All words are included and presented as they were found in the underlying documents. For that reason, misspellings (like "goverment" instead of "government"), words in outdated spelling (like "thou") or dialectal variations may be contained in a corpus. The use of randomly chosen Web pages as source material can also lead to the inclusion of sentences or words that can be considered racist, sexist or problematic in other ways.
Besides issues related to the source material, errors in our processing pipeline may also lead to errors in our data (e.g. the extraction of word fragments like "ing" by our tokenizer). In most cases, the frequency of an ill-formed word is significantly smaller than the frequency of its correct version. In the case of outdated spelling we sometimes provide a link to the correct word. If you find a systematic error in our data, we are always happy about a short
hint.