To the documentation main page

Frequently asked questions (FAQ)

...about the project Leipzig Corpora Collection / Deutscher Wortschatz

The Leipzig Corpora Collection (or its branch Deutscher Wortschatz focused on the German language) collects and processes documents available from the Internet (typically in an annual cycle). The results are corpus-based dictionaries for more than 250 languages, in which for every word statistical information, example sentences, and links to related words are provided. The service ranks among the most comprehensive information systems about the German language and provides for many languages the largest freely available text resource.
Information about citation and publications can be found here.

...about the data

All words are included and presented as they were found in the underlying documents. For that reason, misspellings (like "goverment" instead of "government"), words in outdated spelling (like "thou") or dialectal variations may be contained in a corpus. The use of randomly chosen Web pages as source material can also lead to the inclusion of sentences or words that can be considered racist, sexist or problematic in other ways.

Besides issues related to the source material, errors in our processing pipeline may also lead to errors in our data (e.g. the extraction of word fragments like "ing" by our tokenizer). In most cases, the frequency of an ill-formed word is significantly smaller than the frequency of its correct version. In the case of outdated spelling we sometimes provide a link to the correct word. If you find a systematic error in our data, we are always happy about a short hint.
The Leipzig Corpora Collection creates corpora mostly based on documents from the Internet, which are processed automatically by our toolchain. If a specific word does not occur in the source documents, it is not contained in the resulting corpus. We do not select documents manually for inclusion in a corpus (except in some cases of domain-specific corpora).
Information about the downloads can be found here.
The Leipzig Corpora Collection uses mostly documents from the Internet for the creation of its corpora. As this material is subject to copyright law, every text is splitted in its sentences and those sentences are randomly ordered to destroy the original document structure. After this processing step, the original documents are deleted and can not be provided anymore.
We use corpus names that encode the most relevant information about the used source material. All corpus names comply with the following structure: LANGUAGE_GENRE_DATE
  • Language: Information about the language of the source material based on ISO 639-3, optionally extended by country of origin using ISO 3166
  • Genre: Information about the kind of source material. Typical values are "web", "wikipedia", "news" (news material, often via RSS feeds) or "newscrawl" (news material, crawled from Websites)
  • Date: Information about the timespan in which the source material was acquired
Examples for corpus names are:
  • deu_news_2011: news material in German language of 2011
  • deu-at_news_2011: news material in German language from Austria of 2011
  • deu-at_web_2011-2014: Web text in German language from Austria between 2011 and 2014
  • deu_wikipedia_2014: Wikipedia texts in German language of 2014

...about the corpora portal

The corpora portal supports the search for and presentation of inflected word forms. This contains simple words (like "car" or "cars") but also multiword units (like "Sri Lanka" or "Los Angeles"). The set of indexed multiword units is inconsistent and varies especially between different languages and genres. Every query is treated case sensitiv. If there are words in the corpus that would match for a case insensitive search they are displayed as "See also:" underneath the head word.

The corpora portal also allows search for word patterns. Supported are the special characters '*' (or '%') for an arbitrary number of characters and '?' (or '_') for a single character. As an example, the query "l??d*" may find words like "London", "lead" or "landscape". Words matching the specified pattern are shown on a special result page, sorted by their frequency in the corpus in descending order.
In the information box on the upper right you can find information about the selected corpus. This contains the number of sentences, the number of distinct words (types) and the number of running words (tokens).
We provide different information about the frequency of a word. This includes:
  • Frequency: Number of occurrences of the word in the corpus. This is an absolute number and therefore linearly dependent on the corpus size.
  • Rank: Position of the word in the corpus word list sorted by frequeny in descending order. In most English corpora "the" is the most frequent word and has therefore rank 1. The second most frequent word (often "and" or "to") has rank 2, etc. The rank of a word does not grow with corpus size, but it may differ significantly between corpora (especially for low frequent words).
  • Frequeny class: Words of similar frequency are grouped into classes with the goal that the frequency class of a word does rarely change between different corpora. The frequency of the most frequent word of a corpus is divided by the frequency of the word in question and the logarithm to the basis 2 of the result is rounded up to the next whole number. The most frequent word in a corpus has always frequency class 0; a word in frequency class 1 is around half as often found in the corpus as the most frequent word. In general, a word of frequency class n+1 has half the frequency of a word in frequency class n. Extremely rare words may have a frequency class of 20 or higher in large corpora.
For every word, we provide information about its frequency in the corpus. More details about this information can be found here.

We provide the following information for many - but not all - words. In most cases they were generated using automatic procedures and may contain errors. These information include:
  • For inflected words, the corresponding base form is provided ("seminars" -> "seminar"), and vice versa all known inflected forms for a baseform, ordered by their frequency in descending order (like for "planet": "planets").
  • For baseforms, their part of speech is provided, for nouns also their grammatical gender.
  • For compound words, its constituents are provided.
  • Hyphenation describes possible options for splitting a word at line breaks (like "syl|la|ble").
  • Descriptions are extracted from the corresponding Wikipedia article.
  • Synonyms are words with a similar or identical meaning.
The dictionary Dornseiff: Der deutsche Wortschatz nach Sachgruppen (published by De Gruyter since 1934) groups words by semantic criteria in currently 22 main groups and 970 domain groups. Domain groups are again structured in semantic groups.

The 8th edition of the dictionary (published in 2004) was created using data of the Leipzig Corpora Collection. We received friendly permission by the publisher to present both the name of the corresponding Dornseiff set and the complete semantic group for every word.
As we destroy the original document structure in our preprocessing for copyright reasons, all sentences are only available in a random order. To favour "typical" sentences we use an adapted version of the GDEX algorithm ("Good Dictionary Examples in a Corpus") for most corpora. The algorithm prefers short sentences of rather simple grammatical structure, without extremely rare words and without numbers or special characters. More details about the original algorithm can be found here.
Cooccurrences of a word are those words that occur noticeably often together with it. This may be the case as immediate left neighbour, as immediate right neighbour, or in the same sentence. The relevance of a cooccurrence is measured using a significance measure; cooccurrences are ordered by their significance. At the Leipzig Corpora Collection the log-likelihood ratio is used as significance measure and word pairs of little significance are removed.
Word similarity based on cooccurrences comprises words that occur in similar contexts as the word in question. The distributional hypothesis assumes that those words share a similar meaning.

For computing this kind of similarity between words, the set of their significant cooccurrences are compared. A high similarity of those sets (computed using the Dice coefficient) suggests an exchangeability in similar contexts and therefore a similar meaning. The results are ranked by similarity in descending order; only results with a minimum number of common cooccurrences are considered.
The cooccurrences graph is a visualisation of sentence cooccurrences. For the most significant cooccurring words of an input word, it is examined if there are significant cooccurrence relationships between all possible word pairs. If that is the case, an edge is drawn between their nodes in the graph and also between them and the input word. The strength of the significance of a cooccurrence is represented by the line width of the corresponding edge.
The available languages are presented in the footer of every Web page. The selected language will be stored in a cookie on your computer and will be automatically selected at your next visit.

To the documentation main page