FAQ - Leipzig Corpora Collection / Deutscher Wortschatz

Frequently asked questions (FAQ)

...about the project Leipzig Corpora Collection / Deutscher Wortschatz

What is the Leipzig Corpora Collection?

The Leipzig Corpora Collection (or its branch Deutscher Wortschatz focused on the German language) collects and processes documents available from the Internet (typically in an annual cycle). The results are corpus-based dictionaries for more than 250 languages, in which for every word statistical information, example sentences, and links to related words are provided. The service ranks among the most comprehensive information systems about the German language and provides for many languages the largest freely available text resource.

How do I cite the project or specific resources?

Information about citation and publications can be found here.

How can I contact the project Leipzig Corpora Collection?

Contact information can be found here.

Do you have a data privacy statement?

Our privacy notice can be found here.

...about the data

I have found an error/a misspelled word/an ungrammatical sentence!

All words are included and presented as they were found in the underlying documents. For that reason, misspellings (like "goverment" instead of "government"), words in outdated spelling (like "thou") or dialectal variations may be contained in a corpus. The use of randomly chosen Web pages as source material can also lead to the inclusion of sentences or words that can be considered racist, sexist or problematic in other ways.

Besides issues related to the source material, errors in our processing pipeline may also lead to errors in our data (e.g. the extraction of word fragments like "ing" by our tokenizer). In most cases, the frequency of an ill-formed word is significantly smaller than the frequency of its correct version. In the case of outdated spelling we sometimes provide a link to the correct word. If you find a systematic error in our data, we are always happy about a short hint.

Why can't I find word X?

The Leipzig Corpora Collection creates corpora mostly based on documents from the Internet, which are processed automatically by our toolchain. If a specific word does not occur in the source documents, it is not contained in the resulting corpus. We do not select documents manually for inclusion in a corpus (except in some cases of domain-specific corpora).

Where can I download the data?

Information about the downloads can be found here.

Where are the complete texts?

The Leipzig Corpora Collection uses mostly documents from the Internet for the creation of its corpora. As this material is subject to copyright law, every text is splitted in its sentences and those sentences are randomly ordered to destroy the original document structure. After this processing step, the original documents are deleted and can not be provided anymore.

What do these cryptic corpus names mean?

We use corpus names that encode the most relevant information about the used source material. All corpus names comply with the following structure: LANGUAGE_GENRE_DATE
With:

Language: Information about the language of the source material based on ISO 639-3, optionally extended by country of origin using ISO 3166
Genre: Information about the kind of source material. Typical values are "web", "wikipedia", "news" (news material, often via RSS feeds) or "newscrawl" (news material, crawled from Websites)
Date: Information about the timespan in which the source material was acquired

Examples for corpus names are:

deu_news_2011: news material in German language of 2011
deu-at_news_2011: news material in German language from Austria of 2011
deu-at_web_2011-2014: Web text in German language from Austria between 2011 and 2014
deu_wikipedia_2014: Wikipedia texts in German language of 2014

...about the corpora portal

What search capabilities does the corpora portal provide?

The corpora portal supports the search for and presentation of inflected word forms. This contains simple words (like "car" or "cars") but also multiword units (like "Sri Lanka" or "Los Angeles"). The set of indexed multiword units is inconsistent and varies especially between different languages and genres. Every query is treated case sensitiv. If there are words in the corpus that would match for a case insensitive search they are displayed as "See also:" underneath the head word.

The corpora portal also allows search for word patterns. Supported are the special characters '*' (or '%') for an arbitrary number of characters and '?' (or '_') for a single character. As an example, the query "l??d*" may find words like "London", "lead" or "landscape". Words matching the specified pattern are shown on a special result page, sorted by their frequency in the corpus in descending order.

How large is the underlying data base?

In the information box on the upper right you can find information about the selected corpus. This contains the number of sentences, the number of distinct words (types) and the number of running words (tokens).

What do the different frequency information about a word mean?

We provide different information about the frequency of a word. This includes:

Frequency: Number of occurrences of the word in the corpus. This is an absolute number and therefore linearly dependent on the corpus size.
Rank: Position of the word in the corpus word list sorted by frequeny in descending order. In most English corpora "the" is the most frequent word and has therefore rank 1. The second most frequent word (often "and" or "to") has rank 2, etc. The rank of a word does not grow with corpus size, but it may differ significantly between corpora (especially for low frequent words).
Frequeny class: Words of similar frequency are grouped into classes with the goal that the frequency class of a word does rarely change between different corpora. The frequency of the most frequent word of a corpus is divided by the frequency of the word in question and the logarithm to the basis 2 of the result is rounded up to the next whole number. The most frequent word in a corpus has always frequency class 0; a word in frequency class 1 is around half as often found in the corpus as the most frequent word. In general, a word of frequency class n+1 has half the frequency of a word in frequency class n. Extremely rare words may have a frequency class of 20 or higher in large corpora.

What lexical/morphological information is provided for a word?

For every word, we provide information about its frequency in the corpus. More details about this information can be found here.

We provide the following information for many - but not all - words. In most cases they were generated using automatic procedures and may contain errors. These information include:

Our transliterations are text conversions from non-Latin scripts into Latin script. As an example, we are using the Pinyin transliteration for the Chinese language and the Python Transliterator by Artur Barseghyan for some other languages (like Modern Greek, Russian, or Georgian).
For inflected words, the corresponding base form is provided ("seminars" -> "seminar"), and vice versa all known inflected forms for a baseform, ordered by their frequency in descending order (like for "planet": "planets").
For baseforms, their part of speech is provided, for nouns also their grammatical gender.
For compound words, its constituents are provided.
Hyphenation describes possible options for splitting a word at line breaks (like "syl|la|ble").
Descriptions are extracted from the corresponding Wikipedia article.
Synonyms are words with a similar or identical meaning.

What are Dornseiff sets?

The dictionary Dornseiff: Der deutsche Wortschatz nach Sachgruppen (published by De Gruyter since 1934) groups words by semantic criteria in currently 22 main groups and 970 domain groups. Domain groups are again structured in semantic groups.

The 8th and 9th edition of the dictionary (published in 2004 and 2020) were created using data of the Leipzig Corpora Collection. We received friendly permission by the publisher to present both the name of the corresponding Dornseiff set and the complete semantic group for every word.

What is the purpose of the translation links?

If translations of a word into another language are available, for each a link to the respective information page in the corpora portal is provided. For languages where more than one corpus is available in the portal, typically the largest corpus (alternatively: the most recent one) is linked to. For every available language, the translations are ordered in descending order according to their frequency in the target corpus. For more clarity, the respective frequency class is included (information about the use of frequency classes can be found here).

What is the sorting order of the sentences?

As we destroy the original document structure in our preprocessing for copyright reasons, all sentences are only available in a random order. To favour "typical" sentences we use an adapted version of the GDEX algorithm ("Good Dictionary Examples in a Corpus") for most corpora. The algorithm prefers short sentences of rather simple grammatical structure, without extremely rare words and without numbers or special characters. More details about the original algorithm can be found here.

What are cooccurrences?

Cooccurrences of a word are those words that occur noticeably often together with it. This may be the case as immediate left neighbour, as immediate right neighbour, or in the same sentence. The relevance of a cooccurrence is measured using a significance measure; cooccurrences are ordered by their significance. At the Leipzig Corpora Collection the log-likelihood ratio is used as significance measure and word pairs of little significance are removed.

What is similarity based on cooccurrences ("Words with similar context")?

Word similarity based on cooccurrences comprises words that occur in similar contexts as the word in question. The distributional hypothesis assumes that those words share a similar meaning.

For computing this kind of similarity between words, the set of their significant cooccurrences are compared. A high similarity of those sets (computed using the Dice coefficient) suggests an exchangeability in similar contexts and therefore a similar meaning. The results are ranked by similarity in descending order; only results with a minimum number of common cooccurrences are considered.

What does the cooccurrences graph represent?

The cooccurrences graph is a visualisation of sentence cooccurrences. For the most significant cooccurring words of an input word, it is examined if there are significant cooccurrence relationships between all possible word pairs. If that is the case, an edge is drawn between their nodes in the graph and also between them and the input word. The strength of the significance of a cooccurrence is represented by the line width of the corresponding edge.

What is the "in the current news" box?

Our portal Words of the Day collects and analysis Internet news articles on a daily basis. Words that occur unusually frequently on a day or in atypical sentence contexts are selected as "word of the day" and are displayed together with current example sentences and sources information. The box in the corpora portal shows the daily relative frequency of the word for the last few weeks or months. A click on a blue bar directs you to the respective information page at the Words of the day.

How can I change the language of the Web page?

The available languages are presented in the footer of every Web page. The selected language will be stored in a cookie on your computer and will be automatically selected at your next visit.

To the documentation main page