Documentation - Leipzig Corpora Collection / Deutscher Wortschatz

About the project Deutscher Wortschatz and the Leipzig Corpora Collection

Short description

The project Deutscher Wortschatz provides information about the German language since the mid 1990s. We regularly collect and process available documents from the Internet (typically in an annual cycle). The result is a corpus-based dictionary, in which for every word a page containing statistical information, example sentences, and links to related words is available. Because of the huge amount of used text material containing several million sentences, information about almost every word can be provided. The service ranks among the most comprehensive information systems about the German language.

Over time, we extended our service to more and more languages using the name Leipzig Corpora Collection. Currently, corpora for more than 250 languages are available and can be queried online. For many of those languages, we provide the largest freely available text resources.

For the presentation, the largest corpus of each language is preselected; for many languages multiple corpora are available. All corpora are classified in the following dimensions:

Language (sometimes in connection with the country of origin, like "deu-ch" for German texts from Switzerland)
Genre (currently: news texts, random Web texts, and Wikipedia texts)
Year of download

For many applications, smaller corpora are already sufficient. For those use cases, so called 'normed size corpora' are created which are based on 10,000, 30,000, 100,000, 300,000 and 1,000,000 randomly selected sentences. They are provided for download under our terms of usage.

About the project Deutscher Wortschatz and the Leipzig Corpora Collection

Short description

More information