Leipzig Corpora Collection Download Page

The Leipzig Corpora Collection provides different tools and data for download, which are protected by copyright. For more details please refer to our terms of usage.

Download Corpora

The Leipzig Corpora Collection presents corpora in different languages using the same format and comparable sources. All data are available as plain text files and can be imported into a MySQL database by using the provided import script. They are intended both for scientific use by corpus linguists as well as for applications such as knowledge extraction programs.
The corpora are identical in format and similar in size and content. They contain randomly selected sentences in the language of the corpus and are available in sizes from 10,000 sentences up to 1 million sentences. The sources are either newspaper texts or texts randomly collected from the web. The texts are split into sentences. Non-sentences and foreign language material was removed. Because word co-occurrence information is useful for many applications, these data are precomputed and included as well. For each word, the most significant words appearing as immediate left or right neighbor or appearing anywhere within the same sentence are given. More information about the format and content of these files can be found here.
The corpora are automatically collected from carefully selected public sources without considering in detail the content of the contained text. No responsibility is taken for the content of the data. In particular, the views and opinions expressed in specific parts of the data remain exclusively with the authors.

If you use one of these corpora in your work we kindly ask you to cite this paper as

D. Goldhahn, T. Eckart & U. Quasthoff: Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages.
In: Proceedings of the 8th International Language Ressources and Evaluation (LREC'12), 2012

To download a corpus select a language and corpus size - given in number of sentences - and download the corresponding data file.

SentiWS

SentimentWortschatz, or SentiWS for short, is a publicly available German-language resource for sentiment analysis, opinion mining etc. It lists positive and negative polarity bearing words weighted within the interval of [-1; 1] plus their part of speech tag, and if applicable, their inflections. The current version of SentiWS contains 1,650 positive and 1,818 negative words, which sum up to 15,649 positive and 15,632 negative word forms incl. their inflections, respectively. It not only contains adjectives and adverbs explicitly expressing a sentiment, but also nouns and verbs implicitly containing one.

SentiWS is organised in two UTF8-encoded text files structured the following way:
<Word>|<POS tag> \t <Polarity weight> \t <Infl_1>,...,<Infl_k> \n
where \t denotes a tab, and \n denotes a new line.

SentiWS is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.
If you use SentiWS in your work we kindly ask you to cite this paper as

R. Remus, U. Quasthoff & G. Heyer: SentiWS - a Publicly Available German-language Resource for Sentiment Analysis.
In: Proceedings of the 7th International Language Ressources and Evaluation (LREC'10), pp. 1168-1171, 2010

Download SentiWS:
  • v1.8c, 2011-03-21: Second publicly available version in which some POS tags were corrected.
  • v1.8b, 2010-05-19: First publicly available version as described in Remus et al. (2010).

TinyCC

TinyCC 2.0 is a text corpus production engine that can be used to produce corpora in Leipzig Corpus Collection (LCC) format.

Documentation and download: TinyCC 2.0

ASV Toolbox

The ASV Toolbox is a modular collection of tools for the exploration of written language data. It was created at the NLP Group, Leipzig University and is not actively developed anymore.

Download at the Language Technology Group, Universität Hamburg: ASV Toolbox