unsupos - Unsupervised Part-of-Speech Tagging

by Chris Biemann, July 2007

 

 

Introduction

This is the page of Chris Biemann’s unsupervised POS tagger unsupos. The paper describing it is

 

Biemann C. (2006): Unsupervised Part-of-Speech Tagging Employing Efficient Graph Clustering. Proceedings of the COLING/ACL-06 Student Research Workshop 2006, Sydney, Australia (pdf )

 

If you use this tagger, please acknowledge the software and cite the paper.

 

This page is not concerned with discussing research questions, but merely with operating the tagger and operating/tuning the implementation.

 

For running the software you need Java JDK 1.5 installed on your computer.

 

Unsupervised POS Tagging

The trick with unsupervised POS tagging is that no annotated training material is required. Instead, unsupos finds the word categories itself by analyzing a large sample of monolingual, sentence-separated plain text. And if I write “large sample”, I really mean large - Should be at least 2 Million tokens (100k sentences) to get something going, 50 million tokens (3M sentences) to get reasonable results and 500 million tokens (30M sentences) to obtain really nice performance. More is better.

 

Implementation

The open-source implementation in Java by Andreas Klaus can be downloaded here:

Download unsupos (24 MegaBytes)

 

For operating unsupos, please read the manual. Here, you can also find explanations on the parameters. For developers, there are also links to javadoc.

 

Sample Data

If you want to try out the tagger, you might consider downloading text corpora at LCC.

Use the .sentence file and strip the sentence numbers.

Sample Tagger models

The implementation contains a Viterbi tagger that can be used to tag arbitrary text using a tagger model that was computed by unsupos before.

 

To tag an input_text with a taggermodel.tmodel call

 

shell> java -Xmx500M -jar lib/ViterbiTagger.jar taggermodel.tmodel input_text > output_text

 

Probably, a heap space of 500M is not enough for large tagger models.

 

The input_text must be tokenized in the same way as the corpus used for building the taggermodel.tmodel. It is recommended to use the built-tokenizing of unsupos by calling unsupos with your input_text and to use the file basic/input_text.tok

Notice that the internal tokenizer replaces numbers with the symbol %N% .

 

If you are just interested in the quality of tagging and do not have a large corpus to train on, you can download the following tagger models:

 

Language

Corpus

# Sentences

Tokenisation

Download link

Catalan

LCC

3M

internal

Catalan-model

Czech

LCC

4M

internal

Czech-model

Danish

LCC

3M

internal

Danish-model

Dutch

LCC

18M

internal

Dutch-model

English

BNC

6M

from BNC

BNC-model

English

MEDLINE 2004

34M

Penn Treebank

MEDLINE-model

Finnish

LCC

11M

internal

Finnish-Model

French

LCC

3M

internal

French-Model

German

Wortschatz

40M

internal

German-model

Hindi

Dainik Jagran; thanks to Monojit Choudhury and Joydeep Nath for providing a crawled and cleaned version of these newspaper articles.

2M

from source; encoding: IITRANS

Hindi-model

Hungarian

LCC

18M

internal

Hungarian-model

Icelandic

LCC

14M

internal

Icelandic-model

Italian

LCC

9M

internal

Italian-model

Norwegian

LCC

16M

internal

Norwegian-model

Spanish (Mexico)

LCC

4.5M

internal

SpanishMx-model

Swedish

LCC

3M

internal

Swedish-model

 

If you feel your language is missing, ask me whether I can provide it for you. If you have a tagger model that is worth distributing, send it to me and I will put it here. If you don’t like the granularity of the tag sets, go ahead and play with the parameters!

 

 

Acknowledgements

Biggest thanks goes to the invaluable programmer of this unsupos version, Andreas Klaus. He had to dig himself through a huge pile of undocumented perl scripts.

Further thanks goes to Stefan Bordag for his similarity.jar package and to Marco Büchler for the Medusa engine.

 

 

 

Last edited December 2, 2009