unsupos - Unsupervised
Part-of-Speech Tagging
by Chris Biemann, July 2007
This is the
page of Chris Biemann’s unsupervised POS tagger unsupos.
The paper describing it is
Biemann
C. (2006): Unsupervised Part-of-Speech Tagging Employing Efficient Graph
Clustering. Proceedings of the COLING/ACL-06 Student Research Workshop 2006,
Sydney, Australia (pdf )
If you use
this tagger, please acknowledge the software and cite the paper.
This page
is not concerned with discussing research questions, but merely with operating
the tagger and operating/tuning the implementation.
For running
the software you need Java JDK 1.5
installed on your computer.
The trick
with unsupervised POS tagging is that no annotated training material is
required. Instead, unsupos finds the word categories itself by analyzing
a large sample of monolingual, sentence-separated plain text. And if I write
“large sample”, I really mean large - Should be at least 2 Million tokens (100k
sentences) to get something going, 50 million tokens (3M sentences) to get
reasonable results and 500 million tokens (30M sentences) to obtain really nice
performance. More is better.
The
open-source implementation in Java by Andreas Klaus can be downloaded here:
Download
unsupos (24 MegaBytes)
For
operating unsupos, please read the manual.
Here, you can also find explanations on the parameters. For developers, there
are also links to javadoc.
If you want
to try out the tagger, you might consider downloading text
corpora at LCC.
Use the .sentence file and strip the sentence
numbers.
The
implementation contains a Viterbi tagger that can be
used to tag arbitrary text using a tagger model that was computed by unsupos
before.
To tag an input_text
with a taggermodel.tmodel call
shell> java -Xmx500M -jar lib/ViterbiTagger.jar
taggermodel.tmodel input_text
> output_text
Probably, a
heap space of 500M is not enough for large tagger models.
The input_text
must be tokenized in the same way as the corpus used for building the taggermodel.tmodel. It is recommended to use the built-tokenizing of unsupos by
calling unsupos with your input_text and to use the file basic/input_text.tok
Notice that
the internal tokenizer replaces numbers with the
symbol %N% .
If you are
just interested in the quality of tagging and do not have a large corpus to
train on, you can download the following tagger models:
|
Language |
Corpus |
#
Sentences |
Tokenisation |
Download
link |
|
Catalan |
3M |
internal |
||
|
Czech |
4M |
internal |
||
|
Danish |
3M |
internal |
||
|
Dutch |
18M |
internal |
||
|
English |
6M |
from BNC |
||
|
English |
34M |
|||
|
Finnish |
11M |
internal |
||
|
French |
3M |
internal |
||
|
German |
40M |
internal |
||
|
Hindi |
Dainik Jagran |
2M |
from
source; encoding: IITRANS |
|
|
Hungarian |
18M |
internal |
||
|
Icelandic |
14M |
internal |
||
|
Italian |
9M |
internal |
||
|
Norwegian |
16M |
internal |
||
|
Spanish
(Mexico) |
4.5M |
internal |
||
|
Swedish |
3M |
internal |
If you feel
your language is missing, ask me whether I can provide it for you. If you have
a tagger model that is worth distributing, send it to me and I will put it
here. If you don’t like the granularity of the tag sets, go ahead and play with
the parameters!
Biggest thanks goes to the invaluable programmer of this unsupos version, Andreas
Klaus. He had to dig himself through a huge pile of undocumented perl scripts.
Further thanks goes to Stefan Bordag for his similarity.jar
package and to Marco Büchler for the Medusa engine.
Last edited
December 2, 2009