Unsupervised Part-Of-Speech-Tagger -- UnsuPOS


Download unsupos here. [Javadoc, public]       [Javadoc, private]       [This document as PDF]

1 Introduction

UnsuPosTagger is a Java implementation of the unsupervised Part-of-Speech tagger presented by Chris Biemann (2006). The UnsuPosTagger class described in this document uses external Java libraries from the ASV Group (University of Leipzig), e. g. Medusa for tokenizing and ChineseWhispers for graph clustering. In the following sections information is given about the configuration of the UnsuPosTagger, the input and output files and the interface for integrating the UnsuPosTagger into other Java applications. All files and 'external' libraries are packed into the folder unsupos/: Download unsupos here.

2 Classes and source code files

Package name

de.uni_leipzig.asv.unsupos


Classes

The package de.uni_leipzig.asv.unsupos primarily consists of three classes:
  1. an IntegerPair used for key values in HashMaps,
  2. the UnsuPosTagger which implements the algorithm, and
  3. a main class UnsuPos for command line usage.
The Java files of all three classes are located in the source code folder in src/de/uni_leipzig/asv/unsupos/ (according to the package name, see above). The names of the Java files are IntegerPair.java, UnsuPosTagger.java and UnsuPos.java, respectively. The UnsuPos-class in src/de/uni_leipzig/asv/unsupos/UnsuPos.java provides an example how to start the UnsuPosTagger as a thread and how to request progress information (see also the comments in this file). This file will probably be replaced or extented by anyone who wants to use the UnsuPosTagger within own applications.

3 Configuration

3.1 Command line arguments

The UnsuPos tagger has three command line options: (1) the input file name, (2) a configuration file name (for an explanation see section 3.3), and (3) an option to control case sensitivity. A typical call at command line looks as follows
java -Xmx<. . .>
     -cp <. . .>
     -Dfile.encoding=ISO-8859-1
     -Djava.io.tmpdir=../tmp/
     -Dde.uni_leipzig.asv.medusa.config.ClassConfig=
     de/uni_leipzig/asv/unsupos/UnsuPos  -input 
                                        [-config ]
                                        [-ignoreCase]
                                        [-help]
An example can be found in bin/run.aspra4.sh which was used to run the application on the workstation aspra4. If not specified the UnsuPos main class assumes that the configuration file to be config/unsupos.conf. In UnsuPos options are passed to the UnsuPosTagger constructor (see section 3.2).

3.2 UnsuPosTagger constructor

The signature of the constructor of the UnsuPosTagger-class is
public UnsuPosTagger(String fcorpusname, boolean ignoreCase,
                     String fconfname, PrintStream sout,
                     PrintStream eout, String newLineSequence)
The parameters are:

3.3 Configuration file unsupos.conf

The configuration file unsupos.conf for the UnsuPosTagger is located in the config/ subfolder. The name of the configuration file is arbitrary and can be specified at either command line or in the UnsuPosTagger constructor (see sections 3.1 and 3.2). All the parameters are listed in Table 1.

The NB_MAX parameter regulates the maximum number of neighbours that are considered when computing similarity based on neighbouring co-occurrences and is mainly meant for reducing the processing time. Small settings result in a considerable loss of data and therefore a smaller lexicon. Large settings lead to an unnecessary long run-time, since it is in general sufficient to consider only the most significant neighbours for words of high frequency (recommended setting: 100-2000, default: 100).

To control the execution of the single steps the parameters PREPROC, PART1 and so on can be modified. Several values are possible: true means that the corresponding files are created, but existing files are not overwritten, force means that existing files are overwritten, false means that the corresponing step is skipped (files will not be created), and break stops the execution.

The TOKENIZER parameter specifies the implementation used for tokenizing ("medusa" is default). The medusa tokenizer can be configured via the medusa configuration file medusa_config.xml (property "TOKENIZER_IMPL"). An alternative option for the TOKENIZER is "normal" which means that only numbers are replaced. It should be noted that lines with less than 3 tokens are skipped when writing the tokenized corpus file (this is necessary due to a bug in Medusa).

ParameterDefault valueDescription
FEAT200The number of feature words.
TARG10000The number of target words.
COS_THRESH0.66The cosine threshold.
CWTARG5000The number of nodes in the cosine graph for partitioning 1.
TOPADD9500The number of nodes in the cosine graph when building the lexicon.
NB_THRESH4Threshold for number of cooccurrences.
NB_MAX100Maximum number of neighbours that are considered when computing similarity.
CONF_OVERLAP2Minimal overlap of clusters from partitioning 1 and 2.
SING_ADD200The number of words to be added as single clusters.
BEHEAD2000The number of words to be skipped for partitioning 1.
TOKENIZERmedusaSpecifies the tokenizing procedure (medusa, normal).
PREPROCtruePerform preprocessing.*
PART1truePerform partitioning 1.*
PART2truePerform partitioning 2.*
JOINPARTStruePerform joining of partitionings.*
BUILDLEXforcePerform building of the lexicon.*
TAGGINGfalsePerform tagging.*
DELTEMPFILES (optional)trueControls whether temporary files are deleted or not (possible values are 'true' and 'false').
Table 1: Most parameters are specified in the unsupos/config/unsupos.conf file. Note that the parameter S_THRESH (significance threshold for nb-cooccurrences) is set in the Medusa configuration file. * Possible values are true (create files if they do not exist), false (do not create corresponding files, i. e. skip this step), force (create/overwrite files), and break (stop the UnsuPosTagger).

3.4 Medusa configuration

Medusa's configuration file medusa_config.xml is located in the config/ subfolder. There are two options in this file which are of interest for the UnsuPosTagger. The parameter dblThreshold in the category de.uni_leipzig.asv. medusa.export.FlatFileExporterImpl specifies the significance threshold for the similarity calculation. A more technical parameter is the intFreeMemory parameter in the category de.uni_leipzig.asv.medusa.config.DefaultMemoryAllocatorImpl which specifies the number of bytes in the memory not used by the algorithms. Due to a bug in Medusa, this parameter should be greater than one third of the Java VM memory size (cf. java option -Xmx).

4 Data files

Input file The corpus file is a simple text file where every single line corresponds to one sentence of natural language. The name of the corpus file is used as a suffix for all output files which are stored in subdirectories at the same location as the corpus file. Output files

5 Interface

The UnsuPosTagger implements the Runnable interface and can therefore be started as a Thread. An example can be found in the UnsuPos class. To request the actual status of the UnsuPosTagger the following methods can be used:
  • public String getActualStepName() - returns the actual step name (e. g. Partitioning 1 ).
  • public String getActualTaskName() - returns the actual task name (e. g. Create context vectors ). Note: In general one step consists of several tasks.
  • public int getActualTaskProgress() - returns the actual task progress (an integer between 0 and 100).
  • public void deleteDataFiles() - deletes all data files except of the corpus (the input) file.
  • public void printConfigInfo() - prints parameter settings and information about the configuration.
  • 6 Javadoc documentation

    A detailed source code documenation generated with Javadoc can be found in doc/javadoc/index.html .

    A HowTo

    Usage of external tokenizers

    To use an own implementation of a tokenizer two possible solutions exist: (1) to include the tokenizer into Medusa (see Medusa configuration for more details) 1, or (2) to set the option PREPROC=false in the configuration file to false and to write the tokenized file to basic/<corpus.txt>.tok, where corpus.txt is the name of the original input file and basic/ is a sub-directory located in the same directory as the input file. The UnsuPosTagger then skips the generation of the tokenized file basic/<corpus.txt>.tok and starts with creating the word and frequency list. It should be noted that the UnsuPosTagger nevertheless requires the path to the original input file. Usage of standard output and standard error When creating a new object of the UnsuPosTagger class two PrintStreams for standard output and standard error, respectively, can be specified (see section 3.2). The UnsuPos main class, for example, writes the corresponding output to the files log/<suffix>.stdout and log/<suffix>.stderr, respectively (see UnsuPos.java), where suffix is the input file name.

    References

    Biemann C. (2006): Unsupervised Part-of-Speech Tagging Employing Efficient Graph Clustering. Proceedings of the COLING/ACL-06 Student Research Workshop 2006, Sydney, Australia, http://wortschatz.uni-leipzig.de/~cbiemann/pub/2006/unsupos_graph_coling06SRW.pdf