LangSepP -
Unsupervised Language Separation Program
by Chris Biemann,
July 2007
This page
contains langSepP, a program that sorts multilingual text in monolingual chunks.
the program comes separately for
for UNIX/LINUX and Windows. Download Version 1.0 of langSepP for LINUX.
To run
langSepP, you need a Java Runtime Environment (JRE) of version 1.5 or later.
You can obtain it at http://java.sun.com/j2se/1.5.0/download.jsp.
Please ensure that java is in the path to check this, type java
-version in your shell it should respond with a version number of 1.5.0
or higher. Further, you need PERL version 5 or higher. The latest version
can be downloaded at http://www.perl.com/download.csp.
Please ensure that PERL is in the path to heck this, type perl -v in
your shell it should respond with a version number of 5 or higher.
After
unzipping langSepP, perform a test run with the example file mix1K.txt, that
contains 10 number of languages, separated by sentences, one sentence per line.
> sh langSepP.sh mix1K.txt
The program
will compute a word co-occurrence graph, cluster it with the Chinese
Whispers Graph Clustering algorithm and will use the words in the clusters
as word lists in a word-based language identifier. For reference, see [1].
Results
will be stored in the subdirectory res with each language in a separate file. Lines
that could not be assigned to a language are stored in a file with the
extension unknown.
You should
always provide a nearly sentence-separated file as input. Lines with too many
words may cause the program to crash due to memory overflow. Short documents,
however, should be possible.
If you want
to process really large files or have more or less memory the program uses, you
should change the parameters in the script langSepP.sh.
The
parameters are similar as in the tinyCC2.0
corpus production engine - here, only rules-of-thumb w.r.t. language
Separation are given.
# Memory max usage in MB (approximate)
export MAXMEM=600
Should be
set to fit the size of your machines memory, but never above 1500.
# min frequency for scoocs
export SMINFREQ=2
Should be
at least 2, higher values speed up the process but result in smaller text
coverage. For corpora above 1 million sentences, you can safely use 4.
# min sig for scooc
export SMINSIG=2.71
Should be
at least 2, higher values speed up the process but result in smaller text
coverage. For corpora above 1 million sentences, you can safely use 6.
# CW
java -jar -Xmx500M bin/CW.jar -S -F -i temp$1/$1.nodes
temp$1/$1.edges$SMINSIG -o temp$1/$1.res$SMINSIG
Here, take care for the memory value
(500M). If CW exits with out-of-memory error during the process, additionally
give parameter -R.
For very
large corpora, it might be advisable to perform these changes. Since the
internal on-disk-memory for the co-occurrence graph is limited to 2 Gigabyte,
higher parameter settings can push the border of what is possible with the
current implementation further.
[1]
Biemann, C., Teresniak, S. (2005): Disentangling from Babylonian Confusion -
Unsupervized Language Identification, Proceedings of CICLing-2005,
Computational Linguistics and Intelligent Text Processing, Mexico City, Mexico
and Springer LNCS 3406 (pdf)