LangSepP -
Unsupervised Language Separation Program

by Chris Biemann, July 2007

 

 

Introduction

 

This page contains langSepP, a program that sorts multilingual text in monolingual chunks.

the program comes separately for for UNIX/LINUX and Windows. Download Version 1.0 of langSepP for LINUX.

; langSepP for WINDOWS.

 

 

System requirements

To run langSepP, you need a Java Runtime Environment (JRE) of version 1.5 or later. You can obtain it at http://java.sun.com/j2se/1.5.0/download.jsp. Please ensure that java is in the path – to check this, type “java -version” in your shell – it should respond with a version number of 1.5.0 or higher. Further,  you need PERL version 5 or higher. The latest version can be downloaded at  http://www.perl.com/download.csp. Please ensure that PERL is in the path – to heck this, type “perl -v” in your shell – it should respond with a version number of 5 or higher.

 

 

Operating langSepP

After unzipping langSepP, perform a test run with the example file mix1K.txt, that contains 10 number of languages, separated by sentences, one sentence per line.

 

> sh langSepP.sh mix1K.txt

 

The program will compute a word co-occurrence graph, cluster it with the Chinese Whispers Graph Clustering algorithm and will use the words in the clusters as word lists in a word-based language identifier. For reference, see [1].

 

Results will be stored in the subdirectory res with each language in a separate file. Lines that could not be assigned to a language are stored in a file with the extension unknown.

 

You should always provide a nearly sentence-separated file as input. Lines with too many words may cause the program to crash due to memory overflow. Short documents, however, should be possible.

 

Tuning langSepP

If you want to process really large files or have more or less memory the program uses, you should change the parameters in the script langSepP.sh.

 

The parameters are similar as in the tinyCC2.0 corpus production engine - here, only rules-of-thumb w.r.t. language Separation are given.

# Memory max usage in MB (approximate)

export MAXMEM=600

 

Should be set to fit the size of your machine’s memory, but never above 1500.

 

# min frequency for scoocs

export SMINFREQ=2

 

Should be at least 2, higher values speed up the process but result in smaller text coverage. For corpora above 1 million sentences, you can safely use 4.

 

 

# min sig for scooc

export SMINSIG=2.71

 

Should be at least 2, higher values speed up the process but result in smaller text coverage. For corpora above 1 million sentences, you can safely use 6.

 

# CW

java -jar -Xmx500M bin/CW.jar -S -F -i temp$1/$1.nodes temp$1/$1.edges$SMINSIG -o temp$1/$1.res$SMINSIG

 

Here, take care for the memory value (500M). If CW exits with out-of-memory error during the process, additionally give parameter -R.

 

 

For very large corpora, it might be advisable to perform these changes. Since the internal on-disk-memory for the co-occurrence graph is limited to 2 Gigabyte, higher parameter settings can push the border of what is possible with the current implementation further.

 

 

References

[1] Biemann, C., Teresniak, S. (2005): Disentangling from Babylonian Confusion - Unsupervized Language Identification, Proceedings of CICLing-2005, Computational Linguistics and Intelligent Text Processing, Mexico City, Mexico and Springer LNCS 3406 (pdf)