![]()
Activate / Deactivate a Language
Starting the tool in the console with command lines
Run the tool with a given in- and output file:
Run the tool by typing the input directly in the console:
How you can use the JLanI-Tool in your own program
A
description how to install a module is available at the main
page of the ASV Toolbox project.
The line
you have to copy into the toolbox.start file looks like this:
de.uni_leipzig.asv.toolbox.jLanI.main.WortschatzModulPane
The JLanI
tool allows you to identify a language on the basis of a text input. This input
can be given by a file, database connection or plain text.
Per
language, a list of words and their frequency is needed for training. This list
can come from a database or a text file.
This tool
use a file called blacklist.txt which you find in resources/jlani. It should be
an empty file if you download this tool. JlanI needs this file even if it is
empty (so do not delete it) but you can fill the file with words which should
be not used for language identification. It contains one word per line. For
example, you should write name like George, Bush, Angelika, Merkel or USA,
Deutschland and abbreviations like A. or G. or W. in this file. This will improve the result in case you only enter
words not specific to any language.
Here you
can find the direct link to the help you read now. Also you see the current
version of the JLanI Plugin and the authors.

On this
panel you can specify the text input. You have three options to do this:

![]()

After you have made your choice of input, you
have to specify what kind of output you need. You have 3 options to do this:
1.
Text
output – there is nothing you have to do further, only check this point.

2.
File
output – check this point and specify the filename by clicking on the “search”
button
![]()
3.
Database
output – Check this point and set the parameters (“table name”, “ID”,
“sentence”, “language 1”, “Probability1”, “language 2”, “Probability2”

If you choose this, the output will be written
in the database. If the tables are not available, the program will create the
table. Probability values give the confidence (in percent) for language 1 and
language 2. When using database input, IDs will match between input and output
table.
Attention: if this option is chosen, it can
be take a couple of minutes before the program will finish the work, especially
for large databases.
After you have specified the input and the
output, click on the start button. The language of the given input will now be
identified on basis of sentences.
![]()
If an error occurs, please go to the “DB
Connection Panel” and check the parameters.
Here you can add a new language to the tool.
The input for the language you want to add can be given by a text file or from
a database connection.
On the left side you can see the currently
available languages.

By checking the active flag, you can activate
or deactivate a language.
If you want remove a language, just click on
the language you want to delete (the language will be shown with a blue
background). After this, click on the delete button. The tool will ask you if
you really want to delete the language. If you are sure, click on the “ok”
button.
Just click on the “load all languages” button.
All available languages will be loaded from the disk. If an error occurs, go to
the “DB Connection” panel and check the parameter.
If the button is not available for clicking,
all languages are already loaded.
![]()
Click on the button “add from file”, and browse
to the file with contains the language you want to add.
![]()
The file with the language should look like
this:
50305806915151724 der14548413 die10698711 und7404753 in5761988 den4562084 von4248046 das4074500 mit4007442 zu
The first line specifies how many words (token)
the underlying corpus contains in total. After this line the words are be
listed. For the words you have to specify 2 parameters:
The first parameter is the absolute frequency
of the word, the second parameter is the word itself.
You have to specify some parameter before you
can add a language from a database. When using LCC corpora, the parameters
should look like this:

If a language is already loaded, the
tool will ask you if you want to overwrite it.
On this
panel you can specify the parameter for the connection to the database. Here
you can see the parameters for a connection to a MySQL database:

The parameter in explanation:
Driver – this means the jdbc Driver for
the database connection.
Protocol – this specify the type of database
Host – the host were the database server
is running. Normally this is localhost. If the database server is running on
another machine, type the IP address of the server in this field, e.g.
192.168.13.12
Port – the port of the database server.
Port 3306 is the default port of the MySQL server.
Database – the database you want to connect
to
User – the username of the db user who
has access to the database
Password – the password of the user
Open a new console like this

You can start the tool with the following command:
java -classpath .;./lib/ASV_JLanI.jar
-Djava.ext.dirs=.;./lib de.uni_leipzig.asv.toolbox.jLanI.main.CLIMain
<inFile> <outFile>
if the tool runs very slow, you can start the
tool with more memory with this command :
java
-Xmx500M -classpath .;./lib/ASV_JLanI.jar -Djava.ext.dirs=.;./lib
de.uni_leipzig.asv.toolbox.jLanI.main.CLIMain <inFile> <outFile>
the parameters in detail :
<inFile>
here you have to specify the path of the input file, e.g c:\input.txt
<outFile> here you have to specify the path of the output file, e.g.
c:\output.txt
java -classpath .;./lib/ASV_JLanI.jar -Djava.ext.dirs=.;./lib
de.uni_leipzig.asv.toolbox.jLanI.main.CLIMain
./examples/jLanI/jLanI_text.txt
./examples/jLanI/output_cli.txt
After you have start the program, you have to wait
until the program will be show you “finished”. The output file will be written
in the directory ./example/JLanI/output_cli.txt
In the console you have to type in the following
command :
* java -classpath .;./lib/ASV_JLanI.jar -Djava.ext.dirs=.;./lib
de.uni_leipzig.asv.toolbox.jLanI.main.CLIMain - -
Now you have to wait until all languages are loaded. If this is
done
you can type in a input like this :
Die
erste Ausgabe der von Arwidsson herausgegebenen, kurzlebigen Zeitschrift Abo
Morgonblad vom 5. Januar 1821.
When the program is finished, the output will
be written down in the console.
It is easy to use the JLanI Tool in your own
program. There are only three classes you have to know. These classes (Request, Response and LanIKernel),
are available in the package “de.uni_leipzig.asv.toolboox.jlani.kernel”.
Here a short overview for these three classes:
-
Request – This class should needed some parameters to
work. The main parameter is the input text, will be given by a string.
-
LanIKernel – this is the class that is
identifying a language. It becomes a Request object and gives a Response object
back. The class uses the config file “lanikernel.ini”. In this config file the
directory for the wordlist is specified. Also, the class works as a singleton.
-
Response – this is the object will be given
back from the kernel. If everything is ok, the results of the language
identification are found in this object.
e.g. a Java Class can look like this :
import
java.util.Enumeration;
import
java.util.HashSet;
import
java.util.Hashtable;
import java.util.Set;
import
de.uni_leipzig.asv.toolbox.jLanI.kernel.DataSourceException;
import
de.uni_leipzig.asv.toolbox.jLanI.kernel.LanIKernel;
import
de.uni_leipzig.asv.toolbox.jLanI.kernel.Request;
import
de.uni_leipzig.asv.toolbox.jLanI.kernel.RequestException;
import
de.uni_leipzig.asv.toolbox.jLanI.kernel.Response;
public class JLanITest {
/**
* @param args
*/
public static void main(String[]
args) {
try {
String sentence = "Site Internet du Musée d'histoire naturelle";
System.out.println("Sentence
:\n"+sentence+"\n");
Set
languages = new HashSet();
int modus = 0;
boolean reduce =
false;
Request req = new Request(sentence, languages, modus,
reduce);
LanIKernel
kernel = LanIKernel.getInstance();
Response
res = kernel.evaluate(req);
Hashtable
result = new Hashtable(res.getResult());
Enumeration
enumeration = result.keys();
double
finalValue = 0;
String
finalLang = "";
while(enumeration.hasMoreElements())
{
Object
key = enumeration.nextElement();
Object
value = result.get(key);
double val =
((Double)value).doubleValue();
if(val >
finalValue)
{
finalValue
= val;
finalLang
= ""+key;
}
}
System.out.println("\nResult :");
System.out.println(finalLang
+ " - " + finalValue);
} catch
(RequestException e) {
e.printStackTrace();
} catch
(DataSourceException e) {
e.printStackTrace();
}
}
}
If you let
run the program above, you see an output like this:
Sentence :
Site Internet du Musée d'histoire naturelle
(18.08.2007 11:06:55) LOG : Properties (lanikernel)loaded
successfully from
FileE:\toolbox_v1.0_08082007\workspace\JLanITest\lanikernel.ini!
DatasourceManager contains:
25 languages:
[ca, tr, no, hu, jp, lv, lt, de, fi, dk, fr, sl, sk, it, so, mt, kr, se, cs,
ee, pt, en, gr, es, nl]
122000 words
Result :
fr - 71.73692763612787