Documentation of JlanI

back to main page

 

Installation. 1

Introduction. 1

The Welcome Panel 1

The JLanI Panel 3

The “add new language” Panel 5

Activate / Deactivate a Language. 5

Delete a Language. 5

Add a language. 5

The DB Connection Panel 7

Starting the tool in the console with command lines. 8

Run the tool with a given in- and output file: 8

Example: 8

Run the tool by typing the input directly in the console: 8

How you can use the JLanI-Tool in your own program.. 9

 

Installation

A description how to install a module is available at the main page of the ASV Toolbox project.

The line you have to copy into the toolbox.start file looks like this:

de.uni_leipzig.asv.toolbox.jLanI.main.WortschatzModulPane

 

Introduction

 

The JLanI tool allows you to identify a language on the basis of a text input. This input can be given by a file, database connection or plain text.

Per language, a list of words and their frequency is needed for training. This list can come from a database or a text file.

This tool use a file called blacklist.txt which you find in resources/jlani. It should be an empty file if you download this tool. JlanI needs this file even if it is empty (so do not delete it) but you can fill the file with words which should be not used for language identification. It contains one word per line. For example, you should write name like George, Bush, Angelika, Merkel or USA, Deutschland and abbreviations like A. or G. or W.  in this file. This will improve the result in case you only enter words not specific to any language.

The Welcome Panel

 

Here you can find the direct link to the help you read now. Also you see the current version of the JLanI Plugin and the authors.

 


 

The JLanI Panel

 

On this panel you can specify the text input. You have three options to do this:

 

  1. Plain text input – just copy and paste the text which you want to identify the language for.

 

     

     

  1. File input – choose “File input” in this panel and search for a file on your local machine by clicking on the “search” button.

 

    

 

  1. Database input – specify the “Table name”, the “ID” (primary key) and the column containing the text “Sentence”.

 

    

 

 


 

After you have made your choice of input, you have to specify what kind of output you need. You have 3 options to do this:

 

1.      Text output – there is nothing you have to do further, only check this point.

 

    

 

2.      File output – check this point and specify the filename by clicking on the “search” button

 

     

 

3.      Database output – Check this point and set the parameters (“table name”, “ID”, “sentence”, “language 1”, “Probability1”, “language 2”, “Probability2”

 

    

If you choose this, the output will be written in the database. If the tables are not available, the program will create the table. Probability values give the confidence (in percent) for language 1 and language 2. When using database input, IDs will match between input and output table.

 

Attention: if this option is chosen, it can be take a couple of minutes before the program will finish the work, especially for large databases.

 

 

After you have specified the input and the output, click on the start button. The language of the given input will now be identified on basis of sentences.

 

 

If an error occurs, please go to the “DB Connection Panel” and check the parameters.

 

The “add new language” Panel

 

Here you can add a new language to the tool. The input for the language you want to add can be given by a text file or from a database connection.

 

On the left side you can see the currently available languages.

 

    

 

Activate / Deactivate a Language

 

By checking the active flag, you can activate or deactivate a language.

 

Delete a Language

 

If you want remove a language, just click on the language you want to delete (the language will be shown with a blue background). After this, click on the delete button. The tool will ask you if you really want to delete the language. If you are sure, click on the “ok” button.

 

Add a language

 

  1. Adding all languages available on the database

 

Just click on the “load all languages” button. All available languages will be loaded from the disk. If an error occurs, go to the “DB Connection” panel and check the parameter.

 

If the button is not available for clicking, all languages are already loaded.

 

 

  1. Adding from a text file

 

Click on the button “add from file”, and browse to the file with contains the language you want to add.

 

 

The file with the language should look like this:

 

503058069
15151724 der
14548413 die
10698711 und
7404753 in
5761988 den
4562084 von
4248046 das
4074500 mit
4007442 zu

The first line specifies how many words (token) the underlying corpus contains in total. After this line the words are be listed. For the words you have to specify 2 parameters:

 

The first parameter is the absolute frequency of the word, the second parameter is the word itself.

 

  1. Adding from a database

You have to specify some parameter before you can add a language from a database. When using LCC corpora, the parameters should look like this:

 

         

 

            If a language is already loaded, the tool will ask you if you want to overwrite it.


The DB Connection Panel

 

On this panel you can specify the parameter for the connection to the database. Here you can see the parameters for a connection to a MySQL database:

 

 

 

The parameter in explanation:

 

Driver – this means the jdbc Driver for the database connection.

 

Protocol – this specify the type of database

 

Host – the host were the database server is running. Normally this is localhost. If the database server is running on another machine, type the IP address of the server in this field, e.g. 192.168.13.12

 

Port – the port of the database server. Port 3306 is the default port of the MySQL server.

 

Database – the database you want to connect to

 

User – the username of the db user who has access to the database

 

Password – the password of the user

 


Starting the tool in the console with command lines

 

Open a new console like this

Run the tool with a given in- and output file:

You can start the tool with the following command:

 

java -classpath .;./lib/ASV_JLanI.jar -Djava.ext.dirs=.;./lib de.uni_leipzig.asv.toolbox.jLanI.main.CLIMain <inFile> <outFile>

 

if the tool runs very slow, you can start the tool with more memory with this command :

 

java -Xmx500M -classpath .;./lib/ASV_JLanI.jar -Djava.ext.dirs=.;./lib de.uni_leipzig.asv.toolbox.jLanI.main.CLIMain <inFile> <outFile>

 

the parameters in detail :

 

<inFile> here you have to specify the path of the input file, e.g c:\input.txt

<outFile> here you have to specify the path of the output file, e.g. c:\output.txt

 

Example:

 

  java -classpath .;./lib/ASV_JLanI.jar -Djava.ext.dirs=.;./lib

      de.uni_leipzig.asv.toolbox.jLanI.main.CLIMain

      ./examples/jLanI/jLanI_text.txt ./examples/jLanI/output_cli.txt

 

After you have start the program, you have to wait until the program will be show you “finished”. The output file will be written in the directory ./example/JLanI/output_cli.txt

 

Run the tool by typing the input directly in the console:

In the console you have to type in the following command :

 

* java -classpath .;./lib/ASV_JLanI.jar -Djava.ext.dirs=.;./lib

      de.uni_leipzig.asv.toolbox.jLanI.main.CLIMain - -

 

     Now you have to wait until all languages are loaded. If this is done      

      you can type in a input like this :

 

Die erste Ausgabe der von Arwidsson herausgegebenen, kurzlebigen Zeitschrift Abo Morgonblad vom 5. Januar 1821.

 

When the program is finished, the output will be written down in the console.

 

How you can use the JLanI-Tool in your own program

 

It is easy to use the JLanI Tool in your own program. There are only three classes you have to know. These classes (Request, Response and LanIKernel), are available in the package “de.uni_leipzig.asv.toolboox.jlani.kernel”.

 

Here a short overview for these three classes:

 

-         Request – This class should needed some parameters to work. The main parameter is the input text, will be given by a string.

-         LanIKernel – this is the class that is identifying a language. It becomes a Request object and gives a Response object back. The class uses the config file “lanikernel.ini”. In this config file the directory for the wordlist is specified. Also, the class works as a singleton.

-         Response – this is the object will be given back from the kernel. If everything is ok, the results of the language identification are found in this object.

 

 

e.g. a Java Class can look like this :

 

import java.util.Enumeration;

import java.util.HashSet;

import java.util.Hashtable;

import java.util.Set;

 

import de.uni_leipzig.asv.toolbox.jLanI.kernel.DataSourceException;

import de.uni_leipzig.asv.toolbox.jLanI.kernel.LanIKernel;

import de.uni_leipzig.asv.toolbox.jLanI.kernel.Request;

import de.uni_leipzig.asv.toolbox.jLanI.kernel.RequestException;

import de.uni_leipzig.asv.toolbox.jLanI.kernel.Response;

 

 

 

public class JLanITest {

 

      /**

       * @param args

       */

      public static void main(String[] args) {

           

            try {

String sentence = "Site Internet du Musée d'histoire naturelle";

                  System.out.println("Sentence :\n"+sentence+"\n");

                  Set languages = new HashSet();

                  int modus = 0;

                  boolean reduce = false;

Request req = new Request(sentence, languages, modus, reduce);

                  LanIKernel kernel = LanIKernel.getInstance();

                  Response res = kernel.evaluate(req);

                  Hashtable result = new Hashtable(res.getResult());

                 

                  Enumeration enumeration = result.keys();

                  double finalValue = 0;

                  String finalLang = "";

                 

 

 

while(enumeration.hasMoreElements())

                  {

                        Object key = enumeration.nextElement();

                        Object value = result.get(key);

                        double val = ((Double)value).doubleValue();

                        if(val > finalValue)

                        {

                             finalValue = val;

                             finalLang = ""+key;

                        }

                  }

                  System.out.println("\nResult :");

                  System.out.println(finalLang + " - " + finalValue);

 

            } catch (RequestException e) {

                 

                  e.printStackTrace();

            } catch (DataSourceException e) {

                 

                  e.printStackTrace();

            }

      }

 

}

 

If you let run the program above, you see an output like this:

 

Sentence :

Site Internet du Musée d'histoire naturelle

 

(18.08.2007 11:06:55) LOG : Properties (lanikernel)loaded successfully from FileE:\toolbox_v1.0_08082007\workspace\JLanITest\lanikernel.ini!

 

DatasourceManager contains:

  25 languages: [ca, tr, no, hu, jp, lv, lt, de, fi, dk, fr, sl, sk, it, so, mt, kr, se, cs, ee, pt, en, gr, es, nl]

  122000 words

 

Result :

fr - 71.73692763612787

 

back to main page