Documentation of Levenshtein

*   

back to main page

 

Installation. 1

Introduction. 1

How to use the Gui Version. 1

Check the spelling of a word or find prosecutions: 1

Train a DAWG from database: 6

Train a DAWG from file: 17

How to use the Command Line Version. 28

Command: 29

Options: 31

Examples: 33

How to use this tool in your own program.. 40

Classes and Methods: 41

Example: 48

 

Installation

A description how to install a module you find at the main page of the ASV toolbox Project.

The line you have to copy into the toolbox.start file look like this:

de.uni_leipzig.asv.toolbox.levenshtein.LevenshteinModul

Introduction

This tool verify a words spelling and find prosecutions for this word. It uses direct a cyclic word graphs, so called DAWG’s.

How to use the Gui Version

Check the spelling of a word or find prosecutions:

Before you can start you have to do 2 simple configurations. The first one is to choose a DAWG. For this you have 2 opportunities: use a integrated DAWG from the drop down menu(see figure 1) or choose your own DAWG from file(see figure 2).

This is the drop down menu for choosing a integrated DAWG for the language you want to use.

 
choose a predefined DAWG

figure 1

Click this button for open the file open dialog below.

 

Select a DAWG file and open it. The DAWG will be  loaded now from this file. After finishing in the drop down menu for the language will be selected “own DAWG”.

 
choose DAWG from file

figure 2

The second configuration is select a distance(see figure 3). The distance describe how dissimilar a word could be from your word but will be listed in the Did you mean text area on the panel.

 

 

Use the arrow buttons to increase or decrease the distance.

For example: The distance of 3 means all words in the Did you mean text area will be at the most 3 positions different from your entered word.

 
select a distance

figure 3

 

Now you can enter your word in the text filed at the top of the panel. Press enter or use the go button to start the spell checking and finding of prosecutions(see figure 4).

 

Enter here the word. Use the enter button at your keyboard or the go button for starting.

 

Here you find the result of the spell checking.

 

Here you find all found prosecutions to your word.

 

Use this button to save all words from spell checking  in a file. The file will have the following format:

first line : the word you entered

following lines: one word from spell checking

 

Use this button to save all prosecutions in a file. The file will have the following format:

first line : the word you entered

following lines: one prosecution

 
spell checking and prosecutions finding

figure 4

 

 

Train a DAWG from database:

This option will you find at the Options panel. Fill out all text fields with the correct data and click on connect to Database. Now all tables and their columns will be loaded for the tool(see figure 5).

Choose the tables and columns you need for the DAWG. The table should contains at least 2 columns, one with the words and one with ids for the words.

 

Click on this button to connect to your database and load the tables and column names for this tool.

 

Fill in the text fields(the first 7 things, starting with the Driver Class and ending with the Database) the right information for the database you want to use.

 
configure the database settings

figure 5

 

Before you start training choose how many words you want to use(see figure 6).

 

 

Id of the first word which should be used in the DAWG.

 

Number of words which should be used in the DAWG.

 
number of words

figure 6

 

Now start the training by clicking on Load Words.. button(see figure 7).

 

Load Words.. button. Click it and the save dialog will open for choosing the file for the trained DAWG. The button will be not accessible until the DAWG is saved.

 

Save dialog for choosing the file for the trained DAWG. After choosing the file the training will begin.

 
train and save DAWG

figure 7

Train a DAWG from file:

For training a DAWG from file you need a word list in the format(one word per line) like in figure 8.

file with the right format

figure 8

For starting the training click on the Choose file.. button on the Options panel and choose the training file(see figure 9).

Choose file.. button to open the file dialog below. It will be not accessible until the DAWG is saved.

 

Select the training file and open it.

 
choose the trainigs file

figure 9

 

Choose in the second file dialog, opening after choosing the trainings file, the file for the trained DAWG. After pressing save DAWG the training of the DAWG will begin(see figure 10).

 

After click on this button the training of the DAWG will begin.

 
save the dawg

figure 10

 

How to use the Command Line Version

Command:

For starting the command line version of this tool use the following command:
java -Xmx500M -classpath .;./lib/ASV_Levenshtein.jar -Djava.ext.dirs=.;.lib de.uni_leipzig.asv.toolbox.levenshtein.Levenshtein option ...

Options:

-? Print this information.
-C Create a word graph from file (-i) or from a database.
-g Start gui mode.
-w Specify the word to check. (Use with -f)
-f Specify the dawg file to use.
-i Specify the file containing words. (Use with -C).
-o Save output to the specified file.
-l Levenshtein distance to use. (default is 1)
-D Specify the driver (default is com.mysql.jdbc.Driver).
-P Specify the protocol (default is mysql).
-h Specify the database host (default is localhost).
-x Set the port to use. (default is 3306).
-d Specify the database
-u Database user name.
-p The user's password.
-t Specify the table.
-W Specify the table's column which contains the words.
-c Specify the table's column which contains the word ids.
-I Specify the lowest word id. (default is 101)
-O Specify the numbers of words. (default is 2000)

Examples:

·        java -Xmx500M -classpath .; ./lib/ASV_Levenshtein.jar -Djava.ext.dirs=.;.lib de.uni_leipzig.asv.toolbox.levenshtein.Levenshtein -C -D com.mysql.jdbc.Driver -P mysql -h localhost -x 3306 -d de1M -u root -p root -t words -W word -c w_id -I 101 -O 5000 -o ./examples/test.dawg

·        java -Xmx500M -classpath .; ./lib/ASV_Levenshtein.jar -Djava.ext.dirs=.;.lib de.uni_leipzig.asv.toolbox.levenshtein.Levenshtein -C -i ./resources/levenshtein/plain/wordlist_de.txt -o ./examples/de_cli.dawg

·        (needs the dawg which was build in the second example)java -Xmx500M -classpath .; ./lib/ASV_Levenshtein.jar -Djava.ext.dirs=.;.lib de.uni_leipzig.asv.toolbox.levenshtein.Levenshtein -w Baum -f ./examples/de_cli.dawg

·        (needs the dawg which was build in the second example)java -Xmx500M -classpath .; ./lib/ASV_Levenshtein.jar -Djava.ext.dirs=.;.lib de.uni_leipzig.asv.toolbox.levenshtein.Levenshtein -w Baum -f ./examples/de_cli.dawg -o ./examples/Levenshtein_CLOutput_Baum.txt -l 3

 

 

How to use this tool in your own program

Classes and Methods:

It is easy to use Levenshtein for your own program. You only need the 3 classes Levenshtein, Dawg and DawgFactory which you find in the package de.uni_leipzig.asv.toolbox.levenshtein.

class

description

Dawg

This class represent the DAWG. You have to create an instance of this class using the class DawgFactory.

DawgFactory

This class create an instance of the class Dawg. For this use the method LoadGraph(String filename) which needs as parameter a string representing the path to
the file with the serialized DAWG.

Levenshtein

This class provides the algorithms to find alternatives and prosecutions. There are 2 methods and 1 attribute you need to know.
public static Hashmap WordGraph – hashmap that holds the Dawg instances for working, use put(String key, Dawg dawg) save a instance of Dawg under the key in the hashmap
Vector<String> FindAlterntives(String dawg, String w, int dis, boolean doSort) – this method returns all alternatives which where find for the word w with the distance dis in the DAWG which were found under key dawg in the hashmap WordGraph. If doSort is true the alternatives will be in alphabetical order in the vector.
Vector<String> FindProsecutions(String dawg, String w)  – this method returns all prosecution which were find for the word w in the DAWG which were found under the key dawg in the hashmap WordGraph

 

Example:

Here are an example of a JAVA class(LevenshteinTest.java) using the Levenshtein tool. You can find the class LevenshteinTest.java in the package de.uni_leipzig.asv.toolbox.tests.

 

package de.uni_leipzig.asv.toolbox.tests;

 

import java.util.Vector;

 

import de.uni_leipzig.asv.toolbox.levenshtein.Dawg;

import de.uni_leipzig.asv.toolbox.levenshtein.DawgFactory;

import de.uni_leipzig.asv.toolbox.levenshtein.Levenshtein;

 

public class LevenshteinTest {

 

     

      public static void main(String[] args) {

            //DAWG file

            String dawgFile = "./resources/levenshtein/top50000en.dawg";

            //load DAWG

            Dawg Graph = DawgFactory.LoadGraph(dawgFile);

            //put graph in Leveshtein with key dawgFile

            Levenshtein.WordGraphs.put(dawgFile, Graph);

            //word for calculation

            String word = "half";

            //distance

            int distance = 2;

            //calculate alternatives and prosecutions

            Vector<String> alternativs = Levenshtein.FindAlternatives(dawgFile, word, distance, true);

            Vector<String> prosecutions = Levenshtein.FindProsecutions(dawgFile, word);

            System.out.println("word:\n"+word);

            System.out.println("alternativs:");

            for(int i = 0; i< alternativs.size(); i++)System.out.print(alternativs.get(i)+" ");

            System.out.println("\nprosecutions:");

            for(int i = 0; i< prosecutions.size(); i++)System.out.print(prosecutions.get(i)+" ");

      }

 

}      

 

You can start this test. Below you see the output of the test.

Loading graph. This may take a while..

word:

half

alternatives:

half Half calf halo hall halt Zale Val Vale Valu Khalq Khalaf Kalb Ghali Gal Gale Gala Gall Gulf Golf Lal Elf Rolf Daf Dal Dali Dale Daly Pal Palo Pall Palm Falk Fall Cal Calfa Call Cali Calif Chalk Sal Salt Salk Sale Self Shale Shall Shelf Yale Wolf Whale Wharf Wald Walk Walt Wall Nall pal palm pale pals pall Alf gulf gale gala gall golf Hal Hale Hall chalk calm call Bali Bala Ball mall malt male Maly Male Mali Mall hulk hull holy hole hold hilt hill ha hajj hawk hack haul hauls ham hams hat hats hate haze hazy hair hail hails hang hand harp harm hard halls halve halts had hadn have has hasn hash hay heal held helm hell help bale balm bald balk ball behalf whale wharf wolf walk wall Tal Tale Talb Tall Talk Taif fall self shale shall shelf salt sale al ale all tall tale talk

prosecutions:

halfway halftime half-way half-year half-time half-empty half-staff half-inch half-interest half-brother half-point half-price half-mile half-million half-day half-dozen half-hour half-hearted half-century

 

 

 

back to main page