Documentation of Namerec

back to main page

 

Installation. 1

Introduction. 1

How to use the Gui Version. 1

Loading and saving a configuration: 1

Configure the tool by your own: 4

Starting the tool: 25

How to use the Command Line Version. 30

Commad: 31

Options: 34

Examples: 36

How to use Namerec in your own program.. 41

Classes and Methods. 42

Example: 47

 

Installation

A description how to install a module is available at the main page of the ASV Toolbox project.

The line you have to copy into the toolbox.start file looks like this:

de.uni_leipzig.asv.toolbox.namerec.gui.RecognizerPanel

Introduction

This tool tries to recognize names in sentences. It needs some initially given names (a gazetteer) and some rules for learning new names.

How to use the Gui Version

Before you start the tool you have to configure the tool. You can do this by your own or load a configuration.

Loading and saving a configuration:

Choose the File Management panel. At the bottom you find two buttons, one for loading configurations and one for saving them(see figure 1).

This button is for saving a configuration to file. If you use this button a new file saving dialog will open where you can save the actual configuration to file in any directory.

 

This button is for loading a configuration file. If you use this button a new file open dialog will open where you can choose the configuration file from any directory.

 
Configuration files -loading and saving

figure 1

 

 

New Items: This file contains all items(unknown until this point of time) which where find in your text with there classification. An entry look like this: Pauli NN

(Pauli = item, NN=classification of Pauli)

 
Configure the tool by your own:

There are 6 panels for configuring the tool. Let us start with the first one. This is the File Management panel. Here you can configure with output files you want to have and where you want to save them (figure 2).

 

all complex names which were found

 

items which are were found but to rare or were to rare classified as the same.

 

Rule Context: specify why and item was classified

 
configure output files

figure 2

log file for namerec: containing information about all what namerec does

 
 

 

 

 


The second panel is the Parameters and Settings panel. Here you can specify the parameters for the algorithm and some settings for the database input and the usage of the internal tokenizer (see figure 3).

Field for the version id.

 
 


Here you can decide if you want to use the internal tokenizer  for tokenise your text and if you want to replace all numbers with %N%.

 

This is for configurate the database: choose the first and the last id of the sentences you want to analyse with the tool.

 
Parameters and Settings

Her you can specify the numbers of verification thresh, the number of sentences for verification, the threshold for accept items and the sentences between the time estimation.

 
figure 3

 

 

 

 

The next panel is the Tag System panel. Here you can configure tag encoding and the regular expression tagging(see figure 4).

 

Enter one new entry to table by filling out the fields.

 

Content of the table which will be used by the algorithm.

 

Button for saving the content of the table to file.

 

Delete the complete content of the table.

 

Button for loading the content of the table from file.

 
Tag Sytem

Button to delete the selected entry of the table.

 

Auto Fill button for the tag encoding.

 
figure 4

 

 

The next panel is the Rules and Patterns panel(see figure 5). The functionality of this panel is like the one of the Tag System panel.

 

Pattern to find names.

 

Rules for classify unknown words in a context.

 
Rules and Patterns

figure 5

 

The next panel is the Known names panel. Here you can list all names which already are known(see figure 6). The functionality of this panel is like the one of the Tag System panel.

 

All known names.

 
Known names

figure 6

 

The last panel you have to configure is the Database Settings panel(see figure 7).

 

Switch to switch on/off the write back to the database. If one no other output possible.

 

Database settings for database input and output.

 

Table for database input.

 

Output table for write back result.

 
Database Settings

figure 7

Switch for switch on/off the verification by database.

 

Database settings for  verification: needs table with words, table with sentences and a table which connect both table with the help of the id.

 
 

 

 

 

 

 


Starting the tool:

At the Run panel you can star the tool(see figure 8).

 

Text area for output result.

 

Select this for write output to file.

 

Stop button to stop the algorithm while running.

 

Select this for run NE recognition.

 

Field for enter a sentence.

 

Button to choose a file for input.

 

Button to start the algorithm from sentence.

 

Button to start the algorithm from file.

 

Button to start the algorithm from database.

 
Run

figure 8

 

How to use the Command Line Version

Commad:

For starting the command line version of this tool use the following command:
java -Xmx500M -classpath .;./lib/ASV_Namerec.jar -Djava.ext.dirs=.;./lib de.uni_leipzig.asv.toolbox.namerec.Recognizer configfile [-t -rn] [-o outfile] db|file|sentence [filename|sentence]

 

Options:

configfile - path to a configuration file of this tool containing the settings for this run
-t use tokenizer
-rn replace numbers
-o outputfile write output to file outputfile, if not specified written to console
db use sentences from database for run(configured in configfile)
file (needs filename behind separated by space)use sentences from file filename for run
sentence (needs sentence behind separated by space) use the specified sentence for run

Examples:

·        Run Namerec with configuration ./config/namerec/NameRec_noWriteback.cfg with db input  and output to file ./example/namerec/namerecdb.txt
java -Xmx500M -classpath .;./lib/ASV_Namerec.jar -Djava.ext.dirs=.;./lib de.uni_leipzig.asv.toolbox.namerec.Recognizer ./config/namerec/NameRec_noWriteback.cfg –o ./examples/NameRec/namerecdb.txt db

·        Run Namerec with configuration ./config/namerec/NameRec_noWriteback.cfg with sentence input  and output to file ./example/namerec/namerecdb.txt
java -Xmx500M -classpath .;./lib/ASV_Namerec.jar -Djava.ext.dirs=.;./lib de.uni_leipzig.asv.toolbox.namerec.Recognizer ./config/namerec/NameRec_noWriteback.cfg –o ./examples/NameRec/namerecdb.txt sentence Geoge Bush ist gesucht.

·        Run Namerec with configuration ./config/namerec/NameRec_noWriteback.cfg with file input  and output to file ./example/namerec/namerecdb.txt
java -Xmx500M -classpath .;./lib/ASV_Namerec.jar -Djava.ext.dirs=.;./lib de.uni_leipzig.asv.toolbox.namerec.Recognizer ./config/namerec/NameRec_noWriteback.cfg –o ./examples/NameRec/namerecdb.txt file ./examples/NameRec/Namerec.txt

 

How to use Namerec in your own program

Classes and Methods

For using Namerec in your own program you only have to know some classes.

·        Recognizer:
This class will do the algorithm. Additional it provides some methods for initialise your rules.

·        SatzDatasource:
Interface that provides access to any datasource.

·        Config:
Class for handle with your configuration file.

Example:

Here are an example of a JAVA class(NamerecTest.java) using the Namerec tool. You can find the class NamerecTest.java in the package de.uni_leipzig.asv.toolbox.tests.

package de.uni_leipzig.asv.toolbox.tests;

 

import java.io.File;

import java.io.FileNotFoundException;

import java.io.IOException;

import java.util.Observable;

import java.util.Observer;

import java.util.Scanner;

import java.util.Vector;

 

import javax.swing.SwingUtilities;

 

import de.uni_leipzig.asv.toolbox.namerec.NameTable;

import de.uni_leipzig.asv.toolbox.namerec.Pattern;

import de.uni_leipzig.asv.toolbox.namerec.Recognizer;

import de.uni_leipzig.asv.toolbox.namerec.SatzDatasource;

import de.uni_leipzig.asv.toolbox.namerec.util.Config;

import de.uni_leipzig.asv.toolbox.namerec.util.SwingWorker;

 

public class NamerecTest {

 

     

      private static Vector<Pattern> extraPats;

 

      public static void main(String[] args) {

            final boolean tokenize = true;

            SwingWorker sw = new SwingWorker(){

           

                  public Object construct() {

                        //read config

            String configFile = "./config/namerec/completedatabasetestconfig.cfg";

            Config cfg2 = null;

            try {

                  cfg2 = new Config(configFile);

            } catch (FileNotFoundException e1) {

                  e1.printStackTrace();

            } catch (IOException e1) {

                  e1.printStackTrace();

            }

            String parent = new File(configFile).getParent();

            //load classification Rules

            Vector<Pattern> classRules = new Vector<Pattern>();

            Scanner insc;

            try {

                  insc = new Scanner(new File(parent + "/"

                             + cfg2.getString("IN.PATFILE", "")));

                 classRules = Recognizer.loadClassRules(insc);

            } catch (FileNotFoundException e) {

                  e.printStackTrace();

                  return null;

            }

            //load extraction pattern

            extraPats = new Vector<Pattern>();

            try {

                  insc = new Scanner(new File(parent + "/"

                             + cfg2.getString("IN.PATFILENE", "")));

                  extraPats = Recognizer.loadExtractionPattern(insc);

            } catch (FileNotFoundException e) {

                  e.printStackTrace();

                  return null;

            }

            cfg2.set("DB.WRITEBACK", "false");

            //make paths to files absolute

            cfg2.set("IN.REGEXP", parent + "/" + cfg2.getString("IN.REGEXP", ""));

            cfg2.set("IN.PATFILE", parent + "/" + cfg2.getString("IN.PATFILE", ""));

            cfg2.set("IN.PATFILENE", parent + "/"

                        + cfg2.getString("IN.PATFILENE", ""));

            cfg2.set("IN.CLASSNAMES", parent + "/"

                        + cfg2.getString("IN.CLASSNAMES", ""));

            cfg2.set("IN.KNOWLEDGE", parent + "/"

                        + cfg2.getString("IN.KNOWLEDGE", ""));

     

            //create Regonizer

            Recognizer.makePatternMap(classRules, extraPats);

            Recognizer rec;

            try {

                  rec = new Recognizer(cfg2, null, classRules, extraPats,tokenize);

            } catch (IOException e1) {

                  e1.printStackTrace();

                  return null;

            }

            Recognizer.cfg2=cfg2;

           

            final String sentence = "Osama Bin Laden ist gesucht.";

            //set Datasource

            rec.ds = new SatzDatasource() {

                  boolean isDone = false;

 

                  public String getNextSentence() {

                        if (!this.isDone) {

                             this.isDone = true;

                             return sentence;

                        }

                        return "END";

                  }

 

                  public int getNumOfSentences() {

                        return 1;

                  }

            };

            rec.addObserver(getObserver());

            try {

                  rec.doTheRecogBoogie(false, tokenize);

                 

            } catch (Exception e) {

                  e.printStackTrace();

                  return null;

            }

           

            return rec;

      }

                 

                  public void finished(){

                        System.out.println("finished");

                  }

            private Observer getObserver() {

                  return new Observer() {

                        public void update(Observable o, Object arg) {

                             final Object[] arr = (Object[]) arg;

                             SwingUtilities.invokeLater(new Runnable() {

                                   public void run() {                           

                                         System.out.println("\n\n\nResult:"+Recognizer.outputSentenceCL((NameTable) arr[0],

                                                     (String) arr[1],extraPats));

                                        

                                   }

                             });

                        }

                  };

 

            }

            };

            sw.start();

      }

     

     

}

You can start this test. Below you see the output of the test.

Confg:Einstellungen:

-------------

 Klassen: .\config\namerec//additional/completedatabasetestconfig.cfg.tagsystem

 Wissen Items: .\config\namerec//additional/completedatabasetestconfig.cfg.knowledgeItems

 Wissen Regexp: .\config\namerec//additional/completedatabasetestconfig.cfg.regex

 Wissen Regeln: .\config\namerec//additional/completedatabasetestconfig.cfg.rules

 Regeln für NEs .\config\namerec//additional/completedatabasetestconfig.cfg.extraPats

 Anzahl Sätze zur Kandidatenüberprüfung 30

 Threshhold Anerkennung Item 0.15

 Beginne bei Satz: 0

 Ende bei Satz: -1

 Datei für neue Items:

 Datei für eventuelle Items: maybes-de.txt

 Datei für Kontexte, wenn Regeln irgendwie zuschlagen:

 Datei für komplett bekannte Namen: NEs-de.txt

 Anzahl der Verifikationsthreads: 10

Could not connect! You can not use database option.

Could not connect! You can not use database option.

Initializing basetagger...

Number of Rules: 65

0: Osama Bin Laden ist gesucht.

Knowledge: Bin=ZN

Knowledge: Osama=VN

Knowledge: .=PU

Knowledge: Laden=NN

verification done!

finished

 

 

 

Result:<person pattern="VN ZN NN">Osama Bin Laden</person> ist gesucht .

 

 

 

 

back to main page