Documentation of Viterbitagger

   

back to main page

 

Installation: 1

Introduction. 1

How to use the GUI version. 1

Train from file – horizontal format: 8

Tag with a Viterbitagger: 33

Evaluate your Tagger: 59

How to use the Command Line Version. 82

Command: 83

Abilities: 86

Options: 88

Examples: 90

How to use the Tagger in your own Programm.. 144

Classes and Methods: 145

Example: 153

 

Installation:

A description how to install a module can be found at the main page of the ASV toolbox Project.

The line you have to copy into the toolbox.start file looks like this:

de.uni_leipzig.asv.toolbox.viterbitagger.gui.ViterbiTagger

Introduction

This tool is for training a Viterbitagger from tagged text in horizontal and vertical format with one or two tags per word. You can also tag text with a Viterbitagger from text, file and database and evaluate you Viterbitagger (only from file).

How to use the GUI version

Train a Viterbitagger:
Choose the Train-Panel. Here you have 3 options for training a Viterbitagger, training from file containing tagged text in horizontal format, training from file containing tagged text in vertical format and training from database containing tagged sentences in horizontal format.

Training from sentences in database – horizontal format

 

Training from text in file – horizontal format

 

Training from text in file - vertical format

 
the 3 opportunities

figure 1

Train from file – vertical format:
Choose the vertical format option at the top. Vertical format means that at each line in the file one word with it tags separated with tabs. (see figure 2)

example for text in vertical format

figure 2

Now you can specify the sentence end tag. This is crucial for the success of the tagger.

set sentence end tag to PUN

figure 3


Then specify if you have one or two tags per word and the order of word and tag. Additionally you can choose to  replace numbers with the symbol %N% for training the Viterbitagger. (see figure 3). Recommended setting: Replace numbers.

Here are used two tags and  the word stand at position 1 and the tag at position two. That means at first the word and after that the tag.

 

Choose this option for replacing all numbers in the text (not in the tags) with unique symbol.

 
configure word and tag position and replace number option

figure 4

                                                                                                                                                                                

 

At least click on the “Train Tagger from file ” button. A new window will open for choosing the training file. After finishing the train process another window will open for saving the Viterbitagger to file.
If you get the error message like in figure 4, then no tag could be found in the training data. A reason for this could be that you chose the wrong format or the wrong separator (is used by horizontal format) or the wrong file (e.g. text without tags).

error message if no tag is found

figure 5

 

Train from file – horizontal format:

It is very similar to train from file with vertical format (see above). Only 2 little things are different. The first one is that you have to choose “horizontal format” at the top of the panel.  Horizontal format means word and tag are separated by the separator (see figure 6).

 

part of a tagged file in horizontal format

figure 6

The second thing what is different is that you have to specify the separator instead of the sentence end tag.

 

set the separator to |

figure 7

 

Train from db: Choose the train from db option at the top. Using this option you need a database containing a table with tagged sentences in horizontal format (vertical format not supported).

 

At first you should specify the database connection settings at the db tab (see figure 8). Do not forget to push the “Connect to Database”-button.

 

configure database settings

figure 8

 

Now switch back to the train panel. Here you can specify the table and columns containing the tagged data and choose how many sentences should be used for training(see figure 9).

Option for training from db -  must be chosen for  specify table and columns

 
 


Select id column(needed for selecting only a part of the sentences) and sentence column(containing tagged data).

 

Select first the table for training and after that the columns

 
chooes table and columns and number of sentences

figure 9

The id of the last sentence for training

 

The id of the first sentence for training

 
 

 

 

 

 

 


All the other settings you have to made at this panel similar to training from file. Use the “Train from DB”-button for starting the training.

 

 

Tag with a Viterbitagger:

For tagging you need a trained Viterbitagger model and some text without tags. Choose the “Apply” panel and load your tagger model. For this push the “Load taggermodel”-button. A new window will open where you can navigate to the directory where you saved the tagger model. Choose the file ending with “.model” (see figure 10).

 

choose en.model from directory taggermodels in resources

figure 10

 

Now choose between “Lexicon in RAM” and “Lexicon on Disc”. LExicon in RAM is much faster, but needs a considerable amount of memory.

Further you can specify whether to use the internal tokenizer (see figure 11). Training and application text should be tokenized in the same way.

 

Option for internal tokenizer

 
loaded taggermodel, Lexicon on Disc chosen and internal tokenizer chosen

figure 11

Lower speed but need less memory

 

Faster but need more memory

 

Some information about the loaded tagger.

 
 

 

 

 

 

 


The next step is to choose a input method. You can enter some text in the text area (see figure 12) or choose a file from file system (see figure 13) or use sentences from database(see figure 14).

Choose this option to use text input. After you choose this option you will be able to enter text in the text area below.

 
 


Here you can enter the text you want  tag. Copy and Paste are also possible.

 
example text input

figure 12

 

 

example tag file

You can enter the path to the text field you want to tag in the text field or click on search to choose the file in the file open dialog.

 

Choose this option to tag text in a file. After choosing the search button and the text field will be enabled.

 
figure 13

 

 

 

 

 

Select the table contaiing the data for tagging.

 
 

 


Select the id of the last sentence you want tag. –1 means tag until the last sentence in the table.

 

Select id column and sentence column(should contain untagged sentences).

 
example tag text in database

figure 14

Choose this option to tag text in database.  After choosing you will be able to select table, columns and  ids.

 

Select the id of the first sentence you want to tag.

 
 

 

 

 

 

 

 

 

 


Before you start the tag process, specify your output option; output in the text area , to file and/or to database(see figure 15).

 

Select the table for output or enter a new one and select/enter the id column and sentence column for output to database.

 

Enter the path to the output file or  select the output file in a file open dialog by clicking on search.

 

Here you will see the tagged text.

 
example: use all output methods

figure 15

 

Select the checkboxes before the methods you want to use. It will enable the belonging components to configure the output method.

 
 

 

 

 


At least click on the start button at the bottom. The progress bar will show you the progress of the tag process. If you click on cancel the running tag process will be stopped.

Progress bar

 
 


Cancel button for stop the tag process

 
started tag progress

figure 16

 

Evaluate your Tagger:

This option is only available for tagger with one tag per word. At the test panel you can evaluate your Viterbitagger model. For this you need your trained Viterbitagger and a file with already tagged text - a so-called test set.

 

Load your tagger model in the test panel. For this, click on the “Load Tagger Model”-button and navigate in the now open file dialog to the tagger model file which ends with “.model” and open this file. Choose if you want Lexicon on Disc or Lexicon on RAM. Now open the file with the already tagged text by clicking the “Search”-button. For example, the panel could look like in figure 17, where the en.model file(in resources/taggermodels) is used as tagger model and the file “examples/tagger/en.horizontal.oneTag” as test file.

 

loaded tagger and loaded evaluation file, lexicon in ram

figure 17

 

Click on the “Evaluate”-button to start the evaluation process. You can see the progress at the progress bar.

 

Evaluate button and below the progress bar for the evaluate process.

 
evaluate process

figure 18

 

The result of the evaluation will be displayed on the text area in the middle of the panel (see figure 19).

 

finished evaluation with result in text area

figure 19

 

 

Here you can see the whole result of the evaluation above. Most interesting is the accuracy value. All other values refer to evaluating on different tagsets than in training: here, the score to minimize to 1.0 is the total cluster-conditional tag perplexity, see [Freitag, 2004] for details.

 

complete result of evaluation

figure 20

 

How to use the Command Line Version

Command:

For starting the command line version of this tool, use the following command:

java -Xmx500M -classpath .;./lib/ASV_Viterbitagger.jar -Djava.ext.dirs=.;./lib de.uni_leipzig.asv.toolbox.viterbitagger.gui.ViterbitaggerCL ability option [options ...]

Abilities:


            train for training a Viterbitagger
            tag for tag with a given Viterbitagger model some text
            evaluate for evaluate a given Viterbitagger model with a file

Options:


              -tm taggermodelfile the taggermodel which will be used(tag/evaluate) or which will be created(train)
            only for train:
              -db train from db, uses configuration from config/viterbiTagger/viterbiTagger.query
              -f file train from the given file
              -h file is in horizontal file format else file is in vertical file format
              -s seperator sign between word and tag, use this only together with -h option
              -set sentenceendtag tag for the end of a sentence, use this only without -h option
              -rn replace numbers option
              -wp word position, default 1
              -ptp primary tag position, default 2
              -stp secondary tag position, -1 means not in use, default –1
            only for tag and evaluate:
              -ram viterbitagger will be loaded complete into RAM(faster) else part of the tagger will be on disk(for small RAM)
            only for tag:
              -tokenizer uses tokenizer before tag the text
              -if infile specify the input file for tagging
              -of outfile specify the output file for tagging
              -it use this option for input from console
              -ot print tagged text to screen
              -idb use db for input, uses configuration from config/viterbiTagger/viterbiTagger.query
              -odb write tagged sentences back to db, uses configuration from config/viterbiTagger/viterbiTagger.query
              -ids start id for tagging sentences from database, default 0
              -ide end id for tagging sentences from database, -1 means all, default –1
            only use for evaluate:
              -ef evaluatefile specify the file for evaluation

Examples:

      I'm a sentence for tagging!

      ^^

      You enter the following text: I'm a sentence for tagging!

   Result:  I|PNP '|POS m|UNC a|AT0 sentence|NN1 for|PRP tagging|NN1* !|PUN

sentences tagged:100

sentences tagged:200

Overall Accuracy: 0.858072225172597 (6463/7532)

TOTAL TOKENS:  7532

 

total cluster purity goldtags:  0.8799787573021773

total cluster purity lextags:   0.8590015932023367

total entropy goldtags:   3.214317492922561

total entropy lextags:    3.01810840177073

total conditional Entropy:      2.5812824807793713

total entropy purity lextags:   0.8030577211065688

total cluster-cond. tag perplexity:   1.8833178065437262

 

KNOWN TOKENS:  6175 (81.98353690918748% )

known cluster purity goldtags: 0.9481781376518219

known cluster purity lextags:   0.9468825910931175

known entropy goldtags:   3.0994674059763128

known entropy lextags:    3.0725497903606205

known conditional Entropy:      2.877526443574545

known entropy purity lextags:   0.9283938388983131

known cluster-cond. tag perplexity:   1.248497667384779

 

UNKNOWN TOKENS:     1357

unknown percent:    18.016463090812536

unknown cluster purity goldtags:      0.6462785556374355

unknown cluster purity lextags: 0.5092114959469418

unknown entropy goldtags: 2.816107783691665

unknown entropy lextags: 2.1357734485840076

unknown conditional Entropy:    1.0741627680527874

unknown entropy purity lextags: 0.381435247000616

unknown cluster-cond. tag perplexity: 5.7084356284891395

 

HOLE statistics

total holes: 1034

 length 1: 816(78.91682785299807% )

 length 2: 153(14.796905222437138% )

 length 3: 41(3.9651837524177944% )

 length 4: 13(1.2572533849129592% )

 length 5: 7(0.6769825918762089% )

 length 6: 3(0.2901353965183753% )

 length 7: 1(0.09671179883945842% )

          length 8: 0(0.0% )

    length 9: 0(0.0% )

    longer: 0(0.0% )

How to use the Tagger in your own Programm

Classes and Methods:

It is easy to use Viterbitagger in your own program. All you have to know is the class Tagger in the package de.uni_leipzig.asv.toolbox.viterbitagger.

methods

description

contructor

Tagger(taglistfile,lexiconfile, transitionsfile,conditionsfile, useforeval)
creates an new Tagger from the given files
if conditionsfile is null tagger will be used with only one tag
if useforeval is true the tagger will do evaluation with given text

setExtern(boolean extern)

if extern true the lexicon will not be loaded in to ram (for small memory, will be a little bit slowly)

setReplaceNumbers(boolean replacenumbers)

this parameter you will find in the tagger model file (key: ReplaceNumbers)

setUseInternalTok(boolean internalTok)

if internalTok is true the text will be tokenised before it will be tagged

String tagSentence(String text)

tag the given sentence and return the tagged result

Example:

 

Here are an example of a JAVA class(ViterbitaggerTest.java) using the Viterbitagger tool. You can find the class ViterbitaggerTest.java in the package de.uni_leipzig.asv.toolbox.tests.

package de.uni_leipzig.asv.toolbox.tests;

 

import java.io.File;

import java.io.FileInputStream;

import java.io.FileNotFoundException;

import java.io.IOException;

import java.util.Properties;

 

import de.uni_leipzig.asv.toolbox.viterbitagger.Tagger;

 

public class ViterbitaggerTest {

 

     

      public static void main(String[] args) {

            //taggermodel

            String tmFile = "./resources/taggermodels/en.model";

            //sentence to tag

            String sentence = "Sherman served under General Ulysses" +

                        " S. Grant in 1862 and 1863 during the campaigns " +

                        "that led to the fall of the Confederate stronghold " +

                        "of Vicksburg on the Mississippi River and " +

                        "culminated with the routing of the Confederate " +

                        "armies in the state of Tennessee.";

            //preoperties to read content of taggermodel file

            Properties props = new Properties();

            //tm dir

            String tmDir = new File(tmFile).getParent();

            try {

                  props.load(new FileInputStream(tmFile));

                  Tagger tagger = new Tagger(tmDir+"/"+props.getProperty("taglist"),

                             tmDir+"/"+props.getProperty("lexicon"),

                             tmDir+"/"+props.getProperty("transitions"),null, false);

                  tagger.setExtern(false);

                  tagger.setReplaceNumbers(props.getProperty("ReplaceNumbers").equals("true"));

                  tagger.setUseInternalTok(true);

                  System.out.println(tagger.tagSentence(sentence));

            } catch (FileNotFoundException e) {

                  e.printStackTrace();

            } catch (IOException e) {

                  e.printStackTrace();

            }

 

      }

 

You can start this test. Below you see the output of the test.

 

Sherman|NP0 served|VVN under|PRP General|AJ0 Ulysses|NN2* S|NP0 .|PUN Grant|NP0 in|PRP 1862|CRD* and|CJC 1863|CRD* during|PRP the|AT0 campaigns|NN2 that|CJT led|VVD to|PRP the|AT0 fall|NN1 of|PRF the|AT0 Confederate|AJ0* stronghold|NN1 of|PRF Vicksburg|NP0* on|PRP the|AT0 Mississippi|NP0 River|NN1 and|CJC culminated|NN1* with|PRP the|AT0 routing|NN1* of|PRF the|AT0 Confederate|AJ0* armies|NN2* in|PRP the|AT0 state|NN1 of|PRF Tennessee|NP0** .|PUN

 

 

References

[Freitag 2004] "Toward Unsupervised Whole-Corpus Tagging," Proceedings of Coling 2004.

 

back to main page