Train from file – horizontal format:
How to use the Command Line Version
How to use the Tagger in your own Programm
A
description how to install a module can be found at the main
page of the ASV toolbox Project.
The line
you have to copy into the toolbox.start file looks like this:
de.uni_leipzig.asv.toolbox.viterbitagger.gui.ViterbiTagger
This tool is for training a
Viterbitagger from tagged text in horizontal and vertical format with one or
two tags per word. You can also tag text with a Viterbitagger from text, file
and database and evaluate you Viterbitagger (only from file).
Train a Viterbitagger:
Choose the Train-Panel.
Here you have 3 options for training a Viterbitagger, training from file
containing tagged text in horizontal format, training from file containing
tagged text in vertical format and training from database containing tagged
sentences in horizontal format.
Training from sentences in database –
horizontal format Training from text in file – horizontal
format Training from text in file - vertical
format

![]()
![]()

figure 1
Train from file – vertical format:
Choose the vertical
format option at the top. Vertical format means that at each line in the file
one word with it tags separated with tabs. (see figure 2)

figure 2
Now you can specify the
sentence end tag. This is crucial for the success of the tagger.

figure 3
Then specify if you have one or two tags per word and the order of word and
tag. Additionally you can choose to
replace numbers with the symbol %N% for training the Viterbitagger. (see
figure 3). Recommended setting: Replace numbers.
Here are used two tags and the word stand at position 1 and
the tag at position two. That means at first the word and after that the
tag. Choose this option for replacing all
numbers in the text (not in the tags) with unique symbol.
![]()


figure 4
At least click on the
“Train Tagger from file ” button. A new window will open for choosing the
training file. After finishing the train process another window will open for
saving the Viterbitagger to file.
If you get the error message like in figure 4, then no tag could be found in
the training data. A reason for this could be that you chose the wrong format
or the wrong separator (is used by horizontal format) or the wrong file (e.g.
text without tags).

figure 5
It is very
similar to train from file with vertical format (see above). Only 2 little
things are different. The first one is that you have to choose “horizontal
format” at the top of the panel.
Horizontal format means word and tag are separated by the separator (see
figure 6).

figure 6
The second
thing what is different is that you have to specify the separator instead of
the sentence end tag.

figure 7
Train
from db: Choose the
train from db option at the top. Using this option you need a database
containing a table with tagged sentences in horizontal format (vertical format
not supported).
At first
you should specify the database connection settings at the db tab (see
figure 8). Do not forget to push the “Connect to Database”-button.

figure 8
Now switch
back to the train panel. Here you can specify the table and columns containing
the tagged data and choose how many sentences should be used for training(see
figure 9).
Option for training from db - must be chosen for specify table and columns
Select id column(needed for selecting only
a part of the sentences) and sentence column(containing tagged data). Select first the table for training and
after that the columns
![]()
![]()
![]()
![]()
![]()

figure 9
The id of the last sentence for training The id of the first sentence for training
All the
other settings you have to made at this panel similar to training from file.
Use the “Train from DB”-button for starting the training.
For tagging
you need a trained Viterbitagger model and some text without tags. Choose the
“Apply” panel and load your tagger model. For this push the “Load
taggermodel”-button. A new window will open where you can navigate to the
directory where you saved the tagger model. Choose the file ending with
“.model” (see figure 10).

figure 10
Now choose
between “Lexicon in RAM” and “Lexicon on Disc”. LExicon in RAM is much faster,
but needs a considerable amount of memory.
Further you
can specify whether to use the internal tokenizer (see figure 11).
Training and application text should be tokenized in the same way.
Option for internal tokenizer
![]()
![]()
![]()
![]()

figure 11
Lower speed but need less memory Faster but need more memory Some information about the loaded tagger.
The next
step is to choose a input method. You can enter some text in the text area (see
figure 12) or choose a file from file system (see figure 13) or use
sentences from database(see figure 14).
Choose this option to use text input. After
you choose this option you will be able to enter text in the text area
below.
Here you can enter the text you want tag. Copy and Paste are also
possible.
![]()
![]()

figure 12
![]()
![]()

You can enter the path to the text
field you want to tag in the text field or click on search to choose the
file in the file open dialog. Choose this option to tag text in a file.
After choosing the search button and the text field will be enabled.
figure 13
Select the table contaiing the data for
tagging.

Select the id of the last sentence you want
tag. –1 means tag until the last sentence in the table. Select id column and sentence column(should
contain untagged sentences).
![]()
![]()
![]()

![]()

figure 14
Choose this option to tag text in
database. After choosing you
will be able to select table, columns and ids. Select the id of the first sentence you
want to tag.
Before you start the tag process, specify your
output option; output in the text area , to file and/or to database(see
figure 15).
Select the table for output or enter
a new one and select/enter the id column and sentence column for output to
database. Enter the path to the output file
or select the output file in a
file open dialog by clicking on search. Here you will see the tagged text.
![]()
![]()
![]()

figure 15
Select the checkboxes before the methods
you want to use. It will enable the belonging components to configure the
output method.
At least click on the start button at the
bottom. The progress bar will show you the progress of the tag process. If you
click on cancel the running tag process will be stopped.
Progress
bar
Cancel button for stop the tag process
![]()
![]()

figure 16
This option is only available for tagger with
one tag per word. At the test panel you can evaluate your Viterbitagger model.
For this you need your trained Viterbitagger and a file with already tagged
text - a so-called test set.
Load your tagger model in the test panel. For
this, click on the “Load Tagger Model”-button and navigate in the now open file
dialog to the tagger model file which ends with “.model” and open this file.
Choose if you want Lexicon on Disc or Lexicon on RAM. Now open the file with
the already tagged text by clicking the “Search”-button. For example, the panel
could look like in figure 17, where the en.model file(in
resources/taggermodels) is used as tagger model and the file
“examples/tagger/en.horizontal.oneTag” as test file.

figure 17
Click on the “Evaluate”-button to start the
evaluation process. You can see the progress at the progress bar.
Evaluate button and below the
progress bar for the evaluate process.
![]()

figure 18
The result
of the evaluation will be displayed on the text area in the middle of the panel
(see figure 19).

figure 19
Here you
can see the whole result of the evaluation above. Most interesting is the
accuracy value. All other values refer to evaluating on different tagsets than
in training: here, the score to minimize to 1.0 is the total
cluster-conditional tag perplexity, see [Freitag, 2004] for details.

figure 20
For
starting the command line version of this tool, use the following command:
java -Xmx500M -classpath
.;./lib/ASV_Viterbitagger.jar -Djava.ext.dirs=.;./lib
de.uni_leipzig.asv.toolbox.viterbitagger.gui.ViterbitaggerCL ability option
[options ...]
train for training a Viterbitagger
tag
for tag with a given Viterbitagger model some text
evaluate
for evaluate a given Viterbitagger model with a file
-tm taggermodelfile the
taggermodel which will be used(tag/evaluate) or which will be created(train)
only for train:
-db train from db, uses configuration
from config/viterbiTagger/viterbiTagger.query
-f file train from the given file
-h file is in horizontal file format
else file is in vertical file format
-s seperator sign between word and tag,
use this only together with -h option
-set sentenceendtag tag for the end of
a sentence, use this only without -h option
-rn replace numbers option
-wp word position, default 1
-ptp primary tag position, default 2
-stp secondary tag position, -1 means
not in use, default –1
only
for tag and evaluate:
-ram viterbitagger will be loaded
complete into RAM(faster) else part of the tagger will be on disk(for small
RAM)
only
for tag:
-tokenizer uses tokenizer before tag
the text
-if infile specify the input file for
tagging
-of outfile specify the output file for
tagging
-it use this option for input from
console
-ot print tagged text to screen
-idb use db for input, uses
configuration from config/viterbiTagger/viterbiTagger.query
-odb write tagged sentences back to db,
uses configuration from config/viterbiTagger/viterbiTagger.query
-ids start id for tagging sentences
from database, default 0
-ide end id for tagging sentences from
database, -1 means all, default –1
only
use for evaluate:
-ef evaluatefile specify the file for
evaluation
I'm a sentence for tagging!
^^
You enter the following text: I'm a
sentence for tagging!
Result: I|PNP '|POS m|UNC a|AT0 sentence|NN1
for|PRP tagging|NN1* !|PUN
sentences tagged:100
sentences tagged:200
Overall Accuracy:
0.858072225172597 (6463/7532)
TOTAL TOKENS: 7532
total cluster purity goldtags: 0.8799787573021773
total cluster purity
lextags: 0.8590015932023367
total entropy goldtags: 3.214317492922561
total entropy lextags: 3.01810840177073
total conditional Entropy: 2.5812824807793713
total entropy purity
lextags: 0.8030577211065688
total cluster-cond. tag
perplexity: 1.8833178065437262
KNOWN TOKENS: 6175 (81.98353690918748% )
known cluster purity
goldtags: 0.9481781376518219
known cluster purity
lextags: 0.9468825910931175
known entropy goldtags: 3.0994674059763128
known entropy lextags: 3.0725497903606205
known conditional Entropy: 2.877526443574545
known entropy purity
lextags: 0.9283938388983131
known cluster-cond. tag
perplexity: 1.248497667384779
UNKNOWN TOKENS: 1357
unknown percent: 18.016463090812536
unknown cluster purity goldtags:
0.6462785556374355
unknown cluster purity
lextags: 0.5092114959469418
unknown entropy goldtags: 2.816107783691665
unknown entropy lextags: 2.1357734485840076
unknown conditional Entropy:
1.0741627680527874
unknown entropy purity
lextags: 0.381435247000616
unknown cluster-cond. tag
perplexity: 5.7084356284891395
HOLE statistics
total holes: 1034
length 1: 816(78.91682785299807% )
length 2: 153(14.796905222437138% )
length 3: 41(3.9651837524177944% )
length 4: 13(1.2572533849129592% )
length 5: 7(0.6769825918762089% )
length 6: 3(0.2901353965183753% )
length 7: 1(0.09671179883945842% )
length 8: 0(0.0% )
length 9:
0(0.0% )
longer: 0(0.0%
)
It is easy to use Viterbitagger
in your own program. All you have to know is the class Tagger in the package
de.uni_leipzig.asv.toolbox.viterbitagger.
|
methods |
description |
|
contructor |
Tagger(taglistfile,lexiconfile,
transitionsfile,conditionsfile, useforeval) |
|
setExtern(boolean extern) |
if extern true the
lexicon will not be loaded in to ram (for small memory, will be a little bit
slowly) |
|
setReplaceNumbers(boolean
replacenumbers) |
this parameter you will
find in the tagger model file (key: ReplaceNumbers) |
|
setUseInternalTok(boolean
internalTok) |
if internalTok is true
the text will be tokenised before it will be tagged |
|
String tagSentence(String
text) |
tag the given sentence
and return the tagged result |
Here are an example of a
JAVA class(ViterbitaggerTest.java) using the Viterbitagger tool. You can find
the class ViterbitaggerTest.java in the package de.uni_leipzig.asv.toolbox.tests.
package
de.uni_leipzig.asv.toolbox.tests;
import java.io.File;
import
java.io.FileInputStream;
import
java.io.FileNotFoundException;
import
java.io.IOException;
import
java.util.Properties;
import
de.uni_leipzig.asv.toolbox.viterbitagger.Tagger;
public class
ViterbitaggerTest {
public static void main(String[]
args) {
//taggermodel
String
tmFile = "./resources/taggermodels/en.model";
//sentence to tag
String
sentence = "Sherman served under General
Ulysses" +
" S. Grant in 1862 and 1863 during the campaigns
" +
"that led to the fall of the Confederate
stronghold " +
"of Vicksburg on the Mississippi River and " +
"culminated with the routing of the Confederate
" +
"armies in the state of Tennessee.";
//preoperties to read content of taggermodel file
Properties
props = new Properties();
//tm dir
String
tmDir = new File(tmFile).getParent();
try {
props.load(new
FileInputStream(tmFile));
Tagger
tagger = new Tagger(tmDir+"/"+props.getProperty("taglist"),
tmDir+"/"+props.getProperty("lexicon"),
tmDir+"/"+props.getProperty("transitions"),null, false);
tagger.setExtern(false);
tagger.setReplaceNumbers(props.getProperty("ReplaceNumbers").equals("true"));
tagger.setUseInternalTok(true);
System.out.println(tagger.tagSentence(sentence));
}
catch (FileNotFoundException e) {
e.printStackTrace();
}
catch (IOException e) {
e.printStackTrace();
}
}
You can start this test. Below you see the output of
the test.
Sherman|NP0 served|VVN under|PRP
General|AJ0 Ulysses|NN2* S|NP0 .|PUN Grant|NP0 in|PRP 1862|CRD* and|CJC
1863|CRD* during|PRP the|AT0 campaigns|NN2 that|CJT led|VVD to|PRP the|AT0
fall|NN1 of|PRF the|AT0 Confederate|AJ0* stronghold|NN1 of|PRF Vicksburg|NP0*
on|PRP the|AT0 Mississippi|NP0 River|NN1 and|CJC culminated|NN1* with|PRP
the|AT0 routing|NN1* of|PRF the|AT0 Confederate|AJ0* armies|NN2* in|PRP the|AT0
state|NN1 of|PRF Tennessee|NP0** .|PUN
[Freitag
2004] "Toward Unsupervised Whole-Corpus Tagging," Proceedings of Coling
2004.