How to use Command Line Version
How you can use the TE-Tool in your own program
A
description how to install a module is available at the main
page of the ASV Toolbox project.
The line
you have to copy into the toolbox.start file looks like this:
de.uni_leipzig.asv.toolbox.te.TEPanel
If you want to install a
new language, you will have to edit the properties file. A description you find
in the section How to add a new language.
The Terminology Extraction
Tool extracts technical terms from a given text for use in terminological
databases and dictionaries. In order to determine if a given word or sequence
of words is terminological, the software uses both statistical and
pattern-based methods.
The most important
statistical method is the so-called differential analysis which measures
the extent to which the frequency of a word w in the given text deviates from
its frequency in general usage. The latter frequency is determined using a reference
corpus, i.e. a large and well-balanced collection of documents in the given
language. The termhood of w is quantified using a statistical significance
measure reflecting the significance of the frequency deviation.
Pattern-based methods are
based on part-of-speech (POS) information. POS are used both to restrict the
output units to certain word classes - it is often argued that most technical
terms are nouns - and to extract multiword units (or phrases) in a shallow way.
This is done by extracting sequences of words that appear frequently together
and follow certain POS patterns, e.g. noun + noun ("terminology
extraction").
For using the extraction
tool, you can either directly write the text in a text area or load it from a
file. It is possible to build and use your own terminology dictionary. At the
Config Panel you can specify many technical parameters for your specify needs.
If you use a language the
first time the terminology extraction will be need some extra time and a little
bit more memory because the used tagger needs binary files for tagging. These
files will be created at the first time using the tagger. The next time it will
be faster.
Here you can extract the
Terminology from your text, load a dictionary to use it and build a dictionary.
If you want to use your own
dictionary, just press the Add button under the textarea labeled with Use
Dictionaries. Choose in the opened window the dictionary file and press open.
To remove a dictionary select it and press remove.
Added
dictionary Remove button for removing a selected
dictionary.
![]()
![]()

Add button for adding more dictionaries
To extract the terminology
from your text, load or write the text in the left big text area. For loading a
file use the "Load from File" button, choose the file and press open.
Do not forget to specify the language of the text under the text area. Start
the Extraction with the "Extract -->" button. On the right big
text area you will see the result.
Loading a file
Choose the language

Extract (Stop button for stopping before finishing the
process)
A significant word in the text is
marked in red. A significant phrase in the text is marked
in green. The 4th column contains the frequency of
the word in the text. The 3rd column show you the
significance of the word or phrase in the text. In the 2nd column you find the POS
class of the word. Could be A, N, V or combinations of them. In the 1st column are listed the most
significant term of the text.


example for result
If you have extracted terms
from a text, you can save the results to a file by pressing "Save to
File", choosing the directory and a filename or add the result of the
extraction to an existing dictionary by pressing "Add to dictionary",
choosing a dictionary and press append.
Since version 1.1, the contexts
where the extracted terms appear
are directly accessible
by clicking the row of the term
for which the contexts should be shown.
By pressing "Save Contexts" all the contexts for every term can saved at once into a chooseable directory.
For every term, there will be a file for every source document.
the file.
frequency significance
score POS
class Word
or phrase




example for result saved in file

example for a result saved as dictionary
In the config panel, you
can set the parameters of the extraction tool. Generally, the predefined
default settings should work rather well for medium-sized texts (up to 10,000
words), for larger texts, some of the thresholds may need to be adjusted. In
the following, each parameter will be explained in detail
Choose this option if you text isn’t already tokenised and you
want to use the internal tokenizer. Decide between Likelihood ration and frequency
ration for calculating the significance. Choose here minimum significance a word
must be have in the text to be shown as significant term. Choose here how many times a phrase must be
seen in the text to be shown as significant term. Choose here how many times a word
must be seen in the reference corpus to be shown as significant term. Choose here how many times a word must be
seen in the text to be shown as significant term.
![]()
![]()
![]()

This option you maybe need if you want the use a dictionary for extraction.
If you choose it TE will add word to the significant terms which are in
text and in the dictionary although
they are not in the reference corpus. These words will get a
significance value of –2.

This option should be choosen for
baseform reduction. For example trees and tree will be the same
word if you choose stemming but two different words if you not choose
stemming. This option allow TE to add words(with a
high freuquency in the text) to the significant terms although they were
not in the reference corpus. These words get a significant value of –1.
![]()

Choose this option to use verbs as
significant terms. Choose this option to use nouns as
significant terms.
![]()
Write in this text field a new phrase
pattern and click the button beneath to add a new pattern to the list. The
pattern should only consist of Ns, As and Vs which are separated by one
space. Select a phrase pattern and click this
button to delete the pattern. Choose which kind of phraes are used as
significant terms. Choose this option to use phrases as
significant terms. Choose this option to use adjectives as
significant terms.
![]()
![]()

![]()
![]()

Click this button for loading a
configuration. Click this button for saving the current
configuration.
You can save
a configuration of this tool for later use. Click on the “Save configuration”
button. A new window will be open where you can choose directory and filename
for your configuration. Click on the “Load configuration” button to load a
configuration. In the new opened window you can choose the directory and file
of the configuration.
![]()
![]()

For using the command line
version use the following:
Windows: java -Xmx500M -classpath
.;./lib/ASV_TE.jar -Djava.ext.dirs=.;./lib de.uni_leipzig.asv.toolbox.te.TETool
[OPTIONS] TEXTFILE [TERMSFILE]
Linux: java -Djava.ext.dirs=.:./lib
de.uni_leipzig.asv.toolbox.te.TETool [OPTIONS] TEXTFILE [TERMSFILE]
Following options are available:
|
command |
alternativ |
description |
|
-l=# |
--language=# |
specifies
the language of the text; possible: en, de; default: en |
|
-ft=# |
--min_freq_text=# |
sets
minimum frequency of a word in the text to be taken into account to #,
default 1 |
|
-fc=# |
--min_freq_corpus=# |
sets
minimum frequency of a word in the corpus to be taken into account to #,
default 2 |
|
-ms=# |
--min_significance=# |
sets
minimum significane of a word to be taken into account to #, default 20.0 |
|
-sm=# |
--sig_measure=# |
specifies
the significances measure; possible: lr (likelihood ratio), fr (frequency
ratio); default: lr |
|
-s |
--show_significances |
show
significances |
Replace TEXTFILE with the path to the input file you want to use.
Replace TERMFILE with the path to the output file if you want to write
the output to a file. If you do not specify a TERMFILE than the output will be
written on the screen.
java -Xmx500M -classpath .;./lib/ASV_TE.jar
-Djava.ext.dirs=.;./lib de.uni_leipzig.asv.toolbox.te.TETool -l=de -s
./examples/te/text_deutsch.txtjava -Xmx500M -classpath .;./lib/ASV_TE.jar
-Djava.ext.dirs=.;./lib de.uni_leipzig.asv.toolbox.te.TETool -l=de -s -ft=2
-fc=1 -ms=10.0 -sm=fr ./examples/te/text_deutsch.txt
./examples/te/text_deutsch_output.txtjava -Xmx500M -classpath .;./lib/ASV_TE.jar
-Djava.ext.dirs=.;./lib de.uni_leipzig.asv.toolbox.te.TETool -l=en -ft=2
-fc=1 -ms=10.0 -sm=lr ./examples/te/text_englisch.txtIt is easy to use TE for
your own program. You only need the class Indexer which is in the package de.uni_leipzig.asv.toolbox.te.indexer
and the class Word in the same package.
|
Method with parameters |
description |
|
Indexer |
|
|
new Indexer(Properties
props) |
This is the constructor
of the class Indexer. It need a Object of type Properties representing the content of the file
“te.properties” in “./config/te”. In this file are defined all parameters for
the available language. |
|
setLanguage(String
language) |
Set the language which
should be used for extraction. |
|
setStemming(boolean
stemming) |
Defines if base form
reduction will be used or not. |
|
setTagger(int tagger) |
Set the tagger for the
extraction. You should only use Parameters.QT for the parameter tagger. |
|
setSigFormula(int) |
With this method you can
set to values. The first one is the ratio use Parameters.LR for Likelihoo
ration and Parameters.HQ for frequency ration. The second one you can set is
the access method of the corpus. This means if the corpus should be load into
RAM(Parameters.RAM) or if the corpus should be used as file(Parameters.FILE).
Use file because the corpus is maybe to big for RAM and hasn’t an strong
effect on performance. |
|
setMinFreq(int minfreq) |
Set how many times a word
must be seen in the text to be significant. |
|
setCorpusMinFreq(int
corpusminFreq) |
Set how many times a word
must be seen in the corpus to be significant. |
|
setMinSig(float minsig) |
Set the minimum
significance of a word to be significant. |
|
setMinFreqPhrases(int minFreq) |
Set the minimum frequency
of a phrase to be significant. |
|
setPOSPatterns(List<String>
posPatterns) |
Set the POS patterns for
the extraction. |
|
prepare(String text,
Thread thread) |
This method to the
extraction with the given text. The parameter thread is the Thread which do
the extraction. It is used to stop the process of extraction if it is needed. |
|
getFilteredTerms(String
pos, Thread thread) |
This method can be used
to get all significant terms of the specified POS class pos from the text. The parameter thread
has the same function as in method prepare above. |
|
getPhrases(Thread thread) |
This method returns all
Phrases of the text. Parameter thread see method description prepare. |
|
Word |
|
|
getWord_Str() |
Returns the word as
String. |
|
getPos() |
Returns the POS class of
the word in the text. |
|
getSig() |
Returns the significance
of the word in the text. |
|
getFreq() |
Returns the frequency of
the word in the text. |
Here are an example of a
JAVA class(TETest.java) using the TE-Tool. You can find the class TETest.java
in the package de.uni_leipzig.asv.toolbox.tests.
package
de.uni_leipzig.asv.toolbox.tests;
import
java.io.FileInputStream;
import
java.io.FileNotFoundException;
import
java.io.IOException;
import
java.util.ArrayList;
import
java.util.Iterator;
import java.util.List;
import
java.util.Properties;
import de.uni_leipzig.asv.toolbox.te.indexer.Indexer;
import
de.uni_leipzig.asv.toolbox.te.indexer.Word;
import
de.uni_leipzig.asv.toolbox.te.utils.Parameters;
public class TETest {
/**
* @param args
*/
public static void main(String[]
args) {
String
language = "en";
String
text = "William Tecumseh Sherman (February
8, 1820 – February 14, 1891) "
+"was an American soldier, businessman, educator,
and author. He served as "
+"a general in the United States Army during the
American Civil War (1861–65), "
+"receiving both recognition for his outstanding
command of military strategy, and "
+"criticism for the harshness of the scorched
earth policies he implemented in "
+"conducting total war against the Confederate
States of America. Military historian "
+"Basil Liddell Hart famously declared that
Sherman was the first modern general.";
int minFreqText =
1;
int minFreqCorpus =
1;
int ratio =
Parameters.LR;//Parameters.LR = Likelihood ration,
Parameters.HQ = frequency ration
float minSig =
(float)10.0;
TETestThread
myThread = new TETest.TETestThread(language,minFreqText,minFreqCorpus,minSig,ratio);
myThread.extract(text);
}
public static class
TETestThread extends Thread{
Indexer
indexer;
String
text;
public TETestThread(String
language, int minFreqText, int minFreqCorpus, float minSig, int ratio){
String
propertyfile = "./config/te/te.properties"; //property
file with parameters for the different languages
Properties
props = new Properties();
try {//load propertyfile
props.load(new
FileInputStream(propertyfile));
}
catch (FileNotFoundException e) {
e.printStackTrace();
}
catch (IOException e) {
e.printStackTrace();
}
indexer = new Indexer(props);//create Indexer for extraction
//set Parameters
indexer.setLanguage(language);
indexer.setStemming(true);
indexer.setTagger(Parameters.QT);
indexer.setMinFreq(minFreqText);
indexer.setCorpusMinFreq(minFreqCorpus);
indexer.setMinSig(minSig);
indexer.setMinFreqPhrases(1);
ArrayList<String>
posPatterns = new ArrayList<String>();
posPatterns.add("A
N");
posPatterns.add("N N");
posPatterns.add("N N N");
indexer.setPOSPatterns(posPatterns);
indexer.setSigFormula(ratio);
indexer.setSigFormula(Parameters.FILE);
}
public void run(){
this.indexer.prepare(text, this);
List<Word>
terms = this.indexer.getFilteredTerms("N", this);
terms.addAll(this.indexer.getFilteredTerms("V", this));
terms.addAll(this.indexer.getFilteredTerms("A", this));
terms.addAll(this.indexer.getPhrases(this));
Iterator<Word>
it = terms.iterator();
System.out.println("term\tpos\tsignificance\tfrequency");
while(it.hasNext()){
Word
word = it.next();
System.out.println(word.getWordStr()+"\t"+word.getPos()+"\t"+word.getSig()+"\t"+word.getFreq());
}
}
public void extract(String
text){
this.text = text;
this.start();
}
}
}
You can
start this test. Below you see the output of the test.
Loading trees!
Accessing .\resources\corpora\en_with_wordnumbers.txt
and .\resources\corpora\en_wordnumbers_counts.txt
term pos significance frequency
sherman N 29.455941931868438 2
liddel N 20.01071701179626 1
harshness N 19.519715473945325 1
february N 17.62468549492769 2
basil N 16.094381230930594 1
confederate N 15.345867249008734 1
war N 13.806933445390314 2
educator N 12.619932897607214 1
historian N 12.180977952448302 1
hart N 11.927657159088994 1
businessman N 11.082113131618826 1
famously A 18.19563057169671 1
scorch A 15.570666559529855 1
military A 14.09592092083767 2
general A 12.291852246038616 2
american A 11.233365741558373 2
military historian A
N 1.0 1
earth policy N
N 1.0 1
basil liddel N
N 1.0 1
civil war A
N 1.0 1
unit state N
N 1.0 1
unit state army N
N N 1.0 1
scorch earth A
N 1.0 1
military strategy A
N 1.0 1
state army N
N 1.0 1
total war A
N 1.0 1
businessman educator N
N 1.0 1
sherman february N
N 1.0 1
soldier businessman educator N N N 1.0 1
confederate state N
N 1.0 1
outstand command A
N 1.0 1
american soldier A
N 1.0 1
historian basil liddel N
N N 1.0 1
soldier businessman N
N 1.0 1
historian basil N
N 1.0 1
liddel hart N N 1.0 1
All you need for a new
language are a Viterbitagger, trees for baseform reduction, stopwords, a
mapping and a reference corpus.
You can train a Viterbitagger for your language with the Tagger-Tool. It
has this
icon. For this you need some tagged text.
Example: Train a finnish Tagger from the file
horizontal.1Tag.fi100k10000.txt which you can find in
"examples/tagger". The format is horizontal and the text is tagged
with only one tag which stand behind the word. Save the tagger in
resources/taggermodels under the name fiTaggerModel.
Now you have to write the mapping . The mapping is a file with two
columns separated by a space. The first column containing the tags used by the
tagger and the second column contains N for nouns, V for verbs, A for
adjectives and X for neither noun nor verb nor adjective. Create a new file and
copy the first column of the taglist-file (belong to tagger) in it. Now you
have all tags which are used by your tagger and you can write the mapping to
NAVX behind them.
Example: Create the file fi.map in resources/mapping and copy into it
the first column of the file fiTaggerModel.taglist in resources/taggermodels.
Now write behind the tags A and ADV an A, behind the tag N an N, behind tag V a
V and behind the remaining tags an X.
The next step is creating the files of the reference corpus. For this
you need a file named lang_with_wordnumbers.txt and file named
lang_wordnumbers_counts.txt. Both have to have 2 columns seperated by tabs and
in both files the first column contains the word ids. In the second column the
first file stand the word and the second file the frequency of the word.
After that use the WordServer.jar in the lib directory of the Toolbox to create
the files fi_with_wordnumbers.txt.bin, fi_with_wordnumbers.txt.idx,
fi_with_wordnumbers.txt.trie and some more and the cooccaccess.jar to create
the files fi_wordnumbers_counts.txt.bin, fi_wordnumbers_counts.txt.idx,
fi_wordnumbers_counts.txt.meta(see: example to this step).
Example: Such file can be extracted from the Finnish database called
fi100k. Use the following 2 statements for creating the files.
SELECT w_id, word FROM words INTO OUTFILE
'fi_with_wordnumbers.txt' FIELDS TERMINATED BY '\t';
SELECT w_id, freq FROM words INTO OUTFILE
'fi_wordnumbers_counts.txt' FIELDS TERMINATED BY '\t';
Now copy both files into the toolbox directory.
Use the following command to create fi_with_wordnumbers.txt.bin,
fi_with_wordnumbers.txt.idx and fi_with_wordnumbers.txt.trie.
java -Xmx512m -jar ./lib/WordServer.jar
fi_with_wordnumbers.txt
Use the following command to create fi_wordnumbers_counts.txt.bin,
fi_wordnumbers_counts.txt.idx and fi_wordnumbers_counts.txt.meta.
java -Xmx512m -classpath ./lib/cooccaccess.jar
de.uni_leipzig.asv.coocc.BinFileMultColPreparer fi_wordnumbers_counts.txt 2
Copy all these files into resources/corpora.
In TE, the stemming option can be deactivated, meaning that the base form
reduction will not be used. For this you need a modified reference corpus
because it should also contains words which are not in their base form. The
easiest way to create this corpus is to create a copy of the files from the
basic reference corpus (the file created in the step above) and rename them
to
fi_with_wordnumbers.txt.full.bin, fi_with_wordnumbers.txt.full.idx,
fi_with_wordnumbers.txt.full.trie, fi_wordnumbers_counts.txt.full.bin,
fi_wordnumbers_counts.txt.full.idx and fi_wordnumbers_counts.txt.full.meta. You
can also repeat the creation of the reference corpus above and modify the
select statement so that it works an a table containing words in base form and
also inflected forms. In that way append at the names of the output files .full
(Attention: The commands for creating the .bin, .trie, .idx and .meta files
have to be changed so that it used the file created by your select statement!)
Now let us create the stop words. The stop words have to be in a file
with one word each line. If you do not know the stop words of your language,
just take the 500 most frequent words.
Example: For Finnish just use the 500 most frequent words from the
database. In the database fi100k the words are already ordered but the first
100 words are special signs so use 600 in stead of 500 words. Use the following
statement to create the stop words file from database.
SELECT word FROM words LIMIT 600 INTO OUTFILE
'fi_stopwords.txt';
If you use another database where the words not order by their frequency, you
can use this statement.
SELECT word FROM words ORDER BY freq LIMIT 600 INTO
OUTFILE 'fi_stopwords.txt';
Copy the fi_stopwords.txt into resources/stopwords .
You also need 3 trees for baseform reduction for your language, one for
adjectives, one for nouns and one for verbs. The trees should containing rules
like "trees 1" or "studied 3y" . If you have already such
trees just copy them into resources/trees. Otherwise create them with the
Baseforms-Tool and save them in resources/trees .
Example: In the files fi-BaseRulesA.txt, fi-BaseRulesV.txt,
fi-BaseRulesN.txt in resources/trees/plain you have the data for the trees.
Create the trees and save them as fi_adjectives.tree, fi_verbs.tree and
fi_nouns.tree into resources/tree.
At least you have to change the file te.properties in config/te. After
the next start of the Toolbox your new language will be available for TE.
Change te.properties like the following example for Finnish will show.
Example: At the top of the file add an entry like Lang3 = fi(if
3 already exists use the next not used number, fi will be shown in the dropdown
menu in TE-Panel)
Scroll to the end of the file and write the following entry into it. If you not
use fi or create another language than replace fi with what you use.
fi_wordserverFile =
resources/corpora/fi_with_wordnumbers.txt
Here enter you path to the file which you create with the WordServer.jar.
fi_cooccIndexFile =
resources/corpora/fi_wordnumbers_counts.txt
Here enter you path to the file which you create with cooccaccess.jar.
fi_tagMapFile = resources/mappings/fi.map
Here enter you path to the mapping file.
fi_stopWords = resources/stopwords/fi_stopwords.txt
Here enter you path to the stopwords-file.
fi_corpuslength = 43175032
Here enter the length of the corpus. It's the sum of all frequencies of the
words in the fi_wordnumbers_counts.txt file. With the following statement you
can get the number from the database: SELECT
sum(freq) from words;
fi_taggerModel =
./resources/taggermodels/fiTaggerModel.model
Here enter you path to the .model-file of the Viterbitagger.
fi_reduceNNTree = resources/trees/fi-nouns.tree
Here enter you path to the nouns baseform reduction tree
fi_reduceADJTree = resources/trees/fi-adjectives.tree
Here enter you path to the adjectives baseform reduction tree
fi_reduceVVTree = resources/trees/fi-verbs.tree
Here enter you path to the verbs baseform reduction tree
Start the Toolbox and the language will be available in the TE-Panel.