Documentation Terminology Extraction

*    

back to main page

 

Installation. 1

Introduction. 1

The Extract Panel 1

The Config Panel 9

How to use Command Line Version. 25

Command: 26

Examples: 36

How you can use the TE-Tool in your own program.. 40

Classes and Methods: 41

Example: 63

How to add a new language. 198

 

Installation

A description how to install a module is available at the main page of the ASV Toolbox project.

The line you have to copy into the toolbox.start file looks like this:

de.uni_leipzig.asv.toolbox.te.TEPanel

If you want to install a new language, you will have to edit the properties file. A description you find in the section How to add a new language  198.

Introduction

The Terminology Extraction Tool extracts technical terms from a given text for use in terminological databases and dictionaries. In order to determine if a given word or sequence of words is terminological, the software uses both statistical and pattern-based methods.

The most important statistical method is the so-called differential analysis which measures the extent to which the frequency of a word w in the given text deviates from its frequency in general usage. The latter frequency is determined using a reference corpus, i.e. a large and well-balanced collection of documents in the given language. The termhood of w is quantified using a statistical significance measure reflecting the significance of the frequency deviation.

Pattern-based methods are based on part-of-speech (POS) information. POS are used both to restrict the output units to certain word classes - it is often argued that most technical terms are nouns - and to extract multiword units (or phrases) in a shallow way. This is done by extracting sequences of words that appear frequently together and follow certain POS patterns, e.g. noun + noun ("terminology extraction").

For using the extraction tool, you can either directly write the text in a text area or load it from a file. It is possible to build and use your own terminology dictionary. At the Config Panel you can specify many technical parameters for your specify needs.

If you use a language the first time the terminology extraction will be need some extra time and a little bit more memory because the used tagger needs binary files for tagging. These files will be created at the first time using the tagger. The next time it will be faster.

The Extract Panel

Here you can extract the Terminology from your text, load a dictionary to use it and build a dictionary.

If you want to use your own dictionary, just press the Add button under the textarea labeled with Use Dictionaries. Choose in the opened window the dictionary file and press open. To remove a dictionary select it and press remove.

Added dictionary

 

Remove button for removing a selected dictionary.

 
"Use Dictionaries" textbox and there components

Add button for adding more dictionaries

 
 

 


To extract the terminology from your text, load or write the text in the left big text area. For loading a file use the "Load from File" button, choose the file and press open. Do not forget to specify the language of the text under the text area. Start the Extraction with the "Extract -->" button. On the right big text area you will see the result.

load text from file for extraction   

Loading a file   

choose language  

Choose the language

extraction process

Extract (Stop button for stopping before finishing the process)

A significant word in the text is marked in red.

 

A significant phrase in the text is marked in green.

 

The 4th column contains the frequency of the word in the text.

 

The 3rd column show you the significance of the word or phrase in the text.

 

In the 2nd column you find the POS class of the word. Could be A, N, V or combinations of them.

 

In the 1st column are listed the most significant term of the text.

 
result of the example                                       

example for result

 

If you have extracted terms from a text, you can save the results to a file by pressing "Save to File", choosing the directory and a filename or add the result of the extraction to an existing dictionary by pressing "Add to dictionary", choosing a dictionary and press append. Since version 1.1, the contexts where the extracted terms appear are directly accessible by clicking the row of the term for which the contexts should be shown. By pressing "Save Contexts" all the contexts for every term can saved at once into a chooseable directory. For every term, there will be a file for every source document.

the file.

frequency

 

significance score

 

POS class

 

Word or phrase

 
result saved in a text file

example for result saved in file

 

part of the dictionary

example for a result saved as dictionary

 

The Config Panel

In the config panel, you can set the parameters of the extraction tool. Generally, the predefined default settings should work rather well for medium-sized texts (up to 10,000 words), for larger texts, some of the thresholds may need to be adjusted. In the following, each parameter will be explained in detail

Choose this option if you text  isn’t already tokenised and you want to use the internal tokenizer.

 

Decide between Likelihood ration and frequency ration for calculating the significance.

 

Choose here minimum significance a word must be have in the text to be shown as significant term.

 

Choose here how many times a phrase must be seen in the text to be shown as significant term.

 

Choose here how many times a word must be seen in the reference corpus to be shown as significant term.

 

Choose here how many times a word must be seen in the text to be shown as significant term.

 
                                                                   example threshold and tokenizer configuration

 

 

This option you maybe need if you want  the use a dictionary for extraction. If you choose it TE will add word to the significant terms which are in text and in the dictionary although  they are not in the reference corpus. These words will get a significance value of –2.

 
 

 


This option should be choosen for baseform reduction.

For example trees and tree will be the same word if you choose stemming but two different words if you not choose stemming.

 

This option allow TE to add words(with a high freuquency in the text) to the significant terms although they were not in the reference corpus. These words get a significant value of –1.

 
                                                                  configure which words should be shown

         

 

 

Choose this option to use verbs as significant terms.

 

Choose this option to use nouns as significant terms.

 
 

 


Write in this text field a new phrase pattern and click the button beneath to add a new pattern to the list. The pattern should only consist of Ns, As and Vs which are separated by one space.

 

Select a phrase pattern and click this button to delete the pattern.

 

Choose which kind of phraes are used as significant terms.

 

Choose this option to use phrases as significant terms.

 

Choose this option to use adjectives as significant terms.

 
                                                                                                 configuration of the POS classes

                                                                                                                                 

 

Click this button for loading a configuration.

 

 

Click this button for saving the current configuration.

 
You can save a configuration of this tool for later use. Click on the “Save configuration” button. A new window will be open where you can choose directory and filename for your configuration. Click on the “Load configuration” button to load a configuration. In the new opened window you can choose the directory and file of the configuration.

"Load configuration" and "Save configuration" button

 

How to use Command Line Version

Command:

For using the command line version use the following:
Windows:
java -Xmx500M -classpath .;./lib/ASV_TE.jar -Djava.ext.dirs=.;./lib de.uni_leipzig.asv.toolbox.te.TETool [OPTIONS] TEXTFILE [TERMSFILE]
Linux:
java -Djava.ext.dirs=.:./lib de.uni_leipzig.asv.toolbox.te.TETool [OPTIONS] TEXTFILE [TERMSFILE]

Following options are available:

command

alternativ

description

-l=#

--language=#

specifies the language of the text; possible: en, de; default: en

-ft=#

--min_freq_text=#

sets minimum frequency of a word in the text to be taken into account to #, default 1

-fc=#

--min_freq_corpus=#

sets minimum frequency of a word in the corpus to be taken into account to #, default 2

-ms=#

--min_significance=#

sets minimum significane of a word to be taken into account to #, default 20.0

-sm=#

--sig_measure=#

specifies the significances measure; possible: lr (likelihood ratio), fr (frequency ratio); default: lr

-s

--show_significances

show significances


Replace TEXTFILE with the path to the input file you want to use.
Replace TERMFILE with the path to the output file if you want to write the output to a file. If you do not specify a TERMFILE than the output will be written on the screen.

Examples:

How you can use the TE-Tool in your own program

Classes and Methods:

It is easy to use TE for your own program. You only need the class Indexer which is in the package de.uni_leipzig.asv.toolbox.te.indexer and the class Word in the same package.

Method with parameters

description

Indexer

new Indexer(Properties props)

This is the constructor of the class Indexer. It need a Object of  type Properties representing the content of the file “te.properties” in “./config/te”. In this file are defined all parameters for the available language.

setLanguage(String language)

Set the language which should be used for extraction.

setStemming(boolean stemming)

Defines if base form reduction will be used or not.

setTagger(int tagger)

Set the tagger for the extraction. You should only use Parameters.QT for the parameter tagger.

setSigFormula(int)

With this method you can set to values. The first one is the ratio use Parameters.LR for Likelihoo ration and Parameters.HQ for frequency ration. The second one you can set is the access method of the corpus. This means if the corpus should be load into RAM(Parameters.RAM) or if the corpus should be used as file(Parameters.FILE). Use file because the corpus is maybe to big for RAM and hasn’t an strong effect on performance.

setMinFreq(int minfreq)

Set how many times a word must be seen in the text to be significant.

setCorpusMinFreq(int corpusminFreq)

Set how many times a word must be seen in the corpus to be significant.

setMinSig(float minsig)

Set the minimum significance of a word to be significant.

setMinFreqPhrases(int minFreq)

Set the minimum frequency of a phrase to be significant.

setPOSPatterns(List<String> posPatterns)

Set the POS patterns for the extraction.

prepare(String text, Thread thread)

This method to the extraction with the given text. The parameter thread is the Thread which do the extraction. It is used to stop the process of extraction if it is needed.

getFilteredTerms(String pos, Thread thread)
returns List<Word>

This method can be used to get all significant terms of the specified POS class pos  from the text. The parameter thread has the same function as in method prepare above.

getPhrases(Thread thread)
returns List<Word>

This method returns all Phrases of the text. Parameter thread see method description prepare.

Word

getWord_Str()
returns String

Returns the word as String.

getPos()
returns String

Returns the POS class of the word in the text.

getSig()
returns double

Returns the significance of the word in the text.

getFreq()
returns int

Returns the frequency of the word in the text.

Example:

Here are an example of a JAVA class(TETest.java) using the TE-Tool. You can find the class TETest.java in the package de.uni_leipzig.asv.toolbox.tests.

package de.uni_leipzig.asv.toolbox.tests;

 

import java.io.FileInputStream;

import java.io.FileNotFoundException;

import java.io.IOException;

import java.util.ArrayList;

import java.util.Iterator;

import java.util.List;

import java.util.Properties;

 

import de.uni_leipzig.asv.toolbox.te.indexer.Indexer;

import de.uni_leipzig.asv.toolbox.te.indexer.Word;

import de.uni_leipzig.asv.toolbox.te.utils.Parameters;

 

public class TETest {

 

      /**

       * @param args

       */

      public static void main(String[] args) {

           

            String language = "en";

            String text = "William Tecumseh Sherman (February 8, 1820 – February 14, 1891) "

                  +"was an American soldier, businessman, educator, and author. He served as "

                  +"a general in the United States Army during the American Civil War (1861–65), "

                  +"receiving both recognition for his outstanding command of military strategy, and "

                  +"criticism for the harshness of the scorched earth policies he implemented in "

                  +"conducting total war against the Confederate States of America. Military historian "

                  +"Basil Liddell Hart famously declared that Sherman was the first modern general.";

            int minFreqText = 1;

            int minFreqCorpus = 1;

            int ratio = Parameters.LR;//Parameters.LR = Likelihood ration, Parameters.HQ = frequency ration

            float minSig = (float)10.0;

            TETestThread myThread = new TETest.TETestThread(language,minFreqText,minFreqCorpus,minSig,ratio);

            myThread.extract(text);

           

           

      }

     

      public static class TETestThread extends Thread{

            Indexer indexer;

            String text;

            public TETestThread(String language, int minFreqText, int minFreqCorpus, float minSig, int ratio){

                  String propertyfile = "./config/te/te.properties"; //property file with parameters for the different languages

                  Properties props = new Properties();

                  try {//load propertyfile

                        props.load(new FileInputStream(propertyfile));

                  } catch (FileNotFoundException e) {

                        e.printStackTrace();

                  } catch (IOException e) {

                        e.printStackTrace();

                  }

                  indexer = new Indexer(props);//create Indexer for extraction

                  //set Parameters

                  indexer.setLanguage(language);

                  indexer.setStemming(true);

                  indexer.setTagger(Parameters.QT);

                  indexer.setMinFreq(minFreqText);

                  indexer.setCorpusMinFreq(minFreqCorpus);

                  indexer.setMinSig(minSig);

                  indexer.setMinFreqPhrases(1);

                  ArrayList<String> posPatterns = new ArrayList<String>();

                  posPatterns.add("A N");

                  posPatterns.add("N N");

                  posPatterns.add("N N N");

                  indexer.setPOSPatterns(posPatterns);

                  indexer.setSigFormula(ratio);

                  indexer.setSigFormula(Parameters.FILE);

            }

            public void run(){

                  this.indexer.prepare(text, this);

                  List<Word> terms = this.indexer.getFilteredTerms("N", this);

                  terms.addAll(this.indexer.getFilteredTerms("V", this));

                  terms.addAll(this.indexer.getFilteredTerms("A", this));

                  terms.addAll(this.indexer.getPhrases(this));

     

                  Iterator<Word> it = terms.iterator();

                  System.out.println("term\tpos\tsignificance\tfrequency");

                  while(it.hasNext()){

                        Word word = it.next();

                        System.out.println(word.getWordStr()+"\t"+word.getPos()+"\t"+word.getSig()+"\t"+word.getFreq());

                  }

            }

            public void extract(String text){

                  this.text = text;

                  this.start();

            }

      }

 

}

 

 

You can start this test. Below you see the output of the test.

 

Loading trees!

Accessing .\resources\corpora\en_with_wordnumbers.txt and .\resources\corpora\en_wordnumbers_counts.txt

term  pos   significance      frequency

sherman     N     29.455941931868438      2

liddel      N     20.01071701179626 1

harshness   N     19.519715473945325      1

february    N     17.62468549492769 2

basil N     16.094381230930594      1

confederate N     15.345867249008734      1

war   N     13.806933445390314      2

educator    N     12.619932897607214      1

historian   N     12.180977952448302      1

hart  N     11.927657159088994      1

businessman N     11.082113131618826      1

famously    A     18.19563057169671 1

scorch      A     15.570666559529855      1

military    A     14.09592092083767 2

general     A     12.291852246038616      2

american    A     11.233365741558373      2

military historian      A N   1.0   1

earth policy      N N   1.0   1

basil liddel      N N   1.0   1

civil war   A N   1.0   1

unit state  N N   1.0   1

unit state army   N N N 1.0   1

scorch earth      A N   1.0   1

military strategy A N   1.0   1

state army  N N   1.0   1

total war   A N   1.0   1

businessman educator    N N   1.0   1

sherman february  N N   1.0   1

soldier businessman educator N N N 1.0   1

confederate state N N   1.0   1

outstand command  A N   1.0   1

american soldier  A N   1.0   1

historian basil liddel  N N N 1.0   1

soldier businessman     N N   1.0   1

historian basil   N N   1.0   1

liddel hart N N   1.0   1

How to add a new language

All you need for a new language are a Viterbitagger, trees for baseform reduction, stopwords, a mapping and a reference corpus.
You can train a Viterbitagger for your language with the Tagger-Tool. It has this icon of the Tagger-Toolicon. For this you need some tagged text.
Example: Train a finnish Tagger from the file horizontal.1Tag.fi100k10000.txt which you can find in "examples/tagger". The format is horizontal and the text is tagged with only one tag which stand behind the word. Save the tagger in resources/taggermodels under the name fiTaggerModel.
Now you have to write the mapping . The mapping is a file with two columns separated by a space. The first column containing the tags used by the tagger and the second column contains N for nouns, V for verbs, A for adjectives and X for neither noun nor verb nor adjective. Create a new file and copy the first column of the taglist-file (belong to tagger) in it. Now you have all tags which are used by your tagger and you can write the mapping to NAVX behind them.
Example: Create the file fi.map in resources/mapping and copy into it the first column of the file fiTaggerModel.taglist in resources/taggermodels. Now write behind the tags A and ADV an A, behind the tag N an N, behind tag V a V and behind the remaining tags an X.
The next step is creating the files of the reference corpus. For this you need a file named lang_with_wordnumbers.txt and file named lang_wordnumbers_counts.txt. Both have to have 2 columns seperated by tabs and in both files the first column contains the word ids. In the second column the first file stand the word and the second file the frequency of the word.
After that use the WordServer.jar in the lib directory of the Toolbox to create the files fi_with_wordnumbers.txt.bin, fi_with_wordnumbers.txt.idx, fi_with_wordnumbers.txt.trie and some more and the cooccaccess.jar to create the files fi_wordnumbers_counts.txt.bin, fi_wordnumbers_counts.txt.idx, fi_wordnumbers_counts.txt.meta(see: example to this step).
Example: Such file can be extracted from the Finnish database called fi100k. Use the following 2 statements for creating the files.
SELECT w_id, word FROM words INTO OUTFILE 'fi_with_wordnumbers.txt' FIELDS TERMINATED BY '\t';
SELECT w_id, freq FROM words INTO OUTFILE 'fi_wordnumbers_counts.txt' FIELDS TERMINATED BY '\t';
Now copy both files into the toolbox directory.
Use the following command to create fi_with_wordnumbers.txt.bin, fi_with_wordnumbers.txt.idx and fi_with_wordnumbers.txt.trie.
java -Xmx512m -jar ./lib/WordServer.jar fi_with_wordnumbers.txt
Use the following command to create fi_wordnumbers_counts.txt.bin, fi_wordnumbers_counts.txt.idx and fi_wordnumbers_counts.txt.meta.
java -Xmx512m -classpath ./lib/cooccaccess.jar de.uni_leipzig.asv.coocc.BinFileMultColPreparer fi_wordnumbers_counts.txt 2
Copy all these files into resources/corpora.
In TE, the stemming option can be deactivated, meaning that the base form reduction will not be used. For this you need a modified reference corpus because it should also contains words which are not in their base form. The easiest way to create this corpus is to create a copy of the files from the basic reference corpus (the file created in the step above) and rename them to  fi_with_wordnumbers.txt.full.bin, fi_with_wordnumbers.txt.full.idx, fi_with_wordnumbers.txt.full.trie, fi_wordnumbers_counts.txt.full.bin, fi_wordnumbers_counts.txt.full.idx and fi_wordnumbers_counts.txt.full.meta. You can also repeat the creation of the reference corpus above and modify the select statement so that it works an a table containing words in base form and also inflected forms. In that way append at the names of the output files .full (Attention: The commands for creating the .bin, .trie, .idx and .meta files have to be changed so that it used the file created by your select statement!)
Now let us create the stop words. The stop words have to be in a file with one word each line. If you do not know the stop words of your language, just take the 500 most frequent words.
Example: For Finnish just use the 500 most frequent words from the database. In the database fi100k the words are already ordered but the first 100 words are special signs so use 600 in stead of 500 words. Use the following statement to create the stop words file from database.
SELECT word FROM words LIMIT 600 INTO OUTFILE 'fi_stopwords.txt';
If you use another database where the words not order by their frequency, you can use this statement.
SELECT word FROM words ORDER BY freq LIMIT 600 INTO OUTFILE 'fi_stopwords.txt';
Copy the fi_stopwords.txt into resources/stopwords .
You also need 3 trees for baseform reduction for your language, one for adjectives, one for nouns and one for verbs. The trees should containing rules like "trees 1" or "studied 3y" . If you have already such trees just copy them into resources/trees. Otherwise create them with the Baseforms-Tool and save them in resources/trees .
Example: In the files fi-BaseRulesA.txt, fi-BaseRulesV.txt, fi-BaseRulesN.txt in resources/trees/plain you have the data for the trees. Create the trees and save them as fi_adjectives.tree, fi_verbs.tree and fi_nouns.tree into resources/tree.
At least you have to change the file te.properties in config/te. After the next start of the Toolbox your new language will be available for TE.
Change te.properties like the following example for Finnish will show.
Example: At the top of the file add an entry like
Lang3 = fi(if 3 already exists use the next not used number, fi will be shown in the dropdown menu in TE-Panel)
Scroll to the end of the file and write the following entry into it. If you not use fi or create another language than replace fi with what you use.
fi_wordserverFile = resources/corpora/fi_with_wordnumbers.txt
Here enter you path to the file which you create with the WordServer.jar.
fi_cooccIndexFile = resources/corpora/fi_wordnumbers_counts.txt
Here enter you path to the file which you create with cooccaccess.jar.
fi_tagMapFile = resources/mappings/fi.map
Here enter you path to the mapping file.
fi_stopWords = resources/stopwords/fi_stopwords.txt
Here enter you path to the stopwords-file.
fi_corpuslength = 43175032
Here enter the length of the corpus. It's the sum of all frequencies of the words in the fi_wordnumbers_counts.txt file. With the following statement you can get the number from the database:
SELECT sum(freq) from words;
fi_taggerModel = ./resources/taggermodels/fiTaggerModel.model
Here enter you path to the .model-file of the Viterbitagger.
fi_reduceNNTree = resources/trees/fi-nouns.tree
Here enter you path to the nouns baseform reduction tree
fi_reduceADJTree = resources/trees/fi-adjectives.tree
Here enter you path to the adjectives baseform reduction tree
fi_reduceVVTree = resources/trees/fi-verbs.tree
Here enter you path to the verbs baseform reduction tree
Start the Toolbox and the language will be available in the TE-Panel.

 

back to main page