![]()
A
description how to install a module is available at the main
page of the ASV toolbox Project.
The line
you have to copy into the toolbox.start file looks like this:
de.uni_leipzig.asv.toolbox.baseforms.Baseforms
The
Baseforms tool reduces inflected word forms to their base form and splits
compound nouns using Compact Patricia Trie (CPT) classifiers. These can be
trained in the Pretree-Tool, and also in the baseform tool itself.
This panel
informs about the authors of the tool and contains a link to this documentation
Here, base
form reduction can be carried out, using pre-trained data or using own data.
First,
language and part-of-speech have to be selected. When selecting “own data“, the
user is asked to specify a file in Pretree-format.

Figure 1:
Base form reduction panel
Then, enter
a word in the text field, and press “Reduce”. The inflected and reduced form
will appear in the table below. The table can be saved to file in a tabbed
format, the table can be cleared and it is possible to load a one-word-per-line
file for the reduction of multiple word forms.
To add more
languages, edit the file config/baseforms/baseforms.properties.
This panel
allows to split compounds into their consecutive parts, hereby reducing the
parts to their base form. Compounding several words into one word is a property
not all languages share; compounding is e.g. used in German, Scandinavian
Languages, Finnish and Korean.
First,
specify the language or whether you want to use your own data. Compound
decomposition is implemented by recursively applying CPT classifiers to split
parts from the beginning and the end of the word. Once the parts are
identified, they are reduced to base form by applying a POS-independent base
form reducer.

Figure 2:
Compound noun decomposition panel
Figure 2
shows three examples for German. “Prüfungszeitstressfaktor” is split into its
parts reffering to “the factor of stress observed in the examination period”.
Phonetic elements that are put between some combinations as in
“Schifffahrt-s-gesellschaft” are pruned. Notice that when splitting
“Häuserkämpfe”, The parts “Häuser” (houses) and “kämpfe” (battles) are reduced
to their base form.
It is
possible to save and clear the table, as well as to load a
one-compound-per-line file.
To add more
languages, edit the file config/baseforms/baseforms.properties.
In this
panel, data for base form reduction can be trained; it is also possible to use
the generic CPT tool in the Pretree-panel for that.

Figure 3:
Base form training
The process
goes as follows. Enter some inflected word forms in the text field and press
„Add Word“. These words will appear in the table „unconfirmed words“. The
column “base form” is editable, please enter the corresponding base form here.
Then, select all words and accept them by pressing the “accept selected”-arrow.
The words with their correct base forms will appear in the “confirmed
words”-table.
Now, in
subsequent steps of entering base forms, you can make use of the generalization
power of the classifier: Add some more words in the “unconfirmed words” table
and press “Classify List”. The classifier will enter the base form in the
respective column, based on the classifier trained from the confirmed words.
Edit the errors, select, accept and continue.
The
resulting tree can be exported and used in for base form reduction.
When
setting up a new language, first sort the words according to part-of-speech, as
base form reduction is usually part-of-speech dependent. Proceed by word
frequency in descending order.
The
compound noun training works similar to the Base form training: Enter a
compound, indicate its correct spilt by using space characters as split
delimiters, accept these compounds as confirmed compounds and use the
generalisation power of the classifier to classify newly entered compounds,
hereby building up your compound splitter.

Figure 4:
compound noun training panel
As
indicated in the section for the compound noun decomposition panel, two CPTs
for word beginnings and endings are trained, which can be exported and used
further.
You should
have a POS-independent base form reduction ready and specify it - or don’t use
base form reduction here.
For
starting the command line version of this tool use the following command:
java
-classpath .;./lib/ASV_Baseforms.jar -Djava.ext.dirs=.;./lib
de.uni_leipzig.asv.toolbox.baseforms.BaseformsCL command option [option ...]
-help print
this help
-br
baseform reduction
-cnd
compound noun decomposition
-i word
input a word
-if file
input a wordfile
-o output
at screen
-of file
output in spezified file
-l lang
choose a language
-rt file
load a reduction tree from spezified file
following
option are only for baseform reduction:
-wf worform
choose a wordform
following
options are only for compound noun decompodition:
-ft file
load a forward tree from the spezified file
-bt file
load a backward tree from the spezified file
It is easy
to use Baseforms in your own program. Here are all classes and methods you need
to know.
|
class (package) |
description
and methods |
|
Pretree (de.uni_leipzig.asv.utils) |
This
class is for loading and using a pretree. Baseform
reduction use pretrees which are trained for reduction.
load(filename)-loads
a pretree from file String
classify(String word)-classify the word and return the classification |
|
Zerleger2 (de.uni_leipzig.asv.toolbox.baseforms) |
This
class is for splitting compound nouns. It uses 3
pretrees for this(forward tree, backward tree and
reduction tree). Methods: Vector<String>
kZerlegung(String word) returns the splitting of word as Vector of Strings. |
Here are an example of a
JAVA class(BaseformsTest.java) using the Baseforms tool. You can find the class
BaseformsTest.java in the package de.uni_leipzig.asv.toolbox.tests.
package
de.uni_leipzig.asv.toolbox.tests;
import
de.uni_leipzig.asv.toolbox.baseforms.Zerleger2;
import
de.uni_leipzig.asv.utils.Pretree;
public class
BaseformsTest {
public static void main(String[]
args) {
//reduce file for baseform
String
redbase = "./resources/trees/de-nouns.tree";
//reduce file for splitting
String
red = "./resources/trees/grfExt.tree";
//forward file
String forw = "./resources/trees/kompVVic.tree";
//backward file
String
back = "./resources/trees/kompVHic.tree";
//pretree for baseform reduction
Pretree
pretree = new Pretree();
//Splitter
Zerleger2 zer = new Zerleger2();
// word for reduction and splitting
String
word = "Baumschulen";
pretree.load(redbase);
zer.init(forw,
back, red);
System.out.println(word + " is reduced to " + pretree.classify(word));
String splitted
= "" +
zer.kZerlegung(word);
splitted = splitted.replaceAll("\\[", "");
splitted
= splitted.replaceAll("\\]", "");
splitted
= splitted.replaceAll(",", "");
System.out.println(word + " is splitted into " + splitted);
}
}
You can start this test. Below you see the output of
the test.
Loading ./resources/trees/kompVVic.tree ...loaded
Loading ./resources/trees/kompVHic.tree ...loaded
Loading ./resources/trees/grfExt.tree ...loaded
Baumschulen is reduced to 1
Baumschulen is splitted into Baum schule
The Compact
Patricia Tree data structure can be found in
The
compound splitter was used for generating features for document classification
in:
·
Witschel,
F., Biemann, C. (2005): Rigorous dimensionality reduction through
linguistically motivated feature selection for text categoris ation.
Proceedings of NODALIDA 2005, Joensuu, Finland
The base
form reduction step is described (for Norwegian) in
·
Eiken,
U.C., Liseth, A.T., Richter, M., Witschel, F. and Biemann, C. (2006): Ord i
Dag: Mining Norwegian Daily Newswire. Proceedings of FinTAL, Turku, Finland