Baseforms Tool

icon of this tool

 

back to main page

 

Installation. 1

Introduction. 1

Welcome Panel 1

Base Form Reduction. 1

Compound Noun Decomposition. 2

Base Form Training. 3

Compound Noun Training. 3

Command Line version. 4

Command: 4

Commands: 4

Options: 4

Examples: 5

Usage in your own program.. 5

Classes and Methods. 5

Example: 5

References. 6

 

Installation

A description how to install a module is available at the main page of the ASV toolbox Project.

The line you have to copy into the toolbox.start file looks like this:

de.uni_leipzig.asv.toolbox.baseforms.Baseforms

Introduction

The Baseforms tool reduces inflected word forms to their base form and splits compound nouns using Compact Patricia Trie (CPT) classifiers. These can be trained in the Pretree-Tool, and also in the baseform tool itself.

Welcome Panel

This panel informs about the authors of the tool and contains a link to this documentation

 

Base Form Reduction

Here, base form reduction can be carried out, using pre-trained data or using own data.

First, language and part-of-speech have to be selected. When selecting “own data“, the user is asked to specify a file in Pretree-format.

Figure 1: Base form reduction panel

 

Then, enter a word in the text field, and press “Reduce”. The inflected and reduced form will appear in the table below. The table can be saved to file in a tabbed format, the table can be cleared and it is possible to load a one-word-per-line file for the reduction of multiple word forms.

 

To add more languages, edit the file config/baseforms/baseforms.properties.

 

Compound Noun Decomposition

This panel allows to split compounds into their consecutive parts, hereby reducing the parts to their base form. Compounding several words into one word is a property not all languages share; compounding is e.g. used in German, Scandinavian Languages, Finnish and Korean.

First, specify the language or whether you want to use your own data. Compound decomposition is implemented by recursively applying CPT classifiers to split parts from the beginning and the end of the word. Once the parts are identified, they are reduced to base form by applying a POS-independent base form reducer.

 

Figure 2: Compound noun decomposition panel

 

Figure 2 shows three examples for German. “Prüfungszeitstressfaktor” is split into its parts reffering to “the factor of stress observed in the examination period”. Phonetic elements that are put between some combinations as in “Schifffahrt-s-gesellschaft” are pruned. Notice that when splitting “Häuserkämpfe”, The parts “Häuser” (houses) and “kämpfe” (battles) are reduced to their base form.

It is possible to save and clear the table, as well as to load a one-compound-per-line file.

 

To add more languages, edit the file config/baseforms/baseforms.properties.

 

 

Base Form Training

In this panel, data for base form reduction can be trained; it is also possible to use the generic CPT tool in the Pretree-panel for that.

 

Figure 3: Base form training

 

The process goes as follows. Enter some inflected word forms in the text field and press „Add Word“. These words will appear in the table „unconfirmed words“. The column “base form” is editable, please enter the corresponding base form here. Then, select all words and accept them by pressing the “accept selected”-arrow. The words with their correct base forms will appear in the “confirmed words”-table.

Now, in subsequent steps of entering base forms, you can make use of the generalization power of the classifier: Add some more words in the “unconfirmed words” table and press “Classify List”. The classifier will enter the base form in the respective column, based on the classifier trained from the confirmed words. Edit the errors, select, accept and continue.

The resulting tree can be exported and used in for base form reduction.

 

When setting up a new language, first sort the words according to part-of-speech, as base form reduction is usually part-of-speech dependent. Proceed by word frequency in descending order.

Compound Noun Training

The compound noun training works similar to the Base form training: Enter a compound, indicate its correct spilt by using space characters as split delimiters, accept these compounds as confirmed compounds and use the generalisation power of the classifier to classify newly entered compounds, hereby building up your compound splitter.

Figure 4: compound noun training panel

 

As indicated in the section for the compound noun decomposition panel, two CPTs for word beginnings and endings are trained, which can be exported and used further.

You should have a POS-independent base form reduction ready and specify it - or don’t use base form reduction here.

 

Command Line version

Command:

For starting the command line version of this tool use the following command:

java -classpath .;./lib/ASV_Baseforms.jar -Djava.ext.dirs=.;./lib de.uni_leipzig.asv.toolbox.baseforms.BaseformsCL command option [option ...]

Commands:

-help print this help

-br baseform reduction

-cnd compound noun decomposition

Options:

-i word input a word

-if file input a wordfile

-o output at screen

-of file output in spezified file

-l lang choose a language

-rt file load a reduction tree from spezified file

following option are only for baseform reduction:

-wf worform choose a wordform

following options are only for compound noun decompodition:

-ft file load a forward tree from the spezified file

-bt file load a backward tree from the spezified file

Examples:

 

Usage in your own program

Classes and Methods

It is easy to use Baseforms in your own program. Here are all classes and methods you need to know.

 

 

class

(package)

description and methods

Pretree

(de.uni_leipzig.asv.utils)

This class is for loading and using a pretree.

Baseform reduction use pretrees which are trained for reduction.


Methods:

load(filename)-loads a pretree from file

String classify(String word)-classify the word and return the classification

Zerleger2

(de.uni_leipzig.asv.toolbox.baseforms)

This class is for splitting compound nouns.

It uses 3 pretrees for this(forward tree, backward tree

and reduction tree).

 

 

Methods:
init(String forwardtree, String backwardtree, String reducetree) -  initialize the instance and load the specified trees for splitting

Vector<String> kZerlegung(String word) returns the splitting of word as Vector of Strings.

 

Example:

Here are an example of a JAVA class(BaseformsTest.java) using the Baseforms tool. You can find the class BaseformsTest.java in the package de.uni_leipzig.asv.toolbox.tests.

package de.uni_leipzig.asv.toolbox.tests;

 

import de.uni_leipzig.asv.toolbox.baseforms.Zerleger2;

import de.uni_leipzig.asv.utils.Pretree;

 

public class BaseformsTest {

 

     

      public static void main(String[] args) {

            //reduce file for baseform

            String redbase = "./resources/trees/de-nouns.tree";

            //reduce file for splitting

            String red = "./resources/trees/grfExt.tree";

            //forward file

            String forw = "./resources/trees/kompVVic.tree";

            //backward file

            String back = "./resources/trees/kompVHic.tree";

            //pretree for baseform reduction

            Pretree pretree = new Pretree();

            //Splitter

            Zerleger2 zer = new Zerleger2();

            // word for reduction and splitting

            String word = "Baumschulen";

            pretree.load(redbase);

            zer.init(forw, back, red);

            System.out.println(word + " is reduced to " + pretree.classify(word));

            String splitted = "" + zer.kZerlegung(word);

            splitted = splitted.replaceAll("\\[", "");

            splitted = splitted.replaceAll("\\]", "");

            splitted = splitted.replaceAll(",", "");

            System.out.println(word + " is splitted into " + splitted);

      }

 

}

 

You can start this test. Below you see the output of the test.

 

Loading ./resources/trees/kompVVic.tree ...loaded

Loading ./resources/trees/kompVHic.tree ...loaded

Loading ./resources/trees/grfExt.tree ...loaded

Baumschulen is reduced to 1

Baumschulen is splitted into Baum schule

 

References

The Compact Patricia Tree data structure can be found in

 

The compound splitter was used for generating features for document classification in:

·         Witschel, F., Biemann, C. (2005): Rigorous dimensionality reduction through linguistically motivated feature selection for text categoris ation. Proceedings of NODALIDA 2005, Joensuu, Finland

 

The base form reduction step is described (for Norwegian) in

·         Eiken, U.C., Liseth, A.T., Richter, M., Witschel, F. and Biemann, C. (2006): Ord i Dag: Mining Norwegian Daily Newswire. Proceedings of FinTAL, Turku, Finland

 

back to main page