Documentation of Pendulum
back to
main page
Installation. 1
Introduction. 1
How to use the Gui Version. 1
Loading and saving a configuration from/to file: 1
Configure the tool by your own: 3
Starting the tool: 15
How to use the Command Line Version. 19
Command: 20
Options: 22
Examples: 27
A
description how to install a module is available at the main
page of the ASV Toolbox project.
The line
you have to copy into the toolbox.start file looks like this:
de.uni_leipzig.asv.toolbox.pendel.PendelPanel
This tool is for finding
named entities by bootstrapping.
For using the tool you need
a database which contains 3 tables, a table with words, a table with sentences
and table which connect the words and sentences tables.
Before you start the
bootstrapping process you have to set some configurations. You can do this by
loading a configuration file or set the configurations by your own.
At the File Management
Panel you will find 2 buttons. One for loading a configuration and one for
saving the current configuration(see figure 1). After loading a
configuration you can change also the configuration of the tool. For this see
“Configure the tool by your own”.
This button is for saving a configuration to file.
If you use this button a new file saving dialog will open where you can
save the actual configuration to file in any directory.
|
|
This button is for loading a configuration file.
If you use this button a new file open dialog will open where you can
choose the configuration file from any directory.
|
|



figure 1
There are 5 panels where
you can change settings. So let us begin at the first panel, the File
Management panel. Here you can choose which output files you want to get after
running the bootstrapping process. Select the check boxes before the file you
want. For every file a new file dialog will open for choosing the output file (figures
2&3).
Example for one of the file dialogs. Here
you see the file dialog for the log file. It file is named with
logpendel.txt and is placed in the main directory of the toolbox.(You don’t
have to name the file in this way or placed it in a directory of the
toolbox. Other directories or filenames are also possible.)
|
|


New Items file: In this file all named
entities which were find during the bootstrapping process are saved. An
entry in this file look like this: Gerhard VN 11/29 Angela
(Gerhard=new
item, VN = classification of the new item, 11/29 in 11 of 29 cases
classified in this way, Angela= new item were find because of this item )
|
|
figure 2
In this file you find the extraction pattern
which causes that classified words become a named entity: An entry look
like this: Angela Merkel VN NN->name
(Angela Merkel=named entity, VN
NN->name= extraction pattern which were used)
|
|



In this file you find the rule why a item
was classified as in the way it is classified. An entry look like this: Angela(VN)
Merkel(?NN) VN GR*->NN
(item(class) = item which is already in the
knowledge base with the classification, item(?class)= item which is found
because of the rule of this
entry, VN GR*->NN = rule which causes the classification)
|
|


In this file the log will be saved.
All about new items, may be items, … will be logged. So you can comprehend
why an item was found or why it not become a new item.
|
|
This file is very similar to the new item
file. Also the entry are written in the same way like in the new item file.
But the listed items are only maybe new items means there are not enough
data to be sure that the classification is right.
|
|
figure 3
The next panel is the
Parameters and Settings panel. Here you have to configure the database and the
parameters for the bootstrapping algorithm (see figure 4&5).
The last table you have to configure is the
table connecting the words and sentence table. Choose the table, column with the id of the words and
the column with the ids with the sentences in this order.
|
|
This are the sentence table
settings. Choose the table
with sentences(first drop down menu), the column with the id of the
sentences(second drop down menu) and the column with the sentences(last
drop down menu).
|
|




Configure here the settings for the
word table. At the first drop down menu table name, at the second drop down
menu the id field for words, the third drop down field the column with the
word and at the last drop down menu the column containing the frequency of
the word in the corpus you use.
|
|
figure 4
|
|
|
|
|
Configure here the settings for your
database.
|
|
|
|
|
Use this button to get the default settings
for the parameters.
|
|
Minimum count of the word in the
corpus to become a new item.
|
|
Threshold which have to be exceeded
for acceptance of the item.
|
|
Maximum number of sentence which are used
for verification.
|
|
Select the check box if you want to
use the internal tokenizer.
|
|
Maximum number of sentence in which a word
is searched.
|
|







figure 5
Button to add rules from file. A new file
dialog will open to choose the file.
If you save the configuration the rule will be saved in file ends
with .rules .
|
|
The next
panel is the Rules and Patterns panel. Here you can add, delete and save rules
and patterns(see figure 6&7).
Select a rule in the table above and
click on this button to delete this rule.
|
|
Here you can add a new rule. Enter
the rule and click an add to add the rule.
|
|
Table which contains all rules which
will be used.
|
|

Button to save the rules to file.
|
|
Button to delete all rules.
|
|





figure 6
This is the extraction pattern part.
The functionality is like the functionality of the class rules part above.
|
|


figure 7
The next panel is the Input
Items panel. Here you can add start items. This are items which are already
classified. Additional you can add same background knowledge items which will
be used for classification and extraction but they will not be listed in the
item list at the end. The functionality of this panel is like the functionality
of the class rules panel or the extraction panel. The panel may look like in
figure 8.
Here you can configure the background
knowledge items.
|
|
Here you can configure the start items.
|
|



figure 8
The next panel is the Tag
System panel. Here you can configure the tag encoding and the regular
expression tagging. Regular expression tagging means that you can use a regular
expression for finding candidates for new items. The panel may look like figure
9. The functionality is like the functionality of the Rules and Patterns panel.
Auto Fill button to fill out the table
automatically.
|
|




figure 9
After configure the tool
you can start the finding of named entities(see figure 10).
Button to stop the algorithm.
|
|
All items which are found or in start item
list.
|
|
Button to let the algorithm have a break.
|
|
table containing the unused items.
|
|
Button to start the algorithm.
|
|






figure 10
For
starting the command line version of this tool use the following command:
java -Xmx500M -classpath .;./lib/ASV_Pendulum.jar -Djava.ext.dirs=.;./lib
de.uni_leipzig.asv.toolbox.pendel.PendelCL configfile -o outputfile [-t]
configfile
path to the configuration file which should be used for the run
-o
outputfile path to the output file in which the output will be written
-t use the
internal tokenizer
- Run Pendulum with configuration
in file ./config/pendel/Pendel_localhost.cfg, use tokenizer and write output
to ./examples/pendel/output_tokenizer.txt
java -Xmx500M -classpath .; ./lib/ASV_Pendulum.jar -Djava.ext.dirs=.;./lib
de.uni_leipzig.asv.toolbox.pendel.PendelCL
./config/pendel/Pendel_localhost.cfg -o
./examples/pendel/output_tokenizer.txt –t
- Run Pendulum with configuration
in file ./config/pendel/Pendel_localhost.cfg, do not use tokenizer and
write output to ./examples/pendel/output_notokenizer.txt
java -Xmx500M -classpath .; ./lib/ASV_Pendulum.jar -Djava.ext.dirs=.;./lib
de.uni_leipzig.asv.toolbox.pendel.PendelCL
./config/pendel/Pendel_localhost.cfg -o
./examples/pendel/output_notokenizer.txt
Reference
This is an
implementation of the bootstrapping method described in
Quasthoff,
U.; Biemann, Chr. : Named entity learning and verification: EM in large corpora. In:
Proceedings of CoNLL-2002 , The Sixth Workshop on Computational Language
Learning, 31 August and 1 September 2002 in association with Coling 2002 in
Taipei, Taiwan
back to main page