![]()
A description how to
install a module you find at the main page of the Toolbox project.
The line you hat to copy into the toolbox.start file look like this:
de.uni_leipzig.asv.toolbox.hac.ClusteringModule
The tool
"Agglomerative Hierarchical Clustering" can be used to create a clustering
of objects. It is part of the "ASV Toolbox" - a collection of tools
for natural language processing, developed at the Department for Natural
Language Processing (NLP) at Leipzig University [1].
The tool creates a hierarchical clustering of the participating
objects, by performing an agglomerative, hierarchical clustering analysis.
Each object that
participates in the clustering process is represented by a feature vector. These
vector representations are used to estimate the similarity (or dissimilarity)
between objects. The clustering then provides a representation of a set of
objects, where similar objects appear close together and dissimilar objects are
separated from each other [2].
In terms of NLP, an
"object" is usually a word in a corpus, "features" are then
other words that frequently co-occur with that (object) word in the corpus. A
"feature vector" for the given word usually contains the significance
values of these co-occurrences. A non-zero value for the n-th element of the
feature vector indicates, that the object word significantly frequently
co-occurs with the feature word that is the n-th word of the corpus' word list.
AHC is an iterative, bottom-up
process: [2]
Initially, each object
represents a seperate cluster. At each step, the two most similar (i.e. least
distant) clusters are merged to form a larger cluster. Cluster distances are
calculated using a specific distance function. The process of merging the two
most similar clusters continues until only one cluster remains, which contains
all the participating objects.
The resulting clustering is
a hierarchical structure. It can be visualized as a dendrogram.
To determine the distance
between clusters based on their member elements, the following methods have
been implemented:
|
Single
Linkage |
minimum
distance between any members of each group |
|
Complete
Linkage |
maximum
distance between any members of each group |
|
Average
Linkage |
average
pair-wise distance between each member of one cluster to each member of
another cluster |
|
Average
Group Linkage |
average
distance between all possible element pairs of the union of the two clusters |
|
Centroid |
distance
between the mean vectors (centroids) of the two clusters |
|
Wards
Method |
increase
in variance when merging two clusters |
Distances between element
vectors can be calculated using one of the following methods:
|
L1-Norm |
|
L2-Norm |
|
Dice |
|
Jaccard |
|
Cosine |
(See also [5], [6], [7], and [8].)
Running this software
requires a Java Runtime Environment (JRE) of version 1.5 or later. [9] This is available from http://java.sun.com/javase/downloads/index.jsp To check if java is properly
installed on your system, type java -version in your console/shell. This should
return a version statement of 1.5.0 or higher.
To use all the features of
this clustering tool, you need to be able to connect to a co-occurrence
database. You can set up such a database on your local system. First, you need
to download MySQL [10] from http://dev.mysql.com/downloads/ and install MySQL as a service/deamon on your
machine. Second, you need to obtain datafiles for a co-occurrence database. The
tool is pre-configured to work with databases as provided by the Leipzig Corpora Collection [11].
If you want to build the
clustering tool from source, you will need to have Ant installed. [12] Ant is available from http://ant.apache.org/ You can check, whether Ant is
properly installed on your system by typing ant
-version in your
console/shell. This should return a version statement of 1.6.5 or higher.
The clustering tool uses a
set of third-party libraries, which are expected to be found in the java
extension directories. Usually the lib sub folder in your application
folder contains all the required libraries. Please make sure that all these library dependencies are met after installing the clustering tool.
This software may be
obtained in two ways:
The clustering tool is
available as a part of the toolbox. Please refer to the toolbox documentation
for information on installation and usage. Once the toolbox is installed, you
can use the clustering tool as one of its modules. Alternatively, you can find
the clustering tool as asv-toolbox-ahc.jar in the lib
folder of the toolbox. This allows you to use the clustering tool as a
stand-alone application. Simply follow the instructions provided under usage.
The clustering tool is
distributed as a zip-archive in a file named ahc.zip. Create a folder (the application
folder) and extract the archive's contents into that folder.
With installation
completed, the application folder may contain the following sub folders with
their respective contents:
|
|
the
clustering tool as a "Java Archive" |
|
|
application
configuration and default settings |
|
|
sample
data and user manual |
|
|
libraries
used by the clustering tool |
|
|
the
source code (if provided) |
|
|
compiled
java classes (if provided) |
|
|
API
documentation (if provided) |
|
|
icons and
images for the graphical user interface |
The following files will be
available:
|
|
for
Windows: runs the clustering tool as stand-alone application with GUI |
|
|
for
Linux: runs the clustering tool as stand-alone application with GUI |
|
|
for
Windows: runs the ASV Toolbox (if available) |
|
|
for Linux:
runs the ASV Toolbox (if available) |
|
|
for
Windows: runs a few demos of the clustering tool in console mode |
|
|
Ant
buildfile (if provided) |
|
|
first
information and help |
If your distribution
provides the source code for this tool, you can compile the clustering tool
using Ant. [12] Please
refer to the buildfile build.xml for up-to-date settings and
information. The following targets should be available:
|
|
compiles
the source code |
|
|
(default)
equivalent for |
|
|
removes
files that have been created/built by other targets |
|
|
creates
the distribution zip-archive(s) for this project |
|
|
creates
the java-executable jar file(s) for this project |
|
|
runs the
clustering tool, using its GUI |
The following libraries are
required to compile and run the clustering tool.
|
JUnit |
|
|
Commons-Logging |
|
|
Doug
Lea's FJTasks Framework |
|
|
Dom4J |
|
|
JArgs |
|
|
MySQL
Connector for Java |
|
|
ASV's
WordServer [13] |
|
|
Browser
Launcher |
|
Note: junit.jar is only required during
compilation.
The clustering tool uses a
set of configuration files. They are used to load default settings and
parameters on application startup. Additionally, they allow to remember
settings from the GUI or to use and manage several setups for the clustering
tool when run from command-line. Finally, while most parameters for data input
and clustering can be provided via the clustering tool's user interfaces, some
settings can only be made in the configuration files.
On start-up, the clustering
tool tries to load initial (default) settings from the following locations. Properties
read at a later stage override earlier properties:
config/ahc/clustering.properties config/ahc/clustering.queries clustering.properties in the user's home directory
is used as user property file. Settings from the GUI are stored in that
file, after a clustering analysis that has been started by the GUI has run
to completion without any problems. An alternative location of the user
property file may be provided by the command-line parameter --propfile <filename>. Note: Only the GUI ever stores any
settings in a user property file.
Note: When the user property file is
re-written, any comment lines are lost. Furthermore, the individual property
lines will be written in a scrambled order.
Currently, text input
pre-processing can only be configured via the configuration files. The text
pre-processing settings are applied to text input within the GUI as well as
text file input from command-line. Basically, text pre-processing includes
possible substitutions and character exclusions, both of which use regular
expressions that are defined in the configuration files. See the application
configuration file for further details.
The application
configuration file contains a lot of explanatory comments on its
properties. You may also copy its property lines to a user property file
and modify the settings according to your needs.
The clustering tool
provides a graphical user interface (GUI) (both as ASV toolbox module, and as a
stand-alone application) as well as a console interface for non-interactive
access. Details on how to use these interfaces will be provided in the
following sections.
A few data files with tiny
data samples are provided in the application sub folder resources/ahc/examples. Please refer to the System Requirements section for how to set up a
MySQL database to be used with the clustering tool.
To run the GUI you can use
one of the provided scripts: run_ahc.bat (for Windows) or run_ahc.sh
(for Linux). Alternatively, you can use the following command:
java -classpath .;./lib/ASV_HAC.jar -Djava.ext.dirs=./lib de.uni_leipzig.asv.toolbox.hac.main.CLIMain --gui
To run the GUI with a
custom user property file, you can use the following command:
java -classpath .;./lib/ASV_HAC.jar -Djava.ext.dirs=./lib de.uni_leipzig.asv.toolbox.hac.main.CLIMain
--propfile "my_clustering.properties" --gui
To run the console
interface, use the following command:
java -classpath .;./lib/ASV_HAC.jar -Djava.ext.dirs=./lib de.uni_leipzig.asv.toolbox.hac.main.CLIMain [OPTIONS]
Options and examples will be explained in the following
sections.
Note: Use "/" instead of
"\" on Linux-like operating systems.
Note: Set your current working directory
to the "application folder", that is the folder where you extracted
this software.
Note: If you run into memory problems,
i.e. OutOfMemoryError, consider using java's -Xmx<size>M option. For instance, if you want
the clustering tool to use up to 256MB of memory (default is 64MB) use the
command:
java -Xmx256M -classpath .;./lib/ASV_HAC.jar -Djava.ext.dirs=./lib de.uni_leipzig.asv.toolbox.hac.main.CLIMain
[OPTIONS]
To start a clustering
analysis, you need to perform the following steps:
The progress bar indicates
the current status and progress. Depending on the number of candidates and
their feature representation the calculations may take just a few seconds or
several hours.
As soon as the clustering
analysis is completed, the results will appear as a dendrogram in a new window.
The dendrogram can be saved as a PNG image file. Several dendrogram windows may
be kept open, to compare results.
After each successful
clustering analysis, the GUI stores its settings in the user property file.
(See Configuration for details.) Note that, however,
the status of the checkboxes on the database input tab is not saved.
Devoid of any data input or
as a baseline for algorithm comparisons, this data source creates random
feature vectors for the clustering analysis. Its parameters can be set on the Clustering
tab of the GUI.
The clustering tool can
read feature vectors from two related vector file formats: explicit
and compact. Both formats are text files that contain one
feature vector per line. Each line is a space-separated list of tokens. In both
formats the first token of each line represents a label, i.e. the candidate
word of the respective feature vector on that line. Interpretation of the
remaining tokens differs between the two formats.
For the explicit
format each following token is read as a feature vector element. Consider the
following example of a vector file in explicit format:
Car 4 0 0 8 0 7 2 0 0 0 0 0 0 0 0 0 2 Train 2 0 0 6 0 2 5 0 0 3 0 0 0 0 0 0 15 House 0 0 4 0 0 1 0 0 7 2 0 9 6 0 0 0 0 Cheese 0 0 0 0 0 0 0 0 0 18 0 0 0 0 0 12 0
Here, the feature vector
for the candidate word "Car" has 17 elements, five of which are
non-zero. The first element has value 4, the fourth element has value 8, and so
on.
By contrast, in the compact
format, tokens that follow the candidate word are interpreted pair-wise. The
first token of each pair indicates a position in the feature vector (or element
index). The second token of any pair represents the feature vector element
value of that position. Unspecified element values are interpreted as zero. Consider
the following example of a vector file in compact format:
Car 1 4 4 8 6 7 7 2 17 2 Train 1 2 4 6 6 2 7 5 10 3 17 15 House 3 4 6 1 9 7 10 2 12 9 13 6 Cheese 10 18 16 12
This represents the same
feature vectors as in the previous example, but this time in compact
format, which is more efficient for sparse vectors.
Note: This requires a working database
connection.
The clustering tool can
load candidate words and feature vectors from a database. Relevant parameters
for this data source may be specified using the Database Input tab
of the GUI.
Candidate Words
Selection:
Using the database as input source, the clustering tool loads a range of
candidate words from the database's word list. Additionally, it is possible to
exclude candidates based on their frequency values in the database's word list.
The resulting candidate list can directly be used for clustering. Alternatively,
you can choose to edit the candidate list prior to clustering. With this option
enabled, you must use the Text Input tab to load the candidate list from
the database.
Feature Selection:
When feature vectors are loaded from a database, two different cut-offs can be
applied. First, it is possible to specify a minimal co-occurrence significance.
Co-occurrence features with a lower significance will be ignored. This may lead
to more sparse feature vectors, and it may reduce noise in the data. A second
possibility to reduce the number of elements per feature vector is to consider
only the most significant co-occurrences for the given candidate word. The
clustering tool allows you to specify how many of the most significant
co-occurrences you wish to use to create feature vectors.
As a convenience, it is
possible to check the feature vector for a candidate word using the Feature
Vector Preview. Just provide a candidate word (or its ID in the database's
word list). If you click on the Check button, the frequency value and
feature vector will be loaded from the database. This allows to estimate the
influence of the current feature selection cut-offs and candidate selection
parameters.
Note: The feature vector is presented in compact
format, as described above. However, the first token of the feature
vector preview represents the candidate word ID, instead of the word label. All
following tokens are pairs of a feature element index, i.e. another database
word ID, and a feature element value, i.e. the respective co-occurrence
significance.
Note: This requires a working database
connection.
Note: Settings made on the Database
Input tab are also
applied for database operations during text import.
The clustering tool can use
any text data to create a list of candidates for clustering. The Text Input
tab allows you to enter text or provide it via copy-and-paste. When the
clustering analysis is started using text input, at first, a word list is
created from the available text data. This word list is matched against the
database. All those words are kept as candidates, for which feature vectors can
be loaded from the database (applying the settings that were made on the Database
Input tab). The
feature vectors of these candidates are then used for clustering.
When a word list is created
from text input, the text undergoes some pre-processing. This pre-processing
can only be configured using the configuration files of the clustering tool.
Using the Create
candidate list from input button you can preview (and re-edit) your
candidate list.
If a word ID range
restriction is activated (see the Database Input tab), then a candidate list can be
loaded from the database's word list.
The console interface
provides basically the same functionality as the GUI. However, there are a few
peculiarities. Using the console interface, you will need to work with the configuration files for the clustering tool. Many
parameters for the database connection, text file preprocessing, and candidate
and feature selection cannot be provided via command-line options. Instead
these parameters are read from the tool's configuration files. It is
recommended to use and maintain a (or several) user property file(s).
The console interface
provides the following additional features:
The following command-line
options are available:
General:-?, --help print this help-g --gui starts the GUI of this clustering tool This ignores all other command-line parameters, except any specified property file-t --threads <nr> number of background threads that should be used for calculations (default is 1) Datasources (Hint only ONE datasource may be used!):-v --vectorfile <filename> uses the vector file indicated by <filename> as datasource --compact indicates that the file is in short/compact format (default) --explicit indicates that the file in in long/explicit format --wordserv <filename> indicates that the vector names should be retrieved from the wordserver file indicated by <filename>-d --database uses a database connection as datasource.-f --textfile <filename> uses the text file indicated by <filename> as word list source along with a database connection to retrieve feature vectors --propfile <filename> indicates that database connection and input settings should be loaded from the property file <filename> --restrictrange indicates that candidate words must have database IDs within a certain range, as defined in the property file. Other words will be excluded. If omitted (default), range will not be restricted. --restrictfreq indicates that candidate words must have database frequencies within a certain range, as defined in the property file. Other words will be excluded. If omitted (default), frequency will not be restricted. --applyfeatureminsig indicates that feature vector elements must have a minimum significance, whose value is specified in the property file. If omited (default), this minimum is not considered. --applyfeaturelimit indicates that only a number of most significant features be used as feature vector elements. This number is specified in the property file. If omited (default), all available features are used as feature vector elements.-u --user <username> the database user name-p --password <password> the database password --test use randomly created test data Output options:-o --out <filename> the filename for output without extension! For example, using myout you will get myout.png for dendrogram or myout.xml for xml output (or both) --disabledendrogram disable dendrogram drawing (enabled by default)-x --xml enable xml output (disabled by default) --depth <int> the depth for xml output: number of top cluster levels Algorithm options:-v --vectordist <dist> set the vector distance, possible distances are: L1 Norm L2 Norm Cosine DiceJaccard
-c --cluster <dist> set the cluster distance, possible distances are: SingleLinkage CompleteLinkage AverageLinkage AvgGroupLinkage CentroidMethod WardsMethod
These demo examples can
also be found in the scripts run_ahc_demos.bat (for Windows) or run_ahc_demos.sh (for Linux).
Demo 0: Print out the list of command-line
options:
java -classpath .;./lib/ASV_HAC.jar -Djava.ext.dirs=./lib de.uni_leipzig.asv.toolbox.hac.main.CLIMain -?
Demo 1: Use the test data source. Perform a
clustering analysis using SingleLinkage and Cosine. Save results both as
dendrogram and xml.
java -classpath .;./lib/ASV_HAC.jar -Djava.ext.dirs=./lib
de.uni_leipzig.asv.toolbox.hac.main.CLIMain --test -o "1_clustering_output_test"
-x -v Cosine -c SingleLinkage
Demo 2: Use a vector file in compact
format. Perform a clustering analysis using AverageLinkage and Cosine. Save
results both as dendrogram and xml.
java -classpath .;./lib/ASV_HAC.jar -Djava.ext.dirs=./lib
de.uni_leipzig.asv.toolbox.hac.main.CLIMain --vectorfile
".\resources\ahc\examples\vectors_compact.txt" -o
"2_clustering_output_vectorfile" -x -v Cosine -c AverageLinkage
Demo 3: Use a vector file in explicit
format. Perform a clustering analysis using AverageLinkage and Cosine. Save
results both as dendrogram and xml.
java -classpath .;./lib/ASV_HAC.jar -Djava.ext.dirs=./lib
de.uni_leipzig.asv.toolbox.hac.main.CLIMain --vectorfile
".\resources\ahc\examples\vectors_explicit.txt" --explicit -o
"3_clustering_output_vectorfile_explicit" -x -v Cosine -c AverageLinkage
Demo 4: Use a vector file in explicit
format. Use the ASV WordServer [13] to load word labels from a ASV WordServer word list. Perform a
clustering analysis using AverageLinkage and Cosine. Save results both as dendrogram
and xml.
java -classpath .;./lib/ASV_HAC.jar -Djava.ext.dirs=./lib
de.uni_leipzig.asv.toolbox.hac.main.CLIMain --wordserv
".\resources\ahc\examples\wordserver_miniwordlist.txt" --vectorfile
".\resources\ahc\examples\vectors_explicit_wordserver.txt" --explicit
-o "4_clustering_output_wordserv" -x -v Cosine -c AverageLinkage
Demo 5: Use a database connection as data
source, user is root, password is mypassword.
Perform a clustering analysis using AverageLinkage and Cosine. Save results
both as dendrogram and xml. Restrict the depth of the clustering hierarchy to 6
levels (descending from top cluster) when saving as xml.
java -classpath .;./lib/ASV_HAC.jar -Djava.ext.dirs=./lib
de.uni_leipzig.asv.toolbox.hac.main.CLIMain -d -u root -p
"mypassword" -o "5_clustering_output_database" -x --depth 6
-v Cosine -c AverageLinkage
Demo 6b: Use the words from a text input
file as candidates for clustering. Load database connection settings and text
preprocessing settings from the property file my_clustering.properties. Perform a clustering analysis
using AverageLinkage and Cosine. Save results only as xml.
java -classpath .;./lib/ASV_HAC.jar -Djava.ext.dirs=./lib
de.uni_leipzig.asv.toolbox.hac.main.CLIMain --disabledendrogram -f
".\resources\ahc\examples\wortschatz.uni-leipzig.de_asv.htm" -q
"my_clustering.properties" -o
"6_clustering_output_textfile" -x -v Cosine -c AverageLinkage
Demo 7: Use the words from a text input
file as candidates for clustering, but only consider words, whose database word
IDs are in a certain range. Load database connection settings, word ID range
limit, and text preprocessing settings from the property file my_clustering.properties. Perform a clustering analysis using AverageLinkage and Cosine. Save
results both as dendrogram and xml.
java -classpath .;./lib/ASV_HAC.jar -Djava.ext.dirs=./lib
de.uni_leipzig.asv.toolbox.hac.main.CLIMain -f
".\resources\ahc\examples\wortschatz.uni-leipzig.de_asv.htm"
--restrictrange -q "my_clustering.properties" -o
"7_clustering_output_textfile_restricted_custom" -x -v Cosine -c
AverageLinkage
Dendrograms are not well
suited for visualizing centroid-based clustering, where the cluster distances
are not monotonically increasing. The resulting dendrograms may be locally
distorted.
Currently, only the console
interface allows to combine vector files with the ASV WordServer [13], which is the only way to use
vector files that contain phrasal words. (Note that database and text input can
handle phrasal words.)
Currently, it is not
possible to create vector files from database data. It is recommended to
directly work with databases right away.
Currently, only the console
interface allows to save results as xml files.
Currently, while the GUI
stores feature and candidate selection cut-offs, it cannot recall
whether these cut-offs had actually been applied (that is, whether the
respective checkboxes had been selected).
Currently, no parameters or
settings are automatically stored with the results.
Note: Addressing the previous three
limitations, for systematic use it is recommended to
The tool
"Agglomerative Hierarchical Clustering"& was developed by Arne Brutschy, Christian Beutenmüller, Steffen Becker and Julian Hesselbach in the context of the lecture
"Computational Linguistics" at the NLP Department, Leipzig University
[1]. With special thanks to Thomas
Wittig for his competent supervision.
Additional features have
been added by Frank Binder. Thank goes to Chris Biemann and
Uwe Quasthoff for their supervision, and Thomas Wittig and Arne Brutschy for
suggestions and comments.
This software is available
as part of the "ASV Toolbox". A separate release as an open-source
software will be considered.
[1] NLP Department, Leipzig University,
http://www.asv.informatik.uni-leipzig.de
[2] Manning, C. D. and H. Schütze (1999). Foundations of
statistical natural language processing. The MIT Press.
[3] Erik Velldal (2003). Modeling Word Senses With Fuzzy
Clustering. University of Oslo.
[4] http://de.wikipedia.org/wiki/Clusteranalyse
[5] http://de.wikipedia.org/wiki/Distanzfunktion
[6] http://en.wikipedia.org/wiki/Data_clustering
[7] http://en.wikipedia.org/wiki/Dice's_coefficient
[8] http://en.wikipedia.org/wiki/Jaccard_index
[9] Java SE Downloads http://java.sun.com/javase/downloads/index.jsp
[10] MySQL Downloads, http://dev.mysql.com/downloads/
[11] Leipzig Corpora Collection, http://corpora.informatik.uni-leipzig.de/download.html
[12] The Apache Ant Project, http://ant.apache.org/
[13] ASV WordServer (restricted access), http://wortschatz.uni-leipzig.de/snipsnap/space/WordServer