HAC Clustering Tool - User Manual

icon of this tool

back to main page

Contents

Installation

A description how to install a module you find at the main page of the Toolbox project.
The line you hat to copy into the toolbox.start file look like this:
de.uni_leipzig.asv.toolbox.hac.ClusteringModule

Introduction

The tool "Agglomerative Hierarchical Clustering" can be used to create a clustering of objects. It is part of the "ASV Toolbox" - a collection of tools for natural language processing, developed at the Department for Natural Language Processing (NLP) at Leipzig University [1].

The tool creates a hierarchical clustering of the participating objects, by performing an agglomerative, hierarchical clustering analysis.

Each object that participates in the clustering process is represented by a feature vector. These vector representations are used to estimate the similarity (or dissimilarity) between objects. The clustering then provides a representation of a set of objects, where similar objects appear close together and dissimilar objects are separated from each other [2].

In terms of NLP, an "object" is usually a word in a corpus, "features" are then other words that frequently co-occur with that (object) word in the corpus. A "feature vector" for the given word usually contains the significance values of these co-occurrences. A non-zero value for the n-th element of the feature vector indicates, that the object word significantly frequently co-occurs with the feature word that is the n-th word of the corpus' word list.

Agglomerative Hierarchical Clustering

AHC is an iterative, bottom-up process: [2]

Initially, each object represents a seperate cluster. At each step, the two most similar (i.e. least distant) clusters are merged to form a larger cluster. Cluster distances are calculated using a specific distance function. The process of merging the two most similar clusters continues until only one cluster remains, which contains all the participating objects.

The resulting clustering is a hierarchical structure. It can be visualized as a dendrogram.

Clustering Methods

To determine the distance between clusters based on their member elements, the following methods have been implemented:

Single Linkage

minimum distance between any members of each group

Complete Linkage

maximum distance between any members of each group

Average Linkage

average pair-wise distance between each member of one cluster to each member of another cluster

Average Group Linkage

average distance between all possible element pairs of the union of the two clusters

Centroid

distance between the mean vectors (centroids) of the two clusters

Wards Method

increase in variance when merging two clusters

(Taken from [3] & [4].)

Vector Distances

Distances between element vectors can be calculated using one of the following methods:

L1-Norm

L2-Norm

Dice

Jaccard

Cosine

(See also [5], [6], [7], and [8].)

System Requirements

Running this software requires a Java Runtime Environment (JRE) of version 1.5 or later. [9] This is available from http://java.sun.com/javase/downloads/index.jsp To check if java is properly installed on your system, type java -version in your console/shell. This should return a version statement of 1.5.0 or higher.

To use all the features of this clustering tool, you need to be able to connect to a co-occurrence database. You can set up such a database on your local system. First, you need to download MySQL [10] from http://dev.mysql.com/downloads/ and install MySQL as a service/deamon on your machine. Second, you need to obtain datafiles for a co-occurrence database. The tool is pre-configured to work with databases as provided by the Leipzig Corpora Collection [11].

If you want to build the clustering tool from source, you will need to have Ant installed. [12] Ant is available from http://ant.apache.org/ You can check, whether Ant is properly installed on your system by typing ant -version in your console/shell. This should return a version statement of 1.6.5 or higher.

The clustering tool uses a set of third-party libraries, which are expected to be found in the java extension directories. Usually the lib sub folder in your application folder contains all the required libraries. Please make sure that all these library dependencies are met after installing the clustering tool.

Installation

This software may be obtained in two ways:

As "ASV-Toolbox" Module

The clustering tool is available as a part of the toolbox. Please refer to the toolbox documentation for information on installation and usage. Once the toolbox is installed, you can use the clustering tool as one of its modules. Alternatively, you can find the clustering tool as asv-toolbox-ahc.jar in the lib folder of the toolbox. This allows you to use the clustering tool as a stand-alone application. Simply follow the instructions provided under usage.

As "Stand-Alone" Application

The clustering tool is distributed as a zip-archive in a file named ahc.zip. Create a folder (the application folder) and extract the archive's contents into that folder.

Installation Contents

With installation completed, the application folder may contain the following sub folders with their respective contents:

dist

the clustering tool as a "Java Archive" ahc.jar

config

application configuration and default settings

resources

sample data and user manual

lib

libraries used by the clustering tool

src

the source code (if provided)

build

compiled java classes (if provided)

doc

API documentation (if provided)

img

icons and images for the graphical user interface

The following files will be available:

run_ahc.bat

for Windows: runs the clustering tool as stand-alone application with GUI

run_ahc.sh

for Linux: runs the clustering tool as stand-alone application with GUI

run_toolbox.bat

for Windows: runs the ASV Toolbox (if available)

run_toolbox.sh

for Linux: runs the ASV Toolbox (if available)

run_ahc_demos.bat

for Windows: runs a few demos of the clustering tool in console mode

build.xml

Ant buildfile (if provided)

readme.txt

first information and help

Building from Source Code

If your distribution provides the source code for this tool, you can compile the clustering tool using Ant. [12] Please refer to the buildfile build.xml for up-to-date settings and information. The following targets should be available:

compile

compiles the source code

build

(default) equivalent for compile

clean

removes files that have been created/built by other targets

dist

creates the distribution zip-archive(s) for this project

jar

creates the java-executable jar file(s) for this project

run

runs the clustering tool, using its GUI

Library Dependencies

The following libraries are required to compile and run the clustering tool.

JUnit

junit.jar

Commons-Logging

commons-logging.jar and commons-logging-api.jar

Doug Lea's FJTasks Framework

csc375.jar

Dom4J

dom4j-1.6.1.jar

JArgs

jargs.jar

MySQL Connector for Java

mysql-connector-java-3.1.10-bin.jar

ASV's WordServer [13]

WordServer.jar

Browser Launcher

BrowserLauncher.jar

Note: junit.jar is only required during compilation.

Configuration

The clustering tool uses a set of configuration files. They are used to load default settings and parameters on application startup. Additionally, they allow to remember settings from the GUI or to use and manage several setups for the clustering tool when run from command-line. Finally, while most parameters for data input and clustering can be provided via the clustering tool's user interfaces, some settings can only be made in the configuration files.

On start-up, the clustering tool tries to load initial (default) settings from the following locations. Properties read at a later stage override earlier properties:

  1. Built-in default settings are provided by the application. These are specified in the source code and cannot be modified without re-building the application from source.
  2. An application configuration file is read. This file is located within the application folder file hierarchy at config/ahc/clustering.properties
  3. Query properties, that is properties that contain the mysql queries that are used to access the database, are read from config/ahc/clustering.queries
  4. The user property file is read, if available. Per default, a file named clustering.properties in the user's home directory is used as user property file. Settings from the GUI are stored in that file, after a clustering analysis that has been started by the GUI has run to completion without any problems. An alternative location of the user property file may be provided by the command-line parameter --propfile <filename>.

Note: Only the GUI ever stores any settings in a user property file.

Note: When the user property file is re-written, any comment lines are lost. Furthermore, the individual property lines will be written in a scrambled order.

Currently, text input pre-processing can only be configured via the configuration files. The text pre-processing settings are applied to text input within the GUI as well as text file input from command-line. Basically, text pre-processing includes possible substitutions and character exclusions, both of which use regular expressions that are defined in the configuration files. See the application configuration file for further details.

The application configuration file contains a lot of explanatory comments on its properties. You may also copy its property lines to a user property file and modify the settings according to your needs.

Usage

The clustering tool provides a graphical user interface (GUI) (both as ASV toolbox module, and as a stand-alone application) as well as a console interface for non-interactive access. Details on how to use these interfaces will be provided in the following sections.

Sample Data

A few data files with tiny data samples are provided in the application sub folder resources/ahc/examples. Please refer to the System Requirements section for how to set up a MySQL database to be used with the clustering tool.

Running the Clustering Tool

To run the GUI you can use one of the provided scripts: run_ahc.bat (for Windows) or run_ahc.sh (for Linux). Alternatively, you can use the following command:
java -classpath .;./lib/ASV_HAC.jar -Djava.ext.dirs=./lib de.uni_leipzig.asv.toolbox.hac.main.CLIMain --gui

To run the GUI with a custom user property file, you can use the following command:
java -classpath .;./lib/ASV_HAC.jar -Djava.ext.dirs=./lib de.uni_leipzig.asv.toolbox.hac.main.CLIMain --propfile "my_clustering.properties" --gui

To run the console interface, use the following command:
java -classpath .;./lib/ASV_HAC.jar -Djava.ext.dirs=./lib de.uni_leipzig.asv.toolbox.hac.main.CLIMain [OPTIONS]

Options and examples will be explained in the following sections.

Note: Use "/" instead of "\" on Linux-like operating systems.

Note: Set your current working directory to the "application folder", that is the folder where you extracted this software.

Note: If you run into memory problems, i.e. OutOfMemoryError, consider using java's -Xmx<size>M option. For instance, if you want the clustering tool to use up to 256MB of memory (default is 64MB) use the command:
java -Xmx256M -classpath .;./lib/ASV_HAC.jar -Djava.ext.dirs=./lib de.uni_leipzig.asv.toolbox.hac.main.CLIMain [OPTIONS]

Using the Graphical User Interface (GUI)

To start a clustering analysis, you need to perform the following steps:

  1. Choose a Data Source
    Feature vectors can be created from random data, vector files, or a co-occurrence database. Using a co-occurrence database allows to edit the list of candidates, or to provide candidates for clustering as simple text input via copy-and-paste.
  2. Configure the Data Source
    Each data source requires specific parameters or input, which are described in the following sections.
  3. Choose Algorithms
    Choose the
    clustering method and the vector distance that you wish to use.
  4. Start the Clustering Analysis

The progress bar indicates the current status and progress. Depending on the number of candidates and their feature representation the calculations may take just a few seconds or several hours.

As soon as the clustering analysis is completed, the results will appear as a dendrogram in a new window. The dendrogram can be saved as a PNG image file. Several dendrogram windows may be kept open, to compare results.

After each successful clustering analysis, the GUI stores its settings in the user property file. (See Configuration for details.) Note that, however, the status of the checkboxes on the database input tab is not saved.

Using Random Data

Devoid of any data input or as a baseline for algorithm comparisons, this data source creates random feature vectors for the clustering analysis. Its parameters can be set on the Clustering tab of the GUI.

Using Vector Files

The clustering tool can read feature vectors from two related vector file formats: explicit and compact. Both formats are text files that contain one feature vector per line. Each line is a space-separated list of tokens. In both formats the first token of each line represents a label, i.e. the candidate word of the respective feature vector on that line. Interpretation of the remaining tokens differs between the two formats.

For the explicit format each following token is read as a feature vector element. Consider the following example of a vector file in explicit format:

 
        Car 4 0 0 8 0 7 2 0 0 0 0 0 0 0 0 0 2
        Train 2 0 0 6 0 2 5 0 0 3 0 0 0 0 0 0 15
        House 0 0 4 0 0 1 0 0 7 2 0 9 6 0 0 0 0
        Cheese 0 0 0 0 0 0 0 0 0 18 0 0 0 0 0 12 0

Here, the feature vector for the candidate word "Car" has 17 elements, five of which are non-zero. The first element has value 4, the fourth element has value 8, and so on.

By contrast, in the compact format, tokens that follow the candidate word are interpreted pair-wise. The first token of each pair indicates a position in the feature vector (or element index). The second token of any pair represents the feature vector element value of that position. Unspecified element values are interpreted as zero. Consider the following example of a vector file in compact format:

 
        Car 1 4 4 8 6 7 7 2 17 2
        Train 1 2 4 6 6 2 7 5 10 3 17 15
        House 3 4 6 1 9 7 10 2 12 9 13 6
        Cheese 10 18 16 12

This represents the same feature vectors as in the previous example, but this time in compact format, which is more efficient for sparse vectors.

Using Database Input

Note: This requires a working database connection.

The clustering tool can load candidate words and feature vectors from a database. Relevant parameters for this data source may be specified using the Database Input tab of the GUI.

Candidate Words Selection:
Using the database as input source, the clustering tool loads a range of candidate words from the database's word list. Additionally, it is possible to exclude candidates based on their frequency values in the database's word list. The resulting candidate list can directly be used for clustering. Alternatively, you can choose to edit the candidate list prior to clustering. With this option enabled, you must use the
Text Input tab to load the candidate list from the database.

Feature Selection:
When feature vectors are loaded from a database, two different cut-offs can be applied. First, it is possible to specify a minimal co-occurrence significance. Co-occurrence features with a lower significance will be ignored. This may lead to more sparse feature vectors, and it may reduce noise in the data. A second possibility to reduce the number of elements per feature vector is to consider only the most significant co-occurrences for the given candidate word. The clustering tool allows you to specify how many of the most significant co-occurrences you wish to use to create feature vectors.

As a convenience, it is possible to check the feature vector for a candidate word using the Feature Vector Preview. Just provide a candidate word (or its ID in the database's word list). If you click on the Check button, the frequency value and feature vector will be loaded from the database. This allows to estimate the influence of the current feature selection cut-offs and candidate selection parameters.

Note: The feature vector is presented in compact format, as described above. However, the first token of the feature vector preview represents the candidate word ID, instead of the word label. All following tokens are pairs of a feature element index, i.e. another database word ID, and a feature element value, i.e. the respective co-occurrence significance.

Using Text Input

Note: This requires a working database connection.

Note: Settings made on the Database Input tab are also applied for database operations during text import.

The clustering tool can use any text data to create a list of candidates for clustering. The Text Input tab allows you to enter text or provide it via copy-and-paste. When the clustering analysis is started using text input, at first, a word list is created from the available text data. This word list is matched against the database. All those words are kept as candidates, for which feature vectors can be loaded from the database (applying the settings that were made on the Database Input tab). The feature vectors of these candidates are then used for clustering.

When a word list is created from text input, the text undergoes some pre-processing. This pre-processing can only be configured using the configuration files of the clustering tool.

Using the Create candidate list from input button you can preview (and re-edit) your candidate list.

If a word ID range restriction is activated (see the Database Input tab), then a candidate list can be loaded from the database's word list.

Using the Console Interface

The console interface provides basically the same functionality as the GUI. However, there are a few peculiarities. Using the console interface, you will need to work with the configuration files for the clustering tool. Many parameters for the database connection, text file preprocessing, and candidate and feature selection cannot be provided via command-line options. Instead these parameters are read from the tool's configuration files. It is recommended to use and maintain a (or several) user property file(s).

The console interface provides the following additional features:

Command-Line Options

The following command-line options are available:

 
General:
-?, --help                  print this help
-g  --gui                   starts the GUI of this clustering tool
                            This ignores all other command-line parameters,
                            except any specified property file
-t  --threads <nr>          number of background threads that should be used
                            for calculations (default is 1)
 
Datasources (Hint only ONE datasource may be used!):
-v  --vectorfile <filename> uses the vector file indicated by <filename>
                            as datasource
    --compact               indicates that the file is in
                            short/compact format (default)
    --explicit              indicates that the file in in long/explicit format
    --wordserv <filename>   indicates that the vector names should be retrieved
                            from the wordserver file indicated by <filename>
-d  --database              uses a database connection as datasource.
-f  --textfile <filename>   uses the text file indicated by <filename>
                            as word list source along with a database connection
                            to retrieve feature vectors
    --propfile <filename>   indicates that database connection and input
                            settings should be loaded from the property file
                            <filename>   
    --restrictrange         indicates that candidate words must have database
                            IDs within a certain range, as defined in the
                            property file. Other words will be excluded.
                            If omitted (default), range will not be
                            restricted.
    --restrictfreq          indicates that candidate words must have database
                            frequencies within a certain range, as defined in
                            the property file. Other words will be excluded.
                            If omitted (default), frequency will not be
                            restricted.
    --applyfeatureminsig    indicates that feature vector elements must have
                            a minimum significance, whose value is specified
                            in the property file.
                            If omited (default), this minimum is not
                            considered.          
    --applyfeaturelimit     indicates that only a number of most significant
                            features be used as feature vector elements.
                            This number is specified in the property file.
                            If omited (default), all available features
                            are used as feature vector elements.
-u  --user <username>       the database user name
-p  --password <password>   the database password
    --test                  use randomly created test data
 
Output options:
-o  --out <filename>        the filename for output without extension!
                            For example, using myout you will get myout.png
                            for dendrogram or myout.xml for xml output (or both)
    --disabledendrogram     disable dendrogram drawing (enabled by default)
-x  --xml                   enable xml output (disabled by default)
    --depth <int>           the depth for xml output: number of top cluster
                            levels
    
Algorithm options:
-v  --vectordist <dist>     set the vector distance, possible distances are:
                              L1 Norm
                              L2 Norm
                              Cosine
                              Dice
                              Jaccard
-c  --cluster <dist>        set the cluster distance, possible distances are:
                              SingleLinkage
                              CompleteLinkage
                              AverageLinkage
                              AvgGroupLinkage
                              CentroidMethod
                              WardsMethod            
                

Examples for using command-line options

These demo examples can also be found in the scripts run_ahc_demos.bat (for Windows) or run_ahc_demos.sh (for Linux).

Demo 0: Print out the list of command-line options:

java -classpath .;./lib/ASV_HAC.jar -Djava.ext.dirs=./lib de.uni_leipzig.asv.toolbox.hac.main.CLIMain -?

 

Demo 1: Use the test data source. Perform a clustering analysis using SingleLinkage and Cosine. Save results both as dendrogram and xml.

java -classpath .;./lib/ASV_HAC.jar -Djava.ext.dirs=./lib de.uni_leipzig.asv.toolbox.hac.main.CLIMain --test -o "1_clustering_output_test" -x -v Cosine -c SingleLinkage

 

Demo 2: Use a vector file in compact format. Perform a clustering analysis using AverageLinkage and Cosine. Save results both as dendrogram and xml.

java -classpath .;./lib/ASV_HAC.jar -Djava.ext.dirs=./lib de.uni_leipzig.asv.toolbox.hac.main.CLIMain --vectorfile ".\resources\ahc\examples\vectors_compact.txt" -o "2_clustering_output_vectorfile" -x -v Cosine -c AverageLinkage

 

Demo 3: Use a vector file in explicit format. Perform a clustering analysis using AverageLinkage and Cosine. Save results both as dendrogram and xml.

java -classpath .;./lib/ASV_HAC.jar -Djava.ext.dirs=./lib de.uni_leipzig.asv.toolbox.hac.main.CLIMain --vectorfile ".\resources\ahc\examples\vectors_explicit.txt" --explicit -o "3_clustering_output_vectorfile_explicit" -x -v Cosine -c AverageLinkage

 

Demo 4: Use a vector file in explicit format. Use the ASV WordServer [13] to load word labels from a ASV WordServer word list. Perform a clustering analysis using AverageLinkage and Cosine. Save results both as dendrogram and xml.

java -classpath .;./lib/ASV_HAC.jar -Djava.ext.dirs=./lib de.uni_leipzig.asv.toolbox.hac.main.CLIMain --wordserv ".\resources\ahc\examples\wordserver_miniwordlist.txt" --vectorfile ".\resources\ahc\examples\vectors_explicit_wordserver.txt" --explicit -o "4_clustering_output_wordserv" -x -v Cosine -c AverageLinkage

 

Demo 5: Use a database connection as data source, user is root, password is mypassword. Perform a clustering analysis using AverageLinkage and Cosine. Save results both as dendrogram and xml. Restrict the depth of the clustering hierarchy to 6 levels (descending from top cluster) when saving as xml.

java -classpath .;./lib/ASV_HAC.jar -Djava.ext.dirs=./lib de.uni_leipzig.asv.toolbox.hac.main.CLIMain -d -u root -p "mypassword" -o "5_clustering_output_database" -x --depth 6 -v Cosine -c AverageLinkage

 

Demo 6b: Use the words from a text input file as candidates for clustering. Load database connection settings and text preprocessing settings from the property file my_clustering.properties. Perform a clustering analysis using AverageLinkage and Cosine. Save results only as xml.

java -classpath .;./lib/ASV_HAC.jar -Djava.ext.dirs=./lib de.uni_leipzig.asv.toolbox.hac.main.CLIMain --disabledendrogram -f ".\resources\ahc\examples\wortschatz.uni-leipzig.de_asv.htm" -q "my_clustering.properties" -o "6_clustering_output_textfile" -x -v Cosine -c AverageLinkage

 

Demo 7: Use the words from a text input file as candidates for clustering, but only consider words, whose database word IDs are in a certain range. Load database connection settings, word ID range limit, and text preprocessing settings from the property file my_clustering.properties. Perform a clustering analysis using AverageLinkage and Cosine. Save results both as dendrogram and xml.

java -classpath .;./lib/ASV_HAC.jar -Djava.ext.dirs=./lib de.uni_leipzig.asv.toolbox.hac.main.CLIMain -f ".\resources\ahc\examples\wortschatz.uni-leipzig.de_asv.htm" --restrictrange -q "my_clustering.properties" -o "7_clustering_output_textfile_restricted_custom" -x -v Cosine -c AverageLinkage

Limitations

Dendrograms are not well suited for visualizing centroid-based clustering, where the cluster distances are not monotonically increasing. The resulting dendrograms may be locally distorted.

Currently, only the console interface allows to combine vector files with the ASV WordServer [13], which is the only way to use vector files that contain phrasal words. (Note that database and text input can handle phrasal words.)

Currently, it is not possible to create vector files from database data. It is recommended to directly work with databases right away.

Currently, only the console interface allows to save results as xml files.

Currently, while the GUI stores feature and candidate selection cut-offs, it cannot recall whether these cut-offs had actually been applied (that is, whether the respective checkboxes had been selected).

Currently, no parameters or settings are automatically stored with the results.

Note: Addressing the previous three limitations, for systematic use it is recommended to

  1. use the console interface,
  2. manage distinct configurations using several user property files
  3. and to assoctiate stored results with their respective configuration files and command-line options.

Acknowledgements

The tool "Agglomerative Hierarchical Clustering"& was developed by Arne Brutschy, Christian Beutenmüller, Steffen Becker and Julian Hesselbach in the context of the lecture "Computational Linguistics" at the NLP Department, Leipzig University [1]. With special thanks to Thomas Wittig for his competent supervision.

Additional features have been added by Frank Binder. Thank goes to Chris Biemann and Uwe Quasthoff for their supervision, and Thomas Wittig and Arne Brutschy for suggestions and comments.

Copyright

This software is available as part of the "ASV Toolbox". A separate release as an open-source software will be considered.

References

[1] NLP Department, Leipzig University, http://www.asv.informatik.uni-leipzig.de
[2] Manning, C. D. and H. Schütze (1999). Foundations of statistical natural language processing. The MIT Press.
[3] Erik Velldal (2003). Modeling Word Senses With Fuzzy Clustering. University of Oslo.
[4]
http://de.wikipedia.org/wiki/Clusteranalyse
[5]
http://de.wikipedia.org/wiki/Distanzfunktion
[6]
http://en.wikipedia.org/wiki/Data_clustering
[7]
http://en.wikipedia.org/wiki/Dice's_coefficient
[8]
http://en.wikipedia.org/wiki/Jaccard_index
[9] Java SE Downloads
http://java.sun.com/javase/downloads/index.jsp
[10] MySQL Downloads,
http://dev.mysql.com/downloads/
[11] Leipzig Corpora Collection,
http://corpora.informatik.uni-leipzig.de/download.html
[12] The Apache Ant Project,
http://ant.apache.org/
[13] ASV WordServer (restricted access),
http://wortschatz.uni-leipzig.de/snipsnap/space/WordServer

back to main page