Documentation of Zipfel

back to main page

 

 

Installation. 1

Introduction. 1

The Welcome Panel 2

The input panels. 2

File input 2

Database input 3

Text input 4

Settings. 4

The output panels. 5

Statistics. 5

Table. 6

Ranks. 6

Counts. 7

Export 7

Diagram panel 8

How to use the Command Line Version. 9

Parameters. 9

Examples: 9

How you can use the Zipfel-Tool in your own program.. 9

 

Installation

A description how to install a module is available at the main page of the ASV Toolbox project.

The line you have to copy into the toolbox.start file looks like this:

de.uni_leipzig.asv.toolbox.zipfel.ZipfelPlugin

Introduction

 

This tool demonstrates Zipf’s law. For a better understanding the definition of the Zipfs law:

 

Originally, Zipf's law stated that, in a corpus of natural language utterances, the frequency of any word is roughly inversely proportional to its rank in the frequency table. So, the most frequent word will occur approximately twice as often as the second most frequent word, which occurs twice as often as the fourth most frequent word, etc.” wiki[01]

 

The tool uses input from a file, a database or from plain text input. It extracts the words from the given input and calculates the frequencies of the words.

After this, it shows you the result in a table or a diagram.

 

 

The Welcome Panel

 

Here you can find the direct link to the help you read now. Also you see the current version of the Zipfel program and the authors.

 

 

 

 

 

 

 

 

The input panels

 

On the input panel, you have 3 options for input and 1 option for the settings.

 

File input

If you have a file containing the text you want to analyze, go to this panel and click on the “browse” button. Browse for the file on your local machine.

 

 

If everything is ok with your file, you can click on the start button. A popup will be shown as long as the program analyzes your input.

 

Database input

The data can also be taken from a database. On this panel, you have to specify the parameters for the database connection.

The standard settings for LCC databases are shown below:

 

At first you have to fill out the fields “host”, “port”, “name”, “password” and database. After that, click on the connect button. The plug-in will load all names of the available tables including the columns of the tables. It assumes a table with word forms and frequencies.

If the program succeeded in loading this information, you can choose the proper table and the word and frequency columns

 

If you have chosen the right parameters, you can click on the start button. A popup will be shown as long as the program analyzes your input.


Text input

 

You can copy and paste any text you want in this field. Then, click on the start button. A popup will be shown as long as the program analyzes your input.

 

 

 

Settings

 

On this panel you can change the settings for the tool.

 

 

 

Here you can choose the following parameters:

 

-         show straight line if this value is set, the straight line will be shown

 

-         show boundaries if this value is set, the boundaries are shown in the diagram panel

 

-         lower boundary here you can choose the lower boundary (if the value is set, you can move the value by dragging the arrow or click on the button “Input” for directly typing the value)

 

-         upper boundary here you can choose the upper boundary (if the value is set, you can move the value by dragging the arrow or click on the button “Input” for directly typing the value)


The output panels

 

Statistics

 

On this panel you see the statistics results after your input is analyzed.

 

 

The output in detail:

 

-         number of types (different word) this value shows you how many different words (types) are in the input

 

-         number of tokens (including frequency) this value shows you how many tokens are in the input

 

-         text dependent constant k this value shows you the text dependent constant k, which represents the mean value of all text dependent constants k, which will be calculated for every word with the following formula :

§         k = r * n, where r is the rang of the word and n the number of the word

 

-         language dependent constant c this value shows you the mean value of language dependent constant c which will be calculated with the following formula :

§         c = rn*n/N where n is the number of the word, N the number of all words and rn is the biggest rank of all words with frequency n

Table

 

On this panel you see the result as a list of all words in the input. The words are ordered by their count.

 

In detail the given output:

 

-         relative commonness this value shows you the relative commonness, which is be calculated with the following formula:

§         n/N where n is the number of the word and N the number of all words

 

-         text dependent constant k this value shows you the text dependent constant k, which is calculated with the following formula :

§         k = r * n, where r is the rang of the word and n the number of the word

 

-         language dependent constant c this value shows you the value of the language dependent constant c, which is calculated with the following formula :

§         c = rn*n/N where n is the number of the word, N the number of all words and rn is the biggest rank of all words which are given n times

 

Ranks

 

This panel shows you a selection of some words, down to the lowest rank.

 

 

Counts

 

 

Export

 

Here you can export the results. There are 3 options to do this.

 

-         Export as open document table this exports the result to a open document table

 

-         Export as CVS file if you click on this button, the result will be written in a csv file (comma separated values)

 

-         Diagramm as .png if you click on this button, the result will be saved as a png file (a graphics format)

 

 

 

 

 

Diagram panel

 

This panel shows you the result as a diagram. The x axis represents the rank of a word and the y axis represents the count n of a word. The axes are scaled logarithmically.

 


How to use the Command Line Version

java -classpath .;./lib/ASV_Zipfel.jar -Djava.ext.dirs=.;./lib de.uni_leipzig.toolbox.zipfel.ZipfelCmdLine [parameter …]

Parameters

-host <host>    host of your database

-port <port; default=3306>     port of the database

-user <user>    your user name of the database

-pw <password>         your password of the database

-db <database>           database containing data for Zipfel

-table <database table>           table on which Zipfel should run

-wordcol <column containing words or word numbers>         

column containing the words or their word numbers for Zipfel

-countcol <column containing frequency>

            column containing the frequency of the words

-where <restriction using where; embed in quotation marks>

            This option allow you to specify a where clause for restricting the data

-filenamediagram <output file for diagram in png format>

            absolute path of the output file in png format

-filenametablecsv <output file for table in csv format>

            absolute path of the output file in csv format

-filenametableods <output file for table in open document format>

            absolute path of the output file in open document format

 

command line output will be in the following format :

[restriction of the selection] TAB types TAB tokens TAB k TAB c TAB a TAB b

Examples:

 

How you can use the Zipfel-Tool in your own program

This example shows you how you can use the tool in your own program:

 

import java.io.File;

import java.io.FileInputStream;

import java.io.FileNotFoundException;

import java.io.IOException;

import java.io.InputStream;

import java.util.Enumeration;

import java.util.Hashtable;

 

import de.uni_leipzig.asv.toolbox.zipfel.Wort;

import de.uni_leipzig.asv.toolbox.zipfel.Zipfel;

import de.uni_leipzig.asv.toolbox.zipfel.ZipfelErgebnis;

 

 

/**

 * @author madaxe

 *

 */

public class ZipfelTest {

 

      /**

       * @return

       */

      private InputStream getInput()

      {InputStream is = null;

            try {

                   is = new FileInputStream(new File("input.txt"));

            } catch (FileNotFoundException e) {

                 

                  e.printStackTrace();

            }

            return is;

      }

     

      /**

       *

       */

      public ZipfelTest()

      {

           

      }

     

      /**

       *

       */

      private void run(){

            Zipfel z = new Zipfel();

           

            try {

                 

                  Hashtable hs = new Hashtable(z.process(getInput()));

                  Enumeration enumer = hs.keys();

                  Wort[] words = new Wort[hs.size()];

                  int b = 0;

                  while(enumer.hasMoreElements())

                  {

                        Object key = enumer.nextElement();

                        Wort w = (Wort)hs.get(key);

                        System.out.println(w.getWort()+

" : "+w.getAnzahl());

                        words[b] = w;

                        b++;

                  }

                  ZipfelErgebnis ze = new ZipfelErgebnis(words);

                  System.out.println(ze.getGesamtzahl());

                 

            } catch (IOException e) {

                 

                  e.printStackTrace();

            }

      }

     

      /**

       * @param args

       */

      public static void main(String[] args) {

           

            ZipfelTest zt = new ZipfelTest();

            zt.run();

           

 

      }

 

}


Links :

 

wiki[01] - http://en.wikipedia.org/wiki/Zipf%27s_law

 

 

 

back to main page