Wortschatz : Suche : NextLinks : FindLinks

Objective

The objective of FindLinks is to procure the data for NextLinks. Therefore as many as possible web pages are loaded and links within these pages are detected.
It is intended to use the unused capacities of performance and bandwidth of a distributed system to detect new URLs on the web similar to seti@home (see http://setiathome.ssl.berkeley.edu/).

Project Status

The system is still in a beta test phase. The FindLinks client is not stable yet but will be available soon.

Architecture

FindLinks has a client-server architecture. The FindLinks server is responsible for the distribution of the URLs to the clients. The FindLinks clients process the URLs and send the analyzed results back to the server. The FindLinks clients are are platform independent and can operate on every computer connected to the internet.

Technical Realization

The FindLinks server has a list of several million URLs which have to be evaluated successively. Each client receives its own package of 500 URLs and tries to download these 500 pages. From each of this received pages the contained URLs will be extracted and only the list of these URLs will be sent back to the server. Afterwards the client receives the next package of 500 URLs and so on.

For Webmaster: Load Balancing and robots.txt

A reasonable URL ordering should prevent individual servers from being overloaded by a large number of requests within short time. The file robots.txt (see http://www.robotstxt.org/) is considered by the FindLinks server. Changes in that file are noticed at the latest after 30 days.
If you are experiencing problems send a short e-mail to wort@informatik.uni-leipzig.de, so that we are able to react immediately.

Imprint

FindLinks is a project of the Automated Speech Processing Group (see http://www.asv.informatik.uni-leipzig.de/) at the Institute of Computer Science at Leipzig University. Contact: wort@informatik.uni-leipzig.de

[ send email ]