Wapedia - Wiki: Distributed web crawling

Distributed web crawling is a distributed computing technique whereby Internet search engines employ many computers to index the Internet via web crawling. The idea is to spread out the required resources of computation and bandwidth to many computers and networks.

Contents:
1. Types
2. Implementations
3. Draw-backs
4. See also
5. External links

1. Types

Cho and Garcia-Molina (Cho and Garcia-Molina, 2002) studied two types of policies:

1. 1. Dynamic assignment

With this type of policy, a central server assigns new URLs to different crawlers dynamically. This allows the central server to, for instance, dynamically balance the load of each crawler.

With dynamic assignment, typically the systems can also add or remove downloader processes. The central server may become the bottleneck, so most of the workload must be transferred to the distributed crawling processes for large crawls.

There are two configurations of crawling architectures with dynamic assignments that have been described by Shkapenyuk and Suel (Shkapenyuk and Suel, 2002):

A small crawler configuration, in which there is a central DNS resolver and central queues per Web site, and distributed downloaders.
A large crawler configuration, in which the DNS resolver and the queues are also distributed.

1. 2. Static assignment

With this type of policy, there is a fixed rule stated from the beginning of the crawl that defines how to assign new URLs to the crawlers.

For static assignment, a hashing function can be used to transform URLs (or, even better, complete website names) into a number that corresponds to the index of the corresponding crawling process. As there are external links that will go from a Web site assigned to one crawling process to a website assigned to a different crawling process, some exchange of URLs must occur.

To reduce the overhead due to the exchange of URLs between crawling processes, the exchange should be done in batch, several URLs at a time, and the most cited URLs in the collection should be known by all crawling processes before the crawl (e.g.: using data from a previous crawl) (Cho and Garcia-Molina, 2002).

An effective assignment function must have three main properties: each crawling process should get approximately the same number of hosts (balancing property), if the number of crawling processes grows, the number of hosts assigned to each process must shrink (contra-variance property), and the assignment must be able to add and remove crawling processes dynamically. Boldi et al. (Boldi et al., 2004) propose to use consistent hashing, which replicates the buckets, so adding or removing a bucket does not require re-hashing of the whole table to achieve all of the desired properties.