Web crawler – Focus Crawler and Topical Crawler

Web Crawler is an Internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing.
Many other names of Web Crawler : spiders, bot, web agent, worm. It support universal search engines(Google, Yahoo, MSN, Windows Live, ask, Bing etc.)

Examples of Crawlers : Googlebot, Scooter, Slurp, MSNbot, etc.

Types of Crawler

1. Universal crawlers

2. Preferential crawlers

Focused crawlers

Topical crawlers

Basic Crawler Algorithm

Preferential Crawlers

Assume we can estimate Importance Measure I(p) for each page. and want to visit pages in order of decreasing I(p). Preferential Crawlers can be divided in to parts : Focused Crawlers and Topical Crawlers.

1. Focused Crawler

Rather than crawling pages from the entire web, we may want to crawl only pages in certain categories. One application of a such preferential crawler would be to maintain a web taxonomy such as Yahoo! Directory (dir.yahoo.com) or the volunteer based Open Directory Project (ODP, demoz.org).

A focused Crawler attempts to bias the crawler towards pages in some categories in which user is interested. Chakrabarti proposed a focused crawler based on classifier. The idea is to first build a text classifier using labeled example pages.

Then the classifier would guide the crawler by preferentially selecting from frontier those pages that appear most likely to belong to categories of interest, according to classifier’s prediction.

The focused crawler has three main components:
1. Classifier
2. Distiller
3. Crawler

Soft Focused Strategy : the crawler uses the score R(p) of each crawled page p as a priority value of each crawled page p as a priority value for all unvisited URLs extracted from p. The URLs are added to the frontier.

Hard Focused Strategy : For a crawled page p, the classifier first finds the leaf category C(p) in the taxonomy most likely to include p. In a ancestor of c(p) is a focus category , then URLs from the crawled page p are added to the frontier otherwise they are discarded.

2. Topical Crawler

Examples of pages are not available in sufficient numbers to train a focused crawler before the crawl starts. They do not have text classifiers to guide crawling. For example : My Spiders applet

However, unlike a search engine, this application has no index to search for results. Instead the web is crawled in Real Time.

Advantages of topical crawling is that all hits are fresh by definition. No stale results are returned by the crawler because the pages are visited at query time. This type of crawler suitable for applications that look for very recently posted documents, which a search engine may not have indexed yet.

Disadvantages of topical crawling is slow compared to traditional search engines. Ranking Algorithm can not take advantages of global prestige measures, such as PageRank.

Web crawler – Focus Crawler and Topical Crawler