Python Experiment No: 12 Implementation of Web Crawler

Aim:- Implementation of Web Crawler
Theory:-

The World Wide Web has grown from a few thousand pages in 1993 to more than two billion pages at present. Due to this explosion in size, web search engines are becoming increasingly important as the primary means of locating relevant information. Such search engines rely on massive collections of web pages that are acquired with the help of web crawlers, which traverse the web by following hyperlinks and storing downloaded pages in a large database that is later indexed for efficient execution of user queries. Many researchers have looked at web search technology over the last few years, including crawling strategies, storage, indexing, ranking techniques, and a significant amount of work on the structural analysis of the web and web graph for overviews of some recent work and for basic techniques. Thus, highly efficient crawling systems are needed in order to download the hundreds of millions of web pages indexed by the major search engines. In fact, search engines compete against each other primarily based on the size and currency of their underlying database, in addition to the quality and response time of their ranking function. Even the largest search engines, such as Google or AltaVista, currently cover only limited parts of the web, and much of their data is several months out of date. (We note, however, that crawling speed is not the only obstacle to increased search engine size, and that the scaling of query throughput and response time to larger collections is also a major issue.) A crawler for a large search engine has to address two issues. First, it has to have a good crawling strategy, i.e., a strategy for deciding which pages to download next. Second, it needs to have a highly optimized system architecture that can download a large number of pages per second while being robust against crashes, manageable, and considerate of resources and web servers. There has been some recent academic interest in the first issue, including work on strategies for crawling important pages first [12, 21], crawling pages on a particular topic or of a particular type , recrawling (refreshing) pages in order to optimize the overall “freshness” of a collection of pages , or scheduling of crawling activity over time .

Crawling Applications:-
Breadth-First Crawler:

In order to build a major search engine or a large repository such as the Internet Archive, high- performance crawlers start out at a small set of pages and then explore other pages by following links in a “breadth first-like” fashion. In reality, the web pages are often not traversed in a strict breadth-first fashion, but using a variety of policies, e.g., for pruning crawls inside a web site, or for crawling more important pages first.

Recrawling Pages for Updates:
After pages are initially acquired, they may have to be periodically recrawled and checked for updates. In the simplest case, this could be done by starting another broad breadth-first crawl, or by simply requesting all URLs in the collection again.


Focused Crawling:
More specialized search engines may use crawling policies that attempt to focus only on certain types of pages, e.g., pages on a particular topic or in a particular language, images, mp3 files, or computer science research papers.

Random Walking and Sampling:
Several techniques have been studied that use random walks on the web graph (or a slightly modified graph) to sample pages or estimate the size and quality of search engines.

Crawling the “Hidden Web”:
A lot of the data accessible via the web actually resides in databases and can only be retrieved by posting appropriate queries and/or filling out forms on web pages.



Q1) Explain the working of web crawler.



Comments