Assignment 1

Due date: Wednesday, October 5, 2016.
NOTES:
You can use the programming language of your choice.

Problem description

Develop a toy Web crawler. A Web crawler is an Internet program that systematically browses the World Wide Web, typically for the purpose of Web indexing. You will build your crawler according to the following requirements:
  1. The input is a text file that contains a list of URLs.
  2. The input file needs to also contain setting parameters, such as:
    1. Time Delay between successive requests to same server;
    2. MAX DISTINCT URLS (number of pages) to be visited. When MAX is met the crawler ends. A default value must be set, say 1000.
    3. Type of URLs to be followed. For example, if we want the crawler to crawl only pages within temple.edu domain then the crawler must check the presence of this substring in each URL.

      • HINT: use Regex to determine if a URL string contains one of the user specified substrings.
  3. PARSING: You only need to retrieve the URLs from <a href="URL"> within the HTML pages.
  4. Relative and absolute URLs. Your crawler needs to recognize relative and absolute URLs, and transform the relative URLs into absolute URLs (as needed for fetching). Only absolute URLs should be used by the crawler in its queue and lookup data structures.
  5. Your crawler needs to organize visited Web pages as follows.
    • The list of URLs WITHIN the temple.edu domain.
    • The list of URLs from temple.edu Web pages to OUTSIDE domains, i.e., they do not contain temple.edu.
  6. IMPORTANT: Recall Crawler politeness. Your crawler must behave in an ethical and polite way. You must avoid sending too many requests in rapid succession to a server. Your crawler must obey the Robots Exclusion Protocol. All of these were discussed in class.

Deliverables

You have to produce a brief document (no more than 3 pages) that describes your crawler. Input settings: MAX DISTINCT URLS = 100,000 and Start URL = www.temple.edu.
  1. include a table with the various statistics about the activity of your crawler. For example, if www.temple.edu is the starting point then you need to provide (1) the number of all encountered URLs, (2) the number of URLs within the temple.edu domain, and (3) the number of URLs outside the temple.edu domain.
  2. give a histogram with the top-20 most frequent URLs within temple.edu domain.
  3. give a rank frequency graph of the encountered URLs within temple.edu domain. Comment whether it resembles a power law distribution.
  4. give a histogram with the top-20 most frequent URLs outside temple.edu domain.
  5. give a rank frequency graph of the encountered URLs outside temple.edu domain. Comment whether it resembles a power law distribution.
  6. provide the running time and the specs of the computer on which you ran your crawler.
  7. steps you took to optimize your crawler.
  8. include faced challenges and how you addressed them.
Upload the document in blackboard.