Assignment 2

Due date: Thursday, September 25, in the lab.
NOTES:
This is a more advanced JAVA project. This assignment requires object oriented techniques.

Problem description

Develop a toy Web crawler. A Web crawler is an Internet program that systematically browses the World Wide Web, typically for the purpose of Web indexing. You will build your crawler according to the following requirements:
  1. It requires a simple user interface from which the user can specify the start point of crawling. The user specifies a url and your user interface displays the Web page corresponding to the url. The user can input multiple urls until it decides which one to finally follow.
  2. Once the user has decided on the starting url then she can hit the button START CRAWL.
  3. The interface must include additional fields. The fields are BANNED DOMAINS and MAX DISTINCT URLS. The former field allows the user to specify the kind of internet domains the crawler must ignore: e.g., .NET, .GOV or .ORG. That is, if an url contains one of the domains specified by the user then that url must not be appended to the TO DO LIST. For example, if the crawler encounters an url of the form
    http://nsf.gov/news/special_reports/
    this url must not be added to the TO DO LIST.
      HINT: use java regex to determine if a url string contains one of the user specified substrings.

    Your crawler must run until it has visited MAX DISTINCT URLS. If the user does not specify this number your user interface must have a default value, say 100.
  4. Your crawler needs to organize visited Web pages as follows. It distinguishes between:
    • Media Web pages: these pages contain images and/or videos. To detect them you need to look for img HTML tag: e.g., <img src="pulpit.jpg" alt="Pulpit rock" width="304" height="228">.
    • Interactive Web pages: these pages contain Web forms. To detect them you need to look for form HTML tag: e.g.,
      <form action="demo_form.asp" method="get">
      First name: <input type="text" name="fname"><br>
      Last name: <input type="text" name="lname"><br>
      <input type="submit" value="Submit">
      </form>
    • Textual Web pages: these pages contain plain text. The Web pages that do not fall into either of the previous two categories fall into this category.
  5. For a Media Web page you need to keep track of all the images in the Web page including their types: e.g., GIF, JPEG or PNG.
  6. For an Interactive Web page you need to record the number of input elements, the number of select elements, the number of option elements, and the number of buttons.
  7. For Textual Web pages determine the number of times the following three terms appear in the page: temple, university and computer.
  8. At the end of the crawl your program must present a crawling report, where your user interface must present the list of crawled pages along with their properties. For example, you show two lists side by side. The list on the left hand side shows all the crawled pages. This list has two columns: the first column shows the title of the page (Most Web pages have a title: e.g., <title>HTML form tag </title>) and the second column presents the type of the Web page: ignored, Media Web page, Interactive Web or Textual Web page. When the user clicks on one of the pages in the list, the list on the right hand side presents the details about the page. For example, for a Media Web page you need to display the number of GIF pictures, the number of JPEG picture and the number of PNG pictures.
  9. Implementation requirements:
    • define an abstract class for Web pages then use inheritance to define the rest of the Web page types.
    • use java ArrayList and LinkedList data structures to keep track of the data handled by your program.
      NOTE: the use of other data structures other than these two will be penalized.