Crawling the Deep Web.

In your first assignment you were required to develop a surface We crawler. In this phase of the project, you will be exposed to the challenges of crawling the Deep Web.

Requirements for this phase of the project

  • Choose a Deep Web source of your choice from one of the domains: Hotels (e.g., Booking, Hotels), Restaurants (e.g., Tripadvisor, OpenTable, and Yelp), Movies (e.g., Redbox, rotedtomato) Books (Barnesandnoble), or Electronics (Amazon, Newegg). Let's call L the website of your choice.
  • Automatically issue queries to L.
  • Determine whether L returns any result to your query.
  • L usually returns hundreds of records to a user query. The results are organized in pages, usually a page contains 10 or 20 records per page. Hence, in order to gather the records returned to a user query you need to traverse all those pages.
  • Each record has its own page that contain detailed information about an entity. For example, if the web site is from the Restaurant domain, for each restaurant the web site will provide pieces of data such as Address, Phone, Name, and Cuisine. Some web sites may even provide user reviews. Your crawler will need to collect all such data about a restaurant.
  • If the web site has user reviews, the reviews may also span multiple pages. Again, you will need to traverse all those pages and collect the data.
  • Before starting to work on this project, you will need to communicate to me the deep Web site that you plan to work on.
  • In Phase 2 of this project, you will organize this data in a relational database.
  • In Phase 3 of this project, you will extend your crawler to work with a different web site from the same domain. You have to explain the encountered challenges.
  • Deliverables

    1. Send a report that describe your approach to crawling a deep Web source.
    2. Upload the report in blackboard.
    Start early!