Phase 2: Web data scraping.

Requirements

  1. Design a Web data extraction algorithm. The algorithm needs to extract data from conference webistes, such as ICDE, VLDB or SIGMOD. There is no limitation on the techniques you decide to employ. Brute force, learning based, rule based algorithms are all accepted.
  2. Implement your algorithm in the programming language of your choice.
  3. Insert data into the database PubWorld that you created in the previous phase.
  4. Measure the accuracy of your extraction algorithm. Data on the Web is formatted in varied ways across websites. Your algorithm is likely not to be 100% accurate, which is acceptable
  5. These papers describe algorithms for extracting data from Web lists.

Deliverables

  • Update your report and include detail description of your algorithms.
  • Include in your report SMALL pieces (no more than two pages!) of source code that convincingly show that you implemented the program.
  • Do not submit any piece of your source code.
  • Describe the software packages that you use to implement your algorithm, e.g., RegEx.
  • Describe the difficulties in implementing your algorithm.
  • Algorithm analysis: If your algorithm is not 100% accurate, itemize the reasons.
  • Start early!
    Identify other students in the class who work on the same datasets as you. Collaborate! Compare and discuss your approaches.