Phase 2: Web data scraping.

Requirements

  1. Design a Web data extraction algorithm. The algorithm needs to extract data from the Microsoft Academic Search website, available at http://academic.research.microsoft.com/. There is no limitation on the techniques you decide to employ. Brute force, learning based, rule based algorithms are all accepted. We focus on Computer Science.
  2. Extract all the Names of the Conferences and Journals in Computer Science available at http://academic.research.microsoft.com/?SearchDomain=2&entitytype=2.
  3. Extract the Conferences and Journals by Subject. Artificial Intelligence, Data Mining and Databases are among the subjects listed on the Microsoft Academic Search website. You are expected to extract all of them. Examples of Conferences in Databases include VLDB - Very Large Data Bases, ICDE - International Conference on Data Engineering and Data Base Workshops. Examples of journals include TKDE - IEEE Transactions on Knowledge and Data Engineering, VLDB - The Vldb Journal and IS - Information Systems. You are expected to extract all the conferences and journals for each subject. Notice that the list of all conferences (as well as that of journals) may span over multiple pages: you are expected to navigate automatically to those pages and collect the data.
  4. Split each conference/journal name into Acronym and Long Name: e.g, in ICDE - International Conference on Data Engineering, "ICDE" is the acronym and "International Conference on Data Engineering" is the long name.
  5. The extracted data should be in the format: Subject, RawName, ShortName, LongName, URL. The URL is associated with the conference/journal in the webpage, E.g., "/Conference/22/icde-international-conference-on-data-engineering" for ICDE - International Conference on Data Engineering. It appears in the href attribute.
  6. Keep track of all instances where the data does no follow this pattern: e.g., Data Base Workshops or Geoinformatica.
  7. Implement your algorithm in the programming language of your choice.
  8. [Optional] Insert the data into a table with the schema Subject, RawName, ShortName, LongName, URL.
  9. Provide statistics about the extracted data: e.g., total number of extracted conferences/journals per subject.

Deliverables

  • Update your report and include detail description of your algorithms.
  • Include in your report SMALL pieces (no more than two pages!) of source code that convincingly show that you implemented the program.
  • Do not submit any piece of your source code.
  • Describe the software packages that you use to implement your algorithm, e.g., RegEx.
  • Describe the difficulties in implementing your algorithm.
  • Algorithm analysis: If your algorithm is not 100% accurate, itemize the reasons.
  • Start early!
    Collaborate! Compare and discuss your approaches. The end product is expected to be an individual effort!