Phase 4: Merge and Clean.
In this phase of the project you will learn about Data Cleaning: computational procedures to automatically or
semi-automatically identify and correct errors in data sets.
Bad data costs the US Economy over $3.1 Trillion every year. Scary, isn't it? In this phase of the project we get a first encounter with the problem.
Requirements
First you will work with the data from Microsoft Academic Search (MAS).
- Find instances of the homonymy problem. In this task, you will find all those distinct conferences that have the same acronyms. Provide a complete list grouped by acronyms. Perform the same task for journals
- Count all those instances of conference (and separate journals) whose name do not follow the pattern ACRONYM HYPHEN LONG NAME. Include in your report 10 interesting instances of conferences and 10 for journals.
- Attempt to categorize as many as possible of the offending instances collected in the previous step. E.g., one category is the set of conferences that do not have an acronym (give all instances), another is the set of conferences that contain multiple HYPHENS (give all instances).
- Provide solutions to detect the acronym for all identified categories. Place the acronym in the field ShortName..
Second connect DBLP data to that from Microsoft Academic Search using the conference/journal information.
- Perform similar cleaning as above for the DBLP data in the table Publication and place the acronyms in the field venueClean.
- Attempt to match a many as possible records in the Publication table to the records in MAS using the fields ShortName and venueClean.
Report:
- The number (and Percentage) of records in Publication that you were able to correctly match.
- Give each category of conference/journal title. Give the distribution across the identified categories.
- The number of conferences/journals for which you were able to find the acronyms in each category.
- Describe what the homonymy problem leads to in this phase of the project. What may go wrong. Suggest steps to solve it. Implement at least one solution and report how it performs.
Deliverables
Update your report and include detail description of your algorithms.
Include in your report SMALL pieces (no more than two pages!) of source code that convincingly show that you implemented all the required steps of this phase of the project.
Do not submit any piece of your source code.
Describe the software packages that you use to implement your algorithms
Describe the difficulties in implementing your algorithms.
Provide statistics about the extracted data: see above.
Start early!
Collaborate! Compare and discuss your approaches. The end product is expected to be an individual effort!