Phase 1: XML to Relational Database.
In this phase of the project you will learn about XML and its connection to relational databases. First, you will need to get familiar with XML on your own. I suggest that you read Chapter 23 in the textbook.
Requirements
- Get familiar with XML Read Chapter 23.
- Download the current XML file from DBLP from here. You need these two files: dblp.dtd and dblp.xml.gz. dblp.dtd is the schema of the XML document in dblp.xml.gz. This is equivalent to the concepts of schema and instance that we talked about in class in the the relational model.
- Download a Java (or your preferred programming language) library for parsing XML files. There are two options: in-memory parsing and sequential reading of the XML file. dblp.xml has over 1GB when uncompressed. You may not be able to use the former alternative. It all depends on your computer system configuration. For the latter you can use SAX (http://www.saxproject.org/). Other libraries are also available. Choose the one that you like.
- Parse the dblp.xml file and insert the data into a relational database. A possible starting point is available here: http://dblp.uni-trier.de/xml/docu/.
- Design your own relational schema.
- Studies:
- Study 1: Create the schema with all the constraints (e.g., PK, FK, UK, not nulls). Insert the records in the database. Report the time to insert all the records in each table. Consider only the major tables: People (or Authors) and Publications.
- Study 2: Create schema without any constraints. Insert the records in the database. Report the time to insert all records per table. Consider only the major tables: People (or Authors) and Publications.
- Learn and use batch inserts for both 1 and 2 above.
- Compare and explain the obtained times.
Deliverables
Start a semester long report.
Update your report and include detail description of your algorithms.
Include in your report SMALL pieces (no more than two pages!) of source code that convincingly show that you implemented the XML parser for the dblp.xml file.
Do not submit any piece of your source code.
Describe the software packages that you use to implement your algorithms, e.g., SAX.
Describe the difficulties in implementing your algorithms.
Provide statistics about the extracted data: e.g., total number of people, publications.
Include a screen shot that shows the database on your computer.
And, the most important piece: the running times of the 4 studies. Give your system configuration.
Start early!
Collaborate! Compare and discuss your approaches. The end product is expected to be an individual effort!