Phase 3: XML to Relational Database.
In this phase of the project you will learn about XML and its connection to relational databases. First, you will need to get familiar with XML on your own. I suggest that you read Chapter 23 in the textbook.
Requirements
- Get familiar with XML Read Chapter 23.
- Download the current XML file from DBLP from http://dblp.uni-trier.de/xml/. You need these two files: dblp.dtd and dblp.xml.gz. dblp.dtd is the schema of the XML document in dblp.xml.gz. This is equivalent to the concepts of schema and instance that we talked about in class in the the relational model.
- Download a Java library for parsing XML files. There are two options: in-memory parsing and sequential reading of the XML file. dblp.xml has over 1GB when uncompressed. You may not be able to use the former alternative. It all depends on your computer system configuration. For the latter you can use SAX (http://www.saxproject.org/). Other libraries are also available. Choose the one that you like.
- Parse the dblp.xml file and insert the data into a relational database. A possible starting point is available here: http://dblp.uni-trier.de/xml/docu/.
- The schema of the relational database is available here: dbschema
.
- Studies:
- Study 1: Create schema with all the constraints. Insert the records in the database. Report the time to insert the all the records in each table. Consider only the major tables: People (or Authors) and Publications.
- Study 2: repeat Study 1, but have the indices for the major tables be clustered.
- Study 3: Create schema without all the constraints. Insert the records in the database. Report the time to insert all records per table. Consider only the major tables: People (or Authors) and Publications.
- Compare and explain the obtained times.
Deliverables
Update your report and include detail description of your algorithms.
Include in your report SMALL pieces (no more than two pages!) of source code that convincingly show that you implemented the XML parser for the dblp.xml file.
Do not submit any piece of your source code.
Describe the software packages that you use to implement your algorithms, e.g., SAX.
Describe the difficulties in implementing your algorithms.
Provide statistics about the extracted data: e.g., total number of people, publications.
Include a screen shot that shows the database on your computer.
Start early!
Collaborate! Compare and discuss your approaches. The end product is expected to be an individual effort!