Introduction to Document Databases
JSON is a popular textual data format that's used for exchanging data in modern web and mobile applications. JSON is also used for storing unstructured data in log files or NoSQL databases. The goal of this project is to introduce you to JSON and Data integration.
Requirements
- Phase 1. Get familiar with JSON. In this task, you will need to read about JSON and get familiar with it. Learn about Export relational data to JSON. Learn about Import JSON documents into a document databases. Different DMBSes have different statements to support them. You will need to read about them and implement on some toy examples.
- Document database. Choose a database system that manages JSON documents. You can choose from a number of systems: MySQL, MongoDB, or Oracle. More choices are available here. You will collect and receive a large number of JSON documents. In this task you will create a document database based on this data.
- Phase 2: Data collection from Twitter. In this task, you will use Twitter API to collect data from Twitter stream. Collect at least 1M tweets. See here some useful links: API and Research Use Cases .
- Phase 3. Study Query Languages for Document DMBSes. In this task, you will study . If you choose MongoDB, here is your starting point. I will provide additional docs in canvans. If you choose a different DBMS, you will need to read its documentation.
- Phase 4. Additional Document DMBSes. In this task, you will receive JSON documents from 2 other sources whose schemas are different from that of Twitter. You will load them into your DBMS, You will consider 2 scenarios. To come
- Phase 5. JSON to Relational. In this task, you create a relational database corresponding to the UNION of the 3 types of JSON documents. (In other words, you integrate the 3 your 3 types of JSON documents into one relational database.) Then, you create a tool that loads your JSON files into your new database. Your schema must have at least 2 tables, e.g., User and Post. You will need to add the attributes to those tables and identify constraints, like PKs and FKs. Your datbase schema needs to have clearly defined tables that minimize redundancy: e.g., your table User has one entry per user, that is, if a user has multiple posts that user is inserted in table User and the comments of that user will be inserted in Post; you need to add a FK to Post that points to User ID.
Write the SQL queries for the queries you completed in Phase 4.
Take steps to optimze your queries, like adding indices. Give the screenshots of query plans of each query before and after your optimization.
Deliverables
A semester report. However, you will need to upload a weekly version of report on each Monday until the end of the semester. Each subsequent version of the report marks clearly the new parts, e.g., using different font styles or colors. The report starts with a progress section, which gives an itimized list of the updates to your project and report.
Include in your report SMALL pieces (no more than two pages!) of source code that convincingly show that you implemented the tasks.
Source code - only after you give an in-person demo.
Describe the software packages that you use to implement your algorithms.
Describe the difficulties in implementing your algorithms and how you overcame them.
Provide statistics about the queries.
Include a screen shots that you ran your queries on your computer.
Give details about your steps to improve the queries.
Start early!
Collaborate! Compare and discuss your approaches. The end product is expected to be an individual effort!