Text processing and statistics.

Requirements

We have two datasets Restaurants and Hotels. Each consists of records about restaurants, which have fields such as name, address, and phone. Each restaurant is accompanied by a a set of user reviews. Each review has items such as user ID, title, time, and content. The restaurant dataset has records data from Yelp (R-Y), Tripadvisor (R-Y) and OpenTable (R-O). The hotel dataset has records from Hotels (H-H) and Booking (H-B). Tasks to be accomplished in this phase of the project.
  1. Compute the distribution of word frequencies for i) Restaurants and Hotels datasets and ii) individually for R-Y, R-T, R-O, and H-H, H-B. In total you produce 7 (seven) distributions. Plot them and give the corresponding expressions of the Zip's distributions. You need to estimate the parameter k from the data. In this study, each user review represents a document.
  2. For each of the 7 datasets, give the following a) Total documents, b) Total word occurrences, c) Vocabulary size, d) Words occurring > 1000, e) Words curing once. Give a table with the top-50 occurring words.
  3. Compute the vocabulary growth distribution for each of the 7 datasets. See the lecture slides about the Heap's Law. Estimate the two parameters for each of the datasets. Give the plots.
  4. Comment on the observed trends in the two datasets. Describe your tokenization procedure. Use a naive tokenization procedure for this phase of the project.

Deliverables

  1. Send a report with your findings that includes all the plots and tables. Describe the methods used to estimate the parameters.
  2. Upload the report in blackboard.
Start early!