Phase 3: Playing with Term Weighting Schemes.
Requirements
In this phase of the project we will explore several of the weighting schemes discussed in class.
Tasks to be accomplished in this phase of the project.
For each entity, create one (large) document that contains all the reviews for that entity. Include only the content of the reviews; remove user id, title, date, etc.
For each such document create a vector representation over the vocabularies you created in the previous two phases. You must have at least 2 vocabularies: the raw one (phase 1) and the one obtained in phase 2.
You will create 4 vector representations: binary, tf-idf with log normalization and inverse document frequency, tf-idf with double normalization and probabilistic inverse document frequency, and Okapi BM25.
Compute the distance between documents using Jaccard, Euclidean and Cosine for the following pairs: (i) Restaurant: Y and O and (ii) Hotels: H and B.
Optional compute the similarities for Restaurant: T-O and Y-T.
Rank the scores in (i) and (ii) in descending order.
Deliverables
You have 2 vocabulary versions and for each 4 vector representations. You will produce 8 ranked lists for (i) and (ii). Update your report to include the top-10 of each ranking. You will provide 16 such lists. Show the pairs of entities in the form: Entity Name 1 : Entity Name 2: Distance.
Upload the report in blackboard.
Start early!