Phase 2: Stemming and Lemmatization.

Requirements

In this phase of the project we will look into the effect of text normalization. (i) Stemming and (2) Lemmatization are of text normalization techniques we talked in class. Tasks to be accomplished in this phase of the project.
  1. You need only use either (i) or (ii), but NOT both (unless you want to).
  2. Ideally, you should work in pairs: one of you uses (i) and the other uses (ii). Then you can compare the results. But, this is not required. Half of you will use (i) and the other (ii). First come, first served. Please send me you preferences by Monday, Oct. 24, 2016.
  3. There are already NLP tools that implement these techniques. For (i), Porter Stemmer is availalble at http://snowball.tartarus.org/download.html. For (ii), one lemmatizer is availalble at Stanford CoreNLP, another is Natural Language Toolkit, which is based on WordNet. Other options for either (i) or (ii) are accepted provided they are first ran through me.
  4. You need to perform stemming or lemmatization on the 7 datasets.
  5. Repeat the experiment in the previous phase.
  6. Discuss the noticed changes in vocabulary size, Zipf's distribution and Heap's Law.

Deliverables

  1. Send a report with your findings that includes all the plots and tables. Describe the methods used to estimate the parameters.
  2. Upload the report in blackboard.
Start early!