Assignment 6

Due date: Wednesday, November 27, 11:59PM.
NOTES:
This assignment requires JAVA object oriented techniques. You will practice the following data structures: hash data structure, sort and others. You will also practice some Math in Java.

Problem description

In this assignment we extend the previous project. For this assignment you are now required to analyze the content of set D of text documents, preferable a set of html Web pages, according to the following requirements:

Requirements:

  1. Use the implementation from your previous assignment to compute the occurrence of each word in each document. Denote by f(t, d) the occurrence of word t in a document d.
  2. For each document d in D, compute max{f(w, d) : w in d}, which represent the maximum frequency of any word in the document d. Denote this by max(d).
  3. Compute the expression 0.5 + (0.5 * f(t, d))/max(d) for each word t in document d. We denote this expression by tf(t, d).
  4. For each term t in document d, compute the expression log(|D|/g(t)), where |D| is the number of documents in D and g(t) is the number of documents in D that contain the word t. Note that t may appear in multiple documents, so compute g(t) once and use it whenever needed. Let idf(t, D) = log(|D|/g(t)).
  5. Compute the expression tf-idf(t, d, D) = tf(t, d) * idf(f, D).
  6. The base of the logarithm above does not matter. Use whatever is most convenient to you.
  7. For each document show the words with the top-k largest tf-idf(t, d, D).
  8. Program input

  9. A folder with a set of distinct text files.
  10. An integer number k for top-k words with largest tf-idf(t, d, D). Default k = 10.
  11. Program output

  12. For each document display in descending order the top-k words along with their tf-idf(t, d, D). Round up tf-idf(t, d, D) to 3 decimals. (Use Java float formatting.)
  13. Bonus points

  14. (2 points) The input of your program is a set of Web pages.
  15. (2 points): Develop a user interface for your project. Be creative.