Eduard Dragut - CIS2168 Fall 2013 Assignment 4

Assignment 4

Due date: Tuesday, November 21.
NOTES:

10% penalty for each day late, up to three days
This is an individual assignment; no groups allowed

This assignment requires JAVA object oriented techniques. You will practice the following data structures: BST, Max Heap and Huffman encoding.

Problem description

For this assignment you will analyze the content of text documents. The text documents are the tweets you collected in Assignment 2:

Requirements:

As you parse through the tweets you will insert the words (terms) in a Binary Search Tree (BST). Use the lexicographical order (discussed in class) to construct the tree. Insert the words as you encounter them.
Convert all words to small case (use toLowerCase() method).
For each encountered word, you need first to check whether the word was previously seen and hence is present in the BST. You will first need to search the word in the BST. If the word is present in the BST, then you calculate and store the frequency of the word. Therefore, the Node of the BST has two data items: the word itself and its frequency. The former is used to construct the tree, while the latter is used to keep track of the number of times the word is encountered.
Once the tree is constructed, create a list of stop words. In computing (e.g., information retrieval and natural language processing), stop words are words which are filtered out prior to, or after, processing of natural language data. You will determine the stop words by finding the top most frequently occurring words in the document. You will accomplish this task by constructing a Max Heap data structure (recall then in the class we talked about the Min Heap. In a Max Heap the root node contains the maximum). Your Max Heap uses the frequencies that you computed in the BST. You will need to traverse the tree in some manner to insert its data into the Max Heap. You may use an iterator. The input of your program contains an integer argument, call it top_k_stop_words. You will need to use the Max Heap to determine the top TOP-K-STOP-WORDS. (Hint: use the remove operation in the heap.)
Denote by L be the set of words after the removal of the stop words. Use the remove operation in BST to obtain L. That is, you remove all the stop words from your BST.
Use the frequencies of the words in L to construct a custom Huffman tree. Note that the tree is over the words and not over the characters as we did in class. In other words, the leaf nodes of your custom Huffman tree will be the words.

Program input

A text file with all the tweets. Alternatively, the path to the folder where the tweets are saved. You need to have at least 1,000 tweets.
An integer number for TOP-K-STOP-WORDS. Default TOP-K-STOP-WORDS = 10.

Program output

Display the stop words.
Given a word (your program must accept input from the standard prompt), your program must display its Huffman code.
Given a word (your program must accept input from the standard prompt), your program must display the subtree in the BST whose root is that word. Use the procedure dicussed inclass with the indentation if you use the command prompt.

Bonus points