Eduard Dragut - CIS2168 Fall 2013 Assignment 2

Assignment 4

Due date: Thursday, November 14, in the lab.
NOTES:

10% penalty for each day late, up to three days
This is an individual assignment; no groups allowed

This assignment requires JAVA object oriented techniques. You will practice the following data structures: BST, Max Heap and Huffman encoding.

Problem description

For this assignment you will analyze the content of text document, preferable an html Web page, according to the following requirements:

Requirements:

As you parse through the file you will insert the words in a Binary Search Tree (BST). Use the lexicographical order (discussed in class) to construct the tree. Insert the words as you encounter them.
Convert all words to small case (use toLowerCase() method).
For each encountered word, you need first to check whether the word was previously seen and hence is present in the BST. You will first need to search the word in the BST. If the word is present in the BST, then you calculate and store the frequencies of word. Therefore, the Node of the BST has two data items: the word itself and its frequency. The former is used to construct the tree, while the latter is used to keep track of the number of times the word is encountered.
Once the tree is constructed, create a list of stop words. In computing (e.g., information retrieval and natural language processing), stop words are words which are filtered out prior to, or after, processing of natural language data. You will determine the stop words by finding the top most frequently occurring words in the document. You will accomplish this task by constructing a Max Heap data structure (recall then in the class we talked about the Min Heap. In a Max Heap the root node contains the maximum). Your Max Heap uses the frequencies that you computed in the BST. You will need to traverse the tree in some manner to insert its data into the Max Heap. You may use an iterator. The input of your program contains an integer argument, call it top_k_stop_words. You will need to use the Max Heap to determine the top top_k_stop_words. (Hint: use the remove operation.)
Denote by L be the set of words after the removal of the stop words. Use the remove operation in BST to obtain L. That is, you remove of stop words from your BST.
Use the frequencies of the words in L to construct a custom Huffman tree. Note that tree is over the words and not over the characters. In other words, the leaf nodes of your custom Huffman tree will be the words.

Program input

A text file.
An integer number for top_k_stop_words. Default top_k_stop_words = 10.

Program output

Display the stop words.
Given a word (your program must accept input from the standard prompt), your program must display its Huffman code.
Given a word (your program must accept input from the standard prompt), your program must display its parent and his children in the BST.

Bonus points