CIS2168 - Homework 10: WORDS IN BOOKS, THEIR FREQUENCIES, THEIR HUFFMAN CODES

Assignment given: November 9, 2010
Due Date: November 22, by 10pm

We are given a file containing, one per line, some URLs. Each URL refers to a text file in the internet, actually books of the Gutenberg project.

You are to read each of these files and store the words they contain, as lowercase words, in a single array or arraylist. To identify the words, you may find useful the statement, if fin denotes a scanner: fin.useDelimiter("\\W+"); For each word you will keep track of the number of its occurrences. Print out the size of the store after each file has been processed and printout in milliseconds the time it took you to process that file. Be clever or the program will be very slow.

You will then sort the entries in this array in decreasing order of occurrences and print out the 100 most frequent words.

Determine and print out the average length of the 1000 most frequent words, their Huffman codes, and the average number of bits of these codes. Notice that Huffman codes are just a sequence of bits, not characters, but, if you want, you can represent the bits by the characters '0' and '1' and the codes by strings. Be sure to implement and test two functions. One, given one of the 1000 words, will print out the corresponding Huffman code. The other, given a Huffman code will print out the corresponding word.