CIS2168 - Homework 11: Search on RSS Feeds (*)
Assignment given: November 23, 2010
Due Date: December 8 by 10pm
You have probably seen that most newspapers provide an RSS feed of the headlines.
In fact they provide a number of such feeds.
Each feed is an XML document that needs to be analised to identify the news items it
contains, and for each item, determine its title, description, and link to the news
story. "title", "description", and "link" are strings. Aggregators such as
Bloglines allow you to
subscribe to such feeds.
What we are aiming to do is a service that, given search keywords, retrieves the items
that are "most relevant" to those keywords.
You are helped by three java files that you find in
this folder. The HTMLTokenizer.java file is from Princeton and you will not use it
directly. The ItemEntry.java file represents a news item. The file ItemParser.java is
what is most useful to you. Given an URL it will give you access to its items with an
iterator. The URL could be something like
"http://timesofindia.indiatimes.com/rssfeedstopstories.cms" or the name of a file in
your homework12s10 directory. My Iterator in ItemParser.java seems to work for the
news sources I have used, but it is a hack, not serious. You will use as is for the news
sources where it works.
You are given a text file feeds.txt
that specifies the URL for a number of news sources. Your task for homework 11 is:
- Collect the news items from all these sources.
- For each [caseless] word that occurs in the title or description of an item, record
the item where it occurs, and the score of that item, computed as two times
the number of occurrences of the word in the title plus the number of
occurrences in the description. Exclude from the words you
collect the 50 most common words you
determined in a previous assignment.
By now we have collected all the news
items from the specified sources, collected the links to all these news items, determined for
each word occurring in these news items an array list of the news items where they occurred, with
a rank score.
Next you will prompt the user to enter a search query consisting of words separated
by spaces. You will collect the distinct news items where all these words occur, you will rank
the news items as to their relevance to the query, and you will display the top ranked news items.
[Two news items will be considered equal if they have the same link information (some may also
consider two news items equal also if they have the same title.]
Special praise but no extra points to the students who will output these best news items
as an HTML page viewable by a browser, thus allowing users to access the news items by clicking.
(*) This homework is derived from an old assignment at Stanford and
uses code from a Princeton course.