Project 1: IPC and map-reduce for counting word frequency

Due date

Oct 04, 2016, 11:59pm (demo your programs in my office 3-5pm or in the class on Oct 05)

Goal

The goal of this project is to practice various IPC methods and learn map-reduce (parallel computing). Both are very important techniques frequently used in industry.

Details

This project consists of three independent sub-projects, each performing the same task: given a text file, the program outputs a list showing the count of each unique word (case insensitive). For example, given "Can you can a can", you should output "a : 1 can : 3 you :1". Let's simplify the definition of a word: for example, "today's", "today.", and "today" are regarded as three different words; a perfect tokenization is not our purpose.

Your program execution (i.e., the parent process) will create a child process. The parent process can open the text file, read the content, and pass it to the child, but cannot count words, while the child process cannot open that file but can count words. Finally, the prarent process (rather than the child) should output the result. Below are the requirements for the three sub-projects:

  1. Use Pipe as the method of passing the file content and the result.
  2. Use Message Queue as the method of passing the file content and the result.
  3. Use Shared Memory as the method of passing the file content and the result. Plus, the child process creates 4 threads, each acting as a Mapper; and the child process's main thread works as the single reducer. Map-reduce is only required for this sub-project. You are not allowed to use Hadoop as the map-reduce infrastructure; instead, you have to use Posix thread programming to implement map-reduce.

Input text file

Anna Karenina. Leo Tolstoy, 1870. link

Submission

Your submission should include the code (a makefile is preferred but not required), a readme file describing your design, how to compile / use your code and the contribution in the case of pair programming, and a report which consists of the following parts:

Environment

Linux (Ubuntu is recommended) and C/C++.

References

You may find the following articles useful

.