Project 1: IPC and map-reduce for counting word frequency
Due date
Oct 04, 2016, 11:59pm (demo your programs in my office 3-5pm or in the class on Oct 05)
Goal
The goal of this project is to practice various IPC methods and learn map-reduce (parallel computing). Both are very important techniques frequently used in industry.
Details
This project consists of three independent sub-projects, each performing the same task: given a text file, the program outputs a list showing the count of each unique word (case insensitive). For example, given "Can you can a can", you should output "a : 1 can : 3 you :1". Let's simplify the definition of a word: for example, "today's", "today.", and "today" are regarded as three different words; a perfect tokenization is not our purpose.
Your program execution (i.e., the parent process) will create a child process. The parent process can open the text file, read the content, and pass it to the child, but cannot count words, while the child process cannot open that file but can count words. Finally, the prarent process (rather than the child) should output the result. Below are the requirements for the three sub-projects:
- Use Pipe as the method of passing the file content and the result.
- Use Message Queue as the method of passing the file content and the result.
- Use Shared Memory as the method of passing the file content and the result. Plus, the child process creates 4 threads, each acting as a Mapper; and the child process's main thread works as the single reducer. Map-reduce is only required for this sub-project. You are not allowed to use Hadoop as the map-reduce infrastructure; instead, you have to use Posix thread programming to implement map-reduce.
Input text file
Anna Karenina. Leo Tolstoy, 1870. link
Submission
Your submission should include the code (a makefile is preferred but not required), a readme file describing your design, how to compile / use your code and the contribution in the case of pair programming, and a report which consists of the following parts:
- Time the execution of your three programs, and analyze what contributes to the performance difference.
- Summarize the different IPC methods provided by Linux, and describe when to use which.
- Write a short paragraph about map-reduce.
- Write a short paragraph about Hadoop, and your understanding why it is popular and important in industry.
Environment
Linux (Ubuntu is recommended) and C/C++.
References
You may find the following articles useful
- Beej's Guide to Unix IPC. Beej, 2010. link
- Linux Interprocess Communications. Goldt, Meer, Burkett, and Welsh, 1995. link
- MapReduce: Simplified Data Processing on Large Clusters. Dean and Ghemawat, 2004. link
- POSIX Threads Programming. Barney, 2015. link
- Beej's Quick Guide to GDB. Beej, 2009. link