CIS 1068: Programming Assignment #12

Sorting a Really Big File

Handed out: 04/13/10 Due: by 10 pm 04/19/10
E-mail program to TA

Assignment Description [Dr. Yates, with minor modifications by Dr. Ingargiola]

One common problem with designing a computer program is resource constraints. For example, let's say you had a really big file containing people's names that you wanted to sort. The most common way to do this is to read the entire file into a big array of Strings, and then sort the array, and finally save the sorted array into a new file. But files can be really big, and they may contain more information than can be stored in memory. My office computer, for example, has a 200Gb hard drive and 4 Gb of memory. Theoretically, I could have a 200 Gb file. If I tried to read in such a large file into an array, my program would run out of memory and throw an OutOfMemoryError. The program is constrained by the available memory resources.

For this assignment, you will design and implement a sort program that still works correctly, even when there is not enough memory to store an entire input file. This will give you practice with arrays, files, exceptions, and sorting algorithms.

Program Description

For this homework, create a Java file called BigFileSorter.java. The class should accept the two file names as command-line arguments, the first being the name of an input file and the second the name of an output file.

The program will sort the lines in the input file, and save the sorted lines in the output file. Your sort algorithm will be a hybrid of the MergeSort algorithm and Java's built-in sorting algorithm as a two-phase process, as described below.

Phase 1: Breaking the input file into manageable chunks

In the first phase, the sort algorithm will read in a fixed number of lines from the input file, and store them into an array. It should pick a manageable number that will fit into memory. Next, it will sort that array using Java's built-in Arrays.sort method. Then it will store the sorted array in a temporary file called "temp_0_0.txt".

The algorithm will repeatedly read in chunks of lines until it has read in the entire contents of the input file. Each time it reads in a chunk of lines from the input file, it stores that chunk in an array, sorts the array, and saves the array in another temporary file ("temp_0_1.txt", "temp_0_2.txt", ... "temp_0_n.txt"). Notice that it is reading in an amount that will fit into memory each time, so that it does not run out of memory.

After this first phase is complete, each of the "temp_0_i.txt" files is in sorted order, and together they contain all of the stuff that was in the input file. The second phase must merge the files together, while making sure that the merged files remain in sorted order.

Phase 2: Merging the chunks together

Remember from class that the MergeSort algorithm repeatedly merges two small sorted arrays into a larger sorted array. Here, you will do the same, but instead repeatedly merge two small sorted files into a larger sorted file. Take care that you only read in one line of each file into memory at a time --- don't read the entire files into memory, or you will run out of memory!

Your sort algorithm should begin phase 2 by merging "temp_0_0.txt" and "temp_0_1.txt", and saving it to a new file called "temp_1_0.txt". Next, it will merge "temp_0_2.txt" with "temp_0_3.txt", and save the merged file as "temp_1_1.txt". This will repeat until there are no more "temp_0_i.txt" files left. If there are an odd number of these files, the last one will have nothing to merge with. That's ok, it can be merged in later iterations. Just rename it as the next available "temp_1_i.txt".

After merging pairs of the "temp_0_i.txt" files, the sort algorithm needs to begin merging pairs of the "temp_1_i.txt" files. It will begin by merging "temp_1_0.txt" and "temp_1_1.txt", and saving the result to "temp_2_0.txt". Then, it will merge "temp_1_2.txt" and "temp_1_3.txt", and save the result to "temp_2_1.txt", and so on.
We do not want to keep the temporary files after they are used, so after we merge two temporary files we delete them.

Each time through the set of temporary files, the merging process cuts the number of temporary files roughly in half, because it merges two files into one. ( I say "roughly" because there may be an odd number of files, in which case the last file does not get merged with anything.) This phase of the sorting must keep repeating, until there are only one temporary file left. When that happens, it should rename that temporary file to the output file. Then the sort is finished.

Guidelines for Testing Your Program

You may find a reasonably large text file (~7 Mb) here, which you may use to test your program. It contains Aesop's Fables, the complete works of Shakespeare, Mary Shelley's Frankenstein, and Mark Twain's The Adventures of Huckleberry Finn.

The file is certainly small enough to fit into memory --- I didn't want to you to have to deal with an enormous file. However, you should choose a setting so that your program reads in less than the whole file at a time. At first, read in 4096 lines at a time into memory during the first phase. You should test your program with several different settings.

In this file you see the behavior of my program (I put in the program a few print statements to see what it was doing).

You can see how to rename a file and how to delete it in the following code

import java.io.*;
import java.util.*;

public class Temp
{
    public static void main(String[] args) {
        if (args.length != 1) {
            System.out.println("usage: java Temp filename");
            return;
        }
        File ff = new File(args[0]);
        String newName = args[0] + 1;
        File ff1 = new File(newName);
        ff.renameTo(ff1); // now the file is renamed newName
        ff1.delete(); // and now it is deleted
    }
}