For this assignment, you will:
try/catch statementsIterator to iterate over a collection of StringsAll of the sorting algorithms we’ve looked at so far assume that the data is in memory and that we can swap data elements efficiently. Sometimes, because of the size of the data you cannot fit all of it in memory. In these situations, many of the traditional sorting algorithms fail miserably; the algorithms do not preserve data locality and end up accessing the disk frequently, resulting in very slow running times.
For this assignment, you will be implementing an on-disk sorting algorithm that is designed to use the disk efficiently to sort large data sets. The sorting algorithm will work in two phases:
The correctness of the assignments in this class will be automatically verified. For this reason, you must follow all naming conventions specified in this assignment.
OnDiskSort classWe have provided you with a skeleton OnDiskSort class that you will need to fill in the details for. We encourage you to add additional private methods, but do not change the names or parameters of the methods we have provided you. This will make our life much easier when we grade the assignment. As an aside, we have made some of the methods protected where normally we would have made them private to, again, assist us in grading.
You will need to fill in the following methods:
OnDiskSort: the constructor for the class. Make sure that you understand what all of the parameters do. maxSize is the maximum number of Strings that can be read in to memory at any one time. workingDirectory is the directory where you will create temporary files along the way (for example, to store the sorted chunks). Note that workingDirectory has type File. Use the sorting_run directory as the workingDirectory (but make sure that your code will work even if we test it using a different directory). We suggest you name the temporary files something simple like 0.tempfile, 1.tempfile, etc. Make sure you clear the working directory when you’re done. sorter is the sorter that you should use to sort each chunk.
sort: this is the public method that will be called when you want to sort new data. For this assignment, we will only be sorting String data (notice that WordScanner is an Iterator<String>). This method will read in the data maxSize words at a time, sort each chunk using the sorter, store the sorted chunk in a temporary file, and then put the file into an arraylist of files. Once all of the data has been read in, you will have an arraylist of files, each of which is sorted. You should then call the mergeFiles method to merge all the sorted files.
merge: take two sorted files and merge them into one sorted file. This is very similar to the merge method of MergeSort. The main difference is that rather than merging from two arrays (or ArrayLists) you are merging two files. You should not simply read in the data from both of these files and then use the merge method from MergeSort. We are trying to be memory efficient and this would defeat the purpose. Instead, you should open BufferedReaders to both of the files and then, reading one line at a time, read either from the first file or the second, and write that directly out to the output file, depending on the appropriate ordering. Besides the variables for doing the file I/0, you should only need two String variables to keep track of the data.
mergeFiles: takes an ArrayList of Files, each of which should contain sorted data and then uses the merge method above to eventually merge them into one large sorted file. Notice that the merge method only merges two files at a time. The easiest way to merge all of the n sorted files is to merge the first two files, then merge the third file with the result of merging the first two files, then the fourth in, etc. NOTE: you cannot read and write to a file at the same time, so you will need to use another temporary file to store your temporary results as you merge the data.
main: This method gets everything going and is provided to you. It creates a sorter that does a merge-sort in memory, then creates a diskSorter to do the external merges. Parameters to the OnDiskSort sets up directory sorting run to be the working directory for the sorts. It then creates a word scanner to read King’s “I have a dream” speech. Finally it calls the sort method of diskSorter with the scanner to input all the words of the speech, sorts them, and puts them in the file data.sorted.
To assist you, we have also provided a few helper methods in the OnDiskSort class that you may find useful. They primarily do some simple operations with files. If there is any confusion about what these methods do, please come talk to us. In addition, these helper methods may also help you understand basic Java file I/O.
MergeSort classImplementation of the Mergesort algorithm
QuickSort classImplementation of the Quicksort algorithm
Sorter interfaceAn interface for sorting algorithms. Implemented by the MergeSort and Quicksort class.
WordScanner classImplements the Java Iterator interface. An iterator over Strings read in from file.
As usual, create a new project in Eclipse and copy the starter code into the src directory of your newly created project from:
/common/cs/cs062/assignments/assignment03/You will also need a directory in which to put the files to be sorted. Create a directory called “sorting_run” in the parent directory (or in the same directory as the src and bin folder). In that directory put a file containing a copy of King’s “I have a dream” speech. It is in a file named “Ihaveadream.txt” and is in with files from last week’s assignment. Be sure to name these exactly as given here. (If not, then the program won’t find them and it will crash.) See the main method of OnDiskSort for the names. Note that we may test your code using a different directory for temporary files, so your code shouldn’t for OnDiskSort shouldn’t use the name “sorting_run” except in its main method as a default value.
You are now ready to get started! Again, try to code incrementally one method at a time.
You will be graded based on the following criteria:
| Criterion | Points |
Merge
|
3 |
MergeFiles
|
3 |
Sort
|
3 |
Uses only two Strings in merge
|
2 |
| Cleans up temporary files appropriately | 2 |
| General correctness | 2 |
| Appropriate comments (including JavaDoc) | 3 |
| Appropriate use of generics | 2 |
| Style and formatting | 2 |
| Submitted correctly | 1 |
| Extra credit | 2 |
NOTE: Code that does not compile will get a zero! Make sure that your code compiles before submitting.
A reminder: in addition to Javadoc comments, your code must adhere to the (CS 62 Style Guide)[http://www.cs.pomona.edu/classes/cs062/notes/cs062-style.html].
As usual, export the entire folder from Eclipse to your desktop.
Change the name of the folder to Assignment03_LastNameFirstName. Make sure the folder contains both a src and a bin directory.
Make sure to fill out the asg03.json file (found in the same place as the assignment starter code) with your username and whether or not you did extra credit. Add it to the top level of your submission directory.
little.cs.pomona.edu), and then run the submission script at /common/cs/cs062/bin/submit with the following arguments:
cs062 in this case).asg03 in this case).~/Desktop/Assignment03.Once you call the script, it should print “You are:” followed by your username. If it prints the wrong username, you’re using someone else’s login and your submission won’t be recognized as yours. After confirming your username, the script will tell you what file it submitted for what assignment/class and the name of the submitted file. If you get the arguments wrong, it will print an error message instead (which should be informative). Remember you can use tab-completion for filenames to spell them correctly (filenames with spaces in them can accidentally be treated as multiple separate arguments unless you’re careful). Using the submit command looks like this:
/common/cs/cs062/bin/submit cs062 asg03 ~/Desktop/Assignment03_Name
You can use a program like ssh (should be installed by default on Macs) or PuTTY (available for free for Windows) to login to little.cs.pomona.edu remotely to run the submit command.
For those that haven’t had any file I/O experience in Java, we’ll give a brief intro here, but also take a look at the streams cheat sheet available off of the course Documentation page. You can also look up information about the classes seen in the code and discussed here via the Java libraries link there. For most I/O, you’ll need to import java.io.*.
The two main classes you’ll be concerned with when doing file I/O in java are BufferedReader for reading data and PrintWriter for writing data. To read data, you can create a new reader by:
BufferedReader in = new BufferedReader(new FileReader(...))where “…” can be either replaced with a String or can be replaced with a File object. To write data, you can create a new writer by:
PrinterWriter out = new PrintWriter(new FileOutputStream(...))In both cases, you will need to surround these with a try-catch to handle the IOException.
The file system on these computers starts at the root directory ‘/’. Everything is then expanded out based on directories. For example “/home/america/” means that in the root directory ‘/’ there is a directory named “home” and inside this directory is another directory named “pmawhorter”. The ‘/’ symbol is called the file separator and is different depending on the operating system (e.g. it’s ‘’ on windows computers).
Filenames can be specified as relative filenames, where the filename is specified relative to the current location of the program (or user), and absolute filenames where the entire path, starting from the root directory, is specified. For example, the absolute filename for the src directory of my Assignment03 folder in my Eclipse workspace is:
/home/pmawhorter/Documents/cs062/workspace/Assignment03/src
The relative filename for the src directory depends on the current directory. If the current location of the program (or user) is the bin directory, then the relative filename is “../src.” The “..” symbol means go up one directory. So, from the bin directory, the relative filename “../src” means go up one directory and then go to the src directory. Note that relative filenames do NOT start with a ‘/’.
It can be confusing telling exactly where your program currently is when running it, so often the best approach when writing programs is to use a full path which starts at the root directory. (We won’t do that this time because we want the program you turn in to work when copied to the TA’s directory.)
If you ever want to know where you are when you’re in the Terminal, the pwd command (which stands for print working directory). You can try it out by just typing pwd and hitting return (though that won’t work in Eclipse – you must be in the Terminal)!. Rather than hard-coding in a file or directory, you can also pop up a dialog box and let the user choose the file. We did that with last week’s assignment. In fact, the startup code given to you in the main class of TextGenerator is a good example of how to use file input.