For this assignment, you will:
try
/catch
statementsIterator
to iterate over a collection of Strings
All of the sorting algorithms we’ve looked at so far assume that the data is in memory and that we can swap data elements efficiently. Sometimes, because of the size of the data you cannot fit all of it in memory. In these situations, many of the traditional sorting algorithms fail miserably; the algorithms do not preserve data locality and end up accessing the disk frequently, resulting in very slow running times.
For this assignment, you will be implementing an on-disk sorting algorithm that is designed to use the disk efficiently to sort large data sets. The sorting algorithm will work in two phases:
The correctness of the assignments in this class will be automatically verified. For this reason, you must follow all naming conventions specified in this assignment.
OnDiskSort
classWe have provided you with a skeleton OnDiskSort
class that you will need to fill in the details for. We encourage you to add additional private methods, but do not change the names or parameters of the methods we have provided you. This will make our life much easier when we grade the assignment. As an aside, we have made some of the methods protected
where normally we would have made them private
to, again, assist us in grading.
You will need to fill in the following methods:
OnDiskSort
: the constructor for the class. Make sure that you understand what all of the parameters do. maxSize
is the maximum number of Strings
that can be read in to memory at any one time. workingDirectory
is the directory where you will create temporary files along the way (for example, to store the sorted chunks). Note that workingDirectory
has type File
. Use the sorting_run
directory as the workingDirectory
(but make sure that your code will work even if we test it using a different directory). We suggest you name the temporary files something simple like 0.tempfile
, 1.tempfile
, etc. Make sure you clear the working directory when you’re done. sorter
is the sorter that you should use to sort each chunk.
sort
: this is the public method that will be called when you want to sort new data. For this assignment, we will only be sorting String
data (notice that WordScanner
is an Iterator<String>
). This method will read in the data maxSize
words at a time, sort each chunk using the sorter, store the sorted chunk in a temporary file, and then put the file into an arraylist of files. Once all of the data has been read in, you will have an arraylist of files, each of which is sorted. You should then call the mergeFiles
method to merge all the sorted files.
merge
: take two sorted files and merge them into one sorted file. This is very similar to the merge method of MergeSort
. The main difference is that rather than merging from two arrays (or ArrayLists
) you are merging two files. You should not simply read in the data from both of these files and then use the merge method from MergeSort
. We are trying to be memory efficient and this would defeat the purpose. Instead, you should open BufferedReaders
to both of the files and then, reading one line at a time, read either from the first file or the second, and write that directly out to the output file, depending on the appropriate ordering. Besides the variables for doing the file I/0, you should only need two String
variables to keep track of the data.
mergeFiles
: takes an ArrayList
of Files
, each of which should contain sorted data and then uses the merge
method above to eventually merge them into one large sorted file. Notice that the merge method only merges two files at a time. The easiest way to merge all of the n sorted files is to merge the first two files, then merge the third file with the result of merging the first two files, then the fourth in, etc. NOTE: you cannot read and write to a file at the same time, so you will need to use another temporary file to store your temporary results as you merge the data.
main
: This method gets everything going and is provided to you. It creates a sorter that does a merge-sort in memory, then creates a diskSorter
to do the external merges. Parameters to the OnDiskSort
sets up directory sorting run to be the working directory for the sorts. It then creates a word scanner to read King’s “I have a dream” speech. Finally it calls the sort method of diskSorter
with the scanner to input all the words of the speech, sorts them, and puts them in the file data.sorted
.
To assist you, we have also provided a few helper methods in the OnDiskSort
class that you may find useful. They primarily do some simple operations with files. If there is any confusion about what these methods do, please come talk to us. In addition, these helper methods may also help you understand basic Java file I/O.
MergeSort
classImplementation of the Mergesort algorithm
QuickSort
classImplementation of the Quicksort algorithm
Sorter
interfaceAn interface for sorting algorithms. Implemented by the MergeSort
and Quicksort
class.
WordScanner
classImplements the Java Iterator
interface. An iterator over Strings
read in from file.
As usual, create a new project in Eclipse and copy the starter code into the src
directory of your newly created project from:
/common/cs/cs062/assignments/assignment03/
You will also need a directory in which to put the files to be sorted. Create a directory called “sorting_run
” in the parent directory (or in the same directory as the src
and bin
folder). In that directory put a file containing a copy of King’s “I have a dream” speech. It is in a file named “Ihaveadream.txt” and is in with files from last week’s assignment. Be sure to name these exactly as given here. (If not, then the program won’t find them and it will crash.) See the main
method of OnDiskSort
for the names. Note that we may test your code using a different directory for temporary files, so your code shouldn’t for OnDiskSort shouldn’t use the name “sorting_run
” except in its main method as a default value.
You are now ready to get started! Again, try to code incrementally one method at a time.
You will be graded based on the following criteria:
Criterion | Points |
Merge
|
3 |
MergeFiles
|
3 |
Sort
|
3 |
Uses only two Strings in merge
|
2 |
Cleans up temporary files appropriately | 2 |
General correctness | 2 |
Appropriate comments (including JavaDoc) | 3 |
Appropriate use of generics | 2 |
Style and formatting | 2 |
Submitted correctly | 1 |
Extra credit | 2 |
NOTE: Code that does not compile will get a zero! Make sure that your code compiles before submitting.
A reminder: in addition to Javadoc comments, your code must adhere to the (CS 62 Style Guide)[http://www.cs.pomona.edu/classes/cs062/notes/cs062-style.html].
As usual, export the entire folder from Eclipse to your desktop.
Change the name of the folder to Assignment03_LastNameFirstName
. Make sure the folder contains both a src
and a bin
directory.
Make sure to fill out the asg03.json
file (found in the same place as the assignment starter code) with your username and whether or not you did extra credit. Add it to the top level of your submission directory.
little.cs.pomona.edu
), and then run the submission script at /common/cs/cs062/bin/submit
with the following arguments:
cs062
in this case).asg03
in this case).~/Desktop/Assignment03
.Once you call the script, it should print “You are:” followed by your username. If it prints the wrong username, you’re using someone else’s login and your submission won’t be recognized as yours. After confirming your username, the script will tell you what file it submitted for what assignment/class and the name of the submitted file. If you get the arguments wrong, it will print an error message instead (which should be informative). Remember you can use tab-completion for filenames to spell them correctly (filenames with spaces in them can accidentally be treated as multiple separate arguments unless you’re careful). Using the submit command looks like this:
/common/cs/cs062/bin/submit cs062 asg03 ~/Desktop/Assignment03_Name
You can use a program like ssh
(should be installed by default on Macs) or PuTTY
(available for free for Windows) to login to little.cs.pomona.edu
remotely to run the submit command.
For those that haven’t had any file I/O experience in Java, we’ll give a brief intro here, but also take a look at the streams cheat sheet available off of the course Documentation page. You can also look up information about the classes seen in the code and discussed here via the Java libraries link there. For most I/O, you’ll need to import java.io.*
.
The two main classes you’ll be concerned with when doing file I/O in java are BufferedReader
for reading data and PrintWriter
for writing data. To read data, you can create a new reader by:
BufferedReader in = new BufferedReader(new FileReader(...))
where “…” can be either replaced with a String
or can be replaced with a File
object. To write data, you can create a new writer by:
PrinterWriter out = new PrintWriter(new FileOutputStream(...))
In both cases, you will need to surround these with a try-catch to handle the IOException
.
The file system on these computers starts at the root directory ‘/’. Everything is then expanded out based on directories. For example “/home/america/” means that in the root directory ‘/’ there is a directory named “home” and inside this directory is another directory named “pmawhorter”. The ‘/’ symbol is called the file separator and is different depending on the operating system (e.g. it’s ‘’ on windows computers).
Filenames can be specified as relative filenames, where the filename is specified relative to the current location of the program (or user), and absolute filenames where the entire path, starting from the root directory, is specified. For example, the absolute filename for the src
directory of my Assignment03
folder in my Eclipse workspace is:
/home/pmawhorter/Documents/cs062/workspace/Assignment03/src
The relative filename for the src directory depends on the current directory. If the current location of the program (or user) is the bin
directory, then the relative filename is “../src
.” The “..
” symbol means go up one directory. So, from the bin directory, the relative filename “../src
” means go up one directory and then go to the src directory. Note that relative filenames do NOT start with a ‘/’.
It can be confusing telling exactly where your program currently is when running it, so often the best approach when writing programs is to use a full path which starts at the root directory. (We won’t do that this time because we want the program you turn in to work when copied to the TA’s directory.)
If you ever want to know where you are when you’re in the Terminal, the pwd
command (which stands for print working directory). You can try it out by just typing pwd
and hitting return (though that won’t work in Eclipse – you must be in the Terminal)!. Rather than hard-coding in a file or directory, you can also pop up a dialog box and let the user choose the file. We did that with last week’s assignment. In fact, the startup code given to you in the main class of TextGenerator
is a good example of how to use file input.