Homework 10

Concurrency

Please submit homeworks via the DCI submission page.

In this homework, I’ve tasked you with writing a search program. The program’s interface is as follows:

$ ./search
Usage: search needle [haystack]
    where needle is a string and haystack is a directory (defaulting to .)

For example, suppose I have a directory test with two files in it, test.txt and fail.txt.

$ ./search
Usage: search needle [haystack]
    where needle is a string and haystack is a directory (defaulting to .)
$ cd test
$ ../search sir
./test.txt
$

That is, it prints out the files that include the string. I could also run:

$ ./search sir test
test/test.txt
$

Searching for a string that isn’t present at all produces no output; searching for a string that’s present in more than one file prints those files out in some arbitrary order.

$ ./search absent test
$ ./search " " test
test/fail.txt
test/test.txt
$

Sample directories are avaiable on the CS network in /common/cs/cs131/hw10. There are three of them: /common/cs/cs131/hw10/test/ is the directory used in the examples above. /common/cs/cs131/hw10/little/ has 15 megabytes of data in it; /common/cs/cs131/hw10/big/ has 518 megabytes of data in it. (Take a look! I highly recommend the music and movies in that directory.)

Your program will need to handle all of those directories, plus some I have not shown you. There are two gotchas you should watch out for: binary files (look inside them!) and the number of available file descriptors (don’t run out!).

Grading

My original plan was to have this homework be out of three points:

  • 1 point for writing a program that works
  • 1 point for writing a program that’s faster than my single-threaded implementation
  • 1 point for writing a program that’s faster than my multi-threaded implementation
  • 1 single point of extra credit for the fastest of all solutions (has to get all three points, first!)

There’s a hitch, though: without doing a deep-dive on parallel I/O, I wasn’t able to get my multi-threaded program to go faster than my single-threaded program on all inputs. My problem is a common one! Concurrent programs are often slower than their single-threaded counterparts. So, here are the revised grading guidelines.

  • 1 point for writing a program that works
  • 1 point for writing a program that’s faster than one of my implementations
  • 1 point for writing a program that’s faster than both of my implementations
  • 1 single point of extra credit for the fastest of all solutions (has to get all three points, first!)

So: how fast are my implementations? Here are the numbers produced from the benchmarking script /common/cs/cs131/hw10/benchmark on each input. The script runs the program once to “warm things up”, then it takes the mean wall clock time of 5 runs. I’ve highlighted the top-performing program on each input.

Search string Directory Single-threaded run time Multi-threaded run time grep-based run time
absent /common/cs/cs131/hw10/test 0.1 0.0 0.1
xyzzy /common/cs/cs131/hw10/little 0.120 0.122 0.492
weightb /common/cs/cs131/hw10/big 1.116 1.088 1.934

There are three implementations listed here: my single-threaded implementation, my multi-threaded implementation, and an implementation based on OS X grep.

I ran this code on cslab-2 in Edmunds 227 when the fileserver was relatively quiescent; I’ll be grading in a similar environment. You may want to run tests on your own computer, but be careful: your computer may have different performance than the lab computers!

Submission

You may write your program in any language. Please submit (via DCI) a zipfile with README explaining how I can compile and run your program, whether your program is multi- or single- threaded, and a high-level description of your approach to the problem. If you’re familiar with them, Makefiles are appreciated.

Your code must run on the lab computers. It’s okay if you have to install a specific compiler to get it to run, but be certain to give me good instructions.

Remember: if we can’t run your code, we can’t grade it.