CS201 - Spring 2014

CS201 - Spring 2014 - Class 28

exercise

admin
- office hours will start late ~10:30 on Tue.

quick recap of data structures
   - No free lunch: there is no single best data structure
      - different data structures (and different implementations of those structures) are good at different things
   - What are the following structures good for/at?
      - Arrays/ArrayLists
         - storing sequential data
         - O(1) access to elements at particular indices
         - ArrayLists allow for amortized O(1) adds
      - Linked lists
         - add/remove from beginning/end in O(1)
         - can delete from the middle if we have a reference to the node
      - Binary search trees
         - If balanced, O(log n):
            - add
            - delete
            - search
      - Heaps
         - good for min/max requests
         - O(log n) add and extractMin/Max
   - We've also looked at some meta-data structures that help facilitate certain operations
      - Stacks/Queues
      - Trees

sets
   - we'd like to be able to support set-like data structures
      - objects can be added
      - and we can ask if an object belongs to a set
   - look at the Set interface in Hashtables code
   - can we support this type of thing with anything we have so far?
      - binary search trees do this in O(log n)
   - can we do better?
      - are there additional things we can do with binary search trees that we don't need?
         - there is still an ordering
            - things like successor can be done quickly
            - can print out the data in order

set applications?

universe of keys
   - we have some universe of keys (often called U) that we want to store, be it numbers, strings, objects, etc.
      - for example:
         - all Middlebury ID numbers
         - all social security number
         - all last names
         - all names of people (first and last together)
         - all tweets from today
         - ...
   - if you know the min and max key, any approach?
      - store them in an array
   - Any problems?
      - for any given run, we don't need to store ALL keys, just a subset
      - the array has to be at least the size of the universe of keys (all possible keys!)
      - lost of wasted space

hash functions
   - a hash function is a function that maps the universe of keys to a restricted range, call it m, where m << |U|, that is m is much smaller than the universe of keys
   - how does this help us?
      - now we don't have to have an array of size |U|, just have to have an array of size m
      - a hashtable is a data structure that uses an array of some sort to store the items. Using a hash function, any item is mapped to the array.
      - to find if an item exists in the hash table, we hash the item and see if it exists in the table at the specified entry
   - what can happen if m < |U|?
      - we can have two things map to the same position in the array even though they're not equivalent, that is h(x) == h(y) even though !x.equals(y)
      - this is called a "collision"
      - a good hash function will try to avoid them but if m < |U|, they are inevitable
         - why?
            - pigeonhole principle: if n items are put into m pigeonholes with n > m, then at least on pigeonhole must contain more than one item
            - simple idea, but often useful for proving things

hashCode
   - every object in java has a method called hashCode that returns an attempt at a unique integer for that object
   - how does this happen?
      - it's another method (like equals and toString) that is inherited from the Object class
   - the hashCode method for Object is based on the objects location in memory and does a fairly good job of providing unique numbers, however...
   - if you plan on using Maps or hashtables with an Object, you should consider overriding the hashCode method
   - the two requirements of the hashCode method are ( http://docs.oracle.com/javase/7/docs/api/java/lang/Object.html#hashCode() ):
      - "Whenever it is invoked on the same object more than once during an execution of a Java application, the hashCode method must consistently return the same integer, provided no information used in equals comparisons on the object is modified."
      - "If two objects are equal according to the equals(Object) method, then calling the hashCode method on each of the two objects must produce the same integer result."
   - A number of the common classes (like String, Integer, etc) do have overridden hashCode methods