CS201 - Spring 2014 - Class 28
exercise
admin
- office hours will start late ~10:30 on Tue.
quick recap of data structures
- No free lunch: there is no single best data structure
- different data structures (and different implementations of those structures) are good at different things
- What are the following structures good for/at?
- Arrays/ArrayLists
- storing sequential data
- O(1) access to elements at particular indices
- ArrayLists allow for amortized O(1) adds
- Linked lists
- add/remove from beginning/end in O(1)
- can delete from the middle if we have a reference to the node
- Binary search trees
- If balanced, O(log n):
- add
- delete
- search
- Heaps
- good for min/max requests
- O(log n) add and extractMin/Max
- We've also looked at some meta-data structures that help facilitate certain operations
- Stacks/Queues
- Trees
sets
- we'd like to be able to support set-like data structures
- objects can be added
- and we can ask if an object belongs to a set
- look at the Set interface in
Hashtables code
- can we support this type of thing with anything we have so far?
- binary search trees do this in O(log n)
- can we do better?
- are there additional things we can do with binary search trees that we don't need?
- there is still an ordering
- things like successor can be done quickly
- can print out the data in order
set applications?
universe of keys
- we have some universe of keys (often called U) that we want to store, be it numbers, strings, objects, etc.
- for example:
- all Middlebury ID numbers
- all social security number
- all last names
- all names of people (first and last together)
- all tweets from today
- ...
- if you know the min and max key, any approach?
- store them in an array
- Any problems?
- for any given run, we don't need to store ALL keys, just a subset
- the array has to be at least the size of the universe of keys (all possible keys!)
- lost of wasted space
hash functions
- a hash function is a function that maps the universe of keys to a restricted range, call it m, where m << |U|, that is m is much smaller than the universe of keys
- how does this help us?
- now we don't have to have an array of size |U|, just have to have an array of size m
- a hashtable is a data structure that uses an array of some sort to store the items. Using a hash function, any item is mapped to the array.
- to find if an item exists in the hash table, we hash the item and see if it exists in the table at the specified entry
- what can happen if m < |U|?
- we can have two things map to the same position in the array even though they're not equivalent, that is h(x) == h(y) even though !x.equals(y)
- this is called a "collision"
- a good hash function will try to avoid them but if m < |U|, they are inevitable
- why?
- pigeonhole principle: if n items are put into m pigeonholes with n > m, then at least on pigeonhole must contain more than one item
- simple idea, but often useful for proving things
hashCode
- every object in java has a method called hashCode that returns an attempt at a unique integer for that object
- how does this happen?
- it's another method (like equals and toString) that is inherited from the Object class
- the hashCode method for Object is based on the objects location in memory and does a fairly good job of providing unique numbers, however...
- if you plan on using Maps or hashtables with an Object, you should consider overriding the hashCode method
- the two requirements of the hashCode method are (
http://docs.oracle.com/javase/7/docs/api/java/lang/Object.html#hashCode(
) ):
- "Whenever it is invoked on the same object more than once during an execution of a Java application, the hashCode method must consistently return the same integer, provided no information used in equals comparisons on the object is modified."
- "If two objects are equal according to the equals(Object) method, then calling the hashCode method on each of the two objects must produce the same integer result."
- A number of the common classes (like String, Integer, etc) do have overridden hashCode methods