CS201 - Spring 2014 - Class 30
exercise
storing data for quick lookup:
- support three key operations:
- insert
- search/contains
- remove
- key idea:
- use an array to store the data
- associate with each data item an index in the array
1. generate a numerical representation for the data item
2. take numerical representation (hash code) and map it to an entry in the array
3. handle collisions
collision resolution by chaining
open addressing
- because of some of the cons above (in particular the overhead), we often only want to use a basic array to store the hashtable
- we still have to do something about collisions... ideas?
- when we have a collision and there's already an item at that location, we need to find another possible place to put it
- for open addressing, we must define a "probe sequence" that determines where to look in the table next if we have a collision
- if h(x) is the hash function, the probe sequence is often written as h(x, i), that is the ith place to look if all i-1 previous locations were full already
- h(x, 0) is the first place to check
- h(x, 1) the next
- and so on
- notice that this is defined by the hash function, so it could be different for different items, etc.
- the probe sequence must be a permutation of all of the entries in the table, that is, if we look at h(x, 0), h(x, 1), ..., h(x, m-1), these values will be a permutation of 0, 1, ..., m-1
- why?
- inserting
- given this, how can we insert items into the table?
- start at probe sequence 0, if it's empty put the item there
- if it's full, go on to 1, etc.
- note that we can actually fill up the table here
- contains
- what do we need to check here?
- again, start at probe 0
- see if there's something there AND see if the item is equal to the item we're actually looking for
- if not, keep looking
- when do we stop?
- when we find an empty entry
- look at OpenAddressedHashtable class in
Hashtables code
- what is the "put" method doing?
- write the "contains" method
- notice that the class is abstract since we haven't defined what the probe sequence will be
abstract methods/classes
probe sequences
- our one requirement is that the probe sequence must visit every entry in the table
- ideas?
- linear probing
- easiest to understand:
- h(k, i) = (h(k) + i) % m
- just look at the next location in the hash table
- if the original hash function says to look at location i and it's full then, we look at i+1, i+2, i+3, ...
- need to modulo the size of the table to wrap around
- look at LinearAddressedHashtable class in
Hashtables code
- problems?
- "primary clustering"
- you tend to get very long sequences of things clustered together
- show an example
- double hashing
- h(x, i) = (h(x) + i h2(x)) % m
- unlike linear, where the offset is constant, the offset this time is another hash of the data
- avoids primary clustering
- what is the challenge?
- probe sequence must be a permutation of the data
- h2 must visit all possible positions in the data
- most commonly used in real implementation
running time for open addressing
- what is the run-time for contains for open addressing?
- again, assume an ideal hash function where each original location is equally likely and also each probe is equally likely
- assuming n things in the table and m elements (i.e. a load of alpha)
- what is the probability that the first place we look is occupied?
- alpha (n/m)
- given the first was occupied, what is the probability that the second place we look is occupied?
- alpha (actually, (n-1)/(m-1), but almost alpha :)
- what is the probability that we have to make a third probe?
- alpha (the first position was occupied) +
- alpha * alpha (the second position was occupied)
- so, what is the probability that we have to probe i positions before we find an open one?
- it's the sum of the probabilities that we have to make each probe
- alpha + alpha^2 + alpha^3 + ... + alpha^{i-1}
- which is bounded by: 1/(1-alpha)
- how does this help us with our run-time?
- the run-time is bounded by the number of probes we have to make
- to insert, we need to find an open entry, what is the running time?
O(1 + 1/(1-alpha))
- for contains, we may have to search until we find an open entry, what's the running time?
O(1 + 1/(1-alpha))
- what does this translate to search-wise?
alpha average number of searches
0.1 1.11
0.25 1.33
0.5 2
0.75 4
0.9 10
0.95 20
0.99 100
(note that these are ideal case numbers)