CS 062, Lecture 36

Maps
1. Possible implementations of maps:
2. Hashing functions
  1. Selecting hashing functions

Maps

A map represents a function from keys to values. That is for each key, there is a unique value associated with it. It is convenient for us to represent functions as pairs of keys and values, where for each key there is a unique value. The interface Map.Entry<K,V> of Java is as follows:

interface Map.Entry<K,V> {
   // return the key of the entry
   K getKey();
   
   // return the value associated with the entry
   V getValue();
   
   // update the value associated with the current key
   V setValue(V value);

Operations associated with Map<K,V> include the following:

   void clear();
   boolean containsKey(K key);

   // Returns true if this map maps one or more keys to the specified value.
   boolean containsValue(V value) 
          
   // Returns a Set view of the mappings contained in this map.
   Set<Map.Entry<K,V>> entrySet() 
          
   // Returns the value to which the specified key is mapped, or null if this 
   // map contains no mapping for the key.
   V get(K key) 
          
   boolean isEmpty(); 
 
   // Returns a Set view of the keys contained in this map
   Set<K> keySet();
          
   //  Associates the specified value with the specified key in this map
   //  (optional operation)
   V put(K key, V value);
        
   // Removes the mapping for a key from this map if it is present
   //  (optional operation)
   V remove(K key) 

   // Number of entries in the map
   int size() 
 
   // Returns a Collection view of the values contained in this map.
   Collection<V> values()

Talk about sets a bit later, but key thing is they have an iterator() method. Get returns null if key not found. How can we implement?

Possible implementations of maps:

Note: n = actual # elts in table, N = max # elts

**Complexity of table operations**
Structure	Search	Insert	Delete	Space
Linked List	O(n)	O(1)	O(n)	O(n)
Sorted Array	O(log n)	O(n)	O(n)	O(N)
Balanced BST	O(log n)	O(log n)	O(log n)	O(n)
Array[KeyRange] of EltType	O(1)	O(1)	O(1)	KeyRange

Other possibilities include unordered array, ordered linked list, unbalanced BST.

We can get slightly more efficient algorithms with Sorted Arrays if we use an interpolation search (as long as know the distribution of keys). But it is still O(log n).

Hashing functions

The table implementation of an array with keys as the subscripts and values as contents makes sense. Nevertheless there are some important restrictions on the use of this representation of a table.

This implementation assumes that the data has a key which is of a restricted type (some enumerated type in Pascal, integers in Java), which is not always the case.

Note also that the size requirements for this implementation could be prohibitive.

Ex. If the array held 2000 student records indexed by social security number it would be declared as ARRAY[0..999,999,999]

What if most of entries are empty? If we use a smaller array then all elements will still fit.

Suppose we have a lot of data elements of type EltType and a set of locations in which we could store data elements.

Consider a function H: EltType -> Location with the properties

H(elt) can be computed quickly
If elt₁ <> elt₂ then H( elt₁) <> H( elt₂). (H is one-to-one function)

This is called a perfect hashing function. Unfortunately, they are difficult to find unless you know all possible entries to the table in advance. This is not often the case.

Instead we use something that behaves well, but not necessarily perfectly.

The goal is to scatter elements through the array randomly so that they won't bump into each other.

Define a function H: Keys -> Addresses, and call H(element.key) the home address of element.

Of course now we can't list elements easily in any kind of order, but hopefully we can find them in time O(1). A data structure using hash functions is java.util.HashMap.

Note that each entry in the table will need to include the actual key, since several different keys will likely get mapped to the same subscript.

Hash functions are so important that in Java every object supports the method hashcode. If you take it blindly, you get something like a location. Very important constraint is that if a.equals(b) then a.hashCode() = b.hashCode().

Thus if you ever redefine equals, you must redefine hashCode consistently or certain internal operations will no longer work.

There are two problems to look at:

What are good hashing functions?
What do we do when two different elements get sent to same home address?

Selecting hashing functions

The following quote should be memorized by any trying to design a hashing function: "A given hash function must always be tried on real data in order to find out whether it is effective or not." Data which has certain regularities can completely destroy the usefulness of any hashing function!

Sometimes you are lucky and Key is int or other number that can be truncated to an int. If key is a string, can try the following:

String-valued keys

We can use a formula like - Key(xy) = 2⁸ * ORD(x) + ORD(y) to convert from alphabetic keys to ASCII equivalents. This is often used in combination with folding (for the rest of the string) and division.

If you use longints to hold the numbers, then you can get 4 letters into one number in this way. If they are all alphabetic (no special characters), then you can subtract (int)'a' from each ASCII code in order to reduce the size of the keys.

Here is a very simple-minded hash code for strings: Add together the ordinal equivalent of all letters and take the remainder mod tableSize.

Problem: Words with same letters get mapped to same places:

miles, slime, smile

This would be much improved if you took the letters in pairs before division.

Nevertheless, for simplicity we adopt this simple-minded (and thus relatively useless) hash function for the following discussion. Here is a function which adds up ord of letters and then mod tableSize:

hash = 0;
for (int charNo = 0; charNo < word.length(); charNo++) 
    hash = hash + (int)(word.charAt(CharNo));
hash = hash % tableSize;  (* gives 0 <= hash < tableSize *)

Code is only a little more complex to multiply each succeeding character by 2*8.

Efficient way using Horner's rule:

hash = 0;
for (in CharNo = word.length()-1;charNo >= 0; charNo--)
    hash = (256*hash + (int)(word.charAt(CharNo))) % tableSize;

Notice we mod by tableSize each time we update hash to prevent overflows.

Efficient way of calculating uses only word.length() multiplications, while normal way would involve O(word.length()²) multiplications.

What if the obvious hash code is too large?

a. Digit selection

Choose digits from certain positions of key (e.g. last 3 digits of SS#).

Unfortunately it is easy to get a biased sample. We can carefully analyze keys to see which will work best. We must watch out for patterns - they should generate all possible table positions. (For example the first digits of SS#'s reflect the region in which they were assigned and hence usually would work poorly as a hashing function.)

b. Division

Let H(key) = key mod TableSize.

This is very efficient and often gives good results if the TableSize is chosen properly.

If it is chosen poorly then you can get very poor results. If TableSize = 2⁸ = 256 and the keys are integer ASCII equivalent of two letter pairs, i.e. Key(xy) = 2⁸ * ORD(x) + ORD(y), then all pairs starting ending with the same letter get mapped to same address. Similar problems arise with any power of 2.

The best bet seems to be to let the TableSize be a prime number.

In practice if no divisor of the TableSize is less than 20, the hash function seems to be OK. (Text uses 997 in the sample program)

c. Mid-Square:

In this algorithm you square the key and then select certain bits. Usually the middle half of the bits is taken. The mixing provided by the multiplication ensures that all digits are used in the computation of the hash code.

Example: Let the keys range between 1 and 32000 and let the TableSize be 2048 = 2¹¹.

Square the Key and remove the middle 11 bits. (Grabbing certain bits of a word is easy to do using shift operators in assembly language or can be done using the div and mod operators using powers of two.)

In general r bits gives a table of size 2^r.

d. Folding

Break the key into pieces (sometimes reversing alternatie chunks) and add them up.

This is often used if the key is too big. E.g., If the keys are Social security numbers, the 9 digits will generally not fit into an integer. Break it up into three pieces - the 1st digit, the next 4, and then the last 4. Then add them together.

Now you can do arithmetic on them.

This technique is often used in conjunction with other methods (e.g. division)

What to do if you obtain hash clashes?

The home address of a key is the location that the hash function returns for that key.

A hash clash occurs when two different keys have the same home location.

There are two main options for getting out of trouble:

Rehash to try to find an open slot (open addressing). This must be done in such a way that one can find the element again quickly!
External chaining.

Suppose

    Map.Entry [] table = new Map.Entry[TableSize];

1. Open Addressing:

Find the home address of the key (gotten though hash function). If it happens to already be filled keep trying new slots until you find an empty slot.

There will be three types of entries in the table:

they are empty, represented by null;
deleted entries - marked by inserting a special "reserved" object", and
normal objects which contain a reference to an object.

Here are some variations:

a. Linear rehashing

Let Probe(i) = (i + 1) % TableSize.

This is about as simple a function as possible. Successive rehashing will eventually try every slot in the table for an empty slot.

Ex. Table to hold strings, TableSize = 26

H(key) = ASCII code of first letter of string - ASCII code of first letter of 'a'.

Strings to be input are GA, D, A, G, A2, A1, A3, A4, Z, ZA, E

Look at how get inserted with linear rehashing:

0   1   2   3   4   5   6   7   8   9   10      ...     26
A   A2  A1  A3  D   A4  ZA  GA  G   E   ..      ...     Z

Primary clustering occurs when more than one key has the same home address. If the same rehashing scheme is used for all keys with the same home address then any new key will collide with all earlier keys with the same home address when the rehashing scheme is employed.

In example happened with A, A2, A1, A3, A4.

Secondary clustering occurs when a key has a home address which is occupied by an element which originally got hashed to a different home address, but in rehashing got moved to the address which is the same as the home address of the new element.

In example, happened with E

What happens when delete A2 & then search for A1?

Must mark deletions (not just make them empty) and then try to fill when possible.
Can be quite complex.

See code in text for how this is handled.

Minor variants of linear rehashing would include adding any number k (which is relatively prime to TableSize) to i rather than 1.

If the number k is divisible by any factor of TableSize (i.e., k is not relatively prime to TableSize), then not all entries of the table will be explored when rehashing. For instance, if TableSize = 100 and k = 50, the Probe function will only explore two slots no matter how many times it is applied to a starting location.

Often the use of k=1 works as well as any other choice.

b. Quadratic rehashing

Try (home + j²) % TableSize on the jth rehash.

This variant helps with secondary clustering but not primary clustering. (Why?) It can also result in instances where in rehashing you don't try all possible slots in table.

For example, suppose the TableSize is 5, with slots 0 to 4. If the home address of an element is 1, then successive rehashings result in 2, 0, 0, 2, 1, 2, 0, 0, ... The slots 3 and 4 will never be examined to see if they have room. This is clearly a disadvantage.

c. Double Hashing

Rather than computing a uniform jump size for successive rehashes, make it depend on the key by using a different hashing function to calculate the rehash.

E.g. compute delta(Key) = Key % (TableSize -2) + 1, and add delta for successive tries.

If the TableSize is chosen well, this should alleviate both primary and secondary clustering.

For example, suppose the TableSize is 5, and H(n) = n % 5. We calculate delta as above. Thus, while H(1) = H(6) = H(11) = 1, the jump sizes differ since delta(1) = 2, delta(6) = 1, and delta(11) = 3.

2. External chaining.

The idea is to let each slot in the hash table hold as many items as necessary.

The easiest way to do this is to let each slot be the head of a linked list of elements.

Draw picture with strings to be input as GA, D, A, G, A2, A1, A3, A4, Z, ZA, E

The simplest way to represent this is to allocate a table as an array of pointers, with each non-nil entry a pointer to the linked list of elements which got mapped to that slot.

We can organize these lists as ordered, singly or doubly linked, circular, etc.

We can even let them be binary search trees if we want.

Of course with good hashing functions, the size of lists should be short enough so that there is no need for fancy representations (though one may wish to hold them as ordered lists).

See figure 14.2 in text on efficiency

There are some advantages to this over open addressing:

Deleting not a big deal
The number of elements stored in the table can be larger than the table size.
It avoids all problems of secondary clustering (though primary clustering can still be a problem)

Analysis

We can measure the performance of various hashing schemes with respect to the "load factor" of the table.

The load factor of a table is defined as a = F(number of elts in table,size of table)

a = 1 means the table is full, a = 0 means it is empty.

Larger values of a lead to more collisions.

(Note that with external chaining, it is possible to have a > 1).

The following table summarizes the performance of our collision resolution techniques in searching for an element. The value in each slot represents the average number of compares necessary for the search. The first column represents the number of compares if the search is ultimately unsuccessful, while the second represents the case when the item is found:

**Complexity of hashing operations when table has load factor a**
Strategy	Unsuccessful	Successful
Linear rehashing	1/2 (1+ 1/(1-a)²)	1/2 (1+ 1/(1-a))
Double hashing	1/(1-a)	- (1/a) x log(1-a)
External hashing	a+e^a	1 + 1/2 a

The main point to note is that the time for linear rehashing goes up dramatically as a approaches 1.

Double hashing is similar, but not so bad, whereas external increases not very rapidly at all (linearly).

In particular, if a = .9, we get

**Complexity of hashing operations when table has load factor .9**
Strategy	Unsuccessful	Successful
Linear rehashing	55	11/2
Double hashing	10	~ 4
External hashing	3	1.45

The differences become greater with heavier loading.

The space requirements (in words of memory) are roughly the same for both techniques:

TableSize + n (Objectsize) --- open addressing
TableSize + n (Objectsize + 1) --- external chaining

but external chaining is happier with smaller table (i.e., higher loading factor)

General rule of thumb: small elts, small load factor, use open addressing.

If large elts then external chaining gives good performance at a small cost in space.

Dictionaries

A dictionary or table represents a way of looking up items via key-value associations. We've already talked about associations, now we figure out ways of storing them efficiently so that we can look up information. The key difference between a map and a dictionary is that a dictionary may have multiple entries with the same key.

public interface Dictionary extends Container
{
    public Object put(Object key, Object value);
    // pre: key is non-null
    // post: puts key-value pair in Dictionary, 
     // returns old value

    public boolean contains(Object value);
    // pre: value is non-null
    // post: returns true iff the dictionary contains the value

    public boolean containsKey(Object key);
    // pre: key is non-null
    // post: returns true iff the dictionary contains the key

    public Object remove(Object key);
    // pre: value is non-null
    // post: removes an object "equal" to value within bag.
    
    public Object get(Object key);
    // pre: key is non-null
    // post: returns value associated with key, in table

    public Iterator keys();
    // post: returns iterator for traversing keys in dictionary

    public Iterator elements();
    // post: returns iterator for traversing values in 
    // dictionary

    public int size();
    // post: returns number of elements in dictionary
}