CS 062, Lecture 38

Hash Tables

Analysis

We can measure the performance of various hashing schemes with respect to the "load factor" of the table.

The load factor of a table is defined as a = (number of elts in table) / (size of table)

a = 1 means the table is full, a = 0 means it is empty.

Larger values of a lead to more collisions.

(Note that with external chaining, it is possible to have a > 1).

The following table summarizes the performance of our collision resolution techniques in searching for an element. The value in each slot represents the average number of compares necessary for the search. The first column represents the number of compares if the search is ultimately unsuccessful, while the second represents the case when the item is found:

**Complexity of hashing operations when table has load factor a**
Strategy	Unsuccessful	Successful
Linear rehashing	1/2 (1+ 1/(1-a)²)	1/2 (1+ 1/(1-a))
Double hashing	1/(1-a)	- (1/a) x log(1-a)
External hashing	a+e^a	1 + 1/2 a

The main point to note is that the time for linear rehashing goes up dramatically as a approaches 1.

Double hashing is similar, but not so bad, whereas external increases not very rapidly at all (linearly).

In particular, if a = .9, we get

**Complexity of hashing operations when table has load factor .9**
Strategy	Unsuccessful	Successful
Linear rehashing	55	11/2
Double hashing	10	~ 4
External hashing	3	1.45

The differences become greater with heavier loading.

The space requirements (in words of memory) are roughly the same for both techniques:

TableSize + n (Objectsize) --- open addressing
TableSize + n (Objectsize + 1) --- external chaining

but external chaining is happier with smaller table (i.e., higher loading factor)

General rule of thumb: small elts, small load factor, use open addressing.

If large elts then external chaining gives good performance at a small cost in space.

Dictionaries

A dictionary or table represents a way of looking up items via key-value associations. The key difference between a map and a dictionary is that a dictionary may have multiple entries with the same key.

public interface Dictionary<K,V> {

  /** Returns the number of entries in the dictionary. */
  public int size();

  /** Returns whether the dictionary is empty. */
  public boolean isEmpty();

  /** Returns an entry containing the given key, or null if
   * no such entry exists. */
  public Entry<K,V> find(K key) 	
    throws InvalidKeyException;

  /** Returns an iterator containing all the entries containing the
   * given key, or an empty iterator if no such entries exist. */
  public Iterable<Entry<K,V>> findAll(K key) 
    throws InvalidKeyException;

  /** Inserts an item into the dictionary.  Returns the newly created
   * entry. */
  public Entry<K,V> insert(K key, V value)  
    throws InvalidKeyException;

  /** Removes and returns the given entry from the dictionary. */
  public Entry<K,V> remove(Entry<K,V> e) 		
    throws InvalidEntryException;

  /** Returns an iterator containing all the entries in the dictionary. */
  public Iterable<Entry<K,V>> entries(); 
}

Hash tables make all of these operations simple except entries() as you will need to search through the whole table to find the non-null entries. Therefore O(N) where N is the size of the table. All others should be O(1), but can be as bad as O(n) if you have a bad hash code.

Binary search trees

Discussed binary search trees earlier.

Definition: A binary tree is a binary search tree iff it is empty or if the value of every node is both greater than or equal to every value in its left subtree and less than or equal to every value in its right subtree.

Innterestingly, it will simplify a number of algorithms if we represent binary search trees where all external nodes hold null. Thus an empty binary search tree has null at the root.

Searching a binary tree is straightforward. See the code in BinarySearchTree. See the protected method treeSearch, which is used to find, insert, and remove elements of the binary search tree. Notice that find and insert have complexity proportional to the height of the tree, which may be as good as log n or as bad as n, where n is the number of elements in the tree.

Removing an element from a tree is a bit tricky. Removal algorithm depends on algorithm removeExternal(v) where v is an external node (with a null value). The algorithm remove v and its parent and replaces the parent by v's sibling (which might be null).

Removing an element in node w is done by cases:

If the one of the children of w is an external node, z, then remove w and z by removeExternal(z).
If neither are external then
1. Find the successor, z, of w in the tree, and replace w by z. Note that z has no left child, so remove it using removeExternal(x). Can find the successor of w by first going to the right subtree of w and then moving to the left until hit an external node, x, which remove as above.

This is all fine except that tree may become unbalanced and complexity may increase to O(n) rather than O(log n), so we next investigate ways of keeping trees mainly balanced.