**Hash Tables**
*Oct 29*
# Hash Tables (Hashtables, HashMaps, Maps, etc.)
Like an array, but different. With an array we use numbers to get constant time access to any element in the array. For example,
~~~java linenumbers
String[] array = new String[10];
array[0] = "CS 62";
array[5] = "Data Structures";
System.out.println(array[0] + " " + array[5]);
~~~
We can implement this same functionality with a hash table:
~~~java linenumbers
HashMap hashtable = new HashMap<>();
hashtable.put(0, "CS 62");
hashtable.put(5, "Data Structures");
System.out.println(hashtable.get(0) + " " + hashtable.get(5));
~~~
But, with a hash table we can use any `key` (for example a `char` or `String`) to get constant time access to any element:
~~~java linenumbers
HashMap hashtable2 = new HashMap<>();
hashtable2.put('a', "CS 62");
hashtable2.put('b', "Data Structures");
System.out.println(hashtable2.get('a') + " " + hashtable2.get('b'));
~~~
To get an idea of how this works, let's try to replicate the hash table experience using an array:
~~~java linenumbers
int m = 10;
String[] array2 = new String[m];
array2[hash("one", m)] = "CS 62";
array2[hash("five", m)] = "Data Structures";
System.out.println(array2[hash("one", m)] + " " + array2[hash("five", m)]);
~~~
That will work. We just need a function called `hash` that turns a `String` (or any type) into an `int`.
[Here is the code from above.](https://github.com/pomonacs622020fa/LectureCode/tree/master/HashTables)
# Collisions
The problem with the last example in section Hash Tables (above) is that we don't have any control over what happens when two `Strings` result in the same `int` after a call to `hash`. This scenario is known as a **collision**, and it happens **all the time**. Let's say that our array has a capacity of 97 (prime numbers are good--we'll talk more about this in 140). How many elements can we insert until we can expect a collision?
*To be a bit more precise, how many elements can we insert before we have greater than or equal to a 50% chance of having experienced a collision?*
*What is the probability of a collision on the first insert?* $0$
*What is the probability of a collision on the second insert?* $\frac{1}{97}$
*What is the probability of a collision on the third insert?* $\frac{2}{97}$
*What is the probability of a collision during the first three insertions?* This question is a bit different. We can have 1 or 2 collisions and we need to account for both. An easier question to answer is the complement (why do I say it is easier to calculate the complement?):
*What is the probability of no collision during the first three insertions?* $\frac{97}{97}\cdot\frac{96}{97}\cdot\frac{95}{97}=97\%$
*So, then what is the probability of a collision during the first three insertions?* $1-97\%=3\%$
Now back to the original question. *How many elements can we insert until we can expect a collision?*
$$
1-\prod_{i=1}^{I} \frac{97-i}{97}\ge 0.5
$$
In the above equation, we would want to solve for $I$ (the number of insertions) where $\Pi$ denotes the product (similar to summations, $\Sigma$, but we're multiplying instead of adding--as in the complement equation above where we ended up with $97\%$).
That is not a very nice equation to solve. Luckily, there is an approximation that is not so bad:
$$
\prod_{i=1}^{x} \frac{n-i}{n} \approx e^{\frac{-x(x-1)}{2n}}
$$
Now we can solve for $x$
$$
1 - e^{\frac{-x(x-1)}{2 \cdot 97}} = 0.5
$$
[Which you can find here.](https://www.wolframalpha.com/input/?i=1+-+e%5E%28-x%28x-1%29%2F%282*97%29%29+%3D+0.5)
**You only need to insert 12 items before you have a greater than 50% chance of a collision!**
Sanity check:
$$
P(C_{11}) = 1 - \frac{97\cdot96\cdot95\cdot94\cdot93\cdot92\cdot91\cdot90\cdot89\cdot88\cdot87}{97^{11}} = 1 - .555 = 44.5\%
$$
and
$$
P(C_{12}) = 1 - \frac{97\cdot96\cdot95\cdot94\cdot93\cdot92\cdot91\cdot90\cdot89\cdot88\cdot87\cdot86}{97^{12}} = 1 - .492 = 50.8\%
$$
So, we need to handle them with care.
## Collision Handling
We'll discuss two forms of collision handling:
1. Separate Chaining
2. Open Addressing
- Linear Probing
- Double Hashing
![Collision Handling Diagrams](images/2020-10-30-Collisions.jpg)
## Deleting Objects
Deleting an object is discussed a bit in the images above, but I'll summarize here:
1. When searching for an element, we keep following the sequence (linked list or probe sequence) until we hit an empty spot.
2. So, we cannot remove elements without leaving a marker (tombstone) indicating that you must keep looking to confirm the element is not in the table.
## Re-Hashing
A good rule of thumb is to **increase the size of the hash table ($m$) by a factor of 1.5 whenever the table becomes more than 75% full.** This will ensure that the running time of all operations remains $O(1)$.
When you increase the size of the table, you must re-insert all items (you can't just copy elements over--they might end up in the wrong spot and become unfindable).
# Hashing in Java
All Java clases inherit a method called `hashCode()`, which returns an integer. The default implementation returns the memory address of the instance. To implement your own hashing functionality, you need to override both `equals()` and `hashCode`.
If objects are **equal** they should have the same **hash value**.
## Walk-Through Video
(I'll admit. This is not my best video work.)
# Hash Functions
An ideal hash function would take any key and produce a number in $[0, n-1]$. Hash functions should
- be deterministic (they always produce the same value for a given input),
- be efficient (they are called during every hash table operation),
- have a uniform distribution in the output (all outputs are equally likely).
We'll talk quite a bit about hash function in Algorithms (CS 140). But for now, I'll leave you with these links that are worth your time:
- [General Purpose Hash Function Algorithms](https://www.partow.net/programming/hashfunctions/idx.html)
- [The State of Hashing Algorithms — The Why, The How, and The Future](https://medium.com/@rauljordan/the-state-of-hashing-algorithms-the-why-the-how-and-the-future-b21d5c0440de)
- [Which hashing algorithm is best for uniqueness and speed?](https://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed)
# Summary
Hash tables act like an array where you can use any hashable object as the index. They have
- Constant time lookups
- Constant time insertions
- Constant time deletions
They are not very good at much else. But a lot of applications only require these operations.