**Hash Tables** *Oct 29* # Hash Tables (Hashtables, HashMaps, Maps, etc.) Like an array, but different. With an array we use numbers to get constant time access to any element in the array. For example, ~~~java linenumbers String[] array = new String[10]; array[0] = "CS 62"; array[5] = "Data Structures"; System.out.println(array[0] + " " + array[5]); ~~~ We can implement this same functionality with a hash table: ~~~java linenumbers HashMap hashtable = new HashMap<>(); hashtable.put(0, "CS 62"); hashtable.put(5, "Data Structures"); System.out.println(hashtable.get(0) + " " + hashtable.get(5)); ~~~ But, with a hash table we can use any `key` (for example a `char` or `String`) to get constant time access to any element: ~~~java linenumbers HashMap hashtable2 = new HashMap<>(); hashtable2.put('a', "CS 62"); hashtable2.put('b', "Data Structures"); System.out.println(hashtable2.get('a') + " " + hashtable2.get('b')); ~~~ To get an idea of how this works, let's try to replicate the hash table experience using an array: ~~~java linenumbers int m = 10; String[] array2 = new String[m]; array2[hash("one", m)] = "CS 62"; array2[hash("five", m)] = "Data Structures"; System.out.println(array2[hash("one", m)] + " " + array2[hash("five", m)]); ~~~ That will work. We just need a function called `hash` that turns a `String` (or any type) into an `int`. [Here is the code from above.](https://github.com/pomonacs622020fa/LectureCode/tree/master/HashTables) # Collisions The problem with the last example in section Hash Tables (above) is that we don't have any control over what happens when two `Strings` result in the same `int` after a call to `hash`. This scenario is known as a **collision**, and it happens **all the time**. Let's say that our array has a capacity of 97 (prime numbers are good--we'll talk more about this in 140). How many elements can we insert until we can expect a collision? *To be a bit more precise, how many elements can we insert before we have greater than or equal to a 50% chance of having experienced a collision?* *What is the probability of a collision on the first insert?* $0$ *What is the probability of a collision on the second insert?* $\frac{1}{97}$ *What is the probability of a collision on the third insert?* $\frac{2}{97}$ *What is the probability of a collision during the first three insertions?* This question is a bit different. We can have 1 or 2 collisions and we need to account for both. An easier question to answer is the complement (why do I say it is easier to calculate the complement?): *What is the probability of no collision during the first three insertions?* $\frac{97}{97}\cdot\frac{96}{97}\cdot\frac{95}{97}=97\%$ *So, then what is the probability of a collision during the first three insertions?* $1-97\%=3\%$ Now back to the original question. *How many elements can we insert until we can expect a collision?* $$ 1-\prod_{i=1}^{I} \frac{97-i}{97}\ge 0.5 $$ In the above equation, we would want to solve for $I$ (the number of insertions) where $\Pi$ denotes the product (similar to summations, $\Sigma$, but we're multiplying instead of adding--as in the complement equation above where we ended up with $97\%$). That is not a very nice equation to solve. Luckily, there is an approximation that is not so bad: $$ \prod_{i=1}^{x} \frac{n-i}{n} \approx e^{\frac{-x(x-1)}{2n}} $$ Now we can solve for $x$ $$ 1 - e^{\frac{-x(x-1)}{2 \cdot 97}} = 0.5 $$ [Which you can find here.](https://www.wolframalpha.com/input/?i=1+-+e%5E%28-x%28x-1%29%2F%282*97%29%29+%3D+0.5) **You only need to insert 12 items before you have a greater than 50% chance of a collision!** Sanity check: $$ P(C_{11}) = 1 - \frac{97\cdot96\cdot95\cdot94\cdot93\cdot92\cdot91\cdot90\cdot89\cdot88\cdot87}{97^{11}} = 1 - .555 = 44.5\% $$ and $$ P(C_{12}) = 1 - \frac{97\cdot96\cdot95\cdot94\cdot93\cdot92\cdot91\cdot90\cdot89\cdot88\cdot87\cdot86}{97^{12}} = 1 - .492 = 50.8\% $$ So, we need to handle them with care. ## Collision Handling We'll discuss two forms of collision handling: 1. Separate Chaining 2. Open Addressing - Linear Probing - Double Hashing ![Collision Handling Diagrams](images/2020-10-30-Collisions.jpg) ## Deleting Objects Deleting an object is discussed a bit in the images above, but I'll summarize here: 1. When searching for an element, we keep following the sequence (linked list or probe sequence) until we hit an empty spot. 2. So, we cannot remove elements without leaving a marker (tombstone) indicating that you must keep looking to confirm the element is not in the table. ## Re-Hashing A good rule of thumb is to **increase the size of the hash table ($m$) by a factor of 1.5 whenever the table becomes more than 75% full.** This will ensure that the running time of all operations remains $O(1)$. When you increase the size of the table, you must re-insert all items (you can't just copy elements over--they might end up in the wrong spot and become unfindable). # Hashing in Java All Java clases inherit a method called `hashCode()`, which returns an integer. The default implementation returns the memory address of the instance. To implement your own hashing functionality, you need to override both `equals()` and `hashCode`. If objects are **equal** they should have the same **hash value**. ## Walk-Through Video (I'll admit. This is not my best video work.)
# Hash Functions An ideal hash function would take any key and produce a number in $[0, n-1]$. Hash functions should - be deterministic (they always produce the same value for a given input), - be efficient (they are called during every hash table operation), - have a uniform distribution in the output (all outputs are equally likely). We'll talk quite a bit about hash function in Algorithms (CS 140). But for now, I'll leave you with these links that are worth your time: - [General Purpose Hash Function Algorithms](https://www.partow.net/programming/hashfunctions/idx.html) - [The State of Hashing Algorithms — The Why, The How, and The Future](https://medium.com/@rauljordan/the-state-of-hashing-algorithms-the-why-the-how-and-the-future-b21d5c0440de) - [Which hashing algorithm is best for uniqueness and speed?](https://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed) # Summary Hash tables act like an array where you can use any hashable object as the index. They have - Constant time lookups - Constant time insertions - Constant time deletions They are not very good at much else. But a lot of applications only require these operations.