Hashing a string key
int hash = 0;
for(int i = 0; i < s.length(); i++) hash = (R * hash + s.charAt(i)) %M;
treat string as a huge integer
separate-chaining speed
In a separate-chaining hash table with M lists and N keys, the number of compares (equality tests) for search miss and insert is ~N/M.
hash function can tolerate some non-uniform
converting a hashCode to an array index
private int hash(Key x) {return (x.hashCode() & 0x7fffffff) % M;}
produce integers between 0 and M-1
hashCode for user-defined type
public int hashCode() {
int hash = 17;
hash = 31 * hash + field1.hashCode();
hash = 31 * hash + field2.hashCode();
return hash;
}
consistent, efficient, and uniformly distribute keys
Load factor
The performance of separate-chaining and open addressing depends on the ration N/M, which is referred to as the load factor of a hash table. For separate-chaining, the load factor is generally larger than 1. For open addressing, the load factor can not be greater than 1, otherwise search goes into an infinite loop in a full table.
1/8 to 1/2 is good load factor range in practice
Memory usage
method N items' space usage
separate chaining ~48N + 32M
linear probing [32N, 128N]
BST ~56N
object overhead + reference (key, value, next*)
Uniform hashing assumption
The hash functions that we use uniformly and independently distribute keys among the integer values between 0 and M-1.
independence is rarely satisfied in practice
linear-probing performance
In a linear-probing hash table with M lists and N = kM keys, the average number of probes (under the uniform hashing assumption) required is ~ 0.5*(1 + 1/(1-k)) and ~ 0.5*(1 + 1/(1-k)-square) for search hits and search misses or inserts, respectively. In particular, when k is 1/2, the formula yields 3/2 and 5/2 respectively, when k is 1/8, the formula yields 15/14 and 113/98.
1/8 to 1/2 is good load factor range in practice
Poisson distribution
In a separate-chaining hash table with M lists and N keys, the probability (under uniform hashing assumption) that the number of keys in a list is within a small constant factor of N/M is extremely close to 1.
power(a * e/t),t)/power(e, t)
amortized analysis
Suppose a hash table is built with array resizing, starting with an empty table. Under uniform hashing distribution assumption, any sequence of t search, insert, and delete symbol-table operations is executed in expected time proportional to t and with memory usage always within a constant factor of the number of keys in the table.
1/8 to 1/2 is good load factor range in practice