Site Search:

hash and uncertainty

next

Hashing a string key

int hash = 0;
for(int i = 0; i < s.length(); i++) hash = (R * hash + s.charAt(i)) %M;

treat string as a huge integer

separate-chaining speed

In a separate-chaining hash table with M lists and N keys, the number of compares (equality tests) for search miss and insert is ~N/M.

hash function can tolerate some non-uniform

converting a hashCode to an array index

private int hash(Key x) {return (x.hashCode() & 0x7fffffff) % M;}

produce integers between 0 and M-1

hashCode for user-defined type

public int hashCode() {
    int hash = 17;
    hash = 31 * hash + field1.hashCode();
    hash = 31 * hash + field2.hashCode();
    return hash;
}

consistent, efficient, and uniformly distribute keys

Load factor

The performance of separate-chaining and open addressing depends on the ration N/M, which is referred to as the load factor of a hash table. For separate-chaining, the load factor is generally larger than 1. For open addressing, the load factor can not be greater than 1, otherwise search goes into an infinite loop in a full table.

1/8 to 1/2 is good load factor range in practice

Memory usage

method                        N items' space usage
separate chaining        ~48N + 32M
linear probing             [32N, 128N]
BST                             ~56N

object overhead + reference (key, value, next*)

Uniform hashing assumption

The hash functions that we use uniformly and independently distribute keys among the integer values between 0 and M-1.

independence is rarely satisfied in practice

linear-probing performance

In a linear-probing hash table with M lists and N = kM keys, the average number of probes (under the uniform hashing assumption) required is ~ 0.5*(1 + 1/(1-k)) and ~ 0.5*(1 + 1/(1-k)-square) for search hits and search misses or inserts, respectively. In particular, when k is 1/2, the formula yields 3/2 and 5/2 respectively, when k is 1/8, the formula yields 15/14 and 113/98.

1/8 to 1/2 is good load factor range in practice

Poisson distribution

In a separate-chaining hash table with M lists and N keys, the probability (under uniform hashing assumption) that the number of keys in a list is within a small constant factor of N/M is extremely close to 1.

power(a * e/t),t)/power(e, t)

amortized analysis

Suppose a hash table is built with array resizing, starting with an empty table. Under uniform hashing distribution assumption, any sequence of t search, insert, and delete symbol-table operations is executed in expected time proportional to t and with memory usage always within a constant factor of the number of keys in the table.

1/8 to 1/2 is good load factor range in practice