XYZ CODE: hash and uncertainty

Hashing a string key

int hash = 0;
for(int i = 0; i < s.length(); i++) hash = (R * hash + s.charAt(i)) %M;

treat string as a huge integer

separate-chaining speed

In a separate-chaining hash table with M lists and N keys, the number of compares (equality tests) for search miss and insert is ~N/M.

hash function can tolerate some non-uniform

converting a hashCode to an array index

private int hash(Key x) {return (x.hashCode() & 0x7fffffff) % M;}

produce integers between 0 and M-1

hashCode for user-defined type

public int hashCode() {
int hash = 17;
hash = 31 * hash + field1.hashCode();
hash = 31 * hash + field2.hashCode();
return hash;
}

consistent, efficient, and uniformly distribute keys

Load factor

The performance of separate-chaining and open addressing depends on the ration N/M, which is referred to as the load factor of a hash table. For separate-chaining, the load factor is generally larger than 1. For open addressing, the load factor can not be greater than 1, otherwise search goes into an infinite loop in a full table.

1/8 to 1/2 is good load factor range in practice

Memory usage

method N items' space usage
separate chaining ~48N + 32M
linear probing [32N, 128N]
BST ~56N

object overhead + reference (key, value, next*)

Uniform hashing assumption

The hash functions that we use uniformly and independently distribute keys among the integer values between 0 and M-1.

independence is rarely satisfied in practice

linear-probing performance

In a linear-probing hash table with M lists and N = kM keys, the average number of probes (under the uniform hashing assumption) required is ~ 0.5*(1 + 1/(1-k)) and ~ 0.5*(1 + 1/(1-k)-square) for search hits and search misses or inserts, respectively. In particular, when k is 1/2, the formula yields 3/2 and 5/2 respectively, when k is 1/8, the formula yields 15/14 and 113/98.

1/8 to 1/2 is good load factor range in practice

Poisson distribution

In a separate-chaining hash table with M lists and N keys, the probability (under uniform hashing assumption) that the number of keys in a list is within a small constant factor of N/M is extremely close to 1.

power(a * e/t),t)/power(e, t)

amortized analysis

Suppose a hash table is built with array resizing, starting with an empty table. Under uniform hashing distribution assumption, any sequence of t search, insert, and delete symbol-table operations is executed in expected time proportional to t and with memory usage always within a constant factor of the number of keys in the table.

1/8 to 1/2 is good load factor range in practice

XYZ CODE

hash and uncertainty

Hashing a string key

separate-chaining speed

converting a hashCode to an array index

hashCode for user-defined type

Load factor

Memory usage

Uniform hashing assumption

linear-probing performance

Poisson distribution

amortized analysis

No comments:

Post a Comment