9

Cormen et al.'s "Introduction to Algorithms" says the following about the division method hash function $h(k)=k \text{ mod } m$:

A prime not too close to an exact power of 2 is often a good choice for $m$. For example, suppose we wish to allocate a hash table, with collisions resolved by chaining, to hold roughly $n=2000$ character strings, where a character has 8 bits. We don't mind examining an average of 3 elements in an unsuccesful search, so we allocate a table of size $m=701$. The number $701$ is chose because it is a prime near $2000/3$ but not near any power of 2.

The book, however, does not explain what it means by "near". Is $701$ preferrable over, say, $661$ because $661-512<701-512$? I do not understand how this is relevant to the modulo function involved in the calculation of $h(k)$.

hspil
  • 103
  • 2

2 Answers2

2

If a string of 8 bit bytes is interpreted as a large integer, and you calculate the value modulo $2^{24}-1$ then the first, fourth, seventh byte etc. are all hashed to the same value, which means the hash will not be well distributed.

gnasher729
  • 29,996
  • 34
  • 54
1

This answer possibly explains why choosing M(table size) equal to a power of 2 should be avoided.

Prime numbers that are too close to a power of 2 will provide the same kind of biasing as a power of 2 for the keys which differs by $+a$ or $-a$ if $2^k=a(modulo)M$.

In division method we simply use remainder modulo M: h(K) = K mod M

In this case some values of M are obviously much better than others.

Case 1: If M is an even number h(K) will be even when K is even and odd when K is odd, and this will lead to a substantial bias in many files.

Case 2: It would be even worse to let M be a power of 2 (more generally the radix of the computer), since K mod M would then be simply the least significant digits of K (independent of the other digits).

Case 3: Similarly we can argue that M probably shouldn't be a multiple of 3: for if the keys are alphabetic, two keys that differ only by the permutation of letters, would then differ in numeric value by a multiple of 3. (This occurs because $2^{2n}$ mod 3 = 1 and $10^n$ mod 3 = 1).

Case 4: In general, we want to avoid values of M that divide $r^k+a$ or $r^k-a$, where k and a are small numbers and r is the radix of the alphabetic charecter set (usually $r=64, \ 256$ or $100$), since a remainder modulo such a value of M tends to be largely a simple superposition of key digits. Such considerations suggest that we choose M to be a prime number such that $r^k!=a(modulo)M$ or $r^k!=-a(modulo)M$ for small k & a.

--Donald E. Knuth (The art of Computer Programming Vol. 3)

Lavlesh Mishra
  • 139
  • 1
  • 6