Hash functions and pathological data sets

Question

So I'm watching an Algorithms course in Coursera, and we are currently discussing hash tables. He's talking about the importance of a good hash function, and about how an ideal hash function would be a "super clever hash function guaranteed to spread every data set evenly".

Then, he explains that the problem is that such a hash function does not exist (and that for every hash function there is a pathological data set), and that the reason for this is as follows:

Fix a hash function $h: U \to \{0, 1, 2, ..., n-1\}$. By the Pigeonhole Principle, there exists a bucket $i$ such that at least $|U|/n$ elements of $U$ hash to $i$ under $h$. If a data set draws only from these, everything collides.

The bolded part is what's confusing me. Why does there exist a bucket $i$ such that at least $|U|/n$ elements of $U$ hash to $i$ under $h$? I can't really visualize what he means.

Because the pigeonhole principle says exactly that. Did you look it up? — David Richerby, Nov 04 '16 at 11:49
It's easier to understand a more concrete example: You have, say, 5 buckets, and you need to stick, say, 7 pigeons in them somehow. 7 > 5, so it follows that at least 1 bucket has at least 2 pigeons in it. — j_random_hacker, Nov 04 '16 at 15:44
@DavidRicherby well, the Pigeonhole Principle as I understood it is what the first sentence in the wikipedia link says: "In mathematics, the pigeonhole principle states that if n items are put into m containers, with n > m, then at least one container must contain more than one item." I didn't immediately see the equivalence between that and the bolded statement that confused me. Weirdly enough I was able to get useful answers out of asking this question :) — FrostyStraw, Nov 04 '16 at 18:43
OK but if you read as far as the third page of the wikipedia article, it tells you that, if you put more than $km$ items in $m$ buckets, at least one must contain more than $k$ items, which is exactly what's being used here. — David Richerby, Nov 05 '16 at 00:08
The answers posted here were more clear and therefore more helpful to me. — FrostyStraw, Nov 05 '16 at 02:21

Mario Cervera · Accepted Answer · 2016-11-04T14:28:50.003

An easy way to visualize this is to imagine a hash table of size $n$ (implemented with chaining) that contains all of the elements of $U$ (even though this is unrealistic in practice because $U$ typically has massive size). Since $|U| >> n$, all of the elements of $U$ do not fit into the hash table; therefore, there will be collisions. Consider, for example, the universal set $U=\{a,b,c,d,e,f,g\}$ and a hash table with $n=3$ buckets. Since $|U|=7$, at least one bucket must necessarily contain $\lceil \: |U| \: / \: n \rceil = \lceil 7/3 \rceil = 3$ or more elements. In the case of the most clever hash function (which would spread out the elements of $U$ as evenly as possible), this bucket would contain exactly $3$ elements, like this (highlighted in red):

It is important to see that no matter how clever the hash function is, there will always exist a data set (for example, the set $\{b,g,a\}$) whose elements hash to the same bucket (for example, bucket number $1$). Such a pathological data set will make your hash table degenerate to its worst-case linear-time performance.

score 4 · Answer 2 · answered Nov 04 '16 at 07:50

Assume there is no such bucket. Then each bucket has at most $|U|/n - 1$ items. There are $n$ buckets, so the total number of items is at most $n*(|U|/n - 1) = |U| - n$. This is less than $|U|$, which is the number of items we distributed to the buckets in the first place. This is a contradiction, so we proved that the statement "each bucket has at most $|U|/n - 1$ items" is false, which is equivalent to the statement "some bucket has at least $|U|/n$ items."

Hash functions and pathological data sets

2 Answers2

Linked