2

Can somebody explain the following:

image
(source: fbcdn.net)

U is a universe of keys, and H is a finite collection of hash functions mapping U to {0, 1, … , m-1}.

I do not understand definition 2, and thus why amount of funtions that map x and y to the same location is given by |H|/m.

Glorfindel
  • 752
  • 1
  • 9
  • 20
coolchock
  • 23
  • 2

2 Answers2

3

Let’s begin by talking about the intuition for universal hash families. Intuitively, a family of hash functions is universal if for any distinct objects x and y that you’d like to hash, if you select a random hash function from the hash family, the probability that you get a collision between those two elements is at most 1/m, where m is the number of buckets. In other words, universal hash families tend to spread elements out in a way where the probability of a pair colliding is the same as if the elements were distributed randomly.

Let’s see how the definition accomplishes this. Here’s the definition from your question:

H is universal if ∀ x, y ∈ U where x ≠ y | { h in H : h(x) = h(y) } | = |H| / m.

For starters, I’m assuming that we’re talking about hash functions that map from some set U to the integers 1 through m or 0 through m-1. With that in mind, let’s unpack some of this notation.

If we replace the universal quantifier (∀) with the plain English “for all,” this definition says “H is universal if for every choice of two different items x and y to hash, some inequality is true.” So let’s look at that inequality. First, what is this bit?

| { h in H : h(x) = h(y) } |

The vertical bars here represent the size of a set, and the set in question is this one:

{ h in H : h(x) = h(y) }

Read literally, this is the set of all hash functions h in the family H where h(x) = h(y). Keeping in mind that we’re talking about hash collisions here, we can think of h(x) = h(y) as saying that hash function h causes x and y to collide with one another (have the same hash code). With that in mind, the complex expression

| { h in H : h(x) = h(y) } |

means “the number of hash functions in H where x and y collide.” Combining that with our earlier bit, we can rewrite the entire definition as

H is universal if for any two distinct elements x and y that we want to hash, the number of hash functions in H where x and y collide is at most |H| / m.

So what’s |H| / m? That’s the total number of hash functions (|H|) divided by the number of possible outputs for any one single hash function (m). It might help to divide the entire expression through by |H|, which would then give this final definition:

H is universal if for any two distinct elements x and y that we want to hash, the probability of picking a hash function h where x and y collide (that is, the total number of hash functions where x and y collide divided by the total number of hash functions) is at most 1/m.

Hope this helps!

templatetypedef
  • 9,102
  • 1
  • 30
  • 60
  • Ye, it helps a lot, but I actually do not understand why is the amount of the functions, where h(x) = h(y), equal to |H|/m. –  Jun 15 '19 at 23:10
  • Think of it as like a probability. Divide both sides by |H| to get a left-hand side of “the fraction of functions where x and y collide,” which we can interpret as “the probability of a collision between x and y if you sample a random hash function.” Now, what would that quantity be if the functions are truly random? If there are m possible choices for a hash code for x and m choices for a hash code for y, the collision probability would be 1/m - the chance that they end up in the same slot. So for universal hashing, we want that same probability, 1/m, which is the right-hand side. – templatetypedef Jun 15 '19 at 23:39
  • it's all clear now, thank you very much for your professional help! – coolchock Jun 16 '19 at 00:05
0

As you found it is just a definition. You can ask about what is the meaning of the definition. It means the set of function H is universal if for every two different members of U, has exactly the same number of hash function exists in H and that number is |H|/m.

The intuition behind this definition comes from the application over this definition. As a set of function is universal (satisfy the definition), the probability of choosing a function from H to have the same value over two different members of U is 1/m, and it means this probability depends uniformly on the size of m (the range space of those hash functions).

OmG
  • 3,572
  • 1
  • 14
  • 23