2

I'm reading about MinHash technique to estimate the similarity between 2 sets: Given set A and B, h is the hash function and $h_\min(S)$ is the minimum hash of set S, i.e. $h_\min(S) = \min(h(s))$ for s in S. We have the equation: $$ p(h_\min(A) = h_\min(B)) = \frac{|A \cap B|}{|A \cup B|} $$ Which means the probability that minimum hash of A equals to minimum hash of B is the Jaccard similarity of $A$ and $B$.

I am trying to prove above equation and come up with a proof, which is: for $a \in A$ and $b \in B$ such that $h(a) = h_\min(A)$ and $h(b) = h_\min(B)$. So, if $h_\min(A) = h_\min(B)$ then $h(a) = h(b)$. Assume that hash function h can hash keys to distinct hash value, so $h(a) = h(b)$ if and only if $a = b$, which means the probability is $\frac{|A \cap B|}{|A \cup B|}$. However, my proof is not complete since hash function can return the same value for different keys. So, I'm asking for your help to find a proof which can be applied regardless the hash function.

Raphael
  • 72,336
  • 29
  • 179
  • 389
Long Thai
  • 165
  • 4
  • It often pays off to take ridiculous corner cases: If you take $h(a) = 1$ for all $a$, the result is clearly false; while it is true if the hash function is perfect (returns different values for each $a$). Thus the probability of a collision plays a rôle here. – vonbrand Apr 12 '13 at 13:56

1 Answers1

3

Min-hash are not just (standard) hash functions, but a family of functions $\cal H$, such that if you randomly pick one function $h\leftarrow \mathcal{H}$ out of the family, it will satisfy the requirement you have in your question.

Check "Min-wise independent permutations" by Broder, Charikar, Frieze and Mitzenmacher, 2000. They define this notion and analyze it in a very simple way to follow.

From a practical point of view, you usually have a bound on the size of the sets you use (say $2^{256}$, probably more than the number of atoms in the universe). Then, a 256bits output is enough to have hash-functions that are permutations (and clearly, the set of all the permutations is a good family of min-hash functions..) So the problem is not the output length, but how many functions are there in the min-hash family, so that if you pick one at random, you get the probability that any element is the minimum exactly $1/2^{256}$.

To overcome the huge set of hash-functions (there are $256! \gg 2^{256}$ permutations of 256 bits..), they relax the requirements on the probability of the min-hash and get an approximated notion of $\epsilon$-min-hash, which guarantees that $$\Pr_{h\leftarrow {\cal H}}[h(A)=h(B)] = \frac{|A\cap B|}{|A\cup B|}\pm \epsilon$$ which is more useful from efficiency perspective (the set $\mathcal{H}$ is smaller and easier to get).

Ran G.
  • 20,684
  • 3
  • 60
  • 115