3

Say I have a hash table of size $m$, with collision handled by chaining. Assume the hash function hashes uniformly, so every key has probability of $\frac{1}{m}$ of being hashed to any slot in the table. I insert $n$ keys into the table. What is the probability that the longest chain in the table is size $k$?

My initial idea is like this:

$$\sum_{i=1}^{m}p_k^ip_{lk}^i$$

where $p^i_k$ is the probability of having $i$ slots with chain size $k$, and $p_{lk}^i$ is the probability of having $m-i$ slots with chain size less than $k$. I am able to get the probability of one slot having chain size $k$,but not sure how to derive the other probabilities from it. A pointer in the right direction is appreciated!

  • Exactly size $k$ or at least size $k$? – saulspatz Jul 30 '19 at 22:23
  • Exactly size $k$ – William Deng Jul 30 '19 at 22:34
  • 1
    The events aren't independent, so multiplying the probabilities doesn't look right. – saulspatz Jul 30 '19 at 22:41
  • I see... Then I am stuck... Can you point me in the right direction? – William Deng Jul 30 '19 at 22:46
  • Sorry, I don't see how to do it either, at least not yet. – saulspatz Jul 30 '19 at 22:50
  • There are a lot of "common sense" notions, that make it hard to extract the mathematical problem. What is for instance (mathematically) a "hash table" of some size, (a map?! from... to...?!) than "collision", than "chaining", what is a "hash function", a "key", a "slot" in the table, what is a "longest chain" after insertion... The idea is a formula, that it is hard to digest, it depends on new variables, defined by "slots". Some simple example may make things clear. Generally i do not downvote, but this is a case i would take it in consideration, since there is no point to start... – dan_fulea Jul 30 '19 at 23:31
  • 5
    I took it to mean: Given $n$ iid variables $X_i$ taking values uniformly from ${1,2, \ldots, m}$, what is the probability that $\max_j( |{i | X_i = j}| ) = k$? – Jair Taylor Jul 30 '19 at 23:44
  • Yes that is the problem. – William Deng Jul 30 '19 at 23:50

1 Answers1

1

Leaving the following answer even though it applies to a slightly different model than the one asked about.

Consider the random variables $(A_{ij}\colon 1\leq i\leq m,\ 1\leq j\leq n)$ given by $A_{ij}=1$ if the $j^{th}$ inserted key hashes to slot $i$, and $A_{ij}=0$ otherwise. Then the variables $(A_{ij})$ are mutually independent and each distributed as a Bernoulli$(1/m)$ random variable. Let $S_i$ denote the length of the chain in slot $i$, or equivalently, $$ S_i=\sum_{j=1}^n A_{ij}. $$ Your question asks to find the distribution of $\max_{1\leq i\leq m}S_i$. Now we come to a branching off point: to some people, even rephrasing the problem in this way is already an answer, but I will outline some of the different paths one can choose to take from here.

Observe that $S_i$ is Binomial$(n,1/m)$ and the $S_i$ are independent. If an exact asymptotic is desired (for instance, assuming $m$ is $O(1)$ and $n$ tends to infinity) then one can use a Gaussian approximation to $S_i$ and follow similar approaches as in Expectation of the maximum of gaussian random variables or Bounds for the maximum of binomial random variables to get good bounds on the size of the maximum.

If weaker bounds suffice, you can follow the approach outlined here Maximum of $k$ binomial random variables?

If you need an exact formula, you can simply use the probability mass function to express things as a complicated sum.

pre-kidney
  • 30,223
  • But the $S_i$ are not independent. e.g., if $S_1 = n$ then we know $S_2 = \ldots = S_m = 0$. – Jair Taylor Jul 31 '19 at 16:46
  • The $A_{ij}$ are also not independent. There can only be one $i$ with $A_{ij} = 1$ for each $j$, since each key hashes to only one slot. – Jair Taylor Jul 31 '19 at 16:52
  • @JairTaylor good catch, somehow I transformed the model in my mind when writing the answer. That being said, if one is interested in asymptotics when $m=o(n)$ then my model is a good approximation of the one in the problem statement, up to a simple rescaling. – pre-kidney Aug 01 '19 at 02:57
  • Right, I think the asymptotics should be similar in that domain, although I'm not sure how to prove it. – Jair Taylor Aug 01 '19 at 04:51
  • It should be a relatively straightforward consequence of the strong law of large numbers at least for the regime $m=o(n)$, although I will hold off on working out the details until @William clarifies the question further. – pre-kidney Aug 01 '19 at 05:13