7

I was refereed here by stackoverflow.com. With some searching I found this: another balls and bins question, but its not quite what I am looking for. Rather the inverse. IE the expected number of buckets that have H-1 balls in them.

I realize the title is a bit odd. But this is a statistics/probability problem that I am trying to figure out, but am stumped. (No no, its not homework, see the bottom for the real explanation)

The premise is simple. You have N buckets. Each bucket can hold H balls. None of the buckets is full. You have D balls already in the buckets, but you don't know where the balls are (you forgot!) You choose a bucket at random to add 1 ball. What is the probability that that bucket will then be full.

Some example possible diagrams, with N = 4, H = 3, D = 4. Each case is just a hypothetical arrangement of the balls. for one of many cases.

Scenario 1: 1 bucket could be filled.
|   |   |   |   |
+ - + - + - + - +
| B |   |   |   |
+ - + - + - + - +
| B | B |   | B |
+ - + - + - + - +

Scenario 2: 2 buckets could be filled.
|   |   |   |   |
+ - + - + - + - +
|   | B | B |   |
+ - + - + - + - +
|   | B | B |   |
+ - + - + - + - +

Scenario 3: 0 buckets could be filled.
|   |   |   |   |
+ - + - + - + - +
|   |   |   |   |
+ - + - + - + - +
| B | B | B | B |
+ - + - + - + - +

The problem is I need a general purpose equation in the form of P = f(N, H, D)


Alright, you've tuned in this far. The reason behind this query on math, is I'm curious in having large battles between units. Each unit could belong to a brigade that contains many units of the same type. however, the battle will progress slowly over time. At each phase of the battle, the state will be saved to the DB. Instead of saving each unit and each health for each unit, I want to save the number of units and the total damage on the brigade. When damage is added to a brigade, the f(N, H, D) is run and returns a % chance that a unit in the brigade is destroyed (all of its HP are used up). This then removes that unit from the brigade decrementing N by 1 and D by H.

Before you get too technical, I need to implement this solution to a program. So Integrals are out of the question for now. I'm stuck with algebra and trig functions.

I appreciate the help

ohmusama
  • 183
  • If people give you formulas with factorials, your math library may be able to calculate them faster than looping if n is big, for example many math.h C libraries let you evaluate n! as tgamma(n+1). – Matt Apr 21 '11 at 20:03
  • In the example you gave, is having 1,0,1,2 (the reverse of Scenario 1) a possibility? – Aryabhata Apr 21 '11 at 20:34
  • @Moron - All possible combinations are possible, those are just some sample possibilities. @Matt - Yes, factorials are fine. I'm sure this is a combinatory problem, but I can seem to grasp how to set it up. – ohmusama Apr 21 '11 at 20:39
  • The only difference from the question you reference in your first line (specifically Henry's solution, with k=H-1) is that in your case, the distribution is conditioned on all buckets having H-1 or fewer balls. Unfortunately, this raises the hardness of the question significantly. – Matt Apr 21 '11 at 22:05
  • @Matt - I know :( thus why I was so confused. – ohmusama Apr 21 '11 at 22:32

2 Answers2

2

The start has been substantially changed as a result of Matt's comments, affecting the result.

Let $f(N,H,D)$ be the number of ways of putting $D$ balls into $N$ buckets where there are strictly fewer than $H$ balls in each bucket, and counting different orders for the balls as distinct (i.e. labelled balls and labelled buckets). We can calculate this with the recurrence

$$f(N,H,D) = \sum_{i=0}^{H-1} {D \choose i} f(N-1,H,D-i)$$

starting with $f(0,H,D)=0$ except $f(0,H,0)=1$ and remembering ${x \choose y} = 0$ if $x<y$. This stems from adding an extra bucket and putting from $0$ to $H-1$ balls in it, in an order mixed with the balls in the other buckets. For example $f(4,3,4)=204$.

How many of these have $H-1$ balls in the first bin? That is like removing $H-1$ balls and one bin, so is $f(N-1,H,D-H+1) {D \choose H-1}$ which in the example is $9 \times {4 \choose 2} = 54$.

So the probability that (a) you put the next ball in the first bin and (b) fill the first bin by doing so is $\frac{1}{D} \cdot \frac{f(N-1,H,D-H+1) {D \choose H-1}}{f(N,H,D)}$, and so the probability of filling any of the bins with the next ball is $D$ times that, i.e.

$$\frac{f(N-1,H,D-H+1) {D \choose H-1}}{f(N,H,D)}$$

or in this example $\dfrac{54}{204} = \dfrac{9}{34}$ confirming Matt's result.

Henry
  • 157,058
  • I'm not a math wiz, by any stretch of the imagination. Or I would have figured this out, but I would like to understand the methods behind this before I blindly copy the code over to PHP. Do you have a good link that might explain Constraint Compositions? – ohmusama Apr 21 '11 at 22:39
  • 2
    I don't think the final answer here is right. For the example shown, scenario 1 can be drawn in 12 ways (the diagram shown is one of 12 possibilities), each of which can have its balls labeled "first", "second", etc. in 12 ways. Scenario 2 can be drawn in 6 ways, each of which can be labeled in 6 ways. Scenario 3 can be drawn in 1 way, which can be labeled in 24 ways. So the expectation is $\frac{\frac{1}{4}144+\frac{2}{4} \cdot 36}{144+36+24}$ = $\frac{9}{34}$. Also, I thought I saw a generating function answer earlier that incorrectly ignored the labeling... how did it disappear? – Matt Apr 21 '11 at 22:40
  • @Matt: Moron deleted his earlier answer: it also came to $9/34$, but I think did not do what you did of spotting how many ways each scenario could be constructed and how many ways this could lead to a bin being filled: instead it was $\frac{1}{N}\left(1 - \frac{f(H-2)}{f(H-1)}\right)$; it would be ironic if that turned out to be a correct expression. If you are right, and you may well be, I may delete this answer. – Henry Apr 21 '11 at 22:52
  • I didn't realize things could disappear without a trace... I spent quite a while trying to figure out whether it was my browser or the web site that was broken! Anyway, I ran a simulation to confirm my answer, so I think it is right unless I am misunderstanding the problem. I take a sample by throwing balls in bins at random until a bin hits the limit of H balls. At that point, if I have thrown D+1 balls it is a "hit", and if I have thrown more than D+1 balls, it is a "miss", and if I have thrown less than D+1 balls, I ignore that sample as invalid. – Matt Apr 21 '11 at 23:01
  • @Matt - That sounds right. Basically you've randomly thrown in D balls so far, and none of them have filled a bucket yet. So with D+1 what are the odds of filling a bucket. So I think you have it right. – ohmusama Apr 21 '11 at 23:29
  • @Matt: I now agree with you, I think. I have changed my answer suitably. With $H=3$, the results of the recurrence are related to OEIS A141765 called a "Width-restricted finite function" – Henry Apr 21 '11 at 23:51
  • I found Moron's answer, regardless of its correctness level, to be enlightening regarding possible approaches. Before I saw his answer, I thought this problem was completely intractable, akin to extending estimates for erf(x) to the case of integrating high dimensional Gaussians over complicated regions (and was on the verge of embarrassing myself by saying so). But after seeing Moron's approach, I changed my mind. So I give it a +1 in memoriam! – Matt Apr 22 '11 at 00:04
  • Henry, what is the value of f(N−H+1,H,D−1) for {N,H,D} = {4,3,4}? – Mr.Wizard Apr 22 '11 at 00:16
  • @Mr.Wizard: 9, i.e. the ways of putting $4-3+1=2$ balls in $4-1=3$ buckets with fewer than 3 balls in each bucket. If the buckets are b, c and d then the possibilities are bb, bc, bd, cb, cc, cd, db, dc, dd. – Henry Apr 22 '11 at 00:24
  • @Henry: You have defined N and D (and therefore f) backwards from the original poster, but you seem to be internally consistent, and I think correct (although I haven't checked all the boundary cases of the recurrence — I am still thinking about a Moron-like (Moronic?) approach to the recurrence, as shown in this other question (which differed from this question in that it was merely asking for the number of configurations, i.e. diagrams as drawn here by the original poster, and didn't talk about throwing balls in bins)). – Matt Apr 22 '11 at 00:38
  • Oh! Thanks @Matt, I was furiously chasing nonexistent bugs trying to make this work, and I forgot to check that the symbols were the same! – Mr.Wizard Apr 22 '11 at 00:46
  • @Matt. I need to sleep - I have tried to swap them. – Henry Apr 22 '11 at 00:52
  • what does the exactTerms in the twoConstraintCompositions() function mean? Would I set it true by default for my problem? – ohmusama Apr 22 '11 at 07:06
  • @ohmusama: Don't worry about that - as Matt pointed out, it gave the wrong answer. For the record exactTerms=true means an exact number of positive compositions, and exactTerms=false means any number of positive compositions up to the maximum. – Henry Apr 22 '11 at 12:43
  • @Henry, I understand what you are doing now, except what exactly is behind { f(x, y, z) | x = n-h+1, y = h, and z = d-1 } ? – ohmusama Apr 22 '11 at 14:44
  • @ohmusama: If you have $h-1$ balls in the first bucket, that leaves $n-h+1$ balls for the other $d-1$ buckets – Henry Apr 22 '11 at 16:06
  • @henry: I totally get that, but what does f() do with its inputs to get the output? If it the recursive function at the top? – ohmusama Apr 22 '11 at 16:14
  • @ohmusama: it is the recursive function, but it has changed more than once at Matt's suggestion, including the swapping of $n$ and $d$ – Henry Apr 22 '11 at 16:24
  • Is it finalized? (also I coded it up already. Going to do some testing on it later today). Then, I will compare efficiency with mr wizard's solution and see how tractable each one is :) and what ranges for the variables I am allowed. When this is solved, I have another interesting problem to post. – ohmusama Apr 22 '11 at 16:40
  • This solution is intractable :( With small values, like N = 15, H = 3, D = 15, the Numerator f() is called 3,587,226 times and the Denominator f() is called 13,359,993 times. The growth is exponential I'm guessing its about O((N*(N+D))^H) – ohmusama Apr 22 '11 at 19:54
  • You could do the recursion one per value and store the 256 ($(N+1)\times(D+1)$) interesting values each one of which is the sum of up to three ($H$) of the others (multiplied by some small binomial coefficients). I make the probability in that case about 0.28579 - though at the moment I am not feeling totally reliable. – Henry Apr 22 '11 at 20:33
  • If you want an approximation to the probability you could try something like $\left(\dfrac{d-h+2}{nh-n-h+2}\right)^{\alpha}$ where $\alpha$ is slightly less than $h-1$. This could give you probabilities of zero and one in the correct places at $d=h-2$ and at $d=n(h-1)$ and very roughly the shape of the curve. – Henry Apr 22 '11 at 20:38
  • I will test this out tomorrow, and its likely you will get the accept check if it is even reasonably close. – ohmusama Apr 23 '11 at 08:41
  • An interesting answer. The lower the D is the more discrepancy it has, however its good enough I guess. It just means the bias will be toward the larger fleet even more. So. I set α to be .99(h-1) is this what you mean by slightly less? or do you mean .999999999? – ohmusama Apr 23 '11 at 19:58
  • @ohmusama: I don't know exactly what $\alpha should be. With $h=2$ you need $\alpha=1$. With $h=6$ and $n=40$ I think something like $\alpha \approx 3.8$ seems to work reasonably well. – Henry Apr 23 '11 at 23:22
1

Here is how I see this. In your sample case of N, H, D = 4, 3, 4 :

{2,2,0,0}   {6,  6}
{2,1,1,0}   {12,12}
{1,1,1,1}   {1, 24}

On the left we have restricted partitions, on the right the number of ways to permute and uniquely fill each partition.

Therefore, we have a combined enumeration of:

{2,2,0,0}     36
{2,1,1,0}    144
{1,1,1,1}     24

I count the number of nearly-filled bins as 2 * 36 + 144 = 216, out of 4 * (36 + 144 + 24) = 816, for a probability of 216/816 = 9/34.

In other words, I am getting the same result as Matt.

Mr.Wizard
  • 802
  • This is the same as my calculation. Where you say "restricted partitions", I say "number of possible diagrams, of the type drawn by the original poster", and where you say "the number of ways to permute and uniquely fill each partition", I say "the number of ways to label the balls in the diagram as 'first', 'second', etc.". (Sorry for being jargon-impaired!) – Matt Apr 21 '11 at 23:25
  • @Matt, I am not a mathematician so the chances are that it is my jargon that is impaired. – Mr.Wizard Apr 21 '11 at 23:28
  • So If I were doing this programatically. How might I generate the partitions? and then how would I derive the permute and uniquely fill numbers. I think this could be a solution. I wonder if it will also be tractable. – ohmusama Apr 21 '11 at 23:33
  • @ohmusama it is not too complex, if this actually is correct. I use Mathematica, which does make things simpler, but it should not be hard to implement. I'll update my answer in a little while, if no one has shown this to be invalid, or posted a better answer by then. – Mr.Wizard Apr 21 '11 at 23:41
  • @ohmusama: If you can count the units in your brigades, and the hit points of your units, on your fingers (preferably of one hand), the method we're using here might work for you... but we need to seriously simplify our formulas before they are ready for practical use! Here we are essentially just double-checking the logic of our approaches on your nicely illustrated example, not yet claiming to have solved your original problem in any useful way. (Speaking for myself, at least.) – Matt Apr 21 '11 at 23:42
  • @ohmusama how large are your working N, H, D? – Mr.Wizard Apr 21 '11 at 23:50
  • N could be in the (tens)thousands [although most cases 95% it will be 10-100s), H will likely be 5 or less (10 would be most. A note, that the higher H is, its much less likely N will be big, ie, N is inversely proportional to H), and D would then be at most N * (H - 1) – ohmusama Apr 22 '11 at 00:54
  • @ohmusama that could be a problem. I will need to find optimizations. :-) Would you be able to make changes to your parameters, such as H always less than 5, or anything like that? – Mr.Wizard Apr 22 '11 at 01:20
  • I could restrict my self. But I ultimately only can grantee the values of H. – ohmusama Apr 22 '11 at 01:26
  • @ohmusama I think I can do it, but it will take some time. I'll try to crack it tomorrow. – Mr.Wizard Apr 22 '11 at 01:31
  • @ohmusama, how precise does the answer need to be to be useful? – Mr.Wizard Apr 22 '11 at 02:01
  • I don't think people will be able to tell if its 1-2% off. That is the limit of human observation. It does need to be 100% accurate in cases where if I have D = N * (H - 1) In that 100% of that case will result in a bucket filling. It also needs to be 100% accurate in the cases of D <= H - 2 In that 0% of that case will result in a bucket filling. – ohmusama Apr 22 '11 at 03:10
  • @ohmusama, I was not able to find the optimizations I hoped to. A curve fitting or interpolation approach may actually work given your loose accuracy requirement. I have not given up but I am not confident that I can answer this question satisfactorily. I posted your question to another math forum, that while slow, has some very bright people. – Mr.Wizard Apr 23 '11 at 07:04