select a randomly chosen subset of integer range

Question

I tried to ask this question on stack overflow, but it was not well received.

I want to generate an ordered sequence of finitely many random integers in a range such that any N-element subset (having no duplicates) of [b,e] is equally likely. To do this I can of course just generate the N integers, then sort the them into increasing order. But is there a way to do this without having to sort? I.e., can it be done by selecting the smallest one first, and continuing in order.

First select r(0) in the range [b,e-N], then select r(1) in the range [r(0)+1,e-N+1], ... selecting r(k) in the range [r(k-1)+1,e-N+k].

There an obvious problems with this approach. If r(k) is blindly chosen in [r(k-1)+1,e] the resulting distribution will be heavily weighted toward the right. E.g., a sequence containing b and b+1 would be very unlikely.

Ideally a solution would be a "formula" for r(k) in terms of N and r(k-1) and an equal-distribution random() function-oid.

Any suggestions?

Essentially this has been asked and answered on SO. – r.e.s. Nov 15 '17 at 21:49 — r.e.s., Nov 15 '17 at 21:49

r.e.s. · Answer 1 · 2017-11-17T05:45:59.340

It is well known$(^*)$ that if $U_1,...,U_n$ are iid Uniform on the interval $(0,1)$, then the joint distribution of the order statistics $U_{(1)}\le U_{(2)}\le ...\le U_{(n)}$ is the same as that of the quantities
$$R_j=\frac{\sum_{i=1}^j X_i}{\sum_{i=1}^{n+1}X_i} \quad (j=1..n)$$ where the $X_i$ are iid Exponential (with any fixed mean).

Therefore, the quantities $a + (b-a)R_j$ will be distributed like the order statistics of a sample that's iid Uniform on the interval $(a,b)$; consequently, the quantities $$a + \lfloor(b-a) R_j\rfloor$$ will be distributed like the order statistics of a sample that's iid Uniform on the set of integers $\{\text{low},...,\text{high}\}$ if we take $a = \text{low}$ and $b=\text{high}+1$.

An example in SageMath:

def sorted_uniform(n, low, high):
    width = (high + 1) - low  
    cum_sum = 0
    numerators = []
    for _ in xrange(n):
        cum_sum += expovariate(1.0)
        numerators += [cum_sum]
    cum_sum += expovariate(1.0)
    return [low + floor(width*numerator/cum_sum) for numerator in numerators]

Unfortunately, if it is required to produce a sequence with no ties, it may be necessary to repeat the whole procedure until it does so. That may or may not be feasible, depending on the parameters involved.

NB: Generating floats in the interval $(\text{low},\text{high}+1)$, then rounding down to an integer using floor(), avoids a mistake in the Q & A referenced in my comment above. (The endpoints of the interval $(\text{low},\text{high})$ would be under-represented if floating point numbers were generated in the interval $(\text{low},\text{high})$ and then simply rounded to the nearest integer.)

$(^*)$ This is derived in Johnson, N. L. and S. Kotz [1970]. Continuous Univariate Dlstrlbutlons-2. Proofs are also posted on MSE here and here.

score 0 · Answer 2 · answered Nov 17 '17 at 06:38

There are $e-b+1 \choose N$ subsets as you describe. $e-b \choose N-1$ of them include $b$ and the other $e-b \choose N$ do not. Choose a random number $p$ in $[0,{e-b+1 \choose N}-1]$. If $p$ is less than $e-b \choose N-1$, include $b$. If you include $b$ append the $p$th list of $N-1$ elements from $[b+1,e]$. If you do not include $b$ append the $(p-{e-b \choose N-1})$ list of $N$ from $[b+1,e]$. This describes a recursive algorithm that produces the elements in order, so no sorting is required.

select a randomly chosen subset of integer range

2 Answers2