pick K random integers without repetition

Question

Suppose we've to pick $K (\le N)$ random integers in the range $[0, N - 1]$ for very large $N$ such that there is no repetition, while also deterministically minimizing the number of calls made to the $rand()$ function. How would we do it?

Assume $rand(X)$ produces a random integer in the range $[0,X - 1]$; calling it with any argument counts as one call.

Picking at random till we get a unique value may not be efficient for smaller N, and $O(N)$ algorithms, i.e. ones which require looping through the entire range of integers is also not feasible, as $N$ could be up to $10^{18}$.

Is there any way to accomplish this?

This is the reservoir problem – Ṃųỻịgǻňạcểơửṩ Jan 23 '24 at 05:48 — Ṃųỻịgǻňạcểơửṩ, Jan 23 '24 at 05:48

score 16 · Answer 1 · answered Jan 23 '24 at 18:30

Ṃųỻịgǻňạcểơửṩ mentioned that this is the reservoir problem. The reservoir sampling problem, though, is explicitly a very strict requirement for a streaming algorithm, i.e. it works in just one pass over a stream of values whose eventual size we don't know.

In this case, we want to do it in one pass, but we do know both N and K, so the problem is a little different and it permits a simpler solution.

Floyd's algorithm is designed for this task. It has some resemblance to both the Fisher-Yates shuffle and to reservoir sampling.

A couple of minor differences from the framing in your question:

K < N
we use randint(i, j) which produces a random integer between i and j inclusive
the samples will be generated in the range 1 to N inclusive (but of course it is easy to adjust this at the end)

Here's the Python code:

import random
n = 200
k = 5
s = set()
for j in range(n-k+1, n+1):
    t = random.randint(1, j)
    if t not in s:
        s.add(t)
    else:
        s.add(j)
print(s)

A good source for the origin of and mathematical justification for Floyd's algorithm is this article by Jon Bentley from 1987. It's worth noting that the algorithm only calls randint() K times.

A very brief description of how it works is that, on iteration t, it draws a number randomly from [1..n-k+t]. If the number has not been drawn before, it is added to the sample. Otherwise, it adds the number n-k+t to the sample. This works to give the proper probabilities of inclusion in the final set.

In Bentley's article there is a description of the recursive formulation of the algorithm:

We can appreciate the correctness of Floyd’s algorithm anecdotally. When M = 5 and N = 10, the algorithm first recursively computes in S a 4-element random sample from 1..9. Next it assigns to T a random integer in the range 1..10. Of the 10 values that T can assume, exactly 5 result in inserting 10 into S: the four values already in S, and the value 10 itself. Thus element 10 is inserted into the set with the correct probability of 5/10.

I can't do better than that explanation, right now!

The algorithm has been mentioned on StackOverflow and proofs are here on Math SE.

It should be noted that although this algorithm produces uniform probability for each in-range number to be included in the sample, it does not yield a uniform distribution of permutations of the selected elements. In many cases, that won't matter, but in some cases it might. — John Bollinger, Jan 23 '24 at 22:16

Peter Cordes · Answer 2 · 2024-01-23T23:30:43.173

What quality of randomness do you need? You can construct a Linear Congruential Generator as a PRNG with period of N (or the next prime above N) and discard out-of-range samples.

Use rand() only to seed your LCG PRNG.

Of course you get the same sequence every time, just from different start points, unless you also try to randomize your LCG's multiplier and adder parameters. Or as John Bollinger puts it, for any given K it can produce only about N distinct K-element samples.

Re-randomizing the multiplier (a) and adder (c) constants while still satisfying the conditions for period m (the modulus) can make multiples sets of sequences possible. And/or raise m so there are more numbers between N and m you'd have to discard if generated. But there's limited freedom so this only goes so far. (And depending on LCG parameters there can be significant correlation between numbers.)

This requires O(1) storage and negligible time per result, just a 64-bit multiply, add, and modulo.
If you need bigints for this, I guess O(log2(N)) storage and time per query with a small constant factor. But 10^18 < 2^64. Initial setup time requires finding a prime above N.

Quality of randomness is low: even the best LCGs are not great pseudo-random generators by modern standards, although some of their downsides come from choices like a power-of-2 modulus which makes the low-order bits highly correlated. (Other downsides include a short period, which we're taking advantage of here.) But automated selection of LCG parameters to fit a given period can result in generators much worse than a "good" LCG. For some small N, pairs of consecutive numbers are common in the output sequence. Maybe different LCG parameters with the same or similar period could avoid that.

I used this in practice for a partly-finished subtree-pruning-regrafting library that I never did much with, GPLed code on Github with comments explaining the algorithm, citing Knuth TAOCP for the conditions that produce period m, and Numerical Recipes for some PRNG quality folklore and best practices. The LCG-selection part does work correctly, but basically gives up and makes the multiplier 1 and adder 0 in some corner cases. Also for tiny maxval = N, 6 or less, since for this library's purposes, 6 was few enough to just brute-force try all possibilities and see which tree gave the maximum likelihood.

struct lcg {
    unsigned int state;
    unsigned int a, c, m;
    unsigned int startstate;
};
/******************** Linear Congruential Generator setup ***********/
/ to generate all possible SPRs in a pseudo-random order, we generate

all the numbers between 0 and the number of possible SPRs once each
without repetition using an LCG of the form: x_n+1 = x_n*a + c mod m.

/
/

Knuth: TAOCP 3.2.1: ex 2: if a and m are relatively prime,
the number X_0 will always appear in the period.  (will return to start?)

3.2.1.2: Theorem A: An LCG will have period m iff:

c is relatively prime to m



b = a-1 is a multiple of p, for every prime p dividing m



b is a multiple of 4, if m is a multiple of 4



3.2.1.3: ex 4:  m = 2**e >= 8  ->  maximum potency when a mod 8 = 5.
small multipliers are to be avoided.

Numerical recipies suggests c = a prime close to (1/2 - sqrt(3)/6)*m

*/
/* make up some parameters for an LCG that will have the maximum period

equal to the range, so every value is generated once.
When maxval doesn't have any repeated prime factors, a = m+1,
which is the same as a=1.  It's not exactly random, but it does still
mix up which SPRs are done.

TODO: take advantage of the fact that maxval = floor(sqrt(maxval))*ceil(sqrt(maxval))
could do that, but then the code would be less general-purpose
successfully brute-force tested for maxval=1..1000.

/
void findlcg(struct lcg lcg_params, int maxval)
{
    unsigned int a, b, c, m = maxval;
    int i;
primesetup (maxval+maxval/2);

if (m&lt;=6){ // will be either 6 or 2.  Just loop in order
    b=0;
    c=1;
}else{
    int divlimit = m;
    b=1; // b must be a multiple of all of m's prime factors
    if (!(m%2)){
        b=2;
        while (divlimit%2 == 0) divlimit /= 2;
    }
    for (i=3 ; i &lt;= divlimit ; i+=2){
        if (is_prime(i) &amp;&amp; m%i == 0){
            b *= i;
            while (divlimit%i == 0) divlimit /= i;
        }
    }

    if (!(m%4)){    // if m is a mult of 4, b must be.
        while (b%4) b *= 2;
    }

    /* make sure a isn't too small */
    while (b&lt;sqrtf(m)) b*=7;

    if (b == m) b=0;  // just give up and avoid overflow


// Numerical Recipies says there is "lore" behind this... :)
// TAOCP says it's useless unless the multiplier sucks (section 3.3.3, eq. 40)
// That would be us.
        c = next_prime(max(5, (0.5 - sqrtf(3)/6.0)m - 2));
        while (m%c == 0) c = next_prime(c+1);
        // Luckily we don't have to test for c>m, because it doesn't
        // happen with any m<100, and there are enough primes later...
/ I've observed that when a == m, (e.g. a=13, c=13, m=72) you often get

two consecutive numbers...  Do something to avoid that if it's a problem */
 }
a = b+1;
unsigned long long l = (unsigned long long)a * m;
 if (l > ULONG_MAX){
     fprintf(stderr, "spr: chosen Linear Congruential Generator is bogus\n"
     "   x_n+1 = x_n%u+%u mod %u\n"
     "   am > ULONG_MAX, so it would overflow :(\n", a, c, m);
 }
lcg_params->a = a;
 lcg_params->c = c;
 lcg_params->m = m;
 lcg_params->startstate = UINT_MAX;
 lcg_params->state = rand() % m;


}

My use-case was smallish trees so for prime finding I just used a straightforward Sieve to find true primes, not just relatively-prime which would be sufficient. Quality of pseudo-randomness was not a priority at the time, before moving on to other work. There might be room to spend more time choosing LCG parameters better than the algorithm shown here.

Nice option! This seems like a pragmatic approach. I want to document the shortcoming that its output is not uniformly random; it is pseudorandom. So if high-quality randomness is needed, the other answers might be best, but this might be sufficient if there aren't strict requirements on the distribution of the algorithm's output. — D.W., Jan 23 '24 at 20:34
@D.W.: In the worst case it's barely pseudorandom, significant correlation can happen depending on how bad the LCG parameters are. But yes, this was a pragmatic approach I used for a project once, which I thought was pretty clever. The upside in speed and lack of storage is significant. I added a link to the code on github in case anyone actually wants to use it, and quoted the interesting comments which describe the necessary details to produce such an LCG. — Peter Cordes, Jan 23 '24 at 20:54
I accept that this yields approximately uniform probability of each element of the domain being included in the sample, but for any given $K$ it can produce only about $N$ distinct $K$-element samples, which is far fewer than the total number of possible distinct samples for $N$ at the large end of the OP's range and even pretty small $K$. — John Bollinger, Jan 23 '24 at 23:08
@JohnBollinger: Indeed, that's a good way of putting it. There are multiple values of multiplier and adder that can make an LCG with period m, so re-randomizing those could help. — Peter Cordes, Jan 23 '24 at 23:32

score 5 · Answer 3 · edited Jan 24 '24 at 21:09

If you care only about having uniform probability for each of the $N$ elements to be included in the $K$-element sample, then @PeterCordes's suggestion to use just one call to $rand()$ to seed a suitably tuned LCG is very attractive.

If it is additionally important to have equal probability of each of the $\tbinom{N}{K}$ distinct possible $K$-element samples, without regard to order, then Floyd's algorithm for sampling without replacement, as suggested by @TheoH, is hard to beat.

But if the order of the selections is important, such that for every draw, each as-yet unselected element should have an equal probability of being selected, then you need something more. A partial Fischer-Yates shuffle could do this job in $K$ calls to $rand()$, but that requires an impractical amount of storage for large $N$. Or does it?

You can perform a Fischer-Yates-like selection of $K$ elements with $K$ calls to $rand()$ and with only $O(K)$ overhead by maintaining a map from previously selected values to replacement values. Such a map can be constructed with $O(K)$ memory for which storage and retrieval operations each have $O(1)$ amortized asymptotic complexity, so this does not asymptotically increase either computational complexity or the storage required ($O(K)$ storage already being required for the result). Having first initialized an empty map, $M$, one proceeds as follows on the $i$^th draw ($i$ starting from 0):

compute $x = \mathrm{rand}(N - i)$
compute the selected element $y$ as $y = \begin{cases} M[x] & \text{if it exists} \\ x & \text{otherwise} \end{cases}$
set $M[x] = \begin{cases} M[N - i - 1] & \text{if it exists} \\ N - i - 1 & \text{otherwise} \end{cases}$

The simplistic alternative is just to run Floyd's algorithm and randomly shuffle/permute the resulting array, since we're doing it offline anyway. I presume this algorithm is cheaper than doing that. — Peter Cordes, Jan 24 '24 at 03:45
Thank you @IlmariKaronen. Yes, that was the intent, but apparently I was a bit distracted when I first wrote this up. Fixed now. — John Bollinger, Jan 24 '24 at 21:04

Andy · Answer 4 · 2024-01-24T02:28:22.643

4

You can do it in one call to the rand() function, with a perfect uniform distribution, if you're willing to use large integers.

Here's the basic idea:

List out all of the possible ways to choose K distinct integers from the range [0, N-1].
Throw a dart at the list, choosing one of those ways

Once you've done this, you'll realize that you don't actually need that list of possibilities: You can use a mathematical formula to compute the selection choice directly from the id, in O(1) time, without the list.

For example, if you're trying to select k = 2 distinct items from n = 3 options, then you could use the list:

L(3, 2) = ["011", "101", "110"]

...where "110" means that you would select the first two items, and ignore the third. So, if rand() generates "id = 2", then you'd look up the binary sequence at position 2 ("110"), and select those first two items.

In general, there are M = (N K) = N!/(K!(N-K)!) possible ways to select K distinct integers from the set [0, N-1]. So a single call to the random number generator, rand(M), will randomly choose one of those possibilities.

All you need to get started is a canonical list of the possibilities--that is, a canonical ordering of the list.

Really, any ordering will work. But here's one that has a nice recursive formula:

L(n, 0) = ["0...0"] (where "0...0" has length n), and
L(n, n) = ["1...1"] (where "1...1" has length n), and
L(n, k) = [], for n < k, and
L(n, k) = ["0" + L(n-1, k), "1" + L(n-1, k-1)]
...where "1" + "001" = "1001" is string concatenation.

This lets you build up a canonical list of any size.

For example, if you'd like to generate the list from earlier, L(n, k) where n = 3 and k = 2, then you could do it like this:

L(1, 1) = ["1"]
L(2, 2) = ["11"]
L(1, 0) = ["0"]
L(2, 0) = ["00"]
L(3, 0) = ["000"]
L(2, 1)
 = ["0" + L(1, 1), "1" + L(1, 0)]
 = ["0" + ["1"], "1" + ["0"]]
 = ["01", "10"]
L(3, 1)
= ["0" + L(2, 1), "1" + L(2, 0)]
= ["0" + ["01", "10"], "1" + ["00"]]
= ["001", "010", "100"]
L(3, 2)
= ["0" + L(2,2), "1" + L(2,1)]
= ["0" + ["11"], "1" + ["01", "10"]]
= ["011", "101", "110"]

This allows you to compute the canonical list for any problem size. This already enough to get a solution using dynamic programming.

But you can do better. The recursive formula also yields a closed-form solution, and its inverse, g(id), directly computes the binary sequence.

This is a high-effort approach, but it does yield a fantastic O(1) solution that involves only one call to the random number generator.

edited Jan 24 '24 at 02:28

answered Jan 23 '24 at 21:07

Andy

49
3

2

You just need to choose an enumeration - Mapping log2(M) bits of entropy to an enumeration is a critical part of doing this efficiently. Either one query at a time, or constructing an array or set data structure of results. If there are known algorithms for this, your answer would be much better if you name and/or link one. Without that, interesting mathematical insight... – Peter Cordes Jan 23 '24 at 21:14
All right, I'll add an example. – Andy Jan 23 '24 at 21:38
Your answer needs to map bit-sequences back to sets, not the reverse. Is that also efficiently possible? – Peter Cordes Jan 24 '24 at 00:06
Yes. It's an O(1) lookup if you've previously computed the list of sequences. If you didn't build the list of sequences, but went straight to the formula, then it's g(id) = binary_sequence, involving factorials. As you mentioned, it's the inverse of the enumeration function. And it's a polynomial, so it is computationally efficient--O(1) again--but it does require big ints. – Andy Jan 24 '24 at 00:20
1

What size of BigInt? Assuming they scale with N or M, that can't be O(1). The problem is to generate a list of K elements, and saying you've solved it in O(1) by generating one huge random integer instead of multiple is not realistic for any sane cost-model. Real-world entropy sources give you some fixed width per sample, like 64 bits at a time. (Yes you could build super wide hardware, but then we'd need to talk about the cost of that and how wide it needed to be, in terms of M.) – Peter Cordes Jan 24 '24 at 03:54
Anyway, BigInt addition is linear in number of bits, so log2 of values. BigInt multiplication with Toom-Cook is $O(n^{1.465})$ in the number of bits (practical BigInt implementations do use Toom-Cook for big problems, it is practically useful, unlike some asymptotically-better algorithms). Division is at least that, with higher constant factors https://en.wikipedia.org/wiki/Computational_complexity_of_mathematical_operations / Did the 2019 discovery of O(N log(N)) multiplication have a practical outcome? – Peter Cordes Jan 24 '24 at 04:04
3

This reminds me of the warning in the documentation of python's random.shuffle: Note that even for small length(x), the total number of permutations of x can quickly grow larger than the period of most random number generators. This implies that most permutations of a long sequence can never be generated. For example, a sequence of length 2080 is the largest that can fit within the period of the Mersenne Twister random number generator. random.shuffle – Stef Jan 24 '24 at 17:55
1

OP says "$N$ could be upto $10^18$", and the number of combinations grows exponentially in $K$ for $K << N$, so computing the list explicitly is infeasible. – Pablo H Jan 25 '24 at 14:36
1

The critical value is M = (N K), which is small when K is small. So this could work, even for a very large value of N. If K also rises, though, then this approach is dead in the water, unless you can switch over to the g(id) function approach. I think we need more explanation around g(id). The version I was imagining involved a dynamic-programming approach that partitioned the id range, essentially seeking through a tree to compute the binary sequence. – Andy Jan 25 '24 at 19:15

Ilmari Karonen · Answer 5 · 2024-01-24T21:19:40.050

Theo H's suggestion of using Floyd's algorithm is a good one, but requires knowing the number of elements we wish to sample in advance. However, there's a fairly similar algorithm that allows the sampled elements to be generated one at a time "on demand".

The basic idea is to run a partial Fisher–Yates-Durstenfeld–Knuth shuffle on a sparse array, storing only the elements that have changed from their initial values.

To understand this algorithm, it may be useful to start with a quick recap of the basic FYDK shuffle (zero-based ascending-index version with initially sorted array):

Let a be an n element array initialized with a[k] := k for all 0 ≤ k < n.

For each i from 0 to n-1:

Let j be a uniformly chosen random integer with i ≤ j < n.

If i ≠ j, swap a[i] and a[j].

Now a is a random permutation of the integers from 0 to n-1.

Looking at the FYDK shuffle algorithm above, we can see a few notable features and invariants:

In each loop iteration, the new value of a[i] is chosen uniformly at random from among the values not yet assigned to any a[k] for some k < i.
The value assigned to a[i] in each loop iteration is never overwritten by later iterations.
By induction on the above, at the end of each loop iteration, a[0] … a[i] is a uniformly chosen (and uniformly shuffled) random sample of i+1 distinct integers from the range [0, n).

In particular, this leads to the following observations:

If we stop the FYDK shuffle (as given above) after k ≤ n iterations, the first k elements of the array a form a uniformly chosen (and uniformly shuffled) random sample of k distinct integers from the range [0, n).
During the first k iterations, only at most 2k elements of a are changed from their initial values. Thus, if k is much smaller than n, we can save time and storage space by using a sparse map to only store the changed elements of the array a.

Also, since the first k-th elements of a are never changed after the first k iterations of the FYDK shuffle loop, we can easily transform the algorithm into a coroutine that yields the k-th element to the caller immediately after the k-th iteration:

Let a be a sparse map with default values a[k] = k for all 0 ≤ k < n.

For each i from 0 to n-1 (or until caller stops iteration):

Let j be a uniformly chosen random integer with i ≤ j < n.

Yield a[j].

Let a[j] := a[i].

Unset a[i].

…or, equivalently (using a map without implicit default values):

Let a be a sparse map.

For each i from 0 to n-1 (or until caller stops iteration):

Let j be a uniformly chosen random integer with i ≤ j < n.

If j is a key in a, yield a[j], else yield j.

If i is a key in a, let a[j] := a[i], else let a[j] := i.

Unset a[i].

(The "unset a[i]" step at the end of the loop is not strictly necessary for correctness, but since no later iteration will access a[i], there is no point in storing it any longer.)

Anyway, here's an example implementation in Python:

import random
def sample_range(start, stop=None, step=1):
    """
    Yields a stream of random unique integers from range(start, stop[, step]) using only
    O(k) storage for k outputs. This is done using the Fisher-Yates-Durstenfeld-Knuth
    shuffle on a sparse array (implemented using a dict) that only stores elements that
    have been changed and not yet yielded to the caller. The first k outputs form a
    uniformly chosen (and shuffled) random sample of k distinct elements from the range.
    """
    if stop is None:
        # Python convention: range(n) == range(0, n)
        start, stop = 0, start
# Conceptually this dictionary is initialized so that a[k] = k for all k in the range,
# but we only explicitly store values for entries that are changed from this default.
a = dict()

for i in range(start, stop, step):
    j = random.randrange(i, stop, step)
    # Conceptually we swap a[i] and a[j] and then yield a[i] to the caller. However,
    # as a[i] will never again be accessed after this, we can save some time and
    # storage space by leaving it unset.
    yield a.get(j, j)   # a.get(key, default) returns a[key] if set, else default
    a[j] = a.pop(i, i)  # a.pop(key, default) does the same, but also unsets a[key]

(And here's an online demo with a basic test of correctness and uniformity.)

FWIW, I do not know if this particular algorithm / coroutine has a name. I've used and written about it before, but for all I know I could be the first person to invent it. Given its simplicity, however, I very much doubt that.

In particular, I rather suspect that, had someone described this algorithm to Knuth 50 years ago, he would've considered it a natural and obvious extension of the FYDK shuffle. Which of course doesn't mean that it's necessarily quite as obvious to us lesser mortals.

(Edit: In fact, except for trivial indexing differences, I believe this is essentially the same algorithm as the one sketched out in John Bollinger's answer.)

pick K random integers without repetition

5 Answers5