Reservoir Sampling vs Round Robin

Question

You are given a List of numbers (length unknown).

Let's say the length is 10.

GetRandom(List) is called once. If implemented correctly, each number has 1/10 probability of being returned.

GetRandom(List) is called 100 times. If implemented correctly, each number will appear 10 times in the result.

Fine?

You now have to do the same for a Stream of numbers.

GetRandom(Stream, 5) is called. This adds 5 to the Stream. Stream is of length N=1, then 5 is returned (probability = 1/N = 1)

GetRandom(Stream, 3) is called. 3 is added to stream. N=2. Either 3 or 5 is returned (prob = 1/2).

How will this be tested for correctness?
If GetRandom(Stream) (without adding any more numbers) is called 10 times when length of list is 2, each number (3 & 5) should be returned ~5 times.

GetRandom(Stream, 7) is called. 7 is added to Stream. N = 3. One of the 3 numbers (5, 3, 7) is returned (probability = 1/3).

But how will this be tested for correctness?
If GetRandom(Stream) is called 10 times when N = 3, each number is returned ~3 times.

So far, so good ?

Alright, here is my algorithm:

N = 0
Pointer = 0

GetRandom(Stream, Number = NULL):
    Pointer += 1

    if Number is NOT NULL:
        N += 1
    else:
        if Pointer == N:
            Pointer = 1    # Reset

    return Stream[Pointer]    # Assume 1-based indexes

This simply cycles through all numbers in order / round-robin fashion.

If GetRandom(Stream) is called on a Stream with 100 numbers, 1000 times, each number will appear exactly 10 times.

If GetRandom(Stream, 77) is called on a Stream with 100 numbers (77 is the 101st number), while Pointer got reset to initial location 1. Then when GetRandom(Stream) is called 101 times, then on the 101st call, 77 will be output, which satisfies the required probability of 1/101. If it's called 202 times, then on the 202th call, 77 will be output, which satisfies 2/202.

Why bother with Reservoir Sampling k/k+1, Why bother with a random number generator?

Average probability of the events is not the only measurement for randomness. Your algorithm yields a sequence with high autocorrelation. — Albjenow, Feb 13 '20 at 13:52
@Albjenow, how do you know that correlation was a result of the algorithm and not because of randomness.. are you saying that 'randomly generated numbers should NOT be sequential' - if you are imposing 'your' rules on 'randomness', that defeats the purpose of random, doesn't it? It's like saying - "if I toss a coin twice, and get HHTT, it's right, but if it get HTHT, it's wrong" — d-_-b, Feb 13 '20 at 20:12
I am surprised there are even CS@SE tags for what I think purely mathematical concepts. — greybeard, Feb 13 '20 at 21:43
https://cs.stackexchange.com/q/67087/755, https://cs.stackexchange.com/q/7729/755 — D.W., Feb 14 '20 at 00:58
You were the first to impose rules on randomness by saying "If GetRandom(Stream) is called on a Stream with 100 numbers, 1000 times, each number will appear exactly 10 times." According to that, a sequence of 1000 42s is not random. The most loose definition of randomness is a sequence without (obvious) patterns. ... — Albjenow, Feb 14 '20 at 07:45
... An attacker on a crypto system that uses your algorithm will spot the sequential pattern easily even though he cannot be sure it is really how your algorithm works without looking at the code. He could (and will) try guessing the next number. And if he breaks the system, no philosophical discussion on the nature of randomness will protect your data or that of your customers. To increase the strength of a random number generator, you should test for more than just one property of the sequence. This is not just my personal opinion but well established in the literature on RNGs. — Albjenow, Feb 14 '20 at 07:50

user3494047 · Answer 1 · 2020-02-18T15:30:06.830

3

it seems that you're asking why bother with reservoir sampling when you're capable of tricking the test that you wrote?

Round robin doesn't return random numbers. It returns numbers deterministically. Well much more deterministic seeming than reservoir/other methods.

Your tests should be better. If you need the result to seem random and not deterministic, make a test which captures that instead of one which is based on empirical probability.

EDIT to add a test example: another test for randomness (one that round robin would fail) is that you run the same process many times and don't get the same result every time. For uniform random sampling of a stream (the output of reservoir sampling) for a set/list/stream of a fixed size the probability to get one specific subset of size k should be 1/(n choose k) . You can run your method once then another 10000 (or any number of times) and see that you get the first result approximately only 1/(n choose k) times.

edited Feb 18 '20 at 15:30

answered Feb 13 '20 at 13:13

user3494047

281
2
8

how do you determine 'randomness' other than by equality of distribution ? https://i.pinimg.com/originals/2c/61/6a/2c616a9a9f542dffd56b10da4adb4bae.gif – d-_-b Feb 13 '20 at 20:10
how do you know that correlation was a result of the algorithm and not because of randomness.. are you saying that 'randomly generated numbers should NOT be sequential' - if you are imposing 'your' rules on 'randomness', that defeats the purpose of random, doesn't it? It's like saying - "if I toss a coin twice, and get HHTT, it's right, but if it get HTHT, it's wrong" – d-_-b Feb 13 '20 at 20:13
One test for randomness is that you run the same process many times and don't get the same result every time. For a set/list/stream of a fixed size the probability to get one specific subset of size k should 1/(n choose k) . You can run your method once then another 1000 (or any number of times) and see that you get the first result approximately only 1(n choose k) times. – user3494047 Feb 15 '20 at 11:58
in your coin toss example example actually what you're saying if I always (100% of the time) get HHTT and the probability of getting (HTHT) is 0% then it's still like tossing a coin. – user3494047 Feb 15 '20 at 12:03

Reservoir Sampling vs Round Robin

1 Answers1