Correct way to truncate data to a range

Question

2 hours ago I thought I had this figured out, but now I am doubting myself and want someone to validate my algorithm.

I want to take a stream of k trusted random bits and convert it to groups of 5 digits in a uniformly distributed range from 0-99999. To do this I take k and truncate to a multiple of 32, let's say the result now has n bits where n = k - k % (4*8). To create one group of 5 digits we load 32 bits into a variable we'll call var_i. The result will be stored in var_e. To get 5 digits from 0-99999 takes 17 bits, so we perform the loop:

    for (int i = 0; i < 17; i++)
      var_e += ((var_i >> i) % 2) << i

this way we preserve only the first 17 bits. Then we perform var_e = var_e % 100000 to truncate var_e to 5 digits. var_e is now one complete block of 5 digits.

An alternative algorithm could be to just take a 32 bit number var_e and perform var_i = var_e % 100000. This would take just the last 5 digits of the 32 bit number, which fall into the range 0-99999.

These algorithms both waste nearly 50% of the input data, but it would be fairly easy to change the first one to accept 17 bits of input.

@fgrieu That answer does look helpful, and I will read it but I would like to try and learn some more from your comments to this question. I just realized I left off what may have been a crucial piece of information, the input stream is trusted random. 1) Yes it should be uniformly distributed, why do the two algorithms fail this? 2). One output for every 32 bits. Each 32 bit block is not related to its neighbor in any way. 3) Assume now that neither computational or data efficiency are important. — chew socks, Apr 15 '13 at 05:56
Why would var_i = var_e % 100000give an unbiased output? For example, 0 is reached when var_e is 0 or 100000, but 99999 is reached only when var_e is 99999. — fgrieu, Apr 15 '13 at 05:59
@fgrieu I see what you are saying about the bias from mod now, I hadn't considered that case. Is there a way to eliminate the most significant digit without creating a bias? — chew socks, Apr 15 '13 at 06:18
@fgrieu Could you please explain what the problem is, with the exception of the mod operator, with the first method? — chew socks, Apr 15 '13 at 06:19
@chewsocks The mod operator is the problem. You are reducing 2^17 different values into disjoint equivalence classes of 100000 elements (namely 0 to 99999). Since 100000 does not divide 2^17, you have one equivalence class which is smaller than 100000, and therein lies the bias (in fact, one equivalence class will have only 31072 elements instead of 10000, those elements are twice as likely to occur as the others - hardly a uniform distribution). For 2^32 the bias is smaller, but still present. a mod m is unbiased iff m divides the range of a. — Thomas, Apr 15 '13 at 06:56
@chewsocks To verify it yourself, what I suggest is take a reduced version of your algorithm (say, 2^7 and 100 instead of 2^17 and 100000), then iterate over all possible inputs to the algorithm and plot a frequency histogram of the 100 possible outputs. A uniform distribution will be flat - but you won't get that. — Thomas, Apr 15 '13 at 06:58
How do you distribute fairly $a$ apples to $b$ children if $a$ is not a multiple of $b$? (In your case $a = 2^k$ and $b = 100000$) — j.p., Apr 15 '13 at 13:13
@Thomas Your use of "equivalence class" is a bit confusing – the equivalence classes produced by the $\bmod$ operator are actually orthogonal to the ones you are describing. — Paŭlo Ebermann, Apr 15 '13 at 15:42
Related to this newer question, with a simple and practical answer; and this one. — fgrieu, Apr 15 '13 at 20:43
@Thomas I will try that, I think it will help me understand better. — chew socks, Apr 18 '13 at 16:36
@fgrieu Thank you for linking to that question. Since my range is so much smaller than 32 bits, what I gather from that question is that it would be best to truncate to 17 bits, and then reject all numbers out of my ranger? — chew socks, Apr 18 '13 at 16:43
@chew socks: Yes while ((output=rnd32()&0x1FFFF)>99999); is a correct way to get perfectly random input over 0-99999, assuming rnd32() return a 32-bit random. It is wasteful of input bits, but often that's a non-issue. — fgrieu, Apr 18 '13 at 16:58
@fgrieu I graphed the distribution of the output and the result was a very "spiky" graph centered around some value. Is this what the distribution of random, uniformly distributed data should look like? — chew socks, Apr 19 '13 at 01:27
@chew socks: What you observe is indicative that your initial source of random bits is not of cryptographic quality. — fgrieu, Apr 19 '13 at 05:01
@fgrieu I am using /dev/random on Ubuntu linux. Graphing the input with no algorithm applied (finding the distribution from 0-255) yielded the same result. — chew socks, Apr 19 '13 at 19:44
@chew socks: Now what you observe is indicative of bad reading of /dev/random, or graphing something with little visual significance to the naked eye like the values obtained vs iteration number. You want to count the number of times a value is reached over a number of experiments like 10 times the number of possible values or more, then graph that as a function of the number. It is a bit difficult making that for the range $[0\dots2^{32}-1]$, but it is easy with $[0\dots99999]$ and will reveal the problem with one of the alternatives considered, and perhaps the other with many experiments. — fgrieu, Apr 20 '13 at 16:19
@fgrieu What would be the correct way to read from /dev/random? I was just using dd and a pipe to send the data through stdin. The graph I'm am generating is the value of a number vs. the number of times it occurs. — chew socks, Apr 21 '13 at 23:09
I skipped the middle man and opened /dev/urandom right in my program, and got the same result. — chew socks, Apr 21 '13 at 23:10
@fgrieu Do you mind looking at my distribution graphs? I ran the program twice to show that they are different each time. https://www.dropbox.com/s/iww8i5v0bucp7d4/rand_dist.pdf https://www.dropbox.com/s/d9g3e5s2v98q8nf/rand_dist2.pdf — chew socks, Apr 21 '13 at 23:21
@chew socks: These two graphs look about right for the number of times a particular byte is reached (not 32-bit values or range [0..99999]). The spiky aspect comes from [a] joining points using lines for consecutive byte values (rather than just a cloud of points); [b] zooming-in on the band where most points are. If you draw that kind of graph, preferably without lines, for output=(rnd32()&0x1FFFF)%100000, computed for enough values, you will see a defect clearly. — fgrieu, Apr 22 '13 at 07:26
@fgrieu Thank you. I was worried that I may have been over scrutinizing the graphs. I created the graph you suggested and I see the bias where some values are twice as likely to occur than others. — chew socks, Apr 23 '13 at 03:35

Correct way to truncate data to a range

0 Answers0

Linked