10

I know there are different statistical tests out there (NIST, Dieharder, etc), which all do different ways of analyzing entropy.

What I'm having a hard time finding is any particular literature which describes how to go from those tests to actual bits of entropy in the byte stream.

How do you go from p value of a byte stream (say a 100 MB long) to bits of entropy?

Updated:

As mentioned below, you can't estimate entropy of the output (not possible), but only by understanding the physics of the underlying process that generates the entropy.

e-sushi
  • 17,891
  • 12
  • 83
  • 229
Blaze
  • 551
  • 4
  • 13
  • The common notion of entropy is the notion of Shannon entropy. The Shannon entropy H(x) of a value x that occurs with probability Pr[x] is H(x) = -log_2(Pr[x]). Related questions: http://crypto.stackexchange.com/q/378/6961 and http://crypto.stackexchange.com/q/700/6961 – e-sushi Sep 18 '13 at 03:32
  • 2
    There are two issues with Shannon entropy: 1) It's only defined for a probability distribution, not for an individual string 2) Shannon entropy and average key-strength aren't exactly the same thing if the probability distribution isn't uniform. – CodesInChaos Sep 18 '13 at 11:32
  • I've made some progress here. When I have a full answer, I will post it (especially given all the great help I've got it). I'm currently investigating Ping Li's work here: http://www.stanford.edu/group/mmds/slides2010/Li.pdf Ideally though I'd do something based off of NIST's work here: http://csrc.nist.gov/groups/ST/toolkit/rng/documentation_software.html – Blaze Sep 23 '13 at 22:16
  • Can you tell us more? Why do you want to measure the amount of entropy? What do you know about the source of the byte stream? The answer is going to depend heavily on the answers to these questions and on other details, so if you can give us more details, we might be more likely to be able to give you a good answer. This is not a simple subject with a simple one-line answer... – D.W. Oct 01 '13 at 22:49
  • No, I can't. I really want to simply estimate entropy. – Blaze Oct 03 '13 at 17:25
  • Well actually, I want to get very rich. That is really my XY problem. But for now, I guess this is an interesting problem. Even if it doesn't end up solving the other one. – Blaze Oct 03 '13 at 17:34
  • To properly estimate entropy you need a reasonable model of the underlying physics. And it's rather hard to get right, it's one of the reasons why supposedly secure PRNGs fail. – CodesInChaos Oct 03 '13 at 19:49
  • Have you come across the "entropy and prefixes" by P. C. Shields? It might be useful. This approach looks for patterns in the string. I came across it today and I'm trying to figure out if it is any practical for a fixed and relatively small n: the length of the string – Chrysanthi Pas Jan 04 '19 at 14:53

2 Answers2

19

Entropy is a function of the distribution. That is, the process used to generate a byte stream is what has entropy, not the byte stream itself. If I give you the bits 1011, that could have anywhere from 0 to 4 bits of entropy; you have no way of knowing that value.

Here is the definition of Shannon entropy. Let $X$ be a random variable that takes on the values $x_1,x_2,x_3,\dots,x_n$. Then the Shannon entropy is defined as

$$H(X) = -\sum_{i=1}^{n} \operatorname{Pr}[x_i] \cdot \log_2\left(\operatorname{Pr}[x_i]\right)$$

where $\operatorname{Pr}[\cdot]$ represents probability. Note that the definition is a function of a random variable (i.e., a distribution), not a particular value!

So what is the entropy in a single flip of a coin? Let $F$ be a random variable representing such. There are two events, heads and tails, each with probability $0.5$. So, the Shannon entropy of $F$ is:

$$H(F) = -(0.5\cdot\log_2 0.5 + 0.5\cdot\log_2 0.5) = -(-0.5 + -0.5) = 1.$$

Thus, $F$ has exactly one bit of entropy, what we expected.

So, to find how much entropy is present in a byte stream, you need to know how the byte stream is generated and the entropy of any inputs (in the case of PRNGs). Recall that a deterministic algorithm cannot add entropy to an input, only take it away, so the entropy of all inputs to a deterministic algorithm is the maximum entropy possible in the output.

If you're using a hardware RNG, then you need to know the probabilities associated with the data it gives you, else you cannot formally find the Shannon entropy (though you could give it a lower bound if you know the probabilities of some, but not all, events).

But note that in any case, you are dependent on the knowledge of the distribution associated with the byte stream. You can do statistical tests, like you mention, to verify that the output "looks random" (from a certain perspective). But you'll never be able to say any more than "it looks pretty uniformly distributed to me!". You'll never be able to look at a bitstream without knowing the distribution and say "there are X bits of entropy here."

Reid
  • 6,829
  • 1
  • 39
  • 57
  • 2
    Hmm, that doesn't seem to jive with what I've been asked unfortunately. I think if you have enough data and enough tests, you should be able to come up with a reasonable estimate of entropy. – Blaze Sep 18 '13 at 00:33
  • @Blaze: Here's an example: the stream cipher AES-CTR is conjectured to be indistinguishable from random with an unknown key (in PPT). Despite that the only inputs with entropy to AES-CTR are the key and the nonce, I could generate terabytes of random data, and it's conjectured that you could not distinguish it from truly random in probabilistic polynomial time. Knowing that, it should be fairly apparent that what you seek is not possible... at least, not without knowing the process by which the random numbers were generated. – Reid Sep 18 '13 at 00:47
  • 1
    Yes, that's absolutely true (and a great point), however the sources under discussion have not been mathematically obfuscated and the assumption is that they are bare before analysis. One example of entropy measurement of course is compression. If we were to able to compress something 10 times more than something else, I think we can make some reasonable observations about how much more entropy that has. We can do heuristic analysis to come up an estimate of the entropy bits, but I think its nice (and generally expected) to have the results of experimental data to back that up with. – Blaze Sep 18 '13 at 01:42
  • @Blaze: Well, like I said, statistical tests (and things like the compression test you've mentioned) focus solely on how "random" the data looks in the light of whatever they test. But entropy has a precise, information-theoretic definition, and can't be accurately determined without knowing the distribution of the data you're looking at. Now you could say that this bitstream passes all of these batteries of statistical tests and therefore "looks like" it has $n$ bits of entropy (that is, it looks random). But the bitstream's entropy is unknown. – Reid Sep 18 '13 at 02:25
  • 1
    Well, the entropy measurement will never be accurate, unless perhaps it was some radioactively decaying isotope that obeys some particular law of physics. I think what most people these days are looking for is just a reasonable estimate. – Blaze Sep 18 '13 at 02:46
  • 2
    I think what people are looking for is heuristic analysis + statistical analysis. For example, if I flip a coin, I can theoretically say 1 bit of entropy. Now, let's flip that coin 10 thousand times and do a statistical analysis to see if that measures up. If I get all heads, then so much for my heuristical analysis... I'm pretty sure this is the motivation. The part I'm missing is how to go from statistical analysis to 'bits per byte' of entropy. Note this request comes from experts. – Blaze Sep 18 '13 at 02:53
  • 4
    @Blaze: My whole point is that you can't go from a statistical analysis to a measurement of entropy. If you happen to know how the bitstream is generated, then you can calculate the entropy directly (via the above formula) --- but if you don't know how it's calculated, then it simply can't be done. – Reid Sep 18 '13 at 03:45
  • I suspect you are correct if we were looking for an exact measurement rather than a rough estimate. – Blaze Sep 18 '13 at 04:05
  • (BTW Reid, I've learned a lot from your posts and I'm very grateful! I apologize if my writing here doesn't express that properly. I'm just trying to appease some statistical gods here and they're not very helpful) – Blaze Sep 18 '13 at 04:15
  • 3
    I would think we could pragmatically estimate entropy via a combination of statistics and good analysis. Empirically measure a statistical distribution on the lowest level of the entropy source that you can. Calculate $H$ where $Pr[x_i]$ is calculated from that distribution. How good an estimate of entropy this produces will rely on how good of a choice was made for "lowest level". Eg, if you chose the output of a PRG seeded by 0 you will have a false high entropy due to a poor decision. But if you choose a level the attacker won't have deeper insight to, the estimate should be good. – B-Con Sep 18 '13 at 04:58
  • 1
    Do we have anything better than that for practical entropy estimates in real life? We don't (well, rarely) deal with truly random events, just events where we have a distribution and very little additional insight beyond that distribution. (Top of my head examples: hard drive seek times, mouse coordinates, etc. They aren't truly random, but at some level of detail you're stuck with a distribution of behavior and no way to analyze the source any more finely.) – B-Con Sep 18 '13 at 05:03
  • 1
    This answer is mostly right. I think there is some room for tools which were neglected in the answer, but I do appreciate now what the answerer was trying to say. – Blaze Oct 02 '13 at 09:29
1

There are some tests out there: Draft Special Publication 800-90B - National Institute of Standards

In particular the min-entropy, partial collections, Markov (useful for non-IID sources), collisions, and compression tests.

The issue with the Markov test is the constraint on bit size of the sample.

Updated:

These tests only measure output. They don't measure the underlying entropy that was used to generate the data. You can take completely non random data and make it look perfectly random according to these tests (or any tests).

Blaze
  • 551
  • 4
  • 13
  • It's a fair point. I'll change the accepted answer. The standards were (are?) messed up. – Blaze Jun 08 '15 at 17:22
  • 4
    No, the standards are not messed up. Instead, they take a problem which (as Neil mentioned) is insolvable as defined, and try to come up with a good as an answer as they can practically come up with. What the tools do is come up with a *ceiling on the entropy* (we're pretty sure that there's no more than N bits of entropy there); that leaves open the question of a lower bound, but we don't know how to solve that problem. Anyone using the tool should realize its limitations. – poncho Jun 08 '15 at 18:53