6

I am trying to find some measurement for identifying and distinguishing between compressed and random data. I tried this first by computing the entropy of such data, the entropy value is extremely high (almost maximum) in both cases, so that way does not seem to work as a distinguisher.

I read about the chi square algorithm but I've never used it (actually I still have some problems with interpreting the results). Does anybody know if this algorithm can lead to better results?

mikeazo
  • 38,563
  • 8
  • 112
  • 180
tom
  • 387
  • 1
  • 3
  • 3
  • 1
    I am not an expert in this, but the NIST tests might do a better job. – mikeazo Aug 15 '12 at 13:07
  • Wikipedia has a pretty good description of Pearson's chi-squared test. The challenge, really, is coming up with a suitable null hypothesis to test; for example, even very poor pseudorandom streams will usually satisfy the simple hypothesis that the frequencies of individual bytes are uniformly distributed, no matter what test you use. – Ilmari Karonen Aug 15 '12 at 14:01
  • See also http://crypto.stackexchange.com/questions/1287/example-of-chi-square-test-on-caesar-cipher – mikeazo Aug 15 '12 at 14:12

2 Answers2

3

The NIST tools are a good starting point.

There is no general-purpose algorithm that will always distinguish compressed from random data

However, if you want to try the chi-squared test, you can compute a histogram of the frequency of all byte values (how many 0 bytes you saw in the data, how many 1 bytes you saw, etc.), and then use the chi-squared test to test whether this appears to deviate from what you'd expect for uniform-random data.

D.W.
  • 36,365
  • 13
  • 102
  • 187
1

A late answer, but I recently had cause to perform some entropy estimation and calculated some chis.

For context, in uniformly distributed random bytes, the target chi is ~255 leading to a p value of ~ 0.5. As one definition of randomness is in-compressibility, it follows that you cannot differentiate a compressed file from a truly random one. The caveat though is the level of possible compression. A compressed file requires control and format structures within it that significantly differentiate it from perfectly random. These control structures throw out the calculated p values in a chi test. So some examples of compressed data:-

.zip p < 0.0001
.jpg p < 0.0001
.png p < 0.0001

Remember random data would have a p ~ 0.5 on average. More specifically, a Kolmogorov–Smirnov test of these p values should see them uniformly distributed 0 to 1. So at this point my answer would be that yes, you can use a chi test to identify random data.

But compression algorithms have improved and I've found fp8 which is a PAQ8 derivative. It's the most powerful compression program that I found that can be easily compiled. The same files now give the following chis having been compressed by fp8:-

.zip.fp8 p = 0.93
.jpg.fp8 p = 0.14
.png.fp8 p = 0.38

On prima facie evidence, these compressed files produce chi p values consistent with fully random data. So my final answer is no, you cannot differentiate random data from compressed data using a chi test.

Some further insight into chi and p might be had here.

Paul Uszak
  • 15,390
  • 2
  • 28
  • 77