8

I read recently about an idea to, instead of storing actual data, converting the data to a string of digits and then store the index of where this pattern occurs in some number, for example $\pi$. The idea being that the index of the data would take up less storage space than the actual data.

Of course, we don't know whether $\pi$ is a normal number and hence we do not know if every finite decimal pattern occurs, but let's assume for the moment that it does (or one simply changes to some proven normal number, like the Copeland-Erdős constant).

The thing that struck me was whether the index of the data might actually be a larger number than the data itself. Does there exist some measure of the probability of finding a decimal sequence of length $n$ before the $m$:th decimal place? For $\pi$ in this case, I doubt there's a general formula. Would it depend on the base?

Information or references to other, similar ideas are also very welcome.

(Yes, I understand that this method is very impractical for everyday use, I just found the idea intriguinng.)

naslundx
  • 9,720
  • 1
    Related: http://math.stackexchange.com/questions/100477/probability-to-find-the-sequence-rar-in-a-random-uniform-bytes-stream-at-a/100494#100494 – leonbloy Jul 14 '14 at 20:37
  • The relevant property you want is not normality, but rather the much weaker propery of simply having every finite digit string appear at least once (and hence, infinitely often) in the decimal expansion of $\pi.$ By the way, you might be interested in a related idea that appeared in Carl Sagan's novel Contact. – Dave L. Renfro Jul 14 '14 at 21:13
  • 1
    The link in your first paragraph is a joke, I'm afraid. – TonyK Jul 14 '14 at 22:03
  • @TonyK Really? I think they're aware that it isn't practical, but the source code seems alright (although I haven't tried it). – naslundx Jul 15 '14 at 07:44
  • Then it's an elaborate joke! As I explain in my answer, the concept itself is useless. – TonyK Jul 15 '14 at 07:51

3 Answers3

7

You don't need to know anything about $\pi$ to answer this one $-$ you just need to know that you can't get something for nothing.

To be more precise: any compression algorithm (which is what your scheme is), if it makes some inputs shorter, must also make some inputs longer. If an algorithm was able to compress, say, all 10000-bit inputs into less than 10000 bits, then the number of possible outputs ($2^{10000} - 1$) would be less than the number of possible inputs ($2^{10000}$) $-$ so you would inevitably have at least one pair of inputs compressing to the same output.

TonyK
  • 64,559
3

Taking decimal base, assuming (big assumption) that the digits of $\pi$ are random, (uniform distribution, iid) , then given a number with $k$ decimal digits, the probability of finding it before some time $n$ is difficult to find in general (see eg here), and it might depend on the number itself.

A simplifying assumption would be to assume that all coincidence tries are independent (no overlapping) ; obviously a false assumption, but in many asympotics this is a fair approximation). We'd have then a geometric random variable with $p=1-10^{-k}$ (probability of success), and its expected value would be $\approx 10^{k}$. Which is the same order of the value of the number. Hence -under this very coarse approximation- the "index" is on average of the same magnitude as the number itself.

leonbloy
  • 63,430
1

The probability of the $n$th digit of a normal number (if it has essentially random digits) being a given digit is $1\over 10$. Therefore the probability of finding a sequence of $m$ digits starting with the nth digit is $1 \over 10^m$, so the probability of the sequence not having occurred after $n$ digits is $(1-{1\over 10^m})^n$. One way to guess at the digit at which you might find the sequence would be to find the point at which the probability of having found it was $\frac 1 2$, so $\log_{1-10^{-m}}(\frac 1 2)$, or $\ln(\frac 1 2)\over \ln(1-10^{-m})$. This turns out to be a number with just about the same number of digits as in the sequence that you are trying to store, and so saves no space.

Avi
  • 1,780
  • 11
  • 21