3

I'll apologize in advance if anything in here is ineloquent. Suppose we have a pair of lossless compression (C) and decompression (D) functions.

compressed = C(uncompressed)

and

uncompressed = D(compressed)

Because the system is lossless, it should be expected that any data passed through C will map to a single, unambiguous output. This mapping would essentially be 1:1, otherwise two uncompressed inputs would map to the same compressed output and it would be impossible to determine which one should be chosen on decompression.

Likewise, each compressed input to D should map to exactly one uncompressed output, otherwise the system is not deterministic. This is also like a 1:1 mapping, the inverse of the one above.

In essence, this could be thought of as an encoding, where one input is translated to another output and vice-versa, but with the goal of (most of the time) producing an output with a smaller file size. But for that to work, there must be inputs to C where the resulting compressed size is larger, perhaps significantly so, than the input was.

Put another way: If I had a disk containing every possible 100-byte, 8-bit file, I'd have something like 6.67e242 source bytes of input to C. Although many of the compressed representations would be smaller, it seems like many of the compressed representations would also have to be larger to satisfy the 1:1 relationship. It seems like the total compressed size would be about 6.67e242 bytes, and the mean size of each file would be about 100 bytes. Maybe larger.

The "maybe larger" bit leads to another intuition I had, which says that an ideal decompression function should return an output for every single input. If certain inputs generate an error (i.e. because of data alignment issues, unexpected truncation, checksum failures) that implies that there is wasted space in the mapping that is causing less than 100% optimal usage of the compressed space.

My questions are: 1) Is my understanding about how this works correct? That is, the mean compression ratio of random data is 1.0 or maybe worse, and 2) If a decompression function can fail to decode some input, it is not making total optimal use of the compressed domain?

Are there formal names for this that I can read up on? I feel like this is a problem space that's been well-explored already, but I'd love to see a more formal definition of what is currently a nebulous intuition in my mind.

Wandering Logic
  • 17,743
  • 1
  • 44
  • 87
smitelli
  • 181
  • 1
  • 3
  • 1
    I read your statement (eloquent, appropriate or not) of data compression fundamentals as consistent with my notion thereof. 2) does_not_ guarantee anybody will be able to find/implement a bijective compression that is better than the best of non-bijective ones. – greybeard Mar 23 '15 at 19:20
  • 1
    in particular, @Vor's answer: http://cs.stackexchange.com/a/7533/7459 and his pointer to Kolmogorov Complexity. – Wandering Logic Mar 23 '15 at 19:54
  • Ah, yes. @andrej-bauer's answer on the other page sums it up pretty much perfectly. "So, the best compression scheme in the world is the identity function! Well, only if we want to compress random strings of bits. The bit strings which occur in practice are far from random and exhibit a lot of regularity." – smitelli Mar 25 '15 at 13:32

0 Answers0