I'll apologize in advance if anything in here is ineloquent. Suppose we have a pair of lossless compression (C
) and decompression (D
) functions.
compressed = C(uncompressed)
and
uncompressed = D(compressed)
Because the system is lossless, it should be expected that any data passed through C
will map to a single, unambiguous output. This mapping would essentially be 1:1, otherwise two uncompressed inputs would map to the same compressed output and it would be impossible to determine which one should be chosen on decompression.
Likewise, each compressed input to D
should map to exactly one uncompressed output, otherwise the system is not deterministic. This is also like a 1:1 mapping, the inverse of the one above.
In essence, this could be thought of as an encoding, where one input is translated to another output and vice-versa, but with the goal of (most of the time) producing an output with a smaller file size. But for that to work, there must be inputs to C
where the resulting compressed size is larger, perhaps significantly so, than the input was.
Put another way: If I had a disk containing every possible 100-byte, 8-bit file, I'd have something like 6.67e242 source bytes of input to C
. Although many of the compressed representations would be smaller, it seems like many of the compressed representations would also have to be larger to satisfy the 1:1 relationship. It seems like the total compressed size would be about 6.67e242 bytes, and the mean size of each file would be about 100 bytes. Maybe larger.
The "maybe larger" bit leads to another intuition I had, which says that an ideal decompression function should return an output for every single input. If certain inputs generate an error (i.e. because of data alignment issues, unexpected truncation, checksum failures) that implies that there is wasted space in the mapping that is causing less than 100% optimal usage of the compressed space.
My questions are: 1) Is my understanding about how this works correct? That is, the mean compression ratio of random data is 1.0 or maybe worse, and 2) If a decompression function can fail to decode some input, it is not making total optimal use of the compressed domain?
Are there formal names for this that I can read up on? I feel like this is a problem space that's been well-explored already, but I'd love to see a more formal definition of what is currently a nebulous intuition in my mind.