Lossless data compression must make some messages longer?

Question

I read on Wikipedia and in lecture notes that if a lossless data compression algorithm makes a message shorter, it must make another message longer.

E.g. In this set of notes, it says:

Consider, for example, the 8 possible 3 bit messages. If one is compressed to two bits, it is not hard to convince yourself that two messages will have to expand to 4 bits, giving an average of 3 1/8 bits.

There must be a gap in my understand because I thought I could compress all 3 bit messages this way:

Encode: If it starts with a zero, delete the leading zero.
Decode: If message is 3 bit, do nothing. If message is 2 bit, add a leading zero.
Compressed set: 00,01,10,11,100,101,110,111

What am I getting wrong? I am new to CS, so maybe there are some rules/conventions that I missed?

score 7 · Accepted Answer · edited Feb 19 '13 at 23:28

7

You are missing an important nuance. How would you know if the message is only 2 bits, or if it's part of a bigger message? For that, you must also encode a bit that says that the message starts, and a bit that says it ends. This bit should be a new symbol, because 1 and 0 are already used. If you introduce such a symbol and then re-encode everything to binary, you will end up with an even longer code.

edited Feb 19 '13 at 23:28

svick

1,866
14
15

answered Feb 19 '13 at 17:39

Shaull

17,159
1
38
64

I see - what I didn't get is that we are dealing with a stream of bits? And not individual messages? The uncompressed set does not require delimiters because we can just take 3 bits at a time, while "10" from the compressed set could be the beginning of "100" or "101". Right? Thanks. – Feb 19 '13 at 17:54
1

Even in a single message, this is a problem. Suppose I tell you that I am going to send you a message, in the encoding you suggested. I send you 1. Do you wait for another 1? or do you decide it's 001? How would you know when to stop "waiting".
More formally, we encode sequences of bits, and a sequence should have a start and an end. If all the messages are 3 bits, we can omit the start and end, because they give no additional information. But if the length varies, then you do need delimiters.
– Shaull Feb 19 '13 at 18:02
Thanks for the clarification. Accepted and +1. (I meant individual messages as in elements from the compressed set, but i get what you mean) – Feb 19 '13 at 18:17
1

The OP wasn't missing anything. The notes seem to be assuming he is using a code whose extension is uniquely decodeable before he explained any of these terms. The author made a mistake—he shouldn't have given this specific example before giving the background necessary to understand it. – Peter Shor Feb 20 '13 at 04:25

score 4 · Answer 2 · answered Feb 19 '13 at 21:01

I think you should consider the (binary) messages of length up to a certain value, say $n$. Then you have $2^{n+2}-1$ messages, which you have to map onto the same $2^{n+2}-1$ messages, if you don't permit "compressed" messages with more than $n$ bits. In other words, your compression is a permutation $f$. Any word $w$ with $|f(w)| < |w|$ is in a cycle of $f$, which contains another word $w': |f(w')| > |w'|$.

Lossless data compression must make some messages longer?

2 Answers2