I read on Wikipedia and in lecture notes that if a lossless data compression algorithm makes a message shorter, it must make another message longer.
E.g. In this set of notes, it says:
Consider, for example, the 8 possible 3 bit messages. If one is compressed to two bits, it is not hard to convince yourself that two messages will have to expand to 4 bits, giving an average of 3 1/8 bits.
There must be a gap in my understand because I thought I could compress all 3 bit messages this way:
- Encode: If it starts with a zero, delete the leading zero.
- Decode: If message is 3 bit, do nothing. If message is 2 bit, add a leading zero.
- Compressed set: 00,01,10,11,100,101,110,111
What am I getting wrong? I am new to CS, so maybe there are some rules/conventions that I missed?
More formally, we encode sequences of bits, and a sequence should have a start and an end. If all the messages are 3 bits, we can omit the start and end, because they give no additional information. But if the length varies, then you do need delimiters.
– Shaull Feb 19 '13 at 18:02