Decrypting two XORed compressed messages?

Question

If through a previous attack, cipher misuse or two time pad etc. I receive $c=m_1 \oplus m_2$. Where $m_1$ and $m_2$ are compressed messages, for instance GZIPped English text documents. Can we recover something significant about the messages?

Obviously if we can guess two prefixes we can verify. Can we do more? Can something practical be done about the end if we fail to decipher the beginning? or even don't have the start of the message.

Since the messages are compressed I don't know of commom substrings except in the header nor do I know how to validate if a short fragment in the middle is plausible.

Sort of related, but here it might be possible to brute force uncompress $m_1$ and $m_2$ until you get $c$. See Can compressed data be made to look like random data — daniel, Oct 10 '17 at 11:56
Recovering the content of gzip's dynamic huffman tables sounds very annoying. — CodesInChaos, Oct 10 '17 at 12:01
@LuisCasillas but there's no compression in that question, crib dragging wouldn't work here — daniel, Oct 10 '17 at 21:46

fgrieu · Answer 1 · 2017-10-13T19:37:09.623

5

Update: a premise in the former answer did not resist the acid test of experiment. This whole answer was thus very wrong.

Thanks to daniel's comment for opening my eyes.

edited Oct 13 '17 at 19:37

answered Oct 11 '17 at 06:50

fgrieu

140,762
12
307
587

On wikipedia it sounded like each block can contain an arbitrary (RLE compressed) huffman table. Shouldn't that table depend on the whole content of the block? – CodesInChaos Oct 11 '17 at 09:07
@CodesInChaos: I wrote the answer with the assumption that it is used mode 2 in the "Putting it all together" section of the first reference in the answer, where Huffman tables are not in the output (but rather jointly built in compressor/decompress). I have not tried to verify that assumption. – fgrieu Oct 11 '17 at 10:25
I like the idea of narrowing the question to one block, '01 - compressed with fixed Huffman codes' so the 3 header bits for both messages would be 101, c = 000... But that said isn't the 4th bit dependent on all the plaintext bits from both messages? (the later parts of the plaintext changes the order of huffman codes) – daniel Oct 11 '17 at 10:42
2

Why I don't think this works, "beginning of the messages" should read "short messages" since plaintext after the beginning will compress to change every bit of m1. – daniel Oct 12 '17 at 09:28
1

@daniel: I did a short test gziping a file and its prefix, and it confirm what you state: the compressed form of the prefix is not close to be a prefix of the compressed form of the long one; hence the attack can only work when one of the file can be guessed. I have no time to fix the answer right now, but that needs to be done! – fgrieu Oct 12 '17 at 10:47

Paul Uszak · Answer 2 · 2017-10-11T10:07:52.303

2

Well your 1st couple of bytes will probably be zero as they might be golden ones and cancel each other out. This would almost certainly confirm that the same algorithm was used on both messages. Not sure if this is of use though.

The attacker's problem with compression is that some of it is very good indeed. fp8 will compress to within 0.1% of the theoretical Shannon limit. This means that the compressed file will be almost perfectly random. For example a large fp8 compressed file well passes both ent and FIPS-140 tests for randomness. A typical file compressed with fp8 will easily achieve 7.999837 bits /byte of entropy as measured by ent.

The end is where it's interesting. You mention misuse. It might be that the two messages compress to two very different lengths. If these were then xored without noticing, one end would be original compressed information. Only a few people know how fp8 works, but it's feasible that you might be able to recover fragments in less time than a brute force search would take. The attacker would only be fighting against the compression algorithm itself, and that's more Kerckhoff than probability theory.

If they both end up exactly the same length before xoring, the problem is hard. If you have no idea of what the messages could possibly be, you have 99.9979625% true randomness and 0.0020375% file format (from my example compression). Your author's creativity in writing each original message forms a seed. The compressor forms a true randomness extractor with a 0.0020375% output error. If internal blocks overlap, the file format gets destroyed, and the error decreases very substantially. Tricky. NSA guys, what do you think ?

edited Oct 11 '17 at 10:07

answered Oct 10 '17 at 13:06

Paul Uszak

15,390
2
28
77

I feel like the problem can't be 'hard', or else you could stretch a OTP to be message length = 1.5 x key length by adding the two rules "compress everything before sending, inject every 3rd key as m1 xor m2.", maybe a better word is tricky – daniel Oct 10 '17 at 13:22
1

The second paragraph is full of nonsense. Compression doesn't come anywhere close to the theoretical limits on typical data. Shannon entropy is a property of a generation process, and not of an individual string of data. Kolmogorov complexity is, but it's literally impossible to compute. – CodesInChaos Oct 10 '17 at 13:49
In any realistic scenario I can think of it would be truncated to the shorter message. We will actually have two messages xored with the same key stream. – Meir Maor Oct 10 '17 at 15:12
@CodesInChaos So where did 7.999837 bits /byte come from? Is that not enough entropy for you? Speaking of nonsense... – Paul Uszak Oct 10 '17 at 15:19
3

@PaulUszak I don't know how you compute that value. But I assume it's some kind of upper bound, based on an assumption like "all the bytes in the file are independent" or "compression algorithm x will eliminate all redundancies", so it only shows the limitations of your estimation algorithm, and not fundamental information theoretical limitations. – CodesInChaos Oct 10 '17 at 17:13
@CodesInChaos I've edited to make it clear that the 7.9... value is from ent which uses the standard Shannon formula that we all rely on in this forum. – Paul Uszak Oct 11 '17 at 09:55
1

I like the approach of comparing entropy rate in both text inputs to the ciphertext rate, and concluding that the problem can't be solved if the former is larger than the later. I used that idea in my answer; as well as what goes on when one of the compressed stream is shorter. But I fail to see the interest of the "7.999837 bits /byte of entropy as measured by ent", because that does not measure the quality of gzip's compression (which is what matters) in a meaningful way; much like when testing a RNG, this test only concludes that the gzip outout can't be compressed by standard methods. – fgrieu Oct 11 '17 at 10:08
2

If the enttest is applied to gzip output with different levels of compression (except none), it will give similar results; yet it is possible to take the less compressed output, decompress it, recompress with high compression (or a better compressor than gzip, e.g. bzip2, PPM*, PAQ8), and obtain significantly more compressed result. Hence, the ent test of the compressed stream does not tell something useful about the compression level when its result is next to 1, which is the case at hand. – fgrieu Oct 11 '17 at 10:16

Decrypting two XORed compressed messages?

2 Answers2

Linked