Say you have a file that is not random, and you XOR every bit with a random bit (not pseudo, but really random). Can someone who sees only the result extract any information from it? Obviously, it won't be 100% accurate, but I imagine you can do some sort of stochastics and get a vague idea. If yes, how? If no, is there a mathematical proof?
-
Adding a small note in addition to Reid´s answer… remember to never reuse the random key-stream. (Just in case you weren´t aware of related issues if you do.) – e-sushi Jun 25 '15 at 02:16
2 Answers
This cipher is called a one-time pad. It is unbreakable ("perfect secrecy") assuming that:
- The pad (the collection of random bits) really is truly random
- The pad is never reused to encrypt other messages
So, no information can be extracted from $\text{file} \oplus \text{random bits}$.
The basic idea of the proof is that an attacker can test every possible key, but they have no way of knowing which plaintext is actually correct. If I encrypt "attack" with a one-time pad, then any six-character string could just as equally have been encrypted in the first place.

- 6,829
- 1
- 39
- 57
-
-
2You do leak the length of the plaintext, unless you use some sort of padding. – SAI Peregrinus Jun 25 '15 at 03:13
If the file has been crafted deliberately to survive this form of damage then yes you should be able to recover your data.
There are many quite simple methods from adding CRCs to replicating the data multiple times.
There are other possible routes to recovery. If for example the file was an ASCII text file then it may be possible to recover something close to the original data by reasoning and dictionary work.

- 97
- 2
-
The "damage" done by XOR-ing with a truly random source (any bit has 50% independent chance of being 0 or 1) is too much for any recovery scheme. No amount of statistical analysis or combining of known repeated elements will give you a better than 50% guesswork on any individual bit, and a correct guess at any bit value gives you no advantage on guessing any other bit value. Any CRCs would be equally mangled and not recoverable. – Neil Slater Jun 25 '15 at 11:32
-
If you changed the source to have some bias such as $p(0)=0.4, p(1)=0.6$, then enough repetition or suitably robust error correction codes could in theory work. – Neil Slater Jun 25 '15 at 11:36