What determines the entropy of a program's source code?

Question

a few days ago I asked a question about the limits of compression: Can PRNGs be used to magically compress-stuff?

The idea common to all the answers was that if you consider all programs of length < k (Let's call this set $C_k$), then there is only a finite number of possible outputs generated by programs in $C_k$ (let's call that set $S_k$).

In the comments, I asked a followup question roughly equivalent to "what determines which messages are possible outputs?" and have been thinking about this ever since.

It seemes clear that what determines that is the program that executes the program. If we ignore TM's and think in terms of source code, then the output of a piece of code depends only on the syntax and semantics of the programming language. So the language design determines the contents of $S_k$. There are some interesting corollaries:

All program transformations that do no affect the AST nor enlarge the program beyond k bits will produce programs that generate the same output.
All refactorings, which might alter the AST but preserve all input/output relationships, will do the same if they do not inflate the size of the source code.
a PL that provides multiple constructs ("There's more then one way to do it") for achieving the same result (control flow constructs, for example) will , generally speaking, have more programs of length < k which generate the same output and so a smaller $S_k$ then a similar but "slimmer" PL.

This brings me to the following question: If I consider all programs in $C_k$ that generate the same output (it's an equivalence relation) and then proceed to compress them, what would be the results? Would the equivalent programs compress down to roughly the same size or not? Does the compressability depend on the meaning of the program (which is identical) or the bit sequence (which is not)?

A statement given in the answers to the previous question is that the entropy of a bit sequence determines how compressible it is. Is that entropy determined by the input/output relationships encoded in the program, or by it's bit-representation in language X? What's the right way to think of this?

update Since a program describes a computational process rather then merely output, the notion of equivalence suggested really doesn't indicate me much about the relative complexity of programs which are equivalent by that definition.

score 5 · Accepted Answer · answered Mar 30 '14 at 02:28

This is a classical question in the foundations of Kolmogorov complexity: does the programming language matter for the sake of defining minimal description length? It depends on your model, but in general the answer is "no" or "almost no".

Suppose that $P_1,P_2$ are two Turing complete programming languages. In particular, in $P_1$ you can write an interpreter $I$ for $P_2$. Therefore any program $\pi$ written in $P_2$ can be converted in $P_1$ by pasting together $I$ and $\pi$. What is the length of the combined program? In any reasonable programming language, it is $O(|\pi|)$, and so different programming languages only have at most a constant multiplicative advantage over each other. If we are more stringent in our requirements on the programming languages, the answer should be $|\pi| + O(1)$, which is what happens in Kolmogorov complexity. In this case different programming languages only have at most a constant additive advantage over each other.

The idea behind $\pi + O(1)$ is that you should be able to concatenate the code for $I$ with $\pi$. But in practice this is not quite possible. Suppose for example that $P_1$ is $C$. We need to write $\pi$ as a string. If the character set of $\pi$ includes backslash or double quotes, then we will need to escape these, which could result in a multiplicative blow-up. Still, we get a $C$ program whose length is $O(|\pi|)$. If we are careful in our definitions then we can prevent this — check any introductory text on Kolmogorov complexity for the details.

What determines the entropy of a program's source code?

1 Answers1