I need a function $F$ that compresses an arbitrary (unbounded from above) number of bytes into a single byte. The goal is creating a single unpredictable byte from an arbitrary bitstring.
Here are the requirements:
- Pseudo-randomness: the resulting byte should look uncorrelated and unbiased in comparison with the input. This is why Pearson hashing is unacceptable (because it is not even needed to compute this hash to know that the outputs for
0000
andFF00
will be different). Note: see the detailed clarification of Property 1 in Edit #1 below. - Let $S$ denote the collection of all possible (different) bitstrings, assuming that all bitstrings are of the same non-zero length and this length is a multiple of 8. Let $L$ denote the number of elements of $S$. Let $T$ denote a collection of all outputs that correspond to each element of $S$. Then $T$ must contain $L/256$ occurrences of each byte. That is, for example, if I will compress all possible two-byte inputs, I will have 256 occurrences of each byte in the collection of all outputs. This requirement makes standard cryptographic hash functions unacceptable.
- Table lookups are allowed, but if and only if all tables are “Nothing up my sleeve” and are generated/precomputed by an open algorithm, so everyone can observe the entire process of their creation.
- The algorithm should be as fast/efficient as possible. This is another reason why standard cryptographic hash functions are not acceptable. Consider a situation when $F$ is used only for 8/16/24/32-bit inputs. Then I think that using 128/256/512-bit hash functions to compress such inputs does not seem efficient.
- The algorithm is not required to be online. The entire input is allowed to be stored in RAM.
Is there any algorithm that matches the above requirements?
EDIT #1
I will clarify the property 1 as follows.
Let $G(V, W)$ denote an infinite family of all possible functions where any function $G_i$, given a bitstring $V$, generates a collection (array) that contains $W$ different bitstrings of equal length, assuming that $V$ will be the first element of the generated collection and the length (in bits) of each bitstring in this collection is a multiple of 8.
Let $G_1$ denote any chosen (fixed) function from the family of $G$. The algorithm of $G_1$ is open for everyone (as well as the algorithm of $F$). The simplest example of $G_1$ is a function that creates a collection where each successive element $b_i$ is a binary representation of $(b_{i-1} + 1) \bmod{2^N}$, where $N$ is the length of $V$.
Alice chooses any bitstring $B_1$ such that its length is equal to $8X$. Then Alice obtains the collection $S_1$: $$S_1 = G_1(B_1, X) = (B_1, B_2, \ldots, B_{X-1}, B_X).$$ Then Alice computes the bitstring $Y_1$: $$Y_1 = F(B_1) \mathbin\Vert F(B_2) \mathbin\Vert \ldots \mathbin\Vert F(B_{X-1}) \mathbin\Vert F(B_X).$$
Then $Y_1$ is given to the attackers, so the exact value of $X$ is open to them.
The attackers know that $G_1$ was used to generate each $i$-th element $B_i$, and they are allowed to choose any function from the family of $G$, denote it by $G_k$, then choose any $8X$-bit sequence $B_n$, then generate the collection $$S_2 = G_k(B_n, X) = (C_1, C_2, \ldots, C_{X-1}, C_X),$$
and then compute the corresponding bitstring $Y_2$: $$Y_2 = F(C_1) \mathbin\Vert F(C_2) \mathbin\Vert \ldots \mathbin\Vert F(C_{X-1}) \mathbin\Vert F(C_X)$$
(the attackers are not allowed to replace $F$ with another function).
The desirable property of $F$ is that the cost of finding any suitable combination of $G_k$ and $B_n$ that corresponds to $Y_2 = Y_1$ should be as close to $\sqrt{2^{8X}}$ as possible.
SUMMARY
This question implies a bijective function (or a Permutation-Based Compression Function) that can be described by the following quote (taken from the Abstract of “The MD6 hash function: A proposal to NIST for SHA-3”):
The compression function can be viewed as encryption with a fixed key (or equivalently, as applying a fixed random permutation of the message space) followed by truncation.
and optimized for 8-bit words architecture and for all bitstrings such that their lengths are multiples of 8 (similar to how MD6 compression function operates, so that no unnecessary computations are performed if all inputs are short, and all parameters are configurable for different input sizes).