Ways to code two arbitrary binary strings into one without loss of information, and relevant bounds

Question

If the title was not clear, I'm examining methods of taking two binary strings as input and outputting one binary string in such a way that the two original strings can be extracted from the output, and I want to know about how efficiently you can do it - is there a best way? What's the minimum overhead?

I'm also interested in any method that differs significantly from or improves upon any of the following.

Given strings $A$, $B$ with lengths $|A|$, $|B|$:

Approach 1: Code 0 as 00, 1 as 11 and let 01 be a 'separator' character. This produces a string of length $2(|A|+|B|+1)$. To get the original strings back, you just find the separator, discard it and read off $A$ and $B$.

Approach 2: Somewhat more elegant, but produces a string the same length; code to $A0BA1B$. To get $A$ and $B$ back, just halve the string and examine the nth characters until there's a discrepancy.

Approach 3: This is an improovement on approach 1. Code A with the doubling scheme ($0\to00, 1\to11$), follow up with the opposite of the first character of $B$, then put down $B$. This is length $2|A|+|B|+1$. To get $A$ and $B$ back, break the string into blocks of two, and look for the first that isn't $00$ or $11$. Before that point, you have the coded version of $A$. The first part of the first anomalous block can be discarded, and the rest is $B$.

Approach 4: Getting cleverer now. Uses a similar idea to approach 3, but instead of separating $A$ from $B$, you separate |A| (as a binary number) from AB. This is length $|A|+|B|+2\log_2 |A|$.

Approach 5: This is the most complicated, but potentially the best (yet) for maximally compressed strings (asymptotically). Look for the longest consecutive run of $1$ or $0$ in $A$ and $B$. Say you find a run of $a$ length $n$. Then produce the string $A(1-a)(a)^{n+2}(1-a)B$ (where power is repeated concatenation). This is length $|A|+|B|+n+4$, but I've not managed to make a probabalistic argument as to the expected size of $n$. My best guess is that it's logarithmic in $|A|+|B|$. To get $A$ and $B$ back, look for the longest consecutive run of 0 or 1, then discard that run and the two characters on either side of it. $A$ is on the left, $B$ is on the right.

How do you measure efficiency? In your last paragraph you speak about probabilistic arguments, so do you assume some probability distribution on the infinite set of strings? Are you looking for good average case or worst case? — Hagen von Eitzen, Aug 05 '14 at 08:09
@HagenvonEitzen Wherever I have used the term efficiency I am primarily referring to the length of the string coding $A$ and $B$.
In the last section where I refer to probabilistic arguments I assume maximally compressed strings have a distribution based on being essentially 'random' (lacking in patterns), but I'm not as knowledgeable in this area as I could be. Generally speaking I'm interested in both worst and average case.

Your answer is clever! I'm leaving this question open for now, but if there's no activity for a while I'll accept. — G. H. Faust, Aug 05 '14 at 08:58

score 5 · Accepted Answer · answered Aug 05 '14 at 08:27

5

Let $n=|A|+|B|$. Use $\lceil \log_2 (n+1)\rceil$ bits to binary encode $|A|$ (a number between $0$ and $n$, inclusive), then copy $A$, then copy $B$.

To decode $C$, start with $n=|C|$ and while $\lceil \log_2(n+1)\rceil + n>|C|$ decrease $n$. Then interprete the first $\lceil \log_2 (n+1)\rceil$ as length of $A$, voila.

Efficiency: There are $(n+1)\cdot 2^n$ possible $(A,B)$ for a given value of $n=|A|+|B|$. Hence to encode all of these, at least some input requires at least $n+\lceil \log_2(n+1)\rceil$ bits. So there is already very little waste, and only little can be gained by considering different $n$ together: $\sum_{k=0}^n(k+1)2^k = n\cdot2^{n+1}$, so encoding all possible $A,B$ with $|A|+|B|\le n$ requires at least one code of length $\ge n+1+\lceil\log_2 n\rceil$.

answered Aug 05 '14 at 08:27

Hagen von Eitzen

374,180

What is $C$ in your argument? – mvw Aug 05 '14 at 13:24
Can you give a proof that your decoding algorithm works? In particular, why couldn't there be some $m > n$ such that $\lceil \log_2(m+1)\rceil + m ≤ |C|$? – WillG Jan 11 '24 at 21:36
Perhaps the following suffices. The function $n \mapsto f(n) = n + \lceil \log_2(n + 1)\rceil$ increases by at least $1$ whenever $n$ increases by $1$, and for the base case $n = 1$, we have $n < f(n)$. Thus $f$ is injective, so there is at most one possible $n$ for each $|C| = f(n)$. – WillG Jan 11 '24 at 21:59

score 1 · Answer 2 · edited Apr 13 '17 at 12:20

Note: This is inspired by Hagen's answer, I believe it is roughly the same approach.

The idea is to use the length of the encoded string $s$ to derive the length of a leading word / prefix $w$ with $$ w=zp \quad\quad s=wab $$ (the padding $z$ contains only $0$ bits) by a suitable function $f$, which gives $f(|s|) = |w|$.

While encoding, the word $w$ was chosen large enough to contain a pointer $p$ which is able to point to any position in $s$, especially the position which splits $a$ and $b$.

Decoding:

determine the length of $s$
apply the function $f$: $|w| = f(|s|)$, see $(**)$ below
use $|w|$ to split $s$ into $w$ and $u=ab$, keep $w$
decode $w$: $|wa| = d(w)$, using some bit string to number decoder $d:\{0,1\}^*\to \mathbb{N}$, $|wa|$ gives the position where $b$ begins within $s$
use $|w|$ and $|wa|$, to split $s$ into $w$, $a$ and $b$, keep $a$ and $b$
output $a$ and $b$

If $w$ encodes a number $n$, $|w|$ bits can store a number up to $m = 2^{|w|} - 1$ which means $|w| = \log_2(m + 1)$ and thus for $n \le m$ we estimate $|w| = \lceil \log_2(n+1) \rceil$.

For $n$ we get this equation:

$$ n = |wu| = |w| + |u| = \lceil \log_2(n+1) \rceil + |u| =: F_{|u|}(n) \quad (*) \\ $$

which is a fixed point equation $$ F_{|u|}(n) = n $$ which can be solved by numerical methods (not just fixed point solvers but as well by making use of the fact that $\lceil . \rceil$ produces straight line segments where the intersection with the diagonal $\mbox{id}$ can be calculated efficiently, see this question for a similar problem).

The function $f$ for decoding should be $$ f(|s|) = |w| = \lceil \log_2(|s|+1) \rceil \quad (**) $$

Encoding:

determine $|a|$ and $|b|$
determine $|u| = |a| + |b|$
solve $(*)$ for $F_{|u|}$, chosing the smallest fixed point as $n = |s|$, e.g. $7$ for $F_4$
determine $|w| = n - |u|$
determine $|wa|= |w| + |a|$
encode $p = e(|wa|)$, using some number to bit string encoder $e:\mathbb{N}\to \{0,1\}^*$
choose the zero bit string $z = 0^{|w| - |p|}$
output $s = zpab$

Example:

Encoding: Having $|a| = 8$ and $|b| = 2$ thus $|u| = 10$, we get $F_{10}(14) = 14$ (here by solving $(*)$ graphically in gnuplot or Wolfram Alpha, see link). So $n = |s| = 14$, $|w| = n - |u| = 4$ and the maximum number $14$ needing $3.8$ bits, fitting into $|w| = 4$ bits. We put $|wa| = |w| + |a| = 4 + 8 = 12$ there, encoded as $p = 1100$, the padding $z = \varepsilon$ is empty. We would output $s = 1100ab$.

Decoding: We get $|s| = 14$, then $|w| = 4$. We split and get $w = 1100$. Decoding gives $|wa| = 12$. So we can split $s$ to get $a$ and $b$.

You've phrased it differently, but I don't see a significant difference from Hagen's answer. — G. H. Faust, Aug 06 '14 at 06:25
The funny bit is that I have only understood Hagen's first sentence. This answer is what I derived from it. It might have turned out much lengthier, if I had grokked the other two. :-) As added value I gave full encoding and decoding schemes. That the logarithm is important was already recognized by the OP, sec. 5. The fixed point solution I used to determine the size of the encoded string is probably equivalent to the procedure in Hagen's 2nd sentence. This was basically an attempt to work it out myself how to avoid marker symbols and just use the string length for separation. — mvw, Aug 06 '14 at 08:48

serge boisse · Answer 3 · 2021-07-07T18:42:28.563

Encode:

Create the string $A2B$ and pretend it is in written in base 3. Its length is $|a|+|b|+1$. Then convert it in base 2. The result s is a binary string with length $(|a|+|b|+1)*log(3)/log(2)$, that is, about 59% longer than $|a|+|b|$.

For example, let a = $10011_2$ = $19_{10}$, and b = $11010_2$ = $26_{10}$, so a total length of 10 bits

We create the string a2b = $10011211010$, then pretend it is base 3 and convert it in base 2, giving $1111101110101001$ and voila !

Decode:

To decode $s$, convert it in base 3, look for the symbol "2" (there should be only one) and split the string in two parts at this symbol.

Although this is a bit less efficient than the above method, it is still better than the doubling scheme. And it has the unique advantage that you can encode any number of binary strings (and not juste two) into one, by providing the necessary separators "2"

This coding can be further improved (for large numbers) by writing the two input strings $a$ and $b$ in bytes (that is, in base $256$), then pretending they are written in base $257$, concatenating $a257_{257}b$, and converting it back in base $256$

Ways to code two arbitrary binary strings into one without loss of information, and relevant bounds

3 Answers3