Is in-place run length encoding possible in O(1) space given that the output is shorter than the input?

Question

This is inspired by a problem from here. This is the approximate form of the problem:

Given a string like "aaaa777cbb" (10 symbols long), run length encode it in-place to a string like "a473c1b2" (8 symbols long). You are guaranteed that the input will always be longer than the output.

The precise form of the problem is:

You are given an ordered list $L$ of symbols from a set $S$. Any symbol from $S$ may appear in the list.
$S$ contains all the positive integers up to and including $|L|$ (the length of $L$) and also some other symbols.
Rules of manipulating the input in-place
- You can replace one symbol in the list with another
- You can trim the list to a length of your choice by removing symbols from the end
- You cannot insert symbols
You must overwrite the list of symbols with it's run-length-encoding representation and trim it to length so that it includes only the run-length-encoding representation.
- The run-length-encoding representation replaces each series of 1 or more of the same symbol in the input with that symbol followed by the symbol representing the number of occurrences of the previous symbol.
  - For example: $[a, a, a, a, a, a, a, a, a, a, 7]$ becomes $[a, 10, 7, 1]$ meaning "$a$ ten times followed by $7$ one time"
  - Note that the length of the output list is always even
- You are guaranteed that the length of the input list is always larger than the length of the output list
You must do this with $O(1)$ additional working memory
- Each "word" of working memory contains $log_2 |S|$ bits (put another way, words may be constructed which store constant amounts of information, the position of any element in the input, or any symbol from the input)

Intuitively I don't think this is possible. The solutions provided on the original site seem to break on strings like "abccccc" (length 7) where the output should be "a1b1c5" (length 6), since they start by overwriting "b" with the "1" from "a1" before they have even checked which symbol is in the 2nd position.

I have thought about trying to start by finding the compressible runs of letters (2 or more of the same letter), but I don't know how to tell which symbols are already processed and which are from the original input without using some sort of memory that which would grow with the size of the input (like a bitmap of processed areas) and therefore put me in violation of the $O(1)$ space requirement.

I consider acceptable answers to be proofs that this problem either is or is not solvable in $O(1)$ space.

output string's length is always smaller than input string (from the hyperlink) seems to allow two interpretations: 1) for every problem instance, the output is shorter than the input 2) at every output (of two symbols), the encoding of the prefix to the part under consideration will be short enough to allow two symbols (implying input symbols will be duplicated to the end of input or a run longer than two). — greybeard, Sep 23 '20 at 07:19
@greybeard The question I am asking here is not exactly the same as the linked one (ex: an output of 10 is one symbol). I see your point about interpretations, but for my question it is the first. The precise form is my own doing. I'm not sure I understand your last comment. A Google search for "universal encoding" doesn't seem to turn up anything related. — 9072997, Sep 23 '20 at 12:43
The computing model is going to be very important in this situation. What exactly does $O(1)$ space mean to you? Does it mean $O(1)$ bits of space? Or does it mean $O(1)$ words, where each word is allowed to contain $O(\lg n)$ bits, where $n$ is the length of the input? (the transdichotomous model) In the former case, a problem can be solved in $O(1)$ space if and only if it forms a regular language. In the latter you case, you are asking whether this problem is in L (aka LSPACE). — D.W., Sep 24 '20 at 04:44
@D.W. Your question has made me realize there is no way I can let you store a symbol without giving you infinite memory (this becomes Hilbert's paradox). I will update the question to limit the size of $S$. — 9072997, Sep 24 '20 at 23:06
Great, so then your question is equivalent to asking whether there is a finite-state transducer for your mapping (that maps a string of symbols to its run-length encoding). You might be able to use a pumping lemma to try to prove it is not a regular transduction; or construct a regular transduction explicitly. — D.W., Sep 25 '20 at 00:54

score 1 · Accepted Answer · answered Sep 25 '20 at 05:17

An $O(1)$ space algorithm that uses one extra symbol not found in $L$, which I will call $B$ for blank space.

I define an operation, a "shift right" at position $k$. It finds the next blank symbol $B$ after position $k$ , shifts all symbols one to the right, and sets position $k$ to $B$. For example a right shift at the third symbol:

abcdeBfjgB    becomes    abBcdefjgB
  ^                        ^

Similarly a "shift left" at position $k$ assumes there is a $B$ symbol there, and moves it all the way to the end of the string, shifting all other symbols left.

abBdeBfjgB    becomes    abdeBfjgBB
  ^                        ^

Note that you can perform both shifts in $O(1)$ memory.

Now, first we replace all runs of any symbol $x$ with length $l \geq 3$ or greater with $xlB^{l-2}$. This can be done in-place, and leaves such runs identifiable. Also note that these are all the runs that shortens the output compared to the input.

Then, move a single pointer $p$ from left to right:

If the string at the pointer starts with $B$, shift left.
If the string at the pointer starts with $xlB^+$, this is the start of a run with length at least 3. Increment $p$ by $2$.
If the string at the pointer starts with $xx$ replace it with $x 2$ and increment $p$ by 2. Note that $xx$ can never be the start of a run of length 3 or higher, since we already replaced those.
If the string at the pointer has form $xy$, increment $p$ by 1, shift right, and replace the resulting $B$ at $p$ with $1$. Increment $p$ by 1 again. Note that the shift right must succeed due to the guarantee the output is shorter than the input, and we already created all space possible when replacing all runs of length $3$.

If any space is left over at the end, the algorithm will eventually get stuck performing step 1. Detect this, strip the remaining space, and you are done.

There must exist a $B$ not in $L$, since $S$ contains all the integers up to $|L|$, and $L$ must contain some duplicates in order to be compressible — 9072997, Sep 25 '20 at 17:50

Is in-place run length encoding possible in O(1) space given that the output is shorter than the input?

1 Answers1