0

Let's say you have this mapping of symbols and codewords:

$$ \begin{array}{cc} \hline \text { Symbol } & \text { Codeword } \\ \hline \text { A } & 101 \\ \text { B } & 100 \\ \text { C } & 01 \\ \text { D } & 00 \\ \text { E } & 110 \\ \text { F } & 111 \\ \hline \end{array} $$

How is it possible to determine the probability of the occurrence of a symbol like A? Unfortunately, I have not found anything on this and would be thankful for tips.

Yuval Filmus
  • 276,994
  • 27
  • 311
  • 503
Rico1990
  • 125
  • 2

1 Answers1

0

You could estimate the probability of occurrence by working out the probability that a randomly generated string of bits makes a code word that lands at A.

Here that would be a 50% of a 1, then a 50% chance of a zero, then a 50% chance of a 2 again, for a 12.5% total chance of arriving at A after this random walk along the encoding tree.

That doesn't tell you the probability of A occurring in the original string, only the best approximation the coding table was able to capture.

It's possible all 6 of these symbols occur equally frequently in the input text, and that C & D were chosen arbitrarily to get shorter code words just because we didn't need all 8 combinations of 3 bits for this alphabet. Or it could be that C and D are significantly more likely to occur, and were given shorter code words to reflect that. This table alone doesn't tell us which scenario is true, or to what degree.

As long as no symbol has probability > 33.333%, we'd arrive at an encoding table like this one even if A had a vanishingly small probability in the input (arbitrarily close to zero). There just wouldn't be any savings in encoding it with 4 characters unless we could drop a very high-probability character down to a 1-character code. So all we can say with confidence is that A's probability in the source was somewhere in the range [0%, 16.667%], with 12.5% being a middle-of-the-road guess.

DMGregory
  • 338
  • 1
  • 9