An input sequence consists of two symbols, each independently and randomly chosen from a set of symbols with known associated probabilities. Each symbol is independently encoded using a prefix code that minimizes the expected codeword length. Is it possible to reduce the expected total length of codewords by using instead of this code a prefix code where the symbols are ordered pairs of the original symbols?
Answer
A simple example: consider a source $S = \lbrace 0, 1 \rbrace$ with probabilities $\lbrace 0.9, 0.1\rbrace$.
If you take one symbol at a time, no compression is possible: you need one bit per symbol.
Taking two symbols at a time, compression is possible. Now you have a source $S_2 = \lbrace 00, 01, 10, 11 \rbrace$ with probabilities $\lbrace 0.81, 0.09, 0.09, 0.01 \rbrace$. A possible encoding is:
Message | Codeword
--------|---------
00 | 0
01 | 10
10 | 110
11 | 111
Here, the average number of bits per message from source $S$ is $(1 \cdot 0.81 + 2 \cdot 0.09 + 3 \cdot 0.09 + 3 \cdot 0.01)/2 = 0.645$. It is clear, however, that by having only four combinations, the prefix code is not as short as it could be -- for example, $01$ and $10$ should have codewords of the same length, which in turn should be shorter than the codeword for $11$. If you group more $S$ messages, then you have more room for optimizing the code.
Note that the entropy of $S$ is around $0.47$; this means that the code above is still far from optimal. What Shannon says is that one may approach the entropy by increasing the number of symbols that are considered as input to the prefix code.
No comments:
Post a Comment