I've been using the following formula on various empirical data $d$, to obtain a correlation factor $c_f$:-
$$ c_f = { |C(d_s)| \over |C(d)|} $$
where $C$ is a compression function like bz2 or zip, and $d_s$ is the data having been randomly shuffled. The purpose of the random shuffle it to destroy any byte wise auto correlation that might exist in $d$.
A typical value for $c_f$ might be 2.7. What might this indicate viz a viz any possible correlation within the data? Is it at all possible to convert 2.7 to a lag figure of $n$ bits, even loosely?
Example figures are:-
$|d|$ = 921,600 bytes
$|C(d)|$ = 201,000 bytes
$|C(d_s)|$ = 543,000 bytes
Answer
What might this indicate viz a viz any possible correlation within the data? Is it at all possible to convert 2.7 to a lag figure of n bits, even loosely?
The compression ratio is directly proportional to the entropy ($H$) of the source. For example, Shannon's Entropy tells you how many bits on average would be required to encode a source, given its statistical distribution. The expression for Shannon Entropy is:
$$ H(S) = -\sum_{i=0}^{|S|-1} P(s_i) \log_2 P(s_i)$$
Where $|\cdot |$ indicates the size of an ordered set, $S$ is a memoryless source having $|S|$ symbols, each one appearing with probability $P(s_i)$ in a stream.
If the source was totally predictable ($1111111111111111111111111111 \ldots$) then $S_p=[1]$, $P(S_p) = [1]$ and to describe it you would need $H(S_p)=0$ bits.
If the source had to alternate between two states but in a totally unpredictable way then $S_u=[0,1], P(S_u) = [0.5,0.5]$ and to describe it you would need $H(S_u) = -(0.5 \cdot \log_2 0.5) \cdot 2 = 1$ bit.
BUT, if the same source started showing a preference for one of the states, it suddenly became just a bit more predictable. So, let's say $S_b=[0,1], P(S_b)=[0.7,0.3]$ and $H(S) = -(0.7 \cdot \log_2 0.7 + 0.3 \cdot \log_2 0.3 \approx 0.88129...$.
Remember that $H$ is expressed in bits here. If I need $H(S)$ bits per symbol in $S$ and produce a message $M$ that is $N$ symbols long, then I need $|M| = H(S) \cdot N$ bits to describe the message.
The unpredictable source $S_u$ needs 1 bit per symbol, so, we need as many bits as symbols in $M$ and $|M_u| = N_u \cdot H(S_u)$. But the "biased" source $S_b$ needs $\approx 0.88129$ per symbol, so $|M_b| = N_b \cdot H(S_b)$ and theoretically at least, we could fit it in less space.
The question now is, how do you do that? How do you re-encode a sequence to occupy less space? Because that is very, very important (for the interpretation of metrics such as this one).
Without wanting to go into too much detail (now), let's say that we do have a way of encoding a message in such a way that the compression ratio approximates its entropy as the length of the message tends to infinity. This is what you call $C$ (and it is encouraging that you have selected zip, bz2, etc).
Then, to form a ratio of the lengths of the shuffled and unshuffled compressed sequences is equivalent to forming the ratio of $\frac{|M_{u \leftarrow b}|}{|M_b|}$.
In other words, you have a file that is coherent because it bears content. Because of this reason, the bytes appear in some predictable way, following some kind of pattern, kind of like $S_b$ but this time around your $S_b$ contains 256 symbols and your $P(S_b)$ contains 256 probabilities (this is not exactly accurate here because a file is not memoryless but I will come back to this later).
When you shuffle the file, you basically construct an equivalent white noise sequence from a source $S_u$ where $P(S_u) = \frac{1}{256} \forall s \in S_u$ (Where $\forall$ means "for all"). This is what I mean up there with that $M_u \leftarrow M_b$.
Earlier, $S_u,S_b$ were unrelated, but here $S_u$ is the result of $S_b$ after you shuffled it.
So, having said all this, what you basically have expressed there is a ratio that is equivalent to the inverse of the compression ratio of $C$ on your data $d$.
Provided that the shuffling is adequate $|C(ds)|$ is equivalent to the length of $ds$ because it is impossible to compress it. And, the length of $|C(d)|$ is approximated by the entropy of $d$ times its length.
What does this tell you? It tells you how much bigger is a white noise sequence using the same symbols in $d$ compared to the coherent version of $d$.
What does this say about the auto-correlation of the data? Under very strict conditions it can tell you the average size of the longest redundant sequence of bits. If the data compression ratio is 0.5, on average, you are substituting 2 bits for 1.
And this, is where things can go really bad, in terms of choice of $C$ and interpretation.
The compressor $C$, doesn't care what is in $d$. The only thing it sees is a stream of bytes. If you have a file of double
s in a common format, the compressor will compress each and every one of these 8 bytes separately, not as one number. What your ratio will return, is information on the efficiency of compressing the stream, not your "data".
Is it at all possible to convert 2.7 to a lag figure of n bits, even loosely?
This was answered above, but this discussion so far brings us to the wonderful subject of LZ Complexity.
The LZW algorithm is a dictionary based algorithm. It compresses a sequence by keeping a running dictionary of fragments it has encountered before.
Consider 0101001010010
. We start at the first 0, have we seen it before? No, put it in the dictionary, output 0, move to the next. Have we seen 1 before?, No, put it in the dictionary, output 1, move to the next. Have we seen 0 before? Yes, carry on to the next. Have we seen 01 before? No, put it in the dictionary, output the index of the known sequence (the 0) and the new bit (1) and move to the next....and so on until the end of the file.
You see what's happening here, the algorithm compresses when it encounters a sequence it has seen before and it substitutes it for its index in the dictionary. In these early stages it does not seem to compress much but if the source is predictable then the algorithm will build a set of increasingly long repeating sequences very quickly and it will be able to start substituting repetitions. Have we seen 001100110011001100110011
before? Yes, it is entry 242. So, you substitute a long string of bytes for an integer.
The reason I am talking about this is because if you wanted to make such strong statements about the lag of the autocorrelation, it is not enough to examine the compression ratio. Ideally, you would have to examine the patterns within the dictionaries.
If you are lucky, your data is inherently digital (categorical) and you might be able to do this. But if your data is real numbers, then you have to quantise them and the way you do this makes it very difficult to apply these methods to real valued data.
Hope this helps.
No comments:
Post a Comment