This week, I could not derive all equations but I got the gist of their theory
by working out few of equation; all of which are well summarized in paper. The
paper is very readable for anyone who has taken a course in statistics,
information theory and system science. Statistics and information theory are
weak spots of mine. So I’d concentrate on them here to clear my own head.

Two key concepts are used in the paper are: Shanon entropy of a
random variable with a probability mass function (pmf)
and mutual information between two random variables.

Shannon entropy is defined to be $-\Sigma p(x) \log_2 p(x)$.

If one uses then the is measured in bits; this is
what is often used to measure it. If we flip two coins then we can use 2 bits to
encode the outcome: 00 (HH), 01 (HT), 10 (TH), and 11(TT). Lets call it normal
encoding. Let’s see what is the value of $H(X)$ in this case: it is . If $H(X)$ is always equal to number of
bits required for normal encoding, it would not be of great help. Usually
$H(X)$ is less than no of bits required to encode the outcome when probability
is not uniform. Intuitively, if some events occurs more often then it make sense
(intuitively) that we can encode the outcome using fewer bits.

Next concept is mutual information between two variables and
. Mutual information between two variables is defined in terms of
entropy: . It “measures the degree to
which knowledge of one variable reduces entropic uncertainty in another, regardless how their outcome may correlate” (emphasis mine). The paper
establishes relation between and .

Any upper bound on mutual information in a channel which
takes as input and emits $X_2$ as output, puts a upper bound on
channel capacity $C$. In a system with such a channel, there is a lower bound on
the mean-squared estimation error, $E(X_1 – \hat{X_1})^2$. The term $\hat{X_1}$
is an “estimator”; any arbitrary function of discrete signal time series, and
the $X_1$ dynamics at equilibrium is described by a stochastic differential
equation.