Information Theory 1
The chapter 2 of Deep Learning Book is focussed on Probability and
Information Theory. This post is TLDR part 4 of the corresponding chapter of
the book.
Introduction and definitions
- Information theory is needed to characterize probability distributions and
compare them.
- Quantifying similarity between probability distributions is a topic of interest.
- Measuring information is needed. The measure of information $I(x)$ has to satisfy
following properties:
- Likely events have low information content:
- If an event is sure to happen, no information is gained from observing that event.
- Less likely information have information content.
- Independent events have additive information:
$I(x_1 \ and \ x_2) = I(x_1) + I(x_2)$
- One of the simplest functions that satisfies this relation is:
\[I(x) = -log(P(x))\]
- Units depend on base of log:
- For $log_e$ unit is nats
- For $log_2$ unit is bits (or Shannons)
- Proof:
- The expectation of information across all events is called Shannon entropy:
\[H(x) = \mathbb{E}_{x \sim P}[\ I(x) \ ] = -\mathbb{E}_{x \sim P}[log(P(x)) ]\]
- Shannon entropy of a distribution is the expected amount of information in an event drawn from
that distribution.
- It gives lower bounds on number of bits needed on average to encode symbols
drawn form distribution $P$.
- If $x$ is a continuous random variable, the Shannon entropy is called
differential entropy.
KL Divergence
\[D_{KL}(P \vert\vert Q) = \mathbb{E}_{x \sim P} \left[ log \frac{P(x)}{Q(x)}
\right] = \mathbb{E}_{x \sim P} \left[\ log P(x) - log Q(x) \ \right]\]
For discrete random variable $x$, $D_{KL}$ measures the extra amount of bits
needed to send a meassage containing symbols drawn from probability
distribution $P$, using code used to minimize meassage drawn from $Q$.
Properties of KL divergence:
- KL divergence between $P$ and $Q$ is 0 iff $P$ and $Q$ are the same
distributions (for discrete) or equal almost everywhere if they are
- This is because, when $P$ and $Q$ are same, the value of $P(x)/Q(x)$
will always be one, and $log(P(x)/Q(x))$ will always be 0. Thefore its
expected value is also zero.
- KL divergence between $P$ and $Q$ is always greater than 0 unless the
requirement above is not satisfied.
- Kl divergence is not symmetrical: $D_{KL}(P\vert\vert Q) \ne D_{KL}(Q\vert\vert P)$
- This asymmetry mean KL divergence can not be taken as a distance metric
between two distributions. However, for purpose of simplicity it can still
be assumed as such.
- Also this asymmetry has consequences on which to chose between
$D_{KL}(P\vert\vert Q)$ and $D_{KL}(Q\vert\vert P)$ depending on the
Cross Entropy
- Closely related to KL-Divergence
- Denoted as : $H(P,Q)$
\[H(P,Q) = H(P) + D_{KL}(P \vert\vert Q) \\ . \\
= - \mathbb{E}_{x \sim P}[logP(x)] + \mathbb{E}_{x \sim P}\left[log
\frac{P(x)}{Q(x)}\right] \\.\\
= - \mathbb{E}_{x \sim P}\ log \ Q(x)\]
- Minimizing cross-entropy with respect to $Q$ is the same as minimizing KL
divergence with respect to Q, as the other terms in relation between KL
divergence and cross entropy do not include $Q$.
- In a lot of problems including information and entropy it is common to assume:
$0 * log0 = 0$ which is done by following: $lim_{x \to 0}{x \ log(x)} = 0$
- To discuss more details on this topic, I am thinking of doing another TLDR on
Shannon’s paper: Mathematical Theory of Communication.
But this shall be done in a much later date.
Bipin Lekhak
Feel free to share!