Information Theory 1

The chapter 2 of Deep Learning Book is focussed on Probability and Information Theory. This post is TLDR part 4 of the corresponding chapter of the book.

Introduction and definitions

Information theory is needed to characterize probability distributions and compare them.
- Quantifying similarity between probability distributions is a topic of interest.
Measuring information is needed. The measure of information $I(x)$ has to satisfy following properties:
- Likely events have low information content:
  - If an event is sure to happen, no information is gained from observing that event.
- Less likely information have information content.
- Independent events have additive information: $I(x_1 \ and \ x_2) = I(x_1) + I(x_2)$
One of the simplest functions that satisfies this relation is:

\[I(x) = -log(P(x))\]

Units depend on base of log:
- For $log_e$ unit is nats
- For $log_2$ unit is bits (or Shannons)
Proof:
- If: $P(x) = 1$, $I(x) = -log(P(x)) = -log(1) = 0$, thus certain event has zero information.
- If: $P(x) = 0$, $I(x) = -log(P(x)) = -log(0) = \infty$, thus uncertain events have extremely high information.
- If $x_1 \ and \ x_2$ are two independent events:
  \[I(x_1 \ and \ x_2) = -log(P(x_1 \ and \ x_2)) \\ = -log(P(x_1)P(x_2)) \\ = -log(P(x_1)) - log(P(x_2)) \\ = I(x_1) + I(x_2)\]
The expectation of information across all events is called Shannon entropy:

\[H(x) = \mathbb{E}_{x \sim P}[\ I(x) \ ] = -\mathbb{E}_{x \sim P}[log(P(x)) ]\]

Shannon entropy of a distribution is the expected amount of information in an event drawn from that distribution.
- It gives lower bounds on number of bits needed on average to encode symbols drawn form distribution $P$.
If $x$ is a continuous random variable, the Shannon entropy is called differential entropy.

KL Divergence

Two separate probability distributions: $P(x)$ and $Q(x)$ over the same random variables: $x$
Measure of how different the distributions are.

\[D_{KL}(P \vert\vert Q) = \mathbb{E}_{x \sim P} \left[ log \frac{P(x)}{Q(x)} \right] = \mathbb{E}_{x \sim P} \left[\ log P(x) - log Q(x) \ \right]\]

For discrete random variable $x$, $D_{KL}$ measures the extra amount of bits needed to send a meassage containing symbols drawn from probability distribution $P$, using code used to minimize meassage drawn from $Q$.
Properties of KL divergence:
- KL divergence between $P$ and $Q$ is 0 iff $P$ and $Q$ are the same distributions (for discrete) or equal almost everywhere if they are continuous:
  - This is because, when $P$ and $Q$ are same, the value of $P(x)/Q(x)$ will always be one, and $log(P(x)/Q(x))$ will always be 0. Thefore its expected value is also zero.
- KL divergence between $P$ and $Q$ is always greater than 0 unless the requirement above is not satisfied.
- Kl divergence is not symmetrical: $D_{KL}(P\vert\vert Q) \ne D_{KL}(Q\vert\vert P)$
  - This asymmetry mean KL divergence can not be taken as a distance metric between two distributions. However, for purpose of simplicity it can still be assumed as such.
  - Also this asymmetry has consequences on which to chose between $D_{KL}(P\vert\vert Q)$ and $D_{KL}(Q\vert\vert P)$ depending on the problem.

Cross Entropy

Closely related to KL-Divergence
Denoted as : $H(P,Q)$

\[H(P,Q) = H(P) + D_{KL}(P \vert\vert Q) \\ . \\ = - \mathbb{E}_{x \sim P}[logP(x)] + \mathbb{E}_{x \sim P}\left[log \frac{P(x)}{Q(x)}\right] \\.\\ = - \mathbb{E}_{x \sim P}\ log \ Q(x)\]

Minimizing cross-entropy with respect to $Q$ is the same as minimizing KL divergence with respect to Q, as the other terms in relation between KL divergence and cross entropy do not include $Q$.

Notes:

In a lot of problems including information and entropy it is common to assume: $0 * log0 = 0$ which is done by following: $lim_{x \to 0}{x \ log(x)} = 0$
To discuss more details on this topic, I am thinking of doing another TLDR on Shannon’s paper: Mathematical Theory of Communication. But this shall be done in a much later date.

September 26th, 2020 by Bipin Lekhak

Feel free to share!

Information Theory 1

Introduction and definitions

KL Divergence

Cross Entropy

You may also enjoy:

Kalman Variational Auto-Encoders

Deep Variational Bayes Filters

Deep Kalman Filters

Probability 3: Structured Probabilistic Models

Probability 2: Common Probability Distributions