Probability 2: Common Probability Distributions

The chapter 2 of Deep Learning Book is focussed on Probability and Information Theory. This post is TLDR part 2 of the corresponding chapter of the book.

Bernoulli Distribution

Single Binary Random variable: possible two sates are: 0,1.
Controlled by single parameter: $\phi \in [0,1]$ which gives the probability of random variable taking state 1.

\[P(\text{x} = 1) = \phi \\ P(\text{x} = 0) = 1 - \phi\]

Combining these two relations:

\[P(\text{x} = x) = \phi^x (1-\phi)^{(1-x)}\]

The expectation and variance of this random variable are:

\[\mathbb{E}_{\text{x}} [x] = \phi \\ Var_{\text{x}}(x) = \phi (1- \phi)\]

Multinoulli Distribution (Categorical disribution)

Similar to Bernoulli but $k$ different states posiible. ($k$ being finite integer)
Parameterized by a vector: $\mathbf{p} = [0,1]^{k-1}$, where $p_i$ gives probability of $i^{th}$ state.
- The probability of $k^{th}$ state is given by: $1- \mathbf{1^Tp}$
- Note that $\mathbf{1^Tp} \le 1$
Generally used to refer to distributions over objects.
- Do not assume state 1 has numerical value 1.
- No need to calc expectation and variance as their definition becomes unimportant.
Bernoulli and categorical distribution can describe any distribution over their domain:
- (Reason) Domain is simple: discrete values -> finite states -> all states can be easily enumerated.
- Complications arises with continuous random variables:
  - Infinite number of states
  - Finite number of parameters
  - Need to impose strict limits on distributions.

Gaussian Distribution (Normal Distribution)

Continuous random variable

\[\mathcal{N}(x; \mu, \sigma^2) = \sqrt{\frac{1}{2 \pi \sigma^2}} exp \left(-\frac{1}{2 \sigma^2} (x - \mu)^2 \right)\]

The parameters of the distribution are: $\mu \in \mathbb{R}$ and $\sigma \in (0, \infty)$
- Under this distribution: $\mathbb{E}[x] = \mu$ gives location of central peak.
- Variance of distribution is $\sigma^2$
Alternate parameterization: using mean $\mu$ and precision (inverse variance) $\beta$:

\[\mathcal{N}(x; \mu, \beta^{-1}) = \sqrt{\frac{\beta}{2 \pi}} exp \left(-\frac{\beta}{2 } (x - \mu)^2 \right)\]

Normal distribution is sensible choice in many applications, and can therefore be used as default choice in many problems where prior knowledge about distribution over real numbers is not known. The major reasons are:
- Centra Limit Theorem
  - Sum of many independent random variables is approximately normally distributed.
  - Complicated systems can be modelled as normally distributed noise even if it can be decomposed into more structured parts.
- High entropy
  - Out of all distributions with with same variance, normal distribution encodes maximum uncertainty over real numbers.
  - Introduces least amount of prior information into model.
  - More details on this later.
Generalizing normal distribution to vector valued $\mathbf{x} \in \mathbb{R}^n$
- $\mu \in \mathbb{R}^n$ will be mean
- Covariance will replace variance and is denoted by matrix: $\mathbf{\Sigma} \in \mathbb{R}^{n,n}$
- Alternatively, precision matrix will be: $\mathbf{\beta} \in \mathbb{R}^{n,n}$

\[\mathcal{N}(\mathbf{x}; \mathbf{\mu}, \mathbf{\Sigma}) = \sqrt{\frac{1}{(2 \pi)^n det(\mathbf{\Sigma})}} exp \left(-\frac{1}{2} (\mathbf{x} - \mathbf{\mu})^T \mathbf{\Sigma}^{-1} (\mathbf{x} - \mathbf{\mu}) \right)\]

Or, Alternatively,

\[\mathcal{N}(\mathbf{x}; \mathbf{\mu}, \mathbf{\beta}^{-1}) = \sqrt{\frac{det(\mathbf{\beta})}{(2 \pi)^n}} exp \left(-\frac{1}{2} (\mathbf{x} - \mathbf{\mu})^T \beta (\mathbf{x} - \mathbf{\mu}) \right)\]

Exponential and Laplace Distribution

Gaussian is smooth at mean (0 for standard gaussian).
Sometimes sharp point needed at $x=0$
Exponential and laplace distributions achieve that.
Exponential PDF:

\[p(x; \lambda) = \lambda \mathbf{1}_{x \ge 0} exp(-\lambda x)\]

$\mathbf{1}_{x \le 0}$ is indicator that $x \ge 0$
- Assigns probability of 0 for $x < 0$
Laplace is closely related to exponential and has PDF:

\[Laplace(x; \mu, \gamma) = \frac{1}{2\gamma} exp \left(-\frac{\vert x- \mu \vert}{\gamma}\right)\]

Dirac Distribution and empirical distribution

Used to specify PDF peak on a single point
Use of Dirac Delta function:

\[p(x) = \delta(x - \mu) = \mathbf{1}_{x = \mu}\]

Empirical distribution: sum of $m$ Dirac delta distribution:

\[p(x) = \frac{1}{m} \sum_{i=1}^{m}\delta(x - x^{(i)})\]

Puts probability values of $\frac{1}{m}$ on each $m$ points: $x^{(1)}, …, x^{(m)}$

Mixtrue of Distributions

Combine simple distributions with low numbers of parameters (called component distribution) to make more complex mixture distribution.
A categorical distribution to decide which component distribution to chose from.
Sampling from mixture = Sample a value from categorical to chose a category, based on which category gets picked, sample from the corresponding component distribution of that category:

\[P(x) = \sum_i{P(c = i) P(x\vert c = i)}\]

Random variable $c$ in above is example of latent variable.
- Latent variables can not be directly observed.
- May be related to $x$ by joint distribution: $P(x,c) = P(x \vert c) P (c)$
Gaussian Mixtures are when each of the component distribution are Gaussian.
- $p(x \vert c = i)$ is Gaussian with parameters: $\mu^{(i)}$ and $\Sigma^{(i)}$
- More constraint may be possible: e.g. all components have shared covariance matrix.
- $P(c=i) = \alpha_i$ for all $i$ are additional parameters for Gaussian mixture, these parameters are called prior probabilities.
- By this definition: $P(c \vert x)$ will be called posterior probability because it is computed after observation of $x$.
Gaussian Mixtures are universal approximator of densities.
- Any smooth probability density can be approximated by any specific non-zero amount of error by a Gaussian Mixture Model.

Useful Properties of common functions

Logistic sigmoid:
- \[\sigma (x) = \frac{1}{1 + exp(-x)}\]
- $0 \le \sigma (x) \le 1$
- Used to produce parameter $\phi$ for Bernoulli Distribution
Softplus function:
- \[\zeta (x) = log(1 + exp(x))\]
- $0 \le \zeta (x) \le \infty$
- Used to produce parameter $\beta$ or $\sigma$ for Gaussian Distribution

Some useful relations:

\[\sigma (x) = \frac{exp(x)}{exp(0) + exp(x)}\]
\[\frac{d}{dx}\sigma (x) = \sigma (x) ( 1- \sigma (x))\]
\[1- \sigma (x) = \sigma (-x)\]
\[log(\sigma (x)) = - \zeta (-x)\]
\[\frac{d}{dx}\zeta (x) = \sigma (x)\]
\[\forall x \in (0,1) , \sigma^{-1} (x)= log\left(\frac{x}{1-x}\right)\]
\[\forall x > 0, \zeta^{-1} (x)= log\left(exp(x) - 1 \right)\]
\[\zeta(x)= \int_{-\infty}^{x}{\sigma (y) dy}\]
\[\zeta(x) - \zeta (-x) = x\]

Change of variable

If $y = g(x)$, where $g$ is invertible, continuous, differentiable function:

\[\vert \ p_y (g(x)) dy \ \vert = \vert \ p_x (x) dx \ \vert \\ . \\ p_y(y) = p_x(g^{-1}(y)) \left \vert \frac{\delta x}{\delta y} \right \vert \\ . \\ p_x(x) = p_y(g(x)) \left \vert \frac{\delta g(x)}{\delta x} \right \vert\] September 24th, 2020 by Bipin Lekhak

Feel free to share!

Probability 2: Common Probability Distributions

Bernoulli Distribution

Multinoulli Distribution (Categorical disribution)

Gaussian Distribution (Normal Distribution)

Exponential and Laplace Distribution

Dirac Distribution and empirical distribution

Mixtrue of Distributions

Useful Properties of common functions

Change of variable

You may also enjoy:

Kalman Variational Auto-Encoders

Deep Variational Bayes Filters

Deep Kalman Filters

Information Theory 1

Probability 3: Structured Probabilistic Models