Probability 2: Common Probability Distributions
The chapter 2 of Deep Learning Book is focussed on Probability and
Information Theory. This post is TLDR part 2 of the corresponding chapter of
the book.
Bernoulli Distribution
- Single Binary Random variable: possible two sates are: 0,1.
- Controlled by single parameter: $\phi \in [0,1]$ which gives the probability of random variable taking state 1.
\[P(\text{x} = 1) = \phi \\ P(\text{x} = 0) = 1 - \phi\]
- Combining these two relations:
\[P(\text{x} = x) = \phi^x (1-\phi)^{(1-x)}\]
- The expectation and variance of this random variable are:
\[\mathbb{E}_{\text{x}} [x] = \phi \\ Var_{\text{x}}(x) = \phi (1- \phi)\]
Multinoulli Distribution (Categorical disribution)
- Similar to Bernoulli but $k$ different states posiible. ($k$ being finite integer)
- Parameterized by a vector: $\mathbf{p} = [0,1]^{k-1}$, where $p_i$ gives
probability of $i^{th}$ state.
- The probability of $k^{th}$ state is given by: $1- \mathbf{1^Tp}$
- Note that $\mathbf{1^Tp} \le 1$
- Generally used to refer to distributions over objects.
- Do not assume state 1 has numerical value 1.
- No need to calc expectation and variance as their definition becomes unimportant.
- Bernoulli and categorical distribution can describe any distribution over
their domain:
- (Reason) Domain is simple: discrete values -> finite states -> all states can be easily
enumerated.
- Complications arises with continuous random variables:
- Infinite number of states
- Finite number of parameters
- Need to impose strict limits on distributions.
Gaussian Distribution (Normal Distribution)
- Continuous random variable
\[\mathcal{N}(x; \mu, \sigma^2) = \sqrt{\frac{1}{2 \pi \sigma^2}}
exp \left(-\frac{1}{2 \sigma^2} (x - \mu)^2 \right)\]
- The parameters of the distribution are: $\mu \in \mathbb{R}$ and
$\sigma \in (0, \infty)$
- Under this distribution: $\mathbb{E}[x] = \mu$ gives location of central peak.
- Variance of distribution is $\sigma^2$
- Alternate parameterization: using mean $\mu$ and precision (inverse variance) $\beta$:
\[\mathcal{N}(x; \mu, \beta^{-1}) = \sqrt{\frac{\beta}{2 \pi}}
exp \left(-\frac{\beta}{2 } (x - \mu)^2 \right)\]
- Normal distribution is sensible choice in many applications, and can
therefore be used as default choice in many problems where prior knowledge
about distribution over real numbers is not known. The major reasons are:
- Centra Limit Theorem
- Sum of many independent random variables is approximately normally distributed.
- Complicated systems can be modelled as normally distributed noise even if
it can be decomposed into more structured parts.
- High entropy
- Out of all distributions with with same variance, normal distribution
encodes maximum uncertainty over real numbers.
- Introduces least amount of prior information into model.
- More details on this later.
- Generalizing normal distribution to vector valued $\mathbf{x} \in \mathbb{R}^n$
- $\mu \in \mathbb{R}^n$ will be mean
- Covariance will replace variance and is denoted by matrix:
$\mathbf{\Sigma} \in \mathbb{R}^{n,n}$
- Alternatively, precision matrix will be: $\mathbf{\beta} \in \mathbb{R}^{n,n}$
\[\mathcal{N}(\mathbf{x}; \mathbf{\mu}, \mathbf{\Sigma}) = \sqrt{\frac{1}{(2 \pi)^n det(\mathbf{\Sigma})}}
exp \left(-\frac{1}{2} (\mathbf{x} - \mathbf{\mu})^T \mathbf{\Sigma}^{-1} (\mathbf{x} - \mathbf{\mu}) \right)\]
Or, Alternatively,
\[\mathcal{N}(\mathbf{x}; \mathbf{\mu}, \mathbf{\beta}^{-1}) = \sqrt{\frac{det(\mathbf{\beta})}{(2 \pi)^n}}
exp \left(-\frac{1}{2} (\mathbf{x} - \mathbf{\mu})^T \beta (\mathbf{x} - \mathbf{\mu}) \right)\]
Exponential and Laplace Distribution
- Gaussian is smooth at mean (0 for standard gaussian).
- Sometimes sharp point needed at $x=0$
- Exponential and laplace distributions achieve that.
- Exponential PDF:
\[p(x; \lambda) = \lambda \mathbf{1}_{x \ge 0} exp(-\lambda x)\]
- $\mathbf{1}_{x \le 0}$ is indicator that $x \ge 0$
- Assigns probability of 0 for $x < 0$
- Laplace is closely related to exponential and has PDF:
\[Laplace(x; \mu, \gamma) = \frac{1}{2\gamma} exp \left(-\frac{\vert x- \mu \vert}{\gamma}\right)\]
Dirac Distribution and empirical distribution
- Used to specify PDF peak on a single point
- Use of Dirac Delta function:
\[p(x) = \delta(x - \mu) = \mathbf{1}_{x = \mu}\]
- Empirical distribution: sum of $m$ Dirac delta distribution:
\[p(x) = \frac{1}{m} \sum_{i=1}^{m}\delta(x - x^{(i)})\]
- Puts probability values of $\frac{1}{m}$ on each $m$ points: $x^{(1)}, …, x^{(m)}$
Mixtrue of Distributions
-
Combine simple distributions with low numbers of parameters (called component
distribution) to make more
complex mixture distribution.
-
A categorical distribution to decide which component distribution to chose from.
-
Sampling from mixture = Sample a value from categorical to chose a category,
based on which category gets picked, sample from the corresponding component
distribution of that category:
\[P(x) = \sum_i{P(c = i) P(x\vert c = i)}\]
- Random variable $c$ in above is example of latent variable.
- Latent variables can not be directly observed.
- May be related to $x$ by joint distribution: $P(x,c) = P(x \vert c) P (c)$
- Gaussian Mixtures are when each of the component distribution are Gaussian.
- $p(x \vert c = i)$ is Gaussian with parameters: $\mu^{(i)}$ and $\Sigma^{(i)}$
- More constraint may be possible: e.g. all components have shared covariance matrix.
- $P(c=i) = \alpha_i$ for all $i$ are additional parameters for Gaussian
mixture, these parameters are called prior probabilities.
- By this definition: $P(c \vert x)$ will be called posterior probability
because it is computed after observation of $x$.
- Gaussian Mixtures are universal approximator of densities.
- Any smooth probability density can be approximated by any specific non-zero
amount of error by a Gaussian Mixture Model.
Useful Properties of common functions
- Logistic sigmoid:
-
\[\sigma (x) = \frac{1}{1 + exp(-x)}\]
- $0 \le \sigma (x) \le 1$
- Used to produce parameter $\phi$ for Bernoulli Distribution
- Softplus function:
-
\[\zeta (x) = log(1 + exp(x))\]
- $0 \le \zeta (x) \le \infty$
- Used to produce parameter $\beta$ or $\sigma$ for Gaussian Distribution
Some useful relations:
-
\[\sigma (x) = \frac{exp(x)}{exp(0) + exp(x)}\]
-
\[\frac{d}{dx}\sigma (x) = \sigma (x) ( 1- \sigma (x))\]
-
\[1- \sigma (x) = \sigma (-x)\]
-
\[log(\sigma (x)) = - \zeta (-x)\]
-
\[\frac{d}{dx}\zeta (x) = \sigma (x)\]
-
\[\forall x \in (0,1) , \sigma^{-1} (x)= log\left(\frac{x}{1-x}\right)\]
-
\[\forall x > 0, \zeta^{-1} (x)= log\left(exp(x) - 1 \right)\]
-
\[\zeta(x)= \int_{-\infty}^{x}{\sigma (y) dy}\]
-
\[\zeta(x) - \zeta (-x) = x\]
Change of variable
If $y = g(x)$, where $g$ is invertible, continuous, differentiable function:
\[\vert \ p_y (g(x)) dy \ \vert = \vert \ p_x (x) dx \ \vert \\ . \\
p_y(y) = p_x(g^{-1}(y)) \left \vert \frac{\delta x}{\delta y} \right \vert \\ .
\\ p_x(x) = p_y(g(x)) \left \vert \frac{\delta g(x)}{\delta x} \right \vert\]
September
24th,
2020
by
Bipin Lekhak
Feel free to share!