Probability 1

The chapter 2 of Deep Learning Book is focussed on Probability and Information Theory. This post is TLDR part 1 of the corresponding chapter of the book.

Why Probability?

Not all physical phenomena are deterministic.
Stochastic quantities and uncertain quantities are naturally present in many sources of data.
Sources of uncertainty:
- Inherent stochasticity in system being modelled:
  - Many systems are by nature stochastic.
  - They can be modelled only by a stochastic model.
  - Example: a game of cards.
- Incomplete observability:
  - Deterministic system appear stochastic when it is not possible to observe it in entirety.
  - Example: Monty Hall Problem
- Incomplete modelling:
  - If model discards some information, discarded information results in uncertainty.
  - Example: Use of discrete models to quantify continuous variables.
In many case: it is better to use simple but uncertain model rather that complex but certain, even if true rule is deterministic and we have capability to model everything.
- Example: Better to say: “Most birds fly” than say: “Birds fly, except for young birds that have not learned to fly. sick or injured bird that can not fly and flightless species of bird”
Probability can be seen as the extension of logic to deal with uncertainty.

Random Variables

A variable that can take different values randomly.
Denoted by lower case letter in plain typeface: $\text{x}$
Possible values it can take is denoted by lower case letter, plain typeface and subscripts in lower case: random variable $\text{x}$ may take: $x_1, x_2$
Random variable may also take vector valued values
- Denoted as a vector, lowercase bold typeface: $\mathbf{\text{x}}$
- One of its possible value being: $\mathbf{x}$
On its own, random variables are description of states that are possible.
- When used with probability distributions, they specify how likely each states are.
Random variables can be discrete or continuous.

Probability Distributions

Description of how likely a random variable or set of random variables is to take on each of possible states.

Discrete Random Variable and Probability Mass Function

Probability mass function also sometimes just called PMF
- Denoted by upper case P: $P(\text{x})$, where $\text{x}$ is random variable.
PMF maps from state of random variable to probability that the random variable takes that state.
- Probability of random variable $\text{x}$ taking value $\text{x}=x$ is written as: $P(\text{x} = x)$
- $P(\text{x} = x) = 1$ means random variable $\text{x}$ will certainly take state $x$
- $P(\text{x} = x) = 0$ means it is impossible for random variable $\text{x}$ to take state $x$
PMF can act on multiple variables at the same time.
- Called Joint Probability.
- $P(\text{x} = x, \text{y} = y)$ is joint probability over random variables: $\text{x}$ and $\text{y}$, with each taking state $x$ and $y$ respectively.
- This can also be shortened to use notation: $P(x, y)$
For a valid PMF $P$ over random variable $\text{x}$, the following criteria should be met:
- The domain of $P$ must be set of all possible states of $\text{x}$.
- $\forall x \in \text{x}, 0 \le P(x) \le 1$
- $\sum_{x \in \text{x}}{P(x)} = 1$

Continuous Random Variable and Probability Density Function

Probability Density function also sometimes just called PDF
- Denoted by lower case p: $p(\text{x})$, where $\text{x}$ is random variable.
For a valid PDF $p$ over random variable $\text{x}$, the following criteria should be met:
- The domain of $p$ must be set of all possible states of $\text{x}$.
- $\forall x \in \text{x}, p(x) \ge 0$
- $\forall x \in \text{x}, p(x) \ge 1 \ is \ not \ required.$
- $\underset{x \in \text{x}}{\int}{p(x) \ dx} = 1$
PDF is differnet from PMF:
- PMF gives probability of variable taking a state.
- PDF does not give probability of variable taking a state. In PDF, the random variable occurring inside an infinitesimal region with volume $\delta x$ is given by: $p(x)\delta x$.
- The actual probability of a continuous random variable falling in a set $S \subset domain(\text{x})$ is obtained by integrating the PDF in set $S$:
\[\underset{x \in S}{\int}{p(x) \ dx}\]
Joint Probability can also be defined over a continuous variables.
- They take similar notations discussed in section for discrete PMF.

Marginal Probability

To calculate probability over a single random variable or smaller subset of random variables from a joint probability over multiple variables: we marginalize the probability.
- The name marginalization comes from the method of computing them on a paper.
- Thus obtained probability of smaller subset of random variable is called marginal probability
Marginal Probability is calculated using sum rule:

\[\forall x \in \text{x}, P(\text{x}= x) = \sum_y{P(\text{x}= x, \text{y}=y)}\] \[\forall y \in \text{y}, P(\text{y}= y) = \sum_x{P(\text{x}= x, \text{y}=y)}\]

Conditional Probability

Probability of an event occuring given that another event occurs is called conditional probability.
Denoted as:
- If $\text{x} = x$ occurs, the probability that $\text{y} = y$ will occur is denoted as: $P(\text{y} = y \vert \text{x} = x)$
This conditional probability can be calculated as:

\[P(\text{y} = y \vert \text{x} = x) = \dfrac{P(\text{y} = y, \text{x} = x)}{P(\text{x} = x)}\]

Not to confuse: Conditional probability should not be thought of as consequence or cause.
- Probability that y happens given x happens is not the same as probability that y could occur if x is made to happen.
- Example: Probability that someone is from Germany given that they speak German is not the same as Probability that someone will be from Germany after teaching them German.

Chain rule of conditional Probability

\[P(x^{(1)}, ..., x^{(n)}) = P(x^{(1)}) \ \Pi_{i=2}^n{\ P(x_{(i)} \vert x^{(1)}, ..., x^{(i-1)})}\]

Using this equation we can write:

\[P(a,b,c) = P(a \vert b,c) P (b,c) \\ . \\ P(b,c) = P(b \vert c) P (c) \\ . \\ P(a,b,c) = P(a \vert b,c) \ P(b \vert c) \ P(c)\]

Independence and conditional independence

Two random variables are said to be independent if the probability of one of the event occuring does not depend on the other events occurrence.
- If two events be $x$ and $y$, this condition can be written as: $P(x) = P(x \vert y)$
- This means probability of $x$ doesnot change even when $y$ occurs.
If two random variables $\text{x}$ and $\text{y}$ are independent:

\[\forall x \in \text{x}, y \in \text{y}, p(\text{x}=x, \text{y}=y) = p(\text{x}=x) p(\text{y}=y)\]

Two random variables $\text{x}$ and $\text{y}$ are conditionally independent if given another random variable $\text{z}$, if the conditional probability distribution over $\text{x}$ and $\text{y}$ factorizes like independent variables:

\[\forall x \in \text{x}, y \in \text{y}, z \in \text{z}, p(\text{x}=x, \text{y}=y \vert \text{z} = z) = p(\text{x}=x \vert \text{z} = z) p(\text{y}=y \vert \text{z} = z)\]

Bayes’ Rule

\[P(x \vert y) = \frac{P(x) P(y \vert x)}{P(y)}\\.\\ P(x \vert y) = \frac{P(x) P(y \vert x)}{\sum_x{P(x) P(y \vert x)}}\]

Expectation, Variance and Covariance

Expectation

Expected value of a function $f(x)$ with respect to probability distribution $P(x)$ is the average or mean value that $f$ takes on when $x$ is sampled from $P$
Expectation for discrete random variable:

\[\mathbb{E}_{\text{x} \sim P}[f(x)] = \sum_{x}{f(x)P(x)}\]

Expectation for continuous random variable:

\[\mathbb{E}_{\text{x} \sim p}[f(x)] = \int{f(x)p(x)dx}\]

Expection is linear operation:

\[\mathbb{E}_{\text{x}}[\alpha f(x) + \beta g(x)] = \alpha \mathbb{E}_{\text{x}}[f(x)] + \beta \mathbb{E}_{\text{x}}[g(x)]\]

Variance

It measures the average spread of values taken by a random variable from the expected value.

\[Var(f(x)) = \mathbb{E}[\ (f(x) - \mathbb{E}[f(x)])^2 \ ]\]

If variance is low, values of $f(x)$ cluster closely to the expected value.
If variance is high, values of $f(x)$ are spread far from the expected value.
The standard deviation is square root of variance: $sd(x) = \sqrt{Var(x)}$

Covariance

Measures how much are two values linearly related to each other as well as the sacale of each variables:

\[Cov( f(x), g(y) ) = \mathbb{E}[\ (f(x) - \mathbb{E}[f(x)]) (g(y) - \mathbb{E}[g(y)])\ ]\]

Correlation is similar to covariance, but it normalizes the scales of the variables to measure only the linear relation between the variable.
Covariance and dependence are similar but distinct concepts.
Covariance matrix of random vector variable $\mathbf{x} \in \mathbb{R}^n$:
- is a matrix in $\mathbb{R}^{n,n}$
- $Cov(\textbf{x})_{i,j} = Cov(\text{x}_i, \text{x}_j)$
- The diagonal elements are: $Cov(\textbf{x})_{i,i} = Cov(\text{x}_i,\text{x}_i) = Var(\text{x}_i)$

September 23rd , 2020 by Bipin Lekhak

Feel free to share!

Probability 1

Why Probability?

Random Variables

Probability Distributions

Discrete Random Variable and Probability Mass Function

Continuous Random Variable and Probability Density Function

Marginal Probability

Conditional Probability

Chain rule of conditional Probability

Independence and conditional independence

Bayes’ Rule

Expectation, Variance and Covariance

Expectation

Variance

Covariance

You may also enjoy:

Kalman Variational Auto-Encoders

Deep Variational Bayes Filters

Deep Kalman Filters

Information Theory 1

Probability 3: Structured Probabilistic Models

Probability 2: Common Probability Distributions