Bipin's Bubble

Variational Inference

This is a lecture/paper note series where I publish my personal notes from when I read couple of disparate things and try to make sense of it and store for letter. The idea behind having it in open publication is to allow other people to also have access to my personal notes, with the hope that it may help at least one another person.

Introduction

Assume that a data generation process where a iid data $x$ is generated from a latent variable $z$. The observation of $x$ raises interest in learning the data generation process $p(x|z)$ and even the latent variables $z$ may be learnt from data $x$. This can be done using Bayes rule. However, except for some strong assumptions, the Bayes rule learning for posterior $p(z|x)$ can lead to intractable integrals. Variational inference is one way of numerically approximating such complicated probabilities even in the presence of intractable integrals.

Our goal is to estimate the true distribution of $z$ parameterized by $\theta$ and expressed as $p_\theta(z)$ with an approximate $q_\phi(z)$ by finding an algorithm that can use data and condition distributions on $x$ to get us a good estimate.

Bayes Rule

\[\begin{align} p_\theta (x) & = \int{p_\theta (x, z) \ dz} \\ & = \int{p_\theta (x, z) * \dfrac{q_\phi(z)}{q_\phi(z)} \ dz} \\ \end{align}\]

Taking log on both sides:

\[\begin{align} log \ p_\theta (x) & = log \LARGE\{ \normalsize\int{p_\theta (x, z) * \dfrac{q_\phi(z)}{q_\phi(z)} \ dz} \LARGE \} \\ & >= \int{q_\phi(z) * log(\dfrac{p_\theta (x, z)}{q_\phi(z)})} = ELBO \end{align}\]

The inequality follows from Jensen’s inequality f(𝔼[X])≤𝔼[f(X)]

The RHS term is called ELBO which is a function of $\theta$ and $\phi$.

Optimizing ELBO in terms of $\theta$ and $\phi$ gives a good estimate for lower limits on $p_\theta(x)$, which in many use case can be used as a good guess for $p_\theta(x)$.

The ELBO can be further broken down as:

\[\begin{align} ELBO & = \int{q_\phi(z) * log(\dfrac{p_\theta (x, z)}{q_\phi(z)})} \\ & = E_{q_\phi(z)}[log(\dfrac{p_\theta (x, z)}{q_\phi(z)})] \\ & = E_{q_\phi(z)}[log(\dfrac{p_\theta (x|z) * p_\theta (z)}{q_\phi(z)})] \\ & = E_{q_\phi(z)}[log(p_\theta (x|z)) + log(\dfrac{p_\theta (z)}{q_\phi(z)})] \\ & = E_{q_\phi(z)}[log(p_\theta (x|z))] + E_{q_\phi(z)}[log(\dfrac{p_\theta (z)}{q_\phi(z)})]\\ & = E_{q_\phi(z)}[log(p_\theta (x|z))] - KL(q_\phi(z) \ || \ p_\theta (z))) \end{align}\]

Maximizing ELBO can therefor be thought as optimizing this composite function of two parts:

Optimizing this ELBO with respect to parameters $\theta, \phi$ result in learning an approximate encoder $q_\phi(z|x)$ and decoder $q_\phi(x|z)$ which can be used as best estimator for actual encoders $p_\theta(z|x)$ and decoder $p_\theta(x|z)$ and thereby helps to estimate the entire data generation model.

Further reading

References

comments powered by Disqus