Bipin's Bubble

Deep Variational Bayes Filters

arxiv link to paper: https://arxiv.org/abs/1605.06432

This paper is another paper that deals with modelling dynamic systems. Realated earlier post on Variational RNN and Deep Kalman Filters.

This is a paper summary and notes written by me during reading this paper. I claim no rights to any intellectual property in this post. I am sharing this mostly for personal purposes and just in case someone might find my personal notes helpful

Introduction

Modelling assumptions

As mentionned above, the whole idea is to learn an approximation on $\beta$, which are the parameters of the latent dynamic system, so we try to condition the latent system on $\beta$

\[\begin{align} log p(x_{1:T} | u_{1:T}) &= \int log p(x_{1:T}, \ z_{1:T} | u_{1:T}) \ dz_{1:T} \\ &= \int log p(x_{1:T}\ | \ z_{1:T}, u_{1:T}) \ p(z_{1:T}| u_{1:T}) dz_{1:T} \\ &= \int \int log p(x_{1:T}\ | \ z_{1:T}, u_{1:T}) \ p(z_{1:T}| u_{1:T}, \beta_{1:T}) p(\beta_{1:T}) \ d\beta_{1:T} \ dz_{1:T} \\ \end{align}\]

From Graphical model for State Space Models, we have that z_t is dependent on z_{t-1} and x_t is dependent on z_t only. The conditional indpedence thus dervied between x at different time samples and z at different time samples can be used to factorize the decoding and prior on z distributions in the above equation as:

\[\begin{align} p(x_{1:T}\ | \ z_{1:T}, u_{1:T}) &= \prod_{t=1}^T p(x_t | z_t) \\ p(z_{1:T}\ | \ u_{1:T}, \beta_{1:T}) &= \prod_{t=0}^{T-1} p(z_{t+1} | z_t, u_t, \beta_t) \\ \end{align}\]

Using above factorizations into the double integral, we get:

\[p(x_{1:T} | u_{1:T}) = \int \int \prod_{t=1}^T p(x_t | z_t) \prod_{t=0}^{T-1} p(z_{t+1} | z_t, u_t, \beta_t) p(\beta_{1:T}) \ d\beta_{1:T} \ dz_{1:T} \\\]

This equation is interesting because, by reparametrizing all randomness of transitions into a single $\beta$ means the transition function can be deterministic function for fixed values of $\beta$. This means, $p(z_{t+1} | z_t, u_t, \beta_t)$ is constant. This helps to simplify the equation as it can be rewritten into factors of beta and z separately and integrated as:

\[p(x_{1:T} | u_{1:T}) = \int p(\beta_{1:T}) \ d\beta_{1:T} \prod_{t=1}^T p(x_t | z_t) \int \prod_{t=0}^{T-1} p(z_{t+1} | z_t, u_t, \beta_t) \ dz_{1:T} \\ p(x_{1:T} | u_{1:T}) = \int p(\beta_{1:T}) \ d\beta_{1:T} \int \prod_{t=1}^T p(x_t | z_t) \ p(z_{1:T} | u_{1:T}, \beta_{1:T}) \ dz_{1:T} \\ p(x_{1:T} | u_{1:T}) = \int p(\beta_{1:T}) \ [\prod_{t=1}^T p(x_t |z_t)]_{z_t = \mathcal{f}(z_t, u_t, \beta_t)} \ d\beta_{1:T} \\ p(x_{1:T} | u_{1:T}) = \int p(\beta_{1:T}) \ p_\theta(x_{1:T} | z_{1:T}) \ d\beta_{1:T} \\\]

Learning

Variational inference can be used in above model, for which the ELBO can be derived as:

\[\begin{align} p(x_{1:T} | u_{1:T}) &= \int p(\beta_{1:T}) \ p_\theta(x_{1:T} | z_{1:T}) \ d\beta_{1:T} \\ log \ p(x_{1:T} | u_{1:T}) &= log \ \int p(\beta_{1:T}) \ p_\theta(x_{1:T} | z_{1:T}) \ d\beta_{1:T} \\ &= log \ \int p(\beta_{1:T}) \ p_\theta(x_{1:T} | z_{1:T}) * \dfrac{q_\phi(\beta_{1:T} | x_{1:T}, u_{1:T})}{q_\phi(\beta_{1:T} | x_{1:T}, u_{1:T})} \ d\beta_{1:T} \\ &\ge \int q_\phi(\beta_{1:T} | x_{1:T}, u_{1:T}) \ log (\dfrac{p(\beta_{1:T}) \ p_\theta(x_{1:T} | z_{1:T})}{q_\phi(\beta_{1:T} | x_{1:T}, u_{1:T})}) \\ &= E_{q_\phi(\beta_{1:T} | x_{1:T}, u_{1:T})} [log (\dfrac{p(\beta_{1:T}) \ p_\theta(x_{1:T} | z_{1:T})}{q_\phi(\beta_{1:T} | x_{1:T}, u_{1:T})})] \end{align}\]

The ELBO can be further simplified as:

\[\begin{align} \mathcal{L} &= E_{q_\phi(\beta_{1:T} | x_{1:T}, u_{1:T})} [log (\dfrac{p(\beta_{1:T}) \ p_\theta(x_{1:T} | z_{1:T})}{q_\phi(\beta_{1:T} | x_{1:T}, u_{1:T})})] \\ &= E_{q_\phi(\beta_{1:T} | x_{1:T}, u_{1:T})} [log \ p_\theta(x_{1:T} | z_{1:T}) \ + \ log (\dfrac{p(\beta_{1:T}) }{q_\phi(\beta_{1:T} | x_{1:T}, u_{1:T})})] \\ &= E_{q_\phi(\beta_{1:T} | x_{1:T}, u_{1:T})} [log \ p_\theta(x_{1:T} | z_{1:T}) \ + \ log (\dfrac{p(\beta_{1:T}) }{q_\phi(\beta_{1:T} | x_{1:T}, u_{1:T})})] \\ &= E_{enc.}[log \ p(x_{1:T} | z_{1:T})] - KL(q(\vec{\beta} | \vec{x}, \vec{u}) \ || \ p(\vec{\beta})) \end{align}\]

This can be optimized. As usual, one term of ELBO is regeneration error, the other is KL divergence. The KL divergence is between priors of $\beta$ and its estimate given observations. When this elbo is optimized, the kl approaches 0, during which this posterior approximation for $\beta$ is close to its true prior. Thus optimizing this ELBo, results in learning a close approximation for $\beta$ which is the parameter that governs the dynamic system.

References

comments powered by Disqus