Deep Variational Bayes Filters

arxiv link to paper: https://arxiv.org/abs/1605.06432

This paper is another paper that deals with modelling dynamic systems. Realated earlier post on Variational RNN and Deep Kalman Filters.

This is a paper summary and notes written by me during reading this paper. I claim no rights to any intellectual property in this post. I am sharing this mostly for personal purposes and just in case someone might find my personal notes helpful

Introduction

Instead of optimizing for latent representation of observation, optimize for the parameters more governs with the dynamics of the system.

Modelling assumptions

Kalman Filter (non-linear) assumptions (with some reparameterizations): $z_t = \mathcal{f}(z_t, u_t, \beta_t) \\ where, \ \ \ \beta_t = \{ w_t, \ v_t \}$

Here, $\beta$ incorporates noise in the transition model and is thought to be comprising of two parts: $w_t$ which incorporates all sample specific noises, and $v_t$ which incorporates universal noise across samples of data.
The main idea of this paper is to condition the transition matrix on $\beta$ and use Variational inference to learn these parameters.
- For linear Kalman assumptions, we can think of transition matrices to be random variables, in which case, the beta will be those transition matrices.
- This model is also agnostic to the distribution chosen for emission matrix as long as it is parameterizable. However, the choice of distribution class will affect the sampling process and therefore the optimization complexity. It should generally work well for Bernoulli and Gaussian emission probabilities.

As mentionned above, the whole idea is to learn an approximation on $\beta$, which are the parameters of the latent dynamic system, so we try to condition the latent system on $\beta$

\[\begin{align} log p(x_{1:T} | u_{1:T}) &= \int log p(x_{1:T}, \ z_{1:T} | u_{1:T}) \ dz_{1:T} \\ &= \int log p(x_{1:T}\ | \ z_{1:T}, u_{1:T}) \ p(z_{1:T}| u_{1:T}) dz_{1:T} \\ &= \int \int log p(x_{1:T}\ | \ z_{1:T}, u_{1:T}) \ p(z_{1:T}| u_{1:T}, \beta_{1:T}) p(\beta_{1:T}) \ d\beta_{1:T} \ dz_{1:T} \\ \end{align}\]

From Graphical model for State Space Models, we have that z_t is dependent on z_{t-1} and x_t is dependent on z_t only. The conditional indpedence thus dervied between x at different time samples and z at different time samples can be used to factorize the decoding and prior on z distributions in the above equation as:

\[\begin{align} p(x_{1:T}\ | \ z_{1:T}, u_{1:T}) &= \prod_{t=1}^T p(x_t | z_t) \\ p(z_{1:T}\ | \ u_{1:T}, \beta_{1:T}) &= \prod_{t=0}^{T-1} p(z_{t+1} | z_t, u_t, \beta_t) \\ \end{align}\]

Using above factorizations into the double integral, we get:

\[p(x_{1:T} | u_{1:T}) = \int \int \prod_{t=1}^T p(x_t | z_t) \prod_{t=0}^{T-1} p(z_{t+1} | z_t, u_t, \beta_t) p(\beta_{1:T}) \ d\beta_{1:T} \ dz_{1:T} \\\]

This equation is interesting because, by reparametrizing all randomness of transitions into a single $\beta$ means the transition function can be deterministic function for fixed values of $\beta$. This means, $p(z_{t+1} | z_t, u_t, \beta_t)$ is constant. This helps to simplify the equation as it can be rewritten into factors of beta and z separately and integrated as:

\[p(x_{1:T} | u_{1:T}) = \int p(\beta_{1:T}) \ d\beta_{1:T} \prod_{t=1}^T p(x_t | z_t) \int \prod_{t=0}^{T-1} p(z_{t+1} | z_t, u_t, \beta_t) \ dz_{1:T} \\ p(x_{1:T} | u_{1:T}) = \int p(\beta_{1:T}) \ d\beta_{1:T} \int \prod_{t=1}^T p(x_t | z_t) \ p(z_{1:T} | u_{1:T}, \beta_{1:T}) \ dz_{1:T} \\ p(x_{1:T} | u_{1:T}) = \int p(\beta_{1:T}) \ [\prod_{t=1}^T p(x_t |z_t)]_{z_t = \mathcal{f}(z_t, u_t, \beta_t)} \ d\beta_{1:T} \\ p(x_{1:T} | u_{1:T}) = \int p(\beta_{1:T}) \ p_\theta(x_{1:T} | z_{1:T}) \ d\beta_{1:T} \\\]

Learning

Variational inference can be used in above model, for which the ELBO can be derived as:

\[\begin{align} p(x_{1:T} | u_{1:T}) &= \int p(\beta_{1:T}) \ p_\theta(x_{1:T} | z_{1:T}) \ d\beta_{1:T} \\ log \ p(x_{1:T} | u_{1:T}) &= log \ \int p(\beta_{1:T}) \ p_\theta(x_{1:T} | z_{1:T}) \ d\beta_{1:T} \\ &= log \ \int p(\beta_{1:T}) \ p_\theta(x_{1:T} | z_{1:T}) * \dfrac{q_\phi(\beta_{1:T} | x_{1:T}, u_{1:T})}{q_\phi(\beta_{1:T} | x_{1:T}, u_{1:T})} \ d\beta_{1:T} \\ &\ge \int q_\phi(\beta_{1:T} | x_{1:T}, u_{1:T}) \ log (\dfrac{p(\beta_{1:T}) \ p_\theta(x_{1:T} | z_{1:T})}{q_\phi(\beta_{1:T} | x_{1:T}, u_{1:T})}) \\ &= E_{q_\phi(\beta_{1:T} | x_{1:T}, u_{1:T})} [log (\dfrac{p(\beta_{1:T}) \ p_\theta(x_{1:T} | z_{1:T})}{q_\phi(\beta_{1:T} | x_{1:T}, u_{1:T})})] \end{align}\]

The ELBO can be further simplified as:

\[\begin{align} \mathcal{L} &= E_{q_\phi(\beta_{1:T} | x_{1:T}, u_{1:T})} [log (\dfrac{p(\beta_{1:T}) \ p_\theta(x_{1:T} | z_{1:T})}{q_\phi(\beta_{1:T} | x_{1:T}, u_{1:T})})] \\ &= E_{q_\phi(\beta_{1:T} | x_{1:T}, u_{1:T})} [log \ p_\theta(x_{1:T} | z_{1:T}) \ + \ log (\dfrac{p(\beta_{1:T}) }{q_\phi(\beta_{1:T} | x_{1:T}, u_{1:T})})] \\ &= E_{q_\phi(\beta_{1:T} | x_{1:T}, u_{1:T})} [log \ p_\theta(x_{1:T} | z_{1:T}) \ + \ log (\dfrac{p(\beta_{1:T}) }{q_\phi(\beta_{1:T} | x_{1:T}, u_{1:T})})] \\ &= E_{enc.}[log \ p(x_{1:T} | z_{1:T})] - KL(q(\vec{\beta} | \vec{x}, \vec{u}) \ || \ p(\vec{\beta})) \end{align}\]

This can be optimized. As usual, one term of ELBO is regeneration error, the other is KL divergence. The KL divergence is between priors of $\beta$ and its estimate given observations. When this elbo is optimized, the kl approaches 0, during which this posterior approximation for $\beta$ is close to its true prior. Thus optimizing this ELBo, results in learning a close approximation for $\beta$ which is the parameter that governs the dynamic system.

References

[1] M. Karl, M. Soelch, and J. Bayer, “DEEP VARIATIONAL BAYES FILTERS: UNSUPERVISED LEARNING OF STATE SPACE MODELS FROM RAW DATA,” p. 13, 2017.

September 26th, 2022 by Bipin Lekhak

Feel free to share!

Deep Variational Bayes Filters

Introduction

Modelling assumptions

Learning

References

You may also enjoy:

Kalman Variational Auto-Encoders

Deep Kalman Filters