arxiv link to paper: https://arxiv.org/abs/1605.06432
This paper is another paper that deals with modelling dynamic systems. Realated earlier post on Variational RNN and Deep Kalman Filters.
This is a paper summary and notes written by me during reading this paper. I claim no rights to any intellectual property in this post. I am sharing this mostly for personal purposes and just in case someone might find my personal notes helpful
Kalman Filter (non-linear) assumptions (with some reparameterizations): \(z_t = \mathcal{f}(z_t, u_t, \beta_t) \\ where, \ \ \ \beta_t = \{ w_t, \ v_t \}\)
Here, $\beta$ incorporates noise in the transition model and is thought to be comprising of two parts: $w_t$ which incorporates all sample specific noises, and $v_t$ which incorporates universal noise across samples of data.
The main idea of this paper is to condition the transition matrix on $\beta$ and use Variational inference to learn these parameters.
As mentionned above, the whole idea is to learn an approximation on $\beta$, which are the parameters of the latent dynamic system, so we try to condition the latent system on $\beta$
\[\begin{align} log p(x_{1:T} | u_{1:T}) &= \int log p(x_{1:T}, \ z_{1:T} | u_{1:T}) \ dz_{1:T} \\ &= \int log p(x_{1:T}\ | \ z_{1:T}, u_{1:T}) \ p(z_{1:T}| u_{1:T}) dz_{1:T} \\ &= \int \int log p(x_{1:T}\ | \ z_{1:T}, u_{1:T}) \ p(z_{1:T}| u_{1:T}, \beta_{1:T}) p(\beta_{1:T}) \ d\beta_{1:T} \ dz_{1:T} \\ \end{align}\]From Graphical model for State Space Models, we have that z_t is dependent on z_{t-1} and x_t is dependent on z_t only. The conditional indpedence thus dervied between x at different time samples and z at different time samples can be used to factorize the decoding and prior on z distributions in the above equation as:
\[\begin{align} p(x_{1:T}\ | \ z_{1:T}, u_{1:T}) &= \prod_{t=1}^T p(x_t | z_t) \\ p(z_{1:T}\ | \ u_{1:T}, \beta_{1:T}) &= \prod_{t=0}^{T-1} p(z_{t+1} | z_t, u_t, \beta_t) \\ \end{align}\]Using above factorizations into the double integral, we get:
\[p(x_{1:T} | u_{1:T}) = \int \int \prod_{t=1}^T p(x_t | z_t) \prod_{t=0}^{T-1} p(z_{t+1} | z_t, u_t, \beta_t) p(\beta_{1:T}) \ d\beta_{1:T} \ dz_{1:T} \\\]This equation is interesting because, by reparametrizing all randomness of transitions into a single $\beta$ means the transition function can be deterministic function for fixed values of $\beta$. This means, $p(z_{t+1} | z_t, u_t, \beta_t)$ is constant. This helps to simplify the equation as it can be rewritten into factors of beta and z separately and integrated as:
\[p(x_{1:T} | u_{1:T}) = \int p(\beta_{1:T}) \ d\beta_{1:T} \prod_{t=1}^T p(x_t | z_t) \int \prod_{t=0}^{T-1} p(z_{t+1} | z_t, u_t, \beta_t) \ dz_{1:T} \\ p(x_{1:T} | u_{1:T}) = \int p(\beta_{1:T}) \ d\beta_{1:T} \int \prod_{t=1}^T p(x_t | z_t) \ p(z_{1:T} | u_{1:T}, \beta_{1:T}) \ dz_{1:T} \\ p(x_{1:T} | u_{1:T}) = \int p(\beta_{1:T}) \ [\prod_{t=1}^T p(x_t |z_t)]_{z_t = \mathcal{f}(z_t, u_t, \beta_t)} \ d\beta_{1:T} \\ p(x_{1:T} | u_{1:T}) = \int p(\beta_{1:T}) \ p_\theta(x_{1:T} | z_{1:T}) \ d\beta_{1:T} \\\]Variational inference can be used in above model, for which the ELBO can be derived as:
\[\begin{align} p(x_{1:T} | u_{1:T}) &= \int p(\beta_{1:T}) \ p_\theta(x_{1:T} | z_{1:T}) \ d\beta_{1:T} \\ log \ p(x_{1:T} | u_{1:T}) &= log \ \int p(\beta_{1:T}) \ p_\theta(x_{1:T} | z_{1:T}) \ d\beta_{1:T} \\ &= log \ \int p(\beta_{1:T}) \ p_\theta(x_{1:T} | z_{1:T}) * \dfrac{q_\phi(\beta_{1:T} | x_{1:T}, u_{1:T})}{q_\phi(\beta_{1:T} | x_{1:T}, u_{1:T})} \ d\beta_{1:T} \\ &\ge \int q_\phi(\beta_{1:T} | x_{1:T}, u_{1:T}) \ log (\dfrac{p(\beta_{1:T}) \ p_\theta(x_{1:T} | z_{1:T})}{q_\phi(\beta_{1:T} | x_{1:T}, u_{1:T})}) \\ &= E_{q_\phi(\beta_{1:T} | x_{1:T}, u_{1:T})} [log (\dfrac{p(\beta_{1:T}) \ p_\theta(x_{1:T} | z_{1:T})}{q_\phi(\beta_{1:T} | x_{1:T}, u_{1:T})})] \end{align}\]The ELBO can be further simplified as:
\[\begin{align} \mathcal{L} &= E_{q_\phi(\beta_{1:T} | x_{1:T}, u_{1:T})} [log (\dfrac{p(\beta_{1:T}) \ p_\theta(x_{1:T} | z_{1:T})}{q_\phi(\beta_{1:T} | x_{1:T}, u_{1:T})})] \\ &= E_{q_\phi(\beta_{1:T} | x_{1:T}, u_{1:T})} [log \ p_\theta(x_{1:T} | z_{1:T}) \ + \ log (\dfrac{p(\beta_{1:T}) }{q_\phi(\beta_{1:T} | x_{1:T}, u_{1:T})})] \\ &= E_{q_\phi(\beta_{1:T} | x_{1:T}, u_{1:T})} [log \ p_\theta(x_{1:T} | z_{1:T}) \ + \ log (\dfrac{p(\beta_{1:T}) }{q_\phi(\beta_{1:T} | x_{1:T}, u_{1:T})})] \\ &= E_{enc.}[log \ p(x_{1:T} | z_{1:T})] - KL(q(\vec{\beta} | \vec{x}, \vec{u}) \ || \ p(\vec{\beta})) \end{align}\]This can be optimized. As usual, one term of ELBO is regeneration error, the other is KL divergence. The KL divergence is between priors of $\beta$ and its estimate given observations. When this elbo is optimized, the kl approaches 0, during which this posterior approximation for $\beta$ is close to its true prior. Thus optimizing this ELBo, results in learning a close approximation for $\beta$ which is the parameter that governs the dynamic system.