This is a paper summary and notes written by me during reading this paper. I claim no rights to any intellectual property in this post. I am sharing this mostly for personal purposes and just in case someone might find my personal notes helpful
This post is based on the paper: A Recurrent Latent Variable Model for Sequential Data by Chung et. al. this paper was published in NeurlIPS 2015.
The main goal of this paper is to model a dynamic system, e.g. a time series.
Assume a sequence of observations $x_1, x_2, …, x_t$ are all generated from some underlying latent process.
In sequence to sequence modelling, RNN are an obvious choice for modelling sequences. However, they can not model the latent generative process.
VAE can be used to model generative process but they work properly on IID data only. But in this case the observations are not IID but a sequence of data.
Chung et. al propose a VRNN model to deal with this.
Figure taken from Github implementation of this algorithm by Emanual at emited/VariationalRecurrentNeuralNetwork
The above equations of the modelling assumptions help to simplify the joint distribution of all x and their respective z as:
\[p(x_{\le T}, z_{\le T}) = \prod_{t=1}^T p(x_t | z_{\le t}, x_{< t} ) * p(z_t | z_{<t}, x_{< t}) \tag{J}\]Interpretation: Joint can be factored based on the graphical modelling shown in the figure above. The x’s at each time step $t$ are dependent on z at same timestep $t$ which in turn dependent on z at previous timesteps as well as x from previous time steps.
This joint also means the approximate posterior for entire sequence can be factored as:
\[q_\phi(z_{\le T}| x_{\le T}) = \prod_{t=1}^{T} q_\phi (z_t | z_{< t}, x_{\le t}) \tag{Post}\]The variational paradigm (see my old post) helps to optimize for log marginal of $x$ as:
\[\begin{align} log(p(x_{\le T})) &= log(\int{p(x_{\le T}, \ z_{\le T}) \ d z_{\le T}}) \\ &= log(\int{\dfrac{p(x_{\le T}, \ z_{\le T})}{q(z_{\le T}| x_{\le T})} * q(z_{\le T}| x_{\le T}) \ d z_{\le T}}) \\ & \ge \int q(z_{\le T}| x_{\le T}) * log(\dfrac{p(z_{\le T},\ x_{\le T})}{q(z_{\le T}|\ x_{\le T})}) \\ & = E_{q(z_{\le T}| x_{\le T})}[log(\dfrac{p(z_{\le T},\ x_{\le T})}{q(z_{\le T}|\ x_{\le T})})] \end{align}\]which is the ELBO for this model.
As discussed in post for VAE, optimizing the ELBO is the learning algorithm for this model. The ELBO can be further simplified using equation for joint and encoder as:
\[\begin{align} ELBO (\mathcal{L}) &= E_{q(z_{\le T}| x_{\le T})}[ \ log(\dfrac{p(z_{\le T},\ x_{\le T})}{q(z_{\le T}|\ x_{\le T})}) \ ] \\ &= E_{encoder} [ \ log(\dfrac{\prod_{t=1}^T p(x_t | z_{\le t}, x_{< t} ) * p(z_t | z_{<t}, x_{< t})}{\prod_{t=1}^{T} q (z_t | z_{< t}, x_{\le t})})] \\ &= E_{encoder} [ \ \sum_{t=1}^T log( p(x_t | z_{\le t}, x_{< t}) \ + \ log(p(z_t | z_{<t}, x_{< t})) - log(q (z_t | z_{< t}, x_{\le t})) ] \\ &= E_{encoder} [ \ \sum_{t=1}^T log( p(x_t | z_{\le t}, x_{< t}) \ - \ KL(q (z_t | z_{< t}, x_{\le t}) \ || \ p(z_t | z_{<t}, x_{< t}) )) \ ] \end{align}\]Eqn (8) can be optimized as Monte-Carlo estimate for (8) is the same as taking mean of term under expectation for enough samples of data. The terms under expectation are log of reconstruction probability and the KL divergence is analytic with the normal VAE re-parameterization trick.
[1] J. Chung, K. Kastner, L. Dinh, K. Goel, A. Courville, and Y. Bengio, “A Recurrent Latent Variable Model for Sequential Data,” arXiv:1506.02216 [cs], Apr. 2016, Accessed: Apr. 21, 2022. [Online]. Available: http://arxiv.org/abs/1506.02216
[2] Emannuel: Github