NLP Papers at ICML2022
Latent Diffusion Energy-based Model for Interpretable Text Modeling
Yu, Xie, Ma, Jia, Pang, Gao, Zhu, Zhu, Wu. UCLA
This paper combines a particular “interpretable” Energy based model, namely the symbol-vector coupling energy based model, with inference techniques from diffusion models.
It has a bunch of pre-requisites and can be intimidating to someone (me) unfamiliar with the literature. To understand what the author’s actual contribution is here you need to be quite familiar with the problem of learning EBMs, previous work and dedicate some time starting at before and after equations.
Background
First some high level background on EBMs. EBMs are a class of probabilistic generative models (like normalizing flows, variational auto-encoders, autoregressive models, and GANs). You can pretty much view anything neural under the lens of EBMs because it just means unormalised score model. EBMs explicitly parameterize the distribution’s log-probability function while ignoring its normalizing constant.
This gives flexibility when designing the energy function but comes at the (very high) cost of making likelihood computation and sampling generally intractable. Since we cannot compute likelihoods, we cannot do MLE training, so papers revolve around some way to approximate the ML gradient. For e.g., in the “old days” when Restricted Boltzman machines were introduced, people used to do MCMC to approximate the likelihood gradient which is typically slow.
More specific background
Let’s start with the model. They adopt the Learning latent space energy-based prior model. (Pang et al., 2020a)1 which is a symbol-vector coupling model where the continuous latent variables are coupled with discrete one-hot symbol variables. We have a symbolic one-hot vector $y \in \mathbb{R}^K$, and $z$ the latent continuous vector. $y$ and $z$ are coupled by an EBM $p_{\alpha}(y, z)=\frac{1}{Z_{\alpha}} exp(\langle y, f_{\alpha}(z) \rangle )p_0(z)$. Thus the generative model where $\theta=(\alpha, \beta)$ is
\begin{equation} p_{\theta}(y, z, x) = p_{\alpha}(y, z)p_{\beta}(x|z) \end{equation}
The whole point of this is to allow discrete structure, but also not sacrificing the generation quality. So we can use one representation for interpretability and another for performance. The ELBO that we want to minimise after marginalising out $y$ is
\[\begin{equation} \mathrm{ELBO}_{\phi, \theta} = \mathbb{E}_{q_{\phi}(z|x)} [ \log p_{\beta} (x|z) - \log_{q_{\phi}}(z|x) + \log p_{\alpha}(z)] \end{equation}\]Next, the learning algorithm. Gao et al., 20202 showed how to use diffusion-like methods to learn a sequence of EBMs. Each EBM is trained by maximising the conditional probability of the data given their noisy versions and intuitively, this is easier to learn because the distribution conditioned on the noise, is easier than learning the marginal likelihood objective. (Mathematically shown that $p(z_t | z_{t+1})$ is approximately a single-mode Gaussian distribution when the noise is sufficiently small.)
This Work
Instead of a vanilla VAE type framework, they instead consider the diffusion models. Recall diffusion models learn a sequence of EBMs by optimising conditional likelihoods which are more tractable than a marginal likelihood. So now instead of the joint distribution $p(y, z, x)$, we have the joint distribution of the trajectory of latent variables $p(y, z_{0:T}, x)$. Rewriting the ELBO, they end up with
\[\begin{align} \mathrm{ELBO}_{\phi, \theta} &= \mathbb{E}_{q_{\phi} (z_0|x)} [ \log p_{\beta} (x|z_0) - \log_{q_{\phi}}(z_0|x)]\\ &+ \mathbb{E}_{q_{\phi}(z_0|x)} [ \log \int_{z_{1:T}} p_{\alpha} (z_{0:T}) dz_{1:T}] \end{align}\]There is a lot more in the paper in terms of derivations and detailed algorithms. I feel like I oversimplified this work but I think that’s the main gist or the main premise of the paper, and the rest of it is scaffolding to make optimising this new ELBO work out.
References
-
Pang, Wu. Latent Space Energy-Based Model of Symbol-Vector Coupling for Text Generation and Classification ↩
-
Gao, Song, Poole, Wu, Kingma. Learning Energy-based Models by Diffusion Recovery LIkelihood ↩