Back to Contents

1.Controlling Conditional Language Models without Catatstrophic Forgetting

Korbak, Elsahar, Kruzewski, Dymetman. Naver Labs

This paper does not introduce a new problem or a totally new method, and (very) closely follows work by Khalifa et al., 20211 “A distributional approach to controlled text generaiton”. This formulates the problem of controlled text generation as a constraint satisfaction problem over the distribution $p$, where $p$ is the desired target LM. Understanding the previous paper is a strong pre-requisite, as this work’s contribution is in extending the unconditional generation method to conditional generation. An alternative title for the sake of our sanity would be “A distributional approach to controlled conditional text generation.”

Motivation

The problem is trying to control generation from pretrained language models without having a signal for fully supervised finetuning the entire generated sequence. The assumption is we only have an indicator $b(x) \in \{0, 1\}$ of whether the sample $x$ satisfies the control objective. E.g. compilable source code, or factually correct summaries. The other problem is catastrophic forgetting when any model weights are retrained.

Background

Khalifa 2021 “formalizes the problem of controlled text generation as a constraint satisfaction problem over the probability distribution $p$ representing the desired target LM.”

What do we want from $p$? First the constraints that we want are specified using $\mu_i$, where $\mu_i = \mathbb{E}_{x \sim p} \phi_i(x)$ of predefined feature functions $\phi_i(x)$ for $i \in \{1, \cdots, k\}$. Let $\mathcal{C}$ be the set of all distributions that satisfy the moment constraints. Then, $p$ is a distribution from this set but also minimizing the KL divergence from $a$ the original LM (to avoid catastrophic forgetting).

\begin{equation} p = \mathrm{argmin}_{c\in \mathcal{C}} KL(c, a) \end{equation}

How do we train the EBM P? In the special case of only pointwise constraints, where $\phi_{x} \in \{0, 1\}$, they prove $p(x) \propto a(x)b(x)$. This means the EBM is $P(X) = a(x)b(x)$. When the constraints are distributional, i.e., $\phi_x \in [0, 1]$, then $P(x)=a(x)e^{\lambda \dot \phi(x)}$, where the $\lambda$ can be solved using self-normalized Importance Sampling.

How do we sample from EBM? Since the EBM is an unnormalized distribution, we cannot directly sample from it. As a result, Khalifa train a policy $\pi_{\theta}$ to approximate $p$ from $P$ using the Distributional Policy Gradient algorithm (Parshakova et al., 2019)2. The algorithm works by minimising the x-ent($\pi_{\theta}, p)$ using importance sampling. One step of gradient descent looks like $\theta \leftarrow \theta + \frac{P(X)}{q(x)} \nabla_{\theta} \log \pi_{\theta}(x)$.

In this paper

We want a model that’s good at summarization, translation. These are all conditional generation tasks. Let’s say we deal only with the simple case of pointwise constraints where $P(x)=a(x)b(x)$. If we wanted to make the above conditional on say the source sentence $c$, we would have an EBM $P_c(x) = a(x|c)b(x,c)$.

How do we sample from this EBM? Recall we need to train a policy. But how do we train a single policy for the task, given all the different “contexts” in the training data? e.,g paragraphs before summarisation, or all the source sentences in translation. This work says that you should minimize the expected X-ent between $\pi_\theta$ and the multiple $p_c$, where the expectation is over $\tau(c)$, a distribution over $c \in \mathcal{C}$. The paper gives the final form of this gradient step as something complicated looking but if one follows the algorithm pseudocode it should be doable. There is however, an estimation of the normalization constant which might be expensive in a single gradient step, details in the paper.

References