Back to Contents

Co-training improves prompt-based learning for Large Language Models

Lang, Agrawal, Kim, Sontag. MIT

This paper proposes to co-train a LLM (GPT3) with a smaller Bert model.

Co-training (Blum & Mitchell 1998)1 has always seemed like a funny kind of iterative unsupervised method mostly based on intuition. It’s usually more compelling if you have two views of the same data which are complementary (e.g., tweet title and tweet image), and so one view benefits from the pseudo-labels gathered from the other view. Since we know how to simply combine two views of the same data in big neural land (multimodal neural networks), co-training isn’t very popular anymore.

Here instead of having two views of the same data, the authors use two models $\phi_0$ and $\phi_1$ applied to the same input. This starts to look more like a teacher student or model calibration setup rather than co-training “on different views”. They do justify this framing with recent work (see paper for references) which says that the views can be identical as long as $\phi_0(x)$ and $\phi_1(x)$ are different enough.

Background

The co-training framework. The high level idea is that you have two views of your data, A and B, you gather labels from A and use it to train a classifier on B. Use the new labels on B and try to train a classifier on A and iterate.

(Its easy to understand but if you’re new to it best with a video than me trying to explain it here.)

Method

GPT’s view The first view is $\phi_0^{(i)}(x) \in \mathbb{R}^{|V|}$ a vector of output probabilities on the vocabulary space using a very powerful model like GPT3. To restrict the vocabulary space, they take the top 25% of tokens with nonzero probability mass across all prompt examples. For $k$ prompts, the first view is a matrix $\phi_0(x) \in \mathbb{R}^{k \times |V|}$. Note that $k$ is not $k$ datapoints, but $k$ prompts on the same datapoint. They combine the $k$ prompts on the same datapoint into a single pseudolabel by using a calibration Matrix (Zhao et al., 2021)2, and train this calibration matrix $W$.

Smaller model view The second view is $\phi_1(x)$ is the second last layer where $\phi_1$ is a small pretrained model (they use Deberta).

Selecting Confident data.

This is a standard development in co-training to avoid the negative loops (wrong pseudo labels feeding wrong pseudo labels). They use model confidence based on the assumption that every class accounts for at least 1% of the data and a “cut-statistic”. The idea is to form a graph of datapoints connecting vertices who are nearest neighbours in $\phi(x)$, and consider an edge between the nearest neighbors cut if it has a different label from its nearest neighbors. If a node has a different label from its neighbors, we are less confident about this node. This seems to be a direct application of Zhang & Zhou 20113.

References