Back to Contents

Black Box Tuning for Language-Model-as-a-Service

Sun, Shao, Qian, Huang, Qiu. Fudan University

This paper works in the setting where big LM service providers (e.g. GPT3) work as black-box API services, and need to give users feedback on how to engineer their prompts without gradients. This is a compelling scenario, most people are already used to the idea that the big boys host the big models through APIs, something which will only become more prevalent.

Enter Derivative Free Optimization (DFO). DFO is not new, but DFO at scale is a problem because the existing algorithms β€œare known to suffer from slow convergence rate when the dimensionality of the search space is high. Thus, it is intractable to optimize even only the continuous prompts, which can be tens of thousands of parameters, using DFO algorithms.”

Despite hyping us up in the introduction, a fancy new DFO algorithm is not invented here and low dimensionality to the rescue!

The authors project the original prompt space using a random linear projection onto a much smaller subspace. This paper feels like a straightforward application and implementation of two well-known existing ideas (random projection to reduce dimensions + DFO). Although it is still certainly valuable as even though they are well-known in ML land, these ideas may not be that well-known in NLP land. Kudos to them as well for stepping out of gradient land.

Someone interested in applying or following up on this work should stare carefully at their many experiments and ablations as the method almost feels too simple to outperform gradient based methods (which they claim somewhere somewhere in the introduction).

Method

Step1: Write the loss function

Assume we have access to the logits $f(X, Y)$ so minimally we can compute some loss $\mathcal{L}(X, Y)$. We want to optimise some $p \in \mathbb{R}^D$ (prompt) which is combined with $X$. But it is intractable to do DFO on $p*=\mathrm{argmin}_{p \in \mathcal{P}} \mathcal{L}(f(p; X), Y)$, and so instead we want to optimise a lower dimension $z \in \mathbb{R}^d$. We can do this with a standard technique of using a random projection matrix $A \in \mathbb{R}^{D\times d}$ to project $z$ back to $D$ space.

\begin{equation} z^* = \mathrm{argmin}_{z \in \mathcal{Z}} \mathcal{L}(f(Az + p_0; X), Y) \end{equation}

Step2: Apply a standard DFO algorithm

The authors directly apply Covariance Matrix Adaptation Evolution Strategy (CMA-ES1; Hansen, 2016). The main idea behind this algorithm is that a population of new query solutions (offspring) is sampled at every step from the multivariate normal, where the mean and std of the multivariate normal are parameters updated based on likelihood of previous successes.

Tricks to make it work

  • Sampling rows from He initialisation for the matrix $A$ appear to be better than sampling from normal distribution. No reason given why.
  • Restricting the search space of $\mathcal{Z}$ to $[-5, 5]^d$.
  • Using x-ent for the loss function instead of accuracy. This is ML 101.
  • Instead of optimizing the entire prompt, they start with some initial prompt ($p_0$) and only do some small perturbation. $Z + p_0$

References