Back to Contents

Improving Language Models by Retrieving from Trillions of Tokens

Platoon from DeepMind

This paper is a massive scaling up of retrieval augmented models. The paper is chock full of their training details, analysis of different drills (ablations) and battle procedures.

Interesting is the care taken to separate the train from the field (test) data (extra obvious problem since we are doing direct retrieval). They compute a 13-gram Jaccard similarity using MinHash and remove all training documents with high similarity (0.8) or higher to validation or test set document, in addition removing all validation and test articles from Wikitext 103.

There is a further section on 2.6 Quantifying dataset leakage exploitation which is kind of new.

First they chunk up evaluation and training data. For each evaluation chunk, retrieve the 10 closest nearest neighbors in the training data, and compute the longest token substring. They can then plot on increasing amounts of data overlap, how much the model log-likelihood of each chunk is which indicates how much it exploits evaluation leakage.

Nice example of how to write a Standard Operating Procedure, Salute.