NLP Papers at ICML2022
What Language Model Architecture and Pretraining Objective Works Best for Zero-Shot Generalization?
Wang, Roberts, Hesslow, Le Scao, Chung, Beltagy, Launay, Raffel. HuggingFace.
A massive compute intensive paper ( with “a total of 830,000 TPUv4-hours over the study”) from huggingface.
For model architectures they consider, Decoder Only, Encoder-Decoder, and Non-causal Decoder. For pretraining objective they consider, Full LM objective and masked LM objective. Perhaps intersting is this Non-causal Decoder work which has never really caught my eye before.
In conclusion, an autoregressive decoder works better if there is no fine-tuning, and the encoder-decoder architecture works better if we are allowed to fine-tune.
Applaud this effort and hope people read the conclusion section of the paper so the carbon counts for something.