NLP Papers at ICML2022
Towards Coherent and Consistent Use of Entities in Narrative Generation
Papalampidi, Cao, Kocisky. Deepmind
This is primarily an architecture paper which uses dynamic memory for the problem of coherent entities in text generation. The authors propose a dynamic entity memory and cross-attention blocks at each layer of the model to augment a pre-trained LM. They say that the key difference from previous work is in being able to condition on multiple relevant entities instead of just one, and update all entity slots in the dynamic memory based on soft-attention. They also define metrics for “automatically measuring entity coherence and consistency”.
I’m usually not a big fan of such work (including the line of retrieval augmented models) because I’m in the camp which believes very large LMs with lots of parameters are already implicitly working with internal memory banks and all this should come for free. Personally I believe that while more architecture engineering should give more data sample efficiency, if the architecture also comes at the cost of requiring additional supervision (instead of purely self-supervised) ultimately the adoption is temporary/patchy until the next scalable model architecture comes out.
Also not sure how interesting it is as it combines several well-known mechanisms of soft-attention, dynamic memory, gating etc but I don’t expect too much novelty - the techniques in this field are kind of getting saturated. Still a solid engineering effort nonetheless.
Background
Memory augmented LMs are a type of architecture where in addition to the base architecture of whatever LM you’re working with, you also have these additional vectors which get stored and updated as you see more input. They are usually gated for more “learning control”, and then these vectors interact with your default hidden states in some way to influence prediction.
Method
They have j memory slots, each with Key-value. The key is a fixed representation (presumably just the token) and the value is dynamic and gets updated as the model reads in the inputs.
Each layer of the transformer has a cross-attention block that takes the input representation of the token and all memory slots, and computes some representation from it using standard softmax concatenation operations. There are some gating mechanisms in various places for updating the memory value of the cell.
Finally there is a “regularisation of cross-attention scores” step where they minimise the KL Divergence between cross-attention weights and the “ground truth-distribution for that token”. (Presumably this ground-truth distribution has all probability on one token and so the soft-attentions become hard-attentions again but hey with a tuning hyperparameter on this KL loss, why not. )