EMNLP 2020
Likes: Gathertown, rocketchat, videos. Dislikes: Zoom
Keynotes
Information Extraction Through the Years: How Did We Get Here? - Claire Cardie
This talk was an Ohhhh moment for me in why there are so many subfields of “extraction” when they all seem to be very similar or related. Information Extraction in 1991 was too “real”, and they had to simplify/breakdown the real task to smaller tasks (named entity recognition, coreference resolution, relation extraction, entity linking etc.. ) to indicate where the model was failing. In some sense, the “new” task of event extraction is going full circle and getting close to the original task. In event extraction, there are predefined event types, and entities which fill specific roles in the event.
It’s funny that people motivate end-to-end work by saying how pipeline introduces all these intermediate errors. I feel like everytime the DARPA guy reads this they roll their eyes and be like yeah, of cos we know and 30 years later you have finally come full circle. Maybe in the next 50 years there will be a talk titled “AGI Through the Years”. Anyway, limitations or future directions for Information Extraction are better document-level event understanding (neural methods still don’t work that well at the document level). Low-resource (as usual), and user-centric (as usual).
Friends Don’t Let Friends Deploy Black-Box Models: The Importance of Intelligibility in Machine Learning - Rich Caruana
Rich Caruna sells Explainable Boosted Machines (EBMs) pretty hard. While acknowledging the Neural Nets will be “king of the hill” for along time on signal data, for tabular data he argues that EBMs are competitive with NNs, yet are more explainable. The rest of his talk is examples in medical domain of EBM success. Bunch of cute examples but not very technical. He talked very little about the technicalities of EBMs, of which I did some brief reading about subsequently.
Recall a linear model is $y=\beta_-1 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_nx_n$, and an additive linear model is $y=f_1(x_1) + f_2(x_2) + \cdots + f_n(x_n)$. An EBM is a generalized additive model of the form
\begin{equation} g(\mathbb{E} [ y]) = \beta_0 + \sum f_j (x_j) + \sum f_{ij}(x_i, x_j) \end{equation}
It has both general additive effects, and pairwise interaction effects. This is considered “white-box” and interpretable, because the contribution of each feature to a final prediction can be visualized and understood by plotting each \(f\) in a modular way. There are some details regarding the training, it requires round-robin cycles through features to mitigate the effects of co-linearity and to learn the best feature function \(f\). For more details see their InterpretML library paper and its references.
Linguistic Behaviour and the Realistic Testing of NLP Systems - Janet Pierrehumbert
Dame talked about the validity of NLP methodology wrt to linguistic behavior. She focused particularly on clozed test, in that it is strongly grounded in human behavioral studies, but that we could have better separation/analysis of errors depending on the context. The point is that there are qualitatively different uses of “context”. Different types of words have different levels of “burstiness”, different topics activate different vocabularies, and local constraints can dominate vocab distribution.
There were some fun one-shot linguistic learning examples that humans can do (Oh no! Look at that Zunky pan!) and a fun tidbit that word use can be modeled as a “hazard process” with models up to a sequence of 50k words, which is a much larger time lag than what deep learning models account for. And big shoutout to my girl Li Ke for being referenced on the keynote slides ;)
Long Papers
Most things happened in a blur. Some papers which I read more carefully after watching their video. Comments are very high level, detailed questions emailed to authors (note to self to update if anything interesting)
Repulsive Attention: Rethinking Multi-head Attention as Bayesian Inference
Motivation Multi-head attention may suffer from “collapse” - where different heads extract similar attention representation.
Method They use particle-optimization sampling to explicitly make the attention head representations different.
Comments The motivation is juicy to tackle, and seems like a direct application of an existing optimization algorithm. But not everyone knows enough about particle optimization sampling to begin with in order to use it (I certainly don’t), so I think it’s good work ahead of most of the curve :)
Information-Theoretic Probing with Minimum Description Length
Motivation To predict how well learned representations “encode” linguistic property, we typically train a classifier (“probe”) to predict the property from the fixed learned representations. However, this does not differentiate well between learned and random representations.
Method The authors propose Information-theoretic probing with minimum description length as an alternative, where the goal is to have the shortest “description length” that transmits the labels. Why could this work? The intuition is if there is information in the representations, then they can be compressed. Note that since labels are transmitted using a model, the model has to be transmitted as well (directly or indirectly). Thus, the overall code length is a combination of the quality of fit of the model (compressed data length) with the cost of transmitting the model itself. They consider variational coding, and online coding for estimating MDL. They end up with this equation for variational code:
\begin{equation} KL(\beta || \alpha) - \mathbb{E}_{\theta \sim \beta} \sum_{i=1}^N log_2 p_{\theta} (x_i|y_i) \end{equation}
where $\beta(\theta) = \prod_{i=1}^m \beta_i (\theta_i)$ is a distribution over the parameter values, which is obtained from Bayesian Compression, $\alpha$ is a prior distribution over the parameter space, and the remaining term is the code length for transmitting the labels, which is also the categorical cross-entropy loss of the model predicting the label $y_i$ given the i-th example $x_i$. I am still bothered by the equation not having $\sum_j p(y_{ij})$ in front of the $log_2$ even though I know probability is 1 for the given ground truth category $j$. (Thanks Lena for the correspondence)
Comments The intuition is hard to deny, the method requires some working knowledge of information theory. The neat thing is that MDL captures the difficulty of the task by including model size. For example when assigning word types with random outputs “control tasks”, even though accuracy is consistently high, the code length shows that the representations at different layers contain different amounts of relevant information.
I think this is a strictly preferable way to do probing and I hope it gets adopted by one of the big probing libraries.
Back to the Future: Unsupervised Backprop-based Decoding for Counterfactual and Abductive Commonsense Reasoning
Motivation In decoding, if we have a sentence \(X, Y, Z\) and we need to output \(Y\), given \(X\) and \(Z\), how can we do that? \(X\) alone, and concatenating \(X\) and \(Z\) as input context are incorrect.
Method Their method is based on this 2020 ICLR plug-and-play paper, which is based on this 2016 vision plug-and-play paper.
First perform a forward pass to generate \(Y, P(Y|X)\), and then perform a backward pass maximizing \(P(Z|YX)\) to update the logits of \(Y\) without changing the parameters of the neural network. Since \(Y\) is discrete text and not differentiable, update \(Y=(y_1, .. y_n) \in R^{n \times v}\) which is the logits over the vocabulary space. Do a weighted mixture of the “left-to-right” \(Y\) and the “backward updated” Y’ and after several iterations of forward and backward, sample \(y_n \sim softmax(y_n/\tau )\) where \(\tau\) is temperature.
Comments The motivation and idea behind the method was so well explained that the first section I looked at in their paper was “Related Work”. I have some healthy discomfort with this paper re the claim to counterfactual and abductive reasoning. Big words when the method they described mostly seems to me it’s mostly just doing constrained language modeling with a pretrained model. I didn’t read their experiments but I think that’s where one should be most critical about.
Attention is Not Only a Weight: Analyzing Transformers with Vector Norms
Method In self-attention, the updated vector $y_i = \sum_{j=1}^{n} \alpha_{i,j} f(x_j)$, where $j$ indexes the other token representations, and $f(x_j)$ is the “Value” vector. (Remember $K, Q, V$ in self-attention?). Instead of looking at $\alpha_{i,j}$ to see which token representations are important, look at $|| \alpha_{i,j} f(x_j) ||$ instead.
Comments Looking at attention weights for interpretability are just problematic. With just a tiny bit of work (this is probably 2 lines of code) we can get something obviously better.
RNNs can generate bounded hierarchical languages with optimal memory
Motivation: Why are RNNs good at syntax? From previous work we know that RNNs can generate regular (finite state) language, but the best construction we know of requires exponential memory $O(k ^{\frac{m}{2}})$ (citation needed).
Method: The authors setup a family of context-free languages which requires bounded memory; Dyck-(k,m). This language describes balanced brackets, so something like $(_1 (_2 )_2 (_2 )_2 )_1$ which can be hierarchically nested, where there are $k$ bracket types and $m$ unclosed open brackets at any time. In real world language, this translates to $k$ being the vocabulary size, and $m$ being the nesting depth. They describe a stack construction which reads and writes from an implicit stack that they encode in their hidden states, and how the recurrent W matrix can push and pop from this stack with $2mk$ memory for an RNN and $mk$ memory for an LSTM. The experiments training memory restricted RNNs on the language evaluated on bracket closing accuracy show that they indeed do not require exponential hiddenstates.
Comments: This not your typical EMNLP paper with a shit ton of experiments and I’m encouraged that the community can appreciate papers like this. The proof construction is kind of like proof by conjecture, we don’t know whether RNNs in practice are doing what the authors suggest they could be doing, although the experiments do support the new memory bound so that’s convincing for me. Its reminiscent of Psychology papers “theorising” about how cognition works.
Understanding the Mechanics of SPIGOT: Surrogate Gradients for Latent Structure Learning
to come. hopefully.
If Beam Search is the Answer, What was the Question?
to come. hopefully.
How do Decisions Emerge across Layers in Neural Models? Interpretation with Differentiable Masking
to come. hopefully.
Pareto Probing: Trading Off Accuracy for Complexity
to come. hopefully.
Tutorial
Antoine Bosselut’s section of the NLG Tutorial (rec by my dude Anton Belyy). I am secretly happy that this part of the tutorial was about formulating the objective function for training, rather than neural architectures for Text Generation. No sign of a hierarchical bidirectional skipped-connection highway network. Genuine thanks to transformers for quite effectively squashing this line of work.
Decoding:
Why does repetition happen in NLG? Repetition causes token level negative log likelihood to decrease (more and more probable sequences), and this is even worse for Transformers. No real explanation is given for this, one of the modern mysteries of NLP. Many papers try to address this but we still dont know why it happens.
The tutorial describes temperature sampling, top-k sampling (truncating to top-k options), top-p/nucleus sampling (sample from the subset of vocabulary which contains the top p probability mass), reranking sequences based on other attributes, or KNN Models (other citations needed). This section of the tutorial reminds me of why I shirk away from NLG research. I can’t shake off the feeling that most of these are ad-hoc heuristic methods. I do like the new gradient based “plug-and-play” stuff though (see above).
Training:
MLE (cross-entropy loss) discourages diversity, so “Unlikelihood training” explicitly discourage the model from producing certain output. F^2 softmax factorizes the softmax into first selecting the word category and then selecting the word looks like a more nuanced approach to training. The tutorial next describes exposure bias, which just means that neural models produce degnerate output, when the conditioning text is very different then that seen during training.
Finally, I was suprised to see RL introduced as an approach to train the generator. Suprised because you would think that the “reward” is more sparse than say an MLE loss. Even if we wanted to account for style etc during generation, we should be able to incorporate this into the loss as an additional regularization or penalty function.
In my exchange with Antoine, he mentioned that discrete rewards could be directly used for harder to formalize behaviors, with the example of this approach that uses RL with approximations of human judgements as rewards. RL also helps alleviate the “exposure bias” issue, because we sample proposal sequences from the model as trajectories and receive feedback from our own reward function. As a result the model is learning from its own sampled sequences rather than the gold sequences. (I still don’t get the practical differences with scheduled sampling.) Antoine does caveats that RL is however used less in practice. (thx for the correspondence)