Pad pack sequences for Pytorch batch processing with DataLoader
Pytorch setup for batch sentence/sequence processing - minimal working example. The pipeline consists of the following:
- Convert sentences to ix
pad_sequence
to convert variable length sequence to same size (using dataloader)- Convert padded sequences to embeddings
pack_padded_sequence
before feeding into RNNpad_packed_sequence
on our packed RNN output- Eval/reconstruct actual output
1. Convert sentences to ix
Construct word-to-index and index-to-word dictionaries, tokenize words and convert words to indexes. Note the special indexes that we need to reserve for <pad>
, EOS
, <unk>
, N
(digits). The indexes should correspond to the position of the word-embedding matrix.
2. pad_sequence
to convert variable length sequences to same size
For the network to take in a batch of variable length sequences, we need to first pad each sequence with empty values (0). This makes every training sentence the same length, and the input to the model is now $(N, M)$, where $N$ is the batch size and $M$ is the longest training instance.
For batch processing, a typical pattern is to use this with Pytorch’s DataLoader and Dataset:
One instance from the traindataset returns $(xx, yy)$ (unpadded), such that when used together with our custom collate function, we get tuples of xxs and yys, and can pad them by batch. Next, enumerate over the dataloader to get the padded sequences and lengths (before padding).
Note: Here we are assuming yy is a target sequence. If yy is just a categorical variable then they are already fixed length for all data instances and there is no need to pad.
3. Convert padded sequences to embeddings
x_padded
is a $(N, M)$ matrix, and subsequently becomes $(N, E, M)$ where $E$ is the
embedding dimension. Note the vocab_size
should include the special <pad>
, <EOS>
, etc characters.
4. pack_padded_sequence
before feeding into RNN
Actually, pack the padded, embedded sequences. For pytorch to know how to pack and unpack
properly, we feed in the length of the original sentence (before padding). Note we wont be able to pack before embedding. rnn
can be GRU, LSTM etc.
The x_packed
and output_packed
are formats that the pytorch rnns can read and ignore the
padded inputs when calculating gradients for backprop. We can also enforce_sorted=True
, which
requires input sorted by decreasing length, just make sure the target $y$ are also sorted accordingly.
Note: It is standard to initialise hidden states of the LSTM/GRU cell to 0 for each new sequence. There are of course other ways like random initialisation or learning the initial hidden state which is an active area of research
5. pad_packed_sequence
on our packed RNN output
This returns our familiar padded output format, with $(N, M_{out}, H)$ where $M_{out}$ is the
length of the longest sequence, and the length of each sentence is given by output_lengths
.
$H$ is the RNN hidden dimension.
6. Eval/reconstruct actual output
Push the padded output through the final output layer to get (unormalise) scores over the vocabulary space.
Finally we can (1) recover the actual output by taking the argmax and slicing with output_lengths
and converting to words using our index-to-word dictionary, or (2) directly calculate loss with cross_entropy
by ignoring index.