Author: Dr. Patrik Lambert, Senior Machine Translation Scientist @ Iconic
The standard Transformer model is autoregressive (AT), which means that the prediction of each target word is based on the predictions for the previous words. The output is generated from left to right, a process which cannot be parallelised because the prediction probability of a token depends on previous tokens. In the last few years, new approaches have been proposed to predict the output tokens simultaneously to speed-up decoding. These approaches, called non-autoregressive (NAT) are based on the assumption that the prediction probabilities are independent of the previous token. Since this assumption does not hold in general in machine translation, NAT approaches cause a drop in translation quality. A popular way to mitigate this quality drop has been to sacrifice a part of the speed-up by iteratively refining the output of NAT systems, such as in the Levenstein Transformer (see Issue #86). Today, we take a look at a paper by Gu and Kong (2020) which proposes a fully non-autoregressive model achieving similar BLEU scores as the standard Transformer. The key point of the method is to perform a dependency reduction in the learning space of output tokens. They combine several existing methods which reduce the dependency between output tokens in different aspects of the training process: the corpus, the model and the loss function.
Given a toy corpus with only two examples, “AB” and “BA”, each of which has a 50% probability to appear, an AT model would assign 50% of the probability mass to each of these possible outputs. However, a NAT model, in which each output token is produced independently, will assign a 25% probability to each of the following outputs: “AA”, “AB”, “BA” and “BB”. It thus assigns probability mass to incorrect outputs (“AA” and “BB”), not seen during training. In real translation examples, the dependency between output tokens is much more complicated, and some dependency reduction in the output space is necessary so that dependencies can be captured by the NAT model.
Corpus: knowledge distillation
The most effective dependency reduction technique is knowledge distillation (KD). In practice, it consists of translating the source side of the training corpus by a pre-trained AT engine and replacing the original target training sentences with the translated ones. KD can be seen as a filter which reduces the “modes” (alternative translations for an input) in the training data (Zhou et al, 2020). KD is thus able to simplify the training data, removing some noise and aligning the targets more deterministically to the inputs. To understand the impact on NAT neural machine translation (NMT), Zhou et al. take the extreme example of a corpus composed of English sentences translated into French, Spanish and German. In this case each English sentence has three modes: the translation into French, Spanish and German. In the output sentence produced by an AT model trained on this corpus, most tokens would be in the same language because the model would follow the word dependencies. However, a NAT model trained on this corpus would produce sentences mixing the three languages. KD would keep only one mode for each English sentence, and thus would avoid mixing the languages in the translation. This example explains why KD is effective in NAT NMT models.
Model: Latent Variables
Another way of reducing the dependency between target words in the output is to express the probabilities of the output tokens in terms of variables typically predicted from the inputs. These latent variables are introduced at the encoder output of each source position.
Loss Function: Latent Alignments
The standard Transformer is trained by minimising the cross entropy loss, which compares the model’s output with the target tokens at each corresponded position. However, as NAT models ignore the dependency in the output space, it is almost impossible for such models to accurately model the token position differences. An effective dependency reduction in the target space can be achieved by using a loss function which compares sequences of predictions directly aligned with the source instead of target tokens. These sequences of predicted tokens, called latent alignments, have the same length as the source sequence and are monotonically aligned to it (Saharia et al., 2020). They can contain blanks and repetitions, and can be collapsed into the target sequence. For example, given a source sequence of length 10, and a target sequence y = (A, A,B, C,D), then a possible alignment is (_, A, A,_, A,B, B, C,_,D). The dependency is reduced because the NAT model is able to freely choose the best prediction regardless of the target offsets. In the new loss function, the log-likelihood is expressed as a sum over all valid alignments which the target y can be recovered from. Given the assumptions on the alignments (monotonicity and source size), it can be computed efficiently by dynamic programming.
Results are given in terms of BLEU score for engines trained on WMT shared task data for German (de) – English (en), Romanian (ro) – English (en) language pairs and for the Japanese (ja) – English (en) language direction. For en-de and en-ro, KD yields improvements of 5 to 8 points, latent alignments yield further improvements of 4 to 7 points, and latent variables yield an additional improvement of 0.5 to 1 point. The NAT model with all three techniques achieves a new state-of-the-art in NAT NMT, and BLEU scores similar to the baseline AT model (standard Transformer). The ja-en direction is harder and the full NAT model fails to achieve the baseline AT model BLEU score. However, interestingly, performing beam search on an interpolation of the NMT model score and a 4-gram target language model score, the BLEU score rises to the level of the baseline AT model.
The speed-up achieved by the fully NAT model depends on the translation mode. When translating one sentence at a time on GPU, the fact that all tokens are decoded simultaneously instead of one after the other, coupled with the GPU parallelisation capabilities, gives a 16.5x speed-up with respect to the AT model. Translating one sentence at a time on the CPU, which has less parallelisation capabilities than GPUs, the speed-up is more than halved (about 7.5). When translating a batch of sentences, several sentences can be decoded simultaneously on different threads, even with the AT model. In this case the speed-up is nearly 3 on GPU, and not given on CPU. Compared with an AT model with a shallow decoder (12 encoding layers and 1 decoder layer), the speed-up for batch translation on GPU is “only” 2.
Non-autoregressive models have the advantage of speeding-up decoding. However, it usually comes at the expense of a drop in translation quality. We have seen an approach combining several improvements proposed recently (knowledge distillation, latent model variables and latent alignments), which achieves BLEU scores similar to those of the standard AT Transformer models. The guiding principle is that these improvements contribute to reducing the dependency between target tokens, a dependency which cannot be well captured by NAT models. The speed-up is impressive in a sentence-by-sentence translation on GPU use case (16.5x), and is also valuable in a batch translation on GPU use case (nearly 3x).