Issue #105 – Improving Non-autoregressive Neural Machine Translation with Monolingual Data
Author: Dr. Chao-Hong Liu, Machine Translation Scientist @ Iconic
In the training of neural machine translation (NMT) systems, determining how to take advantage of monolingual data and improve the performance of the resulting trained models is a challenge. In this post, we review an approach proposed by Zhou and Keung (2020), under the framework of non-autoregressive (NAR) NMT. The results confirm that NAR models achieve better or comparable performance compared to state-of-the-art non-iterative NAR models.
NAR-MT with Monolingual Data
Zhou and Keung (2020) see the “NAR model as a function approximator of an existing AR (autoregressive) model”. The input of the approach is an AR model and source sentences. Firstly, the AR (teacher) model is used to obtain the output. Then the output is paired with source sentences and used to train the NAR model. Under the framework, it is easy to incorporate monolingual data in the training. Zhou and Keung (2020) used the AR model to translate more data from the source language, which makes the resulting parallel corpus more generalised for the training of the NAR model.
Experiments and Results
Zhou and Keung (2020) used WMT16 English–Romanian (En-Ro) and WMT14 English–German (En-De) datasets for experiments, which are approximately 610k and 4.5M sentence pairs, respectively. For monolingual data used in the experiments of En-Ro models, the “Romanian portion of the News Crawl 2015 corpus and the English portion of the Europarl v7/v8 corpus” are used. Table 1 shows the gain in BLEU using different amounts of monolingual data to prepare the parallel corpus for the training of NAR models. The “gold” in the figure indicates that true target lengths are used in the experiments, rather than predictions. The experiments showed consistent results that the more we incorporate monolingua data for AR models to produce sentence pairs for NAR training, the better the performance achieved on the resulting models. The state-of-the-art results for En-to-Ro and Ro-to-En MT performance are 32.20 and 32.84 in terms of BLEU, respectively. The proposed method achieved 34.50 and 34.01 using an AR Transformer with beam setting as 4. Similar results exhibited in the En-to-De and De-to-En experiments.
Table 1. BLEU scores of NAR models trained on English–Romanian language pairs in both directions. B refers to the number of length candidates determined, while gold refers to using true target length. Excerpted from Zhou and Keung (2020).
In this post we reviewed the use of monolingual data under the framework of non-autoregressive (NAR) NMT. Theoretically, models can only learn from the data they are given for training. Therefore, it does not seem possible that NAR models could outperform their teacher AR models given the same training sets. We see that they do, though, possibly because the resulting NAR models rule out unnecessary complexity that was learnt in teacher models. In Zhou and Keung (2020), the performance improvement is gained by introducing more monolingual data, from which teacher models are used to produce more parallel texts for NAR models to train. It looks as if more information has been introduced into the training of the models, but still it depends mainly on the trained teacher models. Nevertheless, the results have shown that the overall performance has improved significantly. We now need to look at the impact on the performance at microscale, as the complexity of trained models is reduced using this approach.