Issue #67 – Unsupervised Adaptation of Neural MT with Iterative Back-Translation
Author: Dr. Patrik Lambert, Machine Translation Scientist @ Iconic
The most popular domain adaptation approach, when some in-domain data are available, is to fine-tune the training of the generic model with the in-domain corpus. When no parallel in-domain data are available, the most popular approach is back-translation, which consists of translating monolingual target in-domain data into the source language and use it as training corpus. In this post we have a look at a refinement of back-translation, inspired from the advances in unsupervised neural MT, which yields large BLEU score improvements.
Adaptation with Iterative Back-Translation
The method is presented in a paper by Jin et al. (2020). It assumes access to an out-of-domain parallel training corpus and in-domain monolingual data (in both the source and the target languages). In this approach the training optimises three objectives:
- Source and target bidirectional language models. In these language models, masked words are predicted given the whole context surrounding them.
- Source-to-target and target-to-source unsupervised translation models. Source monolingual sentences are translated by the current source-to-target model. Similarly, target monolingual sentences are translated by the current target-to-source model. The objective is to minimise the training loss of these models with these synthetic data.
- A supervised neural MT model, trained with out-of-domain data.
Adaptation with Iterative Back Translation (IBT) is compared with baseline adaptation methods. The best baselines are back-translation and DAFE (DAFE performs multi-task learning on a translation model on out-of-domain parallel data and a language model on in-domain target-side monolingual data, while inserting domain and task embedding learners into the transformer-based model). IBT works much better than the baselines when adapting between specific domains, but since it doesn’t seem to be a real-world scenario, we will focus on the results of the adaptation from a more general domain (like news) into a specific domain (like law or medical). In this case, DAFE works slightly better than back-translation.
For the WMT14 de-en task, IBT with the out-of-domain data as parallel corpus yields 1.5 to 2.5 BLEU points improvement with respect to the best baseline. With back-translated data as parallel corpus, the improvement is more than 2-3 BLEU points. Adding extra monolingual in-domain data gives further improvements. For the smaller WMT16 ro-en task, the improvement is larger.
An ablation study reveals that all components are important: pre-training, IBT, language models, and supervised translation models with back-translated data. However, the language models are the component with the least impact.
Interestingly, pre-training also has a large positive impact on the baselines. However, this result is given for adaptation between specific domains. What is missing is a comparison of IBT with pre-trained back-translation and DAFE on the generic-to-specific domain adaptation.
Iterative Back-Translation with pre-training, source and target language models and back-translated parallel data is the best adaptation approach to date when no in-domain parallel data are available. However, it ideally requires monolingual in-domain data in both the source and target languages. The paper also highlighted the positive impact of pre-training for all considered domain adaptation baselines.