Issue #62 – Domain Differential Adaptation for Neural MT
Author: Raj Patel, Machine Translation Scientist @ Iconic
Neural MT models are data hungry and domain sensitive, and it is nearly impossible to obtain a good amount ( >1M segments) of training data for every domain we are interested in. One common strategy is to align the statistics of the source and target domain, but the drawback of this approach is that the statistics of the different domains are inherently divergent and smoothing over these does not always ensure optimal performance. In this post we’ll discuss the Domain Differential Adaptation (DDA) proposed by Dou et al. (2019), where instead of smoothing over the differences we embrace them.
Domain Differential Adaptation
In the DDA method, we capture the domain difference by two Language Models (LM)s, trained on in-domain (LM-in) and out-of-domain (LM-out) monolingual data respectively. Then we adapt the NMT model trained on out-of-domain data (NMT-out) producing a system as approximate to the NMT model trained on in-domain parallel data (NMT-in) without using any in-domain parallel data. In the paper, the authors proposed two approaches under the overall umbrella of the DDA framework:
- Shallow Adaptation: Given LM-in, LM-out, and NMT-out, in shallow adaptation (DDA-Shallow), we combine the three models at the decoding step. Specifically, at each decoding time step t, the probability of the next generated word , is obtained by an interpolation of log-probabilities from LM-in, LM-out into NMT-out. Intuitively, we encourage the model to generate more words in the target domain as well as reduce the probability of generating words in the source domain.
- Deep Adaptation: This method enables the model to learn to predict using the hidden states of the LM-in, LM-out, and NMT-out. The parameters of LMs are frozen and we train only the fusion strategy and NMT parameters.
One potential problem of training with only out-of-domain parallel corpora is that the proposed method cannot learn a reasonable strategy to predict in-domain words, since it would never come across them during training or fine-tuning. In order to solve this problem, the authors tried copying some in-domain monolingual data from the target side to source side, as described by Currey et al. (2017), to form pseudo in-domain parallel corpora. The pseudo in-domain data is concatenated with the original dataset when training the models.
The performance of the proposed methods is evaluated using German-English (de-en) and Czech-English (cs-en) models consisting of law, medical and IT domains. In experiments, they also compared their methods in three different settings:
- Shallow fusion and deep fusion (Gulcehre et al., 2015)
- Copied monolingual data model (Currey et al., 2017)
- Back-translation (Sennrich et al.,2016).
Across the language pairs and domains, DDA outperforms the strong baseline models with a significant margin of +2 BLEU. Under the settings where additional copied in-domain data (setting-2), and back-translated data (setting-3) are added into the training set, the proposed method further improves, consistently outperforming the methods based on fusion strategy.
It is important to mention here that the proposed methods only require in-domain monolingual data as opposed to the bilingual parallel data. From the results, it is evident that DDA can adapt models to a larger extent and with a higher accuracy compared to other alternative adaptation strategies.