Issue #78 – Balancing Training data for Multilingual Neural MT

NMT 78 Balancing Training data for Multilingual Neural MT

Issue #78 – Balancing Training data for Multilingual Neural MT

Author: Raj Patel, Machine Translation Scientist @ Iconic

Multilingual Neural MT (MNMT) can translate to/from multiple languages, but in model training we are faced with imbalanced training sets. This means that some languages have much more training data compared to others. In general, we up-sample the low resource languages to balance the representation. However, the degree of up-sampling has a large effect on the overall performance of the model. In this post, we will discuss a new method proposed by Wang et. al., 2020 that automatically learns how to weight training-data through a data scorer and is optimised to maximise performance on all test languages. 

Differentiable Data Selection (DDS)

Differential Data Selection (DDS) is a general machine learning method for optimising the weighting of different training examples to improve a predetermined objective. In the paper, this objective is the average loss from different languages. They directly optimise the weights of training data from each language to maximise the objective on a multilingual development set. Specifically, DDS uses a technique called bilevel optimisation to learn a data scorer P(x,y;ψ), parameterised by ψ, where x,y represent a sentence pair from the training data, such that training using data sampled from the scorer optimises the model performance on the development set. 

Experimental Evaluation

To test the effectiveness of the proposed method, the authors use 58-languages-to-English parallel data from Qi et al. (2018). They train the multilingual NMT model for each of the two sets of language pairs with different levels of language diversity:

Related: 4 Low Resource Languages (LRL)s (Azerbaijani:aze, Belarusian:bel, Glacian:glg, Slovak:slk) and a related High Resource Language (HRL) for each LRL (Turkish:tur, Rus-sian:rus, Portuguese:por, Czech:ces)

Diverse: 8 languages with varying amounts of data, picked without consideration for relatedness (Bosnian:bos, Marathi:mar, Hindi:hin, Macedonian:mkd, Greek:ell, Bulgar-ian:bul, French:fra, Korean:kor)

For each set of languages, they test two varieties of translation: 1) many-to-one (M2O): translating 8 languages to English; 2) one-to-many (O2M): translating English into 8 different languages.

Baselines: They compare the proposed method with the three standard heuristic based methods: 

      1. Uniform (τ=∞): datasets are sampled uniformly, so that LRLs are over-sampled to match the size of the HRLs; 
      2. Temperature: scales the proportional dis-tribution by τ=5 to slightly over-sample the LRLs; 
      3. Proportional (τ=1): datasets are sampled propor-tional to their size, so that there is no oversampling of the LRLs.

The proposed method consistently delivers better overall performance than the best baseline: as much as 1.61 Average BLEU points. From these results, we can conclude that the proposed method provides a stable strategy to train multilingual systems over a variety of settings and outperforms the baseline on more languages.

In summary

The proposed method not only outperforms the existing temperature based sampling methods, but also provides a flexible framework to use different multilingual objective functions as per the requirement. For example, with default multilingual objectives we can also optimise the model for development risks of all the languages. As the proposed method doesn’t add much overhead in the model training, it can be really effective to handle the data imbalance in MNMT training. This is especially helpful for use cases like e-discovery, where we prefer to build the model as robust as possible using all the available corpus for a language pair.