Issue #66 – Neural Machine Translation Strategies for Low-Resource Languages
This week we are pleased to welcome the newest member to our scientific team, Dr. Chao-Hong Liu. In this, his first post with us, he’ll give his views on two specific MT strategies, namely, pivot MT and zero-shot MT. While we have covered these topics in previous ‘Neural MT Weekly’ blog posts (Issue #54, Issue #40), these are topics that Chao-Hong has recently worked on prior to joining Iconic. Take it away, Chao-Hong!
Author: Dr. Chao-Hong Liu, Machine Translation Scientist @ Iconic
In this post we will briefly review and discuss two main strategies, i.e. pivot MT and zero-shot MT, that were used to build neural machine translation (NMT) models without direct parallel data. The main reasons for developing these methods are that we wanted to build MT systems for language pairs where direct parallel corpora do not exist or are very small. These methods are especially useful when we are building MT systems for low-resource languages. However, they could also be used in other situations, e.g. the training of MT systems for specific domains.
Pivot Machine Translation
The strategy of pivot MT is to build two cascading systems pivoting through a popular language, to avoid the data scarcity problem in the training of MT systems. For example, if we want to build an MT system translating from Swahili to Greek, instead of building one direct system to do that, we build two systems, first translating Swahili to English and second translating from English to Greek. We do this because we have much more parallel data in the two language pairs of Swahili–English and English–Greek, than the direct Swahili–Greek corpus. Each of the two systems could be built using any kind of MT architectures, e.g. statistical MT, or NMT. Experiments show that this approach is still the best strategy to date, if, when building MT systems where direct parallel data is not available, we could find a pivot language, i.e. English in the example.
Zero-shot Machine Translation
The zero-shot approach is to build a single NMT model using the same parallel data that we have in pivot MT. This might be the most important idea since the invention of NMT, in my opinion. Using the same example, we put both Swahili-to-English and English-to-Greek parallel data in the training of one neural model, using a small amount of Swahili-to-Greek development data to guide the training procedure. And the trained single model, in a zero-shot setup, will be able to translate directly from Swahili to Greek, even though there is no direct Swahili–Greek parallel corpus used in the training.
Although the idea is great, we still didn’t see very strong evidence that the zero-shot approach really works. The only experiments we have came from those models where source and target languages are close to each other in terms of linguistics, e.g. Portuguese and Spanish. Using the same training strategy for other language pairs, e.g. those in the UN parallel corpora, the experiments didn’t show zero-shot is working. We also used forward and back-translation to prepare directly parallel corpus (Swahili–Greek in our example) from the data we have (e.g. Swahili–English and English–Greek). It achieved comparable results compared to pivot MT. However, in this approach, a direct Swahili–Greek parallel corpus is prepared for the training of final single NMT model, therefore it is not clear if the zero-shot MT idea is working or not.
In this post, two main strategies which we used to build MT systems without direct parallel data were briefly discussed. Pivot MT is still a good and reliable strategy to develop systems for many low-resource languages, if we have a pivot language to bridge the source and target languages. There are some developments in zero-shot MT; it shows that when incorporating forward and back-translation before the resulting single NMT model training, the performance is comparable to pivot MT. However, it would then not be purely zero-shot in terms of NMT model training. Personally, I think this is the frontier of AI research, that the models could learn something out of the data we provided. The breakthrough, in my opinion, might come from the improvement on NMT modeling, rather than adding more language pairs to the current model and hoping zero-shot MT will work when we have added enough.