MT Success Series #1 Language

MT Success Series #1 Language

8 Steps to machine translation success banner

We’re kicking off the series today with probably the most natural starting point – language. To modify a phrase from the US Declaration of Independence, “not all languages are created equal”. An important consequence of this is that some language pairs are more suitable for machine translation (MT) than others. We’ve written about this in detail in our Language Challenges Series; but now we want to take a step back and look at how this affects the feasibility of practical MT applications.

Broadly speaking, the closer two languages are in terms of word order and grammatical structure, then the more they lend themselves to MT. We separate language into four categories based on their suitability for MT (assuming English is one of the source/target languages for simplicity). You can see some examples* of the languages in each category below and we explain what the implications are for potential MT projects.

*this is a non-exhaustive list

Iconic 8 Steps to MT Success Language Categories

Category 1

These languages lend themselves well to MT and projects tend to be successful more often than not. Category 1 languages are easier from a technical perspective with only moderate syntactic differences, are popular and well studied among the research community, as well as having ample resources available to support development such as corpora, syntactic parsers, and other linguistic tools.

Category 2

These languages pose much more of a challenge from a linguistic perspective and require additional specialised processing. However, because they are languages in traditionally large markets, they have received a lot of investment into their research and development, and the challenges they present have been well studied (e.g. German verb positioning). There are also a lot of resources available which means that projects involving these languages can take more time to reach an acceptable quality level, but still have a strong chance of being successful.

Category 3

Languages in this category tend to be similar in difficulty to those languages in Category 2. However, as there is typically less demand for these languages (due to market forces, number of speakers, “strategic” importance, etc.) they tend to be less extensively studied and thus represent more of a challenge for MT. It can also be the case that they are simply extremely challenging (e.g. Korean). In order for projects in these languages to be successful, we are more reliant on a number of the other 7 factors in this series being in place.

Category 4

Unfortunately, these languages simply do not lend themselves to MT under almost all circumstances. They are very challenging from a linguistic perspective due to extreme levels of morphology, e.g. Hungarian, Finnish, have no known origin from which we can draw experience, e.g. Basque, or they are simply so rare and obscure that they might not even have written forms, e.g. indigenous Australian languages. There may be the occasional case in which MT has been used successfully for languages in this category, but they would be the exception rather than the rule.

There is an important implication of this variation in MT performance across different languages: it means that it is possible to have projects that span multiple language and you actually need to consider which languages you might deploy MT for as opposed to considering it as a blanket solution to cover an entire project.

Case Study / Example

A good example of the impact language can have on a project is the European Commission’s (EC) case study. For the past number of years, the EC has been running a project called MT@EC in which they provide internal access to engines for all official EU language pairs: that’s 552 engines in total.

They are constantly tracking performance of the engines and rank how usable the translations are on three levels – Gold (best possible), Silver (good enough), and Bronze (just ok). You can see from the table below that performance varies widely depending on the languages in question. Gold and Silver engines typically include one of the Category 1 languages, while the only Gold engine not including English as the source or target language is French to Spanish.

Iconic 8 Steps to MT Success table of language difficulty

While Language is a key contributor to these results, the difference between Silver and Bronze performance is likely due to other factors that we will address later in the series, such as the amount of available Training Data.

Next Week

Next week we’re going to look at a trickier concept that impacts on MT projects – the volume of words being translated. In the meantime, if you’d like to know more about Language, or where a particular language fits in the scheme, please don’t hesitate to get in touch.

About the 8 Steps to MT Success Series

The 8 Steps to MT Success Series is an in-depth look at the various factors that influence machine translation projects from a feasibility, cost, and development time perspective. Using Iconic’s extensive experience and expertise in assessing the suitability of projects for MT we’ve been able to distill the process down to 8 key factors; language, content type, volume of words, quality requirements, integration requirements, available training data, translation memory leverage, and buyer experience.

About Iconic Translation Machines

Iconic Translation Machines is a leading machine translation software and solutions provider who specialises in custom solutions tailored with subject matter expertise for specific industry sectors including legal, life sciences and financial services. Iconic is the MT partner of choice for some of the world’s largest translation companies, information providers, and government and enterprise organisations, helping them to translate more content, more accurately and in less time, resulting in significant cost savings and increased revenue.

Thank you, your sign-up request was successful! Please check your e-mail inbox.
That email address is already subscribed, thank you!
Please provide a valid email address.
Oops. Something went wrong. Please try again later.

Trackbacks & Pings

WordPress Lightbox Plugin
Get Started