Objective Assessment of Machine Translation Technologies
Posted by John Tinsley
There’s been some good discussion online recently about comparing machine translation (MT) from people who know what they’re talking about. We’ve had:
- Lilt’s evaluation: http://labs.lilt.com/2017/01/10/mt-quality-evaluation/
- Kirti Vashee’s response: http://kv-emptypages.blogspot.ie/2017/01/the-trouble-with-competitive-mt-output.html
- Common Sense Advisory’s thoughts: http://www.commonsenseadvisory.com/Default.aspx?Contenttype=ArticleDetAD&tabID=63&Aid=37887&moduleId=390
I’d been thinking about this a bit and wanted to add my two cents.
The initiative by Lilt, the post by CSA, and the response from Kirti all serve to shine further light on a challenge we have in the industry that, despite the best efforts of the best minds, is very difficult to overcome. Similar efforts were proposed in the past at a number of TAUS events, and benchmarking continues to be a goal of the DQF (though not just of MT).
The challenge is apples to apples comparison. MT systems put forward for such comparative evaluations are generally trying to cover a very broad type of content (which is what the likes of Google and Microsoft excel at). While most MT providers have such systems, they rarely represent their best offering or full technical capability.
For instance, at Iconic, we have generic engines and domain-specific engines for various language combinations and, on any given test set may or may not outperform another system. I certainly would not want our technology judged on this basis though! From our perspective, these engines are just foundations upon which we build production-quality engines.
We have a very clear picture internally of how our value-add is extracted when we customise engines for a specific client, use case, and/or content type. This is when MT technology in general is most effective. We see very significant improvements over our generic engines. However, the only way these customisations actually get done are through client engagements and the resulting systems are typically either proprietary or too specific for a particular purpose to be useful for anyone else.
Therefore, the best examples of exceptional technology performance we have are not ones we can put forward in the public domain for the purpose of openness and transparency, however desirable that may be.
I’ve being saying for a while now that providing MT is a mix of cutting-edge technology, and the expertise and capability to enhance performance. In an ideal world, we will automate the capability to enhance performance as much as possible (which is what Lilt are doing for the post-editing use case) but the reality is that right now, comparative benchmarking is just evaluating the technology “out of the box” and not the whole package.
This is why you won’t see companies investing in MT technology on the basis of public comparisons just yet.