Issue #87 – YiSi – A Unified Semantic MT Quality Evaluation and Estimation Metric
Author: Dr. Karin Sim, Machine Translation Scientist @ Iconic
Automatic evaluation is an issue that has long troubled machine translation (MT): how do we evaluate how good the MT output is? Traditionally, BLEU has been the “go to”, as it is simple to use across language pairs. However, it is overly simplistic, evaluating string matches to a single reference translation. More sophisticated metrics have come on the scene, including chfF (Popović, 2015), TER (Snover et al, 2006), and METEOR (Banerjee and Lavie, 2005). None of them attempt to evaluate the extent to which the meaning of the source is transferred to the target text. The first real attempt to incorporate a semantic element into automatic evaluation was MEANT (Lo and Wu, 2011). However, the fact that it requires additional linguistic resources made it less easy to use widely. YiSi (Lo, 2019) builds on that work, offering a range of flavours based on the level of resources available for that language pair. Interestingly, this includes a quality estimation component too, which allows us to measure the quality of the output without the need for a reference (a version previously translated by a human). In today’s post we examine how this metric, YiSi, measures the semantic quality of the output and we look at their results.
At a most basic level, a translation should transfer the meaning of the source text to the target text. A good MT quality metric should be able to measure the extent to which it does that. YiSi does this using a shallow semantic parser, which derives semantic frames and role fillers from the source and target texts. In other words, it extracts entities and their roles in a sentence from the source and target text, and compares them in a range of ways.
- Derives the logical form with shallow semantic parser
- Aligns semantic frames extracted from source and target by comparing the lexical similarity of the predicates
- Aligns the arguments by comparing the lexical similarity of the arguments (entities) in source and target texts
- Computes the F-score of these aligned roles and entities, where:
w(e) = lexical weight of e
s(e,f) = lexical similarity of e and f
where the definition of lexical similarity and weights depends on the version of YiSi used (see below).
Compute phrasal semantic similarity precision and recall as the weighted average of the similarities for the aligned entities.
Compute structural semantic similarity precision and recall as a weighted sum of phrasal precision recall of the aligned role fillers (where the role types match) and weighted lexical similarities of the abovementioned aligned entities. This is then normalised by the token coverage factor reflecting the importance of that frame in the sentence in scope. For full details see Lo, 2019.
Weighting varies depending on usage (i.e. evaluation or system optimization), and the author simplifies semantic role labels to 8 types for robustness.
- Requires no additional resources and as such can be deployed to any language pair
- Measures lexical similarity via longest common character substring
- Compares MT output and human reference
- Since both are same language, this is the inverse document frequency of the words from each of them in the reference document
- Requires an embedding model
- Optionally also a semantic role labeler in output language
- Measures the similarity between the MT and a reference by aggregating the lexical semantic similarity of the embeddings
- Where available it can incorporate shallow semantic structures to evaluate structural semantic similarity
- Requires a crosslingual embedding model
- Optionally requires a semantic role labeler in both input and output language
- Evaluates the crosslingual lexical semantic similarity of the source text with the MT output using bilingual embeddings
- Can also estimate the quality of the MT output without any reference- attempting to directly evaluate whether the MT output reflects the semantics of the source text.
- Can also be used for parallel corpus filtering
Results are reported on the WMT2018 metrics task evaluation set, where human direct assessments involved 1) scoring adequacy on an absolute scale, in addition to the usual 2) ranking of translations. Scores are for Pearson’s correlation with the former (aggregated system level), and Kendall’s correlation with the latter (ranked at segment level).
Yisi-0 is comparable to chrF and BLEU for 1) and 2), and significantly outperforms BLEU for Turkish-English/English-Turkish language pairs. Yisi-1 outperforms them all for 2) (with one exception). Yisi-2 performs very well, particularly given that it is reference-less, where the other metrics compare to a reference. Overall, YiSi performs well, particularly given that it is measuring the adequacy of the translation and not the fluency.
We have reviewed the range of options that YiSi offers for evaluating the semantic quality of MT output, and conclude that it is a useful and performant metric to integrate into an MT workflow. The crosslingual variant YiSi-2 offers MT estimation, as in addition to evaluating against a reference, it can also estimate the quality of the MT output without any reference- attempting to directly estimate whether the MT output reflects the semantics of the source text. This is particularly useful as an option for estimating the quality of MT output in a real world scenario where the deceptively fluent (but potentially semantically wrong) NMT output can often pass unnoticed, and where a reference translation is not available.