Issue #133 – Evaluating Gender Bias in MT

NMT 133 Evaluating Gender Bias in Machine Translation

Issue #133 – Evaluating Gender Bias in MT

Author: Akshai Ramesh, Machine Translation Scientist @ Iconic

Introduction

We often tend to personify aspects of life that may vary based upon the beholder’s interpretation. There are plenty of examples for this – “Mother Earth”, Doctor (Men), Cricketer (Men), Nurse(Woman), Cook(Woman), etc. The MT systems are trained with a large amount of parallel corpus which encodes this social bias. If that is the case, then to what extent is this misconception passed on to the MT systems? Join us as we dive into one of the interesting research areas of MT: Gender Bias.

In issue #23 of the blog series, we went into the details of what causes gender bias, how it affects the translation quality, as well as a few techniques to reduce its impact. You may like to check it out if you haven’t already. In today’s blog post, we will cover the work of Stanovsky et al., 2019 who attempt to estimate the extent to which gender bias exists in Machine Translation.

Some languages like Spanish and Italian encode grammatical gender and have one word specific to a particular gender. For example, “doctor” (male) and “doctora” (female) in Spanish. But languages like English and Turkish don’t encode the grammatical gender. In English, the word “doctor” is used for both male and female doctors. This variation in gender mechanism prohibits the one-to-one translations. The “Machine Bias” section of issue #23 covers many such examples that show that the MT outputs are gender-biased.

In the work of Stanovsky et al., 2019, the authors try to answer the following research questions :

  • Can we quantitatively evaluate gender translation in MT?
  • How much does MT rely on gender stereotypes vs. meaningful context?
  • Can we reduce gender bias by rephrasing source texts?

Challenge Set for Gender Bias in MT

The authors present a challenge set – WinoMT to evaluate the MT systems on specialized testsets that are created to assess the specific linguistic property. WinoMT consists of 3,888 English segments that are equally balanced with respect to both gender (equal number of male and female genders) and stereotype (equal number of stereotypical and non-stereotypical assignments). The corpus is a concatenation of Winogender (Rudinger et al., 2018) and WinoBias (Zhao et al., 2018) coreference testsets and consists of gold standard annotations for gender.

Automatic Gender Bias Evaluation Method

The authors propose a framework for evaluating the gender bias which translates the input coreference testset (WinoMT) into the target language and outputs the accuracy score for gender translation.

The steps involved are as follows :

  • Translate the WinoMT English segments to the corresponding target language.
  • Align the source and target translations using the word alignment model. In this work, the authors make use of FastAlign (Dyer et al., 2013).
  • Extract the translated genders in the target language using morphological analyzers or by using simple heuristics in the target language.
  • Compute the accuracy score by comparing the extracted translation gender against the gold standard annotations provided in the source language.

In order to validate the proposed approach, the authors carry out a human validation experiment with 100 test samples of all translation systems annotated by 2 target language native speakers. The average agreement is 87% between the sentence-level human annotations and the output of the proposed automatic method along with the inter-annotator agreement of 90%.

Experiments and Results

Experimental Setup:

The proposed approach is evaluated on 8 different languages – English (En) -> { Spanish (Es), French (Fr), Italian (It), Russian (Ru), Ukrainian (Uk), Hebrew (He), Arabic (Ar), German (De) } belonging to 4 different language families using six widely used MT models representing both commercial and academic research – Google Translate, Microsoft Translator, AmazonTranslate, SYSTRAN, the model of Ott et al. (2018), and the model of Edunov et al. (2018). The WinoMT English segments are translated into the corresponding target language using the above-listed models.

Evaluation Metric:

The evaluation results are based on Accuracy, ∆G and ∆S.

    • Accuracy – The percentage of instances in which the translated gender entity matches the source-side gender entity.
    • G – The difference in performance between male & female translations.
    • s – The difference in performance between pro-stereotypical & non-stereotypical role assignments.

Findings:

  • Based on the evaluation results, all the tested MT systems seem to perform quite poorly on the chosen metrics which is indicative of the presence of gender bias.
  • From the ∆G results, it can be seen that all the models perform significantly better on male roles which may be a result of more occurrences in the training corpus. An exception to this rule is the Microsoft Translator on German which the authors assert may be due to the similarity between German and English.
  • Based on the ∆s results, it can be concluded that MT struggles with non-stereotypical roles across languages & systems.
  • An attempt to reduce the bias by adding in adjectives (“handsome”, “pretty”) prepended to the gender entities (“nurse”, “doctor”) corrects some of the professional bias by mixing signals which serves as a further indication of gender bias in MT.

In summary

As a first step for developing more gender-balanced MT models, Stanovsky et al., 2019 present a challenge set to assess the extent of gender bias in MT models. They also propose a framework for automatic evaluation of gender bias which is backed by human validation results. Based on the evaluation results on 6 MT systems including both popular and state-of-the-art MT models in 8 different languages, the authors conclude that the MT models are prone to gender stereotypes.