Language Challenge #5 – Portuguese

Language Challenge #5 – Portuguese


Photo 18-09-2014 16 46 28

Quick Facts

  • Portuguese is the 6th most spoken language in the world with official status in 10 countries across 4 continents (though the majority are in Brazil).
  • It is a member of the Romance family of languages, and is also descended from Vulgar Latin.
  • Portuguese is not one of the 6 official languages of the United Nations despites numerous petitions to have it added.
  • There is significant mutual intelligibility between (written) Portuguese and Spanish. In fact, the name Portunhol is given to the unsystematic mix of the two languages in regions of Spanish-speaking South American that border Brazil.
  • Famous quote originating in Portuguese: Se podes olhar, vê. Se podes ver, repara. (“If you can look, see. If you can see, notice.”) – José Saramago, Nobel Laureate, Literature.
  • Characteristics

    Portuguese is a Romance language with many similar characteristics to languages we’ve treated previously, particularly Spanish.

    It uses the basic Latin alphabet extended with five diacritics: the cedilla (ç), acute accent (á, é, í, ó, ú), circumflex accent (â, ê, ô), tilde (ã, õ), and grave accent (à).

    It is a subject-verb-object language in terms of word order, with some exceptions that can present a challenge for MT (discussed later).

    Nouns, adjectives, and determiners are inflected for number and gender, but only masculine and feminine – there is no neutral form.

    Verbs are highly inflected, with more than 50 conjugated forms accounting for person, number, tense, mood, aspect, and voice. Ongoing evolution of the language, however, is seeing forms like the future simple tense being replaced by the informal future, as shown below. The informal future is formed using the verb “to go” + infinitive.

  • Future simple: Ele estudará amanhã (He will study tomorrow)
  • Informal future: Ele vai estudar amanhã (He is going to study tomorrow)
  • Portuguese and MT

    Portuguese grammar is quite similar to that of Spanish, though there are some important differences. However, these differences do not have a particular impact on the keys challenges faced when developing MT for Portuguese. In that case, many of these challenges are the same as for other Romance languages such changes in word order between noun-adjective pairs, as illustrated below, and also agreement between articles and nouns.

    Screen Shot 2014-09-19 at 13.25.58

    There are some exceptions to standard word order that can pose a challenge for MT, e.g. when using the imperative. Traditionally, Portuguese has used clitic pronouns to form the verb (where the pronoun is attached to the verb). This gives statements a verb-subject-object order, as shown below, which is the same as the English word order.

    Screen Shot 2014-09-19 at 14.10.22

    However, it is also perfectly acceptable, and arguably more common nowadays, to simply place the pronoun ahead of verb like in the example below. The fact that both contradicting forms are acceptable presents a challenge for MT. In order to translate these forms correctly, when going into Portuguese the training data should ideally by homogenous and only contain one form or the other. When translating from Portuguese, the training data needs to have examples of both to account for variations in the input.

    Screen Shot 2014-09-19 at 14.10.29

    Another particular challenge for Portuguese MT, like Spanish, is the variation across dialects – specifically European Portuguese and Brazilian Portuguese. The are significant differences across a variety of categories that need to be accounted for by MT systems. Firstly, there are differences in spelling (despite attempts at reform), e.g. atual (Brazilian) vs. actual (European) for the word “current”, or ótimo (Brazilian) vs. óptimo (European) for the word “great”.

    There are also differences in vocabulary itself, e.g. the word “pineapple” can be translated as ananas (European) or abacaxi (Brazilian), while the word “train” can be translated as trem (European) or comboio (Brazilian).

    In terms of grammatical differences, one example is how ongoing actions are described. In Brazilian Portuguese, the gerund is used, while in European Portuguse, the infinitive form is prefered.

  • English: I am studying
  • Brazilian Portuguese: Estou estudando
  • European Portuguese: Estou a estudar
  • To account for these differences, there are two options. The most obvious way to overcome this is to develop distinct MT systems for each dialect. However, this is not always practical due to the availability of sufficient training data, but it also represents a significant reduplication of effort because of the clear overlap across dialects.

    Instead, a more effective approach is to develop generic Spanish MT systems and then apply changes to the translation output as required for the dialect in question. This process is known as Automatic Post-Editing.

    Data Availability

    Training data is not as abundant for Portuguese as it is for some other European languages because it is not as widespread an official language, for instance in organisations like the United Nations. What data there is available can be found from organisations like the European Parliament and other commercial sources.

    Coming next week…

    We’ve now reached the halfway point in the Language Challenges Series. We hope you have been enjoying the articles to date. Please feel free to provide us feedback through the usual channels. Next week, we will take a look our first Semitic language…Arabic!

    About Iconic Translation Machines Ltd.

    Iconic Translation Machines provides intelligent domain-adapted machine translation solutions as a cloud-based service for targeted sectors of the translation industry. Our highly-tuned engines produce best-in-class translation quality, allowing Language Service Providers to increase throughput, productivity, and margins. Our flagship product, IPTranslator, provides high-quality machine translation for the patent and intellectual property sector.

    Thank you, your sign-up request was successful! Please check your e-mail inbox.
    That email address is already subscribed, thank you!
    Please provide a valid email address.
    Oops. Something went wrong. Please try again later.

    Comments are closed.

    WordPress Lightbox Plugin
    Get Started