Language Challenge #4 – Spanish

Language Challenge #4 – Spanish

Español / Castellano

Screen Shot 2014-09-02 at 10.41.58

Quick Facts

  • Spanish is the second most spoken language in the world, official to more than 20 countries in Europe and the Americas.
  • In Spanish, the language can be referred to as either español (“Spanish”) or castellano (“Castilian”). Preference for one term or the other depends on the region.
  • The Real Academia Española (Royal Spanish Academy) was founded in 1713 to govern and standardise the language.
  • There are 355 words in Spanish that contain all 5 vowels, e.g. comunidades (“communities”).
  • There are more than 40 million native speakers in the USA and it is projected that it will become the first language of more than 50% of the population by 2050.
  • Famous quote originating in Spanish: El que lee mucho y anda mucho, ve mucho y sabe mucho (“He who reads a lot and walks a lot, sees a lot and knows a lot”) – Cervantes in Don Quijote.
  • Characteristics

    As a Romance language, Spanish uses the basic Latin alphabet with one additional letter ñ, extended further with an acute accent on vowels, e.g. á é í ó ú, and the diaeresis ü. Questions and exclamatory statements are preceded with an upside down question mark ¿ and exclamation mark ¡ respectively.

    It is a subject-verb-object language in terms of word order, though can be some variations, for example, in English sentences that use clefting.

    Spanish is a relatively inflected language, with verbs conjugated for person, number, tense, mood, aspect, and voice. Notable is the extensive use of the subjunctive, in cases such as the example below, which contributes to Spanish verbs frequently having more than 50 conjugated forms.

  • Present: “You are well” -> Estas bien
  • Present + Subjunctive: “I hope that you are well” -> Espero que estes bien
  • Nouns, adjectives, and determiners are inflected for number and gender, while suffixes are widely used as diminutives and augmentatives, e.g. Juan “John” -> Juanito “little John” (a small boy), and gol “a goal” -> golazo “a great goal”.

    Spanish and MT

    Many of the challenges faced when developing machine translation systems for Spanish are similar to those for other Romance languages, like French, which we discussed previously. These include changes in word order between noun-adjective pairs, as illustrated below, and also agreement between articles and nouns.


    A more challenging aspect of word order in Spanish relates to so-called cleft sentences mentioned above. Take, for example, the sentence “It was John who won the prize”. English has an exception here to change the subject-verb-object word order for emphasis. Spanish has no such restriction and can move the order of the words relatively freely in such cases, as illustrated below where the three Spanish variations represent legitimate translations of the English sentence.


    These variations in the Spanish sentences will have subtle differences to a native speaker but it is virtually impossible for machine translation to capture such nuances.

    Another particular challenge for Spanish MT is the variation across dialects. The principal issue here relates to differences in vocabulary between Spanish from Spain and from various countries in Latin America. There are also some variations relating to pronouns, for example, where in Spain the 2nd person pronoun (you) is usted in the formal and in the familiar, in Latin American there is a third variation vos. The choice of pronoun in each case has an impact on how the verb is conjugated.

    The most obvious way to overcome this is to develop distinct MT systems for each dialect. However, this is not always practical due to the availability of sufficient training data, but it also represents a significant reduplication of effort because of the clear overlap across dialects.

    Instead, a more effective approach is to develop generic Spanish MT systems and then apply changes to the translation output as required for the dialect in question. This process is known as Automatic Post-Editing.

    Data Availability

    Like French, Spanish is one of the more widely researched languages in terms of MT because it is one of the “easier” ones and also because of the availability of training data across multinational institutions such as the European Parliament and the United Nations.

    Anecdotally, and interestingly enough given our activity in the field of patent translation with IPTranslator, Spanish is not one of the official languages of the European Patent Office. If you are so inclined, you can read more about the impact of this here.

    Coming next week…

    We’re taking a short break from the series next week, after which we’re going to round up our treatment of romance languages when we take a look at… Portuguese! That will complete up the first half of the language challenges series, and we will have some really interesting languages in store for the second half so stay tuned.

    About Iconic Translation Machines Ltd.

    Iconic Translation Machines provides intelligent domain-adapted machine translation solutions as a cloud-based service for targeted sectors of the translation industry. Our highly-tuned engines produce best-in-class translation quality, allowing Language Service Providers to increase throughput, productivity, and margins. Our flagship product, IPTranslator, provides high-quality machine translation for the patent and intellectual property sector.

    Thank you, your sign-up request was successful! Please check your e-mail inbox.
    That email address is already subscribed, thank you!
    Please provide a valid email address.
    Oops. Something went wrong. Please try again later.

    Comments are closed.

    WordPress Lightbox Plugin
    Get Started