Language Challenge #6 – Arabic

Language Challenge #6 – Arabic

العربية (al-ʻarabiyyah)


Quick Facts

  • Official language for 27 states in Africa and the Middle East, the third most states after English and French.
  • It is a member of the Semitic language family, which also includes Hebrew.
  • Arabic has multiple variants, the most common of which is Modern Standard Arabic (MSA), which is used in books, newspapers and official documents, but it is rarely spoken in day-to-day communication. Spoken Arabic comes in the form of dialects which vary across geographical areas and differ significantly with MSA, reflecting influences from ancient languages such as Aramaic and Assyrian in the Middle East and foreign languages such as French in North Africa.
  • As one of the oldest modern languages, it has had significant influence on the vocabulary of other languages, including English, e.g. algebra (al-jabr), cotton (qúţun), and magazine (maḵāzin), amongst many others.
  • Old Arabic Proverb: يا اخد القرد على مالو بيروح المال وبيضل القرد على حالو‬‎ (Lit: If you marry a monkey for money, the money will go away but the monkey will stay the same. Meaning: “Don’t marry for money!”)
  • Characteristics

    Arabic script contains 36 letters in its alphabet. In addition, there are diacritics which represent short vowels, some of which are written above the letter while others are written below, as illustrated below. However, they are usually dropped from modern Arabic text and used only to disambiguate certain words.


    It is written (and read) from right to left although numbers are written from left to right. However, this doesn’t have an impact on machine translation because of the way computers process text.

    Arabic has two types of sentences: verbal and nominal. Verbal sentences typically have verb-subject-object word order, while nominal sentences start with a subject noun and a noun complement but frequently omit the verb completely, as shown in the example below:

  • Arabic: السماء صافية
  • English: The sky is clear
  • Literal: The sky clear
  • Arabic is a very morphologically complex language. Words are first formed by combining discontinuous sequences of letters, e.g. the word for ‘I wrote’ is constructed by combining the root k-t-b (“write”) with the pattern -a-a-tu (first person, past tense pattern) to form katabtu (“I wrote”). These roots can then be further modified and extended with affixes (for gender, number, person, tense) and clitics (for determines, particles, some conjunctions and pronouns).

    Arabic nouns and adjectives inflect for case, state (definite/indefinite), gender, and number. Verbs inflect for aspect, mood, person, voices, gender, and number.

    This richness in morphology means that a very large number of forms can be generated by combining various elements. For example, the single Arabic word وسيقرؤونه (translation: “and they will read it”) is composed of a conjunction, a future particle, an active verb inflected for third person masculine plural indicative imperfective and an objective pronoun!

    Arabic and MT

    Arabic is not for the feint hearted when it comes to MT. The characteristics of Arabic described above make it a really difficult proposition for machine translation to/from English. It combines many of the significant challenges of other languages we have written about previously like word order (German, Japanese) and agreement (Romance languages), but also includes a number of challenges specific to Arabic.

    Word order is a problem because Arabic allows different word orders, as described above, and also because certain words are dropped which means they simply won’t appear in the translation. These issues is compounded by the fact the various words in the sentence can have huge differences because of the morphology. This must be resolved first before an MT system can determine the right place in the sentence to order the words. The example below illustrates a number of these characteristics in a simple phrase.

    Screen Shot 2014-09-24 at 11.16.05

    Here we see the verb moving position (“teaches”), noun-adjective reordering (“Arabic language”), and additional morphology such as the conjunction (“and”) and article (“the”) attached as prefixes, while the pronoun (“her”) is attached as a suffix.

    Advanced morphological analysis must be used in order to resolve many of these challenges but, as we pointed our in our French article, this requires more computing power and can be expensive. Then, to address challenges around word order, this needs to be combined with deeper syntactic parsing. While extremely effective, this also exacerbates this issues around computing power.

    Even with this, we still face challanges around word ambiguity because of the dropped diacritics which can only be resolved if we have sufficient, relevant training data with which to develop the MT systems.

    Data Availability

    The main source of data for Arabic is via the United Nations where it enjoys official language status. Additional resources have been created from multilingual new outlets such as Al Jazeera. However, as with Japanese, Arabic is a perfect example of a language where data alone is not sufficient. Significant linguistic expertise is required in addition to data to develop many of the solutions mentioned above and to deliver usable machine translation engines.

    Coming Next Week…

    We will continue to look east next week, though we’ll go a bit further and discuss the most widely spoken language in the world…Chinese!

    About Iconic Translation Machines Ltd.

    Iconic Translation Machines provides intelligent domain-adapted machine translation solutions as a cloud-based service for targeted sectors of the translation industry. Our highly-tuned engines produce best-in-class translation quality, allowing Language Service Providers to increase throughput, productivity, and margins. Our flagship product, IPTranslator, provides high-quality machine translation for the patent and intellectual property sector.

    Thank you, your sign-up request was successful! Please check your e-mail inbox.
    That email address is already subscribed, thank you!
    Please provide a valid email address.
    Oops. Something went wrong. Please try again later.

    Comments are closed.

    WordPress Lightbox Plugin
    Get Started