Language Challenge #3 – Japanese

Language Challenge #3 – Japanese

日本語 (Nihongo)


Quick Facts

  • Official language only to Japan, it is the most spoken language in the world that is spoken in just a single country.
  • It is a member of the Japonic family, which includes languages of islands surrounding Japan, though its origin is still unresolved from a scholarly perspective.
  • 12% of the residents of Hawaii are Japanese speakers.
  • The Japanese-Language Proficiency Test (JLPT) is a standardised test for evaluating proficiency administered by the Japanese Ministry of Education.
  • Old Japanese proverb: 知らぬが仏 (Shiranu ga hotoke) (Lit: “Not knowing is Buddha”. English correspondant: “Ignorance is bliss”)


    Japanese uses a combination of three different writing systems: kanji, katakana, and hiragana. Kanji is the predominant script and uses the set of characters borrowed from Chinese. In theory, all Japanese words can also be written using Hiragana, however it is mainly used for words that do not have a kanji representation or as suffixes on kanji words. Katakana is used to transcribe foreign words and for proper names.

    Both hiragana and katakana are syllable based where each character corresponds to a sound or syllable. In that way, non-Japanese words can be transliterated into how they should sound in Japanese, as illustrated below. The practice of writing Japanese words in the Latin alphabet is called rōmaji and is mainly used for language learning and computer processing.

    Screen Shot 2014-08-22 at 18.33.10

    Japanese generally follows subject-object-verb word order. However, except for the fact that the verb must occur last, word order is relatively free.

    Nouns have very little inflection in Japanese: there are no singular or plurals, or notions of gender. Verbs are conjugated for tense, but only ‘past’ and ‘non-past’. Suffixes can be used to indicate things like continuous tense and conditional mood, while verbs are also inflected to form the negative.

    Japanese also has an extensive system for honorific speech to represent distinct sentiments such as politeness, respect, and humilty.

    Japanese and MT

    As you may suspect, these particular characteristics of Japanese make it one of the more challenging languages for machine translation to/from English. The first problem that we face is the fact that there are no spaces between the words in Japanese. In order to translate one word into another, we need first find out which words are which by splitting up the characters in the sentences. This process, as illustrated below, is known as text segmentation.

    Screen Shot 2014-08-25 at 13.59.41

    The different writing systems also present a challenge particularly for proper names and technical terms. In Japanese, such terms are transliterated using katakana so that, essentially, new words are created that the MT system might not have seen before. In order to overcome this, automatic transliteration is performed to automatically map the katakana symbols into Latin characters to produce the final translation.

    The biggest challenge for Japanese MT is word order, not only the fact that the verb is placed at the end of the sentence but that word order in general is less strict. This makes it hard to determine the correct order when translating into English. A first solution to this challenge, like with German as described in last week’s article, is to change the order of the words in the sentence before translation in order to make it more like the Japanese word order.

    Regarding the free word order, particles are often used to indicate the subject and object of the sentence even though they can still occur in different positions in the sentence. In order to position them correctly in the target language, we must first indentify these particles prior to translation and then place them in the correct order after translation.

    Finally, the fact that Japanese nouns are not inflected for number can cause ambiguity for MT. For instance, the word 猫 (neko) can mean “a cat/the cat/the cats/some cats/cats in general”. The only way to translate this correctly is to interpret the wider context in which the word was used. This, however, is one of the weaker aspects of MT technology in general which is why Japanese remains one of the biggest challenges.

    Data Availability

    While there are sources of parallel data for Japanese, such as TAUS and NTCIR, it is often only available for research purposes. However, even if good quality data is available, Japanese is the perfect example of a language where data alone is not sufficient. Significant linguistic expertise is required in addition to data to develop many of the solutions mentioned above and to deliver usable machine translation engines.

    Coming next week…

    In next week’s post, we’re going to look at one of the most widely spoken languages across the globe…Spanish!

    About Iconic Translation Machines Ltd.

    Iconic Translation Machines provides intelligent domain-adapted machine translation solutions as a cloud-based service for targeted sectors of the translation industry. Our highly-tuned engines produce best-in-class translation quality, allowing Language Service Providers to increase throughput, productivity, and margins. Our flagship product, IPTranslator, provides high-quality machine translation for the patent and intellectual property sector.

    Thank you, your sign-up request was successful! Please check your e-mail inbox.
    That email address is already subscribed, thank you!
    Please provide a valid email address.
    Oops. Something went wrong. Please try again later.
  • Comments are closed.

    WordPress Lightbox Plugin
    Get Started