Language Challenge #7 – Chinese

Language Challenge #7 – Chinese

中文 (Zhōngwén)


Quick Facts

  • Chinese and variants are the most widely spoken languages in the world with more than 1 billion native speakers.
  • It is a member of the Sino-Tibetan language family and has influenced many other major languages such as Japanese and Korean.
  • Mandarin, also known as standard Chinese, is the most commonly spoken variety and is the official language of both the Republic of China and the People’s Republic of China.
  • As a spoken language, Chinese makes extensive use of tones. The same written word can have radically different meanings depending on the tone in which it is said.
  • Quote in Chinese: 以約失之者,鮮矣。 (Lit: “The cautious seldom err.”) – Confucius, Chinese philosopher.
  • Characteristics

    Chinese is written using a set of characters known as hanzi. There are more than 100,000 individual characters, though only around 3,000-4,000 are required for basic literacy. Roughly speaking, each character corresponds to a syllable and, as such, can be used alone or in combination to form words. More recently, Chinese has been split into two forms: simplified and traditional. Simplified Chinese uses a reduced character set with less complex characters. It is used in mainland China whereas Traditional Chinese persists elsewhere, for instance, in Hong Kong, Taiwan, and Macau.

    An official transliteration system called Pinyin, which uses the Latin alphabet, has also been developed to facilitate pronuncation.

    Relative to English, Chinese has a very simple grammar. Chinese lacks inflection and most words have only a single form. Elements such as number (singular and plural) and verb tense are not expressed grammatically. Articles (the, a, an, etc.) are also not used.

    For instance, for the verb “to eat”, the Chinese verb 吃 (chī) can mean “to eat”, “eat”, “eats”, “eating”, and so on. In order to express tense, additional words like temporal adverbs (yesterday, tomorrow, now, later) are used, as shown below:

  • 我吃 (wǒ chī)- “I eat”
  • 昨天我吃了 (zuótiān wǒ chī) – “Yesterday I ate” (Lit: “Yesterday I eat”)
  • 明天我吃 (míngtiān wǒ chī) – “Tomorrow I will eat” (Lit: “Tomorrow I eat”)
  • In terms of basic word order, Chinese is generally classified as a subject-verb-object language, like English. However, there are considerable exceptions to this when it comes to noun phrases and relative clauses. The head noun always occurs at the end of the phrase, and relative clauses always come before the noun also. These are indicated by a special particle 的 (DE) which causes real hassle for machine translation.

    Chinese and MT

    The first problem that we face with Chinese is one that we also encounter with Japanese, the fact that there are no spaces between the words. In order to translate one word into another, we need first find out which words are which by splitting up the characters in the sentences. This process, as illustrated below, is known as text segmentation.

    Screen Shot 2014-10-01 at 11.01.06

    The biggest source of difficulty with Chinese-English machine translation, as with many other languages, is the different word order. We see this frequently in Chinese where noun phrases are moved to the end of sentences around the 的 (DE) particle, as shown in the example below with the subject noun “farmer”. The most effective way to solve this problem is to identify the “DE” constructions so that we know which words need to move position in the translation.

    Screen Shot 2014-10-01 at 12.30.12

    Because of the lack of inflection in Chinese, it is significantly more difficult develop machine translation engines for Chinese into English than the other way around. This is because we don’t know what tense the verb should be (e.g. “grow” -> “grows” in the example above), where we need to insert words that don’t exist in the Chinese sentence (e.g. “the”), and so on. In order to overcome this, we need to carry out a syntactic analysis of the sentence to determine these aspects. However, perfecting such analysis is still an active research topic.

    We also need to be conscious of which version of written Chinese we are dealing with. As mentioned above, Traditional Chinese has a much richer character set (and more complex characters, as shown below) so if we develop machine translation engines with only Simplfied Chinese data (which is by far more prevalent) then there will be a lot of Traditional Chinese words that we cannot translate. Solving this problem involves developing a process to convert between the two variations which, while non-trivial, is possible.

    Screen Shot 2014-10-01 at 13.02.49

    Data Availability

    There are a large number of machine translation research groups based in China which has driven the creation of data resources for Chinese MT. As an official language of the United Nations there are large corpora available in this field, in addition to larger collections of news text and other domains from organisations like the LDC (Linguistic Data Consortium) and ELRA (European Language Resources Association).

    Coming next week…

    Next week, we go 180° to one of the most grammatically complex languages… Russian!

    About Iconic Translation Machines Ltd.

    Iconic Translation Machines provides intelligent domain-adapted machine translation solutions as a cloud-based service for targeted sectors of the translation industry. Our highly-tuned engines produce best-in-class translation quality, allowing Language Service Providers to increase throughput, productivity, and margins. Our flagship product, IPTranslator, provides high-quality machine translation for the patent and intellectual property sector.

    Thank you, your sign-up request was successful! Please check your e-mail inbox.
    That email address is already subscribed, thank you!
    Please provide a valid email address.
    Oops. Something went wrong. Please try again later.

    Comments are closed.

    WordPress Lightbox Plugin
    Get Started