Machine Learning: The machines can translate now

by Paul Curzon, Queen Mary University of London

“The Machines can translate now…
…I SAID ‘THE MACHINES CAN TRANSLATE NOW'”

Portion of the Rosetta Stone which has the same text written in three languages.

The stereotype of the Englishman abroad when confronted by someone who doesn’t speak English is just to say it louder. That could soon be a thing of the past as portable devices start to gain speach recognition skills and as the machines get better at translating between languages.

Traditionally machine translation has involved professional human linguists manually writing lots of translation rules for the machines to follow. Recently there have been great advances in what is known as statistical machine translation where the machine learns the translations rules automatically. It does this using a parallel corpus*: just lots of pairs of sentences; one a sentence in the original language, the other its translation. Parallel corpora* are extracted from multi-lingual news sources like the BBC web site where professional human translators have done the translations.

Let’s look at an example translation of the accompanying original arabic:

Machine Translation: Baghdad 1-1 (AFP) – The official Iraqi news agency reported that the Chinese vice-president of the Revolutionary Command Council in Iraq, Izzat Ibrahim, met today in Baghdad, chairman of the Saudi Export Development Center, Abdel Rahman al-Zamil.

Human Translation: Baghdad 1-1 (AFP) – Iraq’s official news agency reported that the Deputy Chairman of the Iraqi Revolutionary Command Council, Izzet Ibrahim, today met with Abdul Rahman al-Zamil, Managing Director of the Saudi Center for Export Development.

This example shows a sentence from an Arabic newspaper then its translation by the Queen Mary University of London’s statistical machine translator, and finally a translation by a professional human translator. The statistical translation does allow a reader to get a rough understanding of the original Arabic sentence. There are several mistakes, though.

Mistranslating the “Managing Director” of the export development center as its “chairman” is perhaps not too much of a problem. Mistranslating “Deputy Chairman” as the “Chinese vice-president” is very bad. That kind of mistranslation could easily lead to grave insults!

That reminds me of the point in ‘The Hitch-Hiker’s Guide to the Galaxy’ where Arthur Dent’s words “I seem to be having tremendous difficulty with my lifestyle,” slip down a wormhole in space-time to be overheard by the Vl’hurg commander across a conference table. Unfortunately this was understood in the Vl’hurg tongue as the most dreadful insult imaginable, resulting in them waging terrible war for centuries…

For now the human’s are still the best translators but the machines are learning from them fast!

*corpus and corpora = singular and plural for the word used to describe a collection of written texts, literally a ‘body’ of text. A corpus might be all the works written by one author, corpora might be works of several authors.

**************

This article was originally published on the CS4FN website; there are some more articles below from CS4FN’s computer science and linguistics mini site.