|
Machine translation (MT) is the process of automatic translation from one natural language to another
by a computer.
- Machine translation (MT) is the application of computers to the task of translating texts from one natural language to
another. One of the very earliest pursuits in computer science, MT has proved to be an elusive goal, but today a number of
systems are available which produce output which, if not perfect, is of sufficient quality to be useful in a number of specific
domains.
- Source: www.eamt.org
, European Association for Machine Translation, EAMT,
1997.
On 7 January 1954, the first public
demonstration of a MT system was held in New York at the head office of IBM. The
demonstration was widely reported in the newspapers and received much public interest. The system itself, however, was no more
than what today would be called a "toy" system, having just 250 words and translating just 49 carefully selected Russian
sentences into English -- mainly in the field of chemistry. Nevertheless it encouraged the view that MT was imminent -- and in
particular stimulated the financing of MT research, not just in the US but worldwide.
Introduction
Translation is anything but simple. It's not a mere substitution for each
word, but being able to know "all of the words" in a given sentence or phrase and how one may influence the other. Human
languages consist of morphology (the way words are built up from small
meaning-bearing units), syntax (sentence structure), and semantics (meaning). Even simple texts can be filled with ambiguities.
Linguistic approaches
It is often argued that the problem of machine translation requires the problem of natural language understanding to be solved
first. However, a number of heuristic methods of machine translation work
surprisingly well, including:
In general terms, rule-based methods (the first three) will parse a text, usually creating an intermediary, symbolic
representation, from which it then generates text in the target language. This approach requires extensive lexicons with morphologic, syntactic, and semantic information, and large sets of rules.
Statistical-based methods (the last two) eschew manual lexicon building and rule-writing and instead try to generate
translations based on bilingual text corpora, such as the Canadian Hansard corpus, the English-French record of the Canadian parliament. Where such corpora are
available, impressive results can be achieved translating texts of a similar kind, but such corpora are still very rare.
Given enough data, most MT programs work well enough for a native
speaker of one language to get the approximate meaning of what is written by the other native speaker. The difficulty is
getting enough data of the right kind to support the particular method. The large multilingual corpus of data needed for statistical methods to work isn't necessary for the grammar based methods, for example.
But then, the grammar methods need a skilled linguist to carefully design the grammar that they use.
Users
Despite their inherent limitations, MT programs are currently used by various organizations around the world. Probably the
largest institutional user is the European Commission, which
uses a highly customized version of the commercial MT system SYSTRAN to handle the
automatic translation of a large volume of preliminary drafts of documents for internal use.
It was recently revealed that in April 2003 Microsoft began using a hybrid MT
system for the translation of a database of technical support documents from English to Spanish. The system was developed
internally by Microsoft's Natural Language Research group. The group is currently testing an English – Japanese system as
well as bringing English – French and English – German systems online. The latter two systems use a learned language
generation component whereas the first two have manually developed generation components. The systems were developed and trained
using translation memory databases with over a million
sentences each.
See also
Free (open source) software
External links
|