Notes
-
[1]
Machine Translation is the subset of Automated Translation which aims to build programs capable of taking over translation per se, whether fully automatically or with the assistance of monolingual users.
-
[2]
METEO (Chandioux, 1988, 1976-), ALT/Flash (NTT, 1999-)…
-
[3]
ENGSPAN & SPANAM (Vasconcellos and León, 1988, 1980-), CATALYST (Nyberg and Mitamura, 1992, 1992-)…
-
[4]
ITS (BYU, Provo, 1974-81), N-Trans (Whitelock et al., 1986).
-
[5]
such as JETS at IBM-Japan (Maruyama et al., 1990), see also Boitet and Blanchon, 1994 ; Wehrli, 1992.
-
[6]
the majority of Systran "language pairs" until very recently, GlobaLink, Web-translator…
-
[7]
e.g., TDMT (ATR).
-
[8]
e.g., LMT (IBM).
-
[9]
e.g., MU (Univ. of Kyoto), and its successor MAJESTIC (JST).
-
[10]
e.g., GETA’s Russian-French system (Boitet and Nédobejkine, 1981).
-
[11]
e.g., METAL (Slocum), Shalt-1 (IBM).
-
[12]
See the difficulties experienced by Siemens with METAL in the 90’s, which contributed to the company’s exit from the scene.
1 – Introduction
1MT (Machine Translation) [1] was the first non-numerical application of computers. After initially promising demonstrations in the US around 1954, it was realized that HQFAMT (high quality fully automatic machine translation) would in general be impossible. Less ambitious tasks were then attacked by lowering one or more requirements. The result was several kinds of AT (automated translation) systems. There exist many LQFAMT (low quality-) systems, producing “rough” translations and used for accessing information in foreign languages. HQFAMT systems for very restricted typologies (kinds of texts) are less common but do exist. There are also HQMT systems for restricted typologies, which produce “raw” translations good enough to be revised cost-efficiently by professional revisers. HQMT can also be obtained by asking end users to assist the system. Finally, TA (translation aids) combining online dictionaries, term banks, and translation memories, are used extensively by professionals.
2 – Human Translation
2.1 – Translation is difficult
2Although it can be argued that the common nature of natural languages makes it possible to translate between any two languages, the task is much more difficult than usually believed, because of differences between languages, unavoidable ambiguities, and insufficient contextual knowledge.
2.1.1 – Differences between languages
3Natural languages reflect different points of view about the world. The “percepts” (things) we speak about may differ. English, French, and Russian have one word for “wall” (mur, stena/cteha), while German, Spanish, and Italian distinguish between the volume and the surface (Mauer/Wand, muro/pared, muro/parete). The “concepts” (ideas) also reflect cultural differences. For example, Japanese has not only different verb forms but actually different verbs for the concept “to be” (da/familiar, desu/neutral, gozaimasu/polite) according to the attitude of the speaker. In Japanese, the subject of a verb must be volitional, so that it is incorrect to say “the typhoon destroyed the house”: one must use another point of view, and say “the house collapsed/was destroyed due to the typhoon”. Etc.
4Another source of the differences which make translation difficult is that, while all types of meanings can be expressed in all languages, some elements of meaning must be expressed in some languages, while they are usually unexpressed in others. In languages without articles (the, an), such as Russian, Chinese, Japanese, or Thai, definiteness is “underspecified” (usually not expressed explicitly). Similarly, aspect is underspecified in French as compared with Russian or even English. Number and gender (or sex) are likewise necessary in some language and not usually expressed in others. For example, "Tanaka-san" in Japanese means “Mr Tanaka” as well as “MrsTanaka” or “Miss Tanaka”.
5It is not possible in general to translate by parts of speech: a noun is not always translatable as a noun, an adjective by an adjective, etc. One reason is that languages may not have the same parts of speech. For example, Thai has no adjectives. This variability extends to phrases: gerund or infinitive clauses don’t exist in all languages. Other important differences appear in the grouping or structuring of words and phrases inside utterances. For example, in German “er schwimmt gern” (he willingly swims) means “he likes to swim”, where the syntactic relations are reversed.
2.1.2 – Ambiguities
6All natural languages are inherently ambiguous, at all levels (sounds, words, phrases, functions, propositional meaning, intentions). For example, “recognize speech” can be understood as “wreck a nice beach”; “The time to learn about computers” can mean the time it takes to learn about computers, or the right moment to learn about computers; and “I didn’t hit you on purpose” may mean “I hit you by accident”, or “I purposely avoided hitting you”. If there are “black notebooks and folders”, the folders may or may not be black - and so on. An important point here is that humans rarely notice the ambiguity, because they use their background knowledge and their current anticipations to go straight to one particular interpretation, and then stop. But different humans arrive at different interpretations, so misunderstandings and accidents sometimes occur.
7There are also unambiguous words or utterances that become ambiguous in translation, because the target language must separate meanings indistinguishable in the source language. This is often the case with elements of meaning systematically underspecified in the source (such as definiteness, modality, aspect, and number).
8There are also cases such as “eat”, which in German gives “essen” for humans and “fressen” for animals; or “river”, which corresponds in French to “fleuve” if it is large and flows into the sea, and to “rivière” otherwise.
9There are also "parasitic" ambiguities that appear unpredictably in the resulting translation because of ambiguities of the target language. For example, “faced with this amount, he pulls back (balks)” is not ambiguous in English, but its French translation “devant cette somme, il recule” may also mean “owing this sum, he pullsback (balks)” - because in French “devant” is ambiguous between “in front of/before/faced with” and “owing/because he owes”. Such accidental ambiguity can be detected only by carefully looking for it in the translation.
2.1.3 – Lack of sufficient knowledge or understanding
10It is evident that errors in translation can arise because the translator does not sufficiently know one of the languages involved. But it is rarely realised that even professional translators often make mistakes because they know some meanings for a term, but not one specialized meaning which is restricted to a certain topic. For instance “avtomat s magazinom” was once translated from Russian as “automaton with magazine” instead of “push-down stack automaton”. Fixed idioms like “to kickthe bucket” (“die”) are also very numerous, and translators who don’t know them will translate them literally - and erroneously - if the result seems to make sense in the target language.
11Errors also occur because the translator does not know the domain well enough. For instance, there are many phrases which are usually abbreviated into one component term. The literal translation of the abbreviated form may or may not be appropriate in context. For example, the literal translation of “system” is all right for “operational system”, which becomes “système” (d’exploitation) in French; but “inertial guidance system” should instead be translated as “centrale” (inertielle).
12Technical manuals make up a sizeable part of the translation business. They often describe processes and procedures, which translators sometimes need to understand to translate correctly. For example, in “heat for 5 minutes and remove tap”, “and” may mean “and at the same time”, but more probably “and then”.
13Finally, translation may require a good understanding of the communicative situation. “The bus should come now” may give "Der Bus sollte/müâte/dürfte kommen” in German, meaning that the speaker thinks the bus will probably come, or must logically come, or would be permitted to come”. “Hai” in Japanese may mean “yes” or “I understand” or “mm hmm” (I follow you, go on!), etc.
2.2 – Translation is a catchword!
2.2.1 – Variety of “translational situations”
14According to the situation, the required translation may be 1-to-1, that is, from one language into another one (e.g. Russian-English), or 1-to-N (from one source language into many target languages, e.g. for disseminating technical documentation), or M-to-1 (from many source languages into a single target language, e.g. for a monolingual reader), or M-to-N (from many languages into many other languages, as for international organisations).
15The translation purposes may also vary:
16(1) dissemination of critical technical information for end-users;
17(2) access to information: material of relatively temporary value, Web materials, intelligence materials;
18or (3) recreation:translation of publicity or poetry.
19Speech translation, or “interpretation”, is quite different from text translation because its product is totally ephemeral, and never reused. The most common interpretation situations are:
20(1) simultaneous interpretation (e.g. at conferences), where translation of an utterance begins before the utterance is complete;
21(2) consecutive interpretation (e.g. of monologues or serious discussions), where the interpreter takes notes for sometime and then translates;
22and (3) liaison interpretation (e.g. of dialogues), the only kind comparable to translation, as the interpreter translates utterance by utterance.
2.2.2 – Misconceptions about translation & translators
23It is often said that perfect understanding is needed to produce high quality translations, with the frequent implicit assumption that human translation is perfect. But “perfect understanding” is extremely rare. Not even a good bilingual engineer-turned-translator can maintain a perfect grasp on new developments in his field. In reality, admittedly imperfect but very high quality translations can in fact be produced with less than full understanding, or even with minimal understanding, by trained translators used to translating within a particular field and very familiar with its technical terms.
24In any case, junior translators usually don’t produce very high quality results; so their first draft, a “raw” translation, must usually be revised by senior translators. Typically, one hour is needed to translate a standard page of 250 words, and 20 minutes to revise it. Given the low price of human translation, professional translators must produce results in the time available, even if the quality suffers: they can’t spend more time if they want to earn a living. In addition, they must often work with stringent deadlines, and within fields that they don’t know very well, so that much time is spent looking up terms.
25Interpreters can take no time to polish their translation, so the quality of their work is measured much more by its speed and regularity, provided the meaning is roughly conveyed, than by its exactitude and completeness.
26Finally, the cost of translation is often underestimated. If it is done in house, essential costs for such requirements as training, meetings, and research are not taken into account. If it is subcontracted, different departments often use different subcontractors, and these expenses are not consolidated.
2.3 – Automation is needed
27The automation of translation and interpretation has become necessary since the 50’s as new needs have appeared.
28There can never be enough translators to cover the perceived needs. For example, not even the US army could train enough translators to skim through all Soviet literature in the Cold War era. With increasing globalisation and the growth of the Internet, the need for all kinds of translation, from rough to raw to refined, are growing. The European Community has now 11 official languages, but still about 1200 in-house translators, the same number it had 25 years ago, when there were only 8 official languages.
29Second, many translation tasks are so boring or stressful that translators want to escape them. For example, the famous METEO system came about because translators working with the Canadian Meteorological Centre heard of the TAUM MTproject at Montreal University and went there to ask for a system which would free them for some interesting work (revision, as opposed to repetitious translation of routine whether reporting formulae). Automation can thus be seen as a way to free translators from menial tasks and promote them to revisers and co-developers of the automated systems.
30Third, there is also increasing need for interpretation, especially to assist travellers abroad. There are many situations where sufficient knowledge of a common language is lacking: visiting a doctor, booking tickets for travel or leisure, asking for help on the roadside, calling motels, etc.
3 – Automated translation
31Translation may be automated with various goals and users in mind. Low quality is sufficient for end users wanting to track information, provided it is fast and cheap. Translators need either good support tools to do their job, or high quality machine output to replace a first draft.
3.1 – What translation tasks can be automated
32Translation jobs can be decomposed into subtasks that can be automated more or less independently.
3.1.1 – Preparation (of resources & source texts)
33Computers may first help in preparing resources for translation: they can
34(1) help to build bilingual terminological lexicons;
35(2) help to discover similar texts or dialogues, if possible already translated;
36(3) supply aligners, to build bitexts (aligned bilingual texts) from translations;
37(4) provide terminology extractors working on monolingual texts or bitexts.
38Computers may also be used to prepare the texts to be translated. This preparation involves first correcting the texts using spellcheckers and grammar checkers, because bad source text quality is a principal source of translation errors. Second, it is often useful to normalise the source text, using style and terminology checkers. For instance, pronouns with distant referents can be replaced by their referents, and implicit elements, such as subjects or objects in Japanese sentences, are inserted.
39Third, long and complex sentences may be broken into simpler ones to help translation, even if they are admissible in the genre at hand. Finally, it is also possible to annotate the text by inserting tags to help the analyser. For example, “Click on save as HTML– to save your file in HTML format” could become “Click on<menu_item>save as HTML</menu_item> to save your file in HTML format”.
3.1.2 – Translation per se (partial or total automation)
40Translation per se, i.e. the passage from one language to another, can be totally, partially, or “only apparently” automated.
41“Pure” Machine Translation operates fully automatically, and its result is used by the end user. As mentioned, high quality of “raw” output can only be obtained by tuning systems to a restricted domain or genre. In order to achieve professional quality output, and to give feedback to the machine translation developers so that they tune translation to the relevant sublanguage, human revision is used. It may take anywhere from one minute per page for very restricted domains to [2] 10-15 minutes for broader domains such as technical manuals or administrative documents [3].
42In “semi-automatic” translation, one or more humans help the system translate. There are several variants. The oldest methods [4] are highly interactive: as soon as the system encounters a problem in a local context (ambiguity in analysis, synonymy in generation, or both in transfer), a question is issued, and the user must give an answer. This procedure gives human assistants the impression of becoming “slaves of a stupid machine” - stupid because the answer to many questions is felt to be obvious, given the whole context of the utterance. It is also very costly if specialised knowledge is required of the human(s) concerning the internal linguistic representations of the system itself or of one or more target languages.
43More recent semi-automatic systems [5] avoid these problems by delaying questions until one or two specific points in the translation process. Such postponement implies that the unit of translation has been completely processed, and that all remaining questions and choices have been kept in the data structure representing the program’s current state.
44We say translation is “only apparently” automated when the automation consists only in retrieving past translations from a “translation memory”. When an “exact match” is found, the system may propose not only one, but several translations, produced in different contexts. In principle, all of them are very good, being of professional quality. When a “fuzzy” or partial match is found, however, translations of similar but different source language segments are proposed, all of them in principle inadequate and requiring editing by the translator to become correct. In typical situations with high redundancy, such as successive versions of the same set of manuals, one may find 20-30% exact matches and 40-60% partial matches.
45These approaches may be combined, e.g. by providing translators with suggestions from a translation memory and from an MT system.
3.1.3 – Revision and workflow
46To assist the revision of a “first translation draft”, both standard and specialized tools can be used. Standard tools would include various kinds of checkers of the target language. Specialized aids include:
47(1) flagging (by the MT system) of dubious passages (for which no complete analysis has been found, or in which unreliable choices had to be made);
48(2) production of two or more alternate translations of ambiguous terms, with appropriate formatting;
49and (3) development of special macro commands in the text editor to automate the most frequent correction operations, such as the permutation of three not necessarily contiguous blocks of texts (…A…B…C… to …B…C…A…).
50Finally, automating the translation workflow involves very few linguistic operations (beyond the segmentation of texts into maximally homogenous translation units), but can be very useful.
4 – Possible workflow approaches
51Let us now examine possible approaches to the automation of translation per se and to evaluation of the results. Imitating humans is no more a viable approach for translation than for voice recognition or other complex mental processes such as chess playing, because today’s computers have very little in common with human brains. However, observation of human practice and introspection are nevertheless very useful: what is done can be reproduced independently of how it is done, and translator “rules of thumb” can be more or less exactly formalised as “heuristic rules”.
4.1 – Vauquois’ triangle and linguistic architectures
52The possible linguistic architectures of MT systems are best understood with the help of Vauquois’ triangle (figure 1). The levels of linguistic interpretation are on the left, the kinds of structures used to represent NL utterances are on the right, and translation processes are in the middle.
Vauquois’ triangle
Vauquois’ triangle
4.2 – Direct & semi-direct MT: lexical replacement + “massaging”
53Historically, the first MT paradigm was based on the supposed similarity of translation and deciphering. In this approach, a Russian text is viewed as the result of some encoding of an original English text, using word replacement and permutations of groups of words. To translate it into English, then, one tries to find the “best” English equivalents for the Russian words or groups of words, by looking at the context in the utterance, and to “massage” the result into a grammatical English utterance.
54This technique does not work very well, but mistranslation and agrammaticality in the output can often be compensated for by human readers comprehension if they are reading for comprehension only. The deciphering, or direct, approach is still used in many systems today [6].
55“Pure” statistical MT is a modern version of the direct approach. The IBM group which pioneered the successful use of statistical methods for speech recognition (Jelinek, Brown) has pioneered in the statistical translation area as well. The idea is to “learn” correspondences between (groups of) words of two languages from a large set of aligned sentences, such as the Hansard bilingual transcriptions of the debates of the Canadian Parliament, and then to use these learned correspondences to translate new sentences. Because of the huge number of possibilities to be examined (the combinatorial explosion), sentences used for training cannot be very long, and the correspondences are usually limited to 1-1, 1-0, 0-1, and 1-N, although M-N (many to many) correspondences would be desirable.
56Experiments have demonstrated that this technique is inferior to the older handcrafted direct approach. But its proponents have been able to use the same ideas at higher levels of linguistic interpretation, this time with success (e.g. using “stochastic grammars” at the syntagmatic, or syntactic, level, and more recently employing “semantic grammars”), in effect moving from a semi-direct approach to a transfer or pivot approach.
4.3 – Transfer approaches
57In the “strict transfer” approach, there are 3 steps: a strictly monolingual analysis into a descriptive structure at some level L of linguistic interpretation, a bilingual transfer (normally decomposed into a lexical and a structural phase) into a descriptive structure of the target language at the same level L, and a monolingual generation process. One can distinguish between surface [7] and deep syntactic transfer [8], semantic transfer [9], and multilevel transfer (if the descriptive structure contains both syntactic and semantic information ) [10]. Whichever variance is used, the transfer-based translation technique makes it possible to reuse the analysis and generation steps in several “language pairs” (English-French, French-German, etc.). Due to a lack of appropriate tools, some developers have started with the strict transfer approach in mind, but have not been able to develop the “structural” part of a strict transfer process - that part which represents correspondences between arbitrarily complex source language and target language structures. The resulting systems [11] mix transfer and syntactic generation in one step, usually implemented by a recursive descent of the analysis tree structure, thus achieving a sort of “descending transfer”.
58By contrast, B. Vauquois introduced the technique of “ascending transfer”. In this approach, the analysis produces a “multilevel structure” containing both semantic and syntactic structures: that is, both a language-neutral level of logico-semantic interpretation (composed of predicate-argument structure and semantic relations) on one hand, and, additionally, some information about surface levels as well, such as information about syntactic functions (noun phrase, verb phrase, etc.) and syntagmatic categories (subject, direct object, etc.). During transfer, some surface level information may be transferred from the source language structure to that of the target language; but, during generation, any remaining surface information is recomputed from the language-neutral semantic level. At this generation stage, the surface information supplied by the transfer stage is used only as a stylistic guide or as a “safety net” in case sufficient language-neutral semantic information is missing in some part of the transfer structure. The overall goal is to minimize the transfer of structural surface information - to rely so far as possible on the transfer of semantic information, so that the target language can express this semantic information idiomatically, without excessive reliance on the surface structures of the source language.
4.4 – Pivot approaches
59Strictly speaking, the pivot approach simply consists in using a standard intermediate format to represent the information passed during translation from any language into any other language. But the nature of this “pivot” format may vary: it may be a natural language text, or it may be a meaning description in some artificial language.
60Using natural language text as a “pivot” structure is possible only if the output is of high quality and grammatically correct. Otherwise, the second translation system will be trying to produce target language based upon a faulty pivot input, one which may contain source language words or constructions which have not been properly recognised or translated. This poor input leads to incomprehensible results.
61The next possibility is to use structural descriptors of a natural language L0 as pivotal elements (rather than normal text of L0). To translate between L0 and another language Lk, only one transfer process is then needed, while two are combined to translate from Lj to Lk via L0. This structural pivot method works with the normal and ascending transfer approaches, but not with descending transfer. The advantage of a structural pivot technique is that no spurious ambiguities are introduced by going all the way “down” to a natural text in L0 and then having to analyze it again. This structural pivot technique was initially proposed and tried by CETA in Grenoble (Vauquois 1975) in the sixties, under the name of “hybrid pivot”. It was then used successfully in the DLT project at BSO research between 1982 and 1988. The pivot language was Esperanto, and the pivot descriptors were Esperanto utterances augmented with structural and grammatical tags. The same technique is currently used by the IPPI team in Moscow (Boguslavskij & al.) to translate from UNL (see below) into Russian through deep dependency structures of English.
62With the structural pivot technique, the lexical level of description employs “lexical units” which are derivational families of lemmas (dictionary forms of words) linked by derivations (“lexico-semantic functions”) such as “passive potentiality” (such as observe_VERB => observable_ADJECTIVE with the paraphrase “which can be observed”). The possible derivations are very numerous, but only those which are productive and useful for generating paraphrases in translation are retained, in order to limit the cost of creating the dictionaries. In order to characterise the intended meaning of a lexical unit, however, one has to add some “sense identifier”, such as #1 for observe/observation and #2 for observe/observance.
63The next step is then to create an autonomous vocabulary of symbols to stand for the “interlingual acceptions” (or sets of acceptions, that is, word senses). If utterance representations use no surface level information, the result is a “linguistic interlingua”. The best example today is the UNL language (Universal Networking Language).
64UNL is a project of multilingual personal networking communication initiated in 1996 by the University of United Nations (based in Tokyo) and addressing 14 languages in 2000. The representation of an utterance is a hypergraph with a unique “entry node” where normal nodes bear UWs (“Universal Words”, or interlingual acceptions/word senses) with semantic attributes, and arcs bear semantic relations (deep cases, such as agt, obj, goal, etc.). A “hypernode” groups a subgraph defined by a set of connected arcs and an entry node.
65Because all UNL developers know English, UNL symbols are built from English. AUW is of the form “<headword (English word or compound)> (<list of restrictions>)”. A restriction expresses a possible relation to or from another UW in the KB (knowledge base) of all Uws, a relation which can help to restrict or fix the intended meaning: <semantic relation><direction><UW>, where <direction> is “>” or “<”. An unrestricted UW denotes a collection of interlingual acceptions (word senses), although one often loosely speaks of “the” word sense denoted by a restricted UW. For example, the unrestricted UW “look for” denotes all the word senses associated with the English compound word “look for”. However, the restricted UW “lookfor(icl>action, agt>human, obj>thing)” represents only the word senses of the English word “look for” in which the expression represents an action which is performed by a human and affects a thing.
66Figure 2 gives an example of a UNL graph (with no subgraphs). The semantic attributes describe what is said from the speaker’s point of view. This annotation includes phenomena like speech acts, truth-values, time, etc.
a simple UNL graph
a simple UNL graph
67Finally, it is possible to develop purely semantic, task-oriented pivot representations within restricted domains such as transport and hotel reservation, schedule arrangement, etc. This approach has been taken in several speech translation projects such as CSTAR (which employs the Interface Format, or “IF”, for intermediate representation). An utterance by a client such as “Hello Tom, how are you today?” may be represented by the IF expression “C:greetings (time=morning, level=familiar)”,which could then be re-expressed as “Good morning!”.
5 – Computer techniques
68Let us now briefly describe how the steps or phases of the translation process are computed.
5.1 – Algorithmic approaches
5.1.1 – Ways to handle ambiguities
69Traditionally, the goal of source language analysis is to produce one or several abstract representations (annotated or “decorated” trees, feature structures, conceptual graphs, logical formulae, etc.), each of them disambiguated.
70In the combinatorial approach to analysis, all possible analyses are computed. Subsequently, since a single best solution or small set of solutions is normally desired, filters, weights, or preferences are used. Filtering consists in applying increasingly stricter conditions of well-formedness until the set of solutions is satisfactorily reduced. Weights may be computed for solutions in a number of ways.
71In the heuristic approach, by contrast, only some possible analyses are computed. The worst heuristic technique is to use a standard search algorithm (e.g., backtracking) and to stop once the first solution is reached, because in this case potential ambiguities are simply ignored. A better technique is to use weights to guide the search and to seek more than one solution. (Heuristic search may in fact be the only practical technique for tackling noisy input.)
72These two search techniques (standard and heuristic) don’t handle ambiguities; they are just ways to eliminate them from further consideration. The use of preferences is more to the point: one tries to recognise the cause of ambiguity and to choose on some principled basis. But then all ambiguity cases and their solutions must be foreseen, a very difficult task. Also, principled resolution based upon preferences is local, which makes it difficult to guarantee an overall “best” solution, and perhaps explains why other techniques often perform as well in practice.
73The main problem with these classical approaches is that they work… as demanded. That is, they produce a set or a (weighted) list of solutions, and discard all trace of the source of the ambiguities. B. Vauquois’ solution to this problem of information loss is to produce abstract structures that represent some ambiguities, either directly, or through “tactical”, annotations. In practical working systems, the heuristic and the preferential approaches are combined. This combination enables a system to produce “warnings” in the output which show the presence of ambiguities in a factorised way. However, ambiguities are coded in the same structures (annotated trees) used to describe disambiguated solutions, which leads to some lack in perspicuity.
5.1.2 – Symbolic, numerical and hybrid techniques
74All operational MT systems use symbolic (rule- or procedure-based) techniques, sometimes enhanced with numerical (statistical) techniques. Their linguistic programs are thus based either on formal grammars or on automata. The grammatical approach uses regular or context-free grammars, extended with attributes and abstract tree building mechanisms, much as is done to build compilers. Various models of transducing automata are also used directly.
75(1) Finite State Transducers (FST) are extensively used for morphological and bounded syntactic processes.
76(2) In 1970, W. Woods introduced Augmented Transition Networks (ATN), string to tree transducers, to build analysers, and used them in the LUNAR project. ATN-based analysers are used today in the Prompt (Softissimo), ENGSPAN/SPANAM (PAHO)and AS-TRANSAC (Toshiba) MT systems.
77(3) G. Stewart extended the model to Q-graphs transducers for the TAUM-Aviation project, but his remarkable REZO language has yet to find a successor.
78(4) Finally, various models of decorated tree transducers (transformational systems) have been used for structural analysis, transfer, and generation. The first was built at CETA (Grenoble) in 1965-66. Quite a few MT systems use some variants today (ROBRA in GETA’s Ariane-G5-based MT systems, L.C. Tong’s GWR in Tapestry at KRDL in Singapore, Nakamura’s GRADE in MU/Majestic in Japan, HICATS of Hitachi).
79Numerical, or statistical, techniques are increasingly used in Speech Translation. The end to end evaluation of the German VerbMobil project (1992-2000) has shown that, after comparable development efforts, they can lower the error rate to 30%, while the symbolic techniques produced errors in at least 60% of all cases. Disadvantage is due not only to the noisy nature of the input (a word lattice produced by the speech recogniser), but to the fact that numerical techniques can “learn” the weights of the transitions of a stochastic automaton far more complex than any automaton which could be built by hand.
80Another promising avenue is the “hybrid approach” to source language analysis, where symbolic and numerical techniques are mixed at different points. An exemplary case is the MT system developed by Microsoft in 2000-01. Analysers and generators are written in G, the successor of LPNLP, a language of symbolic rules. Weights are attached to rules as well as to words and word senses in the dictionaries. They are used to compute scores for the possible linguistic trees, and the best wins. By feedback in supervised training, the weights are learned so that the error rate decreases to 5% or less. Then the essential part of the transfer stage of translation is learned automatically from a collection of about 200,000 sentences in English and their translations into Spanish. For this training process, the 200,000 corresponding pairs of linguistic trees (LF structures, in essence deep dependency structures very similar for both languages) are produced by two separate analysers. Then, a “MindNet” is computed for the language pair. This is a graph putting into correspondence the “most useful” subtrees found in those pairs. A node can correspond to a node or a subtree, and a subtree can correspond to a subtree or even to a unique node (e.g., an expression like “kick the bucket” could correspond to “morir”). This computation of structural correspondences is very time-consuming, but the result can be used very efficiently to produce a translation. For translation, the input is first analysed. Then the resulting tree is “paved” in an optimal way with (English) MindNet subtrees. The associated Spanish tree fragments are attached to the English tree, and then “stitched” together to form a Spanish tree. If it is not a legal LF structure (something which does occur at times after only one year of development), handcrafted normalisation rules are applied. Then the (symbolic but possibly preference-driven) generator generates the Spanish utterance. The converse system (Spanish-English) can be obtained from the same data.
5.2 – Implementation techniques
81Specialised languages for linguistic programming or “SLLPs” (earlier called “metalanguages”) have been around since the sixties, at least to aid in the production of MT dictionaries. Linguistic programming has often been done using a classical programming language (macroassembler and the C-macroprocessor for Systran, LISP for parts of METAL, etc.).
82The idea is simply to define a symbolic language familiar to the linguist, and then to compile linguistic data or programs written in this SLLP into interpretable or directly executable binary code. The first complete SLLP was COMIT, developed at MIT in the sixties.
83There are two competing approaches here. The first is to define an SLLP as the implementation of a linguistic theory or linguistic model. For example, CETA built in 1962 an SLLP for context-free grammars in Chomsky’s normal form, extended with simple and “vehicular” attributes, the latter used to represent long-distance dependencies. When an SLLP is based upon prairie in this way, its underlying “engine”, or basic algorithm, is not controllable by the linguist. The second approach is to build a tool containing predefined powerful data structures (such as labelled or annotated trees, graphs of trees, etc.) and powerful control structures (such as pattern-matching, unification, rewriting, iteration, recursion, non-determinism, etc.).
5.3 – Linguistic resource acquisition
84Large-scale lexical acquisition has been a major problem since the beginnings of MT, almost 50 years ago. There are three main approaches, each with its context, methodology, and pros and cons. The first approach consists in working directly on dictionaries specialised for MT; the second in creating specialised lexical data bases, generally asymmetrical and proprietary, and sometimes usable for applications different from MT; and the third in building lexical data bases which can not only be used in multiple ways by both humans and machines, but are also intrinsically symmetrical, linguistically very detailed, potentially very large in terms of both the number of entries and the number of languages, and open. For the future, if MT systems are to cover all pairs of languages and all domains, it seems that the only approach which might circumvent prohibitive costs is that of Linux: a collaborative, Web-based approach to the creation and usage of lexical resources.
85For any of these methods, it is necessary to use corpora, to obtain primary information and to test “refined” information. In recent years, corpus linguistics has led to the development of very powerful automatic tools, based on low-level linguistic processing and sophisticated statistical techniques, which enormously help lexical acquisition. For example, there are now effective terminology extractors which work on monolingual documents and on bilingual aligned texts. However, their results must be refined by human labor: as many as 40% of the suggestions may be wrong, and information automatically obtained from the remaining 60% must still be revised.
86The acquisition of syntactic and semantic knowledge remains mostly a human responsibility, reserved for specialists. However, in hybrid techniques, programs can to some degree be used to learn from examples. Suppose we have a “tree bank” of utterances with associated (correct) linguistic trees. There is then the possibility, already mentioned, of learning the weights of the transitions of a non-deterministic automaton (such as a LR(k) left-corner all-path pushdown-stack automaton). Another possibility is to learn rules as generalisations of standard rules, but much more detailed, as is done by Y. Matsumoto’s lab at NAIST (Nara). These researchers use one hybrid variant of dependency grammars employing rules of the form (cat1, cat2, rel, p, w), where cat1 and cat2 are morpho-syntactic categories (noun, verb…), rel is a deep syntactic relation (agent, patient, cause…) from cat1 to cat2, w is a weight, and p is an integer representing the position of the determining word (cat1) relative to the determined word (cat2). From a tree bank, they extract richer vectors of the form (cat1, lex1, att1, cat2, lex2, att2, rel, p, w), containing not only the categories, but also the lexemes and their grammatical attributes, such as gender, sex, time, tense, number, quantity, etc. (These attributes may belong to different linguistic levels: syntactic, semantic, etc.) Using automatic classification techniques, the vectors collected from the tree bank are then clustered. The algorithm performing the deep syntactic analysis of a sentence - already analysed morphologically into a lattice of vectors of the form (cat, lex, att) - repeatedly asks questions of the form: “can word1 (cat1, lex1, att1) be related to word2 (cat2, lex2, att2) by relation rel, with weight w, given that word1 is at position p relative to word2?” The answer is Yes if and only if (cat1, lex1, att1, cat2, lex2, att2, rel, p, w) is in the automatically formed vector cluster. When the answer is in fact Yes, we can consider that the rule (cat1, lex1, att1,cat2, lex2, att2, rel, p, w) has been learned from the corpus.
6 – State of the art
6.1 – Existing systems: 3 categories
87To assess current MT systems, it is useful to distinguish three main goals: MT aiming at rough translations of texts, MT aiming at quality translation of texts, and MT of speech.
6.1.1 – "Rough" MT for comprehension
88Many MT systems are currently available, at low prices, for assimilation (that is, basic comprehension). The obtained translations are “rough” but often adequate. They cannot practically be directly revised to obtain quality translation, but readers understand the gist of the information, or at least its topics, and translators can use the result as a “suggestion”, just as they use suggestions from translation memories. The main uses for comprehension MT now are for surfing the Web and for seeking information, although military, economic, and scientific intelligence were early users, and demand continues in these areas. The number of available language pairs is in fact very low compared to the needs. For example, an official of the EU reported at LREC’2000 that EC-Systran had only 19 pairs after 24 years of development, 8 of them “almost satisfactory”. However, there are 110 language pairs in the EC. In Japan (and similarly in China), very few language pairs are offered besides English ? Japanese and English ? Chinese. Russian is offered for two or three pairs, and Thai only for English ? Thai. Some Web sites claim to offer many language pairs, by translating through English. Unfortunately, the results are terrible, for reasons explained above.
89The identified obstacles here are
- the cost of developing the first commercial version of a new language pair (at least 40 man-years according to the CEO of Softissimo),
- the direct approach, which makes it impossible to combine two systems without dramatically lowering the quality,
- the law of diminishing returns: each new language pair to be developed usually corresponds to a lesser need than the previous one, hence there are fewer users/buyers, all expecting to pay no more than the cost of already available language pairs.
6.1.2 – Quality “raw” MT for dissemination of information
90We find here specialised systems for (rare) niches, such as METEO (Chandioux, 1988), ENGSPAN, SPANAM (Vasconcellos and León, 1988), METAL (Slocum, 1985), LMT (McCord, IBM), CATALYST (Caterpillar-CMU), perhaps some LOGOS systems, etc. In Japan, we might mention ALT/Flash (the NTT system for Nikkei stock market flash reports) and perhaps some specialised systems, mostly ENG-JAP, used internally for translating technical manuals (AS-Transac at Toshiba, ATLAS-II at Fujitsu, CrossRoad at NEC, SHALT at IBM, Pensée at OKI, etc.). In Europe, few such systems are now available, due to the relatively small market, and to the negative attitude of the EC and all governments towards funding quality MT.
91Quality MT systems for information dissemination are very rare. However, these systems are indeed very good and very useful (30 Mwords/year for METEO, 75% ENG-FRE and 25% FRE-ENG, with 1mn revision per page, 0.15 cents/word for final output), because they are quite specialised.
92It is extremely difficult to prepare comparative benchmarks for such systems, because, like expert systems, they are very good on their domain, and fail miserably on other tasks. The best way to measure them is through some combined assessment of the buying, maintenance, and evolution prices, and through consideration of the human time needed to obtain a professional result.
93Technically, these systems almost always have a separate analysis component, producing a syntactic or syntactico-semantic descriptor of the source unit of translation (usually an annotated or decorated tree). Almost all use some flavour of the transfer approach (even systems like ATLAS-II by Fujitsu or PIVOT-CROSSROAD by NEC). In most cases, there is no syntactico-semantic descriptor of the target UT, transfer and generation being merged into a single phase using recursive descent of the analysis tree. Hence, changing the source language implies redoing all the work [12].
94The identified obstacles here are
- the cost of developing the first commercial version of a new language pair: at least 100 man-years according to H. Sakaki, the main author of KATE at KDD, and to Pr. Nagao, director of the MU project, and perhaps 300 man-years with large dictionaries, as H. Uchida estimated for ATLAS-II at Fujitsu,
- the impossibility of factorizing generation processes, when the situation changes from 1?N to m?N,
- the fact that, although systems with full transfer structure could be used to produce “all language pairs” by combining systems at the levels of the structural descriptors, there seem to be no industrial situations at the moment calling for high quality for many or all language pairs.
6.1.3 – Speech translation
95Current commercially available technology makes speech translation already possible and usable for “chat MT”. Such systems are usually built by combining speech recognition (SR), text MT, and speech synthesis. NEC has demonstrated a system for JAP?ENG at Telecom’99. Linguatec, a subsidiary of IBM, markets Talk&Translate (it uses Via Voice and LMT for ENG?GER). The quality is of course not very high in all components, but this drawback is compensated by the broad coverage, by some feedback (e.g., editable output of SR and written reverse translation), and by the fact that users are intelligent humans wanting to communicate.
96At the research level, the aim is to obtain higher quality while allowing more “spontaneous” speech, in task-oriented situations. The large German VerbMobil project (1992-2000) has shown the feasibility of reaching these goals, and has compared many alternative methods in the same task setting (Wahlster, 2000). The goals can also be reached in a multilingual setting, as demonstrated by the CSTAR consortium in intercontinental public demos with large media coverage in July 1999. Participants used a kind of “semantico-pragmatic” pivot designed to represent the utterances of participants in a limited set of situations (e.g. exchanging tourism information, booking hotels or tickets for events and transports, etc.).
97The higher quality is necessary here because at least one participant (the “agent”) is a professional who must work fast. S/he may adapt to the system, but still can not afford to repeat each utterance two or three times until the system understands it correctly. The higher spontaneity is necessary because at least one participant (the “client”) is supposed to be a naive and occasional user of the system.
98The identified obstacles here are
- the great difficulty of developing an adequate pivot (such as the IF or Interface Format of CSTAR),
- the cost of building the necessary lexical resources, as for MT of texts,
- the difficulty of handling the context (pragmatic, discursive, linguistic, lexical), in particular to compute correctly the speech acts and the referents of anaphoras and elisions.
6.2 – Evaluating MT
6.2.1 – Variety of evaluation “grids”
99The problem of evaluating MT systems has been the subject of many studies over the last fifty years. Since comprehension-MT systems have become commercial, comparative studies often appear in magazines. In fact, there are many possible sets of criteria.
100Here, we will distinguish between internal and external criteria. The first, not really important for users, concern the linguistic and algorithmic architecture of the system. The second are static or dynamic: a system is judged by its state or its possibilities of evolution. Let us detail the second, more user-oriented criterion.
6.2.2 – Subjective internal criteria
101Readability is difficult to define. The impression of grammaticality contributes to it, as well as the severity of the most frequent errors, the preservation of the initial formatting, the typographic presentation of annotations (e.g. pointing out uncertain segments or multiple translations), and the techniques for displaying the correspondence with the original. To give a readability grade, it is not sufficient to measure the average reading speed; it is also necessary to evaluate the global impression on the reader.
102Intelligibility is first evaluated utterance by utterance, after which an average over the text is computed. The intelligibility of an utterance reflects the effort necessary to understand it and “rephrase” it correctly, whether or not it is a good or bad translation of the original.
103Fidelity is the quality of transmission of the “message” expressed by both the content and the form of an utterance. An exact paraphrase will be judged less faithful than a literal translation.
104Flexibility reflects the ease of installing, configuring, activating, and modifying the system. For example, if there are many dictionaries with different priority lists for different configurations, the flexibility score reflects the quality of the interface for building these lists, for naming them, for associating them with types of documents, etc. When the system must be modified, this score assesses the available tools, e.g. for modifying the system’s dictionaries and (more rarely) grammars and algorithms..
6.2.3 – Objective internal criteria
105To grade grammaticality, one counts errors and gives penalties according to the error types, much as in language instruction.
106Terminological precision seems a somewhat less objective factor to measure, because different equivalent terms may be appropriate in different contexts. It is necessary that the context be specified, that a precise terminology be associated with it, and that the system be claimed to supply the right word in context.
107The cost of an MT system includes not only its list price, but also its speed and disk and memory usage. The number of words translated per hour is often presented as a good measure for speed. In fact, this figure depends on the situation. A better measure is the user’s total waiting time, which includes all auxiliary processes, such as filtering (format transformation), segmenting, management, and transmission. For Web surfing, the delay should be less than 2-3 seconds per page, low quality being acceptable, while a delay of 2-3 hours may be acceptable for information dissemination-MT - but then higher quality is needed.
108In dissemination-MT (“MT for revisers”), the time for human revision is the single most important objective criterion. If the translation quality is too low, this time will exceed the average time needed to revise a first draft produced by a human translator by more than 50%, at which point human revisers usually give up and prefer to start from scratch, and productivity is lost. But the revision time may actually become lower than after human translation, not only because the quality of the raw translation may be good to very good, but also because MT errors are more systematic than human errors, and because revisers are less reticent to correct a machine than a human colleague (who usually receives a copy of the corrections).
109Time for dictionary insertion is a good objective measure of the extensibility of a system to new domains. This measure also depends as much on the richness of the linguistic information to be entered as on the user-friendliness of the dictionary interface, which should supply efficient tools for importing lexical information from available sources before adapting them to the MT system.
6.3 – Commented examples of current translation techniques
6.3.1 – Examples of semi-direct MT (Systran)
110Here are two versions of the introduction of the HEREIN (European Heritage network) thesaurus, taken from http:// www. european-heritage. net/ en/ Thesaurus/ Contenu. html and http:// www. european-heritage. net/ fr/ Thesaurus/ Contenu. html. These French and English versions, written by humans, are aligned below with the output of the Systran French-English Web translator. Observe that, while the human language is of very high lexical and grammatical quality in both languages, some information present in the French version is totally absent in the English version (we have stricken out the missing information to highlight it). Thus, if the source language was English, it appears that information has been added; if the source was French, information has apparently been suppressed. We must conclude that, even if it looks very good, human translation is not always as good as it seems.
111Some errors of the MT output (italicised) can be repaired by looking only at the result, but many cannot. Some errors are evident because parts of the sentences don’t make sense, but others may be quite well hidden.
112Note also that translation ambiguities sometimes remain in the Systran output, shown by the use of slashes, as in the translation of French rapports by English reports/ratios.
ENGLISH (human version) | FRENCH (human version) | ENGLISH (Systran FRE-ENG version) |
---|---|---|
With around 500 terms in each language, the thesaurus was compiled by sifting through the available printed material, such as reports on policies and various existing glossaries or thesauruses, using methods which meet the current international standards. | Fort d’environ 500 termes dans chacune des langues, le thesaurus a été compilé | Extremely of approximately 500 terms in each language, the thesaurus was built starting from the systematic exploitation of the paper documents available: reports/ratios on the policies and various tools - lexicons or preexistent thesaurus, while following a methodology which respects the international standards into force. |
The European-Heritage.net thesaurus covers the fields Of archaeology and architecture as defined in the Council of Europe conventions signed in Granada (1985) and Malta (1992). | Le thesaurus European-Heritage. net couvre les champs de l’archéologie et De l’architecture au sens des conventions du Conseil de l’Europe de Grenade (1985) et de Malte (1992). | The European-Heritage.net thesaurus covers the fields Of archaeology and architecture within the meaning of conventions of the Council of Europe of Grenade (1985) and Malta (1992). |
It encompasses information ranging from the partners involved, categories of cultural assets and legislation, to activities, skills and funding. It is supplemented by a number of specific thesauruses compiled by each member state on a particular topic, such as the thesaurus on Andalusian heritage or the architectural thesaurus from the Mérimée database in France. | Il prend en compte des aspects aussi variés que les acteurs, les catégories de biens culturels, la législation ou encore les interventions, les métiers et les financements. Il est complété et prolongé par des Thesaurus spécifiques développés par chaque Etat membre sur tel ou tel sujet spécifique, comme le thesaurus du patrimoine | It takes into account aspects as varied as the actors, the categories of cultural goods, the legislation or the interventions, the trades and the financings. It is supplemented and prolonged by thesaurus specific developed by each Member State on such or such specific subject, like the thesaurus of the Andalusian historical inheritance or the thesaurus of architecture of the documentation data base Mérimée in France. |
This new, open-ended Search tool will come on line shortly, together with a management and administration system shared among the various contributors. | Cet instrument de recherche, | This instrument of search, inevitably evolutionary, will be put soon on line accompanied by a device of management and administration Distributed between the various contributors. |
113Here are some sample results of Systran™s English-German and French-German Web translators.
GERMAN (Systran ENG-GER version) | GERMAN (Systran FRE-GER version) |
---|---|
Der European-Heritage.netthesaurus umfaßt die Felder von archaeology und von Architektur, wie in den Europaratvereinbarungen definiert, die in Granada (1985) unterzeichnet werden und in Malta (1992). | Der European-Heritage.net-Thesaurus bedeckt die Felder der Archäologie und der Architektur im Sinne der Übereinkommen des Europarats von Granada (1985) und von Malta (1992). |
Er gibt die Informationen um, die von den betroffenen Partnern, von den Kategorien der kulturellen Werte und der Gesetzgebung, bis zu Aktivitäten, von den Fähigkeiten und von der Finanzierung reichen. Er wird durch eine Anzahl von | Er berücksichtigt Aspekte dermaßen variierte, daß die Beteiligten, die Kategorien kultureller Güter, die Gesetzgebung oder noch die Interventionen, die Berufe und die Finanzierungen. Er wird vervollständigt und wird durch ein spezifische Thesaurus entwickelt durch jeder Mitgliedstaat über das eines oder andere spezifische Thema verlängert, als der Thesaurus des andalusischen historischen Kulturgutes oder der Thesaurus der Architektur der urkundlichen Datenbank Mérimée in Frankreich. |
Dieses neue, offene Suchhilfsmittel kommt auf Zeile kurz, zusammen mit einem Management-und Leitungssystem, das unter den verschiedenen Mitwirkenden geteilt wird. | Dieses notgedrungen entwicklungsfähige Forschungsinstrument wird gestellt demnächst online begleitet von einer Verwaltungs- und Verwaltungsvorrichtung, die aufgeteilt unter den verschiedenen Beitragenden. |
114The errors of Systran’s English-German Web translator are slightly more severe, but only the last paragraph really makes no sense in German. Although the French-German language pair is one of the highest quality pairs now available, its output is really not adequate for understanding the content. No translator would really start from it to produce a quality translation via revision of the usual sort, but it is possible to use the output as a source of suggestions, from which the translator can pick some well-translated parts.
6.3.2 – Examples of high quality transfer MT for revisors (EngSpan & SpanAm)
115EngSpan and SpanAm are the two MT systems developed by the Pan American Health Organization (PAHO) to translate texts concerning health, although these systems have quite large vocabularies outside the health field, and can handle standard press articles. The following text has been translated from English into Spanish by EngSpan, then revised manually, and then translated back into English by SpanAm. We give first the two “endpoints”, and then the raw and revised Spanish versions.
Original English text | SpanAm raw translation of revised Spanish EngSpan output |
---|---|
A Message from the Director-General, World Health Organization | Message of the Director-General of the World Health Organization |
Since their discovery, antibiotics have completely transformed humanity’s approach to infectious disease. Today, the use of antibiotics combined with improvements in sanitation,housing, and nutrition alongside the advent of widespread vaccination programmes, have led to a dramatic drop in once common infectious diseases that formerly laid low entire populations. | From its discovery, antibiotics have completely transformed the perspective of humankind with respect to infectious diseases. Today the use of antibiotics, combined with improvements insanitation, housing, and nutrition, together with the advent of the vaccination programs generalized, have caused a notable reduction of infectious diseases that previously were common and annihilated entire populations. |
Scourges that once struck terror into the hearts of millions – plague, whooping cough, polio and scarlet fever – have been, or are, on the verge of being controlled. Now, at the dawn of a new millennium, humanity is faced with another crisis. | Scourges that terrified millions of people, as plague, whooping cough, poliomyelitis, and the scarlatina, have been controlled or are on the verge of being controlled. Now, in the dawn of anew millennium, humankind faces another crisis. |
Formerly curable diseases such as gonorrhoea and typhoid are rapidly becoming difficult to treat, while old killers such as tuberculosis and malaria are now arrayed in the increasingly impenetrable armour of antimicrobial resistance. | Previously curable diseases as the gonorrhea and typhoid fever are becoming rapidly difficult to treat, while old assassins as tuberculosis and malaria now are armed of the increasingly impenetrable resistance to the antimicrobial drugs. |
This phenomenon is potentially containable. It is a deepening and complex problem acceleratedby the overuse of antibiotics in developed nations and the paradoxical underuse of quality antimicrobials in developing nations owing to poverty and a resultant dearth of effective healthcare. | This phenomenon is potentially contenible. The problem is increasingly profound and complex, accelerated by the abuse of antibiotics in the developed countries and the paradoxical underutilization of the quality antimicrobial drugs in the developing countries due to the poverty and to the scarcity resulting from an effective health care. |
Spanish EngSpan raw output | Postedited Spanish version |
---|---|
Un mensaje del Director General, Organización Mundial de la Salud | Mensaje de la Directora General de la Organización Mundial de la Salud |
Desde su descubrimiento, los antibióticos completamente han transformado el enfoque dela humanidad con respecto a la enfermedad infecciosa. Hoy, el uso de los antibióticos combinados con mejoras en el saneamiento, la vivienda y la nutrición al lado del advenimiento de los programas de vacunación generalizada,han conducido a una notable disminución | Desde su descubrimiento, los antibióticos han transformado completamente la perspectiva de la humanidad con respecto a las enfermedades infecciosas. Hoy día el uso de los antibióticos, combinado con mejoras en el saneamiento, lavivienda y la nutrición, junto con el advenimiento de los programas de vacunación generalizada, han dado lugar a una notable disminución de enfermedades infecciosas que antes eran comunes y aniquilaban a poblaciones enteras |
Flagelos que aterrorizaron a millones de personas, como la peste, la tos ferina, lapoliomielitis y la escarlatina, se han controlado o están a punto de controlarse. Ahora, en el alba de un nuevo milenio, la humanidad se enfrentacon otra crisis. | |
Enfermedades antes curables como la gonorrea y la fiebre tifoidea se están volviendo rápidamente difíciles de tratar, mientras que viejos asesinos como la tuberculosis y el paludismo están ahoraarmados de la crecientemente impenetrable resistencia a los antimicrobianos. |
6.3.3 – Comparison of outputs from two systems (SpanAm & Reverso)
116The following samples illustrate the differences between a specialized system aiming at high quality and a more generic system aiming at large coverage for accessing information. We have italicised dubious translations, stricken out words to be suppressed, and underlined corresponding correct fragments (if any) in the other translation.
SpanAm raw Spanish-English output(repeated) | Reverso raw Spanish-English output |
---|---|
Message of the Director-General of the World Health Organization | Message of the Chief operating officer of the World Organization of the Health |
From its discovery, antibiotics have completely transformed the perspective of humankind with respect to infectious diseases. Today the use of antibiotics, combined with improvements in sanitation, housing, and nutrition, together with the advent of the vaccination programs generalized, have caused a notable reduction of infectious diseases that previously were common and annihilated entire populations. | From his{*its*} discovery, the antibiotics have transformed completely the perspective of the humanity with regard to the infectious diseases. Today the use of the antibiotics, cocktail with improvements in the reparation, the housing and the nutrition, together with the advent of the programs of widespread vaccination, |
Scourges that terrified millions of people, as plague, whooping cough, poliomyelitis, and the scarlatina, have been controlled or are on the verge of being controlled. Now, in the dawn of a new millennium, humankind faces another crisis. Previously curable diseases as the gonorrhea and typhoid fever are becoming rapidly difficult to treat, while old assassins as tuberculosis and malaria now are armed of the increasingly impenetrable resistance to the antimicrobial drugs. | Scourges that terrified million persons, as the pest, the savage cough, the poliomyelitis and the scarlatina, they have been controlled or are on the verge of be controlling. Now, in the dawn of a new millenium, the humanity faces |
This phenomenon is potentially contenible. Theproblem is increasingly profound and complex,accelerated by the abuse of antibiotics in thedeveloped countries and the paradoxical underutilization of the quality antimicrobial drugs in the developing countries due to the poverty and to the scarcity resulting from an effective health care. | This phenomenon is potentially contenible. The problem is increasingly deep and complex, accelerated by the abuse of the antibiotics in the developed countries and the paradoxical subutilization of the antimicrobial ones of quality in the countries in development due to the poverty and the resultant shortage of an attention of effective health. |
The report on the last year on infectious diseases titled «Elimination of the obstacles to the healthy development» has demonstrated that the communicable diseases continue to be a significant cause of disability, are responsible for high continuous mortality, and affect mainly the most vulnerable populations of the world. | The report of last year on the infectious diseases titled «Elimination of the obstacles to the healthy development» has demonstrated that the contagious diseases continue being a significant reason of disability, they are responsible for the high constant mortality and affect principally the most vulnerable populations of the world. |
7 – Perspectives: four keys to the generalization of MT in the Future
117Despite considerable investment over the past 50 years, only a small number of language pairs is covered by MT systems designed for information access, and even fewer are capable of quality translation or speech translation. To open the door toward MT of adequate quality for all languages (at least in principle), four keys are needed.
118On the technical side, one should
119(1) dramatically increase the use of learning techniques which have demonstrated their potential at the research level,
120and (2) use pivot architectures, the most universally usable pivot being UNL.
121On the organisational side, the keys are
122(3) the co-operative development of open source linguistic resources on the Web,
123and (4) the construction of systems where quality can be improved “on demand” by users, either a priori through interactive disambiguation, or a posteriori by correcting the pivot representation through any language, thereby unifying MT, computer-aided authoring, and multilingual generation.
Glossary
124Aspect. Semantic aspect is the relation of a process with time: finished or not, single or repeated action, beginning, ongoing or ending process, etc. Morphological aspect (perfective, imperfective in Russian) exists in many languages, in the form of morphemes (e.g., in Russian, prefixes and infixes) and correspond to one or more semantic aspects and other characteristics (e.g., imperfective to unfinished or durative or habitual process, perfective to finished or future process).
125Context-free grammar. A CFG is a formal grammar G=<T,N,S,R> made of two disjoint finite vocabularies T and N, T for “terminal” elements (the words of a language) and N for “non-terminal” or “auxiliary” elements (such as names for syntagms such as Verbal Clause, Nominal Phrase, etc.), an axiom S and a set of rules R. The union T?N is the vocabulary V of the grammar. The “axiom” (S for “Sentence”) is a non-terminal symbol. The rules are of the form A ? X1 X2…Xn where A is a non-terminal, the Xi are terminal or non-terminal elements, and n ? 0. If n=0, the rule is an erasing rule and is written A ? ?, where ? denotes the “empty string”.
126A syntactic tree is any oriented and ordered tree labeled on V?{ ?} such that any subtree of height 1 B(Y1…Ym) corresponds to a rule B ? Y1…Ym in R. L(G), the language associated with G is the set of strings w over T such as w is the “frontier word” (the sequence of symbols labelling the leaves read from left to right) of a syntactic tree ? (w) having its root labeled by the axiom S. ? (w) is a “structural descriptor” of w according to G. w is ambiguous for G if it has more than one structural descriptor.
127Chomsky’s normal form. A CFG G is in CNF if its rules are of the form A ? B C or A ?˜b or S ? ?, where A, B, C are non-terminals and b is a terminal. Then, any structural descriptor is made of a binary tree of non-terminals, completed by the insertion of a terminal under each (non-terminal) leave. The empty word may be in L(G), but then the rule S ? ? must be in R, and ? has only one structural descriptor, S(?).
128Sublanguage. A part of a natural language defined by a particular domain and style: the domain restricts the words employed and their meanings, and the style restricts the grammatical constructs and their interpretations (e.g., infinitival sentences in French cooking recipes have an imperative interpretation).
129Typology. A type of text, defined by a sublanguage and formatting conventions. For example, stock market flash reports all in uppercase, or IBM manuals for AIX in XHTML with special “entities” to denote command and product names.
Bibliographie
References
- Boitet, C. & Nédobejkine, N. (1981) : Recent developments in Russian-French Machine Translation at Grenoble. Linguistics 19, 199-271.
- Boitet, C. (1993) : La TAO comme technologie scientifique : le cas de la TA fondée sur le dialogue. In La traductique, edited by Clas, A. and Bouillon, P., Montréal, Presses de l’Université de Montréal, 109-148.
- Boitet, C. & Blanchon, H. (1994) : Multilingual Dialogue-Based MT for Monolingual Authors: the LIDIA Project and a First Mockup. Machine Translation 9-2, 99-132.
- Boitet, C (1996) : (Human-Aided) Machine Translation: a better future? In Survey of the State of the Art of Human Language Technology, edited by Cole R., Mariani., Uszkoreit H., Zaenen A. and Zue V., Pisa, Giardini, 251-256.
- Chandioux, J. (1988) : 10 ans de METEO (MD). In Traduction Assistée par Ordinateur. Actes du séminaire international sur la TAO et dossiers complémentaires, edited by Abbou A., Paris, mars 1988, Observatoire des Industries de la Langue (OFIL), 169-173.
- Hutchins, W.J. (1986) : Machine Translation : Past, Present, Future. Ellis Horwood, Wiley & Sons.
- Lehrberger, J. & Bourbeau, L. (1988) : Machine Translation. Linguistic characteristics of MT systems and general methodology of evaluation. Amsterdam, John Benjamins.
- Maruyama, H., Watanabe, H. & Ogino, S. (1990) : An Interactive Japanese Parser for Machine Translation. Proc. of COLING-90, 20-25 août 1990, ACL 2/3, 257-262.
- Nyberg, E.H. & Mitamura, T. (1992) : The KANT system: Fast, Accurate, High-Quality Translation in Practical Domains. Proc. of COLING-92, 23-28 July 92, ACL 3/4, 1069-1073.
- Planas, E. (1999) : Formalizing Translation Memories. Proc. of MT Summit VII, Singapore, 13-17 September 1999, Asia Pacific Ass. for MT, 331-339.
- Slocum, J. (1985) : A Survey of Machine Translation : its History, Current Status, and Future Prospects. Computational Linguistics, 11-1, 1-17.
- Vasconcellos, M. & León, M. (1988) : SPANAM and ENGSPAM : Machine Translation at the Pan American Health Organization. In Machine Translation systems, edited by Slocum J., Cambridge University Press, 187-236.
- Vauquois, B. (1975) : Some problems of optimization in multilingual automatic translation. Proc. of First National Conference on the Application of Mathematical Models and Computers in Linguistics, Varna, May 1975, 7 p.
- Wahlster, W. (2000) : Verbmobil: Foundations of Speech-to-Speech Translation. In Artificial Intelligence. Springer, Berlin.
- Wehrli, E. (1992) : The IPS System. Proc. of COLING-92, Nantes, 23-28 July 1992, 3/4, 870-874.
- Whitelock, P.J., Wood, M.M., Chandler, B.J., Holden, N. & Horsfall, H.J. (1986) : Strategies for Interactive Machine translation : the experience and implications of the UMIST Japanese project. Proc. of COLING-86, Bonn, 25-29 Aug. 1986, IKS, 25-29.
Notes
-
[1]
Machine Translation is the subset of Automated Translation which aims to build programs capable of taking over translation per se, whether fully automatically or with the assistance of monolingual users.
-
[2]
METEO (Chandioux, 1988, 1976-), ALT/Flash (NTT, 1999-)…
-
[3]
ENGSPAN & SPANAM (Vasconcellos and León, 1988, 1980-), CATALYST (Nyberg and Mitamura, 1992, 1992-)…
-
[4]
ITS (BYU, Provo, 1974-81), N-Trans (Whitelock et al., 1986).
-
[5]
such as JETS at IBM-Japan (Maruyama et al., 1990), see also Boitet and Blanchon, 1994 ; Wehrli, 1992.
-
[6]
the majority of Systran "language pairs" until very recently, GlobaLink, Web-translator…
-
[7]
e.g., TDMT (ATR).
-
[8]
e.g., LMT (IBM).
-
[9]
e.g., MU (Univ. of Kyoto), and its successor MAJESTIC (JST).
-
[10]
e.g., GETA’s Russian-French system (Boitet and Nédobejkine, 1981).
-
[11]
e.g., METAL (Slocum), Shalt-1 (IBM).
-
[12]
See the difficulties experienced by Siemens with METAL in the 90’s, which contributed to the company’s exit from the scene.