Project description :
The TTC project focuses on the automatic acquisition of aligned bilingual terminologies for computer-assisted translation and machine translation. To do this, important steps of the project are the automatic extraction of monolingual terminologies and the bilingual alignment of the extracted terminologies from a large set of multilingual corpora.
Such terminologies could be extracted from parallel corpora, i.e. from previously translated texts, but such corpora are scarce. Previously translated data is still sparse and only available for some pairs of languages and few specific domains. Thus, no parallel corpora are available for most of specialized domains, especially for emerging domains (such as renewable energy). As a consequence, the project develops methods and tools for automatic extraction of terminologies from comparable corpora, i.e. from corpora corresponding to a same domain, but not necessary being a translation from each other. It also develops tools for gathering (topical web crawler) and managing these comparable corpora and for managing terminologies.
At the end of the TTC project, a platform will be set up to compile and manage comparable corpora using standards (TMF, TBX) and the existing open source UIMA framework. An evaluation and a validation of this work will be done by the consortium on CAT tools and Machine Translation tools. Translation of technical documents for aerospace and IT domain will be done using CAT and MT techniques to assess impact of the TTC project outputs.
Objectives of the project :
The need for linguistic resources (terminologies, lexicons, translation memories, etc.) is overwhelming in any natural language application, but the problem is especially difficult for translation applications because of cross-linguistic divergences and mismatches that arise from the perspective of the lexicon. Lexicons and terminologies play indeed a central role in any machine translation tool, regardless of the theoretical foundations upon which the MT tool is based (statistical machine translation, rule-based machine translation, example-based translation...). Computer-assisted translation tools heavily use terminologies and translation memories to assist the human translator in the translation process, e.g. to create and maintain terminologies. Another functionality of computer-assisted translation is a dictionary based generation of rough translations. In fact, some advanced computer-assisted translation solutions include controlled machine translation.
Besides MT, automatic translation of terminologies from one controlled vocabulary into another is essential to the integration and the use of diverse information systems. Bilingual terminologies for several languages are adaptive and interoperable solutions for managing multilingual content and communication.
- Briefly the TTC project work aims at :
- Compiling and using comparable corpora;
- Using a minimum of linguistic knowledge for candidate term extraction;
- Defining and combining different strategies for term alignment;
- Developing an open platform for use with MT and CAT tools including solutions to manage comparable corpora as well as terminologies ;
- Demonstrating the operational benefits on MT tools and CAT tools.
All these target outcomes have similar impacts, i.e. bettering translation in order to overcome language barriers through technological means. Final outcomes of the TTC project aim at improving translation activities from industry documentation to multilingual content management.