As of November 2007, the European Commission's Directorate-General
for Translation (DGT) made publicly accessible its multilingual
Translation Memory for the Acquis Communautaire (the body
of EU law) - a collection of parallel texts (texts and their translation,
also referred to as bi-texts) in 22 languages. This is a page for
technical users, where you will find a summary of this unique resource
and instructions on where to download it and how to produce bilingual
aligned corpora for any of the 231 language pairs (462 language
pair directions). For an example of one sentence translated into
all 22 languages, click here.
Please note that DGT-TM is not machine translation software.
If you are a non-technical
user, you may be more interested in our freely accessible
news analysis applications, which you find at http://emm.jrc.it/overview.html.
The release of this linguistic resource follows
the public release - in May 2006 - of the JRC-Acquis
multilingual parallel corpus with sentence alignment for 231 language
pairs. Version 3.0 of the JRC-Acquis, which now also contains Bulgarian
as a 22nd language and which comprises a total of over 1 billion
words, has been made available in April 2007. The data releases
of DGT and JRC are in line with the general effort of the European
Commission to support multilingualism, language diversity and the
re-use of Commission information.
The Acquis Communautaire
is the entire body of European legislation, including all the treaties,
regulations and directives adopted by the European Union (EU) and
the rulings of the European Court of Justice (see the Wikipedia
entry). Since each new country joining the EU is required to
accept the whole Acquis Communautaire, this body of legislation
is translated into 22 official languages. As a result, the Acquis
now exists as parallel texts in the following 22 languages:
Bulgarian, Czech, Danish, Dutch, English, Estonian, German, Greek,
Finnish, French, Hungarian, Italian, Latvian, Lithuanian, Maltese,
Polish, Portuguese, Romanian, Slovak, Slovene, Spanish and Swedish.
For the 23rd official EU language, Irish, the Acquis is not translated
on a regular basis.
A translation memory is a collection
of small text segments and their translation. These segments can be sentences
or sentence parts. Translation memories are used to support translators by
ensuring that pieces of text that have already been translated do not need
to be translated again.
Both translation memories and parallel texts are an important
linguistic resource that can be used for a variety of purposes,
including:
- training automatic systems for statistical machine translation
(SMT);
- producing monolingual or multilingual lexical and semantic
resources such as dictionaries and ontologies;
- training and testing multilingual information extraction
software;
- checking translation consistency automatically;
- testing and benchmarking alignment software (for sentences,
words, etc.).
Generally speaking, parallel corpora are useful for all
types of cross-lingual research. The value of a parallel corpus grows with
its size and with the number of languages for which translations exist.
While parallel corpora for some languages exist abundantly, there are few
or no parallel corpora for most other language pairs. To our knowledge,
the Acquis Communautaire is the biggest parallel corpus
in existence, if we take into consideration both its size and the large
number of languages involved. The most outstanding advantage of the Acquis
Communautaire - apart from being freely available - is the number of rare
language pairs (e.g. Maltese-Estonian, Slovene-Finnish, etc.).
This extraction of aligned sentences can be used to produce
a parallel multilingual corpus of the legislative documents (Acquis Communautaire)
of the European Union in 22 EU languages. The aligned sentences ("translation
units") have been provided by the Directorate-General for Translation
of the European Commission by extraction from one of its large shared translation
memories in Euramis (European advanced multilingual information
system). This memory contains most, although not all, of the documents
of the Acquis Communautaire, as well as some other documents which are not
part of the Acquis.
In order to cut down the size, the extraction takes English
as the source language. The sequence in the extracted files is not necessarily
the same as in the underlying documents, and redundancies of text segments
like "Article 1" are inevitable. The documents in the files are identified
by the document number (Numdoc) of the original legislative document in the
EUR-Lex database, but it should be noted that these documents have been modified
(see section on pre-processing below). The documents are in TMX format, a
widely used format provided by LISA:
in order to be backwards compatible, the header mentions TMX format 1.1, but
the files are also compliant with TMX 1.4b. The texts are encoded in UTF-16
Little Endian. The source language of the documents and sentences is not known,
but many of the documents were originally written in English and then translated
into the other languages.
DGT cannot assume any responsibility for the quality and
the content.
Before the documents were aligned and corrected, they were
pre-processed to remove certain differences between the source and target
language versions (further
details). This means that the contents of the documents
might have changed. The documents were aligned in accordance with the segmentation
rules used in the Directorate-General for Translation of the European Commission.
The extraction keeps only the EUR-Lex document number (NumDoc) from
which other information (e.g. year and document type) can be derived. For
further information on the Numdoc structure, see the information provided
by EUR-Lex.
4) Statistics
for the DGT Translation Memory
The DGT Translation Memory is currently available
in 22 languages. The following table shows the coverage, expressed in the
total number of translation units available for each language:
Under Commission Decision 2006/291/EC, Euratom of 7 April
2006 on the re-use of Commission information (Official
Journal L 107, 20.4.2006, pp. 38-41), this data may be disseminated, but
only within the limits set by the Decision. In particular, the Commission
is not liable for any consequence stemming from the re-use. Moreover,
the Commission is not liable for the quality of the alignment nor the correctness
of the data provided.
By agreement with the European Commission's Office for Official
Publications (OPOCE), the Acquis can be used and distributed for research
purposes, but the following conditions for use must be observed:
The European Communities consider legislative and quasi-legislative
documents published in the Official Journal of the European Union to be in
the public domain. Prior written permission is not required for their reproduction/translation,
and they may be reproduced freely without restriction, including for the purpose
of further non-commercial dissemination to final users, subject to the condition
that appropriate acknowledgement is given to the European Communities and
to the source, and provided that - whenever a document is reproduced verbatim
from a source other than the printed version of the Official Journal of the
European Union - a prominently positioned disclaimer should read: "Only
European Community legislation printed in the paper edition of the Official
Journal of the European Union is deemed authentic."
The two resources are rather similar in nature as they are
both based on the Acquis Communautaire, but they are not identical and can
both serve different purposes. The main differences are the following:
- The collection of documents of both resources should mostly
be the same, but they are not identical as both resources were collected
in different ways. None of the resources is exactly equivalent to the Acquis
Communautaire. The criteria for the collection of the JRC-Acquis were rather
loose (all documents were collected which were available in at least ten
languages of which at least three 'new' EU languages) so that the JRC-Acquis
is bigger.
- The DGT Translation Memory is a collection of translation
units, from which the full text cannot be reproduced. The JRC-Acquis is mostly
a collection of full texts with additional information on which sentences
are aligned with which others.
- Most parts of the DGT Translation Memory have been corrected
manually using the Euramis alignment editor, while the alignment
of the JRC-Acquis documents was done using the alignment software tools
Vanilla (Versions 2.2 and 3) and HunAlign (Version 2.2), without manual
correction.
- For the cleaning and pre-processing of the texts, different
methods and tools were used.
- Most JRC-Acquis documents are acompanied
by information on the manually assigned Eurovoc subject domain classes so
that the JRC-Acquis can also be used to train automatic multi-label classification
software.
The distribution consists of 12 zip files (Volume_1.zip, ...
Volume_12.zip), each of approximately 100 MB. Each zip file has dozens of
tmx-files identified by the EUR-Lex number of the underlying documents of
the Acquis and a file list in txt specifying the languages in which the documents
are available.
You can download the data files from the site http://wt.jrc.it/lt/Acquis/DGT_TU_1.0/data/.
There is no need to unzip the files as the extraction program will
access the data in the zip files directly. The texts for the different
languages are spread over the various zip files so that you will
need to download all files if you want the full parallel corpus.
Downloading only a subset of the zip files is possible, but it will
result in producing only a subset of the parallel corpus.
You also need to download the extraction program
and copy it into the same directory as the zip files with the data.
The program is distributed in two versions (NEW!):
a version with graphical user interface for the Windows operating
system, consisting of two files: the
program file and the
library, and a machine-independent command line version in java
bytecode that can be run on any machine supporting a Java runtime
of version 1.4 or newer.
The multilingual extraction has English as the source language.
Users can extract any language pair as follows, using the extraction tool
TMXtract:
For the Windows Operating System:
- download the zip files, the extraction tool TMXtract
(exe.file) and the file swt-win32-3218.dll onto your
PC. The files must be in the same directory;
- open TMXtract;
- select Input files (Volume_1.zip, etc.;
multiple selection is possible);
- specify Output file (the result is always
1 file);
- choose Source and Target language;
- click on Start.
For other Operating Systems: (NEW!)
- download the zip files, the extraction tool TMXtract
(jar file) onto your computer. The files should be in the same
directory;
- Start a command shell;
- Invoke the program by the command java
-jar TMXtract.jar <Source> <Target> <Output file>
[ <Input files> ...];
- The progress of the extraction will be displayed
on the console. Example on Solaris:
