The DGT
Multilingual Translation Memory
of the Acquis Communautaire: DGT-TM
DGT-TM triples in
size due to the release of new data (up to and including the year
2010)

Since November 2007 the European
Commission's Directorate-General for Translation has made
its multilingual Translation Memory for the Acquis Communautaire,
DGT-TM, publicly accessible in order to foster the European Commission’s
general effort to support multilingualism, language diversity and
the re-use of Commission information.
The Acquis
Communautaire is the entire body of European legislation, comprising
all the treaties, regulations and directives adopted by the European
Union (EU). Since each new country joining the EU is required to
accept the whole Acquis Communautaire, this body of legislation
has been translated into 22 official languages. As a result, the
Acquis now exists as parallel texts in the following 22
languages: Bulgarian, Czech, Danish, Dutch, English, Estonian,
German, Greek, Finnish, French, Hungarian, Italian, Latvian, Lithuanian,
Maltese, Polish, Portuguese, Romanian, Slovak, Slovene, Spanish
and Swedish. For the 23rd official EU language, Irish,
the Acquis is not translated on a regular basis; which is why DGT-TM
does not include data in Irish.
Parallel texts
are texts and their manually produced translations. They are also
referred to as bi-texts. A translation memory
is a collection of small text segments and their translations (referred
to as translation units, TU). These TUs can be sentences
or parts of sentences. Translation memories are used to support
translators by ensuring that pieces of text that have already been
translated do not need to be translated again.
Both translation memories and parallel
texts are important linguistic resources that can be used
for a variety of purposes, including:
The value of a parallel corpus grows
with its size and with the number of languages for which translations
exist. While parallel corpora for some languages are abundant, there
are few or no parallel corpora for most language pairs. To our knowledge,
the Acquis Communautaire is the biggest parallel
corpus in existence, taking into consideration both its
size and the number of languages covered. The most outstanding advantage
of the Acquis Communautaire - apart from it being freely available
- is the number of rare language pairs (e.g. Maltese-Estonian, Slovene-Finnish,
etc.).
The first version of DGT-TM was
released in 2007 and included documents published up to the year
2006. The currently latest version of DGT-TM (released in April
2012, but referred to as DGT-TM-2011),
contains additional documents published from 2004 to 2010. While
the alignments between TUs and their translations were verified
manually for DGT-TM-2007, the TUs in DGT-TM-2011 were aligned automatically.
The data format is the same for both releases.
This page, which is meant for technical
users, provides a description of this unique linguistic resource
as well as instructions on where to download it and how to produce
bilingual aligned corpora for any of the 231 language pairs or 462
language pair directions. Here is an
example of one sentence translated into all 22 languages.
This extraction of aligned sentences
can be used to produce a parallel multilingual corpus of the European
Union’s legislative documents (Acquis Communautaire) in 22 EU languages.
The aligned translation units have been provided by the Directorate-General
for Translation of the European Commission by extraction from one
of its large shared translation memories in Euramis (European
advanced multilingual information system). This memory contains
most, although not all, of the documents which make up the Acquis
Communautaire, as well as some other documents which are not
part of the Acquis. In order to reduce the size, the extraction
uses English as the source language. The sequence in the extracted
files is not necessarily the same as in the underlying documents,
and redundancies of text segments like "Article 1" are inevitable.
The documents are in the widely used Translation Memory eXchange
(TMX)
format. In order to be backwards compatible, the header mentions
TMX format 1.1, but the files are also compliant with TMX
1.4b. The texts are encoded in UTF-16 Little Endian. The source
language of the documents and sentences is not known, but many of
the documents were originally written in English and then translated
into the other languages.
Before the documents were aligned,
the source material was pre-processed to reduce the number of entries
of low value for the translators (short sentences, long sentences,
obvious mismatches, etc.) (further
details). This means that the contents of the documents might
have changed. The documents were aligned in accordance with the segmentation
rules used in the Directorate-General for Translation of the European
Commission. The extraction keeps only the EUR-Lex document number
(NumDoc) from which other information (e.g. year and document
type) can be derived. For further information on the Numdoc structure,
see the information provided by EUR-Lex.
The DGT Translation Memory is currently
available in 22 languages. The following table shows the coverage,
expressed in the total number of translation units available for
each language, separately for the DGT-TM releases 2007 and 2011.
For the number of aligned translation
units for each language pair and further statistics, see the
DGT-TM
reference publication.
| Language |
Language
code |
Number
of units in DGT - release 2007 |
Number
of units in DGT - release 2011 |
| English |
EN |
2
187 504 |
2
286 514 |
| Bulgarian |
BG |
708
658 |
454
812 |
| Czech |
CS |
890
025 |
1
985 152 |
| Danish |
DA |
433 871 |
1 997 649 |
| German |
DE |
532 668 |
1 922 568 |
| Greek |
EL |
371 039 |
1 901 490 |
| Spanish |
ES |
509 054 |
1 907 649 |
| Estonian |
ET |
1 047 503 |
1 867 786 |
| Finnish |
FI |
514 868 |
1 881 558 |
| French |
FR |
1 106 442 |
1 853 773 |
| Hungarian |
HU |
1 159 975 |
1 869 246 |
| Italian |
IT |
542 873 |
1 926 532 |
| Lithuanian |
LT |
1 126 255 |
1 867 176 |
| Latvian |
LV |
1 120 835 |
1 859 781 |
| Maltese |
MT |
1 021 855 |
461 865 |
| Dutch |
NL |
502 557 |
1 914 628 |
| Polish |
PL |
1 052 136 |
1 879 469 |
| Portuguese |
PT |
945 203 |
1 922 585 |
| Romanian |
RO |
650 735 |
470 303 |
| Slovak |
SK |
1 065 399 |
1 894 676 |
| Slovene |
SL |
1 026 668 |
1 903 453 |
| Swedish |
SV |
555 362 |
1 934 964 |
| ALL |
ALL |
19,071,485 |
37,963,629 |
Size of DGT's Translation Memory
expressed as the total number of translation units
per language for each of the 22 official EU languages.
The DGT Translation Memory does
not include data in Irish.
I. Intellectual property and conditions of use of databases
The DGT-TM database is the exclusive property of the European
Commission. The Commission cedes its non-exclusive rights free
of charge and world-wide for the entire duration of the protection
of those rights to the re-user, for all kinds of use which comply
with the conditions laid down in the Commission
Decision of 12 December 2011 on the re-use of Commission documents,
published in Official Journal of the European Union L330 of 14
December 2011, pages 39 to 42.
Any re-use of the database or of the structured elements contained
in it is required to be identified by the re-user, who is under
an obligation to state the source of the documents used: the website
address, the date of the latest update and the fact that the European
Commission retains ownership of the data.
II. Conditions for use of software
The DGT-TM database is distributed with the software necessary
for its exploitation/extraction. Use of such software must be
carried out in accordance with the conditions laid down in the
EUPL
licence.
III. Responsibility
The database and the accompanying software are made available,
without any guarantee, explicit or tacit. The Commission cannot
be held responsible for any loss, injury or damage the re-user
may suffer due to the re-use. The Commission does not however
guarantee the absence of any irregularities which may be present
in the databases, within the structured data they contain or the
software itself. The Commission does not guarantee the on-going
distribution of said databases and software.
The Commission cannot be held responsible for any loss, injury
or damage caused to third parties as a result of the re-use. The
re-user shall bear sole responsibility for the re-use of the data
collection, the structured elements it contains and the software.
Re-use must not mislead third parties in respect of the contents
of the database and the structured elements it contains, it’s
the source of the contents or the date of the last update thereto.
This disclaimer is not intended to limit the liability of the
Commission in violation of any requirements laid down in applicable
national law or to exclude its liability in cases where this is
not permitted by the applicable law.
IV. Definitions
Definitions of terms used by the Commission Decision of 12 December
2012 on the re-use of Commission documents, published in Official
Journal of the European Union L330 of 14 December 2012, pages
39 to 42, are supplemented by the following definitions:
Re-user: Any natural or legal person who re-uses
the documents, in accordance with the conditions laid down in
the Commission Decision of 12 December 2012 on the re-use of Commission
documents, published in Official Journal of the European Union
L330 of 14 December 2011, pages 39 to 42.
Databases: A collection of independent works,
data or other materials arranged in a systematic or methodical
way and individually accessible by electronic means or in any
other way.
The first version of the JRC-Acquis
(a multilingual full-text parallel corpus with sentence alignments
for 231 language pairs) was released in 2006. The first version
of the DGT-TM was released in 2007. The two resources are broadly
similar in nature as they are both based on the Acquis Communautaire,
but they are not identical and can both serve different purposes.
The main differences are the following:
- The collection of documents of both resources
should mostly be the same, but they are not identical as both
resources were collected in different ways. Neither of the resources
is exactly equivalent to the Acquis Communautaire.
The criteria for the collection of the JRC-Acquis were rather
loose (all documents were collected which were available in
at least ten languages, including at least three 'new' EU languages),
so the JRC-Acquis is bigger for the years both resources cover.
- The DGT Translation Memory is a collection of translation
units, from which the full text cannot be reproduced. The JRC-Acquis
is mostly a collection of full texts with additional information
on which sentences are aligned with each other.
- Most JRC-Acquis documents are accompanied by information on
the manually assigned EuroVoc
subject domain classes so that the JRC-Acquis can also be used
to train automatic multi-label classification software.
- Different methods and tools were used in cleaning
and pre-processing the texts.
The distribution consists of a collection of zip
files (see below), each not larger than 100 MB. Each zip file
contains tmx-files identified by the EUR-Lex number of the underlying
Acquis Communautaire documents and a file list in txt
specifying the languages in which the documents are available.
There is no need to unzip the files as the extraction
program will access the data in the zip files directly. The texts
for the different languages are spread over the various zip files
so that you will need to download all files if you want the full
parallel corpus. Downloading only a subset of the zip files is
possible, but it will result in producing only a subset of the
parallel corpus.
You also need to download the extraction program
and copy it into the same directory as the zip files with the
data. The program is distributed in two versions: a version with
graphical user interface for the Windows operating system, consisting
of two files: the
program file and the
library, and a machine-independent command line version in
java
byte code that can be run on any machine supporting a Java
runtime of version 1.4 or newer.
You can download the files of the 2007 release from
http://optima.jrc.it/Acquis/DGT_TU_1.0/data/.
You can download the files of the 2011 release
by clicking on the links below.
The multilingual extraction has English as the
source language. Users can extract any language pair as follows,
using the extraction tool TMXtract:
For the Windows Operating System:
- download the zip files, the extraction tool
TMXtract (exe.file) and the file swt-win32-3218.dll
onto your PC. The files must be in the same directory;
- open TMXtract;
- select Input files (Volume_1.zip,
etc.; multiple selection is possible);
- specify Output file (the result is
always 1 file);
- choose Source and Target language;
- click on Start.
For other Operating Systems:
- download the zip files and the extraction tool
TMXtract (jar file) onto your computer. The files should be
in the same directory;
- Start a command shell;
- Invoke the program by the command
java -jar TMXtract.jar
<Source> <Target> <Output file> [ <Input
files> ...];
- The progress of the extraction will be displayed
on the console, e.g. on Solaris:

For a more detailed description
of the DGT-TM, including more statistics on the resource, see the
following publication. When making reference to DGT-TM in scientific
publications, please refer to:
Steinberger Ralf,
Andreas Eisele, Szymon Klocek, Spyridon Pilos & Patrick Schlüter
(2012). DGT-TM:
A freely Available Translation Memory in 22 Languages. Proceedings of the 8th international
conference on Language Resources and Evaluation (LREC'2012), Istanbul,
21-27 May 2012. (PDF)
The Directorate-General
for Translation (DGT) is one of the biggest translation
services in the world. It is also the largest single department
in the European Commission with a total number of around 2500 staff
members and a total production of some 2 million pages a year. Various
computer tools are available to translators, who use them according
to their translation needs and personal preferences. Irrespective
of their preferred working methods, all translators need the possibility
to reuse previously translated texts (translation
memories, electronic archives, ….). To perform its tasks, DG Translation
has a wide variety of language resources at the disposal of its
staff: terminology in many different forms (multilingual
libraries, terminology databases, electronic dictionaries, etc.),
translation memories enabling genuine data sharing;
texts as such to be retrieved from internal archiving
systems and other sources; and machine translation,
which, at the European Commission, is used as a browsing tool to
view the gist of a text and also to be used as a genuine translation
aid.
The Joint Research Centre
(JRC) is
also a Directorate-General of the European Commission. The JRC has
for many years worked on highly multilingual text analysis applications.
The JRC has contributed to the dissemination of the DGT
Translation Memory and it has itself produced and disseminated
a number of further highly multilingual linguistic resources: the
JRC-Acquis,
JRC-Names,
the JRC Eurovoc Indexer JEX, and a series of further
smaller linguistic resources.
The JRC is the creator of the Europe
Media Monitor (EMM) family of news aggregation and analysis
applications. EMM aggregates news from about 3000 news portals world-wide
in about 50 languages (status 2012). EMM's news analysis tools always
show the latest news from around the world as its pages are updated
every five minutes. As EMM not only displays the news articles,
but it also groups related articles, classifies the articles into
hundreds of news categories and displays automatically extracted
meta-information together with the news items, EMM has many users
from around the world, resulting in up to 1.2 million hits per day.
Much information is available via RSS feeds, allowing EMM output
to be combined with third-party tools. The JRC is scientifically
very active, as can be seen from the large number of international
scientific
publications in the field of multilingual text mining and media
monitoring. JRC's four publicly accessible media monitoring applications
are:
- NewsBrief:
Breaking News detection and display of the very latest thematically
organised news from around the world; Grouping of related news;
breaking news detection; RSS feeds and automatic email alerting;
50 languages.
- MedISys:
EMM's Medical Information System selects the health-related
EMM news in 50 languages and additionally gathers documents
from about 250 medical web sites. MedISys displays the medical
news according to diseases, symptoms, organisations and themes
and has statistics-based early warning functions for each category.
A second, restricted site offers more functionality to EU public
health organisations.
- NewsExplorer:
Summary of the news in 20 languages for each 24-hour period;
grouping of related news into clusters; linking of daily clusters
over time and across languages; visualisation of time lines
and of geographical news coverage; information extraction to
detect and disambiguate persons, organisations and locations;
quotation recognition; individual, daily-updated pages for over
one million names; detection of quotations by and about people;
automatic generation of social networks.
- EMM-Labs:
A collection of more experimental text analysis applications
in up to 50 languages not yet entirely integrated with the main
Europe Media Monitor pages. EMM-Labs includes tools for event
extraction (event scenario template filling), multi-document
summarisation, social networks, news maps, media impact analysis,
machine translation and more.
For more information, you can contact
the following persons:

Directorate-General for Translation (DGT)
Patrick Schlüter (Email address: Patrick.Schluter@ec.europa.eu)
Unit DGT.R.3 Informatics
Jean-Monnet Building A2/137
L-2920 Luxembourg
More
information on DGT.

Joint Research Centre (JRC)
Ralf Steinberger (Email address format: Firstname.Lastname@jrc.ec.europa.eu)
IPSC - GlobeSec - OPTIMA
Via E. Fermi 2749, T.P. 267
I-21027 Ispra (VA)
More information on the JRC
and its Language
Technology activity.