|
|
The JRC-Acquis
Multilingual Parallel Corpus
See also the related resource: DGT-TM
Translation Memory, as well as further free
linguistic resources
What are the Acquis Communautaire and the JRC-Acquis
The Acquis
Communautaire (AC) is the total body of European Union (EU)
law applicable in the the EU Member States. This collection of
legislative text changes continuously and currently comprises
selected texts written between the 1950s and now. As of the beginning
of the year 2007, the EU had 27 Member States and 23 official
languages. The Acquis Communautaire texts exist in these languages,
although Irish translations are not currently available. The Acquis
Communautaire thus is a collection of parallel texts in the following
22 languages: Bulgarian, Czech, Danish, German,
Greek, English, Spanish, Estonian, Finnish, French, Hungarian,
Italian, Lithuanian, Latvian, Maltese, Dutch, Polish, Portuguese,
Romanian, Slovak, Slovene and Swedish.
The data release by the JRC is in line with the
general effort of the European Commission to support multilingualism,
language diversity and the re-use of Commission information.
The Language
Technology group of the European Commission's Joint
Research Centre did not receive an authoritative list
of documents that belong to the Acquis Communautaire. In order
to compile the document collection distributed here, we selected
all those CELEX documents (see below) that were available in at
least ten of the twenty EU-25 languages (the official languages
of the EU before Bulgaria and Romania joined in 2007) and that
additionally existed in at least three of the nine languages that
became official languages with the Enlargement of the EU in 2004
(i.e. Czech, Estonian, Hungarian, Lithuanian, Latvian, Maltese,
Polish, Slovak and Slovene). The collection distributed here is
thus an approximation of the Acquis Communautaire which
we call the JRC-Acquis. The JRC-Acquis must not be seen
as a legal reference corpus. Instead, the purpose of the JRC-Acquis
is to provide a large parallel corpus of documents for (computational)
linguistics research purposes.
The linguistic research interest of the JRC-Acquis
Generally speaking, parallel corpora are useful
for all types of cross-lingual research. The value of a parallel
corpus grows with its size and with the number of languages for
which translations exist. While parallel corpora for some languages
exist abundantly, there are few or no parallel corpora for most
other language pairs. To our knowledge, the Acquis Communautaire
is the biggest parallel corpus in existence,
if we take into consideration both its size and the large number
of languages involved. The most outstanding advantage of the Acquis
Communautaire - apart from being freely available - is the number
of rare language pair combinations (e.g. Maltese-Estonian, Slovene-Finnish,
etc.).
The AC and other Community legislation is publicly
available on the European Commission's web sites. The Language
Technology team of the Joint Research Centre (JRC, http://langtech.jrc.ec.europa.eu/)
in Ispra, Italy, has attempted to identify the documents that
are part of the AC, has downloaded them and converted them to
XML format. The Bulgarian and Romanian documents were processed
by the Romanian Academy of Sciences (http://www.racai.ro/).
In further processing steps, the texts were cleaned of their footers
and annexes, and they were sentence-aligned twice: once using
Vanilla and once using HunAlign. Instead of using a single pivot
language, all possible 231 language pair combinations were aligned
individually. This is useful due to the n-to-n relationship between
aligned sentences, which often differs depending on the language
pair involved.
For some of the documents, only preliminary translations
were available. For the online texts in some of the languages,
only the title has been translated, but the text displayed is
English. An automatic language recognition tool was therefore
used to filter out those texts that are displayed as being one
language, but which are actually English. No manual check was
carried out.
The European Commission's Office for Official
Publications OPOCE manages the distribution rights of this
aligned multilingual parallel corpus. OPOCE agreed that the corpus
can be given to research partners for non-commercial use. See
the section on licensing issues, below.
The JRC-Acquis corpus (version 3.0) is
currently available in 22 languages with the following distribution:
|
Language
ISO code |
Number
of texts |
Text body |
Signatures
|
Annexes |
Total
No words (text + signatures + annexes): |
|
Total No
words |
Total No
characters |
Average
No words |
Total No
words |
Total No
words |
|
bg |
11384 |
16140819 |
104522671 |
1417.85 |
2170075 |
14114612 |
32425506 |
|
cs |
21438 |
22843279 |
148972981 |
1065.55 |
7225300 |
16763733 |
46832312 |
|
da |
23624 |
31459627 |
213468135 |
1331.68 |
2629786 |
16855213 |
50944626 |
|
de |
23541 |
32059892 |
232748675 |
1361.87 |
2542149 |
16327611 |
50929652 |
|
el |
23184 |
36453749 |
239583543 |
1572.37 |
2973574 |
16459680 |
55887003 |
|
en |
23545 |
34588383 |
210692059 |
1469.03 |
3198766 |
17750761 |
55537910 |
|
es |
23573 |
38926161 |
238016756 |
1651.3 |
3490204 |
19716243 |
62132608 |
|
et |
23541 |
24621625 |
192700704 |
1045.9 |
1336051 |
14995748 |
40953424 |
|
fi |
23284 |
24883012 |
212178964 |
1068.67 |
2677798 |
12547171 |
40107981 |
|
fr |
23627 |
39100499 |
234758290 |
1654.91 |
3021013 |
19978920 |
62100432 |
|
hu |
22801 |
28602380 |
213804614 |
1254.44 |
2529488 |
15056496 |
46188364 |
|
it |
23472 |
35764670 |
230677013 |
1523.72 |
3120797 |
18331535 |
57217002 |
|
lt |
23379 |
26937773 |
199438258 |
1152.22 |
2436585 |
15018484 |
44392842 |
|
lv |
22906 |
27592514 |
196452051 |
1204.6 |
1673124 |
15437969 |
44703607 |
|
mt |
10545 |
20926909 |
128906748 |
1984.53 |
1336042 |
15620611 |
37883562 |
|
nl |
23564 |
35265161 |
231963539 |
1496.57 |
3039580 |
18467115 |
56771856 |
|
pl |
23478 |
29713003 |
214464026 |
1265.57 |
2513141 |
17027393 |
49253537 |
|
pt |
23505 |
37221668 |
227499418 |
1583.56 |
3034308 |
19350227 |
59606203 |
|
ro |
6573 |
9186947 |
60537301 |
1397.68 |
514296 |
11185842 |
20887085 |
|
ro-19211
(readme) |
19211 |
30832212 |
182631277 |
1604.92 |
--- |
--- |
30832212 |
|
sk |
21943 |
26792637 |
179920434 |
1221.01 |
3227852 |
16190546 |
46211035 |
|
sl |
20642 |
27702305 |
178651767 |
1342.04 |
3103193 |
16837717 |
47643215 |
|
sv |
20243 |
29433037 |
199004401 |
1453.99 |
2575771 |
14965384 |
46974192 |
|
Total |
463,792 |
636,216,050 |
4,288,962,348 |
1387.23 |
60,368,893 |
358,999,011 |
1,055,583,954 |
Size of version 3.0 of the JRC collection of the
Acquis Communautaire
in 22 of the official languages of the European Union.
Numbers are given separately for the text body (the main text),
the signature and the annexes.
Statistics on
the alignment with Vanilla:
-
Total of 4,350,447
aligned documents (all languages);
-
Total of 243,187,303
links (all languages);
-
Average of
18,833 aligned documents per language;
-
Average of
1,052,759 links per language pair (average of all language pairs);
- Average of 85.43% of one-to-one links.
The two resources are rather similar
in nature as they are both based on the Acquis Communautaire, but
they are not identical and can both serve different purposes. The
main differences are the following:
- The collection of documents of both resources
should mostly be the same, but they are not identical as both
resources were collected in different ways. None of the resources
is exactly equivalent to the Acquis Communautaire. The criteria
for the collection of the JRC-Acquis were rather loose (all
documents were collected which were available in at least ten
languages of which at least three 'new' EU languages) so that
the JRC-Acquis is bigger.
- The DGT Translation Memory is a collection
of translation units, from which the full text cannot be reproduced.
The JRC-Acquis is mostly a collection of full texts with additional
information on which sentences are aligned with which others.
- Most parts of the DGT Translation Memory have
been corrected manually using the Euramis alignment
editor, while the alignment of the JRC-Acquis documents was
done using the two alternative alignment software tools Vanilla
and HunAlign, without manual correction.
- For the cleaning and pre-processing of the
texts, different methods and tools were used.
- Most JRC-Acquis documents
are acompanied by information on the manually assigned Eurovoc
subject domain classes so that the JRC-Acquis can also be used
to train automatic multi-label classification software.
Acquis Communautaire corpus
According to an agreement with the European Commission's
Office for Official Publications OPOCE, the AC corpus can
be used and distributed for research purposes, but the following
usage conditions must be adhered to:
The European Communities consider legislative and
quasi-legislative documents published in the Official Journal of
the European Union and related COM and SEC series as well as charters
and treaties and ECJ case-law to be in the public domain. Prior
written permission is thus not required for their reproduction/translation,
and they may be reproduced/translated freely without restriction,
including for the purpose of further non-commercial dissemination
to final users, subject to the condition that appropriate acknowledgement
is given to the European Communities and to the source, and provided
that the additional guidelines set out below are respected.
-
Whenever a document is reproduced verbatim
from a source other than the printed version of the Official
Journal of the European Union, a prominently positioned disclaimer
should read: 'Only European Community legislation printed in
the paper edition of the Official Journal of the European Union
is deemed authentic.'
-
For the reasons stated in the disclaimer
above, it is advisable to ensure that translations are made
from the printed, authentic version of the Official Journal.
This precaution, while minimizing the risk of error, does not
confer any legal status whatsoever to the translated text. The
following notice shall accompany the translated text, printed
below the acknowledgement: 'Originally published in the official
languages of the European Union in the Official Journal of the
European Union by the Office for Official Publications of the
European Communities. Responsibility for the translation into
[specify language] from the original [specify language] edition
lies entirely with [name of translation copyright holder].'
Moreover, please note that we do not consider a "further commercial
dissemination" the inclusion, as reference material for consultation
purposes, of small amounts of relevant legislative texts in
articles/thesis/studies/reports/books issued by third-party
authors or publishers, whatever the means, and disseminated
subject to payment.
Eurovoc thesaurus
Unlike the AC corpus, the EuroVoc
Thesaurus must not be used or disseminated without prior written
permission from the European Commission's Office for Official Publications
OPOCE. If you want to get the rights to use Eurovoc and to receive
a copy of the multilingual thesaurus, please contact OPOCE at OP-INFO-COPYRIGHT@publications.europa.eu,
mentioning the file reference number 2005-COP-395. To our knowledge,
the licence is free of charge for research purposes. For a commercial
licence, please contact OPOCE.
- AC Corpus - version
3.0 (by language)
- AC aligned corpus
using Vanilla aligner
- AC aligned
corpus using HunAlign
By downloading these resources, you agree to the usage
conditions.
Previous version: JRC-ACQUIS
Multilingual Parallel Corpus, Version 2.2.
Click here to see a history
of changes regarding the preparation of this corpus.
A description
of the JRC-Acquis corpus (version 2.2) was published in the paper
below. Please use this reference
publication when referring to the JRC-Acquis.
Steinberger Ralf,
Bruno Pouliquen, Anna Widiger, Camelia Ignat, Tomaž Erjavec,
Dan Tufiş, Dániel Varga (2006). The
JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages.
Proceedings of the 5th
International Conference on Language Resources and Evaluation
(LREC'2006).
Genoa, Italy, 24-26 May 2006. (PDF)
|
|