What are the Acquis Communautaire and the JRC-Acquis
The Acquis Communautaire (AC) is the total body of
European Union (EU) law applicable in the the EU Member States. This collection
of legislative text changes continuously and currently comprises selected
texts written between the 1950s and now. At the beginning of the year 2007,
the EU has 27 Member States and 23 official languages (see the Wikipedia
entry). The Acquis Communautaire texts exist in these languages, although
Irish translations are not currently available. The Acquis Communautaire thus
is a collection of parallel texts in the following 22 languages:
Bulgarian, Czech, Danish, German, Greek, English, Spanish, Estonian, Finnish,
French, Hungarian, Italian, Lithuanian, Latvian, Maltese, Dutch, Polish, Portuguese,
Romanian, Slovak, Slovene and Swedish.
The data release by the JRC is in line with the general effort
of the European Commission to support multilingualism, language diversity
and the re-use of Commission information.
The Language
Technology group of the European Commission's Joint
Research Centre did not receive an authoritative list of documents
that belong to the Acquis Communautaire. In order to compile the document
collection distributed here, we selected all those CELEX documents (see below)
that were available in at least ten of the twenty EU-25 languages (the official
languages of the EU before Bulgaria and Romania joined in 2007) and that additionally
existed in at least three of the nine languages that became official languages
with the Enlargement of the EU in 2004 (i.e. Czech, Estonian, Hungarian, Lithuanian,
Latvian, Maltese, Polish, Slovak and Slovene). The collection distributed
here is thus an approximation of the Acquis Communautaire which we
call the JRC-Acquis. The JRC-Acquis must not be seen as a legal reference
corpus. Instead, the purpose of the JRC-Acquis is to provide a large parallel
corpus of documents for (computational) linguistics research purposes.
The linguistic research interest of the JRC-Acquis
Generally speaking, parallel corpora are useful for all types
of cross-lingual research. The value of a parallel corpus grows with its size
and with the number of languages for which translations exist. While parallel
corpora for some languages exist abundantly, there are few or no parallel
corpora for most other language pairs. To our knowledge, the Acquis Communautaire
is the biggest parallel corpus in existence, if we take into
consideration both its size and the large number of languages involved. The
most outstanding advantage of the Acquis Communautaire - apart from being
freely available - is the number of rare language pair combinations (e.g.
Maltese-Estonian, Slovene-Finnish, etc.).
The AC and other Community legislation is publicly available
on the European Commission's web sites. The Language Technology team of the
Joint Research Centre (JRC, http://langtech.jrc.it/)
in Ispra, Italy, has attempted to identify the documents that are part of
the AC, has downloaded them and converted them to XML format. The Bulgarian
and Romanian documents were processed by the Romanian Academy of Sciences
(http://www.racai.ro/).
In further processing steps, the texts were cleaned of their footers and annexes,
and they were sentence-aligned. Instead of using a single pivot language,
all possible language pair combinations were aligned individually. This is
useful due to the n-to-n relationship between aligned sentences, which often
differs depending on the language pair involved.
For some of the documents, only preliminary translations were
available. For the online texts in some of the languages, only the title has
been translated, but the text displayed is English. An automatic language
recognition tool was therefore used to filter out those texts that are displayed
as being one language, but which are actually English. No manual check was
carried out.
The European Commission's Office for Official Publications
OPOCE manages the distribution rights of this aligned multilingual parallel
corpus. OPOCE agreed that the corpus can be given to research partners for
non-commercial use. See the section on licensing issues, below.
2) Statistics for version 3.0 of
the JRC-Acquis corpus
The JRC-Acquis corpus (version 3.0) is currently available
in 22 languages with the following distribution: