NewsExplorer
MedISys
NewsBrief
EMM-Labs
Together since 1957 – 50th anniversary

Release version 3.0: JRC-Acquis almost tripled in size; Bulgarian added as 22nd language.

New Romanian corpus with 30 Million words is now available (05/02/2008)

Bilingual alignments for all 231 language pairs with both Vanilla and HunAlign

See also the related resource: DGT-TM Translation Memory.


1) Introduction

What are the Acquis Communautaire and the JRC-Acquis

The Acquis Communautaire (AC) is the total body of European Union (EU) law applicable in the the EU Member States. This collection of legislative text changes continuously and currently comprises selected texts written between the 1950s and now. At the beginning of the year 2007, the EU has 27 Member States and 23 official languages (see the Wikipedia entry). The Acquis Communautaire texts exist in these languages, although Irish translations are not currently available. The Acquis Communautaire thus is a collection of parallel texts in the following 22 languages: Bulgarian, Czech, Danish, German, Greek, English, Spanish, Estonian, Finnish, French, Hungarian, Italian, Lithuanian, Latvian, Maltese, Dutch, Polish, Portuguese, Romanian, Slovak, Slovene and Swedish.


The data release by the JRC is in line with the general effort of the European Commission to support multilingualism, language diversity and the re-use of Commission information.


The Language Technology group of the European Commission's Joint Research Centre did not receive an authoritative list of documents that belong to the Acquis Communautaire. In order to compile the document collection distributed here, we selected all those CELEX documents (see below) that were available in at least ten of the twenty EU-25 languages (the official languages of the EU before Bulgaria and Romania joined in 2007) and that additionally existed in at least three of the nine languages that became official languages with the Enlargement of the EU in 2004 (i.e. Czech, Estonian, Hungarian, Lithuanian, Latvian, Maltese, Polish, Slovak and Slovene). The collection distributed here is thus an approximation of the Acquis Communautaire which we call the JRC-Acquis. The JRC-Acquis must not be seen as a legal reference corpus. Instead, the purpose of the JRC-Acquis is to provide a large parallel corpus of documents for (computational) linguistics research purposes.

The linguistic research interest of the JRC-Acquis

Generally speaking, parallel corpora are useful for all types of cross-lingual research. The value of a parallel corpus grows with its size and with the number of languages for which translations exist. While parallel corpora for some languages exist abundantly, there are few or no parallel corpora for most other language pairs. To our knowledge, the Acquis Communautaire is the biggest parallel corpus in existence, if we take into consideration both its size and the large number of languages involved. The most outstanding advantage of the Acquis Communautaire - apart from being freely available - is the number of rare language pair combinations (e.g. Maltese-Estonian, Slovene-Finnish, etc.).


The AC and other Community legislation is publicly available on the European Commission's web sites. The Language Technology team of the Joint Research Centre (JRC, http://langtech.jrc.it/) in Ispra, Italy, has attempted to identify the documents that are part of the AC, has downloaded them and converted them to XML format. The Bulgarian and Romanian documents were processed by the Romanian Academy of Sciences (http://www.racai.ro/). In further processing steps, the texts were cleaned of their footers and annexes, and they were sentence-aligned. Instead of using a single pivot language, all possible language pair combinations were aligned individually. This is useful due to the n-to-n relationship between aligned sentences, which often differs depending on the language pair involved.


For some of the documents, only preliminary translations were available. For the online texts in some of the languages, only the title has been translated, but the text displayed is English. An automatic language recognition tool was therefore used to filter out those texts that are displayed as being one language, but which are actually English. No manual check was carried out.


The European Commission's Office for Official Publications OPOCE manages the distribution rights of this aligned multilingual parallel corpus. OPOCE agreed that the corpus can be given to research partners for non-commercial use. See the section on licensing issues, below.

2) Statistics for version 3.0 of the JRC-Acquis corpus

The JRC-Acquis corpus (version 3.0) is currently available in 22 languages with the following distribution:

 

Language ISO code

Nº of texts

Text body

Signatures

Annexes

Total No words (text + signatures + annexes):

Total No words

Total No characters

Average No words

Total No words

Total No words

bg

11384

16140819

104522671

1417.85

2170075

14114612

32425506

cs

21438

22843279

148972981

1065.55

7225300

16763733

46832312

da

23624

31459627

213468135

1331.68

2629786

16855213

50944626

de

23541

32059892

232748675

1361.87

2542149

16327611

50929652

el

23184

36453749

239583543

1572.37

2973574

16459680

55887003

en

23545

34588383

210692059

1469.03

3198766

17750761

55537910

es

23573

38926161

238016756

1651.3

3490204

19716243

62132608

et

23541

24621625

192700704

1045.9

1336051

14995748

40953424

fi

23284

24883012

212178964

1068.67

2677798

12547171

40107981

fr

23627

39100499

234758290

1654.91

3021013

19978920

62100432

hu

22801

28602380

213804614

1254.44

2529488

15056496

46188364

it

23472

35764670

230677013

1523.72

3120797

18331535

57217002

lt

23379

26937773

199438258

1152.22

2436585

15018484

44392842

lv

22906

27592514

196452051

1204.6

1673124

15437969

44703607

mt

10545

20926909

128906748

1984.53

1336042

15620611

37883562

nl

23564

35265161

231963539

1496.57

3039580

18467115

56771856

pl

23478

29713003

214464026

1265.57

2513141

17027393

49253537

pt

23505

37221668

227499418

1583.56

3034308

19350227

59606203

ro

6573

9186947

60537301

1397.68

514296

11185842

20887085

ro-19211 (readme)

19211

30832212

182631277

1604.92

---

---

30832212

sk

21943

26792637

179920434

1221.01

3227852

16190546

46211035

sl

20642

27702305

178651767

1342.04

3103193

16837717

47643215

sv

20243

29433037

199004401

1453.99

2575771

14965384

46974192

Total

463792

636216050

4288962348

1387.23

60368893

358999011

1055583954

Size of version 3.0 of the JRC collection of the Acquis Communautaire
in 22 of the official languages of the European Union.
Numbers are given separately for the text body (the main text), the signature and the annexes.


Statistics on the alignment with Vanilla:


  • Total of 4,350,447 aligned documents (all languages);
  • Total of 243,187,303 links (all languages);
  • Average of 18,833 aligned documents per language;
  • Average of 1,052,759 links per language pair (average of all language pairs);
  • Average of 85.43% of one-to-one links.

3) What is the difference between the DGT Translation Memory and the JRC-Acquis

The two resources are rather similar in nature as they are both based on the Acquis Communautaire, but they are not identical and can both serve different purposes. The main differences are the following:


  • The collection of documents of both resources should mostly be the same, but they are not identical as both resources were collected in different ways. None of the resources is exactly equivalent to the Acquis Communautaire. The criteria for the collection of the JRC-Acquis were rather loose (all documents were collected which were available in at least ten languages of which at least three 'new' EU languages) so that the JRC-Acquis is bigger.
  • The DGT Translation Memory is a collection of translation units, from which the full text cannot be reproduced. The JRC-Acquis is mostly a collection of full texts with additional information on which sentences are aligned with which others.
  • Most parts of the DGT Translation Memory have been corrected manually using the Euramis alignment editor, while the alignment of the JRC-Acquis documents was done using the two alternative alignment software tools Vanilla and HunAlign, without manual correction.
  • For the cleaning and pre-processing of the texts, different methods and tools were used.
  • Most JRC-Acquis documents are acompanied by information on the manually assigned Eurovoc subject domain classes so that the JRC-Acquis can also be used to train automatic multi-label classification software.

4) Related information

The JRC Workshop on Exploiting multilingual parallel corpora (26-27 September 2005) was dedicated to exploring methods to exploit the Acquis Communautaire and similar corpora. You find more information on the workshop web page http://langtech.jrc.it/0509_EU-Enlargement-Workshop.html.


A description of the Acquis Communautaire corpus (version 2.2) was published in the paper below. Please use this publication as a reference when you refer to the JRC-Acquis. You may want to check the web site http://langtech.jrc.it for more up-to-date publications on the subject.

Steinberger Ralf,  Bruno Pouliquen, Anna Widiger, Camelia Ignat, Tomaž Erjavec, Dan Tufiş, Dániel Varga (2006). The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC'2006). Genoa, Italy, 24-26 May 2006. (PDF).



Site Meter

Please send comments on this page to Ralf Steinberger (Email address format: Firstname.Lastname@jrc.it)

Last update:  11 June 2009