NewsExplorer
MedISys
NewsBrief
EMM-Labs

Abstract

Multilingual Eurovoc thesaurus descriptors are used by a large number of European Parliaments and Documentation Centres to manually index their large document collections. The assigned descriptors are then used to search and retrieve documents in the collection and to summarise the document contents for the users. As Eurovoc descriptors exist in one-to-one translations in almost thirty languages, they can be displayed in a language other than the text language and give users cross-lingual access to the information contained in each document. At the same time, EuroVoc is an ideal means to search in the user's language and to retrieve documents in other languages.


The European Commission's (EC) Joint Research Centre (JRC) has developed - and makes available - software that automatically assigns EuroVoc descriptors to documents in currently 22 languages. The system uses statistical Machine Learning methods that learn the multi-label categorisation rules from previously manually indexed documents. The method used can be described as profile-based category ranking. This software, called JRC EuroVoc Indexer, or short JEX, has been trained for 22 languages and is available for download from this site. The software allows users to re-train the software on their own data, even using their own, alternative classification systems.

Possible uses of the JEX software

This software has various uses: It can be used by the traditional EuroVoc users, (a) either as a very fast and efficient fully automatic system or (b) as an interactive application where the program suggests Eurovoc descriptors and the human documentalist corrects the automatic results, benefitting both from the machine's speed and consistency and from the human specialist's accuracy. (c) The software can also be used as an ingredient for further multilingual and cross-lingual Language Technology applications, including for the detection of document translations or plagiarised text; to link related documents across languages; to support the lexical choice in Machine Translation; to rank sentences in topic-specific summarisation, and more. Finally, due to the high multilinguality of the software and the accompanying training data, (d) the software can be used for educational purposes. For instance, students and researchers can run experiments to improve the software's performance; they can compare results across many languages and language families; they can use the output of JEX to build further text mining applications, etc.

The EuroVoc Thesaurus

The EuroVoc thesaurus was developed by the European Parliament (EP), in collaboration with the EC's Publications Office (PO) and several national organisations for the indexing (cataloguing / classification / categorisation) of document collections in several languages. EuroVoc currently exists not only in 22 official EU languages (Bulgarian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish and Swedish), but also in Basque, Catalan, Croatian, Russian and Serbian. Further non-official translations exist.

The number of Eurovoc users and language versions is steadily increasing. The thesaurus covers the major interests of the involved institutions. It is hierarchically organised into 21 fields and - at the next level - into 127 micro-thesauri, with altogether about 6,700 descriptor terms (classes). The maximum depth of the hierarchy is 8 levels. To browse the thesaurus, see the EuroVoc web site.

JEX usage conditions

The JEX software can in principle be downloaded and used free of charge, but the detailed usage conditions in the EU Licence Agreement (EULA) need to be adhered to. Scientific work using JEX, or scientific publications making reference to JEX, should make reference to at least one of the publications mentioned below (see the Section More information on JEX, below).

 

Download JEX

JEX has been trained for twenty-two languages. Each language version can be downloaded separately. For each language version, there are furthermore two versions of JEX: (1) one basic version for the typical end user who wants to either test the software or who wants to use the software in a production environment; (2) an advanced version of the JEX software for technically trained IT specialists; This advanced version additionally allows to re-train the software with a new document collection and to run scientific experiments. It also includes the data on which the software has been trained, meaning that the download packages are much larger. We suggest that you first download the basic version and that you only download the advanced version once you have confirmed that you really want to use it.

 

JEX is implemented in Java. It can be run on the Windows operating system, on un*x-like operating systems, as well as on Apple Mac. The software should run on most modern computers, but it requires a minimum memory of 2GB.

 

When downloading, you agree to the JEX usage conditions, as formulated in the EU Licence Agreement (EULA).

 

Language
Version
Indexing (basic)
Indexing and Training (advanced)
bg
1.0
download (18 MB)
download (89 MB)
cs
1.0
download (20 MB)
download (75 MB)
da
1.0
download (29 MB)
download (116 MB)
de
1.0
download (32 MB)
download (131 MB)
el
1.0
download (27 MB)
download (156 MB)
en
1.0
download (15 MB)
download (99 MB)
es
1.0
download (17 MB)
download (110 MB)
et
1.0
download (21 MB)
download (72 MB)
fi
1.0
download (35 MB)
download (121 MB)
fr
1.0
download (24 MB)
download (117 MB)
hu
1.0
download (14 MB)
download (72 MB)
it
1.0
download (25 MB)
download (117 MB)
lt
1.0
download (18 MB)
download (117 MB)
lv
1.0
download (19 MB)
download (72 MB)
mt
1.0
download (16 MB)
download (68 MB)
nl
1.0
download (25 MB)
download (117 MB)
pl
1.0
download (18 MB)
download (76 MB)
pt
1.0
download (24 MB)
download (116 MB)
ro
1.0
download (22 MB)
download (119 MB)
sk
1.0
download (18 MB)
download (75 MB)
sl
1.0
download (18 MB)
download (70 MB)
sv
1.0
download (28 MB)
download (115 MB)

More information on JEX

You can find more information on JEX in the documents listed below, depending on your interests and needs.

 

The user manual gives an easy-to-understand overview of the software and explains how to use it, step by step:

  • Ebrahim Mohamed, Ralf Steinberger & Marco Turchi. JEX Manual.

The following document, published in 2012, explains JEX, its history and possible uses. It describes the documents JEX was trained on, gives an overview of the indexing methodology and presents automatic evaluation results for all 22 languages. It also explains how to use JEX:

This third document, mostly targeted at the scientific community, explains the categorisation algorithm in more depth and also describes the results of a manual evaluation of the automatic classification, performed by specialised human EuroVoc indexers, for English and Spanish documents.

You find many more related publications on the publications page of the JRC's Language Technology website.

Acknowledgements

We would like to thank Bruno Pouliquen, who has developed a major part of the main assignment method, and Mladen Kolar, who has implemented an initial Java version of the tool. We would like to mention the support of Victoria Fernandez-Mera from the Spanish Congress of Deputies and Elisabet Lindkvist from the Swedish Riksdagen, who gave us a lot of advice on practices relating to manual EuroVoc indexing and who helped us to thoroughly evaluate the software. Finally, we are grateful to the Publications Office of the European Commission for having provided their collection of manually EuroVoc-indexed documents. The initial work on JEX was funded as a JRC Exploratory Research Project. The preparation of the first public release of JEX, in May 2012, was partially funded under the JRC’s Innovative Project Competition scheme.

 

 

Keywords (English, German, French):
EuroVoc, automatic EuroVoc indexing, multilingual, multilingual classification, multilingual categorization, controlled vocabulary indexing, official European Union languages; Klassifikation von Dokumenten, kontrolliertes Vokabular, Mehrsprachigkeit, automatische Verschlagwortung, sprachübergreifend, Computer-Linguistik, Taxonomie, Ontologie; indexation de documents, linguistique informatique, multilingue, traitement du langage naturel, linguistique, vocabulaire contrôlé, thésaurus.

 



Site Meter

Please send comments on this page to Ralf Steinberger (Email address format: Firstname.Lastname@jrc.ec.europa.eu)

Last update:  15 May 2012