Workshop description

Workshop programme with presentation slides 

Workshop Participants

 

Workshop title:                 Addressing the Language Barrier Problem in the Enlarged EU –
                       Automating Eurovoc Descriptor Assignment

Date:                                  16-17 September 2004

Location:                            JRC Ispra, Italy, Building 36, Room 3

Workshop language:           English

Workshop announcement:   http://www.jrc.cec.eu.int/enlargement/action2004/ipsc-w17.doc

Language Technology:        http://langtech.jrc.cec.eu.int/ 

 

 

 

Workshop participants (by invitation)

For each of the new EU Member States, we expect two participants: one representing a parliamentary documentation centre and one being a computational linguist for the respective language. There will also be representatives from the Eurovoc Steering Committee, the Eurovoc Maintenance Committee, and representatives of Parliaments who have experience with automatic Eurovoc indexing.

The Eurovoc Thesaurus

Eurovoc is the classification system used by the European Parliament’s Documentation Centre, by the Publications Office of the European Commission, and by many documentation centres of national and regional parliaments. Eurovoc has about 6000 hierarchically organised descriptor terms. Eurovoc will soon exist in one-to-one translations in all official EU languages (and more) so that Eurovoc descriptors assigned to documents in one language can be viewed in all other languages. Eurovoc is thus a powerful tool to search and retrieve documents in a multilingual setting, and also to provide information about the indexed texts in languages other than the text language. See http://europa.eu/eurovoc/ for details.

Automating Eurovoc descriptor assignment

According to current indexing practice, librarians or similar documentation specialists choose a small set of the most appropriate Eurovoc descriptors for each text. The JRC has developed an automatic system that tries to imitate the human process of Eurovoc descriptor assignment, by learning from sets of texts that have been indexed by librarians. Several independent evaluations have shown that this automatic process does not reach the quality of human descriptor assignment, but that it comes close enough to be useful for some purposes. For instance, this software could be used to index documents automatically that would otherwise not be indexed at all. Another possibility is to use the automatic process as a first step in an interactive indexing process where the machine suggests Eurovoc descriptors which will then be verified by a human indexing professional to improve indexing speed and consistency. The web page http://langtech.jrc.cec.eu.int/Eurovoc.html summarises the method used and points to some related scientific publications. The most detailed presentation of the system can be found in the report Cross-lingual Indexing, by Steinberger and Pouliquen (2003), which can be found on that web site.

To date, automatic Eurovoc indexing has been applied to the eleven official pre-enlargement EU languages. Our aim is to extend this automatic text analysis capacity to the new EU languages in order to facilitate the integration of the new Member States and to facilitate EU citizens’ access to texts written in the new EU languages (and vice versa). We believe that this technology will help lower the language barrier in the ever increasing jungle of European languages.

Workshop contents

During the workshop, we will explain, in simple words, how the automatic system works and what material is required to train the system for the new languages. No technical knowledge is required, but it would be useful if you could contribute with information on the indexing practice in your organisation: Does your organisation use Eurovoc? Which version? How long have you been using Eurovoc? Do the documents indexed with Eurovoc exist in electronic, machine-readable form? How many Eurovoc-indexed documents could be available to train our system for your language?

For those workshop participants who are interested in collaborating with us on getting automatic Eurovoc indexing for their languages to work, we will, on day 2, make a plan of action. Issues of interest include the material and effort needed, the data exchange format, different Eurovoc versions, language-specific difficulties, etc. We will also discuss a viable option for languages for which no training material (manually Eurovoc-indexed texts) exists.

Previous experiments have shown that the JRC’s statistical methods for Eurovoc indexing can be applied to languages of a very different nature (English, Spanish, Finnish, Greek). However, language-specific text normalisation at the lexical level (lemmatisation or stemming, etc.) is beneficial. We hope that the workshop participants with a background in computational linguistics will be able to provide advice regarding language-specific difficulties, existing tools for text normalisation in their languages, etc.


Workshop Programme

Workshop title:     Addressing the Language Barrier Problem in the Enlarged EU –
                    Automating Eurovoc Descriptor Assignment

Date:                     16-17 September 2004

Location:               JRC Ispra, Italy, Building 36, Room 3

Thursday 16 September (Pickup at hotel ‘Europa’: 8:00)

9:00

Registration

 

9:30

Thomas Barbas (EC, DG JRC)

Welcome Note; Introduction to the JRC; Presentation of the EU Enlargement Action (slides)

9:45

Ralf Steinberger (EC, DG JRC, Language Technology)

Purpose of the workshop;
Presentation of the programme
(slides)

10:00

Christine Laaboudi-Spoiden & Alexandros Athanassiadis (EC, DG OPOCE, Publications Office)

Eurovoc as a means to access multilingual information (slides)

10:30

Suzanne Hanon (EC, DG Education & Culture, Central Library)

Indexing with the ECLAS Thesaurus at the Central Library of the European Commission: Principles for descriptor assignment and problems encountered while indexing (Text)

11:00

 

Coffee

11:30

Bruno Pouliquen (EC, DG JRC, Language Technology)

Automatic Eurovoc indexing: approach (slides)

12:30

 

Lunch

14:00

Bruno Pouliquen (EC, DG JRC, Language Technology)

Automatic Eurovoc indexing: evaluation and results (slides)

14:30

Victoria Fernandez Mera, Spanish Congress of Deputies

Experiences of the Spanish Congress of Deputies with automatic Eurovoc indexing (slides; text)

15:00

Elisabet Lindkvist Michailaki, Swedish Parliament

Automatic indexing with Eurovoc at the Swedish parliament (slides)

15:30

 

Coffee

16:00

Vaclav Sklenar & Anna Lhotská, Parliament of the Czech Republic

Automatic Eurovoc indexing - An experiment in the Czech Parliament  (slides)

16:15

Ralf Steinberger (EC, DG JRC, Language Technology)

Multilingual text analysis applications based on automatic Eurovoc indexing  (slides)

17:15

 

Travel expenses; formalities

18:00

 

End of day one; transfer to the hotel

19:30

 

Workshop Dinner

Friday 17 September (Pickup at hotel ‘Europa’: 9:00 – with luggage)

9:30

Bruno Pouliquen (EC, DG JRC, Language Technology)

Next steps / Technical details
(how to provide data , and how much; data format; Eurovoc versions; linguistic pre-processing of the texts; evaluation procedure and evaluation interface)
(slides)

10:30

Tamás Váradi, Hungarian Academy of Sciences

Indexing languages without a version of Eurovoc
The Hungarian experience.
(slides)

11:00

 

Coffee / Expenses / Formalities

11:30

 

Discussion / Questions and Answers

12:00

Ralf Steinberger (EC, DG JRC, Language Technology)

Summary of the Workshop

12:15

 

Lunch

14:00

 

End of the Workshop