Workshop title:
Addressing the Language Barrier Problem in
the Enlarged EU –
Automating
Eurovoc Descriptor Assignment
Date: 16-17 September 2004
Location: JRC Ispra,
Workshop language: English
Workshop announcement: http://www.jrc.cec.eu.int/enlargement/action2004/ipsc-w17.doc
Language Technology:
http://langtech.jrc.cec.eu.int/
Workshop
participants (by invitation)
For
each of the new EU Member States, we expect two participants: one representing
a parliamentary documentation centre and one being a computational linguist for
the respective language. There will also be representatives from the Eurovoc Steering Committee, the Eurovoc Maintenance Committee, and
representatives of Parliaments who have experience with automatic Eurovoc indexing.
The Eurovoc Thesaurus
Eurovoc is the classification system used by the European Parliament’s
Documentation Centre, by the Publications Office of the European Commission,
and by many documentation centres of national and regional parliaments. Eurovoc
has about 6000 hierarchically organised descriptor terms. Eurovoc will soon exist in one-to-one
translations in all official EU languages (and more) so that Eurovoc descriptors assigned to
documents in one language can be viewed in all other languages. Eurovoc is thus a powerful tool to
search and retrieve documents in a multilingual setting, and also to provide
information about the indexed texts in languages other than the text language.
See http://europa.eu/eurovoc/
for details.
Automating Eurovoc descriptor assignment
According
to current indexing practice, librarians or similar documentation specialists
choose a small set of the most appropriate Eurovoc descriptors for each text. The
JRC has developed an automatic system that tries to imitate the human process of
Eurovoc descriptor assignment, by
learning from sets of texts that have been indexed by librarians. Several
independent evaluations have shown that this automatic process does not reach
the quality of human descriptor assignment, but that it comes close enough to
be useful for some purposes. For instance, this software could be used to index
documents automatically that would otherwise not be indexed at all. Another
possibility is to use the automatic process as a first step in an interactive
indexing process where the machine suggests Eurovoc
descriptors which will then be verified by a human indexing professional to
improve indexing speed and consistency. The web page http://langtech.jrc.cec.eu.int/Eurovoc.html
summarises the method used and points to some related scientific publications. The
most detailed presentation of the system can be found in the report Cross-lingual Indexing, by Steinberger
and Pouliquen (2003), which can be found on that web site.
To
date, automatic Eurovoc indexing
has been applied to the eleven official pre-enlargement EU languages. Our aim
is to extend this automatic text analysis capacity to the new EU languages in
order to facilitate the integration of the new Member States and to facilitate
EU citizens’ access to texts written in the new EU languages (and vice versa). We
believe that this technology will help lower the language barrier in the ever
increasing jungle of European languages.
Workshop contents
During
the workshop, we will explain, in simple words, how the automatic system works
and what material is required to train the system for the new languages. No
technical knowledge is required, but it would be useful if you could contribute
with information on the indexing practice in your organisation: Does your
organisation use Eurovoc? Which
version? How long have you been using Eurovoc?
Do the documents indexed with Eurovoc
exist in electronic, machine-readable form? How many Eurovoc-indexed documents could be available to train our
system for your language?
For
those workshop participants who are interested in collaborating with us on
getting automatic Eurovoc
indexing for their languages to work, we will, on day 2, make a plan of action.
Issues of interest include the material and effort needed, the data exchange format,
different Eurovoc versions,
language-specific difficulties, etc. We will also discuss a viable option for
languages for which no training material (manually Eurovoc-indexed texts) exists.
Previous
experiments have shown that the JRC’s statistical methods for Eurovoc indexing can be applied to
languages of a very different nature (English, Spanish, Finnish, Greek).
However, language-specific text normalisation at the lexical level
(lemmatisation or stemming, etc.) is beneficial. We hope that the workshop
participants with a background in computational linguistics will be able to
provide advice regarding language-specific difficulties, existing tools for
text normalisation in their languages, etc.
Workshop title: Addressing the Language Barrier Problem in the Enlarged EU –
Automating
Eurovoc Descriptor Assignment
Date: 16-17 September 2004
Location: JRC Ispra,
Thursday 16 September (Pickup
at hotel ‘Europa’:
|
|
Registration |
|
|
|
Thomas Barbas (EC, DG JRC) |
Welcome
Note; Introduction to the JRC; Presentation of the EU Enlargement Action |
|
|
|
Purpose
of the workshop; |
|
|
Christine Laaboudi-Spoiden & Alexandros
Athanassiadis (EC, DG OPOCE, Publications Office) |
Eurovoc as a means to access multilingual information |
|
|
Suzanne
Hanon (EC, DG Education & Culture, Central Library) |
Indexing
with the ECLAS Thesaurus at the Central Library of the European Commission:
Principles for descriptor assignment and problems encountered while indexing |
|
|
|
Coffee |
|
|
Bruno Pouliquen (EC, DG JRC, Language
Technology) |
Automatic
Eurovoc indexing: approach |
|
|
|
Lunch |
|
|
Bruno Pouliquen (EC, DG JRC, Language
Technology) |
Automatic
Eurovoc indexing: evaluation
and results |
|
|
Victoria
Fernandez Mera, Spanish Congress of Deputies |
Experiences
of the Spanish Congress of Deputies with automatic Eurovoc indexing |
|
|
Elisabet
Lindkvist Michailaki, Swedish Parliament |
Automatic
indexing with Eurovoc at the
Swedish parliament |
|
|
|
Coffee |
|
|
Vaclav
Sklenar & Anna Lhotská, Parliament of the |
Automatic
Eurovoc indexing - An experiment
in the Czech Parliament |
|
|
|
Multilingual
text analysis applications based on automatic Eurovoc indexing |
|
|
|
Travel
expenses; formalities |
|
|
|
End
of day one; transfer to the hotel |
|
|
|
Workshop
Dinner |
Friday 17 September (Pickup at
hotel ‘Europa’:
|
|
Bruno Pouliquen (EC, DG JRC, Language
Technology) |
Next
steps / Technical details |
|
|
Tamás
Váradi, |
Indexing
languages without a version of Eurovoc
– |
|
|
|
Coffee
/ Expenses / Formalities |
|
|
|
Discussion
/ Questions and Answers |
|
|
|
Summary
of the Workshop |
|
|
|
Lunch |
|
|
|
End
of the Workshop |