NEW:
Romanian JRC-Acquis corpus with 30 Million words now available.
NEW: HunAlign alignments
available for JRC-Acquis parallel
corpus.
Summary of the Activity
At the Joint Research Centre
(JRC), we have been using Language Technology since 1998 to fight
the information overflow and to overcome the language
barrier with the purpose of supporting the European Commission
and Member State institutions. To this end, a number of text gathering
(retrieval), analysis and visualisation tools have been developed
with a focus on high multilinguality, on multilingual and
multi-document information aggregation, and on tools to provide
cross-lingual information access (read publication).
These text analysis tools have been integrated with the
news gathering engine Europe Media Monitor EMM to produce
several complex, high-level appliations.
The publicly accessible
multilingual news analysis system
NewsExplorer shows some of our text analysis applications. Further
integration of text analysis tools into the multilingual breaking
news system
NewsBrief and the specialised Medical Information System
MedISys (read
publication)
are on their way. A collection of more recent developments is visible
to the public under the umbrella name EMM-Labs.
These include the automatic extraction of violent events and generation
of social networks based on news analysis. For an overview, see
http://press.jrc.it/overview.html.
A big advantage of these large-scale
news gathering and analysis applications is their neutrality due
to the fact that they are independent from the viewpoints
of specific news providers and even countries.
A combination of text analysis tools
Our tool set consists of three main components with
the following functionality:
- Multilingual and cross-lingual
retrieval of potentially user-relevant documents.
(E.g. the OSILIA and IDoRA for OLAF projects
on the automatic gathering and classification of articles from
online news sites, but especially the EMM engine, and more).
- Analysis
of documents and
extraction of different information aspects from these
documents plus language-neutral representation of this information,
where possible. Examples for the kind of analysis are:
- identifying the
language a document is written
in (Language recognition);
- identifying the
keywords for
a document, both free monolingual indexing terms and controlled
vocabulary cross-lingual indexing terms from the EUROVOC thesaurus;
- identifying and disambiguating
named entities
such as people's and organisations' names, geographical references,
dates, currencies, etc.;
- detecting relations
between entities (mostly, but not only persons) such as contact,
support, criticism, family relationship,
etc.; using the extracted information to produce social
networks;
- identifying quotations by and about
people; using the extracted data to produce quotation
networks;
- extracting information about events
(violence, disasters, accidents) from the news, including
information on the actors, the victims, the type of event,
as well as time and place of the event; displaying the latest
events on maps;
- multilingual multi-document
summarisation;
- sentiment analysis (opinion mining);
- social network analysis and visualisation;
-
products
and product
groups;
- similarity
to other documents, including the identification of near-duplicate
texts;
- detection of monolingual
and cross-lingual document plagiarism; identification of document
translations (cross-lingual document similarity);
- clustering
of documents;
- classification
(categorisation/categorization) of documents, including multi-label
(multi-lable) categorisation (controlled-vocabulary
indexing);
- relevance-ranking
of documents;
- terminology extraction
from subject-specific text collections;
- Visualisation of the
contents
- contents of single documents
in a document profile;
- contents of document collections
in a document map
or in cluster trees;
- automatically identified
geographical references in a
geographical map;
- trends over time and early-warning
functionality in graphs;
- ...
Living and working in the multilingual
and multicultural setting of the European Union, the focus of our
work is on multilingual and cross-lingual applications. The ultimate
goal is to give users cross-language access to information
hidden in large amounts of multilingual text, in ideally all official
EU languages, and more.
Distribution of language resources
The JRC also helps distribute
some of the European Commission's multilingual linguistic
resources such as the sentence-aligned parallel corpus
(i.e. collection of texts and their translations) JRC-Acquis
and the DGT Translation Memory DGT-TM.
Both resources cover 22 languages and involve all
231 language pairs. To date, the JRC-Acquis is the largest
available parallel corpus world-wide, considering the number of
languages and the amount of text. Both resources are useful to academia
and industry to carry out research and development into multilingual
text analysis tools and especially into cross-lingual applications
such as Machine Translation and multilingual dictionaries. The JRC-Acquis
can also be used as a test bed for multi-label text categorisation,
and more. The most outstanding and useful feature of these two resources
is that they include less widely used languages and language pairs.
Find out more
For technical-scientific information
on the individual text analysis applications and for overview reports
on the high-level applications, see the list of publications,
reports, posters and presentation slides.
The news analysis
applications powered by the Europe Media Monitor news gathering
engine EMM are our main priority. Take your time to explore their
functionality by going online now. Four of our applications are
freely accessible to the wider public (restricted Commission-internal
sites provide more information and functionality). You may include
their RSS feed into your website, subscribe to the daily subject-specific
email alerts, or let EMM only alert you when some major event happens
(breaking news). See also the overview
page of our applications for additional information:
(1) NewsBrief
(http://press.jrc.it/NewsBrief/): breaking news detection, clustering
of related news and thematic classification of news from around
the world; RSS feeds or email notification; 43 languages.
(2) The Medical Information System MedISys
(http://medusa.jrc.it/): display of the latest health-related news
from around the world according to themes, diseases, symptoms, etc.;
automatic alerting for category-specific breaking news; RSS feeds,
email notification; early-warning functionality; 43 languages.
(3) NewsExplorer
(http://press.jrc.it/NewsExplorer/): daily summary of the news in
19 languages; linking of related news over time (topic tracking)
and across languages (cross-lingual topic tracking); information
on almost 700,000 persons gathered from the news in all 19 NewsExplorer
languages; multilingual and multi-document information aggregation;
multilingual quotation detection (reported speech), and more.
Read a NewsExplorer
system description.
(4) EMM-Labs
(http://emm-labs.jrc.it/): a collection of further text analysis
results that are currently more experimental and not yet fully integrated
with the previous, more mature applications. Available tools include
automatically generated country and theme fact sheets, real-time
monitoring and map display of violent events world-wide, a browser
to explore social networks as found in our multilingual news collections,
theme-based news statistics, and more.
Some tools and subjects in more
detail
Retrieving documents
Document Retrieval:
Our analysis and visualisation tools can be applied to already existing
documents, but we have also developed intelligent agent software
that automatically retrieves documents satisfying certain criteria
from specific web sites on the internet (a crawler). In
the OSILIA project, this crawler visited
about twenty different English and German language online newspaper
sites and downloaded, cleaned and classified all documents covering
the area of internet abuse (hacking, viruses, denial-of-service
attacks, paedophilia on the internet, etc.), automatically and on
a daily basis. In the IDoRA for OLAF project, we have extended
the software to more languages, more subject domains and to about
650 news sites world-wide, benefiting from the JRC's Europe Media
Monitor system EMM. EMM currently monitors about 1,400 news portals,
visits approximately 150 specialist medical sites and receives about
20 pay-for newswires. We have also refined the software, its user
interface, its duplicate handling, and its classification and relevance-ranking
modules.
Analysing documents and
extracting information from them
Language Recognition: When dealing with large multilingual
document collections, it is sometimes necessary to automatically
identify the language a text is written in. We use a statistical
method which compares the frequency of a text's letter n-grams (groups
of two or more letters) with those
n-grams typical for different languages (the image shows only
11 EU languages). No dictionaries are used. The advantage of using
purely statistical methods is that the system can be expanded to
new languages simply by letting the tool learn the n-gram
statistics of the new languages by feeding it texts of this language.
Usually, ten words of a language are enough to identify the language.
In the picture, the different
languages recognised automatically are marked up using different
colour codes. The tool is currently trained for more than 25 languages.
Keyword Assignment: Keywords are
words of a document which are particularly relevant and which are
representative of the contents of the document. Keywords give users
a rough idea of the document contents without them having to read
the whole document. We use keywords furthermore as input to document
clustering, classification and document similarity
calculation.
- Assignment of free,
monolingual keywords: we use statistical
methods which compare the lemma (base form of a word) frequency
of a text with the lemma frequency of a standard reference corpus
(e.g. several years of general newspaper texts). Keywords are
those lemmas which appear much more frequently in the text than
they appear in the reference corpus (normalised by the text length).
Keywords are always words of the language the text is written
in, e.g. for a Swedish text they will be in Swedish. The
example keyword list refers to a text
on plutonium smuggling.
- Cross-lingual
keyword assignment, using the EUROVOC thesaurus:
we use statistical methods and manually keyword-assigned training
corpora from the documentation centres of the European Parliament
and the EC's Publications Office OPOCE to assign the
relevant subsets of the approximately 6.300 EUROVOC thesaurus
entries to texts. As EUROVOC exists in more than 25 one-to-one
language translations, these EUROVOC descriptors can
be displayed in any of the other languages. For more information,
see the publications at the Eurolan'2003
and
RANLP'2003 conferences, as well as publications describing
the cross-lingual functionalities of NewsExplorer (e.g. our article
in the Journal
of Computing and Information Technology, in CoLing'2004
and more). The JRC also held an international
workshop on this subject in September 2004. For more information
on the multilingual Eurovoc
thesaurus itself, go to http://eurovoc.europa.eu/.
Recognition of named entities in text: People's
and organisations' names, names of geographical
locations, date and currency expressions,
etc. are referred to as named entities. Many users are
particularly interested in knowing which named entities are mentioned
in their texts. Recognising them automatically in new texts is not
merely a question of performing a dictionary lookup, because new
names keep appearing all the time. For the automatic recognition,
the usage of lists of known names is usually combined with mechanisms
looking for certain patterns (e.g. everything following Mr./Ms./Dr.
etc. is likely to be a name). Different local grammars describing
these patterns have to be written for each language, and we have
developed methods to identify such patterns quickly for new languages
(currently about 20; See our publication in the MIT-Press book Learning
Machine Translation), and others. In four years of daily news
analysis, we have identified almost 700,000 distinct names in multilingual
news. Name recognition software has been produced by a variety of
companies. The JRC's tools differ from commercially available software
in that they automatically identify variant spellings for the same
person, even across writing systems (Cyrillic, Arabic, Roman scripts,
etc.). In our daily news analysis, we have found up to 170 different
spellings for a single person (For technical details, see for example
the article in the specialist journal Linguisticae
Investigationes). We also recognise date expressions
and geographical references in many different
languages. Unlike in most commercial software, the JRC's software
disambiguates between places with the same name, such as between
the 18 places world-wide called 'Paris'. For
details on the
identification and visualisation of geographical references
(geo-tagging), see the publication at LREC'2006,
and others.
Recognition of products and product groups
in text: For some applications, it may be particularly important
to know which products or product groups are made reference to in
a text. Therefore, we have collaborated with the University of Munich's
Computational Linguistics department CIS to develop a tool that
identifies the products listed in the multilingual and hierarchically
organised Customs Tariff Code
TARIC automatically in texts (see
publication at IS'2004).
TARIC is available in at
least twenty languages. The advantage of using TARIC
is therefore that products identified in a text of one language
can be displayed in all the other languages. This means that users
can see lists of products referred to in their own language even
if the text is written in a language they may not understand (cross-language
information access, similar to EUROVOC indexing).
Hierarchical Clustering can be
carried out for any kind of data: words, texts, ngrams, etc.
Similarity calculation is based on chosen features. The degree
of similarity between two items or two groups of items is expressed
here using a numerical value between 0 and 1. The clustering tool
also calculates the features of whole groups of items. Here
are some examples:
-
Clustering of documents according to the keywords they share.
-
Clustering
of (key)words according to their co-occurrence.
-
Clustering of languages according to their bigram frequency statistics.
The most typical two-letter-combinations (bigrams) are listed for each
language. The picture shows that Northern-European languages cluster with
each other and that Latin-based languages cluster with each other. The
curious phenomenon that the Italian language does not cluster with the
Romance languages, and that it is in fact completely isolated, is due to the
fact that Italian uses very different letter combinations from the other
European languages.
Similarity calculation and identification
of near-duplicates: Knowing about the similarity between
documents can be useful when users have identified one document
of interest and want to find out about other documents which cover
the same subject (also called query by example). Knowledge
about document similarity is furthermore required for document clustering
and for the automatic classification of documents. We use several
different ways of identifying document similarity, including (a)
statistical comparison of the keyword lists of documents, (b) counting
the number of lemma n-grams two documents have in common, and (c) displaying
those sections two documents have in common (see
sample text comparison). The latter is useful for partially
identical documents such as news stories based on the same press
feed or revised copies of the same document. Near-duplicates add
very little or no new information for users and should therefore
be discarded when analysing or browsing text collections. The
application is also useful to detect plagiarism.
Document similarity across
languages can be achieved by linking documents written in different
languages to the same multilingual thesaurus or other information
that can be represented in a language-neutral way, such as lists
of names, of locations, of subject domain categories, and more.
For details, see the publications at CoLing'2004, at
IS'2004,
the project on cross-lingual
indexing using EUROVOC
and the publications and the example
given there, as well as the activity on marking up references to
product groups, geographical locations and dates
in texts.
Document classification, or categorisation:
While clustering is a way of
organising document collections bottom-up into naturally occurring
groups of documents, the term classification (also categorisation
or categorization) refers to the assignment of documents
into one or more given classes or categories. Automatic
classification on the basis of the vocabulary used can be achieved
in a variety of ways. These include (a) the usage of rules written
by subject-domain specialists that formulate conditions such as
words A, B and C have to occur at least X times for a document to
be assigned to a certain class, and (b) automatic comparison of
documents (using statistical methods or Machine Learning techniques)
to those which were previously manually assigned to the different
classes; the classes with the documents that are most similar to
the new document are the ones that are most suitable for this document.
Visualising the contents
of documents and document collections
Document Profiles
are a way of displaying the information aspects which were
previously extracted from individual documents
in an organised and structured way. They allow users to focus on the kind of
information they are interested in and to decide quickly whether they are
interested in a given document or not. The contents of document profiles
depend on the information that was previously extracted. Our
example document profile
refers to a text on plutonium
smuggling.
Document Maps are a way of giving
users an idea of the contents and structure of a whole document
collection by clustering related documents into groups
and by assigning keywords to these clusters. The document map displayed here
(see also high-quality bmp) was produced
with the software ThemeScape from the US firm Cartia.
The JRC's Language Technology group has worked with a Machine Learning
group at the JRC to produce a couple of in-house systems which also
have the goal of visualising document collections, but no effort
has been spent so far on making the results look pleasant to the
eye (See the publications at the
PKDD'2000 and IJCAI'1999
conferences).
Visualisation of geographical
references: Geographical references
made in texts or in text collections can, of course, be visualised
with maps. The example image
was produced on the basis of the geographical references (see: named entity recognition) identified automatically
in 1496 English, German, French and Spanish documents. See also
our publications at LREC'2006.
Additionally to geographical locations, more information can be
displayed on maps. See the publication at ESARDA'2005
for examples.
|