NewsExplorer
MedISys
NewsBrief
EMM-Labs
Together since 1957 – 50th anniversary

NEW: Romanian JRC-Acquis corpus with 30 Million words now available.
NEW: HunAlign alignments available for JRC-Acquis parallel corpus.

Summary of the Activity

At the Joint Research Centre (JRC), we have been using Language Technology since 1998 to fight the information overflow and to overcome the language barrier with the purpose of supporting the European Commission and Member State institutions. To this end, a number of text gathering (retrieval), analysis and visualisation tools have been developed with a focus on high multilinguality, on multilingual and multi-document information aggregation, and on tools to provide cross-lingual information access (read publication). These text analysis tools have been integrated with the news gathering engine Europe Media Monitor EMM to produce several complex, high-level appliations.

 

The publicly accessible multilingual news analysis system NewsExplorer shows some of our text analysis applications. Further integration of text analysis tools into the multilingual breaking news system NewsBrief and the specialised Medical Information System MedISys (read publication) are on their way. A collection of more recent developments is visible to the public under the umbrella name EMM-Labs. These include the automatic extraction of violent events and generation of social networks based on news analysis. For an overview, see http://emm.newsbrief.eu/overview.html.


A big advantage of these large-scale news gathering and analysis applications is their neutrality due to the fact that they are independent from the viewpoints of specific news providers and even countries.


A combination of text analysis tools

Our tool set consists of three main components with the following functionality:

  1. Multilingual and cross-lingual retrieval of potentially user-relevant documents. (E.g. the OSILIA and IDoRA for OLAF projects on the automatic gathering and classification of articles from online news sites, but especially the EMM engine, and more).
  2. Analysis of documents and extraction of different information aspects from these documents plus language-neutral representation of this information, where possible. Examples for the kind of analysis are: 
    • identifying the language a document is written in (Language recognition);
    • identifying the keywords for a document, both free monolingual indexing terms and controlled vocabulary cross-lingual indexing terms from the EUROVOC thesaurus;
    • identifying and disambiguating named entities such as people's and organisations' names, geographical references, dates, currencies, etc.;
    • detecting relations between entities (mostly, but not only persons) such as contact, support, criticism, family relationship, etc.; using the extracted information to produce social networks;
    • identifying quotations by and about people; using the extracted data to produce quotation networks;
    • extracting information about events (violence, disasters, accidents) from the news, including information on the actors, the victims, the type of event, as well as time and place of the event; displaying the latest events on maps;
    • multilingual multi-document summarisation;
    • sentiment analysis (opinion mining);
    • social network analysis and visualisation;
    • products and product groups;
    • similarity to other documents, including the identification of near-duplicate texts;
    • detection of monolingual and cross-lingual document plagiarism; identification of document translations (cross-lingual document similarity);
    • clustering of documents;
    • classification (categorisation/categorization) of documents, including multi-label (multi-lable) categorisation (controlled-vocabulary indexing);
    • relevance-ranking of documents;
    • terminology extraction from subject-specific text collections;
  3. Visualisation of the contents
    • contents of single documents in a document profile;
    • contents of document collections in a document map or in cluster trees;
    • automatically identified geographical references in a geographical map;
    • trends over time and early-warning functionality in graphs;
    • ...

Living and working in the multilingual and multicultural setting of the European Union, the focus of our work is on multilingual and cross-lingual applications. The ultimate goal is to give users cross-language access to information hidden in large amounts of multilingual text, in ideally all official EU languages, and more.


Distribution of language resources

The JRC also helps distribute some of the European Commission's multilingual linguistic resources such as the sentence-aligned parallel corpus (i.e. collection of texts and their translations) JRC-Acquis and the DGT Translation Memory DGT-TM. Both resources cover 22 languages and involve all 231 language pairs. To date, the JRC-Acquis is the largest available parallel corpus world-wide, considering the number of languages and the amount of text. Both resources are useful to academia and industry to carry out research and development into multilingual text analysis tools and especially into cross-lingual applications such as Machine Translation and multilingual dictionaries. The JRC-Acquis can also be used as a test bed for multi-label text categorisation, and more. The most outstanding and useful feature of these two resources is that they include less widely used languages and language pairs.

Find out more

For technical-scientific information on the individual text analysis applications and for overview reports on the high-level applications, see the list of publications, reports, posters and presentation slides.


The news analysis applications powered by the Europe Media Monitor news gathering engine EMM are our main priority. Take your time to explore their functionality by going online now. Four of our applications are freely accessible to the wider public (restricted Commission-internal sites provide more information and functionality). You may include their RSS feed into your website, subscribe to the daily subject-specific email alerts, or let EMM only alert you when some major event happens (breaking news). See also the overview page of our applications for additional information:


(1) NewsBrief (http://emm.newsbrief.eu/): breaking news detection, clustering of related news and thematic classification of news from around the world; RSS feeds or email notification; 43 languages.

(2) The Medical Information System MedISys (http://medusa.jrc.it/): display of the latest health-related news from around the world according to themes, diseases, symptoms, etc.; automatic alerting for category-specific breaking news; RSS feeds, email notification; early-warning functionality; 43 languages.

(3) NewsExplorer (http://emm.newsexplorer.eu/): daily summary of the news in 19 languages; linking of related news over time (topic tracking) and across languages (cross-lingual topic tracking); information on almost 700,000 persons gathered from the news in all 19 NewsExplorer languages; multilingual and multi-document information aggregation; multilingual quotation detection (reported speech), and more. Read a NewsExplorer system description.

(4) EMM-Labs (http://emm-labs.jrc.it/): a collection of further text analysis results that are currently more experimental and not yet fully integrated with the previous, more mature applications. Available tools include automatically generated country and theme fact sheets, real-time monitoring and map display of violent events world-wide, a browser to explore social networks as found in our multilingual news collections, theme-based news statistics, and more.



Some tools and subjects in more detail

Retrieving documents

Document Retrieval: Our analysis and visualisation tools can be applied to already existing documents, but we have also developed intelligent agent software that automatically retrieves documents satisfying certain criteria from specific web sites on the internet (a crawler). In the OSILIA project, this crawler visited about twenty different English and German language online newspaper sites and downloaded, cleaned and classified all documents covering the area of internet abuse (hacking, viruses, denial-of-service attacks, paedophilia on the internet, etc.), automatically and on a daily basis. In the IDoRA for OLAF project, we have extended the software to more languages, more subject domains and to about 650 news sites world-wide, benefiting from the JRC's Europe Media Monitor system EMM. EMM currently monitors about 1,400 news portals, visits approximately 150 specialist medical sites and receives about 20 pay-for newswires. We have also refined the software, its user interface, its duplicate handling, and its classification and relevance-ranking modules. 

Analysing documents and extracting information from them

Language Recognition: When dealing with large multilingual document collections, it is sometimes necessary to automatically identify the language a text is written in. We use a statistical method which compares the frequency of a text's letter n-grams (groups of two or more letters) with those n-grams typical for different languages (the image shows only 11 EU languages). No dictionaries are used. The advantage of using purely statistical methods is that the system can be expanded to new languages simply by letting the tool learn the n-gram statistics of the new languages by feeding it texts of this language. Usually, ten words of a language are enough to identify the language. In the picture, the different languages recognised automatically are marked up using different colour codes. The tool is currently trained for more than 25 languages.


Keyword Assignment: Keywords are words of a document which are particularly relevant and which are representative of the contents of the document. Keywords give users a rough idea of the document contents without them having to read the whole document. We use keywords furthermore as input to document clustering, classification and document similarity calculation.

  • Assignment of free, monolingual keywords: we use statistical methods which compare the lemma (base form of a word) frequency of a text with the lemma frequency of a standard reference corpus (e.g. several years of general newspaper texts). Keywords are those lemmas which appear much more frequently in the text than they appear in the reference corpus (normalised by the text length). Keywords are always words of the language the text is written in, e.g. for a Swedish text they will be in Swedish. The example keyword list refers to a text on plutonium smuggling.
  • Cross-lingual keyword assignment, using the EUROVOC thesaurus: we use statistical methods and manually keyword-assigned training corpora from the documentation centres of the European Parliament and the EC's Publications Office OPOCE to assign the relevant subsets of the approximately 6.300 EUROVOC thesaurus entries to texts. As EUROVOC exists in more than 25 one-to-one language translations, these EUROVOC descriptors can be displayed in any of the other languages. For more information, see the publications at the Eurolan'2003 and RANLP'2003 conferences, as well as publications describing the cross-lingual functionalities of NewsExplorer (e.g. our article in the Journal of Computing and Information Technology, in CoLing'2004 and more). The JRC also held an international workshop on this subject in September 2004. For more information on the multilingual Eurovoc thesaurus itself, go to http://eurovoc.europa.eu/.

Recognition of named entities in text: People's and organisations' names, names of geographical locations, date and currency expressions, etc. are referred to as named entities. Many users are particularly interested in knowing which named entities are mentioned in their texts. Recognising them automatically in new texts is not merely a question of performing a dictionary lookup, because new names keep appearing all the time. For the automatic recognition, the usage of lists of known names is usually combined with mechanisms looking for certain patterns (e.g. everything following Mr./Ms./Dr. etc. is likely to be a name). Different local grammars describing these patterns have to be written for each language, and we have developed methods to identify such patterns quickly for new languages (currently about 20; See our publication in the MIT-Press book Learning Machine Translation), and others. In four years of daily news analysis, we have identified almost 700,000 distinct names in multilingual news. Name recognition software has been produced by a variety of companies. The JRC's tools differ from commercially available software in that they automatically identify variant spellings for the same person, even across writing systems (Cyrillic, Arabic, Roman scripts, etc.). In our daily news analysis, we have found up to 170 different spellings for a single person (For technical details, see for example the article in the specialist journal Linguisticae Investigationes). We also recognise date expressions and geographical references in many different languages. Unlike in most commercial software, the JRC's software disambiguates between places with the same name, such as between the 18 places world-wide called 'Paris'. For details on the identification and visualisation of geographical references (geo-tagging), see the publication at LREC'2006, and others.


Recognition of products and product groups in text: For some applications, it may be particularly important to know which products or product groups are made reference to in a text. Therefore, we have collaborated with the University of Munich's Computational Linguistics department CIS to develop a tool that identifies the products listed in the multilingual and hierarchically organised Customs Tariff Code TARIC automatically in texts (see publication at IS'2004). TARIC is available in at least twenty languages. The advantage of using TARIC is therefore that products identified in a text of one language can be displayed in all the other languages. This means that users can see lists of products referred to in their own language even if the text is written in a language they may not understand (cross-language information access, similar to EUROVOC indexing).


Hierarchical Clustering can be carried out for any kind of data: words, texts, ngrams, etc. Similarity calculation is based on chosen features. The degree of similarity between two items or two groups of items is expressed here using a numerical value between 0 and 1. The clustering tool also calculates the features of whole groups of items. Here are some examples:


  • Clustering of documents according to the keywords they share.
  • Clustering of (key)words according to their co-occurrence.
  • Clustering of languages according to their bigram frequency statistics. The most typical two-letter-combinations (bigrams) are listed for each language. The picture shows that Northern-European languages cluster with each other and that Latin-based languages cluster with each other. The curious phenomenon that the Italian language does not cluster with the Romance languages, and that it is in fact completely isolated, is due to the fact that Italian uses very different letter combinations from the other European languages.

Similarity calculation and identification of near-duplicates: Knowing about the similarity between documents can be useful when users have identified one document of interest and want to find out about other documents which cover the same subject (also called query by example). Knowledge about document similarity is furthermore required for document clustering and for the automatic classification of documents. We use several different ways of identifying document similarity, including (a) statistical comparison of the keyword lists of documents, (b) counting the number of lemma n-grams two documents have in common, and (c) displaying those sections two documents have in common (see sample text comparison). The latter is useful for partially identical documents such as news stories based on the same press feed or revised copies of the same document. Near-duplicates add very little or no new information for users and should therefore be discarded when analysing or browsing text collections. The application is also useful to detect plagiarism.


Document similarity across languages can be achieved by linking documents written in different languages to the same multilingual thesaurus or other information that can be represented in a language-neutral way, such as lists of names, of locations, of subject domain categories, and more. For details, see the publications at CoLing'2004, at IS'2004, the project on cross-lingual indexing using EUROVOC and the publications and the example given there, as well as the activity on marking up references to product groups, geographical locations and dates in texts.


Document classification, or categorisation: While clustering is a way of organising document collections bottom-up into naturally occurring groups of documents, the term classification (also categorisation or categorization) refers to the assignment of documents into one or more given classes or categories. Automatic classification on the basis of the vocabulary used can be achieved in a variety of ways. These include (a) the usage of rules written by subject-domain specialists that formulate conditions such as words A, B and C have to occur at least X times for a document to be assigned to a certain class, and (b) automatic comparison of documents (using statistical methods or Machine Learning techniques) to those which were previously manually assigned to the different classes; the classes with the documents that are most similar to the new document are the ones that are most suitable for this document. 


Visualising the contents of documents and document collections

Document Profiles are a way of displaying the information aspects which were previously extracted from individual documents in an organised and structured way. They allow users to focus on the kind of information they are interested in and to decide quickly whether they are interested in a given document or not. The contents of document profiles depend on the information that was previously extracted. Our example document profile refers to a text on plutonium smuggling.


Document Maps are a way of giving users an idea of the contents and structure of a whole document collection by clustering related documents into groups and by assigning keywords to these clusters. The document map displayed here (see also high-quality bmp) was produced with the software ThemeScape from the US firm Cartia. The JRC's Language Technology group has worked with a Machine Learning group at the JRC to produce a couple of in-house systems which also have the goal of visualising document collections, but no effort has been spent so far on making the results look pleasant to the eye (See the publications at the PKDD'2000 and IJCAI'1999 conferences).


Visualisation of geographical references: Geographical references made in texts or in text collections can, of course, be visualised with maps. The example image was produced on the basis of the geographical references (see: named entity recognition) identified automatically in 1496 English, German, French and Spanish documents. See also our publications at LREC'2006. Additionally to geographical locations, more information can be displayed on maps. See the publication at ESARDA'2005 for examples.



Site Meter

Please send comments on this page to Ralf Steinberger (Email address format: Firstname.Lastname@jrc.ec.europa.eu)

Last update:  21 January 2010