NIPS logo

NIPS 2006 Workshop
Machine Learning for
Multilingual Information Access

9 December 2006, Whistler, B.C.


Organizers:

Cyril Goutte
Nicola Cancedda
Marc Dymetman
George Foster

National Research Council Canada
Xerox Research Centre Europe
Xerox Research Centre Europe
National Research Council Canada


Description
Submission
Program
Post-workshop book

NIPS Workshops

Sponsored by:
(tba)

Workshop description:

In many different settings, accessing information available in different languages is a challenge.

In Europe, the wide variety of languages is clearly a bottleneck for efficient circulation and access to information. More than half of EU citizens cannot hold a conversation in a language other than their mother tongue. Even in an officially bilingual country like Canada, less than one in five are considered to have a good enough command of both official languages (2001 census data).

The traditional paradigm for addressing this issue is to perform human translation on a massive scale, and rely on monolingual information access technology. Although this model has worked reasonably well in the past, the rapid increase in the amount of information produced (and, in Europe, in the number of languages covered) poses questions as to its sustainability. Machine Learning has the potential to help develop and deploy technology that provides:

  1. access to information across different languages,

  2. usable translation from one language to another.

We are interested in Machine Learning techniques addressing for example the following problems:

  • Word alignment

  • Machine translation

  • Multilingual lexicon and terminology extraction

  • Cross-lingual information retrieval

  • Cross-lingual categorisation

Goals of the workshop:

Multilingual applications are also emerging as a promising application for some Machine Learning techniques. For example the use of Kernel CCA for Cross-Language applications, or large-margin approaches to word alignment. This new trend meets a well-established interest of the Natural Language Processing community for learning approaches.

The purpose of this workshop is to provide a forum for discussion of current developments at the intersection between multilingual processing and machine learning. This includes developing new techniques to address various multilingual information access problems (e.g. translation), but also scaling up existing techniques to the available NLP data, developing tools for cross-language information retrieval, etc.

We will promote and emphasize discussions of some inter-related key issues in applying Machine Learning to Multilingual problems:

SCALING UP:

- Applying ML to 100 million word corpora (e.g. SMT)

- Deploying ML solutions on new language pairs

SCARCE RESOURCES:

- Languages with limited bilingual corpora

- Bootstrapping limited resources

EVALUATION:

- Design of better performance measures

- Optimisation of application-specific measures

- Learning human evaluation

PRIOR LINGUISTIC KNOWLEDGE:

- Modelling and using linguistic knowledge in ML

- The continuum between all-data (SMT) and all prior knowledge (handcrafted rules)

Workshop format:

We intend to leave a good part of the workshop to panel discussions that would address relevant topics in multilingual information access (MLIA), as well as invited speeches presenting some important MLIA problems and associated challenges for Machine Learning. For each half day, we will start with either a keynote or a short tutorial, continue with a few shorter technical presentations, and end with a panel discussion (topics to be decided depending on the confirmed list of speakers).

Invited speakers:

Related work:

Past NIPS workshops have addressed related topics such as learning with structured data, or the use of Machine Learning for Natural Language Processing. There is also some ongoing interest within the European network of excellence Pascal, as exemplified by the recent workshop on intelligent information access. However none of these specifically target multilingual aspects. We believe there is sufficient interest and genuine need on this particular aspect to justify a specific focus on multilingual information access.

The European project SMART (Statistical Multilingual Analysis for Retrieval and Translation) is specifically targeting advanced machine learning techniques for multilingual applications.