|
|
NIPS 2006 Workshop
|
|
|
|
Organizers: |
Cyril Goutte
|
National Research Council Canada
|
|
|
Description
Sponsored by: |
We gratefully acknowledge the support of the Pascal network
of excellence
Program:
Morning session, 9 December: 7:30am to 10:30am
Afternoon session, 9 December: 3:30pm to 6:30pm
Abstracts of the keynotes and invited talks: Discriminative Learning Methods in Syntax-Based Machine TranslationMichael Collins (CSAIL, MIT) Phrase-based approaches to statistical machine translation have recently achieved impressive results, leading to significant improvements in accuracy over the original IBM models for statistical translation. However, phrase-based models lack a direct representation of syntactic information in the source or target languages; this has prompted several researchers to consider approaches that make use of syntactic information in statistical translation models. In this talk I'll describe a machine learning approach for tree-to-tree based translation, where parse trees in the source language are mapped to parse trees in the target language. A model is learned from a corpus of translation pairs, where each sentence in the source or target language has an associated parse tree. The model combines ideas from synchronous tree-adjoining grammars, with a linear discriminative model trained using the perceptron. We see two major benefits of tree-to-tree based translation. First, it is possible to explicitly model the syntax of the target language, thereby improving grammaticality. Second, we can build a detailed model of the correspondence between the source and target parse trees, thereby attempting to construct translations that preserve the meaning of source language sentences. [slides] Re-ranking for large-scale statistical machine translationKenji Yamada and Ion Muslea (Language Weaver, USA) Statistical Machine Translation (SMT) systems, which are trained on corpora of bilingual text (eg, French and English), typically work as follows: for each sentence to be translated, they generates a plethora of possible translations, among which they select a smaller N-best list of likely translations. Even though the N-best list contains high-quality candidates, the actual ranking is far from accurate: as shown in [Och, et. al. 04, Table 1], the list typically contains candidates that are signifficantly better than the one ranked first. In order to deal with this problem, researchers have proposed a variety of re-ranking approaches. Previous work [Liang, et. al. 06, Shen and Joshi 05, Roark, et. al. 04] shows that machine learning can be succesfully used for this task. However, as the existing experiments were conducted on small corpora (the re-rankers were trained on less that 1 million words), it is unclear how the results scale to real-world applications, in which one has to deal with several orders of magnitude more data. In this paper we report our ongoing work towards building a re-ranking system for a commercial, large-scale SMT system. Our goal is to create a re-ranker that can be efficiently trained on corpora of at least a billion words. In this paper, we make two main contribution. First, our re-ranker is trained on a corpus of 80 million words, which is two orders of magnitude larger than previous work. Second, in order to speed-up the process, we learn - in parallel - an ensemble of re-rankers, each of which using only a fraction of the training data. Scalable Discriminative Training for Tree-Structured Translation ModelsI. Dan Melamed and Joseph Turian Discriminative training methods have recently led to significant advances in the state of the art of machine translation (MT). Another promising trend is the incorporation of syntactic information into MT systems. Combining these trends is difficult for reasons of system complexity and computational complexity. We describe a syntax-driven MT system that is trained in a purely discriminative manner. Our main innovation is an approach to discriminative learning that is computationally efficient enough for large statistical MT systems, yet whose accuracy on translation sub-tasks is near the state of the art. Our source code is downloadable from http://nlp.cs.nyu.edu/GenPar . A relevant paper is at http://nlp.cs.nyu.edu/pubs/papers/AMTA06.pdf . Self-training for Machine TranslationNicola Ueffing (NRC, Canada) Statistical machine translation (SMT) systems are usually trained on large amounts of bilingual text and of monolingual text in the target language. This talk presents a self-training approach which additionally explores the use of monolingual source text, namely the documents to be translated, to improve the system's performance. In the work presented here, NRC's PORTAGE system (Johnson et al. 2006), which is a state-of-the-art phrase-based SMT system, is deployed. The translation process is done in two passes: An initial version of the translation system is used to translate the source text. Among the generated translations, target sentences of low quality are automatically identified and discarded. The reliable translations together with their sources are then used as a new bilingual corpus for training an additional phrase translation model. This test-corpus-specific phrase translation model is then used as additional knowledge source in the SMT system, alongside the (statistical) models used by the initial system. Thus, the translation system can be adapted to the new source data even if no bilingual data in this domain is available. Those phrase pairs in the phrase translation model which are relevant to the test corpus are reinforced, and the probability distribution becomes more focused. Correlation for multilingual analysis: Theory, Scaling and SparsityJohn Shawe-Taylor (CS, UCL, UK) Canonical Correlation Analysis has been used successfully for cross-lingual information retrieval and combined with support vector learning can deliver cross-lingual classification, that is a training set in one language can deliver a classifier for a second language. A hybrid algorithm known as SVM-2K combines these two phases into a single algorithm. The talk will briefly review these approaches giving experimental and theoretical results for the generalisation of correlation analysis for both semantic space learning and cross-lingual classification. The issue of scaling will be discussed and a sparse approach to addressing this issue will be proposed. The talk will also present some preliminary work on applying discriminative learning to statistical machine translation. [slides] Semi-supervised learning for statistical machine translationAnoop Sarkar and Gholamreza Haffari (CS, SFU) We look at semi-supervised learning using the framework provided in Abney (2004) which defines an objective function on the label distribution in the unlabeled data. SMT offers an interesting domain in this respect because there are infinitely many outputs in the target language for each input source language sentence. That is, the label distribution is over a potentially infinite set. We look at semi-supervised learning in this setting. We will also review some previous uses of semi-supervised learning in SMT, where additional unlabeled data is used to improve word alignments, or the full translation model. It is already well known that the use of large amounts of monolingual text in the target language can lead to better translation as we can build better language models for the target. In this work, we look at the setting where adding monolingual source language sentences can also improve SMT output. We use a phrase-based SMT decoder (Moses) to produce new candidate parallel sentence pairs, and we use importance sampling to find new additions to the labeled data for the SMT system. We iterate this process until we obtain convergence of the label distribution. We present experiments in a transductive setting, where we learn a better SMT system by iteratively producing better translations for the test data. We show that we can improve the Bleu score by an amount that is equivalent to the improvement of adding twice as much parallel text as training data. [slides] Bag-of-Words Lexical Choice using Large Scale ClassifiersPatrick Haffner, Srinivas Bangalore and Stephan Kanthak (AT&T Labs) A major limitation in the application of Machine Learning to Machine Translation is that one needs to scale highly structured models to very large datasets, which is not possible with current optimization techniques. While current research applies structured methods to smaller datasets or subproblems, we take here an alternate path. In a first lexical choice step, we explore highly scalable unstructured classification methods that produce bags of words (BOW) rather than sentences. This would be enough for Cross-Language Information Retrieval. Full translation requires a second lexical ordering step that produces an "optimal" sequencing for the words in the bag. We demonstrate improvements in both translation accuracy (BLEU score) and word retrieval accuracy (F-measure) on various tasks. Named Entity Discovery from Multilingual CorporaAlexandre Klementiev and Dan Roth (CS, UIUC) Named Entity recognition (NER) is an important subtask of many natural language processing problems. Most successful approaches to NER employ machine learning techniques, which require supervised data. However, for many languages these resources are scarce. On the other hand, comparable multilingual data (such as multilingual news streams) are becoming increasingly available and can be cheaply harvested from the Web. In this work, we make several observations about Named Entities encountered in such corpora, and use them to develop an algorithm, which learns to extract pairs of NEs across languages with almost no supervision or language specific knowledge. Specifically, given a bilingual corpus, which is weakly temporally aligned with one side annotated with Named Entities, the algorithm discovers the corresponding NEs in the second language text. We first observe that NEs often contain or are entirely made up of words that are phonetically transliterated or have a common etymological origin across languages, and thus are phonetically similar. Secondly, we observe that NEs in one language in such corpora tend to co-occur with their counterparts in the other. We can exploit such weak synchronicity of NEs across languages to associate them. In order to score a pair of entities, we can compute similarity of their time distributions. Thirdly, an NE and its second language counterpart tend to appear in similar context across languages. If available, a dictionary could thus be used to score their contextual similarity. Lastly, we expect mentions of a Named Entity to appear in documents from a particular set of topics. Thus, we can measure topic similarity of a pair of NEs by the overlap of topics of the documents in which they appear. In our setting, multilingual document clustering approaches can be used to first cluster documents in our bilingual corpus. We present an algorithm, which iteratively exploits these observations to train a discriminative transliteration model using the temporal, contextual, and topic similarity scores as supervision signals. During discovery, the trained transliteration model is used to produce a list of highest scoring second language candidates for a given NE. If dictionary is available, the list is also augmented with translations (if any). The list is then re-ranked by a combination of the three similarity scores between the NE and each candidate. Initial experiments with temporal similarity score alone used during training show promising results on a bilingual English-Russian news corpus. They were presented at ACL in 2006. NewsExplorer -- Multilingual News Analysis with Cross-lingual Linking
Ralf Steinberger (EC Joint Research Centre, Ispra,
Italy) Software tools to provide cross-lingual information access usually rely on Machine Translation (MT) or on the use of bilingual dictionaries. Machine Learning approaches such as LSA (Landauer & Littman, 1989) and KCCA (Vinokourov et al., 2002) try to solve the problem by producing a vector space in which texts of both languages are represented together. The existing approaches to provide cross-lingual information access are restricted to two languages, although scientists in the European research network PASCAL are currently experimenting with vector space approaches involving more than two languages. In highly multilingual environments such as the European Union with its twenty official languages (190 language pairs, 380 language pair directions), bilingual approaches are not satisfying, especially as MT and other linguistic resources are only available for a fraction of these language pairs. Highly multilingual environments call for interlingua-like solutions, but at the same time they must be simple and easy to extend to new languages, for which little or no linguistic resources may exist. We will present such an interlingua-like working solution and its application to currently eight languages. The approach consists of first producing a language-independent representation, based mostly on subject domains, normalised named entities and cognates, and to then apply a similarity measure to this interlingua representation. We use the multilingual Eurovoc thesaurus to classify multilingual documents into the same subject domains. Place names are first disambiguated and then represented by their geographical co-ordinates. Person and organisation names are normalised to achieve an automatic match of the name variants found in different languages. We will describe the challenges for each individual software component (e.g. homography and other types of ambiguity; word variants in highly inflected languages) and will present the adopted solutions to these problems. A demonstration of the publicly accessible, fully-automatic news aggregation and analysis system NewsExplorer (http://press.jrc.it/NewsExplorer) will show that the presented approach is workable and that information gathering can be enhanced by combining information extracted from texts in many different languages.
|
|
|
|
|
|||