====================================================================== Title: Re-ranking for large-scale statistical machine translation Authors: Kenji Yamada and Ion Muslea Abstract: Statistical Machine Translation (SMT) systems, which are trained on corpora of bilingual text (eg, French and English), typically work as follows: for each sentence to be translated, they generates a plethora of possible translations, among which they select a smaller N-best list of likely translations. Even though the N-best list contains high-quality candidates, the actual ranking is far from accurate: as shown in [Och, et. al. 04, Table 1], the list typically contains candidates that are signifficantly better than the one ranked first. In order to deal with this problem, researchers have proposed a variety of re-ranking approaches. Previous work [Liang, et. al. 06, Shen and Joshi 05, Roark, et. al. 04] shows that machine learning can be succesfully used for this task. However, as the existing experiments were conducted on small corpora (the re-rankers were trained on less that 1 million words), it is unclear how the results scale to real-world applications, in which one has to deal with several orders of magnitude more data. In this paper we report our ongoing work towards building a re-ranking system for a commercial, large-scale SMT system. Our goal is to create a re-ranker that can be efficiently trained on corpora of at least a billion words. In this paper, we make two main contribution. First, our re-ranker is trained on a corpus of 80 million words, which is two orders of magnitude larger than previous work. Second, in order to speed-up the process, we learn - in parallel - an ensemble of re-rankers, each of which using only a fraction of the training data. For the re-ranking task, we use the following features: - the cost of each candidate in the N-best list (the baseline system uses only this feature to perform the initial ranking); - the phrase pairs that were used to obtain this particular translation (eg, during its training, the baseline SMT system may have learned that "the red house" translates into "la maison rouge"). In our experiments, we use a Chinese-English corpus that consists of about 4 million sentences (80 million English words); the development and test set consist of 993 and 919 sentences, respectively. For each sentence, we generate the 200-best candidates, for a total of 800 million examples. This training set, which uses 12 million phrase-pair features, is extremely unbalanced: only the best of the 200 candidates is a positive example, while the others are negative ones. In our experiments we do not use all these 12 million features; instead, we remove the ones that appear in the training corpus extremely frequently or infrequently (ie, more than 100,000 times or only once, respectively). This prunning, which eliminates almost half the features, has dual benefits: it reduces the computational cost, while also helping the system to avoid overfitting. Similarly to previous approaches, we built our re-ranker around a perceptron learner. However, given the enormous amount of training data, training a perceptron on the entire corpus woud be extremelly ineficient. Instead, we create an ensemble of perceptrons that is trainined is parallel on a fraction of the data. This approach is similar to bagging [Breiman 96], except that we use arbitrary splits of 5000 sentences (given that each split consist of 0.1% of the corpus, it did not make sense to use sampling with replacement). After the perceptrons are trained, the ensemble selects the best of the N candidates by averaging the predictions of all peceptrons. SMT systems typicaly use a development set to tune their parameters and bias the search towards translations that are more likely to be relevant to the test set. We found that adding 5 copies of the 993-sentence dev set to the 5000-sentence training set of each perceptron significantly improves the quality of the translations. Our empirical evaluation shows that the proposed re-ranker improves our baseline SMT system from BLEU score [Papineni, et. al. 02] of 31.19 to 31.72, which is statistically significant at p<0.001. In contrast, the BLEU score is 31.31 when training the ensemble only on the training set, and 31.41 with the development set only. References [Och, et. al. 04] A Smorgasbord of Features for Statistical Machine Translation, HLT-NAACL 2004. [Liang, et. al. 06] An End-to-End Discriminative Approach to Machine Translation, ACL 2006. [Shen and Joshi 05] Ranking and Reranking with Perceptron, Machine Learning, 60(1-3), 2005. [Roark, et. al. 04] Discriminative language modeling with conditional random fields and the perceptron algorithm. ACL 2004. [Papineni, et. al. 02] Bleu: a Method for Automatic Evaluation of Machine Translation, ACL 2002. [Breiman 96] Bagging Predictors, Machine Learning, 24(2):123-140, 1996 ======================================================================