=========================================== Self-training for Machine Translation Nicola Ueffing =========================================== Statistical machine translation (SMT) systems are usually trained on large amounts of bilingual text and of monolingual text in the target language. This talk presents a self-training approach which additionally explores the use of monolingual source text, namely the documents to be translated, to improve the system's performance. In the work presented here, NRC's PORTAGE system (Johnson et al. 2006), which is a state-of-the-art phrase-based SMT system, is deployed. The translation process is done in two passes: An initial version of the translation system is used to translate the source text. Among the generated translations, target sentences of low quality are automatically identified and discarded. The reliable translations together with their sources are then used as a new bilingual corpus for training an additional phrase translation model. This test-corpus-specific phrase translation model is then used as additional knowledge source in the SMT system, alongside the (statistical) models used by the initial system. Thus, the translation system can be adapted to the new source data even if no bilingual data in this domain is available. Those phrase pairs in the phrase translation model which are relevant to the test corpus are reinforced, and the probability distribution becomes more focused. The use of self-training for adaptation purposes has been proposed in (Ueffing 2006). It was shown there that the translation quality of a state-of-the-art phrase-based statistical machine translation system can be significantly improved through self-training. In this talk, a more detailed investigation of the proposed method will be presented. Extensions and variations to the approach and their effect on translation quality are studied. Specifically, these include the following: - The automatic identification of low-quality translations is done using confidence estimation techniques. The confidence of a translation is compared to a threshold, and only those translations are kept for self-training whose confidence exceeds the threshold. The talk will present an analysis of the importance of this filtering step. The relation between the strictness of the filtering (namely the value of the confidence threshold) and translation quality of the resulting adapted system will be shown. - When carrying out the self-training on the new monolingual source data, the system can either consider one translation per source sentence or allow for several different alternatives, represented e.g. in an N-best list. Often, several correct translations of a source sentence exist, so this approach allows the system to introduce some more variation into the adapted phrase table. The talk will give an insight on whether this improves translation quality or whether the self-training algorithm should consider not more than one reliable translation per source sentence. - The procedure described above can be iterated. This means, after adapting the system to a test corpus, the output of this system can be used to retrain the adapted phrase table, again keeping only the reliable translations to create a new bilingual corpus. We will show how this iteration affects the translation quality of the resulting system. Experimental results will be presented on data from the NIST MT Chinese-English translation task. The focus lies on settings where the domain and/or the style of the test data is different from that of the training material. The training data consists mainly of text corpora, such as newswire, whereas the test data comprises transcribed speech data, e.g., political speeches and broadcast conversations. Especially the latter have characteristics of spontaneous speech. They contain hesitations, repetitions, incomplete and ungrammatical sentences. The use of self-training for adaptation can help deal with this mismatch. (Ueffing 2006) Ueffing, Nicola: Using Monolingual Source-Language Data to Improve MT Performance. IWSLT 2006. (Johnson et al. 2006) Johnson, J.H., Sadat, F., Foster, G., Kuhn, R., Simard, M., Joanis, E., Larkin, S.: PORTAGE: with Smoothed Phrase Tables and Segment Choice Models. NAACL 2006, Workshop on SMT.