Statistical Machine Translation Tutorial

Statistical Machine Translation Tutorial Reading

The following is a list of papers that I think are worth reading for our
discussion of machine translation. I've tried to give a short blurb about
each of the papers to put them in context. I've included a number of papers
that I marked "OPTIONAL" that I think are interesting, but are either
supplementary or the material is more or less covered in the other papers.

If anyone would like more information on a particular topic or would
like to discuss any of these papers, feel free to e-mail me
dkauchakcs ucsd edu

Part 1 (Jan. 19)
A Statistical MT Tutorial Workbook. Kevin Knight. 1999.
Very good introduction to word-based statistical machine translation.
Written in an informal, understandable, tutorial oriented style.

Automating Knowledge Acquisition for Machine Translation.
Kevin Knight. 1997.
(OPTIONAL) Another tutorial oriented paper that steps through
how one can learn from bilingual data. Also introduces a number of
important concepts for MT.

Foundations of Statistical NLP , chapter 13. Manning and Schutze. 1999.
(OPTIONAL) Must be accessed from UCSD. Overview of statistical MT.
Spends a lot of time on sentence and word alignment of bilingual data.

Foundations of Statistical NLP , chapter 6. Manning and Schutze. 1999.
(OPTIONAL) Must be accessed from UCSD. Discusses n-gram language
modeling. Language modeling is crucial for SMT and many other natural
language applications. I won't spend much time discussing language
modeling, but for those that are interested this is a good introduction.

Part 2 (Jan. 26)
Word models:
The Mathematics of Statistical Machine Translation:
Parameter Estimation. P. F. Brown, S. A. Della Pietra,
V. J. Della Pietra and R.L. Mercer. 1993.
(OPTIONAL) All you ever wanted to know about word level
models. Describes IBM models 1-5 and parameter estimation
for these models. It's about 50 pages and contains a lot of
material for the interested reader.

Word model decoding:
Decoding Algorithm in Statistical Machine Translation.
Ye-Yi Wand and Alex Waibel. 1997.
Early paper discussing decoding of IBM model 2. The paper
provides a fairly good introduction to word-level decoding
including multi-stack search (i.e. multiple beams) and rest
cost estimation (heuristic functions).

An Efficient A* Search Algorithm for Statistical Machine Translation.
Franz Josef Och, Nicola Ueffing, Hermann Ney. 2001.
(OPTIONAL) One of many papers on decoding with word-based SMT. They
discuss the basic idea of viewing decoding as state space search and
provide one method for doing this. They describe decoding for Model 3
and suggest a few different heuristics that are admissible, leading to few search errors.

Phrase based statistical MT:
Statistical Phrase-Based Translation.
Philipp Koehn, Franz Jasof Ock and Daniel Marcu. 2003.
Good, short overview of phrased based systems. If you want more
details, see the paper below.

The Alignment Template Approach to Statistical Machine Translation.
Franz Josef Och and Hermann Ney. 2004.
(OPTIONAL) This is a journal paper discussing one phrase based statistical system
including decoding. This is more or less the system used at ISI and
is probably the best current system (though syntax based systems my beat
these in the next few years). Requires acrobat 5 and to be at UCSD.

Part 3 (Feb. 2)
Phrase-based decoding:
See the previous paper.

Syntax based translation:
What's in a Translation Rule? Galley, Hopkins, Knight and Marcu. 2004.
This is the current system being investigated at ISI and the hope is that
these syntax based systems will perform better than phrase based systems.
The paper is a bit tough to read since it's a conference paper.

A Syntax-Based Statistical Translation Model. Yamada and Knight. 2001.
(OPTIONAL) Predecessor model to Galley et al., but similar.

Syntax based decoding:
Foundations of Statistical NLP, chapter 12. Manning and Schutze. 1999.
Must be on campus. This is a chapter on parsing (not actually decoding)
However, since the above rules are very similar to PCFGs, then decoding
is very similar to parsing... just with more complications.

A Decoder for Syntax-Based Statistical MT. Kenji Yamada and Kevin Knight. 2001.
(OPTIONAL) Decoder for the above Yamada and Knight model.

Part 4 (Feb. 9)
Discriminative Training:
Discriminative Training and Maximum Entropy Models for Statistical Machine Translation.
Och and Ney. 2002.
Learning how the best models for combining the different models (traslation
model, language model, etc.) using maximum entropy parameter estimation.
This line of research is still very important and my be interesting to
many of you since it's very machine learningy.

Discriminative Reranking for Machine Translation.
Shen, Sarkar and Och. 2004.
(OPTIONAL) Given a ranked output of possible translations from the
translation system, this paper uses the perceptron algorithm to learn
a reranking of the sentences to improves the top translation.

MT Evaluation:
BLEU: A Method for Automatic Evaluation of Machine Translation.
Papineni, Roukos, Ward and Zhu. 2001.
Foundational method for evaluating MT methods and still used currently.