Statistical machine translation methods rely on large quantities of training data to learn various models. We are investigating applications of active learning to both minimize the amount of training data required (for example when moving to new domains or in languages where large amounts of text is not available) as well as integrating a human in the training process to help answer difficult questions, such as translations of phrases, that will help improve the overall translations.

Given a machine translation function and bilingual data, we can generate an aligned Enlish/English data set by translating the foreign sentences with the translation function. Differences between these sentences point to possible mistakes in the translation function. We are investigating various same language alignment methods on this data. Given these alignments, phrase correction rules can be learned. Initial investigation shows major improvements in commercial translation functions and some improvement on phrase based systems.

Research with Andy Kehler and Lara Taylor

Most data driven pronoun resolution techniques rely on labeled data for training. Recently, there has been some work showing that unlabeled data can be used to train these pronoun/antecedent models. Using a large corpus of text, unambiguous pronouns are identified and used to bootstrap a model. Using this model, new pronouns are labeled and the process is continued until convergence. The advantage of these unsupervised methods is that they have much larger data sets available than for supervised. We are investigating an expanded set of features for pronoun resolution that leverage this increased amount of data.

Research with Kevin Knight and Daniel Marcu at ISI

Recent translation models for statistical machine translation involve syntactically motivated translation rules (Galley et. al). Empirically and qualitatively this type of translation approach seems to be a good way to integrate syntax into the translation process while still maintaining many of the advantages of current state of the art phrase based systems. Decoding with these rules is similar to parsing with additional complications. CKY style, bottom up type parsing methods appear to be a good approach for this model.

One of the complications for many bottom up algorithms is that the grammar must be binary (i.e. at most two elements on the right hand side). Binarization of PCFG rules is fairly straightforward. Binarization for translation rules is an interesting area of research since the rules are more complicated (i.e. left hand side consists of a tree, not just a non-terminal) and because other models are used during decoding such as the language models. The binarization process can be biased so that partial hypothesis and future cost estimation are more accurate.

Research with Sanjoy Dasgupta and Anjum Gupta

In 2003, we introduced an iterative hierarchical clustering method. The problem with previous approaches is that they are greedy. Once a decision is made in constructing the hierarchy, this decision can never be undone. The iterative approach we presented searches the space of hierarchical clusterings and therefore does not have that problem.

The iterative approach attempts to minimize a criterion function based on the current clustering. In our initial paper, we used a squared Euclidean cost function similar to that of K-means. Euclidean distance is just one of a family of distance metrics called Bregman distances. We are examining how these distances can be integrated with our iterative method. The hope is that these new distance metrics will fit more naturally for a variety of different data sets where Euclidean distance is not appropriate.

Research with Vibhu Mittal at Google

Information retrieval of research papers provides a number of interesting problems that are different than traditional information retrieval. First, extracting information from non-text files is non-trivial and information such as authors, titles and references must be extracted from research articles before further processing. This can be difficult due to the wide variation of formatting and conventions. Once this information has been extracted, the citation graph of research papers provides a very interesting search structure that is much more reliable than traditional web link structure.

Preliminary paper available

In this paper we examine reinforcement learning problems which consist of a set of homogeneous entities. These problems tend to have extremely large state spaces making standard approaches unattractive. We study lane change selection in a car traffic control problem as an example of such a problem. We show how a single agent problem can be translated into an approximating multi-agent problem. We provide learning results in a traffic simulator using this multi-agent approximation with Q-learning and R-learning. Learning in the multi-agent problem proceeds quickly and outperforms heuristic methods. Experimental results show that learned methods perform better than heuristic methods as traffic densities increase towards rush hour conditions. We summarize the translation method used from a single agent problem to a related multi-agent problem for car traffic control and propose this as a starting place for related problems.

Preliminary paper available

In this paper we discuss the problem of audio retrieval. We examine artist recognition/retrieval as a problem instead of the traditional genre classification. This problem has the motivating benefit that there is a known, uncontroversial ground truth. Second, and more importantly, we suggest borrowing research from the image retrieval community. We provide results from one image retrieval technique ported over to audio retrieval. This technique consists of taking the discrete wavelet transform of the audio, histogramming the results and using statistical histogram comparison metrics to compare similarity. The results are not outstanding, but we do show that this sort of research can be done fairly easily and productively.

The increasing power of modern personal computers is making clusters, which are built up from desktop components increasingly popular. However, these clusters do have their drawbacks. When clusters are updated, the whole cluster may not be upgraded all at once. This will lead to an environment where there are some nodes that are faster than others. This irregular environment can occur more simply in the case where jobs are running on some of the nodes. These nodes will tend to run slower than the free nodes. There are a number of problems such as RedBlack3D that have highly regular rectangular partitioning schemes. Unfortunately, if these static partitions are used in the environment just described, idle waiting can occur when faster nodes have that have finished their iteration early and are stuck waiting at a synchronization point for slower processors to finish. We investigate a load balancer that can adaptively tailor per-node workloads to node speed to help reduce this inefficiency.