Science/Technology

Scientists work to automate quick translation of obscure languages

Researchers get $16.7 million grant to retrieve relevant foreign-language documents and do more with less

January 23, 2018 Caitlin Dawson

A team of researchers from the Information Sciences Institute at the USC Viterbi School of Engineering has received a $16.7 million grant from the Intelligence Advanced Research Projects Activity (IARPA) to develop an automated information translation and summarization tool to quickly translate obscure languages.

Principal investigator and ISI research team leader Scott Miller, ISI computer scientist Jonathan May and ISI research lead Elizabeth Boschee, with senior advisers Prem Natarajan and Kevin Knight, are leading a team of about 30 researchers, including academics from the University of Massachusetts, Northeastern University, the Massachusetts Institute of Technology and the University of Notre Dame. Natarajan is the Michael Keston executive director and research professor of computer science at ISI; Knight is ISI research director and Dean’s Professor of Computer Science.

The ISI team’s project is called SARAL, for Summarization and domain-Adaptive Retrieval, and includes experts in machine translation, speech recognition, morphology, information retrieval, representation and summarization. Saral is a Hindi word whose translations include “simple” and “ingenious.”

“The overall objective is to provide a Google-like capability, except the queries are in English, but the retrieved documents are in a low-resource foreign language,” said Miller, who is based at ISI’s new office in Boston.

“The aim is to retrieve relevant foreign-language documents and to provide English summaries explaining how each document is relevant to the English query.”

In this project, the team will initially test its systems using Tagalog and Swahili, two low-resource languages selected by IARPA for the task. Over the course of the project, the team will receive additional languages to translate using the systems.

Millions of words

Although so-called “low-resource” languages are often spoken by millions of people worldwide, relatively little written material exists in these languages. This creates a challenge for current translation systems, which typically “learn” from seeing millions of written examples.

Since we don’t have a lot of written data in these languages, we have to do more with less.

Jonathan May

“Since we don’t have a lot of written data in these languages, we have to do more with less,” said May, who also holds an appointment as a research assistant professor in computer science at USC Viterbi.

“Ideally, we would use about 300 million words to train a machine-translation system — and in this case, we have around 800,000 words. There are about 100,000 words per novel, so we have only eight novels’ worth of words to work from.”

Getting started

The researchers will begin the project by compiling documents in the test languages, including speech, online documents and video clips, which have previously been translated into English.

They will then develop algorithms to analyze the language patterns, such as sentence structure — subject, verb and object position, for example — and morphology, the structure of words and their relation to other words in the same language.

The system will be designed to respond to domain-specific queries, for example, environmental protection in the “government and politics” domain or primary education in the “lifestyle” domain, and will produce a summarized response of about 100 words describing how the result is relevant to the search.

“You can think of the summary as something like Cliffs Notes, but with the added feature that it is indexed to the precise part you want to write your essay about,” May said.

In addition to ISI, a number of universities and research institutions will work toward the same goal: Johns Hopkins, Columbia University, and Raytheon BBN Technologies are also taking part in the IARPA program, called MATERIAL, for Machine Translation for English Retrieval of Information in Any Language.

“IARPA’s MATERIAL program is the first organized attempt at synthesizing recent advances in machine translation, speech recognition, cross-lingual retrieval and summarization into a powerful new capability that allows users to accurately access all relevant information, across languages and modalities,” Natarajan said. “We are tremendously grateful for the opportunity to contribute to this nationally important effort.”