Information Sciences Institute to develop translation and information-retrieval system for rarer languages

A sign of things to come? A team of researchers from the ISI at the University of Southern California has received a $16.7 million grant from the Intelligence Advanced Research Projects Activity (IARPA) to develop an automated information translation and summarization tool to quickly translate “low resource languages”.

The ISI team’s project is called SARAL, which stands for Summarization and domain-Adaptive Retrieval (a Hindi word whose translations include “simple” and “ingenious”), and includes experts in machine translation, speech recognition, morphology, information retrieval, representation, and summarization.

The overall objective is to provide a Google-like capability, except the queries are in English but the retrieved documents are in a low-resource foreign language.

Principal investigator and ISI research team leader Scott Miller

“The aim is to retrieve relevant foreign-language documents and to provide English summaries explaining how each document is relevant to the English query.”

In this project, the ISI team will initially test their systems using Tagalog and Swahili, two low-resource languages selected by IARPA for the task. Over the course of the project, the team will receive additional languages to translate using the systems.

The Translation of Low-Resource Languages

Although so-called “low resource languages” are often spoken by millions of people worldwide, relatively little written material exists in these languages.

This creates a challenge for current translation systems, which typically “learn” from seeing millions of written examples.

“Since we don’t have a lot of written data in these languages, we have to do more with less,” says ISI computer scientist Jonathan May.

“Ideally, we would use about 300 million words to train a machine translation system–and in this case, we have around 800,000 words. There are about 100,000 words per novel, so we have only eight novels’ worth of words to work from.”

The researchers will begin the project by compiling documents in the test languages, including speech, online documents, and video clips, which have previously been translated into English.

They will then develop algorithms to analyze the language patterns, such as sentence structure–subject, verb and object position, for example–and morphology, the structure of words and their relation to other words in the same language.

 

In addition to ISI, a number of universities and research institutions will work towards the same goal: John Hopkins, Columbia University, and Raytheon BBN Technologies are also taking part in the IARPA program, called MATERIAL, which stands for Machine Translation for English Retrieval of Information in Any Language.

About USC ISI

The Information Sciences Institute (ISI) at USC’s Viterbi School of Engineering is one of the nation¹s largest, most technologically diverse,and impactful university-affiliated computer research institutes. Today, ISI researchers continue to perform groundbreaking work in cutting-edge areas such as natural language processing, deep machine learning, computer vision, cyber security, biometrics, space technologies, networking,high-performance and reconfigurable computing systems, medical informatics, and scientific computing workflows.

Headquartered in Marina del Rey, California, with research offices in Arlington, Virginia and Greater Boston, Massachusetts, ISI employs about 300 engineers, research scientists, graduate students and staff. In 2017, the Institute’s expenditures surpassed a record $100 million, with projects ranging from trusted electronics, to quantum computing and automated forecasting of geopolitical events.