IEEE Home | Shop IEEE | Join IEEE | myIEEE | Contact IEEE | IEEEXplore
IEEE

IEEE Signal Processing Society
Speech & Language Technical Committee


NIST Conducts Language Recognition Evaluation

BY SATANJEEV BANERJEE



In the fall of 2007, NIST - the National Institute of Standards and Technology - ran the fourth Language Recognition Evaluation exercise. The goal of the exercise was to evaluate the current state of the art in automatically detecting the language being spoken in a short audio snippet. This exercise followed similar exercises performed by NIST in 1996, 2003 and 2005.

The latest evaluation consisted of six tests, each defined by a fixed set of languages or dialects that the participants had to distinguish among. Each test comprised of a set of audio snippets. In each set, one of the languages in that test was designated as the "target" language, and the remaining languages "non-target". Participants had to detect whether the target language was being spoken in the snippet. Two versions of this evaluation were performed - closed set and open set. In "closed set" evaluation, the language in the snippet was guaranteed to be either the target language or one of the non-target languages. In "open set" evaluation, the snippet's language could also be outside the set of target and non-target languages.

Of the six tests, there were two language discrimination tests (the general test and the Chinese test) and four dialect discrimination tests. In the general test, participants had to discriminate between 14 languages or language groups: Arabic, Bengali, Farsi, German, Japanese, Korean, Russian, Tamil, Thai, Vietnamese, Chinese (a mixture of Cantonese, Min, Wu, mainland Mandarin and Taiwanese Mandarin), English (a mixture of American and Indian), Hindustani (a mixture of Hindi and Urdu), and Spanish (a mixture of Caribbean and non-Caribbean). In the Chinese language test participants had to discriminate between the 4 Chinese-group languages: Cantonese, Mandarin (mixture of mainland and Taiwan), Min and Wu. In the four dialect discrimination tests, participants had to distinguish between pairs of dialects - mainland Mandarin versus Taiwanese Mandarin, American English versus Indian English, Hindi versus Urdu and Caribbean Spanish versus non-Caribbean Spanish. Together these six tests covered a wide range of difficulty, from languages that are mutually unintelligible to dialects that are similar to each other.

The audio snippets contained conversational telephone speech in 8-bit 8-kHz u-law format, and were either 3, 10 or 30 seconds long. There were between 80 and 168 audio snippets for each of the three durations for each language and dialect tested. Data from previous evaluations were provided to the participants as training data. Much of the test data was recently collected by the Linguistic Data Consortium (LDC) and others.

Participants' outputs were scored by computing a cost function that combined detection misses (system failed to detect that the target language was the correct language) and false alarms (system failed to detect that the target language was not the correct language). Average cost was computed over audio snippets in each of the three duration classes, and in each of the six tests.

Twenty-one sites participated in the exercise - 4 from China, France and Spain each, 3 from the United States, 2 from Singapore, one from the Czech Republic, Italy, and Germany, and one combined team from the Netherlands and South Africa.

As expected, of the six tests, the best results were obtained for the general language discrimination test, and the best performing systems showed a marked improvement in all three snippet duration classes over the best results in the 2005 exercise, even though there were more languages this year than in previous years. The next best results among tests this year was in the Chinese language discrimination test, followed by the English dialect and the Mandarin dialect tests. The best results in the English dialect test did not improve much over the 2005 results, except for a slight improvement in the 30-second snippet duration class. There were improvements over the 2005 results in the Mandarin dialect test. The worst performing tests this year were the Hindustani dialect (Hindi versus Urdu) and Spanish dialect tests. In a separate pilot experiment to measure how well humans fluent in the dialects can perform the same dialect discrimination tasks, NIST test coordinator Audrey Le found that while the humans' performance was expectedly better than the systems', their performance was still "far from perfect".

"This was probably due to...the difficulty in defining hard boundaries among dialects", said Le.

The full evaluation results of this exercise are available at here and more information can be found on the evaluation's homepage.

The next evaluation is planned for the Spring/Summer of 2009.


 
SLTC Home   |    IEEE Home   |   Privacy & Security   |    Terms & Conditions