|
NIST Conducts
Language Recognition Evaluation
BY SATANJEEV
BANERJEE
In the fall of 2007, NIST
- the National Institute of Standards and Technology - ran
the fourth Language Recognition Evaluation exercise. The
goal of the exercise was to evaluate the current state of
the art in automatically detecting the language being spoken
in a short audio snippet. This exercise followed similar
exercises performed by NIST in 1996, 2003 and 2005.
The latest evaluation consisted of six tests, each defined
by a fixed set of languages or dialects that the
participants had to distinguish among. Each test comprised
of a set of audio snippets. In each set, one of the
languages in that test was designated as the "target"
language, and the remaining languages "non-target".
Participants had to detect whether the target language was
being spoken in the snippet. Two versions of this evaluation
were performed - closed set and open set. In "closed set"
evaluation, the language in the snippet was guaranteed to be
either the target language or one of the non-target
languages. In "open set" evaluation, the snippet's language
could also be outside the set of target and non-target
languages.
Of the six tests, there were two language discrimination
tests (the general test and the Chinese test) and four
dialect discrimination tests. In the general test,
participants had to discriminate between 14 languages or
language groups: Arabic, Bengali, Farsi, German, Japanese,
Korean, Russian, Tamil, Thai, Vietnamese, Chinese (a mixture
of Cantonese, Min, Wu, mainland Mandarin and Taiwanese
Mandarin), English (a mixture of American and Indian),
Hindustani (a mixture of Hindi and Urdu), and Spanish (a
mixture of Caribbean and non-Caribbean). In the Chinese
language test participants had to discriminate between the 4
Chinese-group languages: Cantonese, Mandarin (mixture of
mainland and Taiwan), Min and Wu. In the four dialect
discrimination tests, participants had to distinguish
between pairs of dialects - mainland Mandarin versus
Taiwanese Mandarin, American English versus Indian English,
Hindi versus Urdu and Caribbean Spanish versus non-Caribbean
Spanish. Together these six tests covered a wide range of
difficulty, from languages that are mutually unintelligible
to dialects that are similar to each other.
The audio snippets contained conversational telephone speech
in 8-bit 8-kHz u-law format, and were either 3, 10 or 30
seconds long. There were between 80 and 168 audio snippets
for each of the three durations for each language and
dialect tested. Data from previous evaluations were provided
to the participants as training data. Much of the test data
was recently collected by the Linguistic Data Consortium (LDC)
and others.
Participants' outputs were scored by computing a cost
function that combined detection misses (system failed to
detect that the target language was the correct language)
and false alarms (system failed to detect that the target
language was not the correct language). Average cost was
computed over audio snippets in each of the three duration
classes, and in each of the six tests.
Twenty-one sites participated in the exercise - 4 from
China, France and Spain each, 3 from the United States, 2
from Singapore, one from the Czech Republic, Italy, and
Germany, and one combined team from the Netherlands and
South Africa.
As expected, of the six tests, the best results were
obtained for the general language discrimination test, and
the best performing systems showed a marked improvement in
all three snippet duration classes over the best results in
the 2005 exercise, even though there were more languages
this year than in previous years. The next best results
among tests this year was in the Chinese language
discrimination test, followed by the English dialect and the
Mandarin dialect tests. The best results in the English
dialect test did not improve much over the 2005 results,
except for a slight improvement in the 30-second snippet
duration class. There were improvements over the 2005
results in the Mandarin dialect test. The worst performing
tests this year were the Hindustani dialect (Hindi versus
Urdu) and Spanish dialect tests. In a separate pilot
experiment to measure how well humans fluent in the dialects
can perform the same dialect discrimination tasks, NIST test
coordinator Audrey Le found that while the humans'
performance was expectedly better than the systems', their
performance was still "far from perfect".
"This was probably due to...the difficulty in defining hard
boundaries among dialects", said Le.
The full evaluation results of this exercise are available
at
here and more information can be found on the
evaluation's
homepage.
The next evaluation is planned for the Spring/Summer of
2009.
|