|
NLP for
non-English languages: challenges and resources
BY SVETLANA
STOYANCHEV
While Natural Language Processing research is dominated by
work performed on the English
language, there has been a significant increase in the NLP
activities and resources in other languages.
Because these languages vary dramatically in their
characteristics, they pose different challenges to automatic
natural language and speech processing tasks. For example,
Mandarin Chinese is a tonal language and in its speech
recognition classification it is important to include an
estimate of fundamental frequency as an additional feature
to aid in the recognition of tones. Because written Mandarin
Chinese does not segment words, word segmentation is
required, a task typically not applied to English. Another
challenge is that Mandarin Chinese nouns need numerical
classifiers posing a challenge for language generation.
According to the
final report of the 2002 Johns Hopkins Workshop on Novel
Speech Recognition Models for Arabic, "The most difficult
problems in developing high-accuracy speech recognition
systems for Arabic are the predominance of non-diacritized
text material, the enormous dialectal variety, and the
morphological complexity".
A
diacritic is a small sign added to a
letter to alter pronunciation or to distinguish between
similar words. Other languages besides Arabic that use
diacritics include Hebrew, Greek, Danish, Faroese,
Icelandic, Norwegian, Swedish, Bosnian, Croatian,
Lithuanian, Latvian, and others.
In 2006 the
British Computer Society held a workshop
focusing on
The Challenge of Arabic for NLP/MT,
stating that "processing of spoken colloquial Arabic speech
is challenged by the number of a continuum of Arabic
dialects". Research on model adaptation plays an important
role for speech recognition of the languages with diverse
dialects where it is infeasible to obtain a training data
resource for each of the languages’ dialects.
Highly agglutinative languages such as Arabic, Turkish,
Finnish, and Korean pose a challenge to speech recognition.
Words can be constructed by concatenation of multiple
prefixes and suffixes expanding vocabulary. Interspeech
2007 offered a
tutorial on processing morphologically rich languages.
Increased research
activity in these areas has resulted in many recent
workshops, conferences and other events studying the issues
around NLP in languages other than English. Below, we
provide a partial list of recent conference activities and
available speech and text resources.
Recent workshops,
conferences, events on non-English NLP and multi-lingual
processing
Cross language Evaluation Forum
(CLEF) has organized annual workshops on cross-language
processing since 2000. The first workshops mainly focused on
cross-language information retrieval and resources. In 2004,
a multi-lingual question answering track was introduced.
Since 2006, the workshop features a task on information
retrieval from speech data.
TREC 2002 had an
English/Arabic cross-language information extraction
track
that investigated the ability of retrieval systems to find
documents that pertain to a topic regardless of the language
in which the document is written.
In 2006 ISCA held a
MULTILING
2006
workshop on multilingual speech recognition and generation.
Multi-lingual processing
is applied to determining semantic information.
SemEval 2007
featured a cross-language word sense disambiguation task.
ACL 2007
conference featured two sessions on multilingualism.
COLING 2008 had a
workshop
on Multi-source, Multilingual Information Extraction and
Summarization and in 2007 a similar
workshop
was held as part of RANLP.
Languages like Chinese and Arabic have attracted significant
attention from a number of researchers. In 2001, EACL held a
workshop on Arabic language processing and in 2002 LREC held
a Workshop on
Arabic Language Resources and Evaluation.
2008 LREC has a workshop on
Multilingual and Comparative
Perspectives in Specialized Language Resources.
One of the groups in Johns Hopkins University Center for
Language and Speech Processing summer workshop in 2002
focused on
Novel Speech Recognition Models for Arabic.
In 2004 the Johns Hopkins workshop focused on
Dialectal Chinese Speech Recognition
and in 2005 a group of this workshop focused on
Parsing Arabic Dialect
and on creation of standards for the transcription of Arabic
dialect.
SIGIR 2007 featured a workshop on
non-English web searching
where researchers tackled web search on languages such as
Greek, Arabic, Basque, and Abuida.
SIGHAN
is a special interest group of the ACL. It organizes
workshops on Chinese natural language processing. The sixth
workshop
took place in January 2008 at IJCNLP.
Non-English
speech and text resources
Availability of both
processed and plain text as well as speech non-English
corpora has been rapidly increasing.
Since 2007, the
new entries in the LDC repository include 30% of new non-english text
and speech resources and 30 % of parallel English and
non-English resources, while 40 % of new datasets were only
for English. The non-English LDC recent additions include
Chinese Treebank, new versions of Arabic, Chinese, and
Spanish Gigawords; Mandarin broadcast news and affective
speech, Korean Treebank and propbank, Korean broadcast news,
and free response speech; Turkish microphone speech;
Hungarian-english parallel text; English-arabic Treebank;
Russian and Arabic telephone speech; Broadcasts with
extracted keyframes from Arabic, Chinese and English; Urdu
transcribed speech; Treebank, propbank, word senses
annotations on English and Chinese news text,
English (American) Wordnet
developed in Princeton has become a widely used resource for
English. A European initiative created a
EuroWordnet
in 1999 including 8 European languages: English, Dutch,
Spanish, Italian, German, French, Czech and Estonian.
In 2006, an
Arabic Wordnet
was also developed.
Leipzig Corpora Collection
presented at LREC 2006 includes processed text resources
from WWW and newspaper for 18 different languages with 1-10
million sentences per language.
David Lee provides an
extensive
list of corpora resources
in various languages.
ACL Wiki
http://aclweb.org/aclwiki/index.php?title=List_of_resources_by_language
|