IEEE Home | Shop IEEE | Join IEEE | myIEEE | Contact IEEE | IEEEXplore
IEEE

IEEE Signal Processing Society
Speech & Language Technical Committee


NLP for non-English languages: challenges and resources

BY SVETLANA STOYANCHEV

While Natural Language Processing research is dominated by work performed on the English language, there has been a significant increase in the NLP activities and resources in other languages.

Because these languages vary dramatically in their characteristics, they pose different challenges to automatic natural language and speech processing tasks. For example, Mandarin Chinese is a tonal language and in its speech recognition classification it is important to include an estimate of fundamental frequency as an additional feature to aid in the recognition of tones. Because written Mandarin Chinese does not segment words, word segmentation is required, a task typically not applied to English.  Another challenge is that Mandarin Chinese nouns need numerical classifiers posing a challenge for language generation. 

According to the final report of the 2002 Johns Hopkins Workshop on Novel Speech Recognition Models for Arabic, "The most difficult problems in developing high-accuracy speech recognition systems for Arabic are the predominance of non-diacritized text material, the enormous dialectal variety, and the morphological complexity". A diacritic is a small sign added to a letter to alter pronunciation or to distinguish between similar words. Other languages besides Arabic that use diacritics include Hebrew, Greek, Danish, Faroese, Icelandic, Norwegian, Swedish, Bosnian, Croatian, Lithuanian, Latvian, and others.

In 2006 the British Computer Society held a workshop focusing on The Challenge of Arabic for NLP/MT, stating that "processing of spoken colloquial Arabic speech is challenged by the number of a continuum of Arabic dialects". Research on model adaptation plays an important role for speech recognition of the languages with diverse dialects where it is infeasible to obtain a training data resource for each of the languages’ dialects.

Highly agglutinative languages such as Arabic, Turkish, Finnish, and Korean pose a challenge to speech recognition. Words can be constructed by concatenation of multiple prefixes and suffixes expanding vocabulary.  Interspeech 2007 offered a tutorial on processing morphologically rich languages.

Increased research activity in these areas has resulted in many recent workshops, conferences and other events studying the issues around NLP in languages other than English. Below, we provide a partial list of recent conference activities and available speech and text resources.

Recent workshops, conferences, events on non-English NLP and multi-lingual processing

Cross language Evaluation Forum (CLEF) has organized annual workshops on cross-language processing since 2000. The first workshops mainly focused on cross-language information retrieval and resources. In 2004, a multi-lingual question answering track was introduced. Since 2006, the workshop features a task on information retrieval from speech data.

TREC 2002 had an English/Arabic cross-language information extraction track that investigated the ability of retrieval systems to find documents that pertain to a topic regardless of the language in which the document is written.

In 2006 ISCA held a MULTILING 2006 workshop on multilingual speech recognition and generation.

Multi-lingual processing is applied to determining semantic information. SemEval 2007 featured a cross-language word sense disambiguation task.

ACL 2007 conference featured two sessions on multilingualism.

COLING 2008 had a workshop on Multi-source, Multilingual Information Extraction and Summarization and in 2007 a similar workshop was held as part of RANLP.

Languages like Chinese and Arabic have attracted significant attention from a number of researchers. In 2001, EACL held a workshop on Arabic language processing and in 2002 LREC held a Workshop on Arabic Language Resources and Evaluation.

2008 LREC has a workshop on Multilingual and Comparative Perspectives in Specialized Language Resources.

One of the groups in Johns Hopkins University Center for Language and Speech Processing summer workshop in 2002 focused on Novel Speech Recognition Models for Arabic.

In 2004 the Johns Hopkins workshop focused on Dialectal Chinese Speech Recognition and in 2005 a group of this workshop focused on Parsing Arabic Dialect and on creation of standards for the transcription of Arabic dialect.

SIGIR 2007 featured a workshop on non-English web searching where researchers tackled web search on languages such as Greek, Arabic, Basque, and Abuida.

SIGHAN is a special interest group of the ACL. It organizes workshops on Chinese natural language processing. The sixth workshop took place in January 2008 at IJCNLP.

Non-English speech and text resources

Availability of both processed and plain text as well as speech non-English corpora has been rapidly increasing.

Since 2007, the new entries in the LDC repository include 30% of new non-english text and speech resources and  30 % of parallel English and non-English resources, while 40 % of new datasets were only for English. The non-English LDC recent additions include Chinese Treebank, new versions of Arabic, Chinese, and Spanish Gigawords; Mandarin broadcast news and affective speech, Korean Treebank and propbank, Korean broadcast news, and free response speech;  Turkish microphone speech; Hungarian-english parallel text; English-arabic Treebank; Russian and Arabic telephone speech; Broadcasts with extracted keyframes from Arabic, Chinese and English; Urdu transcribed speech; Treebank, propbank, word senses annotations on English and Chinese news text,

English (American) Wordnet developed in Princeton has become a widely used resource for English. A European initiative created a EuroWordnet in 1999 including 8 European languages: English, Dutch, Spanish, Italian, German, French, Czech and Estonian. In 2006, an Arabic Wordnet was also developed.

Leipzig Corpora Collection presented at LREC 2006 includes processed text resources from WWW and newspaper for 18 different languages with 1-10 million sentences per language.

David Lee provides an extensive list of corpora resources in various languages.

ACL Wiki http://aclweb.org/aclwiki/index.php?title=List_of_resources_by_language

 


 
SLTC Home   |    IEEE Home   |   Privacy & Security   |    Terms & Conditions