JHU 2007 Summer
Workshops Tackle Language and Speech Topics
By
BRIAN MAK
The Center for Language and Speech Processing (CLSP)
of Johns Hopkins University continues to hold its annual
JHU Summer Workshop on
Human Language Technology, this year for 6 weeks from July 16th to
Aug 23rd, 2007. During the workshop, speech and language scientists from
universities, industry, and government research institutes will work with
graduate and undergraduate students to probe further in two specific areas with
the hope to advance the current state-of-the-art speech and language
technologies:
- Exploiting Lexical &
Encyclopedic Resources For Entity Disambiguation
- Recovery from Model
Inconsistency in Multilingual Speech Recognition
We did an "email interview" with the 2 team leaders,
and asked them (1) how they came up with these topics (2) how they found their
workshop team (3) what outcomes they were expecting. Below are brief
descriptions of the two projects, and the comments from the project leaders.
Exploiting Lexical & Encyclopedic
Resources For Entity Disambiguation
Group Leader:
Massimo Poesio
Description: To improve entity disambiguation by developing better
techniques for tracking entities and for extracting their properties. A
particular focus will be improving entity tracking by using lexical and
encyclopedic knowledge extracted both from structured lexical databases and from
semi-structured repositories such as Wikipedia.
Comments from Massimo Poesio:
The topic of the project was born out of two
considerations that have informed our work on coreference resolution for a long
time; the decision to have it this year was taken because of a number of recent
advances.
The first consideration is that it is clear that the lack of commonsense
knowledge, or inability to use it, is still the main problem facing systems
doing semantic interpretation tasks such as coreference resolution and entity
disambiguation. In the '80s, we tried to solve the problem by hand-coding the
required knowledge and the inference rules. While this type of work greatly
increased our understanding of the required inference processes, the methods
that were developed could only be applied to systems working on very specific
domains. In the '90s the emphasis in HLT shifted to work with large amounts of
data; as a result, those early methods were abandoned, to concentrate on trying
to achieve as much as possible using surface information. In the meantime
however a great deal of effort was put in developing techniques for acquiring
the desired knowledge automatically. This effort is beginning to pay off: recent
methods for extracting knowledge from Web-sized corpora and from resources such
as Wikipedia, such as those developed by myself, Ponzetto, Strube, and Versley
(all participants in the workshop) have been shown to lead to improvements in
coreference over standard methods using only surface information. In particular
Wikipedia is proving a tremendous source of encyclopedic knowledge.
The second consideration is that ultimately HLT systems are only incorporated in
real applications when shown to lead to performance improvements in actual
applications. In other work we showed that coreference technology is coming to
the point when it can lead to improvements in performance in automatic
summarizers. Entity disambiguation is a task in which we can expect an even
bigger improvement from the use of coreference resolvers.
The advances that made us decide to have the workshop include, in addition to
the already mentioned improvements in our techniques for extracting lexical and
encyclopedic knowledge, improvements in machine learning technology, and the
availability of new resources. Work by Moschitti and Yang on using tree kernels
for semantic interpretation, by Yang on models of training instances, and by
Andrew McCallum's group (among which Rob Hall and Michael Wick, who participate
in the workshop) on global models for entity disambiguation, is resulting in
models much more appropriate for entity disambiguation, particularly for using
lexical and encyclopedic knowledge. Finally, the recent release of the OntoNotes
corpus, annotated for coreference, will provide a much more solid foundation to
work on coreference, whereas new corpora for entity disambiguation have also
been released; our group is also busy creating resources for cross-document
coreference under the coordination of David Day and Janet Hitzeman, who also
participate (David Day is co-chair).
Based on these considerations, the team in a sense it chose itself. (A number of
other groups will participate in the effort as external collaborators.) The
outcomes we hope for: first of all, a clear demonstration that automatically
extracted commonsense knowledge does help coreference resolution. Secondly, that
coreference leads to improvements in entity disambiguation. As bonuses, we
expect to deliver a publicly available system for coreference resolution that
incorporates the best ideas around, and new resources such as annotated corpora.
Recovery from Model Inconsistency
in Multilingual Speech Recognition
Group Leader: Hynek
Hermansky
Description: Current ASR has difficulties in handling unexpected words that
are typically replaced by acoustically acceptable high prior probability words.
Identifying parts of the message where such a replacement could have happened
may allow for corrective strategies. The project will focus on on detection and
description of out-of-vocabulary and mispronounced words the 6-language CallHome
database. Additionally, to describe the suspect parts of messages, a
language-independent recognizer of speech sounds will be developed and applied
for phonetic transcription of identified suspect parts of the recognized
message.
Comments from Hynek Hermansky:
The project was proposed since we believe that current ASR is too top-down heavy
and behaves like an idiot who is quite willing to give a wrong answer regardless
of the data evidence. Clearly, the issue of OOV words, mispronunciations, etc.
is big and important - would not you agree?
As an outcome, we hope to know more about this by the end of the summer than we
know now and have an idea how to work towards ASR that would rely on top-down
constraints only when appropriate and also be able to tell when the data are
inconsistent with the prior knowledge. A proposed architecture for such a system
is shown in Figure 1 below.
The team was formed from interested people who participated at the preparatory
meeting (Chin, Geoff and me) and colleagues who were judged appropriate.

Figure 1: A new framework in developing acoustic models
|