SpeechFind
Retrieves Documents for
National Gallery
of the Spoken Word
By
JOHN H. L. HANSEN & WOOIL KIM
The SpeechFind system, developed by the Center for Robust Speech Systems at the
University of Texas at Dallas, is a spoken document retrieval system currently serving as the search engine for the
National Gallery of the Spoken Word (NGSW). The system constructed in two phases: i) enrollment and ii) query and retrieval. In the enrollment phase, large audio sets are submitted for audio segmentation and transcription generation and metadata construction. Once this phase is completed, the audio, transcription, and meta data are entered into an online depository, the audio material is then available through the online audio search engine for the query and retrieval phase.

The system includes the following modules: an audio spider and transcoder, spoken documents transcriber, "rich" transcription database, and an audio public accessible search engine. The audio spider and transcoder are responsible for automatically fetching available audio archives from a range of available servers and converting the incoming audio files into the designed audio formats for processing. This module also parses the metadata and extracts relevant information into a "rich" transcript database for guiding information retrieval.
The spoken document transcriber includes two components, namely, the audio segmenter and transcriber. The audio segmenter partitions audio data into manageable small segments by detecting speaker, channel, and environmental change points. The transcriber decodes every speech segment into text depending on a large vocabulary continuous speech recognition engine.
The online search engine is responsible for information retrieval tasks, including a web-based user interface as the front-end and search and index engines at the back-end. The web-based search engine responds to a user query by launching back-end retrieval commands, formatting the output with the relevant transcribed documents that are ranked by relevance scores and associated with timing information, and provides the user with web-based page links to access the corresponding audio clips.
The speech corpus from NGSW covers one of the largest ranges of a audio material available today up to 60,000 hours from the last 110 years. The audio content includes a diverse range of vocabulary over the time periods. Many of these include various kinds of acoustic conditions (e.g., background noise, reverberation, channels, recording media, speaking style, etc.) Therefore, the system is strongly required to have reliable audio segmenter and robust speech recognition engine to support a variety of audio copora.
Development of this system was supported through grants from NSF - Digital Libraries Initiative (II) and Project Emmitt under University of Texas at Dallas.
[1] John H. L. Hansen, Rongqing Huang, Bowen Zhou, Michael Seadle, John R. (Jack) Deller, Jr., Aparna R. Gurijala, Mikko Kurimo, and Pongtep Angkititrakul, "SpeechFind: Advances in Spoken Document Retrieval for a National Gallery of the Spoken Word," IEEE Trans. on Speech and Audio Proc., vol. 13, no. 5, pp.712-730, Sep. 2005. |