Combining Structural Analysis and Computer Vision Techniques for Automatic Speech Summarization

2008-12-17
Similar to verse and chorus sections that appear as repetitive structures in musical audio, key-concept (or topic) of some speech recordings (e.g., presentations, lectures, etc.) may also repeat itself over the time. Hence, accurate detection of these repetitions may be helpful to the success of automatic speech summarization. Based on this motivation, we consider the applicability of music structural analysis methods to speech summary generation. Our method transforms a 1 - D time-domain speech signal to a 2 - D image representation, namely (dis)similarity matrix and detects possible repetitions within the matrix by using proper computer vision techniques. In addition, the method does not transcribe speech signal into words, phrases, or sentences. Hence, it can be generalized as speech-to-speech summarization method, in which summarization results are presented by speech instead of text. Furthermore, the method does not need a prior knowledge about the language or grammar of speech signal. Experiments show that, our method can capture the main theme of speech signals compared to the ideal transcription sections defined by experts and computational analysis shows our proposed method has a good performance.

Suggestions

Generating expressive summaries for speech and musical audio using self-similarity clues
Sert, Mustafa; Baykal, Buyurman; Yazıcı, Adnan (2006-07-12)
We present a novel algorithm for structural analysis of audio to detect repetitive patterns that are suitable for content-based audio information retrieval systems, since repetitive patterns can provide valuable information about the content of audio, such as a chorus or a concept. The Audio Spectrum Flatness (ASF) feature of the MPEG7 standard, although not having been considered as much as other feature types, has been utilized and evaluated as the underlying feature set. Expressive summaries are chosen a...
FROM ACOUSTICS TO VOCAL TRACT TIME FUNCTIONS
Mitra, Vikramjit; Oezbek, I. Yuecel; Nam, Hosung; Zhou, Xinhui; Espy-Wilson, Carol Y. (2009-04-24)
In this paper we present a technique for obtaining Vocal Tract (VT) time functions from the acoustic speech signal. Knowledge-based Acoustic Parameters (APs) are extracted from the speech signal and a pertinent subset is used to obtain the mapping between them and the VT time functions. Eight different vocal tract constriction variables consisting of live constriction degree variables,. lip aperture (LA), tongue body (TBCD), tongue tip (TTCD), velum (VEL), and glottis (GLO); and three constriction location ...
Integration of multimodal multimedia database system architecture with query level fusion
Sattari, Saeid; Yarman Vural, Fatoş Tunay; Department of Computer Engineering (2013)
Multimedia data particularly digital videos that contain various modalities (visual, audio, and text) are complex and time consuming to deal with. Therefore, managing large volume of multimedia data reveals the necessity for efficient methods of modeling, processing, storing and retrieving these data. In this study, we investigate some of the requirements to efficiently deal with multimedia data, especially video data. To satisfy such requirements we aim to integrate specific multimedia database architectur...
Automatic multi-modal dialogue scene indexing
Alatan, Abdullah Aydın (2001-10-10)
An automatic algorithm for indexing dialogue scenes in multimedia content is proposed The content is segmented into dialogue scenes using the state transitions of a hidden Markov model (HMM) Each shot is classified using both audio and visual information to determine the state/scene transitions for this model Face detection and silence/speech/music classification are the basic tools which are utilized to index the scenes While face information is extracted after applying some heuristics to skin-colored regi...
Modeling phoneme durations and fundamental frequency contours in Turkish speech
Öztürk, Özlem; Çiloğlu, Tolga; Department of Electrical and Electronics Engineering (2005)
The term prosody refers to characteristics of speech such as intonation, timing, loudness, and other acoustical properties imposed by physical, intentional and emotional state of the speaker. Phone durations and fundamental frequency contours are considered as two of the most prominent aspects of prosody. Modeling phone durations and fundamental frequency contours in Turkish speech are studied in this thesis. Various methods exist for building prosody models. State-of-the-art is dominated by corpus-based me...
Citation Formats
M. Sert, B. Baykal, and A. Yazıcı, “Combining Structural Analysis and Computer Vision Techniques for Automatic Speech Summarization,” 2008, Accessed: 00, 2020. [Online]. Available: https://hdl.handle.net/11511/47151.