Show/Hide Menu
Hide/Show Apps
Logout
Türkçe
Türkçe
Search
Search
Login
Login
OpenMETU
OpenMETU
About
About
Open Science Policy
Open Science Policy
Open Access Guideline
Open Access Guideline
Postgraduate Thesis Guideline
Postgraduate Thesis Guideline
Communities & Collections
Communities & Collections
Help
Help
Frequently Asked Questions
Frequently Asked Questions
Guides
Guides
Thesis submission
Thesis submission
MS without thesis term project submission
MS without thesis term project submission
Publication submission with DOI
Publication submission with DOI
Publication submission
Publication submission
Supporting Information
Supporting Information
General Information
General Information
Copyright, Embargo and License
Copyright, Embargo and License
Contact us
Contact us
Combining Structural Analysis and Computer Vision Techniques for Automatic Speech Summarization
Date
2008-12-17
Author
Sert, Mustafa
Baykal, Buyurman
Yazıcı, Adnan
Metadata
Show full item record
This work is licensed under a
Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License
.
Item Usage Stats
157
views
0
downloads
Cite This
Similar to verse and chorus sections that appear as repetitive structures in musical audio, key-concept (or topic) of some speech recordings (e.g., presentations, lectures, etc.) may also repeat itself over the time. Hence, accurate detection of these repetitions may be helpful to the success of automatic speech summarization. Based on this motivation, we consider the applicability of music structural analysis methods to speech summary generation. Our method transforms a 1 - D time-domain speech signal to a 2 - D image representation, namely (dis)similarity matrix and detects possible repetitions within the matrix by using proper computer vision techniques. In addition, the method does not transcribe speech signal into words, phrases, or sentences. Hence, it can be generalized as speech-to-speech summarization method, in which summarization results are presented by speech instead of text. Furthermore, the method does not need a prior knowledge about the language or grammar of speech signal. Experiments show that, our method can capture the main theme of speech signals compared to the ideal transcription sections defined by experts and computational analysis shows our proposed method has a good performance.
Subject Keywords
Speech analysis
,
Computer vision
,
Speech synthesis
,
Natural languages
,
Signal analysis
,
Audio recording
,
Synthesizers
,
Computational complexity
,
Time domain analysis
,
Image representation
URI
https://hdl.handle.net/11511/47151
DOI
https://doi.org/10.1109/ism.2008.90
Collections
Department of Electrical and Electronics Engineering, Conference / Seminar
Suggestions
OpenMETU
Core
Generating expressive summaries for speech and musical audio using self-similarity clues
Sert, Mustafa; Baykal, Buyurman; Yazıcı, Adnan (2006-07-12)
We present a novel algorithm for structural analysis of audio to detect repetitive patterns that are suitable for content-based audio information retrieval systems, since repetitive patterns can provide valuable information about the content of audio, such as a chorus or a concept. The Audio Spectrum Flatness (ASF) feature of the MPEG7 standard, although not having been considered as much as other feature types, has been utilized and evaluated as the underlying feature set. Expressive summaries are chosen a...
FROM ACOUSTICS TO VOCAL TRACT TIME FUNCTIONS
Mitra, Vikramjit; Oezbek, I. Yuecel; Nam, Hosung; Zhou, Xinhui; Espy-Wilson, Carol Y. (2009-04-24)
In this paper we present a technique for obtaining Vocal Tract (VT) time functions from the acoustic speech signal. Knowledge-based Acoustic Parameters (APs) are extracted from the speech signal and a pertinent subset is used to obtain the mapping between them and the VT time functions. Eight different vocal tract constriction variables consisting of live constriction degree variables,. lip aperture (LA), tongue body (TBCD), tongue tip (TTCD), velum (VEL), and glottis (GLO); and three constriction location ...
Integration of multimodal multimedia database system architecture with query level fusion
Sattari, Saeid; Yarman Vural, Fatoş Tunay; Department of Computer Engineering (2013)
Multimedia data particularly digital videos that contain various modalities (visual, audio, and text) are complex and time consuming to deal with. Therefore, managing large volume of multimedia data reveals the necessity for efficient methods of modeling, processing, storing and retrieving these data. In this study, we investigate some of the requirements to efficiently deal with multimedia data, especially video data. To satisfy such requirements we aim to integrate specific multimedia database architectur...
Automatic multi-modal dialogue scene indexing
Alatan, Abdullah Aydın (2001-10-10)
An automatic algorithm for indexing dialogue scenes in multimedia content is proposed The content is segmented into dialogue scenes using the state transitions of a hidden Markov model (HMM) Each shot is classified using both audio and visual information to determine the state/scene transitions for this model Face detection and silence/speech/music classification are the basic tools which are utilized to index the scenes While face information is extracted after applying some heuristics to skin-colored regi...
Modeling phoneme durations and fundamental frequency contours in Turkish speech
Öztürk, Özlem; Çiloğlu, Tolga; Department of Electrical and Electronics Engineering (2005)
The term prosody refers to characteristics of speech such as intonation, timing, loudness, and other acoustical properties imposed by physical, intentional and emotional state of the speaker. Phone durations and fundamental frequency contours are considered as two of the most prominent aspects of prosody. Modeling phone durations and fundamental frequency contours in Turkish speech are studied in this thesis. Various methods exist for building prosody models. State-of-the-art is dominated by corpus-based me...
Citation Formats
IEEE
ACM
APA
CHICAGO
MLA
BibTeX
M. Sert, B. Baykal, and A. Yazıcı, “Combining Structural Analysis and Computer Vision Techniques for Automatic Speech Summarization,” 2008, Accessed: 00, 2020. [Online]. Available: https://hdl.handle.net/11511/47151.