Show/Hide Menu
Hide/Show Apps
Logout
Türkçe
Türkçe
Search
Search
Login
Login
OpenMETU
OpenMETU
About
About
Open Science Policy
Open Science Policy
Open Access Guideline
Open Access Guideline
Postgraduate Thesis Guideline
Postgraduate Thesis Guideline
Communities & Collections
Communities & Collections
Help
Help
Frequently Asked Questions
Frequently Asked Questions
Guides
Guides
Thesis submission
Thesis submission
MS without thesis term project submission
MS without thesis term project submission
Publication submission with DOI
Publication submission with DOI
Publication submission
Publication submission
Supporting Information
Supporting Information
General Information
General Information
Copyright, Embargo and License
Copyright, Embargo and License
Contact us
Contact us
Bimodal automatic speech segmentation based on audio and visual information fusion
Date
2011-07-01
Author
Akdemir, Eren
Çiloğlu, Tolga
Metadata
Show full item record
This work is licensed under a
Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License
.
Item Usage Stats
199
views
0
downloads
Cite This
Bimodal automatic speech segmentation using visual information together with audio data is introduced. The accuracy of automatic segmentation directly affects the quality of speech processing systems using the segmented database. The collaboration of audio and visual data results in lower average absolute boundary error between the manual segmentation and automatic segmentation results. The information from two modalities are fused at the feature level and used in a HMM based speech segmentation system. A Turkish audiovisual speech database has been prepared and used in the experiments. The average absolute boundary error decreases up to 18% by using different audiovisual feature vectors. The benefits of incorporating visual information are discussed for different phoneme boundary types. Each audiovisual feature vector results in a different performance at different types of phoneme boundaries. The average absolute boundary error decreases by approximately 25% by using audiovisual feature vectors selectively for different boundary classes. Visual data is collected using an ordinary webcam. The proposed method is very convenient to be used in practice.
Subject Keywords
Automatic speech segmentation
,
Audiovisual
,
Lip motion
,
Text-to-speech
URI
https://hdl.handle.net/11511/35997
Journal
SPEECH COMMUNICATION
DOI
https://doi.org/10.1016/j.specom.2011.03.001
Collections
Department of Electrical and Electronics Engineering, Article
Suggestions
OpenMETU
Core
Wireless speech recognition using fixed point mixed excitation linear prediction (MELP) vocoder
Acar, D; Karci, MH; Ilk, HG; Demirekler, Mübeccel (2002-07-19)
A bit stream based front-end for wireless speech recognition system that operates on fixed point mixed excitation linear prediction (MELP) vocoder is presented in this paper. Speaker dependent, isolated word recognition accuracies obtained from conventional and bit stream based front-end systems are obtained and their statistical significance is discussed. Feature parameters are extracted from original (wireline) and decoded speech (conventional) and from the quantized spectral information (bit stream) of t...
Multilingual Video Indexing and Retrieval Employing an Information Extraction Tool for Turkish News Texts: A Case Study
Kucuk, Dilek; Yazıcı, Adnan (2011-10-28)
In this paper, a multilingual video indexing and retrieval system is proposed which relies on an information extraction tool, a hybrid named entity recognizer, for Turkish to determine the semantic annotations for the considered videos. The system is executed on a set of news videos in English and encompasses several other components including an automatic speech recognition system for English, an English-to-Turkish machine translation system, a news video database, and a semantic video retrieval interface....
Multimodal query-level fusion for efficient multimedia information retrieval
Sattari, Saeid; Yazıcı, Adnan (2018-10-01)
Managing a large volume of multimedia data containing various modalities such as visual, audio, and text reveals the necessity for efficient methods for modeling, processing, storing, and retrieving complex data. In this paper, we propose a fusion-based approach at the query level to improve query retrieval performance of multimedia data. We discuss various flexible query types including the combination of content as well as concept-based queries that provide users with the ability to efficiently perform mu...
MAGiC: A Multimodal Framework for Analysing Gaze in Dyadic Communication
Aydin, Ulku Arslan; Kalkan, Sinan; Acartürk, Cengiz (2018-01-01)
The analysis of dynamic scenes has been a challenging domain in eye tracking research. This study presents a framework, named MAGiC, for analyzing gaze contact and gaze aversion in face-to-face communication. MAGiC provides an environment that is able to detect and track the conversation partner's face automatically, overlay gaze data on top of the face video, and incorporate speech by means of speech-act annotation. Specifically, MAGiC integrates eye tracking data for gaze, audio data for speech segmentati...
On lexicon creation for turkish LVCSR
Kadri, Hacıoğlu; Bryan, Pellom; Çiloğlu, Tolga; Öztürk, Özlem; Mikko, Kurimo; Mathias, Creutz (null; 2003-09-14)
In this paper, we address the lexicon design problem in Turkish large vocabulary speech recognition. Although we focus only on Turkish, the methods described here are general enough that they can be considered for other agglutinative languages like Finnish, Korean etc. In an agglutinative language, several words can be created from a single root word using a rich collection of morphological rules. So, a virtually infinite size lexicon is required to cover the language if words are used as the basic units. T...
Citation Formats
IEEE
ACM
APA
CHICAGO
MLA
BibTeX
E. Akdemir and T. Çiloğlu, “Bimodal automatic speech segmentation based on audio and visual information fusion,”
SPEECH COMMUNICATION
, pp. 889–902, 2011, Accessed: 00, 2020. [Online]. Available: https://hdl.handle.net/11511/35997.