Speech emotion recognition using auditory models

Yüncü, Enes
With the advent of computational technology, human computer interaction (HCI) has gone beyond simple logical calculations. Affective computing aims to improve human computer interaction in a mental state level allowing computers to adapt their responses according to human needs. As such, affective computing aims to recognize emotions by capturing cues from visual, auditory, tactile and other biometric signals recorded from humans. Emotions play a crucial role in modulating how humans experience and interact with the outside world and have a huge effect on the human decision making process. They are an essential part of human social relations and take role in important life decisions. Therefore detection of emotions is crucial in high level interactions. Each emotion has unique properties that make us recognize them. Acoustic signal generated for the same utterance or sentence changes primarily due to biophysical changes (such as stress-induced constriction of the larynx) triggered by emotions. This relation between acoustic cues and emotions made speech emotion recognition one of the trending topics of the affective computing domain. The main purpose of a speech emotion recognition algorithm is to detect the emotional state of a speaker from recorded speech signals. Human auditory system is a non-linear and adaptive mechanism which involves frequency dependent filtering as well as temporal and simultaneous masking. While emotion can be manifested in acoustic signals recorded using a high quality microphone and extracted using high resolution signal processing techniques, a human listener has access only to cues which are available to him/her via the auditory system. This type of limited access to emotion cues also reduces the subjective emotion recognition accuracy. A speech emotion recognition algorithm based on a model of the human auditory system is developed and its accuracy is evaluated in this thesis. A state-of-the-art human auditory filter bank model is used to process clean speech signals. Simple features are then extracted from the output signals and used to train binary classifiers for seven different classes (anger, fear, happiness, sadness, disgust, boredom and neutral) of emotions. The classifiers are then tested using a validation set to assess the recognition performance. Three emotional speech databases for German, English and Polish languages are used in testing the proposed method and recognition rates as high as 82% are achieved for the recognition of emotion from speech. A subjective experiment using the German emotional speech database carried out on non-German speaker subjects indicates that the performance of the proposed system is comparable to human emotion recognition.


Automatic Speech Emotion Recognition using Auditory Models with Binary Decision Tree and SVM
Yuncu, Enes; Hacıhabiboğlu, Hüseyin; Bozsahin, Cem (2014-08-28)
Affective computing is a term for the design and development of algorithms that enable computers to recognize the emotions of their users and respond in a natural way. Speech, along with facial gestures, is one of the primary modalities with which humans express their emotions. While emotional cues in speech are available to an interlocutor in a dyadic conversation setting, their subjective recognition is far from accurate. This is due to the human auditory system which is primarily non-linear and adaptive....
A mathematical contribution of statistical learning and continuous optimization using infinite and semi-infinite programming to computational statistics
Özöğür-Akyüz, Süreyya; Weber, Gerhard Wilhelm; Department of Scientific Computing (2009)
A subfield of artificial intelligence, machine learning (ML), is concerned with the development of algorithms that allow computers to “learn”. ML is the process of training a system with large number of examples, extracting rules and finding patterns in order to make predictions on new data points (examples). The most common machine learning schemes are supervised, semi-supervised, unsupervised and reinforcement learning. These schemes apply to natural language processing, search engines, medical diagnosis,...
Ozyer, Baris; Erkmen, İsmet; Erkmen, Aydan Müşerref (2010-07-24)
We propose a new fluidics based methodology to determine a continuum between preshaping and grasping so as to appropriately preshape a multifingered robot hand for creating an optimal initialization of grasp, with minimum energy loss towards task execution, upon landing on an object. In this paper, we investigate the effects of impact forces and momentum transfer between different hand preshapes landing on an object. Momentum transfer parameters lead to modification of object orientation and position at the...
Estimation of the user's cognitive load while interacting with the interface based on bayesian network
Saydam, Aysun; Barbaros, Yet; Department of Cognitive Science (2021-9-10)
The complexity of human machine interfaces is increasing significantly in parallel with the development of technology and excessive data growth, but human cognitive capacity is limited. Therefore, measuring cognitive load is one of the most preferential and common ways to test the usability of user interfaces. There are many different physiological, behavioral and subjective methods to measure human performance and workload. Moreover, there are cognitive predictive models and many related applications based...
Brain Computer Interfaces
Halıcı, Uğur (null; 2015-11-12)
Brain Computer Interface (BCI) systems provide control of external devices by using only brain activity. In recent years, there has been a great interest in developing BCI systems for different applications. These systems are capable of solving daily life problems for both healthy and disabled people. One of the most important applications of BCI is to provide communication for disabled people that are totally paralysed. In this paper, different parts of a BCI system and different methods used in each part ...
Citation Formats
E. Yüncü, “Speech emotion recognition using auditory models,” M.S. - Master of Science, Middle East Technical University, 2013.