Hide/Show Apps

A gaze-centered multimodal approach to face-to-face interaction

Arslan Aydın, Ülkü
Face-to-face conversation implies that interaction should be characterized as an inherently multimodal phenomenon involving both verbal and nonverbal signals. Gaze is a nonverbal cue that plays a key role in achieving social goals during the course of conversation. The purpose of this study is twofold: (i) to examine gaze behavior (i.e., aversion and gaze on face) and relations between gaze and speech in face to face interaction, (ii) to construct computational models to predict gaze behavior using high-level speech features. We employed a job interview setting, where pairs (a professional interviewer and an interviewee) conducted mock job interviews. Twenty-eight pairs of native speakers took part in the experiment. Two eye-tracking glasses recorded the scene video, the audio and the eye gaze position of the participants. To achieve the first purpose, we developed an open-source framework, named MAGiC (A Multimodal Framework for Analyzing Gaze in Communication), for the analyses of multimodal data including video recording data for face tracking, gaze data from the eye trackers, and the audio data for speech segmentation. We annotated speech with two methods: (i) ISO 24617-2 Standard for Dialogue Act Annotation and, (ii) using tags employed by the previous studies that examined gaze behavior in a social context. We then trained simplified versions of two CNN architectures (VGGNet and ResNet) by using both speech annotation methods.