Human presence detection in emergency situations using deep learning based audio-visual systems

Geneci, İzlen
The significance of emergency event detection in surveillance systems has drawn the attention of researchers in recent years. Existing methods mostly depend on visual data to identify any abnormal events since only visual sensors are frequently put in public settings. On the other hand, in an emergency, sound information may be exploited. When eyesight is occluded, audio waves can penetrate to some extent. Applications for visual analysis may be helpful when there is noise in the audio and the scene is congested. Thus, the shift from single-modality to multimodality learning has become crucial given the recent rapid growth of deep learning. Both the audio analysis and the visual analysis were performed separately. In audio-based analysis, audio was transformed into samples using sliding window technique to capture the brief window of a target audio class. Therefore, in a real-time operating system, emergency circumstances can be recognized when the target sound happens briefly. For human sound classes of "Speech", "Scream", "Cry", the minimum sliding window sizes were 0.25 s, 1 s and 0.30 s, respectively. In visual analysis, face detection was conducted along with facial alignment using five facial landmarks. The AP for face detection was 77% on WIDER Face dataset (IoU=0.5). Using the detected faces, facial expression recognition (FER) was performed as well as age and gender estimations by employing an attention-based method. For seven basic emotions, 64.14% accuracy was achieved on AffectNet dataset. The combination of these audio and visual-based systems eliminates the limitations of perceptual tasks in both modalities.


