sensors-logo

Journal Browser

Journal Browser

Audio–Visual Sensor Fusion Strategies for Video Content Analytics II

A special issue of Sensors (ISSN 1424-8220). This special issue belongs to the section "Physical Sensors".

Deadline for manuscript submissions: closed (10 January 2023) | Viewed by 2999

Special Issue Editor


E-Mail Website
Guest Editor
School of Electrical Engineering, Korea University, Intelligent Signal Processing Center, Korea University, Anam-dong, Seongbuk-gu, Seoul 02841, South Korea
Interests: computer vision; acoustic signal processing; multi-sensor fusion; deep learning; big data analytics
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

A two-hour movie, or a short movie clip as its subset, is intended to capture and present a meaningful (or significant) story in video to be recognized and understood by a human audience. What if we substitute a human audience with that of an intelligent machine or robot capable of capturing and processing the semantic information in terms of audio and video cues contained in the video? The human brain processes audio (sound, speech) and video (background image scene, moving video objects, and written characters) modalities to extract spatial and temporal semantic information, which is contextually complementary and robust. Smart machines equipped with audiovisual multisensors (e.g., CCTV equipped with cameras and microphones) should be capable of achieving the same task. An appropriate fusion strategy combining the audio and visual information would be key in developing such artificial general intelligent (AGI) systems. This Special Issue calls for papers on various sensor fusion techniques to combine the audio–visual information cues for video content analytics. There can be a wide range of fusion strategies at various information levels (e.g., feature, decision, and semantic) to extract meaningful information by providing the attention mechanism in terms of weighting significance of the cue to represent the intended world. In light of recent advancements in deep-learning, this Special Issue will provide an important forum to present new fusion strategies by addressing the relevant research issues toward solving many applications requiring artificial general intelligence.

Prof. Dr. Hanseok Ko
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Sensors is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • camera
  • microphone
  • multimodal
  • auditory
  • visual
  • fusion
  • semantic
  • deep-learning
  • artificial general intelligence

Published Papers (1 paper)

Order results
Result details
Select all
Export citation of selected articles as:

Research

16 pages, 5980 KiB  
Article
Sound Event Detection by Pseudo-Labeling in Weakly Labeled Dataset
by Chungho Park, Donghyeon Kim and Hanseok Ko
Sensors 2021, 21(24), 8375; https://doi.org/10.3390/s21248375 - 15 Dec 2021
Cited by 3 | Viewed by 2290
Abstract
Weakly labeled sound event detection (WSED) is an important task as it can facilitate the data collection efforts before constructing a strongly labeled sound event dataset. Recent high performance in deep learning-based WSED’s exploited using a segmentation mask for detecting the target feature [...] Read more.
Weakly labeled sound event detection (WSED) is an important task as it can facilitate the data collection efforts before constructing a strongly labeled sound event dataset. Recent high performance in deep learning-based WSED’s exploited using a segmentation mask for detecting the target feature map. However, achieving accurate detection performance was limited in real streaming audio due to the following reasons. First, the convolutional neural networks (CNN) employed in the segmentation mask extraction process do not appropriately highlight the importance of feature as the feature is extracted without pooling operations, and, concurrently, a small size kernel forces the receptive field small, making it difficult to learn various patterns. Second, as feature maps are obtained in an end-to-end fashion, the WSED model would be weak to unknown contents in the wild. These limitations would lead to generating undesired feature maps, such as noise in the unseen environment. This paper addresses these issues by constructing a more efficient model by employing a gated linear unit (GLU) and dilated convolution to improve the problems of de-emphasizing importance and lack of receptive field. In addition, this paper proposes pseudo-label-based learning for classifying target contents and unknown contents by adding ’noise label’ and ’noise loss’ so that unknown contents can be separated as much as possible through the noise label. The experiment is performed by mixing DCASE 2018 task1 acoustic scene data and task2 sound event data. The experimental results show that the proposed SED model achieves the best F1 performance with 59.7% at 0 SNR, 64.5% at 10 SNR, and 65.9% at 20 SNR. These results represent an improvement of 17.7%, 16.9%, and 16.5%, respectively, over the baseline. Full article
Show Figures

Figure 1

Back to TopTop