sensors-logo

Journal Browser

Journal Browser

Audio, Image, and Multimodal Sensing Techniques

A special issue of Sensors (ISSN 1424-8220). This special issue belongs to the section "Intelligent Sensors".

Deadline for manuscript submissions: closed (10 May 2024) | Viewed by 18062

Special Issue Editor


E-Mail Website
Guest Editor
Multimedia Department, Polish-Japanese Academy of Information Technology, Warsaw, Poland
Interests: multimedia; audio signal analysis; music information retrieval; knowledge discovery in databases; data mining; artificial intelligence
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

This Special Issue of Sensors is focused on original research involving the use of various audio and image sensing devices, both simultaneously and separately. The goal is to collect a diverse set of papers that span a wide range of analyses and possible applications.

Specifically, the interest is in papers that address the use of environment-based sensors, i.e., placed on the ground, cased in the air or water, etc., and the development of software utilizing the output of these sensors. Papers focused on the construction of optimized sensors are also welcomed.

This Special Issue will cover, but is not limited to, the following areas:

  • digital signal processing
  • audio signal analysis
  • image analysis
  • pattern recognition
  • audio sensors
  • image sensors
  • multimodal and single mode sensing
  • artificial intelligence
  • sensor applications

Dr. Alicja Wieczorkowska
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Sensors is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • digital signal processing
  • audio signal analysis
  • image analysis
  • pattern recognition
  • audio sensors
  • image sensors
  • multimodal and single mode sensing
  • artificial intelligence
  • sensor applications

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • e-Book format: Special Issues with more than 10 articles can be published as dedicated e-books, ensuring wide and rapid dissemination.

Further information on MDPI's Special Issue polices can be found here.

Published Papers (9 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

Jump to: Review

19 pages, 1701 KiB  
Article
Non-Intrusive System for Honeybee Recognition Based on Audio Signals and Maximum Likelihood Classification by Autoencoder
by Urszula Libal and Pawel Biernacki
Sensors 2024, 24(16), 5389; https://doi.org/10.3390/s24165389 - 21 Aug 2024
Viewed by 795
Abstract
Artificial intelligence and Internet of Things are playing an increasingly important role in monitoring beehives. In this paper, we propose a method for automatic recognition of honeybee type by analyzing the sound generated by worker bees and drone bees during their flight close [...] Read more.
Artificial intelligence and Internet of Things are playing an increasingly important role in monitoring beehives. In this paper, we propose a method for automatic recognition of honeybee type by analyzing the sound generated by worker bees and drone bees during their flight close to an entrance to a beehive. We conducted a wide comparative study to determine the most effective preprocessing of audio signals for the detection problem. We compared the results for several different methods for signal representation in the frequency domain, including mel-frequency cepstral coefficients (MFCCs), gammatone cepstral coefficients (GTCCs), the multiple signal classification method (MUSIC) and parametric estimation of power spectral density (PSD) by the Burg algorithm. The coefficients serve as inputs for an autoencoder neural network to discriminate drone bees from worker bees. The classification is based on the reconstruction error of the signal representations produced by the autoencoder. We propose a novel approach to class separation by the autoencoder neural network with various thresholds between decision areas, including the maximum likelihood threshold for the reconstruction error. By classifying real-life signals, we demonstrated that it is possible to differentiate drone bees and worker bees based solely on audio signals. The attained level of detection accuracy enables the creation of an efficient automatic system for beekeepers. Full article
(This article belongs to the Special Issue Audio, Image, and Multimodal Sensing Techniques)
Show Figures

Figure 1

17 pages, 3295 KiB  
Article
A Data Matrix Code Recognition Method Based on L-Shaped Dashed Edge Localization Using Central Prior
by Yi Liu, Yang Song, Guiqiang Gu, Jianan Luo, Taoan Wang and Qiuping Jiang
Sensors 2024, 24(13), 4042; https://doi.org/10.3390/s24134042 - 21 Jun 2024
Viewed by 935
Abstract
The recognition of data matrix (DM) codes plays a crucial role in industrial production. Significant progress has been made with existing methods. However, for low-quality images with protrusions and interruptions on the L-shaped solid edge (finder pattern) and the dashed edge (timing pattern) [...] Read more.
The recognition of data matrix (DM) codes plays a crucial role in industrial production. Significant progress has been made with existing methods. However, for low-quality images with protrusions and interruptions on the L-shaped solid edge (finder pattern) and the dashed edge (timing pattern) of DM codes in industrial production environments, the recognition accuracy rate of existing methods sharply declines due to a lack of consideration for these interference issues. Therefore, ensuring recognition accuracy in the presence of these interference issues is a highly challenging task. To address such interference issues, unlike most existing methods focused on locating the L-shaped solid edge for DM code recognition, we in this paper propose a novel DM code recognition method based on locating the L-shaped dashed edge by incorporating the prior information of the center of the DM code. Specifically, we first use a deep learning-based object detection method to obtain the center of the DM code. Next, to enhance the accuracy of L-shaped dashed edge localization, we design a two-level screening strategy that combines the general constraints and central constraints. The central constraints fully exploit the prior information of the center of the DM code. Finally, we employ libdmtx to decode the content from the precise position image of the DM code. The image is generated by using the L-shaped dashed edge. Experimental results on various types of DM code datasets demonstrate that the proposed method outperforms the compared methods in terms of recognition accuracy rate and time consumption, thus holding significant practical value in an industrial production environment. Full article
(This article belongs to the Special Issue Audio, Image, and Multimodal Sensing Techniques)
Show Figures

Figure 1

17 pages, 3213 KiB  
Article
Postfilter for Dual Channel Speech Enhancement Using Coherence and Statistical Model-Based Noise Estimation
by Sein Cheong, Minseung Kim and Jong Won Shin
Sensors 2024, 24(12), 3979; https://doi.org/10.3390/s24123979 - 19 Jun 2024
Cited by 1 | Viewed by 680
Abstract
A multichannel speech enhancement system usually consists of spatial filters such as adaptive beamformers followed by postfilters, which suppress remaining noise. Accurate estimation of the power spectral density (PSD) of the residual noise is crucial for successful noise reduction in the postfilters. In [...] Read more.
A multichannel speech enhancement system usually consists of spatial filters such as adaptive beamformers followed by postfilters, which suppress remaining noise. Accurate estimation of the power spectral density (PSD) of the residual noise is crucial for successful noise reduction in the postfilters. In this paper, we propose a postfilter utilizing proposed a posteriori speech presence probability (SPP) and noise PSD estimators, which are based on both the coherence and the statistical models. We model the coherence-based a posteriori SPP as a simple function of the magnitude of coherence between two microphone signals and combine it with a single-channel SPP based on statistical models. The coherence-based estimator for the PSD of the noise remaining in the beamformer output in the presence of speech is derived using the pseudo-coherence considering the effect of the beamformers, which is used to construct the coherence-based noise PSD estimator. Then, the final noise PSD estimator is obtained by combining the coherence-based and statistical model-based noise PSD estimators with the proposed SPP. The spectral gain function is also modified, incorporating the proposed SPP. Experimental results demonstrate that the proposed method led to more accurate noise PSD estimation and perceptual evaluation of speech quality scores in various diffuse noise environments, and did not degrade the speech quality under the presence of directional interference, although the proposed method utilizes the coherence information. Full article
(This article belongs to the Special Issue Audio, Image, and Multimodal Sensing Techniques)
Show Figures

Figure 1

20 pages, 3721 KiB  
Article
HAtt-Flow: Hierarchical Attention-Flow Mechanism for Group-Activity Scene Graph Generation in Videos
by Naga Venkata Sai Raviteja Chappa, Pha Nguyen, Thi Hoang Ngan Le, Page Daniel Dobbs and Khoa Luu
Sensors 2024, 24(11), 3372; https://doi.org/10.3390/s24113372 - 24 May 2024
Viewed by 862
Abstract
Group-activity scene graph (GASG) generation is a challenging task in computer vision, aiming to anticipate and describe relationships between subjects and objects in video sequences. Traditional video scene graph generation (VidSGG) methods focus on retrospective analysis, limiting their predictive capabilities. To enrich the [...] Read more.
Group-activity scene graph (GASG) generation is a challenging task in computer vision, aiming to anticipate and describe relationships between subjects and objects in video sequences. Traditional video scene graph generation (VidSGG) methods focus on retrospective analysis, limiting their predictive capabilities. To enrich the scene-understanding capabilities, we introduced a GASG dataset extending the JRDB dataset with nuanced annotations involving appearance, interaction, position, relationship, and situation attributes. This work also introduces an innovative approach, a Hierarchical Attention–Flow (HAtt-Flow) mechanism, rooted in flow network theory to enhance GASG performance. Flow–attention incorporates flow conservation principles, fostering competition for sources and allocation for sinks, effectively preventing the generation of trivial attention. Our proposed approach offers a unique perspective on attention mechanisms, where conventional “values” and “keys” are transformed into sources and sinks, respectively, creating a novel framework for attention-based models. Through extensive experiments, we demonstrate the effectiveness of our Hatt-Flow model and the superiority of our proposed flow–attention mechanism. This work represents a significant advancement in predictive video scene understanding, providing valuable insights and techniques for applications that require real-time relationship prediction in video data. Full article
(This article belongs to the Special Issue Audio, Image, and Multimodal Sensing Techniques)
Show Figures

Figure 1

18 pages, 2178 KiB  
Article
Performance Optimization in Frequency Estimation of Noisy Signals: Ds-IpDTFT Estimator
by Miaomiao Wei, Yongsheng Zhu, Jun Sun, Xiangyang Lu, Xiaomin Mu and Juncai Xu
Sensors 2023, 23(17), 7461; https://doi.org/10.3390/s23177461 - 28 Aug 2023
Viewed by 1159
Abstract
This research presents a comprehensive study of the dichotomous search iterative parabolic discrete time Fourier transform (Ds-IpDTFT) estimator, a novel approach for fine frequency estimation in noisy exponential signals. The proposed estimator leverages a dichotomous search process before iterative interpolation estimation, which significantly [...] Read more.
This research presents a comprehensive study of the dichotomous search iterative parabolic discrete time Fourier transform (Ds-IpDTFT) estimator, a novel approach for fine frequency estimation in noisy exponential signals. The proposed estimator leverages a dichotomous search process before iterative interpolation estimation, which significantly reduces computational complexity while maintaining high estimation accuracy. An in-depth exploration of the relationship between the optimal parameter p and the unknown parameter δ forms the backbone of the methodology. Through extensive simulations and real-world experiments, the Ds-IpDTFT estimator exhibits superior performance relative to other established estimators, demonstrating robustness in noisy conditions and stability across varying frequencies. This efficient and accurate estimation method is a significant contribution to the field of signal processing and offers promising potential for practical applications. Full article
(This article belongs to the Special Issue Audio, Image, and Multimodal Sensing Techniques)
Show Figures

Figure 1

35 pages, 10828 KiB  
Article
Audiovisual Tracking of Multiple Speakers in Smart Spaces
by Frank Sanabria-Macias, Marta Marron-Romera and Javier Macias-Guarasa
Sensors 2023, 23(15), 6969; https://doi.org/10.3390/s23156969 - 5 Aug 2023
Viewed by 1464
Abstract
This paper presents GAVT, a highly accurate audiovisual 3D tracking system based on particle filters and a probabilistic framework, employing a single camera and a microphone array. Our first contribution is a complex visual appearance model that accurately locates the speaker’s mouth. It [...] Read more.
This paper presents GAVT, a highly accurate audiovisual 3D tracking system based on particle filters and a probabilistic framework, employing a single camera and a microphone array. Our first contribution is a complex visual appearance model that accurately locates the speaker’s mouth. It transforms a Viola & Jones face detector classifier kernel into a likelihood estimator, leveraging knowledge from multiple classifiers trained for different face poses. Additionally, we propose a mechanism to handle occlusions based on the new likelihood’s dispersion. The audio localization proposal utilizes a probabilistic steered response power, representing cross-correlation functions as Gaussian mixture models. Moreover, to prevent tracker interference, we introduce a novel mechanism for associating Gaussians with speakers. The evaluation is carried out using the AV16.3 and CAV3D databases for Single- and Multiple-Object Tracking tasks (SOT and MOT, respectively). GAVT significantly improves the localization performance over audio-only and video-only modalities, with up to 50.3% average relative improvement in 3D when compared with the video-only modality. When compared to the state of the art, our audiovisual system achieves up to 69.7% average relative improvement for the SOT and MOT tasks in the AV16.3 dataset (2D comparison), and up to 18.1% average relative improvement in the MOT task for the CAV3D dataset (3D comparison). Full article
(This article belongs to the Special Issue Audio, Image, and Multimodal Sensing Techniques)
Show Figures

Figure 1

13 pages, 3555 KiB  
Article
High-Level CNN and Machine Learning Methods for Speaker Recognition
by Giovanni Costantini, Valerio Cesarini and Emanuele Brenna
Sensors 2023, 23(7), 3461; https://doi.org/10.3390/s23073461 - 25 Mar 2023
Cited by 12 | Viewed by 3684
Abstract
Speaker Recognition (SR) is a common task in AI-based sound analysis, involving structurally different methodologies such as Deep Learning or “traditional” Machine Learning (ML). In this paper, we compared and explored the two methodologies on the DEMoS dataset consisting of 8869 audio files [...] Read more.
Speaker Recognition (SR) is a common task in AI-based sound analysis, involving structurally different methodologies such as Deep Learning or “traditional” Machine Learning (ML). In this paper, we compared and explored the two methodologies on the DEMoS dataset consisting of 8869 audio files of 58 speakers in different emotional states. A custom CNN is compared to several pre-trained nets using image inputs of spectrograms and Cepstral-temporal (MFCC) graphs. AML approach based on acoustic feature extraction, selection and multi-class classification by means of a Naïve Bayes model is also considered. Results show how a custom, less deep CNN trained on grayscale spectrogram images obtain the most accurate results, 90.15% on grayscale spectrograms and 83.17% on colored MFCC. AlexNet provides comparable results, reaching 89.28% on spectrograms and 83.43% on MFCC.The Naïve Bayes classifier provides a 87.09% accuracy and a 0.985 average AUC while being faster to train and more interpretable. Feature selection shows how F0, MFCC and voicing-related features are the most characterizing for this SR task. The high amount of training samples and the emotional content of the DEMoS dataset better reflect a real case scenario for speaker recognition, and account for the generalization power of the models. Full article
(This article belongs to the Special Issue Audio, Image, and Multimodal Sensing Techniques)
Show Figures

Figure 1

17 pages, 5040 KiB  
Article
An Efficient Pest Detection Framework with a Medium-Scale Benchmark to Increase the Agricultural Productivity
by Suliman Aladhadh, Shabana Habib, Muhammad Islam, Mohammed Aloraini, Mohammed Aladhadh and Hazim Saleh Al-Rawashdeh
Sensors 2022, 22(24), 9749; https://doi.org/10.3390/s22249749 - 12 Dec 2022
Cited by 10 | Viewed by 3053
Abstract
Insect pests and crop diseases are considered the major problems for agricultural production, due to the severity and extent of their occurrence causing significant crop losses. To increase agricultural production, it is significant to protect the crop from harmful pests which is possible [...] Read more.
Insect pests and crop diseases are considered the major problems for agricultural production, due to the severity and extent of their occurrence causing significant crop losses. To increase agricultural production, it is significant to protect the crop from harmful pests which is possible via soft computing techniques. The soft computing techniques are based on traditional machine and deep learning-based approaches. However, in the traditional methods, the selection of manual feature extraction mechanisms is ineffective, inefficient, and time-consuming, while deep learning techniques are computationally expensive and require a large amount of training data. In this paper, we propose an efficient pest detection method that accurately localized the pests and classify them according to their desired class label. In the proposed work, we modify the YOLOv5s model in several ways such as extending the cross stage partial network (CSP) module, improving the select kernel (SK) in the attention module, and modifying the multiscale feature extraction mechanism, which plays a significant role in the detection and classification of small and large sizes of pest in an image. To validate the model performance, we develop a medium-scale pest detection dataset that includes the five most harmful pests for agriculture products that are ants, grasshopper, palm weevils, shield bugs, and wasps. To check the model’s effectiveness, we compare the results of the proposed model with several variations of the YOLOv5 model, where the proposed model achieved the best results in the experiments. Thus, the proposed model has the potential to be applied in real-world applications and further motivate research on pest detection to increase agriculture production. Full article
(This article belongs to the Special Issue Audio, Image, and Multimodal Sensing Techniques)
Show Figures

Figure 1

Review

Jump to: Research

26 pages, 493 KiB  
Review
A Survey on Low-Latency DNN-Based Speech Enhancement
by Szymon Drgas
Sensors 2023, 23(3), 1380; https://doi.org/10.3390/s23031380 - 26 Jan 2023
Cited by 6 | Viewed by 4078
Abstract
This paper presents recent advances in low-latency, single-channel, deep neural network-based speech enhancement systems. The sources of latency and their acceptable values in different applications are described. This is followed by an analysis of the constraints imposed on neural network architectures. Specifically, the [...] Read more.
This paper presents recent advances in low-latency, single-channel, deep neural network-based speech enhancement systems. The sources of latency and their acceptable values in different applications are described. This is followed by an analysis of the constraints imposed on neural network architectures. Specifically, the causal units used in deep neural networks are presented and discussed in the context of their properties, such as the number of parameters, the receptive field, and computational complexity. This is followed by a discussion of techniques used to reduce the computational complexity and memory requirements of the neural networks used in this task. Finally, the techniques used by the winners of the latest speech enhancement challenges (DNS, Clarity) are shown and compared. Full article
(This article belongs to the Special Issue Audio, Image, and Multimodal Sensing Techniques)
Show Figures

Figure 1

Back to TopTop