Sensors

Research

Jump to: Review

19 pages, 1701 KiB

Open AccessArticle

Non-Intrusive System for Honeybee Recognition Based on Audio Signals and Maximum Likelihood Classification by Autoencoder

by Urszula Libal and Pawel Biernacki

Sensors 2024, 24(16), 5389; https://doi.org/10.3390/s24165389 - 21 Aug 2024

Cited by 1 | Viewed by 1280

Abstract

Artificial intelligence and Internet of Things are playing an increasingly important role in monitoring beehives. In this paper, we propose a method for automatic recognition of honeybee type by analyzing the sound generated by worker bees and drone bees during their flight close [...] Read more.

Artificial intelligence and Internet of Things are playing an increasingly important role in monitoring beehives. In this paper, we propose a method for automatic recognition of honeybee type by analyzing the sound generated by worker bees and drone bees during their flight close to an entrance to a beehive. We conducted a wide comparative study to determine the most effective preprocessing of audio signals for the detection problem. We compared the results for several different methods for signal representation in the frequency domain, including mel-frequency cepstral coefficients (MFCCs), gammatone cepstral coefficients (GTCCs), the multiple signal classification method (MUSIC) and parametric estimation of power spectral density (PSD) by the Burg algorithm. The coefficients serve as inputs for an autoencoder neural network to discriminate drone bees from worker bees. The classification is based on the reconstruction error of the signal representations produced by the autoencoder. We propose a novel approach to class separation by the autoencoder neural network with various thresholds between decision areas, including the maximum likelihood threshold for the reconstruction error. By classifying real-life signals, we demonstrated that it is possible to differentiate drone bees and worker bees based solely on audio signals. The attained level of detection accuracy enables the creation of an efficient automatic system for beekeepers. Full article

(This article belongs to the Special Issue Audio, Image, and Multimodal Sensing Techniques)

► Show Figures

Figure 1

17 pages, 3295 KiB

Open AccessArticle

A Data Matrix Code Recognition Method Based on L-Shaped Dashed Edge Localization Using Central Prior

by Yi Liu, Yang Song, Guiqiang Gu, Jianan Luo, Taoan Wang and Qiuping Jiang

Sensors 2024, 24(13), 4042; https://doi.org/10.3390/s24134042 - 21 Jun 2024

Cited by 2 | Viewed by 1697

Abstract

The recognition of data matrix (DM) codes plays a crucial role in industrial production. Significant progress has been made with existing methods. However, for low-quality images with protrusions and interruptions on the L-shaped solid edge (finder pattern) and the dashed edge (timing pattern) [...] Read more.

The recognition of data matrix (DM) codes plays a crucial role in industrial production. Significant progress has been made with existing methods. However, for low-quality images with protrusions and interruptions on the L-shaped solid edge (finder pattern) and the dashed edge (timing pattern) of DM codes in industrial production environments, the recognition accuracy rate of existing methods sharply declines due to a lack of consideration for these interference issues. Therefore, ensuring recognition accuracy in the presence of these interference issues is a highly challenging task. To address such interference issues, unlike most existing methods focused on locating the L-shaped solid edge for DM code recognition, we in this paper propose a novel DM code recognition method based on locating the L-shaped dashed edge by incorporating the prior information of the center of the DM code. Specifically, we first use a deep learning-based object detection method to obtain the center of the DM code. Next, to enhance the accuracy of L-shaped dashed edge localization, we design a two-level screening strategy that combines the general constraints and central constraints. The central constraints fully exploit the prior information of the center of the DM code. Finally, we employ libdmtx to decode the content from the precise position image of the DM code. The image is generated by using the L-shaped dashed edge. Experimental results on various types of DM code datasets demonstrate that the proposed method outperforms the compared methods in terms of recognition accuracy rate and time consumption, thus holding significant practical value in an industrial production environment. Full article

(This article belongs to the Special Issue Audio, Image, and Multimodal Sensing Techniques)

► Show Figures

Figure 1

17 pages, 3213 KiB

Open AccessArticle

Postfilter for Dual Channel Speech Enhancement Using Coherence and Statistical Model-Based Noise Estimation

by Sein Cheong, Minseung Kim and Jong Won Shin

Sensors 2024, 24(12), 3979; https://doi.org/10.3390/s24123979 - 19 Jun 2024

Cited by 1 | Viewed by 1090

Abstract

A multichannel speech enhancement system usually consists of spatial filters such as adaptive beamformers followed by postfilters, which suppress remaining noise. Accurate estimation of the power spectral density (PSD) of the residual noise is crucial for successful noise reduction in the postfilters. In [...] Read more.

A multichannel speech enhancement system usually consists of spatial filters such as adaptive beamformers followed by postfilters, which suppress remaining noise. Accurate estimation of the power spectral density (PSD) of the residual noise is crucial for successful noise reduction in the postfilters. In this paper, we propose a postfilter utilizing proposed a posteriori speech presence probability (SPP) and noise PSD estimators, which are based on both the coherence and the statistical models. We model the coherence-based a posteriori SPP as a simple function of the magnitude of coherence between two microphone signals and combine it with a single-channel SPP based on statistical models. The coherence-based estimator for the PSD of the noise remaining in the beamformer output in the presence of speech is derived using the pseudo-coherence considering the effect of the beamformers, which is used to construct the coherence-based noise PSD estimator. Then, the final noise PSD estimator is obtained by combining the coherence-based and statistical model-based noise PSD estimators with the proposed SPP. The spectral gain function is also modified, incorporating the proposed SPP. Experimental results demonstrate that the proposed method led to more accurate noise PSD estimation and perceptual evaluation of speech quality scores in various diffuse noise environments, and did not degrade the speech quality under the presence of directional interference, although the proposed method utilizes the coherence information. Full article

(This article belongs to the Special Issue Audio, Image, and Multimodal Sensing Techniques)

► Show Figures

Figure 1

20 pages, 3721 KiB

Open AccessArticle

HAtt-Flow: Hierarchical Attention-Flow Mechanism for Group-Activity Scene Graph Generation in Videos

by Naga Venkata Sai Raviteja Chappa, Pha Nguyen, Thi Hoang Ngan Le, Page Daniel Dobbs and Khoa Luu

Sensors 2024, 24(11), 3372; https://doi.org/10.3390/s24113372 - 24 May 2024

Cited by 3 | Viewed by 1294

Abstract

Group-activity scene graph (GASG) generation is a challenging task in computer vision, aiming to anticipate and describe relationships between subjects and objects in video sequences. Traditional video scene graph generation (VidSGG) methods focus on retrospective analysis, limiting their predictive capabilities. To enrich the [...] Read more.

Group-activity scene graph (GASG) generation is a challenging task in computer vision, aiming to anticipate and describe relationships between subjects and objects in video sequences. Traditional video scene graph generation (VidSGG) methods focus on retrospective analysis, limiting their predictive capabilities. To enrich the scene-understanding capabilities, we introduced a GASG dataset extending the JRDB dataset with nuanced annotations involving appearance, interaction, position, relationship, and situation attributes. This work also introduces an innovative approach, a Hierarchical Attention–Flow (HAtt-Flow) mechanism, rooted in flow network theory to enhance GASG performance. Flow–attention incorporates flow conservation principles, fostering competition for sources and allocation for sinks, effectively preventing the generation of trivial attention. Our proposed approach offers a unique perspective on attention mechanisms, where conventional “values” and “keys” are transformed into sources and sinks, respectively, creating a novel framework for attention-based models. Through extensive experiments, we demonstrate the effectiveness of our Hatt-Flow model and the superiority of our proposed flow–attention mechanism. This work represents a significant advancement in predictive video scene understanding, providing valuable insights and techniques for applications that require real-time relationship prediction in video data. Full article

(This article belongs to the Special Issue Audio, Image, and Multimodal Sensing Techniques)

► Show Figures

Figure 1

18 pages, 2178 KiB

Open AccessArticle

Performance Optimization in Frequency Estimation of Noisy Signals: Ds-IpDTFT Estimator

by Miaomiao Wei, Yongsheng Zhu, Jun Sun, Xiangyang Lu, Xiaomin Mu and Juncai Xu

Sensors 2023, 23(17), 7461; https://doi.org/10.3390/s23177461 - 28 Aug 2023

Viewed by 1504

Abstract

This research presents a comprehensive study of the dichotomous search iterative parabolic discrete time Fourier transform (Ds-IpDTFT) estimator, a novel approach for fine frequency estimation in noisy exponential signals. The proposed estimator leverages a dichotomous search process before iterative interpolation estimation, which significantly [...] Read more.

This research presents a comprehensive study of the dichotomous search iterative parabolic discrete time Fourier transform (Ds-IpDTFT) estimator, a novel approach for fine frequency estimation in noisy exponential signals. The proposed estimator leverages a dichotomous search process before iterative interpolation estimation, which significantly reduces computational complexity while maintaining high estimation accuracy. An in-depth exploration of the relationship between the optimal parameter p and the unknown parameter

δ

forms the backbone of the methodology. Through extensive simulations and real-world experiments, the Ds-IpDTFT estimator exhibits superior performance relative to other established estimators, demonstrating robustness in noisy conditions and stability across varying frequencies. This efficient and accurate estimation method is a significant contribution to the field of signal processing and offers promising potential for practical applications. Full article

(This article belongs to the Special Issue Audio, Image, and Multimodal Sensing Techniques)

► Show Figures

Figure 1

35 pages, 10828 KiB

Open AccessArticle

Audiovisual Tracking of Multiple Speakers in Smart Spaces

by Frank Sanabria-Macias, Marta Marron-Romera and Javier Macias-Guarasa

Sensors 2023, 23(15), 6969; https://doi.org/10.3390/s23156969 - 5 Aug 2023

Cited by 1 | Viewed by 2051

Abstract

This paper presents GAVT, a highly accurate audiovisual 3D tracking system based on particle filters and a probabilistic framework, employing a single camera and a microphone array. Our first contribution is a complex visual appearance model that accurately locates the speaker’s mouth. It [...] Read more.

This paper presents GAVT, a highly accurate audiovisual 3D tracking system based on particle filters and a probabilistic framework, employing a single camera and a microphone array. Our first contribution is a complex visual appearance model that accurately locates the speaker’s mouth. It transforms a Viola & Jones face detector classifier kernel into a likelihood estimator, leveraging knowledge from multiple classifiers trained for different face poses. Additionally, we propose a mechanism to handle occlusions based on the new likelihood’s dispersion. The audio localization proposal utilizes a probabilistic steered response power, representing cross-correlation functions as Gaussian mixture models. Moreover, to prevent tracker interference, we introduce a novel mechanism for associating Gaussians with speakers. The evaluation is carried out using the AV16.3 and CAV3D databases for Single- and Multiple-Object Tracking tasks (SOT and MOT, respectively). GAVT significantly improves the localization performance over audio-only and video-only modalities, with up to

50.3 %

average relative improvement in 3D when compared with the video-only modality. When compared to the state of the art, our audiovisual system achieves up to

69.7 %

average relative improvement for the SOT and MOT tasks in the AV16.3 dataset (2D comparison), and up to

18.1 %

average relative improvement in the MOT task for the CAV3D dataset (3D comparison). Full article

(This article belongs to the Special Issue Audio, Image, and Multimodal Sensing Techniques)

► Show Figures

Figure 1

13 pages, 3555 KiB

Open AccessArticle

High-Level CNN and Machine Learning Methods for Speaker Recognition

by Giovanni Costantini, Valerio Cesarini and Emanuele Brenna

Sensors 2023, 23(7), 3461; https://doi.org/10.3390/s23073461 - 25 Mar 2023

Cited by 17 | Viewed by 4394

Abstract

Speaker Recognition (SR) is a common task in AI-based sound analysis, involving structurally different methodologies such as Deep Learning or “traditional” Machine Learning (ML). In this paper, we compared and explored the two methodologies on the DEMoS dataset consisting of 8869 audio files [...] Read more.

Speaker Recognition (SR) is a common task in AI-based sound analysis, involving structurally different methodologies such as Deep Learning or “traditional” Machine Learning (ML). In this paper, we compared and explored the two methodologies on the DEMoS dataset consisting of 8869 audio files of 58 speakers in different emotional states. A custom CNN is compared to several pre-trained nets using image inputs of spectrograms and Cepstral-temporal (MFCC) graphs. AML approach based on acoustic feature extraction, selection and multi-class classification by means of a Naïve Bayes model is also considered. Results show how a custom, less deep CNN trained on grayscale spectrogram images obtain the most accurate results, 90.15% on grayscale spectrograms and 83.17% on colored MFCC. AlexNet provides comparable results, reaching 89.28% on spectrograms and 83.43% on MFCC.The Naïve Bayes classifier provides a 87.09% accuracy and a 0.985 average AUC while being faster to train and more interpretable. Feature selection shows how F0, MFCC and voicing-related features are the most characterizing for this SR task. The high amount of training samples and the emotional content of the DEMoS dataset better reflect a real case scenario for speaker recognition, and account for the generalization power of the models. Full article

(This article belongs to the Special Issue Audio, Image, and Multimodal Sensing Techniques)

► Show Figures

Figure 1

17 pages, 5040 KiB

Open AccessArticle

An Efficient Pest Detection Framework with a Medium-Scale Benchmark to Increase the Agricultural Productivity

by Suliman Aladhadh, Shabana Habib, Muhammad Islam, Mohammed Aloraini, Mohammed Aladhadh and Hazim Saleh Al-Rawashdeh

Sensors 2022, 22(24), 9749; https://doi.org/10.3390/s22249749 - 12 Dec 2022

Cited by 14 | Viewed by 3514

Abstract

Insect pests and crop diseases are considered the major problems for agricultural production, due to the severity and extent of their occurrence causing significant crop losses. To increase agricultural production, it is significant to protect the crop from harmful pests which is possible [...] Read more.

Insect pests and crop diseases are considered the major problems for agricultural production, due to the severity and extent of their occurrence causing significant crop losses. To increase agricultural production, it is significant to protect the crop from harmful pests which is possible via soft computing techniques. The soft computing techniques are based on traditional machine and deep learning-based approaches. However, in the traditional methods, the selection of manual feature extraction mechanisms is ineffective, inefficient, and time-consuming, while deep learning techniques are computationally expensive and require a large amount of training data. In this paper, we propose an efficient pest detection method that accurately localized the pests and classify them according to their desired class label. In the proposed work, we modify the YOLOv5s model in several ways such as extending the cross stage partial network (CSP) module, improving the select kernel (SK) in the attention module, and modifying the multiscale feature extraction mechanism, which plays a significant role in the detection and classification of small and large sizes of pest in an image. To validate the model performance, we develop a medium-scale pest detection dataset that includes the five most harmful pests for agriculture products that are ants, grasshopper, palm weevils, shield bugs, and wasps. To check the model’s effectiveness, we compare the results of the proposed model with several variations of the YOLOv5 model, where the proposed model achieved the best results in the experiments. Thus, the proposed model has the potential to be applied in real-world applications and further motivate research on pest detection to increase agriculture production. Full article

(This article belongs to the Special Issue Audio, Image, and Multimodal Sensing Techniques)

► Show Figures

Figure 1

Review

Jump to: Research

26 pages, 493 KiB

Open AccessReview

A Survey on Low-Latency DNN-Based Speech Enhancement

by Szymon Drgas

Sensors 2023, 23(3), 1380; https://doi.org/10.3390/s23031380 - 26 Jan 2023

Cited by 8 | Viewed by 5029

Abstract

This paper presents recent advances in low-latency, single-channel, deep neural network-based speech enhancement systems. The sources of latency and their acceptable values in different applications are described. This is followed by an analysis of the constraints imposed on neural network architectures. Specifically, the [...] Read more.

This paper presents recent advances in low-latency, single-channel, deep neural network-based speech enhancement systems. The sources of latency and their acceptable values in different applications are described. This is followed by an analysis of the constraints imposed on neural network architectures. Specifically, the causal units used in deep neural networks are presented and discussed in the context of their properties, such as the number of parameters, the receptive field, and computational complexity. This is followed by a discussion of techniques used to reduce the computational complexity and memory requirements of the neural networks used in this task. Finally, the techniques used by the winners of the latest speech enhancement challenges (DNS, Clarity) are shown and compared. Full article

(This article belongs to the Special Issue Audio, Image, and Multimodal Sensing Techniques)

► Show Figures

Figure 1

Journal Menu

Journal Browser

Audio, Image, and Multimodal Sensing Techniques

Share This Special Issue

Special Issue Editor

Special Issue Information

Keywords

Benefits of Publishing in a Special Issue

Published Papers (9 papers)

Research

Review

Further Information

Guidelines

MDPI Initiatives

Follow MDPI