1. Introduction
Audio classification or sound classification can be referred to as the process of analyzing audio recordings. Audio classification involves the application of machine learning algorithms to raw audio data to categorize the type of audio present. Typically, this process relies on data that have been annotated and classified into target categories by human listeners in various applications.
There is a wide range of different applications for audio classification. Extensive research has been conducted in the field of speech recognition, leading to the advancement of speech-to-text systems. Similarly, audio classification technology has found applications in automating music categorization and powering recommendation engines for music. The classification of environmental sounds has been proposed for the identification of specific species of birds and whales. Additionally, the monitoring of environmental sounds in urban environments has been proposed to aid in law enforcement through the identification of sounds that may be associated with crime (i.e., gunshots) or unauthorized construction (i.e., jackhammers). Pioneering efforts are directed toward developing a small, versatile, efficient deep network for acoustic recognition on resource-limited edge devices. Additionally, a key component of many intelligent Internet of Things (IoT) applications, including predictive maintenance [
1,
2], surveillance [
3,
4], and ecosystem monitoring [
5,
6], is audio classification. With several possible applications, including audio surveillance [
7] and smart room monitoring [
8], environmental sound categorization (ESC) is a significant study topic in human–computer interaction. Designing suitable features for environmental sound categorization is a practical task because acoustic settings are dynamic and unstructured. A classifier is trained with the features in many existing ESC approaches to determine the category likelihood of each environmental sound wave. The features are frequently generated based on prior knowledge of acoustic settings. One of the effective tools in the field of problem diagnosis is intelligent fault diagnosis [
9,
10]. It is possible to replace diagnosticians by using artificial intelligence techniques like neural networks to quickly evaluate these signals and automatically identify mechanical health issues based on the massively monitored signals of the machines [
11,
12,
13]. Therefore, intelligent problem identification is vital in contemporary enterprises, particularly when there are abundant vibration signals. Edge computing is the concept of performing computations at the edge of the network rather than in the cloud. Edge computing has advantages in terms of decreased latency, increased integrity, and a lessened network load. The application of machine learning techniques to edge computing is known as edge AI [
14].
Audio classification or sound classification can be referred to as the process of analyzing audio recordings. Audio classification encompasses the systematic application of machine learning algorithms to analyze unprocessed audio data to identify distinct audio types. This methodology predominantly employs data that has been meticulously annotated and categorized into predefined classes, with these classifications being determined by expert human auditors [
15]. This approach is widely adopted in numerous applications to enhance the accuracy and efficiency of audio analysis. There is a wide range of different applications for audio classification. A great deal of research has been completed for speech recognition and the development of speech-to-text systems [
16].
Additionally, audio classification technology has found its use in the automation of music categorization and the development of music recommendation systems. The classification of environmental sounds has been proposed for the identification of specific species of birds and whales. Additionally, the monitoring of environmental sounds in urban environments has been proposed to aid in law enforcement through the identification of sounds that may be associated with crime (i.e., gunshots) or unauthorized construction (i.e., jackhammers) [
17,
18]. Pioneering efforts are directed toward developing a small, versatile, efficient deep network for acoustic recognition on resource-limited edge devices. Edge devices can perform real-time audio classification, enabling immediate response to audio events. Also, by performing classification locally, edge devices can reduce the latency of the audio classification process, improving the responsiveness of systems. Edge devices can perform audio classification without transmitting sensitive audio data to the cloud, thus protecting privacy. Edge devices can also help in cost reduction. By reducing the amount of data transmitted to the cloud, edge devices can reduce the costs associated with audio classification [
19].
Deep learning models have shown tremendous success in audio classification tasks. There are several limitations to these models when we want to implement them in any edge device. In general, data collected at the edge of the network from different sensors are sent to the cloud for processing and decision making. This will create latency for transmitting a massive amount of data, and cause privacy concerns. For these reasons, it will be difficult to use edge devices for real-time analytics. If the analysis and recognition occur directly in edge devices, the latency can be overcome. For this, we need to rely on the computation power of the edge devices [
20].
Deep learning models require an ample amount of data, extended training time, and large trained models. Thus, it is challenging to run deep learning models such as convolutional neural networks on edge devices that have a low processing power, no GPU, and low memory [
21,
22]. Krizhevsky et al. [
23] showed that they used 60 million parameters and 650,000 neurons for five convolutional layers and 1000-way SoftMax. The ImageNet dataset consists of 15 million labeled high-resolution images of 22,000 categories. Another popular face-recognition method, Deep Face, trained about 120 million parameters for more than four million facial images [
24].
The authors in [
25] proposed a large deep convolutional network for audio classification using raw data and then compressed the model for resource-improvised edge devices, which produced above-state-of-the-art accuracy on ESC-10 (96.65%), ESC-50 (87.10%), Urban-Sound8K (84.45%), and AudioEvent (92.57%); we describe the compression pipeline and show that it allows us to achieve a 97.22% size reduction and a 97.28% FLOP reduction. Audio classification on microcontrollers using XNOR-Net for end-to-end raw audio classification was explored, comparing it with pruning-and-quantization methods. It was found that XNOR-Net is efficient for small class numbers, offering significant memory and computation savings. Still, its performance drops with larger class sets where pruning-and-quantization methods are more effective. In [
26], a knowledge distillation method enhances on-device audio classification by transferring temporal knowledge from large models to smaller, on-device models. This method focuses on incorporating the temporal information embedded in the attention weights of large transformer-based models into various on-device architectures, including CNNs and RNNs. In [
27], a real-time audio enhancement system is proposed that uses convolutional neural networks for precise audio scene classification, optimizing sound quality with minimal latency. This system efficiently enhances audio frame-by-frame, overcoming the limitations of traditional scene-rendering methods in audio devices. A sequential self-teaching approach proposed in [
28] for sound event recognition is especially effective in challenging scenarios like weakly labeled or noisy data. The authors proposed a multi-stage learning process that enhances the generalization ability of sound models, demonstrated by up to a 9% improved performance on the large-scale Audio Set dataset. Additionally, this method shows enhanced transferability of knowledge, boosting generalization in transfer-learning tasks. In [
29], LEAN, a lightweight, efficient deep learning model for audio classification on resource-limited devices is introduced. It combines a trainable wave encoder with a pretrained YAMNet and cross attention-based realignment, achieving high performance with a low 4.5 MB memory footprint, and improving the mean average precision on the FSD50K dataset by 22%. Another approach [
30] is a sequential self-teaching approach for sound event recognition, which is especially effective in challenging scenarios like weakly labeled or noisy data. It proposes a multi-stage learning process that enhances the generalization ability of sound models, demonstrated by up to a 9% improved performance on the large-scale Audio Set dataset.
This study aims to perform model compression and acceleration in deep neural networks without significantly decreasing the model performance. The current state of the art for deep learning model compression and acceleration includes pruning and quantization. We analyze deep learning algorithms with different model-compression techniques that can classify audio data with better accuracy in edge devices. We used environmental sound datasets such as the UrbanSound8K, ESC 50, and Audio Set datasets for the experiments. There are many different uses for audio classification and edge devices. Their capacity for real-time audio analysis and categorization offers a wide range of opportunities for enhancing functionality, security, and convenience across numerous fields and spheres of life.
This research provides the following contributions:
We compare different DL models for audio classification for raw audio and Mel spectrograms.
We apply different model-compression techniques to the neural network and propose hybrid pruning techniques.
We deploy DL models for audio classification in the Raspberry Pi and NVIDIA Jetson Nano.
The remainder of this paper is structured as follows.
Section 2 introduces in detail the algorithm of the proposed method along with the theoretical and technical parts. In
Section 3, the experimental details are presented, and the results are discussed in
Section 4. Conclusions and future works are then presented in
Section 5 and acknowledgement in Funding part.
4. Results
In this study, the extraction of low-level features from raw audio data is a critical step, with a particular focus on the zero crossing rate (ZCR). ZCR, a key measure in the analysis of audio signals, quantifies the frequency at which the audio waveform crosses the zero amplitude axis, thereby providing insights into the frequency content of the signal. This metric is integral to various digital signal processing applications, including speech and music analysis, as well as broader audio classification tasks. The utility of ZCR lies in its ability to effectively differentiate between tonal sounds, which exhibit a lower ZCR, and more noisy or percussive elements, characterized by a higher ZCR.
A notable challenge arises when the audio contains significant ‘dead spots’ or segments of minimal amplitude, as these can obscure the distinctive features of the audio, leading to difficulties in classification. To mitigate this issue, the initial step involves the cleansing of audio data by removing dead space, utilizing a technique that involves the application of a signal envelope. The signal envelope, a conceptual curve outlining the extremes of the audio waveform, provides a framework for identifying and excising sections of the audio below a threshold of 20 dB.
For uniformity and computational efficiency, the audio clips were standardized to a fixed frame size. To facilitate the real-time GPU-based extraction of audio features from Mel spectrograms, the study employed Keras audio preprocessors (Kapre). Kapre’s capabilities extend to the optimization of signal parameters in real time, significantly simplifying and enhancing the reliability of model deployment.
In
Table 2, the comparison between the audio classification using raw audio and Mel spectrograms is shown. The Mel spectrograms achieve the highest accuracy of 95%.
We have also shown in
Table 3 that with the proposed methodology, the experimental result was also better than with many existing models.
Hybrid pruning, combining magnitude and Taylor pruning, offers superior model optimization by balancing the efficient size reduction of magnitude pruning with the precision of Taylor pruning. In
Table 4, we can see that hybrid pruning obtained better accuracy than the individual pruning methods. This approach enhances the network performance and generalization while maintaining an optimal level of complexity. It strikes a fine balance between computational efficiency and the retention of crucial network features.
Though we obtained better accuracy for audio classification using pruning techniques, the model size and execution time were smaller using quantization techniques. A comparison between the accuracy and model size is shown in
Table 5.
Later, the audio classification model was deployed in the Raspberry Pi4 and NVIDIA Jetson Nano to check the performance.
Table 6 shows the results of the accuracy, inference time, and power consumption in the devices.