Emergency Vehicle Classification Using Combined Temporal and Spectral Audio Features with Machine Learning Algorithms

Jayakumar, Dontabhaktuni; Krishnaiah, Modugu; Kollem, Sreedhar; Peddakrishna, Samineni; Chandrasekhar, Nadikatla; Thirupathi, Maturi

doi:10.3390/electronics13193873

Open AccessArticle

Emergency Vehicle Classification Using Combined Temporal and Spectral Audio Features with Machine Learning Algorithms

by

Dontabhaktuni Jayakumar

¹

,

Modugu Krishnaiah

²,

Sreedhar Kollem

³

,

Samineni Peddakrishna

^1,*

,

Nadikatla Chandrasekhar

⁴

and

Maturi Thirupathi

⁵

¹

School of Electronics Engineering, VIT-AP University, Amaravati 522237, Andhra Pradesh, India

²

Department of ECE, Vidya Jyothi Institute of Technology, Hyderabad 500075, Telangana, India

³

Department of ECE, School of Engineering, SR University, Warangal 506371, Telangana, India

⁴

Department of ECE, GMRIT—GMR Institute of Technology, Rajam 532127, Andhra Pradesh, India

⁵

Department of ECE, St. Martin’s Engineering College, Secunderabad 500100, Telangana, India

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(19), 3873; https://doi.org/10.3390/electronics13193873

Submission received: 30 August 2024 / Revised: 25 September 2024 / Accepted: 29 September 2024 / Published: 30 September 2024

(This article belongs to the Special Issue Advances in AI Engineering: Exploring Machine Learning Applications)

Download

Browse Figures

Versions Notes

Abstract

:

This study presents a novel approach to emergency vehicle classification that leverages a comprehensive set of informative audio features to distinguish between ambulance sirens, fire truck sirens, and traffic noise. A unique contribution lies in combining time domain features, including root mean square (RMS) and zero-crossing rate, to capture the temporal characteristics, like signal energy changes, with frequency domain features derived from short-time Fourier transform (STFT). These include spectral centroid, spectral bandwidth, and spectral roll-off, providing insights into the sound’s frequency content for differentiating siren patterns from traffic noise. Additionally, Mel-frequency cepstral coefficients (MFCCs) are incorporated to capture the human-like auditory perception of the spectral information. This combination captures both temporal and spectral characteristics of the audio signals, enhancing the model’s ability to discriminate between emergency vehicles and traffic noise compared to using features from a single domain. A significant contribution of this study is the integration of data augmentation techniques that replicate real-world conditions, including the Doppler effect and noise environment considerations. This study further investigates the effectiveness of different machine learning algorithms applied to the extracted features, performing a comparative analysis to determine the most effective classifier for this task. This analysis reveals that the support vector machine (SVM) achieves the highest accuracy of 99.5%, followed by random forest (RF) and k-nearest neighbors (KNNs) at 98.5%, while AdaBoost lags at 96.0% and long short-term memory (LSTM) has an accuracy of 93%. We also demonstrate the effectiveness of a stacked ensemble classifier, and utilizing these base learners achieves an accuracy of 99.5%. Furthermore, this study conducted leave-one-out cross-validation (LOOCV) to validate the results, with SVM and RF achieving accuracies of 98.5%, followed by KNN and AdaBoost, which are 97.0% and 90.5%. These findings indicate the superior performance of advanced ML techniques in emergency vehicle classification.

Keywords:

emergency vehicle detection; machine learning; ensemble method; LOOCV

1. Introduction

The significance of dependable emergency vehicle detection systems is becoming more apparent, especially in densely populated urban areas where the risk of accidents and emergencies is elevated. Traditional methods of detecting emergency vehicles typically rely on visual cues, like flashing lights and sirens. However, these visual signals may not always be effective in situations where visibility is limited, at night, or when drivers are not paying attention. The ability of emergency vehicles such as ambulances, fire trucks, and police cars to quickly navigate through traffic is vital during emergencies. Traffic jams on road emergencies can cause emergency services to be delayed. Often, their speed in reaching destinations is determined by how effectively they can move through congested roads. Therefore, real-time detection of emergency vehicles using audio sound can enhance their ability to navigate more swiftly, especially at intersections.

Automated systems utilizing audio analysis technology have emerged as a promising solution to address the shortcomings of traditional visual detection methods by providing enhanced public safety. Traditional visual detection methods, such as cameras and LiDAR systems, often face limitations in complex environments, including poor lighting, weather conditions, or occlusions [1,2]. Audio-based systems complement these methods by detecting sounds, such as sirens, vehicle horns, or other emergency signals, offering an additional layer of situational awareness. The study conducted by Li et al. [3] highlights the effective use of audio-based emergency vehicle detection systems in improving public safety. These systems have been shown to decrease response times by detecting emergency vehicles early, allowing for quicker emergency dispatch and coordination. Automated emergency vehicle sound detection systems not only improve public safety but also provide advantages in mitigating noise pollution [4]. These systems accurately locate approaching emergency vehicles, enabling a reduction in unnecessary noise disturbances from prolonged siren use.

Sophisticated advanced algorithms in signal processing and machine learning (ML) are used to analyze the distinct sound patterns of emergency vehicle sirens. This enables precise identification and tracking of approaching emergency vehicles, even in challenging conditions. Pattern recognition algorithms play a crucial role in the detection of emergency vehicle sirens. Pattern recognition is a branch of artificial intelligence (AI) focused on identifying recurring patterns within data. The primary goal of pattern recognition algorithms is to analyze audio data containing emergency vehicle sirens [5], extract key features unique to sirens, and distinguish them from background noise. This process involves several critical steps to ensure accurate detection. The first step in the process is feature extraction, where the raw audio signal is converted into its frequency domain using techniques like Fast Fourier Transform (FFT) [6]. This transformation allows the algorithm to focus on specific frequency bands relevant to emergency vehicle sirens. By analyzing these frequency components, the algorithm can effectively isolate and identify the unique features of siren sounds, such as dominant frequencies associated with the characteristic rise and fall in pitch [7]. However, FFT is highly sensitive to noise, making it challenging to extract accurate frequency components in noisy environments [7]. Techniques like filtering can help, but they might also remove relevant siren information. Moreover, sirens often change over time, and FFT is better suited for stationary signals. As a result, capturing the temporal dynamics of sirens can be problematic [8]. Beyond dominant frequencies, modulation patterns and spectral characteristics extracted from the audio signal also play a significant role in detection [9]. Extracting modulation patterns requires precise analysis of the amplitude and frequency variations over time, which can be computationally intensive [10]. Techniques like Hidden Markov Models (HMMs) can further enhance detection by modeling the temporal evolution of the siren sound, accounting for the rise and fall in pitch and modulation patterns [11]. However, modeling the rise and fall in pitch and other dynamic features with HMMs can be complex and computationally demanding.

Combining FFT, modulation pattern analysis, and spectral characteristics provides a more comprehensive representation of the siren sound. Each technique captures different aspects of the audio signal, contributing to a more holistic understanding [12]. Integrating features from multiple extraction methods can help mitigate the impact of noise, as certain types of noise might affect some features less than others [13]. Once these key features are identified, various algorithms employ classification techniques to categorize the incoming audio data. Different ML models, such as SVMs or CNNs, are then trained on these selected features to classify the sounds for detection [14]. SVMs can effectively handle high-dimensional data, while CNNs are excellent at capturing spatial hierarchies in the data [15,16]. Many existing approaches face performance challenges stemming from variability in audio environments, background noise, and the Doppler effect, all of which can significantly modify the characteristics of siren sounds. Conventional methods often do not generalize effectively across various real-world contexts, resulting in reduced accuracy in both detection and classification.

The major contributions of this research work are discussed below.

This study introduces a distinctive combination of temporal and spectral features to improve the classification accuracy of emergency vehicle sounds. By utilizing a dual-domain approach, incorporating features like MFCC, RMS, and spectral centroid, the system captures both the time and frequency characteristics of the sirens. This comprehensive feature representation significantly enhances the classification performance compared to using either domain alone.
To simulate real-world environments, this work applies various data augmentation techniques such as time stretching, pitch shifting, and adding white noise as well as Doppler effect feature extraction. These techniques expand the diversity of the training data, allowing the models to generalize better to noisy and unpredictable audio environments.
The results demonstrate that models trained with augmented data perform more robustly when exposed to real-world scenarios with background noise or varying distances, addressing the challenges of real-time emergency detection.
This research compares the performance of traditional ML models with advanced ensemble methods for classifying emergency vehicle sounds. The findings show that ensemble models provide superior classification accuracy and robustness, particularly when handling high-dimensional audio data.
Additionally, the computational efficiency of each model is evaluated in terms of inference time and memory usage; ensuring that the system is suitable for real-time deployment where low latency and fast decision making are critical.

2. Literature Review

Effective emergency vehicle detection has long been a critical component for the functioning of urban areas and the protection of residents. During emergencies, emergency vehicles utilize sirens to notify other drivers to yield the right of way on the road. The effective detection of these sounds facilitates timely notifications to other vehicles and automated systems, thereby enabling actions such as lane clearing, prioritization of traffic signals, and alerts for drivers who may not be aware of an approaching emergency vehicle. Timely and precise identification of emergency vehicles, such as ambulances, fire trucks, and police cars, can lead to faster response times and enhanced safety measures [8,17]. Additionally, the precise classification of emergency vehicle sounds distinguishing between ambulances, fire trucks, and traffic noise is crucial for ensuring appropriate and context-sensitive responses within intelligent transportation systems (ITS). These systems depend on reliable real-time data to make rapid decisions that impact traffic flow and the safety of all road users [18]. Early techniques for identifying emergency vehicle sounds were predominantly based on basic signal processing techniques and applying rule-based algorithms for sound classification. These methods were designed to detect unique sound characteristics associated with sirens.

In the study conducted by Ellis DPW [19], methods for detecting alarm sounds in high noise conditions were analyzed in order to identify consistent and reliable acoustic cues in real-world environments. Two distinct alarm detection schemes were utilized, one adapted from a speech recognizer within a neural network system and the other attempting to separate the alarm sound from background noise using a sinusoidal model system. Unfortunately, these methods were found to have high error rates, reducing their reliability in complex acoustic environments. A separate study was conducted to detect emergency vehicle sirens among hearing-impaired drivers [20,21]. An algorithm utilizes a linear prediction model and the Durbin algorithm to trigger siren detection if certain coefficients remain within a pre-selected tolerance for a specified duration, although it may result in false detections [20]. A study conducted by Beritelli F et al. [21] employed a speech recognition technique to create an automatic emergency signal recognition system. This system was designed to identify specific frequency patterns associated with emergency signals. While effective in controlled settings, its performance was found to be less reliable in noisy urban environments. Additionally, it has been extracted with a single feature called Mel-frequency cepstral coefficients (MFCCs). Another study by Liaw JJ et al. introduced a method for recognizing ambulance siren sounds using the Longest Common Subsequence (LCS) algorithm. This approach demonstrated 85% accuracy in identifying ambulance sirens in real sounds. However, its performance may be compromised in environments with significant noise or distorted siren sounds [22]. A significant challenge in the environmental noise context is that it complicates the model’s ability to distinguish between the emergency vehicle siren and the background noise. On the other hand, distortions, specifically from the Doppler effect, can change the frequency of the siren sound as the vehicle approaches or retreats from the observer. This frequency variation can result in traditional frequency-based classifiers misinterpreting the sound, which may lead to inaccurate classifications [23]. Hence, many conventional models, which depend heavily on clean and controlled audio data, struggle to perform effectively under these noisy and unpredictable conditions [24].

Subsequently, with the introduction of ML, more advanced algorithms were implemented for sound classification. Schroder et al. suggested implementing automatic siren detection in traffic noise using part-based models [25]. They utilized standard generative and discriminative training methods to enhance the system’s performance. The system demonstrated a high level of accuracy in identifying sirens within traffic noise and differentiating them from other urban sounds. In a study conducted by Massoud M et al. [26], a CNN model was utilized for simple audio classification based on MFCC feature extraction. The study reported an accuracy rate of 91% in urban area sounds. Another study in an urban scenario presented by Usaid M et al. [27] employed a multi-layered perceptron-based ambulance siren detection. Using MFCC feature extraction, the system showed high accuracy in identifying ambulance sirens. Similarly, Mecocci A and Grassi C [28] introduced a real-time system for detecting ambulances using a pyramidal part-based model that combines MFCCs and YOLOv8. This system demonstrated high accuracy in detecting ambulances in various test scenarios, significantly improving detection performance through the combination of MFCCs for audio processing and YOLOv8 for visual recognition. Nevertheless, the system’s performance could be impacted by high levels of urban noise and overlapping sounds, highlighting the need for further research to ensure robustness across multiple sound scenarios.

Sathruhan S et al. focused on classifying multiple vehicle sounds based on CNN and MFCC feature extraction [4]. The model achieved 93% of accuracy. Tran VT and Tsai WH utilized neural networks to detect two emergency vehicle sirens based on their MFCCs. Their system showcased significant accuracy and reliability in distinguishing these sirens from other background noises. They achieved 90% and 93.8%, accuracies between simulated noisy sound and real-time sound, respectively [5].

Further, the most recent advancements, particularly RNNs and LSTM, employed distinct operational modes. Salem O et al. [29] developed an IoT-based system designed to assist drivers with hearing disabilities. The system focused on five different ML algorithms with three different siren sound identified. However, for a specific phase of learning, it has utilized one specific feature extraction from a simple to the most complex architecture. Zohaib M et al. [30] explored a deep learning approach with multimodal fusion to enhance emergency vehicle detection. Their system integrated audio and visual data, leveraging the strengths of both modalities to improve detection accuracy. This approach demonstrated significant improvements in identifying emergency vehicles in complex urban environments. However, the integration of multimodal data required sophisticated synchronization and processing techniques, posing additional challenges.

The proposed method effectively addresses these challenges by integrating time domain and frequency domain features with advanced data augmentation techniques that replicate real-world distortions, including the Doppler effect and environmental noise. By analyzing both temporal and spectral characteristics of audio signals, this approach enhances the model’s capability to distinguish between emergency vehicle sirens and background noise. This methodology not only improves accuracy in controlled settings but also guarantees reliable performance in varied and dynamic urban environments, where traditional methods may be less effective.

3. Resources and Approaches

The proposed methodology of the emergency vehicle siren classification system using ML is shown in Figure 1. The ML classifier models of SVM, RF, KNN, AdaBoost, a stacked ensemble method, and the DL LSTM model are employed to categorize the audio samples accurately. These models predict the class of each audio sample, distinguishing between ambulance sirens, firetruck sirens, and traffic noise. The system’s performance is evaluated using performance metrics, which ensure the model’s accuracy. The first step in the audio signal classification is to pre-process the audio data by analyzing the sampling frequency contents, and duration of the audio, and implementing the data augmentation techniques, incorporating varying sound qualities to replicate real-world conditions. In the next phase, during feature extraction, these audio signals are converted into measurable characteristics, such as spectrograms, capturing essential information within the audio. Then, the data will be standardized and split into training and testing sets for practical model training. The implementation of the model from the detailed description of the dataset to the final classification is discussed in the following subsections.

3.1. Dataset Description and Augmentation

The raw audio dataset containing 600 audio samples comprises the classification of ambulance, traffic, and fire trucks and is taken from the Kaggle [31]. Among 600 samples, each classification samples consist of 200 samples. The available samples are segmented into a maximum 3 s duration with a sampling rate that varied between 11,025 Hz and 24,000 Hz and a resolution of 16 bits/sample. This dataset serves as the foundation for the classification task, providing the necessary input for feature extraction and model training. Although the controlled nature of the samples may introduce potential biases, such as a lack of real-world background noise, data augmentation techniques, like adding noise and simulating the Doppler effect, were applied to enhance the dataset’s realism and improve model generalization to real-world environments.

During the initial processing of the data, a visual inspection of the samples was conducted to identify potential patterns or similarities that could inform the feature selection process. The dataset was analyzed for sampling frequency and the duration of each audio sample to ensure uniformity. It was noted that there were deviations in both the sampling frequency and duration between samples. To standardize the data, all audio samples were converted to a uniform sampling rate of 44,100 Hz. Additionally, to achieve consistent sample duration, trimming or padding was performed based on the duration requirements.

In the subsequent step, data augmentation techniques were employed to enhance model robustness and generalization by simulating real-world conditions. This involved introducing varying sound qualities and noise to the original signals. Specifically, the dataset was augmented by adding white noise to the ambulance and fire truck signals. To effectively enhance the training process, noise is added in relation to the signal level. This begins with calculating the power of the input audio sample to determine the average energy of the signal. Subsequently, the noise power is computed based on a specified signal-to-noise ratio (SNR). Equation (1) outlines the method for calculating both the signal power and noise power of the audio signal, where x(i) represents the amplitude of the ith sample and N denotes the total number of samples. An SNR of 15 dB is typically sufficient for the expected noise levels in more realistic environments.

\begin{array}{l} S i g n a l P o w e r = \frac{1}{N} {\sum x (i)}^{2} ​ \\ N o i s e P o w e r = \frac{S i g n a l P o w e r}{10^{\frac{S N R}{10}}} \end{array}

(1)

Next, white noise is generated, randomly corresponding to the calculated noise power, and it is modified by a dynamic factor that varies between 0.5 and 1.5 over time. This dynamic factor simulates more realistic scenarios, such as the varying intensity of a siren depending on the environment or distance from the source. The dynamically generated noise is then added to the original audio signal. Furthermore, both the signal and the noise are normalized before mixing to ensure consistent noise levels across different samples. The detailed steps of this implementation are illustrated in Figure 2. This process was carefully designed to replicate real-world conditions, where siren sounds can be impacted by background noise, traffic, and other environmental factors. As a result of this augmentation, the total number of audio samples increased from 600 to 1000, thereby creating a more diverse and challenging dataset for model training. This technique significantly enhances the model’s performance, particularly in noisy environments, where it had previously struggled. The inclusion of the white noise model’s classification accuracy demonstrates the effectiveness of augmentation in real-world conditions.

3.2. Feature Extraction

Feature extraction plays a vital role in extracting meaningful insights from audio data. Different aspects of the signal’s characteristics are captured by considering both time and frequency domains. Time domain features reveal the signal’s temporal dynamics and transient properties, while frequency domain features uncover its spectral content and tonal properties. Together, these features provide a comprehensive understanding of the audio signal, enabling practical analysis for tasks like vehicle classification.

In the process of extracting features from both the original and augmented data, a variety of features are evaluated in terms of both the time domain and frequency domain. The time domain features include root mean square (RMS) and zero-crossing rate (ZCR), which provide valuable insights into the signal’s amplitude over time, capturing variations and patterns essential for differentiating between various sound types. These features are effective in distinguishing sirens from background noise, as emergency vehicle sounds exhibit distinctive temporal patterns. In contrast, the frequency domain features encompass spectral centroid, spectral bandwidth (SBW), spectral roll-off, chroma short-time Fourier transform (STFT), Mel-frequency cepstral coefficients (MFCCs), minimum and maximum (Min and Max) frequency, instantaneous frequency (IF), spectral flux, and chirp rate. These frequency domain features facilitate the analysis of frequency content, spectral changes, and frequency modulation in audio signals. MFCCs, in particular, are widely used for their ability to mimic human sound perception, making them effective for differentiating siren types. This categorization of feature extraction methods into time domain and frequency domain is illustrated in Figure 3. By employing this dual approach, the intricacies of emergency vehicle sounds are accurately captured, thereby enhancing classification accuracy. The following subsections discuss the detailed analysis of both the time domain and frequency domain features, emphasizing their crucial role in the identification and classification of emergency vehicle sounds. Although features such as IF, chirp rate, and spectral flux belong to the frequency domain, they are discussed separately.

3.2.1. Time Domain Feature Extraction

In the context of siren classification, time domain feature extraction involves analyzing various aspects of the sound waveform to identify distinctive patterns that differentiate emergency sirens from other types of sounds, such as background noise or traffic. In the time domain, features are derived directly from essential aspects of the signal’s amplitude and time-based characteristics, and these features capture temporal variations and patterns in the signal.

RMS: provides valuable insight into the energy and intensity of an audio signal. RMS measures the average power of an audio signal. It quantifies the overall energy of the signal and is computed by taking the square root of the mean of the squared values of the signal within a specific time frame. It represents the energy of the audio signal and provides information about its overall loudness. In the context of emergency vehicle sirens, RMS can capture variations in the intensity of the siren sound, which may vary depending on the distance between the vehicle and the observer. By measuring the RMS value across different frames of the audio signal, variations in intensity can be detected, which may provide critical insights for distinguishing between different types of sirens and assessing the proximity of the emergency vehicle. The RMS value of the ith frame in an audio signal is calculated using Equation (2).

R M S_{i} ​ = \sqrt{\frac{1}{N} \sum_{t = 0}^{N - 1} {[x_{w i} ​ (t)]}^{2}}

(2)

Here, RMS_i represents the RMS value for the ith frame of the audio signal. The subscript i indicates that the RMS value is calculated for a specific frame ii within a sequence of overlapping frames into which the audio signal has been divided. The term x_wi(t) denotes the windowed signal for the ith frame, where x_i(t) is the original signal of the ith frame and w(t) is the window function applied to the frame. The parameter N represents the number of samples in each frame, and t is the index of the sample within the frame, ranging from 0 to N−1. Figure 4 shows the amplitude variation and the corresponding RMS envelope of input of the audio sample.

ZCR: The ZCR measures the rate at which the audio signal changes its sign (from positive to negative or vice versa) and provides information about its temporal characteristics. In the context of emergency vehicle sirens, a ZCR can capture the temporal variations in the siren sound, such as its modulation or pulsation. The ZCR for a given signal can be mathematically calculated using Equation (3) [26].

Z C R = \frac{1}{2} \sum_{n = 2}^{N} ∣ s i g n (x_{n} ​) - s i g n (x_{n - 1} ​) ∣

(3)

where x_n is the amplitude value of the audio signal, N is the total number of samples, and sign(x) is the sign function that returns [−1, 0, 1] depending on the x sign. The visual representation of the variations in the amplitude of the audio signal over time and how the ZCR varies over time in response to the amplitude changes in the siren signal is shown in Figure 5.

To facilitate dynamic extraction of the RMS and ZCR features, various siren signals are employed from the fine-tuned and augmented dataset. The extraction procedure is shown in Figure 6. The computation of the RMS value begins with the proper normalization of the audio signal, which is then divided into 1024 samples with a 50% overlapping frame size. This process is consistently applied across all frames throughout the entire audio signal, resulting in a time series of RMS values that reflect the variations in signal energy over time. For each frame, the square root of the mean of the squared amplitude values of the signal is calculated to determine the RMS value for that specific frame. In a similar manner, the extraction of the ZCR involves dividing the audio signal into the same overlapping frames used for the RMS calculation. After computing the ZCR for each frame, the mean ZCR across all frames provides a comprehensive measure of the signal’s noisiness over time. While calculating RMS and ZCR, the number of frames and overlapping frame size are chosen to provide a balance between the temporal resolution and frequency resolution. A detailed flowchart diagram illustrating this methodology is presented in Figure 6. For a given 3 s audio signal, this process produces about 257 frames, which is a manageable quantity for real-time processing while mitigating the computational overhead associated with increased overlap or larger windows.

3.2.2. Frequency Domain Feature Extraction

Frequency domain methods for feature extraction consider the spectral content and dynamics of the audio signals, representing a crucial aspect of this research. Features derived from the frequency domain are useful in examining the variability of frequencies over time. More specifically, these features enhance our understanding of tonal quality, frequency-modulated sounds, and the dynamics of sound that all differ across varying kinds of emergency vehicle sirens. Furthermore, frequency domain features provide a detailed perspective on these patterns, facilitating further analysis and categorization. This subsection discusses the importance of each frequency domain feature with an emphasis on how these features reveal the spectral characteristics of the audio signal over time.

Spectral centroid: The spectral centroid of an audio signal involves determining the weighted average of its frequency spectrum. It pertains to the perceived brightness of sound. A lower centroid value typically indicates the presence of lower-frequency components, resulting in a mellower sound. Conversely, a higher spectral centroid suggests a brighter sound characterized by a high-frequency component. To calculate the spectral centroid of an audio signal, the process begins with transforming the analog signal x(t) into its digital form x[n]. Then, divide this signal x[n] into 50% overlapping frames of 1024 samples. To tapper the edges of each frame, the Hann window function is applied. For each frame, the STFT is computed using Equation (4) to represent a frequency domain X_m[k].

X_{m} ​ [k] = \sum_{n = 0}^{N - 1} ​ x_{m} ​ [n] \cdot e^{- \frac{j 2 π k n}{N}} \begin{matrix} 0 \leq k < N \end{matrix}

(4)

Here X_m[k] represents the complex-valued Fourier coefficients for frame m, and k is the frequency bin index. Next, to obtain the power spectrum P_m[k], the magnitude of X_m[k] is squared, which represents the energy distribution across different frequency bins within the frame. Then, the spectral centroid for each frame m is calculated using Equation (5).

S p e c t r a l C e n t r o i d (C_{m}) = \frac{\sum_{k = 0}^{N - 1} f_{k} P_{m} [k]}{\sum_{k = 0}^{N - 1} P_{m} [k]} ​

(5)

Here,

f_{k} = k (\frac{f_{s}}{N}),

f_s represents the signal’s sampling rate, and N is the FFT size. Figure 7 shows a visual representation of the variations in the amplitude of the audio signal over time and the spectral centroid of an audio signal.

SBW: The SBW measures the spread or width of the frequency spectrum around its centroid and provides information about the range of frequencies present in the audio signal. The SBW is defined as the square root of the variance of the spectrum around the spectral centroid. Mathematically, it is calculated using Equation (6).

S B W (B_{m}) = \sqrt{\frac{\sum_{k = 0}^{N - 1} {(f_{k} - C_{m})}^{2} P_{m} [k]}{\sum_{k = 0}^{N - 1} P_{m} [k]}} ​

(6)

The interpretation of the spectral bandwidth (SBW) of a wave is illustrated in Figure 8, which corresponds to the same sample presented in Figure 7. Analyzing both figures reveals that the bandwidth is narrow, indicating that most of the signal energy is concentrated near the spectral centroid. Consequently, this results in the sound being perceived as pure with fewer harmonic components.

Spectral roll-off: Spectral roll-off is a measure of the brightness of a sound and is used to distinguish between harmonic and non-harmonic content in audio signals. Spectral roll-off is calculated at a certain percentage of the total spectral energy. To calculate the roll-off percentage based on the spectral energy of each audio sample, the spectral energy distribution of the signal is analyzed, and an appropriate roll-off percentage for each sample is chosen. For this, first compute the STFT of the audio signal using Equation (4), which gives a time frequency representation of the signal. After performing the STFT, calculated the power spectral density P_m[k] for each frequency bin k at each frame, where P_m[k] is the squared magnitude of the STFT coefficient X_m[k], which represents the energy distribution across different frequency bins within the frame. Once the spectral energy for each frequency bin is calculated, the cumulative energy C_k is computed by summing the energy from the lowest frequency bin (0) up to a certain frequency index k using Equation (7).

\begin{matrix} C u m u l a t i v e E n e r g y & C_{k} \end{matrix} = \sum_{j = 0}^{k} P_{m} [j] ​

(7)

where P_m[j] is the spectral energy at index j. Now, the roll-off frequency f_r_oll-off is the frequency at which the cumulative energy reaches 85% of the total spectral energy.

f_{r o l l - o f f} ​ = ​ \begin{matrix} \begin{matrix} f_{k} & \begin{matrix} s u c h & t h a t \end{matrix} \end{matrix} & C_{k} ​ \end{matrix} \geq 0.85 (\sum_{j = 0}^{N - 1} P_{m} [j])

(8)

where

\sum_{j = 0}^{N - 1} P_{m} [j]

is the total spectral energy across all frequency bins, from the lowest frequency (index 0) to the highest frequency. In other words, the frequency f_k (corresponding to bin k) is the spectral roll-off frequency if it satisfies the condition that the cumulative energy at that frequency equals or exceeds 85% of the total spectral energy.

A lower spectral roll-off frequency indicates that a significant portion of the signal’s energy is concentrated in the lower frequencies, often resulting in a smoother and more mellow sound. Conversely, a higher roll-off frequency suggests that a greater amount of the signal’s energy is distributed across the higher frequencies, which is typically associated with the presence of noise or other high-frequency components. Figure 9 shows spectral roll-off variation for the input audio signal, which shows that the signal’s energy is concentrated in the lower frequencies.

Chroma STFT: Chroma features represent the energy distribution of musical notes across different pitch classes and provide information about the harmonic content of the audio signal. While initially designed for music analysis, chroma features can still offer insights into the tonal characteristics of emergency vehicle sirens, which may have distinct harmonic structures, as shown in Figure 10. The chroma waveform displays the amplitude variations of the signal over a three-second interval, emphasizing the dynamic changes in loudness. In contrast, the chroma feature plot below analyzes the harmonic content of the audio by mapping the intensity of pitch classes (C to B) over the same period. This chromatogram illustrates the presence of pitches and how their strengths fluctuate over time.

MFCCs: MFCCs are commonly used for speech and audio signal processing and capture the spectral envelope of the audio signal. In the case of emergency vehicle sirens, MFCCs can provide information about the spectral shape and timbral characteristics of the siren sound, which may help distinguish between different types of sirens. The MFCC is a standard feature extraction method for sound recognition because of its high efficiency.

The Mel scale compresses the higher frequencies and provides finer resolution at lower frequencies, where humans are more sensitive. Equation (9) shows the procedure of converting the linear frequency f to the Mel frequency m(f).

m (f) = 2595 \cdot l o g_{10} ​ (1 + \frac{f}{700}) ​

(9)

To calculate the MFCC, first, the power spectrum is computed from the windowed frame, as described in Equation (4). This power spectrum illustrates the distribution of energy within the signal across various frequencies. Subsequently, the power spectrum is passed through a Mel filter bank, which is evenly spaced on the Mel scale. Each filter bank output corresponds to a single Mel frequency, and the energy in each filter bank is calculated using Equation (10).

E_{m} ​ = \sum_{k = f_{m i n} ​}^{f_{m a x}} ​ ∣ X [k] ∣^{2} \cdot H_{m} ​ [k] ​

(10)

where H_m[k] is the mth Mel filter.

Following this, to simulate the logarithmic perception of loudness by the human ear, the logarithm of the filter bank energies is computed using Equation (11). Finally, the discrete cosine transform (DCT) is applied to the logarithm of the Mel filter bank energies, as described in Equation (12), to derive the MFCCs.

L o g E [m] = l o g (E_{m} ​)

(11)

C_{n} ​ = \sum_{m = 0}^{M - 1} L o g E [m] \cdot c o s (\frac{π n (2 m + 1)}{2 M}) ​

(12)

where C_n are the MFCCs and M is the number of Mel filters.

Figure 11 illustrates a visual representation of the sample audio alongside the computed MFCC extraction. In this process, only the first 13 coefficients are retained because they encapsulate the most significant characteristics of the spectral envelope. These coefficients effectively summarize the shape of the energy spectrum in a concise manner.

Min and Max frequency: These features represent the Min and Max frequencies present in the audio signal, respectively. In the context of emergency vehicle sirens, these frequencies can provide information about the frequency range of the siren sound, which may indicate its spectral characteristics and distinguishability from background noise. To calculate the Min and Max frequencies, compute the STFT by decomposing the signal into overlapping frames and obtain the frequencies corresponding to each bin in the magnitude spectrum. Subsequently, this amplitude spectrum is converted to a decibel (dB) scale using Equation (13) to compress the range of values and to identify only significant values of frequencies by setting the limit. Below a certain limit, in this case considered to be −40 dB, is considered insignificant. Following this, identify the significant frequencies above the threshold. From those significant frequencies, Min and Max frequencies are extracted.

S_{d B} ​ = 10 l o g_{10} (​ \frac{D^{2}}{r e f^{2}}) ​

(13)

where D is the amplitude of the spectrum and ref is the maximum amplitude of the particular frame. The extracted maximum and minimum frequency of a particular audio signal is shown in Figure 12.

The comprehensive process of frequency domain feature extraction in audio signal analysis is presented in Figure 13. Each stage in the process represents a significant step in transforming raw audio data into important features that can be used by ML models to classify the classification.

The process begins with the pre-processed audio signal, which refers to fine-tuned augmented data. From these data, the extraction process is performed by applying the STFT to the time domain signal to convert it into the time frequency domain. This transformation is achieved by dividing the audio signal into overlapping frames with a frame length of 1024 samples and 50% overlap per frame and applying Fourier transform to each frame. The result is a complex-valued matrix that represents the signal’s frequency content over time. Once the STFT is computed, the next step involves calculating the power spectrum of the signal. This is performed by squaring the magnitude of the STFT coefficients, resulting in a matrix that reflects the power distribution across frequencies and time. From this, most of the key features are extracted for accurately predicting and classifying different types of audio signals.

3.2.3. Doppler Effect Feature Extraction

The characterization of dynamic change in the frequency and pitch of a sound plays an essential role in the detection and classification of emergency siren signals. These signals are highly distinctive and recognizable to capture the attention of drivers. However, these signals can be influenced by various factors, including environmental noise, the relative speed between the vehicle and the observer, and the acoustic properties of the surrounding area. The Doppler effect is an important phenomenon to consider variation in the frequency of the siren sound in real-world scenarios. Several features are considered critical when analyzing the Doppler effect in siren signals. These features are categorized as IF, chirp rate, and spectral flux.

IF: IF represents the rate of change in the phase of an audio signal at any specific moment. This metric offers valuable insights into the frequency characteristics of a signal as it progresses over time, making it particularly beneficial for the analysis of non-stationary signals where frequency variations may occur. IF is calculated by utilizing the Hilbert transform for the real-valued signal to obtain the analytic signal, which has given the amplitude and phase information of the signal. Subsequently, IF is computed using Equation (14) from the instantaneous phase

\frac{d ϕ (t)}{d t}

of the analytic signal. The mean instantaneous frequency is simply the average of the computed instantaneous frequencies over time.

f (t) = \frac{1}{2 π} ​ \frac{d ϕ (t)}{d t}

(14)

Chirp rate: The chirp rate refers to the rate at which the frequency of a signal changes over a specified period, and it is particularly relevant in signals characterized by non-constant frequencies, such as chirps signals where the frequency increases or decreases in a linear manner over time. By analyzing the slope of the frequency curve over time, it is possible to derive the rate of frequency change, or the chirp rate. This parameter is essential for the analysis of frequency-modulated signals, as it provides valuable insights into the signal’s characteristics.

Spectral flux: Spectral flux is used to assess the degree of variation in the power spectrum of an audio signal across consecutive frames. It quantifies the evolution of the spectral content over time, making it a useful feature for identifying onsets or abrupt changes in the audio. Spectral flux is computed by analyzing the magnitude spectra of consecutive frames. It aggregates the positive differences of the normalized spectra obtained from the sum of magnitudes across frequency bins for each frame. Figure 14 describes the variation of these features considered for the Doppler effect.

3.3. Feature Scaling

Feature scaling is a crucial step in data pre-processing for ML models, ensuring that features are on a similar scale and making them more suitable for algorithms sensitive to data scales. Various features extracted from audio files include RMS, ZCR, spectral centroid, SBW, spectral roll-off, chroma STFT, MFCCs, and more. After extracting the features, they are organized along with their corresponding labels in a structured format for further analysis. It is essential to review and manage the data frame. The initial step in this process involves checking for any missing values and handling them within the loaded data frame. The extracted feature dataset does not contain any missing values; directly, the next important step called standardization is considered. Before standardization, the data are split into features (X) and labels (y).

Standardization transforms the features into a mean of 0 and a standard deviation of 1. Mathematically, each feature x_i is standardized using Equation (9).

{\hat{x}}_{i} = \frac{x_{i} - μ_{i}}{σ_{i}}

(15)

where

{\hat{x}}_{i}

is the standardized value,

x_{i}

is the original feature value,

μ_{i}

is the mean of the feature, and

σ_{i}

is the standard deviation of the feature. This transformation ensures that the data are centered around zero with unit variance, which improves the performance of many ML algorithms by making the optimization process more efficient.

After standardization, normalization scales the standardized features to a range between 0 and 1 using Equation (10).

{\tilde{x}}_{i} = \frac{{\hat{x}}_{i} - \min \hat{x}}{\max \hat{x} - \min \hat{x}}

(16)

where

{\tilde{x}}_{i}

is the normalized value,

{\hat{x}}_{i}

is the standardized value,

\min \hat{x}

is the minimum value of the standardized feature, and

\max \hat{x}

is the maximum value. Normalization ensures that all features are on a comparable scale, which is particularly important for algorithms that rely on distance metrics, such as KNN.

Applying standardization and normalization ensures that the features are on a standard scale, thereby enhancing the performance of the ML models. Standardization centers the data and reduces the influence of outliers, while normalization scales the data to a range between 0 and 1, making them suitable for algorithms that use distance metrics. These pre-processing steps are essential for building robust and accurate models.

These pre-processing steps help prepare the data for training an ML model by ensuring data quality, appropriate feature-label separation, and proper scaling of features. This facilitates practical model training and evaluation, leading to better model performance and generalization ability. With the pre-processed data in hand, various classification models such as SVM, RF, KNN, and AdaBoost are trained on the training set. These models are then evaluated on the testing set using metrics like accuracy, precision, recall, and F1 score to assess their performance in classifying unseen data accurately. Moreover, stacked ensemble classifiers are constructed to combine predictions from multiple base classifiers, leveraging their complementary strengths to enhance overall performance. This comprehensive data pre-processing pipeline ensures that raw audio data are transformed into a format suitable for ML analysis, facilitating accurate classification of audio samples across different classes.

3.4. Performance Measures

Accuracy: Accuracy is a quantitative measure used to evaluate the overall performance of classifiers. It is calculated by determining the percentage of correct predictions concerning the number of instances. The following is the formula for determining accuracy in Equation (17) [21].

A c c u r a c y = \frac{T P + T N}{T P + F P + T N + F N}

(17)

In a multi-classification context, true positive (TP) refers to a situation where a positive outcome is correctly identified, while true negative (TN) denotes the correct identification of a negative outcome. Conversely, false positive (FP) signifies an incorrect identification of a positive outcome, and false negative (FN) represents an erroneous identification of a negative outcome. In the present context, the variables TP, TN, FP, and FN are employed to denote the quantities of accurate positive predictions, accurate negative predictions, inaccurate positive predictions, and inaccurate negative predictions made by the model.

Precision: The precision of a classifier is determined by the ratio of correct predictions to the total number of optimistic predictions made by the classifier. The precision rating can be determined using Equation (18) [32].

\Pr e c i s i o n = \frac{T P}{T P + F P}

(18)

Recall: The recall, alternatively referred to as sensitivity, quantifies the ratio of correctly identified positive predictions to the total number of positive instances. It can be computed using the subsequent mathematical expression shown in Equation (19) [32].

Re c a l l = \frac{T P}{T P + F N}

(19)

F1 score: The F1 score represents the optimal balance between accuracy and memory utilization. The optimal F1 score is 1, while the lowest attainable score is 0. The metric is computed using a weighted average of the precision and recall scores. Presented below in Equation (20) [32] is a mathematical expression that can be utilized to calculate the F1 score.

F 1 - s c o r e = 2 * \frac{\Pr e c i s i o n * Re c a l l}{\Pr e c i s i o n + Re c a l l}

(20)

4. ML Classification Algorithms and Data Analysis

4.1. Support Vector Machines (SVMs)

The SVM is effective for classification tasks, including audio signal analysis. The SVM works by finding the hyperplane that best separates different classes in the feature space. In audio signal processing, the SVM can be used for tasks such as speech recognition, speaker identification, and sound event classification. The SVM can handle high-dimensional feature spaces efficiently, making it suitable for audio signal processing where feature vectors can be complex.

During the training phase, each feature is introduced to the SVM and serves as a dimension within the feature space. It determines the optimal hyperplane that effectively separates the different classes in this high-dimensional environment. This process aims to maximize the distance between the hyperplane and the closest data points from each class, referred to as support vectors. The SVM also employs various kernel functions to transform the data into a higher-dimensional space when linear separability is not achievable, thus enabling efficient separation. Similarly, during the prediction phase, the new input features are transformed into the same high-dimensional space. The SVM computes a decision function using the weight vector and bias term based on the learned hyperplane to classify the new data.

The SVM uses various features extracted from audio signals as inputs during training. These features, computed using the librosa library, enable the SVM to learn a decision boundary that optimally separates different classes of audio signals. The SVM uses this boundary during testing to predict class labels such as ambulance, traffic, or firetruck.

4.2. Random Forest Classifier

RF is an ensemble learning method that constructs multiple decision trees during training and outputs the mode of the classes or classification mean prediction, that is, regression of the individual trees. In audio signal processing, RF can be used for tasks such as siren audio event detection, music genre classification, and speech emotion recognition. It is robust to overfitting and noise in the data, making it suitable for complex audio datasets.

RF classification of audio siren signals encompasses two main steps. It operates by constructing a multitude of decision trees during training. Let

T_{1}, T_{2}, T_{3} \dots . T_{N}

represent the decision trees in the RF ensemble and combine their outputs to make a final prediction.

There is a feature matrix

X

, where each row represents a sample and each column denotes a feature. Mathematically, let

x_{i}

denote the feature vector for the

i^{t h}

sample; the input feature

X

value is represented in Equation (21).

X = [x_{1}, x_{2}, x_{3} \dots . x_{m}]

(21)

Following this representation, RF classification is applied. Let

T_{1}, T_{2}, T_{3} \dots . T_{N}

represent the decision trees in the RF ensemble. During classification, given a new input feature vector

x_{n}

, each decision tree independently predicts the class label

C_{i} (x) .

Equation (22) is the final prediction;

y^{'}

is determined by majority voting among all decision trees in the forest.

y^{'} = \mod e \{C_{1} (x), C_{2} (x), \dots . C_{N} (x)\}

(22)

Similar to the SVM, an RF model uses the same features extracted from audio signals for training and testing. During training, it constructs multiple decision trees based on random subsets of the input features. Each tree independently predicts class labels for the audio signals. During testing, the predictions from all decision trees are aggregated, often by majority vote, to determine the final class label for the input audio signal.

4.3. K-Nearest Neighbor Classifier

KNN is a simple yet effective algorithm for classification and regression tasks. In KNN, the class of a sample is determined by the majority class among its k-nearest neighbors in the feature space. In audio signal processing, KNN can be used for speaker identification, audio fingerprinting, and sound classification tasks. KNN is non-parametric and does not require training, making it easy to implement.

This method calculates the envelope of a single frame of the signal. It involves first rectifying the siren signal to obtain its absolute values and then applying a low-pass filter to smooth the rectified signal, resulting in the amplitude envelope. Mathematically, for a single frame

x [n]

, we compute the rectified signal as

x_{r e c t} [n] = |x [n]|

and then apply a moving average filter to obtain the amplitude envelope, as shown in Equation (23).

E [n] = \frac{1}{N} \sum_{k = 0}^{N - 1} x_{r e c t} [n - k]

(23)

Here,

E [n]

is the amplitude envelope. The original signal frame from the siren sound is

x [n]

. The rectified signal frame

x_{r e c t} [n]

. The window size is N, and k is the index variable used in the summation for the moving average filter.

The KNN algorithm also uses the same extracted audio features during training and testing. During training, KNN memorizes the feature vectors of the training instances. During testing, it calculates the distances between the input feature vector of a test audio signal and all training feature vectors, selects the class labels of the k-nearest training instances (where k is a predefined hyperparameter), and assigns the most common class label among them to the test instance.

4.4. AdaBoost

AdaBoost is an ensemble learning method that combines multiple weak learners (often decision trees) to create a robust classifier. It works by iteratively training models on weighted versions of the dataset, focusing more on the samples that were misclassified in previous iterations. In audio signal processing, AdaBoost is particularly effective when used with weak classifiers and can handle high-dimensional feature spaces efficiently.

In audio signal feature extraction, these algorithms can be used either for classification tasks directly on the extracted features or as part of a feature selection or dimensionality reduction process preceding classification. The choice of algorithm depends on factors such as the audio data’s nature, the classification task’s complexity, computational resources, and desired performance metrics.

AdaBoost is an ML algorithm that iteratively combines weak learners to create a robust audio siren signal classification classifier. Equation (24)’s algorithm starts by uniformly initializing the weights of the training samples.

D_{1} (i) = \frac{1}{N} (f o r i = 1, 2, \dots N)

(24)

with

N

being the number of training samples. In each iteration

t

of training, a weak learner

h_{t} (x)

is trained on the training data with weights

D_{t} (i)

aiming to minimize the weighted error. Equation (25) is used to find the error.

ϵ_{t} = \sum_{i = 1}^{N} D_{t} (i) .1 (h_{t} (x_{i}) \neq y_{i}))

(25)

where ‘1’ is the indicator function and

x (i)

is the

{(i)}^{t h}

training sample and its corresponding true label. The weight of the weak learner

α_{t}

is then computed based on its weighted error given in Equation (26).

α_{t} = \frac{1}{2} I n (\frac{1 - ϵ_{t}}{ϵ_{t}})

(26)

In Equation (27), after updating the weights of the training samples to emphasize misclassified samples, the weak learners are combined into a robust classifier

H (x)

.

H (x) = s i g n \sum_{t = 1}^{T} α_{t} h_{t} (x))

(27)

where T is the number of iterations in practical implementation.

The AdaBoost model also uses features extracted from audio signals as inputs during training and testing. AdaBoost sequentially trains a series of weak learners (usually decision trees) on the training data, with each subsequent learner focusing more on the instances misclassified by previous learners. During testing, AdaBoost combines the predictions from all weak learners to make the final prediction for a given audio signal.

4.5. Stacked Ensemble Classifier

In a stacked ensemble model, the final prediction is obtained through a two-step process that utilizes the strengths of multiple base models. Initially, the training data are input into various base models, such as SVM, RF, KNN, and AdaBoost. Each of these models independently learns from the training data and makes predictions, resulting in what are categorized as prediction outputs. Rather than using these outputs as the final predictions, they are combined to create a new dataset commonly referred to as the stacked dataset. This dataset contains the predictions generated by each base model for every data point in the training set.

The stacked dataset is subsequently employed as input for a final estimator, which is typically another ML model, such as an SVM. This final estimator learns how to optimally combine the predictions from the base models to produce the most accurate final prediction. In this way, the stacked ensemble model leverages the diversity of the base models, potentially minimizing errors that may arise from relying on a single model. Hence, the final prediction is the outcome of this layered learning process, whereby the strengths of multiple models are integrated to enhance accuracy and generalization on unseen data. Figure 15 shows utilized stacked classifier models for training. Table 1 shows the comparison performances.

4.6. Long Short-Term Memory (LSTM)

The classification of siren signals, specifically those from ambulances, fire trucks, and traffic, utilizing an LSTM model. The dataset, which includes both original and augmented versions with added white noise, is pre-processed by encoding the labels into numerical format and scaling the features for consistency. This uniformity enhances the model’s training efficiency. The LSTM model comprises two layers. The first LSTM layer contains 50 units that capture the sequence of hidden states and relays this information to the second layer, which consists of 20 units that further process the sequence, extracting essential temporal patterns within the siren sounds. These LSTM layers are succeeded by 30 dense layers that translate the processed sequences into final classification outputs. The last dense layer employs a softmax activation function to generate a probability distribution across the classes (ambulance, fire truck, traffic). For optimization, the model is compiled using the Adam optimizer, which is recognized for its effectiveness in training deep networks, along with sparse categorical cross-entropy as the loss function, which is ideal for multi-class classification tasks with integer labels. During the training phase, an early stopping callback is incorporated to mitigate overfitting by ceasing the training process once the validation performance plateaus, thereby promoting better generalization to unseen data. Post-training, the model undergoes evaluation on the validation dataset. It processes each audio sequence, identifies the distinct temporal patterns associated with each siren type, and generates predictions. The model’s performance is appraised using metrics, such as accuracy, precision, recall, and F1 score, which provide a thorough understanding of its classification effectiveness. Through this comprehensive approach, the LSTM model successfully learns to differentiate between various siren signals, even amidst background noise, establishing its robustness and reliability for practical applications.

4.7. Model Selection and Hyperparameter Tuning

The rationale for the choice of ML models and details on hyperparameter tuning for the selected models are discussed. The choice of models was based on their unique strengths and relevance to the problem at hand. The SVM model was selected for its robustness in high-dimensional data and its ability to perform well in noisy environments, which is crucial for detecting emergency vehicle sounds amid background noise. To fine tune the SVM, the hyperparameters C and gamma were optimized. The KNN model was chosen for its simplicity and effectiveness, particularly in scenarios where there is no assumption about the underlying data distribution. KNN performs classification by finding the closest neighbors in the feature space, making it an intuitive and interpretable model.

The RF model was selected for its ability to handle large datasets and its resistance to overfitting, thanks to the bagging of multiple decision trees. RF is well suited to complex classification tasks, such as distinguishing between different types of emergency vehicle sounds. On the other hand, the AdaBoost model was chosen for its ability to iteratively improve performance by focusing on misclassified examples in each boosting round. This characteristic is particularly useful for datasets with noise or difficult-to-classify instances, such as overlapping vehicle sounds. Finally, the stacked ensemble model was employed to leverage the strengths of the individual base models (SVM, KNN, RF, and AdaBoost). A stacked ensemble combines the predictions from each of these models using a meta-learner, which, in this case, was logistic regression. The idea behind this approach is that by combining multiple models, the system can overcome the weaknesses of individual models, resulting in a more accurate and robust prediction system for emergency vehicle detection. Additionally, hyperparameter tuning was performed using GridSearch CV with five-fold cross-validation to optimize the performance of each model. Table 2 summarizes the hyperparameter tuning performed for each model.

Further, to address a detailed analysis of the types of errors made by the models, the confusion matrices, precision, recall, and F1 scores for each class are examined. This analysis helps identify specific areas of weakness and provides insights into how the models can be improved. The confusion matrices of the models are given in Figure 16. Based on the results, the SVM model and stacked ensemble demonstrated excellent performance, achieving an accuracy of 99.5% with only one misclassified example. The confusion matrix reveals that this misclassification occurred in the ambulance class, where one instance was predicted as a firetruck. No errors were observed in classifying firetrucks or traffic. According to the confusion matrix, the KNN and RF models performed slightly less accurately than the SVM with three misclassified examples, whereas the AdaBoost model had the lowest accuracy with eight misclassified examples, making it the weakest performer among the models. The confusion matrix shows that the majority of errors occurred with the ambulance, where seven instances were misclassified as a firetruck, and one firetruck was misclassified as an ambulance.

One of the most consistent errors across all models is the misclassification of an ambulance as a firetruck. This indicates a significant overlap in the feature space between these two classes. Addressing this challenge could involve introducing new features that better capture subtle differences in siren patterns, such as spectral shape, harmonic content, or modulation rates. Additionally, using data augmentation techniques to introduce more diverse examples of emergency vehicle sounds could help improve the models’ robustness and performance.

Table 3 illustrates the accuracy percentages of various models utilized for classifying siren signals, including SVM, RF, KNN, AdaBoost, a stacked ensemble, and LSTM networks. Both the SVM and stacked ensemble models achieved the highest accuracy of 99.5%, highlighting their effectiveness in accurately classifying siren signals. This indicates that aggregating predictions from multiple models in a stacked ensemble can lead to a performance that is comparable to or even exceeds that of individual models. The RF and KNN models showed strong performance with an accuracy of 98.5%, which is slightly lower than that of the SVM and stacked ensemble. AdaBoost demonstrated an accuracy of 96%, revealing a lower effectiveness relative to the other traditional ML models. The LSTM model, designed specifically for sequential data, attained an accuracy of 93%, which is on par with AdaBoost but lower than the other models. Lastly, the CNN model attained a solid 97.66% accuracy, showing its effectiveness, though it still lagged behind the SVM and the stacked ensemble in terms of overall performance. These findings suggest that for this specific classification task, traditional models and the ensemble approach outperformed the LSTM and CNN models.

4.8. Leave-One-Out Cross-Validation (LOOCV)

LOOCV is a meticulous validation method in which the model undergoes training and evaluation multiple times. In each iteration, a single data point serves as the test set, while the remaining data points are utilized for training. This process is repeated for each data point within the dataset, thereby ensuring that every data point is employed as a test case once. LOOCV is particularly advantageous for small datasets, as it optimizes the use of available data for both training and testing, leading to a comprehensive assessment of the model’s performance. By evaluating the model across all potential training and test splits, LOOCV provides an unbiased estimate of the model’s generalization capabilities. LOOCV is particularly advantageous when working with small datasets, as it allows for efficient use of the entire dataset without the need to create separate training and test sets. In our case, with 1000 samples, LOOCV ensures that every data point is evaluated, providing a robust and comprehensive performance assessment. Unlike traditional cross-validation methods, where part of the data are withheld for testing, LOOCV enables us to extract the maximum value from limited data, ensuring that the model is trained and tested on all available samples.

Given the size of our dataset, LOOCV allows us to evaluate the performance of our stacked ensemble model without sacrificing data points for validation. This makes LOOCV a suitable choice for small datasets, where the loss of data in a holdout set could lead to underfitting or inaccurate performance estimates. As a result, LOOCV provides a reliable estimate of the model’s performance, as the metrics calculated from each iteration (such as accuracy, precision, recall, and F1 score) are averaged to give a detailed summary across all data points.

While LOOCV provides an unbiased estimate of model performance, it is important to consider the impact it may have on generalizability. One limitation of LOOCV is its potential to produce higher variance in performance metrics compared to other validation techniques, such as k-fold cross-validation. This is because each training set in LOOCV differs only by one data point, which may not fully capture the variance inherent in the dataset. For instance, leaving out one point at a time may not reflect the broader variability that could arise when larger subsets of the data are excluded, as happens in k-fold cross-validation. However, the small size of our dataset reduces this concern, as using LOOCV provides a comprehensive evaluation of model performance while still ensuring that all data points are involved in the validation process. Although LOOCV can result in slightly higher variance, the method remains a reliable estimate of performance for our dataset, providing a detailed understanding of the model’s strengths and limitations. In this context, we believe LOOCV strikes an appropriate balance between maximizing data usage and offering a robust assessment of model performance.

However, it is important to note that this method can be computationally demanding, especially when applied to larger datasets since the model must be trained as many times as there are data points. LOOCV is utilized to assess a stacked ensemble model, calculating metrics such as accuracy, precision, recall, and F1 score for each iteration. These metrics are subsequently averaged to deliver a thorough summary of the model’s performance across all data points, ensuring that the evaluation is both detailed and reliable. Table 4 provides an overview of the performance metrics for various models assessed through LOOCV, with a focus on accuracy, precision, recall, and F1 score.

Both the SVM and RF models achieved outstanding scores of 0.985 across all metrics, demonstrating their capability to accurately classify data and maintain balanced performance across different classes. The KNN and AdaBoost models also exhibited strong performance, recording slightly lower scores of 0.971 and 0.906 across all metrics, indicating a high level of effectiveness compared to SVM and RF. Notably, the stacked ensemble model, which integrates predictions from multiple models, slightly surpassed the individual models with an accuracy, precision, recall, and F1 score of 0.99.

Moreover, in our experiment, the moderate complex model AdaBoost achieved a lower accuracy of 90.6%. This lower accuracy suggests that AdaBoost may not be the most suitable model for this specific task, especially in real-time scenarios. On the other hand, the stacked ensemble model demonstrated a significantly higher accuracy of 99%, which is comparable to the accuracy achieved by a simple SVM in our study. This high accuracy highlights the effectiveness of ensemble methods in capturing complex patterns in the data. However, the complexity of the stacked ensemble model could lead to increased processing time, which may not be ideal for real-time applications, while the SVM model, which also achieved a 98.5% accuracy, offers a more balanced approach in terms of complexity and processing time. SVMs are relatively less complex compared to stacked ensemble models and can be optimized for faster processing, making them a strong candidate for real-time deployment. Therefore, we recommend the use of an SVM in real-time applications, as it strikes a better balance between accuracy and efficiency. Further, a high mean value and low standard deviation highlight a model that performs well on average and demonstrates consistent performance across different data points. Therefore, LOOCV classification offers a comprehensive assessment of the model’s effectiveness and reliability in the workplace.

4.9. Cross-Validation, Accuracy, and Loss of Each Fold

This study also evaluated the performance of various ML models, including SVM, RF, KNN, AdaBoost, and a stacked ensemble, using five-fold cross-validation. The corresponding performance metrics for each fold and the mean performance are shown in Table 5.

5. Result and Comparative Analysis

5.1. Comparing Basic Models and LOOCV

A comparative accuracy of various ML models was performed, evaluating their basic accuracy alongside the accuracy achieved through LOOCV. The SVM model demonstrates consistent performance, with both basic and LOOCV accuracies recorded at 0.985 and 0.995. This indicates a well-generalized model that performs reliably with unseen data. Similarly, the RF model shows the same accuracy from basic to LOOCV accuracy. This suggests that there is no tendency towards overfitting, underscoring its robustness in generalization. The KNN model, which reveals a basic accuracy of 0.985 and a nearly equivalent LOOCV accuracy of 0.971, illustrates underfitting, though it continues to exhibit strong overall performance. In contrast, the AdaBoost model experiences a significant performance decline, dropping from a basic accuracy of 0.966 to a LOOCV accuracy of 0.906. This substantial decrease indicates that AdaBoost may be overfitting to the training data, resulting in diminished effectiveness when evaluated with new data.

The stacked ensemble model distinguishes itself with the highest basic accuracy of 0.995, experiencing only a minimal decrease to 0.99 in LOOCV accuracy. This showcases its strong capability for generalization across varied datasets. Lastly, the LSTM model maintains uniform performance in both basic and LOOCV evaluations, achieving an accuracy of 0.906, which suggests stable and reliable generalization. While most models demonstrate commendable generalization capabilities, the performance of the AdaBoost model indicates a potential requirement for further tuning or adjustments to mitigate overfitting. Figure 17 shows the comparison performance for basic models with LOOCV. The stacked ensemble model emerges as the most robust performer, effectively combining high accuracy with exceptional generalization.

5.2. Classification Report of Models

A stacked ensemble classifier is created by combining the predictions of the base classifiers using an SVM meta-learner. The stacked ensemble classifier is then trained on the training set and evaluated on the testing set, yielding an accuracy score and classification report shown in Figure 18.

Further, to identify the real-time performance of ML models for emergency vehicle detection, various measure metrics such as inference time (latency), feature extraction time, and end-to-end latency are evaluated. The analysis of each metric and its implications for real-time deployment were provided for different ML models. Table 6 summarizes the metrics for each model. This performance is analyzed on an i7-7700 CPU @ 3.60 GH with 24 GB of RAM.

As observed in Table 6, The feature extraction time across all models remains between 0.1078 and 0.1419 s, indicating that the extraction process is fast and consistent. With an average of 0.11 s, the system processes audio data efficiently in real time. Further optimizations, such as parallel processing or GPU acceleration, could reduce this time even more, enhancing the system’s response speed. SVM and AdaBoost have relatively short training times of 0.2350 s and 0.8594 s, respectively, making them efficient for retraining. KNN requires no training time since computations occur during inference. In contrast, the stacked ensemble has a long training time of 12.0349 s, which may be a drawback for scenarios requiring frequent model updates. Moreover, SVM and AdaBoost exhibit the fastest inference times of 0.0158 s and 0.0313 s, making them ideal for real-time applications. RF has a mid-range time of 0.0463 s, while KNN and stacked ensemble have longer latencies of 0.2857 s and 0.3488 s, respectively, potentially introducing delays in time-sensitive scenarios.

The total latency combines feature extraction and inference times, providing a complete picture of system performance. AdaBoost has the lowest total latency at 0.1391 s, followed by SVM and RF with 0.1577 and 0.1557 s, respectively. KNN and stacked ensemble exceed 200 milliseconds, with 0.3964 s and 0.4588 s, making them less suitable for real-time applications where fast response is critical. In terms of memory usage, all models have similar memory footprints, ranging from 295 MB to 298 MB. The SVM is the most memory efficient at 295.12 MB, making it ideal for memory-constrained environments, like IoT devices or edge computing systems.

5.3. Training and Validation Accuracy

Training accuracy assesses how well a model fits the training data, while validation accuracy evaluates its ability to generalize to new data, which is a critical indicator of a model’s performance and reliability. Table 7 presents the training and validation accuracies of SVM, RF, KNN, AdaBoost, and stacked ensemble classifiers. The training accuracy represents how well each classifier performs on the training data, while the validation accuracy indicates its performance on unseen validation data, providing insights into its generalization ability.

The line graph depicts the training and validation accuracies of our machine learning models. Training accuracy reflects how well a model fits the training data it was given. Here, validation accuracy shows how well the model performs on unseen data; training and validation values should be high and close together. A significant gap between training and validation accuracy suggests that the model is overfitting the training data. Figure 19 shows the accuracy of the raining and validation of all models. This means that the model has memorized the specifics of the training data rather than learning generalizable patterns. The graph serves as a reminder that achieving this value requires striking a balance between training accuracy and validation accuracy.

5.4. Ablation Study and Literature Comparision

Further, Table 8 delivers a thorough comparison of emergency vehicle siren sound classification performance for ambulances, firetrucks, and traffic noises, specifically across various feature types and classifiers. The analysis compares metrics for time domain features, frequency domain features, and a combination of both time and frequency domain features. To perform the comparison, the classifiers used in this study include the SVM and a sacked ensemble method. In the time domain analysis, the SVM classifier yielded an overall accuracy of 80.5%, while the stacked ensemble increases by 5.5% compared to the SVM.

The frequency domain analysis was accompanied using various sets of features, starting with MFCCs. The SVM classifier produced an accuracy of 96%, with balanced precision, recall, and F1 scores across all classes. The stacked ensemble further increased the accuracy to 97.5%, with near-perfect classification for all classes. This demonstrates the robustness of MFCC features in capturing the critical aspects of siren sounds. The table also presents the performance of other frequency domain features, except for MFCCs, including spectral centroid, spectral bandwidth, spectral roll-off, chroma STFT, Min and Max frequencies, IF, chirp rate, and spectral flux. The stacked ensemble classifier significantly boosted the accuracy to 99%, achieving nearly perfect precision, recall, and F1 scores for all classes, emphasizing the importance of incorporating diverse spectral features for accurate classification. In this analysis, the SVM classifier demonstrated an impressive accuracy of 96.5%, maintaining consistent performance across all classes. Additionally, the performance of the Doppler-related features was evaluated, which achieved accuracies of 85.5% and 89% for the SVM and stacked ensemble models, respectively. A comparison was also made between the performance of all frequency domain features, including the consideration of the Doppler effect, and the combined time and frequency domain features. The results indicate that the integration of time and frequency domain features, along with specialized Doppler effect features, represents the most effective strategy, yielding near-perfect classification across all classes.

Table 9 presents a comparative analysis of the existing literature concerning various audio detection and classification methods. It outlines the techniques employed, feature extraction methods used, accuracy achieved, and their respective limitations. The analysis reveals that the majority of the literature focuses primarily on single MFCC feature extraction. In contrast, certain studies, such as references [12,25,27], have incorporated multiple features. Additionally, some research on emergency vehicle detection has not adequately addressed features from both the time and frequency domains or considered factors such as augmented noise and the Doppler effect. Incorporating these aspects could enhance the generalizability of the model.

6. Conclusions

This research presents a machine learning-based approach for classifying emergency vehicle siren sounds, with a focus on the integration of both temporal and spectral audio features. We developed an ensemble learning method that leverages the strengths of individual models, including SVM, RF, KNN, and AdaBoost, to improve classification accuracy and robustness. The experimental results indicate that the SVM model achieved the highest individual performance among all methods evaluated, and the ensemble learning strategy further enhanced the overall robustness and accuracy of the system. Moreover, we addressed challenges, such as the Doppler effect and environmental noise, through data augmentation techniques, thereby increasing the model’s applicability in real-world scenarios. To further enhance its effectiveness, future research should focus on integrating this system with intelligent traffic management systems (ITS), allowing it to adjust traffic signals dynamically in response to emergency sirens. A multimodal approach that combines audio detection with visual sensors, such as cameras or lidar, would increase accuracy, especially in noisy environments where sirens may be obscured. Additionally, deploying the system on edge computing and IoT devices located at intersections would enable real-time processing with minimal latency, ensuring efficient traffic flow. Future work should also explore scalability, with large-scale real-world testing across diverse urban environments to ensure that the system performs well under various conditions. Finally, incorporating deep learning models and advanced data augmentation techniques will further improve the system’s robustness, making it better suited for complex and dynamic urban soundscapes. These advancements will pave the way for more reliable emergency vehicle detection and faster response times, ultimately enhancing public safety.

Author Contributions

Conceptualization, D.J. and S.P.; methodology, D.J., M.K. and S.P.; software, D.J. and M.K.; validation, D.J., S.K. and S.P.; formal analysis, S.P.; investigation, D.J. and M.T.; resources, S.P.; data curation, N.C. and M.T.; writing—original draft preparation, D.J. and S.P.; writing—review and editing, S.P.; visualization, S.P.; supervision, S.P.; project administration, S.P.; funding acquisition, S.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data will be provided upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zheng, W.; Xie, H.; Chen, Y.; Roh, J.; Shin, H. PIFNet: 3D object detection using joint image and point cloud features for autonomous driving. Appl. Sci. 2022, 12, 3686. [Google Scholar] [CrossRef]
Guo, L.; Lu, K.; Huang, L.; Zhao, Y.; Liu, Z. Pillar-based multilayer pseudo-image 3D object detection. J. Electron. Imaging 2024, 33, 013024. [Google Scholar] [CrossRef]
Sun, H.; Liu, X.; Xu, K.; Miao, J.; Luo, Q. Emergency vehicles audio detection and localization in autonomous driving. arXiv 2021, arXiv:2109.14797. [Google Scholar]
Sathruhan, S.; Herath, O.K.; Sivakumar, T.; Thibbotuwawa, A. Emergency Vehicle Detection using Vehicle Sound Classification: A Deep Learning Approach. In Proceedings of the 2022 6th SLAAI International Conference on Artificial Intelligence (SLAAI-ICAI), Colombo, Sri Lanka, 1–2 December 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1–6. [Google Scholar]
Tran, V.T.; Tsai, W.H. Acoustic-based emergency vehicle detection using convolutional neural networks. IEEE Access 2020, 8, 75702–75713. [Google Scholar] [CrossRef]
Chu, S.; Narayanan, S.; Kuo, C.C.J. Environmental sound recognition with time–frequency audio features. IEEE Trans. Audio Speech Lang. Process 2009, 17, 1142–1158. [Google Scholar] [CrossRef]
Hamsa, S.; Iraqi, Y.; Shahin, I.; Werghi, N. An enhanced emotion recognition algorithm using pitch correlogram, deep sparse matrix representation and random forest classifier. IEEE Access 2021, 9, 87995–88010. [Google Scholar] [CrossRef]
Cruz, M.C.; Ferenchak, N.N. Emergency response times for fatal motor vehicle crashes, 1975–2017. Transp. Res. Rec. 2020, 2674, 504–510. [Google Scholar] [CrossRef]
Chen, Y.; Li, H.; Hou, L.; Bu, X. Feature extraction using dominant frequency bands and time-frequency image analysis for chatter detection in milling. Precis. Eng. 2019, 56, 235–245. [Google Scholar] [CrossRef]
Albouy, P.; Mehr, S.A.; Hoyer, R.S.; Ginzburg, J.; Zatorre, R.J. Spectro-temporal acoustical markers differentiate speech from song across cultures. bioRxiv 2023. [Google Scholar] [CrossRef] [PubMed]
Benetos, E.; Dixon, S. Joint multi-pitch detection using harmonic envelope estimation for polyphonic music transcription. IEEE J. Sel. Top. Signal Process. 2011, 5, 1111–1123. [Google Scholar] [CrossRef]
Bishop, C.M.; Nasrabadi, N.M. Pattern Recognition and Machine Learning, 4th ed.; Springer: New York, NY, USA, 2006; p. 738. [Google Scholar]
Khalid, S.; Khalil, T.; Nasreen, S. A survey of feature selection and feature extraction techniques in machine learning. In Proceedings of the 2014 Science and Information Conference, London, UK, 27–29 August 2014; IEEE: Piscataway, NJ, USA, 2014; pp. 372–378. [Google Scholar]
Dhanalakshmi, P.; Palanivel, S.; Ramalingam, V. Classification of audio signals using SVM and RBFNN. Expert Syst. Appl. 2009, 36, 6069–6075. [Google Scholar] [CrossRef]
Razzaghi, P.; Abbasi, K.; Bayat, P. Learning spatial hierarchies of high-level features in deep neural network. J. Vis. Commun. Image Represent. 2020, 70, 102817. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process Syst. 2012, 25. [Google Scholar] [CrossRef]
Badi, I.; Bouraima, M.B.; Muhammad, L.J. The role of intelligent transportation systems in solving traffic problems and reducing environmental negative impact of urban transport. Decis. Mak. Anal. 2023, 1, 1–9. [Google Scholar] [CrossRef]
Dimitrakopoulos, G.; Demestichas, P. Intelligent transportation systems. IEEE Veh. Technol. Mag. 2010, 5, 77–84. [Google Scholar] [CrossRef]
Ellis, D.P.W. Detecting alarm sounds. In Proceedings of the Recognition of Real-World Sounds: Workshop on Consistent and Reliable Acoustic Cues, Aalborg, Denmark, 2 September 2001; pp. 59–62. [Google Scholar]
Fatimah, B.; Preethi, A.; Hrushikesh, V.; Singh, A.; Kotion, H.R. An automatic siren detection algorithm using Fourier Decomposition Method and MFCC. In Proceedings of the 2020 11th International Conference on Computing, Communication and Networking Technologies (ICCCNT), Kharagpur, India, 1 July 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1–6. [Google Scholar]
Beritelli, F.; Casale, S.; Russo, S.; Serrano, S. An automatic emergency signal recognition system for the hearing impaired. In Proceedings of the 12th Digital Signal Processing Workshop and 4th Signal Processing Education Workshop, Wyoming, WY, USA, 24–27 September 2006; IEEE: Piscataway, NJ, USA, 2006; pp. 179–182. [Google Scholar]
Liaw, J.J.; Wang, W.S.; Chu, H.C.; Huang, M.S.; Lu, C.P. Recognition of the ambulance siren sound in Taiwan by the longest common subsequence. In Proceedings of the 2013 IEEE International Conference on Systems, Man, and Cybernetics, Manchester, UK, 13–16 October 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 3825–3828. [Google Scholar]
Choudhury, K.; Nandi, D. Review of emergency vehicle detection techniques by acoustic signals. Trans. Indian Natl. Acad. Eng. 2023, 8, 535–550. [Google Scholar] [CrossRef]
Sivasankaran, S.; Prabhu, K.M.M. Robust features for environmental sound classification. In Proceedings of the 2013 IEEE International Conference on Electronics, Computing and Communication Technologies, Bangalore, India, 17–19 January 2013; IEEE: Piscataway, NJ, USA, 2013. [Google Scholar]
Schroder, J.; Goetze, S.; Grutzmacher, V.; Anemuller, J. Automatic acoustic siren detection in traffic noise by part-based models. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 493–497. [Google Scholar]
Massoudi, M.; Verma, S.; Jain, R. Urban sound classification using CNN. In Proceedings of the 2021 6th International Conference on Inventive Computation Technologies (ICICT), Coimbatore, India, 20–22 January 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 583–589. [Google Scholar]
Usaid, M.; Asif, M.; Rajab, T.; Rashid, M.; Hassan, S.I. Ambulance siren detection using artificial intelligence in urban scenarios. Sir. Syed. Univ. Res. J. Eng. Technol. 2022, 12, 92–97. [Google Scholar] [CrossRef]
Mecocci, A.; Grassi, C. RTAIAED: A real-time ambulance in an emergency detector with a pyramidal part-based model composed of MFCCs and YOLOv8. Sensors 2024, 24, 2321. [Google Scholar] [CrossRef] [PubMed]
Salem, O.; Mehaoua, A.; Boutaba, R. The Sight for Hearing: An IoT-Based System to Assist Drivers with Hearing Disability. In Proceedings of the 2023 IEEE Symposium on Computers and Communications (ISCC), Gammarth, Tunisia, 9–12 July 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1305–1310. [Google Scholar]
Zohaib, M.; Asim, M.; ELAffendi, M. Enhancing Emergency Vehicle Detection: A Deep Learning Approach with Multimodal Fusion. Mathematics 2024, 12, 1514. [Google Scholar] [CrossRef]
Available online: https://www.kaggle.com/datasets/vishnu0399/emergency-vehicle-siren-sounds/data. (accessed on 15 January 2024).
Chandrasekhar, N.; Peddakrishna, S. Enhancing Heart Disease Prediction Accuracy through Machine Learning Techniques and Optimization. Processes 2023, 11, 1210. [Google Scholar] [CrossRef]

Figure 1. Audio signal feature extraction and classification workflow.

Figure 2. Detailed procedure for pre-processing and augmentation.

Figure 3. Categorization of feature extraction methods in the time domain and frequency domain.

Figure 4. RMS of an input sound waveform.

Figure 5. ZCR wave for the input waveform.

Figure 6. RMS and ZCR extraction procedure from the pre-processed dataset.

Figure 7. Spectral centroid of an audio signal.

Figure 8. The SBW representation of an audio signal.

Figure 9. Waveform of spectral roll-off for an input siren wave.

Figure 10. Chroma feature extraction of the wave.

Figure 11. MFCC feature extraction of the wave.

Figure 12. Min and Max frequency representation of the audio sample.

Figure 13. Frequency domain feature extraction process for audio signals.

Figure 14. The IF, chirp, and spectral flux representation of the audio signal over time.

Figure 15. Stacked ensemble classifier models.

Figure 16. Confusion matrix insights for sound classification models.

Figure 17. Comparison performance for basic models with LOOCV.

Figure 18. Classification report of the models.

Figure 19. Training and validation accuracies of the models.

Table 1. Comparison of precision, recall, and F1 score obtained over the classifier models.

Classifier	Class	Precision	Recall	F1 Score
SVM	ambulance	1	0.99	0.99
	firetruck	0.99	1	0.99
	traffic	1	1	1
RF	ambulance	1	0.96	0.98
	firetruck	0.96	1	0.98
	traffic	1	1	1
KNN	ambulance	1	0.96	0.98
	firetruck	0.97	1	0.99
	traffic	0.98	1	0.99
AdaBoost	ambulance	0.99	0.91	0.95
	firetruck	0.91	0.99	0.95
	traffic	1	1	1
Stacked Ensemble	ambulance	1	0.99	0.99
	firetruck	0.99	1	0.99
	traffic	1	1	1

Table 2. Optimized hyperparameters for each model.

Model	Tuned Hyperparameters	Best Params
SVM	C, gamma, kernel	{‘C’: 10, ‘gamma’: ‘scale’, ‘kernel’: ‘rbf’}
KNN	n_neighbors, weights	{‘n_neighbors’: 3, ‘weights’: ‘distance’}
RF	n_estimators, max_depth, min_samples_split	{‘max_depth’: None, ‘min_samples_split’: 2, ‘n_estimators’: 200}
AdaBoost	n_estimators, learning_rate	{‘learning_rate’: 1.0, ‘n_estimators’: 100}
KNN	n_neighbors, weights	{‘n_neighbors’: 3, ‘weights’: ‘distance’}

Table 3. Accuracy performance of the models.

Model	Accuracy %
SVM	99.5
RF	98.5
K-Nearest Neighbors	98.5
AdaBoost	96
Stacked Ensemble	99.5
LSTM	93.0
CNN	97.66

Table 4. Performance metrics of LOOCV.

Classifier	Accuracy	Precision	Recall	F1 Score
SVM	0.985	0.985	0.985	0.985
Random Forest	0.985	0.985	0.985	0.985
KNN	0.971	0.971	0.971	0.971
AdaBoost	0.906	0.906	0.906	0.906
Stacking Ensemble	0.99	0.99	0.99	0.99

Table 5. Five-fold cross-validation of the models.

Model	Performance Metrics	Fold 1	Fold 2	Fold 3	Fold 4	Fold 5	Mean
SVM	Accuracy	0.995	0.985	0.96	0.98	0.98	0.980 ± 0.011
SVM	Log Loss	0.0262	0.0721	0.0984	0.0523	0.0600	0.062 ± 0.024
RF	Accuracy	0.9950	0.9550	0.9600	0.9950	0.9850	0.978 ± 0.017
RF	Log Loss	0.0989	0.1614	0.1497	0.1150	0.1269	0.130 ± 0.023
KNN	Accuracy	0.9850	0.9300	0.9500	0.9650	0.9550	0.957 ± 0.018
KNN	Log Loss	0.0501	0.3052	0.0980	0.4330	0.0918	0.196 ± 0.148
AdaBoost	Accuracy	0.9600	0.8800	0.9050	0.9100	0.9350	0.918 ± 0.027
AdaBoost	Log Loss	0.9504	0.9544	0.9520	0.9493	0.9522	0.952 ± 0.002
Stacked Ensemble	Accuracy	0.9950	0.9750	0.9500	0.9850	0.9850	0.978 ± 0.015
Stacked Ensemble	Log Loss	0.0291	0.0816	0.0918	0.0504	0.0579	0.062 ± 0.022

Table 6. Real-time performance metrics of machine learning models.

Model	Feature Extraction Time (s)	Training Time (s)	Inference Time (s)	Total Latency (s)	Accuracy	Memory Usage (MB)
SVM	0.1419	0.2350	0.0158	0.1577	99.50%	295.12
KNN	0.1107	0.0000	0.2857	0.3964	98.50%	295.76
Random Forest	0.1094	1.0253	0.0463	0.1557	98.50%	295.91
AdaBoost	0.1078	0.8594	0.0313	0.1391	96.00%	295.91
Stacked Ensemble	0.1100	12.0349	0.3488	0.4588	99.50%	298.92

Table 7. Training and validation values for ML models.

Classifier	Training Accuracy	Validation Accuracy
SVM	0.9914	0.99
Random Forest	1	0.98
KNN	0.9786	0.9567
AdaBoost	0.9614	0.9333
Stacked Ensemble	0.9957	0.9867

Table 8. Ablation study for performance comparison between different ways of feature extractions.

Feature Type	Features	Classifier	Accuracy %	Classes	Precision %	Recall %	F1 Score %
Time domain	RMS and ZCR	SVM	80.5	Ambulance	76	75	76
				Firetruck	75	75	75
				Traffic	98	100	99
		Stacked Ensemble	86	Ambulance	84	80	82
				Firetruck	80	84	82
				Traffic	100	100	100
Frequency domain	MFCC	SVM	96	Ambulance	94	96	95
				Firetruck	96	93	95
				Traffic	100	100	100
		Stacked Ensemble	97.5	Ambulance	99	95	97
				Firetruck	95	99	97
				Traffic	100	100	100
	Spectral centroid, SBW, spectral roll-off, chroma STFT, Min and Max Fre, IF, chirp rate, and spectral flux	SVM	96.5	Ambulance	96	95	96
				Firetruck	95	96	95
				Traffic	100	100	100
		Stacked Ensemble	99	Ambulance	100	99	99
				Firetruck	99	100	99
				Traffic	100	100	100
	All frequency domains	SVM	99	Ambulance	100	99	99
				Firetruck	99	100	99
				Traffic	100	100	100
		Stacked Ensemble	99	Ambulance	100	99	99
				Firetruck	99	100	99
				Traffic	100	100	100
	IF, chirp rate, and spectral flux	SVM	85.5	Ambulance	92	70	80
				Firetruck	75	95	84
				Traffic	100	98	99
		Stacked Ensemble	89	Ambulance	93	79	85
				Firetruck	81	95	87
				Traffic	100	98	99
Time + frequency	All features	SVM	99.5	Ambulance	100	99	99
				Firetruck	99	100	99
				Traffic	100	100	100
		Stacked Ensemble	99.5	Ambulance	100	99	99
				Firetruck	99	100	99
				Traffic	100	100	100

Table 9. Comparative analysis of the existing literature concerning various audio detection and classification methods.

Ref.	Model/Method	Feature Extraction	Accuracy	Limitation
[4]	CNN	MFCC	93%	Not disclose the size and diversity of data
[5]	SirenNet	MFCC	98.24%	Limited to a specific dataset
[6]	Gaussian mixture model	Matching pursuit combined with MFCCs	92%	High variance and randomness are associated with environmental sounds
[7]	RF	Pitch correlogram and deep sparse matrix	90%	Variability in speech affects recognition accuracy
[11]	Hidden Markov models	Harmonic envelope of candidate pitches	91.86%	Limited to polyphonic music transcription
[14]	SVM and radial basis function neural network (NN)	Linear predictive coefficients, linear predictive cepstral coefficients, and MFCC	93%	Sensitive to noise and environmental factors
[20]	KNN, SVM, and ensemble bagged trees	MFCC and Fourier decomposition	98.49%	Single siren detection
[21]	Artificial NN	MFCC	99%	External noise could impact recognition accuracy
[22]	Longest common subsequence	Min–max pattern analysis	85%	The impact of background noise is not explicitly mentioned
[25]	Part-based models	Mel spectrograms	86%	Performance of machine-learned PBMs in clean condition training
[26]	CNN	MFCC	91%	The model’s ability to generalize to sounds is not included in the training dataset
[27]	Multi-layer perceptron	MFCC, spectral centroid, SBW, roll-off rate, ZCR, and chroma STFT	90%	Limited dataset
[28]	YOLOv8 and NN	MFCC (audio)	99.5%	Computational costs could be concerns
[29]	Majority voting mechanism	ZCR, spectral centroid, spectral roll-off point, and MFCC	95%	Background noise could affect signal recognition accuracy
[30]	Multi-level spatial fusion YOLO	Attention-based temporal spectrum	96.19%	Requires multimodal data

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jayakumar, D.; Krishnaiah, M.; Kollem, S.; Peddakrishna, S.; Chandrasekhar, N.; Thirupathi, M. Emergency Vehicle Classification Using Combined Temporal and Spectral Audio Features with Machine Learning Algorithms. Electronics 2024, 13, 3873. https://doi.org/10.3390/electronics13193873

AMA Style

Jayakumar D, Krishnaiah M, Kollem S, Peddakrishna S, Chandrasekhar N, Thirupathi M. Emergency Vehicle Classification Using Combined Temporal and Spectral Audio Features with Machine Learning Algorithms. Electronics. 2024; 13(19):3873. https://doi.org/10.3390/electronics13193873

Chicago/Turabian Style

Jayakumar, Dontabhaktuni, Modugu Krishnaiah, Sreedhar Kollem, Samineni Peddakrishna, Nadikatla Chandrasekhar, and Maturi Thirupathi. 2024. "Emergency Vehicle Classification Using Combined Temporal and Spectral Audio Features with Machine Learning Algorithms" Electronics 13, no. 19: 3873. https://doi.org/10.3390/electronics13193873

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Emergency Vehicle Classification Using Combined Temporal and Spectral Audio Features with Machine Learning Algorithms

Abstract

1. Introduction

2. Literature Review

3. Resources and Approaches

3.1. Dataset Description and Augmentation

3.2. Feature Extraction

3.2.1. Time Domain Feature Extraction

3.2.2. Frequency Domain Feature Extraction

3.2.3. Doppler Effect Feature Extraction

3.3. Feature Scaling

3.4. Performance Measures

4. ML Classification Algorithms and Data Analysis

4.1. Support Vector Machines (SVMs)

4.2. Random Forest Classifier

4.3. K-Nearest Neighbor Classifier

4.4. AdaBoost

4.5. Stacked Ensemble Classifier

4.6. Long Short-Term Memory (LSTM)

4.7. Model Selection and Hyperparameter Tuning

4.8. Leave-One-Out Cross-Validation (LOOCV)

4.9. Cross-Validation, Accuracy, and Loss of Each Fold

5. Result and Comparative Analysis

5.1. Comparing Basic Models and LOOCV

5.2. Classification Report of Models

5.3. Training and Validation Accuracy

5.4. Ablation Study and Literature Comparision

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI