Next Article in Journal
Election Optimizer Algorithm: A New Meta-Heuristic Optimization Algorithm for Solving Industrial Engineering Design Problems
Previous Article in Journal
Enhancing Security and Efficiency: A Fine-Grained Searchable Scheme for Encryption of Big Data in Cloud-Based Smart Grids
Previous Article in Special Issue
General Image Manipulation Detection Using Feature Engineering and a Deep Feed-Forward Neural Network
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Enhancing Emergency Vehicle Detection: A Deep Learning Approach with Multimodal Fusion

1
School of Computer Science and Engineering, Central South University, Changsha 410083, China
2
EIAS Data Science and Blockchain Laboratory, College of Computer and Information Sciences, Prince Sultan University, Riyadh 11586, Saudi Arabia
3
School of Computer Science and Technology, Guangdong University of Technology, Guangzhou 510006, China
*
Author to whom correspondence should be addressed.
Mathematics 2024, 12(10), 1514; https://doi.org/10.3390/math12101514
Submission received: 7 April 2024 / Revised: 3 May 2024 / Accepted: 6 May 2024 / Published: 13 May 2024

Abstract

:
Emergency vehicle detection plays a critical role in ensuring timely responses and reducing accidents in modern urban environments. However, traditional methods that rely solely on visual cues face challenges, particularly in adverse conditions. The objective of this research is to enhance emergency vehicle detection by leveraging the synergies between acoustic and visual information. By incorporating advanced deep learning techniques for both acoustic and visual data, our aim is to significantly improve the accuracy and response times. To achieve this goal, we developed an attention-based temporal spectrum network (ATSN) with an attention mechanism specifically designed for ambulance siren sound detection. In parallel, we enhanced visual detection tasks by implementing a Multi-Level Spatial Fusion YOLO (MLSF-YOLO) architecture. To combine the acoustic and visual information effectively, we employed a stacking ensemble learning technique, creating a robust framework for emergency vehicle detection. This approach capitalizes on the strengths of both modalities, allowing for a comprehensive analysis that surpasses existing methods. Through our research, we achieved remarkable results, including a misdetection rate of only 3.81% and an accuracy of 96.19% when applied to visual data containing emergency vehicles. These findings represent significant progress in real-world applications, demonstrating the effectiveness of our approach in improving emergency vehicle detection systems.

1. Introduction

In real-world scenarios, the failure to detect emergency vehicles can pose significant threats to road safety. The timely detection of these vehicles is crucial for ensuring prompt medical assistance, as any delays in detection can result in more severe injuries or lead to loss of life [1]. Extensive research consistently highlights the significance of precise detection systems, particularly in emergencies where even minor delays can significantly compromise patient care. Therefore, the development of detection systems that are both reliable and capable of varying environmental conditions and scenarios is necessary to effectively address the risks associated with the failure to detect emergency vehicles [2].
An emergency vehicle-related accident can be the result of various factors, including drivers who are distracted while operating their vehicles, the effects of in-car technologies, incidents where cars run red lights, and the requirement for efficient pedestrian crash prevention [3]. In addition to these factors, a significant contributor to such incidents is the lack of awareness among non-emergency vehicle drivers about the presence of approaching emergency vehicles. This lack of awareness may occur when emergency vehicles are outside the visual range of non-emergency drivers or when the sound of emergency vehicle sirens is hindered by external noise interference, the utilization of in-vehicle audio systems, or the driver’s own distractions. With this objective, the current study focused on developing real-time and precise systems for emergency vehicle detection (EVD).
EVD relies on advanced computer vision [4] and acoustic processing technologies for the rapid identification and classification of automobiles such as police cars, fire engines, and ambulances. Vision-based systems, including YOLO, ensure real-time visual recognition, while acoustic-based solutions like the ATSN excel in auditory cue discrimination, enhancing emergency response efficiency. Considering the potential drawbacks of the acoustic and visual EVD systems when used independently, especially in situations with noisy traffic and unfavorable weather conditions, we combined both acoustic and visual EVD into a multimodal system to concurrently detect siren sounds and emergency vehicle objects.
Vision-based emergency vehicle detection (VEVD) has received limited attention in research thus far, with only a few studies [5,6,7] dedicated to this area. A review conducted in [8] focused on different object detection algorithms and highlighted the suitability of YOLO [5] for vision-based EVD due to its fast processing speed. In addition, the survey paper in [9] provided a comprehensive study on how deep learning (DL) methods are powerful for vehicle detection. By comparing DL methods with traditional techniques, the authors stated that DL models provide efficiency by automating the process of feature extraction, especially since vehicle detection datasets are huge. In addition, DL models proved to outperform other models in terms of accuracy and time through the utilization of different techniques such as the use, modification, and fine-tuning of pre-trained models, and the optimization of the hyperparameters of DL models. In [10], a VEVD system consisting of a two-stage approach was introduced, which uses a vehicle classifier after an object detector to detect vehicles. The accuracy of the VEVD system reached 91.6% after being evaluated with a small, personalized dataset. An alternative two-stage VEVD system with a similar approach was introduced in [6]. The work in [11] proposed a method for detecting vehicles in real time using YOLO-v5 architecture to balance between speed and accuracy. To achieve accurate results, especially in congested areas, the authors gathered a new dataset focusing on different weather conditions and both Low-Density Traffic and High-Density Traffic and then fine-tuned YOLO-v5 using the dataset. The pre-trained weights from the COCO dataset were used to initialize YOLO-v5. The paper highlighted the effectiveness and potential of the proposed model while acknowledging the large amount of data required for training. The results showed that the model works well in challenging conditions like night, rain, and snow, and compared to other methods, it achieved significantly higher performances in terms of accuracy and speed.
Previous works on VEVD exhibit common limitations. To provide a concise overview, object detection was summarized in the initial phase. The object detectors used in the vehicle detection stage of [6,10] involved the utilization of the ImageNet [12] and MS COCO [7] datasets for the training of the object detectors. The object detectors were trained on datasets that were not specifically designed for the EVD problem, incorporating an extensive array of object classes that may not be directly relevant. Since these datasets do not include any emergency vehicle-related objects, there can be a mismatch between the vehicle detection and classification stages, as the features extracted by the vehicle detector may not fully capture the discriminative information of emergency vehicles. Secondly, the two-stage systems are computationally inefficient as they employ an additional vehicle classifier, failing to leverage the full advantages of object detection for both detection and classification. In terms of system evaluation, refs. [6,10] tested the vehicle detector and vehicle classifier separately on different datasets, without a precise evaluation of the complete two-stage systems. Recognizing these limitations, this work aimed to address the VEVD problem more comprehensively by proposing a single-stage VEVD system that offers fast and accurate predictions. Our main priority was to use state-of-the-art DL-based object detection algorithms for the development of the VEVD system, rather than relying on conventional techniques commonly found in the existing literature [13].
Existing audio-based emergency vehicle detection (AEVD) systems suffer from several limitations. Firstly, they often rely on small experimental datasets that consist of simulation, replication, and on-scene recordings. This approach neglects to account for the variety of siren types and specifications, limiting the system’s ability to generalize for real-world applications. Secondly, previous works [14,15,16,17,18] often rely on training the shallow learning algorithms involved in the use of handcrafted features or microcontrollers employed for signal processing tasks. As a result, these systems exhibit inferior detection accuracies, typically falling below 90%, and suffer from computational inefficiency. For instance, the system described in [16] requires 8 s to complete a single detection. To address these limitations, this study proposes the development of an AEVD system that incorporates an attention mechanism following the groundwork provided in [19] to increase the accuracy and efficiency of our AEVD model. We carefully modified the existing approach to address the limitations of previous systems, incorporating novel techniques and methodologies. By leveraging this approach, the proposed model achieves a superior accuracy and faster detection. Additionally, the training of the model was conducted using a diverse dataset that captures the differences in siren types and characteristics, enhancing its performance and practical applicability.
This study not only introduces innovative techniques for enhancing emergency vehicle detection but also focuses on improving the accuracy and efficiency of detection methods. Using integrated learning with acoustic and visual datasets, this technology is a major advancement in emergency response system development. The practical applications of this study’s findings open the path for the advancement of EVD systems. In incorporating them into private vehicle driver-assistance systems, drivers can receive vital notifications, which are especially helpful for those with hearing impairments. Applications also include improving safety features in smart traffic infrastructure and strengthening emergency response in self-driving cars [20].
We utilized an attention-based temporal spectrum network for ambulance siren sound detection. Concurrently, we improved visual detection using an enhanced YOLO architecture. Fusing acoustic and visual information, we employed a stacking ensemble learning technique to create a robust framework for emergency vehicle detection. This approach leverages the strengths of both modalities, resulting in a superior performance compared to those of existing methods. Listed below are the main contributions of this research:
  • This study proposes a stacking ensemble method to fuse acoustic and visual information, aiming to improve emergency vehicle detection under adverse conditions.
  • This study incorporated a multi-level spatial fusion technique into YOLO to accommodate the deep-level semantic information required for multi-modal fusion.
  • This study proposes an attention-based temporal spectrum network to aid in extracting semantic features for siren sound classification.
The following are the remaining sections of this paper: Section 2 outlines the methodologies proposed within this research; Section 3 presents the experiments and results; Section 4 discusses the conclusion; and Section 5 discusses limitations and future work.

2. Materials and Methods

In this study, we employed an attention-based temporal spectrum network for ambulance siren sound classification. Simultaneously, we enhanced the visual classification task using an improved YOLOv4 architecture. For the fusion of the acoustic and visual results, we utilized a stacking ensemble learning technique, creating a robust framework for emergency vehicle detection. This approach capitalizes on the strengths of both modalities, ensuring a comprehensive analysis that outperforms those of existing methods. Figure 1 displays the proposed framework.

2.1. Detection of Emergency Vehicles Using Visual Cues

Proposing the MLSF-YOLO Model for Visual-Based EVD

The basic concept of YOLO is to create a grid of S × S cells from an input image. With an ith ( 1 i S 2 ) cell, if an object’s center is located inside this cell, the responsibility for that specific object lies with this cell. Modern YOLO iterations incorporate multiscale detection strategies, dividing the source image into several grids aligned with distinct levels of detection [21].
In Figure 2a, the YOLO model consists of three essential parts: the prediction head [22], a neck part used for structuring stages, and a backbone utilized for feature extraction. The YOLO model is based on the YOLOv4 architecture [23], incorporating the CSPDarknet53 backbone [24] and a multiscale detection method. To enhance detection accuracy, we introduced cross-stage partial connections (CSPs) [25], augmenting the model’s learning capacity. Figure 2a illustrates the MLSF-YOLO, containing elements like the CBM; as shown in Figure 2b, it includes a convolutional layer for ensuring proper feature extraction, batch normalization for standardizing data, and the mish activation function, which adds non-linearity to the network. The ResUnit, a Branched Cross-Stage Partial (BCSP) connection, a Nested Cross-Stage Partial (NCSP) connection, a Spatial Pooling Layer (SPL), and Multi-Level Spatial Fusion (MLSF) contain a spatial pooling layer with a cross-stage partial connection. In Figure 2f, the ResUnit prevents network degradation with increased depth. A CSP connection divides a given stage’s feature map into two sections: the one that is subjected to network block processing (the sequence of the ResUnit) and the other one integrated into the feature map that was transmitted to produce the stage result. There are five CSP elements in the advanced YOLO backbone, labeled BCSP-n, with ‘n’ signifying the quantity of the ResUnit in a BCSP-n unit, as shown in Figure 2e. To initiate the detector’s neck, we used a modified Spatial Pooling Layer (SPL) [26] module to expand the receptive field. MLSF-YOLO integrates the SPL module within a CSP component, creating the MLSF module shown in Figure 2d at the neck’s start. For feature integration, we employed a path aggregation network (PANet) [27], using the PANet’s outputs to enable three-scale detection. Within the detector neck, there are NCSP-n components. Within the transition pathway, NCSP-n consists of n CBM elements, as shown in Figure 2g, instead of the originally used n ResUnits. We explain the operation of emergency vehicle detection at a single-prediction scale.
In using MLSF-YOLO in our object detection model, the source image is segmented into a grid of size S × S . Every grid cell is tasked with forecasting B bounding boxes. These boxes are defined by four key components. x b represents the x-coordinate of the bounding box centre, computed as the sigmoid of a with an added offset of p. y b is the y-coordinate of the center, determined by the sigmoid of b with offset q. w b is the width, computed by multiplying a factor x with the sigmoid of c, and h b is the height, found by multiplying a factor y with the sigmoid of d. The following equations, from (1) to (4), collectively enable precise object localization within the gridded image segmentation:
x b = σ ( a ) + p
y b = σ ( b ) + q
w b = x · σ ( c )
h b = y · σ ( d )
We employed a weighted loss function in Equation (5) that combines three key elements: localization loss ( L C I o U ), objectness loss ( L O b j ), and classification loss ( L C l s ), with weighting factors λ 1 , λ 2 , and λ 3 . This equation encapsulates the core of our loss computation, emphasizing localization precision, objectness estimation, and class-specific classification while allowing for adjustable weighting to reflect their relative importance.
In terms of localization loss, we employed CIoU loss [28] as specified in Equation (5). This loss considers the IoU between expected and actual bounding boxes, incorporating factors like their spatial relationship ( ρ c ) and aspect ratio ( β α ) . Equation (6) defines α as a trade-off parameter based on the expected box size ( S ) and IoU. Equation (7) introduces ν to assess aspect ratio coherence by comparing α and β . These equations are crucial for accurate bounding box evaluations and optimization in object detection.
CIoU = IoU ( ρ c ) 2 ρ β α ( S / C IoU )
α = S IoU + S
v = α β
L obj = x = 1 S × S y = 1 S × S A x y · max ( IoU y )
In Equation (8), we use variables to represent grid cell indices (x) and bounding box indices (y). The localization loss term, denoted as L obj , is defined as follows: The binary indicator A x y takes a value of 1 when there are no objects in the detection field in the corresponding grid cell and 0 when an object is detected. The expression max ( IoU y ) represents the highest Intersection over Union (IoU) score between the bounding box y and the ground truth box. This formulation ensures that the loss function penalizes localization errors exclusively in grid cells, where objects are present, and for bounding boxes assigned with the ground truth box.
In our methodology, we present the objectness loss ( L Obj ) as a fundamental component in the image analysis evaluation. It is calculated as the sum of binary cross-entropy (BCE) losses for objectness scores within grid cells. This loss comprises two elements: one related to grid cells containing objects, while the other concerns cells without objects. We employ a binary indicator A x y , which takes the value 1 when there are no objects in the grid cell ( x , y ) and 0 when an object is present. Additionally, we introduce the weight factor λ noobj to regulate the impact of the loss in grid cells lacking objects. The specific formula for objectness loss can be found in Equation (9).
L O b j = x = 1 N y = 1 M BCE ( O S x y ) · ( 1 A x y ) · λ noobj
where
B C E ( O S x y ) = ( O S x y * log ( O S x y ) ) ( 1 O S x y * ) log ( 1 O S x y )
Equation (11) evaluates the classification loss. We apply the same penalization method as we do for the localization loss.
L C l s = x = 1 S × S y = 1 B δ x y o b j c classes BCE ( p x y ( c ) )
where
B C E ( p x y ( c ) ) = ( p x y * ( c ) log ( p x y ( c ) ) + ( 1 p x y * ( c ) ) ( 1 log ( p x y ( c ) ) ) )

2.2. Classifying Emergency Vehicles Using Acoustic Signals

2.2.1. Proposed Framework for Acoustic Classification

This paper introduces the attention-based temporal spectrum network to better understand selective time–frequency features in acoustic spectrograms.
Figure 3 displays the comprehensive model architecture. There are two primary components of the model: modules for generating attention and those for managing the backbone network. By applying two different attention processes in the Log-Mel spectrogram, the attention-generating module concentrates computational power on particular time intervals and frequency ranges. The backbone network is composed of a pooling layer, a fully connected layer, and a convolutional layer to extract key time–frequency features and includes an attention mechanism for sound event prediction. In the concluding evaluation stage, a probability-based polling approach is employed to consolidate predictions from various acoustic segments, thereby mitigating classification errors arising from outlier data points.

2.2.2. Feature Processing

The “Log-Mel spectrogram” is highly regarded in the field of acoustic recognition because of its ability to replicate the human auditory system. It furnishes a two-dimensional feature map for each acoustic frame, capturing temporal and spectral attributes in both the time and frequency domains. In this study, we employed the Log-Mel spectrogram to extract time–frequency features from emergency vehicle sound events, enhancing classification accuracy through the use of convolutional neural networks (CNNs).
Our methodology commences with the standardization of the acoustic data format into mono. Subsequently, a 46 ms hamming window with a 50% overlap is applied, and a Short-Time Fourier Transform (STFT) is executed to derive the spectrogram of the amplitude. This spectrogram is further processed using a 128 Mel filter bank and logarithmic scaling to generate the Log-Mel spectrogram. To better align with the attention mechanism and the available training data, the Log-Mel spectrogram is divided into sixty-four frames with a crossover of fifty percent, with zero padding applied to shorter counterparts. This yields the expression of the Log-Mel feature vectors with dimensions of 1 × 64 × 128 , signifying channel × time × frequency.
Loss = λ 1 L C I o U + λ 2 L O b j + λ 3 L C l s

2.2.3. Harmonic–Percussive Source Separation

In our investigative analysis, we seamlessly integrated the HPSS (Harmonic–Percussive Source Separation) algorithm [29] for input Log-Mel spectrogram processing. This method enabled us to efficiently divide the spectrograms into two fundamental constituents: harmonic spectrograms and percussive spectrograms. This approach offers a transparent and informative depiction of the frequency distribution and activity within discrete frequency bands present in the acoustic data.

2.2.4. Creating a Time–Frequency Attention Mechanism

The accurate detection of emergency vehicles from audio signals in real-world traffic environments is significantly impacted by background noise. Our methodology tackles this challenge by integrating advanced noise suppression techniques to enhance the classification accuracy. Specifically, we leveraged sophisticated attention mechanisms that dynamically adjust to prioritize informative portions of the audio signal while suppressing irrelevant background noise. Additionally, state-of-the-art signal processing algorithms, such as spectral subtraction and adaptive filtering, are applied to further improve the signal-to-noise ratio. Through these comprehensive approaches to addressing the audio noise in traffic, our classification framework exhibits a robust performance, enabling the reliable and precise detection of emergency vehicles. In complex, polyphonic acoustic scenarios with multiple concurrent sound sources, our approach employs both temporal and frequency attention mechanisms. Temporal attention emphasizes the importance of semantic segments while mitigating noise, whereas frequency attention highlights critical frequency bands while reducing less relevant ones. This approach enhances the extraction of crucial acoustic features in challenging acoustic environments, ensuring that our model remains effective even with diverse levels of background noise in real-world traffic conditions. By seamlessly integrating advanced noise suppression techniques with sophisticated attention mechanisms, our classification framework offers a robust solution for accurate emergency vehicle detection in complex acoustic environments.
As shown in Figure 4, the Log-Mel spectrogram X undergoes standardization and processing through the HPSS algorithm to yield harmonic and percussive spectrograms; ( 1 × 3 ) and ( 5 × 1 ) kernel convolutions are utilized to extract features, leading to dimension reduction.
A final ( 1 × 1 ) convolution condenses channel information, resulting in one-dimensional matrices A f ( F , 1 ) and A t ( 1 , T ) , which are normalized through the softmax function to produce the frequency and temporal weight matrices F w and T w :
F w ( f ) = Softmax ( A f ( f ) )
T w ( t ) = Softmax ( A t ( t ) )
The Log-Mel spectrogram, denoted as X, undergoes element-wise multiplication with an attention weight matrix acquired from both the frequency and time directions. This operation yields two distinct outputs: the frequency attention spectrogram, referred to as S F , and the temporal attention spectrogram, labeled S T . This transformation process enhances the representation of the Log-Mel spectrogram by selectively attending to important frequency and temporal components, enabling the model to focus on relevant acoustic features for emergency vehicle detection. The resulting frequency and temporal attention spectrograms, S F and S T , serve as crucial inputs for subsequent stages of our proposed framework. The mathematical formulation for this transformation can be expressed as follows:
S F ( f , t ) = X ( f , t ) · F w ( f )
S T ( f , t ) = X ( f , t ) · T w ( t )
In our initial ‘average combination’ approach, we created two attention spectrograms: S F for frequency emphasis and S T for temporal details. These spectrograms are applied to the Log-Mel spectrogram and equally combined into the temporal–frequency attention spectrogram, S T & F using the below equation.
S T & F = S F + S T 2
The ‘weighted combination’ approach introduces two adjustable network parameters, denoted as α and β , with the constraint that α + β = 1 . This approach synthesizes the ultimate temporal–frequency attention spectrogram denoted as S T & F , by combining the two spectrograms of attention, S F and S T , based on the proportions determined by the learnable parameters. Mathematically, this procedure can be expressed as follows:
S T & F W e i g h t = α · S F + β · S T
The final approach, known as “channel combination”, involves the integration of the two generated attention spectrograms, S F and S T , by joining them to create two-channel outputs.
S T & F C h a n n e l = concatenate ( S T , S F )

2.2.5. Network Structure

The architecture presented in this study, termed ATSN, comprises a configuration of three pooling layers, two fully connected layers, and six convolutional layers. These layers can be organized into pairs, where each pair shares identical parameters, effectively forming a block. Every block is paired with a corresponding max-pooling layer of 2 × 2 dimensions. Thirty-two 5 × 3 kernels with a stride of 2 are used in the first two convolutional layers. Following this, the subsequent four convolutional layers are equipped with 128 & 64 kernels, each measuring 3 × 3 and employing a stride of 1. To conclude, the network incorporates two completely linked layers, 256 hidden units in each, for processing the flattened outcome. To produce predictions, the final result is fed into a “Softmax” classifier. Specifically, the architecture makes use of the ReLU activation function, employs batch normalization at each convolutional layer to speed up the training, and uses a 0.5 probability dropout mechanism in the fully connected layer to prevent overfitting.

2.2.6. Decision Approach

The stage of feature processing includes segmenting the log-Mel spectrogram into sub-segments of 64 frames each, with a 50% overlap between them. Each sub-segment inherits its category label from the acoustic source. During the training stage, individual sub-segments are used as inputs for network training, leading to predictions for their respective categories. As we reach the concluding testing phase, the aim is to make predictions for the complete acoustic segment to determine its category. This involves employing a probabilistic voting strategy to aggregate predictions from multiple sub-segments for the final decision, as visualized in Figure 5.
The mathematical representation of this strategy is as follows:
D = agrmax 1 Z k = 1 N P j i , 1 i H
Here, Z denotes the count of sub-segments allocated to each acoustic sample, H reflects the dataset’s category count, and P indicates the prediction outcome for individual segments.

2.3. Harmonizing Hard and Soft Predictions

The method of stacking involves training a meta-model that combines the output of several models to produce an ultimate prediction. This tactic works particularly well when combining predictions from several models, such as those for visual detection and acoustic categorization [30]. The primary aim is to enhance the capability of the model to generalize and complete the execution of EVD. Stacking leverages the unique strengths of each model, presenting an effective approach for addressing diverse aspects of the classification problem. This approach results in predictions that are not only more accurate but also notably more reliable. This study features the implementation of the stacking ensemble learning technique. This highly effective ensemble method integrates outcomes from a range of independent models, including a dedicated ATSN classification model and MLSF-YOLO model tailored for the precise identification of emergency vehicles.

2.3.1. Hard Prediction

For a binary prediction, each model produces an unusual class tag as its output. Subsequently, the stacking ensemble formulates conclusions based on both the predictions of these models and their corresponding actual tags. In this approach, we use a meta-classifier called M hard , which makes use of the learned information to generate a precise binary prediction. The stacking ensemble combines outputs from the acoustic classification model known as h ( x ) with the visual detection approach, referred to as g ( x ) , when given a data record x as input. The meta-classifier is used to calculate the final binary prediction, denoted as y hard , as stated in the following equation.
y hard = M hard ( h ( x ) , g ( x ) )

2.3.2. Soft Prediction

For soft prediction, each model produces a probability distribution across class categories. Similar to this, the stacking ensemble makes use of this information together with the true labels to facilitate the training of an additional meta-classifier known as M soft . Equation (23) states that the stacking ensemble considers an input data point x and integrates predictions from the acoustic classification model, denoted as h ( x ) , with the visual detection model labeled g ( x ) .
y soft = M soft ( h s ( x ) , g s ( x ) )
We want to take advantage of each model’s distinct advantages so that the overall performance may be enhanced and the classification domain can be effectively generalized. The use of this ensemble technique considerably improves the prediction reliability and accuracy, which is crucial for differentiating between emergency vehicles (EVs) and non-emergency vehicles (NEVs) in our research.

3. Experiments and Results

In our study, we performed an experiment using the system, which was equipped with an Intel Core i7-14650 HX CPU running at 2.30 GHz, with 16 GB of RAM and a roomy 1 TB SSD for data storage. To accelerate computations, the setup also incorporated an Nvidia GeForce RTX 4060 GPU. We used the TensorFlow framework with the development environment of Python 3.9.

3.1. Hyperparameters Used

For the ATSN utilized in acoustic classification, we employed an initial learning rate of 0.001 and utilized the Adam optimizer for weight updates. The choice of the binary cross-entropy loss function was determined to suit the binary classification task inherent in our acoustic signal classification. Moreover, we incorporated batch normalization at every layer to expedite training and mitigate overfitting, while dropout regularization was applied specifically to fully connected layers to further prevent overfitting.
In the case of the improved YOLOv4 architecture, termed MLSF-YOLO, we adjusted the batch size to 16 and initialized training with a starting learning rate of 0.01. The momentum was set to 0.9, and we integrated a weight decay mechanism at a rate of 0.0005 to alleviate overfitting. The choice of the stochastic gradient descent (SGD) optimizer with momentum and weight decay was based on its proven effectiveness in optimizing object detection models. Furthermore, we adopted the Mish activation function throughout the MLSF-YOLO model, departing from the original approach of using leaky-ReLU and Mish activation solely for the backbone. This decision was made after comprehensive experimentation, considering Mish’s superior performance in facilitating gradient flow and convergence during training.
We conducted numerous experiments involving various combinations of hyperparameters, optimizers, and activation functions to optimize the model performance. Hyperparameters such as the learning rate, batch size, and weight decay were carefully tuned to ensure that the models were adequately trained. Different optimizers, including Adam and SGD with momentum, were employed to optimize the learning process and minimize the loss function. Additionally, various activation functions, such as ReLU and Mish, were tested to introduce non-linearity and enhance the model’s representational capacity. Through rigorous experimentation, we observed the effects of parameter changes on model performance, allowing us to fine-tune our approach. Ultimately, our experimentation revealed that the Adam optimizer paired with the ReLU activation function yielded the best results for the ATSN model, while the SGD with momentum optimizer combined with the Mish activation function proved to be the optimal choice for the MLSF-YOLO model as shown in Table 1.

3.2. Data for Visual EVD Experiments

Collecting data for experiments on VEVD contains several challenges. One primary hurdle is the restricted accessibility of roadside traffic camera data due to legal constraints, making it unavailable to the public. Additionally, emergency vehicle sightings on roads are infrequent compared to non-electric vehicles under standard traffic conditions. To address this, we opted to acquire data from YouTube, a widely used visual-sharing platform. We selected relevant videos from YouTube channels specializing in emergency vehicle content, proceeded to obtain images from these videos, and subsequently conducted categorization. Only two classes of objects—NEVs and EVs—were the focus of our annotation efforts.
Table 2 provides an overview of our visual-based emergency vehicle detection data, consisting of videos from various scenes. We extracted 19,144 frames, containing 23,448 emergency vehicle instances and 39,895 non-emergency vehicle instances, resulting in a visual emergency vehicle detection dataset with an average of 3.25 instances per frame. In the dataset, there are three distinct types of emergency vehicles: fire trucks (8129), police cars (6699), and ambulances (7920), with a total count of 22,748. The visual dataset was further partitioned into training and testing subsets, with a division ratio of 75:25. Additionally, we enriched the dataset’s diversity by incorporating the KITTI dataset [31], which encompasses a wide array of objects, both vehicular and non-vehicular. From the original KITTI dataset, we exclusively extracted images featuring vehicles to construct a refined subset. This subset consisted of 6850 images, which were subsequently partitioned into training, validation, and testing sets, aligning with the visual dataset’s ratio. In total, the combined dataset for visual emergency vehicle detection and KITTI (VEVD+KITTI) encompassed 25,994 images, with an average of 4.45 vehicular objects per image. Consequently, the VEVD system maintains its effectiveness and reliability under challenging lighting conditions due to its robust training on diverse datasets. The model’s ability to generalize across different environmental settings, including scenarios with poor lighting, underscores its adaptability and resilience in real-world applications. Through extensive exposure to diverse lighting conditions during training, our VEVD model is equipped to deliver consistent performances, ensuring reliable emergency vehicle detection even in adverse lighting environments. Furthermore, we also prioritized data anonymization to protect the privacy and confidentiality of individuals. Any personally identifiable information was removed or carefully anonymized in the dataset, ensuring that individuals cannot be identified or linked to specific data points. This approach safeguards the privacy of individuals and upholds ethical standards.

3.3. Data for Acoustic EVD Experiments

Using the same dataset as in [14], we conducted the training and testing of the attention convolutional audio classifier model on the audio-based emergency vehicle detection dataset. This dataset encompasses 26,705 samples of audio, incorporating EV horn sounds and other noises. It combines 15,947 entries of our own data that we gathered, 8758 entries from the Urbansound8K dataset [32], and 2000 entries from the ESC-50 dataset [33]. The dataset underwent a methodical division into training, testing, and validation sets, comprising 16,310, 5262, and 5133 samples, respectively. Given the binary classification nature of our acoustic-based emergency vehicle detection system, we restructured the acoustic dataset to feature two distinct acoustic groups: siren sounds and noise with car horns classified under the noise category. Detailed statistics of the acoustic-based emergency vehicle detection dataset can be found in Table 3. Due to the diverse sources of the experiment dataset, there may have been variations in the sampling rate, channel count, and bit depth across data samples. To mitigate this, we undertook the standardization of the entire dataset into monophonic data, adhering to a frequency measurement of 22,050 Hz and 16-bit encoding.

3.4. Visual EVD Experiments

The MLSF-YOLO model’s performance was assessed using the VEVD + KITTI datasets. In evaluating the effectiveness of the object identification algorithms, one of the crucial metrics employed was the Mean Average Precision (mAP). This metric is measured through the computation of the Average Precision (AP) for every distinct class, followed by the derivation of the collective mean from these class-specific values. It serves as an indicator of the accuracy of object identification across various object classes. The AP for an object class is determined by integrating precision values at various recall levels. The formula for mAP is expressed as follows:
m A P = A P N
Here, the term A P represents the summation of all AP values across the various object classes, N is the total number of object classes, and mAP is the mean value. By taking into account the model’s performance over a wide variety of item categories, this metric provides a comprehensive evaluation of the model’s detection skills.

Visual EVD with Different Object Detectors

For our initial experiment, our primary objective was to showcase the benefits of deploying the MLSF-YOLO framework for vision-based emergency vehicle detection. To achieve this, we conducted a comprehensive analysis comparing the performances of single-shot object recognition models, including YOLOv3, YOLOv4, SSD, and five different, distinct variations of EfficienDet, on the ‘EVD + KITTI’ dataset. This dataset served as a suitable benchmark for evaluating the models’ capabilities in real-world scenarios. The results of our evaluation are presented in Table 4, which provides a comprehensive overview of the outcomes. Furthermore, the MLSF-YOLO architecture exhibited exceptional efficiency, with inference times as low as 4.9 milliseconds. This efficiency is crucial for real-time applications in traffic management and road safety systems, where the timely and accurate detection of emergency vehicles is of utmost importance.
Almost all detectors performed admirably in terms of detection, with accuracy values ranging from 81.93 to 91.2 across a range of input dimensions, varying from 300 × 300 to 1024 × 1024. Furthermore, all other detectors (SSD-512, Efficient-D3, and Efficient-D4) attained instantaneous processing times of less than 40 ms for each picture. Impressively, YOLOv4 outperforms its three counterparts significantly, achieving the highest detection accuracies at an input size of 608 × 608: 68.9 (mAP@[0.5:0.95]) 92.8 ([email protected]) and 81.9 ([email protected]), all while maintaining the shortest inference time at just 4.5 milliseconds per image. Interestingly, YOLOv4’s time cost is significantly lower than that of SSD and the different EfficientDet models, and almost half that of YOLOv3 at identical input sizes. The MLSF-YOLO architecture demonstrates superior efficiency compared to other object detectors evaluated on the Visual-EVD + KITTI dataset. With a resolution of 608 × 608, MLSF-YOLO achieves a remarkable mean average precision of 71.1%, outperforming other architectures in terms of both [email protected] and [email protected]. Additionally, MLSF-YOLO exhibits significantly lower time costs, requiring only 4.9 milliseconds for inference. This exceptional efficiency highlights the effectiveness of MLSF-YOLO in accurately detecting emergency vehicles while minimizing computational overhead, making it a compelling choice for real-time applications in traffic management and road safety systems. Additionally, we want to emphasize that our model selection process involved thorough experimentation. We conducted extensive evaluations, exploring various models and architectures, before ultimately selecting the MLSF-YOLO model. We rigorously tested and compared the performances of different models to ensure that the MLSF-YOLO architecture exhibited the desired characteristics for our specific problem domain. The selection of this model was based on its demonstrated performance and suitability for emergency vehicle detection. The comprehensive evaluation and comparison of these models allowed us to draw meaningful conclusions about the benefits of deploying the MLSF-YOLO framework for vision-based emergency vehicle detection. By leveraging the MLSF-YOLO architecture, we can attain a high accuracy and efficiency, making it an attractive option for various applications.

3.5. Acoustic EVD Experiments

To augment the acoustic data, we incorporated a technique described in [34]. This entailed sporadically including road noise into the original data, thereby generating samples containing noise having an adjustable signal-to-noise ratio (SNR). Moreover, batch normalization is applied at every layer to accelerate the training procedure. Furthermore, dropout regularization is used on the completely connected layers to reduce the possibility of overfitting [35].
While the recordings in the AEVD dataset typically range from 4 s to 5 s in duration, we specifically focused on input lengths between 0.25 s and 1.5 s. This choice was motivated by the desire to reduce computational complexity and cater to the practical requirements of emergency vehicle detection applications, where swift response times are crucial. Longer inputs tend to lead to extended response times. Additionally, we assessed the network’s stability by measuring its performance on test sets with noise. These sets, which include + 20 dB , + 10 dB , 0 dB , 10 dB , 20 dB , and 30 dB , were produced by applying traffic noise at various SNR levels to each sample in the initial test set. It is important to note that the initial testing data were collected during real-time driving scenarios, where natural traffic noise was present. Deliberately introducing noise into the initial testing data would make it significantly more challenging to assess the proposed model effectively.

Results of Attention-Based Temporal Spectrum Network

The performance of the ATSN framework is shown in Table 5 across varying input lengths and noise levels. Notably, the proposed ATSN exhibits an exceptional performance, achieving high accuracy rates on the original test dataset. Specifically, for input durations of 1.5, 1.0, 0.5, and 0.25 s, the accuracy rates were recorded at 93.47%, 93.24%, 93.29%, and 93.19%, respectively. It is important to emphasize that the experimental data collection was conducted in authentic, on-scene scenarios, underscoring the efficacy and robustness of the ATSN in typical traffic conditions across all input durations. Even with a notably brief input duration of 0.25 s, the ATSN achieved a commendable classification accuracy of 93.19 %. In the presence of moderate noise levels, specifically signal-to-noise ratios (SNRs) of +20 dB and +10 dB, the ATSN consistently maintained a robust performance across all input durations, with accuracy rates surpassing 93%. At a 0 dB SNR, the suggested network continued to exhibit a high accuracy on the 1.5 s test set (93.37%) and on the 1.0 s test set (93.17%), with slightly lower results observed for the 0.5 s and 0.25 s test sets at 91.75% and 93.42%, respectively. Under conditions of heightened noise levels, specifically at the −10 dB and −20 dB SNRs, a more pronounced decline in classification accuracy was observed with shorter input durations. For the ATSN, this translates to maintaining accuracy rates above 93% on the initial two test sets with longer durations, while the accuracy rates on the subsequent 0.5 and 0.25 s test sets dropped to 91.39% & 91.27% at −20 dB each in their own way. Even still, the accuracy suffers greatly when the SNR falls to −30 dB, but the ATSN still consistently delivers a performance exceeding 88% across all four input lengths. The data indicate that input durations of 1.5 s and 1.0 s yield more favorable results for acoustic-based emergency vehicle detection, both on raw waveform data and using the ATSN model. During these times, the accuracy of the network only slightly decreases at varying SNRs and stays over 90% even in very noisy environments (−20 dB). On the other hand, the ATSN performs worse at SNRs of −20 dB and −30 dB due to its increased sensitivity to high noise levels at shorter input lengths.
Furthermore, it is worth noting that the inference times of the ATSN model vary depending on the input lengths. When handling input lengths of 1.5, 1.0, 0.5, and 0.25 s, the corresponding inference times per sample were 8 milliseconds, 7 milliseconds, 4 milliseconds, and 2 milliseconds, respectively. These inference times are well within acceptable ranges for real-time applications, further highlighting the efficiency of the ATSN model.
From Table 6, we can observe that both Class 1 and Class 2 classifications achieved consistently high-performance metrics. This indicates the model’s ability to effectively differentiate between emergency and non-emergency vehicles based on acoustic characteristics. These findings reinforce the practicality and real-world applicability of the ATSN approach for acoustic-based classification applications.
During our extensive investigation into emergency vehicle detection, we utilized accuracy and loss metrics to evaluate the efficacy of our proposed ATSN architecture. The results reveal a remarkable accuracy of 93.47%, as shown in Figure 6a. To gain further insights into the model’s performance, we analyzed the confusion matrix, as depicted in Figure 6b. The confusion matrix provides a visual representation of how effectively the model distinguishes between emergency vehicles and non-emergency vehicles. Furthermore, we conducted a comprehensive performance comparison for the acoustic classification of emergency and non-emergency vehicles, as presented in Table 6. This comparison allowed us to assess the model’s ability to accurately classify vehicles based on their acoustic characteristics.
By employing these evaluation techniques, we were able to assess the effectiveness and accuracy of the ATSN architecture in detecting emergency vehicles. The high accuracy, supported by the confusion matrix analysis and performance comparison, demonstrates the robustness and reliability of the ATSN model for acoustic-based emergency vehicle classification.

3.6. Results of Multi-Level Spatial Fusion YOLO

In building upon the exceptional performance demonstrated by YOLOv4 in the first trial, MLSF-YOLO was introduced. This used the YOLOv4 model as its basis. The primary transformation was centered around the neck of the detector, where we implemented the structure of a CSP connection. MLSF-YOLO’s detailed structure is shown in Figure 2a. Within this design, we replaced the SPL module of YOLOv4 with an MLSF module, while also integrating NCSP components to create the PANet structure within the MLSF-YOLO neck.
Table 7 displays the effectiveness of the MLSF-YOLO model across various input sizes on the VEVD + KITTI dataset. MLSF-YOLO demonstrates commendable detection accuracies across all input size scenarios. Notably, with the lowest possible input size (320 × 320), MLSF-YOLO achieved a [email protected] of 93.4, surpassing the results of YOLOv3, EfficientDet, YOLOv4, and SSD, which were examined with more inputs (as detailed in Table 4). As the input size increases, the results show higher detection accuracies, with MLSF-YOLO reaching its pinnacle accuracy at 94.9 [email protected] for 640 × 640 test images. But still, this is noteworthy because, beyond the size of 608 × 608, the MLSF-YOLO performance usually reaches an endpoint, hovering around 95.2 [email protected]. Consequently, 608 × 608 emerges as the optimal input size for VEVD when utilizing MLSF-YOLO trained on the VEVD + KITTI dataset. The MLSF-YOLO computational efficiency is equally important, with time costs ranging from 2.2 milliseconds (320 × 320) to 6.9 milliseconds (768 × 768) per image, making it suitable for live execution. Figure 7 showcases a result of sample detection generated using MLSF-YOLO.
Table 8 presents a comparison of visual-based emergency vehicle detection methods with those of previous research studies. Different approaches were evaluated, each employing distinct methodologies and achieving varying levels of accuracy. One study utilized the RCNN method and achieved an accuracy of 92% but faced challenges related to lighting conditions. Another study employed the Yolov5 method, achieving an accuracy of 88% but encountered high computation costs. A CNN model achieved an accuracy of 85% but faced difficulties in classification. Another method, YOLOv4_AF, achieved an accuracy of 83% but has limited accuracy compared to that of other models. A YOLOv4_FIRI model achieved an accuracy of 94% but suffered from slow speed and limited adaptability. The MLSF_YOLO framework achieved the highest accuracy of 95% but was noted to lack practical implications. Considering the high accuracy achieved by MLSF_YOLO, further investigation is warranted to explore its real-world application and address the perceived limitations. Future research should focus on enhancing the practicality and adaptability of the MLSF_YOLO method, ensuring its effectiveness in real-time emergency vehicle detection scenarios.

3.7. Ensemble Learning Results

To assess the effectiveness of our MEVD framework, we compiled a video repository exhibiting vehicular entities and the corresponding acoustic environment in diverse traffic scenarios. Broadly, we identified four traffic scenarios based on the potential detection of EVs through both visual and auditory channels. The initial scenario involves the presence of an emergency vehicle within the camera’s perspective, along with the audible blaring of their siren. In the second situation, we have a standard traffic scenario with no visible or audible signs of an emergency vehicle. The next two cases are related to either the visual appearance or the siren sound of an emergency vehicle in the video. During the initial stage of MEVD, we are focusing solely on these first two traffic scenarios, encompassing the detection of an emergency vehicle through both visual and auditory means.
In our comprehensive exploration of emergency vehicle detection, we harnessed loss and accuracy metrics to evaluate the efficacy of our innovative strategy. Our findings underscore the remarkable accuracy rate of 96.19%, as shown in Figure 8a. This outcome reflects the successful synergy of acoustic and visual combinations through ensemble learning, showing the potential of our approach in enhancing emergency vehicle detection systems. Similarly, in Figure 8b, the confusion matrix highlights how well the model differentiates between EVs and NEVs.
We conducted a performance comparison for the classification of emergency and non-emergency vehicles, as shown in Table 9. From Table 9, we observe that both Class 1 and Class 2 classifications exhibited similar high-performance metrics. These findings demonstrate the model’s appropriateness for real-world acoustic and visual-based categorization by showing how well it distinguishes between emergency and non-emergency vehicles.
As depicted in Figure 9, the hard prediction module demonstrated an exceptional performance in the context of acoustic and visual emergency vehicle detection. It achieved an accuracy score of 0.961, a precision score of 0.94, a recall score of 0.95, and an F1 score of 0.96, outperforming all other modules in these domains. These results highlight the usefulness of our suggested ensemble learning method for this crucial application, which mixes acoustic and visual data. This approach provides a robust and reliable means to accurately detect and classify emergency vehicles. The system can utilize complementary information from both acoustic and visual sources because of the integration of many components, resulting in an improved detection accuracy and enhanced decision-making capabilities. This outcome holds significant importance for emergency response systems and may potentially improve the effectiveness of such systems in ensuring public safety and rapid responses in emergencies.
In Table 10, we present a comprehensive comparative analysis of the performance of our proposed emergency vehicle detection framework and those of previous studies conducted in this field of research. The table encompasses a diverse range of studies, providing insights into the accuracy metrics achieved by each study, as well as whether multi-model techniques or ensemble learning methodologies were employed. By examining the table, we can gain valuable insights into the advancements made in emergency vehicle detection and the effectiveness of our proposed framework compared to prior research efforts. The inclusion of accuracy metrics allows for a quantitative assessment of the performances of different approaches, enabling a comprehensive evaluation of their effectiveness. Furthermore, the indication of whether multi-model techniques or ensemble learning were utilized in each study provides additional context regarding the methodologies employed. This information aids in understanding the complexities of the various approaches and their impact on the overall performance of emergency vehicle detection systems.

4. Conclusions

This study proposed a deep learning approach with multimodal fusion to construct an advanced system capable of identifying emergency vehicles in real-world traffic conditions. To address the visual aspect, we used the MLSF-YOLO model and compiled the VEVD dataset, tailoring it specifically for VEVD purposes. By training and evaluating MLSF-YOLO using a combined dataset that merged Visual-EVD and KITTI, we achieved exceptional outcomes, enabling real-time processing at an impressive speed of 4.9 ms per image, while maintaining a mean average precision of 95.2%. Shifting our focus to auditory analysis, we proposed the ATSN model, a comprehensive end-to-end architecture designed to extract features from raw acoustic waveforms for classification. The experiments conducted prove the effectiveness of the ATSN model, showcasing its consistent performance across varying input lengths and signal-to-noise ratios, surpassing the accuracy achieved by previous methodologies. Lastly, we strongly support the idea of using both acoustic recognition and object detection techniques together, as we believed it would bring significant benefits and improve the overall results.

5. Limitations and Future Work

While our study demonstrates promising results in enhancing emergency vehicle detection through multimodal fusion, there exist several avenues for further exploration and improvement. Firstly, employing dimension reduction techniques on our dataset can provide deeper insights into model performance and aid in optimizing computational efficiency. Additionally, future research should delve into advanced synchronization techniques to refine the integration of visual and acoustic data streams for real-world implementation, thus potentially enhancing system robustness. Furthermore, addressing the limitation of testing diversity is crucial; conducting experiments under various environmental conditions, such as different weather conditions, times of day, traffic scenarios, and camera angles, can offer a more comprehensive evaluation of the system performance. Finally, considering the impact of camera quality and sensor placement on detection accuracy is essential for optimizing system design and deployment strategies. By addressing these limitations and pursuing future research directions, we can advance the efficacy and applicability of emergency vehicle detection systems in urban environments.
While favorable outcomes with VEVD, AEVD, and MEVD were acquired, there remains a need for further refinement and evaluation to enhance the detection accuracy and reliability, ensuring that these EVD systems align closely with real-world application requirements. Initially, we will augment the diversity of the experimental data by expanding our collection efforts across a range of conditions. This entails gathering VEVD data from both daytime and night-time scenarios, encompassing various weather conditions, including rain, snow, sunny, and fog conditions. Additionally, to ensure interoperability with resource-constrained IoT devices, we will modify the EVD systems, expanding the range of applications for the suggested EVD techniques.

Author Contributions

Conceptualization, M.Z., M.A., and M.E.; data curation, M.A. and M.E.; formal analysis, M.Z.; funding acquisition, M.A. and M.E.; investigation, M.A. and M.E.; methodology, M.Z. and M.A.; project administration, M.A. and M.E.; resources, M.A. and M.E.; software, M.Z.; supervision, M.A.; validation, M.A. and M.E.; visualization, M.A.; writing—original draft, M.Z.; writing—review and editing, M.A. and M.E. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the EIAS Data Science and Blockchain Laboratory College of Computer and Information Sciences, Prince Sultan University, Riyadh 11586, Saudi Arabia. The authors would like to thank Prince Sultan University for paying the APC of this article.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author.

Acknowledgments

The authors would like to thank Prince Sultan University for their valuable support.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Ghazi, M.U.; Khattak, M.A.K.; Shabir, B.; Malik, A.W.; Ramzan, M.S. Emergency message dissemination in vehicular networks: A review. IEEE Access 2020, 8, 38606–38621. [Google Scholar] [CrossRef]
  2. Damaševičius, R.; Bacanin, N.; Misra, S. From sensors to safety: Internet of Emergency Services (IoES) for emergency response and disaster management. J. Sens. Actuator Netw. 2023, 12, 41. [Google Scholar] [CrossRef]
  3. Wang, X.; Liu, Q.; Guo, F.; Xu, X.; Chen, X. Causation analysis of crashes and near crashes using naturalistic driving data. Accid. Anal. Prev. 2022, 177, 106821. [Google Scholar] [CrossRef] [PubMed]
  4. Razalli, H.; Ramli, R.; Alkawaz, M.H. Emergency vehicle recognition and classification method using HSV color segmentation. In Proceedings of the 2020 16th IEEE International Colloquium on Signal Processing & Its Applications (CSPA), Langkawi, Malaysia, 28–29 February 2020; pp. 284–289. [Google Scholar]
  5. Sarda, A.; Dixit, S.; Bhan, A. Object detection for autonomous driving using yolo [you only look once] algorithm. In Proceedings of the 2021 Third IEEE International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV), Tirunelveli, India, 4–6 February 2021; pp. 1370–1374. [Google Scholar]
  6. Kherraki, A.; El Ouazzani, R. Deep convolutional neural networks architecture for an efficient emergency vehicle classification in real-time traffic monitoring. IAES Int. J. Artif. Intell. 2022, 11, 110. [Google Scholar] [CrossRef]
  7. Sorour, S.E.; Hany, A.A.; Elredeny, M.S.; Sedik, A.; Hussien, R.M. An Automatic Dermatology Detection System Based on Deep Learning and Computer Vision. IEEE Access 2023, 11, 137769–137778. [Google Scholar] [CrossRef]
  8. Goel, S.; Baghel, A.; Srivastava, A.; Tyagi, A.; Nagrath, P. Detection of emergency vehicles using modified YOLO algorithm. In Proceedings of the Intelligent Communication, Control and Devices (ICICCD 2018); Springer: Berlin/Heidelberg, Germany, 2020; pp. 671–687. [Google Scholar]
  9. Berwo, M.A.; Khan, A.; Fang, Y.; Fahim, H.; Javaid, S.; Mahmood, J.; Abideen, Z.U.; Syam, M.S. Deep Learning Techniques for Vehicle Detection and Classification from Images/Videos: A Survey. Sensors 2023, 23, 4832. [Google Scholar] [CrossRef] [PubMed]
  10. Baghel, A.; Srivastava, A.; Tyagi, A.; Goel, S.; Nagrath, P. Analysis of Ex-YOLO algorithm with other real-time algorithms for emergency vehicle detection. In Proceedings of the First International Conference on Computing, Communications, and Cyber-Security (IC4S 2019); Springer: Berlin/Heidelberg, Germany, 2020; pp. 607–618. [Google Scholar]
  11. Farid, A.; Hussain, F.; Khan, K.; Shahzad, M.; Khan, U.; Mahmood, Z. A Fast and Accurate Real-Time Vehicle Detection Method Using Deep Learning for Unconstrained Environments. Appl. Sci. 2023, 13, 3059. [Google Scholar] [CrossRef]
  12. Pan, M.; Liu, Y.; Cao, J.; Li, Y.; Li, C.; Chen, C.H. Visual recognition based on deep learning for navigation mark classification. IEEE Access 2020, 8, 32767–32775. [Google Scholar] [CrossRef]
  13. Tahir, N.U.A.; Zhang, Z.; Asim, M.; Chen, J.; ELAffendi, M. Object Detection in Autonomous Vehicles under Adverse Weather: A Review of Traditional and Deep Learning Approaches. Algorithms 2024, 17, 103. [Google Scholar] [CrossRef]
  14. Tran, V.T.; Tsai, W.H. Acoustic-based emergency vehicle detection using convolutional neural networks. IEEE Access 2020, 8, 75702–75713. [Google Scholar] [CrossRef]
  15. Pramanick, D.; Ansar, H.; Kumar, H.; Pranav, S.; Tengshe, R.; Fatimah, B. Deep learning based urban sound classification and ambulance siren detector using spectrogram. In Proceedings of the 2021 12th IEEE International Conference on Computing Communication and Networking Technologies (ICCCNT), Kharagpur, India, 6–8 July 2021; pp. 1–6. [Google Scholar]
  16. Fatimah, B.; Preethi, A.; Hrushikesh, V.; Singh, A.; Kotion, H.R. An automatic siren detection algorithm using Fourier Decomposition Method and MFCC. In Proceedings of the 2020 11th IEEE International Conference on Computing, Communication and Networking Technologies (ICCCNT), Kharagpur, India, 1–3 July 2020; pp. 1–6. [Google Scholar]
  17. Mateen, A.; Hanif, M.Z.; Khatri, N.; Lee, S.; Nam, S.Y. Smart roads for autonomous accident detection and warnings. Sensors 2022, 22, 2077. [Google Scholar] [CrossRef]
  18. Tang, M.; Zhao, Q.; Ding, S.X.; Wu, H.; Li, L.; Long, W.; Huang, B. An improved lightGBM algorithm for online fault detection of wind turbine gearboxes. Energies 2020, 13, 807. [Google Scholar] [CrossRef]
  19. Mu, W.; Yin, B.; Huang, X.; Xu, J.; Du, Z. Environmental sound classification using temporal-frequency attention based convolutional neural network. Sci. Rep. 2021, 11, 21552. [Google Scholar] [CrossRef]
  20. Mahlous, A.R. Cyber security challenges in self-driving cars. Computer Fraud. Secur. 2022, 1873–7056. [Google Scholar] [CrossRef]
  21. Li, Q.; Garg, S.; Nie, J.; Li, X.; Liu, R.W.; Cao, Z.; Hossain, M.S. A highly efficient vehicle taillight detection approach based on deep learning. IEEE Trans. Intell. Transp. Syst. 2020, 22, 4716–4726. [Google Scholar] [CrossRef]
  22. Yu, J.; Zhang, W. Face mask wearing detection algorithm based on improved YOLO-v4. Sensors 2021, 21, 3263. [Google Scholar] [CrossRef]
  23. Wu, D.; Lv, S.; Jiang, M.; Song, H. Using channel pruning-based YOLO v4 deep learning algorithm for the real-time and accurate detection of apple flowers in natural environments. Comput. Electron. Agric. 2020, 178, 105742. [Google Scholar] [CrossRef]
  24. Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
  25. Li, S.; Li, Y.; Li, Y.; Li, M.; Xu, X. Yolo-firi: Improved yolov5 for infrared image object detection. IEEE Access 2021, 9, 141861–141875. [Google Scholar] [CrossRef]
  26. Huang, Z.; Wang, J.; Fu, X.; Yu, T.; Guo, Y.; Wang, R. DC-SPP-YOLO: Dense connection and spatial pyramid pooling based YOLO for object detection. Inf. Sci. 2020, 522, 241–258. [Google Scholar] [CrossRef]
  27. Hu, X.; Liu, Y.; Zhao, Z.; Liu, J.; Yang, X.; Sun, C.; Chen, S.; Li, B.; Zhou, C. Real-time detection of uneaten feed pellets in underwater images for aquaculture using an improved YOLO-V4 network. Comput. Electron. Agric. 2021, 185, 106135. [Google Scholar] [CrossRef]
  28. Zhang, Y.F.; Ren, W.; Zhang, Z.; Jia, Z.; Wang, L.; Tan, T. Focal and efficient IOU loss for accurate bounding box regression. Neurocomputing 2022, 506, 146–157. [Google Scholar] [CrossRef]
  29. Ansari, S.; Alnajjar, K.A.; Khater, T.; Mahmoud, S.; Hussain, A. A Robust Hybrid Neural Network Architecture for Blind Source Separation of Speech Signals Exploiting Deep Learning. IEEE Access 2023, 11, 100414–100437. [Google Scholar] [CrossRef]
  30. Rehman, A.; Alam, T.; Mujahid, M.; Alamri, F.S.; Al Ghofaily, B.; Saba, T. RDET stacking classifier: A novel machine learning based approach for stroke prediction using imbalance data. Peerj Comput. Sci. 2023, 9, e1684. [Google Scholar] [CrossRef]
  31. Golchoubian, M.; Ghafurian, M.; Dautenhahn, K.; Azad, N.L. Pedestrian trajectory prediction in pedestrian-vehicle mixed environments: A systematic review. IEEE Trans. Intell. Transp. Syst. 2023, 24, 11544–11567. [Google Scholar] [CrossRef]
  32. Guzhov, A.; Raue, F.; Hees, J.; Dengel, A. Audioclip: Extending clip to image, text and audio. In Proceedings of the ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 976–980. [Google Scholar]
  33. Nanni, L.; Maguolo, G.; Brahnam, S.; Paci, M. An ensemble of convolutional neural networks for audio classification. Appl. Sci. 2021, 11, 5796. [Google Scholar] [CrossRef]
  34. Gatto, R.C.; Forster, C.H.Q. Audio-based machine learning model for traffic congestion detection. IEEE Trans. Intell. Transp. Syst. 2020, 22, 7200–7207. [Google Scholar] [CrossRef]
  35. Abdallah, M.; An Le Khac, N.; Jahromi, H.; Delia Jurcut, A. A hybrid CNN-LSTM based approach for anomaly detection systems in SDNs. In Proceedings of the 16th International Conference on Availability, Reliability and Security, Vienna, Austria, 17–20 August 2021; pp. 1–7. [Google Scholar]
  36. Kaushik, S.; Raman, A.; Rao, K.R. Leveraging computer vision for emergency vehicle detection-implementation and analysis. In Proceedings of the 2020 11th IEEE International Conference on Computing, Communication and Networking Technologies (ICCCNT), Kharagpur, India, 1–3 July 2020; pp. 1–6. [Google Scholar]
  37. Raj, V.S.; Sai, J.V.M.; Yogesh, N.L.; Preetha, S.K.; Lavanya, R. Smart Traffic Control for Emergency Vehicles Prioritization using Video and Audio Processing. In Proceedings of the 2022 6th IEEE International Conference on Intelligent Computing and Control Systems (ICICCS), Madurai, India, 25–27 May 2022; pp. 1588–1593. [Google Scholar]
  38. Shatnawi, M.; Audat, A.; Saraireh, M. Intelligent Requirements Engineering: Applying Machine Learning for Requirements Classification. In Proceedings of the 2023 14th IEEE International Conference on Information and Communication Systems (ICICS), Irbid, Jordan, 21–23 November 2023; pp. 1–6. [Google Scholar]
  39. Zhao, J.; Hao, S.; Dai, C.; Zhang, H.; Zhao, L.; Ji, Z.; Ganchev, I. Improved vision-based vehicle detection and classification by optimized YOLOv4. IEEE Access 2022, 10, 8590–8603. [Google Scholar] [CrossRef]
  40. Zhao, Y.; Zhao, H.; Zhang, X.; Liu, W. Vehicle classification based on audio-visual feature fusion with low-quality images and noise. J. Intell. Fuzzy Syst. 2023, 45, 1–14. [Google Scholar] [CrossRef]
  41. Jiang, K.; Su, D.; Zheng, Y. Intelligent acquisition model of traffic congestion information in the vehicle networking environment based on multi-sensor fusion. Int. J. Veh. Inf. Commun. Syst. 2019, 4, 155–169. [Google Scholar] [CrossRef]
  42. Al-Batat, R.; Angelopoulou, A.; Premkumar, S.; Hemanth, J.; Kapetanios, E. An end-to-end automated license plate recognition system using YOLO based vehicle and license plate detection with vehicle classification. Sensors 2022, 22, 9477. [Google Scholar] [CrossRef]
  43. Middya, A.I.; Nag, B.; Roy, S. Deep learning based multimodal emotion recognition using model-level fusion of audio–visual modalities. Knowl.-Based Syst. 2022, 244, 108580. [Google Scholar] [CrossRef]
  44. Bolboacă, R. Adaptive ensemble methods for tampering detection in automotive aftertreatment systems. IEEE Access 2022, 10, 105497–105517. [Google Scholar] [CrossRef]
Figure 1. Visual representation of the Multimodal-EVD (MEVD) system.
Figure 1. Visual representation of the Multimodal-EVD (MEVD) system.
Mathematics 12 01514 g001
Figure 2. The MLSF-YOLO model and its subcomponents (ag).
Figure 2. The MLSF-YOLO model and its subcomponents (ag).
Mathematics 12 01514 g002
Figure 3. The complete framework of the ATSN classification model.
Figure 3. The complete framework of the ATSN classification model.
Mathematics 12 01514 g003
Figure 4. Creation of two distinct attention modules.
Figure 4. Creation of two distinct attention modules.
Mathematics 12 01514 g004
Figure 5. Utilizing probabilistic voting for predicting the complete acoustic category.
Figure 5. Utilizing probabilistic voting for predicting the complete acoustic category.
Mathematics 12 01514 g005
Figure 6. Evaluating the performance of the ATSN in the EVD model, where (a) comprises accuracy and loss graphs, and (b) presents the confusion matrix for the classification accuracy.
Figure 6. Evaluating the performance of the ATSN in the EVD model, where (a) comprises accuracy and loss graphs, and (b) presents the confusion matrix for the classification accuracy.
Mathematics 12 01514 g006
Figure 7. Detection results obtained using the MLSF-YOLO model.
Figure 7. Detection results obtained using the MLSF-YOLO model.
Mathematics 12 01514 g007
Figure 8. Evaluating the performance of the ensemble learning in the EVD model, where (a) comprises accuracy and loss graphs and (b) presents the confusion matrix for the detection accuracy.
Figure 8. Evaluating the performance of the ensemble learning in the EVD model, where (a) comprises accuracy and loss graphs and (b) presents the confusion matrix for the detection accuracy.
Mathematics 12 01514 g008
Figure 9. Evaluation of ensemble learning technique and individual models for emergency vehicle detection.
Figure 9. Evaluation of ensemble learning technique and individual models for emergency vehicle detection.
Mathematics 12 01514 g009
Table 1. Overview of model configuration.
Table 1. Overview of model configuration.
ModelHyperparametersOptimizerActivation Function
ATSNLoss Function: BCE
Learning Rate: 0.001
Regularization
Dropout
AdamReLU
MLSF-YOLOBatch Size: 16
Learning Rate: 0.01
Weight Decay: 0.0005
SGD with MomentumMish
Table 2. Collection of data for the experiments on visual EVD.
Table 2. Collection of data for the experiments on visual EVD.
DivisionData CollectionFrame
Count
EV SamplesNEV
Samples
Frame-
Average Instances
TrainingVEVD11,93012,70524,5223.12
KITTI3995018,9604.74
ValidationVEVD3902399865192.43
KITTI998057505.77
TestingVEVD4008443272303.41
KITTI1349064484.78
TotalVEVD19,14423,44839,8953.25
KITTI6850032,7524.78
VEVD+KITTI25,99423,44872,6474.45
Table 3. Collection of data for AEVD.
Table 3. Collection of data for AEVD.
SubsetNoiseSiren SoundTotal
Training11,020529016,310
Validation351817445262
Test340917245133
Total17,947875826,705
Table 4. Evaluating MLSF-YOLO and other object detectors on Visual-EVD + KITTI dataset.
Table 4. Evaluating MLSF-YOLO and other object detectors on Visual-EVD + KITTI dataset.
DetectorArchitectureResolutionmAP@[0.5:0.95][email protected][email protected]Time Cost (ms)
SSD-300VGG-16300 × 30051.9683.163.919.8
SSD-512VGG-16512 × 51254.9884.864.839.8
YOLOv4CSPDarknet-53512 × 51267.892.781.13.5
YOLOv4CSPDarknet-53608 × 60868.992.881.94.3
YOLOv3Darknet-53512 × 51267.190.980.15.9
YOLOv3Darknet-53608 × 60867.190.880.77.8
EfficientDet-D0Efficient-B0512 × 51261.986.773.226.1
EfficientDet-D1Efficient-B1640 × 64062.787.874.929.7
EfficientDet-D2Efficient-B2768 × 76863.989.878.335.2
EfficientDet-D3Efficient-B3896 × 8966791.881.544.1
EfficientDet-D4Efficient-B41024 × 102466.39180.767.3
MLSF-YOLOCSPDarknet-53608 × 60871.195.286.24.9
Table 5. ATSN performance across varied input lengths and signal-to-noise ratios (SNRs).
Table 5. ATSN performance across varied input lengths and signal-to-noise ratios (SNRs).
Input LengthOriginal Data+20 dB+10 dB0 dB−10 dB−20 dB−30 dBTime
1.5 s93.4793.7693.7593.3792.1993.3291.588 ms
1.0 s93.2493.6893.6893.1793.4693.1989.247 ms
0.5 s93.2992.8193.4791.7591.8891.3989.324 ms
0.25 s93.1992.4792.2993.4290.7291.2788.122 ms
Table 6. Performance of acoustic classification for EVD.
Table 6. Performance of acoustic classification for EVD.
ClassAccuracyPrecisionRecallF1score
193.47%0.930.940.93
292.98%0.940.930.94
Table 7. Evaluating the MLSF-YOLO performance on the VEVD + KITTI dataset.
Table 7. Evaluating the MLSF-YOLO performance on the VEVD + KITTI dataset.
Resolution
(h × w)
mAP
@[0.5:0.95]
mAP @0.5mAP @0.75Time (ms)
320 × 32066.993.481.12.2
416 × 41669.294.584.33.1
512 × 51269.894.884.74.2
608 × 60871.195.286.24.9
640 × 64071.194.985.95.0
768 × 76871.094.786.26.9
Table 8. Comparison of Visual Emergency Vehicle Detection (VEVD) with previous research.
Table 8. Comparison of Visual Emergency Vehicle Detection (VEVD) with previous research.
StudyMethodAccuracyLimitations
Ref. [36]RCNN92%Lighting challenges
Ref. [37]Yolov588%High computation cost
Ref. [38]CNN85%Difficulty in classification
Ref. [39]YOLOv4_AF83%Limited accuracy compared to other models
Ref. [25]YOLOv4_FIRI94%Slow speed and limited adaptability
ProposedMLSF_YOLO95%Lacks practical implications
Table 9. Comparing ensemble classification performances for EVD.
Table 9. Comparing ensemble classification performances for EVD.
ClassAccuracyPrecisionRecallF1score
196.19%0.970.950.97
295.93%0.950.970.96
Table 10. Comparison with previous classification research.
Table 10. Comparison with previous classification research.
StudyAccuracyEnsemble LearningMultimodal
Ref. [40]95%YesNo
Ref. [41]91%NoNo
Ref. [42]94%NoNo
Ref. [43]93%NoYes
Ref. [44]92%YesNo
Proposed96%YesYes
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zohaib, M.; Asim, M.; ELAffendi, M. Enhancing Emergency Vehicle Detection: A Deep Learning Approach with Multimodal Fusion. Mathematics 2024, 12, 1514. https://doi.org/10.3390/math12101514

AMA Style

Zohaib M, Asim M, ELAffendi M. Enhancing Emergency Vehicle Detection: A Deep Learning Approach with Multimodal Fusion. Mathematics. 2024; 12(10):1514. https://doi.org/10.3390/math12101514

Chicago/Turabian Style

Zohaib, Muhammad, Muhammad Asim, and Mohammed ELAffendi. 2024. "Enhancing Emergency Vehicle Detection: A Deep Learning Approach with Multimodal Fusion" Mathematics 12, no. 10: 1514. https://doi.org/10.3390/math12101514

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop