Enhancing Emergency Vehicle Detection: A Deep Learning Approach with Multimodal Fusion

Zohaib, Muhammad; Asim, Muhammad; ELAffendi, Mohammed

doi:10.3390/math12101514

Open AccessArticle

Enhancing Emergency Vehicle Detection: A Deep Learning Approach with Multimodal Fusion

by

Muhammad Zohaib

¹

,

Muhammad Asim

^2,3,*

and

Mohammed ELAffendi

²

¹

School of Computer Science and Engineering, Central South University, Changsha 410083, China

²

EIAS Data Science and Blockchain Laboratory, College of Computer and Information Sciences, Prince Sultan University, Riyadh 11586, Saudi Arabia

³

School of Computer Science and Technology, Guangdong University of Technology, Guangzhou 510006, China

^*

Author to whom correspondence should be addressed.

Mathematics 2024, 12(10), 1514; https://doi.org/10.3390/math12101514

Submission received: 7 April 2024 / Revised: 3 May 2024 / Accepted: 6 May 2024 / Published: 13 May 2024

(This article belongs to the Special Issue Computer Vision, Image Processing Technologies and Artificial Intelligence, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Emergency vehicle detection plays a critical role in ensuring timely responses and reducing accidents in modern urban environments. However, traditional methods that rely solely on visual cues face challenges, particularly in adverse conditions. The objective of this research is to enhance emergency vehicle detection by leveraging the synergies between acoustic and visual information. By incorporating advanced deep learning techniques for both acoustic and visual data, our aim is to significantly improve the accuracy and response times. To achieve this goal, we developed an attention-based temporal spectrum network (ATSN) with an attention mechanism specifically designed for ambulance siren sound detection. In parallel, we enhanced visual detection tasks by implementing a Multi-Level Spatial Fusion YOLO (MLSF-YOLO) architecture. To combine the acoustic and visual information effectively, we employed a stacking ensemble learning technique, creating a robust framework for emergency vehicle detection. This approach capitalizes on the strengths of both modalities, allowing for a comprehensive analysis that surpasses existing methods. Through our research, we achieved remarkable results, including a misdetection rate of only 3.81% and an accuracy of 96.19% when applied to visual data containing emergency vehicles. These findings represent significant progress in real-world applications, demonstrating the effectiveness of our approach in improving emergency vehicle detection systems.

Keywords:

deep learning; image classification; multimodal; attention-based temporal spectrum; ensemble learning; road safety

MSC:

68T07

1. Introduction

In real-world scenarios, the failure to detect emergency vehicles can pose significant threats to road safety. The timely detection of these vehicles is crucial for ensuring prompt medical assistance, as any delays in detection can result in more severe injuries or lead to loss of life [1]. Extensive research consistently highlights the significance of precise detection systems, particularly in emergencies where even minor delays can significantly compromise patient care. Therefore, the development of detection systems that are both reliable and capable of varying environmental conditions and scenarios is necessary to effectively address the risks associated with the failure to detect emergency vehicles [2].

An emergency vehicle-related accident can be the result of various factors, including drivers who are distracted while operating their vehicles, the effects of in-car technologies, incidents where cars run red lights, and the requirement for efficient pedestrian crash prevention [3]. In addition to these factors, a significant contributor to such incidents is the lack of awareness among non-emergency vehicle drivers about the presence of approaching emergency vehicles. This lack of awareness may occur when emergency vehicles are outside the visual range of non-emergency drivers or when the sound of emergency vehicle sirens is hindered by external noise interference, the utilization of in-vehicle audio systems, or the driver’s own distractions. With this objective, the current study focused on developing real-time and precise systems for emergency vehicle detection (EVD).

EVD relies on advanced computer vision [4] and acoustic processing technologies for the rapid identification and classification of automobiles such as police cars, fire engines, and ambulances. Vision-based systems, including YOLO, ensure real-time visual recognition, while acoustic-based solutions like the ATSN excel in auditory cue discrimination, enhancing emergency response efficiency. Considering the potential drawbacks of the acoustic and visual EVD systems when used independently, especially in situations with noisy traffic and unfavorable weather conditions, we combined both acoustic and visual EVD into a multimodal system to concurrently detect siren sounds and emergency vehicle objects.

Vision-based emergency vehicle detection (VEVD) has received limited attention in research thus far, with only a few studies [5,6,7] dedicated to this area. A review conducted in [8] focused on different object detection algorithms and highlighted the suitability of YOLO [5] for vision-based EVD due to its fast processing speed. In addition, the survey paper in [9] provided a comprehensive study on how deep learning (DL) methods are powerful for vehicle detection. By comparing DL methods with traditional techniques, the authors stated that DL models provide efficiency by automating the process of feature extraction, especially since vehicle detection datasets are huge. In addition, DL models proved to outperform other models in terms of accuracy and time through the utilization of different techniques such as the use, modification, and fine-tuning of pre-trained models, and the optimization of the hyperparameters of DL models. In [10], a VEVD system consisting of a two-stage approach was introduced, which uses a vehicle classifier after an object detector to detect vehicles. The accuracy of the VEVD system reached 91.6% after being evaluated with a small, personalized dataset. An alternative two-stage VEVD system with a similar approach was introduced in [6]. The work in [11] proposed a method for detecting vehicles in real time using YOLO-v5 architecture to balance between speed and accuracy. To achieve accurate results, especially in congested areas, the authors gathered a new dataset focusing on different weather conditions and both Low-Density Traffic and High-Density Traffic and then fine-tuned YOLO-v5 using the dataset. The pre-trained weights from the COCO dataset were used to initialize YOLO-v5. The paper highlighted the effectiveness and potential of the proposed model while acknowledging the large amount of data required for training. The results showed that the model works well in challenging conditions like night, rain, and snow, and compared to other methods, it achieved significantly higher performances in terms of accuracy and speed.

Previous works on VEVD exhibit common limitations. To provide a concise overview, object detection was summarized in the initial phase. The object detectors used in the vehicle detection stage of [6,10] involved the utilization of the ImageNet [12] and MS COCO [7] datasets for the training of the object detectors. The object detectors were trained on datasets that were not specifically designed for the EVD problem, incorporating an extensive array of object classes that may not be directly relevant. Since these datasets do not include any emergency vehicle-related objects, there can be a mismatch between the vehicle detection and classification stages, as the features extracted by the vehicle detector may not fully capture the discriminative information of emergency vehicles. Secondly, the two-stage systems are computationally inefficient as they employ an additional vehicle classifier, failing to leverage the full advantages of object detection for both detection and classification. In terms of system evaluation, refs. [6,10] tested the vehicle detector and vehicle classifier separately on different datasets, without a precise evaluation of the complete two-stage systems. Recognizing these limitations, this work aimed to address the VEVD problem more comprehensively by proposing a single-stage VEVD system that offers fast and accurate predictions. Our main priority was to use state-of-the-art DL-based object detection algorithms for the development of the VEVD system, rather than relying on conventional techniques commonly found in the existing literature [13].

Existing audio-based emergency vehicle detection (AEVD) systems suffer from several limitations. Firstly, they often rely on small experimental datasets that consist of simulation, replication, and on-scene recordings. This approach neglects to account for the variety of siren types and specifications, limiting the system’s ability to generalize for real-world applications. Secondly, previous works [14,15,16,17,18] often rely on training the shallow learning algorithms involved in the use of handcrafted features or microcontrollers employed for signal processing tasks. As a result, these systems exhibit inferior detection accuracies, typically falling below 90%, and suffer from computational inefficiency. For instance, the system described in [16] requires 8 s to complete a single detection. To address these limitations, this study proposes the development of an AEVD system that incorporates an attention mechanism following the groundwork provided in [19] to increase the accuracy and efficiency of our AEVD model. We carefully modified the existing approach to address the limitations of previous systems, incorporating novel techniques and methodologies. By leveraging this approach, the proposed model achieves a superior accuracy and faster detection. Additionally, the training of the model was conducted using a diverse dataset that captures the differences in siren types and characteristics, enhancing its performance and practical applicability.

This study not only introduces innovative techniques for enhancing emergency vehicle detection but also focuses on improving the accuracy and efficiency of detection methods. Using integrated learning with acoustic and visual datasets, this technology is a major advancement in emergency response system development. The practical applications of this study’s findings open the path for the advancement of EVD systems. In incorporating them into private vehicle driver-assistance systems, drivers can receive vital notifications, which are especially helpful for those with hearing impairments. Applications also include improving safety features in smart traffic infrastructure and strengthening emergency response in self-driving cars [20].

We utilized an attention-based temporal spectrum network for ambulance siren sound detection. Concurrently, we improved visual detection using an enhanced YOLO architecture. Fusing acoustic and visual information, we employed a stacking ensemble learning technique to create a robust framework for emergency vehicle detection. This approach leverages the strengths of both modalities, resulting in a superior performance compared to those of existing methods. Listed below are the main contributions of this research:

This study proposes a stacking ensemble method to fuse acoustic and visual information, aiming to improve emergency vehicle detection under adverse conditions.
This study incorporated a multi-level spatial fusion technique into YOLO to accommodate the deep-level semantic information required for multi-modal fusion.
This study proposes an attention-based temporal spectrum network to aid in extracting semantic features for siren sound classification.

The following are the remaining sections of this paper: Section 2 outlines the methodologies proposed within this research; Section 3 presents the experiments and results; Section 4 discusses the conclusion; and Section 5 discusses limitations and future work.

2. Materials and Methods

In this study, we employed an attention-based temporal spectrum network for ambulance siren sound classification. Simultaneously, we enhanced the visual classification task using an improved YOLOv4 architecture. For the fusion of the acoustic and visual results, we utilized a stacking ensemble learning technique, creating a robust framework for emergency vehicle detection. This approach capitalizes on the strengths of both modalities, ensuring a comprehensive analysis that outperforms those of existing methods. Figure 1 displays the proposed framework.

2.1. Detection of Emergency Vehicles Using Visual Cues

Proposing the MLSF-YOLO Model for Visual-Based EVD

The basic concept of YOLO is to create a grid of

S \times S

cells from an input image. With an ith

(1 \leq i \leq S^{2})

cell, if an object’s center is located inside this cell, the responsibility for that specific object lies with this cell. Modern YOLO iterations incorporate multiscale detection strategies, dividing the source image into several grids aligned with distinct levels of detection [21].

In Figure 2a, the YOLO model consists of three essential parts: the prediction head [22], a neck part used for structuring stages, and a backbone utilized for feature extraction. The YOLO model is based on the YOLOv4 architecture [23], incorporating the CSPDarknet53 backbone [24] and a multiscale detection method. To enhance detection accuracy, we introduced cross-stage partial connections (CSPs) [25], augmenting the model’s learning capacity. Figure 2a illustrates the MLSF-YOLO, containing elements like the CBM; as shown in Figure 2b, it includes a convolutional layer for ensuring proper feature extraction, batch normalization for standardizing data, and the mish activation function, which adds non-linearity to the network. The ResUnit, a Branched Cross-Stage Partial (BCSP) connection, a Nested Cross-Stage Partial (NCSP) connection, a Spatial Pooling Layer (SPL), and Multi-Level Spatial Fusion (MLSF) contain a spatial pooling layer with a cross-stage partial connection. In Figure 2f, the ResUnit prevents network degradation with increased depth. A CSP connection divides a given stage’s feature map into two sections: the one that is subjected to network block processing (the sequence of the ResUnit) and the other one integrated into the feature map that was transmitted to produce the stage result. There are five CSP elements in the advanced YOLO backbone, labeled BCSP-n, with ‘n’ signifying the quantity of the ResUnit in a BCSP-n unit, as shown in Figure 2e. To initiate the detector’s neck, we used a modified Spatial Pooling Layer (SPL) [26] module to expand the receptive field. MLSF-YOLO integrates the SPL module within a CSP component, creating the MLSF module shown in Figure 2d at the neck’s start. For feature integration, we employed a path aggregation network (PANet) [27], using the PANet’s outputs to enable three-scale detection. Within the detector neck, there are NCSP-n components. Within the transition pathway, NCSP-n consists of n CBM elements, as shown in Figure 2g, instead of the originally used n ResUnits. We explain the operation of emergency vehicle detection at a single-prediction scale.

In using MLSF-YOLO in our object detection model, the source image is segmented into a grid of size

S \times S

. Every grid cell is tasked with forecasting B bounding boxes. These boxes are defined by four key components.

x_{b}

represents the x-coordinate of the bounding box centre, computed as the sigmoid of a with an added offset of p.

y_{b}

is the y-coordinate of the center, determined by the sigmoid of b with offset q.

w_{b}

is the width, computed by multiplying a factor x with the sigmoid of c, and

h_{b}

is the height, found by multiplying a factor y with the sigmoid of d. The following equations, from (1) to (4), collectively enable precise object localization within the gridded image segmentation:

\begin{matrix} x_{b} = σ (a) + p \end{matrix}

(1)

\begin{matrix} y_{b} = σ (b) + q \end{matrix}

(2)

\begin{matrix} w_{b} = x \cdot σ (c) \end{matrix}

(3)

\begin{matrix} h_{b} = y \cdot σ (d) \end{matrix}

(4)

We employed a weighted loss function in Equation (5) that combines three key elements: localization loss (

L_{C I o U}

), objectness loss (

L_{O b j}

), and classification loss (

L_{C l s}

), with weighting factors

λ_{1}

,

λ_{2}

, and

λ_{3}

. This equation encapsulates the core of our loss computation, emphasizing localization precision, objectness estimation, and class-specific classification while allowing for adjustable weighting to reflect their relative importance.

In terms of localization loss, we employed CIoU loss [28] as specified in Equation (5). This loss considers the IoU between expected and actual bounding boxes, incorporating factors like their spatial relationship

(ρ - c)

and aspect ratio

(\frac{β}{α})

. Equation (6) defines

α

as a trade-off parameter based on the expected box size

(S)

and IoU. Equation (7) introduces

ν

to assess aspect ratio coherence by comparing

α

and

β

. These equations are crucial for accurate bounding box evaluations and optimization in object detection.

CIoU = IoU - \frac{{(ρ - c)}^{2}}{ρ - \frac{β}{α} (S / C - IoU)}

(5)

α = \frac{S}{IoU + S}

(6)

v = \frac{α}{β}

(7)

L_{obj} = \sum_{x = 1}^{S \times S} \sum_{y = 1}^{S \times S} A_{x y} \cdot max ({IoU}_{y})

(8)

In Equation (8), we use variables to represent grid cell indices (x) and bounding box indices (y). The localization loss term, denoted as

L_{obj}

, is defined as follows: The binary indicator

A_{x y}

takes a value of 1 when there are no objects in the detection field in the corresponding grid cell and 0 when an object is detected. The expression

max ({IoU}_{y})

represents the highest Intersection over Union (IoU) score between the bounding box y and the ground truth box. This formulation ensures that the loss function penalizes localization errors exclusively in grid cells, where objects are present, and for bounding boxes assigned with the ground truth box.

In our methodology, we present the objectness loss (

L_{Obj}

) as a fundamental component in the image analysis evaluation. It is calculated as the sum of binary cross-entropy (BCE) losses for objectness scores within grid cells. This loss comprises two elements: one related to grid cells containing objects, while the other concerns cells without objects. We employ a binary indicator

A_{x y}

, which takes the value 1 when there are no objects in the grid cell

(x, y)

and 0 when an object is present. Additionally, we introduce the weight factor

λ_{noobj}

to regulate the impact of the loss in grid cells lacking objects. The specific formula for objectness loss can be found in Equation (9).

L_{O b j} = \sum_{x = 1}^{N} \sum_{y = 1}^{M} [BCE (O S_{x y}) \cdot (1 - A_{x y}) \cdot λ_{noobj}]

(9)

where

\begin{matrix} B C E (O S_{x y}) & = - (O S_{x y}^{*} log (O S_{x y})) \\ - (1 - O S_{x y}^{*}) log (1 - O S_{x y}) \end{matrix}

(10)

Equation (11) evaluates the classification loss. We apply the same penalization method as we do for the localization loss.

L_{C l s} = \sum_{x = 1}^{S \times S} \sum_{y = 1}^{B} (δ_{x y}^{o b j} \sum_{c \in classes} BCE (p_{x y} (c)))

(11)

where

\begin{matrix} B C E (p_{x y} (c)) = - ( & p_{x y}^{*} (c) log (p_{x y} (c)) \\ + (1 - p_{x y}^{*} (c)) (1 - log (p_{x y} (c)))) \end{matrix}

(12)

2.2. Classifying Emergency Vehicles Using Acoustic Signals

2.2.1. Proposed Framework for Acoustic Classification

This paper introduces the attention-based temporal spectrum network to better understand selective time–frequency features in acoustic spectrograms.

Figure 3 displays the comprehensive model architecture. There are two primary components of the model: modules for generating attention and those for managing the backbone network. By applying two different attention processes in the Log-Mel spectrogram, the attention-generating module concentrates computational power on particular time intervals and frequency ranges. The backbone network is composed of a pooling layer, a fully connected layer, and a convolutional layer to extract key time–frequency features and includes an attention mechanism for sound event prediction. In the concluding evaluation stage, a probability-based polling approach is employed to consolidate predictions from various acoustic segments, thereby mitigating classification errors arising from outlier data points.

2.2.2. Feature Processing

The “Log-Mel spectrogram” is highly regarded in the field of acoustic recognition because of its ability to replicate the human auditory system. It furnishes a two-dimensional feature map for each acoustic frame, capturing temporal and spectral attributes in both the time and frequency domains. In this study, we employed the Log-Mel spectrogram to extract time–frequency features from emergency vehicle sound events, enhancing classification accuracy through the use of convolutional neural networks (CNNs).

Our methodology commences with the standardization of the acoustic data format into mono. Subsequently, a 46 ms hamming window with a 50% overlap is applied, and a Short-Time Fourier Transform (STFT) is executed to derive the spectrogram of the amplitude. This spectrogram is further processed using a 128 Mel filter bank and logarithmic scaling to generate the Log-Mel spectrogram. To better align with the attention mechanism and the available training data, the Log-Mel spectrogram is divided into sixty-four frames with a crossover of fifty percent, with zero padding applied to shorter counterparts. This yields the expression of the Log-Mel feature vectors with dimensions of

1 \times 64 \times 128

, signifying channel × time × frequency.

Loss = λ_{1} L_{C I o U} + λ_{2} L_{O b j} + λ_{3} L_{C l s}

(13)

2.2.3. Harmonic–Percussive Source Separation

In our investigative analysis, we seamlessly integrated the HPSS (Harmonic–Percussive Source Separation) algorithm [29] for input Log-Mel spectrogram processing. This method enabled us to efficiently divide the spectrograms into two fundamental constituents: harmonic spectrograms and percussive spectrograms. This approach offers a transparent and informative depiction of the frequency distribution and activity within discrete frequency bands present in the acoustic data.

2.2.4. Creating a Time–Frequency Attention Mechanism

The accurate detection of emergency vehicles from audio signals in real-world traffic environments is significantly impacted by background noise. Our methodology tackles this challenge by integrating advanced noise suppression techniques to enhance the classification accuracy. Specifically, we leveraged sophisticated attention mechanisms that dynamically adjust to prioritize informative portions of the audio signal while suppressing irrelevant background noise. Additionally, state-of-the-art signal processing algorithms, such as spectral subtraction and adaptive filtering, are applied to further improve the signal-to-noise ratio. Through these comprehensive approaches to addressing the audio noise in traffic, our classification framework exhibits a robust performance, enabling the reliable and precise detection of emergency vehicles. In complex, polyphonic acoustic scenarios with multiple concurrent sound sources, our approach employs both temporal and frequency attention mechanisms. Temporal attention emphasizes the importance of semantic segments while mitigating noise, whereas frequency attention highlights critical frequency bands while reducing less relevant ones. This approach enhances the extraction of crucial acoustic features in challenging acoustic environments, ensuring that our model remains effective even with diverse levels of background noise in real-world traffic conditions. By seamlessly integrating advanced noise suppression techniques with sophisticated attention mechanisms, our classification framework offers a robust solution for accurate emergency vehicle detection in complex acoustic environments.

As shown in Figure 4, the Log-Mel spectrogram X undergoes standardization and processing through the HPSS algorithm to yield harmonic and percussive spectrograms;

(1 \times 3)

and

(5 \times 1)

kernel convolutions are utilized to extract features, leading to dimension reduction.

A final

(1 \times 1)

convolution condenses channel information, resulting in one-dimensional matrices

A_{f} (F, 1)

and

A_{t} (1, T)

, which are normalized through the softmax function to produce the frequency and temporal weight matrices

F_{w}

and

T_{w}

:

\begin{matrix} F_{w} (f) & = Softmax (A_{f} (f)) \end{matrix}

(14)

\begin{matrix} T_{w} (t) & = Softmax (A_{t} (t)) \end{matrix}

(15)

The Log-Mel spectrogram, denoted as X, undergoes element-wise multiplication with an attention weight matrix acquired from both the frequency and time directions. This operation yields two distinct outputs: the frequency attention spectrogram, referred to as

S_{F}

, and the temporal attention spectrogram, labeled

S_{T}

. This transformation process enhances the representation of the Log-Mel spectrogram by selectively attending to important frequency and temporal components, enabling the model to focus on relevant acoustic features for emergency vehicle detection. The resulting frequency and temporal attention spectrograms,

S_{F}

and

S_{T}

, serve as crucial inputs for subsequent stages of our proposed framework. The mathematical formulation for this transformation can be expressed as follows:

\begin{matrix} S_{F} (f, t) & = X (f, t) \cdot F_{w} (f) \end{matrix}

(16)

\begin{matrix} S_{T} (f, t) & = X (f, t) \cdot T_{w} (t) \end{matrix}

(17)

In our initial ‘average combination’ approach, we created two attention spectrograms:

S_{F}

for frequency emphasis and

S_{T}

for temporal details. These spectrograms are applied to the Log-Mel spectrogram and equally combined into the temporal–frequency attention spectrogram,

S_{T & F}

using the below equation.

S_{T & F} = \frac{S_{F} + S_{T}}{2}

(18)

The ‘weighted combination’ approach introduces two adjustable network parameters, denoted as

α

and

β

, with the constraint that

α + β = 1

. This approach synthesizes the ultimate temporal–frequency attention spectrogram denoted as

S_{T & F}

, by combining the two spectrograms of attention,

S_{F}

and

S_{T}

, based on the proportions determined by the learnable parameters. Mathematically, this procedure can be expressed as follows:

S_{T & F} W e i g h t = α \cdot S_{F} + β \cdot S_{T}

(19)

The final approach, known as “channel combination”, involves the integration of the two generated attention spectrograms,

S_{F}

and

S_{T}

, by joining them to create two-channel outputs.

S_{T & F} C h a n n e l = concatenate (S_{T}, S_{F})

(20)

2.2.5. Network Structure

The architecture presented in this study, termed ATSN, comprises a configuration of three pooling layers, two fully connected layers, and six convolutional layers. These layers can be organized into pairs, where each pair shares identical parameters, effectively forming a block. Every block is paired with a corresponding max-pooling layer of

2 \times 2

dimensions. Thirty-two

5 \times 3

kernels with a stride of 2 are used in the first two convolutional layers. Following this, the subsequent four convolutional layers are equipped with 128 & 64 kernels, each measuring

3 \times 3

and employing a stride of 1. To conclude, the network incorporates two completely linked layers, 256 hidden units in each, for processing the flattened outcome. To produce predictions, the final result is fed into a “Softmax” classifier. Specifically, the architecture makes use of the ReLU activation function, employs batch normalization at each convolutional layer to speed up the training, and uses a 0.5 probability dropout mechanism in the fully connected layer to prevent overfitting.

2.2.6. Decision Approach

The stage of feature processing includes segmenting the log-Mel spectrogram into sub-segments of 64 frames each, with a 50% overlap between them. Each sub-segment inherits its category label from the acoustic source. During the training stage, individual sub-segments are used as inputs for network training, leading to predictions for their respective categories. As we reach the concluding testing phase, the aim is to make predictions for the complete acoustic segment to determine its category. This involves employing a probabilistic voting strategy to aggregate predictions from multiple sub-segments for the final decision, as visualized in Figure 5.

The mathematical representation of this strategy is as follows:

D = agrmax (\frac{1}{Z} \sum_{k = 1}^{N} P_{j i}), 1 \leq i \leq H

(21)

Here, Z denotes the count of sub-segments allocated to each acoustic sample, H reflects the dataset’s category count, and P indicates the prediction outcome for individual segments.

2.3. Harmonizing Hard and Soft Predictions

The method of stacking involves training a meta-model that combines the output of several models to produce an ultimate prediction. This tactic works particularly well when combining predictions from several models, such as those for visual detection and acoustic categorization [30]. The primary aim is to enhance the capability of the model to generalize and complete the execution of EVD. Stacking leverages the unique strengths of each model, presenting an effective approach for addressing diverse aspects of the classification problem. This approach results in predictions that are not only more accurate but also notably more reliable. This study features the implementation of the stacking ensemble learning technique. This highly effective ensemble method integrates outcomes from a range of independent models, including a dedicated ATSN classification model and MLSF-YOLO model tailored for the precise identification of emergency vehicles.

2.3.1. Hard Prediction

For a binary prediction, each model produces an unusual class tag as its output. Subsequently, the stacking ensemble formulates conclusions based on both the predictions of these models and their corresponding actual tags. In this approach, we use a meta-classifier called

M_{hard}

, which makes use of the learned information to generate a precise binary prediction. The stacking ensemble combines outputs from the acoustic classification model known as

h (x)

with the visual detection approach, referred to as

g (x)

, when given a data record x as input. The meta-classifier is used to calculate the final binary prediction, denoted as

y_{hard}

, as stated in the following equation.

y_{hard} = M_{hard} (h (x), g (x))

(22)

2.3.2. Soft Prediction

For soft prediction, each model produces a probability distribution across class categories. Similar to this, the stacking ensemble makes use of this information together with the true labels to facilitate the training of an additional meta-classifier known as

M_{soft}

. Equation (23) states that the stacking ensemble considers an input data point x and integrates predictions from the acoustic classification model, denoted as

h (x)

, with the visual detection model labeled

g (x)

.

y_{soft} = M_{soft} (h_{s} (x), g_{s} (x))

(23)

We want to take advantage of each model’s distinct advantages so that the overall performance may be enhanced and the classification domain can be effectively generalized. The use of this ensemble technique considerably improves the prediction reliability and accuracy, which is crucial for differentiating between emergency vehicles (EVs) and non-emergency vehicles (NEVs) in our research.

3. Experiments and Results

In our study, we performed an experiment using the system, which was equipped with an Intel Core i7-14650 HX CPU running at 2.30 GHz, with 16 GB of RAM and a roomy 1 TB SSD for data storage. To accelerate computations, the setup also incorporated an Nvidia GeForce RTX 4060 GPU. We used the TensorFlow framework with the development environment of Python 3.9.

3.1. Hyperparameters Used

For the ATSN utilized in acoustic classification, we employed an initial learning rate of 0.001 and utilized the Adam optimizer for weight updates. The choice of the binary cross-entropy loss function was determined to suit the binary classification task inherent in our acoustic signal classification. Moreover, we incorporated batch normalization at every layer to expedite training and mitigate overfitting, while dropout regularization was applied specifically to fully connected layers to further prevent overfitting.

In the case of the improved YOLOv4 architecture, termed MLSF-YOLO, we adjusted the batch size to 16 and initialized training with a starting learning rate of 0.01. The momentum was set to 0.9, and we integrated a weight decay mechanism at a rate of 0.0005 to alleviate overfitting. The choice of the stochastic gradient descent (SGD) optimizer with momentum and weight decay was based on its proven effectiveness in optimizing object detection models. Furthermore, we adopted the Mish activation function throughout the MLSF-YOLO model, departing from the original approach of using leaky-ReLU and Mish activation solely for the backbone. This decision was made after comprehensive experimentation, considering Mish’s superior performance in facilitating gradient flow and convergence during training.

We conducted numerous experiments involving various combinations of hyperparameters, optimizers, and activation functions to optimize the model performance. Hyperparameters such as the learning rate, batch size, and weight decay were carefully tuned to ensure that the models were adequately trained. Different optimizers, including Adam and SGD with momentum, were employed to optimize the learning process and minimize the loss function. Additionally, various activation functions, such as ReLU and Mish, were tested to introduce non-linearity and enhance the model’s representational capacity. Through rigorous experimentation, we observed the effects of parameter changes on model performance, allowing us to fine-tune our approach. Ultimately, our experimentation revealed that the Adam optimizer paired with the ReLU activation function yielded the best results for the ATSN model, while the SGD with momentum optimizer combined with the Mish activation function proved to be the optimal choice for the MLSF-YOLO model as shown in Table 1.

3.2. Data for Visual EVD Experiments

Collecting data for experiments on VEVD contains several challenges. One primary hurdle is the restricted accessibility of roadside traffic camera data due to legal constraints, making it unavailable to the public. Additionally, emergency vehicle sightings on roads are infrequent compared to non-electric vehicles under standard traffic conditions. To address this, we opted to acquire data from YouTube, a widely used visual-sharing platform. We selected relevant videos from YouTube channels specializing in emergency vehicle content, proceeded to obtain images from these videos, and subsequently conducted categorization. Only two classes of objects—NEVs and EVs—were the focus of our annotation efforts.

Table 2 provides an overview of our visual-based emergency vehicle detection data, consisting of videos from various scenes. We extracted 19,144 frames, containing 23,448 emergency vehicle instances and 39,895 non-emergency vehicle instances, resulting in a visual emergency vehicle detection dataset with an average of 3.25 instances per frame. In the dataset, there are three distinct types of emergency vehicles: fire trucks (8129), police cars (6699), and ambulances (7920), with a total count of 22,748. The visual dataset was further partitioned into training and testing subsets, with a division ratio of 75:25. Additionally, we enriched the dataset’s diversity by incorporating the KITTI dataset [31], which encompasses a wide array of objects, both vehicular and non-vehicular. From the original KITTI dataset, we exclusively extracted images featuring vehicles to construct a refined subset. This subset consisted of 6850 images, which were subsequently partitioned into training, validation, and testing sets, aligning with the visual dataset’s ratio. In total, the combined dataset for visual emergency vehicle detection and KITTI (VEVD+KITTI) encompassed 25,994 images, with an average of 4.45 vehicular objects per image. Consequently, the VEVD system maintains its effectiveness and reliability under challenging lighting conditions due to its robust training on diverse datasets. The model’s ability to generalize across different environmental settings, including scenarios with poor lighting, underscores its adaptability and resilience in real-world applications. Through extensive exposure to diverse lighting conditions during training, our VEVD model is equipped to deliver consistent performances, ensuring reliable emergency vehicle detection even in adverse lighting environments. Furthermore, we also prioritized data anonymization to protect the privacy and confidentiality of individuals. Any personally identifiable information was removed or carefully anonymized in the dataset, ensuring that individuals cannot be identified or linked to specific data points. This approach safeguards the privacy of individuals and upholds ethical standards.

3.3. Data for Acoustic EVD Experiments

Using the same dataset as in [14], we conducted the training and testing of the attention convolutional audio classifier model on the audio-based emergency vehicle detection dataset. This dataset encompasses 26,705 samples of audio, incorporating EV horn sounds and other noises. It combines 15,947 entries of our own data that we gathered, 8758 entries from the Urbansound8K dataset [32], and 2000 entries from the ESC-50 dataset [33]. The dataset underwent a methodical division into training, testing, and validation sets, comprising 16,310, 5262, and 5133 samples, respectively. Given the binary classification nature of our acoustic-based emergency vehicle detection system, we restructured the acoustic dataset to feature two distinct acoustic groups: siren sounds and noise with car horns classified under the noise category. Detailed statistics of the acoustic-based emergency vehicle detection dataset can be found in Table 3. Due to the diverse sources of the experiment dataset, there may have been variations in the sampling rate, channel count, and bit depth across data samples. To mitigate this, we undertook the standardization of the entire dataset into monophonic data, adhering to a frequency measurement of 22,050 Hz and 16-bit encoding.

3.4. Visual EVD Experiments

The MLSF-YOLO model’s performance was assessed using the VEVD + KITTI datasets. In evaluating the effectiveness of the object identification algorithms, one of the crucial metrics employed was the Mean Average Precision (mAP). This metric is measured through the computation of the Average Precision (AP) for every distinct class, followed by the derivation of the collective mean from these class-specific values. It serves as an indicator of the accuracy of object identification across various object classes. The AP for an object class is determined by integrating precision values at various recall levels. The formula for mAP is expressed as follows:

m A P = \frac{\sum A P}{N}

(24)

Here, the term

\sum A P

represents the summation of all AP values across the various object classes, N is the total number of object classes, and mAP is the mean value. By taking into account the model’s performance over a wide variety of item categories, this metric provides a comprehensive evaluation of the model’s detection skills.

Visual EVD with Different Object Detectors

For our initial experiment, our primary objective was to showcase the benefits of deploying the MLSF-YOLO framework for vision-based emergency vehicle detection. To achieve this, we conducted a comprehensive analysis comparing the performances of single-shot object recognition models, including YOLOv3, YOLOv4, SSD, and five different, distinct variations of EfficienDet, on the ‘EVD + KITTI’ dataset. This dataset served as a suitable benchmark for evaluating the models’ capabilities in real-world scenarios. The results of our evaluation are presented in Table 4, which provides a comprehensive overview of the outcomes. Furthermore, the MLSF-YOLO architecture exhibited exceptional efficiency, with inference times as low as 4.9 milliseconds. This efficiency is crucial for real-time applications in traffic management and road safety systems, where the timely and accurate detection of emergency vehicles is of utmost importance.

Almost all detectors performed admirably in terms of detection, with accuracy values ranging from 81.93 to 91.2 across a range of input dimensions, varying from 300 × 300 to 1024 × 1024. Furthermore, all other detectors (SSD-512, Efficient-D3, and Efficient-D4) attained instantaneous processing times of less than 40 ms for each picture. Impressively, YOLOv4 outperforms its three counterparts significantly, achieving the highest detection accuracies at an input size of 608 × 608: 68.9 (mAP@[0.5:0.95]) 92.8 (mAP@0.5) and 81.9 (mAP@0.75), all while maintaining the shortest inference time at just 4.5 milliseconds per image. Interestingly, YOLOv4’s time cost is significantly lower than that of SSD and the different EfficientDet models, and almost half that of YOLOv3 at identical input sizes. The MLSF-YOLO architecture demonstrates superior efficiency compared to other object detectors evaluated on the Visual-EVD + KITTI dataset. With a resolution of 608 × 608, MLSF-YOLO achieves a remarkable mean average precision of 71.1%, outperforming other architectures in terms of both mAP@0.5 and mAP@0.75. Additionally, MLSF-YOLO exhibits significantly lower time costs, requiring only 4.9 milliseconds for inference. This exceptional efficiency highlights the effectiveness of MLSF-YOLO in accurately detecting emergency vehicles while minimizing computational overhead, making it a compelling choice for real-time applications in traffic management and road safety systems. Additionally, we want to emphasize that our model selection process involved thorough experimentation. We conducted extensive evaluations, exploring various models and architectures, before ultimately selecting the MLSF-YOLO model. We rigorously tested and compared the performances of different models to ensure that the MLSF-YOLO architecture exhibited the desired characteristics for our specific problem domain. The selection of this model was based on its demonstrated performance and suitability for emergency vehicle detection. The comprehensive evaluation and comparison of these models allowed us to draw meaningful conclusions about the benefits of deploying the MLSF-YOLO framework for vision-based emergency vehicle detection. By leveraging the MLSF-YOLO architecture, we can attain a high accuracy and efficiency, making it an attractive option for various applications.

3.5. Acoustic EVD Experiments

To augment the acoustic data, we incorporated a technique described in [34]. This entailed sporadically including road noise into the original data, thereby generating samples containing noise having an adjustable signal-to-noise ratio (SNR). Moreover, batch normalization is applied at every layer to accelerate the training procedure. Furthermore, dropout regularization is used on the completely connected layers to reduce the possibility of overfitting [35].

While the recordings in the AEVD dataset typically range from 4 s to 5 s in duration, we specifically focused on input lengths between 0.25 s and 1.5 s. This choice was motivated by the desire to reduce computational complexity and cater to the practical requirements of emergency vehicle detection applications, where swift response times are crucial. Longer inputs tend to lead to extended response times. Additionally, we assessed the network’s stability by measuring its performance on test sets with noise. These sets, which include

+ 20 dB

,

+ 10 dB

,

0 dB

,

- 10 dB

,

- 20 dB

, and

- 30 dB

, were produced by applying traffic noise at various SNR levels to each sample in the initial test set. It is important to note that the initial testing data were collected during real-time driving scenarios, where natural traffic noise was present. Deliberately introducing noise into the initial testing data would make it significantly more challenging to assess the proposed model effectively.

Results of Attention-Based Temporal Spectrum Network

The performance of the ATSN framework is shown in Table 5 across varying input lengths and noise levels. Notably, the proposed ATSN exhibits an exceptional performance, achieving high accuracy rates on the original test dataset. Specifically, for input durations of 1.5, 1.0, 0.5, and 0.25 s, the accuracy rates were recorded at 93.47%, 93.24%, 93.29%, and 93.19%, respectively. It is important to emphasize that the experimental data collection was conducted in authentic, on-scene scenarios, underscoring the efficacy and robustness of the ATSN in typical traffic conditions across all input durations. Even with a notably brief input duration of 0.25 s, the ATSN achieved a commendable classification accuracy of 93.19 %. In the presence of moderate noise levels, specifically signal-to-noise ratios (SNRs) of +20 dB and +10 dB, the ATSN consistently maintained a robust performance across all input durations, with accuracy rates surpassing 93%. At a 0 dB SNR, the suggested network continued to exhibit a high accuracy on the 1.5 s test set (93.37%) and on the 1.0 s test set (93.17%), with slightly lower results observed for the 0.5 s and 0.25 s test sets at 91.75% and 93.42%, respectively. Under conditions of heightened noise levels, specifically at the −10 dB and −20 dB SNRs, a more pronounced decline in classification accuracy was observed with shorter input durations. For the ATSN, this translates to maintaining accuracy rates above 93% on the initial two test sets with longer durations, while the accuracy rates on the subsequent 0.5 and 0.25 s test sets dropped to 91.39% & 91.27% at −20 dB each in their own way. Even still, the accuracy suffers greatly when the SNR falls to −30 dB, but the ATSN still consistently delivers a performance exceeding 88% across all four input lengths. The data indicate that input durations of 1.5 s and 1.0 s yield more favorable results for acoustic-based emergency vehicle detection, both on raw waveform data and using the ATSN model. During these times, the accuracy of the network only slightly decreases at varying SNRs and stays over 90% even in very noisy environments (−20 dB). On the other hand, the ATSN performs worse at SNRs of −20 dB and −30 dB due to its increased sensitivity to high noise levels at shorter input lengths.

Furthermore, it is worth noting that the inference times of the ATSN model vary depending on the input lengths. When handling input lengths of 1.5, 1.0, 0.5, and 0.25 s, the corresponding inference times per sample were 8 milliseconds, 7 milliseconds, 4 milliseconds, and 2 milliseconds, respectively. These inference times are well within acceptable ranges for real-time applications, further highlighting the efficiency of the ATSN model.

From Table 6, we can observe that both Class 1 and Class 2 classifications achieved consistently high-performance metrics. This indicates the model’s ability to effectively differentiate between emergency and non-emergency vehicles based on acoustic characteristics. These findings reinforce the practicality and real-world applicability of the ATSN approach for acoustic-based classification applications.

During our extensive investigation into emergency vehicle detection, we utilized accuracy and loss metrics to evaluate the efficacy of our proposed ATSN architecture. The results reveal a remarkable accuracy of 93.47%, as shown in Figure 6a. To gain further insights into the model’s performance, we analyzed the confusion matrix, as depicted in Figure 6b. The confusion matrix provides a visual representation of how effectively the model distinguishes between emergency vehicles and non-emergency vehicles. Furthermore, we conducted a comprehensive performance comparison for the acoustic classification of emergency and non-emergency vehicles, as presented in Table 6. This comparison allowed us to assess the model’s ability to accurately classify vehicles based on their acoustic characteristics.

By employing these evaluation techniques, we were able to assess the effectiveness and accuracy of the ATSN architecture in detecting emergency vehicles. The high accuracy, supported by the confusion matrix analysis and performance comparison, demonstrates the robustness and reliability of the ATSN model for acoustic-based emergency vehicle classification.

3.6. Results of Multi-Level Spatial Fusion YOLO

In building upon the exceptional performance demonstrated by YOLOv4 in the first trial, MLSF-YOLO was introduced. This used the YOLOv4 model as its basis. The primary transformation was centered around the neck of the detector, where we implemented the structure of a CSP connection. MLSF-YOLO’s detailed structure is shown in Figure 2a. Within this design, we replaced the SPL module of YOLOv4 with an MLSF module, while also integrating NCSP components to create the PANet structure within the MLSF-YOLO neck.

Table 7 displays the effectiveness of the MLSF-YOLO model across various input sizes on the VEVD + KITTI dataset. MLSF-YOLO demonstrates commendable detection accuracies across all input size scenarios. Notably, with the lowest possible input size (320 × 320), MLSF-YOLO achieved a mAP@0.5 of 93.4, surpassing the results of YOLOv3, EfficientDet, YOLOv4, and SSD, which were examined with more inputs (as detailed in Table 4). As the input size increases, the results show higher detection accuracies, with MLSF-YOLO reaching its pinnacle accuracy at 94.9 mAP@0.5 for 640 × 640 test images. But still, this is noteworthy because, beyond the size of 608 × 608, the MLSF-YOLO performance usually reaches an endpoint, hovering around 95.2 mAP@0.5. Consequently, 608 × 608 emerges as the optimal input size for VEVD when utilizing MLSF-YOLO trained on the VEVD + KITTI dataset. The MLSF-YOLO computational efficiency is equally important, with time costs ranging from 2.2 milliseconds (320 × 320) to 6.9 milliseconds (768 × 768) per image, making it suitable for live execution. Figure 7 showcases a result of sample detection generated using MLSF-YOLO.

Table 8 presents a comparison of visual-based emergency vehicle detection methods with those of previous research studies. Different approaches were evaluated, each employing distinct methodologies and achieving varying levels of accuracy. One study utilized the RCNN method and achieved an accuracy of 92% but faced challenges related to lighting conditions. Another study employed the Yolov5 method, achieving an accuracy of 88% but encountered high computation costs. A CNN model achieved an accuracy of 85% but faced difficulties in classification. Another method, YOLOv4_AF, achieved an accuracy of 83% but has limited accuracy compared to that of other models. A YOLOv4_FIRI model achieved an accuracy of 94% but suffered from slow speed and limited adaptability. The MLSF_YOLO framework achieved the highest accuracy of 95% but was noted to lack practical implications. Considering the high accuracy achieved by MLSF_YOLO, further investigation is warranted to explore its real-world application and address the perceived limitations. Future research should focus on enhancing the practicality and adaptability of the MLSF_YOLO method, ensuring its effectiveness in real-time emergency vehicle detection scenarios.

3.7. Ensemble Learning Results

To assess the effectiveness of our MEVD framework, we compiled a video repository exhibiting vehicular entities and the corresponding acoustic environment in diverse traffic scenarios. Broadly, we identified four traffic scenarios based on the potential detection of EVs through both visual and auditory channels. The initial scenario involves the presence of an emergency vehicle within the camera’s perspective, along with the audible blaring of their siren. In the second situation, we have a standard traffic scenario with no visible or audible signs of an emergency vehicle. The next two cases are related to either the visual appearance or the siren sound of an emergency vehicle in the video. During the initial stage of MEVD, we are focusing solely on these first two traffic scenarios, encompassing the detection of an emergency vehicle through both visual and auditory means.

In our comprehensive exploration of emergency vehicle detection, we harnessed loss and accuracy metrics to evaluate the efficacy of our innovative strategy. Our findings underscore the remarkable accuracy rate of 96.19%, as shown in Figure 8a. This outcome reflects the successful synergy of acoustic and visual combinations through ensemble learning, showing the potential of our approach in enhancing emergency vehicle detection systems. Similarly, in Figure 8b, the confusion matrix highlights how well the model differentiates between EVs and NEVs.

We conducted a performance comparison for the classification of emergency and non-emergency vehicles, as shown in Table 9. From Table 9, we observe that both Class 1 and Class 2 classifications exhibited similar high-performance metrics. These findings demonstrate the model’s appropriateness for real-world acoustic and visual-based categorization by showing how well it distinguishes between emergency and non-emergency vehicles.

As depicted in Figure 9, the hard prediction module demonstrated an exceptional performance in the context of acoustic and visual emergency vehicle detection. It achieved an accuracy score of 0.961, a precision score of 0.94, a recall score of 0.95, and an F1 score of 0.96, outperforming all other modules in these domains. These results highlight the usefulness of our suggested ensemble learning method for this crucial application, which mixes acoustic and visual data. This approach provides a robust and reliable means to accurately detect and classify emergency vehicles. The system can utilize complementary information from both acoustic and visual sources because of the integration of many components, resulting in an improved detection accuracy and enhanced decision-making capabilities. This outcome holds significant importance for emergency response systems and may potentially improve the effectiveness of such systems in ensuring public safety and rapid responses in emergencies.

In Table 10, we present a comprehensive comparative analysis of the performance of our proposed emergency vehicle detection framework and those of previous studies conducted in this field of research. The table encompasses a diverse range of studies, providing insights into the accuracy metrics achieved by each study, as well as whether multi-model techniques or ensemble learning methodologies were employed. By examining the table, we can gain valuable insights into the advancements made in emergency vehicle detection and the effectiveness of our proposed framework compared to prior research efforts. The inclusion of accuracy metrics allows for a quantitative assessment of the performances of different approaches, enabling a comprehensive evaluation of their effectiveness. Furthermore, the indication of whether multi-model techniques or ensemble learning were utilized in each study provides additional context regarding the methodologies employed. This information aids in understanding the complexities of the various approaches and their impact on the overall performance of emergency vehicle detection systems.

4. Conclusions

This study proposed a deep learning approach with multimodal fusion to construct an advanced system capable of identifying emergency vehicles in real-world traffic conditions. To address the visual aspect, we used the MLSF-YOLO model and compiled the VEVD dataset, tailoring it specifically for VEVD purposes. By training and evaluating MLSF-YOLO using a combined dataset that merged Visual-EVD and KITTI, we achieved exceptional outcomes, enabling real-time processing at an impressive speed of 4.9 ms per image, while maintaining a mean average precision of 95.2%. Shifting our focus to auditory analysis, we proposed the ATSN model, a comprehensive end-to-end architecture designed to extract features from raw acoustic waveforms for classification. The experiments conducted prove the effectiveness of the ATSN model, showcasing its consistent performance across varying input lengths and signal-to-noise ratios, surpassing the accuracy achieved by previous methodologies. Lastly, we strongly support the idea of using both acoustic recognition and object detection techniques together, as we believed it would bring significant benefits and improve the overall results.

5. Limitations and Future Work

While our study demonstrates promising results in enhancing emergency vehicle detection through multimodal fusion, there exist several avenues for further exploration and improvement. Firstly, employing dimension reduction techniques on our dataset can provide deeper insights into model performance and aid in optimizing computational efficiency. Additionally, future research should delve into advanced synchronization techniques to refine the integration of visual and acoustic data streams for real-world implementation, thus potentially enhancing system robustness. Furthermore, addressing the limitation of testing diversity is crucial; conducting experiments under various environmental conditions, such as different weather conditions, times of day, traffic scenarios, and camera angles, can offer a more comprehensive evaluation of the system performance. Finally, considering the impact of camera quality and sensor placement on detection accuracy is essential for optimizing system design and deployment strategies. By addressing these limitations and pursuing future research directions, we can advance the efficacy and applicability of emergency vehicle detection systems in urban environments.

While favorable outcomes with VEVD, AEVD, and MEVD were acquired, there remains a need for further refinement and evaluation to enhance the detection accuracy and reliability, ensuring that these EVD systems align closely with real-world application requirements. Initially, we will augment the diversity of the experimental data by expanding our collection efforts across a range of conditions. This entails gathering VEVD data from both daytime and night-time scenarios, encompassing various weather conditions, including rain, snow, sunny, and fog conditions. Additionally, to ensure interoperability with resource-constrained IoT devices, we will modify the EVD systems, expanding the range of applications for the suggested EVD techniques.

Author Contributions

Conceptualization, M.Z., M.A., and M.E.; data curation, M.A. and M.E.; formal analysis, M.Z.; funding acquisition, M.A. and M.E.; investigation, M.A. and M.E.; methodology, M.Z. and M.A.; project administration, M.A. and M.E.; resources, M.A. and M.E.; software, M.Z.; supervision, M.A.; validation, M.A. and M.E.; visualization, M.A.; writing—original draft, M.Z.; writing—review and editing, M.A. and M.E. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the EIAS Data Science and Blockchain Laboratory College of Computer and Information Sciences, Prince Sultan University, Riyadh 11586, Saudi Arabia. The authors would like to thank Prince Sultan University for paying the APC of this article.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author.

Acknowledgments

The authors would like to thank Prince Sultan University for their valuable support.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ghazi, M.U.; Khattak, M.A.K.; Shabir, B.; Malik, A.W.; Ramzan, M.S. Emergency message dissemination in vehicular networks: A review. IEEE Access 2020, 8, 38606–38621. [Google Scholar] [CrossRef]
Damaševičius, R.; Bacanin, N.; Misra, S. From sensors to safety: Internet of Emergency Services (IoES) for emergency response and disaster management. J. Sens. Actuator Netw. 2023, 12, 41. [Google Scholar] [CrossRef]
Wang, X.; Liu, Q.; Guo, F.; Xu, X.; Chen, X. Causation analysis of crashes and near crashes using naturalistic driving data. Accid. Anal. Prev. 2022, 177, 106821. [Google Scholar] [CrossRef] [PubMed]
Razalli, H.; Ramli, R.; Alkawaz, M.H. Emergency vehicle recognition and classification method using HSV color segmentation. In Proceedings of the 2020 16th IEEE International Colloquium on Signal Processing & Its Applications (CSPA), Langkawi, Malaysia, 28–29 February 2020; pp. 284–289. [Google Scholar]
Sarda, A.; Dixit, S.; Bhan, A. Object detection for autonomous driving using yolo [you only look once] algorithm. In Proceedings of the 2021 Third IEEE International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV), Tirunelveli, India, 4–6 February 2021; pp. 1370–1374. [Google Scholar]
Kherraki, A.; El Ouazzani, R. Deep convolutional neural networks architecture for an efficient emergency vehicle classification in real-time traffic monitoring. IAES Int. J. Artif. Intell. 2022, 11, 110. [Google Scholar] [CrossRef]
Sorour, S.E.; Hany, A.A.; Elredeny, M.S.; Sedik, A.; Hussien, R.M. An Automatic Dermatology Detection System Based on Deep Learning and Computer Vision. IEEE Access 2023, 11, 137769–137778. [Google Scholar] [CrossRef]
Goel, S.; Baghel, A.; Srivastava, A.; Tyagi, A.; Nagrath, P. Detection of emergency vehicles using modified YOLO algorithm. In Proceedings of the Intelligent Communication, Control and Devices (ICICCD 2018); Springer: Berlin/Heidelberg, Germany, 2020; pp. 671–687. [Google Scholar]
Berwo, M.A.; Khan, A.; Fang, Y.; Fahim, H.; Javaid, S.; Mahmood, J.; Abideen, Z.U.; Syam, M.S. Deep Learning Techniques for Vehicle Detection and Classification from Images/Videos: A Survey. Sensors 2023, 23, 4832. [Google Scholar] [CrossRef] [PubMed]
Baghel, A.; Srivastava, A.; Tyagi, A.; Goel, S.; Nagrath, P. Analysis of Ex-YOLO algorithm with other real-time algorithms for emergency vehicle detection. In Proceedings of the First International Conference on Computing, Communications, and Cyber-Security (IC4S 2019); Springer: Berlin/Heidelberg, Germany, 2020; pp. 607–618. [Google Scholar]
Farid, A.; Hussain, F.; Khan, K.; Shahzad, M.; Khan, U.; Mahmood, Z. A Fast and Accurate Real-Time Vehicle Detection Method Using Deep Learning for Unconstrained Environments. Appl. Sci. 2023, 13, 3059. [Google Scholar] [CrossRef]
Pan, M.; Liu, Y.; Cao, J.; Li, Y.; Li, C.; Chen, C.H. Visual recognition based on deep learning for navigation mark classification. IEEE Access 2020, 8, 32767–32775. [Google Scholar] [CrossRef]
Tahir, N.U.A.; Zhang, Z.; Asim, M.; Chen, J.; ELAffendi, M. Object Detection in Autonomous Vehicles under Adverse Weather: A Review of Traditional and Deep Learning Approaches. Algorithms 2024, 17, 103. [Google Scholar] [CrossRef]
Tran, V.T.; Tsai, W.H. Acoustic-based emergency vehicle detection using convolutional neural networks. IEEE Access 2020, 8, 75702–75713. [Google Scholar] [CrossRef]
Pramanick, D.; Ansar, H.; Kumar, H.; Pranav, S.; Tengshe, R.; Fatimah, B. Deep learning based urban sound classification and ambulance siren detector using spectrogram. In Proceedings of the 2021 12th IEEE International Conference on Computing Communication and Networking Technologies (ICCCNT), Kharagpur, India, 6–8 July 2021; pp. 1–6. [Google Scholar]
Fatimah, B.; Preethi, A.; Hrushikesh, V.; Singh, A.; Kotion, H.R. An automatic siren detection algorithm using Fourier Decomposition Method and MFCC. In Proceedings of the 2020 11th IEEE International Conference on Computing, Communication and Networking Technologies (ICCCNT), Kharagpur, India, 1–3 July 2020; pp. 1–6. [Google Scholar]
Mateen, A.; Hanif, M.Z.; Khatri, N.; Lee, S.; Nam, S.Y. Smart roads for autonomous accident detection and warnings. Sensors 2022, 22, 2077. [Google Scholar] [CrossRef]
Tang, M.; Zhao, Q.; Ding, S.X.; Wu, H.; Li, L.; Long, W.; Huang, B. An improved lightGBM algorithm for online fault detection of wind turbine gearboxes. Energies 2020, 13, 807. [Google Scholar] [CrossRef]
Mu, W.; Yin, B.; Huang, X.; Xu, J.; Du, Z. Environmental sound classification using temporal-frequency attention based convolutional neural network. Sci. Rep. 2021, 11, 21552. [Google Scholar] [CrossRef]
Mahlous, A.R. Cyber security challenges in self-driving cars. Computer Fraud. Secur. 2022, 1873–7056. [Google Scholar] [CrossRef]
Li, Q.; Garg, S.; Nie, J.; Li, X.; Liu, R.W.; Cao, Z.; Hossain, M.S. A highly efficient vehicle taillight detection approach based on deep learning. IEEE Trans. Intell. Transp. Syst. 2020, 22, 4716–4726. [Google Scholar] [CrossRef]
Yu, J.; Zhang, W. Face mask wearing detection algorithm based on improved YOLO-v4. Sensors 2021, 21, 3263. [Google Scholar] [CrossRef]
Wu, D.; Lv, S.; Jiang, M.; Song, H. Using channel pruning-based YOLO v4 deep learning algorithm for the real-time and accurate detection of apple flowers in natural environments. Comput. Electron. Agric. 2020, 178, 105742. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Li, S.; Li, Y.; Li, Y.; Li, M.; Xu, X. Yolo-firi: Improved yolov5 for infrared image object detection. IEEE Access 2021, 9, 141861–141875. [Google Scholar] [CrossRef]
Huang, Z.; Wang, J.; Fu, X.; Yu, T.; Guo, Y.; Wang, R. DC-SPP-YOLO: Dense connection and spatial pyramid pooling based YOLO for object detection. Inf. Sci. 2020, 522, 241–258. [Google Scholar] [CrossRef]
Hu, X.; Liu, Y.; Zhao, Z.; Liu, J.; Yang, X.; Sun, C.; Chen, S.; Li, B.; Zhou, C. Real-time detection of uneaten feed pellets in underwater images for aquaculture using an improved YOLO-V4 network. Comput. Electron. Agric. 2021, 185, 106135. [Google Scholar] [CrossRef]
Zhang, Y.F.; Ren, W.; Zhang, Z.; Jia, Z.; Wang, L.; Tan, T. Focal and efficient IOU loss for accurate bounding box regression. Neurocomputing 2022, 506, 146–157. [Google Scholar] [CrossRef]
Ansari, S.; Alnajjar, K.A.; Khater, T.; Mahmoud, S.; Hussain, A. A Robust Hybrid Neural Network Architecture for Blind Source Separation of Speech Signals Exploiting Deep Learning. IEEE Access 2023, 11, 100414–100437. [Google Scholar] [CrossRef]
Rehman, A.; Alam, T.; Mujahid, M.; Alamri, F.S.; Al Ghofaily, B.; Saba, T. RDET stacking classifier: A novel machine learning based approach for stroke prediction using imbalance data. Peerj Comput. Sci. 2023, 9, e1684. [Google Scholar] [CrossRef]
Golchoubian, M.; Ghafurian, M.; Dautenhahn, K.; Azad, N.L. Pedestrian trajectory prediction in pedestrian-vehicle mixed environments: A systematic review. IEEE Trans. Intell. Transp. Syst. 2023, 24, 11544–11567. [Google Scholar] [CrossRef]
Guzhov, A.; Raue, F.; Hees, J.; Dengel, A. Audioclip: Extending clip to image, text and audio. In Proceedings of the ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 976–980. [Google Scholar]
Nanni, L.; Maguolo, G.; Brahnam, S.; Paci, M. An ensemble of convolutional neural networks for audio classification. Appl. Sci. 2021, 11, 5796. [Google Scholar] [CrossRef]
Gatto, R.C.; Forster, C.H.Q. Audio-based machine learning model for traffic congestion detection. IEEE Trans. Intell. Transp. Syst. 2020, 22, 7200–7207. [Google Scholar] [CrossRef]
Abdallah, M.; An Le Khac, N.; Jahromi, H.; Delia Jurcut, A. A hybrid CNN-LSTM based approach for anomaly detection systems in SDNs. In Proceedings of the 16th International Conference on Availability, Reliability and Security, Vienna, Austria, 17–20 August 2021; pp. 1–7. [Google Scholar]
Kaushik, S.; Raman, A.; Rao, K.R. Leveraging computer vision for emergency vehicle detection-implementation and analysis. In Proceedings of the 2020 11th IEEE International Conference on Computing, Communication and Networking Technologies (ICCCNT), Kharagpur, India, 1–3 July 2020; pp. 1–6. [Google Scholar]
Raj, V.S.; Sai, J.V.M.; Yogesh, N.L.; Preetha, S.K.; Lavanya, R. Smart Traffic Control for Emergency Vehicles Prioritization using Video and Audio Processing. In Proceedings of the 2022 6th IEEE International Conference on Intelligent Computing and Control Systems (ICICCS), Madurai, India, 25–27 May 2022; pp. 1588–1593. [Google Scholar]
Shatnawi, M.; Audat, A.; Saraireh, M. Intelligent Requirements Engineering: Applying Machine Learning for Requirements Classification. In Proceedings of the 2023 14th IEEE International Conference on Information and Communication Systems (ICICS), Irbid, Jordan, 21–23 November 2023; pp. 1–6. [Google Scholar]
Zhao, J.; Hao, S.; Dai, C.; Zhang, H.; Zhao, L.; Ji, Z.; Ganchev, I. Improved vision-based vehicle detection and classification by optimized YOLOv4. IEEE Access 2022, 10, 8590–8603. [Google Scholar] [CrossRef]
Zhao, Y.; Zhao, H.; Zhang, X.; Liu, W. Vehicle classification based on audio-visual feature fusion with low-quality images and noise. J. Intell. Fuzzy Syst. 2023, 45, 1–14. [Google Scholar] [CrossRef]
Jiang, K.; Su, D.; Zheng, Y. Intelligent acquisition model of traffic congestion information in the vehicle networking environment based on multi-sensor fusion. Int. J. Veh. Inf. Commun. Syst. 2019, 4, 155–169. [Google Scholar] [CrossRef]
Al-Batat, R.; Angelopoulou, A.; Premkumar, S.; Hemanth, J.; Kapetanios, E. An end-to-end automated license plate recognition system using YOLO based vehicle and license plate detection with vehicle classification. Sensors 2022, 22, 9477. [Google Scholar] [CrossRef]
Middya, A.I.; Nag, B.; Roy, S. Deep learning based multimodal emotion recognition using model-level fusion of audio–visual modalities. Knowl.-Based Syst. 2022, 244, 108580. [Google Scholar] [CrossRef]
Bolboacă, R. Adaptive ensemble methods for tampering detection in automotive aftertreatment systems. IEEE Access 2022, 10, 105497–105517. [Google Scholar] [CrossRef]

Figure 1. Visual representation of the Multimodal-EVD (MEVD) system.

Figure 2. The MLSF-YOLO model and its subcomponents (a–g).

Figure 3. The complete framework of the ATSN classification model.

Figure 4. Creation of two distinct attention modules.

Figure 5. Utilizing probabilistic voting for predicting the complete acoustic category.

Figure 6. Evaluating the performance of the ATSN in the EVD model, where (a) comprises accuracy and loss graphs, and (b) presents the confusion matrix for the classification accuracy.

Figure 7. Detection results obtained using the MLSF-YOLO model.

Figure 8. Evaluating the performance of the ensemble learning in the EVD model, where (a) comprises accuracy and loss graphs and (b) presents the confusion matrix for the detection accuracy.

Figure 9. Evaluation of ensemble learning technique and individual models for emergency vehicle detection.

Table 1. Overview of model configuration.

Model	Hyperparameters	Optimizer	Activation Function
ATSN	Loss Function: BCE Learning Rate: 0.001 Regularization Dropout	Adam	ReLU
MLSF-YOLO	Batch Size: 16 Learning Rate: 0.01 Weight Decay: 0.0005	SGD with Momentum	Mish

Table 2. Collection of data for the experiments on visual EVD.

Division	Data Collection	Frame Count	EV Samples	NEV Samples	Frame- Average Instances
Training	VEVD	11,930	12,705	24,522	3.12
Training	KITTI	3995	0	18,960	4.74
Validation	VEVD	3902	3998	6519	2.43
Validation	KITTI	998	0	5750	5.77
Testing	VEVD	4008	4432	7230	3.41
Testing	KITTI	1349	0	6448	4.78
Total	VEVD	19,144	23,448	39,895	3.25
	KITTI	6850	0	32,752	4.78
	VEVD+KITTI	25,994	23,448	72,647	4.45

Table 3. Collection of data for AEVD.

Subset	Noise	Siren Sound	Total
Training	11,020	5290	16,310
Validation	3518	1744	5262
Test	3409	1724	5133
Total	17,947	8758	26,705

Table 4. Evaluating MLSF-YOLO and other object detectors on Visual-EVD + KITTI dataset.

Detector	Architecture	Resolution	mAP@[0.5:0.95]	mAP@0.5	mAP@0.75	Time Cost (ms)
SSD-300	VGG-16	300 × 300	51.96	83.1	63.9	19.8
SSD-512	VGG-16	512 × 512	54.98	84.8	64.8	39.8
YOLOv4	CSPDarknet-53	512 × 512	67.8	92.7	81.1	3.5
YOLOv4	CSPDarknet-53	608 × 608	68.9	92.8	81.9	4.3
YOLOv3	Darknet-53	512 × 512	67.1	90.9	80.1	5.9
YOLOv3	Darknet-53	608 × 608	67.1	90.8	80.7	7.8
EfficientDet-D0	Efficient-B0	512 × 512	61.9	86.7	73.2	26.1
EfficientDet-D1	Efficient-B1	640 × 640	62.7	87.8	74.9	29.7
EfficientDet-D2	Efficient-B2	768 × 768	63.9	89.8	78.3	35.2
EfficientDet-D3	Efficient-B3	896 × 896	67	91.8	81.5	44.1
EfficientDet-D4	Efficient-B4	1024 × 1024	66.3	91	80.7	67.3
MLSF-YOLO	CSPDarknet-53	608 × 608	71.1	95.2	86.2	4.9

Table 5. ATSN performance across varied input lengths and signal-to-noise ratios (SNRs).

Input Length	Original Data	+20 dB	+10 dB	0 dB	−10 dB	−20 dB	−30 dB	Time
1.5 s	93.47	93.76	93.75	93.37	92.19	93.32	91.58	8 ms
1.0 s	93.24	93.68	93.68	93.17	93.46	93.19	89.24	7 ms
0.5 s	93.29	92.81	93.47	91.75	91.88	91.39	89.32	4 ms
0.25 s	93.19	92.47	92.29	93.42	90.72	91.27	88.12	2 ms

Table 6. Performance of acoustic classification for EVD.

Class	Accuracy	Precision	Recall	F1score
1	93.47%	0.93	0.94	0.93
2	92.98%	0.94	0.93	0.94

Table 7. Evaluating the MLSF-YOLO performance on the VEVD + KITTI dataset.

Resolution (h × w)	mAP @[0.5:0.95]	mAP @0.5	mAP @0.75	Time (ms)
320 × 320	66.9	93.4	81.1	2.2
416 × 416	69.2	94.5	84.3	3.1
512 × 512	69.8	94.8	84.7	4.2
608 × 608	71.1	95.2	86.2	4.9
640 × 640	71.1	94.9	85.9	5.0
768 × 768	71.0	94.7	86.2	6.9

Table 8. Comparison of Visual Emergency Vehicle Detection (VEVD) with previous research.

Study	Method	Accuracy	Limitations
Ref. [36]	RCNN	92%	Lighting challenges
Ref. [37]	Yolov5	88%	High computation cost
Ref. [38]	CNN	85%	Difficulty in classification
Ref. [39]	YOLOv4_AF	83%	Limited accuracy compared to other models
Ref. [25]	YOLOv4_FIRI	94%	Slow speed and limited adaptability
Proposed	MLSF_YOLO	95%	Lacks practical implications

Table 9. Comparing ensemble classification performances for EVD.

Class	Accuracy	Precision	Recall	F1score
1	96.19%	0.97	0.95	0.97
2	95.93%	0.95	0.97	0.96

Table 10. Comparison with previous classification research.

Study	Accuracy	Ensemble Learning	Multimodal
Ref. [40]	95%	Yes	No
Ref. [41]	91%	No	No
Ref. [42]	94%	No	No
Ref. [43]	93%	No	Yes
Ref. [44]	92%	Yes	No
Proposed	96%	Yes	Yes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zohaib, M.; Asim, M.; ELAffendi, M. Enhancing Emergency Vehicle Detection: A Deep Learning Approach with Multimodal Fusion. Mathematics 2024, 12, 1514. https://doi.org/10.3390/math12101514

AMA Style

Zohaib M, Asim M, ELAffendi M. Enhancing Emergency Vehicle Detection: A Deep Learning Approach with Multimodal Fusion. Mathematics. 2024; 12(10):1514. https://doi.org/10.3390/math12101514

Chicago/Turabian Style

Zohaib, Muhammad, Muhammad Asim, and Mohammed ELAffendi. 2024. "Enhancing Emergency Vehicle Detection: A Deep Learning Approach with Multimodal Fusion" Mathematics 12, no. 10: 1514. https://doi.org/10.3390/math12101514

APA Style

Zohaib, M., Asim, M., & ELAffendi, M. (2024). Enhancing Emergency Vehicle Detection: A Deep Learning Approach with Multimodal Fusion. Mathematics, 12(10), 1514. https://doi.org/10.3390/math12101514

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhancing Emergency Vehicle Detection: A Deep Learning Approach with Multimodal Fusion

Abstract

1. Introduction

2. Materials and Methods

2.1. Detection of Emergency Vehicles Using Visual Cues

Proposing the MLSF-YOLO Model for Visual-Based EVD

2.2. Classifying Emergency Vehicles Using Acoustic Signals

2.2.1. Proposed Framework for Acoustic Classification

2.2.2. Feature Processing

2.2.3. Harmonic–Percussive Source Separation

2.2.4. Creating a Time–Frequency Attention Mechanism

2.2.5. Network Structure

2.2.6. Decision Approach

2.3. Harmonizing Hard and Soft Predictions

2.3.1. Hard Prediction

2.3.2. Soft Prediction

3. Experiments and Results

3.1. Hyperparameters Used

3.2. Data for Visual EVD Experiments

3.3. Data for Acoustic EVD Experiments

3.4. Visual EVD Experiments

Visual EVD with Different Object Detectors

3.5. Acoustic EVD Experiments

Results of Attention-Based Temporal Spectrum Network

3.6. Results of Multi-Level Spatial Fusion YOLO

3.7. Ensemble Learning Results

4. Conclusions

5. Limitations and Future Work

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI