Object Detection and Monocular Stable Distance Estimation for Road Environments: A Fusion Architecture Using YOLO-RedeCa and Abnormal Jumping Change Filter

Lv, Hejun; Du, Yu; Ma, Yan; Yuan, Ying

doi:10.3390/electronics13153058

Open AccessArticle

Object Detection and Monocular Stable Distance Estimation for Road Environments: A Fusion Architecture Using YOLO-RedeCa and Abnormal Jumping Change Filter

Beijing Key Laboratory of Information Service Engineering, College of Robotics, Beijing Union University, Beijing 100101, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(15), 3058; https://doi.org/10.3390/electronics13153058 (registering DOI)

Submission received: 24 June 2024 / Revised: 19 July 2024 / Accepted: 26 July 2024 / Published: 2 August 2024

Download

Browse Figures

Versions Notes

Abstract

:

Enabling rapid and accurate comprehensive environmental perception for vehicles poses a major challenge. Object detection and monocular distance estimation are the two main technologies, though they are often used separately. Thus, it is necessary to strengthen and optimize the interaction between them. Vehicle motion or object occlusions can cause sudden variations in the positions or sizes of detection boxes within temporal data, leading to fluctuations in distance estimates. So, we propose a method to integrate a detector based on YOLOv5-RedeCa, a Bot-Sort tracker and an anomaly jumping change filter. This combination allows for more accurate detection and tracking of objects. The anomaly jump filter smooths distance variations caused by sudden changes in detection box sizes. Our method increases accuracy while reducing computational demands, showing outstanding performance on several datasets. Notably, on the KITTI dataset, the standard deviation of the continuous ranging results remains consistently low, especially in scenarios with multiple object occlusions or disappearances. These results validate our method’s effectiveness and precision in managing dual tasks.

Keywords:

automatic driving technique; object detection; monocular distance measurement; Kalman filter

1. Introduction

Efficient perception and understanding of the road environment are crucial for autonomous driving technology [1]. However, achieving comprehensive perception of the surrounding environment for vehicles causes a challenging problem. This comprehensive perception system performs multiple tasks. It detects various objects around the vehicle. Additionally, it obtains the distances between these objects and the vehicle. Therefore, we focus on methods that achieve dual tasks. Our methods simultaneously detect objects around the vehicle and determine the distances to these objects. More details can be found in Figure 1.

In autonomous driving, accurate object detection is crucial for dynamic distance estimation, directly affecting the system’s safety and efficiency [2]. Dynamic distance estimation, as part of higher-level perception, can simulate the intuitive perception of human drivers.

Existing distance measurement methods based on radar and stereo vision, although effective, face challenges such as high costs, system complexity, and poor real-time performance. Furthermore, it is worth exploring whether current bird’s-eye view (BEV) perception technology requires deeper support from human visual characteristics [3]. Existing monocular distance measurement methods lack robustness when dealing with sudden value changes, which directly affects the perception and decision-making quality of autonomous driving systems in dynamic environments.

The goal of this study is to propose a monocular distance measurement method. This method aims to both detect objects and determine the distance between the objects and the vehicle. By combining the YOLOv5-RedeCa and anomaly jump filter methods, the approach intends to enhance the accuracy and robustness of dynamic distance estimation, thereby addressing perception and decision-making challenges in dynamic environments. Our method demonstrates significant potential in feature extraction and distance estimation. Additionally, it will provide more reliable and richer perceptual information for autonomous driving systems, thereby significantly improving driving safety and efficiency. This makes our method a promising candidate for breakthrough advancements in autonomous driving technology.

Reviewing previous work in the field of 2D object detection, region proposal-based methods (e.g., R-CNN) [4] have significantly improved detection accuracy through a two-stage detection approach, performing exceptionally well in complex backgrounds. However, they are relatively slow in computation. In contrast, although single-stage detectors (e.g., YOLO) [5] slightly compromise on accuracy, they directly predict the positions and categories of objects in an image through regression/classification, simplifying the model structure and computation process, making them suitable for real-time processing [6]. Therefore, we use them as the baseline for our detector.

Object detection alone cannot fully perceive the environment without distance information [7]. Traditional depth estimation methods use monocular cues or manually designed features [8,9], but they often underperform in varying depth scenarios.

Deep learning, especially with convolutional neural networks (CNNs), offers direct pixel-to-depth mapping [10]. Multi-scale networks and scale-invariant loss functions have enhanced depth estimation accuracy. Unlike traditional methods, deep learning automatically generates disparity maps and depth information without extensive annotation, typically using video sequences or stereo images. However, these methods require substantial computational resources.

In autonomous driving, continuous object detection information from the time domain is used for tracking objects and distance estimation. However, vibrations caused by vehicle motion and occlusions can disrupt the stability of objects in the video sequence. This can lead to sudden changes in the position or size of the detection boxes, as shown in Figure 2. Therefore, it is necessary to smooth the detection boxes.

Though algorithms for 2D object detection have made significant progress, they remain specialized. Similarly, monocular depth estimation algorithms have advanced considerably. However, research in these two fields remains relatively separate. Integration between object detection and depth estimation is still lacking. In the existing research, the object detection often lacks integrated depth estimation, and the depth estimation does not include object recognition capabilities. This situation highlights an urgent need. Developing an integrated system that combines object detection and depth estimation is essential. Such a system would optimize the interaction between the two techniques. Consequently, it would improve their efficiency in practical applications.

Therefore, we propose and design a holistic algorithm architecture that fuses object detection and distance estimation. Our main contributions are as follows:

We design the RedeCa module to enhance feature extraction capability, enabling more precise object detection in road traffic scenarios. To ensure real-time performance, we use a deep separable convolutional network architecture.
We integrate a multi-frame association mechanism, to improve the algorithm’s robustness in cases of object occlusion or temporary loss, enabling simultaneous tracking and ranging of multiple different objects.
We design an anomaly jumping change filter to enhance distance estimation accuracy and robustness. This filter eliminates anomalies caused by sudden changes in detection box size within a frame. It smooths the variation in object distance. Consequently, the reliability of ranging is improved.

Our proposed algorithm won the gold medal in The 3rd China (shenyang) Intelligent Connected Vehicles Conference. It also achieved the bronze medal in the perception & decision control group at the 2023 World Intelligent Driving Challenge Metaverse Virtual Simulation Competition.

2. Related Work

In the field of autonomous driving, how to make the vehicle fully perceive its surrounding environment is a challenging problem. This comprehensive perception includes not only the detection of various surrounding objects, but also the acquisition of distances between these objects and the vehicle. Currently, the use of visible light cameras is a common and cost-effective approach to addressing this problem. Equipped with photosensitive elements, visible light cameras can capture color images of the real world. Researchers conduct in-depth analysis of these color images to accomplish key perception tasks such as object detection and distance estimation.

Traffic Object Detection. Object detection is one of the core problems in the field of computer vision [11]. It aims to classify and determine the position of objects in images. Deep convolutional neural networks (DCNNs) have provided strong support for the development of object detection techniques with their outstanding feature representation capabilities [12,13]. Commonly, the main architectures for object detection can be divided into two categories: region proposal-based object detection methods and regression/classification-based object detection methods [6]. Each of them has its own characteristics, addressing different technical challenges.

Representative models of region proposal-based object detection architectures include RCNN [4] and its subsequent improvements such as SPPNet [14], Fast RCNN [15], Faster R-CNN [16], Mask RCNN [17], and R-FCN [18]. The first stage of these architectures generates candidate region proposals. The second stage performs detailed classification and bounding box refinement on these regions. This staged processing greatly enhances detection accuracy, particularly exhibiting excellent performance in complex backgrounds. For instance, to address the issue of slow proposal generation speed, Faster R-CNN introduces the region proposal network (RPN). It is significant to improve the quality and speed of proposals. Mask RCNN further predicts object masks by adding a parallel branch, enabling the model to perform both object detection and instance segmentation simultaneously. This is particularly valuable in applications requiring fine-grained visual information.

To simplify model structure and computational processes, regression/classification-based object detection architectures are designed to predict object positions and categories directly on images, such as OverFeat [19], the YOLO series [20,21,22,23,24,25], SSD [26], DSSD [27], and FSSD [28]. The main advantage of these architectures is their high detection speed, making them suitable for real-time processing scenarios. Among them, the YOLO series stands out. By reframing the detection problem as a single regression task, it can simultaneously predict the categories and positions of multiple objects in a single network forward pass. To effectively address detection challenges for objects of various sizes and improve the model’s ability to detect small objects, SSD introduces the concept of multi-scale feature maps.

Monocular Depth Estimation. The pursuit of determining the distance between objects and the camera has remained a pivotal research focus within the domain of computer vision since its inception. In the initial approaches, monocular depth estimation primarily relied on Shape from Shading (SFS) and Depth from Focus/Defocus (DFF/DFD). SFS infers the object’s shape based on changes in illumination on its surface [29]. DFF/DFD techniques estimate depth using focused and defocused information in images along with camera calibration parameters [30]. The fundamental assumption is that the focused parts of the image are clearer, while the defocused parts gradually become blurry. Although effective in specific scenarios, these methods often rely on additional scene assumptions and exhibit subpar performance when confronted with scenes featuring significant depth of field.

With the advancement of machine learning technology, researchers have begun to frame the single-pixel depth estimation problem as a probabilistic learning problem based on random fields. For instance, optimizing depth maps using Markov Random Field (MRF) [8,31], and modeling posterior probabilities with Conditional Random Field (CRF) for depth estimation [9]. The enhanced computational power of Graphics Processing Units (GPUs) and the availability of large-scale image datasets have driven deep learning, paving the way for new research directions in depth estimation.

Deep learning techniques typically require real depth data as training labels. These techniques directly map pixels to depth values by constructing convolutional neural networks (CNNs). Refs. [32,33] were the first to apply CNNs in the field of depth estimation. They employed a multi-scale network strategy. A global coarse estimation network provides initial distance estimates. Additionally, a local refinement network optimizes the details. They introduced a scale-invariant loss function to handle errors. Subsequent studies have further improved estimation accuracy by refining network structures [29,34,35,36] and integrating auxiliary information [37,38]. Some research does not rely on a large amount of labeled depth data. Instead, they utilize video sequences or stereo images to automatically derive disparity maps and depth information using deep learning algorithms. For example, Ref. [39] process the left view using a convolutional neural network. This generates a disparity map. Then, they reconstruct the left view by combining it with the right view through inverse mapping. The reconstruction error is used as a supervisory signal for training. Ref. [40] simultaneously processes left and right views to optimize disparity map generation, enabling accurate depth estimation even when only monocular images are used during testing. Additionally, Ref. [41] derives pose and depth from video sequences of consecutive frames, demonstrating the effectiveness of combining multiple network models for view reconstruction. While these methods achieve good performance, they come at a significant computational cost.

Significant progress has been made in 2D object detection and monocular depth estimation. But the two remain largely separate in research, with no effective integration observed. At present, object detection systems often lack the capability for range estimation. And the depth estimation system does not integrate the object recognition ability. This separation suggests a need for development. A holistic algorithmic architecture is necessary. Such an architecture should integrate object detection and range estimation. This integration would enhance the synergy between these two techniques. Consequently, it would optimize their performance in practical applications.

3. Method

To address the integration of object detection and distance estimation, we design and develop an algorithmic architecture. It comprises three main components: a detector based on YOLOv5-RedeCa, a multi-object tracker, and an anomaly jumping change filter. In autonomous driving scenarios, vehicles often encounter complex and dynamic environments with multiple autonomously moving objects that may suddenly appear, become occluded, or overlap [21]. These situations impose stringent requirements on the accuracy and robustness of the algorithm. Additionally, due to the high speed of vehicle movement, the algorithm must process sensor data and react within milliseconds to ensure vehicle safety. Therefore, autonomous driving systems require a perception algorithm with high accuracy and robustness.

For these purposes, we designed the RedeCa module, which enhances the architecture of the object detection algorithm, improving the precision of object localization in road scenarios. Furthermore, we integrate the Bot-Sort tracker. This tracker employs a multi-frame association mechanism. It enhances the algorithm’s robustness when encountering object occlusion or temporary loss. Thus, the method can simultaneously track multiple objects and estimate their distances. The approach also periodically evaluates tracking continuity, monitors significant changes in object state, and checks the stability of the tracking process. To enhance the accuracy and robustness of distance estimation, we also integrate an anomaly jumping change filter. The filter effectively smooths the variation in object distance and eliminates potential anomalies caused by sudden changes in detection box size. Thereby, it can improve the reliability of distance measurement. The overall architecture is illustrated in Figure 3.

By implementing YOLOv5-RedeCa, our method enhances accuracy in object recognition and refines bounding box precision across video frames. Furthermore, we compute the distances between objects and the camera using a sophisticated physical imaging model.

To effectively track and measure multiple objects, we integrate the Bot-Sort tracker. This multi-frame association mechanism significantly enhances robustness in scenarios with occlusions or temporary loss of objects. Even if the object temporarily vanishes from view in certain frames, it continues to track and measure distance consistently. We assign unique identifiers to newly detected objects using an incremental sequence. Then, it utilises the Bot-Sort to match these with existing objects, distinguishing between those that have newly appeared and those already being tracked. For objects that are successfully matched, their status information is updated, and original identifiers are preserved. Conversely, new objects are allocated fresh tracking records and identifiers. If an object is not detected in the current frame, its entry is temporarily preserved. If the object fails to reappear within a predetermined period, it will be removed from the tracking list.

Now, most monocular ranging methods primarily focus on the accuracy of individual images, overlooking the correlation of distance changes across consecutive frames in the temporal domain. Therefore, we propose a multi-frame data analysis strategy, which integrates Kalman filtering. It not only enhances the precision and robustness of distance estimation, but also assists in the detailed analysis of the object’s motion patterns and trends in distance changes. Kalman filtering smooths distance variations by analyzing changes in the same object across consecutive frames. It also eliminates potential errors caused by abrupt changes in the size of single-frame detection boxes.

3.1. Object Detection

To ensure accurate object recognition and precise bounding box extraction in each video frame, we employed the YOLOv5 object detection algorithm. It serves as the core framework for our detection module, which is based on deep convolutional neural networks. YOLOv5 predicts the class and position of objects, i.e., the bounding boxes, directly from the input image in a single forward pass. Compared to the two-stage detection methods, which first extract candidate regions and then classify them, YOLOv5 considerably improves detection speed and efficiency. It uses a deep convolutional network to extract features from images, progressively drawing out increasingly complex characteristics [42]. It enhances its ability to recognize various objects. By integrating a Feature Pyramid Network (FPN), YOLOv5 combines feature maps from different levels, thereby improving its detection capabilities across a range of object sizes.

Considering the diversity of object sizes in road scenes, the accuracy of object detection, and the specific requirements of road traffic, we have improved the YOLOv5 network architecture. First, we integrated the RedeCa module into the residual blocks. This allows for more accurate detection of smaller objects under varying lighting conditions and weather changes. Then, we replaced the traditional convolutions in the residual blocks with depthwise separable convolutions. This change reduces the number of parameters in the network and enhances the real-time performance. The architecture of YOLOv5-RedeCa network is shown in Figure 4.

RedeCa Module

YOLOv5 has demonstrated outstanding performance across a wide range of detection scenarios. However, the complex road environment demands a versatile approach to accommodate the diverse types and sizes of traffic objects. To address this challenge, we optimized YOLOv5’s network structure, specifically by designing the RedeCa module to enhance recognition accuracy for traffic objects of various sizes. As shown in Figure 5, the RedeCa module is based on a residual block. It incorporates a sequence of residual depthwise separable convolution blocks and ECA modules to enhance feature extraction capabilities.

Depthwise separable convolutions markedly reduce the number of parameters and computational load in models. This is achieved by breaking down standard convolutions into two simpler operations: depthwise convolutions and pointwise convolutions. The process is detailed in Figure 6. It involves first applying depthwise convolution independently to each input channel. Then, pointwise convolution aggregates these outputs using 1 × 1 convolution kernels. This approach greatly reduces the parameter count and computational demands, lowering memory usage and energy consumption. Consequently, it makes the model more suitable for deployment on devices with limited resources.

Depthwise separable convolutions minimize the risk of overfitting, which is particularly crucial when training models on limited data.

By reducing the number of parameters, training becomes simpler and the likelihood of overfitting is reduced. It enhances the model’s generalization capability in real-world applications. The computational complexity associated with depthwise separable convolutions is detailed below.

Depthwise Convolution:

\begin{matrix} P a r a m_{D C} = K \times K \times C_{i n} \end{matrix}

(1)

\begin{matrix} F L O P S_{D C} = H_{o u t} \times W_{o u t} \times K \times K \times C_{i n} \end{matrix}

(2)

Pointwise Convolution:

\begin{matrix} P a r a m_{P C} = C_{i n} \times C_{o u t} \end{matrix}

(3)

\begin{matrix} F L O P S_{P C} = H_{o u t} \times W_{o u t} \times C_{i n} \times C_{o u t} \end{matrix}

(4)

Depthwise Separable Convolutions:

\begin{matrix} P a r a m_{D W} = P a r a m_{D C} + P a r a m_{P C} \end{matrix}

(5)

\begin{matrix} F L O P S_{D W} = F L O P S_{D C} + F L O P S_{P C} \end{matrix}

(6)

where

P a r a m_{D C}

quantifies the parameter count associated with depthwise convolution, and

F L O P S_{D C}

delineates the computational demands requisite for depthwise convolution. Correspondingly,

P a r a m_{P C}

specifies the parameter count for pointwise convolution, whereas

F L O P S_{P C}

characterizes its computational demands. Additionally,

P a r a m_{D W}

is indicative of the parameter count in depthwise separable convolution, with

F L O P S_{D W}

elucidating the computational demands for this type of convolution.

The RedeCa module is initiated with a depthwise separable convolution block that integrates principles of a residual network. It uses identity mapping to improve the learning efficiency of deep network layers. It combines the high parameter efficiency of depthwise separable convolutions with the rapid learning capability of residual connections, thereby enhancing the module’s feature extraction ability. And then, a standard depthwise separable convolution block is integrated to effectively reduce the model’s complexity and computational demands. Finally, the deca module adjusts the dependencies between channels, dynamically optimizing feature responses to improve the accuracy of object detection.

The mathematical formulation of the RedeCa module is as follows:

\begin{matrix} F_{D W C o n v}^{R e s} = S i L U (B N (D W C o n v (F^{*})) + C o n v (F^{*})) \end{matrix}

(7)

\begin{matrix} F_{D W C o n v} = B N (P o int w i s e (D e p t h w i s e (F_{D W C o n v}^{Re s}))) \end{matrix}

(8)

\begin{matrix} F_{R e d e C a} = S i L U (F_{D W C o n v} \times σ (C o n v (A d a p t A v g P o o l (F_{D W C o n v}))) + F^{*}) \end{matrix}

(9)

where

F^{*}

represents the feature map input from the previous layer.

3.2. Enhancing the Robustness of Multi-Frame Association in Distance Estimation

In object detection and distance estimation in road traffic scenes, it is not enough to operate only on a single object. In practical applications, it is necessary to accurately estimate the distance of multiple objects simultaneously. And these objects may undergo sudden appearances, occlusions, overlaps, or disappearances. Our proposed method integrates the Bot-Sort tracker, assigning a unique identifier to each object in every frame, facilitating effective object identification and tracking. Although existing monocular ranging algorithms have improved accuracy on individual images, they often overlook the coherence of distance changes for the same object across consecutive frames. Ignoring this coherence can lead to fluctuations in distance measurements, especially when the bounding box size of the object detection changes drastically over a short period. Our method enhances distance estimation accuracy and robustness by integrating Kalman filtering techniques over time series data. The approach smooths out variations in object distances and eliminates anomalies caused by sudden changes in single-frame detection box sizes, resulting in superior performance in real-time road traffic scenarios.

3.2.1. Bot-Sort

In the domain of visual analysis for road traffic, the mere detection and distance estimation of individual objects are not adequate. Paramount to this field is the capability to accurately measure the distances of multiple objects within a scene and allocate distinct identifiers to each for effective tracking and differentiation. Our method integrates the Bot-Sort tracker, an advanced algorithm that adeptly identifies and analyzes every object within video frames, ascertaining new objects for tracking. The methodology of the Algorithm 1 is delineated in the form of pseudocode [43].

Algorithm 1 Frame-based object Tracking System Workflow

1:: Input: Video frames
2:: Output: Tracked object trajectories
3:: for each frame in Video do
4:: Detect potential objects using YOLOX
5:: Estimate camera motion transformation matrix from previous frame
6:: Extract features for detected objects using SBS-S50 model
7:: Predict object locations in the current frame using Kalman Filter
8:: First association based on fused IoU and ReID scores
9:: Match current detections with existing tracklets
10:: Second association for unmatched tracklets based on IoU
11:: Update tracklets’ state information
12:: Create new tracklets and assign new unique identifiers
13:: Remove inactive tracklets if not detected within a certain period
14:: (Optional) Interpolate missing parts in tracklets
15:: end for
16:: return Completed trajectories for all objects

BoT-SORT stands as a superior multi-object tracking system that significantly improves tracking accuracy through the synergistic integration of motion and visual appearance data. The system leverages advanced Kalman filters alongside camera motion compensation techniques to adeptly handle real-world dynamics. It distinctively amalgamates Intersection over Union (IoU) with cosine distance measures from Re-identification (ReID) algorithms to establish consistent and reliable object associations, particularly in densely dynamic environments.

3.2.2. Object Identification and Maintenance

Our method incorporates the following logic for object identification and maintenance. Firstly, for newly detected objects, the method assigns each object a unique identifier through an incremental sequence to differentiate between different tracking objects. Subsequently, the method applies the Bot-Sort algorithm. It compares newly discovered objects with those already tracked to determine their status.

For successfully matched objects, the system updates their status information in the tracking list, while keeping their original unique identifiers unchanged. Newly appearing objects are created as new tracking entries and assigned new unique identifiers. If a tracked object is not detected in the current frame, possibly due to occlusion or leaving the field of view, the method temporarily retains its entry and ID. If the object does not reappear within the set time, it will be removed from the tracking list.

3.2.3. Anomaly Jumping Change Filter

In modeling object motion on the image plane, we typically utilize a discrete Kalman filter equipped with a constant velocity model. The state vector of this filter is defined as a septuple,

x = {[x_{c}, y_{c}, s, a, {\dot{x}}_{c}, {\dot{y}}_{c}, \dot{s}]}^{T}

, where

(x_{c}, y_{c})

specifies the 2D coordinates of the object’s center within the image plane. Here, s signifies the scale (area) of the bounding box, and a indicates its aspect ratio. To augment the precision and resilience of distance estimation, to stabilize the variations in object distance, and to address anomalies potentially triggered by abrupt changes in detection box sizes within a frame, the state vector is expanded to an octuple,

x = {[x_{c}, y_{c}, a, h, {\dot{x}}_{c}, {\dot{y}}_{c}, \dot{a}, \dot{h}]}^{T}

. Accordingly, the state vector of the Kalman Filter (KF) is defined as depicted in Equation (10), and the measurement vector is delineated in Equation (11).

\begin{matrix} x_{k} = {[x_{c} (k), y_{c} (k), a (k), h (k), {\dot{x}}_{c} (k), {\dot{y}}_{c} (k), \dot{a} (k), \dot{h} (k)]}^{T} \end{matrix}

(10)

\begin{matrix} z_{k} = {[z_{x_{c}} (k), z_{y_{c}} (k), z_{a} (k), z_{h} (k)]}^{T} \end{matrix}

(11)

This approach, which integrates Kalman filtering into time series analysis, is designed to significantly improve the accuracy and robustness of distance estimation. The Kalman filter algorithm methodically smoothens fluctuations in object distances by conducting an analysis of historical data. It effectively addresses anomalies that arise from sudden changes in the size of detection boxes within individual frames. The algorithm’s state prediction is formulated in Equations (12) and (13), while the observation update mechanism is detailed in Equations (14)–(16).

\begin{matrix} {\hat{x}}_{k| k - 1} = F_{k} {\hat{x}}_{k - 1| k - 1} + B_{k} u_{k} \end{matrix}

(12)

\begin{matrix} P_{k| k - 1} = F_{k} P_{k - 1| k - 1} F_{k}^{T} + Q_{k} \end{matrix}

(13)

where

{\hat{x}}_{k| k - 1}

represents the predicted state,

F_{k}

is the state transition matrix,

B_{k}

denotes the control input model,

u_{k}

is the control vector,

P_{k| k - 1}

is the predicted covariance, and

Q_{k}

signifies the process noise covariance.

\begin{matrix} K_{k} = P_{k| k - 1} H_{k}^{T} {(H_{k} P_{k| k - 1} H_{k}^{T} + R_{k})}^{- 1} \end{matrix}

(14)

\begin{matrix} {\hat{x}}_{k| k} = {\hat{x}}_{k| k - 1} + K_{k} (z_{k} - H_{k} {\hat{x}}_{k| k - 1}) \end{matrix}

(15)

\begin{matrix} P_{k| k} = (I - K_{k} H_{k}) P_{k| k - 1} \end{matrix}

(16)

where

K_{k}

denotes the Kalman gain,

H_{k}

represents the observation matrix,

z_{k}

is the actual observation,

R_{k}

indicates the observation noise covariance, and I is the identity matrix.

4. Experiments

4.1. Datasets

COCO Datasets. The Common Objects in Context (COCO) dataset, developed by the Microsoft team, is a vital asset in computer vision research. It contains over 330,000 images that are richly annotated across 80 object categories. By providing a diverse, large-scale, and challenging benchmark, the COCO dataset has facilitated extensive research activities [44]. Consequently, the method we proposed has been rigorously evaluated on the COCO dataset to compare its performance with leading algorithms in the field. Table 1 summarizes the experimental setup and dataset partitioning, illustrating our method within the COCO dataset.

VOC Datasets. The PASCAL Visual Object Classes (VOC) dataset serves as a crucial benchmark in the field of computer vision, supporting tasks such as object detection. It encompasses annotations for 20 different object categories, including class labels and bounding boxes. The PASCAL VOC challenge series played a pivotal role in advancing vision-based technologies [45]. Additionally, we evaluate the improved YOLOv5 on the VOC dataset and compare it with other popular methods. Table 2 provides a detailed description of partitioning for the VOC dataset.

KITTI Datasets. The KITTI dataset is a crucial benchmark in the field of autonomous driving research. It encompasses a variety of real-world driving scenarios including urban, rural, and highway environments. It provides comprehensive annotations for tasks such as 3D object detection, tracking, and depth estimation [46]. We use the KITTI dataset to validate the accuracy of our proposed method in distance estimation, and to demonstrate its applicability and effectiveness in realistic driving conditions.

4.2. Implementation Details

4.2.1. Training Setup

In order to ensure the fairness and consistency of the model training effect, the details of the hardware, software and experimental environment are shown in Table 3. The training parameter Settings are shown in Table 4.

4.2.2. Loss Function

The loss function used in the YOLOv5-RedeCa model is calculated as follows:

\begin{matrix} L (t_{p}, t_{g t}) = \sum_{k = 0}^{K} [α_{k}^{b a l a n c e} α_{b o x} \sum_{i = 0}^{S^{2}} \sum_{j = 0}^{B} ∐_{k i j}^{o b j} L_{C I o U} + α_{o b j} \sum_{i = 0}^{S^{2}} \sum_{j = 0}^{B} ∐_{k i j}^{o b j} L_{o b j} + α_{c l s} \sum_{i = 0}^{S^{2}} \sum_{j = 0}^{B} ∐_{k i j}^{o b j} L_{c l s}] \end{matrix}

(17)

\begin{matrix} L_{C I o U} = 1 - I o U + \frac{ρ^{2} (b, b^{g t})}{c^{2}} + α υ \end{matrix}

(18)

where K,

S^{2}

, and B represent the output feature map, cell, and the number of anchors on each cell, respectively.

α_{x}

represents the weight of the corresponding item.

α_{b o x}

,

α_{o b j}

, and

α_{c l s}

take values of 0.05, 0.7, and 0.3, respectively. We based the parameter values of

α_{b o x}

,

α_{o b j}

, and

α_{c l s}

on the original YOLOv5 algorithm. Using a stable and recognized baseline configuration is essential for consistent performance evaluation.

α_{k}^{b a l a n c e}

is used to balance the weights of the output feature maps at each scale, with a default value of [4.0, 1.0, 0.4].

∐_{k i j}^{o b j}

indicates whether the kth output feature map, the ith cell, and the jth anchor box are positive samples. If it is a positive sample, it is 1; otherwise, it is 0.

t_{p}

, and

t_{g t}

are the predicted vector and ground-truth vector.

The loss function of CIoU is as follows:

\begin{matrix} L_{C I o U} = 1 - I o U + \frac{ρ^{2} (b, b^{g t})}{c^{2}} + α υ \end{matrix}

(19)

where b represents the center points of the predicted box, and

b^{g t}

represents the ground-truth box.

α

is the balance factor between the loss caused by aspect ratio and the loss caused by IoU.

υ

is the normalization of the difference in aspect ratio between the predicted box and the ground-truth box. They are mathematically defined as follows:

\begin{matrix} ρ (b, b^{g t}) = {∥b - b^{g t}∥}_{2} \end{matrix}

(20)

\begin{matrix} α = \frac{υ}{(1 - I o U) + υ} \end{matrix}

(21)

\begin{matrix} υ = \frac{4 {(arctan \frac{w_{G}}{h_{G}} - arctan \frac{w_{P}}{h_{P}})}^{2}}{π^{2}} \end{matrix}

(22)

The loss functions for object and class are defined as follows:

\begin{matrix} L_{o b j} (p_{o}, p_{I o U}) = B C E_{o j b}^{s i g} (p_{o}, p_{I o U}; w_{o b j}) \end{matrix}

(23)

\begin{matrix} L_{c l s} (c_{p}, c_{g t}) = B C E_{c l s}^{s i g} (c_{p}, c_{g t}; w_{c l s}) \end{matrix}

(24)

where

(w_{G}, h_{G})

represents the width and height of the object box and

(w_{P}, h_{P})

represents the width and height of the predicted box

p_{o}

is the confidence score of the object in the predicted box.

p_{I} o U

is the confidence score of the object in the predicted box.

B C E_{c l s}^{s i g}

and

B C E_{c l s}^{s i g}

represent binary cross-entropy loss.

w_{o b j}

and

B C E_{c l s}

represent the weights of positive samples.

4.3. Evaluation Index

In the following experiments, we evaluate the accuracy of object detection against the accuracy and stability of distance estimation using the following metrics.

Precision: This metric is pivotal for assessing the accuracy of object detection models. It quantifies the ratio of instances accurately identified as objects to all cases labeled as objects by the model. The formula is defined as

$\begin{matrix} P r e c i s i o n = \frac{T P}{T P + F P} \end{matrix}$

(25)

where $T P$ (True Positives) represents the count of correctly detected objects, and $F P$ (False Positives) denotes the instances erroneously classified as objects.
Recall: Recall evaluates the proportion of true objects that the model successfully detects out of the entire set of actual objects. A higher recall implies that the model is capable of identifying most true objects, albeit potentially increasing the rate of false positives. The formula for recall is

$\begin{matrix} R e c a l l = \frac{T P}{T P + F N} \end{matrix}$

(26)

with $F N$ (False Negatives) indicating the number of objects that were not detected.
Average Precision (AP): AP calculates the mean value of precision across varying levels of recall. It can be determined for each category, and the Mean Average Precision (MAP) represents the average of APs across all categories. MAP is extensively used to gauge the comprehensive performance in tasks involving detection of multiple object classes.

$\begin{matrix} m A P = \frac{1}{c a l s s e s} \sum_{i = 1}^{c l a s s e s} \int_{0}^{1} P (R) d (R) \end{matrix}$

(27)
Standard Deviation (SD): SD serves as a statistical measure to quantify the extent of variation or dispersion from the mean within a dataset. In the domain of time-domain analysis, it is pivotal to precisely evaluate the closeness of distance estimations derived from various monocular ranging algorithms to the true values, as well as for gauging their dispersion. The calculation of standard deviation is defined by the following formula:

$\begin{matrix} σ = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {(f (x_{i}) - μ_{i})}^{2}} \end{matrix}$

(28)

where $σ$ denotes SD, $f (x_{i})$ represents the estimated distance at time instance i, and $μ_{i}$ corresponds to the actual value at that instance. This metric is crucial for assessing the reliability and accuracy of these algorithms in real-world applications.

4.4. Comparison Experiment

We have developed an improved YOLOv5 method, specifically optimized for detecting small objects in road scenes under various lighting and weather conditions. To evaluate the performance of our method in terms of feature representation and real-time processing capabilities, we conducted a series of experiments on the COCO and VOC datasets. Our improved methods, denoted according to scale as Ours-s, Ours-m, and Ours-l, were compared with several state-of-the-art object detection methods.

We used Precision, mAP, and Recall as the primary performance metrics. Additionally, we considered complexity by evaluating the number of parameters (Params) and floating-point operations per second (FLOPS). We performed a comparative analysis of different methods on the COCO and VOC datasets, ensuring the robustness of the experimental design and the validity of the data analysis. The results are presented in the accompanying Table 5.

Experimental results on the COCO dataset show that our optimized Ours-m model achieved 71.6% Precision, 61.5% mAP, and 57.9% Recall. These results are the best among all methods using similar computational resources. On the VOC dataset, the model also excelled, registering 77.5% [email protected] and 73.4% Recall. Notably, despite having fewer parameters than the equivalent scale YOLOv5 model, Ours-s, Ours-m and Ours-l surpassed other methods in Precision and mAP. It demonstrates our success in improving performance while optimizing computational efficiency.

Compared to SSD and Faster R-CNN, Ours-m demonstrated significant advantages on both the COCO and VOC datasets. It was particularly effective in Recall for small object detection. For instance, on the COCO dataset, the Recall of Ours-l was 21.3% higher than that of SSD and 14.1% higher than Faster R-CNN, highlighting the effectiveness of the RedeCa module in enhancing recall capabilities.

Although Mask-RCNN is designed for instance segmentation, the Ours-l model excelled in object detection tasks. On the VOC dataset, it achieved an [email protected] of 80.0%, compared to Mask-RCNN’s 57.2%. This emphasizes that our improved model delivers outstanding detection performance, even in domains specialized by Mask-RCNN.

We also controlled for model complexity while striving for higher performance. Compared to the original YOLOv5, the Ours series demonstrated superior small object detection in complex environments with fewer parameters. On the COCO dataset, Ours-l improved mAP and Recall by 0.4% each over YOLOv5l. Our methods surpass existing methods in key performance metrics while maintaining computational efficiency, laying a solid foundation for practical applications.

We compared our method with the latest object detection methods from YOLOv6 to YOLOv10 [47,48,49,50,51]. Our method matches YOLOv8 and YOLOv9 in terms of accuracy, and surpasses YOLOv6, v8, v9, and v10 in model lightweighting and computational load. The data presented in Table 5 showed that while our model achieves accuracy comparable to the latest YOLO versions, it still has benefits in model lightweighting and computational load. These advantages contribute to the deployment and operational efficiency of our method in practical applications.

The RedeCa module enhances our method’s ability of representing features, especially in accurately handling edge and detail information, which is crucial for small object detection. It also increases the adaptability to varying lighting and weather conditions, thereby boosting robustness in complex environments.

In road traffic scenarios, real-time performance is crucial. We evaluated our method against popular methods in terms of FPS and parameter count, as detailed in Table 5. Our method’s parameter count and computational complexity are markedly lower than those of Faster R-CNN and Mask-RCNN. Compared to the baseline YOLOv5, our method achieves comparable performance while reducing both parameter count and computational complexity.

The experimental results show that by adding the RedeCa module to YOLOv5, we improve small object detection accuracy markedly while maintaining high recall rates. These findings support strongly further exploration of deep learning methods for small object detection in complex environments, confirming the effectiveness and potential of our enhancements.

To validate the superior accuracy and robustness of our method, we conducted a comprehensive quantitative and qualitative comparative analysis. We compared our method with popular monocular distance measurement methods in road traffic scenarios. The results are shown in Figure 7.

As shown in Figure 7, our monocular distance measurement method effectively reduces potential errors caused by sudden changes in the detection bounding box size. This leads to reduced dispersion in distance measurements, ensuring they remain more consistent with actual distances over time.

We also conducted a qualitative analysis to compare our method with mainstream methods. Unlike methods that generate depth maps for the entire image to obtain object distances, our method detects objects and measures their distances simultaneously, improving efficiency. The distance measurement results are shown in Figure 8.

4.5. Ablation Experiment

We adopted the ablation experiment method to evaluate the impact of different modules on the detection performance of YOLOv5m model on VOC dataset. The experimental results are shown in Table 6. We investigate the effect of depthwise separable convolution (DW) on model performance. In addition, we examine the impact of effective channel attention (ECA). We also explore the direct combination of DW and ECA. Finally, we assess the performance of the RedeCa module.

As shown in Table 6, the baseline YOLOv5m model achieved a Precision of 82.5%, mAP of 74.3%, and IOU of 67.2% on the VOC dataset. It had 21.0 million parameters and 48.5 GFLOPS.

After adding Depthwise Separable Convolutions (DW) to the baseline model, Precision decreased to 70.5%, mAP dropped to 62.9%, and IOU significantly declined to 38.4%. However, the number of parameters and FLOPS were reduced to 9.1 million and 18.3 GFLOPS, respectively. This indicates that while the DW module effectively reduces model complexity, it negatively impacts detection performance.

Incorporating the Efficient Channel Attention (ECA) module resulted in a slight improvement in Precision to 82.7%, an increase in mAP to 76.3%, and a significant rise in IOU to 74.1%. The number of parameters and FLOPS slightly increased, demonstrating that the ECA module enhances detection performance without significantly increasing the computational burden.

Using Depthwise Separable Convolutions (DW) and Efficient Channel Attention (ECA) directly, the model saw slight improvements in Mean Average Precision (mAP) and Intersection over Union (IoU). But the increase was minimal, and there was no significant reduction in the network’s parameters and computational load. However, by effectively integrating DW and ECA into the RedeCa module, the model’s accuracy slightly decreased to 81.9%, but the mAP increased to 77.5%, and the IoU rose to 73.4%, while the parameters and FLOPS were reduced to 17.2 M and 34.6 G, respectively. This indicates that the use of the RedeCa module can balance detection performance and model complexity to some extent, providing an effective solution for real-time object detection applications.

To verify that the use of an anomaly jump filter helps smooth distance variations and eliminate potential errors caused by sudden changes in the detection bounding box size, we conducted a comparative analysis. We compared the algorithm architecture with the anomaly jump filter to the one without it.

As shown in Figure 9, using a Kalman filter integrated with the time series effectively eliminates potential errors caused by sudden changes in detection bounding box size. This reduces the dispersion in distance measurements, making them more consistent with the actual distances over time.

5. Conclusions

Our improved object detection network performs excellently on the COCO dataset. It achieves 71.6% precision, 61.5% mean Average precision ([email protected]) and 57.9% recall, respectively. These results are better than the comparison models under the same computing resources. In addition, on the VOC dataset, our improved object detection network also shows excellent performance in [email protected] and recall. The results are 77.5% and 73.4%, respectively. By integrating the Bot-Sort algorithm, the robustness of the proposed method is significantly improved when dealing with object occlusion or temporary loss, so that the system can track and distance multiple objects at the same time. At the same time, the abnormal jump filter can effectively smooth the sudden change of object range, eliminate the potential anomaly caused by the mutation of the detection box size, and thus enhance the reliability of ranging.

The holistic algorithm architecture we designed and implemented successfully breaks through the traditional research gap between 2D object detection and monocular depth estimation. We combine the two techniques by designing a new algorithm architecture. This significantly improves the performance of object detection and distance estimation in dynamic and complex environments. The integration of these technologies not only optimizes the respective functions, but also improves the efficiency of the system in practical applications. In particular, it has shown great application potential in the field of traffic monitoring and automatic driving.

Author Contributions

Conceptualization, H.L., Y.M., Y.D. and Y.Y.; methodology, H.L. and Y.M.; software, H.L.; validation, H.L. and Y.Y.; formal analysis, H.L., Y.Y. and Y.M.; investigation, H.L.; resources, H.L., Y.D. and Y.M.; data curation, H.L.; writing—original draft preparation, H.L.; writing—review and editing, H.L., Y.M. and Y.Y.; visualization, H.L.; supervision, Y.D. and Y.M.; project administration, H.L.; funding acquisition, Y.D. All authors have read and agreed to the published version of the manuscript.

Funding

This work is mainly supported by the Vehicle-road Cooperative Autonomous Driving Fusion Control Project, and sponsored by the Academic Research Projects of Beijing Union University (Nos. ZK80202003, ZK90202105), the Beijing Municipal Education Commission Science and Technology Program (Nos. KM202111417007, KM202211417006).

Data Availability Statement

The COCO dataset used can be obtained from https://cocodataset.org/#home (accessed on 24 June 2024). The VOC dataset can be obtained from http://host.robots.ox.ac.uk/pascal/VOC/ (accessed on 24 June 2024). The KITTI dataset can be obtained from https://www.cvlibs.net/datasets/kitti/eval_object.php?obj_benchmark=3d (accessed on 24 June 2024). The fusion architecture using YOLO-RedeCa and abnormal jumping change filter is available on GitHub at https://github.com/Mr-Lv-BUU/The-Fusion-Architecture-Using-YOLO-RedeCa-and-Abnormal-Jumping-Change-Filter.git (accessed on 24 June 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Tan, K.; Wu, J.; Zhou, H.; Wang, Y.; Chen, J. Integrating Advanced Computer Vision and AI Algorithms for Autonomous Driving Systems. J. Theory Pract. Eng. Sci. 2024, 4, 41–48. [Google Scholar]
Haris, M.; Glowacz, A. Road object detection: A comparative study of deep learning-based algorithms. Electronics 2021, 10, 1932. [Google Scholar] [CrossRef]
Li, Y.; Huang, B.; Chen, Z.; Cui, Y.; Liang, F.; Shen, M.; Liu, F.; Xie, E.; Sheng, L.; Ouyang, W.; et al. Fast-BEV: A Fast and Strong Bird’s-Eye View Perception Baseline. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 1–14. [Google Scholar] [CrossRef] [PubMed]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Diwan, T.; Anirudh, G.; Tembhurne, J.V. Object detection using YOLO: Challenges, architectural successors, datasets and applications. Multimed. Tools Appl. 2023, 82, 9243–9275. [Google Scholar] [CrossRef] [PubMed]
Zhao, Z.Q.; Zheng, P.; Xu, S.T.; Wu, X. Object detection with deep learning: A review. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 3212–3232. [Google Scholar] [CrossRef] [PubMed]
Poggi, M.; Tosi, F.; Batsos, K.; Mordohai, P.; Mattoccia, S. On the synergies between machine learning and binocular stereo for depth estimation from images: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 5314–5334. [Google Scholar] [CrossRef] [PubMed]
Saxena, A.; Sun, M.; Ng, A.Y. Make3d: Learning 3d scene structure from a single still image. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 31, 824–840. [Google Scholar] [CrossRef]
Karsch, K.; Liu, C.; Kang, S.B. Depth transfer: Depth extraction from video using non-parametric sampling. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 36, 2144–2158. [Google Scholar] [CrossRef]
Fu, C.; Yuan, H.; Xu, H.; Zhang, H.; Shen, L. TMSO-Net: Texture adaptive multi-scale observation for light field image depth estimation. J. Vis. Commun. Image Represent. 2023, 90, 103731. [Google Scholar] [CrossRef]
Zou, Z.; Chen, K.; Shi, Z.; Guo, Y.; Ye, J. Object detection in 20 years: A survey. Proc. IEEE 2023, 111, 257–276. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1097–1105. [Google Scholar] [CrossRef]
Xiao, Y.; Tian, Z.; Yu, J.; Zhang, Y.; Liu, S.; Du, S.; Lan, X. A review of object detection based on deep learning. Multimed. Tools Appl. 2020, 79, 23729–23791. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef] [PubMed]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 91–99. [Google Scholar] [CrossRef] [PubMed]
Martin, D.; Fowlkes, C.; Tal, D.; Malik, J. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proceedings of the Eighth IEEE International Conference on Computer Vision, ICCV 2001, Vancouver, BC, Candada, 7–14 July 2001. [Google Scholar]
Dai, J.; Li, Y.; He, K.; Sun, J. R-fcn: Object detection via region-based fully convolutional networks. Adv. Neural Inf. Process. Syst. 2016, 29, 379–387. [Google Scholar]
Sermanet, P.; Eigen, D.; Zhang, X.; Mathieu, M.; Fergus, R.; LeCun, Y. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv 2013, arXiv:1312.6229. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 779–788. [Google Scholar]
Han, X.; Chang, J.; Wang, K. Real-time object detection based on YOLO-v2 for tiny vehicle object. Procedia Comput. Sci. 2021, 183, 61–72. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Jocher, G.; Chaurasia, A.; Stoken, A.; Borovec, J.; Kwon, Y.; Michael, K.; Fang, J.; Wong, C.; Yifu, Z.; Montes, D.; et al. ultralytics/yolov5: v6.2—YOLOv5 Classification Models, Apple M1, Reproducibility, ClearML and Deci.ai Integrations, v6.2; Zenodo: Geneva, Switzerland, 2022. [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
Fu, C.Y.; Liu, W.; Ranga, A.; Tyagi, A.; Berg, A.C. Dssd: Deconvolutional single shot detector. arXiv 2017, arXiv:1701.06659. [Google Scholar]
Li, Z.; Yang, L.; Zhou, F. FSSD: Feature fusion single shot multibox detector. arXiv 2017, arXiv:1712.00960. [Google Scholar]
Ren, D.; Yang, M.; Wu, J.; Zheng, N. Surface normal and Gaussian weight constraints for indoor depth structure completion. Pattern Recognit. 2023, 138, 109362. [Google Scholar] [CrossRef]
Favaro, P.; Soatto, S. A geometric approach to shape from defocus. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27, 406–417. [Google Scholar] [CrossRef] [PubMed]
Liu, B.; Gould, S.; Koller, D. Single image depth estimation from predicted semantic labels. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; pp. 1253–1260. [Google Scholar]
Eigen, D.; Puhrsch, C.; Fergus, R. Depth map prediction from a single image using a multi-scale deep network. Adv. Neural Inf. Process. Syst. 2014, 27, 2366–2374. [Google Scholar]
Kuznietsov, Y.; Stuckler, J.; Leibe, B. Semi-supervised deep learning for monocular depth map prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6647–6655. [Google Scholar]
Eigen, D.; Fergus, R. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2650–2658. [Google Scholar]
Liu, C.; Yang, G.; Zuo, W.; Zang, T. DPDFormer: A Coarse-to-Fine Model for Monocular Depth Estimation. ACM Trans. Multimed. Comput. Commun. Appl. 2024, 20, 1–21. [Google Scholar] [CrossRef]
Li, Z.; Chen, Z.; Liu, X.; Jiang, J. Depthformer: Exploiting long-range correlation and local information for accurate monocular depth estimation. Mach. Intell. Res. 2023, 20, 837–854. [Google Scholar] [CrossRef]
Jiao, J.; Cao, Y.; Song, Y.; Lau, R. Look deeper into depth: Monocular depth estimation with semantic booster and attention-driven loss. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 53–69. [Google Scholar]
Miclea, V.C.; Nedevschi, S. Monocular depth estimation with improved long-range accuracy for UAV environment perception. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–15. [Google Scholar] [CrossRef]
Garg, R.; Bg, V.K.; Carneiro, G.; Reid, I. Unsupervised cnn for single view depth estimation: Geometry to the rescue. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part VIII 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 740–756. [Google Scholar]
Godard, C.; Mac Aodha, O.; Brostow, G.J. Unsupervised monocular depth estimation with left-right consistency. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 270–279. [Google Scholar]
Zhou, T.; Brown, M.; Snavely, N.; Lowe, D.G. Unsupervised learning of depth and ego-motion from video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1851–1858. [Google Scholar]
Mahaur, B.; Mishra, K. Small-object detection based on YOLOv5 in autonomous driving systems. Pattern Recognit. Lett. 2023, 168, 115–122. [Google Scholar] [CrossRef]
Aharon, N.; Orfaig, R.; Bobrovsky, B.Z. BoT-SORT: Robust associations multi-pedestrian tracking. arXiv 2022, arXiv:2206.14651. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part V 13. Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
Everingham, M.; Eslami, S.A.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The pascal visual object classes challenge: A retrospective. Int. J. Comput. Vis. 2015, 111, 98–136. [Google Scholar] [CrossRef]
Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The kitti dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Lou, H.; Duan, X.; Guo, J.; Liu, H.; Gu, J.; Bi, L.; Chen, H. DC-YOLOv8: Small-size object detection algorithm based on camera sensor. Electronics 2023, 12, 2323. [Google Scholar] [CrossRef]
Wang, C.-Y.; Yeh, I.-H.; Liao, H.-Y.M. YOLOv9: Learning what you want to learn using programmable gradient information. arXiv 2024, arXiv:2402.13616. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-time end-to-end object detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]

Figure 1. The left side is the main framework of object detection with monocular depth estimation. On the right is the architecture of the algorithm we are going to implement and the output.

Figure 2. The bounding boxes for non-ideal and ideal captures of road scene objects are illustrated. (a–d) represent consecutive frames from four distinct video sequences. In (a,b), the bounding box dimensions exhibit abrupt changes when the target object undergoes sudden appearance, disappearance, or occlusion events. Conversely, (c,d) depict scenarios where the target is clearly visible, and the bounding box accurately encapsulates the target object under normal conditions.

Figure 3. The Graph of the overall architecture.

Figure 4. The YOLOv5-RedeCa Network Architecture Diagram.

Figure 5. The Graph of RedeCa Structure.

Figure 6. The Graph of Depthwise Separable Convolutional Structure.

Figure 7. Statistical results and standard deviation calculations of different monocular distance measurement methods. The standard deviation of our method is 0.25. The standard deviation of Faster-RCNN is 0.33. The standard deviation of Mask R-CNN is 0.37. The standard deviation of SSD is 0.40.

Figure 8. The visualization results of different monocular distance measurement methods. (a) is the ground truth. (b) is the result based on Faster R-CNN. (c) is the result based on YOLOv5. (d) is the result based on YOLO-RedeCa.

Figure 9. The impact of using Kalman filtering on the fluctuations in distance estimation. The standard deviation of our method is 0.72. The standard deviation for the YOLOv5-RedeCa-based method without the use of Kalman filtering is 1.29.

Table 1. COCO Dataset Training, Validation, and Test Set Splits.

Dataset Split	Number of Images	Covered Categories
Training Set	118,287	80 classes
Validation Set	5000	80 classes
Test Set	40,670	80 classes

Table 2. VOC Dataset Training, and Validation Set Splits.

Dataset Split	Number of Images	Covered Categories
Training Set	11,540	20 classes
Validation Set	10,991	20 classes

Table 3. Experiment environment.

Environment	Version
Operating System	Ubuntu 20.04
CPU	AMD Ryzen 9 5900HX
GPU	Nvidia GeForce RTX 3080 16 G
Compiling Environment	Python 3.9.0
CUDA	12.2
Deep Learning Framework	Pytorch 1.13.1
Compiler	Pycharm 2022.3.1

Table 4. Training parameters.

Parameter	Value
Batch Size	16
Init Learning Rate	0.1
Final Learning Rate	0.0001
Image Size	640 × 640
Optimizer	SGD
Workers	2

Table 5. Detection results of different algorithms on COCO and VOC datasets.

Method	COCO			VOC			Params (M)	FLOPS (G)
Method	Precision (%)	[email protected] (%)	Recall (%)	Precision (%)	[email protected] (%)	Recall (%)	Params (M)	FLOPS (G)
SSD	50.7	45.1	38.8	31.4	54.7	57.3	26.3	65.7
Faster R-CNN	53.1	48.0	46.0	34.1	57.5	54.7	60.4	93.6
Mask-RCNN	51.9	48.0	45.7	33.9	57.2	69.1	63.4	147.0
YOLOv5s	63.5	54.3	52.1	69.4	60.3	52.9	7.1	16.1
YOLOv5m	70.4	61.3	55.4	82.5	74.3	67.2	21.0	48.5
YOLOv5l	72.9	64.0	59.7	82.5	75.4	69.3	46.2	108.6
YOLOv6s	64.7	52.8	46.2	65.3	58.3	51.6	18.5	45.3
YOLOv7s	65.8	47.6	49.5	66.8	58.3	52.5	6.1	13.2
YOLOv8s	68.1	56.3	53.1	68.8	56.5	53.0	11.2	28.8
YOLOv9s	67.1	55.7	50.7	66.7	61.7	54.1	9.6	38.8
YOLOv10s	66.3	57.2	51.8	68.4	60.7	56.1	8.1	24.9
Ours-s	67.2	55.0	51.3	67.0	57.5	53.5	6.7	13.2
Ours-m	71.6	61.5	57.9	81.9	77.5	73.4	17.2	34.6
Ours-l	73.6	64.4	60.1	82.5	80.0	76.0	34.3	70.4

Table 6. Detection results using different modules on VOC dataset.

Method	DW	ECA	RedeCa (DW + ECA)	Precision (%)	mAP (%)	IOU (%)	Params (M)	FLOPS (G)
YOLOv5m				82.5	74.3	67.2	21.0	48.5
YOLOv5m	✓			70.5	62.9	38.4	9.1	18.3
YOLOv5m		✓		82.7	76.3	74.1	21.8	48.6
YOLOv5m	✓	✓		79.3	74.6	71.3	20.8	47.6
YOLOv5m + RedeCa			✓	81.9	77.5	73.4	17.2	34.6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lv, H.; Du, Y.; Ma, Y.; Yuan, Y. Object Detection and Monocular Stable Distance Estimation for Road Environments: A Fusion Architecture Using YOLO-RedeCa and Abnormal Jumping Change Filter. Electronics 2024, 13, 3058. https://doi.org/10.3390/electronics13153058

AMA Style

Lv H, Du Y, Ma Y, Yuan Y. Object Detection and Monocular Stable Distance Estimation for Road Environments: A Fusion Architecture Using YOLO-RedeCa and Abnormal Jumping Change Filter. Electronics. 2024; 13(15):3058. https://doi.org/10.3390/electronics13153058

Chicago/Turabian Style

Lv, Hejun, Yu Du, Yan Ma, and Ying Yuan. 2024. "Object Detection and Monocular Stable Distance Estimation for Road Environments: A Fusion Architecture Using YOLO-RedeCa and Abnormal Jumping Change Filter" Electronics 13, no. 15: 3058. https://doi.org/10.3390/electronics13153058

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Object Detection and Monocular Stable Distance Estimation for Road Environments: A Fusion Architecture Using YOLO-RedeCa and Abnormal Jumping Change Filter

Abstract

1. Introduction

2. Related Work

3. Method

3.1. Object Detection

RedeCa Module

3.2. Enhancing the Robustness of Multi-Frame Association in Distance Estimation

3.2.1. Bot-Sort

3.2.2. Object Identification and Maintenance

3.2.3. Anomaly Jumping Change Filter

4. Experiments

4.1. Datasets

4.2. Implementation Details

4.2.1. Training Setup

4.2.2. Loss Function

4.3. Evaluation Index

4.4. Comparison Experiment

4.5. Ablation Experiment

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI