DAN-YOLO: A Lightweight and Accurate Object Detector Using Dilated Aggregation Network for Autonomous Driving

Cui, Shuwan; Liu, Feiyang; Wang, Zhifu; Zhou, Xuan; Yang, Bo; Li, Hao; Yang, Junhao

doi:10.3390/electronics13173410

Open AccessArticle

DAN-YOLO: A Lightweight and Accurate Object Detector Using Dilated Aggregation Network for Autonomous Driving

by

Shuwan Cui

¹,

Feiyang Liu

^1,2,

Zhifu Wang

^2,*

,

Xuan Zhou

¹,

Bo Yang

¹,

Hao Li

¹ and

Junhao Yang

¹

School of Mechanical and Automotive Engineering, Guangxi University of Science and Technology, Liuzhou 545006, China

²

School of Mechanics and Vehicles, Beijing Institute of Technology, Beijing 100081, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(17), 3410; https://doi.org/10.3390/electronics13173410

Submission received: 24 July 2024 / Revised: 19 August 2024 / Accepted: 26 August 2024 / Published: 27 August 2024

Download

Browse Figures

Versions Notes

Abstract

:

Object detection is becoming increasingly critical in autonomous driving. However, the accuracy and effectiveness of object detectors are often constrained by the obscuration of object features and details in adverse weather conditions. Therefore, this paper presented the DAN-YOLO vehicle object detector specifically designed for driving conditions in adverse weather. Building on the YOLOv7-Tiny network, SPP was replaced with SPPF, resulting in the SPPFCSPC structure, which enhances processing speed. The concept of Hybrid Dilated Convolution (HDC) was also introduced to improve the SPPFCSPC and ELAN-T structures, expanding the network’s receptive field (RF) while maintaining a lightweight design. Furthermore, an efficient multi-scale attention (EMA) mechanism was introduced to enhance the effectiveness of feature fusion. Finally, the Wise-IoUv1 loss function was employed as a replacement for CIoU to enhance the localization accuracy of the bounding box (bbox) and the convergence speed of the model. With an input size of 640 × 640, the DAN-YOLO algorithm proposed in this study achieved an increase in mAP0.5 values of 3.4% and 6.3% compared to the YOLOv7-Tiny algorithm in the BDD100K and DAWN benchmark tests, respectively, while achieving real-time detection (142.86 FPS). When compared with other state-of-the-art detectors, it reports better trade-off in terms of detection accuracy and speed under adverse driving conditions, indicating the suitability for autonomous driving applications.

Keywords:

autonomous driving; object detection; adverse weather conditions; attention mechanism

1. Introduction

With the continuous advancement of computer vision [1], there is an increasing demand for higher standards in autonomous driving object detection. Autonomous vehicles [2,3] must have the capability of fast and accurate object detection in real-world conditions in order to ensure safe and correct driving behavior and decisions [4]. In other words, an object detector intended for autonomous driving applications must exhibit attributes such as high accuracy, robustness, and light weight. However, adverse weather conditions, such as the presence of heavy snow, fog, rain, dust, or sandstorm conditions, can lead to a decrease in camera resolution, resulting in unclear object features and details, which can impact the accuracy of automatic driving detectors [5]. Given the current technological constraints, enhancing the camera resolution in such conditions within a short timeframe is challenging. Therefore, refining the algorithm for improving the object detector in autonomous driving systems represents the most viable approach.

In the field of computer vision, numerous methods for detecting vehicle objects have emerged [6], with deep learning-based object detection becoming a crucial technology in autonomous driving systems. Object detection models are typically categorized into two main approaches: single-stage and two-stage methods. The utilization of the Region Proposal Network (RPN) in two-stage object detectors like the Fast R-CNN (region convolutional neural network) [7,8] has been demonstrated to efficiently extract object details while integrating global information, thereby mitigating the effects of environmental conditions and achieving high-precision detection. However, slow inference speed hinders its application in real-time systems due to heavy computational costs in RPN. In contrast, single-stage detectors (YOLO [9,10,11] and SSD [12]) have extremely fast detection speeds in adverse conditions that formulate object detection as a simple regression problem, which refers to the simultaneous computation of bbox regression and classification on convolutional feature maps. Wang et al. [13] used an enhanced YOLOv4 model for road object detection, incorporating SENet modules into the PANet structure to enhance channel attention and improve the overall detection performance of the model. Cao et al. [14] developed an improved YOLOv5 model and introduced a multi-scale small object detection structure to enhance the sensitivity to dense small objects. Mao et al. [15] improved YOLOv7 by introducing the SIoU [16] loss function, which takes into account the angles, distances, and shapes between targets, resulting in better detection performance. Wang et al. [17] proposed an improved YOLOv8 model to address the challenges of long-range detection in driving scenarios. Their approach included structural reparameterization techniques and the Bidirectional Feature Pyramid Network (BiFPN) [18], which effectively reduces the model’s complexity and computational load while maintaining high performance. Undoubtedly, these studies provide significant assistance for autonomous driving. However, these methods face challenges in adverse driving conditions. Specifically, in environments such as rain, snow, fog, and sandstorms, image clarity is reduced and object features and details are obscured, leading to decreased detection accuracy and significantly limiting their widespread application in autonomous driving.

In recent years, there has been growing interest in enhancing object detection capabilities for autonomous driving applications. Numerous methods have been suggested by scholars to achieve improvement in the performance of detectors. Zeller et al. [19] utilized Radar point clouds in lieu of images and introduced a full-resolution Backbone network to enable instance segmentation in challenging weather conditions. Piroli et al. [20] employed deep learning networks to preprocess LiDAR point cloud data for improved robustness of object detection. Clearly, they have shown significant improvements in both speed and accuracy compared to the traditional algorithm. However, the substantial volume of data in radar and LiDAR point clouds poses a hindrance to real-time detection speed when compared to images. Similarly, while two-stage object detectors have high detection accuracy, their speed falls short of meeting the requirements for autonomous driving applications, hindering their further development in the field of autonomous driving [21]. Wibowo et al. [22] developed YOLOv7 MOD to achieve real-time detection in congested traffic, yet its accuracy diminishes significantly in snowy weather conditions. Hassaballah et al. [5] proposed a visibility restoration scheme to enhance image quality and object detection performance in adverse weather conditions, but this requires additional image processing prior to network input. To balance speed and accuracy, many industries use dilated convolutions to address this issue. Zhang et al. [23] proposed LS-YOLO, which improves the decoupled head using dilated convolutions, enhancing accuracy in landslide detection. Deng et al. [24] designed a weighted feature pyramid using dilated convolutions to improve YOLOv7’s performance in detecting maritime targets under foggy conditions. Qian et al. [25] improved YOLOv4’s architecture with dilated convolutions and skip connections, enhancing polyp detection accuracy in medical imaging. Chen et al. [26] used multiple parallel dilated convolutions to create convolutional kernels with varying receptive fields to capture multi-scale object information. Despite these advancements, the Atrous Spatial Pyramid Pooling (ASPP) architecture did not fully utilize HDC to optimize the receptive field. However, the use of HDC during the sampling process has gained prominence in object detection applications, as it helps to mitigate information loss [27,28,29]. This architecture enhances the model’s ability to preserve spatial resolution and capture intricate details across different scales, which is particularly beneficial in challenging environments. Therefore, Tian et al. [30] also attempted to integrate dilated convolutions and the Self-Attention Module (SAM) into a single-stage object detector to balance detection accuracy and speed, but did not account for driving under adverse weather conditions. Despite the existence of various techniques to optimize detector performance, there has been a lack of research specifically targeting the improvement of autonomous vehicle object detectors in adverse weather conditions using image algorithms. Currently, algorithms for driving in adverse weather conditions face challenges in balancing detection accuracy and speed, hindering the widespread adoption of vehicle target detectors. Therefore, enhancing the detection accuracy of single-stage object detectors holds promise for advancing the field of autonomous driving.

In response to the challenges posed by detecting objects in adverse driving conditions, this study drew inspiration from the rational utilization of object details in the two-stage model to aggregate global information. The architecture of the original YOLOv7-Tiny model was improved to enhance its performance under such conditions. The main improvements are as follows: Firstly, SPP was replaced with SPPF in the YOLOv7-Tiny network, leading to the creation of SPPFCSPC, which resulted in speed improvements. Additionally, HDC was employed to expand the RF of the SPPCSPC and ELAN-T structures. Next, the EMA mechanism was integrated into the feature fusion process to combine spatial and channel information, thereby enhancing the effectiveness of feature fusion. Furthermore, Wise-IoUv1 was implemented in place of CIoU to improve convergence speed and bbox positioning accuracy. Finally, various evaluation experiments were conducted on the BDD100K and DAWN datasets. The model showed a 3.4% increase in mAP0.5 in the BDD100K dataset while achieving real-time detection at 142.86 FPS. In the DAWN dataset, the mAP0.5 value increased by 6.3%. Compared to other state-of-the-art detectors, DAN-YOLO achieved a favorable trade-off between detection accuracy and speed under adverse weather conditions, making it well-suited for applications in autonomous driving.

2. Related Technology

2.1. The YOLOv7-Tiny Network Model Architecture

Currently, the YOLO series models are widely regarded as dominant in single-stage object detectors, with YOLOv7-Tiny [11] standing out as a noteworthy iteration of YOLO due to its simplified structure and strong adaptability. As illustrated in Figure 1, the model is comprised of three fundamental components: Backbone, Neck, and Head.

The CBL (Conv + BN + LeakyReLU), ELAN-T (Efficient Layer Aggregation Networks-Tiny), MP (Max Pooling), SPP, and SPPCSPC networks make up the majority of the Backbone network. CBL refers to the process wherein the input undergoes a convolution, followed by Batch Normalization, and is then activated by LeakyReLU. ELAN-T first splits into two paths: The upper path passes through one CBL convolution, while the lower path passes through three CBL convolutions. The results from both paths are then concatenated and passed through one more CBL convolution before the final features are output. MP stands for maximum pooling downsampling, which aids in extracting high-level semantics. The SPP layer refers to performing a convolution on the input, followed by MP layers with varying RF sizes, and finally, integrating the features from these different RFs before outputting through another convolution. The SPPCSPC structure involves processing the input from the previous layer through both the SPP and CBL convolutions separately, then merging the results, and finally outputting through another convolution. In the Neck, the Path Aggregation Network (PAN) is utilized to merge the features extracted from the Backbone and enhance the feature texture through upsampling. The Head consists of 3 × 3 convolutional kernels and detection heads (IDetect), which receive features of different scales processed from the Neck, enabling the prediction of objects of different sizes on the feature maps. The structure described above makes YOLOv7-Tiny flexible and efficient. Currently, the model commonly utilizes optimization parameters, regularization techniques, and traditional training methods to enhance performance without delving deeply into optimizing the internal network structure. However, structural modifications have proven to be relatively more effective than other techniques for improving object detection performance [31,32]. Therefore, this paper investigated how to achieve the trade-off between detection accuracy and speed, especially for detecting adverse weather conditions, at minimal cost through architectural changes, without using any additional techniques.

2.2. Hybrid Dilated Convolution

Dilated convolutions were initially employed in the context of semantic segmentation to mitigate the loss of essential information that can occur due to pooling operations. The formulas for 1-D dilated convolution and its 2-D RF calculation are as follows:

O [i] = \sum_{k = 0}^{k - 1} I [i + r * k] \cdot w [k]

(1)

R = 1 + (k - 1) \cdot r

(2)

In Equation (1), I[i] denotes the input feature, O[i] represents the output feature, w[k] signifies the filter of length k, and r indicates the dilation factor employed for sampling the input feature. In Equation (2), R denotes the receptive field size, k is the filter kernel size, and r is the dilation coefficient. According to Equation (2), substituting three 3 × 3 standard convolutions with dilated convolutions can substantially expand the network’s RF. This paper analyzed the HDC structure using 10 sets of systematically arranged dilation coefficients, with the resulting data presented in Table 1.

The HDC involves the stacking of multiple dilated convolutions to expand the RF while preserving the internal data structure and spatial hierarchical information. As illustrated in Table 1, the RF increases with the dilation coefficient. However, irregular stacking can lead to an undesirable phenomenon known as the gridding Effect. For instance, although setting the dilation factors of three 3 × 3 convolutions to [2, 3, 4], [3, 4, 5], and [5, 5, 5] can significantly increase the receptive field, it may result in irregular pixel utilization, causing gaps or loss of detailed boundaries (Figure 2). This occurrence results in the permanent loss of local details, diminishes the correlation across a wide range of information, and undermines the consistency of local information, ultimately significantly affecting the learning efficiency of the network. Wang et al. [33] conducted a detailed investigation into the principles of dilated convolution based on the dilation convolution and introduced a comprehensive formula for validating the dilation factor in HDC applications, as follows:

X_{i} = M a x [X_{i + 1} - {2 r}_{i}, X_{i + 1} - 2 (X_{i + 1} - r_{i}), r_{i}]

(3)

Multiple convolutional layers with a kernel size of k are stacked, each with dilation factors [r₁, …r_i, …r_n], where X_n = r_n, r_i represents the dilation factor for the i-th layer and X_i denotes the spacing between two non-zero elements of the i-th layer. It is crucial to follow the design guideline X₂ ≤ K. By selecting appropriate dilation factors that meet these criteria, for instance, r = [1, 2, 5], it is possible to prevent the gridding effect (as illustrated in Figure 3). This paper applied this strategy to the integration of the architecture, expanding the network’s RF and enabling it to capture a broader range of pixel information.

3. DAN-YOLO Network Model Architecture

The network architecture of the enhanced YOLOv7-tiny model, named DAN-YOLO, is shown in Figure 4. Compared to the original YOLOv7-tiny network (Figure 1), several improvements have been made, as indicated by the modules filled with green, blue, and brown boxes in Figure 4. Firstly, the improved SPPFCSPC-E structure is employed in the Backbone network to achieve the lossless capture of different receptive field feature information. Subsequently, the improved DAN structures supplant ELAN-T modules in the Neck network to effectively aggregate global information. The EMA mechanism is then utilized prior to feature fusion to enhance feature representation during the Neck fusion process layers. Finally, the network utilizes Wise-IoUv1 loss [34] to train the model, regardless of the original loss function. In the following subsections, a detailed description of each module will be provided.

3.1. SPPFCSPC-E Structure and DAN Structure

Figure 5 illustrates SPPF and SPPFCSPC, which are based on SPP, but offer significantly improved speed. SPPF sequentially produces the output of the upper branch by applying three 5 × 5 MP operations on the features. By concatenating the receptive fields from each MP layer, SPPF effectively reduces the computational cost associated with pooling scales. This structure improves detection speed while preserving the same RF as before the enhancement.

Figure 1 and Figure 6 illustrate the original SPPCSPC structure and SPPFCSPC-E. Initially, SPP [35] in SPPCSPC was replaced with SPPF to create SPPFCSPC, which improves speed while preserving the same RF. Additionally, the serial structure of SPPF is advantageous for stacking HDC, effectively maintaining internal data structures and spatial hierarchy, as shown in Figure 5. In the SPPF structure of SPPFCSPC, the repeated use of the MP in multiscale fusion has raised concerns about the potential loss of internal data structure and spatial hierarchical information, which is critical for objects that have become unclear due to environmental influences. Therefore, this paper replaced the MP layer in the SPPF module with an HDC layer, incorporating a dilation factor of [1, 2, 5], as depicted in Figure 6. The modified SPPFCSPC-E accelerates network inference speed while expanding the RF, thereby enhancing detection in adverse weather conditions. By replacing MP with HDC, the network’s ability to accurately capture features across different receptive fields is improved, leading to more precise localization of detailed target objects.

Figure 1 and Figure 7 illustrate the original DAN structure and ELAN-T. The ELAN-T module in YOLOv7-Tiny is recognized for its simple and flexible structure. This module incorporates multi-level feature maps using the concat layer, which improves the network’s capacity to learn from input features and accomplish deep feature utilization objectives. However, the limitation of insufficient feature utilization in ELAN-T, which employs regular convolutions with relatively small RF, restricts its ability to effectively integrate global information while capturing object details. To address this issue, this study introduced a DAN structure aimed at expanding RF while maintaining network flexibility. Firstly, a 3 × 3 convolution layer was added to the fifth layer of the ELAN-T structure, after which the dilation factors of the third, fourth, and fifth layers were set to 1, 2, and 5, respectively. Given that the Neck plays a crucial role in enhancing the spatial detail control of global information, the ELAN-T structure in the Neck was replaced with the DAN structure to effectively aggregate global information. Both methods can significantly reduce the impact of adverse weather conditions and improve detection accuracy.

3.2. Efficient Multi-Scale Attention Mechanism

In the network, initial feature extraction has been accomplished within the Backbone, with subsequent feature integration carried out in the Neck to enhance the comprehensiveness of object features. Nevertheless, the features derived from the Backbone fail to fully account for spatial and channel information. During the fusion process, there is a lack of enhancement in the representational capacity of spatial and channel information, and the pixel-level correlation between features remains unimproved. Consequently, this paper incorporated the EMA attention mechanism to address the features extracted from the Backbone prior to feature fusion. This enables the network to emphasize spatial and channel information within features, leading to a more comprehensive understanding of contextual semantic relationships. Then, enhancing the feature representation during the Neck fusion process lays the foundation for the integration of deep features in subsequent stages.

The EMA mechanism [36], known for its efficiency as a multi-scale attention mechanism, distinguishes itself from conventional channel and spatial attention mechanisms by its flexible and lightweight structure. This mechanism reduces network depth through parallel substructures, requiring less computational cost to achieve the same effect on a feature map without diminishing channel dimensions. Drawing inspiration from the Coordinate Attention (CA) mechanism [37], the EMA mechanism introduces a multi-scale parallel sub-network to address dependencies within network modules. As illustrated in Figure 8, the input features are first divided into G sub-features along the channel dimension (X ∈ R^C×h×w), with each sub-feature represented as X = [X₀, X_i, …, X_G−1]. Feature extraction for each sub-feature X_i is performed using learned weights to capture key information. These sub-features are then fed into three parallel subnetworks, enabling dynamic weight allocation to the grouped feature maps. In the parallel subnetworks, the first two sets, X Avg Pool and Y Avg Pool, perform average pooling in the horizontal and vertical directions, respectively, and are referred to as pooling networks. The third subnetwork applies a 3 × 3 convolution and is called a 3 × 3 network. The pooling network is defined as follows:

z_{c}^{H}

represents pooling along the horizontal direction, and

z_{c}^{W}

represents pooling along the vertical direction.

z_{c}^{H} (H) = \frac{1}{W} \sum_{0 \leq i \leq W} x_{c} (H, i)

(4)

z_{c}^{W} (W) = \frac{1}{H} \sum_{0 \leq j \leq H} x_{c} (j, W)

(5)

In the pooling network, the batch dimensions (C//G × h × w) of the input are combined through feature concatenation using the pooling network described earlier. This approach avoids standard convolution downsampling and incorporates direction-related positional information, effectively utilizing spatial information. The sigmoid function is then applied to activate and produce outputs in both horizontal and vertical directions. These two parallel networks perform the first cross-channel interaction of features through multiplication, as defined in Equation (6), where X_i represents the i-th feature in the G sub-features, while X_w and X_h are the convolution results in the horizontal and vertical directions, respectively.

o u t p u t = \sum_{i = 0}^{G - 1} X_{i} \times X_{w} \times X_{h}

(6)

In the 3 × 3 network, 3 × 3 convolution and global average pooling are used for encoding, extracting local key information within the channel, and expanding the spatial feature map. The results from the three parallel networks are combined to achieve the fusion of overall cross-spatial feature information. Specifically, there are two stages of cross-spatial feature information fusion between the pooling networks and the 3 × 3 network. The first stage involves encoding global information through 2D global average pooling in one output of the pooling network, which is then matrix-multiplied with the output of the 3 × 3 network. This operation is represented as

N_{1 \times 1}^{C / / G \times h \times w} \times N_{3 \times 3}^{C / / G \times 1 \times 1}

, where N_1×1 is the output of the pooling network and N_3×3 is the output of the 3 × 3 network.

The second time is to multiply the global average pooling result of the 3 × 3 network output with the result of the pooling network in a matrix, expressed as

N_{1 \times 1}^{C / / G \times 1 \times 1} \times N_{3 \times 3}^{C / / G \times h \times w}

. Finally, the weight values of the cross-spatial feature information are aggregated to obtain pixel-level global key information features, achieving the purpose of emphasizing global contextual information [38]. The experiments in this paper demonstrate the effectiveness of this method.

3.3. Wise-IoU Loss Function

In adverse driving conditions, the model’s generalization ability and localization accuracy play a crucial role. For example, in heavy snow, a white car covered by snow while driving may be inaccurately localized and misclassified as part of the road, leading to reduced generalization. Therefore, high-quality fitting during training is necessary, which requires calculating the bbox regression loss based on the geometric elements of the predicted box and the ground truth box. During data fitting, the loss function should appropriately reduce the focus on the center point distance and weaken the geometric penalty when the anchor box and the target box (high-quality anchor box) overlap well, thus reducing training interference and effectively improving the model’s generalization ability and localization accuracy. However, the CIoU loss function only considers three key geometric factors: overlap area, center distance, and aspect ratio, without distinguishing between ordinary anchor boxes and high-quality anchor boxes [39]. To enhance the model’s generalization ability and localization accuracy, this paper replaced CIoU in the model with Wise-IoUv1.

Wise-IoU is known as a bbox regression loss function with a dynamic non-monotonic focusing mechanism, and Wise-IoUv1 is its first-generation version. In traditional IoU, the penalty for fitting bboxs with high-quality samples differs from that for low-quality samples. However, geometric metrics such as distance and aspect ratio often impose higher penalties on low-quality samples, which can potentially hinder the model’s generalization ability and localization accuracy. To address this issue, Wise-IoUv1 reduces the geometric penalty for overlapping anchor boxes and target boxes (representing high-quality samples), constructs distance attention, successfully reduces the emphasis on low-quality samples, minimizes training interference, and enhances the focus on typical samples, ultimately improving the model’s generalization ability and localization accuracy. The formula for Wise-IoUv1 is as follows:

L_{W I o U v 1} = R_{W I o U} L_{I o U}

(7)

R_{W I o U} = e x p (\frac{(x - x_{g t})^{2} + (y - y_{g t})^{2}}{(W_{g}^{2} + H_{g}^{2})^{*}})

(8)

where x and y represent the predicted bounding box’s center coordinates, x_gt and y_gt represent the center coordinates of the ground truth bounding box, W_g and H_g represent the width and height of the minimum bounding box enclosing the predicted and ground truth boxes, and the superscript * represent the computation separation.

L_{I o U}

denotes the intersection over Union (IoU) ratio of the predicted box and the ground truth box overlap area.

4. Experimentation Results and Discussion

To validate the effectiveness of the proposed method, object detection experiments and tests were performed using PyTorch based on Python 3.8 on a Win10-64bit operating system equipped with 24 GB NVIDIA RTX3090 GPU memory. The training datasets were sourced from the BDD100K dataset [40] and the DAWN dataset [41], for which detailed descriptions can be found in Table 2, where the BDD100K dataset mainly selected images captured in snowy driving conditions and used data augmentation techniques to add 382 images.

4.1. Basic Testing and Implementation Details

The training configuration includes parameters such as epoch, batch size, image size, and workers, set at 200, 16, 640, and 8, respectively. Hyperparameters are established based on the official default file, and the cosine annealing learning rate scheduling algorithm is employed for updates. Additionally, both the YOLOv7-Tiny and DAN-YOLO models were trained from scratch to ensure impartiality in the evaluation process.

This paper evaluated the effectiveness of improving object detection algorithms by selecting Precision, Recall, Mean Average Precision (mAP), Frames Per Second (FPS), Params, and FLOPs as evaluation metrics. Predicted boxes were categorized into TP, TN, FP, and FN based on IoU, where T or F indicated whether a sample was correctly classified, and P or N indicated whether a sample was predicted as positive or negative. The calculations for Precision and Recall are as follows:

P r e c i s i o n = \frac{T P}{T P + F P}

(9)

R e c a l l = \frac{T P}{T P + F N}

(10)

m A P = \frac{1}{n} \sum_{i = 1}^{n} A P

(11)

F P S = \frac{F r a m e N u m}{U s a g e T i m e}

(12)

In the formula for mAP, n and AP represent the number of detection categories and the area under the Precision–Recall (P–R) curve for a specific category, respectively. The FPS formula calculates the frames per second by dividing the number of image frames by the time taken for processing.

4.2. Performance Analysis of DAN-YOLO

The training and validation were performed using an NVIDIA RTX3090 to demonstrate the effectiveness of the DAN-YOLO model in edge computing. And the study compared the proposed object detector with various state-of-the-art used ones, tracking computational complexity in terms of floating-point operations (FLOPs) and measuring the processing time per image to assess efficiency.

Table 3 details the detection performance of the algorithms on the BDD100K benchmark; all models follow official settings and are trained from scratch. From the perspective of mAP0.5, Table 3 shows that compared to DAN-YOLO, other models have lower mAP0.5 values, indicating that this model has better detection capabilities. Notably, DAN-YOLO (15.6 G) has fewer computational parameters than YOLOv5s (16.3 G), YOLOv5m (50.3 G), YOLOv8s (28.4 G), and YOLOv10s (21.6 G), yet its mAP0.5 is higher by 5%, 2.2%, 0.5%, and 1.4%, respectively. In terms of real-time performance, DAN-YOLO achieves 142.86 FPS, fully meeting the real-time requirement (exceeding 60 FPS), demonstrating that DAN-YOLO excels in balancing detection accuracy and lightweight design. Compared to YOLOv7-Tiny (mAP0.5 of 48.6%), the mAP0.5 improved by 3.4%, with only an 18.87% sacrifice in FPS. In terms of precision for each category, compared to other models, DAN-YOLO shows improvements in the car, person, traffic sign, and traffic light categories, achieving 71.6%, 45.3%, 44.4%, and 46.6%, respectively, indicating that the model has stronger compatibility with objects in autonomous driving scenarios.

Additionally, Table 4 presents the evaluation results of the algorithm on the DAWN validation set. From the perspective of mAP0.5, except for YOLOv8s, other models have lower mAP0.5, indicating that DAN-YOLO (52.8%) has stronger detection capabilities. Although YOLOv8s secured first place in the DAWN benchmark object detection challenge with an mAP0.5 of 54.9%, which is 2.1% higher than DAN-YOLO’s 52.8%, it came at the cost of an additional 12.8 G in computational load. Compared to YOLOv7-Tiny, DAN-YOLO made significant progress, specifically improving mAP0.5 by 6.3% with only a 2.6 G increase in FLOPs. This again demonstrates that DAN-YOLO excels in balancing detection accuracy and lightweight design.

Moreover, Table 5 shares the evaluation results of transfer learning on the DAWN dataset, using the weights of DAN-YOLO and YOLOv7-Tiny trained on the BDD100K dataset as pre-trained weights. In the weight column, “√” indicates the use of the corresponding weights, while “×” indicates that the corresponding weights were not used. The results indicate that, when the pre-trained weights are used, the mAP0.5 and mAP0.5:0.95 of both models improve by 4.7%, 2.5%, and 1.7%, 4.2%, respectively. Moreover, DAN-YOLO outperforms YOLOv7-Tiny in terms of mAP0.5 and mAP0.5:0.95 by 3.3% and 1.6%, respectively, based on the outcomes derived from the pre-trained weights. This demonstrates that the performance and generalization ability of the DAN-YOLO model are superior to those of YOLOv7-Tiny.

Consequently, experimental results validate the superiority and competitiveness of this paper’s proposed DAN-YOLO, and better trade-off between accuracy and inference speed makes it more suitable for autonomous driving in adverse weather conditions.

4.3. Ablation Study

This paper included ablation studies on the DAWN dataset to explore the contributions of different components in the used structure, including SPPFCSPC-E, DAN, EMA, and the Wise-IoUv1 loss function. Table 6 indicates that applying the methods from this study to YOLOv7-Tiny resulted in improvements of 2.6%, 3.6%, 3.6%, and 2.9% in mAP0.5, respectively, with corresponding improvements in the mAP0.5 for each class. This indicates the effectiveness of deploying individual modules. Furthermore, assigning all the proposed structural enhancements to YOLOv7-Tiny resulted in increases of 6.3% in mAP0.5 and 11.3% in precision, with parameters and FLOPs increasing by only 1.98M (Million) and 2.6 G, respectively, achieving an effective trade-off without excessive growth. Notably, the motorcycle+bicycle class, which only accounted for 1.36% of the total annotations, achieved a significant improvement of 30.9% in mAP0.5, indicating that the proposed method for this model shows stronger compatibility in adverse conditions and is more conducive to unpredictable autonomous driving scenarios in adverse conditions.

4.4. Visualization Results

To better understand the detection performance of DAN-YOLO, this paper provides several visualizations, including P-R curves, Confusion Matrix Difference Heatmaps, and object detection results.

Figure 9 provided in this paper shows the P–R curves on two datasets, which more intuitively display the recall and precision performance of the detector. As shown in Figure 9a, the P–R curves of DAN-YOLO have completely enclosed the P–R curves of YOLOv7-Tiny. Also, the improved network has a higher value at the Balance Error Rate (BEP), as shown by the intersection points of the two curves with the x = y line in Figure 9b. Therefore, the improved network model can achieve better results.

To visually demonstrate the performance improvement of the DAN-YOLO model, this paper provides the confusion matrix differential heatmap. Specifically, DAN-YOLO and YOLOv7-Tiny were benchmarked on the BDD100K and DAWN datasets, generating two confusion matrices containing the data, respectively. By subtracting the data in YOLOv7-Tiny’s matrix from that in DAN-YOLO’s matrix, two new differential heatmaps were produced, as shown in Figure 10. In the figure, the blue areas indicate positive values, and the orange areas indicate negative values. It can be observed that the improved DAN-YOLO achieves performance gains in each category while reducing the impact of background and other categories. In Figure 10a, the probability of each category being correctly identified increases by 0.04, 0.05, 0.06, and 0.09, respectively, while the probability of misclassifying targets as background decreases by 0.04, 0.04, 0.06, and 0.09. Notably, in Figure 10b, the probability of correctly identifying motorcycle+bicycle increases by 0.3, while the probabilities of misclassifying motorcycle+bicycle and bus as background decrease by 0.22 and 0.19, respectively. Ultimately, these findings validate the effectiveness of the enhanced DAN-YOLO model, highlighting its exceptional performance in autonomous driving scenarios.

The paper presents the visualization of detection outcomes from YOLOv7-Tiny and DAN-YOLO on the test dataset, depicted in Figure 11, where column one represents the detection results of YOLOv7-Tiny, and column two represents the detection results of DAN-YOLO. As observed, when there is image clarity decreases and object occlusion (driving in adverse conditions), the prediction confidence of YOLOv7-Tiny decreases and starts to generate numerous results of incorrect, false positive, and missed detections. On the contrary, following structural enhancements, the DAN-YOLO with improved feature extraction capabilities demonstrates outstanding precision in object localization, accuracy, and detection rates. It excels in identifying challenging samples that YOLOv7-Tiny may overlook, further underscoring the effectiveness of the proposed DAN-YOLO model outlined in the paper.

5. Conclusions

This paper focuses on enhancing real-time object detection performance under adverse conditions by improving the internal structure of the YOLOv7-Tiny algorithm, leading to the development of the DAN-YOLO real-time vehicle object detector. Specifically, SPP was replaced with SPPF to create SPPFCSPC, and HDC was employed to optimize the ELAN-T and SPPFCSPC structures, thereby expanding the network’s RF and enhancing global information aggregation. The EMA mechanism was then integrated during the feature fusion process to comprehensively capture spatial and channel characteristics. Finally, the Wise-IoUv1 loss function was utilized instead of CIoU to improve localization precision. Comprehensive validation tests and analysis were carried out on the BDD100K dataset and DAWN dataset, showing that DAN-YOLO outperformed YOLOv7-Tiny 3.4% and 6.3% in mAP0.5, respectively, while achieving real-time detection at 142.86 FPS. Ablation studies confirmed the effectiveness of the methods employed in this research, and visual results demonstrated the model’s strong performance. Compared to other state-of-the-art detectors, DAN-YOLO achieved a favorable trade-off between detection accuracy and speed under adverse driving conditions, making it well-suited for autonomous driving applications. This paper provides insights into object detection technology for autonomous driving, although there is still potential for further development in practical applications.

Author Contributions

Conceptualization, X.Z.; Methodology, B.Y.; Software, H.L.; Validation, J.Y.; Writing—original draft, F.L.; Writing—review and editing, S.C. and Z.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by [Guangxi Science and Technology Major Project] grant number [GKAA23062024] and [Special Project for Central Government Guiding Local Science and Technology Development (Liuzhou City)] grant number [2022JRZ0102].

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 17–30 June 2016; pp. 770–778. [Google Scholar]
Kanchana, B.; Peiris, R.; Perera, D.; Jayasinghe, D.; Kasthurirathna, D. Computer Vision for Autonomous Driving. In Proceedings of the 2021 3rd International Conference on Advancements in Computing (ICAC), Colombo, Sri Lanka, 9–11 December 2021; pp. 175–180. [Google Scholar]
Cahill, J.; Parsi, A.; Mullins, D.; Horgan, J.; Ward, E.; Eising, C.; Denny, P.; Deegan, B.; Glavin, M.; Jones, E. Exploring the Viability of Bypassing the Image Signal Processor for CNN-Based Object Detection in Autonomous Vehicles. IEEE Access 2023, 11, 42302–42313. [Google Scholar] [CrossRef]
Lu, J.; Han, L.; Wei, Q.; Wang, X.; Dai, X.; Wang, F. Event-Triggered Deep Reinforcement Learning Using Parallel Control: A Case Study in Autonomous Driving. IEEE Trans. Intell. Veh. 2023, 8, 2821–2831. [Google Scholar] [CrossRef]
Hassaballah, M.; Kenk, M.; Muhammad, K.; Minaee, S. Vehicle Detection and Tracking in Adverse Weather Using a Deep Learning Framework. IEEE Trans. Intell. Transp. Syst. 2021, 22, 4230–4242. [Google Scholar] [CrossRef]
Min, W.; Fan, M.; Guo, X.; Han, Q. A new approach to track multiple vehicles with the combination of robust detection and two classifiers. IEEE Trans. Intell. Transp. Syst. 2018, 19, 174–186. [Google Scholar] [CrossRef]
Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 17–30 June 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Wang, C.; Bochkovskiy, A.; Liao, H.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016; Volume 9905, pp. 21–37. [Google Scholar]
Wang, P.; Wang, X.; Liu, Y.; Song, J. Research on Road Object Detection Model Based on YOLOv4 of Autonomous Vehicle. IEEE Access 2024, 12, 8198–8206. [Google Scholar] [CrossRef]
Cao, Y.; Li, C.; Peng, Y.; Ru, H. MCS-YOLO: A Multiscale Object Detection Method for Autonomous Driving Road Environment Recognition. IEEE Access 2023, 11, 22342–22354. [Google Scholar] [CrossRef]
Mao, K.; Jin, R.; Ying, L.; Yao, X.; Dai, G.; Fang, K. SC-YOLO: Provide Application-Level Recognition and Perception Capabilities for Smart City Industrial Cyber-Physical System. IEEE Syst. J. 2023, 17, 5118–5129. [Google Scholar] [CrossRef]
Gevorgyan, Z. SIoU loss: More powerful learning for bounding box regression. arXiv 2022, arXiv:2205.12740. [Google Scholar]
Wang, H.; Liu, C.; Cai, Y.; Chen, L.; Li, Y. YOLOv8-QSD: An Improved Small Object Detection Algorithm for Autonomous Vehicles Based on YOLOv8. IEEE Trans. Instrum. Meas. 2024, 73, 3379090. [Google Scholar] [CrossRef]
Chen, J.; Mai, H.; Luo, L.; Chen, X.; Wu, K. Effective feature fusion network in BIFPN for small object detection. In Proceedings of the 2021 IEEE International Conference on Image Processing (ICIP), Anchorage, AK, USA, 19–22 September 2021; pp. 699–703. [Google Scholar]
Zeller, M.; Sandhu, V.; Mersch, B.; Behley, J.; Heidingsfeld, M.; Stachniss, C. Radar Instance Transformer: Reliable Moving Instance Segmentation in Sparse Radar Point Clouds. IEEE Trans. Robot. 2024, 40, 2357–2372. [Google Scholar] [CrossRef]
Piroli, A.; Dallabetta, V.; Kopp, J.; Walessa, M.; Meissner, D.; Dietmayer, K. Towards Robust 3D Object Detection in Rainy Conditions. In Proceedings of the IEEE 26th International Conference on Intelligent Transportation Systems (ITSC), Bilbao, Spain, 24–28 September 2023; pp. 3471–3477. [Google Scholar]
Turay, T.; Vladimirova, T. Toward Performing Image Classification and Object Detection with Convolutional Neural Networks in Autonomous Driving Systems: A Survey. IEEE Access 2022, 10, 14076–14119. [Google Scholar] [CrossRef]
Wibowo, A.; Trilaksono, B.R.; Hidayat, E.M.I.; Munir, R. Object Detection in Dense and Mixed Traffic for Autonomous Vehicles With Modified Yolo. IEEE Access 2023, 11, 134866–134877. [Google Scholar] [CrossRef]
Zhang, W.; Liu, Z.; Zhou, S.; Qi, W.; Wu, X.; Zhang, T.; Han, L. LS-YOLO: A Novel Model for Detecting Multiscale Landslides With Remote Sensing Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 4952–4965. [Google Scholar] [CrossRef]
Deng, H.; Zhang, Y. FMR-YOLO: Infrared Ship Rotating Target Detection Based on Synthetic Fog and Multiscale Weighted Feature Fusion. IEEE Trans. Instrum. Meas. 2024, 73, 1–17. [Google Scholar] [CrossRef]
Qian, Z.; Jing, W.; Lv, Y.; Zhang, W. Automatic Polyp Detection by Combining Conditional Generative Adversarial Network and Modified You-Only-Look-Once. IEEE Sens. J. 2022, 22, 10841–10849. [Google Scholar] [CrossRef]
Chen, L.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef]
Zhao, Y.; Lu, J.; Li, Q.; Peng, B.; Han, J.; Huang, B. PAHD-YOLOv5: Parallel Attention and Hybrid Dilated Convolution for Autonomous Driving Object Detection. In Proceedings of the International Conference on Pattern Recognition and Artificial Intelligence (PRAI), Chengdu, China, 19–21 August 2022; pp. 418–425. [Google Scholar]
Wang, Z.; Xia, F.; Zhang, C. FD_YOLOX: An improved YOLOX object detection algorithm based on dilated convolution. In Proceedings of the IEEE 18th Conference on Industrial Electronics and Applications (ICIEA), Ningbo, China, 18–22 August 2023; pp. 1263–1268. [Google Scholar]
Li, Y.; Lan, R.; Huang, H.; Zhou, H.; Liu, Z.; Pang, C.; Luo, X. Single Traffic Image Deraining via Similarity-Diversity Model. IEEE Trans. Intell. Transp. Syst. 2024, 25, 90–103. [Google Scholar] [CrossRef]
Tian, D.; Lin, C.; Zhou, J.; Duan, X.; Cao, Y.; Zhao, D.; Cao, D. SA-YOLOv3: An Efficient and Accurate Object Detector Using Self-Attention Mechanism for Autonomous Driving. IEEE Trans. Intell. Transp. Syst. 2022, 23, 4099–4110. [Google Scholar] [CrossRef]
Su, Q.; Wang, H.; Xie, M.; Song, Y.; Ma, S.; Li, B.; Yang, Y.; Wang, L. Real-time traffic cone detection for autonomous driving based on yolov4. IET Intell. Transp. Syst. 2022, 16, 1380–1390. [Google Scholar] [CrossRef]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Wang, P.; Chen, P.; Yuan, Y.; Liu, D.; Huang, Z.; Hou, X.; Cottrell, G. Understanding Convolution for Semantic Segmentation. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; pp. 1451–1460. [Google Scholar]
Tong, Z.; Chen, Y.; Xu, Z.; Yu, R. Wise-IoU: Bounding Box Regression Loss with Dynamic Focusing Mechanism. arXiv 2023, arXiv:2301.10051. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef] [PubMed]
Ouyang, D.; He, S.; Zhang, G.; Luo, M.; Guo, H.; Zhan, J.; Huang, Z. Efficient Multi-Scale Attention Module with Cross-Spatial Learning. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. arXiv 2021, arXiv:2103.02907. [Google Scholar]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual Attention Network for Scene Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, USA, 15–20 June 2019; pp. 3141–3149. [Google Scholar]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. AAAI Conf. Artif. Intell. 2020, 34, 12993–13000. [Google Scholar] [CrossRef]
Yu, F.; Chen, H.; Wang, X.; Xian, W.; Chen, Y.; Liu, F.; Madhavan, V.; Darrell, T. BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 2633–2642. [Google Scholar]
Kenk, M.; Hassaballah, M. DAWN: Vehicle Detection in Adverse Weather Nature Dataset. arXiv 2020, arXiv:2008.05402. [Google Scholar]

Figure 1. YOLOv7-Tiny network architecture diagram.

Figure 2. Illustration of the gridding effect. The numbers in the graph represent the frequency of utilization of input pixels by the convolution, and the heatmap reflects the extent to which input pixels are utilized.

Figure 3. The RF effect diagram for dilation factors r = [1, 2, 5]. The numbers in the graph represent the frequency of utilization of input pixels by the convolution, and the heatmap reflects the extent to which input pixels are utilized.

Figure 4. DAN-YOLO network architecture diagram.

Figure 5. Structure of the SPPF and SPPFCSPC.

Figure 6. Structure of the SPPFCSPC-E.

Figure 7. Structure of the DAN.

Figure 8. Efficient multi-scale attention mechanism.

Figure 9. Comparison of network model P–R curves before and after improvement. (a) Based on the BDD100K dataset; (b) based on the DAWN dataset.

Figure 10. Confusion matrix difference heatmap. (a) Based on the BDD100K test dataset; (b) based on the DAWN test dataset.

Figure 11. Detection results of YOLOv7-Tiny and DAN-YOLO. (a) Based on the BDD100K test dataset; (b) based on the DAWN test dataset.

Table 1. Relationship between multiple dilated convolutions and receptive field.

	1	2	3	4	5	6	7	8	9	10
Description	1	2	3	4	5	6	7	8	9	10
Convolution	3 × 3	3 × 3	3 × 3	3 × 3	3 × 3	3 × 3	3 × 3	3 × 3	3 × 3	3 × 3
Dilation Coefficient of Three Convolutions	1-1-1	1-2-3	3-3-3	1-1-4	2-3-4	4-4-4	1-1-5	1-2-5	3-4-5	5-5-5
Receptive Field	7	13	19	13	19	26	15	17	25	31

Table 2. The BDD100K dataset and the DAWN dataset.

Dataset	Image Size	Number of Images	Classes (Proportion of Annotation)
BDD100K [40]	1280 × 720	Train: 2841	car; person; traffic sign; traffic light
		Val: 1219
		Test: 340
DAWN [41]	900 × 562	Train: 720	person (6.07%); car (82.21%); motorcycler + bycycle (1.36%); bus (2.05%); truck (8.22%)
		Val: 258
		Test: 49

Table 3. Evaluation results on BDD100K benchmark.

Model	mAP0.5/%	Car/%	Person/%	Traffic Sign/%	Traffic Light/%	FPS	FLOPs/G
YOLOv5n	41.2	62.1	35.7	33.3	33.8	172.41	4.1 G
YOLOv5s	47.0	65.8	39.9	40.3	42.1	153.85	16.3 G
YOLOv5m	49.8	67.9	42.4	43.5	45.4	81.96	50.3 G
YOLOv7-Tiny	48.6	69.4	43.3	40.3	41.5	158.73	13.0 G
YOLOv8n	46.7	67.8	41.6	37.9	39.5	126.58	8.1 G
YOLOv8s	51.5	70.9	46.3	44.4	44.5	90.09	28.4 G
YOLOv9-T	47.9	69.4	44.5	39.6	37.9	106.38	11.7 G
YOLOv10n	43.3	65.5	38.4	34.3	35.1	172.41	6.7 G
YOLOv10s	50.6	70.2	45.4	43.0	43.7	196.07	21.6 G
DAN-YOLO	52.0	71.6	45.3	44.4	46.6	142.86	15.6 G

Table 4. Evaluation results on DAWN benchmark.

Model	mAP0.5/%	Person/%	Car/%	Motorcycler + Bycycle/%	Bus/%	Truck/%	FLOPs/G
YOLOv5n	44.9	36.9	81.6	48.3	13.1	44.7	4.1 G
YOLOv5s	49.8	40.3	83.5	49.0	26.7	49.5	16.3 G
YOLOv5m	52.4	46.9	84.4	47.5	30.0	53.2	50.3 G
YOLOv7-Tiny	46.5	42.0	84.4	33.3	22.5	50.4	13.0 G
YOLOv8n	51.5	42.1	83.5	44.1	39.2	48.5	8.1 G
YOLOv8s	54.9	45.8	84.6	51.5	39.3	53.3	28.4 G
YOLOv9-T	41.9	33.9	76.8	39.9	16.1	42.7	11.7 G
YOLOv10n	39.8	31.9	78.3	29.0	16.4	40.9	6.7 G
YOLOv10s	42.8	35.9	83.6	21.2	28.6	44.7	21.6 G
DAN-YOLO	52.8	41.3	83.9	64.2	24.7	49.8	15.6 G

Table 5. Evaluation results of DAN-YOLO and YOLOv7-Tiny on the DAWN benchmark.

Model	mAP0.5/%	mAP0.5:0.95/%	Person/%	Car/%	Motorcycler + Bycycle/%	Bus/%	Truck/%	Weights
YOLOv7-Tiny	46.5	26.60	42.0	84.4	33.3	22.5	50.4	×
YOLOv7-Tiny	51.2	30.90	56.0	88.2	31.6	18.4	61.7	√
DAN-YOLO	52.8	28.30	41.3	83.9	64.2	24.7	49.8	×
DAN-YOLO	54.5	32.50	56.2	87.6	43.3	26.1	59.4	√

Table 6. Ablation experiment.

Model	Precision/%	mAP0.5/%	Parameters/M	FLOPs/G	Person/%	Car/%	Motorcycle + Bicycle/%	Bus/%	Truck/%
YOLOv7-Tiny	50.7	46.5	6.02	13.2	42.0	84.4	33.3	22.5	50.4
SPPFCSPC-E	60.5	49.1 (2.6↑)	7.79	14.6	44.7	84.5	40.6	26.3	49.5
DAN	60.9	50.1 (3.6↑)	6.31	13.8	41.0	83.6	43.8	24.1	54.8
EMA	60.1	50.1 (3.6↑)	6.04	13.8	46.3	84.3	42.9	21.6	55.6
Wise-IoUv1	70.6	49.4 (2.9↑)	6.02	13.2	41.2	83.5	44.8	30.2	47.3
DAN-YOLO	62.0 (11.3↑)	52.8 (6.3↑)	8.10 (1.98↑)	15.8	41.3	83.9	64.2	24.7	49.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cui, S.; Liu, F.; Wang, Z.; Zhou, X.; Yang, B.; Li, H.; Yang, J. DAN-YOLO: A Lightweight and Accurate Object Detector Using Dilated Aggregation Network for Autonomous Driving. Electronics 2024, 13, 3410. https://doi.org/10.3390/electronics13173410

AMA Style

Cui S, Liu F, Wang Z, Zhou X, Yang B, Li H, Yang J. DAN-YOLO: A Lightweight and Accurate Object Detector Using Dilated Aggregation Network for Autonomous Driving. Electronics. 2024; 13(17):3410. https://doi.org/10.3390/electronics13173410

Chicago/Turabian Style

Cui, Shuwan, Feiyang Liu, Zhifu Wang, Xuan Zhou, Bo Yang, Hao Li, and Junhao Yang. 2024. "DAN-YOLO: A Lightweight and Accurate Object Detector Using Dilated Aggregation Network for Autonomous Driving" Electronics 13, no. 17: 3410. https://doi.org/10.3390/electronics13173410

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DAN-YOLO: A Lightweight and Accurate Object Detector Using Dilated Aggregation Network for Autonomous Driving

Abstract

1. Introduction

2. Related Technology

2.1. The YOLOv7-Tiny Network Model Architecture

2.2. Hybrid Dilated Convolution

3. DAN-YOLO Network Model Architecture

3.1. SPPFCSPC-E Structure and DAN Structure

3.2. Efficient Multi-Scale Attention Mechanism

3.3. Wise-IoU Loss Function

4. Experimentation Results and Discussion

4.1. Basic Testing and Implementation Details

4.2. Performance Analysis of DAN-YOLO

4.3. Ablation Study

4.4. Visualization Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI