1. Introduction
In recent years, the global vehicle population has grown rapidly, leading to an increase in the complexity of traffic scenarios and posing significant challenges to driving safety. Intelligent driving is a key transportation innovation [
1] that aims to enhance safety and efficiency. Among the various aspects of intelligent driving, vision-based vehicle-detection plays a crucial role in driving decision-making and automated control. However, multiple challenges are encountered in complex traffic environments, including variations in lighting conditions, diverse vehicle targets, and adverse weather conditions [
2]. Moreover, the increasing complexities of current models result in high computational costs, and the intricate structures of vehicle-detection algorithms and their expensive hardware requirements hinder their use in edge and mobile-terminal devices [
3]. Therefore, the creation of efficient and lightweight vehicle-detection algorithms is vital.
With continuous breakthroughs in artificial intelligence and computing power, target-detection algorithms have recently been the subject of rapid advancements that affect security monitoring [
4,
5], edge detection [
6,
7,
8], autonomous driving [
9,
10,
11], and pose-detection [
12,
13,
14] domains. However, regardless of such advancements, detecting vehicles in low-light conditions poses a significant challenge. Existing night-condition technologies create unique obstacles, including vehicle appearances being blurred, deformed, or partially obscured, which current night-detection technologies struggle to address efficiently. Contemporary deep learning networks are powerful, but require substantial computational and temporal resources. This presents significant obstacles for implementing these networks in real-time situations, particularly in nocturnal settings.
Notably, contemporary target-detection methods have begun shifting from traditional handcrafted models [
15] to generalized deep-learning varieties. Architecturally, target detection has moved from two- to single-stage detection using multiscale feature fusion to support lightweight implementations. As such, remarkable detection performance has been achieved using publicly available datasets.
Two-stage convolutional neural networks (CNNs) divide the detection problem into candidate region generation [
16] and classification activities. Representative models include the region-based CNN (R-CNN) [
17], the “Fast” R-CNN [
18], the “Faster” R-CNN [
19], and the “Mask” R-CNN [
20]. However, region candidate generation requires a significant amount of computational resources and is time-consuming, rendering such methods prohibitively costly for the real-time requirements of nighttime vehicle-detection scenarios.
In contrast, one-stage algorithms directly perform the regression and classification of candidate boxes. Typical versions include “You Only Look Once” (YOLO) models, for which there are currently eight versions [
21,
22,
23,
24,
25,
26,
27,
28]; the single-shot “multibox” detector (SSD); the deconvolutional SSD [
29]; and the “Exceeding” YOLO (YOLOX) [
30], among others. These algorithms complete feature sharing in a single training session, greatly improving speed. However, due to their inherent structural characteristics, one-stage detectors often suffer from class imbalances between positive and negative samples which decrease their accuracy, particularly in challenging, low-light conditions.
Lightweight networks are being designed for portability and mobility, which are needed for driving target-detection. Representative models include MobileNet [
31,
32,
33], EfficientNet [
34], GhostNet [
35], EfficientDet [
36], and YOLOv4-tiny [
37]. They achieve optimal speeds, but often suffer trade-offs in terms of stability and accuracy. Although these achievements have found extensive use across diverse fields, the distinct domain of nocturnal vehicle-detection presents challenges that remain unresolved. To address these challenges in nighttime vehicle-detection, this study introduces Light-YOLO and makes a number of significant contributions to science and safety:
We provide an efficient and accurate scale fusion attention module (SFAM) that aggregates features into a multilevel feature pyramid to enhance the accuracy of nighttime vehicle-detection. Our novel polarized self-attention-enhanced aggregated (PSEA) feature pyramid network (FPN) and its efficient pyramid-split attention module (PSA) is used to eliminate irrelevant contextual information, which helps overcome the efficiency–accuracy tradeoff.
We provide a powerful feature-enhancement module (FEM) that mitigates the information loss caused by feature channel reductions, resulting in strong, fused multiscale feature information for accurate vehicle-detection under varying lighting conditions in mobile or edge environments.
We leverage the lightweight EfficientNetv2 backbone network and add a “Swift” spatial pyramid pooling (SPP) layer to minimize computational resources and memory constraints and encourage portability and mobile use. The network operates efficiently while capturing comprehensive features, further prioritizing accuracy.
We provide a stable and accurate anchor box mechanism using a finely tuned K-means clustering algorithm to detect targets with precision, even in nighttime scenarios where vehicles would otherwise appear blurred, deformed, or partially obscured. We incorporate a focal-loss mechanism to overcome the imbalance between positive and negative samples and maximize target recall and model stability.
2. Related Work
To foster computer vision advancements in a variety of scene conditions, several research teams have produced specialized datasets. Satzoda et al. [
38] introduced a comprehensive annotated dataset at the Smart Safety Car Laboratory comprising over 5000 frames for evaluation and benchmarking and encompassing diverse and intricate traffic and lighting conditions. To enhance the quality of imagery and the challenges posed by adverse lighting conditions, Lin et al. [
39] introduced an innovative approach known as AugGAN, a generative adversarial network (GAN). Their approach facilitates domain transformations and seamless transitions from day to night while preserving the integrity of image objects. Ye et al. [
40] pioneered an unsupervised domain adaptation framework, based on a transformer architecture, with a focus on nocturnal aerial object tracking. Their framework generates training patches via object discovery and employs transformer-based bridging-layer columns to facilitate domain alignment, thereby enhancing tracking performance in nighttime conditions via adversarial training.
Due to the heavy computational and temporal resources required by contemporary deep-learning networks like the ones mentioned, our research addresses the most pivotal areas of improvement: multiscale feature fusion and a lightweight backbone.
2.1. Multiscale Feature Fusion
Advanced feature maps capture and track ample global data, possess an expanded receptive scope, and exhibit enhanced semantic representations. Consequently, high-level versions are used for precise target localization, whereas low-level maps provide superior spatial resolution of edges, contours, and textures. An adept target-detection model will proficiently classify targets; thus, one needs an amalgamation of multiscale feature maps for effective and balanced performance. An FPN [
41] is used to fuse multiscale features into integrated maps for retention and prioritization. Notably, there are several versions, including the path aggregation network (PANet) [
42], which incorporates a bottom-up fusion path; the neural architecture search (NAS)-based FPN [
43]; and the bidirectional FPN (BiFPN) [
44]. However, these are not effective enough for our task, because their feature channel dimensionality reductions result in feature-map information losses, and their maps accumulate extraneous contextual data unrelated to the detection task. Hence, both computational efficiency and target recognition fall short of our targeting and recognition requirements.
Figure 1 provides a high-level illustration of FPN, PANet, and BiFPN functionality. The popular EfficientNet-YOLO network incorporates the PANet structure, and EfficientDet (built on EfficientNet) leverages a BiFPN to flexibly control network size by searching for and reusing the most effective FPN blocks.
Li et al. [
45] introduced the “multi-attention” FPN to address noise and background interference in vehicle-target-detection tasks via the fusion of attention information within an FPN. Gu et al. [
46] presented an improved FPN for small-target vehicle detection that seamlessly integrates deeper and shallower semantic information without increasing computational costs, by using cross-scale connecting lines. Although the FPN’s multiscale feature fusion has significantly advanced object detection in recent years, feature losses, inadequate small-target handling, and resource impracticalities persist.
2.2. Backbone
A seemingly straightforward deep-learning approach to providing onboard, real-time vehicle-detection would use a lightweight backbone. Hence, numerous researchers have investigated ways to apply them to general vehicle-detection. For example, Chen et al. [
47] proposed an improved SSD for rapid detection using MobileNetv2 as the backbone, which approached real-time performance. With approximately 5/11 of the original model’s complexity, inference speeds were improved, achieving an incredibly fast single inference time of 73 ms while sustaining impressive accuracy. To address the computational constraints of edge devices in autonomous driving scenarios, Chen et al. [
48] introduced a domain-specific lightweight network that employs a DenseNet201 backbone [
49] that combines its best features with YOLO, MobileNet, and online capabilities. By leveraging group convolutions and replacing some of the dense blocks with alternating blocks, model embedding was made possible while maintaining excellent speed and accuracy.
For Light-YOLO, we sought to determine the most efficient backbone. Notably, EfficientNetv2 was found to outperform MobileNet, EfficientNet, GhostNet, and others in terms of recognition accuracy and speed. Nevertheless, extant models with capabilities similar to those which we require have not been sufficiently validated under nighttime and adverse conditions. This validation is crucial to robustness and dependability. EfficientNetv2 employs NAS to determine the types of convolutional operations needed (i.e., MBConv and fused-MBConv) and calculates layer numbers, kernel sizes, and expansion ratios to maximize training speed with minimal overhead.
3. Methodology
Light-YOLO applies multiscale feature fusion with the lightweight EfficientNetv2 backbone using a stable and accurate anchor-box mechanism to strike a balance among efficiency, stability, and detection accuracy. In this section, we provide a detailed overview of its framework, algorithm, architecture, and full operation.
For the backbone design, we replaced the standard MBConv with a fused-MBConv to improve training speeds while reducing parameter increments during the early stages of model operation.
Figure 2 illustrates a comparison of these structures.
As illustrated in
Figure 3, the overall Light-YOLO architecture comprises the EfficientNetv2 backbone, the PSEA-FPN, and a prediction layer. First, image standardization and background enrichment operations take place, including resizing and data augmentation. The backbone is used to extract image features at different scales, incorporating the Swift-SPP to enhance feature extraction. Subsequently, the PSEA-FPN fuses semantic and positional features. Finally, the prediction module determines the category of the target.
3.1. PSEA-FPN
As illustrated in
Figure 4, feature map fusion is simplified to improve FPN efficiency. Notably, the PSEA-FPN structure comprises crucial PSA, FEM, and SFAM components.
3.1.1. Feature Fusion
As depicted in
Figure 4, like the conventional PANet, our PSEA-FPN comprises top-down and bottom-up branches. We denote the output of the backbone as
, and
is generated by the bottom-up path inside the FPN. To enhance detection efficiency, we removed node
, as it has only one input edge, rendering its contribution negligible. Feature map
is generated from
through the Swift-SPP and PSA and is fused with features from lower levels. In the top-down path, following each fusion action, the feature map expands its receptive field through the FEM and is further fused through the bottom-up path to ultimately generate
. Finally, via SFAM fusion, feature maps
are created with dimensions of 80 × 80, 40 × 40, and 20 × 20, respectively, making them suitable for predictions at three scales.
3.1.2. PSA
The PSA [
38] ensures that the network focuses on target objects while disregarding redundant background information. Attention mechanisms are broadly categorized into channel (e.g., squeeze-and-extraction [
50] and efficient channel [
51]) and spatial (self-attention [
52]) types. Dual-attention mechanisms have also improved recently, with notable examples including the channel block attention module (CBAM) [
53] and the dual attention module [
54].
For lightweight vehicle-detection at night, we employ PSA, due to its more intricate attention mechanism, which is based on dual attention [
55]. Notably, it effectively models long-range dependencies across high-resolution inputs and outputs with relatively low computational overhead. The structural diagram is shown in
Figure 5.
The PSA is divided into channel and spatial branches, and the weight calculation formulas for the channel and spatial branches are presented as follows:
respectively, which are designed to maintain a high resolution.
Simultaneously, the input tensor is fully folded along the corresponding dimensions to mitigate the information loss caused by dimensionality reductions. Within the attention pathway module, the SoftMax function is applied to the smallest tensor to expand its attention scope, followed by dynamic mapping using a sigmoid function to enhance the preserved information.
Based on the results of these two branches, parallel and serial fusion approaches are formulated as follows:
From the perspective of fusion methods, PSA is similar to CBAM, differing primarily in how they combine the results from the channel and spatial branches (i.e., parallel or in series). However, CBAM often employs fully connected and convolutional layers to obtain attention weights, which are not as effective for retaining knowledge. In contrast, PSA utilizes a self-attention network to derive attention weights and applies dimensionality reductions to certain maps to achieve effective long-range modeling without increasing complexity.
3.1.3. FEM
The FEM is a novel module introduced to capture receptive fields from feature maps of different scales. Its structure, illustrated in
Figure 6, consists of a multibranch convolutional layer and a multibranch pooling layer. The convolutional layer is employed to capture receptive fields of varying sizes, and the pooling layer integrates information from the receptive fields of the three branches.
The multi-branch convolutional layer is composed of dilated convolution, batch normalization, and rectified linear unit (ReLU) activation functions. Each branch in the multibranch convolutional layer employs dilated convolutions with the same kernel size, 3 × 3. However, they differ in their dilation rates,
d, which we set to 1, 3, and 5 in this study. Doing so expands the receptive field and captures more contextual information, which is expressed as follows:
where
and
represent the convolution kernel size and dilation rate, respectively, and
d denotes the convolution stride.
The multi-branch pooling layer combines information from different parallel branches. During training, we employ an averaging operation to balance the contributions of the parallel branches. The equation is as follows:
where
represents the output of the multibranch pooling layer, and
B represents the number of parallel branches. We set
B to three in this case.
The FEM employs dilated convolutions to adaptively learn different receptive fields from various feature maps, depending on the different vehicle features detected at night, thereby enhancing the accuracy of multiscale object detection.
3.1.4. SFAM
The goal of the SFAM is to aggregate multi-level multiscale features into a multi-level feature pyramid. The first step involves a channel-wise summation of features of the same scale, resulting in an aggregated channel representation denoted as . Here, represents the feature maps at different scales, denoted as . Thus, each scale in the aggregated feature pyramid contains features of the same scale from different layers.
The second step introduces a channel-wise attention mechanism to excite features, focusing on channels that provide the greatest detection assistance. This leverages SENet, where the squeeze stage channel information is generated using global pooling. To fully capture channel dependencies, the subsequent excitation step employs two fully connected layers to learn the attention mechanism.
Among these, σ represents the ReLU operation, δ represents the sigmoid,
,
, and
r is the reduction ratio, where
r = 16 in our experiment. The final output is obtained by reweighting input
X using activation
s, as follows:
where each element in
is rescaled for enhancement or weakening. A summary structural diagram of the SFAM is shown in
Figure 7.
3.2. Swift-SPP
Spatial pyramids employ pooling layers of different kernel sizes to capture receptive fields of various scales, and subsequently fuse features to enrich the information in the feature maps. Considering the real-time requirements and the need for high detection speed in vehicle detection tasks, we applied the Swift-SPP to improve inference speeds (
Figure 8).
Swift-SPP employs a multibranch parallel structure, eliminating the repetitive operations of the contemporary SPP and significantly improving operational speeds. It also replaces the pooling structure with a convolutional structure with a kernel size of 5 × 5 and a stride of one. The three parallel convolution operations have receptive fields equivalent to those of convolutions with sizes of 5 × 5, 9 × 9, and 13 × 13. This design not only enhances the network’s detection speed, but it also enriches the information in the feature maps, thereby strengthening the network’s feature-extraction capability.
3.3. Sample Equalization (SE)
Class imbalances consistently pose challenges to object-detection accuracy. SSDs are known for their speed, but they often suffer from lower accuracy, due to the fundamental issue of class imbalance. Employing an anchor-based mechanism results in the generation of thousands of candidate boxes from a single feature map, with only a small fraction of these containing potential targets (i.e., positive samples), whereas the rest are considered negative samples. Negative samples are usually easy to distinguish and do not contribute significantly to the training process. However, when there are too many negative samples and they dominate the loss function, the training process tends to focus excessively on them, overshadowing the positive ones and leading to substantial loss.
To address these issues, we employ a novel focal-loss function derived from the standard cross-entropy loss. However, it is modified to address our research problem. The specific form is as follows:
where
y is 1 or −1, representing the foreground and background, respectively. The value range of
p is in (0, 1), which reflects the probability of the model predicting a positive outcome. Function
p is defined as
By combining Equations (9) and (10), a simplified formula can be obtained as follows:
To solve the problem of imbalanced positive and negative samples, modulation and weight factors from the cross-entropy loss structure are introduced to help distinguish samples. The focal loss formula is as follows:
where modulation factor
is used to reduce the loss contribution of easily distinguishable samples (i.e., foreground or background). The larger the
, the easier the sample is to distinguish and the smaller the modulation factor.
is used to adjust the proportion between positive and negative sample losses, where
is the foreground category, and
is the corresponding background category.
4. Experiments
4.1. Dataset
Our dataset consisted of 10,000 images obtained by extracting 6000 original frames from the classic and diverse BDD100K large-scale autonomous driving video dataset, focusing on nighttime driving scenes, and by capturing nighttime dashcam video frames, which resulted in 4000 more images. As shown in
Figure 9, our self-assembled dataset contains a wide variety of nighttime driving scenarios, and its size promises good adaptability and robustness.
Our experimental dataset meets the requirements for diversity in terms of scenes, shapes, and lighting conditions, and it is suitable for training deep-learning networks. During model training, we employed data augmentation strategies to expand the dataset. The attributes are visualized in
Figure 10. In
Figure 10a, it can be seen that the dataset contains more than 14,000 labels.
Figure 10b displays the central coordinate positions of the objects, with darker colors indicating higher label concentrations at those positions.
Figure 10c illustrates the sizes of the objects, revealing that our dataset contains a good variety.
We partitioned the dataset into training and validation sets in an 80–20% ratio. Using a Python3.8 environment, we employed the open-source LabelImg1.8.6 tool for the manual annotation of regions encompassing target objects. Consequently, we produced the corresponding positional information files in an .XML format. For our experiment, we designated the objects as “cars.” Subsequently, we standardized the .XML positional files and transformed them into .TXT files to facilitate nocturnal vehicle labeling.
4.2. Experimental Environment
Our experiments were conducted using Python 3.8 and the PyTorch11.0 framework. The development platform of was a 64-bit Linux system, and the processor was an Intel(R) Core(TM) i9-11900K CPU. To enhance training efficiency, an NVIDIA GeForce RTX 3090 GPU with CUDA 11.3 and CuDNN 10.0 were employed for graphics acceleration, facilitated through BAIDU AutoDL cloud server resources. Additionally, stochastic gradient descent was used to control the loss reduction, the batch size was set to 128, the initial learning rate was 0.01, and we used 200 training epochs.
4.3. Evaluation Criterion
We used standard precision (
P), recall (
R), average precision (
AP), mean
AP (
mAP), number of parameters (
Params), and speed (fps) to assess performance and accuracy. Higher values of
P and
R indicate higher detection accuracy, and
mAP measures the overall model performance, which reflects the efficacy of training. Compared with
P and
R,
mAP provides a more comprehensive estimation of algorithmic performance. In this experiment, we used
[email protected] and
[email protected]:0.95 to provide a comprehensive evaluation.
Because
mAP reflects only the model accuracy, we tracked the number of model parameters required and inference speeds achieved. The fps measure reflects the algorithm’s execution speed.
where
elapsedTime represents a fixed period, and
frameNum represents the number of frames processed within that period.
4.4. Ablation Experiment
4.4.1. Experimental Benchmark
Because our model is an improvement of the EfficientNetv2-YOLOv5, the latter serves as the baseline for our ablation experiments. Its metrics are shown in
Figure 11. In
Figure 11, with an increase in the number of training epochs, the values of the
[email protected] and
[email protected]:0.95 scores gradually rise, whereas the loss values decrease. Model training attained relative stability after ~100 epochs, and the final training round consisted of 200 epochs, resulting in a
[email protected] score of 90.28%, a
[email protected]:0.95 score of 42.07%, a box_loss of 0.0397, and an obj_loss of 0.0623, respectively. Thus, it is clear that there is room for improvement with Efficientnetv2-YOLOv5.
4.4.2. PSEA-FPN Internal Structure Validity Verification
The PSEA-FPN is a novel feature pyramid structure that includes several internal improvements. The model incorporates four small improvements that are sequentially removed or added for ablation testing: the F3 node, PSA, FEM, and SFAM (
Table 1).
As shown in
Table 1, Experiment 0 represents the full Efficientnetv2-YOLOv5 baseline. In Experiment 01, which involved deleting the F3 node, there was a slight decrease in the
[email protected] and
[email protected]:0.95 scores, but there was a noticeable increase in detection speed (i.e., from 84.74 to 115.51 fps). In Experiment 06, where the PSA was added to Experiment 01, there were increases in the
[email protected] and
[email protected]:0.95 scores of 1.30 and 0.62%, respectively. Experiment 02, which included the PSA module, showed improvements in both
[email protected] and
[email protected]:0.95 scores, while also reducing the parameter count. In Experiment 04, the SFAM module was introduced, resulting in a significant improvement in both
[email protected] and
[email protected]:0.95 scores, although there was a slight decrease in fps. Experiment 05 is an extension of Experiment 04, with the addition of the PSA module. The results demonstrate an increase in both the
[email protected] and the
[email protected]:0.95 scores, while effectively reducing parameters. Experiments 08 and 07 improved upon Experiments 06 and 02 by utilizing the FEM method, which resulted in increased
[email protected] and
[email protected]:0.95 scores. Finally, Experiment 09 incorporated all improvements, including the PSEA-FPN and SFAM, atop Experiment 08. For Experiment 09, the
[email protected] and
[email protected]:0.95 scores were 92.72 and 44.76%, respectively, making them 2.44 and 2.69% higher than the baseline. The detection speed also reached 102.31 fps, an increase of 17.57 fps over the baseline.
The visualization results of the internal improvements given by the PSEA-FPN are shown in
Figure 12a. Although adding the PSA and FEM slightly increased the model parameter count, they significantly enhanced recognition accuracy. In summary, combining these four improvements not only improved the detection speed, but also significantly enhanced detection accuracy.
4.4.3. Validation of the Effectiveness of Light-YOLO Improvement Points
PSEA-FPN represented the first improvement, Swift-SPP was the second, and SE was the third. The results are listed in
Table 2. Experiment 0 used the Efficientnetv2-YOLOv5 baseline. Experiments 1–3 introduced single improvement factors to the baseline to demonstrate the effectiveness of each addition. The results of Experiments 1–3 show that each improvement point (i.e., PSEA-FPN, Swift-SPP, and SE, in that order), led to improvements in the
mAP score. The most significant improvement was provided by the PSEA-FPN, which increased the
[email protected] from 90.28 to 92.72%, a gain of 2.24%. It also reduced the number of model parameters from 5.803 M to 4.226 M and improved the detection speed from 84.74 to 102.31 fps. Swift-SPP, with its parallel structure, reduced model complexity, resulting in a 13.08 fps increase in detection speed compared with the baseline. It also achieved a
[email protected] score of 90.91%, which is a 0.63% improvement. SE improved the
[email protected] score by 1.94% while maintaining the strong detection speed.
Experiments 4–6 were combinations of improvement points, still using Efficientnetv2-YOLOv5 as the baseline. The results show that Experiment 4, which added the Swift-SPP strategy to Experiment 3, increased the
[email protected] score by 2.41% over the baseline, while achieving a certain degree of lightweight performance and fps increase. Experiment 5, a combination of the PSEA-FPN and Swift-SPP, achieved a
mAP score of 93.35%, a 3.07% improvement over the baseline, with a 19.84 fps increase in detection speed. Experiment 6, which added the SE to Experiment 5, achieved a
[email protected] score of 94.31%, a 4.03% improvement, with a detection speed of 102.04 fps, an increase of 17.30 fps over the baseline.
Visualizations of the results of the combination experiments are shown in
Figure 12c,d, where the changes in detection accuracy (
[email protected] and
[email protected]) are clearly and incrementally demonstrated. These results verify the effectiveness of the Light-YOLO model and highlight its improvements over the baseline.
4.5. Comparison with Other Classic Algorithms
Most extant studies on vehicle detection do not use lightweight models; hence, comparisons were made with the most prominent lightweight networks (i.e., YOLOX_s, YOLOv7-tiny, EfficientDet-D1, and varying combinations of MobileNetv3-YOLOv5, GhostNet-YOLOV5, and MobileNetv2-SSD). The results are listed in
Table 3.
EfficientDet comprises a series of scalable and efficient object detectors, ranging from EfficientDet-D1 to EfficientDet-D7. The models gradually increase in accuracy as their real-time performance decreases. The fastest, EfficientDet-D1, was selected for comparison with Light-YOLO. From
Table 2, it is evident that Light-YOLO has a significant advantage in terms of detection accuracy, with a
[email protected] score of 4.99% and a
[email protected]:0.95 of 4.55%. Although our model complexity was slightly higher, Light-YOLO achieved significantly higher detection speeds.
Although YOLOv5–YOLOv7, v7 being the most advanced, are widely used for object detection, they are not considered lightweight algorithms. However, YOLOX_s, which is based on YOLOv5s, is. The results show that Light-YOLO outperformed YOLOX_s in terms of detection speed and accuracy. YOLOv7 has a simplified version (i.e., YOLOv7-tiny), which has a faster detection speed than Light-YOLO; however, it lagged behind
[email protected] and
[email protected]:0.95 scores by 8.29 and 5.91%, respectively.
The results of the variant combinations show that GhostNet-YOLOV5 had lower mAP and overall accuracy than Light-YOLO, in addition to higher model complexity. MobileNetv3-YOLOv5 and MobileNetv2-SSD had accuracy levels similar to Light-YOLO, but they lagged behind in recognition speed, by 8.49 fps.
The visualization results
[email protected],
[email protected]:0.95,
Params, and fps are presented in
Figure 13.
Figure 14 shows a comparison between Light-YOLO and the other lightweight algorithms in terms of recognition speed and accuracy. Overall, Light-YOLO demonstrated a significant competitive advantage in all comprehensive metrics.
4.6. Comparison of Effects
To further validate the applicability of Light-YOLO in nighttime scenarios, a set of images was selected from our dataset for before-and-after comparisons of performance. As shown in
Figure 15, with Nighttime Image I, three vehicles were detected. Both the Efficienctnetv2-YOLOv5 and Light-YOLO models recognized all vehicle objects, but Light-YOLO showed a higher confidence level. In Nighttime Image II, Efficienctnetv2-YOLOv5 produced false positives, due to lighting and shadow interference. In contrast, Light-YOLO demonstrated good robustness and higher confidence. For Nighttime Images III and IV, it becomes apparent that, in densely packed nighttime scenes, Efficienctnetv2-YOLOv5 struggles with accurate vehicle localization, compared with Light-YOLO, which performs better.
The detection results shown in
Figure 15 illustrate the effectiveness of the improvements made in this study.
5. Conclusions
This paper has proposed the Light-YOLO lightweight target-detection algorithm, designed for nocturnal vehicle-detection and mobility. This model was built on the EfficientNetv2-YOLOv5s baseline and incorporates a multitude of enhancements, including PSEA-FPN, Swift-SPP, and focal loss. The empirical findings demonstrate that Light-YOLO yields substantial performance improvements in nocturnal vehicle-detection tasks over the benchmark. The ultimate mAP score increased from 90.28 to 94.31%, concomitantly reducing the parameter count by 44.37% and augmenting the recognition speed by 20.42%. Additionally, internal validation experiments show that the incorporation of the four improvements within PSEA-FPN boosts mAP by 1.84%, while increasing the frame rate by 34.26%, proving the efficacy of the proposed internal structure. Comparisons between different lightweight networks illustrate that the Light-YOLO outperforms YOLOv7-tiny in mAP by 10.14% while using 46.35% fewer parameters, and it outperforms GhostNet-YOLOV5, with a 4% increase in mAP and a 3.82% improvement in frame rate. These outcomes underscore the capacity of Light-YOLO to considerably increase the precision of nocturnal vehicle-detection while preserving real-time operationality, thus promoting its pragmatic application potential.
In future investigations, we will evaluate the efficacy of this algorithm for detecting smaller targets during nocturnal conditions. Furthermore, the lightweight attributes of the model must be enhanced. Subsequently, we will embark on investigations of other lightweight models, seeking to enhance their suitability for real-time vehicle-target-detection in nocturnal settings.