1. Introduction
With the rapid development of intelligent transportation systems, research on nighttime pedestrian trajectory tracking has become increasingly important [
1]. During nighttime driving, poor lighting conditions pose significant challenges to pedestrian-detection and -tracking [
2], including multi-task detection [
3], pedestrian occlusion [
4], detection accuracy in complex environments [
5], and real-time performance. The goal of this study is to enhance the accuracy and real-time performance of nighttime pedestrian-detection and -tracking by improving existing multi-object detection and -tracking algorithms. We propose a method that combines an enhanced YOLOP detection algorithm with an improved DeepSORT tracking algorithm to address the challenges of pedestrian-detection and -tracking in nighttime environments.
Currently, commonly used object detection algorithms include the R-CNN series, YOLO series, and SSD series. Algorithms based on R-CNN have notably improved detection accuracy and stability, especially in pedestrian-detection. For instance, Akshatha, K. R., et al. [
6] proposed a pedestrian-detection algorithm based on R-CNN, which effectively detects in complex environments. However, R-CNN-based algorithms face issues like redundant computations and high computational costs. To address these, FastR-CNN and FasterR-CNN were developed. Avola, Danilo, et al. [
7] designed MS-CNN, which integrates shallow and deep features to enhance discriminability, though it struggles with uneven lighting and diverse pedestrian colors. Liu, YJ, et al. [
8] tackled this with a method combining ACE and FasterR-CNN, enhancing detection accuracy in color-complex scenes. For occlusion scenarios, Zhang et al. [
9] incorporated a cross-channel attention mechanism into FasterR-CNN, improving positioning accuracy and reducing false detection. The SSD algorithm by Chen, Z et al. [
10], which eliminates the candidate box extraction step, significantly enhances detection efficiency but has lower accuracy. To improve this, Ni Y et al. [
11] used a residual network with stronger representational capabilities as the SSD base, resulting in better real-time performance and robustness, though challenges with occluded and small targets persist. Liu et al. [
12] proposed an improved YOLOv5-based algorithm, while Kumar, Sunil, et al. [
13] enhanced YOLOv5 with DeepSort for multi-target-detection and -tracking, improving occluded target-recognition. For small target-detection, Li et al. [
14] introduced the YOLO-ACN algorithm, adding attention mechanisms and CIoU loss function to effectively extract detailed features and address small target pedestrian-detection.
Pedestrian-tracking algorithms are used to continuously monitor targets in video sequences. Traditional tracking methods, such as nearest neighbor algorithms [
15], multiple hypothesis tracking [
16], and joint data association [
17], often struggle with sustainability in complex scenes due to the large number of targets [
18]. To address this, Chen, Xuewen, et al. [
19] developed the DeepSORT algorithm, an extension of the earlier SORT algorithm [
20], which reduces data redundancy and enhances sustainable tracking. Additionally, Razzok, Mohammed, et al. [
21] proposed a pedestrian counting method that combines YOLO and DeepSORT, demonstrating high accuracy and robustness in real-time detection and tracking. Nighttime pedestrian-detection and -tracking introduce additional challenges, including poor lighting, shadows, and glare from artificial light sources, which can significantly degrade the performance of traditional algorithms. Recent studies have sought to overcome these limitations by integrating advanced image enhancement techniques and robust tracking models. For example, Zhang et al. [
22] developed a nighttime pedestrian-detection system that incorporates a low-light image enhancement module alongside a deep learning-based tracking algorithm, improving detection accuracy in low-visibility conditions. Similarly, Ngeni, et al. [
23] proposed an algorithm that combines a specialized illumination compensation technique with the DeepSORT framework, resulting in enhanced tracking stability and accuracy in nighttime scenarios. These advancements highlight the importance of adapting existing algorithms to handle the unique challenges posed by nighttime environments, thereby ensuring reliable pedestrian-detection and -tracking, even under difficult conditions.
Through research, it has been found that combining the YOLOP multi-task object detection algorithm with multi-object tracking (MOT) algorithms is well suited for the recognition and tracking of pedestrians and lane markings at night. This paper introduces a novel trajectory tracking model that enhances both YOLOP and DeepSORT algorithms for more accurate detection and the tracking of pedestrian trajectories. The proposed improvements include the integration of the C2f-faster structure and BiFormer attention mechanism within the YOLOP algorithm, significantly enhancing feature extraction capabilities and focusing on small area features. The CARAFE module is used to replace the original upsampling module, improving the fusion of shallow features, while the DyHead detection head achieves comprehensive fusion of scale, spatial, and task perception. To further enhance tracking accuracy and reduce model complexity, the ShuffleNetV2 lightweight module is integrated into the DeepSORT feature extraction network. The effectiveness of the proposed model was validated through experiments involving pedestrian activities near motorways in typical nighttime scenarios. This study provides valuable insights for the development of vehicle collision avoidance decision-making and active safety technologies, significantly improving the accuracy and real-time performance of nighttime pedestrian-detection and -tracking systems.
Key Innovations:
Improve Network Architecture: Introduction of the C2f-faster structure and BiFormer attention mechanism into the YOLOP algorithm to improve detection accuracy, particularly for small area features. Integration of the CARAFE module and DyHead detection head to enhance the fusion of features, enabling more effective detection and tracking in complex environments.
Optimized Real-Time Performance: Implementation of the ShuffleNetV2 lightweight module into the DeepSORT network, reducing model complexity and improving real-time tracking performance.
Integrating the FBCD-YOLOP and Deep SORT algorithms: By integrating the FBCD-YOLOP and Deep SORT algorithms, it is possible to perform detection and tracking for multiple tasks simultaneously. This integration allows for the generation of lane markings and drivable areas while simultaneously detecting and tracking pedestrians.
3. Experiment Results and Analysis
3.1. Data Introduction
This study utilized publicly available datasets, Market-1501 [
37] and BDD100K [
38]. Market-1501 focuses on pedestrian re-identification, while BDD100K covers various driving-related tasks, including pedestrian-detection and lane detection, under different lighting conditions. Selecting images of nighttime pedestrians and lanes with varying lighting conditions from these datasets is beneficial for this research. The images were divided into training, validation, and test sets in a 7:2:1 ratio. This division ensures that there is sufficient data for learning, facilitates hyperparameter tuning and model selection, and helps to avoid overfitting on the training set. The training set included 15,084 images, the validation set included 4225 images, and the test set included 2083 images, as shown in
Table 1. To evaluate the model’s performance under different lighting conditions, the nighttime pedestrian-detection dataset was classified into three environments: complete darkness, low light, and illuminated, as shown in
Table 2. The open-source annotation tool LabelImg was used to annotate pedestrians in the images, with the pedestrian category defined as “person”, labeled as 1, and saved in .txt format.
3.2. Experimental Platform
This experiment was conducted under the Windows 11 system environment, with the programming platform PyCharm, the programming language Python 3.8, the deep learning framework PyTorch 1.12.1, and CUDA as 11.2. The hardware environment included an Intel
® Core™ i7-12700H processor, 16 GB of memory, and an NVDIA GeForce RTX 3060 12G graphics card (Santa Clara, CA, USA), as shown in
Table 3.
In the experiment, to better utilize target-detection models for training datasets with nighttime pedestrian and lane line labels, several parameter settings were configured. The input image size was set to 640 × 640 pixels, with an image batch size of 16, a learning rate of 0.01, and the model was trained for 200 epochs. After ensuring training convergence, the dataset was re-evaluated. For the SORT tracking algorithm, the IoU threshold was set to 0.7, the confidence threshold was set to 0.5, while other parameters were set to their default values.
3.3. Evaluation Indicators
In order to verify the performance of nighttime pedestrian target-detection, we used precision (P), recall (R), frame rate (FPS), and mean average precision (mAP) as evaluation indicators. In addition, the lane line detection sub-task used two indicators—accuracy (accuracy, Acc) and intersection over union (IoU)—for evaluation, and Acc is calculated by Formula (15). In the experiment, the IoU threshold of mAP (mean average precision) was taken as 0.5 (mAP@0.5) to comprehensively evaluate the accuracy of the model. mAP is the average of the average precision (AP) of all categories, which represents the overall performance of the model in category detection. A high mAP value means that the model had good detection performance in all categories. For the evaluation of nighttime pedestrian-tracking, we chose evaluation indicators such as multiple object tracking accuracy (MOTA), multiple object tracking precision (MOTP), and the number of identity switches (IDS) to measure the performance of the tracking model. Among them, the higher the P, R, mAP, and Acc, the more accurate the target-detection and -recognition; the higher the FPS, the faster the target-detection speed.
- (1)
Evaluation Metrics for Object-Detection
This study evaluated system performance by calculating the model’s precision, recall,
mAP (mean average precision),
Acc (accuracy), and
FPS (frames per second) to analyze model performance. The calculation formulas are as follows:
where, True Positives (
TP) refer to the number of samples that the model correctly classified as the positive class, meaning both the predicted and actual values were positive. A sample is considered a True Positive if the Intersection over Union (
IoU) between the predicted box and the actual box exceeded a specified threshold. False Positives (
FP) refer to the number of samples that the model incorrectly classified as the positive class, where the predicted value was positive, but the actual value was negative. In this study, a sample was considered a False Positive if the
IoU between the predicted box and the actual box fell below a specific threshold. A false negative (FN) refers to the number of samples that the model incorrectly predicted as negative, meaning the predicted value was negative while the actual value was positive. This indicates that the model’s prediction for the negative class was inaccurate.
- (2)
Evaluation Metrics for Tracking Algorithm
In a multi-object tracking system, the number of Identity Switches (IDS) measures the frequency of changes and losses in target identification numbers during the tracking process. A lower IDS value indicates greater coherence and accuracy in tracking. Another key metric is Multiple Object Tracking Accuracy (MOTA), which reflects the system’s overall recognition accuracy and the cumulative degree of error throughout the tracking period. Additionally, Multiple Object Tracking Precision (MOTP) evaluates the positioning accuracy of a multi-object tracking system. It calculates the average error between the tracked target location and the actual target location, typically measured in pixels. MOTP focuses on the precision of target position prediction rather than the accuracy of target identification. A higher MOTP value indicates that the tracked target location is closer to the true position, demonstrating better positioning capability of the system.
The calculation formulas are as follows:
where
represents the miss rate,
represents the false detection rate,
represents the number of identity switches, and
represents the target number.
represents the number of matches in the t-th frame, and the matching error
is calculated for each pair of matches, representing the IoU of the detection box with GT under the t-th frame. If the tracking match is perfect, it is 100%, if it is completely incorrect, it is 0%.
3.4. Object-Detection Experiment and Result Analysis
3.4.1. Lane Line-Detection
For the lane line-detection task, the proposed algorithm was compared to single-task models SCNN and ENet-SAD, as well as to multi-task models YOLOP and TDL-YOLO. Evaluation metrics included Accuracy (Acc), Intersection over Union (IoU), and Frames Per Second (FPS). The experimental results are presented in
Table 4.
As shown in
Table 4, the algorithm proposed in this paper demonstrated outstanding performance in lane line-detection tasks, particularly in multi-task processing. Compared to single-task models such as SCNN and ENet-SAD, our algorithm significantly improved accuracy (Acc), intersection over union (IoU), and frame rate (FPS). Among multi-task models, YOLOP and TDL-YOLO handled both lane line detection and pedestrian-detection tasks. Our algorithm shows a 5.1% improvement in accuracy over YOLOP and a 3.3% improvement over TDL-YOLO. In IoU, it surpassed YOLOP by 0.8% and TDL-YOLO by 0.7%. In FPS, it outperformed YOLOP by 25 FPS and TDL-YOLO by 29 FPS. These results indicate that the proposed algorithm can achieve higher detection accuracy and faster processing speeds when simultaneously handling pedestrian-detection and lane line detection tasks.
The experimental results, as shown in
Figure 11, demonstrate that the proposed YOLOP-based improved lane-detection algorithm can effectively detect lane lines and drivable areas across various road scenarios and lane line configurations, including cases where obstacles partially obstruct the view. In the figure, the red lines represent the generated lane lines, the green areas indicate the segmented drivable areas, the red circles highlight differences between the lane line-generation of the proposed model and the baseline model, and the green circles indicate differences in the drivable area segmentation. In Scene 1, it can be observed that the baseline YOLOP model generated a more complete green area, indicating that the YOLOP algorithm performed well in detecting drivable areas. However, the proposed algorithm generated more complete and clearer red lines, reflecting superior lane line-detection. In Scene 2, under conditions with occlusions, the proposed algorithm exhibited stronger lane line-prediction capabilities and achieved better detection results compared to YOLOP. In Scene 3, for road surfaces with unclear lane lines, the proposed algorithm was able to more accurately identify the lane lines.
Overall, the algorithm in this paper exhibits good robustness and is capable of handling complex road environments at night and accurately identifying the position and shape of lane lines. During the experiment, we observed that the algorithm in this paper could successfully handle the situation of obstacles blocking the road and accurately distinguish different types of lane lines.
3.4.2. Nighttime Pedestrian-Detection
To validate the superiority and generalizability of the algorithm developed in this paper within the context of autonomous driving multi-task perception, we conducted a comparative experiment on nighttime pedestrian-detection. This experiment involved evaluating our proposed algorithm against current mainstream single-task and multi-task autonomous driving algorithms. Specifically, we compared our improved FBCD-YOLOP model with Faster R-CNN, YOLOv3, YOLOv5s, and TDL-YOLO. The comparison was based on metrics including precision, recall, mean average precision, frame rate, and the number of parameters for nighttime pedestrian-recognition. The results of this comparison are presented in
Table 5.
From the comparison of data in
Table 5, it can be observed that among single-task models, Faster R-CNN faced limitations in real-time detection due to its need to first generate candidate boxes and then perform classification and bounding box regression, which increases computational complexity. While YOLOv3 offered better real-time performance, its detection accuracy was still significantly limited. YOLOv5s achieved higher efficiency by predicting bounding boxes and class information directly from the input image in a single forward pass, thereby omitting the additional candidate box generation stage, resulting in an inference speed of 121 FPS. The algorithm in this paper had a lower FPS than YOLOv5s due to the larger number of parameters and the simultaneous detection of multiple tasks. However, it still met the requirements for practical tasks. Among multi-task models, the proposed model in this paper exhibited the highest detection accuracy, with Precision, Recall, and mAP reaching 89.6%, 91.3%, and 88.1%, respectively.
To assess the detection performance of the improved FBCD-YOLOP model under varying lighting conditions, experiments were conducted in three distinct environments: complete darkness, low light, and illuminated. Key metrics recorded included precision, recall, and miss rate. Optimal detection thresholds were established for each lighting condition: 0.6 for complete darkness, 0.5 for low light, and 0.4 for illuminated environments. Analysis of detection precision, miss rate, and recall rate at these thresholds revealed that the model exhibited some missed detections in complete darkness. In the low light environment, most pedestrians were accurately detected with a low false detection rate. In the illuminated environment, the model successfully detected all pedestrians, demonstrating its strong adaptability to well-lit conditions. The detailed experimental results are presented in
Table 6.
To evaluate the effectiveness of the FBCD-YOLOP model, we used the YOLOP model as a baseline for comparison and conducted ablation experiments on various improvements, including C2f-faster (Strategy 1), BiFormer (Strategy 2), CARAFE (Strategy 3), and DyHead (Strategy 4). The results of these ablation experiments are detailed in
Table 7.
By incorporating C2f-faster (Strategy 1) into YOLOP, it was observed that the detection speed significantly increased without reducing the average precision of nighttime pedestrian-recognition. When the BiFormer (Strategy 2) attention mechanism was added to YOLOP, the average precision of nighttime pedestrian-recognition improved by 2.3 percentage points. This improvement is attributed to BiFormer (Strategy 2)’s ability to detect the characteristic behavior information of small targets, especially in the validation set where there are partially occluded and incomplete pedestrians, thereby enhancing recognition accuracy. Additionally, we found that integrating the CARAFE (Strategy 3) and DyHead (Strategy 4) modules into YOLOP significantly improved nighttime pedestrian-recognition, with an average precision increase of 2.5 percentage points. This is because CARAFE (Strategy 3) and DyHead (Strategy 4) are particularly effective at handling dynamic and posture-changing behaviors under a large receptive field. Finally, the FBCD-YOLOP model constructed in this study achieved a precision of 89.6%, an average precision of 91.3%, a recall of 88.1%, and a frame rate of 66 FPS in nighttime pedestrian-recognition. Overall, the FBCD-YOLOP model demonstrated excellent performance in handling nighttime pedestrian-detection tasks. The training process curve of the FBCD-YOLOP model for nighttime pedestrian-recognition is shown in
Figure 12.
Additionally, to validate whether the improvements made to the FBCD-YOLOP model resulted in statistically significant enhancements in detection performance, we conducted a significance test. We employed the
t-test method to compare the differences in model detection accuracy under various strategy conditions. A value of 1 was recorded when the difference was statistically significant (
p < 0.05), and 0 otherwise. The results are presented in
Table 8.
Based on the results of the significance tests, we can confirm that the differences in detection performance among the various strategies were statistically significant. This indicates that the improvements implemented in the FBCD-YOLOP model had a substantial impact on its performance.
3.5. Target Re-Identification Experiment and Result Analysis
The nighttime pedestrian re-identification model enables the re-identification of the same pedestrian target across different frames in a video. To achieve this goal, we trained the DeepSORT-SNV2 model on a custom dataset for a total of 100 iterations. The convergence trends of loss and top1err for the DeepSORT-SNV2 model on the training set (train) and validation set (val) are shown in
Figure 13.
After 100 iterations, the DeepSORT-SNV2 model showed signs of convergence, achieving an accuracy of 87.9% on the training set and 78.2% on the validation set. These results demonstrate the model’s effectiveness in accurately extracting and re-identifying pedestrian appearance features. To further validate the tracking performance of our DeepSORT-SNV2 model, we conducted a comparative experiment with the standard DeepSORT model. The results of this comparison are presented in
Table 9.
From
Table 9, we can see that, after training the pedestrian re-identification model using the DeepSORT-SNV2 model, its size was reduced by 18 times compared to the original DeepSORT model, while still maintaining good accuracy on both the training and validation sets. Overall, by introducing the lightweight ShuffleNetV2 network into the DeepSORT model, it not only met the real-time and accuracy requirements for pedestrian-tracking in nighttime driving conditions, but also resulted in a smaller model size and lower computational cost, making it more suitable for deployment on edge devices.
To verify the performance of the proposed algorithm in nighttime pedestrian-tracking, tests were conducted on a custom dataset and compared with the YOLOP-DeepSORT algorithm. The results are shown in
Table 10.
As shown in
Table 10, the proposed algorithm achieved a Multiple Object-Tracking Accuracy (MOTA) of 86.3%, a Multiple Object-Tracking Precision (MOTP) of 84.9%, and an Identity Switch (IDS) count of 5 in nighttime multi-object pedestrian-tracking. Compared to the YOLOP-DeepSort algorithm, the proposed algorithm improved MOTA by 5.6%, reduced IDS by 5, and increased video processing speed from 24 FPS to 59 FPS, effectively meeting the real-time requirements for nighttime pedestrian video tracking. To further validate the effectiveness of our algorithm in detecting and tracking pedestrians in the blind spots of nighttime drivers, we tested and compared our algorithm with two other advanced multi-object-tracking algorithms on a lighting environment dataset. The comparison of experimental results is shown in
Table 11.
As shown in
Table 11, compared to the two mainstream multi-object tracking algorithms, ByteTrack and StrongSORT, the proposed algorithm in this paper demonstrated superior tracking performance across the MOTA, IDS, and FPS metrics. Although the MOTP value of the StrongSORT algorithm was 1.4% higher than that of our algorithm, when considering the other three metrics, our algorithm clearly outperformed both ByteTrack and StrongSORT. Specifically, our algorithm’s MOTA value was 2.9% higher than ByteTrack, and 0.4% higher than StrongSORT. The IDS value was also 5.9% and 3.4% higher than ByteTrack and StrongSORT, respectively. Additionally, our algorithm had the lowest number of ID switches, with only five occurrences, indicating excellent continuity and stability during the tracking process. In terms of FPS, our algorithm outperformed ByteTrack and StrongSORT by 32fps and 36fps, respectively. At an almost real-time tracking rate of 59 frames per second, our algorithm maintained high tracking accuracy. These results validate the effectiveness of our algorithm for pedestrian-detection and -tracking in nighttime driving blind spots.
The performance of pedestrian-detection methods on nighttime roads significantly influences the effectiveness of tracking tasks. To validate the impact of different detection algorithms, this paper applied various detection algorithms to the improved tracking algorithm, as shown in
Table 12. The results indicate that different detection methods exhibited varying levels of performance across multi-object-tracking accuracy (MOTA), multi-object-tracking precision (MOTP), and frames per second (FPS).
Notably, our method exhibited the highest accuracy in both MOTA and MOTP, achieving scores of 86.3% and 84.9%, respectively, which indicates its exceptional performance in accurately detecting and localizing pedestrians. Additionally, with an FPS value of 59, our method demonstrated outstanding real-time capability. In summary, our approach showed significant superiority in detecting pedestrians on nighttime roads and tracking abnormal behavior.
In this study, the YOLOP-DeepSort algorithm was utilized as the baseline. After enhancing various modules of the algorithm, significant improvements in tracking performance were observed. As illustrated in
Figure 14a–d, which presents two sets of video sequences, the first set includes images (a) and (b), while the second set includes images (c) and (d). In the IDS-tracking process (b), the proposed algorithm accurately distinguished and tracked ID3 as it passed through the crowd. In contrast, the YOLOP-DeepSort algorithm exhibited ID changes, highlighted by orange circles. In the second set of video sequences, (c) and (d), the proposed algorithm showed no ID changes, false detections, or missed detections. However, the YOLOP-DeepSort algorithm mistakenly identified the tree trunk in (c) and the wall crack in (d) as pedestrians, highlighted by red circles.
It is evident that the DeepSort algorithm exhibited issues such as missed detections, false detections, and ID switches, resulting in suboptimal performance in nighttime pedestrian-detection and -tracking. In contrast, our proposed algorithm demonstrated superior performance in multi-target-tracking. The initial ID numbers assigned to pedestrian targets remained consistent throughout the tracking process. Even in cases of pedestrian occlusion or uneven lighting, where an ID may temporarily disappear, our algorithm was able to reassign the correct unique ID to the target by matching features in subsequent video frames. This indicates that the improved YOLOP-DeepSORT algorithm, when applied to nighttime pedestrian video-tracking scenarios, achieved excellent tracking results. Throughout the tracking process in a video sequence, the algorithm accurately located pedestrian targets in the video frames, with the tracking box size consistently matching the actual scale of the target. Furthermore, the algorithm demonstrated good real-time performance, with no target loss, effectively meeting the technical requirements for multi-target pedestrian-tracking in nighttime driver blind spot scenarios.