Study on Nighttime Pedestrian Trajectory-Tracking from the Perspective of Driving Blind Spots

Zhao, Wei; Ren, Congcong; Tan, Ao

doi:10.3390/electronics13173460

Open AccessArticle

Study on Nighttime Pedestrian Trajectory-Tracking from the Perspective of Driving Blind Spots

by

Wei Zhao

^*,

Congcong Ren

and

Ao Tan

School of Vehicle and Traffic Engineering, Henan University of Science and Technology, Luoyang 471000, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(17), 3460; https://doi.org/10.3390/electronics13173460

Submission received: 5 August 2024 / Revised: 23 August 2024 / Accepted: 28 August 2024 / Published: 31 August 2024

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

With the acceleration of urbanization and the growing demand for traffic safety, developing intelligent systems capable of accurately recognizing and tracking pedestrian trajectories at night or under low-light conditions has become a research focus in the field of transportation. This study aims to improve the accuracy and real-time performance of nighttime pedestrian-detection and -tracking. A method that integrates the multi-object detection algorithm YOLOP with the multi-object tracking algorithm DeepSORT is proposed. The improved YOLOP algorithm incorporates the C2f-faster structure in the Backbone and Neck sections, enhancing feature extraction capabilities. Additionally, a BiFormer attention mechanism is introduced to focus on the recognition of small-area features, the CARAFE module is added to improve shallow feature fusion, and the DyHead dynamic target-detection head is employed for comprehensive fusion. In terms of tracking, the ShuffleNetV2 lightweight module is integrated to reduce model parameters and network complexity. Experimental results demonstrate that the proposed FBCD-YOLOP model improves lane detection accuracy by 5.1%, increases the IoU metric by 0.8%, and enhances detection speed by 25 FPS compared to the baseline model. The accuracy of nighttime pedestrian-detection reached 89.6%, representing improvements of 1.3%, 0.9%, and 3.8% over the single-task YOLO v5, multi-task TDL-YOLO, and the original YOLOP models, respectively. These enhancements significantly improve the model’s detection performance in complex nighttime environments. The enhanced DeepSORT algorithm achieved an MOTA of 86.3% and an MOTP of 84.9%, with ID switch occurrences reduced to 5. Compared to the ByteTrack and StrongSORT algorithms, MOTA improved by 2.9% and 0.4%, respectively. Additionally, network parameters were reduced by 63.6%, significantly enhancing the real-time performance of nighttime pedestrian-detection and -tracking, making it highly suitable for deployment on intelligent edge computing surveillance platforms.

Keywords:

nighttime pedestrian-detection; multi-object tracking; lane lines; YOLOP; deep SORT

1. Introduction

With the rapid development of intelligent transportation systems, research on nighttime pedestrian trajectory tracking has become increasingly important [1]. During nighttime driving, poor lighting conditions pose significant challenges to pedestrian-detection and -tracking [2], including multi-task detection [3], pedestrian occlusion [4], detection accuracy in complex environments [5], and real-time performance. The goal of this study is to enhance the accuracy and real-time performance of nighttime pedestrian-detection and -tracking by improving existing multi-object detection and -tracking algorithms. We propose a method that combines an enhanced YOLOP detection algorithm with an improved DeepSORT tracking algorithm to address the challenges of pedestrian-detection and -tracking in nighttime environments.

Currently, commonly used object detection algorithms include the R-CNN series, YOLO series, and SSD series. Algorithms based on R-CNN have notably improved detection accuracy and stability, especially in pedestrian-detection. For instance, Akshatha, K. R., et al. [6] proposed a pedestrian-detection algorithm based on R-CNN, which effectively detects in complex environments. However, R-CNN-based algorithms face issues like redundant computations and high computational costs. To address these, FastR-CNN and FasterR-CNN were developed. Avola, Danilo, et al. [7] designed MS-CNN, which integrates shallow and deep features to enhance discriminability, though it struggles with uneven lighting and diverse pedestrian colors. Liu, YJ, et al. [8] tackled this with a method combining ACE and FasterR-CNN, enhancing detection accuracy in color-complex scenes. For occlusion scenarios, Zhang et al. [9] incorporated a cross-channel attention mechanism into FasterR-CNN, improving positioning accuracy and reducing false detection. The SSD algorithm by Chen, Z et al. [10], which eliminates the candidate box extraction step, significantly enhances detection efficiency but has lower accuracy. To improve this, Ni Y et al. [11] used a residual network with stronger representational capabilities as the SSD base, resulting in better real-time performance and robustness, though challenges with occluded and small targets persist. Liu et al. [12] proposed an improved YOLOv5-based algorithm, while Kumar, Sunil, et al. [13] enhanced YOLOv5 with DeepSort for multi-target-detection and -tracking, improving occluded target-recognition. For small target-detection, Li et al. [14] introduced the YOLO-ACN algorithm, adding attention mechanisms and CIoU loss function to effectively extract detailed features and address small target pedestrian-detection.

Pedestrian-tracking algorithms are used to continuously monitor targets in video sequences. Traditional tracking methods, such as nearest neighbor algorithms [15], multiple hypothesis tracking [16], and joint data association [17], often struggle with sustainability in complex scenes due to the large number of targets [18]. To address this, Chen, Xuewen, et al. [19] developed the DeepSORT algorithm, an extension of the earlier SORT algorithm [20], which reduces data redundancy and enhances sustainable tracking. Additionally, Razzok, Mohammed, et al. [21] proposed a pedestrian counting method that combines YOLO and DeepSORT, demonstrating high accuracy and robustness in real-time detection and tracking. Nighttime pedestrian-detection and -tracking introduce additional challenges, including poor lighting, shadows, and glare from artificial light sources, which can significantly degrade the performance of traditional algorithms. Recent studies have sought to overcome these limitations by integrating advanced image enhancement techniques and robust tracking models. For example, Zhang et al. [22] developed a nighttime pedestrian-detection system that incorporates a low-light image enhancement module alongside a deep learning-based tracking algorithm, improving detection accuracy in low-visibility conditions. Similarly, Ngeni, et al. [23] proposed an algorithm that combines a specialized illumination compensation technique with the DeepSORT framework, resulting in enhanced tracking stability and accuracy in nighttime scenarios. These advancements highlight the importance of adapting existing algorithms to handle the unique challenges posed by nighttime environments, thereby ensuring reliable pedestrian-detection and -tracking, even under difficult conditions.

Through research, it has been found that combining the YOLOP multi-task object detection algorithm with multi-object tracking (MOT) algorithms is well suited for the recognition and tracking of pedestrians and lane markings at night. This paper introduces a novel trajectory tracking model that enhances both YOLOP and DeepSORT algorithms for more accurate detection and the tracking of pedestrian trajectories. The proposed improvements include the integration of the C2f-faster structure and BiFormer attention mechanism within the YOLOP algorithm, significantly enhancing feature extraction capabilities and focusing on small area features. The CARAFE module is used to replace the original upsampling module, improving the fusion of shallow features, while the DyHead detection head achieves comprehensive fusion of scale, spatial, and task perception. To further enhance tracking accuracy and reduce model complexity, the ShuffleNetV2 lightweight module is integrated into the DeepSORT feature extraction network. The effectiveness of the proposed model was validated through experiments involving pedestrian activities near motorways in typical nighttime scenarios. This study provides valuable insights for the development of vehicle collision avoidance decision-making and active safety technologies, significantly improving the accuracy and real-time performance of nighttime pedestrian-detection and -tracking systems.

Key Innovations:

Improve Network Architecture: Introduction of the C2f-faster structure and BiFormer attention mechanism into the YOLOP algorithm to improve detection accuracy, particularly for small area features. Integration of the CARAFE module and DyHead detection head to enhance the fusion of features, enabling more effective detection and tracking in complex environments.

Optimized Real-Time Performance: Implementation of the ShuffleNetV2 lightweight module into the DeepSORT network, reducing model complexity and improving real-time tracking performance.

Integrating the FBCD-YOLOP and Deep SORT algorithms: By integrating the FBCD-YOLOP and Deep SORT algorithms, it is possible to perform detection and tracking for multiple tasks simultaneously. This integration allows for the generation of lane markings and drivable areas while simultaneously detecting and tracking pedestrians.

2. Detection and Tracking Method

This study addresses the limitations of existing detection methods in accurately assessing pedestrian safety in the blind spots of nighttime drivers by proposing a novel identification and tracking method. The framework for multi-target nighttime pedestrian-tracking is shown in Figure 1. In the detection results, red lines indicate the generation of lane lines, while the green areas represent the delineation of drivable regions. First, the improved YOLOP is used as a detector to extract feature information from targets in the image, obtaining the positions, classifications, and confidence levels of nighttime pedestrians and lane lines. Then, the detection results are input into the improved DeepSORT algorithm. Next, through the Kalman filter prediction module, the system can predict the next position of the target based on historical trajectories. Subsequently, the Hungarian algorithm is used to calculate the matching degree between the detection results of the current frame and the predicted trajectories. The system can determine which detection results match the predicted trajectories through the matching degree, thereby achieving target tracking. Finally, this study updates the Kalman filter [24] by deleting erroneous trajectories and using corrected trajectories, achieving precise tracking of pedestrians in the blind spots of nighttime drivers. The nighttime pedestrian-detection and -tracking mechanism can improve the safety of drivers during nighttime driving.

The nighttime driver’s blind spot pedestrian-detection and -tracking algorithm, based on the enhanced YOLOP and Deep SORT, comprises three main components: the input of video images depicting pedestrians in the driver’s blind spot at night, the detection of pedestrian and lane line targets, and the tracking of pedestrian targets. The implementation process of this algorithm is illustrated in Figure 2.

2.1. Improving the YOLOP Model

The network structure of YOLOP (You Only Look Once for Panoptic) [25] includes a common encoder for extracting features from the input image and three decoder heads for corresponding tasks. Compared with ENet-SAD (Efficient Neural Network with Self-Attention Distillation) and YOLOv5 (You Only Look Once version 5) [26], it can not only detect pedestrians but also detect lane lines and drivable areas. Although YOLOP demonstrates effective performance in multi-task perception, its network exhibits limitations in feature extraction and loses spatial details during the upsampling process. Consequently, issues such as missed detections, false detections, and segmentation edge errors arise under conditions of partial occlusion, adverse weather, and low light at night. To address these challenges, we have developed an improved YOLOP multi-task perception algorithm model.

2.1.1. Feature Extraction C2f-Faster

In nighttime road environments, the complexity of pedestrian activity spaces is significantly increased due to crowded roadways, cluttered backgrounds, and frequent occlusions. Additionally, adverse weather conditions such as rainy or sandy days can cause water droplets or dust to appear on monitoring equipment, and severe shaking of the equipment can blur the monitoring images. Traditional convolutional neural networks may introduce redundancy and noise when extracting image features, thereby affecting the model’s accuracy in recognizing pedestrian behavior. In order to reduce feature redundancy, reduce model complexity, and improve the accuracy of target-detection, some models, such as MobileNets, ShuffleNets, and GhostNet [27], use DWConv (depth convolution) and GConv (group convolution) techniques to extract spatial features. However, in the process of reducing the number of FLOPs (floating-point operations), these operations often lead to an increase in memory access frequency, which is an undeniable side effect. To solve this problem, FasterNet [28] proposed a new PConv (partial convolution) method. This convolution method leverages the redundancy in feature mapping by applying regular Conv (convolution) to selected input channels while preserving the changes in other channels. Building on this approach, this paper introduces PWConv (pointwise convolution) to effectively integrate information across all channels. Each component of the FasterNet block consists of a PConv layer followed by two PWConv layers or a 1 × 1 Conv layer, as illustrated in Figure 3a. By integrating the FasterNet block into the C2f structure, a new architecture, C2f-faster, is created. This new structure maintains non-linear representational capacity while reducing the number of parameters and computational complexity, as depicted in Figure 3b.

C2f-faster is an efficient object-detection model. The role of the C2 module is to extract feature representations of the input image using a convolutional neural network (CNN), as represented by Equation (1). The Faster R-CNN module consists of three main parts, with the RPN responsible for generating candidate object boxes. Let Anchor be a predefined set of anchor boxes, and Equation (2) calculates the output of the RPN. Equation (3) computes the regional features, and Equation (4) calculates the results of object classification and bounding box regression.

F_{C 2} = C N N (X)

(1)

R P N_s c o r e, R P N_b b o x = R P N (F_{C 2})

(2)

R o I_f e a t u r e s = R o I_p o o l (F_{C 2}, R o I s)

(3)

C l a s s_s c o r e s, B b o x_r e g = C l a s s i f i e r_r e g r e s s o r (R o I_f e a t u r e s)

(4)

where,

X

represents the input image, and

F_{C 2}

denotes the output feature representation from the C2 module.

R P N_s c o r e

indicates the probability score for the target object in each anchor box, while

R P N_b b o x

represents the bounding box regression co-ordinates for each anchor box.

R o I_p o o l

refers to the region pooling operation.

C l a s s_s c o r e s

indicates the probability scores for each category in the candidate boxes, and

B b o x_r e g

represents the bounding box regression co-ordinates for each candidate box.

2.1.2. BiFormer Attention Mechanism

Under well-lit conditions, pedestrian targets are usually clearly visible, with distinct movements that are easily recognized by detection systems. However, in dark or rainy weather conditions, individual target-recognition becomes difficult, reducing recognition accuracy and leading to missed detections. To address the issue of small target-detection under complex nighttime conditions, this study introduces the dynamically sparse BiFormer (Bi-Level Routing Attention) [29] attention mechanism between the backbone and neck of the YOLOP model. This adaptive query method effectively focuses on the appearance features of small-area pedestrians, avoiding distraction by irrelevant markers. The introduction of BiFormer improves the model’s performance and computational efficiency.

In this study, the BiFormer attention module is placed after the SPFF, effectively removing background noise and focusing on the detection areas of nighttime pedestrians. Combined with the CARAFE upsampling mechanism, it enhances the attention given to low-resolution feature maps, reducing the computational complexity of high-resolution processing. This approach achieves precise localization of pedestrian targets and improves the model’s processing speed. The architecture of BiFormer is built around the BRA (Bidirectional Relative Attention) module as its core building block, employing a four-layer pyramid structure. The detailed workflow is as follows: first, Stage One processes the input through overlapping image block embedding. Next, Stages Two to Four reduce the resolution while increasing the number of channels through the image merging module. Finally, continuous BiFormer blocks are used to transform the features. The processing flow of the BiFormer block begins with a 3 × 3 depth-wise separable convolution, which encodes relative positional information. Then, the BRA module is used to capture bidirectional relative attention. Lastly, the process is completed by using a MLP (Multi-Layer Perceptron) module for relationship modeling and positional embedding. The structure of the BiFormer block is shown in Figure 4.

BiFormer is a widely studied visual transformer, with its core concept centered around the dual-route attention mechanism. This mechanism employs a technique known as dynamic sparse attention to enhance the flexibility of computational allocation and the perception of content, thereby effectively improving the accuracy of object detection, especially in the detection of small objects. The implementation of the dual-route attention mechanism relies on the combined operation of two attention layers. Each attention layer involves the computation of queries, keys, and values, as well as the determination of attention weights, as represented by the following Formulas (5) and (6).

A t t e n t i o n_{1} (Q_{1}, K_{1}, V_{1}) = softmax (\frac{Q_{1} K_{1}^{T}}{\sqrt{d_{k}}}) V_{1} A t t e n t i o n_{2} (Q_{2}, K_{2}, V_{2}) = softmax (\frac{Q_{2} K_{2}^{T}}{\sqrt{d_{k}}}) V_{2} .

(5)

A t t e n t i o n_{d y n a m i c} (Q, K, V) = softmax (\frac{(Q K^{T}) ⊙ g}{\sqrt{d_{k}}}) V

(6)

where,

Q_{1}, K_{1}, V_{1}

and

Q_{2}, K_{2}, V_{2}

represent the query, key, and value matrices of the first and second layers, respectively. The ⊙ symbol denotes element-wise multiplication. This operation enhances the computational efficiency and accuracy of the model.

2.1.3. CARAFE Upsampling

The original feature pyramid network is mainly composed of CSPNet (Cross Stage Partial Network), convolution, and upsampling modules. However, on the one hand, the efficiency of CSPNet structure in merging feature information at different semantic levels is not high, and with the increase in network layers, the problem of gradient disappearance may occur. On the other hand, the upsampling method it uses may lead to information loss and deformation of the target edge. To solve these problems, this paper introduces the content-aware feature reorganization CARAFE (Content-Aware ReAssembly of FEatures) [30] module in the feature pyramid network to replace the original upsampling module. This approach can reduce the loss of semantic information during feature propagation and enhance the model’s ability to fuse shallow features, such as nighttime lane lines and pedestrian edges.

In the YOLOP network, the feature extraction stage gradually reduces the size of the feature maps. During the feature fusion stage, these small-sized feature maps need to be enlarged and concatenated with feature maps from different stages of feature extraction. This approach retains high-level semantic information and low-level detail information, making the information used for detection richer. In the original network, nearest neighbor interpolation was used as the upsampling method to enlarge the feature maps. Nearest neighbor interpolation only considers information from adjacent pixels, directly using the original feature map to fill the pixels in the target feature map. While this method has a lower computational cost, it may result in discontinuous grayscale values in the interpolated image and produce noticeable jagged edges, which significantly affects the detection of lane lines and pedestrians. Therefore, this model adopts CARAFE as the network’s upsampling method. This method aggregates contextual content through a larger receptive field and uses dynamically adaptive convolution kernels to generate the target image. The process of CARAFE is shown in Figure 5.

CARAFE performs reassembly operations through content-aware kernels, primarily involving two steps: first, it predicts and generates reassembly kernels for each target position based on feature content; second, it uses these reassembly kernels to reassemble the features. The input feature map

X

has dimensions

C \times H \times W

, and the upsampling factor is

σ (= 2)

. CARAFE outputs a new feature map

X^{'}

with dimensions

C \times σ H \times σ W

. For any target position

I^{'} = (i^{'}, j^{'})

in X′, it corresponds to

I^{'} = (i, j)

in

X

. Define as the feature map centered at

X_{I}

with a size of

k \times k

. The steps for generating the reassembly kernel

W_{I^{\cdot}}

and reassembling the features are shown in Equations (7) and (8).

W_{I^{\cdot}} = φ (N (X_{I}, K_{e n c o d e r}))

(7)

X_{I^{\cdot}} = ϕ (N (X_{I}, k_{u p}), W_{I^{\cdot}})

(8)

where,

i = [i^{'} / σ]

and

j = [j^{'} / σ]

. φ and ϕ represent the kernel prediction module and the content-aware reassembly module, respectively.

The kernel prediction module needs to generate a reassembly kernel of size

k_{u p} k_{u p}

: First, channel compression is used to reduce the number of channels in the input feature map. Then, content encoding takes the compressed feature map as input, encodes the content, and performs pixel rearrangement to adjust the dimensions, thereby generating the reassembly kernel. Finally, kernel regularization applies the softmax function to each reassembly kernel, outputting the reassembly kernel. In the content-aware reassembly module, ϕ uses a weighted summation operation, where the target position l’ is generated by the square region

N (X_{I}, k_{u p})

. The reassembly process is shown in Equation (9).

X_{I^{\cdot}}^{'} = \sum_{n = - r}^{r} \sum_{m = - r}^{r} W_{I^{\cdot} (n, m)} \cdot X_{(i + n, j + m)}

(9)

where,

r = [k_{u p} / 2]

.

When using the reassembly kernel, each pixel in the

N (X_{I}, k_{u p})

region contributes to the upsampled pixel l′ based on the feature content, rather than being directly generated by adjacent pixels. The semantics of the feature map generated by this method are more effective than those produced by the original method because it allows relevant points in the local area to receive more attention. This gives CARAFE an advantage in feature fusion quality, providing better spatial continuity and detail recovery. Through the reassembly operation, CARAFE can effectively recover the detailed information that may be lost during the upsampling process, which is particularly important for perceiving image edges. Additionally, since CARAFE takes more contextual information into account, the generated feature maps are not only spatially more precise but also semantically richer. This enhances the model’s robustness in perceiving targets of different scales.

2.1.4. Dynamic Detection Head DyHead

The YOLOP model is an object detection model that first extracts features from the input image through the Backbone, obtaining feature maps of different scales. These feature maps are then processed by the Detection Head for object detection and classification, resulting in recognition outcomes. However, detecting pedestrian behavior in nighttime road environments presents numerous challenges due to the similarity of various behaviors, making them difficult to distinguish and requiring high-precision detection. Firstly, due to the significant differences in body size and age among the crowd, the detector needs to have a high multi-scale perception capability to detect pedestrians of different scales within the same image. Secondly, pedestrians engage in a wide range of activities on the road, with varying postures and changing detection positions, necessitating the detector to have spatial perception capabilities to understand the relationships between pedestrians in different spatial positions. Finally, the detector needs to have task-perception capabilities, as pedestrian-detection involves different representations (e.g., bounding boxes, centers, and key points), each with entirely different objectives and constraints.

To improve the accuracy of pedestrian-detection, this study introduces DyHead (Dynamic Head) [31] to enhance the representation ability of the head of the target-detection model. Its structure diagram is shown in Figure 6. The improved YOLOP structure is shown in Figure 7.

2.2. Improved DeepSORT Tracking Module

The DeepSORT (Simple Online and Realtime Tracking with a Deep Association Metric) [32] multi-object tracking algorithm tracks target objects in a scene using object detection algorithms, estimating the trajectories of multiple targets in a video and assigning a unique ID identifier to each target. The current tracking steps are mainly divided into two parts: the motion model and the appearance estimation. The motion model uses a KF (Kalman Filter) [33] to predict the bounding boxes of the motion trajectories in subsequent sequence frames. The appearance model locates the target and estimates its appearance through the predicted trajectory bounding boxes and detected boundaries. The mainstream tracking algorithms currently include DeepSORT, ByteTrack, and StrongSORT. This study chooses to use the DeepSORT algorithm, which is an improved algorithm based on SORT. DeepSORT combines the advantages of motion features and appearance information, introducing the lightweight network ShuffleNetV2 and an improved IoU matching mechanism, thereby achieving stable tracking of multiple targets in the blind spots of nighttime drivers in complex scenes.

2.2.1. ShuffleNetV2 Lightweight

To further enhance the tracking accuracy and speed of the multi-pedestrian target tracking model on the road at night and to facilitate model deployment on devices, this article proposes improvements to the DeepSORT algorithm. We introduced a lightweight network, ShuffleNetV2 (Practical Guidelines for Efficient CNN Architecture Design) [34], and retrained the feature extraction model. Through these improvements, a new nighttime road multi-pedestrian target tracking model, DeepSORT-SNV2, was constructed. This model not only enhances tracking accuracy and speed, but also features a smaller model size and reduced computational complexity, making it more suitable for deployment on devices for real-time processing.

ShuffleNetV2 is a lightweight feature extraction model that includes basic units and downsampling units. The basic unit first introduces a channel split operation, dividing the input feature map channels into two independent branches with equal channel numbers. The left branch remains unchanged, while the right branch undergoes three consecutive convolutional layer operations, specifically including one depth-wise separable convolution layer and two standard 1 × 1 convolution layers that fuse inter-channel feature information. Then, the outputs of the left and right branches are concatenated using the Concat operation, aiming to improve computational speed while ensuring the output channel number is the same as the input channel number. Next, the Concat result undergoes channel shuffle operations to ensure information exchange between channels, as shown in Figure 8a. The downsampling unit does not use the channel split operation but directly duplicates the two branches and performs downsampling with a stride of 2, then concatenates them together. This results in a feature map with half the spatial dimensions and twice the number of channels as the input, completing the downsampling process, as shown in Figure 8b. Compared to traditional deep neural network structures such as ResNet and VGG [35], ShuffleNetV2 introduces channel shuffle operations, which reduce computational and storage requirements while maintaining model performance, resulting in a more lightweight design. This makes it better suited for deployment on embedded systems, mobile devices, or edge computing devices. Therefore, the DeepSORT-SNV2 network, without compromising computational accuracy, achieves higher computational efficiency and fewer model parameters, effectively balancing speed and accuracy. Its structural diagram is shown in Figure 8.

2.2.2. Improved IoU Matching Mechanism

DIoU (Distance Intersection over Union) is an improved IoU (Intersection over Union) matching metric [36]. Unlike traditional IoU, which only considers the overlap between the predicted box and the ground truth box, DIoU also takes into account the distance between them. In some cases, when the target cannot be matched based on appearance features and there is no overlap between the predicted box and the ground truth box, using the IoU metric may lead to changes in tracking IDs. DIoU effectively addresses this issue by incorporating the distance between the center points of the predicted box and the ground truth box. Therefore, this study replaces IoU with DIoU as the association metric to improve the accuracy of target matching. The schematic diagram of DIoU is shown in Figure 9.

Where, d represents the distance between the centers of the gray rectangle and the black rectangle; c denotes the diagonal distance of the smallest circumscribed rectangle that encompasses both the gray and black rectangles.

The calculation formulas for DIoU and DIoU loss are as follows:

I_{D I O U} = I_{I O U} - \frac{ρ^{2} (b, b^{g t})}{c^{2}} = I_{I O U} - \frac{d^{2}}{c^{2}}

(10)

L_{D I O U} = 1 - I_{D I O U}

(11)

where, IoU represents the Intersection over Union between the predicted box and the ground truth box, measuring the degree of overlap between the two rectangles. DIoU, on the other hand, stands for Distance Intersection over Union, which not only considers the overlap between the two rectangles but also takes into account the distance between their center points.

In Figure 9, the gray rectangle is the detected box bgt, and the black rectangle is the predicted box b. The distance d from the center of the gray rectangle to the black rectangle is calculated by the Euclidean distance ρ2(b, bgt). c is the diagonal length of the smallest circumscribed rectangle of the predicted box and the detected box. The calculation of DIoU can not only more accurately represent the intersection between two rectangular boxes but also directly minimize the distance between two rectangular boxes when calculating the loss function of the network. This allows the model to converge faster. Therefore, this paper introduces DIoU into the structure of DeepSort to improve the performance of the model. The improved DeepSort structure is shown in Figure 10.

3. Experiment Results and Analysis

3.1. Data Introduction

This study utilized publicly available datasets, Market-1501 [37] and BDD100K [38]. Market-1501 focuses on pedestrian re-identification, while BDD100K covers various driving-related tasks, including pedestrian-detection and lane detection, under different lighting conditions. Selecting images of nighttime pedestrians and lanes with varying lighting conditions from these datasets is beneficial for this research. The images were divided into training, validation, and test sets in a 7:2:1 ratio. This division ensures that there is sufficient data for learning, facilitates hyperparameter tuning and model selection, and helps to avoid overfitting on the training set. The training set included 15,084 images, the validation set included 4225 images, and the test set included 2083 images, as shown in Table 1. To evaluate the model’s performance under different lighting conditions, the nighttime pedestrian-detection dataset was classified into three environments: complete darkness, low light, and illuminated, as shown in Table 2. The open-source annotation tool LabelImg was used to annotate pedestrians in the images, with the pedestrian category defined as “person”, labeled as 1, and saved in .txt format.

3.2. Experimental Platform

This experiment was conducted under the Windows 11 system environment, with the programming platform PyCharm, the programming language Python 3.8, the deep learning framework PyTorch 1.12.1, and CUDA as 11.2. The hardware environment included an Intel^® Core™ i7-12700H processor, 16 GB of memory, and an NVDIA GeForce RTX 3060 12G graphics card (Santa Clara, CA, USA), as shown in Table 3.

In the experiment, to better utilize target-detection models for training datasets with nighttime pedestrian and lane line labels, several parameter settings were configured. The input image size was set to 640 × 640 pixels, with an image batch size of 16, a learning rate of 0.01, and the model was trained for 200 epochs. After ensuring training convergence, the dataset was re-evaluated. For the SORT tracking algorithm, the IoU threshold was set to 0.7, the confidence threshold was set to 0.5, while other parameters were set to their default values.

3.3. Evaluation Indicators

In order to verify the performance of nighttime pedestrian target-detection, we used precision (P), recall (R), frame rate (FPS), and mean average precision (mAP) as evaluation indicators. In addition, the lane line detection sub-task used two indicators—accuracy (accuracy, Acc) and intersection over union (IoU)—for evaluation, and Acc is calculated by Formula (15). In the experiment, the IoU threshold of mAP (mean average precision) was taken as 0.5 (mAP@0.5) to comprehensively evaluate the accuracy of the model. mAP is the average of the average precision (AP) of all categories, which represents the overall performance of the model in category detection. A high mAP value means that the model had good detection performance in all categories. For the evaluation of nighttime pedestrian-tracking, we chose evaluation indicators such as multiple object tracking accuracy (MOTA), multiple object tracking precision (MOTP), and the number of identity switches (IDS) to measure the performance of the tracking model. Among them, the higher the P, R, mAP, and Acc, the more accurate the target-detection and -recognition; the higher the FPS, the faster the target-detection speed.

(1): Evaluation Metrics for Object-Detection

This study evaluated system performance by calculating the model’s precision, recall, mAP (mean average precision), Acc (accuracy), and FPS (frames per second) to analyze model performance. The calculation formulas are as follows:

P r e c i s i o n = \frac{T P}{T P + F P} \times 100 %

(12)

R e c a l l = \frac{T P}{T P + F N} \times 100 %

(13)

m A P = \frac{1}{N} + \sum_{k = 1}^{N} P_{(n)} R_{(n)} \times 100 %

(14)

A c c = \frac{T P + T N}{N} \times 100 %

(15)

where, True Positives (TP) refer to the number of samples that the model correctly classified as the positive class, meaning both the predicted and actual values were positive. A sample is considered a True Positive if the Intersection over Union (IoU) between the predicted box and the actual box exceeded a specified threshold. False Positives (FP) refer to the number of samples that the model incorrectly classified as the positive class, where the predicted value was positive, but the actual value was negative. In this study, a sample was considered a False Positive if the IoU between the predicted box and the actual box fell below a specific threshold. A false negative (FN) refers to the number of samples that the model incorrectly predicted as negative, meaning the predicted value was negative while the actual value was positive. This indicates that the model’s prediction for the negative class was inaccurate.

(2): Evaluation Metrics for Tracking Algorithm

In a multi-object tracking system, the number of Identity Switches (IDS) measures the frequency of changes and losses in target identification numbers during the tracking process. A lower IDS value indicates greater coherence and accuracy in tracking. Another key metric is Multiple Object Tracking Accuracy (MOTA), which reflects the system’s overall recognition accuracy and the cumulative degree of error throughout the tracking period. Additionally, Multiple Object Tracking Precision (MOTP) evaluates the positioning accuracy of a multi-object tracking system. It calculates the average error between the tracked target location and the actual target location, typically measured in pixels. MOTP focuses on the precision of target position prediction rather than the accuracy of target identification. A higher MOTP value indicates that the tracked target location is closer to the true position, demonstrating better positioning capability of the system.

The calculation formulas are as follows:

M O T A = 1 - \frac{\sum_{t} M_{m} + M_{f} + I D S}{\sum_{t} G T_{t}}

(16)

M O T P = \frac{\sum_{t, i} d_{t, i}}{\sum_{t} c_{t}}

(17)

where

M_{m}

represents the miss rate,

M_{f}

represents the false detection rate,

I D S

represents the number of identity switches, and

G T_{t}

represents the target number.

c_{t}

represents the number of matches in the t-th frame, and the matching error

d_{t, i}

is calculated for each pair of matches, representing the IoU of the detection box with GT under the t-th frame. If the tracking match is perfect, it is 100%, if it is completely incorrect, it is 0%.

3.4. Object-Detection Experiment and Result Analysis

3.4.1. Lane Line-Detection

For the lane line-detection task, the proposed algorithm was compared to single-task models SCNN and ENet-SAD, as well as to multi-task models YOLOP and TDL-YOLO. Evaluation metrics included Accuracy (Acc), Intersection over Union (IoU), and Frames Per Second (FPS). The experimental results are presented in Table 4.

As shown in Table 4, the algorithm proposed in this paper demonstrated outstanding performance in lane line-detection tasks, particularly in multi-task processing. Compared to single-task models such as SCNN and ENet-SAD, our algorithm significantly improved accuracy (Acc), intersection over union (IoU), and frame rate (FPS). Among multi-task models, YOLOP and TDL-YOLO handled both lane line detection and pedestrian-detection tasks. Our algorithm shows a 5.1% improvement in accuracy over YOLOP and a 3.3% improvement over TDL-YOLO. In IoU, it surpassed YOLOP by 0.8% and TDL-YOLO by 0.7%. In FPS, it outperformed YOLOP by 25 FPS and TDL-YOLO by 29 FPS. These results indicate that the proposed algorithm can achieve higher detection accuracy and faster processing speeds when simultaneously handling pedestrian-detection and lane line detection tasks.

The experimental results, as shown in Figure 11, demonstrate that the proposed YOLOP-based improved lane-detection algorithm can effectively detect lane lines and drivable areas across various road scenarios and lane line configurations, including cases where obstacles partially obstruct the view. In the figure, the red lines represent the generated lane lines, the green areas indicate the segmented drivable areas, the red circles highlight differences between the lane line-generation of the proposed model and the baseline model, and the green circles indicate differences in the drivable area segmentation. In Scene 1, it can be observed that the baseline YOLOP model generated a more complete green area, indicating that the YOLOP algorithm performed well in detecting drivable areas. However, the proposed algorithm generated more complete and clearer red lines, reflecting superior lane line-detection. In Scene 2, under conditions with occlusions, the proposed algorithm exhibited stronger lane line-prediction capabilities and achieved better detection results compared to YOLOP. In Scene 3, for road surfaces with unclear lane lines, the proposed algorithm was able to more accurately identify the lane lines.

Overall, the algorithm in this paper exhibits good robustness and is capable of handling complex road environments at night and accurately identifying the position and shape of lane lines. During the experiment, we observed that the algorithm in this paper could successfully handle the situation of obstacles blocking the road and accurately distinguish different types of lane lines.

3.4.2. Nighttime Pedestrian-Detection

To validate the superiority and generalizability of the algorithm developed in this paper within the context of autonomous driving multi-task perception, we conducted a comparative experiment on nighttime pedestrian-detection. This experiment involved evaluating our proposed algorithm against current mainstream single-task and multi-task autonomous driving algorithms. Specifically, we compared our improved FBCD-YOLOP model with Faster R-CNN, YOLOv3, YOLOv5s, and TDL-YOLO. The comparison was based on metrics including precision, recall, mean average precision, frame rate, and the number of parameters for nighttime pedestrian-recognition. The results of this comparison are presented in Table 5.

From the comparison of data in Table 5, it can be observed that among single-task models, Faster R-CNN faced limitations in real-time detection due to its need to first generate candidate boxes and then perform classification and bounding box regression, which increases computational complexity. While YOLOv3 offered better real-time performance, its detection accuracy was still significantly limited. YOLOv5s achieved higher efficiency by predicting bounding boxes and class information directly from the input image in a single forward pass, thereby omitting the additional candidate box generation stage, resulting in an inference speed of 121 FPS. The algorithm in this paper had a lower FPS than YOLOv5s due to the larger number of parameters and the simultaneous detection of multiple tasks. However, it still met the requirements for practical tasks. Among multi-task models, the proposed model in this paper exhibited the highest detection accuracy, with Precision, Recall, and mAP reaching 89.6%, 91.3%, and 88.1%, respectively.

To assess the detection performance of the improved FBCD-YOLOP model under varying lighting conditions, experiments were conducted in three distinct environments: complete darkness, low light, and illuminated. Key metrics recorded included precision, recall, and miss rate. Optimal detection thresholds were established for each lighting condition: 0.6 for complete darkness, 0.5 for low light, and 0.4 for illuminated environments. Analysis of detection precision, miss rate, and recall rate at these thresholds revealed that the model exhibited some missed detections in complete darkness. In the low light environment, most pedestrians were accurately detected with a low false detection rate. In the illuminated environment, the model successfully detected all pedestrians, demonstrating its strong adaptability to well-lit conditions. The detailed experimental results are presented in Table 6.

To evaluate the effectiveness of the FBCD-YOLOP model, we used the YOLOP model as a baseline for comparison and conducted ablation experiments on various improvements, including C2f-faster (Strategy 1), BiFormer (Strategy 2), CARAFE (Strategy 3), and DyHead (Strategy 4). The results of these ablation experiments are detailed in Table 7.

By incorporating C2f-faster (Strategy 1) into YOLOP, it was observed that the detection speed significantly increased without reducing the average precision of nighttime pedestrian-recognition. When the BiFormer (Strategy 2) attention mechanism was added to YOLOP, the average precision of nighttime pedestrian-recognition improved by 2.3 percentage points. This improvement is attributed to BiFormer (Strategy 2)’s ability to detect the characteristic behavior information of small targets, especially in the validation set where there are partially occluded and incomplete pedestrians, thereby enhancing recognition accuracy. Additionally, we found that integrating the CARAFE (Strategy 3) and DyHead (Strategy 4) modules into YOLOP significantly improved nighttime pedestrian-recognition, with an average precision increase of 2.5 percentage points. This is because CARAFE (Strategy 3) and DyHead (Strategy 4) are particularly effective at handling dynamic and posture-changing behaviors under a large receptive field. Finally, the FBCD-YOLOP model constructed in this study achieved a precision of 89.6%, an average precision of 91.3%, a recall of 88.1%, and a frame rate of 66 FPS in nighttime pedestrian-recognition. Overall, the FBCD-YOLOP model demonstrated excellent performance in handling nighttime pedestrian-detection tasks. The training process curve of the FBCD-YOLOP model for nighttime pedestrian-recognition is shown in Figure 12.

Additionally, to validate whether the improvements made to the FBCD-YOLOP model resulted in statistically significant enhancements in detection performance, we conducted a significance test. We employed the t-test method to compare the differences in model detection accuracy under various strategy conditions. A value of 1 was recorded when the difference was statistically significant (p < 0.05), and 0 otherwise. The results are presented in Table 8.

Based on the results of the significance tests, we can confirm that the differences in detection performance among the various strategies were statistically significant. This indicates that the improvements implemented in the FBCD-YOLOP model had a substantial impact on its performance.

3.5. Target Re-Identification Experiment and Result Analysis

The nighttime pedestrian re-identification model enables the re-identification of the same pedestrian target across different frames in a video. To achieve this goal, we trained the DeepSORT-SNV2 model on a custom dataset for a total of 100 iterations. The convergence trends of loss and top1err for the DeepSORT-SNV2 model on the training set (train) and validation set (val) are shown in Figure 13.

After 100 iterations, the DeepSORT-SNV2 model showed signs of convergence, achieving an accuracy of 87.9% on the training set and 78.2% on the validation set. These results demonstrate the model’s effectiveness in accurately extracting and re-identifying pedestrian appearance features. To further validate the tracking performance of our DeepSORT-SNV2 model, we conducted a comparative experiment with the standard DeepSORT model. The results of this comparison are presented in Table 9.

From Table 9, we can see that, after training the pedestrian re-identification model using the DeepSORT-SNV2 model, its size was reduced by 18 times compared to the original DeepSORT model, while still maintaining good accuracy on both the training and validation sets. Overall, by introducing the lightweight ShuffleNetV2 network into the DeepSORT model, it not only met the real-time and accuracy requirements for pedestrian-tracking in nighttime driving conditions, but also resulted in a smaller model size and lower computational cost, making it more suitable for deployment on edge devices.

To verify the performance of the proposed algorithm in nighttime pedestrian-tracking, tests were conducted on a custom dataset and compared with the YOLOP-DeepSORT algorithm. The results are shown in Table 10.

As shown in Table 10, the proposed algorithm achieved a Multiple Object-Tracking Accuracy (MOTA) of 86.3%, a Multiple Object-Tracking Precision (MOTP) of 84.9%, and an Identity Switch (IDS) count of 5 in nighttime multi-object pedestrian-tracking. Compared to the YOLOP-DeepSort algorithm, the proposed algorithm improved MOTA by 5.6%, reduced IDS by 5, and increased video processing speed from 24 FPS to 59 FPS, effectively meeting the real-time requirements for nighttime pedestrian video tracking. To further validate the effectiveness of our algorithm in detecting and tracking pedestrians in the blind spots of nighttime drivers, we tested and compared our algorithm with two other advanced multi-object-tracking algorithms on a lighting environment dataset. The comparison of experimental results is shown in Table 11.

As shown in Table 11, compared to the two mainstream multi-object tracking algorithms, ByteTrack and StrongSORT, the proposed algorithm in this paper demonstrated superior tracking performance across the MOTA, IDS, and FPS metrics. Although the MOTP value of the StrongSORT algorithm was 1.4% higher than that of our algorithm, when considering the other three metrics, our algorithm clearly outperformed both ByteTrack and StrongSORT. Specifically, our algorithm’s MOTA value was 2.9% higher than ByteTrack, and 0.4% higher than StrongSORT. The IDS value was also 5.9% and 3.4% higher than ByteTrack and StrongSORT, respectively. Additionally, our algorithm had the lowest number of ID switches, with only five occurrences, indicating excellent continuity and stability during the tracking process. In terms of FPS, our algorithm outperformed ByteTrack and StrongSORT by 32fps and 36fps, respectively. At an almost real-time tracking rate of 59 frames per second, our algorithm maintained high tracking accuracy. These results validate the effectiveness of our algorithm for pedestrian-detection and -tracking in nighttime driving blind spots.

The performance of pedestrian-detection methods on nighttime roads significantly influences the effectiveness of tracking tasks. To validate the impact of different detection algorithms, this paper applied various detection algorithms to the improved tracking algorithm, as shown in Table 12. The results indicate that different detection methods exhibited varying levels of performance across multi-object-tracking accuracy (MOTA), multi-object-tracking precision (MOTP), and frames per second (FPS).

Notably, our method exhibited the highest accuracy in both MOTA and MOTP, achieving scores of 86.3% and 84.9%, respectively, which indicates its exceptional performance in accurately detecting and localizing pedestrians. Additionally, with an FPS value of 59, our method demonstrated outstanding real-time capability. In summary, our approach showed significant superiority in detecting pedestrians on nighttime roads and tracking abnormal behavior.

In this study, the YOLOP-DeepSort algorithm was utilized as the baseline. After enhancing various modules of the algorithm, significant improvements in tracking performance were observed. As illustrated in Figure 14a–d, which presents two sets of video sequences, the first set includes images (a) and (b), while the second set includes images (c) and (d). In the IDS-tracking process (b), the proposed algorithm accurately distinguished and tracked ID3 as it passed through the crowd. In contrast, the YOLOP-DeepSort algorithm exhibited ID changes, highlighted by orange circles. In the second set of video sequences, (c) and (d), the proposed algorithm showed no ID changes, false detections, or missed detections. However, the YOLOP-DeepSort algorithm mistakenly identified the tree trunk in (c) and the wall crack in (d) as pedestrians, highlighted by red circles.

It is evident that the DeepSort algorithm exhibited issues such as missed detections, false detections, and ID switches, resulting in suboptimal performance in nighttime pedestrian-detection and -tracking. In contrast, our proposed algorithm demonstrated superior performance in multi-target-tracking. The initial ID numbers assigned to pedestrian targets remained consistent throughout the tracking process. Even in cases of pedestrian occlusion or uneven lighting, where an ID may temporarily disappear, our algorithm was able to reassign the correct unique ID to the target by matching features in subsequent video frames. This indicates that the improved YOLOP-DeepSORT algorithm, when applied to nighttime pedestrian video-tracking scenarios, achieved excellent tracking results. Throughout the tracking process in a video sequence, the algorithm accurately located pedestrian targets in the video frames, with the tracking box size consistently matching the actual scale of the target. Furthermore, the algorithm demonstrated good real-time performance, with no target loss, effectively meeting the technical requirements for multi-target pedestrian-tracking in nighttime driver blind spot scenarios.

4. Conclusions

(1): This study proposed a nighttime pedestrian behavior-recognition and -tracking scheme based on an optimized YOLOP. By using the FBCD-YOLOP algorithm to detect lane lines and pedestrians at night, posture information and lane line positions were accurately captured. The improved DeepSORT algorithm was then employed for precise pedestrian-tracking, enabling multi-target behavior-recognition in the driver’s blind spot environment at night. Experimental results demonstrated that the FBCD-YOLOP model enhanced lane line detection accuracy by 5.1 percentage points, with an IoU improvement of 0.8% and a detection speed increase of 25 FPS. For nighttime pedestrian-detection, the model achieved a precision of 89.6%, outperforming YOLOv5, TDL-YOLO, and YOLOP. Additionally, the improved DeepSORT algorithm achieved a MOTA of 86.3%, showing enhanced real-time performance and tracking stability, with a reduced parameter count of 24.8% of the original, making it suitable for embedded device deployment and automotive safety applications.
(2): In the future, we will place greater emphasis on practical use in autonomous driving platforms while exploring more diverse modalities, such as infrared imaging and radar, to enhance the effectiveness of nighttime pedestrian-detection and -tracking, as well as pedestrian safety.

Author Contributions

In this study, W.Z. was responsible for the conceptualization and methodology design; C.R. handled the software development and validation; C.R. and W.Z. conducted the formal analysis; A.T. carried out the investigation and provided resources and data curation; C.R. wrote the original draft, while W.Z. reviewed and edited the manuscript; A.T. was responsible for visualization; W.Z. oversaw the supervision, project management, and funding acquisition. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding author. The data are not publicly available due to privacy.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Farooq, M.S.; Khalid, H.; Arooj, A.; Umer, T.; Asghar, A.B.; Rasheed, J.; Shubair, R.M.; Yahyaoui, A.; Farooq, M.S.; Khalid, H.; et al. A conceptual multi-layer framework for the detection of nighttime pedestrian in autonomous vehicles using deep reinforcement learning. Entropy 2023, 25, 135. [Google Scholar] [CrossRef] [PubMed]
Cao, J.; Song, C.; Peng, S.; Song, S.; Zhang, X.; Shao, Y.; Xiao, F. Pedestrian detection algorithm for intelligent vehicles in complex scenarios. Sensors 2020, 20, 3646. [Google Scholar] [CrossRef] [PubMed]
Georgescu, M.I.; Barbalau, A.; Ionescu, R.T.; Khan, F.S.; Popescu, M.; Shah, M. Anomaly detection in video via self-supervised and multi-task learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
Li, F.; Li, X.; Liu, Q.; Li, Z. Occlusion handling and multi-scale pedestrian detection based on deep learning: A review. IEEE Access 2022, 10, 19937–19957. [Google Scholar] [CrossRef]
Liu, S.; Wang, S.; Liu, X.; Lin, C.-T.; Lv, Z. Fuzzy detection aided real-time and robust visual tracking under complex environments. IEEE Trans. Fuzzy Syst. 2020, 29, 90–102. [Google Scholar] [CrossRef]
Akshatha, K.R.; Karunakar, A.K.; Shenoy, S.B.; Pai, A.K.; Nagaraj, N.H.; Rohatgi, S.S. Human detection in aerial thermal images using faster R-CNN and SSD algorithms. Electronics 2022, 11, 1151. [Google Scholar] [CrossRef]
Avola, D.; Cinque, L.; Diko, A.; Fagioli, A.; Foresti, G.L.; Mecca, A.; Pannone, D.; Piciarelli, C. MS-Faster R-CNN: Multi-stream backbone for improved Faster R-CNN object detection and aerial tracking from UAV images. Remote Sens. 2021, 13, 1670. [Google Scholar] [CrossRef]
Liu, Y.; Zhou, S.; Liu, X.; Hao, C.; Fan, B.; Tian, J. Unbiased Faster R-CNN for Single-source Domain Generalized Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024. [Google Scholar]
Ren, Z.; Wang, S.; Zhang, Y. Weakly supervised machine learning. CAAI Trans. Intell. Technol. 2023, 8, 549–580. [Google Scholar] [CrossRef]
Chen, Z.; Guo, H.; Yang, J.; Jiao, H.; Feng, Z.; Chen, L.; Gao, T. Fast vehicle detection algorithm in traffic scene based on improved SSD. Measurement 2022, 201, 111655. [Google Scholar] [CrossRef]
Niu, Y.; Cheng, W.; Shi, C.; Fan, S. YOLOv8-CGRNet: A lightweight object detection network leveraging context guidance and deep residual learning. Electronics 2023, 13, 43. [Google Scholar] [CrossRef]
Liu, H.; Duan, X.; Lou, H.; Gu, J.; Chen, H.; Bi, L. Improved GBS-YOLOv5 algorithm based on YOLOv5 applied to UAV intelligent traffic. Sci. Rep. 2023, 13, 9577. [Google Scholar] [CrossRef]
Kumar, S.; Singh, S.K.; Varshney, S.; Singh, S.; Kumar, P.; Kim, B.-G.; Ra, I.-H. Fusion of deep sort and Yolov5 for effective vehicle detection and tracking scheme in real-time traffic management sustainable system. Sustainability 2023, 15, 16869. [Google Scholar] [CrossRef]
Li, Y.; Li, S.; Du, H.; Chen, L.; Zhang, D.; Li, Y. YOLO-ACN: Focusing on small target and occluded object detection. IEEE Access 2020, 8, 227288–227303. [Google Scholar] [CrossRef]
Bansal, M.; Goyal, A.; Choudhary, A. A comparative analysis of K-nearest neighbor, genetic, support vector machine, decision tree, and long short term memory algorithms in machine learning. Decis. Anal. J. 2022, 3, 100071. [Google Scholar] [CrossRef]
Luo, W.; Xing, J.; Milan, A.; Zhang, X.; Liu, W.; Zhao, X.; Kim, T.-K. Multiple object tracking: A literature review. Artif. Intell. 2021, 293, 103448. [Google Scholar] [CrossRef]
Cao, Z.; Zhang, Y.; Tian, R.; Ma, R.; Hu, X.; Coleman, S.; Kerr, D. Object-aware SLAM based on efficient quadric initialization and joint data association. IEEE Robot. Autom. Lett. 2022, 7, 9802–9809. [Google Scholar] [CrossRef]
Dendorfer, P.; Ošep, A.; Milan, A.; Schindler, K.; Cremers, D.; Reid, I.; Roth, S.; Leal-Taixé, L. Motchallenge: A benchmark for single-camera multiple target tracking. Int. J. Comput. Vis. 2021, 129, 845–881. [Google Scholar] [CrossRef]
Chen, X.; Jia, Y.; Tong, X.; Li, Z. Research on pedestrian detection and deepsort tracking in front of intelligent vehicle based on deep learning. Sustainability 2022, 14, 9281. [Google Scholar] [CrossRef]
Charef, N.; Ben Mnaouer, A.; Aloqaily, M.; Bouachir, O.; Guizani, M. Artificial intelligence implication on energy sustainability in Internet of Things: A survey. Inf. Process. Manag. 2023, 60, 103212. [Google Scholar] [CrossRef]
Razzok, M.; Badri, A.; El Mourabit, I.; Ruichek, Y.; Sahel, A. Pedestrian detection and tracking system based on Deep-SORT, YOLOv5, and new data association metrics. Information 2023, 14, 218. [Google Scholar] [CrossRef]
Rasheed, M.T.; Shi, D.; Khan, H. A comprehensive experiment-based review of low-light image enhancement methods and benchmarking low-light image quality assessment. Signal Process. 2023, 204, 108821. [Google Scholar] [CrossRef]
Ngeni, F.; Mwakalonge, J.; Siuhi, S. Solving traffic data occlusion problems in computer vision algorithms using DeepSORT and quantum computing. J. Traffic Transp. Eng. 2024, 11, 1–15. [Google Scholar] [CrossRef]
Masoud, K.; Maihami, V. A review on Kalman filter models. Arch. Comput. Methods Eng. 2023, 30, 727–747. [Google Scholar]
Cossu, M.; Berta, R.; Forneris, L.; Fresta, M.; Lazzaroni, L.; Sauvaget, J.L.; Bellotti, F. YoloP-Based Pre-processing for Driving Scenario Detection. In International Conference on Applications in Electronics Pervading Industry, Environment and Society, Genoa, Italy, 28–28 September 2023; Springer Nature Switzerland: Cham, Switzerland, 2023. [Google Scholar]
Li, A.; Zhang, Z.; Sun, S.; Feng, M.; Wu, C. MultiNet-GS: Structured Road Perception Model Based on Multi-Task Convolutional Neural Network. Electronics 2023, 12, 3994. [Google Scholar] [CrossRef]
Lei, Y.; Pan, D.; Feng, Z.; Qian, J. Lightweight YOLOv5s human Ear recognition based on MobileNetV3 and ghostnet. Appl. Sci. 2023, 13, 6667. [Google Scholar] [CrossRef]
Yang, F.; Huang, L.; Tan, X.; Yuan, Y. FasterNet-SSD: A small object detection method based on SSD model. Signal Image Video Process. 2024, 18, 173–180. [Google Scholar] [CrossRef]
Zhu, L.; Wang, X.; Ke, Z.; Zhang, W.; Lau, R.W. Biformer: Vision transformer with bi-level routing attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
Sun, Z. CBE-Net: A dedicated target detection algorithm for small vehicle. In Proceedings of the 2024 3rd International Symposium on Control Engineering and Robotics, Changsha, China, 24–26 May 2024. [Google Scholar]
Wang, J.; Li, Y.; Wang, J.; Li, Y. An Underwater Dense Small Object Detection Model Based on YOLOv5-CFDSDSE. Electronics 2023, 12, 3231. [Google Scholar] [CrossRef]
Du, Y.; Zhao, Z.; Song, Y.; Zhao, Y.; Su, F.; Gong, T.; Meng, H. Strongsort: Make deepsort great again. IEEE Trans. Multimed. 2023, 25, 8725–8737. [Google Scholar] [CrossRef]
Gregory, F.W. Kalman filter. In Computer Vision: A Reference Guide; Springer International Publishing: Cham, Switzerland, 2021; pp. 721–723. [Google Scholar]
Cengil, E.; Çınar, A.; Yıldırım, M. An efficient and fast lightweight-model with ShuffleNetv2 based on YOLOv5 for detection of hardhat-wearing. Rev. Comput. Eng. Stud 2022, 9, 116–123. [Google Scholar] [CrossRef]
Atliha, V.; Sesok, D. Comparison of VGG and ResNet used as Encoders for Image Captioning. In Proceedings of the 2020 IEEE Open Conference of Electrical, Electronic and Information Sciences (eStream), Vilnius, Lithuania, 30 April 2020; IEEE: Piscataway, NJ, USA, 2020. [Google Scholar]
You, L.; Chen, Y.; Xiao, C.; Sun, C.; Li, R. Multi-Object Vehicle Detection and Tracking Algorithm Based on Improved YOLOv8 and ByteTrack. Electronics 2024, 13, 3033. [Google Scholar] [CrossRef]
Wang, X.; Hu, X.; Liu, P.; Tang, R. A Person Re-Identification Method Based on Multi-Branch Feature Fusion. Appl. Sci. 2023, 13, 11707. [Google Scholar] [CrossRef]
Ortiz Castelló, V.; Salvador Igual, I.; del Tejo Catalá, O.; Perez-Cortes, J.C. High-profile vru detection on resource-constrained hardware using yolov3/v4 on bdd100k. J. Imaging 2020, 6, 142. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Nighttime driver’s blind spot pedestrian-tracking technology route.

Figure 2. Algorithm implementation flowchart.

Figure 3. C2faster structural diagram. (a) FasterNet block, (b) C2f-faster.

Figure 4. BiFormer attention mechanism structure diagram.

Figure 5. CARAFE upsampling structure diagram.

Figure 6. Dynamic detection head DyHead structure diagram.

Figure 7. Improved YOLOP network structure diagram.

Figure 8. ShuffleNetV2 structure diagram.

Figure 9. DIoU schematic diagram.

Figure 10. Improved DeepSORT structure flowchart.

Figure 11. The lane line-detection results at night are presented. In Scene 1, the road at night is unobstructed and the lane lines are clear. In Scene 2, the road at night has obstructions. In Scene 3, the lane lines on the road at night are unclear.

Figure 12. FBCD-YOLOP Training Process Results Diagram.

Figure 13. The results of the loss during the training and validation process of the tracking algorithm.

Figure 14. Nighttime pedestrian-tracking results. (a) A frame from the first video sequence showing the initial detection and tracking of pedestrians by the proposed algorithm. (b) The corresponding frame from the first video sequence where the IDS-tracking process is shown; the proposed algorithm accurately tracks pedestrian ID3 through the crowd, while the YOLOP-DeepSort algorithm exhibits ID switches (highlighted by orange circles). (c) A frame from the second video sequence showing the proposed algorithm’s detection of pedestrians with no ID changes or false detections. (d) The corresponding frame from the second video sequence where the YOLOP-DeepSort algorithm mistakenly identifies a tree trunk and a wall crack as pedestrians (highlighted by red circles), demonstrating the superiority of the proposed algorithm in avoiding false detections.

Table 1. Introduction of the target-detection experimental dataset.

Dataset	Total	Class	Training Set	Validation Set	Test Set
Market 1501	10,742	1	7584	2125	1033
BDD100K	10,650	1	7500	2100	1050

Table 2. Introduction of datasets with different lighting conditions.

Lighting Conditions	Total	Class	Training Set	Validation Set	Test Set
Complete Darkness	7100	1	5028	1420	652
Low Light	7192	1	5028	1450	714
Illuminated	7100	1	5028	1355	717

Table 3. Experimental platform configuration.

Experimental Environment
CPU	Intel(R)Core(TM)i7-12700H (Santa Clara, CA, USA)
Memory	16 GB
Hard Drive	SSD 500 GB
Graphics Card	RTX3060 12 GB
OS	Win11
Python	3.8
CUDA	11.2
Pytorch	1.12.1

Table 4. Lane line-detection results.

Model	Acc%	IoU%	FPS
SCNN	34.7	15.9	19
ENet-SAD	37.6	16.0	50
YOLOP	70.5	26.4	41
TDL-YOLO	72.3	26.5	37
Ours	75.6	27.2	66

Table 5. Nighttime pedestrian-detection results.

Model	Precison/%	Recall/%	mAP/%	FPS	Params/10⁶
Faster R-CNN	86.1	88.9	85.7	15	27.1
YOLOv3	80.1	81.3	79.5	63	51.5
YOLOv5s	88.3	91.6 ↑	87.3	121 ↑	10.2 ↓
TDL-YOLO	88.7	89.5	87.7	37	78.4
YOLOP	85.8	87.3	84.5	41	47.4
Ours	89.6 ↑	91.3	88.1 ↑	66	30.5

Table 6. Detection performance under different lighting conditions.

Lighting Environment	Precision (%)	Recall (%)	Miss Rate (%)	Threshold
Complete darkness	85.2	87.3	14.5	0.6
Low light	88.7	90.5	10.8	0.5
Illuminated	89.6	91.3	7.2	0.4

Table 7. Comparative results of ablation experiments.

Strategy 1	Strategy 2	Strategy 3	Strategy 4	Precision/%	mAP/%	Recall/%	FPS
				85.8	84.5	87.3	46
√				86.2	86.4	89.5	83
	√			87.6	87.1	90.2	44
		√	√	87.9	86.4	89.9	47
√	√	√	√	89.6	88.1	91.3	66

Table 8. Statistical significance test of detection performance improvement.

Comparison	t-Statistic	p-Value	Significant Difference (1 for Yes, 0 for No)
Strategy 1 vs. Strategy 2	−2.93	0.0136	1
Strategy 1 vs. Strategy 3	−12.62	0.00001	1
Strategy 1 vs. Strategy 4	−13.37	0.00001	1
Strategy 1 vs. All	−23.18	0.000001	1
Strategy 2 vs. Strategy 3	−8.89	0.0001	1
Strategy 2 vs. Strategy 4	−9.43	0.0001	1
Strategy 2 vs. All	−16.89	0.000001	1
Strategy 3 vs. Strategy 4	−2.34	0.0365	1
Strategy 3 vs. All	−7.65	0.0002	1
Strategy 4 vs. All	−8.88	0.0001	1

Table 9. Experimental results of nighttime pedestrian target reidentification model.

Model	Precision (Train)/%	Precision (Val)/%	Model Size/10⁶
DeepSORT	85.1	75.6	43.8
DeepSORT-SNV2	87.9	78.2	2.7

Table 10. Comparison results of nighttime pedestrian multi-target-tracking.

Algorithm	MOTA/%	MOTP/%	IDS	FPS
YOLOP-DeepSort	80.7	79.4	10	24
Modified YOLOP-DeepSort	85.4	83.3	7	44
YOLOP-Modified DeepSort	81.2	79.3	8	47
Ours(YOLOP-DeepSort)	86.3	84.9	5	59

Table 11. Comparison results of tracking performance of different MOT algorithms.

MOT Algorithm	MOTA/%	MOTP/%	IDS	FPS
Byte Track	83.4	79.7	11	27
Strong SORT	85.9	86.3	17	23
Ours	86.3	84.9	5	59

Table 12. Performance of various detection methods on multi-target tracking effectiveness.

Model	MOTA/%	MOTP/%	FPS
Faster R-CNN	63.6	78.8	27
RT-DETR	66.1	79.5	31
YOLOv3	70.2	80.2	35
YOLOv5s	83.4	84.1	76
YOLOv7	85.9	84.3	72
YOLOP	80.7	79.3	36
Ours	86.3	84.9	59

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, W.; Ren, C.; Tan, A. Study on Nighttime Pedestrian Trajectory-Tracking from the Perspective of Driving Blind Spots. Electronics 2024, 13, 3460. https://doi.org/10.3390/electronics13173460

AMA Style

Zhao W, Ren C, Tan A. Study on Nighttime Pedestrian Trajectory-Tracking from the Perspective of Driving Blind Spots. Electronics. 2024; 13(17):3460. https://doi.org/10.3390/electronics13173460

Chicago/Turabian Style

Zhao, Wei, Congcong Ren, and Ao Tan. 2024. "Study on Nighttime Pedestrian Trajectory-Tracking from the Perspective of Driving Blind Spots" Electronics 13, no. 17: 3460. https://doi.org/10.3390/electronics13173460

APA Style

Zhao, W., Ren, C., & Tan, A. (2024). Study on Nighttime Pedestrian Trajectory-Tracking from the Perspective of Driving Blind Spots. Electronics, 13(17), 3460. https://doi.org/10.3390/electronics13173460

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Study on Nighttime Pedestrian Trajectory-Tracking from the Perspective of Driving Blind Spots

Abstract

1. Introduction

2. Detection and Tracking Method

2.1. Improving the YOLOP Model

2.1.1. Feature Extraction C2f-Faster

2.1.2. BiFormer Attention Mechanism

2.1.3. CARAFE Upsampling

2.1.4. Dynamic Detection Head DyHead

2.2. Improved DeepSORT Tracking Module

2.2.1. ShuffleNetV2 Lightweight

2.2.2. Improved IoU Matching Mechanism

3. Experiment Results and Analysis

3.1. Data Introduction

3.2. Experimental Platform

3.3. Evaluation Indicators

3.4. Object-Detection Experiment and Result Analysis

3.4.1. Lane Line-Detection

3.4.2. Nighttime Pedestrian-Detection

3.5. Target Re-Identification Experiment and Result Analysis

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI