PVNet: A Used Vehicle Pedestrian Detection Tracking and Counting Method

Xie, Haitao; Xiao, Zerui; Liu, Wei; Ye, Zhiwei

doi:10.3390/su151914326

Open AccessArticle

PVNet: A Used Vehicle Pedestrian Detection Tracking and Counting Method

by

Haitao Xie

,

Zerui Xiao

,

Wei Liu

^* and

Zhiwei Ye

College of Computer Science, Hubei University of Technology, Wuhan 430000, China

^*

Author to whom correspondence should be addressed.

Sustainability 2023, 15(19), 14326; https://doi.org/10.3390/su151914326

Submission received: 8 August 2023 / Revised: 12 September 2023 / Accepted: 20 September 2023 / Published: 28 September 2023

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Advances in technology have made people’s lives more prosperous. However, the increase in the number of cars and the emergence of autonomous driving technology have led to frequent road accidents. Manual observation of traffic conditions requires high labor intensity, low work efficiency, and poses safety risks. The paper proposes a deep learning-based pedestrian-vehicle detection model to replace manual observation, overcoming human resource limitations and safety concerns. The model optimizes the darknet53 backbone feature extraction network, reducing parameters and improving feature extraction capabilities, making it more suitable for pedestrian-vehicle scenarios. In addition, the PVFPN multi-scale feature fusion method is used to facilitate information exchange between different feature layers. Finally, the Bytetrack method is used for target counting and tracking. The paper model shows excellent performance in pedestrian-vehicle detection and tracking in traffic scenarios. The experimental results show that the improved model achieves a [email protected] of 0.952 with only 32% of the parameters compared to YOLOv8s. Furthermore, the proposed PVNet model, combined with the Bytetrack method, maintains high detection accuracy and is applicable to pedestrian-vehicle detection and tracking in traffic scenarios. In summary, this section discusses the traffic issues arising from technological development and presents the optimization and performance of the deep learning-based pedestrian-vehicle detection model, along with its potential applications in traffic scenarios.

Keywords:

deep learn; vehicle pedestrian detection; lightweight model

1. Introduction

Global Traffic Accident Statistics: According to the World Health Organization (WHO), millions of people are killed and hundreds of thousands are injured in road traffic accidents globally every year. This makes road accidents one of the top ten leading causes of death worldwide. The major causes of road accidents include speeding, drunk driving, distracted driving, poor road conditions, non-compliance with traffic rules, and technical problems with vehicles. Traffic accidents result in the loss of a large number of lives, which not only causes immense grief for individuals and families; however, many traffic accident survivors may suffer serious injuries, which can lead to permanent disability, medical expenses, and the long-term need for rehabilitation. Traffic accidents impose significant economic costs, including medical expenses, vehicle repair costs, legal fees, insurance costs, and lost productivity. Accidents negatively impact the mental health of participants and witnesses, which can lead to problems such as post-traumatic stress disorder (PTSD). Traffic accidents often lead to congestion on the roads, further impacting the flow of traffic in cities and regions and wasting time and resources.

In recent years, with the rapid development of artificial intelligence technology, deep learning [1] has been widely applied in various fields, including autonomous driving. The development of the Internet of Vehicles (IoV) [2,3] has become increasingly popular worldwide, combining the Internet of Things, intelligent transportation, and cloud computing [4]. Among IoV applications, autonomous driving [5,6,7] stands out as one of the most well-known and flourishing. Target detection is a crucial technology for autonomous driving vehicles’ environmental perception, and convolutional neural networks have become a powerful tool in the field of vehicle detection due to their strong feature extraction capabilities [8]. However, the complex traffic environment poses increasing challenges to the environmental perception of autonomous vehicles. Advances in computer vision and computing tools provide theoretical and technical support for autonomous environmental perception [9]. Hence, vision-based object detection is a key means for autonomous environmental perception [10,11], and its detection performance, including accuracy, speed, efficiency, and robustness, is vital for vehicle and pedestrian detection in traffic scenes [12,13].

The model proposed in this paper can help self-driving cars detect and recognize targets on the road, such as other vehicles and lines, in a timely manner. By monitoring its surroundings in real time, the vehicle can take steps to avoid potential collisions, such as braking automatically or performing evasive maneuvers. As self-driving technology becomes more widespread, pedestrians and bicycles are becoming important traffic participants on the road. Higher-accuracy detection models can ensure that vehicles recognize and respond to these non-motorized vehicles and pedestrians, improving their safety; help monitor traffic flow, identify traffic congestion or accidents; and assist traffic management systems in taking appropriate measures to improve road access and reduce traffic congestion, thereby improving overall safety. Vehicle pedestrian detection models can continue to work in adverse conditions such as nighttime, rain, and snow to help vehicles recognize targets on the road and improve the safety of drivers and passengers. In conclusion, the model proposed in this paper plays a key role in autonomous driving technology and traffic management, which helps to improve the overall safety of road traffic and reduce the likelihood of accidents, especially in complex traffic environments and highly automated future transportation systems.

This paper further explores a low-power, high-precision, lightweight framework for pedestrian-vehicle detection to meet the objective requirements of target diversity in traffic scenarios. The model adopts the darknet53 backbone feature extraction network, making it more suitable for the specifics of traffic scenes. This article mainly focuses on the following points:

Model construction: The target detection model is divided into two parts: the multi-scale pedestrian-vehicle prediction part and the target tracking part.
Model structure design: The detection network is optimized to better suit the intelligent driving environment, thereby improving detection performance.
Data preprocessing: Data preprocessing includes two modules: dataset creation and data augmentation. The dataset is gathered from different environments using cameras, car-mounted cameras capturing driving records with frame segmentation techniques, and pedestrian-vehicle images from various public datasets, merging them to create a new dataset.
Model training and application: The model is trained and applied for target tracking.

2. Related Work

2.1. Object Detection

Object detection involves locating and classifying variable quantities of objects in an image and providing category and location information for multiple objects. Currently, deep learning-based object detection algorithms are mainly divided into two frameworks: two-stage and one-stage. Two-stage detection algorithms (e.g., R-CNN series, SPP-Net, and Fast R-CNN [14]) achieve higher detection accuracy but suffer from low computational efficiency and poor real-time performance. In contrast, one-stage detection algorithms (e.g., YOLO [15] series, SSD [16] series) treat detection as a regression problem, directly predicting object categories and positions. They are characterized by simplicity, efficiency, and strong real-time capabilities.

Therefore, one-stage object detection algorithms, such as the YOLO series and SSD series, are lightweight detection networks suitable for embedded devices. They offer high detection efficiency and real-time performance, making them well-suited for deployment on embedded devices.

2.2. Pedestrian-Vehicle Detection

Research on pedestrian-vehicle detection follows the general approach of object detection. Due to the increasing number of vehicles and the advancement of autonomous driving technology, pedestrian-vehicle detection has rapidly developed and focuses on issues such as low power consumption, high accuracy, and lightweight solutions. Various methods have been proposed, including DATMO [17], RCF-Faster R-CNN [18], CA-MobileNetv2-YOLOv4 [19], YCD [8], YOLOv3-promote [20], and RFCC [21]. Conclusion: In the context of pedestrian-vehicle detection, many methods have been proposed, and these approaches hold certain practical value in addressing challenges related to low power consumption, high accuracy, and lightweight solutions.

3. Methodology

3.1. Convolution

The paper proposes a new convolutional approach, SMGConv (as shown in Figure 1), which generates the same number of feature maps as a regular convolutional layer, allowing easy replacement of the convolutional layer to reduce computational costs. Excellent neural network models produce information-rich feature maps after training; however, spatial compression and channel expansion of feature maps can lead to the loss of some semantic information. SMGConv introduces a max-pooling residual module to retain feature information at a lower time complexity and incorporates DWConv to generate more mappings, enhancing the network’s understanding of input data.

The SMGConv module effectively reduces computational costs while maintaining high feature information, enabling neural networks to process input data more efficiently and with a deeper understanding.

Usually, the time complexity of convolutional computations is defined by FLOPs. Thus, the time complexity (without bias) of Conv and SMGConv is:

T i m e (C o n v) ~ O (M^{2} * K^{2} * C_{i n} * C_{o u t})

(1)

T i m e (M G S C o n v) ~ O (M^{2} * K^{2} * \frac{C_{i n}}{4} * (C_{o u t} + 1) + (\frac{M}{S})^{2})

(2)

where

M

is the edge length of the output? The feature map of the convolutional kernel

K

is the edge length of each convolutional kernel (Kernal).

C_{i n}

is the number of channels of each convolutional kernel, i.e., the number of input channels, i.e., the number of output channels of the previous layer.

C_{o u t}

is the number of convolution kernels this convolutional layer has, i.e., the number of output channels.

S

is the step size (stride) of pooling.

3.2. SMGBottleneck

Based on the idea of ResNet, we combined the residual structure with SMGConv to propose the SMGBottleneck (SBN) module. SBN utilizes the residual structure to enable the network to be deeper, converge faster, and optimize more easily, while having fewer parameters and lower complexity compared to previous models. The module is divided into S1-Bottleneck and S2-Bottleneck based on the stride size. Figure 2 shows the structures of S-Bottleneck with shortcut = true and shortcut = false, respectively. S1-Bottleneck (as shown in Figure 2a) contains the input feature matrix, followed by two SMGConv operations. The output feature matrix obtained is concatenated with the initial input feature matrix through the shortcut to obtain the final result. The only difference between S1-Bottleneck and S2-Bottleneck (as shown in Figure 2b) is whether the shortcut is used.

Assuming the result of the input is

x

, the two layers of SMGConv learn mappings

S (x)

are, and the entire Bottleneck’s residual function can be formally defined as

y = F (x, {S_{i}}) + x

, where the residual function of S1-Bottleneck can be represented as

y = S_{2} (S_{1} (x)) + x

, and the residual function of S2-Bottleneck can be represented as

y = S_{2} (S_{1} (x))

. The network degradation problem is reflected in the difficulty of fitting a constant mapping to a multilayer network, i.e.,

S (x)

it is difficult to fit to

x

. However, with the residual structure, it becomes easy to fit a constant mapping by learning all the network parameters to 0, leaving only the cross-layer connections for that constant mapping. Therefore, when the network does not need to be so deep, there can be more constant mappings in the middle, and vice versa.

3.3. SMB

Based on SBN, we have designed the SMB module (as shown in Figure 3) based on CSPNet and ResNeXt. The main purpose of designing SMB is to enable this architecture to achieve richer gradient combinations while reducing computational complexity. After SMGConv, there might be some loss of feature information; therefore, we aim to enhance the model’s learning capacity while maintaining sufficient accuracy in a lightweight manner. Excessive computational bottlenecks lead to more cycles required to complete the inference process. Therefore, we intend to evenly distribute the computational load across each layer of the model, effectively improving the utilization of each computing unit and reducing unnecessary energy consumption. SMB achieves this objective through the following steps: partitioning the feature maps of the base layer into two parts and then merging them using the proposed cross-stage hierarchical structure. SMB segments the gradient flow, allowing the gradients to propagate through different network paths. By switching between concatenation and transition steps, the propagated gradient information can have significant differences in correlation. The formula for SMB can be expressed as:

y = \sum_{i = 1}^{n} S_{j}^{i} (S_{j}^{i}) + x_{2}, (S_{j}^{0} = x_{1})

(3)

S_{j}^{i}

is the SBN module,

i

indicates how many SBN modules are stacked,

j

indicates whether the SBN module uses shortcuts (1 = True, 0 = False), and

x_{1}, x_{2}

indicates the two features into which the input data are split.

3.4. FPN

A lightweight backbone typically has fewer parameters and computational complexity, reducing the computational cost of the model. In the context of object detection tasks, particularly in embedded devices or real-time applications, low latency and high efficiency are crucial. A lightweight backbone can offer faster inference speeds without sacrificing too much performance.

Object detection tasks require detecting objects at different scales because objects can appear in various sizes and aspect ratios. Multi-scale feature fusion involves combining feature maps from different levels to help the model capture object information at different scales. This can enhance the detection performance of the model for both small and large objects.

The proposed PVFPN multi-scale feature fusion method in this paper effectively propagates and fuses feature information from high resolution to low resolution, thereby expanding the model’s perception range and improving object detection performance.

Multi-scale feature fusion contributes to improving the accuracy of object localization. In tasks such as vehicle and pedestrian detection, it is essential not only to detect the presence of vehicles or pedestrians but also to accurately localize their bounding boxes. By leveraging multi-scale features, the model can better understand the shape and position of objects, thus enhancing the accuracy of bounding box predictions. When performing detection on complex backgrounds, multi-scale feature fusion can help the model suppress background interference and better identify objects. By integrating information from multiple scales, the model can better differentiate between objects and background variations.

In summary, optimizing Darknet53 and employing the PVFPN multi-scale feature fusion are key techniques in PVNet. They enhance the model’s performance, efficiency, and robustness, making it suitable for various scenarios and applications. Therefore, they are often considered essential components for improving object detection models in practice.

Due to the inability of a single-stage feature map to effectively represent objects at various scales, a Feature Pyramid Network (FPN) is used to simultaneously represent objects at different scales by combining feature maps from different stages. The forward pass of the network returns C2, C4, C6, and C9, which are Feature Maps obtained after each pooling operation. Through a bottom-up pathway, PVFPN (as shown in Figure 4) generates four sets of Feature Maps. Shallow Feature Maps such as C2 contain more low-level information (textures, colors, etc.), while deep Feature Maps such as C9 contain more semantic information. To combine these four sets of Feature Maps, each with a preference for different features, PVFPN employs a top-down and lateral connection strategy, ultimately obtaining outputs P3, P4, and P5.

3.5. Target Tracking

To achieve counting and tracking of pedestrians and vehicles, we utilized ByteTrack, a tracking method based on the tracking-by-detection paradigm. Most multi-object tracking methods obtain object IDs by associating detection boxes with scores higher than a certain threshold. However, this approach leads to significant issues, such as a large number of missed detections and fragmented trajectories, especially for objects with low detection scores, such as occluded targets, which are simply discarded.

ByteTrack, on the other hand, employs Byte data association. It first classifies each detection box into either the high-confidence group (for scores above the threshold) or the low-confidence group (for scores below the threshold). In the second step, it computes the similarity between the detection boxes and the estimated results from the Kalman filter. Then, based on the Hungarian algorithm, it matches the target trajectories and decides whether to retain the low-confidence detection boxes. In the third step, it associates the low-confidence detection boxes with the remaining trajectories and removes the unassociated detection boxes. In the fourth step, high-confidence detection boxes that have not been part of any trajectory are initialized as new trajectories.

The general equations for the linear Gaussian system of the Kalman filter are as follows:

Prediction equation:

In this pedestrian-vehicle tracking process, the camera is used for shooting; therefore, the displacement of the pedestrian-vehicle between each frame is very small. Assuming that the motion state of the pedestrian vehicle is uniform linear motion, the equation of its motion state is as follows:

x_{k} = A_{k} x_{k - 1} + \partial_{k - 1}

(4)

A

is the displacement matrix of the pedestrian vehicle in the state of uniform linear motion,

k, (k - 1)

denotes the current frame and the previous frame, and

\partial

is the offset generated during the motion.

The result of pedestrian-vehicle tracking is taken as the observation value of the current frame, and the observation equation is as follows:

z_{k} = H x_{k} + v_{k}

(5)

H

is the measurement matrix, which computes the state-value-to-observation-value transitions, and

v

is the Gaussian-distributed observation noise with an expectation of zero.

The pedestrian-vehicle state is represented as follows:

x = {[p, q, r, h, v_{p}, v_{q}, v_{r}, v_{h}]}^{T}

(6)

p, q

is the horizontal and vertical coordinates of the center of the target box,

r

is the bounding box aspect ratio,

h

is the bounding box height ratio, and

v

is the velocity at the current position of the coordinates.

Updating the equations:

Time update equation:

{\hat{x}}_{K}^{'} = A {\hat{x}}_{K - 1}^{'} + \partial

(7)

P_{K}^{'} = A P_{K - 1}^{'} A + Q

(8)

{\hat{x}}_{k}^{'}

is the a priori estimate of moment

k

pedestrian-vehicle,

P_{k}^{'}

is the a priori estimate of moment

k

covariance,

P_{k - 1}^{'}

is the posterior estimate of moment

k

pedestrian-vehicle,

Q

is the posterior estimate of moment 8 covariance, and 9 is the offset covariance.

State update equation:

K_{k} = P_{k}^{'} H^{T} {(H P_{k}^{'} H^{T} + R)}^{- 1}

(9)

{\hat{x}}_{k} = {\hat{x}}_{k}^{'} + K (z_{k} - H {\hat{x}}_{k}^{'})

(10)

P_{k} = (I - K_{k} H) P_{k}^{'}

(11)

The second step of Byte data association is accomplished using Kalman filtering to estimate the degree of recognition between the results.

3.6. PVNet

Darknet-53 is a Convolutional Neural Network (CNN) architecture in the Darknet framework that consists mainly of convolutional layers and residual connectivity, between which a Max-Pooling layer is used to reduce the size of the feature map. This helps in reducing the computational complexity and extracting features of different sizes. A Batch Normalization layer is used after the convolution to normalize the intermediate features in the neural network to accelerate convergence and improve training stability. Often, a leaky ReLU activation function (with a small negative slope) is used after batch normalization to introduce nonlinearity and increase the expressive power of the network.

Based on YOLO, this paper retains three YOLO output layers, reconstructs the entire model’s backbone network, and introduces the new lightweight feature extraction module composed of SMB modules using SMGConv. Additionally, the neck network is restructured by employing local cross-channel and non-reduction channel interaction methods to eliminate redundant features, enhance essential information features, and improve the detection capability for pedestrians and vehicles, making it suitable for traffic detection in various environments and named PVNet (as shown in Figure 5).

4. Experimental Results and Analysis

4.1. Description of the Experimental Environment and Parameters

In this experiment, the PyTorch framework was utilized to construct and train the network. The training and testing of the results were performed within this framework. A tower GPU professional workstation was used for neural network training, equipped with an Intel (R) Core (TM) i7-10700 CPU @ 2.90 GHz processor and 16 GB of RAM. The configuration of the machine used for testing was an Intel Xeon (R) Silver 4210R @ 2.40 GHz processor, 128 GB of RAM, and an NVIDIA Tesla M40 24 GB graphics card.

During the training process, the image input size, batch size, and epoch are set to 640 × 640, 64, and 300, respectively. after that, the initial learning rate is set to 0.01; however, the learning rate is reduced from 0.01 to 0.0001 using cosine annealing. the loss values are recorded for each iteration, and the training results of PVNet are shown in Figure 6.

The loss convergence curves show that the training loss and validation loss continue to decrease and eventually converge to a minimum. The absence of divergence and overfitting shows the effectiveness of PVNet.

4.2. Dataset

Data are the key to deep learning model training. Insufficient training data or unbalanced data distribution may limit model performance. Collecting and labeling large-scale, high-quality data are often an expensive and time-consuming task.

This study relies on two main datasets: the MIT-CBCL pedestrian database and the KITTI dataset, in addition to data collected using in-vehicle recorders and surveillance cameras across various scenarios. Following data collection, a thorough screening and preprocessing phase is applied, resulting in the creation of a novel dataset encompassing pedestrians and vehicles. The dataset’s construction process primarily involves three stages: data collection, screening, and preprocessing. Subsequently, the developed model is rigorously tested and evaluated using this newly curated dataset to ascertain its performance across diverse scenarios and conditions.

4.3. Comparison Experiment

In pedestrian-vehicle detection, this paper will use a two-dimensional Gaussian kernel to represent key points. The center coordinates of the bbox are rounded to determine the center of the Gaussian circle, and the size of the bbox is used to determine the radius of the Gaussian circle. The Gaussian function value (ranging from 0 to 1) is calculated by substituting it into the Gaussian formula. The value at the center is the highest and decreases outward along the radius. In the image, the center point is the brightest and becomes darker outward along the radius. In the heatmap, areas other than the center point are set to 0, while the heatmap represents classification information, with each category having a corresponding heatmap. In the pedestrian-vehicle experiment, there are two categories: “people” and “car.” To validate that the model designed in this paper is more sensitive to pedestrians and vehicles, Figure 7 shows a comparison between this paper’s model and YOLOv8s. When detecting the “people” category (as shown in Figure 7b), this paper’s model focuses more on pedestrians, and when detecting the “car” category (as shown in Figure 7c), this paper’s model is also more effective.

To validate the effectiveness of the model proposed in this paper, comparative experiments were conducted in the same environment, using equal-sized training, validation, and testing datasets, along with several popular models from the YOLO family. The evaluation was based on two performance metrics: [email protected] and parameters. The experimental results are presented in Table 1. It is observed that the algorithm designed in this paper significantly improves the detection of pedestrians and vehicles, demonstrating its effectiveness in accurately detecting them across diverse scenarios. The “ours” algorithm achieved an [email protected] of 95.2%. Compared to YOLOv8n, our model demonstrated superior detection performance while maintaining a similar parameter count. Moreover, compared to YOLOv8s, YOLOv5s, and YOLOv7n, our algorithm outperformed them both in terms of precision and parameter efficiency.

Overall, our algorithm exhibits excellent performance in both detection accuracy and lightweight design, ensuring accurate discrimination between pedestrians and vehicles in complex road environments. Additionally, to provide a more intuitive understanding of the detection differences between various algorithms, the results of [email protected] and [email protected]:.95 were visualized and presented in Figure 8.

There are various current object-detection networks. In order to better verify the effectiveness and scientificity of the proposed algorithm, this paper compares the PVNet algorithm with recent years’ object detection algorithms. The comparison results are shown in Table 2. The focus is on the average class accuracy and the number of model parameters.

These results show that our model achieves excellent performance while being lightweight and outperforms the above architectures. Thus, our results confirm the superiority of our proposed model.

For further comparison experiments, this paper randomly selected images from the testing dataset and tested them using both YOLOv8s and our model. Figure 9 showcases some of the detection results obtained from this experiment.

From Figure 9, it can be observed that pedestrian detections are marked with deep red borders, while vehicle detections are indicated with light red borders. The vertical axis represents the detection models used, and the horizontal axis represents different traffic scenarios in various environments. The missed portion of the YOLOv8s is framed with a red circle. In Figure 9A, we have a high pedestrian flow street under strong lighting conditions. Comparatively, our model demonstrates higher accuracy than YOLOv8s, and it successfully detects small, distant pedestrian targets that YOLOv8s misses. In Figure 9B, we examine a busy traffic environment also under strong lighting conditions, where our model outperforms YOLOv8s in detecting vehicles near the edges. Figure 9C showcases a complex and well-lit traffic scenario in which our model exhibits excellent detection performance. Moving on to Figure 9D, we consider vehicles in high-speed motion. Lastly, Figure 9E presents a nighttime traffic environment where, upon observation, YOLOv8s fails to detect a distant vehicle. Based on the detections in the various scenes mentioned above, it is evident that our model is better suited for pedestrian-vehicle detection in complex traffic environments.

4.4. Ablation Experiment

To validate the effectiveness of different components in the model, this paper conducted ablation experiments. The results are presented in Table 3, and the visualization of the training results after 300 epochs is shown in Figure 10.

A (SMGConv): In this study, we designed a novel convolution, SMGConv, as an alternative to regular convolutions to extract features, aiming to reduce model network parameters, lower computational complexity, and achieve lightweight model requirements. SMGConv efficiently reduces the model’s parameters and the number of floating-point operations.

B (MSB): MSB stands for the Simplified Bottleneck based on SMGConv. After passing through SMGConv, the feature information may suffer some loss. Thus, we aim to aggregate and optimize the obtained features to achieve higher detection accuracy and preserve essential feature information to the maximum extent.

C (PVFPN): PVFPN involves an effective bi-directional cross-scale fusion of features, enabling the fusion of features with different resolutions. It significantly enhances the efficiency of extracting more critical features while retaining a higher probability of preserving the fusion process. This not only improves accuracy but also effectively removes noise and redundant information.

The ablation experiments conducted on these components provide insights into their respective contributions to the overall model performance. It is evident that each component plays a crucial role in achieving a lightweight yet accurate pedestrian-vehicle detection model.

4.5. Target Tracking

To validate the effectiveness of the pedestrian-vehicle counting and tracking method proposed in this paper, we conducted video collection of pedestrians and vehicles in urban and campus traffic environments using onboard recorders and cameras. Subsequently, tracking experiments were performed, and the results are shown in Figure 11.

From Figure 11, it is evident that the tracking performance of the detection model proposed in this paper is satisfactory, with both pedestrian-vehicle tracking and counting clearly displayed. Vehicle tracking is represented by light red bounding boxes, with the left side indicating the target count and the right side showing the accuracy. Pedestrian tracking is denoted by deep red bounding boxes, with the left side representing the target count and the right side displaying the accuracy. The sequence of images is arranged from left to right and top to bottom. By observing the image numbers, it can be seen that the model proposed in this paper exhibits excellent tracking and counting capabilities in complex traffic environments, adequately meeting the requirements for pedestrian-vehicle tracking and counting in the current complex operating environments.

5. Discussion

Based on the improved YOLO network, this paper proposes a pedestrian-vehicle detection method called PVNet. The Bytetrack algorithm is employed for pedestrian-vehicle tracking and counting.

In this study, a novel and updated feature extraction method is introduced, effectively reducing the model’s parameters and computational complexity. Additionally, the proposed cross-layer multi-scale neck structure is fused to further enhance the feature representation. Through comparison with the final experimental results, our model achieves higher detection accuracy, meeting the requirements for pedestrian-vehicle detection accuracy in the current complex operating environments.

This paper focuses on addressing various challenges caused by multi-target scenarios and complexities in different driving environments, such as missed detections, false alarms, and tracking failures. It successfully addresses these issues arising from multiple factors that lead to multi-target tracking failures. The research outcomes of this paper hold significant reference value for the theoretical research and development of intelligent transportation systems, including applications in autonomous driving and vehicle-pedestrian collision avoidance.

The research in this paper focuses on the performance and improvement of the YOLO family of algorithms, in which case comparisons with other algorithms may not be the focus of this study. We believe that the YOLO family of algorithms is more consistent with the problem or application scenario it is studying and therefore chose to compare it with it. Consistency can make research more relevant and interpretable. However, the fact that our work may not have covered all target detection algorithms is one of the limitations of our study, and future research may extend to the comparison of other algorithms for a more comprehensive understanding of the state of the art in the vehicular pedestrian domain.

In complex traffic scenarios, such as congested roads in large cities, the interaction between vehicles and pedestrians is very complicated. Current detection systems may have reduced accuracy in these situations. Adverse weather conditions, such as dense fog or smog, can affect the performance of the model detection effect.

In the future, we will continue to improve PVNet to enhance detection performance. training with larger and more diverse datasets to increase the robustness and generalization ability of the model. Ensure the detection performance of our model in bad weather environments by studying air pollution detection models with excellent performance, such as DCNN [26] and VMFS [27].

Overall, vehicle and pedestrian target detection is an evolving field, and the future will focus on improving performance, robustness, and real-time performance to meet the growing demand for autonomous driving and traffic regulation.

Author Contributions

Conceptualization, H.X.; methodology, Z.X.; investigation, W.L.; validation, Z.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China under Grant No. 62376089, the key projects of Hubei Provincial Department of Education No. D20161403.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

In this article, the data set PD-2022 was created based on all kinds of public data sets, such as the MIT-CBCL Pedestrian Database and the CUHK Square Dataset.

Conflicts of Interest

The authors declared no potential conflict of interest with respect to the research, authorship, and/or publication of this article.

References

Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Al-Qerem, A.; Alauthman, M.; Almomani, A.; Gupta, B.B. IoT transaction processing through cooperative concurrency control on fog–cloud computing environment. Soft Comp. 2020, 24, 5695–5711. [Google Scholar] [CrossRef]
Gupta, B.B.; Quamara, M. An overview of Internet of Things (IoT): Architectural aspects, challenges, and protocols. Concurr. Comput. Pract. Exp. 2020, 32, e4946. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Hussain, M.D.M.; Beg, M.M.S. Using vehicles as fog infrastructures for transportation cyber-physical systems (T-CPS): Fog computing for vehicular networks. Int. J. Softw. Sci. Comput. Intell. 2019, 11, 47–69. [Google Scholar] [CrossRef]
Ahuja, S.P.; Wheeler, N. Architecture of fog-enabled and cloud-enhanced internet of things applications. IJCAC 2020, 10, 1–10. [Google Scholar] [CrossRef]
Sejdiu, B.; Ismaili, F.; Ahmedi, L. Integration of semantics into sensor data for the IoT: A systematic literature review. Int. J. Semant. Web Inf. Syst. 2020, 16, 1–25. [Google Scholar] [CrossRef]
Zhang, Y.; Zhou, A.; Zhao, F.; Wu, H. A Lightweight vehicle-pedestrian detection algorithm based on attention mechanism in traffic scenarios. Sensors 2022, 22, 8480. [Google Scholar] [CrossRef]
Meng, C.C.; Bao, H.; Ma, Y. Vehicle Detection: A Review. In Proceedings of the 3rd International Conference on Computer Information Science and Application Technology (CISAT), Electr Network, Dali, China, 17 July 2020. [Google Scholar]
Al-qaness, M.A.A.; Abbasi, A.A.; Fan, H.; Ibrahim, R.A.; Alsamhi, S.H.; Hawbani, A. An improved YOLO-based road traffic monitoring system. Computing 2021, 103, 211–230. [Google Scholar] [CrossRef]
Duv, L.Y.; Chen, X.J.; Pei, Z.H.; Zhang, D.H.; Liu, B.; Chen, W. Improved Real-Time Traffic Obstacle Detection and Classification Method Applied in Intelligent and Connected Vehicles in Mixed Traffic Environment. J. Adv. Transp. 2022, 2022, 2259113. [Google Scholar]
Zhou, Y.; Wen, S.; Wang, D.; Meng, J.; Mu, J.; Irampaye, R. MobileYOLO: Real-Time Object Detection Algorithm in Autonomous Driving Scenarios. Sensors 2022, 22, 3349. [Google Scholar] [CrossRef] [PubMed]
Liu, H.; Sun, F.; Gu, J.; Deng, L. SF-YOLOv5: A Lightweight Small Object Detection Algorithm Based on Improved Feature Fusion Mode. Sensors 2022, 22, 5817. [Google Scholar] [CrossRef] [PubMed]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. arXiv 2015, arXiv:1506.01497. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.Y.; Liao, H. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the 14th European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 8–16 October 2016; pp. 21–37. [Google Scholar]
Mendes, A.; Bento, L.C.; Nunes, U. Multi-target detection and tracking with a laser scanner. In Proceedings of the IEEE Intelligent Vehicles Symposium, 2004, Parma, Italy, 14–17 June 2004. [Google Scholar]
Wang, Z.; Miao, X.; Huang, Z.; Luo, H. Research of target detection and classification techniques using millimeter-wave radar and vision sensors. Remote Sens. 2021, 13, 1064. [Google Scholar] [CrossRef]
Chen, X.; Jia, Y.; Tong, X.; Li, Z. Research on Pedestrian Detection and DeepSort Tracking in Front of Intelligent Vehicle Based on Deep Learning. Sustainability 2022, 14, 9281. [Google Scholar] [CrossRef]
Xu, H.; Guo, M.; Nedjah, N.; Zhang, J.; Li, P. Vehicle and pedestrian detection algorithm based on lightweight YOLOv3-promote and semi-precision acceleration. IEEE Trans. Int. Trans. Syst. 2022, 23, 19760–19771. [Google Scholar] [CrossRef]
Zhang, X.; He, L.; Lv, R.; Jin, C.; Wang, Y. Infrastructure 3D Target detection based on multi-mode fusion for intelligent and connected vehicles. IEEE Access 2023, 11, 72803–72812. [Google Scholar] [CrossRef]
Liu, J.; Cai, Q.; Zou, F.; Zhu, Y.; Liao, L.; Guo, F. BiGA-YOLO: A Lightweight Object Detection Network Based on YOLOv5 for Autonomous Driving. Electronics 2023, 12, 2745. [Google Scholar] [CrossRef]
He, Q.; Xu, A.; Ye, Z.; Zhou, W.; Cai, T. Object Detection Based on Lightweight YOLOX for Autonomous Driving. Sensors 2023, 23, 7596. [Google Scholar] [CrossRef] [PubMed]
Shi, P.; Li, L.; Qi, H.; Yang, A. Mobilenetv2_CA Lightweight Object Detection Network in Autonomous Driving. Technologies 2023, 11, 47. [Google Scholar] [CrossRef]
Wang, X.; Hua, X.; Xiao, F.; Li, Y.; Hu, X.; Sun, P. Multi-Object Detection in Traffic Scenes Based on Improved SSD. Electronics 2018, 7, 302. [Google Scholar] [CrossRef]
Gu, K.; Xia, Z.; Qiao, J.; Lin, W. Deep Dual-Channel Neural Network for Image-Based Smoke Detection. IEEE Trans. Multimed. 2020, 22, 311–323. [Google Scholar] [CrossRef]
Gu, K.; Zhang, Y.; Qiao, J. Vision-Based Monitoring of Flare Soot. IEEE Trans. Instrum. Meas. 2020, 69, 7136–7145. [Google Scholar] [CrossRef]

Figure 1. SMGConv.

Figure 2. Structure of SMGBottleneck. (a) S1-Bottleneck; (b) S2-Bottleneck.

Figure 3. Structure of SBM.

Figure 4. PVFPN.

Figure 5. PVNet.

Figure 6. Training curve.

Figure 7. Heat map of this paper’s model with YOLOv8s. (a) original; (b) people; (c) car.

Figure 8. Model training results for [email protected] and [email protected]:.95.

Figure 9. Detection results of the model in this paper and YOLOv8s. (A) high pedestrian flow street under strong lighting conditions; (B) busy traffic environment also under strong lighting conditions; (C) complex and well-lit traffic scenario; (D) vehicles in high-speed motion; (E) nighttime traffic environment.

Figure 10. Visualization of ablation experiment results (basic is our origin model, the YOLOv8s.).

Figure 11. Target tracking results.

Table 1. Comparison of experimental results with multiple detection models.

Detection Model	Para/m	Precision	Recall	[email protected]	[email protected]:.95	GFLOPs
YOLOv5s	7.02	0.897	0.86	0.932	0.723	16.0
YOLOv7n	6.01	0.889	0.84	0.918	0.695	13.2
YOLOv8n	3.01	0.904	0.871	0.936	0.721	8.2
YOLOv8s	11.1	0.907	0.868	0.942	0.749	28.6
ours	3.62	0.913	0.88	0.952	0.779	15.3

Table 2. Comparison results of target detection algorithms in recent years.

Detection Model	[email protected]	Para/m	GFLOPs
BiGA-YOLO [22]	0.922	11.8	13.8
SSD	0.743	13.1	9.35
Fast-RCNN	0.769	16.01	22.34
ShuffYOLOX [23]	0.922	35.43	89.99
MobileNetv2_CA [24]	0.953	39.1	-
M5 [25]	0.918	7.13	5.19
ours	0.952	3.62	15.3

Table 3. Results of ablation experiments (√ indicates whether to use).

Model	A	B	C	P	R	[email protected]	[email protected]:.95	P/m	G
√				0.907	0.868	0.942	0.749	11.1	28.6
√	√			0.901	0.884	0.946	0.762	9.28	26.6
√	√	√		0.879	0.89	0.948	0.769	5.57	17.5
√	√	√	√	0.913	0.88	0.952	0.779	3.62	15.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xie, H.; Xiao, Z.; Liu, W.; Ye, Z. PVNet: A Used Vehicle Pedestrian Detection Tracking and Counting Method. Sustainability 2023, 15, 14326. https://doi.org/10.3390/su151914326

AMA Style

Xie H, Xiao Z, Liu W, Ye Z. PVNet: A Used Vehicle Pedestrian Detection Tracking and Counting Method. Sustainability. 2023; 15(19):14326. https://doi.org/10.3390/su151914326

Chicago/Turabian Style

Xie, Haitao, Zerui Xiao, Wei Liu, and Zhiwei Ye. 2023. "PVNet: A Used Vehicle Pedestrian Detection Tracking and Counting Method" Sustainability 15, no. 19: 14326. https://doi.org/10.3390/su151914326

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

PVNet: A Used Vehicle Pedestrian Detection Tracking and Counting Method

Abstract

1. Introduction

2. Related Work

2.1. Object Detection

2.2. Pedestrian-Vehicle Detection

3. Methodology

3.1. Convolution

3.2. SMGBottleneck

3.3. SMB

3.4. FPN

3.5. Target Tracking

3.6. PVNet

4. Experimental Results and Analysis

4.1. Description of the Experimental Environment and Parameters

4.2. Dataset

4.3. Comparison Experiment

4.4. Ablation Experiment

4.5. Target Tracking

5. Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI