1. Introduction
Road traffic accidents are considered the most important general health concern, as it results in numerous injuries and deaths worldwide. The survey indicated that human error is the main cause of most vehicular accidents [
1]. Drivers need to be highly focused and always pay attention to changes in the surrounding environment while driving a conventional car. However, the judgment of drivers can be affected by many factors such as fatigue, noise, and weather, which contribute to the risk of road traffic accidents. With the arrival of the information age, Cooperative Intelligent Transportation Systems (C-ITS) [
2] have emerged. The vehicle–road cooperation system mainly equips vehicles and roadsides with advanced sensors to sense the road traffic environment in real time. With the availability of cooperative information regarding vehicles and road conditions, human error can be effectively avoided. Hence, vehicle–road cooperation is attracting increasing attention over the past few years.
The vehicle–road cooperative system contains two parts, i.e., the perception part and the cooperation part. The perception part is primarily roadside perception; vehicle perception is supplementary. The cooperation part uses Cooperative Vehicle-to-Everything (C-V2X) communication technology to realize the cooperative and integrated sensing of road. The intelligent perception system is an important antecedent of autonomous driving [
3]. Vehicle perception mainly equips autonomous vehicles with sensors to achieve automatic decisions by analyzing recently acquired road and obstacle information in space [
4]. However, there are obvious shortcomings in relying on vehicle perception alone [
5]. Firstly, vehicle sensors are unable to achieve the multilevel and long-distance perception of the environment with perceptual blind spots, which makes it difficult to obtain effective traffic information at long distances. Secondly, perception vehicles need to install multiple sensors (e.g., LiDAR, millimeter-wave radar, high-definition cameras, among others) to obtain more comprehensive environmental information, thus requiring a more complex and expensive system. Finally, for a long time in the future, intelligent and nonintelligent vehicles will coexist, while the penetration rate of perception systems for intelligent vehicles is relatively low. To make up for the lack of vehicle perception, combining roadside perception to sense road conditions is an efficient approach. Therefore, it is necessary to study the roadside perception technology of the vehicle–road cooperative system.
The main purpose of roadside perception is to improve the over-the-horizon perception capability of intelligent vehicles [
6], which can expand the perception range of intelligent networked vehicles and make early warning alerts. As the main method of roadside perception, visual target detection is mainly aimed at identifying traffic objects on the road, which can visualize the current road conditions. However, the roadside images obtained by visual sensors contain a large number of small-scale targets due to the wide roadside view. The traditional target detection model extracts less information about the shallow features [
7,
8], which makes it difficult to accurately classify and precisely localize the small-scale targets. Especially in complex road conditions, the overlap and occlusion of the dense objects are more obvious, which leads to a higher rate of missed detection and misdetection. Moreover, the roadside object scale varies violently due to the different heights of the visual sensors installed. Finally, the information sensed by the roadside perception system needs to be transmitted to the vehicle for decision and control via wireless communication technology [
9], which has high requirements for real-time and efficient deployment of detection algorithms. Hence, designing an effective and efficient algorithm for roadside object detection is a pressing problem.
With the rapid development of deep learning in machine vision, the target detection algorithms based on deep learning are mainly classified into two categories. One is the two-stage object detection algorithm represented by the SPP-Net [
10] and R-CNN series algorithms [
11,
12,
13], which cannot meet real-time requirements in terms of detection speed due to structural limitations. Another is the one-stage object detection algorithm represented by the SSD [
14] and YOLO series algorithms [
15,
16,
17]; by performing classification and regression tasks at the same time as generating boxes, the detection speed is significantly improved. Currently, a lot of research has been carried out in the field of road traffic vision. To analyze the movement behavior of traffic objects, Murugan et al. [
18] used two-stage object detection algorithm R-CNN for vehicle identification in the traffic monitoring system. Due to the poor performance of RCNN algorithm for small-target detection, Liang et al. [
19] proposed a sparse R-CNN combining coordinate attention mechanism with ResNeSt to obtain better performance for traffic sign detection by edge devices equipped with self-driving vehicles. The R-CNN series algorithm has some advantages in object detection accuracy, but the detection speed is lower than YOLO. The YOLO series algorithm uses the idea of regression, which makes the generalized algorithm easier to learn and solves the problem of target characteristics and speed. For poor detection performance of small targets in autonomous driving, Benjumea et al. [
20] optimized the feature extraction ability of YOLOv5 for small targets by redirecting the feature maps delivered to the Neck network and head layer. Du et al. [
21] used YOLOv5 for pavement defect detection and introduced the BiFPN structure and Varifocal Loss in the YOLOv5 algorithm, which effectively improved the performance of pavement defect detection.
As aforementioned, many methods achieve better target detection performance in the field of intelligent transportation, such as traffic sign detection, vehicle detection, and pavement defect detection. However, these models are difficult to deploy on roadside edge devices due to the increasing complexity of the model network. To solve the problems in the existing models, this paper investigates the current, more lightweight and flexible detector, YOLOv5. According to the characteristics of target detection in the roadside view, this paper optimizes YOLOv5 and proposes the roadside object detection algorithm RD-YOLO. To implement the real-time target recognition and efficient deployment of an object detection algorithm for roadside perception systems, the following work has been carried out in the paper:
- (1)
Based on the unique characteristics of roadside images, this study proposes RD-YOLO roadside object detection algorithm by optimizing the network, channels, and parameters. Compared with the latest roadside target detection algorithms, the proposed model has qualitative improvements in speed and accuracy, while the model volume is significantly reduced to facilitate the deployment of edge devices.
- (2)
Aimed at the problem of high complexity and poor performance of small-scale target detection in the current algorithms, we reconstructed the feature fusion layer by adding the 32× downsampling feature fusion layer and removing the 4× downsampling feature fusion layer, which improves small-scale object detection accuracy and compresses the volume of the model.
- (3)
We replaced the original pyramid network with GFPN to deal with large-scale variance of objects, which improves the aggregation capability of multiscale features and the adaptability of the network to different scale features.
- (4)
To solve the problems of poor performance for small-target detection in object-dense scenarios, we integrated the CA attention mechanism into the YOLOv5s Backbone network to enhance the important channel and spatial feature information in the features, which can accurately localize and identify important information.
- (5)
For slow convergence and inaccurate regression results in the small-target detection, we improved the loss function of the YOLOv5s prediction head to accelerate the learning of high confidence target, which effectively improves the speed of the bounding box regression and the positioning accuracy of the anchor box.
2. Related Work
This section describes the limitations of roadside object detection in detail. By analyzing the existing object detection algorithms, YOLOv5 is chosen as the benchmark model. Finally, the YOLOv5 network structure is introduced and its current problems are analyzed.
2.1. Roadside Object Detection
The roadside cameras are installed on different road environments, including day, night, overlapping, obscured, dense, etc. As shown in
Figure 1, roadside object detection exposes three problems. First, the object scale varies violently because of the different installation heights of the roadside cameras. Second, in rainy and night road environments, the captured images tend to contain more blurred targets; especially in complex traffic road conditions, the overlap and occlusion of the dense objects are more obvious. Third, the roadside images contain a large number of small-scale targets that are not easily identified because of the wide roadside view. The above three problems make roadside target detection very challenging.
Currently, the main object detection algorithms are based on convolutional neural networks for improvement. Although the convolutional neural network can effectively extract feature information, the locality of convolutional operation limits its ability to obtain global context information. Some scholars have explored new methods. For multitarget detection in traffic scenes, Wang et al. [
22] proposed a novel detection framework, AP-SSD, by introducing a feature extraction convolutional kernel library, which enables the network to select convolution kernels adaptively. Although it improves detection accuracy, it also brings an increase in computation. Zhu et al. [
23] proposed a multisensor multilevel enhanced convolutional network structure (MME-YOLO) based on YOLOv3, which effectively improved the average detection accuracy value on the UA-DETRAC dataset by adding cross-level attention blocks and an image composite module. To improve the target detection capability and efficiency of the detector for autonomous driving, Cai et al. [
24] introduced the CSPDarknet53 structure and five scale detection layers in the YOLOv4 algorithm to improve the detection accuracy, and the lightweight modification of the model by network pruning effectively improved the inference speed. To solve the problems of misdetection and missed detection of small targets in complex traffic scenes, Li et al. [
25] proposed a method that combines depth information obtained by the end-to-end PSMNet with the YOLOv5s target detection algorithm to improve the feature extraction ability of small targets, which improves the detection accuracy of small-scale targets.
Although the emergence of large neural networks improves the performance of detection, it is followed by the problem of efficiency. Because roadside object detection algorithms need to be deployed on edge computing devices, complex network models cannot be satisfied for use in roadside perception systems with relatively poor computational power. Therefore, this paper proposes an improved object detection model, RD-YOLO, to solve the problem of the speed and accuracy of roadside object detection being unable to be improved at the same time due to small targets, complex backgrounds, and limited feature extraction.
2.2. YOLOv5 Network Structure
The YOLO series algorithm is a target detection algorithm based on deep learning and the convolutional neural network, which has the advantages of fast inference, high detection accuracy, and real-time detection. With the rapid development of deep learning in machine vision, the YOLOv5 algorithm has emerged. Compared to previous generations of algorithms, YOLOv5 has higher accuracy, faster speed, and smaller size. YOLOv5 contains four derived models, including YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x. They have the same model architecture, but the model depth and width increase sequentially. Therefore, YOLOv5s was chosen as the benchmark model to build the roadside object detection algorithm. The YOLOv5s network structure is shown in
Figure 2.
The YOLOv5s framework consists of four parts, including the Input layer, Backbone network, Neck network, and Head layer. The main purpose of the Input layer is to perform preprocessing operations on the input images. The Input layer contains mosaic data enhancement, adaptive anchor box calculation, and adaptive image scaling. They effectively improve the efficiency of extracting features from the input images. YOLOv5 employs the Focus module, CSP structure [
26], and spatial pyramidal pooling (SPP) module [
27,
28] as the Backbone, which solves the problem of the detection algorithm requiring a large number of calculations during the push process. The Neck network consists of a feature pyramid network (FPN) and path aggregation network (PANet); the high-level feature and the output of different layers of the CSP module are aggregated by the top-down pathway, and then the shallow features are aggregated by the bottom-up pathway, which fully integrates the image feature of different layers. As the final detection, the Head layer contains the bounding box loss function and non-maximum suppression (NMS) [
29]. The output of the Neck network is input to the Head layer to generate prediction boxes, and then the prediction boxes with local regional redundancy are filtered out to obtain the final detection results by NMS operation.
2.3. Problems of YOLOv5 Algorithm
As a one-stage object detection algorithm, YOLOv5 has a significant improvement in detection speed compared to two-stage object detection algorithms such as Faster R-CNN and Mask R-CNN [
30], which meeting the requirement of real-time detection. However, in terms of detection accuracy and model weight, it still needs to be improved in practical production with complex backgrounds.
Although YOLOv5s in the YOLOv5-derived algorithms greatly simplifies the complexity of the network structure by reducing the depth and width, it also reduces the detection accuracy. YOLOv5s contains multiple convolutional modules, so the size of the feature map decreases as the number of downsampling convolution layers increases during feature information extraction. Therefore, it is difficult for YOLOv5s to accurately classify and precisely localize the small-scale targets due to the small number of shallow features extracted, which leads to the problem of misdetection and missed detection in small-scale target detection.
The YOLOv5 Neck network with FPN + PANet combined structure focuses on the deep feature fusion, which weakens small target detection and indirectly increases interference noise by upsampling operation. However, in roadside object detection, the timely detection of small or distant targets is important for safe driving. Therefore, this paper improves the model and further enhances its roadside object detection capability.
3. Proposed Method
This section describes in detail the small-target feature extraction method, the method of feature fusion, the attention mechanism, and the improvement of the loss function. Finally, a lightweight and accurate roadside object detection method based on YOLOv5 is proposed.
3.1. Feature Fusion Layer Reconstruction
The roadside target detection algorithm not only needs to accurately identify targets in complex road environments but also needs to compress the size of the model as much as possible for deployment in roadside edge devices. Moreover, the field of perception is broader in the roadside view; the roadside images contain a large number of small-scale targets. Therefore, the Backbone network and Neck network of the YOLOv5s model are modified in this study. Under the premise of ensuring detection performance, the network parameters and model calculations are reduced in order to realize the lightweight and improved design of the target detection network.
The Backbone network contains four downsampling modules, which extract features at different levels in the image by deep convolution operation. However, in the Backbone network, the small-target features decrease or even disappear with the increase in feature levels due to the multiple uses of downsampling. Therefore, the improved design of the feature fusion layer is executed in this study. The top feature extraction layer of the YOLOv5s Backbone network is removed, which reduces the complexity of the network and decreases the invalid information into the next stage at the same time. On the other hand, the low-level features contain more location and detailed information due to the small feature-space-receptive fields, which can accurately detect the small targets in roadside images. Therefore, the 4x downsampling feature fusion layer of the YOLOv5s Neck network is added, which captures more effective information about small targets and improves the detection capability of small targets. The network architecture of the improved feature fusion layer is shown in
Figure 3.
Compared with the original network architecture, the improved structure is more suitable for the detection of small targets in roadside images, which reduces the network complexity and improves the detection accuracy of the algorithm at the same time.
3.2. Multiscale Feature Fusion
Due to the different road conditions and the different heights of visual sensor installation, the target scale varies violently, which brings a great challenge to object recognition in roadside images. Especially in the case of complex road conditions, the target information is more complex and overlaps severely. In the feature extraction network, by continuously downsampling convolution layers, the extracted multiscale features are input into the Neck network for feature fusion. The low-level feature maps have a smaller feature-space-receptive field, which is more suitable for perceiving location and detailed information. However, due to the small number of downsampling feature extraction, the low-level feature maps have relatively less semantic information and contain more noise. The high-level features have more semantic information, but the feature-space-receptive field becomes large due to multiple convolution operations, which leads to poor perception ability of details in the image. Therefore, effective feature fusion of the extracted feature information is the key to improving the model detection performance.
Feature fusion is currently the main method to deal with the multiscale discrepancy problem, and the representative algorithms are the feature pyramid network (FPN) [
31], path aggregation network (PANet) [
32], and bidirectional feature pyramid network (BiFPN) [
33]. Their core idea is to aggregate multiscale features that is extracted from the Backbone network. However, these feature pyramid network structures mainly focus only on the scale of features but ignore the level of features. When the size of the detecting object is basically the same, it is difficult for the network to distinguish between objects with simple appearance and objects with complex appearance, because the feature map contains only single-level or few-level features. Therefore, a novel feature fusion method, the generalized feature pyramid network (GFPN), was proposed by Jiang et al. [
34]. GFPN proposes a new cross-scale fusion that aggregates the feature of the same level and neighbor level, which provides more efficient information transfer. On the other hand, GFPN proposes a new skip connection method, which effectively prevents gradient vanishing in a heavy neck and expands into a deeper network. Under different floating-point operations per second (FLOPs) performance balances, GFPN has more excellent performance than other SOTA solutions, and the network structure is shown in
Figure 4d.
Since the GFPN structure has a higher complexity compared to other feature pyramid network structures. To avert the vanishing gradient problem during the increase in network computational volume, a new skip-layer connection method was proposed and named
, which not only improves the expansion depth of GFPN but also preserves effective features for reuse, as shown in
Figure 5.
Sufficient information exchange should contain both skip-level connection and cross-scale connection. However, previous works in aggregating features of adjacent layers only consider the same-level feature or previous-level feature, which leads to poor performance in scenarios with large-scale variance of objects. Therefore, queen fusion was proposed to overcome large-scale variation, and the structure is shown in
Figure 6. Each node receives input not only from the previous node but also from the nodes above and below it diagonally, which helps target features for effective information transfer and improves the adaptability of the network to different scale features. Moreover, the fusion style of GFPN uses the concatenation method instead of the summation method, which effectively reduces the loss of feature fusion.
3.3. Attention Mechanism
The YOLOv5 neck network structure focuses on the fusion of deep features, which weakens the detection of small targets. Especially in the scenario with dense objects, the detection accuracy of small targets is low. Moreover, due to multiple downsampling operations, the receptive field of the high-level feature map is relatively large, and a large amount of detailed information has been lost; small-target features are especially likely to be completely missing. To reduce the loss of small-target features during feature extraction and improve the ability of small-target detection, we introduce an attention mechanism that constructs a hierarchical attention structure similar to human perception to enhance the network feature extraction capability.
Coordinate attention (CA) is a network structure proposed by Hou et al. [
35]; the main idea is to embed location information into channel attention. Channel relationships and long-range dependencies with precise positional information are more beneficial for the network to extract important information from feature images. The CA attention mechanism consists of two main components, i.e., coordinate information embedding and coordinate attention generation. As shown in
Figure 7, given the input X, two spatial extensions of pooling kernels
and
are used to encode each channel along the horizontal coordinate and the vertical coordinate, respectively. The outputs
and
are concatenated and then sent to a shared
convolutional transformation. The concatenated feature maps are sent to BatchNorm and Nonlinear to encode the spatial information in the vertical and horizontal directions. The output
is split along the spatial dimension into two separate tensors,
and
. Another two
convolutional transformations are utilized to separately transform
and
to tensors with the same channel number to the input X, yielding
and
. Then, the attention weight maps
and
in two spatial directions are obtained after the activation function
[
27], each attention weight feature maps carries long-range dependencies along a particular direction. Finally, the input feature map is multiplied with two weights, which enhances the expressiveness of the feature map.
In order to accurately identify and localize small targets in object-dense scenarios, we integrated the CA attention mechanism in the YOLOv5s Backbone network to enhance the important channel and spatial feature information in the features, and the improved Backbone network is shown in
Figure 8. The CA attention mechanism added at the end of the Backbone network not only does not increase the network parameters and model computation but also facilitates the extraction of important feature information.
3.4. Loss Function
The loss function is used to measure the degree of overlap between the prediction boxes and the true boxes. For the slow convergence and inaccurate regression results in roadside small target detection, we introduce the Focal-EIOU Loss [
36] which is more suitable for the regression mechanism. The original YOLOv5s model uses CIOU Loss [
37] as the IOU loss function. The principle of CIOU Loss is as follows:
EIOU Loss consists of three parts, i.e., the IOU loss, the distance loss, and the aspect loss. In this way, the first two parts of EIOU Loss continue the approach in CIOU Loss, which retains the profile characteristics of the CIOU Loss. Meanwhile, the aspect Loss directly minimizes the difference between target box’s and anchor box’s width and height, which effectively improves the converge speed and positioning accuracy. The principle of EIOU Loss is as follows:
There is a problem of imbalanced training examples in regression for bounding boxes, i.e., the number of high-quality anchor boxes with small regression errors in an image is much fewer than the number of low-quality examples with large errors, and the low-quality examples will produce too large a gradient to affect the training process. Therefore, Focal Loss is introduced to optimize the training examples imbalanced in the bounding box regression task, which separates high-quality anchor boxes from low-quality anchor boxes so that the regression process focuses on high-quality anchor boxes. The principle of Focal-EIOU Loss is as follows:
where
is a parameter that controls the degree of outlier suppression.
3.5. RD-YOLO Network Structure
According to the above improvement method, this paper proposes an effective and efficient object detector for the roadside perception system. The network structure of the improved algorithm is shown in
Figure 9.
To realize the lightweight deployment of the roadside target detection model and improve the detection accuracy of small targets, RD-YOLO removes the 32× downsampling feature fusion layer and adds the 4× downsampling feature fusion layer, which maximally preserves the feature information and improves the detection capability of small targets. The CA attention mechanism is introduced to the end of the Backbone network, which enhances the important channel and spatial feature information in the features and improves the ability to locate small targets. After that, the different resolution features that are extracted from the Backbone network are input into the Neck network for feature fusion. It contains both top-down, bottom-up, and queen-fusion information transfer paths. In the first path, the semantic information of deep features is passed downward to shallow features to enhance the multiscale semantic representation. In the second path, the detailed information of shallow features is passed upward to deep features to enhance multiscale localization. The aggregated feature maps contain both abstract semantic information and rich detail information, which effectively improves the positioning accuracy and classification precision of the target detection algorithm. In the final path, to aggregate more feature information at different levels, the nodes accept input not only from the previous level node but also from the nodes above and below it diagonally. Moreover, the nodes of the same layer are also connected to the output nodes to fuse more feature information, which helps target features for effective information transfer and improves the adaptability of the network to different scale features. Finally, the output of the GFPN network is input to the Head layer to generate prediction boxes, and then the prediction boxes with local regional redundancy are filtered out to obtain the final detection results by NMS operation.
5. Conclusions
This paper proposes an effective and efficient algorithm for roadside object detection based on YOLOv5s, which mainly solves the problem that the speed and accuracy of roadside target detection cannot be improved at the same time due to small targets, complex background, and limited feature extraction. The feature fusion layer reconstruction is proposed to more effectively capture small-target features and to achieve the lightweight design of the roadside target detection model. Then, using the GFPN for multiscale feature fusion improves the adaptability of networks to different scale features. In addition, the CA module is introduced into the Backbone network to improve the detection ability of dense small targets. Finally, the loss function is optimized to improve the speed of the bounding box regression and the positioning accuracy of the anchor box. Compared to the YOLOv5s, the RD-YOLO improves the mean average precision by 5.5% on the Rope3D dataset and 2.9% on the UA-DETRAC dataset. Furthermore, the weight of the model is reduced by 55.9% while the inference speed is almost unchanged. The algorithm proposed in this paper effectively improves the accuracy of object detection in roadside images and achieves the requirements of efficient deployment and real-time detection for roadside edge devices.
Compared with the ideal detection requirements, our network is prone to some ambiguous targets leading to a decrease in the detection accuracy of the model. In the future, we will further optimize detection by increasing the diversity and richness of the dataset. In addition, we will continue to tune hyperparameters and optimize the model to further improve the speed and accuracy of roadside object detection.