1. Introduction
Damage to the road surface poses a potential threat to driving safety on the road; maintaining excellent pavement quality is an important prerequisite for safe road travel and sustainable traffic. The inability to detect road damage promptly is often considered one of the most critical steps in limiting the need to maintain high pavement quality. To maintain high pavement quality, transportation agencies are required to regularly evaluate pavement conditions and maintain them promptly. These road damage assessment techniques have gone through three stages in the past.
Manual testing was used in the earliest days, where the evaluator visually detects road damage by walking on or along the roadway. In the process of testing, evaluators need expertise in related fields and make field trips, which are tedious, expensive, and unsafe [
1]. Moreover, this detection method is inefficient, and the road needs to be closed during on-site detection, which may affect the traffic flow. Therefore, manual testing technology is not appropriate for application to high-volume road damage detection. Gradually, semiautomatic detection technology became the primary method. In this method, images are automatically collected by vehicles traveling at high speeds, which requires specific equipment to capture the road conditions and reproduce the road damage. Then, the professional technicians identify the type of road damage and output the results manually [
2,
3,
4]. This method greatly reduces the impact on traffic flow, but has some disadvantages, such as a heavy follow-up workload, a single detection target, and a high equipment overhead. Over the past 20 years, with the development of sensor technology and camera technology, the technology of integrated automatic detection of road damage has made great progress. Established automatic detection systems include vibration-based methods [
5], laser-scanning-based methods [
6,
7], and image-based methods. Among them, vibration-based methods and laser-scanning-based methods require expensive specialized equipment. Moreover, the reliability of the system for road damage detection is highly dependent on the operator [
8]. Therefore, although the above methods possess high detection accuracy, they are not practical when considering economic issues. The most highlighted features of image-based methods are their low cost and not needing special equipment [
9,
10,
11,
12,
13,
14,
15,
16]. However, image-based methods do not have the accuracy of vibration-based methods or laser-scanning-based methods, so image-based automated detection of road damage requires highly intelligent and robust algorithms.
The advent of machine learning has opened new directions in road damage detection with pioneering applications in pavement crack classification [
17,
18,
19]. However, these studies only looked at the characteristics of shallow networks, which could not detect the complex information about the pavement, and could not distinguish between road damage categories. The unprecedented development of computer computing power has laid the foundation for the emergence of deep learning, which can effectively address the low accuracy of image-based methods of road damage detection. Deep learning has gained widespread attention in smart cities, self-driving vehicles, transportation, medicine, agriculture, finance, and other fields [
20,
21,
22,
23,
24,
25]. Deep learning has the following advantages over traditional machine learning: deep learning can train models directly using data without pre-processing; and deep learning has a more complex network composition and outperforms traditional machine learning in terms of feature extraction and optimization. From a certain point of view, we can consider road damage detection as an object detection task, which focuses on classifying and localizing the objects to be detected. In the object detection task, the deep learning model has a powerful feature extraction capability, which is significantly better than the manually set feature detectors. Deep learning-based object detection algorithms are divided into two main categories, including two-stage detection algorithms and one-stage detection algorithms. The former algorithm first finds the suggested regions from the input image, and then classifies and regresses each suggested region. Typical examples of this class of algorithms are R-CNN [
26], Fast R-CNN [
27], Faster R-CNN [
28], and SPP-Net [
29]. However, for the latter algorithm, there is no need to predict regions in advance, and the class probability and location coordinate values can be generated directly. Typical examples of this class of algorithms are the YOLO series [
30,
31,
32,
33,
34,
35], SSD [
36], and RetinaNet [
37].
In recent years, many studies have applied neural network models to pavement measurement or damage detection. References [
38,
39,
40] used a neural network model to detect cracks in pavement. However, these road damage detection methods are only concerned with determining whether cracks are present. Based on this flaw, ref. [
41] used a mobile device to photograph the road surface, divided the Japanese road damage into eight categories, named the collected dataset RDD-2018, and finally applied it to a smartphone for damage detection. After that, some studies focused on adding more images or introducing entirely new datasets [
42,
43,
44,
45,
46], but the vast majority of these datasets were limited to road conditions in one country. For this purpose, in 2020, ref. [
47] combined the road damage dataset from the Czech Republic and India with the Japanese road damage dataset [
48] to propose a new dataset, “Road Damage Dataset-2020 (RDD-2020)”, which contains 26,620 images, nearly three times more than the 2018 dataset. In the same year, the Global Road Damage Testing Challenge (GRDDC) was held, in which several representative detection schemes were used [
49,
50,
51,
52,
53,
54]. The above studies found that object detection algorithms from other domains can also be applied to road damage detection tasks. However, as the performance evaluation of the competition is only F1-Score, the model inevitably increases its scale while improving its detection accuracy, so that the model occupies a large savings space, increases the inference time of the model, and has high requirements for equipment.
In road damage detection, mobile terminal devices are more suitable for detection tasks due to the limitations of the working environment. However, mobile terminal devices have limited storage capacity and computing power, and the imbalance between accuracy and complexity makes it difficult to apply the models on mobile terminal devices. In addition, state-of-the-art algorithms were trained from datasets such as Pascal VOC and COCO, not necessarily for road damage detection. Considering these shortcomings, in this study, the YOLO-LWNet algorithm, which balances detection accuracy and algorithm complexity, is proposed and applied to the RDD-2020 road damage dataset. In order to maintain high detection accuracy while effectively reducing model scale and computational complexity, we designed a novel lightweight network building block, the LWC, which includes a basic unit and a unit for spatial downsampling. According to the LWC, a lightweight backbone network suitable for road damage detection was designed for feature extraction, and the backbone network in the YOLOv5 was replaced by the lightweight network designed by us. A more effective feature fusion network structure was designed to achieve efficient inference while maintaining a better feature fusion capability. In this paper, we divide the YOLO-LWNet model into two versions, tiny and small. Comparing the YOLO-LWNet with state-of-the-art detection algorithms, our algorithm can effectively reduce the scale and computational complexity of the model and, at the same time, achieve a better detection result.
The contribution of this study is as follows: ① To balance the detection accuracy, model scale, and computational complexity, a novel lightweight network building block, the LWC, was designed in this paper. This lightweight module can effectively improve the efficiency of the network model, and also effectively avoid gradient fading and enhance feature reuse, thus maintaining the accuracy of object detection. The attention module was applied in the LWC module, which re-weights and fuses features from the channel dimension. The weights of valid features are increased and the weights of useless features are suppressed, thus improving the ability of the network to extract features; ② A novel lightweight backbone network and an efficient feature fusion network were designed. When designing the backbone, to better detect small and weak objects, we expanded the LWC module to deepen the thickness of the shallow network in the lightweight backbone, to maximize the attention to the shallow information. To enhance the ability of the network to extract features of different sizes, we also used the Spatial Pyramid Set-Fast (SPPF) [
36] module. In the feature fusion network, we adopted the topology based on BiFPN [
55], and replaced the C3 module in the YOLOv5 with a more efficient structure, which effectively reduces the model scale and computational complexity of the network, while ensuring almost the same feature fusion capability; ③ To evaluate the effect of each design on the network model detection performance, we conducted ablation experiments on the RDD-2020 road damage dataset. The network model YOLO-LWNet, designed in this paper, is also compared with state-of-the-art object detection models on the RDD-2020 dataset. Through comparison, it is found that the network model designed in this paper improves the detection accuracy to a certain extent, effectively reduces the scale and computation complexity of the model, and can better achieve the deployment requirements of mobile terminal devices.
The rest of this paper is organized as follows. In
Section 2, we will introduce the development of lightweight networks and the framework of the YOLOv5. In
Section 3, the structural details of the lightweight network YOLO-LWNet will be introduced. The algorithm can effectively balance detection accuracy and model complexity. In
Section 4, we will present the specific experimental results of this paper and compare them with the detection results of state-of-the-art methods. Finally, in
Section 5, we will conclude this paper and propose some future works.
4. Experiments on Road Damage Object Detection Network
In this paper, the novel road damage object detection network YOLO-LWNet was trained using the RDD-2020 dataset, and its effectiveness was verified. Firstly, we compared the advanced lightweight object detection algorithms with the RDD-mobilenet designed in this paper, in terms of detection accuracy, inference speed, model scale, and computational complexity. Secondly, ablation experiments were conducted to test the ablation performance resulting from different improved methods. Finally, to prove the performance improvement of the final lightweight road object detection models, they were compared with state-of-the-art object detection algorithms.
4.1. Datasets
The dataset used in this study was proposed in [
73], and includes one training set and two testing sets. The training set consists of 21,041 labeled road damage images, which include four damage types obtained through intelligent device collection from Japan, the Czech Republic, and India. The damage categories include longitudinal crack (D00), transverse crack (D10), alligator crack (D20), and potholes (D40).
Table 2 shows the four types of road damage and the distribution of road damage categories in the three countries, with Japan having the most images and damage, India having fewer images and damage, and the Czech Republic having the least number of images and damage. In the RDD-2020 dataset, its test set labels are not available, and to run the experiments, we divided the training dataset into a training set, a validation set, and a test set in the ratio of 7:1:2.
4.2. Experimental Environment
In our experiments, we used the following hardware environment: the GPU is NVIDIA GeForce RTX 3070ti, the CPU is Intel i7 10700k processor, and the memory size is 32GB. The software environment: Windows 10 operating system, Python 3.9, CUDA 11.3, cuDNN 8.2.1.32, and PyTorch 1.11.0. In the experiments, the individual network models were trained from scratch by using the RDD-2020 dataset. To ensure fairness, we trained the network models in the same experimental environment, and validated the detection performance of the trained models on the test set. To ensure the reliability of the training process, the hyperparameters of the model training were kept consistent throughout the training process. The specific training hyperparameters were set as follows: the input image size in the experiment was 640 × 640, the epochs of the entire training process were set to 300, the warmup epochs were set to 3, the batch size was 16 during training, the weight decay of the optimizer was 0.0005, the initial learning rate was 0.01, the loop learning rate was 0.01, and the learning rate momentum was 0.937. In each epoch, mosaic data enhancement and random data enhancement were turned on before training.
4.3. Evaluation Indicators
In deep learning object detection tasks, detection models are usually evaluated in terms of recall, precision, average precision rate (AP), mean average precision rate (mAP), params, floating-point operations (FLOPs), frames per second (FPS), latency, etc.
In object detection tasks, the precision indicator cannot directly evaluate for object detection. For this reason, we introduced the
AP, which calculates the area under a certain type of P-R curve. When calculating
AP, it is first necessary to calculate precision and recall, precision measures the percentage of correct predictions among all results predicted as positive samples, and recall measures the percentage of correct predictions among all positive samples. For the object detection task, an
AP value can be calculated for each category, and the
mAP is the average of the
AP values of all categories. The expressions corresponding to precision, recall, and
mAP are shown in Formula (5), Formula (6), and Formula (7), respectively:
To compute precision and recall, as with all machine learning problems, true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN) need to be determined. Where TP indicates that the test result for a positive sample is a positive sample; FP indicates that the test result for a negative sample is a positive sample; and FN indicates that the test result for a positive sample is a negative sample. When calculating precision and recall, IoU and confidence thresholds have a great influence on them, and can directly affect the change of the P-R curve. In this paper, we took the IoU value as 0.6 and the confidence threshold as 0.1.
In addition to accuracy, computational complexity is another important consideration. Engineering applications are often designed to achieve the best accuracy with limited computing resources, which is caused by the limitations of the target platform and application scenarios. To measure the complexity of a model, a widely used indicator is FLOPs, and another indicator is the number of parameters, denoted by params. Where FLOPs are used to represent the theoretical computation of the model, the unit of the large model is usually G, and the unit of the small model is usually M. Params relate to the size of the model file, usually in M.
However, FLOPs is an indirect indicator. The detection speed is another important evaluation indicator for object detection models, and only a fast speed can achieve real-time detection. In previous studies [
74,
75,
76,
77], it has been found that networks with the same FLOPs have different inference speeds. Because FLOPs only consider the amount of computation of the convolutional part, although this part takes up most of the time, other operations (such as channel shuffling and element operations) also take up considerable time. Therefore, we used FPS and latency as direct indicators, while using FLOPs as indirect indicators. FPS is the number of images processed in one second, and the time required to process an image is the latency. In this paper, we tested different network models on a GPU (NVIDIA GeForce RTX 3070ti). FP16-precision and batch set to one were used for measurement during the tests.
4.4. Comparison with Other Lightweight Networks
To verify the performance of the lightweight network unit LWC in road damage detection, we used the MobileNetV3-Small, the MobileNetV3-Large, the ShuffleNetV2-x1, and the ShuffleNetV2-x2 as substitutes for backbone feature extraction network in the YOLOv5. We compared them with the lightweight network based on the LWC module (BLWNet) unit on the RDD-2020 public dataset. Specifically, we used the MobileNetV3-Small, the ShuffleNetV2-x1, and the BLWNet-Small as the backbone of feature extraction for the YOLOv5-s model. The MobileNetV3-Large, the ShuffleNetV2-x2, and the BLWNet-Large were used as the backbone of the YOLOv5-l model. To fairly compare the performance of each structure in road damage detection, we selected the attention mechanism and activation functions used in the MobileNetV3 and the ShuffleNetV2 for the LWC module. In addition, we designed different versions of the model by adjusting the output channel value, the exp channel value, and the number of modules
n of the LWC module. In the experiment of this stage, the specific parameters of the two sizes of models designed in this paper are shown in
Table 3 and
Table 4, respectively.
In the neck network of the YOLOv5, three resolution feature maps are extracted from the backbone network, and the three extracted feature maps are named P3, P4, and P5, respectively. P3 corresponds to the output of the last layer with a step of 8, P4 corresponds to the output of the last layer with a step of 16, and P5 corresponds to the output of the last layer with a step of 32. For the BLWNet-Large, P3 is the 7th LWConv layer, P4 is the 13th LWConv layer, and P5 is the last LWConv layer. For the BLWNet-Small, P3 is the sixth LWConv layer, P4 is the ninth LWConv layer, and P5 is the last LWConv layer.
We compared
mAP, param, FLOPs, FPS, and latency, and in
Table 5 the test results of each network model on the RDD-2020 test set are shown. As can be seen, the BLWNet-YOLOv5 is not only the smallest model in terms of size, but also the most accurate and fastest model among the three models. The BLWNet-Small is 0.8 ms faster than the MobileNetV3-Small and has a 2.9% improvement in
mAP. Compared to the ShuffleNetV2-x1, the the BLWNet-Small has 1.3 ms less latency and 3.1% higher mAP. The
mAP of the BLWNet-Large with increased channels is 1.1% and 3.1% higher than those of the MobileNetV3-Large and the ShuffleNetV2-x2, respectively, with similar latency. The BLWNet outperforms the MobileNetV3 and the ShuffleNetV2 in terms of the reduced model scale, reduced computational cost, and improved
mAP. In addition, the BLWNet model with smaller channels has better performance in reducing the latency. Although the BLWNet has an excellent performance in reducing model scale and computational cost, its
mAP still has great room for improvement, so our next experiments mainly focus on improving the
mAP of the network model when tested on the RDD-2020 dataset.
4.5. Ablation Experiments
To investigate the effect of different improvement techniques on the detection results, we conducted ablation experiments on the RDD-2020 dataset.
Table 6 shows all the schemes in this study; in the LW scheme, only the backbone of the YOLOv5-s was replaced with the BLWNet-Small. In the LW-SE scheme, the CBAM attention module replaced the attention in the basic unit, and the ECA attention module replaced the attention in the unit for spatial downsampling. In the LW-SE-H scheme, the hardswish nonlinear function was used instead of the swish in the LWC module. In the LW-SE-H-depth scheme, the numbers of LWConv layers between P5 to P4, P4 to P3, and P3 to P2 in the BLWNet-Small were increased. In the LW-SE-H-depth-spp scheme, the SPPF module was added to the last layer of the backbone. In the LW-SE-H-depth-spp-bi scheme, based on the LW-SE-H-depth-spp, the BiFPN weighted bi-directional pyramid structure was used to replace PANet, to generate a feature pyramid. In the LW-SE-H-depth-spp-bi-ENeck scheme, the C3 blocks in the neck of the YOLOv5 were replaced by the LWblock proposed in this paper, to achieve efficient feature fusion. In the experiments, we found that CBAM attention modules would cause considerable latency in the model. Therefore, the scheme of the LW-SE-H-depth-spp-bi-fast, based on the LW-SE-H-depth-spp-bi, replaced the CBAM attention modules in the basic unit of the LWC with the ECA attention modules. In addition, the value of the channel output by the focus module in the network was reduced from 64 to 32. For the LW-SE-H-depth-spp-bi-ENeck-fast scheme, different from the LW-SE-H-depth-spp-bi-fast, the CBAM attention modules in the basic unit of the LWC in the backbone and neck were replaced by ECA attention modules.
The results of the ablation experiments are shown in
Figure 10. There is still room for improvement in the param, FLOPs, and
mAP of the YOLOv5-s model. The LW scheme reduces the parameter size and computational complexity of the YOLOv5-s by nearly half. However, the mAP decreased by 2.1%. To improve the
mAP, the LW-SE scheme increases the
mAP by 0.7% and reduces the model scale, but increases the latency. The LW-SE-H replaces the activation functions with hardswish, which has a positive impact on the inference speed, without reducing the detection accuracy, or changing the model scale and computational complexity. The LW-SE-H-depth enhances the extraction ability of 80 × 80, 40 × 40, and 20 × 20 resolution features, respectively, and increases the
mAP by 2.0%. While deepening the network, the model scale and computational complexity inevitably increased, and the inference time of 4.4 ms is sacrificed. By introducing the improved SPPF module in the LW-SE-H-depth-spp, the features of different sensitive fields can be extracted, and the
mAP can be further improved by 1.0% with essentially the same param, FLOPs, and latency. The application of the BiFPN structure makes the
mAP of the network increase by 0.3%, and hardly increases the param, FLOPs, and latency of the model. For the application of an efficient feature fusion network, the param, and FLOPs of the network model can be greatly reduced, and the latency of 6.9 ms is sacrificed. To better balance detection accuracy, model scale, computational complexity, and inference speed, the LW-SE-H-depth-spp-bi-fast scheme replaced the attention module in the LW-SE-H-depth-spp-bi, and reduced the output channel value of the focus module. In this case, the
mAP is decreased by 0.2%, the FLOPs of the model are effectively reduced, and the inference speed is greatly increased. Compared with that of the LW-SE-H-depth-spp-bi-ENeck scheme, the
mAP of the LW-SE-H-depth-spp-bi-ENeck-fast is reduced by 0.1%, and the model scale and computational complexity are optimized to some extent, and the latency time is reduced by 8.2 ms. Finally, considering the balance between param, FLOPs,
mAP, and latency of the network model, we chose the scheme LW-SE-H-depth-spp-bi-fast as the small version of the road damage object detection model YOLO-LWNet proposed in this paper. For the tiny version, we just need to reduce the width factor in the model. The YOLO-LWNet-Small has advantages over the original YOLOv5-s in terms of model performance and complexity. Specifically, our model increases the
mAP by 1.7% in the test set, and it has a smaller number of parameters, almost half that of the YOLOv5-s. At the same time, its computational complexity is much smaller than that of the YOLOv5-s, which makes the YOLO-LWNet network model more suitable for mobile terminal devices. Compared with the YOLOv5-s, the inference time of the YOLO-LWNet-Small is 3.3 ms longer; this phenomenon is mainly caused by the depthwise separable convolution operation. The depthwise separable convolution is an operation with low FLOPs and a high data read and write volume, which consumes a large amount of memory access costs (MACs) in the process of data reading and writing. Limited by GPU bandwidth, the network model wastes a lot of time in reading and writing data, which makes inefficient use of the computing power of the GPU. For the mobile terminal device with limited computing power, the influence of MACs can be ignored, so this paper mainly considers the number of parameters, computational complexity, and detection accuracy of the network model.
To better observe the effects of different improvement methods on the network model during the training process,
Figure 11 shows the
mAP curves of the nine experimental scenarios training during the training process. The horizontal axis in the figure indicates the number of training epochs, and the vertical axis indicates the value of the
mAP. As the number of training epochs increases, the
mAP also increases until after 225 epochs, where the
mAP of all network schemes reaches the maximum value and the network model starts to converge. The LW-SE-H-depth-spp-bi-fast is the final proposed network model in this paper. From the figure, we can see that each improvement scheme can effectively improve the detection accuracy of the network, and we can intuitively observe that the final network model is better trained than the original one.
4.6. Comparison with State-of-the-Art Object Detection Networks
The YOLO-LWNet is a road damage detection model based on the YOLOv5, which includes two versions, namely, small and tiny, and the specific parameters of their backbone (LWNet) have been given in
Table 1. Specifically, the YOLO-LWNet-Small is a model that uses the LWNet-Small as the backbone of the YOLOv5-s, and uses BiFPN as the feature fusion network. The YOLO-LWNet-Tiny uses BiFPN as the feature fusion network while using the LWNet-Tiny as the backbone of the YOLOv5-n.
Figure 12 shows the results of the YOLO-LWNet model and state-of-the-art object detection algorithms for the experiments on the RDD-2020 dataset.
Compared with the YOLOv5 and the YOLOv6, the YOLO-LWNet has a great improvement in various indicators, especially in
mAP, param, and FLOPs. Compared with the YOLOv6-s, the YOLO-LWNet-Small has a 78.9% and 74.6% reduction in param and FLOPs, respectively, the
mAP is increased by 0.8%, and the latency is decreased by 3.9 ms. Compared with the YOLOv6-tiny, the YOLO-LWNet-Small model has 62.6% less param, 54.8% fewer FLOPs, 0.9% more
mAP, and 1.8 ms of latency savings. The YOLO-LWNet-Small has no advantage over the YOLOv6-nano in terms of inference speed and computational complexity, but it effectively reduces param by 15.8%, and increases
mAP by 1.6%. Compared to the YOLOv5-s, the small version has 1.7% more
mAP, 48.4% less param, and 30% fewer FLOPs. The YOLO-LWNet-Tiny is the model with the smallest model scale and computational complexity, as shown in
Figure 12. Compared with the YOLOv5-nano, param decreased by 35.8%, FLOPs decreased by 4.9%, and
mAP increased by 1.8%. The YOLO-LWNet has less model scale, lower computational complexity, and higher detection accuracy in road damage detection tasks, which can balance the inference speed, detection accuracy, and model scale. In
Figure 13,
Figure 14,
Figure 15 and
Figure 16, the YOLO-LWNet is compared with five other detection methods for each of the four objects, and the predicted labels and predicted values for these samples are displayed. From the observation of the detection results, it is easy to find that the YOLO-LWNet can accurately detect and classify different road damage locations; the results are better than other detection network models. This shows that our network model can perform the task of road damage detection better.
Figure 17 shows the detection results of the YOLO-LWNet.
5. Conclusions
In road damage object detection, a mobile terminal device is more suitable for detection tasks due to the limitations of the working environment. However, the storage capacity and computing power of mobile terminals are limited. To balance the accuracy, model scale, and computational complexity of the model, a novel lightweight LWC module was designed in this paper, and the attention mechanism and activation function in the module were optimized. Based on this, a lightweight backbone and an efficient feature fusion network were designed. Finally, under the principle of balancing detection accuracy, inference speed, model scale, and computational complexity, we experimentally determined the specific structure of the lightweight road damage detection network, and defined it as two versions (small and tiny) according to the network width. In the RDD-2020 dataset, the model scale of the YOLO-LWNet-Small is decreased by 78.9%, the computational complexity is decreased by 74.6%, the detection accuracy is increased by 0.8%, and inference time is decreased by 3.9 ms compared with the YOLOv6-s. Moreover, compared with the YOLOv5-s, the model scale of the YOLO-LWNet-Small is reduced by 48.4%, the computational complexity is reduced by 30%, and the detection accuracy is increased by 1.7%. For the YOLO-LWNet-Tiny, it has a 35.8% reduction in model scale, a 4.9% reduction in computational complexity, and a 1.8% increase in detection accuracy compared to the YOLOv5-nano. Through experiments, we found that the YOLO-LWNet is more suitable for the requirements of accuracy, model scale, and computational complexity of mobile terminal devices in road damage object detection tasks.
In the future, we will further optimize the network model to improve its detection accuracy and detection speed, and increase the training data to make the network model more competent for the task of road damage detection. We will deploy the network model to mobile devices, such as smartphones, so that it can be fully used in the engineering field.