1. Introduction
Against the backdrop of double growth in vehicle ownership and average vehicle age, the scale of automotive aftermarket services is rapidly expanding, bringing more development opportunities to the automotive industry. The automotive aftermarket, in the narrow sense, refers to automotive after-sales services centered on automotive repair and maintenance, of which automotive repair mainly includes engine repair, transmission repair, suspension system repair, electrical system repair, and body repair. At present, the grinding and repair of damaged parts of the car is mainly carried out by workers due to the high mobility of personnel in the automotive repair service industry and the uneven operating level of workers, resulting in the overall quality of repair work being difficult to effectively control. Therefore, in order to improve the quality and efficiency of automotive repair aftermarket services and reduce the work intensity of repair workers, it is essential to analyze and design a body-damaged area detection scheme that serves automated grinding and repair of automobiles.
Image detection algorithms can be divided into two types: traditional detection algorithms and deep learning detection algorithms. The conventional approach to detection relies on manually designed features and shallow classifiers, and although it has strong interpretability, the feature extraction method designed for unique scenarios is challenging to adapt to complex application scenarios, and the generalization ability is weak. The deep learning approach, on the other hand, automatically extracts multi-level feature expressions through neural networks, obtains more abstract semantic information, effectively adapts to complex environments, and has better detection accuracy than traditional methods. Among them, Fast R-cnn [
1] avoided repeated convolutional operations by mapping candidate regions to a shared feature map and extracting fixed-size features uniformly, which significantly improved the speed, but relied on a selective search to generate candidate boxes, which made the speed of region proposal suffer, and was subsequently optimized by Faster R-CNN [
2]. In addition, the single-stage SSD [
3] network performs real-time enhancement by directly predicting bounding boxes and categories, which has the advantages of fast speed and good significant target detection. Meanwhile, the YOLO series [
4] is known for its extreme real-time and end-to-end design. It has undergone multiple development iterations and is widely used in various industries, including intelligent manufacturing testing [
5]. As of 2024, it has been updated to YOLOv11.
YOLOv1 [
6] proposes a single-stage detection framework for the first time, which reduces the target detection problem to a regression problem, but suffers from poor detection of small targets and leakage of multi-target detection; YOLOv2 [
7] introduces anchor frames and multi-scale training, which significantly improves the detection recall rate, and optimizes the size of anchor frames through K-means clustering, which achieves the adaptation of targets of different scales; YOLOv3 [
8] uses three-scale prediction, which optimizes the detection of the small-target detection effect, while using logistic regression instead of Softmax to support multi-label classification; YOLOv4 [
9] introduces PANet and SPP modules to enhance the network feature fusion ability, while using Mosaic data enhancement and Ciou loss function to improve the training stability; YOLOv5 [
10] introduces the Focus layer and C3 module to improve the computational efficiency; YOLOv6 [
11] uses decoupled header and ReP-PAN structure and SIOU loss function to reduce parameter redundancy and optimize the bounding box regression accuracy; YOLOv7 [
12] designs the ELAN module and MP downsampling layer to enhance the feature extraction capability; YOLOv8 [
13] replaces the C3 module with the C2f module, adopts a decoupled header design, separates classification and regression tasks, and improves detection accuracy; YOLOv9 [
14] proposes programmable gradient information to enhance gradient propagation and solve the problem of information loss in deep networks; YOLOv10 [
15] combines large kernel convolution and partial self-attention to balance computational overhead and global awareness; and YOLOv11 [
16] uses the improved C3K2 module to replace the C2f module of v8, accelerates feature extraction by two small convolutions instead of a large convolution, and adds a new C2PSA module to enhance multi-scale feature fusion capability.
YOLOv11 consists of a backbone network, a neck network, and a detection head. The input image is subjected to feature extraction and weight adjustment by the C3K2 module and C2PSA module of the backbone network part. Then, the features of the different scale sizes of the 4th, 6th, and 10th layers are sent to the neck for feature fusion. Finally, the features of the 16th, 19th, and 22nd layers are sent to the three detection heads, those being large, medium, and small, respectively, for the prediction of the results. Although the YOLOv11 network has substantial advantages, when applied in the field of body damage area detection, it will still cause leakage and misdetection in the damaged area due to the characteristics of the detection object, such as slight differences between the color texture features of the body pit damage and the normal paint surface, and the scratch damage in the form of a thin strip, which will result in leakage and misdetection in the damaged area. Therefore, this paper takes the YOLOv11 algorithm as the benchmark, optimizes the convolution of the backbone feature extraction part, improves the neck feature fusion network, and adjusts the edge loss function at the same time, and proposes an improved YOLOv11-BSS body damage region detection algorithm.
Our main contributions are summarized as follows:
The research summarizes the body polishing process, collects and produces an image dataset of damaged body area detection applicable to the automatic body polishing repair process, and provides a basis for further research on polishing repair services in the automotive aftermarket.
Based on the deformable convolution and the characteristics of the damaged region of the body, the bi-deformable convolution is designed, and part of the convolution of the backbone feature extraction network is replaced to optimize the feature extraction capability of the backbone network. Meanwhile, combining the bi-deformable convolution and the spatial and channel synergistic attention module, the C2PSA-SCSA module is designed to adjust the importance of the features obtained by the backbone feature extraction network.
In the neck feature fusion network part, the slim-neck feature fusion network is improved using DWConv to reduce the overall number of parameters of the network and balance the increased number of parameters of bidirectional deformable convolution. At the same time, the idea of Focaler-CIoU segmented linear mapping is combined to optimize the Bbox loss function to balance the different attentions of the two types of damage that need to be detected during training.