YOLOv11-BSS: Damaged Region Recognition Based on Spatial and Channel Synergistic Attention and Bi-Deformable Convolution in Sanding Scenarios

Li, Yinjiang; Zhou, Zhifeng; Pan, Ying

doi:10.3390/electronics14071469

Open AccessArticle

YOLOv11-BSS: Damaged Region Recognition Based on Spatial and Channel Synergistic Attention and Bi-Deformable Convolution in Sanding Scenarios

by

Yinjiang Li

,

Zhifeng Zhou

^* and

Ying Pan

School of Mechanical and Automotive Engineering, Shanghai University of Engineering Science, Shanghai 201620, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(7), 1469; https://doi.org/10.3390/electronics14071469

Submission received: 7 March 2025 / Revised: 31 March 2025 / Accepted: 3 April 2025 / Published: 5 April 2025

Download

Browse Figures

Versions Notes

Abstract

:

In order to address the problem that the paint surface of the damaged region of the body is similar to the color texture characteristics of the usual paint surface, which leads to the phenomenon of leakage or misdetection in the detection process, an algorithm for detecting the damaged region of the body based on the improved YOLOv11 is proposed. Firstly, bi-deformable convolution is proposed to optimize the convolution kernel shape offset direction, which effectively improves the feature representation power of the backbone network; secondly, the C2PSA-SCSA module is designed to construct the coupling between spatial attention and channel attention, which enhances the perceptual power of the backbone network, and makes the model pay better attention to the damaged region features. Then, based on the GSConv module and the DWConv module, we build the slim-neck feature fusion network based on the GSConv module and DWConv module, which effectively fuses local features and global features to improve the saturation of semantic features; finally, the Focaler-CIoU border loss function is designed, which makes use of the principle of Focaler-IoU segmented linear mapping, adjusts the border loss function’s attention to different samples, and improves the model’s convergence of feature learning at various scales. The experimental results show that the enhanced YOLOv11-BSS network improves the precision rate by 7.9%, the recall rate by 1.4%, and the mAP@50 by 3.7% over the baseline network, which effectively reduces the leakage and misdetection of the damaged areas of the car body.

Keywords:

YOLOv11; damaged areas of the car body; bi-deformable convolution; spatial and channel synergistic attention; Focaler-CIoU

1. Introduction

Against the backdrop of double growth in vehicle ownership and average vehicle age, the scale of automotive aftermarket services is rapidly expanding, bringing more development opportunities to the automotive industry. The automotive aftermarket, in the narrow sense, refers to automotive after-sales services centered on automotive repair and maintenance, of which automotive repair mainly includes engine repair, transmission repair, suspension system repair, electrical system repair, and body repair. At present, the grinding and repair of damaged parts of the car is mainly carried out by workers due to the high mobility of personnel in the automotive repair service industry and the uneven operating level of workers, resulting in the overall quality of repair work being difficult to effectively control. Therefore, in order to improve the quality and efficiency of automotive repair aftermarket services and reduce the work intensity of repair workers, it is essential to analyze and design a body-damaged area detection scheme that serves automated grinding and repair of automobiles.

Image detection algorithms can be divided into two types: traditional detection algorithms and deep learning detection algorithms. The conventional approach to detection relies on manually designed features and shallow classifiers, and although it has strong interpretability, the feature extraction method designed for unique scenarios is challenging to adapt to complex application scenarios, and the generalization ability is weak. The deep learning approach, on the other hand, automatically extracts multi-level feature expressions through neural networks, obtains more abstract semantic information, effectively adapts to complex environments, and has better detection accuracy than traditional methods. Among them, Fast R-cnn [1] avoided repeated convolutional operations by mapping candidate regions to a shared feature map and extracting fixed-size features uniformly, which significantly improved the speed, but relied on a selective search to generate candidate boxes, which made the speed of region proposal suffer, and was subsequently optimized by Faster R-CNN [2]. In addition, the single-stage SSD [3] network performs real-time enhancement by directly predicting bounding boxes and categories, which has the advantages of fast speed and good significant target detection. Meanwhile, the YOLO series [4] is known for its extreme real-time and end-to-end design. It has undergone multiple development iterations and is widely used in various industries, including intelligent manufacturing testing [5]. As of 2024, it has been updated to YOLOv11.

YOLOv1 [6] proposes a single-stage detection framework for the first time, which reduces the target detection problem to a regression problem, but suffers from poor detection of small targets and leakage of multi-target detection; YOLOv2 [7] introduces anchor frames and multi-scale training, which significantly improves the detection recall rate, and optimizes the size of anchor frames through K-means clustering, which achieves the adaptation of targets of different scales; YOLOv3 [8] uses three-scale prediction, which optimizes the detection of the small-target detection effect, while using logistic regression instead of Softmax to support multi-label classification; YOLOv4 [9] introduces PANet and SPP modules to enhance the network feature fusion ability, while using Mosaic data enhancement and Ciou loss function to improve the training stability; YOLOv5 [10] introduces the Focus layer and C3 module to improve the computational efficiency; YOLOv6 [11] uses decoupled header and ReP-PAN structure and SIOU loss function to reduce parameter redundancy and optimize the bounding box regression accuracy; YOLOv7 [12] designs the ELAN module and MP downsampling layer to enhance the feature extraction capability; YOLOv8 [13] replaces the C3 module with the C2f module, adopts a decoupled header design, separates classification and regression tasks, and improves detection accuracy; YOLOv9 [14] proposes programmable gradient information to enhance gradient propagation and solve the problem of information loss in deep networks; YOLOv10 [15] combines large kernel convolution and partial self-attention to balance computational overhead and global awareness; and YOLOv11 [16] uses the improved C3K2 module to replace the C2f module of v8, accelerates feature extraction by two small convolutions instead of a large convolution, and adds a new C2PSA module to enhance multi-scale feature fusion capability.

YOLOv11 consists of a backbone network, a neck network, and a detection head. The input image is subjected to feature extraction and weight adjustment by the C3K2 module and C2PSA module of the backbone network part. Then, the features of the different scale sizes of the 4th, 6th, and 10th layers are sent to the neck for feature fusion. Finally, the features of the 16th, 19th, and 22nd layers are sent to the three detection heads, those being large, medium, and small, respectively, for the prediction of the results. Although the YOLOv11 network has substantial advantages, when applied in the field of body damage area detection, it will still cause leakage and misdetection in the damaged area due to the characteristics of the detection object, such as slight differences between the color texture features of the body pit damage and the normal paint surface, and the scratch damage in the form of a thin strip, which will result in leakage and misdetection in the damaged area. Therefore, this paper takes the YOLOv11 algorithm as the benchmark, optimizes the convolution of the backbone feature extraction part, improves the neck feature fusion network, and adjusts the edge loss function at the same time, and proposes an improved YOLOv11-BSS body damage region detection algorithm.

Our main contributions are summarized as follows:

The research summarizes the body polishing process, collects and produces an image dataset of damaged body area detection applicable to the automatic body polishing repair process, and provides a basis for further research on polishing repair services in the automotive aftermarket.
Based on the deformable convolution and the characteristics of the damaged region of the body, the bi-deformable convolution is designed, and part of the convolution of the backbone feature extraction network is replaced to optimize the feature extraction capability of the backbone network. Meanwhile, combining the bi-deformable convolution and the spatial and channel synergistic attention module, the C2PSA-SCSA module is designed to adjust the importance of the features obtained by the backbone feature extraction network.
In the neck feature fusion network part, the slim-neck feature fusion network is improved using DWConv to reduce the overall number of parameters of the network and balance the increased number of parameters of bidirectional deformable convolution. At the same time, the idea of Focaler-CIoU segmented linear mapping is combined to optimize the Bbox loss function to balance the different attentions of the two types of damage that need to be detected during training.

2. Related Work

2.1. Vehicle Body Damage Inspection

There have been some studies on body damage detection in the insurance industry, such as those by Ramazhan et al. [17], who developed an auto body damage detection system using the YOLOv7 algorithm. The system detects auto body damage, including different types of damage, such as cracks and scratches, through data collection, data labeling, data preprocessing, and data enhancement. The YOLOv7x algorithm is finally selected as the best algorithm for the detection system, which achieves high accuracy and recall and automatically detects damages through machine learning and data science techniques, which facilitates the claims process for insurance companies and reduces the risk of fraud. Wang et al. [18] produce a vision-based automotive damage detection dataset, CarDD, and discuss the challenges faced in annotating automotive damage images, performance evaluation of existing methods, and the dataset construction process. The dataset supports four tasks, namely classification, target detection, instance segmentation, and saliency target detection, providing researchers with multiple perspectives for exploring automotive damage detection and facilitating the development of the automotive insurance industry and automated cycling damage assessment systems. Elgargouh et al. [19] presented an approach to damage detection in vehicle images using computer vision techniques. The research team achieved automatic detection and classification of vehicle damage by constructing their dataset using deep learning models such as Mask R-CNN and Faster R-CNN, as well as tools such as Detectron2. They demonstrated the superiority of this approach by refining the damage region, applying bounding box filtering, and using multi-damage classification models and achieved good results in insurance claim scenarios. Gustian et al. [20] discuss the use of deep learning algorithms to detect car body damage. The article describes the advantages of using the YOLO deep learning algorithm for real-time object detection and how to detect scratches, cracks, and other damages to the car body through the steps of collecting a vehicle image dataset, preprocessing and segmenting the images, and constructing a detection model. The article also mentions the importance of parallel computing in accelerating the deep learning model training and testing process, as well as the application of deep reinforcement learning in vehicle damage detection. Gandhi [21] discussed deep learning automotive damage detection, classification, and severity assessment using various deep learning models such as CNN, YOLO, and DenseNet for detection, localization, classification of damaged cars, and severity assessment using techniques such as data augmentation and migration learning. They also created their dataset, used different pre-trained models, and explored model performance and room for improvement.

All of the above studies have merits, but the body polishing repair process does not need to identify broken glass, deformed bumpers, damaged headlamps, etc., so applying the above datasets to the process of detecting the damaged areas of body polishing repairs has some limitations.

2.2. Paint Defect Detection

Cutting through the perspective of body paint defect detection, many scholars also made different attempts, such as Xu et al. [22] introducing an automatic defect detection method for vehicle painting based on the APF-ACO algorithm. The method includes an edge detection algorithm, a reflection region elimination algorithm, and a defect region identification algorithm. By optimizing the ant colony algorithm, effective retention of edge details in the detection process is achieved. Experimental validation shows that the algorithm can accurately identify the defects of vehicle body coating film with an accuracy of 97.76%. Kamani et al. [23] present a method for automatic automotive body painting defect detection and classification that utilizes the joint distribution of the local binary pattern (LBP) and local variance for the detection of defective regions and a Bayesian classifier for the prediction of defect types. Experimental results show that the machine vision system meets the real-time requirements while having high defect detection and classification efficiency. Zhang et al. [24] introduced an improved MobileNet-SSD algorithm for automatic defect detection in body painting. Mainly, an offline data enhancement algorithm that enhances the diversity of defect images through a multi-angle and multi-level cutting strategy is proposed to improve the anti-interference ability of the detection model; a K-means clustering algorithm is used to determine a suitable aspect ratio for body paint defect detection to improve the accuracy of the detection. The results of the study show that the enhanced MobileNet-SSD algorithm significantly improved detection accuracy and speed in automotive body paint defect detection. Akhtar et al. [25] proposed a phase deflection measurement-based automotive paint defect detection system aimed at solving the problem of detecting the exterior painted parts of automobiles. By displaying a sinusoidal stripe pattern on an LCD screen, a high-resolution camera is used to capture the reflection image to calculate the surface phase map, and defects are detected based on the phase characteristics of the defects. Experimental results show that the system is able to detect most of the defects under study. Still, the detection performance suffers on white parts because white surfaces reflect all incident light, resulting in sensitivity to ambient light and reducing the visibility of defects. Borsu et al. [26] proposed a 3D surface mesh-based feature extraction and classification technique, using a structured light sensor to generate a 3D reconstruction of the surface of an automotive panel and a stereo camera to capture an image of the surface and acquire high-density 3D point cloud data, which is capable of accurately identifying unwanted deformations (e.g., depressions and protrusions) on the automotive components, and the experimental results show that the proposed automated inspection and labeling system can meet the standardized inspection requirements of the automotive manufacturing industry while ensuring production efficiency.

Although the above studies can effectively detect different types of defects in the paintwork, the damage in the body repair process is usually manifested as pits and scratches with uneven size distribution, which cannot be effectively detected and localized by these methods. Thus, it only has a specific value as a reference for ideas.

3. Our Approach

In this section, we provide an introduction to the process related to body polishing repair and point out the two types of damage that the process needs to detect and identify. We also present the improved YOLOv11 network and related module improvement sections based on the characteristics of the two types of damage.

3.1. Introduction to Sanding Repair Processes

Figure 1 summarizes the grinding and repair process flow through the car repair 4S shop research and study.

First of all, the repairer determines the repair area through the inspection of the body damage location and uses the grinding tool to carry out the paint treatment of the body damage location. After removing the old paint by surface grinding, the treatment result shown in Figure 1b is obtained, which contains the part of the bare steel plate that has been removed from the color paint and the part of the greyish-white varnish removal. Then, the sanded portion of the old paint is filled with atomic grey and sanded to make a uniform transition between the repaired portion and the original paint surface. Atomic grey cannot be filled in. One needs to repair in multiple layers of thin scraping, on the one hand, to avoid the nuclear grey filling too thick to produce bubbles and other defects; on the other hand, atomic grey in the curing process will produce heat, and one needs to wait for the filler to fully cool down after the subsequent repair work to prevent the repair area of the putty to have uneven thickness. As shown in Figure 1c–f, multi-layer scraping is needed when the latter layer of the scraping range needs to be larger than the previous layer to avoid the emergence of multiple steps, increasing the difficulty of sanding treatment; at the same time, scraping the putty must be slightly higher than the original surface, because if higher than the normal surface of the paint is more, it will make the removal of excess putty in the subsequent finishing and grinding stage take a lot of time and effort; and, in order to prevent putty sanding surface depression, after scraping, the surface should not appear around the high or low in the middle so that one cannot effectively sand the lower part of the situation. Next, the atomic grey surface, which is slightly higher than the original paint surface after filling, is sanded with finishing grinding to obtain a smooth surface with a uniform transition, as shown in Figure 1g. Finally, after painting and surface-acceptable grinding treatment, the repair results with the same color and finish as the original paint surface were obtained, as shown in Figure 1h.

Among them, the automatic maintenance process needs to detect the obtained damaged areas, including dents and scratches, as shown in Figure 2. Since the two types of damage are similar to the standard paint surface color and texture characteristics, the neural network is unable to effectively recognize the extracted features, which can easily lead to misdiagnosis and missed diagnosis.

3.2. Introduction to the YOLOv11-BSS Algorithm

Therefore, this paper proposes an improved algorithm based on YOLOv11 to improve the detection effect of damaged body areas. The improved YOLOv11-BSS model structure is shown in Figure 3. The image to be tested is sent to the backbone feature extraction network. After multiple feature extractions by the improved bidirectional deformable convolution and C3K2 module, the SPPF module performs multi-scale feature fusion. It sends it to the C2PSA-SCSA module for feature screening to complete the feature extraction of the image to be tested. The features of different scales from the 4th, 6th, and 10th layers of the backbone network are then sent to the improved slim-neck network for feature fusion to enrich the semantic expression of the features. Finally, the fused features from the 16th, 19th, and 22nd layers are sent to the three detection heads for damage location prediction to obtain the location of the damaged area that needs to be polished.

3.2.1. Bi-Deformable Convolution

In traditional convolutional modules, the kernel size and shape are fixed during feature extraction, which leads to a restricted receptive field in the feature extraction process and prevents the feature extraction from accurately matching the damaged areas of the paint surface, which can quickly reduce the accuracy of the feature information. Therefore, this paper uses an improved deformable convolution [27] for feature extraction in the backbone network part, effectively adjusting the kernel shape according to the characteristics of different damaged areas of the paint surface to obtain higher-quality image features and improve the feature extraction capability of the network.

The deformable convolution changes the sampling area by adding an offset to the normal convolution, allowing the feature extraction stage to adapt to different detection shapes. The principle is shown in Equation (1).

y (p_{0}) = \sum_{p_{n} \in R} w (p_{n}) \cdot x (p_{0} + p_{n} + ∆ p_{n})

(1)

where

y (p_{0})

represents the specific value of each point on the output feature map,

w (p_{n})

represents the weight at the corresponding position on the convolutional kernel,

∆ p_{n}

represents the offset of the convolutional kernel deformation process,

x (p_{0} + p_{n} + ∆ p_{n})

represents the specific value at the corresponding bias sampling point, and the offset of each point on the convolutional kernel relative to the center point is defined by

R ϵ \{(- 1, - 1), (- 1,0), \dots, (0,1), (1,1)\}

.

The offset is generated by convolving the input feature map with another convolutional kernel. It is represented as a vector in decimal form, which does not correspond well to the actual pixel points on the feature map. The pixel value at this point is usually calculated using bilinear interpolation, as shown in Equation (2), whereas

G (q, p)

is shown in Equation (3).

x (p) = \sum_{q} G (q, p) \cdot x (q)

(2)

G (q, p) = \max (0,1 - |q_{x} - p_{x}|) \cdot \max (0,1 - |q_{y} - p_{y}|)

(3)

A near-rectangular pit or elongated scratch characterizes the damaged area of the car body. When using deformable convolution for feature extraction, the bias direction is not restricted, and it is easily affected by the slight differences in paint color and texture, which causes errors in the scope of the convolution and, in turn, affects the feature extraction results. Therefore, this paper restricts the bias direction of the convolution kernel in the convolution process to the horizontal and vertical directions and designs a bidirectional deformable convolution to reduce the disorderliness of the selection of the bias direction of the damaged features and better match the features of the damaged areas of the body.

Figure 4 shows the improved bidirectional deformable convolution (Bi-DCN). The bias vector generated by the convolution is projected horizontally or vertically, with the four quadrant angle bisector lines as the boundary conditions, as shown in Equation (4), to constrain the bias vector and better adapt to the surface characteristics.

∆ p_{n}^{'} = \{\begin{matrix} ∆ p_{n} \cdot \cos θ, 0^{°} < θ < 45^{°} \\ ∆ p_{n} \cdot \sin θ, 45^{°} < θ < 90^{°} \\ ∆ p_{n} \cdot \sin θ, 90^{°} < θ < 135^{°} \\ ∆ p_{n} \cdot \cos θ, 135^{°} < θ < 180^{°} \end{matrix}

(4)

where

θ

is the angle between the bias vector and the positive horizontal direction of the coordinate system, and

∆ p_{n}^{'}

is the bias vector after constraint.

This paper introduces this module into the backbone feature extraction network part of the YOLOv11 network, replacing the Conv modules in layers 1, 3, 5, and 7 of the original network to improve the quality of deep features obtained by downsampling the backbone feature extraction network and improve the expressiveness of the model. The schematic diagram of the module’s feature extraction for dents and scratches is shown in Figure 5. Due to the limitation of the direction of the convolution bias, the model can better focus the feature extraction points on the edges of pit damage and scratch damage during the feature extraction process, improving the feature extraction capability of the backbone network.

3.2.2. C2PSA-SCSA Module

The backbone feature extraction network performs an efficient multi-scale pooling of the feature map through the SPPF module and feeds it into the C2PSA module for the enhancement of essential features. Due to the slight difference in the characteristics of damaged areas of the paint surface, the multi-head attention mechanism in the C2PSA module cannot efficiently screen features through global dependencies. Therefore, this paper introduces the spatial and channel synergistic attention mechanism [28] (SCSA) into the C2PSA module, fully considers the relationship between different feature images and within the image, and designs the C2PSA-SCSA module to replace the original C2PSA for screening paint damage features and enhancing the feature expression of the damage location.

The structure of the spatial channel collaborative attention module is shown in Figure 6 and is divided into shared multi-semantic space attention and progressive channel self-attention. First, the input image is split along the height and width directions, and global average pooling is performed to generate two unidirectional one-dimensional sequences, which are then equally divided into four independent sub-features according to the number of channels, facilitating the extraction of multi-semantic spatial features. Then, the four sub-features are passed through a depth-one-dimensional convolution with different kernel sizes to obtain different semantic spatial structures. After alignment by shared convolution, feature normalization and feature activation are performed to generate spatial attention. Finally, the results after spatial attention processing are pooled and mapped to create Q, K, and V, which are used to calculate and generate attention between channels. This ultimately achieves the fusion of spatial and channel attention, enhancing the information fusion capabilities of local and global features.

The improved C2PSA-SCSA module structure is shown in Figure 7. The original multi-head attention is replaced with spatial and channel synergistic attention, and the two traditional convolutions are replaced with bi-deformable convolutions for better feature selection, thus forming the PSABlock-SCSA-embedded C2PSA module. The residual connection is retained in the C2PSA-SCSA module, and the number of PSABlock-SCSA modules is set to 1.

3.2.3. Slim-Neck Fusion Network

The original network neck fusion processing mainly relies on the C3K2 module, which replaces the bottleneck module of C2f with a C3K module composed of two bottleneck modules and a traditional convolution in parallel. While the increase in convolution improves feature fusion, it also reduces the differences in the paint feature map, which in turn causes a loss of information. To address this shortcoming, the GSConv and VoV-GSCSP modules are introduced into the neck to construct the slim-neck [29] feature fusion network, thereby reducing the information loss in the feature fusion process.

The structure of GSConv is shown in Figure 8. In this structure, the input image is passed through a traditional convolution operation to generate a feature map with only half the number of output channels. This feature map is passed through a depth-separable convolution operation, which produces a new feature map with the same number of channels. The two feature maps are then concatenated, and the channels are shuffled to obtain the final feature map. Half of the traditional convolution and half of the depth-separable convolution are used in the process of increasing the number of channels from C1 to C2 to avoid redundant calculations and reduce the decrease in information interaction between channels caused by the depth-separable convolution. Through the cooperation of traditional convolution, depth-separable convolution, and channel mixing, the amount of convolution calculation is reduced. In contrast, the same effect as conventional convolution is simulated, and the non-linear expression ability of convolution is increased.

The structure of the VoV-GSCSP module designed using GSConv and DWConv is shown in Figure 9. This module first replaces the two traditional convolutions in the bottleneck module with GSConv. It replaces the conventional convolution on the residual connection with a deep separable convolution to form a new GSbottleneck, which further balances the increase in computational complexity brought about by bidirectional deformable convolutions while also being used to avoid partial loss of semantic information caused by spatial dimension compression and channel dimension expansion. Based on this bottleneck module, a VoV-GSCSP module is formed in conjunction with a traditional convolution module to reduce computational complexity while ensuring computational accuracy. Finally, the two conventional convolutions in the neck are replaced with GSConv, and the four C3K2 modules are replaced with VoV-GSCSP modules to form the slim-neck feature fusion network, which better integrates the features extracted by the backbone feature extraction network.

3.2.4. Focaler-CIoU Loss Function

The damaged area of the car body is very different from the characteristics of the standard paint surface, which makes it impossible for the Bbox loss function to be better parameterized according to the difference in the feature map of the more minor features. Therefore, based on the CIoU border loss function, the idea of linear mapping is introduced to redefine the border loss function [30] as shown in Equation (5) so that the model takes into account the balance of complex and straightforward samples in the border loss, increases the IoU value of the more challenging to recognize part, and makes the model pay more attention to the difficult-to-acknowledge samples.

L_{F o c a l e r - C I o U} = L_{C I o U} + I o U - {I o U}_{F o c a l e r}

(5)

where

L_{C I o U}

represents the conversion of CIOU to a loss form, which can be expressed as Equation (6);

I o U

represents the degree of overlap between the predicted box and the actual box, which can be expressed as Equation (10); and

{I o U}_{F o c a l e r}

represents the adjusted IOU, which can be expressed as Equation (11).

L_{C I o U} = 1 - C I o U

(6)

where CIoU is the original bbox loss function form, as expressed in Equation (7).

C I o U = I o U - (\frac{d^{2}}{c^{2}} + a v)

(7)

where

d

represents the diagonal distance between the minimum enclosing rectangles of the two prediction boxes,

c

represents the distance between the center points of the two prediction boxes,

α

is the weight coefficient, as expressed in Equation (8), and

v

is the shape difference, as expressed in Equation (9).

α = \frac{v}{(1 - I o U) + v}

(8)

v = \frac{4}{π^{2}} {(\arctan (\frac{w_{g t}}{h_{g t}}) - \arctan (\frac{w}{h}))}^{2}

(9)

I o U = \frac{B \cap B_{G T}}{B \cup B_{G T}}

(10)

{I o U}_{F o c a l e r} = \{\begin{matrix} 0, I o U < d_{1} \\ \frac{I o U - d_{1}}{u - d_{1}}, d_{1} \leq I o U \leq u \\ 1, I o U > u \end{matrix}

(11)

where

[d_{1}, u] ϵ [0,1]

,

d_{1}

and

u

are taken as 0 and 0.95.

4. Experiment

4.1. Dataset

Since there is no relevant public dataset for vehicle body repair area detection, and the CarDD dataset is used in the insurance industry, it contains different recognition types from vehicle body repair area detection. Therefore, in this paper, we cooperated with an automobile repair 4S store to collect and produce the data required for the inspection task. The dataset pictures are provided by the 4S store and given to the experienced maintenance master for the initial marking of the damage location; after many exchanges and learning with the maintenance master, the initial dataset is obtained by using the label software for the initial labeling of the damage pictures. Then, the repairer will screen the labeling results, and the final dataset will be obtained after both sides jointly identify and correct the data to ensure that both sides have the same knowledge of the damage location. The labeling effect of scratches and pit damage using labelme annotation is shown in Figure 10, and the original inspection dataset collected and produced is further divided into three parts, namely, training set, test set, and validation set, which contain 414, 209 and 183 images, respectively. Data augmentation was performed on the original detection images by rotating, adding salt and pepper noise, and adjusting the contrast. The training set was expanded to 3726 images to prevent overfitting during network training. The composition of the dataset after data augmentation is shown in Table 1 and the augmentation effect is shown in Figure 11.

4.2. Experimental Environment and Parameter Settings

The operating system used in this experiment is Windows 10 Pro, the CPU model is AMD 7500F, the GPU model is NVIDIA GeForce RTX 4070 (12G), and the memory size is 32G. The experimental development language is Python 3.9.20, and the deep learning framework is Pytorch 1.10.1 + cuda 11.3. The specific parameter settings are shown in Table 2.

4.3. Evaluation Indicator

In this paper, the model performance is analyzed using precision P (Precision), recall R (Recall), and mean average precision mAP@ 50 and mAP@ 50-95. Precision reflects the accuracy of the model prediction and is calculated as shown in Equation (12). Recall demonstrates the model’s ability to detect actual positive samples and is calculated as shown in Equation (13). mAP indicates the average mean accuracy, and the subsequent number indicates the different detection threshold intervals. mAP@50 indicates the average accuracy when the iou threshold is 0.5, and mAP@50-95 indicates the average accuracy of the model when the iou threshold is in the range of 0.5–0.95. The calculation is shown in Equations (14) and (15).

P r e c i s i o n = \frac{T P}{T P + F P}

(12)

R e c a l l = \frac{T P}{T P + F N}

(13)

A P = \int_{0}^{1} P (R) d R

(14)

m A P = \frac{1}{C} \sum_{i = 1}^{c} A P_{i}

(15)

where TP indicates true positives, FP indicates false positives, FN indicates false negatives, AP indicates the area under the P-R curve and the coordinate axis, and mAP is the average AP value for several types of detection objects.

4.4. Performance Comparison

To further demonstrate the advantages of the improved model in this paper, comparative experiments were conducted using Faster R-cnn, SSD, YOLOX-s, YOLOv7-tiny, YOLOv8n, YOLOv9-t, YOLOv10n, LW-YOLOv11, and G-YOLOv11, respectively. The YOLOv7-tiny, YOLOv8n, YOLOv9-t, YOLOv10n, G-YOLOv11, and YOLOv11, and the improved YOLOv11-BSS network model were used for visual analysis and comparison. The experimental results of the comparison of the detection effects of different models are shown in Table 3, and the visual comparison results are shown in Figure 12.

Table 3 shows that the YOLOv11-BSS model performed best, leading in all evaluation metrics, including accuracy (92.0%), recall (83.5%), mAP@50 (92.1%), and mAP@50-95 (55.9%). YOLOv10n and YOLOv9-t also performed very well, achieving 90.1% and 53.2% for mAP@50 and mAP@50-95, and 89.7% and 52.7%, respectively. In comparison, YOLOv8n is slightly worse, but maintains a high mAP@50 (86.5%) and mAP@50-95 (50.2%); YOLOv7-tiny performs well in terms of precision (87.1%), and recall (81.5%); LW-YOLOv11 and G-YOLOv11 have relatively balanced performances, achieving 51.8% and 46.3% in mAP@50-95, respectively; Faster R-CNN has a high recall rate (85.1%), but its overall accuracy (55.3%) and mAP performance were average; SSD’s performance was also average, with mAP@50 and mAP@50-95 at 81.7% and 46.0%, respectively; and YOLOX-S’s accuracy (82.4%) and recall (65.2%) were low, resulting in poor mAP performance. Overall, YOLOv11-BSS ranked first in the three indicators, P, mAP@50, and mAP@50-95, and only ranked second in the R indicator by a small margin, with the best overall performance.

Figure 12a shows the detection results for dented. The improved network achieves higher confidence scores compared to the baseline and other reference networks (e.g., YOLOv7-tiny and YOLOv8n). Notably, it avoids the issue observed in the baseline network, where two targets were detected at the exact damaged location. Additionally, it eliminates false positives and missed detections present in YOLOv7-tiny and YOLOv8n, demonstrating the effectiveness of the model improvements in dent defect recognition.

Figure 12b illustrates the results for scratch detection. The improved network also exhibits higher confidence scores than the baseline and other comparative models. Specifically, it resolves the problem in G-YOLOv11, where two targets were detected at a single damaged location, further validating the enhanced capability of the modified model in identifying scratch defects.

4.5. Ablation Experiment

To further clarify the contributions of individual improved modules to the network, ablation experiments were conducted by sequentially integrating each enhanced module into the original baseline network, as shown in Table 4

According to the metrics in the table. The Bi-DCN module improved mAP@50 from 88.4% to 91.2% (+2.8%) and mAP@50-95 from 52.7% to 57.0% (+4.3%), demonstrating its significant enhancement in feature extraction capability, though at the cost of a 35% increase in parameters and an 83% rise in computational cost. When the C2PSA-SCSA module was added individually, precision improved from 84.1% to 90.6% (+6.5%), but recall decreased by 2.4% (from 82.1% to 79.7%), indicating a trade-off between suppressing false positives and potentially missed detections. However, when combined with the Bi-DCN module, recall recovered to 83.8%, suggesting that bidirectional feature fusion alleviates missed detection issues. Integrating the slim-neck module boosted precision by 5% (from 84.1% to 89.1%) while reducing parameters by 3.5% (from 2.58 M to 2.49 M), validating its lightweight design efficiency. However, using it alone caused a slight 0.8% decline in mAP@50 (from 88.4% to 87.6%). When combined with Bi-DCN and C2PSA-SCSA modules, mAP@50 reached 91.1%. Introducing the Focaler-CIoU module further optimized the detection head, improving precision from 91.1% to 92.0% (+0.9%) and mAP@50-95 from 54.7% to 55.9% (+1.2%), indicating enhanced localization accuracy through loss function refinement. Overall, compared to the baseline model, the improved network achieved 7.9% higher precision (from 84.1% to 92.0%), 1.4% higher recall (from 82.1% to 83.5%), 3.7%, and 3.2% improvements in mAP@50 and mAP@50-95, respectively. These results validate the effectiveness of the proposed model enhancements.

4.6. Comparison of Different Modules

For the convolutional operations on the backbone and neck, this paper introduces different downsampling methods such as omni-dimensional dynamic convolution (ODConv), lightweight adaptive extraction (LAE), context-guided block (CG block), online convolutional re-parameterization (OREPA), and dynamic convolution (Dynamic) to compare their feature extraction capabilities, and the results are shown in Table 5.

Table 5 shows from the metrics that the Dynamic module performs the best in terms of precision, with a 2.9% improvement over the baseline model, but performs average in other aspects; the CG-block performs average overall with the other modules; in contrast, the Bi-DCN module performs well overall, with an improvement over the baseline model in the recall, mAP@50, and mAP@ 50-95 by 1.7%, 1.3%, and 1.5%. For the two kinds of damages to be detected in Figure 5, the feature visualization is shown in Figure 13 in combination with the Dynamic module and CG-block module, which have better convolutional effects in Table 5. As can be seen from the figure, the Bi-DCN module extracts the features around the damage better, and the irrelevant features account for less; the Dynamic module extracts the features around the damage better, but the non-important feature parts are also preserved; the CG-block module also obtains better features of the damage location, but the overall quality is not as good as that of the Bi-DCN.

The combined results in Table 5 and Figure 13 show that the proposed approach of introducing the Bi-DCN module in the backbone feature extraction network part for improvement has substantial advantages.

For the C2PSA module, this paper introduces different attention mechanisms, such as spatially enhanced attention (SEAM), large separable kernel attention (LSKA), mixed local channel attention (MLCA), efficient multi-scale attention (EMA), and deformable attention transformer (DAT), to improve the multi-head attention part. Five C2PSA variants were designed, and their feature selection capabilities were compared and analyzed. The results are shown in Table 6.

Table 6 shows from the metrics that the LSKA module performs well in terms of precision, with a 6.5% improvement over the benchmark model, but generally performs well in other aspects; the MLCA module performs better overall in terms of recall, with a 2.5% improvement over the benchmark model, respectively; the SCSA module performs the best out of all the control modules, with a 6.5% improvement in precision over the benchmark model, and an improvement of mAP@50 over the benchmark model by 2.3%, and mAP@ 50-95 improved by 4.3% over the benchmark model. For the two kinds of damages to be detected in Figure 5, the feature visualization is shown in Figure 14 in combination with the LSKA module and MLCA module, which have better weight adjustment effects in Table 6. As can be seen from the figure, the SCSA module makes the feature weights near the damage location and better reflects the feature morphology; the LSKA module makes the features concentrated near the damage while some features are distributed in the non-damage location; and the MLCA module pays less attention to the feature area than the LSKA module, but the irrelevant area features are less distributed.

The combined results of Table 6 and Figure 14 show that the C2PSA-SCSA module designed in this paper, in combination with the SCSA module, has a strong advantage in feature screening.

To demonstrate the improvement effect of the feature fusion network, this paper visualizes the damaged features at the small-target detection head location, and the obtained feature map results are shown in Figure 15. From the figure, it can be seen that before the introduction of slim-neck, the features at the damaged location have feature detail information loss. In contrast, the feature details near the damage are increased after introducing the slim-neck network structure. The feature weight is increased, which reduces the loss of feature information and effectively improves the network’s overall recognition accuracy.

For the improvement of Focaler-CIoU, this paper compares the effect of different losses on the detection results, as shown in Table 7. The loss change process is also visualized, as shown in Figure 16.

Table 7 shows from the indicators that GIOU performs best on precision (87.9%) but performs poorly on recall (76.9%) and mAP@50 (86.8%); DIOU performs best on recall (83.0%) but has a relatively low precision (83.4%); SIOU performs best on mAP@50-95 (54.1%); and Focaler-CIoU performs optimally on mAP@50 (88.9%), which is a more balanced overall performance and has a lower loss value in Figure 16, which is suitable as the optimal loss function.

4.7. Misdetection and Missed Detection Analysis

For the final detection results, this paper counts the false and missed detection rates of the verification set before and after the algorithm improvement, and the results are shown in Table 8.

Table 8 shows that the improved Yolov11-BSS algorithm proposed in this paper effectively improves the feature extraction efficiency, reduces the leakage and misdetection rate of dented damage by 9.8%, and reduces the leakage and misdetection rate of scratch damage by 8.5%. When the algorithm detects minimal differences in color and texture between the damaged paint layer and the normal paint layer, the improved algorithm proposed in this article can effectively reduce the occurrence of missed or misidentified damaged areas.

5. Conclusions

This paper proposes an improved detection algorithm based on the YOLOv11 algorithm in response to the problem that the differences between the damaged areas of the car body and the normal paint surface in terms of color and texture characteristics are minor, which leads to misdetection and missed detection during the detection of damaged areas and affects the accuracy of the detection results. Compared with the benchmark network, the improved YOLOv11-BSS network comprehensively improved the detection performance while avoiding missed and false detections. Future research on the detection of damaged areas on the body will need to focus on further improving the detection accuracy to provide better detection results for automated body repair. At the same time, the dataset of damaged areas on the body still needs to be further expanded to provide a better data basis for network feature learning and to further enhance the robustness of the algorithm.

Author Contributions

Conceptualization, Z.Z.; methodology, Y.L.; software, Y.L.; validation, Y.P.; formal analysis, Z.Z.; writing—original draft, Y.L.; writing—review and editing, Z.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The datasets used and analyzed during the current study are available from the corresponding author on reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Zhao, X.; Li, W.; Zhang, Y.; Gulliver, T.A.; Chang, S.; Feng, Z. A faster RCNN-based pedestrian detection system. In Proceedings of the 2016 IEEE 84th Vehicular Technology Conference (VTC-Fall), Montreal, QC, Canada, 18–21 September 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 1–5. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016; Proceedings, Part I 14; Springer International Publishing: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
Jiang, P.; Ergu, D.; Liu, F.; Cai, Y.; Ma, B. A Review of Yolo algorithm developments. Procedia Comput. Sci. 2022, 199, 1066–1073. [Google Scholar]
Zendehdel, N.; Chen, H.; Leu, M.C. Real-time tool detection in smart manufacturing using You-Only-Look-Once (YOLO) v5. Manuf. Lett. 2023, 35, 1052–1059. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Jocher, G.; Stoken, A.; Borovec, J.; Liu, C.; Hogan, A.; Diaconu, L.; Poznanski, J.; Yu, L.; Rai, P.; Ferriday, R.; et al. ultralytics/yolov5: v3. 0. Zenodo 2020. [Google Scholar] [CrossRef]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 7464–7475. [Google Scholar]
Varghese, R.; Sambath, M. Yolov8: A novel object detection algorithm with enhanced performance and robustness. In Proceedings of the 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), Chennai, India, 18–19 April 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–6. [Google Scholar]
Wang, C.Y.; Yeh, I.H.; Mark Liao, H.Y. Yolov9: Learning what you want to learn using programmable gradient information. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer Nature: Cham, Switzerland, 2024; pp. 1–21. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J. Yolov10: Real-time end-to-end object detection. Adv. Neural Inf. Process. Syst. 2025, 37, 107984–108011. [Google Scholar]
Khanam, R.; Hussain, M. Yolov11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar]
Ramazhan, M.R.S.; Bustamam, A.; Anwar, R. Car Body Damage Detection System Using YOLOv7. In Proceedings of the 2023 3rd International Conference on Electronic and Electrical Engineering and Intelligent System (ICE3IS), Yogyakarta, Indonesia, 9–10 August 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 498–502. [Google Scholar]
Wang, X.; Li, W.; Wu, Z. Cardd: A new dataset for vision-based car damage detection. IEEE Trans. Intell. Transp. Syst. 2023, 24, 7202–7214. [Google Scholar]
Elgargouh, Y.; Ghazali, M.; Louhdi, M.R.C.; Zemmouri, E.M.; Behja, H. Computer Vision for Damage Detection in Cars Images. In Proceedings of the 2024 7th International Conference on Advanced Communication Technologies and Networking (CommNet), Rabat, Morocco, 4–6 December 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–8. [Google Scholar]
Gustian, Y.W.; Rahman, B.; Hindarto, D.; Wedha, A.B.P.B. Detects Damage Car Body using YOLO Deep Learning Algorithm. Sink. J. Dan Penelit. Tek. Inform. 2023, 7, 1153–1165. [Google Scholar] [CrossRef]
Gandhi, R. Deep learning based car damage detection, classification and severity. Int. J. 2021, 10, 2947–2953. [Google Scholar]
Xu, J.; Zhang, J.; Zhang, K.; Liu, T.; Wang, D.; Wang, X. An APF-ACO algorithm for automatic defect detection on vehicle paint. Multimed. Tools Appl. 2020, 79, 25315–25333. [Google Scholar]
Kamani, P.; Noursadeghi, E.; Afshar, A.; Towhidkhah, F. Automatic paint defect detection and classification of car body. In Proceedings of the 2011 7th Iranian Conference on Machine Vision and Image Processing, Tehran, Iran, 16–17 November 2011; IEEE: Piscataway, NJ, USA, 2011; pp. 1–6. [Google Scholar]
Zhang, J.; Xu, J.; Zhu, L.; Zhang, K.; Liu, T.; Wang, D.; Wang, X. An improved MobileNet-SSD algorithm for automatic defect detection on vehicle body paint. Multimed. Tools Appl. 2020, 79, 23367–23385. [Google Scholar] [CrossRef]
Akhtar, S.; Tandiya, A.; Moussa, M.; Tarry, C. An efficient automotive paint defect detection system. Adv. Sci. Technol. Eng. Syst. J. 2019, 4, 171–182. [Google Scholar]
Borsu, V.; Yogeswaran, A.; Payeur, P. Automated surface deformations detection and marking on automotive body panels. In Proceedings of the 2010 IEEE International Conference on Automation Science and Engineering, Toronto, ON, Canada, 21–24 August 2010; IEEE: Piscataway, NJ, USA, 2010; pp. 551–556. [Google Scholar]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
Si, Y.; Xu, H.; Zhu, X.; Zhang, W.; Dong, Y.; Chen, Y.; Li, H. SCSA: Exploring the Synergistic Effects Between Spatial and Channel Attention. arXiv 2024, arXiv:2407.05128. [Google Scholar]
Li, H.; Li, J.; Wei, H.; Liu, Z.; Zhan, Z.; Ren, Q. Slim-neck by GSConv: A lightweight-design for real-time detector architectures. J. Real-Time Image Process. 2024, 21, 62. [Google Scholar] [CrossRef]
Dong, X.; Liu, Y.; Dai, J. Concrete Surface Crack Detection Algorithm Based on Improved YOLOv8. Sensors 2024, 24, 5252. [Google Scholar] [CrossRef] [PubMed]
Huang, J.; Wang, K.; Hou, Y.; Wang, J. LW-YOLO11: A Lightweight Arbitrary-Oriented Ship Detection Method Based on Improved YOLO11. Sensors 2024, 25, 65. [Google Scholar] [CrossRef] [PubMed]
Ferdi, A. Lightweight G-YOLOv11: Advancing Efficient Fracture Detection in Pediatric Wrist X-rays. arXiv 2024, arXiv:2501.00647. [Google Scholar]

Figure 1. Schematic diagram of the body sanding repair process.

Figure 2. Automatic polishing and repair requires detection of two types of damage; (a) dented damage; and (b) scratch damage.

Figure 3. YOLOv11-BSS algorithm structure diagram.

Figure 4. Schematic diagram of a bi-deformable convolution.

Figure 5. Schematic representation of feature extraction.

Figure 6. Schematic diagram of the spatial and channel synergistic attention.

Figure 7. Schematic diagram of the C2PSA-SCSA modular structure.

Figure 8. Schematic diagram of the GSConv module structure.

Figure 9. Schematic diagram of the VoV-GSCSP module structure.

Figure 10. Schematic of data labeling; (a) dented image labeling; and (b) scratch image labeling.

Figure 11. Schematic diagram of data enhancement; (a) dented damage enhanced; and (b) scratch damage enhanced.

Figure 12. Visual comparison of different model tests; (a) visualization of dented defects; and (b) visualization of scratch defects.

Figure 13. Comparison of feature maps for different downsampling methods; (a) Bi-DCN downsampling feature map for dented defects; (b) Bi-DCN downsampling feature map for scratch defects; (c) dynamic downsampling feature map for dented defects; (d) dynamic downsampling feature map for scratch defects; (e) CG-block downsampling feature map for dented defects; and (f) CG-block downsampling feature map for scratch defects.

Figure 14. Comparison of feature maps for different attention; (a) SCSA attentional feature map for dented defects; (b) SCSA attentional feature map for scratch defects; (c) LSKA attentional feature map for dented defects; (d) LSKA attentional feature map for scratch defects; (e) MLCA attentional feature map for dented defects; and (f) MLCA attentional feature map for scratch defects.

Figure 15. Comparison of feature maps before and after feature fusion network improvement; (a) dented defects feature map of the original feature fusion network; (b) scratch defects feature map of the original feature fusion network; (c) dented defects feature map of the slim-neck feature fusion network; and (d) scratch defects feature map of the slim-neck feature fusion network.

Figure 16. Comparison of results with different loss functions; (a) Valcls_loss; (b) Valdfl_loss; and (c) Valbox_loss.

Table 1. Dataset composition.

Dataset Category	Number of Pictures	Type of Damage	Number of Pictures
Train	3726	dented	2100
Train	3726	scratch	1626
Test	209	dented	106
Test	209	scratch	103
Val	183	dented	112
Val	183	scratch	71

Table 2. Training parameters.

Experimental Parameter	Value
Epoch	150
Batch size	16
Learning rate	0.01
Weight decay	0.0005
Optimizer	SGD
Momentum	0.937
Image size	640 × 640

Table 3. Comparison of detection indicators for different models.

Model Name	P	R	mAP@50	mAP@50-95
Faster R-cnn	55.3	85.1	82.8	45.7
SSD	80.4	76.4	81.7	46.0
YOLOX-s	82.4	65.2	78.1	39.7
YOLOv7-tiny	87.1	81.5	87.9	49.9
YOLOv8n	84.9	80.7	86.5	50.2
YOLOv9-t	83.5	83.9	89.7	52.7
YOLOv10n	86.4	83.2	90.1	53.2
LW-YOLOv11 [31]	78.8	77.7	85.9	51.8
G-YOLOv11 [32]	83.3	79.2	83.9	46.3
YOLOv11-BSS	92.0	83.5	92.1	55.9

Table 4. Ablation experiment.

Base	Bi-DCN	C2PSA-SCSA	Slim-Neck	Focaler-CIoU	P	R	mAP@50	mAP@50-95	Params/M	GFLOPs/G
√					84.1	82.1	88.4	52.7	2.58	6.3
√	√				87.8	84.3	91.2	57.0	3.49	11.5
√		√			90.6	79.7	90.7	57.0	2.54	6.4
√			√		89.1	76.5	87.6	50.2	2.49	5.9
√	√	√			81.1	87.2	89.7	52.7	3.44	11.4
√	√		√		87.9	83.8	89.8	54.3	3.64	11.5
√		√	√		86.9	79.5	87.3	50.4	2.44	5.8
√	√	√	√		91.1	84.4	91.1	54.7	3.34	10.9
√	√	√	√	√	92.0	83.5	92.1	55.9	3.34	10.9

(where “√” indicates the addition of a module).

Table 5. Comparison of different downsampling methods.

Convolution Module	P	R	mAP@50	mAP@50-95
Base	84.1	82.1	88.4	52.7
+Bi-DCN	84.4	83.8	89.7	54.2
+ODConv	85.1	78.8	87.6	50.8
+LAE	81.7	80.0	88.2	51.5
+OREPA	82.7	80.4	87.2	50.6
+CG block	86.2	79.3	88.3	51.1
+Dynamic	87.0	78.7	87.2	53.4

Table 6. Comparison of different attention modes.

Attention Module	P	R	mAP@50	mAP@50-95
Base	84.1	82.1	88.4	52.7
+SCSA	90.6	79.7	90.7	57.0
+SEAM	84.5	81.3	89.5	52.4
+LSKA	90.6	83.9	89.7	52.5
+MLCA	86.6	84.6	90.1	54.4
+EMA	80.9	83.2	87.3	50.9
+DAT	81.5	79.3	86.8	51.0

Table 7. Comparison of detection indicators for different loss.

Loss Function	P	R	mAP@50	mAP@50-95
Base	84.1	82.1	88.4	52.7
GIOU	87.9	76.9	86.8	51.7
DIOU	83.4	83.0	87.9	52.8
EIOU	85.7	80.7	88.7	51.2
SIOU	83.9	80.8	87.5	54.1
Focaler-CIoU	84.8	81.3	88.9	52.7

Table 8. Verification of centralized false positive and missed test statistics.

Model Name	Type of Damage	Number of Pictures	Number of False and Missed Detections	False Detection and Missed Detection Rates
Yolov11	Dented	112	15	13.4%
Yolov11	Scratch	71	8	11.3%
Yolov11-BSS	Dented	112	4	3.6%
Yolov11-BSS	Scratch	71	2	2.8%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, Y.; Zhou, Z.; Pan, Y. YOLOv11-BSS: Damaged Region Recognition Based on Spatial and Channel Synergistic Attention and Bi-Deformable Convolution in Sanding Scenarios. Electronics 2025, 14, 1469. https://doi.org/10.3390/electronics14071469

AMA Style

Li Y, Zhou Z, Pan Y. YOLOv11-BSS: Damaged Region Recognition Based on Spatial and Channel Synergistic Attention and Bi-Deformable Convolution in Sanding Scenarios. Electronics. 2025; 14(7):1469. https://doi.org/10.3390/electronics14071469

Chicago/Turabian Style

Li, Yinjiang, Zhifeng Zhou, and Ying Pan. 2025. "YOLOv11-BSS: Damaged Region Recognition Based on Spatial and Channel Synergistic Attention and Bi-Deformable Convolution in Sanding Scenarios" Electronics 14, no. 7: 1469. https://doi.org/10.3390/electronics14071469

APA Style

Li, Y., Zhou, Z., & Pan, Y. (2025). YOLOv11-BSS: Damaged Region Recognition Based on Spatial and Channel Synergistic Attention and Bi-Deformable Convolution in Sanding Scenarios. Electronics, 14(7), 1469. https://doi.org/10.3390/electronics14071469

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

YOLOv11-BSS: Damaged Region Recognition Based on Spatial and Channel Synergistic Attention and Bi-Deformable Convolution in Sanding Scenarios

Abstract

1. Introduction

2. Related Work

2.1. Vehicle Body Damage Inspection

2.2. Paint Defect Detection

3. Our Approach

3.1. Introduction to Sanding Repair Processes

3.2. Introduction to the YOLOv11-BSS Algorithm

3.2.1. Bi-Deformable Convolution

3.2.2. C2PSA-SCSA Module

3.2.3. Slim-Neck Fusion Network

3.2.4. Focaler-CIoU Loss Function

4. Experiment

4.1. Dataset

4.2. Experimental Environment and Parameter Settings

4.3. Evaluation Indicator

4.4. Performance Comparison

4.5. Ablation Experiment

4.6. Comparison of Different Modules

4.7. Misdetection and Missed Detection Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI