1. Introduction
The insurance industry continues to evolve with the integration of advanced artificial intelligence (AI) solutions, driving transformative innovations in the processing and analysis of complex datasets. Their potential to streamline and enhance processes remains undeniable. In particular, the proposed study builds on recent findings in [
1] and from Boston Consulting Group [
2], which emphasize the transformative potential of AI in insurance, especially in damage assessment, underwriting, and customer service, provided that high-quality data are available.
Building upon our previous work with the Insoore AI pipeline [
3]—a system designed for the Italian market to automate vehicle damage assessment using computer vision—this study introduces a significant advancement in terms of damage detection and classification. By leveraging one of the latest deep learning architectures in computer vision, namely, DiffusionDet [
4], this research enhances the performance achieved at damage detection and classification in the claim management process, also marking a pivotal improvement with respect to the Faster R-CNN [
5] model employed in earlier iterations.
In particular, this paper thoroughly examines the methodology and outcomes associated with retraining Insoore AI over the original dataset using DiffusionDet, a Generative AI (GenAI) framework that is gaining increasing popularity in the scientific community focused on object detection. The retraining process was carried out on the Booster module of the LEONARDO HPC system [
6], as no other underperforming setup allows launching such a training session. This transition to advanced computational platforms facilitated the exploration of enhanced model capabilities, allowing for a deeper understanding of the potential benefits and limitations of diffusion-based detection.
The retraining process resulted in substantial improvements in performance metrics, underscoring the effectiveness of diffusion-based detection techniques for solving the problem of automatic damage detection and classification for the claim management process. These findings not only highlight the transformative potential of computer vision in addressing real-world challenges faced by the insurance industry but also affirm the suitability of diffusion models as a powerful tool for tasks requiring high precision and reliability. The outcomes demonstrated through this research further solidify the role of diffusion-based detection as a viable and efficient approach for advancing automation and decision-making processes in insurance applications.
As far as the authors understand, this work represents the first attempt in the current literature at solving the problem of automatic car damage detection and classification using DiffusionDet [
4]. In addition, relying on two-stage transfer learning allows to push the performance beyond the state of the art, represented in chronological order by [
3,
7].
This paper is organized in the following way:
Section 2 presents a recap of our previous work concerning the Insoore AI platform;
Section 3 provides the motivation behind this paper;
Section 4 provides the necessary background on DiffusionDet for object detection;
Section 4 describes the Insoore AI pipeline as enhanced by the improvements documented in the current paper;
Section 5 highlights the emerging need for HPC resources;
Section 6 discusses the dataset and the experimental setup with the final numerical results. Concluding remarks end the paper.
2. Recap of Previous Work
The initial development of the Insoore AI pipeline [
3] marked a significant step forward in integrating artificial intelligence into the Italian insurance market. Designed to automate the assessment of vehicle damages, the system addressed key challenges in the claims management process, including the inefficiency, subjectivity, and susceptibility to fraud inherent in manual evaluation methods. By leveraging computer vision and deep learning technologies, the pipeline aimed to provide a robust decision-support tool for insurance experts, reducing their reliance on physical vehicle inspections and expediting the claims settlement process.
The pipeline was structured around three main tasks: (i) car damage detection, (ii) orange detection of damaged vehicle components, and (iii) orange calculation of repair cost. The first step, damage detection, focused on localizing and classifying damages within images sent by clients. This was accomplished by employing the Faster R-CNN architecture, a deep learning model extensively used for object detection. In the second step, the identified damages were assigned to specific vehicle components that the system classified as these parts among 53 distinct categories. Finally, in the third step, the pipeline estimates the severity of the identified damages and calculates the repair costs based on the updated prices of the vehicle components and the labor rates provided by the insurance company.
The dataset used in the initial implementation consisted of real-world accident images annotated into four damage classes: scratches, dents, cracks, and broken clips. Images were annotated following strict guidelines to ensure consistency and accuracy, accounting for the unique characteristics of the Italian insurance market. While the Faster R-CNN [
5] model demonstrated reasonable performance in terms of damage detection and classification, limitations were observed, particularly in handling complex damages and small, intricate details. This highlighted the need for exploring advanced architectures to overcome these shortcomings.
Extensive experiments were conducted to fine-tune the Faster R-CNN model, optimizing hyperparameters and evaluating performance using metrics such as Average Precision (AP), AP50, and AP75. Despite achieving competitive results in comparison with the main competing players in the Insurtech market (namely, Bdeo [
8] and Tractable [
9]), the experiments revealed inherent challenges in adapting the architecture to the specific requirements of the problem. Object detection performed better than segmentation-based approaches like Mask R-CNN [
10], as the latter struggled with generating accurate masks for damage localization. However, the overall detection accuracy indicated room for improvement in both precision and recall.
The findings from [
3] underscored the potential of AI-driven solutions for claims management while highlighting the limitations of the existing methodology. The system effectively automated significant portions of the process, reducing the burden on insurance experts and expediting claim resolutions. Nevertheless, the relatively moderate performance of Faster R-CNN motivated further exploration of more sophisticated models to enhance detection accuracy and robustness. This paper builds on those insights, presenting an updated pipeline powered by the DiffusionDet [
4] architecture, which leverages advanced diffusion-based detection methods to achieve superior results.
3. Motivation
The advancements in AI-driven car damage detection have significantly transformed the insurance industry, enhancing the efficiency and accuracy of damage assessment. While early methods relied on CNN-based object detectors like YOLO and Faster R-CNN, the evolution of transformer-based models and diffusion-based approaches has opened new possibilities for improved detection accuracy and robustness. The integration of these modern architectures addresses several challenges associated with conventional methods, such as difficulty in detecting subtle or complex damages and the reliance on large annotated datasets.
A key contribution of the present work is the application of DiffusionDet in a novel way, as demonstrated in
Section 6. DiffusionDet redefines object detection by leveraging the iterative refinement process of diffusion models, allowing for better localization of damages even in challenging scenarios. Traditional CNN-based detectors often struggle with damages that lack clear contours or are small relative to the vehicle’s surface. By formulating detection as a denoising process, DiffusionDet mitigates these issues by systematically improving its predictions over multiple iterations, effectively reducing false positives and enhancing recall.
An additional and particularly noteworthy aspect of this work is the adoption of a two-step transfer learning strategy. While the COCO dataset is widely used and provides a strong foundation, it remains distant from the specific domain of cars and car damages. Therefore, the decision to pre-train on an intermediate dataset that bridges this gap before fine-tuning on the target domain is both innovative and strategic. This step is not commonplace in the literature and reflects a deep understanding gained through practical experience rather than conventional practice. It is a key move that significantly contributes to the model’s ability to generalize and perform effectively on complex, domain-specific tasks. Emphasizing this approach is crucial, as it highlights a level of methodological refinement that is rarely encountered and speaks to the domain-specific expertise behind the proposed approach.
The above-mentioned innovative aspects are tested and compared with the most relevant state-of-the-art approaches addressing the same problem, namely, [
3,
7].
We demonstrate also that real-life data, such as ours, is significantly more challenging than CarDD [
11], as evidenced by the comparatively lower performance metrics we obtain in
Section 6. Object segmentation is notably more difficult than object detection, as shown in the numerical results in
Section 7. In particular, our results indicate that Mask R-CNN [
10] networks struggle to learn masks, whereas its detection network performs reasonably well. Therefore, we maintain that object detection is a more suitable solution for this problem, as opposed to object segmentation, which has been widely attempted in other studies, such as [
7]. Consequently, we adopt object detection and fine-tune our model, surpassing both the Mask R-CNN detection network and the Faster R-CNN baseline experiment. As later shown in the numerical results appearing in
Section 7, the best overall performance is ultimately achieved by adopting DiffusionDet with a two-step transfer learning strategy, thanks to the HPC computational resource discussed in
Section 6.
4. Background on DiffusionDet for Object Detection
DiffusionDet [
4] represents a remarkable and significant milestone in the continuous evolution of object detection frameworks. It leverages foundational principles derived from diffusion models, which were initially rooted in the disciplines of thermodynamics and statistical physics. These models, known for their theoretical elegance and mathematical rigor, have rapidly gained widespread attention and acclaim for their exceptional generative capabilities. They have been particularly impactful in tasks such as image synthesis, where their ability to generate high-quality and realistic outputs has set new benchmarks in the field.
Diffusion models themselves form a class of probabilistic generative models that have seen extensive application across various domains. Their core mechanism involves gradually removing noise from an initial, noisy representation to yield a clean, structured output. While their initial success was closely tied to the domain of image denoising, diffusion models have since broadened their scope to include tasks like image segmentation and, more recently, object detection. Reference [
4] marked the introduction of the latest diffusion model network specifically designed for object detection.
Building on this success, diffusion models have been adapted for discriminative applications, including object detection, marking a pivotal shift in their use cases. DiffusionDet specifically formulates the task of object detection as a denoising process. This methodology involves iteratively refining initially noisy bounding box proposals into precise and accurate object predictions. By treating object detection as a denoising process, DiffusionDet capitalizes on the inherent strengths of diffusion models, such as their ability to progressively reduce uncertainty and generate high-fidelity outputs.
This novel formulation not only aligns with the principles of diffusion but also represents a paradigm shift in how object detection is conceptualized and executed. The iterative refinement process employed by DiffusionDet ensures that the final predictions are both robust and reliable, highlighting its potential as a transformative tool in the field of computer vision. At its core, DiffusionDet incorporates two primary processes:
Forward diffusion [
4]: according to [
12], this process gradually adds Gaussian noise to object bounding boxes, rendering them indistinguishable from a random distribution. This process begins by systematically introducing noise to the bounding boxes in small increments, which gradually transforms the structured information of the bounding boxes into a completely noisy representation. The forward diffusion process essentially models the degradation of bounding box information over time, ensuring that the final noisy representation is sufficiently close to a random distribution. This step is critical, as it lays the foundation for the reverse diffusion process by providing the necessary starting point for noise removal.
Reverse diffusion [
4]: a neural network is trained to reverse this process, progressively removing noise and recovering precise bounding boxes. During this phase, the neural network works iteratively, step-by-step, to undo the effects of the forward diffusion process. Each iteration removes a small amount of noise, gradually refining the noisy bounding boxes and steering them toward their original, precise configurations. The reverse diffusion process relies heavily on the neural network’s ability to model the underlying data distribution and accurately predict the denoising steps required to recover the bounding boxes.
By employing these two complementary processes, DiffusionDet leverages the strengths of diffusion models in transforming a traditionally discriminative task like object detection into a generative-inspired framework. This dual-process approach not only enhances the robustness of the model but also introduces a novel perspective on tackling object detection problems. This framework introduces several innovations over traditional architectures such as Faster R-CNN [
5] or Mask R-CNN [
10]:
Dynamic proposal generation: unlike pre-defined anchor boxes, DiffusionDet dynamically generates proposals during training, perturbing them with noise and refining them iteratively. In other words, Faster R-CNN relies on region proposals and predefined anchor boxes, which limits its ability to detect irregular and subtle damages. Instead, DiffusionDet formulates object detection as a denoising process, progressively refining object features without relying on predefined regions. Additionally, DiffusionDet’s stepwise noise removal allows for better recognition of faint scratches, small cracks, and subtle dents, improving classification accuracy.
Iterative evaluation: the model processes bounding boxes in multiple stages, correcting errors at each step and improving accuracy over time. Unlike Faster R-CNN, DiffusionDet does not depend on predefined bounding boxes, making it more adaptable to damages of varying shapes and sizes.
Robustness to noise: the underlying diffusion mechanism inherently provides robustness, making it suitable for noisy or imperfect data scenarios.
The architecture of DiffusionDet comprises two main components that work in tandem to achieve high-precision object detection.
Figure 1 explains the basic structure of the DiffusionDet framework [
4], namely, the image encoder and the detection decoder. The first component, the ‘Image Encoder’, is responsible for extracting multi-scale feature maps from input images. It utilizes advanced feature extraction network architectures, such as ResNet [
13], ResNeXt [
14] or Swin Transformer [
15], to effectively capture rich semantic and spatial information across different scales. These networks are well established for their ability to model complex visual patterns, making them an ideal choice for encoding image features that are essential for accurate object detection.
Swin Transformer, in particular, addresses the challenges of high-resolution and multi-scale visual data with a hierarchical design built around non-overlapping local windows. In each stage, image tokens are partitioned into local windows to limit self-attention, thus reducing the computational overhead. A shifted window mechanism further interleaves the windows across layers, facilitating cross-window interactions without imposing significant cost. This architecture allows Swin Transformer to effectively capture both local and global context while maintaining linear computational complexity relative to image size. Notably, the Swin Transformer is available in several variants—Swin-T (Tiny), Swin-S (Small), Swin-B (Base), and Swin-L (Large)—each scaling in model size and complexity. Swin-T is often compared to ResNet-50 in terms of capacity, and Swin-S is comparable to ResNet-101. Swin-B, considered the original Swin Transformer configuration, has a model size and computational complexity on a par with ViT-B [
16]/DeiT-B [
17]. These flexible configurations make Swin Transformer a robust backbone choice for a broad spectrum of vision tasks, from image classification to dense prediction. In general, the Swin Transformer backbone enhances the model’s ability to detect texture variations, leading to more precise localization and classification.
The second component, the ‘Detection Decoder’, processes noisy bounding boxes in conjunction with the feature maps extracted by the image encoder. The Detection Decoder was introduced by Sparse R-CNN [
18]; it employs iterative refinement techniques to enhance the accuracy of both bounding box coordinates and their associated classifications. This process involves progressively refining the initial noisy bounding boxes through a series of iterations, with each step leveraging the encoded feature maps to guide the refinement.
Empirical evaluations of DiffusionDet have demonstrated superior performance compared to conventional object detection frameworks. By effectively modeling noisy input distributions and refining predictions iteratively, DiffusionDet sets a new benchmark for accuracy and robustness in computer vision applications. Its success in real-world use cases, such as claim management in insurance, underscores its transformative potential in the industry.
5. Enhanced Insoore AI Pipeline
Insoore’s AI pipeline pushes the boundaries of the state of the art in two key ways:
Insoore AI’s automatic damage severity estimation integrates insights from input images, considering both damage size and type. Unlike existing market solutions, which overlook damage type, our approach leverages this information to enhance severity estimation accuracy. For more details, refer to Section 4.2 in [
3].
To our knowledge, the proposed pipeline achieves superior performance in two integrated tasks: (i) car damage detection and affected vehicle components, and (ii) calculation of damage severity [
7]. This is demonstrated on a diverse test dataset from real-world accidents, utilizing a DiffusionDet architecture based on a Swin Transformer backbone. Further details are provided in
Section 6.
To implement the algorithmic pipeline of Insoore AI, we begin by gathering vehicle images. These may be acquired by either the insurance company’s customer or a professional with the required expertise utilizing a dedicated Android app. which prompts users to capture photos from nine distinct angles.
The first step, detection of damages on the vehicle, is carried out by a neural network that classifies damages into four categories: broken clips, cracks, dents, and scratches.
Vehicle component’s damages may vary in degree of severity. Severe accidents may cause extensive deformation or destruction, while minor accidents often result in scratches, dents, cracks, or broken clips.
A crack on a vehicle’s exterior signifies a structural break affecting the paint, metal, or plastic components. It is more serious than mere scratches or dents. Cracks may weaken the overall strength and durability and often require advanced repairs, such as welding or panel replacement, followed by refinishing.
A dent is an indentation or concave deformation on the vehicle’s exterior, typically produced by impacts or collisions. A dent’s size can vary from minor surface impressions to significant deep distortions and is often cosmetic, though severe cases may affect structural integrity.
A scratch is surface-level damage caused by objects scuffing the paint or protective coating. Scratches differ in severity, ranging from surface-level marks to deeper abrasions. If left untreated, scratches can lead to rust or further deterioration, especially deeper ones.
Broken clips are damaged fasteners that secure vehicle components. These clips are made of plastic or metal and hold panels, trim, and bumpers in place. When broken, vehicle components that were supported by these clips may become loose or misaligned.
For each detected damage instance, the model outputs the damage’s location by defining the damaged area with a bounding box. The bounding box coordinates within the image, along with a confidence score, follow a structure similar to the panel detector’s output.
These types of car damage are the four most frequently seen and prevalent classes. By focusing on them, our AI system enhances the reliability and efficiency of the claim management process.
The pipeline then segments car parts (panels) into fifty-three distinct classes, covering the main components in a vehicle’s exterior.
Front Section. This section includes the front bumper, grille, fog lights, headlights, bonnet, and windscreen.
Side Sections. Covers side mirrors, fenders, doors, wheels, windows, quarter panels, and rails.
Rear Section. Comprises the boot, rear windscreen, bumper, tail lights, and spoiler.
Additional Components. Comprises the license plate, door handles, sensors, roof, caps, reflectors, and indicators.
By dividing the vehicle into these different panels, our AI system achieves precise damage detection and assessment, yielding improved claim processing accuracy.
In
Figure 2, a sample of Insoore AI’s current car panel detector [
3] shows how each type of car panel is detected and segmented. For each panel, the model provides a segmentation map, the panel type, and a confidence score for the prediction.
By reconciling outputs from the panel detector with outputs from the damage detector, each detected damage is mapped to the corresponding panel, and the damage area is calculated. Based on the calculated area and the detected damage type, damage severity is classified into three possible levels: low, medium, or high.
Finally, the Insoore AI pipeline aggregates severity assessments for each damaged panel, eventually assessing whether repair or replacement is needed. The decision made is based on the severity and type of each damage instance, together with updated part and labour prices from the repair shop and car builder.
In light of the above,
Figure 3 illustrates the proposed functional architecture, breaking it into four consecutive steps: segmentation of car panels, damage detection, mapping damages to panels, and finally, severity estimation and decision suggestion—whether to repair or replace based on damage assessment.
All in all, the transition from a damage detector based on Faster R-CNN in [
3] to DiffusionDet in this paper enhances Insoore AI’s ability to detect and classify car damages. DiffusionDet achieves a substantial improvement in AP50 (see
Appendix A below for further details), confirming stronger feature representation and better accuracy at detecting and classifying damages. By leveraging a generative denoising process, DiffusionDet significantly improves damage detection accuracy, robustness, and efficiency, making it more suitable for insurance claim assessments.
Balancing the forward and reverse diffusion processes in DiffusionDet is key to improving detection accuracy while maintaining efficiency. The forward process must add just enough noise to enable effective learning without obscuring fine details, while the reverse process should efficiently remove noise without excessive computational steps. This can be achieved through adaptive noise scheduling, feature guidance from the Swin Transformer, and latent space processing to reduce complexity. Additionally, optimizing the loss function ensures the model learns to reconstruct objects with high precision. By refining these elements, DiffusionDet achieves a faster, more accurate, and computationally efficient damage detection system.
6. On the Need for HPC Resources
The study focuses on training GenAI-based deep learning models while leveraging the capabilities of GPU parallel computing to enhance the efficiency and effectiveness of the training process. By utilizing GPU parallelism, the training pipeline can significantly reduce computation time, ensuring that even complex models are trained within a reasonable timeframe. For the inference phase, the models will be executed in a sequential manner to maintain consistency and simplify the evaluation process. A comprehensive benchmarking task was carried out, according to
Table 1, to identify the most suitable HPC configuration required for the successful execution of the tasks. This ensures that the computational resources are optimized and effectively allocated throughout the study. In particular,
Table 1 compares the training speedup achieved on the Leonardo Booster setup against a standard configuration that employs 32 GB Tesla V100 GPUs in parallel.
The benchmarking process involves three experimental setups to assess the scalability and performance of the system under varying GPU configurations. Specifically, it includes testing with a single-GPU setup, a two-GPU setup, and a four-GPU setup. These configurations will help determine the impact of scaling GPU resources on the training speed and overall system performance. By conducting this systematic benchmarking, the study aims to identify the optimal balance between computational power and resource utilization, ensuring that the deep learning models achieve their highest potential performance. This process is vital for understanding the efficiency of different configurations and serves as a foundation for making informed decisions about resource allocation during the project.
To further illustrate the scalability and performance of the proposed application, a detailed benchmarking table is presented to compare the training speedup achieved on the Leonardo Booster setup against a standard configuration that employs parallel 32 GB Tesla V100 GPUs. The entries in the table summarize the overall training speed, providing insights into the advantages of leveraging advanced HPC setups for deep learning tasks. The benchmarking results, sourced from data provided by Whoosnap, highlight the significant improvements in training speed achievable with optimized GPU configurations. This comparison underscores the transformative potential of HPC systems in accelerating the training of GenAI-based models, showcasing their utility in handling computationally intensive tasks effectively.
Given the above-mentioned HPC setup, we have trained a DiffusionDet architecture based on a Swin Transformer backbone to address the task of damage detection and classification, thus yielding improved performance with respect to the results achieved in [
3]. Instead, the other tasks appearing in
Figure 3 have been implemented with the same setup as in [
3].
7. Results
7.1. Dataset
For every experiment presented in this work from number 1 (C.P.) to number 9 (C.P.) listed in
Table 2 below, we maintained the same train and test datasets used in our previous paper [
3] to ensure consistency in benchmarking and performance assessment.
Following [
3], the evaluation was carried out on a proprietary dataset comprising images of vehicle damages captured from real-world road accidents. In accordance with the design framework introduced in the previous paper, this dataset was annotated with four damage categories: cracks, dents, scratches, and broken clips.
The dataset’s annotation process followed rigorous guidelines which were applied to ensure accuracy and consistency:
Exclusion of non-damaged or disassembled vehicles: images without visible damage or displaying disassembled vehicles from mechanic workshops were removed.
Multiple body parts: when multiple body parts were damaged, a single annotation encompassed all affected regions.
Different damage types in the same area: if various types of damage co-occurred in the same area, each damage type was annotated separately.
Proximate damages: for dents in close proximity, one annotation was used to represent them collectively. However, damages on distinct parts of the panel were annotated individually.
Localized scratch swipes: multiple scratch swipes within a localized region were consolidated into the smallest feasible number of annotations.
Dent annotations: for dents, bounding boxes were aligned with the shadowed regions caused by the impact deformation.
Crack annotations: the entire length of a crack was covered by a single bounding region.
Broken clips: when clips were broken, the annotation encompassed the visible black gap caused by the separation of hooks joining body parts.
This standardized approach ensured clear and consistent representation of varying damage patterns across vehicles.
The dataset composition, as outlined in
Table 3, is structured as follows:
the training set comprises 21,846 annotations derived from a total of 6782 images;
the test set includes 540 annotations extracted from 326 images.
The annotation process adhered to the COCO dataset format, ensuring compatibility with widely used standards for object detection and segmentation tasks. This structured approach facilitates consistency and robustness in the evaluation of the proposed methodology.
Approximately 120,000 images were collected, along with relevant business attributes, such as the monetary value of the damage, the brand, and the model of the vehicle. Based on these attributes, out of the above-mentioned 120,000, a dedicated dataset of 6782 images was carefully selected, exhibiting an attribute distribution arranged in such a way as to mimic their real-world distribution in road accident scenarios. To ensure that not just the training set but also the test set accurately represented these conditions, the same attribute distribution was replicated, resulting in a test sample of 326 images that closely aligns with the actual distribution of these business features. This methodology ensures that the resulting dataset reflects realistic conditions and provides a reliable foundation for evaluation.
An intermediate dataset was later introduced—concerning solely Experiment 10 listed in
Table 2—as a crucial pre-training step to bridge the gap between general object detection models and the specific task of car damage classification. This dataset provides a structured domain adaptation phase, ensuring that the model learns meaningful features related to vehicle damage before fine-tuning on the final dataset. By incorporating a diverse set of damage patterns and real-world variations, it improves the model’s generalization capability while mitigating overfitting when transitioning from generic datasets like COCO.
However, while the intermediate dataset facilitates domain alignment, it is not designed for final optimization. The transition to the training set of 21,846 annotations (as per
Table 3) is necessary to refine model performance with high-quality, task-specific annotations that adhere to stricter labeling standards. These last annotations provide improved granularity, reduced labeling inconsistencies, and align better with industry requirements for insurance claim assessments. This shift results in significant performance improvements, particularly in AP50, demonstrating that fine-tuning on a well-curated dataset was essential for achieving reliable and robust damage detection in real-world scenarios.
7.2. Experiments and Performance Assessment
As in our previous paper [
3], we employed the following performance metrics for evaluation:
AP (average precision),
AP50 (average precision for IoU greater than 50%), and
AP75 (average precision for IoU greater than 75%).
Among these, AP50 is the primary metric of interest.
The first experiment relevant to this paper (i.e., the one denoted with 1 (C.P.) in
Figure 4) was conducted to establish a baseline for DiffusionDet, aligning with the constraints of the previous experiments in [
3], where all trainings were performed on a single 16 GB GPU. Consequently, the initial baseline setup consisted of using one GPU with a batch size of 1 per GPU.
In the current study, however, we used Leonardo’s high-performance computing (HPC) infrastructure, which provides four GPUs with 64 GB of memory each. This resource expansion allowed for larger-scale experimentation with increased GPU usage and higher batch sizes.
Following the established configuration, we decided to use the Swin base transformer backbone over the available ResNet backbones, taking advantage of its ability to capture long-range dependencies and multi-scale features—both critical for fine-grained damage identification in car components. This backbone is integrated within the diffusion-based detection framework in the image encoder.
We standardized the experimental setup by fixing all hyper-parameters and changing only one per experiment to measure its direct impact on performance. The selected backbone, namely, Swin Base [
4], is the result of an initial training on the 22k ImageNet dataset [
19] and a second training on the COCO 2017 dataset [
20], providing robust feature extraction capabilities and aligning well with the iterative refinement required by the diffusion-based detection process.
We systematically investigated the impact of scaling GPU resources and batch size on the DiffusionDet model’s performance. Our experimental design involved incrementally modifying two key parameters: the number of GPUs and the batch size per GPU. To isolate the effect of each parameter, we conducted a series of experiments where only one variable was altered at a time. We can observe the results in
Figure 4 and the results for each experiment are found in
Table 2.
The first set of experiments focused on GPU scaling, progressing from one to two and from two to four GPU’s. Subsequently, we explored the impact of increasing batch sizes per GPU, incrementing by one unit and carefully monitoring training performance. Our analysis revealed a clear trend of performance improvement across three evaluation metrics: AP, AP50, and AP75.
The optimal configurations were consistently achieved using four GPUs, with peak performance metrics varying by specific measure: batch size of 6 for maximum AP, batch size of 3 for maximum AP50, and batch size of 5 for maximum AP75. The experimental bounds were established at a maximum batch size per GPU of 7 in the ninth experiment.
Following the determination of optimal values for the number of GPUs and batch size in Experiment 9, an additional experiment was conducted to further enhance model performance. This experiment incorporated a two-stage transfer learning approach. Initially, the model was trained using pre-trained COCO weights, followed by an intermediate training phase on a dataset specifically curated to bridge the domain gap. Subsequently, a second training phase was performed on the training set.
Prior to Experiment 10, all previous experiments used COCO pre-trained weights as the starting point, from which the model was fine-tuned on the training set. In Experiment 10, however, training was first conducted on an intermediate dataset, named the intermediate set, which contained the same four object classes as the training set. This intermediate training phase facilitated a domain adaptation process, allowing the model to acquire more relevant feature representations before fine-tuning on the target dataset. This approach effectively aligned the COCO pre-trained weights with the domain-specific task of car damage detection, providing a more suitable initialization compared to Experiment 9.
All other parameters and experimental settings remained consistent with those of Experiment 9. The results demonstrated significant improvements in model performance, with AP50 achieving its highest recorded value of 40.92. Furthermore, the overall AP reached its highest value of 16.76, indicating the effectiveness of the proposed two-stage transfer learning strategy.
Comparison with our previous work in [
3], which can be seen in
Figure 4 and is found in detail in
Table 2, demonstrated significant performance enhancements. The improvements were particularly notable in AP50, which increased from 30.45 to 40.92, representing a substantial gain of 10.47. Overall metric improvements included AP increasing from 12.65 to 16.76 and AP75 increasing from 9.63 to 11.9.
8. Conclusions and Future Work
In this study, we have presented an improved version of Insoore AI, with specific reference to the task of automatic car damage detection and classification, by relying on a GenAI architecture—namely, the DiffusionDet one based on a Swin Transformer backbone—and leveraging high-performance computing resources from the LEONARDO HPC infrastructure. The strategic architectural shift and computational scaling enabled a remarkable
improvement in AP50. This result represents a further leap forward with respect to the previous
improvement of [
3] over [
7].
The comparison of our latest solution with the baseline is made in terms of average precision, specifically AP50. The improvement measured with respect to this metric, when it comes to practical use in the claim management process, translates to having, in general, no more than false positive predictions should Insoore AI be fully in charge of damage detection, which is a really encouraging result along the roadmap towards the full automation of claim management.
The principal advancement in this study—namely, the integration of a diffusion architecture with a Swin transformer backbone, enabled by high-performance computing resources—represents a novel and significant contribution to the field and its effectiveness is strengthened by resorting to two-stage transfer learning. This exploration has not been previously conducted in the context of car damage claim management systems, and it provides valuable insights into the scalability and adaptability of diffusion-based architectures for this domain.
In a nutshell, the Swin Transformer backbone, by contrast with a ResNet one, provides multi-scale, window-based self-attention that helps preserve and aggregate fine-grained details—exactly what is needed for detecting small or subtle car damages. Because small dents, scratches, or cracks can easily be overlooked by coarser feature extractors, the localized attention together with the hierarchical structure of the Swin Transformer backbone significantly boost the sensitivity and precision of the DiffusionDet architecture. This leads to more accurate detection of tiny or nuanced car damages—a critical factor in fine-grained tasks, such as insurance claims, maintenance assessments, or automated quality inspections. By integrating these refined detection capabilities into Insoore’s AI pipeline, the results presented in this study offer a more reliable and efficient basis for automatic claim generation. The demonstrated robustness in challenging real-world scenarios not only strengthens the pipeline’s performance in identifying vehicle damages but also enables a more streamlined end-to-end workflow for insurance claim processing. Consequently, these advancements reinforce the broader impact of diffusion-based methods in industrial applications, laying the groundwork for further optimization and extension to other segments of insurance automation.
Future work will explore additional architectural optimizations and investigate alternative feature extraction techniques.
It is important to note that, given the several potential privacy-related challenges that emerge when it comes to further validating and improving the model performance in real-world scenarios, balancing privacy and model performance might require a combination of synthetic data, federated learning, and privacy-preserving AI techniques. This may help further validate and enhance model accuracy in real-world insurance claim assessments.
Also, to use the proposed model in different markets, some further effort may be needed in order to adjust to the very different vehicle types, damage patterns, and insurance policies that are relevant to each considered market.