On the Application of DiffusionDet to Automatic Car Damage Detection and Classification via High-Performance Computing

Arconzo, Vito; Gorga, Gerardo; Gutierrez, Gonzalo; Omar, Ahmed; Rangisetty, Meher Anvesh; Ricciardi Celsi, Lorenzo; Santini, Federico; Scianaro, Enrico

doi:10.3390/electronics14071362

Open AccessArticle

On the Application of DiffusionDet to Automatic Car Damage Detection and Classification via High-Performance Computing

by

Vito Arconzo

,

Gerardo Gorga

,

Gonzalo Gutierrez

,

Ahmed Omar

,

Meher Anvesh Rangisetty

,

Lorenzo Ricciardi Celsi

^*

,

Federico Santini

and

Enrico Scianaro

Whoosnap Srl, Via Marsala 29H/I, 00185 Rome, Italy

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(7), 1362; https://doi.org/10.3390/electronics14071362

Submission received: 29 January 2025 / Revised: 21 March 2025 / Accepted: 27 March 2025 / Published: 28 March 2025

(This article belongs to the Special Issue Application of Artificial Intelligence to Image Processing: Advantages and Prognosis)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Claim management is a critical process for insurance companies, requiring fairness, transparency, and efficiency to maintain policyholder trust and minimize financial impact. In our previous work, we introduced Insoore AI, an insurtech solution leveraging deep learning-based computer vision to automate car damage recognition and localization from user-provided pictures. While this approach demonstrated the potential of AI in claims management, it faced limitations in terms of performance and computational efficiency due to resource constraints. In this study, we present an improved version of Insoore AI, enabled by the High-Performance Computing (HPC) resources offered by the Booster module of LEONARDO HPC system located at the CINECA datacenter in Bologna, Italy. By leveraging the advanced computational capabilities of the above-mentioned HPC infrastructure, we trained larger and more complex deep learning models, processed higher-resolution images, and significantly reduced training and inference times. Our results show marked performance improvements in terms of damage detection, paving the way for more efficient, more effective and scalable claims management solutions. This work underscores the transformative potential of HPC resources in advancing AI-driven innovations in the insurance sector and is to be regarded as an improvement on the contribution of our previous work, enabled by relying on the DiffusionDet architecture and on a Swin Transformer backbone to solve the problem of automatic car damage detection and classification.

Keywords:

insurance claim management; GenAI; deep learning; damage detection and classification; computer vision

1. Introduction

The insurance industry continues to evolve with the integration of advanced artificial intelligence (AI) solutions, driving transformative innovations in the processing and analysis of complex datasets. Their potential to streamline and enhance processes remains undeniable. In particular, the proposed study builds on recent findings in [1] and from Boston Consulting Group [2], which emphasize the transformative potential of AI in insurance, especially in damage assessment, underwriting, and customer service, provided that high-quality data are available.

Building upon our previous work with the Insoore AI pipeline [3]—a system designed for the Italian market to automate vehicle damage assessment using computer vision—this study introduces a significant advancement in terms of damage detection and classification. By leveraging one of the latest deep learning architectures in computer vision, namely, DiffusionDet [4], this research enhances the performance achieved at damage detection and classification in the claim management process, also marking a pivotal improvement with respect to the Faster R-CNN [5] model employed in earlier iterations.

In particular, this paper thoroughly examines the methodology and outcomes associated with retraining Insoore AI over the original dataset using DiffusionDet, a Generative AI (GenAI) framework that is gaining increasing popularity in the scientific community focused on object detection. The retraining process was carried out on the Booster module of the LEONARDO HPC system [6], as no other underperforming setup allows launching such a training session. This transition to advanced computational platforms facilitated the exploration of enhanced model capabilities, allowing for a deeper understanding of the potential benefits and limitations of diffusion-based detection.

The retraining process resulted in substantial improvements in performance metrics, underscoring the effectiveness of diffusion-based detection techniques for solving the problem of automatic damage detection and classification for the claim management process. These findings not only highlight the transformative potential of computer vision in addressing real-world challenges faced by the insurance industry but also affirm the suitability of diffusion models as a powerful tool for tasks requiring high precision and reliability. The outcomes demonstrated through this research further solidify the role of diffusion-based detection as a viable and efficient approach for advancing automation and decision-making processes in insurance applications.

As far as the authors understand, this work represents the first attempt in the current literature at solving the problem of automatic car damage detection and classification using DiffusionDet [4]. In addition, relying on two-stage transfer learning allows to push the performance beyond the state of the art, represented in chronological order by [3,7].

This paper is organized in the following way: Section 2 presents a recap of our previous work concerning the Insoore AI platform; Section 3 provides the motivation behind this paper; Section 4 provides the necessary background on DiffusionDet for object detection; Section 4 describes the Insoore AI pipeline as enhanced by the improvements documented in the current paper; Section 5 highlights the emerging need for HPC resources; Section 6 discusses the dataset and the experimental setup with the final numerical results. Concluding remarks end the paper.

2. Recap of Previous Work

The initial development of the Insoore AI pipeline [3] marked a significant step forward in integrating artificial intelligence into the Italian insurance market. Designed to automate the assessment of vehicle damages, the system addressed key challenges in the claims management process, including the inefficiency, subjectivity, and susceptibility to fraud inherent in manual evaluation methods. By leveraging computer vision and deep learning technologies, the pipeline aimed to provide a robust decision-support tool for insurance experts, reducing their reliance on physical vehicle inspections and expediting the claims settlement process.

The pipeline was structured around three main tasks: (i) car damage detection, (ii) orange detection of damaged vehicle components, and (iii) orange calculation of repair cost. The first step, damage detection, focused on localizing and classifying damages within images sent by clients. This was accomplished by employing the Faster R-CNN architecture, a deep learning model extensively used for object detection. In the second step, the identified damages were assigned to specific vehicle components that the system classified as these parts among 53 distinct categories. Finally, in the third step, the pipeline estimates the severity of the identified damages and calculates the repair costs based on the updated prices of the vehicle components and the labor rates provided by the insurance company.

The dataset used in the initial implementation consisted of real-world accident images annotated into four damage classes: scratches, dents, cracks, and broken clips. Images were annotated following strict guidelines to ensure consistency and accuracy, accounting for the unique characteristics of the Italian insurance market. While the Faster R-CNN [5] model demonstrated reasonable performance in terms of damage detection and classification, limitations were observed, particularly in handling complex damages and small, intricate details. This highlighted the need for exploring advanced architectures to overcome these shortcomings.

Extensive experiments were conducted to fine-tune the Faster R-CNN model, optimizing hyperparameters and evaluating performance using metrics such as Average Precision (AP), AP50, and AP75. Despite achieving competitive results in comparison with the main competing players in the Insurtech market (namely, Bdeo [8] and Tractable [9]), the experiments revealed inherent challenges in adapting the architecture to the specific requirements of the problem. Object detection performed better than segmentation-based approaches like Mask R-CNN [10], as the latter struggled with generating accurate masks for damage localization. However, the overall detection accuracy indicated room for improvement in both precision and recall.

The findings from [3] underscored the potential of AI-driven solutions for claims management while highlighting the limitations of the existing methodology. The system effectively automated significant portions of the process, reducing the burden on insurance experts and expediting claim resolutions. Nevertheless, the relatively moderate performance of Faster R-CNN motivated further exploration of more sophisticated models to enhance detection accuracy and robustness. This paper builds on those insights, presenting an updated pipeline powered by the DiffusionDet [4] architecture, which leverages advanced diffusion-based detection methods to achieve superior results.

3. Motivation

The advancements in AI-driven car damage detection have significantly transformed the insurance industry, enhancing the efficiency and accuracy of damage assessment. While early methods relied on CNN-based object detectors like YOLO and Faster R-CNN, the evolution of transformer-based models and diffusion-based approaches has opened new possibilities for improved detection accuracy and robustness. The integration of these modern architectures addresses several challenges associated with conventional methods, such as difficulty in detecting subtle or complex damages and the reliance on large annotated datasets.

A key contribution of the present work is the application of DiffusionDet in a novel way, as demonstrated in Section 6. DiffusionDet redefines object detection by leveraging the iterative refinement process of diffusion models, allowing for better localization of damages even in challenging scenarios. Traditional CNN-based detectors often struggle with damages that lack clear contours or are small relative to the vehicle’s surface. By formulating detection as a denoising process, DiffusionDet mitigates these issues by systematically improving its predictions over multiple iterations, effectively reducing false positives and enhancing recall.
An additional and particularly noteworthy aspect of this work is the adoption of a two-step transfer learning strategy. While the COCO dataset is widely used and provides a strong foundation, it remains distant from the specific domain of cars and car damages. Therefore, the decision to pre-train on an intermediate dataset that bridges this gap before fine-tuning on the target domain is both innovative and strategic. This step is not commonplace in the literature and reflects a deep understanding gained through practical experience rather than conventional practice. It is a key move that significantly contributes to the model’s ability to generalize and perform effectively on complex, domain-specific tasks. Emphasizing this approach is crucial, as it highlights a level of methodological refinement that is rarely encountered and speaks to the domain-specific expertise behind the proposed approach.

The above-mentioned innovative aspects are tested and compared with the most relevant state-of-the-art approaches addressing the same problem, namely, [3,7].

We demonstrate also that real-life data, such as ours, is significantly more challenging than CarDD [11], as evidenced by the comparatively lower performance metrics we obtain in Section 6. Object segmentation is notably more difficult than object detection, as shown in the numerical results in Section 7. In particular, our results indicate that Mask R-CNN [10] networks struggle to learn masks, whereas its detection network performs reasonably well. Therefore, we maintain that object detection is a more suitable solution for this problem, as opposed to object segmentation, which has been widely attempted in other studies, such as [7]. Consequently, we adopt object detection and fine-tune our model, surpassing both the Mask R-CNN detection network and the Faster R-CNN baseline experiment. As later shown in the numerical results appearing in Section 7, the best overall performance is ultimately achieved by adopting DiffusionDet with a two-step transfer learning strategy, thanks to the HPC computational resource discussed in Section 6.

4. Background on DiffusionDet for Object Detection

DiffusionDet [4] represents a remarkable and significant milestone in the continuous evolution of object detection frameworks. It leverages foundational principles derived from diffusion models, which were initially rooted in the disciplines of thermodynamics and statistical physics. These models, known for their theoretical elegance and mathematical rigor, have rapidly gained widespread attention and acclaim for their exceptional generative capabilities. They have been particularly impactful in tasks such as image synthesis, where their ability to generate high-quality and realistic outputs has set new benchmarks in the field.

Diffusion models themselves form a class of probabilistic generative models that have seen extensive application across various domains. Their core mechanism involves gradually removing noise from an initial, noisy representation to yield a clean, structured output. While their initial success was closely tied to the domain of image denoising, diffusion models have since broadened their scope to include tasks like image segmentation and, more recently, object detection. Reference [4] marked the introduction of the latest diffusion model network specifically designed for object detection.

Building on this success, diffusion models have been adapted for discriminative applications, including object detection, marking a pivotal shift in their use cases. DiffusionDet specifically formulates the task of object detection as a denoising process. This methodology involves iteratively refining initially noisy bounding box proposals into precise and accurate object predictions. By treating object detection as a denoising process, DiffusionDet capitalizes on the inherent strengths of diffusion models, such as their ability to progressively reduce uncertainty and generate high-fidelity outputs.

This novel formulation not only aligns with the principles of diffusion but also represents a paradigm shift in how object detection is conceptualized and executed. The iterative refinement process employed by DiffusionDet ensures that the final predictions are both robust and reliable, highlighting its potential as a transformative tool in the field of computer vision. At its core, DiffusionDet incorporates two primary processes:

Forward diffusion [4]: according to [12], this process gradually adds Gaussian noise to object bounding boxes, rendering them indistinguishable from a random distribution. This process begins by systematically introducing noise to the bounding boxes in small increments, which gradually transforms the structured information of the bounding boxes into a completely noisy representation. The forward diffusion process essentially models the degradation of bounding box information over time, ensuring that the final noisy representation is sufficiently close to a random distribution. This step is critical, as it lays the foundation for the reverse diffusion process by providing the necessary starting point for noise removal.
Reverse diffusion [4]: a neural network is trained to reverse this process, progressively removing noise and recovering precise bounding boxes. During this phase, the neural network works iteratively, step-by-step, to undo the effects of the forward diffusion process. Each iteration removes a small amount of noise, gradually refining the noisy bounding boxes and steering them toward their original, precise configurations. The reverse diffusion process relies heavily on the neural network’s ability to model the underlying data distribution and accurately predict the denoising steps required to recover the bounding boxes.

By employing these two complementary processes, DiffusionDet leverages the strengths of diffusion models in transforming a traditionally discriminative task like object detection into a generative-inspired framework. This dual-process approach not only enhances the robustness of the model but also introduces a novel perspective on tackling object detection problems. This framework introduces several innovations over traditional architectures such as Faster R-CNN [5] or Mask R-CNN [10]:

Dynamic proposal generation: unlike pre-defined anchor boxes, DiffusionDet dynamically generates proposals during training, perturbing them with noise and refining them iteratively. In other words, Faster R-CNN relies on region proposals and predefined anchor boxes, which limits its ability to detect irregular and subtle damages. Instead, DiffusionDet formulates object detection as a denoising process, progressively refining object features without relying on predefined regions. Additionally, DiffusionDet’s stepwise noise removal allows for better recognition of faint scratches, small cracks, and subtle dents, improving classification accuracy.
Iterative evaluation: the model processes bounding boxes in multiple stages, correcting errors at each step and improving accuracy over time. Unlike Faster R-CNN, DiffusionDet does not depend on predefined bounding boxes, making it more adaptable to damages of varying shapes and sizes.
Robustness to noise: the underlying diffusion mechanism inherently provides robustness, making it suitable for noisy or imperfect data scenarios.

The architecture of DiffusionDet comprises two main components that work in tandem to achieve high-precision object detection. Figure 1 explains the basic structure of the DiffusionDet framework [4], namely, the image encoder and the detection decoder. The first component, the ‘Image Encoder’, is responsible for extracting multi-scale feature maps from input images. It utilizes advanced feature extraction network architectures, such as ResNet [13], ResNeXt [14] or Swin Transformer [15], to effectively capture rich semantic and spatial information across different scales. These networks are well established for their ability to model complex visual patterns, making them an ideal choice for encoding image features that are essential for accurate object detection.

Swin Transformer, in particular, addresses the challenges of high-resolution and multi-scale visual data with a hierarchical design built around non-overlapping local windows. In each stage, image tokens are partitioned into local windows to limit self-attention, thus reducing the computational overhead. A shifted window mechanism further interleaves the windows across layers, facilitating cross-window interactions without imposing significant cost. This architecture allows Swin Transformer to effectively capture both local and global context while maintaining linear computational complexity relative to image size. Notably, the Swin Transformer is available in several variants—Swin-T (Tiny), Swin-S (Small), Swin-B (Base), and Swin-L (Large)—each scaling in model size and complexity. Swin-T is often compared to ResNet-50 in terms of capacity, and Swin-S is comparable to ResNet-101. Swin-B, considered the original Swin Transformer configuration, has a model size and computational complexity on a par with ViT-B [16]/DeiT-B [17]. These flexible configurations make Swin Transformer a robust backbone choice for a broad spectrum of vision tasks, from image classification to dense prediction. In general, the Swin Transformer backbone enhances the model’s ability to detect texture variations, leading to more precise localization and classification.

The second component, the ‘Detection Decoder’, processes noisy bounding boxes in conjunction with the feature maps extracted by the image encoder. The Detection Decoder was introduced by Sparse R-CNN [18]; it employs iterative refinement techniques to enhance the accuracy of both bounding box coordinates and their associated classifications. This process involves progressively refining the initial noisy bounding boxes through a series of iterations, with each step leveraging the encoded feature maps to guide the refinement.

Empirical evaluations of DiffusionDet have demonstrated superior performance compared to conventional object detection frameworks. By effectively modeling noisy input distributions and refining predictions iteratively, DiffusionDet sets a new benchmark for accuracy and robustness in computer vision applications. Its success in real-world use cases, such as claim management in insurance, underscores its transformative potential in the industry.

5. Enhanced Insoore AI Pipeline

Insoore’s AI pipeline pushes the boundaries of the state of the art in two key ways:

Insoore AI’s automatic damage severity estimation integrates insights from input images, considering both damage size and type. Unlike existing market solutions, which overlook damage type, our approach leverages this information to enhance severity estimation accuracy. For more details, refer to Section 4.2 in [3].
To our knowledge, the proposed pipeline achieves superior performance in two integrated tasks: (i) car damage detection and affected vehicle components, and (ii) calculation of damage severity [7]. This is demonstrated on a diverse test dataset from real-world accidents, utilizing a DiffusionDet architecture based on a Swin Transformer backbone. Further details are provided in Section 6.

To implement the algorithmic pipeline of Insoore AI, we begin by gathering vehicle images. These may be acquired by either the insurance company’s customer or a professional with the required expertise utilizing a dedicated Android app. which prompts users to capture photos from nine distinct angles.

The first step, detection of damages on the vehicle, is carried out by a neural network that classifies damages into four categories: broken clips, cracks, dents, and scratches.

Vehicle component’s damages may vary in degree of severity. Severe accidents may cause extensive deformation or destruction, while minor accidents often result in scratches, dents, cracks, or broken clips.

A crack on a vehicle’s exterior signifies a structural break affecting the paint, metal, or plastic components. It is more serious than mere scratches or dents. Cracks may weaken the overall strength and durability and often require advanced repairs, such as welding or panel replacement, followed by refinishing.

A dent is an indentation or concave deformation on the vehicle’s exterior, typically produced by impacts or collisions. A dent’s size can vary from minor surface impressions to significant deep distortions and is often cosmetic, though severe cases may affect structural integrity.

A scratch is surface-level damage caused by objects scuffing the paint or protective coating. Scratches differ in severity, ranging from surface-level marks to deeper abrasions. If left untreated, scratches can lead to rust or further deterioration, especially deeper ones.

Broken clips are damaged fasteners that secure vehicle components. These clips are made of plastic or metal and hold panels, trim, and bumpers in place. When broken, vehicle components that were supported by these clips may become loose or misaligned.

For each detected damage instance, the model outputs the damage’s location by defining the damaged area with a bounding box. The bounding box coordinates within the image, along with a confidence score, follow a structure similar to the panel detector’s output.

These types of car damage are the four most frequently seen and prevalent classes. By focusing on them, our AI system enhances the reliability and efficiency of the claim management process.

The pipeline then segments car parts (panels) into fifty-three distinct classes, covering the main components in a vehicle’s exterior.

Front Section. This section includes the front bumper, grille, fog lights, headlights, bonnet, and windscreen.
Side Sections. Covers side mirrors, fenders, doors, wheels, windows, quarter panels, and rails.
Rear Section. Comprises the boot, rear windscreen, bumper, tail lights, and spoiler.
Additional Components. Comprises the license plate, door handles, sensors, roof, caps, reflectors, and indicators.

By dividing the vehicle into these different panels, our AI system achieves precise damage detection and assessment, yielding improved claim processing accuracy.

In Figure 2, a sample of Insoore AI’s current car panel detector [3] shows how each type of car panel is detected and segmented. For each panel, the model provides a segmentation map, the panel type, and a confidence score for the prediction.

By reconciling outputs from the panel detector with outputs from the damage detector, each detected damage is mapped to the corresponding panel, and the damage area is calculated. Based on the calculated area and the detected damage type, damage severity is classified into three possible levels: low, medium, or high.

Finally, the Insoore AI pipeline aggregates severity assessments for each damaged panel, eventually assessing whether repair or replacement is needed. The decision made is based on the severity and type of each damage instance, together with updated part and labour prices from the repair shop and car builder.

In light of the above, Figure 3 illustrates the proposed functional architecture, breaking it into four consecutive steps: segmentation of car panels, damage detection, mapping damages to panels, and finally, severity estimation and decision suggestion—whether to repair or replace based on damage assessment.

All in all, the transition from a damage detector based on Faster R-CNN in [3] to DiffusionDet in this paper enhances Insoore AI’s ability to detect and classify car damages. DiffusionDet achieves a substantial improvement in AP50 (see Appendix A below for further details), confirming stronger feature representation and better accuracy at detecting and classifying damages. By leveraging a generative denoising process, DiffusionDet significantly improves damage detection accuracy, robustness, and efficiency, making it more suitable for insurance claim assessments.

Balancing the forward and reverse diffusion processes in DiffusionDet is key to improving detection accuracy while maintaining efficiency. The forward process must add just enough noise to enable effective learning without obscuring fine details, while the reverse process should efficiently remove noise without excessive computational steps. This can be achieved through adaptive noise scheduling, feature guidance from the Swin Transformer, and latent space processing to reduce complexity. Additionally, optimizing the loss function ensures the model learns to reconstruct objects with high precision. By refining these elements, DiffusionDet achieves a faster, more accurate, and computationally efficient damage detection system.

6. On the Need for HPC Resources

The study focuses on training GenAI-based deep learning models while leveraging the capabilities of GPU parallel computing to enhance the efficiency and effectiveness of the training process. By utilizing GPU parallelism, the training pipeline can significantly reduce computation time, ensuring that even complex models are trained within a reasonable timeframe. For the inference phase, the models will be executed in a sequential manner to maintain consistency and simplify the evaluation process. A comprehensive benchmarking task was carried out, according to Table 1, to identify the most suitable HPC configuration required for the successful execution of the tasks. This ensures that the computational resources are optimized and effectively allocated throughout the study. In particular, Table 1 compares the training speedup achieved on the Leonardo Booster setup against a standard configuration that employs 32 GB Tesla V100 GPUs in parallel.

The benchmarking process involves three experimental setups to assess the scalability and performance of the system under varying GPU configurations. Specifically, it includes testing with a single-GPU setup, a two-GPU setup, and a four-GPU setup. These configurations will help determine the impact of scaling GPU resources on the training speed and overall system performance. By conducting this systematic benchmarking, the study aims to identify the optimal balance between computational power and resource utilization, ensuring that the deep learning models achieve their highest potential performance. This process is vital for understanding the efficiency of different configurations and serves as a foundation for making informed decisions about resource allocation during the project.

To further illustrate the scalability and performance of the proposed application, a detailed benchmarking table is presented to compare the training speedup achieved on the Leonardo Booster setup against a standard configuration that employs parallel 32 GB Tesla V100 GPUs. The entries in the table summarize the overall training speed, providing insights into the advantages of leveraging advanced HPC setups for deep learning tasks. The benchmarking results, sourced from data provided by Whoosnap, highlight the significant improvements in training speed achievable with optimized GPU configurations. This comparison underscores the transformative potential of HPC systems in accelerating the training of GenAI-based models, showcasing their utility in handling computationally intensive tasks effectively.

Given the above-mentioned HPC setup, we have trained a DiffusionDet architecture based on a Swin Transformer backbone to address the task of damage detection and classification, thus yielding improved performance with respect to the results achieved in [3]. Instead, the other tasks appearing in Figure 3 have been implemented with the same setup as in [3].

7. Results

7.1. Dataset

For every experiment presented in this work from number 1 (C.P.) to number 9 (C.P.) listed in Table 2 below, we maintained the same train and test datasets used in our previous paper [3] to ensure consistency in benchmarking and performance assessment.

Following [3], the evaluation was carried out on a proprietary dataset comprising images of vehicle damages captured from real-world road accidents. In accordance with the design framework introduced in the previous paper, this dataset was annotated with four damage categories: cracks, dents, scratches, and broken clips.

The dataset’s annotation process followed rigorous guidelines which were applied to ensure accuracy and consistency:

Exclusion of non-damaged or disassembled vehicles: images without visible damage or displaying disassembled vehicles from mechanic workshops were removed.
Multiple body parts: when multiple body parts were damaged, a single annotation encompassed all affected regions.
Different damage types in the same area: if various types of damage co-occurred in the same area, each damage type was annotated separately.
Proximate damages: for dents in close proximity, one annotation was used to represent them collectively. However, damages on distinct parts of the panel were annotated individually.
Localized scratch swipes: multiple scratch swipes within a localized region were consolidated into the smallest feasible number of annotations.
Dent annotations: for dents, bounding boxes were aligned with the shadowed regions caused by the impact deformation.
Crack annotations: the entire length of a crack was covered by a single bounding region.
Broken clips: when clips were broken, the annotation encompassed the visible black gap caused by the separation of hooks joining body parts.

This standardized approach ensured clear and consistent representation of varying damage patterns across vehicles.

The dataset composition, as outlined in Table 3, is structured as follows:

the training set comprises 21,846 annotations derived from a total of 6782 images;
the test set includes 540 annotations extracted from 326 images.

The annotation process adhered to the COCO dataset format, ensuring compatibility with widely used standards for object detection and segmentation tasks. This structured approach facilitates consistency and robustness in the evaluation of the proposed methodology.

Approximately 120,000 images were collected, along with relevant business attributes, such as the monetary value of the damage, the brand, and the model of the vehicle. Based on these attributes, out of the above-mentioned 120,000, a dedicated dataset of 6782 images was carefully selected, exhibiting an attribute distribution arranged in such a way as to mimic their real-world distribution in road accident scenarios. To ensure that not just the training set but also the test set accurately represented these conditions, the same attribute distribution was replicated, resulting in a test sample of 326 images that closely aligns with the actual distribution of these business features. This methodology ensures that the resulting dataset reflects realistic conditions and provides a reliable foundation for evaluation.

An intermediate dataset was later introduced—concerning solely Experiment 10 listed in Table 2—as a crucial pre-training step to bridge the gap between general object detection models and the specific task of car damage classification. This dataset provides a structured domain adaptation phase, ensuring that the model learns meaningful features related to vehicle damage before fine-tuning on the final dataset. By incorporating a diverse set of damage patterns and real-world variations, it improves the model’s generalization capability while mitigating overfitting when transitioning from generic datasets like COCO.

However, while the intermediate dataset facilitates domain alignment, it is not designed for final optimization. The transition to the training set of 21,846 annotations (as per Table 3) is necessary to refine model performance with high-quality, task-specific annotations that adhere to stricter labeling standards. These last annotations provide improved granularity, reduced labeling inconsistencies, and align better with industry requirements for insurance claim assessments. This shift results in significant performance improvements, particularly in AP50, demonstrating that fine-tuning on a well-curated dataset was essential for achieving reliable and robust damage detection in real-world scenarios.

7.2. Experiments and Performance Assessment

As in our previous paper [3], we employed the following performance metrics for evaluation:

AP (average precision),
AP50 (average precision for IoU greater than 50%), and
AP75 (average precision for IoU greater than 75%).

Among these, AP50 is the primary metric of interest.

The first experiment relevant to this paper (i.e., the one denoted with 1 (C.P.) in Figure 4) was conducted to establish a baseline for DiffusionDet, aligning with the constraints of the previous experiments in [3], where all trainings were performed on a single 16 GB GPU. Consequently, the initial baseline setup consisted of using one GPU with a batch size of 1 per GPU.

In the current study, however, we used Leonardo’s high-performance computing (HPC) infrastructure, which provides four GPUs with 64 GB of memory each. This resource expansion allowed for larger-scale experimentation with increased GPU usage and higher batch sizes.

Following the established configuration, we decided to use the Swin base transformer backbone over the available ResNet backbones, taking advantage of its ability to capture long-range dependencies and multi-scale features—both critical for fine-grained damage identification in car components. This backbone is integrated within the diffusion-based detection framework in the image encoder.

We standardized the experimental setup by fixing all hyper-parameters and changing only one per experiment to measure its direct impact on performance. The selected backbone, namely, Swin Base [4], is the result of an initial training on the 22k ImageNet dataset [19] and a second training on the COCO 2017 dataset [20], providing robust feature extraction capabilities and aligning well with the iterative refinement required by the diffusion-based detection process.

We systematically investigated the impact of scaling GPU resources and batch size on the DiffusionDet model’s performance. Our experimental design involved incrementally modifying two key parameters: the number of GPUs and the batch size per GPU. To isolate the effect of each parameter, we conducted a series of experiments where only one variable was altered at a time. We can observe the results in Figure 4 and the results for each experiment are found in Table 2.

The first set of experiments focused on GPU scaling, progressing from one to two and from two to four GPU’s. Subsequently, we explored the impact of increasing batch sizes per GPU, incrementing by one unit and carefully monitoring training performance. Our analysis revealed a clear trend of performance improvement across three evaluation metrics: AP, AP50, and AP75.

The optimal configurations were consistently achieved using four GPUs, with peak performance metrics varying by specific measure: batch size of 6 for maximum AP, batch size of 3 for maximum AP50, and batch size of 5 for maximum AP75. The experimental bounds were established at a maximum batch size per GPU of 7 in the ninth experiment.

Following the determination of optimal values for the number of GPUs and batch size in Experiment 9, an additional experiment was conducted to further enhance model performance. This experiment incorporated a two-stage transfer learning approach. Initially, the model was trained using pre-trained COCO weights, followed by an intermediate training phase on a dataset specifically curated to bridge the domain gap. Subsequently, a second training phase was performed on the training set.

Prior to Experiment 10, all previous experiments used COCO pre-trained weights as the starting point, from which the model was fine-tuned on the training set. In Experiment 10, however, training was first conducted on an intermediate dataset, named the intermediate set, which contained the same four object classes as the training set. This intermediate training phase facilitated a domain adaptation process, allowing the model to acquire more relevant feature representations before fine-tuning on the target dataset. This approach effectively aligned the COCO pre-trained weights with the domain-specific task of car damage detection, providing a more suitable initialization compared to Experiment 9.

All other parameters and experimental settings remained consistent with those of Experiment 9. The results demonstrated significant improvements in model performance, with AP50 achieving its highest recorded value of 40.92. Furthermore, the overall AP reached its highest value of 16.76, indicating the effectiveness of the proposed two-stage transfer learning strategy.

Comparison with our previous work in [3], which can be seen in Figure 4 and is found in detail in Table 2, demonstrated significant performance enhancements. The improvements were particularly notable in AP50, which increased from 30.45 to 40.92, representing a substantial gain of 10.47. Overall metric improvements included AP increasing from 12.65 to 16.76 and AP75 increasing from 9.63 to 11.9.

8. Conclusions and Future Work

In this study, we have presented an improved version of Insoore AI, with specific reference to the task of automatic car damage detection and classification, by relying on a GenAI architecture—namely, the DiffusionDet one based on a Swin Transformer backbone—and leveraging high-performance computing resources from the LEONARDO HPC infrastructure. The strategic architectural shift and computational scaling enabled a remarkable

\frac{40.92 - 30.45}{30.45} = 34.38 %

improvement in AP50. This result represents a further leap forward with respect to the previous

8.13 %

improvement of [3] over [7].

The comparison of our latest solution with the baseline is made in terms of average precision, specifically AP50. The improvement measured with respect to this metric, when it comes to practical use in the claim management process, translates to having, in general, no more than

13 %

false positive predictions should Insoore AI be fully in charge of damage detection, which is a really encouraging result along the roadmap towards the full automation of claim management.

The principal advancement in this study—namely, the integration of a diffusion architecture with a Swin transformer backbone, enabled by high-performance computing resources—represents a novel and significant contribution to the field and its effectiveness is strengthened by resorting to two-stage transfer learning. This exploration has not been previously conducted in the context of car damage claim management systems, and it provides valuable insights into the scalability and adaptability of diffusion-based architectures for this domain.

In a nutshell, the Swin Transformer backbone, by contrast with a ResNet one, provides multi-scale, window-based self-attention that helps preserve and aggregate fine-grained details—exactly what is needed for detecting small or subtle car damages. Because small dents, scratches, or cracks can easily be overlooked by coarser feature extractors, the localized attention together with the hierarchical structure of the Swin Transformer backbone significantly boost the sensitivity and precision of the DiffusionDet architecture. This leads to more accurate detection of tiny or nuanced car damages—a critical factor in fine-grained tasks, such as insurance claims, maintenance assessments, or automated quality inspections. By integrating these refined detection capabilities into Insoore’s AI pipeline, the results presented in this study offer a more reliable and efficient basis for automatic claim generation. The demonstrated robustness in challenging real-world scenarios not only strengthens the pipeline’s performance in identifying vehicle damages but also enables a more streamlined end-to-end workflow for insurance claim processing. Consequently, these advancements reinforce the broader impact of diffusion-based methods in industrial applications, laying the groundwork for further optimization and extension to other segments of insurance automation.

Future work will explore additional architectural optimizations and investigate alternative feature extraction techniques.

It is important to note that, given the several potential privacy-related challenges that emerge when it comes to further validating and improving the model performance in real-world scenarios, balancing privacy and model performance might require a combination of synthetic data, federated learning, and privacy-preserving AI techniques. This may help further validate and enhance model accuracy in real-world insurance claim assessments.

Also, to use the proposed model in different markets, some further effort may be needed in order to adjust to the very different vehicle types, damage patterns, and insurance policies that are relevant to each considered market.

Author Contributions

Conceptualization, methodology, software, validation, formal analysis, investigation, resources, data curation, writing—original draft preparation, writing—review and editing, visualization, G.G. (Gonzalo Gutierrez), A.O., M.A.R., and L.R.C.; supervision, project administration, funding acquisition, V.A., G.G. (Gerardo Gorga), L.R.C., F.S., and E.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by CINECA and EuroHPC within the framework of project EHPC-DEV-2024D05-017—EuroHPC Development Access Call. The Call provided access to the Booster module of LEONARDO HPC system, which offered the computational resources necessary for conducting the research and achieving the reported results.

Data Availability Statement

The data are unavailable due to privacy restrictions.

Acknowledgments

The authors thank the software development team at Insoore for the support provided in terms of data engineering and platform implementation.

Conflicts of Interest

We report that E. Scianaro, V. Arconzo, and G. Gorga are founders of Whoosnap Srl, F. Santini is executive chairman of Whoosnap Srl, whereas G. Gutierrez, M. A. Rangisetty, and L. Ricciardi Celsi are employees at Whoosnap Srl. A. Omar is currently an independent researcher; however, the work presented in this paper has been carried out in the context of A.O.’s activity at Whoosnap until October 2024.

Appendix A. Evaluation Metrics

As specified in [21], the COCO metrics, which were initially proposed together with the publication of the COCO dataset, have become the most used evaluation method for models aimed at solving tasks of object detection and segmentation mapping. All experiments proposed in this paper have been assessed using COCO metrics, as proposed in the Microsoft COCO: Common Objects in Context [22] and later listed in Microsoft Coco Website [20]. In this respect, the terms Average Precision (AP) and mean Average Precision (mAP) can be used alternatively.

In [21], average precision is defined using COCO metrics. Precision is the ratio of true positives over the sum of false positives and true positives: it is calculated using the standard precision formula, which is not particularly effective when it comes to object detection, because in this context, models may produce numerous predictions, some of them with low confidence scores. Average Precision (AP) addresses this specific limitation and is considered as a robust general-purpose metric for measuring the performance of object detection models.

As defined in [21], “Intersection over Union, or IoU, is an area-based metric used in Computer Vision to determine if two objects can be considered to be the same. As the name suggests, the metric is computed by dividing the area of the intersection of two objects by the area of their union. The range of IoU is 0 to 1, with two identical objects having an IoU of 1 and objects with no intersection possessing an IoU of 0”.

AP: It is the Average precision at ten IoU thresholds ranging from

IoU = 0.50

to

IoU = 0.95

(

IoU = 0.50 : 0.05 : 0.95

) [20].

AP50: It is the Average precision calculated at IoU threshold of

50 %

[20].

AP75: It is the Average precision calculated with IoU threshold of

75 %

[20].

References

Andreozzi, A.; Ricciardi Celsi, L.; Martini, A. Enabling the Digitalization of Claim Management in the Insurance Value Chain Through AI-Based Prototypes: The ELIS Innovation Hub Approach. In Digitalization Cases; Urbach, N., Röglinger, M., Kautz, K., Alias, R.A., Saunders, C., Wiener, M., Eds.; Management for Professionals; Springer: Cham, Switzerland, 2021; Volume 2. [Google Scholar]
Group, B.C. Why Insurance Leaders Need to Leverage GenAI. Available online: https://www.bcg.com/publications/2023/why-insurance-leaders-need-to-leverage-gen-ai (accessed on 26 March 2025).
Atanasious, M.M.H.; Becchetti, V.; Giuseppi, A.; Pietrabissa, A.; Arconzo, V.; Gorga, G.; Gutierrez, G.; Omar, A.; Pietrini, M.; Rangisetty, M.A.; et al. An Insurtech Platform to Support Claim Management Through the Automatic Detection and Estimation of Car Damage from Pictures. Electronics 2024, 13, 4333. [Google Scholar] [CrossRef]
Chen, S.; Sun, P.; Song, Y.; Luo, P. Diffusiondet: Diffusion model for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 19830–19843. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. arXiv 2015, arXiv:1506.01497. [Google Scholar]
Consorzio Interuniversitario del Nord-Est per il Calcolo Automatico (CINECA), Leonardo Booster. Available online: https://leonardo-supercomputer.cineca.eu/it/about-it/#jump-leonardo (accessed on 26 March 2025).
Maiano, L.; Montuschi, A.; Caserio, M.; Ferri, E.; Kieffer, F.; Germanò, C.; Baiocco, L.; Celsi, L.R.; Amerini, I.; Anagnostopoulos, A. A deep-learning–based antifraud system for car-insurance claims. Expert Syst. Appl. 2023, 231, 120644. [Google Scholar] [CrossRef]
Bdeo. Available online: https://bdeo.io/en/motor/motor-claims-management/ (accessed on 26 March 2025).
Tukra, S.; Hoffman, F.; Chatfield, K. Improving visual representation learning through perceptual understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Wang, X.; Li, W.; Wu, Z. CarDD: A New Dataset for Vision-based Car Damage Detection. IEEE Trans. Intell. Transp. Syst. 2023, 24, 7202–7214. [Google Scholar] [CrossRef]
Dhariwal, P.; Nichol, A. Diffusion models beat GANs on image synthesis. Adv. Neural Inf. Process. Syst. 2021, 34, 8780–8794. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. arXiv 2015, arXiv:1512.03385. [Google Scholar]
Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated Residual Transformations for Deep Neural Networks. arXiv 2017, arXiv:1611.05431v2. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. arXiv 2021, arXiv:2103.14030. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. arXiv 2021, arXiv:2012.12877v2. [Google Scholar]
Sun, P.; Zhang, R.; Jiang, Y.; Kong, T.; Xu, C.; Zhan, W.; Tomizuka, M.; Li, L.; Yuan, Z.; Wang, C.; et al. Sparse R-CNN: End-to-End Object Detection with Learnable Proposals. arXiv 2021, arXiv:2011.12450v2. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. ImageNet Website. Available online: https://www.image-net.org/ (accessed on 26 March 2025).
Microsoft-COCO. Microsoft COCO: Common Objects in Context. Available online: https://cocodataset.org/#detection-eval (accessed on 26 March 2025).
Wood, L.; Chollet, F. Efficient Graph-Friendly COCO Metric Computation for Train-Time Model Evaluation. arXiv 2022, arXiv:2207.12120. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part V 13. Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]

Figure 1. DiffusionDet framework. The image encoder extracts features from the image in input. Then, the detection decoder processes noisy boxes to classify specific categories and box coordinates for object detection.

Figure 2. Recognition results for panels across multiple car models from various angles. In each pair of images, the left one is the original image and the right one shows the detection result. Shapes are non-homogeneous, corresponding to the actual shape of each car panel appearing in the images as detected by the algorithm, while colors are different in order to distinguish each car panel from the other.

Figure 3. Flowchart of Insoore AI pipeline.

Figure 4. Performance curve of the experiments carried out. At the left of the vertical dashed line, we see all experiments from the previous paper [3], and at the right side of it, we see the experiments from our current paper.

Table 1. Benchmarking training speedup across GPU configurations.

No. of Parallel GPUs	V100 32 GB	Leonardo Booster
1	1×	2.5×
2	1.96×	4.32×
4	3.88×	8.63×

Table 2. Results of the DiffusionDet experiments (P.P. refers to our previous paper [3] and C.P. refers to the current paper). The asterisk denotes our baseline experiment. The highest value per metric is written in bold.

N. (Paper)	DiffusionDet Experiments	AP	AP50	AP75
1 (P.P.)	Mask-RCNN (Seg)	1.02	4.89	0.14
2 (P.P.)	Mask-RCNN (BB)	11.92	28.16	8.2
3 (P.P.)	Default in [3]: Faster RCNN + FPN:X101	11.24	28.16	7.2
4 (P.P.)	Data Aug.1	8	20.55	5.2
5 (P.P.)	Data Aug.2	6.7	18.27	3.87
6 (P.P.)	Backbone1	12.4	28.68	9.09
7 (P.P.)	Backbone2	11.93	28.22	8.47
8 (P.P.)	FPN1	8.86	24.34	4.22
9 (P.P.)	FPN2	6.3	18.04	2.21
10 (P.P.)	RPN: Anchors Size1	12.38	29.52	6.62
11 (P.P.)	RPN: Anchors Size2	12.31	30.45	7.29
12 (P.P.)	RPN: Anchors Size3	12.15	29.05	7.72
13 (P.P.)	RPN:AnchorsAspect Ratio1	12.65	29.28	9.63
14 (P.P.)	RPN:AnchorsAspect Ratio2	12.41	30.04	8.21
15 (P.P.)	RPN: Hybrid1	11	26.71	8.52
1 (C.P.)	COCO weights + GPU:1 + Batch size:1 (*)	12.79	36.02	7.766
2 (C.P.)	GPU: 2 + Batch size:1	14.57	37.02	10.04
3 (C.P.)	GPU: 4 + Batch size:1	14.33	38.53	8.935
4 (C.P.)	GPU: 4 + Batch size:2	13.82	37.11	9.025
5 (C.P.)	GPU: 4 + Batch size:3	14.44	38.87	8.834
6 (C.P.)	GPU: 4 + Batch size:4	14.28	38.11	9.626
7 (C.P.)	GPU: 4 + Batch size:5	14.52	38.08	11.9
8 (C.P.)	GPU: 4 + Batch size:6	15.04	38.26	8.848
9 (C.P.)	GPU: 4 + Batch size:7	14.1	38.2	8.54
10 (C.P.)	training stages:2 + GPU:4 + Batch size:7	16.76	40.92	10.85

Table 3. Sample distribution per class for the training and test sets [3].

Data	Scratch	Dent	Crack	Broken Clips	Annotations
training set	47.64%	23.89%	16.40%	12.07%	21,846
test set	46%	33%	11%	10%	540
intermediate set	81.24%	10.90%	5.02%	2.83%	101,891

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Arconzo, V.; Gorga, G.; Gutierrez, G.; Omar, A.; Rangisetty, M.A.; Ricciardi Celsi, L.; Santini, F.; Scianaro, E. On the Application of DiffusionDet to Automatic Car Damage Detection and Classification via High-Performance Computing. Electronics 2025, 14, 1362. https://doi.org/10.3390/electronics14071362

AMA Style

Arconzo V, Gorga G, Gutierrez G, Omar A, Rangisetty MA, Ricciardi Celsi L, Santini F, Scianaro E. On the Application of DiffusionDet to Automatic Car Damage Detection and Classification via High-Performance Computing. Electronics. 2025; 14(7):1362. https://doi.org/10.3390/electronics14071362

Chicago/Turabian Style

Arconzo, Vito, Gerardo Gorga, Gonzalo Gutierrez, Ahmed Omar, Meher Anvesh Rangisetty, Lorenzo Ricciardi Celsi, Federico Santini, and Enrico Scianaro. 2025. "On the Application of DiffusionDet to Automatic Car Damage Detection and Classification via High-Performance Computing" Electronics 14, no. 7: 1362. https://doi.org/10.3390/electronics14071362

APA Style

Arconzo, V., Gorga, G., Gutierrez, G., Omar, A., Rangisetty, M. A., Ricciardi Celsi, L., Santini, F., & Scianaro, E. (2025). On the Application of DiffusionDet to Automatic Car Damage Detection and Classification via High-Performance Computing. Electronics, 14(7), 1362. https://doi.org/10.3390/electronics14071362

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

On the Application of DiffusionDet to Automatic Car Damage Detection and Classification via High-Performance Computing

Abstract

1. Introduction

2. Recap of Previous Work

3. Motivation

4. Background on DiffusionDet for Object Detection

5. Enhanced Insoore AI Pipeline

6. On the Need for HPC Resources

7. Results

7.1. Dataset

7.2. Experiments and Performance Assessment

8. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Evaluation Metrics

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI