1. Introduction
The advancements in artificial intelligence (AI), particularly neural networks using deep learning, have revolutionized high-stakes tasks such as medical analysis. For instance, natural language processing (NLP) models like GPT have been applied to diagnose health conditions via chat applications [
1], and generative AI has been used to discover a new design of drug [
2]. Similarly, deep learning models handling medical images contribute significantly to tasks such as translating PET to CT [
3], detecting oral cancer and classifying the risks [
4], and segmenting cells [
5,
6].
The emergence of diffusion models has further advanced image generation tasks, providing a more stable training process compared to generative adversarial networks (GANs). Trained diffusion models, such as Stable Diffusion [
7], have demonstrated the ability to generate high-quality outputs, which opens new possibilities for medical applications.
In the context of medical tasks utilizing diffusion models, refs. [
8,
9] employed denoising diffusion probabilistic models (DDPMs) to conduct organ segmentation and coloring from reference to MRI input images. Furthermore, Nicholas et al. [
10] proposed SegGuidedDiff, which uses the DDIM algorithm to generate realistic MRI and CT images from input mask images. To enhance image quality further, researchers have introduced more complex denoising modules, such as frequency filters [
8] and advanced multi-head attention mechanisms [
11]. However, these sophisticated diffusion models often lead to high computational costs during training and require substantial GPU memory for storage. Such resource-intensive models are challenging to implement in practical medical settings, where hardware constraints often limit the feasibility of deploying large-scale models.
In this study, we employed the DDIM method due to its potential for fast and high-quality sampling. Our goal was to address the limitations of existing diffusion models by developing a lightweight DDIM capable of delivering comparable segmentation quality while reducing computational requirements. Through iterative experimentation, we arrived at a simple yet effective solution: replacing standard convolutional neural networks (CNNs) in each layer with grouped CNNs.
We evaluated our lightweight DDIM on three distinct medical segmentation tasks: (1) extracting lung regions from chest X-ray images, (2) segmenting melanoma from skin images, and (3) segmenting polyps from endoscopy images (irrespective of imaging modalities).
The contributions of our study can be summarized as follows:
By implementing grouped CNNs in residual blocks and self-attention layers, the trained model achieved segmentation results closely matching ground truth data.
In terms of quality metrics, our lightweight DDIM demonstrated high recall and dice scores, comparable to those of standard DDIM models. In addition, the runtime to output the predicted segment image was marked at 29.9% faster than standard DDIM.
The model’s file size was reduced by approximately 75% compared to the standard DDIM, thus significantly lowering storage requirements.
3. Our Lightweight DDIM
3.1. Overview
Diffusion models have demonstrated potential in medical segmentation tasks due to their ability to reconstruct high-quality outputs through iterative denoising processes. To explore this potential, we implemented a denoising model based on U-Net within the DDIM framework.
Figure 1 illustrates the overall architecture of our proposed model, which consists of three main components: an encoder, a bottleneck, and a decoder.
In the encoder, concatenated input images are progressively downsampled through residual blocks and convolutional layers to extract hierarchical features. At each stage, feature maps are stored for concatenation in the corresponding decoder layers. The bottleneck includes a self-attention layer, which processes the feature maps from the final residual block to capture long-range dependencies. In the decoder, the stored feature maps are upsampled through multiple residual blocks, and the predicted noise is output, which is subsequently used in the reverse process to reconstruct the final segmentation.
The noisy input
is generated by blending random noise
and the target segmentation image
, as described in Equation (
1):
As the timestep
t increases,
becomes progressively noisier. The noisy input
, along with the conditional input
c, is fed into the denoising U-Net
. During training,
minimizes the mean squared error (MSE) loss between
and
, as shown in Equation (
3):
In the reverse process, segmentation images are reconstructed iteratively from noise through the reverse denoising process. The sampling process is governed by Equation (
4), which follows the DDIM framework:
where
in the first term represents the prediction of
, and the second term captures the residual noise.
Figure 2 shows the reverse process from an input of random noise to generate segmentation. Through iterative steps with
(where
N denotes the skipping interval), the segmentation image is progressively reconstructed from the predicted noise. Unlike DDPM, which relies on Markov chains for sampling, DDIM uses a deterministic sampling procedure, allowing for consistent segmentation predictions even with varying random noise inputs. Therefore, the time of generating predicted segmentation can be accelerated by setting large value of
N.
3.2. Our Lightweight Residual Block and Self-Attention
Traditional diffusion models employ numerous residual blocks to address the gradient vanishing problem [
24]. However, the extensive use of these blocks, coupled with standard self-attention mechanisms, significantly increases the computational cost and memory requirements of the model.
To mitigate these issues, we introduce lightweight grouped convolution layers. Grouped convolutions divide input feature maps into smaller groups based on channel size and perform convolutions independently within each group, reducing computational overhead and memory usage [
25].
Figure 3 illustrates the architecture of our modified residual blocks and self-attention layers.
In our residual blocks, input feature maps with dimensionality D are processed through grouped convolutions, with the group parameter set to the smaller value between 32 and . Embedded time data is injected into intermediate layers to provide temporal context during training.
For self-attention, the Q (query), K (key), and V (value) matrices are computed using grouped convolutional layers. The dimensionality of each matrix is set to . The grouped convolution parameter for calculating K and V is fixed at 32, further enhancing computational efficiency while preserving the effectiveness of the attention mechanism.
5. Results
5.1. Visualizing Predicted Segmentation
For evaluating the segmentation quality through visualizing using diffusion models, we fed five samples of random noise and obtained five segmentations of the candidates.
The segmentation results for the RSUA dataset are shown in
Figure 4a. U-Net demonstrated basic segmentation capabilities, but its predictions often deviated significantly from the ground truth. MedSegDiff, leveraging its diffusion-based architecture, exhibited improved performance in delineating these regions. However, sometimes, the noise in the outputs highlighted stability issues. In contrast, both DDIM and Lightweight DDIM consistently achieved accurate and smooth segmentation results, closely matching the ground truth. Our Lightweight DDIM maintained performance similar to DDIM, despite its reduced computational complexity, proving its efficiency in handling grayscale medical imaging tasks.
For the ISBI 2016 dataset, the segmentation results, as visualized in
Figure 4b, further emphasize the strengths of diffusion-based models. U-Net faced difficulty in adapting to the input images with strong coloration, often producing segmentation maps that failed to align with the target melanoma regions. MedSegDiff provided predictions that generally adhered to the target features. However it showed instability when handling large melanoma regions. By contrast, DDIM demonstrated consistently precise segmentation, even under challenging conditions that involved high noise during the reverse process. Similarly, Lightweight DDIM produced outputs closely aligned with the ground truth, holding spatial accuracy and shape consistency.
Segmentation results for the PolypDB dataset, depicted in
Figure 4c, reflect the challenges of localizing multiple small or intricate target regions within endoscopic images. U-Net produced coarse and often imprecise segmentation maps, failing to capture smaller polyp structures. MedSegDiff exhibited a high degree of instability with many outputs misaligned. On the other hand, DDIM effectively segmented both large and small regions with accuracy, aligning closely with the ground truth. Lightweight DDIM, while simplified, retained its ability to handle intricate segmentation tasks, matching the ground truth effectively for most cases.
Overall, visualization of the results confirmed that the diffusion-based approaches, particularly DDIM and its lightweight variant, could perform well in medical segmentation tasks. Lightweight DDIM, despite its reduced computational demands, maintained comparable accuracy to DDIM.
5.2. Score Results: Recall and Dice Score
When outputting the segmentation using diffusion models, we set five samples of random noise as input and used the ensemble of these outputs for evaluation. In addition, we normalized each predicted segmentation to the range [0, 1] using min-max normalization. For evaluation, the threshold was set to 0.9, and this was applied per pixel in the normalized output, where a pixel was considered part of the segmented region if its value exceeded this threshold. And then we calculated the recall (which is calculated by ) and dice score (which is calculated by ) with true positives (TPs), which are the regions where the predicted segmentation correctly overlapped with the target; false negatives (FN), which are the regions where the target is present but was not detected by the predicted segmentation; and false positives (FP), which are the regions where the predicted segmentation identified a region as positive but no corresponding target exists. A pixel was counted as TP if its value in the predicted segmentation exceeded the threshold and its corresponding pixel in the target segmentation was also positive.
Table 1 presents the evaluation results for the ISBI 2016 and PolypDB datasets across the trained models, showcasing their performance in terms of recall and dice scores. Notably, the proposed Lightweight DDIM model demonstrated a performance that closely approximated the standard DDIM while requiring significantly less computational resources.
For the ISBI 2016 dataset, which features centrally located segmentations, the standard DDIM achieved a dice score exceeding 91% with a recall score surpassing 90%. Lightweight DDIM, while slightly trailing at 88.16% in dice score and 84.50% in recall score, exhibited competitive results. Its ability to balance segmentation accuracy with reduced computational demands marks a significant advancement.
In the PolypDB dataset, which is characterized by diverse and intricate target regions, the standard DDIM achieved superior scores, with dice and recall values of 81.52% and 76.82%, respectively. Lightweight DDIM also performed commendably, recording dice and recall scores of 70.78% and 63.45%. This highlights its capability to handle challenging tasks, despite its simplified architecture.
Overall, the Lightweight DDIM model consistently delivered segmentation results on par with the standard DDIM, proving its practicality for resource-constrained environments. Its performance, as evident in
Table 1, underscores its potential as a viable alternative, especially in applications where computational efficiency is critical.
5.3. The Memory Size and Time to Output
We compared the computational costs, file storage requirements, and inference time among the models. The results are shown in
Table 2.
While the standard DDIM requires significant GPU memory and computational overhead due to its extensive use of multiple residual blocks and deeper network layers, Lightweight DDIM demonstrates remarkable improvements in resource efficiency. The GPU memory usage of Lightweight DDIM is 88.3 MB, which is 89.9% smaller than the 871.2 MB required by the standard DDIM. Similarly, the file size of Lightweight DDIM is reduced to 84.6 MB, achieving an 80.5% reduction compared to the 433 MB required by DDIM.
In addition to these reductions, Lightweight DDIM significantly optimizes inference time, requiring only 5.80 s to generate an image, which is 29.9% faster than the 8.27 s needed by DDIM. These results demonstrate that Lightweight DDIM not only maintains segmentation performance, but that it also substantially reduces resource demands, making it a more practical choice for applications in resource-constrained medical environments.
6. Discussion
6.1. Highlighting Our Lightweight DDIM Through Experiments
This study focused on evaluating segmentation performance across various types of medical images using models such as DDPM-based MedSegDiff, DDIM, and the proposed Lightweight DDIM. Our experiments revealed that both DDIM and Lightweight DDIM consistently generated stable segmentations, even when subjected to five samples of random noise. In contrast, the MedSegDiff model, which is based on a DDPM framework, often produced segmentations with significant variability, especially when tested on the PolypDB dataset.
The instability in the DDPM-based model’s segmentations can be attributed to its reverse process. This process relies on a Markov chain, where each timestep’s prediction influences the next. Moreover, noise is deliberately added at every step to introduce diversity. While this design enables stochastic outputs, it also leads to inconsistent and irregular segmentations, which are less desirable for medical imaging tasks.
On the other hand, DDIM-based models employ a non-Markovian reverse process, where noise is only used during the initial input phase and does not affect subsequent predictions. This deterministic approach allows the reverse process to consistently aim at reconstructing by the direction of .
As a result, DDIM-based models can generate more stable and shape-consistent segmentations, aligning closely with ground truth targets.
In medical imaging, where precision and consistency are paramount, models that produce reliable and reproducible segmentations are preferred. Therefore, DDIM-based models are inherently more suited for medical segmentation tasks. Furthermore, our proposed Lightweight DDIM not only inherits these advantages, but it also significantly reduces file size and computational costs when compared to the standard DDIM. This efficiency makes it an ideal solution for deployment in resource-constrained environments, such as rural clinics or portable medical devices.
6.2. Limitations of Our Lightweight DDIM
Despite the promising results, our proposed Lightweight DDIM has some limitations. For instance, in certain cases, it produces less stable segmentation outputs compared to the standard DDIM, particularly when tested on the PolypDB dataset.
Figure 5 shows that the predictions that were obtained using our Lightweight DDIM were different from the ground truth.
The instability observed with PolypDB may be attributed to the dataset’s inherent complexity. Unlike RSUA and ISBI2016, where consistent features such as uniform lung shapes or centrally aligned melanoma regions exist, PolypDB includes diverse disease region shapes, positions, and imaging conditions. Some images exhibit dark backgrounds, while others are affected by overexposure due to camera flash, leading to inconsistent contextual information. This high variability poses challenges for segmentation models.
Additionally, the use of grouped convolutional neural networks (Grouped CNNs) in Lightweight DDIM may contribute to these issues. Grouped CNNs divide input feature maps into smaller groups for independent convolutional operations. While this approach reduces computational costs, it may fail to effectively capture the global spatial information necessary for accurately segmenting complex regions.
Moreover, the current self-attention mechanism in our Lightweight DDIM focuses solely on position-based attention. Incorporating a dual-attention mechanism could enhance the model’s ability to capture intricate relationships across both spatial and channel dimensions [
30]. This improvement might address the segmentation challenges posed by datasets with high variability, such as PolypDB.
7. Conclusions
In this study, we demonstrated the significant potential of our Lightweight DDIM model in extracting diseased regions through the segmentation of both external and internal medical images. By introducing a lightweight architecture that employs grouped convolutional layers and a simplified residual structure, the proposed Lightweight DDIM achieved comparable segmentation accuracy to the standard DDIM while significantly reducing computational costs and file sizes. Notably, our Lightweight DDIM produces consistent and reliable segmentations across datasets, even under resource-constrained conditions, highlighting its adaptability for real-world medical applications.
The experimental results showcase the superior performance of our model in terms of dice and recall scores when compared to other state-of-the-art segmentation models, such as U-Net and MedSegDiff. Additionally, the model’s ability to maintain segmentation stability, even with noise variations during the reverse process, underscores its robustness and suitability for diverse medical imaging scenarios. The substantial reduction in GPU memory usage and inference time achieved by the Lightweight DDIM further emphasizes its practicality for deployment in environments with limited computational resources.
However, the reliance on paired datasets consisting of raw medical images and their corresponding segmentation labels presents a notable limitation. The process of collecting such datasets is both time-intensive and expertise-driven, often requiring manual annotations by medical professionals. Addressing this challenge is critical for expanding the applicability of Lightweight DDIM. Developing an architecture capable of learning from unpaired datasets would represent a significant advancement, reducing dependency on paired data while enhancing training efficiency and flexibility.
Future research will focus on adapting the Lightweight DDIM for unpaired segmentation tasks by leveraging techniques from unpaired image-to-image translation. CycleGAN, which employs adversarial and cycle consistency losses to retain target domain features while ensuring fidelity between source and reconstructed images, provides an inspiring foundation. Similarly, UNIT-DDPM, which integrates diffusion probabilistic models with consistency loss for handling random noise, offers promising insights for unpaired training [
31]. Building upon these approaches, we aim to extend Lightweight DDIM to learn effectively from unpaired datasets, enabling its application to a broader range of medical imaging tasks.
Furthermore, improving the model’s robustness to blurry or low-resolution input images is crucial for ensuring reliable segmentation performance in diverse clinical scenarios. Medical images often suffer from degradation due to factors such as motion artifacts, imaging noise, and limited hardware resolution, which can significantly impact segmentation accuracy. To address this, previous research has explored diffusion-based image restoration techniques, such as Real-SRGD (real-world image super-resolution with classifier-free guided diffusion) [
32], which employs classifier-free guidance to reduce image degradation, effectively restoring high-frequency details in low-quality images. By integrating a pre-trained Real-SRGD model as a preprocessing step before segmentation, we can enhance input image quality, thereby improving the accuracy and consistency of our Lightweight DDIM model. Future work will explore the optimal integration of Real-SRGD with our segmentation framework, assessing its impact on both computational efficiency and segmentation performance.
Additionally, enhancing the model’s robustness against noise and adversarial attacks is essential for ensuring reliability in real-world medical applications. Adversarial examples, which introduce small but carefully crafted perturbations to input images, can significantly degrade the performance of deep learning models by leading them to incorrect predictions [
33]. In medical imaging, such perturbations could result in critical misdiagnoses, making it imperative to develop defense mechanisms against them. Future improvements will focus on integrating adversarial training techniques and noise-resistant architectures, such as denoising diffusion models, to strengthen Lightweight DDIM’s resistance to perturbations. Additionally, exploring self-supervised learning and contrastive learning approaches could further enhance the model’s ability to generalize under noisy and adversarial conditions.
By addressing the current limitations and incorporating advanced learning strategies, Lightweight DDIM has the potential to become a transformative tool in medical image segmentation. Its lightweight and efficient design positions it as a viable solution for widespread use, even in settings with limited computational resources, while paving the way for more accessible and scalable medical imaging solutions in the future.