SALM: A Unified Model for 2D and 3D Region of Interest Segmentation in Lung CT Scans Using Vision Transformers

Gayap, Hadrien T.; Akhloufi, Moulay A.

doi:10.3390/applbiosci4010011

Open AccessArticle

SALM: A Unified Model for 2D and 3D Region of Interest Segmentation in Lung CT Scans Using Vision Transformers

by

Hadrien T. Gayap

^*

and

Moulay A. Akhloufi

Perception, Robotics, and Intelligent Machines (PRIME), Department of Computer Science, Université de Moncton, Moncton, NB E1A 3E9, Canada

^*

Author to whom correspondence should be addressed.

Appl. Biosci. 2025, 4(1), 11; https://doi.org/10.3390/applbiosci4010011

Submission received: 13 January 2025 / Revised: 11 February 2025 / Accepted: 13 February 2025 / Published: 17 February 2025

Download

Browse Figures

Versions Notes

Abstract

:

Accurate segmentation of Regions of Interest (ROI) in lung Computed Tomography (CT) is crucial for early lung cancer diagnosis and treatment planning. However, the variability in size, shape, and location of lung lesions, along with the complexity of 3D spatial relationships, poses significant challenges. In this work, we propose SALM (Segment Anything in Lung Model), a deep learning model for 2D and 3D ROI segmentation. SALM leverages Vision Transformers, proposing an adaptation of positional encoding functions to effectively capture spatial relationships in both 2D slices and 3D volumes using a single, unified model. Evaluation on the LUNA16 dataset demonstrated strong performance in both modalities. In 2D segmentation, SALM achieved a Dice score of 93% on 124,662 slices. For 3D segmentation using 174 3D images from the same dataset, SALM attained a Dice score of 81.88%. We also tested SALM on an external database (PleThora) on a subset of 255 pulmonary CT from diseased patients, where it achieved a Dice score of 78.82%. These results highlight SALM’s ability to accurately segment lung ROI in both 2D and 3D, demonstrating its potential to improve the accuracy and efficiency of computer-aided diagnosis for lung cancer.

Keywords:

lung segmentation; Computed Tomography; Vision Transformer; medical imaging; 3D images

1. Introduction

Lung cancer remains a leading cause of cancer-related mortality worldwide [1]. Early detection and accurate diagnosis are crucial for improving patient outcomes. CT imaging plays a vital role in the diagnosis and treatment planning of lung cancer, providing detailed cross-sectional images of the lungs. Within these images, the precise segmentation of ROI, such as lung lesions and surrounding anatomical structures, is essential for accurate disease assessment, staging, and treatment planning [2]. However, manual segmentation of ROI in lung CT images is a time-consuming and labor-intensive process, prone to inter-observer variability.

The development of automated and reliable methods for lung CT image segmentation has thus become a critical area of research. While traditional image processing techniques have been applied, they often face limitations in handling the inherent complexities of lung CT images [3]. These complexities include variations in lesion size, shape, density, and location, as well as the presence of similar-appearing structures like blood vessels. Furthermore, segmenting 3D CT volumes adds another layer of complexity due to the need to capture complex spatial relationships across multiple slices.

Recently, deep learning technologies have emerged as powerful tools for medical image analysis, offering promising solutions for automated lung CT segmentation [4,5,6]. Convolutional Neural Networks (CNNs) have demonstrated exceptional performance in various image recognition tasks, including medical image segmentation [7,8]. However, CNNs are limited in their ability to capture long-range dependencies, which are crucial for understanding the global context within medical images [9].

The emergence of Transformers [10], originally developed for natural language processing, has revolutionized the field of computer vision. Transformers leverage self-attention mechanisms to model relationships between different parts of an input sequence, enabling them to effectively capture long-range dependencies. This capability makes them particularly well suited for analyzing complex medical images, where understanding the spatial relationships between different anatomical structures is essential.

Furthermore, the use of Foundation Models (FMs) [11], which are large-scale models pretrained on massive datasets, has further enhanced the performance of Vision Transformers. These FMs learn rich visual representations from diverse data, providing a strong starting point for fine-tuning on specific tasks, such as lung CT segmentation.

In this study, we present SALM (Segment Anything in Lung Model), a deep learning model for both 2D and 3D ROI segmentation in lung CT images. SALM is built upon the Vision Transformer architecture and incorporates a unique adaptation of positional encoding using weights of the FM Segment Anything Model (SAM) from Meta AI [12]. This adaptation, inspired by the need to capture spatial relationships at multiple scales within lung CT volumes, involves modifying the periodicity of the standard sinusoidal functions used in positional encoding. This allows SALM to effectively integrate information across different spatial scales, improving its ability to delineate ROI accurately. Another major contribution of our work lies in our approach to managing the size variability of lung CT images. Our SALM incorporates advanced image processing techniques to standardize image scale and resolution during preprocessing, enabling the accurate segmentation of CT images, irrespective of their original size. This standardization allows the model to maintain high accuracy even on smaller original images by ensuring a standardized and optimal size input for the network.

Our model is refined by incorporating elements from MedSAM [13], specifically tailored for medical image segmentation. We leverage a unified architecture capable of processing both 2D slices and 3D volumes without requiring separate training, making SALM a versatile tool for comprehensive lung image analysis.

We evaluate SALM’s performance on the publicly available LUNA16 dataset [14], a benchmark dataset for lung nodule analysis. Our experiments demonstrate that SALM achieves strong performance in both 2D and 3D segmentation tasks. In 2D segmentation, SALM achieves an Area under the Curve (AUC) of 99% and a Dice Similarity Coefficient (DSC) of 93% on 124,662 slices. In 3D segmentation, evaluated on 174 3D images, SALM attains an F1-score of 75.57% and a Dice score of 81.88%. We also tested SALM on a subsample of 255 pulmonary images from diseased individuals’ contents in the PleThora dataset [15,16], where it achieved a DICE performance of 78.82% and an accuracy of 97.64%.These results underline SALM’s ability to accurately segment ROI in lung CT images, highlighting its potential to enhance the accuracy and efficiency of computer-aided diagnosis for lung cancer.

The main contributions of this work are as follows:

Development of a Unified 2D and 3D Segmentation Model: We propose SALM, a unified Vision Transformer-based model capable of performing both 2D and 3D ROI segmentation in lung CT images. This unified architecture simplifies the segmentation process and demonstrates the versatility of Vision Transformers in handling different data modalities within a single framework.
Novel Adaptation of Positional Encoding for Enhanced Spatial Context Awareness: We demonstrate that modifying the periodicity of the standard sinusoidal functions used in positional encoding significantly enhances the model’s ability to capture multi-scale spatial relationships, particularly within 3D volumetric data. By introducing a $3 π$ modulation factor, we effectively increase the spatial frequency of the encoding. This allows the model to integrate information across different scales, leading to improved segmentation accuracy, especially in complex structures like those found in lung CT images. This finding provides insights into optimizing Transformer architectures for volumetric medical image analysis.
Adaptation to Lung Image Size Variability: Another major contribution of our work lies in our approach to managing the size variability of lung CT images. Our SALM model incorporates advanced image processing techniques to standardize image scale and resolution during preprocessing, enabling accurate segmentation of CT images, irrespective of their original size. This standardization not only ensures high accuracy in the detection and segmentation of ROI of varying sizes, but also helps maintain optimal model performance on a wide range of lung imaging data. By directly tackling the challenge posed by the diversity of lung CT images, we are helping to make lung image segmentation models efficient and reliable, without compromising performance even for smaller original images.

The remainder of this paper is organized as follows: Section 2 provides an overview of related work in medical image segmentation, with a focus on lung CT analysis. Section 3 details the proposed SALM model, including the architecture, the modified positional encoding, and the training methodology. Section 4 presents the experimental results, both quantitative and qualitative, and discusses the performance of SALM. Finally, Section 5 concludes the paper and outlines directions for future research. Additionally, Appendix A provides a detailed analysis of the carbon footprint associated with the development and training of the SALM model, reflecting our commitment to responsible and sustainable AI research.

2. Related Works

The field of medical image segmentation has witnessed significant advancements, driven by the increasing availability of medical imaging data and the development of powerful computational techniques [17]. In particular, the segmentation of lung CT images has received considerable attention due to its crucial role in the diagnosis, treatment planning, and monitoring of lung diseases, especially lung cancer.

Early approaches to lung CT segmentation relied on traditional image processing techniques, such as thresholding, region growing, and active contour models [18]. While these methods can achieve reasonable results in simple cases, they demonstrate limitations in handling the complex and heterogeneous nature of lung pathologies, as well as variations in image quality and acquisition parameters. For instance, Bellotti et al. [18] proposed a system based on region growing and active contour models for lung nodule detection, achieving a detection rate of 88.5% with 6.6 false positives per CT scan on a limited dataset.

More recently, supervised methods, such as the one proposed by Li et al. [19], have utilized geometric active contours for semi-3D segmentation of lung tissue. These methods have shown improvements in accuracy and speed, particularly for segmenting pleural nodules. However, they often require manual initialization or interaction, which can limit their applicability in large-scale studies.

The introduction of deep learning has revolutionized the field of medical image segmentation, with CNN demonstrating significant success in a wide range of applications [17,20,21]. In the context of lung CT segmentation, CNN-based methods have been extensively explored for tasks such as lung field segmentation, nodule detection, and tumor segmentation. Wu et al. [22] proposed a 3D-UNET optimized by a three-dimensional conditional random field (3D-CRF) for pulmonary nodule segmentation, achieving a Dice score of 80.1% on the LIDC-IDRI dataset. Zhao et al. [23] developed a deep learning system based on 3D CNN and multitask learning for lung nodule classification and segmentation, outperforming radiologists in terms of F1-score.

Several studies have focused on specific challenges in lung CT segmentation, such as handling variations in nodule size and shape. For example, Wang et al. [24] developed a 3D CNN with a stratified training strategy for automated lung tumor segmentation in radiotherapy planning, achieving a Dice Similarity Coefficient of 83% for large tumor volumes. Riaz et al. [25] addressed the segmentation of cancerous lesions by merging MobileNetV2 and UNET architectures, demonstrating improved performance on the Medical Segmentation Decathlon (MSD) dataset.

Despite their success, CNN-based methods have inherent limitations in capturing long-range dependencies within images, which can be crucial for accurately segmenting complex structures in lung CT scans. This limitation has motivated the exploration of alternative architectures, such as Transformers [10], which have shown great promise in natural language processing and, more recently, in computer vision tasks [26].

Transformers leverage self-attention mechanisms to model relationships between different parts of an input sequence, enabling them to effectively capture global context. This capability has led to their increasing adoption in medical image analysis, including lung CT segmentation. Several recent studies have explored the use of Transformers for lung nodule detection, classification, and segmentation [27,28,29,30]. However, the application of Transformers to 3D medical image segmentation, particularly for lung CT scans, remains an active area of research.

Our work builds upon these recent advances and proposes a Transformer-based model, SALM, for both 2D and 3D ROI segmentation in lung CT images. SALM leverages a unique adaptation of positional encoding, specifically designed to capture multi-scale spatial relationships within lung CT volumes. This adaptation, which involves modifying the periodicity of the standard sinusoidal functions used in positional encoding, enables SALM to effectively integrate information across different spatial scales. Furthermore, SALM utilizes a unified architecture capable of processing both 2D slices and 3D volumes, making it a versatile tool for comprehensive lung image analysis. Table 1 summarizes these related works, presenting the objectives, databases used, and results obtained.

3. Methods

In this section, we detail the proposed SALM model for 2D and 3D ROI segmentation in lung CT images. We describe the dataset used; the preprocessing steps; the model architecture, with a particular emphasis on the novel adaptation of positional encoding; and the training procedure.

3.1. Dataset and Preprocessing

The LUNA16 (Lung Nodule Analysis 2016) dataset [14] was used for training and evaluating our model. This publicly available dataset is a subset of the larger LIDC-IDRI database and contains 888 thoracic CT scans with accompanying annotations for lung nodules. The scans are provided in the MetaImage format (.mhd/.raw) and have a typical resolution of 512 × 512 pixels per slice (Figure 1), with varying slice thicknesses ranging from 0.6 to 3 mm.

To prepare the data for our model, we performed several preprocessing steps. First, since LUNA16 focuses on lung imaging, we normalized the Hounsfield units (HU) to standardized window width and level values (W: 1500, L: − 160) specifically for the lung, as recommended in [42]. This important step homogenizes pixel intensity values and highlights the anatomical structures of interest in the lung. Following this, the intensity values were scaled to a range of [0, 255].

In order to standardize the image size and resolution and ensure consistent input for the model, we applied the following technical steps:

Isotropic resampling (for 3D images): We first resampled each CT volume to an isotropic voxel spacing of 1 × 1 × 1 mm using linear interpolation. This step, performed using linear interpolation functions available in standard imaging libraries, ensures that spatial relationships between voxels are consistent across all dimensions, regardless of the original slice spacing.
Slice-by-slice processing: Each 3D volume was then processed slice by slice. This approach allows us to apply SALM, designed for 2D inputs, to 3D volumetric data. While the Vision Transformer processes each slice independently, the modified 3D positional encoding, described in detail in Section 3.2.2, allows for the integration of 3D spatial context.
Resizing to 1024 × 1024: Each slice, whether from an original 2D image or extracted from a 3D volume, was resized to a standard size of 1024 × 1024 pixels using linear interpolation (linear interpolation). This resizing operation ensures that the model receives input images of uniform dimensions, which is crucial for the performance of the Vision Transformer. The choice of the 1024 × 1024 size represents a compromise between preserving anatomical details and memory and computational constraints.
Channel repetition (for 2D images): For original 2D images and slices extracted from 3D volumes, we repeated the intensity channel three times to create a 3-channel input image, compatible with the ViT architecture and consistent with the processing of color images, even though CT images are grayscale.

While each slice is processed as a 2D image by the Vision Transformer, the 3D positional encoding provides crucial context about the slice’s location within the entire 3D volume. This combined approach allows SALM to efficiently process 3D data while leveraging the power of 2D Vision Transformers.

We then resampled each CT volume to an isotropic voxel spacing of 1 × 1 × 1 mm using linear interpolation. This step ensures that the spatial relationships between voxels are consistent across all dimensions. Subsequently, each 3D volume was processed slice by slice, with each slice resized to 1024 × 1024 pixels and the channel repeated three times, as in the 2D case. This approach allows us to leverage the same model architecture for both 2D and 3D segmentation while explicitly incorporating 3D spatial information through our modified positional encoding, which is described in detail below. While each slice is processed as a 2D image by the Vision Transformer, the 3D positional encoding provides crucial context about the slice’s location within the entire 3D volume.

3.2. Model Architecture

SALM is based on the Vision Transformer (ViT) architecture [26], which has demonstrated remarkable performance in various computer vision tasks. We adopt the base ViT model (ViT-B) due to its balance between computational efficiency and accuracy. Larger ViT variants (e.g., ViT-L, ViT-H) offer marginal improvements in accuracy at the cost of significantly increased computational requirements [43,44], making them less suitable for our task, where processing large 3D volumes is essential.

The SALM architecture consists of three main components: an image encoder, a 3D positional encoder, and a mask decoder. Figure 2 illustrates the overall architecture of SALM.

3.2.1. Image Encoder

The image encoder is responsible for extracting visual features from the input CT slices. It follows the standard ViT architecture, which processes the input image as a sequence of patches. Given an input slice of size 1024 × 1024 × 3, we divide it into a sequence of non-overlapping patches of size 16 × 16 × 3. Each patch is then linearly projected into a fixed-dimensional embedding vector. These patch embeddings, along with the positional encodings, are fed into a series of Transformer layers. Each Transformer layer consists of a Multi-head Self-Attention (MSA) module and a Feedforward Network (FFN), with layer normalization applied before each module and residual connections applied after each module. The image encoder produces a feature map that captures both local and global visual information from the input slice (Figure 3).

3.2.2. 3D Positional Encoding

To enable SALM to effectively process 3D volumes, we propose a novel adaptation of the standard positional encoding used in Transformers. While the image encoder processes each slice independently, the 3D positional encoding provides crucial context about the slice’s location within the 3D volume, allowing the model to learn spatial relationships across slices.

Standard positional encodings in Transformers use sinusoidal functions to generate unique positional vectors for each element in a sequence. These encodings are typically defined as

P E (p o s, 2 i) = sin (\frac{p o s}{{10, 000}^{2 i / d}})

(1)

P E (p o s, 2 i + 1) = cos (\frac{p o s}{{10, 000}^{2 i / d}})

(2)

where

p o s

is the position of the element in the sequence, i is the dimension index, and d is the dimensionality of the embedding.

In our adaptation, we modify the periodicity of these functions by introducing a factor of

3 π

, as follows:

P E (p o s, 2 i) = sin (\frac{p o s \cdot 3 π}{{10, 000}^{2 i / d}})

(3)

P E (p o s, 2 i + 1) = cos (\frac{p o s \cdot 3 π}{{10, 000}^{2 i / d}})

(4)

This modification effectively increases the spatial frequency of the encoding, allowing it to capture finer-grained positional information (Figure 4). The

3 π

factor was empirically determined to be optimal for our task through experimentation with different values, which showed that it provided the best balance between capturing detailed spatial relationships and avoiding overfitting to specific positional patterns.

The choice of the

3 π

factor to modulate the periodicity of the sinusoidal functions in positional encoding is based on a primarily empirical approach. Through a series of experiments we observed that this value allowed us to achieve the best segmentation performance on the LUNA16 dataset. The increase in the spatial frequency of the encoding could promote the capture of finer details and multi-scale spatial relationships specific to the complex structures of lung CT images.

For each 3D volume, we generate a unique 3D positional encoding for each slice based on its spatial coordinates (x, y, z) within the volume. The x and y coordinates correspond to the patch position within the slice, while the z coordinate represents the slice index. These 3D positional encodings are added to the corresponding patch embeddings before being fed into the Transformer layers, providing the model with explicit information about the spatial location of each patch within the 3D volume.

3.2.3. Mask Decoder

The mask decoder is responsible for generating the final segmentation mask from the encoded image features and positional information. It consists of a series of Transformer decoder layers, followed by a simple upsampling module. Each Transformer decoder layer takes as input the output of the image encoder and the positional encodings, and uses multi-head self-attention and cross-attention mechanisms to integrate information from both the image features and the positional context.

The upsampling module progressively increases the spatial resolution of the feature maps using transposed convolutions until it matches the original input resolution (1024 × 1024). Finally, a sigmoid activation function is applied to produce the segmentation mask, where each pixel value represents the probability of that pixel belonging to the ROI. Figure 5 shows the architecture of the proposed mask decoder.

3.3. Training Procedure

We trained SALM using a combination of binary cross-entropy (BCE) loss and Dice loss. The BCE loss measures the pixel-wise difference between the predicted segmentation mask and the ground truth mask, while the Dice loss measures the overlap between the two. The total loss is a weighted sum of the two losses, as follows:

Loss = α \cdot BCE Loss + β \cdot Dice Loss

(5)

where

α

and

β

are hyperparameters that control the relative importance of the two losses. In our experiments, we set

α

= 0.5 and

β

= 0.5, giving equal weight to both losses.

We used the Adam optimizer [45] with a learning rate of 1 ×

10^{- 4}

and a batch size of 8 for 2D segmentation and 1 for 3D segmentation due to the larger memory requirements of processing 3D volumes. The model was trained for 100 epochs, and the best-performing model on the validation set was selected for evaluation on the test set.

To augment the training data, we applied random horizontal and vertical flips, random rotations (−20 to +20 degrees), and random scaling (0.8 to 1.2) to the input images and corresponding masks during training.

The model was implemented using the PyTorch 2.0 framework [46] and trained on a single NVIDIA A100 GPU [47].

4. Results and Discussion

4.1. Quantitative Results

We evaluated SALM’s performance on both 2D and 3D segmentation tasks using the LUNA16 dataset. For 2D segmentation, we report the accuracy, the Area under the ROC Curve (AUC), and the Dice Similarity Coefficient (DSC) on the test set. For 3D segmentation, we report the F1-score and DSC, following the evaluation protocol commonly used in previous studies [12,48,49,50,51].

Table 2 shows the results of SALM on the 2D segmentation task compared with several state-of-the-art methods. SALM achieves an accuracy of 99%, an AUC of 99%, and a DSC of 93%, demonstrating its strong performance in accurately segmenting ROI in 2D slices.

Table 3 shows the results of SALM on the 3D segmentation task compared with several state-of-the-art methods. SALM achieves an F1-score of 75.57% and a DSC of 81.88% on the test set of 174 3D images (124,662 slices). These results demonstrate SALM’s ability to effectively leverage 3D spatial information for accurate ROI segmentation in lung CT volumes. Notably, SALM outperforms the original SAM [12] evaluated on our 3D test set, highlighting the importance of our modified positional encoding for capturing 3D spatial relationships.

To further evaluate the effectiveness of our modified positional encoding, we conducted an ablation study comparing the performances of SALM with and without the

3 π

modification. Table 4 shows the results of this study on the 3D segmentation task. We observe that the modified positional encoding leads to a significant improvement in both F1-score and DSC, demonstrating its importance for capturing multi-scale spatial relationships in 3D volumes.

4.2. Qualitative Results

In addition to the quantitative results, we also present qualitative results to visually demonstrate the effectiveness of SALM in segmenting ROI in lung CT images. Figure 6 and Figure 7 show example segmentation results of SALM applied to 2D slices from the LUNA16 test set. These results demonstrate that SALM is able to accurately segment ROI of varying sizes and shapes.

However, we also observed some cases where SALM produced less accurate segmentations, particularly for very small ROI or ROI with irregular shapes (Figure 8).

4.3. External Validation on PleThora Dataset

To further assess the generalizability of SALM, we evaluated the model on 245 DICOM CT scans of the publicly available PleThora dataset [15,16]. PleThora (Pleural Effusion and Thoracic Organ Segmentation) is a completely external dataset comprising diseased lung CT scans, specifically designed for benchmarking chest CT processing pipelines and focusing on the challenging task of segmenting pleural effusions and thoracic organs in pathological cases. This dataset provides a valuable independent validation of SALM’s performance on data beyond the LUNA16 dataset used for training and initial evaluation.

The inference was performed on a single NVIDIA GeForce GTX 1080 Ti GPU [55] with 12 GB of memory, with an average processing time of approximately 1.68 images per second. To further assess the model’s efficiency across different hardware, we also conducted inference on an Apple MacBook Pro (2024) equipped with an M3 chip and 16 GB of memory [56], achieving an average processing time of approximately 1.92 images per second. A detailed comparison of processing times across these platforms is presented in Table 5. The quantitative results on the PleThora dataset are summarized in Table 6.

The results on the PleThora dataset demonstrate robust segmentation performance, achieving a mean Dice score of 78.82% and a high accuracy of 97.64%. An AUC-ROC of 85.92% further indicates good discriminatory power of the model’s segmentation. While the Dice score is slightly lower compared with the results on the LUNA16 dataset (81.88% in 3D segmentation), it is important to consider that PleThora presents a more challenging scenario due to the presence of diseased lungs and pleural effusions, which introduce greater anatomical variability and complexity (Figure 9). Furthermore, a Dice score of 78.82% on a completely external and challenging dataset represents a strong and clinically relevant performance, demonstrating SALM’s ability to generalize beyond the training data and maintain good accuracy in more complex, pathological lung CT scans. Additionally, Figure 10 highlights how SALM, while processing 3D volumes slice by slice, effectively segments lung ROI in a 2D viewing format, which is typical in clinical practice for lung CT analysis. The 2D slice visualization allows clinicians to readily interpret the segmentation results within their familiar diagnostic workflow, demonstrating the practical applicability of SALM for 3D lung CT data when viewed and analyzed in standard 2D slice-based displays.

To further illustrate the impact of our modified positional encoding, Figure 11 presents a qualitative comparison of SALM’s segmentation performance on a PleThora dataset slice, with and without the

3 π

modification. The top row of Figure 11 shows the segmentation using the standard positional encoding, while the bottom row shows the result with our modified positional encoding (SALM). As we can observe in the “Prediction Errors” panels, the model using the standard positional encoding exhibits a significantly larger overpredicted area (shown in red), indicating a tendency to falsely identify regions outside the ground truth ROI. In contrast, SALM, with the modified positional encoding, substantially reduces this overprediction, achieving a more precise segmentation that aligns more closely with the ground truth mask. While SALM still shows some limitations in segmenting the diseased portion of the lung in this challenging PleThora example, as indicated by the missed areas (blue), the visual comparison clearly demonstrates the effectiveness of our modified positional encoding in improving segmentation precision and reducing noise-induced false-positive detections, particularly in complex pathological lung tissue. This qualitative finding complements the quantitative results of our ablation study (Table 4).

4.4. Discussion

The quantitative and qualitative results demonstrate that SALM is a promising model for both 2D and 3D ROI segmentation in lung CT images. The model’s good performance can be attributed to several factors, including the use of the Vision Transformer architecture, FM weights, the proposed adaptation of positional encoding, and the effective training procedure.

The ViT architecture enables SALM to capture long-range dependencies within the input images, which is crucial for understanding the global context of lung CT scans. The modified positional encoding further enhances the model’s ability to capture multi-scale spatial relationships, particularly in 3D volumes, by providing explicit information about the location of each slice within the volume.

The ablation study demonstrates the significant impact of our modified positional encoding on the model’s performance. By increasing the spatial frequency of the encoding, we enable the model to capture finer-grained positional information, leading to improved segmentation accuracy.

The training procedure, which combines BCE loss and Dice loss, effectively balances the need for pixel-wise accuracy and overall segmentation quality. Data augmentation techniques further improve the model’s robustness and generalization ability.

As shown in Figure 9, SALM demonstrates a degree of robustness in segmenting diseased lungs, successfully capturing the overall lung regions despite the presence of significant anatomical variability and heterogeneous tissue densities. However, the segmentations on PleThora also exhibit increased roughness and less smooth boundaries, particularly in areas affected by disease. This suggests that while SALM generalizes reasonably well to pathological cases, the segmentation accuracy is impacted by the increased complexity and heterogeneity of diseased lung tissue compared with the relatively cleaner LUNA16 dataset. The model tends to include diseased areas within the segmented lung regions, which, depending on the clinical application, might be considered either appropriate or requiring further refinement to specifically delineate only healthy lung parenchyma. These qualitative observations on PleThora are consistent with the slightly lower Dice scores obtained on this dataset compared with LUNA16, further emphasizing the challenges of lung segmentation in pathological contexts and highlighting areas for future improvement in model robustness and refinement.

Despite its good performance, SALM has some limitations. The model’s accuracy can be affected by the quality of the input images, particularly in cases of low contrast or high noise. Additionally, the model may struggle with very small ROI or ROI with highly irregular shapes, as observed in some of the qualitative results. The overpredicted areas (false positives) can be attributed to the following specific technical and anatomical factors:

Tissue Density Similarity: Anatomical structures with attenuation coefficients (Hounsfield units) similar to the targeted ROI can induce classification errors. This densitometric ambiguity is particularly pronounced in transition zones between soft tissues and vascular structures. For example, blood vessels, scar tissue, and areas of fibrosis can exhibit Hounsfield values that overlap with those of lung nodules, especially part-solid or ground-glass nodules. This overlap makes it challenging for the model to solely rely on intensity information for accurate differentiation, leading to false-positive detections in regions with dense vasculature or fibrotic changes.
Morphological Characteristics: The presence of anatomical structures with geometric characteristics similar to irregular or hidden ROI can generate segmentation errors. The model may incorrectly interpret these formations as potential ROI. In advanced lung disease, ROI such as tumors or affected lung parenchyma can exhibit highly irregular shapes and may be partially obscured or hidden by surrounding diseased tissue, pleural effusions, or consolidation, as frequently observed in the PleThora dataset. SALM, trained primarily on datasets with relatively well-defined ROI like LUNA16, may struggle to accurately delineate these highly irregular and poorly defined ROI in advanced pathological cases. The model’s reliance on shape features learned from less complex examples might lead to under-segmentation or inaccurate boundary delineation when confronted with the complex and atypical morphologies characteristic of advanced lung pathologies. This is further compounded by the fact that, in diseased lungs, the contrast between the ROI and the surrounding abnormal tissue may be reduced, making morphological differentiation even more challenging for the model.
Positional Encoding Limitations: Overpredictions can result from an imperfect capture of three-dimensional spatial relationships, despite the use of modified positional encoding. Peripheral areas of anatomical structures, where intensity gradients are less pronounced, are particularly susceptible to misclassification. While our modified 3D positional encoding enhances spatial awareness, it is not perfect. In regions where anatomical structures are complex or boundaries are ill-defined, particularly at the periphery of organs or lesions where intensity transitions are gradual, the model might struggle to precisely delineate the ROI. This can lead to over-segmentation extending into surrounding tissues or misclassification of adjacent structures due to limitations in fully capturing the intricate 3D spatial context.
Acquisition Artifacts: Artifacts inherent to CT imaging, such as beam hardening or partial volume effects, can create intensity variations that mislead the model, particularly at interfaces between different tissue types. Beam hardening artifacts, often seen near dense bone structures like ribs, and partial volume averaging, especially in regions with thin slices or complex anatomy, can introduce artificial intensity gradients and noise. These artifacts can create spurious features that the model might misinterpret as relevant structures or boundaries, leading to false-positive detections or inaccurate segmentation, especially at the lung periphery or near the chest wall.

This observation about false positives emphasizes the importance of optimizing the model’s attention mechanisms and incorporating stricter anatomical constraints in the segmentation architecture.

We also acknowledge that further refinements are possible through post-processing techniques. Specifically, methods such as hole filling [57,58] and morphological operations [59] hold the potential to improve the segmentation masks further by addressing common imperfections in segmentation outputs. Hole filling can effectively close small gaps and cavities within the segmented regions, leading to more complete and solid object masks, which is particularly useful for lung nodules that may appear with internal low-density areas. Morphological operations, such as erosion and dilation, can smooth out jagged boundaries, remove small spurious pixels or noise outside the main object, and refine the overall shape of the segmented ROI to better align with anatomical structures.

Future work could focus on addressing these limitations and further enhancing SALM’s capabilities and clinical utility. Specifically, the following promising avenues for future research could be explored:

Refinement for Small, Irregular, and Hidden ROI: To improve segmentation accuracy for ROI that are not only small and irregularly shaped but also poorly defined or hidden due to the complexities of advanced lung disease, future work could focus on incorporating more advanced data augmentation techniques, such as localized deformations and synthetic ROI generation [60], specifically targeting the variability in size, shape, and visibility of lesions, particularly in advanced pathological cases like those observed in the PleThora dataset. Furthermore, research could investigate the incorporation of attention mechanisms that are more sensitive to fine-grained details and boundary information, potentially through novel Transformer architectures or modifications to the existing self-attention layers, to better capture the indistinct and ambiguous boundaries of hidden or irregular ROI. Finally, exploring loss functions that explicitly emphasize boundary sharpness and penalize over-segmentation, such as boundary loss [61] or contour loss [62], could further improve delineation accuracy for challenging ROI with complex and ill-defined borders, often encountered in diseased lung tissue.
Addressing Over-segmentation and False Positives: To challenge over-segmentation and reduce false-positive detections, particularly arising from tissue density similarity and acquisition artifacts, future research could explore the incorporation of anatomical priors and contextual information into the model architecture. This might involve integrating anatomical segmentation maps [63] as additional input channels or exploring hierarchical segmentation approaches [64] that first segment broader anatomical regions before refining the ROI segmentation. Furthermore, exploring contrastive learning techniques [65] to better discriminate between ROI and similar-appearing background structures could enhance the model’s robustness to densitometric ambiguities.
Exploration of Advanced Architectures for Efficiency and Performance: To further improve both the efficiency and the performance of SALM, future work could investigate the integration of novel Transformer architectures. Specifically, the exploration of xLSTM [66], with its ability to process sequential data efficiently across multiple timescales, could be beneficial for modeling temporal dependencies in longitudinal studies or dynamic contrast-enhanced CT. Mamba [67], with its selective state space model, offers a promising avenue for capturing long-range dependencies in both spatial and temporal dimensions with high computational efficiency, potentially allowing for more efficient processing of large 3D volumes. The integration of these advanced architectures, combined with the modified positional encoding scheme described in this paper, could lead to significant improvements in segmentation accuracy, robustness, and clinical utility, paving the way for real-time applications and deployment on resource-constrained devices.
Incorporation of Clinical and Demographic Data: To further enhance the clinical relevance and diagnostic potential of SALM, future work could explore the possibility of incorporating additional clinical information, such as patient demographics, medical history, and radiology reports, as auxiliary inputs to the model. This could enable SALM to learn more nuanced relationships between image features and clinical context, potentially improving segmentation accuracy and enabling more informed diagnostic predictions or risk stratification.

5. Conclusions

In this paper, we presented SALM, a novel deep learning model for 2D and 3D ROI segmentation in lung CT images. SALM leverages the Vision Transformer architecture and incorporates a unique adaptation of positional encoding that enables it to effectively capture multi-scale spatial relationships within 3D volumes. We evaluated SALM on the publicly available LUNA16 dataset and demonstrated its strong performance in both 2D and 3D segmentation tasks. We further validated SALM’s robustness and generalizability by testing it on the completely external and clinically challenging PleThora dataset of diseased lung CT scans.

The quantitative results show that SALM achieves state-of-the-art performance on LUNA16, with an AUC of 99% and a DSC of 93% for 2D segmentation and an F1-score of 75.57% and a DSC of 81.88% for 3D segmentation. The ablation study further highlights the importance of our modified positional encoding, which leads to significant improvements in segmentation accuracy. Moreover, the external validation on PleThora, achieving a mean Dice score of 78.82% on diseased lungs, provides strong evidence for SALM’s ability to generalize to unseen data and maintain clinically relevant performance in more complex pathological scenarios. The qualitative results demonstrate that SALM is able to accurately segment ROI of varying sizes, shapes, and locations, in both 2D slices and 3D volumes, with visual examples on PleThora highlighting both the model’s robustness and the challenges posed by diseased lung tissue. However, some challenges remain, particularly in segmenting very small nodules or nodules with irregular shapes.

Future work could focus on addressing the remaining limitations and further expanding SALM’s capabilities to enhance its clinical utility and robustness. Specifically, future research could prioritize refinement for small, irregular, and hidden ROI, exploring advanced data augmentation techniques, more sensitive attention mechanisms, and specialized loss functions to improve delineation accuracy in challenging pathological contexts. Furthermore, to challenge over-segmentation and false positives, incorporating anatomical priors, contextual information, and contrastive learning strategies could be investigated. Expanding SALM’s generalizability beyond lung CT is another interesting direction, with future work focusing on validation across diverse datasets and modalities like PET-CT and MRI, potentially through fine-tuning and architectural adaptations for multimodal input. To improve both efficiency and performance, the exploration of novel architectures such as xLSTM and Mamba holds significant promise, potentially enabling real-time applications and deployment on resource-constrained devices. Finally, incorporating clinical and demographic data as auxiliary inputs could further enhance SALM’s clinical relevance and diagnostic potential, leading to more nuanced and context-aware segmentation and potentially enabling downstream diagnostic or prognostic tasks.

Author Contributions

Conceptualization, M.A.A. and H.T.G.; methodology, H.T.G. and M.A.A.; validation, H.T.G. and M.A.A.; formal analysis, H.T.G. and M.A.A.; writing—original draft preparation, H.T.G.; writing—review and editing, M.A.A.; funding acquisition, M.A.A. All authors have read and agreed to the published version of the manuscript.

Funding

New Brunswick Innovation Foundation (NBIF) and the New Brunswick Priority Occupation Student Support Fund (NBPOSS) POF2021-006.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

This work uses the public dataset LUNA16 (https://luna16.grand-challenge.org/, accessed on 7 February 2025). For external validation, we used PleThora (https://www.cancerimagingarchive.net/analysis-result/plethora/, accessed on 7 February 2025). See Section 3.1 and Section 4.3, respectively, for more details.

Acknowledgments

This research was enabled in part by the support provided by Calcul Quebec (https://www.calculquebec.ca/, accessed on 10 February 2025.) and the Digital Research Alliance of Canada (https://alliancecan.ca/en), accessed on 10 February 2025.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

2D	two-dimensional
3D	three-dimensional
ACDC	Automatic Cancer Detection and Classification
ACM	active contour model
AI	artificial intelligence
AUC	Area under the Curve
BCE	binary cross-entropy
CAD	computer-aided detection
CNN	Convolutional Neural Network
CRF	conditional random field
CT	Computed Tomography
DSC	Dice Similarity Coefficient
FMs	Foundation Models
FCN	Fully Convolutional Network
FFN	Feedforward Network
HU	Hounsfield unit
kg CO₂ eq/kWh	kilograms of carbon dioxide equivalent per kilowatt-hour
LIDC-IDRI	Lung Image Database Consortium and Image Database Resource Initiative
LUNA16	Lung Nodule Analysis 2016
MRI	Magnetic Resonance Imaging
MSA	Multi-head Self-Attention
NSCLC	Non-Small Cell Lung Cancer
PE	positional encoding
PUE	Power Usage Effectiveness
ROC	Receiver Operating Characteristic
ROI	Region of Interest
SALM	Segment Anything in Lung Model
UNET	U-Shaped Network
ViT	Vision Transformer

Appendix A. Carbon Footprint

In recent years, there has been growing awareness of the environmental impact of artificial intelligence research, particularly the energy consumption and carbon emissions associated with training large deep learning models [68,69]. As part of our commitment to responsible and sustainable research, we provide an estimate of the carbon footprint associated with the development and training of SALM.

We followed the methodology proposed in [70] to estimate the energy consumption and carbon emissions. The total energy consumption in watt-hours (Wh) is estimated using the following formula:

Energy Consumption (Wh) = # GPUs \times GPU Power Consumption (W) \times TT (h) \times PUE

(A1)

where

# GPUs: the number of GPUs used for training;
GPU Power Consumption (W): the power consumption of a single GPU in watts;
TT (h): the total training time in hours;
PUE (Power Usage Effectiveness): a factor that accounts for the energy used by the entire data center infrastructure, including cooling and other overhead. We use a PUE of 1.1, which is in line with recommendations from [71] and represents a relatively efficient data center.

The carbon emissions are then calculated by multiplying the total energy consumption in kilowatt-hours (kWh) by the carbon intensity factor of the electricity grid used to power the data center. In this study, we used the carbon intensity factor for Quebec, Canada, which is 0.0017 kg CO2eq/kWh, according to Canada’s Greenhouse Gas Offset Credit System [72]. This factor represents the average amount of carbon dioxide equivalent (CO2eq) emitted per unit of electricity generated.

The formula for calculating carbon emissions in kilograms of CO2eq is

Carbon Emissions (kg CO 2 eq) = Energy Consumption (kWh) \times Carbon Intensity Factor (kg CO 2 eq / kWh)

(A2)

For the development and training of SALM, we used a single NVIDIA A100-SXM4-40 GB GPU [47], which has a power consumption of approximately 400 W. The total training time for our experiments, including hyperparameter tuning and multiple runs, was approximately 2688 h (4 fine-tuning runs of 4 full weeks: 4 × 4 × 7 × 24 = 2688).

Using these values in Equation (A1), we obtain

Energy Consumption = 1 \times 400 W \times 2688 h \times 1.1 = 1, 182, 720 Wh = 1182.72 kWh

(A3)

Then, using Equation (A2), we calculate the carbon emissions as follows:

Carbon Emissions = 1182.72 kWh \times 0.0017 kg CO 2 eq / kWh = 2.010624 kg CO 2 eq

(A4)

Therefore, the estimated carbon footprint for training SALM is approximately 2.01 kg CO2eq.

Table A1 summarizes the energy consumption and carbon emissions for the different stages of SALM’s development.

It is important to note that these are estimates, and the actual carbon footprint may vary depending on the specific hardware, data center efficiency, and electricity grid used. However, these calculations provide a reasonable approximation and highlight the relatively low environmental impact of training SALM, especially when compared with larger models trained on massive datasets.

Table A1. Estimated carbon footprint of training SALM.

Stage	No. of GPUs	Training Time (h)	Energy Consumption (kWh)	Carbon Emissions (kg CO2eq)
Hyperparameter Tuning	1	2016	887.04	1.508
Final Model Training	1	672	295.68	0.503
Total	-	2688	1182.72	2.011

We believe that transparency regarding the carbon footprint of AI research is crucial for promoting responsible and sustainable development in the field.

References

World Health Organization. Cancer. 2023. Available online: https://www.who.int/news-room/fact-sheets/detail/cancer (accessed on 24 June 2024).
Razzak, M.I.; Naz, S.; Zaib, A. Deep learning for medical image processing: Overview, challenges and the future. In Classification in BioApps: Automation of Decision Making; Springer: Berlin/Heidelberg, Germany, 2018; pp. 323–350. [Google Scholar]
Bates, K.; Le, K.N.; Lu, H. Deep learning for robust and flexible tracking in behavioral studies for C. elegans. PLoS Comput. Biol. 2022, 18, e1009942. [Google Scholar] [CrossRef] [PubMed]
Lambert, T.; Waters, J. Towards effective adoption of novel image analysis methods. Nat. Methods 2023, 20, 971–972. [Google Scholar] [CrossRef]
Litjens, G.; Kooi, T.; Bejnordi, B.E.; Setio, A.A.A.; Ciompi, F.; Ghafoorian, M.; van der Laak, J.A.; van Ginneken, B.; Sánchez, C.I. A survey on deep learning in medical image analysis. Med. Image Anal. 2017, 42, 60–88. [Google Scholar] [CrossRef] [PubMed]
Leibig, C.; Allken, V.; Ayhan, M.; Berens, P.; Wahl, S. Leveraging uncertainty information from deep neural networks for disease detection. Sci. Rep. 2016, 7, 17816. [Google Scholar] [CrossRef]
Ghali, R.; Akhloufi, M.A. ARSeg: An Attention RegSeg Architecture for CXR Lung Segmentation. In Proceedings of the 2022 IEEE 23rd International Conference on Information Reuse and Integration for Data Science (IRI), Virtual, 9–11 August 2022; pp. 291–296. [Google Scholar]
Ghali, R.; Akhloufi, M.A. Vision Transformers for Lung Segmentation on CXR Images. SN Comput. Sci. 2023, 4, 414. [Google Scholar] [CrossRef] [PubMed]
Li, F.; Zhou, L.; Wang, Y.; Chen, C.; Yang, S.; Shan, F.; Liu, L. Modeling long-range dependencies for weakly supervised disease classification and localization on chest X-ray. Quant. Imaging Med. Surg. 2022, 12, 3364. [Google Scholar] [CrossRef] [PubMed]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. arXiv 2023, arXiv:1706.03762. [Google Scholar] [CrossRef]
Jung, K.H. Uncover this tech term: Foundation model. Korean J. Radiol. 2023, 24, 1038. [Google Scholar] [CrossRef] [PubMed]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment anything. arXiv 2023, arXiv:2304.02643. [Google Scholar]
Ma, J.; Wang, B. Segment anything in medical images. arXiv 2023, arXiv:2304.12306. [Google Scholar] [CrossRef] [PubMed]
Setio, A.A.A.; Traverso, A.; de Bel, T.; Berens, M.S.; Van Den Bogaard, C.; Cerello, P.; Chen, H.; Dou, Q.; Fantacci, M.E.; Geurts, B.; et al. Validation, comparison, and combination of algorithms for automatic detection of pulmonary nodules in computed tomography images: The LUNA16 challenge. Med. Image Anal. 2017, 42, 1–13. [Google Scholar] [CrossRef]
Kiser, K.; Ahmed, S.; Stieb, S.M.; Mohamed, A.A.S.R.; Elhalawani, H.; Park, P.Y.S.; Doyle, N.S.; Wang, B.J.; Barman, A.; Fuller, C.D.; et al. Thoracic Volume and Pleural Effusion Segmentations in Diseased Lungs for Benchmarking Chest CT Processing Pipelines. The Cancer Imaging Archive (TCIA). Note: Dataset Published by The Cancer Imaging Archive. Available online: https://www.cancerimagingarchive.net/analysis-result/plethora/ (accessed on 12 February 2025). [CrossRef]
Aerts, H.J.W.L.; Wee, L.; Velazquez, E.R.; Leijenaar, R.T.H.; Parmar, C.; Grossmann, P.; Carvalho, S.; Bussink, J.; Monshouwer, R.; Haibe-Kains, B.; et al. Data From NSCLC-Radiomics. The Cancer Imaging Archive (TCIA). Note: Dataset Published by The Cancer Imaging Archive. Available online: https://www.cancerimagingarchive.net/collection/nsclc-radiomics/ (accessed on 12 February 2025).
Gayap, H.T.; Akhloufi, M.A. Deep machine learning for medical diagnosis, application to lung cancer detection: A review. BioMedInformatics 2024, 4, 236–284. [Google Scholar] [CrossRef]
Bellotti, R.; Carlo, F.D.; Gargano, G.; Tangaro, S.; Cascio, D.; Catanzariti, E.; Cerello, P.; Cheran, S.; Delogu, P.; Mitri, I.D.; et al. A CAD system for nodule detection in low-dose lung CTs based on region growing and a new active contour model. Med. Phys. 2007, 34, 4901–4910. [Google Scholar] [CrossRef] [PubMed]
Li, X.; Wang, X.; Dai, Y.; Zhang, P. Supervised recursive segmentation of volumetric CT images for 3D reconstruction of lung and vessel tree. Comput. Methods Programs Biomed. 2015, 122, 316–329. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18. Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Milletari, F.; Navab, N.; Ahmadi, S.A. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016; IEEE: New York, NY, USA, 2016; pp. 565–571. [Google Scholar]
Wu, W.; Gao, L.; Duan, H.; Huang, G.; Ye, X.; Nie, S. Segmentation of pulmonary nodules in CT images based on 3D-UNET combined with three-dimensional conditional random field optimization. Med. Phys. 2020, 47, 4054–4063. [Google Scholar] [CrossRef]
Zhao, W.; Yang, J.; Sun, Y.; Li, C.; Wu, W.; Jin, L.; Yang, Z.; Ni, B.; Gao, P.; Wang, P.; et al. 3D deep learning from CT scans predicts tumor invasiveness of subcentimeter pulmonary adenocarcinomas. Cancer Res. 2018, 78, 6881–6889. [Google Scholar] [CrossRef] [PubMed]
Wang, S.; Mahon, R.; Weiss, E.; Jan, N.; Taylor, R.J.; McDonagh, P.R.; Quinn, B.; Yuan, L. Automated lung cancer segmentation using a PET and CT dual-modality deep learning neural network. Int. J. Radiat. Oncol. Biol. Phys. 2023, 115, 529–539. [Google Scholar] [CrossRef] [PubMed]
Riaz, Z.; Khan, B.; Abdullah, S.; Khan, S.; Islam, M.S. Lung tumor image segmentation from computer tomography images using MobileNetV2 and transfer learning. Bioengineering 2023, 10, 981. [Google Scholar] [CrossRef] [PubMed]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Mkindu, H.; Wu, L.; Zhao, Y. Lung nodule detection in chest CT images based on vision transformer network with Bayesian optimization. Biomed. Signal Process. Control 2023, 85, 104866. [Google Scholar] [CrossRef]
Niu, C.; Wang, G. Unsupervised contrastive learning based transformer for lung nodule detection. Phys. Med. Biol. 2022, 67, 204001. [Google Scholar] [CrossRef] [PubMed]
Xie, Y.; Zhang, J.; Xia, Y. Semi-supervised adversarial model for benign–malignant lung nodule classification on chest CT. Med. Image Anal. 2019, 57, 237–248. [Google Scholar] [CrossRef]
Tang, T.; Zhang, R.; Lin, K.; Li, F.; Xia, X. SM-RNet: A Scale-aware-based Multi-attention Guided Reverse Network for Pulmonary Nodules Segmentation. IEEE Trans. Instrum. Meas. 2023, 72, 3315365. [Google Scholar] [CrossRef]
Bhattacharyya, D.; Thirupathi Rao, N.; Joshua, E.S.N.; Hu, Y.C. A bi-directional deep learning architecture for lung nodule semantic segmentation. Vis. Comput. 2023, 39, 5245–5261. [Google Scholar] [CrossRef] [PubMed]
Said, Y.; Alsheikhy, A.A.; Shawly, T.; Lahza, H. Medical images segmentation for lung cancer diagnosis based on deep learning architectures. Diagnostics 2023, 13, 546. [Google Scholar] [CrossRef] [PubMed]
Antonelli, M.; Reinke, A.; Bakas, S.; Farahani, K.; Kopp-Schneider, A.; Landman, B.A.; Litjens, G.; Menze, B.; Ronneberger, O.; Summers, R.M.; et al. The Medical Segmentation Decathlon. Nat. Commun. 2022, 13, 4128. [Google Scholar] [CrossRef] [PubMed]
Zhang, F.; Wang, Q.; Fan, E.; Lu, N.; Chen, D.; Jiang, H.; Yu, Y. Enhancing non-small cell lung cancer tumor segmentation with a novel two-step deep learning approach. J. Radiat. Res. Appl. Sci. 2024, 17, 100775. [Google Scholar] [CrossRef]
Chen, W.; Wei, H.; Peng, S.; Sun, J.; Qiao, X.; Liu, B. HSN: Hybrid segmentation network for small cell lung cancer segmentation. IEEE Access 2019, 7, 75591–75603. [Google Scholar] [CrossRef]
He, B.; Hu, W.; Zhang, K.; Yuan, S.; Han, X.; Su, C.; Zhao, J.; Wang, G.; Wang, G.; Zhang, L. Image segmentation algorithm of lung cancer based on neural network model. Expert Syst. 2022, 39, e12822. [Google Scholar] [CrossRef]
Armato, S.G.; McLennan, G.; Bidaut, L.; McNitt-Gray, M.F.; Meyer, C.R.; Reeves, A.P.; Zhao, B.; Aberle, D.R.; Henschke, C.I.; Hoffman, E.A.; et al. The Lung Image Database Consortium (LIDC) and Image Database Resource Initiative (IDRI): A Completed Reference Database of Lung Nodules on CT Scans. Med. Phys. 2011, 38, 915–931. [Google Scholar] [CrossRef] [PubMed]
Li, Z.; Zhang, J.; Tan, T.; Teng, X.; Sun, X.; Zhao, H.; Liu, L.; Xiao, Y.; Lee, B.; Li, Y.; et al. Deep learning methods for lung cancer segmentation in whole-slide histopathology images—the acdc@ lunghp challenge 2019. IEEE J. Biomed. Health Inform. 2020, 25, 429–440. [Google Scholar] [CrossRef] [PubMed]
Šarić, M.; Russo, M.; Stella, M.; Sikora, M. CNN-based Method for Lung Cancer Detection in Whole Slide Histopathology Images. In Proceedings of the 2019 4th International Conference on Smart and Sustainable Technologies (SpliTech), Split, Croatia, 18–21 June 2019; pp. 1–4. [Google Scholar] [CrossRef]
Baek, S.; He, Y.; Allen, B.G.; Buatti, J.M.; Smith, B.J.; Tong, L.; Sun, Z.; Wu, J.; Diehn, M.; Loo, B.W.; et al. Deep segmentation networks predict survival of non-small cell lung cancer. Sci. Rep. 2019, 9, 17286. [Google Scholar] [CrossRef]
Jiang, J.; Hu, Y.C.; Tyagi, N.; Zhang, P.; Rimner, A.; Mageras, G.S.; Deasy, J.O.; Veeraraghavan, H. Tumor-aware, adversarial domain adaptation from CT to MRI for lung cancer segmentation. In Proceedings of the Medical Image Computing and Computer Assisted Intervention–MICCAI 2018: 21st International Conference, Granada, Spain, 16–20 September 2018; Proceedings, Part II 11. Springer: Berlin/Heidelberg, Germany, 2018; pp. 777–785. [Google Scholar]
Reeves, T.; Mah, P.; McDavid, W. Deriving Hounsfield units using grey levels in cone beam CT: A clinical application. Dentomaxillofacial Radiol. 2012, 41, 500–508. [Google Scholar] [CrossRef] [PubMed]
Xu, G.; Hao, Z.; Luo, Y.; Hu, H.; An, J.; Mao, S. DeViT: Decomposing Vision Transformers for Collaborative Inference in Edge Devices. IEEE Trans. Mob. Comput. 2023, 23, 5917–5932. [Google Scholar] [CrossRef]
Li, C.; Kim, K.; Wu, B.; Zhang, P.; Zhang, H.; Dai, X.; Vajda, P.; Lin, Y. An Investigation on Hardware-Aware Vision Transformer Scaling. ACM Trans. Embed. Comput. Syst. 2023, 23, 1–19. [Google Scholar] [CrossRef]
Kingma, D.P. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 2019, 32, 8024. [Google Scholar]
Digital Research Alliance of Canada. Narval Documentation. Note: Documentation for Narval, an Advanced Computing Resource Provided by the Digital Research Alliance of Canada. Available online: https://docs.alliancecan.ca/wiki/Narval/en (accessed on 12 February 2025).
Mohammed, K.K.; Hassanien, A.E.; Afify, H.M. A 3D image segmentation for lung cancer using V.Net architecture based deep convolutional networks. J. Med. Eng. Technol. 2021, 45, 337–343. [Google Scholar] [CrossRef]
Gan, W.; Wang, H.; Gu, H.; Duan, Y.; Shao, Y.; Chen, H.; Feng, A.; Huang, Y.; Fu, X.; Ying, Y.; et al. Automatic segmentation of lung tumors on CT images based on a 2D; 3D hybrid convolutional neural network. Br. J. Radiol. 2021, 94, 20210038. [Google Scholar] [CrossRef] [PubMed]
Kamal, U.; Rafi, A.M.; Hoque, R.; Wu, J.; Hasan, M.K. Lung Cancer Tumor Region Segmentation Using Recurrent 3D-DenseUNet. In Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2020; pp. 36–47. [Google Scholar] [CrossRef]
Peixoto, S.A.; Medeiros, A.G.; Hassan, M.M.; Dewan, M.A.A.; Albuquerque, V.H.C.d.; Filho, P.P.R. Floor of log: A novel intelligent algorithm for 3D lung segmentation in computer tomography images. Multimed. Syst. 2022, 28, 1151–1163. [Google Scholar] [CrossRef]
Saood, A.; Hatem, I. COVID-19 lung CT image segmentation using deep learning methods: U-Net versus SegNet. BMC Med. Imaging 2021, 21, 19. [Google Scholar] [CrossRef] [PubMed]
Zhao, T.; Gao, D.; Wang, J.; Yin, Z. Lung segmentation in CT images using a fully convolutional neural network with multi-instance and conditional adversary loss. In Proceedings of the 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), Washington, DC, USA, 4–7 April 2018; IEEE: New York, NY, USA, 2018; pp. 505–509. [Google Scholar]
Murugappan, M.; Bourisly, A.K.; Prakash, N.B.; Sumithra, M.G.; Acharya, U.R. Automated semantic lung segmentation in chest CT images using deep neural network. Neural Comput. Appl. 2023, 35, 15343–15364. [Google Scholar] [CrossRef] [PubMed]
NVIDIA. NVIDIA GeForce GTX 1080 Ti: The Ultimate GeForce. Note: Official Product Announcement Page for the NVIDIA GeForce GTX 1080 Ti Graphics Card. Available online: https://www.nvidia.com/en-us/geforce/news/gfecnt/nvidia-geforce-gtx-1080-ti/ (accessed on 12 February 2025).
Apple. MacBook Pro-Buy. Note: Product Page for Purchasing MacBook Pro Computers from Apple Canada. Manufacturer: Apple Inc., City: Cupertino, Country: USA. Available online: https://www.apple.com/ca/shop/buy-mac/macbook-pro (accessed on 12 February 2025).
Wang, J.; Oliveira, M.M. A hole-filling strategy for reconstruction of smooth surfaces in range images. In Proceedings of the 16th Brazilian Symposium on Computer Graphics and Image Processing (SIBGRAPI 2003), São Carlos, Brazil, 12–15 October 2003; IEEE: New York, NY, USA, 2003; pp. 11–18. [Google Scholar]
Hasan, M.M.; Mishra, P.K. Improving morphology operation for 2D hole filling algorithm. Int. J. Image Process. IJIP 2012, 6, 635–646. [Google Scholar]
Chudasama, D.; Patel, T.; Joshi, S.; Prajapati, G.I. Image segmentation using morphological operations. Int. J. Comput. Appl. 2015, 117, 8887. [Google Scholar] [CrossRef]
Kim, Y.; Lee, J.H.; Kim, C.; Jin, K.N.; Park, C.M. GAN based ROI conditioned synthesis of medical image for data augmentation. In Proceedings of the Medical Imaging 2023: Image Processing, San Diego, CA, USA, 19–23 February 2023; SPIE: Bellingham, WA, USA, 2023; Volume 12464, pp. 752–758. [Google Scholar]
Kervadec, H.; Bouchtiba, J.; Desrosiers, C.; Granger, E.; Dolz, J.; Ayed, I.B. Boundary loss for highly unbalanced segmentation. In Proceedings of the International Conference on Medical Imaging with Deep Learning, PMLR, London, UK, 8–10 July 2019; pp. 285–296. [Google Scholar]
Chen, Z.; Zhou, H.; Lai, J.; Yang, L.; Xie, X. Contour-aware loss: Boundary-aware learning for salient object segmentation. IEEE Trans. Image Process. 2020, 30, 431–443. [Google Scholar] [CrossRef]
Evans, A.; Collins, D.; Holmes, C. Automatic 3D regional MRI segmentation and statistical probability anatomy maps. In Quantification of Brain Function Using PET; Elsevier: Amsterdam, The Netherlands, 1996; pp. 123–130. [Google Scholar]
Lee, D.U.; Cheung, R.C.; Luk, W.; Villasenor, J.D. Hierarchical segmentation for hardware function evaluation. IEEE Trans. Very Large Scale Integr. VLSI Syst. 2008, 17, 103–116. [Google Scholar] [CrossRef]
Hu, H.; Wang, X.; Zhang, Y.; Chen, Q.; Guan, Q. A comprehensive survey on contrastive learning. Neurocomputing 2024, 610, 128645. [Google Scholar] [CrossRef]
Beck, M.; Pöppel, K.; Spanring, M.; Auer, A.; Prudnikova, O.; Kopp, M.; Klambauer, G.; Brandstetter, J.; Hochreiter, S. xLSTM: Extended Long Short-Term Memory. arXiv 2024, arXiv:2405.04517. [Google Scholar]
Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar]
Lacoste, A.; Luccioni, A.; Schmidt, V.; Dandres, T. Quantifying the carbon emissions of machine learning. arXiv 2019, arXiv:1910.09700. [Google Scholar]
Strubell, E.; Ganesh, A.; McCallum, A. Energy and policy considerations for modern deep learning research. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 13693–13696. [Google Scholar]
Wu, C.J.; Raghavendra, R.; Gupta, U.; Acun, B.; Ardalani, N.; Maeng, K.; Chang, G.; Aga, F.; Huang, J.; Bai, C.; et al. Sustainable ai: Environmental implications, challenges and opportunities. Proc. Mach. Learn. Syst. 2022, 4, 795–813. [Google Scholar]
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. Llama: Open and efficient foundation language models. arXiv 2023, arXiv:2302.13971. [Google Scholar]
Environment and Climate Change Canada. Federal Greenhouse Gas Offset System: Emission Factors and Reference Values. 2023. Available online: https://www.canada.ca/en/environment-climate-change/services/climate-change/pricing-pollution-how-it-will-work/output-based-pricing-system/federal-greenhouse-gas-offset-system/emission-factors-reference-values.html (accessed on 12 February 2025).

Figure 1. Overview of 2D image slices in LUNA16: original 2D image (top); original segmentation label (bottom).

Figure 2. The proposed SALM architecture.

Figure 3. Image encoder architecture. Our image encoder uses a Vision Transformer to learn a 3D representation of an input image. The model flattens it and completes it with our reformulated positional encoding before passing it on to a transform encoder.

Figure 4. Comparison of standard and modified positional encodings PE(pos, 2i). The curve (a) oscillates much more slowly, with wide cycles. The curve (b) shows a much faster and denser oscillation, with cycles closer together. Adding a factor of 3

π

significantly modifies the granularity with which the model encodes positions, increasing its sensitivity to fine details.

Figure 4. Comparison of standard and modified positional encodings PE(pos, 2i). The curve (a) oscillates much more slowly, with wide cycles. The curve (b) shows a much faster and denser oscillation, with cycles closer together. Adding a factor of 3

π

significantly modifies the granularity with which the model encodes positions, increasing its sensitivity to fine details.

Figure 5. Mask decoder architecture. The Transformer decoder takes as input a small fixed number of learned positional embeddings, called “Object Query”, and additionally processes the encoder output. Each decoder output embedding is transmitted to a shared Feedforward Network (FFN), which then produces the CT segmentation.

Figure 6. Result of segmentation with SALM-3D (left–right): original slice, original label, and segmentation result.

Figure 7. Result of segmentation with SALM on different image sizes: segmentation on a 2D 685 × 522 image (top); segmentation on a 2D image 512 × 512 (bottom). For each image, each color represents a distinct side of the segmented lung.

Figure 8. SALM segmentation overpredicted (left–right): original CT slice, ground truth mask, model prediction, and false-positive detection map. Yellow represents the predicted region, red represents the overpredicted areas compared to the label.

Figure 9. Qualitative segmentation results on PleThora dataset. Each row displays a sample slice from the PleThora dataset, showcasing the original CT slice (left), the corresponding ground truth mask where each color represents a distinct part of the ROI. (center), and the SALM-generated segmentation mask overlaid on the original slice where yellow represents the segmentation made by SALM. (right). These examples illustrate the challenges of segmenting diseased lungs, characterized by increased anatomical complexity and heterogeneous tissue densities.

Figure 10. SALM processing of 3D PleThora CT scans visualized in 2D slices. Each row illustrates SALM’s segmentation on a 2D slice extracted from a 3D PleThora CT volume. We show the original CT slice (left), the corresponding ground truth mask (center), and the SALM prediction overlaid on the original slice (right). This figure highlights how SALM, while processing 3D volumes slice by slice, effectively segments lung ROI in a 2D viewing format, which is typical in clinical practice for lung CT analysis.

Figure 11. Qualitative comparison of segmentation with standard vs. modified positional encoding on PleThora dataset. Each row presents a qualitative comparison on a sample PleThora CT slice, contrasting the segmentation performance using the standard positional encoding (top row) and our modified positional encoding (SALM) (bottom row). For each encoding method, we display (left–right): the original image, the ground truth mask, the model prediction (with bounding box prompt), and a prediction error map highlighting overpredicted areas (red) and missed areas (blue). This figure visually demonstrates the benefit of our modified positional encoding in reducing over-segmentation and false positives, particularly in challenging cases within the PleThora dataset.

Table 1. Summary of work on lung CT segmentation.

Ref.	Objective	Database	Results (%)
[31]	Detection of pulmonary nodules	LUNA16 [14]	DSC = 88.89
[32]	Early detection of lung cancer	MSD [33]	Seg. accuracy = 97.83
			Class. accuracy = 98.77
[24]	Tumor segmentation for radiotherapy planning	Private clinical database	DSC = 83
[25]	Cancerous lesion segmentation	MSD [33]	DSC = 87.93
			Recall = 86.02
			Precision = 93
[34]	Accurate tumor segmentation in NSCLC treatment	NSCLC cases	DSC = 80
[35]	Segmentation	134 scans CT	Dice = 88.8
			Sensitivity = 87.2
			Precision = 90.9
[36]	Recognition of lung cancer	LIDC-IDRI [37]	Sensitivity = 95.7
[38]	Lung cancer detection challenge	ACDC@LungHP Database [39]	-
[40]	Lung cancer segmentation	NSCLC	DSC = 86.1
[41]	Automatic tumor segmentation from T2 MRI	377 patients CT + 6 IRM	DSC = 80
[18]	Automatic detection of lung nodules using region growth algorithms and active contour models	15 CT scans with approx. 4700 slices	Detection rate: 88.5, false positives: 6.6 per CT
[19]	Supervised semi-3D segmentation of lung tissue and reconstruction of untrimmed 3D models	15 scans of healthy and early-stage lung tumor patients	Outperforms in accuracy and speed
[22]	Accurate segmentation of pulmonary nodules using 3D-UNET network optimized by 3D-CRF	936 lung nodules from LIDC-IDRI, validated on clinical data.	Dice score: 80.1
[23]	Automatic classification and segmentation of lung nodules using 3D CNN and multitask learning	651 nodules annotated with segmentation masks and pathological labels	Weighted average F1-score: 63.3 vs. radiologists 51.0 to 56.6

Seg: segmentation; Class: classification.

Table 2. Comparison of SALM with state-of-the-art methods on 2D segmentation using the LUNA16 dataset.

Model	Accuracy	AUC	DSC
SegNet [52]	95.00	N/A	N/A
UNET [52]	91.00	N/A	N/A
FCN [53]	N/A	N/A	92.00
DeepLabV3 [54]	94.90	N/A	N/A
SALM	99.00	99.00	93.00

N/A = not available; In bold: best result.

Table 3. SALM comparison with different architectures for Lung CT Segmentation in 3D.

Model	F1-Score	DSC	Test Size	Database (No. of Patients)
FCN V.Net [48]	N/A	80.00	32	96
Hybrid CNN [49]	N/A	58.00	N/A	260
Recurrent 3D-DenseUNet [50]	N/A	72.28	N/A	300
FoL [51]	N/A	83.68	430 slices	40
SAM [12]	42.83	43.95	174	-
SALM-3D	75.57	81.88	174	888

N/A = not available; In bold: best result.

Table 4. Ablation study on the effect of the modified positional encoding.

Positional Encoding	F1-Score (%)	DSC (%)
Standard	56.23	59.53
Modified (3 $π$ )	75.57	81.88

In bold: best result.

Table 5. Inference time comparison across different GPUs on 245 CT scans.

GPU	Processing Time (Minutes)	Images per Second
NVIDIA GeForce GTX 1080 Ti (12 GB)	2.25	1.68
Apple M3 (16 GB)	7.51	1.92

Table 6. SALM performance on PleThora dataset (external validation).

Metric	Value
Mean Dice Score	0.7882 ± 0.1514
Mean Accuracy	0.9764 ± 0.0146
AUC-ROC	0.8592

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gayap, H.T.; Akhloufi, M.A. SALM: A Unified Model for 2D and 3D Region of Interest Segmentation in Lung CT Scans Using Vision Transformers. Appl. Biosci. 2025, 4, 11. https://doi.org/10.3390/applbiosci4010011

AMA Style

Gayap HT, Akhloufi MA. SALM: A Unified Model for 2D and 3D Region of Interest Segmentation in Lung CT Scans Using Vision Transformers. Applied Biosciences. 2025; 4(1):11. https://doi.org/10.3390/applbiosci4010011

Chicago/Turabian Style

Gayap, Hadrien T., and Moulay A. Akhloufi. 2025. "SALM: A Unified Model for 2D and 3D Region of Interest Segmentation in Lung CT Scans Using Vision Transformers" Applied Biosciences 4, no. 1: 11. https://doi.org/10.3390/applbiosci4010011

APA Style

Gayap, H. T., & Akhloufi, M. A. (2025). SALM: A Unified Model for 2D and 3D Region of Interest Segmentation in Lung CT Scans Using Vision Transformers. Applied Biosciences, 4(1), 11. https://doi.org/10.3390/applbiosci4010011

Article Menu

SALM: A Unified Model for 2D and 3D Region of Interest Segmentation in Lung CT Scans Using Vision Transformers

Abstract

1. Introduction

2. Related Works

3. Methods

3.1. Dataset and Preprocessing

3.2. Model Architecture

3.2.1. Image Encoder

3.2.2. 3D Positional Encoding

3.2.3. Mask Decoder

3.3. Training Procedure

4. Results and Discussion

4.1. Quantitative Results

4.2. Qualitative Results

4.3. External Validation on PleThora Dataset

4.4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A. Carbon Footprint

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI