1. Introduction
The latest advancements in digital pathology present significant practical benefits compared to traditional manual diagnosis [
1,
2]. The exponential growth of medical imaging technologies has resulted in an accumulation of high-resolution histological images, necessitating automated annotation and diagnosis processes [
3]. In this context, artificial intelligence (AI) algorithms have emerged as promising tools in digital pathology, holding immense potential for streamlining diagnostic workflows [
4,
5].
However, the diversity of biological tissue structures complicates the automated analysis of histopathology slides [
6,
7]. This challenge is particularly pronounced in the segmentation of tissue and substructures, such as glands and nodules, which are pivotal in cancer diagnosis across various tissue types and cancer subtypes [
8,
9]. Disruptions in glandular structures are often indicative of malignant cases [
10], presenting irregular shapes in contrast to the circular structures commonly observed in benign cases. This distinction is crucial in diagnosing colorectal cancer, where the gland anatomy plays a vital role [
11,
12]. Therefore, the precision and accuracy of tissue segmentation processes are critical for advancing toward an AI-aided cancer detection pipeline.
The AI research community is currently experiencing a significant transformation, driven by the development of large-scale models like DALL-E [
13], GPT-4 [
14], and SAM [
15]. These models provide frameworks for solving a wide range of problems. SAM, in particular, emerges as a notable segmentation model capable of generalizing to new objects and images without further training. This adaptability results from SAM’s training on millions of images and masks, refined through iterative feedback and model improvements. However, in order to apply SAM in the medial field, it is necessary to fine-tune it for a given downstream task.
The contributions of our work are the following:
Introducing a new strategy for fine-tuning the SAM model using granular box prompts derived from ground truth masks, enhancing gland morphology segmentation accuracy.
Demonstrating through experiments on CRAG, GlaS, and Camelyon16 datasets, our training strategy improves SAM’s segmentation performance.
Showcasing SAM’s superior performance and adaptability in digital pathology is particularly beneficial for cases with limited data availability.
Highlighting SAM’s consistent performance and exceptional ability to generalize to new and complex data types, such as lymph node segmentation.
This paper is organized as follows: In
Section 2, we discuss related work in the field of medical image segmentation using SAM.
Section 3 describes the datasets and the methodology used, including the training procedures for GB-SAM.
Section 4 presents our experimental results and provides a comparative analysis with other models. Finally,
Section 5 concludes the paper, summarizing our findings and discussing future research directions.
2. Related Work
The application of the Segment Anything Model (SAM) in the pathology field remains relatively unexplored. SAM-Path improves SAM’s semantic segmentation in digital pathology by introducing trainable class prompts. Experiments on the BCSS and CRAG datasets show significant improvements over the standard SAM model, highlighting its enhanced performance in pathology applications [
16].
Another study evaluated SAM’s zero-shot segmentation on whole slide imaging (WSI) tasks, such as tumor, nontumor tissue, and cell nuclei segmentation. While SAM obtained good results in segmenting large connected objects, it struggled with dense instance object segmentation [
17].
WSI-SAM focuses on precise object segmentation for histopathology images using multiresolution patches. This variant maintains SAM’s zero-shot adaptability and introduces a dual mask decoder to integrate features at multiple scales, demonstrating superior performance on tasks like ductal carcinoma in situ (DCIS) and breast cancer metastasis segmentation [
18].
MedSAM, a foundation model for universal medical image segmentation, covers various imaging modalities and cancer types. It outperforms modality-wise specialist models in internal and external validation tasks, indicating its potential to revolutionize diagnostic tools and treatment plans [
19].
Another paper explores SAM’s application in medical imaging, particularly radiology and pathology. Through fine-tuning, SAM significantly improves segmentation accuracy and reliability, offering insights into its utility in healthcare [
20].
Lastly, all-in-SAM, an SAM pipeline for the AI development workflow, has shown that leveraging pixel-level annotations from weak prompts can enhance the SAM segmentation model. This method surpasses state-of-the-art methods in nuclei segmentation and achieves competitive performance with minimal annotations [
21].
We summarize these works in
Table 1.
4. Results and Discussion
In this section, we present the experiments conducted to evaluate the segmentation performance of our GB-SAM model and compare its results with those of the U-Net model. Both models were initially fine-tuned using the CRAG dataset, followed by testing on the CRAG testing dataset, GlaS, and Camelyon16 datasets. Additionally, we explore the impact of reduced training dataset sizes. Further details on our experiments and results are provided in the following sections.
4.1. Impact of Dataset Size on Tuning GB-SAM and U-Net Models with CRAG
We selected 100%, 50%, and 25% of the CRAG training dataset for our training. We then trained our GB-SAM and U-Net on these subsets and compared the models to the CRAG testing (validation) dataset.
4.1.1. Comparative Results
Analyzing the results obtained from both models, it is noticeable that distinct differences exist in their sensitivity to reductions in training dataset size, as shown in
Table 2. The U-Net model exhibits a pronounced decrease in all evaluated performance metrics (Dice, IoU, mAP) as the dataset size decreases, with Dice scores dropping from 0.937 at full dataset size to 0.857 at 25%, IoU scores from 0.883 to 0.758, and mAP scores from 0.904 to 0.765. This trend underscores U-Net’s significant dependency on larger amounts of training data for optimal performance, as highlighted by the standard deviations in performance metrics (Dice: 0.041, IoU: 0.064, mAP: 0.075), indicating variability with changes in dataset size.
In contrast, the GB-SAM model demonstrates a less consistent pattern of performance degradation across the same metrics. Notably, specific metrics (Dice, IoU) even show improvement when the dataset size is reduced to 25%, with Dice scores slightly decreasing from 0.900 at the entire dataset to 0.885 at 25% of the dataset and IoU scores decreasing from 0.813 to 0.793. The average mAP score experiences a modest decline from 0.814 at full dataset size to 0.788 at 25%, demonstrating a less pronounced drop than U-Net. These results suggest that GB-SAM has superior capabilities in generalizing from limited data or experiencing less performance loss due to overfitting on smaller datasets. The low standard deviations for GB-SAM (Dice: 0.012, IoU: 0.016, mAP: 0.019) underscore its consistent performance across varying dataset sizes, in contrast to U-Net’s performance.
Notably, the performance degradation from 100% to 25% dataset sizes was more acute for the U-Net model, especially in mAP scores, indicating a sharper decline in model precision () compared to the GB-SAM model ().
It is important to note that while U-Net outperformed GB-SAM when trained on 100% and 50% of the CRAG training dataset, GB-SAM’s results are characterized by more excellent stability, outperforming U-Net when using only 25% of the training data (see
Table 2).
4.1.2. Segmentation Performance
Upon analyzing the performance metrics of the GB-SAM and U-Net models across varying training dataset sizes, we identified H&E patch images that exhibited the lowest scores for each training size (
Table 3), offering a detailed view of each model’s segmentation capability.
A recurring observation in
Table 3 is the challenge presented by the image test_23 to the GB-SAM model, which faces difficulties across all metrics (Dice, IoU, mAP) and dataset sizes. As demonstrated in
Figure 3d, GB-SAM’s segmentation is noticeably noisy, showing a tendency toward higher false positives and reduced accuracy, as evidenced by its over-segmentation and unnecessary noise alongside actual features.
In contrast, U-Net’s segmentation more closely matches the ground truth but lacks finer details, missing smaller features and resulting in smoother edges. This indicates U-Net’s better performance in capturing the overall structure in this image, though it struggles to capture detailed aspects (
Figure 3c).
In our analysis, we computed the differences between the ground truth and the predicted masks. The segmentation results for GB-SAM and U-Net on image test_23, as shown in
Figure 4 and
Figure 5, reveal several key observations. The color-coded segmentation, with red indicating underpredictions (where the predicted mask has a pixel as 0 and the true mask has it as 1) and green highlighting overpredictions (where the predicted mask has a pixel as 1 and the true mask has it as 0), allows for a visual comparison of model performance across different training dataset sizes. GB-SAM consistently balances over- and underpredictions, regardless of the training dataset size. This consistency shows that GB-SAM maintains stable segmentation performance, which indicates its robustness and ability to generalize from the available data.
On the other hand, U-Net tends to increase overpredictions as the size of the training dataset decreases. This pattern points to a potential overfitting issue with U-Net when trained on smaller datasets, where the model might compensate for the lack of data by overestimating the presence of features. It also reflects U-Net’s sensitivity to training dataset size, suggesting that its performance and accuracy in segmenting specific features become less reliable with reduced data availability.
For added context and based on
Table 3, U-Net struggles with several images (test_39, test_15, test_18) across various training sizes. Notably, U-Net tends to produce gland structures hallucinations, as illustrated in
Figure 6, and incorrectly segments scanner artifacts as glandular tissue, as demonstrated in
Figure 7.
Our findings show the importance of dataset size in training segmentation models and reveal distinct characteristics of GB-SAM and U-Net in managing data scarcity. In our case, GB-SAM’s stable performance across varying dataset sizes offers an advantage in applications with limited annotated data, such as in the digital pathology field.
4.2. Assessing Model Generalizability across Diverse Datasets
After fine-tuning and evaluating the performance of the GB-SAM and U-Net models using 100% of the training data from the CRAIG dataset, we explored their ability to generalize to unseen data. The following section presents and discusses our findings, analyzing how effectively each model applies its learned segmentation capabilities to new images. Our analysis aims to highlight the strengths and weaknesses of GB-SAM and U-Net in terms of model generalization, offering insights into their practical utility and flexibility in digital pathology applications.
4.2.1. Evaluating on GlaS
The GlaS test dataset consists of 37 benign and 43 malignant samples. In this section, we aim to evaluate the segmentation performance of the GB-SAM and U-Net models, trained on the CRAG dataset, across these categories.
Based on
Table 4, we found that for benign areas, GB-SAM performs better in terms of the Dice coefficient and IoU, with average scores of 0.901 and 0.820, respectively. These metrics indicate that GB-SAM is highly effective in accurately identifying benign tumor areas, ensuring a strong match between the predicted segmentation and the ground truth. On the other hand, while still performing well, U-Net is slightly behind GB-SAM, with Dice and IoU scores of 0.878 and 0.797, respectively. However, it is noteworthy that U-Net outperforms GB-SAM in the mAP metric with an average score of 0.873, compared to GB-SAM’s 0.840. While U-Net may not match GB-SAM in segmentation precision and overlap, it has a small advantage in detecting relevant areas within benign contexts.
Now, when analyzing the performance on malignant tumors, the gap between GB-SAM and U-Net widens, particularly in the Dice and IoU metrics. GB-SAM maintains a lead with Dice and IoU scores of 0.871 and 0.781, respectively, versus U-Net’s 0.831 (Dice) and 0.745 (IoU). This indicates a consistent trend where GB-SAM outperforms U-Net in delineating tumor boundaries with greater precision, especially crucial in malignant tumors where accurate segmentation can significantly impact clinical outcomes. Interestingly, in the mAP metric for malignant tumors, U-Net (0.821) closes the gap with GB-SAM (0.796), suggesting that U-Net may be more adept at recognizing malignant features across a dataset, despite having lower overall segmentation accuracy. Visual segmentation results are shown in
Figure 8 and
Figure 9.
Both GB-SAM and U-Net exhibit strengths that make them valuable tools in the digital pathology domain. However, GB-SAM’s consistent accuracy and robustness across tumor types highlight its potential benefits for improved tumor segmentation and classification in clinical settings.
4.2.2. Evaluating on Camelyon16
Lymph nodes present significant segmentation challenges, primarily due to often indistinct boundaries and the complexity of surrounding structures. In this context, analyzing the performance of GB-SAM and U-Net models, trained on the CRAG dataset, in segmenting lymph nodes within a subset of the Camelyon16 dataset offers valuable insights into their utility for complex pathological analyses.
Table 5 shows that GB-SAM outperforms U-Net across all metrics. Specifically, GB-SAM achieves a Dice score of 0.740, indicating a significantly higher degree of overlap between the predicted segmentation and the ground truth, in contrast to U-Net’s score of 0.491. This disparity suggests that SAM more effectively identifies lymph node boundaries within WSIs.
Similarly, GB-SAM’s IoU score of 0.612 exceeds U-Net’s 0.366, demonstrating that GB-SAM’s predictions more closely match the actual lymph node areas. Regarding mAP, GB-SAM leads with a score of 0.632 compared to U-Net’s 0.565. Although the gap in mAP between the two models is less pronounced than in Dice and IoU, GB-SAM’s higher score underlines its superior reliability in recognizing lymph nodes.
Figure 10 shows a visual segmentation result.
Furthermore, even when trained on gland data from the CRAG dataset, GB-SAM’s excellent performance showcases its remarkable capacity for generalization to lymph node segmentation. This flexibility highlights GB-SAM’s robust and adaptable modeling approach, which can be used for different yet histologically related tissue types. In contrast, despite achieving high scores in segmenting gland structures, U-Net exhibits constraints in extending its applicability to other tissue types.
4.2.3. Comparative Analysis: GB-SAM, SAM-Path, and Med-SAM
We summarize the results in
Table 6. Despite a moderately lower IoU score of 0.813 compared to SAM-Path’s 0.883, GB-SAM obtained a higher Dice score of 0.900, which is essential for medical segmentation tasks. Notably, GB-SAM achieved a Dice score of 0.885 on the CRAG test dataset, surpassing SAM-Path’s 0.883, even though it used only 25% of the CRAG training data (see
Table 2). This efficiency highlights GB-SAM’s capability to achieve high performance with limited training data, making it particularly suitable for scenarios with constrained data availability.
Compared to Med-SAM, which achieved a Dice score of 0.956 with a large training dataset, GB-SAM’s Dice score of 0.885 demonstrates a nominal difference. Despite the considerably smaller training dataset, this moderately high performance underscores GB-SAM’s effectiveness in generalizing from limited data to unseen cases. This ability to perform well with less data is critical for practical clinical deployment, where extensive annotated datasets may only sometimes be available.
Moreover, while we found no existing studies that use SAM for lymph node segmentation, GB-SAM’s performance in this task is noteworthy. Despite not being explicitly trained on lymph node data, GB-SAM’s acceptable segmentation performance indicates its potential to serve as a reliable tool across a diverse range of pathological tissues. This adaptability suggests that GB-SAM could be a valuable tool in clinical settings, offering robust segmentation capabilities across various medical images.
It is important to note that we did not run the experiments for SAM-Path and Med-SAM ourselves. Although their code is available on GitHub, replicating their experiments requires substantial computational resources. Therefore, we reported the statistics published in the original papers for comparison. This approach ensures that we provide a fair and accurate comparison based on the reported performance metrics of these models.
5. Conclusions
In this study, we adopted a new strategy of employing granular box prompts based on ground truth masks for fine-tuning our GB-SAM model, which is based on SAM. This approach aims to achieve more precise gland morphology segmentation, moving away from the traditional single-large box approach used in other works. This technique notably enhanced GB-SAM’s gland segmentation accuracy by supplying detailed data and mitigating ambiguity in regions with complex morphology.
Our experiments across the CRAG, GlaS, and Camelyon16 datasets showed that granular box prompts enable GB-SAM to focus on specific gland features, thus improving learning and generalization across various histopathological patterns. This method highlighted GB-SAM’s outstanding segmentation performance and adaptability, which is particularly helpful in digital pathology cases with limited data availability.
GB-SAM’s consistent performance and capability to generalize to new data, like lymph node segmentation, emphasize its potential for clinical applications. Although both GB-SAM and U-Net contribute valuable tools to digital pathology, GB-SAM’s robustness and success in extending beyond gland segmentation establish it as a strong option for tumor segmentation within the field of digital pathology.
Despite its promising performance, GB-SAM has some limitations. The Camelyon16 dataset is particularly challenging for segmentation due to the unclear boundaries and surrounding tissue structures in WSIs. As discussed in
Section 4.2.2, the complexity of accurately detecting and segmenting lymph node metastases in Camelyon16 highlights the difficulty of handling images with complex details. Additionally, while we implemented early stopping based on validation loss to mitigate overfitting, maintaining state-of-the-art generalization capabilities across different datasets remains challenging. The variability in the complexity of tissue structures can impact GB-SAM’s performance, especially when segmenting highly complex or rare tissue structures.
To overcome these limitations, our future work will focus on enhancing the preprocessing pipeline to better standardize images and exploring advanced training strategies such as transfer learning and domain adaptation. We aim to ensure that GB-SAM can effectively generalize to a broader range of histopathological images, offering a promising path for its future development.
Addressing ethical and practical considerations is crucial for the responsible deployment of the GB-SAM model in clinical settings. This study represents the first phase of our project, focusing on testing the GB-SAM model’s segmentation capabilities. In the next phase, we will thoroughly address these considerations.
Future work will also include developing an interactive tool for pathologists and promoting the integration of GB-SAM into clinical workflows. This tool will enable medical professionals to interact with the model’s segmentation results, provide feedback, and validate its real-time performance.
While our study demonstrates GB-SAM’s robustness and generalization capabilities with reduced data, further evaluation on more diverse and extensive datasets is essential to capture its full performance for real-world applications. In future work, we plan to include additional datasets containing a more comprehensive range of histopathological variations and larger sample sizes to validate the generalizability of GB-SAM. This approach will help us confirm that our model is well suited for practical deployment in diverse clinical environments.
Moreover, while GB-SAM was explicitly trained for 2D segmentation tasks, we acknowledge the need to address the challenges associated with 3D medical imaging. Future work will focus on extending our approach to 3D data, exploring methods to process volumetric data efficiently, and ensuring consistency across slices.
GB-SAM shows great promise in digital pathology. Addressing its current limitations and expanding its validation to a wider range of datasets will be critical steps in its development. We look forward to enhancing GB-SAM’s capabilities and ensuring its robust performance in diverse clinical applications.