Fine-Tuning Models for Histopathological Classification of Colorectal Cancer

ALGhafri, Houda Saif; Lim, Chia S.

doi:10.3390/diagnostics15151947

Open AccessArticle

Fine-Tuning Models for Histopathological Classification of Colorectal Cancer

by

Houda Saif ALGhafri

^1,* and

Chia S. Lim

²

¹

Department of Information Technology, College of Computing and Information Sciences, University of Technology and Applied Sciences, Muscat 133, Oman

²

Graduate School of Technology, Asia Pacific University of Technology and Innovation, Kuala Lumpur 57000, Malaysia

^*

Author to whom correspondence should be addressed.

Diagnostics 2025, 15(15), 1947; https://doi.org/10.3390/diagnostics15151947

Submission received: 26 June 2025 / Revised: 30 July 2025 / Accepted: 31 July 2025 / Published: 3 August 2025

(This article belongs to the Special Issue Advances in Machine Learning for Computer-Aided Diagnosis in Biomedical Imaging—2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Background/Objectives: This study aims to design and evaluate transfer learning strategies that fine-tune multiple pre-trained convolutional neural network architectures based on their characteristics to improve the accuracy and generalizability of colorectal cancer histopathological image classification. Methods: The application of transfer learning with pre-trained models on specialized and multiple datasets is proposed, where the proposed models, CRCHistoDense, CRCHistoIncep, and CRCHistoXcep, are algorithmically fine-tuned at varying depths to improve the performance of colorectal cancer classification. These models were applied to datasets of 10,613 images from public and private repositories, external sources, and unseen data. To validate the models’ decision-making and improve transparency, we integrated Grad-CAM to provide visual explanations that influence classification decisions. Results and Conclusions: On average across all datasets, CRCHistoDense, CRCHistoIncep, and CRCHistoXcep achieved test accuracies of 99.34%, 99.48%, and 99.45%, respectively, highlighting the effectiveness of fine-tuning in improving classification performance and generalization. Statistical methods, including paired t-tests, ANOVA, and the Kruskal–Wallis test, confirmed significant improvements in the proposed methods’ performance, with p-values below 0.05. These findings demonstrate that fine-tuning based on the characteristics of CNN’s architecture enhances colorectal cancer classification in histopathology, thereby improving the diagnostic potential of deep learning models.

Keywords:

deep learning; transfer learning; fine-tuning; colorectal cancer; histopathological images

1. Introduction

Colorectal cancer (CRC) is an intestinal cancer that begins as polyps and gradually develops into malignant cells, leading to high incidence and mortality rates over time [1]. The stage at which CRC is diagnosed is crucial in determining survival outcomes. The American Joint Committee on Cancer (AJCC) TNM system is the standard staging framework for CRC, assessing three key factors: tumor (T), which evaluates the depth of cancer invasion into the colon or rectal walls; lymph nodes (N), which assesses whether cancer has spread to nearby lymph nodes; and metastasis (M), which identifies the presence of cancer in distant lymph nodes or organs like the liver or lungs [2,3,4]. This staging system is crucial for informing treatment decisions and predicting patient outcomes. Pathology slides for cancer diagnosis are routinely prepared from biopsy samples by staining tumor tissue with hematoxylin and eosin [5]. Histopathological analysis remains the most reliable method for diagnosing malignant tumors and various diseases [6]. However, this process is labor-intensive and time-consuming and demands high-level expertise, making it challenging for pathologists [7]. The complexity of analyzing histopathological images can lead to pathologist fatigue, increasing the risk of diagnostic errors [8]. Improving the efficiency and accuracy of histopathological assessment is, therefore, a critical need, particularly in the context of CRC diagnostics.

Digital pathology converts histology into multi-gigapixel whole slide images (WSIs), often reaching several gigabytes in size. Due to their large scale, loading WSIs entirely into memory for training machine learning models presents a considerable challenge [9]. To overcome such a challenge, WSIs are typically segmented into smaller image patches, as full-image analysis is computationally impractical [10,11,12,13]. A widely adopted approach involves dividing these large images into patches and applying deep-learning models to analyze each patch individually [10]. Such methods enable computational processing and facilitate automatic histopathological image classification. By leveraging deep learning-based classification, this approach enhances diagnostic efficiency by rapidly distinguishing between malignant and benign tissues, thereby improving patient outcomes [14].

Among the deep learning-based methods, convolutional neural networks (CNNs) have become a standard method for pathological image analysis due to their effectiveness in image classification [15]. Architectures such as Inception V3, Xception, and DenseNet have demonstrated strong performance in medical imaging tasks. However, annotating medical images is expensive and time-consuming, leading to limited labeled datasets [16]. To deal with such limitations, transfer learning techniques are extensively applied in medical image analysis [16]. Usually, models are initialized with weights pre-trained on the ImageNet dataset [17] and then fine-tuned using histopathological images [18]. That approach enables pre-trained models to capture distinct features relevant to the task by allowing their layers to learn specific characteristics essential for the accurate classification of CRC. According to [19], fine-tuning pre-trained networks can be performed using two primary methods: layer-wise fine-tuning and partial training. In layer-wise fine-tuning, individual layers are trained sequentially, with careful selection of which ones remain fixed and which ones undergo training. In contrast, partial training keeps the early layers frozen while training only the higher layers on the new dataset. The choice of layers to fine-tune is critical for the final classification performance of the proposed models, as it directly affects their capacity to extract relevant features from CRC histopathological images.

The objective of this study is to enhance the performance and generalizability of CRC histopathological image classification by providing specific fine-tuning methods on pre-trained CNN models through transfer learning. Specifically, we investigate how different fine-tuning strategies influence model adaptability and diagnostic accuracy across multiple CRC datasets. This research is motivated by the need for deep learning models that can perform reliably across diverse histopathological datasets, despite the challenges posed by limited annotations and morphological variability in CRC tissue samples. To address this, we proposed CRCHisto (Colorectal Cancer Histopathology) models based on three widely used CNN architectures: DenseNet121 (CRCHistoDense), InceptionV3 (CRCHistoIncep), and Xception (CRCHistoXcep). The explicit method of fine-tuning them on CRC histopathological images will provide a reference for other CNN models. The selection of these models was based on their complementary strengths: feature preservation, multi-scale processing, and efficiency in capturing complex image structures. These models were fine-tuned and evaluated using public, private, and integrated CRC datasets to ensure robustness. Furthermore, we investigated the impact of depth fine-tuning and feature randomization on performance and evaluated model generalization on internal, external, and unseen test sets.

The key contributions of our study are as follows: (1) A structured fine-tuning approach was applied to multiple pre-trained models (DenseNet121, InceptionV3, Xception), revealing performance dynamics specific to CRC histopathological image classification; (2) an analysis across internal, external, and unseen datasets revealed that fine-tuning enhanced the classification performance of CRC histopathological image models, with consistent improvements observed across all dataset sources; (3) we measured how different random initializations impact model performance by running multiple experiments, highlighting variability and ensuring reproducibility; (4) a comparative evaluation between baseline and fine-tuned models showed consistent and statistically significant improvements in CRC histopathological image classification; (5) a clinically curated dataset was developed to reflect CRC progression, incorporating key tissue types, adipose, muscle, and lymph nodes, to enhance the proposed models’ diagnostic relevance.

2. Related Works

Deep learning models excel in various medical imaging tasks because they automatically extract and learn intricate, hidden patterns within medical images, as shown by Ijaz et al. [20], enabling more accurate and efficient analysis. Transfer learning with pre-trained models has demonstrated potential in accurately classifying CRC tissue images. As a result, this section explores methodologies that employ deep learning techniques, with a focus on transfer learning strategies using CNNs. By examining recent studies, we highlight how these architectures have been effectively applied to CRC image classification, demonstrating advancements in leveraging pre-trained models to enhance diagnostic accuracy.

In the study by Tsai and Tao [21], CRC histopathological image classification was enhanced by utilizing transfer learning with pre-trained models, including AlexNet, SqueezeNet, VGGNet, GoogLeNet, and ResNet50. They used two public datasets, 5000 CRC images and 100,000 images, along with 7180 external validation images. The data were split into training (70%), testing (15%), and validation (15%) sets. ResNet50 achieved a classification accuracy of 94.86%. Likewise, Al. Shawesh and Chen [22] employed 100,000 CRC histopathological images for training and 7180 for validation. Using transfer learning with fine-tuning, they applied the ResNet50 model, initializing it with parameters trained on the ImageNet dataset and freezing all layers except the final ones. The model achieved a validation accuracy of 97.7%. Vidyun et al. [23] used transfer learning to fine-tune the VGG19 architecture, training only the top five layers while keeping the remaining 11 layers unchanged. They applied this approach to a public CRC histology image dataset of 5000 images, each annotated into eight classes and resized to 150 × 150 pixels. The fine-tuned model achieved an accuracy of 91.2%. Sarwinda et al. [24] utilized a public dataset of 165 benign and malignant tumor images, applying transfer learning with ResNet18 and ResNet50 architectures for classification. ResNet50 outperformed ResNet18, achieving accuracy rates between 73% and 88%, with sensitivity values ranging from 64% to 96%. Tasnim et al. [25] used the MobileNetV2 model on a public dataset of colon tissue images, achieving 99.67% accuracy with an 80/20 train–test split.

Several studies have investigated customized CNN architectures for the classification of CRC histopathological images. For example, Ibrahim et al. [26] developed a CNN model for classifying CRC images using a public dataset of 2500 images resized to 64 × 64 pixels. The architecture consisted of two convolutional layers with 3 × 3 filters and a fully connected layer with softmax activation, achieving an accuracy of 83%. In Kumar et al. [27], they explored multiple CNN architectures on a public dataset of normal colon and colon adenocarcinoma images. The top-performing model reached an accuracy of 99.40%. The dataset used in this study may present limitations due to its lack of diversity. The dataset primarily consists of normal colon and colon adenocarcinoma images, which may limit the model’s ability to generalize to variations commonly encountered in clinical practice.

Transfer learning using ResNet152, ResNet50, and VGG16 architectures was employed in the study by Naga Raju and Rao [28] to classify CRC images from a publicly available dataset of 5000 images. The dataset was split into 60% training, 30% testing, and 10% validation. The metrics indicated that ResNet152 delivered the highest accuracy at 98.38%, followed closely by ResNet50 at 97.08% and VGG16 at 96.16%. To classify colorectal cancer categories, Gupta et al. [29] utilized transfer learning with five pre-trained CNN architectures: ResNet50, Inception V3, VGG16, VGG19, and ResNet152V2. They experimented with a publicly available dataset of 5000 images, split into 70% training, 15% validation, and 15% testing. Inception V3 outperformed the other models, achieving the highest accuracy of 89.87%. This demonstrates the potential of transfer learning in CRC classification, though the relatively low accuracy suggests limitations in the dataset’s diversity and the potential for further improvement in model performance.

Transfer learning with fine-tuned VGG16 and MobileNetV2 has been used by Parelanickal et al. [30] for colon cancer tissue classification, utilizing an open-source dataset of 7200 histopathological images categorized into nine classes. The dataset was divided into 60% training, 20% testing, and 20% validation. MobileNetV2 achieved 97% accuracy, while VGG16 achieved 95%. Abhishek et al. [31] applied transfer learning with ResNet34 and EfficientNetB4 to classify CRC using a public dataset of 5000 histopathological images across eight tissue classes. ResNet34 achieved an accuracy of 99.97%, while EfficientNetB4 reached 99.8%. Despite the high accuracy, using a single public dataset raises concerns about dataset diversity, which could limit the model’s generalizability to other actual settings. Furthermore, as highlighted by Davila et al. [32], fine-tuning in medical image classification remains a challenging task, requiring careful model selection to balance complexity, accuracy, data availability, and computational efficiency.

Although notable progress has been made in CRC classification using deep learning, several critical gaps remain unaddressed in the literature. First, most studies lack rigorous statistical validation [22,23,27], often reporting only accuracy without confirming significance across experimental variations. Second, dataset bias remains a persistent issue, as public datasets offer limited diversity, thereby restricting model generalizability. Third, in studies such as [24,26,28,31], there is an insufficient methodological exploration of fine-tuning in relation to how depth and characteristics of architecture-specific tuning impact model learning, an aspect that can provide more reliability in fine-tuning other CNN models at large. Fourth, model interpretability is frequently absent or insufficient, which limits trust in real-world clinical settings. To address these gaps, our study investigates the impact of transfer learning and fine-tuning techniques on improving CRC histopathological image classification. The contributions made by this work are significant. We established a structured fine-tuning process, which led to consistent and statistically significant performance improvements across various datasets, including internal, external, and unseen datasets. Additionally, a clinically curated dataset was developed, reflecting the progression of CRC through the inclusion of key tissue types, further enhancing the relevance of the models. Visualization was integrated to improve the interpretability of the proposed models, facilitating an understanding of the features that drive classification decisions. These advancements strengthen the potential for refining deep learning models to improve medical diagnostics in CRC.

3. Dataset and Preprocessing

We utilized diverse datasets, with the primary input features consisting of RGB image patches extracted from colorectal histopathology slides. These patches served as the foundational data for model training and evaluation, capturing tissue-level characteristics critical for accurate classification. To ensure clinical relevance, the datasets were curated in consultation with the Pathology Department, focusing on representative tissue types associated with CRC progression, specifically adipose, muscle, and lymph node. The curation strategy was uniformly applied across all datasets, including private, public, and integrated sources, to enhance the generalizability and real-world applicability of the proposed models.

The datasets include Dataset 1 (private source: Pathology Department, the Royal Hospital, the Sultanate of Oman), Dataset 2 (public source: [33]), and Dataset 3 (a combination of public and private sources) obtained from three public datasets, “CRC-VAL-HE-7K”, “NCT-CRC-HE-100K”, and “Colorectal Histology MNIST” [33,34], and the private dataset. Figure 1 illustrates adipose, muscle, and lymph node samples from Dataset 1, while Table 1 presents the distribution of CRC histopathological images across all used datasets.

Hematoxylin and eosin dyes, which are commonly used for staining tissue samples, aiding pathologists in the histopathological analysis essential for accurate CRC diagnosis [35], were also applied in this study. As part of the data collection process, we employed preprocessing methods to improve the quality and usability of these stained images. This included applying preprocessing steps to all datasets, such as resizing images to 224 × 224 pixels and normalizing pixel values for compatibility with pre-trained models. The private dataset consisted of stained histopathological images scanned at magnifications of 20×, 40×, and, in a few cases, 60×, depending on the tissue slide. The original images were captured at a resolution of 1920 × 1080 pixels. During preprocessing, WSIs containing artifacts were excluded to ensure the quality of the data. Representative tissue regions were manually selected using QuPath software [36], version 0.5.0, and then cropped into fixed-size patches of 224 × 224 pixels to ensure consistency and compatibility across datasets for learning and performance validation. The data was allocated as 65% for training, 15% for validation, and 20% for testing. These steps were designed to improve computational efficiency, focus on critical regions, and enhance the accuracy and reliability of CRC histopathological image classification.

4. Transfer Learning with Fine-Tuning Approach

Transfer learning is a versatile approach in deep learning [32] that leverages features from a pre-trained model to address related tasks more effectively, speeding up the training process on new data using previously acquired information. This concept [37] was initially introduced for image classification, laying the foundation for its widespread application. Accordingly, it is inspired by humans’ capability to transfer knowledge across domains, utilizing a related source domain to enhance learning or reduce labeled data in a target domain [38]. Our approach focuses on feature extraction without fine-tuning initially, adapting pre-trained models for CRC classification. In addition, we propose a method leveraging layer-wise and partial fine-tuning to enhance the reliability of fine-tuning for classifying CRC histopathological images across different pre-trained models. For example, layer-wise fine-tuning is employed when higher-level features necessitate fine-grained adaptation to CRC-specific patterns, allowing for selective layer updates that strike a balance between generalization and learning the current task. Partial fine-tuning is implemented because of limited computational resources, allowing updates to only a subset of layers. This strategic variation ensures that the fine-tuning process is optimized for the strengths and limitations of each pre-trained model. Furthermore, fine-tuning is closely related to hyperparameter optimization due to the growing complexity of deep learning architectures [39]. As models become deeper, selecting the correct layers to fine-tune along with tuning hyperparameters, such as learning rate and batch size, become crucial for achieving acceptable performance. Based on the state-of-the-art CNNs and the factors considered above, we selected DenseNet121, Inception V3, and Xception to enhance CRC classification using histopathological images. DenseNet, introduced by [40], builds on ResNet’s principles with densely connected layers, enhancing information flow, gradient propagation, and feature reuse while minimizing information loss. Inception V3, introduced by [41], optimizes CNN efficiency, offering improved performance at a computational cost 2.5 times lower than that of GoogLeNet and significantly more efficient than VGGNet. Xception, proposed by [42], enhances the Inception module by replacing it with depthwise separable convolutions, followed by pointwise convolutions, forming a linear stack with residual connections to improve efficiency and performance.

5. Experimental Setup

In our experiment, the training process began with pre-trained models as the baseline, followed by fine-tuning on the CRC datasets presented in Table 1. The proposed models were developed and trained using the TensorFlow Keras library with Python version 3.10.13. All experiments were conducted on a Dell laptop equipped with a 13th-generation Intel Core i9-11390HX processor and an NVIDIA GeForce RTX 4060 GPU, sourced in Muscat, Oman. We initially evaluated the proposed models on various datasets, assessing their performance using multiple metrics, as outlined in Equations (1)–(7). In addition, each model was applied to the integrated dataset, which combines diverse CRC histopathological images, to evaluate the impact of depth fine-tuning with various feature initializations to ensure robust and reliable results. Furthermore, statistical significance tests were used to assess the hypothesis of depth fine-tuning on the performance of the proposed models. The statistical tests employed repeated ANOVA with 95% confidence intervals and the Shapiro–Wilk test [43] to assess the normality of the data distribution. These tests evaluated the reliability of the results across different runs with varying initializations. In addition, validation was performed on external sources and unseen data to assess the models’ generalization capability.

A c c u r a c y = \frac{T r u e P o s i t i v e + T r u e N e g a t i v e}{T r u e P o s i t i v e + T r u e N e g a t i v e + F a l s e N e g a t i v e + F a l s e P o s i t i v e}

(1)

S e n s i t i v i t y (R e c a l l) = \frac{T r u e P o s i t i v e}{T r u e P o s i t i v e + F a l s e N e g a t i v e}

(2)

S p e c i f i c i t y = \frac{T r u e N e g a t i v e}{T r u e N e g a t i v e + F a l s e P o s i t i v e}

(3)

P r e c i s i o n = \frac{T r u e P o s i t i v e}{T r u e P o s i t i v e + F a l s e P o s i t i v e}

(4)

F 1 s c o r e = \frac{2 \times P r e c i s i o n \times S e n s i t i v i t y}{P r e c i s i o n + S e n s i t i v i t y}

(5)

M i s c l a s s i f i c a t i o n R a t e = \frac{N u m b e r o f m i s c l a s s i f i e d s a m p l e s}{(T o t a l n u m b e r o f s a m p l e s)} \times 100

(6)

k a p p a = \frac{P_{o} - P_{e}}{1 - P_{e}}

(7)

P_{o}

is the observed agreement between raters, denoted by the accuracy, while

P_{e}

is the expected accuracy, represented by the chance agreement.

As detailed in Table 2 and illustrated in Figure 2, the proposed models begin by leveraging pre-trained models, which are fine-tuned for classifying CRC images. To adapt the pre-trained models, we employed global average pooling to reduce the feature maps to a single feature vector for the dense layer. A dropout layer is incorporated to combat overfitting, while batch normalization is applied to stabilize and accelerate the training process, improving the models’ generalization capability on the CRC classification task. Each model employs a distinct configuration tailored to enhance the performance of CRC histopathological image classification, as illustrated in Table 3. The hyperparameter settings for all proposed models included an initial learning rate of 1 × 10⁻⁴ with the Adam optimizer [44], a categorical cross-entropy loss function, a batch size of 32, and training for 20 epochs, with early stopping to mitigate overfitting.

In our study, we introduce layer-depth fine-tuning, tailored for CRC histopathological image classification. Our approach evaluates the impact of tuning different portions of each model architecture. The proposed method includes (i) a controlled scheme for selectively unfreezing layer blocks within pre-trained models, (ii) a comprehensive evaluation using repeated statistical tests over multiple randomized runs to ensure performance consistency, and (iii) validation across internal, external, and integrated datasets to assess generalization under domain shift. We integrate interpretability through Grad-CAM visualizations and expert pathologist review to confirm that model attention aligns with clinically relevant regions. This approach contributes to a reproducible and domain-adapted strategy for improving model adaptation in CRC histopathology datasets.

6. Results and Discussion

To address our research objective, we applied a fine-tuning protocol across several CNN architectures. Rather than relying on ad hoc training, our approach varied the number and depth of unfrozen layers in a structured, layer-wise manner while standardizing optimizer settings, learning rates, and training epochs. The design isolates the effect of fine-tuning depth on model performance, enabling a controlled investigation into how different levels of representational adaptation impact CRC histopathological image classification. Furthermore, by applying such a strategy across internal, external, and integrated datasets, we ensured that observed improvements are attributable to effective feature transfer rather than dataset-specific fitting, an essential consideration in medical image analysis under domain shift.

In our study, we assessed the performance of the baseline (DenseNet121, InceptionV3, and Xception) and fine-tuned models (CRCHistoDense, CRCHistoIncep, and CRCHoistXcep) on CRC datasets, comparing the average F1-scores across these models to evaluate their effectiveness in classifying CRC histopathological images. We selected the F1-score for its balance between precision and recall, which is vital in CRC image classification, as it reduces both missed diagnoses and unnecessary treatments. Fine-tuning consistently improved F1-scores across datasets, with statistically significant gains confirmed by a paired Wilcoxon signed-rank test, yielding a p-value of 0.018, indicating a statistically considerable enhancement between the baseline and fine-tuned models. The proposed models achieved a significantly higher average F1-score (mean = 0.993, 95% confidence interval [0.990, 0.996]) than baseline models (mean = 0.986, 95% confidence interval [0.979, 0.993]); Wilcoxon W = 28.0, p = 0.018, with a large effect size. The result confirms that fine-tuning based on the model’s architecture characteristics is an effective strategy for improving model performance, thereby enhancing precision and recall in CRC histopathological image classification.

The fine-tuned models maintained consistently high performance across three distinct CRC histopathological datasets, reflecting their adaptability to data variation. As shown in Figure 3, the average classification metrics remained strong across all architectures, indicating that fine-tuning enabled effective feature extraction regardless of dataset source.

Figure 4 illustrates the training dynamics of the proposed models, including accuracy and loss curves, which highlight stable convergence and consistent generalization performance across different datasets.

Figure 4 shows the accuracy and the loss curves across private, public, and integrated datasets for the proposed models. CRCHistoDense exhibited the most stable learning behavior, with consistent convergence and minimal fluctuations, indicating robust generalization across diverse data sources. Although CRCHistoXcep marginally surpassed CRCHistoIncep in final accuracy, both followed comparable convergence patterns. Notably, CRCHistoDense and CRCHistoXcep required more training epochs to reach convergence, reflecting a deeper representational adaptation through fine-tuning. The smooth decline in loss across all models further confirms the stability of training and consistency of optimization. These trends highlight the effectiveness of our fine-tuning strategy in achieving balanced generalization across diverse CRC datasets. Such integrated analysis is essential for rigorous assessment and informed decision-making in advanced deep learning applications.

To empirically assess the impact of depth fine-tuning on model performance in CRC histopathological image classification, we employed an experimental design in which each proposed architecture was fine-tuned at multiple, predefined layer depths. For each depth setting, training was repeated 15 times with random seeds to capture variance due to initialization and sampling effects. Classification performance was assessed using the test accuracy metric via repeated-measures ANOVA, which allowed for us to evaluate the impact of fine-tuning depth while accounting for intra-model variability. This approach confirmed statistically significant improvements (p < 0.05) with deeper tuning across all models, supporting our hypothesis that depth fine-tuning influences feature adaptation and generalization. CRCHistoDense showed notable performance when the last 50 layers were fine-tuned (p = 0.028), while CRCHistoIncep benefited most from fine-tuning the last 100 and 150 layers (p < 0.05). For CRCHistoXcep, improved performance was achieved when the last 20 layers were fine-tuned, with statistically significant results. These findings, consistently replicated over multiple randomized trials (Table 4), highlight the critical importance of model-specific fine-tuning depth selection as a key factor in stabilizing training dynamics and enhancing cross-domain generalization.

To evaluate the generalizability of the proposed fine-tuned models, we performed cross-dataset validation using external and unseen data sources (Table 5). In the dataset of “CRC-VAL-HE-7K”, model performance remained consistent across internal, external, and unseen splits, demonstrating strong generalization when exposed to datasets with similar underlying distributions. In contrast, when the private dataset (Dataset 1) was used as the primary evaluation source, the models exhibited a noticeable performance drop when tested on external (public) and unseen samples from the same private source (672 samples). This decline reflects the increased complexity and heterogeneity of real-world histopathological images. The Kruskal–Wallis [45] test confirmed that the differences among internal, external, and unseen sets were statistically significant (p = 0.003), with post hoc pairwise analysis showing meaningful drops from internal to both external and unseen data (p = 0.011). However, there was no notable difference between external and unseen groups (p = 0.652), suggesting consistent generalization behavior in the presence of data variability. These findings highlight the practical challenge of domain shift in medical imaging and the effectiveness of our depth fine-tuning strategy. Specifically, the private dataset was curated in collaboration with clinical experts to reflect real-world diagnostic conditions, capturing adipose, muscle, and lymph node tissues under consistent imaging and preparation protocols. In contrast, the public datasets, although covering the same tissue types, were collected from independent sources with differing specimen handling practices and image resolution settings. These differences resulted in morphological variability and structural heterogeneity across CRC datasets. The consistent performance of the proposed models across internal, external, and unseen datasets highlights their robustness to such domain shifts and supports the generalizability of the depth fine-tuning strategy. Furthermore, the private dataset’s greater diversity served as a robust benchmark, revealing that model performance is sensitive to the complexity of source data. This cross-dataset analysis confirms that the proposed models enhance reliability and generalizability in CRC histopathological classification. These models achieve strong internal performance and maintain statistically significant robustness across unseen and heterogeneous samples, a crucial requirement for actual clinical settings.

While the proposed models’ outcomes are promising, we recognize the importance of evaluating potential sources of bias. The high results suggest effective model performance but also require careful interpretation, given the risk of overfitting in deep networks, especially with relatively limited or imbalanced medical datasets. We addressed this concern through structured fine-tuning and testing on diverse datasets. Nonetheless, future work should validate performance on larger colorectal cancer datasets and explore model calibration to ensure reliability in clinical deployment. Overall, the proposed models significantly improve CRC image classification while maintaining robustness and adaptability.

A key goal of our study is to enhance the performance and interpretability of the proposed CRC classification models that leverage three core characteristics: (1) depth-specific fine-tuning across pre-trained architectures to optimize feature adaptation to histopathological patterns, guided by statistical significance testing; (2) dataset curation based on clinical relevance, incorporating diverse tissue types (adipose, muscle, lymph node) and informed by pathologist input to better reflect CRC progression; and (3) model interpretability integration using Grad-CAM, enabling visualization of discriminative regions and validating model attention against expert-annotated tissue features.

Most existing CRC classification studies adopt off-the-shelf models with minimal architectural modification and apply standard fine-tuning strategies, typically freezing early layers or fine-tuning the entire network, without explicitly analyzing the impact of tuning depth. Our approach diverges by introducing a depth-aware fine-tuning strategy that explores intermediate adaptation levels. We selectively unfroze layers across different model blocks to evaluate their specific contributions to feature alignment and cross-domain generalization, particularly in the presence of morphological variability across private, public, and integrated CRC datasets. Repeated ANOVA-based statistical evaluations guided this fine-grained strategy to quantify learning gains associated with each adaptation depth. We aligned the overall performance trends with Grad-CAM visualizations, which confirmed that the models with deeper tuning yielded more coherent and clinically relevant attention patterns. We further observed that greater adaptation depth was necessary to accommodate semantic shifts across CRC datasets, but this could be controlled effectively using targeted learning rates and early stopping. These insights demonstrate that structured depth-aware fine-tuning can improve both generalization and diagnostic relevance in CRC classification.

Building on this and with the interpretability focus, Grad-CAM [46] visualizes the spatial focus of the proposed models during prediction. These visualizations provided insight into the regions influencing classification decisions, enabling us to assess whether performance improvements through fine-tuning were also accompanied by meaningful diagnostic reasoning (Figure 5c). To validate these interpretability outputs, we collaborated with pathologists who reviewed the heatmaps across different CRC classes. Their feedback confirmed that the highlighted regions corresponded to CRC histological findings, supporting the clinical coherence of the models’ attention. The validation confirms that the improved performance of the models is both statistically significant and clinically relevant. Our interpretability pipeline addresses transparency of deep learning models by providing visual justifications for the predictions of the proposed models and integrating clinical expert feedback. The outcomes provided evidence both technically sound and practically valuable, focusing on approaches that prioritize accuracy and offer explanatory depth.

To ensure the robustness and generalizability of our proposed fine-tuning strategy, we evaluated it across three widely used histopathology datasets (CRC5000, CRC7180, and LC25000). The objective of using these datasets was not simply to replicate single-dataset benchmarks but rather to demonstrate that our approach consistently improves performance across diverse data distributions and collection settings. In contrast to prior studies that relied on single datasets without addressing model transferability, our approach evaluated whether depth fine-tuning generalizes across internal, external, and heterogeneous datasets. This multi-dataset design enabled us to assess model stability and highlight dataset-specific challenges. Comparative results in Table 6 confirm that our proposed models achieve competitive performance, particularly when compared to studies using the same datasets, many of which lack explanations of the training protocol, statistical validation, or interpretability.

In the evolving landscape of deep learning in digital pathology, innovation does not rest solely on introducing novel architectures but also on effectively adapting and validating existing ones for actual clinical impact. Therefore, this study addresses that challenge by exploring depth fine-tuning strategies across established CNNs, DenseNet121, InceptionV3, and Xception, tailored for CRC histopathological image classification. While these models are not the newest in the field, their architectural stability, parameter efficiency, and well-characterized training behavior make them highly suitable for clinical tasks that require interpretability, generalizability, and reproducibility. Our objective was to investigate how depth fine-tuning and transfer learning strategies influence model performance and adaptability across various datasets. As highlighted by [47], who emphasized that fine-tuning strategies in histopathology remain underdeveloped, task adaptation has not yet been systematically explored. This study also raises the question of whether the transfer learning architectures optimized for natural images generalize effectively to medical domains. Our findings confirm that selectively fine-tuning, rather than full retraining or shallow transfer, consistently enhances classification accuracy and robustness, even across heterogeneous datasets with varied staining protocols and imaging conditions. To address concerns regarding model relevance, we also evaluated EfficientNetV2B0 under various fine-tuning configurations (50, 160, and 200 layers), including Adam (learning rate = 1 × 10⁻⁴), dropout (0.5), and dense (512). Despite its recent design, its performance (test accuracy ranges from 66.56% to 74.96%) was suboptimal across datasets. Resource limitations similarly hindered attempts to explore ConvNeXtTiny. As [48] emphasized, the success of deep learning in clinical applications depends not only on newer models but also on the ability to fine-tune existing ones efficiently and interpretably, especially in resource-constrained environments. Crucially, we integrated Grad-CAM-based visual interpretability, validated by expert pathologists, to ensure that the fine-tuned models not only performed well statistically but also localized diagnostically relevant regions. This interpretability validation reinforces the clinical reliability of our approach and addresses the gap between AI predictions and pathological decision-making, a step still lacking in many deep learning studies in histopathology.

We compared our models against prior studies using the same datasets. While a few works report slightly higher accuracy, many do not detail the fine-tuning strategies, optimization settings, or generalization capacity under domain shift. In contrast, our study contributes a structured and reproducible fine-tuning framework, validated statistically and tested across diverse CRC datasets. The analysis demonstrates that performance gains are critically shaped by how depth fine-tuning aligns with dataset variability. The observed consistency across internal, external, and unseen data, alongside clinically relevant interpretations, underscores the robustness and adaptability of our proposed configurations. These contributions offer valuable insights essential for advancing AI-assisted diagnostics in CRC histopathology.

7. Conclusions

This study examined the impact of fine-tuning depth on the performance of pre-trained models for CRC histopathology image classification. Through experiments conducted across multiple architectures, varied datasets, and statistically validated evaluations, we demonstrated that model-specific fine-tuning depth, which is dependent on the model’s architecture characteristics, can significantly enhance accuracy and generalizability. The integration of pathologist-guided dataset curation and model interpretability further reinforces the clinical relevance of the results. Our findings provide explicit fine-tuning strategies tailored to the model’s architecture for improving transfer learning in CRC classification tasks.

Author Contributions

Conceptualization, H.S.A. and C.S.L.; methodology, H.S.A. and C.S.L.; writing—original draft preparation, H.S.A. and C.S.L.; writing—review and editing, H.S.A. and C.S.L.; supervision, C.S.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

This study adhered to the ethical principles outlined in the Declaration of Helsinki. The private dataset of CRC histopathological images was collected from the Pathology Department of the Royal Hospital, with approval granted by the Research and Ethical Review & Approve Committee (RERAC) of the Royal Hospital, the Sultanate of Oman, MoH/CSR/25/29962, date: 1 May 2024. Anonymized CRC histopathological images were sourced from public datasets and private collections.

Informed Consent Statement

Not applicable.

Data Availability Statement

The public dataset used in this study is available at Zenodo (https://zenodo.org/records/1214456), (accessed on 7 November 2024). The study data can be obtained from the corresponding author upon reasonable request.

Acknowledgments

We sincerely thank Fatma Ali Ramadhan Al Lawati at the Royal Hospital, the Sultanate of Oman, for her support and assistance in facilitating data collection within the Department of Pathology.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CRC	Colorectal Cancer
AJCC	American Joint Committee on Cancer
TNM	Tumor Lymph Nodes Metastasis
WSI	Whole Slide Image
CNN	Convolutional Neural Network
CRCHisto	Colorectal Cancer Histopathology

References

Chen, H.; Li, C.; Li, X.; Rahaman, M.M.; Hu, W.; Li, Y.; Liu, W.; Sun, C.; Sun, H.; Huang, X.; et al. IL-MCAM: An Interactive Learning and Multi-Channel Attention Mechanism-Based Weakly Supervised Colorectal Histopathology Image Classification Approach. Comput. Biol. Med. 2022, 143, 105265. [Google Scholar] [CrossRef]
Amin, M.B.; Greene, F.L.; Edge, S.B.; Compton, C.C.; Gershenwald, J.E.; Brookland, R.K.; Meyer, L.; Gress, D.M.; Byrd, D.R.; Winchester, D.P. The Eighth Edition AJCC Cancer Staging Manual: Continuing to Build a Bridge from a Population-based to a More “Personalized” Approach to Cancer Staging. CA A Cancer J. Clin. 2017, 67, 93–99. [Google Scholar] [CrossRef]
Colorectal Cancer Stages|Rectal Cancer Staging|Colon Cancer Staging. Available online: https://www.cancer.org/cancer/types/colon-rectal-cancer/detection-diagnosis-staging/staged.html (accessed on 27 May 2025).
Digestive Cancers Europe. Colorectal Cancer—Staging; Digestive Cancers Europe. 1 December 2021. Available online: https://digestivecancers.eu/staging/ (accessed on 27 May 2025).
Kuntz, S.; Krieghoff-Henning, E.; Kather, J.N.; Jutzi, T.; Höhn, J.; Kiehl, L.; Hekler, A.; Alwers, E.; Von Kalle, C.; Fröhling, S.; et al. Gastrointestinal Cancer Classification and Prognostication from Histology Using Deep Learning: Systematic Review. Eur. J. Cancer 2021, 155, 200–215. [Google Scholar] [CrossRef]
Jiao, Y.; Li, J.; Fei, S. Staining Condition Visualization in Digital Histopathological Whole-Slide Images. Multimed. Tools Appl. 2022, 81, 17831–17847. [Google Scholar] [CrossRef]
Xu, Y.; Jiang, L.; Huang, S.; Liu, Z.; Zhang, J. Dual Resolution Deep Learning Network with Self-Attention Mechanism for Classification and Localisation of Colorectal Cancer in Histopathological Images. J. Clin. Pathol. 2023, 76, 524–530. [Google Scholar] [CrossRef]
Hirra, I.; Ahmad, M.; Hussain, A.; Ashraf, M.U.; Saeed, I.A.; Qadri, S.F.; Alghamdi, A.M.; Alfakeeh, A.S. Breast Cancer Classification From Histopathological Images Using Patch-Based Deep Learning Modeling. IEEE Access 2021, 9, 24273–24287. [Google Scholar] [CrossRef]
Mahbub, T.; Obeid, A.; Javed, S.; Dias, J.; Hassan, T.; Werghi, N. Center-Focused Affinity Loss for Class Imbalance Histology Image Classification. IEEE J. Biomed. Health Inform. 2024, 28, 952–963. [Google Scholar] [CrossRef] [PubMed]
Wieslander, H.; Harrison, P.J.; Skogberg, G.; Jackson, S.; Friden, M.; Karlsson, J.; Spjuth, O.; Wahlby, C. Deep Learning With Conformal Prediction for Hierarchical Analysis of Large-Scale Whole-Slide Tissue Images. IEEE J. Biomed. Health Inform. 2021, 25, 371–380. [Google Scholar] [CrossRef]
Kaczmarzyk, J.R.; Gupta, R.; Kurc, T.M.; Abousamra, S.; Saltz, J.H.; Koo, P.K. ChampKit: A Framework for Rapid Evaluation of Deep Neural Networks for Patch-Based Histopathology Classification. Comput. Methods Programs Biomed. 2023, 239, 107631. [Google Scholar] [CrossRef] [PubMed]
Moscalu, M.; Moscalu, R.; Dascălu, C.G.; Țarcă, V.; Cojocaru, E.; Costin, I.M.; Țarcă, E.; Șerban, I.L. Histopathological Images Analysis and Predictive Modeling Implemented in Digital Pathology—Current Affairs and Perspectives. Diagnostics 2023, 13, 2379. [Google Scholar] [CrossRef]
Sun, C.; Xu, A.; Liu, D.; Xiong, Z.; Zhao, F.; Ding, W. Deep Learning-Based Classification of Liver Cancer Histopathology Images Using Only Global Labels. IEEE J. Biomed. Health Inform. 2020, 24, 1643–1651. [Google Scholar] [CrossRef] [PubMed]
Lin, Y.-S.; Huang, P.-H.; Chen, Y.-Y. Deep Learning-Based Hepatocellular Carcinoma Histopathology Image Classification: Accuracy Versus Training Dataset Size. IEEE Access 2021, 9, 33144–33157. [Google Scholar] [CrossRef]
Liang, F.; Wang, S.; Zhang, K.; Liu, T.-J.; Li, J.-N. Development of Artificial Intelligence Technology in Diagnosis, Treatment, and Prognosis of Colorectal Cancer. World J. Gastrointest. Oncol. 2022, 14, 124–152. [Google Scholar] [CrossRef]
Diao, S.; Luo, W.; Hou, J.; Lambo, R.; AL-kuhali, H.A.; Zhao, H.; Tian, Y.; Xie, Y.; Zaki, N.; Qin, W. Deep Multi-Magnification Similarity Learning for Histopathological Image Classification. IEEE J. Biomed. Health Inform. 2023, 27, 1535–1545. [Google Scholar] [CrossRef]
Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Kai, L.; Li, F.-F. ImageNet: A Large-Scale Hierarchical Image Database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; IEEE: New York, NY, USA, 2009; pp. 248–255. [Google Scholar]
Petríková, D.; Cimrák, I. Survey of Recent Deep Neural Networks with Strong Annotated Supervision in Histopathology. Computation 2023, 11, 81. [Google Scholar] [CrossRef]
Sharma, S.; Mehra, R. Conventional Machine Learning and Deep Learning Approach for Multi-Classification of Breast Cancer Histopathology Images—A Comparative Insight. J. Digit. Imaging 2020, 33, 632–654. [Google Scholar] [CrossRef] [PubMed]
Ijaz, A.; Raza, B.; Kiran, I.; Waheed, A.; Raza, A.; Shah, H.; Aftan, S. Modality Specific CBAM-VGGNet Model for the Classification of Breast Histopathology Images via Transfer Learning. IEEE Access 2023, 11, 15750–15762. [Google Scholar] [CrossRef]
Tsai, M.-J.; Tao, Y.-H. Deep Learning Techniques for the Classification of Colorectal Cancer Tissue. Electronics 2021, 10, 1662. [Google Scholar] [CrossRef]
Shawesh, R.A.; Chen, Y.X. Enhancing Histopathological Colorectal Cancer Image Classification by Using Convolutional Neural Network. medRxiv 2021. [Google Scholar] [CrossRef]
Vidyun, A.S.; Srinivasa Rao, B.; Harikiran, J. Histopathological Image Classification Using Deep Neural Networks with Fine-Tuning. In Data Engineering and Intelligent Computing; Bhateja, V., Satapathy, S.C., Travieso-González, C.M., Aradhya, V.N.M., Eds.; Springer: Singapore, 2021; Volume 1407, pp. 173–180. ISBN 9789811601705. [Google Scholar]
Sarwinda, D.; Paradisa, R.H.; Bustamam, A.; Anggia, P. Deep Learning in Image Classification Using Residual Network (ResNet) Variants for Detection of Colorectal Cancer. Procedia Comput. Sci. 2021, 179, 423–431. [Google Scholar] [CrossRef]
Tasnim, Z.; Chakraborty, S.; Shamrat, F.M.J.M.; Chowdhury, A.N.; Nuha, H.A.; Karim, A.; Zahir, S.B.; Billah, M.M. Deep Learning Predictive Model for Colon Cancer Patient Using CNN-Based Classification. Int. J. Adv. Comput. Sci. Appl. 2021, 12, 687–696. [Google Scholar] [CrossRef]
Ibrahim, N.; Pratiwi, N.K.C.; Pramudito, M.A.; Taliningsih, F.F. Non-Complex CNN Models for Colorectal Cancer (CRC) Classification Based on Histological Images. In Proceedings of the 1st International Conference on Electronics, Biomedical Engineering, and Health Informatics, Surabaya, Indonesia, 8–9 October 2020; Triwiyanto, N.H.A., Rizal, A., Caesarendra, W., Eds.; Springer: Singapore, 2021; Volume 746, pp. 509–516, ISBN 9789813369252. [Google Scholar]
Kumar, A.; Vishwakarma, A.; Bajaj, V.; Sharma, A.; Thakur, C. Colon Cancer Classification of Histopathological Images Using Data Augmentation. In Proceedings of the 2021 International Conference on Control, Automation, Power and Signal Processing (CAPS), Jabalpur, India, 10–12 December 2021; IEEE: New York, NY, USA, 2021; pp. 1–5. [Google Scholar]
Naga Raju, M.S.; Rao, B.S. Colorectal Multi-Class Image Classification Using Deep Learning Models. Bull. EEI 2022, 11, 195–200. [Google Scholar] [CrossRef]
Gupta, S.; Gupta, A.; Kumar, A.; Gupta, S.; Singh, A. Multi-Class Classification of Colorectal Cancer Tissues Using Pre-Trained CNN Models. In Proceedings of the TENCON 2022—2022 IEEE Region 10 Conference (TENCON), Hong Kong, China, 1–4 November 2022; IEEE: New York, NY, USA, 2022; pp. 1–6. [Google Scholar]
Parelanickal, S.B.; Jefferson, J.; Densil Raj V, F. Detection of Colorectal Cancer from Histopathological Images Tissue Classification Using Deep Learning Techniques. In Proceedings of the 2023 International Conference on Innovative Computing, Intelligent Communication and Smart Electrical Systems (ICSES), Chennai, India, 14–15 December 2023; IEEE: New York, NY, USA, 2023; pp. 1–7. [Google Scholar]
Ranjan, A.; Srivastva, P.; Prabadevi, B.; Rajagopal, S.; Soangra, R.; Subramaniam, S.K. Classification of Colorectal Cancer Using ResNet and EfficientNet Models. Open Biomed. Eng. J. 2024, 18, e18741207280703. [Google Scholar] [CrossRef]
Davila, A.; Colan, J.; Hasegawa, Y. Comparison of Fine-Tuning Strategies for Transfer Learning in Medical Image Classification. Image Vis. Comput. 2024, 146, 105012. [Google Scholar] [CrossRef]
Kather, J.N.; Krisam, J.; Charoentong, P.; Luedde, T.; Herpel, E.; Weis, C.-A.; Gaiser, T.; Marx, A.; Valous, N.A.; Ferber, D.; et al. Predicting Survival from Colorectal Cancer Histology Slides Using Deep Learning: A Retrospective Multicenter Study. PLoS Med. 2019, 16, e1002730. [Google Scholar] [CrossRef] [PubMed]
Kather, J.N.; Weis, C.-A.; Bianconi, F.; Melchers, S.M.; Schad, L.R.; Gaiser, T.; Marx, A.; Zöllner, F.G. Multi-Class Texture Analysis in Colorectal Cancer Histology. Sci. Rep. 2016, 6, 27988. [Google Scholar] [CrossRef]
Tamang, L.D.; Kim, B.W. Deep Learning Approaches to Colorectal Cancer Diagnosis: A Review. Appl. Sci. 2021, 11, 10982. [Google Scholar] [CrossRef]
Bankhead, P.; Loughrey, M.B.; Fernández, J.A.; Dombrowski, Y.; McArt, D.G.; Dunne, P.D.; McQuaid, S.; Gray, R.T.; Murray, L.J.; Coleman, H.G.; et al. QuPath: Open Source Software for Digital Pathology Image Analysis. Sci. Rep. 2017, 7, 16878. [Google Scholar] [CrossRef]
Pan, S.J.; Yang, Q. A Survey on Transfer Learning. IEEE Trans. Knowl. Data Eng. 2010, 22, 1345–1359. [Google Scholar] [CrossRef]
Zhuang, F.; Qi, Z.; Duan, K.; Xi, D.; Zhu, Y.; Zhu, H.; Xiong, H.; He, Q. A Comprehensive Survey on Transfer Learning. Proc. IEEE 2021, 109, 43–76. [Google Scholar] [CrossRef]
Barakat, B.; Huang, Q. Enhancing Transfer Learning Reliability via Block-Wise Fine-Tuning. In Proceedings of the 2023 International Conference on Machine Learning and Applications (ICMLA), Jacksonville, FL, USA, 15–17 December 2023; IEEE: New York, NY, USA, 2023; pp. 414–421. [Google Scholar]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely Connected Convolutional Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: New York, NY, USA, 2017; pp. 2261–2269. [Google Scholar]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: New York, NY, USA, 2016; pp. 2818–2826. [Google Scholar]
Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: New York, NY, USA, 2017; pp. 1800–1807. [Google Scholar]
Shapiro, S.S.; Wilk, M.B. An Analysis of Variance Test for Normality (Complete Samples). Biometrika 1965, 52, 591–611. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J.L. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar] [CrossRef]
Kruskal, W.H.; Wallis, W.A. Use of Ranks in One-Criterion Variance Analysis. J. Am. Stat. Assoc. 1952, 47, 583–621. [Google Scholar] [CrossRef]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; IEEE: New York, NY, USA, 2017; pp. 618–626. [Google Scholar]
Sauter, D.; Lodde, G.; Nensa, F.; Schadendorf, D.; Livingstone, E.; Kukuk, M. A Systematic Comparison of Task Adaptation Techniques for Digital Histopathology. Bioengineering 2024, 11, 19. [Google Scholar] [CrossRef] [PubMed]
Mienye, I.D.; Swart, T.G. A Comprehensive Review of Deep Learning: Architectures, Recent Advances, and Applications. Information 2024, 15, 755. [Google Scholar] [CrossRef]

Figure 1. Representative samples of adipose tissue, muscle, and lymph node from Dataset 1, highlighting the morphological diversity relevant to CRC progression. These variations are critical for training robust classification models, as they reflect real-world variability that the model must generalize across. This visual context supports the inclusion of domain-representative features in our fine-tuning process.

Figure 2. Overview of the proposed methods for classifying CRC histopathological images.

Figure 3. Heatmap of averaged performance metrics across proposed CRC models.

Figure 4. Accuracy and loss convergence across Datasets 1, 2, and 3 of the proposed models. The observed variations highlight differences in dataset complexity and reinforce the robustness of the fine-tuned models.

Figure 5. Visualization heatmaps highlight the regions identified by the proposed models as indicative of malignant and relevant features in CRC histopathological images for three classes: (a) adipose tissue, (b) muscle, and (c) lymph node.

Table 1. Distribution of CRC histopathological images across the three datasets, categorized by tissue classes (adipose, muscle, and lymph node).

Class	Dataset 1	Dataset 2	Dataset 3
Adipose	1014	1338	1073
Muscle	1239	592	1051
Lymph	1118	634	1210

Table 2. Fine-tuning approaches (CRCHisto) applied for CRC histopathological image classification.

1-: Standard fine-tuning (baseline):
The pre-trained model (M) is initially trained on a large source dataset (S).
M = train (S)
Proposed Model (PM) (Fine-tuning updates a subset of layers on the target dataset (T)).
PM = fine-tune (M,T)
Some layers are frozen (retain knowledge from S), and others are trainable (adapt to T).
2-: Selecting the best layers to fine-tune:
PM = fine-tune (M,T, selected layers)
The selection depended on the model architecture and performance alignment with the CRC dataset features.
3-: Statistical significance testing of PM:
This test validated the impact of fine-tuning depth to ensure that p < 0.05.
Statistically significant improvement.
4-: External validation for generalization:
Performance (PM, T) ≈ Performance (PM, unseen data) ≈ Performance (PM, external source)
Tested the PM works beyond the training data on unseen and external datasets to validate it.

Table 3. Summary of the proposed models (CRCHisto)—fine-tuning approach. **: exponentiation.

Configurations	CRCHistoDense	CRCHistoIncep	CRCHistoXcep
Fine-tuned Layers	Last 20, 50, 75 layers	Last 25, 100, 150 layers	Last 20, 36, 50 layers. Layers from block8_speconv1 onward
Newly Added Layers	Dense (512)	Dense (1024)	Dense (1024)
Dropout Rate	0.5	0.5	0.5
Batch Normalization	Yes	No	Yes
Learning Rate (LR)	Schedule 1: If epoch < 10, then: LR = 1 × 10⁻⁴ × (0.1 (epoch//3) Else: LR = 1 × 10⁻⁴ × (0.1 (10//3)) Schedule 2: LR = 1 × 10⁻⁴ × (0.1 ** (epoch//%))	If epoch < 10, then: LR = LR Else: LR = LR × exp (−0.1)	Fixed LR = 1 × 10⁻⁴/ If epoch < 10, then: LR = LR Else: LR = LR × exp (−0.1)
Other Configurations	Early Stopping: Monitor: validation loss Patience: 3, 5 epochs restore_best_weight: True. Reduce LR on Plateau: monitor = validation loss, factor = 0.1, patience = 3	Early Stopping: Monitor: validation loss Patience: 5 epochs restore_best_weight: True. Reduce LR on Plateau: monitor = validation loss, factor = 0.2, patience = 3	Early Stopping: Monitor: validation loss Patience: 10 epochs restore_best_weight: True. Reduce LR on Plateau: monitor = validation loss, factor = 0.1, patience = 5

Table 4. Test accuracy of the proposed models at fine-tuning levels and initialization variations.

Seeds/ Fine-Tuning Level	0	1	2	3	5	7	11	13	17	19	23	29	31	37	41
CRCHistoDense Model
L20	98.9	99.2	98.9	99.1	98.9	99.1	99.1	99.5	99.1	98.9	99.4	98.8	99.2	99.1	98.9
L50	99.2	99.4	99.1	99.4	98.9	99.1	99.1	99.5	99.1	98.9	99.4	98.6	99.5	99.2	99.2
L75	99.1	99.2	98.6	98.9	98.9	99.1	98.9	99.7	98.9	98.8	99.4	98.9	98.9	99.2	99.1
CRCHistoIncep Model
L25	98.6	98.3	98.5	98.2	98.3	98.5	98.5	98.6	98.8	98.6	98.6	98.3	98.5	98.5	98.6
L100	98.8	98.2	98.5	98.6	98.9	98.9	98.8	98.8	98.8	98.5	98.6	98.8	98.9	99.2	98.8
L150	99.4	99.1	99.1	99.2	99.4	99.2	99.5	99.5	99.1	98.8	99.1	99.2	99.1	99.1	98.9
CRCHistoXcep Model
L20	99.5	98.8	98.8	99.2	99.5	98.8	99.1	99.1	98.9	99.1	98.6	99.1	99.4	98.9	98.9
L36	98.6	98.6	98.6	98.9	98.3	98.8	98.6	98.8	98.8	98.5	98.3	98.9	98.8	98.5	98.5
L50	99.2	98.6	98.6	98.8	98.3	98.9	98.9	98.3	98.8	98.6	98.5	98.8	98.9	98.6	98.6

Table 5. Evaluation of the proposed models on various dataset sources.

Proposed Model	Trained Dataset	Internal Dataset Accuracy %	Unseen Dataset Accuracy %	External Source Accuracy %	Accuracy Variation % (Unseen/External)
CRCHistoDense	Dataset1	98.96	90.67	82.07	−8.29/−16.89
CRCHistoIncep		98.51	91.56	91.56	−6.95/−6.95
CRCHistoXcep		98.96	91.85	82.52	−7.11/−16.44
CRCHistoDense	Dataset2	99.80	97.93	94.54	−1.87/−5.26
CRCHistoIncep		100.0	97.78	98.05	−2.22/−1.95
CRCHistoXcep		100.0	98.37	95.52	−1.63/−4.48

Table 6. Comparison of proposed models with prior methods on benchmark CRC datasets.

Existing Study	Architecture Used	Dataset	Existing Study Accuracy (%)	Proposed Models Test Accuracy (%)
[21]	ResNet 50	CRC 5000	94.86	94.4–96.3
[22]	ResNet 50	CRC 7180	97.7	97.1–99.16
[23]	VGG19	CRC 5000	91.2	94.4–96.3
[25]	MobileNetV2	LC25000 (10,000)	99.67	99.7–100
[28]	VGG16, ResNet50, ResNet152	CRC 5000	96.16–98.38	94.4–96.3
[29]	ResNet50, Inception V3, VGG 16, VGG19, ResNet152V2	CRC 5000	70–89.87	94.4–96.3
[31]	ResNet34, EfficientNetB4	CRC 5000	99.8–99.97	94.4–96.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

ALGhafri, H.S.; Lim, C.S. Fine-Tuning Models for Histopathological Classification of Colorectal Cancer. Diagnostics 2025, 15, 1947. https://doi.org/10.3390/diagnostics15151947

AMA Style

ALGhafri HS, Lim CS. Fine-Tuning Models for Histopathological Classification of Colorectal Cancer. Diagnostics. 2025; 15(15):1947. https://doi.org/10.3390/diagnostics15151947

Chicago/Turabian Style

ALGhafri, Houda Saif, and Chia S. Lim. 2025. "Fine-Tuning Models for Histopathological Classification of Colorectal Cancer" Diagnostics 15, no. 15: 1947. https://doi.org/10.3390/diagnostics15151947

APA Style

ALGhafri, H. S., & Lim, C. S. (2025). Fine-Tuning Models for Histopathological Classification of Colorectal Cancer. Diagnostics, 15(15), 1947. https://doi.org/10.3390/diagnostics15151947

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Fine-Tuning Models for Histopathological Classification of Colorectal Cancer

Abstract

1. Introduction

2. Related Works

3. Dataset and Preprocessing

4. Transfer Learning with Fine-Tuning Approach

5. Experimental Setup

6. Results and Discussion

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI