1. Introduction
Mammographic breast density refers to the visual appearance of the breast on an X-ray mammogram and relates to the relative amount of fibroglandular versus adipose tissue [
1]. Breast tissue with a high relative abundance of fibroglandular tissue shows up white on mammograms, while breast tissue with a high relative abundance of adipose tissue appears dark. Breast tissue that appears mostly white on a mammogram is considered extremely dense, and breast tissue that appears mostly dark is considered non-dense or mostly fatty. Mammographic breast density is a strong independent risk factor for breast cancer; compared with females with mostly fatty breasts, females who have extremely dense breasts have a 4- to 6-fold elevated risk of breast cancer when body mass index and age are matched [
2,
3]. Understanding the biological mechanisms that regulate mammographic breast density and breast cancer risk has the potential to provide new therapeutic opportunities for breast cancer prevention [
4].
Mammographic breast density is a radiological finding extracted from a 2-dimensional mammogram image. It refers to the overall whiteness of the image, although there can be significant heterogeneity in mammographic density within an individual breast [
5]. Mammographic breast density can be classified through the Breast Imaging Reporting and Data System (BIRADS), which defines four categories through visual and subjective classification by the radiologist [
6]. There are also established quantitative methods for classifying mammographic breast density based on the integration of texture, gray-level information [
7], and image digitalization. Both current qualitative and quantitative approaches, however, come with limitations. Earlier is less reproducible and the latest exaggerates the extent of density [
2]. Deep learning techniques have been successfully applied to classify mammographic breast density on mammograms [
8] and ultrasound images [
9].
Research on the underlying biology of mammographic breast density is hampered by the difficulty of defining the mammographic density of small tissue samples, which may not represent the density of the whole breast. To overcome this limitation, fibroglandular breast density has been used for research purposes as a surrogate measure for mammographic density [
4]. Unlike mammograms, which are used as a non-invasive technique for breast cancer diagnosis and breast density analysis, H&E-stained sections are derived from tissue samples taken during biopsy or surgery, making them invaluable sources for more detailed analysis [
10]. H&E-stained classified sections enable researchers to investigate the biological mechanisms underlying mammographic breast density by comparing AI-classified high- and low-density breast tissues. Discovering novel biomarkers and pathways involved in mammographic breast density leads to novel therapeutic and prevention approaches for breast cancer.
To the best of our knowledge, no machine learning model has been developed yet for classifying fibroglandular breast density in H&E stained sections [
11]. Convolutional neural networks (CNNs) have become the leading tools for classification tasks in computer-aided diagnostic systems for medical applications [
12]. Particularly, CNN-based approaches have been successfully employed for extracting characteristic features from histopathology images of breast parenchymal tissue [
10]. Previously, CNN-based models have been used to classify H&E-stained breast tissue samples based on tumor type [
11]. MobileNet-v2, a specific CNN architecture, has demonstrated promising outcomes in medical image classification [
13]. MobileNet-v2 is pre-trained on millions of images from the ImageNet dataset, enabling it to perform effectively even with limited data to other currently available CNN architectures [
14]. In contrast, vision transformers [
15] are a modern edition of neural networks that utilize a self-attention mechanism originally developed for natural language processing tasks but later showed their potential capability for image classification [
16], particularly for histopathology, ultrasound, and mammography images [
17]. ViTs are proving to be a valuable tool for a broad range of tasks including classification, object detection, and image segmentation [
18].
In this paper, deep learning algorithms have been developed for the classification of fibroglandular breast density in H&E-stained formalin-fixed paraffin-embedded (FFPE) sections of human breast tissue using a transferred and modified version of MobileNet-v2 and a ViT model. FFPE refers to a tissue preparation technique in which human samples are fixed in formalin and embedded in paraffin for preservation and detailed microscopic analysis, respectively. Using a standard deep learning algorithm to classify H&E-stained sections by avoiding subjective errors and providing a consistent approach would enhance the robustness of data generated in this field.
2. Materials and Methods
This study received ethics approval from the Central Adelaide Local Health Network Human Ethics Research Committee (TQEH Ethics Approval #2011120) and the University of Adelaide Human Ethics Committee (#H-2014-175).
2.1. Tissue Processing
Women aged between 18 and 75 attending The Queen Elizabeth Hospital (TQEH) for prophylactic mastectomy or reduction mammoplasty were consented for the study. The tissue was confirmed as healthy non-neoplastic by the TQEH pathology department. The validation sample set was collected following informed consent from women undergoing breast reduction surgery at the Flinders Medical Centre, Adelaide, SA. Breast tissue was dissected into small pieces using surgical scalpel blades. Breast tissue then was fixed in 4% paraformaldehyde (Sigma-Aldrich; 3050 Spruce Street, St. Louis, MO 63103, USA, Cat# P6148), for 7 days at 4 °C, washed twice in PBS (1X), and transferred to 70% ethanol until further processing. Tissue was processed using the Excelsior tissue processor (Thermo Fisher Scientific;168 Third Avenue, Waltham, MA 02451, USA) followed by the dehydration, clearing, and embedding protocol: incubation in 70%, 80%, and 90% ethanol for an hour each, proceeded with incubation in 100% ethanol with 3 changes, 1 h each, and xylene with 3 changes, 1 h each. Finally, tissue was filtrated in paraffin wax with 3 changes, 1 h each. The resulting formalin-fixed paraffin-embedded (FFPE) tissue blocks were stored at room temperature before sectioning.
2.2. Hematoxylin and Eosin (H&E) Staining
Five-micrometer sections were cut from FFPE blocks using a microtome (Leica Biosystems; 495 Blackburn Road, Mount Waverley, VIC, Australia). These sections were then floated onto a warm (42 °C) water bath and transferred to super adhesive glass slides (Trajan Series 3 Adhesive microscope slides, Ringwood, Victoria, Cat#473042491). The slides were incubated at 37 °C overnight until fully dry. Sections were dewaxed through three changes in xylene (Merck Millipore, Frankfurter Str. 250, Darmstadt, Germany; Cat# 108298) and rehydrated through a gradient of 100%, 95%, 70%, and 50% ethanol, followed by distilled water. Tissue sections were stained with hematoxylin (Sigma Aldrich, St. Louis, MO, USA; Cat#HHS16) for 30 s and eosin (Sigma Aldrich, St. Louis, MO, USA; Cat#318906) for 5 s. Slides were then dehydrated with 100% and 95% ethanol and cleared with two changes in xylene. The tissue slides were then mounted using a mounting medium (Proscitech; 6/118 Wecker Road, Morningside, QLD, Australia; Cat#IM022). Finally, the stained slides were scanned using a digital Nanozoomer 2.0-HT slide scanner (Hamamatsu Photonics K.K. 325-6, Sunayama-cho, Higashi-ku, Hamamatsu-shi, Shizuoka, Japan, Adelaide, SA, Australia) with a 40X objective lens, generating high-resolution (0.23 µm) images for computer-based analysis.
2.3. Fibroglandular Breast Density Score Classification
Tissue staining was performed by multiple laboratory specialists over several years, which may result in variations in staining intensity across the images (
Figure 1). The staining protocol and reagents used remained consistent throughout this study. These images were used to set up the training and test database.
A validation study was performed using H&E-stained human breast tissue independently collected from a different laboratory that was not part of the training and test dataset. These de-identified H&E-stained breast tissue sections are described in the validation section of the Results.
Each patient had an average of 10 tissue blocks. One tissue section was assessed in each FFPE tissue block. In total, 965 images were collected from 93 patients. A panel of scientists (HH, LH, WI) classified each image semi-quantitatively. The panel reached a consensus on density through discussion. Higher density scores were assigned to sections containing a greater percentage of stroma and epithelium and a smaller amount of adipose tissue. The fibroglandular density classification scale was defined by Archer [
5] and demonstrated a correlation with mammographic breast density in tissue samples obtained by X-ray image-guided biopsy [
19]. The classification scale assigned each tissue sample to a number between 1 to 5, where 1 represented 0–10%, 2 represented 10–25%, 3 represented 25–50%, 4 represented 50–75%, and 5 represented >75% of fibroglandular tissue (
Figure 1).
2.4. Image Pre-Processing
A total of 965 high-resolution original images were generated from the human breast tissues. All H&E-stained images were resized to 224 × 224 pixels. Then, for processing and manipulation, the images were converted into an array format and stored in a data frame. Libraries including sci-kit-learn 0.23.2, Pandas 1.5.3, and Numpy 1.23.5 were used for pre-processing the images. Diverse data augmentation techniques were applied to improve model generalization and expand the training dataset [
20]. These techniques included horizontal flipping, vertical flipping, and rotating each image by 0, 90, 180, and 270 degrees. As this work focuses on the breast density classification of tissue sections and not on the density classification of the whole breast, the choice of rotation angles for data augmentation is not restricted and can be extended beyond small magnitudes. However, for mammography density classification of mammograms, it is recommended to limit the rotation angle to 0–15 degrees as higher rotations might alter the biological relevance and spatial relationships between the fundamental tissue structures. It should be noted that cropping an H&E-stained image changes its breast density score, and thus if cropping is used as an augmentation method, it requires a fibroglandular breast density re-assessment.
Medical images from many individuals come with intrinsic imbalance, where some classes of fibroglandular density will be represented in the sample set more than others. To ensure each class represents itself properly, images in each class were evened out by implementing an undersampling strategy to reach an equal data distribution (
Figure 2) [
21]. Before implementation of the balancing technique, the number of samples in class 3 was maximum while density classes 1 and 5 were the minority classes. However, the number of samples in each class was around 2000 after undersampling balancing, preventing potential bias toward a specific class.
2.5. Deep Learning Model
We introduce a unique data augmentation approach specifically tailored to images of H&E-stained human breast tissue sections, offering significant benefits over traditional techniques used for mammography images. Unlike mammograms, where data augmentation is often restricted to small rotation angles (0–15 degrees) to preserve diagnostic relevance and the natural structure of the whole breast, our method allows for rotations at any angle. Meanwhile, cropping is not suitable for human tissue sections, as it can change the composition of fibroglandular and adipose tissues, potentially altering the breast density score. To address variability in H&E-stained images arising from differences in laboratory protocols, we applied stain normalization as part of our analysis. In addition to data augmentation, no ViT model has yet been developed for the breast density classification of H&E-stained human breast sections. Moreover, our modification of the MobileNet-v2 model is both unique and specific to H&E images of human breast tissues. Eight different layers are added to the existing MobleNet-v2 model to specifically tailor it for the breast density classification in histopathology images. All the existing conventional deep-learning models of breast density classification are limited to mammograms and are not applicable to H&E-stained tissue sections.
2.5.1. MobileNet-v2
This study implemented a convolutional neural network (CNN), backend Tensorflow (version 2.12.0), using the sequential Keras library, with MobileNet-v2 architecture. MobileNet-v2 is a pre-trained CNN with 53 different layers in depth [
14] and has been trained on over a million images from the ImageNet database [
22]. The model was developed and trained using TensorFlow 2.15.0.
In this application, the final fully connected layer of the MobileNet-v2 was excluded, allowing for adjustments based on the specific needs of our target application (breast density classification). Meanwhile, the initial layers of the model were fixed during training by freezing their parameters.
The GlobalAveragePooling2D layer was called to reduce the dimensionality and number of parameters by computing the average from each feature map to a single value. BatchNormalization layer and dropout layers with a rate of 30% were added after the global average pooling layer to normalize batches and prevent potential overfitting, respectively. An intermediate dense layer with 64 neurons and ReLU activation was added to account for nonlinearity and handle complex patterns. The final dense layer had 5 neurons with softmax activation to conduct the muti-class classification and calculate the probability of each fibroglandular breast density class (
Figure 3). The learning rate was designed to change based on an exponential decay, beginning with an initial value of 0.001 and followed by a decay rate of 0.9. To optimize model weights, the model uses an Adam optimizer with a loss function setting to categorical cross-entropy. An early stopping callback was used to terminate the training process when validation losses did not decrease for 10 epochs in a row. However, the training could continue for a maximum of 200 consecutive epochs if necessary.
2.5.2. Vision Transformer
Vision Transformer [
15] is an emerging type of deep neural network model based on transformer encoders with a self-attention mechanism [
15]. ViT showed stronger capabilities compared to the previous model using sequences of image patches to classify the full image [
16]. ViTs work by dividing an image into small fixed-size patches, which are linearly embedded and fed into a number of transformer encoders to extract image features.
As an alternative to the MobileNet-v2 model, a ViT model was developed using PyTorch version 2.3.1 and CUDA version cu118 to allow for the use of a graphics processing unit (GPU) to accelerate the training process. As ViT models demand large training data, we applied larger image augmentation in our database including small adjustments to brightness, contrast, and saturation on images plus random erasing. Random erasing removed a small portion of an image with a possibility of 50%, ranging between 1% and 7% of the image, and an aspect ratio of 0.3 to 3.3 to resemble human technical errors. In this study, we set the patch size to 16, model depth to 12, and attention head to 8. Patch and position embedding were implemented with an internal feature dimension of 64. The multi-layer perceptron (MLP) part had a hidden layer size of 64 pixels, and a dropout rate of 0.1 was applied to both the overall and embedding dropouts. An overview of the ViT model used for the classification of H&E-stained images of human breast tissue is shown in
Figure 4. For the loss function, we applied the cross-entropy and to optimize the model, and an Adam optimizer with a learning rate of 0.001 was applied. Moreover, we applied a learning rate scheduler, using the PyTorch StepLR scheduler, decaying the learning rate by a factor of gamma equal to 0.7. In this model, patience was considered 15, which stops the training process when validation losses do not decrease after 15 epochs.
4. Discussion
Breast cancer is the most commonly diagnosed cancer in women and the incidence is rising across all age groups [
24,
25]. Current research that aims to reduce breast cancer risk and improve early detection is increasingly using data-intensive approaches, which rely on computational methods for analysis. Although this brings invaluable information, the generation of large amounts of data can make it difficult to analyze information. Deep learning, using artificial neural networks that simulate the human brain [
26], has emerged as a breakthrough approach to support medical research including single-cell transcriptomics, DNA sequencing, protein interactions, drug development, and disease diagnosis [
27,
28].
Early models of deep learning used extracted features and fed them into the model. More recently, deep learning models have used pre-trained databases through a technique known as transfer learning [
29]. Models that use transfer learning benefit from a wide range of features learned from massive datasets. This makes them ideal for applications with limited data, such as medical image analysis [
30]. Medical image analysis for breast cancer research can benefit from this advancement in deep learning, which is preparing the way for using image analysis algorithms to detect and diagnose breast cancer. Of significance, deep learning approaches can reduce radiologist screening time by triaging the digital mammogram images most likely to require recall and further assessment [
31] and identify those women most at risk of a future breast cancer diagnosis [
32,
33]. Promising results from histopathological classification of breast biopsies suggest that deep learning could also be employed for quality control in breast cancer detection [
34,
35].
The application of deep learning in medical image analysis is not limited to breast cancer detection. Mammographic breast density classification reached radiologist-level accuracy through this advancement. Despite using standardized protocols and advanced digital imaging methods, operator miscalculation, variation in the operator’s perception, and a heavy workload can still lead to inaccuracies in mammographic density assessment [
25]. To minimize these errors, new automated methods of mammographic density measurement have been developed that use a computerized analysis algorithm that improves the consistency of results [
8,
23,
36]. Here, we use deep learning to develop an automated histology image analysis tool to classify fibroglandular breast density in H&E-stained FFPE tissue sections. Whilst this tool is relatively simplistic in comparison to other medical research applications of deep learning, it has the potential to provide a foundation for future applications in research image analysis.
This study investigated a CNN model with four different architectures of MobileNet-v2 and different models of ViTs to classify fibroglandular breast density in H&E-stained human breast tissue samples. CNNs are mature enough to be the most common deep learning approach in medical vision classification. CNNs offer highly accurate automated feature extraction from various medical image sources, such as mammograms, X-rays, and histopathology images. CNNs can classify images into multiple categories, which is well-suited for this research [
37]. Moreover, employing pre-trained models of CNNs including MobilNetV2 is advantageous when dealing with limited data like labeled medical images as it minimizes the required training. MobileNet-v2 was particularly chosen due to its efficiency in terms of computational cost and memory usage, making it suitable for resource-constrained environments. Its depth wise separable convolutions allow for a significant reduction in the number of parameters while maintaining competitive performance, which is crucial for real-time applications [
14].
Vision Transformer (ViT), on the other hand, was selected for its superior ability to model long-range dependencies and capture global contextual information, which are essential for complex feature representations. ViT has demonstrated state-of-the-art performance in various vision tasks and offers advantages in handling diverse and complex data compared to traditional convolutional architectures [
38]. Unlike CNNs, ViTs dynamically learn relationships across the entire image, enabling them to adapt to complex medical image classification tasks. The self-attention mechanism supports the model in focusing on the most relevant part of the image. ViTs are particularly well-suited for high-resolution large images with complex patterns [
38]. Furthermore, ViTs can be integrated with other models, such as CNNs, to create more powerful hybrid architectures, highlighting the importance of further research into ViT-based models [
39].
Both models achieved a high level of accuracy in classifying fibroglandular breast density (ViT model 3: 0.94 and MobileNet Arc 1: 0.93). However, their performance was not identical. MobileNet-v2 using convolutional layers achieved success across each of the evaluated architectural configurations, except MobileNet-Arc 3. The use of depthwise separable convolutions and residual linear bottlenecks substantially reduces the number of parameters and training time [
40]. Additionally, it enables the model to learn effectively from smaller datasets and operate with lower computational resources, leading to deployment on mobile and embedded devices [
40]. ViTs, using multi-head self-attention, can learn more characteristic features and deliver a high accuracy score. Four different models of ViT were evaluated, and almost all of them delivered a satisfactory result. The ViT approach for analyzing images is not just understanding individual image patches, but also the relationship between images without considering their distance within an image [
41,
42]. This allows ViT to generalize advanced models in image-analyzing tasks.
In agreement with other studies [
43], we found that ViT excels in classifying medical images. Rather than improved general performance, ViT illustrates fewer incorrect predictions in classifying the fibroglandular breast density in each class. However, ViT required a larger dataset and a longer training process. As a result, ViT has a high computational cost for training caused by the intensive use of GPUs (Graphics Processing Units). Both models almost perfectly classified fibroglandular breast tissue samples in classes 1 and 5, while the most challenging classes are class 3 followed by classes 2 and 4. The challenge arises because sometimes, there is a narrow distinction between classes. Most errors arise when models classify data points that are close to the boundaries between neighboring classes. For instance, some images that truly belonged to class 3 were incorrectly classified as class 2 or 4, leading to increased prediction errors and diminished sensitivity for this class.
The limitations of this study are the relatively small sample size and the use of a private single-laboratory dataset. This limitation particularly impacts the performance of ViTs. We anticipate that with a larger and more diverse dataset, ViTs could achieve better overall performance, as they are inherently strong models. The lack of significant improvement in ViTs over MobileNet-v2 in this study is likely attributable to the constraints of the database. The dataset used in this study mainly consists of H&E-stained images from our laboratory. This limitation is particularly relevant in challenging classifications, such as class 3, where the model struggled to distinguish between classes due to insufficient diversity in the training data. To enhance robustness and distinguishing abilities, the models require using larger datasets from different laboratories.
5. Conclusions
This research has developed deep learning models for the classification of fibroglandular breast density, implementing MobileNet-v2 and vision transformers. The MobileNet-Arc 1 and model ViT 3 with accuracies of 0.93 and 0.94, respectively, were identified as the best architectural models. These results were validated by evaluating model performance on unseen H&E-stained sections prepared in another laboratory. The accuracy and F1-score of the deep learning models (both the ViT and MobileNet) slightly decreased from class 1 and 5 to intermediate classes such as class 3. This would highlight the inherent challenge in the precise definition of class 3, which might include a mix of overlapping characteristics. After performing a comprehensive analysis, we have found that ViT offers a slight performance improvement, although it requires a higher computational cost to achieve high accuracy, uses a larger number of parameters, and has a longer processing time. For large datasets where high accuracy is important, it is recommended to use ViT models to generalize better outcomes while minimizing overfitting. However, when limited data are available, a MobileNet-v2, which already has a considerable number of pre-trained parameters that allow for effective learning from a small number of H&E-stained images, is preferred.