1. Introduction
The literature on breast cancer detection and diagnosis using image processing and artificial intelligence (AI) reveals significant advancements and varied methodologies, reflecting the ongoing evolution in this critical field. Central to these advancements is the integration of AI techniques, such as machine learning (ML) and deep learning (DL), which have significantly enhanced the accuracy and efficiency of breast cancer detection.
The article [
1] systematically reviews the application of image processing in breast cancer recognition, detailing advancements in detection, segmentation, registration, and fusion techniques. The authors emphasize the promising future of unsupervised and transfer learning in enhancing diagnostic accuracy and patient privacy protection. Similarly, Zerouaou and Idri [
2] conducted a structured literature review, identifying deep learning as the predominant method for classification tasks in breast cancer imaging, with mammograms being the most extensively studied imaging modality. They highlight the importance of image preprocessing, feature extraction, and public datasets in improving diagnostic performance.
The early and accurate detection of breast cancer remains a critical challenge in medical diagnostics, with significant implications for patient outcomes. Advances in medical imaging and artificial intelligence (AI) have opened new avenues for enhancing the precision of tumor detection, particularly in mammographic imaging. SA Khan et al. [
3] conducted a comprehensive survey of medical imaging fusion techniques, emphasizing the strengths and limitations of various methods in improving diagnostic accuracy. Their work highlights key challenges, such as noise sensitivity, computational complexity, and the difficulty of preserving essential image details, which continue to impede the broader application of these techniques in clinical practice. Addressing these challenges is vital in developing more reliable and effective fusion methods for medical imaging.
Building on this foundation, SU Khan et al. [
4] explored the application of deep learning models for semantic segmentation in breast tumor detection. Through a comparative analysis, they identified the Dilation 10 (global) model as particularly effective, achieving high pixel accuracy in differentiating tumor regions in mammograms. However, their study also uncovered significant challenges, including dataset imbalance and the risk of over-segmentation, which can lead to false positives. These findings underscore the need for careful model selection, balanced datasets, and further refinement of AI-based methods to enhance the reliability and accuracy of early breast cancer detection.
Expert human knowledge is essential in traditional cancer image recognition paradigms. The process involves image segmentation, feature extraction, and the application of machine learning algorithms to these handcrafted features in order to develop predictive models. In contrast, deep learning offers an end-to-end solution that processes raw images directly. Deep learning systems use biologically inspired neural networks to transform data through multiple nonlinear layers, yielding progressively more abstract representations. This hierarchical approach enables the formation of complex, highly discriminative models, significantly enhancing the ability to classify cancerous images accurately. These studies underscore the pivotal role of advanced image processing and AI technologies in enhancing the early detection, diagnosis, and treatment of breast cancer. Integrating these technologies improves diagnostic accuracy and efficiency and holds promise for personalized medicine, ultimately aiming to improve patient outcomes and reduce mortality rates associated with breast cancer.
Despite significant advancements in the application of artificial intelligence (AI) and image processing techniques for breast cancer detection, several critical challenges remain unresolved. The current body of literature extensively documents the efficacy of machine learning (ML) and deep learning (DL) models, which have markedly improved the accuracy of breast cancer diagnostics. However, a persistent gap exists concerning the generalizability and robustness of these models when applied across a diverse range of medical imaging modalities. A predominant limitation within the existing research is the heavy reliance on single-modality datasets. This dependency constrains the performance of AI models, particularly when these models are deployed across various imaging modalities, such as mammography, ultrasound, magnetic resonance imaging (MRI), and histopathology [
5,
6]. The heterogeneity in image quality and the presence of modality-specific noise further exacerbate this issue, leading to variability in diagnostic outcomes and diminishing the models’ efficacy in clinical settings. In response to these identified gaps, the present study undertakes a systematic exploration of the integration of advanced image processing techniques with state-of-the-art AI models. The primary objective is to enhance diagnostic performance across multiple medical imaging modalities. To this end, the study focuses on the application of Wiener filtering and total variation filtering as preprocessing steps to refine image quality. These preprocessing techniques are then evaluated in conjunction with cutting-edge AI models, specifically the vision transformer (ViT) and the U-KAN model.
Breast cancer remains one of the most significant health challenges worldwide, demanding continual improvements in diagnostic accuracy and early detection. This study investigates the impact of various image processing techniques, notably Wiener filtering and total variation filtering, on the quality and diagnostic precision of breast cancer detection across different medical imaging modalities. Additionally, it evaluates the consistency and robustness of advanced AI models, such as vision transformers (ViTs) and U-KAN, when applied to diverse datasets, including dynamic contrast-enhanced MRI (DCE-MRI), ultrasound, mammography, and histopathology. By addressing these research questions, this study fills a critical gap in the literature, providing a comprehensive evaluation of AI models in conjunction with tailored image preprocessing techniques. The findings aim to contribute to developing more robust, generalizable, and clinically applicable diagnostic systems for early breast cancer detection.
The remainder of this paper is organized as follows.
Section 2 reviews related work on preprocessing techniques and AI-driven breast cancer detection and segmentation. The methodology, including image processing and ViT model training, is detailed in
Section 3.
Section 4 presents the validation of the theoretical framework through experimental studies. Concluding remarks are provided in
Section 5.
3. Proposed Methods
In this section, we focus on the research techniques we chose, explaining each method from the perspectives of principles, the reasons for their selection, and their advantages, and we demonstrate how each method addresses the project’s concerns. Here, we list the datasets we plan to use, which include multiple types of imagery. We then introduce traditional image processing models, discussing image quality assessment methods, Wiener filtering, and total variation filtering. This lays the groundwork for a further exploration of how traditional processing can enhance the early diagnostic performance of artificial intelligence algorithms. Finally, we transition to artificial intelligence algorithms, introducing the latest Kolmogorov–Arnold network (KAN) architecture. We then discuss the challenges of early breast cancer diagnosis in terms of classification and image segmentation, introducing the Vit model and the Unet model and exploring the potential of combining the Unet model with the latest KANs framework to achieve superior performance. This combination is expected to yield improved results. For a detailed technical roadmap, refer to
Figure 15.
3.1. Work Dataset Presentation
In this section, we detail the datasets utilized in our project aimed at AI-based early breast cancer detection and classification. The success of such a project heavily relies on the availability and diversity of relevant medical image datasets. Previous studies have highlighted a significant limitation in current AI-based early breast cancer diagnosis models, specifically their dependence on a single type of medical image, which leads to weak generalization across different datasets. To address this issue, we propose analyzing and training models using multiple datasets comprising multimodal medical images. The datasets employed in this research are shown in
Table 4, including the Breast Ultrasound Images Dataset [
31], MIAS Mammography [
32], Mini-DDSM [
33], BreakHis [
34], and BreastDM [
35]. These datasets encompass four mainstream medical image types: ultrasound, mammography, histopathological images, and DCE-MRI. By leveraging these diverse datasets, we aim to enhance the robustness and generalizability of AI models in early breast cancer detection and classification.
The selection of the five datasets—BreastDM, BUSI, MIAS, BreakHis, and DDSM—was implemented to comprehensively evaluate the proposed AI models across a wide range of imaging modalities and clinical scenarios in breast cancer detection. The BreastDM dataset, with 232 cases focused on the DCE-MRI domain, provides a robust foundation for both segmentation and classification tasks, offering a unique emphasis of MRI, which is critical for detecting tumors in dense breast tissue. The BUSI dataset introduces the challenges of ultrasound imaging, such as speckle noise and lower resolution, ensuring that the models are tested on imaging modalities where mammography might be less effective. The MIAS and DDSM datasets, both containing mammographic images with various abnormalities and tissue densities, are pivotal for assessing the models’ performance in one of the most widely used breast cancer screening methods. BreakHis, with its histopathological images, adds another layer of complexity by requiring the models to differentiate between benign and malignant tissues at the cellular level. By leveraging these datasets, the study not only covers a diverse spectrum of imaging types—each with its specific challenges—but also ensures that the models are robust, generalizable, and applicable across different clinical contexts, ultimately enhancing their potential utility in real-world breast cancer diagnostics.
3.2. Restoration Image Modeling
In medical imaging, the phenomenon of image degradation refers to the deterioration in the quality and clarity of images, which can adversely affect diagnostic accuracy. This degradation arises from various factors, including motion artifacts caused by patient movement, the technical limitations of imaging equipment, and noise from electronic interference or low signal strength. Additional factors include beam hardening in CT imaging, in which X-ray beams passing through denser tissues lead to artifacts and reduced contrast, and the partial volume effect, in which voxels containing multiple tissue types produce blurred images. The attenuation and scattering of signals in modalities such as ultrasound and MRI further contribute to degradation. The manifestations of these issues are evident in blurring, artifacts, noise, and contrast reduction, all of which impair the visibility and differentiation of anatomical structures. Understanding these causes and manifestations is essential for enhancing image acquisition techniques and developing methods to mitigate degradation, thereby improving the diagnostic utility of medical imaging.
To address the degradation caused by blurring, distortion, and noise in images, it is necessary to perform image restoration. Image restoration aims to recover the original appearance of a degraded image as closely as possible. This process involves reversing the degradation effects, which means that, if we know the specific processes that led to the degradation, we can restore the image by applying the inverse of these processes. The process of image degradation is shown on the left side of
Figure 16.
As illustrated in
Figure 16, the image degradation model can be represented by Equation (
1): The input image is convolved with the degradation function and subsequently linearly superimposed with noise to yield the degraded image. Through the application of a Fourier transform to this equation, the image degradation model can be expressed in the frequency domain as Equation (
2).
where:
is the input image.
is the point spread function (PSF) that represents the blurring effects and other imperfections.
is the noise added to the image.
is the resulting degraded image.
, , and are the Fourier transforms of , , and , respectively.
is the Fourier transform of the degraded image .
In the initial analysis of the image recovery model, it becomes evident that different types of noise or blurring possess distinct functional expressions. Consequently, the effectiveness of various recovery filters varies, depending on the type of noise encountered. Understanding the characteristics of different types of noise is, therefore, crucial in selecting the appropriate image processing methods.
Article [
36] provides a detailed explanation of the common types of noise found in medical imaging. Gaussian noise arises from atomic thermal vibrations and intermittent radiation from hot objects, as well as sensor noise due to temperature or brightness variations. Salt noise consists of randomly bright pixels (value 255), while pepper noise involves random dark pixels (value 0). Speckle noise, inherent in ultrasound images, is multiplicative and degrades diagnostic quality by reducing contrast and resolution. Poisson noise, resulting from the quantized nature of electromagnetic waves like gamma rays, X-rays, and visible light, introduces signal-dependent fluctuations as photons interact with the body. Thus, traditional additive noise removal techniques are ineffective for Poisson noise. The specific expressions are summarized in
Table 5.
3.3. Image Quality Assessment
Image quality assessment (IQA) ensures that medical images meet the standards for accurate diagnosis and effective treatment, especially in early breast cancer detection. The three main types of IQA are Full-Reference (FR-IQA), Reduced-Reference (RR-IQA), and No-Reference (NR-IQA). Our research focuses on using appropriate IQA standards to evaluate the quality of our image processing results, aiming to obtain reliable, high-quality medical images for further analysis to improve the accuracy and sensitivity of early breast cancer diagnosis.
FR-IQA methods, such as the mean squared error (MSE), peak signal-to-noise ratio (PSNR), and structural similarity index (SSIM), require a pristine reference image for comparison. MSE measures the average squared differences between the original and distorted images, while PSNR provides a logarithmic scale of these differences. SSIM evaluates image quality based on structural information, luminance, and contrast, aligning closely with human visual perception. RR-IQA methods use partial information from the reference image to assess quality, balancing the need for reference data with evaluation accuracy. These techniques extract and compare specific features from both the reference and distorted images. NR-IQA, or blind IQA, is particularly valuable in medical imaging, for which reference images are often unavailable. The Blind/Referenceless Image Spatial Quality Evaluator (BRISQUE) is a prominent NR-IQA metric that assesses image quality based on natural scene statistics, operating in the spatial domain to quantify deviations from expected natural statistics.
Chow, Li Sze and Paramesran, Raveendran, in [
37], mention that, in real-time medical imaging, there is no original or perfect reference image to evaluate. Therefore, NR-IQA becomes the most suitable method to evaluate medical images. Among NR-IQA methods, the BRISQUE method does not require the computation of specific distortion features; instead, it utilizes scene statistics of locally normalized luminance coefficients to quantify potential losses in the image’s ‘naturalness’. In terms of statistical performance, this method surpasses PSNR and SSIM, and it demonstrates high competitiveness and computational efficiency compared to other NR-IQA methods. Therefore, in this study, several evaluation criteria are employed: MSE, PSNR, SSIM, standard deviation (STD), and BRISQUE. MSE and PSNR provide foundational error measurements, while SSIM offers a perceptually aligned evaluation. STD captures image variability, and BRISQUE excels in scenarios lacking reference images. This comprehensive approach ensures rigorous and versatile IQA, providing reliable, high-quality medical images for further analysis to enhance the accuracy and sensitivity of early breast cancer diagnosis.
3.4. Wiener Image Filtering
The objective of image restoration is to estimate the original image
from the observed degraded image
and the degradation function
, along with any available information about additive noise. The simplest approach to restoring an image could be implemented in the absence of noise, as follows:
This direct and simple method is known as inverse filtering, where
is the Fourier transform of the estimated image. In practical scenarios, due to the presence of noise, directly applying this formula often results in the amplification of noise, leading to poor restoration. Therefore, according to Equation (
2), Equation (
3) can be modified under the condition of considering noise, giving the following:
When performing inverse filtering, if
is very small or zero in certain areas while
is not zero and relatively large, the second term in the equation can become significantly larger than the first term, leading to substantial errors. The Wiener filter is highly effective for this problem, as it is a form of linear minimum mean square error (LMMSE) estimation.
Linear indicates that the estimation is linear in nature, while
minimum variance refers to the optimization criterion used in constructing the filter. Specifically, it aims to minimize the variance of the error between the actual signal and the estimate (Equation (
5)). The goal of the Wiener filter is to design a filter such that the output signal, obtained via filtering the observed signal, is the minimum mean square error estimate of the actual signal.
The Wiener filter, in its many variations, can be single-input–output or multiple-input–output, depending on the issue at hand with the image. However, the basic idea of Wiener filtering is still the same. A signal can be extracted from a mixture of signal and noise via filtering (in the form of a matrix or other model). So, the core of Wiener filtering is used to compute this filter (the parameters of the matrix r or model), thus solving the Wiener–Hopf equation. To facilitate the derivation of its principle, when assuming that the system is a single-input–output type and considering only finite-length filtering (i.e., considering that the signal at the current moment is only correlated with the signal at the previous finite number of time points), it can be seen from
Figure 16 that, at this time, the output of the Wiener filter is as follows:
Following the derivation process detailed in
Appendix A, we obtain the fundamental formula for the simplest single-input, single-output Wiener filter:
3.5. Total Variational Filtering
The previous subsection revealed that the Wiener filter is based on frequency-domain filtering, using the known noise and signal power spectrum. De-noising and de-blurring are achieved through inverse convolution, focusing on global noise suppression and blur correction, making it more suitable for dealing with linear and smooth noise. The total variation filter is based on the variational method. It focuses on retaining the edges by minimizing the total variation in the image in order to achieve denoising. Through the spatial domain of the iterative optimization of denoising and solving the nonlinear optimization problem, the noise can effectively be removed while retaining the edges.
Here is the simple derivation of the equation. The total variation filter constitutes an anisotropic model leveraging gradient descent to achieve image smoothing, with a primary objective of maximizing smoothness across the image domain by minimizing discrepancies between adjacent pixels while concurrently preserving edges to the utmost extent feasible. The term “variation” refers to
, where
approaches 0 for continuous functions. Total variation pertains to intervals defined for functions, where variations accumulate over the interval. Thus, by observing the definitions of the variation and the total variation of continuous real functions, we can derive equations for their discrete forms, specifically the total variation equation of one-dimensional discrete signals. For a discrete signal sequence
, the total variational form of the one-dimensional discrete signal is given by Equation (
8).
Upon obtaining the observed signal
x, the objective is to smooth
x, effectively denoising it. An intuitive approach is to minimize the total variation of the signal, which corresponds to the physical meaning of the input signal’s smoothness. Let the recovered signal be
y, which should satisfy two conditions:
y should not deviate significantly from the observed signal
x (expressed as Equation (
9)), and the total variation of
y should be small. Under these constraints,
y can be represented as in Equation (
10), where the parameter
is a positive constant used to balance the influence of the two constraints.
As early as 1992, Rudin et al. proposed the total variation equation for two-dimensional discrete signals (images) in Article [
38], as shown in Equation (
11). Solving this equation of total variation is relatively difficult; therefore, there is another commonly used definition for two-dimensional total variation (Equation (
12)). The minimization problem of this equation is relatively simple to solve.
In this paper, we selected Wiener filtering and total variation (TV) filtering as our primary preprocessing techniques due to their complementary capabilities in addressing the dual challenges of noise reduction and edge preservation in breast cancer imaging. Wiener filtering was chosen for its effectiveness in mitigating Gaussian noise, which is a common issue across medical imaging modalities such as DCE-MRI and ultrasound. Its adaptive approach, based on the local mean and variance estimation, allows for significant noise reduction while preserving critical image details, making it particularly useful for enhancing the visibility of subtle tumor features.
To complement this, total variation filtering was employed in order to maintain the integrity of edge information, which is crucial for accurate tumor delineation in modalities like mammography and histopathology. TV filtering minimizes noise while preserving sharp transitions in an image, ensuring that essential structural details are retained.
Both filters were carefully optimized to align with the specific characteristics of each dataset. For Wiener filtering, the noise-to-signal ratio was fine-tuned to balance noise reduction with the preservation of tissue contrast, which is especially important in DCE-MRI and ultrasound. Similarly, the regularization parameter in TV filtering was adjusted to prioritize edge preservation while achieving effective noise suppression, particularly in datasets in which clear tumor boundaries are critical.
The strategic combination and optimization of these two filtering techniques enhance the overall image quality, providing the AI models with superior input data that supports improved performance and generalizability across diverse imaging modalities.
3.6. Kolmogorov–Arnold Networks
Traditional multilayer perceptrons (MLPs) have achieved significant success in machine learning but face challenges such as large parameter counts and limited interpretability. To address these issues, Liu and Wang et al., in the article [
39], propose the Kolmogorov–Arnold network (KAN), a novel neural network architecture designed to enhance model flexibility and expressiveness while maintaining interpretability.
KAN’s design is inspired by the Kolmogorov–Arnold representation theorem, as shown in Equation (
13), which posits that a multivariate, continuous function can be decomposed into a finite composite of univariate continuous functions and binary additive operations. Instead of using fixed activation functions at the nodes, KAN employs learnable activation functions at the network’s edges. This allows each weight parameter to be replaced with a univariate function, typically parameterized as a spline function. By applying learnable activation functions to the weights, KAN can more flexibly and accurately capture complex relationships in input data.
Figure 17 illustrates a structural comparison between multilayer perceptrons (MLPs) and Kolmogorov–Arnold networks (KANs). The primary distinction lies in the sequence of operations: MLP applies linear combinations followed by nonlinear activations, whereas KAN employs nonlinear activations for each input prior to the linear combinations. Crucially, KAN features parameterizable and learnable activation functions, unlike fixed functions like Sigmoid or ReLU in MLP. This adaptability enables KAN to represent complex curves with greater efficiency, thereby achieving higher accuracy with fewer parameters.
Theoretically, two KAN layers (one representing the inner function and one learning the outer function) are sufficient to model various supervised learning tasks over the real number domain. This is analogous to the Kolmogorov–Arnold (KA) representation theorem. However, the activation functions in KANs can sometimes become very non-smooth, making it difficult to approximate any function using smooth splines in practice. Hence, the necessity for multi-layer KANs arises. Unlike the KA theorem, which restricts each input to produce
nonlinear activations, as indicated in Equation (
13), KANs can be more flexible and stacked to form deeper networks, resulting in more practical activation functions. The essence of deep learning is representation learning, which involves composing simple modules to learn complex functions. Therefore, extending KANs to multiple layers aligns with this principle. In article [
39], a KAN layer with
-dimensional inputs and
-dimensional outputs is defined as a matrix of one-dimensional functions using the following equation:
To further compute
, we can use the Equation (
15); each value from the
l-th layer corresponds to an activation function
for
. After processing each value through the corresponding activation function, we simply sum them up to get
.
As shown in
Figure 18, the two-layer KANs in the article [
39] have the 0-th layer (bottom) representing the inner function, changing the variable dimensionality from
n to
. The first layer represents the outer function, changing the dimensionality from
to 1 and resulting in a real number. Extending the basic two-layer KANs to a general form,
3.7. U-Net
The UNet algorithm is a convolutional neural network (CNN) architecture for image segmentation. It was proposed by Olaf Ronneberger et al. in [
40], and it is mainly used to solve the problem of medical image segmentation. The key innovation of UNet is its U-shaped architecture, which allows for high segmentation accuracy even with a limited number of training images.
UNet is a fully convolutional neural network for image segmentation, comprising an encoder and a decoder. The encoder extracts features using convolutional layers and pooling operations, reducing spatial resolution while capturing crucial details. The decoder then upsamples these low-resolution, high-level feature maps, combining them with corresponding encoder feature maps via skip connections. This technique enhances segmentation accuracy and detail preservation by utilizing both high-level abstract and low-level detailed features.
In the final stage, two convolutional layers generate feature maps, followed by 1 × 1 convolutions to produce class-specific heatmaps. The softmax function processes these heatmaps to compute probabilities, which are then used for loss calculation and backpropagation.
The UNet algorithm excels in segmentation and is well suited to small-sample learning, but it demands high computational resources and faces challenges with data imbalance and large image processing. The article [
41] notes that, despite various innovative enhancements incorporating transformers or MLPs, these networks remain constrained due to linear modeling paradigms, and they lack sufficient interpretability. To address these problems, Li, Chenxin, Liu, Xinyu, et al. proposed the U-KAN architecture, as illustrated in
Figure 19. This design incorporates elements from KANs, which are renowned for its high accuracy and interpretability. KANs transform neural network learning by incorporating nonlinearly learnable activation functions derived from the Kolmogorov–Arnold representation theorem.
The U-KAN architecture consists of a two-phase encoder–decoder structure. The encoder phase starts with three convolutional blocks that progressively reduce the feature map resolution, followed by two tokenized Kolmogorov–Arnold network (Tok-KAN) blocks. Conversely, the decoder phase includes two Tok-KAN blocks and three convolutional blocks that restore the feature map resolution. Skip connections link corresponding blocks in the encoder and decoder to facilitate feature reuse. Channel counts for the convolution and Tok-KAN phases are defined by hyperparameters C1 to C3 and D1 to D2, respectively. This architecture effectively integrates convolutional and tokenized KAN blocks, enhancing segmentation accuracy and interpretability and setting it apart from conventional UNet designs.
3.8. Vision Transformer
The vision transformer (ViT), developed by Google, repurposes the transformer architecture for computer vision tasks using an attention mechanism. While CNNs have traditionally been the cornerstone for computer vision, transformers are primarily used in NLP for tasks such as translation and text generation. Researchers have adapted the transformer’s multi-head self-attention to vision tasks in order to address the limitations of CNNs in capturing long-range dependencies. ViT has proven effective in image classification, object detection, and segmentation by leveraging its capability to process images of varying scales and resolutions and capture global contextual information.
The vision transformer (ViT) architecture (
Figure 20), designed for computer vision tasks, consists of three main modules. The Linear Projection of Flattened Patches module converts input images into a serialized format suitable for the transformer encoder using the incorporating patch, position, and learnable embeddings. The transformer encoder, the core component, utilizes multi-head self-attention and feed-forward neural networks to capture global information and learn feature representations. Finally, the MLP head processes the output from the transformer encoders using a multi-layer perceptron for classification or other vision tasks.
The ViT model starts by segmenting an input image into fixed-size patches, which are then linearly transformed into lower-dimensional patch embeddings. Positional and learnable embeddings are added to retain spatial and global information. These embeddings are input to multiple layers of transformer encoders, which apply self-attention to extract features. The final output vectors are processed through a fully connected layer for classification. By converting image data into a sequence format, ViT effectively leverages the transformer’s attention mechanisms for efficient image analysis and classification.
3.9. Comparative Analysis with Previous Works
The application of machine learning and deep learning techniques to breast cancer detection has been extensively explored, yet challenges related to generalizability across different imaging modalities remain significant. Previous methods, such as those proposed in [
21], focused on specific imaging modalities like mammography and MRI and on employing traditional image processing techniques such as edge detection and thresholding. These methods have shown efficacy within their targeted applications; however, their adaptability to other imaging modalities is limited. For instance, while edge detection may work effectively in mammography by highlighting distinct boundaries, it often fails to capture the more nuanced variations present in ultrasound images, where tissue interfaces are less clear. Similarly, thresholding techniques that perform well in MRI may not adequately handle the complex textures seen in histopathological images, where contrasts between different tissue types can be subtle and varied.
In contrast, studies like [
10] examined a broad range of deep learning architectures combined with various preprocessing techniques, placing significant emphasis on the architecture’s influence on model accuracy. However, these studies did not sufficiently explore how different preprocessing techniques affect performance across various imaging modalities, leading to limited generalizability. Our approach differs by systematically applying preprocessing techniques, specifically Wiener filtering and total variation filtering, across multiple modalities, including DCE-MRI, ultrasound, mammography, and histopathology. This strategic use of preprocessing enhances image quality uniformly across different datasets, thereby improving the overall performance and generalizability of AI models, which is an area where previous studies have often fallen short.
Moreover, by systematically applying image processing techniques, we are able to enhance the generalizability of AI models. Traditional approaches often rely on a one-size-fits-all strategy for preprocessing, which may not account for the nuanced differences between imaging modalities. Our method diverges from this by optimizing the filtering parameters for each dataset, ensuring that the preprocessing is tailored to the specific characteristics of the imaging data. This tailored approach not only improves the diagnostic accuracy within each modality but also enhances the robustness of the models when applied to diverse datasets.
Overall, the proposed method addresses the ongoing challenge of developing generalizable and robust diagnostic models applicable across multiple imaging modalities. By strategically applying advanced preprocessing techniques and integrating state-of-the-art AI models, this study seeks to offer an approach that navigates some of the limitations observed in previous methodologies. While further validation and exploration are needed, the findings presented here contribute to the ongoing dialog in the field, with the potential to inform future developments in breast cancer diagnostics.
3.10. Highlight of the Proposed Methods
The proposed method distinguishes itself through a comprehensive approach that integrates advanced preprocessing techniques with cutting-edge AI models. Key aspects include the following:
- 1.
Multimodal dataset utilization: Unlike previous approaches that primarily focus on single-modality datasets, our method leverages a diverse range of medical imaging datasets. This strategy ensures that the AI models developed are robust and generalizable across various imaging conditions, enhancing their applicability in different clinical scenarios.
- 2.
Advanced image processing techniques: By systematically comparing and integrating Wiener filtering with total variation filtering, our approach is designed to tackle specific challenges inherent to medical imaging, such as noise reduction and edge preservation. These challenges are crucial for improving image quality before applying AI models. Additionally, we tailor filtering parameters to the characteristics of each specific dataset, thereby enhancing the adaptability and performance of the models across different imaging modalities.
- 3.
Integration of ViT and U-KAN models: The incorporation of vision transformer (ViT) and U-KAN models represents an innovative application in the context of breast cancer detection. These models have demonstrated superior performance in both classification and segmentation tasks when compared to traditional CNN-based models. Their integration provides a more robust and interpretable framework capable of being effectively applied across a variety of imaging modalities.
4. Experimental Results and Discussion
4.1. Results of Image Filtering
Assuming that the degraded features in the mammographic images of the dataset are due to Gaussian blur, this study utilized specific Python modules to estimate the Gaussian blur kernel and employ a Wiener filter to deblur the images. The implementation of this functionality requires a manual estimation of the Gaussian kernel. The Gaussian blur kernel function in image processing is defined by two primary variables: kernel size and standard deviation. The kernel size, represented as a pair of integers (k_width, k_height) or a single integer for square kernels, specifies the dimensions of the Gaussian kernel and determines the number of pixels considered around each target pixel when applying the blur. A larger kernel size results in a more extensive blur by averaging values over a wider area. The standard deviation sigma controls the spread or width of the Gaussian function, influencing the degree of blur. It dictates how much neighboring pixels affect the center pixel, with a larger sigma producing a broader, smoother blur and a smaller sigma resulting in a sharper, more localized blur. Often, a single standard deviation value is used for both the x and y directions to maintain a uniform blur effect. The setting of the Gaussian kernel is closely related to the final deblur effect.
In order to realize the subsequent early classification and diagnosis research, we set the parameters to an interval range, taking the Gaussian kernel size to be 3–9 with a step size of 2 and sigma to be 0.5–3.0 with a step size of 0.25. We calculated the optimal parameter selection under the current image database through image quality assessment (IQA) for subsequent research. This section demonstrates the processed images, all based on the MIAS database (
https://www.mammoimage.org/databases/ (accessed on 20 July 2024)).
The processing effect of the Wiener filter under different variables is shown in
Figure 21. We calculated the image quality evaluation metrics PSNR, SSIM, MSE, and BRISQUE to find the better variable settings under this dataset.
In
Figure 22,
Figure 23,
Figure 24,
Figure 25 and
Figure 26, the Wiener filter’s performance across various image quality metrics—PSNR, SSIM, MSE, STD, and BRISQUE—reveals that the choice of kernel size and sigma significantly influences the quality of the denoised images. Optimal image quality, indicated by higher PSNR and SSIM values and lower MSE and BRISQUE scores, is generally achieved with smaller kernel sizes (3 to 4) and lower sigma values (0.5 to 1.0). Under these conditions, the filter effectively reduces noise while preserving structural details and minimizing deviations from the original image. As the kernel size and sigma increase, there is a noticeable decline in PSNR (from around 23 dB to 15 dB) and SSIM (from approximately 0.55 to 0.2), reflecting a loss of detail and structural fidelity. Concurrently, MSE values escalate (from around 250 to over 2250), highlighting increased error due to excessive smoothing.
Furthermore, the standard deviation (STD) and BRISQUE metrics show a similar trend, where larger kernels and higher sigma values lead to increased uniformity and perceived quality degradation. The STD values rise from 62.5 to 65.5, indicating a reduction in texture variability, while BRISQUE scores increase from 12 to 72, suggesting diminished visual quality. These findings suggest that, while larger kernels and higher sigma values may be effective for noise reduction, they also introduce substantial over-smoothing, resulting in a loss of crucial image details and texture. Therefore, the careful selection of kernel size and sigma is essential for optimizing image quality, particularly in applications requiring a balance between noise suppression and the preservation of fine details for accurate early classification and diagnosis.
The analysis of the contour map of BRISQUE values in relation to kernel size and sigma parameters for Wiener-filtered images reveals distinct patterns. Smaller kernel sizes (3–5) are highly sensitive to variations in the sigma parameter, whereas larger kernel sizes (7–9) exhibit greater stability. Within the tested parameter range, a kernel size of 8 and a sigma value of 1.25 yield the best image quality, indicated by lower BRISQUE scores of 13.47. This combination effectively balances noise reduction and detail preservation. As shown in
Figure 26, this specific combination results in the lowest BRISQUE score. Therefore, within the established range, these parameters are optimal for processing images in the current dataset. Through this method, we can determine the relatively optimal points within the assumed range of the dataset. The average evaluation metrics for the relatively optimal points within the parameter ranges of all used datasets are shown in
Table 6. The optimal parameter sets for each dataset are as follows: for BreastDM, the optimal parameters (Size, Sigma) are (4, 3); for BreakHis, they are (5, 2.5); for DDSM, they are (5, 1.5); and for BUSI, they are (7, 1).
In addition, total variation filtering was applied to five datasets, with specific IQA parameters detailed in
Table 7. It can be observed that, compared to Wiener filtering, the performance of total variation filtering is inferior. This is particularly evident in the BRISQUE parameter, which will likely significantly impact deep learning models. The substantial increase in BRISQUE values indicates a notable decline in image quality.
Wiener filtering and total variation filtering are complementary in dealing with noise and preserving details. Wiener filtering is very effective in reducing Gaussian noise, while total variation filtering excels in preserving edges and details. Therefore, we processed images by applying total variation filtering to both the original dataset and the Wiener-filtered dataset and then evaluated the quality of the images, expecting that the processed images would lead to superior performance in early diagnosis for AI.
Table 8 shows the sample filtering effects for each dataset (the experimental settings are the same as the MIAS dataset, both assuming that the images are Gaussian blurred). Similar to the Wiener filter, the total variation filter has a different setup with the variable regularization parameter (
), which controls the strength of the filtering and determines the balance between noise reduction and detail retention. Following the Wiener filter treatment, we explored the relatively optimal combination of parameters.
Table 9 shows the evaluation metrics for each dataset under the relatively optimal parameters of the combined filters.
Figure 27,
Figure 28 and
Figure 29 use image quality assessment (IQA) metrics to compare the effects of different filtering techniques on various datasets. These datasets include MIAS (benign: 64 images; malignant: 51 images; normal: 207 images), BreakHis (benign: 2480 images; malignant: 5429 images), BUSI (benign: 437 images; malignant: 210 images; normal: 133 images), BreastDM (benign: 88 images; malignant: 147 images), and Mini-DDSM (benign: 671 images; malignant: 679 images, normal: 602 images). The metrics used are PSNR, SSIM (scaled by
), MSE (scaled by 10), and BRISQUE. These figures clearly illustrate the differences in image quality across different filtering methods, with particular emphasis on the BRISQUE metric, which indicates significant variations in image quality.
4.2. AI Diagnostic Results
Evaluated using IQA metrics alone,
Table 9 shows that the combined filter-treated images are degraded in all parameters. However, to draw accurate conclusions and validate whether Wiener filtering improves AI early diagnostic performance, we used five datasets, each subjected to three different treatments (including the original images), resulting in 15 different combinations for deep learning training. The primary task in the early diagnosis of breast cancer is to classify medical images to determine whether the condition is benign or malignant for targeted treatment. For the classification task, training was conducted using the VisionTransformer framework with a fixed 15 epochs for all datasets, a learning rate of 0.001, and a learning rate factor of 0.01. The performance of the same dataset under different treatments was compared and analyzed.
Figure 30 illustrates the model training process data for the BUSI dataset.
In
Table 10, we analyzed the performance results of five different datasets after applying Wiener filtering and total variation filtering. It is evident that the performance varies significantly across different datasets, depending on the filtering technique used. For instance, in the Mini-DDSM dataset, although the performance of Wiener filtering and total variation filtering are relatively similar, the raw images perform the worst. However, the Breakhis dataset shows a significant performance improvement after applying Wiener filtering, particularly in accuracy, recall, and AUC.
Further analysis reveals that the BreastDM dataset achieves the best results after applying Wiener filtering, with all performance metrics reaching their highest values. This indicates that our chosen range of parameters and parameter combinations are well suited to this dataset. The BUSI dataset exhibits excellent performance with both Wiener and total variation filtering, although Wiener filtering performs slightly better, suggesting that the effectiveness of different filtering methods varies across specific datasets.
Overall, these results indicate that filtering can significantly improve model performance in some cases but may have negative effects on certain datasets. Therefore, in practical applications, it is crucial to select the most appropriate image processing method based on the characteristics of the specific dataset to achieve optimal performance.
Further predictions using the trained model reveal more diverse performance outcomes.
Figure 31 shows that, while the model can correctly classify images in the BreakHis dataset, there is a decrease in the likelihood of correctly classifying benign images post-processing, whereas the likelihood increases for malignant images.
Figure 32 and
Figure 33 demonstrate that filtered images significantly improve classification probabilities for the BreastDM and BUSI datasets. However,
Figure 34 indicates that, despite improved accuracy and probability in processed images, the model misclassifies original DDSM images, which is critical in real diagnostics; specifically, a malignant image is diagnosed as benign with a probability of 0.504.
Figure 35 illustrates a more severe issue in the MIAS dataset, where the model misclassifies benign images as normal with a probability of 0.421 (benign prob: 0.337; malignant prob: 0.242) and malignant images as normal with a probability of 0.526 (benign prob: 0.191; malignant prob: 0.283) after combined filtering. Image processing degraded the model’s classification performance with MIAS, and the original dataset’s classification probabilities were already low, indicating that the model is not well suited to the MIAS dataset.
Table 11 presents a comparison between the mask images generated using three different frameworks and the ground truth masks for image segmentation. In
Table 12, the performance comparison of various models on the BUSI and BreastDM datasets is presented, emphasizing their respective performance metrics. The methods for calculating performance metrics are detailed in
Appendix A.
For the BUSI dataset, the U-KAN model demonstrates superior performance across most metrics relative to U-Net and U-Net++. Specifically, U-KAN achieves the highest accuracy (0.933), precision (0.754), and F1 score (0.747). Additionally, it records the highest specificity (0.963) and AUC (0.935), although its recall (0.740) is marginally lower than that of U-Net++ (0.749). These results indicate that U-KAN offers balanced and robust performance, excelling particularly in accuracy and specificity, which are critical for reliable image segmentation.
Regarding the BreastDM dataset, all three models exhibit high accuracy, yet U-KAN again shows the best overall performance. U-KAN achieves the highest accuracy (0.986), recall (0.870), F1 score (0.728), specificity (0.993), and AUC (0.838). In comparison, U-Net++ demonstrates slightly lower performance with an accuracy of 0.985 and an AUC of 0.822, while U-Net exhibits an accuracy of 0.983 and an AUC of 0.815.
These findings suggest that U-KAN is particularly effective for image segmentation tasks, especially in the context of early breast cancer diagnosis. It provides superior accuracy, specificity, and balanced performance across other metrics compared to U-Net and U-Net++, thus offering enhanced capabilities for detecting early-stage breast cancer lesions. Consequently, U-KAN’s advanced performance metrics underscore its potential as a reliable model for clinical applications in breast cancer detection.
5. Conclusions
This study systematically explored the effects of various image processing techniques and AI models on the performance of early breast cancer diagnostic systems. Experiments were conducted on multiple datasets, including BreastDM, BUSI, MIAS, BreakHis, and DDSM, ensuring a comprehensive evaluation across diverse medical image types. The results provided critical insights into the effectiveness of these techniques and models.
The application of Wiener filtering produced distinct image quality metrics across various datasets, as evidenced by measures such as the peak signal-to-noise ratio (PSNR), the structural similarity index measure (SSIM), the mean squared error (MSE), and the Blind/Referencel-ess Image Spatial Quality Evaluator (BRISQUE). With the BreastDM dataset, Wiener filtering achieved a PSNR of 23.06 dB, an SSIM of 0.79, and an MSE of 341.72, indicating effective noise suppression and good structural preservation, with a BRISQUE score of 28.88. For the BUSI dataset, Wiener filtering resulted in an SSIM of 0.35, a PSNR of 20.09 dB, and an MSE of 637.35, demonstrating moderate structural preservation and reasonable perceptual quality with a BRISQUE score of 26.23. In the MIAS dataset, Wiener filtering showed significant noise reduction with an MSE of 1114.22, a PSNR of 17.68 dB, and an SSIM of 0.30, maintaining perceptual quality with a BRISQUE score of 13.47. For the DDSM dataset, Wiener filtering achieved a PSNR of 17.14 dB, an SSIM of 0.55, and an MSE of 1267.13, reflecting moderate improvements and challenges in balancing noise reduction and perceptual quality with a BRISQUE score of 63.71. In the BreakHis dataset, Wiener filtering showed a PSNR of 19.69 dB, an SSIM of 0.73, and an MSE of 697.58, indicating good noise reduction and structural preservation, with a BRISQUE score of 15.12, suggesting good perceptual quality.
Combined filtering techniques produced mixed results that were highly dependent on the dataset. To evaluate their relative performance, these methods were compared against Wiener filtering. In the MIAS dataset, combined filtering decreased SSIM values from 0.30 (Wiener) to 0.21, reflecting potential over-enhancement issues. Additionally, MSE increased from 1114.22 (Wiener) to 1334.17 (combined), indicating slightly less effective noise reduction. The BRISQUE score increased from 13.47 (Wiener) to 16.87 (combined), suggesting a decrease in perceptual quality. Conversely, in the BUSI dataset, combined filtering significantly improved perceptual quality, as evidenced by a decrease in BRISQUE scores from 26.23 (Wiener) to 44.80 (combined). PSNR was slightly lower for combined filtering (19.86 dB) than Wiener filtering (20.09 dB) but still acceptable. SSIM improved from 0.35 (Wiener) to 0.44 (combined), indicating better structural preservation. This comparison indicates that, while Wiener filtering generally enhances image quality, combined filtering methods can offer additional benefits in specific contexts, especially for perceptual quality in datasets like BUSI.
The use of the vision transformer (ViT) framework provided a robust approach to evaluating the impact of different preprocessing techniques on AI diagnostic performance, particularly in the BreastDM and BUSI datasets. For the BreastDM dataset, the Wiener filter treatment achieved an accuracy of 96.9%, a precision of 97.2%, a recall of 96.7%, a specificity of 93.3%, and an AUC of 0.954. In comparison, the combined filter treatment yielded an accuracy of 99.3%, a precision of 99.6%, a recall of 98.3%, a specificity of 95.4%, and an AUC of 0.969, demonstrating superior performance. The raw image treatment showed lower metrics with an accuracy of 90.6%, a precision of 90.6%, a recall of 90.8%, a specificity of 93.3%, and an AUC of 0.937. For the BUSI dataset, the Wiener filter treatment resulted in an accuracy of 98.0%, a precision of 98.8%, a recall of 96.6%, a specificity of 98.5%, and an AUC of 0.998. The combined filter treatment had an accuracy of 83.2%, a precision of 81.1%, a recall of 79.2%, a specificity of 90.0%, and an AUC of 0.938, whereas the raw image treatment exhibited significantly lower performance with an accuracy of 71.6%, a precision of 74.8%, a recall of 61.6%, a specificity of 80.3%, and an AUC of 0.864. These results highlight the effectiveness of the ViT framework, particularly when combined with appropriate preprocessing techniques, in enhancing diagnostic accuracy and reliability.
The U-KAN model consistently outperformed the U-Net and U-Net++ models in breast cancer lesion segmentation across the datasets used. In the BUSI dataset, U-KAN achieved an accuracy of 93.3%, with a precision of 75.4%, a recall of 74.0%, an F1-score of 74.7%, a specificity of 96.3%, and an AUC of 0.935. These results indicate U-KAN’s robustness in accurately segmenting breast cancer lesions. Notably, with the BreastDM dataset, U-KAN maintained high segmentation accuracy with an accuracy of 98.6%, a precision of 62.6%, a recall of 87.0%, an F1-score of 72.8%, a specificity of 99.3%, and an AUC of 0.838 despite the presence of noise and variability in image quality. Furthermore, the U-KAN model’s ability to maintain high-performance metrics across these datasets underscores its adaptability and robustness.
In conclusion, this study underscores the importance of tailored preprocessing techniques for different datasets, given the variability in image quality and characteristics. Wiener filtering emerged as generally effective, particularly for datasets with significant noise, such as BreastDM. However, the choice of filtering techniques should be guided by the specific noise characteristics and diagnostic requirements of each dataset; for instance, combined filtering methods were more suitable in enhancing image clarity and feature preservation in the BUSI dataset. Integrating advanced AI models like ViT and U-KAN with optimal preprocessing techniques has proven to enhance diagnostic performance. The superior performance of the U-KAN model across various metrics suggests that combining innovative network architectures with tailored preprocessing can significantly improve early breast cancer detection. These findings highlight the need for further research to refine and validate these methods across larger and more diverse datasets.
Future work should address the challenge of selecting optimal parameters for different datasets, which remains difficult even when choosing a parameter range to find the best values. Integrating image processing with deep learning may offer a solution to this problem. By training models with various parameter combinations, it may be possible to achieve the adaptive selection of the optimal processing parameters. This approach could significantly enhance the performance of deep learning applications in early breast cancer detection, leading to more accurate and reliable diagnostic outcomes.