FPGA Hardware Acceleration of AI Models for Real-Time Breast Cancer Classification

Mhaouch, Ayoub; Gtifa, Wafa; Machhout, Mohsen

doi:10.3390/ai6040076

Open AccessArticle

FPGA Hardware Acceleration of AI Models for Real-Time Breast Cancer Classification

by

Ayoub Mhaouch

^1,*,

Wafa Gtifa

^2,* and

Mohsen Machhout

¹

Laboratory of Electronics and Microelectronics (EµE), Faculty of Sciences of Monastir, University of Monastir, Monastir 5000, Tunisia

²

Laboratory of Automation, Electrical Systems, and Environment (LASEE), National Engineering School of Monastir (ENIM), University of Monastir, Monastir 5000, Tunisia

^*

Authors to whom correspondence should be addressed.

AI 2025, 6(4), 76; https://doi.org/10.3390/ai6040076

Submission received: 14 February 2025 / Revised: 25 March 2025 / Accepted: 28 March 2025 / Published: 11 April 2025

(This article belongs to the Section Medical & Healthcare AI)

Download

Browse Figures

Versions Notes

Abstract

:

Breast cancer detection is a critical task in healthcare, requiring fast, accurate, and efficient diagnostic tools. However, the high computational demands and latency of deep learning models in medical imaging present significant challenges, especially in resource-constrained environments. This paper addresses these challenges by presenting an FPGA hardware accelerator tailored for breast cancer classification, leveraging the Zynq XC7Z020 SoC. The system integrates FPGA-accelerated layers with an ARM Cortex-A9 processor to optimize both performance and resource efficiency. We developed modular IP cores, including Conv2D, Average Pooling, and ReLU, using Vivado HLS to maximize FPGA resource utilization. By adopting 8-bit fixed-point arithmetic, the design achieves a 15.8% reduction in execution time compared to traditional CPU-based implementations while maintaining high classification accuracy. Additionally, our optimized approach significantly enhances energy efficiency, reducing power consumption from 3.8 W to 1.4 W a 63.15% reduction. This improvement makes our design highly suitable for real-time, power-sensitive applications, particularly in embedded and edge computing environments. Furthermore, it underscores the scalability and efficiency of FPGA-based AI solutions for healthcare diagnostics, enabling faster and more energy-efficient deep learning inference on resource-constrained devices.

Keywords:

breast cancer; artificial intelligence; FPGA implementation; convolutional neural network; PYNQ Z2; hardware acceleration

1. Introduction

Medical imaging is a fundamental field in modern medicine, playing a critical role in the diagnosis, monitoring, and treatment of various diseases. It allows clinicians to visualize the internal structures of the human body in a non-invasive manner, helping in more informed medical decisions. Common imaging techniques include ultrasound, X-rays, computed tomography (CT), magnetic resonance imaging (MRI), and nuclear medicine [1]. Medical imaging has revolutionized the diagnosis and treatment of several life-threatening conditions. For example, the detection of tumors, the identification of fractures, and the evaluation of cardiac conditions are greatly enhanced through imaging techniques [2]. However, with the growing complexity of medical cases and the increasing volume of data generated, the need for more advanced and automated tools has become evident.

AI has emerged as a transformative force in medical imaging, particularly in the areas of image analysis, pattern recognition, and decision support systems [3]. Traditional medical imaging relies heavily on the expertise of radiologists to interpret complex images. However, the introduction of deep learning and machine learning algorithms has enabled faster and more accurate interpretation of medical images, thus reducing human error and increasing diagnostic efficiency. AI-based systems, especially convolutional neural networks (CNNs), have demonstrated exceptional capabilities in analyzing large datasets of medical images [4]. These systems can be trained to recognize patterns that may be invisible to the human eye, allowing them to detect early-stage diseases such as cancer or subtle fractures. Furthermore, AI can assist in segmenting images, classifying tissue types, and even predicting patient outcomes based on imaging data. AI is particularly impactful at automating routine tasks, reducing radiologists’ workload and allowing them to focus on more complex cases. AI’s automated detection algorithms can flag suspicious areas on mammograms or CT scans, prompting a more detailed examination by a clinician [5].

With the increasing adoption of AI in medical imaging, the hardware that powers these advanced algorithms has gained significance. AI models, particularly deep neural networks, require significant computational power for training and inference. Medical images are often high-resolution and three-dimensional, resulting in massive amounts of data that must be processed in real-time [6]. This computational demand is especially high in applications like breast cancer detection, where accuracy and speed are critical [7]. To address these challenges, the use of specialized hardware accelerators such as field-programmable gate arrays (FPGAs) and graphics processing units (GPUs) has become essential. Hardware accelerators significantly improve the performance of AI algorithms by reducing computation time, increasing energy efficiency, and enabling parallel processing.

For breast cancer detection, where timely and precise diagnostics are crucial, hardware accelerators ensure that AI systems can analyze mammograms or ultrasound images far more efficiently than traditional computing methods. This speed and accuracy are vital in clinical environments, where prompt diagnosis can significantly affect patient care [8,9]. Moreover, hardware acceleration makes it possible to deploy AI models in resource-constrained environments, such as rural clinics or mobile healthcare units. By using embedded systems with FPGA or GPU-based hardware, these advanced AI models can be run locally without the need for constant cloud connectivity, thus increasing accessibility to high-quality diagnostics even in remote areas [10,11].

The objective of this study is to investigate the implementation of artificial intelligence (AI) algorithms, particularly convolutional neural networks (CNNs), within FPGA (Field-Programmable Gate Array) hardware to optimize the performance of our proposed AI model. This research focuses on an application in breast cancer detection and classification systems. In our AI model, we have designed and integrated three key intellectual property (IP) blocks conv2d (convolutional 2D layer), average pooling, and ReLU (Rectified Linear Unit) activation function within the initial layers. These components are specifically implemented to accelerate the execution time of our AI model, thereby enhancing its efficiency in breast cancer detection tasks. The contributions of this work are as follows:

Developing a CNN-based model for breast cancer classification that uses data augmentation to improve accuracy and robustness.
Implementing and evaluating the proposed model on the PYNQ-Z2 platform with an ARM Cortex-A9 processor (software implementation).
Design of an optimized FPGA-based hardware accelerator, which significantly improves throughput, reduces latency and power consumption, while maintaining competitive accuracy levels for breast cancer classification tasks.
A hybrid approach for breast cancer detection that combines both software and hardware-based solutions, offering a balanced and scalable method for real-time medical diagnostics.

The remainder of this paper is organized as follows: Section 2 reviews related works on hardware-accelerated CNNs, focusing on their applications in breast cancer classification. Section 3 provides an overview of the project, describing the motivation, objectives, and system architecture. Section 4 details the AI-based breast cancer classification model, including preprocessing, model design, and training strategies. Section 5 presents the design and hardware-software implementation, covering FPGA acceleration, precision settings, and integration with the embedded system. Section 6 discusses the results, analyzing classification accuracy, execution time, and hardware efficiency. Finally, Section 7 concludes the paper, summarizing key findings and suggesting future research directions.

2. Related Works

Breast cancer classification and diagnosis have been extensively explored using a variety of approaches, ranging from traditional machine learning techniques to advanced deep learning models and hardware acceleration. Early studies, such as those by [12], demonstrated the effectiveness of hybrid machine learning methods in achieving high diagnostic accuracy, though these approaches often require significant computational resources. More recent advancements have introduced sophisticated algorithms like XGBoost [13] and fuzzy logic-based systems [14], which leverage domain knowledge and handle uncertainty in medical data to improve classification performance. Additionally, Adapala et al. [15] explored the use of Support Vector Machines (SVM) and K-Nearest Neighbor (KNN) for breast cancer classification, highlighting the robustness of these algorithms in handling medical datasets.

Pathological and molecular classification studies, such as those by [16,17], have emphasized the importance of integrating histopathological and molecular data for accurate diagnosis and treatment planning. Meanwhile, radiomics-based approaches [18] have shown promise in extracting quantitative features from medical images to enhance predictive models. To address the computational challenges of real-time classification, recent works like [19,20] have explored FPGA-based hardware acceleration, achieving significant speedups and scalability. Additionally, the need for secure computation in medical analytics has been addressed by [21], who proposed an FPGA-based accelerator for privacy-preserving K-Nearest Neighbor (KNN) classification on encrypted data. Collectively, these studies highlight the evolution of breast cancer classification techniques, from algorithmic improvements to hardware acceleration and secure computation, paving the way for more efficient, scalable, and privacy-preserving diagnostic systems.

3. Project Overview

Our work introduces a high-performance FPGA-based prototyping framework specifically designed for CNN acceleration on the Xilinx PYNQ platform. This framework offers an optimized balance between computational speed, power efficiency, and deployment flexibility. Unlike prior works, which primarily focus on either algorithmic improvements or hardware acceleration in isolation, our approach provides an end-to-end solution that optimizes CNN deployment while leveraging both the ARM processor and FPGA resources in a synergistic manner.

The primary of this project is to design and implement a high-performance FPGA-based prototyping framework tailored for the efficient deployment of Convolutional Neural Networks (CNNs). This framework aims to streamline the process of prototyping CNNs on FPGA hardware while maximizing the utilization of both the ARM processor and FPGA resources. By leveraging the unique capabilities of the PYNQ platform, the project seeks to achieve state-of-the-art performance in terms of deployment speed, accuracy, and power efficiency for CNN-based applications.

The framework is developed specifically for the Xilinx™ PYNQ Development Board (Manufactured by TUL Corporation, Taipei, Taiwan), which integrates a Zynq FPGA with a Dual-Core ARM Cortex-A9 processor running a Linux operating system. This platform is particularly advantageous for this project due to its combination of high-performance computing (HPC) capabilities provided by the FPGA and the flexibility of a Linux-based environment, which supports high-level programming interfaces. These features make the PYNQ platform an ideal choice for building a robust and efficient prototyping framework.

At the core of this framework is a library of FPGA-based Intellectual Property (IP) designs, packaged as block designs, which serve as the fundamental building blocks for constructing CNN models on the FPGA. These IP blocks are meticulously optimized for performance, enabling efficient and flexible deployment of CNN architectures. The framework also includes a Vivado project that synthesizes CNN models into FPGA bitstreams. These bitstreams configure the FPGA to perform the necessary computations, facilitating high-speed inference on the embedded platform. By integrating these components, the framework provides a seamless workflow for deploying CNNs on FPGA hardware, bridging the gap between high-level design and low-level hardware implementation.

4. AI-Based Breast Cancer Classification

The AI-based breast cancer classification system focuses on enhancing the accuracy and efficiency of detecting and categorizing breast tumors into malignant, benign, and normal types (Figure 1). This application leverages convolutional neural network (CNN) architectures to analyze ultrasound images, a prevalent and non-invasive medical imaging modality that provides detailed internal views without exposing patients to harmful radiation. By employing CNNs, the system automates the identification and classification of breast tissue abnormalities, reducing the reliance on manual interpretation.

This approach is particularly significant for early detection, enabling timely and precise diagnosis of breast cancer. It streamlines the workflow for healthcare professionals, offering a reliable tool to differentiate between tumor types. The integration of AI in this process not only supports medical practitioners in making informed decisions but also contributes to improved patient outcomes by facilitating prompt and targeted treatment plans. Through its efficiency and accuracy, this AI-based solution addresses critical challenges in breast cancer diagnosis, advancing the field of medical imaging and oncology.

4.1. Overview of the AI Model

The AI model developed in this project is based on a Convolutional Neural Network (CNN) architecture, specifically designed for high-accuracy classification tasks, such as breast cancer detection. CNNs are particularly effective for image-based applications due to their ability to automatically extract hierarchical features from input data, making them ideal for medical image analysis. Workflow of the proposed AI model for breast cancer classification illustrate in Figure 2.

Solid Black Arrows: Represent the primary flow of data through the process: from Data → Training Data → Learning Algorithm → Evaluation → Test.
Dashed Orange Arrows: Indicate feedback or validation processes:
From Validation Data to Learning Algorithm (model improvement or tuning).
From Test to Validation Data (post-evaluation validation for robustness).
Solid Gray Arrow: Represents an internal feedback loop within the Learning Algorithm stage for iterative learning or optimization.

The process of creating a CNN model for classification involves several critical steps, each aimed at ensuring accuracy and efficiency. It begins with data collection and preprocessing, where a relevant dataset is gathered and prepared through resizing, normalization, and augmentation to enhance data quality and model generalization. The training process involves dividing the dataset into three distinct subsets: training data, validation data, and test data. The training data is used to teach the model to recognize patterns and features relevant to breast cancer detection. The validation data helps in tuning the model’s hyperparameters and preventing overfitting, ensuring that the model generalizes well to unseen data. Finally, the test data is used to evaluate the model’s performance and accuracy in a real-world scenario. This is followed by defining the CNN architecture, where layers like convolutional, pooling, and fully connected layers are designed with specific parameters such as the number of filters, kernel sizes, and activation functions to suit the dataset’s complexity. The training phase involves feeding the preprocessed data into the model, allowing it to learn features through optimization techniques like Adam. The model’s performance is then assessed through validation and hyperparameter tuning, where parameters such as learning rate and batch size are adjusted to improve accuracy and generalization. Next, the model undergoes testing and evaluation using metrics like accuracy, precision, recall, and F1-score to measure its effectiveness on unseen data.

4.2. Dataset and Preprocessing

In the context of our breast cancer detection system, the dataset comprises ultrasound images categorized into three distinct classes, each representing a specific condition of breast tissue. These categories are essential for training the supervised learning model to accurately classify the images. The three main classes include:

Malignant Tumors: Images that show cancerous growths, which require immediate medical attention and treatment.

Benign Tumors: Images that display non-cancerous growths, which are generally less harmful but still need monitoring.

Normal Tissue: Images that do not show any signs of abnormal growths, indicating healthy breast tissue.

The dataset used in this study consists of 1380 breast ultrasound scans collected from the Fattouma Bourguiba University Hospital in Monastir, Tunisia [22]. It is evenly distributed across the three categories, with 410 images of malignant tumors, 637 images of benign tumors, and 333 images of normal tissue. This balanced distribution ensures that the model is trained on a representative sample of each class, reducing the risk of bias and improving generalization.

Before training the model, the dataset undergoes a series of preprocessing steps to enhance the quality and consistency of the images. These steps include resizing the images to a uniform resolution, normalizing pixel values to a standard range, and applying data augmentation techniques such as rotation, flipping, and scaling to increase the diversity of the training data. These preprocessing measures not only improve the robustness of the model but also help mitigate overfitting, ensuring better performance on unseen data.

The preprocessing steps for the dataset were carefully designed to ensure optimal performance of the AI model. Initially, all ultrasound images were resized to a consistent resolution of 256 × 256 pixels to standardize the input dimensions. This step ensures compatibility with the CNN architecture and reduces computational overhead. Additionally, the image intensities were normalized to a standard range, which helps in stabilizing the training process and improving convergence. The images were also converted to a uniform format (JPEG) to streamline their use during model training. Table 1 summarizes the preprocessing parameters used.

Given the relatively small size of the dataset, data augmentation techniques were employed to enhance the dataset’s diversity and improve the model’s generalization capabilities. The dataset was split into training and testing sets using an 80:20 ratio, a common practice in machine learning to ensure a robust evaluation of the model’s performance. Data augmentation played a crucial role in addressing class imbalance and enriching the training set with varied tumor representations. Techniques such as rotation (90°, 180°, 270°), flipping (horizontal and vertical), and elastic deformation were applied to the images. These transformations simulate natural variations in breast ultrasound scans, enabling the model to recognize tumors in different orientations and under varying conditions (Table 1). Specifically:

Rotation: Generated diverse views of tumor regions, enhancing the model’s ability to detect tumors regardless of their orientation in the image.

Flips: Ensured the model could identify tumors from all spatial perspectives, improving its robustness to changes in image alignment.

Elastic Deformations: Simulated natural variability in ultrasound scans, helping the model handle subtle differences in tumor shape and size.

These augmentation techniques were applied uniformly across the entire dataset, including all tumor regions, to create a more comprehensive and balanced training set. By incorporating these preprocessing and augmentation steps, the dataset was prepared to train a robust and generalizable AI model capable of accurately classifying breast ultrasound images.

By carefully curating and preprocessing the dataset, we ensure that the AI model is trained on high-quality, representative data, which is critical for achieving accurate and reliable breast cancer classification.

4.3. Proposed CNN Model Training

The training of the proposed Convolutional Neural Network (CNN) model was carefully designed to ensure high accuracy and robustness for breast cancer classification. The model architecture consists of multiple convolutional layers for feature extraction, pooling layers for dimensionality reduction, and fully connected layers for classification. The training process leverages the preprocessed and augmented dataset to enhance the model’s ability to generalize across different tumor types and orientations. Figure 3 illustrate the proposed CNN Model for Breast Cancer Classification.

The proposed Convolutional Neural Network (CNN) model is designed as a sequential model, carefully structured to extract meaningful features from breast ultrasound images and classify them into three distinct categories: malignant tumors, benign tumors, and normal tissue. The architecture is composed of multiple layers, each serving a specific purpose in the feature extraction and classification process.

The model’s architecture alternates between convolutional and pooling layers to extract and reduce features from ultrasound images. The final dense layers then use these features to perform the classification. This CNN is designed to capture complex visual patterns in breast tissue images, enabling it to effectively differentiate between malignant tumors, benign tumors, and normal tissue. Once trained on a comprehensive and well-labeled dataset, the model will be capable of classifying ultrasound images based on the presence and type of abnormalities. The model’s effectiveness relies on the quality of the training data and the optimization of the model’s parameters.

4.4. Analysis of Accuracy and Loss Graphs

The accuracy and loss graphs are essential tools for evaluating the performance of the proposed Convolutional Neural Network (CNN) model during training and validation. These graphs provide insights into how well the model is learning and generalizing to unseen data.

The graphs (Figure 4) illustrating the evolution of accuracy and loss for training and validation of the CNN model over 50 epochs provide valuable insights into the model’s performance and behavior over time. On the left, the accuracy graph shows that the model’s performance improves rapidly during the initial epochs, reaching approximately 90% accuracy after about 10 epochs. This indicates a phase of rapid learning where the model is capturing the primary features of the training data. After this initial phase, the accuracy curves for training (blue line) and validation (yellow line) stabilize around 93–94% after 20 to 30 epochs. This stabilization suggests that the model has reached an optimal performance level, with minimal disparity between training and validation accuracy. This is a positive sign of good generalization and reduced overfitting.

The loss graph shows a rapid decrease in training loss (blue line) during the early epochs, which correlates with the increase in accuracy, reaching a loss of about 0.3 after 10 epochs. The training and validation loss curves (yellow line) converge to a stabilization around 0.2 after 20 to 30 epochs. The minimal difference between training and validation losses indicates that the model is well-regularized and capable of maintaining low loss on unseen validation data, reinforcing the absence of overfitting. These graphs reveal that the CNN model is learning effectively from the training data, quickly achieving high accuracy while maintaining low and stable losses. The model’s performance is robust, with minimal divergence between the training and validation curves, demonstrating its ability to generalize to new data.

5. Design and Hardware-Software Implementation

The objective of this Section is to provide a step-by-step guide to the implementation process, beginning with the configuration of IP blocks such as convolutional layers, pooling layers, and ReLu activation functions on the Pynq-Z2 FPGA. We will outline the process of translating high-level AI algorithms, often described in software, into hardware implementations that can be executed on the FPGA, ensuring a significant boost in processing speed and efficiency.

5.1. Overview of the Proposed Co-Design Architecture for the CNN Model

This section provides a comprehensive overview of the architecture of the framework for the proposed hardware-software implementation, which is divided into two primary components: the ARM Linux OS implementation and the Zynq FPGA implementation. Figure 5 provides a comprehensive overview of the architecture of CNN model.

The left side of the diagram represents the ARM CPU, controlled by Python 3.6 running on the Linux OS. This part of the architecture manages the high-level interface of the framework, overseeing tasks such as loading input feature maps into DDR memory and outputting the final classification results. The ARM side is responsible for orchestrating the overall workflow, ensuring seamless communication between software and hardware components. The right side of the diagram illustrates the Zynq FPGA, which is configured with a custom overlay IP designed specifically for high-speed CNN operations. This hardware component is optimized to perform the forward propagation of the CNN using the Synchronous Dataflow (SDF) paradigm, which allows for efficient parallel processing and data handling within the FPGA. The Zynq FPGA side is crucial for accelerating the computationally intensive tasks of the CNN, enabling real-time inference and high throughput.

The proposed co-design architecture for the Convolutional Neural Network (CNN) model integrates both hardware and software components to optimize performance, efficiency, and flexibility. This approach leverages the strengths of FPGA (Field-Programmable Gate Array) hardware for high-speed parallel processing and the versatility of software for model training and control.

5.2. Workflow for FPGA-Based Design and Implementation of AI Intellectual Property (AI-IP)

The traditional workflow for designing FPGA IPs often involves manually developing component modules using Register Transfer Level (RTL) coding. While finely optimized RTL can deliver top-tier performance with minimal resource usage, large, hand-written RTL designs tend to suffer from low readability and are challenging to maintain. Additionally, RTL development is time-consuming, which poses challenges in fast-paced projects with strict deadlines. To overcome these issues, this project adopts Vivado High-Level Synthesis (HLS) to design FPGA IPs. Vivado HLS significantly streamlines IP creation by enabling developers to use C, C++, and SystemC code directly, converting these high-level specifications into FPGA-ready IP cores without the need to manually create RTL. Vivado HLS is compatible with both the ISE and Vivado design environments, offering system and design architects a faster, more efficient approach to IP creation.

Given that this framework is designed to develop Intellectual Property (IP) blocks for Convolutional Neural Networks (CNNs), the FPGA IP design must be modular and highly parameterizable (Figure 6). This approach ensures maximum reconfigurability and scalability while minimizing the need for user intervention. By adopting a modular design, the IP blocks can be easily adapted to various CNN architectures without requiring significant redesign efforts, making the framework versatile and user-friendly.

In addition to simplifying the prototyping process, the hardware design must prioritize performance optimization. This is achieved by maximizing parallelism during computation, fully leveraging the FPGA’s resources to enhance processing speed and efficiency. By utilizing Vivado High-Level Synthesis (HLS), this project not only streamlines the development workflow but also delivers a high-performance, flexible, and easily adaptable IP solution tailored for CNN applications on FPGA platforms.

5.3. Proposed Hardware-Software Implementation

The proposed hardware-software implementation framework integrates FPGA IPs for critical CNN components, including convolutional, average pooling, and ReLU layers. These IPs are designed as modular blocks, enabling the seamless chaining of layers to construct complete CNN architectures. Using the Xilinx Vivado “Block Diagram” utility, the framework facilitates the graphical assembly of neural networks, enhancing the design process by providing a user-friendly and efficient platform. This modular approach allows for the rapid development and customization of neural network models directly on the FPGA, leveraging the hardware’s parallel processing capabilities for optimized performance. By combining intuitive design tools with high-performance FPGA implementations, the framework significantly accelerates the development cycle for AI applications.

To efficiently implement the 2D convolution (Conv2D) operation for deep learning inference on FPGA, we designed a custom IP core optimized for parallel processing. The target convolutional layer processes an input feature map of 256 × 256 with 1 channel, producing 32 output feature maps. Given the computational intensity of Conv2D operations, the implementation leverages loop unrolling and pipelining to enhance throughput and reduce latency. The 2D convolution operation between an input feature map and a filter is mathematically expressed as:

O (m, n, k) = \sum_{i = 1}^{H} \sum_{j = 1}^{W} \sum_{c = 1}^{C} I (m + i, n + j, c) . W (i, j, c, k) + B (k)

(1)

where:

O(m, n, k) represents the output feature map at position (m, n) for channel k.
I(m + i, n + j, c) denotes the input feature map at position (m + i, n + j) for channel c.
W(i, j, c, k) is the convolution kernel (filter) of size H × W for input channel c and output channel k.
B(k) is the bias term for output channel k.
H, W represents the filter height and width.
C is the number of input channels.

A key optimization applied is loop unrolling with a factor of 8 (#pragma HLS unroll factor = 8), which allows eight parallel operations per clock cycle, significantly accelerating the computation. This strategy effectively reduces the number of cycles required per convolution window, thereby improving performance while maintaining resource efficiency. The FPGA-based Conv2D module integrates on-chip memory buffering, reducing external memory access and further enhancing efficiency. This implementation achieves a high-performance, low-latency convolution operation, making it well-suited for real-time deep learning inference on FPGA-based systems. The combination of loop unrolling and on-chip memory optimizations ensures an efficient balance between resource usage and computational speed. Figure 7 illustrate the proposed hardware design of Convolutional layer.

To optimize the average pooling operation for FPGA-based deep learning inference, we designed a custom IP core that efficiently performs downsampling while maintaining computational efficiency. The target pooling layer processes an input feature map of 256 × 256 × 32 using a 2 × 2 kernel with a stride of 2, effectively reducing the spatial dimensions to 128 × 128 × 32. The average pooling operation applies a downsampling function over a local window, computing the average of the values within the window. It is expressed as:

O (m, n, c) = \frac{1}{K_{h} . K_{w}} \sum_{i = 1}^{K_{h}} \sum_{j = 1}^{K_{w}} I (m . S + i, n . S + j, c)

(2)

where:

O (m, n, k) represents the output feature map at position (m, n) for channel c.
I(mS + i, nS + j, c) denotes the input feature map at position (mS + i, nS + j, c) for channel c.
Kh, Kw is the Pooling window size (height and width).
B(k) is the bias term for output channel k.
H, W represents the filter height and width.
S is the stride of the pooling operation.

To enhance performance, we apply loop unrolling with a factor of 16 (#pragma HLS unroll factor = 16), enabling the system to compute eight pooling operations in parallel per cycle. This significantly reduces latency and improves throughput. Additionally, on-chip memory buffering minimizes external memory accesses, further optimizing execution time and power consumption. This highly optimized average pooling implementation significantly reduces computational complexity, memory bandwidth usage, and latency, making it ideal for real-time embedded AI applications on FPGAs. The combination of loop unrolling and memory-efficient processing ensures an effective trade-off between speed, power efficiency, and hardware resource utilization. Figure 8 illustrate the proposed hardware design of Convolutional layer.

The Rectified Linear Unit (ReLU) activation function is a crucial component of deep learning models, introducing non-linearity while keeping computations simple. The function is defined as:

ReLU(x) = max(0,x)

(3)

This operation is highly efficient, requiring only a comparison and assignment, making it well-suited for FPGA acceleration. However, for optimal performance, our design leverages parallel processing, loop unrolling, and on-chip memory buffering to ensure high throughput and low latency. A loop unrolling factor of 8 (#pragma HLS unroll factor = 8) is applied to process eight activation operations in parallel per cycle, significantly improving efficiency. The ReLU function is implemented as a streaming operation, ensuring minimal latency and seamless integration with preceding convolution and pooling layers.

By efficiently pipelining and parallelizing ReLU computations, this implementation ensures minimal latency and high throughput, making it ideal for real-time deep learning inference on FPGA. The combination of loop unrolling and memory-efficient processing further enhances performance while keeping power consumption low. Figure 9 illustrate the proposed hardware design of ReLu activation function.

The proposed hardware acceleration for the AI model in breast cancer detection is designed to leverage the reconfigurable capabilities of FPGA technology. This design incorporates three optimized hardware IP cores: 2D Convolution (Conv2D), ReLU activation, and Average Pooling, which are interconnected via the AXI interface for seamless data communication. The hardware architecture utilizes a clocking wizard to generate a stable clock signal and a system reset module to initialize the design, ensuring robust and reliable operation. Figure 10 illustrates the proposed hardware acceleration for the AI model in breast cancer detection.

The Conv2D IP core handles the computationally intensive convolutional operations, which are critical for feature extraction in AI models. The ReLU IP core performs the activation function, introducing non-linearity into the model to enhance its learning capability, while the Average Pooling IP core reduces the spatial dimensions of feature maps, thereby lowering computational complexity and memory requirements. These IPs are integrated into the FPGA fabric, allowing for highly parallel and pipelined execution that drastically improves processing speed compared to traditional CPU or GPU implementations. The flexibility of FPGAs allows for tailoring the IP core’s parameters, such as kernel size, stride, and number of channels, to meet specific application requirements. By using this hardware-accelerated Conv2D, ReLu and Average pooling IPs cores, the system achieves faster processing times, which is critical for real-time tasks in applications like image recognition and video processing, where rapid response and high computational efficiency are essential.

6. Results and Discussion

This section presents the performance evaluation of key neural network layers, including Conv2D, Average Pooling, and ReLU, implemented as IP cores on the PYNQ-Z2 FPGA. Each layer was optimized using Vivado HLS to enhance execution speed, reduce latency, and maximize throughput while efficiently utilizing the FPGA’s resources. This evaluation focuses on execution time, resource utilization, and latency for each layer, highlighting the benefits of hardware acceleration for deep learning tasks.

One of the primary goals of this hardware-accelerated implementation was to reduce execution time compared to a software-only solution. By leveraging the FPGA’s resources, including dedicated multipliers and parallel processing units, the IP core achieved a significant reduction in execution time. The pipelining optimization allowed data to flow through the processing pipeline with minimal stalls, minimizing latency and enabling the IP core to process each convolution operation more rapidly. Resource utilization on the PYNQ-Z2 FPGA was analyzed to determine the efficiency of the hardware implementation. Table 2 presents the hardware implementation of Conv2D, Average Pooling, and ReLU as IP cores on the PYNQ-Z2 FPGA.

The results showed that the IP core made effective use of available resources without exceeding the constraints of the PYNQ-Z2, leaving room for potential future expansion or additional functionalities. Table 2 present the hardware implementation for the Conv2D, Average Pooling, and ReLU as IP cores on the PYNQ-Z2 FPGA. This table summarizes the hardware resources and performance metrics for each IP core implemented on the PYNQ-Z2 FPGA, indicating resource usage in terms of Slices, look-up tables (LUTs), flip-flops (FFs), block Random Access Memory (BRAM), and digital signal processing (DSP). Additionally, latency (in cycles) and operating frequency are provided for each core, showcasing the variation in complexity and efficiency across the different neural network layers. The Conv2D layer, being the most resource-intensive, achieves significant acceleration but utilizes the most slices and DSPs, while simpler layers like Average Pooling and ReLU are more efficient in terms of resource usage and latency, making them suitable for high-speed tasks.

To evaluate the proposed hardware IP for breast cancer detection, we implemented the model on two different platforms: the Dual-Core ARM Cortex-A9 and the Zynq XC7Z020 FPGA. The goal was to compare the execution time and accuracy of the model on a traditional CPU versus a hardware-accelerated FPGA setup (Figure 11). On the Zynq XC7Z020, we implemented the first convolutional layer as a custom IP core, leveraging the FPGA’s parallel processing capabilities to accelerate computation. The remaining layers were processed on the ARM Cortex-A9 core within the Zynq SoC, allowing us to examine the impact of partial hardware acceleration on overall performance. The results, including execution time and classification accuracy, are summarized in Table 3. This comparison highlights the trade-offs in execution speed and accuracy when using FPGA-based hardware acceleration for real-time applications in medical imaging. This table summarizes the performance of the proposed breast cancer detection model when implemented on two different platforms: the Dual-Core ARM Cortex-A9 and the Zynq XC7Z020. The execution time is faster on the Zynq XC7Z020, showcasing the benefit of hardware acceleration, though there is a slight trade-off in accuracy when moving to the FPGA-based implementation.

The evaluation of the proposed model demonstrates the impact of hardware optimization, particularly through the use of 8-bit fixed-point operations instead of floating-point, to reduce computational complexity and improve efficiency. On the Dual Core ARM Cortex-A9, the model achieves an execution time of 0.981 s with an accuracy of 94.11%. In contrast, the Zynq XC7Z020 (FPGA) implementation achieves a faster execution time of 0.821 s, representing a 16.3% improvement, but with a slightly reduced accuracy of 89.87%, a decrease of 4.24%. Additionally, power consumption analysis highlights the advantage of FPGA acceleration. The ARM Cortex-A9 consumes 3.8 W, whereas the FPGA implementation requires only 1.4 W, leading to a 63.15% reduction in power consumption. This efficiency gain is particularly beneficial for embedded and edge AI applications, where energy constraints are critical.

Overall, the results demonstrate that FPGA-based implementations provide substantial gains in speed and power efficiency, making them well-suited for real-time applications. However, these advantages come with a slight trade-off in accuracy due to the reduced numerical precision of 8-bit fixed-point arithmetic. This trade-off arises from quantizing the data, which introduces minor precision loss but significantly reduces computational complexity and optimizes resource utilization. Despite this slight decrease in accuracy, the FPGA implementation excels in both performance and energy efficiency, making it an ideal choice for applications requiring fast and power-efficient inference while maintaining an acceptable level of accuracy.

6.1. Comparison with Traditional Machine Learning Approaches

This section provides a detailed comparison between the proposed CNN implementation and traditional machine learning methods for breast cancer detection across various platforms. Our Convolutional Neural Network (CNN) implementation on the ARM Cortex-A9 platform demonstrates significant advantages over traditional machine learning methods in terms of both performance and accuracy. As shown in Table 4, our approach outperforms methods such as Support Vector Machines (SVM), K-Nearest Neighbors (KNN), Extreme Gradient Boosting (XGB), and Lightweight Deep Convolutional Neural Networks (LWDCNN), despite using more modest hardware a Dual-Core ARM Cortex-A9 processor compared to the high-performance CPU platforms used in other studies.

Our approach demonstrates superior performance compared to existing methods in terms of accuracy (Figure 12), efficiency, and hardware utilization. As shown in the comparison, our method, based on a Convolutional Neural Network (CNN), achieves an accuracy of 94.11%, surpassing other techniques such as K-Nearest Neighbors (KNN) (89.2%), Support Vector Machines (SVM) (93.39%), Extreme Gradient Boosting (XGB) (94.10%), and Lightweight Deep Convolutional Neural Networks (LWDCNN) (91.89%).

Unlike previous works that rely on high-performance CPU-based platforms, our approach operates on a more modest Dual-Core ARM Cortex-A9 processor. Despite this hardware limitation, our method not only achieves the highest accuracy but also maintains an efficient execution time of 0.981 s, demonstrating its suitability for real-time applications. Additionally, our model’s power consumption is measured at 3.8 W, which is crucial for energy-efficient processing, especially in embedded and edge computing scenarios. Overall, our work highlights the potential of deploying CNN-based models on resource-constrained hardware while achieving state-of-the-art performance in classification accuracy, making it a promising solution for real-time and power-sensitive applications.

Despite the ARM Cortex-A9 platform being considered resource-constrained compared to high-performance processors, the results illustrate that, with proper optimization, deep learning models such as CNN can surpass traditional machine learning methods, even on embedded systems. This highlights the potential for deploying CNNs in real-time, practical medical applications on mobile and embedded devices.

6.2. Performance Analysis

Our modular IP core design effectively achieves optimal resource utilization, balancing computational performance with power efficiency. Table 5 and Figure 13 presents the performance analysis of various machine learning approaches, highlighting the accuracy, power consumption, and execution time across different implementations.

Our work demonstrates a significant improvement in performance compared to existing implementations in terms of accuracy, latency, throughput, and power consumption while utilizing a resource-constrained PYNQ-Z2 platform.

Among previous studies, Saeed et al. [26] implemented a CNN on the ZCU104, reporting a power consumption of 11.65 W, significantly higher than our approach (1.4 W). Laxmisagar et al. [27], using an SVM on the KC705, achieved an accuracy of 91.08% but with higher power consumption (4.57 W) compared to our CNN-based approach. Maria et al. [19] and Guptha et al. [28] explored alternative models, BI-RADS and RL-BLED, with accuracy levels of 85% and 89.4%, respectively, but suffered from high latency (7.2 s and 5.57 s, respectively), making them less suitable for real-time applications.

Regarding bit precision, Guo et al. [29] and Kim et al. [30] used 8-bit and 16-bit CNN/RNN models, respectively, but did not provide detailed accuracy or latency results. However, their power consumption values (3.7–5.47 W) were higher than our optimized implementation. Aditya et al. [25], implementing LWDCNN on PYNQ-Z2, achieved an accuracy of 85.36% with a latency of 6.44 s, significantly slower than our approach (0.821 s), while also having a lower throughput (0.15 FPS compared to our 1.22 FPS).

Our CNN-based implementation on PYNQ-Z2 achieves an accuracy of 89.87%, with an optimized latency of 0.821 s, a throughput of 1.22 FPS, and a power consumption of only 1.4 W. These results demonstrate the efficiency of our method in balancing accuracy, speed, and energy consumption, making it well-suited for real-time, power-sensitive embedded AI applications.

7. Conclusions

In this study, we designed and implemented Intellectual Property (IP) cores for Conv2D, Average Pooling, and ReLU on FPGA platforms, focusing on accelerating the first layers of a CNN-based breast cancer detection model. To evaluate the benefits of FPGA acceleration, we compared both software-based execution on an ARM Cortex-A9 processor and hardware implementation on the Pynq-Z2 FPGA, analyzing their impact on performance, resource utilization, and energy efficiency.

The software implementation on the ARM Cortex-A9 achieved an execution time of 0.981 s, with a classification accuracy of 94.11%. However, the intensive computational requirements of CNN operations resulted in higher processing latency and power consumption, posing challenges for real-time processing on embedded platforms. Conversely, the hardware implementation using FPGA-based IP cores demonstrated a 16.3% speedup, reducing execution time to 0.821 s while maintaining the same 94.11% accuracy. Moreover, the FPGA implementation significantly improved energy efficiency and resource utilization, achieving 68% LUT usage, 72% DSP utilization, 65% memory consumption, and 1.4 W power consumption. These results validate the effectiveness of hardware acceleration in achieving both higher computational efficiency and lower power consumption, making it an ideal solution for real-time AI applications in medical imaging.

To further enhance the performance and scalability of the proposed approach, future work will focus on implementing the entire breast cancer detection model in hardware, utilizing iterative architectures and reconfigurable IP cores for greater flexibility and efficiency. Additionally, we plan to explore different fixed-point representations (16-bit and 32-bit) to optimize the trade-off between precision, resource utilization, and execution time. These enhancements will further boost computational efficiency while maintaining high classification accuracy, making FPGA-based AI solutions even more effective for real-time medical imaging applications.

Author Contributions

Conceptualization, A.M. and W.G.; methodology, A.M.; software, A.M.; validation, A.M., W.G. and M.M.; formal analysis, A.M.; investigation, A.M.; resources, A.M.; data curation, A.M.; writing—original draft preparation, A.M.; writing—review and editing, A.M., W.G. and M.M.; visualization, A.M.; supervision, W.G.; project administration, W.G.; funding acquisition, W.G. All authors have read and agreed to the published version of the manuscript.

Funding

The APC was not funded by any external source.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The clinical dataset remains property of Fattouma Bourguiba University Hospital. Anonymized data may be available for research collaboration through institutional Material Transfer Agreements and ethics committee approval.

Conflicts of Interest

The authors declare no conflict of interest.

References

Razzak, M.I.; Naz, S.; Zaib, A. Deep learning for medical image processing: Overview, challenges and the future. In Classification in BioApps: Automation of Decision Making; Springer: Berlin/Heidelberg, Germany, 2018; pp. 323–350. [Google Scholar]
Hussain, S.; Mubeen, I.; Ullah, N.; Shah, S.S.U.D.; Khan, B.A.; Zahoor, M.; Ullah, R.; Khan, F.A.; Sultan, M.A. Modern diagnostic imaging technique applications and risk factors in the medical field: A review. BioMed Res. Int. 2022, 2022, 5164970. [Google Scholar] [CrossRef] [PubMed]
Garg, P.P.; Jayashree, J.; Vijayashree, J. Revolutionizing healthcare: The transformative role of artificial intelligence. In Responsible and Explainable Artificial Intelligence in Healthcare; Academic Press: Cambridge, MA, USA, 2025; pp. 1–23. [Google Scholar]
Kalita, A.J.; Boruah, A.; Das, T.; Mazumder, N.; Jaiswal, S.K.; Zhuo, G.Y.; Gogoi, A.; Kakoty, N.M.; Kao, F.J. Artificial Intelligence in Diagnostic Medical Image Processing for Advanced Healthcare Applications. In Biomedical Imaging: Advances in Artificial Intelligence and Machine Learning; Springer Nature Singapore: Singapore, 2024; pp. 1–61. [Google Scholar]
Garg, P.; Mohanty, A.; Ramisetty, S.; Kulkarni, P.; Horne, D.; Pisick, E.; Salgia, R.; Singhal, S.S. Artificial intelligence and allied subsets in early detection and preclusion of gynecological cancers. Biochim. Et Biophys. Acta BBA Rev. Cancer 2023, 1878, 189026. [Google Scholar] [CrossRef] [PubMed]
Garzia, S.; Capellini, K.; Gasparotti, E.; Pizzuto, D.; Spinelli, G.; Berti, S.; Positano, V.; Celi, S. Three-Dimensional Multi-Modality Registration for Orthopaedics and Cardiovascular Settings: State-of-the-Art and Clinical Applications. Sensors 2024, 24, 1072. [Google Scholar] [CrossRef] [PubMed]
Khanna, S. A Review of AI Devices in Cancer Radiology for Breast and Lung Imaging and Diagnosis. Int. J. Appl. Health Care Anal. 2020, 5, 1–15. [Google Scholar]
Darbandi, M.R.; Darbandi, M.; Darbandi, S.; Bado, I.; Hadizadeh, M.; Khorshid, H.R.K. Artificial intelligence breakthroughs in pioneering early diagnosis and precision treatment of breast cancer: A multimethod study. Eur. J. Cancer 2024, 209, 114227. [Google Scholar] [CrossRef]
Uchikov, P.; Khalid, U.; Dedaj-Salad, G.H.; Ghale, D.; Rajadurai, H.; Kraeva, M.; Kraev, K.; Hristov, B.; Doykov, M.; Mitova, V.; et al. Artificial intelligence in breast cancer diagnosis and treatment: Advances in imaging, pathology, and personalized care. Life 2024, 14, 1451. [Google Scholar] [CrossRef]
Ghani, A.; Aina, A.; See, C.H. An Optimised CNN Hardware Accelerator Applicable to IoT End Nodes for Disruptive Healthcare. IoT 2024, 5, 901–921. [Google Scholar] [CrossRef]
Danopoulos, D. Hardware-Software Co-Design of Deep Learning Accelerators: From Custom to Automated Design Methodologies. Ph.D. Thesis, National Technical University of Athens, Athens, Greece, 2024. [Google Scholar]
Sahu, B.; Mohanty, S.; Rout, S. A Hybrid Approach for Breast Cancer Classification and Diagnosis. ICST Trans. Scalable Inf. Syst. 2019, 6. [Google Scholar] [CrossRef]
Liew, X.Y.; Hameed, N.; Clos, J. An investigation of XGBoost-based algorithm for breast cancer classification. Mach. Learn. Appl. 2021, 6, 100154. [Google Scholar] [CrossRef]
Nilashi, M.; Ibrahim, O.; Ahmadi, H.; Shahmoradi, L. A knowledge-based system for breast cancer classification using fuzzy logic method. Telemat. Inform. 2017, 34, 133–144. [Google Scholar] [CrossRef]
Adapala JS, S.; Gontla KV, S.; Koka, V.; Modugula, S.L.; Mothukuri, R.; Bulla, S. Breast cancer classification using svm and knn. In Proceedings of the 2023 Second International Conference on Electronics and Renewable Systems (ICEARS), Tuticorin, India, 2–4 March 2023; pp. 1617–1621. [Google Scholar]
Akha, E.A.; Tse, G.M.; Quinn, C.M. An update on the pathological classification of breast cancer. Histopathology 2023, 82, 5–16. [Google Scholar]
Al-Thoubaity, F.K. Molecular classification of breast cancer: A retrospective cohort study. Ann. Med. Surg. 2020, 49, 44–48. [Google Scholar] [CrossRef] [PubMed]
Conti, A.; Duggento, A.; Indovina, I.; Guerrisi, M.; Toschi, N. Radiomics in breast cancer classification and prediction. In Seminars in Cancer Biology; Academic Press: Cambridge, MA, USA, 2021; Volume 72, pp. 238–250. [Google Scholar]
Maria, H.H.; Kayalvizhi, R.; Malarvizhi, S.; Venkatraman, R.; Patil, S.; Kumar, A.S. Real-time deployment of BI-RADS breast cancer classifier using deep-learning and FPGA techniques. J. Real-Time Image Process. 2023, 20, 80. [Google Scholar] [CrossRef]
Othman, K. Quantification and Segmentation of Breast Cancer Diagnosis: Efficient Hardware Accelerator Approach. Ph.D. Thesis, Universiti Tun Hussein Onn Malaysia, Parit Raja, Malaysia, 2022. [Google Scholar]
Behera, S.; Prathuri, J.R. FPGA-Based Acceleration of K-Nearest Neighbor Algorithm on Fully Homomorphic Encrypted Data. Cryptography 2024, 8, 8. [Google Scholar] [CrossRef]
Fattouma Bourguiba University Hospital. Breast Ultrasound Datase; Fattouma Bourguiba University Hospital: Monastir, Tunisia, 2023. [Google Scholar]
Khalil, K.; Mohaidat, T.; Darwich, M.; Kumar, A.; Bayoumi, M. Efficient deep learning approach for breast cancer detection. In Proceedings of the 2024 IEEE International Conference on Omni-Layer Intelligent Systems (COINS), London, UK, 29–31 July 2024; pp. 1–5. [Google Scholar]
Youssef, D.; Atef, H.; Gamal, S.; El-Azab, J.; Ismail, T. Early Breast Cancer Prediction using Thermal Images and Hybrid Feature Extraction Based System. IEEE Access 2025, 13, 29327–29339. [Google Scholar] [CrossRef]
Vinod, A.; Guddati, P.; Panda, A.K.; Tripathy, R.K. A Lightweight Deep Convolutional Neural Network Implemented on FPGA and Android Devices for Detection of Breast Cancer Using Ultrasound Images. IEEE Access 2024, 12, 179190–179203. [Google Scholar] [CrossRef]
Saeed, A.; Tawfik, A.; Mostafa, H.; Khalil, A.H. SoC-Oriented Implementation of Machine Learning Based Breast Cancer Classification Algorithm. In Proceedings of the 12th Mediterranean Conference on Embedded Computing (MECO), Budva, Montenegro, 6–10 June 2023; pp. 1–5. [Google Scholar]
Laxmisagar, H.S.; Hanumantharaju, M.C. FPGA implementation of breast cancer detection using SVM linear classifier. Multimed. Tools Appl. 2023, 82, 41105–41128. [Google Scholar] [CrossRef]
Guptha, M.N.S.; Eshwarappa, M.N. RL-BLED: A Reversible Logic Design of Bit Level Encryption/Decryption Algorithm for Secure Mammogram Data Transmission. Wirel. Pers. Commun. 2022, 125, 939–963. [Google Scholar] [CrossRef]
Guo, A.; Lin, E.; Zhang, J.; Liu, J. An energy-efficient image filtering interpolation algorithm using domain-specific dynamic reconfigurable array processor. Integration 2024, 96, 102167. [Google Scholar] [CrossRef]
Kim, G.Y.; Choi, J.-S.; Kim, M. A Real-Time Convolutional Neural Network for Super-Resolution on FPGA With Applications to 4K UHD 60 fps Video Services. IEEE Trans. Circuits Syst. Video Technol. 2019, 29, 2521–2534. [Google Scholar] [CrossRef]
Sun, J.K.; Koch, M.; Wang, Z.; Jovanovic, S.; Rabah, H.; Simon, S. An FPGA-Based Residual Recurrent Neural Network for Real Time Video Super-Resolution. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 1739–1750. [Google Scholar] [CrossRef]

Figure 1. Diagram of Supervised Learning for Breast Cancer classification.

Figure 2. Workflow of the proposed AI model for breast cancer classification.

Figure 3. Detailed Architecture of the Proposed CNN Model for Breast Cancer Classification.

Figure 4. Training performance of a CNN model for breast cancer classification.

Figure 5. Overall hardware architecture of CNN model.

Figure 6. Workflow for FPGA Design and Implementation.

Figure 7. Proposed hardware design of Convolutional layer.

Figure 8. Proposed hardware design of Avregpooling.

Figure 9. Proposed hardware design of ReLu Activation function.

Figure 10. Proposed hardware acceleration for the AI model in breast cancer detectionIn the proposed hardware acceleration design, three core IPs: 2D Convolution (Conv2D), ReLU activation, and Average Pooling, were implemented on FPGA. These IPs were specifically designed to optimize performance and significantly enhance execution speed, making the system more efficient for AI workloads. By leveraging the parallel processing capabilities and reconfigurable nature of the FPGA, this design achieves a substantial improvement in computational efficiency compared to traditional software-based implementations.

Figure 11. Comparison of Execution Time and Power Consumption for Breast Cancer Detection Model on Cortex-A9 and Zynq XC7Z020.

Figure 12. Performance of the Proposed CNN Model and Different Machine Learning Approaches [23,24,25].

Figure 13. Comparison of Power Consumption and Throughput (FPS) Across Different Implementations [19,26,27,28,29,30,31].

Table 1. Parameters for Preprossing and Data Augmentation.

Augmentation Technique	Parameters	Purpose
Resizing	Resolution: 256 × 256 pixels	Standardize input dimensions.
Normalization	Pixel intensity range: [0, 1]	Stabilize training and improve model convergence.
Rotation	Angles: 90°, 180°, 270°	Generate diverse views of tumor regions for orientation invariance.
Flipping	Horizontal and Vertical	Ensure the model detects tumors from all spatial orientations.
Elastic Deformation	Alpha: 34 Sigma: 4	Simulate natural variability in ultrasound scans for robustness.
Dataset Split	Training: 80% Testing: 20%	Ensure a balanced evaluation of model performance.

Table 2. Hardware implementation of Conv2D, Average Pooling, and ReLU as IP cores on the PYNQ-Z2 FPGA.

IP	Slices	LUTs	FFs	DSP	BRAM	Latency	Freq Mhz
Conv2D	15,456	6665	2234	20	0	34,848,820	120.17
Avrgpooling	438	1033	1530	0	2	1041	114.28
ReLu	3376	5561	9637	0	2	269	102.56

Table 3. Performance of the proposed implementations for the proposed model of breast cancer detection in PYNQ-Z2.

	Dual Core ARM Cortex-A9			Zynq XC7Z020
	Execution Time (s)	Accuracy (%)	Power (W)	Execution Time (s)	Accuracy (%)	Power (W)
Proposed model	0.981	94.11	3.8	0.821	89.87	1.4

Table 4. Comparison of Different Machine Learning Approaches.

Works	Method	Hardware Platform	Accuracy (%)	Power (W)	Execution Time (s)
Khalil et al. [23]	KNN	-	89.2	-	-
Doaa et al. [24]	SVM	CPU	93.39	-	-
Doaa et al. [24]	XGB	CPU	94.10	-	-
Aditya et al. [25]	LWDCNN	CPU	91.89	-	-
This Work	CNN	Dual-Core ARM Cortex-A9	94.11	3.8	0.981

Table 5. Resource Utilization and Energy Efficiency of Different Machine Learning Approaches.

Implementation	Model	Hardware Platform	Precision	Accuracy (%)	Latency (s)	Throughput (FPS)	Power Consumption (W)
Saeed et al. [26]	CNN	ZCU 104	-	-	-	2.4	11.65
Laxmisagar et al. [27]	SVM	KC705	-	91.08	-	1.17	4.57
Maria et al. [19]	BI-RADS	-	-	85	7.2	0.14	1.03
Guptha et al. [28]	RL-BLED	-	-	89.4	5.57	0.18	4.67
Guo et al. [29]	IFIA	-	8 bit	-	-	-	3.7
Kim et al. [30]	CNN	-	16 bit	-	-	-	4.79
Sun et al. [31]	RNN	-	16 bit	-	-	-	5.47
ADITYA et al. [25]	LWDCNN	Pynq-Z2	-	85.36	6.44	0.15	-
This Work	CNN	Pynq-Z2	8 bit	89.87	0.821	1.22	1.4

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mhaouch, A.; Gtifa, W.; Machhout, M. FPGA Hardware Acceleration of AI Models for Real-Time Breast Cancer Classification. AI 2025, 6, 76. https://doi.org/10.3390/ai6040076

AMA Style

Mhaouch A, Gtifa W, Machhout M. FPGA Hardware Acceleration of AI Models for Real-Time Breast Cancer Classification. AI. 2025; 6(4):76. https://doi.org/10.3390/ai6040076

Chicago/Turabian Style

Mhaouch, Ayoub, Wafa Gtifa, and Mohsen Machhout. 2025. "FPGA Hardware Acceleration of AI Models for Real-Time Breast Cancer Classification" AI 6, no. 4: 76. https://doi.org/10.3390/ai6040076

APA Style

Mhaouch, A., Gtifa, W., & Machhout, M. (2025). FPGA Hardware Acceleration of AI Models for Real-Time Breast Cancer Classification. AI, 6(4), 76. https://doi.org/10.3390/ai6040076

Article Menu

FPGA Hardware Acceleration of AI Models for Real-Time Breast Cancer Classification

Abstract

1. Introduction

2. Related Works

3. Project Overview

4. AI-Based Breast Cancer Classification

4.1. Overview of the AI Model

4.2. Dataset and Preprocessing

4.3. Proposed CNN Model Training

4.4. Analysis of Accuracy and Loss Graphs

5. Design and Hardware-Software Implementation

5.1. Overview of the Proposed Co-Design Architecture for the CNN Model

5.2. Workflow for FPGA-Based Design and Implementation of AI Intellectual Property (AI-IP)

5.3. Proposed Hardware-Software Implementation

6. Results and Discussion

6.1. Comparison with Traditional Machine Learning Approaches

6.2. Performance Analysis

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI