1. Introduction
Medical imaging is a fundamental field in modern medicine, playing a critical role in the diagnosis, monitoring, and treatment of various diseases. It allows clinicians to visualize the internal structures of the human body in a non-invasive manner, helping in more informed medical decisions. Common imaging techniques include ultrasound, X-rays, computed tomography (CT), magnetic resonance imaging (MRI), and nuclear medicine [
1]. Medical imaging has revolutionized the diagnosis and treatment of several life-threatening conditions. For example, the detection of tumors, the identification of fractures, and the evaluation of cardiac conditions are greatly enhanced through imaging techniques [
2]. However, with the growing complexity of medical cases and the increasing volume of data generated, the need for more advanced and automated tools has become evident.
AI has emerged as a transformative force in medical imaging, particularly in the areas of image analysis, pattern recognition, and decision support systems [
3]. Traditional medical imaging relies heavily on the expertise of radiologists to interpret complex images. However, the introduction of deep learning and machine learning algorithms has enabled faster and more accurate interpretation of medical images, thus reducing human error and increasing diagnostic efficiency. AI-based systems, especially convolutional neural networks (CNNs), have demonstrated exceptional capabilities in analyzing large datasets of medical images [
4]. These systems can be trained to recognize patterns that may be invisible to the human eye, allowing them to detect early-stage diseases such as cancer or subtle fractures. Furthermore, AI can assist in segmenting images, classifying tissue types, and even predicting patient outcomes based on imaging data. AI is particularly impactful at automating routine tasks, reducing radiologists’ workload and allowing them to focus on more complex cases. AI’s automated detection algorithms can flag suspicious areas on mammograms or CT scans, prompting a more detailed examination by a clinician [
5].
With the increasing adoption of AI in medical imaging, the hardware that powers these advanced algorithms has gained significance. AI models, particularly deep neural networks, require significant computational power for training and inference. Medical images are often high-resolution and three-dimensional, resulting in massive amounts of data that must be processed in real-time [
6]. This computational demand is especially high in applications like breast cancer detection, where accuracy and speed are critical [
7]. To address these challenges, the use of specialized hardware accelerators such as field-programmable gate arrays (FPGAs) and graphics processing units (GPUs) has become essential. Hardware accelerators significantly improve the performance of AI algorithms by reducing computation time, increasing energy efficiency, and enabling parallel processing.
For breast cancer detection, where timely and precise diagnostics are crucial, hardware accelerators ensure that AI systems can analyze mammograms or ultrasound images far more efficiently than traditional computing methods. This speed and accuracy are vital in clinical environments, where prompt diagnosis can significantly affect patient care [
8,
9]. Moreover, hardware acceleration makes it possible to deploy AI models in resource-constrained environments, such as rural clinics or mobile healthcare units. By using embedded systems with FPGA or GPU-based hardware, these advanced AI models can be run locally without the need for constant cloud connectivity, thus increasing accessibility to high-quality diagnostics even in remote areas [
10,
11].
The objective of this study is to investigate the implementation of artificial intelligence (AI) algorithms, particularly convolutional neural networks (CNNs), within FPGA (Field-Programmable Gate Array) hardware to optimize the performance of our proposed AI model. This research focuses on an application in breast cancer detection and classification systems. In our AI model, we have designed and integrated three key intellectual property (IP) blocks conv2d (convolutional 2D layer), average pooling, and ReLU (Rectified Linear Unit) activation function within the initial layers. These components are specifically implemented to accelerate the execution time of our AI model, thereby enhancing its efficiency in breast cancer detection tasks. The contributions of this work are as follows:
Developing a CNN-based model for breast cancer classification that uses data augmentation to improve accuracy and robustness.
Implementing and evaluating the proposed model on the PYNQ-Z2 platform with an ARM Cortex-A9 processor (software implementation).
Design of an optimized FPGA-based hardware accelerator, which significantly improves throughput, reduces latency and power consumption, while maintaining competitive accuracy levels for breast cancer classification tasks.
A hybrid approach for breast cancer detection that combines both software and hardware-based solutions, offering a balanced and scalable method for real-time medical diagnostics.
The remainder of this paper is organized as follows:
Section 2 reviews related works on hardware-accelerated CNNs, focusing on their applications in breast cancer classification.
Section 3 provides an overview of the project, describing the motivation, objectives, and system architecture.
Section 4 details the AI-based breast cancer classification model, including preprocessing, model design, and training strategies.
Section 5 presents the design and hardware-software implementation, covering FPGA acceleration, precision settings, and integration with the embedded system.
Section 6 discusses the results, analyzing classification accuracy, execution time, and hardware efficiency. Finally,
Section 7 concludes the paper, summarizing key findings and suggesting future research directions.
2. Related Works
Breast cancer classification and diagnosis have been extensively explored using a variety of approaches, ranging from traditional machine learning techniques to advanced deep learning models and hardware acceleration. Early studies, such as those by [
12], demonstrated the effectiveness of hybrid machine learning methods in achieving high diagnostic accuracy, though these approaches often require significant computational resources. More recent advancements have introduced sophisticated algorithms like XGBoost [
13] and fuzzy logic-based systems [
14], which leverage domain knowledge and handle uncertainty in medical data to improve classification performance. Additionally, Adapala et al. [
15] explored the use of Support Vector Machines (SVM) and K-Nearest Neighbor (KNN) for breast cancer classification, highlighting the robustness of these algorithms in handling medical datasets.
Pathological and molecular classification studies, such as those by [
16,
17], have emphasized the importance of integrating histopathological and molecular data for accurate diagnosis and treatment planning. Meanwhile, radiomics-based approaches [
18] have shown promise in extracting quantitative features from medical images to enhance predictive models. To address the computational challenges of real-time classification, recent works like [
19,
20] have explored FPGA-based hardware acceleration, achieving significant speedups and scalability. Additionally, the need for secure computation in medical analytics has been addressed by [
21], who proposed an FPGA-based accelerator for privacy-preserving K-Nearest Neighbor (KNN) classification on encrypted data. Collectively, these studies highlight the evolution of breast cancer classification techniques, from algorithmic improvements to hardware acceleration and secure computation, paving the way for more efficient, scalable, and privacy-preserving diagnostic systems.
3. Project Overview
Our work introduces a high-performance FPGA-based prototyping framework specifically designed for CNN acceleration on the Xilinx PYNQ platform. This framework offers an optimized balance between computational speed, power efficiency, and deployment flexibility. Unlike prior works, which primarily focus on either algorithmic improvements or hardware acceleration in isolation, our approach provides an end-to-end solution that optimizes CNN deployment while leveraging both the ARM processor and FPGA resources in a synergistic manner.
The primary of this project is to design and implement a high-performance FPGA-based prototyping framework tailored for the efficient deployment of Convolutional Neural Networks (CNNs). This framework aims to streamline the process of prototyping CNNs on FPGA hardware while maximizing the utilization of both the ARM processor and FPGA resources. By leveraging the unique capabilities of the PYNQ platform, the project seeks to achieve state-of-the-art performance in terms of deployment speed, accuracy, and power efficiency for CNN-based applications.
The framework is developed specifically for the Xilinx™ PYNQ Development Board (Manufactured by TUL Corporation, Taipei, Taiwan), which integrates a Zynq FPGA with a Dual-Core ARM Cortex-A9 processor running a Linux operating system. This platform is particularly advantageous for this project due to its combination of high-performance computing (HPC) capabilities provided by the FPGA and the flexibility of a Linux-based environment, which supports high-level programming interfaces. These features make the PYNQ platform an ideal choice for building a robust and efficient prototyping framework.
At the core of this framework is a library of FPGA-based Intellectual Property (IP) designs, packaged as block designs, which serve as the fundamental building blocks for constructing CNN models on the FPGA. These IP blocks are meticulously optimized for performance, enabling efficient and flexible deployment of CNN architectures. The framework also includes a Vivado project that synthesizes CNN models into FPGA bitstreams. These bitstreams configure the FPGA to perform the necessary computations, facilitating high-speed inference on the embedded platform. By integrating these components, the framework provides a seamless workflow for deploying CNNs on FPGA hardware, bridging the gap between high-level design and low-level hardware implementation.
4. AI-Based Breast Cancer Classification
The AI-based breast cancer classification system focuses on enhancing the accuracy and efficiency of detecting and categorizing breast tumors into malignant, benign, and normal types (
Figure 1). This application leverages convolutional neural network (CNN) architectures to analyze ultrasound images, a prevalent and non-invasive medical imaging modality that provides detailed internal views without exposing patients to harmful radiation. By employing CNNs, the system automates the identification and classification of breast tissue abnormalities, reducing the reliance on manual interpretation.
This approach is particularly significant for early detection, enabling timely and precise diagnosis of breast cancer. It streamlines the workflow for healthcare professionals, offering a reliable tool to differentiate between tumor types. The integration of AI in this process not only supports medical practitioners in making informed decisions but also contributes to improved patient outcomes by facilitating prompt and targeted treatment plans. Through its efficiency and accuracy, this AI-based solution addresses critical challenges in breast cancer diagnosis, advancing the field of medical imaging and oncology.
4.1. Overview of the AI Model
The AI model developed in this project is based on a Convolutional Neural Network (CNN) architecture, specifically designed for high-accuracy classification tasks, such as breast cancer detection. CNNs are particularly effective for image-based applications due to their ability to automatically extract hierarchical features from input data, making them ideal for medical image analysis. Workflow of the proposed AI model for breast cancer classification illustrate in
Figure 2.
Solid Black Arrows: Represent the primary flow of data through the process: from Data → Training Data → Learning Algorithm → Evaluation → Test.
Dashed Orange Arrows: Indicate feedback or validation processes:
From Validation Data to Learning Algorithm (model improvement or tuning).
From Test to Validation Data (post-evaluation validation for robustness).
Solid Gray Arrow: Represents an internal feedback loop within the Learning Algorithm stage for iterative learning or optimization.
The process of creating a CNN model for classification involves several critical steps, each aimed at ensuring accuracy and efficiency. It begins with data collection and preprocessing, where a relevant dataset is gathered and prepared through resizing, normalization, and augmentation to enhance data quality and model generalization. The training process involves dividing the dataset into three distinct subsets: training data, validation data, and test data. The training data is used to teach the model to recognize patterns and features relevant to breast cancer detection. The validation data helps in tuning the model’s hyperparameters and preventing overfitting, ensuring that the model generalizes well to unseen data. Finally, the test data is used to evaluate the model’s performance and accuracy in a real-world scenario. This is followed by defining the CNN architecture, where layers like convolutional, pooling, and fully connected layers are designed with specific parameters such as the number of filters, kernel sizes, and activation functions to suit the dataset’s complexity. The training phase involves feeding the preprocessed data into the model, allowing it to learn features through optimization techniques like Adam. The model’s performance is then assessed through validation and hyperparameter tuning, where parameters such as learning rate and batch size are adjusted to improve accuracy and generalization. Next, the model undergoes testing and evaluation using metrics like accuracy, precision, recall, and F1-score to measure its effectiveness on unseen data.
4.2. Dataset and Preprocessing
In the context of our breast cancer detection system, the dataset comprises ultrasound images categorized into three distinct classes, each representing a specific condition of breast tissue. These categories are essential for training the supervised learning model to accurately classify the images. The three main classes include:
Malignant Tumors: Images that show cancerous growths, which require immediate medical attention and treatment.
Benign Tumors: Images that display non-cancerous growths, which are generally less harmful but still need monitoring.
Normal Tissue: Images that do not show any signs of abnormal growths, indicating healthy breast tissue.
The dataset used in this study consists of 1380 breast ultrasound scans collected from the Fattouma Bourguiba University Hospital in Monastir, Tunisia [
22]. It is evenly distributed across the three categories, with 410 images of malignant tumors, 637 images of benign tumors, and 333 images of normal tissue. This balanced distribution ensures that the model is trained on a representative sample of each class, reducing the risk of bias and improving generalization.
Before training the model, the dataset undergoes a series of preprocessing steps to enhance the quality and consistency of the images. These steps include resizing the images to a uniform resolution, normalizing pixel values to a standard range, and applying data augmentation techniques such as rotation, flipping, and scaling to increase the diversity of the training data. These preprocessing measures not only improve the robustness of the model but also help mitigate overfitting, ensuring better performance on unseen data.
The preprocessing steps for the dataset were carefully designed to ensure optimal performance of the AI model. Initially, all ultrasound images were resized to a consistent resolution of 256 × 256 pixels to standardize the input dimensions. This step ensures compatibility with the CNN architecture and reduces computational overhead. Additionally, the image intensities were normalized to a standard range, which helps in stabilizing the training process and improving convergence. The images were also converted to a uniform format (JPEG) to streamline their use during model training.
Table 1 summarizes the preprocessing parameters used.
Given the relatively small size of the dataset, data augmentation techniques were employed to enhance the dataset’s diversity and improve the model’s generalization capabilities. The dataset was split into training and testing sets using an 80:20 ratio, a common practice in machine learning to ensure a robust evaluation of the model’s performance. Data augmentation played a crucial role in addressing class imbalance and enriching the training set with varied tumor representations. Techniques such as rotation (90°, 180°, 270°), flipping (horizontal and vertical), and elastic deformation were applied to the images. These transformations simulate natural variations in breast ultrasound scans, enabling the model to recognize tumors in different orientations and under varying conditions (
Table 1). Specifically:
Rotation: Generated diverse views of tumor regions, enhancing the model’s ability to detect tumors regardless of their orientation in the image.
Flips: Ensured the model could identify tumors from all spatial perspectives, improving its robustness to changes in image alignment.
Elastic Deformations: Simulated natural variability in ultrasound scans, helping the model handle subtle differences in tumor shape and size.
These augmentation techniques were applied uniformly across the entire dataset, including all tumor regions, to create a more comprehensive and balanced training set. By incorporating these preprocessing and augmentation steps, the dataset was prepared to train a robust and generalizable AI model capable of accurately classifying breast ultrasound images.
By carefully curating and preprocessing the dataset, we ensure that the AI model is trained on high-quality, representative data, which is critical for achieving accurate and reliable breast cancer classification.
4.3. Proposed CNN Model Training
The training of the proposed Convolutional Neural Network (CNN) model was carefully designed to ensure high accuracy and robustness for breast cancer classification. The model architecture consists of multiple convolutional layers for feature extraction, pooling layers for dimensionality reduction, and fully connected layers for classification. The training process leverages the preprocessed and augmented dataset to enhance the model’s ability to generalize across different tumor types and orientations.
Figure 3 illustrate the proposed CNN Model for Breast Cancer Classification.
The proposed Convolutional Neural Network (CNN) model is designed as a sequential model, carefully structured to extract meaningful features from breast ultrasound images and classify them into three distinct categories: malignant tumors, benign tumors, and normal tissue. The architecture is composed of multiple layers, each serving a specific purpose in the feature extraction and classification process.
The model’s architecture alternates between convolutional and pooling layers to extract and reduce features from ultrasound images. The final dense layers then use these features to perform the classification. This CNN is designed to capture complex visual patterns in breast tissue images, enabling it to effectively differentiate between malignant tumors, benign tumors, and normal tissue. Once trained on a comprehensive and well-labeled dataset, the model will be capable of classifying ultrasound images based on the presence and type of abnormalities. The model’s effectiveness relies on the quality of the training data and the optimization of the model’s parameters.
4.4. Analysis of Accuracy and Loss Graphs
The accuracy and loss graphs are essential tools for evaluating the performance of the proposed Convolutional Neural Network (CNN) model during training and validation. These graphs provide insights into how well the model is learning and generalizing to unseen data.
The graphs (
Figure 4) illustrating the evolution of accuracy and loss for training and validation of the CNN model over 50 epochs provide valuable insights into the model’s performance and behavior over time. On the left, the accuracy graph shows that the model’s performance improves rapidly during the initial epochs, reaching approximately 90% accuracy after about 10 epochs. This indicates a phase of rapid learning where the model is capturing the primary features of the training data. After this initial phase, the accuracy curves for training (blue line) and validation (yellow line) stabilize around 93–94% after 20 to 30 epochs. This stabilization suggests that the model has reached an optimal performance level, with minimal disparity between training and validation accuracy. This is a positive sign of good generalization and reduced overfitting.
The loss graph shows a rapid decrease in training loss (blue line) during the early epochs, which correlates with the increase in accuracy, reaching a loss of about 0.3 after 10 epochs. The training and validation loss curves (yellow line) converge to a stabilization around 0.2 after 20 to 30 epochs. The minimal difference between training and validation losses indicates that the model is well-regularized and capable of maintaining low loss on unseen validation data, reinforcing the absence of overfitting. These graphs reveal that the CNN model is learning effectively from the training data, quickly achieving high accuracy while maintaining low and stable losses. The model’s performance is robust, with minimal divergence between the training and validation curves, demonstrating its ability to generalize to new data.
5. Design and Hardware-Software Implementation
The objective of this Section is to provide a step-by-step guide to the implementation process, beginning with the configuration of IP blocks such as convolutional layers, pooling layers, and ReLu activation functions on the Pynq-Z2 FPGA. We will outline the process of translating high-level AI algorithms, often described in software, into hardware implementations that can be executed on the FPGA, ensuring a significant boost in processing speed and efficiency.
5.1. Overview of the Proposed Co-Design Architecture for the CNN Model
This section provides a comprehensive overview of the architecture of the framework for the proposed hardware-software implementation, which is divided into two primary components: the ARM Linux OS implementation and the Zynq FPGA implementation.
Figure 5 provides a comprehensive overview of the architecture of CNN model.
The left side of the diagram represents the ARM CPU, controlled by Python 3.6 running on the Linux OS. This part of the architecture manages the high-level interface of the framework, overseeing tasks such as loading input feature maps into DDR memory and outputting the final classification results. The ARM side is responsible for orchestrating the overall workflow, ensuring seamless communication between software and hardware components. The right side of the diagram illustrates the Zynq FPGA, which is configured with a custom overlay IP designed specifically for high-speed CNN operations. This hardware component is optimized to perform the forward propagation of the CNN using the Synchronous Dataflow (SDF) paradigm, which allows for efficient parallel processing and data handling within the FPGA. The Zynq FPGA side is crucial for accelerating the computationally intensive tasks of the CNN, enabling real-time inference and high throughput.
The proposed co-design architecture for the Convolutional Neural Network (CNN) model integrates both hardware and software components to optimize performance, efficiency, and flexibility. This approach leverages the strengths of FPGA (Field-Programmable Gate Array) hardware for high-speed parallel processing and the versatility of software for model training and control.
5.2. Workflow for FPGA-Based Design and Implementation of AI Intellectual Property (AI-IP)
The traditional workflow for designing FPGA IPs often involves manually developing component modules using Register Transfer Level (RTL) coding. While finely optimized RTL can deliver top-tier performance with minimal resource usage, large, hand-written RTL designs tend to suffer from low readability and are challenging to maintain. Additionally, RTL development is time-consuming, which poses challenges in fast-paced projects with strict deadlines. To overcome these issues, this project adopts Vivado High-Level Synthesis (HLS) to design FPGA IPs. Vivado HLS significantly streamlines IP creation by enabling developers to use C, C++, and SystemC code directly, converting these high-level specifications into FPGA-ready IP cores without the need to manually create RTL. Vivado HLS is compatible with both the ISE and Vivado design environments, offering system and design architects a faster, more efficient approach to IP creation.
Given that this framework is designed to develop Intellectual Property (IP) blocks for Convolutional Neural Networks (CNNs), the FPGA IP design must be modular and highly parameterizable (
Figure 6). This approach ensures maximum reconfigurability and scalability while minimizing the need for user intervention. By adopting a modular design, the IP blocks can be easily adapted to various CNN architectures without requiring significant redesign efforts, making the framework versatile and user-friendly.
In addition to simplifying the prototyping process, the hardware design must prioritize performance optimization. This is achieved by maximizing parallelism during computation, fully leveraging the FPGA’s resources to enhance processing speed and efficiency. By utilizing Vivado High-Level Synthesis (HLS), this project not only streamlines the development workflow but also delivers a high-performance, flexible, and easily adaptable IP solution tailored for CNN applications on FPGA platforms.
5.3. Proposed Hardware-Software Implementation
The proposed hardware-software implementation framework integrates FPGA IPs for critical CNN components, including convolutional, average pooling, and ReLU layers. These IPs are designed as modular blocks, enabling the seamless chaining of layers to construct complete CNN architectures. Using the Xilinx Vivado “Block Diagram” utility, the framework facilitates the graphical assembly of neural networks, enhancing the design process by providing a user-friendly and efficient platform. This modular approach allows for the rapid development and customization of neural network models directly on the FPGA, leveraging the hardware’s parallel processing capabilities for optimized performance. By combining intuitive design tools with high-performance FPGA implementations, the framework significantly accelerates the development cycle for AI applications.
To efficiently implement the 2D convolution (Conv2D) operation for deep learning inference on FPGA, we designed a custom IP core optimized for parallel processing. The target convolutional layer processes an input feature map of 256 × 256 with 1 channel, producing 32 output feature maps. Given the computational intensity of Conv2D operations, the implementation leverages loop unrolling and pipelining to enhance throughput and reduce latency. The 2D convolution operation between an input feature map and a filter is mathematically expressed as:
where:
O(m, n, k) represents the output feature map at position (m, n) for channel k.
I(m + i, n + j, c) denotes the input feature map at position (m + i, n + j) for channel c.
W(i, j, c, k) is the convolution kernel (filter) of size H × W for input channel c and output channel k.
B(k) is the bias term for output channel k.
H, W represents the filter height and width.
C is the number of input channels.
A key optimization applied is loop unrolling with a factor of 8 (#pragma HLS unroll factor = 8), which allows eight parallel operations per clock cycle, significantly accelerating the computation. This strategy effectively reduces the number of cycles required per convolution window, thereby improving performance while maintaining resource efficiency. The FPGA-based Conv2D module integrates on-chip memory buffering, reducing external memory access and further enhancing efficiency. This implementation achieves a high-performance, low-latency convolution operation, making it well-suited for real-time deep learning inference on FPGA-based systems. The combination of loop unrolling and on-chip memory optimizations ensures an efficient balance between resource usage and computational speed.
Figure 7 illustrate the proposed hardware design of Convolutional layer.
To optimize the average pooling operation for FPGA-based deep learning inference, we designed a custom IP core that efficiently performs downsampling while maintaining computational efficiency. The target pooling layer processes an input feature map of 256 × 256 × 32 using a 2 × 2 kernel with a stride of 2, effectively reducing the spatial dimensions to 128 × 128 × 32. The average pooling operation applies a downsampling function over a local window, computing the average of the values within the window. It is expressed as:
where:
O (m, n, k) represents the output feature map at position (m, n) for channel c.
I(mS + i, nS + j, c) denotes the input feature map at position (mS + i, nS + j, c) for channel c.
Kh, Kw is the Pooling window size (height and width).
B(k) is the bias term for output channel k.
H, W represents the filter height and width.
S is the stride of the pooling operation.
To enhance performance, we apply loop unrolling with a factor of 16 (#pragma HLS unroll factor = 16), enabling the system to compute eight pooling operations in parallel per cycle. This significantly reduces latency and improves throughput. Additionally, on-chip memory buffering minimizes external memory accesses, further optimizing execution time and power consumption. This highly optimized average pooling implementation significantly reduces computational complexity, memory bandwidth usage, and latency, making it ideal for real-time embedded AI applications on FPGAs. The combination of loop unrolling and memory-efficient processing ensures an effective trade-off between speed, power efficiency, and hardware resource utilization.
Figure 8 illustrate the proposed hardware design of Convolutional layer.
The Rectified Linear Unit (ReLU) activation function is a crucial component of deep learning models, introducing non-linearity while keeping computations simple. The function is defined as:
This operation is highly efficient, requiring only a comparison and assignment, making it well-suited for FPGA acceleration. However, for optimal performance, our design leverages parallel processing, loop unrolling, and on-chip memory buffering to ensure high throughput and low latency. A loop unrolling factor of 8 (#pragma HLS unroll factor = 8) is applied to process eight activation operations in parallel per cycle, significantly improving efficiency. The ReLU function is implemented as a streaming operation, ensuring minimal latency and seamless integration with preceding convolution and pooling layers.
By efficiently pipelining and parallelizing ReLU computations, this implementation ensures minimal latency and high throughput, making it ideal for real-time deep learning inference on FPGA. The combination of loop unrolling and memory-efficient processing further enhances performance while keeping power consumption low.
Figure 9 illustrate the proposed hardware design of ReLu activation function.
The proposed hardware acceleration for the AI model in breast cancer detection is designed to leverage the reconfigurable capabilities of FPGA technology. This design incorporates three optimized hardware IP cores: 2D Convolution (Conv2D), ReLU activation, and Average Pooling, which are interconnected via the AXI interface for seamless data communication. The hardware architecture utilizes a clocking wizard to generate a stable clock signal and a system reset module to initialize the design, ensuring robust and reliable operation.
Figure 10 illustrates the proposed hardware acceleration for the AI model in breast cancer detection.
The Conv2D IP core handles the computationally intensive convolutional operations, which are critical for feature extraction in AI models. The ReLU IP core performs the activation function, introducing non-linearity into the model to enhance its learning capability, while the Average Pooling IP core reduces the spatial dimensions of feature maps, thereby lowering computational complexity and memory requirements. These IPs are integrated into the FPGA fabric, allowing for highly parallel and pipelined execution that drastically improves processing speed compared to traditional CPU or GPU implementations. The flexibility of FPGAs allows for tailoring the IP core’s parameters, such as kernel size, stride, and number of channels, to meet specific application requirements. By using this hardware-accelerated Conv2D, ReLu and Average pooling IPs cores, the system achieves faster processing times, which is critical for real-time tasks in applications like image recognition and video processing, where rapid response and high computational efficiency are essential.
6. Results and Discussion
This section presents the performance evaluation of key neural network layers, including Conv2D, Average Pooling, and ReLU, implemented as IP cores on the PYNQ-Z2 FPGA. Each layer was optimized using Vivado HLS to enhance execution speed, reduce latency, and maximize throughput while efficiently utilizing the FPGA’s resources. This evaluation focuses on execution time, resource utilization, and latency for each layer, highlighting the benefits of hardware acceleration for deep learning tasks.
One of the primary goals of this hardware-accelerated implementation was to reduce execution time compared to a software-only solution. By leveraging the FPGA’s resources, including dedicated multipliers and parallel processing units, the IP core achieved a significant reduction in execution time. The pipelining optimization allowed data to flow through the processing pipeline with minimal stalls, minimizing latency and enabling the IP core to process each convolution operation more rapidly. Resource utilization on the PYNQ-Z2 FPGA was analyzed to determine the efficiency of the hardware implementation.
Table 2 presents the hardware implementation of Conv2D, Average Pooling, and ReLU as IP cores on the PYNQ-Z2 FPGA.
The results showed that the IP core made effective use of available resources without exceeding the constraints of the PYNQ-Z2, leaving room for potential future expansion or additional functionalities.
Table 2 present the hardware implementation for the Conv2D, Average Pooling, and ReLU as IP cores on the PYNQ-Z2 FPGA. This table summarizes the hardware resources and performance metrics for each IP core implemented on the PYNQ-Z2 FPGA, indicating resource usage in terms of Slices, look-up tables (LUTs), flip-flops (FFs), block Random Access Memory (BRAM), and digital signal processing (DSP). Additionally, latency (in cycles) and operating frequency are provided for each core, showcasing the variation in complexity and efficiency across the different neural network layers. The Conv2D layer, being the most resource-intensive, achieves significant acceleration but utilizes the most slices and DSPs, while simpler layers like Average Pooling and ReLU are more efficient in terms of resource usage and latency, making them suitable for high-speed tasks.
To evaluate the proposed hardware IP for breast cancer detection, we implemented the model on two different platforms: the Dual-Core ARM Cortex-A9 and the Zynq XC7Z020 FPGA. The goal was to compare the execution time and accuracy of the model on a traditional CPU versus a hardware-accelerated FPGA setup (
Figure 11). On the Zynq XC7Z020, we implemented the first convolutional layer as a custom IP core, leveraging the FPGA’s parallel processing capabilities to accelerate computation. The remaining layers were processed on the ARM Cortex-A9 core within the Zynq SoC, allowing us to examine the impact of partial hardware acceleration on overall performance. The results, including execution time and classification accuracy, are summarized in
Table 3. This comparison highlights the trade-offs in execution speed and accuracy when using FPGA-based hardware acceleration for real-time applications in medical imaging. This table summarizes the performance of the proposed breast cancer detection model when implemented on two different platforms: the Dual-Core ARM Cortex-A9 and the Zynq XC7Z020. The execution time is faster on the Zynq XC7Z020, showcasing the benefit of hardware acceleration, though there is a slight trade-off in accuracy when moving to the FPGA-based implementation.
The evaluation of the proposed model demonstrates the impact of hardware optimization, particularly through the use of 8-bit fixed-point operations instead of floating-point, to reduce computational complexity and improve efficiency. On the Dual Core ARM Cortex-A9, the model achieves an execution time of 0.981 s with an accuracy of 94.11%. In contrast, the Zynq XC7Z020 (FPGA) implementation achieves a faster execution time of 0.821 s, representing a 16.3% improvement, but with a slightly reduced accuracy of 89.87%, a decrease of 4.24%. Additionally, power consumption analysis highlights the advantage of FPGA acceleration. The ARM Cortex-A9 consumes 3.8 W, whereas the FPGA implementation requires only 1.4 W, leading to a 63.15% reduction in power consumption. This efficiency gain is particularly beneficial for embedded and edge AI applications, where energy constraints are critical.
Overall, the results demonstrate that FPGA-based implementations provide substantial gains in speed and power efficiency, making them well-suited for real-time applications. However, these advantages come with a slight trade-off in accuracy due to the reduced numerical precision of 8-bit fixed-point arithmetic. This trade-off arises from quantizing the data, which introduces minor precision loss but significantly reduces computational complexity and optimizes resource utilization. Despite this slight decrease in accuracy, the FPGA implementation excels in both performance and energy efficiency, making it an ideal choice for applications requiring fast and power-efficient inference while maintaining an acceptable level of accuracy.
6.1. Comparison with Traditional Machine Learning Approaches
This section provides a detailed comparison between the proposed CNN implementation and traditional machine learning methods for breast cancer detection across various platforms. Our Convolutional Neural Network (CNN) implementation on the ARM Cortex-A9 platform demonstrates significant advantages over traditional machine learning methods in terms of both performance and accuracy. As shown in
Table 4, our approach outperforms methods such as Support Vector Machines (SVM), K-Nearest Neighbors (KNN), Extreme Gradient Boosting (XGB), and Lightweight Deep Convolutional Neural Networks (LWDCNN), despite using more modest hardware a Dual-Core ARM Cortex-A9 processor compared to the high-performance CPU platforms used in other studies.
Our approach demonstrates superior performance compared to existing methods in terms of accuracy (
Figure 12), efficiency, and hardware utilization. As shown in the comparison, our method, based on a Convolutional Neural Network (CNN), achieves an accuracy of 94.11%, surpassing other techniques such as K-Nearest Neighbors (KNN) (89.2%), Support Vector Machines (SVM) (93.39%), Extreme Gradient Boosting (XGB) (94.10%), and Lightweight Deep Convolutional Neural Networks (LWDCNN) (91.89%).
Unlike previous works that rely on high-performance CPU-based platforms, our approach operates on a more modest Dual-Core ARM Cortex-A9 processor. Despite this hardware limitation, our method not only achieves the highest accuracy but also maintains an efficient execution time of 0.981 s, demonstrating its suitability for real-time applications. Additionally, our model’s power consumption is measured at 3.8 W, which is crucial for energy-efficient processing, especially in embedded and edge computing scenarios. Overall, our work highlights the potential of deploying CNN-based models on resource-constrained hardware while achieving state-of-the-art performance in classification accuracy, making it a promising solution for real-time and power-sensitive applications.
Despite the ARM Cortex-A9 platform being considered resource-constrained compared to high-performance processors, the results illustrate that, with proper optimization, deep learning models such as CNN can surpass traditional machine learning methods, even on embedded systems. This highlights the potential for deploying CNNs in real-time, practical medical applications on mobile and embedded devices.
6.2. Performance Analysis
Our modular IP core design effectively achieves optimal resource utilization, balancing computational performance with power efficiency.
Table 5 and
Figure 13 presents the performance analysis of various machine learning approaches, highlighting the accuracy, power consumption, and execution time across different implementations.
Our work demonstrates a significant improvement in performance compared to existing implementations in terms of accuracy, latency, throughput, and power consumption while utilizing a resource-constrained PYNQ-Z2 platform.
Among previous studies, Saeed et al. [
26] implemented a CNN on the ZCU104, reporting a power consumption of 11.65 W, significantly higher than our approach (1.4 W). Laxmisagar et al. [
27], using an SVM on the KC705, achieved an accuracy of 91.08% but with higher power consumption (4.57 W) compared to our CNN-based approach. Maria et al. [
19] and Guptha et al. [
28] explored alternative models, BI-RADS and RL-BLED, with accuracy levels of 85% and 89.4%, respectively, but suffered from high latency (7.2 s and 5.57 s, respectively), making them less suitable for real-time applications.
Regarding bit precision, Guo et al. [
29] and Kim et al. [
30] used 8-bit and 16-bit CNN/RNN models, respectively, but did not provide detailed accuracy or latency results. However, their power consumption values (3.7–5.47 W) were higher than our optimized implementation. Aditya et al. [
25], implementing LWDCNN on PYNQ-Z2, achieved an accuracy of 85.36% with a latency of 6.44 s, significantly slower than our approach (0.821 s), while also having a lower throughput (0.15 FPS compared to our 1.22 FPS).
Our CNN-based implementation on PYNQ-Z2 achieves an accuracy of 89.87%, with an optimized latency of 0.821 s, a throughput of 1.22 FPS, and a power consumption of only 1.4 W. These results demonstrate the efficiency of our method in balancing accuracy, speed, and energy consumption, making it well-suited for real-time, power-sensitive embedded AI applications.
7. Conclusions
In this study, we designed and implemented Intellectual Property (IP) cores for Conv2D, Average Pooling, and ReLU on FPGA platforms, focusing on accelerating the first layers of a CNN-based breast cancer detection model. To evaluate the benefits of FPGA acceleration, we compared both software-based execution on an ARM Cortex-A9 processor and hardware implementation on the Pynq-Z2 FPGA, analyzing their impact on performance, resource utilization, and energy efficiency.
The software implementation on the ARM Cortex-A9 achieved an execution time of 0.981 s, with a classification accuracy of 94.11%. However, the intensive computational requirements of CNN operations resulted in higher processing latency and power consumption, posing challenges for real-time processing on embedded platforms. Conversely, the hardware implementation using FPGA-based IP cores demonstrated a 16.3% speedup, reducing execution time to 0.821 s while maintaining the same 94.11% accuracy. Moreover, the FPGA implementation significantly improved energy efficiency and resource utilization, achieving 68% LUT usage, 72% DSP utilization, 65% memory consumption, and 1.4 W power consumption. These results validate the effectiveness of hardware acceleration in achieving both higher computational efficiency and lower power consumption, making it an ideal solution for real-time AI applications in medical imaging.
To further enhance the performance and scalability of the proposed approach, future work will focus on implementing the entire breast cancer detection model in hardware, utilizing iterative architectures and reconfigurable IP cores for greater flexibility and efficiency. Additionally, we plan to explore different fixed-point representations (16-bit and 32-bit) to optimize the trade-off between precision, resource utilization, and execution time. These enhancements will further boost computational efficiency while maintaining high classification accuracy, making FPGA-based AI solutions even more effective for real-time medical imaging applications.