Optimized YOLOv5 Architecture for Superior Kidney Stone Detection in CT Scans

Abdimurotovich, Khasanov Asliddin; Cho, Young-Im

doi:10.3390/electronics13224418

Open AccessArticle

Optimized YOLOv5 Architecture for Superior Kidney Stone Detection in CT Scans

by

Khasanov Asliddin Abdimurotovich

and

Young-Im Cho

^*

Department of Computer Engineering, Gachon University Sujeong-gu, Seongnam-si 461-701, Republic of Korea

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(22), 4418; https://doi.org/10.3390/electronics13224418

Submission received: 5 October 2024 / Revised: 5 November 2024 / Accepted: 8 November 2024 / Published: 11 November 2024

Download

Browse Figures

Versions Notes

Abstract

:

The early and accurate detection of kidney stones is crucial for effective treatment and improved patient outcomes. This paper proposes a novel modification of the YOLOv5 model, specifically tailored for detecting kidney stones in CT images. Our approach integrates the squeeze-and-excitation (SE) block within the C3 block of the YOLOv5m architecture, thereby enhancing the ability of the model to recalibrate channel-wise dependencies and capture intricate feature relationships. This modification leads to significant improvements in the detection accuracy and reliability. Extensive experiments were conducted to evaluate the performance of the proposed model against standard YOLOv5 variants (nano-sized, small, and medium-sized). The results demonstrate that our model achieves superior performance metrics, including higher precision, recall, and mean average precision (mAP), while maintaining a balanced inference speed and model size suitable for real-time applications. The proposed methodology incorporates advanced noise reduction and data augmentation techniques to ensure the preservation of critical features and enhance the robustness of the training dataset. Additionally, a novel color-coding scheme for bounding boxes improves the clarity and differentiation of the detected stones, facilitating better analysis and understanding of the detection results. Our comprehensive evaluation using essential metrics, such as precision, recall, mAP, and intersection over union (IoU), underscores the efficacy of the proposed model for detecting kidney stones. The modified YOLOv5 model offers a robust, accurate, and efficient solution for medical imaging applications and represents a significant advancement in computer-aided diagnosis and kidney stone detection.

Keywords:

YOLOv5; kidney stone detection; squeeze-and-excitation block; CT images; deep learning; medical imaging; object detection

1. Introduction

The incidence of kidney stones is high worldwide. Computed tomography (CT) is the modality most commonly used to diagnose kidney stones [1]. It relies on the differences between kidney stones and the surrounding tissues to produce detailed CT images that reveal the structure and morphology of the stones. Clinically, stone volume and hardness are key factors in determining surgical treatment plans. During diagnosis, urologists typically search for and segment stones manually in a slice-by-slice manner using CT image sequences. This process requires extensive clinical experience, domain expertise, and significant manual effort [2]. Additionally, urologists often assume that stones are perfect spheres or ellipsoids and estimate their quantity and volume using formulas or software methods that are not suitable for irregularly shaped stones. Small kidney stones are often expelled through urine without affecting the body [3]. However, as stones increase in size, they can cause symptoms such as bloody urine, nausea, painful urination, and severe pain in the lower abdomen or back. Patients experience intense pain when stones move from the kidney to the urinary canal. The longer it takes to detect these stones, the more deteriorated the quality of life becomes, leading to a worsening of kidney function and potentially life-threatening situations. Therefore, the early diagnosis of kidney stones is crucial for the treatment process. Many patients with kidney stones visit hospitals with symptoms such as fever, severe pain in the lower back and sides, and blood in their urine [4]. Occasionally, this condition is mistaken for appendicitis, cholecystitis, ovarian torsion, or mesenteric ischemia. Because of the wide range of differential diagnoses and high number of patients in emergency evaluations, physicians may misdiagnose kidney stones, especially in patients with milder symptoms [5].

Recent advancements in MRI scanners and other medical imaging technologies have increased the demand for computer-aided diagnostic tools [6]. Currently, healthcare professionals must meticulously analyze complex medical images to identify issues, a process that is both challenging and time-consuming [7]. Consequently, there is a growing need for systems capable of autonomously recognizing organs, detecting potential anomalies, and providing essential information to alleviate the burden on clinicians and enhance diagnostic efficiency. The implementation of an artificial intelligence (AI) model to rapidly detect kidney-related radiological findings has significant potential to aid healthcare professionals and reduce patient suffering [8]. Considering the prevalence of renal disorders, global shortage of nephrologists and radiologists, and advancements in deep learning research for visual tasks, this approach is particularly crucial [5]. Extensive research utilizing deep learning technologies has focused on the detection of pathological regions and diseases. In the case of kidney stone disease, significant progress has been made in detection tasks [9]. Moreover, deep learning systems enhance detection accuracy by providing reliable and consistent interpretations of complex imaging results, thereby aiding healthcare professionals in making precise diagnoses [10]. Through continuous learning and adaptation, these systems keep up with the ever-expanding volume of medical data, ensuring that healthcare providers have the most current and relevant information available. Advancements in deep learning technologies have revolutionized kidney disease diagnosis by offering faster, more accurate, and more efficient methods to overcome the challenges associated with conventional techniques. Therefore, in this study, a deep-learning-based technique was developed to predict and classify various kidney diseases. The contributions of this study are as follows.

By integrating the squeeze-and-excitation (SE) block within the C3 block of the YOLOv5 architecture, the proposed model significantly improves the recalibration of channel-wise dependencies, thereby enhancing the network’s ability to capture and differentiate intricate feature relationships. This leads to better detection accuracy and reliability in identifying kidney stones in CT images.
The proposed YOLOv5m modification achieves a balanced performance in terms of model size, inference speed, and detection accuracy. With an inference speed of 8.2 ms per image and a model size of approximately 41 MB, it offers a viable solution for real-time medical applications requiring precise object detection without compromising speed.
The use of a modified CSPDarknet53 as the backbone network enhanced the feature extraction efficiency. The incorporation of cross-stage partial (CSP) connections optimizes learning efficiency, reduces model size, and improves the overall detection capability across different scales.
The integration of attention mechanisms into the YOLOv5m architecture enables the model to focus on the most pertinent parts of the input images. This selective attention enhances detection accuracy by allowing the model to better differentiate between significant and insignificant features within the images.
The proposed model outperformed the standard YOLOv5 variants (nano-sized, small, and medium) in key performance metrics such as precision, recall, and mean average precision (mAP). This superior performance highlights its efficacy in detecting kidney stones, making it a suitable choice for medical imaging applications.
The use of bilateral filtering for noise reduction ensures the preservation of critical features and sharpness in CT images, which are essential for accurate kidney stone detection. In addition, data augmentation techniques enhance the diversity and robustness of the training dataset, contributing to improved model performance.
The proposed model employs a different color approach to improve the clarity of the kidney stone detection results. Using uniquely colored bounding boxes for closely located stones resolves potential overlap issues and facilitates a better analysis and understanding of detection performance.

The current introduction will now succinctly outline the use of AI in medical imaging, particularly in the context of kidney stone detection via CT scans. Existing AI models, such as those based on CNNs, have made significant strides in automating the detection process. However, these models often struggle with the detection of small or irregularly shaped kidney stones, particularly in noisy or low-quality images. Furthermore, most current systems require significant computational resources, limiting their real-time applicability in clinical settings. The present study aims to fill these gaps by proposing a modified YOLOv5 model that incorporates attention mechanisms and noise reduction techniques. This model not only improves detection accuracy but also maintains a balance between computational efficiency and precision, making it suitable for real-time clinical use. By addressing these limitations, our study contributes to the development of more robust AI tools for CT-based kidney stone detection. This study presents a modified YOLOv5 model with SE block integration that offers a robust, accurate, and efficient solution for kidney stone detection in CT images, thereby significantly contributing to the field of medical imaging and diagnostics.

2. Related Works

In recent years, various techniques for kidney stone detection have been explored, employing medical imaging modalities such as ultrasound, MRI, CT scans, and color Doppler [11]. These methods primarily aim to identify the presence of kidney stones but often face challenges related to accurate stone localization, boundary delineation, and detection of small stones. Additionally, many models have shown limitations in maintaining precision in noisy or low-contrast images, which is critical in medical diagnostics [12]. For instance, [13] employed a deep residual learning network combined with a pre-trained ResNet-101 model to reduce speckle noise in ultrasound images. While this approach successfully improved image quality, the model struggled with accurately detecting small or irregularly shaped stones. Similarly, [14] applied an iterative neighborhood component analysis with a k-nearest neighbor classifier on features extracted using DarkNet19 to detect kidney stones, but this method faced difficulties in distinguishing closely located stones due to overlapping regions in CT images. Another [15] study evaluated kidney stones using the S.T.O.N.E. scoring system. A study by [16] used a Kronecker product-based convolution method to enhance feature extraction from CT images. However, it did not address the issue of convolutional overlap, which can reduce the model’s efficiency in identifying stone boundaries. Furthermore, [17] introduced a novel framework inspired by urologists’ diagnostic procedures for CT images, improving localization but failing to differentiate stones in complex environments with high variability in stone shapes. Ref. [18] proposed a computer-aided diagnostic system using deep neural networks to detect kidney stones in direct X-ray (DUSX) images of the urinary system. They used a YOLOv4 model combining Bilateral Filtering and CLAHE (CBC). One study [19] discussed using the YOLOv7 architecture to detect kidney diseases, including stones and cysts, as well as healthy images. The YOLO architecture was effective because it identified the regions within an image and applied a neural network to the entire image, generating bounding boxes around the detected areas. Ref. [20] detected kidney stones using a combination of discrete wavelet transform and deep learning techniques. They discovered that this hybrid approach was more effective and efficient than traditional methods. The researchers emphasized the importance of using high-quality medical images to train the model and diagnose the disease, which ultimately improved patient outcomes. One study [21] proposed an automated method for detecting kidney stones in CT images using an ensemble deep neural network (DNN). This method utilizes pre-trained models such as DarkNet19, InceptionV3, and ResNet101, with feature selection and classification performed via iterative ReliefF and a Bayesian-optimized K-nearest neighbor classifier. Ref. [22] introduced two ensemble models to detect kidney stones in CT images: StackedEnsembleNet combined predictions from four models (InceptionV3, InceptionResNetV2, MobileNet, and Xception) to enhance accuracy, whereas PSOWeightedAvgNet used particle swarm optimization to optimize the model weights. Ref. [23] presented a hybrid model combining CNN and ResNet to improve kidney stone detection. Using ultrasound and CT scans, the model outperformed individual CNN and ResNet models in terms of accuracy, sensitivity, and specificity. Additionally, this model was found to be resilient to noise and variability. This hybrid approach offers a promising tool for the early and accurate diagnosis of kidney stones, thereby enhancing patient outcomes. This method [20] presented an AI-based system for predicting and classifying kidney diseases using a dataset of 12,446 images, including cysts, tumors, stones, and healthy samples. The images were preprocessed and segmented to identify regions of interest. Various deep learning models, such as DenseNet201, EfficientNetB0, InceptionResNetV2, MobileNetv2, ResNet50V2, and Xception, were trained using the RMSprop, SGD, and Adam optimizers.

While these approaches have contributed valuable advancements to kidney stone detection, there are notable gaps that remain unaddressed. One significant issue is the model’s ability to recalibrate feature maps in the presence of noise or irregular stone boundaries. Furthermore, the misclassification of closely located stones due to overlapping bounding boxes is a frequent challenge in these detection systems. To overcome these limitations, our work integrates an SE block within the C3 block of the YOLOv5 architecture, specifically addressing the issue of channel-wise dependencies. The SE block enhances the model’s ability to recalibrate these dependencies, improving the detection of small, irregular stones. Moreover, our novel color-coding scheme for bounding boxes resolves the problem of overlapping bounding boxes, allowing clearer detection of closely located stones, which prior works failed to address. By incorporating attention mechanisms and advanced data augmentation techniques, our proposed model not only improves precision and recall but also demonstrates enhanced robustness in noisy medical images. These advancements position our model as a more efficient and accurate tool for kidney stone detection compared to previous methods, which primarily focused on segmentation or feature extraction without effectively handling noise or overlapping detections. While the body of existing research has made significant strides in kidney stone detection, our approach uniquely addresses these remaining gaps by improving feature recalibration and detection accuracy in complex imaging scenarios.

3. Proposed Methodology

This section describes the proposed methodology for the detection of kidney stones, tailored for medical applications. Section 3.1 provides an overview of the YOLOv5m baseline model. In Section 3.2, we describe the core architectural components and operational mechanisms of YOLOv5m. Section 3.3 provides a comprehensive description of the proposed approach.

3.1. YOLOv5m

YOLOv5m, the medium variant within the YOLOv5 family, exemplifies a harmonious balance among model size, inference speed, and detection accuracy (Figure 1). With a model size of approximately 41 MB, YOLOv5m was engineered to be more substantial and powerful than the nano-sized and small variants, while maintaining a lightweight and efficient profile compared with the large and extra-large versions, as shown in Table 1. In terms of performance, YOLOv5m attains an inference speed of 8.2 ms per image when executed on an NVIDIA V100 GPU utilizing the FP16 (half-precision) mode. This rapid processing capability renders it highly suitable for real-time applications that demand moderate to high accuracy without compromising speed. YOLOv5m achieved an mAP of 45.2 on the COCO dataset, markedly surpassing the performance of YOLOv5n and YOLOv5s. This enhanced accuracy makes YOLOv5m a viable option for applications requiring precise object detection. The YOLOv5m architecture incorporates several advanced components that are designed to enhance its performance. The backbone network, which is used for feature extraction, is a modified CSPDarknet53. This backbone integrates cross-stage partial (CSP) connections to optimize learning efficiency and reduce the overall model size. Moreover, YOLOv5m employs a path aggregation network (PANet) for its neck, which enhances the fusion of features from various layers, thereby bolstering the model’s ability to detect objects across different scales. YOLOv5m utilizes the YOLO head design, which applies anchor boxes and grid cells to predict bounding boxes, object classes, and confidence scores. Attention mechanisms are also integrated into YOLOv5m, enabling the model to focus on the most pertinent parts of the input images. This integration enhances detection accuracy by allowing the model to better differentiate between significant and insignificant features within the images. YOLOv5m is ideally suited to a diverse array of real-world applications, including real-time object detection in videos, autonomous driving systems, surveillance and security, and industrial automation. Its balanced design makes it a versatile choice for scenarios where both speed and accuracy are paramount.

YOLOv5 offers significant improvements over earlier versions, particularly in terms of speed, accuracy, and ease of modification. One of the key reasons for selecting YOLOv5 is its enhanced modularity and flexibility, which allow seamless integration of advanced components, such as the SE blocks we implemented. These blocks are essential in recalibrating channel-wise dependencies, thereby improving detection accuracy for small and irregularly shaped kidney stones—something earlier YOLO versions struggle with, especially in noisy or complex images. YOLOv5 demonstrates an optimal balance between detection accuracy and computational efficiency, which is critical for real-time medical applications. It is particularly well suited for detecting objects across multiple scales, a feature that is crucial for kidney stone detection, where stones vary significantly in size. Earlier YOLO versions, including YOLOv3 and YOLOv4, are less efficient in terms of scalability and model size, making them less adaptable to the specific needs of medical imaging tasks such as ours. Another factor in choosing YOLOv5 is its superior support for modern deep learning frameworks, making it more accessible for modifications and optimizations such as the custom noise reduction techniques and the color-coding scheme we applied to improve detection clarity. These technical advantages, combined with YOLOv5’s streamlined training process and improved performance on object detection benchmarks, make it a more suitable candidate for this study than earlier YOLO versions. Although YOLOv4 and YOLOv3 were considered during the initial stages of the project, our experiments revealed that YOLOv5 outperformed them in terms of precision, recall, and mAP, particularly when handling small, hard-to-detect objects such as kidney stones.

3.2. C3 Block

The C3 block, an integral component of the YOLOv5 architecture, represents a CSP with three convolutions and is derived from the CSPNet concept. This block was engineered to enhance the efficiency and efficacy of the model by optimizing gradient flow, reducing parameter counts, and preserving feature diversity. Fundamentally, the C3 block leverages the CSP connections that divide the input feature map into two distinct parts. One part undergoes a series of convolutions, whereas the other is retained for later concatenation. This design facilitates an improved gradient flow throughout the network and reduces computational overhead. Within the C3 block, several bottleneck blocks are employed. These bottleneck blocks, each consisting of two convolutional layers with a residual connection, contribute to parameter reduction while preserving the representational capacity of the model, as shown in Figure 2a. Following the convolutional processing, the feature map that bypassed the convolutions is concatenated with a processed feature map. This concatenation preserves the original feature information while enriching it with more complex representations. A final convolutional layer is subsequently applied to the concatenated features, combining and refining them before passing them to the next layer or block in the network. The architecture of the C3 block will now be further clarified based on its components. The backbone network, tasked with feature extraction, utilizes a modified CSPDarknet53, incorporating CSP connections to enhance the learning efficiency and reduce the overall model size. Furthermore, attention mechanisms are integrated into the C3 block, enabling the model to focus on the most pertinent regions of the input images. This integration augments detection accuracy by allowing the model to better differentiate between significant and insignificant features within the images.

3.3. Proposed Model

An SE block constitutes an advanced neural network module aimed at augmenting the representational capacity of CNNs by explicitly modeling the interdependencies among the channels of its convolutional features. First introduced by Hu, Shen, and Sun [24], the SE block seeks to adaptively recalibrate channel-wise feature responses by harnessing global contextual information. In CNNs, each convolutional filter identifies distinct patterns within a receptive field. Nevertheless, conventional CNNs process each channel in isolation during convolution operations, which may restrict their ability to capture complex interchannel dependencies. The SE block mitigates this limitation by introducing a mechanism specifically designed to explicitly model these dependencies, thereby enhancing the network’s ability to capture the intricate relationships among channels. As shown in Figure 2b, the SE block includes three main operations: squeezing, excitation, and scaling. The incorporation of SE blocks into CNNs offers several advantages. First, SE blocks contribute to enhanced accuracy. By recalibrating channel-wise feature responses, these blocks significantly augment the representational power of the network, thereby leading to improved accuracy across a variety of tasks. Second, SE blocks exhibit remarkable flexibility. They can be seamlessly integrated into existing CNN architectures with minimal computational overhead, making them attractive additions to numerous network designs.

Finally, SE blocks demonstrate robust generalization capabilities. Their implementation has been shown to enhance performance across a diverse array of tasks and datasets, underscoring their broad applicability and effectiveness.

To enhance the recalibration of channel-wise dependencies and achieve more robust and accurate feature representations, we propose modifying the YOLOv5m architecture by integrating the SE block into the C3 block. This novel approach involves embedding an attention mechanism inside the C3 block, thereby deepening and strengthening the capacity of the model to capture channel-wise features. By emphasizing more valuable features, this integration aims to improve the overall performance and efficacy of the networks. In Figure 2a, the output

x_{o u t_c o n v 3} \in R^{H \times W \times C}

of the third convolution layer in the C3 block is fed into the proposed additional attention SE block. The first step within the SE block is the squeeze operation, which consolidates the feature maps across spatial dimensions (height and width) to generate a channel descriptor. This is achieved through global average pooling, which reduces each channel of the feature map to a single scalar value. Specifically, the input feature map

x_{o u t_c o n v 3} \in R^{H \times W \times C}

undergoes adaptive average pooling to produce the channel descriptor, as illustrated in Figure 2.

F_{s q u e e z e} (x_{o u t_{c o n v 3}}) = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} x_{o u t_c o n v 3} (i, j)

(1)

Here, H and W denote the height and width of the input feature map, respectively.

x_{o u t_{c o n v 3}} (i, j)

represents the values of the input feature map at spatial location (i,j) for the cth channel, whereas the double summation in Equation (1) sums over all the spatial locations within the input channels.

\frac{1}{H \times W}

is the normalization factor of the squeezed feature map. The excitation operation processes

F_{s q u e e z e}

through a gating mechanism to produce channel-wise scaling factors. This process employs a two-layer fully connected network that incorporates a ReLU activation function in the first layer and a sigmoid activation function in the second layer. The first layer reduces the dimensionality of

F_{s q u e e z e}

and applies a ReLU activation function, which introduces nonlinearity. The second layer restores the dimensionality and applies a sigmoid activation function, ensuring that the scaling factors are within the range [0, 1]. This series of transformations allows the model to learn the adaptive recalibration of the channel-wise feature responses, thereby enhancing the representational capacity of the network.

F_{e x c i t a t i o n} (F_{s q u e e z e}, W) = σ (W_{2} \partial (W_{1}, F_{s q u e e z e}))

(2)

Here,

W_{1} \in R^{\frac{C}{r} \times C}

and

W_{2} \in R^{C \times \frac{C}{r}}

are the weights of two flattened layers. The ReLU activation function

\partial

follows the first layer, and

σ

denotes the sigmoid activation function. Moreover, the output

F_{e x c i t a t i o n} \in R^{C}

contains the scaling factors for each channel.

The final scale operation of the block applies the channel-wise scaling factors to the original feature map and recalibrates the channel responses.

{x'}_{c} = F_{s c a l e} (F_{e x c i t a t i o n}, x_{o u t_{c o n v 3}})

(3)

Here, x′_c is the scaled (recalibrated) feature map for the

c^{t h}

channel, and Equation (3) represents the channel-wise multiplication of the feature map

x_{o u t_{c o n v 3}}

and

F_{e x c i t a t i o n}

. In YOLOv5, the overall loss is calculated as an aggregation of three distinct loss components Table 2: class loss, which quantifies the error in the classification task using binary cross-entropy (BCE), and objectness loss (binary cross-entropy loss), which is also calculated using BCE loss:

L_{c l a s s} = - \sum_{i = 1}^{N} \sum_{c = 1}^{C} [y_{i} \log ({y^{'}}_{i}) + (1 - y_{i}) \log (1 - y_{i}^{'})]

L_{o b j} = - \sum_{i = 1}^{N} [y_{i} \log ({y^{'}}_{i}) + (1 - y_{i}) \log (1 - y_{i}^{'})]

(4)

where N denotes the number of grid cells,

y_{i}

is the ground-truth objectness score for grid cell i (0 or 1, indicating whether an object is present),

y_{i}^{'}

is the predicted objectness score for grid cell I, and C is the number of classes. The location loss is computed using the complete intersection over union (CIoU) loss, which is formulated as follows:

L_{l o c} = 1 - C I o U (b, b^{'})

(5)

where b and b′ represent the ground-truth and predicted bounding box coordinates, respectively. The final loss function is the combined loss, where the overall loss in YOLOv5 is the weighted sum of the three individual losses.

L_{t o t a l} = λ_{c l a s s} L_{c l a s s} + λ_{o b j} L_{o b j} + λ_{l o c} L_{l o c}

(6)

Here,

λ_{c l a s s}

,

λ_{o b j}

, and

λ_{l o c}

are hyperparameters that control the relative importance of each loss component.

4. Experiments

In this section, we present the dataset employed for kidney stone detection, detail the data preprocessing procedures undertaken for model training, describe the environmental settings, elucidate our recommended color approach, and conclude with the results of the model, including a comparison with state-of-the-art (SOTA) methods.

4.1. Dataset

In our study, we used the TEZ_ROI_AUG dataset [25], which consists of 778 CT images. The TEZ_ROI_AUG dataset, each normalized to a resolution of 320 × 320 pixels to standardize the input for the model. This dataset includes a diverse range of kidney stone cases, covering different types of stones varying in size, shape, and density. The dataset includes both small and large kidney stones, as well as irregularly shaped stones, allowing the model to be trained on a broad spectrum of cases. The diversity of stone types in the dataset is crucial for improving the model’s robustness in detecting different kinds of stones that may be encountered in clinical settings. In addition to the variation in stone morphology, the dataset includes images from a range of patients with different anatomical structures, which helps ensure that the model can generalize across different individuals. This diversity is important for ensuring that the model performs well across various clinical environments and patient demographics.

For CT images used for kidney stone detection, meticulous noise reduction is imperative to preserve critical features, brightness, and sharpness of contours between different anatomical structures. To achieve these objectives, we employed a bilateral filter, which is renowned for its efficacy in noise reduction while preserving edge details. The bilateral filter operates in two dimensions, considering only pixels proximate to the center and those with similar intensity levels in the averaging process. Initially, the CT images in our dataset were normalized to a resolution of 320 × 320 pixels. This normalization standardized the input data, thereby facilitating more consistent processing and analysis. Furthermore, data augmentation techniques were applied to enhance the diversity and robustness of the training dataset. Each source image underwent the following augmentations to create two additional versions: a 50% probability of a horizontal flip and a random rotation within the range of −10° to +10°.

In terms of noise reduction, bilateral filtering was the primary technique applied in our model, particularly to reduce the impact of Gaussian and speckle noise. Bilateral filtering is well suited for preserving edges while suppressing noise, which is critical in medical imaging where the fine details of structures, such as the boundaries of kidney stones, need to be preserved. This method was effective in reducing Gaussian noise, which is common in CT scans due to sensor and acquisition processes. Speckle noise, often found in ultrasound imaging, can also be reduced to some extent, although it is not the primary noise type in CT scans. However, motion blur, caused by patient movement or lower-quality imaging hardware, is more challenging to address with this technique and remains a source of potential performance degradation. Regarding the SE block, while it does not specifically target a particular type of noise, it enhances the detection of small, irregularly shaped kidney stones by ensuring that key features in the image are highlighted and emphasized. This recalibration of channel dependencies allows the model to prioritize more important regions of the image, even in the presence of noise. However, it is not designed to explicitly suppress noise as a traditional denoising algorithm would. Despite these noise reduction techniques, model performance can still degrade under certain imaging conditions. In particular, motion blur remains one of the most significant challenges. When patient movement or poor imaging hardware introduces motion blur into the CT scans, the clarity of the kidney stones can be severely diminished, leading to reduced detection accuracy. Bilateral filtering, while effective for edge-preserving noise reduction, is not designed to address motion blur, which can obscure the boundaries of stones and make it difficult for the model to differentiate between stone structures and surrounding tissues. Under these conditions, the performance of the model, including precision and recall, can degrade despite the applied noise reduction techniques. Additionally, very low-resolution images or images from older CT scanners may also pose difficulties. While our model includes several advanced techniques to handle noise and improve detection, images with extremely low contrast or high levels of structural degradation can still lead to reduced performance. In such cases, the SE block’s ability to recalibrate feature importance becomes less effective, as the signal-to-noise ratio becomes too low for meaningful feature extraction. While bilateral filtering is effective at reducing Gaussian and speckle noise, motion blur remains a challenging type of noise for the model. Under conditions where motion blur is present or imaging quality is significantly reduced, model performance, particularly in detecting small or irregularly shaped stones, can degrade despite noise reduction techniques. We are considering further enhancements, such as integrating more advanced motion correction techniques, to address these limitations in future work.

4.2. Evaluation Metrics

To comprehensively assess the performance of our kidney stone detection model, we employed a combination of essential evaluation metrics widely recognized in object detection tasks, particularly within the medical imaging domain. The selected metrics provide a thorough evaluation of both the detection accuracy and the efficiency of the model. The key evaluation metrics utilized were as follows:

Precision, defined in Equation (7), indicates the proportion of true positive detections, such as correctly detected kidney stones, out of all positive detections, both true and false positives. Precision is crucial when false positives are a major concern.

p r e c i s i o n = \frac{T P}{T P + F P}

(7)

Equation (8) represents the proportion of true-positive detections out of all actual positives for kidney stones in the dataset, known as recall. A high recall ensures that most of the actual kidney stones are detected.

r e c a l l = \frac{T P}{T P + F N}

(8)

The mAP, defined in Equation (9), computes the average precision across different IoU thresholds and classes. It provides a comprehensive measure of the performance of the model across varying levels of detection difficulties.

m A P = \frac{1}{n} \sum_{k = 1}^{k = n} {A P}_{k}

(9)

Here,

{A P}_{k}

denotes the average precision for class k, and n represents the total number of classes.

The IoU, defined in Equation (10), evaluates the overlap between the predicted and ground-truth bounding boxes. The IoU is a standard metric in object detection, and a higher IoU indicates better localization performance.

I o U = \frac{A r e a o f O v e r l a p}{A r e a o f U n i o n}

(10)

To address the issue of overlapping bounding boxes that may arise from annotations by multiple radiologists, a bounding box merging process was implemented. Bounding boxes with an IoU greater than 0.45 were considered for merging, and their coordinates were averaged to form a single consolidated bounding box. This step not only reduces redundancy but also enhances annotation accuracy.

4.3. Experimental Setup

The training process was configured with a batch size of 16, and the learning rate was initially set to 0.01. A step decay learning rate schedule was applied, reducing the learning rate by a factor of 0.1 at epochs 30 and 60 to ensure stable convergence. The optimizer used was Stochastic Gradient Descent (SGD) with a momentum of 0.9 and a weight decay of 0.0005 to prevent overfitting. The total number of epochs for training was 50, and input images were normalized to a resolution of 320 × 320 pixels. This configuration was chosen to maintain a balance between performance and computational efficiency, enabling real-time application of the model in medical settings. To ensure balanced training, the loss function was carefully configured with the following component weights: λ_class = 1.0, λ_obj = 1.0, and λ_loc = 1.0. This configuration ensures that classification, objectness, and localization losses contribute equally to the overall training process, preventing any one aspect from dominating the learning process.

The primary objective of this research was to diagnose kidney stones using CT images. In this study, we employed a kidney stone detection dataset. A comprehensive analysis was conducted using a computing environment equipped with an NVIDIA GeForce RTX 3080 GPU and the PyTorch framework, which significantly enhanced the computational performance and processing speed.

4.4. Experimental Result and Analysis

Figure 3 provides a comprehensive visual representation of the results obtained from the kidney stone detection algorithm applied to a series of medical CT scans. The images are arranged in a grid format, where each row represents a different slice or section of the scan and each column shows different views or slices of the same section. The blue bounding box highlights areas where potential stones are detected. They contain the label “stone” followed by a confidence score, indicating the likelihood of the detected object being a kidney stone. These confidence scores, displayed next to the “stone” label (e.g., “stone 0.78”), measure the model’s certainty in its detection. Higher confidence scores signify greater certainty and provide a quick assessment of detection reliability. The first column displays a vertical CT scan of the kidney. Both the YOLOv5n model and the proposed model showed a confidence score of 0.77 (77%) for stone detection. In contrast, the YOLOv5m model demonstrated a significantly lower performance for medical data, with a confidence score near 0.51 (51%). The second and third columns show the upper CT scans of the human kidney, where the models were tasked with detecting the two stones. All models showed high confidence scores; however, the proposed model surpassed YOLOv5s by demonstrating confidence scores of 0.89 (89%) and 0.84 (84%). In the third column, the proposed model again showed superior performance, with a confidence score that was 0.02 higher than that of YOLOv5s, which had a score of 0.84 (84%). The last column shows a partially cropped CT scan of the kidney. Both the YOLOv5n and YOLOv5s models demonstrated identical results, with confidence scores of 0.79 (79%). The proposed model slightly outperformed these two models, with a confidence score that was greater by approximately 0.01. This comparative analysis underscores the enhanced performance of the proposed model in detecting kidney stones and demonstrates its potential for achieving higher accuracy in medical imaging applications.

4.5. Different Coloring Approaches

Figure 4 illustrates the application of a different color approach for our proposed model to distinguish closely located kidney stones and avoid confusion resulting from overlap. The first row of images displays the kidney stone detection results using a standard method, with red boxes highlighting the detected stones. These red-highlighted boxes are sometimes very close to each other, potentially leading to misunderstandings of the detection results and creating an impression of an overlap. To address this issue, the second row employs a color-coding scheme to improve clarity. Each detected stone was highlighted with a uniquely colored box (green, blue, yellow, purple, and red), allowing clear differentiation between closely spaced stones. This approach effectively resolves any potential overlapping issues observed in the first row, making it easier to identify and count the stones. This enhanced visualization method provides a clearer and more precise a presentation of the kidney stone detection results, facilitating a better analysis and understanding of the detection performance of different models.

One of the primary post-processing steps we implemented is a novel color-coded bounding box scheme. This method assigns unique colors to closely located stones, particularly when they overlap in the CT images. The purpose of this approach is to resolve potential ambiguities caused by overlapping bounding boxes, which can occur when multiple stones are located near one another. By visually differentiating each detected stone, the model output becomes clearer, facilitating easier interpretation and reducing misclassification in cases where stones are tightly clustered. To further address the issue of overlapping stones, we employed a bounding box merging strategy during post-processing. This approach involves analyzing predicted bounding boxes with a high intersection over union (IoU) score. When the IoU between two bounding boxes exceeded a set threshold (in our case, 0.45), we merged these bounding boxes to form a single, consolidated prediction. This technique helps avoid redundant or erroneous detections that may arise from noise or small, irregular stones being detected multiple times in slightly different locations. By refining the bounding boxes in this manner, the overall detection accuracy is improved, particularly in cases where stones overlap or are located in complex anatomical regions.

One of the key benefits of the model is its ability to capture intricate feature relationships due to the integration of the SE block within the YOLOv5 architecture. This modification allows the model to recalibrate channel-wise dependencies, which enhances its sensitivity to smaller or less distinct features that might be missed by traditional detection methods. As a result, this model is particularly well suited for detecting small kidney stones that may be difficult to identify in noisy or complex CT images. Another scenario where the model could excel is in detecting irregularly shaped stones. Conventional detection methods often assume a more regular or predictable shape, such as ellipsoidal or spherical, when identifying stones. However, kidney stones can have highly irregular morphologies, making them harder to detect. The enhanced feature extraction capabilities of the model, along with its attention mechanisms, help it focus on the relevant parts of the image, allowing it to better differentiate between irregularly shaped stones and surrounding tissue. Moreover, in situations where stones are located near dense or overlapping structures, the model use of advanced noise reduction techniques and multi-scale detection features could significantly improve detection accuracy. These scenarios, where visual complexity and noise can obscure stone boundaries, are areas where this model would provide radiologists with a more reliable tool for accurate diagnosis. The proposed study offers the most benefits in challenging cases involving small or irregularly shaped stones and in noisy environments where precise feature differentiation is essential for accurate detection.

4.6. Comparison

Table 3 presents a comparative analysis of the performances of various YOLOv5 models (nano-sized, small, and medium) and the proposed model across multiple metrics following 50 epochs of training. The evaluated metrics included precision, recall, mAP@0.5, number of parameters (params), and floating-point operations per second (Flops(G)). The YOLOv5s (small) model exhibited enhanced performance compared to the YOLOv5n (nano-sized) version, achieving a precision of 0.772, recall of 0.604, and mAP@0.5 of 0.617. This model comprised 19 parameters and operated on 28.9 Flops(G). An increase in the number of parameters and computational complexity resulted in superior performance metrics, thus rendering it more effective for medical imaging applications than the nano version. Further performance improvements were observed with the YOLOv5m (medium) model, which achieved a precision of 0.808, recall of 0.628, and mAP@0.5 of 0.655. This model consisted of 21 parameters and required 47.9 Flops(G). The medium model successfully balanced computational complexity and performance, offering significant enhancements in precision and recall while maintaining a manageable computational load. The proposed model demonstrated the highest performance among all compared models. It achieved a precision of 0.816, recall of 0.637, and mAP@0.5 of 0.664, with 20.3 parameters and 48.1 Flops(G). Although this entails slightly higher computational requirements than those of the medium model, the superior performance metrics underscore its efficacy in detecting kidney stones. This highlights the fact that the proposed model is the most suitable for medical imaging applications, particularly in the context of kidney stone detection.

Table 3 provides the same comparative analysis by focusing on post-training performance metrics. These metrics include the training box loss, training object loss, training class loss, validation box loss, validation object loss, and validation class loss. The YOLOv5n model for training results shows a box loss of approximately 0.0723, which was the highest training loss in the column; the object loss of 0.0003 was better than the object loss of YOLOv5s, while the class loss was near 1.3021. The validation metrics for this model were a box loss of 0.0821, an object loss of 0.0008 lower than the training object loss, and a class loss that was lower than the training class loss, indicating that there was no overfitting. These results suggest that, despite its efficiency, the nano-model struggles with higher class and box losses than the other YOLOv5 models, impacting its overall accuracy in complex medical imaging tasks. However, the validation losses were 0.0799 (box) and 0.9653 (class), demonstrating its high performance. The reduction in class loss indicates that it is more suitable for medical imaging than the nano-model. The YOLOv5m model strikes a balance between computational complexity and performance, thereby offering significant improvements in terms of class loss and accuracy. The proposed model performed better across all metrics. The training box loss was greater than that of the nano and medium systems by approximately 0.0100 and 0.0020, respectively. The proposed system achieved the best validation results, including 0.9298 for the class loss, which was 0.200 greater than that of YOLOv5m. These results underscore the efficiency and accuracy of the proposed model, establishing it as the most suitable model for medical imaging applications, particularly for the detection of kidney stones.

The proposed study demonstrates that the modified YOLOv5m outperforms other YOLOv5 variants, including YOLOv5n and YOLOv5s, in key metrics such as precision, recall, and mAP. This highlights the improvements achieved by the proposed modifications within the YOLO family. To provide a more comprehensive evaluation, we conducted a comparison of the modified YOLOv5m model against SOTA object detection models outside the YOLO architecture, including Faster R-CNN [26], EfficientDet [27], RetinaNet [28], and CenterNet [29]. These models were chosen because they are widely used in medical imaging and object detection tasks, each offering unique architectural approaches. Faster R-CNN has proven effective in high-accuracy scenarios but lacks the inference speed necessary for real-time applications. EfficientDet compound scaling approach strikes a balance between accuracy and efficiency, which is critical for medical applications such as kidney stone detection. RetinaNet addresses class imbalance issues through a focal loss mechanism, making it particularly suitable for datasets containing kidney stones of various sizes and shapes. CenterNet, with its keypoint-based detection, shows potential in localizing small and irregular stones more effectively.

In this comparative analysis in Table 4, our modified YOLOv5m model demonstrated superior performance in precision (0.816), recall (0.637), and mAP@0.5 (0.664) compared to these state-of-the-art models. The proposed model also maintained a favorable inference time of 8.2 ms and a model size of 41 MB, making it suitable for real-time applications. This comparison confirms that our modifications to YOLOv5m are not only effective within the YOLO family but also outperform other leading models in the field of medical imaging.

By evaluating the computational complexity, inference time, and FLOPs, we provide a holistic view of the proposed model suitability for medical applications. The results of this study validate the effectiveness of the modifications, reinforcing the position of the modified YOLOv5m as a leading method in kidney stone detection. This broader evaluation not only showcases the strengths of the proposed architecture but also highlights its advantages over existing models outside the YOLO family.

5. Discussion

The primary objective of this study was to improve the detection accuracy of kidney stones in CT images by modifying the YOLOv5m architecture, specifically through the integration of SE blocks and an optimized color-coding scheme for bounding boxes. The results demonstrate that the proposed modifications significantly enhance the performance of the model compared to both standard YOLOv5 variants and other state-of-the-art object detection models. Our modified YOLOv5m outperformed all competing models, including Faster R-CNN, EfficientDet, RetinaNet, and CenterNet, in key performance metrics such as precision, recall, and mean average precision (mAP@0.5). The model achieved a precision of 0.816, indicating its ability to minimize false positives more effectively than other methods. The recall, at 0.637, was the highest among the models tested, showing that the proposed model was able to detect a higher proportion of true positives. Furthermore, the model’s mAP@0.5 score of 0.664 underscores its superior ability to localize and classify kidney stones with high accuracy, especially compared to the baseline YOLOv5 variants and other established object detection models. One of the key contributions of this study is the inclusion of the SE block in the C3 block of the YOLOv5 architecture. The SE block enhances the recalibration of channel-wise dependencies, allowing the model to focus on the most informative features in the image. This proves particularly beneficial in medical imaging, where small and irregularly shaped objects, such as kidney stones, may otherwise be overlooked. The introduction of a novel color-coding scheme for closely located kidney stones further enhanced the model ability to differentiate between overlapping stones, improving both the clarity of the detection output and the interpretability of the results. The model ability to maintain a relatively fast inference time of 8.2 ms per image while achieving a high level of accuracy makes it suitable for real-time applications in clinical settings. This speed–accuracy trade-off positions the proposed model as a viable solution for automated kidney stone detection, providing a faster, more reliable alternative to manual analysis by radiologists. The proposed model, while effective in detecting kidney stones in controlled CT imaging environments, may face certain limitations in more challenging scenarios. One such limitation arises when processing CT images with high levels of noise or artifacts caused by low image quality, motion blur, or scanner-specific issues. For instance, images obtained from older or lower-resolution scanners may introduce noise that complicates the accurate detection of small or irregularly shaped stones. While the integration of the SE block helps mitigates some of these challenges by recalibrating feature dependencies, extreme noise can still reduce the precision of detection.

Another area for future research involves expanding the scope of the model application to other medical imaging tasks beyond kidney stone detection. The integration of SE blocks and attention mechanisms could potentially enhance the performance of models applied to other anomaly detection tasks in medical imaging, such as tumor or cyst detection. Furthermore, additional comparisons with more recent advancements in object detection architectures, such as Vision Transformers (ViTs), could provide further insight into the efficacy of the proposed approach relative to emerging technologies. This study presents a modified YOLOv5m architecture that delivers significant improvements in precision, recall, and mAP for kidney stone detection in CT images. The proposed model not only outperforms established object detection models but also demonstrates potential for real-time application in medical diagnostics. By addressing the limitations noted and exploring broader applications, the proposed methodology can pave the way for more effective and efficient medical imaging solutions in the future.

6. Conclusions

This paper presents a novel modification of the YOLOv5 model, specifically tailored for the early and accurate detection of kidney stones in CT images. By integrating an SE block within the C3 block of the YOLOv5m architecture, the ability of the model to recalibrate channel-wise dependencies and capture intricate feature relationships was significantly enhanced. The results of extensive experiments demonstrated that the modified YOLOv5 model achieved superior performance metrics, including higher precision, recall, and mAP compared to standard YOLOv5 variants. The proposed model effectively balances inference speed and model size, making it suitable for real-time medical applications. Additionally, the incorporation of advanced noise reduction and data augmentation techniques ensured the preservation of critical features and enhanced the robustness of the training dataset. The use of a novel color-coding scheme for the bounding boxes further improved the clarity and differentiation of the detected stones, facilitating a better analysis and understanding of the detection results. Our comprehensive evaluation using essential evaluation metrics underscores the efficacy of the proposed model in detecting kidney stones and offers a robust, accurate, and efficient solution for medical imaging applications. This advancement in computer-aided diagnosis holds significant potential for improving patient outcomes by providing healthcare professionals with a reliable tool for the early and precise detection of kidney stones. In addition to its real-time capabilities, YOLOv5 precision is highly advantageous for medical applications. Kidney stones, especially when small or irregularly shaped, can be challenging to detect. YOLOv5 architecture, particularly after the modifications we introduced, such as the integration of the SE block, enhances the model’s ability to capture and recalibrate channel-wise dependencies. This improvement ensures that the model can accurately detect even small stones in noisy CT images, which are often difficult to analyze using traditional methods. Moreover, YOLOv5’s flexibility and scalability allow it to adapt to various computational environments, making it versatile for different clinical setups. The architecture can be fine-tuned to balance computational resources and performance, ensuring that it remains efficient in resource-constrained environments while maintaining high detection accuracy. These features, combined with YOLOv5’s ability to handle complex medical data and perform multi-scale detection, make it a robust and efficient solution for kidney stone detection in CT images. The model’s adaptability, precision, and speed are key factors that motivated its selection for this task.

From a clinical perspective, the AI system developed in this study has significant potential to be integrated into radiological workflows. Its ability to provide real-time, automated detection of kidney stones could streamline the diagnostic process, allowing radiologists to focus on more complex cases while reducing the likelihood of missed diagnoses. By enhancing both speed and accuracy, the system could be particularly useful in high-throughput environments such as emergency departments, where timely detection of small stones can be critical for patient outcomes. Additionally, the model’s lightweight architecture makes it feasible for deployment on a range of computational systems, further supporting its integration into existing clinical infrastructures. Future work will involve collaborating with healthcare professionals to validate the model in real-world clinical settings, ensuring that it can seamlessly complement radiologists’ expertise and improve overall diagnostic workflows.

The integration of SE blocks into the YOLOv5 architecture represents a significant advancement in the field of medical imaging and diagnostics and offers a powerful tool for enhancing the accuracy and efficiency of kidney stone detection in clinical settings. Future studies should focus on further optimizing this model and exploring its application in other medical imaging tasks, thereby broadening its impact on healthcare diagnostics.

7. Additional Information

In this study, the preparation of the methodology, including model design, data preprocessing, and augmentation techniques, took approximately six months. This phase involved finalizing the architecture modifications, tuning the hyperparameters, and ensuring that the dataset was properly processed for use with the model.

The main reason we chose YOLOv5 from the YOLO family is because YOLOv5 had established itself as a stable and well-documented model, widely adopted in various applications. Its maturity provided a solid foundation for our modifications to enhance kidney stone detection. YOLOv5 benefits from a robust community, offering extensive resources, tutorials, and pre-trained models. This support facilitated our development process and allowed for efficient customization to meet the specific requirements of medical imaging. Our goal was to develop a model suitable for deployment in clinical settings, which often have hardware constraints. YOLOv5’s balance between performance and computational efficiency made it an appropriate choice for real-time applications on standard medical equipment.

The experimental phase, which included training the model on the TEZ_ROI_AUG dataset, tuning the hyperparameters, and running various tests, required two months. During this period, we conducted multiple training sessions across different variants of YOLOv5 and state-of-the-art models such as Faster R-CNN, EfficientDet, RetinaNet, and CenterNet. The comparison and analysis of results, including detailed performance evaluation metrics and visualization, took an additional two weeks. This phase was crucial to ensure that the proposed model’s superiority in precision, recall, mAP, and inference time was comprehensively demonstrated against competing models. The entire process, from methodology preparation to obtaining and comparing results, spanned approximately eight and a half months. This timeline is described in the paper to provide a complete view of the study’s duration and the rigor involved in achieving the outcomes.

The sizes of the kidney stones detected in this study ranged from 2 mm to 12 mm based on the annotations present in the TEZ_ROI_AUG dataset. The minimum size of 2 mm represents small stones, which are often challenging to detect manually, especially in the early stages. The maximum size of 12 mm corresponds to larger stones, which are typically easier to detect even in clinical practice. In clinical settings, larger stones (>5 mm) are often more easily identifiable, and radiologists are typically proficient at locating them. However, the real value of AI, particularly in the domain of kidney stone detection, lies in its ability to consistently identify smaller stones, which may be overlooked in standard radiological assessments. Our proposed AI model demonstrated its ability to detect stones as small as 2 mm, including those located not only in the renal system but also in the ureters. The inclusion of attention mechanisms and the SE blocks in the model architecture allowed improved detection of smaller stones by emphasizing important image features and reducing noise, which can obscure such small anomalies. AI’s ability to reliably detect smaller stones could greatly assist in early diagnosis, reducing the risk of stone growth and the potential for more severe complications. Additionally, early detection can guide timely treatment, potentially preventing more invasive procedures. Therefore, we believe that the AI-driven approach is particularly beneficial for identifying smaller stones in both the renal excretory system and the ureters, enhancing clinical decision-making and patient outcomes.

Author Contributions

Methodology, K.A.A. and Y.-I.C.; software, K.A.A.; validation, K.A.A. and Y.-I.C.; formal analysis, K.A.A. and Y.-I.C.; resources, K.A.A.; data curation, K.A.A. and Y.-I.C.; writing—original draft, K.A.A. and Y.-I.C.; writing—review and editing, K.A.A. and Y.-I.C.; supervision, Y.-I.C.; project administration, K.A.A. and Y.-I.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by Korea Institute of Marine Science & Technology Promotion(KIMST) funded by the Ministry of Oceans and Fisheries (G22202202102401), Korean Agency for Technology and Standard under Ministry of Trade, Industry and Energy in 2023 (project number is 1415180835), and by a Gachon University 2024 research grant (GCU-202406240001).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All used datasets are available online with open access.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Akram, M.; Jahrreiss, V.; Skolarikos, A.; Geraghty, R.; Tzelves, L.; Emilliani, E.; Davis, N.F.; Somani, B.K. Urological guidelines for kidney stones: Overview and comprehensive update. J. Clin. Med. 2024, 13, 1114. [Google Scholar] [CrossRef] [PubMed]
Jebir, R.M.; Mustafa, Y.F. Kidney stones: Natural remedies and lifestyle modifications to alleviate their burden. Int. Urol. Nephrol. 2024, 56, 1025–1033. [Google Scholar] [CrossRef] [PubMed]
Cheraghian, B.; Meysam, A.; Hashemi, S.J.; Hosseini, S.A.; Malehi, A.S.; Khazaeli, D.; Rahimi, Z. Kidney stones and dietary intake in adults: A population-based study in southwest Iran. BMC Public Health 2024, 24, 955. [Google Scholar] [CrossRef]
Ahmed, F.; Abbas, S.; Athar, A.; Shahzad, T.; Khan, W.A.; Alharbi, M.; Khan, M.A.; Ahmed, A. Identification of kidney stones in KUB X-ray images using VGG16 empowered with explainable artificial intelligence. Sci. Rep. 2024, 14, 6173. [Google Scholar]
Liu, H.; Ghadimi, N. Hybrid convolutional neural network and Flexible Dwarf Mongoose Optimization Algorithm for strong kidney stone diagnosis. Biomed. Signal Process. Control. 2024, 91, 106024. [Google Scholar] [CrossRef]
Muksimova, S.; Umirzakova, S.; Mardieva, S.; Cho, Y.I. Enhancing medical image denoising with innovative teacher–student model-based approaches for precision diagnostics. Sensors 2023, 23, 9502. [Google Scholar] [CrossRef]
Muksimova, S.; Umirzakova, S.; Kang, S.; Im Cho, Y. CerviLearnNet: Advancing cervical cancer diagnosis with reinforcement learning-enhanced convolutional networks. Heliyon 2024, 10, e29913. [Google Scholar] [CrossRef]
Mardieva, S.; Ahmad, S.; Umirzakova, S.; Rasool, M.A.; Whangbo, T.K. Lightweight image super-resolution for IoT devices using deep residual feature distillation network. Knowl.-Based Syst. 2024, 285, 111343. [Google Scholar] [CrossRef]
Dangle, P.; Tasian, G.E.; Chu, D.I.; Shannon, R.; Spiardi, R.; Xiang, A.H.; Jadcherla, A.; Arenas, J.; Ellison, J.S. A systematic scoping review of comparative effectiveness studies in kidney stone disease. Urology 2024, 183, 3–10. [Google Scholar] [CrossRef]
Umirzakova, S.; Ahmad, S.; Mardieva, S.; Muksimova, S.; Whangbo, T.K. Deep learning-driven diagnosis: A multi-task approach for segmenting stroke and Bell’s palsy. Pattern Recognit. 2023, 144, 109866. [Google Scholar] [CrossRef]
Pan, W.; Yun, T.; Ouyang, X.; Ruan, Z.; Zhang, T.; An, Y.; Wang, R.; Zhu, P. A blood-based multi-omic landscape for the molecular characterization of kidney stone disease. Mol. Omics 2024, 20, 322–332. [Google Scholar] [CrossRef] [PubMed]
Zhang, M.; Ye, Z.; Yuan, E.; Lv, X.; Zhang, Y.; Tan, Y.; Xia, C.; Tang, J.; Huang, J.; Li, Z. Imaging-based deep learning in kidney diseases: Recent progress and future prospects. Insights Into Imaging 2024, 15, 50. [Google Scholar] [CrossRef] [PubMed]
Sudharson, S.; Kokil, P. Computer-aided diagnosis system for the classification of multi-class kidney abnormalities in the noisy ultrasound images. Comput. Methods Programs Biomed. 2021, 205, 106071. [Google Scholar] [CrossRef] [PubMed]
Baygin, M.; Yaman, O.; Barua, P.D.; Dogan, S.; Tuncer, T.; Acharya, U.R. Exemplar Darknet19 feature generation technique for automated kidney stone detection with coronal CT images. Artif. Intell. Med. 2022, 127, 102274. [Google Scholar] [CrossRef]
Chiou, T.; Meagher, M.F.; Berger, J.H.; Chen, T.T.; Sur, R.L.; Bechis, S.K. Software-estimated stone volume is better predictor of spontaneous passage for acute nephrolithiasis. J. Endourol. 2023, 37, 85–92. [Google Scholar] [CrossRef]
Patro, K.K.; Allam, J.P.; Neelapu, B.C.; Tadeusiewicz, R.; Acharya, U.R.; Hammad, M.; Yildirim, O.; Pławiak, P. Application of Kronecker convolutions in deep learning technique for automated detection of kidney stones with coronal CT images. Inf. Sci. 2023, 640, 119005. [Google Scholar] [CrossRef]
Xu, W.; Lai, C.; Mo, Z.; Liu, C.; Li, M.; Zhao, G.; Xu, K. Clinical-Inspired Framework for Automatic Kidney Stone Recognition and Analysis on Transverse CT Images. IEEE J. Biomed. Health Inform. 2024, 1–12. [Google Scholar] [CrossRef] [PubMed]
Kilic, U.; Karabey Aksakalli, I.; Tumuklu Ozyer, G.; Aksakalli, T.; Ozyer, B.; Adanur, S. Exploring the Effect of Image Enhancement Techniques with Deep Neural Networks on Direct Urinary System (DUSX) Images for Automated Kidney Stone Detection. Int. J. Intell. Syst. 2023, 2023, 3801485. [Google Scholar] [CrossRef]
Bayram, A.F.; Gurkan, C.; Budak, A.; Karataş, H. A detection and prediction model based on deep learning assisted by explainable artificial intelligence for kidney diseases. Avrupa Bilim Ve Teknol. Derg. 2022, 40, 67–74. [Google Scholar]
Tahir, F.S.; Abdulrahman, A.A. Kidney stones detection based on deep learning and discrete wavelet transform. Indones. J. Electr. Eng. Compu. Sci. 2023, 31, 1829. [Google Scholar] [CrossRef]
Chaki, J.; Ucar, A. An efficient and robust approach using inductive transfer-based ensemble deep neural networks for kidney stone detection. IEEE Access 2024, 12, 32894–32910. [Google Scholar] [CrossRef]
Asif, S.; Zheng, X.; Zhu, Y. An optimized fusion of deep learning models for kidney stone detection from CT images. J. King Saud Univ.-Comput. Inf. Sci. 2024, 36, 102130. [Google Scholar] [CrossRef]
Kumar, P.; Singh, D.; Samagh, J.S. A Hybrid Model for Kidney Stone Detection Using Deep Learning. IJSTM 2024, 13, 65–85. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
TEZ. “TEZ_ROI_AUG Dataset”. Roboflow Universe, Roboflow, April 2023. Available online: https://universe.roboflow.com/tez-nwkf5/tez_roi_aug (accessed on 3 August 2024).
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 10781–10790. [Google Scholar]
Ross, T.Y.; Dollár, G.K.H.P. Focal loss for dense object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2980–2988. [Google Scholar]
Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. Centernet: Keypoint triplets for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6569–6578. [Google Scholar]

Figure 1. Architecture of YOLOv5 with a C3 block.

Figure 2. Bottleneck blocks, each consisting of two convolutional layers with a residual connection, contribute to parameter reduction while preserving the representational capacity of the model. (a) C3 block with three convolutions. (b) SE block. (c) Bottleneck of the C3 block.

Figure 3. Comparative analysis of different YOLOv5 model variants (nano-sized, small, and medium) along with the proposed modified YOLOv5 model for the detection of kidney stones in CT images.

Figure 4. Different coloring approaches for boundary boxes of detected objects.

Table 1. Five types of YOLOv5 models pretrained on the COCO dataset, where n, s, m, l, and X denote the nano-sized, small, medium, large, and extra-large systems, respectively. The table shows the model size, inference time, and mAP.

YOLOv5n	YOLOv5s	YOLOv5m	YOLOv5l	YOLOv5X
4 MB	14 MB	41 MB	89 MB	166 MB
6.3 ms	6.4 ms	8.2 ms	10.1 ms	12.1 ms
28.4 mAP	37.2 mAP	45.2 mAP	48.8 mAP	50.7 mAP

Table 2. Comparative analysis of various YOLOv5 models (nano-sized, small, and medium) against the proposed model, focusing on performance metrics post-training.

Models	Train Box Loss	Train Object Loss	Train Class Loss	Val Box Loss	Val Object Loss	Val Class Loss
YOLOv5n (nano-sized)	0.0723	0.0093	1.3021	0.0821	0.0085	1.1245
YOLOv5s (small)	0.0671	0.0096	1.0122	0.0799	0.0088	0.9653
YOLOv5m (medium)	0.0624	0.0084	0.9863	0.0785	0.0084	0.9403
Ours	0.0607	0.0076	0.9746	0.0767	0.0079	0.9298

Table 3. Comparison of the performances of different YOLOv5 models (nano-sized, small, and medium) with the proposed model across various metrics after 50 epochs of training.

Models	Precision	Recall	mAP@0.5	Params	Flops(G)	Epochs
YOLOv5n (nano-sized)	0.719	0.578	0.567	17	4.1	50
YOLOv5s (small)	0.772	0.604	0.617	19	28.9	50
YOLOv5m (medium)	0.808	0.628	0.655	21	47.9	50
Ours	0.816	0.637	0.664	20.3	48.1	50

Table 4. Comparing the performance of our modified YOLOv5m model with that of other detection models in kidney stone detection.

Model	Precision	Recall	mAP@0.5	Inference Time (ms)	Model Size (MB)
Faster R-CNN	0.785	0.612	0.628	15.4	148
EfficientDet (D2)	0.799	0.619	0.640	13.0	52
RetinaNet	0.782	0.605	0.635	14.1	80
CenterNet	0.804	0.625	0.649	12.7	70
YOLOv5n (Nano-Sized)	0.719	0.578	0.567	6.3	17
YOLOv5s (Small)	0.772	0.604	0.617	6.4	19
YOLOv5m (Medium)	0.808	0.628	0.655	8.2	41
Proposed Model (Ours)	0.816	0.637	0.664	8.2	41

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Abdimurotovich, K.A.; Cho, Y.-I. Optimized YOLOv5 Architecture for Superior Kidney Stone Detection in CT Scans. Electronics 2024, 13, 4418. https://doi.org/10.3390/electronics13224418

AMA Style

Abdimurotovich KA, Cho Y-I. Optimized YOLOv5 Architecture for Superior Kidney Stone Detection in CT Scans. Electronics. 2024; 13(22):4418. https://doi.org/10.3390/electronics13224418

Chicago/Turabian Style

Abdimurotovich, Khasanov Asliddin, and Young-Im Cho. 2024. "Optimized YOLOv5 Architecture for Superior Kidney Stone Detection in CT Scans" Electronics 13, no. 22: 4418. https://doi.org/10.3390/electronics13224418

APA Style

Abdimurotovich, K. A., & Cho, Y.-I. (2024). Optimized YOLOv5 Architecture for Superior Kidney Stone Detection in CT Scans. Electronics, 13(22), 4418. https://doi.org/10.3390/electronics13224418

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Optimized YOLOv5 Architecture for Superior Kidney Stone Detection in CT Scans

Abstract

1. Introduction

2. Related Works

3. Proposed Methodology

3.1. YOLOv5m

3.2. C3 Block

3.3. Proposed Model

4. Experiments

4.1. Dataset

4.2. Evaluation Metrics

4.3. Experimental Setup

4.4. Experimental Result and Analysis

4.5. Different Coloring Approaches

4.6. Comparison

5. Discussion

6. Conclusions

7. Additional Information

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI