1. Introduction
An endoscope—the most direct examination device for gastrointestinal diseases—is introduced into the human body through natural cavities. This has significant advantages in both diagnosis and treatment, and it serves as a primary means for subsequent minimally invasive surgery and noninvasive treatment [
1]. The collected endoscopic images are used to determine the patient’s condition and formulate subsequent treatment plans. However, endoscopic images contain numerous interfering factors, such as mirror reflection, motion blur, bubbles, etc., which can affect the visual interpretation of endoscopic examinations, affect the observation and diagnosis of clinicians regarding the lesion area, and negatively affect computer-aided diagnosis (CAD) processes [
2]. In addition, the presence of interference is an important reference criterion for the assessment of image quality in gastrointestinal endoscopy images, contributing to the quality evaluation of clinical endoscopic examination procedures [
3].
Image quality assessment aims to simulate human perception and is typically performed by human observers who evaluate algorithm values against subjective ratings. A typical approach involves comparing the distortion metric with an ideal imaging model or a perfect reference image [
4]. Depending on the amount of information provided by the original reference image, it can be classified as full, reduced, or no-reference (also known as blind image quality assessment, BIQA) [
5]. No-reference image quality assessment (NR-IQA) is one of the most challenging problems in image quality assessment because it solely relies on distorted images, and thus has garnered significant research interest in recent years.
NR-IQA has a significant practical value and is widely used in numerous practical applications where reference images are unavailable. Mittal et al. proposed a blind image spatial quality assessment method called the blind/referenceless image spatial quality evaluator (BRISQUE) [
6]. This method uses locally normalized luminance coefficients to quantify the “naturalness” loss caused by distortion; it exhibits a low computational complexity, making it suitable for real-time applications. Liu et al. introduced RankIQA [
7], which generates ordered degraded images to train a Siamese network for relative quality ranking. The trained network is then transferred to a traditional CNN, enabling it to estimate the absolute image quality from a single image. Junjie et al. proposed a multiscale image quality transformer called MUSIQ [
8], which can handle full-resolution image inputs with different resolutions, sizes, and aspect ratios; capture image quality at various granularities; and perform well on multiple large-scale IQA datasets. Talebi et al. presented a neural network based on deep object recognition called NIMA [
9], which can predict the distribution of human evaluations of images in terms of both perceptual quality (from a technical standpoint) and attractiveness (from an aesthetic standpoint). The proposed neural network exhibited scores that closely resembled the human subjective ratings, making it suitable for natural image quality assessment tasks. With the profound influence of machine learning, particularly deep learning, in various fields, image quality assessment in endoscopy is undergoing continuous innovation. Alexander et al. developed a new fidelity score for quantitative image quality assessment based on the structural similarity maps adopted in the human visual system (HVS), where the measure indicated the extent to which the structural information of relevant structures was preserved in the panorama [
10]. Aubreville et al. [
11] proposed an improved version of the Inception V3 network for detecting motion artifacts in endoscopic images. Kamen et al. [
12] selected high-quality and information-rich images by calculating the image entropy. Outtas et al. [
13] studied the usability of two general algorithms based on natural image quality assessments (NIQE [
14] and BRISQUE [
6]) in medical image environments. Prerna et al. [
15] proposed the use of gray-level co-occurrence matrices for image quality assessment. Zhang et al. [
16] performed a no-reference quality assessment of capsule endoscopy images by calculating the gradient field using the Sobel operator.
Considering the limitations of current no-reference endoscopic image quality assessment methods, which primarily focus on evaluating individual artifacts and often yield less interpretable results, we propose a framework for detecting endoscopic image artifacts and assessing image quality based on an improved cascade R-CNN designed to overcome these challenges. The architecture of the framework is shown in
Figure 1. First, the original endoscopic image is obtained. After undergoing image preprocessing, the image is fed into the improved cascade region-based CNN for artifact detection. Subsequently, after extracting the detection result data, both location and area information are acquired and transmitted to the endoscopic image NR-IQA method. This process culminates in the generation of interpretable image quality assessment results.
The artifact detection method retains the multistage structure of the cascade R-CNN, with ResNeXt101 serving as the backbone network. We improved the original feature pyramid network (FPN) structure and introduced Generalized Intersection over Union (GIoU) loss as a new evaluation metric loss function, replacing the original IoU metric. Using the improved model, we successfully located and identified artifacts in the gastrointestinal tract. Additionally, we propose a method that combines multiple weights to calculate the image quality score based on artifact detection results.
The main contributions of this study are as follows:
(1) Nonvideo sequences of gastrointestinal endoscopy images from multiple clinical hospitals were used. The images have high content and modality diversity, including images from white light imaging (WLI), narrow-band imaging (NBI), and iodine staining, making them robust and relevant for general applications in gastrointestinal endoscopy.
(2) An improved feature pyramid network that incorporates channel attention mechanisms into the feature extraction process is proposed. Shallow and deep features were fused by introducing an additional channel from the shallow to deep layers, thereby enhancing the utilization of spatial information in the shallow layers. This not only strengthened the feature extraction of semantic and positional information through path aggregation but also established global spatial feature attention for the mapping and representation of artifact features across multiple branches in the image.
(3) The proposed endoscopic image quality assessment method successfully detects anomalies in small targets and handles multiple interferences. Moreover, it uses the detection results of high-quality artifact targets to provide numerical scores that are close to expert ratings, thereby demonstrating strong interpretability.
Overall, these contributions enhance the detection and assessment of endoscopic image artifacts and provide valuable insights for clinical multicenter endoscopic image quality assessment.
3. Methods
The overall approach of this method is divided into two main parts: artifact target detection using the improved cascade R-CNN model and endoscopic image quality assessment based on the detection results.
3.1. Cascade R-CNN
The cascade R-CNN [
18] is an extension of the Faster R-CNN that aims to incorporate more semantic information into the detection task [
19]. Unlike the traditional Faster R-CNN, the cascaded R-CNN employs a cascaded modular structure, enabling additional context and features to be extracted through advanced feature extraction steps. These features were then utilized in the ROI pooling layer to enhance the network performance by providing richer representations. In addition, the cascaded structure provides more supervised signals during training, leading to more accurate models.
The cascade R-CNN consists of a feature extraction network, an FPN, a region proposal network layer, and cascade detectors. The feature extraction network ResNeXt101 was used to extract image features. The original map is obtained by convolution operations, Conv1, Conv2, Conv3, Conv4, and Conv5, and feature fusion at different levels to obtain feature maps P2, P3, P4, and P5 at different scales. These feature maps were then input into the region proposal network to obtain the candidate target areas. Subsequently, ROI alignment operation was performed on the resulting candidate target areas to obtain an ROI feature map.
ResNeXt101 is a deep CNN built on the ResNet architecture [
20]. This introduces the concept of ”group convolution,” where multiple parallel convolutional branches with the same structure are employed. Each branch processes different input characteristics, enabling an increased network width without overfitting. The ResNeXt101 network incorporates techniques, such as batch normalization and residual connections, to enhance its performance. Residual structures have been widely employed in the field of medical imaging to eliminate dependencies between the network’s weak and high levels [
21]. The residual structure is shown in
Figure 5.
In this study, an improved cascade R-CNN detection model is proposed, utilizing ResNeXt101 as the backbone network to locate and identify artifacts in the digestive tract. The original evaluation index, namely Intersection over Union (IoU), in the cascade R-CNN was replaced with GIoU loss as a new evaluation index loss function. IoU measures the degree of overlap between a predicted bounding box and a ground truth bounding box. GIoU is an improved version of IoU. It not only considers the intersection and union areas but also takes into account the spatial relationship between the two bounding boxes, which makes it more suitable for complex object shapes and rotations. This enhancement is aimed at making it more robust to variations in object rotation and shape.
Figure 6 illustrates the enhanced FPN with added attention mechanisms that enable better learning capabilities.
The advantages of this model are as follows:
(1) It provides an effective feature extraction structure that maximizes the utilization of shallow feature information and enhances the detection of small targets.
(2) The incorporation of the channel attention mechanism captures the feature dependencies between different channel maps in the feature extraction network. This reduces the missed detection rate and leads to more reliable results.
(3) GIoU was employed as the new evaluation index loss function, replacing IoU, the original evaluation index in cascade R-CNN. This ensures scale invariance in the loss function target detection frame regression and maintains consistency between the optimization objective and loss function of the detection frame.
3.2. Improved Feature Pyramid Structure
Each layer of the feature pyramid employs a distinct convolution kernel size to extract the features. The lower layers capture large-scale features, such as edges, whereas the final layer captures fine details. These feature maps are then concatenated to create a comprehensive feature map for classification. The key advantage of a feature pyramid is its ability to capture features at various scales in an image, resulting in enhanced classification performance. This is particularly effective for handling images of different sizes by encompassing the features of diverse scales.
Although this approach exhibits a high accuracy in classifying and localizing larger objects, it faces challenges in accurately detecting smaller objects. This is due to the nature of the deep CNN, where extensive convolution and pooling operations result in expanded receptive fields and decreased resolutions in the network feature layers. Consequently, there is the risk of overlooking small targets.
In contrast, low-level features obtained from shallow neural networks have a higher resolution and contain more information, making them valuable for detecting small objects. By fusing features at different scales, the recognition accuracy for small targets can be improved while maintaining the accuracy for larger targets.
The feature enhancement method was employed in this study to fully use shallow feature information and increase the resolution of the feature maps for small targets. Deep features, which contain rich semantic information, were added elementwise to shallow features through bilinear interpolation. However, an accurate localization of shallow to deep features has become increasingly challenging. To address this issue, an additional pathway, called the pathway enhancement channel, was introduced. The number of convolution layers traversed by the information flow from deep to shallow layers is reduced by this pathway, enabling the propagation of shallow information to deep layers and enhancing the localization of deep positional information. The improved structure of the FPN is shown in
Figure 7.
The original convolutional output layers in the bottom-up pathway are
,
,
, and
. First, convolution was applied to the input image, followed by dimensionality reduction of
,
, and
. After applying the attention mechanism to
, it underwent a two-fold upsampling to match the size of
. Subsequently, the corresponding elements were added, and the obtained result was input into
. The same process was used to obtain
and
.
,
, and
were upsampled by 2x, 4x, and 8x, respectively, and added to the shallow feature maps using bilinear interpolation. Corresponding elements were added to increase the utilization of deep features for shallow features. The formula for bilinear interpolation
is as follows:
where
,
,
,
are four known points, and the characteristics of the
-layer were obtained after bilinear interpolation as follows:
The final generated feature layers were , , , and , which fully utilized the deep and shallow features by fusing these different layers, resulting in improved prediction performance.
Digestive endoscopy images are complex, and the presence of artifacts makes it difficult to distinguish between true targets. To enhance the expressive power of the features of images, a channel attention mechanism was employed in the FPN object detection model. For feature-mapping layers of different scales, the channel attention mechanism was used to capture the feature dependencies between different channel maps. This involves calculating the weighted feature vectors that represent the explicit correlation between the feature channels, as shown in
Figure 8.
3.3. Loss Function
We introduced GIoU [
22] as a new evaluation index loss function to replace the original evaluation index, IoU, in the cascade R-CNN. The GIoU formula ensures that the loss function for the target detection box regression is scale-invariant and maintains consistency between the optimization objective and the loss function. In the context of artifact identification in digestive endoscopy, there is a significant imbalance between positive and negative samples, which makes the training of the bounding box scores more challenging. Metrics based on L1 and L2 norms may yield significantly different IoU values for the two bounding boxes at the same distance. As shown in
Figure 9, the example with the bounding boxes is represented by two corners. In all three cases, the distance between the representation of the two rectangles is the same for L2 norm; however, the IoU and GIoU values are very different. Furthermore, when IoU is the same, it only indicates that the Intersection over Union of the target box and the detection box is identical, but the actual size of the predicted box may be completely different. This highlights that neither IoU nor the L2 norm can adequately reflect the detection performance. Therefore, such loss functions are not ideal for predicting bounding boxes. Unlike IoU, which focuses only on the overlapping area, GIoU considers both overlapping and nonoverlapping regions, providing a better reflection of their intersection.
The introduction of the GIoU loss function has significant implications in the proposed method; it overcomes the limitations associated with traditional loss functions in the context of artifact detection. This loss function promotes a more precise bounding box regression and enhances the model’s ability to detect artifacts of varying sizes and shapes. This improvement augments the depth and accuracy of our research, thereby contributing to a more reliable analysis of endoscopic images, as shown in
Figure 10.
The GIoU formula is:
where
U is the union of the prediction box and GT box, and
is the minimum closure of the prediction box and real box.
3.4. Image Quality Score
Reading poor-quality medical images can significantly affect the efficiency of the diagnostic work. Therefore, based on the artifact detection results discussed in the previous section, we proposed a method to assess the quality of endoscopic images. This method aims to provide a visual indication of the effectiveness of an image. The image quality score (QS) is based on the (a) type, (b) region, (c) and location of the detected artifacts, and (d) the confidence of the detected artifacts. We improved the method based on Ali [
3], where weights are assigned to each category, and the average weight is computed as the quality score.
Class weights : artifact (0.50), specularity (0.10), saturation (0.10), blur (0.20), contrast (0.10), bubbles (0.02), instrument (0.02), and blood (0.02).
Area weight : percentage of the total image area occupied by all detected artifact and normal areas.
Location weight : center (0.5), left (0.25), right (0.25), top (0.25), bottom (0.25), top left (0.125), top right (0.125), bottom left (0.125), and bottom right (0.125).
The confidence weight
represents the confidence level of the detection results.
To align with the numerical range of subjective ratings by experts and present the image assessment results in a more intuitive manner, this study mapped the QS values to a range of 0–5. The mapped results are presented in
Figure 11.
4. Results and Discussion
4.1. Evaluation of Object Detection Results
The experimental setup in this study was as follows: Ubuntu 18.04 operating system, GeForce RTX 2060 graphics card, and Intel(R) Core(TM) i9-10940X processor. The improved cascade R-CNN network model proposed in this study was implemented within the mm detection framework version 2.25.0. The selected optimizer is a stochastic gradient descent with an initial momentum of 0.9. The total number of iterations was 5800. The training time for the first complete epoch of the model was 453 s, which gradually decreased over time. The total training time for 50 epochs was 18,537 s, 5.15 h.
This study employed commonly accepted evaluation metrics for object detection methods. Using a specific IoU threshold, true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) were defined as the classification outcomes. Furthermore, “all detections” refers to the total number of predicted bounding boxes, while “all ground truths” represents the total number of GT annotations.
Average precision (AP) is a commonly used metric in object detection that evaluates model performance by calculating the area under the precision–recall (PR) curve. This reflected the AP at each unique recall level. AP is defined based on PASCAL VOC 2010 [
23]:
Through comparative experiments, we compared the performance of the proposed method with that of other networks in our dataset. From the AP values shown in
Table 4 and
Table 5, it can be seen that the proposed method achieved the best performance in all typical networks.
Through the implementation of ablative experiments, our primary objective was to investigate the performance characteristics of the network components. While keeping the other experimental conditions constant, we conducted separate experiments on the improved FPN and GIoU components using the proposed method. Based on the AP in
Table 6, we observed that these components yielded improvements in the original model, thereby demonstrating the assisting role of the network components in learning pseudo-shadow features.
The confusion matrix is an
matrix used to evaluate the performance of a classification model, where
N represents the number of target categories. The matrix compares the actual and predicted class labels and assesses the predictive performance of the model. By analyzing the entire dataset, one can determine the number of correctly and incorrectly predicted samples, thus measuring the model’s predictive effectiveness. The confusion matrix, as presented in
Figure 12, illustrates the performance for each class, with all classes except “blood” exhibiting performance exceeding 50%. This finding substantiates the existence of significant differences in the learned features among various classes. However, the “blood” class displays a slightly lower performance owing to the diverse morphological characteristics of blood within the gastrointestinal tract lacking distinct typical features.
The test results using the proposed method are presented in
Table 7, using images from WLI, NBI, and iodine-staining modalities as examples. The table shows that the accurate detection of most defects was achieved in these three types of images. The detected bounding boxes effectively encompassed the abnormal regions without redundancy.
Figure 13 depicts the region-based statistical distributions of the GT (left) and predicted results (right) in the test dataset, arranged based on the region size. We observed that the predicted results exhibited a similar trend to those of the GT, indicating stable and reliable detection. This observation was further supported by specific evaluation metrics. In addition, the figure shows magnified views of the detection results for small targets, demonstrating the network’s ability to accurately capture defects with lengths smaller than 30 pixels.
4.2. Evaluation of the Quality Score
This study used a set of 100 randomly selected images with different quality ratings to evaluate the proposed quality assessment scheme based on the quality rankings provided by the three experts. Endoscopic images are different from natural images, and clinicians will randomly select the images of interest to shoot when performing image acquisition. There is no excellent reference image without distortion; therefore, the image quality assessment method without reference is used for image comparison. Four no-reference image quality assessment methods, NIQE [
13], BRISQUE [
6], MUSIQ [
8], and NIMA [
9], were compared with the method proposed in this paper.
Because the numerical values of the image evaluation scores represent only the quality levels of the images and lack practical significance, we employed the Spearman correlation coefficient to assess the monotonic relationship between the two variables. A larger value indicated a stronger correlation. Correlations were calculated using the following formula:
In Equation (
9),
N is the total number of observations,
and
are the grades of the observed value
i.
and
are the average grades of the variables
x and
y, respectively.
A p-value test was used to determine the probability of the observed outcome occurring randomly.
Table 8 shows that the proposed quality score (QS) method has an average Spearman correlation coefficient of 60.71% with the rankings provided by the three experts. High correlation coefficients indicate a significant relationship between the proposed method and the quality scores suggested by the experts.
Figure 14 presents four example images displaying the scores given by the three experts, the average score, and QS computed using the proposed method. The QS scores of the images of different qualities are close to the expert scores.
Figure 15 illustrates the correlation between the scores determined by the three experts and the proposed method.
4.3. Discussion
The evaluation scores of the proposed quality assessment method for endoscopic images had a strong correlation with the expert ratings. This correlation was based on the category, confidence, area, and position information extracted from the artifact detection results. These findings indicate that the quality scores have a practical significance. The poor performance of existing no-reference image quality assessment methods, such as NIQE, BRISQUE, MUSIQ, and MANI, in evaluating digestive endoscopic images may be because these methods are primarily designed for natural images. This highlights the requirement for more specific approaches to evaluate the quality of digestive endoscopy images, considering their unique characteristics and environment. Owing to the tight integration of the proposed image quality assessment method with the object detection results, a strong interpretability was achieved, compared to the four methods mentioned above. In clinical applications, clinicians can rely on the quality scores of digestive endoscopic images as a primary quality control measure, thereby reducing the time required for the manual inspection of patient images.
The proposed object detection method is more sensitive to contrast, blur, instruments, and specularity, owing to the availability of sufficient data and features. However, the detection of blood objects showed a slightly lower performance owing to the limited training data and diverse morphologies of blood presentations. Future studies should focus on improving the detection of blood-related issues. Gastrointestinal endoscopy images encompass various modalities, and the proposed object detection method was validated on WLI, NBI, and iodine-stained modalities based on the imaging modalities offered by Olympus and Fujifilm instruments. However, because of the lack of data for linked color imaging and blue laser imaging modalities, the detection performance of these modalities requires further verification.
Based on the research presented in this paper, future work should focus on image restoration for images containing artifacts, building upon the foundation of object detection, which has a clinical significance. In future work, we will consider conducting comparative experiments for other versions of the YOLO series models. To enhance the robustness and generalizability of the conclusions, it is proposed to replicate the study in diverse geographical regions around the world. Such multilocation replication endeavors are necessary to validate the applicability and reproducibility of the methods and results across different contexts and populations. Validating this research in other parts of the world will significantly enhance the credibility of the research and provide solid support for clinical practice and applications.