We compared the MugenNet model with some existing CNN models, which belong to the category of CNN models in terms of abilities such as learning, generalization, and qualitative segmentation.
4.3. Performance Indicators
To evaluate the proposed MugenNet for the segmentation of colon polyp images, we used the performance indicators of , , loss, , , and . The mathematical expression for the definition of each performance indicator along with its significance in the image segmentation task is introduced below.
The full name of
is Intersection over Union, which represents the ratio of intersection and union between the bounding box and ground truth. The average
(
) can be calculated by
It is noted that during the training process of a segmentation model, the value of the loss will be used to evaluate whether the prediction result of the model is good or valid. If , the prediction is considered valid. Additionally, we also calculated the value relative to the ground truth for each training batch, and we then took the average as the loss for this specific training batch.
The Dice coefficient is a measure of similarity between two sets (for example, A and B) and is used for medical image segmentation. The formula for the mean Dice coefficient (
) is as follows:
The range of the Dice coefficient is (0, 1). The closer the coefficient is to 1, the higher the similarity between sets A and B. We calculated the Dice coefficient between the model’s predicted results and the ground truth to evaluate the reliability of the model in colon polyp image segmentation.
The full name of
is the Mean Absolute Error, which is used to calculate the error between predicted values and the ground truth.
is calculated by
The advantage of is that it is insensitive to outliers. This study took it as one of the criteria to evaluate the reliability of prediction results.
The parameters for the precision and recall metrics include: TP (True Positive), TN (True Negative), FP (False Positive), and FN (False Negative). Each pixel was compared in the map of the prediction result with the ground truth and the prediction result was classified into (i) correct
or (ii) incorrect
. The four attributes are then calculated by
Then, the precision and recall were calculated by
We also used the weighted F-measure
[
47], S-measure
[
48] and E-measure
[
49] to evaluate the model performances, similar to [
43]. These three metrics are selected to evaluate the precision and recall of the model during testing, as well as to assess the model’s performance. The formulas for these three metrics are as follows.
The weighted F-measure can be calculated by
where
represents weight, and
is the adjustable coefficient. In this study, we took
. It is worth noting that the weighted F-measure is a robust metric because it incorporates both precision and recall. The S-measure is calculated by
where FG stands for foreground and BG stands for background. We set
.
represents the probability of the foreground region in the predicted result and the ground truth.
represents the probability of the background region in the predicted result and the ground truth. The S-measure will be used to assess the structural similarity of targets in the region.
The E-measure is calculated by
where
I is the input binary foreground map.
is the global mean of
I.
A is a matrix where all elements have a value of 1, and its size is to the same as that of matric
I. The E-measure was used to evaluate the performance of the model at the pixel level. We took the mean value of E-measure.
4.4. Comparison of the Performance
The results of the comparison are shown in
Table 1, where “Null” means that the corresponding results are not provided in the literature. The best performance in each column of the table is highlighted in bold. The data of other existing models were obtained from the corresponding literature and their implementation was carried out using their published code.
From the results on the five different datasets (
Table 1,
Table 2,
Table 3,
Table 4 and
Table 5, it can be found that the proposed MugenNet shows execellent performance in the CVC-ColonDB and ETIS polyp segmentation datasets (
Table 2 and
Table 3), showing the best overall performance across the three metrics. It is worth mentioning that our training datasets are sourced from Kvasir and CVC-ClinicDB (
Table 4 and
Table 5), indicating that our model has good generalization ability and is suitable for the predicting segmentation results on unknown datasets. Although the results on the other three datasets indicate that our model has not achieved the best performance, it is not far from PraNet (only short by 1.6%), which is the current SOTA (the state of the art) performance. We compared the inference time of the different models in
Table 6. The results show that with the same training dataset, our model has reduced the inference time by 12% compared with PraNet. At the same time, our model significantly outperforms traditional biomedical image semantic segmentation model (such as U-Net); especially on the ETIS dataset, our model’s performance improvement reaches as high as 13.7% compared with the PraNet model.
The poor performance of the models such as SFA and U-Net on the ETIS dataset suggests that they have a weak generalization ability on unknown datasets. In contrast, our model outperforms the traditional biomedical image segmentation models.
Figure 3 displays the polyp segmentation results of MugenNet on the Kvasir dataset, comparing our model with U-Net, U-Net++, SFA and PraNet, where GT represents the ground truth. Our model can accurately locate polyps of different sizes and textures, then generate semantic segmentation maps. It is worth noting that the results of other models are compared with the results from [
43].
In addition to the comparison with the four existing models—U-Net, U-Net++, SFA, PraNet in
Figure 3—on the Kvasir dataset, we also tested our model on the other four datasets (CVC-300, CVC-ClinicDB, CVC-ColonDB, ETIS-LaribPolypDB,). The results are shown in
Figure 4. From
Figure 4, our model (MugenNet) can accurately locate the position and size of polyps on the five different datasets. The results show that our model possesses stable and good generalization ability.
We also trained our model (MugenNet) on the datasets of CVC-300, CVC-ClinicDB, CVC-ColonDB, ETIS-LaribPolypDB, with the results shown in
Figure 5. The examples of the datasets can be found in
Figure 5 under original image and ground truth. It can be seen from
Figure 5 that our model performs well when trained on the other four polyp datasets. This shows that our model has good robustness (because its performance is insensitive to the training datasets) and can quickly adapt to multiple different datasets.
From the pre-training weight of the DeiT distillation neural network, as shown in
Table 6, our model can converge at less than 30 epochs, and the training time is only about 15 min when the batch size is set to 16.
represents Learning Rate,
represents Frames Per Second. Our model’s real-time running speed is approximately 56 frames per second (FPS), which means that our model can be used to process video streaming data during colonoscopy examinations to perform real-time polyp image segmentation tasks.
We tested the Frames Per Second (
) of ten models on the training dataset to evaluate the inference speed of the models, as shown in
Figure 6. Since the FPS of U-Net and U-Net++ are both below 10 (as shown in
Table 6), their inference speed is too slow; hence, they are not shown in
Figure 6. It can be observed from
Figure 6 that the average
of our model reaches 56. It is noted that when
is greater than 30, the model was considered reasonable for video streaming data [
43]. It can be seen from
Figure 6 and
Table 6 that our model (MugenNet) has advantages in processing video stream data for the colonic polyp image segmentation.
The comparative results indicate that our model performs well on three unseen datasets, far surpassing traditional CNN-based biomedical image segmentation models such as U-Net, U-Net++, SFA and ResUNet++, and even slightly outperforming PraNet on certain datasets.
It is worth mentioning that the models we compared are all based on convolutional neural networks, while our model (MugenNet) combines Transformer and CNN for colon polyp image segmentation. The experiment results show that the incorporation of Transformer can effectively improve the performance of neural networks, achieving a faster processing speed. In essence, our model (MugenNet) possesses both the global learning capability of Transformer for capturing overall information and the local learning capability of CNN for focusing on specific regions. Therefore, our model achieves an excellent performance in the colon polyp image segmentation. The statistical test on eight models are shown in
Figure 7.
4.5. Ablation Studies
In this section, five sets of ablation experiments were conducted to evaluate the effectiveness of each component of MugenNet on two different datasets. The results are shown in
Table 7, where TB represents the Transformer branch, CB represents the CNN branch, and MM represents the Mugen module.
We used three indicators to evaluate the models, i.e., , and . Among them, and were used to evaluate the accuracy of the models, while was used to evaluate the precision and tecall of the models.
We removed the Transformer branch, the CNN branch, and the Mugen module from MugenNet, respectively, to examine their performances. The result shows that our model (MugenNet) significantly improves the accuracy of semantic segmentation. Compared with the model without the Transformer branch, our model performs better on the ColonDB dataset than the other two cases. The result shows that increased from 0.591 to 0.678, and increased from 0.667 to 0.758.
The results in
Table 7 show that our model can segment colonic polyp image accurately. Our model outperforms the other two models in the ablation study, where either the CNN or the Transformer was omitted. Compared with the backbone neural network, the performance of our model in
improved about 43.34% on the tested dataset (CVC-ColonDB). The results prove the superiority of our model (MugenNet).