1. Introduction
Among all types of cancer, colorectal cancer is the third most commonly diagnosed [
1], and has the second highest cancer fatality rate worldwide [
2]. It is the most common cancer in men and women in Malaysia, as reported in [
3]. Colorectal cancer involves a type of malignant tumour which can be found in the colon or rectum. In the early stages, the tumour (polyp) is an abnormal tissue growth on the intestinal wall. Polyps are initially benign, but can potentially develop into malignant tumours after a certain period [
4]. The diagnosis of a polyp at an early stage is vital in clinical routines to prevent it from mutating into cancerous cells, thus regular screening is required to search for polyps. Several methods can be used to detect colorectal cancer, such as a high-sensitivity fecal occult blood tests (FOBT), stool deoxyribonucleic acid (DNA) test, optical colonoscopy, and computed tomographic colonography [
5]. Optical colonoscopy is the preferred tool for diagnosing polyps [
6]. During the colonoscopy procedure, a flexible endoscope with a tiny video camera at the tip is gradually advanced under direct vision via the anus into the proximal part of the colon. The video camera generates thousands of images of the internal mucosa of the colon for real-time manual analysis by the doctor. When polyps are found, they are resected as part of a histological analysis procedure.
Although optical colonoscopy is preferable, there is still a miss rate of approximately 10–30% for a single colon screening procedure [
7]. This is primarily due to the complex structure of the colon and human error [
8]. In order to extract information on the polyp from the images of the colon and rectum, a doctor usually screens for hours to seek, observe, and diagnose. This overwhelming workload not only causes physical and mental fatigue for the doctor, but can also be a cause of false diagnosis even by a well-trained and experienced clinician [
9]. In addition, diagnosis outcomes show large variations in terms of inter-observer and intra-observer decisions [
10]. These errors can be correlated with the medical level of expertise and the disparity in the appearance and nature of the polyps themselves. There is a wide variety in the appearance of polyps, in terms of their shapes, sizes, colour, and texture, and the distraction due to convoluted folds and bends inside the colon complicates the diagnostic task [
11].
An effective computer-aided diagnosis (CAD) system should improve diagnostic performance, save time, and be seamlessly integrated into the workflow. With the recent advancements, CAD can improve the accuracy and consistency of radiological diagnosis and ease the work of doctors when decreasing the workload and improving the promptness of medical diagnosis. In a real-time colon screening process, only a small number of video frames contain polyps, while most are not informative. A CAD system can help to ease the process of interpreting this invaluable source of information. CAD systems can generally be divided into traditional handcrafted methods and modern deep learning methods. A handcrafted method is used to extract the intrinsic features of a polyp as determined by the researcher, while a deep learning method enables a computer model to learn the features and use these in decision making by automatically extracting features from raw data. In view of the recent achievements of convolutional neural network (CNN) architectures (a type of deep learning model) in solving image processing problems, this has become the state-of-the-art methodology for detection, classification, and segmentation tasks [
12]. Many ensemble methods yield great results due to the combination of several CNN models. However, the computational power and time required for training and testing should be considered for practical applications. In order to be able to use deep learning methods for real-time applications such as polyp segmentation tasks by embedding models into specific devices, e.g., a medical capsule robot, attention needs to be paid to the development of an efficient overall model during the implementation phase. An efficient model should have relatively low requirements for hardware, computational power, and training time. CNNs such as DeepLab_v3 are computationally intensive, as they involve numerous learnable parameters [
13]. From 2014 onwards, benchmarking for ImageNet performance reached a bottleneck, where the slight increases in performance of a network were due to its deeper and more sophisticated architecture [
14].
The goal of our study is to implement a less complex network with a minimal trade-off on the segmentation results such as accuracy, sensitivity, and specificity. However, with the existing computer hardware technologies, embedding a simple (e.g., AlexNet and Visual Geometry Group (VGG)) or a high accuracy and sensitivity model (e.g., ensemble models) into a real-time application device for segmentation tasks is nearly impossible. That is because the fully connected layer contributes the largest number of trainable parameters to the network architecture. We therefore used a SegNet Visual Geometry Group-19 (VGG-19) [
15] as the framework for our proposed automatic polyp segmentation model in this research work as it is well known for its smaller number of network parameters and the shorter times needed for network training. In order to reduce the network training time, SegNet uses a unique and efficient method of retaining pixel information from the downsampling path to use in the upsampling path.
Although SegNet has been popular due to its simple and efficient network architecture, its main drawback is its lower performance than other, more sophisticated, and deeper networks. SegNet is a more linear architecture than other networks such as U-Net and fully convolutional networks (FCNs), which makes it less promising in segmentation performance due to missing pixel information. Hence, in this study, we modify the SegNet architecture to enhance the segmentation performance in terms of accuracy, sensitivity, and specificity.
In summary, the main contributions of this paper are as follows:
We propose a novel CNN based on SegNet VGG-19 encoder–decoder architecture for automatic polyp segmentation. The goal of this work is to improve segmentation performance such as accuracy and sensitivity in the original SegNet model and reduce the model’s complexity without losing too much performance when compared to other existing methods. We introduce three modifications, i.e., skip connections, 5 × 5 convolutional filters, and concatenation of four dilated convolutions applied in parallel form on SegNet VGG-19.
We train and evaluate our proposed model on a combination of three publicly available datasets, i.e., CVC-ClinicDB, CVC-ColonDB, and ETIS-Larib Polyp DB. Our proposed model manages to outperform the original SegNet VGG-19.
Our proposed model performs as well as or better than previous schemes in the literature. Our segmentation performance achieved accuracy, sensitivity, specificity, precision, mean intersection over union, and dice coefficient of 96.06%, 94.55%, 97.56%, 97.48%, 92.3%, and 95.99%, respectively. Although our proposed model slightly underperforms compared to one of the methods in the literature, our network has less complexity, that is, our method has almost half of the trainable parameters as that method.
The rest of the paper is organised as follows.
Section 2 discusses previous works related to polyp segmentation.
Section 3 reviews the SegNet model used in this research. The limitations of this model are explained and elaborated on in this section, and the modifications made to the existing SegNet to tackle these limitations are explained. In
Section 4, our segmentation results and the performance of the modified SegNet are evaluated and analysed in terms of the network training time and evaluation metrics such as accuracy, sensitivity, and specificity. Finally,
Section 5 summarises and concludes the paper.
2. Related Works
The conventional methods proposed in earlier studies have often extracted intrinsic polyp features determined by the researcher, typically consisting of texture information, colour, edges, and contours. The authors in [
16] used principal component pursuit (PCP) to fragment images into low rank and sparse forms for background subtraction. In the last stage, the image was segmented via an active contour applied to the background-subtracted image. Another active contour segmentation method was proposed in [
17], in which the active contour framework was modified to work without edges. The authors claimed that this remodelled active contour framework had the advantages of the contours merging and splitting on their own; on the contrary, the old snake model can merely extend and straighten without much versatility. Polyp segmentation based on contour region analysis was proposed in [
18]. The authors extracted features using the curvature characteristics and the colours of the polyps, the shape of their edges, and their areas. Edges identified as polyps would further detect their start and end points linked to create the polyp area. Traditional methods have had limited success due to the need for researcher expertise and the numerous factors affecting the environment of the colon; in particular, there are difficulties in differentiating between polyps and non-polyp backgrounds, which have very similar characteristics.
Deep learning techniques have been widely adopted in the image analysis domain and have outperformed traditional methods. The features of polyps can be extracted from training datasets by fitting a non-linear neuron-crafted network with non-convex objectives. The selected features can be further simplified by the intrinsic deep architecture of a neural network for classification at a later stage. The significant achievements of CNNs have led to an increase in popularity, and these models have become prevalent in medical image analysis. The authors of [
19] developed and compared FCNs based on three different network architectures (FCN-AlexNet, FCN-GoogLeNet, and FCN-VGG). These CNNs were converted into FCNs through the replacement of fully connected and classification scoring layers by 1 × 1 convolution, binary scoring, deconvolution, and dense prediction layers. In [
20], traditional handcrafted methods were combined with deep learning approaches for polyp segmentation. FCNs were used for the extraction of polyp features and the generation of region proposals, which were then refined using texton patch representation to separate polyp and non-polyp regions before classification with a random forest algorithm. In [
21], the identification of polyp features and the spatial resolution recovery of polyp images were realised using FCN-8, a method that upsamples the feature maps generated from the last two pooling layers and the last convolution layer. These upsampled feature maps containing polyp regions were then post-processed with Otsu thresholding to give the largest connected component, which was used for polyp segmentation. The authors of [
22] proposed an FCN for polyp segmentation inspired by U-Net and FCN. Convolution, Rectified Linear Unit (ReLU), and pooling layers were built for feature extraction, and a fully connected convolutional layer was added at the end of the feature extraction stage to increase the comprehensive connectivity between the pixels of the image. In the prediction map reconstruction phase, the feature maps were upsampled to the original input image resolution via deconvolution. A skip connection structure was used in [
22] to concatenate the low-level fine features from the downsampling path into high-level coarse features in the upsampling path. The FCN architecture has a drawback in that spatial information on the polyp image is lost in the pooling layers. When the global context is lacking, an FCN typically captures and generates prediction maps based on features at a high abstraction level at the centre of the object of interest, resulting in a rough and discontinuous segmentation effect [
23].
In [
24], a fully convolutional densely connected convolutional networks (DenseNet) approach was proposed for polyp segmentation. The proposed network consisted of an FCN, a skip connection, a traditional dense connection, and pruned dense connection in dense blocks. However, due to the dense network architecture, which generates more feature layers, this approach required a relatively large amount of memory to store the feature layers. The authors of [
25] used a LinkNet with a new combination of colour spaces as the network image input. The red and green components were chosen from the red, green, and blue (RGB) colour space, and were combined with b* from the International Commission on Illumination (CIE)-L*a*b* colour space to give a new colour space for the input image. The prediction maps output from LinkNet were post-processed using the convex hull algorithm. Lastly, the largest connected component was selected for polyp segmentation. A Mask Regional-CNN (Mask R-CNN) was implemented with a residual neural network (ResNet)-101 backbone for feature extraction in [
26]. A transfer learning method was implemented to train the Mask R-CNN, which was previously pre-trained with the Microsoft Common Objects in Context (MS-COCO) image dataset and colorectal image datasets for polyp segmentation. The authors of [
27] modified two deep neural networks: DeepLab_v3 with long short-term memory (LSTM), and Segnet with LSTM. LSTMs can retain localisation information and preserve information, thus preventing it from vanishing in the encoder networks. An LSTM has memory cells and three different gates: an input gate, a forget gate, and an output gate. The input gate determines which new information should be written into memory cells, while the forget gate determines which memory cell should be overwritten, and the output gate determines which information should be released. The method proposed in [
27] did not obtain the expected level of performance; SegNet + LSTM obtained 77.28% in mean intersection over union (IoU), performed worse than DeepLab_v3 + LSTM which obtained 79.30% in mean IoU, and the latter method could not connect the interrelationship of features between ResNet, atrous convolution, and atrous spatial pyramid pooling. The authors of [
28] proposed two methods, i.e., Enhanced FCN-8 (EFCN-8) and Enhanced SegNet (ESegNet). The ESegNet obtained mean intersection over union (mIoU) and accuracy of 72.8% and 93.5%. The EFCN-8 obtained 76.7% and 94.9% on mIoU and accuracy, respectively.
The authors of [
29] proposed a loss function called the eigenloss by combining seven loss functions: Jaccard, Dice, binary cross entropy, binary focal, Tversky, focal Tversky and Lovász hinge. These losses were combined in a linear way by extracting the coefficients using principal component analysis (PCA). The eigenloss was tested on two networks (U-Net and LinkNet) and two backbones (VGG-16 and Densenet121), and it was found that the LinkNet–Densenet121 yielded the best result. This approach obtained accuracy, sensitivity, and specificity of 94.59%, 80.32%, and 99.05%, respectively. However, the sensitivity in this approach is not promising enough. Sensitivity is important in polyp segmentation as it represents the ability to correctly identify the polyp region.
In [
30], the authors proposed a Mask R-CNN as a method of polyp segmentation. They used Res2Net, a modified version of ResNet, as the bottleneck structure for the Mask R-CNN. In Res2Net, the 3 × 3 filters found in the bottleneck block of the old ResNet were replaced with smaller filter groups connected in a hierarchical residual-like manner. Mask R-CNN is based on Faster R-CNN and a feature pyramid network (FPN), which can be regarded as a two-stage process for segmentation. The authors of [
31] proposed an ensemble model consisting of two Mask R-CNNs with a Resnet50 (25.6 million trainable parameters) and a Resnet101 (44.6 million trainable parameters) as backbone architectures. A bitwise combination was used to combine the two networks into an ensemble model with the aim of improving the segmentation performance. The authors of [
32] proposed an ensemble of three deep encoder–decoder networks with DeepLabV3X, in which several different convolution strides are used in each component network to achieve finer feature extraction for medical object segmentation tasks. Each of these single deep encoder–decoder networks was trained on input datasets with different resolutions. The resulting ensemble model was able to extract the important features from a polyp. There are both advantages and disadvantages of using the ensemble method, and these depend on the objectives of the research. It has the advantage of being able to combine several networks to distinguish more detailed information. A combination of networks can complement each other and offset the other’s disadvantages. However, models combining several different networks, such as the Resnet50 and Resnet101 with Mask R-CNNs in [
31] that have approximately 70.2 million trainable parameters, and the three continuous multiple deep encoder–decoder networks in [
32] that have approximately 137.4 million trainable parameters, are extremely complex. The time required for the training and image processing on a complex ensemble model will be longer than for a model consisting of a single network.
The authors of [
33] created an ensemble of three CNNs, the U-Net-VGG, SegNet-VGG, and Pyramid Scene Parsing Network (PSPNet), as a polyp segmentation model. A weight voting method was used to combine the segmentation results from these three networks. The weight distributions among the U-Net-VGG, PSPNet, and SegNet-VGG that provided the best results in [
33] were 2:2:1 and 3:3:2 on the cvc300 dataset, 3:1:1 on CVC-ClinicDB, and 3:2:2 on ETIS-LaribPolypDB. The U-Net and SegNet were estimated as having 27.5 million and 29.5 million trainable parameters, respectively [
28], whereas the PSPNet with ResNet-50 as the feature extractor was estimated to have 44.6 million trainable parameters [
4]. The ensemble model in [
33] therefore has a total of approximately 101.6 million trainable parameters. The approach in [
33] performs well, as the use of three different networks in one model enables it to learn different polyp features and to aggregate multi-scale contextual information from the same database. The disadvantage of the ensemble methodology in [
33] was that the three different networks, U-Net, SegNet, and PSPNet, needed to be trained separately. This approach therefore requires a great deal of effort and computational power.
4. Results and Discussion
4.1. Network Training Protocol
We propose a model based on a modified SegNet, with VGG-19 as the framework architecture. Our model was trained in a supervised manner using ground truths that were manually annotated and verified by expert clinicians as a reference. Of 1187 images, 831 were randomly selected for training, i.e., 437 (CVC-ClinicDB), 260 (CVC-ColonDB), 134 (ETIS-Larib Polyp DB), and the remaining 356 were used for testing, i.e., 175 (CVC-ClinicDB), 119 (CVC-ColonDB), 62 (ETIS-Larib Polyp DB) to evaluate our model. There were no intersections between the testing and training sets. We trained our networks using stochastic gradient descent to update the network weights. All of the models were trained with the following parameter settings: momentum 0.9, learning rate 0.001, 100 epochs, and a mini-batch size of four. The learning rate was reduced by a factor of 0.3 after every 10 epochs. Our models were implemented and trained in Matlab with an Nvidia GeForce RTX 2080 Ti GPU.
4.2. Network Performance Evaluation
A network was formulated to produce dense pixel-wise polyp segmentations on testing images for performance evaluation. In the task of polyp segmentation, there are two class labels, polyp and non-polyp. By comparing the segmentation results with the reference label map, four measurements can be obtained: true positive (TP), false positive (FP), true negative (TN), and false negative (FN). TP represents polyp pixels that are predicted as polyps, while FP represents non-polyp pixels that are predicted as polyps, TN represents non-polyp pixels that are predicted as non-polyps, and FN represents polyp pixels that are predicted as non-polyps.
We evaluated and quantified the effectiveness of our models in terms of the accuracy, sensitivity, specificity, precision, mean IoU, dice coefficient, and F2 score. Accuracy means the proportions of true positives and true negatives over all the datasets in the test, which corresponds to the correct prediction of the polyp and non-polyp pixels. The higher the accuracy, the better the segmentation method. The sensitivity (also known as recall) is the proportion of positive values returned by the segmentation compared to the positive values based on the ground truths. The specificity represents the proportion of negative values produced by the segmentation method compared to the real negatives, as determined by the ground truths, while the precision measures the ratio of the number of true positive values to the number of positive values predicted by the segmentation method. The mean IoU reflects the similarity between the ground truth and the predicted results from the method. The dice coefficient indicates how well the predicted boundary of each class aligns with the true boundary, and the F2 score measures the accuracy of a test based on the precision and recall. The formulae for all the evaluation metrics used here can be found in
Appendix D.
4.3. Analysis of Training and Testing Results
In addition to our modified SegNet models (154 layers), we also compared two further modified SegNet models with different numbers of layers (100 and 208 layers). The aim of this comparison was to determine the best-fitting number of network layers and depths with different dimensions of image channels for our modified SegNet in the task of polyp segmentation. All three modified SegNet models had a similar network structure, and the only difference lies in the depth of the encoder and the corresponding depth of the decoder. An overview of all the three networks is shown in
Figure A2 in
Appendix B.
The 100-layer network consisted of three-layer encoder depths, made up of 64, 128, and 256 image channels. This network was implemented based on our modified SegNet, except that the four- and five-layer encoder depths were eliminated, together with their corresponding decoder networks. This elimination was performed to reduce the complexity of proposed network. The four-layer and five-layer encoder depths of our modified SegNet have 512 image channels which increase the network’s complexity. The 154-layer network had five-layer encoder depths, made up of 64, 128, 256, and two 512 image channels. The 208-layer network was an extended version of our modified SegNet, and consisted of seven-layer encoder depths, with 64, 128, 256, two 512, and two 1024 image channels. The decoder paths in all three networks were symmetrical with respect to the encoder paths. We compared the segmentation results for the three networks and the time required to train each network. All three networks were fully trained for 100 epochs (20,775 iterations) and stabilised at a training accuracy of around 96–98%. The curves for the training accuracy for all three networks are shown in
Figure 2, and the training times for each network are given in
Table 2.
For the same training dataset and the same number of training epochs, the 208-layer network required the longest time for training (475.19 min), followed by the 154-layer network (355.52 min), and finally the 100-layer network (331.16 min). Deeper networks required longer training times, as there were more parameters to be trained. The results for the evaluation metrics for the models with different numbers of layers are given in
Table 3.
Based on the results shown in
Table 3, the 154-layer network gave the best overall performance. We can observe that, on average, this network obtained an accuracy of 96.06%, a sensitivity of 94.55%, a specificity of 97.56%, a precision of 97.48%, a mean IoU of 92.3%, a dice coefficient of 95.99%, and an F2 score of 95.12%. Although the 100-layer network achieved specificity and precision values that were 0.2–0.3% higher, the 154-layer network outperformed it by a large margin on the other evaluation metrics. The 208-layer network performed worst on each evaluation metric. The values for the training and testing accuracy obtained by the 100-layer network were 98.49% and 93.84%, respectively, showing a difference of 4.65%. The 154-layer network yielded values for the training and testing accuracy of 98.49% and 96.06%, with a difference of 2.43%. The 208-layer network gave values of 97.59%, and 92.09%, with a difference of 5.5%. The segmentation accuracies for all three networks on the testing dataset are shown in
Table 4.
The training accuracy curve for the 100-layer network is similar to that of the 154-layer network, although poorer performance is seen on the testing set.
Table 4 shows that 70 of the 356 images from the testing dataset resulted in an accuracy of below 90% for the 100-layer network. For the 154-layer network, 52 of the 356 testing images were below 90% in accuracy, 18 fewer than the 100-layer network. We speculate that the 100-layer network had an insufficient number of trainable parameters to fully learn and recognise the more detailed polyp features, although it performed similarly to the 154-layer network in the training stage.
The 100-layer network has 10.4 million trainable parameters, almost five times fewer than the 154-layer network. The 100-layer network was not sufficiently complex or deep to extract the finer polyp information from new unseen images, and therefore performed poorly on the testing dataset. As the network becomes deeper, we see an increase in the number of trainable parameters, which generally improves the performance of the model, but this is not always true for the 208-layer network; surprisingly, this one performed the worst of all three networks, possibly due to overfitting. Overfitting issues mainly occur in a very deep networks due to their extreme complexity, as they have a tremendous number of trainable parameters that may be biased towards a specific dataset or batch of images. The 208-layer network has 206.7 million trainable parameters, almost four times and 20 times more than the modified SegNet and the 100-layer network, respectively.
Overfitting can be observed for the 208-layer network from the training accuracy curve in
Figure 2. The training accuracy for both the 100-layer and the 154-layer networks increased smoothly until convergence was reached, while for the 208-layer network, this fluctuated throughout the training process. The reason for this could be that the network is biased towards a certain batch of training images, where most of the trainable parameters were trained to suit the features of polyp extracted on that particular batch of images. In this case, when the next batch of training images is fed into the network to update the trainable parameters, the network cannot find a global minimum for the trainable parameters that is suitable for the entire training dataset. Overfitting can also be seen in a network when the testing performance is far poorer than the training performance, i.e., the segmentation results on the testing images are poorer than expected. The network has a training accuracy of 97.59% and a testing accuracy of 92.09%. We can observe from
Table 4 that the 208-layer network had an accuracy of below 90% on 84 of the 356 testing images (14 and 32 more than for the 100-layer and 154-layer networks, respectively). Overall, the 154-layer model outperformed the 100-layer and 208-layer networks by an average of 2.7% and 5%, respectively.
4.4. Discussion of Results and Comparison with Previous SegNet Research
Based on the results of this comparison, the 154-layer network was used in the subsequent experiments as our modified SegNet model. An original SegNet with a VGG-19 architecture was also implemented for comparison with our modified SegNet. In order to achieve a fair comparison, all of the datasets and training parameters were identical for both the original and the modified SegNet models. We also compared the segmentation performance of our model with the networks in [
27,
28], which used SegNet as their basis. A comparison of the results is given in
Table 5, and it can be seen that our modified SegNet model gives a clear improvement over other methods.
It can be seen from
Table 5 that our proposed model performed better than the other SegNet models, and also better than the models proposed in [
27,
28]. This improvement may be due to the use of VGG-19 as a network backbone, whereas VGG-16 was used in [
27,
28]. There are more convolutional layers in VGG-19 than VGG-16, and these additional layers may have enabled our model to learn more of the features of polyps. The score of 73.72% for the IoU indicates that our model obtained more true positive polyps, whereas the scores for the other methods were well below 70%. In terms of the mean IoU, only our modified SegNet model managed to exceed 90%.
Figure 3 compares the results from our modified SegNet to those of the original SegNet VGG-19. The red boxes in the first column in
Figure 3 indicate the location of polyps in the raw images.
As shown in rows (a) to (c) of
Figure 3, the experimental results from our method are smoother at the edges of the polyp and less over-segmented as compared to the original SegNet model. This phenomenon may be due to the use of skip connections in our improved model. The fine-grained information from the encoder network can compensate the vanishing pixel information along the network, thus giving a smoother polyp segmentation effect. We can also observe from rows (d) to (f) in
Figure 3 that our modified SegNet model gave fewer false positive regions compared to the original SegNet model. Some background regions were incorrectly recognised as polyp regions by the original SegNet, according to the ground truth, whereas our modified SegNet managed to correctly identify all the locations and the numbers of polyps in the same images. We believe the reasons for this improvement were our modifications to SegNet, and particularly the use of convolutional layers with a kernel size of 5 × 5 and the parallel dilated convolutions in the final encoding layers. We speculate that the incorporation of these convolutional layers in the two shallower convolution blocks in both encoder and decoder networks help to further remove noise and stabilise the prominent features of polyps. The use of dilated convolutions applied in parallel form enlarges the receptive field of the model and enables it to extract and aggregate multi-scale contextual information. In this way, more spatial information on the polyp features can be consolidated, resulting in better segmentation outcomes in terms of discriminating between polyp and non-polyp pixels.
We visualise and compare the activations of SegNet VGG-16, SegNet VGG-19 and our modified SegNet model. Activations of different SegNet layers can be displayed and visualised when a colonoscopy image is fed to the model. Through visualisation, the features learned by the network can be discovered by observing and comparing the activation areas with the original image. The visualisations were compared separately for the encoder network, the intermediate sector between the encoder and decoder, the decoder network, and the final prediction map. Our modified SegNet model is more capable of capturing more stable and higher continuity features for the polyp edges as a whole than SegNet VGG16 and SegNet VGG-19. The activation visualisation shows that the strongest white pixel activations were accurately determined to be within the polyp area. In the visualisation of intermediate sector between the encoder and decoder, our modified SegNet model gave different receptive field of output activation maps. The concatenation of four dilated convolution blocks gave the strongest pixel activation within the correct location of polyps. In the visualisation of the decoder network, our modified SegNet model also performed better than SegNet VGG-16 and SegNet VGG-19. The activation maps of our modified SegNet model contained less noise, and were more accurate and more centralised on the polyp region. For more information about activation visualisation, the reader can refer to
Appendix C.
4.5. Discussion of Results and Comparison with Other Previous Research Works
In addition to the networks in [
27,
28], we also compared our modified SegNet model with other recent CNN-based methods. A comparison of the performance of all the networks in given in
Table 6.
The comparison above shows that our method outperformed most other methods. Our approach performed better than the model in [
31] for every performance index, with a significantly higher value for the sensitivity (by 18.3%), and a better precision and mean IoU, with gains of 0.25 and 0.33, respectively. Compared to the model in [
32], although our method had a lower accuracy and sensitivity (by 2.64% and 1.65%, respectively), it yielded better results for the specificity, precision, and dice coefficient, with improvements of 3.36%, 2.48%, and 6.89%, respectively. Our method also outperformed the two most recent methods identified in our literature review, i.e., those in [
29] and [
30]. As we can observe from
Table 6, although the model in [
29] yielded a specificity that was 1.49% higher than our method, our model outperformed it overall by a significant margin, with values of 96.06% for accuracy, 94.55% for sensitivity, 97.48% for precision, 92.3% for the mean IoU, and 95.99% for the dice coefficient, compared to values of 94.59%, 80.23%, 88.65%, 71.84%, and 79.07%, respectively. Based on a comparison with [
29] in terms of the performance metrics, we can see that our method detected more TP regions and segmented fewer FP areas. For the task of polyp segmentation, it is essential to improve the TP segmentation rate, as the sensitivity is the factor that determines the performance of the model. This is due to imbalance in the classes between the polyp and background in a colorectal polyp image, as the polyp regions occupy fewer pixels and most of the image is classified as background. A comparison between our method and the model in [
30] shows that our method achieves a precision that is higher by 7.98%. Our proposed method was only outperformed by the network in [
33]. Our method obtained slightly lower values for the mean IoU and dice coefficient of 92.3% and 95.99% as opposed to 96.95% and 98.45%, respectively. The reason for this could be that our method used a single network as our segmentation model, while the approach described in [
33] used an ensemble method. In terms of network complexity, our proposed model has approximately 51.5 million trainable parameters, i.e., about half the number for the model in [
33], yet our model performed only slightly worse, with differences of 4.65% and 2.46% in the mean IoU and dice coefficient, respectively. The concatenation of four dilated convolutions applied in parallel form introduced in our modified SegNet model serves the same purpose of aggregating multi-scale contextual information as the ensemble approach described in [
33] but is accomplished in a single network. Overall, our proposed method yielded promising results, outperforming most of the existing CNN-based methods or giving comparable results. This meets the objectives of this research in terms of enhancing the performance of existing polyp segmentation models and reducing the model’s complexity by introducing three different modifications, i.e., skip connections, 5 × 5 convolutional filters, and concatenation of four dilated convolutions applied in parallel form on SegNet VGG-19.