1. Introduction
In recent years, image segmentation has been widely applied in the fields of autonomous driving [
1,
2], remote sensing image processing [
3,
4], medical image processing [
5,
6], etc., which has had a large influence on its excellent performance in visual tasks. Segmentation quality evaluation refers to a quantitative evaluation of the segmentation quality, so that the evaluation result can be used to measure the performance of segmentation algorithms and guide the adjustment of algorithm parameters. Furthermore, evaluation criteria can be used as a standard for designing a good segmentation algorithm. In short, segmentation quality evaluation is an essential process for image segmentation. In contrast to quality evaluation [
7,
8], which evaluates the quality of the image itself (distortion, blur, etc.), segmentation quality evaluation is more concerned with assessing how well the segmentation image extracts the object of interest from the original image.
There are full-reference evaluation methods, such as Mean Intersection over Union (MIoU) [
9,
10], Mean Pixel Accuracy (MPA) [
9,
11], F-Measure [
12], Probabilistic Rand Index (PRI) [
13], and Dice coefficient (Dice) [
14]. These methods produce scores by calculating the similarities or the differences between the segmentation and the ground truth (reference image). However, there are several problems with such kinds of methods. Firstly, they require the ground truth labels as a reference, which requires a lot of manual effort regarding pixel-level labeling and is unable to capture the various semantic meanings of real-world objects. Secondly, they only evaluate the spatial relationship between the segmentation and the ground truth (e.g., the region areas or the boundary locations), which does not utilize image cues in the evaluation process, resulting in an inconsistent evaluation with human visual standards.
No-reference methods can ease the dependence on the reference in the practical application process, such as ground truth, and therefore have become a promising solution for online segmentation evaluation tasks. Traditional no-reference methods [
15,
16,
17] mainly use low-level image features (e.g., textures and colours) for evaluation; however, they are inefficient in semantic segmentation scenarios. It has been widely observed that learning the semantic information of the objects requires a large number of samples from class-specific objects, which is difficult to obtain in traditional evaluation methods. Intuitively, a good evaluation method should extract the meaningful semantic information from the image and distinguish the quality of segmentations based on this. Exploring the relationship between the original image and the segmentation result and quantifying the distance between segmentations are two important methods for a reasonable no-reference evaluation.
Meta-measures [
18] are designed to measure the appropriateness of evaluation methods, and contain a series of principles for a general purpose evaluation. They are usually based on the ability to distinguish between segmentation images, e.g., identifying which segmentation is produced from a different original image, or on the ability to distinguish those with a higher quality. This provides a natural way to evaluate the performance of different segmentation measures and is designed independently from these measures.
In light of the superior performance of deep convolutional neural networks (CNNs) in feature representation, we propose a feature contrastive learning method for no-reference segmentation quality evaluations. Contrastive learning is a popular topic in computer vision research with many applications such as person re-identification [
19], image matching [
20], and visual tracking [
21]. Contrastive learning is a type of self-supervised learning [
22] that learns by comparing the commonalities and differences between pairs, which can reveal more about the relationships between parts of the data than other learning methods. More interestingly, the concept of comparing pair candidates coincides with the principles of meta-measures. Therefore, we integrate contrastive learning into the segmentation evaluation task and propose a CNN framework for no-reference segmentation evaluations.
The proposed framework does not perform contrastive learning directly because the amount of data required for direct contrastive learning is too large. Instead, it first learns the pixel-level similarity between the original image and the segmentation image and extracts the feature space. Next, a Siamese network is constructed. This network has two branches that share parameters. Each branch is based on a two-channel network [
23] and loaded with preliminary learning parameters. The segmentation images are grouped into pairs according to different qualities, and the concatenated original images are input into the Siamese network for feature extraction. After that, a contrastive learning module is designed to learn feature similarities by calculating the extent to which this pair is related to the original image. In the prediction phase, in order to simulate the human perception process, we add a class activation map (CAM) [
24] to the network, making the score more weighted towards the regions of attention.
To verify the effectiveness of the proposed framework, we produce 17,774 segmentations from the Pascal VOC2012 dataset using four SoA algorithms, which include 8887 well-segmented images and 8887 poorly segmented images. Two meta-measure criteria, including the Swapped-Image SoA Discrimination (SISD) [
18] and the newly proposed Corresponding Image SoA Discrimination (CISD), are used to compare our method with both the no-reference and reference evaluation methods. The comparison results demonstrate the effectiveness of our method.
The main contributions of this paper are two-fold:
1. We present a new no-reference segmentation evaluation framework. It involves deep semantic information not covered by previous no-reference methods, demonstrating the substantial performance benefits of our method. We propose a prediction network by using contrastive learning and a CAM module in segmentation quality evaluation from the perspective of learning.
2. We construct a new segmentation evaluation dataset and design a new meta-measure: CISD. The CISD together with the SISD criterion is used to test various segmentation evaluation methods on the new dataset, providing a reference for segmentation validation and analysis. Extensive experiments are performed to validate the efficiency of our evaluation framework, including preliminary pixel-level learning results, intervals of score distribution, examples of actual evaluation scores, etc.
The rest of the paper is organized as follows.
Section 2 introduces the related work on segmentation quality evaluations.
Section 3 describes the problem and presents the proposed evaluation framework, which consists of three important modules: pixel-level similarity learning, feature contrastive learning, and score adjustment with a CAM.
Section 4 demonstrates the experimental settings, the dataset construction, the meta-measure methods, and the experimental results.
Section 5 concludes the paper and discusses the future work.
3. Proposed Framework of Feature Contrastive Learning
The goal of no-reference segmentation quality evaluations is to evaluate the quality of segmentation results without the ground truth. Most existing no-reference methods suffer from performance deficiencies, where the evaluation results do not conform to human vision and are difficult to apply in the field of semantic segmentation.
To address these issues, we explore the use of a learning-based method to better utilize the relationship between the original image and the segmentation image. A novel method is proposed for no-reference segmentation evaluation with contrastive learning.
Contrastive learning complies with the concept of segmentation evaluation meta-measuring in comparative image pairs. However, contrastive learning has a lower learning content, it requires a large amount of data, and is difficult to fit. Therefore, our framework first performs pixel-level similarity learning [
50], which allows the network to understand the pixel-level relationship between the original image and the segmentation image. Then, contrastive learning is performed in the feature space. By comparing the differences in segmentation images of different quality, the network learns the global relationship between the segmentation and original images at the image level.
In next section, we will describe the details of the proposed method, which includes pixel-level similarity learning and feature contrastive learning processes. Then, the prediction phase is introduced, focusing on how to incorporate a class activation map (CAM) to generate the final evaluation score.
3.1. Pixel-Level Similarity Learning
With the limitation of the number of data, direct use of contrastive learning easily encounters problems such as overfitting. Instead of using contrastive learning directly, pixel-level similarity learning is performed in the preliminary learning with a two-channel network. In contrast to some methods [
51,
52], which uses traditional algorithm metric similarity, such as the euclidean distance, our method directly uses deep learning to measure the similarity, which only extract inter-sample similarity features and does not extract features for a single sample.
As shown in
Figure 1, firstly, the original image and the segmentation image are concatenated as a
matrix for the network input.
H and
W represent the height and the width and
C is the channel. Then, the similarity features are obtained after the Resnet50 [
53] with an upsample module. They are sent into the pixel-level decision layer to obtain a
feature map. Following some classical deep learning methods [
54,
55] validated with the Pascal Voc2012 dataset, we set
and
.
Specially, we define a dataset , composed of N independent and identically distributed training samples, where is the original image with dimensions of . refers to the set of segmentation images corresponding to and is the ground truth of . and have the same dimensions as .
For original image
and the segmentation image
to be evaluated, label
is defined as in Equation (
1).
where ⊙ means the Inclusive-OR operation.
is a similarity matching map with dimensions of
.
With the determined input set as and the label set as , a function pair is used, where is an upsample module to extract features and is a decision function for pixel-level similarity. represents the feature space with a size of and represents the prediction result with a size of . The number of channels is set empirically as 256.
In the proposed method, a fully convolutional network (FCN) [
54] structure,
f, is used as the upsample module to extract features. A simple convolutional module is used as the decision function,
d. Corresponding to pixel-level similarity learning, we use the Mean Pixel Accuracy (MPA) as a loss function to learn the pixel-level relationship between the original image and the segmentation image, Equation (
2).
3.2. Feature Contrastive Learning with a Siamese Network
In most of the literature, there is no direct correlation between evaluation scores and the meta evaluation principle [
14]. In this work, we attempt to integrate these two factors in the evaluation by using a contrastive learning strategy. The basic idea is to construct a contrastive loss function according to the meta-measure criteria for learning and then produce the evaluation scores. In particular, we only carry out contrastive learning for the feature maps extracted in the preliminary learning stage. Unlike other contrastive learning methods, which are mainly applied to upstream tasks, contrastive learning in this work is applied to downstream tasks to obtain the evaluation scores.
As shown in
Figure 2, the similarity learning network is extended to a two-branch structure which makes a Siamese network. The original image concatenated with the positive segmentation is input to the upper branch and concatenated with the negative segmentation from the lower branch to extract features by
f. After that, a contrastive module is constructed for contrastive learning. It contains a separate convolution operation
g for obtaining the upper and lower branch global features, and these features are averaged to obtain the scores. A contrastive loss is calculated between the upper and the lower branch scores. We adopt the contrastive principle: the similarity score of the upper branch should be higher than that of the lower branch.
Specifically, for the original image , we select positive segmentation () and negative segmentation () from . The positive sample can be a good segmentation of the state-of-the-art algorithms or the segmentation ground truth, and the negative sample can be the poor segmentation of the algorithms or the segmentation from a different image.
The upper branch input pair is and the lower branch input pair is . A function pair replaces , where f is the feature-extracting function for pixel-level similarity and g is the global decision function. It is worth noting that d is a pixel-level similarity decision function, it only learns pixel-level relationships and the output dimensions are , while g is image-level learning, it is a global similarity decision function and the output is a vector with dimensions . In this phase, the function f is not used for back-propagation.
For the upper branch input pair
and the lower branch input pair
, the features
and
are extracted separately using the function
f. The function
g is used to obtain an average to obtain the upper and lower scores,
and
. The function
g contains a convolution operation. Among them,
and
are the evaluation scores of good and poor segmentation outputs over the network, and they are a pair of scalar values. Based on the contrastive learning principle, a hyper parameter
is chosen to expand the interval between the two classes, whereby a contrastive loss function is set, as in Equation (
3).
For the hyper-parameter, we pre-set .
3.3. Prediction with Class Activation Map (CAM)
Since the Siamese networks share parameters, one branch can be cropped in the application phase. As shown in
Figure 3, after cropping the Siamese structure into a single branch structure, the original and segmentation images are concatenated as the input to the neural network to obtained the evaluation score. According to a previous study [
55], semantic meaningful regions usually play an important role in deciding the segmentation quality. To better capture these regions and integrate the information into the quality calculation, a CAM is used to adjust the score. Considering performance, we use the smooth Grad-CAM++ method to obtain the CAM. On this basis, a FreqCAM [
55] module is added.
FreqCAM [
56] is a simple module for weakly supervised object localization, which gives higher weights to the attention region while eliminating most of the weight of the background regions, especially the noise regions, in line with the intention of using a CAM in this study. Therefore, we apply FreqCAM to our no-reference segmentation quality evaluation scenario.
We define the original image as , the segmentation image as and the CAM as . For pairs , the feature F is obtained by feeding it into the function f, the similarity matching map with dimension is obtained by function d, and the is obtained by the function g and the averaging operation.
In the CAM, the attention region is assigned a higher weight. Therefore, using
as the weight vector map and
as the base vector map, a weighted average value
is computed to reflect the accuracy of the attention region, which is defined as Equation (
4).
Using
as a coefficient reflecting the accuracy of the attention component, a threshold penalty method is used for calculating the final score
S. That is, a threshold is set, and when the
is higher than the threshold, the score is unchanged, and when it is lower than the threshold, the score is reduced. The coefficient is used directly as the disciplinary ratio. In this paper, we set
. The final score
S is calculated by Equation (
5).
Moreover, we consider adjusting the scores based on a threshold penalty method. This method keeps the scores with higher weights unchanged and only penalizes scores with lower weights.
As shown in
Figure 4, if the segmentation score is adjusted corresponding to (1), the score will be too low. It would be more appropriate to only adjust the score for scenarios such as (2), keeping (1) unchanged. After comprehensive consideration, we set
.
4. Experiments
4.1. Experimental Configuration
Our method was trained and validated on an NVIDIA GeForce 2080Ti GPU with 11GB memory, python3.7, and pytorch1.1. We use FCN with ResNet-50 as the backbone network. ResNet-50 was loaded as the pre-trained network parameter in the ImageNet dataset; the parameters of the first layer were not loaded because our structure changed from three channels to six channels. We set the Stochastic Gradient Descent (SGD) optimizer with a learning rate, momentum, and weight decay. The similarity learning and contrastive learning batch size was set to 16. The number of similarity learning epochs was 120 and for contrastive learning it was 5.
4.2. Dataset
In this paper, the original images and ground truth segmentation labels were selected from the Pascal Voc2012 dataset, where 20 foreground classes and 1 background class are included. Four SOA methods were chosen to produce the segmentation image sets, including FCN [
25], U-Net [
57], Mask-RCNN [
58], and DeepLabV3 [
59]. The segmentation results with 15 and 25 epochs for each method are selected, for a total of eight segmentations. Then, one each of good segmentation and poor segmentation samples were selected from the eight segmentation samples to make two segmentation image sets. Finally, a total of 8887 samples were used, using 7937 samples for training and 950 samples for validation, each sample containing four kinds of data: the original image set, the good segmentation image set, the poor segmentation image set, and the ground truth image set. All images were unified into dimensions
of
.
4.3. Experimental Criteria
We verify multiple evaluation methods for both SISD and CISD meta-measures. SISD measures two segmentation datasets separately, a good segmentation and a poor segmentation dataset output by state-of-the-art (SoA) methods. The poor segmentation has some regions that are over-segmented, under-segmented, or misclassified, but still has a strong correlation with the original image and should be better than the segmentations generated by a different original image. CISD compares good segmentations and poor segmentations for the same original image.
4.3.1. Meta-Measure for SISD
As shown in the left of
Figure 5, SISD (Swapped-Image SoA Discrimination) compares the results created by a SoA segmentation method with the results created by the SoA segmentation method on other original images. For each SoA segmentation technique, SISD computes the number of images in the dataset in which an evaluation measure correctly judges that the corresponding SoA image result is better than the different image result. The definition of meta-measure
SISD (Swapped-Image SoA Discrimination) is the percentage of results in the database that are correctly discriminated [
18].
Specifically, SISD is defined in Equation (
6), where
refers to the
ith original image,
refers to the segmentation image for
, and function S () outputs the matching score between the original image and the segmentation image.
is a judgement, and it is considered true when
.
is the matching score for
and other segmentation images.
4.3.2. Meta-Measure for CISD
SISD meta-measure was performed only for the differentiation of swapped images; however, this is not enough. Most of the segmentation methods are validated on the same image; thus, a new meta-evaluation method is proposed in this paper, which is named
CISD (Corresponding Image SoA Discrimination), and is the percentage of correctly judged images from the good segmentation result of SoA that are more correlated with the corresponding images than the bad results (corresponding image refers to the original image corresponding to the good and poor segmentation image, that is, the segmentation image pair to be compared.). CISD is shown in the right of
Figure 5.
Specifically, CISD is defined in Equation (
7), where
refers to the
ith original image and
refers to the segmentation image for
, where
has a higher quality than
. Function S () outputs the matching score between the original image and the segmentation image.
is a judgement, and it is considered true when
.
4.4. Comparison of Meta-Measure Results of Our Method and Other Methods
As shown in
Table 1, SISD (good) represents the SISD measure in good segmentation sets and SISD (bad) represents the SISD measure in poor segmentation sets. The global accuracy is defined as the average of the above three evaluation methods, and the global accuracy of our method exceeds that of the no-reference method. In CISD, the traditional no-reference methods are somewhat effective, but less so than our method. In SISD, our method demonstrates absolute superiority, even over full-reference methods. Traditional no-reference methods are not very effective.
4.5. Comparison of the Evaluation Score of Our Method and Other Methods
To further verify the validity of our method, in addition to the meta-measures, we also show all the evaluation scores in the validation set for the reference methods and our method. Other no-reference methods are not shown because of their slow computation efficiency and irregular values.
In particular, as shown in
Figure 6, for each original image displayed on the vertical axis, yellow and green are the mean of the evaluation scores of the segmentation images generated by all the other original images. It can be seen that our no-reference method is comparable with reference methods. MPA and Recall have high scores in other image segmentation results. Precision, F-measure, MIoU, Dice, and VI have evaluation scores that are completely separate from other image segmentation results, but too many low scores give the original image own segmentation results.
4.6. Ablation Study
In this section, we analyze the effectiveness of our proposed method, performing ablation studies on each proposed component of the network architecture and empirically analyzing the corresponding reason. Ablation studies contain both the meta-measure accuracy and the details of different learning epochs. SL represents using only pixel-level similarity learning. CL represents contrastive learning, SCAM represents Smooth Grad-CAM++, FCAM represents FreqCAM, and , , and 1 represent the penalty thresholds.
4.6.1. Compare Accuracy with the Addition of Different Module
As shown in
Table 2, when using only similarity learning, the result is a pixel-level accuracy map. Instead of setting a threshold to predict the correctness or the incorrectness, a soft average [
60] is used directly as the evaluation score. This method gives poor results as it only evaluates the pixel-level relationships. When feature contrastive learning is added, there is a large improvement in the accuracy, by
percentage points. Thus, feature contrastive learning is effective in this scenario. With the adoption of a CAM with different penalty thresholds, in meta-measures, the performance of our method was further improved. The best performance is achieved using FCAM accompanied by a threshold of
. When the threshold is set to 1, it means that instead of using the threshold penalty method, the evaluation results of all segmentation images are adjusted by weight. In this case, it achieves a lower accuracy, which validates our assumption in
Section 3.3.
4.6.2. Details of Meta-Measure Results in Different Epochs
We show the training process and results in more detail in
Figure 7, which contains the accuracy of the meta-measures over 120 epochs of pixel-level similarity learning and 5 epochs of contrastive learning. In order to show the differences between epochs more clearly, the
y-axis is not consistent.
4.7. Evaluation Examples from Different Methods
Figure 8 and
Table 3,
Table 4 and
Table 5 show examples of different ways to evaluate scores.
Figure 8 is the sample image selected. We have a selection of typical three full-reference methods and three no-reference methods for comparison. For each original image, we choose six segmentation images for evaluation, indicating good and poor segmentations as good and poor; good and bad segmentation images from different original images with close categories as other-good (1) and other-bad (1); and the good and poor segmentations from different original images that are close to the target location but in different categories as other-good (2) and other-bad (2).
In
Table 3,
Table 4 and
Table 5, it is important to note that the values for no-reference methods F, E, and Ecw represent errors, with higher scores indicating a poorer quality. For
F, we multiplied the value by
, e.g., when the displayed value is
, the actual evaluation value is
.
Table 3 shows the corresponding evaluation results of part (1). In part (1), the evaluation scores of the full-reference methods on the other-good (1) and other-bad (1) are too high, as they are evaluated based on the segmentation space only. Our method scores are more reasonable compared to other methods.
Table 4 shows the corresponding evaluation results of part (2). In part (2), the full-reference method has a good performance in other image segmentation results, but produces too low evaluation scores in the corresponding good and bad segmentations. The evaluation score of our method is more reasonable.
Table 5 shows the corresponding evaluation results of part (3). In part (3), due to the small foreground target, the full-reference method Precision and F-measure produced excessive scores in poor segmentation; our method overcomes this problem.
In summary, it can be seen that no-reference methods have difficulty comparing segmentation from the corresponding image with segmentation from other images, and the scores are also very unreasonable. In our method, three structures are included to evaluate the score, one using only pixel-level similarity learning, then adding contrastive learning, and finally adding a CAM. For the poor segmentations, i.e., good/bad segmentations from other images, our method does not produce reasonable scores in the first two methods, and along with the CAM, this score is more compatible with human vision than other methods.