Addressing Noisy Pixels in Weakly Supervised Semantic Segmentation with Weights Assigned

Qian, Feng; Yang, Juan; Tang, Sipeng; Chen, Gao; Yan, Jingwen

doi:10.3390/math12162520

Open AccessArticle

Addressing Noisy Pixels in Weakly Supervised Semantic Segmentation with Weights Assigned

by

Feng Qian

¹

,

Juan Yang

²,

Sipeng Tang

³,

Gao Chen

^4,*

and

Jingwen Yan

²

¹

Changchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Sciences, Changchun 130033, China

²

College of Engineering, Shantou University, Shantou 515063, China

³

China Mobile Communications Group Guangdong Co., Ltd. Shantou Branch, Shantou 515041, China

⁴

School of Telecommunications Engineering and Intelligentization, Dongguan University of Technology, Dongguan 523808, China

^*

Author to whom correspondence should be addressed.

Mathematics 2024, 12(16), 2520; https://doi.org/10.3390/math12162520

Submission received: 15 July 2024 / Revised: 10 August 2024 / Accepted: 13 August 2024 / Published: 15 August 2024

Download

Browse Figures

Versions Notes

Abstract

:

Weakly supervised semantic segmentation (WSSS) aims to segment objects without a heavy burden of dense annotations. Pseudo-masks serve as supervisory information for training segmentation models, which is crucial to the performance of segmentation models. However, the generated pseudo-masks contain significant noisy labels, which leads to poor performance of the segmentation models trained on these pseudo-masks. Few studies address this issue, as these noisy labels remain inevitable even after the pseudo-masks are improved. In this paper, we propose an uncertainty-weight transform module to mitigate the impact of noisy labels on model performance. It is noteworthy that our approach is not aimed at eliminating noisy labels but rather enhancing the robustness of the model to noisy labels. The proposed method adopts a frequency-based approach to estimate pixel uncertainty. Moreover, the uncertainty of pixels is transformed into loss weights through a set of well-designed functions. After dynamically assigning weights, the model allocates attention to each pixel in a significantly differentiated manner. Meanwhile, the impact of noisy labels on model performance is weakened. Experiments validate the effectiveness of the proposed method, achieving state-of-the-art results of 69.3% on PASCAL VOC 2012 and 39.3% on MS COCO 2014, respectively.

Keywords:

deep learning; weakly supervised semantic segmentation; uncertainty-weight transform module; label noise learning

MSC:

94-10

1. Introduction

Semantic segmentation is the fundamental task in computer vision, which aims to predict pixel-wise classification results on images. Thanks to the flourishing development of deep learning in recent years, the performance of semantic segmentation models has made significant breakthroughs [1,2]. However, pixel-level manual annotations are needed in this task. What is more, high-quality pixel-level manual annotations are expensive and time-consuming. Recently, many efforts have been devoted to weakly supervised semantic segmentation (WSSS), aiming to achieve segmentation performance equivalent to fully supervised methods. The category of weak supervision comprises coarse and incomplete supervision [3]. Incomplete supervision can be categorized into semi-supervision [4], domain-specific supervision [5], and partial supervision [6]. Coarse supervision means that supervision information is image-level, not pixel-level, such as scribble-level annotations [7], box-level annotations [8,9], and image-level annotations [10,11]. As shown in Figure 1, existing methods primarily follow a two-stage process [12] in WSSS based on image-level annotations. In the first stage, the class excitation proposed by Zhou et al. [13] is mainly used, that is, the class activation map (CAM) method, to obtain the initial seed region for preliminary localization of the target position, and then pixel-level pseudo-masks are obtained by optimizing the initial seed region. In the second phase, the pseudo-masks generated from the previous stage are employed as supervisory information to train the semantic segmentation model. However, pseudo-masks are inevitably noisy, which indicates that some pixels are mislabeled (noisy pixels). The presence of noisy labels significantly hinders the performance of the trained segmentation model [14,15]. Research [16,17] has demonstrated that the label quality of training samples seriously affects the performance of the algorithms.

Current methods [18,19] focus on the primary emphasis of generating high-quality pseudo-masks, which overlook the adverse effects of noisy labels on model performance. As is well-known, the impact of noisy labels on segmentation model performance is noticed by the PMM [20] and URN [21]. However, these methods have their limitations. The PMM [20] cannot pinpoint noisy pixels, while the universality of URN [21] across different datasets is questionable. Therefore, our emphasis is placed on mitigating the influence of noisy pixels in a more effective way.

How to recognize noisy labels from pseudo-masks? Inspired by URN, a powerful method is uncertainty estimation. Kendall et al. [22] proposed epistemic uncertainty and aleatoric uncertainty. Uncertainty estimation requires multiple predictions for each image in computer vision. Li et al. [21] proposed a method to generate multiple predictions by simulating response maps at different activation scales through probability scaling. Furthermore, Dense-CRF [23] is added to response maps to form pseudo-masks. After the above operations, uncertainty is estimated with these scaled responses. Finally, uncertainty is normalized and transformed into loss weight through a predefined threshold. However, the predefined threshold is a hyperparameter, which is difficult to determine. Meanwhile, the most appropriate threshold on a particular dataset may not be suitable for others [24,25].

Therefore, we propose an uncertainty-weight transform module that dynamically transforms the pixel uncertainty into loss weights through a frequency-based method rather than using a predefined threshold. The proposed module is a set of functions. Furthermore, the module incorporates multiple scales to adapt to different datasets. At each scale, the frequency of each pixel being labeled as different classes is calculated. Subsequently, the class with the maximum probability is chosen as the correct classification for a given pixel. Meanwhile, the maximum probability is also referred to as the confidence level of the pixel. Finally, statistical analysis is performed on the confidence level of each pixel. Furthermore, the confidence level value with the highest frequency is selected as the threshold for pixel weight allocation. Confidence levels exceeding the threshold correspond to higher weights, while those falling below are assigned smaller weights. As the confidence levels vary, the weight values manifest corresponding trends in a unique function by the proposed module.

The uncertainty-weight transform module is practical and effective for weak supervision. We evaluated the proposed method with the baseline method URN on PASCAL VOC 2012 [26] and MS COCO 2014 with image-level supervision. Furthermore, we achieved outstanding results at 69.3% on VOC and 39.3% on COCO in metric mIoU.

In summary, our main contributions are as follows:

We recognize that the elimination of noisy labels is challenging to achieve. Therefore, we focus on transforming the uncertainty of pixels into loss weights, thereby mitigating the impact of noisy pixels on model performance.
The uncertainty-weight transform module is proposed to dynamically transform pixel uncertainty into loss weight. The critical aspect of the module lies in a set of functions with different thresholds but of the same form.
The experimental results illustrate the effectiveness of the proposed method. The designed functions are also efficient in mitigating the impact of noisy pixels from other datasets under different threshold controls.

2. Related Work

2.1. Weakly Supervised Semantic Segmentation

Semantic segmentation plays a crucial role in the field of computer vision and has rich application scenarios. Owing to the advancement in deep learning, fully supervised semantic segmentation methods based on deep learning have achieved significant progress [27]. However, these methods often require a large amount of training data with pixel-level annotations, which entails substantial labeling costs. To reduce the cost of data annotation and further expand the application scenarios of semantic segmentation, researchers are increasingly focusing on weakly supervised semantic segmentation (WSSS) methods based on deep learning. WSSS refers to the training of semantic segmentation models using relatively less annotation information rather than detailed pixel-level annotation information [28,29].

The WSSS method for image-level annotations follows a two-stage process, and the generated pseudo-masks are prone to noisy labels. To enhance the performance of WSSS, most researchers refine the CAM generated by the classification network to achieve more accurate segmentation masks. Having observed the tendency of CAM to shift across different regions of the target class during training, Jiang et al. [30] applied an online accumulation method. The method involves accumulating CAM outputs from various training stages to generate a CAM-like map. Chang et al. [31] proposed an approach that generates pseudo-sub-category labels through clustering and establishes a sub-category objective function. This method enables the training process to overcome the limitations imposed by the focus on noisy parts, resulting in a response map of enhanced quality. Additionally, several methods are employed to enhance WSSS by incorporating additional information, such as saliency detection models [32]. Fan et al. [33] introduced an intra-class discriminator (ICD) framework, addressing the classification boundary mismatch prevalent in image-level WSSS. This innovative approach refines the segmentation process by ensuring intra-class consistency. Yao et al. [34] proposed a saliency guided self-attention network (SGAN). The model is designed to improve the performance of WSSS by integrating class-independent saliency priors and utilizing class-specific attention cues as additional supervision.

In the aforementioned approaches, the majority of researchers have concentrated on generating high-quality pseudo-masks to enhance segmentation performance, often neglecting the potential of leveraging noisy labels to further refine model capabilities. Moreover, it is challenging to eliminate the presence of noisy labels in generated pseudo-masks [35,36]. Therefore, we propose an uncertainty-weight transform module to mitigate the impact of these noisy labels on the performance of segmentation models.

2.2. Label Noise Learning

In machine learning, the label quality of training samples seriously affects the final performance of classification algorithms. Although clean labels produce better results, they are time-consuming and laborious to collect and use. In recent years, researchers have gradually studied label noise learning to adapt models to general situations and to save costs [14]. In the application of label noise learning, a common approach is to perform label noise cleaning, with the aim of removing incorrectly labeled instances from the training data [37]. However, these methods may lead to serious data loss, ultimately reducing the accuracy of the algorithm. Recently, researchers have advanced the field of label noise learning by improving the loss function and adjusting the loss value. Loss adjustments play a pivotal role in mitigating the detrimental effects of noisy labels by reassigning the loss values of training examples prior to the parameter updates in the deep neural network (DNN) [16,17]. Unlike the newly designed loss function, loss adjustment aims to make traditional optimization processes robust to label noise. The related methods of loss adjustment can be described as follows:

(1): Loss correction: The method modifies the loss for each example by multiplying the estimated label transition probability with the output of the specified DNN. Several advanced methods have been proposed to validate the effectiveness of loss correction. Clean validation data were utilized by Gold loss correction [38] as additional information to obtain a more accurate transition matrix, thereby further improving the robustness of loss correction. T-revision [39] was proposed to infer the transition matrix without anchor points. However, the effectiveness of loss correction depends on the accuracy of estimating the transition matrix. Obtaining a precise transition matrix typically requires prior knowledge, such as anchor points or clean validation data.
(2): Loss reweighting: Loss reweighting alleviates the harm of noisy labels by modifying the weights of the loss function. Specifically, it aims to assign smaller weights to the pixels with false labels and greater weights to those with true labels [14]. DualGraph [40] leveraged graph neural networks to adjust the weights of examples based on the structural relationships among labels, effectively filtering out anomalous noisy examples. Active bias [41] focused on examples with inconsistent label predictions, utilizing the variances of their predictions as weights during the training process. However, these methods need the manual prespecification of weight functions and the selection of additional hyperparameters, which can be challenging to implement in practice due to the significant variations in appropriate weights.
(3): Label refurbishment: Refurbished labels are a convex combination of noisy labels and DNN output labels. Bootstrapping [42] is the first method to propose the concept of label refurbishment to update the target labels of training samples, and a more coherent network is used to improve the ability to evaluate noisy labels consistency. AdaCorr [43] selectively refurbishes the label of noisy examples, but it comes with a theoretical error bound. Alternatively, SEAL [44] calculates the average of the softmax output of a DNN for each sample during the entire training process and subsequently retrains the DNN using the averaged soft labels. Unlike loss correction and reweighting, label refurbishment explicitly replaces all noisy labels with approximate clean labels. However, when the proportion of noisy labels is high, there is a risk of overfitting to the incorrectly refurbished samples.

To avoid spending excessive time constructing a perfect transformation matrix and to prevent the model from overfitting to incorrectly refurbished samples, we focus attention on loss reweighting. Moreover, a novel module is applied to overcome the manual predefinition of weights. The proposed uncertainty-weight transform module is designed to dynamically allocate weights to disparate pixels through the application of a set of functions, thereby enhancing the model’s segmentation performance. In contrast to approaches typically employed in loss reweighting, the functions for loss weight allocation presented in our study are designed to be adaptable to various datasets.

3. Methodology

3.1. Overview

The overview of the proposed method is shown in Figure 2. Firstly, we train a deep classification model using image-level annotated data. Utilizing the trained classification model, we generate class activation maps for the input images. We obtain initial pseudo-masks from CAMs by adopting the conditional random field algorithm, which provides preliminary segmentation guidance for subsequent segmentation model training. We adopt the generated pseudo-masks to train a segmentation model. It can predict pixel-level masks for the input image. Then multi-scale Dense-CRF is adopted for postprocessing the segmentation results, refining the mask boundaries. The proposed uncertainty-weight transform module is adopted to estimate the uncertainty of the predicted masks after CRF postprocessing. This module can quantify the uncertainty of the predicted masks and then transform these results into loss weights. Finally, combining the pseudo-masks and the loss weights generated by the uncertainty-weight transform module, we retrain the segmentation model to further improve the robustness of the segmentation model.

3.2. Preliminaries

3.2.1. Dense-CRF

The dense conditional random field (Dense-CRF) is a postprocessing method based on probabilistic graphical models capable of taking into account the spatial relationships between pixels and making fine adjustments to each pixel. The postprocessing of deep neural network output using Dense-CRF includes the following steps. Firstly, the network output is converted to the applicable format of Dense-CRF. Subsequently, an energy function containing data items, smoothing items, and model parameters is constructed. Next, the minimum of the function is solved through an optimization algorithm to determine the pixel label. Finally, the obtained labels are used to produce the final image segmentation result.

The postprocessing of Dense-CRF for semantic segmentation can be described with the following equation:

\begin{matrix} P (Y | X) & = \frac{1}{Z} e x p (- E (Y | X)) . \end{matrix}

(1)

Here, Y represents the label configuration, and X represents the observed image.

E (Y | X)

represents the energy function, and Z serves as the normalization factor to ensure proper normalization of the probability distribution. The result

P (Y | X)

represents the posterior probability distribution of label configuration Y.

The energy function

E (Y | X)

consists of a data term and a regularization term, given by

\begin{matrix} E (Y | X) & = \int f (Y, X) d Y + \int \int V (Y, Y^{'}, X) d Y d Y^{'} . \end{matrix}

(2)

Here, the data term function is denoted as f, which quantifies the consistency between the label Y and the observed image X. The regularization term function is represented by V, which promotes label smoothness and consistency. The variables Y and

Y^{'}

are used to denote different label configurations.

3.2.2. Class Activation Map

The last layer feature map of the CNN contains the richest semantic information, but it also represents highly abstract categorical features. A class activation map, based on the visualization of the final layer of feature maps, provides a powerful interpretative function for CNN networks. The core concept of a CAM is to multiply the feature maps of the last convolutional layer with the weights of the fully connected layer, producing class activation maps for each category. These activation maps visually depict the regions in the original image that are relevant to specific classes. By overlaying or applying a weighted overlay to these activation maps on the original image, a CAM emphasizes the regions of interest that the network model prioritizes in the classification task. In WSSS, a CAM is utilized to identify regions within the image that correspond to the target category. Pseudo-masks are generated based on these regions for pixel-level semantic segmentation.

3.2.3. Problem Definition

Given an input image

I \in R^{H \times W}

with an image-level annotation, a well-trained image-level classifier

F_{C}

is applied for training. Furthermore, the image

I

is converted to class activation map

\begin{matrix} F_{C A M} \in R^{C \times H \times W} \end{matrix}

by operation

CAM

. The aforementioned process is expressed mathematically as follows:

\begin{matrix} F_{C A M} & = CAM (F_{C} (I)) . \end{matrix}

(3)

Then

F_{C A M}

is refined by

CRF

to obtain pseudo-masks

M_{p} \in R^{C \times H \times W}

as follows:

\begin{matrix} M_{p} & = CRF (F_{C A M}) . \end{matrix}

(4)

In this paper, operation

CRF

is Dense-CRF. Then a semantic segmentation network

F

is trained from pseudo-masks

M_{p}

and inferenced with image

I

. Furthermore, the final segmentation result

O \in R^{C \times H \times W}

can be described as follows:

\begin{matrix} O = F (I, M_{P}) . \end{matrix}

(5)

Finally, losses are calculated and backpropagated to update the parameters of

F

.

The above process constitutes a general framework for WSSS, but there are two unresolved issues with this framework. The first issue is that the pseudo-masks

M_{P}

are not accurate enough, and this will cause the network to learn useless features. Many previous research efforts focus on enhancing the quality of pseudo-masks. Unfortunately, using a CAM in WSSS cannot generate perfect pseudo-masks. The second issue primarily arises in the weight assignment following uncertainty estimation. Due to the high uncertainty of pixels, setting thresholds manually is challenging. A high threshold excludes many uncertain labels, causing an imbalance in category-specific learning and limiting the use of noisy labels. Conversely, employing a low threshold inevitably introduces low-quality pseudo-masks. Furthermore, manually setting the threshold requires adjustments when transitioning between different datasets, which limits its generalizability.

To address these problems, a novel method is proposed in our paper. The aim of our method is not to enhance the quality of noisy labels but rather to mitigate the impact of noisy labels on model performance. In addition, a dynamic weight allocation scheme is proposed to address the limitations of manually setting thresholds.

3.3. Label Noise Learning in WSSS

URN [21] applied the Dense-CRF function

CRF

and the argmax operation

ARG

to generate pseudo-masks at different scales; the process can be showed as follows:

\begin{matrix} \bar{M} & = CRF (F_{C A M}), \end{matrix}

(6)

\begin{matrix} {\bar{M}}_{P} & = ARG (\bar{M}, d i m = 2), \end{matrix}

(7)

where the predictions that are scaled after Dense-CRF are denoted by

\bar{M} \in R^{\bar{C} \times N \times \bar{C} \times H \times W}

. Subsequently, we obtain pseudo-masks in various scales and classes, denoted as

{\bar{M}}_{P} \in R^{\bar{C} \times N \times H \times W}

, by reducing the third category channel through the argmax operation.

3.3.1. Probability Statistic

Once the

{\bar{M}}_{P}

is obtained, the probability statistic step is executed. Firstly, we calculate the probability of each pixel in the image being assigned to various categories. Furthermore, the result of the calculation can be expressed as

P \in R^{N \times H \times W}

. Subsequently, the category with the highest probability is chosen as the correct classification for the given pixel. At the same time, the highest probability also serves as the confidence level of the pixel. The relevant mathematical formula is shown as follows:

\begin{matrix} \bar{P} = ARG (P, d i m = 1) . \end{matrix}

(8)

The output of the probability statistic process is denoted by

\bar{P} \in R^{H \times W}

. Afterwards, the result

\bar{P}

is fed into the proposed uncertainty-weight transform module.

3.3.2. Uncertainty-Weight Transform Module

The function included in the proposed module is RLF, where RLF is a combination of the LF function and RF function. The function curve of RLF is depicted in Figure 3.

α

represents the value that occurs most frequently among the confidence levels of all pixels in the image. It serves as a parameter that governs the trend of the function’s variation. Different

α

values mean that the function assigns weights to the pixels of different images. Curves below

α

correspond to the LF function group, while those above correspond to the RF function group.

In the LF function, the trend of pixel weight variation corresponding to frequencies less than

α

can be expressed by the following formula:

\begin{matrix} L F = - \frac{L}{α} l n (\frac{C - D}{C + D}) . \end{matrix}

(9)

The RF function incorporates SAF, proposed by Zhang et al. [45], which was originally intended to measure the probability under varying distances in the original task. In the RF function, the trend of pixel weight variation corresponding to frequencies greater than or equal to

α

can be expressed by the following formula:

\begin{matrix} R F = \frac{2}{1 + e^{- \frac{α D}{L}}} - 1 . \end{matrix}

(10)

The definition formulas for C and

D

in the formula above are as follows:

\begin{matrix} C & = \frac{1 + e^{- α}}{1 - e^{- α}}, \end{matrix}

(11)

\begin{matrix} D & = \frac{\bar{P} - m i n (\bar{P})}{m a x (\bar{P}) - m i n (\bar{P}) + 0.0001} . \end{matrix}

(12)

The above equations involve several variables. L represents the maximum value of

D \in R^{H \times W}

. Furthermore, C serves as an adjustment factor.

\bar{P} \in R^{H \times W}

is a matrix of confidence levels corresponding to all pixels in the i-th image. The normalization results of

\bar{P}

can be denoted by

D

.

The combined function RLF is obtained by merging LF and RF with the following equation:

\begin{matrix} R L F = \{\begin{matrix} - \frac{L}{α} l n (\frac{C - H_{1}}{C + H_{1}}) \\ \frac{2}{1 + e^{- \frac{α H_{2}}{L}}} - 1 \end{matrix}, \end{matrix}

(13)

where

H_{1} = D < α

refers to the subset of D whose values are smaller than

α

. Furthermore,

H_{2} = D \geq α

refers to the subset of D whose values are greater than or equal to

α

.

The output matrix

\bar{P}

from the probability statistic process is passed into the RLF function,

\begin{matrix} W_{o} = R L F (\bar{P}) . \end{matrix}

(14)

3.4. Loss Function

By performing the aforementioned computations, the weights of each pixel in the i-th image are obtained. Finally, multiplying the weight mask

W_{o} \in R^{H \times W}

with the loss mask

L_{l o s s}

from cross-entropy loss results in the final segmentation loss. The process can be mathematically described as follows:

\begin{matrix} L_{s e g} = \sum_{i = 1}^{H} \sum_{j = 1}^{W} {(W_{o})}_{i, j} {(L_{l o s s})}_{i, j} . \end{matrix}

(15)

4. Experiments

In this section, we provide a detailed description of the experimental settings and the experimental results.

4.1. Experiments Setting

4.1.1. Dataset

PASCAL VOC 2012: With its moderate level of complexity and large volume, this dataset is one of the most widely used in the field of WSSS. The dataset comprises 20 foreground categories and 1 background class, divided into three subsets: training, validation, and test sets. Specifically, the training set contains 1464 images, the validation set has 1449 images, and the test set includes 1456 images. In addition to the original data, supplementary images and annotations from the SBD dataset [46] are also used, forming an extended training set known as ‘trainaug’, which contains 10,582 images. In WSSS, pixel-level annotations are transformed into image-level multi-label annotations during the classification phase.

MS COCO 2014: This dataset shows greater challenges compared to PASCAL VOC due to its larger number of categories and smaller average object size. The MS COCO 14 dataset comprises 80 valid foreground categories with 1 background, and other 10 categories are not evaluated. The training set includes 82,081 images, and the validation set consists of 40,137 images. Like the PASCAL VOC 2012 dataset, the evaluation metric is mean intersection over union (mIoU).

4.1.2. Baseline

For comparison with our baseline URN, the proposed method is evaluated on th ePMM [20], which enhances the generation and utilization of pseudo-masks based on the SEAM algorithm [10]. The backbone of the SEAM algorithm is ResNet-38, and the PMM uses Res2Net-101 and ScaleNet-101 to achieve state-of-the-art performance on both the VOC and COCO datasets. In this paper, the proposed method is verified on these three backbones. ResNet-101 is employed as an additional backbone.

4.1.3. Hyperparameters Setting

To ensure a fair comparison with the baseline URN, the identical hyperparameters in our experiments were applied. Our experiments were conducted on the NVIDIA DGX Station, an advanced platform equipped with four NVIDIA V100 GPUs. Our experiments leveraged several key libraries and dependencies, including CUDA and cuDNN, for GPU-accelerated computation, as well as PyTorch as the underlying deep learning framework. Our codebase is built upon MMSegmentation [47], a widely recognized and robust library for semantic segmentation tasks. Our experiments employed pseudo-mask distillation, which is consistent with the baseline URN.

For the PASCAL VOC dataset, we maintained a batch size of 16 across the four GPUs, utilizing a learning rate of 0.005 for a total of 20,000 iterations with a poly-policy learning rate scheduler. For the more extensive COCO dataset, we adjusted the parameters to a batch size of 16 and a learning rate of 0.02, and we increased the iterations to 40,000 to accommodate the larger and more complex dataset. During training, we applied data augmentation techniques. The input images were resized to ensure their dimensions ranged between 512 and 2048 pixels, followed by random cropping to a fixed size of 512 × 512. This process was complemented by random horizontal flips and distortions to enhance the model’s generalization capabilities. In the testing phase, we retained the same cropping size as in training to ensure consistency. Postprocessing was performed using Dense-CRF to refine the segmentation masks and improve accuracy.

4.2. Experimental Results and Analysis

4.2.1. Comparisons with Baseline

As shown in Table 1, we conducted experiments on the PASCAL VOC 2012 and MS COCO 2014 datasets. Most of the results are trained from image-level annotations, as we focus on methods supervised at the image level. The results are organized by dataset and compared with previous state-of-the-art methods.

PASCAL VOC 2012: We present the results on both the validation and the test set of the VOC dataset. Compared to URN, the proposed method with ResNet-101 as the backbone is increased by 1.7% on the validation set and 1.8% on the test set. Another common backbone network is ResNet-38, which is wider than ResNet-101. The model with ResNet-38 as the backbone achieves performance of 67.9% and 69.1% on the validation and test sets, respectively. For another two, more powerful backbone networks, our model also demonstrates excellent performance. Meanwhile, the performance of the model with Res2Net-101 as the backbone is enhanced by 2.1% on the test set.

MS COCO 2014: Compared to the VOC dataset, this dataset is more challenging and has a higher number of training images and a smaller size of the target than VOC. For the backbone ResNet-101, the result for URN is 35.1%, while our method achieves an optimal result of 36.5%. The performance of the proposed model with Res2Net-101 as the backbone improves by 1.3%. The model with ResNet-38 as the backbone improves the performance by 1.9%.

The experimental results demonstrate that the proposed model exhibits superior performance on both the PASCAL VOC 2012 and MS COCO 2014 datasets compared to the baseline models. A significant reason for the outstanding performance of our model is the incorporation of an uncertainty-weight transform module. The module dynamically translates the uncertainty of pixels into their corresponding loss weights, thereby mitigating the impact of noisy labels on model performance. The proposed model showed consistent performance improvement across different datasets, indicating its strong generalization capability.

4.2.2. Ablation Studies

Segmentation results: The segmentation results of the proposed method and URN on all classes of the VOC12 val dataset, as measured by the mIoU metric, are shown in Table 2. The results clearly indicate that the proposed method outperforms URN in segmentation accuracy for many categories. Especially for the categories of bottle, sheep, and TV monitor, their mIoU increased by 9.88%, 8.68%, and 4.72%, respectively. Although the manual weighting method of URN is effective, the model requires manual adjustment of appropriate thresholds when adapted to other tasks. In contrast, our proposed Uncertainty-weight transform module can customize weight distribution for every image in any task. Moreover, the performance of our model demonstrates a significant improvement in mitigating the impact of noisy labels.

Functions comparison: The designed RLF function is a combination of LF and RF, and we aim to evaluate the effectiveness of the function combination. The constant value 1 replaces the function RF to form a combined function with the function LF, which is named LFCON. Furthermore, its function curve is shown in Figure 4. Additionally, a constant value of 0.05 is used to replace the LF function, forming another combined function CONRF with the RF function. The curve of this function is shown in Figure 5. The performance comparison of the two formed functions and the RLF function on the VOC12 val and VOC12 test is presented in Table 3. From the table, it is evident that the designed RLF function significantly outperforms other functions. It demonstrates that the uncertainty-weight transform module mitigates the impact of noisy pixels on model performance, effectively reducing the harm caused by noisy labels. Additionally, the constants 1 and 0.05 were selected based on empirical experience from the previous literature.

Weight visualization: The proposed method and the weight postprocessing of URN are visualized in Figure 6. The brightness of each pixel represents the corresponding loss weight. Higher brightness suggests higher weight and lower uncertainty, indicating the pixel belongs to a credible region. Importantly, the proposed model effectively assigns weights to different degrees of noisy pixels. The model can more accurately segment the semantic categories of the corresponding pixels and mitigate the impact of noisy labels. Furthermore, the uncertainty-weight transform module is available for any image pixel. Therefore, our approach enhances the robustness, generalization ability, and noise resistance of the model.

Results visualization: The visualization of segmentation results between the proposed method and URN on the VOC12 validation set is shown in Figure 7. Additionally, Figure 8 shows the visualizations for COCO14 val. The figures demonstrate that the proposed method outperforms the baseline method, yielding more accurate predictions with fewer instances of background being misclassified as foreground. The reason for this is that pixel weights are reasonably allocated in pseudo-masks, enabling the model to learn more accurate semantic information during the training phase.

5. Limitations and Future Work

Although the proposed method has shown promising results in mitigating the impact of noisy labels on model performance, it still has limitations. The current method relies mainly on image-level annotations and has a single source of information, limiting the scope of knowledge encoded by the model. In future research, a promising direction is to incorporate textual information into the model. Text data contain rich semantic information that can help the model better understand the relationships between objects in the image. By introducing textual context, models can be trained to generate pseudo-labels with reduced noise and improved precision.

6. Conclusions

In this paper, we propose a novel method for weakly supervised semantic segmentation, which mitigates the impact of noisy labels on segmentation model performance. Specifically, the proposed method does not aim to eliminate noisy labels to create perfect pseudo-masks. Instead, we propose a novel uncertainty-weight transform module, which computes noisy pixel uncertainty and transforms it to loss weight.Consequently, the method can effectively reduce the impact of noisy labels on model performance by assigning varying weights to each pixel. Experimental results on both PASCAL VOC 2012 and MS COCO 2014 show that the proposed method achieves state-of-the-art performance, which demonstrates the effectiveness and generalizability of the proposed method.

Author Contributions

All authors contributed in a substantial way to the manuscript. F.Q. contributed to the investigation and the basic structure of the manuscript. J.Y. (Juan Yang) wrote the manuscript. S.T. made contribution to the experiments. G.C. contributed to the review, editing, and analysis of the manuscript. J.Y. (Jingwen Yan) supervised the study for all the stages. All authors have read and agreed to submit the manuscript.

Funding

This work was supported by the State key laboratory major special projects of Jilin Province Science and Technology Development Plan (Grant No. SKL202402024), the Guangdong Provincial University Innovation Team Project (Grant No. 2020KCXTD012), Guangdong Province Natural Science Foundation (2024A1515011766), Songshan Lake Sci-tech Commissoner Program (20234426-01KCJ-G).

Data Availability Statement

The datasets presented in this article are openly available in http://host.robots.ox.ac.uk/pascal/VOC/. The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Acknowledgments

The authors sincerely appreciate that academic editors and reviewers who gave their help, comments, and constructive suggestions.

Conflicts of Interest

Author Sipeng Tang was employed by the company China Mobile Communications Group Guangdong Co., Ltd. Shantou Branch. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Kong, L.; Ren, J.; Pan, L.; Liu, Z. Lasermix for semi-supervised lidar semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 21705–21715. [Google Scholar]
Xie, B.; Li, S.; Li, M.; Liu, C.H.; Huang, G.; Wang, G. Sepico: Semantic-guided pixel contrast for domain adaptive semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 9004–9021. [Google Scholar] [CrossRef]
Shen, W.; Peng, Z.; Wang, X.; Wang, H.; Cen, J.; Jiang, D.; Xie, L.; Yang, X.; Tian, Q. A survey on label-efficient deep image segmentation: Bridging the gap between weak supervision and dense prediction. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 9284–9305. [Google Scholar] [CrossRef] [PubMed]
Lai, X.; Tian, Z.; Jiang, L.; Liu, S.; Zhao, H.; Wang, L.; Jia, J. Semi-supervised semantic segmentation with directional context-aware consistency. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 1205–1214. [Google Scholar]
Hu, R.; Dollár, P.; He, K.; Darrell, T.; Girshick, R. Learning to segment every thing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4233–4241. [Google Scholar]
Zhang, P.; Zhang, B.; Zhang, T.; Chen, D.; Wang, Y.; Wen, F. Prototypical pseudo label denoising and target structure learning for domain adaptive semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 12414–12424. [Google Scholar]
Tang, M.; Perazzi, F.; Djelouah, A.; Ben Ayed, I.; Schroers, C.; Boykov, Y. On regularized losses for weakly-supervised cnn segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 507–522. [Google Scholar]
Oh, Y.; Kim, B.; Ham, B. Background-aware pooling and noise-aware loss for weakly-supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 6913–6922. [Google Scholar]
Sun, W.; Zhang, J.; Barnes, N. 3d guided weakly supervised semantic segmentation. In Proceedings of the Asian Conference on Computer Vision, Kyoto, Japan, 30 November 2020; pp. 585–602. [Google Scholar]
Wang, Y.; Zhang, J.; Kan, M.; Shan, S.; Chen, X. Self-supervised equivariant attention mechanism for weakly supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 12275–12284. [Google Scholar]
Fan, J.; Zhang, Z.; Tan, T.; Song, C.; Xiao, J. Cian: Cross-image affinity net for weakly supervised semantic segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 10762–10769. [Google Scholar]
Lee, S.; Lee, M.; Lee, J.; Shim, H. Railroad is not a train: Saliency as pseudo-pixel supervision for weakly supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 5495–5505. [Google Scholar]
Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; Torralba, A. Learning Deep Features for Discriminative Localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2921–2929. [Google Scholar] [CrossRef]
Song, H.; Kim, M.; Park, D.; Shin, Y.; Lee, J.G. Learning from noisy labels with deep neural networks: A survey. IEEE Trans. Neural Netw. Learn. Syst. 2022, 34, 8135–8153. [Google Scholar] [CrossRef] [PubMed]
Liu, S.; Liu, K.; Zhu, W.; Shen, Y.; Fernandez-Granda, C. Adaptive early-learning correction for segmentation from noisy annotations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 2606–2616. [Google Scholar]
Hendrycks, D.; Mazeika, M.; Wilson, D.; Gimpel, K. Using Trusted Data to Train Deep Networks on Labels Corrupted by Severe Noise. In Advances in Neural Information Processing Systems; Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R., Eds.; Curran Associates, Inc.: New York, NY, USA, 2018; Volume 31. [Google Scholar]
Arazo, E.; Ortego, D.; Albert, P.; O’Connor, N.; Mcguinness, K. Unsupervised Label Noise Modeling and Loss Correction. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; Chaudhuri, K., Salakhutdinov, R., Eds.; ML Research Press: Maastricht, Dutch, 2019; Volume 97, pp. 312–321. [Google Scholar]
Zhang, B.; Xiao, J.; Wei, Y.; Sun, M.; Huang, K. Reliability does matter: An end-to-end weakly supervised semantic segmentation approach. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12765–12772. [Google Scholar]
Chen, Z.; Wang, T.; Wu, X.; Hua, X.S.; Zhang, H.; Sun, Q. Class re-activation maps for weakly-supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 24 June 2022; pp. 969–978. [Google Scholar]
Li, Y.; Kuang, Z.; Liu, L.; Chen, Y.; Zhang, W. Pseudo-mask matters in weakly-supervised semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 10–17 October 2021; pp. 6964–6973. [Google Scholar]
Li, Y.; Duan, Y.; Kuang, Z.; Chen, Y.; Zhang, W.; Li, X. Uncertainty estimation via response scaling for pseudo-mask noise mitigation in weakly-supervised semantic segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022; Volume 36, pp. 1447–1455. [Google Scholar]
Kendall, A.; Gal, Y. What uncertainties do we need in bayesian deep learning for computer vision? Adv. Neural Inf. Process. Syst. 2017, 30, 5574–5584. [Google Scholar]
Krähenbühl, P.; Koltun, V. Efficient inference in fully connected crfs with gaussian edge potentials. Adv. Neural Inf. Process. Syst. 2011, 24, 109–117. [Google Scholar]
Murphy, C.; Tawn, J.A.; Varty, Z. Automated threshold selection and associated inference uncertainty for univariate extremes. arXiv 2023, arXiv:2310.17999. [Google Scholar]
Kamble, P.M.; Ruikar, D.D.; Houde, K.V.; Hegadi, R.S. Adaptive threshold-based database preparation method for handwritten image classification. In Proceedings of the International Conference on Recent Trends in Image Processing and Pattern Recognition, Msida, Malta, 8–10 December 2021; Springer: Berlin/Heidelberg, Germany, 2021; pp. 280–288. [Google Scholar]
Everingham, M.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
Song, C.; Ouyang, W.; Zhang, Z. Weakly Supervised Semantic Segmentation via Box-Driven Masking and Filling Rate Shifting. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 15996–16012. [Google Scholar] [CrossRef]
Zhang, D.; Li, H.; Zeng, W.; Fang, C.; Cheng, L.; Cheng, M.M.; Han, J. Weakly supervised semantic segmentation via alternate self-dual teaching. IEEE Trans. Image Process. 2023; early access. [Google Scholar] [CrossRef]
Li, R.; Mai, Z.; Zhang, Z.; Jang, J.; Sanner, S. Transcam: Transformer attention-based cam refinement for weakly supervised semantic segmentation. J. Vis. Commun. Image Represent. 2023, 92, 103800. [Google Scholar] [CrossRef]
Jiang, P.T.; Hou, Q.; Cao, Y.; Cheng, M.M.; Wei, Y.; Xiong, H.K. Integral object mining via online attention accumulation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 2070–2079. [Google Scholar]
Chang, Y.T.; Wang, Q.; Hung, W.C.; Piramuthu, R.; Tsai, Y.H.; Yang, M.H. Weakly-supervised semantic segmentation via sub-category exploration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 8991–9000. [Google Scholar]
Liu, Y.; Zhang, Y.; Wang, Z.; Yang, F.; Qiu, F.; Coleman, S.; Kerr, D. A novel seminar learning framework for weakly supervised salient object detection. Eng. Appl. Artif. Intell. 2024, 126, 106961. [Google Scholar] [CrossRef]
Fan, J.; Zhang, Z.; Song, C.; Tan, T. Learning integral objects with intra-class discriminator for weakly-supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4283–4292. [Google Scholar]
Yao, Q.; Gong, X. Saliency guided self-attention network for weakly and semi-supervised semantic segmentation. IEEE Access 2020, 8, 14413–14423. [Google Scholar] [CrossRef]
Ma, Z.; Chen, D.; Zhang, C.Y. A Weakly Supervised Semantic Segmentation Method Based on Local Superpixel Transformation. Neural Process. Lett. 2023, 55, 12039–12060. [Google Scholar] [CrossRef]
Zhong, L.; Wang, G.; Liao, X.; Zhang, S. HAMIL: High-Resolution Activation Maps and Interleaved Learning for Weakly Supervised Segmentation of Histopathological Images. IEEE Trans. Med. Imaging 2023, 42, 2912–2923. [Google Scholar] [CrossRef]
Bernhardt, M.; de Castro, D.C.; Tanno, R.; Schwaighofer, A.; Tezcan, K.C.; Monteiro, M.A.B.; Bannur, S.; Lungren, M.P.; Nori, A.; Glocker, B.; et al. Active label cleaning for improved dataset quality under resource constraints. Nat. Commun. 2021, 13, 1161. [Google Scholar] [CrossRef]
Zhang, Y.; Sugiyama, M. Approximating Instance-Dependent Noise via Instance-Confidence Embedding. arXiv 2021, arXiv:2103.13569. [Google Scholar]
Xia, X.; Liu, T.; Wang, N.; Han, B.; Gong, C.; Niu, G.; Sugiyama, M. Are anchor points really indispensable in label-noise learning? Adv. Neural Inf. Process. Syst. 2019, 32, 6838–6849. [Google Scholar]
Zhang, H.; Xing, X.; Liu, L. Dualgraph: A graph-based method for reasoning about label noise. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–15 June 2021; pp. 9654–9663. [Google Scholar]
Chang, H.S.; Learned-Miller, E.; McCallum, A. Active bias: Training more accurate neural networks by emphasizing high variance samples. Adv. Neural Inf. Process. Syst. 2017, 30, 1002–1012. [Google Scholar]
Reed, S.E.; Lee, H.; Anguelov, D.; Szegedy, C.; Erhan, D.; Rabinovich, A. Training Deep Neural Networks on Noisy Labels with Bootstrapping. arXiv 2014, arXiv:1412.6596. [Google Scholar]
Zheng, S.; Wu, P.; Goswami, A.; Goswami, M.; Metaxas, D.; Chen, C. Error-bounded correction of noisy labels. In Proceedings of the International Conference on Machine Learning, Virtual Event, 13–18 July 2020; pp. 11447–11457. [Google Scholar]
Chen, P.; Ye, J.; Chen, G.; Zhao, J.; Heng, P.A. Beyond class-conditional assumption: A primary attempt to combat instance-dependent label noise. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 2–9 February 2021; Volume 35, pp. 11442–11450. [Google Scholar]
Zhang, S.X.; Zhu, X.; Chen, L.; Hou, J.B.; Yin, X.C. Arbitrary shape text detection via segmentation with probability maps. IEEE Trans. Pattern Anal. Mach. Intell 2022, 45, 2736–2750. [Google Scholar] [CrossRef]
Hariharan, B.; Arbeláez, P.; Bourdev, L.; Maji, S.; Malik, J. Semantic contours from inverse detectors. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 991–998. [Google Scholar]
MMSegmentation Contributors. MMSegmentation: OpenMMLab Semantic Segmentation Toolbox and Benchmark. 2020. 2023. Available online: https://github.com/open-mmlab/mmsegmentation (accessed on 3 July 2020).
Lee, J.; Kim, E.; Lee, S.; Lee, J.; Yoon, S. Ficklenet: Weakly and semi-supervised semantic image segmentation using stochastic inference. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5267–5276. [Google Scholar]
Li, X.; Zhou, T.; Li, J.; Zhou, Y.; Zhang, Z. Group-wise semantic mining for weakly supervised semantic segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 2–9 February 2021; Volume 35, pp. 1984–1992. [Google Scholar]
Liu, Y.; Wu, Y.H.; Wen, P.; Shi, Y.; Qiu, Y.; Cheng, M.M. Leveraging instance-, image-and dataset-level information for weakly supervised instance segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 1415–1428. [Google Scholar] [CrossRef]
Ahn, J.; Kwak, S. Learning pixel-level semantic affinity with image-level supervision for weakly supervised semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4981–4990. [Google Scholar]
Zhang, D.; Zhang, H.; Tang, J.; Hua, X.S.; Sun, Q. Causal intervention for weakly-supervised semantic segmentation. Adv. Neural Inf. Process. Syst. 2020, 33, 655–666. [Google Scholar]
Ahn, J.; Cho, S.; Kwak, S. Weakly supervised learning of instance segmentation with inter-pixel relations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2209–2218. [Google Scholar]
Shimoda, W.; Yanai, K. Self-supervised difference detection for weakly-supervised semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 5208–5217. [Google Scholar]
Sun, G.; Wang, W.; Dai, J.; Van Gool, L. Mining cross-image semantics for weakly supervised semantic segmentation. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part II 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 347–365. [Google Scholar]
Chen, Q.; Yang, L.; Lai, J.H.; Xie, X. Self-supervised image-specific prototype exploration for weakly supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4288–4298. [Google Scholar]
Pan, J.; Zhu, P.; Zhang, K.; Cao, B.; Wang, Y.; Zhang, D.; Han, J.; Hu, Q. Learning self-supervised low-rank network for single-stage weakly and semi-supervised semantic segmentation. Int. J. Comput. Vis. 2022, 130, 1181–1195. [Google Scholar] [CrossRef]

Figure 1. Two-stage training pipeline for weakly supervised semantic segmentation.

Figure 2. The overview of the proposed method. The blue arrows mean the multi-scaled CRF, which is leveraged for postprocessing of the prediction results. The proposed uncertainty estimation module is applied to estimate the uncertainty of these results and convert pixel uncertainty into loss weight.

Figure 3. The RLF curves generated by different

α

values.

Figure 3. The RLF curves generated by different

α

values.

Figure 4. Curve of the function obtained by replacing the RF function with a constant value of 1. The part below

α

is still represented by the LF function.

Figure 4. Curve of the function obtained by replacing the RF function with a constant value of 1. The part below

α

is still represented by the LF function.

Figure 5. Curve of the function using a constant value of 0.05 instead of the LF function. The part greater than

α

is still represented by the RF function.

Figure 5. Curve of the function using a constant value of 0.05 instead of the LF function. The part greater than

α

is still represented by the RF function.

Figure 6. Visualization weights of the proposed method and URN on the VOC12 val dataset. The first column (a) shows the input images with ground truth, the second column (b) shows the input images with pseudo-masks, the third column (c) shows the segmentation results of URN, and the fourth column (d) shows the segmentation results of the proposed method.

Figure 7. Visualization results of the proposed method and URN on the VOC12 val dataset. The first column (a) shows the input images with ground truth, the second column (b) shows the segmentation results of URN, and the third column (c) shows the segmentation results of our method.

Figure 8. Visualization results of our method and URN on the COCO 2014 val dataset. The first column (a) shows the input images with ground truth, the second column (b) shows the segmentation results of URN, and the third column (c) shows the segmentation results of the proposed method on COCO.

Table 1. Performance comparison with state-of-the-art WSSS methods on VOC 2012 and COCO 2014. The middle part lists the methods with extra supervision, including image-level, saliency, and SOP. SOP is segment-based object proposals. Other extra information is about data. For each backbone, the best result is bolded.

Method	Backbone	Supervision	VOC12 Val	VOC12 Test	COCO14 Val
FickleNet [48]	ResNet-101	Image-level + Saliency	64.9%	65.3%	-
OAA [30]	ResNet-101	Image-level + Saliency	65.2%	66.4%	-
SGAN [34]	ResNet-101	Image-level + Saliency	67.1%	67.2%	33.6%
ICD [33]	ResNet-101	Image-level + Saliency	67.8%	68.0%	-
GWSM [49]	ResNet-101	Image-level + Saliency	68.2%	68.5%	28.4%
LIID [50]	ResNet-101	Image-level + SOP	66.5%	67.5%	-
LIID [50]	Res2Net-101	Image-level + SOP	68.4%	68.0%	-
AffinityNet [51]	ResNet-38	Image-level	61.7%	63.7%	-
SEAM [10]	ResNet-38	Image-level	64.5%	65.7%	31.7%
CONTA [52]	ResNet-38	Image-level	66.1%	66.7%	32.8%
PMM [20]	ResNet-38	Image-level	68.5%	69.0%	34.7%
IRNet [53]	ResNet-50	Image-level	63.5%	64.8%	-
OAA [30]	ResNet-101	Image-level	63.9%	65.6%	-
ICD [33]	ResNet-101	Image-level	64.1%	64.3%	-
SSDD [54]	ResNet-101	Image-level	64.9%	65.5%	-
SC-CAM [31]	ResNet-101	Image-level	66.1%	65.9%	-
MCIS [55]	ResNet-101	Image-level	66.2%	66.9%	-
SIPE [56]	ResNet-101	Image-level	68.7%	67.8%	36.7%
SLRNet [57]	ResNet-101	Image-level	68.0%	68.4%	35.0%
PMM [20]	ScaleNet-101	Image-level	67.1%	67.7%	35.2%
PMM [20]	ResNett-101	Image-level	68.7%	68.7%	34.7%
URN [21]	ResNet-38	Image-level	67.1%	67.9%	34.8%
URN [21]	ResNet-101	Image-level	65.9%	66.3%	35.1%
URN [21]	ScaleNet-101	Image-level	68.4%	69.0%	35.2%
URN [21]	Res2Net-101	Image-level	67.6%	67.7%	36.0%
Our Method	ResNet-38	Image-level	67.9%	69.1%	36.7%
Our Method	ResNet-101	Image-level	67.6%	68.1%	36.5%
Our Method	ScaleNet-101	Image-level	69.0%	69.9%	36.6%
Our Method	Res2Net-38	Image-level	69.3%	69.8%	37.7%

Table 2. Performance comparison with baseline URN on VOC12 val. The experimental results are obtained by using Res2Net-101 as the backbone network. The comparison results included 20 classes and 1 background. When the proposed model performs better than the baseline, the model results are bolded.

Class	mIoU
Class	URN	Our Method
background	90.60%	91.14%
aeroplane	78.83%	79.74%
bicycle	33.56%	35.13%
bird	88.95%	87.19%
boat	53.42%	57.88%
bottle	61.60%	71.48%
bus	85.68%	85.63%
car	82.18%	79.21%
cat	89.24%	88.50%
chair	31.18%	30.64%
cow	87.06%	85.88%
diningtable	54.70%	55.33%
dog	82.16%	85.34%
horse	84.23%	85.23%
motorbike	74.70%	73.87%
person	76.12%	76.66%
pottedplant	46.03%	48.08%
sheep	72.36%	81.04%
sofa	45.32%	44.45%
train	56.81%	57.81%
tvmonitor	49.45%	54.17%

Table 3. Performance comparison of the RLF function and its two variant functions on the VOC12 val and VOC12 test.

Method	Backbone	VOC12 Val	VOC12 Test
LFCON/CONRF/RLF	ResNet-38	65.41%/66.64%/67.9%	67.03%/66.28%/69.1%
LFCON/CONRF/RLF	ResNet-101	65.92%/66.21%/67.6%	66.89%/67.30%/68.1%
LFCON/CONRF/RLF	ScaleNet-101	67.59%/68.14%/69.0%	67.52%/68.46%/69.9%
LFCON/CONRF/RLF	Res2Net-101	67.49%/67.67%/69.3%	67.70%/67.42%/69.8%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Qian, F.; Yang, J.; Tang, S.; Chen, G.; Yan, J. Addressing Noisy Pixels in Weakly Supervised Semantic Segmentation with Weights Assigned. Mathematics 2024, 12, 2520. https://doi.org/10.3390/math12162520

AMA Style

Qian F, Yang J, Tang S, Chen G, Yan J. Addressing Noisy Pixels in Weakly Supervised Semantic Segmentation with Weights Assigned. Mathematics. 2024; 12(16):2520. https://doi.org/10.3390/math12162520

Chicago/Turabian Style

Qian, Feng, Juan Yang, Sipeng Tang, Gao Chen, and Jingwen Yan. 2024. "Addressing Noisy Pixels in Weakly Supervised Semantic Segmentation with Weights Assigned" Mathematics 12, no. 16: 2520. https://doi.org/10.3390/math12162520

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Addressing Noisy Pixels in Weakly Supervised Semantic Segmentation with Weights Assigned

Abstract

1. Introduction

2. Related Work

2.1. Weakly Supervised Semantic Segmentation

2.2. Label Noise Learning

3. Methodology

3.1. Overview

3.2. Preliminaries

3.2.1. Dense-CRF

3.2.2. Class Activation Map

3.2.3. Problem Definition

3.3. Label Noise Learning in WSSS

3.3.1. Probability Statistic

3.3.2. Uncertainty-Weight Transform Module

3.4. Loss Function

4. Experiments

4.1. Experiments Setting

4.1.1. Dataset

4.1.2. Baseline

4.1.3. Hyperparameters Setting

4.2. Experimental Results and Analysis

4.2.1. Comparisons with Baseline

4.2.2. Ablation Studies

5. Limitations and Future Work

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI