Center-Guided Transformer for Panoptic Segmentation

Baek, Jong-Hyeon; Lee, Hee Kyung; Choo, Hyon-Gon; Jung, Soon-heung; Koh, Yeong Jun

doi:10.3390/electronics12234801

Open AccessArticle

Center-Guided Transformer for Panoptic Segmentation

by

Jong-Hyeon Baek

¹

,

Hee Kyung Lee

²

,

Hyon-Gon Choo

²,

Soon-heung Jung

² and

Yeong Jun Koh

^1,*

¹

Department of Computer Science & Engineering, Chungnam National University, Daejeon 34134, Republic of Korea

²

Electronics and Telecommunications Research Institute, Daejeon 34129, Republic of Korea

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(23), 4801; https://doi.org/10.3390/electronics12234801

Submission received: 28 September 2023 / Revised: 14 November 2023 / Accepted: 15 November 2023 / Published: 27 November 2023

(This article belongs to the Collection Computer Vision and Pattern Recognition Techniques)

Download

Browse Figures

Versions Notes

Abstract

:

A panoptic segmentation network to predict masks and classes for things and stuff in images is proposed in this work. Recently, panoptic segmentation has been advanced through the combination of the query-based learning and end-to-end learning approaches. Current research focuses on learning queries without distinguishing between thing and stuff classes. We present decoupling query learning to generate effective thing and stuff queries for panoptic segmentation. For this purpose, we adopt different workflows for thing and stuff queries. We design center-guided query selection for thing queries, which focuses on the center regions of individual instances in images, while we set stuff queries as randomly initialized embeddings. Also, we apply a decoupling mask to the self-attention of query features to prevent interactions between things and stuff. In the query selection process, we generate a center heatmap that guides thing query selection. Experimental results demonstrate that the proposed panoptic segmentation network outperforms the state of the art on two panoptic segmentation datasets.

Keywords:

panoptic segmentation; transformer; center-guided query selection

1. Introduction

Panoptic segmentation [1] is the task in the domain of computer vision, involving the segmentation of both things and stuff. Things are defined as distinguishable and individual instances such as people, cars, and animals, where each instance contains unique id and class. In contrast, stuff means amorphous regions and encompassing areas such as the sky, meadows, grass, and other similar homogeneous areas. Starting from ConvNet-based panoptic segmentation models [2,3,4,5], recent panoptic segmentation methods [6,7,8,9] employ transformers to learn thing and stuff queries in various ways.

DETR [7] and its variants [8,9] set queries as randomly initialized embeddings and train the queries with a transformer decoder, as shown in Figure 1a. Then, the learned queries are transformed into class and mask predictions for things and stuff. Next, as in Figure 1b, the query selection approach, which chooses effective features from image features based on the class probability, is adopted in object detection methods [10,11]. However, things and stuff have different properties. Things are countable and contain small segments, while stuff is uncountable and includes large segments; thus, thing and stuff queries need to be learned differently. Also, the aforementioned approaches mix query features using self-attention in the transformer decoder, which yields interactions between thing and stuff queries.

In this work, we propose an architecture that integrates effective query selection for things and a decoupling mask to prevent things and stuff from interrupting each other, as illustrated in Figure 1c. First, we develop center-guided query selection for things, which exploits the center regions of instances from image features. To analyze center regions, we estimate a center heatmap, which has high values at the center of individual instances, to generate a center-guided feature. Based on the center-guided feature, we select the effective thing queries. After treating stuff queries as randomly initialized embeddings, we separately train thing and stuff queries using a transformer decoder with the decoupling mask. Using the trained thing and stuff queries, we obtain panoptic segmentation results from mask and class predictions. Experimental results demonstrate that the proposed panoptic segmentation network outperforms the state of the art on the COCO panoptic dataset [12] and ADE20K panoptic dataset [13]. Specifically, the proposed network yields the best performance for things, while it provides comparable results for stuff with respect to the state of the art on the COCO panoptic dataset. The proposed network achieves 52.2 PQ and 44.1

{AP}_{pan}^{th}

on the COCO panoptic dataset, and 41.5 PQ and 28.9

{AP}_{pan}^{th}

on the ADE20K panoptic dataset, where the metrics PQ and

{AP}_{pan}^{th}

have a range from 0 to 100.

The rest of this paper is organized as follows: Section 2 surveys panoptic segmentation methods and center-based learning techniques. Section 3 describes the proposed network. Section 4 discusses the experimental results. Finally, Section 5 concludes this work and provides future work directions.

2. Related Works

Panoptic segmentation [1] is a joint task including the semantic segmentation and instance segmentation tasks, requiring the prediction of distinct masks to represent both things and stuff. Early panoptic segmentation methods attempt to combine the existing semantic segmentation network and instance segmentation network effectively. For example, UPSNet [2] utilizes two separate branches to produce semantic and instance segmentation, and then it subsequently integrates both results using an additional panoptic segmentation head. PanopticFCN [3] jointly models things and stuff networks by designing a unified convolution pipeline to simplify panoptic segmentation.

Recently, transformer-based models [6,7,8,9] have achieved the promising performance in panoptic segmentation. DETR [7] is an end-to-end solution to address both object detection and panoptic segmentation tasks. However, it is still inferior to classical segmentation models, since DETR produces panoptic segmentation by adding a simple mask head on top of object detection networks. MaskFormer [8] and Mask2Former [9] have a architectures similar to that of DETR but differ in using a global segmentation decoder and some specialized designs for mask prediction. MaskFormer [8] builds a pixel decoder to generate mask predictions through simple matrix multiplication between enhanced queries and pixel decoder output. Mask2Former [9] proposes masked attention, which uses mask predictions in the process of self-attention, to reduce training time significantly. Panoptic Segformer [6] adopts an auxiliary location decoder to assist instance queries to learn location clues and ease model training. These transformer-based methods rely on learnable queries initialized with random values to estimate things and stuff. On the contrary, we propose a query selection algorithm to extract individual queries from image features using object center information. Also, fast panoptic segmentation networks [5,14,15] are presented. YOSO [15] developed the feature pyramid aggregator for speedup in GPU latency and the separable dynamic decoder for generating panoptic kernels. IDNet [14] decomposes panoptic segmentation into category and location information, which simplifies the network architecture.

The object’s center is able to provide a rich context for solving various computer vision tasks, such as object detection [16,17,18,19] and segmentation [5,20]. CenterNet [16] detects each object as center keypoints. CenterNet2 [17] further enhances center representation using a heatmap approach. FCOS [18] introduces a centerness branch to predict the deviation of a pixel from the center of its corresponding box. ExtremeNet [19] predicts geometric centers and aligns them into a bounding box. In the segmentation task, CenterMask [20] leverages a center heatmap for anchor-free instance segmentation. Panoptic-DeepLab [5] first estimates all foreground masks from an image and then extracts thing classes based on instance centers. On the other hand, we take into account centers of object instances to extract instance queries from the corresponding feature space. In the experimental results, the proposed algorithm exhibits the superior performance in comparison to the other panoptic segmentation models.

3. Proposed Methods

3.1. Architecture

Figure 2 illustrates an overview of the proposed panoptic segmentation network. In this section, we introduce the proposed center-guided query selection module and transformer decoder with decoupling mask.

Backbone and transformer encoder: The backbone extracts image features from an input image

X \in R^{H_{0} \times W_{0} \times 3}

, and the transformer encoder generates a new feature map

F_{encoder} \in R^{H_{1} \times W_{1} \times C}

from the image features, where

H_{1} = H_{0} / 32

,

W_{1} = W_{0} / 32

, and

C = 2048

. We employ ResNet50 [21] for the backbone and the transformer encoder in [9]. The transformer encoder consists of deformable attention [10], layer normalization, and a feed forward network (FFN). Feature map

F_{encoder}

is gradually upsampled to a center embedding

F_{center} \in R^{H \times W \times C}

and a mask embedding

F_{mask} \in R^{H \times W \times C}

through the two sets of convolution layer and bilinear interpolation operation, where

H = 4 H_{1}

,

W = 4 W_{1}

. Also,

F_{encoder}

is fed into the transformer decoder for attention mechanisms with queries.

Center-guided query selection: Traditional transformer-based panoptic segmentation models [6,8,9] typically use randomly initialized embeddings to learn queries without distinguishing between things and stuff. The proposed network learns things and stuff separately to prevent thing queries and stuff queries from interrupting each other. Inspired by center-based learning for object detection [16,17,18,19], we develop the mechanism of center-guided query selection for thing queries. The center regions of individual instances in input images contain the cues to distinguish different instances. Thus, we estimate a center heatmap to guide effective thing query selection.

Figure 3 shows the diagram of the proposed center-guided query selection. Center embedding

F_{center}

passes through the FFN to estimate center heatmap

H \in R^{H \times W}

, which contains the location information of the instances. Then, we obtain center-guided feature

{\tilde{F}}_{center} \in R^{H \times W \times C}

using element-wise multiplication between the estimated center heatmap

H

and each channel of

F_{center}

. To this end, we employ the feature selection process in [10,11] to determine the top K query features from center-guided feature

{\tilde{F}}_{center}

.

{\tilde{F}}_{center}

passes through a linear layer and softmax to obtain class probability map

P \in R^{H \times W \times C_{thing}}

for things, where

C_{thing}

is the number of thing classes. Then, we pick the highest probability from

P

for each pixel and construct thing query

Q_{thing} \in R^{N_{thing} \times C}

, where

N_{thing} = K

, by selecting the top K features from

{\tilde{F}}_{center}

in terms of the highest probabilities extracted from

P

. Since center heatmap

H

has high values on the central parts of the instances,

H

conveys strong visual patterns related to the instances to obtain effective thing query

Q_{thing}

. Note that we only perform center-guided query selection for thing queries

Q_{thing}

, while we simply set stuff queries

Q_{stuff} \in R^{N_{stuff} \times C}

as

N_{stuff}

randomly initialized embeddings.

Transformer decoder with decoupling mask: We need to train queries to inject enough information to derive classes and masks. For this purpose,

Q_{thing}

and

Q_{stuff}

are concatenated as

Q = {[Q_{thing}^{T} Q_{stuff}^{T}]}^{T}

and fed to the transformer decoder, which includes self-attention, deformable attention, and the FFN, as in Figure 4. Considering the different properties between things and stuff, we apply decoupling mask

D \in R^{(N_{thing} + N_{stuff}) \times (N_{thing} + N_{stuff})}

to self-attention, where

D

’s element

D (i, j)

is defined as

D (i, j) = \{\begin{matrix} 0 & if i, j ≦ N_{thing}, \\ 0 & if N_{thing} < i, j ≦ N_{thing} + N_{stuff}, \\ - \infty & otherwise . \end{matrix}

(1)

Then, self-attention in the transformer decoder is formulated as

\tilde{Q} = Layernorm (softmax (D + Q_{l} K_{l}^{T}) V_{l} + Q)

(2)

where

Q_{l}

,

K_{l}

, and

V_{l}

are the query, key, and value extracted from

Q

through a linear layer, respectively. We prevent interference between thing and stuff queries using decoupling mask

D

. For the stability of the learning process, we use the residual connection with

Q

and perform layer normalization after the residual connection. After the self-attention process, we use deformable attention to inject

F_{encoder}

into

\tilde{Q}

, resulting in enhanced query set

\bar{Q}

.

Estimation: Masks and classes are estimated from enhanced query set

\bar{Q}

. First, masks are computed using the dot product between

\bar{Q}

and mask embedding

F_{mask}

. Second,

\bar{Q}

passes through a fully connected layer to predict the class probability. Finally, we obtain panoptic segmentation results from mask and class predictions.

3.2. Loss

The proposed network outputs

N_{thing} + N_{stuff}

predictions, including masks and classes. Then, we perform the Hungarian algorithm [22] to match predictions and ground truths, following [6,7,8,9]. For each match, we compute the focal loss [23] between class probability prediction

c_{k}

and ground truth

{\hat{c}}_{k}

as follows:

\begin{matrix} L_{c} (c_{k}, {\hat{c}}_{k}) = λ_{class} [{α {(1 - {\hat{c}}_{k})}^{γ} \cdot - log ({\hat{c}}_{k}) \cdot c_{k}} - {(1 - α) \cdot {\hat{c}}_{k}^{γ} \cdot - log (1 - {\hat{c}}_{k}) \cdot (1 - c_{k})}] \end{matrix}

(3)

where

λ_{class}

,

α

, and

γ

were experimentally set to 4, 0.25, and 2, respectively. Also, to compare the estimated mask

M_{k} \in R^{H_{0} \times W_{0}}

and ground truth

{\hat{M}}_{k}

, we employ the mask loss (

L_{m} (M_{k}, {\hat{M}}_{k})

) in [8], which is composed of per-pixel cross-entropy loss

L_{pixel} (M_{k}, {\hat{M}}_{k})

and dice loss [24]

L_{dice} (M_{k}, {\hat{M}}_{k})

:

L_{m} (M_{k}, {\hat{M}}_{k}) = λ_{pixel} L_{pixel} (M_{k}, {\hat{M}}_{k}) + λ_{dice} L_{dice} (M_{k}, {\hat{M}}_{k})

(4)

where

λ_{pixel}

and

λ_{dice}

were set to 5 and 5, according to [9]. Additionally, to train the center-guided query selection module, we generate ground-truth heatmap

\hat{H}

by applying Gaussian distributions to all instance center points for each image. Then, we compute the focal loss between the predicted center heatmap

H

and

\hat{H}

.

4. Experiments

4.1. Setting

Dataset: We conducted experiments on the proposed network on two panoptic segmentation datasets: the COCO panoptic [12] and ADE20K panoptic [13] datasets. The COCO panoptic dataset consists of annotated images with mask and class labels for 80 thing and 53 stuff classes. The COCO panoptic dataset is divided into training set, validation set, and test set, which contain 118,785, 5000, and 5000 images, respectively. The ADE20K panoptic dataset provides object- and semantic-level information for object detection and segmentation. It consists of 100 thing classes and 50 stuff classes. The dataset contains 20,210 images for the training set and 2000 images for the validation set. Figure 5 and Figure 6 show examples of the COCO panoptic and ADE20K panoptic datasets.

Implementation details and training settings: We implemented the model using the detectron2 [25] platform, based on PyTorch. For training, the size of the input images was set to

1280 \times 1280

on COCO, while it was set to

640 \times 640

on ADE20K. We employed the standard convolution-based ResNet50 [21], ResNet101 [21] and Swin-Transformer [26] as the backbone network. The transformer encoder and the transformer decoder were repeated six times and nine times, respectively. The numbers of thing and stuff queries were 300 and 53, respectively. During the training process, we used four NVIDIA RTX A6000 GPUs, with a batch size of 4 per GPU. For training the proposed network, we set epoch to 50 for the COCO panoptic dataset and epoch to 120 for the ADE20K panoptic dataset. We optimized the proposed network using the AdamW optimizer. The initial learning rate was set to

1 \times 10^{- 4}

, and the multiple step learning rate scheduling technique was applied to decay the learning rate at specific epochs. The reduction rate is set to 1/10, and the learning rate gradually decreases at 36 epoch and 48 epoch for the COCO panoptic dataset, while it decreases at 75 epoch and 105 epoch for the ADE20K panoptic dataset.. The weight decay value is set to 0.05.

Evaluation metrics: For evaluation, we used Panoptic Quality (PQ) [1] to measure the performance in both classification and segmentation. Also, we used the additional metric of

{AP}_{pan}^{th}

, which measures the average precision (AP) of segmentation for thing categories to demonstrate the effectiveness of the proposed method for thing classes. Both PQ and

{AP}_{pan}^{th}

had a range from 0 to 100.

4.2. Comparison with Other Methods

COCO panoptic dataset: In Table 1, we compare the proposed method with existing panoptic segmentation methods [3,5,6,7,8,9,14,27] on the COCO panoptic [12] dataset. Table 1 shows the PQ,

{AP}_{pan}^{th}

scores of the existing methods, which were obtained from the respective papers. We see that the proposed method outperforms both non-transformer-based methods [3,5] and transformer-based ones [6,7,8,9]. Specifically, the proposed method surpasses the prior state of the art (Mask2Former [9]) by margins of 0.3 and 2.4 in terms of PQ and

{AP}_{pan}^{th}

, respectively. The proposed method achieves the remarkable performance for thing classes (58.4

{PQ}_{thing}

), which indicates that the proposed center-guided query selection is essential to exploiting features for different instances in each image. The highest score of

{AP}_{pan}^{th}

shows that the proposed network is effective in the segmentation of thing classes. Figure 7 shows the qualitative comparison of the proposed method with MaskFormer and Mask2Former on the COCO panoptic dataset.

As shown in Figure 7, the proposed network provides more accurate segmentation results and distinguishes different instances compared with MaskFormer and Mask2Former. For example, in the third row in Figure 7, the proposed method significantly enhances the detection performance of things, resulting in more segmentation thing masks than both MaskFormer and Mask2Former. Specifically, while MaskFormer merges the several masks of individual cakes into a single mask, Mask2Former completely fails to detect and segment the cake instances. On the other hand, the proposed method faithfully detect individual cake instances and provides accurate segmentation mask results. Moreover, as illustrated in the stuff region in the fourth row, MaskFormer incorrectly classifies the fence class into the tree class and yields merged masks. Also, Mask2Former fails to obtain segmentation masks for tree regions. In contrast, the proposed method provides remarkable mask results for stuff classes, including tree, fence, window, and wall-brick.

ADE20K panoptic dataset: Table 2 compares the proposed method with IDNet [14], MaskFormer [8], Panoptic Segformer [6], YOSO [15], and Mask2Former [9] in terms of PQ and

{AP}_{pan}^{th}

on ADE20K. The proposed method achieves the best performance in all metrics. Our method surpasses Mask2Former by over 1.8 and 2.7 in terms of PQ and

{AP}_{pan}^{th}

. Figure 8 shows the qualitative comparison of the proposed method with MaskFormer and Mask2Former. We observe that the proposed method effectively distinguishes thing and stuff classes and yields more accurate mask and class results than MaskFormer and Mask2Former. For instance, in the fourth row in the Figure 8, MaskFormer produces inaccurate mask results for the chair class, leading to incorrect or incomplete masks for some chair instances. Mask2Former completely misses the chair and table instances. However, the proposed method yields accurate chair and table segmentation results. Furthermore, in the last row, MaskFormer fails to find wall regions, and thus it misclassifies wall regions into building or water classes. Mask2Former accurately predicts the stuff area, but fails to achieve accurate instance segmentation such as houses and stairs instances. In contrast, the proposed method not only accurately predicts segmentation results for stuff, but also precisely extracts thing segmentation results from the image.

4.3. Ablation Study

In Table 3 and Table 4, we conduct ablation studies to validate the effectiveness of center-guided query selection and the decoupling mask. Table 3 and Table 4 list the performance of the proposed network without center-guided query selection and the decoupling mask on COCO and ADE20K, respectively. As shown in Table 3 and Table 4, both components improve the performance in all metrics. Specifically, center-guided query selection increases the

{PQ}_{thing}

and

{AP}_{pan}^{th}

scores by 1.2 and 2.1 on COCO, while it improves the

{PQ}_{thing}

and

{AP}_{pan}^{th}

scores by 1.7 and 2.8 on ADE20K. This indicates that center-guided feature

{\tilde{F}}_{center}

is effective in segmenting objects and distinguishing different instances. Also, without the decoupling mask,

{PQ}_{thing}

and

{PQ}_{stuff}

performance is degraded on both COCO and ADE20K. When we remove the two components, PQ scores are reduced by 1.8 and 2.1 on COCO and ADE20K, respectively. These results indicate that the proposed modules are essential for accurate panoptic segmentation.

For the learning of thing and stuff queries, there are three approaches: (1) center-guided query selection, (2) feature selection [10,11], and (3) random initialization. Table 5 shows an ablation study according to combinations of thing and stuff query learning. We observe that the combination of center-guided query selection for things and random initialization for stuff, i.e., the proposed method, yields the best performance. When feature selection is adopted for stuff instead of random initialization, we experience accuracy degradation. Also, the proposed combination surpasses the traditional feature selection in [10,11]. Table 6 lists the panoptic segmentation performance for various backbones: (1) ResNet50, (2) ResNet101, and (3) Swin-T [26]. By comparing ResNet50 and ResNet101, the performance is improved as parameters increase. Also, the transformer-based backbone [26] yields the best performance, even though it uses fewer parameters than ResNet101.

5. Conclusions

We propose a panoptic segmentation network to predict masks and classes for things and stuff. The key insight of the proposed network is to generate effective thing and stuff queries for panoptic segmentation. First, we developed center-guided query selection, which exploits center information for detecting and segmenting individual instances. Second, we applied a decoupling mask to the transformer decoder, which prevents the interaction between thing and stuff queries. Experiments on COCO and ADE20K validated that the proposed panoptic segmentation network outperforms the existing methods, especially with respect to things. Despite its effectiveness, the proposed panoptic segmentation network has a limitation with respect to stuff, as reported in Table 1. Therefore, it remains a future work direction to generate effective queries for stuff classes.

Author Contributions

Conceptualization, H.K.L., H.-G.C. and S.-h.J.; data curation, H.K.L. and S.-h.J.; formal analysis, J.-H.B.; methodology, Y.J.K. and J.-H.B.; resources, J.-H.B.; software, J.-H.B.; writing—original draft, J.-H.B.; writing—review and editing, Y.J.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partly supported by an Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korean government (MSIT) (No. 2020-0-00011, Video Coding for Machine) and an Institute for Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korean government (MSIT) (No. 2022-0-01200, Convergence security core talent training business (Chungnam National University)).

Data Availability Statement

The datasets used in this study are the COCO panoptic dataset and the ADE20K dataset. The COCO dataset is available at https://cocodataset.org/#home (accessed on 12 September 2014),and the ADE20K dataset is available at http://sceneparsing.csail.mit.edu/ (accessed on 16 June 2019).

Conflicts of Interest

The authors declare no conflict of interest.

References

Kirillov, A.; He, K.; Girshick, R.; Rother, C.; Dollár, P. Panoptic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 9404–9413. [Google Scholar]
Xiong, Y.; Liao, R.; Zhao, H.; Hu, R.; Bai, M.; Yumer, E.; Urtasun, R. Upsnet: A unified panoptic segmentation network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 8818–8826. [Google Scholar]
Li, Y.; Zhao, H.; Qi, X.; Wang, L.; Li, Z.; Sun, J.; Jia, J. Fully convolutional networks for panoptic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 214–223. [Google Scholar]
Li, Y.; Chen, X.; Zhu, Z.; Xie, L.; Huang, G.; Du, D.; Wang, X. Attention-guided unified network for panoptic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 7026–7035. [Google Scholar]
Cheng, B.; Collins, M.D.; Zhu, Y.; Liu, T.; Huang T., S.; Adam, H.; Chen, L.C. Panoptic-deeplab: A simple, strong, and fast baseline for bottom-up panoptic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Virtual, 14–19 June 2020; pp. 12475–12485. [Google Scholar]
Li, Z.; Wang, W.; Xie, E.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Lu, T. Panoptic segformer: Delving deeper into panoptic segmentation with transformers. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 1280–1289. [Google Scholar]
Nicolas, C.; Francisco, M.; Gabriel, S.; Nicolas, U.; Alexander, K.; Sergey, Z. End-to end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Virtual, 23–28 August 2020; pp. 213–229. [Google Scholar]
Bowen, C.; Alexander, G.; Kirillov, A. Per-pixel classification is not all you need for semantic segmentation. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 7–10 December 2021. [Google Scholar]
Bowen, C.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 1290–1299. [Google Scholar]
Zhu, W.; Xizhou, S. Deformable DETR: Deformable Transformers for End-to-End Object Detection. In Proceedings of the International Conference on Learning Representations, Virtual, 3–7 May 2021. [Google Scholar]
Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Ni, L.M.; Shum, H. DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection. In Proceedings of the International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollar, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
Zhou, B.; Zhao, H.; Puig, X.; Fidler, S.; Barriuso, A.; Torralba, A. Scene parsing through ADE20k dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 633–641. [Google Scholar]
Lin, G.; Li, S.; Chen, Y.; Li, X. IDNet: Information Decomposition Network for Fast Panoptic Segmentation. IEEE Trans. Image Process. 2023. early access. [Google Scholar] [CrossRef] [PubMed]
Hu, J.; Huang, L.; Ren, T.; Zhang, S.; Ji, R.; Cao, L. You Only Segment Once: Towards Real-Time Panoptic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 17819–17829. [Google Scholar]
Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. Centernet: Keypoint triplets for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 6569–6578. [Google Scholar]
Zhou, X.; Koltun, V.; Krähenbühl, P. Probabilistic two-stage detection. arXiv 2021, arXiv:2103.07461. [Google Scholar]
Tian, Z.; Shen, C.; Chen, H.; He, T. Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 9627–9636. [Google Scholar]
Zhou, X.; Zhuo, J.; Krahenbuhl, P. Bottom-up object detection by grouping extreme and center points. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 850–859. [Google Scholar]
Lee, Y.; Park, J. Centermask: Real-time anchor-free instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Virtual, 14–19 June 2020; pp. 13906–13915. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
Stewart, R.J.; Andriluka, M.; Ng, A.Y. End-to-end people detection in crowded scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 2325–2333. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2980–2988. [Google Scholar]
Milletari, F.; Navab, N.; Ahmadi, S.A. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In Proceedings of the 2016 Fourth International Conference on 3D vision, Stanford, CA, USA, 25–28 October 2016; pp. 565–571. [Google Scholar]
Yuxin, W.; Alexander, K.; Francisco, M.; Wan-Yen, L.; Ross, G. Detectron2. Available online: https://github.com/facebookresearch/detectron2 (accessed on 8 September 2023).
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Mao, L.; Ren, F.; Yang, D.; Zhang, R. ChaInNet: Deep Chain Instance Segmentation Network for Panoptic Segmentation. Neural Process. Lett. 2023, 55, 615–630. [Google Scholar] [CrossRef]
Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]

Figure 1. Approaches for query learning: (a) randomly initialized embeddings, (b) query selection, and (c) proposed center-guided query selection and decoupling mask.

Figure 2. Overview of the proposed network. We use the backbone and transformer encoder to extract image feature

F_{encoder}

from an input image

X

.

F_{encoder}

is gradually upsampled to obtain center embedding

F_{center}

(orange block) and mask embedding

F_{mask}

(purple block). Thing queries are selected from

F_{center}

through the center-guided query selection process, while stuff queries are randomly initialized. The transformer decoder generates enhanced queries based on the attention mechanism between

F_{encoder}

and queries with decoupling mask

D

. Then, the enhanced queries are transformed into mask and class predictions for panoptic segmentation.

Figure 2. Overview of the proposed network. We use the backbone and transformer encoder to extract image feature

F_{encoder}

from an input image

X

.

F_{encoder}

is gradually upsampled to obtain center embedding

F_{center}

(orange block) and mask embedding

F_{mask}

(purple block). Thing queries are selected from

F_{center}

through the center-guided query selection process, while stuff queries are randomly initialized. The transformer decoder generates enhanced queries based on the attention mechanism between

F_{encoder}

and queries with decoupling mask

D

. Then, the enhanced queries are transformed into mask and class predictions for panoptic segmentation.

Figure 3. Diagram of center-guided query selection. It generates center-guided feature

{\tilde{F}}_{center}

using the estimated heatmap

H

, which contains the center information of the instances. Feature selection extracts K queries from center-guided feature

{\tilde{F}}_{center}

based on thing class probabilities.

Figure 3. Diagram of center-guided query selection. It generates center-guided feature

{\tilde{F}}_{center}

using the estimated heatmap

H

, which contains the center information of the instances. Feature selection extracts K queries from center-guided feature

{\tilde{F}}_{center}

based on thing class probabilities.

Figure 4. Diagram of the transformer decoder. Given query

Q

, it generates enhanced query

\bar{Q}

through self-attention and deformable attention with image feature

F_{encoder}

.

Figure 4. Diagram of the transformer decoder. Given query

Q

, it generates enhanced query

\bar{Q}

through self-attention and deformable attention with image feature

F_{encoder}

.

Figure 5. Examples of COCO panoptic images [12]. The first row and the second row represent images and their annotations, respectively.

Figure 6. Examples of ADE20K panoptic images [13]. The first row and the second row represent images and their annotations, respectively.

Figure 7. Qualitative comparison on the COCO panoptic [12] val2017 dataset. (a) Input; (b) MaskFormer; (c) Mask2Former; (d) ours.

Figure 8. Qualitative comparison on the ADE20K panoptic [13] dataset. (a) Input; (b) MaskFormer; (c) Mask2Former; (d) ours.

Table 1. Comparison of the proposed method with existing panoptic segmentation networks on the COCO panoptic [12] val2017 dataset. The best results are boldfaced.

Model	Backbone	PQ	${PQ}_{thing}$	${PQ}_{stuff}$	${AP}_{pan}^{th}$	FLOPs	Params
Panoptic DeepLab [5]	Xception71 [28]	41.4	45.1	35.9	-	-	-
ChaInNet [27]	ResNet50	43.0	49.8	33.8	-	-	-
DETR [7]	ResNet50	43.2	48.2	36.1	31.1	248 G	43 M
Panoptic FCN [3]	ResNet50	43.6	49.3	35.0	36.6	244 G	37 M
IDNet [14]	ResNet50	43.8	49.6	35.0	-	-	-
MaskFormer [8]	ResNet50	46.5	51.0	39.8	33.0	181 G	45 M
Panoptic Segformer [6]	ResNet50	49.6	54.4	42.4	39.5	214 G	51 M
Mask2Former [9]	ResNet50	51.9	57.7	43.0	41.7	226 G	44 M
Ours	ResNet50	52.2	58.4	42.6	44.1	276 G	51 M

Table 2. Comparison of the proposed method with existing panoptic segmentation networks on the ADE20K panoptic [13] validation dataset. The best results are boldfaced.

Model	Backbone	PQ	${PQ}_{thing}$	${PQ}_{stuff}$	${AP}_{pan}^{th}$
IDNet [14]	ResNet50	30.2	33.2	24.3	-
MaskFormer [8]	ResNet50	34.7	32.2	39.7	-
Panoptic Segformer [6]	ResNet50	36.4	35.3	38.6	-
YOSO [15]	ResNet50	38.0	37.3	39.4	-
Mask2Former [9]	ResNet50	39.7	38.8	40.5	26.2
Ours	ResNet50	41.5	41.1	42.2	28.9

Table 3. Ablation study on the COCO panoptic val2017 dataset according to different settings. The best results are boldfaced.

Model	PQ	${PQ}_{thing}$	${PQ}_{stuff}$	${AP}_{pan}^{th}$	FLOPs	Params
Ours	52.2	58.4	42.6	44.1	276 G	51 M
−Center-guided query selection	51.6	57.2	42.3	42.0	273 G	50 M
−Decoupling mask	51.8	57.6	42.1	42.6	273 G	50 M
−2 components above	50.4	56.2	41.3	41.4	270 G	50 M

Table 4. Ablation study on the ADE20K panoptic validation set according to different settings. The best results are boldfaced.

Model	PQ	${PQ}_{thing}$	${PQ}_{stuff}$	${AP}_{pan}^{th}$
Ours	41.5	41.1	42.2	28.9
−Center-guided query selection	40.2	39.4	41.1	26.1
−Decoupling mask	40.5	39.9	41.3	26.9
−2 components above	39.4	39.3	40.1	25.2

Table 5. Ablation study on the COCO panoptic val2017 dataset according to query learning settings. The best results are boldfaced.

	Center-Guided	Feature	Random	PQ	${PQ}_{thing}$	${PQ}_{stuff}$	${AP}_{pan}^{th}$
	Query Selection	Selection	Initialization
Things	✓			52.2	58.4	42.6	44.1
Stuff			✓	52.2	58.4	42.6	44.1
Things	✓			51.8	58.2	42.1	43.6
Stuff		✓		51.8	58.2	42.1	43.6
Things		✓		51.6	57.6	42.3	42.9
Stuff		✓		51.6	57.6	42.3	42.9

Table 6. Ablation study on the COCO panoptic val2017 dataset with various backbones, ResNet50, ResNet101, and Swin-T [26]. The best results are boldfaced.

Backbone	PQ	${PQ}_{thing}$	${PQ}_{stuff}$	${AP}_{pan}^{th}$	FLOPs	Params
ResNet50	52.2	58.4	42.6	44.1	276 G	51 M
ResNet101	52.7	58.9	43.3	44.7	342 G	69 M
Swin-T	53.6	59.8	43.6	45.2	280 G	48 M

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Baek, J.-H.; Lee, H.K.; Choo, H.-G.; Jung, S.-h.; Koh, Y.J. Center-Guided Transformer for Panoptic Segmentation. Electronics 2023, 12, 4801. https://doi.org/10.3390/electronics12234801

AMA Style

Baek J-H, Lee HK, Choo H-G, Jung S-h, Koh YJ. Center-Guided Transformer for Panoptic Segmentation. Electronics. 2023; 12(23):4801. https://doi.org/10.3390/electronics12234801

Chicago/Turabian Style

Baek, Jong-Hyeon, Hee Kyung Lee, Hyon-Gon Choo, Soon-heung Jung, and Yeong Jun Koh. 2023. "Center-Guided Transformer for Panoptic Segmentation" Electronics 12, no. 23: 4801. https://doi.org/10.3390/electronics12234801

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Center-Guided Transformer for Panoptic Segmentation

Abstract

1. Introduction

2. Related Works

3. Proposed Methods

3.1. Architecture

3.2. Loss

4. Experiments

4.1. Setting

4.2. Comparison with Other Methods

4.3. Ablation Study

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI