Instance Sequence Queries for Video Instance Segmentation with Transformers
Abstract
:1. Introduction
2. Related Work
2.1. Video Instance Segmentation
2.2. Transformer
3. Materials and Methods
3.1. Background: DETR
- A CNN backbone (e.g., ResNet [22]) takes an input image and extracts a compact feature representation;
- A transformer encoder encodes the image features with multi-head self-attention modules;
- A transformer decoder uses multi-headed attention mechanisms to decode the N learnt object queries;
- A feed forward network (FFN) computes the final prediction of the class and box for each object query.
3.2. Instance Sequence Queries
3.2.1. Definition
3.2.2. Inference Pipeline
3.2.3. Architecture
3.2.4. Instance Sequence Queries Update
3.2.5. Idle Queries Rollback
3.3. Training
4. Results
4.1. Dataset
4.2. Implementation Details
4.3. Main Results
5. Discussion
5.1. DETR Baseline
5.2. New Instance Detection
5.3. Ability to Adjust Queries
5.4. Query Update
5.5. Idle Queries Rollback
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Bolya, D.; Zhou, C.; Xiao, F.; Lee, Y. YOLACT: Real-Time Instance Segmentation. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019; pp. 9156–9165. [Google Scholar]
- Liang, J.; Homayounfar, N.; Ma, W.; Xiong, Y.; Hu, R.; Urtasun, R. PolyTransform: Deep Polygon Transformer for Instance Segmentation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 9128–9137. [Google Scholar]
- Cao, J.; Anwer, R.M.; Cholakkal, H.; Khan, F.; Pang, Y.; Shao, L. SipMask: Spatial Information Preservation for Fast Image and Video Instance Segmentation. In Proceedings of the 16th European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 1–18. [Google Scholar]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R.B. Mask R-CNN. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 386–397. [Google Scholar] [CrossRef] [PubMed]
- Lee, Y.; Park, J. CenterMask: Real-Time Anchor-Free Instance Segmentation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 13903–13912. [Google Scholar]
- Chen, K.; Pang, J.; Wang, J.; Xiong, Y.; Li, X.; Sun, S.; Feng, W.; Liu, Z.; Shi, J.; Ouyang, W.; et al. Hybrid Task Cascade for Instance Segmentation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 4969–4978. [Google Scholar]
- Chen, X.; Girshick, R.B.; He, K.; Dollár, P. TensorMask: A Foundation for Dense Object Segmentation. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019; pp. 2061–2069. [Google Scholar]
- Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8759–8768. [Google Scholar]
- Cao, Y.; Xu, J.; Lin, S.; Wei, F.; Hu, H. GCNet: Non-Local Networks Meet Squeeze-Excitation Networks and Beyond. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Korea, 27 October–2 November 2019; pp. 1971–1980. [Google Scholar]
- Yang, L.; Fan, Y.; Xu, N. Video Instance Segmentation. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019; pp. 5187–5196. [Google Scholar]
- Bertasius, G.; Torresani, L. Classifying, Segmenting, and Tracking Object Instances in Video with Mask Propagation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 9736–9745. [Google Scholar]
- Feng, Q.; Yang, Z.; Li, P.; Wei, Y.; Yang, Y. Dual Embedding Learning for Video Instance Segmentation. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Korea, 27 October–2 November 2019; pp. 717–720. [Google Scholar]
- Athar, A.; Mahadevan, S.; Osep, A.; Leal-Taixé, L.; Leibe, B. STEm-Seg: Spatio-Temporal Embeddings for Instance Segmentation in Videos. In Proceedings of the 16th European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 158–177. [Google Scholar]
- Wang, Y.; Xu, Z.; Wang, X.; Shen, C.; Cheng, B.; Shen, H.; Xia, H. End-to-End Video Instance Segmentation with Transformers. arXiv 2020, arXiv:2011.14503. [Google Scholar]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. arXiv 2020, arXiv:2005.12872. [Google Scholar]
- Hwang, S.; Heo, M.; Oh, S.W.; Kim, S.J. Video Instance Segmentation using Inter-Frame Communication Transformers. arXiv 2021, arXiv:2106.03299. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
- Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable Transformers for End-to-End Object Detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Sun, P.; Jiang, Y.; Zhang, R.; Xie, E.; Cao, J.; Hu, X.; Kong, T.; Yuan, Z.; Wang, C.; Luo, P. TransTrack: Multiple-Object Tracking with Transformer. arXiv 2020, arXiv:2012.15460. [Google Scholar]
- Meinhardt, T.; Kirillov, A.; Leal-Taixé, L.; Feichtenhofer, C. TrackFormer: Multi-Object Tracking with Transformers. arXiv 2021, arXiv:2101.02702. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Kuhn, H. The Hungarian method for the assignment problem. Nav. Res. Logist. Q. 1955, 2, 83–97. [Google Scholar] [CrossRef] [Green Version]
- Rezatofighi, S.H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 658–666. [Google Scholar]
- Milletari, F.; Navab, N.; Ahmadi, S.A. V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation. In Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016; pp. 565–571. [Google Scholar]
- Lin, T.Y.; Goyal, P.; Girshick, R.B.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 318–327. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. In Proceedings of the 7th International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- Lin, T.Y.; Maire, M.; Belongie, S.J.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context6–. In Proceedings of the 13th European Conference on Computer Vision (ECCV), Zurich, Switzerland, 12 September 2014; pp. 740–755. [Google Scholar]
Method | Backbone | Resolution | FPS | |||||
---|---|---|---|---|---|---|---|---|
MT RCNN [10] | ResNet-50 | 640 × 360 | 30.3 | 51.1 | 32.6 | 31.0 | 35.5 | 20.0 |
SipMask [3] | ResNet-50 | 640 × 360 | 33.7 | 54.1 | 35.8 | 35.4 | 40.1 | 30.0 |
MaskProp [11] | ResNet-50 | 640 × 360 | 40.0 | − | 42.9 | − | − | − |
MaskProp [11] | ResNet-101 | 640 × 360 | 42.5 | − | 45.6 | − | − | − |
STEm-Seg [13] | ResNet-50 | 640∼1196 | 30.6 | 50.7 | 33.5 | 31.6 | 37.1 | − |
STEm-Seg [13] | ResNet-101 | 640∼1196 | 34.6 | 55.8 | 37.9 | 34.4 | 41.6 | − |
VisTR [14] | ResNet-50 | 540 × 300 | 34.4 | 55.7 | 36.5 | 33.5 | 38.9 | 30.0 |
VisTR [14] | ResNet-101 | 540 × 300 | 35.3 | 57.0 | 36.2 | 34.3 | 40.4 | 27.7 |
IFC (online) [16] | ResNet-50 | 640 × 360 | 41.0 | 62.1 | 45.4 | 43.5 | 52.7 | 46.5 |
IFC (offline) [16] | ResNet-50 | 640 × 360 | 42.8 | 65.8 | 46.8 | 43.8 | 51.2 | 107.1 |
IFC (offline) [16] | ResNet-101 | 640 × 360 | 44.6 | 69.2 | 49.5 | 44.0 | 52.1 | 89.4 |
Our Method | ResNet-50 | 640 × 360 | 34.4 | 54.9 | 37.7 | 33.3 | 38.1 | 33.5 |
Our Method | ResNet-101 | 640 × 360 | 35.5 | 56.6 | 39.2 | 34.8 | 39.9 | 26.6 |
Excluded Frames | Random Initial | Learnt Initial |
---|---|---|
Non exclusion | 32.7 | 34.1 |
1 | 29.0 | 30.5 |
1 to 3 | 22.1 | 23.2 |
1 to 5 | 16.9 | 17.3 |
1 to 7 | 12.1 | 12.4 |
1 to 9 | 7.3 | 7.5 |
Margin | w/o | 0 | 0.05 | 0.10 | 0.15 | 0.20 |
---|---|---|---|---|---|---|
33.6 | 33.1 | 34.4 | 34.1 | 34.2 | 34.2 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Xu, Z.; Vivet, D. Instance Sequence Queries for Video Instance Segmentation with Transformers. Sensors 2021, 21, 4507. https://doi.org/10.3390/s21134507
Xu Z, Vivet D. Instance Sequence Queries for Video Instance Segmentation with Transformers. Sensors. 2021; 21(13):4507. https://doi.org/10.3390/s21134507
Chicago/Turabian StyleXu, Zhujun, and Damien Vivet. 2021. "Instance Sequence Queries for Video Instance Segmentation with Transformers" Sensors 21, no. 13: 4507. https://doi.org/10.3390/s21134507
APA StyleXu, Z., & Vivet, D. (2021). Instance Sequence Queries for Video Instance Segmentation with Transformers. Sensors, 21(13), 4507. https://doi.org/10.3390/s21134507