FreeMix: Open-Vocabulary Domain Generalization of Remote-Sensing Images for Semantic Segmentation
Abstract
:1. Introduction
- 1.
- We introduce a new setting for semantic segmentation, i.e., open-vocabulary domain generalization (OVDG), which is an important yet unstudied problem. In addition, we propose an effective framework, FreeMix, for solving OVDG, which focuses on learning a generalized model by integrating entity masks to enhance the diversity and completeness of masks for both base classes and novel classes.
- 2.
- We propose a dual-branch universal segmentation module by unifying the base segmentation branch (BSB) and the entity segmentation branch (ESB) in an end-to-end trainable framework, where the BSB leverages a self-supervised pretrained model, CMID, to extract domain-agnostic visual features for decoding masks and semantic logits.
- 3.
- To integrate and leverage information from various source domains, we propose a simple yet effective training strategy called dataset-aware sampling (DAS). Extensive experiments on four benchmark datasets reveal that our proposed method outperforms the state-of-the-art methods on the OVL and the OVDG benchmark.
2. Related Works
2.1. Open-Vocabulary Semantic Segmentation
2.2. Domain Generalization
2.3. Self-Supervised Learning in Remote Sensing
3. Proposed Method
3.1. Problem Definition
3.2. Overview
3.3. Universal Segmentation Module
3.3.1. Base Segmentation Branch (BSB)
3.3.2. Entity Segmentation Branch (ESB)
- 1.
- Entity masks extractor. An entity refers to each semantically coherent region within an image. Entity segmentation is an emerging task that focuses on open-world, class-agnostic, dense image segmentation. It is designed to have superior generalization capabilities for segmenting novel classes [80]. CropFormer [79], trained on a large-scale, high-quality entity segmentation dataset that includes images from various domains, including remote-sensing images, is highly suitable for extracting class-agnostic masks. In entity masks extractor, we use N K-dimensional queries to generate entity masks .
- 2.
- Visual feature extractor. The visual feature extractor is based on vision Transformer (ViT) architecture, which consists of 12 Transformer blocks. We denote these blocks as . Each block comprises a multihead attention layer followed by two MLP layers with GELU [81] non-linearity. Layer normalization is applied before each layer, and residual connections are added after each layer.In the first k Transformer blocks, the visual feature extractor initially encodes the entire image to obtain a representation . Here, h and w represent the height and width of the attention map in the ViT, the additional 1 corresponds to the semantic logits for the entire image, and d denotes the dimensionality of the features. In the remaining Transformer blocks, to extract the semantic logits for entity masks , we assign independent classification queries to each entity mask by repeating the semantic logits of the entire image N times. We then utilize the entity masks as the attention bias (A) in the multihead attention mechanism:
3.4. Train Tactics: Dataset-Aware Sampling
4. Materials and Experimental Settings
4.1. Experimental Datasets and Processing
4.2. Implementation Details
- The text and image encoders of CLIP to preserve the learned multimodal representations;
- The entity mask extractor in ESB to maintain the robustness of the extracted masks;
- The backbone of the base segmentation branch after initializing it with self-supervised pretraining weights.
4.3. Evaluation Metrics
5. Results and Discussion
5.1. Comparison with SOTA Methods
5.1.1. Results of Open-Vocabulary Semantic Segmentation
5.1.2. Results of OVDG
5.2. Experiments on Multi-Source Domain
5.3. Ablation Experiments
5.4. Additional Experimental Results
5.4.1. Performance on Different Image Encoders of BSB
5.4.2. Performance on Scaling Model Size of ESB
5.4.3. Comparison of the Extracted Proposal Masks
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Appendix A
References
- Rottensteiner, F.; Sohn, G.; Gerke, M.; Wegner, J.D. ISPRS Semantic Labeling Contest; ISPRS: Leopoldshöhe, Germany, 2014; Volume 1, p. 4. [Google Scholar]
- Hong, J.; Li, W.; Han, J.; Zheng, J.; Fang, P.; Harandi, M.; Petersson, L. Goss: Towards generalized open-set semantic segmentation. Vis. Comput. 2024, 40, 2391–2404. [Google Scholar] [CrossRef]
- Nunes, I.; Laranjeira, C.; Oliveira, H.; dos Santos, J.A. A systematic review on open-set segmentation. Comput. Graph. 2023, 115, 296–308. [Google Scholar] [CrossRef]
- Nunes, I.M.; Poggi, M.; Oliveira, H.; Pereira, M.B.; Dos Santos, J.A. Deep open-set segmentation in visual learning. In Proceedings of the 2022 35th SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), Natal, Brazil, 24–27 October 2022; Volume 1, pp. 314–319. [Google Scholar]
- Joseph, K.; Khan, S.; Khan, F.S.; Balasubramanian, V.N. Towards open world object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 5830–5840. [Google Scholar]
- Bendale, A.; Boult, T. Towards open world recognition. In Proceedings of the IEEE Conference on Computer vision And Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1893–1902. [Google Scholar]
- Yang, J.; Zhou, K.; Li, Y.; Liu, Z. Generalized out-of-distribution detection: A survey. arXiv 2021, arXiv:2110.11334. [Google Scholar] [CrossRef]
- Liu, J.; Shen, Z.; He, Y.; Zhang, X.; Xu, R.; Yu, H.; Cui, P. Towards out-of-distribution generalization: A survey. arXiv 2021, arXiv:2108.13624. [Google Scholar]
- Zhang, H.; Ding, H. Prototypical matching and open set rejection for zero-shot semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 6974–6983. [Google Scholar]
- He, S.; Ding, H.; Jiang, W. Semantic-promoted debiasing and background disambiguation for zero-shot instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 19498–19507. [Google Scholar]
- Baek, D.; Oh, Y.; Ham, B. Exploiting a joint embedding space for generalized zero-shot semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 9536–9545. [Google Scholar]
- Gu, Z.; Zhou, S.; Niu, L.; Zhao, Z.; Zhang, L. Context-aware feature generation for zero-shot semantic segmentation. In Proceedings of the 28th ACM International Conference on Multimedia, Washington, DC, USA, 12–16 October 2020; pp. 1921–1929. [Google Scholar]
- Zheng, Y.; Wu, J.; Qin, Y.; Zhang, F.; Cui, L. Zero-shot instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 2593–2602. [Google Scholar]
- He, S.; Ding, H.; Jiang, W. Primitive generation and semantic-related alignment for universal zero-shot segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 11238–11247. [Google Scholar]
- Bucher, M.; Vu, T.H.; Cord, M.; Pérez, P. Zero-shot semantic segmentation. In Proceedings of the Advances in Neural Information Processing Systems 32 (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
- Ding, J.; Xue, N.; Xia, G.S.; Dai, D. Decoupling zero-shot semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 11583–11592. [Google Scholar]
- Ma, C.; Yang, Y.; Wang, Y.; Zhang, Y.; Xie, W. Open-vocabulary semantic segmentation with frozen vision-language models. arXiv 2022, arXiv:2210.15138. [Google Scholar]
- Chen, X.; Li, S.; Lim, S.N.; Torralba, A.; Zhao, H. Open-vocabulary panoptic segmentation with embedding modulation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 1141–1150. [Google Scholar]
- Ding, Z.; Wang, J.; Tu, Z. Open-Vocabulary Panoptic Segmentation MaskCLIP. arXiv 2022, arXiv:2208.08984. [Google Scholar]
- Ghiasi, G.; Gu, X.; Cui, Y.; Lin, T.Y. Scaling open-vocabulary image segmentation with image-level labels. In Proceedings of the European Conference on Computer Vision; Springer: Cham, Switzerland, 2022; pp. 540–557. [Google Scholar]
- Zhou, C.; Loy, C.C.; Dai, B. Extract free dense labels from clip. In Proceedings of the European Conference on Computer Vision; Springer: Cham, Switzerland, 2022; pp. 696–712. [Google Scholar]
- Huynh, D.; Kuen, J.; Lin, Z.; Gu, J.; Elhamifar, E. Open-vocabulary instance segmentation via robust cross-modal pseudo-labeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 7020–7031. [Google Scholar]
- Liang, F.; Wu, B.; Dai, X.; Li, K.; Zhao, Y.; Zhang, H.; Zhang, P.; Vajda, P.; Marculescu, D. Open-vocabulary semantic segmentation with mask-adapted clip. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 7061–7070. [Google Scholar]
- Qin, J.; Wu, J.; Yan, P.; Li, M.; Yuxi, R.; Xiao, X.; Wang, Y.; Wang, R.; Wen, S.; Pan, X.; et al. Freeseg: Unified, universal and open-vocabulary image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 19446–19455. [Google Scholar]
- Ren, S.; Zhang, A.; Zhu, Y.; Zhang, S.; Zheng, S.; Li, M.; Smola, A.J.; Sun, X. Prompt pre-training with twenty-thousand classes for open-vocabulary visual recognition. In Proceedings of the Advances in Neural Information Processing Systems 37 (NeurIPS 2014), Vancouver, BC, Canada, 10–15 December 2024. [Google Scholar]
- Zhang, H.; Li, F.; Zou, X.; Liu, S.; Li, C.; Yang, J.; Zhang, L. A simple framework for open-vocabulary segmentation and detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 1020–1031. [Google Scholar]
- Wu, J.; Li, X.; Xu, S.; Yuan, H.; Ding, H.; Yang, Y.; Li, X.; Zhang, J.; Tong, Y.; Jiang, X.; et al. Towards open vocabulary learning: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 5092–5113. [Google Scholar] [CrossRef]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Ding, H.; Cohen, S.; Price, B.; Jiang, X. Phraseclick: Toward achieving flexible interactive segmentation by phrase and click. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28. 2020, Proceedings, Part III 16; Springer: Cham, Switzerland, 2020; pp. 417–435. [Google Scholar]
- Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed representations of words and phrases and their compositionality. In Proceedings of the Advances in Neural Information Processing Systems 26 (NIPS 2013), Lake Tahoe, NV, USA, 5–8 December 2013. [Google Scholar]
- Zhu, C.; Chen, L. A survey on open-vocabulary detection and segmentation: Past, present, and future. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 8954–8975. [Google Scholar] [CrossRef]
- Liang, C.; Li, W.; Dong, Y.; Fu, W. Single Domain Generalization Method for Remote Sensing Image Segmentation via Category Consistency on Domain Randomization. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–16. [Google Scholar] [CrossRef]
- Wang, M.; Liu, J.; Luo, G.; Wang, S.; Wang, W.; Lan, L.; Wang, Y.; Nie, F. Smooth-Guided Implicit Data Augmentation for Domain Generalization. In IEEE Transactions on Neural Networks and Learning Systems; IEEE: New York, NY, USA, 2024; pp. 1–12. [Google Scholar] [CrossRef] [PubMed]
- You, K.; Long, M.; Cao, Z.; Wang, J.; Jordan, M.I. Universal domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 2720–2729. [Google Scholar]
- Saito, K.; Kim, D.; Sclaroff, S.; Saenko, K. Universal domain adaptation through self supervision. In Proceedings of the Advances in Neural Information Processing Systems 33 (NeurIPS 2020), Virtual, 6–12 December 2020; pp. 16282–16292. [Google Scholar]
- Kundu, J.N.; Venkat, N.; Babu, R.V. Universal source-free domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 14–19 June 2020; pp. 4544–4553. [Google Scholar]
- Li, R.; Zheng, S.; Duan, C.; Su, J.; Zhang, C. Multistage attention ResU-Net for semantic segmentation of fine-resolution remote sensing images. IEEE Geosci. Remote Sens. Lett. 2021, 19, 1–5. [Google Scholar] [CrossRef]
- Niu, X.; Zeng, Q.; Luo, X.; Chen, L. FCAU-Net for the Semantic Segmentation of Fine-Resolution Remotely Sensed Images. Remote Sens. 2022, 14, 215. [Google Scholar] [CrossRef]
- Wang, L.; Li, R.; Wang, D.; Duan, C.; Wang, T.; Meng, X. Transformer meets convolution: A bilateral awareness network for semantic segmentation of very fine resolution urban scene images. Remote Sens. 2021, 13, 3065. [Google Scholar] [CrossRef]
- He, X.; Zhou, Y.; Zhao, J.; Zhang, D.; Yao, R.; Xue, Y. Swin transformer embedding UNet for remote sensing image semantic segmentation. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–15. [Google Scholar] [CrossRef]
- Gui, R.; Xu, X.; Wang, L.; Yang, R.; Pu, F. A generalized zero-shot learning framework for PolSAR land cover classification. Remote Sens. 2018, 10, 1307. [Google Scholar] [CrossRef]
- Jia, X.; Khandelwal, A.; Nayak, G.; Gerber, J.; Carlson, K.; West, P.; Kumar, V. Incremental dual-memory lstm in land cover prediction. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, 13–17 August 2017; pp. 867–876. [Google Scholar]
- Li, A.; Lu, Z.; Wang, L.; Xiang, T.; Wen, J.R. Zero-shot scene classification for high spatial resolution remote sensing images. IEEE Trans. Geosci. Remote Sens. 2017, 55, 4157–4167. [Google Scholar] [CrossRef]
- Sumbul, G.; Cinbis, R.G.; Aksoy, S. Fine-grained object recognition and zero-shot learning in remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2017, 56, 770–779. [Google Scholar] [CrossRef]
- Tsai, Y.H.; Hung, W.C.; Schulter, S.; Sohn, K.; Yang, M.H.; Chandraker, M. Learning to adapt structured output space for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7472–7481. [Google Scholar]
- Zheng, Z.; Yang, Y. Rectifying pseudo label learning via uncertainty estimation for domain adaptive semantic segmentation. Int. J. Comput. Vis. 2021, 129, 1106–1120. [Google Scholar] [CrossRef]
- Muhtar, D.; Zhang, X.; Xiao, P.; Li, Z.; Gu, F. CMID: A Unified Self-Supervised Learning Framework for Remote Sensing Image Understanding. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–17. [Google Scholar] [CrossRef]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
- Jia, C.; Yang, Y.; Xia, Y.; Chen, Y.T.; Parekh, Z.; Pham, H.; Le, Q.; Sung, Y.H.; Li, Z.; Duerig, T. Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 4904–4916. [Google Scholar]
- Chen, Y.; Bruzzone, L. Toward Open-World Semantic Segmentation of Remote Sensing Images. In Proceedings of the IGARSS 2023–2023 IEEE International Geoscience and Remote Sensing Symposium, Pasadena, CA, USA, 16–21 July 2023; pp. 5045–5048. [Google Scholar]
- Xu, M.; Zhang, Z.; Wei, F.; Lin, Y.; Cao, Y.; Hu, H.; Bai, X. A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In Proceedings of the European Conference on Computer Vision; Springer: Cham, Switzerland, 2022; pp. 736–753. [Google Scholar]
- Xu, M.; Zhang, Z.; Wei, F.; Hu, H.; Bai, X. Side adapter network for open-vocabulary semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 2945–2954. [Google Scholar]
- Zhang, Z.; Zhao, T.; Guo, Y.; Yin, J. Rs5m: A large scale vision-language dataset for remote sensing vision-language foundation model. arXiv 2023, arXiv:2306.11300. [Google Scholar]
- Wang, Z.; Prabha, R.; Huang, T.; Wu, J.; Rajagopal, R. Skyscript: A large and semantically diverse vision-language dataset for remote sensing. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 5805–5813. [Google Scholar]
- Zhang, W.; Cai, M.; Zhang, T.; Zhuang, Y.; Mao, X. Earthgpt: A universal multi-modal large language model for multi-sensor image comprehension in remote sensing domain. arXiv 2024, arXiv:2401.16822. [Google Scholar]
- Mall, U.; Phoo, C.P.; Liu, M.K.; Vondrick, C.; Hariharan, B.; Bala, K. Remote sensing vision-language foundation models without annotations via ground remote alignment. arXiv 2023, arXiv:2312.06960. [Google Scholar]
- Iizuka, R.; Xia, J.; Yokoya, N. Frequency-based Optimal Style Mix for Domain Generalization in Semantic Segmentation of Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2023, 62, 1–14. [Google Scholar] [CrossRef]
- Zheng, J.; Wu, W.; Yuan, S.; Fu, H.; Li, W.; Yu, L. Multisource-domain generalization-based oil palm tree detection using very-high-resolution (vhr) satellite images. IEEE Geosci. Remote Sens. Lett. 2021, 19, 1–5. [Google Scholar] [CrossRef]
- Zhang, Y.; Zhang, M.; Li, W.; Wang, S.; Tao, R. Language-aware domain generalization network for cross-scene hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–12. [Google Scholar] [CrossRef]
- Li, D.; Yang, Y.; Song, Y.Z.; Hospedales, T. Learning to generalize: Meta-learning for domain generalization. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
- Balaji, Y.; Sankaranarayanan, S.; Chellappa, R. Metareg: Towards domain generalization using meta-regularization. In Proceedings of the Advances in Neural Information Processing Systems 31 (NeurIPS 2018), Montreal, QC, Canada, 3–8 December 2018. [Google Scholar]
- Li, Y.; Yang, Y.; Zhou, W.; Hospedales, T. Feature-critic networks for heterogeneous domain generalization. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 09–15 June 2019; pp. 3915–3924. [Google Scholar]
- Shankar, S.; Piratla, V.; Chakrabarti, S.; Chaudhuri, S.; Jyothi, P.; Sarawagi, S. Generalizing across domains via cross-gradient training. arXiv 2018, arXiv:1804.10745. [Google Scholar]
- Wang, Y.; Li, H.; Kot, A.C. Heterogeneous domain generalization via domain mixup. In Proceedings of the ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 3622–3626. [Google Scholar]
- Shu, Y.; Cao, Z.; Wang, C.; Wang, J.; Long, M. Open domain generalization with domain-augmented meta-learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 9624–9633. [Google Scholar]
- Zhang, H.; Cisse, M.; Dauphin, Y.N.; Lopez-Paz, D. mixup: Beyond empirical risk minimization. arXiv 2017, arXiv:1710.09412. [Google Scholar]
- Segu, M.; Tonioni, A.; Tombari, F. Batch normalization embeddings for deep domain generalization. Pattern Recognit. 2023, 135, 9. [Google Scholar] [CrossRef]
- Bhattacharya, A.; Singha, M.; Jha, A.; Banerjee, B. C-SAW: Self-Supervised Prompt Learning for Image Generalization in Remote Sensing. In Proceedings of the Fourteenth Indian Conference on Computer Vision, Graphics and Image Processing, Rupnagar, India, 15–17 December 2023; pp. 1–10. [Google Scholar]
- Kang, J.; Fernandez-Beltran, R.; Duan, P.; Liu, S.; Plaza, A.J. Deep unsupervised embedding for remotely sensed images based on spatially augmented momentum contrast. IEEE Trans. Geosci. Remote Sens. 2020, 59, 2598–2610. [Google Scholar] [CrossRef]
- Ayush, K.; Uzkent, B.; Meng, C.; Tanmay, K.; Burke, M.; Lobell, D.; Ermon, S. Geography-aware self-supervised learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 10181–10190. [Google Scholar]
- Manas, O.; Lacoste, A.; Giró-i Nieto, X.; Vazquez, D.; Rodriguez, P. Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 9414–9423. [Google Scholar]
- Muhtar, D.; Zhang, X.; Xiao, P. Index your position: A novel self-supervised learning method for remote sensing images semantic segmentation. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–11. [Google Scholar] [CrossRef]
- Sun, X.; Wang, P.; Lu, W.; Zhu, Z.; Lu, X.; He, Q.; Li, J.; Rong, X.; Yang, Z.; Chang, H.; et al. RingMo: A remote sensing foundation model with masked image modeling. IEEE Trans. Geosci. Remote Sens. 2022, 61, 1–22. [Google Scholar] [CrossRef]
- Reed, C.J.; Gupta, R.; Li, S.; Brockman, S.; Funk, C.; Clipp, B.; Keutzer, K.; Candido, S.; Uyttendaele, M.; Darrell, T. Scale-mae: A scale-aware masked autoencoder for multiscale geospatial representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 4088–4099. [Google Scholar]
- Wang, D.; Zhang, Q.; Xu, Y.; Zhang, J.; Du, B.; Tao, D.; Zhang, L. Advancing plain vision transformer toward remote sensing foundation model. IEEE Trans. Geosci. Remote Sens. 2022, 61, 1–15. [Google Scholar] [CrossRef]
- Jakubik, J.; Roy, S.; Phillips, C.; Fraccaro, P.; Godwin, D.; Zadrozny, B.; Szwarcman, D.; Gomes, C.; Nyirjesy, G.; Edwards, B.; et al. Foundation models for generalist geospatial artificial intelligence. arXiv 2023, arXiv:2310.18660. [Google Scholar]
- Dong, Z.; Gu, Y.; Liu, T. Generative ConvNet Foundation Model with Sparse Modeling and Low-Frequency Reconstruction for Remote Sensing Image Interpretation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–16. [Google Scholar] [CrossRef]
- Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
- Qi, L.; Kuen, J.; Shen, T.; Gu, J.; Guo, W.; Jia, J.; Lin, Z.; Yang, M.H. High Quality Entity Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 4024–4033. [Google Scholar]
- Qi, L.; Kuen, J.; Wang, Y.; Gu, J.; Zhao, H.; Torr, P.; Lin, Z.; Jia, J. Open world entity segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 8743–8756. [Google Scholar] [CrossRef]
- Hendrycks, D.; Gimpel, K. Gaussian error linear units (gelus). arXiv 2016, arXiv:1606.08415. [Google Scholar]
- Shi, B.; Zhang, X.; Xu, H.; Dai, W.; Zou, J.; Xiong, H.; Tian, Q. Multi-dataset pretraining: A unified model for semantic segmentation. arXiv 2021, arXiv:2106.04121. [Google Scholar]
- Chen, Y.; Wang, M.; Mittal, A.; Xu, Z.; Favaro, P.; Tighe, J.; Modolo, D. ScaleDet: A scalable multi-dataset object detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 7288–7297. [Google Scholar]
- Tong, X.Y.; Xia, G.S.; Lu, Q.; Shen, H.; Li, S.; You, S.; Zhang, L. Land-cover classification with high-resolution remote sensing images using transferable deep models. Remote Sens. Environ. 2020, 237, 111322. [Google Scholar] [CrossRef]
- Demir, I.; Koperski, K.; Lindenbaum, D.; Pang, G.; Huang, J.; Basu, S.; Hughes, F.; Tuia, D.; Raskar, R. Deepglobe 2018: A challenge to parse the earth through satellite images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 172–181. [Google Scholar]
- Ji, D.; Zhao, F.; Lu, H.; Tao, M.; Ye, J. Ultra-high resolution segmentation with ultra-rich context: A novel benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 23621–23630. [Google Scholar]
- Shi, J.X.; Wei, T.; Xiang, Y.; Li, Y.F. How Re-sampling Helps for Long-Tail Learning? In Proceedings of the Advances in Neural Information Processing Systems 36 (NeurIPS 2023), New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
- Zhou, X.; Koltun, V.; Krähenbühl, P. Simple multi-dataset detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 7571–7580. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
- Yu, Q.; He, J.; Deng, X.; Shen, X.; Chen, L.C. Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip. In Proceedings of the Advances in Neural Information Processing Systems 37 (NeurIPS 2024), Vancouver, BC, Canada, 10–15 December 2024. [Google Scholar]
Dataset | Type | Base Classes | Novel Classes | |||||
---|---|---|---|---|---|---|---|---|
Potsdam | Original | impervious surface | building | car | low vegetation | tree | ||
Mapped | impervious surface | building | car | meadow | tree | |||
GID5 | Original | built up | farmland | forest | meadow | water | ||
Mapped | building | farmland | forest land | meadow | water | |||
DeepGlobe | Original | urban land | agriculture land | range land | forest land | water | barren land | |
Mapped | building | farmland | range land | forest land | water | bare land | ||
URUR | Original | building | farmland | greenhouse | wood land | bare land | water | road |
Mapped | building | farmland | greenhouse | forest land | bare land | water | road |
Method | Year | Image Encoder | VLM | Potsdam | IoU of Base Classes | IoU of Novel Classes | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
imper. | Building | Car | Tree | Meadow | ||||||||||||
ZSSeg | 2021 | ResNet50 | CLIP-B/16 | 54.27 | 78.49 | 17.94 | 51.02 | 66.71 | 66.98 | 88.05 | 34.75 | 59.74 | 85.31 | 90.42 | 0.00 | 35.88 |
ZegFormer | 2022 | ResNet50 | CLIP-B/16 | 49.20 | 71.99 | 15.01 | 45.24 | 61.73 | 62.27 | 84.67 | 27.99 | 53.93 | 75.52 | 86.51 | 0.00 | 30.03 |
MaskCLIP | 2023 | ResNet50 | CLIP-L/16 | 15.58 | 21.84 | 6.19 | 21.50 | 28.54 | 39.46 | 60.16 | 7.78 | 32.24 | 33.29 | 0.00 | 11.23 | 1.16 |
SAN | 2023 | ResNet50 | CLIP-B/16 | 38.56 | 60.25 | 6.02 | 38.82 | 59.80 | 60.71 | 96.01 | 6.70 | 52.94 | 69.04 | 58.77 | 2.12 | 9.92 |
OVSeg | 2023 | ResNet101 | CLIP-B/16 | 31.56 | 50.43 | 3.25 | 35.07 | 43.49 | 54.28 | 87.44 | 3.54 | 41.72 | 74.62 | 34.96 | 0.21 | 6.28 |
FC-CLIP | 2023 | ConvNeXt_L | CLIP-RN50 | 44.78 | 73.74 | 1.32 | 39.03 | 59.12 | 59.76 | 97.85 | 1.48 | 48.12 | 81.32 | 91.79 | 0.05 | 2.60 |
FreeSeg | 2023 | ResNet50 | CLIP-B/16 | 51.25 | 75.89 | 14.29 | 46.57 | 64.12 | 65.10 | 95.89 | 18.00 | 53.99 | 81.86 | 91.82 | 3.54 | 25.05 |
FreeMix(ours) | 2024 | ResNet50 | CLIP-B/16 | 63.44 | 86.46 | 28.92 | 64.45 | 73.87 | 75.87 | 89.92 | 54.37 | 83.89 | 90.16 | 85.32 | 11.11 | 46.73 |
Training Dataset | Model | Testing Type | Testing Dataset: Potsdam | Testing Dataset: GID5 | Testing Dataset: DeepGlobe | Testing Dataset: URUR | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Potsdam | ZSSeg | MS | 78.49 | 17.94 | 54.27 | 66.71 | 1.73 | 13.66 | 6.50 | 34.53 | 0.15 | 10.83 | 3.71 | 18.85 | 0.15 | 3.13 | 1.43 | 14.99 | 16.47 | 33.77 |
ZegFormer | MS | 71.99 | 15.01 | 49.20 | 61.73 | 0.58 | 4.87 | 2.30 | 20.33 | 12.56 | 2.13 | 9.08 | 15.95 | 14.17 | 1.61 | 8.79 | 13.85 | 17.34 | 27.96 | |
MaskCLIP | MS | 21.84 | 6.19 | 15.58 | 28.54 | 13.85 | 0.35 | 8.45 | 23.68 | 7.07 | 0.00 | 4.71 | 14.68 | 7.08 | 0.00 | 4.04 | 12.12 | 8.19 | 19.75 | |
FC-CLIP | MS | 73.74 | 1.32 | 44.78 | 59.12 | 12.51 | 0.00 | 7.51 | 16.64 | 5.55 | 0.00 | 3.70 | 13.44 | 7.87 | 0.00 | 4.50 | 8.39 | 15.12 | 24.39 | |
FreeSeg | MS | 75.89 | 14.29 | 51.25 | 64.12 | 2.99 | 18.80 | 9.31 | 33.35 | 2.96 | 11.87 | 5.93 | 22.65 | 1.56 | 10.34 | 5.32 | 21.59 | 17.95 | 35.42 | |
FreeMix(ours) | SS | 86.46 | 28.92 | 63.44 | 73.87 | 15.73 | 16.31 | 15.96 | 43.90 | 3.41 | 8.93 | 5.25 | 19.95 | 3.53 | 3.30 | 3.43 | 15.76 | 22.02 | 38.37 | |
GID5 | ZSSeg | MS | 0.00 | 10.77 | 4.31 | 20.00 | 33.15 | 0.63 | 20.14 | 37.60 | 3.03 | 5.85 | 3.97 | 18.61 | 9.45 | 3.13 | 6.74 | 19.43 | 8.79 | 23.91 |
ZegFormer | MS | 6.35 | 12.30 | 8.73 | 23.93 | 28.16 | 4.18 | 18.57 | 38.21 | 28.46 | 0.27 | 19.06 | 29.75 | 6.35 | 12.30 | 8.73 | 23.93 | 13.77 | 28.95 | |
MaskCLIP | MS | 21.06 | 8.60 | 16.08 | 29.19 | 16.45 | 0.66 | 10.13 | 20.78 | 10.71 | 0.00 | 7.14 | 16.49 | 9.89 | 0.00 | 5.65 | 9.98 | 9.75 | 19.11 | |
FC-CLIP | MS | 22.78 | 10.00 | 17.67 | 36.48 | 6.12 | 0.13 | 3.72 | 19.66 | 3.59 | 0.00 | 2.40 | 16.54 | 3.87 | 0.01 | 2.21 | 10.55 | 6.50 | 20.80 | |
FreeSeg | MS | 3.32 | 17.59 | 9.02 | 23.46 | 73.36 | 22.22 | 52.91 | 61.88 | 19.05 | 8.81 | 15.64 | 26.30 | 15.86 | 1.72 | 9.80 | 15.51 | 21.84 | 31.78 | |
FreeMix(ours) | SS | 8.33 | 15.83 | 11.33 | 26.36 | 76.47 | 22.55 | 54.90 | 65.44 | 23.01 | 12.38 | 19.47 | 35.81 | 20.95 | 9.79 | 16.17 | 26.48 | 25.46 | 38.52 | |
DeepGlobe | ZSSeg | MS | 5.43 | 11.51 | 7.86 | 23.29 | 14.59 | 9.85 | 12.69 | 32.53 | 0.85 | 5.60 | 2.44 | 17.18 | 0.93 | 5.32 | 2.81 | 20.31 | 6.45 | 23.32 |
ZegFormer | MS | 0.00 | 12.27 | 4.91 | 20.95 | 0.17 | 0.25 | 0.20 | 20.10 | 7.20 | 5.70 | 6.70 | 19.71 | 0.01 | 1.13 | 0.49 | 14.28 | 3.07 | 18.76 | |
MaskCLIP | MS | 16.43 | 7.51 | 12.86 | 23.85 | 6.51 | 0.00 | 3.90 | 20.36 | 9.95 | 0.00 | 6.63 | 26.26 | 5.59 | 0.00 | 3.19 | 14.53 | 6.64 | 21.25 | |
FC-CLIP | MS | 24.89 | 5.92 | 17.30 | 37.53 | 5.77 | 0.00 | 3.46 | 19.72 | 2.71 | 0.00 | 1.80 | 14.25 | 3.74 | 0.00 | 2.14 | 8.73 | 6.17 | 20.05 | |
FreeSeg | MS | 17.62 | 22.61 | 19.61 | 37.10 | 41.40 | 15.96 | 31.22 | 44.04 | 9.44 | 7.03 | 8.63 | 23.16 | 8.44 | 2.71 | 5.99 | 17.81 | 16.36 | 30.52 | |
FreeMix(ours) | SS | 17.89 | 17.14 | 17.59 | 39.37 | 32.89 | 20.26 | 27.84 | 49.37 | 24.97 | 9.35 | 19.76 | 33.88 | 19.12 | 6.49 | 13.71 | 24.97 | 19.72 | 36.89 | |
URUR | ZSSeg | MS | 2.92 | 11.12 | 6.20 | 21.64 | 7.53 | 8.36 | 7.86 | 33.10 | 5.52 | 6.39 | 5.81 | 20.30 | 5.18 | 1.87 | 3.76 | 16.37 | 5.90 | 22.85 |
ZegFormer | MS | 0.59 | 5.34 | 2.49 | 22.72 | 0.76 | 0.25 | 0.56 | 20.47 | 10.56 | 0.00 | 7.04 | 22.31 | 0.02 | 1.13 | 0.50 | 14.30 | 2.64 | 19.95 | |
MaskCLIP | MS | 12.94 | 12.63 | 12.82 | 28.19 | 15.39 | 0.44 | 9.41 | 21.48 | 10.39 | 0.00 | 6.93 | 17.01 | 12.24 | 0.00 | 6.99 | 13.17 | 9.03 | 19.96 | |
FC-CLIP | MS | 26.37 | 8.89 | 19.38 | 35.17 | 5.74 | 0.00 | 3.44 | 19.89 | 2.97 | 0.77 | 2.24 | 16.36 | 3.78 | 0.00 | 2.16 | 10.19 | 6.80 | 20.40 | |
FreeSeg | MS | 12.95 | 22.56 | 16.79 | 32.22 | 43.92 | 21.93 | 35.12 | 57.84 | 21.25 | 8.05 | 16.85 | 32.12 | 21.00 | 5.81 | 14.49 | 24.71 | 20.81 | 36.72 | |
FreeMix(ours) | SS | 15.70 | 21.73 | 18.11 | 36.12 | 33.06 | 16.97 | 26.62 | 54.02 | 28.39 | 13.08 | 23.28 | 37.33 | 33.29 | 9.80 | 23.22 | 32.45 | 22.80 | 39.98 |
Model | Training dataset | Testing type | Potsdam | GID5 | DeepGlobe | URUR | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ZSSeg | GPDU | MS | 9.84 | 10.85 | 8.32 | 3.66 | 0.02 | 9.13 | 2.38 | 0.00 | 7.16 | 1.94 | 1.65 | 2.32 | 4.46 |
ZegFormer | 4.31 | 0.00 | 10.77 | 0.50 | 0.83 | 0.00 | 8.11 | 11.59 | 1.15 | 0.48 | 0.00 | 1.13 | 3.35 | ||
MaskCLIP | 11.72 | 18.78 | 1.13 | 13.62 | 22.70 | 0.00 | 9.20 | 13.80 | 0.00 | 6.87 | 12.02 | 0.00 | 10.35 | ||
SAN | 23.84 | 23.99 | 23.61 | 35.10 | 57.67 | 1.25 | 29.38 | 42.86 | 2.41 | 30.44 | 48.39 | 6.49 | 29.69 | ||
OVSeg | 9.32 | 14.96 | 0.86 | 15.58 | 22.15 | 5.73 | 25.87 | 37.57 | 2.47 | 19.56 | 31.86 | 3.16 | 17.58 | ||
FC-CLIP | 21.02 | 28.76 | 9.40 | 2.87 | 4.79 | 0.00 | 1.81 | 2.72 | 0.00 | 1.48 | 2.60 | 0.00 | 6.80 | ||
FreeSeg | 17.58 | 14.75 | 21.83 | 25.26 | 33.94 | 12.24 | 24.55 | 31.58 | 10.49 | 23.71 | 38.99 | 3.34 | 22.78 | ||
FreeMix†(ours) | GPDU | SS | 19.98 | 17.25 | 24.06 | 57.26 | 75.91 | 29.27 | 32.03 | 41.17 | 13.75 | 29.3 | 42.41 | 11.83 | 34.64 |
FreeMix(ours) | GPDU | SS | 47.03 | 69.54 | 13.26 | 43.13 | 67.91 | 5.97 | 35.14 | 52.69 | 0.04 | 35.72 | 60.85 | 2.22 | 40.26 |
Training Dataset | RS_SSL | ESB | DAS | Potsdam | GID5 | DeepGlobe | URUR | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
GPDU | 17.58 | 32.58 | 25.26 | 32.92 | 24.55 | 39.75 | 23.71 | 30.98 | 22.78 | 34.05 | |||||
√ | 19.73 | 37.32 | 38.53 | 57.22 | 31.97 | 47.06 | 26.46 | 34.49 | 29.17 | 44.02 | +6.39 | +9.96 | |||
√ | √ | 19.98 | 35.92 | 57.26 | 71.10 | 32.03 | 46.49 | 29.30 | 37.85 | 34.64 | 47.84 | +5.47 | +3.81 | ||
√ | √ | √ | 47.03 | 62.56 | 43.13 | 58.44 | 35.14 | 47.37 | 35.72 | 45.75 | 40.26 | 53.53 | +5.62 | +5.69 |
Backbone | Pre-Train Type | Pre-Train Dataset | Training Tactic | Potsdam | GID5 | DeepGlobe | URUR | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ResNet50 | Supervised | In1K | random | 19.73 | 43.66 | 38.53 | 76.23 | 31.97 | 63.66 | 26.46 | 70.45 | 29.17 | 63.50 |
ResNet50 | Self-Supervised | MillionAID | random | 19.98 | 40.43 | 57.26 | 87.65 | 32.03 | 63.24 | 29.30 | 71.70 | 34.64 | 65.75 |
ResNet50 | Supervised | In1K | DAS | 39.49 | 61.35 | 41.18 | 78.04 | 7.81 | 22.71 | 6.39 | 19.02 | 23.71 | 45.28 |
ResNet50 | Self-Supervised | MillionAID | DAS | 47.03 | 62.47 | 43.13 | 81.90 | 35.14 | 69.38 | 35.72 | 78.45 | 40.25 | 73.05 |
Swin-B | – | – | random | 11.79 | 30.92 | 33.76 | 73.99 | 22.20 | 62.06 | 26.05 | 71.96 | 23.45 | 59.73 |
Swin-B | Self-Supervised | MillionAID | random | 11.54 | 31.09 | 39.31 | 78.63 | 23.22 | 61.91 | 27.71 | 72.38 | 25.44 | 61.00 |
Swin-B | – | – | DAS | 36.26 | 60.70 | 34.29 | 72.58 | 19.78 | 48.39 | 15.05 | 47.18 | 26.34 | 57.21 |
Swin-B | Self-Supervised | MillionAID | DAS | 43.85 | 66.57 | 40.11 | 79.49 | 23.81 | 56.59 | 18.44 | 54.53 | 31.55 | 64.29 |
Backbone of ESB | Training Tactic | Potsdam | GID5 | DeepGlobe | URUR | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Swin-T | random | 19.98 | 40.43 | 57.26 | 87.65 | 32.03 | 63.24 | 29.30 | 71.70 | 34.64 | 65.75 |
Swin-L | 23.31 | 45.90 | 55.07 | 88.71 | 24.85 | 50.92 | 23.40 | 59.67 | 31.65 | 61.30 | |
Hornet-L | 21.86 | 45.65 | 53.14 | 88.63 | 30.40 | 64.15 | 27.75 | 72.03 | 33.28 | 67.61 | |
Swin-T | DAS | 47.03 | 62.47 | 43.13 | 81.90 | 35.14 | 69.38 | 35.72 | 78.45 | 40.25 | 73.05 |
Swin-L | 47.39 | 57.09 | 57.11 | 85.90 | 10.89 | 31.00 | 14.66 | 24.70 | 32.51 | 49.67 | |
Hornet-L | 53.68 | 66.17 | 53.40 | 82.55 | 12.91 | 36.81 | 15.67 | 30.10 | 33.91 | 53.90 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wu, J.; Shi, J.; Zhao, Z.; Liu, Z.; Zhi, R. FreeMix: Open-Vocabulary Domain Generalization of Remote-Sensing Images for Semantic Segmentation. Remote Sens. 2025, 17, 1357. https://doi.org/10.3390/rs17081357
Wu J, Shi J, Zhao Z, Liu Z, Zhi R. FreeMix: Open-Vocabulary Domain Generalization of Remote-Sensing Images for Semantic Segmentation. Remote Sensing. 2025; 17(8):1357. https://doi.org/10.3390/rs17081357
Chicago/Turabian StyleWu, Jingyi, Jingye Shi, Zeyong Zhao, Ziyang Liu, and Ruicong Zhi. 2025. "FreeMix: Open-Vocabulary Domain Generalization of Remote-Sensing Images for Semantic Segmentation" Remote Sensing 17, no. 8: 1357. https://doi.org/10.3390/rs17081357
APA StyleWu, J., Shi, J., Zhao, Z., Liu, Z., & Zhi, R. (2025). FreeMix: Open-Vocabulary Domain Generalization of Remote-Sensing Images for Semantic Segmentation. Remote Sensing, 17(8), 1357. https://doi.org/10.3390/rs17081357