Boosting Urban Openspace Mapping with the Enhancement Feature Fusion of Object Geometry Prior Information from Vision Foundation Model

Xu, Zijian; Chen, Jiajun; Niu, Hongyang; Fan, Runyu; Lu, Dingkun; Feng, Ruyi

doi:10.3390/rs17071230

Open AccessArticle

Boosting Urban Openspace Mapping with the Enhancement Feature Fusion of Object Geometry Prior Information from Vision Foundation Model

by

Zijian Xu

¹

,

Jiajun Chen

¹,

Hongyang Niu

¹,

Runyu Fan

^1,*,

Dingkun Lu

² and

Ruyi Feng

¹

School of Computer Science, China University of Geosciences, Wuhan 430078, China

²

Changjiang Basin Ecology and Environment Monitoring and Scientific Research Center, Changjiang Basin Ecology and Environment Administration, Wuhan 430015, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(7), 1230; https://doi.org/10.3390/rs17071230

Submission received: 21 January 2025 / Revised: 9 March 2025 / Accepted: 24 March 2025 / Published: 30 March 2025

(This article belongs to the Special Issue Advancement of Multi-Source Remote Sensing Data Fusion in Environmental Monitoring)

Download

Browse Figures

Versions Notes

Abstract

:

Urban open spaces (UO) play a crucial role in urban environments, particularly in areas where social and economic activities are rapidly increasing. However, the challenges of high inter-class similarity, complex environmental surroundings, and scale variations often result in suboptimal performance in UO mapping. To address these issues, this paper proposes UOSAM, a novel approach that leverages the Segment Anything Model (SAM) for efficient UO mapping using high-resolution remote sensing images. Our method employs a pyramid transformer to extract feature pyramids at multiple scales, capturing multi-scale semantic context and addressing the issue of scale variation. Additionally, SAM is used to achieve the more precise geometry segmentation of ubiquitous objects within the images, effectively tackling the challenges posed by their high inter-class similarity and environmental complexity. Furthermore, we introduce a feature fusion module (FFM) that integrates multi-level features from the remote sensing images. Extensive experiments conducted on the Urban Openspace China Ten Cities (UOCTC) dataset from ten major cities in China, using manually annotated samples, demonstrate the superiority of the proposed UOSAM.

Keywords:

urban open space; UO mapping; remote sensing; deep learning; SAM

1. Introduction

Open space is an outdoor space in an urban area that provides public service facilities and recreational facilities for urban residents. Cities usually have extensive public facilities such as housing, transportation, utilities, and commodity production and communication systems, which attract a large number of people to cities for employment, entertainment, and education [1,2]. Two centuries ago, urban residents constituted only a small fraction of the global population. However, after two centuries of rapid urbanization, more than half of the world’s population now resides in cities. By 2022, urban residents accounted for 57% of the global population, while urbanization in China has progressed even more rapidly, with the proportion of urban residents reaching 64% [3]. Although population growth has begun to slow down and is expected to continue to slow down in the coming decades, slowing population growth does not mean no growth. At the current growth rate, by 2030, 60% of the world’s population (nearly 5 billion people) will live in cities; by 2050, this proportion will be close to 70%. Therefore, urbanization is inevitable, and urban growth planning is crucial to future sustainable development [4].

Driven by the accelerating global urbanization, urban areas are rapidly expanding outward. Simultaneously, existing built-up areas are becoming increasingly dense, resulting in large-scale land conversions from open spaces to built-up zones. Resident-friendly open spaces that meet the needs of a region have become a major focus of sustainable development [5,6]. Many studies have found that open spaces are beneficial to improving people’s mental health [7], recovery from illness [8], and quality of life [9]. Ensuring that there are enough open spaces in cities can also help maintain ecological balance [10,11,12], improve air quality [13,14], and reduce the urban heat island effect [15,16]. Urban open space recognition is an important scientific issue for automatically identifying vegetation, parks, water bodies, roads, parking lots, outdoor stadiums, etc., in cities and providing corresponding labels. It plays an important role in urban transformation, urban planning, urban sustainable development, and improving the quality of urban life.

In the past few decades, as the spatial resolution and spectral resolution of satellite remote sensing images [17,18] have continued to improve [19,20], more and more information [21] can be obtained from remote sensing images [19,22], which can support more sophisticated mapping work [23,24,25,26] and change monitoring [27,28]. The use of ultra-high-resolution remote sensing images to study and obtain various urban information has also achieved good results, such as its use in drawing urban land use maps [29,30,31,32], and studying more specific urban road maps [33,34,35,36,37], urban population maps [38], coastal maps [39], urban functional area maps [40,41], urban informal residential area maps [42,43,44], etc. Ultra-high-resolution remote sensing images provide an important basis for the identification of urban open spaces. However, their characteristics of high inter-class similarity, complex contextual environment [45], and large scale differences lead to unsatisfactory results in open space mapping. This is mainly due to the following factors:

High interclass similarities. Different Urban Objects (UOs) often exhibit similar visual characteristics, such as color, spectral properties, geometry, and texture, making their accurate identification challenging. For instance, outdoor parking lots and roads share highly similar visual appearances. Both UO categories typically consist of the same material, often cement, resulting in similar colors and spectral information. Additionally, parking lots and roads often exhibit an analogous object composition and contextual elements, such as vehicles and white markings. These similarities make distinguishing between these UO categories particularly difficult.
Complex environment surroundings. Urban areas, when viewed at high resolutions, are fragmented and heterogeneous, creating intricate spatial relationships between various urban objects. This complexity in spatial relationships leads to challenging environmental conditions, including shadows and mutual occlusion, which further complicate the accurate identification of pixels associated with UOs. For example, parking lots are often located near tall buildings or street trees. Satellite imagery, typically captured from a bird’s-eye perspective, results in these areas being prone to occlusion and shadowing, further hindering object detection.
Scale variations. Scale variation in urban object identification can arise from two primary factors. The first type refers to objects of the same physical size appearing at different scales within an image due to perspective effects and the camera’s distance from the scene. The second type is related to the actual differences in physical size between objects. In the context of UOs, the latter is more significant. For instance, an outdoor gymnasium typically occupies a much larger area than a parking lot. Additionally, large parking lots often have a significantly greater scale compared to smaller, roadside parking areas.

The challenges associated with open space mapping arise from various factors, including the high interclass similarities, complex environmental surroundings, and scale variations. Most studies on open space extraction focus on land use classification, land use mapping, and road network extraction to identify specific open space categories. To effectively identify Urban Objects (UOs) from remote sensing images, researchers have proposed several models, which can broadly be categorized into pixel-based methods and object-based methods.

Pixel-based machine learning methods primarily rely on spectral and texture features for classification [46], while deep learning approaches leverage automated feature extraction to construct complex models for image processing. For example, Feng et al. [47] proposed a method using ultra-high-resolution images captured by drones. This approach combines random forests with texture analysis to accurately distinguish land cover types in urban vegetation areas. Zhao et al. [48] proposed an efficient spectral–structural feature bag scene classifier, which integrates spectral and structural information from the imagery to perform scene classification.

In contrast, object-based machine learning methods not only consider intra-object features but also incorporate inter-object features such as connectivity, continuity, and the distance and direction between adjacent objects. Pan et al. [49] proposed an improved object-based random forest method using drone multispectral images, which achieved high accuracy in land use classification. Zhang et al. [50] proposed an object-based convolutional neural network (OCNN) for urban land use classification using very fine spatial resolution (VFSR) images. The OCNN model analyzes and labels objects, identifies changes within and between objects, and performs decision fusion based on predefined rules to generate land use classification results. Object-based deep learning methods offer robust solutions for automated interpretation in complex urban environments, enabling accurate object detection and classification. Jin et al. [51] introduced a method combining object-oriented techniques with deep convolutional neural networks (COCNN), where feature samples are first constructed from multi-scale segmented remote sensing images and then processed using modified convolutional neural networks (CNNs) for feature extraction.

Currently, semantic segmentation methods are widely used for open space extraction tasks. These methods involve pixel-level classification, where each pixel is assigned an open space category. In such semantic segmentation tasks, spatial accuracy is of paramount importance, particularly with respect to boundary and location accuracy. Huang et al. [52] proposed a dual-function feature aggregation network (DFFAN) that uses residual neural networks as a backbone. The DFFAN first generates a feature pyramid of remote sensing images via multiple downsampling operations. The context of each feature map is then obtained using the affinity matrix module (AMM), and the boundary feature fusion module (BFF) is employed to combine context and spatial information, producing the final land cover map. U-shaped networks, which typically have symmetric encoder–decoder architectures, also contribute to enhancing spatial positioning accuracy. These networks improve the accuracy of results by downsampling and upsampling the input image while integrating information from multiple scales through skip connections. Xu et al. [53] proposed a high-resolution U-Net (HRU-Net) that enhances spatial accuracy by refining the skip connection structure and loss function in U-Net. The HRU-Net outperforms the basic model, particularly in terms of edge detail accuracy.

Additionally, the attention mechanism has been integrated into open space semantic segmentation due to its ability to capture long-range spatial dependencies and improve task-specific computational efficiency. Men et al. [54] introduced a cascaded residual attention U-Net (CRAUNet) that combines an enhanced residual structure with a convolutional block channel attention (CBCA) module. This method preserves more feature information, boosts effective features, suppresses irrelevant features, and strengthens the extraction of deep convolutional features. Shi et al. [55] proposed a general deep learning framework for large-scale urban green space mapping, which includes both a generator and a discriminator. The generator incorporates an attention mechanism and a point tearing strategy to enhance the spatial resolution of the output results. This framework was successfully applied to generate urban green space maps for 31 major cities in China using Google Earth imagery.

In this paper, we propose UOSAM, a novel approach that utilizes the Segment Anything Model (SAM) to extract ubiquitous objects for efficient urban open space (UO) mapping using high-resolution remote sensing images. UOSAM employs a pyramidal transformer to generate multi-scale feature pyramids, capturing semantic context across various scales and addressing the challenge of scale variations in UO mapping. Additionally, SAM is used to achieve a more accurate segmentation of ubiquitous objects, effectively resolving the issues related to high inter-class similarity and complex environment surroundings. SAM, which has been pre-trained on a vast amount of visual data (enabling it to recognize a variety of objects, such as buildings, trees, roads, etc.), provides robust support for urban UO mapping by leveraging its learned visual priors.

Comprehensive experiments conducted on open-space semantic segmentation datasets from ten major cities in China demonstrate that our proposed UOSAM significantly outperforms other state-of-the-art (SOTA) semantic segmentation algorithms. The key contributions of this paper can be summarized as follows:

(1): To address the issues of high inter-class similarity, complex environment surroundings, and scale variations, the UOSAM model is proposed, which integrates multi-scale semantic information and ubiquitous objects to enhance the performance of UO mapping.
(2): A pyramid Transformer encoder is utilized to extract feature pyramids at different scales, capturing multi-scale semantic context and compensating for scale variations in UO mapping.
(3): The Segment Anything model leverages geometry prior information about ubiquitous objects to capture geometric details of UO, resulting in more structured and accurate UO mapping.

2. Materials

2.1. Study Area

The study area of this research includes ten major cities in China: Shanghai, Beijing, Shenzhen, Chongqing, Guangzhou, Chengdu, Tianjin, Wuhan, Nanchang, and Changsha. This selection is based on our ultimate goal of creating a national land cover map.

2.2. Data Sources

The remote sensing images used in this study are high-resolution satellite images obtained from Google Earth, with a spatial resolution of 1.19 m. The images were captured around 2021 and are derived from the RGB bands of Google Earth. Google’s high-resolution remote sensing imagery integrates data from commercial satellites, aerial photography, and open-source satellite data, providing high-resolution, globally available, and continuously updated imagery.

2.3. Dataset Generation and Visualization

The categories of urban open space (UO) in this study include green space (GS), outdoor sports field (OSF), transportation hub (TH), water body (WB), and not open space (NOS). The images were sourced from the RGB bands of Google Earth. After grid-based cropping, a total of 37,705 images of 931 × 931 pixels were obtained. The images have a horizontal and vertical resolution of 96 dpi.In this study, the labeling process was carried out by overlaying a hierarchical buffer, established based on OSM road network data, onto existing OpenStreetMap (OSM) AOI data to generate initial open space labels. These preliminary labels were then refined and supplemented through expert manual correction to obtain high-quality open space label samples. We selected regions with diverse open space categories, including outdoor sports fields, from 10 cities across China, encompassing the ten major cities. Each city contributes at least 50 labeled samples, with the specific distribution as follows: Beijing (58 samples), Chengdu (50 samples), Guangzhou (53 samples), Nanchang (52 samples), Shanghai (75 samples), Shenzhen (78 samples), Tianjin (50 samples), Wuhan (174 samples), Changsha (65 samples), and Chongqing (63 samples), resulting in a total of 718 labeled samples. Figure 1 illustrates the labels for each city.

We performed a statistical analysis of our labels, and the results are shown in Figure 2a. It can be observed that not open space (NOS) and green space (GS) have the largest number of pixels, while water body (WB) and transportation hub (TH) contain approximately one-third of the pixels of NOS and GS, reflecting that larger areas are occupied by NOS and GS in urban environments compared to the smaller areas of WB and TH. Outdoor sports field (OSF) had the fewest pixels, accounting for only a small portion of the total, indicating that OSF is relatively scarce in urban areas.

Figure 2b illustrates the proportion of each category in different cities. While the proportions vary, GS and NOS consistently account for the largest shares, with their combined proportion exceeding 70% in all cities. In Tianjin, NOS has the highest proportion, whereas in Shenzhen, GS dominates. WB represents nearly 20% of the labeled area in Wuhan but only about 5% in Tianjin and Chengdu. TH contributes 8–15% across all cities, while OSF accounts for less than 5% of the total area.

3. Methods

This paper introduces UOSAM, a novel method for UO mapping, which consists of the Semantic Pyramid Feature Module (SPFM) and the Geometric Feature Module (GFM). It integrates both the semantic and geometric features of remote sensing images, utilizing a joint loss function to weight the semantic features. Figure 3 illustrates the overall network architecture of UOSAM. It contains the following modules:

The SPFM. The SPFM uses a hierarchical structure, where each layer gradually reduces the spatial resolution through downsampling operations while increasing the number of channels. This allows the model to capture image information at different scales and preserve important global and local features. SPFM uses overlapping block embedding layers and four Transformer blocks to hierarchically extract spatial features of different scales from the input remote sensing image. The features at each scale are converted to the required embedding dimension through MLP, then upsampled and fused. The fused features are passed through a convolutional layer to obtain semantic features.
The GFM. GFM uses the Segment Anything Model, a basic model for image segmentation, which can effectively capture highly complex geometric features. GFM first uses a Vision Transformer for image encoding and handles high-resolution inputs while ensuring scalability. The Prompt Encoder in the SAM model handles two types of prompts—sparse prompts and dense prompts—and can convert multiple prompts into a unified feature representation. GFM does not use prompts. GFM’s Mask Decoder uses a modified Transformer decoder to update the embedding information through the self-attention and cross-attention mechanisms between prompts and image embeddings. Then, the final mask is generated through upsampling and dynamic linear classifier.

3.1. Semantic Pyramid Feature Module

As shown in Figure 3, in UOSAM, SPFM uses Mix-Transformer (MiT) as its backbone. It takes an image of shape

(C_{0}, H, W)

as input and produces hierarchical multi-scale features with the following shapes:

(C_{1}, \frac{H}{4}, \frac{W}{4})

,

(C_{2}, \frac{H}{8}, \frac{W}{8})

,

(C_{3}, \frac{H}{16}, \frac{W}{16})

and

(C_{4}, \frac{H}{32}, \frac{W}{32})

. These features are generated from the four stacked Transformer blocks within the MiT backbone.

The Transformer block is the fundamental building unit of Segformer, which is responsible for capturing global information from the image. In the Transformer block, the input

x \in R^{C_{0} \times H \times W}

is first passed through the overlapping block embedding layer, which is defined as follows:

\begin{matrix} x = L a y e r N o r m (T r a n s p o s e (F l a t t e n ( \\ C o n v_{C_{i} \times C_{i + 1}} (x)))), i \in {0, \dots, 3} \end{matrix}

(1)

where

C o n v_{C_{i} \times C_{i + 1}}

represents a two-dimensional convolutional layer with

C_{i}

input channels and

C_{i + 1}

output channels. When

i = 0

, the kernel size, stride, and padding for

C o n v_{C_{i} \times C_{i + 1}}

are set to

7 \times 7

, 4, 3, respectively. For

i \neq 0

, these parameters are set to

3 \times 3

, 2, 1, respectively.

The overlapping patch embedding layer utilizes a 2D convolutional layer to divide the image x into several 3D patches. These 3D patches are then flattened and transposed into 2D patches, denoted as

x_{p}^{i} \in R^{N \times (P_{i}^{2} \times C_{i + 1})}

, where N represents the batch size and

P_{i}

is defined as

\frac{H}{4 \times 2^{i}}

, with

i \in \{0, \dots, 3\}

. Finally, a LayerNorm is applied to

x_{p}^{i}

.

Next, a self-attention layer with a residual connection is added. The formula for SA is as follows:

S A (Q, K, V) = S o f t m a x (\frac{Q K^{T}}{\sqrt{d_{h e a d}}}) V

(2)

In the SA mechanism, a reduction ratio

R_{i}

is used to reduce the sequence length of the patches (denoted as

L e n g t h_{p}

) as follows:

\hat{K} = R e s h a p e (\frac{L e n g t h_{p}}{R_{i}}, C_{i} \cdot R_{i}) (K)

(3)

K = L i n e a r (C_{i} \cdot R_{i}, C_{i}) (\hat{K})

(4)

where Formula (3) refers to reshape tensor K to a shape of

\frac{L e n g t h_{p}}{R_{i}} \times (C_{i} \cdot R_{i})

.

The tensor output from the previous layer is then fed into a feed-forward network to provide positional information for the Transformers. The feed-forward network is defined as:

x^{i} = M L P (G E L U (D W C o n v (M L P (x_{p}^{i})))) + x_{p}^{i}

(5)

where

x_{p}^{i}

represents the feature output from the

i^{t h}

self-attention (SA) layer, and

D W C o n v

denotes depth-wise convolutions with

3 \times 3

kernels. The feature

x^{i}

is then successively passed through LayerNorm, Reshape, and Permute operations to convert the two-dimensional patches into three-dimensional feature maps. Ultimately, the feature maps

x^{i}

from the Transformer blocks at stage i combine to form the pyramidal feature maps. In this study, H and W of the RS image are set to 310. The channel dimensions

C_{0}, C_{1}, C_{2}, C_{3}, C_{4}

for the RS image are 3, 64, 128, 320, 512, respectively. The reduction ratios

R_{0}, R_{1}, R_{2}, R_{3}

are 64, 16, 4, 1, respectively.

The features from the four scales are transformed into a uniform size of

(768 \times \frac{H}{4} \times \frac{W}{4})

through the MLP layer. These feature maps are then concatenated from the highest to the lowest layer, and feature fusion is performed using the ConvModule function. Next, the features are mapped to categories via convolution and subsequently upsampled to produce the semantic feature

X_{Semantic}

, with a size of

(cls, H, W)

, where

c l s

represents the total number of UO categories, which is set to five in this paper.

3.2. Geometric Feature Module

As shown in the Figure 3, in UOSAM, the Geometric Feature Module (GFM) consists of three components: Image Encoder, Prompt Encoder, and Mask Decoder. The input image is processed through the Image Encoder to generate image embeddings. Next, different types of prompt information (including sparse and dense prompts) are fed into the Prompt Encoder. In this case, the prompt inputs are set to empty, using the default settings. The image embeddings and the prompt information are then passed to the Mask Decoder, which produces low-resolution logits and IoU score.

The low-resolution logits are upsampled to the same size as the original image using bilinear interpolation, resulting in

x_{L o g i t s}

. Finally, the Sigmoid function is applied to compute the probability values between 0 and 1, which are recorded as the geometric semantic feature

x_{G e o m e t r i c}

.

As shown in Figure 3, the semantic feature

x_{S e m a n t i c}

and the geometric features

x_{L o g i t s}

and

x_{G e o m e t r i c}

are concatenated along the same layer to form a fused feature map. This fused feature map

x_{m e r g e}

is then passed through a 1 × 1 convolution layer, with the output channels set to

c l s

, and the final semantic feature

x_{o u t}

is obtained.

3.3. Loss Function

The loss function of the model is based on the cross-entropy loss. To balance the contributions of the semantic and geometric tasks, we enhance the selection and weighting of the semantic feature channels. A weighted loss function is used to ensure that both tasks influence the training process appropriately. By adjusting the weight ratio of the semantic loss (

W_{S e m a n t i c}

), the model can focus more on the semantic tasks, thereby improving the performance of UO mapping. The formula for the loss function is as follows:

\begin{matrix} Loss = W_{S e m a n t i c} * L_{C E} (X_{Semantic}, L a b e l) + \\ (1 - W_{S e m a n t i c}) * L_{C E} (X_{O u t}, L a b e l) \end{matrix}

(6)

4. Results

We conducted experiments on the Urban Openspace China Ten Cities (UOCTC) dataset and compared the proposed UOSAM with other SOTA semantic segmentation methods.

4.1. Experimental Setup

In the experiments, we used the Adam optimizer to adjust the model parameters, with the learning rate set to 0.0001. All implementations were based on the 2.4.1 version of PyTorch library and executed on an NVIDIA RTX 4080 GPU. The models were trained for 300 epochs, and the batch size for each training run was set to 16. The mean Intersection over Union (mIoU) metric was used to evaluate the performance of the models.

The semantic segmentation models compared in our experiments include FCN [56], FPN [57], DeepLabV3 [58], UperNet [59], SFNet [60], FaPN [61], and Segformer-B1 [62]. Additionally, we also compared the object-based random forest method [63,64]. OBRF(Object-based Random Forest) utilized only spectral features, generating 100 objects per 465 × 465 image, with a random forest consisting of 200 decision trees. In contrast, OBRF-HL (object-based Random Forest–Haralick and Local Binary Pattern) incorporated spectral features, LBP incorporated texture features (Local Binary Pattern), and Haralick incorporated texture features, generating 200 objects per 465 × 465 image while maintaining the same random forest configuration of 200 decision trees.

A comparison results obtained using UOSAM and the object-based random forest method are shown in Table 1 and Table 2. It can be observed that the accuracy of the machine learning method is not ideal for complex urban areas, especially for outdoor sports fields and transportation hubs.

The experimental results are shown in Table 3 and Table 4. Among the compared models, the FPN model exhibits the lowest accuracy (Acc) and Intersection over Union (IoU), with a mean IoU (mIoU) of less than 60%. The SFNet and UperNet models have similar mIoU values, ranging from 61% to 62%, placing them in the second-to-last tier of performance. Their IoU and accuracy for each category are relatively consistent. Notably, compared to FPN, the improvements in SFNet and UperNet were mainly observed in the outdoor sports field and background categories.The FaPN model achieved an mIoU of 63.52%, with only a slight improvement over the previous tier. The accuracy gains are primarily reflected in the outdoor sports field category, as well as minor improvements in the road and water body categories. This suggests that multi-scale features contribute to enhancing the accuracy of the outdoor sports field category. The FCN and DeepLabV3 models fall into the third performance tier. Compared to the FaPN model, these models achieve significant accuracy improvements across all categories. The Segformer model ranks second in terms of mIoU. Its superior performance is attributed to its attention mechanism, which effectively enhances feature representation.Our DMLSFormer model outperforms all other models, achieving the highest overall accuracy (OA) and mIoU values.

Figure 4 provides some predicted UO maps from the comparison models. As seen in Figure 4, our model produces fewer misclassified pixels. Additionally, the predicted maps from UOSAM more accurately segment the UO regions and better preserve their geometric features, such as smoother edges and more regular shapes. For instance, the results from models like FPN, FaPN, UperNet, and SFNet contain numerous internal holes and excessive salt-and-pepper noise around the UO objects, and fail to preserve the edges and shapes of UO objects. Meanwhile, the predicted UO maps from FCN, DeepLabV3, and SegFormer include some misclassified areas. Overall, the UOSAM model proposed in this paper is capable of accurately segmenting UO objects while maintaining their geometric features. The predicted UO maps highlight the advantages of UOSAM in UO mapping.

4.2. Ablation Study

We also conducted an ablation study to evaluate the contributions of the different components in our proposed UOSAM. The results are shown in Table 5. The SPFM module alone achieves high accuracy, with an mIoU of 68.79%. In contrast, the accuracy when using the GFM module alone is lower, likely because GFM primarily focuses on extracting geometric features and is less accurate in semantic classification. However, when the GFM module is combined with SPFM, the mIoU increases by 0.46%, resulting in UOSAM achieving the best overall accuracy (OA) of 84.15% and an mIoU of 69.25%. These results highlight the effectiveness of each component within the UOSAM model.

4.3. The Visualization and Analysis of Class-Specific Attention Maps

To demonstrate the efficiency of the proposed UOSAM, we visualized class-specific attention maps (CAMs) for each class to highlight the locations of UO-specific regions. The CAMs are shown in Figure 5. First, we extracted the feature map from before the softmax layer, which contains five channels, each corresponding to categories such as green space, outdoor sports fields, transportation hubs, water bodies, and not open space. The feature map for each UO category in the corresponding channel is defined as the CAM. In the CAMs, red indicates high confidence in the presence of the UO category, with darker red signifying higher confidence, while darker purple indicates lower confidence.

It can be observed that the CAMs of our UOSAM model are the closest to the UO ground truth. This is reflected in the similarity of the high-confidence regions between the five CAMs and the five UO categories. Furthermore, in the CAMs of our UOSAM model, there are more prominent purple areas in non-corresponding categories, while corresponding categories exhibit a more distinct presence of red regions. In contrast, the CAMs of the other two models display less contrast between red and purple regions, with many areas appearing yellow or light red. This indicates that these models have lower confidence in determining UO regions and categories. Compared to other models, UOSAM shows a more pronounced contrast between high-confidence and low-confidence regions for each category, demonstrating the superiority of the UOSAM model. The visualization results of the aforementioned CAMs indicate that, compared to other models, our UOSAM can more accurately determine the categories and locations of UOs. The experimental results confirm the superior performance of UOSAM.

4.4. Semantic Segmentation Results of Ten Major Cities

The results for the ten major cities are shown in the Figure 6. The results from the major cities indicate that UOSAM offers certain improvements in urban open space recognition compared to traditional models, particularly in areas with clearly defined boundaries, where it more accurately reflects the actual spatial distribution characteristics of the urban environment. This highlights the model’s strong boundary-extraction capability, making it more suitable for analyzing complex urban scenes.

5. Discussion

This study proposes UOSAM, a model for high-resolution urban open space segmentation, achieving a series of significant results. The experimental results demonstrate that UOSAM can accurately identify urban open spaces while optimizing the geometric shapes of their boundaries. Compared to existing deep learning models, UOSAM also shows an improvement in segmentation accuracy. Its structural design enables the extraction of multi-scale semantic and geometric features, leading to enhanced boundary detail optimization in open space segmentation.

Although UOSAM achieves promising results in urban open space semantic segmentation, it still has some limitations. The performance improvement over traditional deep learning models is not substantial, largely due to challenges in distinguishing transportation hubs and outdoor sports fields. In complex scenarios with narrow roads or unclear boundaries between roads and their surroundings, roads tend to merge with the environment, forming irregular objects. While UOSAM effectively identifies large green spaces, it may misclassify areas with blurred boundaries, such as transitions between green spaces and water bodies. This issue may be caused by lighting variations or feature similarities. The model performs exceptionally well in segmenting outdoor sports fields, particularly those with regular shapes. However, it sometimes confuses vacant land or uncultivated farmland with outdoor sports fields. The building segmentation results are relatively robust, particularly in densely built environments, where the segmentation completeness is well maintained. In addition, this study only uses the RGB bands, lacking the advantages of additional spectral bands available in multispectral imagery. In open space semantic segmentation tasks, the introduction of multispectral images can provide additional spectral channels, potentially enhancing the model’s recognition ability. For example, the near-infrared (NIR) band, which is highly sensitive to vegetation, could help distinguish green spaces more effectively. However, the spectral resolution of the multispectral bands of Sentinel-2 is high, while the spatial resolution is low. As shown in Figure 7, within the RGB bands of Sentinel-2, small-scale outdoor sports fields, such as the basketball court marked by the red box, are difficult to recognize. While commercial satellites (e.g., Planet’s SuperDove) provide high-resolution multispectral imagery that can compensate for the resolution issue, the cost of acquiring such data is expensive, making it less suitable for large-scale experiments and widespread applications.

6. Conclusions

In the context of accelerating global urbanization, maintaining sufficient urban open space helps maintain ecological balance, improve air quality, and mitigate the urban heat island effect. This paper proposes a UOSAM for the UO (Urban Open) mapping task to address challenges such as high inter-class similarity, complex environments, and large-scale variations in UO mapping. The proposed UOSAM uses SPFM to extract semantic features and GFM to extract geometric features, integrating semantic and geometric features from remote sensing images, and enhancing the selection of semantic feature channels to improve UO mapping. We conducted extensive experiments using 1.19 m resolution Google RGB images and UOTCT dataset from the ten major cities in China. The experimental results show that UOSAM achieves 69.25% mIoU and 84.15% Acc on the UOTCT dataset. The proposed model outperforms models such as FPN and DeepLabV3 in the urban open space mapping task. Compared to traditional deep learning semantic segmentation methods, UOSAM demonstrates advantages. In future research, we plan to integrate multispectral data with high-resolution remote sensing imagery to enhance the accuracy of open space extraction. Additionally, we aim to further expand the categories of open space extraction to support more refined urban planning and environmental assessment.

Author Contributions

Conceptualization, R.F. (Runyu Fan), D.L. and Z.X.; methodology, R.F. (Runyu Fan), D.L. and Z.X.; software, Z.X.; validation, H.N., J.C. and R.F. (Ruyi Feng); formal analysis, Z.X. and R.F. (Ruyi Feng); investigation, H.N. and R.F. (Ruyi Feng); resources, J.C. and R.F. (Ruyi Feng); data curation, J.C., H.N. and Z.X.; writing—original draft preparation, Z.X.; writing—review and editing, Z.X.; visualization, Z.X.; supervision, R.F. (Runyu Fan) and D.L.; project administration, R.F. (Runyu Fan) and R.F. (Ruyi Feng). All authors have read and agreed to the published version of the manuscript.

Funding

The project was supported by the National Natural Science Foundation of China (No. 42401469), and the “CUG Scholar” Scientific Research Funds at China University of Geosciences (Wuhan) (Project No. 2024013).

Data Availability Statement

The data are not publicly available due to privacy.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Thwaites, K.; Helleur, E.; Simkins, I. Restorative urban open space: Exploring the spatial configuration of human emotional fulfilment in urban open space. Landsc. Res. 2005, 30, 525–547. [Google Scholar]
Byrne, J.; Sipe, N. Green and Open Space Planning for Urban Consolidation—A Review of the Literature and Best Practice. Issues Paper 11. 2010. Available online: https://core.ac.uk/download/pdf/143882947.pdf (accessed on 25 July 2024).
World Bank. Available online: https://data.worldbank.org/indicator/SP.URB.TOTL.IN.ZS (accessed on 25 July 2024).
United Nations Human Settlements Programme (UN-Habitat). Available online: https://unhabitat.org/wcr/ (accessed on 25 July 2024).
Topcu, U. Reflections of gender on the urban green space. Archnet-IJAR Int. J. Archit. Res. 2020, 14, 70–76. [Google Scholar]
Yung, E.H.; Conejos, S.; Chan, E.H. Public open spaces planning for the elderly: The case of dense urban renewal districts in Hong Kong. Land Use Policy 2016, 59, 1–11. [Google Scholar]
Wortzel, J.D.; Wiebe, D.J.; DiDomenico, G.E.; Visoki, E.; South, E.; Tam, V.; Greenberg, D.M.; Brown, L.A.; Gur, R.C.; Gur, R.E.; et al. Association between urban greenspace and mental wellbeing during the COVID-19 pandemic in a US cohort. Front. Sustain. Cities 2021, 3, 686159. [Google Scholar]
Høj, S.B.; Paquet, C.; Caron, J.; Daniel, M. Relative ‘greenness’ and not availability of public open space buffers stressful life events and longitudinal trajectories of psychological distress. Health Place 2021, 68, 102501. [Google Scholar] [PubMed]
Crossley, A.J.; Russo, A. Has the pandemic altered public perception of how local green spaces affect quality of life in the United Kingdom? Sustainability 2022, 14, 7946. [Google Scholar] [CrossRef]
Alberti, M. Maintaining ecological integrity and sustaining ecosystem function in urban areas. Curr. Opin. Environ. Sustain. 2010, 2, 178–184. [Google Scholar]
Han, W.; Wang, L.; Wang, Y. A novel framework for leveraging geological environment big data to assess Sustainable Development Goals. Innovation 2025, 3, 100122. [Google Scholar]
Tao, L.; Xu, Y.; He, K.; Ma, X.; Wang, L. Pan-spatial Earth information system: A new methodology for cognizing the earth system. Innovation 2025, 6, 100770. [Google Scholar] [CrossRef]
Hewitt, C.N.; Ashworth, K.; MacKenzie, A.R. Using green infrastructure to improve urban air quality (GI4AQ). Ambio 2020, 49, 62–73. [Google Scholar]
Bigazzi, A.Y.; Rouleau, M. Can traffic management strategies improve urban air quality? A review of the evidence. J. Transp. Health 2017, 7, 111–124. [Google Scholar]
Elghonaimy, I.; Mohammed, W.E. Urban heat islands in Bahrain: An urban perspective. Buildings 2019, 9, 96. [Google Scholar] [CrossRef]
Deilami, K.; Kamruzzaman, M.; Liu, Y. Urban heat island effect: A systematic review of spatio-temporal factors, data, methods, and mitigation measures. Int. J. Appl. Earth Obs. Geoinf. 2018, 67, 30–42. [Google Scholar]
Wang, L.; Ma, Y.; Zomaya, A.Y.; Ranjan, R.; Chen, D. A parallel file system with application-aware data layout policies for massive remote sensing image processing in digital earth. IEEE Trans. Parallel Distrib. Syst. 2014, 26, 1497–1508. [Google Scholar]
Li, L.; Liu, P.; Wu, J.; Wang, L.; He, G. Spatiotemporal remote-sensing image fusion with patch-group compressed sensing. IEEE Access 2020, 8, 209199–209211. [Google Scholar]
Wang, L.; Zuo, B.; Le, Y.; Chen, Y.; Li, J. Penetrating remote sensing: Next-generation remote sensing for transparent earth. Innovation 2023, 4, 100519. [Google Scholar]
Zhao, T.; Wang, S.; Ouyang, C.; Chen, M.; Liu, C.; Zhang, J.; Yu, L.; Wang, F.; Xie, Y.; Li, J.; et al. Artificial intelligence for geoscience: Progress, challenges, and perspectives. Innovation 2024, 5, 100691. [Google Scholar]
Song, W.; Liu, P.; Wang, L. Sparse representation-based correlation analysis of non-stationary spatiotemporal big data. Int. J. Digit. Earth 2016, 9, 892–913. [Google Scholar]
Fan, R.; Li, J.; Song, W.; Han, W.; Yan, J.; Wang, L. Urban informal settlements classification via a transformer-based spatial-temporal fusion network using multimodal remote sensing and time-series human activity data. Int. J. Appl. Earth Obs. Geoinf. 2022, 111, 102831. [Google Scholar]
Fan, R.; Feng, R.; Wang, L.; Yan, J.; Zhang, X. Semi-MCNN: A semisupervised multi-CNN ensemble learning method for urban land cover classification using submeter HRRS images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 4973–4987. [Google Scholar]
Li, W.; Wu, B.; Fan, R.; Tian, F.; Zhang, M.; Zhou, Z.; Hu, J.; Feng, R.; Wu, F. Multiclass Crop Interpretation via a Lightweight Attentive Feature Fusion Network Using Vehicle-View Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 496–509. [Google Scholar] [CrossRef]
Han, W.; Li, J.; Wang, S.; Zhang, X.; Dong, Y.; Fan, R.; Zhang, X.; Wang, L. Geological remote sensing interpretation using deep learning feature and an adaptive multisource data fusion network. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4510314. [Google Scholar]
He, K.; Dong, J.; Ma, H.; Cai, Y.; Feng, R.; Dong, Y.; Wang, L. Remote sensing image interpretation of geological lithology via a sensitive feature self-aggregation deep fusion network. Int. J. Appl. Earth Obs. Geoinf. 2025, 137, 104384. [Google Scholar] [CrossRef]
Zheng, Z.; Ermon, S.; Kim, D.; Zhang, L.; Zhong, Y. Changen2: Multi-temporal remote sensing generative change foundation model. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 725–741. [Google Scholar]
Zheng, Z.; Zhong, Y.; Zhang, L.; Burke, M.; Lobell, D.B.; Ermon, S. Towards transferable building damage assessment via unsupervised single-temporal change adaptation. Remote Sens. Environ. 2024, 315, 114416. [Google Scholar]
Zhong, Y.; Yan, B.; Yi, J.; Yang, R.; Xu, M.; Su, Y.; Zheng, Z.; Zhang, L. Global urban high-resolution land-use mapping: From benchmarks to multi-megacity applications. Remote Sens. Environ. 2023, 298, 113758. [Google Scholar]
Ma, H.; Yang, X.; Fan, R.; Han, W.; He, K.; Wang, L. Refined Water-Body Types Mapping Using a Water-Scene Enhancement Deep Models by Fusing Optical and SAR Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2024, 17, 17430–17441. [Google Scholar]
He, K.; Zhang, Z.; Dong, Y.; Cai, D.; Lu, Y.; Han, W. Improving Geological Remote Sensing Interpretation via a Contextually Enhanced Multiscale Feature Fusion Network. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2024, 17, 6158–6173. [Google Scholar]
Lu, Y.; He, K.; Xu, H.; Dong, Y.; Han, W.; Wang, L.; Liang, D. Remote-sensing interpretation for soil elements using adaptive feature fusion network. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4505515. [Google Scholar] [CrossRef]
Pan, D.; Zhang, M.; Zhang, B. A generic FCN-based approach for the road-network extraction from VHR remote sensing images–using openstreetmap as benchmarks. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 2662–2673. [Google Scholar]
Chen, D.; Zhong, Y.; Zheng, Z.; Ma, A.; Lu, X. Urban road mapping based on an end-to-end road vectorization mapping network framework. ISPRS J. Photogramm. Remote Sens. 2021, 178, 345–365. [Google Scholar] [CrossRef]
Lu, X.; Zhong, Y.; Zheng, Z.; Liu, Y.; Zhao, J.; Ma, A.; Yang, J. Multi-scale and multi-task deep learning framework for automatic road extraction. IEEE Trans. Geosci. Remote Sens. 2019, 57, 9362–9377. [Google Scholar] [CrossRef]
Lu, X.; Weng, Q. Multi-LoRA Fine-Tuned Segment Anything Model for Urban Man-Made Object Extraction. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5637519. [Google Scholar] [CrossRef]
Chen, W.; Zhou, G.; Liu, Z.; Li, X.; Zheng, X.; Wang, L. NIGAN: A framework for mountain road extraction integrating remote sensing road-scene neighborhood probability enhancements and improved conditional generative adversarial network. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5626115. [Google Scholar] [CrossRef]
Cheng, L.; Wang, L.; Feng, R.; Yan, J. Remote sensing and social sensing data fusion for fine-resolution population mapping with a multimodel neural network. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 5973–5987. [Google Scholar] [CrossRef]
Zhang, X.; Chen, Y.; Le, Y.; Zhang, D.; Yan, Q.; Dong, Y.; Han, W.; Wang, L. Nearshore bathymetry based on ICESat-2 and multispectral images: Comparison between Sentinel-2, Landsat-8, and testing Gaofen-2. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 2449–2462. [Google Scholar] [CrossRef]
Sanlang, S.; Cao, S.; Du, M.; Mo, Y.; Chen, Q.; He, W. Integrating aerial LiDAR and very-high-resolution images for urban functional zone mapping. Remote Sens. 2021, 13, 2573. [Google Scholar] [CrossRef]
Zhou, W.; Ming, D.; Lv, X.; Zhou, K.; Bao, H.; Hong, Z. SO–CNN based urban functional zone fine division with VHR remote sensing image. Remote Sens. Environ. 2020, 236, 111458. [Google Scholar] [CrossRef]
Fan, R.; Li, F.; Han, W.; Yan, J.; Li, J.; Wang, L. Fine-scale urban informal settlements mapping by fusing remote sensing images and building data via a transformer-based multimodal fusion network. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5630316. [Google Scholar] [CrossRef]
Fan, R.; Niu, H.; Xu, Z.; Chen, J.; Feng, R.; Wang, L. Refined Urban Informal Settlements’ Mapping at Agglomeration Scale With the Guidance of Background Knowledge From Easy-Accessed Crowdsourced Geospatial Data. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4401716. [Google Scholar] [CrossRef]
Niu, H.; Fan, R.; Chen, J.; Xu, Z.; Feng, R. Urban informal settlements interpretation via a novel multi-modal Kolmogorov–Arnold fusion network by exploring hierarchical features from remote sensing and street view images. Sci. Remote Sens. 2025, 11, 100208. [Google Scholar] [CrossRef]
Wang, S.; Han, W.; Zhang, X.; Li, J.; Wang, L. Geospatial remote sensing interpretation: From perception to cognition. Innov. Geosci. 2024, 2, 100056. [Google Scholar]
Blaschke, T. Object based image analysis for remote sensing. ISPRS J. Photogramm. Remote Sens. 2010, 65, 2–16. [Google Scholar]
Feng, Q.; Liu, J.; Gong, J. UAV remote sensing for urban vegetation mapping using random forest and texture analysis. Remote Sens. 2015, 7, 1074–1094. [Google Scholar] [CrossRef]
Zhao, B.; Zhong, Y.; Zhang, L. A spectral–structural bag-of-features scene classifier for very high spatial resolution remote sensing imagery. ISPRS J. Photogramm. Remote Sens. 2016, 116, 73–85. [Google Scholar] [CrossRef]
Pan, L.; Gu, L.; Ren, R.; Yang, S. Land cover classification based on machine learning using UAV multi-spectral images. In Proceedings of the Earth Observing Systems XXV, SPIE, Online, 24 August–4 September 2020; Volume 11501, pp. 297–308. [Google Scholar]
Zhang, C.; Sargent, I.; Pan, X.; Li, H.; Gardiner, A.; Hare, J.; Atkinson, P.M. An object-based convolutional neural network (OCNN) for urban land use classification. Remote Sens. Environ. 2018, 216, 57–70. [Google Scholar]
Jin, B.; Ye, P.; Zhang, X.; Song, W.; Li, S. Object-oriented method combined with deep convolutional neural networks for land-use-type classification of remote sensing images. J. Indian Soc. Remote Sens. 2019, 47, 951–965. [Google Scholar]
Huang, J.; Weng, L.; Chen, B.; Xia, M. DFFAN: Dual function feature aggregation network for semantic segmentation of land cover. ISPRS Int. J. Geo-Inf. 2021, 10, 125. [Google Scholar] [CrossRef]
Xu, W.; Deng, X.; Guo, S.; Chen, J.; Sun, L.; Zheng, X.; Xiong, Y.; Shen, Y.; Wang, X. High-resolution u-net: Preserving image details for cultivated land extraction. Sensors 2020, 20, 4064. [Google Scholar] [CrossRef]
Men, G.; He, G.; Wang, G. Concatenated residual attention unet for semantic segmentation of urban green space. Forests 2021, 12, 1441. [Google Scholar] [CrossRef]
Shi, Q.; Liu, M.; Marinoni, A.; Liu, X. UGS-1m: Fine-grained urban green space mapping of 34 major cities in China based on the deep learning framework. Earth Syst. Sci. Data Discuss. 2022, 2022, 1–23. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
Xiao, T.; Liu, Y.; Zhou, B.; Jiang, Y.; Sun, J. Unified perceptual parsing for scene understanding. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 418–434. [Google Scholar]
Lee, J.; Kim, D.; Ponce, J.; Ham, B. Sfnet: Learning object-aware semantic correspondence. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2278–2287. [Google Scholar]
Huang, S.; Lu, Z.; Cheng, R.; He, C. FaPN: Feature-aligned pyramid network for dense image prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 864–873. [Google Scholar]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. arXiv 2021, arXiv:2105.15203. [Google Scholar]
Stefanski, J.; Mack, B.; Waske, B. Optimization of object-based image analysis with random forests for land cover mapping. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2013, 6, 2492–2504. [Google Scholar] [CrossRef]
Çömert, R.; Matcı, D.K.; Avdan, U. Object based burned area mapping with random forest algorithm. Int. J. Eng. Geosci. 2019, 4, 78–87. [Google Scholar] [CrossRef]

Figure 1. Remote sensing images and labels of cities.

Figure 2. (a) The pixel count of each category across all labels. (b) The proportion of each category within the labels of each city.

Figure 3. The overall structure of the proposed UOSAM.

Figure 4. Results of the compared models in UOCTC datasets.

Figure 5. The CAMS over the image. (A), (B), (C), (D), and (E) are the cams corresponding to GS, OSF, TH, WB, and NOS categories, respectively. (F) represents the ground truth of the UO class. (a,b) show the CAMs of remote sensing images.

Figure 6. UO maping results of ten major cities. (a) The UO mapping results of Beijing, Changsha, Chengdu, Chongqing and Guangzhou. (b) The UO mapping results of Nanchang, Shanghai, Shenzhen, Tianjin and Wuhan.

Figure 7. Comparison between our results, Google imagery, and Sentinel-2 imagery.The orange areas are outdoor sports fields. The green boxes is a large running track, and the red boxes are small-scale outdoor sports fields.

Table 1. IoU comparison between the object-based random forest method and UOSAM on the UOCTC datasets. The highest performances are bolded.

	IoU (%)
Model	GS	OSF	TH	WB	NOS	mIoU
OBRF	50.33	6.15	10.67	56.57	57.58	36.26
OBRF-HL	48.26	12.75	18.66	73.89	54.37	41.59
UOSAM	72.56	61.01	51.65	85.35	75.7	69.25

Table 2. Accuracy comparison between the object-based random forest method and UOSAM on the UOCTC datasets. The highest performances are bolded.

	Acc (%)
Model	GS	OSF	TH	WB	NOS	OA
OBRF	66.95	6.55	13.01	62.95	83.07	66.74
OBRF-HL	57.87	26.55	42.15	80.33	72.04	64.47
UOSAM	83.03	69.68	65.78	92.1	88.34	84.15

Table 3. IoU of each comparison model in UOCTC datasets. The highest performances are bolded.

	Iou (%)
Model	GS	OSF	TH	WB	NOS	mIoU
FCN	71.72	57.98	51.08	84.76	75.08	68.12
FPN	66.87	39.65	41.1	78.99	69.36	59.19
DeepLabV3	72.19	60.86	51.49	84.43	75.51	68.89
FaPN	67.82	53.02	44.09	80.18	72.47	63.52
SFNet	66.87	47.68	42.85	78.17	72.04	61.52
UperNet	66.98	46.51	43.3	79.88	71.1	61.55
SegFormer	72.42	60.4	50.64	85.51	74.98	68.79
UOSAM	72.56	61.01	51.65	85.35	75.7	69.25

Table 4. Accuracy of each comparison model in UOCTC datasets. The highest performances are bolded.

	Acc (%)
Model	GS	OSF	TH	WB	NOS	OA
FCN	82.3	66.85	64.31	92.26	88.31	83.68
FPN	79.11	45.86	52.35	86.17	86.61	79.36
DeepLabV3	82.89	69.48	64.06	91.82	88.57	83.98
FaPN	79.47	63.9	54.92	90.13	87.58	81.1
SFNet	79.17	55	53.56	87.38	87.87	80.4
UperNet	79.6	53.03	54.88	87.98	86.79	80.27
SegFormer	84.29	70.39	64.65	92.41	86.61	83.84
UOSAM	83.03	69.68	65.78	92.1	88.34	84.15

Table 5. Ablation study of the proposed method. The highest performances are bolded.

Method	SPFM	GFM	OA (%)	mIoU (%)
UOSAM	×	✓	59.63	30.01
UOSAM	✓	×	83.84	68.79
UOSAM	✓	✓	84.15	69.25

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xu, Z.; Chen, J.; Niu, H.; Fan, R.; Lu, D.; Feng, R. Boosting Urban Openspace Mapping with the Enhancement Feature Fusion of Object Geometry Prior Information from Vision Foundation Model. Remote Sens. 2025, 17, 1230. https://doi.org/10.3390/rs17071230

AMA Style

Xu Z, Chen J, Niu H, Fan R, Lu D, Feng R. Boosting Urban Openspace Mapping with the Enhancement Feature Fusion of Object Geometry Prior Information from Vision Foundation Model. Remote Sensing. 2025; 17(7):1230. https://doi.org/10.3390/rs17071230

Chicago/Turabian Style

Xu, Zijian, Jiajun Chen, Hongyang Niu, Runyu Fan, Dingkun Lu, and Ruyi Feng. 2025. "Boosting Urban Openspace Mapping with the Enhancement Feature Fusion of Object Geometry Prior Information from Vision Foundation Model" Remote Sensing 17, no. 7: 1230. https://doi.org/10.3390/rs17071230

APA Style

Xu, Z., Chen, J., Niu, H., Fan, R., Lu, D., & Feng, R. (2025). Boosting Urban Openspace Mapping with the Enhancement Feature Fusion of Object Geometry Prior Information from Vision Foundation Model. Remote Sensing, 17(7), 1230. https://doi.org/10.3390/rs17071230

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Boosting Urban Openspace Mapping with the Enhancement Feature Fusion of Object Geometry Prior Information from Vision Foundation Model

Abstract

1. Introduction

2. Materials

2.1. Study Area

2.2. Data Sources

2.3. Dataset Generation and Visualization

3. Methods

3.1. Semantic Pyramid Feature Module

3.2. Geometric Feature Module

3.3. Loss Function

4. Results

4.1. Experimental Setup

4.2. Ablation Study

4.3. The Visualization and Analysis of Class-Specific Attention Maps

4.4. Semantic Segmentation Results of Ten Major Cities

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI