2.2.2. Extraction of Physical Features
Physical features are derived from remote sensing images, street view images, road networks, and building footprints. These features capture the spatial configuration and morphological characteristics of urban areas.
Remote sensing images inherently contain both global and local features. Global features capture long-range spatial dependencies, while local features capture pixel-level and neighborhood-level details. In previous research, there has been limited consideration of both global and local features concurrently, leading to inadequate feature representation. To enable robust and accurate land use classification, our framework combines the global features extracted by the Masked Autoencoder (MAE) [
30] and local features derived from DeepLabV3+ [
31].
For the global feature extraction of remote sensing images, we employ the MAE model, a self-supervised learning model widely used in computer vision. Compared to traditional CNNs, the MAE based on Vision Transformer (ViT) can capture spatial correlations over a longer distance through global self-attention, thus being able to extract the global features of the image more effectively. Meanwhile, due to the scarcity of labeled data samples, using the masking method for self-supervised training can increase the richness of training samples and improve the generalization ability of the model. The MAE model operates through a masking mechanism, consisting of an encoder and a decoder. The encoder, implemented using a ViT architecture, processes only the unmasked patches. Inspired by the success of Transformer models in natural language processing, ViT treats image patches as sequential elements and captures global dependencies among them. After encoding, a lightweight decoder reconstructs the original image by processing both masked and unmasked patches. Through this process, the model learns spatial correlations among patches, providing an effective representation of long-range spatial dependencies. The features extracted by the encoder can serve as the global features of the remote sensing images.
Given the ability of the MAE, depending on the diversity of the training samples and the small sample size of our study (about 12,000 research units), we adopted a pre-training and fine-tuning paradigm to train a robust and generalizable MAE model. This approach allowed the model to learn universal and robust image features from a broader dataset before fine-tuning on our study area to adapt to the specific task of land use classification. During pre-training, we constructed a dataset using remote sensing images from built-up areas of Beijing, Shanghai, Guangzhou, and Shenzhen. To ensure compatibility with the fine-tuning task, we made the empirical cumulative distribution function (ECDF) of the shape features of pre-training samples follow the ECDF of the shape features of samples in our study area. For fine-tuning, the task shifted from image reconstruction in pre-training to land use classification. Therefore, we discarded the MAE decoder and only retained the ViT encoder for the classification task. To prevent confusion between the training and test sets, we fine-tuned a classification model for each of the five partitions in the five-fold cross-validation described in
Section 2.2.5. Finally, the image features extracted by the encoder were used as global features of the remote sensing images.
For local feature extraction of remote sensing images, we employed semantic segmentation as the proxy task, which is well-suited for capturing pixel-level and neighborhood-level image features. The semantic segmentation was accomplished using the DeepLabV3+ architecture, which also follows an encoder–decoder structure. The DeepLabV3+ model employs Atrous Spatial Pyramid Pooling (ASPP), which can effectively capture image features of different scales. This is highly beneficial for accurately delineating the boundaries of multi-scale objects in remote sensing images. In DeepLabV3+, the encoder module reduces the feature map and captures higher-dimensional semantic information, and the decoder module recovers spatial information. The encoder consists of two modules: a Deep Convolutional Neural Network (DCNN) and an ASPP module. The DCNN module employs an Xception_65 backbone network for feature extraction. The ASPP module captures convolution features at multiple scales by applying Atrous convolution with different expansion rates. After the encoder, the decoder is simple and effective. Firstly, the encoder features are up-sampled bilinearly with a coefficient of 4. Then, the sampled results are joined with corresponding low-level features from the encoder backbone network with the same spatial resolution. Finally, the convolution result is up-sampled again to obtain the segmentation result with the same size as the input image.
During the training phase of DeepLabV3+, we defined seven land cover categories, including lawn, shrub, ground, impervious surface, road, building, and water. The pre-trained Xception network parameters on the ImageNet-1k dataset were used as the initial parameters for the DeepLabV3+ backbone network. Meanwhile, we applied the stochastic gradient descent (SGD) optimizer and the ‘poly’ learning rate policy. After the semantic segmentation, we calculated the proportion of land cover types for each parcel as the local features from remote sensing images.
Street view images can complement remote sensing data by providing ground-level visual details, such as building facades, street conditions, and vegetation, enhancing the granularity and accuracy of land use classification. However, due to the limitations of the acquisition equipment, the visual range of street view images is limited, and there are many noises irrelevant to the task of land use classification. Therefore, we consider identifying the scene category as a proxy task to extract the image subject and filter out the noise. Referring to [
25], we chose the pre-trained ResNet-50 network [
32] on the Places2 dataset [
33] as the classification network. This network can divide the images into 365 scene categories, including indoor and outdoor. We calculated the proportion of street view images of different scene categories for each research unit as the second type of physical feature.
To enrich the physical features of the urban built-up environment, the road networks and building footprints were used to further characterize the urban connectivity and architectural form. The connectivity is characterized by the density of road networks within a land parcel and its neighboring area. The neighboring area is defined as a buffer zone with a radius equal to the square root of the parcel area. The architectural form is represented by the intrinsic features and spatial distribution features of the buildings in a land parcel. Intrinsic features are constructed by calculating both the mean value and standard deviation of the building attributes, including construction year, building area, height, perimeter, shape coefficient (the ratio of perimeter to area), and shape complexity (the number of vertices). Spatial distribution features are characterized by the floor area ratio (FAR) and the nearest neighbor distance (NND), which reflect the density and spatial arrangement of buildings. The calculation formulas for the FAR and the NND are as follows:
where
N denotes the total number of buildings within a region,
and
, respectively, represent the floor area and the number of floors of the building
i, and
A is the total area of the region.
where
N denotes the total number of buildings within a region, and
represents the distance from building
i to building
j.
Furthermore, due to the spatial spillover effect of human activities, the activities occurring in one area may influence the spatial structure of the surrounding areas [
34], especially when the research units are small. Therefore, we also constructed a set of building features that consider the neighborhood area of a land parcel. Specifically, when calculating the intrinsic features and spatial distribution features of the buildings, not only are the buildings within the land parcel considered, but also those in its neighborhood. Similarly, the neighboring area is defined as a buffer zone with a radius equal to the square root of the parcel area. Another potential benefit of this operation is that, for sparse data, expanding the target area makes the overall trend of the features more prominent, thereby reducing the impact of random variability from minority samples.
2.2.3. Extraction of Socioeconomic Features
Socioeconomic features were extracted from data, such as POIs, social media check-ins, and mobile phone signaling. These features reflect the functional dynamics and human activities of urban areas.
To capture economic characteristics and the function distribution of urban land use, we constructed a category feature and a spatial feature based on POIs. The POIs were classified into 23 categories according to a customized classification system, including residence, company, education, office, hospital, parking, shop, food, domestic, university, government, car service, hotel, leisure, beauty, sport, finance, media, tourism, transport, school, research, and other. For each land parcel, the density and the proportion of different POIs were calculated, and a 46-dimensional category feature vector was obtained. This feature provides insights into the intensity and composition of urban facilities within a land parcel. To extract the information about spatial co-occurrence contained in POIs, we constructed a Delaunay triangulation network based on the spatial locations of POIs and performed spatially explicit random walks on it. Each random walk started at a randomly selected POI and traversed the network based on a transition probability that considers spatial distance decay and the balance between local and long-range co-occurrences [
35]. After capturing the spatial co-occurrence patterns, the aggregation function based on LSTM networks and attention mechanisms proposed by [
36] was used to aggregate the sequences generated from the random walks. With this POI aggregation function, we could obtain the POI distribution feature of the land parcels.
Considering that different land uses often have different human activities, we further characterized the spatiotemporal and socioeconomic features of urban land parcels using social media check-ins and mobile phone signaling. The social media check-ins can capture the spatiotemporal distribution characteristics of the active population, while the mobile phone signaling can supplement and reflect the spatial and temporal patterns of the resident population. These two datasets provide complementary perspectives on population distribution and dynamics, enabling a comprehensive representation of human activity patterns. For each land parcel, we computed temporal variation curves of the number of check-ins at two temporal scales: hourly variations within a day and daily variations within a week. Regarding the mobile phone signaling, we only generated the 24-h temporal variation curves of the number of users for each land parcel, because the data span is not sufficient for calculating the changes over a week. The number of users per hour at each land parcel was obtained by resampling based on the user number recorded at a 250-m grid resolution. When a grid covers multiple plots, the users were distributed according to the proportion of the covered area. These curves represent the temporal dynamics of population distribution and activity intensity, providing insights into the diurnal and weekly patterns of human behavior. To further extract high-level temporal features from the variation curves, we applied a one-dimensional convolutional neural network (1D-CNN) to temporal variation curves measured in hours. The 1D-CNN is well-suited for this task because it can automatically learn local patterns and dependencies in sequential data. Finally, both the original temporal variation curves and the extracted 1D-CNN feature vectors were included in the socioeconomic features.
Similar to the building features, we also developed a set of socioeconomic features considering the spatial spillover effect. The density and proportion of POIs across different categories within a land parcel and its neighborhood were calculated. The temporal variation curves, taking into account the number of check-ins and mobile signal strength in a land parcel and its surrounding area, were constructed.
2.2.5. Model Training and Validation of Land Use Classification
Referring to Huang [
25], we adopted the land use taxonomy predefined by the Land Administration Law of the People’s Republic of China (GBT 21010–2017) to train and validate the proposed model. After filtering out the nonexistent categories in the Fifth Ring Road of Beijing, nine refined land use categories were generated (see
Table 2). Among them, the ‘other’ category is excluded because it does not undertake the main urban functions and has few characteristics. Based on the land use planning map provided by the Beijing Municipal Commission of Planning and Natural Resources, we semi-automatically assigned the land use category for each land parcel [
25]. In detail, we rescaled the land use planning map into land parcels by computing the area of different land use types. The label of a land parcel was assigned based on the dominant land use type or by visual interpretation. Label assignment based on the dominant land use type was primarily applied in the following two scenarios: (1) For a given parcel, the area proportion of a specific land use type exceeds 60%. (2) For a given parcel, the proportion of the largest land use type is greater than 40%, while that of the second largest is less than 20%. In all other cases, labels were assigned through visual interpretation.
To effectively leverage all the aforementioned features for land use classification, we employed the XGBoost model for the cross-modal feature fusion and classification. XGBoost is a state-of-the-art ensemble learning algorithm that has been widely adopted in both academia and industry due to its robustness, scalability, and ability to handle complex and heterogeneous datasets. It is particularly justified by its effectiveness in handling imbalanced datasets, sparse data, and multi-source feature fusion. Before training and testing procedures, all constructed and enhanced features were directly concatenated without dimensionality reduction to preserve their complete informational content. This is so that we could leverage XGBoost’s inherent capability to effectively handle heterogeneous feature spaces. The regularized objective function and feature importance mechanism of XGboost can automatically optimize multi-dimensional feature integration while resolving potential dimensional inconsistencies. Meanwhile, XGboost’s built-in feature interaction detection further ensures optimal utilization of cross-modal features without requiring pre-processing dimension alignment, thus avoiding information loss associated with conventional reduction techniques.
To train and validate the model, we fed the concatenated features into the XGBoost classifier and evaluated the model performance by five-fold cross-validation. Five-fold cross-validation is widely adopted in machine learning for assessing model performance and generalization capability. This technique randomly partitions the dataset into five mutually exclusive subsets of approximately equal size. Then, the training and testing process repeats five times, with each subset (20% of data) serving as the test set once while the remaining four subsets (80% of data) serve as the training set. The final performance metrics are averaged across all folds to mitigate bias from data partitioning. In this way, the performance estimation variance is reduced through rotating validation, which is critical for small datasets affected by sampling fluctuations. In addition, this method leverages 80% of available data for training in each iteration, maximizing learning capacity while maintaining robust validation. Furthermore, a 5 × 5 repeated stratified five-fold cross-validation is performed to train and validate the model multiple times, from which the 95% confidence intervals (CIs) for the classification accuracy are computed using a t-distribution.