A Cross-Modal Deep Feature Fusion Framework Based on Ensemble Learning for Land Use Classification

Wu, Xiaohuan; Qi, Houji; Wang, Keli; Liu, Yikun; Wang, Yang

doi:10.3390/ijgi14110411

Open AccessArticle

A Cross-Modal Deep Feature Fusion Framework Based on Ensemble Learning for Land Use Classification

by

Xiaohuan Wu

^1,2,*,

Houji Qi

³,

Keli Wang

⁴

,

Yikun Liu

⁵ and

Yang Wang

²

¹

Key Laboratory of Smart Earth, Beijing 100029, China

²

Suzhou Aerospace Information Research Institute, Suzhou 215123, China

³

Data Service Center, Peking University Library, Peking University, Beijing 100871, China

⁴

The Low-Altitude Economy Technology R&D and Operations Center, China Mobile Chengdu Industrial of Research and Development, Chengdu 610065, China

⁵

Xi’an Surveying and Mapping Station, Xi’an 710054, China

^*

Author to whom correspondence should be addressed.

ISPRS Int. J. Geo-Inf. 2025, 14(11), 411; https://doi.org/10.3390/ijgi14110411

Submission received: 13 August 2025 / Revised: 22 September 2025 / Accepted: 26 September 2025 / Published: 23 October 2025

Download

Browse Figures

Versions Notes

Abstract

Land use classification based on multi-modal data fusion has gained significant attention due to its potential to capture the complex characteristics of urban environments. However, effectively extracting and integrating discriminative features derived from heterogeneous geospatial data remain challenging. This study proposes an ensemble learning framework for land use classification by fusing cross-modal deep features from both physical and socioeconomic perspectives. Specifically, the framework utilizes the Masked Autoencoder (MAE) to extract global spatial dependencies from remote sensing imagery and applies long short-term memory (LSTM) networks to model spatial distribution patterns of points of interest (POIs) based on type co-occurrence. Furthermore, we employ inter-modal contrastive learning to enhance the representation of physical and socioeconomic features. To verify the superiority of the ensemble learning framework, we apply it to map the land use distribution of Bejing. By coupling various physical and socioeconomic features, the framework achieves an average accuracy of 84.33 %, surpassing several comparative baseline methods. Furthermore, the framework demonstrates comparable performance when applied to a Shenzhen dataset, confirming its robustness and generalizability. The findings highlight the importance of fully extracting and effectively integrating multi-source deep features in land use classification, providing a robust solution for urban planning and sustainable development.

Keywords:

data fusion; land use classification; deep learning; feature representation; contrastive learning

1. Introduction

Urban land use classification plays an important role in urban planning and sustainable development, with significant implications for smart city management and spatial governance. The accelerating pace of global urbanization has resulted in increasingly complex urban environments, creating an urgent need for advanced methodologies that can accurately characterize land use patterns through comprehensive integration of multi-source geospatial data [1,2,3]. However, the inherent heterogeneity and high-dimensional nature of various geospatial data present substantial challenges in feature extraction and cross-modal data fusion [4].

As one of the important research themes in geographical information science, the study of land use classification has maintained continuous attention [5]. Various methods based on remote sensing imagery or social sensing data have been widely explored to improve the accuracy of land use classification. Among them, traditional land use classification approaches have primarily relied on remote sensing imagery [6], utilizing spectral [7], textural [8], and object-based features [9] for pattern recognition. Unfortunately, while these methods have demonstrated effectiveness in capturing physical characteristics, they often fail to incorporate crucial socioeconomic dimensions that define urban functionality [10]. As various types of social sensing data become increasingly available, many popular data sources, such as points of interest (POIs), social media check-ins, taxi trajectories, and mobile phone signaling, are used to derive socioeconomic features [11,12]. Additionally, the relationship between these characteristics and land use types has been examined in detail [13,14,15]. However, studies relying only on remote sensing imagery or social sensing data remain limited, making it difficult to comprehensively characterize the natural spatial structure and the socioeconomic functions of land use types.

In recent years, advanced approaches combining remote sensing and social sensing data have emerged for urban land use classification [16]. By integrating physical and socioeconomic characteristics from multiple data sources, these methods have significantly improved classification accuracy compared to previous methods [17,18,19]. In this context, how to effectively extract discriminative features from multi-source data and achieve efficient cross-modal feature fusion has become a key research challenge. In terms of feature extraction, various deep learning models have been explored to obtain cross-modal representations. For example, convolutional neural networks (CNNs) have been employed to extract spatial features from remote sensing imagery [20,21], long short-term memory (LSTM) networks have been used to model spatiotemporal dynamics [22,23], and Word2Vec models have been applied to construct high-dimensional semantic representations of POIs [24]. Nevertheless, these approaches often fail to fully capture the complementary nature of physical and socioeconomic characteristics. For instance, remote sensing features are typically limited to pixel-level image features, POI data processing often overlooks crucial spatial dependencies, and the analysis of human mobility frequently ignores the heterogeneity among different population groups [25]. Furthermore, existing feature fusion methods frequently rely on simple concatenation or ensemble strategies, which inadequately capture the complex interrelationships between physical and socioeconomic urban features [26,27,28]. As a result, the complementary benefits of physical and socioeconomic features remain underutilized, and effectively characterizing and integrating deep features from multi-source geospatial data continues to be a major challenge in land use classification.

To address the aforementioned challenges, this study proposes a Cross-Modal Deep Feature Fusion (CM-DFF) framework based on ensemble learning. The framework is designed to comprehensively extract and effectively integrate multi-source information from both physical and socioeconomic perspectives. It leverages the complementary strengths of deep learning and ensemble learning to model and fuse cross-modal geospatial features, thereby significantly enhancing the accuracy and robustness of land use classification in high-density urban environments. To validate the proposed framework, the land use classification of the Fifth Ring Road of Beijing is selected as the study case. A comprehensive set of physical and socioeconomic features is extracted from heterogeneous multi-source datasets for land use classification. The result highlights the importance of fully extracting and effectively integrating multi-source deep features for land use classification, providing reliable data-driven support for urban planning and sustainable development.

Overall, this study aims to address three key research questions: (1) How to design feature extraction modules to effectively capture the deep discriminative information embedded in multi-source data? (2) What fusion mechanisms can optimally integrate cross-modal features to improve land use classification accuracy? (3) Whether the proposed framework exhibits robust generalizability and superior performance?

2. Materials and Methods

2.1. Study Area and Datasets

The Fifth Ring Road of Beijing, China, was chosen as the study area (see Figure 1). It is a highly urbanized region that exemplifies the complex interplay of physical and socioeconomic factors in shaping the land use patterns. The heterogeneity in land use and the availability of multi-source geospatial data make it an ideal testbed to evaluate the proposed framework. This region covers an area of about 667

{km}^{2}

(estimated from Google Maps) and is characterized by a diverse mix of land use types, including commercial, civic, industrial, transport, natural, educational, residential, and agricultural (sourced from the Beijing Municipal Commission of Planning and Natural Resources). In addition, it is also well-documented with rich geospatial data, such as remote sensing images, building footprints, POIs, and Weibo check-ins, providing a comprehensive foundation for analyzing and modeling urban land use. The selection of this area can ensure the generalizability of the findings to other metropolitan regions with similar urbanization challenges. As shown in Figure 1, the study area is divided into 12,197 land parcels based on the road network as spatial units. Those parcels are considered spatially independent, heterogeneous, and highly pure.

To comprehensively capture both the natural and socioeconomic characteristics of the city, this study employs a set of multi-source geospatial datasets (see Table 1), including road networks, high-resolution remote sensing imagery, street view images, building footprints, POIs, Weibo check-ins, and mobile signaling data. The majority of these datasets fall within the period of 2018–2019 to ensure temporal consistency across sources. Weibo check-in data, however, spans 2012–2017. Although its temporal coverage differs from the others, the dataset reflects relatively stable spatial and functional attributes over time, and thus serves as a robust complement to the more temporally aligned datasets.

2.1.1. Road Network

Obtained from OpenStreetMap (OSM), the road network vector data has road type and width attributes, which enables the analysis of urban connectivity and accessibility.

2.1.2. Remote Sensing Images

Google Earth provides remote sensing images with a resolution of 1.2 m from 2019. The high-resolution remote sensing images provide global landscape features and land cover patterns, which are essential for understanding the physical structure of the study area.

2.1.3. Street View Images

Complementing the perspective of remote sensing images, 660,000 street view images sourced from Tencent offer ground-level visual information to extract fine-grained urban features such as building facades and street conditions.

2.1.4. Building Footprints

A total of 161,000 data records of buildings within the Fifth Ring Road were downloaded from public building data sources. The building attributes include construction year, number of floors, perimeter, area, and number of vertices. Considering the spatial arrangement and attribute features of buildings can support the analysis of urban morphology and vertical development patterns.

2.1.5. Points of Interest

To examine spatial distribution and co-occurrence patterns of various types of places, we obtained 554,000 POIs of Beijing from 2019 from Baidu, out of which 284,000 are within the Fifth Ring Road. The Baidu POIs are organized by a hierarchical classification system and provide attributes such as coordinate, category, and name.

2.1.6. Weibo Check-Ins

Sina Corp provides about 5.8 million Weibo check-in data from 2012 to 2017 within the Fifth Ring Road, recording the content, time, and location of the check-in.

2.1.7. Mobile Phone Signaling

The check-in data can reflect human activity patterns, facilitating the identification of functional areas such as commercial hubs and recreational zones. The mobile phone signaling data records the number of users per hour at a 250 m grid resolution, providing fine-grained insights into real-time population mobility and density dynamics.

The integration of above datasets can provide a holistic representation of urban space by incorporating both physical infrastructure and socioeconomic activities. This multi-data source approach can enhance the accuracy and robustness of land use classification and overcomes the limitations of insufficient feature representation in the single-data source method.

2.2. Feature Fusion Framework Based on Ensemble Learning

2.2.1. Method Overview

Ensemble learning has been widely applied to classification problems and has demonstrated excellent performance. It accomplishes learning tasks by integrating multiple machine learning algorithms and can effectively deal with large-scale and sparse remote and social sensing data. In this article, we propose a CM-DFF framework based on ensemble learning (Figure 2) to extract features from different modal data with different methods and fuse those features by ensemble learning at the classification stage (Appendix A).

Two types of features are involved in the proposed framework: physical features and socioeconomic features. Physical features offer a static snapshot of urban form, while socioeconomic features reveal the functional and temporal dynamics of urban spaces. To enhance feature representation, contrastive learning (CL) was applied to align the cross-modal features to ensure compatibility and the discriminative power of features. Finally, both the extracted and aligned features were input into a state-of-the-art ensemble learning model, XGBoost [29], for land use classification and to validate the accuracy using the five-fold cross-validation method. Furthermore, we compared the performance of our model with three alternative baseline models, including Multi-Layer Perceptron (MLP), Support Vector Machine (SVM), and Random Forest (RF). For all models, the classification accuracy was compared across three scenarios: considering only socioeconomic features, considering only physical features, and integrating both socioeconomic and physical features.

2.2.2. Extraction of Physical Features

Physical features are derived from remote sensing images, street view images, road networks, and building footprints. These features capture the spatial configuration and morphological characteristics of urban areas.

Remote sensing images inherently contain both global and local features. Global features capture long-range spatial dependencies, while local features capture pixel-level and neighborhood-level details. In previous research, there has been limited consideration of both global and local features concurrently, leading to inadequate feature representation. To enable robust and accurate land use classification, our framework combines the global features extracted by the Masked Autoencoder (MAE) [30] and local features derived from DeepLabV3+ [31].

For the global feature extraction of remote sensing images, we employ the MAE model, a self-supervised learning model widely used in computer vision. Compared to traditional CNNs, the MAE based on Vision Transformer (ViT) can capture spatial correlations over a longer distance through global self-attention, thus being able to extract the global features of the image more effectively. Meanwhile, due to the scarcity of labeled data samples, using the masking method for self-supervised training can increase the richness of training samples and improve the generalization ability of the model. The MAE model operates through a masking mechanism, consisting of an encoder and a decoder. The encoder, implemented using a ViT architecture, processes only the unmasked patches. Inspired by the success of Transformer models in natural language processing, ViT treats image patches as sequential elements and captures global dependencies among them. After encoding, a lightweight decoder reconstructs the original image by processing both masked and unmasked patches. Through this process, the model learns spatial correlations among patches, providing an effective representation of long-range spatial dependencies. The features extracted by the encoder can serve as the global features of the remote sensing images.

Given the ability of the MAE, depending on the diversity of the training samples and the small sample size of our study (about 12,000 research units), we adopted a pre-training and fine-tuning paradigm to train a robust and generalizable MAE model. This approach allowed the model to learn universal and robust image features from a broader dataset before fine-tuning on our study area to adapt to the specific task of land use classification. During pre-training, we constructed a dataset using remote sensing images from built-up areas of Beijing, Shanghai, Guangzhou, and Shenzhen. To ensure compatibility with the fine-tuning task, we made the empirical cumulative distribution function (ECDF) of the shape features of pre-training samples follow the ECDF of the shape features of samples in our study area. For fine-tuning, the task shifted from image reconstruction in pre-training to land use classification. Therefore, we discarded the MAE decoder and only retained the ViT encoder for the classification task. To prevent confusion between the training and test sets, we fine-tuned a classification model for each of the five partitions in the five-fold cross-validation described in Section 2.2.5. Finally, the image features extracted by the encoder were used as global features of the remote sensing images.

For local feature extraction of remote sensing images, we employed semantic segmentation as the proxy task, which is well-suited for capturing pixel-level and neighborhood-level image features. The semantic segmentation was accomplished using the DeepLabV3+ architecture, which also follows an encoder–decoder structure. The DeepLabV3+ model employs Atrous Spatial Pyramid Pooling (ASPP), which can effectively capture image features of different scales. This is highly beneficial for accurately delineating the boundaries of multi-scale objects in remote sensing images. In DeepLabV3+, the encoder module reduces the feature map and captures higher-dimensional semantic information, and the decoder module recovers spatial information. The encoder consists of two modules: a Deep Convolutional Neural Network (DCNN) and an ASPP module. The DCNN module employs an Xception_65 backbone network for feature extraction. The ASPP module captures convolution features at multiple scales by applying Atrous convolution with different expansion rates. After the encoder, the decoder is simple and effective. Firstly, the encoder features are up-sampled bilinearly with a coefficient of 4. Then, the sampled results are joined with corresponding low-level features from the encoder backbone network with the same spatial resolution. Finally, the convolution result is up-sampled again to obtain the segmentation result with the same size as the input image.

During the training phase of DeepLabV3+, we defined seven land cover categories, including lawn, shrub, ground, impervious surface, road, building, and water. The pre-trained Xception network parameters on the ImageNet-1k dataset were used as the initial parameters for the DeepLabV3+ backbone network. Meanwhile, we applied the stochastic gradient descent (SGD) optimizer and the ‘poly’ learning rate policy. After the semantic segmentation, we calculated the proportion of land cover types for each parcel as the local features from remote sensing images.

Street view images can complement remote sensing data by providing ground-level visual details, such as building facades, street conditions, and vegetation, enhancing the granularity and accuracy of land use classification. However, due to the limitations of the acquisition equipment, the visual range of street view images is limited, and there are many noises irrelevant to the task of land use classification. Therefore, we consider identifying the scene category as a proxy task to extract the image subject and filter out the noise. Referring to [25], we chose the pre-trained ResNet-50 network [32] on the Places2 dataset [33] as the classification network. This network can divide the images into 365 scene categories, including indoor and outdoor. We calculated the proportion of street view images of different scene categories for each research unit as the second type of physical feature.

To enrich the physical features of the urban built-up environment, the road networks and building footprints were used to further characterize the urban connectivity and architectural form. The connectivity is characterized by the density of road networks within a land parcel and its neighboring area. The neighboring area is defined as a buffer zone with a radius equal to the square root of the parcel area. The architectural form is represented by the intrinsic features and spatial distribution features of the buildings in a land parcel. Intrinsic features are constructed by calculating both the mean value and standard deviation of the building attributes, including construction year, building area, height, perimeter, shape coefficient (the ratio of perimeter to area), and shape complexity (the number of vertices). Spatial distribution features are characterized by the floor area ratio (FAR) and the nearest neighbor distance (NND), which reflect the density and spatial arrangement of buildings. The calculation formulas for the FAR and the NND are as follows:

F A R = \frac{\sum_{i = 1}^{N} a_{i} \times f_{i}}{A},

(1)

where N denotes the total number of buildings within a region,

a_{i}

and

f_{i}

, respectively, represent the floor area and the number of floors of the building i, and A is the total area of the region.

N N D = \frac{\sum_{i = 1}^{N} {min}_{j \in [1, N], j \neq i} d_{i j}}{N},

(2)

where N denotes the total number of buildings within a region, and

d_{i j}

represents the distance from building i to building j.

Furthermore, due to the spatial spillover effect of human activities, the activities occurring in one area may influence the spatial structure of the surrounding areas [34], especially when the research units are small. Therefore, we also constructed a set of building features that consider the neighborhood area of a land parcel. Specifically, when calculating the intrinsic features and spatial distribution features of the buildings, not only are the buildings within the land parcel considered, but also those in its neighborhood. Similarly, the neighboring area is defined as a buffer zone with a radius equal to the square root of the parcel area. Another potential benefit of this operation is that, for sparse data, expanding the target area makes the overall trend of the features more prominent, thereby reducing the impact of random variability from minority samples.

2.2.3. Extraction of Socioeconomic Features

Socioeconomic features were extracted from data, such as POIs, social media check-ins, and mobile phone signaling. These features reflect the functional dynamics and human activities of urban areas.

To capture economic characteristics and the function distribution of urban land use, we constructed a category feature and a spatial feature based on POIs. The POIs were classified into 23 categories according to a customized classification system, including residence, company, education, office, hospital, parking, shop, food, domestic, university, government, car service, hotel, leisure, beauty, sport, finance, media, tourism, transport, school, research, and other. For each land parcel, the density and the proportion of different POIs were calculated, and a 46-dimensional category feature vector was obtained. This feature provides insights into the intensity and composition of urban facilities within a land parcel. To extract the information about spatial co-occurrence contained in POIs, we constructed a Delaunay triangulation network based on the spatial locations of POIs and performed spatially explicit random walks on it. Each random walk started at a randomly selected POI and traversed the network based on a transition probability that considers spatial distance decay and the balance between local and long-range co-occurrences [35]. After capturing the spatial co-occurrence patterns, the aggregation function based on LSTM networks and attention mechanisms proposed by [36] was used to aggregate the sequences generated from the random walks. With this POI aggregation function, we could obtain the POI distribution feature of the land parcels.

Considering that different land uses often have different human activities, we further characterized the spatiotemporal and socioeconomic features of urban land parcels using social media check-ins and mobile phone signaling. The social media check-ins can capture the spatiotemporal distribution characteristics of the active population, while the mobile phone signaling can supplement and reflect the spatial and temporal patterns of the resident population. These two datasets provide complementary perspectives on population distribution and dynamics, enabling a comprehensive representation of human activity patterns. For each land parcel, we computed temporal variation curves of the number of check-ins at two temporal scales: hourly variations within a day and daily variations within a week. Regarding the mobile phone signaling, we only generated the 24-h temporal variation curves of the number of users for each land parcel, because the data span is not sufficient for calculating the changes over a week. The number of users per hour at each land parcel was obtained by resampling based on the user number recorded at a 250-m grid resolution. When a grid covers multiple plots, the users were distributed according to the proportion of the covered area. These curves represent the temporal dynamics of population distribution and activity intensity, providing insights into the diurnal and weekly patterns of human behavior. To further extract high-level temporal features from the variation curves, we applied a one-dimensional convolutional neural network (1D-CNN) to temporal variation curves measured in hours. The 1D-CNN is well-suited for this task because it can automatically learn local patterns and dependencies in sequential data. Finally, both the original temporal variation curves and the extracted 1D-CNN feature vectors were included in the socioeconomic features.

Similar to the building features, we also developed a set of socioeconomic features considering the spatial spillover effect. The density and proportion of POIs across different categories within a land parcel and its neighborhood were calculated. The temporal variation curves, taking into account the number of check-ins and mobile signal strength in a land parcel and its surrounding area, were constructed.

2.2.4. Feature Enhancement Based on Contrastive Learning

To enhance the discriminative power of cross-modal features, we adopted a CL framework that leverages the inherent relationships between different data modalities, such as POI data and remote sensing images. The CL objective is alignment of feature vectors derived from the same land parcel—these constitute positive pairs—while simultaneously pushing apart feature vectors drawn from different parcels (negative pairs). Concretely, for each parcel, we grouped its cross-modal feature to form positive pairs and used the features from different land parcels to generate negatives, which increases both robustness and representational expressiveness.

The optimization objective is to maximize the cosine similarity of positive pairs and minimize the cosine similarity of negative pairs. This objective encourages the model to learn a feature space where features from the same parcel are closely aligned, while features from different parcels are well-separated. To ensure that features from different modalities are comparable, we projected all features into a common latent space using an MLP. The MLP maps the original high-dimensional features to a lower-dimensional space while preserving their discriminative properties. Due to the potential correlations between features, not all modalities are equally effective for CL. Through empirical evaluation, we find that POI features and remote sensing features provide the most significant improvement in classification performance after CL. Therefore, we selected these two modalities for feature enhancement and integrated the enhanced features into the final feature set.

2.2.5. Model Training and Validation of Land Use Classification

Referring to Huang [25], we adopted the land use taxonomy predefined by the Land Administration Law of the People’s Republic of China (GBT 21010–2017) to train and validate the proposed model. After filtering out the nonexistent categories in the Fifth Ring Road of Beijing, nine refined land use categories were generated (see Table 2). Among them, the ‘other’ category is excluded because it does not undertake the main urban functions and has few characteristics. Based on the land use planning map provided by the Beijing Municipal Commission of Planning and Natural Resources, we semi-automatically assigned the land use category for each land parcel [25]. In detail, we rescaled the land use planning map into land parcels by computing the area of different land use types. The label of a land parcel was assigned based on the dominant land use type or by visual interpretation. Label assignment based on the dominant land use type was primarily applied in the following two scenarios: (1) For a given parcel, the area proportion of a specific land use type exceeds 60%. (2) For a given parcel, the proportion of the largest land use type is greater than 40%, while that of the second largest is less than 20%. In all other cases, labels were assigned through visual interpretation.

To effectively leverage all the aforementioned features for land use classification, we employed the XGBoost model for the cross-modal feature fusion and classification. XGBoost is a state-of-the-art ensemble learning algorithm that has been widely adopted in both academia and industry due to its robustness, scalability, and ability to handle complex and heterogeneous datasets. It is particularly justified by its effectiveness in handling imbalanced datasets, sparse data, and multi-source feature fusion. Before training and testing procedures, all constructed and enhanced features were directly concatenated without dimensionality reduction to preserve their complete informational content. This is so that we could leverage XGBoost’s inherent capability to effectively handle heterogeneous feature spaces. The regularized objective function and feature importance mechanism of XGboost can automatically optimize multi-dimensional feature integration while resolving potential dimensional inconsistencies. Meanwhile, XGboost’s built-in feature interaction detection further ensures optimal utilization of cross-modal features without requiring pre-processing dimension alignment, thus avoiding information loss associated with conventional reduction techniques.

To train and validate the model, we fed the concatenated features into the XGBoost classifier and evaluated the model performance by five-fold cross-validation. Five-fold cross-validation is widely adopted in machine learning for assessing model performance and generalization capability. This technique randomly partitions the dataset into five mutually exclusive subsets of approximately equal size. Then, the training and testing process repeats five times, with each subset (20% of data) serving as the test set once while the remaining four subsets (80% of data) serve as the training set. The final performance metrics are averaged across all folds to mitigate bias from data partitioning. In this way, the performance estimation variance is reduced through rotating validation, which is critical for small datasets affected by sampling fluctuations. In addition, this method leverages 80% of available data for training in each iteration, maximizing learning capacity while maintaining robust validation. Furthermore, a 5 × 5 repeated stratified five-fold cross-validation is performed to train and validate the model multiple times, from which the 95% confidence intervals (CIs) for the classification accuracy are computed using a t-distribution.

3. Results

3.1. Classification Accuracy

To comprehensively evaluate classification performance across different land use categories, we analyze a representative single-run result sampled within the confidence interval. The corresponding confusion matrix is shown in Table 3. The matrix shows strong diagonal dominance, indicating strong classification accuracy across most categories. The total correct classification count across all categories is 10,301, accounting for 84.46% of all parcels. Specifically, residential areas achieve the highest precision (88.13%) and recall (94.87%), with 2.53% of cases misclassified as natural areas. Natural land use (precision = 81.41%, recall = 89.25%) and educational land use (precision = 87.07%, recall = 71.12%) show slightly lower performance, both with primary confusion occurring with residential areas. In particular, the classification precision of agricultural zones is relatively high (83.33%), while the recall is low (23.81%). This is because the sample size of agricultural land is relatively small, resulting in poor representation of the learned features. Similarly, transportation land shows the lowest precision (50.00%) and recall (6.89%) due to the small sample size. This is also because transportation land is mostly elongated in shape and has a barrier formed by plants, which has similar physical characteristics to natural land. Moreover, transportation land covers few POIs and check-ins, lacking the representation of socioeconomic characteristics. Consequently, it is frequently misclassified as natural land.

Overall, the inter-class confusion mainly occurs between various types of land use and residential or natural land. Under the influence of inter-class confusion, land use categories with small sample sizes, such as agricultural, transport land, and civic, exhibit low recall rates. Nevertheless, apart from transportation land, all other land use categories demonstrate high classification precision. These results suggest that our model can effectively capture meaningful relationships between urban form and function, while maintaining reliable discriminative performance for the major land use categories.

The following can be seen from the spatial distribution of land use classification (see Figure 3): Residential areas exhibit extensive distribution within the Fifth Ring Road, forming contiguous zones of high density. Commercial districts demonstrate significant spatial concentration in Chaoyang District, while displaying fragmented distribution patterns in other administrative regions. Civic land parcels are scattered in various areas, among which the Forbidden City is the most easily identifiable. Natural areas are mainly located between the Fourth Ring Road and the Fifth Ring Road. Educational land parcels are mainly distributed in Haidian District, where many renowned universities are gathered. The land parcels of industrial, transport, and agricultural categories are relatively scarce and mainly distributed in the southern part of the study area. It is worth noting that although the size of transportation land is as small as the size of civic land, the classification precision of the civic category is much higher than that of transportation land. This is because the sample size of transportation land is too small to fully extract distinguishable features, leading to the primary confusion with natural land.

Furthermore, by comparing the spatial distribution of the predicted results and the ground truth, it can be observed that the spatial distribution of the predicted land use classification is more homogeneous and cohesive, while that of the ground truth is more diverse and fragmented across the study area. Small-area parcels tend to be misclassified as the more dominant land use types surrounding them. For example, civic lands and industrial zones sporadically distributed between the Fourth and Fifth Ring Roads are frequently misclassified as natural areas; civic lands and commercial districts within the Fourth Ring Road are often surrounded by residential lands, making them easy to be misclassified as residential lands. This outcome reflects the inherent difficulty of classifying small parcels: their limited spatial extent reduces the availability of distinctive features, while the strong contextual influence of surrounding dominant categories can blur class boundaries. As a result, small parcels are more susceptible to misclassification, particularly in heterogeneous urban environments.

3.2. Feature Contribution of Different Data Sources

Land use classification accuracy is fundamentally influenced by the selection and combination of input features derived from diverse data sources. Understanding the relative contribution of different features is crucial for developing robust classification frameworks while optimizing computational efficiency. Therefore, we explore the land use classification accuracy, considering only one feature or disregarding a certain feature.

The results in Figure 4 demonstrate a clear hierarchy in the contribution of different features to land use classification. When evaluated individually, built environment features achieve the highest classification accuracy at 76.05%, followed by remote sensing imagery (72.33%), POI data (71.50%), population distribution (67.18%), and street view imagery (66.63%). This pattern suggests that physical characteristics of the urban environment, particularly built environment attributes, provide the most discriminative power for land use classification when considered in isolation. The ablation study results further reinforce this observation, revealing that the removal of built environment features caused the most substantial accuracy reduction (from 84.33% to 82.04%), corresponding to a decrease of 2.29 percentage points. Similarly, the exclusion of POI data and remote sensing imagery resulted in accuracy declines of 1.69 and 1.33 percentage points, respectively, indicating their significant but secondary importance compared to built environment features. In contrast, the removal of population distribution and street view imagery had relatively modest impacts, reducing accuracy by only 0.90 and 0.54 percentage points, respectively.

Additionally, we applied McNemar’s test to the ablation study (Table 4) to provide a more robust statistical comparison. The results indicate that, except for the street view modality, each data source yields a statistically significant improvement in accuracy.

These findings collectively suggest that while all features contribute to classification performance, their relative importance varies substantially, with physical environment features playing a dominant role and socioeconomic features providing supplementary discriminative information. The full feature combination achieved optimal performance at 84.33% accuracy, demonstrating that the integration of complementary features can yield classification improvements beyond what any single feature type can achieve independently. This result underscores the value of multi-source data fusion for comprehensive land use characterization, while also highlighting the need for careful feature selection to optimize the balance between model performance and computational efficiency.

To further investigate feature importance, we extract the top 20 weighted features from the XGBoost classifier (Table 5). The analysis reveals that features derived from building footprints, road networks, and remote sensing imagery dominate the top 15 positions, while features generated from mobile phone signaling and POI data appear between ranks 16 and 20. These results demonstrate the predominant role of built environment features and remote sensing features in land use classification, consistent with the findings presented in Figure 4. Such an imbalance underscores the greater importance of physical features in land use classification, which may be attributed not only to their easier extractability but also to the inherent association between land use types and building forms. Notably, half of the top 10 most important features are global image features extracted by the MAE model from remote sensing imagery, highlighting the importance of multi-perspective feature extraction. Meanwhile, building features, considering neighborhood context, show significant predictive power, suggesting the value of considering spatial spillover effects in feature engineering. These findings validate the effectiveness of our improved feature construction framework, which integrates multi-scale remote sensing feature extraction, spatial contextual information, and complementary data source fusion, thereby enhancing the discriminative capacity for land use classification tasks.

3.3. Comparison with Alternative Classifiers

We compare the performance of XGboost with three alternative classifiers, including MLP, SVM, and RF, to evaluate the superiority of integrating ensemble learning for feature fusion and classification within our framework. Moreover, we systematically evaluate the performance of different classifiers under various feature combination conditions, such as socioeconomic features, physical features, and all features. For each feature combination condition, all the classifiers are trained with the same inputs.

As shown in Figure 5, the results demonstrate two important findings regarding the cross-modal feature fusion in land use classification. First, XGBoost consistently demonstrates superior classification accuracy across all feature configurations, maintaining a significant performance advantage over other models. Specifically, XGBoost outperforms MLP by 5.63 percentage points (76.07% vs. 70.44%) when considering socioeconomic features alone, and maintains a 3.74–5.43 percentage point advantage with physical features alone. This performance advantage is most pronounced in the combined feature space, where XGBoost achieves 84.33% accuracy compared to 79.07–79.95% for other models. The result suggests that while all models benefit from multi-source feature integration, our model demonstrates superior capacity for leveraging complementary information across multi-source features. The consistent superiority of XGBoost can be attributed to its ensemble architecture and gradient boosting mechanism, which effectively handles heterogeneous feature spaces and captures complex nonlinear relationships in urban land use patterns.

Second, a systematic improvement in classification accuracy can be observed across all algorithms following feature integration, which provides compelling evidence for the value of multi-source data fusion in land use classification. The performance gains are particularly notable for MLP (from 70.44–76.70% to 79.95%) and XGBoost (from 76.07–81.49% to 84.33%), suggesting that socioeconomic features and physical features provide complementary and synergistic information for land use characterization. The findings align with established urban morphological theories, where the urban land use patterns emerge from the complex interplay between social processes and physical structures. Meanwhile, the results strongly support adopting a multi-source data fusion approach for accurate land use mapping, particularly when employing advanced ensemble methods like XGBoost.

We statistically compare the performance of models using McNemar’s exact test on paired per-unit predictions, which properly accounts for within-unit dependence. XGBoost is significantly superior to MLP, RF, and SVM (See Table 6).

Overall, these findings yield significant methodological insights for land use classification. XGBoost’s superior performance, especially with multi-source features, highlights the efficacy of ensemble methods with inherent feature interaction mechanisms for urban land use mapping. Moreover, the consistent accuracy gains from feature integration emphasize the critical importance of combining socioeconomic and physical dimensions in land use classification frameworks.

4. Discussion

4.1. Generalization to Other Datasets

To demonstrate the universality and generalization ability of the proposed method, we conducted an additional land use classification experiment using a dataset from Shenzhen. Shenzhen is a megacity with a distinct urban morphology and development patterns compared to Beijing, making it an ideal choice for validating the scalability of our method. The multi-source geospatial dataset of Shenzhen includes road networks, remote sensing images, building footprints, POIs, mobile phone signaling, and nighttime light images. Compared to the Beijing dataset, the Shenzhen case lacks street view images and social media check-ins, but compensates through nighttime light images. The nighttime light images can reflect the vitality of a city and human activities at night. The features of nighttime light images are constructed by calculating the mean and standard deviation of the night light radiance within the land parcel. The processing methods for other features are consistent with those of Beijing (see Section 2.2).

To evaluate the model performance of our method on Shenzhen dataset, we fed all the features into the XGBoost model and adopted the five-fold cross-validation method. The land use labels were derived from the Shenzhen Urban Construction Land Planning Map released by the Shenzhen government, which are reliable and trustworthy. Table 7 shows the confusion matrix of land use classification in Shenzhen. Similar to the classification result of Beijing, the matrix exhibits diagonal dominance. The overall classification accuracy reaches 75.39%, with 10892 land parcels correctly classified. Specifically, the precision (86.51%) and recall rate (84.89%) of natural land use are the highest, while the precision of civic, commercial, and residential land use exceeds 70%. Similar to the classification result of Beijing, the precision (50%) and recall (16.13%) of transport land use is also the lowest, because it is difficult to extract distinctive features based on a small sample size. It can be seen that our method has robustness in land use classification on different datasets.

Figure 6 shows the spatial distribution of land use types in Shenzhen. Unlike Beijing, the industrial land in Shenzhen occupies a considerable area and is mainly concentrated in the northern region. Commercial land is concentrated in the Nanshan, Luohu, and Futian districts, while it is distributed in small patches in other regions. In addition, residential land is evenly distributed across all regions. It can be seen that the prediction results of the classification task effectively reflect the actual distribution patterns of land use, demonstrating the effectiveness of the proposed method.

Despite distinct geographical patterns and functional distributions between Shenzhen and Beijing, our method achieved robust classification accuracy in both cities, which proves its strong adaptability and generalizability. Compared to Beijing, the slightly lower performance in Shenzhen can be partially attributed to its comparatively limited dataset, which lacks street view imagery, check-in data, and detailed building attributes. As a result, the spatial features of land parcels in Shenzhen are less sufficiently represented than those in Beijing, which leads to a reduction in the identifiability of land use types. This finding further underscores the importance of comprehensive multi-source data fusion for robust land use classification.

4.2. Comparison of Feature Fusion Methods

Multi-source urban data provide complementary perspectives on urban land use. Effectively integrating heterogeneous features is therefore pivotal to improving urban land use recognition. To validate the superiority of our method, we conducted comparative experiments with alternative fusion techniques. In this section, we organize feature fusion strategies into early, mid-level, and late paradigms, and apply representative methods from each category to land use classification.

Early fusion concatenates features from different modalities at the input layer and trains a single classifier, offering a simple, end-to-end pipeline. Our proposed framework implements early fusion as its core feature integration strategy. We adopt feature concatenation as the overarching early-fusion architecture, motivated by its robustness under sparse features and incomplete data. In addition, we compare representative classifiers (XGBoost, RF, SVM, MLP) on concatenated features and observe that XGBoost performs better in this setting (see Figure 5).

Mid-level fusion learns cross-modal representation or enhancement before classification, which is particularly effective when modalities exhibit correlation or complementary structure. There are two mainstream families: (i) representation alignment via deep CL and (ii) projection alignment via canonical correlation analysis (CCA). In our framework, CL is adopted to enhance features before concatenation, while CCA is used as a baseline method for comparison. CCA identifies correlated structures between two sets of variables and projects them into a shared latent space, thereby generating cross-modal representations that are better aligned. In the context of GIS applications, this enables features from heterogeneous sources—such as POIs, remote sensing imagery, and street view data—to be mapped into a common space where their complementarity can be more effectively exploited for land use classification. For implementation, we apply pairwise CCA across the six modalities and concatenate the resulting joint features for classification. The comparison results are summarized in Table 8. The results illustrate that under early fusion, ensemble classifiers (XGBoost) were consistently stronger than SVM. When considering the complementarity of cross-modal features, incorporating CL for feature enhancement leads to significant improvements in classification accuracy. Overall, the proposed framework achieves the highest classification accuracy among all compared fusion methods.

Late fusion aggregates outputs from multiple base learners via methods such as voting and probability averaging, with strong compatibility to heterogeneous features and improved robustness. By integrating multiple single-modality base learners, we compare four decision-level late-fusion strategies: hard voting, soft averaging, weighted soft averaging, and stacking. (i) The hard voting method selects the class receiving the majority of top-1 labels across base learners. (ii) The soft averaging method predicts the argmax of the arithmetic mean of per-class probabilities, leveraging uncertainty for greater stability. (iii) The weighted soft averaging method assigns learner-specific weights to probabilities before averaging. (iv) The stacking method trains a meta-learner on out-of-fold class-probability features. The results are presented in Table 9. Decision-level late fusion is robust but limited in capturing cross-modal interactions, leading to lower classification accuracy compared to both early and mid-level fusion approaches.

In summary, our framework primarily adopts early fusion while incorporating multi-level fusion for feature enhancement. The combination of XGBoost with CL consistently yields the best results among the evaluated methods, demonstrating the effectiveness of our approach in integrating heterogeneous geospatial data for land use classification.

5. Conclusions

In this study, we propose a CM-DFF framework based on ensemble learning to effectively characterize and integrate multi-source geospatial information. Taking the Fifth Ring Road of Beijing as the research area, we extract and fuse physical and socioeconomic features from road networks, remote sensing images, street view images, building footprints, POI data, Weibo check-ins, and mobile phone signaling data to perform land use classification. The results demonstrate that our method significantly improves the accuracy (84.33%) of land use classification compared with a few comparative baselines, indicating that the proposed CM-DFF framework based on ensemble learning is effective for land use classification. In addition, the results of the feature contribution analysis also indicate the value of multi-source data fusion for comprehensive land use characterization.

Addressing the research questions proposed earlier, this study makes the following key contributions: (1) We have developed targeted feature extraction methods for each data modality, enabling comprehensive characterization of both physical and socioeconomic attributes. Especially, the superior performance of MAE-derived remote sensing features confirms that the rational feature extraction can provide essential discriminative information for land use characterization. (2) We have verified that the combination of CL and XGBoost achieves superior fusion of cross-modal deep features based on the complementary nature of physical and socioeconomic characteristics for improved land use classification. (3) The proposed method achieves competitive classification results when applied to Shenzhen, demonstrating its strong adaptability and generalizability across distinct urban environments. Furthermore, comparative experiments against baseline methods confirm its superior performance.

While the proposed method demonstrates promising performance in land use classification, several limitations need to be overcome in the future. First, following the convention of most existing studies, our current framework assigns each parcel to a single dominant land use type. This single-label criterion simplifies model implementation and ensures comparability with prior work, but it does not explicitly account for the presence of mixed land uses. Second, constrained by data accessibility, both the acquisition timelines and quality metrics across the utilized multi-source datasets are not completely consistent, which can lead to some loss of classification accuracy. Furthermore, the generalization of the proposed framework also requires further research and improvement. In future work, we will continue to enhance the framework’s ability to identify mixed land use types and conduct experiments and improve its classification performance in different research areas.

Author Contributions

Conceptualization, Xiaohuan Wu and Houji Qi; methodology, Xiaohuan Wu and Houji Qi; software, Houji Qi; validation, Xiaohuan Wu, Yikun Liu and Yang Wang; formal analysis, Xiaohuan Wu and Keli Wang; investigation, Xiaohuan Wu; resources, Yang Wang; data curation, Houji Qi; writing—original draft preparation, Xiaohuan Wu and Keli Wang; writing—review and editing, Xiaohuan Wu, Keli Wang, Houji Qi and Yikun Liu; visualization, Xiaohuan Wu and Yikun Liu; supervision, Xiaohuan Wu and Yang Wang; project administration, Xiaohuan Wu and Yang Wang; funding acquisition, Xiaohuan Wu and Yang Wang. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Key Laboratory of Smart Earth, NO. KF2023YB02-06.

Data Availability Statement

The dataset for reproducing the experimental results is openly available at the following Figshare link: https://figshare.com/s/2410acd0f387ee068fc0 (accessed on 15 August 2025).

Acknowledgments

Thanks to the data distribution agency for providing the publicly available data used in this study.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Derived Urban Features

Table A1. List of urban features derived from the multi-source geospatial dataset.

Data Source	Feature Name	Description
Remote sensing image	Land cover feature	The summary of pixel-level land cover classification results for each parcel, indicating the area proportions of different land cover types, such as lawn, shrub, ground, impervious surface, road, building, and water.
Remote sensing image	Image embedding feature	The eight-dimensional vector, resulting from the output of the encoder in the MAE model after dimensionality reduction via MLP.
Street view image	365 scene categories	A 365-dimensional vector representing the proportion of street view images belonging to each category.
Building footprint	Construction year	Average and standard deviation of the construction year for all the buildings within a land parcel (or within a land parcel and its neighborhood).
Building footprint	Building area	Average and standard deviation of the building footprint area for all the buildings within a land parcel (or within a land parcel and its neighborhood).
Building footprint	Height	Average and standard deviation of the numbers of floors for all the buildings within a land parcel (or within a land parcel and its neighborhood).
	Perimeter	Average and standard deviation of the building perimeter for all the buildings within a land parcel (or within a land parcel and its neighborhood).
	Shape coefficient	Average and standard deviation of the ratio of perimeter to area for all the buildings within a land parcel (or within a land parcel and its neighborhood).
	Shape complexity	Average and standard deviation of the vertice number for buildings within a land parcel (or within a land parcel and its neighborhood).
	Floor area ratio	Floor area ratio of a land parcel (or of a land parcel and its neighborhood).
	Nearest neighbor distance	Spatial nearest neighbor information computed using the average of the nearest neighbor distance between buildings in a land parcel (or in a land parcel and its neighborhood).
Road geometry	Road density	The road network density within a land parcel and its neighborhood.
POI	Category feature	A 46-dimensional vector representing the density and proportion of POIs across different categories within a land parcel (or within a land parcel and its neighborhood).
POI	Spatial distribution feature	A vector representing the spatial distributional structure of POIs within a land parcel (or within a land parcel and its neighborhood).
Weibo Check-in	Volumes	Hourly and daily check-in heatmap curve for a land parcel (or for a land parcel and its neighborhood).
Weibo Check-in	Curve shape features	The 8-dimensional vector, representing the curve features output by the 1D convolutional neural network.
Mobile phone signaling	Volumes	Hourly population heatmap curve for a land parcel (or for a land parcel and its neighborhood).
Mobile phone signaling	Curve shape features	The 8-dimensional vector, representing the curve features output by the 1D convolutional neural network.

References

Liu, X.; Hu, G.; Chen, Y.; Li, X.; Xu, X.; Li, S.; Pei, F.; Wang, S. High-resolution multi-temporal mapping of global urban land using Landsat images based on the Google Earth Engine Platform. Remote Sens. Environ. 2018, 209, 227–239. [Google Scholar] [CrossRef]
Wu, L.; Cheng, X.; Kang, C.; Zhu, D.; Huang, Z.; Liu, Y. A framework for mixed-use decomposition based on temporal activity signatures extracted from big geo-data. Int. J. Digit. Earth 2018, 13, 708–726. [Google Scholar] [CrossRef]
Liu, X.; He, J.; Yao, Y.; Zhang, J.; Liang, H.; Wang, H.; Hong, Y. Classifying urban land use by integrating remote sensing and social media data. Int. J. Geogr. Inf. Sci. 2017, 31, 1675–1696. [Google Scholar] [CrossRef]
Zhang, Y.; Li, Q.; Tu, W.; Mai, K.; Yao, Y.; Chen, Y. Functional urban land use recognition integrating multi-source geospatial data and cross-correlations. Comput. Environ. Urban Syst. 2019, 78, 101374. [Google Scholar] [CrossRef]
Wu, X.; Dong, W.; Wu, L.; Liu, Y. Research themes of geographical information science during 1991–2020: A retrospective bibliometric analysis. Int. J. Geogr. Inf. Sci. 2023, 37, 243–275. [Google Scholar] [CrossRef]
Ding, X.; Wang, Z.; Peng, S.; Shao, X.; Deng, R. Research on Land Use and Land Cover Information Extraction Methods for Remote Sensing Images Based on Improved Convolutional Neural Networks. ISPRS Int. J. Geo-Inf. 2024, 13, 386. [Google Scholar] [CrossRef]
Chen, F.; Wang, K.; Van de Voorde, T.; Tang, T.F. Mapping urban land cover from high spatial resolution hyperspectral data: An approach based on simultaneously unmixing similar pixels with jointly sparse spectral mixture analysis. Remote Sens. Environ. 2017, 196, 324–342. [Google Scholar] [CrossRef]
Pacifici, F.; Chini, M.; Emery, W.J. A neural network approach using multi-scale textural metrics from very high-resolution panchromatic imagery for urban land-use classification. Remote Sens. Environ. 2009, 113, 1276–1292. [Google Scholar] [CrossRef]
Zhang, C.; Sargent, I.; Pan, X.; Li, H.; Gardiner, A.; Hare, J.; Atkinson, P.M. An object-based convolutional neural network (OCNN) for urban land use classification. Remote Sens. Environ. 2018, 216, 57–70. [Google Scholar] [CrossRef]
Tu, W.; Cao, J.; Yue, Y.; Shaw, S.L.; Zhou, M.; Wang, Z.; Chang, X.; Xu, Y.; Li, Q. Coupling mobile phone and social media data: A new approach to understanding urban functions and diurnal patterns. Int. J. Geogr. Inf. Sci. 2017, 31, 2331–2358. [Google Scholar] [CrossRef]
Liu, Y.; Liu, X.; Gao, S.; Gong, L.; Kang, C.; Zhi, Y.; Chi, G.; Shi, L. Social Sensing: A New Approach to Understanding Our Socioeconomic Environments. Ann. Assoc. Am. Geogr. 2015, 105, 512–530. [Google Scholar] [CrossRef]
Zhang, X.; Li, W.; Zhang, F.; Liu, R.; Du, Z. Identifying Urban Functional Zones Using Public Bicycle Rental Records and Point-of-Interest Data. ISPRS Int. J. Geo-Inf. 2018, 7, 459. [Google Scholar] [CrossRef]
Gao, S.; Janowicz, K.; Couclelis, H. Extracting urban functional regions from points of interest and human activities on location-based social networks. Trans. GIS 2017, 21, 446–467. [Google Scholar] [CrossRef]
Wang, Y.; Wang, T.; Tsou, M.H.; Li, H.; Jiang, W.; Guo, F. Mapping Dynamic Urban Land Use Patterns with Crowdsourced Geo-Tagged Social Media (Sina-Weibo) and Commercial Points of Interest Collections in Beijing, China. Sustainability 2016, 8, 1202. [Google Scholar] [CrossRef]
Ge, P.; He, J.; Zhang, S.; Zhang, L.; She, J. An Integrated Framework Combining Multiple Human Activity Features for Land Use Classification. ISPRS Int. J. Geo-Inf. 2019, 8, 90. [Google Scholar] [CrossRef]
Ren, Y.; Xie, Z.; Zhai, S. Urban Land Use Classification Model Fusing Multimodal Deep Features. ISPRS Int. J. Geo-Inf. 2024, 13, 378. [Google Scholar] [CrossRef]
Lu, W.; Tao, C.; Li, H.; Qi, J.; Li, Y. A unified deep learning framework for urban functional zone extraction based on multi-source heterogeneous data. Remote Sens. Environ. 2022, 270, 112830. [Google Scholar] [CrossRef]
Zong, L.; He, S.; Lian, J.; Bie, Q.; Wang, X.; Dong, J.; Xie, Y. Detailed mapping of urban land use based on multi-source data: A case study of lanzhou. Remote Sensing 2020, 12, 1987. [Google Scholar] [CrossRef]
Yu, M.; Xu, H.; Zhou, F.; Xu, S.; Yin, H. A Deep-Learning-Based Multimodal Data Fusion Framework for Urban Region Function Recognition. ISPRS Int. J. Geo-Inf. 2023, 12, 468. [Google Scholar] [CrossRef]
Song, J.; Gao, S.; Zhu, Y.; Ma, C. A survey of remote sensing image classification based on CNNs. Big Earth Data 2019, 3, 232–254. [Google Scholar] [CrossRef]
Ye, F.; Su, Y.; Xiao, H.; Zhao, X.; Min, W. Remote sensing image registration using convolutional neural network features. IEEE Geosci. Remote Sens. Lett. 2018, 15, 232–236. [Google Scholar] [CrossRef]
Courtney, L.; Sreenivas, R. Using deep convolutional LSTM networks for learning spatiotemporal features. In Proceedings of the Pattern Recognition: 5th Asian Conference, ACPR 2019, Auckland, New Zealand, 26–29 November 2019; Revised Selected Papers, Part II 5. Springer: Berlin/Heidelberg, Germany, 2020; pp. 307–320. [Google Scholar]
Wang, J.; Tang, J.; Xu, Z.; Wang, Y.; Xue, G.; Zhang, X.; Yang, D. Spatiotemporal modeling and prediction in cellular networks: A big data enabled deep learning approach. In Proceedings of the IEEE INFOCOM 2017-IEEE Conference on Computer Communications, Atlanta, GA, USA, 1–4 May 2017; IEEE: Pistacaway, NJ, USA, 2017; pp. 1–9. [Google Scholar]
Yao, Y.; Li, X.; Liu, X.; Liu, P.; Liang, Z.; Zhang, J.; Mai, K. Sensing spatial distribution of urban land use by integrating points-of-interest and Google Word2Vec model. Int. J. Geogr. Inf. Sci. 2017, 31, 825–848. [Google Scholar] [CrossRef]
Huang, Z.; Qi, H.; Kang, C.; Su, Y.; Liu, Y. An Ensemble Learning Approach for Urban Land Use Mapping Based on Remote Sensing Imagery and Social Sensing Data. Remote Sens. 2020, 12, 3254. [Google Scholar] [CrossRef]
Hu, T.; Yang, J.; Li, X.; Gong, P. Mapping Urban Land Use by Using Landsat Images and Open Social Data. Remote Sens. 2016, 8, 151. [Google Scholar] [CrossRef]
Xia, N.; Cheng, L.; Li, M. Mapping Urban Areas Using a Combination of Remote Sensing and Geolocation Data. Remote Sens. 2019, 11, 1470. [Google Scholar] [CrossRef]
Zhang, Y.; Li, Q.; Huang, H.; Wu, W.; Du, X.; Wang, H. The Combined Use of Remote Sensing and Social Sensing Data in Fine-Grained Urban Land Use Mapping: A Case Study in Beijing, China. Remote Sens. 2017, 9, 865. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; Girshick, R. Masked Autoencoders Are Scalable Vision Learners. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 15979–15988. [Google Scholar]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Proceedings of the 15th European Conference on Computer Vision (ECCV 2018), Munich, Germany, 8–14 September 2018; pp. 833–851. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Zhou, B.; Lapedriza, A.; Khosla, A.; Oliva, A.; Torralba, A. Places: A 10 Million Image Database for Scene Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 1452–1464. [Google Scholar] [CrossRef]
Wang, K.; Han, X.; Dong, L.; Chen, X.J.; Xiu, G.; Kwan, M.; Liu, Y. Quantifying the spatial spillover effects of non-pharmaceutical interventions on pandemic risk. Int. J. Health Geogr. 2023, 22, 13. [Google Scholar] [CrossRef]
Bai, L.; Huang, W.; Zhang, X.; Du, S.; Cong, G.; Wang, H.; Liu, B. Geographic mapping with unsupervised multi-modal representation learning from VHR images and POIs. ISPRS J. Photogramm. Remote Sens. 2023, 201, 193–208. [Google Scholar] [CrossRef]
Vinyals, O.; Bengio, S.; Kudlur, M. Order Matters: Sequence to sequence for sets. arXiv 2015, arXiv:1511.06391. [Google Scholar]

Figure 1. Study area (the Fifth Ring Road of Beijing).

Figure 2. The cross-modal deep feature fusion framework for land use classification based on ensemble learning.

Figure 3. Spatial distribution of land use classification in the Fifth Ring Road of Beijing. (a) Ground truth. (b) Model prediction.

Figure 4. Classification accuracy of land use on different datasets. (a) Classification results on single dataset. (b) Ablation study on model components.

Figure 5. Comparison of model performances with traditional classifiers.

Figure 6. Spatial distribution of land use classification in Shenzhen. (a) Ground truth. (b) Model prediction.

Table 1. Data list.

Dataset	Source	Content
Road network	OpenStreetMap	Dividing the study area into 12,197 polygonal parcels as basic spatial units.
Remote sensing images	Google Earth	Images with a resolution of 1.2 m in 2019.
Street view images	Tencent	660,000 street view images offering ground-level visual information of Beijing.
Building footprints	Public building data	161,000 data records of buildings, including construction year, number of floors, perimeter, area, and number of vertices.
Points of interest	Baidu	284,000 POIs organized by a hierarchical classification system and providing attributes such as coordinate, category, and name.
Weibo check-ins	Sina	5.8 million Weibo check-in data from 2012 to 2017, recording the content, time, and location of the check-in.
Mobile phone signaling	Mobile telecom carrier	Recording the number of users per hour at a 250 m grid resolution.

Table 2. Classification criteria for urban land use types [25].

Land Use	Description
Commercial	Retail, wholesale markets, restaurants, office buildings, shopping centers, hotels, entertainment (such as theaters, concert halls, recreational facilities)
Civic	Government agencies and organizations, hospitals, etc.
Industrial	Industrial land and storehouses
Transport	Airports, railway stations, bus stops, and other transportation facilities
Natural	Natural vegetation or artificial vegetation, water and water infrastructure
Educational	Universities, colleges, primary and secondary schools, kindergartens and their ancillary facilities
Residential	Urban residential buildings (including bungalows, multi-story or high-rise buildings), homesteads
Agricultural	Farmland, natural or artificial grasslands and shrub lands for grazing livestock
Other	Vacant land, bare land, railway, highway, rural road, etc.

Table 3. Confusion matrix for land use classification of the Fifth Ring Road of Beijing.

		Prediction								Total	Recall
	Category	Com.	Civ.	Ind.	Tra.	Nat.	Edu.	Res.	Agr.	Total	Recall
Truth	Commercial	986	15	17	0	118	16	261	0	1413	69.78%
	Civic	72	191	12	0	116	21	154	0	566	33.75%
	Industrial	31	6	204	0	104	3	39	0	387	52.71%
	Transport	4	1	2	4	39	0	8	0	58	6.89%
	Natural	67	27	27	4	2465	9	163	0	2762	89.25%
	Educational	32	11	3	0	28	559	153	0	786	71.12%
	Residential	103	15	7	0	157	34	5887	1	6204	94.89%
	Agricultural	0	0	0	0	1	0	15	5	21	23.81%
Total		1295	266	272	8	3028	642	6680	6	12,197	–
Precision		76.14%	71.80%	75.00%	50.00%	81.41%	87.07%	88.13%	83.33%	–	–

Table 4. Paired McNemar comparisons between the all-data and ablated variants.

Pair	b	c	$n = b + c$	$(b - c) / n$	OR $= b / c$ [95% CI]	$p_{exact}$	Decision
all-data vs. out-BF	239	470	709	$- 0.3258$	0.5090 [0.4357, 0.5947]	$< 10^{- 6}$	Significant
all-data vs. out-POI	245	426	671	$- 0.2697$	0.5756 [0.4920, 0.6735]	$< 10^{- 6}$	Significant
all-data vs. out-RS	400	542	942	$- 0.1507$	0.7382 [0.6488, 0.8400]	$4.0 \times 10^{- 6}$	Significant
all-data vs. out-POP	234	306	540	$- 0.1333$	0.7651 [0.6455, 0.9069]	0.00222	Significant
all-data vs. out-SV	179	216	395	$- 0.0937$	0.8291 [0.6803, 1.0105]	0.070	Not significant

McNemar paired comparison of classifiers A vs. B. b: A wrong/B correct; c: A correct/B wrong;

n = b + c

: discordant pairs.

OR = b / c

(95% CI);

OR < 1

favors A,

OR > 1

favors B.

p_{exact}

: two-sided exact binomial p-value for McNemar’s test. Directional effect

(b - c) / n

: negative favors A, positive favors B.

Table 5. Top 20 most important features in the XGBoost classifier.

Feature	Weight
The floor area ratio, considering neighborhood.	2108
The mean of buildings’ construction years, considering neighborhood.	1992
The area proportion of building land cover.	1933
The sixth dimensional image embedding feature from the MAE model.	1879
The seventh dimensional image embedding feature from the MAE model.	1770
The area proportion of impervious surface land cover.	1556
The first dimensional image embedding feature from the MAE model.	1538
The fifth dimensional image embedding feature from the MAE model.	1536
The third dimensional image embedding feature from the MAE model.	1513
The standard deviation of buildings’ shape coefficients, considering neighborhood.	1507
The standard deviation of buildings’ vertice numbers, considering neighborhood.	1473
The road network density within the parcel and its surrounding neighborhood.	1468
The second dimensional image embedding feature from the MAE model.	1464
The standard deviation of buildings’ floor numbers, considering neighborhood.	1439
The mean of buildings’ floor numbers, considering neighborhood.	1417
The fifth dimensional curve feature of mobile phone signaling from the 1D-CNN model.	1373
The standard deviation of buildings’ construction years, considering neighborhood.	1360
The proportion of residence POIs.	1351
The area proportion of lawn land cover.	1222
The third dimensional curve feature of mobile phone signaling from the 1D-CNN model.	1220

Table 6. Pairwise model comparisons using McNemar’s exact test on paired predictions.

Pair	b	c	$n = b + c$	$(b - c) / n$	OR $= b / c$ [95% CI]	$p_{exact}$	Decision
XGBoost vs. MLP	503	1052	1555	−0.520	0.300 [0.262,0.344]	<1 × $10^{- 6}$	Significant
XGBoost vs. RF	273	911	1184	−0.539	0.356 [0.316,0.402]	<1 × $10^{- 6}$	Significant
XGBoost vs. SVM	357	1003	1360	−0.475	0.497 [0.436,0.568]	<1 × $10^{- 6}$	Significant

McNemar paired comparison of classifiers A vs. B. b: A wrong/B correct; c: A correct/B wrong;

n = b + c

: discordant pairs.

OR = b / c

(95% CI);

OR < 1

favors A,

OR > 1

favors B.

p_{exact}

: two-sided exact binomial p-value for McNemar’s test. Directional effect

(b - c) / n

: negative favors A, positive favors B.

Table 7. Confusion matrix for land use classification in Shenzhen.

		Prediction						Total	Recall
	Category	Tra.	Civ.	Com.	Res.	Ind.	Nat.	Total	Recall
Truth	Transport	10	51	14	11	22	54	162	16.13%
	Civic	2	3870	156	200	185	213	4626	83.66%
	Commercial	1	422	1027	165	122	88	1825	56.27%
	Residential	2	356	144	1730	193	64	2489	69.51%
	Industrial	2	294	65	137	1228	53	1779	69.03%
	Natural	3	394	39	63	40	3027	3566	84.89%
Total		20	5387	1445	2306	1790	3499	14,447	–
Precision		50.00%	71.84%	71.07%	75.02%	68.60%	86.51%	–	–

Table 8. The results of early fusion and mid-level fusion.

Method	Accuracy	F1_Macro
SVM	0.7794 ± 0.0081	0.4608 ± 0.0117
CCA + SVM	0.8044 ± 0.0041	0.4867 ± 0.0053
XGBoost	0.8273 ± 0.0091	0.5830 ± 0.0368
CCA + XGBoost	0.8119 ± 0.0077	0.5282 ± 0.0414
CL + XGBoost (our method)	0.8433 ± 0.0072	0.5846 ± 0.0471

Table 9. The results of late fusion.

Method	Accuracy	F1_Macro
stacking	0.7275 ± 0.0002	0.5075 ± 0.0009
best_single	0.7334 ± 0.0007	0.4320 ± 0.0014
soft_weighted	0.7716 ± 0.0022	0.4266 ± 0.0045
hard_vote	0.7630 ± 0.0010	0.4132 ± 0.0012
soft_avg	0.7640 ± 0.0002	0.4003 ± 0.0001

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Published by MDPI on behalf of the International Society for Photogrammetry and Remote Sensing. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wu, X.; Qi, H.; Wang, K.; Liu, Y.; Wang, Y. A Cross-Modal Deep Feature Fusion Framework Based on Ensemble Learning for Land Use Classification. ISPRS Int. J. Geo-Inf. 2025, 14, 411. https://doi.org/10.3390/ijgi14110411

AMA Style

Wu X, Qi H, Wang K, Liu Y, Wang Y. A Cross-Modal Deep Feature Fusion Framework Based on Ensemble Learning for Land Use Classification. ISPRS International Journal of Geo-Information. 2025; 14(11):411. https://doi.org/10.3390/ijgi14110411

Chicago/Turabian Style

Wu, Xiaohuan, Houji Qi, Keli Wang, Yikun Liu, and Yang Wang. 2025. "A Cross-Modal Deep Feature Fusion Framework Based on Ensemble Learning for Land Use Classification" ISPRS International Journal of Geo-Information 14, no. 11: 411. https://doi.org/10.3390/ijgi14110411

APA Style

Wu, X., Qi, H., Wang, K., Liu, Y., & Wang, Y. (2025). A Cross-Modal Deep Feature Fusion Framework Based on Ensemble Learning for Land Use Classification. ISPRS International Journal of Geo-Information, 14(11), 411. https://doi.org/10.3390/ijgi14110411

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Cross-Modal Deep Feature Fusion Framework Based on Ensemble Learning for Land Use Classification

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Area and Datasets

2.1.1. Road Network

2.1.2. Remote Sensing Images

2.1.3. Street View Images

2.1.4. Building Footprints

2.1.5. Points of Interest

2.1.6. Weibo Check-Ins

2.1.7. Mobile Phone Signaling

2.2. Feature Fusion Framework Based on Ensemble Learning

2.2.1. Method Overview

2.2.2. Extraction of Physical Features

2.2.3. Extraction of Socioeconomic Features

2.2.4. Feature Enhancement Based on Contrastive Learning

2.2.5. Model Training and Validation of Land Use Classification

3. Results

3.1. Classification Accuracy

3.2. Feature Contribution of Different Data Sources

3.3. Comparison with Alternative Classifiers

4. Discussion

4.1. Generalization to Other Datasets

4.2. Comparison of Feature Fusion Methods

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

Derived Urban Features

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI