1. Introduction
Remote sensing image classification plays a significant role in fields like land cover monitoring and forest management [
1,
2]. The hyperspectral image (HSI) stands out for its exceptional ability to differentiate materials based on the unique spectral signatures [
3,
4]. However, in real-world applications, different objects may display similar or even identical spectral signatures due to environmental influences under certain conditions [
5]. Conversely, the same type of object may exhibit varying spectral characteristics due to noise, temperature fluctuations, and other factors [
6] during the imaging process [
7]. As a result, relying solely on HSI for analysis poses certain limitations when confronted with complex ground details in large-scale scenes [
8].
Facing the limitation of single-modality data, integrating multi-modal images has demonstrated significantly greater advantages across various research fields [
9,
10,
11]. For instance, light detection and ranging (LiDAR) data [
12], acquired by emitting laser beams and analyzing the reflected signals to determine precise distances, provides valuable altitude information over large areas through sensor scanning, which derives various tasks such as 3D instance segmentation [
13] and point cloud segmentation [
14]. Combining LiDAR with HSI data collected from the same area is beneficial to partly overcome the limitations of single-modal information, substantially enhancing land cover classification accuracy [
15,
16]. Unlike natural images, remote sensing images often contain more intricate information, with variations in the positions of similar land cover features, making it crucial to extract global information from multi-modal remote sensing data [
17]. In recent years, numerous scholars have explored this challenge using a variety of deep learning models. These models can be generally categorized into three types, including convolutional neural network (CNN)-based methods, transformer-based methods, and graph-based methods [
18,
19,
20,
21].
In the CNN-based methods, various strategies have been developed to capture global information from multi-modal data. For example, Song et al. [
22] introduced a two-stream deep CNN in their hashing-based deep metric learning method to separately extract and then fuse spectral-spatial features, and enhanced classification accuracy with a unique loss function that incorporates both semantic and metric losses. Wang et al. [
23] optimized traditional CNN by pyramid convolutions with different kernel sizes to extract features at different scales and subsequently used effective feature recalibration modules to enhance the multi-scale spatial-spectral features. Zhao et al. [
24] proposed a hierarchical random walk network, which utilized the predicted distribution of dual-tunnel CNN to serve as a global prior on the fusion of HSI and LiDAR data and employed a pixel-wise affinity branch as pixel priors to enforce spatial consistency in the deeper layers of networks. Feng et al. [
25] designed a dynamic scale hierarchical fusion network based on CNN, which dynamically selected and integrated features across scales to address the high-dimensional problem of multiscale features. It used spatial attention for shallow feature fusion and modal attention for deep fusion to improve classification accuracy. Wang et al. [
26] introduced a nearest neighbor-based contrastive learning network (NNCNet) by introducing self-supervised contrastive learning and a bilinear attention fusion module to CNN-based joint classification for interpreting ground objects at a more precise level. Xue et al. [
27] augmented the attentional feature fusion module on the basis of CNN, and a global average pooling layer was designed to enhance the representation of global information in features. Gao et al. [
28] proposed an adaptive multiscale spatial–spectral enhancement network (AMSSE-Net) based on CNN to fuse features from HSI and LiDAR data to improve classification performance. With the property that the convolution kernel shares the feature channels within the group, the involution operator was introduced in the network to enhance the correlation of spectral dimensions. Besides, dynamically assigned weights were utilized to guide the selection of the optimal model, which is determined by the joint loss across the three feature fusion methods (maximum, adaptive addition and concat). Mohla et al. [
29] devised deeper networks for multimodality features extraction, incorporating two spatial attention modules and one modality attention module. With a higher number of network layers, deeper convolutional layers can obtain larger receptive fields, corresponding to larger regions of the original image, achieving extensive feature perception. However, blindly stacking numerous convolutional layers may increase network depth and training difficulty, leading to a higher risk of overfitting.
Besides CNN, the transformer has attracted significant attention in computer vision due to its remarkable ability to model global relationships among samples in the visual domain [
30,
31]. In deep networks based on transformers, images often need to be serialized and input to the network in the form of image block sequences. In Ref. [
32], a cross-modal enhanced CNN and transformer module was incorporated into a dual-branch feature fusion network to enhance interactive information from multi-source data both locally and globally, thereby enabling the robust integration of diverse features. Zhao et al. [
33] proposed a novel dual-branch method combining a hierarchical CNN and a transformer network to enhance multisource data fusion and improve classification accuracy, with a cross-token attention fusion encoder that leverages CNN’s spatial extraction capabilities and the transformer’s long-range dependency modeling. Song et al. [
34] designed a height information-guided hierarchical fusion-and-separation network, in which the transformer and CNN were introduced in the dual-structure feature encoders to encode the spectral and spatial information, while deformable convolution-based modules were employed in feature fusion-and-separation blocks for modality-shared and modality-specific feature extraction. Yang et al. [
35] selected HSI bands based on LiDAR data by introducing a cross-attention mechanism from the transformer architecture to reduce the redundancy of HSI and improve the classification accuracy. Zhang et al. [
36] proposed a local information interaction transformer (LIIT) model to capture and fuse multi-modal data dynamically. A dual-branch transformer was designed in LIIT to fully extract the sequence features, and a local-based multisource feature interactor was developed to coordinate local spatial features with the global-based transformer. Ni et al. [
37] introduced a multiscale head selection transformer (MHST) network to capture nonredundant features across multiple dimensions of data. An adaptive global feature extraction module was designed in MHST, which leveraged head selection pooling within the transformer to dynamically reduce token redundancy. Sun et al. [
38] introduced a morphological feature enhancement module and a transformer-based deep dilated convolution module in the encoder enabling efficient integration of data features. Feng et al. [
39] proposed a spectral-spatial-elevation fusion Transformer framework (S2EFT) adopting the Patch as the input of the transformer for taking full advantage of sequence data and spatial features. Additionally, Zhao et al. [
40] used a CNN with residual connections to extract features from multi-modal data, then the features were serialized to execute further feature learning by integrating the transformer with Fourier transform, ultimately predicting the categories of land objects for classification tasks. While the transformer-based classification methods excel at extracting global features, images need to be serialized into image blocks to accommodate the structural characteristics of the transformer, resulting in the capture of global information to be converted into the extraction of associations between image blocks, which will potentially lead to insufficient extraction of local information within each image block.
Compared to CNNs, graph models inherently possess an advantage in modeling and extracting global information. In a topological graph, any two nodes can be connected by associating node features to establish edges, and the relationship between features among nodes is characterized by the weights of edges, thus overcoming the limitations of the two-dimensional structure of images. For example, Feng et al. [
41] developed a multi-branch dual-channel graph convolutional network to remove the redundant information for integrating spectral–elevation–spatial features. Cai et al. [
42] constructed an undirected weighted graph with modality-specific tokens in their multimodal fusion network to address the problem of long-distance dependencies. Wan et al. [
43] segmented HSI into regions, with each segmented region serving as a node, to establish a complete topological graph by associating regions with each other. After that, they designed a dynamic graph convolutional network for feature learning, continuously updating the connection relationships between edges during training. Cai et al. [
44] employed principal component analysis to reduce the HSI dimensionality. The principal components were input into CNN for feature extraction, and the features were taken as a series of graph nodes to establish a topological graph. Then the graph CNNs were subsequently employed for feature extraction, with cross-attention added to guide the features. Sha et al. [
45] adopted graph attention networks for feature extraction from the topological graph constructed with HSI, ensuring that the features included both spatial and spectral characteristics. The aforementioned graph-based methods are effective in modeling global information, yet they face two main challenges. Firstly, these methods are still restricted to feature extraction within a single modality, without considering the fusion of global information from multiple modalities. Secondly, these methods still have limited capability in capturing local information. For instance, Sha et al. [
45] directly converted pixels in images into the nodes of the topological graph, disregarding the local spatial information in the original images, while Wan et al. [
43] used superpixel segmentation to retain certain spatial information which was highly sensitive to the scale of segmentation.
To address the mentioned issues, a multi-scale graph encoder–decoder network (MGEN) is proposed for multi-modal data classification. The MGEN is capable of extracting multi-modal image information at multiple scales, achieving local-global information fusion and robust feature representation. Specifically, MGEN consists of a graph encoder, graph feature extraction module and graph decoder, each module comprises three hierarchical levels of information.
In the graph encoder, unsupervised region segmentation is carried out on the images of two modalities through the segmentation algorithm, dividing spatial regions according to the semantic information of the images, and aggregating pixels from the original images to generate a series of superpixels. The superpixels are then transformed into the graph space, generating nodes and edges of the topological graph based on superpixel features and adjacency relationships, and the topological graphs with variety scales are generated by controlling the number of regions in superpixel segmentation. In the graph feature extraction module, different network branches are adopted to extract features from multi-scale topological graphs, incorporating the multiscale short- and long-range graph convolutional network (MSLGCN) [
46] to extract features. In the graph decoder, the features are mapped from graph nodes back to the original pixels, and feature alignment is performed at the pixel scale, followed by multi-scale feature fusion. Finally, a classifier constructed from fully connected networks predicts pixel categories to obtain a category map.
The main contributions of this work are summarized as follows:
A multi-scale graph encoder–decoder network is proposed for the classification of remote sensing multi-modal data. This network is able to extract features from the graph space with multiple scales, achieving multi-level fusion of multi-modal global and local information.
Graph encoder and graph decoder are proposed for extracting modality-independent multi-scale features, while simultaneously measuring the direct similarity between short-range and long-range samples in multi-modal images to enhance feature discriminability.
Experiments on remote sensing multi-modal datasets are conducted, revealing that the proposed MGEN achieves comparable performance with state-of-the-art methods.
The remaining parts of the paper are organized as follows.
Section 2 describes the proposed network in detail. The experimental results and analyses are shown in
Section 3. Then, the effects of parameters in the network are discussed in
Section 4. Finally,
Section 5 summarizes some concluding remarks.