1. Introduction
Remote sensing image classification technology uses spatial and spectral information of objects in remote sensing images for extraction and analysis, thereby achieving accurate classification and identification of objects, which is currently widely used in various fields such as urban design, agricultural planning, and military applications [
1,
2,
3,
4,
5,
6]. However, the image obtained from a single sensor is influenced by technical principles and other factors. It is not able to fully and comprehensively reflect the spatial and various spectral information of objects, thus having inherent limitations. For example, hyperspectral images [
7,
8,
9,
10,
11,
12,
13] can provide rich spectral and fine feature information, but due to the limitations of the spectral imager, the phenomenon of “same spectrum different objects” and “same object different spectra” may occur in classification tasks.
Figure 1 shows partial spectral information of the Houston dataset [
14]. From this figure, we can see that the spectral information of certain features is quite similar, such as Highway, Railway and Road with similar materials. Hyperspectral images are often difficult to effectively distinguish target object categories with similar spectral information but different height information, such as roads and rooftops with similar concrete textures, and may have poor performance in some complex fine-grained object classification tasks [
15].
At the same time, LiDAR data [
16,
17,
18,
19,
20] can provide high-precision object stereo structural information and elevation information, but it is difficult to distinguish objects, such as grassland and road, having the same height due to the lack of spectral and texture features, but for objects with different heights, such as roads and rooftops, with similar concrete textures, accurate classification can be achieved through height information. The complementarity and synergy of multispectral remote sensing data fusion using hyperspectral and LiDAR images have important theoretical and practical application values, which have been validated in the literature on land-cover and land-use classification [
21,
22,
23].
To fully leverage the complementary information between hyperspectral and LiDAR data, many methods have been proposed. One of the widely adopted methods is feature level fusion [
24]. Pedergnana et al. [
25] applied Morphological Extended Attribute Profiles to hyperspectral and LiDAR data. These profiles, along with the raw spectral information from the hyperspectral data, were stacked together for classification. However, the direct stacking of high-dimensional features often leads to the Hughes phenomenon when only relatively few training samples are available. To address this problem, many works have been proposed. For example, Réjichi et al. [
26] introduced Principal Component Analysis (PCA) to reduce the dimensionality of hyperspectral data. Liao et al. [
27] proposed a graph embedding framework. Rasti et al. [
28] used extinction profiles to extract spatial and elevation data from hyperspectral and LiDAR images, and fused them with spectral information using an feature fusion method based on orthogonal total variation analysis (OTVCA) algorithm, which allows processing of fused features in low-dimensional space. Additionally, in the same year, Rasti et al. [
15] improved this method by using a new sparse low-rank technique to replace the OTVCA algorithm. This technique can retain more key information while reducing dimensionality, resulting in significantly improved multi-source remote sensing image classification accuracy.
Another widely used method is decision-level fusion [
24]. Liao et al. [
29] inputted spatial features, elevation features, and their fused features into support vector machines (SVM) [
30] to generate four classifiers. The final classification result is determined by their collaborative decision-making. Ge et al. [
31] proposed a classification model based on the extinction profile, local binary pattern, and kernel collaborative representation. The idea is to use a collaborative representation classifier to separately generate intermediate results, namely residual matrices, for hyperspectral and LiDAR images, and then fuse them. Zhong et al. [
32] employed three different classifiers, including maximum likelihood classifier, support vector machine (SVM) [
30], and polynomial logistic regression, to classify the extracted features. Then they used differential evolution algorithm to calculate the weights of these classifiers for fusion.
To sum up, the difference between feature-level fusion and decision-level fusion lies in the timing of fusion operation [
24], the former being before the classifier and the latter being after the classifier. However, both require a significant amount of time for manual feature extraction to obtain features containing rich spectral, texture, and spatial information, but the results of land cover classification are often unsatisfactory.
Compared with the traditional remote sensing image fusion classification algorithm that extracts manual features, deep learning can learn advanced semantic information from data in an end-to-end manner [
33]. Therefore, some researchers have used deep learning methods [
34,
35] to extract multi-source data, significantly improving the accuracy of land cover classification results. For example, Mohla et al. [
36] proposed a feature extraction and fusion framework called FusAtNet, based on attention mechanism, which emphasized the spatial features of HSI and obtained joint features of specific modalities. Cao et al. [
37] proposed a feature fusion network called CNN-MRF, which fused spectral and spatial information into the same Bayesian framework and trained it using convolutional neural networks. Hong et al. [
38] effectively utilized multimodal information by reconstructing features of different modalities with a forced fusion module. Wu et al. [
39] proposed a network called C-RNN, which consists of both convolutional and recurrent layers, where the former extracts high-dimensional local shift-invariant features and the latter captures long-distance dependencies between frames. Hu et al. [
40] proposed that using feature extractors of different modalities in multimodal fusion tasks can significantly improve the feature extraction capability and efficiency of the model. Using a Transformer structure [
41] can effectively fuse multimodal data and achieve good results.
In general, traditional convolutional neural networks (CNNs) typically only receive supervised feedback at the output layer, making it challenging for intermediate layers to learn target features directly and transparently during the training process [
42]. As the depth of the network increases, the likelihood of gradient vanishing or exploding increases significantly. Moreover, simply concatenating and fusing the features cannot fully exploit the correlation between different modal images, leading to inaccurate classification results. Finally, the performance of Convolutional Neural Networks (CNNs) is typically influenced by the amount of available training samples. However, in the context of hyperspectral and LiDAR data fusion, the number of training samples is often severely restricted. Consequently, achieving maximal utilization of the limited data presents a formidable challenge.
Given the issues mentioned above, this paper proposes a new binary-tree-style Transformer network. It fuses hyperspectral and LiDAR data to achieve high-precision land-cover classification. The network uses multi-source Transformer complementors to capture the correlations between multi-modal features, and adopts a tree structure to improve the stability of the network.
The main contributions of this paper are summarized as follows:
To address the issues of inconsistent data structures, insufficient information utilization, and imperfect feature fusion methods, this paper proposes a new binary-tree-style Transformer network that achieves high-precision land-cover classification by fusing heterogeneous data.
To fully exploit the complementarity between different modal data, this paper proposes a multi-source Transformer complementor (MSTC) that organically fuses hyperspectral and LiDAR data to better capture the correlations between multi-modal feature information and improve the accuracy of land cover classification. Furthermore, the internal multi-head complementary attention mechanism (MHCA) of this module helps the network better learn the long-distance dependency relationships.
To fully obtain the feature information of multi-source remote sensing images, this paper proposes a full binary tree structure, named binary feature search tree (BFST), to extract feature maps of the images layer by layer to obtain features at different network levels. The fusion of multi-modal features at different levels is used to obtain multiple image feature representations that are stronger and more stable. This module effectively improves the stability, robustness, and prediction capability of the network.
The remaining sections of this paper are as follows.
Section 2 describes the implementation details of the proposed model in detail.
Section 3 introduces the datasets and experimental results
Section 4 introduces Parameter Tuning and Ablation Experiments.
Section 4.2 summarizes this paper.
2. Methodology
2.1. Overall Framework
To better address issues such as inconsistent data structures, irrelevant physical properties, scarce training data, insufficient utilization of information, and imperfect feature fusion method, we propose an end-to-end neural network called BTRF-Net, as shown in
Figure 2.
The BTRF-Net uses HSI-Net and LiDAR-Net to extract spectral and spatial features of hyperspectral images and elevation information of LiDAR data, respectively, from different levels of the network. For the HSI-Net, PCA [
26] is firstly used to reconstruct hyperspectral data in order to summarize the main detail information, compress or even discard redundant information, as shown in Equation (
1). Secondly, appropriate spatial patches are chosen to obtain spatial neighborhood information around given pixels, and inputted into the HSI-Net for classification. Then, we use a 3D convolutional neural network on the hyperspectral image, using 3D convolutional kernels to simultaneously sample three dimensions to extract spectral and spatial features, and to achieve fusion of spatial and spectral information, thereby fully utilizing the advantages of hyperspectral images, and use 2D convolutional processing, as shown in Equation (
2). This hybrid 3D and 2D convolutional neural network [
7] can effectively extract texture information of the image and the inter-channel dependencies. For the LiDAR-Net, the spatial patch located at the same spatial position as the hyperspectral data can be directly chosed. To prevent overfitting of the network, batch normalization (BN) processing was performed simultaneously.
where
and
, respectively represent the original hyperspectral data and the hyperspectral data processed by PCA [
26] for dimensionality reduction.
where
represents the hyperspectral data processed by PCA [
26] for dimensionality reduction. B is the feature map of hyperspectral data
After being processed by two sub-networks (HSI-Net and LiDAR-Net), different levels of feature representations (A, B, C, and D) for two modalities of data are firstly obtained. The specific operations are as shown in Equations (2)–(5). Secondly, three fused feature maps M, N, and P were generated by fusing the feature maps A, B, C, and D according to the fusion strategy shown in Equation (
6). we input the hyperspectral and LiDAR feature maps (B and D) into the multi-source Transformer complementor (MSTC) to organically combine them by utilizing their complementarity between them, in order to better capture the correlations among multi-modal features. Then, the four fusion feature maps (M, N, P, and Q) are input into the tree structure for processing, in order to obtain multiple image features with stronger representational ability, and achieve high-precision land object classification successfully. Finally, two classifiers were constructed and processed using a weighted sum method.
where
represents the hyperspectral data processed by PCA [
26] for dimensionality reduction. Cut represents the clipping process of the obtained feature graph to cut out the insignificant information at the edge of the feature graph. Compared to feature map B, feature map A is a shallow-level feature map of hyperspectral data. A represents the shallow-level features of hyperspectral data, while B represents the deep-level features.
where
represents the original LiDAR data. Cut represents the clipping process of the obtained feature graph to cut out the insignificant information at the edge of the feature graph.
C represents the shallow-level features of LiDAR data, while
D represents the deep-level features.
where
M,
N and
P, respectively represent feature fusion maps at different levels of the network.
The BTRF-Net can simultaneously and fully utilize the spectral information of hyperspectral images and the elevation information of LiDAR data and can fully explore the correlation between them so as to achieve high-precision land-cover classification. This network adopts two methods, attention mechanism and joint training, to address the issue of inconsistent data structures. Specifically, the attention mechanism enhances the feature representation of each data source by weighting its features, and then fuses them together using this mechanism. The joint training method trains multiple datasets of different modalities together, sharing model parameters and feature representation to improve the model’s generalization ability. To compensate for the impact of data scarcity on classification results, this network adopts two methods: data augmentation and joint training. Specifically, data augmentation uses diverse data processing techniques, such as cropping, to utilize the shallow information in the data, thereby reducing the risk of overfitting. It improve insufficient utilization of information by extracting feature information from different network layers. The design of binary feature search tree (BFST) enhances the robustness and anti-interference capability of this network.
2.2. Multi-Source Transformer Complementor
In order to fully utilize the complementary advantages and potential interaction between different modality data and improve data fusion methods, this paper proposes a multi-source transformer complementor (MSTC), as shown in
Figure 3.
The hyperspectral and LiDAR features
Ki,
Qi, and
Vi (
i = 1 represents hyperspectral,
i = 2 represents LiDAR) with added position encoding information are mapped to several lower dimensions through linear layers to obtain key vectors
ki, query vectors
qi, and value vectors
vi (
i = 1 represents hyperspectral,
i = 2 represents LiDAR), as shown in Equation (
7).
Then, multiple heads of complementary attention (MHCA) are computed separately in several dimensions and concatenated for subsequent computation, as shown in Equation (
8). This is beneficial for the network to obtain different patterns of data in different attention calculations, thereby simulating the multi-channel characteristics of convolutional neural networks. The MHCA can effectively capture global features and local texture information of images, hence achieving full feature fusion.
Based on complementary theory support and experimental verification, we ultimately chose k2, q1, and v1 to perform the calculation of multi-head complementary attention (MHCA). We performed a dot product operation between the query vector and the transpose of the key vector to calculate the correlation between the current position’s features and the features of other positions, thereby obtaining the attention matrix. Then, the softmax function is used to normalize it and a dot product operation is performed with the value vector to make the feature of each position contain global positional feature information.
We then perform a linear mapping operation on the result, obtaining an output sequence with the same shape as
Q2, and add it to the value of the input feature
Q2 to form a residual connection to alleviate the degradation problem of deep neural networks, as shown in Equation (
9). Next, the result is normalized to improve the network’s accelerated convergence.
The multi-source Transformer complementor (MSTC) first uses a multimodal feature extractor to extract features from different modalities of remote sensing data. Then, using a multi-head complementary attention mechanism, the features of different modalities are cross-encoded to promote complementarity between them. Next, multiple complementary attention heads and feedforward neural networks are used to encode and fuse the features. Finally, the fused feature, along with three other features, is input into the BFST module for further processing.
2.3. Binary Feature Search Tree
To fully utilize the characteristic information of hyperspectral and LiDAR data, this paper organizes the fused features into a hierarchical network structure with a tree-like form. By combining knowledge of deep learning and data structures, we know that learning features requires optimization using a loss function, which is essentially a mechanism for backpropagation that can be seen as retrieving and selecting features, and it can be known that full n-ary trees have the shortest path length. For simplicity, we use a full binary tree for feature construction and name it Binary Feature Search Tree (BFST).
As shown in
Figure 4, after the network extracts and fuses the image features of the two modalities, The feature maps M, N, P, and Q are obtained. These four feature maps are then concatenated and subjected to adaptive average pooling to preserve more spatial information and reduce the risk of overfitting, as shown in Equation (
10).
This method fully utilizes the spectral information, spatial information, and elevation information of two different modal data, thereby enhancing the representation ability of multiple feature maps. By utilizing the feature correlations between different levels of images, it combines local features and global features to describe the information of target objects well. In addition, the application of a full binary tree structure effectively enhances the stability and robustness of this network.
Since the BTRF-Net adopts a multi-level structure and makes predictions at different network levels, it is necessary to weight all losses to obtain the final loss function. Each output’s classification accuracy on the training data determines its weight in the weighted sum method, as shown in Equation (
11). Weighting the coefficients of different branch paths can balance the importance of the gradient flow and suppress errors during backpropagation.
This loss function effectively updates all parameters away from the output layer, thereby improving the network’s learning ability of low-level abstract features.