Next Article in Journal
Optimal Pair Selection Applied to Sentinel-2 Images for Mapping Ground Deformation Using Pixel Offset Tracking: A Case Study of the 2022 Menyuan Earthquake (Mw 6.9), China
Previous Article in Journal
Estimating the Quality of the Most Popular Machine Learning Algorithms for Landslide Susceptibility Mapping in 2018 Mw 7.5 Palu Earthquake
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Multi-Scale Convolution and Multi-Layer Fusion Network for Remote Sensing Forest Tree Species Recognition

1
School of Mathematics and Computer Science, Zhejiang A&F University, Hangzhou 311300, China
2
Zhejiang Provincial Key Laboratory of Forestry Intelligent Monitoring and Information Technology, Zhejiang A&F University, Hangzhou 311300, China
3
College of Information Science and Electronic Engineering, Zhejiang University, Hangzhou 310027, China
4
State Key Laboratory of CAD & CG, Zhejiang University, Hangzhou 310027, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2023, 15(19), 4732; https://doi.org/10.3390/rs15194732
Submission received: 28 July 2023 / Revised: 21 September 2023 / Accepted: 24 September 2023 / Published: 27 September 2023

Abstract

:
Forest tree species identification in the field of remote sensing has become an important research topic. Currently, few research methods combine global and local features, making it challenging to accurately handle the similarity between different categories. Moreover, using a single deep layer for feature extraction overlooks the unique feature information at intermediate levels. This paper proposes a remote sensing image forest tree species classification method based on the Multi-Scale Convolution and Multi-Level Fusion Network (MCMFN) architecture. In the MCMFN network, the Shallow Multi-Scale Convolution Attention Combination (SMCAC) module replaces the original 7 × 7 convolution at the first layer of ResNet-50. This module uses multi-scale convolution to capture different receptive fields, and combines it with the attention mechanism to effectively enhance the ability of shallow features and obtain richer feature information. Additionally, to make efficient use of intermediate and deep-level feature information, the Multi-layer Selection Feature Fusion (MSFF) module is employed to improve classification accuracy. Experimental results on the Aerial forest dataset demonstrate a classification accuracy of 91.03%. The comprehensive experiments indicate the feasibility and effectiveness of the proposed MCMFN network.

1. Introduction

In the current field of environmental protection and ecological research, accurate identification and classification of tree species in forests have become important and challenging tasks. This is of significant importance for ecological conservation, resource management, forestry planning, and biodiversity research. However, traditional tree species identification methods mainly rely on manual observation and expert experience, which can be limited in efficiency and accuracy when dealing with large-scale forest areas and complex ecosystems. With the rapid development of remote sensing and deep learning technologies, the extraction of tree species features from remotely sensed imagery has become a hot topic in current remote sensing research. Due to the large field of view and strong global characteristics of remote sensing images, tree classification can be more comprehensive and effective. Deep learning algorithms, represented by convolutional neural networks, show strong potential for remote sensing image classification. Nevertheless, current tree species classification methods still face several challenges due to the diversity and similarity of species.
With the advancement of remote sensing imaging technology and the proliferation of high-resolution satellites, the exploration of satellite data, such as Landsat, for land cover classification dates back to as early as 1980 when Walsh [1] identified and mapped 12 land cover types, including seven types of coniferous forests. Subsequently, research in tree species classification predominantly focused on pixel-level analysis. Damon et al. [2] demonstrated that forest phenology changes could be accurately captured by combining image differencing and cap-index mapping. In recent years, we have gained access to higher resolution (spatial, spectral, radiometric, and temporal) satellite or aerial remote sensing images. However, this also places higher demands on our interpretation of remote sensing images. High-resolution remote sensing images provide rich information and data for remote sensing tree species image processing tasks [3]. Michael et al. [4] utilized the fusion of Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) imagery and airborne LiDAR data to map tree species at the crown level in urban areas, achieving an Overall Accuracy (OA) of 83.4% using canonical discriminant analysis. Chen Zhang et al. [5] employed unmanned aerial vehicles to obtain remote sensing images and combined deep learning with RGB optical imagery for urban tree species classification. Hui Li et al. [6] demonstrated that the use of LiDAR intensity images, in conjunction with panchromatic sharpening of WorldView-2 imagery, improved ITS classification accuracy without the need for atmospheric correction. These advancements have presented new opportunities for forest tree species identification but have also introduced new challenges, including tree species diversity and similarity, large-scale data processing, and variations caused by lighting conditions and occlusion.
The most challenging aspect lies in the diversity and similarity of forest tree species. Forests encompass a wide array of tree species, each characterized by unique morphological, textural, and color features. However, certain species may exhibit visually similar attributes, making it arduous to distinguish them in remote sensing images, as depicted in Figure 1. The overall branch and leaf textures, as well as leaf colors of these tree species, bear striking resemblance, posing a significant challenge in differentiation to the naked eye. Such intricacy underscores the necessity for more precise and robust feature representation and classification models in tree species classification tasks.
In recent years, a series of deep learning models have been widely applied in the field of image classification, such as AlexNet [7], GoogleNet [8], VGGNet [9], and more. These models have been improved to varying extents. For example, ResNet [10], proposed by He et al., addresses the gradient vanishing problem during backpropagation by adding skip connections. Xie et al. introduced ResNeXt [11], which utilizes group convolutions and ResNet-like merge operations. Huang et al. presented DenseNet [12], aiming to include information from all layers in the output of each block, achieved through multiple dense blocks and a classification layer similar to ResNet. Chen et al. proposed Dual Path Networks (DPN) [13], effectively combining the advantages of ResNet and DenseNet by incorporating both residual and dense blocks in a parallel structure. MobileNetv2 [14] and MobileNetv3 [15] are improved versions of MobileNet, designed for efficient image recognition on mobile devices with limited computational resources. These improvements employ depthwise separable convolutions, breaking down the standard convolution operation into depthwise and pointwise convolutions, reducing parameter count and computation cost. Zhang et al. introduced ShuffleNet [16], which reduces parameter count and computation complexity through channel shuffling, while enhancing the network’s non-linearity. ShuffleNet performs impressively when deployed on resource-limited devices. Vision Transformer (ViT) [17] is an image classification model based on the Transformer architecture, proposed by the Google Brain team in 2020. Originally designed for natural language processing tasks, the ViT demonstrated excellent representation learning capabilities when applied to the image domain. ViT splits the image into small image patches and uses multi-head self-attention mechanisms to capture relationships within the image and between different patches. While ViT performs well on some image classification tasks, it may incur high computational costs for larger images. To address this, Liu et al. introduced Swin Transformer [18] as an improvement to ViT. Swin Transformer introduces Swin Blocks, a multi-scale, multi-level attention mechanism. Swin Blocks leverage local windows to capture information within the image and combine this information at different levels, effectively handling large-sized images and achieving impressive performance in image classification and object detection tasks.
Hu et al. [19] introduced the Squeeze-and-Excitation (SE) mechanism into ResNet, proposing SE-Net. In the SE module, the output feature information is first compressed through global max pooling, and then two fully connected layers are reduced to a weight, which is multiplied with the original feature information of each channel. Yao et al. [20] proposed a multimodal model for parallel branching of location-shared ViTs extended with separable convolutional modules, which extends the parallel branching of location-shared ViTs using separable convolutional modules, provides an economical way to exploit spatial and modality-specific channel information, and significantly improves the discrimination of classification tokens in each mode by fusing their labelled embeddings with cross-modal attention modules, thus achieving higher performance capability, resulting in higher performance. Hong et al. [21] proposed a small batch GCN (miniGCN), which implements a combination of CNN and GCN models, to train large-scale GCNs in a small batch manner and extrapolates out-of-sample data, eliminating the need to re-train the network and improving the classification accuracy. Li et al. [22] proposed a HAD baseline network (LRR-Net) that combines the LRR model with deep learning techniques, which efficiently solves the LRR model via a multiplicative alternating direction method (ADMM) optimizer, uses its solution as a priori knowledge of the deep network to guide parameter optimization, and transforms the regularization parameters into trainable parameters, reducing the need for manual parameter tuning. Liu et al. [23] proposed an improved Res-UNet tree species classification model, which is based on a point-based deep neural network. This network can segment the Euclidean space of trees into multiple overlapping layers from the LiDAR data in forest areas, thereby obtaining partial 3D structures of trees. The overall features of trees are extracted using convolutional operations considering the characteristics of each layer. Nezami et al. [24] compared the performance of 3D-CNN models trained with high-spectral (HS) channels, red-green-blue (RGB) channels, and canopy height models (CHM). The 3D convolution demonstrated excellent classification accuracy. Guo [25] proposed a morphological feature-based double clustering network. First, mathematical morphology methods were used to extract morphological features from hyperspectral images. Based on this, the original hyperspectral remote sensing images were processed for coarsening and refining. Morphological and spectral feature information was simultaneously input into DNMF to obtain comprehensive evaluation indices and visual images. The advantage of the DNMF method lies in its decoupling of spatial-spectral data before fusion, thus separating spatial-spectral data. He et al. [26] proposed the “Spatial Pyramid Pooling” fusion strategy, leading to a novel network architecture named SPP-net, which generates data representations independent of size and scale. The pyramid fusion is also stable in handling object deformations. Leveraging these advantages, the SPP neural network is generally an improvement upon CNN-based image classification algorithms. Wang et al. [27] proposed a weakly-supervised fine-grained classification network based on a multi-scale pyramid, replacing ordinary convolutional kernels in residual networks with pyramid convolutional kernels to expand the receptive field and obtain features from different scales. Spatial and channel attentions were then introduced to acquire more informative feature maps. While the aforementioned models effectively achieve forest tree species classification, some of them focus solely on extracting local or global features, neglecting the joint feature extraction. Models with a single focus may face limitations in recognizing highly similar forest tree species. Concentrating solely on local features may cause the model to overlook the overall layout of the forest, while focusing solely on global features may lead to the neglect of individual tree details, such as textures and shapes. Additionally, the improvement of these models often overlooks the correlation between intermediate and deep-level feature information.
The main contributions of this paper are as follows:
  • In order to further improve the accuracy of forest tree species classification in remote sensing images, this paper proposes a remote sensing image forest tree species classification method based on the MCMFN network. The residual network model ResNet-50 is used as the baseline network to extract image features. Experiments are conducted on the Aerial dataset to evaluate the classification performance of the proposed MCMFN network. The results demonstrate that this method effectively enhances the accuracy of forest tree species classification in remote sensing images.
  • To effectively improve capability in extracting shallow features and to obtain richer feature information, the 7 × 7 convolution in the ResNet-50 network is replaced with the SMCAC module. The SMCAC module first utilizes convolutions with different scales and global average pooling to obtain different receptive fields. Then, point-wise convolutions are added to include pixel positional information, and the ACmix attention mechanism is introduced to focus on more informative features.
  • To address the correlation between intermediate and deep-level feature information and improve classification accuracy, the MSFF module is incorporated after the last residual block in the ResNet-50 network. This module extracts feature information and fuses it with the feature information obtained through a fully connected layer from the last residual block, resulting in the final feature representation.

2. Materials and Methods

2.1. Overall Architecture

The overall structure of our proposed method is shown in Figure 2 and Table 1. It consists of three parts: multi-scale feature extraction, multi-layer feature fusion and classification. In terms of multi-scale feature extraction, the combination of convolution and attention mechanism is used to extract richer shallow feature information. In multi-layer feature fusion, the feature information extracted from the middle and deep layers is added and fused. In classification, the feature maps after multi-layer fusion are classified.

2.2. Multi-Scale Feature Extraction

2.2.1. ResNet-50 Net

The Residual Network (ResNet), proposed by He et al., employs a design with shortcut connections in residual blocks to enhance the network’s identity mapping capability. This allows the network to be deeper without overfitting, thereby improving its performance. The ResNet-50 network consists of five stages for feature extraction.
The first stage has only one convolutional layer, which is used to extract initial features. The remaining four stages are composed of convolutional layers containing 3, 4, 6, and 3 residual blocks, respectively. As shown in Figure 3, each residual block structure consists of two 1 × 1 convolutional layers followed by a 3 × 3 convolutional layer. The 3 × 3 convolutional layer in the middle first performs dimension reduction using a 1 × 1 convolutional layer, and then restores the dimension using another 1 × 1 convolutional layer, preserving accuracy while reducing computational overhead.
The ResNet architecture revolutionized deep learning and has been widely adopted in various computer vision tasks, including image classification, object detection, and semantic segmentation, due to its ability to effectively train very deep neural networks. The use of residual blocks and shortcut connections helps to address the vanishing gradient problem, enabling the successful training of deeper networks. The ResNet-50 variant, with 50 layers, is particularly popular for its balance between model complexity and performance.

2.2.2. SMCAC Module

Using traditional single-scale CNNs to extract feature information from remote sensing images leads to the loss of some effective feature information, directly impacting the classification performance. Feature extraction at a single scale is limited and neglects local detail information in the image. Tree species classification tasks require accurate recognition of local features such as texture and shape of trees, which cannot be effectively captured by single-scale feature extraction methods, resulting in inaccurate classification results.
To enhance the quality of extracting local feature information, we introduce the Shallow Multiscale Contextual Attention Combination (SMCAC) module (Figure 4). This module leverages a convolutional pyramid with varying kernel sizes for feature extraction and incorporates global average pooling to capture global features. The pyramid comprises three parallel convolutional layers with different kernel sizes (1 × 1, 3 × 3, and 5 × 5), denoted as W 1 , W 3 , W 5 , respectively. This arrangement enables the capture of diverse feature structures and contextual information across different scales, resulting in feature maps with multiple receptive fields.
In this context, the input feature map is denoted as ‘ X ’ and the output feature map is denoted as ‘Y’. ‘Gap’ denotes the global average pooling operation. The ‘ · ’ operation represents standard convolutional operations. ‘cat’ represents feature map splicing.
The formula for this process is as follows:
Y = c a t ( W 1 · X ,   W 3 · X ,   W 5 · X , G a p ( X ) )
In the end, to extract more effective feature information in the rich feature maps, we introduce the ACmix attention mechanism to enhance the focus on important information, achieve a balance between local and global aspects, and further improve the model’s recognition performance. As illustrated in Figure 5, Pan et al. [28] introduced the ACmix attention mechanism guide’s feature learning to important regions, thereby increasing precision. The ACmix attention mechanism is a hybrid model that combines the advantages of self-attention and convolution while maintaining lower computational cost compared to pure convolutional or self-attention models. The ACmix attention mechanism aims to capture both local and global contextual information, enabling the model to selectively focus on relevant regions and effectively enhance feature extraction. By incorporating the ACmix attention mechanism into the feature extraction process, the model becomes more attentive to crucial information, leading to improved classification accuracy and robustness. This attention mechanism has shown promising results in various computer vision tasks, including image classification, object detection, and semantic segmentation. Its ability to strike a balance between local and global information, while maintaining computational efficiency, makes it a valuable addition to the MCMFN network for remote sensing image forest tree species classification.
In image processing, traditional convolution methods utilize convolutional kernel weights to aggregate local receptive fields and share them across the entire feature map. This characteristic can lead to significant induction bias in image processing. In contrast, self-attention modules dynamically compute attention weights using similarity functions between pixels in the weighted average operation of input feature contexts. This flexibility allows the attention module to adaptively adjust its focus on different regions, enabling the extraction of more effective feature information. Considering the distinct characteristics of both methods, combining them can greatly enhance feature extraction efficiency while reducing computational overhead. The ACmix process, as shown in Figure 6, first applies three 1 × 1 convolutions to the input feature map. The resulting feature maps are then subjected to feature extraction using convolution with a kernel size of k and self-attention. Finally, the feature values from both operations are added to obtain the final feature map.
Finally, the number of channels is changed by 1 × 1 convolution, marked as W 11 , and restored to the original input feature size. ‘ F a ’ represents the ACmix attention mechanism, and ‘ Y ’ corresponds to the output feature map of the SMCAC module.
The final formula of SMCAC is:
Y = Y   ·   F a   ·   W 11

2.3. Multilayer Feature Fusion

Traditional convolutional networks propagate features layer by layer, and the obtained feature maps at each layer lack correlation with each other, leading to a loss of some contextual information. To address this issue, we propose a multi-layer fusion approach called the Multi-layer selection feature fusion (MSFF) module. The primary goal of this module is to perform feature fusion on intermediate and deep-level feature maps, as these layers possess varying levels of abstraction and semantic information. By automatically selecting the most informative features from multiple layers, the MSFF module enhances the inter-layer correlation between intermediate and deep-level features.
Through multi-layer fusion, the MSFF module combines the fine-grained details from intermediate layers with the abstract representations from deep layers, thereby improving the overall feature representation capability. This helps the model gain a better understanding of various hierarchical features present in the input data and extract more comprehensive and robust feature representations. The specific structure of the MSFF module is depicted in Figure 7.
The original input feature map, denoted as P R H × W × C , consists of dimensions C (channels), H (height), and W (width). Firstly, feature map P undergoes a series of convolution operations to produce three feature maps: ‘a’ with size H 4 × W 4 × 2 C , ‘b’ with size H 8 × W 8 × 4 C , and ‘c’ with size H 16 × W 16 × 8 C . Then, a and b are transformed to have the same size as ‘c’ using 1 × 1 convolutions C 1 and are added together to fuse the features, resulting in a new feature map denoted as ‘O’ with size H 16 × W 16 × 8 C . Next, feature extraction is performed on feature map ‘O’ through global average pooling and two fully connected layers. After reshaping and applying the Softmax operation, a feature map 3 × 8 C × 1 with size ‘Q’ is obtained. Feature map is divided into three equal parts based on height averaging, and each part undergoes a spatial transformation before being multiplied with feature map O for fusion. This fusion process selects different feature information with sizes H 16 × W 16 × 8 C and adds and fuses the feature information at different scales, resulting in a new feature map denoted as ‘d’ with the same size as feature map c.
The formula for this process is as follows:
O = C 1 ( a )     C 1 ( b )     c
Q = S o f t m a x ( r e s h a p e ( f c 2 ( f c 1 ( G a p ( O ) ) ) ) )
d = i = 1 ( O   ·   r e s h a p e ( Q i 3 ) )

2.4. Classification

During the classification phase, the MSFF module processes the extracted feature information by applying average pooling and fully connected layer operations. Subsequently, these processed features are fused with the feature maps obtained from the fully connected (FC) layer of feature map ‘c’. The final step involves applying the Softmax function to achieve the ultimate classification of forest tree species. ‘fc’ is denoted as the fully connected operation; ‘Avgpool’ is represented as average pooling operation; ‘d’ corresponds to the feature maps obtained from the MSFF module, and ‘c’ represents the feature maps obtained from the last residual block. ‘e’ is denoted as the output classification.
The formula for this process is as follows:
e = S o f t m a x ( f c ( A v g p o o l ( d ) ) + f c ( A v g p o o l ( c ) ) )

3. Experiment and Analysis

3.1. Datasets

The experiment uses the TreeSatAI project (this dataset is a public dataset and the URL is https://zenodo.org/record/6780578, (accessed on 1 August 2022)) [29], using satellites to obtain the federal forest tree species data Aerial of Lower Saxony, Germany, which is derived from aerial image patches of summer orthophotos from 2011 to 2020. The patch size is 60 × 60 m2 (304 × 304 pixels). The order of bands is near infrared, red, green, blue. The spatial resolution is 20 cm. Figure 8 shows some images of the dataset. The dataset contains 20 tree species categories, each category has about 200–6000 images ranging from 200 to 6000, and there are 50,381 images in total.
Due to the multi-band nature of the dataset and the imbalance in the number of images per class, preprocessing steps were applied to facilitate training and reduce errors, as depicted in Figure 9. Specifically, the split function was employed to extract the four bands from the original image. Subsequently, the near-infrared (NIR), green, and blue bands were merged using the merge function to create a new composite image. Then, for each class in the obtained dataset, 500 images were randomly sampled as part of that dataset (if a class had fewer than 500 images, then all available images were included).

3.2. Evaluating Indexes

Accuracy, recall, F1 score, overall accuracy (OA), Kappa coefficient and confusion matrix (CM) are used as evaluation metrics for forest tree species classification.
  • Accuracy is a statistical metric defined as the ratio of correct predictions (true positives (TP) and true negatives (TN)) made by the classifier to the total number of predictions made by the classifier, including both false positives (FP) and false negatives (FN).
Therefore, the formula for quantifying algorithm accuracy is as follows:
A c c u r a c y = T P + T N T P + T N + F P + F N
2.
Precision, from a statistical perspective, represents the ratio of correctly identified positive cases to all predicted positive cases.
Its formula is as follows:
P r e c i s i o n = T P T P + F P
3.
Recall, also known as sensitivity, is the ratio of correctly identified positive cases to all actual positive cases.
Its formula is as follows:
R e c a l l = T P T P + F N
4.
The F1 score is the harmonic mean of precision and recall, and it is a metric used to measure the accuracy of a classification model.
Its formula is as follows:
F 1 = 2 × P r e c i s i o n × R e c a l l P r e c i s i o n + R e c a l l
5.
OA (Overall Accuracy) is defined as the proportion of correctly classified samples to the total number of samples in the test set. It effectively expresses the predictive capability of the model on the entire dataset. In the given context, T represents the total number of samples in the test set, m denotes the total number of classes, and n represents the number of samples in each class. The function f() is a classification function that predicts the class of an individual sample x from the test set. y denotes the sample label, indicating the true class of the sample. The function I() is an indicator function that takes a value of 1 when true and 0 when false.
The calculation method for OA is as follows:
O A = 1 T i m j n I ( f ( x i , j ) = y i , j )
6.
The confusion matrix (CM) is a matrix of size N × N used to represent the classification performance, where each row represents the predicted values and each column represents the true values. It can visually represent the misclassified categories, providing a better and more intuitive representation of the algorithm’s performance.
7.
The Kappa coefficient is a comprehensive evaluation metric that takes into account multiple factors in classification results. It provides a more reliable performance measurement for image classification tasks, aiding in a better understanding of classification consistency and model performance. However, in practical applications, it is essential to consider the characteristics of the dataset and evaluation requirements to ensure the selection of the most appropriate evaluation method. ‘ p 0 ’ is the sum of the number of correctly classified samples of each category divided by the total number of samples, i.e., the overall classification accuracy ‘ p e ’ is the sum of the products of the number of real samples of each category and the number of predicted samples, divided by the square of the total sample number.
Its formula is as follows:
k = p 0 p e 1 p e

3.3. Implementation Details

The experiments were conducted using the Windows 10 operating system. The design was based on PyTorch deep learning framework version 1.12.0. The CPU configuration was an i7-9700K, the graphics card used was NVIDIA 2070 super, and the system had 16 GB of RAM. The experimental parameters are provided in Table 2.

3.4. The Performance of the Proposed Method

To demonstrate the superiority of our proposed method, we compared it with other methods on the Aerial dataset. We trained and validated the models using deep learning algorithms such as EfficientNet, MobileNet, ResNet, ShuffleNet, and Vision Transformer. The training ratio used for this dataset was 8:2, 20% is used as the validation set, and the best validation accuracy obtained during the training process is used as the preliminary evaluation index. The best validation accuracy obtained by different models is shown in Table 3. The results indicate that our method achieved the highest validation accuracy, of up to 71.05%, outperforming the other methods.
To verify the individual impact of each module on the experiment, we trained ResNet-50 models with the addition of SMCAC and MSFF modules and compared their training results. To ensure fair comparison of the model optimization performance, consistent hyperparameter settings were used during the training phase, including learning rate, batch size, input image size, maximum iteration times, and training strategy. In order to further analyze the performance of the network structure, we evaluated the performance using overall accuracy (OA) and Kappa coefficient. In this experiment, a 5-fold cross-validation method (k = 5) was employed. It randomly selected 20% of the images from the entire Aerial dataset as the test set for each fold. The results were averaged and are presented in Table 4.
The confusion matrix of the Aerial dataset for the MCMFN model is depicted in Figure 10. Our proposed method achieved excellent classification accuracy for each tree species. Among the 20 tree species categories, 15 classes obtained an OA of over 90%, while four classes achieved an OA of over 80%. When the Kappa coefficient approaches 1, it indicates a high level of agreement, signifying a strong consensus among evaluators or classifiers. The Kappa experiment results demonstrate that the addition of the SMCAC and MSFF modules individually resulted in Kappa value improvements of 4.42% and 2.07%, respectively. When both modules were added, the Kappa value reached 90.36%. This signifies that the model can perform forest tree species classification more reliably.
In order to gain a more detailed evaluation of each class on the Aerial dataset, precision, recall, and F1 score were used as the evaluation metrics. The results on the Aerial dataset are presented in Table 5.

4. Discussion

In order to discuss the contributions of the proposed SMCAC and MSFF modules to the entire MCMFN model, this section will provide a detailed analysis based on the experimental data.

4.1. Analysis of SMCAC Modules

The SMCAC module is designed to address the feature extraction at shallow layers by combining different scales of convolutional kernels in a pyramid structure, followed by pointwise convolution and attention mechanisms. In the original ResNet-50, a single 7 × 7 convolutional layer is used for shallow feature extraction, but this approach has limitations in capturing fine-grained local features. For instance, in forest tree species classification, some individual tree species may be small with intricate patterns, which are difficult to extract using the conventional convolutional kernels. In contrast, the SMCAC module utilizes 1 × 1, 3 × 3, and 5 × 5 convolutional kernels to capture more precise and detailed local features of tree species, enhancing the extraction of unique characteristics. The global average pooling is then applied to complement the global features and facilitate mutual supplementation of local and global information.
To better emphasize the extracted features, the ACmix attention mechanism is employed to focus on the most relevant local features within the entire image. The SMCAC module enhances the extraction of both local and global features, considering the interconnections among forest tree species, leading to more robust and informative feature representations. The performance comparison of the SMCAC module is shown in Table 6. through relevant experiments.
From the experimental data, the original ResNet-50 achieved an Overall Accuracy (OA) of 84.95% on the dataset. When the SMCAC module was added without replacing the 7 × 7 convolutional layer, the OA obtained was 81.18%, which resulted in a decrease of 3.77% in accuracy compared to the original ResNet-50. However, when the 7 × 7 convolutional layer was removed, the OA improved to 89.33%, showing an overall accuracy increase of 4.38%. Finally, the validation of the SMCAC module without the ACmix attention mechanism and removal of the 7 × 7 convolutional layer achieved an OA of 88.32%, improving by 4.37%.
These results indicate that, after extracting features using the SMCAC module, applying the 7 × 7 convolutional layer to extract features would compromise the effectiveness of the features obtained from the SMCAC module. The 7 × 7 convolutional layer has a large receptive field, which may establish correlations between important and less relevant features, leading to a significant reduction in the effectiveness of feature extraction at intermediate and deep layers and, subsequently, a decrease in overall accuracy. Furthermore, the pyramid structure within the SMCAC module efficiently extracts tree contour feature information with varying receptive fields. After undergoing processing through the ACmix attention mechanism, this module can focus more on the subtle and significant feature information within tree species, further enhancing the precision of Overall Accuracy (OA). Although this module reduces the parameter count by 0.01 M, it also results in a fourfold increase in complexity, leading to longer training times. This increase in complexity may impose certain hardware constraints and requirements. Considering these factors, there is a trade-off between sacrificing some training time for increased accuracy, which can be acceptable in some contexts.

4.2. Analysis of MSFF Modules

The MSFF module performs feature fusion using intermediate and deep layer feature information, where different layers have varying degrees of abstraction and semantic information. There are certain correlation between the feature information from different layers, and by fusing them together, the lost feature information can be compensated. In the MSFF module, three layers from intermediate and deep layers are selected for fusion, combining the most effective detailed and contextual feature information from intermediate layers with the most effective abstract and contextual feature information from deep layers, thereby enhancing the feature representation capability. The experimental results related to the MSFF module are shown in Table 7.
From the experimental results, the original ResNet-50 achieved an overall accuracy (OA) of 84.95% on the dataset. However, by incorporating the MSFF module, the OA improved to 86.91%, resulting in a significant increase of 1.96%. This indicates that, in traditional CNN architectures, the feature extraction process proceeds from shallow to deep layers, overlooking the unique feature information present in each layer. Moreover, the feature information extracted at each layer is interrelated. The MSFF module helps the model better understand the distinct hierarchical features in the input data and extract more comprehensive and robust feature representations.

4.3. General Discussion of the Modules

Through disintegration experiments involving two modules, we have generated a performance chart, as depicted in Figure 11. Here, the x-axis represents FLOPs (Floating-Point Operations per Second), the y-axis signifies OA (Overall Accuracy), and the size of the circles reflects the model’s parameter count.
In our model presented in this paper, we have employed the SMCAC module primarily in the shallow layers for the extraction of features related to the contours of forest tree species. This module consists of three convolutional layers with different kernel sizes (1, 3, and 5) along with a global average pooling operation, forming a pyramid structure. These features are further extracted using the ACmix attention mechanism. The selection of convolutional kernels at sizes 1, 3, and 5 aims to capture features from different perspectives. When considering the trade-off between parameter count and complexity, choosing these three convolutional layers proves to be an optimal choice. However, if dealing with images of varying resolutions, it may be necessary to adjust the number of convolutional layers or utilize dilated convolutions to adapt feature extraction accordingly.
On the other hand, the MSFF module is primarily designed for finer-grained feature extraction related to forest tree species. It operates by fusing features from both mid-level and deep-level feature maps. Each of these levels offers unique feature information, and their combination through addition results in a richer set of feature representations. When applying the MSFF module to different models, one may consider the selection of the number of deep layers or weight coefficients for each layer, allowing for proportional feature fusion. Such adjustments can lead to varying effects in different model architectures.

5. Conclusions

The diversity and similarity of forest tree species pose challenges to accurate classification in remote sensing. Forests consist of numerous tree species with distinct morphological, textural, and color features. To address this issue, we propose a new model called MCMFN. In MCMFN, two modules are used to enhance classification accuracy. The first module is the SMCAC module, which replaces the original 7 × 7 convolutional kernel with different-scale convolutional kernels and global average pooling for feature concatenation. It further incorporates ACmix attention mechanism, effectively integrating the advantages of convolution and self-attention for efficient and adaptive feature extraction. This combination enables the model to capture both local and global contextual information, thereby improving the performance of various image processing tasks. Moreover, compared to using pure convolution or self-attention alone, the ACmix mechanism reduces computational costs, making it a valuable tool for enhancing feature extraction efficiency in image processing and achieving more effective fusion of local and global features. The second module is the MSFF module, which effectively utilizes diverse information available at different layers, addressing the feature disconnection problem between intermediate and deep layers. By leveraging the advantages of various feature representations, the MSFF module enhances feature learning and improves the performance of image classification tasks. We conduct classification evaluations on the Aerial dataset, and the experimental results demonstrate that the MCMFN model outperforms existing baseline models with the highest accuracy of 91.03%. Overall accuracy for each tree species category has also been improved. The proposed SMCAC and MSFF modules individually contribute to a 4.38% and 1.96% increase in accuracy compared to the original ResNet-50 model. In future work, we will investigate techniques to reduce parameter count without compromising accuracy in shallow convolutional layers. Overall, the MCMFN network with the integrated SMCAC and MSFF modules effectively enhances the extraction of shallow and mid/deep-level features while considering their correlations, leading to significantly improved accuracy in forest tree species classification within remote sensing images.

Author Contributions

Conceptualization, J.H. (Jinjing Hou), H.Z., J.H. (Junguo Hu), H.Y. and H.H.; methodology, J.H. (Jinjing Hou); software, J.H. (Jinjing Hou); validation, H.Z., J.H. (Junguo Hu), H.Y. and H.H.; formal analysis, J.H. (Jinjing Hou), H.Z., J.H. (Junguo Hu), H.Y. and H.H.; investigation, J.H. (Jinjing Hou); resources, J.H. (Jinjing Hou), H.Z., H.Y. and H.H.; data curation, J.H. (Jinjing Hou); writing—original draft preparation, J.H. (Jinjing Hou); writing—review and editing, J.H. (Jinjing Hou), H.Z., J.H. (Junguo Hu), H.Y. and H.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

This work was part of the project TreeSatAI (Artificial Intelligence with Satellite data and Multi-Source Geodata for Monitoring of Trees at Infrastructures, Nature Conservation Sites and Forests). Its overall aim is the development of AI methods for the monitoring of forests and woody features on a local, regional and global scale. Based on freely available geodata from different sources (e.g., remote sensing, administration maps, and social media), prototypes will be developed for the deep learning-based extraction and classification of tree- and tree stand features. These prototypes deal with real cases from the monitoring of managed forests, nature conservation and infrastructures. The development of the resulting services by three enterprises (liveEO, Vision Impulse and LUP Potsdam) will be supported by three research institutes (German Research Center for Artificial Intelligence, TU Remote Sensing Image Analysis Group, TUB Geoinformation in Environmental Planning Lab). https://zenodo.org/record/6778154 (accessed on 1 August 2022).

Acknowledgments

The authors would like to thank the anonymous reviewers for their constructive. All authors have agreed.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Walsh, S.J. Coniferous tree species mapping using LANDSAT data. Remote Sens. Environ. 1980, 9, 11–26. [Google Scholar] [CrossRef]
  2. Dymond, C.C.; Mladenoff, D.J.; Radeloff, V.C. Phenological Differences in Tasseled Cap Indices Improve Deciduous Forest Classification. Remote Sens. Environ. 2002, 80, 460–472. [Google Scholar] [CrossRef]
  3. Cheng, G.; Li, Z.; Han, J.; Yao, X.; Guo, L. Exploring Hierarchical Convolutional Features for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2018, 56, 6712–6722. [Google Scholar] [CrossRef]
  4. Alonzo, M.; Bookhagen, B.; Roberts, D.A. Urban Tree Species Mapping Using Hyperspectral and Lidar Data Fusion. Remote Sens. Environ. 2014, 148, 70–83. [Google Scholar] [CrossRef]
  5. Zhang, C.; Xia, K.; Feng, H.; Yang, Y.; Du, X. Tree Species Classification Using Deep Learning and RGB Optical Images Obtained by an Unmanned Aerial Vehicle. J. For. Res. 2021, 32, 1879–1888. [Google Scholar] [CrossRef]
  6. Li, H.; Hu, B.; Li, Q.; Jing, L. CNN-Based Individual Tree Species Classification Using High-Resolution Satellite Imagery and Airborne LiDAR Data. Forests 2021, 12, 1697. [Google Scholar] [CrossRef]
  7. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
  8. Szegedy, C.; Wei, L.; Yangqing, J.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going Deeper with Convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
  9. Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar] [CrossRef]
  10. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  11. Xie, S.; Girshick, R.; Dollar, P.; Tu, Z.; He, K. Aggregated Residual Transformations for Deep Neural Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5987–5995. [Google Scholar]
  12. Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely Connected Convolutional Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2261–2269. [Google Scholar]
  13. Chen, Y.; Jin, X.; Feng, J.; Yan, S. Training Group Orthogonal Neural Networks with Privileged Information. arXiv 2017, arXiv:1701.06772. [Google Scholar]
  14. Mehta, S.; Rastegari, M. Separable Self-Attention for Mobile Vision Transformers. arXiv 2022, arXiv:2206.02680. [Google Scholar]
  15. Howard, A.; Sandler, M.; Chen, B.; Wang, W.; Chen, L.-C.; Tan, M.; Chu, G.; Vasudevan, V.; Zhu, Y.; Pang, R.; et al. Searching for MobileNetV3. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
  16. Zhang, X.; Zhou, X.; Lin, M.; Sun, J. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6848–6856. [Google Scholar]
  17. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929. [Google Scholar]
  18. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 9992–10002. [Google Scholar]
  19. Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
  20. Yao, J.; Zhang, B.; Li, C.; Hong, D.; Chanussot, J. Extended Vision Transformer (ExViT) for Land Use and Land Cover Classification: A Multimodal Deep Learning Framework. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–15. [Google Scholar] [CrossRef]
  21. Hong, D.; Gao, L.; Wu, X.; Yao, J.; Zhang, B. Revisiting Graph Convolutional Networks with Mini-Batch Sampling for Hyperspectral Image Classification. In Proceedings of the 2021 11th Workshop on Hyperspectral Imaging and Signal Processing: Evolution in Remote Sensing (WHISPERS), Amsterdam, Netherlands, 24 March 2021; pp. 1–5. [Google Scholar]
  22. Li, C.; Zhang, B.; Hong, D.; Yao, J.; Chanussot, J. LRR-Net: An Interpretable Deep Unfolding Network for Hyperspectral Anomaly Detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–12. [Google Scholar] [CrossRef]
  23. Liu, M.; Han, Z.; Chen, Y.; Liu, Z.; Han, Y. Tree Species Classification of LiDAR Data Based on 3D Deep Learning. Measurement 2021, 177, 109301. [Google Scholar] [CrossRef]
  24. Nezami, S.; Khoramshahi, E.; Nevalainen, O.; Pölönen, I.; Honkavaara, E. Tree Species Classification of Drone Hyperspectral and RGB Imagery with Deep Learning Convolutional Neural Networks. Remote Sens. 2020, 12, 1070. [Google Scholar] [CrossRef]
  25. Guo, Z.; Zhang, M.; Jia, W.; Zhang, J.; Li, W. Dual-Concentrated Network With Morphological Features for Tree Species Classification Using Hyperspectral Image. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 7013–7024. [Google Scholar] [CrossRef]
  26. He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef] [PubMed]
  27. Wang, G.; Cheng, L.; Lin, J.; Dai, Y.; Zhang, T. Fine-Grained Classification Based on Multi-Scale Pyramid Convolution Networks. PLoS ONE 2021, 16, e0254054. [Google Scholar] [CrossRef] [PubMed]
  28. Pan, X.; Ge, C.; Lu, R.; Song, S.; Chen, G.; Huang, Z.; Huang, G. On the Integration of Self-Attention and Convolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 815–825. [Google Scholar]
  29. Ahlswede, S.; Schulz, C.; Gava, C.; Helber, P.; Bischke, B.; Förster, M.; Arias, F.; Hees, J.; Demir, B.; Kleinschmit, B. TreeSatAI Benchmark Archive: A Multi-Sensor, Multi-Label Dataset for Tree Species Classification in Remote Sensing; ESSD—Land/Land Cover and Land Use. Earth Syst. Sci. Data Discuss. 2022, 15, 681–695. [Google Scholar] [CrossRef]
  30. Tan, M.; Le, Q. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
Figure 1. Difficult-to-Distinguish Remote Sensing Forest Tree Species.
Figure 1. Difficult-to-Distinguish Remote Sensing Forest Tree Species.
Remotesensing 15 04732 g001
Figure 2. The overall architecture of our proposed method.
Figure 2. The overall architecture of our proposed method.
Remotesensing 15 04732 g002
Figure 3. Shortcut connection of ResNet.
Figure 3. Shortcut connection of ResNet.
Remotesensing 15 04732 g003
Figure 4. SMCAC module structure.
Figure 4. SMCAC module structure.
Remotesensing 15 04732 g004
Figure 5. Schematic diagram of ACmix.
Figure 5. Schematic diagram of ACmix.
Remotesensing 15 04732 g005
Figure 6. Structure of ACmix attention mechanism.
Figure 6. Structure of ACmix attention mechanism.
Remotesensing 15 04732 g006
Figure 7. MSFF module structure.
Figure 7. MSFF module structure.
Remotesensing 15 04732 g007
Figure 8. Aerial dataset.
Figure 8. Aerial dataset.
Remotesensing 15 04732 g008
Figure 9. The preprocessed image.
Figure 9. The preprocessed image.
Remotesensing 15 04732 g009
Figure 10. Aerial Confusion matrix for forest tree species classification.
Figure 10. Aerial Confusion matrix for forest tree species classification.
Remotesensing 15 04732 g010
Figure 11. Performance Analysis Chart.
Figure 11. Performance Analysis Chart.
Remotesensing 15 04732 g011
Table 1. Framework of the proposed MCMFN.
Table 1. Framework of the proposed MCMFN.
Definition LayerLayer NameTypeFiltersSize
Shallow layerSMCACMultiscale convolution31 × 1
3 × 3
5 × 5
3 × 3 (Gap)
MaxPoolingMaxPooling 3 × 3
Shallow layerLayer 1convolution641 × 1×3
643 × 3
2561 × 1
MiddleLayer 2convolution1281 × 1×4
1283 × 3
5121 × 1
MiddleLayer 3convolution2561 × 1×6
2563 × 3
10241 × 1
Deep layerLayer 4convolution5121 × 1×3
5123 × 3
20481 × 1
FC, MSFFMultiscale layer fusion2048
Table 2. Experimental parameters.
Table 2. Experimental parameters.
Name of the ParameterParameter Values
learning rate1 × 10−3
optimizerSGD
loss functionbinary cross entropy
batch size8
epochs300
Table 3. Comparative experiments between different models.
Table 3. Comparative experiments between different models.
MethodValidation Accuracy
efficientnet_b0 [30]31.94%
mobilenet_v2 [14]66.40%
mobilenet_v3_small [15]35.31%
mobilenet_v3_large [15]42.52%
ResNet-34 [10]62.07%
ResNet-50 [10]68.69%
ResNext50_32x4d [11]69.28%
shuflenet_v2_x0_5 [16]60.31%
shuflenet_v2_x1_0 [16]65.17%
Vit_base_patch_window7_224 [17]34.03%
Vit_large_patch16_224 [17]37.87%
Swin_tiny_patch4_window7_224 [18]30.88%
Swin_base_patch4_window7_224 [18]34.03%
MCMFN (ours)72.45%
Table 4. Ablation experiment.
Table 4. Ablation experiment.
MethodOA (%)Kappa (%)
ResNet-5084.95% (±1.31%)84.14%
ResNet-50 + SMCAC89.33% (±0.65%)88.56%
ResNet-50 + MSFF86.91% (±0.71%)86.21%
ResNet-50 + SMCAC + MSFF91.03% (±0.51%)90.36%
Table 5. Three category review indicator data.
Table 5. Three category review indicator data.
Tree SpeciesPrecisionRecallF1
aerial_60m_abies_alba0.980.960.97
aerial_60m_acer_pseudoplatanus0.9050.760.826
aerial_60m_alnus_spec0.8850.920.902
aerial_60m_betula_spec0.8480.840.844
aerial_60m_cleared0.8670.910.888
aerial_60m_fagus_sylvatica0.910.910.91
aerial_60m_fraxinus_excelsior0.8430.860.851
aerial_60m_larix_decidua0.920.920.92
aerial_60m_larix_kaempferi0.9590.930.944
aerial_60m_picea_abies0.8570.960.906
aerial_60m_pinus_nigra1.00.990.995
aerial_60m_pinus_strobus0.970.960.965
aerial_60m_pinus_sylvestris0.8820.970.924
aerial_60m_populus_spec0.9680.9190.943
aerial_60m_prunus_spec0.9580.9020.929
aerial_60m_pseudotsuga_menziesii0.940.940.94
aerial_60m_quercus_petraea0.8040.860.831
aerial_60m_quercus_robur0.8620.810.835
aerial_60m_quercus_rubra0.9120.930.921
aerial_60m_tilia_spec0.9690.9590.964
Table 6. Performance metrics for SMCAC module ablation experiments.
Table 6. Performance metrics for SMCAC module ablation experiments.
ModuleOA (%)FLOPs (MB)Parameters of Networks (M)
ResNet-5084.954131.7923.60
ResNet-50 + SMCAC (with 7 × 7)81.184269.9223.60
ResNet-50 + SMCAC (without 7 × 7)89.3316,229.0723.59
ResNet-50 + SMCAC (without 7 × 7) (without ACmix)88.3216,112.7623.59
Table 7. Performance metrics for MSFF module ablation experiments.
Table 7. Performance metrics for MSFF module ablation experiments.
ModuleOA (%)FLOPs (MB)Parameters of Networks (M)
ResNet-5084.954131.7923.60
ResNet-50 + MSFF86.915366.4827.80
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Hou, J.; Zhou, H.; Hu, J.; Yu, H.; Hu, H. A Multi-Scale Convolution and Multi-Layer Fusion Network for Remote Sensing Forest Tree Species Recognition. Remote Sens. 2023, 15, 4732. https://doi.org/10.3390/rs15194732

AMA Style

Hou J, Zhou H, Hu J, Yu H, Hu H. A Multi-Scale Convolution and Multi-Layer Fusion Network for Remote Sensing Forest Tree Species Recognition. Remote Sensing. 2023; 15(19):4732. https://doi.org/10.3390/rs15194732

Chicago/Turabian Style

Hou, Jinjing, Houkui Zhou, Junguo Hu, Huimin Yu, and Haoji Hu. 2023. "A Multi-Scale Convolution and Multi-Layer Fusion Network for Remote Sensing Forest Tree Species Recognition" Remote Sensing 15, no. 19: 4732. https://doi.org/10.3390/rs15194732

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop