This study applied the Mask R-CNN deep learning instance segmentation algorithm as the base model for identifying PWD infection stage of individual trees. Firstly, RGB imagery was composited and input into Mask R-CNN for identifying PWD-infected individual trees at different infection stages. Then the hyperspectral data were subjected to feature band selection to screen out sensitive bands that could improve the accuracy of early identification of PWD. Afterwards, by adjusting the network input layer structure, an improved Mask R-CNN model was constructed to enable hyperspectral full bands and screened bands data input and to verify the usefulness of rich spectral information for the identification of PWD-infected individual trees at different infection stages. Finally, an integrated framework was proposed combing a prototypical network classification method and an individual tree segmentation algorithm for individual tree infection stage identification, which was compared with improved Mask R-CNN to optimize the optimal model for PWD identification based on UAV hyperspectral technology and deep learning techniques. The technology flow chart of this research is shown in
Figure 4.
2.4.1. Mask R-CNN
Mask R-CNN is a pixel-level multi-objective instance segmentation algorithm proposed by He et al. [
38]. Unlike traditional object detection models that can only output the bounding box of a target, Mask R-CNN can output pixel-level segmentation results for each target while retaining the target’s location and class information, which suggests that the task of detecting the degree of PWD susceptibility and individual tree canopy segmentation can be achieved simultaneously.
To further improve the performance of the model, Mask R-CNN utilizes a Feature Pyramid Network (FPN) in conjunction with ResNet for feature extraction. Based on the RoI Pooling layer, Mask R-CNN uses the RoI Align layer to solve the quantization mismatch problem and improve the accuracy of the bounding box suggestions. In Mask R-CNN, the RoI Align layer is able to generate a more accurate feature map of the region of interest, improving the accuracy of instance segmentation. The network structure of the Mask R-CNN is shown in
Figure 5.
The loss function of Mask R-CNN consists of three components: target classification loss (
), target bounding box regression loss (
) and target mask segmentation loss (
), as shown in Equation (3).
where the Mask loss function represents the average binary cross-entropy loss of the decoupled mask branch and the classification branch. For RoI belonging to the k
th class, only the k
th mask is considered for calculation in the loss function. Such a definition allows for a mask to be generated for each class and for there to be no inter-class competition.
2.4.3. Integrated Framework Combing Prototypical Network Classification Model and Individual Tree Segmentation Method
Mask R-CNN can achieve both PWD infection stage classification and individual tree canopy segmentation simultaneously, as it requires a high level of input sample labels, which may lead to identification accuracy that cannot meet the needs of PWD early detection. Therefore, we propose an integrated method combining a prototypical network classification model with an individual tree segmentation to improve the accuracy of early PWD monitoring.
In this study, a prototypical network structure is used to map sliced data with a window size of S × S into a low-dimensional embedding space by means of an embedding function. The embedding function consists of multiple convolutional structures (Layer 1 … Layer N, Layer last), each consisting of a convolutional layer (Conv2d), a batch normalization layer (Batch_norm), a non-linear activation function (Relu), and a maximum pooling layer (Max_pool2d). The convolutional layer extracts features by convolving the input data with a set of learnable filters, and the batch normalization layer normalizes each feature map by mean and unit variance to speed up training and improve generalization performance. The non-linear activation function increases the model representation capability by introducing a non-linear transformation, and the maximum pooling layer divides each feature map into regions and selects the maximum value in each region as the output, which is used to reduce the feature map size and improve translation invariance. In this way, the prototypical network can efficiently extract features from imagery and perform accurate classification.
At the end of the embedding function, the fully connected layer (Flatten) and the softmax function process the F feature values in the prototypical network, allowing the probability values of each category to be calculated as a basis for classification. Specifically, the fully connected layer maps the embedding vectors to an F-dimensional vector space, and the softmax function normalizes the values in that vector space to a probability distribution, representing the probability of each category. The final classification result is the probability value for each category based on the fully connected layer and the softmax function processing. Therefore, we can classify the input data into one of these categories based on these probability values. The classification process is shown in
Figure 8.
The prototypical network identifies the infection pixels of the entire image by classification at pixel level, but does not provide the exact location and canopy edge of the infected trees, i.e., it does not enable direct detection of PWD-infected individual trees. This study uses eCognition’s object-oriented multi-scale segmentation algorithm for imagery canopy extraction to construct an integrated method combining a prototypical network classification model with an individual tree segmentation for PWD-infected individual tree identification.
2.4.4. Experimental Design
First, the experiment used Mask R-CNN, an instance segmentation model based on deep learning algorithms, to input RGB synthetic imageries in order to achieve the recognition of infected individual trees at different infection stages of PWD. Secondly, the hyperspectral full bands and screened bands data are separately input into the improved Mask R-CNN, and the results are compared with those of the RGB dataset to explore the effect of the inclusion of hyperspectral information on optimizing the identification of PWD-infected individual trees and also to verify the role of sensitive bands in improving the accuracy of PWD early identification. ResNet-50 is used as the backbone feature extraction network in the model, and the pre-trained ResNet-50 model is fine-tuned to reduce training time and improve accuracy. In addition, we have optimized the anchor frame parameters. The batch size is set to 1 and the epochs to 300. Using the same polynomial learning rate, the initial learning rate is 0.001 and the learning rate at the end is 0.0001, decreasing at a fixed rate of decay. Using SGD as the optimizer, the momentum and weight decay were set to 0.9 and 0.0001, respectively.
Afterwards, the UAV hyperspectral imagery (3759 × 4061 × 270) and the feature preferred bands synthetic imagery (3759 × 4061 × 8) were used as data sources for the prototypical network classification. Based on the ground survey data, the mean north–south canopy width of horsetail pine in the sample plots in the study area is 2.6 m, the mean east–west canopy width is 2.5 m, and the mean canopy width is 2.55 m. Therefore, considering the imagery resolution of 0.05 m, 51 pixels was taken as the sample window size. The sample data with a final cropping window size of 51 × 51 was used as input data for the prototypical network.
The prototypical network structure used in this study had a convolutional layer output space dimension of 64 (F) and a convolutional kernel of size 3 × 3. The largest pooling layer had a pooling core of size 2 × 2. Using the same embedding function for the support and query sets and using them as input parameters for calculating loss and accuracy, the model was trained through Adam-SGD with an initial learning rate of 0.0001, which would be halved for every 2000 training sessions. The Euclidean distance was used as the metric function, and the negative log-likelihood function was chosen as the loss function for the training of the prototypical network. We adjusted the number of iterations (epochs/iterations) to 60/100 so that the training accuracy was close to 1 as possible.
Table 8 lists the specific parameters of the prototypical network structure.
For canopy segmentation, this study used eCognition’s object-oriented multi-scale segmentation technique for imagery canopy extraction. eCognition (Version 9.0, Trimble, Sunnyvale, CA, USA) is a remote sensing imagery analysis software based on object-oriented analysis that can be used to segment and classify objects in images. The results and accuracy of canopy segmentation after several trials showed that the best canopy segmentation results were obtained when the scale parameter was set to 51, the shape parameter was set to 0.8, and the compactness was set to 0.9 for multi-scale segmentation. We overlaid the prototypical network classification results with the individual tree canopy segmentation results to obtain the final identification results of individual trees at the different PWD infection stages.
We completed this study with the support of ArcGIS (Version 10.2, Esri, Redlands, CA, USA), ENVI (Version 5.3, Exelis Visual Information Solutions, Boulder, CO, USA), eCognition (Version 9.0, Trimble, Sunnyvale, CA, USA), Lableme (Version 5.0.5, Computer Science and Artificial Intelligence Lab., Massachusetts Institute of Technology, Cambridge, MA, USA), and Python programming based on the Pytorch framework. Windows 10 Professional and Pycharm were chosen as the software runtime platform. The hardware platform was selected with the appropriate GPU, CPU, and memory according to the computational volume, with the following information: (1) GPU: NVIDIA GeForce RTX 3090; (2) CPU: Intel(R) Xeon(R) W-2275 CPU @ 3.30 GHz 3.31 GHz; (3) Memory: 128 GB of RAM on board.