1. Introduction
Forestry resources are one of China’s most important natural resources, and the control of forest pests and diseases has always been a crucial task in the field of forestry ecology. Pine wilt disease (PWD), also known as pine wood nematode (
Bursaphelenchus xylophilus) disease, is referred to as the cancer of pine trees. It is one of the most destructive and dangerous types of pests and diseases in the field of forestry, which was first discovered in the United States of America and later spread to various regions of the world [
1]. In China, the first case of pine wood nematode infection was detected in Jiangsu Province in 1982; since then, PWD has rapidly spread across the country [
2]. According to a 2024 announcement by the National Forestry and Grassland Administration of China, PWD has spread to 664 county-level epidemic areas in 18 provinces (including municipalities and autonomous regions). The economic losses caused by this disease are enormous and have caused significant damage to China’s environmental protection work and forestry resources.
PWD has the characteristics of a wide transmission range, fast spread, high mortality, and is difficult and costly to control. It takes as little as 40 days for a single tree to die from infection and only 3 to 5 years for an entire pine forest to be infected. The optimal strategy for controlling such disease is early detection and early intervention, aiming to manage the disease at the initial stages of tree infection [
3]. However, traditional survey methods mainly rely on manual surveys, which are time-consuming and costly and difficult to quickly grasp the dynamic occurrence of PWD in the epidemic area. Consequently, the traditional survey methods fail to meet the needs of regular pest and disease monitoring [
4]. Fortunately, remote sensing technology offers advantages such as wide coverage, high temporal resolution, short revisit cycles, and extensive spatial reach, providing a robust technical foundation for timely monitoring of PWD. Currently, there are two main methods for monitoring forest pests and diseases using remote sensing technology, including satellite remote sensing technology and unmanned aerial vehicle (UAV) remote sensing technology. Among these, numerous studies have focused on the application of high-resolution satellite images in monitoring forest pests and diseases, such as WorldView [
5], QuickBird [
6], and GeoEye-1 [
7]. However, the application of satellite images still has several limitations, such as optical satellite images are easily affected by cloudy weather, atmospheric conditions, and spatial resolution constraints, while radar satellite images are susceptible to interference from mountainous terrain. Additionally, satellite images cannot always provide the timely data needed for epidemic areas due to the limitations of revisit cycles. In contrast, UAV images overcome the limitations of satellite images and have higher spatial resolution, which has the potential to better identify PWD-infected trees. Currently, scholars have conducted many studies on the application of UAV remote sensing technology to detect PWD [
8], such as multispectral [
9], hyperspectral [
10], and LiDAR [
11].
In addition to the original bands of remote sensing images, indices derived from the original bands of remote sensing images are commonly used features for detecting forest pests and diseases. Numerous studies have demonstrated that commonly used remote sensing vegetation indices can effectively detect diseased trees, such as the normalized difference vegetation index (NDVI) [
12] and enhanced vegetation index (EVI) [
13]. Additionally, some studies have focused on the pathogenesis of PWD to develop specific indices targeting PWD and have achieved promising results, such as single red-edge [
14], green normalized difference vegetation index (GNDVI) [
15,
16], Green-Red Spectral Area Index (GRSAI) [
17], Green to Red Region Spectral anGle Index (GRRSGI) [
18], and Multiple Ratio Disease-Water Stress Indices (MR-DSWIs) [
19]. Besides vegetation indices, the use of texture features has also been proven to significantly improve the accuracy of identifying diseased trees [
20].
In recent years, due to the rapid development of machine-learning and computer vision technologies, numerous scholars have conducted studies on integrating UAV remote sensing technology with machine-learning methods to identify diseased trees. Some of the machine-learning methods used include random forest [
11], support vector machines [
21], and spatiotemporal change detection [
22]. With the rise of deep learning methods, an increasing number of studies have focused on the integration of deep learning and remote sensing and have demonstrated that deep learning techniques can be effectively applied to the detection and extraction of diseased trees [
23,
24,
25]. For the applications of deep learning to disease monitoring, two types of techniques have been widely used: (1) Object detection technique, which involves improving model performance of existing object detection models, for example, the you only look once (YOLO), including modifying the backbone architecture of the YOLOv3 model [
26], making lightweight improvements to the YOLOv4 model to obtain the YOLOv4-Tiny-3Layers model [
27], combining the YOLOv5 model with spatial pyramid pooling and focus mechanism to create the YOLO-PWD model [
28], integrating UAV images with Sentinel-2 satellite data to develop the YOLOv5-PWD model [
29], and combining the YOLOv8 model with attention mechanisms for an improved model [
30]. Apart from the YOLO series models, the Faster R-CNN model has also been proven to perform well in the field of diseased tree detection [
31]; (2) Semantic segmentation technique. Few studies have focused on the applications of such techniques in diseased tree detection, such as the novel segmentation model SCANet [
32] and the semi-supervised semantic segmentation model [
33]. However, compared to the object detection technique, the application of semantic segmentation techniques in PWD detection is limited, especially the lack of attempts to utilize semantic segmentation models integrated with UAV images for this purpose.
The study compared three classic semantic segmentation models combined with different feature schemes to detect individual PWD-infected trees. Specifically, this study aims to achieve the following three objectives: (1) Compare the effects of red–green–blue (RGB), hue saturation value (HSV), and RGBHSV feature schemes on the identification accuracy; (2) compare the performance of U-Net, DeepLabV3+, and feature pyramid networks (FPNs) semantic segmentation models in detecting individual PWD-infected trees; (3) compare the model performance of deep learning models with traditional machine-learning methods and a newly proposed segment anything model (SAM) in the detection of individual PWD-infected trees. The study utilizes UAV images and semantic segmentation techniques for rapidly and accurately monitoring PWD with direct localization of individual PWD-infected trees. The proposed approach provides an efficient and straightforward solution for the prevention and management of PWD.
3. Methods
3.1. Color Space Model
Common color space models mainly include the RGB color space model and the HSV color space model, in which the RGB color space model consists of three components: red (R), green (G), and blue (B). It is widely used in the field of computer vision due to its simple representation principle. However, the three-color components in the RGB color space model exhibit a certain degree of correlation, which may lead to information redundancy and potentially affect classification accuracy.
The HSV color space model consists of three components: hue (H), saturation (S), and value (V). Previous studies have shown that the HSV color space model has advantages over the RGB color space model in the field of remote sensing digital image processing. For instance, converting an RGB color space model to an HSV color space model could improve classification accuracy [
34]. For an RGB image with pixel values ranging from 0 to 255, the process for calculating HSV is as follows [
35]:
First, normalizing the RGB values to the range [0, 1]:
Second, calculating value (V), saturation (S), and hue (H):
where
is the maximum value of the normalized RGB channels,
is the minimum value, and
is the difference between
and
:
To assess the impact of input features on classification accuracy, we tested three input feature schemes, including RGB, HSV, and RGBHSV. After converting the RGB images to HSV images, we stacked the RGB and HSV images to create a six-band RGBHSV image. The NumPy and OpenCV libraries in Python were adopted to perform the transformation and overlay processing.
3.2. Semantic Segmentation Model
Compared to traditional machine-learning algorithms, convolutional neural network (CNN) techniques can determine the target variable (e.g., PWD-infected trees) through operations such as convolution and fully connected layers, thereby improving classification accuracy. However, this technology still cannot provide the exact location information of the PWD-infected trees, necessitating manual visual interpretation in the later stages [
36]. The fully convolutional network (FCN) proposed in 2015 not only can accurately provide health diagnostic information on the PWD-infected trees but also can provide their precise location information [
37]. Therefore, this study employed and compared three advanced semantic segmentation models to identify the PWD-infected trees, including U-Net, FPN, and DeepLabV3+. The input layers of these models were extended to accept either three or six feature dimensions, allowing for a comparative analysis of the results produced by these models.
3.2.1. U-Net Model
The U-Net model is a deep learning model designed for image segmentation and is widely used for medical image segmentation [
38]. Subsequently, numerous scholars have introduced the U-Net model into the field of remote sensing combined with remote sensing image segmentation techniques to achieve significant advancements, such as using improved U-Net models for image segmentation [
39].
The U-Net model is a CNN-based model that consists of a symmetrical encoder and decoder. A complete U-Net model generally comprises three parts: encoder, decoder, and skipped connections. The encoder is composed of a series of convolutional layers and pooling layers, which are responsible for capturing context and extracting features from the input images. The encoder progressively reduces the spatial dimensions while increasing the depth of the feature maps, downsampling the input image to a smaller feature map, and extracting high-level semantic information. The decoder consists of a series of upsampling and convolutional layers. It upsamples the feature maps generated by the encoder back to the original resolution and merges them with the corresponding layers’ features from the encoder to progressively reconstruct the original resolution image. In this process, U-Net uses skipped connections to directly connect corresponding layers of the encoder and decoder. These connections enable the model to maintain fine-grained spatial information and ensure that the decoder is aware of both low-level and high-level features to obtain high-precision segmentation outcomes. Finally, U-Net employs a convolutional layer in the last layer of the decoder for conducting pixel-level classification and producing the final segmentation results. Taking a six-band image input as an example, the U-Net model structure is illustrated as
Figure 4.
3.2.2. DeepLabv3+ Model
The DeepLabv3+ model, proposed by the Google Brain team in 2018, is the latest version in the DeepLab series [
40]. It is a further improvement and optimization of the DeepLabv3 model and is also commonly used in remote sensing image classification tasks. Compared to the DeepLabv3 model, the DeepLabv3+ model offers higher segmentation accuracy and efficiency. Previous studies further improved the DeepLabv3+ model by building the semantic segmentation decoder on the DeepLabv3+ architecture and obtained high accuracy while reducing computational costs and processing time [
41].
Due to the stridden pooling or convolution operations within the DeepLabv3 model’s backbone network, the model struggles to capture detailed information related to object boundaries. The DeepLabv3+ model uses DeepLabv3 as the encoder and adds a decoder module to recover the boundary information. Moreover, by combining the spatial pyramid pooling module with the encoder–decoder structure, the DeepLabv3+ model produces more refined segmentation results that not only contain rich semantic information but also restore boundary details. The DeepLabv3+ model structure is illustrated in
Figure 5.
3.2.3. FPN Model
The FPN model constructs feature pyramids through cross-layer connections and a top-down feature pyramid structure. It simultaneously retains the semantic information of high-level features and the spatial information of low-level features [
42]. FPN consists of two pathways: a bottom-up pathway and a top-down pathway. The bottom-up pathway extracts features through convolution operations, during which the spatial resolution continuously decreases while the feature resolution increases. The top-down pathway restores spatial resolution through upsampling operations while reducing the feature resolution. Like the U-Net model, FPN fuses feature maps from different levels through skipped connections in both the top-down and bottom-up pathways to obtain richer and multi-scale feature representations. Due to the ability to better capture contextual information and increase feature map resolution, FPN can obtain more useful information about small objects and is highly effective in handling small object segmentation. Additionally, the FPN model can adaptively construct feature pyramids without requiring scale transformations of the images, thus reducing computational load and time costs while improving segmentation accuracy. The FPN model structure is illustrated in
Figure 6.
3.2.4. Settings for Training Framework
The PyTorch framework was used as the training framework with Compute Unified Device Architecture (CUDA) version 12.4. All three semantic segmentation models used resnet-152 as the backbone network. During the training processes, the batch size was set to 4, the maximum number of iterations was set to 50, and the learning rate was set to 0.0001. DiceLoss was used as the loss function, and Adam was used as the optimizer. The computer hardware included an NVIDIA GeForce 4606TI series graphics card with 16 GB of memory.
3.3. Random Forest Model
This study selected the random forest (RF) model, a commonly used machine-learning model typically applied to classification and regression problems [
43], to compare with the model performance of the semantic segmentation model. Random forest uses a combination of multiple decision trees to make predictions and decisions. These decision trees can be seen as different classifiers, with each decision tree generated independently, making random forest better adapted to complex data and nonlinear relationships. Due to its excellent generalization ability and resistance to overfitting, random forest is often used to handle large and high-dimensional datasets [
44].
This study implemented the random forest model using the Google Earth Engine platform. Two important parameters needed to be set for the random forest model: number of trees and mtry. The former represents the number of decision trees, and the latter represents the number of variables considered for each split, which were set to 500 and the square root of the number of variables, respectively. A total number of 11,504 random points were selected based on UAV images for training and validation, including 5518 PWD-infected tree points and 5986 non-diseased tree points. The training was conducted using the six RGBHSV features of these random points. The classification results of the random forest model were then compared with the optimal model of the three semantic segmentation models.
3.4. SAM Model
The segment anything model (SAM) is a semantic segmentation model developed by Meta [
45]. SAM produces high-quality object masks from input prompts such as points or boxes, and it can be used to generate masks for all objects in an image. It has been trained on a dataset of 11 million images and 1.1 billion masks and has strong zero-shot performance on a variety of segmentation tasks. Since the SAM model can only accept three-channel inputs [
46], we fine-tuned the SAM model using both the RGB and HSV images rather than the RGBHSV images.
In this study, we froze all VisionEncoder and PromptEncoder layers. During the training process, bounding boxes were used as prompts, while during the inference process, random points were used for prediction.
3.5. Accuracy Assessment
This study selected three indexes to evaluate the classification accuracies of the three semantic segmentation models, including Precision, F1-score, and intersection over union (IoU) [
47]. The formulas for calculating these indexes are as follows:
where
TP represents true positives, which are the number of pixels correctly predicted by the model as positives (PWD-infected trees) and are positives.
FP represents false positives, which are the number of pixels predicted by the model as positives but are negatives.
TN represents true negatives, which are the number of pixels correctly predicted by the model as negatives and are negatives.
FN represents false negatives, which are the number of pixels predicted by the model as negatives but are positives. In this study, positives refer to PWD-infected tree pixels, and negatives refer to non-diseased tree pixels. Recall is calculated as follows:
5. Discussion
5.1. The Optimal Input Feature Scheme
When assessing the performance of the three input feature schemes, it is necessary to consider not only the models’ performance but also the practical applications of these feature schemes in image processing and computer vision. Although the RGB feature scheme is highly intuitive and widely used, it has poor feature separation. In contrast, the HSV feature scheme offers better feature independence and is more suitable for color segmentation. It is commonly used for land cover classification in the field of remote sensing [
48]. The RGBHSV feature scheme comprehensively utilizes information from both the RGB and HSV color spaces and fully considers the contributions of color and brightness information to the image content, allowing it to more accurately capture object features in various scenarios. In the task of detecting PWD-infected tree samples, it indeed demonstrates the highest performance compared with the RGB and HSV feature schemes. The superiority of the RGBHSV input feature scheme lies in its relatively lower rate of misclassification, meaning that the model could more accurately distinguish different categories of objects, which is crucial for object detection tasks [
49] and image classification [
50].
In previous studies, multispectral and hyperspectral UAV images had been more widely used compared with UAV RGB images due to their rich feature information and higher identification accuracy. However, multispectral and hyperspectral UAV images are costly and difficult to obtain. As a result, the lower-cost UAV RGB images are a complementary choice for detecting diseased trees. Unlike previous studies, which typically only utilized a single color space model [
51], this study explores color space information by combining information from different color space models as model input. Experimental results show that the synthesized color space model (i.e., the RGBHSV input feature scheme) achieves higher identification accuracy and lower false detection rates compared to a single-color space model. The superiority of the RGBHSV input feature scheme is apparent, which not only lies in the enhancement of model performance but also in its ability to meet the demands of various scenarios and offer better versatility and flexibility. This finding is validated both theoretically and practically, providing an important reference for the application of UAV RGB images in the detection tasks of PWD-infected trees.
5.2. The Optimal Semantic Segmentation Model
When selecting a suitable semantic segmentation model for a specific object detection task, it is important to consider factors including model architecture, feature extraction capability, and the ability to handle multi-scale features. Among the three semantic segmentation models used in this study, the unique structure of the U-Net model allows it to perform exceptionally well in many scenarios, which could be attributed to the encoder–decoder structure [
52,
53]. The encoder part captures the global features of the image through multi-level feature extraction, while the decoder part effectively reconstructs these features into pixel-level prediction results, thus preserving rich detail information. This makes the U-Net model highly advantageous for performing object detection tasks that require precise segmentation of targets. The ability to maintain detailed spatial information and context throughout the process enables the U-Net model to achieve high accuracy in segmenting complex images [
54].
The FPN and DeepLabv3+ models also have their advantages in specific scenarios. The FPN model primarily addresses multi-scale issues by using a top-down feature fusion mechanism and effectively deals with targets at different scales. This makes the FPN model perform well in tasks such as object detection and instance segmentation, especially when dealing with large-scale scenes or targets of varying scales [
55]. The DeepLabv3+ model expands the receptive field of the model by incorporating techniques such as atrous (dilated) convolutions and spatial pyramid pooling. This enables it to better handle segmentation tasks in large-scale scenes and large-scale images [
56]. However, for tasks that require fine segmentation or the preservation of details, the U-Net model generally outperforms the DeepLabv3+ and FPN models [
47]. The results of the comparison experiments of this study confirm this characteristic. In this study, PWD-infected trees usually occupy small areas in the image, which requires a high degree of fine extraction by the segmentation model. Fortunately, the U-Net model meets these requirements well, which is why the performance of the U-Net model is higher than that of the DeepLabv3+ and FPN models.
Additionally, it is worth noting that the differences between the random forest model and the U-Net model are quite pronounced and that the U-Net model is superior to the random forest model, which has also been observed in previous studies [
53,
57]. This is mainly because the random forest model essentially performs pixel-by-pixel classification, lacking the semantic information of surrounding pixels. It cannot capture the context of neighboring pixels, which leads to poorer performance. In contrast, the advantage of deep learning methods over traditional machine-learning methods in image classification tasks lies in their ability to fully utilize the correlations between pixels, while traditional machine-learning methods often rely on pixel-based classification, neglecting the spatial relationships between pixels. This results in poor performance for traditional machine-learning methods when dealing with salt-and-pepper noise or local discontinuities in the image. In contrast, deep learning methods can better capture the semantic information between pixels through operations such as convolution and pooling. This improves robustness to noise and local discontinuities, resulting in better performance when conducting semantic segmentation tasks in complex scenes.
The SAM model has achieved significant success in the field of computer vision and demonstrated notable potential in the field of remote sensing [
46,
58], which may benefit from one of SAM’s major advantages—its powerful recognition ability derived from pre-training on the SA-1B dataset. However, the SAM model performed the worst in this study. The probable reason may be that most of the images in SA-1B were taken from smartphones or cameras with fine spatial resolution, while the image data in this study were captured by drones with rough spatial resolution. The difference in spatial resolution may be one of the reasons for SAM’s poor performance. Additionally, most of the SA-1B data consist of RGB images, making the SAM model more adapted to RGB imagery. This also explains why the SAM model achieved relatively higher accuracy in recognizing RGB images compared to HSV images.
5.3. Error Source of the Semantic Segmentation Model
The sources of error in identifying PWD-infected trees using the semantic segmentation models primarily encompass two aspects, including missed detections and false positives. Regarding missed detections, the UAV RGB images collected for this study were captured in November 2021, when most PWD-infected trees were in the middle to late stages of infection, while a small number of PWD-infected trees were still in the early stages of infection. Given the significant differences in physiological condition and morphological characteristics between early-stage and middle-to-late-stage PWD-infected trees, the presence of PWD-infected trees at different stages within the study area can lead to the method missing many early-stage or middle-to-late-stage PWD-infected trees. This depends on the feature schemes chosen by the semantic segmentation model during the learning process. To address this issue, PWD-infected trees in early-stage and middle-to-late-stage can be sampled separately and subjected to multi-class classification. Regarding false positives, they are primarily influenced by the local features in RGB images. If the detection area contains a large amount of land cover with RGB digital values similar to those of PWD-infected trees, misclassifications may occur. For example, PWD-infected trees in middle-to-late-stage often exhibit reddish-brown colors in RGB images, which are similar to the colors of some building surfaces. This similarity may cause the model to mistakenly classify buildings as PWD-infected trees. To address this issue, new feature information, such as near-infrared bands and vegetation indices derived from satellite images, could be introduced, which help to differentiate between buildings and trees and reduce the occurrence of false positives.
5.4. Model Generalization
The prediction results of different models are shown in
Figure 11. It was found that the predictive abilities of the U-Net, DeepLabv3+, and FPN models were similar, with the number of predicted suspected PWD-infected trees ranging from 650 to 750. When using the RGBHSV feature scheme, aside from the confirmed PWD-infected trees, the U-Net, DeepLabv3+, and FPN models identified 206, 174, and 178 more suspected PWD-infected trees, respectively. There were no significant misclassifications, but a few omissions were present. These suspected diseased trees cannot be distinguished from the drone images alone and require field investigations for confirmation. When using the HSV feature scheme, these three models identified 209, 148, and 137 more suspected PWD-infected trees, respectively, with no significant misclassifications and a few omissions. When using the RGB feature scheme, these three models identified 166, 186, and 185 more suspected PWD-infected trees, respectively. Unlike the previous two feature schemes, the RGB feature scheme presented noticeable misclassification issues, which were mainly due to serious confusion between some red roofs and PWD-infected trees showing reddish-brown colors, along with a few omissions. Generally, all three models have excellent generalization capabilities and can achieve high detection accuracy of PWD-infected trees at a large scale.
In contrast, for the RF and SAM models, the RF model detected a total of 1043 suspected PWD-infected trees, among which, only 526 were confirmed diseased trees. Moreover, there were many misclassifications, such as incorrectly classifying red roofs, mountain shadows, and other discolored trees as diseased trees. The SAM model using the RGB feature scheme and HSV feature scheme detected 862 and 1041 suspected PWD-infected trees, respectively, among which only 458 and 402 were confirmed PWD-infected trees. Similar to the RF model, the SAM model also exhibited a large number of misclassifications, and the omission of confirmed PWD-infected trees was also severe, with 103 and 159 confirmed PWD-infected trees missing, respectively. Compared to the three classical segmentation models (i.e., the U-Net, DeepLabv3+, and FPN models), both the RF and SAM models performed much poorer generalization capabilities.