A Lightweight Pine Wilt Disease Detection Method Based on Vision Transformer-Enhanced YOLO

Quanbo Yuan; Suhua Zou; Huijuan Wang; Wei Luo; Xiuling Zheng; Lantao Liu; Zhaopeng Meng

doi:10.3390/f15061050

Abstract

Pine wilt disease (PWD) is a forest disease characterized by rapid spread and extremely high lethality, posing a serious threat to the ecological security of China’s forests and causing significant economic losses in forestry. Given the extensive forestry area, limited personnel for inspection and monitoring, and high costs, utilizing UAV-based remote sensing monitoring for diseased trees represents an effective approach for controlling the spread of PWD. However, due to the small target size and uneven scale of pine wilt disease, as well as the limitations of real-time detection by drones, traditional disease tree detection algorithms based on RGB remote sensing images do not achieve an optimal balance among accuracy, detection speed, and model complexity due to real-time detection limitations. Consequently, this paper proposes Light-ViTeYOLO, a lightweight pine wilt disease detection method based on Vision Transformer-enhanced YOLO (You Only Look Once). A novel lightweight multi-scale attention module is introduced to construct an EfficientViT feature extraction network for global receptive field and multi-scale learning. A novel neck network, CACSNet(Content-Aware Cross-Scale bidirectional fusion neck network), is designed to enhance the detection of diseased trees at single granularity, and the loss function is optimized to improve localization accuracy. The algorithm effectively reduces the number of parameters and giga floating-point operations per second (GFLOPs) of the detection model while enhancing overall detection performance. Experimental results demonstrate that compared with other baseline algorithms, Light-ViTeYOLO proposed in this paper has the least parameter and computational complexity among related algorithms, with 3.89 MFLOPs and 7.4 GFLOPs, respectively. The FPS rate is 57.9 (frames/s), which is better than the original YOLOv5. Meanwhile, its mAP@0.5:0.95 is the best among the baseline algorithms, and the recall and mAP@0.5 slightly decrease. Our Light-ViTeYOLO is the first lightweight method specifically designed for detecting pine wilt disease. It not only meets the requirements for real-time detection of pine wilt disease outbreaks but also provides strong technical support for automated forestry work.

Keywords:

pine wilt disease; YOLOv5; Light-ViTeYOLO; EfficientViT; real-time detection

1. Introduction

Pine wilt disease (PWD), caused by the pine wood nematode, is a forest disease characterized by high pathogenicity, rapid spread, and a wide transmission pathway, resulting in severe damage to pine forest resources in China [1]. This disease has been classified as a quarantine pest in more than 40 countries, with China experiencing substantial direct economic losses and ecological service value depletion [2]. Owing to the extensive forested area and the high costs and limited scope of manual inspection and monitoring, there is a need for efficient, cost-effective, and accurate monitoring techniques. In recent years, the advancement of UAV (Unmanned Aerial Vehicle) remote sensing technology has demonstrated significant potential for application in the monitoring of pine wood nematode disease, leveraging its operational ease, adaptability, extensive coverage, and real-time capabilities [3].

The use of UAV remote sensing for monitoring pine blight outbreaks has undergone significant evolution over the past few decades. Traditional machine learning algorithms, such as SVM (Support Vector Machine), RF (Random Forest), and ANNs (Artificial Neural Networks), have been developed and optimized by integrating spectral and spatial features. These methods have been successfully employed for identifying pine blight tree damage in Multi-Spectral Imagery (MSI) and Hyper-Spectral Imagery (HSI) datasets. However, classical machine learning necessitates intricate feature selection and combination work, posing challenges for leveraging in-depth image information [4].

In recent years, with the development of deep learning object classification and detection technology, researchers have gradually applied it to PWD detection [5,6]. For instance, Qin et al. [7] utilized a proposed SCANet (spatial-context-attention network) to diagnose pine nematode disease in UAV-based MSI datasets, achieving an average overall accuracy of 79.33%. Wu et al. [8] used Faster R-CNN (Region-CNN) and YOLOv3 for early diagnosis of infected trees, demonstrating that YOLOv3 is more suitable for PWD detection. Gong et al. [9] identified pine blight spots affected by pine wilt using YOLOv5, achieving a mean Average Precision (mAP) of 84.5%. Similarly, Sun et al. [10] utilized the improved MobileNetv2-YOLOv4 algorithm to identify abnormal discoloration blight caused by pine wilt nematode disease, and the improved model achieved higher detection accuracy of 86.85%.

Although current deep learning methods have achieved some results in disease detection, realizing real-time detection on UAV platforms still faces great challenges. Changes in UAV flight altitude and speed lead to too small and different scales of disease and pest targets in trees, making detection difficult. In addition, limited by the computational resources, storage, and communication capabilities of the UAV platform, it is difficult for the existing deep learning-based methods to achieve a balance between detection accuracy and speed due to the complexity of their models.

Aiming at the above problems and difficulties, this paper takes the YOLOv5 model as the baseline network, redesigns and optimizes the feature extraction network, neck network, and loss function, and proposes ViTeYOLO, a lightweight pine wilt detection method based on Vision Transformer-enhanced YOLO, to improve its detection accuracy for PWD and achieve light weight. The main contributions of this paper are as follows:

A lightweight Multi-Scale Attention module (MSA) is introduced to construct an EfficientViT feature extraction network, which achieves efficient global information extraction and multi-scale learning through efficient hardware operations, reducing network computational complexity;
A Content-Aware Cross-Scale bidirectional fusion neck network (CACSNet) is proposed, which uses the Content-Aware Reassembly Feature Enhancement (CARAFE) operator to replace the bilinear difference in PANET (Path Aggregation Network) for upsampling, and uses cross-scale weighting for feature fusion to improve the expression ability of fine-grained features of diseased trees, prevent small target feature loss, and improve detection accuracy;
Optimization of the loss function and introduction of EIOU (Efficient Intersection over Union) loss to help the model better balance the size and shape information of the target, improving the accuracy and robustness of PWD detection.

2. Related Works

2.1. Visual Transformer in Remote Sensing

Transformer [11] employs an attention-based architecture that first demonstrated its great impact on sequence modeling and machine translation tasks, and has evolved to become the primary deep learning model for many natural language processing (NLP) tasks. Inspired by these significant achievements, Transformer has been applied to the field of computer vision (CV) and has led to some groundbreaking work, giving rise to Visual Transformer (ViT) [12].

ViT has shown exceptional performance in various computer vision tasks. ViT relies on a self-attention mechanism to skillfully capture global interactions by utilizing the connections between elements in the input sequence. According to studies [13,14], the results demonstrate the ability to model content-dependent remote interactions. It possesses the property of being able to flexibly adjust the sensory field so as to be able to adapt to multiple complexities in the data and learn effective feature representations. As a result, ViT and its variants have been successfully applied to several computer vision tasks such as classification, detection, and segmentation. With the success of ViT in the field of computer vision, the remote sensing community has also observed a significant growth in the application of transformer-based frameworks for multiple tasks. This has triggered a promising wave of research in remote sensing, where researchers have adopted a variety of approaches [15,16] to train and analyze remote sensing data using visual transformers. Hong et al. [17] developed SpectralFormer based on ViT to obtain advanced classification results for hyperspectral images. Liu et al. [18] introduced a Deep Spatial Spectral Transformer (DSS-TRM) for end-to-end hyperspectral image classification. In addition, a hybrid approach based on a combination of transformers and cellular neural networks was used to detect changes in dual-time images [17] as well as to detect small objects in remotely sensed images in complex backgrounds [19]. The above ViT-based methods, while achieving advanced performance in remote sensing, also bring a greater number of parameters and computational effort.

2.2. Lightweight Multi-Scale Attention

Multiscale learning and global receptive fields during feature extraction can effectively improve the performance of tasks such as semantic segmentation and target detection, but their computation is quadratic in the resolution of the input image, and sometimes requires special support for hardware to achieve good efficiency [18]. Cai et al. [20] proposed a lightweight MSA module for semantic segmentation that requires only hardware operation for global sense field and multi-scale learning, and shows significant speedup on edge devices. As shown in Figure 1, the lightweight MSA uses ReLU-based lightweight attention for global receptive fields, and the feature map, after obtaining Q/K/V tokens through a linear projection layer, is aggregated with nearby tokens through lightweight convolution of a small kernel to generate multiscale tokens.

Figure 1. Lightweight multi-scale attention.

ReLU-based global attention (Equation (1)) is applied to the multiscale token and connects and inputs the output to the final linear projection layer for feature fusion.

W_{i} = \sum_{j = 1}^{N} \frac{R e l u (Q_{i}) {R e l u (K_{j})}^{T}}{\sum_{j = 1}^{N} R e l u (Q_{i}) {R e l u (K_{j})}^{T}} V_{j}

(1)

Here, Q = xW^Q, K = xW^K, and V = xW^V, where WQ, WK, and WV are learnable linear projection matrices. W_i represents the i-th row of the matrix W. By leveraging the associativity property of matrix multiplication, it is possible to reduce the computational complexity and memory usage from quadratic to linear without altering the functionality. As a result, Equation (1) can be expressed as:

W_{i} = \frac{R e l u (Q_{i}) (\sum_{j = 1}^{N} {R e l u (K_{j})}^{T} V_{j})}{R e l u (Q_{i}) (\sum_{j = 1}^{N} {R e l u (K_{j})}^{T})}

(2)

This module leverages lightweight ReLU-based attention [21] as an alternative to the more complex self-attention [11], facilitating the establishment of a global receptive field with linear computational complexity. This approach notably enhances the speed of model inference on mobile devices compared to softmax attention, leading to a considerable reduction in model latency and consequently faster detection. Incorporating this module into the existing ViT model proves to be an effective strategy for accelerating model inference without sacrificing accuracy.

3. Materials and Methods

In this paper, we redesigned the feature extraction network based on the baseline network YOLOv5 and proposed a lightweight pine wilt detection method based on ViT-enhanced YOLO. Firstly, we constructed a lightweight EfficientViT feature extraction network with lightweight MSA as the core to replace YOLOv5’s CSPDarkNet53(DarkNet53 with Cross-stage Partial Connections). Secondly, a cross-scale feature fusion neck network (CACSNET) was designed, which uses the CARAFE operator to replace the bilinear difference in the original model for upsampling, and then performs cross scale feature fusion. Finally, EIOU was introduced to optimize the loss function. The comprehensive architecture of the proposed Light-ViTeYOLO is shown in Figure 2.

Figure 2. Overall architecture of the lightweight PWD model based on YOLOv5.

Below, we will analyze YOLOv5 and provide a detailed explanation of the proposed Light-ViTeYOLO.

3.1. Baseline Network YOLOv5

The network structure of YOLOv5 can be divided into the following three parts: the backbone, the neck segment, and the head output segment, with the specific structure depicted in Figure 3.

Figure 3. Network architecture of YOLOv5.

After the input image undergoes preprocessing, it is fed into the backbone feature extraction network, CSPDarkNet53, associated with multiple convolutional operations. This process transforms the image into a feature map and facilitates the extraction of semantic and structural information from the input image. Subsequently, at the neck layer, a feature pyramid PANet is established at varying scales, with each feature map comprising different resolutions corresponding to receptive fields of different scales. Finally, YOLOv5 utilizes the output of detection frame processed by the NMS (Non-maximum Suppression)- as the ultimate target detection result.

However, YOLOv5’s use of a series of convolutional modules for feature extraction results in a complex network and the inability to effectively capture global information. Consequently, the bilinear interpolation used in the neck network cannot utilize the semantic information of the feature maps, and the perception domain is limited to the sub-pixel domain. The above design cannot be suitable for the real-time detection requirements of multi-scale and small targets in PWD detection tasks. In light of this, the following improvements will be carried out.

3.2. Redesign of Backbone Feature Extraction Network

In this paper, an examination of the structure of the visual transformer (ViT) has revealed that the main computational bottleneck is the softmax attention module, which exhibits quadratic computational complexity with respect to the input resolution. To address this issue, the lightweight Multi-Scale Attention (MSA) module introduced in Section 2.2 is specifically designed to enhance execution speed, delivering a substantial inference speedup while maintaining accuracy.

Based on this, we construct the EfficientViT module with lightweight MSA as the core, which is used for the design of the feature extraction network in this paper. The redesigned EfficientViT feature extraction model is shown in Figure 4 (left), with the EfficientViT module shown in Figure 4 (right).

Figure 4. Macro architecture of EfficientViT (left) and illustration of EfficientViT’s building blocks (right).

The EfficientViT module comprises the lightweight MSA module and the MBConv module [22]. The lightweight MSA module is employed for contextual information extraction, while the MBConv module facilitates local information extraction. Notably, the linear attention utilized by the lightweight MSA module has limitations in capturing localized details, potentially leading to a notable loss in accuracy. To mitigate this shortcoming, a deep convolutional MBConv module is integrated behind the MSA to enhance linear attention. This strategy incurs low computational overhead while significantly augmenting the capability of linear attention in local feature extraction.

The EfficientViT model adheres to the standard backbone head/decoder architecture, reflecting the following design features:

(1): The backbone network incorporates an input backbone and four stages, characterized by diminishing feature map size and escalating channel numbers;
(2): Lightweight MSAs are integrated into stages 3 and 4;
(3): For downsampling, the model employs MBConv with a step size of 2.

The outputs of Stage 2, Stage 3, and Stage 4 collectively generate a feature map pyramid, serving as the input for feature fusion in the neck network. The detailed architecture configurations of EfficientViT variants is shown in the following Table 1.

Table 1. Detailed architecture configurations of different EfficientViT variants.

Here, C denotes the number of channels, L denotes the number of blocks, H denotes the feature map height, and W denotes the feature map width.

In this paper, the above-designed EfficientViT model replaces the feature extraction network CSPDarkNet53 of YOLOv5, aiming to achieve efficient hardware operation through a lightweight MSA design to improve the accelerated inference performance of the model, to achieve global awareness and multi-scale learning to ensure that performance is not sacrificed, and ultimately, to enable the proposed model to realize the real-time PWD detecting task.

3.3. Design of CACSNet Neck Networks

YOLOv5 uses PANET as a neck network for feature extraction and fusion, and as a key operation of the feature pyramid, the feature upsampling method uses bilinear interpolation. This method is unable to utilize the semantic information of the feature map and the perceptual domain is limited to the sub-pixel domain. In order to further optimize the performance, in this paper, PANET is improved and a content-aware cross-scale bi-directional fusion network (CACSNet) is designed as a new neck network. The specific improvements are described as follows.

Firstly, we use the CARAFE [23] operator as the new up-sampling kernel to complete the up-sampling operation of the neck network (P7_u,P6_u,P5_u,P4_u in Figure 5b) to realize the up-sampling based on the input content. The implementation specifically consists of two steps: the first step is to predict the reorganization kernel of each target location based on its content, and the second step is to restructure the features with the predicted kernel.

Figure 5. Illustrating the design of the feature network. (a) PANet adds an extra bottom-up path atop FPN. (b) CACSNet utilizes feature graph semantic information to implement a top-down process based on input content.

Given a feature map X of size C × H × W and an up-sampling rate α (α is an integer), CARAFE will generate a new feature map

X^{'}

of size C × αH × αW and, for any target location

l^{'} = (i^{'}, j^{'})

of

X^{'}

, its corresponding original location is

l = (i, j)

, where

i = ⌊\frac{i^{'}}{σ}⌋

and

j = ⌊\frac{j^{'}}{σ}⌋

. Here, we denote

N (X_{l}, k)

as the k × k subregion of X centered at location l, i.e., the neighbors of

X_{l}

.

In the first step, the kernel prediction module

ψ

predicts the spatially variant kernels

W_{l'}

for each position

l'

based on the neighborhoods of

X_{l}

, as shown in Equation (3). The second step is the restructuring step shown in Equation (4), where

ϕ

is the content-aware reassembly module, which reassembles the neighborhoods of

X_{l}

with the kernel

W_{l'}

.

W_{l'} = ψ (N (X_{l}, k_{e n c o d e r}))

(3)

X_{l'}^{'} = ϕ (N (X_{l}, k_{u p}), W_{l'})

(4)

where weights are generated in a content-aware manner. In addition, for each location there exists multiple sets of such up-sampling weights, and then feature up-sampling is accomplished by generating features rearranged into spatial blocks. CARAFE up-sampling can aggregate and reorganize the contextual information around the target within a large perceptual field, which improves the ability to express feature details and introduces little computational overhead.

Furthermore, to prevent the loss of feature information related to small targets during the feature extraction process, the paper incorporates cross-scale weighting for feature fusion in the neck layer (see Figure 5). This is achieved by introducing additional connections (depicted as curved edges in Figure 5) between the feature input nodes from the backbone network and the output nodes of the neck network at the same level. This approach facilitates the fusion of more original image features to maximize the retention of features related to individual diseased trees.

3.4. Optimization of Loss Function

In target detection, the loss function is critical in quantifying the disparity between the model’s predicted output and the actual target, driving continual learning during training to enhance the performance of the detection task. Typically, loss functions in object detection encompass bounding box regression loss, classification loss, and object presence loss. While YOLOv5 employs the CIOU (complete concatenated intersection) loss function for bounding box regression, this approach has limitations in handling variations in object location and size. As the CIOU loss function does not directly consider target location information, the model may prioritize the wrong bounding box location during optimization, leading to mismatches between the detected and actual disease areas and affecting detection accuracy. Moreover, the CIOU loss function exhibits reduced sensitivity to the degree of deformation in small targets, resulting in suboptimal model performance for detecting small targets.

To address these limitations, this paper adopts the EIOU loss function as an alternative. EIOU loss better balances detection accuracy by integrating position and size information of the target frame. By combining width and height information of the target frame and considering the intersection region-to-minimum closure region ratio, the EIOU loss function effectively addresses target size changes and deformation issues, enhancing detection accuracy and robustness. The EIOU loss function is calculated as follows:

L_{E I O U} = L_{I O U} + {L_{d i s} + L}_{a s p} = 1 - I O U + \frac{ρ^{2} (b, b^{g t})}{{(w^{c})}^{2} + {(h^{c})}^{2}} + \frac{ρ^{2} (w, w^{g t})}{{(w^{c})}^{2}} + \frac{ρ^{2} (h, h^{g t})}{{(h^{c})}^{2}}

(5)

The loss function comprises three components: the overlap loss (

L_{I O U}

), the center distance loss (

L_{d i s}

), and the width-height loss (

L_{a s p}

). The first two components follow the approach used in CIOU. However, the width-height loss directly minimizes the disparity between the widths and heights of the target box and the predicted box, thereby accelerating convergence. Here,

w^{c}

and

h^{c}

are the width and height of the minimum enclosing box covering both boxes.

ρ^{2} (b, b^{g t})

represents the Euclidean distance between the center points of the anchor box and the ground truth box,

ρ^{2} (w, w^{g t})

represents the Euclidean distance between the width of the anchor box and the ground truth box, and

ρ^{2} (h, h^{g t})

represents the Euclidean distance between the height of the anchor box and the ground truth box.

4. Experiment and Performance Analysis

4.1. Research Area and Data Acquisition

The image data for this study were obtained from the forest field of Zhuanshanzi, Tai’an City, Shandong Province, China (latitude 31°14′ N, longitude 117°01′ E, altitude 40 m). Figure 6 illustrates a schematic of the data acquisition site. In order to mitigate the effects of wind, shadows, strong light, weak light, and reflections on image quality, we selected the time frame from 2 PM to 5 PM on 3 May and 4 May 2022 for image data collection. This specific time period was chosen due to its favorable meteorological conditions.

Figure 6. Data acquisition area.

We utilized a DJI Mavic Air 2 drone, which is outfitted with a 48-megapixel visible light camera, to capture the image data. The camera of this drone boasts a maximum flight time of 34 min, a maximum flight range of 18.5 km, and a maximum flight speed of 19 m per second. The resulting images were stored in JPEG format with a resolution of 6000 × 4000 pixels. Throughout the flights, the drone’s speed and direction were manually controlled, while the camera remained fixed perpendicular to the ground at a 90-degree angle. The drone was equipped with precise GPS and GLONASS positioning capabilities, enabling accurate recording of the location and altitude of each image. The flight altitude of the drone was maintained at approximately 300 m. Figure 7 shows examples of forestry images captured by the drone, showcasing a resolution of 6000 × 4000 pixels.

Figure 7. Examples of UAV tree images.

To enhance the usability of the collected drone images, we adhered to the steps outlined below. Firstly, due to the images’ high resolution and extensive spatial coverage, it would necessitate significant computational resources to directly use all images for training a network model, given the limited number of sample images. Consequently, we opted to extract image patches of 640 × 640 pixels from 300 drone forestry images gathered in the study area. Subsequently, 10,000 image patches were randomly selected for the training set, while 1200 image patches were utilized for the validation set. In order to perform image analysis and applications, we use the image annotation tool LabelImg and annotate the diseased trees in the images under the guidance of forestry experts. A segmented tree image is shown in Figure 8.

Figure 8. Illustrations of cropping tree images.

4.2. Experimental Configuration

We implemented and trained the neural network model using the PyTorch deep learning framework on the Linux operating system. Table 2 presents the exhaustive hardware and software environment arrangement for the experiments.

Table 2. Hardware and software environment arrangement.

4.3. Experimental Indicators

To ensure a precise assessment of the new model’s performance, we utilized several performance evaluation metrics: Average Precision (AP), recall, model parameters, Giga Floating-point Operations Per second (GFLOPs), and Frames Per Second (FPS).

AP represents the average precision of a single target class, providing an overall measure of the model’s detection performance. The AP is calculated using the following formula:

A P = \int_{0}^{1} P r e c i s i o n (R e c a l l) d (R e c a l l)

(6)

Here, Precision denotes the proportion of correctly predicted boxes to the total predicted boxes, while Recall represents the proportion of predicted boxes to all actual boxes. To further assess the accuracy of the detector, we employed two metrics: Average Precision at an IOU threshold of 0.5 (AP@0.5) and Average Precision with IOU thresholds ranging from 0.5 to 0.95 (AP@0.5:0.95).

We use the model parameters and GFLOPs to measure the model complexity and size, while FPS is used to measure the running speed of algorithms, representing the number of images that can be processed per second. The smaller the number of model parameters and GFLOPs, the lower the model complexity and size. The larger the FPS, the faster the algorithm processing speed, which is more conducive to the deployment of the model on edge devices.

Assuming a convolutional layer with a size of

h \times w \times c_{i} \times c_{o}

(

c_{i}

is the number of input channels,

c_{o}

is the number of output channels) and an output feature map size of

H^{'} \times W^{'}

, the formula for calculating the parameters of the convolutional layer is as follows:

P a r a m s = c_{o} \times (h \times w \times c_{i} + 1)

(7)

The formula for calculating the FLOPs of the convolutional layer is as follows:

F L O P s = H^{'} \times W^{'} \times c_{o} \times (h \times w \times c_{i} + 1)

(8)

While GFLOPs = FLOPs × 10⁹.

4.4. Performance Comparison of Different Methods

4.4.1. Performance Comparison of Different Methods

In order to evaluate the effectiveness of our proposed model, this paper lists several representative methods and compares them with the algorithm proposed in this paper in terms of model detection performance and model complexity. The specific results can be seen in Table 3 and Table 4. Table 3 and Table 4 show the comparison results with YOLOv5 [9], Faster R-CNN [24], RetinaNet [25], YOLOv5, YOLOv6 [26], YOLOv7 [27], and YOLOx [28] on the test set in terms of recall, mAP, Parameter, GFLOPs, and FPS.

Table 3. The comparison of the detection accuracy between Light-ViTeYOLO and existing methods.

Table 4. Comparison of calculation quantity and parameter quantity between Light-ViTeYOLO and existing models.

From Table 3, it can be seen that the proposed algorithm exhibits a significant improvement in mAP@0.5:0.95 compared to other algorithms, with slightly lower recall rates than RetinaNET, YOLOv6, YOLOX, and only slightly lower mAP@0.5 compared to YOLOv5. In the task of pine wilt disease (PWD) detection, accurate detection of diseased areas is a prerequisite for subsequent disease control, and a higher mAP@0.5:0.95 demonstrates that the proposed algorithm achieves good detection results in the PWD task.

From Table 4, it is evident that the proposed algorithm significantly reduces parameter number and computational complexity compared to other algorithms. The YOLOv7 and YOLOX, which are the most lightweight, have reduced by more than 40%. At the same time, the algorithm’s inference speed is superior to all other models except YOLOv5, meeting the real-time requirements of drone scenarios.

Based on Table 3 and Table 4, compared with other algorithms, Light-ViTeYOLO proposed in this paper achieves the minimum number of parameters and computational complexity, with suboptimal inference speed. While achieving a lightweight model, mAP@0.5 0.95 (%) reaches its maximum, mAP@0.5 (%) reached the second highest. Although the recall and mAP@0.5 (%) of Light-ViTeYOLO did not reach the optimal level, it is only slightly lower than the optimal algorithm. Light-ViTeYOLO has obvious advantages in lightweight level. The FPS of Light-ViTeYOLO is lower than that of YOLOv5. However, the number of parameters and computation is nearly 50% less than that of YOLOv5, and mAP@0.5:0.95(%) is almost 4% higher than that of YOLOv5.

The impact of lightweight networks on the detection performance of powdery mildew was also compared. The experimental results are shown in Section 4.4.3.

Based on the above analysis, the proposed algorithm ensures detection accuracy, has high detection precision, and simultaneously significantly reduces model complexity and inference speed. Light-ViTeYOLO is more suitable for PWD detection tasks than other algorithms.

4.4.2. Ablation Experiment

Light-ViTeYOLO proposed in this paper redesigned the feature extraction network of YOLOv5, proposed a neck network, and optimized the loss function. In order to evaluate the effectiveness of each module of our method, comparative experiments were conducted on the PWD dataset, and the improvement scheme was incrementally added. The specific experimental results are shown in Table 5.

Table 5. Detection effects of different modules on the model.

From Table 5, it can be observed that after using EfficientViT for global feature extraction, the parameter count and GFLOPs were effectively reduced. While mAP@0.5 showed a slight decrease, there was a significant improvement in mAP@0.5:0.95, indicating that EfficientViT significantly improved efficiency in feature extraction without sacrificing performance. After optimizing the neck network and loss function, there was no change in the parameter count and GFLOPs, while mAP@0.5 and mAP@0.5:0.95 were further improved. This indicates that the new neck network and optimization of the loss function have improved the performance of object detection without increasing the number of parameters and calculations, and does not affect the operational efficiency of the model. Compared to the original YOLOv5 model, mAP@0.5:0.95 of proposed Light-ViTeYOLO increased by more than 3 percentage points, while the model’s parameter count reduced by 44.6%, and computational complexity reduced by 53.2%.

4.4.3. Feature Extraction Performance Analysis of EfficientViT

We integrate multiple typical visual Transformer models and lightweight networks on the core architecture of the YOLOv5 network for experimental comparison in order to analyze object detection performance and model complexity of the EfficientViT in pine wilt disease detection. As shown in Table 6, ViT [12], BoTNet [29], and CoNet [30] are typical ViT models, while Shufflenetv2 [31], Mobilenetv3 [32], RepVGG [33], and GhostNet [34] are classic lightweight feature extraction networks. It can be seen from the experimental results that the AP@0.5 0.95(%) of three ViT models is about 7.5 percent higher than the four classic lightweight networks, but the ViT parameter count and computational complexity are significantly higher than those of lightweight networks. After integrating EfficientViT, the model has almost the same number of parameters and computational complexity as the lightweight network, which is about 50% lower than the ViT model. However, its detection performance is improved by nearly 6 percentage points compared to the lightweight network, slightly lower than ViT. Overall, the improved EfficientViT method, combined with the optimization of the neck network and loss function, achieves the best performance in terms of model accuracy and complexity.

Table 6. Comparison of pine wilt detection performance between EfficientViT and other feature extraction networks.

4.4.4. Performance Analysis of the Training Process

We compared the changes in mAP@0.5 and mAP@0.5:0.95 during the training process of the original YOLOv5 model and its iterations with the inclusion of EfficientViT, optimization of the CACSNET neck network, and EIOU loss function, as shown in Figure 9; the left graph reflects that the mAP@0.5 of the four models sharply increases at the beginning of training, then tends to plateau around 10 iterations. The mAP@0.5 of the improved models remains similar to that of the original model in the final iterations, indicating that the improved models start to change faster than the original model.

Figure 9. The detection performance change of the algorithm during training after the improved scheme superposition.

In the right graph, the mAP@0.5:0.95 of the improved models quickly surpasses the original model, after which the values of all four models gradually increase and plateau around 60 iterations. Ultimately, the mAP@0.5:0.95 of the improved models is significantly better than that of the original model, with the proposed model achieving the highest mAP@0.5:0.95. From the above analysis, it is evident that the models exhibit faster training accuracy after the improvements compared to the original model. This demonstrates that the various optimizations proposed in this article have a promoting effect on the model’s performance, and the reduction in model parameters and computational complexity has not had a major impact on the model’s performance. This highlights the strong generalization ability of the lightweight model.

We compared the loss values, precision, recall, and AP@0.5:0.95 of the proposed model with YOLOv5 by plotting curves during the training process. As depicted in Figure 10, it is observed that during the initial training phase, the values of each model metric undergo rapid changes, and after approximately 100 iterations, the loss function values exhibit less fluctuation, signifying a relatively stable state. Concurrently, the model’s precision and recall also reach a relatively balanced state. Although the mAP continues to increase gradually, the rate of change is minimal, suggesting that the model has essentially converged at this juncture. The trends of the various metrics in the graph reveal that the proposed model matches or surpasses YOLOv5 in all metrics, underscoring the robustness of the proposed model’s object representation.

Figure 10. Curves of the loss values, precision, recall, and AP@0.5:0.95 for Light-ViTeYOLO during the training process.

5. Conclusions

In current methods for pine wilt disease detection, convolutional neural networks (CNNs) are commonly utilized for network architecture, leveraging their strong performance in feature extraction. However, the sensory field of CNNs is constrained by kernel size and network depth, limiting their capacity to effectively model long-term dependencies. On the other hand, Transformers are adept at capturing global and rich contextual information, but their high computational demands hinder their practicality for real-time monitoring scenarios, such as UAV-based applications. To address these challenges, this paper introduces Light-ViTeYOLO, a lightweight PWD detection method based on Vision Transformer-enhanced YOLOv5. By incorporating a lightweight Multi-Scale Attention (MSA) to redesign the backbone network process and enhancing the neck and head, the proposed method achieves impressive performance in terms of detection accuracy, model complexity, and inference speed. Notably, this approach manages to exceed the detection accuracy of many target detectors even with significantly reduced parameters. This achievement marks a successful balance between model accuracy and efficiency, underscoring its strong robustness. The use of drones carrying our detection method for real-time detection of pine wilt disease-discolored wood may lead to higher economic results, including benefits in terms of improved detection efficiency, reduced costs, reduced risk of disease transmission, and optimized decision support. However, the specific economic effects still need to be professionally assessed based on actual applications and relevant cost data. Therefore, we have the following outlook for future work:

The method proposed in this paper has been experimentally verified on a standard platform. The next step is to deploy the application on a drone hardware platform through algorithms to further verify its feasibility and potential economic benefits;
Combining the method proposed in this paper with satellite-based forest monitoring to further strengthen the monitoring of pine tree discoloration caused by pine wilt disease. Integrating drone images with satellite images for multi-scale analysis from both macroscopic and local perspectives, comprehensively monitoring diseases through data fusion and analysis;
Applying the method proposed in this paper to the detection of other forest diseases, such as bark beetle damage.

Author Contributions

Conceptualization, Q.Y. and S.Z.; methodology, Q.Y.; software, S.Z.; validation, Q.Y., Z.M. and H.W.; formal analysis, X.Z. and Q.Y.; investigation, L.L.; resources, H.W. and W.L.; data curation, S.Z. and X.Z; writing—original draft preparation, Q.Y. and S.Z.; writing—review and editing, H.W. and W.L.; visualization, S.Z.; supervision, Q.Y. and H.W.; project administration, Q.Y. and H.W.; funding acquisition, Q.Y. and H.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Fund Project of Central Government Guided Local Science and Technology Development under Grant No. 226Z0302G and the Special Project of Langfang Key Research and Development under Grant No. 2023011005B.

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to project privacy.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Pan, C. Development of studies on pinewood nematodes diseases. J. Xiamen Univ. 2011, 50, 476–483. [Google Scholar]
Liu, F.; Su, H.; Ding, T.; Huang, J.; Liu, T.; Ding, N.; Fang, G. Refined Assessment of Economic Loss from Pine Wilt Disease at the Subcompartment Scale. Forests 2023, 14, 139. [Google Scholar] [CrossRef]
Duarte, A.; Borralho, N.; Cabral, P.; Caetano, M. Recent advances in forest insect pests and diseases monitoring using UAV-based data: A systematic review. Forests 2022, 13, 911. [Google Scholar] [CrossRef]
Zhang, X.; Yang, H.; Cai, P.; Chen, G.; Li, X.; Zhu, K. Research progress on remote sensing monitoring of pine wilt disease. Trans. Chin. Soc. Agric. Eng 2022, 38, 184–194. [Google Scholar]
Cai, P.; Chen, G.; Yang, H.; Li, X.; Zhu, K.; Wang, T.; Liao, P.; Han, M.; Gong, Y.; Wang, Q.; et al. Detecting Individual Plants Infected with Pine Wilt Disease Using Drones and Satellite Imagery: A Case Study in Xianning, China. Remote Sens. 2023, 15, 2671. [Google Scholar] [CrossRef]
You, J.; Zhang, R.; Lee, J. A deep learning-based generalized system for detecting pine wilt disease using RGB-based UAV images. Remote Sens. 2021, 14, 150. [Google Scholar] [CrossRef]
Qin, J.; Wang, B.; Wu, Y.; Lu, Q.; Zhu, H. Identifying pine wood nematode disease using UAV images and deep learning algorithms. Remote Sens. 2021, 13, 162. [Google Scholar] [CrossRef]
Wu, B.; Liang, A.; Zhang, H.; Zhu, T.; Zou, Z.; Yang, D.; Tang, W.; Li, J.; Su, J. Application of conventional UAV-based high- throughput object detection to the early diagnosis of pine wilt disease by deep learning. For. Ecol. Manag. 2021, 486, 118986. [Google Scholar] [CrossRef]
Gong, H.; Ding, Y.; Li, D.; Wang, W.; Li, Z. Recognition of Pine Wood Affected by Pine Wilt Disease Based on YOLOv5. In Proceedings of the 2022 China Automation Congress (CAC), Xiamen, China, 25–27 November 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 4753–4757. [Google Scholar]
Sun, Z.; Ibrayim, M.; Hamdulla, A. Detection of pine wilt nematode from drone images using UAV. Sensors 2022, 22, 4704. [Google Scholar] [CrossRef] [PubMed]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2017; Volume 30. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Naseer, M.M.; Ranasinghe, K.; Khan, S.H.; Hayat, M.; Shahbaz Khan, F.; Yang, M.H. Intriguing properties of vision transformers. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2021; Volume 34, pp. 23296–23308. [Google Scholar]
Park, N.; Kim, S. How do vision transformers work? arXiv 2022, arXiv:2202.06709. [Google Scholar]
Hao, S.; Wu, B.; Zhao, K.; Ye, Y.; Wang, W. Two-stream swin transformer with differentiable sobel operator for remote sensing image classification. Remote Sens. 2022, 14, 1507. [Google Scholar] [CrossRef]
Ma, J.; Li, M.; Tang, X.; Zhang, X.; Liu, F.; Jiao, L. Homo–heterogenous transformer learning framework for RS scene classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 2223–2239. [Google Scholar] [CrossRef]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2021; Volume 34, pp. 12077–12090. [Google Scholar]
Guo, M.H.; Lu, C.Z.; Hou, Q.; Liu, Z.; Cheng, M.M.; Hu, S.M. Segnext: Rethinking convolutional attention design for semantic segmentation. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2022; Volume 35, pp. 1140–1156. [Google Scholar]
Wang, G.; Li, B.; Zhang, T.; Zhang, S. A network combining a transformer and a convolutional neural network for remote sensing image change detection. Remote Sens. 2022, 14, 2228. [Google Scholar] [CrossRef]
Cai, H.; Li, J.; Hu, M.; Gan, C.; Han, S. EfficientViT: Lightweight Multi-Scale Attention for On-Device Semantic Segmentation. arXiv 2023, arXiv:2205.14756. [Google Scholar]
Katharopoulos, A.; Vyas, A.; Pappas, N.; Fleuret, F. Transformers are rnns: Fast autoregressive transformers with linear attention. In Proceedings of the International Conference on Machine Learning, Virtual Event, 13–18 July 2020; PMLR; pp. 5156–5165. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
Wang, J.; Chen, K.; Xu, R.; Liu, Z.; Loy, C.C.; Lin, D. Carafe: Content-aware reassembly of features. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3007–3016. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2015; Volume 28. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Srinivas, A.; Lin, T.Y.; Parmar, N.; Shlens, J.; Abbeel, P.; Vaswani, A. Bottleneck transformers for visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 16519–16529. [Google Scholar]
Li, Y.; Yao, T.; Pan, Y.; Mei, T. Contextual transformer networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 1489–1500. [Google Scholar] [CrossRef] [PubMed]
Ma, N.; Zhang, X.; Zheng, H.T.; Sun, J. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 116–131. [Google Scholar]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
Ding, X.; Zhang, X.; Ma, N.; Han, J.; Ding, G.; Sun, J. Repvgg: Making vgg-style convnets great again. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13733–13742. [Google Scholar]
Tang, Y.; Han, K.; Guo, J.; Xu, C.; Xu, C.; Wang, Y. GhostNetv2: Enhance cheap operation with long-range attention. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2022; Volume 35, pp. 9969–9982. [Google Scholar]

Figure 1. Lightweight multi-scale attention.

Figure 2. Overall architecture of the lightweight PWD model based on YOLOv5.

Figure 3. Network architecture of YOLOv5.

Figure 4. Macro architecture of EfficientViT (left) and illustration of EfficientViT’s building blocks (right).

Figure 5. Illustrating the design of the feature network. (a) PANet adds an extra bottom-up path atop FPN. (b) CACSNet utilizes feature graph semantic information to implement a top-down process based on input content.

Figure 6. Data acquisition area.

Figure 7. Examples of UAV tree images.

Figure 8. Illustrations of cropping tree images.

Figure 9. The detection performance change of the algorithm during training after the improved scheme superposition.

Figure 10. Curves of the loss values, precision, recall, and AP@0.5:0.95 for Light-ViTeYOLO during the training process.

Table 1. Detailed architecture configurations of different EfficientViT variants.

Variants	Input Stem	Stage 1	Stage 2	Stage 3	Stage 4
C	C0 = 8	C1 = 16	C2 = 32	C3 = 64	C4 = 128
L	L0 = 1	L1 = 2	L2 = 2	L3 = 2	L4 = 2
H	640	640	640	640	640
W	640	640	640	640	640

Table 2. Hardware and software environment arrangement.

Platform	Configuration
Operating system	Linux 3.10.0
CPU	Intel(R) Xeon(R) Gold 6138 CPU @ 2.00 GHz
GPU	Tesla V100-PCIE-32GB
GPU accelerator	CUDA 10.2
Deep learning frame	PyTorch 1.10.1
Compilers	PyCharm and Anaconda
Scripting language	Python 3.7

Table 3. The comparison of the detection accuracy between Light-ViTeYOLO and existing methods.

Method	Recall (%)	mAP@0.5(%)	mAP@0.5:0.95(%)
Faster-RCNN	75.4	75.2	66.2
RetinaNET	96.6	95.9	92.5
YOLOV6	96.7	95.9	80.8
YOLOv7	93.9	82.5	55.9
YOLOX	96.9	96.0	84.3
YOLOv5	96.1	97.6	90.8
Light-ViTeYOLO	95.7	97.2	94.3

Table 4. Comparison of calculation quantity and parameter quantity between Light-ViTeYOLO and existing models.

Method	Parameters (M)	GFLOPs	FPS (Frames/s)
Faster-RCNN	41.1	78.1	15.5
RetinaNET	36.1	81.6	12.3
YOLOV6	17.1	21.8	26.3
YOLOv7	6.5	13.9	39.5
YOLOX	8.9	13.3	46.5
YOLOv5	7.1	15.8	67.0
Light-ViTeYOLO	3.89	7.4	57.9

Table 5. Detection effects of different modules on the model.

Model	Parameters (M)	GFLOPs	mAP@0.5(%)	mAP@0.5:0.95(%)
baseline	7.02	15.8	97.6	90.8
+EfficientViT	3.74	6.8	97.22	93.6
+CACSNet	3.89	7.4	97.20	94.0
+EIOU	3.89	7.4	97.27	94.3

Table 6. Comparison of pine wilt detection performance between EfficientViT and other feature extraction networks.

Model	Parameters (M)	GFLOPs	AP@0.5(%)	AP@0.5:0.95(%)
YOLOv5 +ViT	7.02	15.6	97.2	95.8
YOLOv5 +BoTNet	6.69	15.5	97.2	95.9
YOLOv5 +CoNet	8.19	16.8	97.2	95.4
YOLOv5 +Shufflenetv2	3.79	7.9	96.07	87.87
YOLOv5 +Mobilenetv3	3.19	5.9	94.19	86.83
YOLOv5 +RepVGG	7.19	16.3	97.13	85.93
YOLOv5 +GhostNet	3.68	8.1	96.49	86.06
YOLOv5 +EfficientViT	3.74	6.8	97.22	93.6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Article Metrics

Citations

Article Access Statistics

Journal Statistics

Multiple requests from the same IP address are counted as one view.