1. Introduction
Citrus Greening (CG) disease, caused by the pathogen
Candidatus Liberibacter asiaticus, is a destructive disease of citrus that is spread by grafting or transmitted by the vector insect citrus psyllid [
1]. The most typical symptoms of the disease are “blotchy mottling”, partial yellowing of green leaves, leaf-thickening, and corking of veins [
2]. As the disease develops, infected trees gradually decline and finally die. No curable measures for this disease have been developed for practical citrus cultivation. The current management of CG thus mainly involves insecticide application to control vectors and frequent surveillance of trees to detect and remove infected trees as soon as possible. Hence, the detection of diseased trees is one of the most important management practices, as it can reduce the risks of both primary and secondary infections. The most widely adopted disease-detection practice is the use of Polymerase Chain Reaction (PCR) [
3], which requires collecting plant materials from trees and subsequently processing them for chemical analysis to detect pathogen genes in the samples. However, this requires skills in chemical experiments, is time- and labor-intensive, and requires a long time to obtain results. These conditions make growers reluctant to use the method. Therefore, there is a demand for the development of simple and rapid diagnostic methods.
Digital image analyses for plant disease diagnosis are increasingly being used. Convolutional neural networks (CNNs) incorporating image analyses of individual plant leaves have been examined as classification models, particularly under a controlled laboratory environment [
4]. For instance, deep learning techniques with machine learning are used in citrus disease diagnosis systems [
5]. This approach, applied to a dataset with five categories of leaves based on disease development, attained an average accuracy of 87% on test sets. Prior to the application, five categories of healthy and diseased leaf images of citrus should be defined to implement transfer learning with VGG19 and AlexNet models, which successfully distinguish the groups with 94.3% average accuracy on the test set [
6]. Although the approach can perform well in a laboratory under stable conditions, its reliability is limited in fields where conditions, e.g., weather, light conditions, and background noise, easily vary [
7]. The other factor that seriously affects the accuracy of image analyses is the coexistence of other diseases with symptoms similar to CG, easily reducing the confidence of the analyses.
On the other hand, object-detection technology using deep learning techniques is used to identify and locate specific objects (targets) in images and videos. This procedure is widely applied in the diagnosis of plant diseases. Although this technique is computationally expensive, it can recognize different categories of objects and draw bounding boxes around each of them. An optimized YOLO-V4 model was used to examine six different disease images obtained from fruits in a citrus orchard [
8]. The EfficientNet model was used for classification and achieved 84.2% accuracy in CG disease. This model was implemented with different object detection models to effectively detect citrus disease by focusing on the spot where the symptoms occur [
9]. As citrus psyllid is the only insect vector of CG disease, Dai F et al. [
10] aimed to prevent CG disease by detecting citrus psyllids on citrus leaves taken from a natural environment and achieved an average precision of 90.21%.
The above approaches have been tested for their ability to detect various targets, but their availability remains to be studied. The diagnosis can be made when the trees bear fruits, but this is not practical for early diagnosis and management [
8]. The research in [
9] attempted to identify diseases from the fine features of citrus leaves, but since symptoms of CG disease appear across the entire leaf, it was difficult to make judgments based on specific localized areas; thus, they did not consider diagnosing CG disease. Another study [
10] has the potential to help prevent the early spread of CG disease, but because the target is too small, it is difficult to grasp the overall situation in an orchard. Additionally, it cannot be used for detecting images of CG disease that do not contain vector insects.
This study reports on how a simple and precise diagnostic system for CG disease can be developed, focusing on the following issues:
A non-invasive method that involves collecting high-resolution, in-field images taken in the natural environment of an orchard and performing annotations of leaves on branches;
A diagnostic approach using the Faster R-CNN object detection architecture, enabling simultaneous identification and localization of CG disease, thereby improving detection efficiency;
The integration of the Convolution Block Attention Module (CBAM) attention mechanism into the VGGNet and ResNet models to improve CG disease detection capability;
The development of a web application tool for real agricultural scenarios.
This system was examined to determine whether it could quickly determine the CG-infection status of leaves by simply photographing citrus branches without the interaction by the location or background noises. Based on the results, this study considered the potential of the tested models for practical uses by growers.
4. Methods
4.1. Faster R-CNN-Based Diagnosis System
The Faster R-CNN [
15] is a type of deep learning architecture designed to locate and identify objects within multi-scale images, achieving high accuracy in object-detection tasks. Its capability to fine-tune parameters for specific datasets makes it particularly effective for transfer-learning applications.
Figure 13 outlines our diagnosis system’s configuration, which utilizes the Faster R-CNN framework. In this system, the input images are resized to 900 × 600 pixels and processed through the Faster R-CNN for training, resulting in a well-trained model. This model is hosted on a web server, facilitating an online CG disease diagnosis platform. Users can upload images of any size via a web application and receive results featuring bounding boxes, labels, and confidence scores of the identified targets.
4.2. Backbone and Transfer Learning
The backbone is a CNN model located in the initial stages of object-detection models like Faster R-CNN, playing a crucial role in extracting useful features from the input image. By using a backbone, it is possible to capture features ranging from low-level characteristics such as edges, textures, and shapes to more advanced abstract features. Therefore, the method of using a pre-trained CNN model as a backbone through transfer learning is widely employed to adapt to new tasks [
16].
Transfer learning involves applying the knowledge (weights and feature extractors) of a model trained on a large-scale task as initial values for another specific task. This can reduce the amount of data required for learning, shorten the learning time, and improve performance, especially when the data are scarce or the task is complex [
17]. When using transfer learning with Faster R-CNN, its capability to capture various image features allows it to immediately provide high-level feature-extraction abilities for new object-detection tasks [
15].
In this study, we utilized pre-trained models based on the ImageNet subset of the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [
18] as the backbone of Faster R-CNN. The ILSVRC is a large-scale database composed of over 1.2 million annotated images, encompassing more than 1000 object categories, and is widely used as a standard benchmark in many computer vision studies. Specifically, the pre-trained models used in this study were VGGNet and ResNet, which have demonstrated high performance across various tasks.
4.2.1. VGGNet
VGGNet [
19], a profound CNN model, is highly regarded in the field of image classification for its simplicity and uniform design. It is characterized by multiple convolutional layers with small 3 × 3 filters, each followed by max-pooling layers. The defining feature of VGGNet is its repetitive stacking of convolutional layers, enabling deeper representations [
19]. VGGNet has several variations, with VGG16 and VGG19 being the most used. VGG16 consists of 13 convolutional layers and 3 fully connected layers, while VGG19 has an additional 3 convolutional layers, enhancing its ability to capture more complex features. In this study, we utilized these two pre-trained VGGNet models for transfer learning.
In transfer learning, a common approach involves freezing certain blocks of the network. This means maintaining the weights of these frozen blocks as unchanged while updating the weights of the other layers during training. This strategy effectively leverages pre-trained knowledge while tailoring the model for a new task, especially with limited data [
20]. In our study, we found that freezing the first two blocks of VGGNet yielded optimal training performance. The architecture of our employed VGGNet is illustrated in
Figure 14.
4.2.2. ResNet
ResNet [
21] is an innovative deep learning model that can effectively train deep networks and is considered the benchmark model for most computer vision tasks. Previous CNN models before ResNet tended to suffer from vanishing or exploding gradients as the network deepened, making learning difficult [
20]. ResNet introduced the “residual block” structure, using “skip connections” that add the input directly to the output, effectively avoiding these issues.
There are various ResNet model variations. In this study, we adopted the ResNet50, ResNet101, and ResNet152 models, which have demonstrated effectiveness across a wide range of computer vision tasks. To achieve optimal learning effects in transfer learning, experiments were conducted similarly to VGGNet, and it was found that freezing the layers before the third block was optimal. The structure of the ResNet models used is shown in
Figure 15.
4.3. Attention Mechanism
When reading a text, one may “pay attention” to certain words according to the context and deeply understand their meaning. The attention mechanism in deep learning models attempts to mimic this human process. Since the advent of the Transformer [
22] model, the attention mechanism has garnered significant interest, especially in natural language processing, and has since been widely applied to other areas like image recognition. By using the attention mechanism, deep learning models can focus on important parts of the data, making it a powerful tool that improves task performance [
23]. In this study, to achieve higher diagnostic performance, we integrated the CBAM attention mechanism into the two types of backbone models mentioned in
Section 3.2 and conducted experiments.
4.3.1. CBAM
The CBAM [
24] is designed to bolster the representational capabilities of CNNs by focusing on both spatial and channel-wise attention. CBAM comprises two sub-modules: the Channel Attention Module (CAM) and the Spatial Attention Module (SAM).
Figure 16 shows the structure of the CBAM.
CAM concentrates on the inter-channel relationships of features, emphasizing "what" aspects of input images are meaningful. The process starts with global max pooling and global average pooling of the input features, each then fed into a two-layer neural network. The reduction ratio parameter adjusts the neuron count reduction in the first layer, followed by ReLU activation. The second layer’s neuron count is restored to its original number. The outputs are then combined using element-wise summation and passed through a sigmoid activation, leading to a channel-refined feature for SAM.
SAM leverages the inter-spatial relationships of features, focusing on "where" informative parts are located. The channel-refined feature undergoes channel-based max pooling and average pooling, followed by concatenation. A 7 × 7 convolution operation with ReLU and sigmoid activation is then applied. Finally, an element-wise multiplication with the channel-refined feature generates the refined features.
4.3.2. Proposed Model
CBAM is a lightweight module that can be seamlessly integrated into any position within any CNN model [
24]. Chougui A. et al. [
25] demonstrated enhanced feature extraction by adding CBAM after each of the five blocks of the VGGNet model on a large-scale plant disease dataset. In this study, given the use of a small-scale dataset, optimal results were achieved by incorporating CBAM only after Block5 of VGGNet and Block4 of ResNet. The structure of our proposed model is detailed in
Figure 17.
4.4. Platform and Hyperparameters
Even with the use of transfer learning, there are instances where the pre-trained model may not perfectly adapt to the specifics of a given task or new dataset. Therefore, it becomes necessary to experiment with various hyperparameters to tailor and optimize the model for the specific task. In this study, we compared different optimizers and regularization techniques using the same dataset to find the optimal hyperparameters. We looked at two types of optimizers: Momentum and Adam. For regularization techniques, we evaluated L1 Lasso and L2 Ridge. Additionally, we experimented with various settings for learning rate, weight decay, dropout, and batch size to optimize our results.
In this research, we utilized PyCharm Community Edition 2023.2.1 for building and generating deep learning models; Anaconda3 for managing library files; and LabelImg v1.5.2 for annotating. The details regarding the experimental platform and recommended hyperparameters are presented in
Table 8.
4.5. Model Evaluation
In this study, we tested 10 models using a 5-fold cross-validation (5-fold CV) method, including VGGNet, ResNet, and these models integrated with CBAM (VGGNet+CBAM and ResNet+CBAM). We recorded the average precision (AP) for the three categories “greening”, “healthy”, and “others”, as well as the mean AP (mAP) across all categories to compare the overall performance of the models.
4.5.1. Evaluation Metric
Given the imbalanced nature of the three labels in our dataset, with a majority being “greening”, there exists a risk that the model could achieve high accuracy by predominantly predicting the majority labels while neglecting the minority labels. To address this potential bias and to evaluate the model’s performance more comprehensively, AP [
26] was employed as the primary evaluation metric. By calculating the AP for each category and using their mean value mAP, it is possible to evaluate the overall performance of the model. In this study, we used AP and mAP to evaluate the detection capability of each category and the overall performance of the models.
4.5.2. k-Fold Cross-Validation
k-fold CV [
27] is a widely used method for evaluating the performance of models in deep learning. This method involves dividing the dataset into k mutually exclusive folds and alternately conducting training and validation to aim for a more accurate estimation of the model’s performance. Specifically, k cycles of training and validation are carried out, where one of the k folds is selected as the validation dataset in each cycle, and the remaining k − 1 folds are used as the training dataset. Performance evaluations are recorded in each cycle, and the average of these evaluations is calculated to estimate the model’s average performance.
Using k-fold CV allows all data to be used for both training and validation, enabling a fairer assessment of the model’s generalization ability, particularly when dealing with small datasets. This maximizes data utilization and evaluates the model across multiple independent validation sets, making the performance estimation more stable and reliable. Typically, k is chosen between 5 to 10, but for our small datasets, k was set to 5. The approach of a 5-fold CV is illustrated in
Figure 18.
In detail, we initially divided the 82 collected images randomly into 5 folds, labeled 1 through 5. We then applied data-augmentation techniques of rotation and flipping to the images in each fold, ensuring that both original and augmented images remained within the same fold. For each iteration of the 5-fold CV, 4 folds (e.g., 1, 2, 3, 4) were used as the training set, and the remaining fold (e.g., 5) served as the validation set. This process was repeated such that each fold acted as the validation set once, thereby completing the 5-fold CV cycle.
4.6. Web Application
Alongside developing the CG disease-detection model, we also created a web application for practical use. This application, developed using the Django [
28] web framework in Python, allows users to upload leaf images (supporting multiple image uploads) for real-time diagnosis. The user interface of our web application is depicted in
Figure 19. Local farmers can take images of leaves on branches and upload them to the web application via smartphone or computer. Uploaded images are transferred to a server computer in the laboratory for disease diagnosis. Frames are drawn directly on the targeted leaves, displaying the classification category and confidence scores on each frame. The diagnosis results are then immediately shown as output images on the web application. The web application is hosted on the server of the Faculty of Informatics at Kansai University and is accessible via the following URL: citrus.kutc.kansai-u.ac.jp.
Using the web application allows for the real-time monitoring of disease conditions, enabling the early diagnosis and management of diseases, which can reduce the spread and impact of the diseases. Additionally, by accurately identifying and treating infected trees, the use of chemical pesticides can be reduced, contributing to environmental protection and reducing production costs.
5. Discussion
This study focuses on identifying optimal networks and solutions for the simple and efficient detection of CG disease in field applications. We explored the Faster R-CNN architecture with transfer learning, which demonstrated strong recognition capabilities even under challenging conditions, such as distant targets or backgrounds lacking similar objects, highlighting its robust anti-interference abilities. Users can take advantage of our CG disease-diagnosis system by uploading photos directly from the field to our web application for real-time diagnosis, proving highly practical for immediate use. However, transfer-learning models, which are built on limited datasets, typically excel within similar feature spaces but may struggle with out-of-domain data [
29].
Our data collection was restricted to leaves from a single citrus variety, gathered only under sunny conditions in January, and from trees aged 5 to 7 years. These limitations could impact the effectiveness of the model, for instance, when applied to different citrus varieties. Nevertheless, CG disease exhibits minimal variation in disease characteristics (appearance and manifestation) across different seasons and varieties [
30]. The universality of these disease characteristics suggests that our system could be effectively adapted for use with other types and in various regions. Future improvements will focus on enhancing the model’s versatility. We plan to expand our dataset to include a wider range of citrus species, age groups, and lighting conditions, aiming to address variations in leaf color and size that could affect recognition accuracy. Additionally, by collecting and training samples from other plants using our proposed method, we believe this will also help in diagnosing other plant diseases.
To enhance the model’s ability to recognize CG disease, we presented a novel approach by integrating the CBAM with VGGNet and ResNet models, marking the first attempt within the field to conduct a precision comparison using this combination.
Table 9 shows the performance difference between models before and after the integration of CBAM with 5-fold CV. Models with positive values were improved by the integration of CBAM and vice versa.
Table 9 demonstrates that the integration of CBAM yielded a notable improvement in the AP for CG disease detection, with enhancements ranging from 1.06% to 2.02%. This underscores the effective role of CBAM in enhancing feature extraction specific to CG disease. For the ResNet50 model, there was an increase of 1.71% in detecting CG disease; however, this was accompanied by declines of 0.45% and 0.35% in the “healthy” and “others” categories, respectively. This suggests that due to the relatively shallow architecture of ResNet50, the addition of CBAM may lead to an over-reliance on attention-weighted features, potentially resulting in the neglect of other pertinent information or an inability to fully capitalize on the sophisticated features offered by CBAM, thus impacting the accuracy of identification. However, in terms of overall model performance, VGG16 and VGG19 exhibited enhancements of 1.81% and 2.01%, respectively, while ResNet50, ResNet101, and ResNet152 achieved improvements of 0.30%, 1.15%, and 1.58%, respectively. This indicates a trend that increasing the model depth correlates with greater overall performance improvements. This suggests that for small-scale target-detection tasks requiring the learning of a great number of detailed features, more complex models with sufficient capacity to learn and utilize these enhanced features may benefit more substantially from the integration of CBAM.
Although we successfully enhanced the feature extraction for CG disease by combining the CBAM with VGGNet and ResNet models, the highest AP achieved was 89.92% with the ResNet152+CBAM model. To further improve the detection capability for CG disease, it is worth exploring the potential for increased detection accuracy through the integration of other CNN models like EfficientNet [
31] or ViT [
32] with other attention mechanisms such as ECA-Net [
33]. Moreover, to further improve the practical efficiency of our system, we aim to develop it to be capable of extracting frames from videos taken with smartphones or drones to diagnose diseases. Essential improvements in the response speed of the web application, such as using other object-detection architectures like RetinaNet [
34] or YOLOv7 [
35], which have a faster object-detection speed, are also necessary. Furthermore, by registering disease information in a geographic information system, it is possible to track the spread trends of CG disease within a region and provide a scientific basis for disease prediction and prevention. However, the accuracy of location information at the level of individual trees is insufficient with smartphone Global Positioning System (GPS) functions, so management is currently performed at the plantation level. In practice, in the mountainous areas of Chiang Mai Province, Thailand, the HRDI is developing a geographic information system for managing plantations, and it is believed that the results of this research can be utilized there.
6. Conclusions
In this study, we explored a fundamental yet innovative approach for diagnosing CG disease using in-field images of citrus leaves taken in orchards in Thailand through transfer learning with the Faster R-CNN architecture. The focus of our research was to compare the effects of transfer learning using VGGNet and ResNet and the integration of the CBAM attention mechanism into CNN models, providing valuable insights for future research. We used AP and mAP as the evaluation metric and conducted a 5-fold CV, assessing a total of 10 models based on VGGNet and ResNet. The key findings are as follows:
The ResNet models demonstrated superior performance compared to the VGGNet models;
The integration of CBAM into VGGNet and ResNet models yielded outstanding improvement;
The ResNet152+CBAM model performed best in both the accuracy of CG disease detection and overall performance;
The implementation of Faster R-CNN with in-field images notably improved the efficiency and practical application of CG disease detection.
By using our system for real-time CG disease diagnosis, the efficiency of early in-field detection will be improved with a relatively high level of accuracy. Given the severe impact of CG disease on global citrus production, the results of this study facilitate the development of techniques to mitigate this disease problem and even support economic citriculture to some extent. Furthermore, this study not only contributes to the stable production of citrus and the improvement of plant quarantine systems but also has the potential to be applied to research on other plant disease diagnoses.