1. Introduction
In agriculture, plant diseases cause an estimated 10–15% annual loss of the world’s major crops [
1]; 70–80% of these diseases are caused by pathogenic fungi that have an adverse effect on crop growth, quality and yield. Therefore, disease management is important to agricultural systems including the wild lowbush blueberry production system. Wild blueberry (mainly
Vaccinium angustifolium Aiton) is a perennial shrub that spreads by underground rhizomes, with aerial shoots occurring every 2–30 cm. Wild blueberry plants are not planted [
2,
3] but grow naturally in rocky hills and sandy fields, and are managed to form a carpet for berry production [
4]. Wild blueberry is one of the most important crops in Maine, USA, and the Canadian provinces of Quebec and the Maritimes, and the crop is a major source of income for growers in the regions [
5,
6]. The state of Maine is one of the largest producers of wild blueberries of the world, accounting for 97% of the total production in the US [
7,
8,
9]. The yield and quality of blueberries are impacted by several factors, but one of the most important is mummy berry disease caused by the fungus
Monilinia vaccinii-corymbosi [
10].
Monilinia vaccinii-corymbosi ascospores attack opening flower clusters and axillary buds in the spring and kills infected tissues [
11]. These tissues then produce secondary asexual spores that infect healthy flowers and the fungus colonizes the developing fruit. High levels of infection can kill up to 90% of the leaves and flower buds during the early part of the growing season [
10,
12]. The infection of the developing fruit directly affects yield and the loss of flowers and leaves can indirectly reduce yield [
8]. The loss of yield (berry weight harvested) can be substantial and poses an economic challenge to growers.
The current method of early warning monitoring for mummy berry disease is based on the prediction of potential infection periods determined by weather conditions and development stages of the plants and fungus [
13]. If a high likelihood of infection is predicted, based on the duration of leaf wetness and suitable air temperature, growers are advised to protect their crops from infection with the application of fungicides [
14]. Follow-up field scouting by crop protection experts and experienced blueberry growers is often implemented to determine the effectiveness of forecasting infection and fungicide applications. However, monitoring for the presence and rating the level of disease is extremely time-consuming and labor-intensive since infected plants can be scattered in patches around the field and so typically multiple transects are used to observe many individual stems across a field. It can also be prone to error due to confusion between mummy berry disease symptoms and those from frost damage or other diseases such as
Botrytis blight. These are some of the main reasons why researchers are looking for alternative methods to identify diseases in the field [
15,
16,
17]. Previous studies involving other crops and diseases using traditional machine learning algorithms have mainly relied on manual extraction of features from image texture, color and shape [
18] to locate disease. However, the symptoms of the same disease may have different visual characteristics, such as during different stages of infection, when infecting flowers, leaves or fruit, and possible occlusions and high spatial variations among individual plants. Therefore, when there is variation in environmental conditions and symptom traits, the generalization ability of these algorithms decreases significantly.
In recent years, with the rapid advancement of computer vision techniques and deep learning, various methods of plant disease detection and classification techniques have been developed in agriculture resulting in highly accurate results [
16,
19,
20]. Despite its success in achieving superior performance in plant disease detection, deep neural network architectures depend heavily on the availability of large quantities of training data that are characterized by variation to accurately “learn the breadth of behavior” for proper training of a model. However, the available dataset for wild blueberry plant disease detection does not contain the abundance of images collected and labeled from a real-field environment which is essential for making highly accurate models [
19]. Levels of mummy berry symptoms vary by field characteristics, weather, and inoculum level and symptoms of the first stage of infection of leaves and flower buds only last for one to three weeks depending upon the field and weather. Clean and background-free images of diseased and healthy plant parts also are difficult to obtain in blueberry fields. Accurately labeling images for model training is also very labor-intensive. To address the problem of data scarcity in training deep learning models, researchers have proposed various techniques to generate synthetic images based on the available dataset to obtain diverse and inexpensive training data [
21,
22] rather than field collecting and annotating training images which is an expensive and time-consuming task.
Although computer vision techniques have greatly improved for plant disease detection, practical problems such as the small size of lesions, occlusion of shoots, interference of complex background, uncontrollable light conditions in fields, etc., remain unsolved for mummy berry disease identification. For instance, masses of conidia (a sign on leaves and flowers of primary mummy berry infection) on blueberry shoots are tiny (<33 µm long; in [
11] and only account for a very small portion of a field taken image, which makes it unlikely to be automatically identified by computer vision techniques. Moreover, the much branched and dense structure of blueberry bushes often occlude small diseased plant parts such as those exhibiting conidia. Multiple shoots or branches also complicate the background of field-taken images, which also poses a challenge to disease detection. An example of field-taken images of mummy berry disease and conidia on shoots are shown in
Figure 1. In addition, disease traits (such as size, color, and portion) in the field obtained sample images for disease detection and severity rating may vary considerably due to the changes in camera shooting angle and distance. These highly spatial variations could inevitably degrade the performance of identification, despite the most advanced object detection algorithms having been employed [
23].
In deep learning, as in human vision, the attention mechanism tends to focus on key regions of the input objects by ignoring irrelevant information. Recent studies have demonstrated the remarkable effectiveness of attention-based methods for boosting deep learning networks and have proven their usefulness in a variety of computer vision tasks, such as object detection [
20,
24,
25]. CBAM [
26] is a widely used attention mechanism that combines channel and spatial attention. SE [
27], on the other hand, focuses on the relationship between channels to learn each image feature based on the loss function, increases the weight of relevant image features, and decreases the weight of irrelevant image features to achieve the best results. In plant disease detection, the lightness of the model determines whether it can be deployed to embedded devices, which is of great importance for growers to monitor the growth and disease status of blueberries in real-time in the field [
15]. Considering the limited computational power and storage capacity of mobile or embedded devices, SE and CBAM attention mechanisms are still the most popular attention methods. However, SE neglects the importance of location information and CBAM only captures local relationships and cannot model long-range dependencies essential for capturing object structure in visual tasks [
28]. In contrast, coordinate attention (CA) considers both inter-channel relationships and position information.
Therefore, in order to overcome the problems in the current plant disease detection methods, and solve the limitation of data scarcity for mummy berry disease detection of the wild blueberry plant in a real-field environment, we have implemented the cut and paste method [
29] for synthetically augmenting the available dataset to generate annotated training images for object detection tasks, which reduces the effort required to collect and manually annotate huge datasets. Thereafter, we improved the backbone of the original Yolov5s network model by integrating the lightweight coordinate attention (CA) module to effectively highlight the important features by capturing the channel and location information to improve a mummy berry disease detection network model in a real natural field environment with little extra computational cost. The main contributions of this study are summarized as follows:
The coordinate attention (CA) module is integrated into a Yolov5s backbone. This allows the network to increase the weight of key features and pay more attention to visual features related to disease to improve the performance of disease detection in various spatial scales.
The loss function, General Intersection over Union (GIoU), is replaced by the loss function, Complete Intersection over Union (CIoU) to enhance bounding box regression and localization performance in identifying diseased plant parts with a complex background.
A synthetic dataset generation method is presented that can reduce the effort of collecting and annotating large datasets and boost the performance of identification by artificially increasing available features in deep model training.
3. Materials and Methods
In this section, we briefly present a field-collected image dataset that was used for model training and evaluation, as well as for generating synthetic images. We then introduced the system of synthetic dataset generation methods for object detection tasks. This section concludes with the description of an improved Yolov5 model based on attention mechanism and evaluation metrics.
3.1. Data Source
The first step in developing a deep learning model is to prepare a dataset. As the primary source of data in this study, images of healthy and diseased flowers, fruits, and leaves of the blueberry crop in a field environment with complex backgrounds were obtained from the University of Maine wild blueberry experimental fields at Blueberry Hill Farm, Jonesboro, ME, USA [
47]. However, the total number of field images collected for training a deep learning network was not adequate. Therefore, to achieve high performance and reduce the risk of overfitting a predictive model for mummy berry disease detection, we first produced annotated synthetic images with a complex background that mimicked real field situations. Then we collected blueberry images with mummy berry disease from online sources such as the National Ecological Observatory Network (
www.bugwood.org, accessed on 23 April 2022), and Google Images (
www.google.com, accessed on 2 May 2022) to incorporate variety in training images, as deep learning models show enhanced results and higher generalization ability on the availability of a large dataset. A total of 459 field images of blueberries with mummy berry disease were obtained from the University of Maine wild blueberry experimental fields and online sources. Based on field images, a total of 1661 annotated images were produced by the synthetic data generation method (
Table A1 in
Appendix).
3.2. Synthetic Data Generation
In this study, we applied the cut and paste technique [
29] to create synthetic images and related annotations by random scaling, rotation, and adding segmented images of interest to the background. Unlike Mixup and CutMix, our method only copied the exact pixels that corresponded to an object, as opposed to all the pixels in the object’s bounding box. To generate a synthetic dataset with our cut and paste method, we randomly selected images of 55 flowers, 48 fruits, and 58 leaves of diseased blueberry plant tissue from the field dataset (discussed in
Section 3.1) and created masks. Then a total of 83 “healthy” background photographic images with only healthy uninfected flowers and leaves were collected in a lowbush blueberry field at the University of Maine, Blueberry Hill Farm (Jonesboro, ME, USA). In order to make the background more complex, seven distractor object images of healthy fruits were obtained from online sources, and then masks were created. Objects of interest masks were created using Adobe Photoshop software, unlike a previous study [
29] that automated this process by training a machine learning model to segment and extract the objects.
Once the image data was ready, we randomly selected the background image and resized it to 1080 × 1320 pixels and 1320 × 1080 pixels, vertically and horizontally, respectively. Then, to make the background of the synthetic dataset diverse, we randomly selected at most 10 segmented distractor images and randomly resized, rotated and added them to the background iteratively. Under field conditions in agriculture production systems, occlusion problems are common challenges that need to be considered. Hence, in generating a synthetic dataset, a newly added image can partially or fully overlap a previously added image. Therefore, to control the degree of overlap and include cases of occlusion in the synthetic dataset, the threshold value for the degree of overlap was set at 25%. Finally, in an iterative process, we randomly chose a maximum of 15 segmented images of diseased leaves, flowers, and fruits and randomly resized and rotated them, and then added the new background images on top of the background distractor images (see
Figure 2).
3.3. Coordinate Attention Module
When detecting mummy berry disease, the infection can be randomly distributed on the plant stem, resulting in a mixture of overlapping occlusions, and the infected region may occupy a relatively small proportion of the image area, leading to missed or incorrect detection. In our study, we introduce the coordinate attention (CA) module to help the deep learning model focus on the most significant information related to infection and ignore minor features. The CA mechanism is an efficient and lightweight module that embeds position information into the attention map. The model can obtain information about a large area without introducing additional computational costs. The coordinate attention block can be considered a computational unit that increases the expressive power of the learned features. It takes an intermediate feature tensor: as input; and outputs a transformed tensor with enhanced representations: of the same size to .
In the structure of the coordinate attention module, the operation is divided into two steps: (1) coordinate information embedding; and (2) coordinate attention generation (
Figure 3). The first step factors global pooling as given in Equation (1) into two 1D feature encoding operations that encode each channel along the horizontal and vertical directions, respectively.
where
denotes the input,
and
indicate the outputs of the
channel at height
and width
, respectively. The second step concatenates the feature maps produced and sends them to the shared 1 × 1 convolutional transformation
to obtain the intermediate feature map,
as formulated in Equation (2),
where [
. , .] denotes the concatenation operation along the spatial dimension, and δ is a non-linear activation function. The feature map
is then split along the spatial dimension into two separate tensors
and
, followed by another two 1 × 1 convolutional functions
and
, which are determined by Equation (3),
where
denotes the sigmoid activation function. The final attention weight
is generated according to Equation (4),
Therefore, in this study, we integrated the coordinate attention (CA) module on the Yolov5 backbone. This offers three obvious advantages: (1) it captures cross-channel and position-sensitive information that helps models accurately locate and recognize objects of interest; (2) having a lightweight property that is less lightweight than other attention mechanisms [
26,
27]; and (3) flexibility to be plugged into object detection models such as Yolov5 with little additional computational overhead.
3.4. Yolov5 Method
Object detection is a computer vision technique for locating instances of a certain class of objects in an image. Recent object detection methods can be categorized into two main types: one-stage and two-stage. One-stage methods prioritize inference speed and include a series of YOLO detection methods [
15,
35,
48,
49,
50], SSD [
36,
51], and RetinaNet [
52]. Typical two-stage methods prioritize detection accuracy and include R-CNN [
53], Fast R-CNN [
54], Mask R-CNN [
55], Cascade R-CNN [
56], and others.
Yolov5 is the latest generation of one-stage object detection network models of the YOLO series proposed by Ultralytics in May 2020 see [
57]. Based on the network depth and width of feature maps, Yolov5 can be divided into four models, namely Yolov5s, Yolov5m, Yolov5l, and Yolov5x [
23]. Compared with two-stage detection network models, Yolov5 greatly improves the running speed of the model while maintaining detection accuracy. This not only meets the needs of real-time detection, but also has the advantage of a small structure size. The Yolov5 network model is an improved model based on Yolov3 with improvements such as multi-scale prediction, which can simultaneously detect images of different sizes [
20]. Therefore, we proposed a lightweight mummy berry disease detection network model based on Yolov5s by improving the network backbone with an attention mechanism. The architecture of the improved Yolov5s-CA network model is shown in
Figure 3.
3.5. Improvement of Yolov5s-CA Network Model
Figure 3 shows the structure of the improved Yolov5s-CA network model to detect mummy berry disease. It can be seen that a lightweight module CA [
28] was introduced into the backbone of Yolov5s to strengthen the feature representation ability of the network and select useful information, which enhances detection performance. The network structure of Yolov5s-CA consists of four parts: input, backbone, neck, and head.
The backbone of the Yolov5s-CA network model contains Conv, C3, CA, and Spatial Pyramid Pooling Fusion (SPPF). The Conv is the basic convolution unit, which performs two-dimensional convolution, regularization, and activation operations on the input. The C3 module is located in both the backbone and neck. The C3 module with a shortcut structure is implemented in the backbone of the network. It divides the input tensor equally into two branches and performs convolution operations. One branch passes through a Conv module and then passes through multiple residual structures to avoid degradation problems in the deep computational process. The other branch directly combines the two branches to form a Conv module. As shown in
Figure 4, the CA modules are integrated into the backbone following the C3 module to highlight and select the most important disease-related visual features and improve the representation ability of the object detection model to detect mummy berry disease in a field environment. The last layer of the backbone, Spatial Pyramid Pooling Fast (SPPF), shown in
Figure 5, comprises three MaxPool layers of 5 × 5 kernel sizes in series and passes the input through the MaxPool layers in turn and performs a concatenation operation on the output before performing a Conv operation. The SPPF structure can achieve similar feature ex-traction results as SPP, but SPPF runs faster. The image can learn features at multiple scales with the help of MaxPool layers and jump connections, and then increase the representativeness of the feature map by combining global and local features.
The neck module is a feature aggregation layer between the head and the backbone. It collects as much information as possible from the backbone before feeding it to the head. It consists of two parts: the Feature Pyramid Network (FPN) and the Path Aggregation Network (PAN). The FPN structure transmits semantically robust features from the top-down, while the PAN transmits information in a bottom-up pyramid to strengthen the feature representation capabilities of the network model. In addition, C3 modules were added to enhance the network’s feature extraction capability, and the C3 at the neck replaces the residual structure with multiple Conv modules.
The head outputs a vector containing the object category probability, the object scores, and the position of the bounding box. The loss function of Yolov5s consists of three parts: the confidence loss, the classification loss and the position loss of the target and prediction box. The original Yolov5s uses GIoU_loss as a bounding box regression loss function to evaluate the distance between the predicted box and the ground truth box. It can be expressed in the following formulae represented in Equations (5)–(7).
where
is the predicted box,
is the ground truth box,
represents the intersection ratio of the predicted box and the ground truth box,
represents the intersection of the predicted box and the ground truth box,
represents the smallest circumscribed rectangle of the predicted box and the ground truth box, and
is the
Loss.
Compared with the IoU_loss function, the GIoU loss function can solve the problem of non-overlapping bounding boxes. However, GIoU loss cannot solve the problem that the prediction frame is inside the target frame and the size of the prediction frame is the same. On the other hand, CIoU loss considers the scale information of the aspect ratio of the bounding box and measures it from the three viewpoints: (1) overlapping area, (2) center point distance, and (3) aspect ratio, which makes the prediction box regression more efficient. Therefore, in this study, we use CIoU loss as the regression loss function represented in Equations (8)–(10).
where
is the width and
is the height of the prediction box and
and
are the width and height of the ground truth box, respectively.
3.6. Model Evaluation
Three metrics were used to evaluate the performance of the models. First, we used precision (P), defined as the proportion of true positives to the total number of positive detections (Equation (11)). Second, we used recall (R), defined as the proportion of true positives to the total number of actual objects (Equation (12)). Third, mean average precision (mAP
@0.5) was used, which represents the mean value of AP for different categories with a threshold of 0.5% when
AP (Equation (13)) is converted to percent.
In Equations (11)–(13), TP is the number of correctly detected disease regions, FP is the number of healthy regions of plants that have not been detected as having disease, FN is the number of incorrectly detected disease regions, AP is the area under the precision-recall curve and is the number of classes.
The experiments were carried out following the improved Yolov5s-CA model (
Figure 3). To implement the mummy berry disease detection model, we used Pytorch version 1.11.0. The code was written, edited, and run using Google Colab Pro’s notebook, a subscription-based service provided by Google Research that allows users to write and run Python code in web browsers. The hardware configuration that we used was: NVIDIA Tesla P100 GPU, 16 GB RAM, 127 GB hard disk, and CUDA version 11.2. The hyperparameters of the two models were set uniformly. The initial learning rate of the model was set to 0.01, and the momentum of the learning rate to 0.9. The batch size was set to process 16 images per iteration. The resolution of the input image was set to 640 × 640 pixels, and the number of epochs was set to 300. The training, validation, and test set images were in the ratio of 8:1:1 with no overlap between the three sets. To demonstrate the effectiveness of improving the original Yolov5s, we conducted experiments with and without modifying the backbone of Yolov5s with an attention mechanism. Each experiment was validated on the field-collected test dataset.
5. Discussion
In the present study, a deep learning model based on the improved Yolov5s for automatic detection of mummy berry disease in a real wild blueberry field environment is proposed. In order to highlight important information that is relevant to the current task and improve the effectiveness of the network model, the coordinate attention (CA) module was introduced on the backbone structure of the original Yolov5s. In addition, to overcome the problem of data scarcity, we present a method for generating synthetic training images for object detection models, which greatly reduces the effort required to collect and annotate large datasets.
The overall performance of the improved network model was better than the original Yolov5s. A one-way ANOVA test on precision found a significant difference between the means of the two network models (F
(1299) = 18.069,
p < 0.001). The precision of the improved network model reached 71.4%, which is 1.2% higher than Yolov5s precision. This result is consistent with previous studies conducted to recognize plant diseases. Yan et al. [
58] compared the original Yolov5s network model with the improved Yolov5s for real-time apple disease target detection, and the improved Yolov5s model mAP
@0.5 increased by 5.1%. Similar results and comparisons with Yolov5 models were shown in a study [
59], where the authors found that with the joint efforts of the coordinate attention module and Softpool pooling, the multi-scale feature fusion (MFF) convolutional neural network (CNN) obtained the optimal detection accuracy with a 1.6% improvement compared to Yolov5s. Another study [
60] developed an accurate apple fruitlet detection method with a small model size and the channel pruned Yolov5s model provided an effective method to detect apple fruitlets under different conditions. For tomato disease detection, the study in [
20] used a mobile phone to collect images of tomato disease in a greenhouse and the improved SE-Yolov5
[email protected] was 1.78% higher than the Yolov5 model.
The performance of our improved network model was evaluated on the field-collected, synthetic and mixed datasets. Compared to training the object detection model only on synthetic images, we found a detection model with satisfactory performance on field-collected images, but a significant increase in performance was achieved when trained on a mixed dataset of field-collected and synthetic images. Our proposed Yolov5s-CA network model trained on a mixed dataset of 70% real field images and 80% of synthetic images outperformed, by 1.2% precision and 0.5% mAP@0.5 values, the baseline model trained using only field-collected images. The results indicated that labeled real-world field-collected datasets are key to improving performance by overcoming domain gaps when training a plant disease detection model with synthetic datasets.
The improved Yolov5s network model has improved disease prediction performance under a certain degree of occlusion, leaf overlap, and different spatial scale scenarios (
Table 5,
Figure 7,
Figure 8 and
Figure 9 and
Figure A4). This is because the integrated coordinate attention (CA) mechanism at the backbone of the Yolov5s network model suppresses less relevant information and highlights key disease-related visual features to help identify mummy berry disease in a field environment. The lightweight coordinate attention (CA) module captures long-term dependencies in one space, retains accurate disease location information in the other, and forms a pair of direction-aware and position-aware feature maps, which can help the model locate and identify potential targets more precisely and enhance the representation capability of effective information. In addition, the CIoU loss used in this study takes into account the overlap area, the center point distance and the aspect ratio similarity between the actual box and the prediction box, which improves the network’s regression accuracy and sensitivity to small disease organs [
61]. The advantages of our method become even more obvious when dealing with scenarios of large spatial scale where a huge number of interacting and overlapping plant parts are present in a clone level image. Therefore, the effectiveness of the improved network model for mummy berry disease detection makes it clearly better than the l Yolov5s family and meets the needs of real-time detection of mummy berry disease under field conditions.
In general, promising results were obtained for training object detection models by combining a small number of field-collected images with synthetic datasets. The presented synthetic image generation method is essential when the collection and annotation of a large dataset are expensive and/or prohibitive. In addition, the coordinate attention (CA) module integrated into the Yolov5s backbone has contributed to the detection of mummy berry disease in a commercial lowbush blueberry field environment by efficiently discriminating important features.
6. Conclusions
This study focused on detecting mummy berry disease in a real natural environment based on the deep learning method and proposed an improved Yolov5s network model. By integrating the coordinate attention module into the backbone of Yolov5s, the visual features associated with mummy berry disease are well focused and extracted, which boosts the performance of the model in identifying disease symptoms. In addition, we presented the cut-and-paste method for synthetically augmenting the available dataset to generate annotated training images which greatly reduces the effort required to collect and annotate large datasets. To test the generalization ability of the improved network model and prove the usefulness of the synthetic dataset to enhance the performance of deep learning-based object detection models, quantitative performance comparisons of the improved network model and Yolov5s trained on field-collected, synthetic and mixed datasets were conducted (
Table 1,
Table 2 and
Table 3). Compared to the baseline model with a 100% real field dataset, the synthetic dataset combined with 70% of real field outperformed the baseline model (
Table 3). In all three datasets tested, the overall performance of the improved Yolov5s-CA network model is superior to that of the Yolov5s model with only slightly higher computational costs. Moreover, the improved Yolov5s network model has improved the disease prediction performance in occlusion, leaf overlaps, and different spatial scales. In general, the effectiveness of the improved network model for mummy berry disease detection is better than the original Yolov5s and meets the needs of real-time detection of mummy berry disease under field conditions. However, as the synthetic data generation process and the network model were trained on small numbers of field-collected images with limited variability in disease symptoms and camera shooting distances, some missed or incorrect detection cases were observed. In addition, the presented cut–paste synthetic data generation method is highly influenced by the quality of segmentation of the object from the image.
In the future, taking images using high-resolution cameras at different shooting distances will contribute to creating a more robust model, as well as solving the limitations of missed or incorrect detection over different occlusions and spatial scales. Furthermore, we will automate the segmentation process to extract the object from the image. Finally, we will work on implementing the models to run on a cloud server so that web and mobile applications can access it to make predictions.