*4.5. Comparison of Detection at Different Spatial Scales*

To compare the detection results of the models, nine images similar to those shown in Figures 7–9 were selected from the test set as these images represented detection scenarios at different spatial scales (camera shooting distances) in the dataset. In the figures, the labels fl, fr, and lf represent infected flower, infected fruit, and infected leaf, respectively. The improved Yolov5s-CA network model proposed in this study was the superior model to detect diseased plant parts at different camera shooting distances (Figures 7–9). There is almost no difference between the improved network model and Yolov5s in detecting large target plant parts taken at close distances (Figure 7a–f). The detection results focused on the plant stem (Figure 8a–f) show that there is a difference between the two network models in detecting small plant parts from the image. As shown in Figure 8d–f, the improved network model can accurately detect small target plant parts with occlusion which could not be detected by the original Yolov5s model. For the clone-level detection results shown in Figure 9a–f between the two models, the Yolov5s model has more wrong and missed detections than the improved network model. In Figure 9a, Yolov5s predicted two wrong detection and nine correct detections, while the improved network model in Figure 9d predicted thirteen correct detections with one wrong detection. Both models predicted three correct detections in Figure 9b,e, but Yolov5s predicted two wrong detections while the improved network model had only one wrong detection. Additionally, in Figure 9f, the improved network model predicted twelve correct detections, while, Yolov5s predicted eight correct detections (Figure 9c). However, although the improved model still has satisfactory detection ability with some degree of occlusion, and overlap of leaves, as shown in Figure 9c,f, both models failed to detect small diseased leaves in the image at long distances.

To further verify the effectiveness of the improved Yolov5s-CA model proposed in the present study, nine test sets representing different spatial scale detection scenarios were analyzed (Table 5). There were 78 mummy berry disease objects in nine test sets. The number of objects detected by these methods was 47 and 54 for Yolov5s and Yolov5s-CA, respectively, of which mummy berry disease was 41 for Yolov5s and 52 for Yolov5s-CA. The recall rate, accuracy, and misdetection rate of the methods were 52.56%, 87.23%, and 12.77% for Yolov5s and 66.67%, 96.30%, and 3.70% for the improved Yolov5s-CA.

From Table 5 and Figure 8d–f, it can be seen that the detection is the best in the plant stem scenario with a recall and precision rate of 80.95% and 100.00%, respectively. The plant parts taken at close distances are also accurately detected. Both methods can correctly detect plant parts in the image and their recall rate is 77.78%. In addition, the proposed method can effectively detect mummy berry disease objects at a long distance in the clone with, a recall rate of 58.33%,and a precision of 93.33%.

The loss and mAP curves for the two network models tested in the present study are shown in Appendix A. The loss curves of both models had a downward trend and the values of the loss function decreased rapidly when tested against the real field and mixed datasets (Figures A1b and A3b). However, when the network iterations reach approximately 150, the loss curves gradually exhibited a slowed rate of change and stabilized. In contrast, the loss curves in Figure A2b using the synthetic dataset had a downward trend, but only after approximately 25 iterations, the loss curves showed an upward trend indicating

noisy movements and no improvement in the values of the loss function. Analysis of the loss function from Figures A1 and A3 shows that the integrated attention module on the Yolov5s backbone can effectively accelerate the network convergence speed and improve the model performance.


**Table 5.** Detail detection results of mummy berry disease at different spatial scales.

<sup>1</sup> Clone is a term that refers to a genetically distinct plant (range: <1–>25 m diameter).

#### **5. Discussion**

In the present study, a deep learning model based on the improved Yolov5s for automatic detection of mummy berry disease in a real wild blueberry field environment is proposed. In order to highlight important information that is relevant to the current task and improve the effectiveness of the network model, the coordinate attention (CA) module was introduced on the backbone structure of the original Yolov5s. In addition, to overcome the problem of data scarcity, we present a method for generating synthetic training images for object detection models, which greatly reduces the effort required to collect and annotate large datasets.

The overall performance of the improved network model was better than the original Yolov5s. A one-way ANOVA test on precision found a significant difference between the means of the two network models (F(1299) = 18.069, *p* < 0.001). The precision of the improved network model reached 71.4%, which is 1.2% higher than Yolov5s precision. This result is consistent with previous studies conducted to recognize plant diseases. Yan et al. [58] compared the original Yolov5s network model with the improved Yolov5s for real-time apple disease target detection, and the improved Yolov5s model mAP@0.5 increased by 5.1%. Similar results and comparisons with Yolov5 models were shown in a study [59], where the authors found that with the joint efforts of the coordinate attention module and Softpool pooling, the multi-scale feature fusion (MFF) convolutional neural network (CNN) obtained the optimal detection accuracy with a 1.6% improvement compared to Yolov5s. Another study [60] developed an accurate apple fruitlet detection method with a small model size and the channel pruned Yolov5s model provided an effective method to detect apple fruitlets under different conditions. For tomato disease detection, the study in [20] used a mobile phone to collect images of tomato disease in a greenhouse and the improved SE-Yolov5 mAP@0.5 was 1.78% higher than the Yolov5 model.

The performance of our improved network model was evaluated on the field-collected, synthetic and mixed datasets. Compared to training the object detection model only on synthetic images, we found a detection model with satisfactory performance on fieldcollected images, but a significant increase in performance was achieved when trained on a mixed dataset of field-collected and synthetic images. Our proposed Yolov5s-CA network model trained on a mixed dataset of 70% real field images and 80% of synthetic images outperformed, by 1.2% precision and 0.5% mAP@0.5 values, the baseline model trained using only field-collected images. The results indicated that labeled real-world field-collected datasets are key to improving performance by overcoming domain gaps when training a plant disease detection model with synthetic datasets.

The improved Yolov5s network model has improved disease prediction performance under a certain degree of occlusion, leaf overlap, and different spatial scale scenarios (Table 5, Figures 7–9 and Figure A4). This is because the integrated coordinate attention (CA) mechanism at the backbone of the Yolov5s network model suppresses less relevant information and highlights key disease-related visual features to help identify mummy berry disease in a field environment. The lightweight coordinate attention (CA) module captures long-term dependencies in one space, retains accurate disease location information in the other, and forms a pair of direction-aware and position-aware feature maps, which can help the model locate and identify potential targets more precisely and enhance the representation capability of effective information. In addition, the CIoU loss used in this study takes into account the overlap area, the center point distance and the aspect ratio similarity between the actual box and the prediction box, which improves the network's regression accuracy and sensitivity to small disease organs [61]. The advantages of our method become even more obvious when dealing with scenarios of large spatial scale where a huge number of interacting and overlapping plant parts are present in a clone level image. Therefore, the effectiveness of the improved network model for mummy berry disease detection makes it clearly better than the l Yolov5s family and meets the needs of real-time detection of mummy berry disease under field conditions.

In general, promising results were obtained for training object detection models by combining a small number of field-collected images with synthetic datasets. The presented synthetic image generation method is essential when the collection and annotation of a large dataset are expensive and/or prohibitive. In addition, the coordinate attention (CA) module integrated into the Yolov5s backbone has contributed to the detection of mummy berry disease in a commercial lowbush blueberry field environment by efficiently discriminating important features.

### **6. Conclusions**

This study focused on detecting mummy berry disease in a real natural environment based on the deep learning method and proposed an improved Yolov5s network model. By integrating the coordinate attention module into the backbone of Yolov5s, the visual features associated with mummy berry disease are well focused and extracted, which boosts the performance of the model in identifying disease symptoms. In addition, we presented the cut-and-paste method for synthetically augmenting the available dataset to generate annotated training images which greatly reduces the effort required to collect and annotate large datasets. To test the generalization ability of the improved network model and prove the usefulness of the synthetic dataset to enhance the performance of deep learning-based object detection models, quantitative performance comparisons of the improved network model and Yolov5s trained on field-collected, synthetic and mixed datasets were conducted (Tables 1–3). Compared to the baseline model with a 100% real field dataset, the synthetic dataset combined with 70% of real field outperformed the baseline model (Table 3). In all three datasets tested, the overall performance of the improved Yolov5s-CA network model is superior to that of the Yolov5s model with only slightly higher computational costs. Moreover, the improved Yolov5s network model has improved the disease prediction performance in occlusion, leaf overlaps, and different spatial scales. In general, the effectiveness of the improved network model for mummy berry disease detection is better than the original Yolov5s and meets the needs of real-time detection of mummy berry disease under field conditions. However, as the synthetic data generation process and the network model were trained on small numbers of field-collected images with limited variability in disease symptoms and camera shooting distances, some missed or incorrect detection cases were observed. In addition, the presented cut–paste synthetic data generation method is highly influenced by the quality of segmentation of the object from the image.

In the future, taking images using high-resolution cameras at different shooting distances will contribute to creating a more robust model, as well as solving the limitations of

missed or incorrect detection over different occlusions and spatial scales. Furthermore, we will automate the segmentation process to extract the object from the image. Finally, we will work on implementing the models to run on a cloud server so that web and mobile applications can access it to make predictions.

**Author Contributions:** E.Y.O.: Conceptualization, methodology, software, formal analysis, writing original draft preparation. H.Q.: conceptualization, methodology, formal analysis, writing—original draft preparation and review and editing, supervision, project administration, funding acquisition. Y.-J.Z.: writing—review and editing, data curation. S.A.: field collection of images, writing—review and editing. F.D.: writing—review and editing. All authors have read and agreed to the published version of the manuscript.

**Funding:** We acknowledge funding by the National Natural Science Foundation of China (61871061). This also a publication of the Project of Advanced Scientific Research Institute of CQUPT under Grant E011A2022329. YJZ is supported by the USDA National Institute of Food and Agriculture (Hatch Project number ME0-22021) through the Maine Agricultural & Forest Experiment Station.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The dataset is available upon request from the corresponding author at hcchyu@gmail.com.

**Conflicts of Interest:** The authors declare no conflict of interest.

**Appendix A**

**Figure A1.** Experimental results of each model on the field-collected dataset. (**a**) mAP@0.5 curves for the real field dataset. (**b**) Loss curves for the real field dataset.



**Figure A2.** Experimental results of each model on the synthetic dataset. (**a**) mAP@0.5 curves for the synthetic dataset. (**b**) Loss curves for the synthetic dataset.

**Figure A3.** Experimental results of each model on the mixed dataset. (**a**) mAP@0.5 curves for the mixed dataset. (**b**) Loss curves for the mixed dataset.

**Figure A4.** Comparison of correct detection between the Yolov5s and Yolov5s-CA models.
