*3.6. Model Evaluation*

Three metrics were used to evaluate the performance of the models. First, we used precision (P), defined as the proportion of true positives to the total number of positive detections (Equation (11)). Second, we used recall (R), defined as the proportion of true positives to the total number of actual objects (Equation (12)). Third, mean average precision (mAP@0.5) was used, which represents the mean value of AP for different categories with a threshold of 0.5% when *mAP* (Equation (13)) is converted to percent.

$$Precision = \frac{TP}{TP + FP} \tag{11}$$

$$Recall = \frac{TP}{TP + FN} \tag{12}$$

$$mAP = \frac{\sum AP}{n} \tag{13}$$

In Equations (11)–(13), TP is the number of correctly detected disease regions, FP is the number of healthy regions of plants that have not been detected as having disease, FN is the number of incorrectly detected disease regions, AP is the area under the precision-recall curve and *n* is the number of classes.

The experiments were carried out following the improved Yolov5s-CA model (Figure 3). To implement the mummy berry disease detection model, we used Pytorch version 1.11.0. The code was written, edited, and run using Google Colab Pro's notebook, a subscription-based service provided by Google Research that allows users to write and run Python code in web browsers. The hardware configuration that we used was: NVIDIA Tesla P100 GPU, 16 GB RAM, 127 GB hard disk, and CUDA version 11.2. The hyperparameters of the two models were set uniformly. The initial learning rate of the model was set to 0.01, and the momentum of the learning rate to 0.9. The batch size was set to process 16 images per iteration. The resolution of the input image was set to 640 × 640 pixels, and the number of epochs was set to 300. The training, validation, and test set images were in the ratio of 8:1:1 with no overlap between the three sets. To demonstrate the effectiveness of improving the original Yolov5s, we conducted experiments with and without modifying the backbone of Yolov5s with an attention mechanism. Each experiment was validated on the field-collected test dataset.

#### **4. Results**

We designed and conducted five experiments. The first experiment was designed to compare the two Yolov5s models (Yolov5s vs. Yolov5s-CA) on disease detection when they were only trained on the field-collected data. The second experiment compared the two models when they were trained only on the synthetic data. The third experiment compared the two models when they were trained on a combined dataset of synthetic and field-collected data. The fourth experiment compared the detection speeds of the two models. The fifth experiment compared the detection of the two models at different spatial scales (camera shooting distances).

#### *4.1. Comparison of Disease Detection Models Trained Only on the Field-Collected Dataset*

This experiment developed a baseline model by evaluating the effect of varying the amount of training data on the model's performance. To this end, the improved Yolov5s-CA and Yolov5s models were evaluated only on field-collected images. The precision of the Yolov5s-CA model is 70.2%, the recall is 61.3% and mAP@0.5 is 65.8%, which shows an increase of 2.7%, 0.5% and 1.1% in precision, recall and mAP@0.5, respectively, compared to the Yolov5s model (Table 1). Increasing the amount of the field-collected training data from 10–100% in all cases leads to an increase in the performance of the model (Figure 6).

**Table 1.** Performance comparisons of models trained only on real field dataset.


<sup>1</sup> Bold type reflects the best precision, recall and mAP@0.5 values.

(**b**)

**Figure 6.** Effects of training data size. (**a**) Field-collected dataset. (**b**) Synthetic dataset.

The comparison of experimental indicators shows that the performance of the improved Yolov5s-CA model is higher than Yolov5s, which confirms the effectiveness of integrating the attention mechanism on the backbone of the Yolov5s model. This approach suppresses less important features and improves the rate of correct detection.

#### *4.2. Comparison of Disease Detection Models Trained Only on the Synthetic Dataset*

This experiment evaluated models trained exclusively with synthetically generated images, in contrast to the results in Section 4.1 illustrating the performance of the models trained only on a limited number of field-collected images. We created a synthetic dataset containing 1661 images (method in Section 3.2). Similar to Section 4.1, we varied the amount of synthetic training images to investigate their effect on model performance (Figure 6). Increasing the training data size to more than 80% of the total available images has no contribution in terms of improving the performance of the model. Compared with Yolov5s, the recall of the improved Yolov5s-CA model increased by 4.9%; however, precision and mAP@0.5 values decreased by 5.9% and 0.4%, respectively (Table 2).

**Table 2.** Performance comparisons of the model trained only on the synthetic dataset.


<sup>1</sup> Bold type reflects the best precision, recall and mAP@0.5 values.

Moreover, as illustrated in Tables 1 and 2, when the precision, recall and mAP@0.5 values of models trained on field-collected and synthetic datasets are compared, a model trained only on a synthetic dataset generalized poorly compared to the field-collected dataset. This suggests that, although synthetic images are fast to generate, the domain gap between the synthetic and the field-collected data prevents the disease detection model trained only on the synthetic dataset from achieving the same performance as the field-trained models.

## *4.3. Comparison of Disease Detection Models Trained on a Combination of Synthetic and Field-Collected Datasets*

In this experiment, we explored the effects of varying the amount of field-collected and synthetic data in mixed model training datasets. We evaluated the models on 10%, 25%, 40%, 55%, and 70% field-collected images with 80% synthetic images. The aim is to achieve baseline detection performance with less field-collected data and more synthetic data. The evaluation results are shown in Table 3.

**Table 3.** Performance comparison of the model trained on a combination of synthetic and real field datasets.


<sup>1</sup> Bold type reflects the best precision, recall and mAP@0.5 values.

The improved Yolov5s-CA and Yolov5s models were trained on the mixed datasets, and precision, recall and mAP@0.5 values were calculated for the two models. The precision of the Yolov5s-CA model is 71.4%, the recall is 59.2% and mAP@0.5 is 66.3% (Table 3), which indicates an increase in precision, recall and mAP@0.5 by 8.8%, 3.3%, and 5.2%, respectively, compared to Yolov5s model. Figures 7–9 show comparative results of the model prediction.

**Figure 7.** Comparison of detection results focused on the plant part. Yolov5s (**a**–**c**) and Yolov5s-CA (**d**–**f**).

**Figure 8.** Comparison of detection results focused on the plant stem.Yolov5s (**a**–**c**) and Yolov5s-CA (**d**–**f**).

**Figure 9.** Comparison of detection results focused on the clone. Yolov5s (**a**–**c**) and Yolov5s-CA (**d**–**f**).

In addition, when the number of field-collected images in the training datasets increased, a slight increase in the experimental indicators of the models was observed. In particular, the mixed model trained on 70% of field-collected images had a better detection performance and outperforms the baseline model trained on only field-collected images (see Section 4.1 with 1.2% precision and 0.5% mAP@0.5).
