*4.3. Accuracy Estimation*

**Detector evaluation:** We reported the average test accuracy of 5-fold cross-validation. The standard object detection metrics: mean average precision (*mAP*) and Recall were used to evaluate the performance of the model, and these are, respectively, expressed as Equations (**??**) and (**??**).

$$mAP = \frac{1}{|\mathcal{Q}|} \sum\_{q=1}^{\mathcal{Q}} AP(q) \tag{6}$$

$$Recall = \frac{tp}{tp + fn} \tag{7}$$

where *Q* is the number of queries in the set and *AP*(*q*) is the mean of the average precision scores for the given query. Our goal is to identify potential PWD-infected trees as much as possible, and we use Recall as an improvement metric to evaluate how much potential disease area has been located. Recall is assessed as the number of correctly detect diseases (TP) divided by the number of total diseases.

**The setup for real-word environment evaluation:** Operating in the real-world is even more challenging, as the cropped patches contain not only the PWD-infected trees but also various background content (including "disease-like" objects). Real-world PWD objects are small, irregular, and distributed across a wide area, and the GT is point-wise annotation (*<sup>x</sup>*, *y* geographic coordinates). Converting the point-wise annotation into bounding box annotation is challenging. Our goal was to correctly identify as many disease objects as possible along with their precise positions. In the field investigation to locate the diseaseinfected tree, an offset error of less than 8 m was deemed acceptable, so we use the *x*, *y* geographic coordinates of disease to generate 8 × 8 m GT bounding box annotation for each hotspot. We assume that the disease was found correctly when the overlapped area of GT and predicted bounding box (IoU: Intersection of Union) is larger than 0.3; otherwise, the system prediction is considered to be a false detection.

Specially, we use the overlap strategy to scan the orthophotograph and crop the test patches with 25% overlap. The IoU threshold is set to 0.4 for NMS, and 0.6 for merging the predicted bounding boxes in TTA process. Due to the insufficient context information, FP typically appears on the edge of the test patch. Therefore, we remove the bounding boxes that are less than 5 pixels from the boundary to reduce false alarms; for more information, please refer to Section **??**.
