*3.1. System Overview*

The main challenges associated with PWD detection are a lack of data for training the DNN in various stages of infection, and the existence of background objects that are similar to the object of interest. We collected a large dataset consisting of observations from various districts in South Korea. The suppression of ambiguous background objects is particularly important in this domain [**?** ]. The hard negative samples for PWD were further divided into six categories according to their appearance and texture information. The problem-solving strategy was found to be beneficial for guiding an object detector, particularly in instances where the object of interest was highly correlated with other background objects.

As shown in Figure **??**, we first augmented the training samples with suitable augmentation methods and then trained our first DNN, which was designed to distinguish the background and PWD objects. The detector was robust at detecting most background regions, including healthy trees, land, buildings, lakes, etc., but PWD-infected trees confused with "disease-like" areas (FP) such as yellow land or maple trees. In this phase, it is easy to collect a lot of "disease-like" areas, which is referred to as the mining of hard negative samples [**?** ]. We further categorized those ambiguous FP samples into six distinct categories (Figure **??**). The disease and "disease-like" samples were passed through the DNN to perform a fine-level object detection. The UAV images in our system included disease regions that varied in size from 12 × 8 to 360 × 300 pixels. Accordingly, we applied a feature pyramid network (FPN) to capture arbitrary scale features based on both bottom-up and top-down connections. ResNet was selected as a backbone network, and the features in each residual block were processed for pyramidal representation. The bottom-up pathway produced the feature map hierarchy and the top-down pathway fused higher resolution

features through upsampling the spatially coarse ones. This combined bottom-up and top-down process helps generate semantically stronger features. For the feature maps in each hierarchical stage, we appended a 3 × 3 convolution layer to reduce the aliasing effect of upsampling. Then, the set of merged features of each FPN stage was finally used for predictions. In the inference stage, we cropped the 800 × 800 image patch from the large sized orthophotograph (with "\*.tif" extension). We assembled the results of augmented inference images and fused them by weighted boxes fusion algorithm to enhance the location and classification accuracies. A more detailed description of each module is provided later.

**Figure 2. Left figure**: the object of hard negative samples. **Middle figure**: the appearance relationship among "disease-like" objects and ground truth disease. **Right figure**: the reasons that the network predicted the different types of false negative samples. In high-quality image, branches and leaves are clear and easily identifiable. However, in low-resolution image, the real PWD-infected trees appear similar to the maple tree and the yellow land.
