*3.2. Synthetic Data Generation*

In this study, we applied the cut and paste technique [29] to create synthetic images and related annotations by random scaling, rotation, and adding segmented images of interest to the background. Unlike Mixup and CutMix, our method only copied the exact pixels that corresponded to an object, as opposed to all the pixels in the object's bounding box. To generate a synthetic dataset with our cut and paste method, we randomly selected images of 55 flowers, 48 fruits, and 58 leaves of diseased blueberry plant tissue from the field dataset (discussed in Section 3.1) and created masks. Then a total of 83 "healthy" background photographic images with only healthy uninfected flowers and leaves were collected in a lowbush blueberry field at the University of Maine, Blueberry Hill Farm (Jonesboro, ME, USA). In order to make the background more complex, seven distractor object images of healthy fruits were obtained from online sources, and then masks were created. Objects of interest masks were created using Adobe Photoshop software, unlike a previous study [29] that automated this process by training a machine learning model to segment and extract the objects.

Once the image data was ready, we randomly selected the background image and resized it to 1080 × 1320 pixels and 1320 × 1080 pixels, vertically and horizontally, respectively. Then, to make the background of the synthetic dataset diverse, we randomly selected at most 10 segmented distractor images and randomly resized, rotated and added them to the background iteratively. Under field conditions in agriculture production systems, occlusion problems are common challenges that need to be considered. Hence, in generating a synthetic dataset, a newly added image can partially or fully overlap a previously added image. Therefore, to control the degree of overlap and include cases of occlusion in the synthetic dataset, the threshold value for the degree of overlap was set at 25%. Finally, in an iterative process, we randomly chose a maximum of 15 segmented images of diseased leaves, flowers, and fruits and randomly resized and rotated them, and then added the new background images on top of the background distractor images (see Figure 2).

#### *3.3. Coordinate Attention Module*

When detecting mummy berry disease, the infection can be randomly distributed on the plant stem, resulting in a mixture of overlapping occlusions, and the infected region may occupy a relatively small proportion of the image area, leading to missed or incorrect detection. In our study, we introduce the coordinate attention (CA) module to help the deep learning model focus on the most significant information related to infection and ignore minor features. The CA mechanism is an efficient and lightweight module that embeds position information into the attention map. The model can obtain information about a large area without introducing additional computational costs. The coordinate attention block can be considered a computational unit that increases the expressive power of the learned features. It takes an intermediate feature tensor: **<sup>X</sup>** <sup>=</sup> [x1, x2,...,xC] <sup>∈</sup> <sup>R</sup><sup>C</sup> <sup>×</sup> <sup>H</sup> <sup>×</sup> <sup>W</sup> as input; and outputs a transformed tensor with enhanced representations: **Y** = y1, y, . . . , yC of the same size to **X**.

**Figure 2.** The procedure of synthetic image dataset generation.

In the structure of the coordinate attention module, the operation is divided into two steps: (1) coordinate information embedding; and (2) coordinate attention generation (Figure 3). The first step factors global pooling as given in Equation (1) into two 1D feature encoding operations that encode each channel along the horizontal and vertical directions, respectively.

$$\mathbf{z}\_{\mathbf{c}}^{h}(h) = \frac{1}{W} \sum\_{\substack{0 \le \mathbf{i} < \mathbf{W}}} \mathbf{x}\_{\mathbf{c}}(h, \mathbf{i}) \, \mathbf{z}\_{\mathbf{c}}^{w}(w) = \frac{1}{H} \sum\_{\substack{0 \le \mathbf{i} < \mathbf{H}}} \mathbf{x}\_{\mathbf{c}}(j, w) \tag{1}$$

where *X* denotes the input, z*<sup>h</sup> <sup>c</sup>* (*h*) and z*<sup>w</sup> <sup>c</sup>* (*w*) indicate the outputs of the *c* − th channel at height *h* and width *w*, respectively. The second step concatenates the feature maps produced and sends them to the shared 1 × 1 convolutional transformation *F*<sup>1</sup> to obtain the intermediate feature map, *f* , as formulated in Equation (2),

$$f = \delta\left(\mathcal{F}\_1\left(\left[Z^{\mathcal{h}}, Z^{\mathcal{w}}\right]\right)\right) \tag{2}$$

where [**.,.**] denotes the concatenation operation along the spatial dimension, and δ is a non-linear activation function. The feature map *f* is then split along the spatial dimension into two separate tensors *<sup>f</sup> <sup>h</sup>* and *<sup>f</sup> <sup>w</sup>*, followed by another two 1 × 1 convolutional functions *Fh* and *Fw*, which are determined by Equation (3),

$$\mathcal{g}^h = \sigma\left(\mathcal{F}\_h\left(f^h\right)\right), \mathcal{g}^w = \sigma(\mathcal{F}\_w(f^w)) \tag{3}$$

where *σ* denotes the sigmoid activation function. The final attention weight *Y* is generated according to Equation (4),

$$y\_{\mathfrak{c}}(i,j) = \mathfrak{x}\_{\mathfrak{c}}(i,j) \times \operatorname{gph}(i) \times \operatorname{g}^{w}\_{\mathfrak{c}}(j) \tag{4}$$

Therefore, in this study, we integrated the coordinate attention (CA) module on the Yolov5 backbone. This offers three obvious advantages: (1) it captures cross-channel and position-sensitive information that helps models accurately locate and recognize objects of interest; (2) having a lightweight property that is less lightweight than other attention mechanisms [26,27]; and (3) flexibility to be plugged into object detection models such as Yolov5 with little additional computational overhead.

**Figure 3.** Structure of the coordinate attention (CA) module.
