*4.2. Rice Tiller Number Recognition Algorithm*

After the images with laser light are obtained and preprocessed, a rice tiller counting algorithm is then used to obtain tiller numbers from the images. In practical applications, accurately counting the tiller number is difficult and unnecessary. In practice, the aim of gene-editing breeding is to promote effective tillering (tillers with panicles) to obtain high yields, while eliminating ineffective tillering (tillers without panicles) for reduced nutrition consumption [32]. Since the panicle numbers can be statistically estimated by drone detection, we aim to statistically estimate the total number of under-canopy tillers and then the number of effective tillers can be estimated. Therefore, we divide the tiller numbers into several grades and the task in this paper is to obtain the approximate ranges of tiller numbers.

In this paper, a deep learning method based on an attentional residual network (AtResNet) is proposed. Figure 5 illustrates the network structure. Resized grayscale images are directly input into the network, and they are processed through stacked layers. The backbone network is a deep convolutional neural network (CNN) with residual connections to ResNet [33] to prevent the overfitting problem. There are three convolutional blocks with similar structures, each of which firstly processes the input through a two-dimensional convolution operation as follows.

$$\mathbf{x}\_{i}^{l} = f\_{conv}^{l} \left( \mathbf{x}\_{i}^{l-1}; \theta^{c,l} \right) = \mathbf{x}\_{i}^{l-1} \* w^{c,l} + b^{c,l},\tag{5}$$

where *xl*−<sup>1</sup> *<sup>i</sup>* denotes the input of the convolutional layer and *<sup>θ</sup>c*,*<sup>l</sup>* = *wc*,*<sup>l</sup>* , *bc*,*<sup>l</sup>* are the parameters of this layer. Then, a batch normalization (BN) [31] layer is introduced to speed up the network convergence, which is formulated for each mini-batch as follows.

$$
\hat{\mathbf{x}}\_i^l = \frac{\mathbf{x}\_i^l - E\left[\mathbf{x}\_i^l\right]}{\sqrt{Var\left[\mathbf{x}\_i^l\right]}},\tag{6}
$$

$$\mathbf{y}\_i^l = \gamma^l \hat{\mathbf{x}}\_i^l + \boldsymbol{\beta}^l,\tag{7}$$

where *<sup>γ</sup><sup>l</sup>* and *<sup>β</sup><sup>l</sup>* are learnable parameters. *<sup>E</sup>*[·] and *Var*[·] denote the mean and variance value, respectively. Then, a rectified linear unit (ReLU) layer is used with a rectified linear function, which is formulated as

$$RelLI(\mathfrak{x}) = \max(0, \mathfrak{x}).\tag{8}$$

Then, a max-pooling layer is adopted, which calculates the maximum values within the receptive field.

**Figure 5.** AtResNet model for rice tiller number recognition.

Residual connections are introduced to the second and last convolutional blocks to accelerate network training and prevent overfitting. A convolutional layer with a 1 × 1 kernel is used to perform identity mapping, which keeps the input and output size of the convolutional block the same. Then, the output of the *l*-th convolutional block can be calculated as follows.

$$\mathbf{x}\_{i}^{l} = \sigma \left[ f\_{CB} \left( \mathbf{x}\_{i}^{l-1}; \theta^{\text{CB}} \right) + BN \left( f\_{1 \times 1} \left( \mathbf{x}\_{i}^{l-1}, \theta^{1 \times 1} \right) \right) \right], \tag{9}$$

where *fCB* is the mapping function of the convolutional block, and *f*1×<sup>1</sup> is the mapping function of the 1 × 1 convolutional layer in residual connections. *σ* denotes the ReLU function. The output of the last convolutional block is processed by an adaptive average pooling (AAP) layer and two fully connected (FC) layers and the final output is a vector whose length is the same as the tiller number grades.

Since these images are dark in most regions and the laser light spots only appear in some small areas, attention mechanisms [34] are introduced to help the model focus on the informative regions. Firstly, a channel attention block [35] is adopted to allocate different weights to different feature channels. The channel attention block firstly aggregates spatial information through adaptive average pooling and adaptive max pooling operations. Then, a shared convolutional network is used to generate attention maps for each aggregated feature vector. In addition, two maps are summed to obtain the final channel attention map. In short, these channel attention operations are summarized as follows.

$$A\_{\mathbb{C}}(\mathbf{x}) = \sigma\_{\mathbb{S}}[f\_{conv}^{\mathbf{c}}(AvgPool(\mathbf{x})) + f\_{conv}^{\mathbf{c}}(MaxPool(\mathbf{x}))],\tag{10}$$

where *<sup>x</sup>* ∈ *<sup>R</sup>W*×*H*×*<sup>C</sup>* represents the input features, and *<sup>f</sup> <sup>c</sup> conv* denotes the mapping function of the shared convolutional network, which consists of a 1 × 1 convolutional layer with *C*/*r* channels, a ReLU layer, and a 1 × 1 convolutional layer with *C* channels. *σ<sup>s</sup>* denotes the sigmoid function. Finally, the calculated channel attention map *Ac*(*x*) is applied to the input feature by element-wise multiplication, as follows:

$$\mathbf{x}' = A\_{\mathbf{c}}(\mathbf{x}) \otimes \mathbf{x}.\tag{11}$$

Similarly, a spatial attention block [36] is adopted afterwards to obtain spatial attention maps to help the network to focus on informative spatial regions. Channel information is aggregated by average and maximum values. Two features are concatenated and then processed by a convolutional layer to produce the spatial attention map. The spatial attention operations can be summarized as follows.

$$A\_s(\mathbf{x}) = \sigma\_s[f\_{conv}^s([Av\mathbb{g}(\mathbf{x}); \, \text{Max}(\mathbf{x})])],\tag{12}$$

where *f <sup>s</sup> conv* denotes the mapping function of the convolutional layer. Finally, the calculated spatial attention map *A\_s (x)* is applied to the input feature by element-wise multiplication, as follows:

$$\mathbf{x''} = A\_{\mathbf{s}} \left( \mathbf{x'} \right) \otimes \mathbf{x'}.\tag{13}$$

The whole network outputs a vector *y*ˆ*i*, which represents the predicted probability of the *i*-th sample that belongs to each tiller number grade. *y*ˆ*<sup>i</sup>* is obtained through a softmax function of the output *yf c* of the last FC layer, as follows:

$$y\_{i,j}^{\gamma} = \frac{e^{y\_{fc,j}}}{\sum\_{j=1}^{K} e^{y\_{fc,j}}} \, ^{\prime} \tag{14}$$

where *yf c*,*<sup>j</sup>* and *y*ˆ*i*,*<sup>j</sup>* denote the *j*-th element of *yf c* and *y*ˆ*i*, respectively, and *K* is the number of all tiller number grades. The network is trained by minimizing the cross-entropy loss, which is defined as follows.

$$L = -\frac{1}{N} \Sigma\_{i=1}^{N} \Sigma\_{k=1}^{K} I(y\_i = k) \log(y\_{i,k}^{\gamma})\_\prime \tag{15}$$

where *I*(·) is the indicator function, *yi* is the true tiller number grade label of the *i*-th sample and *N* is the sample number.

### **5. Experiment and Results**

#### *5.1. Data Description*

Following the image acquisition procedure illustrated in Section 4.1, a set of images are obtained in fields using the structured light system. Then, these images are categorized into four classes according to the rice plant tiller number. In large-scale variant breeding, we found that the total tiller numbers of most variants are mainly between 21 and 25 [37]. We hoped to achieve relatively accurate tiller counting in this range. Therefore, we subdivide this range and the numbers fewer than 21 and more than 25 are divided roughly. Some image examples are shown as Figure 6. The details of these images are shown in Table 1. These images are transformed to grayscale images and resized to 256 × 256. Then, they are randomly split into a training set and a testing set with the ratio of 3:1.

**Figure 6.** Some image examples of four tiller number grades (improved brightness).

**Table 1.** Dataset details.


#### *5.2. Experiment Setup*

We use all the images in the training set to train the AtResNet and test the model using the testing set samples. The detailed parameter settings used in the experiment are listed in Table 2.

**Table 2.** Parameter details of AtResNet.


The network is implemented by PyTorch on an NVIDIA GTX 1660 GPU. It is trained by the Adam optimizer with a learning rate of 0.001 for 50 epochs. In each mini-batch, 64 samples are inputted into the system. A convolutional neural network (CNN) without

residual connections and attention operations, and a ResNet without attention operations are also implemented for performance comparison. They share the same backbone structure and parameters with the AtResNet and all the experiments are repeated for 10 trials to reduce randomness.

#### *5.3. Results*

The experiment results of all the three methods are shown in Table 3. From the recognition results, we can observe that these deep learning-based methods achieved more than 93% tiller number recognition accuracy. This is satisfactory for practical applications. In addition, the proposed AtResNet outperforms the other two methods. We also illustrate the training and testing accuracy and loss values during the training process in Figure 7. We can observe that the AtResNet has lower accuracy and fewer loss fluctuations during model testing. It may be because the introduction of residual connections and attention operations helps the model to converge faster.

**Table 3.** Tiller number recognition accuracy (%) of three methods.


**Figure 7.** Training and testing accuracy and loss value curve. (**a**) CNN accuracy. (**b**) CNN loss. (**c**) AtResNet accuracy. (**d**) AtResNet loss. Blue line denotes training process and orange line denotes testing process.

To further explore the recognition results, we also analyze the confusion matrix of the results, as shown in Figure 8. It is observed that all the three methods can accurately recognize images with grade IV tiller numbers. For grade II and III, the AtResNet displays higher recognition accuracy compared with the other two methods. Figure 9 shows some examples of spatial attention maps. The different colors represent different relative attention values. We can observe that the laser spot regions have different attention values with other dark areas. So, the network can selectively focus on the informative regions.

**Figure 8.** Confusion matrix of tiller number recognition results.

**Figure 9.** Examples of spatial attention maps in AtResNet.
