2.4.2. ResNet Model

ResNet, which emerged in 2015, marks a milestone in deep learning [32]. It adjusts the structure of the traditional CNN models, in which the most critical residual structure adds an identity mapping to the basic network unit [33]. The residual structures are shown in Figure 3. The original fitting target of the residual structure is *H*(*x*), and it becomes extremely difficult to learn *H*(*x*) with the gradual deepening of the network level. Thus, transforming the fitting target into the fitted residual function *F*(*x*) (*F*(*x*) = *H*(*x*) − *x*) through the residual structure and turning the output into a superposition of the fit, and the input will make the learning of the network relatively easy. The residual learning is adopted for each stacked layer in ResNet, and the residual learning formula is defined as:

$$y = F(\mathfrak{x}, \{w\_i\}) + \mathfrak{x} \tag{1}$$

where *x* and *y* are the input and output vectors of the residual structure of this layer, and *F*(*x*, {*wi*}) represents the residual mapping to be learned. For the example in Figure 3 that has two layers, *F* = *w*2*ReLU*(*w*1*x*) in which ReLU denotes ReLU activation function. In addition, the dimensions of *F*(*x*, {*wi*}) and *x* should be consistent. *wS*, a square matrix, can be conducted through identity mapping to match the dimensions when the input or output dimension information needs to be changed, as shown in Figure 3b.

$$y = F(x, w\_i) + w\_s x \tag{2}$$

**Figure 3.** Residual structure. (**a**) Residual-A structure. (**b**) Residual-B structure.

#### 2.4.3. Custom Model

The Inception-v3 structure offers the characteristics of fusing multi-scale features and accelerating network computation, while the residual structure in ResNet prevents gradient explosion, gradient disappearance, and network degradation when the number of network layers is deepened. Consequently, in this study, we integrated the Inception-v3 structure and residual module and established a multi-scale information fusion CNN model based

on ResNet34 architecture, named InceptionResNet–BOA model, or IRBOA model for short. The model was adopted to enrich the rice feature information and promote the recognition effect. The structure of the IRBOA model is shown in Figure 4. The input of the model is a 224 × 224 × 3 color image, and the model architecture consists of an Inception-A structure as shown in Figure 5a, a maximum pooling layer, five Residual-A structures, two Residual-B structures, an Inception-B structure as shown in Figure 5b, and an average pooling layer. The input of the fully connected layer is the number of flattened characteristic maps of the average pooled layer. While the count of neurons of this layer is the amount of rice DOM types to classify rice DOM.

**Figure 4.** The architecture of the IRBOA model.

**Figure 5.** Inception-v3 structure. (**a**) Inception-A structure. (**b**) Inception-B structure.

Table 1 displays the parameter settings for each layer of the IRBOA model. The Inception-A structure is a parallel combination of a series of 1 × 1 convolution layers, 3 × 3 convolution layers, and a 5 × 5 convolution layer replaced by two 3 × 3 convolution layers, with the number of convolution kernels from branch1 to branch4 being 8, 12, 24, 8, 12, 24, 24, respectively. The Residual-A structure contains two convolutional layers with 3 × 3 kernels and an identity mapping, and the number of convolutional kernels in Residual-A1 to A4 are 64, 128, 256, and 256, respectively. Residual-B structure matches the number of channels in the two pathways by 1 × 1 convolution at identity mappings based on the Residual-A structure, with 128 and 256 convolution kernels for Residual-B1 to B2. The Inception-B structure is combined by 1 × 1 convolution layers, asymmetric 1 × 7 convolution layers, and 7 × 1 convolution layers. The number of convolution kernels from branch1 to branch4 are 64, 128, 64, 64, 128, 192, 192, 192, 192, and 128, respectively.


**Table 1.** Parameters of the IRBOA model structure.

"–" represents that there is no corresponding parameter.
