*Article* **Multi-Scale Convolution-Capsule Network for Crop Insect Pest Recognition**

**Cong Xu \*, Changqing Yu, Shanwen Zhang and Xuqi Wang**

College of Information Engineering, Xijing University, Xi'an 710123, China; xaycq@163.com (C.Y.); wjdw716@163.com (S.Z.); 18811166421@163.com (X.W.)

**\*** Correspondence: xucong0623@126.com

**Abstract:** Accurate crop insect pest identification in fields is useful to control pests and beneficial to agricultural yield and quality. However, it is a difficult and challenging problem due to the crop insect pests being small with various sizes, postures, shapes, and disorganized backgrounds. Multi-scale convolution-capsule network (MSCCN) is constructed for crop insect pest identification. It consists of a multi-scale convolution module, capsule network (CapsNet) module, and SoftMax classification module. Multi-scale convolution is used to extract the multi-scale discriminative features, CapsNet is employed to encode the hierarchical structure of the size-variant insect pests in the crop images, and Softmax is adopted for insect pest identification. MSCCN combines the advantages of convolutional neural network (CNN), CapsNet, and multi-scale CNN, and can learn multi-scale robust features from pest images of different shapes and sizes for pest recognition and identify various morphed pests. Experimental results on the crop pest image dataset show that this method has a good recognition rate of 91.4%.

**Keywords:** crop insect pest identification; convolutional neural network (CNN); capsule network (CapsNet); multi-scale convolution-capsule network (MSCCN)

#### **1. Introduction**

To control pests, avoid economic losses, and reduce pesticide costs, early detection and identification of crop pests is an important task. However, it is difficult and challenging to detect and recognize crop pests in fields, because the insect pest images are photographed in complex crop environments. These include not only various types, sizes, postures, and shapes of insect pests, but changeable light, viewpoint, and irregular backgrounds, and it is obvious that the insect pest size is small in proportion to the whole image and its color and texture characteristics are similar to those of the background in the cropped image, as shown in Figure 1. Therefore, it usually leads to low identification accuracy using the traditional pattern recognition and image processing algorithms [1].

With the development of computer vision technology, computer computing power, and various algorithms of artificial intelligence (AI) [1,2], machine learning [3,4], and modern digital and deep learning [5], many crop pest detection and recognition methods have been presented [6]. Martineau et al. [7] investigated forty-four studies on this topic, including a lot of methods of image capture, feature extraction, and classification and tested datasets, and generally discussed the questions that might still remain unsolved. Costa et al. [8] constructed a knowledge-based crop pest identification system. This system can provide a convenient way for farmers to manage crop pests and diseases. Liu et al. [9] introduced the definition and connotation of the crop disease-pest knowledge and analyzed and classified the key techniques and methods of crop disease-pest detection and recognition in recent years, including knowledge representation, feature extraction and fusion, reasoning, and classifier. Huo et al. [10] introduced the research progress of disease-pest identification, pest number, and position detection, of an existing dataset and some methods used in

**Citation:** Xu, C.; Yu, C.; Zhang, S.; Wang, X. Multi-Scale Convolution-Capsule Network for Crop Insect Pest Recognition. *Electronics* **2022**, *11*, 1630. https://doi.org/10.3390/ electronics11101630

Academic Editor: Javid Taheri

Received: 9 April 2022 Accepted: 12 May 2022 Published: 20 May 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

previous articles. Li et al. [11] proposed a few-shot cotton pest recognition method and verified its effectiveness and feasibility on two datasets, namely the national bureau of agricultural insect resources and a dataset with the natural scenes. The results of the above methods show that the performance of the traditional pest identification methods relies on hand-crafted features and matching templates, shallow learning-based features with limited representation power, and only low-level features, but ignores the hierarchical features of pest images, so their recognition rate and generalization ability are limited.

**Figure 1.** Crop insect pest image examples. (**a**) Maize army worm. (**b**) Cotton bollworm. (**c**) Bean larvae.

Convolutional neural network (CNN) has made remarkable achievements in various target detection and recognition tasks. It has been widely used in pest detection and recognition as it can automatically learn the essential features of the pest images from a large amount of data and produce fewer high-quality candidate features for pest recognition [12]. Ai et al. [13] used CNN to automatically identify crop disease pests as they trained the Inception-ResNet-v2 model, utilizing the public dataset of the AI Challenger Competition in 2018, with 27 disease images of 10 crops, and designed and implemented the Wechat applet of crop disease-insect pest recognition. Xie et al. [14] proposed an automatic crop pest classification method by learning multi-level features from a large number of unlabeled image patches using unsupervised feature learning methods and utilized the filters in multiple scales coupling them with several pooling granularities. Labaa et al. [15] proposed a crop pest recognition method based on CNN, improved by combining different technologies such as CNNs and REST services. Li et al. [16] proposed a fine-tuned GoogLeNet model to deal with the complicated backgrounds presented by farmland scenes and achieved better pest classification results than the original model.

Compared to traditional handcraft-feature extraction algorithms, CNN is effective in image classification tasks. It can automatically learn features during the training process, avoiding the error generated by manual selection, but its pooling operation (downsampling) can only give rough location information, allowing the model to ignore some small spatial changes and failing to accurately learn the location association of different objects, such as the location, size, direction, and even deformation degree and texture of entities in a region. Although the pooling operation of CNN can maintain the invariability of the location and direction of the entity, it will lose the characteristics of small pests, so the recognition rate of crop pests may not be high. Therefore, pooling operations may cause some problems: they may lose the low-level features and spatial hierarchical features, and the data of small pests (under certain conditions) may be lost after down-sampling.

The capsule network (CapsNet) is a new kind of deep learning architecture aiming to encode the features of the images and their spatial relationships [17]. It can overcome the shortcomings of CNN. It only uses the shallow CNN to preserve the spatial information, and can capture not only the discriminant features, but also the underlying relationships between these features. A capsule is a group of neurons whose output represents the various perspectives of an entity, such as pose, texture, scale, or the relative relationship between the entity and its parts. In this case, CapsNet is more robust to affine transformations and achieves good results with fewer training samples. Paoletti et al. [18] constructed a CapsNet for hyperspectral image classification, where several spectral-spatial capsules are used to learn HSI spectral-spatial features while significantly reducing the network complexity. Mensah et al. [19] proposed Gabor CapsNet for plant disease detection and evaluated its performance on three publicly available plant disease datasets containing disease leaf images with high similarity and background objects. Wang et al. [20] proposed a multiscale convolutional CapsNet for hyperspectral image classification, which is composed of a multi-scale convolutional layer, a single-scale convolutional layer, a PrimaryCaps layer, a DigitCaps layer, and a fully connected layer. Peker [21] proposed a multi-channel CapsNet ensemble for plant disease detection and individually trained the network on the image set. Thenmozhi et al. [22] proposed a deep CNN model to classify insects, where transfer learning was applied to fine-tune the pre-trained models.

From the above analysis, it is known that the conventional CNN-based crop pest image classification faces a problem of quite limited training samples, which leads to overfitting and dissatisfied performance to describe the correlation between features. CapsNet can deal with the disadvantages of CNN, but the feature representation capability of the low-level features extracted by the shallow-layer CNN is limited. Therefore, the original CNN or CapsNet is not suitable for crop pest recognition tasks. Inspired by multi-scale convolutional CapsNet and multi-channel CapsNet, a multi-scale convolution-capsule network (MSCCN) is constructed for crop insect pest recognition combining the advantages of traditional CNN and CapsNet. It consists of a multi-scale convolutional module, CapsNet module, and a Softmax classification module. The main contributions of this work are as follows:


The remainder of the paper is organized as follows. Section 2 reviews the related works including Inception and CapsNet. MSCCN is introduced in detail in Section 3. Experiments are presented in Section 4. Section 5 concludes the paper and puts forward some opinions and suggestions for the future research direction.

#### **2. Related Methods**

*2.1. Inception*

Inception is a module in GoogleNet and has been validated to be better in complex images classification tasks. It has multi-scale convolution kernels to extract the features of different scales from the input images by increasing the number of convolutional kernels and introducing multi-scale convolutional kernels. The inception structure has been improved in terms of speed and accuracy. There are multiple versions of Inception: Inception V1, Inception V2, Inception V3, Inception V4, and Inception ResNet, each of which is an iterative evolution of the previous version. In general, a lower version of the Inception module may work better in classification tasks. Figure 2 shows Inception V1. As shown in Figure 2, 1 × 1, 3 × 3, and 5 × 5 convolutional kernels are used to convolve the outputs of the upper layer at the same time to form a multi-branch structure. Feature maps obtained from the different branches are then concatenated to obtain different classification features

of the input images. Processing these operations in parallel and combining all the results will result in better image representation. To make the feature map have the same size, each branch adopts the same padding mode with the stride of 1. The 1 × 1 convolution operation is used before 3 × 3 and 5 × 5 and after Max-pooling to reduce the amount of calculation.

**Figure 2.** The structure of Inception V1.

#### *2.2. Capsule Network (CapsNet)*

CNN is composed of multiple neurons stacked together, and it takes a lot of computation to compute convolution between neurons, so the pooling operation is used to reduce the size of the network layer. However, classification information may be lost by pooling. CapsNet is constructed to overcome the limitations and shortcomings of CNN. It can encode spatial information and calculate the existence probability of objects, and is good at dealing with changeable object recognition with different positions, sizes, directions, deformations, speeds, textures, and other features. Its architecture is shown in Figure 3, consisting of a traditional convolution layer, a primary capsule layer, and a digital capsule layer.

**Figure 3.** The architecture of CapsNet.

In CapsNet, the primary capsule layer mainly transforms the upper scalar representation into a vector representation, so its output is as a vector. The digital capsule uses dynamic routing algorithms to update the network. The final output is vectors. The length of each vector is the probability value of belonging to a class.

#### **3. Multi-Scale Convolution-Capsule Network (MSCCN)**

Motivated by the fact that the crop insect pests are changeable with various postures, and their sizes range from less than 1 mm to more than 100 mm, a multi-scale convolutioncapsule network (MSCCN) is proposed for crop insect pest recognition. Its architecture is shown in Figure 4.

**Figure 4.** Architecture of MSCCN.

The input image is reshaped to 128 × 128, 96 × 96, and 64 × 64 assembled in parallel. MCNN firstly extracts the high-level features of describing pest images through three multiscale convolutions, three Inceptions, and three CapsNet using these features to further construct the vector-based capsule structure to form the final discriminative feature vector of pests in the image, which will be directly fed to the final SoftMax classifier without any feature reduction. Finally, pest recognition is implemented by the Softmax classifier. MCNN is designed as an end-to-end structure for easy convolution-CapsNet training and deployment.

In CapsNet, three multi-dimensional primary capsules are employed to encode the hierarchical multi-scale features extracted by three multi-scale convolutions, and obtain 12D, 8D, and 4D capsules, respectively. Then, the predicted vectors are computed through different weight matrixes *W*, *V*, and *U* as follows:

$$\begin{cases} \widehat{u}\_{\stackrel{\supset}{j}|i}^{1} = \mathcal{W} \cdot \boldsymbol{u}\_{i}^{1} \\ \widehat{u}\_{\stackrel{\supset}{j}|i}^{2} = \mathcal{V} \cdot \boldsymbol{u}\_{i}^{2} \\ \widehat{u}\_{\stackrel{\supset}{j}|i}^{3} = \mathcal{U} \cdot \boldsymbol{u}\_{i}^{2} \end{cases} \tag{1}$$

where *u*1, *u*2, *u*<sup>3</sup> are the feature maps of three multi-scale convolutions, *W*, *V*, and *U* are three weight matrixes of *<sup>u</sup>*1, *<sup>u</sup>*2, *<sup>u</sup>*<sup>3</sup> and *u* 1 , *u* 2 , *u* 3 respectively, *u<sup>k</sup> <sup>i</sup>* is *i*-th primary-capsule from *k*-th branch, *u*ˆ*<sup>k</sup> <sup>j</sup>*|*<sup>i</sup>* is predict vector between the *<sup>j</sup>*-th parent capsule and the *<sup>i</sup>*-th child capsule of *k*-th branch, and *u*ˆ is the output of this multi-scale capsule encoding structure, which concatenates the results of three branches by function *concat*().

The classification features are encoded using a weight matrix between *i*-th child capsule and *j*-th parent capsule. During the training, the part–whole relationship for each capsule pair is learned by adjusting the transformation matrixes *W*, *V*, and *U*.

There is a dynamic routing between the multi-scale capsule encoding unit and digit capsule layer. It is used to ensure that the outputs of child capsules are sent to the proper parent capsules. The prediction vectors *u*ˆ in the previous section are computed through a weight matrix. The relationship is determined between each parent capsule *sj* and prediction vector *u*ˆ by dynamic routing. All the prediction vectors are denoted as *<sup>u</sup>*ˆ*j*|*i*(*<sup>i</sup>* <sup>=</sup> 1, ··· , *<sup>n</sup>*). In the first iteration, *<sup>c</sup>*<sup>1</sup> *<sup>i</sup>* <sup>=</sup> <sup>1</sup> *<sup>n</sup>* and *<sup>s</sup>*<sup>1</sup> *<sup>j</sup>* <sup>=</sup> <sup>∑</sup>*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *c*<sup>1</sup> *<sup>i</sup> <sup>u</sup>*ˆ*j*|*i*, where <sup>∑</sup>*<sup>j</sup> cj* = 1 and *cj* <sup>≥</sup> 0. Then, adjust the routing coefficients *<sup>c</sup>*<sup>1</sup> to *<sup>c</sup>*<sup>2</sup> by the function *update*() as follows:

$$\begin{array}{l} b^{i+1} = b^i + \pounds v\_j \\ c^{i+1} = \text{soft } \max(b^{i+1}) \end{array} \tag{2}$$

where *b* is the coupling coefficient before normalization and *b*<sup>1</sup> = 0, *vj* is the *j*th output capsule of the parent capsule layer calculated by

$$w\_{\dot{j}} = \frac{\left\|{s\_{\dot{j}}}\right\|^2}{1 + \left\|{s\_{\dot{j}}}\right\|^2} \cdot \frac{s\_{\dot{j}}}{\left\|{s\_{\dot{j}}}\right\|}\tag{3}$$

where *sj* is the total input vector of the *j*th capsule obtained by the weighted sum of the *j*th parent capsule layer connecting with the *i*th child capsule layer, *sj*-2 1+*sj*-<sup>2</sup> is the reduction coefficient of *sj*, *sj sj* is the normalized unit vector of *sj*, *sj* = ∑ *i ciju*ˆ*j*|*i*, and the prediction vector *<sup>u</sup>*ˆ*j*|*<sup>i</sup>* is obtained by multiplying the output features of the BN layer with the weight matrix of the primary capsule layer.

The objective function of MCNN is expressed as follows:

$$L\_{\mathfrak{c}} = \sum\_{k \in \mathbb{CN}um} T\_k \max(0, m^+ - \left\| V\_k \right\|^2) + \lambda \left( 1 - T\_k \right) \max(0, \left\| V\_k \right\| - m^-)^2 \tag{4}$$

where the former part is used to calculate the settings of the correctly classified digital capsule, the latter part is used to calculate the losses of wrongly classified digital capsules, *m*<sup>+</sup> = 0.9 and *m*<sup>−</sup> = 0.1 are the default category prediction values, *λ* = 0.5 is the default balance coefficient, *Tk* is the label of data category, *Tk* = 1 is the correct label, *Tk* = 0 is the incorrect label, CNum is the number of categories, -*Vk* is the length of the vector representing the probability of discriminating as the *k*th class pest, and the total loss is the sum of all digital capsule loss functions.

The main processes of MCNN-based crop pest recognition are shown in Figure 5.

**Figure 5.** Crop pest recognition process based MSCCN.

First, all kinds of pest images are divided into the training set and test set. Both sets need to be preprocessed to facilitate MSCCN to extract the multi-scale features. Then, the results of image preprocessing are used as the input of the multi-scale convolution, the network will automatically extract the multi-scale features of color, texture, and shape from training samples. Multi-scale CapsNet is used to encode the multi-scale convolution features. Each layer of CapsNet is composed of neurons. The neuron input of CapsNet is vectors. The vector length represents the approximate probability of the pest. The vector direction represents the instantiation parameters of the pest. The output of a capsule is only routed to the next layer of the corresponding capsule, which will return a clearer input signal, it can accurately determine the posture of the pest. The feature combination method is adopted for different feature vectors. MSCCN structure and network parameters are set up. After training, the classification model is implemented to classify and recognize pest images by Softmax. The *k*-dimension feature vector *Yi* extracted by CapsNet is input into the trained Softmax classifier, as follows:

$$P(Y = i | \mathbf{x}) = Solt \max(Y\_i) = \frac{\exp(\mathcal{O}\_j Y\_i)}{\sum\_{i=1}^{K} \exp(\mathcal{O}\_k Y\_k)}\tag{5}$$

where *P* is the probability that the feature vector *x* belongs to the *i*th category, *K* is the total number of categories, and is the weight items.

From the above analysis, the pseudocode MSCCN is introduced in Algorithm 1:

**Algorithm 1:** Multi-scale CapsNet

Input training pest images, parameters: *η* = 0.001, *α* = 0.9, *β* = 0.99, and *ε* = 0.00001, batch-size = 128, the number of iterations and dynamic routing are 3000 and 3, respectively, threshold *ρ*; 1: Image processing; 2: Reshape each image into three images with different sizes; 3: For iteration; 4: for *k* = 1 to 3 Carry on the *k*th convolution with different kernels with sizes of 7 × 7, 5 × 5 and 3 × 3; Carry on the *k*th Inception convolution; Carry on procedure routing (*k*, *u <sup>j</sup>*|*i*, *r*, *l*) of the *k*th CapsNet; (1) for all capsule *i* in layer *l* and capsule *j* in layer(+1): *bij* ← 0; (2) for *r* iterations do for all capsule *i* in layer *l*: *ci* ← softmax(*bi*) for all capsule *j* in layer (*l* + 1): *sj* ← ∑*<sup>j</sup> cij u <sup>j</sup>*|*<sup>i</sup>* by Equation (5) for all capsule *j* in layer (*l* + 1): *vj* ← *squash*(*sj*) (3) for all capsule *<sup>i</sup>* in layer *<sup>l</sup>* and capsule *<sup>j</sup>* in layer (*<sup>l</sup>* + 1): *bij* <sup>←</sup> *bij* <sup>+</sup> *u <sup>j</sup>*|*ivj* (4) return *vj* return *vk*; 5: integrate *vk*; 6: *vk* × Mask; 7: input *vk* × Mask into Softmax classifier; 8: calculate loss Lc by Equation (4); 9: if Lc more than *ρ*, return step 3; 10: Stop iterations.

#### **4. Experiments**

To evaluate the performance of the proposed method based on MSCCN, a lot of experiments were conducted on the rice pest image set of IP102 dataset and compared with four existing mainstream deep learning methods, AlexNet [12,23,24], CapsNet [19], MS-CapsNet [20], DCNN + transfer learning (DCNNTL) [22], and ResNet50 [25]. AlexNet consists of five convolutional layers, three max pooling layers, and three fully connected layers. Resnet50 is composed of 49 convolutional layers and a fully connected layer, where the residual network unit contains cross-layer connections. MS-CapsNet consists of a multi-scale convolutional layer, a single-scale convolutional layer, a primaryCaps layer, a digitCaps layer, and a fully connected layer. DCNNTL consists of six convolutional layers, five max pooling layers, and a fully connected layer. It uses VGG16 as transfer learning to pre-train the deep CNN model on the constructed dataset. In all models, categorical cross entropy is used as a loss function, Stochastic gradient descent (SGD) is used as the optimizer, and Softmax classifier is used in their output layers to classify pest categories.

The hardware and software conditions of the experiments are as follows: the operating system is 64-bit Microsoft Windows 10, the CPU is I5-6200U, GeForce RTX2080 SUPER 8GB, 64-bit Operating System and x64-based processor NVIDIA Ge Force RTx 2080Ti 11GB GDDR6 Mother Board Intel i7/i8/i9, the programming language is Python 3.7 Jupyter notebook software, and the deep learning framework is Keras 2.3.0.

#### *4.1. IP102 Dataset*

IP102 (https://github.com/xpwu95/IP102 (accessed on 7 April 2019)) is often used to test insect pest detection and recognition methods based on deep learning [26]. It contains 75,222 images belonging to 102 common crop insect pest categories with an average of 737 images per class. Most images were collected by common image search engines at different growth stages, and about 19,000 images were annotated with bounding boxes for pest detection. Some images are shown in Figure 6. From Figure 6, it is found that the insect pest images were collected in the fields with various sizes, shapes, and complex

backgrounds, and the pest has different sizes, postures, and shapes at the different stages of the life cycle [23].

(**c**)

**Figure 6.** Insect pest image examples in IP102. (**a**) The first 20 original images. (**b**) Different forms of a kind of insect pest at different stages of the life cycle. (**c**) Nine kinds of rice annotated images.

IP102 contains 5701 original images of nine rice pests. Their pest image names and corresponding numbers and serial numbers are shown in Table 1, where the maximum number is 1115, and the minimum number is 369. The number of each kind of rice pest image is increased to more than 500 by the augmentation algorithm. All images are converted to JPEG format. Repeated or damaged images are deleted. In this study, all rice insect pest images are used to conduct rice insect pest recognition experiments, where each original image is firstly reshaped to 128 × 128, closer to the actual application, because the sizes of the collected images are not uniform. Finally, 5000 preprocessed images are used for experiments, except for 1034 poor quality or negative images.



Due to the different crop pest conditions of data collection, illumination, and parameter settings of a digital scanner, color differences of digital pest images are often caused. Size and color normalization can not only ensure the color consistency of the original image, but also preserve the biological information in the pest image, so as to improve the recognition performance of the model. As the pest image sizes of the dataset are different, ranging from 220 × 220 to 512 × 512, the size of the input original image and ROI label

will be uniformly adjusted to 128 × 128. At the same time, pixel values of all images will be regularized to between 0 and 255 when entering the channel expansion module. After channel expansion, minimax normalization is applied to the pest images of each channel to normalize the range of pixel values to between 0 and 1, so as to complete channel expansion of original pest images and better meet the input of the deep learning network. Minimax normalization is defined as follows,

$$\chi\_{nor} = \frac{\mathbf{x} - X\_{\text{min}}}{X\_{\text{max}} - X\_{\text{min}}} \tag{6}$$

where *x* is the pixel value of the original image, *X*min is the minimum value of the pixel value set, and *X*max represents the maximum value of the pixel value set.

The 5-fold-cross-validation (5FCV) strategy is used to evaluate the performance of the proposed model. The 5FCV experiment is conducted 50 times, and the results are the average of 50 5FCV experiments.

#### *4.2. Experiment Results*

Rice pest recognition mainly relies on MSCCN to extract the pest image features and complete the pest identification via Softmax classifier. MSCCN is a deep learning algorithm. The CapsNet module in MSCCN uses activity vectors to represent instantiation parameters of specific pest types. The length of the output vectors is used to characterize the probability of pests having the current input. After the pest images are preprocessed, the images are output into MSCCN, where CapsNet uses multi-scale convolution to extract the pest image features, trains the image classification, and predicts the output vector based on routingby-agreement protocol. In this paper, the characteristics of MSCCN are used to solve the problem of crop pest identification difficulties caused by multiple pests overlapping during the pest recognition process. The most vital step, according to routing-by-agreement, is to analyze the pest images with overlapped objects. The parameters of MSCCN are originally set as a batch size 128, weight decay factor 0.00001, and number of training epochs 100, and the initial weights are set randomly from a Gaussian distribution with a mean of 0 and a variance of 1. In the multi-scale convolution module, the dropout rate is set to 0.4, the learning rate is initialized to 1 <sup>×</sup> <sup>10</sup><sup>−</sup>3, decreasing 0.05 times as the number of iterations increases. In CapsNet module, the number of iterations of dynamic routing is 3000, and the dropout ratio is 0.9. Adam is employed as the gradient descent algorithm to perform the training. In Adam, the original parameters are set as *η* = 0.001, *α* = 0.9, *β* = 0.99, and *ε* = 0.00001.

Three parameters in MSCCN are not trainable but can be fine-tuned: dropout rate, learning rate, and mini-batch size. They are fixed at the start of training. Considering validation accuracy while tuning hyperparameters, we fine-tune them. The dropout is used to reduce overfitting, and a dropout layer is often added after each dense layer except the last. MSCCN is trained with three dropout rates of 0.3, 0.4, and 0.5. The results are 0.891, 0.575, and 0.843, respectively. The dropout rate of 0.3 has the best result in general. In the experiments, the dropout rate is set to 0.3, which means that MSCCN model will randomly ignore 30% of the neurons of the previous layer. Learning rate determines how fast the weights of MSCCN are adjusted to find the local or global minima of the loss function. MSCCN is tested with a learning rate of 0.01, 0.001, 0.0001, and 0.00001. In terms of convergence speed and accuracy, learning rate of 0.0001 has the best accuracy of 0.906. MSCCN is evaluated with a mini-batch size of 10, 16, 32, 64, and 128. The accuracy of MSCCN is improved with the increase in mini-batch size from 10 to 64, and then it decreased for 128, as shown in Figure 7. Then, a mini-batch size of 64 is selected to train the model that increases the convergence precision.

**Figure 7.** Accuracy versus Mini-Batch sizes.

Gradient descent and backpropagation algorithms are used to update the weight parameters of the model. As the gradient descent algorithm of the driving quantity is used, the momentum factor is set as 0.9 to prevent the overfitting problem. In order to show the performance of MSCCN, MSCCN is compared with classical CNN and CapsNet, and three modified models: MS-CapsNet, DCNNTL, and ResNet50.

Figure 8 shows the loss values of six models in the training set. From Figure 8, it is found that MSCCN converges fastest, and the curve is relatively stable after 2000 iterations. All models converge basically when the number of iterations is more than 2000. For fair comparison, in the following experiments, all trained models are selected after 3000 iterations to recognize pest categories. Table 2 shows the average recognition rates of 50 5FCV experiments by six models based on pest recognition methods.

**Figure 8.** The loss versus iterations of three models.

**Table 2.** Average recognition rates of five models.


#### *4.3. Discussion*

From Figure 8 and Table 2, it is found that MSCCN outperforms the other four models. Its generalization is enhanced because the multi-scale input, multi-scale convolution, and time-spatial characteristics can be extracted from various pest images, ultimately allowing the pest images to be characterized at a higher level of abstraction. The main reason is

that MSCCN makes use of the advantages of multi-scale input, Inception, multi-scale CNN, and multi-scale CapsNet, so it can quickly extract the features from the pests with various sizes and shapes. MS-CapsNet is the second best because, similar to MSCCN, it employed multi-scale convolution to extract the image features of pests with scale changes and uses capsule network to extract the image features of pests with shape, position and angle changes. DCNNTL is better than AlexNet and CapsNet, because it employed transfer learning to speed up training. Though ResNet50 has the deepest layers, it does not work very well. The reason for this is that it needs a large number of training samples, but there are not enough samples. AlexNet has the worst performance, because it is difficult to optimize pest distortions at the same time by simply changing the size of the projection model, while AlexNet requires a large training database as a comparison library to improve its classification performance and overcome the overfitting problem. CapsNet is not very good because it has a shallow convolution layer, which cannot extract deep classification features.

The results in the references [12–16] verify that CNN and its variants are suitable for classifying images that are very close to the training data set, but they perform poorly in various pest images because pest images vary greatly. Pooling in CNN can establish invariance of location and size, but this invariance can also lead to objects with harmful colors and shapes being mistaken for pests. Humans can identify various pests through a few training images of pests, while CNN needs a large number of training samples, even tens of thousands, to train a good model, which is obviously too substantial. Unlike CNN, CapsNet extracts feature vectors, not feature maps. The vector modulus represents the probability of the feature existence, and the vector direction represents the attitude feature information. The moving features will change the CapsNet vector without affecting the probability of the features' existence. Therefore, CapsNet is more suitable to describe the characteristics of various pests. As CapsNet collects the pose information of pests, a good representation effect can be learned from a small number of samples, so the identification performance of pests is improved.

Unlike other deep models such as CNN and CapsNet, the components of MSCCN are intended to reveal typical time–spatial features and their corresponding instantiation parameters. These features allow the various pest images to be described at a higher level of abstraction while reducing the overfitting inherent in complex and deep networks.

#### **5. Conclusions**

Traditional pest identification methods cannot effectively extract robust classification features from the changeable images of pests. Many methods based on deep learning have great advantages in image recognition, but they require a large number of training samples and time for training parameters. To improve the recognition performance, a multi-scale convolution-capsule network (MSCCN) is constructed for crop insect pest identification. MSCCN combines the advantages of CNN, CapsNet, and multi-scale CNN to recognize various pests, including small-size ones, in complex fields. We implement a series of experiments involving the pest images in the complex fields. Experimental results with the IP102 dataset consistently produce the best identification performance with the highest accuracy and least training time. The proposed model has the advantages of good generalization, high recognition rate, and fast convergence, and provides technical support for the practical application of capsule network in crop pest identification system.

In this study, there are some problems in identifying pests in the field because the same pest may have completely different shapes and sizes during its growth period. Future research is expected to apply this method to crop pest control systems to make the system more intelligent.

**Author Contributions:** Conceptualization, C.X. and S.Z.; methodology, C.X.; software, C.Y.; writing original draft preparation, C.X.; and writing—review and editing, X.W. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work is supported by the National Natural Science Foundation of China (Nos. 62172338 and 62072378). National Natural Science Foundation of Education Department of Shaanxi Province (No. 20JK0960).

**Informed Consent Statement:** Informed consent was obtained from all subjects involved in the study.

**Data Availability Statement:** Data were obtained from the experimental and simulation software designed in this study, which we obtained by rigorous calculation and logical reasoning.

**Acknowledgments:** We thank the project side for the use of the site and equipment required for the experiment in this study.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**

