1. Introduction
To control pests, avoid economic losses, and reduce pesticide costs, early detection and identification of crop pests is an important task. However, it is difficult and challenging to detect and recognize crop pests in fields, because the insect pest images are photographed in complex crop environments. These include not only various types, sizes, postures, and shapes of insect pests, but changeable light, viewpoint, and irregular backgrounds, and it is obvious that the insect pest size is small in proportion to the whole image and its color and texture characteristics are similar to those of the background in the cropped image, as shown in
Figure 1. Therefore, it usually leads to low identification accuracy using the traditional pattern recognition and image processing algorithms [
1].
With the development of computer vision technology, computer computing power, and various algorithms of artificial intelligence (AI) [
1,
2], machine learning [
3,
4], and modern digital and deep learning [
5], many crop pest detection and recognition methods have been presented [
6]. Martineau et al. [
7] investigated forty-four studies on this topic, including a lot of methods of image capture, feature extraction, and classification and tested datasets, and generally discussed the questions that might still remain unsolved. Costa et al. [
8] constructed a knowledge-based crop pest identification system. This system can provide a convenient way for farmers to manage crop pests and diseases. Liu et al. [
9] introduced the definition and connotation of the crop disease-pest knowledge and analyzed and classified the key techniques and methods of crop disease-pest detection and recognition in recent years, including knowledge representation, feature extraction and fusion, reasoning, and classifier. Huo et al. [
10] introduced the research progress of disease-pest identification, pest number, and position detection, of an existing dataset and some methods used in previous articles. Li et al. [
11] proposed a few-shot cotton pest recognition method and verified its effectiveness and feasibility on two datasets, namely the national bureau of agricultural insect resources and a dataset with the natural scenes. The results of the above methods show that the performance of the traditional pest identification methods relies on hand-crafted features and matching templates, shallow learning-based features with limited representation power, and only low-level features, but ignores the hierarchical features of pest images, so their recognition rate and generalization ability are limited.
Convolutional neural network (CNN) has made remarkable achievements in various target detection and recognition tasks. It has been widely used in pest detection and recognition as it can automatically learn the essential features of the pest images from a large amount of data and produce fewer high-quality candidate features for pest recognition [
12]. Ai et al. [
13] used CNN to automatically identify crop disease pests as they trained the Inception-ResNet-v2 model, utilizing the public dataset of the AI Challenger Competition in 2018, with 27 disease images of 10 crops, and designed and implemented the Wechat applet of crop disease-insect pest recognition. Xie et al. [
14] proposed an automatic crop pest classification method by learning multi-level features from a large number of unlabeled image patches using unsupervised feature learning methods and utilized the filters in multiple scales coupling them with several pooling granularities. Labaa et al. [
15] proposed a crop pest recognition method based on CNN, improved by combining different technologies such as CNNs and REST services. Li et al. [
16] proposed a fine-tuned GoogLeNet model to deal with the complicated backgrounds presented by farmland scenes and achieved better pest classification results than the original model.
Compared to traditional handcraft-feature extraction algorithms, CNN is effective in image classification tasks. It can automatically learn features during the training process, avoiding the error generated by manual selection, but its pooling operation (down-sampling) can only give rough location information, allowing the model to ignore some small spatial changes and failing to accurately learn the location association of different objects, such as the location, size, direction, and even deformation degree and texture of entities in a region. Although the pooling operation of CNN can maintain the invariability of the location and direction of the entity, it will lose the characteristics of small pests, so the recognition rate of crop pests may not be high. Therefore, pooling operations may cause some problems: they may lose the low-level features and spatial hierarchical features, and the data of small pests (under certain conditions) may be lost after down-sampling.
The capsule network (CapsNet) is a new kind of deep learning architecture aiming to encode the features of the images and their spatial relationships [
17]. It can overcome the shortcomings of CNN. It only uses the shallow CNN to preserve the spatial information, and can capture not only the discriminant features, but also the underlying relationships between these features. A capsule is a group of neurons whose output represents the various perspectives of an entity, such as pose, texture, scale, or the relative relationship between the entity and its parts. In this case, CapsNet is more robust to affine transformations and achieves good results with fewer training samples. Paoletti et al. [
18] constructed a CapsNet for hyperspectral image classification, where several spectral-spatial capsules are used to learn HSI spectral-spatial features while significantly reducing the network complexity. Mensah et al. [
19] proposed Gabor CapsNet for plant disease detection and evaluated its performance on three publicly available plant disease datasets containing disease leaf images with high similarity and background objects. Wang et al. [
20] proposed a multi-scale convolutional CapsNet for hyperspectral image classification, which is composed of a multi-scale convolutional layer, a single-scale convolutional layer, a PrimaryCaps layer, a DigitCaps layer, and a fully connected layer. Peker [
21] proposed a multi-channel CapsNet ensemble for plant disease detection and individually trained the network on the image set. Thenmozhi et al. [
22] proposed a deep CNN model to classify insects, where transfer learning was applied to fine-tune the pre-trained models.
From the above analysis, it is known that the conventional CNN-based crop pest image classification faces a problem of quite limited training samples, which leads to overfitting and dissatisfied performance to describe the correlation between features. CapsNet can deal with the disadvantages of CNN, but the feature representation capability of the low-level features extracted by the shallow-layer CNN is limited. Therefore, the original CNN or CapsNet is not suitable for crop pest recognition tasks. Inspired by multi-scale convolutional CapsNet and multi-channel CapsNet, a multi-scale convolution-capsule network (MSCCN) is constructed for crop insect pest recognition combining the advantages of traditional CNN and CapsNet. It consists of a multi-scale convolutional module, CapsNet module, and a Softmax classification module. The main contributions of this work are as follows:
Inception is introduced into the convolutional module with different-scale convolutional kernels in different branches of the Inception structure, and multi-scale image features are extracted by different receptive fields in each branch, which increases the width of the network and the adaptability of the network to pest scale;
An improved dropout is proposed on the encoded capsules to enhance the robustness of the model for the capsule layer.
The remainder of the paper is organized as follows.
Section 2 reviews the related works including Inception and CapsNet. MSCCN is introduced in detail in
Section 3. Experiments are presented in
Section 4.
Section 5 concludes the paper and puts forward some opinions and suggestions for the future research direction.
3. Multi-Scale Convolution-Capsule Network (MSCCN)
Motivated by the fact that the crop insect pests are changeable with various postures, and their sizes range from less than 1 mm to more than 100 mm, a multi-scale convolution-capsule network (MSCCN) is proposed for crop insect pest recognition. Its architecture is shown in
Figure 4.
The input image is reshaped to 128 × 128, 96 × 96, and 64 × 64 assembled in parallel. MCNN firstly extracts the high-level features of describing pest images through three multi-scale convolutions, three Inceptions, and three CapsNet using these features to further construct the vector-based capsule structure to form the final discriminative feature vector of pests in the image, which will be directly fed to the final SoftMax classifier without any feature reduction. Finally, pest recognition is implemented by the Softmax classifier. MCNN is designed as an end-to-end structure for easy convolution-CapsNet training and deployment.
In CapsNet, three multi-dimensional primary capsules are employed to encode the hierarchical multi-scale features extracted by three multi-scale convolutions, and obtain 12D, 8D, and 4D capsules, respectively. Then, the predicted vectors are computed through different weight matrixes
W,
V, and
U as follows:
where
are the feature maps of three multi-scale convolutions,
, and
are three weight matrixes of
and
respectively,
is
-th primary-capsule from
-th branch,
is predict vector between the
-th parent capsule and the
-th child capsule of
-th branch, and
is the output of this multi-scale capsule encoding structure, which concatenates the results of three branches by function
concat().
The classification features are encoded using a weight matrix between -th child capsule and -th parent capsule. During the training, the part–whole relationship for each capsule pair is learned by adjusting the transformation matrixes , and .
There is a dynamic routing between the multi-scale capsule encoding unit and digit capsule layer. It is used to ensure that the outputs of child capsules are sent to the proper parent capsules. The prediction vectors
in the previous section are computed through a weight matrix. The relationship is determined between each parent capsule
and prediction vector
by dynamic routing. All the prediction vectors are denoted as
. In the first iteration,
and
, where
and
. Then, adjust the routing coefficients
to
by the function
update() as follows:
where
is the coupling coefficient before normalization and
is the
output capsule of the parent capsule layer calculated by
where
is the total input vector of the
capsule obtained by the weighted sum of the
parent capsule layer connecting with the
ith child capsule layer,
is the reduction coefficient of
,
is the normalized unit vector of
,
, and the prediction vector
is obtained by multiplying the output features of the BN layer with the weight matrix of the primary capsule layer.
The objective function of MCNN is expressed as follows:
where the former part is used to calculate the settings of the correctly classified digital capsule, the latter part is used to calculate the losses of wrongly classified digital capsules,
= 0.9 and
= 0.1 are the default category prediction values,
= 0.5 is the default balance coefficient,
is the label of data category,
is the correct label,
is the incorrect label, CNum is the number of categories,
is the length of the vector representing the probability of discriminating as the
class pest, and the total loss is the sum of all digital capsule loss functions.
The main processes of MCNN-based crop pest recognition are shown in
Figure 5.
First, all kinds of pest images are divided into the training set and test set. Both sets need to be preprocessed to facilitate MSCCN to extract the multi-scale features. Then, the results of image preprocessing are used as the input of the multi-scale convolution, the network will automatically extract the multi-scale features of color, texture, and shape from training samples. Multi-scale CapsNet is used to encode the multi-scale convolution features. Each layer of CapsNet is composed of neurons. The neuron input of CapsNet is vectors. The vector length represents the approximate probability of the pest. The vector direction represents the instantiation parameters of the pest. The output of a capsule is only routed to the next layer of the corresponding capsule, which will return a clearer input signal, it can accurately determine the posture of the pest. The feature combination method is adopted for different feature vectors. MSCCN structure and network parameters are set up. After training, the classification model is implemented to classify and recognize pest images by Softmax. The
k-dimension feature vector
extracted by CapsNet is input into the trained Softmax classifier, as follows:
where
is the probability that the feature vector
belongs to the
category,
is the total number of categories, and
is the weight items.
From the above analysis, the pseudocode MSCCN is introduced in Algorithm 1:
Algorithm 1: Multi-scale CapsNet |
Input training pest images, parameters: , , , and , batch-size = 128, the number of iterations and dynamic routing are 3000 and 3, respectively, threshold ; |
1: Image processing; |
2: Reshape each image into three images with different sizes; |
3: For iteration; |
4: for k = 1 to 3 |
Carry on the kth convolution with different kernels with sizes of 7 × 7, 5 × 5 and 3 × 3; |
Carry on the kth Inception convolution; |
Carry on procedure routing (k, , r, l) of the kth CapsNet; |
(1) for all capsule i in layer l and capsule j in layer(+1): bij ← 0; |
(2) for r iterations do |
for all capsule i in layer l: ci ← softmax(bi) |
for all capsule j in layer (l + 1): by Equation (5) |
for all capsule j in layer (l + 1): |
(3) for all capsule i in layer l and capsule j in layer (l + 1): |
(4) return |
return ; |
5: integrate ; |
6: × Mask; |
7: input × Mask into Softmax classifier; |
8: calculate loss Lc by Equation (4); |
9: if Lc more than , return step 3; |
10: Stop iterations. |
4. Experiments
To evaluate the performance of the proposed method based on MSCCN, a lot of experiments were conducted on the rice pest image set of IP102 dataset and compared with four existing mainstream deep learning methods, AlexNet [
12,
23,
24], CapsNet [
19], MS-CapsNet [
20], DCNN + transfer learning (DCNNTL) [
22], and ResNet50 [
25]. AlexNet consists of five convolutional layers, three max pooling layers, and three fully connected layers. Resnet50 is composed of 49 convolutional layers and a fully connected layer, where the residual network unit contains cross-layer connections. MS-CapsNet consists of a multi-scale convolutional layer, a single-scale convolutional layer, a primaryCaps layer, a digitCaps layer, and a fully connected layer. DCNNTL consists of six convolutional layers, five max pooling layers, and a fully connected layer. It uses VGG16 as transfer learning to pre-train the deep CNN model on the constructed dataset. In all models, categorical cross entropy is used as a loss function, Stochastic gradient descent (SGD) is used as the optimizer, and Softmax classifier is used in their output layers to classify pest categories.
The hardware and software conditions of the experiments are as follows: the operating system is 64-bit Microsoft Windows 10, the CPU is I5-6200U, GeForce RTX2080 SUPER 8GB, 64-bit Operating System and x64-based processor NVIDIA Ge Force RTx 2080Ti 11GB GDDR6 Mother Board Intel i7/i8/i9, the programming language is Python 3.7 Jupyter notebook software, and the deep learning framework is Keras 2.3.0.
4.1. IP102 Dataset
IP102 (
https://github.com/xpwu95/IP102 (accessed on 7 April 2019)) is often used to test insect pest detection and recognition methods based on deep learning [
26]. It contains 75,222 images belonging to 102 common crop insect pest categories with an average of 737 images per class. Most images were collected by common image search engines at different growth stages, and about 19,000 images were annotated with bounding boxes for pest detection. Some images are shown in
Figure 6. From
Figure 6, it is found that the insect pest images were collected in the fields with various sizes, shapes, and complex backgrounds, and the pest has different sizes, postures, and shapes at the different stages of the life cycle [
23].
IP102 contains 5701 original images of nine rice pests. Their pest image names and corresponding numbers and serial numbers are shown in
Table 1, where the maximum number is 1115, and the minimum number is 369. The number of each kind of rice pest image is increased to more than 500 by the augmentation algorithm. All images are converted to JPEG format. Repeated or damaged images are deleted. In this study, all rice insect pest images are used to conduct rice insect pest recognition experiments, where each original image is firstly reshaped to 128 × 128, closer to the actual application, because the sizes of the collected images are not uniform. Finally, 5000 preprocessed images are used for experiments, except for 1034 poor quality or negative images.
Due to the different crop pest conditions of data collection, illumination, and parameter settings of a digital scanner, color differences of digital pest images are often caused. Size and color normalization can not only ensure the color consistency of the original image, but also preserve the biological information in the pest image, so as to improve the recognition performance of the model. As the pest image sizes of the dataset are different, ranging from 220 × 220 to 512 × 512, the size of the input original image and ROI label will be uniformly adjusted to 128 × 128. At the same time, pixel values of all images will be regularized to between 0 and 255 when entering the channel expansion module. After channel expansion, minimax normalization is applied to the pest images of each channel to normalize the range of pixel values to between 0 and 1, so as to complete channel expansion of original pest images and better meet the input of the deep learning network. Minimax normalization is defined as follows,
where
is the pixel value of the original image,
is the minimum value of the pixel value set, and
represents the maximum value of the pixel value set.
The 5-fold-cross-validation (5FCV) strategy is used to evaluate the performance of the proposed model. The 5FCV experiment is conducted 50 times, and the results are the average of 50 5FCV experiments.
4.2. Experiment Results
Rice pest recognition mainly relies on MSCCN to extract the pest image features and complete the pest identification via Softmax classifier. MSCCN is a deep learning algorithm. The CapsNet module in MSCCN uses activity vectors to represent instantiation parameters of specific pest types. The length of the output vectors is used to characterize the probability of pests having the current input. After the pest images are preprocessed, the images are output into MSCCN, where CapsNet uses multi-scale convolution to extract the pest image features, trains the image classification, and predicts the output vector based on routing-by-agreement protocol. In this paper, the characteristics of MSCCN are used to solve the problem of crop pest identification difficulties caused by multiple pests overlapping during the pest recognition process. The most vital step, according to routing-by-agreement, is to analyze the pest images with overlapped objects. The parameters of MSCCN are originally set as a batch size 128, weight decay factor 0.00001, and number of training epochs 100, and the initial weights are set randomly from a Gaussian distribution with a mean of 0 and a variance of 1. In the multi-scale convolution module, the dropout rate is set to 0.4, the learning rate is initialized to 1 × 10−3, decreasing 0.05 times as the number of iterations increases. In CapsNet module, the number of iterations of dynamic routing is 3000, and the dropout ratio is 0.9. Adam is employed as the gradient descent algorithm to perform the training. In Adam, the original parameters are set as , , , and .
Three parameters in MSCCN are not trainable but can be fine-tuned: dropout rate, learning rate, and mini-batch size. They are fixed at the start of training. Considering validation accuracy while tuning hyperparameters, we fine-tune them. The dropout is used to reduce overfitting, and a dropout layer is often added after each dense layer except the last. MSCCN is trained with three dropout rates of 0.3, 0.4, and 0.5. The results are 0.891, 0.575, and 0.843, respectively. The dropout rate of 0.3 has the best result in general. In the experiments, the dropout rate is set to 0.3, which means that MSCCN model will randomly ignore 30% of the neurons of the previous layer. Learning rate determines how fast the weights of MSCCN are adjusted to find the local or global minima of the loss function. MSCCN is tested with a learning rate of 0.01, 0.001, 0.0001, and 0.00001. In terms of convergence speed and accuracy, learning rate of 0.0001 has the best accuracy of 0.906. MSCCN is evaluated with a mini-batch size of 10, 16, 32, 64, and 128. The accuracy of MSCCN is improved with the increase in mini-batch size from 10 to 64, and then it decreased for 128, as shown in
Figure 7. Then, a mini-batch size of 64 is selected to train the model that increases the convergence precision.
Gradient descent and backpropagation algorithms are used to update the weight parameters of the model. As the gradient descent algorithm of the driving quantity is used, the momentum factor is set as 0.9 to prevent the overfitting problem. In order to show the performance of MSCCN, MSCCN is compared with classical CNN and CapsNet, and three modified models: MS-CapsNet, DCNNTL, and ResNet50.
Figure 8 shows the loss values of six models in the training set. From
Figure 8, it is found that MSCCN converges fastest, and the curve is relatively stable after 2000 iterations. All models converge basically when the number of iterations is more than 2000. For fair comparison, in the following experiments, all trained models are selected after 3000 iterations to recognize pest categories.
Table 2 shows the average recognition rates of 50 5FCV experiments by six models based on pest recognition methods.
4.3. Discussion
From
Figure 8 and
Table 2, it is found that MSCCN outperforms the other four models. Its generalization is enhanced because the multi-scale input, multi-scale convolution, and time-spatial characteristics can be extracted from various pest images, ultimately allowing the pest images to be characterized at a higher level of abstraction. The main reason is that MSCCN makes use of the advantages of multi-scale input, Inception, multi-scale CNN, and multi-scale CapsNet, so it can quickly extract the features from the pests with various sizes and shapes. MS-CapsNet is the second best because, similar to MSCCN, it employed multi-scale convolution to extract the image features of pests with scale changes and uses capsule network to extract the image features of pests with shape, position and angle changes. DCNNTL is better than AlexNet and CapsNet, because it employed transfer learning to speed up training. Though ResNet50 has the deepest layers, it does not work very well. The reason for this is that it needs a large number of training samples, but there are not enough samples. AlexNet has the worst performance, because it is difficult to optimize pest distortions at the same time by simply changing the size of the projection model, while AlexNet requires a large training database as a comparison library to improve its classification performance and overcome the overfitting problem. CapsNet is not very good because it has a shallow convolution layer, which cannot extract deep classification features.
The results in the references [
12,
13,
14,
15,
16] verify that CNN and its variants are suitable for classifying images that are very close to the training data set, but they perform poorly in various pest images because pest images vary greatly. Pooling in CNN can establish invariance of location and size, but this invariance can also lead to objects with harmful colors and shapes being mistaken for pests. Humans can identify various pests through a few training images of pests, while CNN needs a large number of training samples, even tens of thousands, to train a good model, which is obviously too substantial. Unlike CNN, CapsNet extracts feature vectors, not feature maps. The vector modulus represents the probability of the feature existence, and the vector direction represents the attitude feature information. The moving features will change the CapsNet vector without affecting the probability of the features’ existence. Therefore, CapsNet is more suitable to describe the characteristics of various pests. As CapsNet collects the pose information of pests, a good representation effect can be learned from a small number of samples, so the identification performance of pests is improved.
Unlike other deep models such as CNN and CapsNet, the components of MSCCN are intended to reveal typical time–spatial features and their corresponding instantiation parameters. These features allow the various pest images to be described at a higher level of abstraction while reducing the overfitting inherent in complex and deep networks.