AM-MSFF: A Pest Recognition Network Based on Attention Mechanism and Multi-Scale Feature Fusion

Zhang, Meng; Yang, Wenzhong; Chen, Danny; Fu, Chenghao; Wei, Fuyuan

doi:10.3390/e26050431

Open AccessArticle

AM-MSFF: A Pest Recognition Network Based on Attention Mechanism and Multi-Scale Feature Fusion

¹

School of Computer Science and Technology, Xinjiang University, Urumqi 830017, China

²

Xinjiang Key Laboratory of Multilingual Information Technology, Xinjiang University, Urumqi 830017, China

^*

Author to whom correspondence should be addressed.

Entropy 2024, 26(5), 431; https://doi.org/10.3390/e26050431

Submission received: 18 March 2024 / Revised: 12 May 2024 / Accepted: 16 May 2024 / Published: 20 May 2024

(This article belongs to the Section Entropy and Biology)

Download

Browse Figures

Versions Notes

Abstract

:

Traditional methods for pest recognition have certain limitations in addressing the challenges posed by diverse pest species, varying sizes, diverse morphologies, and complex field backgrounds, resulting in a lower recognition accuracy. To overcome these limitations, this paper proposes a novel pest recognition method based on attention mechanism and multi-scale feature fusion (AM-MSFF). By combining the advantages of attention mechanism and multi-scale feature fusion, this method significantly improves the accuracy of pest recognition. Firstly, we introduce the relation-aware global attention (RGA) module to adaptively adjust the feature weights of each position, thereby focusing more on the regions relevant to pests and reducing the background interference. Then, we propose the multi-scale feature fusion (MSFF) module to fuse feature maps from different scales, which better captures the subtle differences and the overall shape features in pest images. Moreover, we introduce generalized-mean pooling (GeMP) to more accurately extract feature information from pest images and better distinguish different pest categories. In terms of the loss function, this study proposes an improved focal loss (FL), known as balanced focal loss (BFL), as a replacement for cross-entropy loss. This improvement aims to address the common issue of class imbalance in pest datasets, thereby enhancing the recognition accuracy of pest identification models. To evaluate the performance of the AM-MSFF model, we conduct experiments on two publicly available pest datasets (IP102 and D0). Extensive experiments demonstrate that our proposed AM-MSFF outperforms most state-of-the-art methods. On the IP102 dataset, the accuracy reaches 72.64%, while on the D0 dataset, it reaches 99.05%.

Keywords:

pest recognition; attention mechanism; multi-scale feature fusion; cross-entropy loss

1. Introduction

Agriculture plays a crucial role in ensuring food security, promoting economic development, and maintaining ecological balance [1]. However, pests are one of the primary factors limiting agricultural development [2]. Traditionally, the early identification of pests relied heavily on agricultural experts. However, this approach was labor-intensive and lacked real-time capabilities [3]. With advancements in computer vision technology, automated pest recognition based on pest images has gained attention from researchers. Automated systems utilize computer vision techniques to analyze and interpret pest images, enabling farmers and agricultural practitioners to quickly and accurately identify specific pests that negatively impact crops. This technology reduces reliance on human experts and provides real-time pest detection capabilities, facilitating timely intervention measures and targeted pest control methods [4].

In insect identification tasks, extracting useful features from images faces several challenges due to the high diversity of pest species, as well as variability in their sizes and shapes. Past studies predominantly employed traditional machine learning methods using manually designed features, such as GIST [5], HOG [6], SIFT [7], and SURF [8]. However, these handcrafted features have limitations in capturing the large-scale variations in the shapes of target objects.

In recent years, deep learning has achieved robust feature learning and demonstrated state-of-the-art performance in various image classification tasks. Consequently, deep learning models based on convolutional neural networks (CNNs) have been widely applied in various image classification tasks in the agricultural domain, such as crop disease identification, crop classification, weed detection, and crop pest identification [9]. However, current datasets on pests are still very limited, with most datasets containing samples of only a few insect species. In addition, these datasets consist mainly of images of pests collected under controlled laboratory conditions. [3,9]. This limitation hampers the ability of deep learning models to perform insect pest recognition in real field conditions. Additionally, different insect pest species may have highly similar appearances, and there are also differences between the various forms of the same species (such as eggs, larvae, pupae, and adults) [10,11]. This implies that insect pest recognition tasks face challenges of significant intra-class variation and high inter-class similarity.

This paper proposes an identification network based on attention mechanism and multi-scale feature fusion (AM-MSFF) to address challenges such as complex backgrounds, large intra-class differences, small inter-class differences, and uneven data distribution in pest recognition. Our contributions are summarized as follows:

The introduction of relation-aware global attention (RGA) helps the model focus on the pest part, suppresses interference from complex backgrounds, and enhance the model’s attention to pests;
We propose the multi-scale feature fusion (MSFF) module, which extracts features at different scales and integrates these features to capture both the characteristics and contextual information of pests across different scales. This enables the model to better adapt to variations in the morphology and appearance of different pests. Additionally, we introduce generalized-mean pooling (GeMP) to better preserve important features and enhance the sensitivity to detailed information;
An improved version of the cross-entropy loss function, called balanced focal loss (BFL), is proposed based on the focal loss (FL). BFL takes into consideration the number of samples for each class and adjusts the weights for each class accordingly. This adjustment allows the model to pay more attention to minority samples and hard-to-classify samples, thereby allowing the model to better handle class imbalance situations.

2. Related Work

With the advancement of computer vision technology, pest recognition methods have been continuously improved and innovated upon. Based on the approach of feature extraction, these methods can be broadly categorized into two types: traditional handcrafted feature-based methods and deep feature-based methods.

2.1. Handcrafted Features

In previous research, the process of feature extraction and classification required manual intervention, where researchers had to manually segment the insects of interest from the background. For instance, Mayo et al. [12] employed the image processing tool ImageJ (can be accessed at: https://imagej.net/ij/) [13] for feature extraction from insect images and utilized a support vector machine (SVM) for classification. Although this method successfully achieved automatic species identification of live specimens in the field without manually specifying regions of interest, it still required the segmentation of insects and background when the image background was highly cluttered. Yalcin [14] separated insects from the background using background subtraction and active contour models, extracting the outer boundaries of insects. Subsequently, they extracted features using Hu moments, elliptic Fourier descriptors (EFD), radial distance function (RDF), and local binary patterns (LBP), finding that LBP features performed better in terms of performance. Venugoban et al. [15] utilized ther histogram of oriented gradients (HOG) and speeded-up robust features (SURF) for image feature extraction, making full use of their ability to capture characteristics of local shape edges or gradient structures. They employed SVM for the multi-class classification of feature histograms. Xie et al. [16] employed sparse coding histograms of multiple feature modalities, combining multiple features of insect images for feature extraction. This method effectively quantifies original features such as color, shape, and texture, significantly enhancing recognition performance.

Handcrafted feature methods typically rely on raw image patches or manually designed image features, making them very sensitive to noise and background interference in natural images. They also struggle to adapt to the variations in the appearance of the same insect species at different stages. Furthermore, these methods often fail to capture mid-level and high-level features in insect appearances, and they also pose a significant computational burden. To address these issues, there is a need to develop more robust and discriminative feature descriptors that can automatically extract relevant information from insect appearances and adapt to changes in appearance across different insect species.

2.2. Deep Features

In recent years, CNNs such as ResNet [17] and GoogleNet [18] have achieved significant advancements in image classification tasks, garnering widespread attention. Consequently, an increasing number of researchers are exploring and adopting CNNs to address insect pest recognition problems. Li et al. [19] utilized CNNs to extract feature vectors from images and employed triplet loss training to distinguish between different insect pest species, ensuring the stable and reliable performance of the recognition system under various circumstances. Cheng et al. [20] introduced deep residual learning to overcome the problem of network degradation. Through optimization with deep residual learning, their method significantly improved the accuracy of insect pest image recognition in complex agricultural field backgrounds compared to simple CNNs like AlexNet [21]. Liu et al. [22] proposed the deep feature fusion residual network (DFF-ResNet), which enhances the model’s generalization ability by introducing feature fusion residual blocks that merge features from the previous layer with convolutional layers in the residual signal branch. Coulibaly et al. [23] introduced a crop pest recognition and localization network based on an interpretable approach, selecting inception-v3 as the backbone for feature extraction and highlighting the captured colors and shapes through visualized graphs. A combination of various interpretability methods better explains the reasoning process of deep learning systems and determines the optimal number of feature extraction layers. Hu et al. [24] proposed an insect recognition network based on a multi-scale dual-branch GAN-ResNet, utilizing ConvNeXt residual blocks to adjust computational scale and constructing a dual-branch structure to capture insect features of different sizes in input images while effectively extracting subtle features.

CNN models can automatically extract rich spatial and semantic information from images without the need for manual feature extraction, thereby reducing the workload of human involvement. However, CNN models rely heavily on large-scale data and are prone to recognition errors when dealing with complex backgrounds and lighting variations. Therefore, there is still room for improvement in this field, and further exploration is needed on how to enhance the model’s adaptability to complex backgrounds and lighting variations.

3. Proposed Method

As shown in Figure 1, the proposed AM-MSFF is based on the architecture of ResNet-50 [17] pre-trained on the ImageNet dataset. The network consists of the RGA module, MSFF module, and GeMP module, along with an improved focal loss [25] function called balanced focal loss (BFL). In the specific network structure, the RGA module models relationships between different positions in the image and weights them using attention mechanisms. The MSFF module fuses multi-scale feature information, focusing on both details and global features. The GeMP module better preserves the spatial information within feature maps. BFL adjusts the sample weights to balance the influence between different classes.

3.1. Relation-Aware Global Attention

Although the relation-aware global attention (RGA) module [26] was initially designed to address the problem of person re-identification, we can draw inspiration from its design principles and incorporate it into pest-related tasks. We combine the RGA module with deep residual networks to construct a feature extraction network based on relation-aware global attention. By learning the relationships between feature nodes and computing attention weights, the network can effectively explore discriminative regional features.

The RGA module weights input features through two subsidiary modules: the spatial relation-aware attention (RGA-S) submodule and the channel relation-aware attention (RGA-C) submodule. Firstly, the RGA-S submodule emphasizes the critical spatial information by analyzing the spatial relationships among input features and subsequently weighting the original features based on the learned weights. Subsequently, the RGA-C submodule further processes the weighted output from RGA-S by leveraging channel relationships and highlighting important channel information. This two-stage attention mechanism enables the model to more accurately focus on essential input feature information, thereby enhancing the model’s representation learning capability and performance.

The RGA module’s structure, as shown in Figure 2, involves processing the input image through a frontend network to produce a feature map. Each feature vector in the feature map is represented as a feature node

x_{i}

, where

i = 1, 2, \dots, N

; N denotes the number of feature nodes. For each feature node

x_{i}

, its correlation with all other nodes

x_{j}

(

j = 1, 2, \dots, N

) is computed, resulting in correlation values

r_{(i, j)}

and

r_{(j, i)}

. The relationship vector for feature node

x_{i}

is represented as

r_{i} = [r_{(i, 1)}, r_{(i, 2)}, \dots, r_{(i, N)}, r_{(1, i)}, r_{(2, i)}, \dots, r_{(N, i)}]

. Subsequently, feature node

x_{i}

and its relationship vector

r_{i}

are concatenated to obtain the relation-aware feature

E_{i}

. Then, the attention weight

a_{i}

for the current feature node is computed.

3.1.1. Spatial Relation-Aware Global Attention

RGA-S is a method for learning each feature node in the spatial dimension of the feature map. It compactly represents the pairwise relationships between all feature nodes and extracts structural information with a global context, as illustrated in Figure 3. Our approach incorporates RGA-S into the ResNet-50 network to learn the correlations between all feature nodes in the spatial dimension of the feature map, enabling the network to better focus on important spatial positions and feature nodes.

Specifically, for the input feature map

X \in R^{C \times H \times W}

obtained from ResNet-50, each C-dimensional feature vector at every spatial position is regarded as a feature node. These nodes construct a node graph

G_{s}

, consisting of a total of

N = W \times H

nodes. Each feature node is represented as

x_{i}

, where

i = 1, 2, \dots, N

. By performing a dot product operation, we can obtain the correlation

r_{(i, j)}

between feature nodes

x_{i}

and

x_{j}

, which can be defined by Equation (1):

\{\begin{matrix} r_{i, j} & = f_{s} (x_{i}, x_{j}) = θ_{s} {(x_{i})}^{T} ϕ_{s} (x_{j}) \\ θ_{s} (x_{i}) & = (R e L U (B N (C o n ν (x_{i})))) \\ ϕ_{s} (x_{j}) & = (R e L U (B N (C o n ν (x_{j})))) \end{matrix}

(1)

Here, the function

f_{s}

represents the dot product operation,

θ_{s}

and

ϕ_{s}

are two embedding functions, BN stands for batch normalization, ReLU denotes the rectified linear unit activation function, and Conv represents

1 \times 1

convolution operation. Similarly, the pairwise relationship between node j and node i is denoted as

r_{j, i} = f_{s} (x_{j}, x_{i})

, and

(r_{i, j}, r_{j, i})

represents the bidirectional relationship between

x_{i}

and

x_{j}

. Finally, the correlations between all nodes can be represented by the relation matrix

R_{S} \in R^{N \times N}

, where

r_{i, j} = R_{S} (i, j)

.

For the i-th feature node, the pairwise relationships with all nodes are stacked in a certain order to obtain a spatial relation vector

r_{i} = [R_{S} (i, :), R_{S} (:, i)] \in R^{2 N}

. Then, the spatial relation vector is concatenated with the original feature information to incorporate both the global structural information and local original information, resulting in spatial relation attention

E_{S}

, which can be defined as Equation (2):

\{\begin{matrix} E_{s} = C (x_{i}, r_{i}) = ({pool}_{C} (ψ_{S} (x_{i})), δ_{S} (r_{i})) \\ ψ_{S} (x_{i}) = Re L U (B N (Con v (x_{i}))) \\ δ_{S} (r_{i}) = Re L U (B N (Con v (r_{i}))) \end{matrix}

(2)

where

ψ_{S}

and

δ_{S}

represent operations on the original features and spatial relation features, respectively. C denotes concatenation operation,

p o o l_{C}

denotes global average pooling (GAP) operation along the channel dimension, and Conv reduces the channel dimension to one.

Through spatial relation-aware attention

E_{s}

, the attention weight value

s_{i}

is computed for each position. This attention weight is then multiplied with the original features to obtain the intermediate feature

Y_{S}

weighted by spatial relation-aware attention. The computation process is depicted in Equations (3) and (4).

s_{i} = s i g m o i d (B N (C o n v_{2} (Re L U (B N (C o n v_{1} (E_{s}))))))

(3)

Y_{S} = \sum_{i = 1}^{N} s_{i} \cdot x_{i}

(4)

where

s i g m o i d

represents the sigmoid activation function,

C o n v_{2}

reduces the number of channels to one, and

C o n v_{1}

reduces the dimensionality by a fixed ratio.

3.1.2. Channel Relation-Aware Global Attention

RGA-C learns various feature nodes along the channel dimension, compactly representing pairwise relationships among all feature nodes to obtain global structural information along the channel dimension, as illustrated in Figure 4. The approach in this paper incorporates RGA-C into the ResNet-50 network, enabling the learning of the correlations among all feature nodes in the channel dimension. This allocates different weights to each channel, enhancing the network’s focus on different channel information in pest images.

Specifically, the intermediate feature

Y_{S}

obtained from the RGA-S submodule is used as the input feature for the RGA-C submodule. For the obtained feature map

Y_{S} \in R^{C \times H \times W}

, each feature map on every channel is considered a feature node, forming a graph

G_{C}

with a total of C nodes. Each feature map on a channel is regarded as a feature node denoted as

y_{i}

, where

i = 1, 2, \dots, C

. The input feature

Y_{S}

is compressed into

Y_{S}^{'} \in R^{(H W) \times C \times 1}

, and then transformed using two 1 × 1 convolutions to obtain two feature node vectors that are dot-producted to form the channel relation matrix

R_{C} \in R^{C \times C}

. The element

r_{(i, j)}

of

R_{C}

represents the pairwise relationship between node i and node j, defined by Equation (5):

\{\begin{matrix} r_{i, j} & = f_{c} (x_{i}, x_{j}) = θ_{C}^{T} (x_{i}) φ_{C} (x_{j}) \\ θ_{C}^{T} (x_{i}) & = (R e L U (B N (C o n ν (x_{i})))) \\ φ_{C} (x_{j}) & = (R e L U (B N (C o n ν (x_{j})))) \end{matrix}

(5)

where

f_{c}

represents dot product operation. Similarly, the correlation

r_{j, i}

between feature nodes

x_{j}

and

x_{i}

can be obtained. The pairwise relationships between all nodes are represented by the matrix

R_{C} \in R^{C \times C}

. Stacking the relationships of the ith feature node with all nodes, we obtain the channel relation vector

r_{i} = [R_{C} (i, :), R_{C} (:, i)] \in R^{2 N}

. Similar to Equation (3), we can obtain the final channel attention weight

c_{i}

. The attention weights are multiplied by the intermediate feature

Y_{S}

to obtain the final output feature representation Y. The calculation process is shown in Equation (6):

Y = \sum_{i = 1}^{C} c_{i} \cdot y_{i}

(6)

3.2. Multi-Scale Feature Fusion

The purpose of adaptive spatial feature fusion (ASFF) [27] is to address the consistency issue among feature pyramids in object detection. By filtering conflicting information spatially, ASFF can weaken the inconsistency between features at different scales, thereby improving the scale invariance of features. Pest objects in pest images often vary in size and shape and may exhibit different detail and texture features. Therefore, relying solely on features from a single scale may not fully capture all useful information. Hence, inspired by the idea of ASFF, we propose multi-scale feature fusion (MSFF) to extract rich detail and global information from feature maps at different scales. The structure of MSFF is illustrated in Figure 5, and implementing MSFF involves two steps: feature scale adjustment and adaptive fusion.

Specifically, L2, L3, and L4 are different levels of the ResNet-50 model with attention mechanism, representing features at different scales. They, respectively, represent features at different scales. First, the sizes of the L2 and L3 feature layers are adjusted to match the size of the L4 feature map.

X^{2 - 4}

represents resizing the L2 feature size to match the size of the L4 feature, while

X^{3 - 4}

represents resizing the L3 feature size to match the size of the L4 feature. The L4 feature layer is denoted as

X^{4 - 4}

. Then, each layer’s channel number is compressed to eight through

1 \times 1

convolution layers, concatenated along the channel dimension, followed by another 1 × 1 convolution to obtain a 3-channel weight used for weighted fusion of different level feature maps. Using the softmax operation, the fusion weight coefficients are bounded between [0,1], result in fusion coefficients

α^{3}

,

β^{3}

, and

γ^{3}

for L2, L3, and L4, respectively. The input feature maps are then weighted and fused using these coefficients to obtain the fused feature map

y^{3}

. Finally, the fused feature map is convolved with a 3 × 3 kernel with a stride of one to extract higher-level feature representations, enhancing the expressiveness and discriminability of the features. Therefore, the fused feature layer corresponding to MSFF is represented as shown in Equation (7).

y^{3} = α^{3} \cdot X^{2 - 4} + β^{3} \cdot X^{3 - 4} + γ^{3} \cdot X^{4 - 4}

(7)

where

α^{3} + β^{3} + γ^{3} = 1

,

α^{3}

,

β^{3}

, and

γ^{3}

are in the range [0,1].

MSFF effectively integrates information from different feature layers by adaptively adjusting their fusion ratios. Thus, MSFF dynamically adjusts the weights of features based on the importance of different parts of the image, enabling a more accurate capture of pest-related information present in the image. Through this approach, MSFF efficiently filters out conflicting information in the image, thereby improving the model’s accuracy in identifying pests.

3.3. Generalized-Mean Pooling

Traditional max pooling or average pooling methods are ineffective at capturing salient features with domain specificity. To address this issue, we introduce a learnable pooling layer known as generalized-mean pooling (GeMP) [28]. GeMP applies element-wise power operation to the input features followed by averaging, thereby better capturing important features with subtle differences in pest recognition tasks. Thus, through GeMP, the model can better understand detailed information in pest images and differentiate between different types of pests. Mathematically, GeMP can be represented by Equation (8):

f = {[f_{1} \dots f_{k} \dots f_{K}]}^{T}, f_{k} = (\frac{1}{| X_{k} |} \sum_{x_{i} \in X_{k}} {x_{i}^{p_{k}})}^{\frac{1}{p_{k}}}

(8)

where

f_{k}

represents the output feature map, K is the number of feature maps in the last layer. X is the input feature map,

X \in R^{H \times W \times C}

, and

X_{k} \in R^{H \times W}

.

p_{k}

is a pooling hyperparameter whose value is learned during backpropagation.

It is worth noting that when

p_{k} = 1

, GeMP degenerates into average pooling, as shown in Equation (9):

f = {[f_{1} \cdot \cdot \cdot f_{k} \cdot \cdot \cdot f_{K}]}^{T}, f_{k} = \frac{1}{| X_{k} |} \sum_{x_{i} \in X_{k}} x

(9)

when

p_{k} = \infty

, GeMP degenerates into max pooling, as shown in Equation (10):

f = {[f_{1} \dots f_{k} \dots f_{K}]}^{T}, f_{k} = max_{x_{i} \in X_{k}} x

(10)

3.4. Balanced Focal Loss

In pest recognition tasks, difficult samples are a key factor leading to inefficient model learning. The existence of difficult samples mainly stems from the imbalance in the number of pest categories and the low discriminability of small individual features. To address this issue, this study introduces focal loss to improve the problem.

Focal loss (FL) [25] is achieved by adding the modulation factor

{(1 - p_{t})}^{γ}

on the basis of standard cross-entropy loss (CEL), which adaptively adjusts the contribution of samples to the loss based on their prediction accuracy. This helps the model to focus more on the misclassified and difficult samples. Specifically, the model reduces its attention on samples that can be predicted very accurately, as these samples already have good classification capability and do not significantly affect the model’s classification ability. For samples that are not predicted accurately or even predicted incorrectly, the model increases its attention, thereby improving its prediction capability for these samples and enhancing the overall performance. This design ensures that even if there are many easily classifiable samples, they will not dominate the model’s training process. Mathematically, FL can be represented by Equation (11):

F L (p_{t}) = - {(1 - p_{t})}^{γ} log (p_{t})

(11)

where

p_{t}

is the predicted probability and

γ

is the focusing parameter. By adjusting the weights of positive and negative samples, the model pays more attention to difficult-to-classify samples. When a sample is correctly classified,

p_{t}

approaches 1, and thus

{(1 - p_{t})}^{γ}

approaches 0, resulting in a decrease in the loss term and reducing attention to easy samples. Conversely, when a sample is misclassified,

p_{t}

approaches 0, making the loss term larger and increasing attention to difficult samples.

In practical applications, a variant of FL with the addition of the

α

balancing factor often yields better results. The FL variant with

α

is defined as follows in Equation (12):

F L (p_{t}) = - α {(1 - p_{t})}^{γ} log (p_{t})

(12)

In the original FL,

α

is a manually set balancing factor used to adjust the weights of easy and hard samples. However, this approach may not effectively adapt to changes in the dataset and the dynamic adjustment requirements during model training. Therefore, we propose the balanced focal loss (BFL), which calculates

α

in an adaptive manner. First, we calculate

α

adaptively based on the distribution of target categories. We count the occurrences of each class in the targets to obtain a histogram. Then, we divide the count of each class in the histogram by the total number of samples to calculate the frequency of each class. This gives us the frequency of each class occurrence. Next, to convert the frequency into

α

weight coefficients, we perform a normalization operation. Specifically, we divide the frequency of each class by 10 and then subtract the result from 1. The purpose of this operation is to adjust the frequency to a range such that the maximum frequency corresponds to

α

= 1. This way, we obtain the weight coefficients of each class relative to other classes, which are used to balance the differences in the number of samples between different classes. The definition of

α

is as follows in Equation (13):

\{\begin{matrix} f_{i} = \frac{n_{i}}{N} \\ α_{i} = 1 - (\frac{f_{i}}{10}) \end{matrix}

(13)

where

f_{i}

represents the occurrence frequency of the ith category,

n_{i}

represents the number of samples in the ith category,

α_{i}

represents the weight coefficient of the ith category, and N is the total number of samples.

4. Experiments

In this section, we compare AM-MSFF with relevant state-of-the-art methods and validate the effectiveness of the added modules through a series of ablation studies.

4.1. DataSets

The IP102 dataset [29] is currently the largest publicly available benchmark dataset for insects, covering eight crops including rice, corn, wheat, sugar beet, alfalfa, grapes, citrus, and mangoes. This dataset comprises a total of 75,222 insect images distributed across 102 categories, exhibiting a natural long-tailed distribution. Adopting a hierarchical classification approach, each insect is classified into a superclass reflecting its predation on crop types, along with subclasses labeled as pests damaging crops, encompassing images at different stages of insect development, such as egg, larva, pupa, and adult. Furthermore, insects at different growth stages may exhibit distinct appearance features. Additionally, different species of insects may share similar characteristics, further complicating insect classification. In our study, 45,095 images are used for training, 7508 for validation, and the remaining 22,169 for testing. The examples of the IP102 dataset are illustrated in Figure 6, and detailed information is provided in Table 1.

The D0 dataset [30] consists of 4508 insect images with a resolution of

200 \times 200

pixels, covering most of the common insect species found in several major field crops, including corn, soybeans, wheat, and rapeseed. In our study, we randomly divided D0 into three subsets, with 70% used for training. The remaining 30% was further divided, with 30% allocated for validation and the remaining 70% forming a new test set. Thus, in our research, 3155 images were used for training, 406 for validation, and the remaining 947 for testing. Table 1 lists the names of various insects along with their respective image counts. As indicated in the table, there is some degree of imbalance in the number of different insect species. Examples of the D0 dataset are shown in Figure 7, and detailed information is provided in Table 2.

4.2. Evaluation Metrics

Due to the class imbalance in both the IP102 and D0 datasets, we evaluate our proposed model using metrics such as macro average precision (

M P r e

), macro average recall (

M R e_{c}

), macro average F1-score (

M F 1

), accuracy (

A c c

), and geometric mean (

G M

). To equally weigh the importance of each class, we compute the recall for each class and then take their average to obtain

M R e_{c}

, as follows:

{Rec}_{c} = \frac{{TP}_{c}}{{TP}_{c} + {FN}_{c}}

(14)

{MRe}_{c} = \frac{\sum_{c = 1}^{C} {Rec}_{c}}{C}

(15)

where C is the number of classes.

T P_{c}

and

F N_{c}

represent the true positives and false negatives for class c, respectively. Similarly,

P r e_{c}

and

M P r e

are calculated using the following formulas:

{Pre}_{c} = \frac{{TP}_{c}}{{TP}_{c} + {FP}_{c}}

(16)

MPre = \frac{\sum_{c = 1}^{C} {Pre}_{c}}{C}

(17)

where

F P_{c}

represents the false positives for class c. MF1 is the harmonic mean of

M R e_{c}

and

M P r e

, calculated using the following formula:

MF 1 = 2 \frac{MPre \cdot {MRe}_{c}}{MPre + {MRe}_{c}}

(18)

A c c

is calculated based on the true positive counts across all classes, computed as follows:

Acc = \frac{TP}{N}

(19)

where N is the total number of samples.

G M

is computed based on the sensitivity of each class (represented as

S_{c}

), calculated as follows:

S_{c} = \frac{{TP}_{c}}{{TP}_{c} + {FN}_{c}}

(20)

GM = \prod_{c = 1}^{c} \sqrt[c]{S_{c}}

(21)

G M

is equal to 0 if and only if

S_{c}

is equal to 0. To avoid this issue, we replace the values of sensitivity that are 0 with 0.001.

4.3. Experiment Settings

We conducted preprocessing steps on the input images, where the size of the image is

h \times w

, with h and w representing the height and width of the image, respectively. Firstly, we resized the image to

h^{'} \times w^{'}

to maintain the aspect ratio of the original image. We chose to adjust the smaller of h and w to 256 and adjusted the larger value based on the ratio of the larger to the smaller value. This helps maintain the aspect ratio of the image and adapts to the input requirements of the model. During the training phase, we applied random cropping as a data augmentation technique with a window size of

256 \times 256

to address overfitting issues. Through random cropping, we randomly selected different sub-regions from the image, increasing the diversity and generalization capability of the data. In the testing phase, we used center cropping with the same size window as in the training phase. This ensures that the image region used in the testing phase is similar to that in the training phase, resulting in comparable results.

During the training process, we utilized ResNet-50 pre-trained on the ImageNet dataset as the backbone network, with BFL as the classification loss function. To optimize the model, we used the Adam optimizer with a learning rate initialized to

1 \times 10^{- 4}

, and coefficients

β_{1}

and

β_{2}

set to 0.9 and 0.999, respectively. To control the decay of the learning rate, we employed an exponential decay method with a decay rate of 0.96. We partitioned the training data into batches, with a batch size of 64 for the IP102 dataset and 32 for the D0 dataset, as it is smaller. We set the maximum number of training epochs to 100.

4.4. Experimental Results

To assess the effectiveness of our approach, we compared it with several state-of-the-art methods and conducted experiments on the IP102 and D0 datasets. The experimental results are shown in Table 3 and Table 4.

On the IP102 dataset, we compared AM-MSFF with ResNet-50 implemented in [29], as well as some variants of ResNet, namely, FR-ResNet [31] and DMF-ResNet [32]. The results indicate that AM-MSFF outperforms ResNet-50 and its variants. Additionally, our method demonstrates competitiveness when compared with other state-of-the-art models. Specifically, as shown in Table 3, the AM-MSFF method achieves state-of-the-art results in recognition accuracy, surpassing MMAL [34] by 0.48%. It ranks slightly lower than GAEnsemble [33] in

M P r e

but still achieves an excellent second place. It is lower than MMAL by 1.76% in

M R e_{c}

but maintains the second position. It is slightly lower than GAEnsemble by 0.14% in

M F 1

but outperforms MMAL by 3.04% in

G M

.

On the D0 dataset, we compared AM-MSFF with ResNet-50 implemented in [33] and other state-of-the-art methods, achieving the best results. Specifically, as shown in Table 4, compared with the currently best-performing GAEnsemble, AM-MSFF outperforms it by 0.24%, 0.04%, 0.05%, and 0.03% in terms of

A C C

,

M P r e

,

M R e_{c}

, and

M F 1

, respectively. This result further demonstrates that our model has higher accuracy and better generalization ability.

4.5. Ablative Study

To evaluate the performance of the AM-MSFF model, we conducted ablation experiments to analyze the contributions of its four components to the model’s performance. Since the recognition difficulty of the IP102 dataset is greater, we chose to conduct ablation experiments on this dataset, and the experimental results are presented in Table 5.

Our ResNet-50 model performs significantly better in terms of performance compared to the implementation in [29]. Our model adopts the random cropping augmentation technique, which randomly crops input images during the training process, thereby increasing the diversity and richness of the data. This helps improve the model’s generalization ability and robustness, making it better suited to different scenarios and variations. Additionally, we used the Adam optimizer, which is a gradient-based adaptive optimization algorithm. Compared to the stochastic gradient descent (SGD) optimizer used in [29], Adam converges faster and finds better local minima, thereby improving the efficiency and performance of the model training.

4.5.1. The Impact of Relation-Aware Global Attention

To validate the impact of RGA on model performance, we conducted a series of ablation experiments, and the results are shown in Table 5. Firstly, compared to the baseline model, adding only the RGA module led to a performance improvement. Specifically, the model with the added RGA module showed an increase of 0.46% in

A C C

, 0.45% in

M P r e

, 0.58% in

M R e_{c}

, 0.47% in

M F 1

, and 0.13% in

G M

. Furthermore, compared to the model without the RGA module, the AM-MSFF model with the added RGA module exhibited significant improvements in all metrics. Specifically, the AM-MSFF model with the added RGA module showed an increase of 0.39% in

A C C

, 0.52% in

M P r e

, 0.12% in

M R e_{c}

, 0.38% in

M F 1

, and 0.51% in

G M

.

The results indicate that the introduction of the RGA module effectively enhances model performance. The RGA module improves the representation capacity of features related to pests and captures relationships between features, effectively enhancing the model’s ability to recognize pests.

4.5.2. The Impact of Multi-Scale Feature Fusion

To validate the impact of MSFF on model performance, we conducted a series of ablation experiments, and the results are shown in Table 5. Firstly, compared to the baseline model, adding only the MSFF module led to a 0.33% increase in

A C C

, a 0.36% increase in

M P r e

, a slight decrease of 0.17% in

M R e_{c}

, a 0.12% increase in

M F 1

, and a 0.36% increase in

G M

. Although there was a slight decrease in

M R e_{c}

, the improvement in other performance metrics indicates its effectiveness in enhancing the model. Secondly, compared to the case where the MSFF module was not added in the AM-MSFF model, adding MSFF led to more significant improvements in all performance metrics. It resulted in a 0.39% increase in

A C C

, a 0.51% increase in

M P r e

, a 0.12% increase in

M R e_{c}

, a 0.38% increase in

M F 1

, and a 0.51% increase in

G M

.

The experimental results demonstrate that by adding the MSFF module, our model can fully utilize feature maps from different scales. By integrating feature information from different scales, the model can better understand and capture the details and features of pests comprehensively, enhancing its perception ability towards pest targets, and improving the accuracy of identification results.

4.5.3. The Impact of Generalized-Mean Pooling

To assess the impact of GeMP on model performance, we conducted a set of comparative ablation experiments, and the experimental results are listed in Table 6. The results show that GeMP significantly improves model performance. Firstly, we compared the use of GeMP with global average pooling (GAP) and global max pooling (GMP) in the baseline model. The results indicate that compared to the model using GAP, the model using GeMP achieved an increase of 0.49% in

A C C

, 0.47% in

M P r e

, 0.87% in

M R e_{c}

, 0.66% in

M F 1

, and 0.49% in

G M

. Compared to the model using GMP, the model using GeMP achieved an increase of 0.18% in

A C C

, 0.05% in

M P r e

, 0.46% in

M R e_{c}

, 0.30% in

M F 1

, and 0.33% in

G M

.

Secondly, replacing GAP with GeMP in the AM-MSFF module significantly improved model performance. The experimental results show that compared to the model using GAP, the model using GeMP achieved an increase of 0.76% in

A C C

, 0.50% in

M P r e

, 0.52% in

M R e_{c}

, 0.52% in

M F 1

, and 0.34% in

G M

. Similarly, compared to the model using GMP in the AM-MSFF module, using GeMP also led to a significant improvement in model performance. The experimental results show that compared to the model using GMP, the model using GeMP achieved an increase of 0.66% in

A C C

, 1.21% in

M P r e

, 0.70% in

M F 1

, and 0.35% in

G M

. Although there is a slight decrease of 0.14% in

M R e_{c}

, the significant improvements in other performance metrics indicate the effectiveness of GeMP.

The experimental results indicate that GeMP achieves a higher recognition accuracy than GMP, and GMP outperforms GAP. In pest identification tasks, crucial information in feature maps tends to be concentrated in local regions, with other areas being relatively less important. GMP retains only the most significant parts of each feature map during pooling, whereas GAP simply averages the entire feature map, potentially leading to the loss or blurring of local information.

However, GeMP performs a weighted average based on the activation level of features, which more accurately reflects the crucial information in feature maps by considering the intensity of each feature during activation. In contrast, GAP cannot distinguish the importance of different features, and GMP overlooks other important activated information. Through GeMP, the model can more effectively utilize useful information in feature maps, thereby enhancing its perception and recognition accuracy of pest targets. Additionally, this weighted averaging approach helps the model better adapt to variations in different scenes and features, improving its generalization ability and robustness.

4.5.4. The Impact of Balanced Focal Loss

To validate the impact of BFL on model performance, we conducted a series of ablation experiments, and the results are presented in Table 7. In the experiments, we defaulted to using CEL, and replacing CEL with BFL loss on the baseline led to improvements in multiple metrics. Specifically,

A C C

increased by 0.6%,

M P r e

increased by 0.36%,

M R e_{c}

increased by 0.37%, and

M F 1

increased by 0.31%. Although there was a slight decrease of 0.03% in

G M

, the improvements in other performance metrics suggest that BFL is effective in enhancing the model’s classification ability. When using BFL in the AM-MSFF model, compared to using FL, although there was a slight decrease in

M R e_{c}

and

M F 1

metrics, the recognition accuracy was higher. Compared to using CEL, there was a 0.24% increase in

A C C

, a 0.09% increase in

M P r e

, a 0.14% increase in

M R e_{c}

, a 0.39% increase in

M F 1

, and a 0.09% increase in

G M

.

Since BFL assigns lower weights to easily classifiable samples and higher weights to difficult-to-classify samples during training, it may lead to a misclassification of some easily classifiable samples, thereby reducing MRe_c and MF1. However, compared to CEL, BFL still effectively improves the problem of imbalanced data distribution and enhances the model’s classification performance.

4.5.5. Discussion of Results

From Table 3 and Table 4, it can be observed that the AM-MSFF method outperforms other state-of-the-art methods in terms of recognition accuracy on the IP102 and D0 datasets, demonstrating a higher performance. Additionally, in terms of evaluation metrics such as MPre, MRe_c, MF1, and GM, the AM-MSFF method also demonstrates competitiveness comparable to other state-of-the-art models.

However, in Table 5, we notice that the “AM-MSFF without GeMP” shows a decrease in accuracy compared to the “Baseline + RGA”. After analysis, we believe that this result may be due to an imbalance in local–global information. Although the RGA module is designed as a global attention mechanism, it also tends to focus more on the relationships between local regions, potentially leading to the neglect of some important contextual information when processing global information. With the addition of the MSFF module, the model’s ability to utilize global information is enhanced, as MSFF can better integrate features at different scales. However, this may also result in the model overly focusing on local information, leading to an imbalance when processing global information. Therefore, despite the overall improvement in utilizing information by the model, the imbalance may lead to a decrease in accuracy. It is worth noting that although the model’s recognition accuracy decreases, there is a significant improvement in metrics such as MPre, MRe_c, MF1, and GM compared to only adding a single RGA and MSFF module, indicating that the model is able to more accurately capture and express features at different scales.

4.6. Visualization

In this section, we use the Grad-CAM method [42] to visualize the attention regions of our proposed model in the input images, helping to interpret the model’s predictions and understand its behavior. Grad-CAM calculates the gradients of the target class to identify the feature map regions that play a crucial role in the final prediction and visualize them as class activation maps.

Figure 8 displays the visual representation of the attention regions of the model in the input images. Even though insects like alfalfa seed chalcid are small, AM-MSFF is still able to focus on the insects in the input images. In contrast, although ResNet-50 can correctly focus on the insect’s location in most cases, it seems to prefer larger and less accurate regions for prediction, leading to relatively poorer performance.

5. Conclusions

This paper proposes a pest identification network based on attention mechanism and multi-scale feature fusion (AM-MSFF), which consists of three key modules: the RGA module, the MSFF module, and the GeMP module. Additionally, an improved loss function called BFL is proposed for the classification task. In the specific network architecture, the RGA module models the relationship between different positions in the image and weights them using an attention mechanism. This enables the network to focus on and highlight pest areas while suppressing interference from irrelevant regions. The MSFF module enhances the model’s perception and representation capabilities by fusing multi-scale feature information and paying attention to both details and global features. Unlike traditional GAP and max pooling, the GeMP module better preserves spatial information in the feature map, improving the perception of local details. In addition, to address the issue of class imbalance, this study proposes the use of BFL as a replacement for the cross-entropy loss to adjust sample weights. Experimental results on the IP102 and D0 datasets demonstrate the outstanding performance of the AM-MSFF method. On the IP102 dataset, the accuracy reaches 72.64%, while on the D0 dataset, it reaches 99.05%. Compared to other networks, the AM-MSFF method achieves high levels of accuracy.

In future research, on the one hand, we plan to delve into the characteristics and features of pest image data and design more targeted, efficient, and streamlined network architectures. On the other hand, we also aim to further enhance pest recognition performance through multimodal fusion. In addition to image data, pests may also be accompanied by other sensory data, such as sound and vibration. We can fuse these different modalities of data to obtain more comprehensive and accurate pest information.

Author Contributions

Conceptualization, M.Z.; methodology, M.Z.; validation, M.Z.; writing—original draft, M.Z.; writing—review editing, M.Z., W.Y. and D.C.; supervision, W.Y., C.F. and F.W.; funding acquisition, W.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the National Key Research and Development Program of China (Grant No. 2022ZD0115802), the Key Research and Development Program of the Autonomous Region (Grant No. 2022B01008), the National Natural Science Foundation of China (Grant No. 62262065) and the Tianshan Elite Science and Technology Innovation Leading Talents Program of the Autonomous Region (Grant No. 2022TSYCLJ0037).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflict of interest.

References

Kang, S.; Hao, X.; Du, T.; Tong, L.; Su, X.; Lu, H.; Li, X.; Huo, Z.; Li, S.; Ding, R. Improving agricultural water productivity to ensure food security in China under changing environment: From research to practice. Agric. Water Manag. 2017, 179, 5–17. [Google Scholar] [CrossRef]
Waddington, S.R.; Li, X.; Dixon, J.; Hyman, G.; De Vicente, M.C. Getting the focus right: Production constraints for six major food crops in Asian and African farming systems. Food Secur. 2010, 2, 27–48. [Google Scholar] [CrossRef]
Li, W.; Zheng, T.; Yang, Z.; Li, M.; Sun, C.; Yang, X. Classification and detection of insects from field images using deep learning for smart pest management: A systematic review. Ecol. Inform. 2021, 66, 101460. [Google Scholar] [CrossRef]
Damos, P. Modular structure of web-based decision support systems for integrated pest management. A review. Agron. Sustain. Dev. 2015, 35, 1347–1372. [Google Scholar] [CrossRef]
Oliva, A.; Torralba, A. Modeling the shape of the scene: A holistic representation of the spatial envelope. Int. J. Comput. Vis. 2001, 42, 145–175. [Google Scholar] [CrossRef]
Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–26 June 2005; Volume 1, pp. 886–893. [Google Scholar]
Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
Bay, H.; Tuytelaars, T.; Van Gool, L. Surf: Speeded up robust features. In Proceedings of the Computer Vision–ECCV 2006: 9th European Conference on Computer Vision, Graz, Austria, 7–13 May 2006; Springer: Berlin/Heidelberg, Germany, 2006; pp. 404–417. [Google Scholar]
Thenmozhi, K.; Reddy, U.S. Crop pest classification based on deep convolutional neural network and transfer learning. Comput. Electron. Agric. 2019, 164, 104906. [Google Scholar] [CrossRef]
Truman, J.W.; Riddiford, L.M. The origins of insect metamorphosis. Nature 1999, 401, 447–452. [Google Scholar] [CrossRef] [PubMed]
Gilbert, L.I.; Schneiderman, H.A. Some biochemical aspects of insect metamorphosis. Am. Zool. 1961; 11–51. [Google Scholar]
Mayo, M.; Watson, A.T. Automatic species identification of live moths. Knowl. Based Syst. 2007, 20, 195–202. [Google Scholar] [CrossRef]
Rasband, W. ImageJ: Image Processing and Analysis in Java. Astrophysics Source Code Library. 2012, p. ascl-1206. Available online: https://ui.adsabs.harvard.edu/abs/2012ascl.soft06013R/abstract (accessed on 18 March 2024).
Yalcin, H. Vision based automatic inspection of insects in pheromone traps. In Proceedings of the 2015 Fourth International Conference on Agro-Geoinformatics (Agro-Geoinformatics), Istanbul, Turkey, 20–24 July 2015; pp. 333–338. [Google Scholar]
Venugoban, K.; Ramanan, A. Image classification of paddy field insect pests using gradient-based features. Int. J. Mach. Learn. Comput. 2014, 4, 1. [Google Scholar] [CrossRef]
Xie, C.; Zhang, J.; Li, R.; Li, J.; Hong, P.; Xia, J.; Chen, P. Automatic classification for field crop insects via multiple-task sparse representation and multiple-kernel learning. Comput. Electron. Agric. 2015, 119, 123–132. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
Li, Y.; Yang, J. Few-shot cotton pest recognition and terminal realization. Comput. Electron. Agric. 2020, 169, 105240. [Google Scholar] [CrossRef]
Cheng, X.; Zhang, Y.; Chen, Y.; Wu, Y.; Yue, Y. Pest identification via deep residual learning in complex background. Comput. Electron. Agric. 2017, 141, 351–356. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1097–1105. [Google Scholar] [CrossRef]
Liu, W.; Wu, G.; Ren, F.; Kang, X. DFF-ResNet: An insect pest recognition model based on residual networks. Big Data Min. Anal. 2020, 3, 300–310. [Google Scholar] [CrossRef]
Coulibaly, S.; Kamsu-Foguem, B.; Kamissoko, D.; Traore, D. Explainable deep convolutional neural networks for insect pest recognition. J. Clean. Prod. 2022, 371, 133638. [Google Scholar] [CrossRef]
Hu, K.; Liu, Y.; Nie, J.; Zheng, X. Rice pest identification based on multi-scale double-branch GAN-ResNet. Front. Plant Sci. 2023, 14, 1167121. [Google Scholar] [CrossRef] [PubMed]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Zhang, Z.; Lan, C.; Zeng, W.; Jin, X.; Chen, Z. Relation-aware global attention for person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 3186–3195. [Google Scholar]
Liu, S.; Huang, D.; Wang, Y. Learning spatial fusion for single-shot object detection. arXiv 2019, arXiv:1911.09516. [Google Scholar]
Radenović, F.; Tolias, G.; Chum, O. Fine-tuning CNN image retrieval with no human annotation. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 1655–1668. [Google Scholar] [CrossRef] [PubMed]
Wu, X.; Zhan, C.; Lai, Y.K.; Cheng, M.M.; Yang, J. Ip102: A large-scale benchmark dataset for insect pest recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 8787–8796. [Google Scholar]
Xie, C.; Wang, R.; Zhang, J.; Chen, P.; Dong, W.; Li, R.; Chen, T.; Chen, H. Multi-level learning features for automatic classification of field crop pests. Comput. Electron. Agric. 2018, 152, 233–241. [Google Scholar] [CrossRef]
Ren, F.; Liu, W.; Wu, G. Feature reuse residual networks for insect pest recognition. IEEE Access 2019, 7, 122758–122768. [Google Scholar] [CrossRef]
Liu, W.; Wu, G.; Ren, F. Deep multibranch fusion residual network for insect pest recognition. IEEE Trans. Cogn. Dev. Syst. 2020, 13, 705–716. [Google Scholar] [CrossRef]
Nanni, L.; Maguolo, G.; Pancino, F. Insect pest image detection and recognition based on bio-inspired methods. Ecol. Inform. 2020, 57, 101089. [Google Scholar] [CrossRef]
Ung, H.T.; Ung, H.Q.; Nguyen, B.T. An efficient insect pest classification using multiple convolutional neural network based models. arXiv 2021, arXiv:2107.12189. [Google Scholar]
Yang, X.; Luo, Y.; Li, M.; Yang, Z.; Sun, C.; Li, W. Recognizing pests in field-based images by combining spatial and channel attention mechanism. IEEE Access 2021, 9, 162448–162458. [Google Scholar] [CrossRef]
Setiawan, A.; Yudistira, N.; Wihandika, R.C. Large scale pest classification using efficient Convolutional Neural Network with augmentation and regularizers. Comput. Electron. Agric. 2022, 200, 107204. [Google Scholar] [CrossRef]
An, J.; Du, Y.; Hong, P.; Zhang, L.; Weng, X. Insect recognition based on complementary features from multiple views. Sci. Rep. 2023, 13, 2966. [Google Scholar] [CrossRef]
Lin, S.; Xiu, Y.; Kong, J.; Yang, C.; Zhao, C. An effective pyramid neural network based on graph-related attentions structure for fine-grained disease and pest identification in intelligent agriculture. Agriculture 2023, 13, 567. [Google Scholar] [CrossRef]
Li, Y.; Sun, M.; Qi, Y. Common pests classification based on asymmetric convolution enhance depthwise separable neural network. J. Ambient. Intell. Humaniz. Comput. 2023, 14, 8449–8457. [Google Scholar] [CrossRef]
Yu, J.; Shen, Y.; Liu, N.; Pan, Q. Frequency-enhanced channel-spatial attention module for grain pests classification. Agriculture 2022, 12, 2046. [Google Scholar] [CrossRef]
Su, Z.; Luo, J.; Wang, Y.; Kong, Q.; Dai, B. Comparative study of ensemble models of deep convolutional neural networks for crop pests classification. Multimed. Tools Appl. 2023, 82, 29567–29586. [Google Scholar] [CrossRef]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]

Figure 1. The structure of AM-MSFF.

Figure 2. The structure of relation-aware global attention.

Figure 3. The structure of spatial relation-aware global attention.

Figure 4. The structure of channel relation-aware global attention.

Figure 5. The structure of multi-scale feature fusion.

Figure 6. The example images from the IP102 dataset include various morphologies of insects, such as eggs, larvae, pupae, and adults.

Figure 7. Example of Cletus punctiger (Dallas) in the D0 dataset.

Figure 8. Visualization of Grad-CAMs produced by ResNet-50 and AM-MSFF.

Table 1. The detailed information about the IP102 dataset.

Crop Type	Rice	Corn	Wheat	Beet	Alfalfa	Vitis	Citrus	Mango
Number of Categories	14	13	9	8	13	16	19	10
Number of Images	8417	14,004	3418	4420	10,390	17,551	7273	9738

Table 2. The detailed information about the D0 dataset.

No.	Name	Quantity	No.	Name	Quantity
1	Dolycoris baccarum (Linnaeus)	87	21	Stollia ventralis (Westwood)	72
2	Lycorma delicatula (White)	92	22	Nilaparvata lugens (Stål)	62
3	Eurydema dominulus (Scopoli)	150	23	Diostrombus politus Uhler	238
4	Pieris rapae (Linnaeus)	71	24	Phyllotreta striolata (Fabricius)	187
5	Halyomorpha halys (Stål)	101	25	Aulacophora indica (Gmelin)	78
6	Spilosoma obliqua (Walker)	66	26	Laodelphax striatellus (Fallén)	61
7	Graphosoma rubrolineata (Westwood)	116	27	Ceroplastes ceriferus (Anderson)	100
8	Luperomorpha suturalis Chen	101	28	Corythucha marmorata (Uhler)	98
9	Leptocorisa acuta (Thunberg)	133	29	Dryocosmus kuriphilus Yasumatsu	50
10	Sesamia inferens (Walker)	126	30	Porthesia taiwana Shiraki	141
11	Cicadella viridis (Linnaeus)	138	31	Chromatomyia horticola (Goureau)	114
12	Callitettix versicolor (Fabricius)	156	32	Iscadia inexacta (Walker)	79
13	Scotinophara lurida (Burmeister)	117	33	Plutella xylostella (Linnaeus)	69
14	Cletus punctiger (Dallas)	169	34	Empoasca flavescens (Fabricius)	133
15	Nezara viridula (Linnaeus)	175	35	Dolerus tritici Chu	88
16	Dicladispa armigera (Olivier)	150	36	Spodoptera litura (Fabricius)	130
17	Riptortus pedestris (Fabricius)	110	37	Corythucha ciliata (Say)	90
18	Maruca testulalis Geyer	73	38	Bemisia tabaci (Gennadius)	147
19	Chauliops fallax Scott	68	39	Ceutorhynchus asper Roelofs	146
20	Chilo suppressalis (Walker)	93	40	Strongyloides variegatus (Fairmaire)	135

Table 3. The comparison of classification performance on the IP102 dataset. Bold text indicates the best result, and underline is used to indicate the second-best result.

Model	ACC	MPre	MRe_c	MF1	GM
ResNet-50 [29] (2019)	49.4	43.7	39.1	40.5	30.7
FR-ResNet [31] (2019)	55.24	-	-	54.18	-
DMF-ResNet [32] (2020)	59.22	-	-	58.37	-
GAEnsemble [33] (2020)	67.13	67.17	67.13	65.76	-
MMAL [34] (2021)	72.15	62.63	69.13	64.53	58.43
STN-SE-ResNet50 [35] (2021)	69.84	-	-	-	-
MobileNetV2 + Sparse + CutMix + DynamicLR [36] (2022)	71.32	-	-	-	-
ResNet152 + Vision-Transformer + Swin-Transformer [37] (2023)	65.6	60.9	59.7	60.3	-
GPA-Net [38] (2023)	56.9	45.9	43.8	45.0	-
AM-MSFF	72.64	64.54	67.37	65.62	61.48

Table 4. The comparison of classification performance on the D0 dataset. Bold text indicates the best result, and underline is used to indicate the second-best result.

Model	ACC	MPre	MRe_c	MF1
MLLF + MKB [30] (2018)	89.3	-	-	-
CNNs [9] (2019)	95.97	-	-	-
GAEnsemble [33] (2020)	98.81	98.88	98.81	98.81
ResNet-50 [33] (2020)	92.18	92.74	92.18	92.07
ACEDSNet [39] (2022)	96.15	-	-	-
FcsNet [40] (2022)	98.33	98.49	98.33	98.34
SBPEnsemble [41] (2023)	96.18	96.45	95.37	-
AM-MSFF	99.05	98.92	98.86	98.84

Table 5. Ablation experiment results of RGA, GeMP, and MSFF on the IP102 dataset. Bold text indicates the best result.

Model	ACC	MPre	MRe_c	MF1	GM
ResNet-50 (baseline)	71.30	63.46	65.24	64.12	60.64
baseline + RGA	71.76	63.91	65.82	64.59	60.77
baseline + GeMP	71.79	63.93	66.11	64.78	61.13
baseline + MSFF	71.63	63.82	65.07	64.24	61.00
AM-MSFF without MSFF	72.01	63.94	66.39	64.85	60.84
AM-MSFF without GeMP	71.64	63.95	65.99	64.71	61.01
AM-MSFF without RGA	72.01	63.93	66.39	64.85	60.84
AM-MSFF	72.40	64.45	66.51	65.23	61.35

Table 6. Comparative ablation experiment results of GAP, GMP, and GeMP on the ip102 dataset. Bold text indicates the best result.

Model	ACC	MPre	MRe_c	MF1	GM
Baseline with GAP	71.30	63.46	65.24	64.12	60.64
Baseline with GMP	71.61	63.88	65.65	64.48	60.80
Baseline with GeMP	71.79	63.93	66.11	64.78	61.13
AM-MSFF with GAP	71.64	63.95	65.99	64.71	61.01
AM-MSFF with GMP	71.74	63.24	66.65	64.53	61.00
AM-MSFF with GeMP	72.40	64.45	66.51	65.23	61.35

Table 7. Loss function ablation experiment results on the IP102 Dataset. Bold text indicates the best result, and underline is used to indicate the second-best result.

Model	ACC	MPre	MRe_c	MF1	GM
Baseline with CEL	71.30	63.46	65.24	64.12	60.64
Baseline with FL	71.84	63.73	66.33	64.74	60.85
Baseline with BFL	71.95	63.82	65.61	64.43	60.61
AM-MSFF with CEL	72.40	64.45	66.51	65.23	61.35
AM-MSFF with FL	72.61	64.54	67.73	65.73	61.39
AM-MSFF with BFL	72.64	64.54	67.37	65.62	61.48

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, M.; Yang, W.; Chen, D.; Fu, C.; Wei, F. AM-MSFF: A Pest Recognition Network Based on Attention Mechanism and Multi-Scale Feature Fusion. Entropy 2024, 26, 431. https://doi.org/10.3390/e26050431

AMA Style

Zhang M, Yang W, Chen D, Fu C, Wei F. AM-MSFF: A Pest Recognition Network Based on Attention Mechanism and Multi-Scale Feature Fusion. Entropy. 2024; 26(5):431. https://doi.org/10.3390/e26050431

Chicago/Turabian Style

Zhang, Meng, Wenzhong Yang, Danny Chen, Chenghao Fu, and Fuyuan Wei. 2024. "AM-MSFF: A Pest Recognition Network Based on Attention Mechanism and Multi-Scale Feature Fusion" Entropy 26, no. 5: 431. https://doi.org/10.3390/e26050431

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

AM-MSFF: A Pest Recognition Network Based on Attention Mechanism and Multi-Scale Feature Fusion

Abstract

1. Introduction

2. Related Work

2.1. Handcrafted Features

2.2. Deep Features

3. Proposed Method

3.1. Relation-Aware Global Attention

3.1.1. Spatial Relation-Aware Global Attention

3.1.2. Channel Relation-Aware Global Attention

3.2. Multi-Scale Feature Fusion

3.3. Generalized-Mean Pooling

3.4. Balanced Focal Loss

4. Experiments

4.1. DataSets

4.2. Evaluation Metrics

4.3. Experiment Settings

4.4. Experimental Results

4.5. Ablative Study

4.5.1. The Impact of Relation-Aware Global Attention

4.5.2. The Impact of Multi-Scale Feature Fusion

4.5.3. The Impact of Generalized-Mean Pooling

4.5.4. The Impact of Balanced Focal Loss

4.5.5. Discussion of Results

4.6. Visualization

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI