1. Introduction
Aphid infestation seriously reduces grain yield by soaking up plant juices and transmitting wheat virus disease. Pesticides are often used to deal with pest infestations [
1,
2]. However, the overuse and misuse of pesticides lead to environmental degradation and food safety issues. Accurately and efficiently detecting pests is the foundation of integrated pest management (IPM) [
3]. Due to the manual recognition and location being a time-consuming and laborious work, researchers attempt to solve this problem by computer vision techniques. Traditional machine-learning-based algorithms [
4,
5,
6,
7] identify specific pests by hand-designed feature extraction methods, which result in inadequate generalization for practical application. Since the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) [
8], deep-learning-based methods obtained state-of-the-art (SOTA) performance in general object detection. Therefore, researchers transfer the deep-learning-based detector to pest recognition and location.
Rustia et al. used insect sticky paper traps and wireless imaging devices to construct a greenhouse dataset for detecting and recognizing pests in a fixed environment [
9]. With light-trap devices, Liu et al. designed a pest detector by using global and local activation features to recognize and localization 16 species from 2 orders, including Lepidoptera and Coleoptera [
10]. Jiao et al. proposed an anchor-free network to identify and locate pests of 24 types, but the incomplete feature fusion resisted the improvement in the detection performance [
11]. Subsequently, a sampling-balanced region proposal network was designed to improve the performance of small-size pests by introducing an attention mechanism into the residual network (ResNet) [
12] for obtaining richer pest feature appearances [
13]. Aimed at the small-size high-similarity pest detection problem, Dong et al. designed a CRA-Net to improve the feature extraction capability of the CNN-based method with a channel recalibration feature pyramid network and an adaptive anchor module [
14].
The light-trap methods automatically detect crop pests by using light-trap devices, but the expensive equipment overhead limits the development of IPM. In addition, the above-mentioned methods accurately detect pests in a fixed background but are not suitable for the in-field environment because of the complex lighting, various shooting angles, different image quality, and intricate background. Due to the limitation of light-trap methods, researchers tend to recognize and locate pests in the field environment. Wu et al. constructed a large-scale insect dataset IP102 including 75,000 images with 102 pest species, which laid the foundation of pest recognition and location [
15]. Pattnaik et al. explored the feasibility of deep learning-based pest identification methods with the 10-class tomato pest dataset [
16]. Ayan et al. combined different convolutional neural networks (CNNs) into a unified pest identification network and automatically selected the combination weight to carry out pest identification through the genetic algorithm [
17]. Thenmozhi et al. explored the results of four deep-learning-based methods (AlexNet [
8], ResNet [
12], LeNet [
18], and VGG [
19]) on three pest datasets using the method of transfer learning [
20]. Xie et al. used multi-task sparse representation and multi-kernel learning to identify 24-class common field pests [
21].
The above methods use the CNN-based model to recognize pests in the simple in-field environment, in which most images consist of one or two pest close-ups. Although these methods obtain satisfactory performance, they lack practical application value. In the complex in-field environment, Wang et al. solved the difficulty of small-size pest recognition by combining the context-aware information (longitude, latitude, temperature, and humidity) with the Faster R-CNN [
22]. Due to the clustering habits of pests, the real in-field data exhibit dense distribution. Li et al. proposed a coarse-to-fine network to recognize and detect aphids by combining the two-stage network and one-stage network into a uniform pipeline. The network used two-stage architecture to capture the region of aphids and employed another fine network to detect aphids by regarding the region from the two-stage network as a new image, which results in inadequate timeliness [
23]. Subsequently, a data augmentation method was designed to improve the detection performance of multi-scale and multi-attitude pests. It expanded data by rotating and scaling in the training phase and detected pests with multi-resolution images in the testing phase. This method improved the performance but ignored the time cost regardless of the training and testing phase resulting in inadequate practical application ability [
24]. Du et al. defined the problem of densely clustered tiny pest detection and proposed an aphid detector that used a cluster region proposal network to find the region of aphid and employed a local detector group to recognize each aphid by transforming the aphid region to a single image [
25]. The method could accurately detect aphids but the significantly slow test speed limited the practical application. Due to the detection difficulty of tiny-sized dense pests in the real field environment, existing methods improve performance at the cost of time. In addition, the incomplete feature enhancement capacity of existing methods results in inadequate performance improvement in aphids detection.
In summary, aphid data have three characteristics, including tiny size, dense distribution, and multiple viewpoints.
Figure 1 shows the characteristics of aphid detection compared with other pest datasets (simple in-field environment dataset IP102 [
15] and light-trap pest dataset Pest-26 [
26]). Firstly, the average relative size of the APHID-4K dataset is 0.067%, which is significantly less than IP102 (37.622%) and Pest-26 (2.674%). Secondly, pests living in groups result in the situation of dense distribution on the APHID-4K dataset. The APHID-4K has an average of 12.60 aphid objects per image, which is significantly more than IP102 (1.17 pests per image) and Pest-26 (6.73 pests per image). Thirdly, due to the focusing difficulty of the data-collection device, multi-viewpoint aphids (aphids with varying degrees of clarity) exist in images.
Due to the above-mentioned characteristics, existing methods have a couple of limitations in aphid detection: (1) Due to the tiny size characteristics of aphids, the feature can difficult to extract, resulting in unsatisfactory detection performance [
11,
14,
26]. The tiny-sized features gradually disappear in the process of convolution operation and the misty features are not satisfied with the accurate location of dense distribution aphids. (2) Due to multi-viewpoint aphids in the image, vague aphids will be missed. (3) Due to the dense distribution, existing methods have to detect the same aphid image multiple times, even finely detecting the aphid region as a new image resulting in inadequate practical application value (improve performance without considering efficiency) [
23,
24,
25]. To solve the above-mentioned defects, we design a tiny-size dense aphid detection network (TD-Det) to improve the performance and efficiency simultaneously with two core designs: a Transformer feature pyramid network (T-FPN) and a multi-resolution training method (MTM). The T-FPN improves the feature expression ability of tiny-sized dense aphids by a feature-wise Transformer module (FTM) and a channel-wise feature recalibration module (CFRM), while the MTM is designed to train networks more robust (accuracy and efficiency) by using a coarse-to-fine resolution setting without extra time cost. In addition, extensive experiments on the APHID-4K dataset verify the feasibility of this study, and the results show that this study can improve detection performance and training efficiency. Ablation experiments show that our T-FPN and MTM can improve the detection performance of other methods in a plug-and-play manner.