1. Introduction
Crops and stores have historically been (and will continue to be) attacked by pests [
1]. In-depth research on insect behavior can help people learn more about insects, thereby helping to formulate safe and effective prevention strategies.
Bactrocera minax (Diptera: Trypetidae) is an important citrus pest mainly distributed in China, and it is also one of the targets of external quarantine [
2,
3,
4].
Grooming is a broad definition that covers all forms of body surface care. Grooming is a common and habitual behavior of many insects [
5] and is also a very common behavior [
5]. Although the insect groups involved in grooming behavior are different, the main functions of grooming behavior are surprisingly similar [
6,
7]. Remove foreign dust particles from the surface of the epidermis and sensory organs [
8], remove body surface secretions and epidermal lipids [
9,
10], collect pollen particles as food [
11], and remove external parasites or pathogens [
12]. At the same time, grooming behavior plays a significant role in maintaining the sensitivity of sensory organs [
13,
14]. As grooming behavior of insects is a very important part of their defense mechanism, it is important to identify and classify these behaviors to help systematically explore the physiological, neurological and pharmacological basis of grooming [
6,
12,
15]. A better understanding of grooming will provide new insight toward the development of control practices, leading to less damage to beneficial insects and consequently new possibilities for sustainable agricultural activity [
6].
With the rapid development of computer vision technology, it has become an inevitable trend to use computers to process and analyze video data in various industries to reduce manual labor [
16,
17]. Computer vision technology has been widely used in our daily life and achieved excellent results, such as face recognition [
18] and object detection [
19]. These technologies can not only achieve the accuracy of human vision without rest, but are also tens of times faster than manual recognition. The same application also occurs in agriculture. In recent years, agriculture has played a key role in the global economy [
20]. The application of computer vision technology in all aspects of agricultural production has higher efficiency than that of manual work, providing a reliable and accurate basis for the regulation and control of agricultural production [
20,
21]. In the field of agricultural insect behavior analysis, it is the most basic content to establish the behavior spectrum by observing and recording insect behavior [
22,
23,
24]. However, we know that most researchers are still using manual observation and statistical methods to find and record the start time and end time of each behavior by playing the video frame by frame [
25]. In this way, it is not only inefficient to find the behavior interval and judge the type manually but also the statistical error problem caused by the increase of personnel fatigue under the condition of long-term observation will gradually increase.
In the course of the experiment, we tried some deep learning algorithms to classify grooming which was developed to track the key parts of the object or predict the behavior [
26,
27,
28], but they are not very suitable in our experimental environment. For example, to identify the grooming behavior of
Bactrocera minax by tracking the key parts, it is necessary to ensure that the forelegs are visible in most of the time [
14]. However, the movement speed of the forelegs in the grooming behavior is very fast, and the mouth grooming and hind leg grooming are often obscured when the fly’s back is facing the camera, similar situations always occur [
27]. We hope that after a detection, we can get a complete behavior interval partition result for our subsequent analysis [
29], not just real-time detection feedback.
In this paper, we propose an improved method based on spatio-temporal context and Convolution Neural Network to detect the grooming behavior of Bactrocera minax. The background color of Bactrocera minax was extracted and the color channel was added into the spatio-temporal feature image. The spatial information of the spatio-temporal feature image was increased, and the distinction between the front grooming (head, foreleg) and the posterior grooming (hind leg, abdomen) of the feature image was enhanced, so the detection model based on CNN can judge the behavior of these images, and achieve the purpose of automatic detection of the grooming behavior of Bactrocera minax, and provide a reliable method and idea for improving our ability to document grooming.
4. Discussion
The rapid development of agricultural insect recognition and animal pose estimation based on computer vision has inspired us to develop a reliable statistical system for the grooming behavior of
Bactrocera minax. We’re going to process video data more efficiently. At present, computer vision technology has been widely used in agricultural research [
30,
31,
32,
33], such as crop pest detection [
34,
35,
36] or pest activity detection [
37], crop disease detection [
38], identification of crop growth [
39,
40], crop yield prediction [
41], and animal behavior detection [
26,
27,
42]. The first four kinds of applications can get good results by processing and analyzing only a few clear images. Such as Ulzii-Orshikh Dorj et al.’s method for predicting fruit yield in 2017 [
41], by transforming and processing the original RGB image, the fruit target was separated from the background image. A similar approach is used in our paper, when there is a difference between the target and the background color information, this traditional and simple method is effective. Animal behavior detecting is the analysis of video streams, and the size of the target, the state and the recording environment have a greater impact on the detection effect. So we need to optimize the detection method for
Bactrocera minax, the final experimental results also verify the effect of our experiment.
Identification of fruit fly adult species based on machine vision [
43] or the method of identifying and counting other insects [
44,
45,
46] is mature and has practical applications. At present, using the popular deep learning object detection algorithm, as yolo [
47,
48] and maskRCNN [
49,
50] identification can also achieve better results. There are a few methods to detect the grooming behavior of flies. One is to analyze video using deep neural networks, such as DeepLabCut and LEAP. The former uses Deep Residual Networks (ResNet), a small number of labeled images were put into training to predict the key parts of the body [
51], and the latter trained hundreds of markers to predict the location of the target body, classification of behaviors through unsupervised learning [
26]. These two methods based on the deep neural network have been used in Drosophila experiments, with good results, and have strong generalization. However, the low overall quality of video and the small proportion of the objects in the video and the rapid movement of the observation site limit our use of the method. Our experiments are different from the two, the
Bactrocera minax is relatively small in the video, and grooming will occur motion blur or mutual occlusion between parts, resulting in key points difficult to predict. This is mainly due to the lack of information on the spatial scale of video frames. Although the above two methods have achieved better results using deeper network structures, for our experimental environment, the loss of pooling information and the gradient disappearance problem in the deep neural network [
52], simply deepening the network depth cannot continue to improve the accuracy [
53].
Benefit from the study of the spatio-temporal context in predicting human or animal behavior, we can use temporal information to complement insufficient spatial information. Inspired by ABRS [
27], we can create better behavioral spatio-temporal features of
Bactrocera minax. Through the fusion of spatial information and temporal features, a more intuitive spatio-temporal feature image is generated, so that the behavior category to which the feature image belongs can be directly judged by the human eye or computer vision. We choose CNN for feature image detection, not only because of its reliability in the field of computer vision [
54,
55], but also because of the convenience brought by the large collection of methods based on current CNN library, labeling, training, and prediction are very easy to understand for people who are not in this field, and the results are clear.
Finally, we can achieve more than 95% accuracy after a small amount of manual verification, and the difference between the results of complete manual statistics is stable between 10%~15%, which indicates that the final result is credible and stable. Part of the difference is due to differences in judgment, such as three specific behaviors in head grooming, which are somewhat ambiguous in multiple observations. The other part is that the system is more sensitive to the boundaries of behavior, and complete manual observation may not be sensitive enough to the front grooming behavior of less than 1 s, which is common in the experimental process. This behavior generally involves head grooming and foreleg grooming. After a long time of work, completely manual work may not be able to find out this difference every time, but machines can.
Our next research direction is to further optimize the way of generating spatio-temporal feature images. On the one hand, we consider optimizing the use of CPU multi-threading when the program is running, or using GPU to accelerate the processing of video frames and subsequent FFT to achieve a faster detection speed than the current one [
56,
57]. The improvement in performance means that we can retain more video frame details during the frame cropping process, thereby improving the quality of the spatio-temporal feature image generation. This also means using deeper or larger CNN structure, such as the current high robust ResNet model provides the possibility [
53]. On the other hand, we may use the Fully Convolutional Network (FCNN) [
58] or U-Net [
59] to segment the object from the background to achieve better results than RGB features segmentation. The former proposes an end-to-end fully convolution network for semantic segmentation, which combines deep and coarse network layer semantic information with shallow and fine network layer surface information to generate accurate segmentation [
58]. And the latter improves FCNN, it has a large number of feature channels in the up-sampling part, which allows the network to propagate context information to higher resolution layers [
59]. In short, the further optimization of each step of the method provides a basis for the method to be used in a wider range in the future.