1. Introduction
The estimated world area planted with vines in 2020 was 7.3 million hectares, resulting in an estimated world production of 260 million hectoliters of wine [
1]. Despite these promising indicators for the wine industry, a substantial body of literature is already available regarding the harmful effects of climate change on viticulture. These effects range from wine quality, grape composition, and vine physiology and phenology to unexpected diseases and pest outbreaks [
2].
Automation in agriculture via computer vision (CV) and machine learning (ML) is particularly challenging due to the uncontrolled field conditions and uncertainties of the outdoor environment. Nevertheless, edge artificial intelligence (AI) is already reshaping the way farming is carried out. Digitalization and processes automation via AI-powered agricultural systems are already relieving common pain points, such as in the prevention and monitoring of pest outbreaks in viticulture. The monitoring of seasonal dynamics and accurate species identification of key insects is crucial for optimizing the effectiveness, insecticide use, and costs of pest management programs. Deploying sticky traps on grape plantations to attract key insects has been the backbone of conventional pest management programs. However, they are time-consuming processes for winegrowers, conducted through visual inspection via the manual identification and counting of key insects. Additionally, winegrowers usually lack taxonomy expertise for accurate species identification.
In this work, we aim to improve pest monitoring and prevention processes by developing lightweight AI algorithms, suitable for deployment on edge devices, that automatically identify and quantify pest counts. In particular, we envisioned that the proposed approach should work on-site with mobile-acquired images of yellow sticky and delta traps, thus avoiding instrumented traps and complex hardware infrastructure. Convolutional neural networks (CNNs) are currently considered the state-of-the-art in the CV domain for object detection, providing higher accuracies than traditional methods such as scale-invariant feature transform (SHIFT) and histograms of oriented gradients (HOGs) [
3]. These traditional methods are based on hand-crafted ML features, which would require a substantial amount of feature engineering effort to encompass all of the variable factors usually present in hand-held images acquired in the open field, e.g., the variability and heterogeneity of illumination, or artifacts such as dust and leaves. We started by selecting five different object detection deep learning models suitable to run locally on mobile devices, namely: (i) SSD ResNet50 (RetinaNet50); (ii) CenterNet ResNet50; (iii) SSD MobileNet V2; (iv) EfficientDet-D0; and (v) Faster R-CNN ResNet101. This selection considered model lightness (i.e., it should be suitable for the computational power available on regular Android devices) and the availability of the model in the Tensorflow Object Detection API (to ensure model compatibility to run on Android devices by converting to TensorFlow Lite). The selected backbones have the following parameter number (PN) and time complexity in floating-point operations (FLOPs): ResNet50: 23.6M PN and 5G FLOPs; ResNet101: 42.7M PN and 8G FLOPs; MobileNet V2 3.4M PN and 0.31G FLOPs; and EfficientDet-D0 3.9M PN and 0.02G FLOPs. For benchmarking purposes, we applied both model-centric and data-centric training strategies to find the most suitable model and respective parametrization.
This paper has the following structure:
Section 1 presents the motivation of the work and respective objectives;
Section 2 summarizes the relevant related work;
Section 3 provides the technical background regarding key insects in vineyards;
Section 4 describes the used dataset and respective data preparation procedures; in
Section 5, the proposed methodology is detailed;
Section 6 presents the results and discussion; and, finally,
Section 7 draws the conclusions and future work.
2. Related Work
Object detection is a key field in AI and CV, allowing computer systems to “see” their environments by detecting objects in images or videos. During the last decade, the advent of deep learning (DL) techniques led to significant and promising advances in the field of object detection. In particular, CNNs and GPUs are the main drivers behind the significant advancement of CV-based object detection. CNNs have exhibited superior skills in learning invariant features in various categories of objects with large quantities of training data [
4,
5]. Compared with conventional computer vision approaches based on hand-crafted features, CNNs facilitate the hunt of regions of interest for the detection of target objects and the extraction of more discriminative features.
Currently, object detection frameworks can be divided into two categories: (1) two-stage detectors such as region-based CNN (R-CNN) [
6], faster R-CNN [
7], and other detectors that use feature pyramid networks (FPN) [
8] that first output multiple region proposals for object candidates and then generate more accurate regions with corresponding class labels; and (2) one-stage detectors such as SSD [
9], YOLO [
10], and RetinaNet [
11] that simultaneously regress object locations and classes. One-stage detectors are significantly more time-efficient, structurally simpler, and have a greater applicability in real-time applications, but the two-stage detectors usually achieve a better detection performance.
Computational approaches for the automatic identification of key insects for vineyards monitoring are starting to emerge in the literature and commercial solutions. Image acquisition and the annotation of sticky traps were already reported in different works to foster the development of CV-based automated approaches [
12,
13,
14]. In most cases, these traps also capture insects that are non-threatening to the crop, or artifacts such as dust and leaves, which might interfere with the automated detection and identification of pests. Additionally, open-field image acquisition is especially challenging due to light variability and the adequacy of trap images to be further analyzed (e.g., properly focused, the trap is entirely visible, proper perspective, etc.) [
15].
Several studies are exclusively focused on camera-equipped trap prototypes to facilitate the acquisition process in a controlled environment [
16,
17,
18,
19,
20,
21]. In fact, controlling the acquisition via instrumented traps with static set-ups allows for standardizing the image acquisition process by fixing variables such as illumination conditions or the image acquisition angle. However, the benefits of using camera-equipped traps can be overshadowed by the need to establish a costly hardware infrastructure on the field and the respective maintenance.
To avoid such infrastructure restrictions, smartphones appear as an exciting alternative, being widely disseminated devices that simultaneously allow portability, communication, and computational power to run AI algorithms on the edge [
15]. A recent study [
22] investigated the impact of using images captured with scanners in laboratory settings versus smartphone images acquired under more realistic conditions and concluded that smartphone images were crucial to ensure the robustness of the trained network. Furthermore, the same authors assessed the impact of using images captured from different devices for insects classification using CNNs [
23], and concluded that the training process benefits from the variability of using images acquired with different devices, resulting in improved classification performances. Although several annotated sticky trap datasets have already been created [
14,
19,
24,
25], to the best of our knowledge, neither of them provides annotations of key insects for viticulture.
In terms of CV and AI approaches for pest detection, earlier studies used traditional CV and AI algorithms such as the watershed algorithm and Mahalanobis distance [
26], multiple-task sparse representation and multiple-kernel learning [
24], unsupervised feature learning [
25], or the two-dimensional Fourier transform spectrum [
27]. More recently, several DL-based approaches have been proposed to identify and count a wide range of insects on sticky traps [
16,
18,
19,
21,
23,
26,
28,
29,
30]. In particular, Zhong et al. [
16] developed specialized hardware using a Raspberry Pi system for counting and recognizing six species of insects in sticky traps. They used a hybrid approach by merging DL (YOLO) with a conventional ML classifier (SVM), obtaining an accuracy increase of 30% for all six species in comparison with using only the YOLO model. Espinosa et al. [
29] also developed an approach using classical image processing that detects objects using morphological features and color features, in combination with DL, achieving F-measures of 0.96 and 0.94 for whiteflies and thrips, respectively. Hong et al. [
30] made a study comparing the various CNN-based object detectors for moths in images of delta traps. Using a faster R-CNN ResNet101, they achieved a mean average precision (mAP) of around 90% for four classes of moth.
In summary, it is clear that CV-based approaches based on object detection and DL for insect detection and classification brought significant improvements. However, DL methods might not perform well if the object resolution is low (e.g., small insects) or if only few training samples are available. Data collection and annotation processes can be relatively expensive and time-consuming, and the usage of data-centric ML strategies to mitigate this problem in this field is a topic that remains unexplored. Additionally, we found that research on using DL object detection for pest monitoring in viticulture, particularly to detect and count key insects in vineyards, is very scarce.
3. Pest Monitoring in Viticulture
The implementation of pest assessment processes through the monitoring of traps, assessment of damages, and risk estimation, allows for the application of preventive and protective measures. These procedures allow for maximizing the effectiveness of the protective interventions, and consequently promote the sustainable use of pesticides, contributing to the implementation of the EU directive 2009/128/EC [
31]. Both yellow sticky and delta traps must be monitored weekly to count the number of captured adults and to obtain the insect flight curve. Based on this information, risk estimates are performed, from which, the most appropriate strategy for the control of different pests is defined. This work addresses the automated detection of five different insects relevant for pest monitoring in viticulture (see
Figure 1):
European Grapevine Moth (GM),
Lobesia botrana (Denis & Schiffermüller) (Lepidoptera: Tortricidae), is among the most economically important vine pests, in particular because of its relationship with the development of grey and acetic rot, which affect the quality of the grape [
32]. The adult is brownish-grey, 11–13 mm long, and approximately 6–8 mm wide [
33]. It hibernates in the pupae stage and can develop three to four generations, causing damage to the vine’s flowers or bunches, from March to October. This pest is monitored via delta traps in which a glue base with a capsule containing a pheromone, a synthetic chemical compound ((E,Z)-7,9-dodecadien acetate, is placed, similar to the pheromone released by females to attract males [
34]. In this way, the males are attracted by the trap and are retained in the glue base, which allows for monitoring the flight of this pest.
Green Leafhopper (GL) (
Empoasca vitis (Göthe) or
Jacobiasca lybica (Bergevin & Zanon)) (Hemiptera: Cicadellidae) is a stinger sucker insect that migrates to the vine when it has foliage, sucking the sap from the leaves and causing necrosis that can, on the one hand, cause damage in terms of maturation of the vine grape, and, on the other, cause the weakening of the vine. In its adult stage, the insect is 3 mm long. Yellow sticky traps are used for monitoring, to which, adults are attracted and glued [
35].
“
Flavescence Dorée”
Leafhopper (FDL),
Scaphoideus titanus Ball (Hemiptera: Cicadellidae) is the vector of the “Flavescence dorée”, one of the most destructive diseases of the vine, caused by the phytoplasma Grapevine flavescence dorée [
36]. The size of this leafhopper in its adult stage varies between 4.5 and 5.5 mm, with a yellowish-brown color [
37]. This insect is also monitored with yellow sticky traps.
Tomato Moth (TM),
Tuta absoluta (Meyrick) (Lepidoptera: Gelechiidae) is a key pest of tomato, which may also be present on other hosts, such as the weed
Solanum nigrum, which is very common in vineyards. It is a small butterfly with a length of 7 mm, and has a pair of greyish wings, is speckled with darker spots, and has a second pair of darker colored wings [
38]. It is possible that the pheromone released by the grape moth has some chemical compound similar to the pheromone released by the tomato moth, as a considerable number of tomato moths adults are captured in delta-type sex traps used for the grapevine moth, which can cause errors in the counts.
Idaea degeneraria (Hübner) (Lepidoptera: Geometridae), hereinafter referred to as Morphotype C (MC) for practical reasons, is a moth that is approximately 18 mm long and is usually recorded in a delta trap. This moth does not negatively impact the vine culture, but might interfere with the detection and counting of other relevant insects in delta traps.
4. Dataset and Data Preparation
This section details the used image dataset and the respective data preparation procedures. The creation of this dataset was motivated by the lack of publicly available sticky trap datasets with annotations of key insects for viticulture.
4.1. Dataset
This mobile-acquired image database was specifically created under the scope of this work. Different mobile devices were used by several winegrowers and taxonomy specialists to acquire a dataset of 168 images of yellow sticky and delta traps. The images were acquired in three different vineyards of the Douro Wine Region, namely Quinta de São Luíz (Sogevinus Quintas SA), Quinta do Seixo (Sogrape Vinhos, SA) and Quinta do Bom Retiro (Adriano Ramos Pinto—Vinhos SA), with image resolutions ranging from 720 × 540 to 4096 × 3072 pixels (see
Figure 2).
Each image was annotated manually by experienced taxonomy specialists regarding the presence of five key insects for viticulture (as detailed in
Section 3) in the form of bounding boxes enclosing the target insect and respective species labeling. The distribution of annotations by species is summarized in
Table 1, the presence of more than one species being frequent in a single trap image.
4.2. Subset Division
The dataset division in training, validation, and test subsets was made in two distinct phases:
(i) Division of the dataset in training and test subsets to train and evaluate the model, respectively. This division was performed at image level using a routine that ensures the attribution of around 20% of each class annotation to the test set. The used dataset is quite variable regarding the volume of annotations per image, ranging from images with few annotations to more than 500 annotated insects on a single image. Thus, this image-wise separation decreases the risk of potential bias by avoiding the usage of regions of the same image for both training and test purposes. By ensuring that 20% of each class annotations belongs to the test set, data balancing and representativeness are improved since similar per-class distributions can be guaranteed in each subset.
(ii)
Division of the training subset into three training/validation subset splits to allow for the usage of a three-fold cross-validation strategy during training. The selection of three folds allowed for an appropriate amount of training instances while ensuring sufficient validation images. The same strategy regarding image-wise separation and similar per-class distributions was followed, resulting in three mutually exclusive subsets. For each split, one subset was selected for validation and the remaining two for training. The details about the dataset regarding the stratification of insect annotations for each subset can be consulted in
Table A1 (
Appendix A).
4.3. Image Patches Extraction
Most object detection models require input images of fixed dimensions to restrain the computational resources required for training and inference. Still, resizing (downsampling) trap images should be avoided since it can severely impact the detection of small objects such as GL insects due to the critical decrease in object resolution. Thus, we followed a similar approach presented in [
39], where the images are divided into adjacent patches. An overlap between adjacent patches is allowed to reduce the risk of an insect not being detected when cut between patches, where the overlap percentage of the settings is further optimized. It should be noted that patches without annotations will also be fed to the model as negative examples, which usually have non-threatening insects for viticulture, as well as some artifacts, such as dust, leaves, debris, or different spread patterns of the glue used in the trap.
To standardize the image patches extraction, CV algorithms were applied to segment and correct the trap perspective according to the work presented in [
15] (see
Figure 3B,C). After the model performed patch-wise inferences, the coordinate of insects bounding boxes in each patch was then recomputed to the referential of the original image according to the patch position. In this work, a fixed patch dimension of 640 × 640 pixels was selected, which was filled with a black border if the patch size was larger than the remaining image pixels (see
Figure 3D).
5. Methodology
The model-centric approach is currently the prevalent strategy for designing ML models in the ML community, i.e., focused on optimizing model architectures and respective parameters. However, the performance of ML models does not exclusively depend on these types of optimizations, where the quality of the data used to develop the model is a crucial factor. Techniques that aim to improve the performance of ML models by enhancing the data are referred to as data-centric approaches.
The proposed methodology explores three groups of optimization strategies: model-centric, data-centric, and deployment-centric. Each group comprises different types of optimizations, which are executed and assessed iteratively (see
Figure 4).
5.1. Model-Centric
In the model-centric phase, the dataset is fixed, and the ML model is improved iteratively. The main goal is the optimization of the model to deal with noisy data. In particular, a series of comparative studies were made to further fine-tune the ML models’ hyperparameters, where the model architecture, loss function, anchor boxes, learning rate and respective decay, batch size, and patch overlap were optimized.
5.1.1. Model Architecture
For benchmarking purposes, we tested and compared five different object detection deep learning models suitable to run locally on mobile devices, namely: (i) SSD ResNet50 (RetinaNet50); (ii) CenterNet ResNet50; (iii) SSD MobileNet V2; (iv) EfficientDet-D0; and (v) Faster R-CNN ResNet101.
5.1.2. Loss Function
Focal loss introduced by [
11] assigns higher weights to hard examples, so that all of the easy examples contribute less to the loss and learning can focus on the hard examples, with the aim of alleviating the high background–foreground imbalance. Assuming that all objects are correctly annotated, focal loss modifies the standard cross-entropy equation by adding a factor
. When
, the relative loss for easy and well-classified samples is reduced, putting the effort on the classification of hard examples.
The three major challenges on our dataset that affect the suitability of the standard implementation of the focal loss are the following: (i) unlabeled insects that will be treated as the background and potentially confuse the model during training (e.g., annotations discarded due to a lack of image quality, occlusions, or a bounding box split during patches division); (ii) the classes’ imbalance, which might compromise the correct detection of easy examples on certain classes; and (iii) the heterogeneity of the classes in terms of annotations sizes, as depicted in
Figure 1. To solve these problems, we introduced some modifications to the focal loss for this particular dataset. We combined improvements presented by Zhang et al. [
40] and by Sergievskiy and Ponamarev [
41]. In particular, we modified the focal loss to soften the response of the loss function on the hard samples, which mirrors the response of the easy examples at given threshold values.
5.1.3. Anchor Boxes
Given the inter-class heterogeneity regarding insect sizes previously reported, it was crucial to adjust the hyperparameters associated with the anchor boxes, i.e., the bounding box priors that can be specified in terms of scales and aspect ratios, which define the object candidates proposed by the model. In our work, an exploratory data analysis was made regarding the dimensions and aspect ratio of bounding box annotations for each class, thus allowing for the adjustment of the anchors’ scales and aspect ratios to match our dataset. A clustering-based approach was then applied to find the hyperparameters’ optimal values through a methodology described in [
42] and already applied in a healthcare use case [
43].
5.2. Data-Centric
In the data-centric phase, the ML model is fixed, and the data quality is improved iteratively. Rather than collecting more data, the main goal is to improve the existing data quality through methods such as selecting the most suitable image size, data augmentation techniques, down-sampling empty patches, and removing inadequate images.
5.2.1. Image Size
This optimization step involved resizing all the images to a fixed width of 1500 pixels. As detailed in
Section 4.1, there was a significant discrepancy between image sizes in our dataset (and, consequently, in insect sizes), which could promote misclassifications and missed detections. It should be noted that this is a suitable standardization approach for our use case because: (i) traps have a fixed width; and (ii) we applied the pre-processing techniques for trap segmentation and perspective correction described in
Section 4.3.
5.2.2. Data Augmentation
In the model-centric stage, we already used online data augmentation techniques, performed randomly for each training batch during training. The applied transformations included random variations in hue (max up to 5%), rotations, flips, brightness, contrast, and saturation (a maximum of up to 10%). In the data-centric stage, we further applied offline data augmentation techniques before training to balance classes. In particular, we applied random brightness, saturation, and contrast alterations, on top of random rotations and flips, as illustrated in
Figure 5. This balancing was performed on the patch level, and only patches containing annotations of a single label were augmented.
5.2.3. Down-Sampling Empty Patches
The number of empty patches (patches without any annotation) was down-sampled to balance the number of empty and annotated patches during training. In this step, empty patches with more than half of their area composed of padding (black pixels) were discarded (as exemplified in
Figure 6). These empty patches not only do not contain any annotations but also do not have a relevant amount of the background, meaning that they would be highly uninformative for the networks.
5.2.4. Removal of Inadequate Images
This step involved removing images deemed inadequate for training purposes. This was performed through manual observation; as such, there was a certain subjectivity attached to this process (see
Figure 7). The main factors analyzed were: (i) the sharpness/quality of the image and its annotated areas; and (ii) the positioning of the camera with regard to the trap (i.e., images too far away or too angled in relation to the trap were discarded). In addition to this, the quantity of under-represented classes in each image was also taken into account in order to avoid discarding too many images containing these classes.
5.3. Deployment-Centric
In the deployment-centric phase, the ML model and data are fixed, and the model compression is improved iteratively. The main goal is to find the best trade-off between the performance, model size, and inference time. Specifically, several comparative studies were performed, including training the fixed model with quantization awareness, model compression using TensorFLow Lite, and model size reduction techniques.
6. Results and Discussion
This section presents and discusses the obtained results sequentially, where the results obtained at each step are compared to the best baseline found until that point. In particular, we explored model-centric, data-centric, and deployment-centric optimization strategies, executed and assessed iteratively, to find the most suitable ML model for our use case.
6.1. Model-Centric
To ensure the proper benchmark of the selected object detection deep learning models, we used the Tensorflow Object Detection API version 2.3 [
44], which is an open-source machine learning framework built upon Tensorflow [
45]. The Tensorflow Object Detection API provides several pre-trained models and procedures for experimentation. These detection models were pre-trained on the COCO [
4], KITTI [
46], and Open Images [
47] datasets.
The configuration for all training sessions encompassed 200,000 steps with a batch size of 8, resulting in 925 epochs. This value was chosen after several experiments using different models, which showed that they do not need more than 200,000 steps to converge. All experiments were performed through a singularity container in a high-performance computing cluster equipped with a 48-core Intel® Xeon® Silver 4214 CPU (2.20 GHz) and a 32 GB NVIDIA® V100 GPU.
The next model-centric optimization experiment had the goal of fine tuning all aspects related to the learning rate in order to find a suitable trade-off between the convergence time and avoiding missing any local minima. Using the Adam optimizer with default exponential rates, each meta-architecture/backbone combination was then trained with different values of learning rates and decay steps. By using the three-fold cross-validation strategy described in
Section 4.2, the learning rate was then fixed to 1.0 × 10
, with a decay rate of 0.95 per 45,000 steps.
A comparative study regarding the most favorable patch overlapping ratio was also performed. Different overlap percentages were tested, where the best average performance on validation sets of each fold was obtained for an overlap of 6.25%, i.e., 40 pixels between adjacent patches. The average three-fold cross-validation detection performance obtained for the five models tested is supplied in
Table 2.
By analyzing the AP values, it is clear that the most promising model architectures are the SSD ResNet50, Faster R-CNN ResNet101, and CenterNet ResNet50. We then extracted the performance of these three models on the test set, which is depicted in
Table 3.
6.2. Data-Centric
In the data-centric stage, we focused on increasing the performance of the most promising ML models found previously by improving the data training quality through iterative data-oriented optimization steps. From the five model architectures initially considered, we only selected the two most promising options for further experiments, namely the SSD ResNet50 and Faster R-CNN ResNet101. The results for the different data-oriented optimization steps described in
Section 5.2 are presented in
Table 4 and
Table 5 for the SSD ResNet50 and Faster R-CNN ResNet101, respectively.
For both architectures, the approach that achieved the highest mAP consisted of simultaneously applying image resizing and empty image down-sampling. It is worth noting that the inadequate image removal step retrieved similar but slightly lower results while combining these three steps, and, as such, it was not the ideal approach to maximize the models’ performance.
6.3. Deployment-Centric
In the deployment-centric stage, our goal was to test and compare different techniques that optimize the best-performing models for deployment. Unlike previous stages, in this stage, the criteria used to select the most suitable model are not restricted to performance indicators. Our models are envisioned to be deployed on mobile devices; thus, we aimed to find the best trade-off between several criteria, such as performance, model size, and inference time. The results for the different deployment-oriented optimization steps described in
Section 5.3 are presented in
Table 6 and
Table 7 for the SSD ResNet50 and the Faster R-CNN ResNet101, respectively.
Given the complexity and computational requirements of each architecture selected, they are used for different deployment scenarios. As SSD ResNet50 is more lightweight, it is our choice for deployment on mobile devices. Thus, we selected a high-end and low-end smartphone model to test the deployment of our model, namely the Samsung S10 (16 MP camera, 8GB RAM, Exynos 9820 Octa-core CPU and Mali-G76 MP12 GPU) and Samsung S6 (16 MP camera, 3GB RAM, Exynos 7420 Octa-core CPU and Mali-T760 MP8 GPU), respectively. As shown in
Table 8, the usage of model size reduction clearly benefits both devices in terms of the inference time, RAM usage, and model size.
For both architectures, the approach that we considered that provided the best trade-off consisted of training with quantization awareness, applying model compression using TensorFLow Lite, and finally employing a model size reduction (MSR) technique. While it is an obvious choice for the faster R-CNN ResNet101 (simultaneously maximizing the performance and model compression), the decision for the SSD ResNet50 is not straightforward, since the usage of MSR affects the capacity of the model to detect the FDL class.
To further discuss this topic, the confusion matrix and corresponding performance metrics for the final SSD ResNet50 model running on a Ubuntu virtual desktop infrastructure are provided in
Table 9 and
Table 10, respectively.
The model obtained a high accuracy and recall across most classes, with the exception of FDL, as all 26 instances were missed or misclassified. This is likely due to its similarity to other classes, such as TM, and the comparatively small number of training examples (124 FDL vs. 744 TM instances). Further supporting this is the contrasting good performance on the GM class, obtaining a higher precision, recall, and F1 score than on any other class, again likely due to its disproportionate number of training examples (8522 instances). Illustrative examples of SSD ResNet50 model predictions are provided in
Figure 8.
Though the SSD ResNet50 model seems to be the most suitable architecture for deployment on edge devices, we also tried to deploy the faster R-CNN ResNet101 model on both mobile devices. However, we were unable to run this model on the low-end device due to a lack of computational resources. Nevertheless, it is worth noting that the faster R-CNN ResNet101 presents slightly better results in several classes, as shown in the results depicted in
Table 11 and
Table 12, obtained on a Ubuntu virtual desktop infrastructure. Thus, we considered this model to be the most suitable for cloud deployment, i.e., the trap images could be acquired with a mobile device and sent to a virtual machine, where the model is executed and the results are provided back to the mobile device.
Despite the marginally better results in most cases, there are some results worth noting, e.g., the case of FDL, from which, 7 out of 26 examples were correctly identified by the network (as opposed to none with the SSD ResNet50). Another noteworthy case is the one regarding TM, which had a considerable increase in precision (+27%) in comparison with the SSD ResNet50, albeit with a slight decrease in recall (−4%). Illustrative examples of SSD ResNet101 model predictions are provided in
Figure 9.
6.4. Limitations
Despite the promising results obtained, there are certain limitations in the presented work worth noting. One of them is the models’ sub-par performance with regard to the FDL class, which is especially noticeable in the final SSD ResNet50 model as it was not able to detect any of those instances in the test set. Due to the highly destructive risk imposed by this species, it is necessary to consider this factor in real-world applications. The system’s poor performance for this class is certainly related to the disproportionately lower number of FDL instances in the collected dataset due to the low incidence of this species in the three vineyards involved in this study. Given that deep learning approaches similar to the ones presented usually require a substantial volume of data to achieve a good generalization performance, data scarcity might limit the adaption of the proposed methodology for the detection of pest outbreaks in other agricultural productions where data acquisition is challenging.
Additionally, the inadequate image removal in the data-centric stage was performed through manual observation, leading to a certain subjectivity attached to this image selection process. Coupled with the down-sampling of partially empty patches, this might have led to the unnecessarily excessive removal of images from the dataset, thus negatively impacting the models’ performance. In the future, more objective approaches for assessing the presence of adequacy criteria should be considered to potentially overcome this limitation.
7. Conclusions and Future Work
This paper provides three main original contributions. The first contribution consists of a novel CV-based approach based on deep object detection networks to improve pest monitoring and prevention processes in viticulture. The development of lightweight AI algorithms suitable for edge scenarios was explored, allowing for automated insect detection and counting on conventional sticky traps via mobile devices.
Given the lack of freely available image datasets of key insects for viticulture, different mobile devices were used to acquire a dataset of yellow sticky and delta traps, consisting of 168 images with 8966 key insects manually annotated by experienced taxonomy specialists. This collected dataset is also an original contribution of this work. Five different deep learning models suitable to run locally on mobile devices were selected, trained, and benchmarked to detect five different species.
To find the most suitable model, we proposed a new methodology based on three groups of optimization strategies: model-centric, data-centric, and deployment-centric. Each group comprises different types of optimizations, executed and assessed iteratively. While this methodology is an original contribution of this work, it was still only validated in the reported use case, but we believe that it provides an important foundation for future research in the field of edge-compatible deep learning models. Nevertheless, refinements to the proposed methodology should be explored in future work, e.g., reapplying the model-centric phase after the best data-centric and deployment-oriented optimization steps were found and fixed. This approach might allow us to obtain better performances with a different architecture and/or hyperparameter combination for the fixed set-up of ideal data-centric and deployment-centric optimization steps.
Regarding the findings of this study, they can be grouped into four different categories: model-centric; data-centric; deployment-centric; and the selection of the most suitable model for deployment on edge devices. The findings for each category are separately detailed in the following paragraphs.
In the model-centric stage, the dataset was fixed and the ML models were improved iteratively. The most promising results at this stage were obtained by SSD ResNet50 and faster R-CNN ResNet101 models, with mAPs of 0.418 and 0.407 and average recalls (ARs) of 0.24 and 0.21, respectively.
In the data-centric stage, we focused on increasing the performance of the most promising ML models found previously by improving the data training quality through iterative data-oriented optimization steps. By applying image resizing and empty image down-sampling techniques, we were able to increase the mAPs to 0.515 (+23.2%) and 0.491 (+20.6%), as well as the ARs to 0.288 (+20.0%) and 0.262 (+24.8%) for SSD ResNet50 and faster R-CNN ResNet101, respectively.
In the deployment-centric phase, models were iteratively improved to find the best trade-off between the performance, model size, and inference time. For both architectures, the most suitable model was obtained by training with quantization awareness, applying model compression using TensorFLow Lite, and finally employing a model size reduction technique. While the application of model size reduction had a slight negative impact on the SSD ResNet50 model, with a decrease in mAP to 0.452 (−12.2%) and AR to 0.245 (−14.9%), the inference time on low-end and high-end mobile devices improved drastically from 120.2 s to 62.7 s (−47.8%) and 40.2 s to 19.4 s (−51.7%), respectively. As expected, the model size for both models decreased significantly from 204 MB to 52 MB (−74.5%), with registered improvements in terms of the RAM usage from 1200 MB to 1000 MB (−16.7%) on low-end devices and 600 MB to 540 MB (−11.7%) on high-end devices.
Given this, the SSD ResNet50 model seems to be the most suitable architecture for deployment on edge devices, whereas the faster R-CNN ResNet101 model seems to be the most suitable for cloud deployment, i.e., sending mobile-acquired trap images to a virtual machine, where inference is executed, and results being sent back to the mobile device. With this clear identification of the best-performing models for both edge and cloud scenarios, the selection of the most suitable model can be made according to the goals and requirements of the pest monitoring system in development. For instance, if compatibility with both low-end and high-end devices is mandatory, as well as insect detection near real-time and on site without the need for a network connection, the best option is clearly the SSD ResNet50 model. In contrast, if maximizing the detection performance is a priority and the inference can be made asynchronously (e.g., the images are only sent to the cloud for analysis when a network connection is available), the best option would probably be the faster R-CNN ResNet101 model.
In terms of future work, we plan to perform field trials with stakeholders to validate the solution in a real-world scenario. We also aim to complement the created image database with more annotations, namely balancing the number of annotations of under-represented classes, as well as expanding the automated detection to other key species in viticulture. Additional strategies used to deal with class imbalance should also be explored in the future, such as hard negative mining or generative methods to generate new artificial samples for under-represented classes. Furthermore, adapting the proposed methodology to other agricultural productions can also be considered in the future, e.g., detecting key insects in olive production such as Bactrocera oleae in yellow sticky traps or Palpita unionalis and Prays oleae in delta traps.
As a last remark, the proposed approach represents one component of a solution based on mobile devices for pest prevention in viticulture currently in development. The proposed models will be integrated into a decision support tool for winegrowers and taxonomy specialists that allows them to: (i) automatically verify the quality and adequacy of sticky trap images acquired with mobile devices; (ii) automatically detect and count key insects; and (iii) optimize insecticide usage for the detected pests.
Author Contributions
Conceptualization, L.R., T.N., A.F. and C.C.; methodology, J.G., E.S., P.F., T.N., A.F., C.C. and L.R.; software, J.G., E.S., P.F. and L.R.; validation, J.G., E.S., P.F., T.N., A.F., C.C. and L.R.; formal analysis, J.G., E.S., P.F. and L.R.; investigation, J.G., E.S., P.F. and L.R.; resources, T.N., A.F. and C.C.; data curation, J.G., E.S., P.F., A.F., C.C., T.N. and L.R.; writing—original draft preparation, J.G., E.S. and L.R.; writing—review and editing, J.G., E.S., P.F., T.N., A.F., C.C. and L.R.; visualization, J.G., E.S., P.F. and L.R.; supervision, T.N., A.F., C.C. and L.R.; project administration, T.N., A.F., C.C. and L.R.; funding acquisition, T.N., C.C. and L.R. All authors have read and agreed to the published version of the manuscript.
Funding
This research was funded by European Regional Development Fund (ERDF) in the frame of Norte 2020 (Programa Operacional Regional do Norte), through the project EyesOnTraps+ - Smart Learning Trap and Vineyard Health Monitoring, NORTE-01-0247-FEDER-039912.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
Not applicable.
Acknowledgments
The authors would like to thank the stakeholders from Douro Wine Region that collaborated in this work, namely Sogevinus Quintas SA, Adriano Ramos Pinto—Vinhos SA, and Sogrape Vinhos, SA.
Conflicts of Interest
The authors declare no conflict of interest.
Abbreviations
The following abbreviations are used in this manuscript:
AI | Artificial Intelligence |
ML | Machine Learning |
DL | Deep Learning |
CV | Computer Vision |
CNN | Convolutional Neural Network |
SHIFT | Scale-Invariant Feature Transform |
HOG | Histogram of Oriented Gradients |
FLOPs | FLoating-point OPerations |
EU | European Union |
GPU | Graphics Processing Unit |
CPU | Central Processing Unit |
RAM | Random Access Memory |
FPN | Feature Pyramid Networks |
YOLO | You Only Look Once |
SSD | Single Shot Multibox Detector |
CT | Chromotropic Traps |
DT | Delta Traps |
GM | European Grapevine Moth |
GL | Green Leafhopper |
FDL | "Flavescence Dorée" Leafhopper |
TM | Tomato Moth |
MC | Morphotype C |
GB | Gigabyte |
MB | Megabyte |
mAP | Mean Average Precision |
AP | Average Precision |
AR | Average Recall |
Appendix A
This appendix details the dataset division, particularly the stratification of insect annotations for each subset (see
Table A1).
Table A1.
Dataset division: stratification of insect annotations division by insect species for the different subsets.
Table A1.
Dataset division: stratification of insect annotations division by insect species for the different subsets.
| Images Division | Annotations Division by Insect Species |
| GM | GL | FDL | MC | TM |
| %Images | #Images | #Anotations | %Anotations | #Anotations | %Anotations | #Anotations | %Anotations | #Anotations | %Anotations | #Anotations | %Anotations |
Train + Validation Set | 79% | 132 | 4053 | 77% | 2356 | 82% | 149 | 85% | 168 | 79% | 363 | 83% |
Test Set | 21% | 36 | 1218 | 23% | 516 | 18% | 26 | 15% | 44 | 21% | 73 | 17% |
Total | 100% | 168 | 5271 | 100% | 2872 | 100% | 175 | 100% | 212 | 100% | 436 | 100% |
FOLD 1 Division (3-fold Cross-validation) |
| Images Division | Annotations Division by Insect Species |
| GM | GL | FDL | MC | TM |
| %Images | #Images | #Anotations | %Anotations | #Anotations | %Anotations | #Anotations | %Anotations | #Anotations | %Anotations | #Anotations | %Anotations |
Train | 59% | 78 | 2214 | 55% | 1545 | 66% | 92 | 62% | 86 | 51% | 217 | 60% |
Validation Set | 41% | 54 | 1839 | 45% | 811 | 34% | 57 | 38% | 82 | 49% | 146 | 40% |
Total | 100% | 132 | 4053 | 100% | 2356 | 100% | 149 | 100% | 168 | 100% | 363 | 100% |
FOLD 2 Division (3-fold Cross-validation) |
| Images Division | Annotations Division by Insect Species |
| GM | GL | FDL | MC | TM |
| %Images | #Images | #Anotations | %Anotations | #Anotations | %Anotations | #Anotations | %Anotations | #Anotations | %Anotations | #Anotations | %Anotations |
Train | 80% | 106 | 3228 | 80% | 1570 | 67% | 125 | 84% | 149 | 89% | 308 | 85% |
Validation Set | 20% | 26 | 825 | 20% | 786 | 33% | 24 | 16% | 19 | 11% | 55 | 15% |
Total | 100% | 132 | 4053 | 100% | 2356 | 100% | 149 | 100% | 168 | 100% | 363 | 100% |
FOLD 3 Division (3-fold Cross-validation) |
| Images Division | Annotations Division by Annotations Division by Insect Species |
| GM | GL | FDL | MC | TM |
| %Images | #Images | #Anotations | %Anotations | #Anotations | %Anotations | #Anotations | %Anotations | #Anotations | %Anotations | #Anotations | %Anotations |
Train | 61% | 80 | 2664 | 66% | 1597 | 68% | 81 | 54% | 101 | 60% | 201 | 55% |
Validation Set | 39% | 52 | 1389 | 34% | 759 | 32% | 68 | 46% | 67 | 40% | 162 | 45% |
Total | 100% | 132 | 4053 | 100% | 2356 | 100% | 149 | 100% | 168 | 100% | 363 | 100% |
References
- OIV. State of the World Vitivinicultural Sector in 2020. In International Organisation of Vine and Wine; OIV: Paris, France, 2021. [Google Scholar]
- van Leeuwen, C.; Destrac-Irvine, A.; Dubernet, M.; Duchêne, E.; Gowdy, M.; Marguerit, E.; Pieri, P.; Parker, A.; de Rességuier, L.; Ollat, N. An Update on the Impact of Climate Change in Viticulture and Potential Adaptations. Agronomy 2019, 9, 514. [Google Scholar] [CrossRef] [Green Version]
- Sultana, F.; Sufian, A.; Dutta, P. A review of object detection models based on convolutional neural network. Intell. Comput. Image Process. Based Appl. 2020, 1175, 1–16. [Google Scholar]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context; Computer Vision—ECCV 2014; Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Springer International Publishing: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
- Kuznetsova, A.; Rom, H.; Alldrin, N.; Uijlings, J.; Krasin, I.; Pont-Tuset, J.; Kamali, S.; Popov, S.; Malloci, M.; Kolesnikov, A.; et al. The Open Images Dataset V4: Unified image classification, object detection, and visual relationship detection at scale. Int. J. Comput. Vis. 2020, 128, 1956–1981. [Google Scholar] [CrossRef] [Green Version]
- Dai, J.; Li, Y.; He, K.; Sun, J. R-FCN: Object Detection via Region-Based Fully Convolutional Networks. In Proceedings of the 30th International Conference on Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; Curran Associates Inc.: Red Hook, NY, USA, 2016; pp. 379–387. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems; Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., Garnett, R., Eds.; Curran Associates, Inc.: Kottayam, India, 2015; Volume 28. [Google Scholar]
- Lin, T.Y.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. arXiv 2016, arXiv:1512.02325. [Google Scholar] [CrossRef] [Green Version]
- Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
- Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollar, P. Focal Loss for Dense Object Detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
- Rustia, D.J.A.; Chao, J.; Chiu, L.; Wu, Y.; Chung, J.; Hsu, J.; Lin, T. Automatic greenhouse insect pest detection and recognition based on a cascaded deep learning classification method. J. Appl. Entomol. 2021, 145, 206–222. [Google Scholar] [CrossRef]
- Martin, V.; Paris, B.; Nicolás, O. O.50-Towards a Video Camera Network for Early Pest Detection in Greenhouses. In Proceedings of the ENDURE International Conference on Diversifying Crop Protection, La Grande Motte, France, 12–15 October 2008. [Google Scholar]
- Nieuwenhuizen, A.T.; Hemming, J.; Janssen, D.; Suh, H.K.; Bosmans, L.; Sluydts, V.; Brenard, N.; Rodríguez, E.; del Mar Tellez, M. Raw Data from Yellow Sticky Traps with Insects for Training of Deep Learning Convolutional Neural Network for Object Detection. 2019. Available online: https://doi.org/10.4121/uuid:8b8ba63a-1010-4de7-a7fb-6f9e3baf128e (accessed on 15 October 2022).
- Faria, P.; Nogueira, T.; Ferreira, A.; Carlos, C.; Rosado, L. AI-Powered Mobile Image Acquisition of Vineyard Insect Traps with Automatic Quality and Adequacy Assessment. Agronomy 2021, 11, 731. [Google Scholar] [CrossRef]
- Zhong, Y.; Gao, J.; Lei, Q.; Zhou, Y. A Vision-Based Counting and Recognition System for Flying Insects in Intelligent Agriculture. Sensors 2018, 18, 1489. [Google Scholar] [CrossRef] [Green Version]
- Preti, M.; Verheggen, F.; Angeli, S. Insect pest monitoring with camera-equipped traps: Strengths and limitations. J. Pest Sci. 2021, 94, 203–217. [Google Scholar] [CrossRef]
- Rustia, D.; Lin, T. An IoT-based Wireless Imaging and Sensor Node System for Remote Greenhouse Pest Monitoring. Chem. Eng. Trans. 2017, 58, 601–606. [Google Scholar] [CrossRef]
- Li, W.; Wang, D.; Li, M.; Gao, Y.; Wu, J.; Yang, X. Field detection of tiny pests from sticky trap images using deep learning in agricultural greenhouse. Comput. Electron. Agric. 2021, 183, 106048. [Google Scholar] [CrossRef]
- Yang, Z.; Li, W.; Li, M.; Yang, X. Automatic greenhouse pest recognition based on multiple color space features. J. Agric. Biol. Eng. 2021, 14, 188–195. [Google Scholar] [CrossRef]
- Hong, S.J.; Nam, I.; Kim, S.Y.; Kim, E.; Lee, C.H.; Ahn, S.; Park, I.K.; Kim, G. Automatic Pest Counting from Pheromone Trap Images Using Deep Learning Object Detectors for Matsucoccus thunbergianae Monitoring. Insects 2021, 12, 342. [Google Scholar] [CrossRef] [PubMed]
- Barbedo, J.; Castro, G.B. Influence of image quality on the identification of psyllids using convolutional neural networks. Biosyst. Eng. 2019, 182, 151–158. [Google Scholar] [CrossRef]
- Barbedo, J.G.A.; Castro, G.B. A Study on CNN-Based Detection of Psyllids in Sticky Traps Using Multiple Image Data Sources. AI 2020, 1, 198–208. [Google Scholar] [CrossRef]
- Xie, C.; Zhang, J.; Li, R.; Li, J.; Hong, P.; Xia, J.; Chen, P. Automatic classification for field crop insects via multiple-task sparse representation and multiple-kernel learning. Comput. Electron. Agric. 2015, 119, 123–132. [Google Scholar] [CrossRef]
- Xie, C.; Wang, R.; Zhang, J.; Chen, P.; Dong, W.; Li, R.; Chen, T.; Chen, H. Multi-level learning features for automatic classification of field crop pests. Comput. Electron. Agric. 2018, 152, 233–241. [Google Scholar] [CrossRef]
- Xia, C.; Chon, T.S.; Ren, Z.; Lee, J.M. Automatic identification and counting of small size pests in greenhouse conditions with low computational cost. Ecol. Inform. 2015, 29, 139–146. [Google Scholar] [CrossRef]
- Sun, Y.; Cheng, H.; Cheng, Q.; Zhou, H.; Li, M.; Fan, Y.; Shan, G.; Damerow, L.; Schulze Lammers, P.; Jones, S.B. A smart-vision algorithm for counting whiteflies and thrips on sticky traps using two-dimensional Fourier transform spectrum. Biosyst. Eng. 2017, 153, 82–88. [Google Scholar] [CrossRef]
- Ding, W.; Taylor, G. Automatic moth detection from trap images for pest management. Comput. Electron. Agric. 2016, 123, 17–28. [Google Scholar] [CrossRef] [Green Version]
- Espinoza, K.; Valera, D.L.; Torres, J.A.; López, A.; Molina-Aiz, F.D. Combination of image processing and artificial neural networks as a novel approach for the identification of Bemisia tabaci and Frankliniella occidentalis on sticky traps in greenhouse agriculture. Comput. Electron. Agric. 2016, 127, 495–505. [Google Scholar] [CrossRef]
- Hong, S.J.; Kim, S.Y.; Kim, E.; Lee, C.H.; Lee, J.S.; Lee, D.S.; Bang, J.; Kim, G. Moth Detection from Pheromone Trap Images Using Deep Learning Object Detectors. Agriculture 2020, 10, 170. [Google Scholar] [CrossRef]
- Official Journal of the European Union. Directive 2009/128/EC of the European Parliament and of the Council of 21 October 2009 Establishing a Framework for Community Action to Achieve the Sustainable Use of Pesticides; Official Journal of the European Union: Brussels, Belgum, 2009. [Google Scholar]
- Gilligan, T.M.; Epstein, M.E.; Passoa, S.C.; Powell, J.A.; Sage, O.C.; Brown, J.W. Discovery of Lobesia botrana ([Denis & Schiffermüller]) in California: An invasive species new to North America (Lepidoptera: Tortricidae). Proc. Entomol. Soc. Wash. 2011, 113, 14–30. [Google Scholar]
- Carlos, C. Cadernos técnicos da ADVID Caderno técnico nº1–“A Traça da Uva”; ADVID–Associação para o Desenvolvimento da Viticultura Duriense: Vila Real, Portugal, 2007. [Google Scholar]
- Gonçalves, F.; Carlos, C.; Ferreira, A.; Salvação, J.; Bagnoli, B.; Torres, L. Contribuição para a adequada monitorização da traça-da-uva com armadilhas sexuais. 2020. Available online: https://doi.org/10.13140/RG.2.2.34964.42888 (accessed on 15 October 2022).
- Carlos, C.; Alves, F. Instrumentos de Apoio à Proteção Integrada. Utilização de armadilhas para traça da uva e cigarrinha verde. 2013. Available online: https://www.advid.pt/uploads/DOCUMENTOS/Subcategorias/manuais/Instrumentos%20de%20apoio%20a%CC%80%20Protecc%CC%A7a%CC%83o%20integrada_%20U%20lizac%CC%A7a%CC%83o%20de%20armadilhas%20para%20trac%CC%A7a%20da%20uva%20e%20cigarrinha%20verde-abril2013.pdf (accessed on 15 October 2022).
- Mazzoni, V.; Prešern, J.; Lucchi, A.; Virant-Doberlet, M. Reproductive strategy of the nearctic leafhopper Scaphoideus titanus Ball (Hemiptera: Cicadellidae). Bull. Entomol. Res. 2009, 99, 401–413. [Google Scholar] [CrossRef] [PubMed]
- Quartau, J.; Guimaraes, J.; André, G. On the occurrence in Portugal of the nearctic Scaphoideus titanus Ball (Homoptera, Cicadellidae), the natural vector of the grapevine “Flavescence dorée” (FD). IOBC WPRS Bull. 2001, 24, 273–276. [Google Scholar]
- Soares, C. A traça-do-tomateiro (Tuta absoluta). Horticultura—Sanidade. Revista Voz do Campo 2010, 66. [Google Scholar]
- Ozge Unel, F.; Ozkalayci, B.O.; Cigla, C. The Power of Tiling for Small Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–17 June 2019. [Google Scholar]
- Zhang, H.; Chen, F.; Shen, Z.; Hao, Q.; Zhu, C.; Savvides, M. Solving Missing-Annotation Object Detection with Background Recalibration Loss. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 1888–1892. [Google Scholar] [CrossRef] [Green Version]
- Sergievskiy, N.; Ponamarev, A. Reduced Focal Loss: 1st Place Solution to xView object detection in Satellite Imagery. arXiv 2019, arXiv:1903.01347. [Google Scholar]
- Sampaio, A.F.; Gonçalves, J.; Rosado, L.; Vasconcelos, M.J.M. Cluster-based Anchor Box Optimisation Method for Different Object Detection Architectures, July 2021. Available online: https://recpad2020.uevora.pt/wp-content/uploads/2020/10/RECPAD_2020_paper_42.pdf (accessed on 15 October 2022).
- Sampaio, A.F.; Rosado, L.; Vasconcelos, M.J.M. Towards the Mobile Detection of Cervical Lesions: A Region-Based Approach for the Analysis of Microscopic Images. IEEE Access 2021, 9, 152188–152205. [Google Scholar] [CrossRef]
- Huang, J.; Rathod, V.; Sun, C.; Zhu, M.; Korattikara, A.; Fathi, A.; Fischer, I.; Wojna, Z.; Song, Y.; Guadarrama, S.; et al. Speed/accuracy trade-offs for modern convolutional object detectors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7310–7311. [Google Scholar]
- Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Irving, G.; Isard, M.; et al. TensorFlow: A System for Large-Scale Machine Learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), Savannah, GA, USA, 2–4 November 2016; USENIX Association: Savannah, GA, USA, 2016; pp. 265–283. [Google Scholar]
- Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 3354–3361. [Google Scholar] [CrossRef]
- Krasin, I.; Duerig, T.; Alldrin, N.; Ferrari, V.; Abu-El-Haija, S.; Kuznetsova, A.; Rom, H.; Uijlings, J.; Popov, S.; Kamali, S.; et al. OpenImages: A Public Dataset for Large-Scale Multi-Label and Multi-Class Image Classification. 2017. Available online: https://storage.googleapis.com/openimages/web/index.html (accessed on 15 October 2022).
Figure 1.
Illustrative examples of detected insects for viticulture pest monitoring (in scale): (A) Grapevine Moth; (B) Green Leafhopper; (C) “Flavescence Dorée” Leafhopper; (D) Tomato Moth; and (E) Morphotype C.
Figure 1.
Illustrative examples of detected insects for viticulture pest monitoring (in scale): (A) Grapevine Moth; (B) Green Leafhopper; (C) “Flavescence Dorée” Leafhopper; (D) Tomato Moth; and (E) Morphotype C.
Figure 2.
Illustrative images of original delta (A) and yellow sticky (B) traps comprised in the mobile-acquired image database. Images (C,D) illustrate manual annotations of GM and GL, respectively.
Figure 2.
Illustrative images of original delta (A) and yellow sticky (B) traps comprised in the mobile-acquired image database. Images (C,D) illustrate manual annotations of GM and GL, respectively.
Figure 3.
Image patches extraction: (A) original image; (B) trap segmentation; (C) trap perspective correction; (D) patches division.
Figure 3.
Image patches extraction: (A) original image; (B) trap segmentation; (C) trap perspective correction; (D) patches division.
Figure 4.
Methodology overview.
Figure 4.
Methodology overview.
Figure 5.
Examples of data augmentation: (A) original patch image; (B,C) random brightness, saturation, contrast, and rotation alterations.
Figure 5.
Examples of data augmentation: (A) original patch image; (B,C) random brightness, saturation, contrast, and rotation alterations.
Figure 6.
Examples of down-sampling empty patches: (A) patch kept; and (B) patch removed.
Figure 6.
Examples of down-sampling empty patches: (A) patch kept; and (B) patch removed.
Figure 7.
Examples of removal of inadequate images due to: (A) camera distance; (B) camera angle and illumination; and (C) low image quality.
Figure 7.
Examples of removal of inadequate images due to: (A) camera distance; (B) camera angle and illumination; and (C) low image quality.
Figure 8.
Examples of detections predicted by SSD ResNet50 model on delta (left) and sticky trap (right) images from the test set. The ground truth annotations are marked in white, GM predictions are in cyan, MC in purple, and GL in green.
Figure 8.
Examples of detections predicted by SSD ResNet50 model on delta (left) and sticky trap (right) images from the test set. The ground truth annotations are marked in white, GM predictions are in cyan, MC in purple, and GL in green.
Figure 9.
Examples of detections predicted by faster R-CNN ResNet101 model on delta (left) and sticky trap (right) images from the test set. The ground truth annotations are marked in white, whereas GM predictions are in cyan, TM in turquoise, MC in purple, GL in green, and FDL in red.
Figure 9.
Examples of detections predicted by faster R-CNN ResNet101 model on delta (left) and sticky trap (right) images from the test set. The ground truth annotations are marked in white, whereas GM predictions are in cyan, TM in turquoise, MC in purple, GL in green, and FDL in red.
Table 1.
Dataset distribution of annotations by species.
Table 1.
Dataset distribution of annotations by species.
| GM | GL | FDL | TM | MC | Total |
---|
#Train annotations | 4053 | 2356 | 149 | 363 | 168 | 7089 |
#Test annotations | 1218 | 516 | 26 | 73 | 44 | 1877 |
#Total annotations | 5271 | 2872 | 175 | 436 | 212 | 8966 |
#Train images | - | - | - | - | - | 132 |
#Test images | - | - | - | - | - | 36 |
#Total images | - | - | - | - | - | 168 |
Table 2.
Model-centric results for 3-fold cross-validation.
Table 2.
Model-centric results for 3-fold cross-validation.
| Fold | [email protected] | AP GM | AP MC | AP TM | AP GL | AP FDL | AR@100 |
---|
SSD ResNet50 | 1 | 0.364 | 0.626 | 0.337 | 0.107 | 0.509 | 0.24 | 0.218 |
2 | 0.25 | 0.435 | 0.284 | 0.145 | 0.251 | 0.136 | 0.172 |
3 | 0.315 | 0.439 | 0.454 | 0.023 | 0.373 | 0.285 | 0.173 |
AVG | 0.31 | 0.5 | 0.358 | 0.092 | 0.378 | 0.22 | 0.188 |
Faster R-CNN ResNet101 | 1 | 0.333 | 0.596 | 0.375 | 0.068 | 0.471 | 0.154 | 0.167 |
2 | 0.251 | 0.422 | 0.434 | 0.091 | 0.239 | 0.068 | 0.152 |
3 | 0.325 | 0.448 | 0.564 | 0.091 | 0.379 | 0.139 | 0.174 |
AVG | 0.303 | 0.489 | 0.458 | 0.083 | 0.363 | 0.12 | 0.164 |
EfficientDet-D0 | 1 | 0.276 | 0.53 | 0.183 | 0.212 | 0.409 | 0.046 | 0.129 |
2 | 0.202 | 0.367 | 0.177 | 0.198 | 0.234 | 0.034 | 0.117 |
3 | 0.231 | 0.405 | 0.279 | 0.03 | 0.305 | 0.135 | 0.102 |
AVG | 0.236 | 0.434 | 0.213 | 0.147 | 0.316 | 0.072 | 0.116 |
SSD MobileNet V2 | 1 | 0.309 | 0.532 | 0.357 | 0.097 | 0.451 | 0.106 | 0.176 |
2 | 0.187 | 0.341 | 0.22 | 0.073 | 0.246 | 0.052 | 0.127 |
3 | 0.241 | 0.347 | 0.412 | 0.018 | 0.256 | 0.169 | 0.128 |
AVG | 0.246 | 0.407 | 0.329 | 0.063 | 0.318 | 0.109 | 0.144 |
CenterNet ResNet50 | 1 | 0.363 | 0.641 | 0.365 | 0.196 | 0.465 | 0.144 | 0.176 |
2 | 0.251 | 0.425 | 0.236 | 0.273 | 0.284 | 0.04 | 0.149 |
3 | 0.295 | 0.482 | 0.468 | 0.01 | 0.306 | 0.209 | 0.169 |
AVG | 0.303 | 0.516 | 0.356 | 0.159 | 0.352 | 0.131 | 0.165 |
Table 3.
Model-centric results of best model architectures on test set.
Table 3.
Model-centric results of best model architectures on test set.
| [email protected] | AP GM | AP MC | AP TM | AP GL | AP FDL | AR@100 |
---|
SSD ResNet50 | 0.418 | 0.733 | 0.318 | 0.495 | 0.279 | 0.263 | 0.24 |
Faster R-CNN ResNet101 | 0.407 | 0.786 | 0.363 | 0.396 | 0.379 | 0.11 | 0.21 |
CenterNet ResNet50 | 0.352 | 0.769 | 0.221 | 0.487 | 0.278 | 0.044 | 0.212 |
Table 4.
Data-centric results for SSD ResNet50.
Table 4.
Data-centric results for SSD ResNet50.
DATA-CENTRIC STEP | [email protected] | AP GM | AP MC | AP TM | AP GL | AP FDL | AR@100 |
---|
None | 0.418 | 0.733 | 0.318 | 0.495 | 0.279 | 0.263 | 0.240 |
Image Size | 0.462 | 0.729 | 0.404 | 0.332 | 0.544 | 0.302 | 0.247 |
Image Size + Data Augmentation | 0.403 | 0.727 | 0.343 | 0.274 | 0.558 | 0.115 | 0.229 |
Image Size + Removal of Inadequate Images | 0.494 | 0.757 | 0.479 | 0.395 | 0.563 | 0.276 | 0.274 |
Image Size + Down-sampling Empty Patches | 0.515 | 0.744 | 0.501 | 0.386 | 0.549 | 0.396 | 0.288 |
Image Size + Removal of Inadequate Images + Down-sampling Empty Patches | 0.451 | 0.748 | 0.406 | 0.363 | 0.579 | 0.162 | 0.268 |
Table 5.
Data-centric results for Faster R-CNN ResNet101.
Table 5.
Data-centric results for Faster R-CNN ResNet101.
DATA-CENTRIC STEP | [email protected] | AP GM | AP MC | AP TM | AP GL | AP FDL | AR@100 |
---|
None | 0.407 | 0.786 | 0.363 | 0.396 | 0.379 | 0.11 | 0.21 |
Image Size | 0.404 | 0.790 | 0.397 | 0.257 | 0.531 | 0.045 | 0.200 |
Image Size + Data Augmentation | 0.447 | 0.781 | 0.354 | 0.422 | 0.542 | 0.136 | 0.241 |
Image Size + Removal of Inadequate Images | 0.476 | 0.795 | 0.476 | 0.421 | 0.552 | 0.136 | 0.225 |
Image Size + Down-sampling Empty Patches | 0.491 | 0.792 | 0.483 | 0.486 | 0.542 | 0.151 | 0.262 |
Image Size + Removal of Inadequate Images + Down-sampling Empty Patches | 0.419 | 0.709 | 0.380 | 0.424 | 0.509 | 0.073 | 0.223 |
Table 6.
Deployment-centric results for SSD ResNet50.
Table 6.
Deployment-centric results for SSD ResNet50.
DEPLOYMENT-CENTRIC STEP | [email protected] | AP GM | AP MC | AP TM | AP GL | AP FDL | AR@100 |
---|
None | 0.515 | 0.744 | 0.501 | 0.386 | 0.549 | 0.396 | 0.288 |
Train with Quantization Aware | 0.519 | 0.750 | 0.483 | 0.447 | 0.553 | 0.362 | 0.287 |
Train with Quantization Aware + TensorFLow Lite Compression | 0.528 | 0.754 | 0.550 | 0.456 | 0.548 | 0.332 | 0.301 |
Train with Quantization Aware + TensorFLow Lite Compression + Model Size Reduction | 0.453 | 0.756 | 0.564 | 0.393 | 0.551 | 0.000 | 0.245 |
Table 7.
Deployment-centric results for Faster R-CNN ResNet101.
Table 7.
Deployment-centric results for Faster R-CNN ResNet101.
DEPLOYMENT-CENTRIC STEP | [email protected] | AP GM | AP MC | AP TM | AP GL | AP FDL | AR@100 |
---|
None | 0.491 | 0.792 | 0.483 | 0.486 | 0.542 | 0.151 | 0.262 |
Train with Quantization Aware | 0.489 | 0.801 | 0.478 | 0.488 | 0.519 | 0.159 | 0.240 |
Train with Quantization Aware + TensorFLow Lite Compression | 0.501 | 0.754 | 0.497 | 0.489 | 0.558 | 0.207 | 0.277 |
Train with Quantization Aware + TensorFLow Lite Compression + Model Size Reduction | 0.527 | 0.792 | 0.558 | 0.505 | 0.597 | 0.182 | 0.286 |
Table 8.
Impact of model size reduction (MSR) on the SSD ResNet50 model deployed on mobile devices.
Table 8.
Impact of model size reduction (MSR) on the SSD ResNet50 model deployed on mobile devices.
| Samsung S6 | Samsung S10 |
---|
| No MSR | With MSR | No MSR | With MSR |
---|
Inference Time (s) | 120.2 | 62.7 | 40.2 | 19.4 |
RAM Usage (MB) | 1200 | 1000 | 600 | 530 |
Model Size (MB) | 204 | 52 | 204 | 52 |
Table 9.
Confusion matrix for final SSD ResNet50 model on the test set, using a score threshold of and NMS threshold.
Table 9.
Confusion matrix for final SSD ResNet50 model on the test set, using a score threshold of and NMS threshold.
| | Predicted |
---|
| Class | GL | MC | FDL | GM | TM | Miss (FN) |
Groundtruth | GL | 378 | 0 | 0 | 0 | 0 | 138 |
MC | 0 | 31 | 0 | 0 | 0 | 13 |
FDL | 0 | 0 | 0 | 0 | 10 | 16 |
GM | 0 | 2 | 0 | 1013 | 3 | 200 |
TM | 0 | 0 | 0 | 16 | 44 | 13 |
False Alarm (FP) | 293 | 13 | 0 | 171 | 23 | - |
Table 10.
Performance metrics for final SSD ResNet50 model on the test set, using a score threshold of and NMS threshold.
Table 10.
Performance metrics for final SSD ResNet50 model on the test set, using a score threshold of and NMS threshold.
| GL | MC | FDL | GM | TM |
---|
Accuracy | 82% | 99% | - | 83% | 97% |
Precision | 56% | 67% | - | 84% | 55% |
Recall | 73% | 70% | - | 83% | 60% |
F1 Score | 64% | 69% | - | 84% | 58% |
Table 11.
Confusion matrix for final faster R-CNN ResNet101 model on the test set, using a score threshold of and NMS threshold.
Table 11.
Confusion matrix for final faster R-CNN ResNet101 model on the test set, using a score threshold of and NMS threshold.
| | Predicted |
---|
| Class | GL | MC | FDL | GM | TM | Miss (FN) |
Groundtruth | GL | 374 | 0 | 0 | 0 | 0 | 142 |
MC | 0 | 28 | 0 | 0 | 0 | 16 |
FDL | 0 | 0 | 7 | 0 | 0 | 19 |
GM | 0 | 6 | 0 | 1033 | 1 | 178 |
TM | 0 | 0 | 0 | 18 | 41 | 14 |
False Alarm (FP) | 237 | 11 | 5 | 142 | 8 | - |
Table 12.
Performance metrics for final faster R-CNN ResNet101 model on the test set, using a score threshold of and NMS threshold.
Table 12.
Performance metrics for final faster R-CNN ResNet101 model on the test set, using a score threshold of and NMS threshold.
| GL | MC | FDL | GM | TM |
---|
Accuracy | 84% | 98% | 99% | 85% | 98% |
Precision | 61% | 62% | 58% | 87% | 82% |
Recall | 72% | 64% | 27% | 85% | 56% |
F1 Score | 66% | 63% | 37% | 86% | 67% |
| Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).