1. Introduction
In spite of their small size, insects play a huge role in ecosystems around the world—outnumbering all other terrestrial species on Earth [
1,
2]. Reasons for detecting and monitoring insects are numerous: controlling malaria [
3], increasing agricultural yields [
4], investigating population decline [
5], understanding changes in biodiversity [
6,
7], etc. Traditional monitoring methods, such as traps and netting [
8], are not capable of the wide-scale monitoring needed to understand and address global insect population decline [
9]. These traditional techniques usually kill the insects, lack precise spatial and temporal resolution, and are labor-intensive. The downsides of traditional techniques, along with the rising importance of wide-scale monitoring, have led to numerous calls for improved monitoring and data analysis methods [
7,
10,
11,
12], many of which rely on remote sensing and machine learning.
Automated monitoring techniques aim to increase the ease and scale of insect monitoring; they are typically based on audio [
13], radar [
14,
15,
16], or optical sensors, such as lidar [
15,
17,
18,
19,
20,
21] or camera traps [
22,
23,
24]. Given the large amounts of data these automated monitoring techniques generate, machine learning has been applied to help automate data analysis tasks. For example, computer vision has been used in many camera trap studies [
11,
25,
26,
27,
28,
29]. In lidar studies, machine learning has been used for numerous tasks, such as identifying gravid mosquitoes [
30], estimating biodiversity [
19], and classifying species [
31]. Since all the aforementioned remote sensing techniques have unique advantages and disadvantages, no single sensing modality will be sufficient to solve all insect monitoring problems; moreover, these remote sensing techniques need to be combined with traditional in-situ data collection to fully address ecological challenges [
7].
The study presented here focuses on a lidar-based technique, which can be used for high-resolution spatio-temporal monitoring of insects [
18]. The lidar instrument was developed by the Optical Remote Sensor Laboratory at Montana State University (MSU) [
18]. This paper compares supervised machine learning and unsupervised changepoint detection techniques for detecting insects in pulsed lidar data.
Early entomological lidar development started with groups at Sandia National Laboratories [
32] and MSU [
33,
34,
35] demonstrating that lidar can detect honeybees; Shaw et al. [
33] located unexploded landmines by mapping the locations of honeybees that were trained to locate the landmines. In the following two decades, several entomological lidar systems have been developed using time-of-flight [
18], Scheimpflug [
20,
36], and fluorescence [
37,
38] techniques. For an overview, see the review papers by Brydegaard et al. [
17] and Wang et al. [
21]. However, these entomological lidar systems produce a large amount of data, making the data analysis time-consuming and labor-intensive. Beyond this, the measurements contain few insects, and the background environment can vary over time and space.
Due to the variable background in lidar measurements, as well as the fact that most measurements contain no insects, automatically detecting insects is challenging. Previous field studies by Brydegaard et al. [
20,
39,
40] detected insect events using variations of the methods described by Malmqvist et al. [
41]. This method determines the static background signal via measurements when the laser is off and subtracts this background from the signal measurements. Any samples with an amplitude above a range-dependent threshold are considered as insect events. This method has been successful at finding insect events in several studies [
19,
20,
39,
40,
42]; however, other non-insect objects that pass through the beam are likely to be above the detection threshold. Rather than relying on a statistical threshold, Rydhmer et al. trained a convolutional neural network to distinguish insect observations from non-insect observations [
43]; this technique is most similar to the learning methods presented in this paper.
Most previous entomological lidar studies that used machine learning classification focused exclusively on feature engineering and supervised learning methods [
30,
31,
44,
45]. However, since the insects are being remotely sensed, ground truth labels are difficult to generate or infeasible to collect. For example, ground-truth species labels in field studies are typically unknown because researchers cannot collect the remotely sensed insects; even if insects were collected during a field experiment, correlating individual insects to individual measurements would be challenging. Consequently, some work has explored unsupervised methods, such as clustering of power spectra [
19,
39,
43]; changepoint detection is another unsupervised method, which our research group has previously explored [
46]. To the best of our knowledge, only two studies have used end-to-end machine learning [
31,
43]. It is worth noting that each entomological lidar instrument has been developed as a unique research instrument, making direct comparisons of data and results difficult.
Although previous entomological lidar studies used a variety of machine learning algorithms, most studies only used one algorithm or compared a small set of algorithms. The main aim of this paper is to compare the insect-detection performance of a wide variety of machine learning techniques. In particular, we test traditional supervised learning, deep learning, and changepoint detection, as each method category has unique advantages and disadvantages. To that end, we present performance comparisons of six supervised and two changepoint methods. For supervised methods, we present four feature engineering and two deep learning [
47] approaches. We evaluate and compare these methods on a new entomological lidar dataset, which we openly publish [
48].
2. Materials and Methods
2.1. Lidar System
We collected data using the wingbeat-modulation lidar system developed by MSU’s Optical Remote Sensor Laboratory [
18].
Figure 1 shows the intuition behind detecting insects using wingbeat-modulation lidar. As an insect flaps its wings, the amount of reflected light changes, which in turn leads to modulations in the lidar data. By analyzing the changes in reflected light in the frequency domain, one can determine the insect’s wingbeat spectrum [
17,
18]. The unique signal characteristics of the insect’s wingbeats can also be used to distinguish insects from other objects.
Figure 1b shows an example image containing one stationary object and two insects: one where reflections from the insect’s wings dominate the signal, and one where reflections from the insect’s body dominate the signal.
Figure 1c shows time series plots of the three targets. The stationary object looks like noise. The reflection from the insect’s body creates a noticeable dip in the signal, while the wing’s oscillations are superimposed on top of the body reflection. The wing signal in the lower plot shows distinct periodic peaks.
Figure 1d shows the frequency spectra for each of the targets. The stationary object looks like noise in the frequency domain; the insect’s body produced a dominant low-frequency component, while the wing signal produced obvious harmonic structure.
Figure 2 shows an optical schematic of the lidar system, and
Figure 3 shows a picture of the lidar system during the experiments. The first component in the transmitting subsystem was the JDS Uniphase NG-10320-100 532 nm laser, which had a pulse width of 0.6 ns, a pulse energy of 3
, and a pulse repetition frequency (PRF) range between approximately 2.9 kHz to 5.4 kHz. The laser was housed in a light-tight box along with a Hamamatsu H6780-20 photomultiplier tube (PMT), the output of which was used as the trigger input for the analog-to-digital (ADC) converter. A series of mirrors and lens tubes directed the light out of the system at the center of the receiver telescope’s central obstruction. Before exiting the system, the beam was expanded to have a 50.8 mm diameter and half-angle divergence of 0.2
.
The first component in the receiving subsystem was the Meade LX200 Classic Schmidt-Cassegrain telescope with a 305 mm diameter and a focal length of 3048 mm. At the back of the telescope, a collimating lens, two laser line filters, and a focusing lens were housed in a lens tube. A Hamamatsu H9305-04 PMT was attached to the end of the lens tube. The lidar system was mounted on a MOOG QPT-130 pan-tilt mount.
The data were digitized using a Compuscope 14200 PCIe ADC with a 200
/
sampling rate, 100 MHz bandwidth, and 14-bit resolution. The 200
/
sampling rate limited the system’s range resolution to 0.75
. The output of the H9305-04 PMT was connected to the ADC’s input channel, and the output of the H6780-20 PMT was connected to the ADC’s trigger input. For each laser pulse, the ADC was triggered to collect 200 samples. Due to the transit time for the light to leave the system, as well as variance in when the ADC started sampling, 178 samples were usable [
49]. MATLAB was used to control the ADC and collect 1024 pulses, resulting in images of 178 rows (range bins) × 1024 columns (pulses). For compatibility with the ADC, MATLAB 2015b was used.
Figure 4 shows a generic example of how the data were arranged into a matrix; the data are later visualized as false-color images. Since the laser’s PRF was not constant, the total duration of 1024 pulses was not constant; this resulted in images of varying durations.
2.2. Dataset
The data used for the experiments were collected in June and July 2022 at the MSU horticulture farm. The data comprise 9977 images taken in front of the beehives. As noted previously, the image durations vary because the laser’s PRF varied; this can be seen in
Figure 5.
For the data collection process, the lidar was mounted in the back of a van, as shown in
Figure 3, and pointed in front of the beehives, some of which are shown in
Figure 6. The data were collected with the lidar beam pointed in various discrete directions; at each position, multiple images were collected while the lidar’s position was fixed. This was done to create a diverse set of measurements to train the machine learning algorithms on; having various levels of bee activity and background clutter helps avoid overfitting. The position of the lidar beam was controlled by changing the pan and tilt angles of the pan-tilt mount that the lidar was mounted on. The probability of bees flying through the lidar’s beam was a function of the beam’s position, as well as the various factors that influence bee activity (temperature, time of day, etc.).
After collecting the data, we manually labeled the bounding box of each insect in each image, then converted the bounding boxes into binary labels that indicate whether a row contains an insect. Each potential insect was labeled with a confidence rating between 1 and 4 because some bees were more obvious than others. In general, high confidence ratings were assigned to bee events, which have obvious harmonic structure and a prominent return amplitude in the time domain. Low confidence ratings were given to events with very short transit times, such as those shown in
Figure 5c. During the labeling process, we found 4671 probable bees. All labeled bees, regardless of confidence rating, were included in the dataset; our insect-detection algorithms did not account for confidence ratings. Since we were not able to collect ground-truth data in the field, it is possible that our labels are imperfect: some insects might have been missed, and some non-insects might have been labeled as insects. Our ground truth labels are the primary source of uncertainty. Using imperfect labels during training and validation may result in false insect detections; on the contrary, classifiers may find insects that the ground truth labels missed.
Section 4.1 and
Section 4.3 provide additional discussion on the effect of imperfect labels.
Of the 9977 images, 3498 (35.14%) contain one or more bees. In total, the dataset has 1,775,906 rows, 11,492 (0.647%) of which contain an insect measurement. Due to sampling jitter in the ADC, most insects span multiple range bins, leading to an increase in the number of rows that were labeled as containing insects. The dataset has a large class imbalance, particularly when looking at how many rows contain insects. Large class imbalances can cause machine learning algorithms to ignore the minority class [
50], which would result in not detecting most insects.
Section 2.4.1 discusses the methods we used to alleviate class imbalance.
Figure 5 shows four example images from the dataset containing different types of bee activity and stationary objects. The bee signals vary depending on whether the laser primarily hit the body or the wings of the bee. There is a low chance that the lidar pulses only hit the body, but the return from the body often dominates the signal when the incidence angle is such that the more specular wing reflection is not large. There are also images containing small blips of a bee where the laser might have barely hit the bee for a few pulses. These blips were labeled as bees but with a low confidence. When the intensity of returned light was large, reflections showed up in the data, which resulted in positive voltages at later range bins, as indicated by the orange dashed boxes in
Figure 5. These artifacts are undesirable because they have similar characteristics to the actual returns that caused them. After the data were collected, we discovered that the reflections were caused by an impedance mismatch in the lidar’s wiring. By the time we discovered this, we could not recollect the data.
Section 2.3 discusses our method to reduce the reflection artifacts.
Figure 7a shows a histogram of the insect transit times observed in the dataset. The shortest transit time was 744
, the longest transit time was 307.9
, and the average transit time was 36.7
. Most bees were in the beam for less than 50
, which is about
of the average image duration of 230
. Note that the image durations are not all the same. None of the recorded transit times can be longer than the associated image’s duration, though the object may have been in the beam for longer than the image’s duration. Nine events that were labeled as insects had transit times that lasted the entire image duration; most of these were labeled as low-confidence insects.
Figure 7b shows the distribution of fundamental wingbeat frequencies. Previous literature has found that honeybees’ fundamental wingbeat frequency can range from 150
to 600
and varies with environmental conditions [
51]. The most common wingbeat frequencies are in the low 200
region [
52,
53]. The fundamental frequency of each insect observation was calculated using the harmonic product spectrum algorithm [
54]. Since most insects showed up in multiple range bins due to sampling jitter, the fundamental frequency was estimated on the average of the insect’s range bins to increase the quality of the insect signals. Before estimating the fundamental frequency, the insect observations were bandpass filtered to help isolate the expected wingbeat frequencies. We used a 10th-order Butterworth filter with a passband between 50
and 1200
; the upper passband frequency was chosen to ensure that the third harmonics in the spectra were left untouched, so as to not affect the harmonic product spectrum algorithm. The wingbeat frequency distribution shown in
Figure 7b is distinctly bi-modal. In agreement with previous literature, we see a strong distribution of wingbeat frequencies between approximately 175
and 250
[
51,
52,
53]. The other distribution of wingbeat frequencies, between 0
and approximately 125
, came from the bee observations where the body dominated the return signal and very little wingbeat motion was captured.
The data from June were used for training and validation, and the data from July were used for testing. This split simulates the scenario of training algorithms after one field campaign and then using the algorithms during the next campaign. The June data were split into stratified training and validation sets, with 80% for training and 20% for holdout validation.
Table 1 shows the number of images, observations, and bees in each set.
2.3. Preprocessing
As mentioned in
Section 2.2, an impedance mismatch caused reflections in the lidar’s wiring, which resulted in undesirable artifacts in the data. These artifacts can easily confuse the bee detection algorithms, as they share the same characteristics as the bee that produced them. Since the PMT output current is negative, any positive voltage values were due to reflections. Since the samples in each row form a time-series at a particular range bin, any large changes between subsequent samples would be caused by a moving object at that range bin. Using this knowledge, we thresholded all positive voltage values to the corresponding row’s average value.
Figure 8 shows an example image before and after thresholding. The thresholding reduced most of the reflections, but not all, because some of the reflections resulted in small negative voltages near 0. Since we cannot easily distinguish between a small negative voltage caused by a reflection and a small negative voltage caused by a weak return from a bee or other object, we settle for leaving some artifacts in the data. While leaving these artifacts will increase the false positive rate of the algorithms, a more aggressive removal would result in missed insect detections.
2.4. Supervised Methods
The supervised learning methods we used can be broken into two categories: feature engineering and deep learning. The feature engineering methods use the features described in
Section 2.4.2 as input, whereas the deep learning methods take the raw data as input. The methods can also be classified by whether they used rows as input or entire images as input. The feature engineering methods, as well as the 1-dimensional deep learning methods, predict which rows contain insects; the 2-dimensional deep learning methods take entire images as input and predict which images contain insects. For the row-based methods, the row predictions were aggregated to predict which images contain insects.
The overall supervised learning process comprised tuning dataset sampling parameters, tuning model hyperparameters, training, and testing. While our code attempts to make training as reproducible as possible by carefully seeding MATLAB’s random number generator, some algorithms were still not deterministic; consequently, we ran the whole process three times for each supervised learning algorithm.
2.4.1. Data Sampling
As mentioned previously, our dataset is highly imbalanced. To deal with this, we used combinations of data augmentation and random undersampling [
50] while training the row-based methods. Data sampling was not performed for the image-based methods because the image-based dataset is not highly imbalanced.
Random undersampling was used to remove rows from each image that did not contain insects. To increase the number of insect examples used during training, we created synthetic insects. Our augmentation procedure performs 6 transformations: (1) random circular shifting; (2) interpolating the circularly-shifted vector; (3) decimating the circularly-shifted vector; (4) vector reversal; (5) interpolating the reversed vector; (6) and decimating the reversed vector. Random Gaussian noise is added to each synthetic observation. These transformations are performed in order. Consequently, when the number of synthetic observations, N, is not a multiple of 6, the last synthetic observations are created using the first n transformations from the list; for example, when the number of synthetic insects is 10, the last 4 synthetic insects are created using the first 4 transformations.
For the row-based methods, we performed a grid search over the following undersampling and data augmentation parameters:
undersampling ratio:
# of synthetic insects per insect observation:
During the grid search, the Matthew’s Correlation Coefficient (MCC) [
55] was used to evaluate classification performance on the validation dataset. For each model, the sampling parameters with the highest MCC value were chosen as the final sampling parameters. During the grid search, the models were trained with their default hyperparameters.
2.4.2. Feature Engineering Methods
For the feature engineering methods, we trained four different types of models: AdaBoost [
56], RUSBoost [
57], a linear support vector machine (SVM), and multilayer perceptrons (MLP). AdaBoost and RUSBoost are both boosted decision tree ensembles, while the MLP is a feedforward neural network. We trained MLPs with 1, 3, 5, and 7 hidden layers.
The inputs to these models were 46 hand-selected features, which were extracted from each row in each image. From the time domain, we extracted the following features from each row: range, standard deviation, maximum absolute first difference, maximum absolute second difference, root-mean-square (RMS), impulse factor, crest factor, shape factor, median, median absolute deviation, mean-centered skewness, mean-centered kurtosis, and the observation mean minus the image mean. From the frequency domain, we extracted various statistics from the normalized one-sided energy spectral density (ESD): mean, median, median absolute deviation, standard deviation, mean-centered skewness, and mean-centered kurtosis. Most frequency-domain features were based upon the insect’s wingbeat frequency and harmonics, such as the height and location of each harmonic, as well as ratios of each harmonic’s features. The fundamental frequency was estimated from the normalized ESD using the harmonic product spectrum algorithm [
54]. Before extracting any of the frequency domain features, the signals were bandpass filtered with a 10th order Butterworth filter with a passband between 50
and 1200
.
In addition to time- and frequency-domain features, we used generalized Morse wavelets [
58] to compute the continuous wavelet scalogram, from which we extracted additional features. Across the time axis in the time-frequency plane, we computed the average intensity for each frequency bin, then selected the maximum of those intensity values. Similarly, we computed the standard deviation of each frequency bin and extracted the mean of those values. We also computed the maximum and mean values of the scalogram, the maximum change between adjacent values in the frequency axis, and the average skewness of the frequency bins. For a full list of features, see the feature ranking in
Figure 9 and our code [
59].
Individual model hyperparameters and the cost of predicting false negatives were tuned using Bayesian optimization [
60], using MATLAB’s
bayesopt function. Fifteen iterations were used during parameter tuning. MCC was again used to select the final hyperparameters. The search range for the false negative cost was integers between 1 and 10. For brevity, view our source code [
59] for more information on the model-specific hyperparameter search ranges. The data sampling and model hyperparameter tuning were performed sequentially, rather than together. Our assumption was that the optimal data sampling parameters were independent of the individual model hyperparameters; although this assumption is likely not 100% true, performing the two tuning steps sequentially drastically reduces the parameter search space.
2.4.3. Deep Learning Methods
Given the rise of deep learning, we wanted to see if deep learning outperformed our manually engineered features. To this end, we tested two deep learning algorithms: a 2D convolutional neural network (CNN) that took the raw images as input, and a 1D CNN that operated on each row. For both CNNs, we tested models with 1, 3, 5, and 7 convolutional layers. Each convolution layer was followed by a batch normalization layer, a layer of rectified linear unit (ReLU) activation functions, and a dropout layer with a 20% dropout probability to reduce the chance of overfitting. Between the ReLU layer and the dropout layer, the 2D CNNs included a 2-by-2 max pooling layer with a stride of 2. Each model used weighted cross-entropy loss for the output layer. The default filter size for the 1D CNNs was 16, and the default filter size for the 2D CNNs was . A rectangular filter size was chosen as the default for the 2D CNNs because insect events are typically wider than they are tall.
All networks were trained using the Adam optimizer [
61], though no efforts were made to optimize the training parameters; 20 training epochs and a learning rate of 0.01 were used for all models; the mini-batch size was 2048 and 64 for the 1D and 2D CNNs, respectively. The filter size, number of filters, and false negative cost for each model were tuned using the same procedure described in the previous section. For more architecture and training details, see our source code [
59].
2.5. Changepoint Detection
The motivation for using changepoint algorithms was that bees show up as abrupt changes in both rows and columns. The experiments were conducted using MATLAB’s
findchangepts function, a primarily statistical changepoint detection algorithm, and gfpop [
62,
63], a graph-based changepoint detection algorithm.
The findchangepts function is a part of MATLAB’s Signal Processing Toolbox and returns the index where the input signal changes most significantly. This algorithm utilizes a parametric global method of dividing the signal into two sections from some division point, calculating an estimate of the desired statistical property, measuring the deviation of the property at each point from the estimate, calculating total residual error, and then varying the division location until the total residual error is at its minimum. In this study, the mean was used as the statistical property. An experimentally determined minimum threshold of 0.0025 was used; this acts as a penalty value for the changepoint and sets the minimum improvement for the total residual error when calculating the changepoints.
The gfpop (Graph Constrained Functional Pruning Optimal Partitioning) algorithm is a graph-based changepoint detection algorithm primarily created for applications in fields, such as medicine, containing large sequences of data with abrupt changes. The algorithm requires the user to input a built-in or manually created graph consisting of nodes representing the states of the signal, edges for the transitions between states, and a loss function to analyze the graph. The edges for the graph have a variety of parameters, although for this application, the only parameter that was tuned was the penalty on the edges for transitioning between states. The graph for the bee dataset, shown in
Figure 10, contained four states representing air, bees, and the increase and decrease from the air to the bee; this graph gave better results than a binary graph of only air and bee states. The edges transitioning from the air to an increase and from the bee to a decrease had a penalty of 0.005. The graph penalties, as well as the procedures described below, were manually tuned on a few images from the training dataset.
For the image analysis using the changepoint detection algorithms, both the rows and columns of the bee images were fed into the algorithms. The results were gathered based off of analyzing only the rows, only the columns, and checking both the rows and columns to see if they gave a changepoint within some margin.
The procedure for analyzing rows comprises the following steps: applying a heuristic to filter out rows that obviously do not contain bees, applying the changepoint detection algorithm to the remaining rows, then applying further heuristics to determine if the changepoints were likely associated with a bee signal. Concretely, we first check whether the range of the row is higher than the mean of the row. This was meant to check whether the target row was either empty air or a stationary object, as the empty air has a low range and a low mean, while stationary objects generally have a low range and high mean; if the range is greater than the mean, the row is marked as potentially containing a bee. The changepoint algorithm is then applied to these marked rows. After the algorithm is applied to the rows, the number of changepoints is checked, and if it is more than 5, the row is thrown out. This is because the stationary objects that pass through the first check often contained upwards of 10 changepoints, while the true bees often only contained around 5. The last check verifies that the voltage of the changepoint is larger than the mean of the row. If it is, then the changepoint is classified as a valid bee.
The procedure for analyzing the columns is similar. First, the algorithm checks the range of each row and compares it to the mean; from those results, the algorithm creates a vector of the valid rows where an insect could be. It then iterates through each column and checks whether the changepoints it detects are in valid rows before checking the value of the changepoint against the mean of the row. If the changepoint fits the same criteria used in the row analysis process, it is classified as a valid bee.
The process for checking both rows and columns involves iterating through the rows in the same manner as before, but after detecting changepoints in a row, it then iterates through those columns, and after finding a changepoint in the column, checks the norm between the changepoint it detected in the row and the changepoint it detected in the column. If they are within a distance of 4, as calculated by the L2 norm, the changepoint is considered to be a bee.
2.6. Evaluation Metrics
All the methods used were analyzed based on their recall, precision, and (MCC) [
55] scores. It is well known that accuracy is not an informative metric for imbalanced data [
64]; thus, accuracy is not used in this paper. Recall is the percentage of the target class that the method was able to identify and ranges from 0 to 1. Precision is the percentage of true positives relative to the total number of positive identifications and ranges from 0 to 1. MCC is a metric that gives a high score if the method is able to achieve an accurate value in each of the four categories of the confusion matrix. The MCC ranges between −1 and +1. Positive values indicate a positive correlation between the predictions and the true labels, with +1 being perfect prediction. A value of 0 indicates that the prediction is no better than random predictions. Negative values indicate a negative correlation between the predictions and the true labels, with −1 indicating complete disagreement between the predictions and the true labels.
When evaluating our methods, we consider recall more important than precision because the focus was to identify as many of the bees as possible. However, a low precision is not entirely outweighed by a high recall: low precision means there were many false positives, which researchers would have to look through to validate whether bees were present or not. Thus, a balance of recall and precision is desirable; a high MCC score indicates a good balance of recall and precision. While the
score could be used as a single metric that combines recall and precision, we chose to use MCC because it is invariant to which class is labeled “positive”, unlike
[
65].
4. Discussion
4.1. Sources of Uncertainty and Error
Our ground truth labels are the primary source of uncertainty. Since we manually labeled all insects via visual inspection, it is possible that some insects were missed and some non-insects were labeled as insects. Without any independent ground truth to validate our labels, we cannot easily verify that all labeled insects truly were insects; this is especially pertinent for observations that were labeled as low-confidence insects. The digitizer’s sampling jitter, which resulted in most insects spanning multiple 0.75
range bins during a single laser pulse (a physically impossible event), also contributed to uncertainty in the labels. When labeling each insect’s bounding box, deciding exactly how many rows the insect spanned was somewhat arbitrary because the effect of the sampling jitter decayed over multiple range bins. This decay and smearing can be seen in
Figure 5.
4.2. Effects of Confidence Ratings
As noted in
Section 2.2, the bee events were labeled with confidence ratings between 1 and 4, with 4 being the highest confidence rating. High-confidence bees had obvious signatures in both the time and frequency domains, whereas medium-confidence bees generally had obvious signatures in only one domain. For example, the two insect examples in
Figure 1 are high-confidence bees, whereas the blips in
Figure 5b,c are low-confidence bees. Thus, the confidence ratings are tied to labelling uncertainty and data quality. There are 73 rows with a confidence of 1, 1479 rows with a confidence of 2, 5288 rows with a confidence of 3, and 402 rows with a confidence of 4.
Figure 13 shows recall as a function of confidence rating. For the majority of the algorithms, the high-confidence bees are identified more often than low-confidence bees. This makes intuitive sense because high-confidence bees should be the most distinct. Interestingly, many of the supervised learning algorithms had higher recall for bees with a confidence of 1 than a confidence of 2. This suggests that the algorithms may have identified some salient features for low-confidence bees, such as the blips shown in
Figure 5. These results could also be due to labeling uncertainty in the bounding boxes, as discussed in the following section, or imprecision of the confidence labels.
Interestingly, the gfpop “rows” and MATLAB “rows” methods exhibit the opposite trend: recall decreases as confidence increases. The “both” methods exhibit the same trend because they are similar to the “rows” methods. It is possible that some of the high-confidence bee rows were discarded by the heuristic that bees typically had fewer than five changepoints. Conversely, low-confidence bees that were in the beam for a short period of time would have few changepoints and would thus pass through that heuristic. The column-based changepoint detection algorithms exhibit the expected behavior. High-confidence bees, especially ones which had a prominent body return signal, exhibit abrupt changes between range bins: range bins before and after the bee have a low return amplitude; this likely contributed to the behavior of the column-based changepoint methods shown in
Figure 13.
4.3. Model Generalization
As noted in
Section 3, many of the supervised methods did not perform as well on the testing set as they did on the validation set. This suggests some amount of overfitting to the training and validation sets, which were collected at a different time than the testing set. Although the training/validation and testing data were collected at the same location, the conditions were not exactly the same: ambient temperature was different, resulting in different bee activity; height of the foliage in front of the beehives was different; and some of the pan and tilt angles that the data were collected at were intentionally different. These variations resulted in different bee and background characteristics between the two datasets. Consequently, the training/validation and testing dataset distributions are not identical, nor are either of the sets representative of the entire possible data distribution at the data-collection location.
As was seen in
Table 3 and
Table 4, the methods generalized better when classifying images than classifying rows. This suggests that these methods were still detecting the insects but not detecting every row the insect occupied due to sampling jitter; or the methods simply missed an insect in images that contained multiple insects.
Figure 14a shows an example where a 3-layer MLP correctly found the insect but predicted more rows than were labeled. In this case, only one row was labeled but the insect spanned three rows, which the MLP predicted. This leads to an artificially low row-based precision, as the human labels were imperfect and did not span all the insect’s rows. Additionally, the MLP predicted the reflection artifacts as additional bees.
Figure 14b shows the opposite case, where the 3-layer MLP predicted fewer rows than were labeled. Although the algorithm still found the bee, the missed rows led to an artificially low row-based recall. The situations demonstrated in
Figure 14 show that the label ambiguity caused by the digitizer’s sampling jitter and the human-labeled bounding boxes contributed to artificially low row-based performance metrics.
Better generalization performance might have been achieved if we randomly split the combined June and July datasets into training/validation/testing sets, rather than using June for training/validation and July for testing. However, we wanted to model a real-world scenario where labeled data from a previous campaign could be used to train models for a new data collection campaign.
Overall, lack of generalization is one of the primary limitations of both the supervised and changepoint detection methods. It is unlikely that either category of methods will generalize to a new data-collection location or species without needing to be retrained or have some algorithm parameters adjusted. Additionally, recent insights have shown that CNNs with subsampled convolution and pooling operations are sensitive to small image translations [
70]; it is possible that our 2D CNNs are sensitive to the bee locations, leading to lack of generalization, but some preliminary analysis suggests this is not the case. Eventually, as more data are collected across various locations, seasons, and species, the methods should start to generalize.
4.4. Algorithm Recommendations
For supervised learning, the general rule for model selection is to choose the model with the best validation performance. Based upon the row-based MCC, which was used during hyperparameter tuning, the 5-layer MLP would be chosen as the “best” model. Indeed, the 5-layer MLP also performed okay on the testing set, so this would not be a bad decision. However, RUSBoost, which had the lowest row-based MCC on the validation set, performed best on the testing set; RUSBoost did, however, have the highest row-based recall on the validation set, as well as the highest image-based recall on the validation set. Most methods’ generalization ability was better for the image-based results; given this, one might consider using the image-based validation results for final model selection, rather than the row-based results.
In general, for the supervised methods we tested, we recommend using RUSBoost and/or a feature-based MLP, as those methods performed consistently well across the validation and testing sets. The 1D CNNs achieved good average performance on the testing set but had high variability between runs; this variability indicates that the models were not robust. The 2D CNNs did not generalize to the testing set and also had high variability between runs. With more fine-tuned architectures, it is possible that CNNs could perform and generalize well.
Compared to the changepoint detection methods, the higher performance of the supervised methods comes at a cost: needing labeled data. Needing labeled data is the major limitation of the supervised methods because the primary point of automated detection is not having to manually detect insects. If labeled data already exist, then we recommend using one of the supervised methods. However, if labeled data does not exist, the changepoint detection methods are a promising place to start because they can be tuned on a few images and still achieved high recall of insect-containing images. If finding as many insect images as possible is paramount, and looking through false positives is not a huge burden, then we recommend the gfpop “rows” or “both” methods; these methods achieved image-based recall scores around 93%. If looking through many false positive images is too large of a burden, then we recommend the MATLAB “rows” or “both” algorithms because they achieved precision scores around 85%.
When the data are not labeled, a semi-supervised learning approach could be employed: use a changepoint detection method to create an initial set of imperfect labels, and then use those labels to train one of the supervised learning algorithms. The data could then be fed through the trained supervised algorithm, and any new positive predictions that were not predicted by the changepoint detection method can be inspected and labeled. This process can be repeated until no new insect images are found.
4.5. Future Research Directions
While good image-based recall and precision, like that achieved in this paper, would help reduce the amount of non-insect images requiring manual analysis, manual analysis would still need to be performed on the insect images to obtain quantities of entomological interest, such as the total number of insects observed. Consequently, future research should work towards improving the row-based insect detection performance and reducing sampling jitter; high row-based precision and recall would allow scientists to easily calculate the total number of insects, their spatial and temporal distributions, their wingbeat frequency distribution, etc. Towards this goal, the fundamental frequency estimation algorithm should be improved to detect less spurious harmonics in stationary object and noise observations. This would lead to better class separability in the harmonic features.
The sampling jitter is the biggest impediment to having an automatic count of the actual number of insects. The smearing of insect observations across multiple range bins limits the ability of the row-based methods to accurately detect insects and leads the row-based methods to produce artificially large insect counts. If the sampling jitter cannot be reduced significantly, post-processing methods should be explored. Using an image segmentation CNN is one possible solution, as the model could be trained to predict the insects’ bounding boxes.
Including the labels’ confidence ratings into the detection process is an interesting avenue for future work. The confidence ratings provide extra probability information that could prove useful. Lastly, we are also interested in trying more unsupervised and semi-supervised methods, such as anomaly detection methods.