The final step collected information and derived a summary of counted individuals of known moth species and unknown insects detected and tracked by the algorithm. The summary information was annotated to the image with highlighted marks of each insect tracks.
2.2.2. Insect Tracking and Counting
Tracking was used to reduce each visit of an insect in the camera field of view to one observation. However, we note that an individual insect could be counted again if it left the light trap and returned later during the night.
The individual insects were relatively stationary during their stay, and images were captured at two second intervals in case of activity. Therefore, it was assumed that a match was more likely for the shortest distance between two bounding boxes. That is, two boxes that were close to each other were likely to be the same individual.
The position and size of each insect were estimated for every single frame, and tracking could therefore be solved by finding the optimal assignment of insects in two consecutive images. The Hungarian Algorithm [
29] was our chosen method for finding the optimal assignment for a given cost matrix. In this application, the cost matrix should represent how likely it was that an insect in the previous image had moved to a given position in the current image. The cost function was defined as a weighted cost of distance and area of matching bounding boxes in the previous and current image. The Euclidean distance between center position
in the two images was calculated as follows.
This distance was normalized according to the diagonal of the image:
The area cost was defined as the cost between the size of bounding boxes:
A final cost function in Equation (
4) was defined with a weighted cost of distance
and weighted cost of area
.
The Hungarian Algorithm required the cost matrix to be squared and, in our case, was defined as an matrix, where each entry was the cost assigning in previous image to in current image. After a match with minimum cost, the entry in the current matrix was assigned a Track ID from the entry in the former. The found Track IDs and entries were stored and used in the upcoming iteration. Dummy rows and columns were added to the matrix to ensure that it was always squared. All entries in the dummy row or column had to have a cost significantly larger than all other entries in the cost matrix to ensure that the algorithm did not make a “wrong” assignment to a dummy. The insect assigned to a dummy could be used to determine which insect from the previous image had left, or which insect had entered into the current image.
To evaluate the performance of the tracking algorithm, two metrics were defined based on the work in [
30]. The measure False Alarm Rate (
FAR) was an expression of the probability that a given track was incorrect. It describes the number of false alarms relative to the total number of tracks, that is, how many times the tracker made a wrong track compared to the times it made a track.
While a True Positive (TP) was defined as an individual who retained its uniquely assigned Track ID in its entire presence of the observation, a False Positive (FP) was defined as an individual who was either counted multiple times or assigned a new Track ID.
The term Tracking Detection Rate (
TDR) is a measure of the number of individuals who maintained their own Track ID in relation to the established Ground Truth (
GT), during the course of observation. The size was therefore used as the primary scale to express the tracker’s ability to maintain the same Track ID for the individual insects in an observation.
GT was defined as the total number of unique individuals in the test set measured by manual counting.
2.2.3. Moth Species Classification
In the field of deep learning, specific architectures of CNNs have provided particularly positive results in many areas of computer vision [
31]. CNNs use both pixel intensity values and spatial information about objects in the image. It was a challenging task to find a suitable CNN architecture for classification of moth species. Based on an investigation of several CNN architectures [
32,
33], a customized network was designed inspired by the work in [
34]. Hyperparameters of the architecture were explored to find the optimal network architecture to classify moth species. The model was designed to be light and fast for the purpose of being able to be executed on the embedded Raspberry Pi computer used in the light trap.
As the camera recorded images of insects with a constant working distance, the insects did not change in size in the images. The moths were labeled with bounding boxes with an average size of 368 × 353 × 3 pixels and a standard deviation of 110 for pixel height and width. Initial experiments gave poor results with a resized input size of 224 × 224 × 3, which many CNNs [
35] use. Improved results were achieved by reducing the input size, while still being able to visually identify the moth species. Based on the given camera setup the bounding boxes were finally resized approximately three times to a fixed window size of 128 × 128 × 3 as input for the customized CNN model.
Only 2000 images with a even number of images for eight different classes of moths were used for training the CNN model. A customized model was designed to work with this limited amount of training data.
The CNN model had four layers for feature detection and two fully connected layers for final species classification. The optimal architecture was found by using combinations of hyperparameters for the first and last layer in the CNN. Below are the parameters used to train different CNN’s for species classification.
Fixed pool size and stride, ,
Kernel size ,
Convolutional depth n,
Fully connected size n,
Optimizer n,
The optimal chosen CNN architecture is shown in
Figure 5. The first layer performed convolution using 32 kernels with a kernel size of 5 × 5 followed by maximum pooling of size 2 × 2 and stride 2. All the following layers used a kernel size of 3 × 3. The second and third layer performed convolution using 64 kernels with the same pooling size as mentioned above. The final layer also used 64 kernels based on the optimization of hyperparameters. All convolutional layers used the Rectified Linear Unit (ReLu) activation function. The last fully connected layer had two hidden layer with 4096 and 512 neurons and a softmax activation function in the output layer. Two of the most commonly used optimizers—Adaptive Moment Estimation (Adam) and Stochastic Gradient Decent (SGD)—were investigated. While Adam was an optimizer that converged relatively quickly, it did so at the expense of a greater loss. SGD, on the other hand, converged more slowly, but achieved a smaller loss.
Developing a deep learning model for classifying species was an iterative process with an alternation between selecting images and training CNN models. When selecting and annotating images, experiments showed that it was important to vary images with different individuals and preferably equal numbers of each species. From the experiment mentioned in
Section 2.3, images of nine different frequently occurring insect species were selected to train the different CNN models. The classes of species were chosen based on the recorded data from where a sufficient number of images could be selected to train the CNN models. According to our screening of the data, no other species were found in sufficient quantity to allow for their inclusion in the dataset. Examples of images of each individual species are shown in
Figure 6.
The dataset was created by selecting images of different individual insects from the captured sequences of images in the period of observation. To collect sufficient number of samples for each class several images with different orientations of the same individual was chosen. With the use of sugar water, the dataset also contained many wasps (Vespula vulgaris). Therefore, they were included as a separate class. A separate class containing a variation of cropped background images was also included to classify false blob detections without insects. A total of ten classes was used for training with nine different insects including seven different moth species and one subtribe. These images were annotated with a bounding box for each individual, and the moth species determination was reviewed by an expert from the image only.
Table 1 shows an overview of the occurrence of all species in the chosen dataset for training and validation of the CNN algorithm. The moth class
Hoplodrina complex consists of the three moth species
Hoplodrina ambigua,
Hoplodrina blanda, and
Hoplodrina octogenaria. These species are very similar in appearance and it was too difficult to distinguish between them from the images alone.
It was a challenge to obtain a sufficient number of images. Especially, Agrotis puta and Mythimna pallens had fewer occurrences than the other species. That was the main reason for the limited number of images (250) for each species. Data augmentation was therefore applied to all images with a flip vertical, horizontal, zoom, different illumination intensity and rotation of different degrees. This operation provided more training data and was used to create a uniform distribution of species. The dataset was scaled with a factor of 32 times, resulting in 72,000 images, where each class contained 8000 data points after augmentation. From this dataset, was used for training and for validation of the CNN model.
To find the best CNN architecture for species classification, different hyperparameters were adjusted as described in
Section 2.2.3. A total of 64 architectures were trained using a dropout probability of 0.3 after the second to last hidden layer. The average
-score for all classes was used as a measure for a given architecture’s performance.
The five best architectures had high
-scores, which only varied by 0.02, but had a varying number of learnable parameters (
Table 2). Compared to SGD, Adam turned out to be the superior optimizer for training of all models. In the end, the architecture that had a rating among the three highest
-score but the lowest amount of learnable parameters (2,197,578) was chosen. The reason for this is that an architecture with many parameters and few training data would increase the risk of overfitting the neural network.
The chosen model shown in
Figure 5 had an
-score of 92.75%, which indicated that the trained CNN was very accurate in its predictions. This final architecture was chosen because it achieved average precision, recall, and an
-score of 93%, which indicated a suitable model classification.
The confusion matrix (
Figure 7) was based upon the validation of the chosen model. The confusion matrix has a diagonal trend, which indicates that the model matched the validation set well. The model had a recall of 93%, indicating that only 7% of the moth species in the validation set were missed. A similar precision of 93% was obtained, indicating that only 7% were wrongly classified.
Finally, the customized CNN architectures were compared with selected state-of-the-art CNN optimized architectures. EfficientNetB0 [
36] is scaled to work with a small image input size of 224 × 224 pixel and has 4,030,358 learnable parameters. Using the moths dataset with the same data augmentation, the EfficientNetB0 achieved a
-score of 88.62%, which is lower than our top five best architectures. DenceNet121 [
37] with 7,047,754 learnable parameters gave a
-score of 84.93% which is even lower. CNN architectures with many parameters (more than 20,000,000) such as ResNetV50 [
38] and InceptionNetV3 [
39] gave a high training accuracy, but a lower validation
-score of 69.1% and 81.7%, respectively. This result indicates overfitting and that more training data are needed when such large deep learning networks are used. A very high
-score of 96.6% was finally achieved by transfer learning on ResNetV50 using pretrained weights and only training the output layers. This indicates that the state-of-the-art was able to outperform our proposed model, but requires pretrained weights with many more parameters.