1. Introduction
Compared to traditional wildlife monitoring methods, camera traps offer advantages such as low cost and high concealment, allowing wild animals to be monitored and surveyed automatically and without disturbance [
1,
2,
3,
4]. Equipped with motion or infrared sensors, such cameras automatically capture images or videos of passing animals [
5,
6], resulting in billions of images/videos being collected annually worldwide [
7]. The massive amount of raw camera trap data contains valuable information on the species, sex, age, health, number, behaviors, and locations of animals, making camera traps an indispensable tool for conservation and management efforts [
8,
9]. However, manual extraction of this information is a laborious, knowledge-intensive, time-consuming, and expensive endeavor. To address these challenges, researchers have turned to deep learning—a technique that enables computers to learn multiple levels of abstraction from images [
10]. Many researchers have attempted to use deep learning techniques to detect and classify animals in camera trap images [
11,
12], with some yielding promising results that showcase the potential of these methods, which can speed up complex tasks, such as species recognition and individual counting [
13].
Deep learning models learn the intrinsic features of the training data by updating the model parameter weights, and subsequently outputting inference results on unseen data; therefore, the distribution of data has a crucial impact on the model performance. As a real-world dataset, the camera trap dataset typically exhibits imbalanced distribution, in which certain classes contain numerous images, while others contain only a few, resulting in a long-tailed distribution [
14]. Deep learning models often perform well for species that are frequently captured by the camera trap, but may perform poorly for species that are rarely captured, especially for rare or endangered species with inherently low population sizes. Such a class imbalance makes the training of deep learning models in long-tailed camera trap datasets highly challenging [
15]. In recent years, several datasets have been proposed for different long-tailed learning tasks. For long-tailed image classification, ImageNet-LT [
16], CIFAR-10/100-LT [
17], Places-LT [
16], and iNaturalist 2018 [
18] are four benchmark datasets. Meanwhile, LVIS0.5/1.0 [
19] is the most widely used benchmark dataset for long-tailed object detection and instance segmentation. To investigate the long-tailed problem more effectively, researchers have used quantitative metrics, such as Imbalance Factor [
16,
17,
18] and the Gini Coefficient [
20], to precisely assess the degree of long-tailedness in the aforementioned datasets. Lu et al. [
20] demonstrated the reasonability and effectiveness of the Gini coefficient, and revealed significant variations in long-tailedness among the existing long-tailed datasets. This brings us to the following questions: what is the degree of class imbalance in camera trap datasets; is the class imbalance prevalent and severe? Until now, there have been no studies investigating the quantitative metrics of long-tailedness in camera trap datasets.
Object detection is the task of locating and classifying objects in an image, using bounding boxes [
21]. It is a critical problem in computer vision, and has many applications, such as autonomous driving and pedestrian detection [
22,
23,
24]. Compared to object detection in a balanced dataset, long-tailed object detection is more complex and challenging, primarily due to the presence of an extreme class imbalance [
25]. This imbalance often results in the detection loss or incorrect classification of rare classes, leading to poor overall detection performance. Thus, long-tailed object detection requires the development of appropriate strategies to address the imbalance issue. It is common knowledge that species with similar characteristics—particularly those within the same class, family or genus—tend to share similar body parts: for instance, squirrels generally share similar body and tail shapes. Transferring this shared knowledge between images of different species, to mitigate the effects of class imbalance, can help improve long-tailed object detection ability, especially for rare classes. Exploiting the invariant features between images belonging to the same species is also helpful. The diverse and firm sample relationships in camera trap datasets can be utilized to alleviate the issue of insufficient images for rare classes. Hou et al. [
26,
27] devised a simple yet effective batch transformer block that enables deep recognition and object detection models to explore sample relationships from the perspective of the batch dimension, facilitating the transfer of knowledge from frequent classes to rare ones. Though they extensively evaluated the effectiveness of this approach on multiple benchmark long-tailed datasets, it has not been tested on any camera trap datasets to date.
The distribution of object sizes can also impact the performance of object detection models, particularly with respect to small objects that usually have indistinguishable features and limited context information [
28,
29,
30]. Oksuz et al. [
25] observed a skewness in the distribution in favor of small objects in the MS-COCO dataset, and defined certain sizes of objects or input bounding boxes over-represented as an object/box-level scale imbalance. From the perspective of object/box-level scale imbalance, what is the distribution of the animal objects sizes in camera trap datasets? Is the distribution similar to the MS-COCO dataset or the long-tailed dataset, in which certain sizes of animal objects have a significant number of images, while others have very few? Is exploiting sample relationships also effective at improving the detection performance for different sizes of animal objects? Thus far, no research has been conducted to answer these questions.
In this work, we focused on the long-tailed metrics and object detection in camera trap datasets. The main contributions of this paper are as follows: (1) We utilized four commonly used metrics to quantitatively analyze the class imbalance in several camera trap datasets, revealing that it was more severe than in the benchmark long-tailed dataset. (2) We also quantitatively analyzed the object/box-level scale imbalance in camera trap datasets, revealing that it too was more severe than in benchmark datasets. Moreover, the camera trap datasets contained too few small objects, making the object detection more challenging. (3) For the first time, we utilized a simple yet effective module, named BatchFormer (Batch transFormer), to explore the effectiveness of sample relationships. The results demonstrated that exploiting sample relationships can improve object detection performance in long-tailed camera trap datasets, in terms of class and object/box-level scale imbalance.
The structure of this paper is as follows. In
Section 2, we present the Materials and Methods used in this study. The camera trap dataset details, several metrics, and object detection experiments results for multiple camera trap datasets are presented in
Section 3. In
Section 4, we discuss our findings, and look ahead to future work. In
Section 5, we present the conclusions.
4. Discussion
The application and development of deep learning technologies have revolutionized the analysis and utilization of camera trap data. However, the imbalanced distribution of data in these datasets can often lead to the poor performance of deep learning models. In this study, we analyzed 12 camera trap datasets obtained from various habitats worldwide. These datasets exhibited significant differences in the number of species and the size of camera trap images, but the proportion of images with empty labels was the largest among all classes, reaching up to 92.17% in Snapshot Mountain Zebra. Next, we utilized four quantitative metrics to objectively and accurately quantify long-tailedness in camera trap datasets. Based on our results, we recommended the Gini Coefficient as an effective and appropriate measure of imbalance in camera trap datasets. Compared to the benchmark balanced CIFAR and long-tailed ImageNet-LT, LVIS 1.0, the class imbalance in different camera trap datasets was prevalent and very severe, consistently surpassing 0.7 in 12 datasets. Moreover, the GC of three object detection datasets was greater than COCO and LVIS 1.0, indicating that for various deep learning tasks, such as animal recognition and detection in a camera trap dataset, long-tailed distribution is a very challenging problem. In addition, the rarity of samples of some tail species in the camera trap datasets was mainly due to the fact that tail species are rare or even endangered in the wild. The ongoing global trend of anthropogenic biodiversity loss, which involves extinction or a dramatic decline in both species and population size, will further exacerbate the class imbalance in camera trap datasets; therefore, compared with the head species, which may even be over-represented, determining whether deep learning can accurately extract the information of tail species is more difficult and urgent.
Object detection accuracy varies greatly for different-sized objects. Accurate detection of small objects remains particularly challenging. Because of the different body size of animals, and different distances from the camera, the size of animal objects varies greatly. Thus, we also need to pay attention to the object/box-level scale imbalance in camera trap datasets. In this study, we calculated the GC of area to measure the object/box-level scale imbalance in three object detection datasets: the results were all greater than 0.5, demonstrating that camera trap datasets exhibit object/box-level scale long-tailed distribution as well. As shown in
Table 5, camera trap datasets exhibit a positive correlation between the number of samples and object size, which is completely different from a skewness in the distribution in favor of small objects in the COCO and LVIS 1.0. However, we also note that the number of very small (0–16 × 16), small (16 × 16–32 × 32), and small–medium (32 × 32–64 × 64) objects is too few. Even worse, the camera trap images resolution is generally much higher than in the benchmark dataset, the natural background of the images is very complex, light conditions are variable, and small animals move quickly, leading to detection-performance issues for small objects, and making object detection for the camera trap dataset more challenging.
To exploit the diverse and firm sample relationships, we introduced the simple yet effective module BatchFormer into the DINO model, to transfer shared knowledge from head to tail, so as to enhance the representation of tail species. In this experiment, the BatchFormer module improved the DINO overall detection performance by 2.3% on SWG, by 0.8% on WCS, and by 1.2% on CBL. On the class imbalance, the BatchFormer module improved the performance of DINO by up to 2.9 % on Rare, 2.4% on Common, and 1.7% on Frequent. On the object/box-level scale imbalance, the BatchFormer module can also improve the performance of DINO by up to 3.3 % on eight types of object sizes, while the AP on very small, small, and small–medium objects is too low, demonstrating that it cannot make up for the shortcoming of too few and too small objects. Exploiting sample relationships is a simple yet effective way to promote long-tailed deep learning problems for camera trap dataset solving.
Finally, due to the limitations of the experimental environments, this study did not make generalization experiments to test the performance on new camera trap data. Practically, the images from camera traps situated at new locations not included in training sets have different backgrounds (grasslands, forest, etc.), different prominent objects (tree stumps, rocks, etc.), and different environmental conditions (day, night, season, etc.), and should be considered as different domains. The deep learning models generalization performance declines in new locations [
14]. In future, we will leverage the approaches of few-shot and zero-shot learning, to improve the generalization. Except for the imbalance and generalization problem, the classification and detection of nocturnal animals, such as rodents, are considerably more challenging, due to issues such as low light, fast movement, small body size, etc. Data augmentation methods such as deblur, colorization, low-light enhancement, etc., can be implemented to increase the quality of night-time images, further improving classification and detection accuracy [
56,
57,
58].