1. Introduction
Wildlife plays a crucial role in maintaining ecological balance and stability [
1]. However, the continuous expansion of human activities in recent years has led to significant encroachment and destruction of animal habitats. This has pushed many species to the brink of extinction or into a critically endangered state, posing a serious threat to biodiversity both in China and globally [
2]. It is imperative to implement effective and feasible conservation measures to address this crisis. Among them, precise monitoring and identification of animal species and individuals is an extremely effective means of understanding the characteristics of animal populations [
3].
Recently, studies showed that individual identification is becoming more important, since it helps us to understand the density, space use, behaviors, etc., of different animal species [
4] (Roy et al., 2023 [
5]; Akçay et al., 2020 [
6]; Dave et al., 2023 [
7]; Schütz et al., 2021 [
8]; Xu et al., 2024 [
9]). Animal individual identification is not only a necessary basis for unbiased data collection (such as recording individual behavior data), but also for recording individual changes in the variables of concern (social relations, special behaviors, etc.) [
10]. Therefore, the identification of individual wild animals is conducive to a profound understanding of the ecological habits, population distribution, breeding status, and habitat status of animal populations. There are various methods for identifying individual animals, including vocalization [
11,
12], fur color [
13], gait [
14,
15], and DNA from body tissues [
16,
17], among which using the appearance of animals such as fur for individual identification is a very common and effective method [
18]. For animals with distinct fur stripes, such as Amur tigers [
19] and zebras [
20,
21,
22,
23], using the differences in individual stripes to identify different animals has become a hot topic for researchers. Taking Amur tigers as an example, they usually have unique black stripes on their bodies, which have high uniqueness and invariance. Individual identification can be made by comparing the stripes in different areas of the body. At present, this method is one of the most important means to distinguish the individual tigers in the northeast region. Additionally, akin to human faces, each Amur tiger has unique facial features, which can also be used for individual identification.
In recent years, due to the strong feature extraction capabilities of deep learning technologies such as convolutional neural networks (CNNs), and their high recognition efficiency and accuracy, they have gradually become widely used for the intelligent identification of animal species and individuals. For example, in species identification, in 2014, Chen et al. [
24] used a CNN for the first time to classify 20 species in 20,000 pictures. In 2016, Gomez et al. [
25] trained a network on the African Serengeti wildlife dataset SSe. In 2017, Cheema et al. [
26] used Faster R-CNN to detect species with different patterns (such as tigers, zebras and other individuals) in pictures. In 2018, Norouzzadeh et al. [
27] realized the automatic classification, quantity detection and behavior description of 48 species, with an accuracy rate of 92%. In 2019, Gong Yinan et al. [
28] carried out automatic species identification from infrared camera images of Amur tigers and leopards taken under natural conditions based on the YOLOv3 model, and the average accuracy rate of identification of eight species reached 84.9~96.0%. There are also many applications based on CNNs for individual animal identification. For example, in 2018, Fan et al. [
29] applied the improved convolutional neural network model GKP-Net to analyze the facial features of 48 golden monkeys, with an accuracy up to 93.69%. In 2018, Zhao et al. [
30] built an individual recognition model for three leopards based on the Cifar-10 model, with an accuracy rate up to 99.3%. In 2020, He et al. [
3] extracted panda individual facial features with YOLOv3 and a CNN, with an accuracy up to 98%.
Additionally, there is research that specifically focuses on individual Amur tiger recognition with CNNs. For example, in 2020, Shi et al. [
19] built a nine-layer deep convolutional neural network framework and identified 40 individual Amur tigers based on the striped features of the left and the right sides of individual animals, with an accuracy rate of 93.5%. In 2021, Shi et al. [
31] applied several classical convolutional neural networks to recognize the different body parts of 38 Amur tiger individuals, with an accuracy rate of 97.01%. However, these studies are all based on independently collected Amur tiger stripe data, but the data sample size is very limited due to the rareness of the Amur tiger, and the existing datasets have not been publicly verified, making it difficult to objectively evaluate the effectiveness and performance of the algorithms. Furthermore, constrained by factors such as image resolution, lighting, clarity, obstruction levels, and the posture and behavior of the Amur tigers, these studies did not present Amur tiger photos taken in the wild environment. Meanwhile, the network models and methods used in these studies are relatively primitive. For example, the literature [
19] mainly employed a simple layer-stacking approach, while the literature [
30] directly used classic convolutional neural networks, such as LeNet, AlexNet, ZFNet, VGG16, and ResNet34, rather than more advanced networks. These networks need further improvement in aspects such as feature extraction capability, overfitting suppression, and multi-scale image recognition.
In view of the above reasons, this paper conducts individual recognition based on the different physiological characteristics of Amur tigers and builds an individual recognition network based on the improved InceptionResNetV2 model. The InceptionResNet model combines residual blocks with an Inception module to decompose large-scale convolution kernels into multiple small-scale convolution kernels for dimensionality reduction, which can achieve a balance between network width and depth while retaining rich feature expression capability. At the same time, the residual connection is used to deepen the network and accelerate the convergence speed. However, if the InceptionResNet model is directly applied to the Amur tiger individual recognition problem, it may lead to insufficient expression ability and overfitting. Therefore, this paper introduces a dropout layer and a dual-attention mechanism module to improve model recognition accuracy and loads pre-trained models with transfer learning to further improve model training efficiency. It provides an important reference value for the accurate identification of wildlife images and further improves the intelligence level of wildlife monitoring.
The rest of this paper is organized as follows.
Section 2 detects and segments the facial, left stripe, and right stripe areas from Amur tiger images with YOLOv5.
Section 3 introduces the improved InceptionResNetV2 model.
Section 4 verifies the validity of our proposed method by experiments.
Section 5 summarizes this paper.
4. Experimental Results and Analysis
4.1. Data Preprocessing
This paper performs individual recognition based on the image segmentation results from
Section 2, with three categories: face, left-side stripes, and right-side stripes. The number of images and the number of Amur tigers corresponding to each category are shown in
Table 2.
Since the fixed input size of the InceptionResNetV2 model is 299 × 299 pixels, and most images in the dataset used in this paper have different aspect ratios, it is necessary to resize the images to squares and scale them. To avoid distortion and retain more information, this paper adopts the method of padding according to the long side, as shown in
Figure 8.
To avoid overfitting due to the small dataset size, this paper also introduces data augmentation techniques to expand the dataset, including adjusting brightness, contrast, hue, and saturation, and applying blur to images. The augmentation effects are shown in
Figure 9.
After augmenting the dataset, it is randomly divided into training and test sets in a 4:1 ratio.
4.2. Pre-Trained Model Loading
Since the sample size of the Amur tiger dataset used in this paper is relatively small, to improve the classification accuracy of the model and effectively avoid overfitting, this paper further introduces transfer learning technology, using the InceptionResNetV2 model parameters trained on the ImageNet dataset. Given the distinct features of the Amur tiger dataset compared to ImageNet, we load the InceptionResNetV2 pre-trained model from tensorflow.keras.applications and include its full architecture in our training to leverage transfer learning effectively.
4.3. Model Training
4.3.1. Experimental Environment Setup and Training Parameters
The software and hardware environment configurations employed in the experiments of this paper are shown in
Table 3.
The training parameters are set as follows:
- (1)
Number of iterations: epoch = 20;
- (2)
Batch size: batch_size = 16;
- (3)
Initial learning rate: lr0 = 0.001;
- (4)
optimizer: optimizer = Adam;
- (5)
loss function: loss = Cross-entropy.
4.3.2. Model Training Process
The training process of the improved InceptionResNetV2 model is shown in
Figure 10. By applying the preprocessed Amur tiger images as the original input to the model, a series of convolutional and pooling layers are stacked in the Stem structure to obtain shallow feature maps of the images. The InceptionResNet module extracts different-sized feature maps through parallel convolution and merges them to further extract deep features of the images. The CBAM module focuses on important features to obtain more refined feature maps. Then, the extracted features are input into the global average pooling layer to calculate the average value of all pixel values in each channel. Subsequently, a certain proportion of neurons are randomly dropped, and the Softmax classification layer outputs the classification results. After completing the forward propagation process, the loss is calculated and the parameters are updated according to the backpropagation algorithm. These steps are repeated until the loss is below the pre-set threshold and no longer changes, indicating the end of training.
4.4. Comparison of Results
4.4.1. Comparison and Analysis of Different Dropout Rates
In this paper, the dropout layer is added after the global average pooling layer of the original InceptionResNetV2 model. To illustrate the impact of different dropout rates on the training results and select the optimal dropout rate, we trained the model with dropout rates of 0, 0.1, 0.2, 0.3, 0.4, and 0.5, respectively. The experimental results show that the optimal dropout rate for the facial, left stripe, and right stripe data is consistent. Taking the facial data as an example, the training results are shown in
Figure 11.
Each epoch of training has a Top1 accuracy, and we generally compare the maximum Top1 accuracy, which is the maximum value of Top1 accuracy over 20 rounds of training. However, due to possible deviations and small differences in the maximum Top1 accuracy, this paper specifically defines the fifth Top1 accuracy for collaborative comparison, which is the fifth-highest Top1 accuracy in 20 rounds of training. The maximum Top1 accuracy and the fifth-highest Top1 accuracy of several methods are combined, and the results are shown in
Table 4 below.
From the table analysis, we can see that when the dropout rate is 0.4, the fifth-highest Top1 has the highest precision, so this paper uses a dropout rate of 0.4.
4.4.2. Comparison and Analysis of Training Effectiveness with Different Attention Mechanisms
We added an attention mechanism after the last Inception-ResNet-C module in the original InceptionResNetV2 model. To verify the impact of different attention mechanisms on model training effectiveness, we experimented with the SE channel attention mechanism, the ECA [
38] module, and the CBAM dual attention mechanism separately. Similarly, the experimental results for the facial, left stripe, and right stripe regions are consistent. Taking the face data as an example, the training results are shown in
Figure 12.
The specific results of accuracy are shown in
Table 5. As indicated in the table, when the InceptionResNetV2 model is augmented with the CBAM module, it achieves the highest recognition accuracy. Moreover, it exhibits the fastest convergence speed among all methods incorporating attention mechanisms, along with higher stability.
4.4.3. Comparison with Other Models
After adding the dropout layer and CBAM module to the original InceptionResNetV2 model, the network structure is shown in
Table 6 below.
To evaluate the performance of the Amur tiger individual recognition method proposed in this paper, we trained several different network models, including VGG19, ResNet50, ResNet152, InceptionV3, InceptionResNetV2, and the improved InceptionResNetV2 with the same dataset and training parameters, and tested all models on the same testing dataset. Taking the facial data as an example, the accuracy and loss comparison results of each model on the testing dataset are illustrated in
Figure 13 and
Figure 14, respectively.
From
Figure 13, it can be seen that, in terms of accuracy variation, the performance of the proposed improved model consistently outperforms the first four models from the initial stage to the final stage. Although sometimes it is on par with the InceptionResNetV2 model, the accuracy changes of the improved model are relatively smoother.
From
Figure 14, it can be observed that, in terms of the change in loss values, the model proposed in this paper converges faster and exhibits higher stability. As indicated by the specific accuracy values in
Table 7, the improved InceptionResNetV2 model achieves the highest recognition accuracy, with an improvement of 1.38% compared to the original InceptionResNetV2 model. Through the aforementioned comparative experiments, the effectiveness of the proposed model in this paper has been fully validated.
4.4.4. Comparison and Analysis of Different Physiological Characteristic Parts
The improved model is applied to different physiological features of Amur tigers, and the test accuracy data is compared with the original model, as shown in
Table 8 below.
As shown in the table, the recognition accuracy of the left-side stripes is the highest, and the accuracy for the right-side stripes is slightly higher than that of the face. Since most Amur tigers only show their side faces, the recognition accuracy for the face should be lower than that of the left-side and right-side stripes. However, almost every image contains a facial part, and the facial dataset is somewhat larger than the right-side stripe dataset. Therefore, due to the interaction of these two factors, the difference in accuracy between the two is not significant.
It is important to note that in the actual process of individual recognition conducted in this study, if a picture contains multiple physiological feature parts of an Amur tiger, the part with the highest recognition accuracy is taken as the final identification result. For example, if a picture shows both the face and the right stripe of an Amur tiger, based on experimental results, the order of recognition accuracy from high to low for different parts is left stripe > right stripe > face. Therefore, the result of recognizing the right stripe will be considered as the final identification conclusion.
4.5. Discussion
The individual identification of Amur tigers presents a unique challenge due to the intricate details of facial features and stripes, which require feature extraction at varying granular levels. The InceptionResNetV2 network is well-suited for this task due to its complex network structure that offers robust representation capabilities and rich spatial features. Its design allows for effective analysis of stripe patterns while reducing the number of parameters typically associated with such detailed features, making it particularly adept at addressing these issues.
In addition to the Amur tiger, many other animals such as golden snub-nosed monkeys, Amur leopards, giant pandas, zebras, and lions also possess unique body or facial stripes. When identifying the species or individuals of these animals, fine-grained feature recognition is similarly required. Therefore, the method proposed in this paper can be applied to similar recognition scenarios, making it suitable for identifying a wider range of animal species or individuals. This demonstrates that our method has broad prospects for promotion and application.
To adapt and refine our method for broader use in individual identification across various species, we can employ similar strategies tailored to different animal characteristics. For example, to apply our approach to the indentification of zebras, similarly, we could start by using the YOLOv5 model to detect and segment specific stripe patterns on different body parts, and then carry out individual identification through the InceptionResNetV2 network. Finally, the corresponding model could be applied to a specialized zebra dataset for training and validation, ultimately resulting in a high-precision model capable of accurately identifying different body part features of individual zebras.
Additionally, the improved method proposed in this paper can be applied to other network models such as EfficientNet to obtain models with higher recognition accuracy. Although the average recognition accuracy of the proposed method in this paper reaches 95.36%, there are still a few images that were not accurately recognized. Based on the characteristics of the dataset and the recognition process, the reasons for these misidentifications include the high similarity of certain body parts, interference from the angle and lighting conditions in which the photos were taken, and the relatively limited number of Amur tigers in the dataset, which leaves room for further improvement in the model’s accuracy. Therefore, in future research, we could consider incorporating techniques such as multi-scale feature fusion and multi-head attention mechanisms to enhance the recognition accuracy and efficiency of the model, providing better solutions for individual identification across various species.
Despite the significant advantages achieved by the proposed method in terms of recognition accuracy, it also has some limitations. For instance, the dataset used for training and validating the recognition model is still quite limited. To further improve the model’s accuracy and generalization, it is necessary to collect more diverse and comprehensive datasets for training and use more image augmentation techniques to expand the dataset, including geometric distortions. Additionally, the method was only tested on the Amur tiger dataset. While the results are promising, its applicability and generalizability to other animal species still need further verification. Enhancing the model’s generalization capabilities and ensuring its robustness across different contexts are critical areas for future research.
5. Conclusions
As the Amur tiger is a flagship species of precious and endangered wildlife, the accurate identification of individual Amur tigers holds significant symbolic importance for biodiversity conservation. This paper focuses on the methods of object detection and individual recognition of Amur tigers based on convolutional neural networks. Initially, the YOLOv5 model is employed for object detection on the Amur tiger dataset, obtaining the facial, left-side stripe, and right-side stripe data of individual tigers, followed by image segmentation. To further improve the individual recognition performance of Amur tigers, this paper proposes an improved model based on InceptionResNetV2, which incorporates dropout regularization to prevent overfitting and introduces a dual-attention mechanism to strengthen the feature representation capabilities at different levels. Then, transfer learning is performed by employing a pre-trained model on the ImageNet dataset to improve model training efficiency. Finally, comparative experiments are conducted on the open large-scale wildlife dataset ATRW, comparing the improved InceptionResNetV2 model with other classic image recognition models. The experimental results demonstrate the superior performance of the proposed model, validating its effectiveness and practicality.
Our work provides a meaningful exploration for the accurate identification of individual wild animals. In our future work, we aim to further enhance aspects such as dataset richness, model performance, and system functionality.