1. Introduction
Widespread distribution, low population density, unpredictable behavior patterns, and sensitivity to interference of wildlife pose significant challenges to monitoring work for some animal species. Traditional wildlife investigation techniques mainly include manual investigation, line sampling, collar tracking, and acoustic tracking using sound recording instruments [
1,
2]. However, each of these methods has certain disadvantages, so scientists strive to improve them. The Amur tiger, also known as the Siberian Tiger, is one of the subspecies of tigers. The Amur tiger is mainly distributed in the northeastern region of Asia and is listed as an endangered species in the Red List of Threatened Species by the World Conservation Union. There are only just over 500 Amur tigers left in the world, so it is crucial to strengthen the protection of the Amur tiger [
3]. Moreover, the survival and reproduction of species populations are closely related to regional biodiversity and ecosystem functional integrity [
4]. Therefore, re-evaluating the Amur tiger and its prey resources in natural environments such as nature reserves and national parks can help to statistically analyze the situation of Amur tiger resources and provide data reference for the next step of protection work [
5,
6]. At present, the most commonly used method for the re-ID of wild animals is manual discrimination. After receiving professional knowledge and training, wildlife protection professionals need to screen and distinguish a large amount of image data based on the fur pattern characteristics of the abdomen, head, neck, and other parts of the Amur tiger [
7,
8]. To reduce errors, it is necessary for multiple people to simultaneously identify and verify the recognition results, which requires a large workload, high cost, and low efficiency.
With the continuous development and application of machine learning and deep learning technologies, machine learning algorithms and emerging deep learning models, such as Linear Clustering [
9], Classification [
10,
11], Detection [
12,
13], and Generative Adversarial Networks [
14], are gradually being applied to the intelligent monitoring and protection of wildlife. Research on the intelligent recognition of wildlife mainly focuses on issues such as wildlife re-ID, species classification, population counting, and attribute recognition [
15]. In the process of wildlife re-ID, the application of computer vision-related technologies can greatly improve recognition efficiency and accuracy. Research in this area has gradually become popular. Currently, the main methods used are clustering algorithms based on image hotspots [
9] and convolutional neural network models based on VGG [
16], AlexNet [
17], and ResNet [
18]. These methods have been improved in optimizing feature extraction, feature fusion, and incorporating prior knowledge of pose. Zheng et al. proposed a Transformer network structure with cross-attention block (CAB) and local awareness (CATLA Transformer) [
19], which captures global information of an animal body’s surface and local feature differences in fur, color, texture, or face, and fuse global features and local features through CATLA Transformer. Zhang et al. proposed using texture features as global and local features for re-ID, and proposed a pyramid feature fusion model method to extract features from both local and global perspectives, effectively matching entities [
20]. Li et al. proposed an Amur tiger re-ID method, which introduces precise pose parts with deep neural networks to handle the large pose variation of tigers [
3]. Liu et al. proposed a Partial Pose Guided Network (PPGNet), which uses local image features based on pose data to drive the network to extract features from the original image, and applies it to an Amur tiger re-ID system based on automatic detection and Amur tiger pose estimation [
21]. He et al. proposed a Multi-pose Feature Fusion Network (MPFNet), which constructs three pose modules: standing, sitting, and lying. In each module, two parallel branches are used to extract global and local features for effective feature extraction. Finally, the features are fused [
22].
There are also some very advanced studies in the field of person re-identification similar to the Amur tiger re-ID. Sun et al. proposed a Part-based Convolutional Baseline (PCB) framework and an inter-block combination method with uniform partitioning to effectively extract part-level features, and by Refined Part Pooling (RPP), closer parts are allocated together to improve the within-part consistency of parts [
23]. Sun et al. considered the problem of partial re-ID and proposed a Visibility-aware Part Model (VPM). Through self-supervised learning, the model perceives the features within the visible region, extracts regional features, and compares two images within their shared regions to suppress noise in unshared regions. It better extracts fine-grained features of the image and reduces image misalignment [
24]. Liu et al. proposed a multi-scale Feature Enhancement (MFE) Re-ID model and a Feature Preserving Generative Adversarial Network (FPGAN). In the MFE, the semantic feature maps of the person’s body are segmented, and then multi-scale feature extraction and enhancement are performed on the person’s body region. In the FPGAN, the source domain is transferred to the target domain in an unsupervised manner, maximizing the preservation of personal information integrity [
25].
In current research, there are issues such as the need for prior knowledge and the complexity of training large models. Although some models have been verified to have excellent average accuracy and other indicators, and the effectiveness of model improvement has been verified through ablation experiments, a large amount of reliance on prior knowledge leads to poor model transferability, requiring staff with expertise in wildlife to perform a large amount of dataset labeling and processing work in the early stages before retraining the model, which has low feasibility in practical production applications. The networks with four, six, or more branches, or which require data preprocessing through instance segmentation models before being fed into the re-ID model, are too complex and have problems such as large model size and a need for complex training. Therefore, this paper proposes a serial multi-scale feature fusion and enhancement re-ID network of Amur tigers with global inverted pyramid multi-scale feature fusion and local dual-domain attention feature enhancement for the re-ID of Amur tiger images. Combining the re-ID methods of the Amur tiger and fine-grained task properties, the Path Aggregation Network (PANet) [
26] feature fusion idea is introduced. A bottom-up unidirectional feature fusion method is proposed, which uses an inverted pyramid structure for feature fusion. This helps to better integrate high-level features with large receptive fields and rich semantic information while preserving multi-scale features. We propose a local dual-domain attention feature enhancement method that is serially connected with the global branch to enhance local feature extraction and fusion. Our goal is not to go beyond the SOTA model used for re-ID, but to propose an end-to-end model that is more suitable for removing animal pose prior knowledge and other additional attribute information, and has good transferability and re-ID performance. Our core contributions are as follows:
We integrate and propose a lightweight, efficient, end-to-end network for the re-ID task of the Amur tiger, which does not require the introduction of prior knowledge such as posture. It can be quickly and conveniently used for the re-ID task of other large mammals. The specific network innovation and design are as follows.
In order to better extract and integrate the global information of the high-level and low-level layers of the Amur tiger, we propose a multi-scale feature fusion method of the global inverted pyramid. We introduce the ideas of Feature Pyramid Network (FPN) [
27] and PANet into the global branch of the model for the task of wildlife re-ID. Improving the top-down connection method of traditional feature pyramid models will greatly compress the problem of key deep semantic information [
28].
In order to deepen the feature extraction of various parts of the Amur tiger and extract fine-grained features such as body fur texture, we introduce a serial local branch network and design an attention module and output feature fusion method in the local branch.
4. Discussion
The Amur tiger may have the problem of occupying a relatively large position and having a relatively complex image background in the photos captured by camera traps [
34]. This is because photo shooting is triggered only when the wild animals are relatively close to the infrared camera and the infrared sensor senses their temperature [
34,
35]. Moreover, due to the complex forest environment, the Amur tiger has a narrow path and a larger target. Therefore, we propose a new serial multi-scale feature fusion and enhancement re-ID network of Amur tiger, which extracts and learns to input Amur tiger features in a global and local branch serial manner. We also propose a global inverted pyramid multi-scale feature fusion method and a local dual-domain attention feature enhancement method to learn Amur tiger images at multiple scales, more adaptable to this re-ID task. In the model validation stage, we applied the Amur tiger re-ID dataset of the ATRW for experimental verification. The experimental results showed that our proposed model still has good performance without introducing other prior knowledge and complex labeling, and the mAP and hit rate have been improved. In addition to the Amur tiger, our proposed network is applicable to other large quadruped animals through retraining. It can be structurally adjusted according to specific animal species and task details, without the need to introduce other prior knowledge, reducing the cost of early labeling and other inputs, and has a certain degree of universality and transferability.
In summary, since our constructed model requires vertical partitioning of extracted features in the horizontal direction, it is effective in identifying large quadruped mammals that are mostly identified by body surfaces, such as snow leopard re-ID and leopard species classification. However, we have not yet conducted further validation and model fine-tuning on many other large quadruped animal datasets, and if we apply datasets of upright animals such as monkeys, there may be issues with poor performance. Because our proposed method requires local partitioning in the horizontal direction, and for such animals, the key complete features may be segmented, resulting in the inability to learn important information. Currently, however, a suitable dataset for comparative experiments remains unavailable. This is the limitation and problem that this paper aims to address, and further in-depth research is still needed. In the future, we will strive to create datasets and complete research and comparative experiments on model transfer. In addition, this paper is conducted on a public dataset where each entity has an average of 14.5 and at least 8 training images prior to data augmentation. However, in real life, there may be small and uneven sample sizes in the dataset we obtain in the wild or in surveillance videos, which are also issues that we need to address in the future.