1. Introduction
Procapra przewalskii is an endangered ungulate endemic to the Qinghai-Tibet Plateau. Its type specimen was collected by Nikolai M. Przewalski in the Ordos Plateau of Inner Mongolia in 1875 [
1]. It is listed as a national key protected wild animal by the
List of National Key Protected Wild Animals issued in 1988. In 2001, it was listed as one of the 15 species in urgent need of rescue in the
National Wildlife Conservation and Nature Reserve Construction Project Master Plan. In December 2002, the State Forestry Administration formulated the
Overall Plan of the National Proctor for the gazelle protection project. Meanwhile, it was determined to be a critically endangered (CE) species by the Red List of the International Union for Conservation of Nature (IUCN) in 1996 and 2003. Due to the recovery of local populations and the discovery of new populations, it was adjusted as endangered (EN) from CE by the Red List of the IUCN, but it was still listed as CE in the assessment results of the Red List of Biodiversity in China [
2]. Perennial cold and oxygen deficiency in plateau areas pose more difficulty in continuous artificial monitoring of animal behaviors.
As a non-contact data acquisition method, unmanned aerial vehicles (UAVs) exhibit unique advantages in wildlife monitoring. Thanks to the significant progress in computer vision technologies, autonomous UAVs equipped with edge reasoning units have been applied in preliminary real-time monitoring [
3,
4]. UAVs are limited by size, weight, and power (SWaP), and a single camera is usually equipped as the best sensor. However, when the single-shot attitude estimation is applied to video data, it has to estimate the extraordinary noise and confuse the visually similar but spatially different image features if the time information is ignored. Therefore, a time filter is an effective way to increase the accuracy of attitude estimation. The Kalman filter (KF) [
5] is a very wide choice due to its simplicity and versatility. In addition, the extended KF (EKF) [
6] can address the nonlinear systems of measurement and transition models.
However, these measurement and transition models cannot be specified a priori, severely restricting the application of KFs. Therefore, they can try to directly learn the motion models from training data with support vector machines [
7] or with long and short-term memory (LSTM) [
8] to overcome the above limitations. It can release the modeler from time-consuming selection and optimization of the KF by learning the motion models, simultaneously enriching the underlying models. However, sufficient training data should be ensured to cover all possible motion paths of the tracked object when learned motion models are applied to enforce temporal consistency in pose estimation.
This work mainly contributes to an autonomous UAV target-tracking system based on the improved LSTM-KF model. (1) A three-dimensional (3D) attitude tracking algorithm integrating a learning-based real-time target detection algorithm with a UAV embedded system based on target detection, stereo reconstruction technology, and the KF model, is implemented in the low-cost UAV system to automatically identify, locate, and track the Procapra przewalskii. (2) In the LSTM-KF model, the calculation formulas of the output gate Qt of three different standard LSTM networks are modified by connecting them with the input gate to solve overfitting during the training of many datasets about the Procapra przewalskii video sequences. (3) During the training iteration of the LSTM-KF model, the Adam optimizer is modified by extending the update rules based on the L2 norm in Adam optimization to those based on the Lp norm. The improved model is able to accurately track the behavioral attitudes of the LSTM-KF model to identify the original model and reduce the modeler’s a priori burden to specify motion and noise models.
This paper is structured as follows. In the “Introduction” section, the background and definitions of Procapra przewalskii, autonomous UAV object-tracking system, KF, and LSTM model are introduced. In the “Related Works” section, the background of the CNN-based animal-monitoring algorithm and the learning-based KF structure are described. The “Materials and Methods” section describes the hardware system architecture used for UAV object tracking, the study area and dataset, the definition and principles of animal 3D pose estimation, the YOLOX model, and the structure of the improved LSTM-KF model. Next, the experimental results are introduced and analyzed. The performances of the YOLOX model at different resolutions are compared, and the tracking results and average errors of the improved LSTM-KF model are verified. In addition, the tracking effects of Procapra przewalskii based on simulation systems and actual flight scenarios are presented. In the “Conclusions” section, the results and discussions are summarized, and future research directions are prospected.
2. Related Works
2.1. Animal-Monitoring Algorithm Based on Convolutional Neural Network (CNN)
CNN-based methods that can achieve accurate result-oriented image feature extraction and representation have been increasingly accepted and applied in vision and image processing [
9,
10,
11]. Meanwhile, they have exhibited extensive applications in animal identification without pre-specifying any features [
12] but have limited usages in monitoring the activities of farming animals [
13]. The two-stream network proposed by [
14] is one of the tracking models to track moving objects. Multi-layers and the optical flow of convolutional networks enable the capturing the frames of relevant information on an object and tracking the object movement among various frames. Shortly after the proposal of two-stream networks, long-term recurrent convolutional networks (LRCNs) were developed. LRCNs [
15] generally comprise several CNNs, namely Inception modules, ResNet, VGG, and Xception, enabling the extraction of spatial and temporal features.
LRCN was the most applied tracking model thanks to its reasonable architecture in object tracking. Generic object tracking using regression network (GOTURN) [
16] is another lightweight network that achieves 100 frames per second (fps) for object tracking. GOTURN was initially trained with the generic-objects-filled datasets. The regions of interest (ROIs) on the frames are undertaken as input data for the trained network, providing the possibility of continuously predicting the location of the target. The Slow Fast network [
17], on the other hand, tracks objects using two streams of frames, namely slow and high pathways. Many other algorithms can be introduced for animal monitoring, including, but not limited to, simple online and real-time tracking (SORT), the Hungarian algorithm (HA), the Munkres variant of the Hungarian assignment algorithm (MVHAA), the spatial-aware temporal response filter (STRF), and the channel and spatial reliability discriminative correlation filter (CSRDCF) [
18,
19,
20,
21].
These algorithms were utilized by object-detection models, such as Faster R-CNN, FCN, SSD, VGG, and YOLO, to detect and track animals in images using their geometric features in continuous frames [
13]. To provide a reliable and efficient method to monitor the behavioral activity in cows, [
22] presented a tracking system embedded with ultra-wideband technology. Similarly, [
23] employed a computer vision module to analyze and detect the positive and negative social interactions in feeding behavior among cows. The system was implemented and tested on seven dairy cows in the feeding area, realized an accuracy with a mean error of 0.39 m and standard deviation of 0.62 m, and achieved a detection accuracy of social interactions of 93.2%. However, the real-time locating system (RTLS) exhibits poor accuracy in identifying individual cows if they are in close body contact.
The CNN-based algorithm presents limited applications in monitoring agricultural animals because it does not pre-specify any features of any target. In this case, the YOLOX model, a lightweight network with an anchorless frame at the head of the network, is applied in this paper, which is equipped with several high-performance detectors and a faster network convergence, so that animals with specified features can be monitored. LRCN is the most widely used tracking model due to its reasonable structure in target tracking. However, it fails to effectively monitor the pose behavior of animals and consider the variations in measurement noise. Therefore, the LSTM-KF model with good robustness, reliability, and validity in target tracking and 3D pose estimation is proposed in this paper, which can alleviate the effect of measurement noise and modify the measurement results.
2.2. Learning-Based KF Architecture
In the current work, machine learning and KF models are combined for temporal regularization. The approaches can be classified into those that learn the static parameters of the KF and those that actively regress the parameters during filtering. The noise covariance matrices (NCMs) were optimized statically by [
24] to replace the manual fine-tuning of noise parameters in a robotic navigation. Additionally, a coordinate ascent algorithm was employed and each element of the NCM was optimized. However, this approach is only applicable for noisy but time-invariant systems. As opposed to the dynamic model adopted in this study, a change in measurement noise cannot be considered and will therefore lower the accuracy in state estimates.
Reference [
8] learned the underlying state transition function that controlled the dynamics of a hidden process state. However, only the state space equations of the KF are used instead of the prediction and update scheme with good performance under linear state transitions and additive Gaussian noise [
6]. The neural network models that jointly learn to propagate the state, incorporate the measurement updates, and react to control inputs were trained. In addition, the covariances were designed as constants during the entire estimation. This approach can estimate the state better than a distinct prediction and update model, especially when large-scale training data are insufficient, as demonstrated in the experiment section in the present study.
The dynamic regression of KF parameters was put forward by [
7] who adopted support vector regression (SVR) to estimate a linear state transition function, jointly with which the predicted NCM was estimated. The SVR-based system can deal with time-variant systems and outperforms manually tuned KF models in object tracking. As opposed to the model adopted here, the measurement noise covariances (MNCs) are kept constant and the transition function is modeled as a matrix multiplication. In this case, it can only estimate the linear motion models, while the model employed in the present study can estimate the nonlinear transition functions based on all previous state observations.
Reference [
25] focused on integrating a one-shot estimation as a measurement into a KF model, which required a prediction of the MNC. They demonstrated that the integrated model exhibited superior performance by comparing it with two other models. In contrast, the model designed in the present work undertakes the measurement updates as a black-box system and automatically estimates the MNC, so that they can be combined with the current one-shot estimators.
Previous work has extensively investigated the temporal regularization for bit-pose estimation, and priority attention has been given to works that focus on implicit regularization schemes and that explicitly use a learning-based KF structure to infer temporal coherence. In contrast to other models, the proposed model introduces the LSTM-KF, which mitigates the modeler-induced influence on specifying motion and noise models a priori, while allowing rich models to learn from data that are extremely difficult to write down explicitly. An extensive series of experiments reveals that the LSTM-KF outperforms both the stand-alone KF and LSTM in terms of temporal regularization.
3. Materials and Methods
3.1. Overall Technical Architecture
The videos in the research area were acquired using Prometheus 230 intelligent UAVs (Chengdu Bobei Technology Co., Ltd., Chengdu, China), as shown in
Figure 1. An Intel RealSense D435i stereo camera was selected for the system to acquire sensing and depth data because its features include light weight, wide field of view (FoV), high depth accuracy, and good stability. Furthermore, a powerful graphics processing unit (GPU) was employed for the embedded systems, and the NVIDIA Jetson AGX Xavier embedded platform was selected to process the deep-learning-based algorithms. A flight controller (Pixhawk 4 (PX4)) was deployed and communicated with the MAVROS package, which was connected to the planner node.
As illustrated in
Figure 2, the designed system consists of (1) a perception module, (2) an object-tracking algorithm, (3) a UAV maneuver, and (4) a ground station visualization module. In brief, the UAV system perceives the red-green-blue (RGB) images and the depth data first, and the drone recognizes the Procapra przewalskii with YOLOX (a deep-learning-based detector). Next, the 2D bounding boxes are fused with the depth measurement to estimate the 3D pose of the Procapra przewalskii. Finally, the improved LSTM-KF model proposed here is integrated to assist in predicting the motion of the Procapra przewalskii. During this, the visualization user interface is included.
3.2. Dataset Establishment and Training
Precepting an object in a 3D world is essential for detecting and tracking an object. A deep-learning-based detector is employed here to generate related 2D information and perform 3D stereo reconstruction. This is very challenging because the object may move fast (e.g., running), the training data are low, the detection accuracy is not high, and the position of the object is changing continuously. As mentioned above, the YOLOX algorithm is selected as the baseline model to more accurately detect and track the Procapra przewalskii. Its structural framework is illustrated in
Figure 3.
By referring to YOLOV3 and Darknet53, the YOLOX model adopts the structural architecture and spatial pyramid poling (SPP) layer of the latter. In addition, the model is equipped with several high-performance detectors.
In August 2022, we went to the research area (Qinghai Lake) (
Figure 4a) to acquire the UAV data and verify the actual flight. There were 40 flights in five days, and each flight lasted about half an hour. The average flight height was about 100 m, and the aerial photography coverage area reached 9744 km
2.
Figure 4b shows the flight landing sites. With the captured videos, an object-tracking database was established to identify the moving Procapra przewalskii, match them in different frames, and track their motions.
A total of 6 video sequence databases, which were composed of 3 training databases and 3 test databases, are marked. The data were divided into a training set, a test set, and a verification set at a ratio of 3:2:1 (The
Supplementary Data Set is available at link
https://pan.baidu.com/s/1vEYdFFTKUE9Z9cC67lCH_Q?pwd=56vx). There were three major motions for Procapra przewalskii, which are standing, walking, and running, as displayed in
Figure 5a–c (male),
Figure 5d–f (female) and
Figure 5g–i (young). The database was trained based on the YOLOX model by adjusting the weight ratio, confidence threshold, intersection over union (IoU) threshold of nms, and activation function. In this way, a stable and accurate model was obtained.
3.3. 3D Pose Estimation
Herein, the predicted bounding box was saved as
SROI, and the 3D pose of the object was recovered and dynamically tracked by the object coordinates on the 2D frame based on the depth information obtained from the stereo camera. In addition, an interior rectangle Si was firstly generated by contracting
SROI with a scaling factor
, as computed in Equations (1) and (2). In the equations below,
Si is the predicted bounding box;
refers to the scaling factor to adjust the size of the bounding box;
and
are the coordinates of the bounding box; and
w and
h represent the width and height of the bounding box, respectively.
Si, as displayed in
Figure 6b, serves as the
ROI to obtain the depth information. The unfilled pixels are filtered out from the depth image captured by the stereo camera, and the remaining depth data in
Si are averaged as
S, which is assumed as the distance between the observer and the target object.
Then, with the boundary box coordination, coordination transformation was performed to obtain the relative attitude of the camera and the global attitude in the world frame. Frame transformation was carried out according to Equations (3) and (4) below:
In the above equations,
u and
v are the pixel coordinates of
;
K is the inherent matrix of the local camera and
is the object pose vector in camera frame; and
refers to the object pose vector in the world frame. Specifically, the transformation matrix can be calculated using Equations (5) and (6).
where
is an element in the observer pose rotation matrix;
,
, and
denote the position of the observer (UAV) relative to the world frame; and
and
are the pose transformation matrices. Rotation of the coordinate system is usually represented by a rotation matrix or a quaternion representation.
3.4. Tracking Based on the Improved LSTM-KF Model
In this study, the YOLOX algorithm is employed because it can balance speed and accuracy. Dynamic states of the target Procapra przewalskii and the quadrotor reduce the robustness of the pose estimation based on the descriptions in
Section 3.3. The target Procapra przewalskii could not always be captured in the FoV during a surveillance as false positive or negative results may be found. In addition, partial or full occlusion might occur but not often. To address the above issues, the KF model is utilized to enhance tracking, but it requires the specification of a motion model and a measurement model in advance, which increases the burden on the modeler.
3.4.1. Model Structure and Prediction Steps
As introduced above, an improved LSTM-KF model is proposed in the current study, which is a time regularization model for attitude estimators. Its main idea is to use the KFs without specifying a linear conversion function or fixed process and measuring the covariance matrixes Q and R.
The network of the standard LSTM (
Figure 7) exhibits memory units, forgetting gates (
ft), input gates (
it), and output gates (
Ot). Some information of cell state
Ct−1 is retained in the current cell state
Ct, and the amount of retained information is determined by
ft, as given in Equation (7). Meanwhile,
it and
Ot can be calculated with Equations (8) and (9), respectively.
Here,
Ot on the standard LSTM network is modified as follows:
In Equation (10) above, {X0, X1, …, Xt−1} and {Yt, Yt+1, …, Yt+n} are the input and output of the LSTM network, respectively; {W0(t), Wt, …, W(t−1)t} and {W0(t+1), Wt(t+1), … W(t−1)(t+1)} represent the direct weights of the input and output, respectively; C refers to the current state of the LSTM network; and V is the coefficient.
KF is an optimal state estimator under the assumptions of linear and Gaussian noise. Specifically, if the state and the measurement state are expressed as
and
, respectively, the hypothetical model here can be expressed in Equations (11) and (12).
Because the incoming measurements are noisy estimates of potential states and
H = I in Equation (11), Equations (11) and (12) can be modified to Equations (13) and (14), respectively, which are the basic models of LSTM-KF. In the equations below,
Zt denotes the model measurement state;
is the weight at moment
t;
and
are covariance matrices; and
f is a nonlinear transfer function.
The prediction step can be defined by Equations (15) and (16).
where
f is modeled by an LSTM module,
F is the Jacobian matrix of
f relative to
, and
is the output of the second LSTM model. Thus, the updating steps are specified in Equations (17)–(19).
where
is the output of the third LSTM module and
refers to the observed measurement at time
. Next, these LSTM modules are described in detail.
3.4.2. Architecture and Loss Function
In this paper, LSTM
f, LSTM
Q, and LSTM
R are selected to represent the three LSTM modules of
f,
, and
, respectively. LSTM
f is composed of three stacked layers (1024 hidden cells in each) and three fully connected (FC) layers (with 1024, 1024, and 48 hidden cells). Similarly, standard LSTM is built as LSTM
f, but
of the LSTM
f is modified to connect to
it. In addition, ReLU nonlinearity is introduced to all FC layer activations except the last, and each LSTM layer is followed by a lost layer with a retention probability of 0.7. LSTM
Q and LSTM
R follow with 256 hidden unit monolayer frameworks and 48 hidden units. Meanwhile,
connects with
it to prevent the video time sequence of the agency and the Procapra przewalskii training to avoid overfitting.
Figure 8 shows each module of the LSTM-KF model and
Figure 9 displays an overview of the system.
At each t, taking as an input, LSTMf generates the intermediate state without depending on the current measurement; LSTMQ takes and as the input and output, respectively, and estimates the process covariance; and taking Zt and as an input and output, respectively, LSTMR only estimates the measured covariance. Finally, and Zt, along with the covariance estimates made here, are fed into a standard KF (Equations (16)–(19)), ultimately yielding a new prediction . Moreover, Q and R are restricted to diagonal and positive definite by indexing the output of the LSTMQ and LSTMR modules in this study.
Preliminarily, the standard Euclidean loss summation is applied throughout the entire process, but the LSTM
f module fails to learn reasonable mapping. Therefore, the loss function is introduced with a term to enhance the gradient flow to the LSTM
f module in the current study. Equation (20) expresses the specific loss function.
3.4.3. Optimization of Parameters
All parameters
θ in the loss function are optimized to minimize the loss given by Equation (20) regarding all free parameters in the model applied in this work, which is the connection of the ownership weight matrix and bias from all three LSTM modules, which are a combination of the LSTM layer and the linear layer (
Figure 8).
The LSTM-KF model can achieve end-to-end training, and the gradient can be obtained through the time backpropagation algorithm [
26]. All the computations and states in the model are presented by a single data flow graph, so that communication between the sub-computations can be displayed, which is conductive to the parallel execution of independent computations to obtain the gradient as soon as possible.
In addition, the Adam-based optimizer [
27] is introduced for training iteration to update the gradient, ensuring high stability and high predictability of the model. Meanwhile, the update rules are extended based on the
L2 norm to those based on the
Lp norm. A large
p results in numerical instability of these variants. However, in a special case of
p → ∞, the algorithm is simple but stable, which can be calculated with Equation (21).
With the
Lp norm, the step size at time
t is supposed to be inversely proportional to
vt1/p, and then Equation (21) can be modified to Equation (22).
It should be noted that the attenuation term is equivalently parameterized here to
instead of
β2. If
p → ∞ and
vt = limp → ∞(vt)1/p are defined, Equations (23)–(26) can be obtained. This corresponds to a very simple recursive equation (Equation (27)).
The initial value is v0 = 0. Note that conveniently, the initialization bias does not necessarily need to be corrected. The improved Adam-based optimizer is simpler than the original and is easier for gradient updating.
4. Result
The performance of the trained model was assessed on a Jetson AGX Xavier onboard computer, where the coupled detection head of the original YOLO model was replaced with the decoupled head, which greatly accelerated training convergence. As mentioned, the robustness of the proposed model was observed on a streaming video with various techniques for quantitative analysis Lastly, several intensive flight tests were performed on a self-assembled quadrotor platform to evaluate the overall performance.
4.1. Detection Effect of the YOLOX Model
The YOLOX model outputs a predictive bounding box that classifies detected objects and marks their locations, which plays a key role in subsequent pose estimation using the UAV. The model training lasted for 1000 iterations, during which time the loss did not degrade. Because the accuracy of the model is affected by different neural network resolutions, the YOLOX model was trained with different resolutions (i.e., 416 × 416, 512 × 512, 640 × 640) to evaluate the best performance. The four input resolutions are compared in
Table 1.
Furthermore, the surveillance performed by UAVs was realized based on real-time perception solutions, which focus more on object detection and tracking. In this case, the detection speed and accuracy had to be balanced to ensure consistent detection and tracking, in which delay can be neglected and accuracy is high enough. Thus, the YOLOX and YOYLOv4 models of the same network resolution were compared to examine their accuracy and speeds.
After training, the model performance in detecting target Procapra przewalskii on real-time videos captured was evaluated on an Intel RealSense D435i stereo camera. The trained model was proven to be robust under various environments and exhibits low false positives and negatives. Procapra przewalskii tracking was then successively assessed after assuring the validity of the model.
4.2. Tracking Performance on Target
In this study, the LSTM-KF model was employed to track and identify the behaviors and gestures of the video sequence dataset of Procapra przewalskii. Six object-tracking sequences were comprehensively generated from the Procapra przewalskii dataset, and the 6-DOF ground realistic attitude was available. The LSTM-KF model was trained at , decaying by 0.95 from the second period. Before training, gradients of 100 time steps were propagated using a truncated backpropagation time.
However, for a single-layer LSTM with 16 hidden units, batch size is set to 2 and the learning rate is designed as . After the model is trained for 120 periods, the gradient is propagated again for 10-time steps in the same way. It is the same case for the standard LSTM method evaluated in this work.
The tracking algorithm can be evaluated by employing the successive frames of depth frame sequence tracking through the 3D CAD model of 3D pose. Therefore, all the task methods are compared here to obtain a target-tracking method which is superior to the existing methods.
Table 2 displays the results of tracking recognition under the scenario.
4.3. Verification of Field Tracking Flight
To integrate a perception to reaction and evaluate the surveillance system, four sites (the red points in
Figure 4b) were selected to verify the tracking effect of the UAV on Procapra przewalskii. The parameter settings for the flight test are shown in
Table 3.
Additionally, the estimated dynamic position is compared with the ground truth of the tracked Procapra przewalskii.
Figure 12 and
Figure 13 demonstrate that the system can basically track the poses of Procapra przewalskii in 3D space. Regardless of jittering and occasional drift, the Procapra przewalskii can be relocated accurately after several frames. In addition, the figure shows that error is basically within 0.5 m in all axes of the world frame.
In addition, the root-mean-square error (
RMSE) and mean absolute error (
MAE) (defined in Equations (28) and (29), respectively) were calculated, as presented in
Table 4. Here,
RMSE indicates the degree of prediction error generated by the model, and a large error results in heavier weights.
MAE reflects the error between the predicted and actual values, and a smaller
MAE corresponds to a better model performance.
and
in Equations (28) and (29) are the true value and the predicted value, respectively.
In addition, during monitoring, the distance between the UAV and the target Procapra przewalskii constantly changed, so it is necessary to further analyze the accuracy at different distances. As shown in
Table 5, the performance of the proposed method remained basically stable regardless of the distance between the UAV and the target Procapra przewalskii.
5. Discussion
It should be highlighted that if the input resolution is too large, the best possible mAP increases, but the training and detection speed are indeed negatively affected. Therefore, the higher input resolution is not trained in this work, because the currently obtained speed and accuracy at 640 × 640 are acceptable, and the mAP and the union threshold intersection are 88.67% and 0.50 (AP50), respectively. Unfortunately, the fps is not as fast as the resolution of 512 × 512. Meanwhile, it was found that the mAP of the YOLOX model was lower in contrast to the YOLOv4 model, but the fps was higher. In consideration of the greater significance of fps in real-time predictions, the YOLOX model with a higher fps was selected to balance the accuracy and speed.
Table 2 clearly displays that the motion models that do not investigate the training data, i.e., Kalman Vel Kalman Acc and EMA, are not meaningfully improved for translational estimation and rotation. However, the improved LSTM-KF model proposed in this paper performs better in predicting the target position (0.82 mm) with a mean error of 61.26%, which is higher than the original estimate and better than the results in [
28], using the KF algorithm alone for target tracking, and exhibits a lower average error. In addition, the LSTM-KF model greatly improves the original measurements with all actions outperforming the standard LSTM by an average of 14% compared to the state-of-the-art method. In contrast, the standard LSTM method estimates the position and rotation with such a large error that they fail to meet the requirements.
As shown in
Table 3 and
Table 4, the 3D object pose-estimation systems focused on by other scholars [
29,
30] highlight the objects in static states, while the model applied in this paper exhibited higher errors in estimating the dynamic position of Przewalski’s Tibetan antelope in real time. However, it possesses better robustness and is also more accurate than the model in [
28] that uses the KF algorithm alone for 3D pose estimation of the target. In addition, it mitigates the influence of the modeler on the a priori specified motion and noise models. The model for 3D pose estimation of dynamic targets using an improved spatio-temporal context algorithm in comparison to that in [
31] exhibits higher accuracy, a fast network convergence, and fewer impacts from measurement noise. Overall, the proposed LSTM-KF possesses a better performance in pose estimation than the sole use of the KF algorithm and improved spatio-temporal context algorithm. If the target animal, Przewalski’s Tibetan antelope, moves suddenly, redundant overshot periods follow, slightly affecting the overall performance of the model. Thus, it further proves that the proposed model can be better applied to autonomous UAV monitoring systems in a real-time and manipulable manner.
6. Conclusions
In this paper, a deep-learning-based model was employed to build an autonomous UAV tracking system to help monitor Procapra przewalskii. The LSTM-KF model is proposed and applied to track the target, and the YOLOX model is employed to identify the target. Meanwhile, they were combined to estimate the pose of the protozoa in 3D images, thus improving the performance of object tracking. In addition, the three different standard LSTM networks modeled in the LSTM-KF model are optimized by modifying and connecting the computation of Qt. During the training iterations of the LSTM-KF model, Adam as an optimizer is improved by extending the L2 criterion-based update rule to an Lp criterion-based rule. The results show that the improved LSTM-KF model can achieve the best result (0.82 mm) in predicting the target position with an average error of 61.26%, which is higher than the original estimate, significantly improving the accuracy of the measurement results. In addition, the YOLOX model exhibits an mAP of 93.25% and an FPS of 13.56 on images with 640 × 640 resolution, which are higher than those of the YOLOv4 model. Overall, the proposed improved LSTM-KF model is robust, valid, and reliable for animal tracking, recognition, and pose estimation.
However, this paper is subject to several shortcomings for Procapra przewalskii tracking. For example, when UAVs are applied to track dense herds of Procapra przewalskii, accuracy is decreased if Procapra przewalskii individuals occlude each other. Based on this, we are trying to solve this problem in future research using algorithms based on depth sorting. In addition, it is believed that a visual servo controller can be designed to control the UAV to explore the environment or avoid obstacles using only one camera, ensuring that the tracked object is always in the field of view of the camera.
Author Contributions
Conceptualization, X.L.; Methodology, Y.W. and X.W.; Validation, Z.Y.; Formal analysis, Y.Z.; Investigation, Y.H.; Data curation, T.Z.; Writing–original draft, W.L.; Writing–review & editing, D.W.; Visualization, G.Z.; Supervision, L.D.; Funding acquisition, Q.S. and F.L. All authors have read and agreed to the published version of the manuscript.
Funding
This research was funded by the National Natural Science Foundation of China (No.: 42071289); the Open Fund of Key Laboratory of Agricultural Monitoring and Early Warning Technology, Ministry of Agriculture and Rural Affairs (JCYJKFKT2204); the Open Fund of Key Laboratory of Spectroscopy Sensing, Ministry of Agriculture and Rural Affairs, P.R. China (No.: 2022ZJUGP004); the Innovation Fund of Production, Study and Research in Chinese Universities (No.: 2021ZYA08001); the Central Government Guided Local Science and Technology Development Fund Project (No.: 216Z0303G); the National Key Research and Development Plan “Establishment of Spectral Earth with Medium Spatial Resolution and Its Application Research” (No.: 2019YFE0127300); Hebei Province Full-time Introduction of Top Talent Research Project (No.: 2020HBQZYC002); the National Science and Technology Major Project “Application and Demonstration of High Resolution Remote Sensing Monitoring Platform for Ecological Environment in Xiong’an New Area” (No.: 67-Y50G04-9001-22/23); the High Resolution Earth Observation System National Science and Technology Major Project; the National Basic Research Plan Project (No.: 2019YFE0126600); and the Doctoral Research Startup Fund Project (No.: BKY-2021-32).
Institutional Review Board Statement
This study does not require “ethics committee” approval.
Informed Consent Statement
Informed consent was obtained from all members who participated in this study.
Data Availability Statement
Conflicts of Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
References
- Ellerman, J.R.; Morrison-Scott, T.C. Checklist of Palaearctic and Indian Mammals, 1758 to 1946; British Museum: London, UK, 1951. [Google Scholar]
- Jiang, Z.G.; Jiang, J.P.; Wang, Y.Z.; Zhang, Y.Y.; Li, L.L.; Xie, F.; Cai, B.; Cao, L.; Zheng, G.M.; Dong, L.; et al. Red List of China’s Vertebrates. Biodivers. Sci. 2016, 24, 500–551. [Google Scholar]
- Luo, W.; Zhang, Z.; Fu, P.; Wei, G.; Wang, D.; Li, X.; Shao, Q.; He, Y.; Wang, H.; Zhao, Z.; et al. Intelligent Grazing UAV Based on Airborne Depth Reasoning. Remote Sens. 2022, 14, 4188. [Google Scholar] [CrossRef]
- Luo, W.; Li, X.; Zhang, G.; Shao, Q.; Zhao, Y.; Li, D.; Zhao, Y.; Li, X.; Zhao, Z.; Liu, Y.; et al. High-Accuracy and Low-Latency Tracker for UAVs Monitoring Tibetan Antelopes. Remote Sens. 2023, 15, 417. [Google Scholar] [CrossRef]
- Kalman, R.E. A new approach to linear filtering and prediction problems. J. Basic Eng. 1960, 82, 35–45. [Google Scholar] [CrossRef] [Green Version]
- Welch, G.; Bishop, G. An Introduction to the Kalman Filter; Technical Report 1; University of North Carolina: Chapel Hill, NC, USA, 2006. [Google Scholar]
- Salti, S.; Di Stefano, L. Online support vector regression of the transition model for the kalman filter. Image Vis. Comput. 2012, 31, 487–501. [Google Scholar] [CrossRef] [Green Version]
- Krishnan, R.G.; Shalit, U.; Sontag, D. Deep Kalman filters. In Proceedings of the NIPS Workshop on Advances in Approximate Bayesian Inference and Black Box Inference, Montreal, QC, Canada, 11 December 2015. [Google Scholar]
- Bello, R.W.; Mohamed, A.S.A.; Talib, A.Z. Contour extraction of individual cattle from an image using enhanced mask R-CNN instance segmentation method. IEEE Access 2021, 9, 56984–57000. [Google Scholar] [CrossRef]
- Bello, R.W.; Mohamed, A.S.A.; Talib, A.Z. Enhanced mask R-CNN for herd segmentation. Int. J. Agric. Biol. Eng. 2021, 14, 238–244. [Google Scholar] [CrossRef]
- Bello, R.W.; Mohamed, A.S.A.; Talib, A.Z.; Olubummo, D.A.; Enuma, O.C. Enhanced deep learning framework for cow image segmentation. IAENG Int. J. Comput. Sci. 2021, 48, 1182–1191. [Google Scholar]
- Kumar, S.; Pandey, A.; Satwik, K.S.R.; Kumar, S.; Singh, S.K.; Singh, A.K.; Mohan, A. Deep learning framework for recognition of cattle using muzzle point image pattern. Measurement 2018, 116, 1–17. [Google Scholar] [CrossRef]
- Li, G.; Huang, Y.; Chen, Z.; Chesser, G.D.; Purswell, J.L.; Linhoss, J.; Zhao, Y. Practices and applications of convolutional neural network-based computer vision systems in animal farming: A review. Sensors 2021, 21, 1492. [Google Scholar] [CrossRef] [PubMed]
- Simonyan, K.; Zisserman, A. Two-stream convolutional networks for action recognition in videos. arXiv 2014, arXiv:1406.2199. [Google Scholar]
- Donahue, J.; Anne Hendricks, L.; Guadarrama, S.; Rohrbach, M.; Venugopalan, S.; Saenko, K.; Darrell, T. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 2625–2634. [Google Scholar]
- Held, D.; Thrun, S.; Savarese, S. Learning to track at 100 fps with deep regression networks. In Computer Vision—ECCV 2016, Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016; pp. 749–765. [Google Scholar]
- Feichtenhofer, C.; Fan, H.; Malik, J.; He, K. Slowfast networks for video recognition. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6202–6211. [Google Scholar]
- Alameer, A.; Kyriazakis, I.; Bacardit, J. Automated recognition of postures and drinking behaviour for the detection of compromised health in pigs. Sci. Rep. 2020, 10, 13665. [Google Scholar] [CrossRef] [PubMed]
- Chen, G.; Shen, S.; Wen, L.; Luo, S.; Bo, L. Efficient pig counting in crowds with keypoints tracking and spatial-aware temporal response filtering. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020. [Google Scholar]
- Cowton, J.; Kyriazakis, I.; Bacardit, J. Automated individual pig localisation, tracking and behaviour metric extraction using deep learning. IEEE Access 2019, 7, 108049–108060. [Google Scholar] [CrossRef]
- Zhang, L.; Gray, H.; Ye, X.; Collins, L.; Allinson, N. Automatic individual pig detection and tracking in pig farms. Sensors 2019, 19, 1188. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Ren, K.; Bernes, G.; Hetta, M.; Karlsson, J. Tracking and analysing social interactions in dairy cattle with real-time locating system and machine learning. J. Syst. Archit. 2021, 116, 102139. [Google Scholar] [CrossRef]
- Salau, J.; Lamp, O.; Krieter, J. Dairy cows’ contact networks derived from videos of eight cameras. Biosyst. Eng. 2019, 188, 106–113. [Google Scholar] [CrossRef]
- Abbeel, P.; Coates, A.; Montemerlo, M.; Ng, A.Y.; Thrun, S. Discriminative Training of Kalman Filters. In Robotics: Science and Systems I, Proceedings of the Robotics: Science and Systems Conference, Cambridge, MA, USA, 8–11 June 2005; MIT Press: Cambridge, MA, USA, 2005; pp. 289–296. [Google Scholar]
- Haarnoja, T.; Ajay, A.; Levine, S.; Abbeel, P. Backprop KF: Learning Discriminative Deterministic State Estimators. In Proceedings of the International Conference on Neural Information Processing Systems (NIPS), Barcelona, Spain, 5–10 December 2016. [Google Scholar]
- Werbos, P.J. Backpropagation through time: What it does and how to do it. Proc. IEEE 1990, 78, 1550–1560. [Google Scholar] [CrossRef] [Green Version]
- Kingma, D.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- Lo, L.-Y.; Yiu, C.H.; Tang, Y.; Yang, A.-S.; Li, B.; Wen, C.-Y. Dynamic Object Tracking on Autonomous UAV System for Surveillance Applications. Sensors 2021, 21, 7888. [Google Scholar] [CrossRef] [PubMed]
- Feng, Y.; Tse, K.; Chen, S.; Wen, C.-Y.; Li, B. Learning-Based Autonomous UAV System for Electrical and Mechanical (E&M) Device Inspection. Sensors 2021, 21, 1385. [Google Scholar] [CrossRef] [PubMed]
- Steich, K.; Kamel, M.; Beardsley, P.; Obrist, M.K.; Siegwart, R.; Lachat, T. Tree cavity inspection using aerial robots. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Daejeon, Republic of Korea, 9–14 October 2016. [Google Scholar]
- Zhang, K.; Zhang, L.; Liu, Q.; Zhang, D.; Yang, M.H. Fast Visual Tracking via Dense Spatio-temporal Context Learning. In Computer Vision—ECCV 2014, Part V, Proceedings of the 2014 European Conference on Computer Vision—ECCV, Zurich, Switzerland, 6–12 September; LNCS; Springer: Cham, Switzerland, 2014; Volume 8693, pp. 127–141. [Google Scholar]
Figure 1.
Prototype of the proposed autonomous object-tracking system.
Figure 1.
Prototype of the proposed autonomous object-tracking system.
Figure 2.
Software architecture of the system.
Figure 2.
Software architecture of the system.
Figure 3.
Structural framework of the YOLOX algorithm.
Figure 3.
Structural framework of the YOLOX algorithm.
Figure 4.
Research area: (a) geographical location of Qinghai Lake; (b) distribution of UAV landing sites in August.
Figure 4.
Research area: (a) geographical location of Qinghai Lake; (b) distribution of UAV landing sites in August.
Figure 5.
Different motions of male, female, and young Procapra przewalskii.
Figure 5.
Different motions of male, female, and young Procapra przewalskii.
Figure 6.
(a) Bounding box coordinates; (b) ROI for depth.
Figure 6.
(a) Bounding box coordinates; (b) ROI for depth.
Figure 7.
Network structure of the standard LSTM.
Figure 7.
Network structure of the standard LSTM.
Figure 8.
Structure of the LSTM-KF model.
Figure 8.
Structure of the LSTM-KF model.
Figure 9.
Overview of the LSTM-KF model: (a) a high-level depiction of the architecture which uses three LSTM modules to predict the internals of the KF; (b) LSTM-KF unrolled over time, which can be trained end to end with backpropagation.
Figure 9.
Overview of the LSTM-KF model: (a) a high-level depiction of the architecture which uses three LSTM modules to predict the internals of the KF; (b) LSTM-KF unrolled over time, which can be trained end to end with backpropagation.
Figure 10.
Verification scenario for the tracking algorithm of Procapra przewalskii based on the Prometheus simulation system.
Figure 10.
Verification scenario for the tracking algorithm of Procapra przewalskii based on the Prometheus simulation system.
Figure 11.
First-person views of the UAV system during the Procapra przewalskii track (the yellow and the red bounding boxes are the detected and predicted states of the object, respectively).
Figure 11.
First-person views of the UAV system during the Procapra przewalskii track (the yellow and the red bounding boxes are the detected and predicted states of the object, respectively).
Figure 12.
Comparison between the estimated position and ground truth of Procapra przewalskii.
Figure 12.
Comparison between the estimated position and ground truth of Procapra przewalskii.
Figure 13.
Error throughout the mission time.
Figure 13.
Error throughout the mission time.
Table 1.
Performances of YOLOv4-Tiny and YOLOX with respect to different resolutions.
Table 1.
Performances of YOLOv4-Tiny and YOLOX with respect to different resolutions.
Method | Backbone | Size | [email protected] (AP50) | FPS |
---|
YOLOv4 | CSPDarknet-53 | 416 × 416 512 × 512 640 × 640 | 83.19% 85.87% 88.67% | 6.12 6.24 6.02 |
YOLOX | Darknet-53 | 640 × 640 | 93.25% | 13.56 |
Table 2.
Effects of temporal regularization on object-tracking estimations of Procapra przewalskii. The errors in translation are denoted as [mm].
Table 2.
Effects of temporal regularization on object-tracking estimations of Procapra przewalskii. The errors in translation are denoted as [mm].
| Male Trans | Female Trans | Young Trans | Mean Trans |
---|
D.J. Tan et al. | 1.58 | 2.55 | 3.85 | 2.66 |
+Kalm an Vel.al | 1.49 | 2.42 | 3.34 | 2.42 |
+Kalm an Acc. | 1.49 | 2.41 | 3.32 | 2.41 |
+EMA | 1.59 | 2.56 | 3.87 | 2.67 |
+Std. LSTM | 40.98 | 45.98 | 50.15 | 45.7 |
LSTM-KF (ous) | 0.58 | 0.67 | 1.22 | 0.82 |
Table 3.
Defined parameters for flight test.
Table 3.
Defined parameters for flight test.
Parameters | Value |
---|
θ | 12 deg |
Vθmax | 45 deg/s |
VZmax | 2 m/s |
VXmax | 2 m/s |
Rsafe | 5 m |
Rsur | 30 m |
Table 4.
Calculated RSME and MAE for position estimation of the dynamic Procapra przewalskii.
Table 4.
Calculated RSME and MAE for position estimation of the dynamic Procapra przewalskii.
Error Evaluation | X(m) | Y(m) | Z(m) |
---|
RMSE MSE | 0.0477 0.0022 | 0.0478 0.0023 | 0.0221 0.0005 |
Table 5.
Calculated RSME and MAE of position estimation of dynamic Procapra przewalskii with different object distances.
Table 5.
Calculated RSME and MAE of position estimation of dynamic Procapra przewalskii with different object distances.
Object Distance | 10–30 | 30–50 |
---|
Error Evaluation | X(m) | Y(m) | Z(m) | X(m) | Y(m) | Z(m) |
---|
RMSEMSE | 0.0477 0.0022 | 0.0478 0.0023 | 0.0221 0.0005 | 0.0773 0.0059 | 0.0805 0.0648 | 0.0656 0.0431 |
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).