Pose Estimation Method for Non-Cooperative Target Based on Deep Learning

Deng, Liwei; Suo, Hongfei; Jia, Youquan; Huang, Cheng

doi:10.3390/aerospace9120770

Open AccessArticle

Pose Estimation Method for Non-Cooperative Target Based on Deep Learning

by

Liwei Deng

,

Hongfei Suo

,

Youquan Jia

and

Cheng Huang

^*

Heilongjiang Provincial Key Laboratory of Complex Intelligent System and Integration, School of Automation, Harbin University of Science and Technology, Harbin 150080, China

^*

Author to whom correspondence should be addressed.

Aerospace 2022, 9(12), 770; https://doi.org/10.3390/aerospace9120770

Submission received: 10 October 2022 / Revised: 19 November 2022 / Accepted: 24 November 2022 / Published: 29 November 2022

Download

Browse Figures

Versions Notes

Abstract

:

The scientific research strength in the aerospace field has become an essential criterion for measuring a country’s scientific and technological level and comprehensive national power, but in the grand scheme, many factors are beyond human control. As is known, the difficulty with non-cooperative target intersection docking is its failure to provide attitude information autonomously. The existing non-cooperative target poses estimation methods with low accuracy and high resource consumption. This paper proposes a deep-learning-based pose estimation method for solving these problems. The proposed pose estimation method consists of two distinctly innovative works. You Only Look Once v5 (YOLOv5) is an innovative and lightweight network that is used to pre-recognize non-cooperative targets. Another part introduces concurrent space and channel compressor theory modules in a lightweight High-Resolution Network (HRNet) to extend its advantages in real-time, and hence proposes a spatial and channel Squeeze and Excitation—Lightweight High-Resolution Network (scSE-LHRNet) network for pose estimation. To verify the superiority of the proposed network, experiments were conducted on a publicly available dataset with multiple evaluation metrics to compare and analyze existing methods. The experimental results show that the proposed pose estimation method dramatically reduces the complexity of the model, effectively decreases the amount of computation, and achieves significant pose estimation results.

Keywords:

non-cooperative YOLOv5s; scSE-LHRNet; target recognition; pose estimation

1. Introduction

Aerospace technology is one of the most cutting-edge fields in science and technology, and even more of an essential manifestation of a country’s scientific and technological level and comprehensive national strength [1]. The identification and pose estimation technology of non-cooperative targets plays a vital role in the aerospace field. How to accurately identify and estimate the pose of non-cooperative targets is now basically the agenda for the aerospace field [2]. Although the pose estimation technology of non-cooperative targets possesses broad application prospects in the areas of space service and space junk cleaning, a vast array of tremendous challenges still lie before us due to the lack of information communication [2,3]. The pose estimation technology of non-cooperative targets based on monocular vision has attracted extensive attention from researchers due to its advantages of low power, low mass, and small size, and thus has been exhaustively studied [4]. By comparison, monocular vision exceedingly surpasses the binocular pose estimation scheme in both field of view and reliability. As a result, we mainly focus on non-cooperative target recognition and pose estimation of monocular vision.

At present, pose estimation methods for the non-cooperative targets are basically categorized into two types, including traditional methods of artificially designing geometric features and end-to-end processes based on deep learning [5,6]. Traditional pose estimation methods are based on image processing algorithms and a priori knowledge of the pose [7]. The critical point of this method is to extract feature points and manually obtain their corresponding descriptors, namely Scale Invariant Feature Transform (SIFT) [8], Maximally Stable External Regions (MSER) [9], Speeded Up Robust Features (SURF) [10] and Binary Robust Independent Elementary Features (BRIEF) [11]. With the Canny operator and Hough transforms extracting edge features and line features from grayscale images of non-cooperative targets, D’Amico et al. [12] combined the known 3D models to construct pose parameters and the function of the pixel point, and consequently obtained the pose quantity. Based on the monocular camera image, Sharma et al. [13] introduced a hybrid image processing method in the attitude initialization system, which combined the weak gradient elimination technology with the Sobel operator and Hough transform to detect the dimension characteristics of the target. In this way, they obtained its attitude information, but on the other hand, this method is challenging when dealing with complex background images.

The deep learning methods are divided into indirect methods and direct methods. The indirect method uses deep learning in place of the artificial design of points, lines, and other features of the target, and then solves the pose according to the 2D–3D relationships. The Render CNN classification model is adopted to predict the viewpoint. The attitude estimation is carried out by training the convolutional neural network and fine-grained classification of three angles (Pitch Angle, Yaw Angle, and Roll Angle) for the pose estimation [14]. The PoseCNN model of object pose estimation is divided into two branches. One branch, located in the center of the object, predicts the distance between it and the camera so as to estimate the translation vector of the object, while the other branch estimates the three-dimensional rotation Angle of the object through regression quaternion [15]. Sharma et al. [6] used threshold segmentation to remove background pixels. However, what still sounds the alarm is the problem of generating false edges and being unable to eliminate background interference completely. In view of the problem, Chen et al. [16] first detected the target and cropped it from the background, then used the High-Resolution Network (HRNet) to predict the 2D critical points of the target, and thereby established the 3D model of the target according to the multi-view triangulation method. Finally, they used PnP to solve the 2D–3D relationship to obtain the pose of the target. Nan et al. [17] operated a high-resolution, multi-scale prediction deep network HRNet to replace the feature pyramid network so that the problem of information loss caused by the reduction of resolution can be well-handled. They finally enhanced the detection ability of spatial objects of different scales. In addition, applying the Transformer model to the rigid target pose estimation task, Wang et al. [18] first proposed a representation method based on sets of critical points, and designed an end-to-end critical point regression network model to enhance the relationship between critical points.

The direct method can directly obtain the pose information without figuring out the 2D–3D relationship. The Spacecraft Pose Network proposed by Sharma et al. [19] had three branches. The first branch performed target detection to determine the bounding box of the target device, yet the other two branches used the bounding box area to calculate the relative pose, and then used the relative pose constraints to calculate the phase position. UrsoNet, proposed by Proença et al. [20], treated ResNet as the backbone network, position estimation as a superficial regression branch and pose estimation as a classification branch, and used probabilistic soft classification and Gaussian mixture model to solve pose.

To sum up, when solving the pose estimation problem of non-cooperative targets, the methods of artificially designing features have significant limitations and insufficient robustness, the indirect method in deep learning has high accuracy but complicated calculation process, while the model of the direct method is more straightforward but less accurate than the indirect method. In the rendezvous and approach scenarios, the shipboard resources of the spacecraft are so limited that it brings out stringent requirements on the calculation speed and pose estimation accuracy. Therefore, the direct method, with advantages in memory usage and computing power consumption, is chosen as the pose estimation method for non-cooperative targets. We further improve the original method and improve the attitude estimation accuracy of the direct method. The direct method used in this paper is further improved and optimized to achieve higher positional recognition accuracy.

The structure of the following parts of the paper is as follows: Section 2 introduces the theoretical basis of the network structure of target detection and pose estimation; Section 3 reports the specific experiments, testing, and verification procedures, subsequently comparing them with other networks, which fully illustrates the superiority of our network; as a conclusion, Section 4 summarizes the full text and makes arrangements for the future work.

2. Proposed Method

The non-cooperative target pose estimation method proposed in this chapter is shown in Figure 1. It mainly goes through two stages: first, the target detection network is used to predict the bounding box of the target device, and the target is cut out to remove background interference such as deep space and the earth, so that the next stage of the network focuses on the pose estimation problem; then, the cropped target image is input into the pose estimation network, and the position information and attitude information of the target is predicted by the direct method.

In the target detection algorithm, the two-stage algorithm has a high detection accuracy, as in Faster R-CNN [21]; the speed of the one-stage algorithm is fast, as in YOLOv3-tiny [22], YOLOv4 [23], YOLOv5 [24], and YOLOv5s [25]. YOLOv5s is small, fast in calculation, and the detection accuracy is no less than Faster R-CNN. High-Resolution Network (HRNet) [26] can effectively accomplish space-sensitive tasks, but its network is complex and takes up a large number of memory resources.

In the rendezvous and approach missions, the equipment carried by spacecraft is expensive, the memory resources and computing power are limited, and there is a high requirement for real-time performance. To solve this problem, we select YOLOv5s as the target detection network and use the improved HRNet as the pose estimation network. The improved HRNet has a lightweight structure while improving the prediction accuracy.

2.1. Target Detection Network Based on YOLOv5s

The structure of YOLOv5s is shown in Figure 2. It consists of an input layer, a backbone layer, a neck layer, and an output layer. The network is small in size and fast in inference, making it suitable for intersection and approximate tasks.

The input layer includes Mosaic data enhancement, self-adaptive anchor box calculation, and self-adaptive image scaling. It has a strong ability to recognize small objects, which is suitable for cases where the satellites are small targets in the far-field of view, as in this section.

The backbone layer includes CSP1_X and Focus. Focus is shown in Figure 3, and its function is to use the slicing operation shown in Figure 4 to turn the 4 × 4 × 3 feature map into a 2 × 2 × 12 feature map. CSP is shown in Figure 5, and can solve the problem of gradient disappearance in the network. YOLOv5s achieves excellent accuracy and calculation speed due to the use of CSP, and reduces the model’s size.

The neck layer is shown in Figure 6. The FPN, PAN, and CSP2_X modules enhance the network’s feature fusion ability and improve the accuracy of network prediction.

The weighted non-maximum suppression and loss function used in the output layer solve the problems of multi-target occlusion and bounding box mismatch, respectively. The loss function [27] is:

{\begin{matrix} G I o U_L o s s = 1 - G I o U = 1 - (I o U - \frac{C - (A \cup B)}{C}) \\ I o U = \frac{| A \cap B |}{| A \cup B |} \end{matrix}

(1)

where: IoU is the intersection ratio, which is used to measure the degree of overlap between the actual frame and the predicted frame; GIoU is the generalized intersection ratio, which is used to measure the target detection effect; A and B are the areas of the predicted frame and the actual frame, respectively; C is the area of the smallest bounding box of A and B.

2.2. High-Resolution Network

The network structure of HRNet is shown in Figure 7. Its four parallel sub-networks perform multi-resolution feature fusion in four stages so that the feature maps always keep the high resolution.

The four parallel sub-networks of HRNet are shown in Figure 8. Layer 1 always maintains high resolution, and the depth of the next layer of features is increased, but the resolution is reduced by half. In Figure 8, N represents the resolution of the image, s represents the number of stages, and r represents the subnetwork. N_sr represents the resolution of the r-th subnet in stage s and the resolution of each subnet is 1/(2^r⁻¹) the resolution of Layer 1.

Each subnet can repeatedly receive characteristic information of other subnets through the switching unit in the blue box in Figure 7. The exchange unit of Stage 3 is shown in Figure 9. In Figure 9,

C_{s r}^{a}

represents the convolution unit of the

r

-th subnet of the

a

-th exchange block of stage

s

, and

ε_{s}^{a}

is the corresponding exchange unit. The operation formula of the exchange unit is:

Y_{k} = \sum_{i = 1}^{s} a (X_{i}, k)

(2)

where:

X_{i}

is the input;

Y_{k}

is the output;

a (X_{i}, k)

indicates that the resolution is from

i

to

k

upsampling or downsampling the input

X_{i}

. The fusion process is shown in Figure 10. Feature maps of different resolutions undergo upsampling or downsampling to fuse cross-scale information always to maintain the high resolution of the feature maps and supplement the high-resolution features multiple times.

The residual block used by HRNet is shown in Figure 11. The two 1 × 1 convolutions of Bottleneck are used for dimension reduction and dimension improvement, and the 3 × 3 convolution in the middle layer is used to extract features; the two 3 × 3 convolutions of Basicblock are used to extract features. Convolution adds the output information. Residual blocks increase the depth of the HRNet network, allowing it to extract deeper information.

2.3. Pose Estimation Network Based on scSE-LHRNet

Although HRNet’s parallel sub-networks and repeated feature fusion can maintain high-resolution feature maps, it has the following problems: the complex network model leads to a large number of parameters, a large number of repeated calculations, and a large amount of memory; and the direct fusion of multi-resolution features cannot effectively utilize the channel feature information and spatial feature information of the feature map.

In response to the above problems, this chapter proposes a Lightweight HRNet (LHRNet) with scSE with concurrent spatial and channel squeeze and excitation block (scSE) [28]. It it is used for pose estimation of non-cooperative targets.

As shown in Figure 12, the backbone network of scSE-LHRNet is HRNet, and the residual blocks are lightweight Bottleneck and Basicblock. In the multi-resolution feature fusion stage, we add the scSE module with a hybrid attention mechanism, and connect two full-scale modules at the network’s end. The full connection layers perform position estimation and attitude estimation, respectively.

Standard convolution and depthwise Separable Convolution (DSC) [29] are shown in Figure 13. If the input is

D_{F} \times D_{F} \times M

and the output is

D_{F} \times D_{F} \times N

, then there are N convolution kernels of

1 \times 1 \times M

, and the ordinary convolution calculation amount is

D_{K} \times D_{K} \times M \times D_{F} \times D_{F} \times N

. Depthwise convolution with M convolution kernels of

D_{K} \times D_{K} \times 1

and pointwise convolutions with N convolution kernels

1 \times 1 \times M

make up DSC, and the amount is

D_{K} \times D_{K} \times M + D_{F} \times D_{F} + M \times D_{F} \times D_{F} \times N

. The ratio of them [29] is:

\frac{D_{K} \times D_{K} \times M \times D_{F} \times D_{F} \times N}{D_{K} \times D_{K} \times M + D_{F} \times D_{F} + M \times D_{F} \times D_{F} \times N} = \frac{1}{N} + \frac{1}{D_{K}^{2}}

(3)

It can be seen that using DSC to perform the convolution operation of Bottleneck and Basicblock can significantly reduce the number of network parameters and computations, and the complexity of the constructed multi-resolution subnet structure is reduced [29]. This makes HRNet more lightweight.

The attention mechanism enables the network to focus on extracting the vital feature information. The scSE belongs to a hybrid attention mechanism, which can focus on both spatial and channel features. Its structure is shown in Figure 14.

The sSE that focuses on the spatial domain first converts

C \times H \times W

feature maps

U

into

1 \times 1

feature maps after

1 \times H \times W

convolution dimensionality reduction and Sigmoid activation, and then re-calibrates the features and integrates them into

U

to obtain a new feature map

{\hat{U}}_{s S E}

after spatial information calibration. The cSE that pays attention to the channel domain first converts it into

1 \times 1 \times C

feature maps after global pooling. The number of channels is halved by the convolution and ReLU is used to restore the original number of channels. Sigmoid is used to obtain the mask, and finally, the features are re-calibrated and integrated into

{\hat{U}}_{c S E}

, a new feature map after channel information calibration is obtained. The results of the two modules are added to obtain the final feature map of the scSE module

{\hat{U}}_{s c S E}

.

Adding the scSE block before the multi-resolution feature fusion of HRNet can extract more useful spatial feature information and channel feature information, making the fusion of different resolution features more efficient.

The scoring criteria used in the Kelvins Pose Estimation Challenge (KPEC) [30] competition are thus: the i image poses a score of

s c o r e_{o r i e n t a t i o n}^{(i)}

, a position score of

s c o r e_{p o s i t i o n}^{(i)}

, an attitude score of

s c o r e_{a t t i t u d e}^{(i)}

, and a total score of

s c o r e

on the dataset, as follows:

s c o r e_{o r i e n t a t i o n}^{(i)} = 2 \arccos (| 〈 q_{e s t}^{(i)}, q_{g t}^{(i)} 〉 |)

(4)

s c o r e_{p o s i t i o n}^{(i)} = \frac{{‖ t_{g t}^{(i)} - t_{e s t}^{(i)} ‖}_{2}}{{‖ t_{g t}^{(i)} ‖}_{2}}

(5)

s c o r e_{p o s e}^{(i)} = s c o r e_{o r i e n t a t i o n}^{(i)} + s c o r e_{p o s i t i o n}^{(i)}

(6)

s c o r e = \frac{1}{N} \sum_{i = 1}^{N} s c o r e_{p o s e}^{(i)}

(7)

where:

q_{e s t}^{(i)}

and

q_{g t}^{(i)}

are the quaternions of the predicted value of the pose and the real value, respectively;

t_{e s t}^{(i)}

and

t_{g t}^{(i)}

are the translation vector of the predicted value of the position and the real value, respectively; the smaller the value of

s c o r e

, the more accurate the attitude estimation.

This paper uses the KPEC dataset, and the pose estimation is a direct method to complete the position estimation and state estimation. In order to obtain the same level of accuracy as the competition, the network loss function is:

L o s s = β_{1} \frac{1}{N} \sum_{i = 1}^{N} s c o r e_{o r i e n t a t i o n}^{(i)} + β_{2} \frac{1}{N} \sum_{i = 1}^{N} s c o r e_{p o s i t i o n}^{(i)}

(8)

where:

β_{1}

and

β_{2}

are adjustable parameters. The experiment shows that when

β_{1}

is 0.6 and

β_{2}

is 0.4, the accuracy of the pose estimation network is the highest. It has been verified by experiments that when

β_{1}

and

β_{2}

take other values, the prediction accuracy of the network will be slightly reduced.

3. Evaluation

The operating system of the experimental platform in this paper is Windows 10, the CPU is 64 GB Intel Core i9-10900K, the GPU is 32GB NVIDIA 2080Ti, the programming language is Python3.6.13, and the framework of deep learning is TensorFlow 1.13.1.

3.1. Metrics

To test the performance of the non-cooperative target pose estimation method we proposed, KPEC’s Spacecraft Pose Estimation Dataset (SPEED) is used for experiments. The SPEED dataset provides 15,303 Tango satellite images with 1920 × 1200 pixels, including authentic and simulated images. Since the SPEED dataset currently only discloses the true value of the training set, we have used methods such as translation, rotation, scaling, and adding Gaussian noise to enhance the training set of the SPEED dataset and divide it into the training set and test set of the experiments in this chapter according to the ratio of 7:3. In addition, for the part of the detection, it is necessary to use the Labeling tool [31] to label the SPEED dataset.

The target detection evaluation metrics of the non-cooperative target are Average Precision (AP) and Mean Intersection over Union (mIoU). The area under the Precision-Recall curve is defined as AP, and the calculation formula is:

{\begin{matrix} Precsion = \frac{TP}{TP + FP} \\ Recall = \frac{TP}{TP + FN} \end{matrix}

(9)

where: TP indicates that the positive label is correctly identified; FP and FN suggest that the negative label is identified incorrectly and correctly, respectively. The mIoU is used to measure the effect of target detection, which is the average value of IoU in the test set, and IoU is shown in Formula (1).

The evaluation index of the pose estimation of the non-cooperative target adopts the KPEC competition scoring standard. For convenience of expression, “error” is used instead of score in analyzing experimental results. In addition, the model complexity is selected to be measured by parameters and computation FLOPs.

3.2. Experiments

To test the target detection performance of YOLOv5s on non-cooperative targets, we select YOLOv3-tiny, YOLOv4, and Faster RCNN for the comparative experiments. The results are shown in Table 1. It can be seen from the table that the model of YOLOv3-tiny is smaller, but AP and mIoU are much lower than the other three models; the model size of YOLOv5s is 11.44% and 16.24% of that of YOLOv4 and Faster RCNN, respectively, and the AP and mIoU are only slightly lower on both. However, the reason for this result is that the detection target in the SPEED dataset is only one type of Tango satellite, which is a simple task for these three models. Considering the three evaluation indicators, YOLOv5s is the most suitable target detection network for this dataset.

The output results of YOLOv5s on the measured pictures are shown in Figure 15. The Tango satellites in the deep space background and the earth background, in intense light and weak light, can be detected, and the satellite still has an excellent detection effect when it is located at the image boundary. The excellent detection effect shows that YOLOv5s is efficient as a non-cooperative spacecraft target detection network.

To objectively evaluate the performance of our proposed non-cooperative spacecraft pose estimation method, ablation experiments and comparative experiments are carried out below.

We choose HRNet-W32 as the original network. The network model with only the lightweight strategy is called LHRNet, and the network model with only the scSE module is called scSE-HRNet. The experimental results of the pose estimation of each network model are shown in Table 2.

We can see from Table 2 that the lightweight option reduces the parameter amount and calculation amount of the model by (28.50 − 9.23)/28.50 = 67.61% and (11.26 − 5.48)/11.26 = 51.33%, respectively, but the error increases by 0.0126; the scSE module reduces the error of the model by 0.0233, but the parameter amount and calculation amount increase by (39.76 − 28.50)/28.50 = 39.51% and (14.31 − 11.26)/11.26 = 27.09%, respectively; when the two-act simultaneously, the parameter and calculation amount of the model are reduced by (28.50 − 11.82)/28.50 = 58.53% and (11.26 − 6.95)/11.26 = 38.28%, respectively, and the error is reduced by 0.0217. The analysis shows that although the structure of the lightweight network can significantly reduce the complexity of the model, the reduction of network parameters will lead to a decrease in accuracy; the hybrid attention mechanism of the scSE module can significantly improve the accuracy of the model, but it will also lead to an increase in the number of parameters and computations; simultaneously, adopting the structure of the lightweight network and adding the scSE module can not only ensure the accuracy of the model but also reduce the number of parameters and calculations.

The loss function curve of scSE-LHRNet is shown in Figure 16. The curve drops rapidly and then becomes smooth, indicating that the model is stable and convergent. In summary, scSE-LHRNet, which balances parameter quantity, computational speed, and prediction accuracy as a pose estimation network, can effectively predict the pose information of non-cooperative spacecraft.

The pose estimation results of the technique in this chapter and some existing methods on the SPEED dataset are shown in Table 3. The analysis shows that the literature [16] adopts the indirect method and uses the improved HRNet as the network model, and the error is lower than that of the literature [19]. It is 0.0279, indicating that the high-resolution features of HRNet are conducive to improving the accuracy of the model. The reference [17] adopts the indirect method, using Transformer with attention mechanism as the network model, and the error is 0.0342 lower than that of the reference [19]; the method in this chapter is used for HRNet. The lightweight improvement and the introduction of the hybrid attention mechanism have advantages in both prediction accuracy and model complexity, and the error is 0.0402 lower than that of the literature [19]. Although the direct method is adopted, it is still more accurate than the three indirect methods mentioned above. To sum up, the non-cooperative spacecraft pose estimation method proposed in this chapter has superiority.

The attitude information provided by the SPEED dataset is a quaternion array. To make the prediction results of the method in this chapter more intuitive, we convert the quaternion array to Euler angles. The visualization results are shown in Figure 17. Taking Figure 17a as an example, the red, green, and blue arrows on the left are the X-axis, Y-axis, and Z-axis, respectively. The intersection of the three arrows is the position of the satellite centroid, the upper left image is the actual value of the position, and the lower left image is the position predicted value. The right picture shows the roll angle, pitch angle, and yaw angle from top to bottom, the solid red line is the actual value of the attitude, and the blue dotted line is the attitude prediction value. It can be seen from the figure that the method in this chapter only deviates from the actual value of the predicted value of situation and attitude under the complex background and weak light conditions composed of deep space and the earth, as shown in Figure 17b, and there is almost no deviation under other conditions. It can be seen that the method in this chapter can achieve good pose estimation results.

4. Conclusions

This paper completed the design of a deep-learning-based non-cooperative target pose estimation method. First, we introduced the general framework of the pose estimation method; then, we adopted YOLOv5s with a small model and fast computation speed as the target detection network for non-cooperative spacecraft; finally, we utilized DCS and scSE modules to improve HRNet’s multi-resolution sub-network and multi-resolution fusion network. The resolution fusion part made the network model lightweight and further improved the prediction accuracy. The experimental results on the SPEED dataset showed that, compared with some advanced methods, the pose estimation method in this paper has achieved a good balance between the number of parameters, the amount of computation, and the accuracy, and had effectiveness and superiority.

Since the dataset we used is static images rather than real space scenes of the actual movement of the spacecraft, the application scope is relatively narrow. In the follow-up research, we will apply the method in this paper to a video dataset to continue to improve the generality of pose estimation. We will continue to explore new efficient pose methods to improve pose detection performance.

Author Contributions

Methodology, H.S.; software and simulation, H.S.; writing—original draft preparation, Y.J.; writing—review and editing, Y.J.; supervision, C.H.; project administration, L.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Science Foundation of Heilongjiang Province (Grant No. LH2019F024) and the National Science Foundation for Young Scientists of China (Grant No. 52102455).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

This study did not report any data.

Conflicts of Interest

The authors declare no conflict of interest.

References

Bao, W. Research status and development trend of aerospace vehicle control technology. Acta Autom. Sin. 2013, 39, 697–702. [Google Scholar] [CrossRef]
Liang, B.; Du, X.; Li, C.; Xu, W. Advances in Space Robot on-orbit Servicing for Non-cooperative Spacecraft. Jiqiren 2012, 34, 242–256. [Google Scholar] [CrossRef]
Li, R.; Wang, S.; Long, Z.; Gu, D. Undeepvo: Monocular Visual Odometry Through Unsupervised Deep Learning. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 21–25 May 2018; pp. 7286–7291. [Google Scholar]
Kendall, A.; Grimes, M.; Cipolla, R. PoseNet: A Convolutional Network for Real-time 6-dof Camera Relocalization. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2938–2946. [Google Scholar]
Cassinis, L.P.; Fonod, R.; Gill, E. Review of the Robustness and Applicability of Monocular Pose Estimation Systems for Relative Navigation with an Uncooperative Spacecraft. Prog. Aerosp. Sci. 2019, 110, 100548. [Google Scholar] [CrossRef]
Sharma, S.; Ventura, J.; D’Amico, S. Robust Model-based Monocular Pose Initialization for Noncooperative Spacecraft Rendezvous. J. Spacecr. Rocket. 2018, 55, 1414–1429. [Google Scholar] [CrossRef] [Green Version]
Park, T.H.; Sharma, S.; D’Amico, S. Towards Robust learning-based Pose Estimation of Noncooperative Spacecraft. arXiv 2019. [Google Scholar] [CrossRef]
Lowe, D.G. Distinctive Image Features from Scale-invariant Keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
Biondi, G.; Mauro, S.; Mohtar, T.; Pastorelli, S.; Sorli, M. Attitude Recovery from Feature Tracking for Estimating Angular Rate of Non-cooperative Spacecraft. Mech. Syst. Signal Process. 2017, 83, 321–336. [Google Scholar] [CrossRef]
Bay, H.; Tuytelaars, T.; Gool, L.V. Surf: Speeded Up Robust Features. In Proceedings of the 9th European Conference on Computer Vision, Graz, Austria, 7–13 May 2006; pp. 404–417. [Google Scholar]
Calonder, M.; Lepetit, V.; Strecha, C.; Fua, P. BRIEF: Binary Robust Independent Elementary Features. In Proceedings of the 11th European Conference on Computer Vision, Crete, Greece, 5–11 September 2010; pp. 778–792. [Google Scholar]
D’Amico, S.; Benn, M.; Jørgensen, J.L. Pose Estimation of an Uncooperative Spacecraft from Actual Space Imagery. Int. J. Space Sci. Eng. 2014, 2, 171–189. [Google Scholar] [CrossRef] [Green Version]
Sharma, S.; D’Amico, S. Reduced-dynamics Pose Estimation for Non-Cooperative Spacecraft Rendezvous Using Monocular Vision. In Proceedings of the 38th AAS Guidance and Control Conference, Breckenridge, CO, USA, 30 January–4 February 2015; pp. 361–374. [Google Scholar]
Su, H.; Qi, C.R.; Li, Y.; Guibas, L.J. Render for CNN: Viewpoint Estimation in Images Using CNNs Trained with Rendered 3d Model Views. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2686–2694. [Google Scholar]
Ahn, B.; Park, J.; Kweon, I.S. Real-time Head Orientation from a Monocular Camera Using Deep Neural Network. In Proceedings of the 12th Asian Conference on Computer Vision, Singapore, 1–5 November 2014; pp. 82–96. [Google Scholar]
Chen, B.; Cao, J.; Parra, A.; Chin, T.-J. Satellite Pose Estimation with Deep Landmark Regression and Nonlinear Pose Refinement. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 2816–2824. [Google Scholar]
Yang, X.; Nan, X.; Song, B. D2N4: A Discriminative Deep Nearest Neighbor Neural Network for Few-shot Space Target Recognition. IEEE Trans. Geosci. Remote Sens. 2020, 58, 3667–3676. [Google Scholar] [CrossRef]
Wang, Z.; Zhang, Z.; Sun, X.; Li, Z.; Yu, Q. Transformer based Monocular Satellite Pose Estimation. Acta Aeronaut. Astronaut. Sin. 2022, 43, 325298. [Google Scholar] [CrossRef]
Sharma, S.; D’Amico, S. Neural Network-based Pose Estimation for Noncooperative Spacecraft Rendezvous. IEEE Trans. Aerosp. Electron. Syst. 2020, 56, 4638–4658. [Google Scholar] [CrossRef]
Proença, P.F.; Gao, Y. Deep Learning for Spacecraft Pose Estimation From Photorealistic Rendering. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 6007–6013. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Fu, L.; Feng, Y.; Wu, J.; Liu, Z.; Gao, F.; Majeed, Y.; Al-Mallahi, A.; Zhang, Q.; Li, R.; Cui, Y. Fast and Accurate Detection of Kiwifruit in Orchard Using Improved YOLOv3-tiny Model. Precis. Agric. 2021, 22, 754–776. [Google Scholar] [CrossRef]
Zhu, Q.; Zheng, H.; Wang, Y.; Cao, Y.; Guo, S. Study on the Evaluation Method of Sound Phase Cloud Maps Based on an Improved YOLOv4 Algorithm. Sensors 2020, 20, 4314. [Google Scholar] [CrossRef] [PubMed]
Jocher, G. YOLOv5. Available online: https://github.com/ultralytics/yolov5 (accessed on 15 June 2021).
Ashraf, A.H.; Imran, M.; Qahtani, A.M.; Alsufyani, A.; Almutiry, O.; Mahmood, A.; Attique, M.; Habib, M. Weapons Detection for Security and Video Surveillance Using CNN and YOLOV5s. CMC-Comput. Mater. Contin. 2022, 70, 2761–2775. [Google Scholar] [CrossRef]
Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep High-Resolution Representation Learning for Human Pose Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5693–5703. [Google Scholar]
Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 658–666. [Google Scholar]
Roy, A.G.; Navab, N.; Wachinger, C. Recalibrating Fully Convolutional Networks with Spatial and Channel “Squeeze and Excitation” Blocks. IEEE Trans. Med. Imaging 2018, 38, 540–549. [Google Scholar] [CrossRef] [PubMed]
Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
Kisantal, M.; Sharma, S.; Park, T.H.; Izzo, D.; Märtens, M.; D’Amico, S. Satellite Pose Estimation Challenge: Dataset, Competition Design, and Results. IEEE Trans. Aerosp. Electron. Syst. 2020, 56, 4083–4098. [Google Scholar] [CrossRef]
Isell. Labelimg. Available online: https://github.com/heartexlabs/labelImg (accessed on 20 February 2022).
ACT; SLAB. Pose Estimation Challenge. Available online: https://kelvins.esa.int/satellite-pose-estimation-challenge/leaderboard/results (accessed on 5 July 2022).

Figure 1. Retinaface face prediction network frame diagram.

Figure 2. YOLOv5s network structure.

Figure 3. Focus structure.

Figure 4. Slice operation.

Figure 5. CSP structure. (a) CSP1_X. (b) CSP2_X.

Figure 6. Neck structure.

Figure 7. HRNet network structure.

Figure 8. Multi-resolution subnet.

Figure 9. Exchange unit.

Figure 10. Multi-resolution feature fusion.

Figure 11. Bottleneck structure and Basicblock structure.

Figure 12. scSE-LHRNet network structure.

Figure 13. The type of convolution used in this article. (a) Ordinary convolution. (b) Depthwise separable convolution.

Figure 14. scSE block.

Figure 15. Object detection results of non-cooperative targets. (a) Deep space background intense lighting. (b) Dark background dark lighting. (c) Earth background strong lighting. (d) Earth background dark lighting.

Figure 16. Loss function curve.

Figure 17. Pose estimation results of non-cooperative spacecraft. (a) Deep space background strong lighting. (b) Complex background dark lighting. (c) Earth background strong lighting. (d) Complex background strong lighting.

Table 1. Comparison results of object detection.

Model	AP	mIoU	Model Size/M
YOLOv3-tiny	86.23%	79.36%	32.97
YOLOv4	94.98%	87.35%	244.59
Faster RCNN	95.75%	88.43%	172.32
YOLOv5s	94.22%	86.61%	27.98

Table 2. Comparison results of ablation experiment.

Model	Error	Parameters/M	GFLOPs
HRNet	0.0386	28.50	11.26
LHRNet	0.0512	9.23	5.48
scSE-HRNet	0.0153	39.76	14.31
scSE-LHRNet	0.0169	11.82	6.95

Table 3. Comparison results of pose estimation.

Method	Error
Magpies [32]	0.1393
Motokimural [32]	0.0759
Team_Platypus [32]	0.0701
Stanford_slab [32]	0.0621
SPN [19]	0.0571
Pedro_fairspace [32]	0.0570
UniAdelaide [16]	0.0292
D2N4 [17]	0.0229
EPFL_cvlab [32]	0.0215
Our method	0.0169

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Deng, L.; Suo, H.; Jia, Y.; Huang, C. Pose Estimation Method for Non-Cooperative Target Based on Deep Learning. Aerospace 2022, 9, 770. https://doi.org/10.3390/aerospace9120770

AMA Style

Deng L, Suo H, Jia Y, Huang C. Pose Estimation Method for Non-Cooperative Target Based on Deep Learning. Aerospace. 2022; 9(12):770. https://doi.org/10.3390/aerospace9120770

Chicago/Turabian Style

Deng, Liwei, Hongfei Suo, Youquan Jia, and Cheng Huang. 2022. "Pose Estimation Method for Non-Cooperative Target Based on Deep Learning" Aerospace 9, no. 12: 770. https://doi.org/10.3390/aerospace9120770

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Pose Estimation Method for Non-Cooperative Target Based on Deep Learning

Abstract

1. Introduction

2. Proposed Method

2.1. Target Detection Network Based on YOLOv5s

2.2. High-Resolution Network

2.3. Pose Estimation Network Based on scSE-LHRNet

3. Evaluation

3.1. Metrics

3.2. Experiments

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI