1. Introduction
Single-shot camera localization (also known as the model-based localization, image-based localization, visual localization, or camera relocalization) estimates the camera pose from a single image related to a previously visited scene. It has two phases. In phase 1, a model (e.g., 3D point cloud, neural network) is first built based on a set of images. In phase 2, the camera pose related to the model is estimated based on a single image. Single-shot camera localization can be divided into six categories: the
Structure-Based Method [
1],
Structure-Based Using Image Retrieval [
2],
Pose Interpolation [
3,
4],
Relative Pose Estimation [
5,
6],
Scene Point Regression [
7,
8,
9], and
Absolute Pose Regression [
10,
11,
12,
13,
14,
15].
Assume that in phase 1, the scene model has been built. In Phase 2, for the
Structure-Based Method [
1], the first step is to extract local features of the query image. The second step is 2D-3D feature matching of the query image and the model. After obtaining the 2D-3D correspondences, the third step is to estimate the camera pose by the PnP algorithm. The
Structure-Based Using Image Retrieval method [
2] is similar to the
Structure-Based Method. The difference between them is that the
Structure-Based Using Image Retrieval method uses the image retrieval method to reduce the search region of 2D-3D feature matching, which can make the method estimate fast in a large environment. The
Pose Interpolation method [
3,
4] is to find the top
k similar images of the query image in the reference database and interpolate the camera poses of these retrieved images to obtain the pose estimate of the query image. After image retrieval, the
Relative Pose Estimation method [
5,
6] employs a relative pose regression (RPR) to achieve more accurate localization. The query image and nearest neighbor reference image are simultaneously input into the RPR network. The task of this type of RPR network is to learn both the relative pose between the retrieved image pairs and the query image and the global features of a single image, regardless of the scene. The
Scene Point Regression method [
7,
8,
9] is similar to the
Structure-Based Method. The difference between them is that the
Scene Point Regression no longer relies on feature extraction to conduct 2D-3D feature matching. It uses the machine-learning method or deep-learning method to regress the correspondences of 2D image points and 3D model points. The
Absolute Pose Regression method [
10,
11,
12,
13,
14,
15] uses an end-to-end deep-learning model to regress the camera pose of the query image.
Among these six categories, the
Absolute Pose Regression (APR) method uses an end-to-end deep-learning model to regress the camera pose of the query image quickly and needs less storage to store the neural network model. However, the localization accuracy of the APR method is not as good as the other methods. Moreover, the camera pose result of APR methods can be observed as the linear combination of the base poses [
16]. The previous APR methods’ base poses are learned from training data. APR methods are a data-driven camera localization approaches by which the distribution of the ground truth camera trajectories used for training can limit base poses learned from the training data. As long as three orthonormal vectors exist, their linear combination can cover the entire 3D space, as the number of base poses is often much more significant than three, and learning-based base poses are redundant. However, since the training trajectory does not necessarily include diverse directions, the redundant base poses may not cover the space outside the training trajectory. In other words, the APR model that adopts base poses based on learning might have no generalizability. To reduce the impact of this issue, we turn to handcrafted base poses instead of using learning-based base poses.
In this paper, we propose an absolute pose regression (APR) method for camera localization using an RGB-D dual-stream network and handcrafted base poses. This method fuses color and depth information for more accurate localization. We use a dual-stream network architecture to process color images and depth images separately and combine this with handcrafted base poses to reduce the impact of the network’s limitation to movement trajectories in the training data.
We conducted detailed ablation studies, which demonstrate the efficacy of the proposed method. Furthermore, comprehensive evaluation results demonstrate that on the 7 Scenes dataset, the proposed method is among the best in the median rotation error, and in the median translation error, it outperforms previous APR methods. On more difficult dataset—Oxford RobotCar dataset, the proposed method achieves notable improvements in median translation and rotation errors, as compared to the state-of-the-art APR methods.
To summarize, this work has three contributions:
We use handcrafted base poses instead of learning-based base poses (e.g., [
12,
13,
14,
15]) to estimate the camera poses, which can prevent the camera poses from overfitting the training data’s camera trajectories.
We use a dual stream network to fuse color and depth information to obtain more accurate localization results.
On the 7 Scenes dataset, the proposed method is among the best in the median rotation error, and in the median translation error, it outperforms previous APR methods. On more difficult dataset—Oxford RobotCar dataset, the proposed method achieves notable improvements in median translation and rotation errors, as compared to the state-of-the-art APR methods.
The remainder of this paper is organized as follows. In
Section 2 we review related works, and in
Section 3 we describe the proposed APR method.
Section 4 presents the experimental results to attest to the effectiveness of the proposed method, and
Section 5 concludes this study.
5. Conclusions & Future Work
We propose a novel APR method using an RGB-D dual-stream network and handcrafted base poses. On the
7 Scenes dataset, the proposed method is among the best in median rotation error, and in the median translation error, it outperforms previous APR methods. On the more difficult datase—the
Oxford RobotCar dataset, the proposed method achieves notable improvements in median translation and rotation errors, as compared to the state-of-the-art APR methods. The main contribution of this work is the use of handcrafted base poses instead of learning-based base poses (e.g., [
10,
11,
12,
13,
14,
15]) for estimating camera poses, which prevents overfitting to the camera poses of the training data.
To further improve the proposed method, a depth completion network (such as PENet [
24], NLSPN [
25], and FusionNet [
26]) could be added to the depth stream to achieve end-to-end training. In addition, we could use the weighted sum of embeddings instead of the average value for the base pose coefficients to integrate better the feature embeddings of the color stream and depth stream. The embedding weights could be obtained by learning the confidence of each stream.
If 3D point clouds are available, it may be better to use these as input instead of depth images and adopt a network designed for unordered point sets (such as PointNet [
27] and PointNet++ [
28]) to extract the features. There are three advantages to using 3D point clouds. First, compared to depth images, point clouds provide richer structural information. Second, since sparse point cloud data can be directly fed into the network, there is no need for depth completion, which can improve localization efficiency. Third, 3D LiDAR is usually adopted for distance measurement in outdoor scenes, so using point clouds as input can remove the dual-stream network’s limitation to indoor scenes.