Next Article in Journal
Effect of Acetazolamide on Intraocular Pressure After Uneventful Phacoemulsification Using an Anterior Chamber Maintainer
Previous Article in Journal
Clinical Applications of Artificial Intelligence in Corneal Diseases
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Three-View Relative Pose Estimation Under Planar Motion Constraints

Naval Aviation University, Yantai 264001, China
*
Author to whom correspondence should be addressed.
Vision 2025, 9(3), 72; https://doi.org/10.3390/vision9030072
Submission received: 3 July 2025 / Revised: 16 August 2025 / Accepted: 18 August 2025 / Published: 25 August 2025

Abstract

Vision-based relative pose estimation serves as a core technology for high-precision localization in autonomous vehicles and mobile platforms. To overcome the limitations of conventional three-view pose estimation methods that rely heavily on dense feature matching and incur high computational costs, this paper proposes an efficient three-point correspondence algorithm based on planar motion constraints. The method constructs trifocal tensor constraint equations and develops a linearized three-point solution framework, enabling rapid relative pose estimation using merely three corresponding points in three views. In simulation experiments, we systematically analyzed the robustness of the algorithm under complex conditions that included image noise, angular deviation, and vibration. The method was further validated in real-world scenarios using the KITTI public dataset. Experimental results demonstrate that under the condition of satisfying the planar motion assumption, the proposed method achieves significantly improved computational efficiency compared with traditional methods (including general three-view methods, two-view planar motion estimation methods, and classical two-view methods), with the single-solution time reduced by more than 80% compared to general three-view methods. In the public dataset, our algorithm achieves a median rotation estimation error of less than 0.0545 degrees and maintains a translation estimation error of less than 2.1319 degrees. The proposed method exhibits higher computational efficiency and better numerical stability compared to conventional algorithms. This research provides an effective pose estimation solution with real-time performance and high accuracy for planar motion platforms such as autonomous vehicles and indoor mobile robots, demonstrating substantial engineering application value.

1. Introduction

For moving platforms, such as drones and unmanned vehicles, using the sensors installed on them to estimate their position and attitude is a prerequisite to meet the application needs [1]. Traditional pose estimation methods primarily rely on Global Navigation Satellite Systems (GNSS) and Inertial Measurement Units (IMUs). However, in indoor environments, urban canyons, or complex electromagnetic conditions, GNSS signals are susceptible to multipath effects, non-line-of-sight reception, and even spoofing interference [2,3]. Vision-based pose estimation methods have attracted significant research attention in recent years as an effective complementary solution to improve localization reliability, due to their non-contact nature, high accuracy, and low cost [4]. However, visual pose estimation still faces many challenges. Especially in complex, dynamic scenarios, efficiently and robustly solving the relative pose of the camera has become a research hotspot.
In the vision-based relative pose estimation method, the technology for estimating relative pose between two views is already very mature. However, the presence of mismatches frequently leads to erroneous pose solutions, making the estimation process prone to degenerate cases. In contrast, the three-view method introduces additional geometric constraints that effectively resolve correspondence ambiguity and improve the robustness of pose estimation [5]. Notably, in many practical applications (e.g., urban road navigation, indoor mobile robots), the platform motion often satisfies the planar motion assumption. This assumption reduces the 6-DoF pose problem to a 3-DoF problem, significantly simplifying the solution. Therefore, this paper proposes a three-view pose estimation method based on three-point correspondences by exploiting planar motion constraints. The method establishes linear equations through the trifocal tensor and efficiently solves for rotation and translation parameters using SVD decomposition, as illustrated in Figure 1. The main contributions of this paper are as follows:
  • Theoretical innovation: By integrating planar motion constraints with trifocal tensor theory, we derive a linear solver requiring only three point correspondences, eliminating the dependency on dense feature points in conventional methods. This theoretical framework reduces the 6-DoF problem to a 3-DoF problem, significantly lowering computational complexity.
  • Computational efficiency: Experiments show that our method achieves a single-solution time of only 0.142 ms, representing a 6× speedup compared to traditional three-view pose estimation methods. This efficiency advantage arises from the linearized solver design, which eliminates the need for iterative optimization and thus guarantees reliable real-time performance.
  • Robustness verification: The method maintains strong stability under challenging conditions, including image noise (≤2 pixels), angular deviation (≤1°), and minor camera vibration. Notably, in the motion sequences of the KITTI dataset, the median rotational error is below 0.0545°, and the translational error is below 2.1319°, validating its applicability in real-world scenarios.

2. Related Work

2.1. Three-View Pose Estimation

Heyden et al. [6] proposed the trifocal tensor (TFT), which provides a theoretical framework for three-view geometry. Torr et al. [7] demonstrated that the tensor can be computed from six corresponding point triplets across three images. Indelman et al. [8] developed a vision-aided navigation method based on the geometry of three views, combining it with inertial navigation to estimate vehicle pose. Ponce et al. [9] introduced six homogeneous constraints and explored a novel approach to characterize the three-view problem. Ding et al. [5] investigated solving the problem of three views using four match points of feature points and demonstrated GPU acceleration for their HC solver, which requires an initial solution. Li et al. [10] introduced two novel solvers that incorporate vertical direction constraints from the IMU to estimate relative poses in the geometry of three views accurately. Although offering some performance enhancement, the general designs of these methods do not adequately exploit domain-specific motion constraints (e.g., planar motion), resulting in unnecessary computational overhead.

2.2. Planar Motion Pose Estimation

The problem of estimating the pose of the plane motion was first solved by Ortin and Montiel [11], who proposed two solvers to calculate the motion of the camera between two images: an iterative 2-point algorithm and a 3-point linear algorithm. Chou and Wang [12] developed a non-iterative 2-point algorithm based on an ellipse formulation to address image matching under significant viewpoint changes. Building on this work, Choi et al. [13] transformed the ellipse formulation into finding intersections between a line and a unit circle, proposing two non-iterative 2-point algorithms for relative pose estimation under planar motion. Guan et al. [14] proposed a novel minimal solution for planar motion estimation by utilizing constraints from image feature descriptors. While these methods mainly target two-view configurations, research on three-view pose estimation under planar motion remains relatively underdeveloped. Thus, developing efficient and robust three-view pose estimation methods for planar motion scenarios remains an unresolved research problem.

3. Model Building

3.1. The Projection Matrix Under Planar Motion

In 3D space, camera motion can be fully described by a rigid transformation comprising two components: rotation and translation. The rotation has 3 degrees of freedom (DOF), corresponding to rotations about the X-axis, Y-axis, and Z-axis, commonly parameterized as pitch, yaw, and roll angles, respectively. The rotation matrices R x , R y , and R z are defined as follows:
R x = 1 0 0 0 cos α sin α 0 sin α cos α ,
R y = cos θ 0 sin θ 0 1 0 sin θ 0 cos θ ,
R z = cos β sin β 0 sin β cos β 0 0 0 1 .
where α , θ , and β represent the camera’s pitch, yaw, and roll angles, respectively. The 3D rotation matrix R is the product of the three aforementioned rotation matrices:
R = R x R y R z .
The camera’s translational motion in general space also possesses 3 DOFs, corresponding to displacements along the X-axis, Y-axis, and Z-axis, which can be represented by a translation vector t :
t = t x t y t z T ,
where t x , t y , and t z represent the camera’s translations along the X-axis, Y-axis, and Z-axis, respectively. The translation vector explicitly captures the camera’s positional displacement, which, together with the rotation matrix, fully determines its rigid motion in 3D space.
The projection of a world point X to image point x is given by matrix P :
x = P X .
The projection matrix P is given by
P = K [ R R t ] .
The matrix K denotes the camera calibration matrix, which contains intrinsic parameters such as focal length and principal point coordinates, and can be precisely obtained through camera calibration.
When the calibration matrix K is known, (6) can be expressed as
K 1 x = K 1 P X .
This yields the normalized image coordinates x ^ = K 1 x and the normalized projection matrix P ^ = K 1 P . The point x ^ represents the image projection of the 3D world point X under a normalized camera model where the calibration matrix K equals the identity matrix I . After normalization, the projection model becomes independent of specific camera parameters. All computations are performed in a unified coordinate system, eliminating parameter variations across different camera models and enhancing the generality and robustness of the algorithm. For computational convenience, this paper employs normalized projection matrices and normalized image coordinates. The projection matrix P k can be expressed as
P k = 1 0 0 0 0 1 0 0 0 0 1 0 R k R k t k 0 1 ,
where R k is the rotation matrix and t k the translation vector for the k-th view.
Mobile robots and autonomous vehicles exhibit characteristic planar motion in practical applications. A representative case is indoor navigation robots operating on ground surfaces, where the camera’s Y-axis remains normal to the ground plane, and motion between different views is restricted to Y-axis rotation and two-dimensional planar displacement. Under these constraints, the rotation matrix R k reduces to a simplified form dependent solely on the yaw angle θ :
R k = R y = C y k 0 S y k 0 1 0 S y k 0 C y k .
Among them, C y k = cos ( θ k ) , S y k = sin ( θ k ) . θ k denotes the yaw angle of the k-th view. This simplification reduces the complexity of the rotation matrix, resulting in a more concise and efficient representation of rotation while maintaining consistency with the physical characteristics of planar motion.
Under planar motion, the camera’s Y-axis translation is always zero, and the translation vector t k simplifies to the following:
t k = t x k 0 t z k T ,
where t x k and t z k represent the translations along the X-axis and the Z-axis for the k-th view, respectively. This simplification accounts for the positional invariance along the camera’s Y-axis (normal to the motion plane) during planar movement, resulting in a more compact projection matrix representation.
Substituting Equations (10) and (11) into Equation (9) yields the planar motion projection matrix P k as
P k = C y k 0 S y k C y k t x k S y k t z k 0 1 0 0 S y k 0 C y k S y k t x k C y k t z k .
The projection matrix P k in Equation (12) represents the core expression under planar motion constraints. By incorporating the physical characteristics of practical camera motion and simplifying both rotation and translation components, it yields a compact form that directly facilitates the construction of linear equation systems.

3.2. Linear 3-Point Method

In three-view pose estimation, the first view’s coordinate system is assumed as the world reference frame, i.e.,
R 1 = I 3 × 3 , t 1 = ( 0 , 0 , 0 ) T .
Here, R k and t k denote the rotation matrix and translation vector of the k-th view relative to the first view (k = 1, 2, 3). Using Equation (12), the projection matrices for all three views are expressed as
P 1 = I | 0 , P 2 = A | a 4 , P 3 = B | b 4 ,
where A and B are 3 × 3 matrices, with vectors a i and b i denoting the i-th columns of the corresponding matrices (i = 1, …, 4).
In line with the foundational framework laid out by Hartley and Zisserman’s Multiple View Geometry in Computer Vision (2nd Edition, Chapters 15 and 16) [15], the trifocal tensor stands as a core mathematical entity in the geometry of three views, encapsulating the intrinsic geometric constraints between views. The trifocal tensor T constitutes a 3 × 3 × 3 tensor whose elements T i are derived through the following:
T i = a i b 4 T a 4 b i T , i = 1 , 2 , 3 .
In general, the trifocal tensor has 18 DOFs, a conclusion that can be derived through an analysis of the DOFs of projection matrices: The projection matrices P 1 , P 2 , and P 3 of the three views contain a total of 33 DOFs. Specifically, each 3 × 4 projection matrix itself has 12 elements. But due to scale ambiguity, the number of DOFs for each projection matrix is reduced by 1, so a single-view projection matrix has 11 DOFs. However, since the definition of the trifocal tensor is independent of the projective transformation of the world coordinate system (the choice of the world coordinate system does not alter the essential properties of the tensor), and the 3D projective transformation itself contains 15 DOFs, this redundant component needs to be excluded from the total degrees of freedom, ultimately yielding 18 independent parameters. This precisely corresponds to the number of independent parameters required for the trifocal tensor to describe the geometric relationships among three views [16,17].
The simplified method for the tensor under planar motion proposed in this paper is an extension of classical theory under specific motion constraints: By reducing the DOFs of camera motion from 6 (3 rotational + 3 translational) to 3 (only rotation around the vertical axis + translation within the plane), the camera centers are strictly constrained within the X-Z plane. In this scenario, the normal vector of the plane formed by a spatial point and the three camera centers is fixed along the Y-axis (as shown in Figure 2). This stronger geometric constraint directly simplifies the tensor structure—the minimum number of corresponding point pairs required for solving is reduced from 7 to 3.
By substituting the simplified projection matrices Equations (12) and (14) into Equation (15), the specialized expression of the trifocal tensor under planar motion constraints is derived [10,18]:
T 1 = Q 1 0 Q 2 0 0 0 Q 3 0 Q 4
T 2 = 0 Q 5 0 Q 6 0 Q 7 0 Q 8 0
T 3 = Q 9 0 Q 10 0 0 0 Q 11 0 Q 12
It should be clarified that the planar motion constraint applies to the camera’s motion trajectory, specifically the position of the camera center, rather than the 3D points in the scene. While 3D points can still be located arbitrarily in space, as the camera centers are confined to the X-Z plane, the coplanar relationship formed between them and the spatial points is strengthened with a fixed normal vector. This enables the tensor parameters, which originally required more point correspondences for a solution, to be uniquely determined with fewer point correspondences—this constitutes the core logic of the simplified method.
To express the trifocal tensor more concisely, we introduce the symbol Q 1 Q 2 Q 12 to denote Equations (16)–(18). The specific expression is as follows:
Q 1 = C y 3 Q 5 + C y 2 Q 6 Q 2 = C y 2 Q 7 S y 3 Q 5 , Q 3 = C y 3 Q 8 S y 2 Q 6 , Q 4 = S y 3 Q 8 S y 2 Q 7 , Q 5 = C y 2 t x 2 + S y 2 t z 2 , Q 6 = C y 3 t x 3 S y 3 t z 3 , Q 7 = S y 3 t x 3 C y 3 t z 3 , Q 8 = S y 2 t x 2 + C y 2 t z 2 , Q 9 = S y 3 Q 5 + S y 2 Q 6 , Q 10 = C y 3 Q 5 + S y 2 Q 7 , Q 11 = S y 3 Q 8 + C y 2 Q 6 , Q 12 = C y 3 Q 8 + C y 2 Q 7 .
An examination of Equation (19) reveals the following findings:
Q 10 × Q 8 Q 3 × Q 5 = ( C y 3 Q 5 + S y 2 Q 7 ) Q 8 ( C y 3 Q 8 S y 2 Q 6 ) Q 5 = S y 2 ( Q 7 Q 8 + Q 5 Q 6 ) , Q 11 × Q 5 + Q 2 × Q 8 = ( S y 3 Q 8 + C y 2 Q 6 ) Q 5 + ( C y 2 Q 7 S y 3 Q 5 ) Q 8 = C y 2 ( Q 5 Q 6 + Q 7 Q 8 ) , Q 11 × Q 7 Q 2 × Q 6 = ( S y 3 Q 8 + C y 2 Q 6 ) Q 7 ( C y 2 Q 7 S y 3 Q 5 ) Q 6 = S y 3 ( Q 7 Q 8 + Q 5 Q 6 ) , Q 3 × Q 7 + Q 10 × Q 6 = ( C y 3 Q 3 S y 2 Q 6 ) Q 7 + ( C y 3 Q 5 + S y 2 Q 7 ) Q 6 = C y 3 ( Q 5 Q 6 + Q 7 Q 8 ) , C y 2 Q 5 S y 2 Q 8 = C y 2 ( C y 2 t x 2 + S y 2 t z 2 ) S y 2 ( S y 2 t x 2 + C y 2 t z 2 ) = ( ( C y 2 ) 2 + ( S y 2 ) 2 ) t x 2 , S y 2 Q 5 + C y 2 Q 8 = S y 2 ( C y 2 t x 2 + S y 2 t z 2 ) + C y 2 ( S y 2 t x 2 + C y 2 t z 2 ) = ( ( S y 2 ) 2 + ( C y 2 ) 2 ) 2 t z 2 , S y 3 Q 7 C y 3 Q 6 = S y 3 ( S y 3 t x 3 C y 3 t z 3 ) C y 3 ( C y 3 t x 3 S y 3 t z 3 ) = ( ( S y 3 ) 2 + ( C y 3 ) 2 ) t x 3 S y 3 Q 6 C y 3 Q 7 = S y 3 ( C y 3 t x 3 S y 3 t z 3 ) C y 3 ( S y 3 t x 3 C y 3 t z 3 ) = ( ( S y 3 ) 2 + ( C y 3 ) 2 ) t z 3
Under the condition that ( ( C y 2 ) 2 + ( S y 2 ) 2 ) = 1 and ( ( S y 3 ) 2 + ( C y 3 ) 2 ) = 1 , we derive
S y 2 = Q 8 Q 10 Q 3 Q 5 Q 5 Q 6 + Q 7 Q 8 , S y 3 = Q 7 Q 11 Q 2 Q 6 Q 5 Q 6 + Q 7 Q 8 , C y 2 = Q 5 Q 11 + Q 2 Q 8 Q 5 Q 6 + Q 7 Q 8 , C y 3 = Q 3 Q 7 + Q 6 Q 10 Q 5 Q 6 + Q 7 Q 8 , t x 2 = C y 2 Q 5 S y 2 Q 8 , t x 3 = S y 3 Q 7 C y 3 Q 6 , t z 2 = S y 2 Q 5 + C y 2 Q 8 , t z 3 = S y 3 Q 6 C y 3 Q 7 .
Considering a set of corresponding points x 1 x 2 x 3 in three views, the relationship between these corresponding points is as follows:
[ x 2 ] × ( i x i 1 T i ) [ x 3 ] × = 0 3 × 3 ,
where x 1 , x 2 , x 3 represent the image coordinates corresponding to View 1, View 2, and View 3, respectively, and x i 1 denotes the i-th coordinate of x 1 . Equation (22) contains nine equations, but rank analysis reveals that only four are independent. With 3-point pairs, we construct 12 equations, forming the following system:
h 1 1 h 2 1 h 12 1 h 1 2 h 2 2 h 12 2 h 1 12 h 2 12 h 12 12 Q 1 Q 2 Q 12 = 0 12 × 1
The Equation (23) is in the form of H q = 0 , where each element h m n in H is known, with n representing the n-th equation and m representing the coefficient of Q m . Therefore, q = Q 1 Q 2 Q 12 T can be solved by SVD decomposition. Then S y k , C y k , t x k , t z k can be solved by Equation (21), where k = 2, 3. Thus, the rotation matrix R k and translation vector t k can be determined, and the relative pose estimation between the three views can be obtained.

3.3. The Corrective Methods in Practical Applications

In practical applications, especially when dealing with real image data, to ensure the algorithm’s feasibility, we often need to correct the camera’s rotation matrix and translation vector first to make them meet the motion constraints of the plane. Before the correction, the camera rotation matrix between the first and second images is R , and the translation vector is t . After correction, the rotation matrix is R , and the translation vector is t . The following can be seen from Section 3.1:
R = R z R x R y , t = t x t y t z T , R = R y , t = t x 0 t z T .
It is easily seen that R = R z R x R . Assuming the correction matrix J = R p T p 0 1 , the following formula holds:
R t 0 1 = R p t p 0 1 R t 0 1 .
Let us assume that R p = R z R x , and we can solve t p = t R p t . Thus, the correction matrix J can be solved.
After solving the correction matrix J , the image coordinates can be corrected so that the transformation relationship between the coordinates of different images is represented by the corrected rotation matrix R and translation vector t . Assuming that the coordinates of the matching point in the first image are X c 1 , the coordinates in the second image are X c 2 , it can be introduced that
X c 2 = R t 0 1 X c 1 .
Substitute Equation (25) into Equation (26):
J 1 X c 2 = R t 0 1 X c 1 .
In other words, the corrected image coordinates satisfy X c 2 = J 1 X c 2 .
When correcting the image coordinates, we find that since the image coordinates used in the dataset are normalized coordinates x ˜ 2 , and X c 2 = λ 2 x ˜ 2 1 ; the corrected image coordinates are X c 2 = J 1 λ 2 x ˜ 2 1 = R P 1 ( λ 2 x ˜ 2 t p ) 1 . However, the depth information λ 2 of the matching points is unknown. To overcome this problem, we have adopted the triangulation method [15] to recover depth, as shown in Figure 3.
Triangulation refers to the observation of the same point P from different positions and inferring the coordinates of this point through the coordinates of the corresponding points in different images. Suppose x ˜ 1 and x ˜ 2 are the normalized coordinates of two corresponding points, x ˜ 1 = u 1 v 1 1 T , x ˜ 2 = u 2 ν 2 1 T . The coordinates of point P in the world coordinate system are P = x y z T , as can be seen from the content of Section 3.1:
λ 1 x ˜ 1 = R 1 t 1 P = T 1 P , λ 2 x ˜ 2 = R 2 t 2 P = T 2 P .
Decompose T 1 into row vectors T 11 , T 12 , and T 13 , so we have:
λ 1 u 1 v 1 1 = T 11 T 12 T 13 P ,
i.e.,
λ 1 u 1 = T 11 P , λ 1 v 1 = T 12 P , λ 1 = T 13 P .
In this system of equations, substitute λ 1 = T 13 P into the first two equations:
λ 1 T 13 T 11 P = 0 , λ 1 T 13 T 12 P = 0 .
Similarly, decompose T 2 into row vectors T 21 , T 22 , and T 23 , and there is
λ 2 T 23 T 21 P = 0 , λ 2 T 23 T 22 P = 0 .
The system of Equations (31) and (32) consists of four equations in the form of A P = 0 . The point P has three unknowns and can be calculated using the least square method.

4. Experimental Results and Analysis

This section presents extensive experiments on both synthetic data and real image sequences to validate the proposed method, with comparative analysis against existing algorithms in terms of computational efficiency, numerical stability, and noise robustness.
The proposed algorithm is designated as the Three-Point Method (abbreviated as Our-3pt). In our experiments, Our-3pt was systematically compared with the following classical algorithms: (1) Hartley-7pt, a general normalized solution for three-view problems (with its code available in [19]), which serves as our primary baseline; (2) Choi-2pt [13], a two-view method specifically designed for planar motion; and (3) two well-established two-view algorithms: Nister-5pt [20] and Hartley-8pt [15]. It should be noted that although Choi-2pt, Nister-5pt, and Hartley-8pt were originally designed for two-view problems, their mathematical frameworks can be extended to three-view relative pose estimation scenarios. This selection strategically covers both specialized planar solutions and general-purpose algorithms, facilitating multidimensional performance analysis. Our rigorous experimental protocol reveals fundamental insights into the efficacy of three-view geometric constraints for planar motion estimation, while objectively characterizing the method’s operational boundaries.

4.1. Simulated Data Experiment

4.1.1. Computational Efficiency and Numerical Stability

Table 1 presents the average computation time of all algorithms after 5000 random trials on simulated data. All methods were tested under the same computer environment, equipped with an Intel(R) Core(TM) i9-14900HX 2.20 GHz processor and implemented in MATLAB R2023a. Experimental results in Table 1 demonstrate the superior computational efficiency of our proposed Our-3pt method. Specifically, the conventional three-view normalized solution Hartley-7pt requires an average runtime of 0.860 ms, which is over 6× longer than Our-3pt (0.142 ms). While the plane-specific Choi-2pt method achieves the shortest computation time (0.097 ms), it is limited to two-view scenarios and can only estimate a single camera pose. In particular, Our-3pt maintains an efficiency comparable to Hartley-8pt (0.183 ms) while simultaneously solving for two camera poses, demonstrating its practical advantage in planar motion estimation.
In our experimental evaluation, we employ two quantitative metrics to assess the performance of the algorithm: rotation error ε R and translation error ε t . For rotation error, we compute the angular difference (in degrees) between the estimated and ground-truth rotation matrices. To address the scale ambiguity inherent in monocular vision systems, we similarly evaluate translation accuracy using angular deviation, specifically, the angle between the estimated and true translation directions (in degrees). The precise computational formulations follow those established in [21]:
ε R = arccos ( ( t r a c e ( R gt R T ) 1 ) / 2 ) ε t = arccos ( ( t gt T t ) / ( t gt · t ) ) .
where R gt and t gt denote the ground-truth rotation matrix and translation vector, while R and t represent their estimated values from the algorithm.
Without considering the influence of image noise, the experimental results of numerical stability are shown in Figure 4. All algorithms were executed 5000 times. The experimental results indicate that in terms of the estimation of the rotation matrix, Choi-2pt performs the best, followed by Our-3pt. Nister-5pt and Hartley-8pt show moderate performance, while Hartley-7pt is the worst. In terms of translation vector estimation, Our-3pt achieves the optimal performance, with Choi-2pt ranking second. Nister-5pt and Hartley-8pt still maintain a moderate level, and Hartley-7pt remains the worst. It is worth noting that Our-3pt not only excels in translation estimation but also maintains near-optimal performance in rotation estimation. The comprehensive evaluation demonstrates that Our-3pt exhibits significant overall advantages in numerical stability.

4.1.2. Accuracy Analysis Under the Influence of Noise

To assess the accuracy and robustness of the proposed method against varying image noise levels, Gaussian noise (0–2 pixels) was introduced in simulations to replicate pixel-level errors arising from sensor noise and image processing during actual image acquisition. The experimental results are presented in Figure 5. It can be observed that the Our-3pt method exhibits significantly higher accuracy in solving the translation vector compared to the other four methods, characterized by both the smallest mean error and the least degree of error fluctuation. In terms of solving the rotation matrix, Choi-2pt achieves the best performance, while the precision of Our-3pt is almost comparable to that of Nister-5pt. In contrast, both Hartley-8pt and Hartley-7pt exhibit notable sensitivity to noise, with their errors increasing rapidly as noise levels grow.
In real-world scenarios, it is difficult to achieve ideal conditions. Although we assume that the camera only rotates around the y-axis, there may be some angular noise in practice on the x-axis and z-axis. The noise around the x-axis represents pitch angle noise, simulating the camera pitch jitter caused by the vehicle traveling up and down slopes or on bumpy roads. The noise around the z-axis represents roll angle noise, reflecting the camera’s roll deviation when the vehicle is turning or tilting. To investigate the impact of angular noise on algorithm accuracy, we set the angular noise range between 0° and 1°. The image noise is set to 1 pixel and remains unchanged. Since Hartley-7pt, Nister-5pt, and Hartley-8pt algorithms are designed for non-planar motion scenarios, they demonstrated relatively stable accuracy characteristics in the experiments. As shown in Figure 6, when the angular error is less than 1°, the translation accuracy and rotation accuracy of Our-3pt are superior to those of Hartley-8pt and Hartley-7pt. Compared with Nister-5pt and Choi-2pt, Our-3pt achieves the optimal performance in terms of translation accuracy, and its performance is significantly superior to that of Nister-5pt and Choi-2pt. Although in terms of rotation accuracy, Our-3pt fails to outperform Choi-2pt and Nister-5pt, its performance remains within an acceptable range.

4.1.3. Accuracy Analysis Under the Influence of Vibration

During actual motion, the camera may experience minor vibrations along the Y-axis, preventing ideal planar motion. To evaluate the effect of these vibrations on the accuracy of the algorithm, we set their magnitude range to 0–1% of the displacement norm ( t ) between the current and initial positions, while maintaining a constant level of image noise of 1 pixel. As clearly observed in Figure 7, in terms of translation estimation, Our-3pt consistently outperforms Choi-2pt, demonstrating a stable precision advantage. Compared with Hartley-7pt, Our-3pt exhibits significant superiority in both rotation accuracy and translation accuracy. Compared with Hartley-8pt, Our-3pt achieves better rotation accuracy when vibration amplitude is below 0.4% t , while maintaining consistently higher translation accuracy within 0–1.4% t vibration range. Although Nister-5pt outperforms Our-3pt in rotation accuracy, Our-3pt exhibits better translation accuracy when vibration remains below 0.6% t .

4.2. Real Image Sequences Experiment

4.2.1. Comparison of Pose Estimation Accuracy in Real-World Scenarios

This section evaluates the proposed method on real-world image sequences from the KITTI dataset [22], with comprehensive performance comparisons against existing approaches. The KITTI dataset provides ground truth poses for 11 image sequences, obtained through high-precision GPS/IMU fusion. We evaluate performance using median errors per sequence. For fair comparison, all methods use standard RANSAC [23] without additional optimizations (e.g., inlier refinement, nonlinear optimization, or bundle adjustment [24]). All compared methods employ SIFT feature matching [23] to establish three-view geometric constraints under identical parameter thresholds. Our method is developed based on a planar motion assumption, which the KITTI dataset scenes do not strictly satisfy. Therefore, comparing its accuracy with other algorithms is not the primary objective of this experiment, as such a comparison would be inherently unfair to our method. Instead, this experiment aims to qualitatively verify whether the proposed method can be effectively applied to real-world scenarios. To address the discrepancy between the dataset and our assumptions, we preprocess the KITTI data through camera pose rectification and image coordinate transformation. This processing approximates planar motion conditions, enabling more valid experimental evaluation (refer to Section 3.3 for details).
The experimental results in Table 2 demonstrate that in terms of rotation estimation, as a three-view method, Our-3pt exhibits a significantly lower overall median rotation error compared to the classical three-view method Hartley-7pt, fully reflecting the optimization effect of the three-view model under planar motion constraints. When compared with other two-view methods, Choi-2pt performs optimally overall, with Our-3pt closely following. This trend is consistent with the simulation results, primarily because the three-view model needs to simultaneously constrain the geometric correlations among the three views, resulting in higher model complexity that may introduce minor cumulative errors. However, Our-3pt has already shown rotation accuracy close to that of Choi-2pt and outperforms Nister-5pt and Hartley-8pt.
The experimental results in Table 3 indicate that in terms of translation estimation accuracy, Our-3pt, as a three-view method, shows an overall performance comparable to the classical three-view method Hartley-7pt. In comparison with two-view methods, Nister-5pt achieves generally better translation estimation accuracy; nevertheless, Our-3pt demonstrates slightly superior translation accuracy in sequences 04, 06, and 07. In the remaining sequences, although the translation error of Our-3pt increases slightly, it remains within a reasonable range. Meanwhile, Our-3pt outperforms Choi-2pt in all test sequences and is generally superior to Hartley-8pt.

4.2.2. Verification of Computational Efficiency and Feature Robustness in Real-World Scenarios

Table 4 presents a comparative analysis of the computational efficiency of the algorithms evaluated in the KITTI dataset, conducted under a unified RANSAC framework with standardized parameters. The framework utilizes the Sampson distance as the inlier criterion and adopts a 99% confidence level to ensure statistically robust sampling of minimal point sets. To balance efficiency and robustness, both the maximum number of iterations and the maximum number of sampling attempts are limited to 500. All compared algorithms share identical parameter configurations to ensure a fair evaluation. For runtime measurement, a robust two-stage statistical protocol is employed: first, the median execution time for each sequence is computed to mitigate the influence of outliers; subsequently, the final performance metric is derived as the mean of these median values across all 11 sequences, thereby ensuring reliable and representative timing comparisons.
Experimental results demonstrate that the proposed Our-3pt method achieves an average computational time of merely 1.26 ms on the KITTI dataset, exhibiting significant efficiency advantages. Compared to the conventional three-view Hartley-7pt approach (20.6 ms), our method demonstrates a remarkable 16.3× speedup, which convincingly validates the efficiency of our linearized solution framework based on planar motion constraints. Even when compared with the plane motion-optimized Choi-2pt method (1.66 ms), our approach maintains a 1.3× speed advantage while additionally providing the ability to estimate poses from three views. In broader algorithmic comparisons, our method shows 2.1× and 5.5× improvements in computational efficiency over Nister-5pt (2.70 ms) and Hartley-8pt (6.99 ms), respectively. These results conclusively demonstrate that our method successfully balances computational efficiency with estimation accuracy under planar motion assumptions, providing a viable technical solution for real-time applications such as autonomous driving where stringent timing requirements exist.
To further verify the robustness of the proposed method in noise interference at feature points, this section is based on the sequence 00 of the KITTI data set. By adjusting the core parameter PeakThresh (peak threshold) for SIFT feature extraction, the impact of changes in the number of feature points on the pose estimation accuracy of five algorithms is explored. In the SIFT feature extraction process, the higher the PeakThresh value, the fewer the number of extracted feature points [25]. In the experiment, PeakThresh was gradually adjusted from 0.03 to 0.04 (with an interval of 0.01), and the rotation errors and translation errors of each algorithm were tested under different feature extraction accuracies. The results are shown in Figure 8.
The experimental results show that as PeakThresh increases and the number of feature points decreases, the accuracy of the pose estimation of different algorithms is significantly affected. In terms of rotation estimation accuracy, the Our-3pt and Choi-2pt methods perform the best, followed by Nister-5pt, and Hartley-7pt and Hartley-8pt perform the worst. In terms of translation estimation accuracy, Nister-5pt performs the best. The translation estimation error of Our-3pt is higher than that of Hartley-7pt and Hartley-8pt but lower than that of Choi-2pt. Overall, Our-3pt achieves a good balance between computational efficiency and accuracy, and is especially suitable for real-time pose estimation in planar motion scenarios.

5. Conclusions

This paper proposes a three-view relative pose estimation method based on planar motion constraints. By establishing trifocal tensor constraints and developing a linearized solution framework, the method achieves efficient pose estimation using only three point correspondences. Experimental results demonstrate superior accuracy, computational efficiency, and robustness in planar motion scenarios compared to existing approaches, providing a reliable solution for real-time pose estimation in mobile platforms such as autonomous vehicles, with significant practical applications. However, the limitation of this method lies in its strong dependence on the planar motion assumption: when there are significant non-planar motions in actual scenarios, the accuracy of pose estimation will decrease. Future work will explore multisensor fusion strategies (such as combining with IMU) to relax the planar motion constraints, and study robust feature screening mechanisms in dynamic environments to improve the applicability of the algorithm further.

Author Contributions

Conceptualization, W.L. and Z.D.; methodology, W.L.; software, Z.D.; validation, Z.D., W.L. and L.L.; formal analysis, L.L.; investigation, Z.D.; resources, W.L.; data curation, Z.D.; writing—original draft preparation, Z.D.; writing—review and editing, W.L. and L.L.; visualization, L.L.; supervision, W.L.; project administration, W.L.; funding acquisition, W.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

This manuscript does not involve any human subjects or animals.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets analyzed in this study (KITTI sequences) are publicly available in the KITTI Vision Benchmark Suite repository: http://www.cvlibs.net/datasets/kitti/ (accessed on 11 August 2025). The specific implementation data and results generated during our experiments cannot be publicly shared due to technical constraints of our evaluation framework, but the complete methodological details are provided in the article to enable reproduction of our findings.

Acknowledgments

During the preparation of this manuscript, the authors used DeepSeek Chat (July 2024 version) for initial language polishing and grammar checking. The authors have thoroughly reviewed and edited all content and take full responsibility for the publication’s content.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Chen, C.; Guan, B.; Shang, Y.; Li, Z.; Yu, Q. A Ground Positioning Method with Rigidly Mounted IMU and Camera. Chin. J. Lasers 2023, 50, 118–125. [Google Scholar]
  2. Gu, M.; Li, H.; Zhang, J.; Bai, X.; Zheng, J. A Survey on Vision-Based UAV Positioning and Navigation Methods. Acta Electron. Sin. 2024, 53, 651–685. [Google Scholar]
  3. Gyagenda, N.; Hatilima, J.V.; Roth, H.; Zhmud, V. A review of GNSS-independent UAV navigation techniques. Robot. Auton. Syst. 2022, 152, 104069. [Google Scholar] [CrossRef]
  4. Ma, N.; Cao, Y.-F. A survey of vision-based perception and pose estimation methods for autonomous UAV landing. Acta Autom. Sin. 2024, 50, 1284–1304. [Google Scholar]
  5. Ding, Y.; Yang, J.; Ponce, J.; Kong, H. Minimal solutions to relative pose estimation from two views sharing a common direction with unknown focal length. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 7045–7053. [Google Scholar]
  6. Heyden, A. Reconstruction from image sequences by means of relative depths. Int. J. Comput. Vis. 1997, 24, 155–161. [Google Scholar] [CrossRef]
  7. Torr, P.H.S.; Zisserman, A. Robust parameterization and computation of the trifocal tensor. Image Vis. Comput. 1997, 15, 591–605. [Google Scholar] [CrossRef]
  8. Indelman, V.; Gurfil, P.; Rivlin, E.; Rotstein, H. Real-time vision-aided localization and navigation based on three-view geometry. IEEE Trans. Aerosp. Electron. Syst. 2012, 48, 2239–2259. [Google Scholar] [CrossRef]
  9. Ponce, J.; Hebert, M. Trinocular geometry revisited. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, ON, USA, 23–28 June 2014; pp. 17–24. [Google Scholar]
  10. Li, T.; Yu, Z.; Guan, B.; Han, J.; Lv, W.; Fraundorfer, F. Trifocal Tensor and Relative Pose Estimation With Known Vertical Direction. IEEE Robot. Autom. Lett. 2024, 10, 1305–1312. [Google Scholar] [CrossRef]
  11. Ortin, D.; Montiel, J.M.M. Indoor robot motion based on monocular images. Robotica 2001, 19, 331–342. [Google Scholar] [CrossRef]
  12. Chou, C.C.; Wang, C.-C. 2-point RANSAC for scene image matching under large viewpoint changes. In Proceedings of the 2015 IEEE International Conference on Robotics and Automation (ICRA), Seattle, WA, USA, 26–30 May 2015; pp. 3646–3651. [Google Scholar]
  13. Choi, S.; Kim, J.-H. Fast and reliable minimal relative pose estimation under planar motion. Image Vis. Comput. 2018, 69, 103–112. [Google Scholar] [CrossRef]
  14. Guan, B.-L.; Zhao, J.; Shang, Y.; Yu, Q.-F. Minimal solutions for relative pose estimation under planar motion constraints. Sci. Sin. Technol. 2024, 54, 2122–2130. [Google Scholar] [CrossRef]
  15. Hartley, R.; Zisserman, A. Multiple View Geometry in Computer Vision; Cambridge University Press: Cambridge, UK, 2003. [Google Scholar]
  16. Lu, L.; Tsui, H.T.; Hu, Z.Y. A novel method for camera planar motion detection and robust estimation of the 1D trifocal tensor. In Proceedings of the 15th International Conference on Pattern Recognition, ICPR-2000, Barcelona, Spain, 3–8 September 2000; IEEE: Piscataway, NJ, USA, 2000; Volume 3, pp. 807–810. [Google Scholar]
  17. Chen, J.; Jia, B.; Zhang, K. Trifocal tensor-based adaptive visual trajectory tracking control of mobile robots. IEEE Trans. Cybern. 2016, 47, 3784–3798. [Google Scholar] [CrossRef] [PubMed]
  18. Guan, B.; Vasseur, P.; Demonceaux, C. Trifocal tensor and relative pose estimation from 8 lines and known vertical direction. In Proceedings of the 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Kyoto, Japan, 23–27 October 2022; pp. 6001–6008. [Google Scholar]
  19. Julià, L.F.; Monasse, P. A critical review of the trifocal tensor estimation. In Proceedings of the Pacific-Rim Symposium on Image and Video Technology, Wuhan, China, 20–24 November 2017; pp. 337–349. [Google Scholar]
  20. Nistér, D. An efficient solution to the five-point relative pose problem. IEEE Trans. Pattern Anal. Mach. Intell. 2004, 26, 756–770. [Google Scholar] [CrossRef] [PubMed]
  21. Guan, B.; Zhao, J.; Barath, D.; Fraundorfer, F. Minimal solvers for relative pose estimation of multi-camera systems using affine correspondences. Int. J. Comput. Vis. 2023, 131, 324–345. [Google Scholar] [CrossRef]
  22. Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The KITTI dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef]
  23. Lowe, D.G. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
  24. Triggs, B.; McLauchlan, P.F.; Hartley, R.I.; Fitzgibbon, A.W. Bundle adjustment—A modern synthesis. In International Workshop on Vision Algorithms; Springer: Berlin, Germany, 1999; pp. 298–372. [Google Scholar]
  25. Scholkmann, F.; Boss, J.; Wolf, M. An efficient algorithm for automatic peak detection in noisy periodic and quasi-periodic signals. Algorithms 2012, 5, 588–603. [Google Scholar] [CrossRef]
Figure 1. Estimation of the three-view pose in planar motion.
Figure 1. Estimation of the three-view pose in planar motion.
Vision 09 00072 g001
Figure 2. Point correspondences among three views under planar motion.
Figure 2. Point correspondences among three views under planar motion.
Vision 09 00072 g002
Figure 3. The principle of triangulation.
Figure 3. The principle of triangulation.
Vision 09 00072 g003
Figure 4. Probability density function of relative pose estimation error: (a) Rotation error (b) Translation error.
Figure 4. Probability density function of relative pose estimation error: (a) Rotation error (b) Translation error.
Vision 09 00072 g004
Figure 5. Accuracy under image noise: (a) rotation error, (b) translation error. (unit: degree).
Figure 5. Accuracy under image noise: (a) rotation error, (b) translation error. (unit: degree).
Vision 09 00072 g005
Figure 6. Accuracy under angular noise: (a) rotation error with pitch angle noise, (b) translation error with pitch angle noise, (c) rotation error with roll angle noise, (d) translation error with roll angle noise. (unit: degree).
Figure 6. Accuracy under angular noise: (a) rotation error with pitch angle noise, (b) translation error with pitch angle noise, (c) rotation error with roll angle noise, (d) translation error with roll angle noise. (unit: degree).
Vision 09 00072 g006
Figure 7. Accuracy under vibration effects: (a) rotational error, (b) translational error. (unit: degree).
Figure 7. Accuracy under vibration effects: (a) rotational error, (b) translational error. (unit: degree).
Vision 09 00072 g007
Figure 8. Pose estimation errors under different SIFT feature extraction thresholds: (a) rotational error, (b) translational error. (unit: degree).
Figure 8. Pose estimation errors under different SIFT feature extraction thresholds: (a) rotational error, (b) translational error. (unit: degree).
Vision 09 00072 g008
Table 1. Comparison of computation time for different methods (unit: ms).
Table 1. Comparison of computation time for different methods (unit: ms).
MethodOur-3ptHartley-7ptChoi-2ptNister-5ptHartley-8pt
Time0.1420.8600.0970.4240.183
All time measurements are in milliseconds.
Table 2. Median rotational error of KITTI sequences (unit: degree).
Table 2. Median rotational error of KITTI sequences (unit: degree).
Seq.Our-3ptHartley-7ptChoi-2ptNister-5ptHartley-8pt
000.05090.11900.03800.08660.1591
010.05370.42690.04290.07550.4803
020.05400.13310.03830.08870.1640
030.05450.12300.04780.07410.1674
040.02850.14690.03300.07620.1736
050.03470.11140.03270.06760.1414
060.03300.11540.03510.06720.1540
070.03620.10750.03400.07340.1390
080.04130.12040.03330.07980.1515
090.04060.13540.03670.08070.1704
100.04550.12260.03490.07940.1498
All values represent rotational error in degrees. Bold values indicate the best performance for each sequence.
Table 3. Median translation error of KITTI sequences (unit: degree).
Table 3. Median translation error of KITTI sequences (unit: degree).
Seq.Our-3ptHartley-7ptChoi-2ptNister-5ptHartley-8pt
001.54631.29171.80021.21031.4457
012.13192.89762.15471.83972.3018
021.34961.28651.68831.10221.3273
031.68432.06051.57641.25321.7406
040.77741.05341.41401.02441.2120
051.10931.57291.14200.94841.2034
060.80230.85001.43360.81471.0001
071.32391.46301.85691.37841.6365
081.62261.70431.49971.37531.6101
091.06321.62471.19771.00431.3199
101.50881.61111.39931.13011.4817
All values represent translation error in degrees. Bold values indicate the best performance for each sequence.
Table 4. Comparative runtime analysis of different methods under RANSAC framework on KITTI sequences (unit: ms).
Table 4. Comparative runtime analysis of different methods under RANSAC framework on KITTI sequences (unit: ms).
MethodOur-3ptHartley-7ptChoi-2ptNister-5ptHartley-8pt
Time1.2620.61.662.706.99
All time measurements are in milliseconds.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Dai, Z.; Lv, W.; Liu, L. Three-View Relative Pose Estimation Under Planar Motion Constraints. Vision 2025, 9, 72. https://doi.org/10.3390/vision9030072

AMA Style

Dai Z, Lv W, Liu L. Three-View Relative Pose Estimation Under Planar Motion Constraints. Vision. 2025; 9(3):72. https://doi.org/10.3390/vision9030072

Chicago/Turabian Style

Dai, Ziqin, Weimin Lv, and Liang Liu. 2025. "Three-View Relative Pose Estimation Under Planar Motion Constraints" Vision 9, no. 3: 72. https://doi.org/10.3390/vision9030072

APA Style

Dai, Z., Lv, W., & Liu, L. (2025). Three-View Relative Pose Estimation Under Planar Motion Constraints. Vision, 9(3), 72. https://doi.org/10.3390/vision9030072

Article Metrics

Back to TopTop