2.1. Configuration and Construction of the VMOS
The configuration of the proposed galvanometer–camera combined VMOS is shown as
Figure 1a. The VMOS consisted of a galvanometer scanner, a camera with an appropriate lens, and a set of control units. The galvanometer scanner was fixed in front of the camera, and the control unit was used to control the camera and the galvanometer scanner simultaneously to take pictures when the galvanometer deflected to a specified position. The lights of the scene were deflected twice by the two mirrors in the galvanometer scanner and then captured by the camera sensor through the lens. By changing the turning angles of the two mirrors, the camera boresight and FOV could be adjusted, as shown in
Figure 1b.
According to the principle of mirror transformation, changing the camera’s field of view through the mirror deflections is equivalent to changing the camera’s pose (including the position and direction), as shown in
Figure 2.
In
Figure 2,
is the camera coordinate system, which represents the pose of the real camera. The rotation angles of
Mirror-1 and
Mirror-2 are denoted as
and
, respectively, which are uniquely determined by a pair of control values
. Suppose
Mirror-1 and
Mirror-2 are at the initial turning angles, then
is the virtual camera position, which is specularly transformed from
with
Mirror-1, and
is the virtual camera position, which is specularly transformed from
with
Mirror-2 in the initial status. The boresight of the real camera is identically transformed and marked as the blue dotted lines. When the turning angles
and
are changed to an arbitrary status, the corresponding virtual camera positions induced by the two mirror transformations are denoted as
and
, respectively, and the virtual camera boresight in this status is marked as the red dotted lines.
To sum up, the virtual camera pose was related to the deflection angles and , the distance between the rotation axes of the two mirrors, and the relative installation pose between the real camera and the galvanometer scanner. However, it is not trivial to directly calculate the pose matrices of the virtual cameras in practice for the following reasons: (1) The turning angles and are determined by a pair of control parameters and , respectively. The non-linear mapping between and the control parameter needs to be carefully calibrated, and the calibration errors of may reduce the accuracy of the calculated virtual camera poses. (2) The distance between the rotation axes of two mirrors is determined by the manufacturing process of the galvanometer scanner and is difficult to accurately measure in practice. (3) The relative installation pose between the camera and the galvanometer scanner is hard to know.
Instead of trying to calculate the virtual camera poses through specular reflection transformation, we enabled the galvanometer–camera to work as a virtual multi-ocular system, which needed to know neither the nonlinear relation , the rotation axes distance of the two mirrors, nor the installation pose of the camera. This scheme took advantage of the high repeatability of the galvanometer scanner. Specifically, the high repeatability of the scanner meant that whenever a specific control parameter was transmitted to the scanner, the corresponding deflection angles and almost remained unchanged every time, and hence, the imaging area of the system was all the same. In other words, given control parameter , the pose of the virtual camera was definitely determined. Therefore, we sampled the 2D control parameter domain in advance and endeavored to calibrate the corresponding virtual poses that corresponded to the sampled parameters. A one-to-one mapping from the sampled control parameters to the corresponding virtual camera was established. All the virtual cameras constituted the virtual multi-ocular system.
In order to perform the camera imaging within the deflection range of the galvanometer scanner, the camera and the galvanometer scanner should be properly configured to guarantee that the view pyramid of any virtual camera resulting from the deflection of
Mirror -1 should intersect with
Mirror-2, as shown in
Figure 3.
More specifically, the parameters of the galvanometer–camera combination should meet the following condition:
where
is the turning angle of
Mirror-1,
is the FOV angle of the camera,
W is the width of
Mirror-2,
is the optical center point of camera,
is the optical center point of virtual camera formed by
Mirror-1,
is the center point of
Mirror-1, and
is the center point of
Mirror-2.
To guarantee that each virtual camera in the VMOS shared common FOVs with some of the others, the sampling numbers of control parameter
should satisfy
where
and
are the least sampling numbers of the control parameters
and
, respectively;
and
are the maximum turning angles of
Mirror-1 and
Mirror-2, respectively; and
and
are the camera FOV angle in the horizontal and vertical directions, respectively. Having determined
and
, the 2D control parameter domain is evenly sampled. Then we have a number of
virtual cameras corresponding to the sampled control parameters
. The virtual cameras are denoted as
.
The above control parameter sampling rule can ensure that the adjacent virtual cameras share common FOVs. Most viewable regions of the VMOS largely have fourfold overlap, as shown in
Figure 4. The bigger the sampling numbers
and
are, the more folds the viewing regions overlap, and the more constraints can be supplied for 3D reconstruction.
2.2. Calibration Method of the VMOS
According to
Section 2.1, the VMOS was composed of a number of
virtual cameras corresponding to the sampled control parameters
. Since all the virtual cameras were induced from the same real camera, the intrinsic parameters, including the pinhole imaging matrix and the distortion parameters, were the same for each virtual camera, while the poses of all the virtual cameras
needed to be calculated.
Due to the large FOV of the VMOS, the calibration was difficult to realize by a calibration target at once. We proposed a global optimization method for zonal calibration, combining Zhang’s camera calibration method [
28], the PnP (perspective-n-point) method [
29], and the bundle adjustment (BA) method [
30]. The main steps are summarized in
Figure 5.
To realize the calibration method, we built a planar calibration target, which was evenly distributed with coded points. The identifications of each coded points in the images could be easily recognized by decoding. Denote the calibration target coordinate system as C-CS, and the coordinates of the coded points in C-CS are denoted as , where the superscript represents the identification of a coded point. The specific steps of proposed calibration method are as follows:
For image collection and calibration data preparation, put the calibration target at position in the working volume of the VMOS. Capture images of the target in position with the virtual camera . Then extract the image coordinates of the coded points in image . The 3D coordinates under a global coordinate system (G-CS) of the coded points on the calibration target in position are measured utilizing a photogrammetric device.
For the calibration of the camera intrinsic parameters, among images
with different index s and fixed index
, match the 3D points
with the image points
. Take the matched pairs
into Zhang’s monocular camera calibration process [
28,
31] for calibrating the intrinsic matrix
in the pinhole camera model as shown in Equation (3) and the distortion parameters
expressed in Equation (4).
where
is the homogeneous coordinates of the spatial point,
is the ideal homogeneous pixel coordinates of the corresponding point,
is the pose parameters of the camera, and
is the depth coefficient.
where
and
are the observed pixel coordinates with distortion corresponding to the ideal coordinates
and
, respectively;
is the distance between the pixel point
and the principle point of the pixel plane;
,
, and
are the radial distortion parameters; and
and
are the tangential distortion parameters.
For the calibration of the virtual camera poses, to calculate the
sth virtual camera pose, gather the coded points
in the images
as a group
with the same index
and different index
. Match the image points
in each
with
according to index
and
. Utilizing the matched pairs
in the specific group
, the pose of the virtual camera
, i.e., the transformation matrix
from
G-CS to the virtual camera coordinate system
-CS, is calculated through the PnP method [
29].
For global optimization, to improve the calibration accuracy, the BA method [
32] is applied to optimize the intrinsic parameters and all the virtual camera poses. In consideration of the lens distortion, we add radial distortion and tangential distortion to the BA model. The objective function of the nonlinear optimization is
where
is the reprojection pixel coordinates of spatial point
in virtual camera
calculated through Equations (3) and (4).
Figure 6 shows the schematic diagram of the entire calibration process. Finally, the intrinsic matrix
, the distortion parameters
, and the extrinsic matrices
of the virtual cameras were determined.
2.3. The 3D Reconstruction Method with the VMOS
Having completed the VMOS calibration, the intrinsic matrix
, the distortion parameters
, and the extrinsic pose matrices
of all the virtual cameras
were obtained. The control parameter sampling rule described in
Section 2.1 guarantees that the scene in the working volume of the VMOS can be observed by largely four or more virtual cameras. According to the triangulation method, the region observed by multiple virtual cameras can be 3D reconstructed, shown as
Figure 7.
In
Figure 7, the image point
corresponding to the spatial point
M can be expressed as
where
is the
extrinsic pose matrix of
;
is the homogeneous coordinates of
M in the world coordinate system;
is the undistorted pixel coordinates of
M in pixel coordinate system of virtual camera
, which can be calculated from the observed image coordinates with Equation (4); and
is the depth coefficient of point
M in the coordinate system of virtual camera
. By eliminating
, Equation (6) can be reorganized as
where
represents the
ith row of matrix
.
According to Equation (7), one camera can provide a 2 × 4 coefficient matrix. When there are
cameras having observed the target point, an overdetermined linear system shown in Equation (8) can be obtained:
Perform singular value decomposition (SVD) [
30] on the coefficient matrix. The 3D coordinates
can be obtained from the singular vector corresponding to the minimum singular value.
2.4. Pose Estimation Method Using the VMOS
Object pose estimation is one of the most common applications of machine vision. The PnP algorithm is the most common means of monocular pose estimation [
33,
34]. Through the 2D image coordinates observed by one camera and the corresponding known 3D coordinates of the target, the transformation between the object coordinate system and the camera coordinate system can be calculated. Then the six degree of freedom (DOF) pose parameters of the object with respect to the camera coordinate system can be obtained.
However, due to the limited field of view of each virtual camera, it may be impossible to obtain enough points for the PnP calculation in a single perspective. In addition, the pose calculated by PnP from a single view is in the current virtual camera coordinate system, which needs to be converted to the VMOS coordinate system using the pose parameters of each virtual camera, which is cumbersome and inconvenient. Fortunately, the proposed VMOS could observe the same object point by different virtual cameras, which had potential to provide more constraints for determining the object pose compared with ordinary cameras. Taking advantage of the large FOV of the VMOS, we proposed a global pose estimation algorithm to directly obtain the object pose in the VMOS coordinate system by utilizing the images from multiple virtual cameras.
In our pose estimation scheme with the VMOS, not all the virtual cameras but only those having observed the feature points for the pose estimation participated in the calculation. As shown in
Figure 8, suppose the calibrated virtual camera
observes a point
in the object coordinate system (
O-CS) concerning the pose estimation, and the corresponding undistorted pixel coordinates
are obtained. Then,
can be transformed to the normalization plane in
according to (9).
where
is the coordinates of point
, which is on the normalization plane in
corresponding to
.
The correspondence between spatial point
and the line
was established. Utilizing the extrinsic matrix
, the line
was transformed from
to the VMOS coordinate system (i.e.,
), as shown in Equations (10) and (11).
where
is the normalized orientation vector of line
in
G-CS,
is a passing point of line
in
G-CS, and
is the normalized orientation vector of
in
,
Given
N pairs of 3D point–line correspondence which formed by virtual cameras
observing points
on an object, the pose estimation can be modeled as the non-perspective PnP (NPnP) [
35] problem depicted as
where
is the parameter of line
and
is the transformation matrix from
O-CS to
G-CS.We utilized the procrustean solution provided in [
35] to estimate the transformation matrix
in Equation (13). After obtaining the result, a BA optimization was performed to improve the accuracy. Taking
as the initial value, we minimized the reprojection errors expressed in Equation (14) to finally obtain the object pose parameters.
where
is the reprojection pixel coordinates of the spatial point
in virtual camera
and can be expressed in Equation (15).
The Gauss–Newton iteration method was used to minimize the objective function in Equation (14).