1. Introduction
Precise camera pose estimation of camera setups is a challenge in computer vision, especially in scenarios where fixed cameras are positioned to avoid overlapping fields of view. While single-camera pose estimation has been widely explored and utilized in fields such as robotics and objects and people positioning [
1,
2,
3], traditional methodologies for multi-camera pose estimation predominantly rely on overlapping visual data to establish correspondences and compute relative poses [
4,
5,
6]. In scenarios where elements need to be positioned within the same reference system, it becomes essential to accurately determine and manage the six-degree-of-freedom (6DoF) pose of each camera. However, this reliance on overlapping views is not always feasible in real-world applications, where consistent camera overlap cannot be guaranteed due to environmental constraints or the need for wider spatial coverage [
7,
8,
9]. A representation of this problem is shown in
Figure 1. Being able to accurately position cameras without relying on overlapping fields of view would be highly beneficial in various scenarios, such as surveillance systems and indoor robotic navigation [
10,
11,
12,
13].
Several solutions have been proposed to address the challenges associated with fixed camera positioning. Feature-based approaches, which identify and rely on distinctive points or features in the scene, have demonstrated effectiveness in environments with rich visual details. Techniques such as structure from motion (SfM) [
14,
15] and simultaneous localization and mapping (SLAM) [
16,
17] have been foundational in this domain, leveraging feature detection and matching across overlapping images to estimate camera poses. These methods have driven significant advancements in computer vision by capitalizing on the geometric and photometric relationships inherent in overlapping visual data.
Despite their success, these methodologies face substantial limitations in scenarios where camera views do not overlap. In such cases, the absence of essential correspondences between views severely hampers pose estimation. To address this limitation, several methods designed for non-overlapping camera setups have been developed [
6,
18]. While these approaches represent progress toward solving the problem, they encounter significant challenges in achieving robust and reliable performance in non-overlapping scenarios, leaving room for further innovation and improvement in this area.
Marker-based approaches provide an alternative solution for pose estimation by relying on the strategic placement of predefined markers within the scene [
4,
5]. These methods have demonstrated significant robustness in controlled environments where markers remain consistently visible. However, their applicability is significantly reduced in non-overlapping camera setups, as the absence of shared markers across distinct viewpoints hinders effective pose estimation. Although advancements such as those proposed by Zhao et al. [
19] have explored this domain, the problem remains unresolved, particularly in addressing the challenges posed by complex and large-scale scenarios. This limitation underscores the inherent constraints of both feature-based and marker-based approaches, highlighting the urgent need for innovative solutions to tackle pose estimation in non-overlapping camera configurations effectively.
This paper introduces several contributions, starting with a novel approach to address these challenges by introducing a method capable of accurately positioning any fixed camera set, regardless of whether they have overlapping views, in large indoor scenarios, as represented in
Figure 2. Our method combines advanced optimization techniques with strategic marker placement facilitated by an auxiliary mobile camera to estimate the pose of any fixed camera accurately. To our knowledge, this is the first work to tackle the problem of fixed camera pose estimation without overlap in large and complex conditions.
Additionally, we introduce an algorithm to automatically detect the set of markers that remain static between recordings made with a mobile camera, facilitating the usage of our method under any circumstances. Our last contribution is the presentation of a novel set of datasets specifically designed to test the capability of our method to accurately position fixed cameras without overlap, arranged in various configurations and scenarios. Our experiments validate our method as the first to achieve state-of-the-art results in positioning any fixed camera arrangement with and without overlapping fields of view.
This work builds upon and significantly extends our previous methodology [
5] by addressing its key limitations. Our previous approach requires cameras to have an overlapping field of view, thus forcing a very large number of cameras to solve the problem. In contrast, the method presented in this paper does not require any camera overlap, thus allowing its use in realistic scenarios. It is a more challenging problem addressed through strategically placed reusable fiducial markers and an auxiliary mobile camera that captures marker observations. The main contributions of this paper are as follows: (i) a novel approach is devised to estimate the pose of sparse cameras (without overlapping field of view) by combining mobile cameras and fixed views; (ii) a method is created to automatically determine the markers that remain static between recordings made with the mobile camera; (iii) a meta-graph is introduced to fuse all visual information into a single optimization process to reduce the reprojection error and other structural restrictions; (iv) datasets are created to evaluate the proposal.
The rest of this work is organized as follows.
Section 2 describes the related works that frame the context of our research.
Section 3 describes our methodology in detail.
Section 4 presents experimental results validating the effectiveness of our approach.
Section 5 discusses our findings’ implications and outlines potential future research directions.
2. Related Works
Estimating the pose of a set of sparse cameras under the same coordinate system is a fundamental task in computer vision applications such as 3D reconstruction, robotics, and augmented reality. Prior methods often rely on overlapping fields of view between cameras to establish correspondences and compute relative poses [
4,
5,
6]. However, in real-world scenarios, cameras may be positioned with non-overlapping views, implying a cost reduction by minimizing the number of cameras required, presenting unique challenges for accurate pose estimation. Recent developments have introduced various methods for general indoor positioning [
20,
21,
22]. However, these techniques are not able to address the pose estimation of fixed sparse cameras, highlighting the importance of image-based solutions for tackling this problem.
In non-overlapping camera networks, the absence of shared visual information complicates finding correspondences between camera views, which is essential for conventional pose estimation techniques. Despite significant advancements, as highlighted in previous studies [
6,
23,
24], achieving reliable pose estimation in large-scale environments without overlapping views remains an unresolved challenge. Each of these methods has advanced the field, yet they continue to rely on environments that provide substantial visual texture, indicating that this remains a critical area for ongoing research, particularly in large scenarios where such details are scarce.
In scenarios requiring precise camera positioning over large distances, total stations and laser measurement tools provide an alternative approach, offering high accuracy but at increased costs and operational complexity. These instruments are particularly advantageous in scenarios where traditional methods face challenges in precision or scale. However, the exploration of camera positioning using these methodologies remains underdeveloped, especially for multi-camera pose estimation in sparse environments. While some studies have examined their application in camera pose estimation [
25,
26], further research is essential to fully integrate high-precision measurement tools into practical and scalable solutions for complex indoor environments.
Various methodologies, prominently feature-based and marker-based, have been developed to tackle camera pose estimation. Feature-based systems, effective in conditions with sufficient overlap, detect and match key points across images to compute relative camera poses. However, in non-overlapping camera networks, the absence of shared features poses challenges, as traditional feature-based approaches cannot establish the necessary correspondences for pose estimation.
Structure from Motion (SfM) is a computational technique initially detailed by Ullman [
27], which reconstructs 3D structures through the analysis of motion sequences captured from multiple viewpoints. Essential tools such as OpenDroneMap [
28], Pix4D [
29], and COLMAP [
30] facilitate this process by using feature-matching techniques across overlapping images to estimate camera poses and construct 3D models. A significant hurdle in SfM is the lack of common features between images, which is critical for ensuring accurate view alignment and thus affects the overall efficacy of the reconstruction process, making its application challenging in indoor environments or in scenarios without overlapping views between the cameras capturing these images.
While deep learning has broadened the scope of Structure from Motion (SfM), challenges continue in scenarios with non-overlapping views, despite the advances such as those by Wang et al. [
14] and Ren et al. [
15]. Gaussian splatting, developed by Kerbl et al. [
31], marks a significant evolution in traditional SfM methods by utilizing neural networks to model scenes as continuous volumetric functions. This technique significantly improves the handling of complex scenes and smoothens the reconstruction process. Tools from PolyCam [
32] support the rapid testing and deployment of Gaussian splatting. However, the efficacy of this technique is heavily reliant on the quality of the initial SfM reconstruction. Further expanding on this concept, a novel method by Cao et al. [
33] integrates object detection into Gaussian splatting, potentially offering solutions to the challenges discussed in this paper. However, like other existing methods, without overlapping views, its application remains limited.
Simultaneous localization and mapping (SLAM) is a principal method among feature-based techniques, enabling a camera to construct a map of an unknown environment while concurrently determining its location within that map. Existing methods, such as those proposed by Romero-Ramirez et al. [
16] and Campos et al. [
17], demonstrate significant effectiveness. However, despite their capabilities, SLAM methodologies encounter difficulties in non-overlapping camera networks, primarily due to the lack of shared visual features, which are essential for traditional SLAM operations.
For these reasons, new methodologies based on SLAM, such as those proposed by Dai et al. [
6] and Ataer-Cansizoglu et al. [
18], are being developed to tackle the challenges posed by non-overlapping environments. While these methods showcase innovative approaches to extend SLAM techniques beyond their traditional constraints, they inherently rely on environments rich in identifiable details. This reliance stems from their dependence on feature-based techniques, necessitating dense environmental data to function effectively.
Despite the effectiveness of feature-based systems, marker-based systems provide a superior solution for accurate pose estimation in controlled scenarios. Utilizing fiducial markers like ArUco [
34], these systems are robust and easy to detect, generally requiring markers to be visible in multiple camera views to establish correspondences. In non-overlapping scenarios, however, the effectiveness of marker-based systems is reduced, as the lack of shared marker observations prevents the direct computation of relative poses. To address this, methods such as the one proposed by Zhao et al. [
19] offer solutions using an additional camera and a chessboard pattern. However, these solutions are constrained to smaller scales and are far removed from real-world applications.
At a larger scale, MarkerMapper [
4] offers a notable solution designed to perform simultaneous localization and mapping using fiducial markers. It capitalizes on the detection of markers to build a map of the environment and estimate the camera’s pose within that map. While effective in environments where markers are visible across multiple views, its dependency on overlapping fields of view limits its utility in non-overlapping camera networks.
Lastly, the method proposed by Garcia et al. [
5] extends MarkerMapper specifically to navigate the complexities of large-scale scenarios with some overlapping views. Their strategy utilizes reusable markers alongside scene geometry constraints, significantly enhancing pose estimation accuracy in vast and intricate environments. The method is based on placing markers in the overlapping view of two fixed cameras to enable the computation of the inter-camera pose. However, it is not suited for scenarios where cameras operate without any overlapping fields, as the method relies on the presence of at least some shared markers for accurate positioning.
This paper introduces a novel approach for estimating the pose of sparse (i.e., non-overlapping) fixed camera networks. Our approach fundamentally differs from existing methodologies by utilizing a reusable set of markers, which are iteratively mapped using an auxiliary camera that records every placed marker. We enhance this method by integrating advanced optimization techniques and leveraging scene geometry, aiming to achieve precise pose estimation without relying on overlapping fields of view. To our knowledge, this innovative approach addresses a challenge that existing methods have not effectively resolved.
3. Proposed Method
This section presents the proposed methodology for estimating the poses of sparse fixed indoor cameras using fiducial markers. In our previous work, a solution is proposed [
5], assuming that some fixed cameras are close to each other, so they share part of their field of view. Thus, a large area can be covered by many fixed cameras, creating a path of shared fields of view. In that use case, markers placed on the ground are employed to obtain the pairwise relationship (pose) between the cameras and, thus, their global pose on a map. However, in many scenarios, this option is not possible, and we only have a small subset of sparse fixed cameras completely unconnected from each other.
Our solution to the sparse camera pose estimation problem relies on the idea that the pose of a camera with a global reference system can be easily obtained from an image showing one or more markers whose poses are known. The problem then becomes placing markers on a large area and accurately estimating their poses. This problem can be solved by the method proposed in MarkerMapper [
4], in which a set of markers are placed in the environment, and their poses are estimated from images, i.e., structure from motion using markers instead of key points. The problem with that approach is that it requires an extremely large number of unique markers, which is unfeasible in even relatively small scenarios.
Our approach, which is illustrated in
Figure 2, consists of placing a small set of markers on the ground and taking several images with a moving camera (i.e., a phone camera). The images, showing the markers from several viewpoints, are employed to obtain a reconstruction of the markers related to one of them; this is called a group. Then, a subset of the markers is moved while the rest are left in their positions, and the operation is repeated, creating another group. This second group is connected to the previous one by the fixed markers. The process is repeated until the whole area to map is covered, making sure we place markers under the fixed cameras we aim at locating. Thus, our set of images contains not only images from the moving camera but also from the fixed cameras.
We first estimate the poses of the markers and the images involved within a group using a local reference system for each group (e.g., centered at one of the markers). Then, since groups are connected, it is possible to find the transformation between any group and one of them, which acts as a global reference system. In doing so, we obtained the poses of fixed cameras with that global reference system.
The rest of this section explains the proposed method in detail. First,
Section 3.1 provides some mathematical definitions necessary for the rest of the paper. Second,
Section 3.2 explains how groups are analyzed, and
Section 3.3 presents how they are merged into a metagroup to solve the proposed problem. Finally,
Section 3.4 explains how to automatically determine the connection between groups.
3.1. Mathematical Definitions
Let us represent a three-dimensional point within a given reference system
a as
. To convert this point to a different reference system, labeled as
, rotation and translation are required. Let us denote
as the SE(3) homogeneous transformation matrix that transforms points from
a to
b as follows:
To ease the notation, we employ the operator (
) as follows:
It should also be noted that the transformation from system
c to system
b (
), followed by the transformation from
b to
c (
), can be combined through matrix multiplication into a single transformation
as shown below:
uses the pinhole camera model. Assuming that the camera parameters are known, a projection can be obtained as a function
: a point in three-dimensional space,
, is projected onto a camera’s pixel
according to the pinhole camera model. Given known camera parameters, this projection can be described by the function
:
where
denotes the intrinsic parameters of the camera, and
represents the camera pose at the time the image was captured, i.e., the transformation that relocates a point from any reference system to the camera’s system.
3.2. Group Pose Graphs
As previously indicated, we use a set of markers
that are first placed in the ground and then take images
of them. We will use
Figure 3 to guide the explanation. While
Figure 3a shows the group’s configuration,
Figure 3b shows the images obtained by the moving camera. Please note that some images
belong to fixed cameras, while others belong to moving cameras.
Let
be a set of groups. We define a group
where
denotes the set of poses of the markers
in the reference system of group
,
represents the set of image poses in the reference system of the group
, and
and represents the set of observations of markers in images. In other words, the tuple
indicates that the marker
is observed in the image
.
Our goal is to estimate the poses
and
with the group’s local reference system. To do so, we select any of the markers of the group and assume its center is the group reference system, i.e.,
To that end, we first create a group pose quiver (
Figure 3c), where nodes represent markers, and edges are the pair-wise pose relationship between markers obtained from its image observations. The quiver is then refined into a group pose graph that is ultimately optimized using sparse graph optimization.
A marker is a squared planar object whose four corners
can be expressed regarding the center of the marker as follows:
where
s is the length of the marker sides. Let us denote
as the observed positions of the corner
l of marker
in image
. Using the PnP solution [
35], we estimate the relative pose from the marker to the image:
Furthermore, if more than one marker is observed in image
, we obtain a pair-wise relationship between any two markers
and
observed. Thus, we shall denote
as the transform that moves from marker
to
given the observation from
.
A pose quiver, as shown in
Figure 3c, is constructed from all pair-wise combinations, with nodes symbolizing markers and edges depicting their pair-wise relationships. Among all potential edges connecting markers, the one exhibiting the minimal reprojection error is chosen to form a group pose graph (see
Figure 3d), where the edges
illustrate the relationships between markers.
To obtain an initial estimation of the marker poses before the optimization process, we select one of the markers
as the local reference system (i.e.,
) and apply Dijkstra’s algorithm to determine the best path to all other nodes. Then, an initial estimation
of the local marker poses
is obtained by following the paths as follows:
Similarly, we obtain an initial estimation
of the image poses
:
from any of the markers
observed by image
, i.e.,
.
Given the initial estimation of the values, they are refined by minimizing the reprojection error of the observed markers in the images using the sparse version of the Levenberg–Marquardt algorithm, which exploits the sparsity of the Jacobian matrix to efficiently handle large-scale optimization problems typical in graph-based formulations [
36,
37]:
where
3.3. Metagroup Pose Graph Optimization
The previous process is repeated independently for each one of the groups , obtaining their poses relative to a local reference. Our goal now is to obtain the poses of both cameras and markers of all groups in a common reference system, e.g., centered at marker .
As already explained, there is a connection between some groups
and
, i.e., there is a subset of markers,
between them that have not been moved. If
is known, it is possible to obtain the transform
that moves from one group to another. Then, it is true that
The subset can be manually annotated when collecting the images or automatically obtained, as explained later in
Section 3.4. In either case, it allows us to build the metagroup pose graph, where nodes are groups and edges their relationship
(see
Figure 4). As in the previous case, we employ Dijkstra’s algorithm to determine the best path from any marker to group
. Then, we shall denote
as the pose
transformed to the global reference system (i.e.,
). The same is applied to obtain the transformed image poses:
Now, the poses are referred to in the reference system of the first group,
. However, in practical scenarios, it is often preferable to express these poses relative to a CAD model or map of the building. To achieve this, control points can be utilized. A control point
represents the known position of a marker or a camera within the map, and
denotes the collection of these points. By applying the Horn [
38] algorithm with at least three control points, we determine the optimal rigid transformation
that aligns the image and marker poses with the map’s reference system:
and
Our final optimization function combines multiple objectives into a single global error:
The global error
represents the global reprojection error of the metagraph:
where
serves as a normalization factor such that
when each individual error equals one. This is computed as the inverse of the total number of projected points, i.e.,
The normalization factor allows us to combine the different error terms independently of the number of markers, images, or control points optimized.
The optional error term
ensures that cameras known to be on the same plane remain coplanar. This is particularly relevant in indoor settings, where cameras are often installed on the ceiling. We define
as the set of fixed cameras that share a single plane. In scenarios with multiple planes, such as buildings with several floors, each plane is associated with its own
group. The collection of all these groups is represented as
. The error term is then defined as follows:
Here,
denotes the optimal plane derived from the camera poses in the set
. Meanwhile,
represents the Euclidean distance from the translational component
of the camera pose
to the plane. The normalization factor
is adjusted so that
when all
distances are precisely 1 centimeter, thus equating one pixel of error in
to one centimeter in
.The normalization factor
can be defined as follows:
Similarly, the optional error term
ensures that markers known to lie on the same plane remain coplanar. This is commonly encountered in indoor settings where markers are positioned on the floor. The error term is computed as follows:
where
denotes the optimal plane fitted from the set
of coplanar markers, and
is the Euclidean distance from the translational component of the marker pose
to the plane. The normalization factor
ensures that
when each distance
measures exactly 1 centimeter. The normalization factor
is given by the following:
Finally, the optional term
refers to the use of control points, which forces cameras and markers to be at specified locations using the following term:
where
and
correspond to the control points of the markers and cameras, respectively. As noted, the errors are calculated by measuring the distances between the actual positions of cameras and markers and their respective ground-truth locations. The normalization factor
is set so that
, meaning all Euclidean distances are exactly 1 millimeter, demanding higher precision for these points compared to the previous cases. The corresponding normalization factor
is thus defined as follows:
3.4. Automatic Estimation of Metagroup Edges
Obtaining metagroup edges requires determining the set marker that has not moved between two groups. This can carried out manually by annotating them at the time of recording the images; however, it is a process subject to human error. We propose a method to automatically obtain these markers, knowing that a connection exists between the two groups. Our solution relies on the following idea. Starting with an empty set , we add the pair of markers whose relative position in both groups is most similar and compute the rigid transform that moves their corners from one group to another. Then, we keep adding markers to the group as long as the transform error is below a threshold. Let us formally describe it below.
Our algorithm starts by computing the relative pose between all marker pairs within the two groups:
As a starting point for
, we use the pair of markers
with the most similar relative position in the two groups, i.e.,
The corners of the markers expressed in the reference system of the groups, i.e.,
are used to obtain transform between the groups,
, using the Horn transform. Then, we can calculate the error of the transform by analyzing how far these corners are when we move them from one group to another:
The more fixed markers found between the groups in , the more accurate the transform will be, so we want to add all the common fixed markers. We proceed iteratively by adding the next unselected marker that produces the smallest increment in the error , and we stop when the error added by a particular marker is above a given threshold .
The above method has a problem, though: the pair initially selected by Equation (
33) may not be correct. It may occur that some of the moved markers were placed in positions that were very similar to each other. If that happens, they could be selected in Equation (
33), and the process would end with a few markers selected in
. To overcome that problem, we repeat the process several times by selecting the starting set
as not only the best pair of markers but also the second and the third, etc. Thus, multiple
sets are obtained. Amongst them, we select the one with more markers, and in the case of a tie, we choose the one with the lowest error (Equation (
35)).
4. Experiments
This section details the experiments conducted to validate our approach. To the best of our knowledge, there are no specific public datasets available to test our method, necessitating the creation of our own. Although our method can be applied to the dataset of our previous work [
5], where cameras have overlapping fields of view, these datasets do not contain sparse cameras. Thus, it is a more straightforward problem. The reverse is impossible, i.e., the previously proposed method cannot solve the problem we tackle in this paper. Consequently, we recorded three datasets in our building with different levels of complexity. We employed fixed cameras located on the ceiling with no overlapping in their field of view and a moving camera (a phone camera). The datasets present no loop closure. The ground-truth positions of the fixed cameras were manually verified to establish a baseline for accuracy. The control points were obtained using the available map of the building. Please note that it is only possible to establish control points in very salient regions of the environment, such as a door or the intersection between two walls. Lastly, the number of fixed cameras in each dataset was selected to maximize camera coverage while ensuring no overlap in their field of view. This setup was chosen to have as many error measures as possible for evaluation purposes. However, our method could have been applied with only two cameras per environment, obtaining similar results.
To achieve good precision in pose estimation using fiducial markers, it is crucial that the markers cover a significant area in the images and are distributed throughout. Adequate coverage and sufficient angle between the camera plane and the normal to the marker plane ensure accurate pose estimation.
The experiments were conducted on a system equipped with an Intel(R) Xeon(R) Silver 4316 CPU @ 2.30 GHz, running Ubuntu 20.04.6 LTS. The average processing time per dataset was approximately 3 hours and 32 minutes with a non-optimized implementation. Please note that due to the usage of sparse graph optimization techniques, we obtain a solution that, despite a high number of variables, tends to be efficient.
This section is divided into five subsections. In
Section 4.1, we show our first dataset in which we reconstruct a single corridor.
Section 4.2 presents our second dataset, employed to reconstruct a complete building floor of our laboratory, comprised of four corridors and a large hall.
Section 4.3 presents our most complex dataset, an entire building comprised of two floors
Section 4.3 interconnected by stairs. Then,
Section 4.4 evaluates the method proposed to find static markers between groups, i.e., the method proposed in
Section 3.4.
Section 4.5 performs a comparative analysis of the proposed method with prior works. Furthermore,
Section 4.6 discusses potential limitations and the adaptability of our system to various environmental conditions.
4.1. Corridor Dataset
In this experiment, our methodology was evaluated within a real
corridor (refer to
Figure 5a). The setup included six fixed cameras positioned linearly along the ceiling without overlapping views. Each camera boasted a resolution of
pix and a focal length of approximately 2201 pix. The experiment required two groups to reconstruct the corridor with the aid of a moving camera and 50 ArUco [
34] markers. The moving camera operated at a resolution of
pix and a focal length of approximately 1339 pix. This setup was chosen to evaluate the accuracy of our method in preserving spatial continuity across cameras that do not share fields of view. Ground truth data were generated through manual annotation. Upon completion of each dataset, our method is not run in real-time but is instead initiated after the dataset has been fully recorded. An overview summarizing the datasets used in this work, including the number of markers, fixed cameras, frames, and groups required for reconstruction, can be found in
Table 1.
We created two different datasets on the same scenario, labeled A and B, by varying the orientations of the fixed cameras (see
Figure 5a). As for dataset A, the cameras were positioned to face directly downward, whereas in dataset B, they were slightly tilted to provide a broader view of the corridor. To assess the robustness of our approach, we performed ablation studies to evaluate how different optimization terms impact reconstruction quality. These studies involved optimizing the
term alone, as well as testing all possible combinations of error terms. The results are presented in
Table 2.
As observed, the exclusive use of the reprojection error obtains the worst results. Adding marker coplanarity constraints seems to be the most effective one, probably because it helps to mitigate the well-known doming effect [
39] that occurs in this type of problem. In any case, combining all errors allows a substantial reduction in the errors in both datasets. In this particular case, dataset B seems to obtain better results.
Although the combinations of and resulted in the lowest errors in these datasets, such outcomes are not consistent across different scenarios. Generally, integrating all optimization terms is the most effective strategy, as shown in the next experiments. In practical applications where direct validation of results is not possible, it is recommended to employ all available optimization parameters to ensure the most robust and reliable performance.
4.2. Complete Floor Dataset
This experiment deals with a more complex environment consisting of an entire floor within a building. The floor layout includes four interconnected corridors and a large central hall, a nexus for all corridors (see
Figure 5b). This setup incorporates 22 fixed cameras, all strategically positioned along the ceiling, with no overlap between their fields of view. We employed 50 ArUco markers [
34] to cover the area, obtaining 10 different groups using our moving camera. An overview summarizing the datasets used in this work, including the complete floor configuration, can be found in
Table 1.
As in the previous case, two variations of the datasets were generated, labeled A and B. In dataset A, the fixed cameras were oriented to face directly downward, while in dataset B, they were angled slightly to provide a different view of the floor layout. Similar to the work carried out on the previous dataset, we conducted ablation studies on these datasets to assess how different combinations of optimization terms affect the reconstruction accuracy. These results are summarized in
Table 3.
As evidenced by the results, just as in previous datasets, relying solely on the reprojection error term () leads to suboptimal outcomes compared to our method, including additional optimization terms. Applying all optimization terms achieves the most robust and consistent results, yielding superior outcomes in dataset B. This enhancement underscores the importance of integrating multiple constraints to effectively address the complexities of pose estimation in expansive environments. Finally, we achieved an error of approximately 27 cm and 24 cm for datasets A and B of the complete floor, respectively.
4.3. Entire Building Dataset
This experiment presents our last dataset, which extends our methodology to an entire building comprising two floors connected by a stairwell section (see
Figure 6). This setup allowed us to evaluate the robustness of our camera positioning approach over large vertical distances and across multiple interconnected levels. By testing these configurations, we aimed to assess the method’s accuracy in maintaining spatial continuity and precise camera alignment over extended and complex structures. The fixed and moving camera models were the same as in the previous experiments.
The experiment involved 42 fixed cameras distributed uniformly across the floors, ensuring no overlap in their fields of view. In addition, each floor and stair section included pathways recorded using a moving camera and 50 ArUco markers, obtaining 23 different groups. We performed ablation studies on the dataset to examine the effects of different optimization term combinations on the reconstruction quality. An overview summarizing the datasets used in this work, including the entire building configuration, can be found in
Table 1.
The results of these evaluations, as detailed in
Table 4, demonstrate the significant role of control points in enhancing pose estimation across the entire building, achieving an optimal outcome with an approximate error of 42 cm. The inclusion of control points proved especially critical in this large-scale dataset, contributing markedly to the accuracy of the results due to the high-dimensional nature of the environment.
4.4. Metagroup Edges Automatic Estimation
This experiment evaluates our algorithm’s capability to automatically find the markers that have not been moved between groups. The accuracy of this marker labeling is crucial, as it directly impacts the quality and reliability of the linkages within the reconstructed environment.
We tested each generated dataset against manually annotated ground-truth data, which provided the precise labeling of markers for evaluation purposes.
Table 5 presents the results, detailing the output rates of true positives (
), false positives (
), and false negatives (
) for each dataset. These metrics offer a comprehensive assessment of the algorithm’s performance across various configurations.
The results of these evaluations are detailed in
Table 5. The high true positive rates (
) across all datasets, reaching 100% in some cases, indicate near-perfect accuracy in identifying markers that have remained stationary between groups. The absence of false positives (
) confirms that no markers were incorrectly classified as common, ensuring that all detected common markers were indeed unchanged. This level of precision highlights our algorithm’s effectiveness in consistently identifying valid sets of common markers, which are crucial for accurate environment reconstruction, especially in large and complex settings, like entire buildings.
4.5. Comparison with Other Works
In this section, we assess how our proposed method stands in comparison to established techniques within the field, including the approaches by García et al. [
5] and the MarkerMapper system [
4], which are both adept at handling camera field overlaps. Additionally, we extend our comparison to state-of-the-art structures from motion (SFM) implementations such as OpenDroneMap [
28], Pix4D [
29], and COLMAP [
30], as well as the Gaussian splatting approach PolyCam [
32]. Despite numerous attempts, these methods failed to produce complete reconstructions due to insufficient camera overlaps and a lack of distinctive key points for reliable matching, except for the MarkerMapper [
4] and Garcia et al. [
5] methods, which successfully operated within some datasets. The datasets used for this evaluation, collectively referred to as dense, were introduced by García et al. [
5] and tested the capacity to manage overlapping views.
These datasets include artificial and real scenarios, representing a single corridor, an entire floor, and multiple floors. In these scenarios, fixed cameras are positioned on the ceilings, while Aruco [
34] fiducial markers are placed on the floor. Similar to our work, these datasets demonstrate various camera arrangements as different configurations.
On the other hand, the datasets introduced in our study, labeled as sparse, are specifically designed to assess the efficacy of methods in positioning fixed cameras with no overlap. However, there are no existing methods with publicly available code that can be evaluated using our proposed datasets. Consequently, we cannot compare our results on these datasets with any other existing method.
Table 6 presents the results of our comparative analysis. The symbol ‘×’ indicates that a method was inapplicable, while the symbol ‘−’ denotes that specific error data were unavailable due to a lack of ground truth.
The results in
Table 6 show that our method achieves a level of precision comparable to that of the approach proposed by García et al. [
5], which itself showed improvements over the results obtained by MarkerMapper [
4]. Notably, our method achieves significantly better outcomes in the more complex datasets with which comparisons are possible, demonstrating a reduction in error for the dense complete floor datasets A and B from previously observed 14.82 and 15.72 cm to 5.22 and 4.88 cm, respectively.
Our method achieves consistent results for datasets featuring cameras with no overlapping views. However, the error margins increase as the dataset’s complexity increases, similar to what is observed in datasets with overlapping cameras. This consistency across different datasets confirms the effectiveness of our approach under varied conditions.
4.6. Discussion on Limitations and Potential Improvements
In this section, we address the potential limitations of our proposed method and suggest areas for enhancement, particularly with involuntary movements of markers, low-light conditions, outdoor applications, and generalization across various indoor settings.
Our approach assumes that fiducial markers remain stationary throughout the capture process and are removed afterward. Markers do not need to remain in the environment once the calibration is conducted. In environments where there may be slight movements of markers while recording, the accuracy of the system depends on the redundancy of markers and a robust optimization algorithm. However, significant movement would require recalibration or algorithmic adjustments to ensure continued accuracy. Future enhancements could incorporate real-time tracking and adjustment capabilities to better manage such dynamics.
The design of our system is agnostic to the type of marker and the detection method employed. While traditional fiducial markers, such as Aruco [
34], demonstrate high robustness in most conditions, they may struggle in extremely poor lighting. In such cases, considering alternative technologies like DeepAruco++ [
40] could be beneficial to ensure high accuracy and robustness.
While our primary investigations have focused on indoor environments, extending our system to outdoor scenarios could prove beneficial. Outdoor conditions introduce complexities such as changing weather and variable lighting, which could be mitigated by integrating our system with technologies such as satellite imagery or GPS data. A hybrid approach, which combines fiducial markers with natural feature tracking, could provide more reliable solutions for outdoor navigation and mapping, similar to the concept explored in UcoSLAM [
41].
Finally, our method has demonstrated reliable performance across diverse indoor settings in our datasets. However, the efficacy of our system is contingent upon the presence of the markers within the camera images, regardless of variations in room geometry or furniture arrangement. As a consequence, we believe our method can adapt to any indoor layout as long as it is possible to place markers and take pictures of them.
5. Conclusions
This work introduced a novel approach for indoor camera positioning using fiducial markers, uniquely designed to accommodate both overlapping and non-overlapping camera setups in extensive environments. To our knowledge, this is the first method to accurately position any set of fixed cameras in large scenarios, regardless of their overlap status. Our technique employs a mobile camera along with a set of fiducial markers that are strategically placed and moved iteratively in groups. This system facilitates the precise positioning of fixed cameras without overlap. Additionally, we developed an algorithm that automatically identifies markers acting as connectors between these groups, significantly simplifying the camera positioning process.
Experimental validations across multiple complex scenarios confirm the feasibility and robustness of our system, demonstrating consistent accuracy even in environments without visual overlap while achieving state-of-the-art results in environments with camera overlap. This establishes our method as being capable of incorporating the capabilities of previous methods and additionally handling non-overlapping situations effectively. We believe that the proposed method could significantly enhance automated surveillance systems and improve augmented reality applications within complex indoor environments. However, our system still returns poorer solutions depending on the scale of the real world; therefore, efforts must be made to improve our method.
Future efforts will focus on refining the automation process for marker detection and enhancing the system’s adaptability to dynamically changing environments. We plan to integrate machine learning algorithms to optimize marker placement and improve camera network configurations. Extending this research to outdoor environments could significantly broaden the applicability of our method, potentially impacting urban planning and automated vehicle navigation. Additionally, we aim to explore the feasibility of incorporating total station technology as a complementary or alternative method for camera positioning. Addressing challenges such as adapting the system to varying outdoor lighting and weather conditions will be crucial.