Sparse Indoor Camera Positioning with Fiducial Markers

García-Ruiz, Pablo; Romero-Ramirez, Francisco J.; Muñoz-Salinas, Rafael; Marín-Jiménez, Manuel J.; Medina-Carnicer, Rafael

doi:10.3390/app15041855

Open AccessArticle

Sparse Indoor Camera Positioning with Fiducial Markers

by

Pablo García-Ruiz

^1,*

,

Francisco J. Romero-Ramirez

²

,

Rafael Muñoz-Salinas

^1,3,*

,

Manuel J. Marín-Jiménez

^1,3

and

Rafael Medina-Carnicer

^1,3

¹

Departamento de Informática y Análisis Numérico, Edificio Einstein, Campus de Rabanales, Universidad de Córdoba, 14071 Córdoba, Spain

²

Departamento de Teoría de la Señal y Comunicaciones y Sistemas Telemáticos y Computación, Campus de Fuenlabrada, Universidad Rey Juan Carlos, 28942 Fuenlabrada, Spain

³

Instituto Maimónides de Investigación en Biomedicina (IMIBIC), Avenida Menéndez Pidal s/n, 14004 Córdoba, Spain

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2025, 15(4), 1855; https://doi.org/10.3390/app15041855

Submission received: 21 January 2025 / Revised: 7 February 2025 / Accepted: 10 February 2025 / Published: 11 February 2025

(This article belongs to the Special Issue Technical Advances in 3D Reconstruction)

Download

Browse Figures

Versions Notes

Abstract

:

Accurately estimating the pose of large arrays of fixed indoor cameras presents a significant challenge in computer vision, especially since traditional methods predominantly rely on overlapping camera views. Existing approaches for positioning non-overlapping cameras are scarce and generally limited to simplistic scenarios dependent on specific environmental features, thereby leaving a significant gap in applications for large and complex settings. To bridge this gap, this paper introduces a novel methodology that effectively positions cameras with and without overlapping views in complex indoor scenarios. This approach leverages a subset of fiducial markers printed on regular paper, strategically placed and relocated across the environment and recorded by an additional mobile camera to progressively establish connections among all fixed cameras without necessitating overlapping views. Our method employs a comprehensive optimization process that minimizes the reprojection errors of observed markers while applying physical constraints such as camera and marker coplanarity and the use of a set of control points. To validate our approach, we have developed novel datasets specifically designed to assess the performance of our system in positioning cameras without overlapping fields of view. Demonstrating superior performance over existing techniques, our methodology establishes a new state-of-the-art for positioning cameras with and without overlapping views. This system not only expands the applicability of camera pose estimation technologies but also provides a practical solution for indoor settings without the need for overlapping views, supported by accessible resources, including code, datasets, and a tutorial to facilitate its deployment and adaptation.

Keywords:

indoor camera positioning; no overlap; camera pose estimation; fiducial marker

1. Introduction

Precise camera pose estimation of camera setups is a challenge in computer vision, especially in scenarios where fixed cameras are positioned to avoid overlapping fields of view. While single-camera pose estimation has been widely explored and utilized in fields such as robotics and objects and people positioning [1,2,3], traditional methodologies for multi-camera pose estimation predominantly rely on overlapping visual data to establish correspondences and compute relative poses [4,5,6]. In scenarios where elements need to be positioned within the same reference system, it becomes essential to accurately determine and manage the six-degree-of-freedom (6DoF) pose of each camera. However, this reliance on overlapping views is not always feasible in real-world applications, where consistent camera overlap cannot be guaranteed due to environmental constraints or the need for wider spatial coverage [7,8,9]. A representation of this problem is shown in Figure 1. Being able to accurately position cameras without relying on overlapping fields of view would be highly beneficial in various scenarios, such as surveillance systems and indoor robotic navigation [10,11,12,13].

Several solutions have been proposed to address the challenges associated with fixed camera positioning. Feature-based approaches, which identify and rely on distinctive points or features in the scene, have demonstrated effectiveness in environments with rich visual details. Techniques such as structure from motion (SfM) [14,15] and simultaneous localization and mapping (SLAM) [16,17] have been foundational in this domain, leveraging feature detection and matching across overlapping images to estimate camera poses. These methods have driven significant advancements in computer vision by capitalizing on the geometric and photometric relationships inherent in overlapping visual data.

Despite their success, these methodologies face substantial limitations in scenarios where camera views do not overlap. In such cases, the absence of essential correspondences between views severely hampers pose estimation. To address this limitation, several methods designed for non-overlapping camera setups have been developed [6,18]. While these approaches represent progress toward solving the problem, they encounter significant challenges in achieving robust and reliable performance in non-overlapping scenarios, leaving room for further innovation and improvement in this area.

Marker-based approaches provide an alternative solution for pose estimation by relying on the strategic placement of predefined markers within the scene [4,5]. These methods have demonstrated significant robustness in controlled environments where markers remain consistently visible. However, their applicability is significantly reduced in non-overlapping camera setups, as the absence of shared markers across distinct viewpoints hinders effective pose estimation. Although advancements such as those proposed by Zhao et al. [19] have explored this domain, the problem remains unresolved, particularly in addressing the challenges posed by complex and large-scale scenarios. This limitation underscores the inherent constraints of both feature-based and marker-based approaches, highlighting the urgent need for innovative solutions to tackle pose estimation in non-overlapping camera configurations effectively.

This paper introduces several contributions, starting with a novel approach to address these challenges by introducing a method capable of accurately positioning any fixed camera set, regardless of whether they have overlapping views, in large indoor scenarios, as represented in Figure 2. Our method combines advanced optimization techniques with strategic marker placement facilitated by an auxiliary mobile camera to estimate the pose of any fixed camera accurately. To our knowledge, this is the first work to tackle the problem of fixed camera pose estimation without overlap in large and complex conditions.

Additionally, we introduce an algorithm to automatically detect the set of markers that remain static between recordings made with a mobile camera, facilitating the usage of our method under any circumstances. Our last contribution is the presentation of a novel set of datasets specifically designed to test the capability of our method to accurately position fixed cameras without overlap, arranged in various configurations and scenarios. Our experiments validate our method as the first to achieve state-of-the-art results in positioning any fixed camera arrangement with and without overlapping fields of view.

This work builds upon and significantly extends our previous methodology [5] by addressing its key limitations. Our previous approach requires cameras to have an overlapping field of view, thus forcing a very large number of cameras to solve the problem. In contrast, the method presented in this paper does not require any camera overlap, thus allowing its use in realistic scenarios. It is a more challenging problem addressed through strategically placed reusable fiducial markers and an auxiliary mobile camera that captures marker observations. The main contributions of this paper are as follows: (i) a novel approach is devised to estimate the pose of sparse cameras (without overlapping field of view) by combining mobile cameras and fixed views; (ii) a method is created to automatically determine the markers that remain static between recordings made with the mobile camera; (iii) a meta-graph is introduced to fuse all visual information into a single optimization process to reduce the reprojection error and other structural restrictions; (iv) datasets are created to evaluate the proposal.

The rest of this work is organized as follows. Section 2 describes the related works that frame the context of our research. Section 3 describes our methodology in detail. Section 4 presents experimental results validating the effectiveness of our approach. Section 5 discusses our findings’ implications and outlines potential future research directions.

2. Related Works

Estimating the pose of a set of sparse cameras under the same coordinate system is a fundamental task in computer vision applications such as 3D reconstruction, robotics, and augmented reality. Prior methods often rely on overlapping fields of view between cameras to establish correspondences and compute relative poses [4,5,6]. However, in real-world scenarios, cameras may be positioned with non-overlapping views, implying a cost reduction by minimizing the number of cameras required, presenting unique challenges for accurate pose estimation. Recent developments have introduced various methods for general indoor positioning [20,21,22]. However, these techniques are not able to address the pose estimation of fixed sparse cameras, highlighting the importance of image-based solutions for tackling this problem.

In non-overlapping camera networks, the absence of shared visual information complicates finding correspondences between camera views, which is essential for conventional pose estimation techniques. Despite significant advancements, as highlighted in previous studies [6,23,24], achieving reliable pose estimation in large-scale environments without overlapping views remains an unresolved challenge. Each of these methods has advanced the field, yet they continue to rely on environments that provide substantial visual texture, indicating that this remains a critical area for ongoing research, particularly in large scenarios where such details are scarce.

In scenarios requiring precise camera positioning over large distances, total stations and laser measurement tools provide an alternative approach, offering high accuracy but at increased costs and operational complexity. These instruments are particularly advantageous in scenarios where traditional methods face challenges in precision or scale. However, the exploration of camera positioning using these methodologies remains underdeveloped, especially for multi-camera pose estimation in sparse environments. While some studies have examined their application in camera pose estimation [25,26], further research is essential to fully integrate high-precision measurement tools into practical and scalable solutions for complex indoor environments.

Various methodologies, prominently feature-based and marker-based, have been developed to tackle camera pose estimation. Feature-based systems, effective in conditions with sufficient overlap, detect and match key points across images to compute relative camera poses. However, in non-overlapping camera networks, the absence of shared features poses challenges, as traditional feature-based approaches cannot establish the necessary correspondences for pose estimation.

Structure from Motion (SfM) is a computational technique initially detailed by Ullman [27], which reconstructs 3D structures through the analysis of motion sequences captured from multiple viewpoints. Essential tools such as OpenDroneMap [28], Pix4D [29], and COLMAP [30] facilitate this process by using feature-matching techniques across overlapping images to estimate camera poses and construct 3D models. A significant hurdle in SfM is the lack of common features between images, which is critical for ensuring accurate view alignment and thus affects the overall efficacy of the reconstruction process, making its application challenging in indoor environments or in scenarios without overlapping views between the cameras capturing these images.

While deep learning has broadened the scope of Structure from Motion (SfM), challenges continue in scenarios with non-overlapping views, despite the advances such as those by Wang et al. [14] and Ren et al. [15]. Gaussian splatting, developed by Kerbl et al. [31], marks a significant evolution in traditional SfM methods by utilizing neural networks to model scenes as continuous volumetric functions. This technique significantly improves the handling of complex scenes and smoothens the reconstruction process. Tools from PolyCam [32] support the rapid testing and deployment of Gaussian splatting. However, the efficacy of this technique is heavily reliant on the quality of the initial SfM reconstruction. Further expanding on this concept, a novel method by Cao et al. [33] integrates object detection into Gaussian splatting, potentially offering solutions to the challenges discussed in this paper. However, like other existing methods, without overlapping views, its application remains limited.

Simultaneous localization and mapping (SLAM) is a principal method among feature-based techniques, enabling a camera to construct a map of an unknown environment while concurrently determining its location within that map. Existing methods, such as those proposed by Romero-Ramirez et al. [16] and Campos et al. [17], demonstrate significant effectiveness. However, despite their capabilities, SLAM methodologies encounter difficulties in non-overlapping camera networks, primarily due to the lack of shared visual features, which are essential for traditional SLAM operations.

For these reasons, new methodologies based on SLAM, such as those proposed by Dai et al. [6] and Ataer-Cansizoglu et al. [18], are being developed to tackle the challenges posed by non-overlapping environments. While these methods showcase innovative approaches to extend SLAM techniques beyond their traditional constraints, they inherently rely on environments rich in identifiable details. This reliance stems from their dependence on feature-based techniques, necessitating dense environmental data to function effectively.

Despite the effectiveness of feature-based systems, marker-based systems provide a superior solution for accurate pose estimation in controlled scenarios. Utilizing fiducial markers like ArUco [34], these systems are robust and easy to detect, generally requiring markers to be visible in multiple camera views to establish correspondences. In non-overlapping scenarios, however, the effectiveness of marker-based systems is reduced, as the lack of shared marker observations prevents the direct computation of relative poses. To address this, methods such as the one proposed by Zhao et al. [19] offer solutions using an additional camera and a chessboard pattern. However, these solutions are constrained to smaller scales and are far removed from real-world applications.

At a larger scale, MarkerMapper [4] offers a notable solution designed to perform simultaneous localization and mapping using fiducial markers. It capitalizes on the detection of markers to build a map of the environment and estimate the camera’s pose within that map. While effective in environments where markers are visible across multiple views, its dependency on overlapping fields of view limits its utility in non-overlapping camera networks.

Lastly, the method proposed by Garcia et al. [5] extends MarkerMapper specifically to navigate the complexities of large-scale scenarios with some overlapping views. Their strategy utilizes reusable markers alongside scene geometry constraints, significantly enhancing pose estimation accuracy in vast and intricate environments. The method is based on placing markers in the overlapping view of two fixed cameras to enable the computation of the inter-camera pose. However, it is not suited for scenarios where cameras operate without any overlapping fields, as the method relies on the presence of at least some shared markers for accurate positioning.

This paper introduces a novel approach for estimating the pose of sparse (i.e., non-overlapping) fixed camera networks. Our approach fundamentally differs from existing methodologies by utilizing a reusable set of markers, which are iteratively mapped using an auxiliary camera that records every placed marker. We enhance this method by integrating advanced optimization techniques and leveraging scene geometry, aiming to achieve precise pose estimation without relying on overlapping fields of view. To our knowledge, this innovative approach addresses a challenge that existing methods have not effectively resolved.

3. Proposed Method

This section presents the proposed methodology for estimating the poses of sparse fixed indoor cameras using fiducial markers. In our previous work, a solution is proposed [5], assuming that some fixed cameras are close to each other, so they share part of their field of view. Thus, a large area can be covered by many fixed cameras, creating a path of shared fields of view. In that use case, markers placed on the ground are employed to obtain the pairwise relationship (pose) between the cameras and, thus, their global pose on a map. However, in many scenarios, this option is not possible, and we only have a small subset of sparse fixed cameras completely unconnected from each other.

Our solution to the sparse camera pose estimation problem relies on the idea that the pose of a camera with a global reference system can be easily obtained from an image showing one or more markers whose poses are known. The problem then becomes placing markers on a large area and accurately estimating their poses. This problem can be solved by the method proposed in MarkerMapper [4], in which a set of markers are placed in the environment, and their poses are estimated from images, i.e., structure from motion using markers instead of key points. The problem with that approach is that it requires an extremely large number of unique markers, which is unfeasible in even relatively small scenarios.

Our approach, which is illustrated in Figure 2, consists of placing a small set of markers on the ground and taking several images with a moving camera (i.e., a phone camera). The images, showing the markers from several viewpoints, are employed to obtain a reconstruction of the markers related to one of them; this is called a group. Then, a subset of the markers is moved while the rest are left in their positions, and the operation is repeated, creating another group. This second group is connected to the previous one by the fixed markers. The process is repeated until the whole area to map is covered, making sure we place markers under the fixed cameras we aim at locating. Thus, our set of images contains not only images from the moving camera but also from the fixed cameras.

We first estimate the poses of the markers and the images involved within a group using a local reference system for each group (e.g., centered at one of the markers). Then, since groups are connected, it is possible to find the transformation between any group and one of them, which acts as a global reference system. In doing so, we obtained the poses of fixed cameras with that global reference system.

The rest of this section explains the proposed method in detail. First, Section 3.1 provides some mathematical definitions necessary for the rest of the paper. Second, Section 3.2 explains how groups are analyzed, and Section 3.3 presents how they are merged into a metagroup to solve the proposed problem. Finally, Section 3.4 explains how to automatically determine the connection between groups.

3.1. Mathematical Definitions

Let us represent a three-dimensional point within a given reference system a as

p_{a} \in R^{3}

. To convert this point to a different reference system, labeled as

p_{b}

, rotation and translation are required. Let us denote

φ_{b a}

as the SE(3) homogeneous transformation matrix that transforms points from a to b as follows:

[\begin{matrix} p_{b}^{⊤} \\ 1 \end{matrix}] = φ_{b a} [\begin{matrix} p_{a}^{⊤} \\ 1 \end{matrix}] .

(1)

To ease the notation, we employ the operator (

\cdot

) as follows:

p_{b} = φ_{b a} \cdot p_{a} .

(2)

It should also be noted that the transformation from system c to system b (

φ_{b c}

), followed by the transformation from b to c (

φ_{c b}

), can be combined through matrix multiplication into a single transformation

φ_{a c}

as shown below:

φ_{a c} = φ_{a b} φ_{b c} .

(3)

u \in R^{2}

uses the pinhole camera model. Assuming that the camera parameters are known, a projection can be obtained as a function

Ψ

: a point in three-dimensional space,

p \in R^{3}

, is projected onto a camera’s pixel

u \in R^{2}

according to the pinhole camera model. Given known camera parameters, this projection can be described by the function

Ψ

:

u = Ψ (δ, φ, p),

(4)

where

δ

denotes the intrinsic parameters of the camera, and

φ

represents the camera pose at the time the image was captured, i.e., the transformation that relocates a point from any reference system to the camera’s system.

3.2. Group Pose Graphs

As previously indicated, we use a set of markers

M = {m_{i}}

that are first placed in the ground and then take images

f_{i}

of them. We will use Figure 3 to guide the explanation. While Figure 3a shows the group’s configuration, Figure 3b shows the images obtained by the moving camera. Please note that some images

f_{i}

belong to fixed cameras, while others belong to moving cameras.

Let

G = {g^{k}}

be a set of groups. We define a group

g^{k} = {M^{g^{k}}, F^{g^{k}}, O^{g^{k}}},

(5)

where

M^{g^{k}} = {φ_{m_{i}}^{g^{k}} \in S E (3)},

(6)

denotes the set of poses of the markers

m_{i}

in the reference system of group

g^{k}

,

F^{g^{k}} = {φ_{f_{j}}^{g^{k}} \in S E (3)},

(7)

represents the set of image poses in the reference system of the group

g^{k}

, and

O^{g^{k}} = {(m_{i}, f_{j})} .

(8)

and represents the set of observations of markers in images. In other words, the tuple

(m_{i}, f_{j})

indicates that the marker

m_{i}

is observed in the image

f_{j}

.

Our goal is to estimate the poses

M^{g^{k}}

and

F^{g^{k}}

with the group’s local reference system. To do so, we select any of the markers of the group and assume its center is the group reference system, i.e.,

\exists m_{i} | φ_{m_{i}}^{g^{k}} = 0 .

(9)

To that end, we first create a group pose quiver (Figure 3c), where nodes represent markers, and edges are the pair-wise pose relationship between markers obtained from its image observations. The quiver is then refined into a group pose graph that is ultimately optimized using sparse graph optimization.

A marker is a squared planar object whose four corners

c_{l} \in R^{3}

can be expressed regarding the center of the marker as follows:

c_{l} = \{\begin{cases} (s / 2, s / 2, 0) & i f l = 1 \\ (s / 2, - s / 2, 0) & i f l = 2 \\ (- s / 2, - s / 2, 0) & i f l = 3 \\ (- s / 2, s / 2, 0) & i f l = 4 \end{cases},

(10)

where s is the length of the marker sides. Let us denote

m_{i, l}^{f_{j}}

as the observed positions of the corner l of marker

m_{i}

in image

f_{j}

. Using the PnP solution [35], we estimate the relative pose from the marker to the image:

φ_{m_{i}}^{g^{k}, f_{j}} \in S E (3), (m_{i}, f_{j}) \in O^{g^{k}} .

(11)

Furthermore, if more than one marker is observed in image

f_{j}

, we obtain a pair-wise relationship between any two markers

m_{i}

and

m_{t}

observed. Thus, we shall denote

φ_{m_{i}, m_{t}}^{g^{k}, f_{j}} = φ_{m_{i}}^{g^{k}, f_{j}} {(φ_{m_{t}}^{g^{k}, f_{j}})}^{- 1}

(12)

as the transform that moves from marker

m_{i}

to

m_{t}

given the observation from

f_{j}

.

A pose quiver, as shown in Figure 3c, is constructed from all pair-wise combinations, with nodes symbolizing markers and edges depicting their pair-wise relationships. Among all potential edges connecting markers, the one exhibiting the minimal reprojection error is chosen to form a group pose graph (see Figure 3d), where the edges

φ_{m_{i}, m_{j}}^{g^{k}}

illustrate the relationships between markers.

To obtain an initial estimation of the marker poses before the optimization process, we select one of the markers

m_{i}

as the local reference system (i.e.,

m_{1}

) and apply Dijkstra’s algorithm to determine the best path to all other nodes. Then, an initial estimation

{\hat{M}}^{g^{k}}

of the local marker poses

M^{g^{k}}

is obtained by following the paths as follows:

\begin{matrix} {\hat{φ}}_{m_{1}}^{g^{k}} & = 0 \\ {\hat{φ}}_{m_{2}}^{g^{k}} & = φ_{m_{1}, m_{2}}^{g^{k}} \\ {\hat{φ}}_{m_{l}}^{g^{k}} & = φ_{m_{1}, m_{i}}^{g^{k}} φ_{m_{i}, m_{j}}^{g^{k}} \dots φ_{m_{t}, m_{l}}^{g^{k}} \end{matrix} .

(13)

Similarly, we obtain an initial estimation

{\hat{F}}^{g^{k}}

of the image poses

F^{g^{k}}

:

\begin{matrix} {\hat{φ}}_{f_{1}}^{g^{k}} & = {\hat{φ}}_{f_{1}, m_{1}}^{g^{k}} \\ {\hat{φ}}_{f_{i}}^{g^{k}} & = {\hat{φ}}_{m_{j}}^{g^{k}} {(φ_{m_{j}}^{g^{k}, f_{i}})}^{- 1} \end{matrix}

(14)

from any of the markers

m_{j}

observed by image

f_{i}

, i.e.,

(m_{j}, f_{i}) \in O (g^{k})

.

Given the initial estimation of the values, they are refined by minimizing the reprojection error of the observed markers in the images using the sparse version of the Levenberg–Marquardt algorithm, which exploits the sparsity of the Jacobian matrix to efficiently handle large-scale optimization problems typical in graph-based formulations [36,37]:

M^{g^{k}}, F^{g^{k}} = \underset{M^{g^{k}}, F^{g^{k}}}{arg min} E_{r p} (g^{k}),

(15)

where

E_{r p} (g^{k}) = \sum_{(m_{i}, f_{j}) \in O^{g^{k}}} \sum_{l = 1}^{4} ∥ m_{i, l}^{f_{j}} - Ψ (δ_{f_{j}}, {(φ_{f_{j}}^{g^{k}})}^{- 1} φ_{m_{i}}^{g^{k}}, c_{l}) ∥^{2} .

(16)

3.3. Metagroup Pose Graph Optimization

The previous process is repeated independently for each one of the groups

g^{k} \in G

, obtaining their poses relative to a local reference. Our goal now is to obtain the poses of both cameras and markers of all groups in a common reference system, e.g., centered at marker

m_{1}^{g^{1}}

.

As already explained, there is a connection between some groups

g^{k}

and

g^{t}

, i.e., there is a subset of markers,

C (g^{k}, g^{t}) = {m_{i}},

(17)

between them that have not been moved. If

C (g^{k}, g^{t})

is known, it is possible to obtain the transform

γ^{k, t} \in S E (3)

that moves from one group to another. Then, it is true that

φ_{m_{i}}^{g^{k}} = γ^{k, t} φ_{m_{i}}^{g^{t}}, \forall m_{i} \in C (g^{k}, g^{t}) .

(18)

The subset can be manually annotated when collecting the images or automatically obtained, as explained later in Section 3.4. In either case, it allows us to build the metagroup pose graph, where nodes are groups and edges their relationship

γ^{k, t}

(see Figure 4). As in the previous case, we employ Dijkstra’s algorithm to determine the best path from any marker to group

g^{1}

. Then, we shall denote

{\hat{φ}}_{m_{i}} = γ^{1, 2} \dots γ^{t, k} φ_{m_{i}}^{g^{k}}

(19)

as the pose

φ_{m_{i}}^{g^{k}}

transformed to the global reference system (i.e.,

g^{1}

). The same is applied to obtain the transformed image poses:

{\hat{φ}}_{f_{j}} = γ^{1, 2} \dots γ^{t, k} φ_{f_{j}}^{g^{k}} .

(20)

Now, the poses are referred to in the reference system of the first group,

g^{1}

. However, in practical scenarios, it is often preferable to express these poses relative to a CAD model or map of the building. To achieve this, control points can be utilized. A control point

p_{c_{j}} \in R^{3}

represents the known position of a marker or a camera within the map, and

Π_{c p}

denotes the collection of these points. By applying the Horn [38] algorithm with at least three control points, we determine the optimal rigid transformation

φ_{c p}

that aligns the image and marker poses with the map’s reference system:

φ_{m_{i}} = φ_{c p} {\hat{φ}}_{m_{i}},

(21)

and

φ_{f_{j}} = φ_{c p} {\hat{φ}}_{f_{i}} .

(22)

Our final optimization function combines multiple objectives into a single global error:

E_{g} = E_{r p} + E_{c p} + E_{c c} + E_{c m} .

(23)

The global error

E_{g}

represents the global reprojection error of the metagraph:

E_{r p} = w_{r p} \sum_{g^{k} \in G} \sum_{(m_{i}, f_{j}) \in O (g^{k})} \sum_{l = 1}^{4} ∥ m_{i, l}^{f_{j}} - Ψ (δ_{f_{j}}, {(φ_{f_{j}})}^{- 1} φ_{m_{i}}, c_{l}) ∥^{2},

(24)

where

w_{r p}

serves as a normalization factor such that

E_{r p} = 1

when each individual error equals one. This is computed as the inverse of the total number of projected points, i.e.,

w_{r p} = 1 / (\sum_{g^{k} \in G} \sum_{(m_{i}, f_{j}) \in O (g^{k})} \sum_{l = 1}^{4} 1) .

(25)

The normalization factor

w_{r p}

allows us to combine the different error terms independently of the number of markers, images, or control points optimized.

The optional error term

E_{c c}

ensures that cameras known to be on the same plane remain coplanar. This is particularly relevant in indoor settings, where cameras are often installed on the ceiling. We define

Θ_{o} = {f_{j}}

as the set of fixed cameras that share a single plane. In scenarios with multiple planes, such as buildings with several floors, each plane is associated with its own

Θ_{o}

group. The collection of all these groups is represented as

Θ = {Θ_{o}}

. The error term is then defined as follows:

E_{c c} = w_{c c} \sum_{Θ_{o} \in Θ} \sum_{f_{j} \in Θ_{o}} {d_{t}}^{2} (t (φ_{f_{j}}), P (Θ_{o})) .

(26)

Here,

P (Θ_{o})

denotes the optimal plane derived from the camera poses in the set

Θ_{o}

. Meanwhile,

d_{t}

represents the Euclidean distance from the translational component

t

of the camera pose

φ_{f_{j}}

to the plane. The normalization factor

w_{c c}

is adjusted so that

E_{c c} = 1

when all

d_{t}

distances are precisely 1 centimeter, thus equating one pixel of error in

E_{r p}

to one centimeter in

E_{c c}

.The normalization factor

w_{c c}

can be defined as follows:

w_{c c} = 1 / (\sum_{Θ_{o} \in Θ} \sum_{f_{j} \in Θ_{o}} 1) .

(27)

Similarly, the optional error term

E_{c m}

ensures that markers known to lie on the same plane remain coplanar. This is commonly encountered in indoor settings where markers are positioned on the floor. The error term is computed as follows:

E_{c m} = w_{c m} \sum_{Υ_{o} \in Υ} \sum_{m_{i} \in Υ_{o}} \sum_{l = 1}^{4} {d_{t}}^{2} (t (φ_{m_{i}}), P (Υ_{o})),

(28)

where

P (Υ_{o})

denotes the optimal plane fitted from the set

Υ_{o}

of coplanar markers, and

d_{t}

is the Euclidean distance from the translational component of the marker pose

φ_{m_{i}}

to the plane. The normalization factor

w_{c m}

ensures that

E_{c m} = 1

when each distance

d_{t}

measures exactly 1 centimeter. The normalization factor

w_{c m}

is given by the following:

w_{c m} = 1 / (\sum_{Υ_{o} \in Υ} \sum_{m_{i} \in Υ_{o}} 4) .

(29)

Finally, the optional term

E_{c p}

refers to the use of control points, which forces cameras and markers to be at specified locations using the following term:

E_{c p} = w_{c p} (\sum_{p_{m_{i}} \in Π_{m}} ∥ p_{m_{i}} - t (φ_{m_{i}}) ∥^{2} + \sum_{p_{f_{j}} \in Π_{f}} ∥ p_{f_{j}} - t (φ_{f_{j}}) ∥^{2}),

(30)

where

Π_{m}

and

Π_{f}

correspond to the control points of the markers and cameras, respectively. As noted, the errors are calculated by measuring the distances between the actual positions of cameras and markers and their respective ground-truth locations. The normalization factor

w_{c p}

is set so that

E_{c p} = 1

, meaning all Euclidean distances are exactly 1 millimeter, demanding higher precision for these points compared to the previous cases. The corresponding normalization factor

w_{c p}

is thus defined as follows:

w_{c p} = 1 / (\sum_{p_{m_{i}} \in Π_{m}} 1 + \sum_{p_{f_{j}} \in Π_{f}} 1) .

(31)

3.4. Automatic Estimation of Metagroup Edges

Obtaining metagroup edges requires determining the set marker

C (g^{k}, g^{t})

that has not moved between two groups. This can carried out manually by annotating them at the time of recording the images; however, it is a process subject to human error. We propose a method to automatically obtain these markers, knowing that a connection exists between the two groups. Our solution relies on the following idea. Starting with an empty set

C (g^{k}, g^{t})

, we add the pair of markers whose relative position in both groups is most similar and compute the rigid transform

γ^{k, t}

that moves their corners from one group to another. Then, we keep adding markers to the group as long as the transform error is below a threshold. Let us formally describe it below.

Our algorithm starts by computing the relative pose between all marker pairs within the two groups:

Φ (g^{k}) = {φ_{m_{i}, m_{q}}^{g^{k}}}, Φ (g^{t}) = {φ_{m_{i}, m_{q}}^{g^{t}}}

(32)

As a starting point for

C (g^{k}, g^{t})

, we use the pair of markers

(m_{i}, m_{q})

with the most similar relative position in the two groups, i.e.,

C (g^{k}, g^{t}) = {\underset{(m_{i}, m_{q})}{arg min} ∥ φ_{m_{i}, m_{q}}^{g^{k}} - φ_{m_{i}, m_{q}}^{g^{t}} ∥^{2}} .

(33)

The corners of the markers expressed in the reference system of the groups, i.e.,

m_{i, l}^{g^{k}} = φ_{m_{i}}^{g^{k}} \cdot c_{l}, m_{q, l}^{g^{k}} = φ_{m_{q}}^{g^{k}} \cdot c_{l},

(34)

are used to obtain transform between the groups,

γ^{k, t}

, using the Horn transform. Then, we can calculate the error of the transform by analyzing how far these corners are when we move them from one group to another:

E (γ^{k, t}) = \sum_{m_{i} \in C (g^{k}, g^{t})} \sum_{l = 1}^{4} ∥ m_{i, l}^{g^{k}} - (γ^{k, t} \cdot m_{i, l}^{g^{t}}) ∥^{2} .

(35)

The more fixed markers found between the groups in

C (g^{k}, g^{t})

, the more accurate the transform

γ^{k, t}

will be, so we want to add all the common fixed markers. We proceed iteratively by adding the next unselected marker that produces the smallest increment in the error

E (γ^{k, t})

, and we stop when the error added by a particular marker is above a given threshold

τ_{e}

.

The above method has a problem, though: the pair initially selected by Equation (33) may not be correct. It may occur that some of the moved markers were placed in positions that were very similar to each other. If that happens, they could be selected in Equation (33), and the process would end with a few markers selected in

C (g^{k}, g^{t})

. To overcome that problem, we repeat the process several times by selecting the starting set

C (g^{k}, g^{t})

as not only the best pair of markers but also the second and the third, etc. Thus, multiple

C_{c} (g^{k}, g^{t})

sets are obtained. Amongst them, we select the one with more markers, and in the case of a tie, we choose the one with the lowest error (Equation (35)).

4. Experiments

This section details the experiments conducted to validate our approach. To the best of our knowledge, there are no specific public datasets available to test our method, necessitating the creation of our own. Although our method can be applied to the dataset of our previous work [5], where cameras have overlapping fields of view, these datasets do not contain sparse cameras. Thus, it is a more straightforward problem. The reverse is impossible, i.e., the previously proposed method cannot solve the problem we tackle in this paper. Consequently, we recorded three datasets in our building with different levels of complexity. We employed fixed cameras located on the ceiling with no overlapping in their field of view and a moving camera (a phone camera). The datasets present no loop closure. The ground-truth positions of the fixed cameras were manually verified to establish a baseline for accuracy. The control points were obtained using the available map of the building. Please note that it is only possible to establish control points in very salient regions of the environment, such as a door or the intersection between two walls. Lastly, the number of fixed cameras in each dataset was selected to maximize camera coverage while ensuring no overlap in their field of view. This setup was chosen to have as many error measures as possible for evaluation purposes. However, our method could have been applied with only two cameras per environment, obtaining similar results.

To achieve good precision in pose estimation using fiducial markers, it is crucial that the markers cover a significant area in the images and are distributed throughout. Adequate coverage and sufficient angle between the camera plane and the normal to the marker plane ensure accurate pose estimation.

The experiments were conducted on a system equipped with an Intel(R) Xeon(R) Silver 4316 CPU @ 2.30 GHz, running Ubuntu 20.04.6 LTS. The average processing time per dataset was approximately 3 hours and 32 minutes with a non-optimized implementation. Please note that due to the usage of sparse graph optimization techniques, we obtain a solution that, despite a high number of variables, tends to be efficient.

This section is divided into five subsections. In Section 4.1, we show our first dataset in which we reconstruct a single corridor. Section 4.2 presents our second dataset, employed to reconstruct a complete building floor of our laboratory, comprised of four corridors and a large hall. Section 4.3 presents our most complex dataset, an entire building comprised of two floors Section 4.3 interconnected by stairs. Then, Section 4.4 evaluates the method proposed to find static markers between groups, i.e., the method proposed in Section 3.4. Section 4.5 performs a comparative analysis of the proposed method with prior works. Furthermore, Section 4.6 discusses potential limitations and the adaptability of our system to various environmental conditions.

4.1. Corridor Dataset

In this experiment, our methodology was evaluated within a real

50 m^{2}

corridor (refer to Figure 5a). The setup included six fixed cameras positioned linearly along the ceiling without overlapping views. Each camera boasted a resolution of

3840 \times 2160

pix and a focal length of approximately 2201 pix. The experiment required two groups to reconstruct the corridor with the aid of a moving camera and 50 ArUco [34] markers. The moving camera operated at a resolution of

1920 \times 1080

pix and a focal length of approximately 1339 pix. This setup was chosen to evaluate the accuracy of our method in preserving spatial continuity across cameras that do not share fields of view. Ground truth data were generated through manual annotation. Upon completion of each dataset, our method is not run in real-time but is instead initiated after the dataset has been fully recorded. An overview summarizing the datasets used in this work, including the number of markers, fixed cameras, frames, and groups required for reconstruction, can be found in Table 1.

We created two different datasets on the same scenario, labeled A and B, by varying the orientations of the fixed cameras (see Figure 5a). As for dataset A, the cameras were positioned to face directly downward, whereas in dataset B, they were slightly tilted to provide a broader view of the corridor. To assess the robustness of our approach, we performed ablation studies to evaluate how different optimization terms impact reconstruction quality. These studies involved optimizing the

E_{r p}

term alone, as well as testing all possible combinations of error terms. The results are presented in Table 2.

As observed, the exclusive use of the reprojection error obtains the worst results. Adding marker coplanarity constraints seems to be the most effective one, probably because it helps to mitigate the well-known doming effect [39] that occurs in this type of problem. In any case, combining all errors allows a substantial reduction in the errors in both datasets. In this particular case, dataset B seems to obtain better results.

Although the combinations of

E_{r p} + E_{c m}

and

E_{r p} + E_{c c} + E_{c m}

resulted in the lowest errors in these datasets, such outcomes are not consistent across different scenarios. Generally, integrating all optimization terms is the most effective strategy, as shown in the next experiments. In practical applications where direct validation of results is not possible, it is recommended to employ all available optimization parameters to ensure the most robust and reliable performance.

4.2. Complete Floor Dataset

This experiment deals with a more complex environment consisting of an entire floor within a building. The floor layout includes four interconnected corridors and a large central hall, a nexus for all corridors (see Figure 5b). This setup incorporates 22 fixed cameras, all strategically positioned along the ceiling, with no overlap between their fields of view. We employed 50 ArUco markers [34] to cover the area, obtaining 10 different groups using our moving camera. An overview summarizing the datasets used in this work, including the complete floor configuration, can be found in Table 1.

As in the previous case, two variations of the datasets were generated, labeled A and B. In dataset A, the fixed cameras were oriented to face directly downward, while in dataset B, they were angled slightly to provide a different view of the floor layout. Similar to the work carried out on the previous dataset, we conducted ablation studies on these datasets to assess how different combinations of optimization terms affect the reconstruction accuracy. These results are summarized in Table 3.

As evidenced by the results, just as in previous datasets, relying solely on the reprojection error term (

E_{r p}

) leads to suboptimal outcomes compared to our method, including additional optimization terms. Applying all optimization terms achieves the most robust and consistent results, yielding superior outcomes in dataset B. This enhancement underscores the importance of integrating multiple constraints to effectively address the complexities of pose estimation in expansive environments. Finally, we achieved an error of approximately 27 cm and 24 cm for datasets A and B of the complete floor, respectively.

4.3. Entire Building Dataset

This experiment presents our last dataset, which extends our methodology to an entire building comprising two floors connected by a stairwell section (see Figure 6). This setup allowed us to evaluate the robustness of our camera positioning approach over large vertical distances and across multiple interconnected levels. By testing these configurations, we aimed to assess the method’s accuracy in maintaining spatial continuity and precise camera alignment over extended and complex structures. The fixed and moving camera models were the same as in the previous experiments.

The experiment involved 42 fixed cameras distributed uniformly across the floors, ensuring no overlap in their fields of view. In addition, each floor and stair section included pathways recorded using a moving camera and 50 ArUco markers, obtaining 23 different groups. We performed ablation studies on the dataset to examine the effects of different optimization term combinations on the reconstruction quality. An overview summarizing the datasets used in this work, including the entire building configuration, can be found in Table 1.

The results of these evaluations, as detailed in Table 4, demonstrate the significant role of control points in enhancing pose estimation across the entire building, achieving an optimal outcome with an approximate error of 42 cm. The inclusion of control points proved especially critical in this large-scale dataset, contributing markedly to the accuracy of the results due to the high-dimensional nature of the environment.

4.4. Metagroup Edges Automatic Estimation

This experiment evaluates our algorithm’s capability to automatically find the markers

C (g^{k}, g^{t})

that have not been moved between groups. The accuracy of this marker labeling is crucial, as it directly impacts the quality and reliability of the linkages within the reconstructed environment.

We tested each generated dataset against manually annotated ground-truth data, which provided the precise labeling of markers for evaluation purposes. Table 5 presents the results, detailing the output rates of true positives (

T P R

), false positives (

F P R

), and false negatives (

F N R

) for each dataset. These metrics offer a comprehensive assessment of the algorithm’s performance across various configurations.

The results of these evaluations are detailed in Table 5. The high true positive rates (

T P R

) across all datasets, reaching 100% in some cases, indicate near-perfect accuracy in identifying markers that have remained stationary between groups. The absence of false positives (

F P R

) confirms that no markers were incorrectly classified as common, ensuring that all detected common markers were indeed unchanged. This level of precision highlights our algorithm’s effectiveness in consistently identifying valid sets of common markers, which are crucial for accurate environment reconstruction, especially in large and complex settings, like entire buildings.

4.5. Comparison with Other Works

In this section, we assess how our proposed method stands in comparison to established techniques within the field, including the approaches by García et al. [5] and the MarkerMapper system [4], which are both adept at handling camera field overlaps. Additionally, we extend our comparison to state-of-the-art structures from motion (SFM) implementations such as OpenDroneMap [28], Pix4D [29], and COLMAP [30], as well as the Gaussian splatting approach PolyCam [32]. Despite numerous attempts, these methods failed to produce complete reconstructions due to insufficient camera overlaps and a lack of distinctive key points for reliable matching, except for the MarkerMapper [4] and Garcia et al. [5] methods, which successfully operated within some datasets. The datasets used for this evaluation, collectively referred to as dense, were introduced by García et al. [5] and tested the capacity to manage overlapping views.

These datasets include artificial and real scenarios, representing a single corridor, an entire floor, and multiple floors. In these scenarios, fixed cameras are positioned on the ceilings, while Aruco [34] fiducial markers are placed on the floor. Similar to our work, these datasets demonstrate various camera arrangements as different configurations.

On the other hand, the datasets introduced in our study, labeled as sparse, are specifically designed to assess the efficacy of methods in positioning fixed cameras with no overlap. However, there are no existing methods with publicly available code that can be evaluated using our proposed datasets. Consequently, we cannot compare our results on these datasets with any other existing method.

Table 6 presents the results of our comparative analysis. The symbol ‘×’ indicates that a method was inapplicable, while the symbol ‘−’ denotes that specific error data were unavailable due to a lack of ground truth.

The results in Table 6 show that our method achieves a level of precision comparable to that of the approach proposed by García et al. [5], which itself showed improvements over the results obtained by MarkerMapper [4]. Notably, our method achieves significantly better outcomes in the more complex datasets with which comparisons are possible, demonstrating a reduction in error for the dense complete floor datasets A and B from previously observed 14.82 and 15.72 cm to 5.22 and 4.88 cm, respectively.

Our method achieves consistent results for datasets featuring cameras with no overlapping views. However, the error margins increase as the dataset’s complexity increases, similar to what is observed in datasets with overlapping cameras. This consistency across different datasets confirms the effectiveness of our approach under varied conditions.

4.6. Discussion on Limitations and Potential Improvements

In this section, we address the potential limitations of our proposed method and suggest areas for enhancement, particularly with involuntary movements of markers, low-light conditions, outdoor applications, and generalization across various indoor settings.

Our approach assumes that fiducial markers remain stationary throughout the capture process and are removed afterward. Markers do not need to remain in the environment once the calibration is conducted. In environments where there may be slight movements of markers while recording, the accuracy of the system depends on the redundancy of markers and a robust optimization algorithm. However, significant movement would require recalibration or algorithmic adjustments to ensure continued accuracy. Future enhancements could incorporate real-time tracking and adjustment capabilities to better manage such dynamics.

The design of our system is agnostic to the type of marker and the detection method employed. While traditional fiducial markers, such as Aruco [34], demonstrate high robustness in most conditions, they may struggle in extremely poor lighting. In such cases, considering alternative technologies like DeepAruco++ [40] could be beneficial to ensure high accuracy and robustness.

While our primary investigations have focused on indoor environments, extending our system to outdoor scenarios could prove beneficial. Outdoor conditions introduce complexities such as changing weather and variable lighting, which could be mitigated by integrating our system with technologies such as satellite imagery or GPS data. A hybrid approach, which combines fiducial markers with natural feature tracking, could provide more reliable solutions for outdoor navigation and mapping, similar to the concept explored in UcoSLAM [41].

Finally, our method has demonstrated reliable performance across diverse indoor settings in our datasets. However, the efficacy of our system is contingent upon the presence of the markers within the camera images, regardless of variations in room geometry or furniture arrangement. As a consequence, we believe our method can adapt to any indoor layout as long as it is possible to place markers and take pictures of them.

5. Conclusions

This work introduced a novel approach for indoor camera positioning using fiducial markers, uniquely designed to accommodate both overlapping and non-overlapping camera setups in extensive environments. To our knowledge, this is the first method to accurately position any set of fixed cameras in large scenarios, regardless of their overlap status. Our technique employs a mobile camera along with a set of fiducial markers that are strategically placed and moved iteratively in groups. This system facilitates the precise positioning of fixed cameras without overlap. Additionally, we developed an algorithm that automatically identifies markers acting as connectors between these groups, significantly simplifying the camera positioning process.

Experimental validations across multiple complex scenarios confirm the feasibility and robustness of our system, demonstrating consistent accuracy even in environments without visual overlap while achieving state-of-the-art results in environments with camera overlap. This establishes our method as being capable of incorporating the capabilities of previous methods and additionally handling non-overlapping situations effectively. We believe that the proposed method could significantly enhance automated surveillance systems and improve augmented reality applications within complex indoor environments. However, our system still returns poorer solutions depending on the scale of the real world; therefore, efforts must be made to improve our method.

Future efforts will focus on refining the automation process for marker detection and enhancing the system’s adaptability to dynamically changing environments. We plan to integrate machine learning algorithms to optimize marker placement and improve camera network configurations. Extending this research to outdoor environments could significantly broaden the applicability of our method, potentially impacting urban planning and automated vehicle navigation. Additionally, we aim to explore the feasibility of incorporating total station technology as a complementary or alternative method for camera positioning. Addressing challenges such as adapting the system to varying outdoor lighting and weather conditions will be crucial.

Author Contributions

Conceptualization, P.G.-R., R.M.-S. and R.M.-C.; methodology, R.M.-S. and R.M.-C.; software, P.G.-R.; validation, P.G.-R., R.M.-S. and F.J.R.-R.; formal analysis, R.M.-C.; investigation, P.G.-R.; resources, P.G.-R. and M.J.M.-J.; data curation, P.G.-R.; writing—original draft preparation, P.G.-R., F.J.R.-R. and M.J.M.-J.; writing—review and editing, P.G.-R., F.J.R.-R. and M.J.M.-J.; visualization, P.G.-R.; supervision, R.M.-S.; project administration, R.M.-S. and R.M.-C.; funding acquisition, R.M.-S. and R.M.-C. All authors read and agreed to the published version of this manuscript.

Funding

This research has been funded under the project PID2023-147296NB-I00 of the Ministry of Science, Innovation, and Universities of Spain.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly archived datasets and code can be found at https://www.uco.es/investiga/grupos/ava/portfolio/sparse-indoor-camera-positioning/ (accessed on 6 February 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Yuan, S.; Ge, Z.; Yang, L. Single-Camera Multi-View 6DoF pose estimation for robotic grasping. Front. Neurorobot. 2023, 17, 1136882. [Google Scholar] [CrossRef] [PubMed]
Yang, S.; Scherer, S. CubeSLAM: Monocular 3-D Object SLAM. Trans. Rob. 2019, 35, 925–938. [Google Scholar] [CrossRef]
Moon, G.; Chang, J.Y.; Lee, K.M. Camera Distance-Aware Top-Down Approach for 3D Multi-Person Pose Estimation From a Single RGB Image. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 10132–10141. [Google Scholar] [CrossRef]
Muñoz-Salinas, R.; Marín-Jimenez, M.J.; Yeguas-Bolivar, E.; Medina-Carnicer, R. Mapping and localization from planar markers. Pattern Recognit. 2018, 73, 158–171. [Google Scholar] [CrossRef]
García-Ruiz, P.; Romero-Ramirez, F.J.; Muñoz-Salinas, R.; Marín-Jiménez, M.J.; Medina-Carnicer, R. Large-Scale Indoor Camera Positioning Using Fiducial Markers. Sensors 2024, 24, 4303. [Google Scholar] [CrossRef] [PubMed]
Dai, C.; Han, T.; Luo, Y.; Wang, M.; Cai, G.; Su, J.; Gong, Z.; Liu, N. NMC3D: Non-Overlapping Multi-Camera Calibration Based on Sparse 3D Map. Sensors 2024, 24, 5228. [Google Scholar] [CrossRef] [PubMed]
Hazarika, A.; Vyas, A.; Rahmati, M.; Wang, Y. Multi-Camera 3D Object Detection for Autonomous Driving Using Deep Learning and Self-Attention Mechanism. IEEE Access 2023, 11, 64608–64620. [Google Scholar] [CrossRef]
Nguyen, P.; Quach, K.G.; Nhan Duong, C.; Le, N.; Nguyen, X.B.; Luu, K. Multi-Camera Multiple 3D Object Tracking on the Move for Autonomous Vehicles. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), New Orleans, LA, USA, 19–20 June 2022; pp. 2568–2577. [Google Scholar] [CrossRef]
Amosa, T.I.; Sebastian, P.; Izhar, L.I.; Ibrahim, O.; Ayinla, L.S.; Bahashwan, A.A.; Bala, A.; Samaila, Y.A. Multi-camera multi-object tracking: A review of current trends and future advances. Neurocomputing 2023, 552, 126558. [Google Scholar] [CrossRef]
Zhang, C.; Liu, Y.; Wang, F.; Xia, Y.; Zhang, W. VINS-MKF: A Tightly-Coupled Multi-Keyframe Visual-Inertial Odometry for Accurate and Robust State Estimation. Sensors 2018, 18, 4036. [Google Scholar] [CrossRef] [PubMed]
Morar, A.; Moldoveanu, A.; Mocanu, I.; Moldoveanu, F.; Radoi, I.E.; Asavei, V.; Gradinaru, A.; Butean, A. A Comprehensive Survey of Indoor Localization Methods Based on Computer Vision. Sensors 2020, 20, 2641. [Google Scholar] [CrossRef] [PubMed]
Sridhar Raj, S.; Prasad, M.V.; Balakrishnan, R. Spatial segment-aware clustering based dynamic rweeliability threshold determination (SSC-DRTD) for unsupervised person re-identification. Expert Syst. Appl. 2021, 170, 114502. [Google Scholar] [CrossRef]
Hussain, T.; Muhammad, K.; Ding, W.; Lloret, J.; Baik, S.W.; de Albuquerque, V.H.C. A comprehensive survey of multi-view video summarization. Pattern Recognit. 2021, 109, 107567. [Google Scholar] [CrossRef]
Wang, J.; Zhong, Y.; Dai, Y.; Birchfield, S.; Zhang, K.; Smolyanskiy, N.; Li, H. Deep Two-View Structure-From-Motion Revisited. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 8953–8962. [Google Scholar] [CrossRef]
Ren, X.; Wei, X.; Li, Z.; Fu, Y.; Zhang, Y.; Xue, X. DeepSFM: Robust Deep Iterative Refinement for Structure From Motion. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 4058–4074. [Google Scholar] [CrossRef]
Romero-Ramirez, F.J.; Muñoz-Salinas, R.; Marín-Jiménez, M.J.; Carmona-Poyato, A.; Medina-Carnicer, R. ReSLAM: Reusable SLAM with heterogeneous cameras. Neurocomputing 2024, 563, 126940. [Google Scholar] [CrossRef]
Campos, C.; Elvira, R.; Gómez Rodríguez, J.J.; Montiel, J.M.M.; Tardós, J.D. ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual-Inertial and Multi-Map SLAM. IEEE Trans. Robot. 2021, 37, 1874–1890. [Google Scholar] [CrossRef]
Ataer-Cansizoglu, E.; Taguchi, Y.; Ramalingam, S.; Miki, Y. Calibration of Non-overlapping Cameras Using an External SLAM System. In Proceedings of the 2014 2nd International Conference on 3D Vision, Tokyo, Japan, 8–11 December 2014; Volume 1, pp. 509–516. [Google Scholar] [CrossRef]
Zhao, F.; Tamaki, T.; Kurita, T.; Raytchev, B.; Kaneda, K. Marker based simple non-overlapping camera calibration. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; pp. 1180–1184. [Google Scholar] [CrossRef]
Sugimoto, M.; Suenaga, M.; Watanabe, H.; Nakamura, M.; Hashizume, H. Smartphone Indoor Positioning using Inertial and Ambient Light Sensors. In Proceedings of the 2023 13th International Conference on Indoor Positioning and Indoor Navigation (IPIN), Nuremberg, Germany, 25–28 September 2023; pp. 1–6. [Google Scholar] [CrossRef]
Rana, L.; Dong, J.; Cui, S.; Li, J.; Hwang, J.; Park, J. Indoor Positioning using DNN and RF Method Fingerprinting-based on Calibrated Wi-Fi RTT. In Proceedings of the 2023 13th International Conference on Indoor Positioning and Indoor Navigation (IPIN), Nuremberg, Germany, 25–28 September 2023; pp. 1–6. [Google Scholar] [CrossRef]
Ju, C.; Yoo, J. Machine Learning for Indoor Localization Without Ground-truth Locations. In Proceedings of the 2023 13th International Conference on Indoor Positioning and Indoor Navigation (IPIN), Nuremberg, Germany, 25–28 September 2023; pp. 1–5. [Google Scholar] [CrossRef]
Kumar, R.K.; Ilie, A.; Frahm, J.M.; Pollefeys, M. Simple calibration of non-overlapping cameras with a mirror. In Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA, 23–28 June 2008; pp. 1–7. [Google Scholar] [CrossRef]
Van Crombrugge, I.; Penne, R.; Vanlanduit, S. Extrinsic camera calibration for non-overlapping cameras with Gray code projection. Opt. Lasers Eng. 2020, 134, 106305. [Google Scholar] [CrossRef]
Gao, Y.; Lin, J.; Yang, L.; Zhu, J. Development and calibration of an accurate 6-degree-of-freedom measurement system with total station. Meas. Sci. Technol. 2016, 27, 125103. [Google Scholar] [CrossRef]
Paraforos, D.S.; Sharipov, G.M.; Heiß, A.; Griepentrog, H.W. Position Accuracy Assessment of a UAV-Mounted Sequoia+ Multispectral Camera Using a Robotic Total Station. Agriculture 2022, 12, 885. [Google Scholar] [CrossRef]
Ullman, S. The interpretation of structure from motion. Proc. R. Soc. London. Ser. Biol. Sci. 1979, 203, 405–426. [Google Scholar] [CrossRef]
OpenDroneMap. 2024. Available online: https://www.opendronemap.org/ (accessed on 27 November 2024).
Pix4D. Pix4D Official Website. 2024. Available online: https://www.pix4d.com/ (accessed on 27 November 2024).
Schönberger, J.L.; Frahm, J.M. Structure-from-Motion Revisited. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Kerbl, B.; Kopanas, G.; Leimkühler, T.; Drettakis, G. 3D Gaussian Splatting for Real-Time Radiance Field Rendering. ACM Trans. Graph. 2023, 42, 139:1–139:14. [Google Scholar] [CrossRef]
PolyCam. Homepage. 2024. Available online: https://poly.cam (accessed on 27 November 2024).
Cao, Y.; Ju, Y.; Xu, D. 3DGS-DET: Empower 3D Gaussian Splatting with Boundary Guidance and Box-Focused Sampling for 3D Object Detection. arXiv 2024, arXiv:2410.01647. [Google Scholar]
Garrido-Jurado, S.; Munoz-Salinas, R.; Madrid-Cuevas, F.J.; Marin-Jimenez, M.J. Automatic generation and detection of highly reliable fiducial markers under occlusion. Pattern Recognit. 2014, 47, 2280–2292. [Google Scholar] [CrossRef]
Collins, T.; Bartoli, A. Infinitesimal Plane-Based Pose Estimation. Int. J. Comput. Vis. 2014, 109, 252–286. [Google Scholar] [CrossRef]
Madsen, K.; Nielsen, H.B.; Tingleff, O. Methods for Non-Linear Least Squares Problems, 2nd ed.; Technical University of Denmark: Kongens Lyngby, Denmark, 2004. [Google Scholar]
Kümmerle, R.; Grisetti, G.; Strasdat, H.; Konolige, K.; Burgard, W. G2o: A General Framework for Graph Optimization. In Proceedings of the 2011 IEEE International Conference on Robotics and Automation, Shanghai, China, 9–13 May 2011; pp. 3607–3613. [Google Scholar]
HORN, B. Closed-Form Solution of Absolute Orientation Using Unit Quaternions. J. Opt. Soc. Am.-Opt. Image Sci. Vis. 1987, 4, 629–642. [Google Scholar] [CrossRef]
James, M.R.; Antoniazza, G.; Robson, S.; Lane, S.N. Mitigating systematic error in topographic models for geomorphic change detection: Accuracy, precision and considerations beyond off-nadir imagery. Earth Surf. Process. Landforms 2020, 45, 2251–2271. [Google Scholar] [CrossRef]
Berral-Soler, R.; Muñoz-Salinas, R.; Medina-Carnicer, R.; Marín-Jiménez, M.J. DeepArUco++: Improved detection of square fiducial markers in challenging lighting conditions. Image Vis. Comput. 2024, 152, 105313. [Google Scholar] [CrossRef]
Munoz-Salinas, R.; Medina-Carnicer, R. UcoSLAM: Simultaneous localization and mapping by fusion of keypoints and squared planar markers. Pattern Recognit. 2020, 101, 107193. [Google Scholar] [CrossRef]

Figure 1. Sparse camera representation. A depiction of a floor setup with a common arrangement of surveillance cameras. The positions of the cameras and their fields of view are depicted in blue. The absence of overlapping among the camera views complicates camera positioning in this environment.

Figure 2. Overview of the proposed method. (a) First step: Some markers are placed and recorded using the mobile camera with frames

{f_{2}, f_{3}, f_{4}, f_{5}}

. Additionally, frame

f_{1}

is taken from the left fixed camera. Subset A consists of markers

{m_{1}, m_{2}, m_{5}, m_{6}}

while Subset B consists of markers

{m_{3}, m_{4}, m_{7}, m_{8}}

. (b) Second step: Markers from Subset B are moved while markers from Subset A remain still, and the new marker arrangement is recorded again using the mobile camera with frames

{f_{6}, f_{7}, f_{8}, f_{9}, f_{10}}

. (c) Third step: Markers from Subset A are moved, leaving the markers from Subset B static. The arrangement is recorded with the mobile camera with frames

{f_{11}, f_{12}, f_{13}}

, and finally, frame

f_{14}

is taken with the right fixed camera.

Figure 2. Overview of the proposed method. (a) First step: Some markers are placed and recorded using the mobile camera with frames

{f_{2}, f_{3}, f_{4}, f_{5}}

. Additionally, frame

f_{1}

is taken from the left fixed camera. Subset A consists of markers

{m_{1}, m_{2}, m_{5}, m_{6}}

while Subset B consists of markers

{m_{3}, m_{4}, m_{7}, m_{8}}

. (b) Second step: Markers from Subset B are moved while markers from Subset A remain still, and the new marker arrangement is recorded again using the mobile camera with frames

{f_{6}, f_{7}, f_{8}, f_{9}, f_{10}}

. (c) Third step: Markers from Subset A are moved, leaving the markers from Subset B static. The arrangement is recorded with the mobile camera with frames

{f_{11}, f_{12}, f_{13}}

, and finally, frame

f_{14}

is taken with the right fixed camera.

Figure 3. Group pose graphs. (a) An example of a camera group is where the moving camera and its views are represented as purple pyramids. (b) Frames captured with the moving camera

m c

. (c) Group pose quiver for this example with images

{f_{1}, f_{2}, f_{3}, f_{4}, f_{5}}

. (d) Pose graph for the pose quiver.

Figure 3. Group pose graphs. (a) An example of a camera group is where the moving camera and its views are represented as purple pyramids. (b) Frames captured with the moving camera

m c

. (c) Group pose quiver for this example with images

{f_{1}, f_{2}, f_{3}, f_{4}, f_{5}}

. (d) Pose graph for the pose quiver.

Figure 4. Metagroup pose graph creation. Group

g^{3}

is linked to

g^{2}

using the subset of markers

{m_{3}, m_{4}, m_{7}, m_{8}}

. Group

g^{2}

is linked to

g^{1}

using the subset of markers

{m_{1}, m_{2}, m_{5}, m_{6}}

.

Figure 4. Metagroup pose graph creation. Group

g^{3}

is linked to

g^{2}

using the subset of markers

{m_{3}, m_{4}, m_{7}, m_{8}}

. Group

g^{2}

is linked to

g^{1}

using the subset of markers

{m_{1}, m_{2}, m_{5}, m_{6}}

.

Figure 5. Corridor and complete floor datasets. Representations of camera positions and orientations are depicted using blue, orange, and purple pyramids. (a) Corridor datasets. (b) Complete floor datasets.

Figure 6. Entire building dataset. Representations of camera positions and orientations are depicted using blue and purple pyramids.

Table 1. Dataset details. Summary of datasets proposed in this work, including the number of markers (

N_{m}

), fixed cameras (

N_{c}

), frames (

N_{f}

), and the number of groups (

N_{g}

) required for reconstruction.

Table 1. Dataset details. Summary of datasets proposed in this work, including the number of markers (

N_{m}

), fixed cameras (

N_{c}

), frames (

N_{f}

), and the number of groups (

N_{g}

) required for reconstruction.

Datasets		$N_{m}$	$N_{c}$	$N_{f}$	$N_{g}$
Corridor	A	50	6	76	2
	B	50	6	97	2
Complete Floor	A	50	22	361	10
	B	50	22	410	10
Entire Building		50	42	851	23

Table 2. Results using the corridor datasets. The term

E_{r p}

refers to the baseline, the term

E_{c p}

refers to the usage of control points,

E_{c c}

refers to the enforcement of cameras being at the same plane, and

E_{c m}

applies the same principle to the markers.

E_{t}

denotes the translational errors, expressed in cm and as a percentage with respect to the running distance.

Table 2. Results using the corridor datasets. The term

E_{r p}

refers to the baseline, the term

E_{c p}

refers to the usage of control points,

E_{c c}

refers to the enforcement of cameras being at the same plane, and

E_{c m}

applies the same principle to the markers.

E_{t}

denotes the translational errors, expressed in cm and as a percentage with respect to the running distance.

	Corridor
	Dataset A		Dataset B
Method	$E_{t}$ (cm) ↓	$E_{t}$ (%) ↓	$E_{t}$ (cm) ↓	$E_{t}$ (%) ↓
$E_{r p}$	31.14 ± 12.94	1.112	21.98 ± 9.07	0.785
$E_{r p}$ + $E_{c c}$	13.09 ± 5.15	0.468	5.14 ± 1.73	0.184
$E_{r p}$ + $E_{c m}$	0.81 ± 0.25	0.029	2.62 ± 1.12	0.094
$E_{r p}$ + $E_{c c}$ + $E_{c m}$	0.83 ± 0.25	0.030	2.52 ± 1.10	0.090
$E_{r p}$ + $E_{c p}$	9.23 ± 2.90	0.330	3.49 ± 1.26	0.125
$E_{r p}$ + $E_{c p}$ + $E_{c c}$	8.22 ± 2.42	0.294	3.35 ± 1.29	0.120
$E_{r p}$ + $E_{c p}$ + $E_{c m}$	5.53 ± 1.82	0.198	2.91 ± 1.32	0.104
$E_{r p}$ + $E_{c p}$ + $E_{c c}$ + $E_{c m}$	5.09 ± 1.68	0.182	2.84 ± 1.34	0.101

Table 3. Results using the complete floor datasets. The term

E_{r p}

refers to the baseline, the term

E_{c p}

refers to the usage of control points,

E_{c c}

refers to the enforcement of cameras being at the same plane, and

E_{c m}

applies the same principle to the markers.

E_{t}

denotes the translational errors, expressed in cm and as a percentage with respect to the running distance.

Table 3. Results using the complete floor datasets. The term

E_{r p}

refers to the baseline, the term

E_{c p}

refers to the usage of control points,

E_{c c}

refers to the enforcement of cameras being at the same plane, and

E_{c m}

applies the same principle to the markers.

E_{t}

denotes the translational errors, expressed in cm and as a percentage with respect to the running distance.

	Complete Floor
	Dataset A		Dataset B
Method	$E_{t}$ (cm) ↓	$E_{t}$ (%) ↓	$E_{t}$ (cm) ↓	$E_{t}$ (%) ↓
$E_{r p}$	156.79 ± 70.16	1.264	133.01 ± 60.42	1.073
$E_{r p}$ + $E_{c c}$	61.92 ± 19.87	0.499	87.11 ± 36.56	0.703
$E_{r p}$ + $E_{c m}$	63.47 ± 25.40	0.512	65.30 ± 31.38	0.527
$E_{r p}$ + $E_{c c}$ + $E_{c m}$	62.27 ± 20.25	0.502	65.79 ± 31.43	0.531
$E_{r p}$ + $E_{c p}$	46.62 ± 11.96	0.376	28.61 ± 11.26	0.231
$E_{r p}$ + $E_{c p}$ + $E_{c c}$	34.27 ± 8.47	0.276	24.11 ± 9.21	0.194
$E_{r p}$ + $E_{c p}$ + $E_{c m}$	25.96 ± 6.67	0.209	24.90 ± 9.51	0.201
$E_{r p}$ + $E_{c p}$ + $E_{c c}$ + $E_{c m}$	26.97 ± 7.58	0.218	23.64 ± 9.27	0.191

Table 4. Results using the entire building dataset. The term

E_{r p}

refers to the baseline, the term

E_{c p}

refers to the usage of control points,

E_{c c}

refers to the enforcement of cameras being at the same plane, and

E_{c m}

applies the same principle to the markers.

E_{t}

denotes the translational errors, expressed in cm and as a percentage with respect to the running distance.

Table 4. Results using the entire building dataset. The term

E_{r p}

refers to the baseline, the term

E_{c p}

refers to the usage of control points,

E_{c c}

refers to the enforcement of cameras being at the same plane, and

E_{c m}

applies the same principle to the markers.

E_{t}

denotes the translational errors, expressed in cm and as a percentage with respect to the running distance.

	Entire Building
Method	$E_{t}$ (cm) ↓	$E_{t}$ (%) ↓
$E_{r p}$	242.35 ± 81.16	0.892
$E_{r p}$ + $E_{c c}$	221.43 ± 78.08	0.815
$E_{r p}$ + $E_{c m}$	216.93 ± 66.97	0.798
$E_{r p}$ + $E_{c c}$ + $E_{c m}$	216.17 ± 58.42	0.796
$E_{r p}$ + $E_{c p}$	45.92 ± 17.00	0.169
$E_{r p}$ + $E_{c p}$ + $E_{c c}$	43.83 ± 19.90	0.161
$E_{r p}$ + $E_{c p}$ + $E_{c m}$	41.68 ± 20.81	0.153
$E_{r p}$ + $E_{c p}$ + $E_{c c}$ + $E_{c m}$	41.96 ± 21.47	0.154

Table 5. Marker linkage accuracy. Accuracy results of marker linkage across different datasets and algorithm variants, showing the percentages of the true positive rate (

T P R

), false positive rate (

F P R

), and false negative rate (

F N R

).

Table 5. Marker linkage accuracy. Accuracy results of marker linkage across different datasets and algorithm variants, showing the percentages of the true positive rate (

T P R

), false positive rate (

F P R

), and false negative rate (

F N R

).

		Markers Linkages Accuracy
Dataset	Variant	%TPR ↑	%FPR ↑	%FNR ↑
Corridor	A	100.00	0.00	0.00
	B	100.00	0.00	0.00
Complete Floor	A	98.00	0.00	2.00
	B	98.00	0.00	2.00
Entire Building		99.09	0.00	0.91

Table 6. Comparison with other works. Lower values for both

E_{t}

and

E_{r}

represent better reconstruction. MarkerMapper was the only method capable of obtaining coherent results. The symbol ‘−’ indicates that a particular error was unavailable because we did not have ground truth, and the symbol ‘×’ indicates that the method could not be applied. MarkerMapper ^† and Garcia et al. ^† results on dense datasets were sourced from the work presented by Garcia et al. [5].

E_{t}

denotes the translational errors, and

E_{r}

denotes the rotational errors.

Table 6. Comparison with other works. Lower values for both

E_{t}

and

E_{r}

represent better reconstruction. MarkerMapper was the only method capable of obtaining coherent results. The symbol ‘−’ indicates that a particular error was unavailable because we did not have ground truth, and the symbol ‘×’ indicates that the method could not be applied. MarkerMapper ^† and Garcia et al. ^† results on dense datasets were sourced from the work presented by Garcia et al. [5].

E_{t}

denotes the translational errors, and

E_{r}

denotes the rotational errors.

		Methods
		MarkerMapper [4] ^†		Garcia et al. [5] ^†		Our Method
Datasets	Variant	$E_{t}$ (cm) ↓	$E_{r}$ (^∘) ↓	$E_{t}$ (cm) ↓	$E_{r}$ (^∘) ↓	$E_{t}$ ↓ (cm) ↓	$E_{r}$ (^∘) ↓
(Dense) Syn. corridor	A	$0.36$	$0.004$	$0.66$	$0.052$	0.39	0.063
	B	$36.37$	$0.018$	$0.41$	$0.063$	1.08	0.056
(Dense) Syn. duplex 1st room	A	$0.13$	$0.003$	$0.19$	$0.021$	0.05	0.004
	B	$0.54$	$0.004$	$0.18$	$0.053$	0.23	0.004
(Dense) Syn. duplex 1st floor	A	$1.19$	$0.003$	$0.49$	$0.157$	0.31	0.092
	B	$4.09$	$0.022$	$2.69$	$0.241$	0.88	0.087
(Dense) Syn. duplex 2 floors	A	$18.19$	$0.090$	$2.04$	$0.050$	4.95	0.131
	B	$14.94$	$0.670$	$7.35$	$0.552$	9.68	0.168
(Dense) Corridor	A	×	−	$1.22$	−	3.39	−
	B	×	−	$0.84$	−	2.11	−
(Dense) Complete floor	A	×	−	$14.82$	−	5.22	−
	B	×	−	$15.72$	−	4.88	−
(Sparse) Corridor	A	×	−	×	−	5.09	−
	B	×	−	×	−	2.84	−
(Sparse) Complete Floor	A	×	−	×	−	26.97	−
	B	×	−	×	−	23.64	−
(Sparse) Entire Building		×	−	×	−	41.96	−

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

García-Ruiz, P.; Romero-Ramirez, F.J.; Muñoz-Salinas, R.; Marín-Jiménez, M.J.; Medina-Carnicer, R. Sparse Indoor Camera Positioning with Fiducial Markers. Appl. Sci. 2025, 15, 1855. https://doi.org/10.3390/app15041855

AMA Style

García-Ruiz P, Romero-Ramirez FJ, Muñoz-Salinas R, Marín-Jiménez MJ, Medina-Carnicer R. Sparse Indoor Camera Positioning with Fiducial Markers. Applied Sciences. 2025; 15(4):1855. https://doi.org/10.3390/app15041855

Chicago/Turabian Style

García-Ruiz, Pablo, Francisco J. Romero-Ramirez, Rafael Muñoz-Salinas, Manuel J. Marín-Jiménez, and Rafael Medina-Carnicer. 2025. "Sparse Indoor Camera Positioning with Fiducial Markers" Applied Sciences 15, no. 4: 1855. https://doi.org/10.3390/app15041855

APA Style

García-Ruiz, P., Romero-Ramirez, F. J., Muñoz-Salinas, R., Marín-Jiménez, M. J., & Medina-Carnicer, R. (2025). Sparse Indoor Camera Positioning with Fiducial Markers. Applied Sciences, 15(4), 1855. https://doi.org/10.3390/app15041855

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Sparse Indoor Camera Positioning with Fiducial Markers

Abstract

1. Introduction

2. Related Works

3. Proposed Method

3.1. Mathematical Definitions

3.2. Group Pose Graphs

3.3. Metagroup Pose Graph Optimization

3.4. Automatic Estimation of Metagroup Edges

4. Experiments

4.1. Corridor Dataset

4.2. Complete Floor Dataset

4.3. Entire Building Dataset

4.4. Metagroup Edges Automatic Estimation

4.5. Comparison with Other Works

4.6. Discussion on Limitations and Potential Improvements

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI