**1. Introduction**

Visual-based localization involves estimating the pose of a robot from a query image, taken with an on-board camera, within a previously mapped environment. The widely adopted approach relies on detecting local image features (e.g., points, segments) [1,2] that are projections of 3D physical landmarks. Though this feature-based localization has achieved great accuracy in the last years [3–5], it presents two major drawbacks that hinder long-term localization and mapping: (i) lack of robustness against image radiometric alterations; (ii) inefficiency of 2D-to-3D matching against large-scale 3D models [6]. A much less explored alternative to feature-based localization consists in localizing the robot through the scene appearance, represented by a descriptor of the whole image. According to this framework, localization is accomplished by comparing the appearance descriptor against a map composed of descriptor–pose pairs, without any explicit model of the scene's geometric entities [7,8]. This approach turns out to be particularly robust against perceptual changes and also appropriate for large-scale localization, as demonstrated by the fact that it is included in the front-end of state-of-the-art Simultaneous Localization and Mapping (SLAM) pipelines to perform relocalization and loop closure, typically in the form of Place Recognition (PR) [3,5].

The accuracy of appearance-based localization is, however, quite limited. Good results are reported only when the camera follows a previously mapped trajectory (i.e., in one

**Citation:** Jaenal, A.; Moreno, F.-A.; Gonzalez-Jimenez, J. Appearance-Based Sequential Robot Localization Using a Patchwise Approximation of a Descriptor Manifold. *Sensors* **2021**, *21*, 2483. https://doi.org/10.3390/s21072483

Academic Editor: Simon Tomažiˇc

Received: 4 March 2021 Accepted: 29 March 2021 Published: 2 April 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

dimension) [9–11] or when it is very close to any of the poses of the map [8,12]. In this work, we investigate whether localization based only on appearance can deliver *continuous* solutions that are accurate and robust enough to become a practical alternative to methods based on 3D geometric features. We restrict this work to planar motion, that is, we aim to estimate the camera pose given by its 2D position and orientation (3 d.o.f.).

To this end, we first assume that images acquired in a certain environment are samples of a low-dimensional Image Manifold (IM) that can be locally parameterized (or articulated) by the camera pose. This assumption has been justified by previous works [13,14], but only exploited under unrealistic conditions, where the IM was sampled from a fine grid of poses in the environment under fixed lighting conditions. This IM is embedded in an extremely high-dimensional space: R*H*×*W*×<sup>3</sup> , where *W* and *H* stand for the width and height of the images, respectively. Patently, working in such extremely high-dimensional space is not only unfeasible, but also impractical since it lacks radiometric invariance. That is, the IM of a given environment might change drastically with the scene illumination and the automatic camera accommodation to light (e.g., gain and exposure time). Thus, it is primarily to project the images (i.e., samples of the IM) to a lower-dimensional space with a transformation that also provides such radiometric invariances [15]. This projection can be carried out by encoding the image into a descriptor vector, hence obtaining a new appearance space, the Descriptor Manifold (DM), which is still articulated by the camera pose. In this paper, we leverage Deep Learning (DL)-based holistic descriptors [8,16] to project the IM into a locally smooth DM. We are aware that such smoothness is not guaranteed for the descriptors employed here, since this feature has not been explicitly taken into account in their design. This issue will be addressed in future work, but in the context of our proposal, the selected DL-based descriptors perform reasonably well under this assumption.

Another capital aspect in appearance-based localization is that it requires an appropriate map, which, in our case, is built from samples of the DM that are annotated with their poses. In this paper, we assume that such samples, in the form of descriptor–pose image pairs, are given in advance and are representative of the visual appearance of the environment. Upon this set of pairs, we propose creating *Patches of Smooth Appearance Change* (PSACs), that is, regions that locally approximate the geometry of the DM using neighbor samples (see Figure 1). A tessellation of such PSACs results in a piecewise approximation of the DM that constitutes our *appearance map*, where pose data is only available at the vertices of the PSACs. The appearance smoothness within each PSAC allows us to accurately regress a descriptor for any pose within the pose space covered by the PSAC. This is accomplished through a Gaussian Process (GP), which delivers the Gaussian distribution of the regressed descriptor (refer to Figure 1).

Our proposal solves continuous sequential localization indoors by tracking the robot pose using a Gaussian Process Particle Filter (GPPF) [17,18] within the described *appearance map*. The particles are propagated with the robot odometry and weighted through the abovementioned Gaussian Process, which is implemented as the GPPF observation model for the image descriptor.

Pursuing to improve the robustness of our method against appearance changes, we model the descriptor variations in such situations as a white noise distribution that is introduced into the estimation of the observation likelihood. Finally, it is worth mentioning that our proposal can easily recover from the habitual PF particle degeneracy problem by launching a fast and multihypothesis camera relocalization procedure through global Place Recognition.

Our localization system has been validated with different indoor datasets affected by significant appearance changes, yielding notable results that outperform current state-ofthe-art techniques, hence demonstrating its capability to reduce the gap between featurebased and appearance-based localization in terms of accuracy, while still leveraging the invariant nature of holistic descriptors.

**Figure 1.** The Gaussian Process (GP) associated to a *Patch of Smooth Appearance Change* (PSACs) approximates the geometry of a neighborhood of the Descriptor Manifold (assumed to be locally smooth) with respect to the pose space, predicting the local likelihood *p*(**d***q*|**x**, PSAC*m*) of the observation **d***<sup>q</sup>* in a given pose **x**. In this example, the descriptor–pose pairs are extracted from two previous trajectories of the robot (in red and blue).

#### **2. Related Work**

This section reviews two concepts that are essential for the scope of this work: Global image descriptors and Appearance-based localization.

#### *2.1. Global Image Descriptors*

A well-founded way of getting a consistent dimensionality reduction from the Image Manifold to the Descriptor Manifold is through *Manifold Learning* tools, like LLE [19] or Isomap [20]. Their performance, however, is limited to relatively simple IMs that result from sequences of quasi-planar motions or deformations, like face poses, person gait, or hand-written characters [21]. Unfortunately, images taken in a real 3D scene give rise to complex, highly twisted IMs, which also present discontinuities due to occlusion borders [22]. Moreover, typical *Manifold Learning* tools are hardly able to generalize their learned representations to images captured under different appearance settings [15]. This prevents their application to generating low-dimensional embeddings adequate for camera localization.

Nevertheless, Deep Learning (DL)-based holistic descriptors have recently proven their suitability to enclose information from complete images, effectively reducing their dimensionality, while adding invariance to extreme radiometric changes [8,12,23,24]. This feature has made DL-based descriptors highly suitable for diverse long-term robot applications [9,25], e.g., Place Recognition (PR), where the goal is to determine if a certain place has been already seen by comparing a query image against a lightweight set of images [26]. In addition, these descriptors have proven to more sensitively reflect pose fluctuations than local features [26], which is described, for example, by the *equivariance* property [27,28]. Since we are targeting robust robot operation under different appearance conditions, global descriptors arise as the natural choice to address appearance-based localization.

#### *2.2. Appearance-Based Localization*

Appearance-based localization is typically formulated as a two-step estimation problem: first, PR is performed to find the most similar images within the map and, subsequently, the pose of the query image is approximated from the location of the retrieved ones [29,30]. In this scenario, DL-based works have proposed to improve the second stage through Convolutional Neural Network architectures that estimate relative pose transformations between covisible images [31–33].

The addition of temporal and spatial sequential information to appearance-based localization methods based on single instances provides more consistency to the estimation

of the pose, as it reduces the possibility of losing camera tracking due to, for instance, perceptual aliasing [26]. Following this idea, SeqSLAM [10,34] proposes a sequence-tosequence matching framework that reformulates PR with the aim to incorporate sequentiality, leading to substantial improvements under extreme appearance changes. Building upon SeqSLAM, SMART [11] integrates odometry readings to provide more consistent results. More recently, *Network-Flow*-based formulations have also been proposed to solve appearance-based sequence matching under challenging conditions, addressing camera localization [35,36] or position-based navigation and mapping [37]. Despite their relevant results, the nature of all these works is discrete, unlike our proposal, restricting all possible estimation to the locations present on the map.

Conversely, CAT-SLAM [38] employs image sequentiality as a source of topometric information to improve the discrete maps used by FAB-MAP [9], allowing interpolation within the sequence map through the association of continuous increments on appearance and pose. Although the estimates produced by this approach are continuous, they are restricted to the mapped trajectory. Our work overcomes this constraint by requiring multiple sequences or pose grids as a source for constructing the map PSACs. This way, we can perform localization even at unvisited map locations near the PSACs, achieving, consequently, more accuracy and reliability.

An interesting alternative to pose interpolation is the use of Gaussian Processes (GPs) regression [39], nonparametric, general-purpose tools that allow generalizing discrete representations to a continuous model, and, hence, can be adapted to perform continuous localization within discrete maps. For instance, [40,41] employed GPs to generate position estimates for omnidirectional images in indoor maps, achieving good performance although lacking applicability to robot rotations. Our approach is instead designed to work with both 2D positions and rotations for conventional cameras.

In turn, the authors of [18] proposed Gaussian Process Particle Filters (GPPFs) to solve appearance-based localization in maps of descriptor–pose pairs. The GP works as the observation model of the PF, estimating the likelihood of the observed holistic descriptor at each of the particle positions. This localization pipeline was later improved in [42] by using only the nearest map neighbors in the GP regression, allowing efficient localization within large environments. Despite being promising, both works have three major drawbacks: (i) they define a unique Gaussian Process between poses and descriptors for the whole environment, assuming that the manifold geometry has a similar shape for the entire environment, thus leading to inaccurate estimations, (ii) they do not propose a relocalization process in case of losing tracking, and (iii) they only consider localization under the same appearance than the map, lacking robustness to radiometric alterations.

Inspired by these works, we employ a GPPF to solve appearance-based localization in a continuous and sequential fashion within challenging indoor environments. We solve their first problem by locally modeling the mapping between poses and descriptors via specific GPs for each PSAC, providing refined estimates for each neighborhood. Our proposal solves the second issue through a fast and multihypothesis relocalization process based on global PR within the map. Finally, the last issue is addressed by incorporating a model of the appearance variation between the mapped and query images to the map.

#### **3. System Description**

This section describes our proposal for the process of appearance-based camera localization. First, we define the elements that form the *appearance map*, which are key contributions in this work, and then we address PR-based localization and camera tracking using a probabilistic formulation based on a GPPF.

#### *3.1. Patches of Smooth Appearance Change*

The *Patches of Smooth Appearance Change* (PSACs) are regions that locally model the interrelation between camera poses and image descriptors, and represent areas where the change in appearance is small.

#### 3.1.1. Definition

The basic building unit of a PSAC is the pair *p<sup>i</sup>* = (**d***<sup>i</sup>* , **x***i*), composed by the global descriptor **<sup>d</sup>***<sup>i</sup>* <sup>∈</sup> <sup>R</sup>*<sup>d</sup>* of an image and the pose **<sup>x</sup>***<sup>i</sup>* <sup>∈</sup> *SE*(2) where it was captured.

We assume that these pairs are extracted from any of these two environmental representations (Figure 2): either from several **robot navigation sequences** (at least two) or from **pose grids** where the cameras have densely sampled the environment given fixed position and rotation increments (i.e., a regular grid). Optimally, a subset of these pairs should be selected so that they constitute the smallest number of samples from which the Descriptor Manifold (DM) can be approximated with sufficiently good accuracy. These *key* samples can be viewed as the equivalent to the *key-frames* in traditional, feature-based visual localization, and hence, we denote them *key-pairs* (*KPs*). Since determining such an optimal subset is a challenging issue itself, out of the scope of this work, the *KPs* are constantly sampled from the total collection of pairs.

**Figure 2.** Each PSAC is constructed from descriptor–pose pairs that can be obtained from two different robot trajectories (left, blue and red) or a grid of poses (right, green). This is better seen in color.

Each PSAC is built from a group of adjacent *KPs* and approximates the DM in the region that they delimit. As explained later, the robot localization takes place within these PSACs by defining a suitable observation model for the GPPF (refer to Figure 1). Formally, let the *m*-th PSAC be

$$\text{PSAC}\_{\mathfrak{m}} = \left( \left\{ KP\_{\mathfrak{m},i} \,|\, i = 1, \dots, Q \right\} \,\middle|\, GP\_{\mathfrak{m}} \right), \tag{1}$$

where *Q* ≥ 3 is the number of *KPs* forming the PSAC. In turn, *GP<sup>m</sup>* is a Gaussian Process specifically optimized for that particular PSAC that delivers a Gaussian distribution over the image descriptor for any pose nearby the PSAC (further explained in Section 3.1.2).

Thus, in order to determine the closeness between a query pair *p<sup>q</sup>* = (**d***q*, **x***q*) and a particular PSAC, we define two distance metrics as follows:

• The *appearance distance D<sup>a</sup> m*,*q* from the query descriptor **d***<sup>q</sup>* to the *m*-th PSAC is defined as the average of the descriptor distances to each of its constituent *key-pairs*:

$$D\_{m,q}^d = D^d(\text{PSAC}\_m, \mathbf{d}\_q) = \frac{1}{Q} \sum\_{i}^{Q} ||\mathbf{d}\_q - \mathbf{d}\_{m,i}||\_2. \tag{2}$$

• Similarly, but in the pose space, we define the *translational distance D<sup>t</sup> m*,*q* from **x***<sup>q</sup>* to the *m*-th PSAC as

$$D\_{m,q}^t = D^t(\text{PSAC}\_{m}, \mathbf{x}\_q) = \frac{1}{Q} \sum\_{i}^{Q} ||\mathbf{t}\_q - \mathbf{t}\_{m,i}||\_{2\nu} \tag{3}$$

with **t***<sup>q</sup>* being the translational component of the pose **x***<sup>q</sup>* = (**t***q*, *θq*).

Finally, the set of all PSACs covering the environment that has been sampled forms the **appearance map** M:

$$\mathcal{M} = \{ \text{PSAC}\_{m} | m = 1, \ldots, M \}, \tag{4}$$

with *M* being the number of PSACs. This way, we achieve a much more accurate approximation of the relation between the pose space and the DM within each PSAC, ultimately modeling in M, patchwise, the complete shape of the DM.

#### 3.1.2. GP Regression

GPs are powerful regression tools [39] that have previously demonstrated their validity as observation models in Particle Filters [17,18,42].

In this work, we learn a specific GP for each PSAC from its vertices *KPs* and also using all the nearest pairs, in terms of translational distance. Then, for a certain query pose **x***q*, the *GP<sup>m</sup>* delivers an isotropic Gaussian distribution N (*µm*,*q*, *σ* 2 *m*,*q* **I***d* ), where *<sup>µ</sup>m*,*<sup>q</sup>* <sup>∈</sup> <sup>R</sup>*<sup>d</sup>* and *σ* 2 *<sup>m</sup>*,*<sup>q</sup>* ∈ R stand for its mean and uncertainty, respectively. This distribution is finally employed to estimate the likelihood *p*(**d***q*|**x***q*, PSAC*m*) of an observed image descriptor **d***q*, given the query pose **x***<sup>q</sup>* within the PSAC*m*.

For this, the GP regression employs a *kernel k*, which measures the similarity between two input 2-D poses (**x***<sup>i</sup>* , **x***j*), with this structure:

$$k(\mathbf{x}\_{i\prime}\mathbf{x}\_{\uparrow}) = k\_{RBF}(\mathbf{t}\_{i\prime}\mathbf{t}\_{\uparrow}) \cdot k\_{RBF}(\theta\_{i\prime}\theta\_{\uparrow}) + k\_{W}(\mathbf{x}\_{i\prime}\mathbf{x}\_{\uparrow}).\tag{5}$$

This *kernel k* first multiplies two Radial Basis Function (RBF) *kernels kRBF*(*a<sup>i</sup>* , *aj*) = *β<sup>a</sup>* exp(−*αa*||*a<sup>i</sup>* − *a<sup>j</sup>* ||2 2 ) (*α<sup>a</sup>* and *β<sup>a</sup>* are optimizable parameters) for the separated translational and rotational components of the evaluated poses **x** = (**t**, *θ*). Then, a White Noise *kernel kW*(*a<sup>i</sup>* , *aj*) = *σ* 2 *<sup>W</sup>δ*(*a<sup>i</sup>* − *aj*) is added, which models the variation suffered by the image descriptors taken at the same pose but under different appearances (refer to Figure 3). This is justified because, although global PR descriptors have demonstrated outstanding results in terms of invariance, such invariance is not ideal and small differences might appear. Thereby, since the construction of the map M is typically carried out considering just one particular appearance, and we aim for the robot localization to be operational under diverse radiometric settings, we propose the inclusion of a white noise distribution accounting for this circumstance in the regression. We model such descriptor variation with the variance *σ* 2 *<sup>W</sup>*, computed as the average discrepancy between the descriptor variances of pose adjacent pairs *p<sup>i</sup>* , under the same *σ* 2 *<sup>i</sup>*,same and different *σ* 2 *<sup>i</sup>*,diff illumination settings:

$$
\sigma\_W^2 = \frac{1}{N} \sum\_{i}^{N} \left( \sigma\_{i, \text{diff}}^2 - \sigma\_{i, \text{same}}^2 \right). \tag{6}
$$

**Figure 3.** These three images have been captured at the same pose but with different appearances. Ideally, their descriptors (blue, red, and purple dots) should be identical but, in practice, certain inaccuracies appear. As the GP learns the descriptor distribution uniquely from the appearance of the map, this variation is not considered, leading to an underestimated GP uncertainty (red area, *σ*ˆ 2 *m*,*q* ). The inclusion of white noise expands such uncertainty (green area, *σ* 2 *m*,*q* ), solving this issue.

#### *3.2. Robot Localization*

Once we have defined all the elements involved in the representation of the environment, we address here the process of localization within the appearance map M. We aim to estimate the robot pose through appearance-based continuous tracking using a Gaussian Process Particle Filter (GPPF), namely, a PF that employs the GPs described above as observation models. Being well-known robot localization tools, we do not provide a deep description of Particle Filters here but instead refer to the reader to the seminar work [43] for further information.

At time-step *t*, each of the *P* particles in the filter represent a robot pose estimation **x** (*t*) *<sup>p</sup>* with an associated weight *w* (*t*) *<sup>p</sup>* , proportional to its likelihood. Besides, each particle is assigned to a certain region PSAC(*t*) *<sup>m</sup>*,*p*, as explained next.

#### 3.2.1. System Initialization

When the PF starts, we perform global localization based on Place Recognition to select the most similar PSAC in M to the query descriptor according to its appearance:

$$\text{PSAC}\_{\mathcal{M}} = \min\_{\text{PSAC}\_{\mathcal{W}} \in \mathcal{M}} D^{a}\_{m,q}. \tag{7}$$

To account for multihypothesis initialization, we also consider as candidates those PSACs whose *appearance distance* is under a certain threshold proportional to *D<sup>a</sup> m*ˆ ,*q* . Subsequently, the particles are uniformly assigned and distributed among all candidate PSACs, setting their initial weights to *w* (*t*0) *<sup>p</sup>* = <sup>1</sup> *P* .

Note that, if the robot tracking is lost during navigation, this procedure is launched again to reinitialize the system and perform relocalization.

#### 3.2.2. Robot Tracking

Once each particle is assigned to a candidate PSAC, the robot pose estimation is carried out following the traditional *propagation-weighting* sequence:

**Propagation**. First, the particles are propagated according to the robot odometry:

$$\mathbf{x}\_p^{(t)} = \mathbf{x}\_p^{(t-1)} \oplus \boldsymbol{\upsilon}^{(t)},\tag{8}$$

with *υ* (*t*) ∼ N (*<sup>υ</sup>* (*t*) , Σ*υ*) representing noisy odometry readings, and ⊕ being the pose composition operator in *SE*(2) [44].

**Weighting**. After the propagation, the *translational distance* between each particle's pose and all the PSACs is computed, so that the particle is assigned to the nearest PSAC (PSAC(*t*) *<sup>m</sup>*,*p*). Then, we use the GP regressed in the PSAC to locally evaluate the likelihood of the observed descriptor **d***<sup>q</sup>* at the particle pose **x** (*t*) *<sup>p</sup>* as follows:

$$w\_p^{(t)} = p(\mathbf{d}\_q | \mathbf{x}\_p^{(t)}, \text{PSAC}\_{m,p}^{(t)}) \sim \exp\left(-\frac{d}{2} \ln(\sigma\_{m,p}^{2(t)}) - \frac{||\mathbf{d}\_q - \mathbf{p}\_{m,p}^{(t)}||\_2^2}{2(\sigma\_{m,p}^{2(t)})}\right),\tag{9}$$

with *d* being the dimension of the descriptor.

Finally, apart from *propagation* and *weighting*, two more operations can be occasionally applied to the particles.

**Resampling:** In order to prevent particle degeneracy, the GPPF resamples when the number of effective particles is too low, promoting particles with higher weights.

**Reinitializing:** During normal operation, the GPPF may lose the tracking of the camera, mainly due to extremely challenging conditions in the images (e.g., very strong change appearances, presence of several dynamic objects). We identify this situation by inspecting the *translational distance* between each particle and the *centroid* of its assigned PSAC, defined as the average pose of all the key-pairs forming the PSAC. If all particles are at least twice farther from the centroid of their assigned PSAC than its constituent *key-pairs*, the tracking is considered lost. Consequently, the PF relocalizes by following the aforementioned initialization procedure.

#### **4. Experimental Results**

In this section, we present three experiments to evaluate the performance of our appearance-based localization system.

First, we carry out a verification of the regression outcome in Section 4.1, with the aim to experimentally validate the hypothesis of a smooth Descriptor Manifold within the regions covered by each PSAC. In Sections 4.2 and 4.3, we test our proposal with four different state-of-the-art global descriptors in two different datasets, with a combination of setups for the map sampling. This has provided us with an insight about the error incurred by our proposal and has allowed us to determine the best configuration for localization. The second experiment, in Section 4.4, compares the resulting setup with three appearancebased localization alternatives in terms of accuracy and robustness, revealing that our system equals or improves their performance in every scenario.

It is important to highlight that this evaluation does not include comparisons with feature-based localization methods, since they are not compatible with appearance changes in the images used for both mapping and localization, which is one of the main benefits of our proposal.

We employed two different indoor datasets for the experiments:


In turn, regarding the global image representation, we have tested the following state-of-the-art appearance descriptors:


Although none of these descriptors has been specifically designed to fulfill suitable properties for our pose regression approach, they have achieved promising results in terms of localization accuracy, as shown next. We used the GPy tool [50] to implement the proposed Gaussian Processes and empirically determined *σ* 2 *<sup>W</sup>* from Equation (6), for each descriptor, by randomly sampling *N* = 2000 adjacent pairs with diverse illumination settings from the COLD-Freiburg database.

In this evaluation, we employ as metrics the median errors in translation and rotation (to inspect our method's accuracy), as well as the percentage of correctly localized frames (which illustrates the tracking and relocalization capabilities of our method). It is worth mentioning that traditional trajectory-based evaluation metrics as Absolute Trajectory Error (ATE) or Relative Pose Errors (RPE) are not applicable to this approach since our proposal, and appearance-based localization methods in general, yields global pose estimations that are not guaranteed to belong to a trajectory, due to possible tracking losses and relocalization situations.

**Figure 4.** Environments and sequences employed for evaluation. (**a**) Map of the COLD-Freiburg Part A environment. Samples of the standard and extended routes are depicted in blue and red, respectively (image from [45]). (**b**) Map of the house rendered by the SUNCG environment (the house employed was *034e4c22e506f89d668771831080b291*). The dense grid poses are shown in red and the test sequences in different shades of green. Black regions depict objects, where the robot is not able to navigate.

#### *4.1. Corridor: Sanity Check*

The main assumption of our proposal is the hypothesis of a locally smooth Descriptor Manifold with respect to the pose, on which PSACs are based. Since this assumption is not justified by previous work, we have conducted a basic test to evaluate the regression outcome of the PSACs in a simple scenario.

The proposed experiment studies the evolution of the image descriptor along a simple, linear trajectory, by comparing the observed descriptor and the mean of the descriptor distribution resulting from the regression within the PSAC. In this manner, the behavior of the descriptor can be examined along the corridor axis in order to prove its continuity and the validity of the PSAC approximation. For this, we have selected a portion of an artificially illuminated (*night*) sequence where the robot traveled along a ∼8 m-long corridor, as well as the NetVLAD image descriptor. For the PSACs, we used a map constructed with images with the same appearance selected every 20 frames.

In order to represent the evolution of the descriptors, we have applied Principal Component Analysis (PCA) to them and represented the first PCA element (that with larger variation). Thus, Figure 5 depicts the trajectory of the robot through the corridor along with the value of said first PCA element for both the observed descriptor and the mean of the descriptor distribution regressed by the GPs at each PSAC. The displayed results demonstrate that the descriptor has a continuous evolution along the corridor, almost lineal in the central part. Besides, the PSACs are proved to also have a continuous outcome and to approximate very accurately the values of the observed descriptor along the sampled trajectory.

**Figure 5.** Experimental study of the global image descriptor behavior along a corridor of the COLD-Freiburg database (left image). The central figure depicts the robot pose trajectory and the PSACs (shadowed areas) through which it navigates. The rightmost figure compares, along the corridor, the behavior of the first Principle Component Analysis (PCA) component of the observed descriptor (coral) and the PSAC regression output (blue and green). This is better seen in color.

#### *4.2. COLD: Sequential Map Testing*

The COLD-Freiburg database (*part A*) provides odometry readings and real images for two different itineraries, namely, (i) **extended** (∼100 m-long), which covers the whole environment; (ii) **standard** (∼70 m), covering a subset of the environment, both depicted in Figure 4a. The dataset also provides images gathered under three different lighting conditions: at *night* (with artificial illumination), and on *cloudy* and *sunny* days.

In order to create the *appearance maps* for the experiments, we employed the first and second *night* sequences of the extended itinerary, since images captured under artificial illumination do not suffer from severe exposure changes or saturation like under the remaining conditions. From here on, we will refer to these as *map sequences*. Specifically, *key-pairs* from both *map sequences* have been obtained through Constant Sampling (CS) every 10, 20, and 30 pairs, resulting in three different maps with diverse density (described in more detail in Table 1), using *Q* = 4 *KPs* to construct every PSAC.



Finally, we have setup an extensive evaluation with six other sequences including different routes and illumination conditions: the first *night*; *cloudy* and *sunny* sequences of the **standard** part; the first *cloudy* and *sunny* sequences; and the third *night* sequence of the **extended** part.

Figure 6 shows a comprehensive test study depicting the localization performance of our proposal after twenty runs for all test sequences at each map, using the median translational (top) and rotational (down) errors as metrics. Note that the number of particles for the PF has been set to *P* = 10<sup>3</sup> , as we have empirically found that increasing that number does not improve the accuracy results. The overall performance shows a median error predominantly below 0.3 m and 6°, which denotes promising results given

the pure appearance nature of our approach, i.e., with no geometrical feature employed for localization.

**Figure 6.** Comparison of the median translational and rotational errors of our proposal in the COLD-Freiburg dataset, tested at every sequence and with different Constant Sampling (CS) rates, using each holistic descriptors with 10<sup>3</sup> particles. Note that the maps were constructed employing sequences under similar conditions to the *night* sequences. This is better seen in color.

The results in Figure 6 show that scene appearance seems to be a key issue regarding the system's accuracy, as our proposal achieves better results in less-demanding lighting conditions like artificial illumination (night) or cloudy. Nevertheless, our system still demonstrates notable performance under challenging radiometric conditions, such as in sunny sequences (e.g., presence of lens flares and image saturation), hence proving its suitability for robust appearance-based localization.

On the other hand, the number of *KPs* that form the map is another factor influencing the performance, since the PSACs approximate the pose–descriptor interrelation the closer their *KPs* are. Although not particularly significant under advantageous conditions, this factor severely affects performance under challenging situations, as in *sunny* sequences, where localization is hindered as the sampling frequency decreases. Note that a more elaborate mapping technique than CS would improve these results, since an optimal selection of *KPs* would conform PSACs that achieve a more precise description of the DM geometry. Nevertheless, this will be explored in future work while, in this paper, we rely on CS to get still nonoptimal but notable results.

Finally, regarding the tested PR descriptors, the results show that in most cases, all perform similarly, with NetVLAD mostly achieving slightly better results. In turn, 1M COLD Quadruplet seems to struggle under complex illumination conditions, which might indicate that its empirically estimated white noise variance is unlikely to account adequately for these cases. The similar performance shown by all descriptors agrees with the fact that none of them has been specifically trained for appearance-based localization.

#### *4.3. SUNCG: Grid Map Testing*

The SUNCG Dataset provides a set of synthetic houses where a virtual camera can be placed at any pose. This feature allowed us to create a regular grid map in the space of planar poses with the camera and then to evaluate the impact of the map density in our proposal. Note that this dataset does not present appearance changes, and hence, the effect of such a characteristic cannot be evaluated in this experiment.

First, our *dense* grid map was created by selecting *KPs* with constant increments of 0.5 m in translation and 36° in rotation (refer to the red dots in Figure 4b). Then, we used subsampling to generate more grid maps for the evaluation, as described in Table 1, namely, the *sparse-position map* (subsampling half of the positions); the *sparse-rotation map* (subsampling half of the orientations); and the *sparse-position-rotation map* (subsampling

both at the same time). In this case, we created PSACs with *Q* = 8 *KPs* for all maps. Additionally, we have recorded three ∼30 m-long test sequences following the trajectories shown in shades of green in Figure 4b, generating a *synthetic odometry* corrupted by zeromean Gaussian noise with *σ<sup>u</sup>* = (0.06 m, 1°).

The results of this experiment are shown in Figure 7, comparing the median errors in translation (left) and rotation (right) for all the descriptors employed in the previous experiment and for the described versions of the grid map. Again, we have set *P* = 10<sup>3</sup> particles for the PF. As can be seen, our proposal yields median errors under 0.2 m and 6° in the *dense map*, while using subsampled maps hinders the process of localization. It can be noted that subsampling exclusively on rotations does not worsen the accuracy, while subsampling positions has a noticeable impact on the overall performance. Consequently, PSACs prove to handle information sparseness more efficiently in orientation than in position. Not surprisingly, subsampling in both position and orientation clearly achieves the worst localization performance due to the combined loss of information.

Regarding the holistic descriptors, all of them again demonstrate a similar behavior for each subsampling case, with 1M RobotCar Volume performing worst. ResNet-101 GeM and NetVLAD, in turn, achieve the best performance.

These results demonstrate that uniform grid sampling is a rough strategy for mapping environments, achieving results highly dependent on the sampling density. Besides, the construction of such maps with real robots becomes largely time-consuming, mostly being realizable when using virtual environments. Future work should investigate more elaborated strategies, designed to fulfill more adequate criteria concerning the map creation, ultimately pursuing an optimal approximation of the DM geometry.

#### *4.4. Comparative Study*

Finally, we compare the localization performance between our proposal and state-ofthe-art appearance-based methods of diverse nature in both datasets. For the setup of our method, we selected the NetVLAD descriptor due to its performance against appearance changes, and added the *KPs* every 20 pairs for the COLD dataset, as it represents a fair trade-off between accuracy and the number of *KPs* employed.

These are the appearance-based localization methods involved in the comparison:


duce continuous estimations. For that, we used the following weighting after the bipartite matching:

$$\mathbf{x}\_{i} = \frac{\sum\_{j}^{N\_{k}} \frac{\mathbf{y}\_{i}}{\mathbf{y}\_{ij}} \mathbf{x}\_{i}}{\sum\_{j}^{N\_{k}} \frac{\mathbf{y}\_{i}}{\mathbf{y}\_{ij}}} \tag{10}$$

where **x***<sup>j</sup>* is the pose of the each of the *N<sup>k</sup>* = 5 most contributing *KPs*, **y***ij* is the flow connecting the *i*-th query and the *j*-th *KP*, and **y***<sup>i</sup>* = ∑ *Nk j* **y***ij* represents the query flow from the nearer *KPs* (refer to [37] for further info).

• **Our approach**, configured with *P* = 10<sup>3</sup> particles.

Table 2 shows the compared performance between all the described algorithms after twenty runs for each scenario. Note that, apart from the median errors, we included the percentage of correctly localized frames along the trajectory, showing the tracking robustness and relocalization potential. Concretely, a frame is considered to be correctly localized when the distance between the estimate and its true pose is below (0.5 m, 10°).

**Table 2.** Comparative median position and rotation errors, and % of correctly localized frames (*m*, ◦ , (%)) of different stateof-the-art appearance-based localization methods. A frame is correctly localized when the distance between the estimate and its true pose is below (0.5 m, 10°) (L: sequences where the tracking got lost. N/A: not applicable). PRP—Pairwise Relative Pose estimator; PR—Place Recognition. In bold, best result for each sequence.


As can be seen, the challenging radiometric conditions in the COLD-Freiburg database caused the GPPF method to lose tracking, while the PRP estimator achieved low accuracy as a result of not exploiting the trajectory sequentiality, performing PR at every time-step instead. In turn, the solutions based on Network flow provide very accurate estimations in general, with the best results achieved by the uniformly sampled map under favorable conditions (i.e., *night* and *cloudy* sequences) and slightly worse in the case of severe appearance changes (i.e., *sunny* sequences). Our proposal, in contrast, demonstrates providing consistent results regardless of the appearance settings, achieving similar results to the Network flow solution in favorable conditions and outperforming all other methods under challenging radiometric settings.

In the case of the SUNCG dataset, the formulation proposed by the Network flow is incompatible with grid maps covering multiple rotations at the same location, as they are conceived to work only with positions. In turn, PRP and GPPF obtain low performance, even worsened in subsampled maps, while our proposal achieves the best results.

Despite its similarity with our approach, GPPF has shown to be unable to achieve robust localization due to the abovementioned issues: (i) deficiencies from considering a single pose–descriptor mapping for the whole environment, (ii) the absence of a relocalization process, and (iii) its limitation to environments without radiometric changes.

In conclusion, the presented comparison proves that these state-of-the-art localization methods based on appearance cannot provide both consistent and accurate localization estimations while operating within maps of diverse nature and captured under different appearance conditions. Our method, in turn, achieves higher performance in these

conditions in terms of precision and robustness, showing notable results given its pure appearance nature.

#### **5. Conclusions**

We have presented a system for appearance-based robot localization that provides accurate, continuous pose estimations for camera navigation within a 2D environment under diverse radiometric conditions. Our proposal relies on the assumption that image global descriptors form a manifold articulated by the camera pose that adequately approximates the Image Manifold. This way, we gather pose–descriptor pairs from a lightweight map in order to create locally smooth regions called *Patches of Smooth Appearance Change* (PSACs) that shape, piecewise, the Descriptor Manifold geometry. Additionally, we robustly deal with appearance changes by modeling the descriptor variations under a white noise distribution.

We implemented a sequential camera tracking system built upon a Gaussian Process Particle Filter, which allows for multihypothesis pose estimation. Thus, our system optimizes a specific GP for each PSAC, subsequently being employed to define a local observation model of the descriptor for the Particle Filter. Furthermore, our method includes a relocalization process based on PR in case of tracking loss.

A first set of experiments has shown our proposal's error baseline in different environments and for a selection of holistic descriptors, revealing the most suitable configuration for our system. Finally, we have presented a comprehensive evaluation of the localization performance, showing that our approach outperforms state-of-the-art appearance-based localization methods in both tracking accuracy and robustness, even using images with challenging illuminations, yielding median errors below 0.3 m and 6°. Consequently, we have proven that pure appearance-based systems can produce continuous estimations with promising results in terms of accuracy, while working with lightweight maps and achieving robustness under strong appearance changes.

Future work includes research about (i) building the appearance map in an optimal way and wisely selecting where to sample the Descriptor Manifold; (ii) the design of a novel holistic descriptor that is more adequate to perform pose regression while keeping high invariance to radiometric changes.

**Author Contributions:** Conceptualization, A.J., F.-A.M., and J.G.-J.; methodology, A.J.; software, A.J.; validation, A.J.; investigation, A.J.; writing—original draft preparation, A.J., F.-A.M., and J.G.-J.; writing—review and editing, A.J., F.-A.M., and J.G.-J.; supervision, F.-A.M. and J.G.-J.; funding acquisition, J.G.-J. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by: Government of Spain grant number FPU17/04512; by the "I Plan Propio de Investigación, Transferencia y Divulgación Científica" of the University of Málaga; and under projects ARPEGGIO (PID2020-117057) and WISER (DPI2017-84827-R) financed by the Government of Spain and European Regional Development's funds (FEDER).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Acknowledgments:** We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan X Pascal used for this research.

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

#### **Abbreviations**

The following abbreviations are used in this manuscript:


#### **References**


**Robin Amsters 1,\*, Eric Demeester <sup>1</sup> , Nobby Stevens <sup>2</sup> and Peter Slaets <sup>1</sup>**


**Abstract:** Most indoor positioning systems require calibration before use. Fingerprinting requires the construction of a signal strength map, while ranging systems need the coordinates of the beacons. Calibration approaches exist for positioning systems that use Wi-Fi, radio frequency identification or ultrawideband. However, few examples are available for the calibration of visible light positioning systems. Most works focused on obtaining the channel model parameters or performed a calibration based on known receiver locations. In this paper, we describe an improved procedure that uses a mobile robot for data collection and is able to obtain a map of the environment with the beacon locations and their identities. Compared to previous work, the error is almost halved. Additionally, this approach does not require prior knowledge of the number of light sources or the receiver location. We demonstrate that the system performs well under a wide range of lighting conditions and investigate the influence of parameters such as the robot trajectory, camera resolution and field of view. Finally, we also close the loop between calibration and positioning and show that our approach has similar or better accuracy than manual calibration.

**Citation:** Amsters, R.; Demeester, E.; Stevens, N.; Slaets, P. Calibration of Visible Light Positioning Systems with a Mobile Robot. *Sensors* **2021**, *21*, 2394. https://doi.org/10.3390/ s21072394

Academic Editor: Simon Tomažiˇc

Received: 12 March 2021 Accepted: 27 March 2021 Published: 30 March 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

**Keywords:** indoor positioning; visible light positioning; sensor fusion; mobile robot; calibration

### **1. Introduction**

Since their introduction, Global Navigation Satellite Systems (GNSS) are the enabling technology for applications such as navigation, autonomous vehicles and emergency services. While GNSS can provide worldwide coverage and require only a receiver to use, they are typically not useful for indoor spaces. On one hand, building walls significantly reduce the signal strength, often making positioning impossible or reducing the accuracy [1]. However, even the nominal accuracy of GNSS (around 5 m [2]) is insufficient for indoor positioning, where an error of a couple of meters can mean that the user is located in one of several rooms. In order to provide indoor location, researchers have developed many Indoor Positioning Systems (IPS), yet a single standard like GNSS was not achieved. Indoor environments come in many different varieties and can favor different positioning technologies. Current systems are often based on Wi-Fi in order to reduce infrastructure cost; however, their accuracy is limited to a couple of meters [3]. Other technologies such as Ultra-WideBand (UWB) [4] and ultrasound [5] can provide much higher accuracy (centimeters), at the cost of additional specialized infrastructure.

With the introduction of solid-state lighting, a new type of indoor positioning has emerged. In Visible Light Positioning (VLP), light intensities are modulated at speeds imperceptible to the human eye, which allows for a one way transmission from transmitter to receiver. Similar to other positioning systems, the Received Signal Strength (RSS) or signal travel time can be used to determine the receiver location. Due to the local character of light, the influence of multipath is significantly reduced, resulting in an accuracy that can be as low as a couple of centimeters [6]. Existing lighting infrastructure can also be reused for positioning, thereby reducing the overall system cost significantly. These advantages

led to an increasing research interest in recent years. However, the installation of new VLP systems remains an important issue. The majority of indoor positioning systems require some form of calibration. For example, systems that use range measurements to determine the receiver position via triangulation assume the locations of all transmitters to be known. Manually measuring transmitter locations can be a cumbersome process, as the transmitters are often mounted on ceilings and walls [7]. Fingerprinting systems also require a calibration procedure, in order to build an RSS map that can later be used for positioning. These site surveys can be lengthy and labor-intensive processes. Additionally, this RSS map may have to be updated when changes occur in the environment.

Several calibration procedures have been proposed for IPS using technologies such as UWB [8], Wi-Fi [9,10] and Radio Frequency Identification (RFID) [11,12]. In the field of visible light positioning, little literature is available on this subject. Our previous work [13] proposed a proof-of-concept calibration procedure with a mobile robot. In that procedure, the total number of lights needed to be determined manually [13]. Counting the number of transmitters is significantly less time-consuming compared to manually measuring the positions. However, it is still a tedious process that is prone to errors. In this work, we therefore introduce an improved calibration algorithm. Detected light sources are filtered based on their measured coordinates, as well as their place in the frequency spectrum. Using this new procedure, the number of light sources is no longer required. Additionally, accuracy is significantly improved. As [13] is a proof-of-concept, much remains unknown about the robustness of the approach. For example, which parameters have an effect on the accuracy of the procedure? To find out, we investigate the impact of a variety of factors on the calibration procedure. Finally, the goal of a calibration procedure is to prepare the system for positioning. The relation between calibration errors and positioning errors may be complex. In order to determine whether our system has satisfactory performance, we use the calibrated parameters for positioning. Following this approach, we close the loop between calibration and positioning and enable high-performance systems that are easy to deploy.

Our main contributions can therefore be summarized as follows:


The rest of this paper is structured as follows: Section 2 describes related work, and Section 3 introduces the materials and methods used in this paper. Experimental results are presented in Section 4 and discussed in Section 5. Finally, a conclusion is drawn in Section 6.

#### **2. Related Work**

Table 1 provides an overview of calibration procedures proposed for different types of indoor positioning systems. In Table 1, "positioning technology" refers to the technology that is actually used for positioning (after the calibration has completed). During the calibration itself, other signals such as RGB-D cameras [10] or PDR [14] may be used. The following section will describe the broad categories of calibration methods in more detail.


**Table 1.** Calibration approaches for indoor positioning systems.

Fingerprinting-based IPS operate in two stages. In the first (offline) stage, a signal strength map is constructed. RSS values are measured at known locations throughout the entire space. It is possible to record just one type of signal (e.g., Wi-Fi). However, accuracy is generally improved by including multiple sources of information (e.g., magnetic field, Bluetooth, etc.) [38]. Signals already present in the environment are often used, in order to avoid the need for additional infrastructure. In the second (online) stage, the receiver location is unknown and one or more RSS values are measured. By matching the current signal fingerprint to the database, the receiver position is recovered. Contrary to triangulation-based IPS, fingerprinting approaches do not require transmitter coordinates. To ensure positioning accuracy, it is however important that the signal strength map is accurate. The map may also have to be updated periodically, if changes to the environment are made.

We distinguish four methods to construct the signal strength map:


In manual site surveys, a trained expert records signal fingerprints at known locations. The entire space needs to be visited by the surveyor and as mentioned before, this process

may have to be repeated. Manual site survey is time-consuming and labor-intensive and is thus not always practical in large indoor spaces [35]. The use of mobile robots has therefore been proposed to simplify this task. Mobile robots have been used to collect fingerprints for RFID [11,12] and Wi-Fi systems [10]. Some authors have even proposed algorithms that enable a robot to collect data without human intervention [9,15]. These navigation algorithms were relatively simple and did not follow the optimal trajectory (in terms of accuracy or time required), but they did succeed in covering the space eventually.

The goal of Simultaneous Localization And Mapping (SLAM) is to reconstruct a map of the environment, while simultaneously estimating the trajectory of the observer relative to that map. Solutions to the SLAM problem are most commonly based on Bayesian filtering [39]. SLAM algorithms have mainly been used for robotics applications, as earlier implementations required expensive sensors such as laser scanners (LIDAR) or depth cameras [40]. Recently, researchers started using sensors embedded in conventional smartphones to construct signal strength maps. This approach is sometimes also referred to as signalSLAM. Pedestrian Dead Reckoning (PDR) is often used to obtain a rough estimate of the user's trajectory, and drift is corrected by using absolute location fixes (for example, from GNSS signals or near-field communication tags) [16]. Alternatively, other signals of opportunity such as Wi-Fi, magnetic field or even ambient light [35] can be used to compensate PDR drift. When using signalSLAM to calibrate fingerprinting IPS, the main goal is to reconstruct the trajectory of the user and to add the measured signal strengths to the map based on that trajectory. Recent approaches tend to use a modified version of graphSLAM [36]. The main challenge with graph-based signalSLAM is the reduction of false positives when performing loop closures [14,16,17].

SignalSLAM calibration still requires surveyor to visit the entire indoor space. It is more efficient compared to manual site survey, as the surveyor can walk around continuously. In manual calibration, the surveyor has to stop and record his or her location periodically. Crowdsourcing approaches attempt to improve efficiency even further by removing the dedicated surveyor entirely. Initially, users can go about their regular tasks, while the systems collect both inertial and signal strength data from their smartphones in the background. As more data are collected, these systems obtain a more complete picture of the indoor environment, and position accuracy increases. In contrast to single site surveys, the map can continuously be updated. Crowdsourcing presents a number of interesting advantages, yet some challenges still remain. Kim et al. [18] assumed the initial location of the user was known and suggested it can be obtained from GNSS when the user enters the building. In contrast, the system described in [19] did not require the initial position, stride length or phone placement. Instead, a map of the environment was used to impose constraints that can filter improbable locations. The obtained trajectories were optimized through backpropagation, and Wi-Fi signal strength was added to the map based on the optimized path. In the work of Wang et al. [20], seed landmarks were extracted from the floor plan (e.g., doors), which can be used to obtain global observations. Additional landmarks were learned as more data entered the system. Yang et al. [21] first transformed the map into a stress-free floor plan, which is a high dimensional space in which the distance between two points reflects their walking distance (taking constraints such as walls into account). The similarity between the stress-free floor plan and the fingerprint space was used to label RSS signatures with their real locations. Crowdsourcing-based calibration does require users to give up their personal data, which may be an important barrier to some. Moreover, the approaches discussed above often required a floor plan, which may not always be available. Finally, the accuracy of both signalSLAM and crowdsourcing is typically low (in the range of several meters). Due to the relatively low quality input data (PDR and radio frequency signals), it is challenging to obtain robust and accurate systems with signalSLAM or crowdsourcing.

The calibration methods discussed so far are only applicable to fingerprinting-based IPS. Another category of positioning systems obtains the receiver position based on ranging. The travel time of a signal or the signal strength are used to determine the distance

between transmitter and receiver. From the measured distances, the receiver position can then be obtained via triangulation. These types of IPS require accurate knowledge of the transmitter locations. Depending on the positioning technology used, additional parameters may also be required. For example, UWB systems often correct the bias on the distance measurements [8]. VLP systems based on RSS sometimes calibrate the gain [30] or Lambertian emission order [33]. For the calibration of range-based systems, we can again distinguish a few possible methods:


Similar to fingerprinting systems, range-based IPS can be calibrated manually. In this case, the transmitter locations would be measured relative to some reference with rulers or laser-based measurement devices. While measuring the transmitter locations manually requires less work than performing a manual site survey for fingerprinting, it is still a tedious process. Transmitters are often mounted on the ceiling, which can make the process somewhat inconvenient. Ranging systems can also be calibrated based on known receiver locations, which may be easier to obtain than the transmitter coordinates [22–24]. However, ground truth measurements of the receiver locations are still required, which often requires an additional positioning system. Moreover, errors on the receiver location while calibrating will subsequently lead to errors on the transmitter locations. If sufficient transmitter positions are known, the others can be extrapolated without extra measurements [25].

Some IPS can use the same ranging techniques that enable receiver positioning to obtain the distance between transmitters, from which the transmitter locations can also be obtained [27,28]. These interbeacon ranging techniques (sometimes also referred to as autocalibration) do assume that beacons can communicate with each other. Additionally, the transmitters must be placed sufficiently close together such that they are within measurement range of each other, which may disqualify them from positioning technologies such as Bluetooth.

Finally, range-based IPS can also be calibrated based on a set of transmitter–receiver distances. If the quantity of data is large enough, no receiver or transmitter locations are required; a set of range measurements is sufficient. Calibration can then be formulated as an optimization problem that minimizes the residual of the trilateration equations [7,8,26,27,29]. Results from these approaches are not always unique, for example, in the case of rotational symmetry. Additionally, accuracy of the solution can be heavily dependent on the initial conditions [29].

Both ranging and fingerprinting can be used for VLP. Ranging is generally more accurate and robust. However, as the transmitters are lights that also illuminate the space, they are generally mounted on the ceiling and are pointing downwards. Therefore, transmitters likely do not have a line of sight (LOS) to each other. Even when VLP transmitters are within range of each other, they lack the necessary hardware for receiving signals. Therefore, autocalibration methods cannot be used by most conventional VLP systems. In fact, VLP calibration in general is not yet explored in depth. Rodríguez-Navarro et al. [30] proposed a method for calibrating the electrical parameters of a VLP amplification circuit. They performed an extensive parameter study and found that manufacturing tolerances on the resistors and capacitors contributed most to positioning errors due to incorrect calibration. By performing multiple intensity measurements at known locations, a system of equations can be constructed. The solution that minimizes the error provides the optimal calibration of the receiver parameters. In [31,32], calibration of transmitter coordinates based on known receiver locations was proposed. Similarly, Ref [33,37] were able to calibrate the channel model based on known receiver locations. However, these works either did not indicate how the receiver position should be obtained [31–33] or used an additional positioning system to obtain it [37]. Note that not all VLP systems require a calibrated channel model. Camera-based implementations such as [41] only detect the relative position of the light to

the camera center, while photodiode-based systems use the signal strength to obtain the transmitter–receiver distance. Camera-based VLP systems therefore only need the location of each transmitter. However, the channel model of VLP is relatively well known; therefore, model-based fingerprinting is sometimes also possible given the transmitter locations [34].

In this work, we will focus on the calibration of light source locations and identifiers without prior knowledge or additional positioning systems, of which there are few examples. Liang and Liu [35] crowdsourced the construction of a signal strength map of opportunistic signals. Similar to [14,16], user trajectories were obtained with the help of a modified graphSLAM algorithm. Contrary to similar works, they also mapped the location of light sources and used them as landmarks in the positioning stage. However, as the lights were not modulated, their identity cannot be uniquely determined, resulting in a relatively low positioning accuracy (several meters) [35]. Additionally, unmodulated light sources are not easily distinguished from sunlight, as both increase the ambient lighting. In contrast, Yue et al. [36] did use modulated Light Emitting Diodes (LED). A modified version of graphSLAM was again used to construct the signal strength database. Absolute location fixes were obtained by detecting doors with changes in light intensity and magnetic field strength. Following calibration, positioning was performed by fusing PDR with fingerprint observations via a Kalman filter. Positioning accuracy after calibration was about 0.8 m on average, which is an improvement of approximately 70% over Wi-Fi-based fingerprinting under the same conditions. However, in rare occasions the positioning error can exceed 2 m.

#### **3. Materials and Methods**

In our proposed system, specific hardware was placed between the power lines and the lights, which modulated the intensity of each LED at a unique frequency (see Figure 1). Contrary to VLC, no data was transmitted. Instead, we used the modulated lights as a landmark. Detection and identification of the light itself is not the focus of this paper but was explained in our previous work [13]. The main variables of interest were the identity (i.e., frequency) and the coordinates in the camera frame of the light source. If the position of each light is known beforehand, this information can be used to obtain the receiver location. However, in this work, we will focus on the calibration itself. We chose frequency division for its easy implementation (see Section 4.2.1), but another modulation technique could also be used. So long as the light sources can be detected (within the field of view) and identified, the calibration procedure remains applicable.

**Figure 1.** Model of the proposed Visible Light Positioning (VLP) system.

#### *3.1. Experimental Setup*

In order to evaluate the proposed calibration procedure, we used the same experimental setup as our previous works [13,41] (see Figure 2). Four VLP transmitters were mounted at a height of approximately 1.5 m. Light intensity of every transmitter was

modulated by a Metal-Oxide Semiconductor Field-Effect Transistor (MOSFET) that was connected to a signal generator. Every LED had a unique frequency between approximately 1.5 kHz and 5 kHz, so that both low and high frequency modulation can be evaluated. The complete methodology used to obtain suitable transmitter frequencies was detailed in [13]. Table 2 lists the selected modulation frequencies, along with the other main hardware specifications.

**Figure 2.** Experimental setup used in this paper, which was also used in our previous work [13,41].


**Table 2.** Hardware specifications.

A mobile robot was equipped with a custom sensor platform that contained a laser scanner and a camera (see Figure 3). Two different cameras will be investigated in this paper. Initially, we used the OpenMV M7 camera during experiments, as the low resolution allowed us to more quickly process the images and therefore speed up development. Later experiments used the Raspberry Pi (RPi) camera module. Similar to the OpenMV M7, the RPi camera allows a high flexibility over settings such as the exposure time. However, the RPi camera has a much greater resolution compared to the OpenMV camera. Section 4 will investigate whether this resolution improves accuracy. The sensor platform also contained a laptop, which recorded all data so that calibration could be performed offline. When the RPi camera module was used, a Raspberry Pi single-board computer recorded the images separately from the laptop, as the RPi camera does not have a Universal Serial Bus (USB) interface. The cost of the main components of the experimental setup is detailed in Appendix A. The robot platform was driven by a human operator via a remote control.

**Figure 3.** Mobile robot with custom sensor platform used in this paper, which was also used in our previous work [13,41].

#### *3.2. Calibration Procedure*

Figure 4 provides a graphical overview of the calibration procedure. The parameters used to obtain the results in Section 4 are listed in Table 3. The initial steps of the improved procedure were still the same as in [13]. Due to the rolling shutter of our Complementary Metal-Oxide-Semiconductor (CMOS) camera, modulated light sources are visible as stripe patterns in the images. The width of the stripes is proportional to the transmitter frequency [42]. The complete image processing pipeline is detailed in our previous work [13] and returns the frequency and pixel coordinates of the lights as output. The effects of lens distortion are largest at the edges in a picture. Therefore, only the images where a light was detected close to the image center were processed further. Measurements of the laser scanner were used to reconstruct the followed trajectory and a map of the environment by using the Google Cartographer SLAM algorithm [43]. By combining the trajectory of the robot with the detected light sources, we obtained a map of the environment with the light source locations relative to the map frame.

**Figure 4.** Calibration procedure overview.

**Table 3.** Calibration parameters.


At this point, we have obtained the world coordinates of detected light sources, in addition to their frequencies. Unfortunately, the same LED is occasionally labeled with different frequencies, depending on the image. The Canny edge detection step may calculate the stripe pattern to be one pixel larger or smaller than the actual value, resulting in a small spread in the frequency spectrum (see Figure 5). Additionally, the detected world coordinates of the light sources may also not be a single point, due to noise on the centroid detection and robot localization. The light may therefore appear as a cluster of coordinates in the light map (see Figure 5). Previously, we solved this problem by averaging the world coordinates per detected frequency. Then, the frequencies which were detected most often were kept, depending on the number of light sources. For example, in our

experimental setup four lights are present; therefore, the calibration procedure selected the four frequencies that were detected most often, and the rest were discarded. This approach has the disadvantage that one first needs to know how many light sources are installed in the environment, and therefore some manual measurements may still be required.

**Figure 5.** Filtering pipeline. The top row always indicates a light map, and a unique marker is used to indicate different frequencies. The bottom row shows the frequency spectrum corresponding to a certain processing step. From left to right: (1) raw output of the light mapping step, (2) light map after averaging coordinates of unique frequencies and removing outliers (3) light map after removing spurious observations (4) final results of spatial and spectral filtering.

We now propose a different method, whereby we filter the light sources based on their physical coordinates and the frequency spectrum. The intermediate results of the processing steps can be seen in Figure 5. First, outliers are removed from the detected coordinates per frequency. Outliers are defined as detected positions that have a distance of more than 2 standard deviations from the average position. Then, the coordinates of the remaining light sources are averaged per frequency. Light sources that were only observed a few times are removed. Next, light sources that are close to each other are combined. Filtering is first performed based on the coordinates of the light sources. Whenever the distance between two transmitters is below a certain threshold (Table 3), the LED with the lowest number of observations is removed. In case both lights have an equal amount of detections, their detected frequencies and positions are averaged. We call this step "spatial filtering". Finally, we combine light sources of approximately equal frequency. Whenever the distance in the frequency spectrum of two light sources is below a certain threshold (Table 3), the frequency with the largest number of observations is kept and the other is removed. Similar to the spatial filtering step, we average the frequencies and positions of light sources with an equal amount of detections. We call this final step "spectral filtering". The result of these additional processing steps is a light map with the correct number of transmitters. The number of light sources therefore no longer needs to be known beforehand. Instead, one only needs to know (approximately) how far lights are minimally spaced apart, which is much easier to obtain.

#### *3.3. Data Processing*

The result of the calibration procedure is a map of the environment that includes transmitter locations and their frequencies. Evaluating the accuracy of frequency detection is relatively straightforward and was performed by comparing the frequency applied by the signal generator to the frequency determined by the calibrating procedure. Evaluating the location of the light sources is more complex. Ideally, we could simply compare the coordinates determined by the calibration procedure to the coordinates in the physical setup. However, the calibration procedure produces coordinates relative to the map, which is not necessarily the same coordinate frame as the experimental setup, and obtaining the transformation between these frames is challenging. However, we can still compare the relative placement of the light sources. The distance between different light sources is irrespective of the coordinate frame. Therefore, in order to obtain the transmitter position accuracy, we subtracted the distance in the physical setup from the distance obtained in the light map. The distance error between two lights was therefore calculated by:

$$\varepsilon\_{r,ij} = \left| d\_{\rm meas,ij} - d\_{\rm est,ij} \right| = \left| d\_{\rm meas,ij} - \sqrt{(\mathbf{x}\_{\rm est,i} - \mathbf{x}\_{\rm est,j})^2 + (y\_{\rm est,i} - y\_{\rm est,j})^2} \right| \tag{1}$$

where:


In Section 4, we will investigate the performance of the system under a range of conditions. Unless explicitly mentioned otherwise, three experiments are conducted for every condition and the results from all three experiments are combined before further processing (for example, in order to obtain the cumulative error distribution).

#### **4. Results**

#### *4.1. Baseline Results*

Figure 6 compares the calibration results obtained in [13] with the method proposed in this paper. It is clear that the additional filtering steps significantly improve the calibration accuracy. On average, light sources can now be detected with an accuracy of approximately 6 cm, compared to 11 cm in [13]. Larger improvements are also visible in the higher percentiles of the distribution. More than 80% of light sources can be positioned with an accuracy of 10 cm, compared to 20 cm in [13]. Note that the same experimental data were used to obtain both error distributions; the difference is therefore purely due to improvements in data processing.

While this new method can localize the LEDs more accurately, the results of the frequency detection remain unchanged. As was the case in [13], it is challenging to calibrate the high frequency transmitter. As the frequency increases, the width of the stripes decreases, and a detection error of a few pixels results in a large frequency error (error of several hundred Hz). The lower and medium frequency sources can, however, be identified with relatively high accuracy (error of maximum 130 Hz).

#### *4.2. Parameter Study*

Using a mobile robot is a completely new approach to the calibration of VLP systems. As with any new technique, much is currently unknown about the effects of certain parameters on the calibration results. In the following sections, we will therefore investigate the influence of a number of factors on the calibration procedure. One parameter will be changed at a time, and unless otherwise specified, we will use the results from the previous section as a baseline to compare against. In doing so, we aim to create a better understanding of the strengths and limitations of the proposed approach.

**Figure 6.** Cumulative error distribution of this paper compared with our previous work [13].

#### 4.2.1. Transmitter Waveform

In visible light positioning, the light intensity of every LED is modulated in such a way that they are uniquely identifiable, even when multiple lights are in view at the same time. Generally, the lights continuously transmit a unique code or frequency [44]. The former is referred to as Code Division Multiple Access (CDMA), the latter as Frequency Division Multiple Access (FDMA). In this paper, we use FDMA as a multiple access technology. However, the calibration procedure can be adapted to support CDMA as well with relatively minor changes.

When using FDMA, both sine waves and square waves can be used as transmitter waveforms. A square wave is easier to generate and therefore the cost of the transmitters can be lower. Therefore, most VLP systems in literature use square waves. However, square waves have harmonics in the frequency spectrum. When selecting square wave frequencies, more care is needed to avoid interference. Photodiode-based VLP systems use the Fourier spectrum to separate the received signal into the components of each transmitter. Therefore, photodiode-based VLP systems are impacted most by harmonics. Camera-based VLP systems can use spatial multiplexing and are therefore less affected by this interference.

On the other hand, ideal sine waves have no harmonics, and therefore the available bandwidth can be used much more efficiently when photodiodes are used as receivers. The downside is that a sine wave is not as straightforward to generate with low-cost components. Additionally, the light intensity changes much more gradually with a sine wave, which makes frequency detection significantly more challenging (see Figure 7). In the following sections, we will therefore only use square waves, as our calibration procedure is not able to detect sine waves with sufficient accuracy.

#### 4.2.2. Robot Trajectory

The motion of the robot platform may influence calibration results. If the robot can stay in motion, calibration time will be reduced. On the other hand, we expect motion blur may negatively impact results. During experiments, the robot was driven manually via a remote control. Two different types of trajectories were tested. In the first trajectory, the robot was continuously in motion and passed by every light source while covering the experimental space in a zigzag pattern. Figure 8a shows an example light map constructed from data recorded during a zigzag experiment. Performance is quite poor in this example—one light source was not even mapped at all. Other zigzag experiments occasionally resulted in even

fewer light sources. Moreover, Figure 8c shows that the positioning accuracy of the fixtures that were detected is quite low. We therefore also tested a second type of trajectory, whereby the robot drives towards each light source sequentially. Once the camera is directly below the LED, data were recorded for a few seconds, before continuing to the next light. We call this as a "stop and go" trajectory, Figure 8b shows a light map constructed based on data from such an experiment. The light sources are now placed closer to the ITEM profiles (see Figure 2). Additionally, Figure 8c shows that the relative placement is significantly more accurate.

**Figure 7.** Comparison of 2 kHz square wave (**left**) and sine wave (**right**).

**Figure 8.** Calibration experiments with different trajectories. (**a**) Zigzag trajectory. (**b**) Stop and go trajectory. (**c**) Cumulative error distribution.

Table 4 compares the duration of the two types of trajectories, based on the average of three experiments for each type. Contrary to our expectations, we can observe that performing the calibration with the stop and go trajectory does not take significantly more time. The zigzag is an exhaustive search, and therefore takes a long time. In contrast, the stop and go trajectory is not a continuous motion but only gathers the data that is really required. Processing time for the stop and go trajectory is slightly increased compared to the zigzag. Fewer images were rejected, as more light sources were detected close to the center for this type of experiment. Consequently, more images needed to be processed and the processing time increased. However, the majority of this time was actually spent on LIDAR mapping (approximately 80%). Therefore, an increase in image processing only had a small impact on the overall time required. Additionally, the difference is less than 1 s. As the calibration can be performed offline, this time delay does not present an obstacle.

**Table 4.** Duration and processing time of calibration trajectories.


Many other types of trajectories could be considered. Determining the optimal calibration trajectory is outside the scope of this paper. With these results, we can however conclude that the robot should briefly stop at each LED in order to obtain accurate results. Unless otherwise specified, results in the following sections are obtained with a stop and go trajectory.

#### 4.2.3. Lighting Conditions

Results from the previous section were all obtained under the same lighting conditions. It is well known that changing illumination levels can influence computer vision algorithms, and could thus negatively impact our proposed calibration procedure. In this section, we will therefore calibrate the experimental setup under a range of lighting conditions. More specifically, we distinguish 4 scenarios:


For every condition, three experiments were again conducted with the stop and go trajectory. For every experiment, we determined the error on the light source location as explained in Section 3. Figure 9 shows that all conditions have similar performance. The "other day" experiments are very similar to the baseline, indicating that the parameters (e.g., exposure time) are not overfit to a specific point in time. The difference between both error distributions generally is not larger than 2 cm. Whether or not the shutters are opened also does not seem to negatively impact calibration accuracy, as unmodulated light sources are easily ignored by our calibration procedure. Similarly, we can observe that switching on the fluorescent light has little impact. While fluorescent light is modulated, the frequency is too low compared to the LEDs. Therefore, the additional light sources are simply ignored.

**Figure 9.** Influence of environmental conditions on calibration accuracy.

The results of the frequency detection were very similar under different lighting conditions. In fact, they were identical with only one exception. During one "shutters open" experiment, one medium frequency light source had a slightly larger error compared to the other experiments. This phenomenon did not occur during the experiments with the fluorescent lights, even though the shutters were also opened in this case.

#### 4.2.4. Transmitter–Receiver Distance

Experiments so far were performed with LEDs mounted at a height of approximately 1.5 m. This makes them more easily accessible and thereby makes prototyping and experimenting easier. As the distance between transmitter and receiver increases, the LEDs will take up a relatively smaller portion of the image. Consequently, fewer stripes will be visible, and it will be more challenging to determine the transmitter frequency. This section will characterize the influence of increasing this distance on the accuracy of frequency estimation. To that end, we place the camera and an LED on a table and ensure that their normal planes were parallel. This horizontal setup (Figure 10) was different from how the LEDs are normally installed. However, whether the lights are mounted horizontally or vertically made no difference for this experiment; only the relative distance is important. In contrast, placing the light on a table rather than on the ceiling allowed us to change the distance much more easily and also enabled us to test performance at larger distances.

The distance between transmitter and receiver was increased from 1 m to 5 m. The light was modulated at a frequency of approximately 2 kHz. At every distance increment, images were recorded for approximately 30 s. In postprocessing, we determined the LED frequency using the process described in Section 3.2. Next, we calculated the difference between the true frequency (applied by the signal generator) and the frequency estimated by the calibration procedure. Table 5 shows the frequency estimation accuracy as a function of the transmitter–receiver distance. For short distances, the accuracy is approximately 95%, similar to Section 4.1. However, starting at a distance of 2 m, light sources can no longer be detected, hence the accuracy is 0%. The cause for this problem can be found by comparing images captured at different distances (Figure 11). At a distance of 1 m, several horizontal stripes are visible, from which the transmitter frequency can be calculated. At

a distance of 3 m, only 1 stripe is visible, and the transmitter frequency can no longer be determined. Interestingly, the difference between 1 m and 3 m is much more pronounced than the difference between 3 m and 5 m.

**Figure 10.** Experimental setup for transmitter–receiver distance experiments.

**Table 5.** Frequency estimation accuracy as a function of the transmitter–receiver distance. Field Of View (FOV) is kept fixed at 60 degrees.


**Figure 11.** Example images taken at different transmitter–receiver distances. Images are cropped around the light source.

Results from the previous sections were obtained by using a camera lens with a field of view of 60 degrees. When a smaller FOV is used, the LED occupies a relatively larger portion of the image, and we may be able to detect it at greater distances. In order to test this hypothesis, we used a lens set (https://www.arducam.com/product/m12-mountcamera-lens-kit-arduino-raspberry-pi/, accessed on 12 March 2021) and varied the FOV from 10 to 60 degrees, at a fixed distance of 5 m. Again, the transmitter frequency was approximately 2 kHz, images were recorded during 30 s for every experiment, and the calibration procedure was used to estimate the transmitter frequency. Table 6 shows the accuracy of frequency estimation as a function of the FOV. It is clear that by sufficiently decreasing the FOV, the LED can still be detected. Even at a distance of 5 m, we can obtain the same accuracy of 95% as in Section 4.1 by using a lens with a FOV of 10 degrees. At shorter distances, a larger FOV can potentially also be used.

**Table 6.** Frequency estimation accuracy as a function of the FOV. Transmitter–receiver distance is kept fixed at 5 m.


#### 4.2.5. Camera Resolution

Similar to lighting conditions, camera resolution can have a significant impact on the performance of computer vision approaches. In the previous sections, the OpenMV M7 camera was used as an image sensor, which has a relatively low resolution of 640 × 480 pixels. We now hypothesize that a higher resolution will lead to a higher accuracy, for calibration of both the position and frequency of the LEDs. The increased resolution provides a higher granularity and thus potentially a greater accuracy in distinguishing the location of the light source. Additionally, a higher resolution provides additional stripes in the image, which may improve frequency detection. To verify this hypothesis, additional experiments were performed with a Raspberry Pi (RPi) camera sensor. Similar to Sections 4.2.2 and 4.2.3, the robot followed a stop and go trajectory, and data were recorded for offline postprocessing. Figure 12 compares images captured with both cameras. The RPi camera is designed for the low-cost single-board computer of the same name, which is particularly popular for embedded applications. The maximum resolution of 3280 × 2484 pixels was used, in order to amplify any effects related to the resolution. Images taken at such a high resolution take up a lot of space in memory. Therefore, pictures were not continually recorded. Rather, 10 images were taken when the robot was located underneath the light source. Both types of cameras were equipped with a 60 degree FOV lens. Due to an error in data recording, there are only two experiments with the RPi camera instead of the usual three.

**Figure 12.** Example images of the same light taken with different cameras.

Table 7 contains the main quality metrics for calibration with both camera's. The results appear to support our previous hypothesis. The accuracy of transmitter positioning is improved, although the improvement is rather small. A larger improvement can be seen in the frequency identification. The high-frequency source can now be identified much more accurately, due to the increased bandwidth of the RPi camera.


**Table 7.** Calibration results for different camera sensors.

On the other hand, the time required to collect data with the RPi camera is more than double that of the OpenMV camera. The RPi camera is designed to be used with the Raspberry Pi single-board computer, which has significantly lower computational power compared to the laptop that collected the images from the OpenMV camera. Therefore, capturing each image takes a significantly longer amount of time. The main bottleneck here is the RPi itself, if one were to use a camera with a USB interface, a laptop could again be used to capture the images and calibration time would decrease. However, the duration of an experiment would likely still be higher with higher resolutions, yet less drastically so with the correct hardware. Processing time is also significantly increased when using the RPi camera. The higher resolution of the images means that the image processing takes a few seconds longer. The majority of the time increase can be attributed to the larger number of LIDAR measurements, which in turn is caused by the longer stationary time needed to capture the images with an RPi camera.

#### 4.2.6. Field of View

Section 4.2.4 mentioned the effects of changing the field of view to improve the detection rate. Due to the nature of those experiments, we could not determine the error on the LED position. We therefore perform additional experiments with a changing FOV and multiple LEDs. These experiments were performed in the normal experimental setup (Figure 2). Section 4.2.5 showed that a larger resolution improves calibration performance. Therefore, we will again use this larger resolution camera in this section. Similar to Section 4.2.5, the RPi camera was used and images were only captured when the camera was approximately underneath a light source. By changing the lens, the FOV is again varied from 10 to 80 degrees. Figure 13 shows example pictures of the same light source for every FOV.

Figure 14 shows the mean accuracy of transmitter positioning and frequency detection as a function of the FOV. From 80 degrees onward, the calibration procedure starts failing, resulting in large errors. For the sake of clarity, these results are not included in these figures. In general, a smaller FOV leads to a better positioning accuracy, though the improvement is relatively small. The exception to this is the 40 degree lens, which actually has a larger error compared to 60 degrees. In contrast, a smaller FOV actually lowers frequency detection accuracy. The 40 degree lens fits the overall trend better in this case. Therefore, one can trade off the positioning and frequency accuracy. However, the 60 degrees from previous sections seems to have already been a good compromise for our setup. As discussed in Section 4.2.4, a larger height may still necessitate a smaller FOV, as the light sources will become difficult to detect otherwise.

**Figure 13.** Example images for different FOV lenses.

**Figure 14.** Top: positioning accuracy as a function of the camera FOV. Bottom: frequency accuracy as a function of the camera FOV.

#### *4.3. Influence on Positioning*

The goal of a calibration procedure is to accurately measure the environmental parameters needed for determining the receiver location. Calibration errors will likely result in positioning errors, though this is often not a simple linear relation. Therefore, it is challenging to determine how accurately the calibration needs to be performed in order to guarantee adequate positioning performance in a later stage. In this section, we evaluate

the impact of calibration errors on visible light positioning, in order to determine if our calibration provides satisfactory results.

First, we calibrated the setup with the configuration that was determined to be the best trade-off between position and frequency accuracy. More specifically, we used an RPi camera with a field of view of 60 degrees and drove the robot in a stop and go trajectory. This setup was also calibrated manually as a point of reference. In our previous work, we described sensor fusion based robot positioning with three filters, namely an Extended Kalman Filter (EKF), a Particle Filter (PF) and a hybrid Particle/Kalman filter (PaKa) [41]. We now use the parameters from both the manual and robot calibration in these positioning approaches. We used all data from the conditions described in Section 4.2.3. Positioning accuracy results were obtained in the same way as described in [41]. Contrary to previous sections, "positioning accuracy" does not refer to the accuracy on the position of the transmitters. In this section specifically, "positioning accuracy" refers to the accuracy on the robot position, the calculation of which is described in [41].

The results from all experiments were combined into one cumulative distribution per filter, which are shown in Figure 15. It is clear that the proposed calibration procedure has little impact on positioning accuracy. Occasionally, the new method even improves accuracy. However, as explained in [41], the accuracy results of the PF and the PaKa can have a small variation due to the sampling of probability distributions. In general, the robot calibration is more accurate, albeit only slightly. Of the three positioning approaches, the hybrid filter seems to be least impacted by the new calibration method. The difference between the error distributions of the PaKa filter in Figure 15 is often less than 1 mm. All filters also have no trouble identifying the lights correctly. Even though the new calibration method introduces an error on the modulation frequency, it is not large enough to cause ambiguity among the transmitters.

Note that this calibration was performed with an RPi camera but that positioning was performed with the OpenMV camera. The above results therefore show that device heterogeneity is not an issue that needs to be specifically taken into account, contrary to some RSS-based positioning systems [45–48].

**Figure 15.** Cumulative positioning error distributions of calibration methods for different filters.

#### **5. Discussion**

Amsters et al. [13] proposed a proof of concept for a calibration procedure of VLP systems. Contrary to [13], the improved procedure described in this paper does not require prior knowledge of the number of light sources. In the vast majority of cases, the algorithm was able to determine the correct number of light sources. When using the zigzag trajectory, or a FOV of 80 degrees, the number of transmitters could be underestimated. In all the other tests that were performed (29 experiments in total), the calibration algorithm correctly determined the number of LEDs.

This type of calibration procedure for VLP systems has not been used before. Therefore, it was unclear how robust the approach is and which factors can influence the results. During our parameter study (Section 4.2), we obtained several key insights. For example, a limitation of the procedure is that it cannot be used for calibrating transmitters modulated with sine waves, which is a consequence of using a camera as a receiver. However, the majority of VLP systems described in literature use On-Off Keying (OOK) as a modulation scheme, even if only frequencies are transmitted [49]. In case code division is used, researchers also often opt for OOK. While we performed experiments with FDMA as a multiple access technology, it would be relatively straightforward to include code division multiplexing by expanding the image processing pipeline. Another limitation is that we can only determine the two-dimensional position of the LEDs. Some positioning approaches require knowledge of the ceiling height, which would have to be measured separately.

The proposed calibration procedure was also not influenced significantly by the ambient lighting, similar to the positioning approach used as an evaluation [41]. In contrast, the robot trajectory, height, FOV and resolution all had an impact on calibration accuracy. A large resolution should be used to increase accuracy of both frequency detection and transmitter positioning. However, we recommend the use of a USB camera in order to capture pictures faster. In our experiments, a FOV of 60 degrees provided a good tradeoff between positioning and frequency detection accuracy. In case the distance between transmitter and receiver is large (as is the case with high ceilings), a smaller FOV may be required to detect the light sources. Finally, care should be taken to stop the robot at each light source, rather than using a continuous motion. The latter could lead to poor accuracy and an underestimation of the number of light sources.

The main objective of the technique is to calibrate the parameters of the system, so that these can be used for positioning in a later stage. The experimental results in Section 4.3 showed that the parameters of the experimental setup can be determined with sufficient accuracy. The error on the light source locations did not result in increased positioning errors. In the case of our experimental setup, the transmitter frequencies could also be determined with sufficient accuracy so as to not cause ambiguity. It is important to note that one should take care that the modulation frequencies are sufficiently far apart, as some error is introduced when calibrating the modulation frequency. We should also note that certain positioning approaches are more susceptible to calibration errors than others. The positioning approach used as an evaluation tool made use of sensor fusion. In case of large measurement errors, the filters can fall back on odometry data. However, this is only the case when the error on the observation is sufficiently large. More subtle disturbances such as errors on the transmitter coordinates cannot be filtered. Additionally, with this work we showed that it is possible to close the loop between calibration and positioning. That is, we can efficiently calibrate the setup with a mobile robot and then use the determined parameters for high-accuracy positioning. Manual calibration also leads to errors on the transmitter locations. As evidenced by our positioning case, these errors are likely of the same order of magnitude as the robot calibration.

Our work shares similarities with robot-based RFID calibration. Hähnel et al. [11] also used a mobile robot equipped with a LIDAR and used it to reconstruct a map of the environment. The position of RFID tags was later estimated based on the path of the robot. Similarly, Milella et al. [12] also mapped indoor spaces with a mobile robot in order to localize RFID tags. They used fuzzy logic to determine the likelihood of a tag location. Mirowski et al. [16] proposed the use of a mobile robot for calibration of Wi-Fi localization systems. Contrary to [11,12], they used Quick Response (QR) codes to aid with loop closures, which raises the question as to how these QR codes should be localized.

Literature on the subject of VLP calibration is limited. Most examples focused on obtaining the parameters of the channel model [33,37], which we cannot calibrate. However, the approach which we used as an evaluation tool does not require these parameters as the channel model is not used [41]. This does limit the calibration procedure to mostly camerabased positioning systems. It is possible to further extend the proposed calibration system by including a photodiode on the robot platform and using the intensity measurements to obtain the parameters of the channel model.

In order to obtain the transmitter locations and identities, we obtained the receiver position through SLAM, rather than the manual measurements used in [31,50]. Contrary to [35], we were also able to obtain light source identities. Yue et al. [36] did use modulated LEDs, yet they have significantly lower accuracy compared to our work. However, our approach required a dedicated procedure rather than crowdsourcing the required data. Additionally, our robot needed to be manually driven by a human operator. Nevertheless, it may be possible to let the robot perform the calibration autonomously, whereas crowdsourcing will always require the cooperation of humans.

#### **6. Conclusions**

In this work, we outlined an improved calibration procedure for VLP systems, based on data collection with a mobile robot. The new approach had significantly improved performance compared to previous work. Accuracy of LED localization was almost doubled. Additionally, whereas previous work remained a proof-of-concept, we performed an extensive parameter study to characterize the strengths and limitations of the approach. Based on these results, we suggested the use of high resolution camera, with a FOV of 60 degrees to further improve the accuracy of LED placement and frequency detection. We showed that ambient lighting has little influence on the proposed procedure. Through positioning experiments, we determined that the approach is also accurate enough to calibrate high-performance VLP systems. In doing so, an important barrier to entry is removed for visible light positioning systems.

Our approach required a dedicated site survey, rather than crowdsourcing. While less convenient, it did result in much greater accuracy. The procedure was also unable to calibrate the channel model. In future work, we could add a photodiode to the sensor platform in order to obtain the Lambertian emission parameters. Additionally, we could investigate the possibility of letting the robot perform the procedure autonomously, in order to reduce the human labor required.

**Author Contributions:** Conceptualization, R.A.; Funding acquisition, R.A., E.D., N.S. and P.S.; Investigation, R.A.; Methodology, R.A.; Project administration, R.A. and P.S.; Supervision, E.D., N.S. and P.S.; Writing—original draft, R.A.; Writing—review and editing, R.A., E.D., N.S. and P.S. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by Research Foundation Flanders (FWO) under grant number 1S57720N.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

#### **Appendix A**


**Table A1.** Cost of the experimental setup.

#### **References**

