**4. Using the Compact Topological Maps to Localize the Robot**

At this point, the robot is provided with a model of the environment, which, in this case, is a hierarchical map. From it, the robot firstly uses the high-level layer to carry out a rough localization, and secondly, a fine localization is tackled through the use of the low-level layer. The visual localization problem has been solved by many authors through local features by using probabilistic approaches such as particle filters or Monte Carlo localization [51,52]. Nevertheless, the works developed with global appearance descriptors are scarce. Hence, this paper presents a comparison of this kind of descriptor to estimate hierarchically the position of the robot within a hierarchical map in a specific time instant. In order to test the accuracy of the localization method proposed in this work, the coordinates

where the images were captured within the environment are known (ground truth). Nevertheless, they are not used to estimate the position of the robot since, as mentioned before, the presented method only considers visual information. This decision permits studying the feasibility of visual sensors as the only source of information to create a compact topological map and, more concisely, of global appearance descriptors. Therefore, not using the position information in the mapping and localization algorithms permits isolating the effect of the main parameters of these descriptors and knowing the performance of this kind of information. The remainder of this section is structured as follows: Section 4.1 outlines the types of distances that have been used to calculate how different the global appearance descriptors are. Section 4.2 explains the localization step within maps that have not been compacted previously, i.e., no clustering has been carried out (the full information about the environment is provided). Finally, Section 4.3 explains the localization task within hierarchical topological maps.

#### *4.1. Distance Measures between Descriptors*

In order to know how similar two panoramic images are through their global appearance descriptors, some distance measurements have been used. This way, a comparison can be carried out by calculating the distance between the descriptors of two images captured from different positions of the environment. The lower the distance between those images is, the more similar they are. This kind of distance is used in the localization step. We consider two descriptors #»*<sup>a</sup>* <sup>∈</sup> <sup>R</sup>*l*×<sup>1</sup> and #»*<sup>b</sup>* <sup>∈</sup> <sup>R</sup>*l*×1, where *ai* and *bi* are the *i* th components of #»*<sup>a</sup>* and #»*<sup>b</sup>* with *<sup>i</sup>* <sup>=</sup> 1, ..., *<sup>l</sup>*. The distances used in this work are:

• Euclidean distance: This a particular case of the the weighted metric distance and is defined as:

$$dist\_{euclidean}(\overrightarrow{a}^{\bullet}, \overrightarrow{b}^{\bullet}) = \sqrt{\sum\_{i=1}^{l} (a\_i - b\_i)^2} \tag{3}$$

• Cosine distance: Departing from a similitude metric, which is defined as the scalar product between two vectors, the distance is defined as:

$$\begin{aligned} dist\_{\text{cosine}}(\overrightarrow{\pi}, \overrightarrow{b}) &= 1 - sim\_{\text{cosine}}(\overrightarrow{\pi}, \overrightarrow{b}) \\ sim\_{\text{cosine}}(\overrightarrow{a}^\star, \overrightarrow{b}^\star) &= \frac{\overrightarrow{a}^\star \cdot \overrightarrow{b}}{|\overrightarrow{a}^\star||\overrightarrow{b}^\star|} \end{aligned} \tag{4}$$

• Correlation distance: Again, departing from a similitude metric, which is defined as a normalized version of the scalar product between two vectors, the distance is defined as:

$$dist\_{correlation}(\overrightarrow{\pi}, \overrightarrow{b}) = 1 - sim\_{correlation}(\overrightarrow{\pi}, \overrightarrow{b})$$

$$sim\_{correlation}(\overrightarrow{\pi}, \overrightarrow{b}) = \frac{(\overrightarrow{\pi} - \overline{a})^{T}(\overrightarrow{b} - \overline{b})}{\sqrt{(\overrightarrow{\pi} - \overline{a})^{T}(\overrightarrow{\pi} - \overline{a})}\sqrt{(\overrightarrow{b} - \overline{b})^{T}(\overrightarrow{b} - \overline{b})}}\tag{5}$$

where:

$$
\overline{a} = \frac{1}{l} \sum\_{i=1}^{l} a\_i; \qquad \overline{b} = \frac{1}{l} \sum\_{i=1}^{l} b\_i \tag{6}
$$

Previous research works [21,36] have evaluated the relation between the distance between the global appearance descriptors and the geometrical distance between capture points. These works show that even if the robot moves a short distance, the descriptor changes. Therefore, global appearance descriptors can be used to detect even small movements.
