*Article* **Towards Multi-Robot Visual Graph-SLAM for Autonomous Marine Vehicles**

**Francisco Bonin-Font † and Antoni Burguera \*,†**

Systems, Robotics and Vision Group, Department of Mathematic and Informatics, University of the Balearic Islands, 07122 Palma, Spain; francisco.bonin@uib.es

**\*** Correspondence: antoni.burguera@uib.es

† These authors contributed equally to this work.

Received: 16 May 2020; Accepted: 10 June 2020; Published: 14 June 2020

**Abstract:** State of the art approaches to Multi-robot localization and mapping still present multiple issues to be improved, offering a wide range of possibilities for researchers and technology. This paper presents a new algorithm for visual Multi-robot simultaneous localization and mapping, used to join, in a common reference system, several trajectories of different robots that participate simultaneously in a common mission. One of the main problems in centralized configurations, where the leader can receive multiple data from the rest of robots, is the limited communications bandwidth that delays the data transmission and can be overloaded quickly, restricting the reactive actions. This paper presents a new approach to Multi-robot visual graph *Simultaneous Localization and Mapping* (SLAM) that aims to perform a joined topological map, which evolves in different directions according to the different trajectories of the different robots. The main contributions of this new strategy are centered on: (a) reducing to hashes of small dimensions the visual data to be exchanged among all agents, diminishing, in consequence, the data delivery time, (b) running two different phases of SLAM, intra- and inter-session, with their respective loop-closing tasks, with a trajectory joining action in between, with high flexibility in their combination, (c) simplifying the complete SLAM process, in concept and implementation, and addressing it to correct the trajectory of several robots, initially and continuously estimated by means of a visual odometer, and (d) executing the process online, in order to assure a successful accomplishment of the mission, with the planned trajectories and at the planned points. Primary results included in this paper show a promising performance of the algorithm in visual datasets obtained in different points on the coast of the Balearic Islands, either by divers or by an *Autonomous Underwater Vehicle* (AUV) equipped with cameras.

**Keywords:** multi robot; Simultaneous Localization and Mapping; visual loop closure; image global signatures

#### **1. Introduction and Related Work**

*Simultaneous Localization and Mapping* (SLAM) [1] is an essential task for *Autonomous Underwater Vehicles* (AUV) to achieve successfully and precisely their programmed missions. SLAM consists of building a map of the environment and, at the same time, estimating its own pose within this map. SLAM is presently a de facto localization standard for any kind of autonomous vehicle. Laser range finders or sonar were the sensor modality of choice at first [2–4]. However, research turned to computer vision as soon as price and capabilities of cameras made it possible [5], since cameras provide higher temporal and spatial data resolutions and richer representations of the world.

However, large-scale or long-term operations with a single robot equipped with cameras generate huge amounts of visual data that can collapse the vehicle computer, if they are not treated properly. A common strategy to overcome this problem is to explore the areas of interests in different, separated missions, so-called *sessions*, run with a single robot in different time periods (a Multi-session

configuration [6]) or with several robots running simultaneously (Multi-robot configurations [7]). Therefore, any low capability of a robot to operate robustly during long periods of time can be alleviated by running different transits with different agents, at the same time through common areas, and joining all individual estimated trajectories in a single coordinate frame. Multi-robot systems also increase robustness in case of failure of any of the robots; however, they need complex coordination and multiple localization systems. Typical applications using teams of robots include aerial surveillance [8], underwater exploration [9], maintenance of industrial infrastructures or intervention in archaeological sites [10], among others.

The first approaches to Multi-robot SLAM were based on particle filters [11], and introduced the concept of *encounters* as the relative pose between two robots that can mutually recognize each other and determine their relative poses. These *encounters* are introduced as additional pose constrains in the particle filter. Some Multi-robot approaches are based on the *Anchor-nodes* [12,13] proposal, which defined two concepts unconsidered for multiple trajectories until that moment: (a) the *Anchor*, defined as the offset of a complete trajectory with respect to a global system of coordinates, and (b), an *encounter*, re-defined as a transformation between two different poses of two different robots that observe the same part of the environment, but without being necessarly that both robots recognize themselves. In visual-based systems this can be achieved, for instance, detecting overlapping scenes. In Multi-robot systems, *encounters* represent additional constraints between different graphs corresponding to different sessions.

Schuster et al. conceived a very precise SLAM approach to localize a team of planetary rovers equipped with an *Inertial Measurement Unit* (IMU) and a stereo camera [14]. IMU data, visual odometry and wheel odometry are integrated in a local localization *Extended Kalman Filter* (EKF) and the 3D point clouds of all robots computed from their respective stereo views are, firstly stored in each agent, and then matched to be joined in global 3D maps. The use of stereo vision and advanced techniques for 3D feature matching and alignment complicate considerably the whole system and generate huge amounts of data to work with and to be exchanged. This solution turns out to be very difficult for underwater missions, given the limited options for fast communication in this media.

Saeedi et al. offered an extensive survey of Multi-robot systems and strategies, pointing also towards the upcoming trends and challenges [15], such as extending the systems to dynamic and/or large-scale environments or increasing the number of agents in the working teams.

Another issue to consider in SLAM is the detection of loop closings and their use to correct the robot trajectory estimated by means of dead-reckoning sensors, such as, inertial units, acoustic beacons, laser-based or visual odometers. Loop closing is the problem of recalling revisited scenes, and approaches to visual loop closure detection try to recognize the same scene in different images, taken at different and relatively distant time instants, regardless evident differences on scale or view point [16]. In single session SLAM, since the robot pose is continuously estimated, the search for images candidate to close a loop with a *query* (from now on called intra-session loop closings) is constrained to a region around the robot pose associated with that query [17]. In contrast, multi-robot loop detection, i.e., the detection of loop closings among different sessions of different robots (from now on called inter-session loop closings), cannot rely on the AUV poses to constrain the search since, at first, the relative poses between sessions is unknown. Consequently, it seems that every *query* of one session would need to be compared with all the images obtained until that moment in the other sessions, increasing considerably the time dedicated for this task, and the amount of visual data to be exchanged.

Exchanging image hashes instead of entire images or sets of image salient points is a way to reduce data transfer requirements in Multi-robot configurations. Hash functions are usually used to authenticate messages sent between a source and a receiver, so that the later can verify the authenticity of the source. Conventional hashing algorithms are extremely sensitive to the hashed messages. A change in 1 bit of the input message causes dramatic changes on the output. Applications of hashes, understood as exposed before, include image retrieval in large databases, authentication and

watermarking, among many others [18]. However in applications of scene recognition, localization or visual loop closing detection, it is accepted that similar or overlapping images are expected to produce similar or close hashes while distinct images produce clearly distinctive hashes, being this concept known are perceptual image hashing [19–23]. In particular, McDonald et al. [6] proposed to detect loop closings using a solution based on *Bag of Words* BoW [24] combined with iSAM [12] for batch map optimization, and Negre et al. [22] showed how their new global image descriptor HALOC outperformed other techniques, such as BoW and VLAD [20], in the task of loop closing detection with image hashes. From now on, this text uses equally hash or global image descriptor to refer the same concept.

All these aforementioned references apply hashes to detect loops in SLAM applications for single robots. However, now, our interest is focused on the Multi-robot systems, and the application of global image signatures to find loops between images captured by different robots that operate in a same mission on a common area of interest. A few authors have already explored this idea. For instance, *Decentralized Visual Simultaneous Localization and Mapping* (DSLAM) [25] is a powerful tool for pose-graph Multi-robot decentralized applications in environments where absolute positioning is not available. DSLAM reduces every image to its NetVLAD (a Neural Network Architecture for Place Recognition) global descriptor [26]. To find loop closings, DSLAM seeks, for every *query* of one robot, the image of another robot whose NetVLAD descriptor presents the shortest distance to the descriptor of the *query* and this distance is below a certain threshold. This process is done for every frame of every robot that is inside a predefined cluster. DSLAM uses ORB-SLAM [27] for continuous localization, which includes a global image charectrization based on BoW for initial odometric estimates, and ORB [28] feature matching and RANSAC to calculate the 3D transform between confirmed visual loop closings.

The idea of *Cloud Computing* is applied in some cases to alleviate the computational charge needed for a set of robots to localize themselves and map the environment running a software architecture based on a multi-layer cloud platform [29]. In this later reference, robots use ORB-SLAM for self-localization and the multi agent SLAM is tested with the KITTI [30] public dataset and using a quadrotor drone in an outdoor environment. A few solutions integrate inertial with image data to perform Multi-robot graph SLAM. In [31], ORB visual features are tracked along consecutive frames and integrated together with the motion given by an IMU in a graph optimization context. BoW is also used to detect candidates to close inter-session loops. The BoW-based global image descriptor of a query image is compared to the global descriptor of all other images of the other agents, selecting a set of candidates to close inter-session loops with the query. Afterwards, a brute-force feature matching with RANSAC is applied to confirm the candidates or to reject them. Experiments in [31] are performed with aerial robots in industrial environments.

Previous references have been tested only in terrestrial indoor and outdoor environments. The literature is extremely scarce in Multi-robot SLAM addressed, implemented and tested in underwater scenarios with AUVs [9,32]. Underwater computer vision is affected by several challenging problems, such as flickering, reduced range, lack of illumination, haze, light absorption, refraction, and reflection. These limitations increase the need for more robust visual SLAM approaches which start with accurate camera calibrations. Accuracy in the processes of camera calibration is critical to reduce drift in the visual odometry and increase precision in the pose transform obtained from images that close loops [33,34]. Furthermore, none of the papers cited previously consider the potential impact of limited communications among robots, because, either they are applied in ground or aerial robots or they simply assume full, high-bandwidth connectivity. This supposal is clearly unrealistic in underwater environments, where blue-light laser communications need to be highly directive and acoustic USBL modems work, on average, at 13 Kbps for long range devices and up to 65 Kbps in mid-range devices. Additionally, the later speeds are not suitable to transmit medium-high resolution visual data between two robots without a previous compression. For instance, Pfingsthorn et al. [35] proposed a visual pose-graph SLAM approach in which compressed JPEG-format images are send

via acoustic links only between robots that can mutually recognize their positions and are viewing overlapping areas. Paull et al. [36] refuse the use of images for SLAM and trust all the localization process to an acoustic modem for data transmission and instruments that give relatively small amounts of data, if compared with images: compass, *Doppler Velocity Log* (DVL) and a Side Scan Sonar. Besides, they also apply a new strategy to marginalize unnecessary local information to reduce the dimensions of the transferable packages.

In the context of the ongoing national project TWINBOT (*TWIN roBOTs for Cooperative Underwater Intervention Missions*) [37], diverse missions of exploration and cooperative intervention have to be run using one or several AUVs in underwater areas with multiple appearances and different benthic habitats. In this project, accurate, fast and reliable robot localization, loop closing and navigation algorithms are crucial for the success of their missions. This paper presents a new approach to Multi-robot visual graph-SLAM, especially designed for 2'5D configurations, where vehicles move at a constant altitude with a camera pointing downwards, with the lens axis constantly perpendicular to the ground or to the vehicle longitudinal axis. This condition simplifies the visual system to 3 *Degrees of Freedom* (DoF): two for an in-plane translation (*x*, *y*) and another for rotation in yaw (*θ*). This simplification fits with aerial and underwater vehicle configurations, if the navigation altitude is large enough compared with the heigh of the terrain relief [38,39]. However, now tests have been made only with underwater datasets since the research developed by our team, in general [40], and the TWINBOT project in particular, is applied entirely and solely underwater, and this approach emerged as a solution to be applied on the robots that participate in our project missions.

The approach presented now includes several contributions that represent clear advantages with respect to the existing solutions, namely:

(a) As in [25], images are reduced to global descriptors decreasing drastically the amount of visual data to be exchanged among robots; however, the global descriptor used now is HALOC [22] instead of NetVLAD. The construction of HALOC is simpler and faster than NetVLAD, consisting in projecting all image features on a base of orthogonal vectors, without any need for tedious and long training tasks. This is a clear advantage over the previous work, since HALOC already showed to outdo VLAD [20] and BoW, in speed and performance for loop closing detection, in both terrestrial public benchmaks and underwater environments. HALOC also showed a performance better than ORB-SLAM, only underwater. Additionally, extensive experiments with HALOC performed in marine areas partially colonized with seagrass [41,42] also revealed an excellent efficiency, capacity and utility for loop closing detection in this type of environments.

(b) A second important contribution is the simplification of the whole system with respect previous approaches. Ours does not require neither the computation of relative poses among robots, nor a specific strategy to limit their communication and interaction. At every SLAM iteration, the quantity of bytes to be exchanged between robots is so small that this will not necessarily limit the communication between all agents that participate in the mission, if needed.

(c) The global procedure includes local and global SLAM tasks, with a map joining process in between. The advantage of this point lies on the flexibility to choose the moment at which the map joining is performed, giving priority to local routes as accurate as possible, or delaying the major corrections once all maps have been joined.

(d) The present approach goes one step beyond its predecessors, since the joined graph incorporates and reflects, online, the successive poses of all robots that move simultaneously.

(e) The localization and motion problem is simplified to 2D. Furthermore, it avoids complex multi-layer software architectures or Cloud computing strategies present in previous solutions.

(f) One of the principal objectives has been the reduction of the computational requirements of the algorithm, since, in general, they are limited in lightweight underwater vehicles. Running the algorithm online onboard the vehicles is a must, since it is especially addressed to multi-robot configurations, and these configurations imply controlling, mapping and guiding several robots moving simultaneously, where usually, one centralizes the processing of the localization data of the whole group.

Furthermore, although they are not directly novel contributions, it is worth mentioning two additional advantages in the implementation: (i) similarly to [32] or [35], once maps of different robots are joined, standard graph-based topology representations are used, where images form nodes and transforms between two images (being from consecutive frames or between two images that close a loop) form edges or links, and (ii) the graph is optimized by means of standard *g*2*o* [43]; this standardization facilitates the exchange of the different modules on a variety of software platforms and their reuse among different implementations.

Although this is out of the scope of this paper, this vision-based algorithm can complement the navigation facilities of underwater vehicles equipped with multiple types of sensors. In fact, this algorithm can integrate additional sensorial data in the first estimation of the vehicle motion, combining visual odometry with other means of laser or sonar-based dead reckoning [44,45].

The source code of the whole approach has been made publicly available for the scientific community in several GitHub repositories, together with a simple underwater dataset to test the whole procedure. Links to sources are provided in Section 3.

Section 2 contextualizes and details all algorithms proposed to: (a) estimate the visual odometry, (b) detect intra- and inter-session loop closings, (c) perform the local trajectory-based SLAM and (d) join maps and optimize the global graph. Section 3 presents some qualitative and quantitative preliminary results. Finally, Section 4 concludes the paper and gives some indications of ongoing and upcoming tasks to continue and improve this work.

#### **2. Materials and Methods**

#### *2.1. Overview*

The proposed localization module is based only in vision, with no intervention of either dead-reckoning navigation instruments, such as IMU or DVL, or global positioning systems, such as GPS for surface vehicles or Ultra-short Baselines (USBL) for underwater vehicles.

The structure of the proposed system is as follows:

(1) Let us simplify the problem assuming that there are, for instance, two vehicles moving simultaneously over the same area of interests in such a way that there is no possibility of collision, and that part of the area will be explored by both robots.

(2) The approach starts by estimating the trajectory of each robot motion, separately, applying the trajectory-based visual-SLAM strategy included in the multi-session scheme of Burguera and Bonin-Font [46]. Let us refer to this step as the *intra-session SLAM*. The indicated trajectory-based scheme implies that the trajectory of each robot is estimated by means of compounding [47] successive displacements calculated from one point to the next. These successive displacements form the state of an Extended Kalman filter (EKF) which is updated using the transforms given by the confirmed loop closings. In our particular case, this displacement corresponds to the visual odometry calculated between consecutive images, and the images candidate to close a loop with the current image are found comparing the corresponding image hashes and confirmed by a RANSAC-based algorithm applied on a brute-force visual feature-matching process. The main difference with respect to [46] is that while in a multi-session localization procedure the trajectory of the currently running robot is joined to another trajectory already completed and available in its totality, now, in a Multi-robot scheme, both robots are moving at the same time to complete a mission in which both participate simultaneously. When joined, both trajectories are incomplete, and continue running until all robot missions are finished.

(3) Simultaneously to both *intra-session SLAM* tasks, the system also searches for inter-session loop closings using HALOC. The global signature of each new image captured by each robot is compared with the global signatures of all images of the other robot. Once a certain number of inter-session loop closings are confirmed, both routes are joined in a single graph. Let us refer to this step as the *Map Joining*.

(4) Once joined, the global map (graph) must be completed with the successive poses of both robots until the end of both sessions. Furthermore, the trajectory-based localization approach applied to both agents separately is not longer valid. Each new displacement of both robots is included in the form of new nodes on the graph. Each new node of the graph will follow the direction of motion of each vehicle, which means that the graph will grow in two different directions, according to the two different trajectories, but forming a single entity. Let us denote this step as the *Multi-robot SLAM*.

(5) The global graph completion and optimization is done using a pose-based scheme, i.e., all new nodes corresponding to each robot will contain their successive global poses with respect to the origin of a unique world coordinate system, while the links between nodes will contain the displacement between them. After the map joining, only the inter-session loop closings are used to optimize the global map. If one loop closing between different sessions is confirmed, its resulting transform is used as an additional constraint between two nodes to optimize the whole graph. In our case, the optimization solver used is the *g*2*o* implementation [43] of the *Levenberg Marquardt* algorithm [48,49].

#### *2.2. Intra-Session SLAM and Map Joining*

#### 2.2.1. Visual Odometry

Figure 1 shows the global idea behind this first step of the approach. The visual odometry gives the estimated 2D motion between consecutive images by means of a SIFT feature detection and matching procedure.

**Figure 1.** Sigle Session SLAM overview.

Algorithm 1 shows the RANSAC-based method used to register two images, i.e., get the transform (if it exists) between them in translation and rotation. *apply*\_*altitude*() is a function that converts image feature coordinates from pixels to meters by considering the altitude at which the AUV navigates, as well as the camera parameters.

**Algorithm 1:** RANSAC approach to estimate the motion *X*ˆ *<sup>A</sup> <sup>B</sup>* from image *IA* to image *IB*.

#### **<sup>1</sup> Input:**


#### **<sup>12</sup> begin**

The idea behind this algorithm is that correct correspondences lead to the same roto-translation while wrong feature matchings lead to different and wrong roto-translations. The algorithm selects a random subset *R* of correspondences from the total number of correspondences *C* between two images, and then computes the roto-translation X and the subsequent error using only this subset. Afterwards, if the error introduced by the non-selected matchings of *C* is below a threshold *corr*, then, these matchings are included in *R*. If at any moment, the number of elements in *R* surpasses a threshold *Nmin*, the roto-translation and the error are computed again using this expanded *R*. If the error is below the smallest error obtained until this moment, the roto-translation is kept as a good model. This process is iterated a certain number of times. If partial roto-translations are inconsistent and *R* never reaches the minimum number of items required, the algorithm will not return any transform, but a boolean called *f ail* set to *true*. The obtained transform can be assumed to be the odometric displacement between consecutive images and the trajectory of the robot between steps *i* and *j* (*X<sup>i</sup> j* ) (assuming step *j* being subsequent to *i*) can be estimated using the compounding ⊕ operator of the successive odometric displacements (*X<sup>i</sup> <sup>i</sup>*+1, *<sup>X</sup>i*+<sup>1</sup> *<sup>i</sup>*+2,..., *<sup>X</sup>j*−<sup>1</sup> *<sup>j</sup>* ), as described in [50]:

$$X\_j^i = \quad X\_{i+1}^i \oplus X\_{i+2}^{i+1} \oplus \cdots \oplus X\_j^{j-1} \quad j > i \text{ .} \tag{1}$$

#### 2.2.2. Local Loop Detection and Trajectory Optimization

Local loops are those found within a single SLAM session. The Loop Candidates set (*LCt*) is the set of images *Ii* that may close a loop with the last gathered image *It* obtained in each running trajectory. This set is built by searching in a region within a predefined radius *δ* [38] around the current robot pose as estimated by the odometry:

$$L\mathcal{C}\_t = \{ i : ||X\_t^i||\_2 \le \delta\_\prime i < t - 1 \} \tag{2}$$

where *X<sup>i</sup> <sup>t</sup>* is computed by Equation (1).

Every image contained in the set of loop closing candidates (*Ii* ∈ *LCt* ) is registered with *It* using Algorithm 1, in order to build the set of local loops *LLt*, being *LLt* = {*Z<sup>i</sup> <sup>t</sup>* : *i* ∈ *LCt* ∩ ¬*f ail*(*i*, *t*)}, where ¬*f ail*(*i*, *<sup>t</sup>*) indicates that Algorithm <sup>1</sup> did not fail to get a roto-translation *<sup>Z</sup><sup>i</sup> <sup>t</sup>* between *Ii* and *It*.

The trajectory estimation obtained by means of compounding the successive odometric displacements between two points *A* and *B* will most likely not coincide with the direct transform between images obtained in *A* and *B* provided by the image registration process of Algorithm 1, if *A* and *B* close a loop:

$$X\_{A+1}^A \oplus X\_{A+2}^{A+1} \oplus X\_{A+3}^{A+2} \oplus \cdots \oplus X\_{B-1}^{B-2} \oplus X\_B^{B-1} \neq Z\_B^A \tag{3}$$

due to the drift introduced by the visual odometry and the error inherent to the transform directly obtained from the image registration procedure. Figure 2 illustrates these concepts.

Afterwards, a process of global optimization is run to get a trajectory that best combines the pose constraints imposed by the set of local loops (*LLt*) and the odometry. As mentioned before, the trajectory of the robot is the state vector of a *Iterative Extended Kalman Filter* (IEKF). Each new odometric displacement (*Xt*−<sup>1</sup> *<sup>t</sup>* ) computed between the last image and the previous one is used to augment the state vector at time *t* (*X*− *<sup>t</sup>* ): *X*<sup>−</sup> *<sup>t</sup>* = (*Xt*−1)*<sup>T</sup>* (*Xt*−<sup>1</sup> *<sup>t</sup>* )*<sup>T</sup> <sup>T</sup>* . In the prediction stage of the IEKF the state vector does not change at all.

If *LLt* is not empty, the trajectory is optimized performing the Update stage of the IEKF using the set of loop closings as measurements. The observation function *h<sup>i</sup> <sup>t</sup>* associated with each measurement *Z<sup>i</sup> t* (the transform of each real loop closing) can be defined as *h<sup>i</sup> t*(*X*<sup>−</sup> *<sup>t</sup>* ) = *<sup>X</sup><sup>i</sup> <sup>i</sup>*+<sup>1</sup> <sup>⊕</sup> *<sup>X</sup>i*+<sup>1</sup> *<sup>i</sup>*+<sup>2</sup> ⊕···⊕ *<sup>X</sup>t*−<sup>2</sup> *<sup>t</sup>*−<sup>1</sup> <sup>⊕</sup> *<sup>X</sup>t*−<sup>1</sup> *<sup>t</sup>* , being the innovation of the IEKF for each measurement: *Z<sup>i</sup> <sup>t</sup>* − *<sup>h</sup><sup>i</sup> t*(*X*<sup>−</sup> *<sup>t</sup>* ). With all this elements, one can iterate the classical equations of an EKF to get the optimized trajectory, until ||*X*<sup>−</sup> *<sup>t</sup>*,*<sup>j</sup>* − *X*<sup>−</sup> *<sup>t</sup>*,*j*−1|| <sup>&</sup>lt; *<sup>γ</sup>*, where *j* represents the last iteration and *γ* is a predefined threshold. The classical format of the IEKF involves iterating until the changes between consecutively estimated states are below a certain threshold. However, in the experiments we verified that after a certain and almost constant number of iterations, the filter already converged with a difference between consecutive results below the threshold (||*X*<sup>−</sup> *<sup>t</sup>*,*<sup>j</sup>* − *X*<sup>−</sup> *<sup>t</sup>*,*j*−1|| <sup>&</sup>lt; *<sup>γ</sup>*). Because of that, it was decided to repeat all the experiments with a defined number of iterations, in order to limit the number of executions to be done and save computational resources. In any case, both options can be used in other circumstances, depending on the needs and environmental conditions of each different system and field case.

#### 2.2.3. Inter-Session Loop Closings

The main problem for joining two trajectories of two robots operating simultaneously is the lack of geometric relation between their corresponding sessions. Every robot geo-localizes itself with respect to the origin of its own trajectory, but it has no knowledge about the origin of the other trajectories. By means of finding inter-session loop closings, i.e., images that show, partially or totally, the same area, but taken by two different robots in two different sessions running simultaneously, the maps of the two robots can be joined in a single one [46]. Due to this lack of geometrical relation between the two trajectories, the search of the loop closing candidates of one robot to close a loop with the last image captured by the other robot cannot be restricted to a certain area. In this case, one should

compare the last image gathered by one of the robots with all images taken by the other robot, from the start of the mission until the current moment. Applying a brute force feature matching algorithm between all these involved images is unfeasible for online applications, due to the great amount of computing resources and time needed. One way to alleviate this problem is reducing all images to global descriptors. As in [46], all images of both sessions are reduced to their HALOC global descriptor. The size of HALOC is fixed in 384 floats since the size of the used projective orthogonal vectorial space is 3 [22]. This length is independent of the number of visual features found in each image. The global descriptor of every new image of one of the sessions (called the query image) is compared with the global descriptor of all and each of the images taken during the other session. According to [22], those 5 images that give the lowest L1-norm of the difference between their hash and the query hash, and this norm is lower than a certain threshold *δ*are taken as the inter-session loop closing candidates (*GCt*):

$$\mathcal{G}\mathcal{C}\_t = \{ i : ||H\_i - H\_t||\_1 \le \delta', \forall I\_i \in V\_p \} \tag{4}$$

being *Vp* the set of images taken by one of the robots from the start of its session until the current moment, *Hi* the hash of each of these images and *Ht* the hash of the query image. The value of *δ* will be selected experimentally.

**Figure 2.** In theory, the transform between A and B, if both close a loop should be very close to the transform obtained compounding the odometric displacements *X<sup>i</sup> <sup>i</sup>*+1.

Once the set of candidates is established, the true positives are confirmed by means of the RANSAC-based Algorithm 1, forming the definitive set of images (*GLt*) of the first session that, in principle, close a loop with the last image of the second session, as: *GLt* = {*Z<sup>i</sup> <sup>t</sup>* : *i* ∈ *GCt* ∩ ¬*f ail*(*i*, *t*)}, being *Z<sup>i</sup> <sup>t</sup>* the transformation found by Algorithm 1. Inter-session loops are accumulated at every iteration of both single SLAM sessions. These transforms *Z<sup>i</sup> <sup>t</sup>*, are, in fact, a set of geometrical relations between the two different sessions. Assessing the performance of HALOC in loop closing detection, in terms of accuracy, recall and fall-out, is out of the scope of this paper, since it has already been presented in [22,42] with considerable good results underwater.

#### 2.2.4. Map Joining

As mentioned in the previous section, the loop closings between different sessions can be used to infer the geometrical relation between the two trajectories of the two robots that perform both missions simultaneously. The objective now is to align, at a certain moment, both surveys in a single global graph, and, maintain this single graph from the moment of the joining to the end of both missions.

Let *X*<sup>1</sup> denote the trajectory of one of the sessions. Let the first and last images of this trajectory be denoted as *I*1*<sup>s</sup>* and *I*1*e*, respectively. Let *X*<sup>2</sup> denote the trajectory of the second session and let us denote its first and last images as *I*2*<sup>s</sup>* and *I*2*e*, respectively. Let us denote the number of accumulated inter-session loop closings at a certain moment *t* as *K*. Let us also denote this set of loop closings as *ZG*, each one relating one image of the first session with another image of the second session.

$$Z\_G \quad = \left( \begin{array}{cccc} (Z\_{20}^{10})^T & (Z\_{21}^{11})^T & \dots & (Z\_{2K-1}^{1K-1})^T \end{array} \right)^T \tag{5}$$

where each *Z*1*<sup>i</sup>* <sup>2</sup>*<sup>i</sup>* represent a transform from image *I*1*<sup>i</sup>* of the session 1 to image *I*2*<sup>i</sup>* of session 2, or what is the same, the transforms of the loop closings. Each *Z*1*<sup>i</sup>* <sup>2</sup>*<sup>i</sup>* belongs to a certain *GLt*.

Let us define *X*1*<sup>e</sup>* <sup>2</sup>*<sup>s</sup>* as the transform, or the relative motion, from *I*1*<sup>e</sup>* to *I*2*s*. For every loop closing, ideally, *Z*1*<sup>i</sup>* <sup>2</sup>*<sup>i</sup>* = *<sup>X</sup>*1*<sup>i</sup>* <sup>1</sup>*<sup>e</sup>* ⊕ *<sup>X</sup>*1*<sup>e</sup>* <sup>2</sup>*<sup>s</sup>* ⊕ *<sup>X</sup>*2*<sup>s</sup>* <sup>2</sup>*<sup>i</sup>* , where *<sup>X</sup>*1*<sup>i</sup>* <sup>1</sup>*<sup>e</sup>* is the displacement from *<sup>I</sup>*1*<sup>i</sup>* to *<sup>I</sup>*1*e*, and *<sup>X</sup>*2*<sup>s</sup>* <sup>2</sup>*<sup>i</sup>* is the displacement from *I*2*<sup>s</sup>* to *I*2*i*, being:

$$X\_{1\varepsilon}^{1i} \quad = \ (x\_{1\varepsilon}^{1i}, y\_{1\varepsilon}^{1i}, \theta\_{1\varepsilon}^{1i})^T \tag{6}$$

$$X\_{2s}^{1c} \;=\; \begin{array}{c} \left( \mathfrak{x}\_{2s}^{1c}, y\_{2s}^{1c}, \theta\_{2s}^{1c} \right)^{T} \\ \end{array} \tag{7}$$

$$X\_{2i}^{2s} \;=\; \left(x\_{2i}^{2s}, y\_{2i}^{2s}, \theta\_{2i}^{2s}\right)^{T} \tag{8}$$

The proposal consists of an IEKF that will give the value of *X*1*<sup>e</sup>* <sup>2</sup>*<sup>s</sup>* that better matches all the loop closures found until the moment *t*. The state vector of the IEKF is just the transform *X*1*<sup>e</sup>* <sup>2</sup>*s*, the observation function for each loop closing will be *g<sup>i</sup> <sup>G</sup>* = *<sup>X</sup>*1*<sup>i</sup>* <sup>1</sup>*<sup>e</sup>* ⊕ *<sup>X</sup>*1*<sup>e</sup>* <sup>2</sup>*<sup>s</sup>* ⊕ *<sup>X</sup>*2*<sup>s</sup>* <sup>2</sup>*<sup>i</sup>* , and *<sup>Z</sup>*1*<sup>i</sup>* <sup>2</sup>*<sup>i</sup>* is the corresponding measurement. With this, one can form the innovation, and apply the classical EKF equations iteratively, as explained in Section 2.2.2. *X*1*<sup>e</sup>* <sup>2</sup>*<sup>s</sup>* is the transformation that can be used to join the two sessions, in such a way that the state vectors of both trajectories, formed by displacements, are joined by this recently computed transformation as: *XJ* = (*X*1)*T*(*X*1*<sup>e</sup>* 2*s*)*T*(*X*2)*<sup>T</sup> <sup>T</sup>* , where *XJ* represents the joined trajectory, and *X*1//*X*<sup>2</sup> the state vector of the first and second trajectories, respectively, from their starting points until the instant *t*. The idea is illustrated in Figure 3.

**Figure 3.** (**a**) Separated trajectories with intersession loop closings. (**b**) The joined trajectory.

Once both sessions have been joined, the trajectory-based schema is no longer valid, and the resulting map is transformed into a pose-based graph. All the robot displacements included in *XJ* are transformed into global poses that constitute each node of the global graph as:

$$X^{i} = \begin{cases} X\_1^0 \oplus X\_2^1 \oplus \dots \oplus X\_{i-1}^{i-2} \oplus X\_i^{i-1} & i < = 1e\\ X\_1^0 \oplus X\_2^1 \oplus \dots \oplus X\_{1e}^{1e-1} \oplus X\_{2s}^{1e} \oplus X\_{2s+1}^{2s} \oplus \dots \oplus X\_i^{i-1} & i > 1e \end{cases} \tag{9}$$

where *X<sup>i</sup>* is the pose associated with node *i*. The case for *i* <= 1*e* refers to the global pose of a node corresponding to an element of the first trajectory, and the case for *i* > 1*e* refers the global pose of a node of the second trajectory. All the displacements included in *XJ* become the links between successive nodes and each node of the graph is associated with its corresponding image.

#### *2.3. Multi-Robot Graph SLAM*

Let us assume that, (a) after both trajectories have been joined in a single graph, both robots are still running their own missions, and, (b) the successive poses of both robots must be included in the global graph as new nodes, each one following the corresponding trajectory.

The Multi-robot Graph-SLAM procedure detailed below follows the indications of [51], in terms of structure, node generation, inclusion of loop closings as additional pose constraints, and graph optimization, but in our case particularized for a Multi-robot configuration. The algorithm includes the next points:

(1) The local SLAM algorithm explained in Section 2.2 is continuously executed for both trajectories until them are joined when a certain number of inter-session loop closings have been accumulated. It is better to optimize the local trajectories every *N* frames, although there is only a couple of loop closings, in order to, when both sessions are joined the drift has already been reduced locally, as much as possible. Otherwise, trajectories could be joined before local optimizations have been applied, transferring local drifts to the global map.

Let us assume that the Multi-robot localization is centralized in the first robot, which will receive, from the second robot: (1) The set of visual features and the global descriptor of the last gathered image, (b) only if the map joining has to be done, the state vector and the last odometric displacement. The state vector is needed to be attached to the one of the first robot, if it is due. The last frame global descriptor is needed to find possible loop closures with frames of trajectory 1, the set of features is needed to confirm or reject the possible candidates, and the last odometric displacement of trajectory 2 will be used as a reference after the map joining.

(2) Once both trajectories have been joined in a single global graph, it is time to feed the map with the successive displacements of both robots. It is important to note that the last node of the graph corresponds to the last displacement of the second trajectory, since the first set of elements correspond to the trajectory of robot 1, then it comes the link between trajectories, and finally the elements of trajectory 2. Let us denote the identifier ID of the last node of the global graph as *Nn*2, where *n*2 represents the number of nodes in the graph, and is equal to the length of the joined state vector *XJ* (|*XJ*|). Accordingly, the ID of the graph node corresponding to the end of the trajectory 1 will be *Nn*1, where *n*1 = |*XJ*| − 1 − |*X*2|.

Storing *Nn*<sup>1</sup> and *Nn*<sup>2</sup> is necessary, since they will be the points of the global graph from which the successive nodes corresponding to trajectories 1 and 2, respectively, will be placed according to the ongoing motion of both vehicles.

The set of iterated actions performed for the Multi-session SLAM are:


node of trajectory 2. The link between nodes *Xn*1+<sup>1</sup> and *X<sup>n</sup>*1, and the link between nodes *Xn*2+<sup>1</sup> and *Xn*<sup>2</sup> will contain the values of *Xn*<sup>1</sup> and *Xn*2, respectively. Each new node added on the graph is associated in the code to its corresponding image, regardless the trajectory it belongs to. In this way, with the node ID one can find it associated image, and with an image identifier, one can find its associated node ID.

	- (a) Recover the node IDs of the images associated with each inter-session loop closing classified as true positive, and every corresponding transform.
	- (b) Add one additional link in the graph between nodes *IDai* and *IDbi*, which content is *ZIDai IDbi*, ∀*i*, 1 *i* -*NALC*.

The idea is illustrated by Algorithm 2.

#### **Algorithm 2:** Multi-robot Visual Graph SLAM.

#### **Inputs**


**begin**

```
26 NALC = NTP = n = 0 ;
27 Robot 1 takes the next image → In1 + 1 ;
28 Robot 2 takes the next image → In2 + 1 ;
29 H1=hash(In1 + 1); H2=hash(In2 + 1) ;
30 Htrajectory1 ← H1 ;
31 [Xn1+1] = RansacEstimateMotion(In1, In1 + 1).;
32 [Xn2+1] = RansacEstimateMotion(In2, In2 + 1).;
33 Xn1+1 = Xn1 ⊕ Xn1+1. Nn1+1 → node graph ID of Xn1+1 ;
34 Xn2+1 = Xn2 ⊕ Xn2+1. Nn2+1 → node graph ID of Xn2+1 ;
35 Store the correspondences Nn1+1 → In1 + 1 and Nn2+1 → In2 + 1; ;
36 [ListO f Candidates] = LibHALOC(Htrajectory1 , Nc , H2) ;
37 for j ← 0 to Nc do
38 store Ian = I1j, Ibn = In2+1 ;
39 [ZIDan
            IDbn ] = RansacEstimateMotion(Ian, Ibn) ;
40 if Number of Inliers between Ian and Ibn  MinRansacInliers then
41 NTP = NTP + 1 ;
42 store ZIDan
                    IDbn and IDan, IDbn ;
43 n = n + 1 ;
44 end
45 end
46 NALC=NALC+NTP ;
47 NTP = 0 ;
48 if NALC=N IsLoopClosings then
49 for i ← (n − NALC) to n + NALC do
50 AddRelativePose (ZIDai
                             IDbi
                                ,IDai, IDbi) ;
51 end
52 NALC = 0,
53 end
54 OptimizeGraph() ;
55 n1 = n1 + 1, n2 = n2 + 1 ;
56 end
```
#### **3. Experimental Results**

#### *3.1. Experimental Setup*

A set of preliminary experiments were performed simulating the Multi-robot configuration with real underwater data. Three different datasets were used. Every dataset consists of two different video sequences, partially overlapping. Sequences were recorded either by a diver or by an AUV, both carrying a bottom looking camera with its lens axis perpendicular to the longitudinal axis of the vehicle or of the diver. Divers or autonomous underwater vehicles moved at an approximate constant altitude with respect to the sea bottom. The lack of any other sensorial data which could be supplied by an AUV, makes the localization system a pure vision-based approach. The three datasets have been recorded in several coastal sites of the north and south of Mallorca, at depths between 5 and 13 m. All the environments where the datasets were grabbed present a great variety of bottom textures, including seagrass, stones, sand, algae, moss and pebbles. The first dataset was recorded in the north coast of the island, by a diver with an attached Gopro camera, pointing downwards, moving on the surface at an approximate constant altitude of 4 m. Let us refer to both video sequences of this first dataset as *S*11 and *S*12. The camera altitude was obtained at the beginning of the video sequence by means of a visual marker of known size, placed at the sea bottom, in the starting point of each trajectory.

A second dataset formed by two partially overlapping trajectories named *S*21 and *S*22 were recorded also in the north coast of Mallorca, also by a diver supplied with a Gopro, looking to the bottom, far from *S*11 and *S*12, swimming on the water surface, at an approximate constant altitude of 4 m. In this case, the initial altitude was computed thanks to the know dimensions of a structure formed by markers and PVC tubes placed at the sea floor in the origin of both trajectories. The video resolution was 1920 × 1080 pixels, grabbed at 30 frames per second (fps), and prior to their use, all images were scaled down to 320 × 180 pixels.

A third dataset, with two video sequences named *S*31 and *S*32, was recorded by a SPARUS II AUV [52] property of the University of the Balearic Islands, at 7.5 fps, moving at a constant altitude of 3 m, in an area of the south of the island with an almost constant depth of 16 m. The navigation altitude is obtained from the vehicle navigation filter which integrates a Doppler Velocity Log (DVL), an Inertial Measurement Unit (IMU), a pressure sensor, an Ultra Short Baseline (*USBL*) acoustic modem [53], and a stereo 3D odometer. This dataset permitted to test the approach in larger environments with complex imagery due to the presence of sea grass on the sea bottom. In particular, *S*31 was recorded during a trajectory of 93 m long, and *S*32 during a a trajectory of 114 m, covering both an approximate area of 300 M<sup>2</sup> each one.

Figure 4 shows some samples of images included in the three datasets.

All these images show how all regions are colonized with *Posidonia oceanica*, a seagrass that forms dense and large meadows. Images of dataset 3 show a lack of illumination which increases at larger depths. With these conditions, the feature matching process decreases its performance and affects directly the accuracy of the visual odometry and the loop closing detection using HALOC. In this type of marine environments and with our robot and its equipment, moving at approximately 1 knot at altitudes between 3 m and 5 m, in areas with a depth between 16 m and 20 m, gave a good tradeoff between image overlap and illumination conditions.

Due to the particular texture of the Posidonia and the slight motion of its leafs caused by the currents, tracking stable visual features in consecutive overlapping frames is complicated and requires an accurate selection of the type of features and the feature detection/tracking parameters. Errors in this process will compromise the accuracy of the visual odometry and the image registration task for the loop closing confirmation. Previous pieces of work [22,41,42] already showed the high efficiency of SIFT features for underwater SLAM in areas colonized with Posidonia, in all the tasks involved in the process: visual odometry, image hashing with HALOC, loop closing detection, and pose refinement. Although SIFT feature detector is slower than other descriptors and delays the RANSAC-based

matching process of Algorithm 1, given the robustness and traceability of SIFT, and according to our experience, this additional processing time is preferable to obtain more reliable trajectories than using other simpler features that take less time than SIFT to be computed and tracked, but can cause larger inaccuracies in the camera trajectory estimation or in the image registration process.

**Figure 4.** Examples of images of dataset 1, in (**a**–**c**), dataset 2, in (**d**–**f**), and dataset 3, in (**g**–**i**).

#### *3.2. Experiments and Results*

All experiments performed to get these first preliminary results were run offline, simulating the Multi-robot configuration with the visual data obtained in the sea. Key images of each video sequence that forms the different datasets mentioned in the previous section were extracted, indexed, stored separately in a hard disk, and processed consecutively according to the algorithms exposed in this paper.

Successive field experiments showed that, an overlap between consecutive images of 35% to 50% was necessary to obtain a robust visual odometry. On the other hand, reducing the number of images stored was also required in order to save as much memory space as possible. Consequently, a good trade off between both requirements was obtained selecting the keyframes of datasets 1 and 2 down-sampling the initial video frame rates at 1.1 fps, on average, and the dataset 3 at 3 fps. 226 key images were extracted from *S*11 and 199 from *S*12. 152 key frames were extracted from the video sequence *S*21 and 57 from *S*22. Finally, a total of 400 key images were extracted from dataset 3, 200 belonging to *S*31 and 200 to *S*32.

Local SLAMs are continuously executed for both trajectories in sequential steps of a predefined number of frames *N*; each local trajectory is accumulated and optimized from *N* to *N* frames, alternating both sessions every *N* frames. *N* is set differently for each dataset.

Figure 5 shows several samples of inter-session loop closings found by HALOC and confirmed by Algorithm 1.

#### *J. Mar. Sci. Eng.* **2020**, *8*, 437

**Figure 5.** (**a**–**d**) Multi-session loops between sequences *S*11 and *S*12. (**e**–**h**) Multi-session loops between sequences *S*21 and *S*22. (**i**–**l**) Multi-session loops between sequences *S*31 and *S*32. (**a**) Closes a loop with (**b**,**c**) closes a loop with (**d**,**e**) close a loop with (**f**,**g**) close a loop with (**h**,**i**) closes a loop with (**j**,**k**) closes a loop with (**l**).

Once images and inter-session loops are available, the whole localization and mapping process starts. In other words, the sequence of actions is as follows:

	- (a) Starts algorithm of Section 2.2, building the state vector of each session, by steps of *N* consecutive frames, using the displacements included in each odometry file.
	- (b) For each newly gathered image (lets call it, the query image), searches for local loop closings on other images of the same dataset which positions are near the query. This search is done only among the images gathered before the query.
	- (c) Optimize both local graphs according to Section 2.2.2.
	- (d) For each image of trajectory 2 (called *the query*), the algorithm searches the best 5 HALOC loop closing potential candidates of trajectory 1. Each candidate, if any, is confirmed by means of Algorithm 1, and filtered out if the number of inliers is lower than the predefined threshold.
	- (e) Accumulate the number of inter-session true loop closings.
	- (f) When the number of accumulated inter-session loop closings is greater than a certain threshold, join both sessions in a single pose-based graph. That means transforming all members of the joined state vector in global poses and the corresponding graph nodes, associating to each node the corresponding image.

This simulation does not permit assessing anything related to execution times because, although the whole process works with real data, the Multi-robot configuration has been simulated, split into three different software packages of different natures. The purpose of these preliminary results is not giving an accurate set of quantitative and numerical results to assess the process in terms of execution time or trajectory accuracy. The aim of this section is presenting the implementation of a new approach and a set of preliminary results that provide: (a) a proof of viability and feasibility of the solution, (b) a proof of its utility and suitability to manage, in a single map, two sessions of two different robots that operate simultaneously, in a simple way, (c) a qualitative proof of concept and, (d) the source code and a dataset to be tested, open to further improvements.

Obtaining a ground truth trajectory underwater is a challenging task, unless one can install an infrastructure of acoustic beacons or Long Baseline (LBL) systems, which is costly and complicated to run and manage, and imposes spatial restrictions on the motion of the robots. In our experiments, there is no ground truth and no possibility to get it. The planned trajectories, in the case of those performed with the AUV, cannot be used as a ground truth either, since they differ substantially from the ones that the AUV ends up performing (which is usual in underwater robotics). In consequence, we cannot compare the trajectories estimated by the system with another one that serves as reference. In our case, robustness has been qualitatively validated by two means, (a) comparing the resulted global maps with the mosaics obtained with BIMOS [54] and, (b) comparing the direct transforms between images that close loops, obtained with Algorithm 1 with the transforms between the graph-nodes related to the same loop closing images. These two points have been already used and validated in previous pieces of work [46].

Figure 6a,b,d,e show, respectively, the trajectory of *S*11, *S*12, *S*21 and *S*22 estimated by the local SLAM procedure described in Section 2.2. Figure 6c,f show, respectively, the global graph obtained applying the Multi-robot SLAM procedure described in Section 2.3.

Figures 7 and 8 show four photo-mosaics corresponding to sequences *S*11, *S*12, *S*21 and *S*22. These photomosaics have been obtained using BIMOS [54], a mosaicing algorithm based on bags of binary words that already demonstrated a great performance in underwater environments [55,56].

**Figure 6.** (**a**) Trajectory *S*11, (**b**) Trajectory *S*12, (**c**) Multi-robot final graph from *S*11 and *S*12. (**d**) Trajectory *S*21, (**e**) Trajectory *S*22, (**f**) Multi-robot final graph from *S*21 and *S*22.

**Figure 7.** (**a**) Photo-mosaic of *S*11. (**b**) Photo-mosaic of *S*12.

The resulting mosaics have associated an implicit trajectory which imposes the position of each image with respect to the origin of the mosaic system of coordinates. Due to the lack of any trajectory ground truth and the impossibility to get it, the mosaic is, to a certain extent, a qualitative reference to assess the quality of the resulting joined trajectories of Figure 6c,f, since BIMOS has already demonstrated its good performance in land and underwater. Notice how the mosaic of Figure 7b shows a montage very close to the SLAM trajectory of *S*11, and Figure 7a a mosaic fitting the SLAM

trajectory of *S*12. It can be assumed that the quality of the single and joined graphs obtained with BIMOS are similar to the quality of the mosaics.

**Figure 8.** (**a**) Photo-mosaic of *S*21. (**b**) Photo-mosaic of *S*22.

If both mosaics are aligned by their left laterals, where the marker is located, something that fits very well with the global graph is obtained. The same applies to the mosaics of Figure 8. The structure containing the two markers is the point joining both trajectories. Then, the alignment of both figures by the markers gives a result that also fits perfectly with the global graph of Figure 6.

Figure 9a,b show the local SLAM trajectory of *S*31 and *S*32, and Figure 9c shows the global graph estimated after joining both trajectories and applying the Multi-robot graph SLAM approach.

An illustrative video of the whole process involving the three datasets can be seen in [57]. The video shows, at the beginning, some sequences grabbed underwater and used to test our approach. Afterwards, it shows the whole process for the three datasets exposed: (1) The local SLAM for both separated trajectories, (2) the moment when both sessions are joined and converted into a single global graph, and (3) how the graph continues growing in different directions, each one corresponding to each trajectory involved in the Multi-robot mission. As mentioned in previous sections, the joined graph is optimized every time a set of inter-session loop closings are confirmed.

As this is a pure visual-based SLAM approach and no other sensorial input is included in the multi-localization process, this is firstly computed in pixel units. To find the relation between pixels and metric units, the known real dimensions of the markers used to establish the starting and end points of each trajectory (see pictures (a) and (e) of Figure 4) were related with the pixel dimensions of the markers in the images. These relations resulted in coefficients ranging between 0.0015 and 0.0019, depending on each video sequence.

Another way to verify the consistency of the resulting optimized global map is comparing the transform between two images (called I1 and I2) that close an inter-session loop, when it is obtained by two different means: (1) using Algorithm 1, and (2) running the next operation: *P*1 ⊕ *P*2, where *P*1 is the global pose associated with the graph node corresponding to I1, and *P*2 is the global pose associated with the graph node corresponding to I2. The idea is illustrated in Figure 10: *P*1 and *P*2 are the global poses associated with two loop closing nodes, *TI*<sup>1</sup> *<sup>I</sup>*<sup>2</sup> is the direct transform between P1 and P2 obtained with the RANSAC-based algorithm applied directly on I1 and I2. In principle, if the graph

*J. Mar. Sci. Eng.* **2020**, *8*, 437

is consistent, *TI*<sup>1</sup> *<sup>I</sup>*<sup>2</sup> has to be equivalent to the inverse of *P*1 composed with *P*2, which is the transform between *I*1 and *I*2, but obtained composing the global poses between the respective nodes of the loop closing images.

**Figure 9.** (**a**) Trajectory *S*31, (**b**) Trajectory *S*32, (**c**) Multi-robot final graph from *S*31 and *S*32.

**Figure 10.** Transforms between nodes P1 and P2.

Tables 1 and 2 show some samples of quantitative results of intersession loop closings with the corresponding transforms calculated by both aforementioned ways. The first to forth columns show, respectively, the number of both images that close an inter-session loop and the graph nodes which each image is associated with. Column I1 contains images of the sessions *S*11 and *S*21, and column I2 contains images of sessions *S*12 and *S*22. Fifth and sixth columns indicate the 2D transform, in translation and rotation (*x*, *y*, *θ*), computed indirectly through the graph and directly using RANSAC (Algorithm 1). The units of these transforms are expressed in pixels and radians. The seventh column indicates the difference between both transforms, in module (meters) and orientation (radians). These samples indicate that: (a) for *S*11 and *S*12, the difference of transforms ranges between 1.8 cm and 0.15 mm in module and between 0.05 rad. (2.86◦) and 0.0089 rad. (0.5◦) in orientation, and (b) the difference for *S*21 and *S*22 ranges between 1.02 cm and 14.8 cm in module and between 0.0085 rad. (0.5◦) and 0.21 rad. (12.03◦) in orientation. These differences are totally acceptable, taking into account that there are errors inherent to the RANSAC transform estimation process due to the possible (and usual) presence of any inconsistent inlier, and differences (or errors) due to the successive graph optimizations, which also cause subsequent readjustments of all node poses.


**Table 1.** Comparison of Transforms between images of *S*11 and *S*12.

**Table 2.** Comparison of Transforms between images of *S*21 and *S*22.


#### *3.3. Some Considerations of the Data Reduction*

The length of the HALOC global descriptors used in the aforementioned tests is 384 floats. That is, a total of 1536 bytes per image, considering that in *C* + + a float needs 4 bytes for memory storage. All this, regardless the image resolution and the amount of SIFT features per image. That means that no matter how big is the image and how many visual features are being detected per image, that the length of the hash maintains invariable. Conversely, the size of a color image with a very reduced resolution of 320 × 240 pixels would be 320 × 240 × 3 = 230,400 bytes. The save on memory space for data storage is clearly reduced when using the HALOC hash instead of the original images. The set of image features must be stored for every image, in any case, since they are needed in the later processes of loop closing confirmation. However, the computational cost of comparing two hashes to retrieve the best candidates for loop closing just calculating the L1-norm of two vectors is much lower than finding the best candidates with a brute-force recursive feature double-matching with RANSAC. From this point on, one can think of applying additional strategies to limit the communication between robots, complementing, for instance the solution proposed in [35], where the images are sent only among robots that view, simultaneously, a common point. In this case, we introduce an additional layer to compress the information to be exchanged, since instead of sending JPEG images, robots would send their respective hashes.

#### *3.4. Sources Availability*

The source code for the odometry computation has been developed in Matlab and it is available at [58]. The source code of the HALOC library is available at [59], for its *C* + + version, and in [60], for its version in Python.

The sources for the local SLAM, the Map Joining and the later Multi-Robot Graph SLAM have been developed also in Matlab, and they are available for the community at [61]. The pose-based graph management has been programmed using the Matlab library for localization and pose estimation especially addressed to mobile autonomous vehicles [62].

#### **4. Discussion, Conclusions and Future Work**

This paper presents a new approach to visual SLAM for Multi-robot configurations, based on joining, in a single pose-based graph, several trajectories of different robots which operate simultaneously in a common area of interests. The system finds loop closings between images of different robot trajectories by means of a hash-based methodology (HALOC), and uses them to add additional constraints to the global graph. As exposed in the text, the use of HALOC clearly guarantees an important reduction in storage space, amounts of data to be transferred and time dedicated for loop closing detection, especially in centralized multi-robot configurations. Using HALOC also assures a proper graph optimization since it has already been proved underwater, showing excellent results in terms of success ratios in loop closing detection.

The strategy for map joining comes from [46], but adapted to a Multi-robot configuration, which differs from a multi-session case in some aspects. This strategy is simple, easy to replicate, effective, and, more importantly, as flexible as possible to modulate the moment for map joining depending on the mission conditions and convenience, which implies a trade off between the accuracy of the local maps before they are joined, or the need for joining as soon as possible all the trajectories to centralize the global multiple localization of all the robotic team in a single agent.

Preliminary experiments permitted to show how the application of the new approach for joining, online, multiple ongoing sessions was perfectly feasible, suggesting a certain consistency and reliability in the results, from a qualitative point of view.

Although, until now, we focused our efforts exclusively in the estimation of the camera pose and trajectory, this algorithm has been designed to be applied on board a vehicle. Therefore, one priority ongoing task is testing this algorithm in a team of real vehicles operating in the sea. To this end, a ROS [63] wrapper in C++ is currently being developed and tested.

We have now focused our efforts exclusively in the estimation of the camera pose and trajectory assuming that navigation and control are solved issues. In fact, most of the existing research on SLAM makes the same assumption. However, the continuous re-estimation of the vehicle poses thanks to the SLAM algorithm surely affects the control of the vehicles, because the control modules are fed with the poses and velocities. In addition, changes in control affect, in turn, the vehicle navigation. At the moment the SLAM algorithm is completely decoupled from the control module, but once it is installed on a vehicle, the vehicles velocity and pose provided by our SLAM modules, together with the mission goal points, will be input in the navigation and control subsystems. Another possible line of research that is also under consideration is to make the goal points also depend on SLAM in order to add exploration to the AUV capabilities.

Other future work includes:

(1) Extending the tests to additional environments with longer trajectories. (2) Extending the assessment of the approach by means of evaluating the performance of the SLAM pose corrections in the presence of additive Gaussian noise in the visual odometry, and all evaluation techniques employed in [46]. (3) Comparing with other Multi-robot software packages still not tested underwater, such as DSLAM.

Matlab sources are available in a public repository giving the chance to the scientific community of testing, replicating, and also improving them.

**Author Contributions:** Both authors contributed equally to this work, including the Conceptualization, theoretical methodology, the implementation of the software, the validation with the datasets obtained in the sea, writing the original draft, and the final supervision. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work is partially supported by Ministry of Economy and Competitiveness under contract DPI2017-86372-C3-3-R (AEI,FEDER,UE).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Underwater Pipe and Valve 3D Recognition Using Deep Learning Segmentation**

**Miguel Martin-Abadal \*, Manuel Piñar-Molina, Antoni Martorell-Torres, Gabriel Oliver-Codina and Yolanda Gonzalez-Cid**

> Departament de Matemàtiques i Informàtica, Universitat de les Illes Balears, Carretera de Valldemossa Km. 7.5, 07122 Palma, Spain; manuel.pinar@uib.es (M.P.-M.); antonimartorelltorres@gmail.com (A.M.-T.); goliver@uib.es (G.O.-C.); yolanda.gonzalez@uib.es (Y.G.-C.) **\*** Correspondence: miguel.martin@uib.es

> **Abstract:** During the past few decades, the need to intervene in underwater scenarios has grown due to the increasing necessity to perform tasks like underwater infrastructure inspection and maintenance or archaeology and geology exploration. In the last few years, the usage of Autonomous Underwater Vehicles (AUVs) has eased the workload and risks of such interventions. To automate these tasks, the AUVs have to gather the information of their surroundings, interpret it and make decisions based on it. The two main perception modalities used at close range are laser and video. In this paper, we propose the usage of a deep neural network to recognise pipes and valves in multiple underwater scenarios, using 3D RGB point cloud information provided by a stereo camera. We generate a diverse and rich dataset for the network training and testing, assessing the effect of a broad selection of hyperparameters and values. Results show *F1-scores* of up to 97.2% for a test set containing images with similar characteristics to the training set and up to 89.3% for a secondary test set containing images taken at different environments and with distinct characteristics from the training set. This work demonstrates the validity and robust training of the PointNet neural in underwater scenarios and its applicability for AUV intervention tasks.

> **Keywords:** point cloud segmentation; deep learning; pipe and valve recognition; underwater perception; computer vision

#### **1. Introduction**

During the past few decades, the interest in underwater intervention has grown exponentially as more often it is necessary to perform underwater tasks like surveying, sampling, archaeology exploration or industrial infrastructure inspection and maintenance of offshore oil and gas structures, submerged oil wells or pipeline networks, among others [1–5].

Historically, scuba diving has been the prevailing method of conducting the aforementioned tasks. However, performing these missions in a harsh environment like open water scenarios is slow, dangerous, and resource consuming. More recently, thanks to technological advances such as Remotely Operated Vehicles (*ROVs*) equipped with manipulators, more deep and complex underwater scenarios are accessible for scientific and industrial activities.

Nonetheless, these *ROVs* have complex dynamics that make their piloting a difficult and error-prone task, requiring trained operators. In addition, these vehicles require a support vessel, which leads to expensive operational costs. To mitigate that, some research centres have started working towards intervention Autonomous Underwater Vehicles (*AUVs*) [6–8]. In addition, due to the complexity of the Underwater Vehicle Manipulator Systems (UVMS), recent studies have been published towards its control [9,10].

Traditionally, when operating in unknown underwater environments, acoustic bathymetric maps are used to get a first identification of the environment. Once the bathymetric information is available, *ROVs* or *AUVs* can be sent to obtain more detailed information

**Citation:** Martin-Abadal, M.; Piñar-Molina, M.; Martorell-Torres, A.; Oliver-Codina, G.; Gonzalez-Cid, Y. Underwater Pipe and Valve 3D Recognition Using Deep Learning Segmentation. *J. Mar. Sci. Eng.* **2021**, *9*, 5. https://dx.doi. org/10.3390/jmse9010005

Received: 10 December 2020 Accepted: 18 December 2020 Published: 23 December 2020

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/ licenses/by/4.0/).

using short distance sensors with higher resolution. The two main perception modalities used at close range are laser and video, thanks to their high resolution. They are used during the approach, object recognition and intervention phases. Existing solutions for all perception modalities are reviewed in Section 2.1.

The underwater environment is one of the most problematic in terms of sensing in general and in terms of object perception in particular. The main challenges of underwater perception include distortion in signals, light propagation artefacts like absorption and scattering, water turbidity changes or depth-depending colour distortion.

Accurate and robust object detection, identification of target objects in different experimental conditions and pose estimation are essential requirements for the execution of manipulation tasks.

In this work, we propose a deep learning based approach to recognise pipes and valves in multiple underwater scenarios , using the 3D RGB point cloud information provided by a stereo camera, for real-time *AUV* inspection and manipulation tasks.

The remainder of this paper is structured as follows: Section 2 reviews related work on underwater perception and pipe and valve identification and highlights the main contributions of this work. Section 3 describes the adopted methodology and materials used in this study. The experimental results are presented and discussed in Section 4. Finally, Section 5 outlines the main conclusions and future work.

#### **2. Related Work and Contributions**

#### *2.1. State of the Art*

Even though computer vision is one of the most complete and used perception modalities in robotics and object recognition tasks, it has not been widely used in underwater scenarios. Light transmission problems and water turbidity affect the images clarity, colouring and produce distortions; these factors have favoured the usage of other perception techniques.

Sonar sensing has been largely used for object localisation or environment identification in underwater scenarios [11,12]. In [13], Kim et al. present an AdaBoost based method for underwater object detection, while Wang et al. [14] propose a combination of non-local spatial information and frog leaping algorithm to detect underwater objects in sonar images. More recently, object detection deep learning techniques have started to apply over sonar imaging in applications such as detection of underwater bodies in [15,16] or underwater mine detection in [17]. Sonar imaging also presents some drawbacks as it tends to generate noisy images, losing texture information; and are not capable of gathering colour information, which is useful in object recognition tasks.

Underwater laser scans are another perception technique used for object recognition, providing accurate 3D data. In [18], Palomer et al. present the calibration and integration of a laser scanner on an AUV for object manipulation. Himri et al. [19,20] use the same system to detect objects using a recognition and pose estimation pipeline based on point cloud matching. Inzartsev et al. [21] simulate the use of a single beam laser paired with a camera to capture its deformation and track an underwater pipeline. Laser scans are also affected by light transmission problems, have a very high initial cost and can only provide colourless point clouds.

The only perception modality that allows gathering of colour information for the scene is computer vision. Furthermore, some of its aforementioned weaknesses can be mitigated by adapting to the environmental conditions, adjusting the operation range, calibrating the cameras or colour correcting the obtained images.

Traditional computer vision approaches have been used to detect and track submerged artifacts [22–25], cables [26–28] and even pipelines [28–31]. Some works are based on shape and texture descriptors [28,31] or template matching [32,33], while others exploit colour segmentation to find regions of interest in the images, which are later further processed [25,34].

On pipeline detection, Kallasi et al. in [35] and Razzini et al. in [7,36] present traditional computer vision methods combining shape and colouring information to detect pipes in underwater scenarios and later project them into point clouds obtained from stereo vision. In these works, the point cloud information is not used to assist the pipe recognition process.

The first found trainable system to detect pipelines is presented in [37] by Rekik et al. using the objects structure and content features along a Support Vector Machine to classify between positive and negative underwater pipe images samples. Later, Nunes et. al introduced the application of a Convolutional Neural Network in [38] to classify up to five underwater objects, including a pipeline. In both of these works, no position of the object is given, but simply a binary output on the object's presence.

The application of computer vision approaches based on deep learning in underwater scenarios has been limited to the detection and pose estimation of 3D-printed objects in [39] or for living organisms detection like fishes [40] or jellyfishes [41]. Few research studies involving pipelines are restricted to damage evaluation [42,43] or valve detection for navigation [44] working with images taken from inside the pipelines. The only known work addressing pipeline recognition using deep learning is from Guerra et al. in [45], where a camera-equipped drone is used to detect pipelines in industrial environments.

To the best knowledge of the authors, there are not works applying deep learning techniques in underwater computer vision pipeline and valve recognition, nor implementing the usage of point cloud information on the detection process itself.

#### *2.2. Main Contributions*

The main contributions of this paper are composed of:


#### **3. Materials and Methods**

This section presents an overview of the selected network; explains the acquisition, labelling and organisation of the data; and details the studied network hyperparameters, the validation process and the evaluation metrics.

#### *3.1. Deep Learning Network*

To perform the pipe and valve 3D recognition from point cloud segmentation, we selected the PointNet deep neural network [46]. This is a unified architecture for applications ranging from object classification and part segmentation to scene semantic segmentation. PointNet is a highly efficient and effective network, obtaining great metrics in both object classification and segmentation tasks in indoor and outdoor scenarios [46]. However, it has never been tested in underwater scenarios. The whole PointNet architecture is shown in Figure 1.

**Figure 1.** PointNet architecture. Reproduced from [46], with permission from publisher Hao Su, 2020.

In this paper, we use the *Segmentation Network* of PointNet. This network is an extension to the *Classification Network*, as it can be seen in Figure 1. Some of its key features include:


The PointNet architecture takes as input point clouds and it outputs a class label for each point. During the training, the network is also fed with ground truth point clouds, where each point is labelled with its pertaining class. The labelling process is further detailed in Section 3.2.2.

As the original PointNet implementation, we used a softmax cross-entropy loss along an Adam optimiser. The decay rate for batch normalisation starts with 0.5 and is gradually increased to 0.99. In addition, we applied a dropout with keep ratio 0.7 on the last fully connected layer, before class score prediction. Other hyperparameters values such as learning rate or batch size are discussed, along other parameters, on Section 3.3.

Furthermore, to improve the network performance, we implemented an early stopping strategy based on the work of Prechelt in [47], assuring that the network training process stops at an epoch that ensures minimum divergence between validation and training losses. This technique allows for obtaining a more general and broad training, avoiding overfitting.

#### *3.2. Data*

This subsection explains the acquisition, labelling and organisation of the data used to train and test the PointNet neural network.

#### 3.2.1. Acquisition

As mentioned in Section 3.1, the PointNet uses pointclouds for its training and inference. To obtain the point clouds, we set up a Bumblebee2 Firewire stereo rig [48] on an Autonomous Surface Vehicle (ASV) through a *Robot Operating System* (ROS) framework.

First, we calibrated the stereo rig both on fresh and salt water using the ROS package *image\_pipeline/camera\_calibration* [49,50]. It uses a chessboard pattern to obtain the camera, rectification and projection matrices along the distortion coefficients for both cameras.

The acquired synchronised pairs of left-right images (resolution: 1024 × 768 pixels) are processed as follows by the *image\_pipeline/stere\_image\_proc* ROS package [51] to calculate the disparity between pairs of images based on epipolar matching [52], obtaining the corresponding depth of each pixel from the stereo rig.

Finally, combining this depth information with the RGB colouring from the original images, we generate the point clouds. An example of the acquisition is pictured in Figure 2.

**Figure 2.** Data acquisition process. (**a**) left and right stereo images, (**b**) disparity image, (**c**) point cloud.

3.2.2. Ground Truth Labelling

Ground truth annotations are manually built from the point clouds, where the pixels corresponding to each class are marked with a different label. The studied classes and their RGB labels are: *Pipe* (0, 255, 0), *Valve* (0, 0, 255) and *Background* (0, 0, 0). Figure 3 shows a couple of point clouds along with their corresponding ground truth annotations.

**Figure 3.** (**a**) Original point cloud; (**b**) ground truth annotations, points corresponding to pipes are marked in green; to valves, in blue; and to background, in black.

#### 3.2.3. Dataset Managing

Following the steps described in the previous section, we generated two datasets. The first one includes a total of 262 point clouds along with their ground truths. It was obtained on an artificial pool and contains diverse connections between pipes of different diameters and 2/3 way valves. It also contains other objects such as cement blocks and ceramic vessels, always over a plastic sheeting simulating different textures. This dataset is split into a train-validation set (90% of the data, 236 point clouds) and a test set (10% of the data, 26 point clouds). The different combinations of elements and textures increase its

diversity, helping to assure the robustness in the training and reduce overfitting. From now on, we will refer to this dataset as the *Pool* dataset.

The second dataset includes a total of 22 point clouds and their corresponding ground truths. It was obtained in the sea and contains different pipe connections and valves positions. In addition, these 22 point clouds were obtained over diverse types of seabed, such as sand, rocks, algae, or a combination of them. This dataset is used to perform a secondary test, as it contains point clouds with different characteristics of the ones used to train and validate the network, allowing us to assess how well the network generalises its training to new conditions. From now on, we will refer to this dataset as the *Sea* dataset.

Figure 4 illustrates the dataset managing, while in Figure 5 some examples of point clouds from both datasets are shown.


**Figure 5.** Examples of point clouds from (**a**) *Pool* dataset and (**b**) *Sea* dataset.

#### *3.3. Hyperparameter Study*

When training a neural network, there are hyperparameters which can be tuned, changing some of the features of the network or the training process itself. We selected some of these hyperparameters and trained the network using different values to study their effect over its performance in underwater scenarios. The considered hyperparameters were:


The tested values for each hyperparameter are shown in Table 1. In total, 13 experiments are conducted, one using the hyperparameter values used in the original PointNet implementation [46] (marked in bold in Table 1); and 12 more, each one fixing three of the

aforementioned hyperparameters to their original values and using one of the other tested values for the fourth hyperparameter. This way, the effect of each hyperparameter and its value over the performance is isolated.

**Table 1.** Tested hyperparameter values. Original values are marked in bold.


#### *3.4. Validation*

3.4.1. Validation Process

To ensure the robustness of the results generated for the 13 experiments, we used the 10 k-fold cross-validation method [53]. Using this method, the train-validation set of the *Pool* dataset is split into ten equally sized subsets. The network is trained ten times as follows, each one using a different subset as validation (23 point clouds) and the nine remaining as training (213 point clouds), generating ten models which are tested against both *Pool* and *Sea* test sets. Finally, each experiment performance is computed as the mean of the results of its 10 cross-validation models. This method reduces the variability of the results, as these are less dependent on the selected training and validation subsets, therefore obtaining a more accurate performance estimation. Figure 6 depicts the k-fold cross-validation technique applied to the dataset managing described in Section 3.2.3

**Figure 6.** Implementation of the 10k-fold cross-validation method.

#### 3.4.2. Evaluation Metrics

To evaluate a model performance, we make a point-wise comparison between its predictions and their corresponding ground truth annotations, generating a multi-class confusion matrix. This confusion matrix indicates, for each class: the number of points correctly identified belonging to that class, *True Positives* (TP) and not belonging to it, *True Negatives* (TN); the number of points misclassified as the studied class, *False Positives* (FP); and the number of points belonging to that class misclassified as another one, *False Negatives* (FN). Finally, the TP, FP and FN values are used to calculate the *Precision*, *Recall* and *F1-score* for each class, following Equations (1)–(3):

$$Precision = \frac{TP}{TP + FP} \quad \text{\textsuperscript{\textsuperscript{\textsuperscript{\textsuperscript{\boxminus}}}}} \tag{1}$$

*Recall* <sup>=</sup> *TP TP* <sup>+</sup> *FN* , (2)

*<sup>F</sup>*1-*score* <sup>=</sup> <sup>2</sup> <sup>×</sup> *Recall* <sup>×</sup> *Precision Recall* <sup>+</sup> *Precision* . (3)

Additionally, the mean time that a model takes to perform the inference of a point cloud is calculated. This metric is very important, as it defines the frequency that information is provided to the system. In underwater applications, it would directly affect the agility and responsiveness of the AUV that this network could be integrated in, having an impact over the final operation time.

#### **4. Experimental Results and Discussion**

This section reports the performance obtained for each experiment over the *Pool* and *Sea* test sets and discusses the effect of each hyperparameter over it. The notation used to name each experiment corresponds as follows: "Base" for the experiment conducted using the original hyperparameter values, marked in bold in Table 1; the other experiments are notated as an abbreviation of the modified hyperparameter for that experiment ("Batch" for batch size, "Lr" for learning rate, "BS" for block-stride and "Np" for number of points) followed by the actual value of the hyperparameter for that experiment. For instance, experiment *Batch 24* uses all original hyperparameter values except for the batch size, which in this case is 24.

#### *4.1. Pool Dataset Results*

Table 2 shows the *F1-scores* obtained for the studied classes and its mean for all experiments when evaluated over the *Pool* test set. The mean inference time for each experiment is showcased in Figure 7 as follows.


The results presented in Table 2 show that all experiments achieved a mean *F1-score* greater than 95.5%, with the highest value of 97.2% for the experiment *BS 1\_075*, which has a smaller block stride than its size, overlapping information. Considering the figures of mean *F1-score* for all experiments, it is safe to say that no hyperparameter seemed to represent a major shift in the network behaviour.

Looking at the metrics presented by the best performing experiment for each class, it can be seen that the *Pipe* class achieved an *F1-score* of 97.1%, outperforming other stateof-the-art methods for underwater pipe segmentation: [35]—traditional computer vision algorithms over 2D underwater images achieving an *F1-score* of 94.1%, [7]—traditional computer vision algorithms over 2D underwater images achieving a mean *F1-score* over three datasets of 88.0% and [45]—deep leaning approach for 2D drone imagery achieving a pixel-wise accuracy of 73.1%. For the valve class, the *BS 1\_075* experiment achieved a *F1-score* of 94.9%, being a more challenging class due to its complex geometry. As far as the authors know, no comparable work on underwater valve detection has been identified. Finally, for the more prevailing *Background* class, the best performing experiment achieved an *F1-score* of 99.7%.

The results on mean inference time for each experiment presented in Figure 7 shows that the batch size and learning rate hyperparameter values do not influence the inference time or have little impact, as their value is very similar to the one obtained in the *Base* experiment. On the contrary, the block and stride size highly affect the inference time, the bigger the information block or the stride between blocks, the faster the network can analyse a point cloud, and vice versa. Finally, the maximum number of allowed points per block also has a direct impact over the inference time, the lower it is, the faster the network can analyse a point cloud, as it becomes less dense. The time analysis was carried out in a computer with the following specs—processor: Intel i7-7700, RAM: 16 GB, GPU: NVIDIA GeForce GTX 1080.

Taking into account both metrics, *BS 1\_075* presented the best *F1-score* and has the highest inference time. In this experiment, the network uses a small block size and stride, being able to analyse the data and extract its features better, at the cost of taking longer. The hyperparameter values of this experiment are a good fit for a system in which quick responsiveness to changes and high frequency of information are not a priority, allowing for maximising the recognition performance.

On the other hand, experiments such as *BS 2\_2* or *Np 1024*, *512*, *256*, *128* were able to maintain very high *F1-scores* while significantly reducing the inference time. The hyperparameter values tested in these experiments are a good fit for more agile systems that need a higher frequency of information and responsiveness to changes.

Figure 8 shows some examples of original point clouds from the *Pool* test set along with their corresponding ground truth annotations and network predictions.

**Figure 8.** Qualitative results for the *Pool* test set. (**a**) original point cloud, (**b**) ground truth annotations, (**c**) network prediction.

#### *4.2. Sea Dataset Results*

Table 3 shows the *F1-scores* obtained for the studied classes and its mean, for all experiments when evaluated over the *Sea* test set. The mean inference time for each experiment is showcased in Figure 9 as follows.

#### **Table 3.** *Sea* test set *F1-scores*.


**Figure 9.** *Sea* test set mean inference time.

The results presented in Table 3 show that all experiments achieved a mean *F1-score* greater than 84.9% with the highest value of 89.3% for the experiment *Batch 16*. On average, the mean *F1-score* was around 9% lower than for the *Pool* test set. Even so, all experiments maintained high *F1-scores*. Again, the *F1-scores* of the *Pipe* and *Valve* classes are relatively lower than for the *Background* class. Even though the *Sea* test set is more challenging, as it contains unseen pipe and valve connections and environment conditions, the network was able to generalise its training and avoid overfitting.

The results on mean inference time for each experiment presented in Figure 9 shows that the mean inference times for the *Sea* test set are proportionally lower than the *Pool* test set for all experiments. This occurs because the *Sea* test set contains smaller point clouds with fewer points.

Figure 10 shows some examples of original point clouds from the *Sea* test set along with their corresponding ground truth annotations and network predictions.

**Figure 10.** Qualitative results for the *Sea* test set. (**a**) original point cloud; (**b**) ground truth annotations; (**c**) network prediction.

#### **5. Conclusions and Future Work**

This work studied the implementation of the PointNet deep neural network in underwater scenarios to recognise pipes and valves from point clouds. First, two datasets of point clouds were gathered, providing enough data for the training and testing of the network. From these, a train-validation set and two test sets were generated, a primary test set with similar characteristics as the training data and a secondary one containing unseen pipe and valve links and environment conditions to test the network training generalisation and overfitting. Then, diverse hyperparameter values were tested to study their effect over the network performance, both in the recognition task and inference time.

Results from the recognition task concluded that the network was able to identify pipes and valves with high accuracy for all experiments in both *Pool* and *Sea* test sets, reaching *F1-scores* of 97.2% and 89.3%, respectively. Regarding the network inference time, results showed that it is highly dependent on the size of information block and its stride; and to the point clouds density.

From the performed experiments, we obtained a range of models covering different trade-offs between detection performance and inference time, enabling the network implementation into a wider spectrum of systems, adapting to its detection and computational cost requirements. The *BS 1\_075* experiment presented metrics that fitted a slower, more still system, while experiments like *BS 2\_2* or *Np 1024*, *512*, *256*, *128* are a good fit for more agile and dynamic systems.

The implementation of the PointNet network in underwater scenarios presented some challenges, like ensuring its recognition performance when trained with point clouds obtained from underwater images, and its suitability to be integrated on an AUV due to its computational cost. With the results obtained in this work, we have demonstrated the validity of the PointNet deep neural network to detect pipes and valves in underwater scenarios for AUV manipulation and inspection tasks.

The datasets and code, along with one of the *Base* experiment trained models, are publicly available at http://srv.uib.es/3d-pipes-1/ (UIB-SRV-3D-pipes) for the scientific community to test or replicate our experiments.

Further steps need to be taken in order to achieve an underwater object localisation and positioning for ROV and AUV intervention using the object recognition presented in this work. We propose the following future work:


**Author Contributions:** Conceptualisation, G.O.-C. and Y.G.-C.; methodology, M.M.-A.; software, M.M.-A., M.P.-M. and A.M.-T.; validation, M.M.-A.; investigation, M.M.-A. and M.P.-M.; resources, G.O.-C. and Y.G.-C.; data curation, M.M.-A., M.P.-M. and A.M.-T.; writing—original draft preparation, M.M.-A. and M.P.-M.; writing—review and editing, M.M.-A., M.P.-M., A.M.-T., G.O.-C. and Y.G.-C.; supervision, Y.G.-C.; project administration, G.O.-C. and Y.G.-C.; funding acquisition, G.O.-C. and Y.G.-C. All authors have read and agreed to the published version of the manuscript.

**Funding:** Miguel Martin-Abadal was supported by the Ministry of Economy and Competitiveness (AEI,FEDER,UE), under contract DPI2017-86372-C3-3-R. Gabriel Oliver-Codina was supported by Ministry of Economy and Competitiveness (AEI,FEDER,UE), under contract DPI2017-86372- C3-3-R. Yolanda Gonzalez-Cid was supported by the Ministry of Economy and Competitiveness (AEI,FEDER,UE), under contracts TIN2017-85572-P and DPI2017-86372-C3-3-R; and by the Comunitat Autonoma de les Illes Balears through the Direcció General de Política Universitaria i Recerca with funds from the Tourist Stay Tax Law (PRD2018/34).

**Data Availability Statement:** Publicly available datasets were analysed in this study. This data can be found here: http://srv.uib.es/3d-pipes-1/.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**

