*4.1. Obstacle Detection*

The detection-recognition methodology can be summarized as follows:


(d) Obstacle recognition using a deep learning model based on probable obstacle regions obtained in Step (c).

4.1.1. Human Eye Fixation Estimation

The saliency maps used in this work are generated by a GAN [47]. The generated saliency maps derive from human eye fixation points and thus, they make the significance of a region in a scene more instinctual. Such information can be exploited for the obstacle detection procedure, and at the same time, enhance the intuition of the methodology. Additionally, the machine learning aspect enables the extensibility of the methodology, since it can be trained with additional eye fixation data, collected from individuals during their navigation through rough terrains. An example of the saliency maps estimated from a given image can be seen in Figure 4. Since the model is trained on human eye-fixation data, it identifies as salient those regions in the image on which the attention of a human would be focused. As it can be observed in Figure 4, in the first image, the most salient region corresponds to the fire extinguisher cabinet; in the second image, to the people on the left side; and in the last image, to the elevated ground and the tree branch.

**Figure 4.** Examples of the generated saliency maps given an RGB image. (**a**) Input RGB images. (**b**) Respective generated saliency maps.

The GAN training utilizes two different CNN models, namely, a discriminator and a generator. During the training, the generator learns to generate imagery related to a task, and the discriminator

assists to the optimization of the resemblance to the target images. In our case, the target data are composed of visual saliency maps based on human eye tracking data.

The generator architecture is a VGG-16 [40] encoder-decoder model. The encoder follows an identical architecture to that of VGG-16 unaccompanied by fully connected layers. The encoder is used to create a latent representation of the input image. The encoder weights are initialized by training the model on the ImageNet dataset [48]. During the training, there was no update of the weights of the encoder, with an exception to the last two convolutional blocks.

The decoder has the same architectural structure with the encoder network, with the exception that the layers are placed in reverse order, and the max pooling layers are replaced with up-sampling layers. To generate the saliency map, the decoder has an additional 1 × 1 convolutional layer in the output, with sigmoidal activation. The decoder weights were initialized randomly. The generator accepts an RGB image *IRGB* as stimulus and generates a saliency map that resembles the human eye fixation on that *IRGB*.

The discriminator of the GAN has a simpler architecture. The discriminator model consists of 3 × 3 convolutional layers, combined with 3 max pooling layers followed by 3 Fully Connected (FC) layers. The Rectified Liner Unit (ReLU) and hyperbolic tangent (tanh) functions are deployed as activation functions for the convolutional and FC layers, respectively. The only exception is the last layer of the FC part, where the sigmoid activation function was used. The architecture of the GAN generator network is illustrated in Figure 5.

**Figure 5.** Illustration of the generator architecture. The generator takes as input an RGB image *IRGB* and outputs a saliency map based on human eye fixation.

#### 4.1.2. Uncertainty-Aware Obstacle Detection

In general, an object that interferes with the safe navigation of a person can be perceived as salient. Considering this, the location of an obstacle is likely to be in regions of a saliency map that indicate high importance, i.e., with high intensities. A saliency map produced by the model described in Section 4.1.1 can be treated as a weighted region of interest, in which an obstacle may be located. High-intensity regions of such a saliency map indicate high probability of the presence of an object of interest. Among all the salient regions in the saliency map, we need to identify these regions that may pose a threat to the person navigating in the scenery depicted in *IRGB*. Thus, we follow an approach, where both a saliency map and a depth map deriving by an RGB-D sensor are used for the risk assessment. The combination of the saliency and depth maps is achieved with the utilization of Fuzzy Sets [49].

For assessing the risk, it can be easily deduced that objects/areas that are close to the VCP navigating in an area and are salient with regard to the human gaze may pose a certain degree of threat to the VCP. Therefore, as a first step, the regions that are in a certain range from the navigating person need to be extracted, so that they can be determined as threatening. Hence, we consider a set of 3 fuzzy sets, namely, *R*1, *R*2, and *R*3—describing three different risk levels, which can be described with the linguistic values of high, medium, and low risk, respectively. The fuzzy sets *R*1, *R*2, and *R*<sup>3</sup> represent a different degree of risk and their universe of discourse is the range of depth values of a depth map. Regarding the fuzzy aspect of these sets and taking into consideration the uncertainty in the risk assessment, there is an overlap between the fuzzy sets describing low and medium and medium and *high* risk. The fuzzy sets *R*1, *R*2, and *R*<sup>3</sup> are described by the membership function *ri(z), i* = 1, 2, 3, where *z* ∈ [0, ∞). The membership functions are illustrated in Figure 6c.

**Figure 6.** Membership functions of fuzzy sets used for the localization of objects in the 3D space using linguistic variables. (**a**) Membership functions for far left (*h*1), left (*h*2), central (*h*3), right (*h*4) and far right (*h*5) positions on the horizontal axis. (**b**) Membership functions for up (*v*1), central (*v*2) and bottom (*v*3) positions on the vertical axis. (**c**) Membership functions for low (*r*1), medium (*r*1), and high risk (*r*3) upon the distance of the user from an obstacle.

A major aspect of an obstacle detection methodology is the localization of obstacles and the description of their position in a manner that can be communicated and easily perceived by the user. In our system, the description of the spatial location of an object is performed using linguistic expressions. We propose an approach based on fuzzy logic to interpret the obstacle position using linguistic expressions (linguistic values) represented by fuzzy sets. Spatial localization of an obstacle in an image can be achieved by defining 8 additional fuzzy sets. More specifically, we define 5 fuzzy sets for the localization along the horizontal axis of the image, namely, *H*1, *H*2*, H*3, *H*4, and *H*<sup>5</sup> corresponding to far left, left, central, right, and far right portions of the image. Additionally, to express the location of the obstacle along the vertical axis of the image, we define 3 fuzzy sets, namely, *V*1, *V*2, and *V*<sup>3</sup> denoting the upper, central, and bottom portions of the image. The respective membership functions of these fuzzy sets are *hj(x)*, *j* = 1, 2, 3, 4, 5 and *vi(y), i* = 1, 2, 3, where *x, y* ∈ [0, 1] are normalized image coordinates. An illustration of these membership functions can be seen in Figure 6.

Some obstacles, such as tree branches, may be in close proximity to the individual with respect to the depth but at a certain height that safe passage would not be affected. Thus, a personalization step was introduced to the methodology eliminating false alarms. The personalization aspect and the minimization of false positive obstacle detection instances are implemented through an additional fuzzy set *P*, addressing the risk an obstacle poses to a person with respect to the height. For the description of this *P* fuzzy set, we define a two dimensional membership function *p*(*ho, hu*), where *h*o and *h*<sup>u</sup> are the heights of the obstacle and the user, respectively. The personalization methodology is described in Section 4.1.3.

For the risk assessment, since the membership functions describing each fuzzy set were defined, the next step is the creation of 3 risk maps, *R<sup>i</sup> <sup>M</sup>*. The risk maps *Ri <sup>M</sup>*, derive from the responses of a membership function, *ri(z)*, and are formally expressed as:

$$R\_M^i(\mathbf{x}, \ y) = r\_i(D(\mathbf{x}, y)) \tag{1}$$

where *D* is a depth map that corresponds to an RGB image *IRGB*. Using all the risk assessment membership functions, namely *r*1, *r*2, and *r*3, 3 different risk maps, *R*<sup>1</sup> *<sup>M</sup>*, *<sup>R</sup>*<sup>2</sup> *<sup>M</sup>*, and *<sup>R</sup>*<sup>3</sup> *<sup>M</sup>*, are derived. Each of these risk maps depicts regions that may pose different degrees of risk to the VCP navigating in the area. In detail, risk map *R*<sup>1</sup> *<sup>M</sup>* represents regions that may pose high degree of risk, *<sup>R</sup>*<sup>2</sup> *<sup>M</sup>* medium degree of risk, and finally *R*<sup>3</sup> *<sup>M</sup>* low degree of risk. A visual representation of these maps can be seen in Figure 7. Figure 7b,c illustrates the risk maps derived from the responses of the *r*1*, r*2, and *r*<sup>3</sup> membership functions on the depth map of Figure 7a. Brighter pixel intensities represent higher participation in the respective fuzzy set, while darker pixel intensities represent lower participation.

**Figure 7.** Example of *Ri <sup>M</sup>* creation. (**a**) Depth map *D,* where lower intensities correspond to closer distances; (**b**) visual representation of *R*<sup>1</sup> *<sup>M</sup>* representing regions of high risk; (**c**) *<sup>R</sup>*<sup>2</sup> *<sup>M</sup>* representing regions of medium risk; (**d**) *R*<sup>3</sup> *<sup>M</sup>* depicting regions of low risk. Higher intensities in (**b–d**) correspond to lower participation in the respective fuzzy set. All images have been normalized for better visualization.

In the proposed methodology, the obstacle detection is a combination between the risk assessed from the depth maps and the degree of saliency that is obtained from the GAN described in the previous subsection. The saliency map *SM* that is produced from a given *IRGB* is aggregated with each risk map *R<sup>i</sup> <sup>M</sup>*, where *i* = 1, 2, 3, using the fuzzy AND (∧) operator (Godel t-norm) [50], formally expressed as:

$$F\_1 \wedge F\_2 = \min(F\_1(\mathbf{x}, \ y), F\_2(\mathbf{x}, \ y))\tag{2}$$

In Equation (2), *F*<sup>1</sup> and *F*<sup>2</sup> denote two generic 2D fuzzy maps with values within the [0, 1] interval, and *x*, *y* are the coordinates of each value of the 2D fuzzy map. The risk maps *Ri <sup>M</sup>* are, by definition, fuzzy 2D maps, since they derive from the responses of membership functions *ri* on a depth map. The saliency map *SM* can be considered as a fuzzy map where its values represent the degree of

participation of a given pixel to the salient domain. Therefore, they can be combined with the fuzzy AND operator to produce a new fuzzy 2D map *Oi <sup>M</sup>* as follows:

$$O\_M^i = R\_M^i \land S\_M \tag{3}$$

The non-zero values of the 2D fuzzy map *Oi <sup>M</sup>* (obstacle map) at each coordinate (*x, y*) indicate the location of an obstacle and express the degree of participation in the risk domain of the respective *Ri M*. Figure 8d illustrates the respective *Oi <sup>M</sup>* produced using the fuzzy AND operator with the three *<sup>R</sup><sup>i</sup> M*. Higher pixel values of the *Oi <sup>M</sup>* portray higher participation on the respective risk category and the probability of the location of an obstacle.

**Figure 8.** Example of the aggregation process between the saliency map SM and the high-risk map *R*<sup>1</sup> *M*. (**a**) Original IRGB used for the generation of the saliency map SM; (**b**) high-risk map *R*<sup>1</sup> *<sup>M</sup>* used in the aggregation; (**c**) saliency map SM based on the human eye fixation on image (**a**); (**d**) the aggregation product using the fuzzy AND operator between images (**b**) and (**c**).

Theoretically, the *O<sup>i</sup> <sup>M</sup>* can be directly used to detect obstacles posing different degrees of risk to the VCP navigating in the area. However, if the orientation of the camera is towards the ground, the ground plane can be often falsely perceived as obstacle. Consequently, a refinement step is needed to optimize the obstacle detection results and reduce the occurrence of false alarm error. Therefore, a simple but effective approach for ground plane extraction is adopted.

The ground plane has a distinctive gradient representation along the *Y* axis in depth maps, which can be exploited in order to remove it from the *Oi <sup>M</sup>*. As a first step, the gradient of the depth map *D* is estimated by:

$$
\nabla D = \left(\frac{\partial D}{\partial \mathbf{x}}, \frac{\partial D}{\partial y}\right) \tag{4}
$$

A visual representation of a normalized difference map <sup>∂</sup>*<sup>D</sup>* <sup>∂</sup>*<sup>y</sup>* in the [0, 255] interval can be seen in Figure 9. As it can be seen, the regions corresponding to the ground have smaller differences than the rest of the depth map. In the next step, a basic morphological gradient *g* [51] is applied on the gradient of *D* along the *y* direction <sup>∂</sup>*<sup>D</sup>* <sup>∂</sup>*<sup>y</sup>* . A basic morphological gradient is basically the difference between dilation and erosion of the <sup>∂</sup>*<sup>D</sup>* <sup>∂</sup>*<sup>y</sup>* given an all-one kernel *k*5×5:

$$g\left(\frac{\partial D}{\partial y}\right) = \delta\_{k\_{\mathbb{S}\times\mathbb{S}}}\left(\frac{\partial D}{\partial y}\right) - \varepsilon\_{k\_{\mathbb{S}\times\mathbb{S}}}\left(\frac{\partial D}{\partial y}\right) \tag{5}$$

where δ and ε denote the operations of dilation and erosion and their subscripts indicate the used kernel. In contrast to the usual gradient of an image, the basic morphological gradient *g* corresponds to the maximum variation in an elementary neighborhood rather than a local slope. The morphological gradient is followed by consecutive operations of erosion and dilation with a kernel *k5*×*5*. As it can be noticed in Figure 9c, the basic morphological filter *g* gives higher responses on non-ground regions, and thus, the following operations of erosion and dilution are able to eliminate the ground regions quite effectively. The product of these consecutive operations is a ground removal mask *GM*, which is then multiplied with *O<sup>i</sup> <sup>M</sup>*, setting the values corresponding to the ground, to zero. This ground removal approach has been experimentally proven to be sufficient (Section 5) to eliminate the false identification of the ground as obstacle. A visual representation of the ground mask creation and the ground removal can be seen in Figures 9 and 10, respectively.

**Figure 9.** Example of the creation steps of *GM*. (**a**) Depth map *D*, normalized for better visualization; (**b**) visual representation of the difference map Δ*M*; (**c**) difference map Δ*<sup>M</sup>* after the application of the basic morphological gradient; and (**d**) the final ground removal mask *GM*.

Once the obstacle map of the depicted scene is estimated following the process described above, the next step is the spatial localization of the obstacle in linguistic values. This step is crucial for the communication of the surroundings to a VCP. For this purpose, Fuzzy Sets are utilized in this work. As presented in Section 4.1.1, 5 membership functions are used to determine the location of an obstacle along the horizontal axis (*x*-axis) and 3 along the vertical axis (*y*-axis).

**Figure 10.** Example of the ground removal procedure. (**a**) Original *IRGB* image; (**b**) corresponding obstacle map *O*<sup>1</sup> *<sup>M</sup>*; (**c**) respective ground removal mask GM; (**d**) masked obstacle map *<sup>O</sup>*<sup>1</sup> *<sup>M</sup>*. In (**d**), the ground has been effectively removed.

Initially, the boundaries of the obstacles depicted in the obstacle maps need to be determined. For the obstacle detection task, the *O*<sup>1</sup> *<sup>M</sup>* obstacle map, through which the high-risk obstacles are represented, is chosen. Then, the boundaries *bl*, where *l* = 1, 2, 3 ... , of the obstacles are calculated using a border following the methodology presented in [52]. Once the boundaries of each probable obstacle depicted in *O*<sup>1</sup> *<sup>M</sup>* are acquired, their centers *cl* = (*cx*, *cy*), *l* = 1, 2, 3, ... are derived by exploiting the properties of the image moments [53] of boundaries *bl*. The centers *cl* can be defined using the raw moments *m*00, *m*10, and *m*<sup>01</sup> of *bl* as follows:

$$m\_{qk} = \iint\_{b\_l} \mathbf{x}^q y^k I\_{\text{RGB}}(\mathbf{x}\_l, y) dx dy \tag{6}$$

$$\mathbf{c}\_{l} = \left(\frac{m\_{10}}{m\_{00}}, \frac{m\_{01}}{m\_{00}}\right) \tag{7}$$

where *q* = 0, 1, 2, ... , *k* = 0, 1, 2, ... and *x, y* denote image coordinates along the *x*-axis and *y*-axis respectively. An example of the obstacle boundary detection can be seen in Figure 11, where the boundaries of the obstacles are illustrated with green lines (Figure 11b) and the centers of the obstacles are marked with red circles (Figure 11c).

**Figure 11.** Example of the obstacle boundary extraction and obstacle center calculation. (**a**) *O*<sup>1</sup> *<sup>P</sup>* obstacle map used for the detection of high-risk obstacles; (**b**) boundary (green outline) estimation of the obstacles; (**c**) respective centers of the detected obstacles.

Once the centers have been calculated, their location can be determined and described with linguistic values using the horizontal and vertical membership functions, *hj*, where *j* = 1, 2, 3, 4, 5, and *vi*, where *i* = 1, 2, 3. If the response of *hj*(cx) and *vi*(*cy*) is greater than 0.65, then the respective obstacle with a boundary center of *cl* = (*cx, cy*) will be described with the linguistic value that these *hj* and *vi* represent. Additionally, the distance between object and person is estimated using the depth value of depth map *D* at the location of *D*(*cx, cy*). Using this information, the VCP can be warned regarding the location and distance of the obstacle and, as an extension, be assisted to avoid it.
