*Article* **Complex Color Space Segmentation to Classify Objects in Urban Environments**

**Juan-Jose Cardenas-Cornejo †, Mario-Alberto Ibarra-Manzano †, Daniel-Alberto Razo-Medina † and Dora-Luz Almanza-Ojeda \*,†**

> Electronics Engineering Department, DICIS, University of Guanajuato, Carr. Salamanca-Valle de Santiago KM. 3.5 + 1.8 Km., Salamanca 36885, Mexico

**\*** Correspondence: dora.almanza@ugto.mx

† These authors contributed equally to this work.

**Abstract:** Color image segmentation divides the image into areas that represent different objects and focus points. One of the biggest problems in color image segmentation is the lack of homogeneity in the color of real urban images, which generates areas of over-segmentation when traditional color segmentation techniques are used. This article describes an approach to detecting and classifying objects in urban environments based on a new chromatic segmentation to locate focus points. Based on components *a* and *b* on the CIELab space, we define a *chromatic map* on the complex space to determine the highest threshold values by comparing neighboring blocks and thus divide various areas of the image automatically. Even though thresholds can result in broad segmentation areas, they suffice to locate centroids of patches on the color image that are then classified using a convolutional neural network (CNN). Thus, this broadly segmented image helps to crop only outlying areas instead of classifying the entire image. The CNN is trained to use six classes based on the patches drawn from the database of reference images from urban environments. Experimental results show a high score for classification accuracy that confirms the contribution of this segmentation approach.

**Keywords:** image segmentation; complex numbers; CNN classifier; outdoor environments

**MSC:** 68T45

#### **1. Introduction**

Autonomous systems need to recognize objects and their position in the real world to interact. Ideally, autonomous systems label objects and regions on an image to understand the environment [1]. Commonly used strategies in smart systems are based on image segmentation and automatic-learning techniques. Image segmentation is a key task in computer vision involving the analysis of standard features, such as texture and color, among others, on the image. However, most models and techniques used in image segmentation are unique, that is to say, only used for a specific purpose, and their performance only differs depending on the color space involved [2]. Therefore, choosing a suitable space to represent color is essential during the segmentation process.

CIELab, HSI [3] or HSV [4] are the most common color spaces used to segment images. Others, such as *Munsel* or *YIQ* spaces [5], are used for several purposes and need specific methodologies to work. The CIELab color space mimics how humans perceive color; it is useful to modify brightness and color values on an image independently [6]. Most processing techniques based on the CIELab color space analyze each plane individually. According to the CIELab theory, chromatic components *a* and *b* are orthogonal axes on a 2D plane. Thus, the representation of 2D space on CIELab can be transformed into complex space directly, enabling the possibility of using complex numbers to facilitate algebraic calculations of image data.

**Citation:** Cardenas-Cornejo, J.-J.; Ibarra-Manzano, M.-A.; Razo-Medina, D.-A.; Almanza-Ojeda, D.-L. Complex Color Space Segmentation to Classify Objects in Urban Environments. *Mathematics* **2022**, *10*, 3752. https://doi.org/ 10.3390/math10203752

Academic Editor: Liliya Demidova

Received: 8 September 2022 Accepted: 6 October 2022 Published: 12 October 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

A complex number is a pair of real numbers *a* and *b* ordered as (*a*, *b*), and expressed as *a* + *bi* whereby *i* is the imaginary unit defined as *i* <sup>2</sup> = −1. The symbol *<sup>z</sup>* can represent any complex number and is a complex variable subject to operational definitions, such as an addition and a multiplication [7]. Each complex number corresponds to a single point on the complex plane.

On the other hand, automatic learning only extracts data from the most representative objects and regions to classify as segmented images. A good selection of segmentation techniques considers the relevant context, hardware resources, the number of classes, and the size of the dataset [8]. For instance, in classifying the object in the self-driving, hardware resources and the number of classes play a key role because the size of the training data and validation labels could restrict decision-making. A self-driving car that uses deep learning needs to consider hardware resources to process the dataset [9]. A convolutional neural network (CNN) is related to the number of convolutional layers, the kind of layer grouping, the activation function used, the number of fully connected layers, and the size of the image to be processed as well as the techniques used to prevent over adjustment. Even though the training phase of a CNN is computationally costly, these models can reach high classification accuracy levels, making them popular.

This study proposes using a color image segmentation algorithm based on a *chromatic map* defined on a space using complex numbers to analyze the best color distribution. Complex algebra is used spatially to obtain final representative thresholds to segment the image. The segmented images represent similar chromatic values on components *a* and *b* of the CIELab space and the image's most relevant areas. Patches from representative areas are extracted based on both aspects. A convolutional neural network (CNN) classifies the extracted patches to label them on the color image. This study's contribution is to propose a new representation of chromatic components based on complex numbers defined as a *chromatic map*. The map can facilitate localizing the most representative areas across the image using fundamental algebra for complex numbers. This segmentation method renders broadly segmented images; however, instead of refining the segmented areas and labeling them, several patches from the color images are extracted using the location of the segmented area as the input for a CNN classifier. Thus, this segmentation strategy is a phase prior to the classifier that looks for similar chromatic patterns that represent the essential content of the image. This approach to segmentation and classification has been tested using urban-context images, and the results include data about the reliability of each predicted image class.

#### **2. Related Works**

Labeling segmented areas require high computational resources to recognize objects during the human−machine interaction. Image segmentation is often based on the graphs theory and grouping algorithms. In [10], the authors propose a general scheme of segmentation of scenes based on the spectral grouping algorithm for normalized cuts, fusing geometric and color information on a working frame with no parameters. The study in [11] presents a segmentation scheme to combine color and depth information. Under this scheme, segmentation happens in 3 steps. The study by Karimpoulit [12] identifies the types of rocks using images of rocky settings. Segmentation has been extended to video; for instance, the authors of [13] have developed a method to combine the appearance of an object with the temporal consistency between frames. Using the features of a normalizedcolor histogram and CNN features, the GrabCut algorithm is applied to different frame boundaries to segment the object in the background. When detecting objects in motion, the background is obtained using videos taken from a static camera. The study presented in [14] suggests the option of a detection and segmentation method based on consecutive stereo images that process dynamic objects found in an urban environment. This is a pixel-by-pixel approach applied to the KITTI dataset [15], and the frame boundaries are generated bearing in mind color and difference data for each moving object.

Unlike object-detection strategies, object recognition focuses on the objectives of the image and provides a specific class for each one [16]. As the objective is better adjusted to the frame boundary, the classification results become more reliable without a background. The fast development of smart vehicles makes object detection and recognition essential in self-driving [1]. In addition, road sign detection provides key information for safe navigation. Often, road detection is based on standard low-profile features used to process the image and isolate the borders. In [17], a real-time two-stage YOLOv2-based road-sign detection system is used. In the first stage, the YOLOv2 detection frame is modified to adapt it to the road-sign detection task and predicts boundary frames, class, and reliability of road signage. In the second stage, an invariant light road-sign transformation network (RM-Net) reclassifies the samples with low accuracy to increase accuracy.

The CNN architectures used for segmentation purposes are usually of three kinds, fully convolutional networks (FCN) [18], coder-decoder networks [19] and "atrous-convolutional" networks [20]. The authors in [21] introduce the Mask R-CNN method, an extension of the Faster R-CNN method [22] to segment images instead of just detecting boundary frames. There are also some approaches whereby Deep Neural Networks are modified, for instance, semantic-aware segmentation [23] to use semantic segmentation and instance segmentation. Recent strategies propose general DWT and IDWT layers to various wavelets and design wavelet integrated CNNs (WaveCNets) for image classification using ImageNet and ImageNet-C, achieving an accuracy of 78.51% [24]. Moreover, a new architecture (VOLO) implements a novel outlook attention operation that dynamically conducts the local feature aggregation mechanism in a sliding window across the input image. This approach uses transformers and CNNs to complement their model and achieves 87.1% using ImageNet-1k [25]. Another natural color image approach is described in [26]. In this approach, the image is split into patches that feed the embedding module to expand the feature dimensions used for image classification [26]. This method achieves 83.9% in the Top-1 accuracy rate.

The main aims of this study are: (1) to develop an automatic strategy to obtain areas on a natural, outdoor image transforming components *a* and *b* of the CIELab image as a complex space to represent image tonality, saturation, and contrast; (2) to build a *chromatic map* that concentrates the distribution of the tone density of pixels from the image using algebra for complex numbers; (3) to provide a strategy that includes sky and road categories, which are usually considered in semantic-based methods but not in object-classifier methods.

#### **3. Image Segmentation Approach**

Figure 1 shows an overview of the proposed method to segment images and identify objects. First, input color images are transformed to the CIELab space. Next, chromatic planes *a* and *b* on the CIELab space are used as real and imaginary elements to form complex image *I*. The representative chromatic values of image *I* are calculated using the complex image to build a *chromatic map*. The number of thresholds per image depends on the colors of the image. The segmented areas represent those from images with similar chromatic values without a classification label. The next step consists of extracting several patches from the color image from each segmented area to build a database of images in six categories. A CNN uses the database to train, validate and test the identification of the object on the image. Note that color image patches are the input to the CNN model instead of the segmented areas. The implementation details of the method are shown in the following subsection.

**Figure 1.** Proposed segmentation method to identify objects.

#### *3.1. Image in the Complex Space*

As said earlier, planes *a* and *b* on the CIELab space known as *imA* and *imB*, are combined to generate complex image *I*. Figure 2 shows chromatic planes *imA* and *imB* to form complex image *I* for a specific color image. Each pixel on image *I* is a complex number *z* = *a* + *ib*, processed using algebra for complex numbers. In this case, basic operations such as division, modulus, and argument have been used [27], but the division is the main operation used. Each pixel *I* is divided by a reference point *P*(*r*,*c*); the resulting image is known as the division image and is referred to as *D*. Image *D* shows values such as the threshold ones indicated by reference point *P*(*r*,*c*) within boundary . Thus, the same values as the unit or those close to it point to similar areas as those of the threshold value *P*(*r*,*c*). Equation (1) defines division image *D*, which is the resulting complex image *I* size *u* × *v* divided by reference point *P*(*r*,*c*). Values close to the unity in *D* represent similar pixels as those of *P*(*r*,*c*). Therefore, image *D* shows the relevance of point *P*(*r*,*c*) on the color image. However, as *D* is in the complex space, searching for values close to 1 cannot be direct. Using module *D*, the image of module |*D*| can generate positive real values.

$$D\_{[u \times v]} = \frac{I\_{[u \times v]}}{P\_{(r, c)}} \tag{1}$$

**Figure 2.** Generating complex image *I* using the chromatic images *imA* and *imB*.

In Equation (2), unitary values in |*D*| (around an value) are chosen to obtain the thresholded image *F* and to highlight areas with a color such as *P*(*r*,*c*). Figure 3 represents the Division image *D* and the corresponding module |*D*|. |*D*| shows in white color the areas whose values are similar to *P*(*r*,*c*).

**Figure 3.** Complex division using a representative chromatic point.

To obtain a final segmented image *F*, first, representative *P*(*r*,*c*) thresholds must be found. Each threshold requires the division process. A *chromatic map* AB makes it possible to obtain several thresholds for the image automatically.

#### *3.2. Chromatic Map*

*Chromatic map* AB can be defined in the context of a bidimensional histogram. Chromatic components *a* and *b* on the CIELab space make up the horizontal *Xa* and vertical *Yb* axis on the map AB. This can be illustrated as shown in Figure 4a,c for a real and an artificial image, respectively.

Figure 4d shows five representative points on the *chromatic map* AB, one for each area of the artificial image shown in Figure 4c. These points separate the chromatic components of the image. In the case of images such as those in Figure 4a, chromatic values are calculated by seeking the most representative values, that is to say, the highest density of points. Therefore, *chromatic map* AB is divided into *k*-areas, resulting from division *m* and *n* on the *Yb* and *Xa* axes, respectively. Thus, the map is divided into *k* = *m* × *n* areas based on the combinations of *m* and *n* within the set of values {4, 8, 16, 32}. These values reduce the complexity of the power and make the methodology suitable for hardware implementation. For instance, blocks *k* = 128 when dividing the map by *m* = 8 and *n* = 16.

**Figure 4.** *Chromatic map* AB for two color images. (**a**) Color image 1. (**b**) *Chromatic map* AB of image 1. (**c**) Color image 2. (**d**) *Chromatic map* AB of image 2.

In Equation (3), *npx* is a percentage based on the total number of pixels on the image, which is used to label blocks as representative. Each block has a chromatic range Δ*a* and Δ*b* defined by Equations (4) and (5). Figure 5 shows the division in k−blocks on a *chromatic map*, whose axes take the chromatic values from planes *a* and *b* on the CIELab space used to build complex image *I*.

$$m\_{\text{px}} = (u \times v) \cdot \frac{1}{\max(m, n)}\tag{3}$$

$$
\Delta a = \frac{\max\left(X\_a\right) - \min\left(X\_a\right)}{m} \tag{4}
$$

$$
\Delta b = \frac{\max(\mathbf{Y}\_b) - \min(\mathbf{Y}\_b)}{n} \tag{5}
$$

**Figure 5.** *Chromatic map* AB divided into *m* × *n* blocks on the chromatic range given by Δ*a* and Δ*b*.

#### *3.3. Segmentation Approach*

This study uses complex numbers to segment the complex image *I*. As shown in the previous subsection, the *chromatic map* AB represents the pixel density distribution along k-blocks on the complex image. In each block (*i*, *j*) on the *chromatic map* AB, density *M<sup>μ</sup>* is calculated by counting the number of pixels *Mp* and averaging the intensity of each pixel on *I*, as shown in Equation (6).

$$M\_{\mu}(i,j) = \begin{cases} \sum\_{p=1}^{M\_p} I\_{i,j}(p) \\ \frac{M\_p}{M\_p} \\ 0 & \text{in another case} \end{cases} \qquad \text{if} \qquad M\_p > 0 \tag{6}$$

Equation (7) calculates indexes *indMp*, showing blocks with a number of pixels greater than *npx*. In Equation (8), a second criterion is applied to obtain the final vector index *indMμ*, which stores the indexes for blocks on *Mμ*, which also agrees with *indMp*. The number of thresholds *nth* is used in the segmentation process and is obtained from the cardinality of vector *indM<sup>μ</sup>* (see Equation (9)).

$$\|u\|\_{Mp} = M\_p \ge n\_{p\ge} \tag{7}$$

$$ind\_{M\mu} = M\_{\mu} \left( ind\_{M\mu} \right) \tag{8}$$

$$m\_{\rm th} = \operatorname{card}\left(\operatorname{ind}\_{M\mu}\right) \tag{9}$$

Vector *V<sup>μ</sup>* is calculated using *M<sup>μ</sup>* and *indMμ*, as shown in Equation (10). *V<sup>μ</sup>* is the vector for average values used as thresholds in the segmentation process, which are still represented using complex numbers. The correlation matrix *Mcorr* is obtained by dividing each threshold value by all the other values, as shown in Equation (11). Equation (12) represents the areas for average values bound by a circle |*z* − *z*0| = *R*. In this case, areas are defined as being within a unitary circle centered on each threshold value on the matrix *Mcorr*.

$$V\_{\mu} = M\_{\mu} \left( \operatorname{ind}\_{M\mu} \right) \tag{10}$$

$$M\_{corr}(i,j) = \frac{V\_{\mu}(i)}{V\_{\mu}(j)} \qquad \qquad i,j = \{1, \ldots, n\_{th}\} \tag{11}$$

$$\begin{aligned} \mathcal{M}\_{\mu} &= |1 - |\mathcal{M}\_{corr}| \\ &= \begin{vmatrix} 1 - \left| \frac{V\_{\mu}(1)}{V\_{\mu}(1)} \right| & 1 - \left| \frac{V\_{\mu}(1)}{V\_{\mu}(2)} \right| & \cdots & 1 - \left| \frac{V\_{\mu}(1)}{Z\_{\mu}(k)} \right| \\ 1 - \left| \frac{V\_{\mu}(2)}{V\_{\mu}(1)} \right| & 1 - \left| \frac{V\_{\mu}(2)}{V\_{\mu}(2)} \right| & \cdots & 1 - \left| \frac{V\_{\mu}(2)}{V\_{\mu}(k)} \right| \\ \vdots & \vdots & \ddots & \vdots \\ 1 - \left| \frac{V\_{\mu}(k)}{V\_{\mu}(1)} \right| & 1 - \left| \frac{V\_{\mu}(k)}{V\_{\mu}(2)} \right| & \cdots & 1 - \left| \frac{V\_{\mu}(k)}{V\_{\mu}(k)} \right| \end{vmatrix} \end{aligned} \tag{12}$$

The matrix values *Mr<sup>μ</sup>* are used to analyze the middle values. Beyond diagonal values, minimization was conducted on matrix *Mr<sup>μ</sup>* . Minimum values obtained are then divided by two to ensure there is no overlap between areas centered around average values; this is expressed by *Vr<sup>μ</sup>* in Equation (13). *nth* values are stored in *Vrμ*, which contains the thresholds to conduct color segmentation. Algorithm 1 explains the implementation of the multi-threshold segmentation process on a color image.

$$V\_{r\_{\mu}} = \frac{\min\left(M\_{r\_{\mu}}(i, j)\right)}{2} \quad \forall i \neq j \tag{13}$$

#### **Algorithm 1** Segmentation method

**Input:** Input image **im**, number of blocks *m*, *n* in the *chromatic map* **Output:** Segmented image **imSeg** 1: *npx* <sup>←</sup> (*size*(**im**)) *max*(*m*,*n*) 2: [*imL*, *imA*, *imB*] ← *to*\_*cielab*(**im**) 3: *I* ← *imA* + *i imB* %*complex image* 4: **for** *i* = 1 **to** m **do** 5: **for** *j* = 1 **to** n **do** 6: *Mp* ← *card*(*blocki*,*j*) 7: **if** *Mp* > 0 **then** 8: *M<sup>μ</sup>* ← *mean*(*blocki*,*j*) 9: **end if** 10: **end for** 11: **end for** 12: *indMp* ← (*Mp* ≥ *npx*) 13: *indM<sup>μ</sup>* ← *Mμ*(*indMp* ); 14: [*Vμ*, *nth*] ← [*M<sup>μ</sup> indM<sup>μ</sup>* , *card*(*indM<sup>μ</sup>* )] 15: **for** *i*, *j* = 1 **to** *nth* **do** 16: *Mcorr*(*i*, *<sup>j</sup>*) <sup>←</sup> *<sup>V</sup>μ*(*i*) *Vμ*(*j*) 17: **end for** 18: *Mr<sup>μ</sup>* ← *abs*(1 − *abs*(*Mcorr*)) 19: *Vr<sup>μ</sup>* <sup>←</sup> *min*(*Mr<sup>μ</sup>* (*i*,*j*)) <sup>2</sup> for *<sup>i</sup>* <sup>=</sup> *<sup>j</sup>* 20: **for** *k* = 1 **to** *nth* **do** 21: **if** *Vr<sup>μ</sup>* (*k*) = 0 **then** 22: *<sup>D</sup>* <sup>←</sup> *<sup>I</sup> Vr<sup>μ</sup>* (*k*) 23: **else** 24: *D* ← *abs*(*I*) 25: **end if** 26: *F* ← *abs*(1 − *abs*(*D*)) 27: *S*(:, :, *k*) ← *k* · *F* < *Vr<sup>μ</sup>* (*k*) 28: *imgSeg*(*S*(:, :, *k*) ≡ *k*) ← *k* 29: **end for**

#### **4. Results**

#### *4.1. Experimental Results*

Segmentation and classification results are obtained using Cityscape [28] and CamVid [29] datasets. Similar datasets, i.e., Kitti, Waymo [30], and nuScenes [31], are used for 2*D* and 3*D* object detection for self-driving. The Cityscape dataset is divided into 20 folders obtained from several European cities; in this case, the Munster subfolder with 174 images of 1024 × 2048 pixels was chosen. In contrast, the CamVid dataset has 701 images of 720 × 960 pixels. Both datasets showed urban contexts but under different seasonal and lighting conditions.

The color image and the number of blocks on the *chromatic map* are the inputs for the segmentation algorithm. Each input image is processed using 16 different blocks, generating 16 segmented images. Each segmented image is colored by area according to the values of a 256−color map. Figures 6 and 7 show the segmentation results for an image taken from the CamVid dataset, using different block sizes on the *chromatic map*. The

validation method used shows that the *chromatic map* for the CamVid dataset produces better results in a (8 × 16) combination, unlike the Cityscape database, which produced better results for (16 × 8) values, as shown in the following subsection.

(**a**) (**b**)

**Figure 6.** CamVid segmented images for different block sizes. (**a**) Block size 4 × 8. (**b**) Block size 4 × 16. (**c**) Block size 8 × 8. (**d**) Block size 8 × 16.

(**a**) (**b**)

**Figure 7.** CamVid segmented images for different block sizes. (**a**) Block size 16 × 8. (**b**) Block size 16 × 16. (**c**) Block size 32 × 8. (**d**) Block size 32 × 16.

#### *4.2. Segmentation Performance*

The number of representative areas segmented is validated through a quantitative analysis of ground-truth images provided by Cityscape and CamVid datasets. Table 1 shows the number of representative areas *nth* found by Algorithm 1 for various block sizes on the *chromatic map*. Bear in mind that as the number of blocks on the axes increases, the number of representative areas increases too. Segmented images can be empty in both cases, meaning no representative area was found.


**Table 1.** Number of representative areas generated by each dataset.

A second validation of segmented images consists of selecting the most common categories and their semantics to compare with the segmented areas. The most common categories from the urban context are enough for a general description of the scene. The selected categories are building, car, pedestrian, road, sky, and tree. Each segmented image is analyzed by area. The results for categories building, pedestrian, and road using the CamVid dataset are shown in Table 2, and those using the Cityscape dataset are shown in Table 3. Both tables show the segmented pixel-by-pixel relationship between the results and the ground-truth images, which makes it possible to consider some criteria to establish block sizes (*m*, *n*):


Therefore, *m*, *n* block sizes where *m* = *n* are used to comply with the last criterion, and the number of areas are enough to represent the categories.


**Table 2.** Analysis of segmented image categories for CamVid.


**Table 3.** Analysis of segmented image categories for CityScape.

#### *4.3. Cnn Architecture*

Figure 8 shows network architecture based on VGG-16 [32] used in this study. This architecture has 16 layers to train about 138 million of parameters. The network consists of five blocks of convolutional layers. Each block consists of two or three convolutional layers followed by a groping layer. The number of filters increases by 2, from 64 to 512. The *Dropout* layers are added between one block and the next to avoid over-adjustment [33]. Each *Dropout* layer reduces the connection between one block and the next. The flat layer connects convolutional blocks with the fully connected layer. The fully connected layers have 4096 neurons, including "bias" and the activation function, a ReLU in this case. The last fully connected layer is the output from the network. The number of neurons on this layer is the same as the number of categories. The activation function associated with the last fully connected layer is the Softmax or normalized exponential function for a multi-class problem.

**Figure 8.** Modified convolutional neural network VGG-16.

Algorithm 1 calculates the segmentation of input images used to process the training and validation dataset. This process is illustrated in Figure 9. A binary mask per category, known as *class mask*, is generated for each image on the dataset. The *class mask* is then used to crop *p* patches randomly sized [*lu* × *lv*]=[60 × 80] for each category. About 30,000 patches were generated for all the classes using the Cityscape database, with approximately 3000 images.

**Figure 9.** *Class mask* obtained from a segmented image to generate patches for the category *building*. A similar process is followed for all the categories to process the training dataset.

All patches were resized to [96 × 96], [128 × 128] and [224 × 224] for use on the CNN. Figure 10a,b show the accuracy and loss chart, respectively, for training of 100 epochs using an image size of (224 × 224).

**Figure 10.** Results from training the CNN using 224 × 224 images. (**a**) Accuracy graph. (**b**) Loss graph.

The *SmallVGG* network model was also used to optimize its resources and keep performance results optimal. This network model reduces the original architecture presented in this study [32]. Even though the VGG-16 model for a resized 96 × 96 path shows greater accuracy than the results shown in Table 4, 224 × 224 images have had a more stable performance during the training and validation phase.


Additional experimental tests were performed using the ResNet CNN model, and the results are included in Table 4. In [34], the authors describe the residual blocks used for training deeper layers in the network. Using skip connections, it is possible to activate one layer and relocate its output to feed deeper layers in the network. ResNet CNN architectures are built by grouping a set of residual blocks. It is important to point out that the adder in the residual block can only be performed if both layers have the same dimension. For six categories and three different sizes of patches, we obtained an accuracy of 94% for 224 × 224 image patches.

The network model is validated using a training dataset from segmented images. Ground-truth information is not used with the validation dataset, and therefore, patches generated depend only on the areas obtained from the segmented image and the equivalent input color image. Unlike the training dataset, these patches are chosen randomly by area and do not have a predetermined category. This is represented visually in Figure 11. The patches are cropped from the color image using a fixed size [*lu* × *lv*]=[224 × 224], which is the classifier input size. A different number of patches is cropped for each segmented area depending on the size and the number of regions obtained. The fixed size of bounding boxes allows the classification of undefined categories, such as sky and road, which most object-detect methodologies cannot detect and classify. This is one of our contributions to classifiers in urban environments.

**Figure 11.** Image patches generated to validate the classifier.

Experimental tests to validate this approach use patches generated using the CamVid dataset. The classifier assigns a label and a reliability label to each patch. The output image shows different boundary frames with the brand and the reliability value corresponding to each image patch. Some results are shown in Figure 12. The CNN architecture was trained using the CityScape dataset, which has bigger images, and therefore, the process of generating patches was more straightforward.

(**c**) (**d**)

**Figure 12.** Classification of objects using CamVid. (**a**) Test image #1. (**b**) Test image #2. (**c**) Test image #3. (**d**) Test image #4. (**e**) Test image #5. (**f**) Test image #6.

Experimental tests were conducted using a PC with Intel Core i5 9th generation, 32 GB of RAM, and an NVIDIA GeForce GTX 1650 graphics card. Table 5 shows the time for the segmentation algorithm.

**Table 5.** Execution time for the segmentation algorithm.


Bear in mind that the time presented in Table 5 depends on the number of areas on the segmented image and their sizes. When a patch for an area cannot be obtained, this increases processing time significantly. To limit the execution time, a maximal number of tries to generate the patches has been established. In addition, the number of patches per image depends on the number of representative areas obtained by the segmentation algorithm for each image divided into four partitions (see Figure 11). Thus, the number of patches changes from image to image, and so does the total processing time. Processing times were analyzed, including those recorded in the classification phase. Table 6 shows the number of patches generated and the time. A general processing time per image can be produced by adding the segmentation and classification times. For instance, the time for the CamVid dataset is 4 seconds; for the CityScape image, it is twice as long.

**Table 6.** Classification execution time.


#### **5. Discussion**

Table 7 shows a comparison between our proposal and other methodologies in the literature. Using ResNet, we achieve 94% accuracy, whereas YOLOv3 [35] and YOLOv4 [36] architecture achieve over 95% accuracy for the ImageNet dataset. Different approaches compared were VOLO [25] and SPPNet [37], which also achieved good accuracy in the top rate. Even if our classification accuracy is lower, in this work, we provide an alternative method to classify image content without performing a whole refined segmentation of the image and without using semantic image information. Therefore, sky and road classes have been included as categories. In contrast, YOLO or other object classification architectures do not consider it because a bounding box cannot be defined for both categories.

Our accuracy results also depend on the bounding-boxes size extracted from the image; note in Table 4 that our accuracy increases as this selected size does. Our methodology is an alternative region-based approach that has been trained with one dataset and validated with another. Both datasets only have the urban context in common, but resolution and illumination are different, becoming more difficult for the validation task.

**Table 7.** Comparison with related works.


A final experimental test was performed by training boosted trees and several machine learning classifiers using the patches extracted from our method; the obtained results are illustrated in Table 8. For six categories and three different sizes, the highest accuracy was 79.80%, achieved by the Bagged Trees classifier. Considering that CNN architectures extract the main and representative features through the layers, machine learning-based classifiers require a more careful feature extractor strategy to improve their classification accuracy.


**Table 8.** Classification result using machine learning approaches.

#### **6. Conclusions**

This study shows a new approach to image segmentation to identify objects in structured outdoor spaces. The approach extracts representative features based on combining algebra for complex numbers on planes *a* and *b* on the CIELab color space. The complex image makes it possible to develop and implement a multi-threshold segmentation algorithm. The methodology follows a typical automatic learning technique. The required features to input the classifier are chosen from specific areas on the segmented image. Despite light and overcrowding issues in outdoor environments, the number of classes and images used in the training and validation phases of the model are enough to execute the identification of objects.

The multi-threshold segmentation algorithm produces different execution time lapses depending on the image features to be processed. This is also dependent on the computing power available. In addition, the different sets of images used for CNN training and validation are created using random conditions. The execution time results for the multithreshold segmentation algorithm depend on the size and features of the image. Thus far, this approach cannot be used in real-time conditions that require execution speeds of milliseconds. However, a dispersal strategy to select different areas on the scene could provide lighter techniques for classification purposes. Given the modular nature of the methodology, modifications to increase hardware performance are possible.

The VGG-16 network responds well to conditions such as those in this study, showing a uniform and flexible architecture; however, better accuracy results were achieved using the ResNet-150 network. Execution times for classification purposes are affected by the various phases in the methodology and the different features of the images from the databases. Hence, the decision to train the CNN architecture using the Cityscape dataset and validate it using the CamVid dataset shows similar outdoor and urban environments.

Finally, this study has focused on a less computationally intensive alternative to conducting color segmentation and object detection tasks, with the flexibility of adapting to different hardware architectures and scenarios.

**Author Contributions:** Conceptualization, D.-L.A.-O. and M.-A.I.-M.; methodology, D.-A.R.-M. and D.-L.A.-O.; software, J.-J.C.-C. and M.-A.I.-M.; validation, J.-J.C.-C. and D.-A.R.-M.; formal analysis, J.-J.C.-C., D.-L.A.-O. and M.-A.I.-M.; investigation, J.-J.C.-C. and D.-A.R.-M.; data curation, J.-J.C.-C. and D.-L.A.-O.; writing—original draft preparation, J.-J.C.-C.; writing—review and editing, D.-L.A.-O. and M.-A.I.-M.; visualization, D.-L.A.-O., D.-A.R.-M. and M.-A.I.-M.; project administration, D.-L.A.-O. All authors have read and agreed to the published version of the manuscript.

**Funding:** This study was conducted as part of the doctoral studies of Juan-Jose Cardenas-Cornejo, funded through scholarship number 2021-000018-02NACF-07210, awarded by CONACYT.

**Data Availability Statement:** Not applicable.

**Acknowledgments:** The authors are grateful to the University of Guanajuato. The authors would like special thanks to Carlos Montoro for his technical support in the English revision of the manuscript.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**

