*5.4. ResNet*

The Residual Networks (Resnets) are CNNs that learn residual functions referenced to inputs. Reference He et al. [53] proposed a framework for deep networks that prevent saturation and degradation of the accuracy by shortcut connections that add identity map outputs to the output of skipped stacked layers. ResNets showed that the idea of stacking layers to improve results in complex problems has a limit, and the capacity of learning is not progressive. Reference He et al. [53] introduced a residual block, enabling the creation of an identity map, bypassing the output from a previous layer into the next layer. Figure 12 shows the residual block of Resnet.

**Figure 12.** ResNet's residual block.

The *x* represents the input of a layer, and it is added together with the output *F*(*x*). Since *x* and *F*(*x*) can have different dimensions due to convolutional operations, a certain W weight function is used to adjust the parameters, allowing the combination of *x* and *F*(*x*) by changing the channels to match with the residual dimension. This operation is illustrated in Equations (4) and (5), where *xi* represents the input and *xi* + 1 is the output of an *i*-th layer. *F* represents the mentioned residual function. The identity map *h*(*xi*) is equal to *xi* and *f* is a Rectified Linear Unit (ReLU) [54] function implemented on the block, which provides better generalization and enhances the speed of convergence. Equation (6) shows the ReLU function

$$y\_i = h(x\_i) + F(x\_{i\prime} \mathcal{W}\_i);\tag{4}$$

$$\mathbf{x}\_{i+1} = f(y\_i);\tag{5}$$

$$f(\mathbf{x}\_{relu}) = \max(0, \mathbf{x}\_{relu}).\tag{6}$$

Moreover, the Resnets use stochastic gradient descent (SGD) in the opposite of adaptative learning techniques [55]. The authors introduced to the networks a pre-processing stage on the data, dividing the input in patches to introduce it to the network. This operation has the goal of improving the training performance of the network, which stacks residual blocks rather than layers. The activation of units is the combination of activation of the unit with a residual function. This tends to propagate the gradients through shallow units, improving the efficiency of training in Resnets [56].

One of the versions of Resnets, Resnet50 was originally trained on the ImageNet 2012 dataset for the classification task of 1000 different classes [57]. In total, Resnet50 runs 3.8 × 109 operations. Figure <sup>13</sup> illustrates the Resnet50 architecture.

**Figure 13.** ResNet 50 architecture.

#### *5.5. SqueezeNet*

In their architecture, Iandola et al. [58] proposed a smaller CNN with 50 times fewer parameters than AlexNet, however, keeping the same accuracy, bringing faster training, making model updates easier to achieve, and a feasible FPGA and embedded deployment. This architecture is called SqueezeNet.

SqueezeNet uses three strategies to build its structure. The 3 × 3 filters are replaced by 1 × 1, in order to reduce the number of parameters. The input channel is also modified, limiting its number to 3 × 3 filters. SqueezeNet also uses a late downsample approach, creating larger activation maps. Iandola et al. [58] made use of fire modules, a squeeze convolution, and an expanded layer. Figure 14 shows the fire module. It can be tuned in three locations: the number of 1 × 1 filters in the squeeze layer, represented by *S*1*x*1. The number of 1 × 1 filters on expanding layer (*E*1*x*1) and finally, the number of 3 × 3 filters also on expanding layer, represented by *E*3*x*3. The first two hyperparameters are responsible for the implementation of strategy 1. The fire blocks also implement strategy 2, limiting the number of input channels by following a rule: *S*1*x*<sup>1</sup> has to be less than *E*1*x*<sup>1</sup> + *E*3*x*3.

**Figure 14.** SqueezeNet's fire block.

SqueezeNet has three different architectures, resumed in Figure 15: SqueezeNet, SqueezeNet with skip connections, and SqueezeNet with complex skip connections, using 1 × 1 filters on the bypass. The fire modules are represented by the "fire" blocks, and the three numbers on them are the hyperparameters s1 × 1, e1 × 1, and e3 × 3 respectively. SqueezeNets are fully convolutional networks. The late downsample from strategy three is implemented by a max pooling, creating larger activation maps.

**Figure 15.** SqueezeNet architectures.

These bypasses are implemented to improve the accuracy and help train the model and alleviate the representational bottleneck. The authors trained the regular SqueezeNet on ImageNet ILSVRC-2012, achieving 60.4% top-1 and 82.5% top-5 accuracy with 4.8 megabytes as model size.

#### *5.6. DenseNet*

A CNN architecture that connects every layer to all layers was proposed by Huang et al. [59]. This architecture called DenseNet works with dense blocks. These blocks connect each of the feature maps from preceding layers to all subsequent layers. This approach reuse features from the feature maps, compacting the model in an implicit deep supervision way.

DenseNet has four variants, each one with a different number of feature maps in their four "dense blocks". The first variant, DenseNet-121 has the smallest depth on its blocks. It is represented in Figure 16. Transition layers are in between the dense blocks and are composed of convolution and max pool operations. The architecture ends with a softmax to perform the final classification. As the depth increases, accuracy also increases, reporting no sign of degradation or overfitting [59].

**Figure 16.** DenseNet architecture.

#### *5.7. EfficientNet*

EfficientNets [60] are architectures that have a compound scaling, balancing depth, width, and resolution to achieve better performance. The authors proposed a family of models that follow a new compound scale method, using a coefficient *φ* to scale their three dimensions. For depth: *d* = *αφ*. Width: *w* = *βφ*. Finally, resolution: *r* = *γφ*. So that, *<sup>α</sup>* × *<sup>β</sup>*<sup>2</sup> × *<sup>γ</sup>*<sup>2</sup> ≈ 2 and *<sup>α</sup>* ≥ 1, *<sup>β</sup>* ≥ 1, *<sup>γ</sup>* ≥1. The *<sup>α</sup>*, *<sup>β</sup>*, and *<sup>γ</sup>* are constants set by a grid search. The coefficient *φ* indicates how many computational resources are available.

The models are generated by a multi-objective neural architecture search, optimizing accuracy and Floating-point operations per second, or FLOPS. They have a baseline structure inspired by residual networks and a mobile inverted bottleneck MBConv. The base model is presented in Figure 17, known as EfficientNet-B0.

As more computational units are available, the blocks can be scaled up, from EfficientNet-B1 to B7. The approach searches for these three values on a small baseline network, avoiding spending that amount of computation [60]. EfficientNet-B7 has 84,3% top-1 accuracy on ImageNet, a state-of-the-art result.

#### **6. Ensemble Learning Approaches**

Ensemble learning has gained attention on artificial intelligence, machine learning, and neural network. It builds classifiers and combines their output to reduce variances. By mixing the classifiers, ensemble learning improves the accuracy of the task, in comparison to only one classifier [61]. Improving predictive uncertainty evaluation is a difficult task and an ensemble of models can help in this challenge [62]. There are several ensembles approaches, suitable for specific tasks: Dynamic Selection, Sampling Methods, Cost-Sensitive Scheme, Patch-Ensemble Classification, Bagging, Boosting, Adaboost, Random Forest, Random Subspace, Gradient Boosting Machine, Rotation Forest, Deep Neural Decision Forests, Bounding Box Voting, Voting Methods, Mixture of Experts, and Basic Ensemble [61,63–68].

For deep learning applications, an ensemble classification model generates results from "weak classifiers" and integrates them into a function to achieve the final result [65,66]. This fusion can be processed in ways like hard voting, choosing the class that was most voted or a soft voted, averaging or weighing an average of probabilities [68]. Equation (7) illustrates the soft voting ensemble method. The vector *P*(*n*) represents the probabilities of classes in an n-classifier. The value of *W*(*n*) is the weight of the n-classifier, in the range from 0 to 1. The final classification is the sum of all vectors of probabilities used on the ensemble, multiplied with their respective weights. The sum of weights must follow the rule described in Equation (8)

$$E = \sum\_{1}^{n} P(n)\mathcal{W}(n);\tag{7}$$

$$\sum\_{1}^{n} W(n) = 1.\tag{8}$$

The ensemble is a topic of discussion and it has been applied with deep learning and computer vision in tasks like insulator fault detection method using aerial images [67], to recognize transmission equipment on images taken by an unmanned aerial vehicle [68], and application of remote sensing image classification [69]. There are also studies that compare imbalanced ensemble classifiers [63].

#### **7. Methodology**

The authors defined a methodology, divided into two stages. The first one intends to check the performance of CNNs trained on a synthetic dataset and then, improve it by training the CNNs with real images. In the second stage, the experiment evaluates the use of a blend type ensemble approach using RGB and RGB-D images.

#### *7.1. Stage I-Training the CNNs*

The first stage consists of training different architectures on both synthetic domains and then training the best models with real data. Figure 18 shows the procedure of Stage I. Each CNN will have two models, one trained on RGB and the other one trained on RGB-D images. From the previous training on Imagenet, a fine-tuning will update the models. The datasets used for this training procedure are S-RGB and S-RGBD (step 1 in the Figure 18).

**Figure 18.** Diagram for StageI-Training the CNNs.

After the training step, all models will be tested on real datasets corresponding to each domain, being R-RGB-1 and R-RGBD-1 (step 2). The CNN with the best overall performance in RGB will be selected (step 3). The chosen CNNs will be fine-tuned on a real dataset called R-RGB-2, in order to mitigate the domain shift (step 4). The same procedure will be done for the best model in RGB-D, which will be fine-tuned on R-RGBD-2, a real dataset.

Then, they will be tested again on R-RGB-1 and R-RGBD-1 (step 5). The intuition is to achieve improvements over the models trained only in synthetic images. Finally, a comparison of the tests conducted on steps 2 and 5 will show if the procedure outperformed the CNNs trained only on the synthetic domain (step 6).

#### *7.2. Stage II-Blending Pipelines*

In this stage, the chosen fine-tuned models will be blended in an ensemble approach. The goal here is to extract information from the scenes using a pipeline for RGB and another for RGB-D, seeking to increase classification performance over the single CNN approach on Stage I. This ensemble consists in blending the CNNs by applying soft voting to their outputs, averaging the probabilities with equal weights for the pipelines. However, the architectures were trained separately. The Equation (9) illustrates the result of the ensemble approach

$$E = P\_{\mathfrak{c}} \mathcal{W}\_{\mathfrak{c}} + P\_{\mathfrak{d}} \mathcal{W}\_{\mathfrak{d}}.\tag{9}$$

The *Pc* in Equation (10) is the vector of probabilities resulted from the RGB pipeline. The *cci* value is the probability of the class being an insulator whereas *ccb* is the probability of it being a brace band

$$P\_c = [\mathfrak{c}\_{ci} \; , \; \mathfrak{c}\_{cb}].\tag{10}$$

The *Pd* in Equation (11) is the vector of probabilities resulted from the RGB-D pipeline. The *cdi* value is the probability of the class being a insulator whereas *cdb* is the probability of it being a brace band

$$P\_d = \begin{bmatrix} \mathbf{c}\_{di} \ \mathbf{c}\_{db} \end{bmatrix}. \tag{11}$$

The vector of probabilities in the colored pipeline *Pc* is multiplied by its correspondent weight *Wc*. The same is true for the depth pipeline, where the vector *Pd* is multiplied by its *Wd* weight. Both pipeline weights receive the value of 0.5 in order to guarantee the same influence of the classifiers. The ensemble results in a vector of probabilities, which will be handled by the final decision step. The class with the higher probability will be chosen as the final decision for the scene.

The blend will be tested on the real scenes and compared with the results from the best CNNs on Stage I. Figure 19 shows the structure of the blended approach.

**Figure 19.** Blending CNN pipelines approach.

One advantage of the blended pipeline can be a more general classification, taking into account features extracted from both colored and depth images. The domains have features that can be different from each other, and using it in a blend may improve classification. The tool would be more sensitive to capture these features from both RGB and RGB-D images. Nevertheless, the pipelines must be tuned to have a similar and suitable accuracy, otherwise, the one with discrepancy would pull the average down.
