*7.3. Dataset Description*

There are four domains in this experiment, being Synthetic RGB, Synthetic RGB-D, Real RGB, and Real RGB-D. Each scene has data in two domains (colored and depth), and for instance, scene 01 will have a correspondent image in both RGB and RGB-D. This is valid for scenes in synthetic and real datasets. Figure 20 shows all domains of the experiment.

**Figure 20.** The four domains of the experiment: (**a**) real RGB; (**b**) real RGB-D; (**c**) synthetic RGB and (**d**) synthetic RGB-D.

In Figure 20 it can be seen two scenes with colored and depth images. Figure 20 (a,b) show a good example of a blend to classify a scene as Figure 20 (a) has poor light conditions and (b) can be used to capture features that are difficult for (a) to deliver.

For the experiments, six datasets were separated using synthetic and real images. Two synthetic datasets, S-RGB and S-RGBD were used to train the CNNs. Two real RGB datasets, R-RGB-1 and R-RGB-2 were used to test the CNNs and to fine-tune them, respectively. The

same is true for real RGB-D, two sets were created, being R-RGBD-1 and R-RGBD-2, to test and fine-tune the CNNs. Table 1 illustrates these datasets.


**Table 1.** Datasets of the experiment.

#### **8. Results**

The extensive tests conducted on this paper lead to the results shown in this section. As mentioned in Section 7, they are separated into subtopics, where the first one is used to assess the CNNs performance applying synthetic data. The best models were fine-tuned on a set of real images. The second one evaluates the ensemble method using color and depth images.

#### *8.1. Stage I-Training the CNNs*

In this stage, the proposed architectures were trained with a transfer learning technique. The models were pre-trained on Imagenet and fine-tuned on the synthetic datasets. The experiment was conducted as follows: 2 classes; batch-size of 16; 30 epochs; 5-fold crossvalidation; adaptive moment estimation or Adam as optimizer (except for ResNet and AlexNet due to incompatibilities, it was used Stochastic Gradient Descent or SGD), and learning rate of 0.001 [55,70–73].

At the end of each training and validation, the CNNs received the test dataset R-RGB-1 and R-RGBD-1. This test set was not used for tuning the hyperparameters and weights of the net. It is a separated set of data, containing real images, to verify the ability of the models trained on synthetic to perform on a real domain. The metrics used to evaluate the models are accuracy, precision, recall, and f1-score. Figure 21 shows the evaluation of the models trained on synthetic RGB and tested on R-RGB-1.

**Figure 21.** Evaluation of models on R-RGB-1.

DenseNet achieved the second-highest accuracy and recall, as well as the best f1-score. Since f1-score takes into account precision and recall, DenseNet was more stable than Inception-V3 in these metrics. Inception-V3 achieved the best result in accuracy. However, DenseNet was chosen as the best overall performance mainly due to its stability and also a marginally higher evaluation of f1-score. EfficientNet and VGGNet suffered from the domain shift and did not perform well on the test, with an f1-score lower than 50%.

For the depth domain, the results are shown in Figure 22. The models were trained on synthetic S-RGB-D and tested on R-RGBD-1.

**Figure 22.** Evaluation of models on R-RGBD-1.

Resnet50 performed better in comparison with other models. It achieved the best accuracy and f1-score on the test. VGGNet and EfficientNet had the lowest values on accuracy, recall, and f1-score.

The best model for each domain was DenseNet (RGB images) and Resnet (RGB-D images). Therefore, these architectures were selected to be fine-tuned using sets of real images, R-RGB-2 and R-RGBD-2. The models trained on synthetic were fine-tuned on the real sets to attack the domain shift problem.

DenseNet was fine-tuned on R-RGD-2, with 8 as batch size, 30 epochs, 5-fold, Adam optimizer, and learning rate of 0.001. The average accuracy and loss on training were 91.31% and 0.236. The model was tested on R-RGB-1. To verify if training with synthetic and then with real images is the best approach, Densenet was also straight fine-tuned on R-RGB-2. The confusion matrixes for these two tests are shown in Figure 23.

Fine-tuning the model on S-RGB and then on R-RGB-2 (a) outperformed the model straight trained on R-RGB-2 (b). The use of a pre-trained model on synthetic domain missed only 5 insulators and 8 brace bands with an accuracy of 94.84%. While (b) missed only 1 insulator, it also missed 20 brace bands, achieving an accuracy of 91.67%.

ResNet was fine-tuned on the real set, with a batch size of 8, 30 epochs, and 5-fold, Adam optimizer, and a learning rate of 0.001. The average accuracy and loss on training were 95.78% and 0.097, respectively. The model was tested on R-RGBD-1. To verify the proposed approach, ResNet was straight fine-tuned on a real dataset as well, R-RGBD-2. The confusion matrixes for these two tests are shown in Figure 24.

**Figure 23.** Confusion Matrixes of R-RGB-1 tested on: (**a**) DenseNet trained on S-RGB and R-RGB-2 datasets and (**b**) DenseNet trained on R-RGB-2.

The approach of fine-tuning the model in synthetic S-RGB dataset and then in a real set R-RGB-2 Figure 24 (a) outperformed the model trained straight on the real set R-RGB-2 Figure 24 (b). The use of a pre-trained model on the synthetic domain (a) missed only 4 insulators and 19 brace bands, with an accuracy of 90.87% in contrast to 7 insulators and 22 brace bands and accuracy of 88.49% (b).

**Figure 24.** Confusion Matrixes of R-RGBD-1 tested on: (**a**) Resnet trained on synthetic and real datasets and (**b**) Resnet trained on real dataset.

### *8.2. Stage II-Blending Pipelines*

DenseNet and Resnet fine-tuned on synthetic and real domains were blended in an ensemble approach. The pipeline for RGB (DenseNet) and for RGB-D (ResNet) had their probabilities of a class inference summed and averaged in a soft voting operation, with equal weights, meaning that the pipelines had the same influence on the final result. The test set used in the blended pipelines is the same used for testing the CNNs on Stage I, the scenes from R-RGB-1 and R-RGBD-1. The confusion matrix for this test is shown in Figure 25.

As it can be seen in Figure 25, the blended approach did not miss insulators and missed only 12 brace bands. The accuracy for the test was 95.24%.

The blended approach is then compared with the performance of each CNN individually. This comparison is shown in Table 2. The Blend outperformed Densenet and ResNet in accuracy, precision, and f1-score. The Blend also achieved better results on recall in comparison with Resnet, although it did not perform better than DenseNet on this particular metric. Since DenseNet misclassed only eight brace bands, it performed better in the recall. However, DenseNet classified wrongly five insulators.

**Figure 25.** Confusion matrix of blended pipelines.



Table 3 shows the difference in the percentage of all metrics of the single CNNs in comparison with the blend. The proposed mixed pipelines achieved an improvement of 0.39% in accuracy, 2.84% in precision, and 0.21% in f1-score over the best result of each single CNN. It also had a drop of 2.57% in recall in comparison with DenseNet.

**Table 3.** Difference in percentage of Blended CNN and single CNNs.


#### **9. Discussion and Conclusions**

To create the synthetic dataset, Blender was used with Eevee as the render engine due to speed. Eevee was almost seven times faster than Cycles engine.

For the real dataset, the mechanical device performed the data gathering. It was built prior to the automated stacker being constructed. This enabled to testing of the CNNs with the real dataset. The Brisque IQA algorithm and the histogram analysis provided quality control for the real images.

From the seven CNNs fine-tuned on the synthetic dataset and tested on the real domain, the best performance in RGB and RGB-D was respectively DenseNet and Resnet-50. The training procedure on synthetic datasets and testing on real samples showed a domain shift problem, a common issue discussed in recent studies. To contour this problem, a set of real images was separated and used to fine-tune DenseNet and Resnet. Finetuning the models with synthetic and then with real images outperformed classification in comparison with a straight fine-tuning on real images. The use of synthetic images generated by Blender and rendered by Eevee proved to help the classification performance.

The proposed blended RGB and RGB-D pipelines were used to improve the recognition of insulators and brace bands. Since the scenes have colored and depth information, they were applied in an ensemble approach. Each pipeline contributed equally to the output prediction of the scenes, in a soft voting decision. The final classification is the average of probabilities of colored and depth pipelines. The blended approach outperformed the best results of the single CNNs, with the only exception being the metric recall in the DenseNet colored test. However, the blended pipelines misclassed fewer items in comparison with DenseNet. Blending colored and depth CNN pipelines achieved better accuracy in comparison with the previous study, where a Resnet-50 was used to classify the insulators and brace bands through RGB images [2]. This present approach resulted in an accuracy of 95.23% in contrast to 92.87% seen in Piratelo et al. [2].

This paper presented a blend of convolutional neural network pipelines that classifies products from an electrical company's warehouse, using colored and depth images from real and synthetic domains. The research also compared the results of training the architectures only with real and then synthetic and real data. The stage I consisted in training several CNNs on a synthetic dataset and testing them in the real domain. The architectures that performed better in RGB and RGB-D images were DenseNet and Resnet-50, although, they all suffered from the domain shift. A procedure to overcome this issue was done by fine-tuning the CNNs on a real set of data. The procedure improved the accuracy, precision, recall, and f1-score of the models in comparison to only training with real data, proving that synthetic images helped to train the models. It was time-consuming and it required a computational effort to train all CNNs on synthetic and real images, yet it showed to be a valid method to improve accuracy.

On stage II, the DenseNet model trained on RGB images was used as the first pipeline, and Resnet trained on RGB-D images composed the ensemble as the second pipeline. Each one contributed equally to the final classification, using the average of their probabilities in a soft voting method. A poorly performance of one CNN could drag down final accuracy for the ensemble. Both pipelines must be tuned with suitable accuracy. Moreover, this requires more computational resources in comparison with inferences from single CNNs on Stage I, since it is working with two models. However, the ensemble lead to a more robust classification, taking advantage of colored and depth information provided by the images, extracting features from both domains. The blended pipelines outperformed accuracy, precision, and f1-score in comparison with the single CNNs.

This approach intends to be encapsulated and used to keep the inventory of the warehouse up to date. The classification task is handled by computer vision and artificial intelligence, making full use of RGB and RGB-D images of synthetic and real domains, applied in an approach of blended CNN pipelines. The quality of captured images is also examined by an IQA called BRISQUE, equalizing the images and enhancing their features.

This study classifies two types of objects: insulators and brace bands. A transfer learning method called fine-tuning took advantage of pre-trained models, utilizing its feature detection capacity and adapting it to the problem. This transfer learning technique was used twice, first with synthetic and then with real images. The extensive tests conducted

on this paper evaluated the performance of different CNNs individually and blended in an ensemble approach. In the end, the blend was able to classify the objects with an average accuracy of 95.23%. The tool will be included in the prototype of an AGV that travels the entire warehouse capturing images of the shelves, combining the application of automation, deep learning, and computer vision in a real engineering problem. This is the first step to automate and digitize the warehouse, seeking to achieve the paradigms of real-time data analysis, smart sensing, and perception of the company.

As future work, it is intended to test weights for the pipelines and seek their best combination. It is compelling to find different alternatives to ensemble learning techniques. A study of the combination between synthetic and real domain and their influence on training the models is also intended to be accomplished.

**Author Contributions:** Conceptualization, P.H.M.P.; methodology, P.H.M.P.; software, P.H.M.P., R.N.d.A. and F.S.M.L.; validation, E.M.Y., L.d.S.C. and G.V.L.; formal analysis, J.F.B.F.; writing original draft preparation, P.H.M.P. and R.N.d.A.; writing—review and editing, P.H.M.P.; R.N.d.A.; E.M.Y.; J.F.B.F.; G.M.; F.S.M.L.; L.P.d.J.; R.d.A.P.N.; L.d.S.C.; G.V.L.; supervision, E.M.Y.; project administration, E.M.Y.; funding acquisition, E.M.Y. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was supported by COPEL (power utility from the state of Paraná, Brazil), under the Brazilian National Electricity Agency (ANEEL) Research and Development program PD-02866-0011/2019. The APC was funded by LACTEC.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Acknowledgments:** The authors thank LACTEC for the assistance.

**Conflicts of Interest:** The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results".

### **References**

