*4.1. An Automated 2D Images Generation by the Unreal Engine*

In the previous research [24] there was combined Blender 3D software with Python scripting language for an automated generation of the CNN training set, but the Blender rendering engine does not provide cinematic quality of the generated samples. This disadvantage significantly decreases the classification precision of trained convolutional networks by about 20 to 30%. The new approach is based on cinematic rendering from the Unreal Engine combined with Blueprint and Python scripting language. The next new feature is a dynamic collision used for realistic shadow rendering placed under generated samples on a different surface. An example of part position in the 3D inside of Unreal Engine editor setup before dynamic simulation (left) and the part with shadow after simulation (right) is shown in Figure 6.

**Figure 6.** An example project in the Unreal Engine: (**a**) 3D virtual model of a screw in the Unreal Engine editor; (**b**) 3D after dynamic simulation with cinematics texture and shadow.

The basic algorithm is coded in the Blueprint Scripting Language, an example of subprogram for 3D virtual part rotation in Z axis is shown in Figure 7.

**Figure 7.** An example of a visual script for automated part rotation in Z-axis using the algorithm coded in the Blueprint Scripting Language.

The initial parameters for 2D sample generation are rotation angle, type of CNN model which provides basic image resolution, number of generated backgrounds as floors, and annotation file type. Basic parameters selection in the Unreal HUD menu before automated generation start is shown in Figure 8. Full assembly of the cam switch consists

of 31 different parts, the basic setup uses 5 different floor textures and the angle of rotation can be set up from 20◦ to 360◦ in Z-axis.

**Figure 8.** The Unreal Engine application setup of basic parameters for automated generation of 2D samples training and testing set.

### *4.2. An Automated Annotation by the OpenCV Algorithms*

The input condition for automated annotation is a black background and binary threshold to get clear object edges. Two basic annotation methods were selected for generated 2D images from 3D virtual models:


The process of automated SSD annotation and evaluation is shown in Figure 9. The resolution of generated images can be changed exactly for the used CNN model. The first tested model was Faster R-CNN with Inception v2 with default resolution 600 × 1024 × 3. Test samples are separated randomly from the train set for every generated part by a default value of 25%.

**Figure 9.** An autogeneration of the part localization by OpenCV for SSD: (**a**) binary threshold; (**b**) contour detection with bounding box generation; (**c**) LabelImg check of XML data generation.

Segmentation needs much more precise thresholding like single shot detection. The closing algorithm was used to get a precise contour of the object. An example of thresholding with object contour closing is shown in Figure 10.

**Figure 10.** An autogeneration of the part localization by OpenCV for Segmentation: (**a**) the input image with black background from the Unreal Engine; (**b**) the result of OpenCV algorithm for precise contour generation; (**c**) an example of manual contour selection in VGG Image Annotator.

XML format for SSD is accepted as standard, but instance segmentation has many formats, COCO JSON, CSV, LabelMe JSON, RLE, etc. The simplest JSON structure has LabelMe format with polygon shape, which can be easily implemented to the automated process of contour annotation by OpenCV. An automated contour detection by OpenCV can provide a better contour as a manual process and it can significantly improve CNN instance segmentation after the training process, as can be seen in Figure 11.

**Figure 11.** OpenCV detected contour representation: (**a**) binarized image; (**b**) all detected points; (**c**) reduced number of points optimized by Douglas-Peucker algorithm.

#### *4.3. The Generated Training and Testing Sample Set*

An example of results from an automated sample generation with XML annotation converted to CSV files is shown in Figure 12.

To start the CNN model training process is only necessary to copy folder *train*, *test,* and two cumulated annotations to *CSV* files to the TensorFlow folder and create TF\_record files for training and testing.

**Figure 12.** An example of the autogenerated training/tests set with XML bounding box annotation.

All necessary information to identify the sample is encoded in the sample image name: 00007 3**plastic\_button** pic or 00007 3**plastic\_button** picCV**.png** where:

```
00007—number of the generated image [image 7]
3—part identification in the assembly [part 3]
plastic_button—part name
pic—image type
CV—OpenCV generated binary image with black background
```
**.**png/.xml/.json**—**the type of file necessary for TF record algorithm

#### **5. Experimental Results and Implementation into the Assembly Process**

An initial experiment of the cam switch parts recognition was executed using a small set of training samples (five per part, 155 samples altogether) with the different floor. Considering this small teaching set the obtained results are acceptable (see Table 1). The training process for Inception V2 is shown in Figure 13, where the unit on the X-axis is the number of cycles and the unit on the Y-axis is mAP.

**Table 1.** The acquired results by autogenerated training samples based on virtual 3D models.


<sup>1</sup> Both CNN models were used with a pretrained COCO dataset.

CNN models with single shot detection can be retrained very fast by transfer learning with accepted results within less than 2 h of training without dedicated GPU (for example the used pretrained Faster RCNN Inception V2 SSD reached the required accuracy within 1.35 h). CNN models with segmentation for the same input samples need 4–5 times more time for successful training (for example the used pretrained Mask RCNN Resnet101 reached required accuracy up to 5.20 h). But in contrast to SSD where a model is saved after unpredictable numbers of iteration, the training of models with Segmentation is stored after each epoch. The results of recognition of other images (not training set) of some parts for both tested CNN models (SSD and instance segmentation) are shown in Figure 14.

**Figure 13.** Training graphs for classification (**left**) and localization (**right**).

**Figure 14.** Inference experiments with the SSD trained model (**top**) and instance segmentation (**bottom**).

The inference time experiments with trained CNN models (SSD and Mask) have been performed on many different platforms. The delay results presented in Tables 2 and 3 for inference time is average value from a multiple test in loop for recognition of 40 sample images. The obtained results for CNN model SSD Inception V2 and TensorFlow 1 are in Table 2, for CNN Segmentation model Resnet101 and TensorFlow 2 with Pixelib in Table 3.

**Table 2.** The acquired inference results for the SSD CNN model.



**Table 3.** The acquired inference results for the Mask CNN model.

The FP16 SSD Inception V2 CNN model can reach about 3 FPS, which is an acceptable parts identification delay for checking worker assembly tasks and collaborative robot assembly status. The experiment with the Mask RCNN segmentation model reached in AGX device about 700 ms delay, which is acceptable in comparison to Desktop PC with high-performance CPU.

The mixed reality device is based on ARM64 architecture which does provide enough power for the execution of the trained inference model. The new approach is to stream video data to NVIDIA Xavier APU which runs an inference model and sends only extracted data: bounding box, contour polygon, and a result of classification as feedback.

The collaborative work cell contains a SMART vision system consisting of three cameras. The primary camera is connected to NVIDIA Xavier AGX (Figure 15a), where is uploaded trained CNN model. The second camera is integrated into mixed reality devices (Figure 15b) and the third camera is integrated into the right hand of the collaborative robot (Figure 15c). The principle of parts detection can be described in these steps:


**Figure 15.** SMART vision system of the collaborative work cell: (**a**) CNN processing unit NVIDIA Xavier AGX; (**b**) Mixed reality device camera; (**c**) Collaborative robot ABB Yumi Vision system integrated into the right hand.

The implementation principle of parts recognition into the collaborative work cell is shown in Figure 16.

**Figure 16.** The principle of an experimental assisted assembly work cell with CNN processing unit, mixed reality device, and collaborative robot.

Images captured by all vision systems (static dual 4K e-con cameras connected to NVIDIA Xavier AGX and JetPack 4.4, integrated Cognex 7200 camera in ABB Yumi right hand and Microsoft Hololens 2 internal head camera) are shown in Figure 17.

**Figure 17.** Images captured in the assisted assembly work cell by: (**a**) Nvidia Xavier AGX e-con camera; (**b**) Mixed reality Microsoft Hololens 2 head camera; (**c**) ABB Yumi Cognex Vision system.

The application for the mixed reality device Hololens 2 is coded in the same software (Unreal Engine) as a sample generation software but does not use Python programming language and is coded only by Blueprint programming language with UXTool library.

The designed calibration principle of all used vision systems to one Cartesian coordinate is shown in Figure 18 and their synchronization is realized in these steps:


**Figure 18.** The principle of vision systems synchronization to one Cartesian coordinate.

Current deep learning frameworks provide only image augmentation, which only reduces the number of images that are needed to be prepared. It means, the most monotonous works in deep learning implementation to real application still exist. This is the main reason why is not profitable to use deep learning in small series assembly tasks. On the other hand, an automated generation of training samples from CAD models, which are available before production starts, can help to implement more assisted assembly solutions into practice.

The research field in an automated generation of training samples for CNN models from 3D virtual models has a big potential to expand. Current progress in GPU with real-time raytracing can provide new rendering possibilities to reach cinematic quality in object visualization and fast preparation of virtual samples. An interesting project is Kaolin from NVIDIA, a modular differentiable rendering for applications like high-resolution simulation environments, though it is still only as a library under research.

The early research progress is improving the presented tested solution with parts overlay recognition implemented into the Unreal Engine. An example of the first testing implementation with overlays is shown in Figure 19.

**Figure 19.** Parts overlay early research implemented into the Unreal Engine.

#### **6. Conclusions**

An automated generation of training samples based on 3D virtual models is a new approach in the field of deep learning that can save many hours of manual work. The presented research in the article introduces a methodology of CNN training for deep learning implementation into the assisted assembly process. This methodology was evaluated in an experimental SMART manufacturing system with assisted assembly work cell using cam switch as chosen assembly product from real production where is still used fully manual assembly process [31].

To summarize, those experiments have been performed and these main research results have been acquired in the field of CNN training for parts recognition in the assisted assembly process:


The future works can be divided into more steps because there are further plans mainly in extending the current software:


**Author Contributions:** Conceptualization, K.Ž. and J.P.; methodology, M.B.; software, P.L.; validation, V.H. and A.I.; formal analysis, M.B.; investigation, A.H.; resources, J.P.; data curation, K.Ž.; writing—original draft preparation, K.Ž.; writing—review and editing, J.P.; visualization, J.D.; supervision, A.H.; project administration, K.Ž.; funding acquisition, J.P. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by the Slovak Research and Development Agency under the contract No. APVV-19-0590 and also by the projects VEGA 1/0700/20, 055TUKE-4/2020 granted by the Ministry of Education, Science, Research and Sport of the Slovak Republic.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Acknowledgments:** The article was written as a result of the successful solving of the Project of the Structural Funds of the EU, ITMS code: 26220220103.

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

#### **References**

