**1. Introduction and Related Works**

Collaborative robots and their implementation in the assisted assembly process is an important part of the Industry 4.0 concept. They can work in the same workspace as human workers and perform basic manipulations or simple monotonous assembly tasks. This area is open to new research, methodology development and definition of basic requirements, because real applications in production processes are currently still limited.

The main advantage of using collaborative robots in the assembly process is a minimal transport delay of assembly parts between manual and automated operation. Other benefits are, for example, an integrated vision system for additional inspection of manual operation, the possibility to provide the interface for digital data collection from sensors and communication with external cloud platforms.

Appropriate human-robot cooperation can significantly improve assembly time, but both must have exactly defined methods of communication between them. For example, the collaborative robot can check the success of a worker assembly operation by the integrated camera and the worker can get information about this status by mixed reality devices. The mixed reality device also can shortcut the time of staff training. Configuration principles of a collaborative robot in an assembly task were introduced in [1], a framework to implement collaborative robots in the manual assembly in [2] and a human-robot collaboration framework for improving ergonomics in [3]. The important condition for the assisted assembly process is the synchronization of augmented (AR), virtual (VR), or mixed

**Citation:** Židek, K.; Pitel', J.; Balog, M.; Hošovský, A.; Hladký, V.; Lazorík, P.; Iakovets, A.; Demˇcák, J. CNN Training Using 3D Virtual Models for Assisted Assembly with Mixed Reality and Collaborative Robots. *Appl. Sci.* **2021**, *11*, 4269. https://doi.org/10.3390/app11094269

Academic Editor: Pavol Božek

Received: 13 April 2021 Accepted: 6 May 2021 Published: 8 May 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

(MR) reality devices with the digital twin for full digitalization of the used technology. A nice review of virtual, mixed, and augmented reality for immersive systems research is presented in [4]. Some other research results of the mixed assembly process between human and collaborative robots are described in [5–7]. An AR-based worker support system for human-robot collaboration using AR libraries was proposed in [8] and an anchoring support system using the AR toolkit was developed in [9]. A novel approach for end-user-oriented no-code process modeling in IoT domains using mixed reality technology is presented in [10] and a holistic analysis towards understanding consumer perceptions of virtual reality devices in the post-adoption phase in [11]. Technologies of virtual, augmented and mixed reality have an important role also in educational and training purposes as it was, for example, stated in [12,13].

Our research in the field of a digital twin implementation into assembly processes started with the digitalization of the experimental manufacturing assembly system described in [14]. A digital twin can visualize the real status of a manufacturing system as a 3D simulation model with real-time actualization, so there is a need to have an appropriate simulation model, for example a generic simulation model of the flexible manufacturing system developed in [15]. An automatic generation of a simulation-based digital twin of an industrial process plant is described in [16]. The basic overview about the usability of a digital twin can be found in [17], the possibility of improving the efficiency of the production by the implementation of digital twins is presented in [18]. Some learning experiences after establishing digital twins are described in [19] and using a digital twin to enhance the integration of ergonomics in the workplace design in [20]. Recommendations for future research and practice in the use of digital twins in the field of collaborative robotics are given in [21]. Also, this brief overview shows that a digital twin is an important element in the implementation of an assisted assembly into the production process.

The next condition of successful implementation of an assisted assembly into the production process is synchronization with the recognition system mainly based on vision devices with some SMART features like object detection, assembly parts identification and localization with their actual orientation in a workspace. Based on the obtained knowledge from the research on diagnostics of errors at the component surface by vision recognition systems using machine learning algorithms [22] we have started to use a convolutional neural network (CNN) for the recognition of standardized industrial parts (hexagon screw/nut and circular hole assembly elements) trained by real image input [23]. CNN with deep learning can work reliably for parts recognition, but the problem is manual preparation of input data for the learning process, because a very large quantity of input samples need to be prepared, usually several hundred for one part. The solution is to replace real image samples with 3D virtual models [24], but an important task is an automated sample generation from 3D models, which can be simplified using a web interface [25]. Besides automated input data preparation for CNN training, automated image analysis is also important. Some principles of this process were described in [26]. An interesting case study on the recognition of mark images using deep CNN was published in [27]. A methodology for synthesizing novel 3D image classifiers by generative enhancement of existing nongenerative classifiers was proposed and verified in [28]. A viewpoint estimation in images using CNNs trained with rendered 3D model views was published in [29] and an accurate and fast CNN-based 6DoF object pose estimation using synthetic training in [30].

The sections of this article are structured in the following manner: following the introduction and related works in this Section a methodology of CNN training using 3D virtual models is introduced in Section 2. Section 3 describes the experimental assisted assembly work cell and the assembled product, in Section 4 the principles of the 3D virtual model preparation and 2D sample generation for CNN training are presented. Section 5 contains results and discussion, including implementation of parts recognition into the collaborative work cell. Finally, Section 6 presents a summary of the article along with some ideas for future work.

The main novelty and the innovation contribution of the article is a complex methodology for CNN training by virtual 3D models and design of a communication framework for assisted assembly devices like collaborative robot and mixed reality device.

#### **2. Methodology of Deep Learning Implementation into the Assisted Assembly Process**

A methodology of CNN training using 3D virtual models for deep learning implementation into the assisted assembly process is based on an automated generation of input sample data for learning without any monotonous manual work. All tasks, such as an object detection position, background and material change can be automated by the scripting language. This methodology can be divided into eight steps:


The detected objects are placed on the floor and they can be rotated only around one axis with a chosen increment from 20◦ to 360◦ by step from 1 to 18. The translation and rotation of virtual parts in scene are computed by standard translation and rotation matrixes for placement in 3D environment [24].

The detected object can be too small, for example as nuts or washers, so it is necessary to get magnification to change the field of view for the camera, according to Equation (1):

$$\text{if } \frac{FOV\_{H,V}}{D\_{X,Y}} > 3; \text{ then } H = Mf\_{L\prime} \tag{1}$$

where *FOVH,V* is field of view (horizontal or vertical), *DX,Y* is object dimension in *X* or *Y* axis, *H* is distance to object [mm], *M* is magnification setup to 0.5x or 0.25x and *fL* is focal length [mm].

The number of generated 2D images from all imported parts is counted by simple Equation (2):

$$N = pfn\_{a\_{\prime}} \tag{2}$$

where *p* is the number of imported parts, *f* is the number of used floors and *n<sup>α</sup>* is the number of rotations for every part (in the range from 1 to 18).

Figure 1 presents a diagram of the proposed methodology of automatic data preparation for CNN automated training using experimental values, evaluation, and execution in the embedded device.

**Figure 1.** The simplified algorithm for samples generation from 3D virtual models of assembly parts.

## **3. Experimental Platform**

The research has been provided in the SmartTechLab for Industry 4.0 at the Faculty of Manufacturing Technologies of Technical University of Kosice. There is installed an experimental SMART manufacturing system established primarily for research purposes, but also for collaboration with companies and teaching purposes. An important part of this system is a work cell for assisted assembly with incorporated technologies for parts recognition, mixed reality and collaborative robotics (Figures 2 and 3).

**Figure 2.** A scheme of the experimental assisted assembly work cell with CNN processing unit, mixed reality device and collaborative robot.

**Figure 3.** The experimental SMART manufacturing system; red frame: the workplace with an assisted assembly work cell with collaborative robot ABB Yumi and Microsoft Hololens 2 mixed reality device.

The point of interest for experimental assembly is a cam switch consisting of 31 parts made from different materials: plastic, rubber, stainless steel, and brass. The disassembled parts are shown in Figure 4a, and the assembled product is shown in Figure 4b.

**Figure 4.** The experimental product used in research: (**a**) The disassembled product parts for identification by CNN; (**b**) The assembled cam switch.

#### **4. Input Data Preparation for CNN Training**

CNN can work reliably for assembly parts recognition, but a problem is the preparation of input data for their training. A very large quantity of input samples need to be prepared, usually several hundred for one assembly part, because it has to be captured with different angular/translation variations and also with different backgrounds and materials. The replacement of real images of assembly parts with their 3D virtual models can significantly accelerate this process. Applying virtual models is also a trend of the Industry 4.0 concept and they can represent the real production process or product. Such virtual models digitally replicate all aspects of real products and they are called digital twins. They consist of 3D models of parts grouped into assemblies with the possibility of data synchronization with a real product, so the first step in the methodology of deep learning implementation into the assisted assembly process is a digital twin creation of the assembled product, in our case a cam switch (Figure 5). This digital twin will serve also as an input model into a mixed reality device for staff training.

**Figure 5.** An exploded view of a cam switch digital twin used in the assisted assembly process for implementation into mixed reality devices.
