**1. Introduction**

The human pose estimation task is one of the most interesting areas of research. It has applications in various fields, such as games, healthcare, augmented reality and sports [1]. The operator position estimation is of great importance in collaborative robotics, since the solution of this task will increase the efficiency of robotics and expand the possibilities of its application. A real-time response is required not only in security applications [2,3], where human actions must be detected in time, but also in industrial applications where human movement is predicted, to prevent collisions with robots in shared workspaces. For these reasons, research on motion-capture systems without markers and wearable sensors has been especially in demand in recent years.

When solving the problem using computer vision methods, it is necessary to find the coordinates of each joint (arm, head, torso, etc.), called keypoints, in the video frame, and form a skeletal representation of the human body. Thus, the detection of keypoints includes the simultaneous detection of people and the localization of their keypoints.

This problem is particularly difficult due to the heterogeneity of objects that have various and potentially complex shapes, as well as the difficulties arising from background noise and partial overlaps between objects (occlusions).

Thus, the most crucial problems can be pointed out, which are [4]:


**Citation:** Ivanov, Y.; Zhiganov, S.; Gorkavyy, M.; Sukhorukov, S.; Grabar, D. Using an Ensemble of Deep Neural Networks to Detect Human Keypoints in the Workspace of a Collaborative Robotic System. *Eng. Proc.* **2023**, *33*, 19. https:// doi.org/10.3390/engproc2023033019

Academic Editors: Askhat Diveev, Ivan Zelinka, Arutun Avetisyan and Alexander Ilin

Published: 13 June 2023

**Copyright:** © 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

<sup>14–16</sup> December 2022.

There are several approaches to the human body modeling:


Though the posture estimation has been studied for many years, this task still remains very complex and largely unsolved. There is still no universal approach that can give satisfactory results in general non-laboratory conditions.

The paper [5] considers a linear Tensor-on-Tensor regression model to predict human behavior. However, in this work, only reference points and connections of the human upper limbs are constructed, i.e. only hand movements in the working area are analyzed.

An approach to detect hazardous operator behavior was proposed in [6]. The authors simulate dangerous behavior through time series analysis to detect hazards. Unfortunately, the authors did not give examples of real testing of the proposed algorithms, and all tests were carried out only in simulation.

Industrial cobots must be able to detect the human presence, identify and determine intentions based on hand gestures, actions, etc. There is a promising tendency here towards marker-less recognition [7]. High-level motion planning is usually combined with the ability of a robot to recognize the intentions of a human partner [8–10].

In [11] the authors applied the promising technology ViT (Visual Transformers) and obtained the maximum score in the Pose Estimation on MPII Human Pose benchmark [12].

The approaches that use multiple sensors or camera networks appear to be promising, which increase reliability in the case of occlusion and visual field limitations.

The paper [13] proposed an approach to estimate 3D poses of multiple persons in calibrated RGB-Depth camera networks. Each single-view outcome was computed by using a CNN for 2D pose estimation and extending the resulting skeletons to 3D by means of the sensor depth. The authors presented their solution in the form of an open-source library OpenPTrack [14].

The study [15] proposed a convolutional neural network (CNN) approach for estimating human body pose using a small number of cameras, including outdoors scenes. The authors [16] proposed a purely geometric approach to infer a multiview pose from a synchronous set of 2D skeletons.

The paper [17] proposed an approach to estimate a 3D human pose from multiview video recordings, taken with unsynchronized and uncalibrated cameras. A unique approach to self-calibrate the system using the detected keypoints was employed in the research.

It should be noted that despite the high accuracy of some approaches, they are rather resource intensive and involve the transmission of a video stream and calculations using server graphics processing units (GPUs). However, in industrial systems, the necessity to use embedded computing modules based on GPUs of lower performance as on-board computers for robots arises. The fact imposes significant restrictions on the complexity of algorithms due to their low power consumption and small size.

It is important to emphasize once again that the application possibilities of cobots will be significantly increased thanks to timely recognition, analysis and prediction of human actions based on data from a multimodal sensory system and the development of control actions, taking into account emergency situations and extreme conditions in real time. Thus, the technological solutions, including the development of software and algorithmic support for the implementation of joint work of various types of robots with a person, is a promising task.

From our point of view, the most promising is the approach that combines a multicamera system, as well as additional sensor devices, thereby providing multimodal data processing based on neural network approaches. For this reason, this article proposes to use an ensemble of deep neural networks (NN) to determine the spatial operator posed using keypoints and to link them to the world coordinate system of the robot.

The scientific novelty of the project is in the proposed set of methods, approaches and algorithms aimed at ensuring effective interaction between the components of the operator-cobot system.

#### **2. Problem Statement**

The task of increasing the efficiency of cobot-person interactions is formulated as follows. Judging by the incoming video stream from surveillance cameras, microphones, and other sensors installed in the working area, it is necessary to recognize, localize in space and build a predictive dynamic model of the operator's behavior, which should be further synchronized and adapted to the collaborative control system of the manipulation robot with a different number of degrees of freedom to perform joint, diverse and previously unknown scenarios.

The research deals with the problem of detecting and localizing human keypoints. Thus, the particular task solved in this article can be formulated as follows.

By the incoming video stream from several surveillance cameras, it is necessary to recognize and localize keypoints of the operator in three-dimensional space. To preliminarily calibrate the system, it is necessary to apply calibration methods "by template" and to apply computer vision methods based on deep neural networks in order to localize keypoints.

One needs to develop software in the Python environment and conduct a full-scale experiment with video scenes which come from surveillance cameras installed at a production facility.

#### *Mathematical Formulation of the Keypoint Detection Task*

The task of detecting human keypoints in an image can be formulated as a regression task. Let there be: a set of images *ω* ∈ Ω, defined using features *xi*, *i* = 1, *n*, the totality of which for the image *ω* is represented by vector descriptions **Φ**(*ω*)=(*x*1(*ω*), ... *xn*(*ω*)) = **x** and a set of values of the dependent variable **y**, corresponding to them, each of which is a vector of values **<sup>y</sup>** = (*y*1, *<sup>y</sup>*2,... *yn*), and **<sup>y</sup>***<sup>i</sup>* <sup>∈</sup> <sup>R</sup>.

A priori information is represented with a training set (dataset) D = ((**x***<sup>j</sup>* , **y***<sup>j</sup>* )), *j* = 1, *L*, defined using a table, each row *j*, contains a vector image description **Ψ**(*ω*) and the value of the target variable. Note that the training set characterizes the unknown mapping <sup>∗</sup>**<sup>F</sup>** : <sup>Ω</sup> <sup>→</sup> <sup>Y</sup>.

We specify the regression task to the keypoints detection task. Let there be a video stream frame **I***<sup>t</sup>* , where *t* - is the number of the current frame. Judging by the available frames **I***<sup>t</sup>* of a continuous video stream **V** = (**I** <sup>1</sup> ... ,**I***<sup>t</sup>* , ... ,**I***τ*) and a priori information given by the training set D = ((**x***<sup>j</sup>* , **y***<sup>j</sup>* )), *j* = 1, *L* for deep learning of a supervised NN, it is required to solve the problem of image recognition: To detect the key points of an object on a video stream frame. Key points can be represented as a vector **y** = (*y*1, *y*2, ... *yn*), containing object coordinates.

Generally accepted metrics are used to evaluate the task of detecting the keypoints of a human as a performance criterion [18].

Average precision (AP), which is averaged over various Object Keypoint Similarity (OKS) thresholds is set to 0, 50 : 0, 05 : 0, 95.

OKS is calculated based on the distance between predicted points and ground truth points normalized to the human scale. The scale and constant of the keypoint is needed to equalize the importance of each keypoint: the location of the neck position is more precise than the position of the hips [19]: OKS <sup>=</sup> *exp*(<sup>−</sup> *<sup>d</sup>*<sup>2</sup> *i* 2*s*2*k*<sup>2</sup> *i* ), where *di* - Euclidean distance between true keypoint and predicted keypoint; *s*–scale. The square root of the object segment area; *k*–is a constant for the keypoint that controls the decline; each keypoint has its own coefficient (the circles for the shoulders and knees can be larger than for the nose or eyes). The OKS metrics only shows how close the predicted keypoint is to the true keypoint (value between 0 and 1). Perfect predictions will have OKS = 1, while the predictions for which all keypoints differ by more than a few standard deviations *s* · *ki* , will have OKS ≈ 0. Mean Average precision is also calculated using OKS with thresholds of 0.50 and 0.75 (AP50 and AP75).

Percentage of Detected Joints–PDJ. A detected joint is considered correct if the distance between a predicted keypoint and a true one is within a certain fraction of the diagonal of the bounding box (diameter of the torso).

The use of the PDJ metrics implies that the accuracy of all joints is estimated using the same error threshold.

$$\text{PDJ} = \frac{\sum\_{i=1}^{n} \text{bool}(d\_i < 0.05) \* diagonal}{n}\_i$$

where *di*–is the Euclidean distance between the true keypoint and the predicted keypoint; *bool*(*condition*)–a function that returns 1 if the condition is true, 0, if it is false; *n*–the number of keypoints in the image.

Percentage of Correct Key-points PCK [20] considers a body part to be correctly located if the estimated endpoints of the body segments are within 50% of the segment length of their true location. PCKh-0.5 [21]. PCK modification using 50% head segment length matching threshold.

#### **3. Problem Solution**

The algorithm which is aimed to detect and localize keypoints in the world coordinate system is divided into a separate subtasks solution (Figure 1):


**Figure 1.** Algorithm for detecting and localizing keypoints in the world coordinate system

#### *3.1. Calibration of the Multi-Camera System*

Let there be a multi-camera system consisting of two cameras observing one scene. The configuration and location of the system is shown in Figure 2. The camera registers a scene containing *N* reference points. The task is to use the three-dimensional coordinates of the reference points (*X<sup>p</sup> <sup>i</sup>* ,*Y<sup>p</sup> <sup>i</sup>* , *<sup>Z</sup><sup>p</sup> <sup>i</sup>* ) and the coordinates of their projection in the camera

image plane (*cxi* , *cyi* ), where *i* = 1...*N* evaluation of matrix elements **A**. As a rule, a calibration object in the form of a checkerboard is formed for camera calibration, since the use of alternating black and white squares has a sharp gradient in two directions. The intersections of the checkerboard lines are used as corners.

**Figure 2.** Observed scene

*K* images for each camera are generated with a calibration object depending on the number of rotation angles *N*. To calculate four internal parameters (*f* /*w*, *f* /*h*, *cx*, *cy*) where (*cx*, *cy*) are the natural coordinates of the point, *f* is the distance from the optical center, *w*, *h* are the dimensions along the axes *ox*, *oy* and six external parameters such as (*ψ*, *φ*, *θ*) rotation angles and (*Tx*, *Ty*, *Tz*) transfer parameters. The number of frames and the number of corners is calculated using the following expression: 2 · *N* · *K* ≥ 6 + 4. To obtain the coordinates of reference points (*X<sup>p</sup> <sup>i</sup>* ,*Y<sup>p</sup> <sup>i</sup>* , *<sup>Z</sup><sup>p</sup> <sup>i</sup>* ) and the coordinates of their projection in the camera image plane (*cx*, *cy*) the *findChessboardCorners* function for finding corners on the chessboard is used, provided in the OpenCV library (Figure 3). Using the method [22] the matrix element of camera internal parameters is evaluated:

$$\mathbf{A} = \begin{bmatrix} f/w & 0 & c\_{x\_0} \\ 0 & f/h & c\_{y\_0} \\ 0 & 0 & 1 \end{bmatrix}.$$

The result of the function is a matrix of the camera internal characteristics **A**, distortion vector **r**, rotation vector **m** and transfer vectors **q**.

**Figure 3.** Camera calibration

#### *3.2. Human KeyPoint Detection*

The following approach worked out by the authors [23] based on the OpenPifPaf project [24] was used as an algorithm to detect the operator keypoints. The authors [23] proposed a number of modifications that improve the quality of keypoints detection. The underlying OpenPifPaf is a bottom-up detector based on composite fields. The architecture of the model is shown in Figure 4. The input gets an *h* × *w* image with three color channels. The neural network encoder generates PIF and PAF fields 17 × 5 and 19 × 7 channels. The decoder converts the PIF and PAF fields into pose estimates containing 17 joints each. Each connection is represented by x and y coordinates and a confidence score. ResNet is used as encoders [25] or ShuffleNetV2 [26].

The Part Intensity Field (PIF) and Part Association Field (PAF) blocks are 1 × 1 convolutions followed by subpixel convolutions. These blocks are trained to detect and link key points. For architecture training, the COCO dataset was used, which can also be supplemented with synthetic data to adapt to a specific task. The method of the network retraining using only synthetic data with pretrained weights seems more promising.

**Figure 4.** Keypoint detector architecture.

To obtain synthetic data, the Unity3D engine was used, as well as images obtained from the cameras of the observed scene. This approach allows you to adapt the algorithm to certain conditions.

Figure 5 shows the results of recognition on video surveillance cameras installed on the experimental site (a), as well as on synthetic data (b). As a result, the algorithm detected key points in the COCO format.

**Figure 5.** Keypoint detector: (**a**) real camera (**b**) synthetic data.

#### *3.3. Mapping Keypoints in 3D Space*

The result of the second stage was the keypoints of the operator detected on the images of each of the surveillance cameras. It was necessary to map 2D coordinates to the points in 3D space of the world coordinate system.

There are methods to calculate the 3D position of points for stereoscopic systems consisting of two cameras, the optical axes of which are parallel, and the straight line which passes through the optical centers is perpendicular to the optical axes. In this case, to obtain a point in three-dimensional coordinates, it is necessary to calculate the reference points using the following expression:

$$x\_1^1 = \frac{f(X^w + \frac{k}{2})}{Z^2}, \quad x\_2^2 = \frac{f(X^w - \frac{k}{2})}{Z^2}, \quad y\_1^1 = y\_2^2 = \frac{fY^w}{Z^w}, \quad y\_2^1 = y\_1^2 = f(X^w)$$

where *Xw*,*Yw*, *Z<sup>w</sup>* are world coordinates of the point, *x*1, *y*<sup>1</sup> are the projection coordinates in the image plane of the first camera, while *x*2, *y*<sup>2</sup> for the second camera. Then the point coordinates in three-dimensional space will be as follows:

$$X^{w^c} = b \frac{(\mathbf{x}\_1^1 + \mathbf{x}\_2^2)}{2(\mathbf{x}\_1^1 - \mathbf{x}\_2^2)}, \quad Y^{w^c} = b \frac{(y\_1^1 + y\_2^2)}{2(\mathbf{x}\_1^1 - \mathbf{x}\_2^2)}, \quad Z^{w^c} = \frac{fb}{\mathbf{x}\_1^1 - \mathbf{x}\_2^2}.$$

where *f* is the distance from the optical center, *b* is the length of the straight line segment between the optical centres. In the case when the camera axes are not parallel and the direction of cameras optical center displacement is arbitrary, the calculation of the point coordinates for any camera presupposes that the following parameters should be calculated:

$$\mathbf{v}\_{1} = \frac{\mathbf{A}\_{1}\mathbf{M}\_{1}}{Z\_{1}}, \quad \mathbf{v}\_{2} = \frac{\mathbf{A}\_{2}\mathbf{M}\_{2}}{Z\_{2}},$$

$$\begin{bmatrix} Z\_{1}^{t} \\ Z\_{2}^{t} \end{bmatrix} = \begin{bmatrix} \mathbf{v}\_{1}^{T}\mathbf{A}\_{1}^{-T}\mathbf{A}\_{1}^{-1}\mathbf{v}\_{1} & -\mathbf{v}\_{1}^{T}\mathbf{A}\_{1}^{-T}\mathbf{R}^{T}\mathbf{A}\_{2}^{-1}\mathbf{v}\_{2} \\ -\mathbf{v}\_{1}^{T}\mathbf{A}\_{1}^{-T}\mathbf{R}^{T}\mathbf{A}\_{2}^{-1}\mathbf{v}\_{2} & \mathbf{v}\_{1}^{T}\mathbf{A}\_{1}^{-T}\mathbf{A}\_{1}^{-1}\mathbf{v}\_{1} \end{bmatrix}^{-1} \begin{bmatrix} \mathbf{v}\_{1}^{T}\mathbf{A}\_{1}^{-T}\mathbf{R}^{T} \\ \mathbf{v}\_{2}^{T}\mathbf{A}\_{2}^{-T} \end{bmatrix} \mathbf{t},$$

where **M**1, **M**<sup>2</sup> characterize the coordinates of a certain point in three-dimensional space in the coordinate system in the system of the first and second cameras, **A**1, **A**<sup>2</sup> are the matrix of internal parameters of the first and second cameras. **R** is an orthogonal matrix that describes the orientation of the coordinate system of the second camera relative to the first one, **t** is the translation vector that determines the position of the optical center of the second camera in the coordinate system of the first one. The obtained parameters can be used to get a vector of 3D point coordinates for any of the cameras: **M***<sup>p</sup>* <sup>1</sup> = *<sup>Z</sup><sup>t</sup>* 1**A**−<sup>1</sup> <sup>1</sup> **v**1, **M***<sup>p</sup>* <sup>2</sup> = *<sup>Z</sup><sup>t</sup>* 2**A**−<sup>1</sup> <sup>2</sup> **v**2.

#### **4. Semi-Natural Experiment**

The proposed approach was implemented in Python using the pytorch library. The following computer configuration was used for testing: Intel Core i5, 8 Gb RAM, Nvidia Geforce 1080 Ti. Figure 6 demonstrates the algorithm in action in the current industrial line of the robotics center.

It has been experimentally proven that this approach allows the pose to be restored when only a subset of keypoints are in the video scene, as well as when they partially overlap.

**Figure 6.** The proposed approach application.

### **5. Conclusions**

This article solves the problem of determining the spatial position of the operator using keypoints with a multi-camera sensor system. The possibility of the proposed approach application to solve the problems of collaborative robotics has been demonstrated. The advantage of the algorithm is the use of a multi-chamber system that allows you to attach points to the local coordinate system of an industrial robotic complex. The obtained data will further be used in planning the trajectory of the robot to safely perform joint operations in the same workspace. As additional modifications, the possibility of installing a lidar to clarify the coordinates of the edges of the skeleton is being considered.

A research perspective is the development of an algorithm to predict the actions of the operator in the workspace and detect hazardous situations and possible intersections in the trajectories of a collaborative robot.

**Author Contributions:** Conceptualization, Y.I., S.Z. and M.G.; methodology, Y.I. and S.Z.; software, S.Z. and D.G.; validation, S.Z., M.G. and S.S.; writing—original draft preparation, Y.I. and S.Z.; writing, review and editing, Y.I. and D.G. All authors have read and agreed to the published version of the manuscript.

**Funding:** The work was supported by the Russian Science Foundation (project № 22-71-10093).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Informed consent was obtained from all subjects involved in the study.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


**Disclaimer/Publisher's Note:** The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
