**1. Introduction**

A high demand of services for assisted and rehabilitation environments is expected from the health status of the world due to the COVID-19 pandemic. Currently, according to the WHO (World Health Organization), existing rehabilitation services have been disrupted in 60–70% of countries due to this pandemic in order to avoid human contact. Therefore, countries must face major challenges to ensure the health of their population. Robotic platforms are a great solution to ensure assistance and rehabilitation for disabled people using human–robot interaction (HRI) capabilities. HRI is currently a topic of research that contributes by means of several research approaches for the physical and/or social interaction of humans and robotic systems [1] in order to achieve a goal together.

The number of people with congenital and/or acquired disabilities are quickly increasing, and therefore, there are many dependents who lack the necessary autonomy for a fully independent life. Among them, stroke is one of the main causes of these acquired disabilities throughout the world. Acquired brain injury (ABI) is a clinical-functional situation triggered by an injury of any origin that acutely affects the brain, causing neurological deterioration, functional loss, and poor quality of life as a result. It can be due to various causes, with stroke and head trauma the most frequent in our environment. Patients with ABI suffer cognitive and motor sequelae. In stroke patients, motor sequelae are usually more severe in the upper limb. In published studies, it has been reported that 30–60% of hemiplegic patients due to a stroke remain with severely affected upper limbs after 6 months of the event, and only 5–20% manage a complete functional recovery.

**Citation:** Hernández, Ó.G.; Morell, V.; Ramon, J.L.; Jara, C.A. Human Pose Detection for Robotic-Assisted and Rehabilitation Environments. *Appl. Sci.* **2021**, *11*, 4183. https:// doi.org/10.3390/app11094183

Academic Editors: Alessandro Di Nuovo and Manuel Armada

Received: 26 March 2021 Accepted: 30 April 2021 Published: 4 May 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

Physical medicine and rehabilitation are the most important treatment methods in ABI because they help patients reutilize their limbs at maximum capacity. Intensive therapies and repetitive task-based exercises are very effective treatments for motor skills recovery [2]. One of the most important processes of physical therapy requires manual exercises, in which the physiotherapist and the patient must have one-to-one interaction. The goal of the physiotherapist in this process is to help the patients achieve a normal standard of range of motion in their limbs and to strengthen their muscles. Rehabilitation robotic platforms pursue the recovery of impaired motor function. The majority of rehabilitation robotics research to date has focused on passive post-stroke exercises (e.g., [3,4]). The use of assistive robotics in rehabilitation allows the assistance of the physiotherapist in certain exercises that require repeated movements with high precision. The robot can fulfil the requirements of the cyclic movements in rehabilitation. Additionally, robots can successfully control the forces applied and can monitor the therapy results objectively by using their sensors.

Robots intended for upper limb rehabilitation can accomplish active and passive bilateral and unilateral motor skills training for the wrist, forearm, and shoulder. MIT-Manus is one of the most well-known upper limb rehabilitation robots [5]. It was developed for unilateral shoulder or elbow rehabilitation. MIME is another well-known upper limb rehabilitation robot, developed for elbow rehabilitation using the master–slave concept [6]. The movement of the master side of the robot is reproduced on the slave side. The 2-DOF robot can perform flexion–extension and pronation–supination movements. The Assistive Rehabilitation and Measurement (ARM) Guide is a bilateral rehabilitation system for upper limb rehabilitation using an industrial robot [7]. It assists the patient in following a trajectory. It also serves as a basis for the evaluation of several key motor impairments, including abnormal tone, incoordination, and weakness. The GENTLE/s system uses a haptic interface and virtual reality techniques for rehabilitation. The patients can move their limbs in a three-dimensional space with the aid of the robot [8]. The authors of [9] presented a rehabilitation robot with minimum degrees of freedom to train the arm in reaching and manipulation, called reachMAN2. All these previous robotic devices provide the potential for patients to carry out more exercise with limited assistance, and dedicated robotic devices can progressively adapt to the patients' abilities and quantify the improvement of the subject.

Robotic platforms for assistance and rehabilitation must have precise sensory systems for HRI. Therefore, they must recognize human pose or human gestures to improve the performance and safety of human–robot collaboration in these environments [10,11]. Our study sought to obtain a marker-less good pose estimation using a low-cost RGB camera for upper limb robotic rehabilitation environments. We set-up a multiple-RGBDcalibrated-cameras system to measure the goodness of the available methods.

#### **2. Human Pose Detection and Body Feature Extraction: A State of the Art**

The human body is a very complex system composed of many limbs and joints, and the exact detection of the position of the joints in 2D or 3D is a challenging task [12], as it requires a specific assumption within biomechanics research in robotic rehabilitation environments [13]. In addition, HRI environments are complex and nondeterministic, and it is not easy to ensure the user's safety during interaction with the robot. Currently, this assumption is a research topic in other areas, such as Industry 4.0 [14,15]. The resolution of this issue involves constant position tracking, intention estimation, and action prediction of the user. This problem can be faced by a proper sensory system. On the one hand, some contributions employ inertial measurement units (IMUs) for motion capture, especially in medical applications and motor rehabilitation analysis [16,17]. However, this type of sensor requires the correct placement of passive/active markers on the body before each capture session, and they are insufficient for HRI environments.

On the other hand, this issue can also be faced as a computer vision problem, basically using two approaches: marker-based and marker-less. Marker-based approaches,

such as motion capture systems (MoCap), have significant environmental constraints (markers in the human body) and are relatively complex, expensive, and difficult to maintain. Marker-less approaches have fewer environmental constraints, and they can give a new understanding about human movements [18]. This issue requires the processing of complex information to develop an algorithm that recognizes human poses or skeletons from images. Therefore, an easy-to-use marker-less motion capture method is desirable for these robotic rehabilitation environments. In this paper, we analyzed the performance of the estimation of shoulder and elbow angles for the development of rehabilitation exercises using CNN (convolutional neural network)-based human pose estimation methods.

There is extensive research about marker-less approaches for human tracking motion. In these approaches, depth cameras such as Kinect (RGB-D) provide additional information about the 3D shape of the scene. Kinect has become an important 3D sensor, and it has received a lot of attention thanks to its rapid human pose recognition system. Its low-cost, reliability, and rapid measurement characteristics have made the Kinect the primary 3D measuring device for indoor robotics, 3D scene reconstruction, and object recognition [19]. Several approaches for real-time human pose recognition have been performed only using a single sensor [20–22], but it can have substantial errors with partial occlusions.

In recent years, the use of deep learning techniques for 3D human pose estimation has become a common approach in HRI systems. These computer vision techniques usually train a neural network from labeled images in order to estimate human pose. As a reference, some research works obtain the 3D pose estimation using a single-view RGB camera image or multi-view camera images [23,24]. These accurate methods encounter fewer problems regarding the cameras' position and calibration issues in comparison to RGB-D approaches. However, 3D human pose detection for assisted and rehabilitation robotic environments needs further improvements to achieve real-time tracking for human motion analysis with enough accuracy [25].

In our comparison, we decided to use the Kinect v2 sensor, which has a high accuracy in joint estimation while providing skeletal tracking, as described in [26]. Additionally, some research about this fact has been presented in [27], where the validity of Kinect v2 for clinical motion was compared with a MoCap system, and the error was reported to be less than 5%. We employed the skeleton obtained by two Kinect sensors as our ground truth to measure the performance of the estimation of shoulder and elbow angles using two CNN (convolutional neural network)-based human pose estimation methods in rehabilitation exercises. The selected CNN-based methods were OpenPose [28] and Detectron2 [29]. OpenPose is a multi-person pose detection system, and it can detect a total of 135 body points from a digital image [28,30]. OpenPose has been trained to produce three distinct pose models. They differ from one another in the number of estimated key points: (a) MPII is the most basic model and can estimate a total of 15 important key points: ankles, knees, hips, shoulders, elbows, wrists, necks, torsos, and head tops. (b) The COCO model is a collection of 18 points including some facial key points. (c) BODY pose provides 25 points consisting of COCO + feet keypoints [30,31]. Detectron2 was built by Facebook AI Research (FAIR) to support the rapid implementation and evaluation of novel computer vision research. Detectron2 is a ground-up rewrite of the previous Detectron version and it originates from the Mask R-CNN benchmark. Detectron2 includes high-quality implementations of state-of-the-art object detection algorithms, including DensePose [29].

#### **3. Materials and Methods**

The architecture of the vision system is composed of two RGBD cameras (Microsoft Kinect Xbox One, also known as Kinect v2, Microsoft, Albuquerque, NM, USA) and a webcam connected to a computer network. Each Kinect is connected to a client computer, which estimates the user skeleton joint tracking through the Microsoft Kinect Software Development Kit (SDK). Microsoft released the Kinect sensor V2 in 2013, which incorporates a RGB camera with a resolution of 1920 × 1080 pixels and a depth sensor with a resolution of 512 × 424 pixels and a working range of 5–450 cm, a 70◦ × 60◦ field of view, and a frame

rate of 15–30 fps. Data from the sensor can be accessed using the Kinect for Windows SDK 2.0, which allows tracking up to 6 users simultaneously, each with 25 joints. For each joint, the three-dimensional position is provided, as well as the orientation as quaternions. The center of the IR camera lens represents the origin of the 3D coordinate system [32,33].

The Microsoft SDK was designed for Windows platforms; therefore, rosserial is used to communicate between Windows platforms and Linux [34]. Three PCs are used for the system architecture. One of them works both as a client and server. The detailed hardware description is shown in Table 1. The RGB webcam is connected to a client PC equipped with a graphics card (GPU). Both the webcam and GPU are used by the OpenPose and Detectron2 methods for human pose estimation. The overall system topology is shown in Figure 1.



**Figure 1.** An overview of the proposed system.

### *Cameras Calibration*

In order to compare the different pose estimation methods, all 3 cameras need to be calibrated in a common coordinate system. Calibration of the cameras was performed using the OpenCV multiple cameras calibration package [35]. A checkerboard pattern with known dimensions is shown so that at least two cameras can identify it at the same time. To obtain the ground truth, information of the extrinsic parameters of the cameras (translation and rotation matrix) is required, then a 3D-to-2D projection must be made in the image plane to be able to compare with the information provided from OpenPose or Detectron 2 (see Equations (1) and (2)). The acquisition and processing scheme of the data is shown in Figure 2.

$$
\begin{bmatrix} X\_{CK1} \\ Y\_{CK1} \\ Z\_{CK1} \end{bmatrix} = R \begin{bmatrix} X\_{K1} \\ Y\_{K1} \\ Z\_{K1} \end{bmatrix} + T \tag{1}
$$

$$
\mu = \left(\frac{x}{z}\right) \* f\_{\mathbf{x}} + c\_{\mathbf{x}} \\
\upsilon = \left(\frac{y}{z}\right) \* f\_{\mathbf{y}} + c\_{\mathbf{y}} \tag{2}
$$

where (*XK*1, *YK*1, *ZK*1) are the coordinates of a 3D point in the coordinate system of the Kinect 1 (IR camera), (*XcK*1, *YcK*1, *ZcK*1) are the coordinates of a 3D point calculated from the Kinect 1, *T* is the transfer vector, *R* is the rotation matrix, (*u*, *v*) are the coordinates of the projection point in pixels, (*cx*, *cy*) is the image center (IR camera), and (*fx*, *fy*) are the focal lengths expressed in pixel units (IR camera).

**Figure 2.** Data acquisition and processing scheme.

### **4. Experimental Setup**

#### *4.1. Cameras Position*

The two Kinect sensors were located orthogonally to each other, as described in Figure 3a. This distribution allows the elimination of the problem of data loss caused by self-occlusion [36,37]. The distribution of the laboratory hardware setup is shown in Figure 3b. With this configuration, more precise data can be obtained on rehabilitation exercises that focus on the limbs. Finally, the webcam was located just above the Kinect 2 to reduce the errors in extrinsic parameters and to obtain a similar view with the Kinect 2.

**Figure 3.** Our system: (**a**) Cameras distribution, (**b**) Working zone.

#### *4.2. Joint Angle Measurement*

The joint angle was measured as the relative angle between the longitudinal axis of two adjacent segments. These segments were composed of three points in the 2D space: a starting point, a middle point, and an end point. For the elbow joint angle, the adjacent segments were the upper arm and the forearm, respectively. Figure 4 shows the elbow and shoulder joint angles measured in this study. Let *u* and *v* be vectors representing two adjacent segments, where the angle between *u* and *v* is equal to:

$$\theta = \cos^{-1}\left(\frac{\mu \cdot v}{|u||v|}\right) \tag{3}$$

**Figure 4.** Rehabilitation exercises: (**a**) elbow side flexion, (**b**) elbow flexion, (**c**) shoulder extension, and (**d**) shoulder abduction.

#### *4.3. Rehabilitation Exercises*

Four upper limb rehabilitation exercises were proposed: elbow side flexion, elbow flexion, shoulder extension, and shoulder abduction (Figure 4). During the execution of the exercises, the cameras capture the information of the desired joints of the patient pose, and this information is used to calculate the angles θ and β obtained by the different systems.

#### *4.4. Ground Truth*

The ground truth of the pose estimation was calculated using the skeletons provided by the two Kinect cameras. As previously mentioned, these cameras were located at approximately 90◦ from each other to obtain accurate data on rehabilitation exercises that focus on the upper limb where one camera does not give fully reliable estimations. Figure 5 shows four examples of 3D human pose estimations obtained by the Kinect 2 (blue color), and the skeleton fusions (red color) during an elbow side flexion exercise. The skeleton fusion was calculated with a simple average of the Kinect 2 skeleton and the projected skeleton from the Kinect 1, which was calculated using the information obtained in the calibration phase and using Equation (3). When performing this projection, a difference is expected between the coordinates of the Kinect 2 (main camera, front view) and the Kinect 1 (auxiliary camera, side view). This difference is due to the viewing angle of each Kinect and the volume of the joints of the human.

**Figure 5.** 3D skeleton poses obtained by Kinect 2 and skeleton fusion.

Figure 6 shows the X, Y, and Z positions of the left and right wrist of the two Kinects (Kinect 1 and Kinect 2) during a shoulder extension exercise and the projected points (Projected 1). As stated before, an error was expected in the calculation of the fused skeleton, and the results show how even with the fully calibrated system, we obtained some errors. For the left wrist, we measured MAEs (mean absolute errors) of 7.35, 2.16, and 3.71 cm for the coordinates X, Y, and Z, respectively. For the right wrist, we measured MAEs of 7.70, 2.90, and 3.25 cm for the coordinates X, Y, Z, respectively.

**Figure 6.** X, Y, and Z positions of left and right wrist by Kinect 1, Kinect 2, and projected position.

#### **5. Experiments**

The following experiments show the precision of the angles calculated using the OpenPose and Detectron 2 approaches. Only the necessary angles are shown in each experiment. The movements of exercises 1 and 2 involved only one angle, while exercises 3 and 4 involved the two aforementioned elbow and shoulder angles. A video of the experiments is available online [38]. A summary of the results of all experiments is shown in Section 6 (Results section).
