**1. Introduction**

Collaborative robotic applications are, nowadays, well-established and commonly used concepts of technology. The main goal of such applications is to combine the strengths of both humans and robotic systems to achieve maximum effectiveness in completing a specified task while minimizing the risks imposed on the human worker. In the manufacturing process, these applications are crucial in enabling the concept referred to as "Industry 4.0". Industry 4.0 focuses on developing so-called cyber-physical systems or CPS for short, aiming to create highly flexible production systems capable of fast and easy changes, addressing the need for individualized mass production of the current markets [1].

Collaborative applications are becoming increasingly popular on the factory floors due to their various benefits, such as lower deployment costs, compact sizes, and easier repurposing, than standard robotic systems [2]. As a result, more factory workers will be required to work in close contact with robotic systems. This, however, introduces new arduous challenges to overcome, primarily how to secure the safety of a human worker, while ensuring high work effectiveness. The former problem focuses mainly on minimizing potential risks and avoiding accidents from collaborative work between the human worker and a robot, not considering the work efficiency. The latter aims to find the methods capable of maximizing the overall productivity of such collaboration. Albeit the safety of the human worker is an absolute priority, also being the most currently researched

**Citation:** Cor ˇ ˇ nák, M.; Tölgyessy, M.; Hubinský, P. Innovative Collaborative Method for Interaction between a Human Operator and Robotic Manipulator Using Pointing Gestures. *Appl. Sci.* **2022**, *12*, 258. https:// doi.org/10.3390/app12010258

Academic Editors: Giuseppe Carbone and Med Amine Laribi

Received: 10 November 2021 Accepted: 20 December 2021 Published: 28 December 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

topic [3], the economic benchmarks are vital in determining further deployment of these applications. Thus, it is necessary to pay more attention to the interaction between the human and robotic systems, also referred to as HRI, researching the impact and influence of the whole human–robot relationship, developing new approaches capable of meeting today's production demands.

In today's conventional robotic application, usually involving robust industrial manipulators, the robot works in a cell that must be wholly separated from the workers' environment by a physical barrier [4]. The human only comes in contact with the robot during maintenance and programming, carried out by highly qualified personnel. All of this is not only expensive, but also time-consuming. Alternatively, in collaborative applications, factory workers may often ge<sup>t</sup> into contact with parts of the robot and are even encouraged to influence the robot's actions based on the current circumstances to secure maximum work flexibility. This could pose a severe problem, where technically unskilled workers may suffer grea<sup>t</sup> difficulties communicating with the robotic system or conducting some form of control. A solution to this problem lies in developing such HRI methods, which would enable natural and ergonomic human–robot communication without the need for prior technical knowledge of the robotic systems. Hence, the workers could solely focus on the task at hand and not waste time figuring out the details of the application's interface.

For humans, the most natural way of communication is through speech, body, or facial gestures. Implementing these types of interfaces into collaborative applications could yield many benefits, which lead to increased productivity and overall worker contentment [5].

This paper presents a gesture-based collaborative framework for controlling a robotic manipulator with flexible potential deployment in the manufacturing process, academic field, or even a commercial sector. The underlying basis of the proposed framework is based on the previous work of Tölgyessy et al. [6], which introduced the concept of "linear HRI," initially designed for mobile robotic applications.This paper further develops this concept, widening its usability from mobile robotic applications to robotic arm manipulation while utilizing a new state-of-the-art visual-based sensory system. The main goal of the proposed framework is to provide systematic foundations for gesture-based robotic control, which would support a wide variety of potential use-cases, could be applied universally regardless of the specific software or robotic platform while providing a basis on which other more complex applications could be built.

#### **2. Related Work**

Gestural HRI is a widely researched topic across the whole spectrum of robotics. Allowing the robot to detect and recognize human movements and act accordingly is a powerful ability enabling the robot and the human worker to combine their strengths and achieve various challenging tasks. In mobile robotics, gestures can be used to direct and influence the robot's movement. Cho and Chung [7] used a mobile robotic platform, Pioneer 3-DX, and a Kinect sensor to recognize a human body and follow the operator's movement. Tölgyessy et al. [6] proposed a method for controlling a mobile robotic platform, iRobot Create, equipped with a Kinect sensor via pointing gestures. Chen et al. [8] used a Leap Motion Controller to control the movement of the mobile robotic platform with a robotic arm. Both the chassis and the robotic arm could be controlled via hand gestures. Gesture-based control was also used in projects RECO [9] and TeMoto [10], which focused on intuitive multimodal control of mobile robotic platforms. Besides mobile robotic platforms, force-compliant robotic manipulators are ideal for implementing various forms of gestural interaction, potentially deployable in multiple scenarios, such as object handling, assembly, assistance, etc. A significant part of the research is focused on teleoperation, in which the robot mirrors the human operator's movements. Hernoux et al. [11] used a Leap Motion sensor and a collaborative manipulator UR10 to reproduce the movement of the operator's hands. G. Du et al. [12] proposed a similar approach using Leap Motion to control a dual-arm robot with both hands. They used the interval Kalman filter and

improved particle filter methods to improve hand tracking. Kruse et al. [13] proposed a gesture control dual-arm telerobotic system, in which the operator controlled the position of an object held by the robotic system. Microsoft Kinect was used to track the human body. Many other researchers used Microsoft Kinect or a Leap Motion controller to control the robotic manipulator [14–20]. Tang and Webb [5] studied the feasibility of gesture control to replace conventional means of direct control through teach-pendant. Zhang et al. [21] proposed a gesture control for the delta architecture robot. Cipolla and Hollinghurst [22] investigated a gestural interface for teleoperating a robotic manipulator based on pointing gestures. Using stereo cameras, collineations, and an active contour model, they were able to pinpoint a precise location in the 40 cm workspace of the robot at which the user would point at with his finger. Object picking is another promising application area. A human hand can pick different objects of various sizes and shapes. Therefore, several works were conducted to mimic the human hand, to grasp objects [23–25]. An interesting concept was proposed by Razjigaev et al. [26], the authors proposed a gestural control for a concentric tube robot. The work focused on the potential use of such an interface in noninvasive surgical procedures. A gestural interface may be a promising concept in the control of UAV drones; some works [27,28] show potential for future development. Social robotics is another vast research area where gestural interfaces could yield many benefits. Natural interfaces could be especially advantageous in interaction with humanoid robots. Yu et al. [29] proposed a gestural control of the NAO humanoid robot. Cheng et al. [30] used this robot and a Kinect sensor to facilitate a gestural interaction between humans and thehumanoidrobot.

In Table 1, there is a comparison of key works related to the interaction method designed by us. The vast majority focus mainly on teleoperation of the robotic manipulator used, while the user has visual feedback of the resulting robot motion. Only three approaches present some form of direct interaction of the operator with the manipulator's workspace and two of these allow the user to point and select objects present in the workspace. The major novelty and contribution of our design is that the operator can select objects on the planar surface; furthermore, she or he can point to any spot of the workspace and subsequently navigate the end effector to the desired destination.


**Table 1.** Comparison of key related works.

#### **3. Our Approach**

In a conventional manufacturing process, a worker usually performs repetitive manual and often unergonomic tasks. Workers do most of these tasks mainly using their hands, such as assembly, machine tending, object handling, material processing, etc. Manual work, however, has many limitations which consequently influence the whole process efficiency. Integrating a collaborative robotic solution could significantly improve efficiency while, at the same time, it can alleviate a human worker's physical and mental workload.

Our proposed concept for the gestural HRI framework for the collaborative robotic arm aims to create such interaction, where the human worker could interact with the robot as naturally and conveniently as possible. Secondly, besides ergonomics, the framework focuses on flexibility, providing the user with functionality that could be used in numerous application scenarios. Lastly, the framework lays basic foundations for further application development, pushing the flexibility even further.

The fundamental principles of the proposed framework are based on the so-called "linear HRI" concept introduced by Tölgyessy et al. [6], which formulates three simple laws for HRI that state:


The proposed framework was developed under the laws of linear HRI, where the core functionality lies in the ability to send the end effector of the robotic arm to a specific location on a horizontal/vertical plane using the operator's hand pointing gestures.

Humans naturally use pointing gestures to direct someone's attention to the precise location of something in space. Combined with speech, it can be a powerful communication tool. Pointing gestures have many grea<sup>t</sup> use-cases among people to signify the importance of a particular object ("That's the pen I was looking for" (pointing to the specific one)), express intention ("I'm going there!" (pointing to the place)), or specify the exact location of something ("Can you hand me that screwdriver please?" (pointing to the specific location in space)). Pointing gestures are incredibly efficient when the traditional speech is insufficient or impossible due to circumstances, such as loud and noisy environments.

Let us imagine a collaborative application scenario where the worker performs a delicate assembly task that requires specific knowledge. A collaborative robot would fulfill the role of an intelligent assistant/co-worker. The worker could point to the necessary tools and parts out of his reach, which the robot would then bring to him; thus, he could specifically focus on the task while keeping his workspace clutter-free. Due to the natural character of the whole interaction, the worker could control the robot's behavior conveniently without any prior technical skills. Such application would be accessible and easy to use across the entire worker spectrum, with various technical backgrounds. Our gestural framework aims to enable the described interaction in real-world conditions using the principles of linear HRI and state-of-the-art hardware. However, first, the following main challenges need to be addressed:


Solving these challenges is crucial for ensuring efficient, reliable, and natural interaction. Additionally, we aimed to make the framework universal, not dependent on the specific robotic platform. For that reason, we chose the Robot Operating System or ROS as our software environment. ROS supports a wide variety of robotic platforms and is considered standard for programming robots in general. For the robot's control and trajectory planning, the ROS package MoveIt was used.

In summary, the primary objectives of our approach lie in natural interaction, application flexibility, and scalability, while following the underlying concept of so-called linear HRI.

#### **4. The Human Body Tracking and Joint Recognition**

Precise recognition and localization of joints are vital in gestural interaction. Several technologies and approaches exist, providing the user with tracking capabilities. According to Zhou and Hu [34], human motion tracking technologies can be divided into non-visual, visual, and robot-aided tracking.

In non-visual tracking, the human motion and joints are mapped via various sensors placed onto the human body; typically, MEMS IMU or IMMU devices are used. These sensors are usually a part of a suit or so-called "data glove", which must be worn by the user [35–39]. Although these systems are precise and relatively cheap, they suffer from many shortcomings, such as the need for the user to wear the sensory suit, which may be uncomfortable and often calibrated for the specific user. The sensors on the suit may require extensive cabling, which can limit the user's range of motion. The sensors themselves may suffer from several issues, such as null bias error, scale factor error, noise, or interference. Due to this, the general application focus shifted towards the use of visual-based human body tracking.

This approach uses optical sensors, such as RGB, IR, or depth cameras to extract the user's spatial position. Complex image processing algorithms are then applied to pinpoint the precise position and orientation of human joints. These sensors do not require any equipment that the user needs to wear, thus not limiting him in any way. They can be easily calibrated to the specific user, which greatly improves their flexibility. Today's camera technologies and image processing algorithms make them fast and accurate and relatively cheap. These attributes made these sensors popular and widely used in various commercial applications, mainly in HCI and video-gaming industry. However, visual-based sensors still have multiple drawbacks, such as the sensitivity to the lighting conditions and worsened tracking capabilities when the human body is occluded. The most widely recognizable sensor for visual body tracking is the Microsoft Kinect, released in 2010, initially intended for video gaming. However, due to its capabilities, the sensor became widely used in other applications. The Kinect's principle of body tracking relies on capturing the depth data from the depth sensors and then applying their image processing methods to produce a so-called "skeletal stream" representing the human figure. The first iteration of the sensors used an IR projection pattern to acquire the depth data. The latter versions used the so-called "time-of-flight" technology or TOF. According to Shotton et al. [40], body part classification (BPC) and offset joint regression (OJR) algorithms, specially developed for Microsoft Kinect, are used to determine the parts of the human body. Following the success of Microsoft Kinect, other similar sensors were released, such as Intel RealSense or ASUS Xtion.

Another widely popular vision-based motion tracking sensor is the Leap Motion controller. This compact device was specially designed to track the human hands and arms at very high precision and accuracy. Alongside gaming and VR applications, the controller's design was initially meant to replace conventional peripheral devices, such as a mouse and keyboard, and provide a more sophisticated and natural HCI. The Leap Motion controller uses two IR cameras, capturing emitted light from three LEDs with a wavelength of 850 nm. The depth data are acquired by applying unique algorithms on the raw data, consisting of infrared brightness values and the calibration data [41]. According to the manufacturer, the accuracy of position estimation is around ±0.01 mm. However, several studies show [42,43] that this is highly dependent on the conditions.

#### *Sensor Choice for the Proposed Concept*

Due to the advantages of vision-based body tracking, we decided to use this technology in our concept. We believe that the proper HRI should not rely on "human-dependent devices" as data gloves or body-mounted sensors, but on the perceptive ability of the robot itself, as this, in our opinion, most accurately represents natural and ergonomic interaction.

As for the specific sensor, the Leap Motion controller was picked as the most suitable option. The main reason is the sensor's application focus. The core of the proposed concept

centers around the hand gesture interaction between the robot and the human worker, as most manufacturing tasks are done by hand. Furthermore, the whole gestural framework is built on pointing gestures, performed by the arrangemen<sup>t</sup> of individual fingers. Leap Motion provides accurate and precise human hand tracking, explicitly focusing on the position and orientation of fingers. Other commercially available sensors on the market are not ye<sup>t</sup> capable of such precision and generally focus on tracking the whole human body. The controller depicted in Figure 1 can track hands within a 3D interactive zone that extends up to 60 cm (24) or more, extending from the device in a 140 × 120° typical field of view, which is illustrated in Figure 2. The controller produces an output in the form of gray-scale images captured by the IR cameras and the skeletal representation of the human hand. The data are shown in Figure 3. Leap Motion's software can differentiate between 27 distinct hand elements, such as bones and joints, and track them even when other hand parts or objects obscure them. The positions of joints are represented in the controller's coordinate system depicted in Figure 4. The connection with the PC is facilitated via USB 2.0 or USB 3.0 connection. The manufacturer also provides a software development kit or SDK for the Leap Motion controller, allowing the developers to create custom applications.

**Figure 1.** The Leap Motion controller (LMC).

**Figure 2.** Approximate workspace of the LMC.

**Figure 3.** Visualization of detected hand joints.

**Figure 4.** Coordinate system of the LMC.

#### **5. Method Design**

The core functionality of the proposed framework follows specific consecutive steps. The operator first performs a pointing gesture, pointing to the particular location in the (planar) workspace of the robot. Then, the robot computes the specified location defined by the planar surface and the half-line formed by the operator's joints. When the operator is satisfied with the pointed place, he performs a command gesture, triggering the signal to move the robot's end effector to the desired location. The whole process is illustrated in Figure 5. The half-line is defined by the direction of the index finger, as it is the most common way to represent a pointing gesture. The following command gesture was specifically designed to be performed naturally, fluently connecting with the previous gesture while not significantly influencing the pointing precision. The command gesture is achieved by extending the thumb, forming a so-called "pistol" gesture. Both gestures are depicted in Figure 6. Custom ROS packages and nodes were created for gesture recognition and intersection computation. The ROS-enabled machine connected to the robotic arm manages these nodes, ensuring proper communication between the individual parts of the whole application. The method's architecture in ROS framework is depicted in Figure 7. The flowchart of the method's process is in Figure 8. The most prominent geometrical features for the mathematical description of the concept are illustrated in Figure 9. The global coordinate system G defines the planar ground surface *π*.

**Figure 5.** Process of controlling the robot via pointing gestures. 1—The operator points to a location. 2—The operator performs "pistol" gesture. 3—The robot's TCP moves to the appointed location.

(**a**) Pointing gesture

 (**b**) Pistol gesture

**Figure 7.** The package architecture of the method in ROS.

The Leap Motion sensor determines the position of the joints of the human hand in its coordinate system *L*; furthermore, the robotic manipulator operates in its coordinate system *R*. The unification of those coordinate systems is vital to obtain points A and B coordinates defining the half-line *p*, which ultimately determines the coordinates of intersection *I*. All of the calculations are done in the global coordinate system *G*. Hence, the following homogeneous transformations between *R* and *G*, *L*, and *G* are defined:

$$
\begin{pmatrix} X\_G \\ Y\_G \\ Z\_G \\ 1 \end{pmatrix} = \begin{pmatrix} 1 & 0 & 0 & d\_x \\ 0 & \cos\alpha & -\sin\alpha & d\_y \\ 0 & \sin\alpha & \cos\alpha & d\_z \\ 0 & 0 & 0 & 1 \end{pmatrix} \cdot \begin{pmatrix} X\_L \\ Y\_L \\ Z\_L \\ 1 \end{pmatrix}
$$

$$
\begin{pmatrix} X\_G \\ Y\_G \\ Z\_G \\ 1 \end{pmatrix} = \begin{pmatrix} 0 & \cos\alpha & -\sin\alpha & d\_x \\ 0 & \sin\alpha & \cos\alpha & d\_y \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{pmatrix} \cdot \begin{pmatrix} X\_R \\ Y\_R \\ Z\_R \\ 1 \end{pmatrix}
$$

Analytical geometry and vector algebra are used to calculate the desired coordinates. The data acquired from the Leap Motion sensor are transformed based on Equation (1). The points and vectors used in the calculation are illustrated in Figure 10.

**Figure 8.** State machine flowchart of the method.

**Figure 10.** Extraction of fundamental geometrical features.

First, the direction of the vector *v* defined by points A and B needs to be found. Vector *v* is calculated by subtracting coordinates of point *A* from point *B* (positions of index finger joints).

$$
\vec{v} = B - A \tag{2}
$$

The direction of vector *v* is calculated by dividing the coordinates of vector *v* with its magnitude.

$$
\vec{\sigma} = \frac{\vec{\sigma}}{||\vec{\sigma}||}\tag{3}
$$

Now, it is possible to define the half-line *p* by parametric equations.

$$\begin{aligned} \mathbf{x} &= \mathbf{x}\_0 + \mathbf{v}\_1 \times \mathbf{t} \\ \mathbf{y} &= \mathbf{y}\_0 + \mathbf{v}\_2 \times \mathbf{t} \\ \mathbf{z} &= \mathbf{z}\_0 + \mathbf{v}\_3 \times \mathbf{t} \end{aligned} \tag{4}$$

where *t* is the parameter of *p*. The plane *π* is defined by the *x* and *y* axes with an arbitrary normal vector *n*. Equation (5) can mathematically describe plane *π*.

*z*

*t*

$$= 0\tag{5}$$

By substituting Equation (5) into Equation (4), the parameter *t* can be calculated as follows.

$$\zeta = -\frac{z\_0}{\vartheta\_3} \tag{6}$$

Now, using substitution, the *X* and *Y*coordinates of the intersection *<sup>I</sup>*(*<sup>X</sup>*,*<sup>Y</sup>*) can be calculated by the following equations.

$$\begin{aligned} I\_X &= A\_X + \mathfrak{v}\_1 \times (-\frac{\mathfrak{z}\_0}{\mathfrak{z}\_3}) \\ I\_Y &= A\_Y + \mathfrak{v}\_2 \times (-\frac{\mathfrak{z}\_0}{\mathfrak{z}\_3}) \end{aligned} \tag{7}$$

Respectively, it is possible to calculate the intersection on the plane *ω* perpendicular to *π* mathematically defined by the Equation (7).

$$y = d\_\prime \tag{8}$$

 where *d* is the distance of the plane on y axis. The scenario is illustrated in Figure 11.

**Figure 11.** Abstraction of pointing to the perpendicular plane.

#### **6. Method Implementation**

The main application focus of the proposed concept is to create a natural HRI workplace where humans and robots can work together efficiently. For this reason, a specialized robotic workplace was built around the core concept's functionality, supporting the ergonomy of the whole interaction between human and the robot, trying to maximize the

efficiency and convenience for the worker. Furthermore, it also acts as a modular foundation for implementing, testing, and evaluating other HRI concepts.

The whole workplace, called COCOHRIP, an abbreviation for Complex Collaborative HRI WorkPlace, is depicted in Figure 12. The COCOHRIP consists of three main parts, the sensors, the visual feedback, and the robotic manipulator. The sensory part contains the various sensors that gather data about the human worker and the environment. The visual feedback part consists of two LED monitors, providing the user with visual feedback from the sensor. The robotic part comprises the force-compliant robotic manipulator. All these parts are connected through the ROS machine, which manages all the communication and logic of the application. Transformation matrices between coordinate systems were:

$$\begin{aligned} T\_{LG} &= \begin{pmatrix} 1 & 0 & 0 & 670 \\ 0 & \cos(0.707) & -\sin(0.707) & 15 \\ 0 & \sin(0.707) & \cos(0.707) & 11 \\ 0 & 0 & 0 & 1 \end{pmatrix} \\ T\_{RG} &= \begin{pmatrix} 0 & \cos(3.14) & -\sin(3.14) & 1050 \\ 0 & \sin(3.14) & \cos(3.14) & 693.3 \\ 0 & 0 & 1 & 103 \\ 0 & 0 & 0 & 1 \end{pmatrix} \end{aligned} \tag{9}$$

**Figure 12.** The COCOHRIP workplace.

#### **7. Experimental Evaluation**

In this section, we present two experimental scenarios to evaluate the basic concept principles, which quantify the overall usability of the concept and serve as a baseline for further development. The scenarios of each of the two experiments are derived from the proposed gesture pointing method, in which the operator points to a specific location on the workspace plane and performs a command gesture, sending the robot to the desired location. Each of the experiment scenarios has two variants. In the first scenario, the operator points to the designated horizontally placed markers; in the second scenario, the operator points at vertically placed markers. In the first variant, the operator has no visual feedback about where she or he is pointing. His task is to rely solely on his best guess. In the second variant, the user receives visual feedback from the interactive monitors showing him the exact place he is pointing. The illustration of the experimental setup and the precise positions of the markers in the global coordinate system is depicted in Figures 13 and 14. The position values are in millimeters. The process of obtaining one measurement is as follows:


In total, the operator points ten times to each of the target markers for the current experiment. During the pointing, the only requirement on the user is to keep the hand approximately 200–250 mm above the Leap Motion sensor, as this is believed to be the optimal distance according to [42] and the empirical data acquired during the method pretesting. The user is encouraged to perform the pointing gestures as naturally as possible, allowing him to position his body and arm as he sees fit. Seven male participants of different height, physical constitution, and height executed the experiment scenarios.

**Figure 13.** Scenario of experiment no. 1.

**Figure 14.** Scenario of experiment no. 2.
