*Article* **Teleoperation of High-Speed Robot Hand with High-Speed Finger Position Recognition and High-Accuracy Grasp Type Estimation**

**Yuji Yamakawa 1,\*,†, and Koki Yoshida 2,†**


**Abstract:** This paper focuses on the teleoperation of a robot hand on the basis of finger position recognition and grasp type estimation. For the finger position recognition, we propose a new method that fuses machine learning and high-speed image-processing techniques. Furthermore, we propose a grasp type estimation method according to the results of the finger position recognition by using decision tree. We developed a teleoperation system with high speed and high responsiveness according to the results of the finger position recognition and grasp type estimation. By using the proposed method and system, we achieved teleoperation of a high-speed robot hand. In particular, we achieved teleoperated robot hand control beyond the speed of human hand motion.

**Keywords:** teleoperation; high-speed image processing; machine learning; finger position recognition; grasp type estimation

### **1. Introduction**

Technology for realizing remote systems such as teleoperation, telerobotics, telexistence, etc., has been an important issue [1–3], and much research has been actively carried out. In the recent situation, in particular with the effects of COVID-19, remote work (telework) by office workers has become commonplace. In the future, teleoperation using robot technology will be applied to industrial fields, and object handling and manipulation using remote systems are considered to be essential and critical tasks. In order to achieve this, we consider that telerobotics technology based on sensing human hand motion and controlling a robot hand will be essential. Thus, this research focuses on the teleoperation of a robot hand on the basis of visual information about human hand motion. The reason why we use visual information is that it is troublesome for users to have to put on contact devices [4–7] before operating the system, and non-contact-type systems are considered to be more suitable for users.

Here, we describe related work in the fields of teleoperation and telerobotics based on visual information. Interfaces based on non-contact sensing generally recognize human hand gestures and control a slave robot based on these gestures [8,9]. In the related work in the field of humanoid robotics, a low-cost teleoperated control system for a humanoid robot has been developed [10]. In wearable robotics, semantic segmentation has been performed by using Convolutional Neural Networks (CNNs) [11]. Such interfaces are intuitive for users and do not involve the restrictions involved with contact-type input devices. Some devices for recognizing human hand gestures have been developed, and some systems have been also constructed [12,13]. Lien et al. proposed a high-speed (10,000 Hz) gesture recognition method based on the position change of the hand and fingers by radar [14]. This method can recognize the rough hand motion, but not its detail. Zhang et al. performed human hand and finger tracking using a machine learning technique based on RGB images,

**Citation:** Yamakawa, Y.; Yoshida, K. Teleoperation of High-Speed Robot Hand with High-Speed Finger Position Recognition and High-Accuracy Grasp Type Estimation. *Sensors* **2022**, *22*, 3777. https://doi.org/10.3390/s22103777

Academic Editor: Anne Schmitz

Received: 26 April 2022 Accepted: 12 May 2022 Published: 16 May 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

but the operating speed was limited to 30 fps [15]. Tamaki et al. created a database consisting of finger joint angles obtained by using a data glove, hand contour information, and nail positions obtained from images, and they also proposed a method of estimating hand and finger positions by searching the database at 100 fps [16].

Premeratne discussed some techniques for hand gesture recognition for Human– Computer Interaction (HCI) [17]. Furthermore, Ankit described recent activities on hand gesture recognition for robot hand control [18]. Hoshino et al. [19] and Griffin et al. [20] proposed methods of mapping between human hand and robot hand motions. On the other hand, Meeker et al. [21] created a mapping algorithm experimentally. Sean et al. developed a system that can operate a robot arm according to human intention [22]. Niwa et al. proposed "Tsumori" control, which can achieve a unique robot operation for an operator based on learning the correspondence of a human operation and robot motion [23]. Fallahinia and Mascaro proposed a method of estimating hand grasping power based on the nail color [24].

Summarizing the above, we can conclude that the disadvantages of the previous approaches are as follows:


Regarding the low speed and low responsiveness, Anvari et al. [25] and Lum et al. [26] discussed the system latency in surgical robotics, and they claimed that the latency affects the task completion and performance. Thus, it is strongly desirable for teleoperation systems to have as low a system latency as possible.

To overcome these disadvantages, we also developed a high-speed telemanipulation robot hand system consisting of a stereo high-speed vision system, a high-speed robot hand, and a real-time controller [27,28]. In the stereo high-speed vision system, which is composed of two high-speed cameras and an image-processing PC, the 3D positions of the fingertips of a human subject were calculated by a triangulation method. Then, mapping between the human hand and the robot hand was performed. Finally, robot hand motion was generated to duplicate the human hand motion. With this high-speed system, we achieved a system latency so low that a human being cannot recognize the latency from the human hand motion to the robot hand motion [29,30].

In the present research, we aim to achieve even lower latency so that an intelligent system with vision cannot recognize the latency. Realizing such an extremely low-latency teleoperated system will contribute to solutions for overcoming latency issues in cases where the latency of telemanipulated systems may occur in more distant places. In addition, this technology will enable high-level image processing using the remaining processing time. In this paper, we propose a new method that fuses machine learning and highspeed image-processing techniques to obtain visual information about human hand motion. In general, the speed of machine learning methods is considered to be very low, and therefore, we consider that it is not suitable to adapt machine learning methods for realtime and real-world interactions between a human and a robot. By using our proposed method, we can overcome the issue with the low speed of the machine learning processing. Concretely speaking, the low-speed characteristics of machine learning can be improved by using high-speed image processing and interpolating the results of the machine learning with the results of the high-speed image processing. Although the finger position is estimated by machine learning using a CNN and high-speed image-processing technologies in this research, the integration of machine learning and high-speed image-processing technologies can be considered to be applicable to other target tracking tasks. Thus, our proposed method with high speed and high intelligence possesses the generality of the target-tracking method.

In addition, since our proposed method does not require three-dimensional measurement and camera calibration is also not needed, it is easy to set up the system. Moreover, motion mapping from the human hand motion to the robot hand motion is not performed in our proposed method. Therefore, kinematic models of the human hand and robot hand are not needed either. As a result, it is considered to be easy to implement our developed teleoperation system in actual situations.

The contributions of this paper are the following:


Furthermore, Table 1 shows the positioning of this research. The characteristics of our proposed method are "non-contact", "intention extraction", and "high-speed".

**Table 1.** Positioning of this research.


The rest of this paper is organized as follows: Section 2 describes an experimental system for teleoperation. Section 3 explains a new method for achieving grasp type estimation based on high-speed finger position recognition. Section 4 shows evaluations of the proposed method and the experimental results of teleoperation. Section 5 concludes with a summary of this research and future work.

### **2. Experimental System**

This section explains our experimental system for the teleoperation of a high-speed robot hand based on finger position recognition and grasp type estimation. The experimental system, as shown in Figure 1, consists of a high-speed vision system (Section 2.1), a high-speed robot hand (Section 2.2), and a real-time controller (Section 2.3). All of the components were placed in the same experimental environment.

### *2.1. High-Speed Vision System*

This subsection explains the high-speed vision system, consisting of a high-speed camera and an image-processing PC. As the high-speed camera, we used a commercial product (MQ013MG-ON) manufactured by Ximea. The full image size was 1280 pixels (width) × 1024 pixels (height), and the frame rate at the full image size was 210 frames per second (fps). In this research, since we decreased the image size, we increased the frame rate from 210 fps to 1000 fps. The reason why we set the frame rate at 1000 fps is that the servo control systems for the robot and machine system were both operated at 1000 Hz. In general, the raw image acquired by the high-speed camera was dark because of the significantly short exposure time. Therefore, we used an LED light to obtain brighter raw images from the high-speed camera.

The raw image data acquired by this high-speed camera were transferred to the imageprocessing PC. The image-processing PC ran high-speed image processing to track the finger position and to estimate the grasp type. The details of the image processing are explained in Section 3. The results of the image processing were sent to a real-time controller, described in Section 2.3. By performing real-time, high-speed (1000 Hz) image processing, we could control the high-speed robot hand described in Section 2.2 at 1 kHz. The sampling frequency of 1 kHz was the same as the sampling frequency of the servo-motor control.

**Figure 1.** Structure of the experimental system.

The specifications of the image-processing PC are as follows: Dell XPS 13 9360, CPU: Intel® Core (™) i7-8550U @1.80 GHz, RAM: 16.0 GB, OS: Windows 10 Pro, 64 bit.

### *2.2. High-Speed Robot Hand*

This subsection describes the high-speed robot hand, which was composed of three fingers [31]. A photograph of the high-speed robot hand is shown in the center of Figure 1. The number of degrees of freedom (DoF) of the robot hand was 10; the middle finger had 2 DoF, the left and right fingers 3 DoF, and the wrist 2 DoF. The joints of the robot hand could be closed by 180 degrees in 0.1 s, which is fast motion performance beyond that possible by a human. Each joint angle of the robot hand was controlled using a Proportional and Derivative (PD) control law, given by

$$
\pi = k\_p(\theta\_d - \theta) + k\_d(\dot{\theta}\_d - \dot{\theta}),
\tag{1}
$$

where *τ* is the torque input as the control input for the high-speed robot hand control, *θ<sup>d</sup>* and *θ* are the reference and actual joint angles of the finger of the robot hand, and *kp* and *kd* are the proportional and derivative gains of the PD controller.

### *2.3. Real-Time Controller*

As the real-time controller, we used a commercial product manufactured by dSPACE. The real-time controller had a counter board (reading encoder attached to the motors of the robot hand), digital-to-analog (DA) output, and two Ethernet connections (one was connected to the host PC and the other to the image-processing PC). We operated the real-time controller through the host PC, and we also implemented the program of the proposed method in the host PC.

The real-time controller received the results of image processing via Ethernet communication. Then, the real-time controller generated a control signal for the robot hand to appropriately control the robot hand according to the results of the image processing and output the control signal to the robot hand.

### **3. Grasp Type Estimation Based on High-Speed Finger Position Recognition**

This section explains a new method for estimating grasp type, such as power grasp or precision grasp, based on high-speed finger position recognition using machine learning and high-speed image processing, and our proposed method can be mainly divided into two components: high-speed finger position recognition described in Section 3.1 and grasp type estimation described in Section 3.2.

Figure 2 shows the overall flow of the proposed teleoperation method, detailed below:


The CNN and finger tracking are executed on the images. The calculation process of the CNN is run at 100 Hz, and finger tracking is run at 1000 Hz; the results of the CNN are interpolated by using the results of finger tracking. As a result, the finger positions are recognized at 1000 Hz.


**Figure 2.** Overall flow of the proposed teleoperation method.

### *3.1. High-Speed Finger Position Recognition with CNN*

This subsection explains the method for high-speed finger position recognition. By recognizing the finger positions at high speed (for instance, 1000 fps), we can reduce the latency from the human hand motion to the robot hand motion and estimate the grasp type with high accuracy. Conventional image-processing methods are too slow (around 30 fps) for actual application. This research can solve the speed issue with the image processing conventionally used.

The proposed method was implemented by using machine learning and high-speed image-processing technologies. As the machine learning method, we used a Convolutional Neural Network (CNN). As the high-speed image processing method, we used tracking of a Region Of Interest (ROI) and image processing of the ROI. Here, the ROI was extremely small for the full image size and was set at the position of the result of the CNN, namely roughly at the position of the fingers.

### 3.1.1. Estimation of Finger Position by CNN

By using the CNN, we estimated the positions of six points (five fingertips and the center position of the palm) in the 2D image captured by the high-speed camera. The advantages of using the CNN to recognize hand positions include robustness against finger-to-finger occlusion and robustness against background effects.

The model architecture of the CNN is as follows:


This architecture was created by referring to a model [32] used for image classification and modifying it according to handling multiple-output regression problems.

In addition, the value of Dropout was set at 0.1, and the activation function and parameter optimization were Relu and RMSprop [33], respectively. The loss was calculated using the Mean-Squared Error (MSE). Adding the Dropout layer was expected to suppress overlearning, and reducing the number of layers was expected to be effective in suppressing overlearning and reducing inference and learning times [34].

Figure 3 shows an example of annotation on a hand image, where the annotated positions indicating the center points of the fingertips and palms of the five fingers are shown as blue dots.

**Figure 3.** Annotation result of fingertip positions.

3.1.2. Finger Tracking by High-Speed Image Processing

Since the estimation by the CNN is much slower than imaging by the high-speed camera and there are many frames where the hand position cannot be recognized during the estimation by the CNN, the CNN processing becomes the rate-limiting step of the system. While compensating for the frames where the CNN estimation is not performed, we achieved real-time acquisition of the hand position. Figure 4 [35] shows a schematic of the method that combines CNN estimation performed at low frequency with highfrequency hand tracking to obtain the hand position. In Figure 4, the orange dots indicate the execution of the CNN, and the gray dots indicate the execution of hand tracking. By performing hand tracking in frames where the CNN is not performed, the hand position can be obtained at high frequency.

**Figure 4.** Concept of fusing the CNN and finger tracking.

Next, we explain the requirements that the hand-tracking method should satisfy. First, in order to perform real-time hand position recognition, information obtained in frames after the frame to be tracked cannot be used. For example, if we perform linear interpolation of two CNN results to interpolate the hand position in the frame between the CNN steps, we need to wait for the second CNN to be executed, which impairs the real-time performance of the system. Therefore, it is necessary to track the hand based on the information obtained from the frame to be processed and the earlier frames.

In addition, it is desirable to obtain the data by measurement rather than by prediction. This is because it is not always possible to accurately recognize high-speed and minute hand movements if the current hand position is inferred from the trend of past hand positions. By calculating the current hand position from the information from the current frame, instead of predicting based on the information from the past frames, we can achieve more accurate recognition of sudden hand movements.

In this study, we propose a real-time, measurement-based recognition method for hand positions in frames where the CNN is not performed. This method involves hand tracking using frame-to-frame differences of fingertip center-of-gravity positions. The proposed hand-tracking method calculates the hand position in the corresponding frame by using three data sets: the hand position from the past CNN estimation results, the image of the frame in which the CNN was executed, and the image of the corresponding frame. As the amount of hand movement between frames, we calculated the difference in hand positions from the two images and added it to the hand position estimated by the CNN to treat it as the hand position in the corresponding frame.

The following is the specific method of calculating the hand position when the most recent frame in which the CNN is executed is *n*, the frame in which the tracking process is performed is *n* + *k*, and the frame in which the CNN is executed again is *n* + *T*:

1. *n*-th frame: CNN

In the *n*-th frame, let an estimated fingertip position obtained by the CNN be *Pn*. Using Equation (2) below, the image is binarized, the ROI with the center position *Pn* is extracted, and the center of the fingertip in the ROI is assumed to be *Cn* (Figure 5a). In the image binarization, the original image and the binarized image are src(*i*, *j*) and *f*(*i*.*j*), respectively. Furthermore, the threshold of the image binarization is set at *thre*.

$$f(i,j) = \begin{cases} 0 & \text{if } \text{src}(i,j) < \text{tlre} \\ 1 & \text{otherwise} \end{cases} \tag{2}$$

The image moment is represented by *mpq* (Equation (3)), and the center position (*Cn*) of the fingertip in the ROI is (*m*10/*m*00, *m*01/*m*00):

$$m\_{pq} = \sum\_{i} \sum\_{j} i^p j^q f(i, j) \tag{3}$$

The value of *Pn* is substituted for the fingertip position *Qn* in the *n*-th frame.

$$Q\_n = P\_n \tag{4}$$

2. (*n* + *k*)-th frame: Finger tracking

After binarizing the image in the (*n* + *k*)-th frame (0 < *k* < *T*), the ROI with the center position *Pn* is extracted, and the center of the fingertip in the ROI is assumed to be *Cn*<sup>+</sup>*<sup>k</sup>* (Figure 5b). At that time, let the finger position in the (*n* + *k*)-th frame be *Qn*<sup>+</sup>*k*, calculated by the following equation:

$$Q\_{n+k} = P\_n + \mathcal{C}\_{n+k} - \mathcal{C}\_n \tag{5}$$

**Figure 5.** An example of finger tracking: (**a**,**b**) show images at the *n*-th and (*n* + *k*)-th frames, respectively.

If the hand-tracking module receives a new CNN estimation result from the CNN module every *T* frames, where *T* is a predefined constant, and updates the result for processing, the hand-tracking process will have to wait for every frame that exceeds *T* for inference by the CNN. When the number of frames required for inference by the CNN exceeds *T*, the hand-tracking process needs to wait. This increases the latency of hand tracking, since the inference time of the CNN may vary in actual execution and the inference result may not be sent by the CNN module even after *T* frames have passed. On the other hand, if the CNN results are updated in frames received from the CNN module instead of in frames at regular intervals, the latency is reduced because the last received CNN result is used even if the CNN result is delayed. The latency is reduced because the last received CNN result is used even if the transmission of the CNN result is delayed.

Based on the above, we devised two different methods for hand tracking with and without a waiting time for the CNN estimation results: a low-latency mode with a variable *T* value and a high-accuracy mode with a constant *T* value. The low-latency mode is effective for applications where low latency is more important than accuracy, such as anticipating human actions. On the other hand, the high-accuracy mode is suitable for applications where accurate acquisition of hand positions is more important than low latency, such as precise mechanical operations. In this research, we adopted the low-latency mode to track the human hand motion, because we aimed at the development of a teleoperation system with high speed and low latency. An overview of the algorithm for the low-latency mode is shown in Algorithm 1. The algorithm for the high-accuracy mode is shown in Algorithm A1 in Appendix A.

The characteristics of the two modes are summarized below:

Algorithm 1. Low-latency mode:

As the result of the CNN, which is used for finger tracking, the latest result is utilized. The advantage is that the latency is reduced because no time is required to wait for the CNN results. The disadvantage is that if the CNN processing is delayed, the tracking process will be based on the CNN results for distant frames, which will reduce the accuracy. Algorithm A1. High-accuracy mode:

By fixing the interval *T* of the number of frames at which the CNN is executed, the process of updating the estimate by CNN is performed at fixed intervals. The advantage is that hand tracking is based on frequently acquired CNNs, which improves accuracy. The disadvantage is that when the CNN processing is delayed, the latency increases because there is a waiting time for updating the CNN results before the tracking process starts.

### **Algorithm 1** Finger tracking with low latency.


### *3.2. Grasp Type Estimation*

This subsection explains the method for estimating the grasp type on the basis of the results of the high-speed finger tracking. Furthermore, we explain the robot hand motion according to the estimated grasp type.

### 3.2.1. Estimation of Grasp Type by Decision Tree

We also used machine learning technology to estimate the grasp type on the basis of the finger position and the center position of the palm, which are estimated by the CNN and hand tracking at high speed. As representative grasp types to be estimated, we considered two grasp types: (1) a power grasp using the palm of the hand and (2) a precision grasp using only the fingertips. As a result, we categorized the grasps to be estimated into three types: "power grasp", "precision grasp", and "non-grasp", as shown in Figure 6. In the "power grasp", all four fingers, and not the thumb, move in the same way and tend to face the thumb, whereas in the "precision grasp", the positions of the thumb and index finger tend to separate from those of the little finger and ring finger. The "non-grasp" state corresponds to the extended state of the fingers.

**Figure 6.** Differences among power grasp (**left**), precision grasp (**middle**), and non-grasp (**right**).

To accurately estimate the grasp type, we used decision trees (decision tree classifier) as the machine learning method. Decision trees have the features of fast classification and readability of the estimation criteria. In particular, the ease of interpretation of the estimation criteria and the possibility of creating algorithms with adjustments are reasons for using decision trees as a classification method.

### Preprocessing of hand position data:

In order to improve the accuracy of the decision tree, we preprocessed the input data, namely the hand positions (Figure 7). In the preprocessing, we first calculate the distance between the middle finger and the palm of the hand in the frame with the fingers extended as a hand size reference to calibrate the hand size. When the coordinates of the middle

finger and the palm of the hand in the image of the frame to be calibrated are (*xm*0, *ym*0) and (*xp*0, *yp*0), respectively, the distance between the two points can be given by

**Figure 7.** Direction of middle finger and angle between each finger and middle finger.

Next, we calculate the position of the hand, (*r*, *θ*), in the polar coordinate system with the palm of the hand serving as the origin and the direction of extension of the middle finger serving as the *x*-axis, based on the position of the hand represented by the coordinates in the image. When the coordinates of the middle finger and the palm in the image are (*xm*, *ym*) and (*xp*, *yp*), respectively, the declination angle of the polar coordinate of the middle finger, *θm*, is expressed by the following formula:

$$\theta\_m = \arctan\left(\frac{y\_m - y\_p}{x\_m - x\_p}\right). \tag{7}$$

Furthermore, the polar coordinate (*ri*, *θi*) corresponding to the coordinate (*xi*, *yi*) of finger *i* in the image is expressed by the following equation with the direction of the middle fingertip serving as the positive direction of the *x*-axis. To calibrate the size of the hand, we divide *ri* by the hand size reference *r*0.

$$r\_i \quad = \frac{1}{r\_0} \sqrt{(\mathbf{x}\_i - \mathbf{x}\_p)^2 + (y\_i - y\_p)^2} \tag{8}$$

$$\theta\_i = -\arctan\left(\frac{y\_i - y\_p}{x\_i - x\_p}\right) - \theta\_m \tag{9}$$

The vector **r** containing the distance (*ri*) of each finger and the vector Θ containing the declination angle (*θi*) obtained in this way for each frame are used as inputs to the decision tree. By using the relative positions of the fingers with respect to the palm in polar coordinates as inputs, the data to be focused on are clarified, and by dividing the data by the size of the hand with the fingers extended, the characteristics of the hand morphology can be extracted while suppressing the effects of differences in hand size between individuals and differences in the distance between fingers and camera lens for each execution.

### 3.2.2. Grasping Motion of High-Speed Robot Hand

The object is grasped by the high-speed robot hand according to the result of the high-speed grasp type estimation described above.

The middle finger of the robot hand has two joints, that is the root and tip links, and the left and right fingers also have three joints, the root, tip, and rotation around the palm. The root and tip links operate in the vertical direction to bend and stretch the fingers, allowing them to wrap around objects. The rotation joint around the palm moves horizontally and can change its angle to face the middle finger, which enables stable grasping. The wrist part of the robot hand has two joints, that is flexion/extension and rotation joints, which enables the finger to move closer to the object to be grasped.

Since the time required to rotate the finger joint 180 deg is 0.1 s, it takes approximately 50 ms to close the finger from the open position to 90 deg for grasping. From our research using a high-speed control system, the latency from the image input to the torque input of the robot hand is about 3 ms [36]. If the value of the estimated grasping configuration oscillates, the target angle is frequently changed, and the robot hand becomes unstable, so the grasping operation is started when the same grasp type is received continuously for a certain number of frames.

### **4. Experiments and Evaluations**

This section explains the experiments, experimental results, and evaluations for finger position recognition (Section 4.1—Exp. 1-A), grasp type estimation (Section 4.2—Exp. 1-B), and teleoperated grasping by a robot hand (Section 4.3—Exp. 2), respectively. Figure 8 shows an overview of the experiments and evaluations for each part.

**Figure 8.** Overview of experiments and evaluations.

### *4.1. Finger Position Recognition*

This subsection explains the experiment and evaluation for finger position recognition based on the proposed method with high-speed image processing and the CNN.

### 4.1.1. Preparation for Experiment

We trained the CNN model described in Section 3, which estimated hand positions from images. To increase the amount of data for training, we performed data augmentation by random scaling (0.7∼1.0) and rotation (−60∼60 deg.) operations. As a result, we could obtain 9000 images from 1000 images by data augmentation. The training process was performed for 200 epochs, with 70% of the prepared data used as the training data and 30% used as the validation data. The slope of the loss function calculated from the Mean-Squared Error (MSE) became lower around epoch 30. When we calculated the Mean Absolute Error (MAE) of the estimation results for the validation data, the MAE was less than 10 pixels. The width of the fingers in the image was about 15 pixels, which means that the hand position estimation was accurate enough for the hand-tracking process. As a result of 5 trials of 100 consecutive inferences, the mean and standard deviation of the inference times were 7.03 ms and 1.46 ms, respectively. Furthermore, the longest was 24.6 ms, and the shortest was 3.75 ms. Thus, an average of seven hand-tracking runs was taken for each update of the CNN results for 1000 fps of image acquisition by the high-speed camera.

### 4.1.2. Experiment—1-A

The exposure time of the high-speed camera was set to 0.5 ms, the image size to 400 pixels wide and 300 pixels high, the square ROI for the hand-tracking process to 40 pixels by 40 pixels, and the threshold for the binarization process to *thre* = 20. In such a situation, we captured the hand opening and closing in 1000 frames during 1 s and applied the proposed hand tracking for hand position recognition. The images captured in the experiment were stored, the CNN was run offline on all images, and the results were used as reference data for the comparison method.

4.1.3. Results

The output speed of the hand position was 1000 Hz, which was the same speed as the imaging. The latency between the end of imaging and the output of the hand position was also 1 ms.

Hand images are shown in Figure 9i: starting with the finger extended (Figure 9(i-a)), bending the finger (Figure 9(i-b)), folding it back around Frame 400 (Figure 9(i-c)), and stretched again (Figure 9(i-d)–(i-f)). Figure 9ii is a graph of the hand positions estimated over 1000 frames by the proposed method. The image coordinates of the five fingers and the palm center point are represented as light blue for the index finger, orange for the middle finger, gray for the ring finger, yellow for the pinky finger, dark blue for the thumb, and green for the palm.

The errors in the proposed method and the errors in the comparison method are shown in Table 2. The average error of the five fingers in the proposed method was 1.06 pixels. Furthermore, the error in the comparison method was 1.27 pixels, which is 17% bigger than that in the proposed method. For all fingers, the error in the proposed method was smaller than that of the comparison method. Note that 1 pixel in the hand image corresponds to approximately 0.7 mm in the real world.

(ii) (*x*, *y*) coordinates of each finger and palm

**Figure 9.** Result of finger tracking: (**i**) images obtained from the high-speed camera and (**ii**) (*x*, *y*) coordinates of each finger and the palm.


**Table 2.** Mean-Squared Error (MSE) of finger positions estimated by CNN w/ and w/o finger tracking.

### 4.1.4. Discussion

First, we consider the execution speed of hand position recognition. From the experimental results, the output speed of the hand position was 1000 Hz, which is the same as that of the image capturing, indicating that the hand position recognition is fast enough. In addition, the latency from the end of imaging to the output of the hand position was 1 ms, indicating that the total execution time of the inter-process data sharing and handtracking process itself was 1 ms. Thus, the effectiveness of the proposed method described in Section 3 was shown.

Next, we discuss the reason why the error in the proposed method is smaller than that in the comparison method. In the proposed method, even for the frames where the CNN is not executed, the hand position recognition at 1000 Hz by the tracking process can output values close to the reference data. On the other hand, the comparison method outputs the last CNN estimation result without updating it for the frames where the CNN is not executed, and thus, the updating in the hand position recognition is limited to 100 Hz. The effectiveness of the proposed method for fast hand tracking is the reason why the proposed method has superior accuracy.

### *4.2. Grasp Type Estimation*

This subsection explains the experiment and evaluation for grasp type estimation based on the finger position recognition described above.

### 4.2.1. Preparation for the Experiment

We trained a decision tree that outputs a grasp type label using the hand position as input. First, we captured 49 images of a power grasp, 63 images of a precision grasp, and 52 images of a non-grasp and annotated them with the grasp type label. Next, we annotated the hand positions in the images by estimating the CNN trained in the above subsection. After preprocessing, we transformed the hand positions represented in the Cartesian coordinate system into a polar coordinate system centered on the palm of the hand and normalized them by hand size to obtain 10 variables. The number of variables in the decision tree was two: length *r* and angle *θ* for each of the five types of fingers (index, middle, ring, pinky, and thumb). The length *r* is the ratio of the distance from the center of the palm to the tip of the middle finger with the fingers extended to the distance from the palm to each fingertip. The angle *θ* is the angle between the finger and the middle finger, with the thumb direction being positive and the little finger direction being negative.

Based on the above variables as inputs, a decision tree was trained using the leaveone-out method [37]. The model was trained with the depth of the decision tree from 1 to 10. As a result, when the depth was three, the accuracy was about 0.94, which can be considered to be sufficient. Thus, we decided that the depth of the tree structure should be set at three. Furthermore, the parameters of the decision tree such as the classification conditions, Gini coefficient, and depth were obtained.

### 4.2.2. Experiment—1-B

We used a series of images of the grasping motion to evaluate the learned decision tree. We took a series of 500 frames of hand images of each type of grasping motion, starting from the open fingers, performing a power or precision grasp, and then, opening the fingers again. The exposure time of the high-speed camera was 0.5 ms, the image size 400 pixels (width) × 300 pixels (height), and the frame rate 1000 fps. Based on the hand positions in the images recognized by the CNN and hand tracking (accurate mode), we performed the grasp type estimation using the decision tree and calculated the correct answer rate.

### 4.2.3. Result

Figures 10 and 11 show the results of grasp type estimation for a series of images and the hand image in a representative frame. The horizontal axes in Figures 10 and 11 represent the frame number, and the vertical axis also represents the label of the grasp type, where 0 indicates non-grasp, 1 indicates a power grasp, and 2 indicates a precision grasp. Figure 10 is the result for the power grasp motion, which is a non-grasp at Frame 0 (a), judged as a power grasp at Frame 82 (b), a continued power grasp (c), and a non-grasp again at Frame 440 (d). Figure 11 is the result for the precision grasp motion, which is a non-grasp at Frame 0 (a), judged as a precision grasp at Frame 72 (b), a continued precision grasp (c), and a non-grasp again at Frame 440 (d). No misjudgments occurred in either the power grasp or the precision grasp experiments.

### 4.2.4. Discussion

First, we discuss the stability of the grasp type estimation by the proposed method. From the results shown in Figures 10 and 11, there was no misjudgment of the grasp type estimation. In addition, high-accuracy grasp type estimation was achieved successfully.

Next, we describe the processing speed of the grasp type estimation. The mean and standard deviation of the inference speed for three trials of 1000 consecutive inferences were 0.07 ms and 0.02 ms, respectively. This is much shorter than 1 ms and is fast enough for a system operating at 1000 Hz.

Finally, the hand position recognition and grasp type estimation for the hand images evaluated in Experiment 1-A and 1-B are shown in Figure 12. From this result, we can conclude that the effectiveness of the proposed method for the hand position recognition and grasp type estimation is confirmed.

**Figure 10.** Estimation result in the case of power grasp.

**Figure 11.** Estimation result in the case of precision grasp.

**Figure 12.** Difference among non-grasp, power grasp, and precision grasp and fingertip and palm positions (white circles).

### *4.3. Teleoperated Grasp by Robot Hand*

This subsection explains the experiment and evaluation for teleoperated grasping by a robot hand on the basis of human hand motion sensing.

### 4.3.1. Experiment—2

In this experiment, the robot hand grasped a Styrofoam stick-shaped object with a diameter of 0.05 m and a length of 0.3 m, which was suspended by a thread in the robot's range of motion. The human operator performed a "power grasp" or a "precision grasp" with his/her hand in the field of view of the high-speed camera. The real-time controller calculated the reference joint angles of the robot hand for the grasping operation according to the received grasp type and provided the reference joint angles as step inputs to the robot hand through the real-time control system.

### 4.3.2. Results

A video of the grasping process of the robot hand can be seen at our web site [38], and the hand fingers of the robot hand and the operator are shown in Figure 13. From this result, we confirmed the effectiveness of the proposed teleoperation of the robot hand based on the hand tracking and grasp type estimation.

### 4.3.3. Discussion

Since the operating frequency of both image processing and robot control was 1000 Hz in the experiment, the operating frequency of the entire system was 1000 Hz. Therefore, the robot hand manipulation with high-speed image processing proposed in this study achieved the target operating frequency of 1000 Hz.

Next, we consider the latency of the entire system. We define the latency as the time from the imaging to the completion of the robot hand motion. That is to say, the latency can be evaluated from the total time for the image acquisition, image processing, including hand non grasp precision grasp power grasp

tracking and grasp type estimation, transmitting the results from the image-processing PC to the real-time controller, and implementing robot hand motion.

**Figure 13.** Experimental result of teleoperation; left, middle, and right show non-grasp, precision grasp, and power grasp, respectively.

First, the latency from the end of imaging to the end of transmission of the image processing result was 1 ms. Second, the latency from the transmission of the result by the image-processing PC to the output of the control input by the real-time controller was from 2∼3 ms, and the worst case was 3 ms. Next, we need to consider the latency from the output of the control input to the completion of the robot hand motion. The time required to converge to ±10% of the reference joint angle is shown in Table 3. From the top in Table 3, "Joint" means the root and tip links of the middle finger and the root, tip, and rotation of the left and right thumbs around the palm, respectively. Furthermore, Figure 14 shows the step response of the tip link of the left and right thumbs from an initial value of 0.0 rad to a reference joint angle of 0.8 rad. The dashed line depicts the range of ±10% of the reference joint angle 0.8 rad, and the slowest convergence time was 36 ms. Therefore, it took 36 ms to complete the grasp after the real-time controller received the grasp type. As described above, the latency of the entire system was 1 ms for the imaging and the grasp type estimation, 3 ms for the communication between the image-processing PC and the real-time controller, and 36 ms for the robot operation, totaling 40 ms for the teleoperation of the robot hand. Since this value (40 ms) is close to the sampling time of the human eye (around 33 ms), our developed system is fast enough for robot teleoperation.


**Table 3.** Time for convergence of robot hand joint angle to ±10% of the reference angle.

Figure 15 shows the timeline of the human and robot grasping motions. From the experimental results, it took about 80 ms from the time the hand starts the grasping motion to the time the hand form that is estimated to be a specific grasping form is captured and 150 ms until the time the motion is completed. On the other hand, it took 40 ms for the robotic hand to complete the grasp after the hand configuration that is estimated to be a specific grasp configuration is captured. In other words, the grasping by the robotic hand is completed when the grasping motion by the fingers is completed at 80 ms, and the remote grasping operation using the robotic hand is realized by anticipating the motion from the human hand morphology during the grasping motion, called pre-shaping. Consequently, we achieved teleoperated robot hand control beyond the speed of human hand motion by using the proposed method and system. This result may contribute to compensate the latency due to the network in the teleoperation.

**Figure 14.** Joint response of high-speed robot hand.

**Figure 15.** Time series of human hand motion and robot hand motion.

### **5. Conclusions**

The purpose of the work described in this paper was to develop an intuitive and fast telerobot grasping and manipulation system, which requires fast recognition of the intended grasping method from the operator's gestures. In this paper, we proposed a method for fast recognition of grasping intentions by obtaining hand positions from gesture images and estimating the grasp type from the hand positions by machine learning. In particular, we combined machine learning and tracking to achieve both high speed and accuracy in hand position acquisition. In the evaluation experiments of the hand position recognition method, we achieved a mean-squared error of 1.06 pixel, an operating frequency of 1000 Hz, and a latency of 1 ms. In the evaluation experiment of the grasp type estimation, we also achieved an accuracy of 94%, and the inference time was 0.07 ms. These results show that the operating frequency of the system from gesture capturing to grasping form estimation was 1000 Hz and the latency was 1 ms, which confirms the effectiveness of the proposed method. As a result of the remote grasping operation experiment by the high-speed robot hand using the high-speed grasping form estimation system, the grasping operation was completed in 40 ms after the hand image was captured. This is the time when the grasping operation by the fingers was completed at 80%, and the high-speed tele-grasping operation was realized successfully.

The first application of the system proposed in this study is HMI, which uses highspeed gesture recognition. In this study, the grasping form was obtained from the hand position. However, the proposed method can be applied to HMI that supports various hand gestures because it is easy to obtain other types of gestures. Next, human–machine coordination using fast and accurate hand position recognition can be considered as an application. Since the hand positions are acquired with high speed and high accuracy, the system can be applied to human–machine coordination using not only gestures, but also hand positions, and remote master–slave operation of robots by mapping.

There are still some issues to be solved in this research. One of them is the 3D measurement of the hand position. Currently, the hand position is only measured in two dimensions, which restricts the orientation of the operator's hand, but we believe that three-dimensional measurement will become possible by using multiple cameras, color information, and depth information. The other is hand position recognition against a miscellaneous background. In this study, the background of the hand image was black, but by training the machine learning model using images with various backgrounds, the hand position can be recognized without being affected by the background, and the applicability of the system will be enhanced. These issues described above will be solved in the future.

**Author Contributions:** Conceptualization, Y.Y.; methodology, K.Y. and Y.Y.; software, K.Y.; validation, K.Y. and Y.Y.; formal analysis, K.Y.; investigation, K.Y. and Y.Y.; writing—original draft preparation, Y.Y., K.Y.; writing—review and editing, Y.Y., K.Y.; project administration, Y.Y.; funding acquisition, Y.Y. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was supported in part by the JST, PRESTO Grant Number JPMJPR17J9, Japan.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflicts of interest.

### **Appendix A**

In this Appendix, we show an algorithm of another mode, which is a high-accuracy mode for finger tracking, shown in Algorithm A1.


### **References**


## *Article* **A Human-Following Motion Planning and Control Scheme for Collaborative Robots Based on Human Motion Prediction**

**Fahad Iqbal Khawaja 1,2,\*, Akira Kanazawa 1, Jun Kinugawa <sup>1</sup> and Kazuhiro Kosuge 1,3**


**Abstract:** Human–Robot Interaction (HRI) for collaborative robots has become an active research topic recently. Collaborative robots assist human workers in their tasks and improve their efficiency. However, the worker should also feel safe and comfortable while interacting with the robot. In this paper, we propose a human-following motion planning and control scheme for a collaborative robot which supplies the necessary parts and tools to a worker in an assembly process in a factory. In our proposed scheme, a 3-D sensing system is employed to measure the skeletal data of the worker. At each sampling time of the sensing system, an optimal delivery position is estimated using the real-time worker data. At the same time, the future positions of the worker are predicted as probabilistic distributions. A Model Predictive Control (MPC)-based trajectory planner is used to calculate a robot trajectory that supplies the required parts and tools to the worker and follows the predicted future positions of the worker. We have installed our proposed scheme in a collaborative robot system with a 2-DOF planar manipulator. Experimental results show that the proposed scheme enables the robot to provide anytime assistance to a worker who is moving around in the workspace while ensuring the safety and comfort of the worker.

**Keywords:** human–robot interaction; human–robot collaboration; collaborative robots; motion planning; robot control; human motion prediction; human-following robots

### **1. Introduction**

The concept of collaborative robots was introduced in the early 1990s. The first collaborative system was proposed by Troccaz et al. in 1993 [1]. This system uses a passive robot arm to ensure safe operation during medical procedures. In 1996, Colgate et al. developed a passive collaborative robot system and applied it to the vehicle's door assembly process carried out by a human worker [2]. In 1999, Yamada et al. proposed a skill-assist system to help a human worker carry a heavy load [3].

The collaborative robot systems are being actively introduced in the manufacturing industry. The International Organization of Standardization (ISO) amended its robot safety standards ISO 10128-1 [4] and 10128-2 [5] in 2011 to include the safety guidelines for human-robot collaboration. This led to an exponential rise in collaborative robot research and development. Today, many companies are manufacturing their own versions of collaborative robots, and these robots are being used in industries all over the world. Collaborative robots are expected to play a major role in the Industry 5.0 environments where people will work together with robots and smart machines [6].

In 2010, a 2-DOF co-worker robot "PaDY" (in-time Parts and tools Delivery to You robot) was developed in our lab to assist a factory worker in an automobile assembly

**Citation:** Khawaja, F.I.; Kanazawa, A.; Kinugawa, J.; Kosuge, K. A Human-Following Motion Planning and Control Scheme for Collaborative Robots Based on Human Motion Prediction. *Sensors* **2021**, *21*, 8229. https://doi.org/10.3390/s21248229

Academic Editor: Anne Schmitz

Received: 8 November 2021 Accepted: 4 December 2021 Published: 9 December 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

process [7]. This process comprises a set of assembly tasks that are carried out by a worker while moving around the car body. PaDY assists the worker by delivering the necessary parts and tools to him/her for each task. The original control system of PaDY was developed based on a statistical analysis of the worker's movements [7].

Many studies have been carried out on the human–robot collaborative system. Hawkins et al. proposed an inference mechanism of human action based on a probabilistic model to achieve wait-sensitive robot motion planning in 2013 [8]. D'Arpino et al. proposed fast target prediction of human reaching motion for human–robot collaborative tasks in 2015 [9]. Unhelkar et al. designed a human-aware robotic system, in which human motion prediction is used to achieve a safe and efficient part-delivery task between the robot and the stationary human worker in 2018 [10]. A recent survey on the sensors and techniques used for human detection and action recognition in industrial environments can be seen in [11]. The studies cited above [8–10] have improved efficiency of the collaborative tasks by incorporating human motion prediction into robot motion planning.

Human motion prediction was also introduced to PaDY. In 2012, the delivery operation delay of the robot–human handover tasks was reduced by utilizing prediction of the worker's arrival time at the predetermined working position [12]. In 2019, a motion planning system was developed which optimized a robot trajectory by taking the prediction uncertainty of the worker's movement into account [13]. In those studies [12,13], the robot repeats the delivery motion from its home position to each predetermined assembly position. If the robot can follow the worker during the assembly process, the worker can pick up necessary parts and tools from the robot at any time. Thus, more efficient collaborative work could be expected by introducing human-following motion.

In this paper, human-following motion of the collaborative robot is proposed for delivery of parts and tools to the worker. The human-following collaborative robot system needs to stay close to the human worker, while avoiding collision with the worker under the velocity and acceleration constraints. The contribution of this paper is summarized as follows:


This is a quantitative study where we conducted the experiments ourselves and analyzed the data collected from these experiments to deduce the results. The proposed scheme predicts the motion of the worker and calculates an optimal delivery position for the handover of parts and tools from the worker to the robot for each task of the assembly process. This scheme has been designed for a single worker operating within his/her workspace. It is not designed for the cases when multiple workers are operating in the same workspace, or when the worker moves beyond the workspace.

The rest of the paper is organized as follows. Section 2 describes the related works. Section 3 gives an overview of the proposed scheme, including the delivery position determination, the worker's motion prediction, and trajectory planning and control scheme. The experimental results are discussed in Section 4. Section 5 concludes this paper.

### **2. Related Works**

In this section, we present a review of the existing research on human–robot handover tasks, human-following robots, and motion/task planning based on human motion prediction.

### *2.1. Human–Robot Handover*

Some studies have considered the problem of psychological comfort of the human receiver during the handover task. Baraglia et al. addressed the issue of whether and when a robot should take the initiative [14]. Cakmak et al. advocated the inclusion of user preferences while calculating handover position [15]. They also identified that a major cause of delay in the handover action is the failure to convey the intention and timing of handing over the object from the robot to the human [16]. Although these studies deal with important issues for improving the human–robot collaboration, it is still difficult to apply them in actual applications because psychological factors cannot be directly observed.

Some other studies used observable physical characteristics of the human worker for planning a robot motion that is safe and comfortable for the worker. Mainprice et al. proposed a motion planning scheme for human–robot collaboration considering HRI constraints such as constraints of distance, visibility and arm comfort of the worker [17]. Aleotti et al. devised a scheme in which the object is delivered in such a way that its most easily graspable part is directed towards the worker [18]. Sisbot et al. proposed a human-aware motion planner that is safe, comfortable and socially acceptable for the human worker [19].

The techniques and algorithms mentioned above operate with the assumption that the worker remains stationary in the environment. To solve the problem of providing assistance to a worker who moves around in the environment, we propose a humanfollowing approach with HRI constraints in this paper.

### *2.2. Human-Following Robots*

Several techniques have been proposed to carry out human-following motion in various robot applications. One of the first human-following approaches was proposed by Nagumo et al., in which an LED device carried by the human was detected and tracked by the robot using a camera [20].

Hirai et al. performed visual tracking of the human back and shoulder in order to follow a person [21]. Yoshimi et al. used several parameters including the distance, speed, color and texture of human clothes to achieve stable tracking in complex situations [22]. Morioka et al. used the reference velocities for human-following control calculated from estimated human position under the uncertainty of the recognition [23]. Suda et al. proposed a human–robot cooperative handling control using force and moment information [24].

The techniques cited in this section focus on performing human-following motion of the robot to achieve safe and continuous tracking. However, these schemes use the feedback of the observed/estimated current position of the worker. This makes it difficult for the robot to keep up with the worker who is continuously moving around in the workspace. In this paper, we solve this problem by applying human motion prediction and MPC.

### *2.3. Motion/Task Planning Based on Human Motion Prediction*

In recent years, many studies have proposed motion planning using human motion prediction. The predicted human motion is used to generate a safe robot trajectory. Mainprice et al. proposed to plan a motion that avoids the predicted occupancy of the 3D human body [25]. Fridovich-Keil et al. proposed to plan a motion that avoids the risky region calculated by the confidence-aware human motion prediction [26]. Park et al. proposed a collision-free motion planner using the probabilistic collision checker [27].

Several studies have proposed robot task planning to achieve collaborative work based on human motion prediction. Maeda et al. achieved a fluid human–robot handover by estimating the phase of human motion [28]. Liu et al. presented a probabilistic model for human motion prediction for task-level human–robot collaborative assembly [29]. Cheng et al. proposed an integrated framework for human–robot collaboration in which the robot perceives and adapts to human actions [30].

Human motion prediction has been effectively used in various problems of human– robot interaction. In this paper, we apply the human motion prediction to human-following motion of the collaborative robot for delivery of parts/tools to a worker.

### **3. Proposed Motion Planning and Control Scheme**

*3.1. System Architecture*

Figure 1 shows the system architecture of our proposed scheme. This scheme consists of three major parts:


**Figure 1.** System architecture.

In delivery position determination, an optimal delivery position is estimated using an HRI-based cost function. This cost function is calculated using the skeletal data of the worker measured by the 3-D vision sensor. In workers' motion prediction, the position data obtained from the vision sensor are used to predict the motion of the worker. Moreover, after the completion of each work cycle, the worker's model is updated using the stored position data. In the trajectory planning step, an optimal trajectory from the robot's current position to the goal position is calculated using the receding horizon principle of Model Predictive Control (MPC). The robot motion controller ensures that the robot follows the calculated trajectory. The detailed description of these three parts of our scheme is given in the subsequent subsections.

### *3.2. Delivery Position Determination*

In our proposed scheme, the delivery position is determined by optimizing an HRIbased cost function. This cost function includes terms related to the safety, visibility and arm comfort of the worker. These terms are calculated from the worker's skeletal data observed by the 3-D vision sensor in real-time. This concept was first introduced by Sisbot et al. for motion planning of mobile robots [31]. The analytical form of the HRIbased cost function was proposed in our previous study [32]. Here, we provide a brief description of the cost function and solver for determining the optimal delivery position.

Let *<sup>p</sup>*del <sup>∈</sup> <sup>R</sup>*<sup>n</sup>* be the *<sup>n</sup>* dimensional delivery position, then the total cost *Cost*(*p*del, *<sup>s</sup>*w) is expressed as:

$$\text{Cost}(\boldsymbol{\mathfrak{p}\_{\text{del}\prime}}\mathbf{s}\_{\text{W}}) = \mathbb{C}\_{V}(\boldsymbol{\mathfrak{p}\_{\text{del}\prime}}\mathbf{s}\_{\text{W}}) + \mathbb{C}\_{\mathbb{S}}(\boldsymbol{\mathfrak{p}\_{\text{del}\prime}}\mathbf{s}\_{\text{W}}) + \mathbb{C}\_{A}(\boldsymbol{\mathfrak{p}\_{\text{del}\prime}}\mathbf{s}\_{\text{W}})$$

where *s*<sup>w</sup> is latest sample of the worker's skeletal data obtained from the sensor. *CV*(*p*del, *s*w) is the visibility cost that maintains the delivery position within the visual range of the worker. This cost is expressed as a function of the difference between the worker's body orientation and the direction of the delivery position with respect to the worker's body center. *CS*(*p*del, *s*w) is the safety cost that prevents the robot from colliding with the worker. This cost is expressed as a function of the distance between the worker's body center and the delivery position. *CA*(*p*del, *s*w) is the arm comfort cost that maintains the delivery position within the suitable distance and orientation for the worker. This cost is a function of the joint angles of the worker's arm. In addition, this cost penalizes the delivery position where the worker needs to use his/her non-dominant hand.

The optimal delivery position is calculated by minimizing the cost function *Cost*(*p*del, *s*w). Since *Cost*(*p*del, *s*w) is a non-convex function, we use Transition-based Rapidly-exploring Random Tree(T-RRT) method [33] to find the globally optimal solution. We apply T-RRT only in the vicinity of the worker to calculate the optimal solution in real-time. The process of determining the optimal delivery position is summarized in Algorithm 1.


Figure 2 shows an example of the cost map in the workspace around the worker calculated from the HRI constraints. The worker's shoulder positions (red squares) and the calculated delivery position (green circle) are shown in the figure. We can see that the proposed solver can calculate the delivery position that has the minimum cost in the cost map.

**Figure 2.** Example of the cost map calculated from the HRI constraints and its optimal delivery position.

### *3.3. Worker's Motion Prediction*

The worker's motion is predicted by using Gaussian Mixture Regression (GMR) proposed in our previous work [34]. GMR models the worker's past movements and predicts his/her future movements in the workspace. Here, we provide a brief description of the motion prediction using GMR.

Suppose that *<sup>p</sup>*<sup>c</sup> <sup>=</sup> *<sup>p</sup>*(*t*) <sup>w</sup> <sup>∈</sup> <sup>R</sup>*<sup>n</sup>* is the worker's current position at time step *<sup>t</sup>*, *<sup>p</sup>*<sup>h</sup> <sup>=</sup> *<sup>p</sup>*(*t*−1) <sup>w</sup> *<sup>p</sup>*(*t*−2) <sup>w</sup> ··· *<sup>p</sup>*(*t*−*d*) <sup>w</sup> T <sup>∈</sup> <sup>R</sup>*n*×(*d*−1) is the position history, and *<sup>d</sup>* is the length of the position history. GMR models the conditional probability density function *pr*(*p*c|*p*h) whose expectation *E*[*p*c|*p*h] means the worker's future position and variance *V*[*p*c|*p*h] is the uncertainty of the prediction. The details of GMR calculation are shown in Appendix A.

The procedure for the long-term motion prediction using GMR is summarized in Algorithm 2. The calculation to predict the worker's position at the next time step is repeated until the length of the predicted trajectory becomes equal to the maximum prediction length *T*p. The worker's predicted motion is expressed as the sequence of Gaussian distributions <sup>N</sup> (*t*c) <sup>w</sup> , <sup>N</sup> (*t*c+1) <sup>w</sup> , ···N (*t*c+*T*p) w starting from the current time *<sup>t</sup>*c. *<sup>N</sup>*(*t*) w is the worker's predicted position distribution at step *t* expressed as:

$$\mathcal{N}\_{\mathbf{w}}^{(t)} = \mathcal{N}(\boldsymbol{\mu}\_{\mathbf{w}}^{(t)}, \boldsymbol{\Sigma}\_{\mathbf{w}}^{(t)}) \tag{1}$$

where *μ*(*t*) <sup>w</sup> is the mean vector and **<sup>Σ</sup>**(*t*) <sup>w</sup> is the covariance matrix of worker's predicted position at step *t*.

If the worker repeats his/her normal movement, which is indicated in the process chart of the assembly process, our prediction system can predict the worker's movement accurately enough for the system. According to our previous research, the RMSE (Root Mean Square Error) of the worker's movement was about 0.3 m [34]. The RMSE was calculated by the comparison between the initial predicted worker's movement and the observed worker's movement when the worker started to move to the next working position.

When the worker moves differently from his/her normal movement, it is not easy to ensure the accuracy of the prediction. However, the proposed system operates safely even in this case, since the variance of the predicted position of the worker is included in the cost function used in the motion planning as shown in our previous study [13].

### **Algorithm 2** Worker's motion prediction using GMR

**Input:** Current time *t*c, Current position *<sup>p</sup>*(*t*c) <sup>c</sup> , Position history *p*(*t*c) <sup>h</sup> , Max prediction length *T*p **Output:** Predicted trajectory <sup>N</sup> (*t*c) , <sup>N</sup> (*t*c+1) , ··· , <sup>N</sup> (*t*c+*T*p) 1: **while** *k* = 1 *to Tp* **do** 2: *p*(*t*c+*k*) <sup>h</sup> = *<sup>p</sup>*(*t*c+*k*−1) <sup>c</sup> *<sup>p</sup>*(*t*c+*k*−2) <sup>c</sup> ··· *<sup>p</sup>*(*t*c+*k*−*d*−1) <sup>c</sup> T 3: *<sup>μ</sup>*(*t*c+*k*) <sup>w</sup> <sup>=</sup> *<sup>E</sup>*[*p*(*t*c+*k*) <sup>c</sup> <sup>|</sup>*p*(*t*c+*k*) <sup>h</sup> ] 4: **<sup>Σ</sup>**(*t*c+*k*) <sup>w</sup> <sup>=</sup> *<sup>V</sup>*[*p*(*t*c+*k*) <sup>c</sup> <sup>|</sup>*p*(*t*c+*k*) <sup>h</sup> ] 5: *<sup>p</sup>*(*t*c+*k*) <sup>c</sup> <sup>=</sup> *<sup>E</sup>*[*p*(*t*c+*k*) <sup>c</sup> <sup>|</sup>*p*(*t*c+*k*) <sup>h</sup> ] 6: **end while**

### *3.4. Trajectory Planning and Control*

Figure 3 shows the concept of human-following motion planning using the worker's predicted motion. The sequence of the worker's predicted position distributions (<sup>N</sup> (*t*c) <sup>w</sup> , <sup>N</sup> (*t*c+1) <sup>w</sup> , ···N (*t*c+*T*p) <sup>w</sup> ) is given to the trajectory planner. The sequence of the robot states, that is the robot trajectory *<sup>q</sup>*(*t*c), *<sup>q</sup>*(*t*c+1), ··· *<sup>q</sup>*(*t*c+*To* ) , is calculated so that the robot's state at each time step follows the corresponding predicted position of the worker.

**Figure 3.** Concept of human-following motion planning based on the predicted trajectory of the worker.

To achieve the prediction-based human-following robot motion, we use an MPC-based planner to consider the evaluation function for finite time future robot states. This is a wellknown strategy and is often used in real-time robot applications such as task-parametrized motion planning [35] and multi-agent motion planning [36].

The cost function used in MPC consists of terminal cost and stage cost. The terminal cost deals with the cost at the terminal state of the robot, which is the delivery position in our case. Stage cost considers the state of the robot during the whole trajectory from the current configuration to the goal configuration. A distinct feature of our scheme is that the optimal delivery position, found by optimizing the HRI-based cost function, is used to calculate the terminal cost. In addition, the predicted trajectory of the worker is used to calculate the stage cost. This scheme plans the collision-free robot trajectory that follows the moving worker efficiently under the safety cost constraint and the robot's velocity and acceleration constraints.

The cost function *J* used for the optimization of the proposed trajectory planner is expressed as:

$$J = q(q(t\_\varepsilon + T\_0)) + \int\_{t\_\varepsilon}^{t\_\varepsilon + T\_0} (L\_1(\dot{q}(k)) + L\_2(q(k)) + L\_3(q(k))) dk \tag{2}$$

where *<sup>q</sup>* = (*θ*, *<sup>θ</sup>***˙**)<sup>T</sup> <sup>∈</sup> <sup>R</sup>2*Nj* is the state vector of the manipulator, *<sup>θ</sup>* = (*θ*1, *<sup>θ</sup>*2, ··· , *<sup>θ</sup>Nj* )<sup>T</sup> <sup>∈</sup> <sup>R</sup>*Nj* is the vector composed of the joint angles of the manipulator, *Nj* is the degrees of freedom of the manipulator, *T*o(*T*<sup>o</sup> ≤ *T*p) is the length of the trajectory (in our experiments, we used *T*<sup>o</sup> = *T*<sup>p</sup> as a rule of thumb) and *ϕ*(*q*(*t*<sup>c</sup> + *To*)) is the terminal cost which prevents the calculated trajectory of the robot from diverging. It is expressed as:

$$\varphi(q(t\_{\rm c} + T\_{\rm o})) = \frac{1}{2} \left( F \mathbb{K}\_{N\_{\rm j}}(q(t\_{\rm c} + T\_{\rm o})) - \mathfrak{x}\_{\rm del} \right)^{\rm T} \mathbb{R} \left( F \mathbb{K}\_{N\_{\rm j}}(q(t\_{\rm c} + T\_{\rm o})) - \mathfrak{x}\_{\rm del} \right) \tag{3}$$

where *FK<sup>j</sup>* is the forward kinematics of the robot that transform the robot state *q* from joint coordinates to position *p<sup>j</sup>* and velocity *v<sup>j</sup>* in the workspace coordinates. *R* is the diagonal positive definite weighting matrix. *x*del is the terminal state of the robot which is calculated based on the optimal delivery position and the predicted mean position of the worker. In this study, *x*del becomes *x*del = *<sup>μ</sup>*(*t*c+*T*o) <sup>w</sup> <sup>+</sup> *<sup>p</sup>*del, **<sup>0</sup>** T , where *<sup>μ</sup>*(*t*c+*T*o) <sup>w</sup> is the worker's predicted position at the end of the trajectory (*t*<sup>c</sup> + *T*o), and *p*del is the calculated delivery position for the worker's observed position. We calculate *p*del after each sampling interval and assume that the variation in *p*del is negligibly small during the sampling interval of the sensing system, which is 30 ms.

*L*1(*q***˙**(*k*)), *L*2(*q*(*k*)) and *L*3(*q*(*k*)) are the stage costs which are expressed as:

$$L\_1(\dot{q}(k)) = \frac{1}{2} \sum\_{j=1}^{N} r\_j(B\_{\text{vel},j}(\theta\_j(k)) + B\_{\text{acc},j}(\theta\_j(k)))\tag{4}$$

$$\mathbb{L}L\_2(\mathfrak{q}(k)) = w \sum\_{j=1}^{N\_{\tilde{\mathbb{P}}}} \frac{1}{D\_{\mathbb{M}}\Big(\mathsf{FK}\_{p,j}(\mathfrak{q}(k)), \mathfrak{p}\_{\text{w}}^{(k)}, \mathfrak{T}\_{\text{w}}^{(k)}\Big)}.\tag{5}$$

$$L\_3(q(k)) = \frac{1}{2} \sum\_{j=1}^{N} (\mathsf{FK}\_{p,j}(q(k)) - (\mu\_{\mathsf{W}}^{(k)} + p\_{\text{del}})^T) \bigotimes (\mathsf{FK}\_{p,j}(q(k)) - (\mu\_{\mathsf{W}}^{(k)} + p\_{\text{del}})) \tag{6}$$

*L*1(*q***˙**(*k*)) is the stage cost to maintain the robot velocity and acceleration within their maximum limits. *B*vel,*j*( ˙ *θj*(*k*)) and *B*acc,*j*(¨ *θj*(*k*)) are defined as:

$$B\_{\rm vel,j}(\dot{\theta}\_{\dot{\jmath}}(k)) = \begin{cases} 0 & (||\theta\_{\dot{\jmath}}|| \le \dot{\theta}\_{\rm max,j})\\ \left(||\dot{\theta}\_{\dot{\jmath}}|| - \dot{\theta}\_{\rm max,j}\right)^2 & (||\dot{\theta}\_{\dot{\jmath}}|| > \dot{\theta}\_{\rm max,j}) \end{cases} \tag{7}$$

$$B\_{\text{acc},j}(\theta\_j(k)) = \begin{cases} 0 & (||\theta\_j|| \le \theta\_{\text{max},j})\\ \left(||\theta\_j|| - \theta\_{\text{max},j}\right)^2 & (||\theta\_j|| > \theta\_{\text{max},j}) \end{cases} \tag{8}$$

where ˙ *θ*max,*<sup>j</sup>* and ¨ *θ*max,*<sup>j</sup>* are the maximum velocity and maximum acceleration of the *j*th joint, respectively.

*L*2(*q*(*k*)) is the stage cost that prevents the robot from hitting the worker. *w* is a weighting coefficient of this cost function. *<sup>D</sup>*M(*x*, *<sup>μ</sup>*, **<sup>Σ</sup>**) = (*<sup>x</sup>* − *<sup>μ</sup>*)T**Σ**−1(*<sup>x</sup>* − *<sup>μ</sup>*) is the Mahalanobis distance that considers the variance of the probabilistic density distribution. Using the Mahalanobis distance between the predicted worker's position distribution <sup>N</sup> (*μ*(*k*) <sup>w</sup> , **<sup>Σ</sup>**(*k*) <sup>w</sup> ) and the end-effector position *FKp*,*j*(*q*(*k*)) at step *k*, an artificial potential field is constituted according to the predicted variance. The artificial potential becomes wider in the direction of larger variance in the predicted position.

*L*3(*q*) is the stage cost to ensure that the robot follows the worker's motion. *Q* is the diagonal positive definite weighting matrix. This cost function is responsible for the human-following motion of the robot based on the worker's predicted trajectory as shown in Figure 3. For each time step of the predicted position distribution of the worker <sup>N</sup> (*μ*(*k*) <sup>w</sup> , **<sup>Σ</sup>**(*k*) <sup>w</sup> ), the desirable state of the robot is calculated so that the robot's endpoint follows the predicted mean position of the worker *μ*(*k*) <sup>w</sup> offset by the calculated delivery position *p*del.

Now we can define the optimization problem that will be solved by our proposed system.

$$\begin{array}{ll}\text{minimize} & & \int \\ \text{subject to} & & \dot{q} = f(q, u) \\ & & q(t) = q\_{\text{curr}} \end{array}$$

where *f* denotes the nonlinear term of the robot's dynamics, *u* is the input vector, and *q*(*t*) is the initial state of the trajectory which corresponds to the current state *q*cur of the robot. To solve this optimization problem with the equality constraints described above, we use the calculus of variations. The discretized Euler–Lagrange equations that the optimal solution should satisfy are expressed as:

$$\mathfrak{q}^{(k+1)} = \mathfrak{q}^{(k)} + f(\mathfrak{q}^{(k)}, \mathfrak{u}^{(k)}) \Delta t\_{\mathfrak{s}, \mathfrak{q}} \tag{9}$$

$$
\mathfrak{q}^{(t)} = \mathfrak{q}\_{\text{cur}} \tag{10}
$$

$$
\lambda^{(k)} = \lambda^{(k+1)} - \left(\frac{\partial H}{\partial q}\right)^{\mathrm{T}} (q^{(k+1)}, u^{(k)}, \lambda^{(k+1)}), \tag{11}
$$

$$
\lambda^{\left(t+T\_o\right)} = \left(\frac{\partial \!\!\!\!\!\!\!\!\!\/ }{\partial \!\!\!\!\!\!\/ }\right)^{\mathrm{T}} (q^{\left(t+T\_o\right)}) , \tag{12}
$$

$$\frac{\partial H}{\partial \boldsymbol{u}}(\boldsymbol{q}^{(k)}, \boldsymbol{u}^{(k)}, \boldsymbol{\lambda}^{(k)}) = \boldsymbol{0},\tag{13}$$

where *H* is the Hamiltonian and is defined as:

$$H(\mathfrak{q}, \mathfrak{u}, \lambda) = L\_1(\dot{\mathfrak{q}}(k)) + L\_2(\mathfrak{q}(k)) + L\mathfrak{z}(\mathfrak{q}(k)) + \lambda^T f(\mathfrak{q}, \mathfrak{u}).\tag{14}$$

The procedure for calculating the online trajectory is shown in Algorithm 3. After the sequential optimization based on the gradient decent, we obtain the optimal trajectory *<sup>q</sup>*(*t*), *<sup>q</sup>*(*t*+1), ··· , *<sup>q</sup>*(*t*+*To* ) . For detailed calculations, please refer to our previous study [13].

### **Algorithm 3** Robot Trajectory Generator

**Input:** Target delivery position *p*del, Predicted worker's trajectory <sup>N</sup> (*t*) , <sup>N</sup> (*t*+1) , ···N (*t*+*Tp*) , Current state of the robot *q*cur, Max length of the robot trajectory *T*o **Output:** Optimal trajectory is *<sup>q</sup>*(*t*), *<sup>q</sup>*(*t*+1), ··· *<sup>q</sup>*(*t*+*To* ) 1: Initialize the set of input vectors *u* 2: *<sup>q</sup>*(*t*) ⇐ *<sup>q</sup>*cur 3: **while** Σ*t*+*To <sup>k</sup>*=*<sup>t</sup>* <sup>|</sup> *<sup>∂</sup><sup>H</sup> <sup>∂</sup><sup>u</sup>* (*q*(*k*+1), *<sup>u</sup>*(*k*), *<sup>λ</sup>*(*k*+1))<sup>|</sup> <sup>&</sup>lt; **do** 4: **while** *k* = 1 *to To* **do** 5: *<sup>q</sup>*(*t*+*k*) ⇐ *<sup>q</sup>*(*t*+*k*−1) <sup>+</sup> *<sup>f</sup>*(*q*(*t*+*k*−1), *<sup>u</sup>*(*t*+*k*−<sup>1</sup>))Δ*t*<sup>s</sup> 6: **end while** 7: **while** *k* = *To to* 1 **do** 8: *<sup>λ</sup>*(*t*+*k*−1) ⇐ *<sup>λ</sup>*(*t*+*k*) <sup>−</sup> *∂H ∂q* T (*q*(*t*+*k*) , *u*(*t*+*k*) , *λ*(*t*+*k*+1) ) 9: **end while** 10: **while** *i* = 1 *to To* **do** 11: *si* ⇐ *∂H ∂u* T (*q*(*t*+*i*−1), *u*(*t*+*i*−1), *λ*(*t*+*i*)) 12: **end while** 13: *u* ⇐ *u* + *cs* 14: **end while** 15: **while** *k* = 1 *to To* **do** 16: *<sup>q</sup>*(*t*+*k*) ⇐ *<sup>q</sup>*(*t*+*k*−1) <sup>+</sup> *<sup>f</sup>*(*q*(*t*+*k*−1), *<sup>u</sup>*(*t*+*k*−<sup>1</sup>))Δ*t*<sup>s</sup> 17: **end while**

### **4. Experiment**

### *4.1. Experimental Setup*

To evaluate the performance of the proposed scheme in a real-world environment, we used the planar manipulator PaDY proposed in our previous study [7]. PaDY was designed to assist the workers of an automobile factory. A parts tray and a tool holder were attached to the end-effector of PaDY to store the parts and tools required for car assembly tasks. The robot delivers the parts and tools to the worker during the assembly process. For the details of the hardware design of PaDY, please refer to [7].

The proposed scheme was installed in a computer with an Intel Core i7-3740QM (Quadcore processor, 2.7 GHz) with 16GB memory. All calculations were done within 30 ms, the sampling interval of the sensing system that tracks the position of the human worker.

We designed an experiment to demonstrate the effectiveness of the worker's motion prediction in the human-following behavior of our proposed scheme. Figure 4 shows the experimental workspace and Figure 5 shows the top view of the setup for this experiment. In this experiment, the worker needs to perform the following six tasks:


Each task is performed at a separate working position in the workspace. The experiment is carried out as shown in Figure 6. The experiment is carried out as follows.


We performed this experiment with four different participants (A, B, C and D) to evaluate the robustness of the system for different workers. Each participant is asked to perform the complete work cycle ten times. The first trial is performed without using the predicted motion of the worker. Whereas, in all other trials, the predicted motion of the worker is used and the worker model is sequentially updated after completing each trial.

For more details about the experiment, please see the Supplementary Materials.

**Figure 4.** Experimental workspace.

**Figure 5.** Top view of the experimental setup.

(**a**) (**b**) (**c**) (**d**)

(**e**) (**f**) (**g**) (**h**)

(**i**) (**j**) (**k**) (**l**)

**Figure 6.** Experiment showing a complete work cycle where six tasks are performed. (**a**) A bolt and the tool are picked up; (**b**) Task 1 is performed; (**c**) The tool is returned and 3 grommets are picked up; (**d**) Task 2 is performed; (**e**) A grommet is picked up; (**f**) Task 3 is performed; (**g**) A grommet is picked up; (**h**) Task 4 is performed; (**i**) A grommet is picked up; (**j**) Task 5 is performed; (**k**) A grommet is picked up; (**l**) Task 6 is performed.

### *4.2. Tracking Performance*

Figure 7a shows the estimated delivery position and the robot's end-effector position for trial 1 of a participant when the robot's motion is calculated based on the observed position of the worker without using the motion prediction. The black vertical lines show the time when the worker performs each assembly task. Figure 7b shows the estimated delivery position and the robot's end-effector position for trial 10 of the same participant when the robot's motion is calculated based on the predicted position of the worker using the proposed scheme.

At the beginning of the experiment, the robot is at its home position and the participant is at the working position for Task 1. The robot starts its human-following motion after arriving at the delivery position for Task 1 (at around 6 s in Figure 7a,b). We can see that the robot keeps following the participant during the whole experiment in both schemes (with and without the use of motion prediction).

The green line in Figure 7a,b shows the tracking error which is the difference between the delivery position and the end-effector position. We can see that the maximum tracking error is reduced from about 0.5 m to 0.3 m by using the motion prediction.

**Figure 7.** Tracking performance. (**a**) When motion prediction is not used; (**b**) When motion prediction is used.

It is not possible to completely eliminate the tracking error since the manipulator used for the experiments has a mechanical torque limiter at each joint and the maximum angular acceleration without activating the torque limiter is 90 deg/s2. In both Figure 7a,b, a large tracking error around 30 s can be observed. This is because the participant makes a large movement around 30 s and the robot cannot follow the participant because of its acceleration limit.

### *4.3. Cycle Time*

Figure 8 shows the comparison of the cycle time of the four participants in each trial. We define cycle time as the time required for a participant to complete all six tasks of the assembly process. In Figure 8, the cycle time of each trial is normalized by the time of trial 1. Remember that motion prediction was not used in trial 1.

**Figure 8.** Comparison of Cycle Time.

We can see that the cycle time for each participant decreases as the number of trials increases. The cycle time of trial 10 is reduced to 65.6–74.8% of the cycle time of trial 1. This shows that motion prediction can improve the performance of participants and help them complete the assembly process faster.

Note that the proposed system ignores the dynamics of the interaction between the robot and the worker, assuming that the worker is well trained and the behavioral dynamics

of the worker with respect to the robot's movements can be ignored. If the effects of the robot's motion on the worker can be modeled, the system can better deal with the effect of the interaction between the worker and the robot and a further improvement in the worker's time efficiency could be expected.

### *4.4. HRI-Based Cost*

Table 1 shows the average and maximum HRI-based costs for each participant during the human-following motion of the robot. Since the HRI-based cost increases as the safety and comfort of the worker decreases, it is desirable to have a low HRI-based cost in human–robot collaboration.


**Table 1.** Summary of HRI-based costs during the human-following motion for each worker.

In Table 1, we see that there are no significant differences in the average and maximum HRI-based costs between trial 1 (when motion prediction is not used) and trial 10 (when motion prediction is used) for all four participants. Therefore, we conclude that the proposed prediction-based human-following control reduces the work cycle time without adversely affecting the safety and comfort of the workers.

### **5. Conclusions**

We proposed a human-following motion planning and control scheme for a collaborative robot which supplies the necessary parts and tools to a worker in an automobile assembly process. The human-following motion of the collaborative robot makes it possible to provide anytime assistance to the worker who is moving around in the workspace.

The proposed scheme calculates an optimal delivery position for the current position of the worker by performing non-convex optimization of an HRI-based cost function. Whenever the worker's position changes, the new optimal delivery position is calculated. Based on the observed movement of the worker, the motion of the worker is predicted and the robot's trajectory is updated in real-time using model predictive control to ensure a smooth transition between the previous and new trajectories.

The proposed scheme was applied to a planar collaborative robot called PaDY. Experiments were conducted in a real environment where a worker performed a car assembly process with the assistance of the robot. The results of the experiments confirmed that our proposed scheme provides better assistance to the worker, improves the work efficiency, and ensures the safety and comfort of the worker.

This scheme has been designed for a single worker operating within his/her workspace. It is not designed for the cases when multiple workers are operating in the same workspace, or when the worker moves beyond the workspace. Moreover, we did not consider the dynamics of interaction between the robot and the human, assuming that the human worker in the factory is well trained and his/her behavior dynamics to the robot motion is negligible. If the effects of the robot's motion on the human can be modeled, the system can better deal with the effect of the interaction and further improvement in time efficiency could be expected.

We believe that the human-following approach has tremendous potential in the field of collaborative robotics. The ability to provide anytime assistance is a key feature of our proposed method, and we believe it will be very useful in many other collaborative robot applications.

**Supplementary Materials:** The supplementary material is available online at https://youtu.be/ -jkPoK5URdw (accessed on 4 December 2021), Video: Human-Following Motion Planning and Control Scheme for Collaborative Robots.

**Author Contributions:** Conceptualization, J.K. and K.K.; methodology, A.K. and K.K.; software, F.I.K. and A.K.; validation, F.I.K. and A.K.; formal analysis, F.I.K. and A.K.; investigation, A.K. and K.K.; resources, J.K. and K.K.; data curation, F.I.K. and A.K.; writing—original draft preparation, F.I.K. and A.K.; writing—review and editing, F.I.K., A.K. and K.K.; visualization, F.I.K.; supervision, K.K.; project administration, K.K.; funding acquisition, J.K. and K.K. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Informed consent was obtained from all subjects involved in the study.

**Data Availability Statement:** The dataset generated from the experiments in this study can be found at https://github.com/kf-iqbal-29/Dataset-HumanFollowingCollaborativeRobot.git.

**Conflicts of Interest:** The authors declare no conflict of interest.

### **Abbreviations**

The following abbreviations are used in this manuscript:


### **Appendix A. Detail Calculation of Gaussian Mixture Regression**

Suppose that *<sup>p</sup>*<sup>c</sup> <sup>=</sup> *<sup>p</sup>*(*t*) <sup>w</sup> <sup>∈</sup> <sup>R</sup>*<sup>n</sup>* is the worker's current position at time step *<sup>t</sup>*, *<sup>p</sup>*<sup>h</sup> <sup>=</sup> *<sup>p</sup>*(*t*−1) <sup>w</sup> *<sup>p</sup>*(*t*−2) <sup>w</sup> ··· *<sup>p</sup>*(*t*−*d*) <sup>w</sup> T <sup>∈</sup> <sup>R</sup>*n*×(*d*−1) is the position history, and *<sup>d</sup>* is the order of the autoregressive model. Then the joint distribution *pr* of *p*<sup>c</sup> and *p*<sup>h</sup> can be expressed as

$$p\_r(p\_{\mathbf{h}\prime}p\_{\mathbf{c}}) = \sum\_{m=1}^{M} \pi\_m \mathcal{N}(p\_{\mathbf{h}\prime}p\_{\mathbf{c}} | \boldsymbol{\mu}\_{m\prime} \,\, \Sigma\_m),\tag{A1}$$

where

$$
\mu\_m = \begin{bmatrix} \mu\_m^{p\_h} \\ \mu\_m^{p\_c} \end{bmatrix} \tag{A2}
$$

$$
\Sigma\_{\rm II} = \begin{bmatrix}
\Sigma\_{\rm II}^{p\_{\rm h}p\_{\rm h}} & \Sigma\_{\rm II}^{p\_{\rm h}p\_{\rm c}} \\
\Sigma\_{\rm II}^{p\_{\rm c}p\_{\rm h}} & \Sigma\_{\rm II}^{p\_{\rm c}p\_{\rm c}}
\end{bmatrix}.
\tag{A3}
$$

The expectation *E*[*p*c|*p*h] and the variance *V*[*p*c|*p*h] of the conditional probability density function *pr*(*p*c|*p*h) are expressed as

$$E\left[p\_{\mathbf{c}}\left|p\_{\mathbf{h}}\right] = \sum\_{m=1}^{M} h^{\mathbf{m}}(p\_{\mathbf{h}})\mu'\right. \tag{A4}$$

$$V\left[p\_{\mathbf{c}}\left|p\_{\mathbf{h}}\right] = \sum\_{m=1}^{M} h^{\mathbf{m}}(p\_{\mathbf{h}}) \left(\Sigma' + \mu'\mu'^{\mathbf{T}}\right)$$

$$-E\left[p\_{\mathbf{c}}\left|p\_{\mathbf{h}}\right|E\left[p\_{\mathbf{c}}\left|p\_{\mathbf{h}}\right]^{\mathbf{T}}\right]\right. \tag{A5}$$

where

$$h^{\rm m}(p\_{\rm h}) = \frac{\pi\_{\rm m} \mathcal{N}(p\_{\rm h} | \mu\_{\rm m}^{p\_{\rm h}} \; , \; \Sigma\_{\rm m}^{p\_{\rm h} p\_{\rm h}})}{\sum\_{k=1}^{K} \pi\_{k} \mathcal{N}(p\_{\rm h} | \mu\_{k}^{p\_{\rm h}} \; , \; \Sigma\_{\rm k}^{p\_{\rm h} p\_{\rm h}})} \, \tag{A6}$$

$$
\mu' = \mu\_m^{p\_c} + \Sigma\_m^{p\_c p\_h} (\Sigma\_m^{p\_h p\_h})^{-1} (p\_h - \mu\_m^{p\_h}),
\tag{A7}
$$

$$
\Sigma' = \Sigma\_m^{p\_c p\_c} - \Sigma\_m^{p\_c p\_h} (\Sigma\_m^{p\_h p\_h})^{-1} \Sigma\_m^{p\_h p\_c}.\tag{A8}
$$

While making the prediction, the position of the worker *<sup>p</sup>*(*t*+1) <sup>c</sup> at step *<sup>t</sup>* <sup>+</sup> 1 is calculated as

$$\begin{aligned} p\_{\mathbf{c}}^{(t+1)} &= E[p\_{\mathbf{c}}^{(t+1)} | p\_{\mathbf{h}}^{(t+1)}], \\ p\_{\mathbf{h}}^{(t+1)} &= \left( p\_{\mathbf{c}}^{(t)} | p\_{\mathbf{c}}^{(t-1)} | \cdots | p\_{\mathbf{c}}^{(t+1-d)} \right)^{\mathbf{T}}. \end{aligned} \tag{A9}$$

This calculation to predict the worker's position, shown in Equation (A9) is repeated until the length of the predicted trajectory becomes equal to the maximum prediction length *T*p. The process of predicting the worker's motion is summarized in Algorithm 2. For the details of the derivation, please see [37].

### **References**

