*Article* **Development and Application of a Tandem Force Sensor**

### **Zhijian Zhang, Youping Chen and Dailin Zhang \***

State Key Laboratory of Digital Manufacturing Equipment & Technology, School of Mechanical Science and Engineering, Huazhong University of Science and Technology, Wuhan 430074, China; zhijian516@hust.edu.cn (Z.Z.); ypchen@hust.edu.cn (Y.C.)

**\*** Correspondence: mnizhang@hust.edu.cn; Tel.: +86-2787543555

Received: 2 September 2020; Accepted: 20 October 2020; Published: 23 October 2020

**Abstract:** In robot teaching for contact tasks, it is necessary to not only accurately perceive the traction force exerted by hands, but also to perceive the contact force at the robot end. This paper develops a tandem force sensor to detect traction and contact forces. As a component of the tandem force sensor, a cylindrical traction force sensor is developed to detect the traction force applied by hands. Its structure is designed to be suitable for humans to operate, and the mechanical model of its cylinder-shaped elastic structural body has been analyzed. After calibration, the cylindrical traction force sensor is proven to be able to detect forces/moments with small errors. Then, a tandem force sensor is developed based on the developed cylindrical traction force sensor and a wrist force sensor. The robot teaching experiment of drawer switches were made and the results confirm that the developed traction force sensor is simple to operate and the tandem force sensor can achieve the perception of the traction and contact forces.

**Keywords:** tandem force sensor; traction force sensor; human–robot interaction; contact task; imitation learning

### **1. Introduction**

Imitation learning or learning by demonstration is one of the promising approaches for non-experts to develop a task control method or a policy in a straightforward and feasible manner [1,2]. Within imitation learning, a task control model or policy is learned from the task demonstrations, one of which is a sequence of state-action pairs recorded during the teacher's demonstration. After the teacher demonstrates how to complete the task several times, learning algorithms utilize the state-action pairs in these demonstrations to derive a mapping model of the state and action, namely the policy.

To obtain the state-action pairs in demonstrations, the robot needs to sense the environment information and the actions taken by the teacher simultaneously during the task demonstration. The environment information depends on the task to be learned. In non-contact tasks of industrial robots, such as spraying and welding, the state only contains the robot motion parameters, target position, and posture, etc. [3,4]. In the contact tasks of industrial robots, the contact force needs to be included [5–9]. The actions taken by a teacher can be perceived by sensors, such as visual sensors to capture a teacher's body movements [10,11] or recognize a teacher's gestures [12], wearable sensors, and force sensors to perceive a teacher's behavioral intentions [13–15]. Compared with visual sensors, wearable sensors, etc., force sensor-based kinesthetic teaching is suitable for non-professionals to tell the robot the action needed to be taken in current state in a simple and intuitive way [5–7,13–16].

In the robot teaching for contact tasks, force sensors need to detect not only traction force, but also the contact force. However, there is only one perceptual unit in a wrist force sensor, which makes it impossible to detect the traction and contact forces synchronously. In the imitation learning of peg-in-hole tasks, references [17–19] adopted kinesthetic teaching to guide the robot to carry out assembly tasks, in which a wrist force sensors was used to measure the traction force exerted by human hands and the contact status between peg and holes. However, this force sensor installed at the end flange of robot cannot distinguish between contact force and traction force, which makes the force data used for the policy learning inaccurate. To avoid this problem, Abu-Dakka [18] repeated the demonstration trajectory to collect the net contact force, which is complicated. Different from reference [18], in reference [13], Zeng grasped the end-point of Baxter robot to guide the robot motion, and the force sensor installed at the end flange of robot just detected the contact status. However, this method is only suitable for collaborative robots equipped with joint torque sensors rather than common ones. One method to obtain traction and contact forces is to adopt two wrist force sensors mounted in parallel, which can complicate the robot's end structure [20,21]. For example, the last two joints of the robot in reference [20] cannot move freely within their motion range, which limits the adjustable range of the robot's attitude. Therefore, for the kinesthetic teaching of robot contact tasks, simultaneous detection of traction and contact forces is still an important issue to be solved.

The main contribution of the paper is that a tandem force sensor is developed, which helps robots to learn human skills of opening and closing a drawer. A cylindrical traction force sensor that can be connected with a contact force sensor in series is developed, which is different from the common wrist force sensors [22–26]. Compared with these common wrist force sensors, the main novelty of the cylindrical traction force sensor is that the sensor's side surface is sensitive to external forces rather than the lateral end surface. Besides, in the cylindrical traction force sensor, there is a central column coaxial to and within the elastic structural body (ESB), which allows other device to be connected with this sensor without influencing the measurement of the traction force. Compared with the force sensor in reference [27], the developed traction force sensor is easier to be operated by human hands and suitable for drawer switch teaching.

### **2. Introduction to the Tandem Force Sensor**

### *2.1. The Ideal Tandem Force Sensor*

To realize the perception of traction and contact forces, a tandem force sensor consisting of two perceptual units connected in series is designed, as shown in Figure 1a. Figure 1a shows an ideal tandem force sensor, which helps to understand the basic perception principle of the tandem force sensor. In the ideal tandem force sensor, one perceptual unit is connected with its side surface, and the other is connected with its end surface. In the kinesthetic teaching of the robot contact tasks, the end effector is connected to the end surface of the tandem force sensor, and the human hand guides the robot's motion by grasping the side surface of the tandem force sensor. The traction force applied to the side surface is detected by the perception unit (i.e., traction force sensor) connected with it, and another perception unit (i.e., contact force sensor) connected with the end surface is used to measure the contact force between the end-effector and external environment. Therefore, the side surface and end surface are sensitive to the traction and contact forces, respectively.

**Figure 1.** Schematic diagram of the ideal tandem force sensor: (**a**) structure of the ideal tandem force sensor; (**b**) the inner structure of the ideal tandem force sensor.

Each perceptual unit in the tandem force sensor is composed of an elastic structural body, strain type sensors pasted on the ESB, etc., and the two ESBs in it are shown in Figure 1b. The two ESBs in the tandem force sensor are connected in series, and the serial connection mode can be explained by Figure 2. The free end of the ESB for detecting traction force is connected to the side surface, and the end surface is fixed to the free end of the ESB for detecting contact force. The fixed end of the former is directly connected to the connecting flange, while the fixed end of the latter is indirectly fixed to the connecting flange through the central column. In addition, all the connections are made by screw fastening. In application, the traction force applied to the side surface will transmitted to the ESB for detecting traction force and ultimately to the connecting flange, as shown in Figure 2. The contact force exerted on the end surface will flow to the ESB for detecting contact force and to the connecting flange through central column. By adopting the connection mode shows in Figure 2, the traction and contact forces can be detected by the corresponding ESBs, and do not interfere with each other. Finally, the tandem force sensor can achieve the perception of traction and contact forces in decoupled manner.

**Figure 2.** Simplified schematic diagram of series connection mode of the tandem force sensor.

The two perceptual units shown in Figure 2 are connected in serial structure. In principle, the two perceptual units are independent of each other, which is similar to the measurement principle of two wrist force sensors in Figure 3. The two wrist force sensors shown in Figure 3 are connected in parallel structure, which is a currently adopted method to realize the measurement of traction and contact forces. Different form this method, the two perceptual units in the tandem force sensor are connected in series so we have named the sensor shown in Figure 1 as tandem force sensor. Compared with the perception method of the traction and contact forces shown in Figure 3, the tandem force sensor is compact in structure and does not require the handle to be fixed to the sensor. Therefore, the effect of the handle gravity on the measurement accuracy of the traction force is eliminated. Moreover, the tandem force sensor does not increase the transverse structural complexity of the robot end and will not limit the motion range of the last two joints of a six degree of freedom (6-DOF) industrial robot.

**Figure 3.** Two wrist force sensors installed in parallel for detecting the traction and contact forces.

### *2.2. The Developed Tandem Force Sensor*

In order to simplify the realization difficulty of the tandem force sensor, this paper proposes and designs a tandem force sensor, as shown in Figure 4a. Both the wrist force sensor and the contact force sensor in Figure 1a use the end surface to sense external forces, so the wrist force sensor is used as the contact force sensor. Based on this idea, the developed tandem force sensor is different from the ideal tandem force sensor in appearance. However, the perception principle of the developed tandem force sensor is the same as the ideal tandem force sensor, that is, the perception of the traction and contact forces are achieved by the perception units connected to the side surface and end surface of the develop tandem force sensor. Moreover, the series connection mode of the two perception units in the developed tandem force sensor is consistent with the ideal tandem force sensor, as shown in Figure 4b. The difference of the traction force sensor in the developed tandem force sensor from that of in the ideal tandem force sensor lies in that the its central column is longer. Unlike the ideal tandem force sensor, limited by the size of the contact force sensor, the contact force sensor is not surrounded by the side surface of traction force sensor. Similarly, the connections of different components of the developed tandem force sensor are made by screw fastening. Besides, to achieve the series connection of the contact force sensor and traction force sensor, an intermediate connection flange is added.

**Figure 4.** Schematic diagram of the developed tandem force sensor: (**a**) structure of the developed tandem force sensor; (**b**) the inner structure of the developed tandem force sensor.

To realize the tandem force sensor, we design and develop a cylindrical traction force sensor firstly. Compared with common wrist force sensors, the unique features of the cylindrical traction force sensor are that it senses external force applied to the side surface and its internal space provides adequate space for the central column. The basic structure of the ESB of the cylindrical traction force sensor is a thin-walled cylinder. The free end of the thin-walled cylinder-shaped ESB is connected with the side surface of the traction force sensor, and its fixed end is fixed to the connecting flange, as shown in Figures 2 and 4b. The internal space of the ESB is not valuable for the detection of traction force. However, it is significant for the realization of the tandem force sensor. In the tandem force sensor, the central column is not only used to connect the contact force sensor, but also provides rigid support for the contact force sensor and the end effector mounted on it. Hence, the diameter of the central column should not be small, which is 32 mm in this paper. By selecting reasonable structural parameters of the cylinder-shaped ESB, enough space can be provided for the central column, which is one of the main advantages of the cylinder-shaped ESB. In addition, the internal space is also important for the arrangement of the contact force sensor and for the signal lines of the contact force sensor.

### **3. Development of the Cylindrical Traction Force Sensor**

### *3.1. Architecture of the Cylindrical Traction Force Sensor*

Referring the force sensor in literature [28], the cylindrical traction force sensor is designed, as shown in Figure 5a. The cylindrical traction force sensor is consisting of a cylinder-shaped elastic structural body, a connecting fitting and a shell. The cylinder-shaped ESB shown in Figure 5b is the core of the traction force sensor, and it has layer A (black area), layer B (red area) and layer C (blue area). Compared with the diaphragm type ESB [29], cross beam type ESB [30–32], parallel type ESB [22,33], etc., the cylinder-shaped ESB is hollow, and the free space inside can be used as the connection channel between the contact force sensor and the traction force sensor. The layer A consists of *A1*, *A2*, *A3*, and *A4*, and layer C is composed by *C1*, *C2*, *C3*, and *C4* (Figure 6)*. A1*, *A2*, *A3*, and *A4* are uniformly distributed along the circumference, the *C1*, *C2*, *C3*, and *C4* are also uniformly distributed along the circumference. In addition, the angle between *A1* and *C1* is 45 degrees, and the angle between the slots in layer A and the slots in layer C are 45 degrees or times of 45 degrees. The fixed end of cylinder-shaped ESB is fixed to the connecting fitting shown in Figure 5c by screw fastening, and the contact force sensor is fixed to the central column of it by screw fastening. Then, the connecting fitting can be fixed to the end flange of a robot and provide rigid support for the ESB and the contact force sensor. The shell shown in Figure 5d is secured to the free end of the ESB by screw fastening, and it can transfer the traction force exerted by human hands to the free end of ESB, as shown in Figure 2.

**Figure 5.** Schematic diagram of the composition of the cylindrical traction force sensor: (**a**) the basic architecture of the cylindrical traction force sensor; (**b**) cylinder-shaped elastic structural body; (**c**) connecting fitting; (**d**) shell.

**Figure 6.** Basic structure of the cylinder-shaped elastic structural body.

### *3.2. Basic Force Measurement Principle of the Cylindrical Traction Force Sensor*

The basic structure of the cylinder-shaped ESB can be illustrated by Figure 6. Under the traction force, the ESB will produce bending deformation and shear deformation, which will lead to the occurrence of normal stress and shear stress in the ESB. The normal stress mainly exists in layer A and layer C, which is relatively small. Therefore, the traction force sensor uses shear stress to measure traction force.

The layer A of ESB, which is used to measure the force *FX* along the X-axis and the force *FY* along the Y-axis, consists of *A1*, *A2*, *A3*, and *A4*. When the force *FX* is applied on the ESB, the *A2* and *A4* will produce shear stress. The strain values of the two points on the same diameter in the outside surface of *A2* and *A4* have the same signs, as shown in Figure 7a. Besides, under the moment *MZ*, the *A1*, *A2*, *A3*, and *A4* will produce shear deformation. The strain values of the two points on the same diameter in the outside surface of *A2* and *A4*, respectively, have apposite signs, and the strain values of the two points on the same diameter in the outside surface of *A1* and *A3* respectively have apposite signs, as shown in Figure 7b. When the moment *MZ* act on the cylinder-shaped ESB, the sum of strain values of the two points on the same diameter in the outside surface of *A2* and *A4* respectively is zero. Then, by measuring the sum of strain values of the points in the outside surface of *A2* and *A4*, respectively and using this characteristic, the force *FX* can obtain. Similar to *FX*, the force *FY* can be measured by measuring the sum of strain values of the points in the outside surface of *A1* and *A3*, respectively.

**Figure 7.** The shear stress' direction of the points in the outside surface of layer A: (**a**) under the force *FX*; (**b**) under the torque *MZ*.

The layer C of ESB used for the measurement of moment *MZ* is composed by *C1*, *C2*, *C3*, and *C4*. Under the moment *MZ*, the strain values of the two points on the same diameter in the outside surface of *C1* and *C3*, respectively, own different signs, and the sign of strain values of the points in the outside surface of *C2* is opposite to that of the points on the same diameter in *C4*, as shown in Figure 8c. When the force *FX* or the force *FY* or the combination of both is acting on the ESB, the sign of strain values of the points located at the outside surface of *C1* is the same as that of the point on the same diameter in *C3*, the same as the *C2* and *C4* (Figure 8a,b). Then, under the force *FX* or the force *FY* or the combination of both, the difference of strain values of the two points on the same diameter in the outside surface of *C1* and *C3* or *C2* and *C4*, respectively, is zero. However, when the *FX*, *FY,* and *MZ* act on the cylinder-shaped ESB, the difference of strain values of the two points on the same diameter in the outside surface of *C1* and *C3*, respectively, is not zero, same thing with *C2* and *C4*. By using this property, the moment *MZ* can be detected by measuring the difference of strain values of the points in the outside surface of *C1*, and *C3* and the difference of strain values of the points in the outside surface of *C2* and *C4*.

**Figure 8.** The shear stress' direction of the points in the outside surface of layer C: (**a**) under the force *FX*; (**b**) under the force *FY*; (**c**) under the torque *MZ*.

The layer B of the ESB is a ring connected to layer A and layer C respectively, and it can measure the force *FZ*, the moment *MX* and the moment *MY*. Under the force *FZ*, layer A bears axial pressure (Figure 9a). When this axial pressure is transmitted to the layer B, there is the axial shear stress in layer B, as shown in Figure 10. Figure 10 illustrates the basic constitutional unit of the ESB, and the expansion diagram of ESB is shown in Figure 11. The axial pressure induced by force *FZ* causes shear deformation of *B1*, *B2*, *B3*, *B4*, *B5*, *B6*, *B7*, and *B8* (*B1*−*B8*), and then generate axial shear stress in the axial cross section of *B1*−*B8*. Moreover, the sign of strain values of the points in the outside surface of *B1*−*B8* are the same. Besides, under the moment *MX*, *A2* and *A4* are subjected to the axial pressure in opposite direction, respectively (Figure 9b), which causes shear deformation in *B3*, *B4*, *B7*, and *B8*. The sign of strain values of the points in the outside surface of *B3* and *B4*, respectively, is opposite to that of in *B7* and *B8*. Then, by using this property, the force *FZ* can be measured by measuring the sum of strain values of the points in the outside surface of *B1*−*B8*, and the moment *MX* can be measured by the difference between the strain values of the points in the outside surface of *B3* and *B7* and that of in *B4* and *B8*. Similar to moment *MX*, moment *MY* can also be measured.

**Figure 9.** Force diagram of the cylindrical traction force sensor: (**a**) under the force *FZ*; (**b**) under the torque *MX*.

**Figure 10.** Basic constitutional unit of cylinder-shaped elastic structural body.

**Figure 11.** The unfold of the cylinder-shaped elastic structural body.

### *3.3. Mechanical Model of the Cylindrical Traction Force Sensor*

To meet the design requirement of traction force sensor, the selection of ESB structural sizes should be carried out on the basis of theoretical analysis. Therefore, based on theory of mechanics, we analyze the mechanical properties of the ESB and establish the mechanical model of the ESB, which is of great significance for the determination of structural sizes of ESB and for the understanding of the mechanism of force perception and the mechanical properties of the ESB.

### 3.3.1. The Mechanics Analysis under the FX

When the traction force *FX* exerts on ESB, the circular ring between layer A and the free end of the ESB will produce shear deformation along the force direction. According to the mechanics of materials, the direction of the shear stress of a point on the excircle of the circular ring coincides with the tangential direction of the excircle of it, and the angle between its direction vector and the direction of force *FX* is an acute angle, as shown in Figure 12. According to the calculation method of shear stress, the shear stress of point *e* can be calculated by using the following equation.

$$\tau\_{F\_X} = \frac{F\_X \cdot S\_z}{I\_z \cdot (D - d) / 2} = \frac{4F\_X \cdot \sin a}{\pi D (D - d)}\tag{1}$$

where *Sz* = *<sup>D</sup>*2(*<sup>D</sup>* <sup>−</sup> *<sup>d</sup>*) sin <sup>α</sup>/8 is the static moment of *ce*<sup>ˆ</sup> segment ring with regard to Z-axis, *Iz* = <sup>π</sup>*D*3(*<sup>D</sup>* <sup>−</sup> *<sup>d</sup>*)/16 is the moment of inertia with respect to Z-axis, *D* is the diameter of the excircle of the ESB, *d* is the diameter of the inner circle of the ESB, α is the acute angle between point *a* and point *e* about the Z-axis.

**Figure 12.** The shear stress analysis of the circular ring.

According to Equation (1), the shear stresses of point *a* and point *c* are zero, and the shear stresses of point *b* and point *f* are the largest. Then, the distribution of shear stress values of points in the outer surface of the circular ring is shown in Figure 13.

**Figure 13.** The distribution of shear stress values of the points in the outside surface of the circular ring.

Based on Equation (1) and Figure 13, without considering stress concentration, the distribution of shear stress values of the points in the outside surface of layer A is shown in Figure 14a. According to Figure 14a, the shear stress in *A1* and *A3* is small, while that of *A2* and *A4* is large. Therefore, the shear stress of the points in *A2* and *A4* can be utilized to measure *FX*. In addition, the largest shear stress value in *A2* and *A4* caused by force *FX* is as follows.

$$\tau\_{\rm F\chi} = \frac{4F\_X \cdot \sin(\pi/2)}{\pi D(D-d)} = \frac{4F\_X}{\pi D(D-d)}\tag{2}$$

**Figure 14.** The distribution of shear stress values of the points in the outside surface of layer A and layer C: (**a**) layer A; (**b**) layer C.

Similar to layer A, the distribution of shear stress values of the points in the outside surface of layer C can be obtained, as shown in Figure 14b. According to Figure 14b, the values of shear stress of the points in *C1*, *C2*, *C3*, and *C4* are not too large nor too small. Besides, the direction of shear stress at points in *C2* is the same as that of the points in *C4* and the direction of shear stress at points in *C1* is the same as that of the points in *C3* (Figure 8a).

Unlike layer A and layer C, under the *FX*, the layer B bears no shear stress. For layer B, the shear stress in *A2* and *A4* transmits to *B34* and *B78* that connects with layer A, which induces the occurrence of the normal stress in layer B, as shown in Figure 15. This paper utilizes the shear stress in the ESB to measure the traction force. Therefore, the normal stress in layer B will not affect the measurement of *MX*, *MY* and *FZ*.

**Figure 15.** The stress in layer B induced by *FX*: (**a**) the stress transmitted to B34 by A2; (**b**) the stress transmitted to B78 by A4.

### 3.3.2. The Mechanics Analysis under the FY

According to the basic structure of the ESB, under the *FY*, the deformation of the ESB is similar to that of under the *FX*. Similar to Equation (2), the following formula is important for the measurement of force *FY*.

$$
\pi\_{F\_Y} = \frac{4F\_Y}{\pi D(D-d)}\tag{3}
$$

However, unlike under force *FX*, the points with the largest shear stress are in *A1* and *A3* and the points that owns zero shear stress value exist in *A2* and *A4*. Hence, the indirect measurement of force *FY* can be achieved by using the shear stress values of the points in *A1* and *A3*.

### 3.3.3. The Mechanics Analysis under the Force FZ

Under the force *FZ*, the *A1*, *A2*, *A3*, and *A4* bear axial pressure, the *C1*, *C2*, *C3*, and *C4* also under axial pressure. Therefore, the shear stress of the points in the outside surface of layer A and layer C is zero. According to Figures 10 and 11, under the *FZ*, the cross sections along Z-axis of *B1*−*B8* will bear shear force, and the shear stress of the points in *B1*−*B8* can be calculated using the following equation.

$$
\pi\_{F\_Z} = \frac{F\_Z}{A} = \frac{2F\_Z}{L\_b(D-d)}\tag{4}
$$

where *A* = *Lb*(*D* − *d*)/2 is the area of the cross section along Z-axis of layer B (Figure 10), (*D* − *d*)/2 is the wall thickness of layer B, *Lb* is the height of layer B.

Then, based on Equation (4), the force *FZ* can be measured by detecting the shear stress values of the points in *B1*−*B8*.

### 3.3.4. The Mechanics Analysis under the Moment MX

When the moment *MX* acts on the cylinder-shaped ESB, the force/moment applied on *A1*, *A2*, *A3*, and *A4* can be simplified as shown in Figure 9b, which leads to the occurrence of normal stress in the *A1*, *A2*, *A3*, and *A4*. In addition, under the *MX*, *C1*, *C2*, *C3*, and *C4* will also produce normal stress, but no shear stress. When the *MX* is positive, *A2* bears the largest tension and *A4* understands the largest pressure. However, the normal stress in *A1* and *A3* is close to zero, because the central axis of twist goes through *A1* and *A3*. For the layer B, the tension applied on *A2* will transmit to *B34*, and the tension in *B34* will cause shear stress in the outside surface of *B3* and *B4*. Similarly, the outside surface of *B7* and *B8* will also produce shear stress. Because the normal stress in *A1* and *A3* is approximately zero, the tensions/pressures applied on *B3* and *B4* or *B7* and *B8* induced by moment *MX* approximate to *FMX* = *MX*/*D*. Then, the largest shear stress value of the points in the outside face of *B3*, *B4*, *B7* and *B8* can be figured out.

$$
\tau\_{M\_X} = \frac{F\_{M\_X}}{A} = \frac{M\_X/D}{L\_b(D-d)} = \frac{M\_X}{L\_b(D-d)D} \tag{5}
$$

Although both *FZ* and *MX* cause shear stress in *B3*, *B4*, *B7*, and *B8*, the sign of shear stress incurred by *MX* in *B3* and *B4* is apposite to that of *B7* and *B8*, the sign of shear stress caused by *FZ* in *B3* and *B4* is the same as that of *B7* and *B8*. Then, the shear stress of the points in the outside surface of *B3* and *B4* minus the shear stress of the points in the outside surface of *B7* and *B8* is the shear stress caused by *MX*. On the contrary, the shear stress of the points in the outside surface of *B3* and *B4* add the shear stress of the points in the outside surface of *B7* and *B8* is the shear stress caused by *FZ*. Therefore, by using this property, *FZ* and *MX* can be measured, respectively.

In addition, the *FY* applied on the ESB generates the moment around X-axis at layer B, as shown in Figure 16. Therefore, the moment measured by using the shear stress in *B3*, *B4*, *B7*, and *B8* is the superposition of the true moment *MX* and the moment caused by *FY*. However, the true moment *MX* applied on the ESB is the moment value we need to measure. The force *FY* is measurable by using the shear stress in *A1* and *A3*, and the moment arm of the moment caused by *FY* is available. Then, the moment caused by *FY* can be calculated out, after which the true moment *MX* is obtainable.

**Figure 16.** The moment applied on layer B caused by *FY*.

3.3.5. The Mechanics Analysis under the MY

Under the *MY*, the deformation of ESB is similar to that of under the *MX*. Therefore, similar to Equation (5), the following equation can be obtained.

$$
\pi\_{M\_Y} = \frac{M\_Y}{L\_b(D-d)D} \tag{6}
$$

Unlike under moment *MX*, under the *MY*, *A1*, and *A3* bear the largest tension or pressure, and the normal stress in *A2* and *A4* is close to zero. Then, the points in the outside surface of *B1*, *B2*, *B5*, and *B6* produce relatively large shear stress. Under the combined action of *MY* and *FZ*, both will cause shear stress in *B1*, *B2*, *B5*, and *B6*. The sign of shear stress incurred by *MY* in *B1* and *B2* is apposite to that of *B5* and *B6*, the sign of shear stress caused by *FZ* in *B1* and *B2* is the same as that of *B5* and *B6*. By using this property, the *FZ* and *MX* can be measured respectively. In addition, the *FX* applied on the shell will also produce moment around Y-axis. Therefore, the measurement of true moment *MY* applied on the ESB also needs to wipe off the moment around Y-axis caused by *FX*.

### 3.3.6. The Mechanics Analysis under the Moment MZ

When the moment *MZ* act on the cylinder-shaped ESB, the *A1*, *A2*, *A3*, and *A4* all produce shear stress and the value of shear stress can be calculated using the following equation.

$$\tau\_{M\mathbb{Z}} = \frac{M\_{\mathbb{Z}}}{R \cdot A'} = \frac{M\_{\mathbb{Z}}}{D/2 \cdot \pi D(D-d)/2(r+1)} = \frac{4M\_{\mathbb{Z}}(r+1)}{\pi D^2(D-d)}\tag{7}$$

where *A* = π*D*(*D* − *d*)/2(*r* + 1) is the area of the cross section of layer A perpendicular to Z-axis, *R* = *D*/2 is the radius of the excircle of the ESB, *r* is the ratio between the arc length of four grooves and the arc length of *A1*, *A2*, *A3* and *A4*.

Under the *MZ*, the direction of shear stress of the points in the outer surface of *A1* and *A2* is opposite to that of *A3* and *A4* respectively, as shown in Figure 7b. However, under the *FX*, the direction of shear stress of the points in *A2* and *A4* is the same (Figure 7a). Similarly, under the *FY*, the direction of shear stress of the points in *A1* and *A3* is the same. Therefore, the measurement of *FX* or *FY* that using the shear stress of the outside surface of *A2* and *A4* or *A1* and *A3* respectively will not be affected by *MZ*.

For layer B, the shear stress in *A1*, *A2*, *A3*, and *A4* transmits to *B12*, *B34*, *B56*, and *B78* that connects with layer A. Then, *B1*−*B8* under the normal stress, which does not affect the measurement of *MX*, *MY* and *FZ*. The normal stress in *B1*−*B8* causes stress in *B23*, *B45*, *B67*, and *B81*, which induces the shear stress in *C1*, *C2*, *C3*, and *C4*. The values of the shear stress in the outside surface of *C1*, *C2*, *C3*, and *C4* are the same as that of layer A, which can be figured out by using Equation (7). Similarly, the direction of the shear stress of the points in the outside surface of *C1* is opposite to that of *C3*, the direction of the shear stress of the points in the outside surface of *C2* is opposite to that of *C4* (Figure 8c). Besides, *FX* and *FY* also affect the shear stress of the points in the outside surface of *C1*, *C2*, *C3*, and *C4*. However, according to Figure 8a,b, under *FX* and *FY*, the direction of the shear stress of the points in the outside surface of *C1* is the same as that of *C3* and the shear stress of the points in the outside surface of *C2* is the same as that of *C4*. Hence, the shear stress in the outside surface of *C1*, *C2*, *C3*, and *C4* can be used to detect *MZ* without being affected by *FX* and *FY*.

### *3.4. Parameter Selection and Strength Check of Cylinder-Shaped Elastic Structural Body*

### 3.4.1. Sensitivity and Parameter Selection of the Elastic Structural Body

Strain values under unit forces and torques can reflect the sensitivities of a force sensor. The microstrain measured by the strain gauge is ε = τ/*E*, where *E* is the elasticity modulus and τ is the shear stress caused by unit force/moment. Aluminum alloy 7075 is selected to machine the cylinder-shaped ESB, and the elasticity modulus of aluminum alloy 7075 is *E* = 71.7 *Gpa*. Under the unit traction force, the micro strains measured by the strain gauges pasted in the outside surface of the ESB are as follows.

$$\begin{cases} \begin{aligned} S\_{F\chi} = S\_{F\chi} = 4/\pi D(D-d)E\\ S\_{F\_Z} = 2/l\_b(D-d)E\\ S\_{M\chi} = S\_{M\chi} = 1/l\_b(D-d)DE\\ S\_{M\_Z} = 4(r+1)/\pi D^2(D-d)E \end{aligned} \end{cases} \tag{8}$$

where *SFX* , *SFY* , *SFZ*, *SMX* , *SMY* , *SMZ* are the sensitivities of ESB with respect to traction forces/torques *FX*, *FY*, *FZ*, *MX*, *MY*, *MZ* respectively.

According to Equation (8), the smaller *D*, *d* and *lb* are, the larger *r* is, and the larger sensitivities will be. However, based on the design criteria, the parameters of the ESB must not be too small. Considering the convenience of mechanical processing of the ESB and the pasting of strain gauges, the selected parameters are *D* = 50 *mm*, *d* = 48 *mm*, *lb* = 9 *mm*, *r* = 3 and the heights of layer A and layer C are *la* = 10 *mm*, *lc* = 9 *mm*, respectively. Substituting the parameters into Equation (8), the theoretical sensitivities of the ESB can be obtained, as shown in Table 1.


**Table 1.** Theoretical sensitivity of the elastic structural body.

### 3.4.2. Strength Check of the Cylinder-Shaped ESB

In order to prevent the overload damage of the traction force sensor, it is necessary to obtain the maximum force that the cylinder-shaped ESB can withstand, which can be calculated by the following equation.

$$\begin{cases} F\_{X-\text{max}} = F\_{Y-\text{max}} = (\pi D(D-d)/4)[\pi] \\ F\_{Z-\text{max}} = (L\_b(D-d)/2)[\pi] \\ M\_{X-\text{max}} = M\_{Y-\text{max}} = L\_b(D-d)D[\pi] \\ M\_{Z-\text{max}} = (\pi D^2(D-d)/4(r+1))[\pi] \end{cases} \tag{9}$$

where [τ] is the permissible shear stress of the 7075 aluminum alloy used to machine the cylinder-shape ESB, [τ] = 0.5[σ], [σ] = σ*s*/2.5 = 182 *N*/*mm*2, σ*<sup>s</sup>* = 455 *N*/*mm*<sup>2</sup> is the yield stress of 7075 aluminum alloy, *FX*−*max*, *FY*−*max*, *FZ*−*max*, *MX*−*max*, *MY*−*max*, and *MZ*−*max* are the largest *FX*, *FY*, *FZ*, *MX*, *MY*, and *MZ* that the ESB can bear, respectively.

Substituting the parameters into Equation (9), the maximum forces/moments that the ESB can withstand can be figured out, and they are *FX*−*max* = 2857.4 *N*, *FY*−*max* = 2857.4 *N*, *FZ*−*max* = 819 *N*, *MX*−*max* = 8190 *N*·*cm*, *MY*−*max* = 8190 *N*·*cm* and *MZ*−*max* = 8929.38 *N*·*cm*, respectively. In the kinesthetic teaching of robot, humans will not to use large forces to guide the movement of robots. Therefore, for human, the theoretical maximum forces/moments that the ESB can bear are very large, which are enough to prevent the traction force sensor from damaging.

### *3.5. Measurement of the Traction Force*

Given the true sensitivities of the traction force sensor, the traction force can be calculated according to the measured strain values. By combining Equations (2)−(8), the traction force can be figured out, as follows.

$$
\begin{bmatrix} F\_X \\ F\_Y \\ F\_Z \\ M'\_X \\ M'\_Y \\ M\_Z \end{bmatrix} = \begin{bmatrix} 1/S\_{F\_X} & 0 & 0 & 0 & 0 & 0 \\ 0 & 1/S\_{F\_Y} & 0 & 0 & 0 & 0 \\ 0 & 0 & 1/S\_{F\_Z} & 0 & 0 & 0 \\ 0 & 0 & 0 & 1/S\_{M\_X} & 0 & 0 \\ 0 & 0 & 0 & 0 & 1/S\_{M\_Y} & 0 \\ 0 & 0 & 0 & 0 & 0 & 1/S\_{M\_Z} \end{bmatrix} \begin{bmatrix} \varepsilon\_{F\_X} \\ \varepsilon\_{F\_Y} \\ \varepsilon\_{F\_Z} \\ \varepsilon\_{M\_Y} \\ \varepsilon\_{M\_Z} \end{bmatrix} \tag{10}
$$

where ε*FX* , ε*FY* , ε*FZ*, ε*M X* , ε*M <sup>Y</sup>* and <sup>ε</sup>*MZ* are the shear strain values caused by *FX*, *FY*, *FZ*, *<sup>M</sup> <sup>X</sup>*, *M Y*, and *MZ*, respectively.

In order to measure the shear strains caused by external forces/moments, the strain gauges need to be pasted to the ESB in a manner of ±45◦ with the axis of the ESB and the strain gauges pasted in different regions are formed into six electric bridges. The output of an electric bridge is voltage, not strain value. Then, Equation (10) can be rewrote to exhibit the mapping relation between the voltage changes of electric bridges and the external forces.

$$[F]\_{6\times 1} = \ [S]\_{6\times 6} \cdot [K]\_{6\times 6} \cdot [\Delta v]\_{6\times 1} = \ [P]\_{6\times 6} [\Delta v]\_{6\times 1} \tag{11}$$

where [*F*]6×<sup>1</sup> is [*FX*, *FY*, *FZ*, *<sup>M</sup> <sup>X</sup>*, *M <sup>Y</sup>*, *MZ*] *<sup>T</sup>*; [*S*]6×<sup>6</sup> is the diagonal matrix in Equation (10); [*K*]6×<sup>6</sup> is the coefficient matrix of strain transfer of electric bridge, the elements in [*K*]6×<sup>6</sup> is the strain values corresponding to unit voltage; the elements in [Δ*v*]6×<sup>1</sup> are the change values in the output voltage of the electric bridges; [*P*]6×<sup>6</sup> is equal to [*S*]6×6[*K*]6×6.

Owing to the moment *M <sup>X</sup>* in [*F*]6×<sup>1</sup> includes the moment caused by *FY* and the moment *<sup>M</sup> Y* contains the moment induced by *FX*, the amendment is necessary to get the real moment *MX* and *MY* applied on the traction force sensor. The following equation can eliminate the errors in *M <sup>X</sup>* and *M Y*.

$$\begin{cases} M\_X = M'\_X - F\_Y \cdot d\_{F\_Y} \\ M\_Y = M'\_Y - F\_X \cdot d\_{F\_X} \end{cases} \tag{12}$$

where *dFX* is the moment arm from the application point of the force *FX* to the moment measuring point, *dFY* is the moment arm from the application point of the force *FY* to the moment measuring point, and in ideal circumstances, *dFX* is equal to *dFY* .

### *3.6. The Realization of Traction Force Sensor*

After the traction force sensor is machined, it is necessary to paste strain gauges for the shear stress measurement on the surface of the cylinder-shaped ESB. To measure the traction force, we pasted 48 miniature strain gauges on the outer surface of the cylinder-shaped ESB, and the distribution diagram of these strain gauges is shown in Figure 17. The blue rectangles in Figure 17 represent strain gauges, and the red squares in Figure 17 represent the connecting terminals of strain gauges. In order to measure the shear stress of one point, two strain gauges are pasted on the same area at an angle of 90◦, and the angle between the two strain gauges and the direction of shear stress is 45◦ and −45◦ respectively. Therefore. One strain gauge is used to detect tensile stress caused by shear stress, and the other is used to measure compression stress induced by shear stress. Moreover, the strain gauges pasted in *A1*, *A2*, *A3*, *A4*, *C1*, *C2*, *C3*, and *C4* should be pasted in the area that bears the largest shear stress under the *FX*, *FY,* and *MZ*, that is, the middle position of these areas. However, based on the analysis in Sections 3.3.3–3.3.5, the strain gauges pasted in *B1*−*B8* can be arranged as Figure 17 shows. After the strain gauges were pasted, the cylinder-shaped ESB is shown in Figure 19a.

**Figure 17.** Distribution of strain gauges pasted on the cylinder-shaped elastic structural body.

After the pasting of the strain gauges, the strain gauges pasted in different areas are connected to form six electric bridges. The four strain gauges pasted in *A2* and *A4* are connected to form the 1st electric bridge to measure the strain caused by the force *FX*. The strain gauges pasted in *A1* and *A3* are connected to form the 2nd electric bridge to measure the strain caused by the force *FY*. The strain gauges pasted in *B12*, *B21*, *B32*, *B41*, *B52*, *B61*, *B72*, and *B81* are connected to form the third electric bridge to measure the strain induced by the force *FZ*. Similarly, the strain caused by the moment *M X* can be measured by the fourth electric bridge made up of strain gauges stuck in *B31*, *B42*, *B71*, and *B82*, the indirect measurement of the moment *M <sup>Y</sup>* is obtainable by the fifth electric bridge made up of strain gauges pasted in *B11*, *B22*, *B51*, and *B62*, the strain caused by the moment *MZ* can be measured by the sixth electric bridge made up of strain gauges pasted in *C1*, *C2*, *C3*, and *C4*.

As presented in Sections 3.2 and 3.3, this paper utilizes the sum or the difference of strain values of the points in the outside surface of the ESB to measure the traction force. The sum of strain values of the points in the outside surface of *A2* and *A4* respectively is used to represent the force *FX*. Therefore, the connection mode of the four strain gauges pasted in *A2* and *A4* is shown in Figure 18a. Δ*RFX* and Δ*RMZ* represent the changes in the resistance values of the strain gauges caused by the force *FX* and the moment *MZ* respectively. In addition, the minus sign and plus sign of Δ*RFX* and Δ*RMZ* indicate that the strain gauge is compressed and stretched respectively. According to the measurement principle of electric bridges, when Δ*RFX* is zero, the output voltage is zero even if Δ*RMX* is not equal to zero. However, the output voltage is not zero when Δ*RMX* is equal to zero and Δ*RFX* is not zero. Therefore, the 1st electric bridge can measure the force *FX*. In order to measure the forces *FY* and *FZ*, the connection mode of the second and the third electric bridges is basically the same as that of the 1st electric bridge.

**Figure 18.** The connection mode of strain gauges in electric bridge: (**a**) the first electric bridge; (**b**) the fourth electric bridge.

According to Sections 3.2 and 3.3.4, the difference between the strain values of the points in the outside surface of *B3* and *B4* and that in *B7* and *B8* is used to represent the moment *MX*. Therefore, the connection mode of the eight strain gauges pasted in *B31*, *B42*, *B71*, and *B82* is shown in Figure 18b. Δ*RFZ* and Δ*RMX* represent the changes in the resistance values of the strain gauges caused by the force *FZ* and the moment *MX* respectively. According to the measurement principle of electric bridges, when Δ*RMX* is zero, the output voltage is zero even if Δ*RFZ* is not equal to zero. However, the output voltage is not zero when Δ*RFZ* is equal to zero and Δ*RMX* is not zero. Therefore, the fourth electric bridge can measure the moment *MX*. In order to measure the moments *MY* and *MZ*, the connection mode of the fifth and the sixth electric bridges is basically the same as that of the fourth electric bridge.

According to the structure of the traction force sensor shown in Figure 5, the cylinder-shaped ESB, connecting fitting and shell were assembled into a traction force sensor by screw fastening, as shown in Figure 19b. The central column in the connecting fitting attaches the contact force sensor to the end of the traction force sensor and form a tandem force sensor. The output signal of the traction force sensor is voltage, and we developed a 12-channel signal acquisition instrument to realize the signal acquisition (Figure 19c). The 6-channel in the signal acquisition instrument is used for the signal acquisition of the traction force sensor, and the other 6-channel is used for the information acquisition of the contact force sensor.

**Figure 19.** Cylinder-shaped elastic structural body and traction force sensor: (**a**) cylinder-shaped elastic structural body; (**b**) traction force sensor; (**c**) signal acquisition instrument.

### *3.7. Calibration Experiment of Cylindrical Traction Force Sensor*

Equations (11) and (12) exhibit that the traction force can be detected by measuring variation values of voltages of six electric bridges. To obtain the real matrix [*P*]6×<sup>6</sup> in Equation (11) and the moment arm in Equation (12), calibration experiment is necessary. In the calibration experiment of traction force sensor, we use a 6-DOF industrial robot to finish the calibration experiment, as shown in Figure 20. In the calibration experiment, the robot remains stationary during the calibration process to provide a rigid support for the sensor, and forces and torques are applied to the sensor by mounting weights on the loading structure. Moreover, the attitude of the traction force sensor can be changed by adjusting the posture of the robot so that the forces/moments in different directions can be applied to the sensor. After applying forces/torques to the sensor, the self-developed signal acquisition instrument collects the output voltages of the sensor.

**Figure 20.** Industrial robot used in calibration experiments.

In the calibration experiment, small force/moment ranges are adopted because humans like to guide robot with small forces/torques. During the calibration process, *FX* and *FY* adopt interval load of ±60 *N* × 10 *N*, *FZ* adopts interval load of 60 *N* × 10 *N*, *MX* and *MY* adopt interval load of ±201 *N*·*cm* × (10 *N* × 3.35 *cm*) and *MZ* adopts interval load of ±192 *N*·*cm* × (10 *N* × 3.2 *cm*). The moments applied to the sensor were achieved by mounting weights on the loading structure. Therefore, when moments were applied to the sensor, the weights will also exert forces on the sensor. After each loading, the output values of electric bridge were recorded. Each calibration experiment was repeated three times to ensure the availability and repeatability of the experimental data. Under the external force, the changes of output voltage value are shown in Figure 21, and CH1, CH2, CH3, CH4, CH5, and CH6 represent the output voltage values of the first, second, third, fourth, fifth, and sixth electric bridge, respectively. Under the *FX* and *MZ*, CH1, CH5, and CH6 have significant output, and this certifies that *FX* will induce the occur of moment around Y-axis; under the force *FY* and moment *MZ*, CH2, CH4 and CH6 have significant output, and this certifies that *FY* will induce the occur of moment around X-axis. In addition, Figure 21 shows CH1 is mainly sensitive to *FX*, CH2 is mainly sensitive to *FY*, CH3 is mainly sensitive to *FZ*, CH4 is mainly sensitive to *MX*, CH5 is mainly sensitive to *MY* and CH6 is mainly sensitive to *MZ*. All of this certifies the theoretical analysis in Section 3.2.

**Figure 21.** The output voltage changes of electric bridges: (**a**) under the force *FX*; (**b**) under the force *FY*; (**c**) under the force *FX* and moment *MZ*; (**d**) under the force *FY* and moment *MZ*; (**e**) under the force *FZ* and moment *MX*; (**f**) under the force *FZ* and moment *MY*.

After the calibration experiment, the least square method was used to calculate the calibration matrix [*P*]6×6, as follows.

$$\begin{bmatrix} \begin{bmatrix} P \\ \end{bmatrix}\_{\text{foci}} = \begin{pmatrix} -1.07 \times 10^{-1} - 1 & 5.21 \times 10^{-3} - 3 & -6.22 \times 10^{-3} - 3 & -2.26 \times 10^{-2} & 5.23 \times 10^{-1} - 1 & 1.37 \times 10^{-1} \\ -2.45 \times 10^{-3} - 3 & 5.44 \times 10^{-2} - 2 & -1.80 \times 10^{-3} & 2.20 \times 10^{-1} - 1 & -1.74 \times 10^{-1} - 1 & -4.14 \times 10^{-2} \\ -6.06 \times 10^{-4} - 4 & 3.13 \times 10^{-4} - 4 & -9.07 \times 10^{-2} - 2 & -3.50 \times 10^{-2} - 2 & 5.61 \times 10^{-2} - 2 & 1.61 \times 10^{-2} \\ -3.67 \times 10^{-3} - 3 & 5.91 \times 10^{-3} - 3 & -1.03 \times 10^{-2} & 2.62 & 1.43 \times 10^{-1} - 1 & -4.05 \times 10^{-2} \\ -5.64 \times 10^{-3} - 3 & 3.65 \times 10^{-3} - 2.88 \times 10^{-4} - 4 & -5.48 \times 10^{-2} & 2.59 & 8.91 \times 10^{-3} \\ -6.65 \times 10^{-4} - 4 & -7.34 \times 10^{-4} - 2.89 \times 10^{-3} & 6.99 \times 10^{-2} - 2 & -9.18 \times 10^{-2} & -2.12 \end{bmatrix} \tag{13}$$

Plug the calibration matrix into Equation (11) and using Equation (12), the calculated forces/torques can be obtained, which are presented in Figure 22. Then, the interference errors of the cylindrical traction force sensor are shown in Table 2, which shows that most of the errors are not larger than 1.0%, and the measurement ranges are −60 ≤ *FX* ≤ 60 *N*, −60 ≤ *FY* ≤ 60 *N*, 0 ≤ *FZ* ≤ 60 *N*, −201 ≤ *MX* ≤ 201 *N*·*cm*, −201 ≤ *MY* ≤ 201 *N*·*cm*, −192 ≤ *MZ* ≤ 192 *N*·*cm*, respectively.

**Figure 22.** Force/Torque obtained by the cylindrical traction force sensor: (**a**) the force *FX* acts on the sensor; (**b**) the force *FY* acts on the sensor; (**c**) under the force *FX* and torque *MZ*; (**d**) under the force *FY* and torque *MZ*; (**e**) under the force *FZ* and torque *MX*; (**f**) under the force *FZ* and torque *MY*.


**Table 2.** Interference error of the cylindrical traction force sensor.

Non-linear errors (NLES), hysteresis errors (HES) and repeatability errors (RES) are important indexes to show the static performance of a sensor. Five of the six NLES of the cylindrical traction force sensor are not larger than 0.70%, five of the six HES are not larger than 0.85% and four of the six RES are not larger than 0.80%, as Table 3 shows. To demonstrate the measurement error visually of the sensor, several load and measurement experiments of forces/moments were conducted, and Table 4 compares the calculated values with the actual values. The measurement errors in Table 4 verified that the cylindrical traction force sensor can detect the external forces/torques applied to it, and the measurement errors are small.


**Table 3.** Static performance indices of the cylindrical traction force sensor.

**Table 4.** Calculated and real values when forces/torques are applied on traction force sensor.


### **4. The Realization and Application of the Tandem Force Sensor**

### *4.1. The Tandem Force Sensor Based on the Developed Cylindrical Traction Force Sensor*

According to the schematic diagram of the structure of the tandem force sensor shown in Figure 4a and the series connection mode shown in Figure 4b, a tandem force sensor is developed, as shown in Figure 23. The tandem force sensor is composed of a developed cylindrical traction force sensor and a contact force sensor connected in series, and the contact force sensor is connected with the cylindrical traction force sensor by an intermediate connecting flange. In addition, all connections are made by screw fastening. In the application, the tandem force sensor is connected to the robot end through the connection flange, and the end-effector can be fixed to the end of the tandem force sensor. In the kinesthetic teaching of robot contact tasks, the human hand exerts the traction force by grasping the shell of the traction sensor to guide the robot's motion, while the contact force sensor can accurately perceive the contact force between the robot's end-effector and the environment. Then, the traction and contact forces can be simultaneously perceived by the developed tandem force sensor in a decoupled manner.

**Figure 23.** The developed tandem force sensor.

### *4.2. Application of the Developed Tandem Force Sensor*

To further test the feasibility of the developed tandem force sensor, this paper designs drawer switch experiment based on human–robot interaction. In daily work and life, people can easily open a variety of drawers. However, it is not an easy task for robot to open and close diversified drawers like what human does. Human–robot interaction helps to transmit experience to robot and inform the robot of the method of opening and closing drawers, and then robot can learn the method to open and close drawers.

With the developed tandem force sensor, the drawer switch experiment can complete with human–robot interaction without damaging the drawer, and the robot can obtain several effective demonstrations. In the human–robot interaction to finish the drawer switch experiment, the tandem force sensor is mounted at the end of the robot and vacuum chuck, which allows the robot to control the opening and closing of drawers, is attached to the contact force sensor. In human–robot interaction, the teacher chooses a drawer in the locker and selects the adsorption area of the drawer. People guide the robot move from the initial point to the selected drawer, and control the suction cup to hold the drawer. Then, the human guides the robot to open the drawer to the maximum and finally, the human guides the robot to close the drawer, as shown in Figure 24. In this process, the tandem force sensor detects the traction force and the contact force between vacuum chuck and drawer, which allows the robot to act according to human intentions and its contact state with the object being operated on, not just human intentions. During the experiment, the data sampled by the tandem force sensor and the action taken by the teacher are saved as state-action pairs. Then, the robot can learn the policy of drawer switch task and perform drawer opening and closing by itself (Figure 25), which confirms the feasibility and effectiveness of the tandem force sensor.

**Figure 24.** Human-robot cooperate to finish drawer switch experiment: (**a**) approach to the target; (**b**) grab the target; (**c**) open switch (**d**) close switch.

**Figure 25.** Robot finish drawer switch with the method human teaches: (**a**) approach to the target; (**b**) grab the target; (**c**) open switch; (**d**) close switch.

In the human–robot interaction, to complete the drawer switch experiment, the change curves of *FT <sup>Z</sup>* and *FC <sup>Z</sup>* (superscript *T* and *C* represent the traction force and contact force sensors, respectively) are shown in Figure 26. Owing to the inaccuracy of manual operation, the numerical fluctuation of the traction force *F<sup>T</sup> <sup>Z</sup>* is high, while the numerical fluctuation of the contact force *FC <sup>Z</sup>* is lower than *<sup>F</sup><sup>T</sup> Z*. In order to simulate the force curve in the kinesthetic teaching of drawer switch experiment based on single wrist force sensor, the resultant force of the traction force *F<sup>T</sup> <sup>Z</sup>* and the contact force *<sup>F</sup><sup>C</sup> <sup>Z</sup>* has been calculated, as shown in Figure 27. By comparing Figures 26b and 27, it can be seen that the resultant force cannot accurately represent the contact state between the robot and the drawer. If the task policy is learned based on the resultant force, it cannot make the right decision. For example, when the drawer switch task policy is learned based on the data shown in Figure 27, the learned policy only outputs effective action instructions when the absolute value of contact force is about 20 *N*, instead of 0 *N*. Therefore, the net contact force obtained by the tandem force sensor is necessary for learning effective contact task policy.

**Figure 26.** The change curve of the traction and contact forces in the drawer switch experiment: (**a**) Traction force *FT <sup>Z</sup>*; (**b**) contact force *FC Z*.

**Figure 27.** The change curve of the sum of the traction force *F<sup>T</sup> <sup>Z</sup>* and the contact force *<sup>F</sup><sup>C</sup> Z*.

### **5. Conclusions**

A tandem force sensor for measuring the traction and contact forces is introduced in this paper. In cases that a wrist force sensor is used as the contact force sensor, a cylindrical traction force sensor that is easy to handle by hand, has been designed. As the core of the cylindrical traction force sensor, the cylinder-shaped elastic structural body is designed, and the force measurement theory of it is analyzed in detail. Calibration experiments verify the theoretical analysis of the cylinder-shaped elastic structural body and the good static characteristics of the traction force sensor. Then, a wrist force sensor is mounted to the developed cylindrical traction force sensor to realize the tandem force sensor.

To verify whether the tandem force sensor can meet the original intention, the drawer switch experiment based on the tandem force sensor has been carried out. The traction force sensor in drawer switch experiment transmits the human intention to the robot, and the contact force sensor detects the contact status between the robot and the drawer. Human–robot interaction experiment shows that the tandem force sensor can sense the way and skill of a teacher and the contact force between robot and environment, so that the human and robot can cooperate to complete the task, which is the basis of how robots learn to accomplish contact tasks.

The traction force sensor can be combined with the contact force sensor as a tandem force sensor, or can be used alone. Note that, although we have only applied the tandem force sensor to the drawer switch experiment, this sensor can also be applied to a wide range of contact tasks that needs human–robot collaboration, such as assembly, grinding, polishing, and deburring. Moreover, the traction force sensor can be used alone for the non-contact tasks that need human–robot collaboration, such as paint spraying and track teaching. In these tasks, the most important advantage of the traction force sensor over the common wrist force sensor is that the gravity of the end-effector does not affect its measurement, which simplifies the sensor's gravity compensation.

Because the contact force sensor adopted in the tandem force sensor is a commercial wrist force sensor, and its structure is not optimized for the tandem force sensor, which makes the appearance of the developed tandem force sensor less graceful and complex. In the future, the structure of the tandem force sensor will be optimized, which will make it very graceful and close to the ideal tandem force sensor. By that time, the tandem force sensor can be utilized to robotic contact tasks in a very elegant way.

**Author Contributions:** Conceptualization, Z.Z., Y.C., and D.Z.; methodology, Z.Z.; software, Z.Z.; validation, Z.Z., and D.Z.; formal analysis, Z.Z.; investigation, Z.Z.; resources, Y.C.; data curation, Z.Z.; writing—original draft preparation, Z.Z.; writing—review and editing, Z.Z., Y.C., and D.Z.; visualization, Z.Z.; supervision, D.Z.; project administration, Y.C.; funding acquisition, Y.C. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was funded by the Science and Technology Support Project of the National Science Foundation of China, grant number 51775215.

**Acknowledgments:** We thank Jiming Sa (Wuhan University of Technology) and Jiangyu Hu (Wuhan University of Technology) for they assistance in the data collection. We also acknowledge the anonymous reviewers for taking time out of their busy schedules to review this paper.

**Conflicts of Interest:** The authors declare no conflict of interest.

### **References**


**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Review* **Reinforcement Learning Approaches in Social Robotics**

**Neziha Akalin \* and Amy Loutfi**

School of Science and Technology, Örebro University, SE-701 82 Örebro, Sweden; amy.loutfi@oru.se **\*** Correspondence: neziha.akalin@oru.se; Tel.: +46-1930-3415

**Abstract:** This article surveys reinforcement learning approaches in social robotics. Reinforcement learning is a framework for decision-making problems in which an agent interacts through trial-anderror with its environment to discover an optimal behavior. Since interaction is a key component in both reinforcement learning and social robotics, it can be a well-suited approach for real-world interactions with physically embodied social robots. The scope of the paper is focused particularly on studies that include social physical robots and real-world human-robot interactions with users. We present a thorough analysis of reinforcement learning approaches in social robotics. In addition to a survey, we categorize existent reinforcement learning approaches based on the used method and the design of the reward mechanisms. Moreover, since communication capability is a prominent feature of social robots, we discuss and group the papers based on the communication medium used for reward formulation. Considering the importance of designing the reward function, we also provide a categorization of the papers based on the nature of the reward. This categorization includes three major themes: interactive reinforcement learning, intrinsically motivated methods, and task performance-driven methods. The benefits and challenges of reinforcement learning in social robotics, evaluation methods of the papers regarding whether or not they use subjective and algorithmic measures, a discussion in the view of real-world reinforcement learning challenges and proposed solutions, the points that remain to be explored, including the approaches that have thus far received less attention is also given in the paper. Thus, this paper aims to become a starting point for researchers interested in using and applying reinforcement learning methods in this particular research field.

**Keywords:** reinforcement learning; social robotics; human-robot interaction; reward design; physical embodiment

### **1. Introduction**

With the proliferation of social robots in society, these systems will impact users in several facets of life from providing assistance, performing cooperation, and taking part in collaboration tasks. In order to facilitate natural interaction, researchers in social robotics have focused on robots that can adapt to diverse conditions and to different user needs. Recently, there has been great interest in the use of machine learning methods for adaptive social robots [1–4]. Machine Learning (ML) algorithms can be categorized into three sub fields: supervised learning, unsupervised learning and reinforcement learning. In supervised learning, correct input/output pairs are available and the goal is to find a correct mapping from input to output space. In unsupervised learning, output data is not available and the goal is to find patterns in the input data. Reinforcement Learning (RL), on the other hand, is a framework for decision-making problems in which an agent interacts through trial-and-error with its environment to discover an optimal behavior [5]. The RL agent receives scarce feedback about the actions it has taken in the past. The agent then tunes its behavior over time via this feedback signal, i.e., reward or penalty. The agent's goal is therefore learning to take actions that maximize the reward.

RL approaches are gaining increasing attention in the robotics community. As interaction is a key component in both RL and social robotics, RL could provide a suitable

**Citation:** Akalin, N.; Loutfi, A. Reinforcement Learning Approaches in Social Robotics. *Sensors* **2021**, *21*, 1292. https://doi.org/10.3390/ s21041292

Academic Editors: Anne Schmitz and Cosimo Distante

Received: 17 December 2020 Accepted: 4 February 2021 Published: 11 February 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

approach for social human-robot interaction. Worth noting is that humans perform sequential decision-making in daily life where sequential decision making describes problems that require successive observations, i.e., cannot be solved with a single action [6]. Consequently, much of social human-robot interactions can be formulated as sequential decision-making tasks, i.e., RL problems. The goal of the robot in these types of interactions would be to learn an action-selection strategy in order to optimize some performance metric, such as user satisfaction.

Before outlining the research related to reinforcement learning in social robots, first it is important to establish the definition of a social robot in the context of this article. A variety of definitions for a social robot have been proposed in the literature [7–12]. Within each of these definitions, there is a wide spectrum of characteristics. However, two important aspects become prominent in these definitions that are considered in this paper, namely, embodiment and interaction/communication capability. One example can be found in Bartneck and Forlizzi [10] where they define a social robot as an "... autonomous or semi-autonomous robot that interacts and communicates with humans by following the behavioral norms expected by the people with whom the robot is intended to interact." Following this definition, the authors stress that a social robot must have a physical embodiment. Based on the presented definitions in [7–12], we consider social robots as embodied agents that can interact and communicate with humans. Figure 1 shows some of the social robots that are used in the reviewed papers.

**Figure 1.** Some of the social robots platforms referenced within the reviewed papers. (The pictures of (**a**) Pepper robot, and (**b**) Nao robot were taken by the authors. (**c**) Mini robot, the figure is adapted from [13]—licensed under the Creative Commons Attribution, (**d**) Maggie robot, the figure is from https://robots.ros.org/maggie/, accessed on 20 March 2020 licensed under the Creative Commons Attribution, (**e**) iCat robot, the figure is from https://www.bartneck.de/wp-content/ uploads/2009/08/iCat02.jpg, accessed on 22 March 2020—used with permission, photo credit to Christoph Bartneck.)

This article presents a survey on RL approaches in social robotics. As such, it is important to emphasize that the scope of this survey is focused on studies that include physically embodied robots and real-world interactions. Considering the definition of [10] given above, this paper excludes studies with simulations and virtual agents where no physical embodiment is present. The presented review also excludes studies with industrial robots and studies that do not include any interaction with humans. Rather, this review exclusively focuses on papers that comprise both a social robot(s) and human input/user studies. It is worth noting that studies which use simulations for training and test on physical robot deployment with user studies fall within the selection criteria. Likewise, studies that use explicit or implicit human input in the learning process are also included.

Due to the complexity of the social interactions and the real-world, most of the studies applying RL are trained and tested in simulation environments. However, real-world interactions are extremely important not only for social robots but also for understanding the full potential of reinforcement learning. It is mentioned in [14] (p. 391), that "the full potential of reinforcement learning requires reinforcement learning agents to be embedded into

the flow of real-world experience, where they act, explore, and learn in our world, and not just in their worlds." Generally speaking, the overall goal of an RL agent is to maximize the expected cumulative reward over time, as stated in the "reward hypothesis" [14] (p. 42). The reward in RL is used as a basis for discovering an optimal behavior. Hence, reward design is extremely important to elicit desired behaviors in RL-based systems. The choice of reward function is crucial in robotics, where the problem is also referred to as the "curse of goal specification" [15]. Therefore, in this paper, we provide a categorization based on reward design which is crucial for RL to be successful. Moreover, since communication capability is a distinctive feature of social robots, we discuss communication mediums utilized for reward design together with RL algorithms.

Finally, it is also worth noting that in the general field of robotics there is a plethora of research in RL. There also exist review papers on the topic of RL in robotics such as applications of RL in robotics in general [15,16], policy search in robot learning [17], safe RL [18], and Deep Reinforcement Learning (DRL) in soft robotics [19]. Indeed, RL has been applied to a variety of scenarios and domains within social robotics, with growing popularity. While the field of social robotics deserves a survey on its own, to the best of our knowledge, there exists no such survey on this particular research field. Thus, the main purpose of this work is to serve as a reference guide that provides a quick overview of the literature for social robotics researchers who aim to use RL in their research. Depending on the target user group, the application domain or the experimental scenario, different types of rewards, problem formulations or algorithms can be more suitable. In that sense, we believe that this survey paper will be beneficial for social robotics researchers.

### *Overview of the Survey*

After surveying research on RL and social robotics, we analyze and categorize the studies based on four different criteria: (1) RL type, (2) the utilized communication mediums for reward function formulation, (3) the nature of the reward function, (4) the evaluation methodologies of the algorithms. These categorizations aim to facilitate and guide the choice of a suitable algorithm by social robotics researchers in their application domain. For that purpose, we elaborate on the different methods that are tested in real-world scenarios with a physical robot.

Categorization based on RL type includes bandit-based methods, value-based methods, policy-based methods, and deep RL (see Section 4). The utilized communication mediums are verbal communication, nonverbal communication, affective communication, tactile communication, and additional communication medium between the robot and the human. Moreover, there are studies in which higher interaction dynamics are used for reward formulation such as engagement, comfort, and attention. There are also other studies that do not use any communication medium at all for reward formulation. In the categorization based on the design of the reward mechanisms, three major themes emerged:


The evaluation methodologies include (1) the algorithm point of view, (2) the user experience point of view, and (3) evaluation of both learning algorithm-related factors and user experience-related factors.

To formulate the social interactions as a reinforcement learning problem, researchers need to consider some key concepts such as input data, state representation, robot actions, and reward function. Moreover, after the implementation of RL, it should be decided how the evaluation will be performed. Therefore, we extract from each of the cited works the following key points (1) the input data, state space and action space (2) the reward function (3) the communication medium in the HRI scenario (4) the main experimental results (5) the experimental scenario and its validation. Therefore, the contributions of this paper include: (i) analysing and categorising the relevant literature in terms of type of RL used; (ii) analysing and categorising the relevant literature based on the reward function; (iii) analysing the relevant literature in terms of evaluation methodologies.

The paper is organized as follows: In Section 2, we discuss the benefits and challenges of applying RL in the social robotics domain. In Section 3, we present a background on reinforcement learning. Following the formal presentation of the methods, in Section 4, we present the applications of these methods in social robotics. Later, we present the categorization based on reward functions in Section 5. Evaluation methods are discussed in Section 6. In Section 7, we discuss the current approaches in the view of real-world RL challenges and proposed solutions. The section further includes the points that remain to be explored, and the approaches that have thus far received less attention. Finally, in Section 8, we conclude the paper.

### **2. RL in Social Robotics—Benefits and Challenges**

Applications of social robots are numerous and range from entertainment to eldercare. The robot tasks in such cases involve interactive elements such as human-robot cooperation, collaboration, and assistance. To achieve longitudinal interaction with social robots, it is important for such robots to learn incrementally from interactions, often with non-expert end-users. In consideration of continuously evolving interactions where user needs and preferences change over time, hand-coded rules are labor-intensive. Even though rulebased systems are deterministic, it can be difficult to create rules for complex interaction patterns. Machine learning is bound to play an important role in a wide range of domains and applications including robotics. However, the social robot learning problem differs from the traditional ML setting in which there is a need for collected datasets or assumptions about the distribution of input data [21]. Often, social robots should be able to learn new tasks and task refinements in domestic (unstructured) environments. Furthermore, social robotics researchers need to deal with a particular challenge of learning in real-time from human-robot interactions. ML paradigms such as supervised learning and unsupervised learning are not designed for learning from real-time social interactions. On the contrary, RL represents an active process. Unlike other ML methods, it does not need to be provided desired outputs instead, it trains interactively based on reward signals and refines its behavior throughout the interaction. Moreover, interaction is a key component for social robots which makes RL a suitable approach. RL also provides a possibility to learn from natural interaction patterns by utilizing the various social elements in the learning process. Consideration of all these points suggests that socially guided machine learning [22] could be a more suitable approach than traditional ML approaches for social HRI.

In general, combining human and machine intelligence may be effective for solving computationally hard problems [23]. The term "socially guided machine learning" was first used by Thomaz et al. [22] and refers to approaches that include social interaction between a user and a machine in the learning process. Studies using IRL in social robotics can be considered as socially guided machine learning since they make use of human feedback in different forms in the learning process. The feedback provided by the human can be used for shaping the action policy (the human is involved in the action selection mechanism), or shaping the reward function [24]. It can be treated either as reward, in that the feedback is given based on the agent's past actions indicating "how good the taken action was", or policy feedback in which human feedback affects action selection or modification thereby indicating "what to do".

The majority of studies included in this review paper use IRL which may suggest that IRL could be the best suited approach in social robotics. However, IRL has its own challenges. Human teachers tend to give less frequent feedback (due to boredom and/or fatigue) as learning progresses, resulting in diminished cumulative reward [25]. Likewise, human teachers tend to provide more positive reward than punishment [26,27]. Yet another problem in IRL is the transparency issues that might arise during the training of a physical robot via human reward [28,29]. Reference [29] used an audible alarm to alert the trainer about the robot's loss of sense. Suay et al. [30] observed that experts could teach the defined task in a predefined time frame, whereas the same amount of time was not enough for inexperienced users. One solution suggested for this was algorithmic transparency during training, which shows the internal policy to the human teacher. However, the presentation of the model of the agent's internal policy might be obscure for naive human teachers. Therefore, this information should be presented in a straight-forward way that is easy to understand to avoid causing confusion. To exemplify, in [28] human trainers waited for the Leonardo robot to establish eye contact with them before they continued teaching. The eye contact was considered as the robot being ready for the next action. These kinds of transparent behaviors in which the robot communicates the internal state of the learning process should be taken into account for guiding human trainers in IRL. As noted in several studies, in IRL, the human teacher's positive and negative reward can be much more deliberate than a simple 'good' or 'bad' feedback [28,31]. The learning agent should be aware of the subtle meanings of these feedback signals. As an example, human trainers tend to have a positive bias [28,31].

In addition, there are a variety of technical challenges to address when implementing RL in social robotics and social HRI. One of the drawbacks of online learning through interaction with a human is the requirement of long interaction time, which can be tedious and impractical for the users, resulting in fatigue and a loss of interest. A considerable amount of interaction time can wear out the robot's hardware. An alternative is using a simulated world to train the algorithm and subsequently deploying it on the real robot. Using a simulated setting has several advantages. It allows the agent to carry out learning repeatedly, which would otherwise be very expensive in the real-world. Simulated environments can also run much faster than the real-world, thus permitting the learning agent to make proportionately more learning experiences. Bridging the gap between the simulated and the real-world is not a simple task. It may be achieved by randomizing the simulator and learning a policy that shows success across many simulators and can ultimately be robust enough to work in the real world. However, simulating the real-world can be very difficult, especially with regards to modeling relevant human behaviors. Simulating the human requires a predictive model of human interactive behaviors and social norms as well as modeling the uncertainty of the real-world. Furthermore, the use of RL in social robotics poses other challenges such as devising proper reward functions and policies, as well as dealing with the sparseness of the reward signals.

The exploration-exploitation dilemma is a well-known problem in RL and refers to the choice of actions to discover the environment or taking actions that have already proven to be effective in producing reward [14]. RL practitioners use different approaches to deal with the trade-off between exploration and exploitation, such as epsilon-greedy policy [32], epsilon-decreasing policy [33] and Boltzmann distribution [34]. The epsilon-greedy strategy exploits knowledge for maximizing rewards (greedily choosing the current best option), otherwise to select a random action with probability *ε* P r0, 1s [14]. The epsilon-decreasing strategy decreases *ε* over time, thereby progressing towards exploitative behavior [14]. Boltzmann exploration uses Boltzmann distribution to select the action to execute. A temperature parameter balances between exploration and exploitation (high-temperature value for selecting actions randomly and low-temperature value for selecting actions greedily) [14].

Despite the mentioned challenges, there are also advantages of using RL in social robotics. One of the main advantages is that the robot can learn a personalized adaptation for different interactants, i.e., a different policy for each user. Social robots can learn social skills from their own actions without demonstrations through uncontrolled interaction experiences. This is especially true given that interaction dynamics are difficult to model and sometimes even humans cannot explain why they behave in a certain way. Therefore, RL may enable social robots to adapt their behaviors according to their human partners for natural human-robot interaction. In IRL, the immediate reward provided by the human teacher has the potential to improve the training by reducing the number of required interactions. Human teachers' guidance significantly reduces the number of states explored, and the impact of teacher guidance is proportional to the size of the state space, i.e., it increases as the size of the state space grows [26]. In RL, how to achieve a goal is not specified, instead the goal is encoded and the agent can devise its own strategy for achieving that goal. Intrinsically motivated reward signals might be useful in many real-world scenarios, where sparse rewards make the goal-directed behavior challenging. Approaches using human social signals have the advantage of utilizing signals that the user exhibits naturally during the interaction. It does not require an extra effort to collect the reward. However, the change in social signals would not be so sudden, which would very much affect the time for convergence. The role of human social factors deserves extra attention in online learning methods. Combination of RL with deep neural networks has shown success in many application areas. DRL is also a trending technique in social robotics as we see increasing work in recent years. It has the advantage of not needing manual feature engineering [35] and resulting in human-like behavior for social robots [36].

### **3. Reinforcement Learning**

Reinforcement learning [5] is a framework for decision-making problems. Markov Decision Processes (MDPs) are mathematical models for describing the interaction between an agent and its environment. Formally, an MDP is denoted as a tuple of five elements xS, A, P, R, *γ*y where S represents the state space (i.e., the set of possible states), A represents the action space (i.e., the set of possible actions), P : SˆAˆS Ñ r0, 1s represents the probability of transitioning from one state to another state given a particular action, R : <sup>S</sup>ˆAˆ<sup>S</sup> <sup>Ñ</sup> <sup>R</sup> represents the reward function, and *<sup>γ</sup>* is the discount factor that determines the importance of future rewards, *γ* P r0, 1s. The agent interacts with its environment in discrete time steps, *t* " 0, 1, 2, ...; at each time step *t*, the agent gets a representation of the environmental state *St* P S, takes an action *At* P A, moves to next state *St*`1, and receives a scalar reward *Rt*`<sup>1</sup> P R. Figure 2 depicts the standard RL framework.

**Figure 2.** A standard reinforcement learning framework (reproduced from [14] (p. 38)).

The agent's behavior that maps states to actions is described as a policy, *π* : S ˆ A where *π*p*s*|*a*q " *Pr*p*At* " *a*|*St* " *s*q is the probability of taking action *a* P A given state *s*. The agent's goal is to maximize the expected cumulative discounted reward, in other words *return* which is denoted as *Gt*:

$$G\_t = \sum\_{k=0}^{\infty} \gamma^k R\_{t+k+1} \tag{1}$$

where *γ* is the discount factor and usually *γ* P r0, 1s. The optimal behavior that is taking the best action at each state to maximize the reward over time is called optimal policy, *π*˚.

There exists a large variety of approaches in RL. They can be most broadly distinguished as model-based and model-free. Model-free approaches can be further subdivided into value-based and policy-based approaches. A shortened version of a RL taxonomy can be seen in Figure 3.

**Figure 3.** Taxonomy of Reinforcement Learning algorithms (reproduced and shortened from [37]).

### *3.1. Model-Based and Model-Free Reinforcement Learning*

RL algorithms can be divided into two main categories, model-free RL and modelbased RL, depending on whether the agent does or does not use a model of the environment dynamics, which can be either provided or learned. The model describes the transition function, P, and the reward function, R. The model-based methods can be divided into two categories: those that use a given model, i.e., the models of the transition and the reward function can be accessed by the agent, and the methods in which the agent learns the model of the environment [37]. In the latter approach, the agent learns a model, which it subsequently uses during policy improvement. The agent can collect samples from the environment by taking actions. From those samples state transitions and reward can be predicted through supervised learning. Planning methods can be used directly on the environment model. In the model-free approach, there is no effort to build a model of the environment, instead the agent searches for the optimal policy through trial and error interactions with the environment. Model-free methods are easier to implement in comparison with model-based methods. These methods can be advantageous over more complex methods when building a sufficiently accurate model is difficult [14] (p. 10).

### *3.2. Value-Based Methods*

The value of policy *π*, namely the value function, is used to evaluate the states based on the total reward the agent receives over time. RL methods that approximate the value function through temporal difference (TD) learning instead of directly learning the policy *π* are called value-based methods. For each learned policy *π*, there are two related value functions: the state-value function, *vπ*p*s*q, and state-action value function (quality function), *qπ*p*s*, *a*q. The equations for *qπ*p*s*, *a*q and *vπ*p*s*q are given in Equations (2) and (3) respectively. *Eπ* in Equations (2) and (3) means the agent follows policy *π* in each step.

$$q\_{\pi}(s, a) = E\_{\pi} \left[ R\_{l+1} + \gamma \, R\_{l+2} + \gamma^2 \, R\_{l+3} + \dots \vert S\_l = s, A\_l = a \right] = E\_{\pi} \left[ \sum\_{k=0}^{\infty} \gamma^k R\_{l+k+1} \middle| S\_l = s, A\_l = a \right] \tag{2}$$

$$v\_{\pi}(\mathbf{s}) = E\_{\pi} \left[ \mathbf{R}\_{t+1} + \gamma \left. \mathbf{R}\_{t+2} + \gamma^2 \left. \mathbf{R}\_{t+3} + ... \right| \mathbf{S}\_t = \mathbf{s} \right] = E\_{\pi} \left[ \sum\_{k=0}^{\infty} \gamma^k \mathbf{R}\_{t+k+1} \left| \mathbf{S}\_t = \mathbf{s} \right. \right] . \tag{3}$$

The value functions are expressed via the Bellman equation [38]. The Bellman equation for *v<sup>π</sup>* and *q<sup>π</sup>* is given in Equations (4) and (5) where *s*1 indicates the next states from the set S.

$$\upsilon\_{\pi}(s) = \sum\_{a} \pi(a|s) \sum\_{s',r} p(s',r|s,a) \left[r + \gamma \upsilon\_{\pi}(s')\right] \tag{4}$$

$$q\_{\pi}(s, a) = \sum\_{s'} p(s'|s, a) \left[ r(s, a, s') + \gamma \sum\_{a'} \pi(a'|s') q\_{\pi}(s', a') \right]. \tag{5}$$

Comparing policies, a policy *π* is better than or equal to a policy *π*1 if:

$$
\pi \not\simeq \pi' \text{ if } \forall \text{s} \in \mathcal{S} : \upsilon\_{\pi}(\text{s}) \ni \upsilon\_{\pi'}(\text{s}).\tag{6}
$$

There exists always at least one optimal policy *π*˚ whose expected return is greater than or equal to the other policy/policies for all states. Optimal policies share the same state-value function, defined as *<sup>v</sup>*˚p*s*q " *max <sup>π</sup> <sup>v</sup>π*p*s*<sup>q</sup> for all *<sup>s</sup>* <sup>P</sup> <sup>S</sup>, and action-value function, defined as *<sup>q</sup>*˚p*s*, *<sup>a</sup>*q " *max <sup>π</sup> <sup>q</sup>π*p*s*, *<sup>a</sup>*<sup>q</sup> for all *<sup>s</sup>* <sup>P</sup> <sup>S</sup> and *<sup>a</sup>* <sup>P</sup> <sup>A</sup>p*s*q. The Bellman optimality equation for *q*˚p*s*, *a*q is given in Equation (7).

$$q^\*(s, a) = \sum\_{s', r} p(s', r | s, a) \Big| r + \gamma \max\_{a'} q^\*(s', a') \Big| . \tag{7}$$

Another distinction in RL methods comes from the perspective of policy: on-policy vs. off-policy learning. On-policy methods learn the value of the policy that is used to make decisions. In the on-policy setting, the target policy and the behavior policy are the same. The target policy is the policy that is learned about, and the behavior policy is the policy that is used to generate behavior. The state-action-reward-state-action (SARSA) algorithm [39] is one of the on-policy methods in which the agent interacts with the environment, selects an action based on the current policy, then updates the current policy. The *Q* function update in SARSA is done using Equation (8). A transition from one state-action pair to the next is expressed as p*St*, *At*, *Rt*`1, *St*`1, *At*`1q which gives rise to the name SARSA. The update given in Equation (8) is done after every transition from a non-terminal state *St*.

$$Q(\mathcal{S}\_{t\prime}A\_t) \leftarrow Q(\mathcal{S}\_{t\prime}A\_t) + \mathfrak{a}\Big[\mathcal{R}\_{t+1} + \gamma \, Q(\mathcal{S}\_{t+1\prime}A\_{t+1}) - Q(\mathcal{S}\_{t\prime}A\_t)\Big].\tag{8}$$

In the off-policy methods, the target policy is different from the behavior policy. In these methods, the policy that is evaluated and improved does not match the policy that is used to generate data. Off-policy methods can re-use the experience from old policies or other agents' interaction experience to improve the policy. One example of an off-policy algorithm is Q-learning [40]. It is one of the most popular RL algorithms using discounted reward [41]. The Q-learning rule is defined by:

$$Q(\mathcal{S}\_{l\prime}A\_{l}) \leftarrow Q(\mathcal{S}\_{l\prime}A\_{l}) + a \left[R\_{l+1} + \gamma \max\_{a} Q(\mathcal{S}\_{l+1\prime}a) - Q(\mathcal{S}\_{l\prime}A\_{l})\right].\tag{9}$$

The Q-learning algorithm iteratively applies the Bellman optimality equation (given in Equation (7)). As shown in Equation (9), the main difference between Q-learning and SARSA (see Equation (8)) is that in the former the target value is not dependent on the policy being used and only depends on the state-action function.

### *3.3. Policy-Based Methods*

Policy-based methods, also known as direct policy search methods, do not use value function models. In these methods, the policy is parameterized with *θ* and written as *πθ*. They operate in the space of policy parameters Θ and *θ* P Θ [17]. The goal is still to maximize the accumulative return. The agent updates its policy by exploring various behaviors and exploiting the ones that perform well in regard to some predefined utility function *J*p*θ*q. In many robot control tasks the state space, which includes both internal

states and external states, is high-dimensional. The policy of the robot *πθ* can be defined as a controller. For any state of the robot, this controller decides which actions to take or which signals to send to the actuators [42]. The robot takes its actions *u* according to the controller (please note, actions in policy search context are represented with *u* instead of *a*). The robot controller can be stochastic, i.e., *π*p*u*|*s*q or deterministic, i.e., *π*p*s*q. After the action execution the robot transitions to another state according to the probabilistic transition function *p*p*st*`1|*st*, *ut*q. These states and actions of the robot form a trajectory *τ* " p*s*0, *u*0,*s*1, *u*1, ...q. The corresponding return for the trajectory *τ* is represented as *R*p*τ*q. The global utility of the robot is denoted as:

$$J(\theta) = \mathbb{E}\_{\mathbf{r} \sim \pi\_{\theta}}[R(\mathbf{r})].\tag{10}$$

Computing the expectation in <sup>E</sup>*τ*"*πθ* <sup>r</sup>*R*p*τ*qs requires to run an infinite number of trajectories with the current controller. The way to go around this difficulty is to sample the expectation. After performing a finite set of trajectories, the return is computed over these trajectories. Thus, the goal is:

$$\theta^{\*\!\!\! } = \underset{\theta}{\text{argmax}} \, f(\theta) = \underset{\theta}{\text{argmax}} \sum\_{\tau} P(\tau, \theta) R(\tau) \tag{11}$$

where *θ*˚ is the estimate of global performance and *P*p*τ*, *θ*q is the probability of *τ* under policy *πθ*.

Here RL addresses a black-box optimization problem in that the function which relates the performance to the policy parameters is unknown. There are two families of methods: direct policy search and gradient descent [42]. In direct policy search algorithms, approximate gradient descent is performed by "random trial then selection" methods, like genetic algorithms, evolution strategies, finite differences, cross entropy, etc. These algorithms need many samples and can escape from local minima if large enough variations are used. In gradient descent methods, a mathematical transformation is used so that policy gradient methods can be applied. In these methods, the policy gradient update is given by:

$$
\theta\_{k+1} = \theta\_k + \mathfrak{a} \nabla\_{\theta} f(\theta) \tag{12}
$$

where *α* is a learning rate, and the policy gradient is given by [17]:

$$\nabla\_{\theta} f(\theta) = \sum\_{\tau} \nabla\_{\theta} P(\tau, \theta) R(\tau). \tag{13}$$

There are different methods to estimate the gradient ∇*<sup>θ</sup> J*p*θ*q, interested readers may refer to [17]. Policy-based methods have the advantage of being effective in high dimensional or continuous action spaces and having better convergence properties.

Some methods learn both policy and value functions. These methods are called actorcritic methods, where 'actor' is the learned policy that is trained using policy gradient with estimations from the critic, and 'critic' refers to the learned value function that evaluates the policy.

### *3.4. Deep Reinforcement Learning*

Learning in RL progresses over discrete time steps by the agent interacting with the environment. Obtaining an optimal policy requires a considerable amount of interaction with the environment, which results in high memory and computational complexity. Therefore, the tabular approaches that represent state-value functions, *vπ*p*s*q, or stateaction value functions, *qπ*p*s*, *a*q, as explicit tables are limited to low-dimensional problems, and they become unsuitable for large state spaces. A common way to overcome this limitation is to find a generalization for estimating state values by using a set of features in each state. In other words, the idea is to use a parameterized functional form with weight vector *<sup>w</sup>* <sup>P</sup> <sup>R</sup>*<sup>d</sup>* for representing *<sup>v</sup>π*p*s*<sup>q</sup> or *<sup>q</sup>π*p*s*, *<sup>a</sup>*<sup>q</sup> that are written as *<sup>v</sup>*ˆp*s*; *<sup>θ</sup>*<sup>q</sup> or *<sup>q</sup>*ˆp*s*, *<sup>a</sup>*; *<sup>θ</sup>*<sup>q</sup>

instead of tables [14] (p. 161). Such approximate solution methods are called function approximators. The reduction of the state space by using the generalization capabilities of neural networks, especially deep neural networks, is becoming increasingly popular. Deep Learning (DL) has the ability to perform automatic feature extraction from raw data. DRL introduces DL to approximate the optimal policy and/or optimal value functions [14] (p. 192). Recently, there has been an increasing interest in using DL for scaling RL problems with high-dimensional state spaces.

The DQN method, first presented by Mnih et al. [43], combines Q-learning with Convolutional Neural Networks (CNN) for learning to play a wide variety of Atari games better than humans. In DQN, the agent's experiences *et* " p*st*, *at*,*rt*,*st*`1q are stored at each time step *t* in a data set *Dt* " t*e*1, ...,*et*u, so-called experience replay memory. Qlearning updates are applied on a mini-batch uniformly sampled from the experience replay memory. The Q-learning update is done using Equation (14):

$$L\_i(\theta\_i) = \mathbb{E}\_{s, a, r, s' \sim \mathcal{U}(D)} \left[ \left( r + \gamma \max\_{a'} Q(s', a'; \hat{\theta}\_i) - Q(s, a; \theta\_i) \right)^2 \right] \tag{14}$$

where *θ<sup>i</sup>* represents the parameters (weights) of the Q-network at iteration *i* and ˆ *θ<sup>i</sup>* represents the parameters used to compute the target network at iteration *i*. The target network parameters ˆ *θ<sup>i</sup>* are updated to the parameters *θ<sup>i</sup>* after every *C* iterations.

### **4. Categorization of RL Approaches in Social Robotics Based on RL Type**

In human-human communication, a communication medium is a means of conveying information to other people. It can be in different forms such as verbal, nonverbal, affective, and tactile. Human-robot interaction overlaps with human-human interaction to a certain extent. Furthermore, there can be an additional physical interface (i.e., a computer, a tablet, a smart game board, etc.) shared between the robot and the human. In the interaction between the robot and the human, information transmission is bidirectional, the robot and the human can be sender, receiver, or both. In the surveyed papers, we see all these communication channels being utilized, especially for the RL problem formulation. As it has already been stated in the introduction, one of the prominent characteristics of social robots is the ability to interact and communicate. Therefore, we provide two categorizations in this section: first we categorize the papers based on RL types, after which we provide a further discussion and categorization with respect to the utilized communication channels and interaction dynamics for the reward functions.

### *4.1. Bandit-Based Methods*

Bandit-based methods can be considered as a simplified case of RL in which the next state does not depend on the action taken by the agent. Different bandit-based methods explored in social robotics [4,44–47], such as dueling bandit learning [44], k-armed bandit method (multi-armed bandit) [4,45,46], and Exponential-Weight Algorithm for Exploration and Exploitation (Exp3) algorithm [47].

### 4.1.1. Additional Physical Communication Medium between the Robot and the Human

Learning user preferences to personalize the user experience is used in customizing advertisements and search results. A similar approach was applied in HRI studies [4,44]. Whereas the customization is done in the background for personalized experiences in websites using users' clicks, it is adapted for social interactions by asking the user to select their preferences using the buttons. In other words, these studies use a physical communication medium between the robot and the human. Schneider and Kummert [44] investigated a dueling bandit learning approach for preference learning. The algorithm draws two or more actions, and the relative preference is used as reward. It is defined as follows: In each time step *t* ą 0 a pair of arms p*k* p1q *<sup>t</sup>* , *k* p2q *<sup>t</sup>* q is selected and presented to the user, if the user prefers *k* p1q *<sup>t</sup>* over *k* p2q *<sup>t</sup>* then *wt* " 1, and *wt* " 2 otherwise where *wt* is a noisy

comparison result. The distribution of outcomes is represented by a preference matrix *P* " r*pij*s*KxK*, here *pij* is the probability that the user preferred arm *i* over arm *j*. The participant provided pairwise comparisons via a button. In the work by Ritschel et al. [4], the robot adapted its linguistic style to the user's preferences. They defined the learning tasks as k-armed bandit problems. The adaptation was done based on explicit human feedback given via buttons in the form of numeric reward (´1, +1). The actions of the robot were a set of scripted utterances. Similarly, Ritschel et al. [46] used an additional medium between the robot and the user. They employed the social robot Reeti as a nutrition adviser, where a custom hardware was utilized to obtain the information about the selected drink [46]. Their custom hardware included an electronic vessel holder and a smart scale that could communicate with the robot. The problem was formalized as an k-armed bandit problem where the actions of the robot were scripted spoken advice. The reward was calculated from the amount of calories and quantity of the selected drink.

### 4.1.2. Verbal and Nonverbal Communication Plus an Interface

Social robots can use any natural communication channel, and benefit from different user interfaces. The studies [45–47] take advantage of a physical medium shared across the robot and the human to simplify the state space representations. Leite et al. [45] used a multi-armed bandit for empathetic supportive strategies in the context of a chess companion robot for children. The difference in the probabilities of the user being in a positive mood before and after employing supportive strategies was used as a reward. The child's affective state was calculated by using visual facial features (smile and gaze) and contextual features of the game (game evolution i.e winning/losing, chessboard configuration). Similarly, in the work by Gao et al. [47] the user's task-related parameters were monitored through the puzzle interface. The robot's behaviors were adapted by combining a decision tree model with the Exp3 [48]. The Exp3 algorithm maintains a list of weights for each of the actions, which are used for selecting the next action. The reward was the user's task performance in combination with the user's verbal feedback. The set of robot actions included four supportive behaviors to help the user to solve the puzzle game.

### *4.2. Model-Based and Model-Free Reinforcement Learning* Verbal Communication

Considering the challenge of modeling real-world human-robot interactions, the majority of papers included in this survey use model-free RL. Nevertheless, several recent works started to investigate model-based RL for HRI [49,50]. One of the challenges of real-world robot learning is the delayed reward. There is an assumption that the result of an agent's observations of its environment is available instantly. However, there can be a lag in human reaction to robot actions in HRI. When the reward of the robot depends on human responses, reward shaping can be useful for the robot to get more frequent feedback. Reward shaping is a technique that consists of augmenting the natural reward signal so that additional rewards are provided to make the learning process easier [51]. Studies in [49,50] presented methods including model-based RL and reward shaping for HRI. Tseng et al. [49] proposed a model-based RL strategy for a service robot learning the varying user needs and preferences, and adjusting its behaviors. The proposed reward model was used to shape the reward through human feedback by calculating temporal correlations of robot actions and human feedback. Concretely, they modeled human response time using a gamma distribution. This formulation was found to be effective (more cumulative reward collected) in dealing with delayed human feedback. The work by Martins et al. [50] presented a user-adaptive decision-making technique based on a simplified version of model-based RL and POMDP formulation. Three different reward functions were formulated, and compared in the experiments. Their entropy-based reward shaping mechanism devised using an information-based term. The purpose of using the information term was to increase the reward given for an action leading to unknown

transitions, thereby encouraging the robot to investigate the impact of new actions on the user.

### *4.3. Value-Based Methods*

In recent years, there has been an increasing interest in applying RL methods to social robotics with growing trend towards value-based methods. Q-learning, along with its different variations, is the most commonly used RL method in social robotics. The studies using Q-learning are [3,13,34,52–61]. These comprise studies using standard Q-learning [3,54,55,58,60,62], studies modify Q-learning for dealing with delayed reward [52], tuning the parameters for Q-learning such as *α* [13,34,52], dealing with decreasing human feedback over time [52], comparing their proposed algorithm with Q-learning [33,49,61,63,64], variation of Q-learning called Object Q-learning [64–66], combining Q-learning with fuzzy inference [67], SARSA [68,69], TD(*λ*) [70], MAXQ [33,71,72], R-learning [32], and Deep Q-learning [35,36,73,74].

### 4.3.1. Tactile Communication

When the user is involved in the learning process by providing feedback in the form of reward or guidance, the general approach is either using an additional interface or utilizing the sensory information such as internal (robot's onboard sensors) or external cameras and microphones. Nowadays, many social robots are equipped with tactile capabilities. However, the usage of the robots' touch sensors as a feedback mechanism has received relatively little attention in the context of RL in social robotics. Yet [52,53] benefited from the robot's tactile sensors instead of an additional interface between the user and the robot. Barraquand and Crowley [52] conducted five experiments with different modifications of the classical Q-learning algorithm. The human teacher provided feedback through tactile sensors of the Sony AIBO robot, caressing the robot for the positive feedback and tapping the robot for the negative feedback. The action space comprised two actions; bark and play. The first experiment was standard Q-learning with human reward. Since the human ceased giving feedback over time, they concluded that the learning rate *α* should be adapted. In the second experiment, they used the asynchronous Q-learning algorithm. In asynchronous Q-learning, the learning rate *α* may be different for different state-action pairs. The learning rate is decreased when the system encountered the same situations and actions. In relation to standard Q-learning this modification increased the effectiveness of the algorithm, i.e., it learned faster and forgot more slowly. Because the learning rate was much smaller when there was no feedback. To overcome the delayed reward, they considered to increase the effect of human-delivered positive reward in larger time frames and to decrease the effect of negative reward in a shorter time frame. The use of an eligibility trace with a heuristic for delayed reward was found to be more efficient than classical Q-learning (generalizing experience to cover similar situations). The authors noted that learning rate, reward propagation, and analogy (i.e., propagating information to similar states) can improve the effectiveness of learning from social interaction. Yang et al. [53] proposed a Q-learning based approach that combines homeostasis and IRL. The internal factors, i.e., the drives and motivations worked as a triggering mechanism to initiate the robot's services. However, the reward in the real-world experiments was given by the user touching the robot's head, left hand, and right hand to give positive, negative, and dispensable feedback, respectively [53]. The authors trained their model in a simulator and deployed it on the Pepper robot.

### 4.3.2. Additional Physical Communication Medium between the Robot and the Human

Since we identify social robots with interaction, the robot learning within a social scenario stands out in the surveyed papers. Alternatively, there are studies where social interaction is not the main concern however, the main purpose is training a social robot to do a task. As an example, a human teacher trains the agent through a GUI [26,30], speech and gestures [28,31]. In Suay and Chernova [26], human teacher trained a social

robot. They performed experiments similar to those presented in [75] in a real-world scenario with the Nao robot [26]. The human trainer observed the robot in its environment via a webcam and provided reward based on the robot's past actions or anticipatory guidance for selecting future actions through a GUI. They conducted four sets of experiments (small state space and only reward, large state space and only reward, small state space and reward plus guidance, large state space and reward plus guidance) to investigate the effect of teacher guidance and state space size on learning performance in IRL. The task was object sorting and the size of state space depended on the object descriptor features. Their results showed that the guidance accelerated the learning by significantly decreasing the learning time and the number of states explored. They observed that human guidance helped the robot to reduce the action space and its positive effect was more visible in large state-space. In a similar vein, Suay et al. [30] conducted a user study in which 31 participants taught a Nao robot to catch the robotics toys by using one of three algorithms: Behavior Networks, IRL, and Confidence-Based Autonomy. The study compared the results of these algorithms in terms of algorithm usability and teaching performance by non-expert users. In IRL, the participants provided positive or negative feedback in the form of reward through an on-screen interface. In terms of teaching performance, users achieved better performance using Confidence-Based Autonomy, however, IRL was better of modelling user behavior. It has been noted in much of the literature that teaching with IRL requires more time than with other methods because users had the tendency to stop rewarding or to vary their reward strategy. This affected the training time, which is a drawback to this approach.

### 4.3.3. Verbal and Nonverbal Communication

We discuss different human feedback types in IRL in Section 5.1. When a human teacher trains an agent, the positive or negative feedback might convey several meanings, even lack of feedback can give information to the agent depending on the teacher's training strategy [76]. For example, Thomaz and Breazeal [31] realized that human trainers might have multiple intentions with the negative reward they are giving, such as the last taken action was bad and future actions should correct the current state. They performed experiments with two different platforms: the Leonardo robot learned pressing buttons and a virtual agent learned baking a cake (Sophie's kitchen). The virtual agent responded to the negative reward by taking an UNDO action, i.e., the opposite action. In the examples with the Leonardo robot, the human teacher provided verbal feedback. After negative feedback, the robot expected the human teacher to guide it through refining the example by using speech and gestures (collaborative dialog). Although the interactive Q-learning with the addition of UNDO behavior was tested only on the virtual agent, it is worth mentioning that the proposed algorithm was more efficient compared to standard IRL. It had several advantages such as robust exploration strategy, fewer states visited, fewer failures occurred and fewer action trials done for learning the task. Continuing along these lines, Thomaz and Breazeal [28] explored how self-exploration and human social guidance can be coupled for leveraging intrinsically motivated active learning. They called the presented approach socially guided exploration, in which the robot could learn by intrinsic motivations, however, it could also take advantage of a human teacher's guidance when available. The robot learner with human guidance generalized better to new starting states and reached the desired goal states faster than the self-exploration.

### 4.3.4. Higher Level Interaction Dynamics: Engagement

Social robots are expected to exhibit flexible and fluent face-to-face social conversation. The natural conversational abilities of social robots should not be only limited to short basic task related sentences. However, they should be able to engage users in the interactions with chat and entertainment, varying from storytelling to jokes together with human-like vocalizations and sounds. As an example, Papaioannou et al. [60] reported that users spent more time with the robot which can carry out small chat together with task-based dialogue compared to the robot that conversed only task-based dialogue. In their system,

the agent was trained using the standard Q-learning algorithm with simulated users and tested with the Pepper robot where the robot assisted visitors of a shopping mall by providing information about and directions to the shops, current discounts in the shops, among other things. In the problem definition, states were represented with 12 features such as user engaged, task completed, distance, turn taking, etc. The action space consisted of 8 actions, *A* " [PerformTask, Greet, Goodbye, Chat, GiveDirections, Wait, RequestTask, RequestShop]. The reward was encoded as predefined numerical values based on task completion by the agent, including the engagement of the user. Another study considering user engagement is Keizer et al. [1], who applied a range of ML techniques in the presented system that included a modified iCat robot (with additional manipulator arms with grippers) and multimodal input sensors for tracking facial expressions, gaze behavior, body language and location of the users in the environment. The reward function was a weighted sum of task-related parameters. For each individual user *i* the reward function *Ri* was defined as *Ri* " 350 ˆ *TCi* ´ 2 ˆ *Wi* ´ *TOi* ´ *SPi*. *TCi* is short for Task Complete, and is a binary variable. *Wi* (Waiting) is a binary variable showing whether the user *i* is ready to order but not engaged with the system. *TOi* stands for Task Ongoing and is a binary variable describing whether the user is interacting with the robot but has not been served. *SPi* is short for Social Penalties and corresponds to several social penalties (e.g., while the user *i* is still talking to the system, it turns its attention to another user). An experimental evaluation compared a hand-coded and trained system. The authors reported that the trained system performed better and it was found to be faster at detecting user engagement than the hand-coded one, while the latter was more stable. In [55,57,59], the authors investigated the entertainment capabilities of social robots using RL. Ritschel et al. [57] presented a social-cues-driven Q-learning approach for adapting the Reeti robot to keep the user engaged during the interaction. The engagement of the user was estimated from the user's movement through the Kinect 2 sensor by using a Dynamic Bayesian Network. They used the change in the engagement as a reward in the storytelling scenario to adapt the robot's utterance based on the personality of the user. In similar fashion, the work by Weber et al. [59] incorporated social signals in the learning process, namely the participants' vocal laughs and visual smiles as reward. In the problem formulation, they used a two-dimensional vector containing probabilities of laughs and smiles for state representation, and the action space consisted of sounds, grimaces and three types of jokes. They used an average reward based on all samples from the punchline to the end with a predefined punchline for every joke. The human social signals were captured and processed by using the Social Signals Interpretation (SSI) framework [77]. Their purpose was to understand the user's humor preferences in an unobtrusive manner in order to improve the engagement skills of the robot. In a joke-telling scenario, the Reeti robot adapted its sense of humor (grimaces, sounds, three kinds of jokes and their combination) by using Q-learning with a linear function approximator. Likewise Addo and Ahamed [55] presented a joke telling scenario with a torso Nao robot for entertaining a human audience. They used Q-learning in which the actions of the robot were pre-classified jokes, and the numerical reward corresponded to affective states of the user. However, the affective states of the participants were captured by a self-reported feedback signal. After each joke, the human participant provided a verbal feedback (i.e., reward) such as "very funny", "funny", "indifferent" and "not funny".

### 4.3.5. Affective Communication: Facial Expressions

Human facial expressions are perhaps one of the richest and most powerful tools in social communication. Facial expressions analysis is commonly used in HRI for understanding users and enhancing their experience. Affective facial expressions can also facilitate robot learning in RL. Recently, it is becoming more popular to use off-the-shelf applications in social robotics for different perception and recognition modules. Affectiva software [78] analyzes facial expressions from videos or in real-time. The studies [58,68,69] used this software for affective child-robot interaction. In the work by Gordon et al. [68] a tutoring

system for children was presented. The system included an Android tablet and the Tega robot setup integrated with the Affectiva software for facial emotion recognition. They used the SARSA algorithm where the reward was a weighted sum of valence and engagement. Both valence and engagement values were obtained from the Affectiva software. Similar to [68], Park et al. [58] used the Tega robot as a language learning companion for young children. A personalized policy was trained through 6–8 sessions of interaction by using a tabular Q-learning algorithm. The reward function was a weighted sum of engagement and learning gains of the child. The engagement was obtained from the Affectiva software. The learning gains in the reward function was represented as numerical values ([´100, 0, +50, +100]) depending on the lexical and syntactic complexity of the phrase relative to the child's level. Gamborino and Fu [69] presented an approach for socially assistive robots for children to support them in emotionally difficult situations using SARSA. In the proposed method, the human trainer selects the actions for the social robot RoBoHoN (small humanoid smartphone robot) through an interface with the purpose to improve the mood of the child depending on her/his current affective state. The affective state of the child was based on seven basic facial emotions and engagement obtained by the Affectiva software and stored in an input feature vector to classify the mood of the child as good or bad. The emotions were binarized as 1 or 0 depending on whether the value was greater or less than the average, respectively. The robot suggested a set of actions to the trainer. The aim was to suggest actions that would match with the trainer's action preferences. This way the agent would act independently, without feedback from the trainer. Another study using facial expressions is Zarinbal et al. [54], in which Q-learning was used for query-based scientific document summarization with a social robot. The problem formulation was as follows: In each state *St* :ă *xi*,*score<sup>t</sup>* p*xi*q ą a summary that consisted of M sentences was generated, where *xi* is a sentence and *i* " 1, 2, ..., *M*. The scoring scheme was updated based on the human-delivered reward. The reward *rt* P t´1, 0, 1u depended on the classified facial expressions: dislike, neutral and like. In state *St*, the robot presented the sentence *x*˚ to the user and based on his reward *rt*. The authors concluded that user feedback may improve the query-based text summarization.

### 4.3.6. Verbal Communication

The curse of dimensionality is a phenomenon that refers to problems with high dimensional data. Representing state and action spaces as explicit tables becomes impractical for large spaces. To overcome the problem of large state space, approximate solutions are used, one of them being fuzzy techniques. This approach is also explored for HRI, e.g., Chen et al. [67] and Patompak et al. [32] used fuzzification and fuzzy inference together with Q-learning. These works employed verbal communication in their user studies. Chen et al. [67] proposed a multi-robot system for providing services in a drinkingat-a-bar scenario. The authors used a modified Q-learning algorithm combined with fuzzy inference which was called information-driven fuzzy friend-Q (IDFFQ) learning for understanding and adapting the behaviors of the mentioned multi-robot system based on the emotion and intention of the user. The reward function was defined as *r* " p*rt* ` *rh*q{2. Task completion *rt* (i.e., robots selected the drink the user preferred) and the human's satisfaction with the robots' task performance *rh* were predefined numerical values. Fuzzification of emotions was done using the triangular and trapezoidal membership function in the pleasure-arousal plane. They compared the proposed algorithm with their previous algorithm, Fuzzy Production Rule-based Friend-Q learning (FPRFQ) [79]. The authors noted that the current algorithm was superior in that it resulted in higher collected reward and faster response time of the robots. Patompak et al. [32] proposed a dynamic social force model for social HRI. The authors considered two interaction areas: a quality interaction area and a private area. The quality interaction area was defined as the distance from which the users can be engaged in high-quality interactions with robots. The proposed model was designed by a fuzzy inference system, the membership parameters were optimized by using the R-learning algorithm [80]. R-learning is an average reward RL approach; it does

not discount future rewards [81]. They argued that R-learning was suitable for the scenario since they intended to take every interaction experience into account equally. In the real robot experiments, positive or negative verbal rewards were provided by the participants.

Another study that used verbal communication for the reward is [62]. In this study, a gesture recognition system categorized the body trunk patterns as towards (the person is facing the robot), neutral (the trunk is facing the robot between 30–150 away), and away (orientation of the trunk is more than 150). The recognized gestures were interpreted as a person's accessibility level, which was used to determine the person's affective state. In the Q-learning-based decision-making system, the robot had drives and emotional states which were utilized for action selection. In particular, a state is represented as *s*p*yH*, *yR*, *d*q where *yH* is the accessibility level of the human, *yR* is emotional state of the robot and *d* is the dominant drive. State transition probabilities, Q-values for each state, and reward for each transition were predetermined numerical values. The satisfaction of the robot's drives depended on the robot completing the task. In the experimental scenario, the Brian robot reminded the user about daily activities (eat, use the bathroom, go for a walk and take medication) and the user verbally stated 'yes' or 'no' after the robot's action, with 'no' meaning that the robot's drive is not satisfied and it will continue to try to satisfy the drive. The authors mentioned that the robot could use its drives in one or two iterations for the reminders except the drive related to using the bathroom. It was attributed to people potentially being uncomfortable with this reminder.

### 4.3.7. Higher Level Interaction Dynamics: Attention

Social robots have the potentials for information acquisition from both verbal and nonverbal communication. Not only can they gesture, maintain eye contact, and share attention with their users, but they can also estimate the users' non-verbal cues and behave accordingly. In this interaction, both actors can interpret verbal and nonverbal social cues to communicate effectively. For natural fluid HRI, robot non-verbal behaviors together with verbal communication are thoroughly discussed in [82]. These social cues do not only convey a basic message but also carry higher-level interaction dynamics such as attention, engagement, comfort, and so on. The following works highlight these in the context of RL in social robotics. Chiang et al. [56] proposed a Q-learning based approach for personalizing the human-like robot ARIO's interruption strategies based on the user's attention and the robot's belief in the person's awareness of itself. The authors called it the "robot's theory of awareness". They formulated the problem based on the user attention, which was referred to as a Human-Aware Markov Decision Process. The human attention was estimated with a trained Hidden Markov Model (HMM) from human social cues (face direction, body direction, and voice detection). The reward consisted in predefined numerical values based on the robot's theory of awareness of the user. The robot had six actions (gestures: head shake and arm wave; navigation: approach the user and move around; audio: make sound and call name) to draw the user's attention while the user was reading. The optimal policy converged after two hours of interaction. The robot developed personalized policies for each user depending on their interruption preferences. Another study considering human attention in their problem formulation is Hemminghaus and Kopp [3]. They used Q-learning to adapt the robot head Furhat's behavior in a memory game scenario. In the game, the robot assisted the participant by guiding their attention towards target objects in a shared spatial environment. In the proposed hierarchical approach, the high-level behavior was mapped to low-level behaviors, which could then be directly executed by the robot. The purpose of using Q-learning was to learn the execution of high-level behaviors through low-level behaviors. In the problem formulation, states were represented in terms of the user's gaze, user's speech, and game state. The game state represented the number of remaining card pairs in the game. The action space included actions such as speaking, gazing, etc. or a combination of those actions. The reward was designed as *r* " *rpos* ´ *c* if success *r* " *c*.*rneg* if no effect. The robot received a positive reward *rpos* if the robot's action helped the user to find the correct pair. The robot received a negative reward

*rneg* if the action had no effect on helping the user. *c* represents the cost of the chosen action in cases where the costs were determined manually. Moro et al. [61] is another study that considered the attention of the user. Their scenario was an assistive tea-making activity for older people with dementia. The authors proposed an algorithm involving Learning from Demonstration (LfD) and Q-learning for personalized robot behavior according to a user's cognitive abilities [61]. The Casper robot learned to imitate the combination of speech and gestures from a collected data set. The robot learns to select the suitable labeled behavior (i.e., speech and gestures initially learned from demonstrations) that is most likely to transition the user into the desired state, i.e., focused on the activity and completing the correct step. The reward function, *R*p*s*, *b<sup>i</sup> l* q, depended on *b<sup>i</sup> l* , the labeled behavior displayed by the robot, and the state *s* where *s* " t*sr*,*su*u. Here, *sr* represents a set of robot activity states, and *su* is the user state such that *su* " t*sf nc*,*sac*u. In the user state, *sf nc* represents the user functioning state which is one of five mental functioning states: focused, distracted, having a memory lapse, showing misjudgment, or being apathetic. The user activity state, *sac*, represents possible actions that can be performed by seniors with cognitive impairment: successfully completing a step, being idle, repeating a step, performing a step incorrectly, or declining to continue the activity. The robot was rewarded according to the state the user transitioned into—a positive reward if the user was focused and completed the activity, and a negative reward if the user transitioned to an undesirable state. The authors compared the proposed approach with Q-learning, and reported that the proposed approach required fewer interactions for convergence and fewer steps required to complete the tea-making activity. In all the papers explained above, the robot takes the users attention into account for deciding its actions. Shared attention refers to situations involving mutual gaze, gaze following, imperative pointing and declarative pointing. Da Silva and Francelin Romero [63] presented a robotic architecture for shared attention which included an artificial motivational system driving the robot's behaviors to satisfy its intrinsic needs, so-called necessities. The motivational system comprised necessity units that were implemented as a simple perceptron with recurrent connections. The input to the artificial motivational system was provided by a perception module used to detect the environmental state and to encode the state in first order logic with predicates. This module included face recognition with head pose estimation and a visual attention mechanism. The necessities of the robot were associated with a state-action pair in the training phase of the learning algorithm. The activation of a necessity unit was dependent on the input signal representing a stimulus detected from the environment (i.e., the perception module) and empirically defined parameters. They compared the performance of three different RL algorithms, namely contingency learning, Q-learning and Economic TG (ETG) methods for shared attention in social robotics. ETG is a relational RL algorithm that incorporates a tree-based method for storing examples [83]. Because ETG performed better in the simulation experiments, they decided to employ it in real-world experiments which entailed one of the authors interacting with the robotic head. The authors reported that the robot's corrected gaze index, which was defined as frequency of gaze shifts from the human to the location that the human is looking at, was increased over time during learning.

### 4.3.8. Affective Communication

Humans use affective communication consciously or unconsciously in their daily conversations by expressing feelings, opinions, or judgments. Social robots can facilitate their learning process through sensing and building representations of affective responses. This idea was used in [33,71,72]. In these studies, the socially assistive robot Brian 2.0 was employed as a social motivator by giving assistance, encouragement, and celebration in a memory game scenario. In the scenario, the participants interacted with the robot one-onone with the objective to find the matching pictures in the memory card game (4 ˆ 4 grid, 16 picture cards). The robot's behaviors were adapted using a MAXQ method to reduce the activity-induced stress in the user. The MAXQ approach is a hierarchical formulation, which accommodates a hierarchical decomposition of the target problem into smaller

subproblems by decomposing the value function of an MDP into combinations of value functions of smaller integral MDPs [84]. The authors argued that the MAXQ algorithm was suitable for memory game scenarios due to its temporal abstraction, state abstraction, and sub-task abstraction. These abstractions also helped to reduce the number of Q-values that needed to be stored. The detailed system was presented in [33]. In their system, they used three different types of sensory information: a noise-canceling microphone for recognizing human verbal actions, an emWave earclip heart rate sensor for affective arousal level and a webcam for monitoring the activity state (depending upon whether matching card pairs were found or not). They used a two-stage training process involving offline training followed by online training. The purpose of the first stage was to determine the optimal behaviors for the robot with respect to the card game. The offline training was carried out on a human user simulation model created with the interaction data of ten participants. In the second stage, they aimed to personalize the robot according to the user's state (affective arousal and game state) for different participants in online interactions. The affective arousal and user activity state formed the user state (e.g., stressed: high arousal and not matching card, pleased: low arousal and matching card). The success of the robot's actions was subject to the improvement of a person's user state from a stressed state to a stress-free state.

### *4.4. Deep Reinforcement Learning*

For natural interaction, it is important that social robots possess human-like social interaction skills, which requires features from high dimensional signals. In these cases, DRL can be useful. In fact, several researchers have begun to examine the applicability of DRL in social robotics [35,36,73,74,85–87].

### 4.4.1. Tactile Communication

One of the pioneering works using DRL in social robotics was presented by [36]. Here, a Pepper robot learned to choose among predefined actions for greeting people, based on visual input. In their work, they succeeded to map two different visual input sources, the Pepper robot's RGBD camera and the webcam, to discrete actions (waiting, looking towards the human, hand waving and handshaking) of the robot. The reward was provided by a touch sensor located on the robot's right hand to detect handshaking. The robot received a predefined numerical reward (1 or ´0.1) based on a successful or unsuccessful handshake. A successful handshake was detected through the external touch sensor. The proposed multimodal DQN consists of two identical streams of CNN for action-value function estimation—one for grayscale frames and another for depth frames. The grayscale and depth images were processed independently, and the Q-values from both streams were fused for selecting the best possible action. This method comprised two phases: the data generation phase and the training phase. In the data generation phase, the Pepper robot interacted with the environment and collected data. After this phase, the training phase began. This two-stage algorithm was useful in that it did not pause the interaction for training. Qureshi et al. [36] used 14 days of interaction data where each day of the experiment corresponded to one episode. The same authors applied a variation of DQN, the Multimodal Deep Attention Recurrent Q-Network (MDARQN) [73], to the same handshaking scenario in [36]. In their previous study, the robot was unable to indicate its attention. For adding perceptibility to the robot's actions, a recurrent attention model was used, which enabled the Q-network to focus on certain parts of the input image. Similar to their previous work [36], two identical Q-networks were used (one for grayscale frames and one for depth frames). Each Q-network consisted of convnets, a Long Short-term Memory (LSTM) network, and an attention network [88]. The convnets were used to transform visual frames into feature vectors. The network transforms an input image into D-dimensional *L* feature vectors, each of them representing a part of the image *at* " t*a*<sup>1</sup> *<sup>t</sup>* , ..., *a<sup>L</sup> <sup>t</sup>* u, *a<sup>l</sup> <sup>t</sup>* <sup>P</sup> <sup>R</sup>*D*. This feature vector was provided as an input to the attention network for generating the annotation vector *<sup>z</sup>* <sup>P</sup> <sup>R</sup>*D*. The annotation vector *zt*

is the dynamic representation of a part of an input image at time *t*. *zt* is computed with *zt* " <sup>ř</sup>*<sup>L</sup> <sup>l</sup>*"<sup>1</sup> *<sup>β</sup><sup>l</sup> tal <sup>t</sup>*. The LSTM network used the annotation vector *zt* for computing the next hidden state. Each of the streams of the MDARQN model were trained by using the backpropagation method. The outputs from the two streams were normalized separately and averaged to create output Q-values of MDARQN. As in their previous work, handshake detection was used for the reward function (´0.1 for unsuccessful handshakes and 1 for successful handshakes). The horizontal and vertical axes of the input image were divided into five subregions, and the Q-network enabled to focus on certain parts of the input image. The attention mechanism of the robot used the annotation vector *zt* to determine the pixel location to direct maximal attention to the input image. This region selection provided computational benefits by reducing the number of training parameters. Another work from the same authors Qureshi et al. [74] proposed an intrinsically motivated DRL approach for the same handshaking scenario. The proposed method utilized three basic events to represent the current state of the interaction, i.e., eye contact, smile, and handshake. These event occurrences were predicted at the next time step according to the state-action pair by a neural network called Pnet. Another neural network called Qnet was employed for action selection policy guided by the intrinsic reward. The reward was determined based on the prediction error of Pnet, i.e., the error between actual occurrences of events *e*p*t* ` 1q and Pnet's prediction *e*ˆp*t* ` 1q. An OpenCV-based event detector module provided the labels for three events (i.e., actual event occurrence). The Qnet was a dual stream deep convolutional neural network mapping pixels to q-values of the actions (wait, look towards human, wave hand, and shake hand). Pnet was a multi-label classifier which was trained to minimize the prediction error between *e*ˆ and *e* by using the Binary Cross Entropy (BCE) loss function. The reward consisted in predetermined numerical values depending on the prediction error between *e* and *e*ˆ. They investigated the impact of three different reward functions named strict, neutral and kind. In all reward functions, if all three events are predicted successfully by Pnet, Qnet receives a reward of 1, if all events are predicted wrong then Qnet gets a reward of ´0.1. If one or two events are predicted correctly then different reward functions penalize differently, with the strict reward having the highest penalties. The authors reported that the reward functions with more positive reward on incorrect predictions yielded more socially acceptable behavior. They compared the collected total reward from 3 days of experiments in a public place, each day following a different policy (random policy, Qnet policy, and the previously employed method [36]). The current proposed model led to more human-like behaviors, according to the human evaluators.

### 4.4.2. No Communication Medium

Another study using the Pepper robot and DQN was presented by Cuayáhuitl [35]. In their scenario, human participants played a 'Noughts and Crosses' game with two different grids (small and big) against the Pepper robot. They used a CNN for recognizing game moves, i.e., hand-writing on the grid. These visual perceptions and the verbal conversations of the participant were given as an input to their modified DQN. The author modified the Deep Q-Learning with Experience Replay [43] by adding the identification of the worst action set *<sup>A</sup>*ˆ. *<sup>A</sup>*<sup>ˆ</sup> included actions with *min*p*r*p*s*, *<sup>a</sup>*q ă <sup>0</sup> @*<sup>a</sup>* <sup>P</sup> *<sup>A</sup>*<sup>q</sup> and *<sup>A</sup>* is the set of actions leading to win the game. The action selection was done with *max <sup>a</sup>*P*A*z*A*<sup>ˆ</sup> *Q*p*s*, *a*; *θ*q.

In other words, the proposed DQN algorithm refines the action set at each step to make the agent learn to infer the effects of its actions (such as selecting the actions that lead to winning or to avoid losing). The reward consisted in predefined numerical values based on the performance of the robot in the game. Therefore, this study does not use any communication medium for reward formulation. The robot received the highest reward in the cases 'about to win' or 'winning', whereas the robot received the lowest reward in the cases 'about to lose' or 'losing'.

### 4.4.3. Nonverbal Communication

Expressive robot behaviors including facial expressions, gestures, and posture are found to be useful to express the robots' internal states, goals, and desires [89]. To date, several studies have investigated the production of expressive robot behaviors using DRL, including gaze [85,86] and facial expressions [87]. Lathuilière et al. [85] modeled Q-learning with a Long Short Term Memory (LSTM) to fuse audio and visual data for controlling the gaze of the robotic head to direct it towards the targets of interest. The reward function was defined as *Rt* " *Ft*`<sup>1</sup> ` *α* ř *<sup>t</sup>*`<sup>1</sup> where *α* ě 0 serves as an adjustment parameter. If the speech sources lie within the camera's field of view, large *α* values return large rewards, i.e, *α* permits to give importance to speaking persons. The reward function includes face reward *Ft* (*α* " 0) and speaker reward (*α* ą 0). The number of visible people (face reward) and the presence of speech sources in the camera field of view (speaker reward) were observed from the temporal sequence of camera and microphone observations. The proposed DRL model was trained on a simulated environment with simulated people moving and speaking, and on the publicly available AVDIAR dataset. In this offline training, they compared the reward obtained with four different networks: early fusion and late fusion of audio and video data, as well as only audio data and only video data. The authors emphasized the importance of audio-visual fusion in the context of gaze control for HRI. They reported that the proposed method outperformed the handcrafted strategies. Lathuilière et al. [86] extended the study presented in [85] by investigating the impact of the discount factor, the window size (number of past observations affects the decision), and LSTM network size. They reported that in the experiments with AVDIAR dataset, high discount factors were prone to overfit, whereas in the simulated environment low discount factors resulted in worse performance. Using smaller window sizes accelerated the training, however, larger window sizes performed better in simulated environment. Changing the LSTM size did not make a substantial difference in the results. In a similar vein, Churamani et al. [87] utilized visual and audial data for enabling the Nico robot to express empathy towards the users. They focused on both recognizing the emotions of the user and generating emotions for the robot to display. The presented model consisted of three modules: an emotion perception module, an intrinsic emotion module, and an emotion expression module. For the perception module, both the visual and audio channels were used to train a Growing-When-Required (GWR) Network. For the emotion expression module, they used a Deep Deterministic Policy Gradient (DDPG) based actor-critic architecture. The reward was the symmetry of the eyebrows and mouth in offline pre-training, whereas in online training the reward was provided by the participant deciding whether the expressed facial expression was appropriate. The Nico robot expressed its emotions through programmable LED displays in the eyebrow and mouth area.

### *4.5. Policy-Based Methods*

### Higher Level Interaction Dynamics: Comfort

In the domain of socially assistive robotics, the robots are expected to be adaptive to their users to some extent, by using social interaction parameters (for example, the interaction distance, the speed of motion and utterances) regarding to the task, to the users' comfort and personality. Several studies [90–92] examined the Policy Gradient Reinforcement Learning (PGRL) for adapting the robot behaviors using social interaction parameters. Mitsunaga et al. [90,91] presented a study where the Robovie II robot adjusted its behaviors (i.e., proxemics zones, eye contact ratio, waiting time between utterance and gesture, motion speed) according to comfort and discomfort signals of humans (i.e., body re-positioning amount and the time spent gazing at the robot).These signals were used as reward. The goal of the robot was to minimize these signals, thereby reducing experienced discomfort in the human interactant. In [92], an ActiveMedia Pioneer 2- DX mobile robot adapted its personality by changing the interaction distance, speed and frequency of motions, and vocal content (what and how the robot says things). The purpose of this adaptation was to improve the user's task performance. Their reward function was

based on user performance, defined as the number of performed exercises. Specifically, the number of performed exercises over the previous 15 s was computed every second and results were averaged over a 60 s period to produce the final evaluation for each policy. They used a threshold for the reward function (7 exercises in the first 10 min) and a time range to adjust the fatigue incurred by the participant. The participant's performance was tracked by the robot through a light-weight motion capture system worn by the participant.

### **5. Categorization of RL Approaches in Social Robotics Based on Reward**

We now present a review of the literature but with focus on the reward function. Designing the reward function is perhaps the most crucial step in the implementation of an RL framework. One of the main contributions of this paper is a categorization of different types of reward functions that are used in RL and social robotics. The categorization is given in Figure 4.

**Figure 4.** Reinforcement Learning approaches in social robotics.

As we have already discussed the used RL methods in Section 4, they are not included here. Moreover, the evaluation methodologies are also discussed in a separate section (see Section 6).

### *5.1. Interactive Reinforcement Learning*

Different approaches have been proposed for incorporating the human assistance in the learning process of artificial agents, including learning from human feedback [24,76] and learning from demonstration. Learning from demonstration is beyond the scope of this paper, we focus on learning from human feedback. In traditional RL, the agent receives environmental reward from a predefined reward function. Interactive RL makes use of human feedback in the learning process in combination with or without environmental reward. Interactive RL framework is given in Figure 5. Integrating human feedback with RL can be accomplished in different ways, such as via evaluative feedback [93], corrective feedback [94] or guidance [95].

**Figure 5.** Interaction in Interactive Reinforcement Learning (reproduced from [96]).

Li et al. [96] discuss different interpretations of human evaluative feedback in interactive reinforcement learning (referred to as human-centered RL throughout the paper). They distinguish between three types of human evaluative feedback: interactive shaping, learning from categorical feedback and learning from policy feedback. In interactive shaping, human feedback is interpreted as numeric reward, and this reward can be myopic i.e., *γ* " 0 [93] or non-myopic i.e., *γ* is different from 0 [97]. Human feedback might be erroneous when the task is repetitive. Moreover, human teachers tend to give less frequent feedback (e.g., due to boredom and fatigue) as the learning progresses. Modeling human feedback has been found to be an efficient strategy when the meaning of human-delivered feedback is ambiguous [76]. Loftin et al. [76] developed a probabilistic model of human teacher's feedback. They interpret human feedback as categorical feedback, considering that human teachers may have different feedback strategies. In their work, depending on the human teacher's training strategy, a lack of feedback can convey information about the agent's behavior. Human training strategies are categorized into four groups: rewardfocused strategy (positive reward for correct actions and no feedback for incorrect actions), punishment-focused strategy (no feedback for correct actions and punishment for incorrect actions), balanced strategy (positive reward for correct actions and punishment for incorrect actions) and inactive strategy (the human teacher rarely provides feedback). Corrective feedback can be categorized under policy feedback. As an example, Celemin and Ruiz-del Solar [94] presented a framework named COACH (COrrective Advice Communicated by Humans) which uses human corrective feedback in the action domain as binary signals (i.e., increase or decrease the magnitude of the current action). In their comparison with classical reinforcement learning approaches, they showed that RL agents can benefit from human feedback, i.e., learning progresses faster [94]. When the agent learns both human feedback and environment reward, the human feedback can be used to guide the agent's exploration [95]. The guidance includes both providing feedback on past actions and guiding the agent in the learning process through future-directed rewards. Human guidance can reduce the action space by narrowing down the action choices [98], which speeds up the training process by accelerating the convergence towards an optimal policy.

In the context of HRI, the human can be in the learning loop by way of varying types of inputs, such as providing feedback via a GUI (e.g., by button or mouse clicks). Alternatively, the feedback can be delivered more naturally, via emotions, gestures and speech. Therefore, this category comprises two subcategories: (1) explicit feedback, when the feedback is direct, provided through an interface such as ratings, and labels; (2) implicit feedback, if the human feedback is spontaneous behavior or reactions such as non-verbal cues and social signals. The terms "explicit feedback" and "implicit feedback" are adopted from Schmidt [99]'s "implicit interaction" study in human-computer interaction. For a quick summary of the studies, see Table 1.


**Table 1.** Summary of Interactive Reinforcement Learning approaches in social robotics.


**Table 1.** *Cont.*

### 5.1.1. Explicit Feedback

In the explicit feedback approach, the feedback of the human teacher is given by direct manipulations and generally through an artificial interface. The human teacher observes the agent's actions and environment states and subsequently provides feedback to the agent through a graphical user interface (GUI) or through the robot's (touch) sensors. In this approach, the feedback from the human teacher is noiseless and direct in the form of numerical values provided via a button, a Graphical User Interface (GUI), or through the robot's touch sensors. In general, the main purpose of the interaction is to teach the robot to do something in this category. Unlike the explicit feedback category, in the implicit feedback category, the majority of studies include a social scenario such as robot tutoring, robots supporting the human in a game, etc. The studies under this category are [4,26,29,30,44,52,53].

### 5.1.2. Implicit Feedback

Human social signals are widely used as reward in social human-robot interaction. The most commonly used signals are human emotions, as these have a great influence on decision-making [103]. Computational models of emotions have been studied by many researchers as part of the agent's decision making architecture, by modelling the RL agents with emotions or incorporating human emotions as an input to the learning process. As an example, Moerland et al. [104] surveyed RL studies focusing on agent/robot emotions. Since emotions also play an important role in communication and social robots [7], there exist various studies considering these aspects for RL and social robotics. In the implicit feedback approach, the agent learns from spontaneous natural behavior and reactions of the interactant, i.e., emotions, speech, gestures, etc. This type of feedback is noisy and indirect. In other words, in this approach, human feedback requires pre-processing and the quality of the feedback depends on the perception and recognition algorithms being used. Unlike explicit feedback, the implicit feedback is not provided directly through an interface. Instead, the human's emotions or verbal instructions serve as reward or guidance signals. The studies in this category are [28,31,32,45,50,55–57,59,68,90,91,100–102].

### *5.2. Methods Using Intrinsic Motivation*

It is a common approach to examine the biological and psychological decision-making mechanisms and to use a similar method for autonomous systems. One such approach consists in combining intrinsic motivation with reinforcement learning. Intrinsic motivation is a concept in psychology, which denotes the internal natural drive to explore the environment, as well as gain new knowledge and skills. The activities are performed for inherent satisfaction rather than external rewards [105]. Researchers have proposed computational approaches that use intrinsic motivation [106]. In intrinsically motivated RL, the main idea is using intrinsic motivations as a form of reward [107]. There are different intrinsic motivation models within the RL framework [20]. However, in social robotics, the idea of maintaining the internal needs of the robot (detailed in Section 5.2) has received much attention [13,34,63–66,108]. One exception is [74], in which prediction error of social event occurrences was used as intrinsic motivation. For a quick summary, see Table 2.

### Homeostasis-Based Methods

Homeostasis, as defined by Cannon [109], refers to a continuous process of maintaining an optimal internal state in the physiological condition of the body for survival. Berridge [110] explains homeostasis motivation with a thermostat example that behaves as a regulatory system by continuously measuring the actual room temperature and comparing it with a predefined set point, and activating the air conditioning system if the measured temperature deviates from the predefined set point. In the same manner, the body maintains its internal equilibrium through a variety of voluntary and involuntary processes and behaviors. The homeostasis-based RL in social robotics is presented in [13,34,64–66,108]. These studies introduced a biologically inspired approach that depends on homeostasis. The robot's goal was to keep its well-being as high as possible while considering both internal and external circumstances. The common theme in these studies is that the robot has motivations and drives (needs), where each drive has a connection with a motivation as in Equation (15).

$$\begin{aligned} \text{if } & D\_i < L\_d \text{ then } & M\_i = 0\\ \text{if } & D\_i \gg L\_d \text{ then } & M\_i = D\_i + w\_i \end{aligned} \tag{15}$$

Motivations whose drives are below the activation levels do not initiate a robot behavior. This was formulated as *if Di* ă *Ld then Mi* " 0 where *Di* is a drive, *Ld* the activation level, and *Mi* is the related motivation. The motivation depends on two factors: the associated drive and the presence of an external stimulus, this was formulated as *if Di* ě *Ld then Mi* " *Di* ` *wi* where *wi* is the related external stimulus. These motivations serve as action stimulation to satiate the drives. A drive can be seen as a deficit that leads the agent to take action in order to alleviate this deficit and maintain an internal equilibrium. The ideal value for a drive is zero, corresponding to the absence of need. The robot learns how to act in order to maintain its drives within an acceptable range, i.e., to maintain its well-being. The well-being of the robot was defined as:

$$\mathcal{W}b = \mathcal{W}b\_{\text{ideal}} - \sum\_{i} \mathfrak{a}\_{i} D\_{i} \tag{16}$$

where *Wbideal* is the value of the well-being when all drives are satiated, and *α<sup>i</sup>* is the set of the personality factors that weight the importance of each drive. The variation of the robot's well-being is used as reward signal and calculated with the Equation (17)

$$
\Delta Wb = Wb\_t - Wb\_{t-1} \tag{17}
$$

i.e., the difference between the current well-being *Wbt* and the well-being in the previous step *Wbt*´1.

In several works [64–66], a variation of the traditional Q-learning algorithm was used in addition to the homeostasis-based approach. In all of these, the authors referred to the proposed algorithm as Object Q-learning. In this approach, there are actions associated with each object in the environment, and the robot considers its state in relation to every object independently. Thus, there is an assumption that an action execution in relation to a certain object does not influence the state of the robot in relation to other objects. However, in reality, an action execution may create collateral effects. In other words, an action associated with a particular object, e.g., approaching it, may affect the robot's state in relation to other objects, e.g., moving away from them. The update of Q-values

accounted for these collateral effects. The purpose of this simplification was to reduce the number of states during the learning process. In their experiments, to reduce the state space, the robot learned what to do with each object without considering its relation to other objects. In other words, they assumed that an action execution associated with a certain object will not affect the state of the robot in relation to the rest of the objects. The proposed algorithm was implemented on the social robot Maggie that lived in a laboratory and interacted with several objects in the environment (e.g., a music player, a docking station, or humans). Castro-González et al. [65] appears to be closely linked to the other papers discussed here with one difference being that a discrete emotion, fear, was used as one of the motivations. Unlike other motivation-drive pairs, no drive was associated with the 'fear motivation' (i.e., fear is not a deficiency of any need). 'Fear motivation was' linked to dangerous situations (that can cause damage the robot) and directed the robot to a secure state. As an example, the motivation 'social' was not updated if the user who occasionally hit the robot was around. For a quick summary, refer to Table 2.

**Table 2.** Summary of Intrinsically Motivated Methods in social robotics.


### *5.3. Methods Driven by Task Performance*

Task performance denotes the effectiveness with which an agent performs a given task, and the performance metrics can vary for different tasks. In these methods, the design of the reward function is based on task-driven measures, which often include some problemspecific information, especially the task performance of the robot, task performance of the human, or both. For a quick summary, see Table 3.

### 5.3.1. Human Task Performance Driven Methods

In these human task performance driven methods, the reward function is based on the user's success in performing a task related to the interaction with the robot. The studies in this category are [47,92].

### 5.3.2. Robot Task Performance Driven Methods

In these methods, the reward design depends on the robot's task performance. Robot behaviors that satisfy the user's preferences, accurate completion of the task, finishing the task within a desired amount of time, visiting certain states, and robot actions that benefit or satisfy the user are examples for task performance measures. The studies in this category are [1,3,35,36,46,60,62,67,70,85,86].


**Table 3.** Summary of Task performance driven methods in social robotics.

5.3.3. Human and Robot Task Performance Driven Methods

In the previous two sections, we listed the studies using task performance of the robot and human as reward signal. There are also studies that use a combination of the human's and the robot's task performance as reward signal. As an example, in [33,72] the robot received the highest reward if the user completed the task successfully. The robot also received reward for its actions that were suitable for the current situation. Likewise, in [61], the robot was rewarded based on actions that transitioned the user into a desirable state (e.g., completing the activity). Other papers in this category are [33,71,72].

### **6. Evaluation Methodologies**

The past decade has seen a rapid growth of social robotics in diverse uncontrolled environments such as homes, schools, hospitals, shopping centers, or museums. In this review, we have seen various application domains in a range of fields including therapy [3], eldercare [62], entertainment [59], navigation [32], healthcare [44], education [58], personal robots [13], and rehabilitation [92]. Research in the field of social robotics and human-robot interaction becomes crucial as more and more robots are entering our lives. This brings many challenges as social robots are required to deal with dynamic and stochastic elements in social interaction in addition to the challenges in robotics. Besides these challenges, validation of social robotics systems with users necessitates efficient evaluation methodologies. Recent studies underline the importance of evaluation and assessment methodologies in HRI [111]. However, developing a standardized evaluation procedure still remains a difficult task. Furthermore, in RL-based robotic systems, there is a need to explore various human-level factors (personal preferences, attitudes, emotions, etc.) to assure that the learned policy leads to better HRI. Additionally, how can we evaluate whether the learned policy conveys the intended social skill(s)? As an example, in Qureshi et al. [36,73,74]'s study, the model performance on a test dataset was evaluated by three volunteers who judged if the robot's action was an appropriate one for the current scenario. In [87], there both annotators and participants rated whether the robot was able to associate the facial expressions with the conversation context. The independent annotators' ratings were higher than the participants', which, as the authors argued, might be explained by discrepancies between the participants' actual expressed emotion and the intended emotion. In such cases, additional sensory information could be useful for validating that the adaptive robot behaviors lead to better HRI. For example, Park et al. [58] analyzed the body poses and electrodermal activity (EDA) of the participants to check their correlation with participant's engagement. This kind of approach could be used to support subjective evaluations. A comparative evaluation methodology considering both the learned policy and the user's experience about the interaction is another way of evaluation. As an example [32,33,56,90,91] presented the policy for each participant as well as a discussion on the effectiveness of the robot behavior on the user based on their comments and subjective evaluations.

The papers in the scope of this manuscript used different evaluation and assessment methodologies for their algorithms and for their systems with users. We identify three types of evaluation methodologies: (1) an evaluation from the algorithm point of view, (2) evaluation and assessment of user experience-related subjective measures, and (3) evaluation of both learning algorithm-related factors and user experience-related factors. Several studies only reported the self-rated questionnaire results [45] or user opinions [55]. There are also studies which do not include any evaluation, and only a short discussion regarding the learned policy [53,57,100,101].

The cumulative collected reward over time is the most commonly used evaluation method. As learning progresses, the frequency of negative rewards is expected to decrease and positive rewards are expected to increase. Thus, the cumulative reward and comparing the reward across different settings and variations of algorithms are one of the measures for evaluating the efficiency of learning [49,50,52,85,86]. The evolution of the learning algorithm over time (e.g., the evolution of *Q* values) is another evaluation method. Several studies presented only the learning evolution of their system without mentioning how a participant would perceive the learned robot behaviors [13,34,61,63–66,108]. Comparison of user experiences (e.g., learning gains of children) for adaptive and non-adaptive robot is another way of evaluation [68,102]. We also see evaluation by using only interaction related objective measures such as the frequency of turn-taking and dialogue duration with the robot [60]. Task-related evaluation measures (i.g., the number of moves needed to solve a game with an adaptive versus a random robot) together with Q-matrix [3], or average task success and average reward [35] are used. In some IRL studies, the purpose is only teaching a robot. In these studies, evaluation metrics are training time [26,29], or training related parameters (e.g., the amount of positive and negative feedback) [28].

Studies reporting both subjective user opinions and algorithm related measures are [30,44,46,59,92]. Interaction related objective measures such as interaction duration, distance to the robot, preferred motion speed of the robot in combination with questionnaires are other measures for evaluating the efficiency of the learned policy. Studies also use a comparison of different algorithms in terms of average steps, average reward, average execution time together with questionnaires [67], and the number of times the preferences of the trainer match with the agent's action [69], reward and discussion of observations from the experiments [46], questionnaires and task-related parameters (e.g., time to complete the task) [47].

### **7. Discussion**

In this paper, we present the RL approaches in social robotics. In virtual game environments (e.g., Atari, Go, etc.) which are commonly used testbeds for RL implementations, the goal is well defined (e.g., getting higher scores, accomplishing a game level, or winning the game). In social robotics, the goal is not that clear. Still, we argue that social robots could provide a unique potential testbed for RL implementations in real-world scenarios, in a sense that they can deal with transparency issues by showing their internal states through social cues (e.g., facial expressions, gaze, speech, LEDs on their body, tablet). In Section 5, we presented RL approaches based on reward types. IRL with implicit reward is the most widely used approach in social robotics since human social cues occur naturally during the interaction. However, the change in social cues can be slow, which leads to sparse reward. A combination of the reward approaches presented Section 5, namely intrinsically motivated methods, IRL with implicit feedback, and task performance-driven methods could be an approach to deal with the sparse reward problem. This way the robot could receive a reward even when there is no dramatic change in social cues or the task is not completed in one step. Similar to the homeostasis-based approaches, combining emotional models for robots' decision-making mechanisms could be helpful. The interested reader may refer to [104] which presents a thorough analysis and discussion of computational emotional models incorporated within RL. Th sparse reward problem is not the only problem in real-world social HRI. We continue to the discussion with the proposed solutions for real-world RL problems in Section 7.1. Later on, we present possible interesting future directions in Section 7.2.

### *7.1. Proposed Solutions to Real-World RL Problems*

RL is a powerful and versatile algorithmic tool and has been shown to perform better than humans in simulated environments [43] However, the progress on applying RL methods to real-world systems is not so advanced yet. This is due to the complexity of the real-world. Dulac-Arnold et al. [112] discuss nine challenges of realizing RL on real-world systems. Here, we discuss these challenges and how some papers tackled them in real-world HRI with social robots.

The first mentioned challenge is "training off-line from the fixed logs of an external behavior policy". This challenge applies to HRI since users would not tolerate the long pauses and action delays of the social robot. As an example, Qureshi et al. [36] suggested an approach where they divided training into two stages. In the first stage, the robot interacts with the environment and gathers data, whereas in the second stage the robot rests and trains.

The second challenge is "learning on the real system from limited samples". This challenge is especially valid for HRI since the interaction time with the users is limited in controlled lab experiments. Moreover, users get bored and tired with longer interaction duration. As mentioned [112] exploration must be limited. As an example, in [13,34], exploration and exploitation phases are separated. A predefined duration is set for the exploration phase, in which the robot runs through all possible states and actions. More-

over, they also decreased the learning rate *α* throughout the exploration phase to increase the importance of previously learned information as the learning progresses. In the exploitation phase, they set *α* to 0. As mentioned in [112], for improving the sample efficiency expert demonstrations can be beneficial to avoid learning from scratch. For example, Moro et al. [61] combined LfD with Q-learning for a Casper robot helping older people in a tea making scenario. Another mentioned solution was model-based RL, of which we see two examples in social robotics [49,50]. In addition, long-term interactions (several sessions [58,68,102]) are important for HRI and could be beneficial for RL to collect samples.

The third challenge is "high-dimensional continuous state and action spaces". In the context of social robotics, the problem also needs to be simplified due to the low onboard computational power of most platforms. That might be another reason for a small set of actions in the reviewed papers. To overcome this challenge we see several approaches. As an example, human guidance was found to effective [26], as well as Object Q-learning [64–66] and action elimination [35].

The fourth challenge is "safety constraints that should never or at least rarely be violated". The mentioned approaches for this challenge in [112] include imposing safety constraints during the training. In the current literature, social robot interactions are generally conducted in a controlled laboratory environment and the researchers are available to intervene if any problems occur. Therefore, this challenge seems to get little attention.

The fifth challenge is "tasks that may be partially observable, alternatively viewed as non-stationary or stochastic". We see several attempts in social robotics to deal with this challenge such as in POMDP based approaches [50,102], and in DRL where several frames are stacked together for incorporating the history of the agent observations. Another mentioned approach to deal with this challenge was using recurrent networks which were applied in [63].

The sixth challenge is "reward functions that are unspecified, multi-objective, or risksensitive". Some papers that use simulated environments for training and testing on realworld interactions. In these papers, there are different reward functions for the simulated world and the real-world. Generally, the real-world reward functions are simplified to one parameter such as feedback of the user or predefined numeric numbers, whereas the simulated world reward functions are more complex including several parameters.

The seventh challenge is "system operators who desire explainable policies and actions". This is particularly valid for social robotics, since ambiguous robot behaviors might affect the user's willingness to interact again. Moreover, if the human trains the robot, the intention and internal state of the robot becomes crucial for the success of the training. As an example, Knox et al. [29] discussed the transparency challenges and their effect on the training time. Thomaz and Breazeal [28] observed that participants had a tendency to wait for eye contact with the robot before saying the next utterance while training the robot. These kinds of social cues on the robot could be used for explaining its actions and internal states.

The eighth challenge is "inference that must happen in real-time at the control frequency of the system". The real-world is slower than the simulated world both in reaction and data generation. To deal with this challenge, several researchers used an additional interface between the robot and the human, so that the inference is received from the interface rather than robot control.

The ninth challenge is "large and/or unknown delays in the system actuators, sensors, or rewards". We see several approaches to deal with this challenge, as an example [52] considered to increase the effect of human-delivered positive reward in larger time frames and to decrease the effect of negative reward in a shorter time frame. Another approach was estimating reward from natural human feedback using the gamma distribution [49].

### *7.2. Future Outlook*

There are still many interesting potential problems and open questions to be solved in RL for social robotics. Applications on physically embodied robots are limited due to the enormous challenge of complexity and uncertainty in real-world social interactions. The increased prevalence of RL in physical social robots will shed further light on this topic. Another unanswered question is how RL-based social robotics may include the generation of reward signals from ambiguous or conflicting sources of implicit feedback, and how learned skills can be transferred to different robots. Further work could also investigate larger state-action spaces, as current studies are mostly limited to a small sets.

Despite the fact that there are goal-oriented approaches for social robot learning [113,114], in the current literature, the social robot that learns through RL has only one goal, such as performing a single task and optimizing a single reward function. However, in many realworld scenarios, a robot may need to perform a diverse set of tasks. As an example, socially assistive robots designed with the purpose of assisting older people in their houses may need to accomplish several tasks such as medication reminders, detecting issues, informing caregivers, and managing plans. Multi-goal RL enables an agent to learn multiple goals, hence the agent can generalize the desired behavior and transfer skills to unseen goals and tasks [115]. This has been applied on robotic manipulation tasks in a simulated environment [115]. However, applying the multi-goal RL framework to social robots would be a fruitful area for future work.

Another interesting future direction might be the application of multi-objective RL in social robotics. The task efficiency and user satisfaction can be two objectives where the robot would try to maximize both objectives by formalizing the problem as a multi-objective MDP. As an example, Hao et al. [116] presents a multi-objective weighted RL in which the agent had two objectives: minimizing the cost of service execution and eliminating the user's negative emotions. We refer the interested reader to the survey on multi-objective decision making for a more detailed explanation of the topic [117].

Recent developments in the field of deep neural networks have led to an increasing interest in DRL. Applying DRL in social robotics has also received recent attention, however, studies focused on small sets of actions and single task scenarios. In this regard, social robots with larger sets of actions would be a promising area for further work. Another future direction can be a further investigation of hyper-parameters of RL in social robotics. This was briefly discussed in [1], as an example, in turn-based interactions relatively small discount factors (i.e., 0.7 ď *γ* ď 0.95) are more common, whereas for the frame-based interactions with rather long trajectories, higher discount factors seem to be more suitable (i.e., *γ* ě 0.99). In deep networks, the selection of different hyper-parameters affects the accuracy of the algorithm [118]. This also applies to DRL, Lathuilière et al. [86] presented several experiments to evaluate the impact of some of the principal parameters of their deep network structure.

Thus far, model-free RL learning a value function or a policy through trial and error is the most commonly used approach in social robotics. However, model-based RL that focuses on learning a transition model of the environment serving as a simulation remains to be further explored. In particular, having a user model can ease the learning process. Although it is difficult to model human reactions, having a model can play a crucial role in reducing the number of required interactions in the real-world. The model-based approach can also help with the problem of hardware depreciation which may arise in model-free RL in robotics because of the considerable amount of interaction time. Simulating the interaction environment can ease the training without manual interventions and a need for maintenance. Nonetheless, transferring the learned policies in simulation directly to the physical robot may not be trivial due to undermodeling and uncertainty about system dynamics [15]. A common limitation is that most of the works are not generalizable, i.e., utilizing the knowledge learned by one robot on the other or utilizing the task knowledge for other tasks. The Google AI team trained a model-based Deep Planning Network (PlaNet) agent which achieved six different tasks (i.e., cartpole swing-up, cartpole balance, finger spin, cheetah run, etc.) [119]. A similar approach for a physical social robot would be an interesting future direction.

RL problems are formalized as MDPs in fully observable environments. However, in the case of HRI not all of the required observations are available, due to the underlying effect of psychological states on human behavior. It has been demonstrated that POMDPs are able to model the uncertainties and inherent interaction ambiguities in real-world HRI scenarios [120]. Hausknecht and Stone [121] proposed a method that couples a Long Short Term Memory with a Deep Q-Network to handle the noisy observations characteristic of POMDPs. A similar approach would be useful in social robotics problems to better capture the dynamics of the environment. We included two examples of POMDP approaches in social robotics, [50,102]. Further investigation would constitute an interesting line of research.

### **8. Conclusions**

In this work, we give an overview of the work on RL in social robotics. We surveyed the literature and presented a thorough analysis of RL approaches in social robotics. Social robots have two important characteristics: physical embodiment and interaction/communication capabilities. Therefore, we included studies with physically embodied robots. Moreover, we categorize the papers based on the used RL type. In this categorization, we discuss and group the papers based on the communication medium used for reward formulation. Considering the importance of designing the reward function, we also categorize the papers based on the nature of the reward. The evaluation methods of the papers are also grouped by whether or not they use subjective and algorithmic metrics. We then provide a discussion in the view of real-world RL challenges and proposed solutions. The points that remain to be explored, including the approaches that have thus far received less attention are also given in the discussion section. To conclude, despite tremendous leaps in computing power and advances in learning methods, we are still a long way from general-purpose, robust, and versatile social robots that can learn several skills from naive users with real-world interactions. In spite of the immediate challenges, we see steady progress of RL applications in social robotics with an increasing interest in recent years.

**Author Contributions:** N.A. was the main author responsible for conducting literature research, methodology definition, and paper writing. A.L. supervised the study, and has been involved in structuring and writing the paper. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by European Union's Horizon 2020 research and innovation program under the Marie Skłodowska-Curie grant agreement No 721619 for the SOCRATES project.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

### **References**

