*Article* **Human–Machine Differentiation in Speed and Separation Monitoring for Improved Efficiency in Human–Robot Collaboration**

**Urban B. Himmelsbach \*, Thomas M. Wendt, Nikolai Hangst, Philipp Gawron and Lukas Stiglmeier**

Work-Life Robotics Laboratory, Department of Business and Industrial Engineering, Offenburg University of Applied Sciences, 77723 Gengenbach, Germany; thomas.wendt@hs-offenburg.de (T.M.W.); nikolai.hangst@hs-offenburg.de (N.H.); philipp.gawron@hs-offenburg.de (P.G.); lukas.stiglmeier@hs-offenburg.de (L.S.) **\*** Correspondence: urban.himmelsbach@hs-offenburg.de; Tel.: +49-7803-9698-4488

**Abstract:** Human–robot collaborative applications have been receiving increasing attention in industrial applications. The efficiency of the applications is often quite low compared to traditional robotic applications without human interaction. Especially for applications that use speed and separation monitoring, there is potential to increase the efficiency with a cost-effective and easy to implement method. In this paper, we proposed to add human–machine differentiation to the speed and separation monitoring in human–robot collaborative applications. The formula for the protective separation distance was extended with a variable for the kind of object that approaches the robot. Different sensors for differentiation of human and non-human objects are presented. Thermal cameras are used to take measurements in a proof of concept. Through differentiation of human and non-human objects, it is possible to decrease the protective separation distance between the robot and the object and therefore increase the overall efficiency of the collaborative application.

**Keywords:** human–robot collaboration; speed and separation monitoring; human–machine differentiation; thermal cameras; protective separation distance

### **1. Introduction**

Human–Robot Collaboration (HRC) is seeing an enormous growth in research interest as well as in industry applications. The highest priority in HRC applications is given to the safety of the human within the system. A human within a robotic system is called an operator. Different approaches on how to protect the operator from any harm are subject to research. There has been good progress on how to protect the operator from any harm. The efficiency of the systems suffered from most of these safety improvements. Reduced efficiency leads to a reduced acceptance of HRC. In order to increase the acceptance, it is important to examine how these methods for operator safety can become more efficient.

Operator safety does not necessarily mean preventing the operator only from any physical contact. It can also mean to prevent psychological harm through dangerous and threatening movement of the manipulator. An overview of different methods of safe human–robot interaction can be found in [1]. Lasota et al. divided their work into four major categories of safe HRC: safety through control, through motion planning, through prediction, and through psychological consideration. The category of safety through control is subdivided into pre- and post-collision methods [1].

Speed and separation monitoring (SSM), which is subject of this work, belongs to the subcategory of pre-collision methods. Other methods of this subcategory are quantitative limits and the potential field method [2].

Established methods for HRC have already been integrated into standards like the ISO/TS 15066. The Technical Specification 15066 differentiates between four different modes of collaborative operations [3]:

**Citation:** Himmelsbach, U.B.; Wendt, T.M.; Hangst, N.; Gawron, P.; Stiglmeier, L. Human–Machine Differentiation in Speed and Separation Monitoring for Improved Efficiency in Human–Robot Collaboration. *Sensors* **2021**, *21*, 7144. https://doi.org/10.3390/s21217144

Academic Editor: Anne Schmitz

Received: 30 September 2021 Accepted: 26 October 2021 Published: 28 October 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).


This paper focuses on Speed and Separation Monitoring (SSM). There are different sensor systems that can detect the Separation and Speed between the robot and the operator.

There are already quite a few well working sensor systems for speed and separation monitoring on the market. One can distinguish these systems as external and internal. Internal means that the sensors are part of the robot itself, e.g., mounted somewhere on the manipulator surface. External means that the sensor is placed on the edge of the table that the robot is mounted on, or on ceiling above the robot's workspace. Examples for external sensor systems are laserscanners like [4], camera systems like the SafetyEYE [5] or pressure-sensitive floors [6]. There are only few examples for sensor systems that are mounted on the manipulator itself. A good example is the Bosch APAS system [7]. It consists of a safety skin that measures the separation distances capacitively. The main disadvantage is that it can only detect an obstacle in a distance of two to five centimeters.

All of these systems are more or less great in detecting obstacles within the workspace. It is difficult for them—if not impossible—to classify the obstacles in human and nonhuman objects. Non-human objects, like an automated guided vehicle (AGV), are therefore treated like an operator and safety measures are applied accordingly when they enter or pass through the workspace and its surroundings. These AGVs have fixed and well known dimensions. They have the ability to be programmed for certain behaviour and usually have a navigation system. Therefore, it should be possible to integrate an AGV with high precision into the robotic system for example in order to deliver and pick up workpieces. If the AGV is part of the entire system, it should not be treated as an operator. Instead, it should be possible to continue the robot's movement with high velocities and consequently increase the overall efficiency of the system.

A good overview on research of concepts and performance of SSM can be found in [8,9]. Lucci et al. proposed in [10] to combine speed and separation monitoring with power and force limiting. This way it is possible to continue the movement of the robot when the operator is very close to the robot. A complete halt of the robot's motion is only necessary when it comes to a contact between the operator and the robot. They showed that with their approach it is possible to increase the overall production efficiency. Kumar et al. researched on how to calculate the amount of sensors needed for a specific area as well as how SSM can be achieved from the surface of the robot [11,12]. Grushko et al. proposed the approach of giving haptic feedback to the operator through vibration on the operator's work gloves [13]. The system monitors the workspace with three RGB-D cameras. The controller calculates if the operator's hand intersects with the planned path of the robot and gives appropriate feedback to the operator. They were able to proof in user studies that the participants could finish their task more efficiently compared to the original baseline. A trajectory planning approach was taken by Palleschi et al. in [14]. Using a visual perception system to gather position data of the operator, they also used an interaction/collision model from Haddadin et al. [15] to permanently check the safety situation according to the ISO/TS 15066 standard. If the safety evaluation showed that the robot needs to slow down, their algorithm searched for an alternative path with lower risk for injury and velocities acceptable according to the safety limits. In an experimental validation the group was able to proof the effectiveness of planning safe trajectories for a task of unwrapping an object. Another approach based on dynamically scaled safety zones was proposed by Scalera et al. in [16]. Bounding Volumes around the robot links and the operator's body and extremities represent the safety zones. These safety zones vary in their size according to the velocities of the robot and operator. Information about the operator's position is gathered through a Microsoft Kinect camera. The paper proofed in a collaborative sorting operation that it was possible to shorten the task completion time by 10%.

As there is no system available that can distinguish between human and non-human objects, we see demand for such a system and the need to research methods on how to integrate the differentiation of human and non-human objects in existing safety methods.

In this paper, we propose a method on how such a differentiation is possible. Different sensor principles are presented that are capable of differentiating between human and non-human objects. Thermal cameras with two different field of views (FoV) are used to make measurements. The results show that it is possible two detect an operator in ranges up to 4 m from the sensor.

We have shown in previous publications that it is possible to do speed and separation monitoring directly from the robot arm. We showed that time-of-flight (ToF) sensors are suitable for this task. The first approaches used a camera mounted on the flange of the robot [17]. Further research investigated the use of single-pixel ToF sensors distributed over the links of the manipulator [18]. In [19], we presented a sensor solution in form of an adapter-plate that is mounted between the flange of the robot and the gripper. The previous work also showed that there is still potential for further growth of HRC applications in industrial settings. It also showed that the efficiency will play a key role in the success of HRC and that there is a need to improve the efficiency of these systems.

This paper proposes to differentiate obstacles in the vicinity of robotic systems into human and non-human objects. With this classification, it is possible to calculate object specific distance and velocity limits for the robotic system. The limits for non-human objects can be lower because there is only a financial risk associated with it instead of possible injuries to the human. As a result, it is possible to increase the efficiency of human–robot collaborative applications. The paper shows that a differentiation in certain applications is possible with thermal cameras that can be attached to manipulator or gripper. There is no need for an additional camera system surrounding the robot's workspace and there is no need for any equipment to be attached to objects that shall be differentiated. The main contributions of this work can be summarized as shown in Table 1.

**Table 1.** List of main contributions of this paper.


An algorithm for detecting an operator in distances of up to 4 m with thermal cameras directly from the manipulator.

Experimental verification of the algorithm with two thermal cameras with different field of views.

The paper is structured as follows. Section 1 introduces the topic and the state of the art. Section 2 gives an overview of sensor systems that can be used for human– machine differentiation in the context of speed and separation monitoring in human–robot collaboration. In Section 3 the protective separation distance is explained and the new object-specific protective separation distances is proposed. Furthermore, this section shows the potential efficiency improvement that can be established with this method. Section 4 explains what kind of and how the measurements were executed. Section 5 shows and discusses the results of the measurements before Section 6 concludes the paper and gives an outlook on future research.

### **2. Methods for Human–Machine Differentiation**

There are active and passive methods to differentiate between human and non-human objects. Active methods are, for example, when camera systems are used and the AGV is marked with a sticker or QR-Code that identifies the object. Other active methods would be when the AGV sends its coordinates via wireless communication to the robotic system so that the robotic system knows exactly where the AGV is located and can therefore differentiate the AGV from other objects in the surroundings. A list of examples for active and passive methods is given in Table 2. The same is true for humans; they could wear a kind of tracker to monitor their position and send it to the robotic system. Depending on the overall situation on industrial shop floors, there might already exist a navigation system that keeps track of all machines, AGVs, and operators.

**Table 2.** Overview of active and passive methods for human–machine differentiation.


There are many small- and medium-sized enterprises (SMEs) that are new to automation with collaborative robots. They usually do not have any existing navigation or monitoring systems. Moreover, they require flexible solutions. Passive differentiation methods provide the most flexibility. They do not require any additional installations in the surroundings, on the operator or the AGV. This paper focuses on passive differentiation methods that will be presented in the following subsections.

These passive methods use sensors that measure properties that are characteristic for humans or machines. Here is a list of human specific properties that can be used for differentiation [20]:


Depending on the specific property that needs to measured, the sensors can be placed at different locations. Three different locations are proposed that make sense to place the sensors. These are the base of the robot, the robot links itself, and the flange or gripper. For integration of the sensors in the gripper, a very flexible method is to use 3D printed grippers. Using 3D printing technology, it is possible to arrange and layout the sensors as needed. A good overview on this topic can be found in [21]. The following sections describe some possible sensor principles that can be used for passive human–machine differentiation.

### *2.1. Pressure-Sensitive Floor*

If the mass of the automated guided vehicle (AGV) is known, and if this mass is different to the mass of the operators working around the robotic system, then it should be possible to differentiate between human and non-human objects by the difference in their mass. An improved system might be able to detect whether the object has two feet on the ground or if there are four wheels touching the ground. AGVs might have a different and more consistent footprint on the pressure sensitive floor. A human being has a variation in pressure. While walking, the human lifts up one foot and there is only one foot touching the ground with full mass.

The average weight of an adult human being is assumed to be 75 kg. The total weight of clothes, including shoes, is assumed to be an additional 3 kg. The total mass of an operator in an industrial setting is then assumed to be 78 kg. In general, an operator should be able to carry a payload of 20 kg. Considering a minimum weight of 50 kg and a maximum weight of 100 kg per operator we get a range of 50 kg of a light worker without payload and up to 120 kg for a worker with payload. Distributed on two feet, we have a range between 25 kg and 60 kg per foot.

AGVs are available in different sizes and weights. Assuming a standard AGV, we have a total mass between 200 kg up to 1000 kg. Usually the weight is distributed on four wheels. This means a weight per wheel of 50 kg up to 200 kg. As we have an overlapping range of about 50 kg to 60 kg for both, human beings on one foot and AGVs on one wheel, it is not possible to differentiate only by weight. A good overview of the research on pressure sensitive floors can be found in the work of Andries et al. [6]. Other state-of-the-art methods that use pressure-sensitive floors are in [22,23].

### *2.2. Capacitive Sensors*

Another possible way to differentiate between human and non-human objects is to measure the change of capacity when an object is approaching a capacitive sensor. There are already sensor systems available that use a capacitive measurement to detect objects in the surroundings of the robot like [7]. The capacity of an object depends on different properties:


Depending on the kind of non-human objects that are present in the application, it could be possible to differentiate between human and non-human objects. AGVs are commonly built with materials like aluminum or steel and have motors and other metallic equipment. For such objects, a capacitive sensor system should be capable of differentiating between human and non-human objects.

Lumelsky et al. were pioneers on the topic of sensitive skin and its use on robot manipulators [24,25]. Other early work like the one from Karlsson and Järrhed [26] proposed one single huge capacitor with one plate on the floor and the second plate on the ceiling above the robot's workspace. More recent work was done by Lam et al. [27] who managed to integrate the sensors into the housing of the robot manipulator. Thus, reaching a solution where not a single part of the sensor is on the outside of the manipulator that could be damaged.

### *2.3. Thermal Cameras*

Body temperature is a property of a human that is already used in other sensor applications. The human body temperature is usually between 36 ◦C and 37.8 ◦C. There is only a small window allowed for variations. From 37.8 ◦C to 41 ◦C, the human has a physical condition called fever. Above 41 ◦C, the fever can be life-threatening. Everything below 36 ◦C is too cold [28]. Everything above the absolute zero point irradiates infrared light or waves in the infrared spectrum. It can be detected with Bolometers or Thermopiles. In a first measurement, images were taken with a FLIR camera. Note that the human body temperature is only visible on parts of the human that are not covered with clothes or other means of protection like helmets, masks, or safety googles. For the covered parts of the body, the temperature is attenuated, as you can see in Figure 1. Even though the AGV is turned on in the picture, there is no significant heat radiation coming from the AGV next to the human.

### *2.4. Conclusions*

There are different sensors that allow a differentiation between human and nonhuman objects. A differentiation in an industrial setting depends greatly on the conditions in the hall that the system is used in. The decision for a specific sensor needs to made for each individual case. In our work, we continue to focus on the differentiation with thermal cameras.

(**b**)

**Figure 1.** Comparison of visual and thermal image of a human next to an AGV. (**a**) Visual image; (**b**) Thermal image.

### **3. Potential Efficiency Improvement**

The Cambridge Dictionary defines efficiency as follows: "the good use of time and energy in a way that does not waste any" [29]. For a standard, non-collaborative, robotic application, a common way to measure efficiency is to measure how long the robotic system needs to fulfill a sequence of tasks. With finding ways to shorten the time to fulfill the tasks, one increases the efficiency of the system.

When it comes to human–robot collaborative applications, it gets a bit more complicated. Interactions happen not only with other well-defined objects, but also with a human—and no human is like another. A human in industrial applications is called an operator. This operator might be talking to other operators, might take a break, switch with an other operator, or simply has to clean their nose.

All of these interruptions are not foreseeable for the robotic system and come along with leaving and re-entering the robot's workspace. The more of these occasions happen, the less efficient is the overall robotic system. An efficient speed and separation monitoring system is essential for these occasions and influences the overall system efficiency.

On industrial shop floors, there is usually no hard border for a transition from the walkway or driveway into the operator's or robot's workspace as shown in Figure 2. The monitored space can often reach into the walkway and driveways.

**Figure 2.** Different workspaces around a robotic application. Note how the monitored space ranges into the walkway and driveway area.

In human–robot collaborative applications, there is a special focus on the operator. The safety of the operator has priority over the speed and movement of the robot. This is why the robot has to slow down or come to a complete stop when an operator enters the monitored space. There are different sensor systems that can measure the operators location and speed. So far, these systems do not differentiate between an operator and another machine like an AGV. The AGV is handled like an operator and the robot has to slow down or stop when it gets closer than the protective separation distance.

This is where our work proposes to differentiate between an operator and other machines. This differentiation shall then be taken into account when calculating the protective separation distance. With smaller protective separation distances for non-human objects, we increase the time that the robot can work with higher velocities and thus increase the overall efficiency of the system.

### *3.1. Protective Separation Distance*

The point in time when the operator enters the workspace can be variable as well as the speed of the operator while entering the workspace. Depending on the tasks, there are different types and amounts of interaction with other objects. Other objects in this context can be other robots, automated guided vehicles, or human beings. These objects can either provide workpieces, tools, or actively support the robot's task.

No matter what kind of interaction happens, the object needs to enter and exit the robot's workspace at a certain point in time. When entering the workspace, the robot has to slow down in order to prevent harm to the object. The moment when to slow down or stop depends on the speed of the robot, the robots reaction and stopping time as well as on the speed of the operator.

This moment is defined as the protective separation distance. The ISO/TS 15066 provides equations to calculate the protective separation distance. This distance depends on a large portion on the operators location and speed. Different values are needed to calculate the protective separation distance. The protective separation distance is calculated as shown in Equation (1) [3]:

$$S\_{\mathbf{p}}(t\_0) = S\_{\mathbf{h}} + S\_{\mathbf{r}} + S\_{\mathbf{s}} + \mathbb{C} + Z\_{\mathbf{d}} + Z\_{\mathbf{r}}.\tag{1}$$

The different values are defined in the ISO/TS 15066 as follows [3]:


The protective separation distance can be a fixed number if worst case values are used to calculate it. Especially the contribution by the human operator plays an important role in the equation.

It is allowed by the ISO/TS 15066 that the protective separation distance can be calculated dynamically according to the robot's and operator's speeds [3]. The operators contribution to the overall protective separation distance can be calculated as shown in Equation (2):

$$S\_{\mathbf{h}} = \int\_{t\_0}^{t\_0 + T\_{\mathbf{r}} + T\_{\mathbf{s}}} \upsilon\_{\mathbf{h}}(t) \, dt. \tag{2}$$

A constant value for *S*<sup>h</sup> can be calculated with Equation (3):

$$S\_{\rm h} = 1.6 \cdot T\_{\rm r} + T\_{\rm s}.\tag{3}$$

Equation (4) shows how to calculate the distance that the robot moves during the reaction time of the controller of the robot:

$$S\_{\mathbf{r}} = \int\_{t\_0}^{t\_0 + T\_{\mathbf{r}}} v\_{\mathbf{r}}(t) \, dt. \tag{4}$$

A constant value for *S*r can be calculated with Equation (5):

$$S\_{\mathbf{r}} = v\_{\mathbf{r}}(t\_0) \cdot T\_{\mathbf{r}}.\tag{5}$$

The contribution of the stopping time can be calculated with Equation (6):

$$S\_{\sf s} = \int\_{t\_0 + T\_{\sf r}}^{t\_0 + T\_{\sf r} + T\_{\sf s}} v\_{\sf s}(t) \, dt. \tag{6}$$

### *3.2. Object-Specific Protective Separation Distance*

Our proposal in this paper is to introduce an additional variable in the formula for the protective separation distance for the object kind. There are two different approaches of how to handle this additional variable.

One is to treat the variable as a binary digit: the value is either 0 or 1. If the object is a human, the contribution of the operator's change in location to the protective separation distance needs to be fully accounted for and the value is set to 1. If the object is a nonhuman object, the variable is set to 0 and the contribution of the object to the protective separation distance is neglected.

The second approach would be to treat the value as a probability of how likely the object is a human or a non-human object. With 0 being a non-human object and 1 being a human. Equation (7) shows the formula for the extended protective separation distance:

$$S\_{\mathbf{P}}(t\_0) = (\mathbf{S\_h} \cdot T) + \mathbf{S\_r} + \mathbf{S\_s} + \mathbf{C} + Z\_{\mathbf{d}} + Z\_{\mathbf{r}}.\tag{7}$$

In order to get a rough estimate of values for the protective separation distance, we calculate an example for the protective separation distance. We calculate with *v*<sup>r</sup> = 2.5 m/s and an operator velocity of 1.6 m/s.

The specification sheet for the KUKA LBR iiwa 7 R800 specifies a stopping distance of 5.193◦ for a category 0 Stop on axis 1 with a 100% radius and a 100% program override. With a specified radius of 800 mm for the KUKA robot, the distance traveled during stopping would be 72.47 mm according to Equation (8):

$$S\_{\\$} = 2 \cdot \pi \cdot 800 \text{ mm} \cdot \frac{5.193^{\circ}}{360^{\circ}} = 72.47 \text{ mm}.\tag{8}$$

Neglecting the values for position uncertainties of the robot and the operator, and neglecting the intrusion distance, we can plot the protective separation distance for robot speed of 0 to 2.5 m/s with operator speeds of 0.25 m/s which is the maximum allowed speed close to the robot, 1.6 m/s as an average operator speed, and 2.5 m/s as maximum speed. Figure 3 shows the calculated protective separation distances. The protective separation distance is linearly dependent on the robot and the human velocity. If the robot moves with full speed of 2.5 m/s and the operator approaches the system with a speed of 1.6 m/s, the protective separation distance is 2.922 m.

**Figure 3.** Protective Separation Distances for robot speeds between 0 m/s and 2.5 m/s and operator speeds of 0.25 m/s, 1.6 m/s, and 2.5 m/s.

Figure 3 shows the dependency of the robot's and the operator's speed on the protective separation distance. If it is possible to differentiate between an operator and an AGV, there would be no need for accounting for the approaching distance of the operator and *S*<sup>h</sup> could be neglected. This reduces the protective separation distance for a robot's speed of *v*<sup>r</sup> = 2.5 m/s from 3.5 m down to 1.5 m. This opens a range of 2 m where the AGV can drive by the robotic system without interfering with the robot's speed.

### **4. Measurements**

### *4.1. Monitored Space*

A difficult question is always what needs to be monitored by the sensors system. Typically, a robot's workspace is divided into two main sections, as shown in Figure 2, namely, the operating space and the collaborative workspace. The collaborative workspace is the part where the operator can work collaboratively with the robot. The operating space is the part where no human being is allowed and where the robot can work faster than in the collaborative workspace.

Considering a robot that is capable of moving 360◦ around its base, the collaborative workspace can be as small as a few degrees or as big as the full 360◦ around the base. Thus, the size of the collaborative workspace is calculated as follows:

$$\text{Size of Callaborative Wordspace} = 360^\circ - \text{Size of operating space.} \tag{9}$$

The operating space is protected by design against any access of the operator. The collaborative workspace needs to be monitored with a sensor system that is capable of measuring the separation distance to an intruding obstacle like the operator or an AGV.

A sensor for monitoring the collaborative workspace has a defined field of view (FoV). The amount of sensors needed to monitor the entire collaborative workspace is then calculated as shown in Equation (10):

$$Number\ of\ sensor\ needed = \frac{Size\ of\ Collalovative\ Workspace}{FoV}.\tag{10}$$

The collaborative workspace ends with the maximum reach of the manipulator. In order to calculate the protective separation distance we need to be able to detect obstacles before they enter the collaborative workspace.

Therefore, monitoring is necessary for the collaborative workspace and an additional extended monitoring space. This extended monitoring space usually includes walkways for other workers and AGVs. The required size of the extended monitoring space must be at least the maximum possible protective separation distance as calculated in Section 3. The sensor for differentiating between human and non-human objects must have the same range.

As seen in Figure 3, the maximum possible protective separation distance for robot speed of *v*<sup>r</sup> = 2.5 m/s and an operator speed of *v*<sup>h</sup> = 2.5 m/s, is *Sp*,2.5 = 3.822 m. We round up and set the maximum separation distance to *Sp*,*max* = 4 m.

The goal is to be able to detect a temperature of a human being in an industrial surrounding in a distance of *Sp*,*max* = 4 m.

As described in Section 2, the human body temperature can usually only be measured somewhere in the head area of the operator due to clothing covering the skin of the rest of the body. Let us assume a head size of an average human being of 20 cm. We want to be able to have a minimum pixel size for measurement of 10 cm in a distance of *Sp*,*max* = 4 m. The pixel size in different distances from a sensor is calculated as shown in Equation (11):

$$\alpha = 2 \tan\left(\frac{\alpha}{2}\right) d. \tag{11}$$

With *d* being the distance from the sensor to the object, *α* the field of view of the sensor, and *x* the size of the viewing window in a distance *d* from the sensor as shown in Figure 4.

The commercially available TeraRanger Evo Thermal 33 and Evo Thermal 90 are used to make measurements. The properties of the sensors are listed in Table 3. The sensor is connected via USB to a laptop running Windows 10. Matlab is used to read the data from the sensor via a serial connection with parameters set to: Baud Rate of 115,200, 8 Data Bits, 1 Stop Bit, Parity None, and no flow control. Matlab was chosen due to its great ability to work with matrices as the data read from the sensor with its resolution of 32 × 32 pixels is best represented in a 32 × 32 matrix. Furthermore, Matlab provides a well-established set of functions for postprocessing the data. With the KUKA Sunrise Toolbox it is possible to control the KUKA LBR iiwa 7 R800 robot directly with Matlab via an Ethernet connection [30]. This allows the control of the entire measurement setup with only one laptop running Matlab.

**Figure 4.** Schematic for the field of view of the sensor attached to the flange of the robot.

**Table 3.** Teraranger Evo Thermal Specifications [31].


The two sensors from Terabee have a field of view of 90◦ and 33◦. The sensors are shown in Figure 5. The resolution is 32 × 32 pixels. The size of the area measured by the sensor in a distance *d* is calculated by dividing Equation (11) by 32 pixels as shown in Equation (12):

$$\alpha\_{\text{Sensor}} = \frac{2\tan\left(\frac{\theta}{2}\right)d}{32}.\tag{12}$$

**Figure 5.** Teraranger Thermal 33 and 90.

The pixel sizes for both sensors for distances from 1 m up to 5 m are shown in Figure 6.

**Figure 6.** Size of pixel in different distances from the sensor.

The average size of a human head is assumed to be 20 cm. The pixel size of the 33◦ FoV sensor in a distance of 5 m is ~10 cm according to Equation (11). For the 90◦ FoV sensor, the pixel size would already be at 30 cm in a distance of 5 m which would not lead to good results. A pixel size of 10 cm for the 90◦ FoV sensor is reached at a distance of 2 m.

A first measurement was to see if it is possible to measure the human temperature in different distances of 1 m to 4 m in 1 m steps. With Matlab, the average temperature of 10 subsequent measurements was calculated and plotted in an thermal image. The room temperature during the measurement was 22.2 ◦C and the humidity was at 56%.

In order to find out if it is possible to detect an operator within 4 m from the robot, we make following measurement. The sensor is placed in a height of 120 cm. The sensor is connected via USB to a laptop running Windows 10 and Matlab. Matlab opens a serial connection to the sensor. The Matlab script reads the temperature values from the sensor 100 times. In a first measurement, there is no operator or other human being in the field of view of the sensor. In the next eight measurements, there is an operator with a height of 183 cm in distances of 0.5 m to 4 m in 0.5 m steps. At each distance value, 100 measurements are taken. Matlab then calculates the mean value for each measurement as well as the standard deviation. This measurement will show if it is possible to see the difference between human beings and the surroundings.

### *4.2. Differentiation Algorithm*

In order to save computing time, the first approach is to measure the temperature and compare it to a threshold as shown in the flow chart in Figure 7.

First, the thermal data from the camera are read via a serial connection. Second, the Matlab function *max*() is used to find the maximum measured value. Third, the measured maximum temperature is compared with a threshold. If the maximum measured temperature exceeds the threshold, the variable T is set to 1, meaning that the object is treated like a human. If the measured temperature stays below the threshold, the variable T is set to 0, meaning that there is no human in the field of view of the sensor and that the object must be a machine. Fourth, the extended protective separation distance as introduced in Section 3 is calculated. In the last step, the robot's speed is adjusted according to the calculated extended protective separation distance.

The temperature threshold needs to be set depending on the application. Best results will be achieved in settings where the temperature of the surrounding equipment is significantly lower than the temperature of a human being. With typical room temperatures of less than 23 ◦C, a threshold for the measurements of 24 ◦C is chosen.

**Figure 7.** Flow chart of temperature decision for speed adaption.

### **5. Results and Discussion**

Figure 8 shows the eight results for the thermal measurements of both sensors. Figure 8a,c,e,g show the results with the TeraRanger Evo Thermal 33. As seen in Figure 8a, the human temperature is measured quite well with a mean temperature over 10 measurements of 34.92 ◦C. In Figure 8a, one can also see that the human is wearing glasses. Glasses have poor transmission of long-wave infrared radiation and therefore we see a lower temperature on the glasses. This could be a possible solution for AGVs that show a certain heat radiation from their motors or electronics. Those parts could be covered by a glass or another material that does not transmit heat radiation. In Figure 8c,e,g, you can see that the underarms of the human being were not covered and therefore also were measured with a temperature in the range of 30 ◦C.

Figure 8b,d,f,h show the four results for the thermal measurement of a human-being in distances of 1 m to 4 m in 1 m steps with the Terabee Evo Thermal 90 sensor. Figure 8b shows that the bigger FoV of 90◦ allows to measure almost a complete standing operator in a short distance of only 1 m compared to only half the operator in Figure 8a. As calculated in Section 4, you can see that in Figure 8f,h the operator and especially the head are so far away, that one pixels measures more than just the temperature of the head. This leads to a significantly reduced average temperature. That makes it harder to differentiate the operator from its surroundings.

Figure 8 shows two main advantages and drawbacks of the sensors. For the Evo Thermal 33 sensor, the main drawback is the small field of view. Depending on the application, multiple sensors might be needed to cover the entire area that needs to monitored. The advantage is that the measured temperature is close to actual temperature for the entire distance range from 1 m up to 4 m. This is the drawback of the Evo Thermal 90 sensor, that still measures temperatures over 30 ◦C for distances up to 2 m. However, for distances above 2 m, the single pixels of the sensor cover areas of 12.5 cm by 12.5 cm and more, resulting in lower temperature measurements if a body part only covers a part of the pixel. Depending on the room temperature it gets more and more difficult to detect a human being in distances of more than 2 m for the Evo Thermal 90 sensor. The advantage of the Evo Thermal 90 sensor is the field of view that allows to cover a three times bigger area than the Evo Thermal 33.

**Figure 8.** Thermal images of human in distances of 1 m, 2 m, 3 m, and 4 m of the two sensors TeraRanger Evo Thermal 33 and 90. (**a**) Human in 1 m distance of Evo Thermal 33; (**b**) Human in 1 m distance of Evo Thermal 90; (**c**) Human in 2 m distance of Evo Thermal 33; (**d**) Human in 2 m distance of Evo Thermal 90; (**e**) Human in 3 m distance of Evo Thermal 33; (**f**) Human in 3 m distance of Evo Thermal 90; (**g**) Human in 4 m distance of Evo Thermal 33; (**h**) Human in 4 m distance of Evo Thermal 90.

Figure 9 shows the results of the measurement where the highest temperature was measured while an operator was in distances of 0.5 m to 4 m in 0.5 m steps from the sensor. The measurement was executed once with the Evo Thermal 33 and once with the Evo Thermal 90. For each distance of the operator, 100 measurements were taken. The mean value was calculated and plotted in Figure 9 with error bars for the standard deviation. The record for a distance of 0 m represents the measurement without operator in the field of view of the sensors.

**Figure 9.** 100 Measurements with human-being in distances from 0.5 m to 4 m for both sensors, the TeraRanger Evo Thermal 33 and 90. (**a**) Evo Thermal 33: 100 Measurements with human in distances from 0.5 m to 4 m; (**b**) Evo Thermal 90: 100 Measurements with human in distances from 0.5 m to 4 m.

Figure 9a shows that for the Evo Thermal 33 sensor, there is a difference of more than 5 ◦C between the temperature measurements in all different distances compared to the temperatures measured without an operator present.

The lower mean values for distances 0.5 m and 1 m with the Evo Thermal 33 as shown in Figure 9a can be explained by the narrow field of view of the sensor. Due to the sensor being placed in a height of 120 cm and the FoV being 33 ◦, the sensor cannot measure the temperatures from the head of a 183 cm operator. Due to the operator wearing long-sleeved shirt, the mean values are a bit lower because the sensor does not see any naked skin that would radiate more heat. Starting at a distance of 1.5 m, the head of the operator with a lot of exposed skin is lying in the field of view of the sensor and therefore detected with a higher mean temperature than the measurements of 0.5 m and 1.0 m.

Figure 9b shows that the measurement for the scene without an operator shows a similar temperature range like the temperatures measured in distances of 2.5 m and more. Therefore, it will not be possible to make a differentiation between human and non-human objects with the Evo Thermal 90 sensor in distances above 2 m. This confirms the result of Figure 8 and is one of the main drawbacks of the Evo Thermal 90 sensor.

Regarding the proposed algorithm, these results show that for normal room temperatures below 24 ◦C, it is possible to make a differentiation between human beings and other machines like AGVs. One drawback is in case that an AGV exposes a heat source like a motor or an electric device that radiates heat in the same amount like a human being, the AGV could be mistakenly be treated like an operator. This might lead to a reduced efficiency, but it would not be a safety issue for the operator. A possible solution would be to cover the heat source with a material that does not allow transmission of infrared heat. The main advantage of this algorithm is its simple structure and therefore short computing time.

An interesting question arises when looking at the corona pandemic where one main indicator for human health is body temperature. Pictures on TV showed that people had their temperature measured on their forehead, a region that is also part of the measurement

in our setup. Considering the entire possible temperature range of a human being between 36 ◦C and 41 ◦C, this should not affect the system performance. For setups where the human is the warmest object, it is no problem at all. The threshold will be set depending on the room temperature and the given temperatures of the surroundings. Everything above that temperature will be treated as a human being. It will become more important in setups where the system should be able to differentiate a human from objects that are warmer than the human. If the object's mean temperature is close to 41 ◦C, then it will be difficult to make a correct differentiation. The differentiation will be easier when the object's temperature is essentially higher than the human's core temperature.

### **6. Conclusions and Outlook**

This paper introduced an object-specific protective separation distance for speed and separation monitoring in human–robot collaborative applications. The use case was that in small- and medium-sized enterprises the shop floor space is limited. The space that needs to be monitored for speed and separation monitoring in HRC applications overlap with the walkways and driveways for other operators and AGVs. AGVs that pass through this monitored space slow down the robotic applications because they are treated like an operator. Differentiating between operators and AGVs allows to adjust the protective separation distance and therefore let the robot move with higher speed.

The main feature that differentiates an operator from an AGV is its temperature. Using a thermal camera, it is possible to differentiate between a human and an AGV in distances of up to four meters depending on the resolution and on the field of view of the sensor. The measurements showed that the smaller FoV sensor has advantage in measuring the temperature of objects in distances of 2 m and more. The 90◦ FoV sensor had the advantage of being able to measure the entire height of an operator in distances as close as 1 m. A mix of both sensors will be subject for further research. A disadvantage of this method is that if the AGV exposes a heat source like a motor or an other electric device, it can mistakenly be treated as an operator. In these cases, the heat sources on the AGV must be covered.

The paper showed that there is potential of more than 50% to decrease the protective separation distance and therefore increase the efficiency of the overall collaborative robotic system. The object-specific protective separation distance differentiates between human and non-human objects in the vicinity of the robot's workspace through the use of thermal cameras.

The proposed differentiation between human and non-human objects might not only be beneficial for Speed and Separation Monitoring, but also for power- and force-limiting operations. The power- and force-limiting operation is based on maximum values for quasi-static and transient contacts [3]. The values are determined in a risk assessment for the specific application.

Similar to the situation in speed and separation monitoring, there is no need to treat non-human objects like a human object. For hon-human objects, the maximum values for quasi-static and transient contacts can be higher. The amount of how much higher these values can be set depends on the materials that the non-human objects are made of. With a sensor system that can differentiate between human and non-human objects, it is possible to adjust the maximum values for the power and force limiting operation. The robot will be able to move with higher speed when a non-human object is close by and therefor the overall efficiency will be increased. This topic will be subject for further investigation.

Furthermore, research in the future will investigate the different presented sensor systems and how well they are suitable for human–machine differentiation. Fusing the data of different sensors might lead to even better results. A first step will be to combine an infrared ToF sensor with the thermal camera in order to get a single sensor system. Another important task is to look at how the different sensor systems can be compared and how the overall system efficiency can be described to suit a broader spectrum of applications.

**Author Contributions:** Conceptualization, U.B.H.; Methodology, U.B.H. and T.M.W.; Software, U.B.H.; Validation, N.H., P.G. and L.S.; Formal Analysis, U.B.H.; Investigation, U.B.H., N.H. and L.S.; Resources, U.B.H. and P.G.; Data Curation, U.B.H. and L.S.; Writing—Original Draft Preparation, U.B.H.; Writing—Review and Editing, N.H. and P.G.; Visualization, U.B.H. and L.S.; Supervision, T.M.W.; Project Administration, T.M.W. and U.B.H.; Funding Acquisition, U.B.H. and T.M.W. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by the Federal Ministry for Economic Affairs and Energy in the Central Innovation Program for small and medium-sized enterprises (SMEs). The article processing charge was funded by the Baden-Württemberg Ministry of Science, Research and Culture and the Offenburg University of Applied Sciences in the funding program Open Access Publishing.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Measurement data is available on request from the corresponding author.

**Conflicts of Interest:** The authors declare no conflict of interest.

### **References**


## *Article* **Uncertainty-Aware Knowledge Distillation for Collision Identification of Collaborative Robots**

**Wookyong Kwon 1, Yongsik Jin <sup>1</sup> and Sang Jun Lee 2,\***


**Abstract:** Human-robot interaction has received a lot of attention as collaborative robots became widely utilized in many industrial fields. Among techniques for human-robot interaction, collision identification is an indispensable element in collaborative robots to prevent fatal accidents. This paper proposes a deep learning method for identifying external collisions in 6-DoF articulated robots. The proposed method expands the idea of CollisionNet, which was previously proposed for collision detection, to identify the locations of external forces. The key contribution of this paper is uncertainty-aware knowledge distillation for improving the accuracy of a deep neural network. Sample-level uncertainties are estimated from a teacher network, and larger penalties are imposed for uncertain samples during the training of a student network. Experiments demonstrate that the proposed method is effective for improving the performance of collision identification.

**Keywords:** collision identification; collaborative robot; deep learning; uncertainty estimation; knowledge distillation

### **1. Introduction**

With the increasing demands of collaborative tasks between humans and robots, the research on human–robot interaction has received great attention from researchers and engineers in the field of robotics [1]. Robots that can collaborate with humans are called collaborative robots (cobots), and cobots differ from conventional industrial robots in that they do not require a fence to prevent access. Previously, the application of robots is limited to performing simple and repetitive tasks in well-structured and standardized environments such as factories and warehouses. However, the development of sensing and control technologies has significantly expanded the area of application of cobots [2], and they are beginning to be applied to several tasks around us. More specifically, their applications have been diversified from traditional automated manufacturing and logistics industries to more general tasks such as medical [3], service [4,5], food and beverage industries [6], and these tasks require more elaborate sensing and complicated control techniques. Furthermore, with the development of intelligent algorithms including intention estimation [7] and gesture recognition [8], cobots can be utilized in wider application areas.

In general, robots have advantages over humans in repetitive tasks, and humans are better at making comprehensive decisions and judgments. Therefore, human–robot collaboration possibly increases the efficiency of intelligent systems through complementary synergies. As the scope of robotics applications gradually expands through collaborative work, interaction with humans or unstructured environments has become an important technical issue, which requires the implementation of advanced perception and control algorithms. Especially, collision detection and identification techniques are indispensable elements to improve the safety and reliability of collaborative robots [9,10].

**Citation:** Kwon, W.; Jin, Y.; Lee, S.J. Uncertainty-Aware Knowledge Distillation for Collision Identification of Collaborative Robots. *Sensors* **2021**, *21*, 6674. https://doi.org/10.3390/ s21196674

Academic Editor: Anne Schmitz

Received: 22 August 2021 Accepted: 5 October 2021 Published: 8 October 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

To perform cooperative tasks with the aid of human–robot interactions, several studies have been carried out to detect and identify robot collisions for the safety of workers [11]. Previous work can be categorized into two approaches: the first category is the study on the control of collaborative robots by predicting possible collisions and the other is the study of responses after impacts. While collision avoidance is more advantageous in terms of safety [12], this approach inevitably requires additional camera sensors for action recognition of coworkers or 3D reconstruction of surrounding environments [13]. Furthermore, it is difficult to completely avoid abrupt and unpredictable collisions. Therefore, techniques for collision identification are essential to improve the safety and reliability of collaborative robots.

Collision detection algorithms investigate external forces [14] or currents [15] to determine whether a true collision has occurred on an articulated robot. A key element in the procedure of collision detection is the estimatation of external torques. A major approach to estimating external torques is utilizing torque sensor signals to compute internal joint torques based on the physical dynamics of robots, and several other methods to construct momentum observers to estimate external torques without the use of torque sensors. The method that does not use torque sensors is called sensorless external force estimation, and an elaborate modeling of the observer and filter is essential for the precise estimation of external forces [16–19]. External forces are further processed by a thresholding method [20] or classification algorithm [21], to determine whether a collision has occurred. Recently, deep-learning-based methods have outperformed traditional model-based methods in detecting collisions [22]. Beyond collision detection, the identification of collision locations is beneficial for the construction of more reliable collaborate robots, by making them react appropriately in collision situations.

To ensure the proper responses of collaborative robots in cases of collisions, it is necessary to identify collision locations. The collision identification technique can be defined as a multiclass classification of time series sensor data according to collision locations. In early studies, collision identification was mainly based on the elaborate modeling of filters [23] and observers [24], and a frequency domain analysis was conducted to improve the accuracy of collision identification [25]. To address the classification problem, machine learning techniques, which were employed to analyze time series data, have also been applied to collision identification [26]. Recently, support vector machines [27] and probabilistic methods [28] were applied to improve the reliability of collision identification systems. In [29], the collision identification performance was improved by utilizing additional, sensors such as inertial measurement units, and analyzing their vibration features.

In this paper, we propose a method that can identify collisions on articulated robots by utilizing deep neural networks for joint sensor signals. Collision identification refers to a technique that not only detects the occurrence of a collision, but also determines its location. Recently, a collision detection method was proposed by Heo et al. [22]; we extend this existing method for collision identification and improve the robustness of the deep neural network. To improve the performance of the collision identification system, we construct a deeper network, which is called a teacher network, to distill its probabilistic knowledge to a student network. In the process of distilling knowledge, we employ the uncertainties of the teacher network to focus on learning difficult examples, mostly collision samples. This paper is organized as follows. Section 2 presents related work, Section 3 explains collision modeling and data collection, and Section 4 presents the proposed method. Section 5 and Section 6 presents the experimental results and conclusion, respectively.

### **2. Related Work**

### *2.1. Deep Learning Methods for Collision Identification of Collaborative Robots*

Collision detection is a key technology to ensure the safety and reliability of collaborative robots. Although most previous methods were based on the mathematical modeling of robots [30–32], recently, deep learning methods have shown promising results for this goal. Min et al. [33] estimated vibration features based on the physical modeling

of robots and utilized neural networks for collision identification. Xu et al. [34] combined neural networks and nonlinear disturbance observer for collision detection. Park et al. [35] combined a convolutional neural network and support vector machine to detect collisions, and Heo et al. [22] employed causal convolutions, which were previously utilized for auto-regressive models in WaveNet [36] to detect collisions based on joint sensor signals including torque, position, and velocity. Maceira et al. [37] employed recurrent neural networks to infer the intentions of external forces in collaborative tasks, and Czubenko et al. [38] proposed an MC-LSTM, which combines convolutions and recurrent layers for collision detection. Mohammadi et al. [13] utilized external vision sensors to further recognize human actions and collisions.

### *2.2. Knowledge Distillation*

Knowledge distillation was proposed by Hinton et al. [39] to train a student network with the aid of a deeper network, which is called a teacher network. Probabilistic responses of the teacher network are beneficial to improve the accuracy of the student network because the probabilities of false categories were also utilized during knowledge distillation. Although most early methods directly distill the logits of a teacher network, Park et al. [40] utilized the logits' relations, and Meng et al. [41] proposed a conditional teacher–student learning framework. Furthermore, knowledge from intermediate feature maps was distilled for network minimization [42] and performance improvement [43,44]. Knowledge distillation has been employed in various applications such as object detection [45], semantic segmentation [46], domain adaptation [47], and defense for adversarial examples [48]. Recently, the teacher–student learning framework has been applied with other advanced learning methodologies such as adversarial learning [49] and semi-supervised learning [50].

### *2.3. Uncertainty Estimation*

Uncertainty plays an important role in interpreting the reliability of machine learning models and their predictions. Probabilistic approaches and Bayesian methods have been regarded as useful mathematical tools to quantify predictive uncertainties [51]. Recently, Gal and Ghahramani proposed Monte Carlo dropout (MC-dropout) [52], which can be interpreted as an approximate Bayesian inference of deep Gaussian processes, by utilizing dropout [53] at test time. Lakshminarayanan et al. [54] proposed deep ensembles for the better quantification of uncertainties, and Amersfoort et al. [55] proposed deterministic uncertainty quantification, which is based on a single model to address the problem of computational cost of MC-dropout and deep ensembles. Uncertainties have been utilized to quantify network confidences [56], selecting out-of-distribution samples [57], and improving the performance of deep neural networks [58,59], in various application areas such as medical image analysis [60] and autonomous driving [61].

### **3. Collision Modeling and Data Collection**

### *3.1. Mathematical Modelling of Collisions*

This section explains the mathematical modeling of dynamic equations for 6 Degrees of Freedom (DoF) articulated robots. In order to operate a robot through a desired trajectory and move it to a target position, precise control torque is required for each joint motor, and the control torque can be represented as the following dynamic equation:

$$
\pi = M(q)\ddot{q} + \mathbb{C}(q,\dot{q})\dot{q} + \mathbb{g}(q), \tag{1}
$$

where *<sup>τ</sup>* <sup>∈</sup> <sup>R</sup>*<sup>n</sup>* is the control torque, *<sup>M</sup>*(*q*) <sup>∈</sup> <sup>R</sup>*n*×*<sup>n</sup>* is the inertia matrix of the articulated robot, *<sup>C</sup>*(*q*, *<sup>q</sup>*˙) <sup>∈</sup> <sup>R</sup>*n*×*<sup>n</sup>* is the matrix of Coriolis and Centrifugal torques, *<sup>g</sup>*(*q*) <sup>∈</sup> <sup>R</sup>*<sup>n</sup>* is the vector of gravitational torques, and *q*, *q*˙, *q*¨ are the angular position, velocity, and acceleration of each joint, respectively. The dynamic equation can be obtained through the Newton–Euler method or the Euler–Lagrange equation using the mechanical and physical information of the robot. Since the dynamic equation of the robot is given as (1), in the absence of external force, external torques can be computed by subtracting the control torques from measured torques.

When a joint torque sensor is installed onto each joint, the torque generated on each joint due to external force is given as follows:

$$
\pi\_{\rm ext} = \pi\_{\rm s} - \pi\_{\rm \prime} \tag{2}
$$

where *τext* is the external torques generated onto each joint due to a collision, and *τ<sup>s</sup>* is torque values measured by joint torque sensors. The external torque can be precisely estimated under an accurate estimation of robot dynamics and physical parameters of the articulated robot such as the mass and center of mass of each link.

In robots that are not equipped with a joint torque sensor, sensorless methods are utilized to estimate external torques. Sensorless methods are basically based on the current signal of each joint motor, and an additional state variable *p* = *M*(*q*)*q*˙ is defined to reformulate the dynamic equation as follows:

$$
\dot{p} = \mathbb{C}(q, \dot{q})\,^\wedge \dot{q} - \mathbb{g}(q) - f(q, \dot{q}) + \pi\_{\text{m}}.\tag{3}
$$

where *f*(*q*, *q*˙) is the friction matrix, and *τ<sup>m</sup>* is the motor torque. In the case of the sensorless method, it is necessary to obtain the transmitted torque from the motor to the link to estimate the collision torque. Therefore, the friction must additionally be considered in the existing robot dynamics equation. A main issue in sensorless external torque estimation is the elaborate design of observer and filter under the dynamic Equation (3), and the effect of disturbance can be reduced using momentum state variables. Due to the effect of noise and nonlinear frictional force, sensorless methods are generally less precise in the estimation of external torques compared to methods that utilize joint torque sensors. Through the methods mentioned above, it is possible to obtain the torques generated in each joint due to the collision of the robot. Then, the collision identification algorithm can determine collision locations from joint torques obtained through sensor or sensorless methods.

### *3.2. Data Collection and Labeling*

Figure 1a presents the 6-DoF articulated robot to collect sensor data, which include the information of joint torque, current, angular position, and angular velocity. The Denavit–Hartenberg parameters of the articulated robot are presented in [62]. From the 6-DoF articulated robot, joint sensor signals were obtained with the sampling rate of 1 kHz, and a data sample collected at time *t* can be expressed as

$$\mathbf{x}\_t = [\mathbf{r}\_t^\top, \mathbf{i}\_t^\top, \boldsymbol{\theta}\_t^\top, \mathbf{w}\_t^\top]^\top \in \mathbb{R}^{24},\tag{4}$$

where *τt*, **i***t*, *θt*, **w***<sup>t</sup>* are six-dimensional vectors corresponding to torque, current, angular position, and angular velocity, respectively; the *i*-th components of these vectors indicate the sensor signals obtained at the *i*-th joint. Figure 1b shows the definition of collision categories according to collision locations. Collisions were generated at six locations, and in the case of no collision, which refers to the normal state, a label of 0 was assigned. In the case of a collision, a categorical label corresponding to the location was assigned to generate ground truth data.

Joint sensor data were collected, along with collision time and category, by applying intentional collisions at different locations. The collision time and category were converted into ground truth data which have an identical length to the corresponding sensor signals, as shown in Figure 2. For a collision occurrence, the corresponding category was assigned to 0.2 s of data samples from the collision time; each collision is represented as 200 collision samples in the ground truth data. We collected joint sensor signals for 5586 intentional collisions along with their ground truth data; the number of collisions, which were applied to different locations, is equal. This dataset was divided into training, validation, and test sets with the ratio of 70%, 10%, and 20%, as presented in Table 1.

**Figure 1.** The definition of labels. (**a**) presents 6-DoF articulated robot, and (**b**) presents the definition of categories; yellow arrows in (**b**) indicate categorical labels according to collision locations.

**Figure 2.** Examples of sensor signals and ground truth data. (**a**) shows a part of the acquired sensor signals, and (**b**) presents examples of generated ground truth data around collision occurrences. Green lines with numbers in (**b**) indicate labeled categories in the ground truth data.

**Table 1.** The number of collisions and data samples. *Total* indicates the number of data samples, which were collected with a sampling rate of 1 kHz, and *Collision* indicates the number of collision samples.


### **4. Proposed Method**

This section presents the proposed method for the collision identification of articulated robots. Firstly, two neural network architectures are presented; one of them is a student network and the other architecture is a teacher network for knowledge distillation. The second part explains the proposed knowledge distillation method, which considers the predictive uncertainties of the teacher network. Lastly, a post-processing is utilized to improve the robustness of the proposed algorithm by reducing noise in network predictions.

### *4.1. Network Architectures*

This paper employs the network architecture presented by Heo et al. [22] as a base network model. Heo et al. [22] proposed a deep neural network, called CollisionNet, to detect collisions in articulated robots. Its architecture is composed of causal convolutions to reduce detection delay and dilated convolutions to achieve large receptive fields. We modeled the base network by modifying the last fully connected layer in CollisionNet to conduct multiclass classification and identify collision locations. The base network is composed of seven convolution layers and three fully connected layers, and its details are identical to CollisionNet except the last layer; convolution filters with the size of 3 are utilized for both regular and dilated convolutions, the depth of the intermediate features is increased from 128 to 512, and the dilation ratio is increased by a factor of two. The architecture of the base network is identically utilized as a student network in the process of knowledge distillation.

Figure 3 shows the architecture of the teacher network. To construct the teacher network, three regular convolutions in the base network are replaced into convolution blocks. A convolution block contains four convolution layers with a skip connection, and therefore, the number of parametric layers in the teacher network increases to 19. The number of channels in the second and third convolution layers in a convolution block are identical to the number of output channels of the corresponding regular convolution layers. The number of trainable parameters in the teacher network is 6.63 million; therefore, it has more capacity to learn the training data compared to the base network, which has 2.79 million parameters. Dropout layers with a dropout ratio of 0.5 are added to the fully connected layers in the teacher network, and Monte Carlo samples from the teacher network are acquired by applying dropout at the test time.

**Figure 3.** The architecture of the teacher network.

### *4.2. Uncertainty-Aware Knowledge Distillation*

The teacher network is trained with the cross-entropy loss between the softmax prediction **y**ˆ *<sup>T</sup>* and its one hot encoded label **y**. The *i*-th component of **y**ˆ *<sup>T</sup>* indicates the predicted probability that the input sample belongs to the *i*-th category. In our case, seven categories exist, which contain the normal state and six possible collision locations. The loss function for the training of the teacher network is defined as

$$d\_{cc}(\mathbf{y}, \mathbf{\hat{y}}\_T) = -\sum\_i y\_i \log(\mathbf{\hat{y}}\_{T,i})\_\prime \tag{5}$$

where *yi* and *y*ˆ*T*,*<sup>i</sup>* are the *i*-th components of **y** and ˆ**y***T*, respectively.

After training the teacher network, *K* logits, **z**ˆ <sup>1</sup> *<sup>T</sup>*, ··· , **<sup>z</sup>**ˆ*<sup>K</sup> <sup>T</sup>* are obtained from an input sample by utilizing MC-dropout [52]. These logits are computed by randomly ignoring 50% of neurons in the fully connected layers in the teacher network. Based on the *K* logits of the teacher network, the *i*-th component of the uncertainty vector is computed by

$$u\_i = \frac{1}{K} \sum\_k (\sharp\_{T,i}^k - \sharp\_{T,i})^2,\tag{6}$$

where *z*¯*T*,*<sup>i</sup>* is the *i*-th component of the averaged logit ¯**z***T*, which is computed by

$$\dot{\mathbf{z}}\_T = \frac{1}{K} \sum\_k \dot{\mathbf{z}}\_T^k. \tag{7}$$

The uncertainty *ui* is the variance of logits; therefore, the value of the uncertainty increases as distances between the logits increase.

The total loss L for the training of the student network is composed of two loss functions, as follows:

$$\mathcal{L} = l\_{\rm cr}(\mathbf{y}, \mathbf{\hat{y}}\_{\rm S}) + l\_{\rm kl}(\mathbf{\bar{z}}\_{\rm T}, \mathbf{\hat{z}}\_{\rm S}, \mathbf{u}), \tag{8}$$

where *lce*(**y**, **y**ˆ *<sup>S</sup>*) is the cross-entropy loss between the softmax prediction of the student network and its corresponding label, **u** is the uncertainty vector whose *i*-th component is *ui*, and *lkd*(**z**¯*T*, **z**ˆ*S*, **u**) is the uncertainty-aware knowledge distillation loss. The knowledge distillation loss os obtained by computing uncertainty-weighted Kullback–Leibler divergence (KL divergence) between *σ*(**z**ˆ*S*, *T*) and *σ*(**z**¯*T*, *T*), as follows:

$$d\_{kl}(\dot{\mathbf{z}}\_{T}, \dot{\mathbf{z}}\_{S}, \mathbf{u}) = -\sum\_{i} \mu\_{i} \sigma(\dot{\mathbf{z}}\_{T}, T)\_{i} \{ \log(\sigma(\dot{\mathbf{z}}\_{S}, T)\_{i}) - \log(\sigma(\dot{\mathbf{z}}\_{T}, T)\_{i}) \},\tag{9}$$

where *σ*(**z**, *T*) is the softmax function with the temperature *T*, and *σ*(**z**, *T*)*<sup>i</sup>* is the *i*-th component of *σ*(**z**, *T*). In (9), *σ*(**z**, *T*)*<sup>i</sup>* can be computed as

$$
\sigma(\mathbf{z}, T)\_i = \frac{\exp(z\_i/T)}{\sum\_j \exp(z\_j/T)}.\tag{10}
$$

The overall procedure for the training of the student network is presented in Figure 4.

**Figure 4.** The procedure of uncertainty-aware knowledge distillation for the training of the student network; SN and TN indicate the student and teacher networks, respectively, and *σ*(**z**, *T*) is the softmax function with the temperature *T*.

### *4.3. Post-Processing*

The post-processing to reduce errors in network predictions is inspired by a connected component analysis in image-processing techniques. In the labeled data, a collision is represented by connected samples, with a non-zero number corresponding to its location. However, a few predictions may differ from their adjacent predictions, because a neural network independently infers predictions for different data samples. Based on the collision properties in the labeled data, incorrect predictions are reduced by the post-processing presented in Figure 5.

**Figure 5.** The procedure for the post-processing. (**a**) presents the predictions from the student network, and (**b**) presents the result of grouping non-zero connected samples and assigning an identical category of the maximum frequent. (**c**) presents the result of a thresholding method.

The post-processing is composed of two steps; in Figure 5, (a) shows predictions from the student network, and (b) and (c) present the results after the first and second postprocessing steps, respectively. In the first step, non-zero connected samples are grouped, and the number of samples for each category are counted. Predictions in a group are replaced into the category which corresponds to the maximum frequency, as presented in Figure 5b. In the second step, if the number of non-zero connected samples is less than a threshold value, then these samples are regarded as the normal state. The threshold value of 10 samples is utilized in experiments, and Figure 6 presents examples of the results of the post-processing.

**Figure 6.** Examples of predictions before and after the post-processing. (**a**) presents predictions for the collision categories of 4 and 5, and (**b**) presents predictions for the collision categories of 2 and 3.

### **5. Experiments**

### *5.1. Experimental Environment and Evaluation Measures*

The proposed algorithm is developed within a hardware environment including Intel core i7-10700 CPU, 32GB DDR4 RAM, and RTX 3080 GPU. In experiments, Python and Pytorch are mainly utilized to implement the proposed algorithm and to conduct an ablation study. To demonstrate the proposed method, the dataset is gathered from a collaborative robot, which consists of six rotating joints. The cobot weighs 47 kg, has a maximum payload of 10 kg, and reaches up to 1300 mm. The actuator consists of motors manufactured from Parker, motor drivers from Welcon, and embedded joint torque sensors in each joint. The hardware of the cobot contains a custom embedded controller, based on real-time linux kernel, and it communicates with drivers through EtherCAT with a cycle time of 1 ms.

To demonstrate the effectiveness of the proposed method, we evaluate the algorithm in three ways: (1) sample-level accuracy, (2) collision-level accuracy, and (3) time delay. In the process of collision identification, deep neural networks perform sample-level multiclass classification, which classifies each sample, composed of a 24-dimensional sequence of sensor data, into the normal state or one of six collision locations. To evaluate the samplelevel accuracy of deep neural networks, we measure *Recall*, *Precision*, and *F*1-*score* for each sample, which are defined as follows:

$$\begin{aligned} Recall &= TP/(TP + FN), \\ Precision &= TP/(TP + FP), \\ F1\text{-}score &= 2 \times \frac{precision \times recall}{precision + recall}, \end{aligned} \tag{11}$$

where *TP*, *FP*, *FN* are the numbers of true positives, false positives, and false negatives, respectively. True positive is a correctly identified collision sample, false positive is an incorrect prediction, which is classified into a collision, and false negative is an incorrect prediction which is classified into the normal state.

Collision-level accuracy is another important measure for evaluating a collision identification system. Because collaborative robots respond to each collision, reducing the number of false positive collisions is an important issue. *Recall*, *Precision*, and *F*1-*score* are computed as (11) with different definitions of *TP*, *FP*, and *FN* to measure the collision-level accuracy. A group of connected samples that are classified into a collision is regarded as a true positive if the intersection over union (IoU) between the connected predictions and its corresponding true collision samples is greater than 0.5. A group of predictions that are classified into a false category of collisions is regarded as a false positive, and a false negative is a missed collision. Figure 7 shows several cases of *TP*, *FP*, and *FN* for measuring the collision-level accuracy.

**Figure 7.** Examples of true-positive, false-positive, and false-negative collisions for computing collision-level accuracies. (**a**) presents a *TP* collision, (**b**,**d**) present *FP* and *FN* cases, (**c**) presents *TP* and *FP* cases, and (**e**) presents a *FP* collision.

Finally, the time delay is measured to evaluate the processing time of the collision identification system. For the safe and reliable collaborations of human and robots, the processing time is required to be reduced as possible. The total processing time is composed of the inference time of a neural network, detection delay for collisions, and post-processing time. Based on these three types of evaluation measure, the effectiveness of the proposed method is demonstrated in experiments.

### *5.2. Training of Neural Networks*

To train the neural networks, Adam optimizer [63] is utilized with a learning rate of 10−4. The learning rate is decreased to 10−<sup>5</sup> after training 200 epochs. Figure 8 presents f1-scores for the training and validation datasets during the training of 500 epochs. As shown in Figure 8, after training a sufficientl number of epochs, the validation accuracy was not further decreased. Therefore, in the following experiments, the accuracies of deep neural networks are evaluated for the test set after training 300 epochs.

**Figure 8.** F1-scores for the training and validation datasets.

To train the student network, the temperature of the softmax function is set to 5 during the process of knowledge distillation. The temperature value has to be greater than 1 to soften probabilistic predictions of neural network, and temperature values between 2 and 5 are usually used for knowledge distillation in the previous literature [39]. In our experiments, modifications to the temperature value glead to insignificant changes in the experimental results. In Figure 9, (a) shows the first dimension of 24-dimensional sensor data, which corresponds to the torque signal at the first joint, and (b) presents uncertainties measured by MC-dropout with the value of *K* = 4. As shown in Figure 9, the uncertainties of collision samples are high compared to normal state samples. By weighting the uncertainties on the KL-divergence between probabilistic predictions of the student and teacher network, the student network is able to focus on learning difficult data samples.

**Figure 9.** Uncertainties measured by MC-dropout of the teacher network. (**a**) shows the first dimension of 24-dimensional sensor data, and (**b**) presents uncertainties measured by MC-dropout. In (**a**), red × marks indicate collision moments, and green lines represent labels for the normal state and locations of collisions.

### *5.3. Sample-Level Accuracy*

The first measure to evaluate the performance of deep neural networks is the samplelevel accuracy. As explained in Section 4.1, the architecture of the deep neural network proposed in [22] is employed to construct the base model. To demonstrate the effectiveness of uncertainty-aware knowledge distillation for the problem of collision identification, we compare the accuracies of the proposed method with those of the base model and a student network. The student network has an identical architecture to the base model, and is trained by distilling knowledge in the teacher network without employing uncertainty information. Table 2 presents the sample-level recall, precision, and f1-score of four neural network models; the proposed method means another student network, which is trained by uncertainty-aware knowledge distillation. The last row of Table 2 presents the sample-level accuracies of the teacher network. As presented in Table 2, the f1-scores of the proposed method are comparable to those of the teacher network; it is worth noting that the proposed method employs a lightweight network compared to the teacher network.


**Table 2.** Sample-level accuracies of the four different neural network models before and after the post-processing.

### *5.4. Collision-Level Accuracy*

This section presents the collision-level accuracies. As collaborative robots react to each collision, reducing the number of false-positive collisions is a very important issue in reliable collision identification systems. In the labeled data, a collision is represented by 200 non-zero samples; therefore, false-positive collisions, which are composed of a few fals- positive samples, are not effectively reflected in the sample-level accuracies. Although the sample-level accuracies of the four neural network models are above 98%, there are a considerable number of false-positive collisions. To compute the collision-level

accuracies, a group of non-zero predictions is regarded as a collision, and Table 3 presents the numbers of true-positive, false-positive, and false-negative collisions of the four neural network models. In Table 3, the base model, student network, and proposed method have an identical network architecture to CollisionNet [22]; the student network is trained by regular knowledge distillation, and the proposed method employs uncertainties during knowledge distillation. As shown in Table 3, the number of false positives is significantly reduced after the post-processing. Table 4 presents the collision-level recall, precision, and f1-score of the four neural networks. By utilizing probabilistic labels and uncertainties from the teacher network, the proposed method produces better accuracies, despite its lightweight network architecture compared to the teacher network.


**Table 3.** The numbers of true-positive (*TP*), false-positive (*FP*), and false-negative (*FN*) collisions of the four neural network models before and after post-processing.

**Table 4.** Collision-level accuracies of the four different neural network models before and after the post-processing.


### *5.5. Analysis for the Processing Time*

The processing time is another important factor for responding to external forces within an acceptable timeframe. In the collision identification system, the total processing time is composed of the inference time of a neural network, time delay for detecting a collision, and post-processing time. Table 5 presents the averaged processing time for each step. The teacher network requires an 83% longer inference time compared to the base model, student network, and proposed method. The detection delay is measured by averaging the time intervals between collision occurrences and their corresponding first true-positive samples. As presented in Table 5, the proposed method requires 2.6350 ms to identify a collision occurrence, and this satisfies the requirement for the safety of collaborative robots.

**Table 5.** The averaged processing time in ms for the collision identification.


### **6. Conclusions**

This paper proposes a collision identification method for collaborative robots. To identify the locations of external forces, the propose method employs a deep neural network, which is composed of causal convolutions and dilated convolutions. The key contribution is the method of capturing sample-level uncertainties and distilling the knowledge of a teacher network to train a student network, with consideration of predictive uncertainties. In the knowledge distillation, KL-divergence between the predictions of the student and teacher networks are weighted by the predictive uncertainties to focus on data samples that are difficult to train. Furthermore, we also propose a post-processing to reduce the number of false-positive collisions. Experiments were conducted with a 6-DoF-articulated robot, and we demonstrated that the uncertainty is beneficial to improving the accuracy of the collision identification method.

**Author Contributions:** Conceptualization, W.K. and S.J.L.; methodology, S.J.L.; software, S.J.L.; validation, S.J.L.; formal analysis, S.J.L.; data curation, W.K.; writing—original draft preparation, W.K. and S.J.L.; writing—review and editing, W.K., Y.J. and S.J.L.; visualization, S.J.L. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. 2021R1G1A1009792). This research was supported by "Research Base Construction Fund Support Program" funded by Jeonbuk National University in 2021, and partially supported by Electronics and Telecommunications Research Institute (ETRI) grant funded by the Korean government. (21ZD1130, Development of ICT Convergence Technology for Daegu-Gyeongbuk Regional Industry).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

### **References**

