Perception and Computation for Speed and Separation Monitoring Architectures

Adamides, Odysseus; Subramanian, Karthik; Arora, Sarthak; Sahin, Ferat

doi:10.3390/robotics14040041

Open AccessReview

Perception and Computation for Speed and Separation Monitoring Architectures

Department of Electrical and Microelectronic Engineering, Rochester Institute of Technology, 1 Lomb Memorial Dr, Rochester, NY 14623, USA

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Robotics 2025, 14(4), 41; https://doi.org/10.3390/robotics14040041

Submission received: 11 February 2025 / Revised: 27 March 2025 / Accepted: 28 March 2025 / Published: 31 March 2025

(This article belongs to the Special Issue Embodied Intelligence: Physical Human–Robot Interaction)

Download

Browse Figures

Versions Notes

Abstract

:

Human–Robot Collaboration (HRC) has been a significant research topic within the Industry 4.0 movement over the past decade. The interest in HRC research has continued on with the dawn of Industry 5.0 focusing on worker experience. Within the study of HRC, the collaboration approach of Speed and Separation Monitoring (SSM) has been implemented through various architectures. The different configuration strategies involve different perception-sensing modalities, mounting strategies, data filtration, computational platforms, and calibration methods. This paper explores the evolution of the perception architectures used to perform SSM, and highlights innovations in sensing and processing technologies that can open up the door to significant advancements in this sector of HRC research.

Keywords:

HRC; HRI; SSM; ToF; TCP; SBC; Industry 4.0; Industry 5.0; LiDAR; radar; stereoscopic; embedded system platforms; embedded system modules

1. Introduction

Human–Robot Collaboration (HRC) started to gain traction in the 2010s with the desire for more flexibility and digitalization in industries like automotive, commercial product manufacturing, and supply chain management. HRC presented an opportunity to reduce the footprint size of robots on a production floor while combining the precision capabilities of robots with human ingenuity and creativity. The key was to create this combination while still maintaining a safe hybrid workspace for the robot and human to work together with minimal downtime. The robot provided the ability to perform repetitive tasks consistently with high accuracy and precision. The human provided creativity and the ability to adapt to tasks that may need adjustment and flexibility. In other words, the goal of HRC is to combine the benefits of low-mix high-volume robots with the flexibility of high-mix low-volume human capabilities.

The earlier development of HRC research was highly motivated by the “Industry 4.0” movement. The term “Industry 4.0” was first used in 2011 [1] to illustrate the occurrence of a fourth industrial revolution. This revolution focused on digitalization and process automation. There was a great increase in research and development in smart devices, the Internet of Things (IoT), smart factories, and electrification [1]. All of these efforts contributed to the use of technology to automate manufacturing, testing, inspection, and all other related processes to the production of products along with their transport, storage, and tracking.

This rise in digitalization drove the need to observe more aspects of these processes. In order to observe the states, behaviors, and values of these systems, there was a significant increase in the integration of sensors into manufacturing and production processes. Sensors were integrated to increase traceability, worker safety, and quality of work. With respect to worker safety, torque sensors, 2D scanning LiDARs, and light curtains were introduced to create safe stopping states for a manufacturing line without shutting down the whole system. Limiting system shutdowns, in turn, increased the overall productivity of the manufacturing line.

In HRC, this concept of a safe system pause without an emergency stop is called a “safety-rated monitored stop”. The safety-rated monitored stop is one of the four collaboration approaches defined by the International Organization of Standardization (ISO). Over time, collaborative robots, or cobots, were developed and sold as industry off-the-shelf robots. This made HRC easier to integrate into manufacturing workspaces and pitches. Following this wave of integration in industry, the research fields that generated this new collaborative digital infrastructure began to investigate new principles and values for these architectures.

The idea of a fifth industrial revolution or “Industry 5.0” started appearing in the literature and research around 2018 [1]. This revolution shifted the focus from digitization and automation to worker collaboration and well-being. The research began to expand on the safety proponents that HRC techniques provide, while also utilizing more complex collaborative tasks between the human and the robot. Once collaborative architectures were in place, HRC research began to focus on monitoring and increasing the human comfort level while interacting with robots. With the collaborative architectures developed, HRC has added a focus on monitoring and increasing the comfort of humans interacting with robots [2,3]. Though the principles for HRC have evolved, the four ISO defined collaboration approaches have remained the foundational pillars in current HRC research.

The four foundational collaboration methods are defined in the ISO/TS 15066:2016 standard [4]. The approaches consist of safety-rated monitored stop, power and force limiting, hand guiding, and speed and separation monitoring.

1.1. Safety-Rated Monitored Stop

As previously mentioned, safety-rated monitor stop defines a safe stop state in which the robot actions are suspended, but the robot does not need to completely restart to reenter an active state. The safety-rated stop may be triggered by high torque feedback, a light curtain, or by the violation of a boundary monitor sensor (i.e., if a human enters an area which they are not supposed to). The robot cannot leave the suspended state until the sensors have been cleared by an operator and conditions have returned to normal. The three remaining collaborative approaches build off this stop state once their thresholds of safe interaction are exceeded. These thresholds can include force, torque, distance, or velocity.

1.2. Power and Force Limiting

Power and force limiting (PFL) is the most common collaboration method for off the shelf industrial cobots. This collaboration approach requires torque feedback at each robot joint. In the event that the robot collides with the human, or begins to exert a joint force greater than the ISO defined thresholds, the robot then enters a safety-rated monitored stop [5].

1.3. Hand Guiding

Hand guiding is an approach in which a human user can freely move a manipulator robot by its tool or Tool Center Point (TCP). This approach greatly simplifies robot training and lowers the technical entry level for human–robot interaction compared to the general level of coding required to program robot tasks. Hand guiding is generally a built-in software feature that relies on the same torque feedback used by the power and force limiting systems already present in cobots. This operational mode has safety limits to maintain safe human-robot interaction (HRI). As with power and force limiting, if the limits of the motors or the torque threshold are exceeded, the robot enters a safety-rate monitored stop.

1.4. Speed and Separation Monitoring

Speed and separation monitoring (SSM) utilizes perception systems mounted on or off the robot to monitor the minimum distance between the robot and human, in addition to the velocity vector between the robot and the human. ISO defines the algorithm thresholds such that particular combinations of speed and separation conditions will either reduce/adjust robot velocity or force it to enter a safety-rated monitored stop to avoid collision.

1.5. HRC Collaboration Method Trade-Offs

In contrast to the first three methods, SSM takes a proactive approach and aims to avoid collisions altogether. Power and force limiting only enters a safety stop following a collision. Hand guiding requires constant contact between the robot and human; therefore, a higher level of safety must be maintained in order to lower risk of injury to the operator handling the robot end effector. Hand guiding maintains safe collaboration by significantly reducing the maximum joint speed and movement range, in an effort to lower chances for human collisions, self collision, or exceeding the mechanical limits of the joint motors. As a result of SSM’s proactive collision avoidance measures, it requires significantly more computation performance. The computational challenge comes from the requirement to continuously compute the minimum distance between the human and the robot in an SSM workspace. Whether on-robot or off-robot, SSM perception systems must meet performance requirements including sample rate, resolution, coverage, calibration, and physical placement/mounting. These requirements vary based on the limitations of the perception and computation technologies selected. Perception and computation limitations drive challenges in the implementation of SSM architectures. Therefore, this is why PFL is the more prevalent approach when building off-the-shelf collaborative robots [6,7,8]. Furthermore, the torque and force sensing requirements between PFL and hand guiding are fairly similar. With the addition of capacitive sensing for detecting contact between the human and robot, most collaborative robots include hand-guiding as a “free drive mode” to simplify robot task planning [9].

This paper seeks to detail the history, function, and trade-offs of different computational platforms and perception modalities used in HRC SSM architectures. Perception sensors have seen large changes in their performance and application focus, while computational platforms available for SSM architecture integration have increased in performance and variety. Following an overview of sensor history and performance, this paper explores the chronology of these modalities across SSM applications to illustrate the trends in SSM architectures over time. These trends help predict how perception and computational systems can be applied to SSM research in the future.

2. Perception Technologies

An SSM architecture uses its perception modules to observe the minimum human–robot distance and input it to the SSM algorithm such that a safe velocity output command is sent to the robot operating in the collaborative workspace. This section covers the modalities most commonly used in the surveyed works. The modalities include IR sensors, LiDARs, radars, stereo cameras, and thermal sensors.

2.1. IR Sensors

Infrared (IR) sensors, as seen in Figure 1, are one of the lowest cost light-based sensing modalities. IR sensors consist of an IR light emitting diode (IR-LED) and a photodiode to detect IR light reflected back from the environment. The measurement modality is continuous and must be sampled by an analog-to-digital converter (ADC) to interpret the signal return strength [10]. The output of the return signal is a voltage which must then be converted to a distance based on a voltage–distance relationship provided in the datasheet [10]. The refresh rate of the distance data is limited by the sample speed of the ADC and microcontroller. The resolution of the distance data is dependent on the bit resolution of the ADC, along with the signal-to-noise ratio (SNR) of the IR sensor output [10].

In [11], a mesh of SHARP GP2Y0A21YK0F IR sensors was affixed on an ABB robot to perform SSM [10]. The distance value mesh was monitored for violations of pre-defined safety zones which would trigger the robot to slow down or stop. Ref. [10] demonstrated a high perception coverage approach at a much lower cost compared to the conventional 2D laser scanner methods of the time. A key issue with IR sensors is their non-linear nature, as seen in Figure 2. Though the voltage to distance curve is provided in the spec, there is an accuracy limitation based on the ADC or microprocessor being used to convert IR sensor voltage readings into distance. Additionally, since this is a voltage-based measurement, the sensing accuracy is impacted by the “cleanliness” of the input voltage [10]. The more ripple on the sensor bus voltage, the less accurate the distance reading will be. The way voltage ripple noise was accounted for in [11] was by adding a decoupling capacitor near the IR sensors throughout the sensor skin. In [12], a lower power and ranging IR sensor was used with pressure sensors to create a touch and proximity sensing skin directly on-robot. The patch of skin could classify a number of contact approaches. The VNCL4010 sensors used in this skin are much lower power but also shorter ranging than the Sharp IR sensor mentioned above.

2.2. LiDAR Sensors

A fundamental perception platform commonly used for speed and separation monitoring is the light detection and ranging sensor, also known as LiDAR. LiDAR has been used to perform speed and separation monitoring for over a decade [13]. It is one of the earliest sensing approaches used in the field of SSM research. Prior to its applications in SSM, LiDAR had a rich development and application history. One of the earliest use cases for LiDARs and manipulator robots was in a Ford manufacturing plant in the 1980s. The LiDAR was used to help a manipulator robot pick and place an exhaust manifold out of a bin [14]. As the laser and receiver technology advanced, its packaging and power consumption decreased. Additionally, the computational power needed to process the sensing modality became more accessible. LiDARs started to become commercially purchasable in the early 2000s [14]. The LiDAR then began appearing in manufacturing plants to act as safety sensor guides for autonomous guided vehicles and robots (AGVs and AGRs) seen in Figure 3. LiDARs were also paired with manipulator robots to perform collision avoidance. As the human robot collaboration field began to grow in the mid-late 2000s, LiDAR was quickly adopted as a baseline perception modality [13,15]. Not only was it used as the experimental sensor in SSM, LiDARs became the control perception modality to compare to a prototype distance sensor in question [16].

Through this rich development, the LiDAR operating principles have diversified greatly. The initial laser ranging modality worked solely on the Time-of-Flight of a laser into the environment and back to the sensor. The light bounces off the environment and then returns to the aperture. As illustrated in Equation (1), distance d from an object is half the speed of light c, times the time difference between signal transmission and receiving, known as time of flight or

τ

[15].

d = \frac{1}{2} c τ

(1)

The multidimensional 2D and 3D LiDARs take this distance measurement principle and apply it across a field of view (FOV) via mechanical or electrical means [14]. The laser transmits a pulse of light that reflects off a mirror. Then, through optomechanical means, the light is spread around the environment. Return light is received monostatically or bistatically, and then digital signal processing is used to calculate point cloud data and send that information to a PC or other external processing unit [14]. When transmission and reception are performed by a single aperture, this is considered a monostatic LiDAR. When the LiDAR has a dedicated transmission and receiving aperture, this is considered a bistatic LiDAR.

In general, LiDARs provide detailed 3D distance information about the environment. However, this detailed perception comes at the cost of processing power, size, and power consumption. Different LiDAR topologies can balance some of these constraints. Monostatic LiDARs are generally lighter than bistatic ones, as there is one shared aperture for both transmitting and receiving.

LiDAR has maintained a consistent presence in SSM Architectures throughout the history of SSM research [17,18]. Ref. [13] is an example of early research into SSM. In this work, a single 2D scanning LiDAR was fixed off-robot in the workspace. The LiDAR tracked human movement within the workspace. These data drove the basic and Tri-Modal SSM experiments. Basic SSM commanded the robot velocity to zero if the minimum distance threshold was tripped, and Tri-Modal SSM commanded a slower velocity state prior to a full safety-rated monitored stop. Ref. [13] discussed the implications of the newly proposed ISO/TS 15066 standard at the time and indicated that these initial findings showed promise for SSM in future HRC applications. Ref. [19] ran physical experiments with a LiDAR and an industrial robot to perform the complete SSM experiment according to the ISO/TS 15066: 2016 SSM methodology. Other SSM researchers continued to use this static off-robot LiDAR Tri-Modal SSM approach. Ref. [20] used this LiDAR configuration in conjunction with the spherical estimation of the robot shape to perform a dynamic SSM (dSSM). In [21], 2D LiDARs were used with a Kinect V2 to monitor a collaborative workspace that was shared with a human, cobot, and industrial robot. The SSM Architecture for these experiments also fed sensor data to a digital twin which mirrored the physical actions of robots, sensors, and humans of the physical world into a virtual space. This digital twin was then used to generate a speed factor to feed into the SSM algorithm. As opposed to dSSM in [20], the reduced speed zone in [21] could dynamically change between 0% and 100% robot speed.

The 1D and 2D dSSM approaches have generated observable decreases in cycle time compared to fenced robots and static SSM approaches. Other works have used LiDAR as a control to compare their experimental sensors with [22], or have combined their sensors with LiDAR into a fusion-based sensor platform [23]. Ref. [24] explored integrating 3D LiDAR with stereo vision to move toward a potential SSM architecture that could work with heavy duty or industrial robots. In [25], fusion was taken a step further by fusing multiple off-robot LiDARs with an on-robot Time-of-Flight (ToF) camera to maximize the coverage of the sensor perception system. Their algorithm aimed to have the local (on-robot) ToF camera account for when the global (off-robot) LiDARs were occluded. In [26], a LiDAR was used as the baseline 2D perception modality to compare with a radar-based 3D sensing approach. A single LiDAR configuration was tested against a three-radar setup in which radars positioned at the corners of the table detected human presence while the third on-robot mounted radar was then enabled to monitor human hand intrusions into the collaborative workspace. This work determined that the radar approach posed a shorter cycle time than the traditional LiDAR approach. In [27], a prototype 3D LiDAR was mounted off-robot to monitor human interactions with a FANUC M-20iD industrial robot. The LiDAR data passed into an embedded NanoPi M4V2 in which the point cloud was assigned to the robot via the forward kinematics of the robot links and joints. This prototype LiDAR and the intelligent control system (ICS) matched the perceived workspace state and intruders with the robot pose from the robot controller. This system was successfully validated against the COVR ROB-MSD-3 safety protocol to demonstrate that LiDAR and ICS exhibited satisfactory performance to be used in real-time HRC applications. Another SSM experiment that explored the use of a 3D LiDAR was carried out in [28]. In this work, an Ouster OS0 LiDAR was used to observe the scene and generate a point cloud for a CNN to extract the human from the scene, which was then used to calculate the minimum distance for a dynamic SSM algorithm. The novelty in the method tested was that CNN was not fed the generic LiDAR point cloud, but instead an OS0 generated IR-Frame such that a traditional YOLOv9 model could be directly used on the LiDAR data.

2.3. Time-of-Flight Sensors

The use of Time-of-Flight to determine distance can be achieved through other methods besides LiDAR. Time-of-Flight (ToF) cameras evolved as technological innovations from sensing modalities like LiDAR. The distinction with ToF cameras is that, unlike traditional LiDARs, ToF cameras do not require mechanical components to rotate laser detectors [15]. Instead, this technology relies on the flood illumination of a laser or LED illuminator and some form of light sensor array. The sensor array can range from traditional complementary metal-oxide-semiconductor (CMOS) imagers to charge-coupled devices (CCDs), or single-photon avalanche diode (SPAD) arrays [15]. The Time-of-Flight calculation for these devices is the same as Equation (1), but at the per-pixel level. Sensor characterization tests have shown that ToF cameras exhibit reflectivity, precision, and accuracy performance similar to those of their LiDAR counterparts [29]. This makes them great alternative sensing candidates for LiDARs in existing SSM Architectures.

There are two main types of ToF cameras. Each calculates the distance in a different way. One camera type is the pulsed-light camera which measures the Time-of-Flight directly. The structured light emitted by the illuminator goes out into the environment and then returns to the sensor. The light hits the sensor in the same pulsed pattern in which it was transmitted. In addition, the start time of the light transmission is known. Therefore, the transmission time to received time is digitally processed to determine a Time-of-Flight and, in turn, a distance to the target. The other major ToF camera type used is called continuous wave or CW [15,30]. This approach indirectly measures the time of flight. Instead of directly measuring transmission time to receive time, this method measures the phase shift of the modulated light pattern between transmission and reception [15]. Phase measurement is performed by integrating small groups of pulses over time. The larger the integration time, the more detailed the return scene. However, longer integration times mean larger amounts of motion blur in a scene [15]. This motion blur can directly impact an SSM algorithm and/or an object classifier.

Both types of cameras have their benefits and hindrances. For instance, the pulsed technique can be used to measure distances much farther than the continuous wave approach. This hindrance on CW limits the top end of its measurement range to a maximum of 5 to 10 m depending on the technology. However, in the context of SSM, the maximum range of desired perception is generally within a few meters. It is important to note that the maximum range is a relative term, as the optical power of these cameras differs based on their system architecture. For example, it is important to account for the behavior of light in the inverse square law [15]. The nature of the inverse square law rapidly diminishes the optical power of the camera’s illuminator as distance increases. Furthermore, when this phenomenon is extrapolated to a 2D field of view, this means that the optical power in the center will be significantly greater than the optical power at the edges. This phenomenon can be accommodated by diffusing the LED or laser illuminator in a structured manner.

Normally, illumination products have built-in diffusers in the front of their lasers to provide a structured field of illumination (FOI) [15]. It is important to keep in mind that one must balance the desired imager FOV with the FOI illumination scheme. For example, capturing the entire scene with a single wide camera may need to be balanced with multiple smaller FOV ToF cameras that can capture narrower point clouds at a further distance [31].

As mentioned above, the coverage of the workspace is important to keep in mind, as blind spots in the perception system can pose risks to the human operating in the collaborative workspace [32]. The overall performance of these ToF cameras can generate 15–450 frames per second (FPS), depending on the camera architecture. The frame rate is a crucial aspect of SSM as the monitoring system chosen for the workspace should aim to be ≥33 Hz for robot speeds of 1.0 m/s. At this refresh rate, a robot traveling at 1.0 m/s can cover 60 cm in a sample cycle [19]. Too slow of an FPS and the SSM algorithm will not be able to adjust the robot trajectory or velocity in time to avoid a collision with a human or object in the workspace. Therefore, it is imperative that the robot operating speed matches the FPS capability of the sensors being used. These ToF cameras come in a wide variety of FOVs, resolutions, frame rates, and depth ranges. It is also important to note that these properties will also impact the power consumption of these units as well.

Like LiDAR, the use of ToF cameras in SSM research was seen very early on. In [33], researchers investigated the use of a Mesa Swiss Ranger SR-3100 ToF camera mounted on a robot to avoid collisions with a human in 2008; 2D laser scanners were used as safety guards to prevent the human from getting too close to the robot. Ref. [34] investigated the use of “unsafe devices” for safety critical systems. Swiss Ranger 4000 ToF cameras were used to monitor a workspace and provide data to a dynamic SSM algorithm. This was one of the first papers to demonstrate an SSM approach with dynamic safety zones that followed the robot in real time. ToF was also used in work that combined the PFL features of the cobot in the workspace with the integration of a ToF camera to perform SSM [35]. In this work, constraint compromises were made between the PFL and SSM approaches such that higher robot velocities could be achieved without violating the safety requirements of the ISO/TS 15066 specification. These trades, in turn, allowed for higher productivity of the worker in the experiments [35].

At the higher resolution end of the ToF technology is the Microsoft Kinect family of products. The Xbox Kinect was introduced in 2012 [36], and the Kinect V2 shortly after. These sensors have been extensively used in SSM research. Ref. [37] used a Kinect to feed a kinematic safety control strategy in conjunction with an AAB cobot. The Kinect was also used to observe the skeletal structure of the human and robot in the workspace for danger field research [38]. These experiments explored using a Kinetostatic safety field approach. The novelty of this implementation was that the robot was considered the danger source instead of objects in the environment. These Kinect V2 devices were used in [39] to implement SSM in full compliance with ISO/TS 15066. This implementation took advantage of on-robot mounting and the built-in human tracking feature of the Kinect V2 to lower the computational complexity of the SSM algorithm, since this method only used spatial tracking of the human as opposed to an off-robot approach, which would require human and robot pose tracking. An off-robot configuration of a Kinect was used to feed a static-dynamic SSM model in [40]. This research found that the computational increase in dynamic SSM reduced the overall safety performance of the algorithm compared to static SSM. Therefore, the researchers investigated a static-dynamic model that traded algorithm complexity for safety performance based on the operational velocity of the robot. In [41], an off-robot Kinect was used for skeletal tracking in a human collision avoidance approach which corrected the robot path to avoid collisions. To generate better skeletal tracking, an improved particle filter (IPF) was applied to the Kinect data to combat positional drift over time. For [42], a Kinect V2 was used to capture the shoulders of the human to define a safety capsule for a danger zone. That safety capsule was then used in the SSM algorithm for the velocity control of an industrial robot. The capsule method generalized the human shape to use a conservative minimum distance instead of an exact distance to lower computational complexity. In [43], a Kinect V2 was fused with thermal camera data. These fused data were fed into an SSM model, which classified different workspace risks including human operators and non-operators and key danger points for the operator. The combination of danger points and the separation distance generated a slightly increased cycle time, but also significantly increased safety.

This Microsoft Kinect ToF technology was also integrated into the Microsoft Hololens [44]. The Hololens is a head-mounted AR device which uses an embedded processor for computation, effectively acting as a smaller form factor variant of the Kinect ToF technology. In [45], a Microsoft Hololens 2 was used to provide a heads up display to workers in an SSM workspace while also running the SSM algorithm. In this work, the digital twin was mapped to the world frame via physical fiducial anchors placed on objects throughout the workspace. The Hololens would hone in on the fiducial anchors and then use the built-in ToF camera and IMU to perform SLAM and maintain orientation of the digital twin and the location of the human in the scene, who was estimated as a geometric primitive for minimum distance computation.

The Azure Kinect Development Kit (AKDK) seen on the left in Figure 4 [31] was the next iteration of the RGB ToF module released in 2020. This product captured 3D point clouds in both narrow and wide FOV modes. RGB depth overlays were still supported, as they were in the previous Kinect products. The sensor exhibited a refresh rate that ranged from 10 to 30 FPS depending on whether the camera was operating in megapixel or quarter-megpixel mode. Although Microsoft stopped producing AKDKs, in 2023 Microsoft announced a collaboration with Orrbec, who made the Femto Bolt in the right of Figure 4 [46]. The Femto Bolt generates the same performance as the AKDK modules and Orbecc provides a software wrapper to maintain compatibility with the Microsoft SDK originally made for the AKDK module [47].

At the lower cost, lower performance end of the spectrum is the 1D ToF sensor. An example of this sensor type is the STMicroelectronics VL53L1x time of flight sensor seen in Figure 5. Although this device only has a 16x16 SPAD array resolution, it has a sample rate of 50 FPS at 1 m and 30 FPS at 4 m. These devices are low-cost and easy to integrate into an on-robot SSM application. Several works rely on this sensor type as the perception component of their SSM architecture.

A sensor skin was made of VL6180X 1D ToF sensors in [48] to provide collision avoidance capabilities within a 300 mm range on an industrial robot. The method did not claim to comply with the formal SSM specification; however, static safety zones were used to define a full-speed, reduced-speed, and fully stopped robot state. The resolution of the skin for this application was 30 mm per the physical space on the robot between each sensor. Further investigation was conducted in [49] to integrate these sensors with capacitive sensing modules. The purpose of this configuration was to generate a skin that measured both proximity and tactile sensing. The mixed capacitive and proximity skin minimized the blind spots of the originally presented proximity-only skin [49]. These authors continued to investigate 1D ToF sensing implementations in [50] through the construction of a sensor ring made up of 24 VL53L0X sensors. These devices can detect targets up to 1000 mm away with a 20 Hz refresh rate. Compared to other ring approaches like [32], this ring did not point the sensors directly normal to the robot link like in [51]. Instead, the 24 sensors were oriented 45° off the end effector to generate a cone-shaped downward light curtain focused around the tool and not generally out into the workspace. Moving away from the cone TCP approach, Ref. [52] researched the configuration of the VL53L0X sensor in a string topology. Their experimental setup used 48 sensors in a rigid flex PCB assembly. In [53], the proximity and capacitive sensor skin was improved by moving from the VL6180X sensors to the VL53L0X sensors. The updated skin still benefited from capacitance features, while also improving range and sampling frequency.

There has also been research on lower-cost 3D ToF cameras that outperform 1D ToF sensors, but do not provide the same level of resolution as the 3D ToF cameras mentioned above [46]. One example of these devices is the Arudicam [54] which is capable of streaming 240 × 180 pixels at 30 FPS. This device was used in [55] to investigate interface latency in SSM architectures between the perception system of the safety response of a robot. The 3D ToF camera streamed depth data to an NVIDIA Jetson Nano which ran a YOLOv8n model for human detection in the SSM architecture.

In the SSM setting, the key benefit of ToF technology is that it provides the same direct distance measurement as LiDAR, but at a significantly lower price point. Furthermore, the size and power constraints of ToF cameras and sensors are much lower than their LiDAR counterparts. This provides an opportunity for ToF sensors and cameras to be integrated into both on-robot and off-robot reference frames to generate holistic point clouds of the entire collaborative workspace.

2.4. Radar Sensors

In addition to light-based perception, electromagnetic waves, specifically in the radio frequency (RF) range (greater than 24 GHz), can be used to detect objects and targets in a scene [56]. These radar sensors, which operate in a mode called frequency modulated continuous wave (FMCW), can be used to determine a number of spatial elements in one reading. These sensors transmit RF waves across a linear spectrum of frequencies in what is called a chirp configuration [56]. This chirp bounces off the environment and is received by the receiver antenna on the radar. The raw received signal is then passed through a fast Fourier transform (FFT) to decode the spatial data of the signal [56]. Depending on the number of transmitters and receivers, these radar sensors have the ability to determine the distance, velocity, and angle between the object and the radar [56]. The distance d from a single object only requires a single transmitter and receiver. The relationship between distance d, phase

Φ_{0}

, and wavelength

λ

is demonstrated in Equation (2) [56].

Φ_{0} = \frac{4 π d}{λ}

(2)

When the chirp is transmitted and then returned with an object in the radar field of view, there will be a frequency difference

f_{0}

between the transmitted and returned chirp. After the received signal passes through an FFT, the single object will be expressed as a constant frequency phase difference. This frequency is then extracted and passed through Equation (3) to determine the distance to the target.

f_{0} = \frac{S 2 d}{c}

(3)

In the event that multiple objects are detected, the FFT will express multiple constant frequency components [56]. This return behavior can be thought of as the “point cloud” perceived by the radar. Multiple transmitters are needed to measure the velocity of objects in an environment. By sending two chirps that are distinctly spaced in time, the returned signals will have the same frequency components but different phases. The velocity of the object can be determined by passing the phase difference through Equation (4).

v_{m a x} = \frac{λ Φ Δ}{4 T_{c}}

(4)

Multiple chirps and a second FFT are required in order to measure multiple object velocities in the field of view. The signals returned will have different frequency components and phases. The signals must be passed through an FFT the second time, also known as a Doppler FFT, to distinguish the phase differences for each frequency component or object [56]. Measurement of angle requires the inverse architecture of measuring velocity. Angle measurements require two transmitters and a single receiver [56]. The distance between the two receiver antennas will be a known separation distance. Angle computation is performed by sending a chirp transmission, waiting for the signal to be received by the receive antenna, and then taking the phase difference of the two return signals between the receive antennas. The known separation of the antenna and the phase difference are then used to calculate the angle of arrival (AoA) in the Equation (5). The estimated AoA using Equation (5) is most accurate near 0 degrees and least accurate near the maximum angle.

θ = s i n^{- 1} (\frac{λ Δ Φ}{2 π l})

(5)

Different design parameters impact different performance metrics of these radar sensors. The separation distance between the reception antennas has an impact on the performance of the maximum angular field of view, defined by Equation (6). A total FOV ± 90° or 180° is achieved when the received antenna separation distance is half the signal wavelength [56].

θ_{m a x} = s i n^{- 1} (\frac{λ}{2 l})

(6)

Additionally, the number of frequencies within a chirp pattern affects the frequency resolution, which is a parameter similar to the density of the point cloud. Furthermore, the computational front-end device affects the quality of chirp transmission, return, and computation.

Several silicon manufacturers have made Application Specific Integrated Circuits (ASICs) for radar sensor signal processing. Texas Instruments has a mature product line of mmWave radar sensors [57]. The name of the product line denotes the wavelength of the transmitted and returned signals produced by their sensors. The mmWave ICs seen on the products in Figure 6 combine chirp generation, return signal filtering, sampling, and frequency domain processing. These radar ICs are developed into multiple transmit and receive antenna structures. The antennas can be routed onto the printed circuit board near the chip or even on the chip itself, which is called the mmWave sensor mmWave antenna on package (AoP) [58]. With the multi-transmitter and multi-receiver antenna structure, these sensors can determine the position, velocity, and AoA of multiple targets in a single frame capture. This variable frame rate can be anywhere from 1 to 30 FPS.

To further enhance the resolution performance of the mmWave sensors, TI has integrated Capon beamforming radar data filtration into their mmWave SDK [60]. This method uses a steering vector to focus the direction of receiving antennas to particular regions in the environment. During steering, the algorithm also dynamically changes weights to select the best angular resolution based on what is observed in the environment. Using this approach over the traditional FFT method expressed above greatly boosts the angular resolution of the sensors, but at the cost of added computational performance. The intensity of filtration can be adjusted to the trading resolution of the computational loading [60].

Although the “point cloud” returned from these devices will not be as rich as those of their LiDAR counterparts, they have some robustness and other performance qualities that LiDARs can struggle to overcome or achieve. For example, LiDARs are a light-based sensing modality, which means that environmental obstructions, such as fog, smoke, or mist, will affect the accuracy of the point-cloud distance because these mediums will refract some of the LiDAR transmission signal. Radar innately will not be impacted by these types of mediums due to the nature of the sensing modality. In addition, radar has the ability to transmit through several solid materials, including plastics, glass, and soil. Radar sensors will also require less computational power and energy to make spatial measurements compared to a LiDAR sensor [16]. Lastly, a key performance feature is that mmWave sensors have been tuned to detect living objects within the FOV of a sensor [61].

Living object detection has a number of key applications in the automotive and industrial spaces for safety. In-cabin radar on cars and industrial vehicles can determine the presence of a human and detect their heart rate and breathing rate. As a result, these sensors can alert a driver of occupancy in the vehicle and help prevent a parent from unintentionally leaving their children in a car during extreme heat or cold, leading to accidental death [61]. In the world of HRC, this key feature could be used to locate a worker and measure their biometrics while operating in the hybrid workspace. With these given characteristics, researchers have been applying radar in automotive and industrial applications for well over a decade. In robotics, linear frequency modulated continuous wave (LFMCW) radars have started to be used in SSM and general collision avoidance applications [62,63].

Specific to HRC research, radar has been used for fenceless machine guarding and target tracking for speed and separation monitoring. As mentioned above, in [26], radar sensors were set up as the experimental approach to detect human presence in the workspace. A TI mmWave IWR6843AOPEVM was evaluated against a MOCAP setup for SSM applications in [64]. In the experiment, the radar was mounted at the foot of the robot looking out into the workspace. The radar was found to struggle to detect motion across the workspace (moving left and right), but demonstrated promising tracking for approaching and moving away from the robot. Therefore, although the radar demonstrated lower performance than the motion capture system, the accuracy results for the approaching axis generated some promise for base-mounted radars in SSM architectures. A custom radar module solution was presented in [16] and compared to a LiDAR in an SSM application. The radar performance was less than the LiDAR but generated enough tracking to pose future work and optimization of the customer antenna and the FPGA module. The published mean distance error was 22.1 cm between the LiDAR measured separation distance and the radar measured separation distance. Further configuration of the antennas and DSP processing blocks in the FPGA could provide more performance. Although the error presented is significant, radar-based approaches in SSM are still in the early stages of research. The authors of [16] looked to fuse radar with other modalities in [23]. This work fused radar with ToF and LiDAR to create a more dynamic perception node. In order to use microDoppler to detect between normal objects and humans in the scene, radar was added to ToF and LiDAR. The wider velocity signature in the microDoppler distribution allowed the radar to discern between humans and other objects. In [65], two radars were mounted on the base of a table with a UR-10 robot. One radar was dedicated to feeding an LSTM object classifier to determine if a human or mobile robot was moving in the scene, while the other radar focused on tracking the moving objects in the workspace. The algorithm in this work then considered the type of objects moving, the velocity of the objects, and the current velocity of the robot before sending an updated velocity command to the robot. The mixed separation distance and classification of objects in the scene generated greater productivity and safety compared to the traditional static and tri-modal SSM approaches.

2.5. Vision Sensors

Vision-based perception modules are based on a fundamental light-based sensing element. The sensors may be configured as stereoscopic cameras in order to compute a depth or operate as an independent visible imager to classify and segment scene data for fusion with LiDAR, ToF, or Radar.

2.5.1. Stereo Vision

Stereo vision has been used in computer vision and robotics since the late 1990s [66]. The fundamental principle of stereo vision is to merge two visible imager-based frames with known separation and orientation to compute a 3D depth from the camera pair. Stereo cameras, such as the Intel RealSense in Figure 7, can operate on passive light in the environment and have active illumination incorporated into the sensor to provide more controlled light for a scene.

Stereo cameras have a number of benefits for performing 3D depth computations. For instance, this perception modality is based on traditional CMOS or CCD imagers which can operate at significantly higher frame rates than ToF Cameras or LiDARs. These sensors can easily achieve 60, 90, or even 120 FPS. Furthermore, the imagers selected for these perception modules tend to be global shutter imagers as opposed to rolling shutter. This design decision makes these systems less prone to motion blur in a scene. The depth range of these sensors can extend up to several meters, but can also detect down to 0.3 m [67]. This range is suitable for SSM and other HRC applications.

While stereo vision has many beneficial traits, there are other key performance factors that can hinder their performance in HRC applications. One key issue with stereo vision is its inability to compute depth on textureless surfaces and objects. Stereo vision struggles to compute depth on walls, ceilings, or floors, which could be used as computation references in some applications. However, in HRC applications, the main focus is on computing the distance from the robot to a human. In most cases, humans are highly textured targets due to their shape and the texture of their clothing or uniform, making them easier to see in the stereo point cloud. Due to the nature of high FPS multi-megapixel image sensors used in stereo cameras, a significant amount of data must be processed in order to generate a point cloud. However, this hindrance has become less of an obstacle as processing platforms now have increased computational power and dedicated cores for image processing and hardware acceleration.

Within the field of SSM, the first applications of stereo vision were seen in 2011 [68]. In [68], data from three cameras were fused to determine the location of a human in a collaborative workspace. The worker in this experiment wore special clothing with colorized markers to aid in computation of body position. Stereo has also been used many times in conjunction with other perception modalities like ToF [69]. In this work, the ToF point cloud and stereo frame were fused into a single data frame. Over time, the FPS of the sensors increased, and the capabilities of the processors and computational platforms increased as well. This allowed for point cloud computation in conjunction with human pose estimation. This work applied the stereo perception method in both on and off-robot mounting strategies. In [70] a ZED stereo camera was used to generate a point cloud of the scene while also classifying the human, robot, and objects within the collaborative workspace that both the robot and human could access. The point cloud was binned into a labeled occupancy voxel-grid (LOG). The voxels in the workspace were then associated with the human, robot, or object detected within that portion of the voxel grid. In another work, human keypoints were obtained through an Intel RealSense camera and OpenPose [71]. The novelty of the approach in this work was that the human location was not estimated with a minimum distance but instead calculated to each keypoint. Furthermore, the PFL pain levels of each keypoint were worked into their SSM algorithm, in turn conforming to both PFL and SSM approaches of ISO/TS 15066. In [72], stereo data were fused with ToF and thermal camera data to generate a fused and human segmented scene. The spatial sensors subtracted non-dynamic objects from the scene. The Intel RealSense stereo camera was challenged by the low-light smooth surfaces, whereas the Kinect ToF camera provided a precise but much lower resolution point cloud. The fused scene was fed into a retrained YOLOv3 network to account for fused thermal and spatial data to classify the humans in the scene. The minimum distance between the human and robot was then computed within the digital twin to feed the SSM control algorithm for robot velocity control. A full industrial application was tested in [73] using a safety rated stereo camera called SafetyEye and a safety rated PSS4000 PLC. The focus of this research was using this safety rated SSM architecture to perform dynamic changes in the safety zones based on the current state of the robot task. In [74], a RealSense stereo camera was used to feed a predictive human pose model. The aim of this work was to predict the next move of the human instead of just reacting to the current state. The data collected pointed towards the “proactive-n-reactive” approach, generating lower cycle times than the purely reactive methods. The authors then continued this work by developing a planning framework around their predictive method [75]. The continued work maintained the use of the same SSM architecture but added timed path planning based on the predicted human motions within the collaborative workspace. Another stereo application looked to use a ZED2i binocular camera to extract both the human and robot pose within the collaborative workspace [76]. The approach for minimum distance calculation was conducted purely with the stereo camera stream. The traditional flow of using robot joint states as part of the minimum distance calculation was taken out of the loop to lower computational complexity. For the experiment, the authors trained and optimized a YOLOv5s network to detect the human approaching the robot, making it slow down. The human then stopped near the robot and reached further into the working zone of the robot, making the robot come to a stop before a collision occurred. In [77], a ZED stereo camera feeds a human posed estimation model to predict the potential future movements of the worker. This pose prediction was fed into SSM algorithm to generate the variable speed commands for the robot.

2.5.2. Mono Vision

Traditional RGB cameras have mainly been used as a segmentation or classification tool in conjunction with a depth sensor such as a stereo camera, LiDAR, or ToF camera. Devices like the Kinect are commonly described as “RGB-D” cameras, when in reality the sensor module maps a ToF camera and RGB camera via the transform between the two camera frames. The significance of this feature is that it enables the RGB frame to be overlaid on top of the 3D point cloud. Furthermore, particular objects, including humans, can be classified and segmented such that algorithms can focus on tracking that particular region of interest within the sensor point cloud [78]. The human classification performed in [78] was used to generate an SSM safety cylinder at the human position. This cylinder was mapped in the digital twin of the experiment to perform the SSM minimum distance calculation.

With the introduction of embedded system platforms that contain hardware acceleration, machine vision and machine learning-based depth processing have significantly improved. Instead of dual cameras for stereo-based depth, monocular-based depth estimation has become accurate enough to estimate the point cloud of a human in a scene with just one image. In [79], a single RGB camera stream was fed into a Jetson Nano that ran a monocular depth estimation model that output segmented human point clouds at 114.6 FPS. The model in this work was called HDES-Net [80]. The model structure was a CNN encoder-decoder, where the decoder stage contained a segmentation branch and a depth branch. At the output of the decoder, the depth and segmentation branches were merged and up-sampled back to the original input resolution. This output point cloud would contain only points on humans segmented within the frame. Initially, the model only output at 17.23 FPS. The model was then optimized using Tensor RT to decrease the inference time. The monocular depth estimation approach is a newer method that requires lower sensor power but more computational power. The rise in performance of monocular-based depth estimation algorithms presents a strong opportunity for these models to start being applied in SSM architectures. The approach in [79] was only a single example and does not solve many of the challenges the current models are aiming to solve. The [79], the model generated relative depths by using discrete binary classification and was specifically trained on the optics of the Microsoft Kinect. Although not optimized for embedded applications, models like Metric3Dv2 tackle the metric depth and zero-shot aspects of monocular depth estimation [81]. Metric3Dv2 can output absolute depth (metric depth) and is not dependent on particular camera optics or focal length from its training dataset (zero-shot). These powerful features come at a cost of longer inference times; however, more research on these model structures will eventually lead to hardware-optimized solutions. One of the most recent monocular depth estimation models, Depth Pro, was published by Apple [82]. This model is novel because it is one of the first multiscale Vision Transformer (ViT) based monocular depth models that outputs zero-shot metric depth with unprecedented accuracy. Although monocular depth has a range of approaches, there are models and architectures that have the potential to fit SSM architectures today.

2.6. Thermal Vision

One modality that has a key characteristic for identifying humans within a scene is thermal imagery. Thermal perception modules utilize a fundamental temperature-based sensing element. Thermal imager development dates back to the 1960s for a number of scientific and military applications. These sensors could monitor the thermal profile of the surface of the earth or monitor geological zones with higher thermal activity. Over time, thermal imaging innovations began to scale the size of sensors, and the technology found its way into the commercial sector through companies such as FLIR. Two major thermal imagery modalities commonly used are microbolometers and thermopiles. Thermopiles measure the thermal response using the same mechanics as thermocouples. The dissimilar metals in the thermocouple elements generate a voltage based on the temperature detected by the device. These devices are grouped to create sensors that generate a thermal image such as the TeraRanger Evo in Figure 8 [83]. Microbolometers, on the other hand, are sensors made up of hundreds of silicon-constructed thermistors. The thermal sensing mechanic of a thermistor is its resistance. As the temperature detected changes, the resistance of the sensing element changes [84].

In most industrial applications, humans are warm and mobile compared to the average features in an industrial environment. This generally makes humans appear as a clear feature in a thermally sensed frame of an industrial environment. Several researchers have taken advantage of this modality to combine it with other perception methods such as ToF and stereo vision to map the human thermal signature to the depth point cloud [43,72,85,86]. In [85], data from an IR stereo camera was fused with 360° data from an IR imager to track humans in a collaborative workspace. In [87], Terabee Evo thermal sensors were used to directly relate human temperature to distance. The human temperature reading would decrease as the separation distance increased. Another thermal fusion application was evaluated in [72], where the authors fused stereo, ToF, and thermal data to train their depth estimation and human classification model. The thermal imager used in [72] was the Optris PI 450i thermal camera, which has a resolution of 382 × 288 pixels and a refresh rate of 80 Hz [88]. It is important to consider performance and cost metrics for these different thermal modalities and devices. The sensor used in [86] is a thermopile-based sensor; therefore, the refresh rate or sensor frames per second is much lower than in other papers that use thermographic cameras or microbolometers. The trade-off with these higher performance imagers is that they cost significantly more than the thermopile devices. The Optris PI 450i in Figure 9 retails at at $6300 USD [88], whereas the TeraBEE TeraRanger Evo Thermal sensor retails for under $100.00 USD [83]. However, it is important to note that TeraBEE does not actively produce the TeraRanger product line [89].

3. Speed and Separation Monitoring Architecture

Each HRC method has implementation and integration trade-offs in research and industry. However, advances in perception and computation technologies have started to soften the trade-offs required in SSMs. These new technologies open the possibilities for lower cost, power, and complexity of SSM architectures, while simultaneously increasing computational performance, sensor coverage, and data throughput. Figure 10 illustrates the high-level architecture of an SSM architecture. The perception sensing, computational units, and system requirements of this architecture are the main focus points of this work. This section details the trade-offs and considerations one must make when constructing a SSM architecture. The perception system mounting, perception sensor performance, system calibration requirements, and architecture computation approach must be well matched to the particular use case or experiment.

3.1. Perception System Mounting

As defined in the previous section, SSM requires on-robot or off-robot perception systems to monitor the speed and separation between the robot and the human. There are significant benefits and considerations when choosing a robot mounting strategy.

3.1.1. Off-Robot

Off-robot systems do not need to consider complex mounting techniques to keep sensors in place on a moving manipulator arm. Off-robot sensor systems usually consist of point-rich sensors such as LiDARs, stereo cameras, ToF cameras, or radar [16,23,63,73]. The off-robot approach also generally provides better sensor coverage and minimizes the chance for occlusions or spatial regions where the perception system is blind. These blind spots can be caused by a particular position of the robot, human, or other environmental obstructions. However, off-robot sensing has very low flexibility from a calibration perspective. Off-robot perception systems must be calibrated so that the robot and perception reference frames are matched to the world frame. This is key for computing the actual minimum distance and velocity between the human and the robot. Once the robot is moved within the environment, or one of the sensors is adjusted, the entire system needs to be recalibrated.

3.1.2. On-Robot

On-robot sensing, however, has significantly more flexibility when it comes to robot placement in a collaborative environment. There a number of different on-robot configurations seen in the literature. This includes sensor skin approaches like in [12,48,49,52], sensor rings as seen in [22,32,50,51], mid-link mounted sensors demonstrated in [25], or sensors mounted at the wrist or TCP, as performed in [26]. Sensor skins are constructed with lower cost and lower resolution sensors. These approaches must be aware of self detection and make sure to subtract self detections to prevent false positives from trapping the robot in a protective stop. On-robot perception systems are much more prone to spatial occlusions and blind spots. The robot arm can easily move in front of the perception system and cast a shadow or block line of sight with the human, making minimum distance and velocity calculations impossible. The sensor skin approach mitigates this issue more than the single mounted link or joint sensors. In the case of [25], the mid-link ToF camera served to fill in the occlusions and blind spots of the off-robot LiDARs. In a number of cases, SSM architectures elected to use both on and off-robot perception. Compared to the off-robot perception structures, once the on-robot sensor frame transformation is calibrated or found, it will never need to be recalculated unless the on-robot sensor is moved. For off-robot sensors, if the robot is moved within the workspace, a calibration step will be necessary to reorient perception and robot kinematics within the perspective of the global reference frame.

3.2. Perception Sensor Performance

The perception technology selection for an SSM Architecture drives the observability of the workspace while the SSM algorithm is executing. The are many dimensions to sensor performance when building the perception system. As previously mentioned, sample rate, resolution, coverage, calibration, and physical mounting all play a role in architecture performance. Table 1 provides a comparison of different sensor products seen in SSM research and the different parameters that one must take into consideration when specifying a sensor for their SSM architecture. The details of how specifications were found or calculated will be discussed in Section 3.2.1, Section 3.2.2 and Section 3.2.3.

3.2.1. Sample Rate

The sample rate, or frame rate, of a particular sensor will limit the refresh rate of the minimum distance calculation. A 60 FPS depth sensor provides twice as much data as a depth sensor that provides only 30 FPS. Without considering other architecture requirements or computation loop speed, the sensor sample rate is the fastest rate at which the SSM algorithm can recalculate the minimum distance. As mentioned above, frame rate dictates the maximum velocity at which the robot can operate. A 33 FPS sensor would be able to detect up to 3 cm of movement if a robot traveled at 1 m/s. Increasing robot movement without matching frame rate creates larger losses of potential motion. In addition to influencing the algorithm loop frequency, the sensor sample rate plays a role in the bandwidth for filtration of the distance data. An example of this is the use of a windowed averaging filter to soften the noise or standard deviation of distance data. At a minimum, a window filter needs to compare two received data points together to generate and average the output. Therefore, the larger the filter window, the more input data points required to generate a single filtered data point at the output. Low frame rate sensors (10–15 FPS) significantly limit filtering options, while higher frame rate sensors (60–100 FPS) open up the amount of filtering bandwidth without jeopardizing the SSM algorithm loop time significantly.

3.2.2. Coverage

The aim of sensor coverage in an SSM architecture is to minimize system blind spots. This work calculates a specific coverage metric and defines a specific coverage example in order to provide a more uniform comparison across sensor modalities. The coverage provided in Table 1 computes the area of coverage based on the FOV of the sensor for a range of 1.2 m. The only exception to this calculation is that the SHARP IR sensor metric was calculated for its maximum range distance of 0.8 m. Equation (7) was used to calculate the length

l_{F O V}

, for the diagonal, horizontal, or vertical FOV

θ

of the optical sensors. The IR coverage slice was modeled as a circle according to Equation (8). The frustum slices for the stereo and ToF cameras were modeled as rectangles according to the Equation (9). The LiDAR coverage was also modeled as a rectangle where the length was found using Equation (7) and the width was assigned to the ranging radius r of a circle of 1.2 m per Equation (10).

l_{F O V} = 2 * r * t a n (\frac{θ}{2})

(7)

C o v e r a g e_{c i r c u l a r} = π * l_{D F O V}^{2}

(8)

C o v e r a g e_{r e c t a n g u l a r} = l_{H F O V} * l_{V F O V}

(9)

C o v e r a g e_{u n w r a p p e d} = l_{V F O V} * (2 π * r)

(10)

The azimuth area was selected for radar coverage to provide some relative comparison to the optical modalities. It is important to note that this metric only provides a relative comparison. Due to the difference in operating principles, it is important to tune the radar coverage based on the particular use case within which it is applied. Key parameters used to tune the radar coverage area include the chirp configuration, range bin size, and angle bin size. Furthermore, the FFT signal transmission method versus the Capon Beamforming approach can significantly change sensor performance [60]. The use case presented in Table 1 considers a RS-6843A [91] operating in Capon beamforming mode as previously discussed in Section 2.4. The azimuth area,

A_{a z i m u t h}

, was calculated in Equation (11) to be an arc centered at 1.2 m, with an outer radius R at 1.5 m, and an arc thickness of 0.6 m. A range bin

d r_{b i n}

of 0.075 m was derived using the maximum radar range of 9.62 m over 128 ADC samples in the mmWave Sensing Emulator [92]. The

θ

of ±60° used was based on the FOV in an indoor human detection application [93]. From this coverage calculation, the azimuth area was found to be 1.508 m². Figure 11 provides a visual representation of the computation.

A_{a z i m u t h} = [{(R + d r_{b i n})}^{2} - R^{2}] * \frac{θ}{2}

(11)

Any region in which the perception system cannot see the human or calculate the minimum distance poses a risk of collisions. Coverage is heavily influenced by the FOV achievable by the sensors in the architecture. In [51], the sensors selected for the in-robot sensing had a 25 degree FOV. Therefore, it was understood that the eight-ring sensor approach contained blind spots within a meter of the sensor ring. However, during experimentation, it was found that the blind spots were acceptable for the particular experiments conducted in [51]. In general, larger FOV sensors make high workspace coverage less challenging. Less individual sensors can be used in comparison to lower FOV solutions in turn, lowering the number of sensors a computational platform may need to communicate with. However, a larger coverage generally means more data collected for computation. Therefore, balancing coverage requirements with computational bandwidth must be done on a case-by-case basis.

3.2.3. Point Density

Sensor point density (PD) is a metric defined in this paper as the number of points on the coverage area calculated in the previous section (pts/m²). The PD for optical modalities were trivially computed by dividing the sensor resolution by the coverage metric directly. In the case of radar, the density of the points

P D_{a z i m u t h}

was found using Equation (12). The equation multiplies the angle bins

\frac{r_{a r c}}{θ_{b i n}}

by range bins

\frac{r_{a r c}}{d r_{b i n}}

, and then divides it by the azimuth coverage area

A_{a z i m u t h}

.

P D_{a z i m u t h} = (\frac{r_{a r c}}{θ_{b i n}} * \frac{r_{a r c}}{d r_{b i n}}) * A_{a z i m u t h}

(12)

The sensor PD metric can express some key information as to how the sensor will perform in an SSM architecture. The PD of the sensor directly impacts the granularity of the SSM algorithm. In other words, the range and transition between danger states become much coarser when the selected sensor modality has limited PD. From a safety perspective, a lower PD sensor will provide a lower number of changing depth points when the robot and human are on a path to collide. Depending on the human and robot velocities, a higher PD sensor can provide more transition depth points to the SSM algorithm, increasing the likelihood that the SSM algorithm could lower the robot velocity before a collision occurs. However, PD is only one metric and does not tell the whole story. The radar PD listed in Table 1 is only 1108.779 pts/m²; however, the radar modality can directly track target velocities in a less computationally intensive manner compared to point-rich optical modalities in Table 1.

3.2.4. Calibration

The accuracy and precision of the computed minimum distance is significantly influenced by how the SSM architecture is calibrated. Even if high-performing sensors are used in an SSM architecture (high resolution, FOV, FPS, etc.), the system will perform poorly without proper calibration. On or off-robot, it is crucial that the physical location of a sensor in an SSM architecture is mapped accurately to the virtual world frame. The calibration process varies depending on the sensor modality used. Vision-based sensors benefit from Hand-Eye calibration tool kits [94]. The approach in [94] meshes images captured from a sensor looking at a checkerboard with the particular robot pose during image capture. The data are then pushed into a solver to generate the frame transformation matrix between the sensor frame and world frame. Other non traditional imager-based systems like LiDARs and radars must use different techniques for sensor to world frame calibration. One technique used for LiDARs is the physical mapping of the LiDAR position in a motion capture system [29]. In [29], retroreflective markers were placed on the base of the LiDAR module to form a rigid body in motion capture software. The coordinates and orientation of the rigid body center were published from the motion capture software. The rigid body center was then used to construct a transform from the LiDAR to the robot end effector.

Many of the 1D ToF approaches use the sensors out of the box with factory calibration. In [95], an extrinsic calibration approach is explored which solves for a best-fit plane by matching robot poses with measured distances of a VL53L3CX and VL6180X ToF sensors mounted on the robot end effector. The calibration method is then taken a step further in [96] by directly using the histogram logged by a TMF882X and VL6180X sensor and passing it into a differentiable rendering pipeline. The pipeline proposed in this work generated significantly better angular, linear, and point error performance than the proprietary methods used by default in the sensors to calculate the distance. The end application used a calibrated TMF882X sensor to determine the orientation of a cup to allow the robot to pick it up properly. Though not a direct SSM application, it is significant that this calibration technique could be used to increase the performance of 1D ToF-based perception nodes in SSM architectures.

The calibration of radar can be performed using motion capture or another target-based approach. In [62], the radar sensors in the SSM workspace were calibrated using a corner reflector to generate a high signal-to-noise (SNR) target for the sensors. In this work, two calibration techniques were explored. One took advantage of the forward kinematics of the robot in the workspace and uniform fixturing to make an estimated transform from the robot TCP to the sensing point on the radar. Ref. [62] also explored the use of singular value decomposition (SVD) to align the range, azimuth, and elevation of three static radars in an SSM workspace. Their approach looked at the radar transform alignment problem as a least-squares exercise. Their method focused on aligning radar pairs. One radar was defined as the source sensor and used to generate the transform for alignment to each of the other radars in the workspace. It is important to consider the calibration complexity of a particular perception modality before integrating it into an SSM architecture. As mentioned previously, if a complex calibration technique is required for an off-robot sensor scheme, any time the robot or sensor is moved within the workspace, the calibration process must be performed again to guarantee accurate operation of the SSM architecture.

The selection, mounting, and calibration of perception sensors are key factors in the construction of an SSM architecture. However, the computational platform also plays a key role as it will be responsible for processing the perception sensor data and running the SSM algorithm itself. There are a range of options that can perform this role and choosing the best-fitting one involves balancing processing capabilities, power consumption, physical footprint, and interface compatibility.

3.3. Computation

As depicted in Figure 10, data processing and the execution of the SSM algorithm are the key computational loads in an SSM system. Sensor data must be captured, filtered, put through feature extraction, and then fed into an SSM algorithm. In turn, the algorithm outputs a command or coefficient which then updates the robot behavior via a change in speed, trajectory, or fully halting the robot into a safety-rated monitored stop. This section outlines how computational platforms of various processing capabilities have been used in SSM architectures. Table 2 provides examples of computational devices used in SSM research and provides some information on the considerations needed before selecting a particular computational platform for an SSM architecture.

Many early SSM architectures relied solely on PCs for data processing and algorithm execution. In some cases, one PC was dedicated to sensor data processing, while the other would be used to run the algorithm in conjunction with a digital twin [21,63,103]. The introduction of digital twins into SSM architectures became more prevalent as the processing power of GPUs and PCs increased over the years. These virtual models encompass the robot and all sensors in a single calibrated world frame. The real sensor data interacts with the virtual robot and are used to help the real robot distinguish between self colliding sensor data and true on-human data points [51]. As architecture complexity advanced in the field of SSM, more sensors were combined into single digital twin ecosystems to significantly reduce workspace occlusions, which present one of the largest challenges for SSM architectures.

As years have passed, data capture and processing has migrated from purely PC-based computational platforms to ones that include microprocessosr and microcontrollers. Microcontrollers were used in [48,49,52] for 1D ToF SSM sensing approaches for front-end processing. In [104], a microcontroller handled the air pressure fan control to provide feedback to the user that they were getting too close to the robot arm. The proximity and tactile skin discussed previously used an Arduino Mega2560 board to handle all sensor processing [12]. Microcontrollers such as the STM32 in Figure 12 can run a simple baremetal loops or a real-time operating systems (RTOS). The primary objective of microcontrollers in an SSM architecture is the capture, filtering, and transmission of sensor data to a downstream PC. This downstream PC then uses the transmitted distance data as input to the SSM algorithm. This approach is widely used with the more cost-effective 1D ToF sensors, IR proximity sensors, and thermopiles.

In recent years, a new type of processing platform has entered the field of SSM research. Embedded system modules are being integrated into SSM architectures. These modules include devices such as the raspberry pi, NVIDIA Jetson Orin in Figure 13, and the Intel NUC. These products are commonly referred to as system on module (SOM), system on a chip (SOC), and single-board computer (SBC). The power of these devices is their ability to provide baremetal peripheral interfaces while also providing significantly more processing power than a traditional microcontroller or microprocessor. These devices contain processing cores, hardware accelerators, onboard memory, and gigabit connectivity interfaces like PCIe and Ethernet. In SSM architectures, these edge units have typically been used to offload point-rich sensor processing from the main algorithm PC in an SSM architecture or to reduce the number of PCs in complex multi-point-rich sensor systems [16,23,42,79,104]. With AI at the leading edge of momentum in industry and research fields, these embedded platforms have started to not only record and process data, but also to use AI to extract key features. They can even provide refined speed and separation data directly to the PC running the SSM algorithms.

4. Materials and Methods

The collection of peer-reviewed papers surveyed in this work spans nearly three decades of general robotics perception research (1996–2024), and almost a decade of papers that specifically focus on SSM (2008–2024). Although sensor and computational citations are noted in the reference section, only SSM conference papers and journal articles that present physical sensor data are included in the results section dataset. The articles were found in peer reviewed research databases including IEEE Xplorer [105], ASME [106], Elselvier [107], Wiley Online Library [108], ProQuest [109], SpringerLink [110], Frontiers [111], MDPI [112], and the Journal of Open Source Software (JOSS) [113]. The citations were imported into a Zotero collection and tagged according to the sensors and computational platforms presented in the work. Figure 14 illustrates the workflow for collecting citations and tagging data. Each paper was analyzed to determine the sensor and computation platforms presented in the design of the experiments and results sections within the work. The exported and tagged citation set, along with the processing scripts and figures is publicly available at [114].

The initial collection of 81 works uploaded to Inciteful [115] consisted of general robotics research papers, sensor characterization papers, and SSM specific papers that presented physical experiments. Only 40 works in the initial collection were SSM-specific papers. Inciteful looked at the connections and relevance between all 81 sources based on cross-citations found within the uploaded group. It was found that 50 of the works in the collection were connected citations. From this network of connections, the tool then provided additional works of relevance related to the uploaded group of works. The approach found 21 additional SSM sources that were added to the pool of works in this survey dataset. All sources came from the databases mentioned above and are listed in Figure 14.

The works were analyzed to determine the perception methods and computational devices used in the proposed SSM experiments. Each paper was manually tagged with one or several of the labels defined in Figure 14. A single work received multiple sensor and computation tags if the design of experiments in that work presented an SSM architecture with multiple sensors and processors. If an architecture integrated stereoscopic cameras and LiDAR in the SSM experiments, the LiDAR tag and the stereo vision tag were assigned to the work. The same logic was used for the computational platforms observed in a particular work. If an experiment required a microcontroller for data processing and the actual SSM algorithm was executed on a PC, then the work was given both the Baremetal tag and the PC tag. Therefore, although there are only 61 SSM articles in the tagged dataset, there are 79 total sensor tags and 68 computational tags.

Once all peer-reviewed works were tagged within the Zotero citation group, the citation data set was exported as an Excel file for processing. The data columns of interest consisted of Publication Year, Author, Title, and Manual Tags. Each citation (row) was then searched and incremented through to account for each tag used in the manual tag cell of that given citation. The tag counts per citation were then summed together with all citations in the dataset per given publication year. This general categorization is the foundation for all the figures presented in the results section of this work.

5. Current Trends and Limitations of SSM

Figure 15 shows the general spread of perception modalities and computational platforms found within the citation data set. These graphs demonstrate the general spread seen throughout the papers without temporal context. In the remaining sections, the data will be correlated with the publication dates of these works to show trends with respect to the utilization of these different sensors and computational platforms in SSM experiments.

5.1. Perception Trends

Sensor modalities were tagged and divided into six categories. Per Figure 16, categories were 1DTOF, 3DTOF, LiDAR, radar, stereo vision, and mono vision. Of these, 3D ToF and LiDAR-based systems were observed to be more present throughout the years compared to the other categories. Radar and 1D ToF modalities only started to show a greater prevalence in 2018. Additionally, it was observed that while LiDAR has maintained some integration into SSM architectures, there has been a substantial rise in mono vision, stereo vision, and ToF-based approaches in the past four years.

These sensor trends have significant correlation wotj the popularity of sensors used in the field over a given period of time. The Kinect and Kinect V2 could be easily integrated into ROS and were relatively cost effective compared to multi-million dollar LiDARs and other advanced vision systems. This may contribute to the higher values of ToF and vision-based approaches in this field illustrated in Figure 17. However, 2018 marked an increase in vision-based methods. Both stereo vision and single-image sensor-based systems have been aided by increased computational abilities and the development of easily integrable AI estimation algorithms on these embedded platforms.

5.2. Computational Trends

In addition to tagging sensing approaches, each work in this survey collection was tagged for computational products used in the SSM architectures presented. In Figure 18, the categories of computational platforms were generalized to PC, Baremetal, and Embedded platforms. If a work used multiple computational devices, then that work received multiple computational citation tags. In general, PCs were the foundational platform used for SSM architectures. Then, as seen in Figure 19, there was a significant increase in the number of microcontroller architectures integrated from 2018 to 2019. Many of these tags were associated with the work researching cost-effective SSM architectures that used lower resolution depth sensor technology, including 1D ToF and thermopiles. Lastly, over the past 3 years (2021–2024), there has been an increase in the use of embedded platforms in SSM Architectures. Many of these embedded platforms were integrated with point-rich sensing modalities such as stereo vision, ToF, and radar.

Table A1 highlights key works in the SSM architecture citation dataset. These works present foundational experiments or exhibit new innovative approaches for structuring an SSM architecture. Foundational experiments include work that helped define SSM approaches or was one of the first instances to use a particular sensing modality. Innovative approaches include experiments that fused multiple sensing modalities or computational platforms in a unique or innovative manner.

5.3. Scope Limitations

The works collected in the Current Trends section and the selection of sensing or computational tags for a work in this data set were performed manually. There are likely many different SSM architecture configurations in the research that were not captured in this smaller sample size. However, due to the manual approach, the small dataset contains only clear examples of well-constructed SSM architectures.

5.4. Technical Limitations

Key technical advances play an important role in shaping technology trends within SSM research. The rise of particular technology usage in SSM aArchitectures is gated by the development of said technologies. It is important not to overinterpret the prevalence of I2C ToF distance sensors right after they were brought to market in the mid-2010s. The key to identifying their impact on the field is their continued presence in SSM research five years following their introduction. Embedded System Modules have seen a large increase in usage over the past few years; however, this does not guarantee that they are a perfect fit for SSM architectures.

6. Discussion

Overall, research in SSM and the evolution of SSM architectures has been highly active over the past decade. The perception and computational systems in these architectures have increased significantly in complexity, performance, and flexibility. LiDAR-based perception and PC based computation laid the groundwork for SSM Architectures. With the rise of the Kinect and ToF technology as a whole, an observable shift towards off-robot ToF sensing was seen in the mid-2010s. With continued silicon manufacturing improvements and AI development, the late 2010s and early 2020s have demonstrated a continued shift towards using point-rich sensors (3D ToF, stereo/mono vision, radar) in conjunction with embedded system platforms.

From the computational and perception trends, it is clear that the SSM architectures must be tailored to the particular SSM use cases. Point-rich sensors such as ToF and stereo vision are great sensor modalities where computational limitations are lower, and a slower robot velocity (near 1 m/s) is used. Currently, most image sensors and light-based sensor options can easily achieve a resolution near 30 FPS. If the selected perception method requires filtering, then a higher FPS is recommended. Radars within SSM architectures show promise when mounted statically off-robot or on the robot base. They will not provide the same coverage or accuracy as LiDAR, but will cost less and are not dependent on the same light-based mechanics as LiDAR, stereo vision, or ToF. Hence, radar pairs well with visible perception nodes in an SSM architecture to make a fused sensor frame. In terms of computation, if the research focus is on processing power and higher-level collision avoidance control, PCs still provide the highest ceiling of processing power. For SSM use cases and research focused on deployment flexibility, cost reduction, or sensor calibration simplification, baremetal processors will work well to serialize data for a PC to control the SSM algorithm. Alternatively, for tri-modal or dynamic SSM algorithms that do not intend to perform predictive robot path planning, an embedded system module can be used for sensing and control.

The movement of SSM algorithms, digital twins, and ROS master nodes directly into the embedded system modules remains a significant gap in this field. Each year, embedded platforms have seen large performance boosts due to the demand for AI at the leading edge of the automotive and industrial sectors. These embedded platforms can be used to not only process depth and extract features, but also to run the full SSM algorithm and provide updated velocity commands to the robot [45]. As previously discussed in the vision perception section, the monocular depth estimation experiments in [79] proved to effectively operate on an NVIDIA Jetson Nano. Though these experiments were focused on general segmentation and depth estimation of humans, it illustrates the potential for these lower-power SOCs to provide purely mono vision-based point clouds into an SSM Architecture application. Beyond this work, new models are adding the ability to generate zero-shot metric depth outputs to achieve true depth point clouds regardless of the lens configuration of the RGB camera [81,82].

7. Conclusions

This work has provided an analysis of the perception modalities and computation platforms used to construct SSM architectures over time, and has identified several key trends in SSM research. The first trend is increased research into fusing point-rich perception data from radar, ToF, and stereo cameras to generate multi-modal perception frames. The second key observation is the increased use of AI and machine learning to perform feature extraction and human pose tracking to provide a more refined separation distance and velocity heading into the SSM algorithm. On the computational side, once embedded system modules were introduced to the market, researchers began investigating how these could be integrated into SSM architectures. The use of embedded system modules will keep growing as SSM continues to be tackled as an AI at the edge use case. Lastly, the rise in monocular depth estimation models poses an opportunity to meet the increasing demand of dense point cloud data, AI/ML-based feature extraction, and edge-based computing. A well-optimized monocular depth model for a purely RGB vision-based perception system could feasibly run the dense point cloud generation, digital twin environment representation, and SSM algorithm all on an embedded system module. This purely embedded vision approach to an SSM architecture seeks to meet the SSM performance requirements at a lower cost, with reduced footprint and power requirements than the mixed modal PC based SSM architectures currently used today.

Author Contributions

Conceptualization was performed by O.A.; Formal analysis was done by O.A.; Funding acquisition was acquired by F.S.; Investigation was done by O.A., S.A. and K.S.; Methodology was constructed by O.A. and K.S.; Project administration was managed by F.S.; Resources were provided by F.S.; Software was implemented by K.S.; Supervision was done by F.S.; Validation was performed by K.S. and S.A.; Visualization was prepared by K.S. and O.A.; Writing—original draft was done by O.A.; Writing review and editing was done by S.A. All authors have read and agreed to the published version of the manuscript.

Funding

This material is based upon work supported by the National Science Foundation under Award No. DGE-2125362. Any opinions, findings, conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Institutional Review Board Statement

Not applicable to this study, as no humans nor animals were involved.

Data Availability Statement

The original data presented in the study are openly available in survey_SSM_robotics, GitHub at https://github.com/kxs8997/survey_SSM_robotics/tree/main (accessed on 12 March 2025).

Acknowledgments

The authors would like to acknowledge Andrew Redman from D3 Embedded for providing his knowledge and expertise in the field of TI mmWave radar sensors.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

MDPI	Multidisciplinary Digital Publishing Institute
SSM	Speed and Separation Monitoring
PFL	Power and Force Limiting
ISO	International Organization of Standardization
ToF	Time-of-Flight
IR	Infrared
HRC	Human Robot Collaboration
HRI	Human Robot Interaction
TCP	Tool Center Point
SBC	Single Board Computer
SOM	System on Module
SOC	System on Chip
IoT	Internet of Things
ADC	Analog to Digital Converter
SVD	Singular Value Decomposition
AGV	Automated Guided Vehicle
AGR	Autonomous Guided Robot
AoA	Angle of Arrival

Appendix A

Table A1. Key Works in SSM Architecture Research.

Reference	Category	Hardware
Zhang, Chenyang; et al. [76]	Monovision, Stereovision	PC
Rakhmatulin, Viktor; et al. [104]	Monovision, MOCAP	Embedded, Baremetal
Ubezio, Barnaba; et al. [62]	Radar	PC
Podgorelec, David; et al. [27]	–	Embedded
Flowers, Jared; et al. [75]	Stereovision	PC
Rashid, Aquib; et al. [25]	LiDAR, Stereovision	PC
Tsuji, Satoshi; et al. [53]	1DTOF	Baremetal
Tsuji, Satoshi [52]	1DTOF	Baremetal
Amaya-Mejía, Lina María; et al. [78]	3DTOF	PC
Yang, Botao; et al. [43]	Thermal, Stereovision	PC
Sifferman, Carter; et al. [95]	1DTOF	Baremetal
Karagiannis, Panagiotis; et al. [73]	Stereovision	PC, PLC
Lacevic, Bakir; et al. [42]	3DTOF	PC, Embedded
Park, Jinha; et al. [21]	3DTOF, LiDAR	PC
Ubezio, Barnaba; et al. [65]	Radar	PC
Costanzo, Marco; et al. [72]	Thermal, Monovision, Stereovision	PC
Scibilia, Adriano; et al. [5]	–	–
Lucci, Niccolo; et al. [35]	3DTOF	PC
Rashid, Aquib; et al. [17]	LiDAR, Monovision	PC
Du, Guanglong; et al. [41]	3DTOF, Monovision	PC
Tsuji, Satoshi; et al. [50]	1DTOF	Baremetal
Glogowski, Paul; et al. [116]	3DTOF	PC
Svarny, Petr; et al. [71]	Monovision, Stereovision	–
Antão, Liliana; et al. [70]	Stereovision	PC
Kumar, Shitij; et al. [22]	1DTOF	PC, Baremetal
Benli, Emrah; et al. [85]	Thermal, Stereovision	PC
Lemmerz, Kai; et al. [117]	3DTOF, Monovision	PC
Kumar, Shitij; et al. [51]	1DTOF	PC, Baremetal
Hughes, Dana; et al. [12]	1DTOF	PC, Baremetal
Marvel, Jeremy A.; et al. [18]	LiDAR	PC
Zanchettin, Andrea Maria; et al. [37]	3DTOF	PC
Marvel, Jeremy A. [13]	LiDAR, Stereovision, MOCAP	PC
Tan, Jeffrey Too Chuan; et al. [68]	Stereovision	PC
Lacevic, Bakir; et al. [118]	–	–

Abbreviations: Monovision—Single image sensor used for human classification, Stereovision—Stereo camera used for depth, MOCAP—Motion Capture System used for depth or calibration, Radar—LMCFW Radar used for depth, LiDAR—LiDAR used for depth, 1DTOF—1D Time-of-Flight Sensors used for Depth, 3DTOF—3D Time-of-Flight Camera used for depth, Thermal—Thermal Imager or sensor used for depth or human classification, PC—PC used for perception processing and or SSM algorithm, Embedded—Embedded platform like NVIDIA Jetson or Intel NUC was used for perception processing, Baremetal—Microcontroller or Microprocessor was used for perception processing, PLC—Programmable logic controller was used for perception processing and or SSM algorithm.

References

Barata, J.; Kayser, I. Industry 5.0—Past, Present, and Near Future. Procedia Comput. Sci. 2023, 219, 778–788. [Google Scholar] [CrossRef]
Subramanian, K.; Singh, S.; Namba, J.; Heard, J.; Kanan, C.; Sahin, F. Spatial and Temporal Attention-Based Emotion Estimation on HRI-AVC Dataset. In Proceedings of the 2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Honolulu, HI, USA, 1–4 October 2023; pp. 4895–4900. [Google Scholar] [CrossRef]
Namba, J.R.; Subramanian, K.; Savur, C.; Sahin, F. Database for Human Emotion Estimation Through Physiological Data in Industrial Human-Robot Collaboration. In Proceedings of the 2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Honolulu, HI, USA, 1–4 October 2023; pp. 4901–4907. [Google Scholar] [CrossRef]
ISO/TS 15066:2016(en); Robots and Robotic Devices—Collaborative Robots. ISO-International Organization for Standardization: Geneva, Switzerland, 2016.
Scibilia, A.; Valori, M.; Pedrocchi, N.; Fassi, I.; Herbster, S.; Behrens, R.; Saenz, J.; Magisson, A.; Bidard, C.; Kuhnrich, M.; et al. Analysis of Interlaboratory Safety Related Tests in Power and Force Limited Collaborative Robots. IEEE Access 2021, 9, 80873–80882. [Google Scholar] [CrossRef]
Kuka. LBR iiwa. 2023. Available online: https://www.kuka.com/en-us/products/robotics-systems/industrial-robots/lbr-iiwa (accessed on 29 June 2024).
ABB. Product Specification-IRB 14000. 2015. Available online: https://library.e.abb.com/public/5f8bca51d2b541709ea5d4ef165e46ab/3HAC052982%20PS%20IRB%2014000-en.pdf (accessed on 29 June 2024).
UR10e Medium-Sized, Versatile Cobot. Available online: https://www.universal-robots.com/products/ur10-robot/ (accessed on 29 June 2024).
myUR. 2019. Available online: https://myur.universal-robots.com/manuals/content/SW_5_14/Documentation%20Menu/Software/Introduction/Freedrive (accessed on 29 June 2024).
Sharp. GP2Y0A21YK0F. Available online: https://global.sharp/products/device/lineup/data/pdf/datasheet/gp2y0a21yk_e.pdf (accessed on 29 June 2024).
Buizza Avanzini, G.; Ceriani, N.M.; Zanchettin, A.M.; Rocco, P.; Bascetta, L. Safety Control of Industrial Robots Based on a Distributed Distance Sensor. IEEE Trans. Control Syst. Technol. 2014, 22, 2127–2140. [Google Scholar] [CrossRef]
Hughes, D.; Lammie, J.; Correll, N. A Robotic Skin for Collision Avoidance and Affective Touch Recognition. IEEE Robot. Autom. Lett. 2018, 3, 1386–1393. [Google Scholar] [CrossRef]
Marvel, J.A. Performance metrics of speed and separation monitoring in shared workspaces. IEEE Trans. Autom. Sci. Eng. 2013, 10, 405–414. [Google Scholar] [CrossRef]
McManamon, P. LiDAR Technologies and Systems; SPIE Press: Bellingham, UK, 2019. [Google Scholar]
Horaud, R.; Hansard, M.; Evangelidis, G.; Ménier, C. An overview of depth cameras and range scanners based on time-of-flight technologies. Mach. Vis. Appl. 2016, 27, 1005–1020. [Google Scholar] [CrossRef]
Zlatanski, M.; Sommer, P.; Zurfluh, F.; Madonna, G.L. Radar Sensor for Fenceless Machine Guarding and Collaborative Robotics. In Proceedings of the 2018 International Conference on Intelligence and Safety for Robotics (ISR 2018), Shenyang, China, 24–27 August 2018; pp. 19–25. [Google Scholar] [CrossRef]
Rashid, A.; Peesapati, K.; Bdiwi, M.; Krusche, S.; Hardt, W.; Putz, M. Local and Global Sensors for Collision Avoidance. In Proceedings of the IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems, Virtual, 14–16 September 2020; pp. 354–359. [Google Scholar] [CrossRef]
Marvel, J.A.; Roger, B. Test Methods for the Evaluation of Manufacturing Mobile Manipulator Safety. J. Robot. Mechatron. 2016, 28, 199–214. [Google Scholar]
Marvel, J.A.; Norcross, R. Implementing speed and separation monitoring in collaborative robot workcells. Robot. Comput. Integr. Manuf. 2017, 44, 144–155. [Google Scholar] [CrossRef]
Byner, C.; Matthias, B.; Ding, H. Dynamic speed and separation monitoring for collaborative robot applications–Concepts and performance. Robot. Comput. Integr. Manuf. 2019, 58, 239–252. [Google Scholar] [CrossRef]
Park, J.; Sorensen, L.C.; Mathiesen, S.F.; Schlette, C. A Digital Twin-based Workspace Monitoring System for Safe Human-Robot Collaboration. In Proceedings of the 2022 10th International Conference on Control, Mechatronics and Automation (ICCMA 2022), Luxembourg, 9–12 November 2022; pp. 24–30. [Google Scholar] [CrossRef]
Kumar, S.; Arora, S.; Sahin, F. Speed and separation monitoring using on-robot time-of-flight laser-ranging sensor arrays. In Proceedings of the IEEE International Conference on Automation Science and Engineering, Vancouver, BC, Canada, 22–26 August 2019; IEEE Computer Society: Washington, DC, USA, 2019; pp. 1684–1691. [Google Scholar] [CrossRef]
Zlatanski, M.; Sommer, P.; Zurfluh, F.; Zadeh, S.G.; Faraone, A.; Perera, N. Machine Perception Platform for Safe Human-Robot Collaboration. In Proceedings of the 2019 IEEE SENSORS, Montreal, QC, Canada, 27–30 October 2019; pp. 1–4. [Google Scholar] [CrossRef]
Rashid, A.; Bdiwi, M.; Hardt, W.; Putz, M.; Ihlenfeldt, S. Efficient Local and Global Sensing for Human Robot Collaboration with Heavy-duty Robots. In Proceedings of the 2021 IEEE International Symposium on Robotic and Sensors Environments (ROSE), Virtually, 28–29 October 2021; pp. 1–7. [Google Scholar] [CrossRef]
Rashid, A.; Alnaser, I.; Bdiwi, M.; Ihlenfeldt, S. Flexible sensor concept and an integrated collision sensing for efficient human-robot collaboration using 3D local global sensors. Front. Robot. AI 2023, 10, 1028411. [Google Scholar] [CrossRef]
Kim, E.; Yamada, Y.; Okamoto, S.; Sennin, M.; Kito, H. Considerations of potential runaway motion and physical interaction for speed and separation monitoring. Robot. Comput. Integr. Manuf. 2021, 67, 102034. [Google Scholar] [CrossRef]
Podgorelec, D.; Uran, S.; Nerat, A.; Bratina, B.; Pečnik, S.; Dimec, M.; žaberl, F.; žalik, B.; šafarič, R. LiDAR-Based Maintenance of a Safe Distance between a Human and a Robot Arm. Sensors 2023, 23, 4305. [Google Scholar] [CrossRef] [PubMed]
Arora, S.; Subramanian, K.; Adamides, O.; Sahin, F. Using Multi-channel 3D Lidar for Safe Human-Robot Interaction. In Proceedings of the 2024 IEEE 20th International Conference on Automation Science and Engineering (CASE), Bari, Italy, 28 August–1 September 2024; pp. 1823–1830. [Google Scholar] [CrossRef]
Adamides, O.A.; Avery, A.; Subramanian, K.; Sahin, F. Evaluation of On-Robot Depth Sensors for Industrial Robotics. In Proceedings of the 2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Honolulu, HI, USA, 1–4 October 2023; pp. 1014–1021. [Google Scholar] [CrossRef]
Li, L. Time-of-Flight Camera—An Introduction; Texas Instruments: Dallas, TX, USA, 2014. [Google Scholar]
Microsoft. Azure Kinect DK Hardware Specifications|Microsoft Learn. 2022. Available online: https://learn.microsoft.com/en-us/previous-versions/azure/kinect-dk/hardware-specification (accessed on 29 June 2024).
Adamides, O.A.; Modur, A.S.; Kumar, S.; Sahin, F. A time-of-flight on-robot proximity sensing system to achieve human detection for collaborative robots. In Proceedings of the IEEE International Conference on Automation Science and Engineering, Vancouver, BC, Canada, 22–26 August 2019; IEEE Computer Society: Washington, DC, USA, 2019; pp. 1230–1236. [Google Scholar] [CrossRef]
Bonn-Rhein-Sieg, H. Biomechanical Requirements for Collaborative Robots in the Medical Field. Master’s Thesis, RWTH Aachen University, Aachen, Germany, 2009. Available online: https://www.dguv.de/medien/ifa/de/fac/kollaborierende_roboter/medizin_biomech_anforderungen/master_thesis_bjoern_ostermann.pdf (accessed on 29 June 2024).
Vicentini, F.; Pedrocchi, N.; Giussani, M.; Molinari Tosatti, L. Dynamic safety in collaborative robot workspaces through a network of devices fulfilling functional safety requirements. In Proceedings of the ISR/Robotik 2014: 41st International Symposium on Robotics, Munich, Germany, 2–3 June 2014; pp. 1–7. [Google Scholar]
Lucci, N.; Lacevic, B.; Zanchettin, A.M.; Rocco, P. Combining speed and separation monitoring with power and force limiting for safe collaborative robotics applications. IEEE Robot. Autom. Lett. 2020, 5, 6121–6128. [Google Scholar] [CrossRef]
Andersen, M.R.; Jensen, T.; Lisouski, P.; Mortensen, A.K.; Hansen, M.K.; Gregersen, T.; Ahrendt, P. Kinect Depth Sensor Evaluation for Computer Vision Applications; Aarhus University: Copenhagen, Denmark, 2012; pp. 1–37. [Google Scholar]
Zanchettin, A.M.; Ceriani, N.M.; Rocco, P.; Ding, H.; Matthias, B. Safety in human-robot collaborative manufacturing environments: Metrics and control. IEEE Trans. Autom. Sci. Eng. 2016, 13, 882–893. [Google Scholar] [CrossRef]
Parigi Polverini, M.; Zanchettin, A.M.; Rocco, P. A computationally efficient safety assessment for collaborative robotics applications. Robot. Comput. Integr. Manuf. 2017, 46, 25–37. [Google Scholar] [CrossRef]
Rosenstrauch, M.J.; Pannen, T.J.; Krüger, J. Human robot collaboration-using kinect v2 for ISO/TS 15066 speed and separation monitoring. Procedia Cirp 2018, 76, 183–186. [Google Scholar] [CrossRef]
Andres, C.P.C.; Hernandez, J.P.L.; Baldelomar, L.T.; Martin, C.D.F.; Cantor, J.P.S.; Poblete, J.P.; Raca, J.D.; Vicerra, R.R.P. Tri-modal speed and separation monitoring technique using static-dynamic danger field implementation. In Proceedings of the 2018 IEEE 10th International Conference on Humanoid, Nanotechnology, Information Technology, Communication and Control, Environment and Management (HNICEM 2018), Baguio City, Philippines, 29 November–2 December 2018. [Google Scholar] [CrossRef]
Du, G.; Liang, Y.; Yao, G.; Li, C.; Murat, R.J.; Yuan, H. Active Collision Avoidance for Human-Manipulator Safety. IEEE Access 2020, 10, 16518–16529. [Google Scholar] [CrossRef]
Lacevic, B.; Zanchettin, A.M.; Rocco, P. Safe Human-Robot Collaboration via Collision Checking and Explicit Representation of Danger Zones. IEEE Trans. Autom. Sci. Eng. 2022, 20, 846–861. [Google Scholar] [CrossRef]
Yang, B.; Xie, S.; Chen, G.; Ding, Z.; Wang, Z. Dynamic Speed and Separation Monitoring Based on Scene Semantic Information. J. Intell. Robot. Syst. 2022, 106, 35. [Google Scholar] [CrossRef]
lolambean. HoloLens 2 Hardware. 2023. Available online: https://learn.microsoft.com/en-us/hololens/hololens2-hardware (accessed on 12 March 2025).
Subramanian, K.; Arora, S.; Adamides, O.; Sahin, F. Using Mixed Reality for Safe Physical Human-Robot Interaction. In Proceedings of the 2024 IEEE Conference on Telepresence, Pasadena, CA, USA, 16–17 November 2024; pp. 225–229. [Google Scholar] [CrossRef]
ORBBEC. Femto Bolt. 2023. Available online: https://www.orbbec.com/products/tof-camera/femto-bolt/ (accessed on 8 March 2025).
ORBBEC. Broadening the Application and Accessibility of 3D Vision. 2023. Available online: https://www.orbbec.com/microsoft-collaboration/ (accessed on 8 March 2025).
Tsuji, S.; Kohama, T. Proximity Skin Sensor Using Time-of-Flight Sensor for Human Collaborative Robot. IEEE Sens. J. 2019, 19, 5859–5864. [Google Scholar] [CrossRef]
Tsuji, S.; Kohama, T. Sensor Module Combining Time-of-Flight with Self-Capacitance Proximity and Tactile Sensors for Robot. IEEE Sens. J. 2022, 22, 858–866. [Google Scholar] [CrossRef]
Tsuji, S.; Kohama, T. A General-Purpose Safety Light Curtain Using ToF Sensor for End Effector on Human Collaborative Robot. IEEJ Trans. Electr. Electron. Eng. 2020, 15, 1868–1874. [Google Scholar] [CrossRef]
Kumar, S.; Savur, C.; Sahin, F. Dynamic Awareness of an Industrial Robotic Arm Using Time-of-Flight Laser-Ranging Sensors. In Proceedings of the 2018 IEEE International Conference on Systems, Man, and Cybernetics (SMC 2018), Miyazaki, Japan, 7–10 October 2018; pp. 2850–2857. [Google Scholar] [CrossRef]
Tsuji, S. String-Like Time of Flight Sensor Module for a Collaborative Robot. IEEJ Trans. Electr. Electron. Eng. 2023, 18, 1576–1582. [Google Scholar] [CrossRef]
Tsuji, S.; Kohama, T. Proximity and Tactile Sensor Combining Multiple ToF Sensors and a Self-Capacitance Proximity and Tactile Sensor. IEEJ Trans. Electr. Electron. Eng. 2023, 18, 797–805. [Google Scholar] [CrossRef]
Arducam. Time of Flight (ToF) Camera for Raspberry Pi. Available online: https://www.arducam.com/time-of-flight-camera-raspberry-pi/ (accessed on 12 March 2025).
Rinaldi, A.; Menolotto, M.; Kelly, D.; Torres-Sanchez, J.; O’Flynn, B.; Chiaberge, M. Assessing Latency Cascades: Quantify Time-to-Respond Dynamics in Human-Robot Collaboration for Speed and Separation Monitoring. In Proceedings of the 2024 Smart Systems Integration Conference and Exhibition (SSI), Hamburg, Germany, 16–18 April 2024; pp. 1–6. [Google Scholar] [CrossRef]
Iovescu, C.; Rao, S. The Fundamentals of Millimeter Wave Radar Sensors. 2020. Available online: https://www.ti.com/lit/wp/spyy005a/spyy005a.pdf?ts=1737121458941&ref_url=https%253A%252F%252Fwww.google.com%252F (accessed on 17 January 2025).
mmWave Radar Sensors|TI.com. Available online: https://www.ti.com/sensors/mmwave-radar/overview.html (accessed on 12 March 2025).
IWR6843AOP Data Sheet, Product Information and Support|TI.com. Available online: https://www.ti.com/product/IWR6843AOP (accessed on 12 March 2025).
Radar Sensors. Available online: https://www.d3embedded.com/product-category/radar-sensors/ (accessed on 12 March 2025).
TI. xWRL6432 MMWAVE-L-SDK: 2D Capon Beamforming. Available online: https://software-dl.ti.com/ra-processors/esd/MMWAVE-L-SDK/05_04_00_01/exports/api_guide_xwrL64xx/CAPON_BEAMFORMING_2D.html (accessed on 12 March 2025).
Wang, G.; Munoz-Ferreras, J.M.; Gu, C.; Li, C.; Gomez-Garcia, R. Application of linear-frequency-modulated continuous-wave (LFMCW) radars for tracking of vital signs. IEEE Trans. Microw. Theory Tech. 2014, 62, 1387–1399. [Google Scholar] [CrossRef]
Ubezio, B.; Zangl, H.; Hofbaur, M. Extrinsic Calibration of a Multiple Radar System for Proximity Perception in Robotics. In Proceedings of the 2023 IEEE International Instrumentation and Measurement Technology Conference (I2MTC), Kuala Lumpur, Malaysia, 22–25 May 2023; pp. 1–6. [Google Scholar] [CrossRef]
Gietler, H.; Ubezio, B.; Zangl, H. Simultaneous AMCW ToF Camera and FMCW Radar Simulation. In Proceedings of the 2023 IEEE International Instrumentation and Measurement Technology Conference (I2MTC), Kuala Lumpur, Malaysia, 22–25 May 2023; pp. 1–6. [Google Scholar] [CrossRef]
Nimac, P.; Petrič, T.; Krpič, A.; Gams, A. Evaluation of FMCW Radar for Potential Use in SSM. In Proceedings of the Advances in Service and Industrial Robotics; Müller, A., Brandstötter, M., Eds.; Springer: Cham, Switzerland, 2022; pp. 580–588. [Google Scholar] [CrossRef]
Ubezio, B.; Schoffmann, C.; Wohlhart, L.; Mulbacher-Karrer, S.; Zangl, H.; Hofbaur, M. Radar Based Target Tracking and Classification for Efficient Robot Speed Control in Fenceless Environments. In Proceedings of the IEEE International Conference on Intelligent Robots and Systems, Prague, Czech Republic, 27 September–1 October 2021; pp. 799–806. [Google Scholar] [CrossRef]
Moravec, H. Robot spatial perception by stereoscopic vision and 3d evidence grids. Perception 1996, 483, 484. [Google Scholar]
Intel. Intel® RealSenseTM Product Family D400 Series. 2023. Available online: https://www.intelrealsense.com/wp-content/uploads/2024/10/Intel-RealSense-D400-Series-Datasheet-October-2024.pdf?_ga=2.253170190.609063794.1743342439-1801352430.1743342439 (accessed on 29 June 2024).
Tan, J.T.C.; Arai, T. Triple stereo vision system for safety monitoring of human-robot collaboration in cellular manufacturing. In Proceedings of the 2011 IEEE International Symposium on Assembly and Manufacturing (ISAM), Tampere, Finland, 25–27 May 2011; pp. 1–6. [Google Scholar] [CrossRef]
Rybski, P.; Anderson-Sprecher, P.; Huber, D.; Niessl, C.; Simmons, R. Sensor fusion for human safety in industrial workcells. In Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, Vilamoura-Algarve, Portugal, 7–12 October 2012; pp. 3612–3619, ISSN 2153-0866. [Google Scholar] [CrossRef]
Antão, L.; Reis, J.; Gonçalves, G. Voxel-based Space Monitoring in Human-Robot Collaboration Environments. In Proceedings of the IEEE International Conference on Emerging Technologies and Factory Automation (ETFA), Zaragoza, Spain, 10–13 September 2019; pp. 552–559. [Google Scholar] [CrossRef]
Svarny, P.; Tesar, M.; Behrens, J.K.; Hoffmann, M. Safe physical HRI: Toward a unified treatment of speed and separation monitoring together with power and force limiting. In Proceedings of the IEEE International Conference on Intelligent Robots and Systems, Macau, China, 3–8 November 2019; pp. 7580–7587. [Google Scholar] [CrossRef]
Costanzo, M.; Maria, G.D.; Lettera, G.; Natale, C. A Multimodal Approach to Human Safety in Collaborative Robotic Workcells. IEEE Trans. Autom. Sci. Eng. 2021, 19, 1202–1216. [Google Scholar] [CrossRef]
Karagiannis, P.; Kousi, N.; Michalos, G.; Dimoulas, K.; Mparis, K.; Dimosthenopoulos, D.; Tokçalar, Ö.; Guasch, T.; Gerio, G.P.; Makris, S. Adaptive speed and separation monitoring based on switching of safety zones for effective human robot collaboration. Robot. Comput. Integr. Manuf. 2022, 77, 102361. [Google Scholar] [CrossRef]
Flowers, J.; Faroni, M.; Wiens, G.; Pedrocchi, N. Spatio-Temporal Avoidance of Predicted Occupancy in Human-Robot Collaboration. In Proceedings of the 2023 32nd IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), Busan, Republic of Korea, 28–31 August 2023; pp. 2162–2168. [Google Scholar] [CrossRef]
Flowers, J.; Wiens, G. A Spatio-Temporal Prediction and Planning Framework for Proactive Human–Robot Collaboration. J. Manuf. Sci. Eng. 2023, 145, 121011. [Google Scholar] [CrossRef]
Zhang, C.; Peng, J.; Ding, S.; Zhao, N. Binocular Vision-based Speed and Separation Monitoring of Perceive Scene Semantic Information. In Proceedings of the 2024 36th Chinese Control and Decision Conference (CCDC), Xi’an, China, 25–27 May 2024; pp. 3200–3205. [Google Scholar] [CrossRef]
Lu, Y.F.; Shivam, K.; Hsiao, J.C.; Chen, C.C.; Chen, W.M. Enhancing Human-Machine Collaboration Safety Through Personnel Behavior Detection and Separate Speed Monitoring. In Proceedings of the 2024 International Conference on Advanced Robotics and Intelligent Systems (ARIS), Taipei, Taiwan, 22–24 August 2024; pp. 1–6. [Google Scholar] [CrossRef]
Amaya-Mejía, L.M.; Duque-Suárez, N.; Jaramillo-Ramírez, D.; Martinez, C. Vision-Based Safety System for Barrierless Human-Robot Collaboration. In Proceedings of the 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Kyoto, Japan, 23–27 October 2022; pp. 7331–7336. [Google Scholar] [CrossRef]
An, S.; Zhou, F.; Yang, M.; Zhu, H.; Fu, C.; Tsintotas, K.A. Real-Time Monocular Human Depth Estimation and Segmentation on Embedded Systems. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021; pp. 55–62. [Google Scholar] [CrossRef]
anshan XR-ROB. HDES-Net. 2021. Available online: https://github.com/anshan-XR-ROB/HDES-Net?tab=readme-ov-file (accessed on 9 March 2025).
Hu, M.; Yin, W.; Zhang, C.; Cai, Z.; Long, X.; Wang, K.; Chen, H.; Yu, G.; Shen, C.; Shen, S. Metric3Dv2: A Versatile Monocular Geometric Foundation Model for Zero-shot Metric Depth and Surface Normal Estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 10579–10596. [Google Scholar] [CrossRef]
Bochkovskii, A.; Delaunoy, A.; Germain, H.; Santos, M.; Zhou, Y.; Richter, S.R.; Koltun, V. Depth Pro: Sharp Monocular Metric Depth in Less Than a Second. arXiv 2024, arXiv:2410.02073. [Google Scholar] [CrossRef]
Terabee. TeraRanger Evo Thermal User Manual. 2023. Available online: https://acroname.com/sites/default/files/assets/teraranger-evo-thermal-user-manual.pdf?srsltid=AfmBOorcFGGPBEiNlHVTcAy7o8mv8zG20rtjJ1hR2HQ0ZlgVd8K-yAqd (accessed on 30 March 2025).
Voynick, S. What is a Microbolometer? 2023. Available online: https://sierraolympia.com/what-is-a-microbolometer/ (accessed on 8 November 2024).
Benli, E.; Spidalieri, R.L.; Motai, Y. Thermal Multisensor Fusion for Collaborative Robotics. IEEE Trans. Ind. Inform. 2019, 15, 3784–3795. [Google Scholar] [CrossRef]
Himmelsbach, U.B.; Wendt, T.M.; Hangst, N.; Gawron, P.; Stiglmeier, L. Human–Machine Differentiation in Speed and Separation Monitoring for Improved Efficiency in Human–Robot Collaboration. Sensors 2021, 21, 7144. [Google Scholar] [CrossRef] [PubMed]
Himmelsbach, U.B.; Wendt, T.M.; Lai, M. Towards safe speed and separation monitoring in human-robot collaboration with 3D-time-of-flight cameras. In Proceedings of the 2nd IEEE International Conference on Robotic Computing (IRC 2018), Laguna Hills, CA, USA, 31 January–2 February 2018; pp. 197–200. [Google Scholar] [CrossRef]
Optris. PI 450i. Available online: https://optris.com/us/products/infrared-cameras/precision-line/pi-450i/ (accessed on 24 January 2025).
Mouser. TR-EVO-T33-USB Terabee|Mouser. Available online: https://www.mouser.com/ProductDetail/Terabee/TR-EVO-T33-USB?qs=OTrKUuiFdkYKUuhq9B0%252BOA%3D%3D (accessed on 9 March 2025).
STMicroelectronics. VL53L1X-Time-of-Flight (ToF) Ranging Sensor Based on ST’s FlightSense Technology-STMicroelectronics. Available online: https://www.st.com/en/imaging-and-photonics-solutions/vl53l1x.html (accessed on 12 March 2025).
RS-1843A mmWAVE RADAR SENSOR EVALUATION KIT. Available online: https://www.d3embedded.com/wp-content/uploads/2020/02/D3Eng-DesignCore-RS-1843AandRS-6843-DataSheet.pdf (accessed on 12 March 2025).
TI. mmWaveSensingEstimator. Available online: https://dev.ti.com/gallery/view/mmwave/mmWaveSensingEstimator/ver/2.4.0/ (accessed on 12 March 2025).
D3. Social Distance Tracking Using mmWave Radar. Available online: https://www.d3embedded.com/solutions/tracking-social-distancing/ (accessed on 12 March 2025).
Esposito, M.; O’Flaherty, R.; Li, Y.; Virga, S.; Joshi, R.; Haschke, R. IFL-CAMP/Easy_Handeye. Original-Date: 2017-06-25T20:22:05Z. 2024. Available online: https://github.com/IFL-CAMP/easy_handeye (accessed on 13 October 2024).
Sifferman, C.; Mehrotra, D.; Gupta, M.; Gleicher, M. Geometric Calibration of Single-Pixel Distance Sensors. IEEE Robot. Autom. Lett. 2022, 7, 6598–6605. [Google Scholar] [CrossRef]
Sifferman, C.; Wang, Y.; Gupta, M.; Gleicher, M. Unlocking the Performance of Proximity Sensors by Utilizing Transient Histograms. IEEE Robot. Autom. Lett. 2023, 8, 6843–6850. [Google Scholar] [CrossRef]
Intel. Intel® CoreTM i9-12900K Processor (30M Cache, up to 5.20 GHz)-Product Specifications. Available online: https://www.intel.com/content/www/us/en/products/sku/134599/intel-core-i912900k-processor-30m-cache-up-to-5-20-ghz/specifications.html (accessed on 12 March 2025).
NVIDIA. NVIDIA RTX A5000 Datasheet. Available online: https://resources.nvidia.com/en-us-briefcase-for-datasheets/nvidia-rtx-a5000-dat-1 (accessed on 12 March 2025).
Newegg. NeweggBusiness-PNY VCNRTXA5000-PB RTX A5000 24GB 384-bit GDDR6 PCI Express 4.0 Workstation Video Card. Available online: https://www.neweggbusiness.com/product/product.aspx?item=9siv7kvjy39435&bri=9b-14-133-832 (accessed on 12 March 2025).
Intel. Intel® CoreTM i7-920 Processor (8M Cache, 2.66 GHz, 4.80 GT/s Intel® QPI)-Product Specifications. Available online: https://www.intel.com/content/www/us/en/products/sku/37147/intel-core-i7920-processor-8m-cache-2-66-ghz-4-80-gts-intel-qpi/specifications.html (accessed on 12 March 2025).
Amazon. Amazon.com: Intel Core i7 Processor i7-920 2.66GHz 8 MB LGA1366 CPU BX80601920: Electronics. Available online: https://www.amazon.com/Intel-Processor-2-66GHz-LGA1366-BX80601920/dp/B001H5T7LK/ref=asc_df_B001H5T7LK?mcid=cf2f78a548833789b337453383ab2693&tag=hyprod-20&linkCode=df0&hvadid=693562313188&hvpos=&hvnetw=g&hvrand=15348408733688645708&hvpone=&hvptwo=&hvqmt=&hvdev=c&hvdvcmdl=&hvlocint=&hvlocphy=9005654&hvtargid=pla-2007964176847&psc=1 (accessed on 12 March 2025).
PassMark. AMD EPYC 9655P Benchmark. Available online: https://www.cpubenchmark.net/cpu.php?cpu=AMD+EPYC+9655P&id=6354 (accessed on 12 March 2025).
Kumar, S.; Savur, C.; Sahin, F. Survey of Human-Robot Collaboration in Industrial Settings: Awareness, Intelligence, and Compliance. IEEE Trans. Syst. Man Cybern. Syst. 2021, 51, 280–297. [Google Scholar] [CrossRef]
Rakhmatulin, V.; Grankin, D.; Konenkov, M.; Davidenko, S.; Trinitatova, D.; Sautenkov, O.; Tsetserukou, D. AirTouch: Towards Safe Human-Robot Interaction Using Air Pressure Feedback and IR Mocap System. In Proceedings of the 2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Honolulu, HI, USA, 1–4 October 2023; pp. 2034–2039. [Google Scholar] [CrossRef]
IEEE. IEEE Xplorer. 2025. Available online: https://ieeexplore.ieee.org/ (accessed on 11 March 2025).
ASME. ASME Digital Collection. 2025. Available online: https://asmedigitalcollection.asme.org/ (accessed on 11 March 2025).
Elsevier. ScienceDirect. 2025. Available online: https://www.sciencedirect.com/ (accessed on 11 March 2025).
Wiley. Wiley Online Library. 2025. Available online: https://onlinelibrary.wiley.com/ (accessed on 11 March 2025).
ProQuest. ProQuest. 2025. Available online: https://www.proquest.com/ (accessed on 11 March 2025).
Springer. SpringerLink. 2025. Available online: https://link.springer.com/ (accessed on 11 March 2025).
Media, F. Frontiers. 2025. Available online: https://www.frontiersin.org/ (accessed on 11 March 2025).
MDPI. 2025. Available online: https://www.mdpi.com/ (accessed on 11 March 2025).
Journal of Open Source Software. 2025. Available online: https://joss.theoj.org/ (accessed on 11 March 2025).
Subramanian, K. Survey_SSM_Robotics. 2024. Available online: https://github.com/kxs8997/survey_SSM_robotics (accessed on 12 March 2025).
inciteful. Available online: https://inciteful.xyz/ (accessed on 9 October 2024).
Glogowski, P.; Lemmerz, K.; Hypki, A.; Kuhlenkotter, B. Extended calculation of the dynamic separation distance for robot speed adaption in the human-robot interaction. In Proceedings of the 2019 19th International Conference on Advanced Robotics (ICAR 2019), Belo Horizonte, Brazil, 2–6 December 2019; pp. 205–212. [Google Scholar] [CrossRef]
Lemmerz, K.; Glogowski, P.; Kleineberg, P.; Hypki, A.; Kuhlenkötter, B. A Hybrid Collaborative Operation for Human-Robot Interaction Supported by Machine Learning. In Proceedings of the International Conference on Human System Interaction (HSI), Richmond, VA, USA, 25–27 June 2019; IEEE Computer Society: Washington, DC, USA, 2019; pp. 69–75. [Google Scholar] [CrossRef]
Lacevic, B.; Rocco, P. Kinetostatic danger field - A novel safety assessment for human-robot interaction. In Proceedings of the IEEE/RSJ 2010 International Conference on Intelligent Robots and Systems (IROS 2010), Taipei, Taiwan, 18–22 October 2010; pp. 2169–2174. [Google Scholar] [CrossRef]

Figure 1. GP2Y0A21YK0F SHARP IR Sensor, which detects distance based on infrared light returned to the sensor.

Figure 2. Voltage to distance relationship per [10].

Figure 3. (a) Legacy AGV for manufacturing applications. (b) More modern AGV product commonly seen in warehouse automation applications.

Figure 4. On the left is the AKDK, on the right is the Femto Bolt. Both modules contain a 1 MP, 120° FOV, 3D ToF camera, and 12 MP 120 FOV RGB imager.

Figure 5. VL53l1x 4 M max, 50 Hz max, 27° FOV, ToF Sensor.

Figure 6. (a) External antenna mmWave base radar product [59]. (b) Antenna on package mmWave base radar product [59].

Figure 7. Intel D435i Stereoscopic Camera.

Figure 8. TeraRanger Evo Thermal 33 sensor.

Figure 9. Optris PI 450i thermal camera.

Figure 10. SSM system architecture. * The process interface connection in green may not be present if the Perception Processing Unit and SSM Processing Unit are combined into a single processing platform.

Figure 11. Visualization for radar azimuth area calculation.

Figure 12. NUCLEO-F030R8 is an STM32 microcontroller evaluation board. This is a common example of a platform used for sensor processing.

Figure 13. NVIDIA Jetson Orin Starter Kit for platform evaluation.

Figure 14. Peer review citation generation and tagging workflow.

Figure 15. (a) Histogram of sensor types in the dataset. (b) Histogram of computational platform types in the dataset.

Figure 16. Timeline usage of different sensor modalities across SSM research. Total usage is based on tag instances per year within the works collected for this survey.

Figure 17. Migration of different sensor modalities used in SSM research over time. Total usage is based on tag instances per year within the works collected for this survey.

Figure 18. Timeline usage of different computation units across SSM research. Total usage is based on tag instances per year within the works collected for this survey.

Figure 19. Migration of different computation units used in SSM research over time. Total usage is based on tag instances per year within the works collected for this survey.

Table 1. Perception sensor performance comparison.

Sensor Product	GP2Y0A 21YK0F	Femto Bolt	VL53l1x	RS-6843A	Realsense D435I	TeraBee Evo Thermal 33	Optirs PI 450i	Ouster OS0
Sensor Type	IR	ToF	ToF	Radar	Stereo	Thermopile	Bolometer	LiDAR
Sensor Range	0.1–0.8 (m)	0.25–5.46 (m)	0.04–4 (m)	0.3–9.62 (m)	0.02–10 (m)	30–45 °C	−20–900 °C	−40–60 °C
Unit Cost	$12.95	$415.00	$5.77	$599.00	$343.75	$95.70	$6300.00	$6000.00
FPS	26 ¹	30 ²	50 ³	50	90 ⁴	7	80	20
Coverage ⁵ (m²)	0.27	17.28	0.26	1.5	3.03	0.51	2.46	9.05
PD ⁶ (pts/m²)	N/A	60,681.48	981.79	1108.779	304,171.15	2026.13	44,673.91	3621.66
Typical Power	0.165 watts	4.35 watts	0.066 watts	0.75 watts	3.5 watts	0.225 watts	2.5 watts	20 watts
Full Coverage Min Sensor #	10	3	14	3	4	10	5	1
Full Coverage Power	1.65 watts	13.05 watts	0.924 watts	2.25 watts	14 watts	2.25 watts	12.5 watts	20 watts
Full Coverage Cost	$129.50	$1245.00	$80.78	$1797.00	$1375.00	$957.00	$31,500.00	$6000.00

¹ FPS based on a 38 ms integration time [10]. ² 30 FPS in NFOV and 15 FPS in WFOV [46]. ³ 50 FPS in short mode and 30 FPS long mode [90]. ⁴ 30/60/90 FPS based on the operating resolution selected. 848 × 480 at 90 FPS and 1280 × 720 at 30 FPS [67]. ⁵ Sensor coverage area calculated for a sensor range of 1.2 m. ⁶ Point density (PD) is the point distribution across the senor coverage area at 1.2 m.

Table 2. Computational platform comparison.

Computational Platform	Known Components and Software	Cost ($)	Platform Power (Watts)	Processing Power (TOPS)
PC Specs from [74]	Intel i9 with 16 CPU cores	>$648.00 ¹	>250 ¹	<1.2 ²
PC Specs from [76]	3.33 GHz CPU and RTX A5000 GPU running Ubuntu 18.04.2 LTS	>$2099.00 ¹	>230 ¹	222.2
PC Specs from [69]	2.67 GHz Intel i7 920 quad-core running Ubuntu 10.04 LTS	>$75.00 ¹	>130 ¹	<1.2 ²
Jetson Orin NX	8-core ARM A78, 16 Gigs of LPDDR5, Ampere GPU	$699.00	25	70 TOPS
Jetson Nano NX	6-core ARM A57, 4 Gigs of LPDDR4, Maxwell GPU	$200.00	15	0.5 TOPS
PILZ PSS4000 PLC [73]	Safety Critical Microprocessors	$20,000.00	50	ARM A7 and high-end microprocessor capabilities
STM32 Nucleo Board	48 MHz ARM M0, 64 KB flash, 8 KB RAM	$11.04	<1	Serial and low-speed data processing only

¹ The authors of [69,74,76] only disclosed the key components of the PCs used in their experiments. They did not provide full specifications of the PCs. Therefore, it is safe to assume that the power, cost and performance of the GPU or CPU provided substantiate the minimum specs for the PCs as a whole. The cost of performance specifications and the cost of the Intel i9 were found in [97]. The performance specifications for the RTX A5000 were found in [98] and cost in [99]. The performance specifications for the Intel i7 920 quad-core were found in [100] and the used cost was found in [101]. ² The AMD EPYC 9655P [102] is a server-grade CPU and has significantly higher performance specifications than the CPUs used in [69,74,76]. Therefore, it is assumed that the CPUs used in [69,74,76] will generate an operational performance of less than 1.2 TOPS.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Adamides, O.; Subramanian, K.; Arora, S.; Sahin, F. Perception and Computation for Speed and Separation Monitoring Architectures. Robotics 2025, 14, 41. https://doi.org/10.3390/robotics14040041

AMA Style

Adamides O, Subramanian K, Arora S, Sahin F. Perception and Computation for Speed and Separation Monitoring Architectures. Robotics. 2025; 14(4):41. https://doi.org/10.3390/robotics14040041

Chicago/Turabian Style

Adamides, Odysseus, Karthik Subramanian, Sarthak Arora, and Ferat Sahin. 2025. "Perception and Computation for Speed and Separation Monitoring Architectures" Robotics 14, no. 4: 41. https://doi.org/10.3390/robotics14040041

APA Style

Adamides, O., Subramanian, K., Arora, S., & Sahin, F. (2025). Perception and Computation for Speed and Separation Monitoring Architectures. Robotics, 14(4), 41. https://doi.org/10.3390/robotics14040041

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Perception and Computation for Speed and Separation Monitoring Architectures

Abstract

1. Introduction

1.1. Safety-Rated Monitored Stop

1.2. Power and Force Limiting

1.3. Hand Guiding

1.4. Speed and Separation Monitoring

1.5. HRC Collaboration Method Trade-Offs

2. Perception Technologies

2.1. IR Sensors

2.2. LiDAR Sensors

2.3. Time-of-Flight Sensors

2.4. Radar Sensors

2.5. Vision Sensors

2.5.1. Stereo Vision

2.5.2. Mono Vision

2.6. Thermal Vision

3. Speed and Separation Monitoring Architecture

3.1. Perception System Mounting

3.1.1. Off-Robot

3.1.2. On-Robot

3.2. Perception Sensor Performance

3.2.1. Sample Rate

3.2.2. Coverage

3.2.3. Point Density

3.2.4. Calibration

3.3. Computation

4. Materials and Methods

5. Current Trends and Limitations of SSM

5.1. Perception Trends

5.2. Computational Trends

5.3. Scope Limitations

5.4. Technical Limitations

6. Discussion

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI