Towards a Safe Human–Robot Collaboration Using Information on Human Worker Activity

Orsag, Luka; Stipancic, Tomislav; Koren, Leon

doi:10.3390/s23031283

Open AccessArticle

Towards a Safe Human–Robot Collaboration Using Information on Human Worker Activity

by

Luka Orsag

,

Tomislav Stipancic

^*

and

Leon Koren

Faculty of Mechanical Engineering and Naval Architecture, University of Zagreb, Ivana Lucica 5, 10000 Zagreb, Croatia

^*

Author to whom correspondence should be addressed.

Sensors 2023, 23(3), 1283; https://doi.org/10.3390/s23031283

Submission received: 23 November 2022 / Revised: 4 January 2023 / Accepted: 20 January 2023 / Published: 22 January 2023

(This article belongs to the Section Sensors and Robotics)

Download

Browse Figures

Versions Notes

Abstract

:

Most industrial workplaces involving robots and other apparatus operate behind the fences to remove defects, hazards, or casualties. Recent advancements in machine learning can enable robots to co-operate with human co-workers while retaining safety, flexibility, and robustness. This article focuses on the computation model, which provides a collaborative environment through intuitive and adaptive human–robot interaction (HRI). In essence, one layer of the model can be expressed as a set of useful information utilized by an intelligent agent. Within this construction, a vision-sensing modality can be broken down into multiple layers. The authors propose a human-skeleton-based trainable model for the recognition of spatiotemporal human worker activity using LSTM networks, which can achieve a training accuracy of 91.365%, based on the InHARD dataset. Together with the training results, results related to aspects of the simulation environment and future improvements of the system are discussed. By combining human worker upper body positions with actions, the perceptual potential of the system is increased, and human–robot collaboration becomes context-aware. Based on the acquired information, the intelligent agent gains the ability to adapt its behavior according to its dynamic and stochastic surroundings.

Keywords:

human–robot collaboration; activity recognition; deep learning; LSTM; safe HCI; adaptive manufacturing systems; robotics

1. Introduction

When exploring flexible manufacturing, cyber-physical systems, or Industry 4.0 (I4.0), the term “human–robot collaboration” rarely escapes attention; rather, it encapsulates every aspect of the idea in one self-sustaining concept [1]. Advances in technology increasingly affect our daily activities. We can observe notable progress in AI, leading to the creation of new fields of research, methods, technologies, and their applications, including applications in medicine [2], decision making [3], algorithm development [4], and intelligent agent architecture for security purposes [5].

Recent research and emerging trends can serve as windows that allow us to envisage the future and adopt the concept of collaborating with robotic systems. The concept of industrial manipulators working alongside human co-workers carries the burden of the need to remove barriers and obstacles, allowing us to fully exploit the machines’ strength and dexterity. The main issue and concern with barrierless spaces is the safety of humans and machine operators. Traditionally, it has been the case that industrial robots ignore the occupants of the space and cannot ensure hazardless operation. While methods, such as the use of torque sensors or safety zones can ensure the workers’ safety, they reduce the overall flexibility and efficiency of the system. With recent advancements in machine learning (ML) and computer vision (CV), the field of human–robot collaboration (HRC) is also developing [6]. HRC teams offer a way to use the problem-solving skills of human operators together with robots’ strength and agility. For example, by using a vision-based modality, intelligent agent can acquire additional high-level information about its surroundings and, in the context of safety, information about the human co-workers. Such information can be presented as motion information (poses, joint velocities, etc.) and even human emotional states, which can be represented through facial expressions or body language [7]. While this set of new information is open to interpretation through careful consideration, it can form contexts and minimize the stochastic factors present in collaborative industrial environments.

HRC is a relatively young scientific field, in which researchers are developing control models and other means to render this collaboration safe and efficient [8,9,10]. The methods that enable us to achieve these goals can be designed from both physical and cognitive perspectives [11]. At the physical level, researchers are designing and building collaborative robots (cobots) that can be used to share the same physical interactive space with humans [12]. Such robots are usually in close proximity with humans while performing joint work tasks. The collaborative robot or cobot can engage in safe interaction, based on special software and smart sensors that are ubiquitously placed within the interaction space. The physical components of such robots usually have rounded edges and are constructed from lightweight construction materials. During the interaction, the robot parts are moved carefully, with limitations on their speed and force [13].

The design of HRC, at the cognitive level, presumes the use of input sensor information about the environment in which the user is positioned [14]. This information is then elaborated and translated to the robot, in which a computation model designed to drive the robot’s behavior shapes the robot’s responses. A thorough overview on the design of learning strategies for HRC is provided in [15]. These strategies are based on ML, which is a promising field, especially with the rise of deep learning (DL), which is often used to train computation models, based on data about different domains of interest in the form of case knowledge.

Based on their applications, different ML techniques can be employed to build HRC models. According to action (activity) and intention recognition applications, which are the focus of this work, there are also a couple of interesting approaches that rely on different sensing modalities. It is, therefore, important to exploit the value of each modality for the purpose of better action recognition [16]. For example, a vision-based action analysis can focus on different characteristics of the input data, such as the RGB, depth, skeleton, and infrared (IR) [17]. Signal segmentation is also an important stage in the action recognition process [18]. In [19], the authors proposed a method that can be employed to achieve a fast and fluid human–robot interaction by estimating the progress of the human co-workers’ movements.

In recent years, researchers have begun to collect data that could be used to build DL models with action recognition HRC applications. These data are often difficult to analyze and annotate [20,21,22]. For example, in [23], the authors distilled spatial and temporal data representing human actions using convolutional neural networks (CNN) and long short-term memory (LSTM). In [24], an approach based on transfer learning is proposed, thus avoiding the need for a large amount of annotated data, whose collection is often a tedious job to perform. In [25], activity recognition based on built-in sensors in smart and wearable devices and custom-made HHAR-net is investigated.

Intention recognition is a more complex task, compared to the recognition of actions. It represents the task of recognizing the intentions of a person by analyzing his/her actions and/or changes in the environment caused by these actions. Intention-based approaches to human–robot interaction is discussed in [26]. The social component is also an important aspect of worker–robot interaction [27]. The way in which a person behaves during the interaction can be analyzed and employed for the purpose of intention recognition. For example, the movements of the worker’s body can be associated with a possible collision with the robot [28]. In [29], the authors proposed an addition to a safety framework using a worker intention recognition method, based on head pose information. In [30], the authors described a behavior recognition system, based on a real-time stereo face-tracking and gaze detection designed to measure the head pose and gaze direction simultaneously. Additionally, a fusion technique for the improvement of intention recognition performance was proposed in [31], and different information sources were explored in [32].

In [33], an intention-guided control strategy was proposed and applied to an upper-limb power-assisted exoskeleton. In [34], the authors proposed a systematic user experience (UX) evaluation framework of action and intention recognition for interactions between humans and robots, from a UX perspective. In [35], the authors presented a method based on the radial basis function neural network (RBFNN) to identify the motion intention of the collaborator.

This work aims to develop efficient HRC in industrial settings. Sensors have a significant impact in the modeling of context-aware industrial applications in which vision plays a special role, as emphasized in [36,37]. Context awareness synchronized with significant environmental occurrences is key to improving operational efficiency and safety in HRC for the purpose of intelligent manufacturing [38,39]. Context-aware systems enable a transition towards more flexible production systems that are rapidly adjustable, in response to changing production requirements [40]. In [41], the authors proposed a hybrid assembly cell for HRC that supervises human behavior to predict the demand for collaborative tasks. There are other applications based on similar principles and methodologies. For example, in [42], the authors quantified the uncertainties involved in the assembly process by constructing a model of mutual trust between humans and robots.

The approach presented in this work relies on the analysis of micro- and macro-movements between interacting units within an industrial environment. Joints connected with links are the natural parts of a unit’s skeleton, representing a moving environmental structure. The skeleton can represent a person, robot, or any other natural or artificial mechanism. The micro-movements of the skeleton usually have a contextual meaning and different consequences within the environment. An analysis of this cause–effect relationship within a temporal and spatial continuum is provided here to empower the intention and activity recognition model. In the case of the model presented herein, generic movements (e.g., picking left, picking in front, etc.), as opposed to the specific actions that the worker can perform (e.g., using a screwdriver, applying grease, etc.), are evaluated and used. Consequently, the model is expected to support the development of skills for system adaptation in response to constant environmental changes. We propose a development framework for the planning of safe robot actions using human worker poses, estimated by a neural network (NN) module dedicated to the prediction of human joint locations. In this way, the robotic manipulator is provided with the information required to ensure human safety and maintain efficiency. The activity recognition module provides the contextual information required for pose extrapolation, thus decreasing the system’s reaction time, similar to the work described in [43,44]. The difference between these two approaches is the information input to the AR module and the selection of the output classes. In [43], the authors proposed a task-oriented action recognition system which relies on the topology of the worker station, as opposed to generic human worker actions. We believe that from the perspectives of safety and system flexibility, activities should be generalized and inherited by different robotic cell topologies.

The activity recognition (AR) module shown in Figure 1 (highlighted with a green background) is modelled using long short-term memory (LSTM) networks and consists of a classification module for AR.

The remainder of the paper is organized as follows:

Section 2: Through an overview, the model is introduced, and certain pose estimation models are described and explained. We present an overview of the dataset used for the activity recognition. Furthermore, the algorithms and explanations that summarize the work performed for the purpose of action recognition are discussed, and finally, the overall framework of the system’s simulation and visualization is explained;
Section 3: In this section, the training and overall results of the system are discussed;
Section 4: In the final section of this paper, we summarize the work conducted and propose future directions for the field.

2. Materials and Methods

The proposed model is based on the action recognition module (ARM). It employs deep learning techniques rather than traditional ones, since they have a proven capacity to perform well without feature selection and engineering. Together with the obvious reasons, there is a lack of understanding about how the basic actions are performed in stochastic environments. For example, the action of “Picking Left” is an action that is performed by using either the left or right hand, which renders the features difficult to analyze when expressed as multi-variate time-series.

2.1. Pose Estimation

Human pose estimation (HPE) is a way of identifying and classifying the joints in the human body. In essence, it is a method used to capture a set of coordinates for each joint, known as a key point, that can describe a pose. With the key points described, not all of them can form a pair. Association techniques are employed to form a skeleton-like representation of the human body.

In this section, we briefly reflect on a possible human pose estimation method for obtaining the skeleton view for the purpose of this work. It is possible to identify three main types of human pose estimation models that can be used to represent the human body in 2D and 3D spaces (skeleton-based, planar-based, and volumetric). For this work, a skeleton-based is preferred, since the relationships between the joints are used to represent the activities performed by the human worker in a collaborative environment.

2.2. Dataset

The InHARD dataset is a large-scale RGB + skeleton action recognition dataset named the “Industrial Human Action Recognition Dataset” [21]. It includes 4804 different action samples represented in over 38 videos collected from 14 industrial action classes. In comparison with other existing action recognition datasets which comprise daily activities, the authors of [21] proposed a method, based on actual industrial actions performed in real use-case scenarios in an industrial environment. Together with the dataset, usage metrics are proposed for the purpose of algorithm evaluation. The RGB data are recorded from three different angles (top, left side, and right side) to capture the complete action and help to improve the ML algorithm’s performance in cases where occlusion occurs.

For the skeleton modality, a Combination Perception Neuron 32 Edition v2 motion sensor was used to capture MOCAP data at a frequency of 120 Hz [21]. The skeleton data comprises the 3D locations (Tx, Ty and Tz) of 17 major body joints, together with their rotations (Rx, Ry, and Rz). The data are saved in standard BVH file format and can be examined using various software packages, e.g., the Blender software, as shown in Figure 2.

The authors of [21] identified 14 different low-level classes, as presented in Table 1, and 72 high-level classes, in which the actions are much more accurate. For the purpose of this experiment, a subset of 4 low-level classes were used. A visual representation of the time distribution of these four classes during one recording is presented in Figure 3.

2.3. Dimensionality Reduction

Upon the dataset’s analysis, it was determined that not all of the feature points of the skeleton hierarchy are necessary, and some can be excluded from the training of the model. The new skeleton hierarchy comprises points that are visible above the assembly surface, since the features are not visible to the camera in that area and deemed unnecessary for the purpose of activity recognition. Figure 4 represents a new hierarchy and an example of a relevant extremity trajectory.

A subset of joints J are then defined as an array with a length of k (17, in this case), resulting in

J = {\{j_{i}\}}_{i = 1}^{k}

.

Subset J can be reduced further by applying a principial component analysis (PCA) to the remaining dataset. The results are presented in Figure 5, and the dataset is further reduced (9 joints, in this case, resulting in 27 input features).

2.4. Human Activity Recognition

HAR aims to understand human behaviors, which enable the computing systems to proactively assist users, based on their requirements [45]. From a formal perspective, suppose that a user is performing activities belonging to a predefined activity set A:

A = {\{a_{i}\}}_{i = 1}^{m},

(1)

where m denotes the number of activity classes. There is a sequence of sensor readings that capture the activity information:

s = \{d_{1}, d_{2}, \dots, d_{t}, \dots, d_{n}\},

(2)

where

d_{t}

denotes the sensor reading at time t. We must build a model

F

to predict the activity sequence, based on sensor reading s:

\hat{A} = {\{{\hat{a}}_{j}\}}_{j = 1}^{n} = F (s), {\hat{a}}_{j} \in A,

(3)

while the true activity sequence (ground truth) is denoted as:

A^{*} = {\{a_{j}^{*}\}}_{j = 1}^{n}, a_{j}^{*} \in A

(4)

given that

n \geq m

. We then select a positive loss function

L (F (s), A^{*})

to minimize the discrepancy between

\hat{A}

and

A^{*}

. In this work, a multi-class categorical cross-entropy loss function is used:

L (F (s), A^{*}) = - \sum_{c = 1}^{n} a_{c}^{*} \log (P ({\hat{a}}_{c}))

(5)

The list of output classes comprises four actions acquired from the InHARD dataset, forming an output vector A. The actions include those highlighted in Table 1.

Every joint can be expressed as a point

P_{J_{k}} = \{X_{J_{k}}, Y_{J_{k}}, Z_{J_{k}}\}

in a cartesian coordinate system, indexed in chronological order. A set of such joints can be used to describe the actions and can be analyzed by a neural network in fixed timeframes. In this case, one sensor reading

d_{t}

appears as:

d_{t} = \{P_{j_{1}}, P_{j_{2}}, \dots, P_{j_{k}}\},

(6)

Vector

s

in (2) is described as a sequence of sensor readings. It is defined as a sliding window and can be used for human activity analysis using complete sets of data, where

d_{n}

is a moment surpassing

d_{t}

, representing a future joint movement. In HRC environments, sensors can obtain only present or past sets of events. A sequence of sensor readings are then expressed as:

s = \{d_{t - n}, \dots, d_{t - 2}, d_{t - 1}, d_{t}\},

(7)

and transposing (6) and placing it into (7) gives:

s = \{{[\begin{matrix} P_{j_{1}} \\ P_{j_{2}} \\ \begin{matrix} ⋮ \\ P_{j_{k}} \end{matrix} \end{matrix}]}_{t - n}, \dots, {[\begin{matrix} P_{j_{1}} \\ P_{j_{2}} \\ \begin{matrix} ⋮ \\ P_{j_{k}} \end{matrix} \end{matrix}]}_{t - 2}, {[\begin{matrix} P_{j_{1}} \\ P_{j_{2}} \\ \begin{matrix} ⋮ \\ P_{j_{k}} \end{matrix} \end{matrix}]}_{t - 1}, {[\begin{matrix} P_{j_{1}} \\ P_{j_{2}} \\ \begin{matrix} ⋮ \\ P_{j_{k}} \end{matrix} \end{matrix}]}_{t}\},

(8)

and it is visualized in Figure 6.

The selected model implements DNN with hidden LSTM layers (Figure 7). We used the rectified linear activation function (ReLu), since it overcomes the vanishing gradient problems present in RNNs [46,47]. It also allows models to learn faster and perform better. The Softmax function is used as an output layer from NN, since the desired output is a vector of probabilities. The probabilities of each value are proportional to the relative scale of each value in the vector and are interpreted as probabilities of the membership of each class.

The probabilities of each value are proportional to the relative scale of each value in the vector and are interpreted as probabilities of the membership of each class. The

{\hat{a}}_{c l a s s}

takes the form of:

{\hat{a}}_{c l a s s} = P (a | s)

(9)

and the final output is calculated as follows:

\hat{y} = a r g m a x (\hat{A})

(10)

The NN module is implemented with the help of Keras API, which is an open source NN library written in Python [48]. The trainings are performed in 500 epochs with a batch size of 64.

2.5. Simulation and Visualization Environment

In this study, the HRC environment was modeled and implemented using the CoppeliaSimEdu software package [49]. CoppeliaSim is a robotic environment simulator with an integrated development environment. It is based on distributed control architecture, meaning that each object can be individually controlled via an embedded script, a plugin, ROS node, remote API client, or a custom solution. In this work, the HRC environment was modeled as a workspace shared by human and robot partners forming a manufacturing team. The workspace consists of two tables representing work surfaces with a robot on a mount and a space for the human worker, as presented in Figure 8. The human worker is represented as a set of joints acquired from the InHARD dataset. Green boxes are placed on top of the work surfaces to represent the event-trigger-activated safety zones.

In addition to a visual worker representation, CoppeliaSimEdu’s Graph element is deployed to represent the actions detected by the neural network. An example of an implemented graph element is presented in Figure 9.

The example presented in Figure 9 demonstrates the distinction between the True Positive and False Positive detections (marked as TP and FP in the image). Here, we used the graph for the visual inspection and the arbitrary analysis of the proposed model’s performance. We found this method useful, since the detections are based on a temporal component, and the TP detection can be described as a set of predictions generated by the model that sufficiently overlaps with the ground truth (annotations) on a temporal axis of the graph. The FP detection can be described in the same manner as a model-generated prediction without sufficient overlap with the ground truth on a temporal axis. For the purpose of this work, sufficient overlap was determined to be arbitrary.

3. Results and Discussions

In this section, we present the training and simulation results. Following the training of the AR network, the accuracy and confusion matrices were generated. We also discuss the results of the PCA and dimensionality reduction, together with the overall results, which were evaluated through a visual inspection in the CoppeliaSimEdu environment.

3.1. Dimensionality Reduction Results

The results of the dimensionality reduction by the PCA are displayed in Table 2. The results are displayed as the number of feature points, which are represented as X, Y, and Z components of the human skeleton joints. The total number of joints can be calculated as the resulting number of feature points divided by three.

3.2. Action Recognition Model Training Results

Two networks with five hidden layers were trained: (1) a network with 51 input features, and (2) a network with 27 input features. The networks were trained in 500 epochs with batch sizes of 64. The neural network with 51 input features scored ~93% for accuracy, based on the validation data, while the network with 27 input features scored 91.365% for accuracy. Figure 10 provides insight into the training results, from which we can draw several conclusions. The accuracy plots in Figure 10a,c provide information about the convergence, accuracy discrepancy, and usability of the dimensionality reduction methods employed in this case. Both networks can achieve a good accuracy with a small discrepancy of ~1.7%, which is acceptable in this case. The training and validation losses for both networks, displayed in Figure 10b,d, converge well before the 500th epoch, suggesting a lack of overfitting or underfitting.

Since the accuracy alone is not enough to evaluate the models’ performance, the precision, recall, and F1-score were also calculated (Table 3). The confusion matrix for one of the models is presented in Figure 11.

The values presented above and the F1-score analysis show that the two models have a similar performance. To explain the lower F1-score value, the confusion matrix can be consulted, as shown in (Figure 11), from which a few conclusions can be obtained:

The dataset appears to be unbalanced. The Assemble System action comprises most of the dataset;
Actions, such as Assemble System and No Action, are very similar in terms of motion and show little variance;
A large portion of every action class is predicted as No Action. The explanation for this trend lies in the fact that the beginning and end of each action starts with the same motion properties.

The results of the confusion matrix can be compared to the results of similar work described in [43]. The authors presented their results using four confusion matrices for each model, based on the motion completion percentage (25, 50, 75, and 100% completion). We compared the results regarding the 25% and 50% completed motions. Since the output classes of the network are different, we selected the Picking In Front and Picking Left classes as references, as these are most similar to the motions discussed in [43]. The authors of [43] were faced with similar issues when evaluating their similar classes for 25% motion completion. We encountered such issues in the case of the No Action and Assemble System classes, where 42% of the No Action class is recognized as the Assemble System.

3.3. Online Performance Results

Together with the training results, we explored the online performance of the AR model via visual inspection using CoppeliaSimEdu graphs and validated the conclusions, based on the confusion matrix. As presented in [21], online recognition is defined as the detection of the on-the-fly recognition within a long video sequence, performed as early as possible without using any further information. The authors of [21] (see the section on online metrics) also explored the online performance accuracy, calculated for each class that we defined in the NN output. Table 4 depicts the results regarding the performance of the online model.

Figure 12 provides examples of event plots that explain this behavior of the model. The graph plots visualize the model predictions and the ground truth in the time intervals in which they occur.

In Figure 12a–c, the classes of Picking Left and Picking In Front are evaluated. A visual inspection shows that the classifier performs well while working online (the entire sequence is processed) regardless of the lower accuracy. In this example, the ground truth interval starts at the ~169th second and ends at the ~170.7th second, while the predictions are shifted by ~0.2 s. In this example, some false positives can be observed, but they are short-lasting and can be mitigated by further processing. A similar performance is observed for the action of Picking in Front. As the acceptance criterion is set to 60% of the ground truth coverage for each detection interval, the reason for the lower performance, compared to that of the confusion matrix in Figure 11 becomes clear.

When evaluating the classes, such as the Assemble System and No Action (Figure 12d,e), noticeable model confusion can be observed. In this example, the ground truth of the No Action class in Figure 12e starts at the ~187.8th second and ends at the 189.2nd second. The Assemble System class overlaps with this time interval instead of the No Action class. Similar to the example shown in Figure 12a, false positives can be observed, together with the time shift of the detections. In this case, the time shift is more noticeable, as the detections do not provide the true positive detection.

Together with the notable false positives for the No Action and Assemble System classes, intermittent detections can also be observed for both classes.

4. Conclusions and Future Work

In this work, a skeleton-based trainable model was developed to classify human worker actions in an industrial environment. The model uses LSTM hidden layers for the purpose of spatiotemporal activity classification, based on an approach that combines the human worker positions and actions to increase the perceptual potential of the system. In this way, human–robot collaboration becomes context-aware, as the model provides information to the intelligent agent, enabling it to adapt its behavior, based on the changes in its dynamic and stochastic surroundings.

Machine-vision-based skeleton pose estimation provides a useful set of information for the purpose of efficient human worker activity recognition. By breaking the pose down into its components (joints), spatiotemporal information can be extracted, providing the input for the AR model.

The AR model comprises five hidden LSTM layers used for activity classification. Here, four classes were evaluated and represented as an output of the AR model. The model evaluation techniques show that the system successfully recognizes the activities of the human worker. While the accuracy indicates the acceptable performance of the model, the confusion matrix reveals that there is still room for improvement, as some of the classes are recognized as false positives, especially in the case of the Assemble System and No Action classes. False positives themselves do not represent a critical problem; rather, their intermittent behavior during short time intervals are problematic, since they can be regarded as true positive detections. In the future, these problems will be mitigated by post-processing and filtering, together with a multimodal information fusion, based on the recognition of human worker intention, as shown in [29].

The work presented in this paper, together with the work discussed in [29], represents the building blocks for efficient HRC, the final goal of this research, in which the robot and the worker can closely cooperate.

Author Contributions

Conceptualization, L.O. and T.S.; methodology, L.O. and T.S.; software, L.O.; validation, L.O., T.S. and L.K.; formal analysis, L.O.; investigation, L.O.; resources, T.S.; data curation, L.O.; writing—original draft preparation, L.O. and T.S.; writing—review and editing, L.O. and T.S.; visualization, L.O.; supervision, T.S.; project administration, T.S.; funding acquisition, T.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

The study was conducted in accordance with the principles of the Declaration of Helsinki, and it was approved by the Institutional Review Board. The research presented in this paper was not registered in The Clinical Trial Registration because it is purely observational and does not require registration.

Informed Consent Statement

No study involving humans was performed in this research project.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Acknowledgments

This work was supported in part by the Croatian Science Foundation under the project “Affective Multimodal Interaction based on Constructed Robot Cognition—AMICORC (UIP-2020-02-7184)” and Visage Technologies AB.

Conflicts of Interest

The authors declare no conflict of interest.

References

Mincă, E.; Filipescu, A.; Cernega, D.; Șolea, R.; Filipescu, A.; Ionescu, D.; Simion, G. Digital Twin for a Multifunctional Technology of Flexible Assembly on a Mechatronics Line with Integrated Robotic Systems and Mobile Visual Sensor—Challenges towards Industry 5.0. Sensors 2022, 22, 8153. [Google Scholar] [CrossRef] [PubMed]
Abdulrahman, A.; Richards, D.; Bilgin, A.A. Exploring the influence of a user-specific explainable virtual advisor on health behaviour change intentions. Auton. Agents Multi-Agent Syst. 2022, 36, 25. [Google Scholar] [CrossRef] [PubMed]
Castro-Rivera, J.; Morales-Rodríguez, M.L.; Rangel-Valdez, N.; Gómez-Santillán, C.; Aguilera-Vázquez, L. Modeling Preferences through Personality and Satisfaction to Guide the Decision Making of a Virtual Agent. Axioms 2022, 11, 232. [Google Scholar] [CrossRef]
Dhou, K.; Cruzen, C. An innovative chain coding mechanism for information processing and compression using a virtual bat-bug agent-based modeling simulation. Eng. Appl. Artif. Intell. 2022, 113, 104888. [Google Scholar] [CrossRef]
Saeed, I.A.; Selamat, A.; Rohani, M.F.; Krejcar, O.; Chaudhry, J.A. A Systematic State-of-the-Art Analysis of Multi-Agent Intrusion Detection. IEEE Access 2020, 8, 180184–180209. [Google Scholar] [CrossRef]
Schmitz, A. Human–Robot Collaboration in Industrial Automation: Sensors and Algorithms. Sensors 2022, 22, 5848. [Google Scholar] [CrossRef]
Stipancic, T.; Koren, L.; Korade, D.; Rosenberg, D. PLEA: A social robot with teaching and interacting capabilities. J. Pac. Rim Psychol. 2021, 15, 18344909211037019. [Google Scholar] [CrossRef]
Wang, L.; Majstorovic, V.D.; Mourtzis, D.; Carpanzano, E.; Moroni, G.; Galantucci, L.M. Proceedings of the 5th International Conference on the Industry 4.0 Model for Advanced Manufacturing, Belgrade, Serbia, 1–4 June 2020; Lecture Notes in Mechanical Engineering; Springer: Berlin/Heidelberg, Germany, 2020. [Google Scholar] [CrossRef]
Lasota, P.A.; Fong, T.; Shah, J.A. A Survey of Methods for Safe Human-Robot Interaction. Found. Trends Robot. 2017, 5, 261–349. [Google Scholar] [CrossRef]
Ajoudani, A.; Zanchettin, A.M.; Ivaldi, S.; Albu-Schäffer, A.; Kosuge, K.; Khatib, O. Progress and prospects of the human–robot collaboration. Auton. Robot. 2018, 42, 957–975. [Google Scholar] [CrossRef] [Green Version]
Semeraro, F.; Griffiths, A.; Cangelosi, A. Human–robot collaboration and machine learning: A systematic review of recent research. Robot. Comput.-Integr. Manuf. 2023, 79, 102432. [Google Scholar] [CrossRef]
Ogenyi, U.E.; Liu, J.; Yang, C.; Ju, Z.; Liu, H. Physical Human–Robot Collaboration: Robotic Systems, Learning Methods, Collaborative Strategies, Sensors, and Actuators. IEEE Trans. Cybern. 2019, 51, 1888–1901. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Bi, Z.; Luo, M.; Miao, Z.; Zhang, B.; Zhang, W.; Wang, L. Safety assurance mechanisms of collaborative robotic systems in manufacturing. Robot. Comput.-Integr. Manuf. 2021, 67, 102022. [Google Scholar] [CrossRef]
Chandrasekaran, B.; Conrad, J.M. Human-robot collaboration: A survey. In Proceedings of the SoutheastCon 2015, Fort Lauderdale, FL, USA, 9–12 April 2015; pp. 1–8. [Google Scholar] [CrossRef]
Mukherjee, D.; Gupta, K.; Chang, L.H.; Najjaran, H. A Survey of Robot Learning Strategies for Human-Robot Collaboration in Industrial Settings. Robot. Comput.-Integr. Manuf. 2022, 73, 102231. [Google Scholar] [CrossRef]
Wang, J.; Chen, Y.; Hao, S.; Peng, X.; Hu, L. Deep learning for sensor-based activity recognition: A survey. Pattern Recognit. Lett. 2019, 119, 3–11. [Google Scholar] [CrossRef] [Green Version]
Shaikh, M.; Chai, D. RGB-D Data-Based Action Recognition: A Review. Sensors 2021, 21, 4246. [Google Scholar] [CrossRef] [PubMed]
Banos, O.; Galvez, J.-M.; Damas, M.; Pomares, H.; Rojas, I. Window Size Impact in Human Activity Recognition. Sensors 2014, 14, 6474–6499. [Google Scholar] [CrossRef] [Green Version]
Maeda, G.; Ewerton, M.; Neumann, G.; Lioutikov, R.; Peters, J. Phase estimation for fast action recognition and trajectory generation in human–robot collaboration. Int. J. Robot. Res. 2017, 36, 1579–1594. [Google Scholar] [CrossRef]
Yan, S.; Xiong, Y.; Lin, D. Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar] [CrossRef]
Dallel, M.; Havard, V.; Baudry, D.; Savatier, X. InHARD—Industrial Human Action Recognition Dataset in the Context of Industrial Collaborative Robotics. In Proceedings of the IEEE International Conference on Human-Machine Systems (ICHMS), Rome, Italy, 7–9 September 2020; pp. 1–6. [Google Scholar] [CrossRef]
Carreira, J.; Noland, E.; Hillier, C.; Zisserman, A. A short note on the kinetics-700 human action dataset. arXiv 2019, arXiv:1907.06987. [Google Scholar]
Ullah, A.; Muhammad, K.; Del Ser, J.; Baik, S.W.; de Albuquerque, V.H.C. Activity Recognition Using Temporal Optical Flow Convolutional Features and Multilayer LSTM. IEEE Trans. Ind. Electron. 2018, 66, 9692–9702. [Google Scholar] [CrossRef]
Li, S.; Fan, J.; Zheng, P.; Wang, L. Transfer Learning-enabled Action Recognition for Human-robot Collaborative Assembly. Procedia CIRP 2021, 104, 1795–1800. [Google Scholar] [CrossRef]
Fazli, M.; Kowsari, K.; Gharavi, E.; Barnes, L.; Doryab, A. HHAR-net: Hierarchical Human Activity Recognition using Neural Networks. In Intelligent Human Computer Interaction—IHCI 2020; Springer: Cham, Switzerland, 2021; pp. 48–58. [Google Scholar] [CrossRef]
Moniz, A.B. Intuitive Interaction Between Humans and Robots in Work Functions at Industrial Environments: The Role of Social Robotics. In Social Robots from a Human Perspective; Springer: Cham, Switzerland, 2015; pp. 67–76. [Google Scholar] [CrossRef]
Jerbic, B.; Stipancic, T.; Tomasic, T. Robotic bodily aware interaction within human environments. In Proceedings of the SAI Intelligent Systems Conference (IntelliSys), London, UK, 10–11 November 2015; pp. 305–314. [Google Scholar] [CrossRef]
Huang, J.; Huo, W.; Xu, W.; Mohammed, S.; Amirat, Y. Control of Upper-Limb Power-Assist Exoskeleton Using a Human-Robot Interface Based on Motion Intention Recognition. IEEE Trans. Autom. Sci. Eng. 2015, 12, 1257–1270. [Google Scholar] [CrossRef]
Orsag, L.; Stipancic, T.; Koren, L.; Posavec, K. Human Intention Recognition for Safe Robot Action Planning Using Head Pose. In HCI International 2022—Late Breaking Papers. Multimodality in Advanced Interaction Environments: HCII 2022; Springer: Cham, Switzerland, 2022; pp. 313–327. [Google Scholar] [CrossRef]
Matsumoto, Y.; Ogasawara, T.; Zelinsky, A. Behavior recognition based on head pose and gaze direction measurement. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2000) (Cat. No.00CH37113), Takamatsu, Japan, 31 October–5 November 2002; pp. 2127–2132. [Google Scholar] [CrossRef]
Zhang, Z.; Wang, H.; Geng, J.; Jiang, W.; Deng, X.; Miao, W. An information fusion method based on deep learning and fuzzy discount-weighting for target intention recognition. Eng. Appl. Artif. Intell. 2022, 109, 104610. [Google Scholar] [CrossRef]
Cubero, C.G.; Rehm, M. Intention Recognition in Human Robot Interaction Based on Eye Tracking. In Human-Computer Interaction—INTERACT 2021: INTERACT 2021; Springer: Cham, Switzerland, 2021; pp. 428–437. [Google Scholar] [CrossRef]
Lindblom, J.; Alenljung, B. The ANEMONE: Theoretical Foundations for UX Evaluation of Action and Intention Recognition in Human-Robot Interaction. Sensors 2020, 20, 4284. [Google Scholar] [CrossRef] [PubMed]
Liu, Z.; Hao, J. Intention Recognition in Physical Human-Robot Interaction Based on Radial Basis Function Neural Network. J. Robot. 2019, 2019, 4141269. [Google Scholar] [CrossRef]
Awais, M.; Saeed, M.Y.; Malik, M.S.A.; Younas, M.; Asif, S.R.I. Intention Based Comparative Analysis of Human-Robot Interaction. IEEE Access 2020, 8, 205821–205835. [Google Scholar] [CrossRef]
Fan, J.; Zheng, P.; Li, S. Vision-based holistic scene understanding towards proactive human–robot collaboration. Robot. Comput.-Integr. Manuf. 2022, 75, 102304. [Google Scholar] [CrossRef]
Stipancic, T.; Jerbic, B. Self-adaptive Vision System. In Emerging Trends in Technological Innovation—DoCEIS 2010; Springer: Berlin/Heidelberg, Germany, 2010; pp. 195–202. [Google Scholar] [CrossRef] [Green Version]
Wang, P.; Liu, H.; Wang, L.; Gao, R.X. Deep learning-based human motion recognition for predictive context-aware human-robot collaboration. CIRP Ann. 2018, 67, 17–20. [Google Scholar] [CrossRef]
Zhang, R.; Lv, Q.; Li, J.; Bao, J.; Liu, T.; Liu, S. A reinforcement learning method for human-robot collaboration in assembly tasks. Robot. Comput.-Integr. Manuf. 2022, 73, 102227. [Google Scholar] [CrossRef]
Sadrfaridpour, B.; Wang, Y. Collaborative Assembly in Hybrid Manufacturing Cells: An Integrated Framework for Human–Robot Interaction. IEEE Trans. Autom. Sci. Eng. 2017, 15, 1178–1192. [Google Scholar] [CrossRef]
Moutinho, D.; Rocha, L.F.; Costa, C.M.; Teixeira, L.F.; Veiga, G. Deep learning-based human action recognition to leverage context awareness in collaborative assembly. Robot. Comput.-Integr. Manuf. 2023, 80, 102449. [Google Scholar] [CrossRef]
Rahman, S.M.; Wang, Y. Mutual trust-based subtask allocation for human–robot collaboration in flexible lightweight assembly in manufacturing. Mechatronics 2018, 54, 94–109. [Google Scholar] [CrossRef]
Mavsar, M.; Denisa, M.; Nemec, B.; Ude, A. Intention Recognition with Recurrent Neural Networks for Dynamic Human-Robot Collaboration. In Proceedings of the 20th International Conference on Advanced Robotics (ICAR), Ljubljana, Slovenia, 6–10 December 2021; pp. 208–215. [Google Scholar] [CrossRef]
Nemec, B.; Mavsar, M.; Simonic, M.; Hrovat, M.M.; Skrabar, J.; Ude, A. Integration of a reconfigurable robotic workcell for assembly operations in automotive industry. In Proceedings of the IEEE/SICE International Symposium on System Integration (SII), Narvik, Norway, 9–12 January 2022; pp. 778–783. [Google Scholar] [CrossRef]
Bulling, A.; Blanke, U.; Schiele, B. A tutorial on human activity recognition using body-worn inertial sensors. ACM Comput. Surv. 2014, 46, 33. [Google Scholar] [CrossRef]
Tan, H.H.; Lim, K.H. Vanishing Gradient Mitigation with Deep Learning Neural Network Optimization. In Proceedings of the 7th International Conference on Smart Computing & Communications (ICSCC), Sarawak, Malaysia, 28–30 June 2019; pp. 1–4. [Google Scholar] [CrossRef]
Hu, Z.; Zhang, J.; Ge, Y. Handling Vanishing Gradient Problem Using Artificial Derivative. IEEE Access 2021, 9, 22371–22377. [Google Scholar] [CrossRef]
Kim, S.; Wimmer, H.; Kim, J. Analysis of Deep Learning Libraries: Keras, PyTorch, and MXnet. In Proceedings of the IEEE/ACIS 20th International Conference on Software Engineering Research, Management and Applications (SERA), Las Vegas, NV, USA, 25–27 May 2022; pp. 54–62. [Google Scholar] [CrossRef]
Pyvovar, M.; Pohudina, O.; Pohudin, A.; Kritskaya, O. Simulation of Flight Control of Two UAVs Based on the “Master-Slave” Model. In Integrated Computer Technologies in Mechanical Engineering—2021: ICTM 2021; Springer: Cham, Switzerland, 2022; pp. 902–907. [Google Scholar] [CrossRef]

Figure 1. Safe HRC framework, based on human activity recognition and pose prediction.

Figure 2. InHARD skeleton preview in Blender: (a) picking left (b), picking in front.

Figure 3. Relevant activity event distribution over one recording in the InHARD dataset.

Figure 4. Skeleton hierarchy considered as the input to LSTM.

Figure 5. Cumulative explained variance ratio as a result of the PCA.

Figure 6. Image representing the skeleton data used for training the NN model and the sliding window approach: (a) view of the skeleton data with the points and motion data for one point (a hand motion is presented in this example); (b) hand motion presented as timeseries.

Figure 7. Overview of the complete LSTM NN for activity recognition.

Figure 8. HRC team modelled using the CoppeliaSimEdu software package.

Figure 9. An example of an implemented CoppeliaSimEdu Graph element.

Figure 10. Accuracy and loss plots for training and validation: (a) accuracy plot for the network with 51 input features; (b) training and validation loss for the network with 51 input features; (c) accuracy plot for the network with 27 input features; (d) training and validation loss for the network with 27 input features.

Figure 11. Confusion matrix for the model with 27 input features (nine joints).

Figure 12. Online model performance results: (a) event plot for the Picking Left class; (b) event plot for the Picking Left class (second example); (c) event plot for the Picking In Front class; (d) event plot for the Assemble System class; (e) event plot for the No Action class.

Table 1. InHARD dataset low-level action classes. The highlighted classes are chosen as the action classes of interest.

Action ID	Meta-Action Label
0	No action ¹
1	Consult sheets
2	Turn sheets
3	Take screwdriver
4	Put down screwdriver
5	Pick in front ¹
6	Pick left ¹
7	Take measuring rod
8	Put down measuring rod
9	Take component
10	Put down component
11	Assemble system ¹
12	Take subsystem
13	Put down subsystem

¹ Actions considered for the purposes of this article.

Table 2. Overview of the dimensionality reduction results.

Dimensionality Reduction Method	Resulting Number of Feature Points	Comment
None ¹	63	N/A
Visual inspection ²	51	N/A
PCA	27	Human worker symmetry is compromised ³

¹ No dimensionality reduction method is performed, and the dataset remains the same. ² An engineering assumption is made, according to which the lower part of the body (up to the worker’s hips) will be omitted in the setup, as presented in Figure 8. ³ One of the joints is deemed unnecessary for the dataset’s description due to the low explained variance ratio value. Following the PCA performance, the dataset is analyzed to confirm the validity of the results. The LeftArm joint is in a position where it does not need to be moved to a large extent.

Table 3. Precision, recall, and F1-score for the trained models.

Metric	Model with 51 Input Features	Model with 27 Input Features
Precision	0.709	0.709
Recall	0.699	0.694
F1-Score	0.686	0.680

Table 4. Online accuracy results for each subclass.

Subclass	Accuracy (%)
Picking In Front	~63%
Picking Left	~67%
No Action	~39%
Assemble System	~38%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Orsag, L.; Stipancic, T.; Koren, L. Towards a Safe Human–Robot Collaboration Using Information on Human Worker Activity. Sensors 2023, 23, 1283. https://doi.org/10.3390/s23031283

AMA Style

Orsag L, Stipancic T, Koren L. Towards a Safe Human–Robot Collaboration Using Information on Human Worker Activity. Sensors. 2023; 23(3):1283. https://doi.org/10.3390/s23031283

Chicago/Turabian Style

Orsag, Luka, Tomislav Stipancic, and Leon Koren. 2023. "Towards a Safe Human–Robot Collaboration Using Information on Human Worker Activity" Sensors 23, no. 3: 1283. https://doi.org/10.3390/s23031283

APA Style

Orsag, L., Stipancic, T., & Koren, L. (2023). Towards a Safe Human–Robot Collaboration Using Information on Human Worker Activity. Sensors, 23(3), 1283. https://doi.org/10.3390/s23031283

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Towards a Safe Human–Robot Collaboration Using Information on Human Worker Activity

Abstract

1. Introduction

2. Materials and Methods

2.1. Pose Estimation

2.2. Dataset

2.3. Dimensionality Reduction

2.4. Human Activity Recognition

2.5. Simulation and Visualization Environment

3. Results and Discussions

3.1. Dimensionality Reduction Results

3.2. Action Recognition Model Training Results

3.3. Online Performance Results

4. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI