1. Introduction
With the rapid development of technologies such as industrial Internet, artificial intelligence, big data, and cloud computing, the manufacturing industry is facing significant opportunities for transformation from the digital network stage to the intelligent direction. In smart manufacturing production, assembly is a key process, the quality of which directly determines the product’s quality [
1]. However, in traditional production manufacturing, a work mode of complete separation between humans and machines is usually adopted. Although robots can perform relatively simple or highly repetitive assembly tasks, they still cannot handle complex or highly flexible tasks. In addition, the interaction methods of traditional industrial robots are mainly limited to traditional touch-based hardware devices such as keyboards and mice, which restrict human activity areas and also limit the robot’s task execution to the settings made by humans. Collaborative assembly methods with humans and robots can leverage the strengths of both parties. Robots provide enormous power, high precision, and tirelessness, while humans contribute their strong cognitive abilities. Therefore, in order to improve the speed, accuracy, and safety of assembly, human–robot collaboration has gradually become a development trend, replacing some traditional manual assembly workstations [
2,
3].
Although collaborative robots have been applied in various manufacturing productions, problems such as low robot motion flexibility and low human–robot collaboration efficiency still exist due to the difficulty in recognizing human intentions [
4]. In the human–robot collaborative assembly process, humans and robots usually work in the same workspace to perform assembly tasks. Robot malfunctions or mistakes made by assembly personnel may affect the assembly progress and even the operation of the entire production line. Therefore, the interaction behavior of robots with humans and the perception of actions during the collaboration process are crucial. Liau and Ryu [
5] developed a Human–Robot Collaboration (HRC) mold assembly status recognition system based on object and action recognition using pre-trained YOLOv5. The system enhanced the sustainability of mold assembly and reduced labor and assembly time. Zhang et al. [
6] proposed an observation method based on deep learning to analyze human actions during the assembly process. It was built on a Recurrent Neural Network (RNN) and predicts future motion trajectories of humans to guide robot action planning. The effectiveness of this method was validated through an example of assembling an engine. Lv et al. [
7] proposed a human–robot cooperation assembly framework based on digital-twin technology. The framework adopted the Double Deep Deterministic Policy Gradient (DDPG) optimization model to reduce time loss in part selection and repair. It demonstrates the enhanced efficiency and safety achieved in human–robot cooperation during assembly using digital-twin technology. Berg et al. [
8] proposed a task action-recognition method for human–robot cooperation assembly based on Hidden Markov Models. By recognizing actions, the robot becomes aware of the task the human is currently executing, enabling flexible adjustments to adapt to human working styles. Lin et al. [
9] used deep learning algorithms to classify muscle signals of human body movements to determine the intention of human motion. This not only enhanced communication and efficiency in human–robot collaboration but also enabled the detection of human fatigue, thus achieving sustainable human–machine collaborative workspaces.
Currently, human motion recognition mainly captures the actions of personnel through devices such as sensors and cameras and transmits them in real-time to the robot control system for processing [
10]. Among the other sensors as data sources for human behavior analysis, cameras are very popular since computer vision technology allows accurate perception and recognition of human poses and actions, making interaction between humans and robots more natural and flexible [
11,
12]. And using camera-based visual recognition methods can reduce the complexity and cost of collecting assembly motion data, while also lightening the burden on humans. Camera-based visual posture-recognition methods can be further divided into posture recognition based on RGB video data and posture recognition based on skeleton data obtained from human depth images. The latter represents the posture behavior of individuals using the position variations of a few simple joints, which demonstrates good robustness in the face of complex backgrounds, lighting changes, and viewpoint variations, resulting in better recognition performance [
13]. However, due to the influence of camera sensor accuracy and noise, robots often encounter issues such as the inability to perceive actions or poor recognition accuracy and real-time performance when recognizing human motions. These not only greatly impact the smoothness and efficiency of assembly but may also pose threats to humans themselves. To overcome these problems, in recent years, motion-recognition methods based on deep learning have gradually developed. By learning large amounts of data, these methods achieve higher recognition accuracy and better real-time performance, thereby improving the efficiency and safety of human–robot collaboration. For example, action recognition based on Convolutional Neural Networks (CNNs) using skeletal features has achieved good results [
14,
15]. Zhu et al. [
16] proposed a deep neural network architecture based on a bidirectional Long Short-Term Memory Convolutional Neural Network (LSTM-CNN) for recognizing human postures using human skeletal data. Although these methods can effectively learn temporal features, most of them do not make good use of the spatial structural information present in human actions. Orazio et al. [
17] implemented real-time tracking of human skeletons using the Open AI framework, extracting key features and inputting them into a neural network for posture recognition. Cherubini et al. [
18] used the OpenPose 3D skeletal extraction library to obtain human skeletal joint coordinates and applied convolutional neural networks for gesture detection, which was then used in human–robot interaction. Yan et al. [
19] first proposed the application of graph neural networks to skeleton-based action-recognition tasks, namely the ST-GCN. The incorporation of spatial and temporal convolutions into skeleton behavior-recognition tasks shows excellent robustness and novelty. Dallel et al. [
20] used digital twins and VR technology to simulate industrial workstation assembly tasks under robotic arm cooperation and trained a human action-recognition model using ST-GCN. However, ST-GCN has a relatively weak expressive ability for temporal information and it requires improvement in terms of capturing the behavioral characteristics of individuals in different application scenarios. Liu et al. [
21] improved the model’s ability to extract temporal features by incorporating residual networks into the time information processing of the node. Cao et al. [
22] proposed a new partition self-attention spatial-temporal graph convolutional network (NP-AGCN).
The use of human motion recognition technology in assembly can overcome the limitations of traditional assembly methods and improve the efficiency of collaborative tasks and the assembly process. It can also accurately monitor, quickly correct, and adaptively adjust human operations, ensuring assembly stability and quality. This paper aims to address the issues in human–robot collaborative assembly using a method based on visual cameras and hybrid convolutional neural networks for human action recognition. Specifically, the study utilizes camera capture of human actions and further improves the performance and application potential of assembly action-recognition tasks through the study of a hybrid convolutional neural network combining ST-GCN and 1DCNN for assembly action-recognition models. The ST-GCN model is utilized to address the spatial and temporal feature extraction of human skeletal topology, enhancing the generalization performance of the model. The 1DCNN model is used to further enhance the capability of extracting temporal features.
The organization of this paper is outlined as follows: related research is introduced in
Section 1; the assembly action-recognition method based on an ST-GCN and 1DCNN hybrid neural network is elaborated in
Section 2; the validation experiments and results of the algorithm model are presented in
Section 3; and the effectiveness of the application case is showcased in
Section 4. Finally, the research is summarized in
Section 5, its contributions are discussed, and future work is outlined.
3. Experimental Results and Performance Evaluation
The hybrid convolutional neural network action-recognition algorithm is implemented based on the Pytorch framework. The hardware environment uses a Windows 10× 64 system computer, manufactured by Lenovo Legion in China. It is equipped with an Intel i7-10875H processor, NVIDIA RTX2060 graphics card, and 16 GB of memory. In the Anaconda environment, the Python 3.6 environment is configured, and the CPU version of torch 1.9.1 and the corresponding torch vision are installed to complete the environment setup.
3.1. Data Set Construction
This study uses the Azure Kinect depth camera and SDKs provided by the developer (Body Tracking SDK v1.0.1 and Sensor SDK v1.4.0) to collect assembly motion data. All actions were performed at this workstation. Some actions were completed individually by humans, while others were collaboratively performed by humans and robots. A total of 12 participants took part in the data collection, with a gender ratio of 7:5 (male:female). The average age was 22 years old, with a height range of 155–193 cm and a weight range of 44–102 kg.
During the data collection process, participants sit diagonally in front of the Azure Kinect depth camera. They are instructed to perform ten sets of actions that may occur during the assembly process, with each set repeated ten times. The data collection for each set takes approximately 20–25 min. To enhance the versatility of the dataset in action recognition, the experimenter explains the assembly process and the required actions to the participants before collection, without restricting specific action behaviors.
To achieve the visualization of the data collection process, a Unity visualization scene was constructed. The body joints of the participants were superimposed with virtual objects and lines. The studied human–robot collaborative assembly application mainly collects the skeletal point motion sequences of the upper body. Each frame is collected based on 24 skeletal joint node data. The data form collected for each frame action in each dataset should be a matrix of 24 (skeletal key points) × 3 (dimensions), that is, the data size = a total number of frames × 24. If there is a skeletal missing problem within a single frame, linear interpolation will be used for data supplementation. Each piece of data is stored with a format of “timestamp (running time) + joint name (Joint Type) + pos: + three-dimensional coordinates of the skeletal joint”. Each set of actions is outputted as a txt file.
The data collection of actions is shown in
Table 1. There are a total of seven categories of action types, with 120 collections for each category. Therefore, a dataset of 120 samples can be obtained for each action type. Each action dataset is labeled as “Action + ‘Num’ + ‘-x’”. ‘Num’ represents the action type label, and ‘x’ represents the dataset index.
3.2. Loss Function
The cross-entropy loss function is widely used in classification tasks. This function can avoid problems such as gradient vanishing and gradient explosion, while also possessing the characteristic of faster weight updates for larger errors and slower weight updates for smaller errors. This enables the model to learn and adjust the parameters more efficiently to improve the classification accuracy. Therefore, this paper uses the cross-entropy loss function to reflect the difference between the model’s predicted results and the true labels. The basic formula for calculating the cross-entropy loss function is as follows:
Based on the above model structure, this paper adopts the formula of binary cross-entropy loss function for calculation. The formula can be expressed as
represents the total number of samples in the dataset; represents the true label of the -th sample data (with a value of 0 or 1); represents the output probability of the -th sample prediction. is used to measure the loss when , where the loss is . It indicates that as the prediction probability increases, the loss decreases accordingly. When approaches infinite, the loss gradually approaches positive infinity, which means a higher cost for misclassifying positive samples. is used to measure the loss when , that is, the loss is . It indicates that the lower the predicted probability belonging to positive samples, the lower the loss. When approaches 1, the loss approaches positive infinity, indicating a higher cost for misclassifying negative samples.
The model adopts the Stochastic Gradient Descent (SGD) algorithm for stochastic gradient descent. The initial learning rate of the model is set to 0.1. Every 10 epochs form a stage, and the learning rate decreases by a factor of 0.1. The total training lasts for 50 epochs.
The dataset is divided into a training set and a testing set at a ratio of 8:2. During the testing phase, the recognition performance is evaluated based on Top-1 and Top-5. The Top-1 evaluation criterion selects the action category with the highest predicted probability as the result. The Top-5 evaluation criterion considers the top five predictions with the highest probabilities. As long as the correct category appears among them, it is considered a correct prediction. The hybrid convolutional neural network model repeats six layers of spatial-temporal graph convolution, and the hyperparameters of each layer are shown in
Table 2.
The performance of the loss function of the hybrid convolutional neural network model on the training and testing sets varies with the number of iterations, as shown in
Figure 5. The loss changes are recorded for each epoch during the training process. After 20 rounds of iteration, the loss in the training set decreased from 0.23 to 0.02. Additionally, the corresponding Top-1 accuracy reached 90% and the Top-5 accuracy reached 98%.
3.3. Performance Evaluation Metrics
Action recognition is essentially a multi-classification problem [
26]. The research on action-recognition algorithms mainly focuses on the accuracy of the model. The calculation of this metric is based on a confusion matrix, as shown in
Table 3.
Accuracy refers to the proportion of correctly classified samples among all samples, which is used to measure the overall classification accuracy of a model. It can reflect the classification performance and generalization ability of the model and is the most commonly used evaluation metric. This index applies to the whole model. In this study, it refers to the proportion of the sum of all correctly identified assembly actions in all samples during human–robot collaborative assembly. It represents the accuracy of the model in predicting all assembly actions. The formula can be expressed as
Precision refers to the proportion of true positive samples among all true-positive samples, which is used to measure the reliability of positive samples in the model’s classification results. It can be used to reflect the error rate of the model. This index aims at the prediction accuracy of a specific category. In this study, it refers to the number of real samples of each assembly action accounted for by the number of all predicted samples for this action during human–robot collaborative assembly. It represents the accuracy of the model in predicting a specific assembly action. The formula can be expressed as
Recall refers to the proportion of true-positive samples among all predicted positive samples. It is used to measure the probability of correctly identifying true-positive samples and can reflect the model’s false-negative rate. The formula can be expressed as
F1 score is considered the harmonic mean of precision and recall, used to evaluate the accuracy of binary classification models. The formula can be expressed as
3.4. Model Validity Verification
To verify the effectiveness of the proposed hybrid convolutional neural network model, a control experiment was conducted as shown in
Table 4. The experiment focused on achieving high accuracy in recognizing human skeletal sequences to improve assembly efficiency. The hybrid convolutional neural network model proposed in this paper was compared with other methods, and the Top-1 accuracy and Top-5 accuracy of each model were listed. Through experiments, this paper verifies the effectiveness of the basic static graph convolutional network and further improves the accuracy to 83.4% by incorporating 1DCCN. Finally, the recognition accuracy of the complete hybrid convolutional model is compared with them, further increasing the accuracy to 91.7%. From the table, it can be seen that the hybrid convolutional neural network model effectively utilizes the prior knowledge of human skeletal sequences and considers both temporal and spatial features, resulting in significantly higher recognition accuracy than existing methods.
Specifically, less than 10 incorrect judgments were made out of 100 test samples. This shows that the proposed hybrid convolutional neural network model has good generalization performance and can make reliable predictions for unknown samples. However, the dataset used in this paper is relatively small, which may have impacted the overall accuracy rate. It can be inferred that the accuracy rate of 91.7% is far from reaching the upper limit of the model. While this study aimed to improve the method itself rather than expand the dataset, it is reasonable to believe that with a larger annotation dataset, the model could achieve a classification accuracy above 99%.
In conclusion, the proposed hybrid convolutional neural network model surpassed existing methods in recognizing human skeletal sequences. Its ability to leverage prior knowledge and consider both temporal and spatial features resulted in significantly higher accuracy rates. Although there are limitations associated with the dataset size, our results suggest great potential for future improvements and applications in the field.
To further evaluate the comprehensive performance of the proposed ST-GCN+1DCNN action-recognition method, it is compared with the GCN + 1DCNN action-recognition method, as shown in
Table 5, and the comparative results are obtained.
In terms of accuracy, the recall rate, and the F1 score, the recognition method of ST-GCN + 1DCNN is significantly superior to the recognition method of GCN + 1DCNN. The results of the ST-GCN + 1DCNN method are consistently higher than those of the GCN + 1DCNN method in terms of the accuracy and recall rate, indicating that the ST-GCN + 1DCNN action-recognition method not only classifies actions more accurately but also provides more accurate division between positive and negative samples. The combination of the F1 score further shows that the proposed ST-GCN + 1DCNN hybrid convolutional neural network model can identify target actions more accurately and has better overall performance in action-recognition tasks. This indicates that compared to the GCN + 1DCNN model, the hybrid model proposed in this paper, which models temporal and spatial features of action sequences through graph structures for feature propagation and information aggregation, can better capture the temporal and spatial characteristics of action sequences.
5. Conclusions
In this study, an assembly action-recognition model based on a hybrid convolutional neural network of ST-GCN and 1DCNN is proposed and implemented. It aims to recognize the assembly behavior of operators during the assembly process. The model uses the ST-GCN model to extract spatial relationships between joints and temporal information of actions. It captures interactions between key points and temporal features using spatiotemporal convolutions. To improve the generalization performance of the model, a residual structure is used to enhance the feature representation of the model. Additionally, a BN layer is further used for data normalization and a dropout layer is used to prevent overfitting. The 1DCNN model is used to enhance the temporal feature extraction ability of the model and improve the classification performance of the model. The model’s usability is validated on an assembly action dataset, achieving a recognition accuracy of 91.7%. A comparison with other neural networks demonstrates the superior performance of this hybrid CNN in assembly action recognition.
Moreover, based on digital-twin technology, a human–robot collaborative assembly integration system using Unity is developed, data transmission and interaction between software and hardware devices are realized, and the feasibility of the human–robot collaborative assembly action-recognition method based on a hybrid convolutional neural network is verified. The experimental results show that the method proposed in this work is significantly better than the pure manual assembly method and the fixed process human–robot collaboration in terms of assembly efficiency.
This study provides new technical support for human–robot collaborative assembly in the field of intelligent manufacturing, with significant implications for improving assembly efficiency and accuracy. However, this study also has some limitations, such as a small dataset size and room for improvement in model generalization capability. Future studies can focus on optimizing the model design, expanding the dataset size, and integrating other technologies and algorithms to further enhance the efficiency and accuracy of the assembly process.