A Multi-Scale and Multi-Stage Human Pose Recognition Method Based on Convolutional Neural Networks for Non-Wearable Ergonomic Evaluation

Zhao, Wei; Wang, Lei; Li, Yuanzhe; Liu, Xin; Zhang, Yiwen; Yan, Bingchen; Li, Hanze

doi:10.3390/pr12112419

Open AccessArticle

A Multi-Scale and Multi-Stage Human Pose Recognition Method Based on Convolutional Neural Networks for Non-Wearable Ergonomic Evaluation

by

Wei Zhao

^1,*,

Lei Wang

²,

Yuanzhe Li

³,

Xin Liu

²,

Yiwen Zhang

^2,4,*

,

Bingchen Yan

² and

Hanze Li

²

¹

Department of Information Security Technology, Jilin Police College, Changchun 130117, China

²

School of Mechanical and Electrical Engineering, Changchun University of Science and Technology, Changchun 130013, China

³

People’s Liberation Army (PLA) Unit 32184, Beijing 100072, China

⁴

Automotive Parts Intelligent Manufacturing Assembly Inspecti on Technology and Equipment University-Enterprise Joint Innovation Laboratory, Changchun 130013, China

^*

Authors to whom correspondence should be addressed.

Processes 2024, 12(11), 2419; https://doi.org/10.3390/pr12112419

Submission received: 2 August 2024 / Revised: 9 October 2024 / Accepted: 28 October 2024 / Published: 2 November 2024

(This article belongs to the Section Energy Systems)

Download

Browse Figures

Versions Notes

Abstract

:

In the context of industrial robot maintenance and assembly, workers often suffer from work-related musculoskeletal disorders (WRMSDs). This paper proposes a multi-scale, multi-stage pose recognition method (MMARM-CNN) based on convolutional neural networks to provide ergonomic intervention. The method leverages computer vision technology to enable non-contact data acquisition, reducing the interference of physiological and psychological factors on assessment results. Built upon the baseline yolov8-pose framework, the method addresses complex maintenance environments, which are prone to occlusion, by introducing the Lightweight Shared Convolutional Detection Head-pose (LSCD-pose) module, Multi-Scale Channel Attention (MSCA) mechanism, and Efficient Multi-Scale Patch Convolution (EMSPC) module, enhancing the model’s feature extraction capabilities. The MMARM-CNN model was validated using the MS COCO 2017 dataset and robot assembly data collected under laboratory conditions. The experimental results show that the MMARM-CNN achieved an accuracy improvement, reaching 0.875 in the [email protected] evaluation. Overall, this method demonstrates significant potential in advancing the automation and intelligence of ergonomic interventions.

Keywords:

shared convolutional neural networks; multi-scale; multi-stage; posture recognition; ergonomics

1. Introduction

The level of automation in industrial production has significantly increased, leading to a rise in the number of industrial robots used in manufacturing processes. Consequently, the workload for the maintenance and repair of these robots has also grown. Maintenance workers are frequently required to perform tasks such as lifting and carrying, which can predispose them to work-related musculoskeletal disorders (WRMSDs). These disorders are caused by prolonged, repetitive, or extreme postures and pose a significant threat to the health and well-being of workers across various industries [1,2]. They not only cause health issues for the affected individuals but also result in substantial economic losses for both individuals and society [3]. According to data from the World Health Organization (WHO), approximately 1.71 billion people worldwide suffer from these disorders. WRMSDs are a major contributor to global disability and the leading cause of rehabilitation needs globally [4]. Research published in The Lancet indicates that China has the highest rehabilitation demand globally, with the total number of rehabilitation needs reaching 460 million people in 2021. The rehabilitation market size has surpassed 100 billion yuan, with a compound annual growth rate of 27.1%. Among them, the number of WRMSDs patients is the highest, amounting to 220 million people. The annual incidence rate of WRMSDs is 69.56%, with the average incidence rate of lumbar diseases being the highest at 41.15% [5]. Therefore, it is crucial to provide timely ergonomic interventions for maintenance workers to prevent and reduce the harm caused by WRMSDs.

The premise of reducing WRMSDs is to assess the ergonomic level of work tasks, and researchers in different fields have adopted various assessment methods. Commonly used assessment methods include the Rapid Upper Limb Assessment (RULA) [6], the Rapid Entire Body Assessment (REBA) [7], and the Ovako Working Posture Analysis System (OWAS) [8]. Traditional assessment methods are completed by ergonomics experts through observational methods. They observe videos or images of workers during their tasks; determine the joint angles of the upper and lower limbs, neck, and trunk as well as the force exerted by the workers and their standing postures based on experience; and finally complete the risk assessment of the work tasks [9,10,11]. This method is cumbersome and time-consuming, and the assessment results are greatly influenced by the subjective factors of ergonomics experts. In recent years, researchers have used motion capture devices such as micro-inertial measurement units (IMUs) to obtain the joint angles of workers. Xinming Li and SangHyeok Han et al. used inertial devices to provide real-time posture data in ergonomics assessments and established a real-time human–machine ergonomics research platform, achieving automation of ergonomics assessments [12]. Chunxi Huang et al. studied an automated ergonomics assessment system based on wearable inertial sensors to evaluate work-related musculoskeletal disorders in the workplace and validated it under laboratory conditions [13]. Battini Daria et al. integrated motion capture systems (MoCap) and immersive reality for rapid and effective ergonomics assessments of workstations [14]. The use of IMUs has improved the convenience and accuracy of assessments, but IMUs are susceptible to interference from complex environments and magnetic field signals, which can significantly affect the assessment results [15,16]. In addition, workers may experience physiological and psychological discomfort when wearing sensors during work, which can affect the accuracy of their operations [17]. Therefore, researchers have begun to focus on non-wearable data collection methods using Kinect depth cameras with depth sensors and machine learning algorithms to identify key parts of the human body. Cai et al. and Clark et al. found that the capture accuracy of Kinect is slightly lower than that of the optical MoCap, but it is sufficient for general ergonomics assessments [18,19]. Diego-Mas et al. and Manghisi et al. used joint angles obtained from Kinect to complete RULA and OWAS assessments [20,21,22]. However, Plantard et al. and Wei et al. found that the tracking quality of Kinect is not satisfactory when the body is occluded and in non-frontal views [23,24]. Despite advancements, current research in posture recognition remains challenged by accuracy, efficiency, and adaptability, particularly in complex environments. Plantard et al. and Wei et al. have reported the suboptimal tracking performance of Kinect sensors under occlusion and non-frontal viewing conditions [23,24]. In response, the research community has introduced several enhancement strategies. The Hourglass network facilitates the efficient acquisition of multi-scale information through a progressive feature fusion architecture [25]. Meanwhile, HRNet-W32 preserves the completeness of posture features by employing parallel high-resolution subnetworks, thus improving recognition accuracy [26]. DEKR enhances posture estimation flexibility and precision through a dynamic convolution mechanism [27]. Additionally, YoloV7, an efficient object detection framework, has showcased exceptional real-time performance and accuracy in posture recognition tasks [28]. While these approaches exhibit promising results within their domains, opportunities for improvement persist, especially regarding complex backgrounds, occlusion scenarios, and dynamic posture recognition. Collectively, these methods advance multi-scale feature extraction for posture recognition, augmenting both accuracy and efficiency.

In recent years, researchers have focused on computer vision, aiming to integrate deep learning technology to achieve human pose recognition. They have applied object detection technology to human pose recognition, calculating the required human joint angles by identifying key points in images. Bogo et al. and Mehta et al. utilized monocular images to identify key points of the human body and were able to reconstruct 3D human poses [29,30]. Mehta et al. used multi-view images to capture human poses, demonstrating higher accuracy than monocular image methods through the coordination of multiple external cameras [31]. Vision-based methods do not require any marking and do not cause any interference to the subjects, which will be an important direction for the future development of ergonomics assessment. OpenPose holds a significant position in the field of human pose recognition. Woohoo Kim et al. reported that OpenPose can effectively complete tasks and perform better in situations with body occlusions or non-frontal tracking compared to Kinect [32]. However, this study lacks the acquisition of some key points, such as upper arm rotation, wrist twist, and neck twist, which still need to be calculated manually. In addition, the accuracy of OpenPose is affected by the number and position of cameras. Emmanuele Barberi et al. used stereo vision and OpenPose to establish a model for posture recognition and 3D reconstruction of motorcyclists, providing a new posture assessment method for the motorcycle field [33]. Nobuyasu Nakano et al. proposed a 3D markerless motion capture technology using OpenPose with multiple synchronized video cameras to analyze tasks such as walking, backward jumping, and ball throwing [34]. However, Clark, R.A et al. found that the accuracy of OpenPose in human kinematics assessment cannot be fully guaranteed [19]. In recent years, a target detection technology corresponding to OpenPose, namely YOLOv8, has developed rapidly. YOLOv8 shows good performance in complex backgrounds and poor lighting conditions, maintaining real-time capabilities while having high detection accuracy. Chengang Dong et al. proposed a real-time Human Pose Estimation (HPE) model called CCAM-Person based on the YOLOv8 framework, which reduces feature loss and receptive field constraints, improving detection accuracy [35]. Hicham Boudlal et al. used a skeleton-based method with Wi-Fi channel state information, integrating the YOLOv8 and Mediapipe frameworks to accurately recognize human skeletal structure and posture [36]. Shuxian Wang et al. proposed a single-stage pose recognition algorithm called yolov8-sp, which improves model performance and accuracy by introducing the concept of multi-dimensional feature fusion and an attention mechanism that automatically captures the importance of features [37].

Thus, this paper introduces and validates a multi-scale, multi-stage posture recognition method based on convolutional neural networks (MMARM-CNN), which is an enhancement of the YOLOv8 baseline model, tailored to better suit the complex environment of industrial robot maintenance. This approach offers a novel technique for ergonomic posture recognition and assessment. The specific improvements are as follows:

The MMARM-CNN incorporates the Lightweight Shared Convolutional Detection Head-pose (LSCD-pose) module, the Multi-Scale Channel Attention (MSCA) attention mechanism, and the Efficient Multi-Scale Patch Convolution (EMSPC) module, thereby enhancing the capability of posture feature extraction under the interference of complex backgrounds, varying lighting conditions, and occlusions.

The LSCD-pose module integrates a Group Normalization (GN) unit to enhance the detection and capture of local spatial contextual information. By employing shared convolutions, it reduces the number of parameters and computational load, thereby improving real-time detection and localization capabilities as well as the ability to detect key points in complex scenes.

The MMARM-CNN leverages the EMSPC module to implement multi-stage posture recognition correction, thereby enhancing the robustness of posture recognition against interference. Particularly when key points are occluded, or the posture is complex, the EMSPC module is capable of providing higher accuracy in key-point localization.

Furthermore, the MMARM-CNN addresses the limitations of traditional ergonomics assessments, which are characterized by their complexity, time-consuming nature, and strong subjectivity. The posture recognition process does not require the wearer to don any additional devices or markers, effectively reducing operational complexity and the physiological and psychological discomfort of workers and ensuring that maintenance tasks are conducted under normal conditions. Ergonomics assessment methods are also integrated into the MMARM-CNN, enabling the real-time generation of risk scores, which enhances the objectivity and efficiency of ergonomics assessments.

The rest of this article is organized as follows. In Section 2, we introduced the methodology framework of this paper and detailed information about each module of the framework. In Section 3, we describe the specific conditions of the experiment and complete the verification of MMARM-CNN. In Section 4, we discuss the main results of this study. We conclude with Section 5, which summarizes this article.

2. Materials and Methods

2.1. Overview of the Proposed Framework

The objective of this study was to estimate the movement postures of maintenance workers and conduct ergonomic assessments through visual means. Consequently, we propose a multi-scale, multi-stage posture recognition method based on convolutional neural networks (MMARM-CNN) to adaptively acquire data information on human posture from continuous operation images of workers. Furthermore, to ensure the effective integration of posture recognition methods with ergonomic assessments, this paper divides the overall framework into two tasks: posture recognition and ergonomic evaluation. The ergonomic evaluation module utilizes the human joint angles obtained from the posture recognition module as input to complete risk assessment. The accuracy of posture recognition is a prerequisite for accurate and efficient ergonomic evaluation. This study selected YOLOv8-pose as the primary framework for posture recognition due to its exceptional performance and precision. It is capable of accurately predicting and localizing key points in images, thereby facilitating the calculation of data information required for assessment. Given the high utilization of the upper limbs by robot-maintenance workers during their tasks, the ergonomic evaluation module opts for the RULA as the assessment method.

The working environment of industrial robots is often complex, which implies that posture collection from maintenance workers may encounter interference from complex environments, lighting variations, and occlusions. Therefore, this study conducted corresponding research on these issues to enhance the robustness and accuracy of posture recognition algorithms in complex work environments. As depicted in Figure 1, the posture recognition and ergonomic assessment framework follows the sequence of “Input image data—Human posture recognition—Ergonomic risk assessment”. The “Input image data” section inputs the captured images of workers at work into the system, enabling the recognition and capture of workers’ postures. The “Human posture recognition” section is primarily composed of three parts: “BackBone”, “Neck”, and “Prediction”. To improve the precision and efficiency of posture recognition, this section incorporates the LSCD-pose module, MSCA attention mechanism, and EMSPC module. The LSCD-pose module introduces a GN unit, which can more accurately capture local spatial contextual information, especially conferring a significant advantage in the model’s key-point detection capabilities under complex scenarios. Concurrently, shared convolutional networks reduce the number of parameters and computational load, achieving real-time detection and localization objectives. The MSCA attention mechanism, with its multi-scale channel attention, establishes a weight model between features, enhancing the model’s adaptability at different scales and improving the expressive power of key-point features. The introduction of the EMSPC module enhances the precision of key-point localization; through multi-stage posture correction, it addresses accuracy issues when key points are occluded or postures are complex. This module enhances the key-point detection capability, feature expression capability, and localization precision and represents key points in the form of spatial coordinates. The “Ergonomic risk assessment” section utilizes the key-point data obtained from the “Human posture recognition” section to calculate the human joint angles required for the ergonomic assessment model, complete the ergonomic evaluation, and present the results.

2.2. High-Precision Lightweight LSCD-Pose Module Based on Shared Convolution

The MMARM-CNN framework is based on YOLOv8-pose and incorporates a lightweight, high-precision LSCD-pose module. To enhance stability and accuracy in small-batch training, we introduced GN [38,39] within the convolutional modules. GN is an effective regularization method particularly suitable for small-batch training environments. Unlike Batch Normalization (BN), which relies on statistical information based on batch size for normalization, GN reduces dependence on batch size by grouping channels and normalizing within these groups, thus performing well in small-batch and distributed training scenarios. Furthermore, GN can also reduce fluctuations during the training process, enhancing the model’s training stability. Therefore, we leveraged the advantages of GN to optimize the LSCD-pose module, significantly improving the model’s performance in small-batch training and distributed environments.

Additionally, we introduced shared convolution to reduce the number of parameters. Shared convolution is a technique used in Convolutional Neural Networks (CNNs), the essence of which involves applying the same convolutional kernels (filters) to different input regions or multiple input channels [40]. This technique plays a significant role in computer vision and image-processing tasks. By employing shared convolution, the number of parameters can be substantially decreased, resulting in a more streamlined model. As shared convolution reduces the number of parameters, the computational load is also diminished, thereby enhancing computational efficiency. Moreover, parameter sharing ensures the translational invariance of feature detection, meaning that a feature, regardless of its position in the image, will be detected by the same convolutional kernel. This mechanism not only enables the model to efficiently process high-resolution images and a greater number of input channels but also better captures local image features such as edges, textures, and shapes. Especially on resource-constrained devices, such as mobile devices and embedded systems, the model’s compactness and computational efficiency are particularly crucial. Shared convolution achieves the efficiency and practicality of the model by significantly reducing the number of parameters, allowing deep learning models to operate effectively across various computational environments.

While utilizing shared convolution, to address the issue of inconsistent target scales detected by each detection head, we introduced a Scale layer to scale the features. This approach enables more accurate detection of targets at different scales, thereby enhancing the model’s detection performance. By integrating the aforementioned methods, through the combination of shared convolution and the Scale layer, we not only achieved a significant reduction in the number of parameters and computational load of the detection heads but also minimized the loss of accuracy. This optimization strategy is particularly suitable for applications that require efficient processing and are limited by computational resources, allowing the model to remain high-performing while becoming more lightweight and efficient. In this way, we can optimize the model’s computational efficiency and resource usage while ensuring detection accuracy, meeting the needs of various practical applications. The structure of the LSCD-pose module is illustrated in Figure 2.

2.3. The Spatial Pyramid Pooling-Fast (SPPF) Module Based on the MSCA Attention Mechanism

In posture detection, detection is often susceptible to interference from background clutter and noise due to low image resolution. Therefore, this paper introduces the MSCA attention mechanism into the feature pyramid Spatial Pyramid Pooling-Fast (SPPF) module. This mechanism enables the network to disregard irrelevant background information and focus more on valid posture feature information. Previous attention mechanisms had several shortcomings; for instance, although Dilated Transformer attention integrated the advantages of channel and spatial attention, it overlooked the role of multi-scale feature aggregation in network design, which is highly detrimental in posture detection [41]. In contrast, the MSCA attention mechanism introduced in this paper effectively utilizes the characteristics of multi-scale feature aggregation, enhancing the model’s feature extraction capability.

MSCA is an innovative multi-scale convolutional attention module that aggregates local information through the use of deep convolutions and employs multi-branch deep strip convolution to capture multi-scale contextual information [24]. Each branch utilizes asymmetric convolution kernels with different configurations to achieve feature capture in various directions (for example, 1 × 7, 7 × 1, 1 × 11, 11 × 1, 1 × 21, and 21 × 1). This approach effectively captures and integrates feature information at different scales while reducing the number of parameters and computational costs. The structure of the MSCA attention mechanism is illustrated in Figure 3. This design not only optimizes the computational efficiency of the model but also enhances the model’s sensitivity to horizontal and vertical features by processing them separately. This strategy ensures the maximization of the model’s processing power while capturing key information, further enhancing the effectiveness and accuracy of feature extraction.

Initially, the use of multi-branch deep strip convolution operations, with each branch equipped with asymmetric convolution kernels of varying orientations, including 1 × 7, 7 × 1, 1 × 11, 11 × 1, 1 × 21, and 21 × 1, effectively captures context information at different scales while reducing computational costs. This approach emphasizes the capability to capture features in different directions, thereby enhancing the model’s spatial flexibility and information processing efficiency. Secondly, 1 × 1 convolutions are employed to model the relationships between different channels. This step not only facilitates the integration of features from each branch but also generates the attention weights used to modulate the input features. These weights are directly used to re-weight the input features, enhancing the model’s response to salient information. Finally, the features weighted by the attention weights are integrated to form the final output. The mathematical expressions for this process are detailed in Equations (1) and (2) below. Through this structure, MSCA is able to achieve fine-tuning and optimization of image features, thereby improving the overall model performance and application flexibility.

A t t = C o n v_{1 \times 1} (Σ_{i = 0}^{3} S c a l e_{i} (D W - C o n v (F)))

(1)

O u t = A t t \otimes F

(2)

Here, F represents the input features, Att denotes the attention weights, and Out is the output of the MSCA attention mechanism. DW-Conv signifies depthwise convolution, and Scale_i indicates the i-th branch. Through this approach, the MSCA module not only captures multi-scale contextual information but also effectively models the relationships between different channels, enhancing the precision and efficiency of feature extraction.

The SPPF module performs feature extraction and encoding of images at various scales, enabling the resizing of any input image to a fixed size and the generation of a fixed-length feature vector. However, due to the presence of complex background interference, lighting variations, and occlusions in posture recognition, the SPPF module is not effective in handling these conditions. The introduction of the MSCA mechanism not only allows the network to disregard irrelevant background information but also effectively focuses on more valid posture feature information, thereby enhancing the improved model’s feature extraction capabilities. A comparison between the SPPF module and the SPPF_MSCA module is illustrated in Figure 4. By incorporating the MSCA attention mechanism, the model simultaneously extracts features at multiple scales, significantly enhancing its ability to capture details and global information. Particularly under complex background and dynamic lighting conditions, MSCA can adaptively adjust the attention weights to focus on key areas of posture features. Furthermore, the multi-branch design of the MSCA module allows it to process features under different convolutional kernel sizes, thereby enhancing the model’s robustness in handling diverse postures and various interference conditions while maintaining efficient computation.

2.4. EMSPC Module Based on Grouped Convolution and Point-by-Point Convolution

In posture detection, due to the diversity of posture variations, traditional convolutions do not adequately address the complex features present. Therefore, this paper introduces the Efficient Multi-Scale Patch Convolution (EMSPC) module [42]. The EMSPC module integrates multi-scale information by combining different convolutional kernel sizes, such as 1 × 1, 3 × 3, 5 × 5, and 7 × 7. This multi-scale approach enables the module to capture various spatial features, thereby more effectively handling different target sizes and shapes within images. The design of EMSPC is inspired by MobileNet [43] and GhostNet [15], utilizing simplified designs to reduce redundant information in feature maps. The principle of GhostNet emphasizes the presence of redundant features in intermediate feature maps, and EMSPC effectively reduces this redundancy.

Following the initial convolutional operation, the EMSPC employs pointwise convolution for channel-level feature fusion, integrating features from different channels to generate a comprehensive feature map, thereby enhancing the richness of the extracted information. The structure of the EMSPC is depicted in Figure 5. To bolster the capability of feature extraction, the crux lies in extracting significant feature information from this redundancy. Since the feature information is extracted from channels of varying sizes, and the information within each channel is independent, we adopted the pointwise convolution channel-level feature fusion technique from MobileNet. As illustrated in the EMSPC structure diagram in Figure 5, the process begins with a standard convolution operation on the feature maps, then divides these feature maps into k groups, and performs a linear operation Φ on each group to generate a complete feature map (Φ involves convolution kernels of 1 × 1, 3 × 3, 5 × 5, or 7 × 7). Subsequently, pointwise convolution is used for channel-level feature fusion to produce the output. The EMSPC structure is shown in Figure 5.

Compared to conventional convolutional modules, the EMSPC has a reduced number of parameters and computational load. EMSPC introduces the concept of Group Convolution, which segments the input feature maps into multiple groups, applies linear transformations to each group, and finally fuses the features through pointwise convolution. This design significantly reduces Floating Point Operations per Second (FLOPs), thereby diminishing the model’s complexity. Moreover, EMSPC preserves a wealth of feature information. By leveraging the advantages of GhostNet in handling redundant features and incorporating the channel feature fusion capabilities of MobileNet, EMSPC can effectively extract feature information from different channels. This approach not only reduces computational complexity but also maintains the richness of the features.

In the backbone network of YOLOv8, some standard convolutions are replaced with EMSPC modules, thereby enhancing the capability for feature extraction. The EMSPC demonstrates significant advantages over conventional convolutions in posture detection. By integrating multi-scale information, EMSPC is capable of capturing features at different spatial scales, which is crucial for the diverse joint locations and posture variations present in posture detection. Furthermore, the design of EMSPC is inspired by group convolution and GhostNet, significantly reducing computational complexity and the number of parameters by diminishing redundant information in feature maps. This streamlined design allows EMSPC to maintain high precision while more efficiently processing large-scale data and real-time detection tasks. EMSPC employs channel-level feature fusion, inspired by the pointwise convolution of MobileNet, ensuring that features from different channels are fully utilized to generate more comprehensive feature maps. This fusion method strengthens the capability for feature extraction, improving the accuracy and robustness of posture detection. In contrast, conventional convolutions tend to produce redundancy when processing high-dimensional features and have high computational costs, making it challenging to achieve efficient posture detection in resource-constrained environments.

2.5. RULA-Based Ergonomics Assessment Module

The ultimate objective of this paper is to implement ergonomic assessments and determine the risk levels of maintenance tasks. Since the research in this paper primarily pertains to manual operations for a variety of industrial tasks, the RULA method, which gives more consideration to the upper limbs, was selected from numerous ergonomic assessments as the model’s ergonomic evaluation approach. Before the operation of this module, it is necessary to calculate the joint angles by obtaining key-point information through the posture estimation module. In this paper, key-point coordinates obtained through posture recognition are utilized to calculate joint angles using the vector method. Vectors are directional and translatable; hence, any two non-coincident points A (x_a, y_a, z_a) and B (x_b, y_b, z_b) in the human body coordinate system can be transformed into vectors within the reference coordinate system through coordinate transformations, represented as coordinate vectors in the following form:

A \vec{B} = (x_{b} - x_{a}, y_{b} - y_{a}, z_{b} - z_{a})

(3)

Based on the above properties, the human joint angles are viewed as the angles of vectors in space. For example, any three points, namely A (x_a, y_a, z_a), B (x_b, y_b, z_b), and C (x_c, y_c, z_c), existing in the space can be formed into two given vectors, which are solved using the vector method:

B \vec{A} = (x_{a} - x_{b}, y_{a} - y_{b}, z_{a} - z_{b})

(4)

B \vec{C} = (x_{c} - x_{b}, y_{c} - y_{b}, z_{c} - z_{b})

(5)

\cos ω = \frac{B \vec{A} \cdot B \vec{C}}{|B \vec{A}| \cdot |B \vec{C}|} = \frac{(x_{a} - x_{b}) \times (x_{c} - x_{b}) + (y_{a} - y_{b}) \times (y_{c} - y_{b}) + (z_{a} - z_{b}) \times (z_{c} - z_{b})}{\sqrt{{(x_{a} - x_{b})}^{2} + {(y_{a} - y_{b})}^{2} + {(z_{a} - z_{b})}^{2}} \times \sqrt{{(x_{c} - x_{b})}^{2} + {(y_{c} - y_{b})}^{2} + {(z_{c} - z_{b})}^{2}}}

(6)

Let τ = cos ω; then the joint angle can be found to be

ω = \cos^{- 1} τ

.

Once the required joint angle information for assessment is obtained, a detailed ergonomic analysis is conducted. The RULA method requires input data including human joint angles, load, muscle usage, and whether the legs have proper support. Therefore, instead of establishing a fully closed-loop assessment model, we input information such as load and muscle usage that cannot be measured at the beginning of the entire assessment task. The support condition of the subject’s legs is determined through spatial position information; it is determined that the legs do not have proper support when the vertical distance difference between the two ankles and the ground exceeds 30 mm. Of course, compared to traditional assessment methods, the level of automation of this approach is already sufficient to meet the needs of most assessment scenarios. When a high-risk value is present, a red font is used for warning, corresponding to RULA level 7. A medium risk is indicated with an orange font, corresponding to RULA levels 5–6. Low risk is represented with a yellow font, corresponding to RULA levels 3–4. When the risk level is negligible, a white font is used, corresponding to RULA levels 1–2. This module has file-writing capabilities, and the recorded data can be used for further analysis and for users to review and improve. Compared to other related studies, this paper offers a higher degree of automation and more detailed objective assessments, reducing the time-consuming and cumbersome disadvantages of assessment methods and increasing users’ sensitivity to ergonomic risks and the practicality of the assessment. In contrast to traditional assessments that focus only on the final RULA score, this paper summarizes the scores for each limb joint angle and uses the ISO 11226:2000 standard as a benchmark to determine whether specific joint angles pose a high risk [44].

3. Experiments Details

This study designed two experiments to validate the enhancement of model performance and the reliability of posture recognition. Initially, the MMARM-CNN model and the YOLOv8-pose model were trained using a public dataset and image data from robotic assembly tasks to improve the models’ generalization capabilities. The performance improvement of the MMARM-CNN model was verified through a comparative analysis of specific effects and parameters. Subsequently, the motion capture system Perception Neuron Studio was utilized to capture the subjects’ movement postures as a benchmark, which were then compared with the postures captured by the MMARM-CNN model to verify its accuracy and reliability. Additionally, the experiments analyzed the impact of special circumstances such as background interference, lighting variations, and occlusion of body parts on posture recognition.

3.1. Comparative Analysis of Models Based on Publicly Available Datasets

3.1.1. Datasets

We selected 20,000 images from the MS COCO 2017 dataset for model training, with 15,000 images chosen for the training set and 5000 for the validation set, while the test set was composed of 5000 images. The model training was conducted over 300 epochs with a batch size of 64, employing the Adam optimizer with an initial learning rate of 0.001, which was dynamically adjusted after every 100 epochs. The loss function utilized was the Mean Squared Error (MSE) Loss based on key-point regression, and data augmentation strategies such as random flipping, scaling, cropping, and color perturbation were applied during the training process to enhance the model’s generalization capabilities. The entire experiment was conducted on a high-performance server equipped with an Intel Xeon Silver 4210R CPU (@2.40 GHz, 40 cores), 24 GB of memory, and dual Nvidia GeForce RTX 4090 GPUs. It is noteworthy that in addition to training the model with the COCO dataset, we also tested it with actual work images from robotic assembly and maintenance to evaluate the model’s performance in real scenarios, ensuring its practical application capability in ergonomic assessment tasks.

3.1.2. Criteria for Evaluation

To comprehensively assess the performance of deep models on posture detection images, this study employed the following key metrics: Mean Average Precision at 0.5 IoU ([email protected]), Mean Average Precision at 0.95 IoU ([email protected]), FLOPs, and Params [45]. These metrics provide a comprehensive evaluation of the model from aspects such as detection accuracy, model complexity, and computational efficiency. Below is a detailed introduction to each parameter: Intersection over Union (IoU) is a commonly used metric for measuring the degree of overlap between two bounding boxes. [email protected] refers to the mean of the Average Precision (AP) for all categories at an IoU threshold of 0.5. This metric reflects the overall performance of the model in detection tasks and is an important indicator for measuring the accuracy of object detection models. The higher the [email protected], the better the model’s detection performance across different categories. [email protected] is a key metric for evaluating the basic performance of a model, reflecting its detection capability under more lenient conditions. [email protected] refers to the mean of the Average Precision for all categories at an IoU threshold of 0.95. Compared to [email protected], [email protected] is a more stringent evaluation criterion, reflecting the model’s performance in high-precision detection tasks. The higher the [email protected], the more accurately the model can identify targets under high-precision requirements. This metric is particularly important for applications that require precise localization and can reflect the model’s ability to handle complex and detail-rich images. FLOPs are an important metric for measuring the computational complexity of a model, indicating the number of floating-point operations required for a model to perform a single forward propagation. The lower the FLOPs, the less computationally complex the model is, and the faster it runs. Reducing FLOPs can significantly improve the model’s real-time performance and operational efficiency, which is of great importance, especially in resource-constrained embedded systems or real-time applications.

3.2. Accuracy and Reliability Analysis of Posture Recognition Based on Inertial Capture

To substantiate the accuracy and reliability of the MMARM-CNN, motion data were captured using the commercial motion tracking system Perception Neuron Studio (PNS). Three male participants, aged 24 ± 3 years with a height of 175 ± 5 cm, were invited to engage in a robotic assembly task, while the entire process was video-recorded using an Orbbec Inc., Shenzhen, China, Gemini 336L camera. Before posture recognition, the video was segmented and exported as frames at a rate of 30 frames per second, resulting in a total of 1000 images. Of these, 800 images were utilized for model training and 200 for analysis and validation. Figure 6 illustrates the experimental setup, which includes an upper computer, tool table, camera, robot, motion capture equipment, and the participant. The participant grasped tools and robot housings from the table, assembled them at the corresponding positions on the mechanical arm, and tightened them. The participant wore the PNS throughout the experiment, the camera was active at all times, and the upper computer acquired and saved the data. The data obtained from the PNS were used as a benchmark to compare with the MMARM-CNN recognition data, thereby verifying the accuracy of the MMARM-CNN posture recognition. Furthermore, the data collected by the PNS was submitted to ergonomics experts for ergonomic assessment (three experts were invited for this part), and the experts’ assessment results were compared with the MMARM-CNN assessment outcomes to ascertain the reliability of the ergonomic evaluation.

4. Results

4.1. Comparative Analysis of MMARM-CNN with Other Models

Figure 7 presents a comparison of the training results for MMARM-CNN (left) and YOLOv8-pose (right) under normal and special circumstances. Figure 1 and Figure 2 depict scenarios where the human posture is unobstructed, and it can be observed that the posture recognition of both is essentially similar. Figure 3 and Figure 4 illustrate instances where limbs are obstructed, and postures are overlapping, as shown at points A and B; point B notably lacks recognition of the lower arm, while A exhibits detailed recognition of the lower arm. Figure 5 and Figure 6 represent cases where the upper limbs are entirely obscured, as indicated at points C and D; point D has lost all data for both the upper and lower arms, whereas C has fully recognized the posture of the upper arm. Figure 7 and Figure 8 portray situations where the head and neck are completely blocked, as demonstrated at points E and F; point F failed to successfully recognize the posture of the neck, while E successfully depicted the neck posture. It can be discerned from these special conditions, such as obstruction and overlap, that MMARM-CNN outperforms YOLOv8-pose in recognition effectiveness.

We compared the parameters of MMARM-CNN with other models with the specific data presented in Table 1. The comparison results demonstrate that MMARM-CNN exhibits significant advantages across various metrics. Firstly, in the [email protected] evaluation, Hourglass [25] achieved an accuracy of 81.65%, HRNet-W32 [26] reached 84.32%, and DEKR [27] obtained a higher result of 86.93%. In contrast, YOLOv7’s [28] accuracy was only 0.84, YOLOv8’s accuracy improved to 0.853, and MMARM-CNN achieved the highest at 0.875, indicating its superior detection accuracy in human posture recognition. Secondly, in the high-precision metric of [email protected], Hourglass’s performance was 56.86%, HRNet-W32’s was 63.48%, and DEKR reached the highest value of 65.32%. Meanwhile, the performance of YOLOv7 and YOLOv8 was 0.57 and 0.586, respectively, whereas MMARM-CNN reached 0.614, further proving its superiority in high-precision posture detection tasks. In terms of computational complexity, Hourglass’s FLOPs was 170G, HRNet-W32’s was 32.8G, DEKR’s was 40.9G, and MMARM-CNN’s FLOPs was 26.5G, lower than YOLOv8’s 30.4G and YOLOv7’s 28G, indicating that MMARM-CNN has a lower computational complexity while maintaining performance. Additionally, Hourglass’s parameter count was as high as 277.8M, while HRNet-W32’s and DEKR’s parameter counts were 28.5M and 29.6M, respectively. In comparison, MMARM-CNN’s parameter count was 9.4M, significantly lower than YOLOv8’s 11.6M and YOLOv7’s 10.5M, highlighting MMARM-CNN’s advantage in controlling model complexity.

In terms of network structure design, the MSCA introduced in MMARM-CNN significantly enhances feature expression capabilities, especially when processing high-resolution images, where it demonstrates greater stability and efficiency. Moreover, MMARM-CNN’s multi-stage posture correction module, EMSPC, improves the accuracy of key-point localization; particularly, when key points are occluded, or postures are complex, it still performs excellently. In summary, MMARM-CNN exhibits higher detection accuracy in human posture recognition tasks, with lower computational complexity and lighter model parameters, and it maintains high-precision detection effects in scenarios with complex backgrounds and occlusions, significantly enhancing the robustness and practicality of posture recognition tasks.

Through the aforementioned enhancements, MMARM-CNN demonstrates superior performance over the baseline YOLOv8 in multiple aspects. These improvements have increased the model’s detection accuracy and efficiency by reducing computational load and the number of parameters, particularly excelling in complex scenarios. This renders MMARM-CNN more advantageous in practical applications, ensuring high-precision real-time detection.

4.2. Accuracy Analysis of Posture Recognition

Figure 8 illustrates the posture recognition results of the MMARM-CNN on the robot installation experiment. The effectiveness depicted in the figure indicates that MMARM-CNN is capable of accurately capturing the key-point locations of the human body and representing the angles of the human joints, regardless of whether the limbs are in normal conditions or under special circumstances such as occlusions.

We conducted a statistical analysis of the data from the robot assembly experiment. Figure 9 presents the curve of the right elbow joint angle, linear regression analysis, and residual analysis results. In the joint angle curve diagram (a), the trends of the data obtained from MMARM-CNN and PNS are essentially identical, both commencing from 180° with only occasional significant fluctuations at specific points. In the linear regression analysis diagram (b), the horizontal axis represents data from MMARM-CNN, while the vertical axis represents data from PNS. The two datasets are normally distributed and highly correlated. Diagram (c) in Figure 9 displays the normalized residual normal P-P plot for the right elbow joint angle. Finally, Figure 9d,e illustrate the scatter plot distribution of MMARM-CNN under different references.

The results indicate an excellent consistency between the MMARM-CNN and PNS data. From a data perspective, as shown in Table 2, the coefficient of determination (R2 = 0.831, and Pearson’s product–moment correlation coefficient demonstrate a strong correlation (r = 0.865, p < 0.01). The Spearman correlation coefficient indicates a high degree of correlation (0.847). Furthermore, we conducted an analysis in terms of average error, maximum, and minimum errors, using ±3° as the error range. According to the data, the average error is 2.53°, the maximum error is 7.29°, and the minimum error is 0.03°. Therefore, there is a high correlation between the MMARM-CNN and PNS data, although there are certain errors in some cases. We reviewed and analyzed the experimental results and found that the accuracy of the data was slightly off when there was occlusion or limb overlap, which did not affect the final assessment outcomes in this study. Thus, this issue will be considered for future research.

4.3. Reliability Analysis of Postural Assessment

We conducted a summary analysis of the experimental data assessment results, comparing the outcomes obtained from three experts with those derived from MMARM-CNN. As depicted in Figure 10a, which illustrates the overall trend, the risk levels from both sources are strikingly similar. Given that the task involves prolonged standing, the majority of the assessment results fall within the low to moderate risk range. However, there are instances where the posture results exhibit sudden changes, which, upon analysis, are associated with issues such as occlusions. This indicates that MMARM-CNN may exhibit slight variations from expert assessments in special circumstances, but these minor fluctuations do not impact the overall assessment outcome.

As shown in Figure 10b, the accuracy was statistically assessed across different risk categories, with the MMARM-CNN results demonstrating over 88% concordance with the standard values. To further substantiate the system’s accuracy, we also calculated Kendall’s tau correlation coefficient (R = 0.867), Pearson’s correlation coefficient (k = 0.858), p-value (ρ < 0.01), and the accuracy rate (88.50%) between the two sets of results, indicating a very high degree of correlation. The specific numerical values are presented in Table 3.

These results indicate that the accuracy of MMARM-CNN is essentially consistent with the accuracy of expert RULA analysis. It is noteworthy that this study placed greater emphasis on special postures such as occlusions and limb folding, thereby enhancing the accuracy and reliability of posture recognition in complex environments. In addition to the overall RULA assessment, the reliability of each limb was also compared, with specific comparative values presented in Table 4. The data show that MMARM-CNN has a high degree of accuracy in recognizing the upper arms, trunk, neck, and legs, while the recognition accuracy for parts with smaller ranges of motion, such as the wrists, still requires further improvement.

4.4. Accuracy Analysis of “Blocked”

This paper validates the recognition accuracy of MMARM-CNN when body parts are occluded. As illustrated in Figure 11, special circumstances such as limb occlusion and arm folding that commonly occur during assembly workers’ tasks can affect the accuracy of posture recognition. Figure 11 also presents the specific recognition outcomes, where it can be observed that MMARM-CNN successfully captures key points of the human body even when the subject is in profile, experiencing varying degrees of body occlusion, or when body overlap occurs.

We conducted a summary analysis of the occluded and non-occluded image information from 200 validation images. These were categorized into two groups, “Blocked” and “Not Blocked”, and compared accordingly. The accuracy rates for posture recognition were 70.37% and 89.47%, respectively, indicating that occlusion affects the accuracy of posture recognition. Particularly when multiple special posture conditions occur simultaneously, the error rate in posture recognition increases. However, in this study, it did not significantly impact the final assessment outcomes. Therefore, developing more precise recognition methods will be a focus in future work. Concurrently, this study conducted Pearson’s chi-square test and Fisher’s precision probability test, both with p-values less than 0.001, demonstrating a high degree of correlation between the two sets of data and confirming the reliability of the comparative results. Table 5 presents the outcomes of this experiment.

5. Conclusions

In this work, we propose a multi-scale, multi-stage posture recognition method based on shared convolutional neural networks for human posture recognition and ergonomic risk assessment in human–machine systems. This method offers a non-wearable ergonomic assessment solution to evaluate the ergonomic risk levels of workers, providing suggestions to those at high risk and guiding them to adjust improper postures or seek timely treatment. Unlike motion capture devices that collect human posture data, MMARM-CNN does not suffer from issues such as environmental magnetic interference or error accumulation, maintaining the advantages of automated assessment while reducing error impacts. The sub-modules of MMARM-CNN ensure accurate recognition of human posture and optimize the model’s key-point detection capabilities, feature expression capabilities, and localization precision, enhancing the recognition accuracy of MMARM-CNN in complex environments or when occlusions are present. Firstly, we utilized the concept of shared convolution, introducing a GN module within the LSCD-pose module, which can more accurately capture local spatial contextual information, improving the model’s key-point detection capabilities in complex scenes. By employing shared convolutional networks, we reduced the number of parameters and computational load, achieving the goal of real-time detection and localization. Secondly, leveraging the idea of multi-scale parallelism, the MSCA attention mechanism, with its multi-scale channel attention capabilities, establishes a weight model between features, enhancing the model’s adaptability at different scales and improving the expressive power of key-point features. Lastly, through multi-stage posture correction, the introduction of the EMSPC module enhances the precision of key-point localization in the MMARM-CNN model, improving accuracy when key points are occluded or postures are complex. This paper ultimately validates the advantages of the MMARM-CNN model as well as the accuracy of posture recognition and the reliability of risk assessment. Although this study addresses the issue of partial occlusions affecting collection accuracy, recognition levels are still low when the range of limb occlusion is too large. Additionally, the current study only uses images as the objects of recognition, and there is room for improvement in assessing continuity. Future research will also expand in these two areas, aiming to enhance the recognition accuracy of deep learning models and utilize videos for continuous ergonomic assessment.

Author Contributions

Conceptualization, W.Z., L.W. and X.L.; methodology, L.W. and Y.Z.; software, L.W. and X.L.; validation, W.Z., L.W. and B.Y.; formal analysis, Y.Z. and W.Z.; investigation, X.L. and L.W.; resources, Y.L., W.Z., Y.Z. and H.L.; data curation, W.Z., L.W., H.L. and Y.L.; writing—original draft preparation, W.Z.; writing—review and editing, W.Z., X.L., Y.L. and L.W.; visualization, L.W. and B.Y.; supervision, W.Z., L.W. and Y.Z.; project administration, W.Z.; funding acquisition, W.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by “the Funds of Education Department of Jilin Province, Research on Campus Security Behavior Identification and Early Warning Based on Deep Learning grant number JJKH20231083KJ” and “the Funds of Education Department of Jilin Province, Analysis of Online Learning Behavior and Psychology of Students Based on Deep Multi-perspective grant number JLJY202301810566”.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding authors.

Acknowledgments

Thanks to Zhang and his students from Changchun University of Science and Technology for their experimental support.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Kadikon, Y.; Rahman, M.N.A. Manual material handling risk assessment tool for assessing exposure to risk factor of work-related musculoskeletal disorders: A review. J. Eng. Appl. Sci. 2016, 100, 2226–2232. [Google Scholar]
Bevan, S. Economic impact of musculoskeletal disorders (MSDs) on work in Europe. Best Pract. Res. Clin. Rheumatol. 2015, 29, 356–373. [Google Scholar] [CrossRef] [PubMed]
Mody, G.M.; Brooks, P.M. Improving musculoskeletal health: Global issues. Best Pract. Res. Clin. Rheumatol. 2012, 26, 237–249. [Google Scholar] [CrossRef]
Mekonnen, T.H. The magnitude and factors associated with work-related back and lower extremity musculoskeletal disorders among barbers in Gondar town, Northwest Ethiopia, 2017: A cross-sectional study. PLoS ONE 2019, 14, e0220035. [Google Scholar] [CrossRef]
World Health Organization. Musculoskeletal Conditions. 2022. Available online: https://www.who.int/news-room/fact-sheets/detail/musculoskeletal-conditions (accessed on 18 July 2022).
McAtamney, L.; Corlett, E.N. RULA: A survey method for the investigation of work-related upper limb disorders. Appl. Ergon. 1993, 24, 91–99. [Google Scholar] [CrossRef]
Hignett, S.; McAtamney, L. Rapid Entire Body Assessment (REBA). Appl. Ergon. 2000, 31, 201–205. [Google Scholar] [CrossRef]
Karhu, O.; Kansi, P.; Kuorinka, I. Correcting working postures in industry: A practical method for analysis. Appl. Ergon. 1977, 8, 199–201. [Google Scholar] [CrossRef] [PubMed]
Berti, N.; Finco, S.; Battaïa, O.; Delorme, X. Aging workforce effects in Dual Resource Constrained job shop scheduling. Int. J. Prod. Econ. 2021, 237, 108151. [Google Scholar] [CrossRef]
Mangesh, J.; Vishwas, D. Study of association between OWAS, REBA and RULA with perceived exertion rating for establishing applicability. Theor. Issues Ergon. Sci. 2022, 23, 313–332. [Google Scholar]
Finco, S.; Calzavara, M.; Sgarbossa, F.; Zennaro, I. Including rest allowance in mixed-model assembly lines. Int. J. Prod. Res. 2021, 59, 7468–7490. [Google Scholar] [CrossRef]
Li, X.; Han, S.; Gül, M.; Al-Hussein, M. Automated post-3D visualization ergonomic analysis system for rapid workplace design in modular construction. Autom. Constr. 2019, 98, 160–174. [Google Scholar] [CrossRef]
Huang, C.; Kim, W.; Zhang, Y.; Xiong, S. Development and Validation of a Wearable Inertial Sensors-Based Automated System for Assessing Work-Related Musculoskeletal Disorders in the Workspace. Int. J. Environ. Res. Public Health 2020, 17, 6050. [Google Scholar] [CrossRef] [PubMed]
Daria, B.; Martina, C.; Alessandro, P.; Fabio, S.; Valentina, V.; Zennaro, I. Integrating mocap system and immersive reality for efficient human-centered workstation design. IFAC PapersOnLine 2018, 51, 188–193. [Google Scholar] [CrossRef]
Murugan, A.S.; Noh, G.; Jung, H.; Kim, E.; Kim, K.; You, H.; Boufama, B. Optimizing computer vision-based ergonomic assessments: Sensitivity to camera position and monocular 3D pose model. Ergonomics 2024, 11–18. [Google Scholar] [CrossRef]
Zhou, D.; Chen, C.; Guo, Z.; Zhou, Q.; Song, D.; Hao, A. A real-time posture assessment system based on motion capture data for manual maintenance and assembly processes. Int. J. Adv. Manuf. Technol. 2024, 131, 1397–1411. [Google Scholar] [CrossRef]
Simon, S.; Dully, J.; Dindorf, C.; Bartaguiz, E.; Walle, O.; Roschlock-Sachs, I.; Fröhlich, M. Inertial Motion Capturing in Ergonomic Workplace Analysis: Assessing the Correlation between RULA, Upper-Body Posture Deviations and Musculoskeletal Discomfort. Safety 2024, 10, 16. [Google Scholar] [CrossRef]
Cai, L.; Ma, Y.; Xiong, S.; Zhang, Y. Validity and reliability of upper limb functional assessment using the Microsoft Kinect V2 sensor. Appl. Bionics Biomech. 2019, 2019, 7175240. [Google Scholar] [CrossRef]
Clark, R.A.; Mentiplay, B.F.; Hough, E.; Pua, Y.H. Three-dimensional cameras and skeleton pose tracking for physical function assessment: A review of uses, validity, current developments and Kinect alternatives. Gait Posture 2019, 68, 193–200. [Google Scholar] [CrossRef]
Diego-Mas, J.-A.; Poveda-Bautista, R.; Garzon-Leal, D.-C. Influences on the use of observational methods by practitioners when identifying risk factors in physical work. Ergonomics 2015, 58, 1660–1670. [Google Scholar] [CrossRef]
Diego-Mas, J.A.; Alcaide-Marzal, J. Using KinectTM sensor in observational methods for assessing postures at work. Appl. Ergon. 2014, 45, 976–985. [Google Scholar] [CrossRef]
Manghisi, V.M.; Uva, A.E.; Fiorentino, M.; Bevilacqua, V.; Trotta, G.F.; Monno, G. Real-time RULA assessment using Kinect v2 sensor. Appl. Ergon. 2017, 65, 481–491. [Google Scholar] [CrossRef] [PubMed]
Plantard, P.; Shum, H.P.H.; Multon, F.; Shum, H.P.H.; Plantard, P. Usability of corrected Kinect measurement for ergonomic evaluation in a constrained environment. Int. J. Hum. Factors Model Simulat. 2017, 5, 338. [Google Scholar] [CrossRef]
Wei, T.; Lee, B.; Qiao, Y.; Kitsikidis, A.; Dimitropoulos, K.; Grammalidis, N. Experimental Study of Skeleton Tracking Abilities from Microsoft Kinect Non-Frontal Views. In Proceedings of the 2015 3DTV-Conference: The True Vision—Capture, Transmission and Display of 3D Video (3DTV-CON), Lisbon, Portugal, 8–10 July 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 1–4. [Google Scholar]
Newell, A.; Yang, K.; Deng, J. Stacked Hourglass Networks for Human Pose Estimation. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016. [Google Scholar]
Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X.; et al. Deep High-Resolution Representation Learning for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 3349–3364. [Google Scholar] [CrossRef] [PubMed]
Li, J.; Xu, C.; Li, M.; He, C.; Lu, C. DEKR: End-to-End Decoupled Keypoint Regression for Multi-Person Pose Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 19–25 June 2021. [Google Scholar]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable Bag-Of-Freebies Sets New State-Of-The-Art for Real-Time Object Detectors. arXiv 2022, arXiv:2207:02696. [Google Scholar]
Bogo, F.; Kanazawa, A.; Lassner, C.; Gehler, P.; Romero, J.; Black, M.J. Keep it SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image. In Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer International Publishing: Berlin/Heidelberg, Germany, 2016; pp. 561–578. [Google Scholar]
Mehta, D.; Sridhar, S.; Sotnychenko, O.; Rhodin, H.; Shafiei, M.; Seidel, H.-P.; Xu, W.; Casas, D.; Theobalt, C. VNect: Real-time 3D human pose estimation with a single RGB camera. ACM Trans. Graph. 2017, 36, 1–14. [Google Scholar] [CrossRef]
Mehta, D.; Sotnychenko, O.; Mueller, F.; Xu, W.; Elgharib, M.; Fua, P.; Seidel, H.-P.; Rhodin, H.; Pons-Moll, G.; Theobalt, C. XNect: Real-time multi-person 3D human pose estimation with a single RGB camera. ACM Trans. Graph. 2020, 39, 82:1–82:17. [Google Scholar] [CrossRef]
Kim, W.; Sung, J.; Saakes, D. Ergonomic postural assessment using a new open-source human pose estimation technology (OpenPose). Int. J. Ind. Ergon. 2021, 84, 103163. [Google Scholar] [CrossRef]
Barberi, E.; Chillemi, M.; Cucinotta, F.; Sfravara, F. Fast Three-Dimensional PostureRecon-construction of MotorcyclistsUsing OpenPose and a CustomMATLAB Script. Sensors 2023, 23, 7415. [Google Scholar] [CrossRef]
Nakano, N.; Sakura, T.; Ueda, K.; Omura, L.; Kimura, A.; Iino, Y.; Fukashiro, S.; Yoshioka, S. Evaluation of 3D markerless motion capture accuracy using OpenPose with multiple video cameras. Front. Sports Act. Living 2020, 2, 50. [Google Scholar] [CrossRef]
Dong, C.; Du, G. An enhanced real-time human pose estimation method based on a modified YOLOv8 framework. Sci. Rep. 2024, 14, 8012. [Google Scholar] [CrossRef]
Boudlal, H.; Serrhini, M.; Tahiri, A. A novel approach for simultaneous human activity recognition and pose estimation via skeleton-based leveraging WiFi CSI with YOLOv8 and media pipe frameworks. Signal Image Video Process. 2024, 18, 3673–3689. [Google Scholar] [CrossRef]
Wang, S.; Zhang, X.; Ma, F.; Li, J.; Huang, Y. Single-Stage Pose Estimation and Joint Angle Extraction Method for Moving Human Body. Electronics 2023, 12, 4644. [Google Scholar] [CrossRef]
Wu, Y.; He, K. Group normalization. In Proceedings of the European conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
García-Luna, M.A.; Ruiz-Fernández, D.; Tortosa-Martínez, J.; Manchado, C.; García-Jaén, M.; Cortell-Tormo, J.M. Transparency as a Means to Analyse the Impact of Inertial Sensors on Users during the Occupational Ergonomic Assessment: A Systematic Review. Sensors 2024, 24, 298. [Google Scholar] [CrossRef]
Jiao, J.; Tang, Y.M.; Lin, K.Y.; Gao, Y.; Ma, A.J.; Wang, Y.; Zheng, W.S. Dilateformer: Multi-scale dilated transformer for visual recognition. IEEE Trans. Multimed. 2023, 25, 8906–8919. [Google Scholar] [CrossRef]
Wen, G.; Li, M.; Luo, Y.; Shi, C.; Tan, Y. The improved YOLOv8 algorithm base xfewrga cd on EMSPConv and SPE-head modules. Multimed. Tools Appl. 2024, 83, 61007–61023. [Google Scholar] [CrossRef]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:170404861. [Google Scholar]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. Ghostnet: More features from cheap operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
ISO 11226:2000; Ergonomics—Evaluation of Static Working Postures. ISO: Geneva, Switzerland, 2000.
Jamshaida, H.; Mishrab, R.K.; Ahamadc, N.; Chandanb, V.; Nadeema, M.; Kolářb, V.; Jirkůb, P.; Müllerb, M.; Akshatd, T.; Nazari, S.; et al. Impact of construction parameters on ergonomic and thermo-physiological comfort performance of knitted occupational compression stocking materials. Heliyon 2024, 10, e26704. [Google Scholar] [CrossRef]

Figure 1. The framework for a multi-scale, multi-stage pose recognition method based on shared convolutional neural networks.

Figure 2. The structure diagram of LSCD-pose.

Figure 3. MSCA Attention Mechanism.

Figure 4. Comparison Between SPPF Module and SPPF_MSCA Module.

Figure 5. The structure of EMSPC.

Figure 6. Experimental platform descriptions.

Figure 7. Comparison of training results between MMARM-CNN and YOLOv8-pose. (1), (2) is the result of normal position; (3), (4) is the result of the upper arm occlusion; (5), (6) is the result of lower arm occlusion; (7), (8) is the result of head occlusion.

Figure 8. Posture recognition of robot installation experiment. (1)–(10) are examples of workers’ movement posture and occlusion in some robot experiments.

Figure 9. Summary of right elbow data measured by MMARM-CNN with motion capture.

Figure 10. Assessment scores and accuracy of different collection methods.

Figure 11. Assessment of the situation when the experimenter is obstructed by the robot during the work process. (1)–(12) refers to the posture recognition of workers when different body parts are covered in the robot experiment.

Table 1. Parameter comparison between MMARM-CNN and other models.

Model	[email protected]	[email protected]	FLOPs/G	Params/M
Hourglass	81.65	56.86	170	277.8
HRNet-W32	84.32	63.48	32.8	28.5
DEKR	86.93	65.32	40.9	29.6
YoloV7	84	57	28	10.5
YoloV8	85.3	58.6	30.4	11.6
MMARM-CNN	87.5	61.4	26.5	9.4

Table 2. Comparison of MMARM-CNN and motion capture capturing results.

Item	Result
R²	0.831
Pearson’s product-moment correlation coefficient	0.865
Spearman correlation coefficient	0.847
Average error	2.53°
Maximum error	7.29°
Minimum error	0.03°

Table 3. Comparison of evaluation results between MMARM-CNN and Motion capture.

Item	Pearson’s Correlation Coefficient	Kendall’s Tau Correlation Coefficient	p-Value	Accuracy
Result	0.867	0.858	<0.01	88.50%

Table 4. Accuracy statistics of different recognition modalities for the assessment of key body parts.

Body Part	Abducted (Y/N)	Raised (Y/N)	Angle (Y/N)	Twisted (Y/N)	Total Accuracy
Upper arm	172/28	178/22	183/17	/	91.50%
Lower arm	/	/	179/21	/	89.50%
Wrist	/	/	153/47	149/51	78.75%
Neck	/	/	182/18	/	91.00%
Trunk	/	/	186/14	/	93.00%
Leg	/	/	194/6	/	97.00%

Table 5. Frequency, accuracy, and confidence of pose recognition when subjected to occlusion.

Item	Correct/Incorrect	Accuracy	Pearson’s Chi-Square	Fisher’s Precision Probability
Blocked	76/32	70.37%	0.001	0.001
Not Blocked	102/12	89.47%	0.001	0.001

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, W.; Wang, L.; Li, Y.; Liu, X.; Zhang, Y.; Yan, B.; Li, H. A Multi-Scale and Multi-Stage Human Pose Recognition Method Based on Convolutional Neural Networks for Non-Wearable Ergonomic Evaluation. Processes 2024, 12, 2419. https://doi.org/10.3390/pr12112419

AMA Style

Zhao W, Wang L, Li Y, Liu X, Zhang Y, Yan B, Li H. A Multi-Scale and Multi-Stage Human Pose Recognition Method Based on Convolutional Neural Networks for Non-Wearable Ergonomic Evaluation. Processes. 2024; 12(11):2419. https://doi.org/10.3390/pr12112419

Chicago/Turabian Style

Zhao, Wei, Lei Wang, Yuanzhe Li, Xin Liu, Yiwen Zhang, Bingchen Yan, and Hanze Li. 2024. "A Multi-Scale and Multi-Stage Human Pose Recognition Method Based on Convolutional Neural Networks for Non-Wearable Ergonomic Evaluation" Processes 12, no. 11: 2419. https://doi.org/10.3390/pr12112419

APA Style

Zhao, W., Wang, L., Li, Y., Liu, X., Zhang, Y., Yan, B., & Li, H. (2024). A Multi-Scale and Multi-Stage Human Pose Recognition Method Based on Convolutional Neural Networks for Non-Wearable Ergonomic Evaluation. Processes, 12(11), 2419. https://doi.org/10.3390/pr12112419

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Multi-Scale and Multi-Stage Human Pose Recognition Method Based on Convolutional Neural Networks for Non-Wearable Ergonomic Evaluation

Abstract

1. Introduction

2. Materials and Methods

2.1. Overview of the Proposed Framework

2.2. High-Precision Lightweight LSCD-Pose Module Based on Shared Convolution

2.3. The Spatial Pyramid Pooling-Fast (SPPF) Module Based on the MSCA Attention Mechanism

2.4. EMSPC Module Based on Grouped Convolution and Point-by-Point Convolution

2.5. RULA-Based Ergonomics Assessment Module

3. Experiments Details

3.1. Comparative Analysis of Models Based on Publicly Available Datasets

3.1.1. Datasets

3.1.2. Criteria for Evaluation

3.2. Accuracy and Reliability Analysis of Posture Recognition Based on Inertial Capture

4. Results

4.1. Comparative Analysis of MMARM-CNN with Other Models

4.2. Accuracy Analysis of Posture Recognition

4.3. Reliability Analysis of Postural Assessment

4.4. Accuracy Analysis of “Blocked”

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI