1. Introduction
Work-related musculoskeletal disorders (WMSDs) are injuries or pain involving the joints, ligaments, muscles, nerves, tendons, and/or structures that support the limbs, neck, and back [
1,
2]. WMSDs are a common concern for modern industrialized nations [
3] due to their high incidence and high costs which reached over USD 20 billion in the United States in 2018 [
4]. It can be caused by a complex interaction of physical, psychosocial, biological and individual characteristics, among which physical demand is an important factor. One of the most prevalent WMSDs is low back pain [
5], which is associated with physically demanding tasks that include handling heavy loads repeatedly and in awkward postures [
6,
7].
Since the physical demands vary widely across jobs, a physical demand analysis (PDA) outlines the physical and environmental requirements to perform a job and is used for pre/post-offer employment screening, return-to-work, and identification of personal protective equipment needs. In the US, it is a required part of a job description and must be written in compliance with the American Disabilities Act. The PDA includes information on the approximate duration (% of day) and magnitude (load) that different occupational physical activities (OPAs), such as lifting/lowering, carrying, kneeling, reaching, walking, and standing, take place. The PDA informs workers about the physical demands of a job before they are hired and it provides occupational health practitioners critical information for facilitating effective return to work programs should an injury occur.
Various methods for quantifying OPAs include self-report, observational, and direct measurement [
8,
9]. Though both self-report and observational approaches are low-cost and convenient, results can be inaccurate and unreliable, as well as time consuming [
5]. Although observational methods are the most common approach for estimating OPAs [
8], some have used more detailed video-based techniques to quantify the duration of OPAs with higher accuracy. Essentially, workers are recorded in real-time and the video is analyzed using computer software that aggregates the amount of time spent performing different OPAs. Direct measurements of force supplement the video analysis to provide a more valid and reliable PDA; however, this approach is extremely time consuming and costly [
8].
Accurate and reliable quantification of OPAs required for a job description is critical for physically demanding jobs. Prospective workers rely on them to determine if they would like to pursue a job and clinicians rely on them to facilitate appropriate return to work plans. Further, accurate and reliable quantification of the duration, frequency and magnitude of physical demands can be useful when assessing interventions designed to reduce physical exposures associated with WMSDs. To overcome the limitations of self-report, observational, and direct measurement methods, kinematic data could be an effective approach to quantifying OPAs in the workplace. Kinematic data yield the position, velocity, and acceleration of body segments and have been used to predict the patterns and quality of movement [
10]. Conventional methods of capturing kinematic data rely principally on video analysis or an optoelectronic system to distinguish movement patterns of body segments [
11,
12,
13,
14]. However, these laboratory methods have limitations in real work scenarios both in cost and feasibility.
Wearable technology, such as inertial measurement units (IMUs), have been used to capture human body motion for animation, optimizing athletic performance, and even optimizing patient treatment [
15]. For example, Daponte et al. [
16] developed a wireless and IMU-based system for monitoring patient motion with real-time 3D reconstruction. IMU systems have also been used by practitioners for gait and lower limb rehabilitation [
17,
18]. IMUs perform well when tracking the orientation of a moving object, thus, coaches and athletes use them to assess athletic performance [
19,
20]. More recently, IMUs have been used in the workplace to quantify specific exposures that may increase risk of injury. A smart garment using two IMU sensors was introduced by Wang et al. [
21] to monitor shoulder posture for treating WMSDs. Relative time series data and kinematic data from the wearable sensors (IMUs) are used to summarize the percent time spent in different physical activities and the probability of being at high risk for WMSDs [
22,
23]. The IMU system (17 IMUs) with classification models introduced by Kim and Nussbaum [
24] and Bastani, Kim, Kong, Nussbaum, and Huang [
25], classified MMH tasks for real-time applications with a higher accuracy than observational methods. These studies provide support for the use of wearable devices in predicting OPAs actively and continuously in diverse work environments, even using fewer sensors. However, the performance of using wearable devices on classifying a broader range of OPAs such as crouching, kneeling and overhead work has not been evaluated. Further, there is little, if any, work published on predicting OPAs when combined in typical simulated work tasks.
Therefore, the objective of this study was to apply deep learning models to data from eight IMUs to predict physical activities performed during simulated occupational tasks. If wearable devices can be used to predict OPAs with higher accuracy, reliability and efficiency than self-report, observational or direct methods, job descriptions and return to work programs can be standardized more effectively.
2. Materials and Methods
2.1. Study Procedure
This laboratory study collected kinematic data from 8 IMUs worn on 8 different body segments by participants who performed OPAs common in MMH jobs [
9,
13,
22,
23,
25,
26,
27,
28,
29,
30,
31,
32,
33,
34,
35,
36]. The kinematic data were used for training a deep learning model for pattern recognition (
Figure 1). The trained model was used to predict OPAs performed in 3 simulated work tasks. In this pilot validation study, model predictions of the OPA were validated using a frame-by-frame analysis of video collected during simulated tasks.
2.2. Participants
Subjects (n = 15) were recruited by email, campus, and social networks. To be included in the study, subjects needed to be between 18 and 65 years of age and willing to perform the simulated work tasks described. Subjects with neck/back/arm/shoulder/vision pain were excluded. Written informed consent was obtained from all subjects before their participation. This study was approved by the Institutional Review Board of the University of California, San Francisco (IRB# 10-04700).
2.3. Occupational Physical Activities and Manual Material Handling Tasks
In this study, 15 categories of occupational physical activities were selected with some OPAs performed in multiple ways to capture the variation of physical activities and prevent overfitting of the model (
Table 1). To train the models, subjects performed activities with fixed parameters of load, duration, and repetition. Most OPAs were performed for at least 60 s each (
Table 1).
A subset of participants (
n = 9) completed up to three simulated work tasks including bottle packing, carpet laying, and drilling to test the deep learning model’s ability to predict OPAs while performing simulated work tasks (
Figure 2). Except for sitting which exhibited obvious features for prediction, the bottle packing, carpet laying and drilling tasks included all OPAs (
Table 2). Verbal explanation and visual demonstration were provided for subjects prior to performing the various OPAs and simulated work tasks. The whole procedure took approximately 2 h per subject.
Subjects performed carpet laying and bottle packing tasks until the task was complete; all tasks were completed within 5 min. Drilling was performed for 15 s. Details for each task include:
Bottle packing started with opening the box and putting 12 bottles into the box, which contained three rows (close, intermediate, and extended distances) and four bottles in each row. After placing the bottles in the box, the box was closed. The horizontal distances between the bottles and the body were <30 cm, 30–40 cm, and >40 cm. Next, the box was carried about two meters and placed on shelves of fixed heights ranging from floor height, waist height and shoulder height.
The carpet laying task was performed by lifting carpet from a shelf (floor, waist and shoulder height) onto a cart. The cart was then pushed or pulled to a distance of approximately two meters. After placing the carpet on the floor, subjects were asked to lay the carpet in a pre-defined rectangle.
The drilling task involved picking up a drill, or paint roller with one hand, walking to the designated area about two meters away and drilling overhead or on the ground. Afterwards the tool was returned to the original spot.
2.4. Wearable Track Device with Inertial Measuring Units
A lightweight (<0.7 kg) prototype wearable vest and arm cuffs that housed 8 IMUs (SwiftMotion, Berkeley, CA, USA) was used to quantify the kinematics while performing OPAs (
Figure 3). The vest was designed with a shoulder harness, belt, upper arm straps and upper leg straps made by nylon mesh fabric, which was used to fix the positions of the IMUs. The vest was available in three different sizes (small, medium, and large), and the straps were length-adjustable to allow exact positioning of the sensors, independent of the body type of the subject.
The specific positions of IMUs were as follows: (1) two were placed on the spine facing posterior, one between the 3rd and 4th thoracic spinous process (T3-T4) and the other between the 5th lumbar and 1st sacral spinous process (L5-S1); (2) two were placed on the medial segment of each upper arm facing lateral; (3) two were placed on the distal segment of each forearm just proximal to the hand facing posterior; (4) the last two were placed on the medial segment of each thigh facing posterior. With these 8 sensors and their corresponding anatomical locations it was possible to track the orientation of the trunk, upper arms, forearms, and thighs. The sensors were not placed directly on muscle bellies, as their orientation could then change during the activation of those muscles.
The small IMUs (50 × 50 × 20 mm) developed by the researchers included three-axis accelerometers, three-axis gyroscopes, and three-axis magnetometers, which measured accelerations, velocities, and position (orientation). Data from the IMUs were recorded at a sample frequency of 10 Hz.
2.5. Data Collection
Time series data (1 column) and kinematic data (18 columns) were synchronized from the 8 IMUs and transmitted to a laptop by Wi-Fi dongles. The IMUs output included both the quaternion and Euler data. Kinematic data were recorded by IMU number and position (2 columns), quaternion (4 columns), Euler angle (°, 3 columns), raw acceleration (m/s
2, 3 columns), linear acceleration (m/s
2, 3 columns), and angular velocity (rad/s, 3 columns). The current orientation of IMUs was determined by using Euler angle following the Z-Y-Z sequence: the first rotation occurs around the
Z-axis, followed by a rotation around the
Y-axis and a rotation around the rotated new
Z-axis. Quaternions transformed to Euler angles using Equation (1) [
37].
where, q
0 denotes the scalar part and q
1, q
2 and q
3 denote the vector part of the quaternion.
2.6. Model Training
The ResNet-18 structure was pre-defined, only the weights were learned during training. Error was defined as the averaged cross-entropy loss over all samples [
38] in one batch (Equation (2)). Training was not finished until the model converged (
Figure 4a). A convolutional neural network (CNN), ResNet-18 [
39] was used for categorical prediction. Its 18-layer implementation was robust to train the model efficiently and retain high accuracy. The time series data and kinematic data of activities based on the 15 OPA categories (
Table 1) were converted to tensors (1 × 60 × 19) and divided into 60%, 20% and 20% for training, validation, and testing, respectively.
where, in each batch, i represented the correct OPA categories, j represented all of the OPA categories and N was the batch size. The lowest validation error across models did not necessarily identify the best model. Therefore, it was critical to save the current best model when the validation error decreased. After the training was complete, the best model was selected based on testing data accuracy (minimize empirical error). Model training and prediction were all performed using Python 2.7.
The OPA categories were added to data of 15 activities as the activity’s labels; the raw datasets were saved to 15 csv files. Each csv file represented one activity and was constructed in such a way that columns represented coordinates and time derivative information from one sensor and rows represented timestamps of each of the 8 sensors. Therefore, each row had 18 columns of sensor data, 1 column for the timestamp, and 1 column for the label.
The final model input data was generated by combining multiple rows together to form an image-like window. The number of rows combined (window size) was a hyperparameter of the model. The window size with the highest accuracy achieved. Each window had a size of 60 rows × 20 columns; in other words, 60 rows of sensor data, 20 columns of sensor data plus an activity label (
Figure 4b). Finally, the overlapping window was cut. An overlapping window was defined as a window containing multiple activities and therefore it was redundant. For example, when a single window (1 s) contained both activity 1 and activity 2, this window was dropped (0.7%).
2.7. Tasks Prediction and Validation
Simulated work tasks were video-recorded at 30 frames per second then analyzed using Multimedia Video Tasks Analysis™ (MVTA™, NexGen Ergonomics Inc., Pointe-Claire, QC, Canada). Each frame was categorized and labeled into one of the 15 OPAs (
Table 1). Two researchers performed the task analysis using MVTA. Both researchers were trained by a senior engineer who has been using MVTA for more than 13 years. Random frames were selected by PI and the senior engineer to confirm task analysis reliability and accuracy. Any uncertainties were resolved by discussion with the PI and senior engineer. Transitions (i.e., frames between standing and lifting) were allocated to the proceeding OPA.
In the occupational activity classification problems, after removing the column of activity label, each task was split into multiple 60 × 19 windows as input to the CNN. In total, 60 rows of data corresponded to roughly 1 s of “video” which was enough for a single activity to be repetitive and recognizable. The CNN analyzed these windows and mapped them to a predicted OPA. The trained CNN model was applied to the IMU data for the simulated work tasks to predict an OPA for each 1 s interval. The OPA prediction each second was compared to the results from the MVTA [
40] to calculate model accuracy of OPA prediction (
Figure 5). A post hoc analysis of the best three predictions of OPAs generated by the CNN model was used for further analysis.
2.8. Post Hoc Analysis
For the simulated work tasks, the best three CNN OPA predictions each second were compared with the actual OPA identified by MVTA. For the best three CNN predictions of each second, if there was a correct prediction, it was counted. If all three were incorrect, the first OPA prediction was counted as the incorrect prediction.
4. Discussion
The present study used a deep learning method to predict 15 OPA activities common in MMH jobs. A convolution neural network model was applied to data from 8 IMUs to predict which of the 15 OPAs were being performed during each second of analysis. Overall, the model had an average accuracy of 95% when each OPA was performed in isolation. However, model accuracy decreased when applied to simulated work tasks that contained multiple OPAs (e.g., bottle packaging, carpet lying and drilling).
The CNN model was first applied to predict occupational physical activities performed in isolation. The study indicated that the CNN model provided a reliable prediction of OPAs performed in isolation with the highest accuracy of prediction reaching 100%, and the lowest accuracy being 83%. However, even when performed in isolation, some activities had better accuracy than others; overhead work, sitting, standing, carrying, reaching and static stoop had 100% accuracy while one-handed lifting, pulling and crouching were lower (83%, 85% and 88%, respectively). One possible reason for the lower accuracy in one-handed lifting (83%) was the asymmetrical movement that varied with each lift (
Figure 7a). Pulling and one-handed pulling also had lower prediction accuracy (85% and 93%, respectively) which may have also been due to activity being asymmetrical and varied. The model accuracy for crouching may have been lower (88%) due to the trunk angle being similar to other activities, such as lifting and kneeling. Future studies should include IMU data from the lower legs to evaluate whether additional IMU data improves model prediction for all OPAs by differentiating activities with similar upper body postures (
Figure 7b).
To test the robustness of the trained CNN model, it was applied to data collected during three MMH tasks. The overall model accuracy fell dramatically to 22%. Each task contained multiple OPAs and the model was unable to differentiate the OPAs with high accuracy when they were performed as part of a simulated task. To understand the inaccurate predictions, a post hoc analysis analyzed the best three CNN predictions for each second of simulated work. The overall accuracy of OPAs increased from 22% to 45%. The results of some activities improved greatly; prediction for lifting and pushing increased from 13% to 51% and 39% to 88%, respectively. This indicates that the model may have had an incorrect best prediction, but the 2nd or 3rd prediction was correct, at least for some OPAs. However, some activities had minimal changes in accuracy. The incorrect predictions for each OPA were graphed to visually depict erroneous predictions for each OPA. This was primarily done to develop improved models for future research, particularly for the OPAs with a low prediction accuracy. Upon further analysis, the erroneous predictions appeared to be primarily due to four circumstances including (1) variation in how the OPA was performed; (2) OPAs being performed concurrently; (3) posture similarities between OPAs; (4) an OPA being embedded in another OPA, thus confusing the model.
The amount of variation in how activities were performed impacted the model prediction accuracy. For example, the CNN model was trained with people lifting using a squat technique, yet during the work simulated tasks, some people used the stoop lift posture which was mistakenly predicted as crouching. Kneeling only had a 7% accuracy despite using the best three predictions. Kneeling can also be performed with much variation; there was kneeling while sitting on the heels, kneeling upright (no hip flexion) and single knee kneeling. Since the model was not trained for these variations in posture, it predicted other similar activities such as crouching or crawling. Future research should include more variations of how each OPA is performed during the training-test dataset before it is applied to simulated work tasks. Fifteen subjects participated in the current study. Including more people in the training-test dataset would also be helpful in capturing variations in how OPAs are performed.
Another reason for poor prediction accuracy was that subjects often performed multiple OPAs at the same moment while performing a task which the single model prediction approach could not resolve. For example, subjects may reach while kneeling, sitting, or standing. Standing was misclassified as lifting since subjects usually lift items when standing. This presented challenges for the CNN model in predicting the predominant activity. Despite the extensive training, evaluation, and discussion about how to classify each frame in MVTA, human judgement was used to identify the predominant activity.
As described above when OPAs were performed in isolation, similar postures across OPAs were a reason for poor accuracy during the MMH tasks, especially since there was no information on loads being handled. Reaching was frequently predicted as lifting, overhead work, and carrying likely because those activities include shoulder flexion with an increased horizontal distance between the body and wrists. Kneeling was commonly misclassified as crawling or crouching, also likely due to the similarities in hip flexion and trunk angles, particularly when kneeling while sitting on one’s heels. Crouching was misclassified as lifting or sitting, again due to similar hip flexion angles. Surprisingly, walking had low prediction accuracies and was most commonly misclassified as one-handed pulling, carrying, or pushing, all of which include walking while handling a load. Having additional information from lower extremity IMUs and/or about loads being handled may improve the OPA classification across these otherwise similar activities that all included some amount of reach.
Another common reason for misclassification was when one OPA was part of or embedded within another OPA. For example, when categorizing OPAs in MVTA, the visual cues made it obvious when someone was reaching forward to lift something. In this case, the lift started at the beginning of the reach to lift an item and ended when the item was brought back to their body. In other words, if someone was reaching forward to lift something, the frames including the reach forward required to make contact with the object, were classified as part of the lift. Despite this consistency in how the models were trained, the CNN model could not differentiate these activities during the simulated activities, based on intent of the movement. This may explain why reaching was incorrectly predicted as a carry and why standing was frequently misclassified as lifting. Thus, a different approach to OPA classification will be needed in the future to help differentiate OPAs. It may be beneficial to start the classification of a lift when the load is actually being lifted versus the moment someone reaches forward to initiate a lift.
To mitigate the misclassification discussed above and improve the prediction accuracy, additional IMUs and/or pressure insoles could be added to the system in future studies. For example, pressure insole information may help differentiate similar activities that primarily differ based on loads being handled such walking versus carrying, pushing or pulling and crouching versus lifting. Information from pressure insoles has been shown to distinguish such activities [
36] and may help distinguish kneeling, crouching, and stooping from lifting. Pressure insole information may also help distinguish crawling from kneeling and crouching since the total force would be significantly lower since the weight is supported through the knees when crawling.
To address concurrent OPAs happening simultaneously, sequential modeling could be used to make multiple predictions. A prior study used sequential artificial neural network models to estimate hand posture before estimating hand exertion force [
41]. Perhaps a similar approach that predicts whole body posture before predicting upper extremity movement could be used to improve OPA prediction accuracy.
Continual challenges to estimating prediction accuracies remain because of the transitional time between OPAs when performing simulated work. This can be minimized by training models to make accurate predictions from the beginning of the activity to the very end. For example, lifting would need to be predicted from the first moment there is hip and knee flexion to the moment the person returns to standing. This has obvious challenges for a model that makes a prediction every second and will require proceeding information for accurate classification. Exploring additional types of deep learning models may help to address this issue and improve prediction results. In this study, a recurrent neural network (RNN) model was first applied to predict OPAs when performed in isolation. The prediction result of the RNN model did not meet expectations. Upon reflection, RNNs are designed to predict an outcome for each timestamp in a time series that was not the best fit for OPA classification. Further, CNN Resnet models have been shown to outperform other deep learning approaches significantly for time series classification [
42]. Thus, the current model chosen for this study was from ImageNet contest winner, Resnet. Its 18-layers implementation is robust to train the model efficiently and retain high accuracy for single OPAs prediction. The window size of the current CNN model was 60 rows of sensor data, roughly corresponding to 1 s of video data which is enough for a single activity to be repetitive and recognizable. However, in MMH tasks, activities change at a very fast pace. Thus, further investigation on window size selection is needed.