4.1. ML Regression Models Result for Prediction of Next Medical Appointment
As a first objective, we are about to predict the time of the next medical appointment based on the worker’s FWA. In order to create the target label out of the next medical appointment in date, we considered the time interval from a default time source in terms of the number of weeks. The existence of outliers might cause the error term to expand to a very high value in the case of RMSLE [
50]. The outliers in RMSLE, on the other hand, are dramatically scaled-down, thereby nullifying their influence. To put it another way, the RMSLE is resistant to the impact of outliers. The RMSLE error is not impacted by the inclusion of the outlier. The relative error between the expected and actual values is referred to as the RMSLE. The penalty for underestimating the actual variable in RMSLE is likewise higher than the penalty for overestimating it. Since our data are such that a significant number of workers have only medical records and the number of workers with more records of a medical visit has been lower, selecting a criterion such as RMSLE seems logical which can show the ability of the model better. As the results of the
Table 5 show, the CatBoost regressor, which is based on GBDT, is able to produce impressive outcomes. The primary principle behind boosting is to progressively integrate multiple weak models (models that perform marginally better than chance) to generate a strong competitive prediction model using greedy search. This characteristic of the regressor explains why it performed so well with our data. This method has been able to obtain the best RMSLE, which indicates its excellent fit on the train data, which allows them to learn from them and perform a good estimation on the test data. Additionally, the R2 score of this method over different cross-validation folds shows its superiority over other methods.
4.1.1. Visual Performance Analysis of ML Regression Models
Learning curves show how a learning algorithm’s models perform in terms of generalization as a function of the size of the training set. For analyzing the different designed models’ learning process more visually, we considered 80% of the data for training them and the remaining 20% as the test data for their learning trend comparison. As
Figure 5a illustrates, the CatBoost regression model starts from the very beginning with its most overfitted learning process, which at this stage does not provide a good result on the validation OHPPs data. In the continuation of the process, by increasing the size of the data, the model almost maintains its sensitivity to noisy or outlier data, which causes its relative overfitting. However, over time, as the process goes through, it can achieve more than the R2 score on the validation set occurs in the range of 0 to 150 records of the training set and then slows down as it reaches the 400 records of the training set. The CatBoost regressor
Figure 5b, which had the best performance in terms of regression results, is also one of the best examples in terms of prediction-vs-observed plot.
4.1.2. Model Analysis Using SHAP Value
As ML models become more widely employed, it is becoming increasingly critical to comprehend their performance. The traditional ML measures such as MAE and R2 score, among others, do not provide deep insight into the model’s performance. We can have regression ML models with a high R2 score, but it is really a discovery of features that should not be utilized for prediction.
SHAP employs a game theory method to explain model predictions. We concentrate on how to utilize SHAP to examine the performance of the regression model in this domain. It begins with a base value for prediction based on past knowledge and then tests other features of the data one by one to see how the addition of that information affects the base value before making the final prediction. In other words, the ML model’s predictions for each instance can be reproduced as the sum of these SHAP values plus a fixed base value, resulting in:
The mean of the target variable is used as the base value in regression models. It also considers the sequence in which features are introduced as well as their interactions, allowing us to better understand model performance. It captures SHAP values throughout this procedure, which will be used to visualize and explain predictions afterward. SHAP values give important insights into how the input factors are affecting the predictions of the ML model, both at the level of individual instances and throughout the population as a whole.
SHAP is a model-agnostic approach that can be applied to any ML algorithm, so the details of the modeling process are unimportant for this discussion. It can be a useful approach when models have suspiciously high prediction results arising from information leaks from the target variable to the feature set. The SHAP features a set of classes known as explainers that may help comprehend a variety of ML models. The following are some examples of helpful and frequent sorts of explainers [
22]:
Generalized Additive Models are explained using AdditiveExplainer.
For linear models accessible from sklearn (
https://scikit-learn.org/, accessed on 31 May 2022),
LinearExplainer is used. It may also take into consideration the link between features. The python SHAP package (
https://github.com/slundberg/shap, accessed on 31 May 2022) is used for plotting common useful SHAP plots, which are presented for further visualized performance analysis in the following.
TreeExplainer is a model that is built on a tree, such as a decision tree, a random forest, or gradient boosting.
We chose TreeExplainer for our analytic objective based on the descriptions of the specified explainers and soft hyperparameter tuning was used for Catboost Regressor as our selected model.
Additionally, to evaluate the aforementioned model, plots were made to show the predictions of the model in comparison to the real body part values, both in the training set and in the test set. In this way, the model achieves better efficiency in estimating the predicted values in the test routine, in which it is validated with a portion of the data set with which it had no contact in the training routine. Additionally, we observed that the distance between the predicted and actual values decreases considerably, obtaining an acceptable metric.
Figure 6 illustrates the summary plot on the test data in which the contribution of the different body parts (
Table 4) are shown on the y-axis. The higher the variables are placed on the y-axis, the more they influence the final model prediction. The magnitude of their contributions is represented by the colored line on the right from high (red) to low (blue). The x-axis represents the SHAP values of each of the variables by which the model predicts the number of weeks till the next medical appointment (weekly) for workers with physical problems.
The interpretation of the summary plot shows that the presence of the variable shoulder right (lower values visible in blue on the horizontal bar) implies a decrease in time of medical appointment (expressed on the scale at the bottom of the figure); conversely, high FWA in shoulder right (visible in red) is associated with an increase in the predicted length of time of week of medical appointment effected individuals per worker’s body part. It means that if the value of shoulder right is high then it increases the amount of functionality ability the worker has and takes less time for the next medical appointment. The same analysis can be applied to the rest of the variables. SHAP sets a model’s mean prediction (base value) and determines the relative contribution of each feature to the target’s divergence from the base. It has the ability to provide both local and global explanations.
Local explainability helps to understand the model’s decision-making process for a single particular sample, while in global explainability, the focus is on all of the records. In other words, global explainability performs across all predictions. We have used both types of explainability for the comprehensive analysis of our model, and their details are presented in detail as follows.
Local explanation. The force plot of a particular random sample is presented in
Figure 7 on which the method is explained. The force plot illustrates the output prediction and the contribution of each predictor which are retrieved based on SHAP value calculation on this sample. A similar pattern to the train data has shown by a force plot in
Figure 7a. The force plot indicates the decreases and increases in the effect of the most influential body part factors on altering the base value (indicated on the horizontal bar with a value close to 64) to achieve the final predicted value (59.73).
Regarding the condition of this worker, the decrease in the function of the shoulders and left elbow has the greatest effect in reducing the time period until the next medical appointment. It seems that because the working conditions of these workers in the workplace require that you make the most of your hands and put pressure on them, these members are more involved and exposed to injury. This is while, although the increase in the function of the neck and fingers of his left hand has increased the time period until the next medical appointment, due to the lack of significant involvement of these organs, they have not made a significant impact.
Global explanation.
Figure 7b is a global explanation of model prediction in which the 63.55 is the base value obtained using the SHAP value. We can observe that the condition of the neck and right shoulder functionality has resulted in a drastic decrease in the model output as the first falling trend of the plot. However, the status of their trunk had a negligible effect. As the second changing trend shows, for another group of workers, the condition of their fingers, right shoulder, and left wrist has delayed their need for medical attention as the model output. The last significant alteration to the model output happens at the end of the plot for another group of workers. In this way, the problem with their right shoulder’s functionality has resulted in a decrease in the time distance to the next medical appointment. However, the functional status of their feet and left knee could not help their health condition considerably.
Figure 8 also indicated that the health state of upper body parts such as shoulders, neck, left fingers, and left elbow are more contributing to the model output, as they are typically engaged in working by hand.
The decision plot depicts the decision process by applying the SHAP values of individual features one by one to the expected value in order to build a line chart with the projected value.
The similar plot in
Figure 9 also shows the decreasing trend of the effect of features on the final estimate of the model, so that in the highlighted records, the performance status of the left and right shoulders is the most influential factor, and its functionality affects the period until the next visit to the OP remarkably. At the lower level of feature importance, the influence of factors such as knees, elbows, wrists, and fingers of the right hand are mentioned as less influential factors.
SHAP analysis gives companies the possibility to investigate the conditions of the working environment by checking the amount of pressure on the workers’ organs in different sections. As a result, they will be able to reduce the work pressure and fatigue of the workers by making adjustments in the production and tools, and in this way, they will preserve more human resources, which is itself precious capital.
4.2. Regression Models Result in Next FWA Body Parts
For our second research objective, we predict the amount of severity to each part of the body at the last medical appointment, based on the OP’s FWA statues in OHPPs. For this purpose, we considered the physical constraints of the last medical report as a measure of the estimation. In this section, the results of body part severity related to different areas of the body based on different models are shown. The first evaluation criterion was RMSLE. The lower the value of this criterion, the better the model fits the data and the less error. The second evaluation criterion (R2) is a statistical measure of how close the data are to the fitted regression line. Therefore. the higher it is, the better the model performs, and the better the regression line fits the data. In the tables, the value of R2 is also reported for the validation data. We choose the R2 measure as a 10-fold cross-validation statistic for performance comparisons, since it provides a thorough perspective of the model’s performance and capabilities.
Table 6 shows the results of the workers in health protection in this period for the left shoulder. According to the RMSLE criterion, the CatBoost performed better than the other five models. It also has the highest value of R2 on the validation data, as its boosting schemes help to reduce overfitting and improve the quality of the model.
Some of the main features of the CatBoost model are that it can achieve great results without hyperparameter tuning, there is no need a preprocessing to be performed on categorical features and it is computationally fast. Moreover, because it is less prone to overfitting, it can achieve decent accuracy.
According to the R2 criterion, the Light Gradient Boosting Machine (LightGBM) model performed better than the other 5 models. LightGBM has better accuracy than any other boosting algorithm, it produces complex trees by following a leaf-wise split approach rather than a level-wise approach which is the main factor in achieving higher accuracy. However, it can sometimes lead to overfitting, which can be avoided by setting the max_depth parameter.
Considering the MAE metric, simpler tree-based models, such as the DT and RF, could achieve better results, but all models’ MAE values are acceptable.
Table 7 shows the results of the workers that had health protection in this period for the right shoulder. According to the RMSLE criterion, the CatBoost performed better than the other five models. It also has the highest value of R2.
According to the R2 criterion, the LightGBM model performed better than the other 5 models on validation. Since it is based on DT algorithms, it splits the tree leaf-wise with the best fit, whereas other boosting algorithms split the tree depth-wise or level-wise rather than leaf-wise. Therefore, when growing on the same leaf in LightGBM, the leaf-wise algorithm can reduce more loss than the level-wise algorithm and hence results in much better accuracy, which can rarely be achieved by any of the existing boosting algorithms. It uses a novel technique of Gradient-based One-Side Sampling (GOSS) to filter out the data instances for finding a split value, while XGBoost uses a pre-sorted and Histogram-based algorithm for computing the best split. All regression models have achieved considerably low MAE values on this dataset, which shows their acceptable capability.
Table 8 shows the results of the workers having health protection for the left elbow. The number of workers with body injuries in the left elbow area is small in the data set. Data for this category has been scarce, so simpler models will work better on this data.
According to the RMSLE criterion, the XGBRegressor model performed better than the other 5 models. XGBRegressor can work well in small to medium datasets, and handle missing data with its in-build features [
51]. XGBoost uses DTs as base learners; combining many weak learners to make a strong learner. As a result, it is referred to as an ensemble learning method, since it uses the output of many models in the final prediction.
According to the R2 criterion, the DT model performed better than the other 5 models. It is one of the quickest ways to identify relationships between variables and the most significant variable. DTs are not largely influenced by outliers or missing values, and they can handle both numerical and categorical variables. Since it is a non-parametric method, it has no assumptions about space distributions and classifier structure.
CatBoost has the highest value of R2 on the validation data because its structure reduces overfitting and improves the generalization of the model.
Although DT and RF models could achieve the best MAE error, they could not become the superior models in comparison with other models.
Similarly,
Table 9 shows the results of the workers having health protection for the right elbow. According to the RMSLE criterion, the CatBoost performed better than the other five models. It also has the highest value of R2 on the validation data.
The primitive learning approach of the first two models seems to make them able to get the lower MAE results, although it could not help them to achieve the best overall results among all models.
According to the R2 criterion, the RF performed better than the other five models. The random forest algorithm provides a higher level of accuracy in predicting outcomes than the decision tree algorithm because, in a RF regression, each tree produces a specific prediction. The mean prediction of the individual trees is the output of the regression.
The CatBoost performed better than the other five models in terms of RMSLE and CV-R2 on trunk FWA dataset (
Table 10). The LightGBM performed better than the other five models in terms of R2 and MAE. A high R2 score indicates the generalization point of view. Models are performing well.
4.2.1. First Scenario Learning Curves
Figure 10a shows the learning curve of models for the right elbow injury prediction. It indicates when the data size reaches a certain level that is sufficient for the model learning process and prevents the occurrence of overfitting, the error rate of most models such as CatBoost, XGbooslreaningt, RF, and GB, decreases with increasing data size. However, as the fitting process is expected to become more suitable, the DT models fitting on training sets with larger sizes seem not to follow the same pattern. The LightGBM model showed the worst fitting pattern on all possible training sizes, with considerable difference in MAE in comparison to other models.
The estimation of possible left-hand finger injuries for the models is shown in
Figure 10b. As the figure illustrates, although models could finally achieve good performance at the maximum size, increasing the training set size does not seem to show a clear correlation with the final model error, and in both cases, the CatBoost model performance is remarkably good. Other models have shown similar downward performance, except for LightGBM, which can indicate that this model is more of a fast computational model than a comprehensive one.
4.2.2. CatBoost Algorithm in Prediction Next Body Part Severity Based on Gender, Seniority, and Areas
A pie chart shows the relationships of parts to the whole for a variable. In this section, we examined the percentage of FWA in different areas of the body in different categories using pie charts. Pie charts show the predicted body part severity of workers who have worked in the automotive industry for a certain category. The OP and ergonomist can assign the exposure to these workers based on this category such as FWA body parts in men/women. For example, trunk and shoulder right are the factors that OP and ergonomist should consider for gender category, respectively. Quality assurance is the area that has the most impact on the trunk, and it is important that OP and ergonomists are aware that the gender of workers may have more issues in this area.
Figure 11b shows that men have the most job demand in the first level from the trunk, with 19%, and in the second place from the right shoulder, with 15%. The highest percentage of FWA in test data is from the trunk. The shoulders are on the second level, and the rest of the body has a small percentage of the FWA. In general, it can be said that both men and women have body parts FWA in the area of the right shoulder, with similar values. However, men have the least FWA body from the trunk, with 19%.
Considering the same plot for women, they have the least FWA in the first level from the right shoulder with 14/% and the second level from the left wrist with 13/%. After the left shoulder and right wrist with 12/% in the third level, the rest of the body has a small percentage of severity. There is evidence that exposure to occupational risk factors is likely to affect FWA in these body parts.
If the results are also analyzed based on the medical history of the worker, a very good view of the FWA loss can happen in relation to the work experience of the worker.
The
Figure 12a determines which area has suffered the most in workers. According to the analysis, the highest percentage of FWA loss in workers in the first level was the trunk, with 16%, and in the second level, the right shoulder, with 14% and wrists, with 12% in the third level. The rest of the body has a small percentage of job demand.
Considering workers with 10 to 20 years of work experience (
Figure 12b), the highest percentage of FWA loss is related to their right shoulder, with 18%, followed by their left shoulder defects, with 14%, and their trunk, with 14% in the third level. The rest of the body has a small percentage of FWA loss.
According to the analysis of workers with a work experience of more than 20 in
Figure 13, the highest percentage of FWA loss in these workers is related to the trunk, with 19% followed by the right shoulder with 14% and the left shoulder with 13% in the third level, while the rest of the body parts have a small percentage of FWA loss.
The reason for the FWA loss in areas such as the right shoulder, which can be related to the fact that most of the workforce is right-handed, increases with increasing work experience. Trunk FWA loss is another problem that arises with experienced workers.
As FWA assessments based on a worker’s occupation can give a beneficial view of their job demand and workload. In this section, the extent of FWA loss based on the occupation of workers in categories such as Paint, Body construction, Assembly, Special Project, and Metal stamping, is examined.
Figure 14 shows the pie chart for the special projects category. In the special projects, most of the body parts were related to the right elbow and the left wrist in the first level both with 13%. After the right shoulder, left foot, and right foot with 10% in the third level, the rest of the body has a small percentage of FWA. The “cause” was because the exposure to occupational risk factors had severe consequences on their work-ability as they were targeted with OHPPs at the highest protection level. Typically, the OHPP of workers belonging to “special projects” is the most challenging in terms of the match between remaining work ability and job demands. Whenever a match is not verified, a worker is not fit at all to work at the automotive factory, and therefore, the worker will be absent from work.
Regarding the Assembly data in the Assembly category, most of the data were related to the right shoulder in the first level with 17% and the left wrist and trunk both in the second level with 13% and right wrist with 12%. The data suggests that an assembly area can have, at the same time, the cause of the problem and the solution to it. There is evidence that exposure to occupational risk factors is likely to affect FWA in these body parts. It also means that the assembly has jobs that are compliant with OHPP in assigning the same body parts.
Figure 15 shows the pie chart for the body construction category.
For the body construction category of test data, most of the worker’s health protection was related to the trunk in the first level (20%), and the left shoulder in the second level (13%). The highest rate of FWA was allocated with a different percentage related to the trunk. As the protection given is addressed to the highest level possible, it has, somehow, to be associated with exposure to working conditions.
The Metal Stamping category is shown on the pie chart in
Figure 15b. Most of the health protection of workers was related to the left wrist in the first level (29%) and the left shoulder in the second level (25%). At the first level, the highest percentage is specialized in the left wrist. This analysis is practical for related decision-making because the workers from whom these results were reported were still working in this same production area. Having said that, it means that jobs matching the OHPPs were found in the Metal Stamping. The “solution” also applies because, under these circumstances, workers are not contributing to work-related absenteeism.
Figure 16 shows the pie chart for the paint category. Workers’ health protection was related to the trunk in the first level (22%) and the right wrist in the second level (14%). Considering the Quality Assurance data analysis using the plot, most of the health protection was related to the left elbow, right elbow, right shoulder, and left shoulder in the first level, with 17%. The analysis is especially useful for supporting a body part FWA loss model via determining how varied job demands influence changes in workers’ work ability, and in regulating which working conditions should be improved. The results are different categorizations of data, and the highest percentage is specialized in the trunk, while the second level is specialized in the right shoulder.
XAI methods are broadly used in many applications association with healthcare, especially pain modeling [
52]. The utilization of AI in healthcare has reduced the burden on the biomedical system exclusively. For example, Hines et al. explored the issue of persistent pain experienced after receiving a total knee replacement. Using ML methods such as K-Nearest Neighbors (KNN), Support Vector Machine (SVM), and discriminate analysis were then introduced as a Lime technique to classify the varying levels of damage present by signal processing [
53]. The recent research also shows that investment of AI in healthcare has increased tremendously in recent years, with predicting readmission risk among the frail being assisted by AI systems that detect and assess the health of patients instantly with feature analysis with SHAP [
54].
Musculoskeletal symptoms vary with time and are not evaluated using a worker’s snapshot. Different research has been conducted with HCXAI in prevention and has proved that musculoskeletal symptoms are the most difficult to predict, among others [
36]. There was little research that went on to discuss it through the history of medical appointments, especially in the automotive industry. However, we achieved certain progress in prognosis not only in the next medical appointment but also in the next body part severity, as discussed above in this paper. Incorporating XAI models can interpret the features that are explainable to cause work-related absenteeism and highlight their importance in prognosis.