Next Article in Journal
A Symmetric Quantum Perspective of Analytical Inequalities and Their Applications
Previous Article in Journal
Optimizing Mine Ventilation Systems: An Advanced Mixed-Integer Linear Programming Model
Previous Article in Special Issue
A Computational Model of the Secondary Hemostasis Pathway in Reaction Systems
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Multiclass Classification of Sarcopenia Severity in Korean Adults Using Machine Learning and Model Fusion Approaches

1
Department of Computer Engineering, Gachon University, Sujeong-gu, Seongnam-si 13120, Republic of Korea
2
Department of Exercise Rehabilitation, Gachon University, Incheon 21936, Republic of Korea
*
Authors to whom correspondence should be addressed.
Mathematics 2025, 13(18), 2907; https://doi.org/10.3390/math13182907
Submission received: 8 May 2025 / Revised: 18 August 2025 / Accepted: 25 August 2025 / Published: 9 September 2025

Abstract

This study presents a unified machine learning strategy for identifying various degrees of sarcopenia severity in older adults. The approach combines three optimized algorithms (Random Forest, Gradient Boosting, and Multilayer Perceptron) into a stacked ensemble model, which is assessed with clinical data. A thorough data preparation process involved synthetic minority oversampling to ensure class balance and a dual approach to feature selection using Least Absolute Shrinkage and Selection Operator regression and Random Forest importance. The integrated model achieved remarkable performance with an accuracy of 96.99%, an F1 score of 0.9449, and a Cohen’s Kappa coefficient of 0.9738 while also demonstrating excellent calibration (Brier Score: 0.0125). Interpretability analysis through SHapley Additive exPlanations values identified appendicular skeletal muscle mass, body weight, and functional performance metrics as the most significant predictors, enhancing clinical relevance. The ensemble approach showed superior generalization across all sarcopenia classes compared to individual models. Although limited by dataset representativeness and the use of conventional multiclass classification techniques, the framework shows considerable promise for non-invasive sarcopenia risk assessments and exemplifies the value of interpretable artificial intelligence in geriatric healthcare.

1. Introduction

Sarcopenia, defined by the gradual decline in skeletal muscle mass and strength, presents a considerable risk to the well-being and autonomy of elderly populations globally. Its early detection and severity assessment are critical for implementing timely interventions to prevent adverse outcomes such as frailty, falls, and hospitalization [1,2]. Traditionally, sarcopenia diagnosis has relied on manual clinical assessments and imaging techniques, which, while effective, are often resource-intensive and limited by inter-observer variability [3]. Contemporary advancements in computational learning methods have enabled innovative approaches for automated sarcopenia screening and prognostic analysis, offering scalable, data-driven solutions to support clinical decision-making [3,4].
Despite these promising developments, most existing studies have focused narrowly on binary classification—distinguishing sarcopenic from non-sarcopenic individuals—without adequately addressing the full spectrum of sarcopenia severity [5,6]. Furthermore, while various machine learning models have been proposed, few have employed model fusion strategies or integrated explainable AI (XAI) techniques, such as SHapley Additive exPlanations (SHAP) analysis, to enhance model transparency and clinical interpretability [7,8]. Consequently, critical gaps remain in developing robust, interpretable, and generalizable models capable of stratifying sarcopenia into multiple severity levels based on real-world clinical features.
To address these limitations, the present study proposes an ensemble-based model fusion framework that integrates the Random Forest, Gradient Boosting, and Multilayer Perceptron (MLP) classifiers, which are optimized through hyperparameter tuning. The model is designed to predict sarcopenia severity across four ordinal classes using a rich set of clinical, anthropometric, and functional assessment features. Additionally, SHAP analysis is incorporated to identify and visualize the most influential predictors, thereby enhancing model explainability and clinical relevance. Through comprehensive performance evaluation—including accuracy, Kappa statistics, and the Brier Score—this study aims to provide a robust, interpretable, and clinically useful tool for sarcopenia risk stratification. Contributions of this study are as follows:
  • Development of a multiclass classification model to predict sarcopenia severity levels (normal, possible, sarcopenia, and severe) rather than just a binary diagnosis, addressing a critical gap in existing machine learning studies.
  • Design and evaluation of a stacked ensemble classifier that integrates the individually optimized Random Forest, Gradient Boosting, and Multilayer Perceptron models, achieving superior performance across all key metrics.
  • Application of dual-path feature selection combining the Least Absolute Shrinkage and Selection Operator(LASSO) and the Random Forest (for nonlinear importance), enhancing both prediction accuracy and model robustness.
  • Integration of SHAP explainability analysis, providing transparent interpretation of feature contributions and ensuring clinical interpretability of the model’s predictions.
  • Comprehensive performance evaluation using accuracy, the macro F1 Score, Cohen’s Kappa, AUROC, and the Brier Score across seven models, demonstrating the proposed model’s superiority in both generalization and calibration.
  • Construction of a reproducible pipeline that combines data balancing, the Synthetic Minority Over-sampling Technique(SMOTE), feature engineering, hyperparameter optimization, and post hoc explanation—facilitating practical application in clinical decision support systems.

2. Related Works

The growing prevalence of sarcopenia among aging populations has driven significant research into early detection and severity assessments using machine learning (ML) techniques. This section integrates recent findings, focusing on sarcopenia severity classification, ensemble learning, feature engineering, data balancing, and model interpretability.

2.1. Sarcopenia Assessment and Severity Prediction

Recent studies have transitioned from binary classification to severity-aware models using ML. For instance, IMU-based sit-to-stand motion metrics were used to develop models, achieving up to 90.4% multiclass accuracy using SVMs and KNNs [9]. Similarly, EHR-based models achieved AUCs above 91% using logistic regression and MLP for multi-level sarcopenia detection [10]. Feature-rich demographic and functional datasets have been used to build web-based risk calculators with robust performance [11]. Deep learning methods using physical fitness scores also yielded over 87% accuracy [12].

2.2. Feature Selection and Interpretability in Clinical ML

Explainability is central to clinical trust in ML. SHAP analysis identified socioeconomic and lifestyle factors as key predictors in KNHANES datasets [13]. Radiomics-based CT image studies employed automated segmentation and radiomic feature extraction for robust sarcopenia classification [14]. Feature selection techniques such as LASSO and ensemble-based methods remain essential [15,16,17].

2.3. Ordinal and Multiclass Classification Approaches

The need to respect the ordered nature of sarcopenia stages has driven the adoption of ordinal classification methods. Severity classification frameworks with risk staging (0–3) showed high accuracy and improved clinical utility [18]. Techniques such as ordinal regression, ensemble ordinal classifiers, and LSTM architectures have demonstrated strong performance [19,20].

2.4. Ensemble and Fusion Learning for Robust Prediction

Combining diverse learners through stacking or voting has proven effective in medical prediction tasks. A study using the RF, SVM, and NN in ensemble fashion achieved 88.5% accuracy, addressing class imbalance using adaptive synthetic sampling [21]. Stacked architectures consistently outperform standalone models across various domains [22,23].

2.5. Data Balancing in Imbalanced Medical Data

Imbalanced class distributions, particularly in severity stages, challenge classification reliability. SMOTE and its variants continue to be effective, particularly when paired with ensemble models [24,25]. These approaches enhance recall for rare classes without significantly sacrificing specificity [26].

2.6. Emerging Modalities and Future Directions

Novel modalities such as oculomics and wearable EMG-based sensors have shown promise for passive, real-time sarcopenia monitoring [27]. Quantum machine learning models have also emerged with slight accuracy gains and computational benefits over classical counterparts [28]. Cross-regional studies emphasize the importance of clinical biomarkers like vitamin D, C-reactive protein (CRP), and folate [29].
Table 1 summarizes recent studies by input modality, model type, prediction scope, and performance metrics.

2.7. Advancing Healthcare Diagnostics with Interpretable Ensemble Deep Learning Models

Recent advancements in healthcare classification tasks have increasingly leveraged ensemble deep learning models to enhance both predictive accuracy and interpretability. For example, ensemble frameworks combining multiple convolutional neural networks (CNNs) such as VGG16, InceptionV3, and EfficientNet have significantly outperformed single-model approaches in diagnosing conditions like breast cancer and lung diseases [33,34]. In the classification of chronic kidney disease, hybrid feature selection techniques integrated with ensemble learning have reduced noise and improved model precision [35]. Moreover, explainability methods such as LIME and Grad-CAM have been effectively incorporated into ensemble systems for leukemia and white blood cell classification, aiding clinical decision-making by visualizing model rationale [36,37]. Similar improvements have been demonstrated in dermatology, where ensemble models achieved high accuracy and AUC scores in classifying skin lesions [38]. These findings collectively highlight the growing value of ensemble deep learning approaches, especially when augmented with interpretability tools, for advancing reliable and explainable healthcare diagnostics.

2.8. Description of Interpretability and Evaluation Metrics

SHAP is a model-agnostic interpretability method derived from cooperative game theory, which attributes to each feature a fair contribution to the prediction based on Shapley values. It has been recently extended to address limitations in complex and high-dimensional domains, enabling both local and global interpretability in diverse applications [39,40]. Cohen’s Kappa coefficient is a statistical measure of inter-rater agreement that corrects for agreement expected by chance, with weighted variants enhancing performance for ordinal or imbalanced data; recent studies have demonstrated its continued relevance in medical and diagnostic reliability contexts [41,42]. The Brier score is a strictly proper scoring rule for evaluating the accuracy and calibration of probabilistic predictions by measuring the mean squared difference between predicted probabilities and actual binary outcomes. Recent developments, such as the Penalized Brier Score, address the metric’s sensitivity to overconfident but incorrect predictions by incorporating additional penalties, thereby improving model comparisons and calibration assessments [43].

3. Materials and Methods

The overall architecture of the proposed model fusion classifier is illustrated in Figure 1. The workflow begins with data collection, followed by a comprehensive preprocessing stage involving robust scaling to standardize feature distributions and SMOTE to address class imbalance issues. Subsequently, feature selection is conducted using LASSO and Random Forest algorithms to extract the most informative attributes. Selected features are then subjected to model fusion, where multiple base learners—Random Forests, Gradient Boosting, and MLPs—are individually optimized through hyperparameter optimization (HPO). Model evaluation ensures performance assessment at each step, while SHAP analysis is integrated to enhance the interpretability of the selected features and model predictions. This structured and iterative process ultimately leads to the development of a robust and interpretable fusion-based classification system (Algorithm 1).
Algorithm 1: Model fusion algorithm
Mathematics 13 02907 i001

3.1. Study Design and Data Description

The dataset utilized in this study was sourced from the Institute of Human Convergence Health Science at Gachon University. Data collection took place over nine months, from 1 September 2022 to 31 May 2023, across multiple community settings in Incheon, including social welfare centers, daycare facilities, and senior welfare centers. This extended timeframe facilitated the collection of a comprehensive and diverse sample from various locations, thereby enhancing the representativeness of the target population. The collected data was subsequently divided into training and testing sets using an 80/20 split.
The present study aimed to predict the severity of sarcopenia using a comprehensive clinical dataset comprising 664 individuals with 97 characteristics. These features included demographic variables, anthropometric measurements, physical performance assessments, and biochemical markers. The target variable, the severity of sarcopenia, was classified on an ordinal scale (0–3) representing the non-sarcopenic, mild, moderate, and severe categories (Table 2).
Sarcopenia is classified into four levels based on the European Working Group on Sarcopenia in Older People, version 2 guidelines. Normal indicates adequate muscle strength, mass, and performance. Possible sarcopenia is identified by either low muscle strength or poor physical performance. A sarcopenia diagnosis requires reduced muscle mass coupled with either diminished strength or functional performance. The classification of severe sarcopenia occurs when all three criteria—decreased muscle mass, impaired strength, and compromised performance—are simultaneously present [1].
Recent diagnostic guidelines for sarcopenia stratification have established a four-tier classification system, as delineated in Table 2. This hierarchical framework progresses from normal muscular health (Level 0) through possible sarcopenia (Level 1) and confirmed sarcopenia (Level 2) to severe sarcopenia (Level 3), with each successive stage characterized by increasingly stringent diagnostic criteria incorporating appendicular skeletal muscle mass measurements, muscular force capacity assessments, and functional performance metrics. The multiparametric approach reflected in Table 2 facilitates precise clinical categorization and enables targeted intervention strategies appropriate to disease severity.
Sarcopenia = 0 , Normal : All assessments within the normal range . 1 , Possible : Low strength or poor performance ( no ASM * yet ) 2 , Sarcopenia : Low ASM with low strength or performance 3 , Severe : Low ASM , low strength , and poor performance
ASM*—appendicular skeletal muscle mass.

3.2. Data Preprocessing

Data preprocessing was systematically carried out in several stages. Initially, missing data were addressed using Multiple Imputation by Chained Equations (MICE):
X m i s s ( t + 1 ) = f ( X o b s , X m i s s ( t ) , θ ) + ε
where X m i s s and X o b s denote missing and observed data, respectively; θ represents model parameters; and ε denotes the random error term.
The continuous variables were then normalized using robust scaling.

3.3. Feature Engineering

The feature selection strategy integrated clinical expertise and statistical methods. Specifically, LASSO logistic regression was employed.
min β 1 N i = 1 N y i log ( p i ) + ( 1 y i ) log ( 1 p i ) + λ j = 1 p | β j |
where N is the number of observations, y i is the observed class label, p i is the predicted probability, β represents the regression coefficients, and λ is the regularization parameter.
Additionally, feature importance was assessed using Random Forest importance scores, which were calculated as
V I ( X j ) = 1 ntree t = 1 ntree V I t ( X j )
where V I t ( X j ) is the importance of feature X j in tree t.
The features selected from both methods were combined, creating a comprehensive and robust feature set for further modeling.

3.4. Model Development

An advanced ordinal classification approach utilizing a stacking ensemble framework was developed, integrating a Random Forest (RF), a Light Gradient Boosting Machine (LightGBM), and an Artificial Neural Network (ANN). These models were selected for their complementary strengths.
The stacking ensemble model followed a structured methodology:
  • Data was partitioned into training and testing subsets, which were stratified to preserve class distributions.
  • SMOTE was employed, with
    X n e w = X i + λ × ( X z i X i )
    where X z i is randomly chosen from k-nearest neighbors and λ [ 0 , 1 ] .
  • Robust scaling was applied as described earlier.
  • Hyperparameter optimization for each model was meticulously conducted using the GridSearchCV and RandomizedSearchCV methods.
  • Stratified K-Fold cross-validation was applied, with
    C V e r r o r = 1 K k = 1 K Error ( f k , D k )
    where f k denotes the model trained excluding fold k and D k is the validation fold.
  • Predictions from base learners were combined using a logistic regression meta-learner, where
    p ( y = 1 | x ) = 1 1 + e ( β 0 + β j x j )

3.5. Model Interpretability

Interpretability was enhanced using SHAP, which was calculated as
ϕ i = S N { i } | S | ! ( | N | | S | 1 ) ! | N | ! f x ( S { i } ) f x ( S )
where ϕ i is the SHAP value for feature i, N is the set of all features, and f x ( S ) is the model’s prediction using only the features in subset S.

3.6. Validation and Performance Metrics

Validation employed nested cross-validation, utilizing accuracy, Cohen’s Weighted Kappa, AUROC, and the Brier score.
Accuracy for binary classification was calculated as follows: Accuracy = T P + T N T P + T N + F P + F N . Accuracy for multiclass classification was calculated as follows: Accuracy = Number of correct predictions Total number of predictions = i = 1 C T P i N .
Cohen’s Weighted Kappa was calculated as κ = P o P e 1 P e where P o is the observed agreement and P e is the expected agreement by chance.
AUROC measures the model’s ability to distinguish between different classes by evaluating the trade-off between the true positive rate (sensitivity) and the false positive rate at various classification thresholds. The AUROC score ranges from 0 to 1, where a value of 1.0 indicates perfect class separation and 0.5 suggests no better than random guessing. This metric is particularly valuable for imbalanced classification tasks, as it reflects the model’s ranking quality rather than its accuracy.
The Brier Score quantifies the accuracy of probabilistic predictions by measuring the mean squared difference between the predicted probabilities and the actual binary outcomes. It is computed as
Brier Score = 1 N i = 1 N ( p ^ i y i ) 2
where p ^ i denotes the predicted probability of the positive class for the i-th instance and y i { 0 , 1 } is the corresponding true label. The score ranges from 0 to 1, with lower values indicating better calibrated and more accurate probability estimates.

Precision, Recall, and F1 Scores

To evaluate the classification performance beyond accuracy, especially in imbalanced datasets, we also report Precision, Recall, and F1 Scores.
  • Precision (also called the positive predictive value) measures the proportion of correctly predicted positive instances among all instances predicted as positive, with
    Precision = T P T P + F P
    where T P is the number of true positives and F P is the number of false positives.
  • Recall (also known as sensitivity or the true positive rate) measures the proportion of correctly predicted positive instances out of all actual positive instances, with
    Recall = T P T P + F N
    where F N is the number of false negatives.
  • The F1 Score is the harmonic mean of Precision and Recall, providing a balance between the two:
    F 1 Score = 2 · Precision · Recall Precision + Recall
    The F1 Score is particularly useful when the dataset is imbalanced, as it considers both false positives and false negatives.

4. Results

4.1. Feature Selection

To enhance model performance and mitigate overfitting, a rigorous feature selection strategy was implemented. Two complementary techniques were utilized: the LASSO and Random Forest feature importance analysis. The LASSO, a regularization-based method, was employed to select features with non-zero coefficients, promoting sparsity and addressing potential multicollinearity among predictors. In parallel, the Random Forest was applied to rank features based on their mean decrease in impurity, effectively capturing complex nonlinear relationships.
Following independent application, the features identified by the LASSO and Random Forest were combined to construct a consolidated subset, thereby leveraging both linear and nonlinear dependencies present in the data. This integrative approach enhanced the robustness and interpretability of the model.
As a result, the feature space was substantially reduced, decreasing from the original 97 variables to 31 carefully selected features (Table 3).
By systematically reducing the feature dimensionality from 96 to 31, the model was optimized for improved generalizability, computational efficiency, and predictive accuracy.

4.2. Binary Classification

The best hyperparameters for each model were determined through systematic optimization procedures and are summarized in Table 4. Gradient Boosting achieved optimal performance with a learning rate of 0.05, a maximum depth of three, and 200 estimators, while the Random Forest required deeper trees and more estimators. Simpler models such as the Decision Tree and SVM benefited from minimal parameter tuning. For the MLP, the use of a tanh activation function and a relatively high learning rate contributed to improved convergence. The stacked model combined these individually optimized base learners to enhance generalization performance.
This binary classification task involves two classes, 0 for no sarcopenia and 1 for sarcopenia, to accurately identify individuals at risk of sarcopenia.
The performance comparison of the evaluated models is presented in Table 5. Among all classifiers, Gradient Boosting achieved the highest accuracy (97.74%) and F1 Score (0.97), indicating superior predictive performance. The Random Forest also exhibited strong results with an accuracy of 93.98% and an F1-Score of 0.91. The stacked ensemble model demonstrated robust generalization with an accuracy of 96.24%, a Precision of 0.98, and an F1 Score of 0.94 while maintaining a high kappa (0.8790) and near-perfect AUROC (0.9969). Although models like the SVM and MLP delivered acceptable performance, they exhibited lower Recall and F1 Scores compared to ensemble approaches. These results confirm that model fusion substantially enhances classification robustness and generalization over individual models (Figure 2).
Figure 3 presents the confusion matrices of all evaluated models, providing a detailed view of classification outcomes for both classes. The stacked model achieved a perfect classification of the negative class (105 true negatives) and a relatively high true positive count (23), misclassifying only 5 positive instances. Gradient Boosting exhibited the most balanced confusion matrix with minimal misclassifications, reflecting its superior overall performance. The Random Forest and AdaBoost also maintained strong classification ability, though they showed similar false positive and false negative patterns. In contrast, the SVM and MLP models had slightly higher misclassification rates, especially among positive instances, indicating their limited ability to capture the minority class distribution. The Decision Tree model, while simple, delivered competitive performance, correctly identifying 23 positive cases but misclassifying 6 negatives. These visual insights reinforce the effectiveness of ensemble methods, particularly the Gradient Boosting and stacked models, in achieving accurate and stable binary classification performance.
To better evaluate model performance beyond standard classification metrics, Cohen’s Kappa and the Brier Score were analyzed for each model, as shown in Table 6. Cohen’s Kappa measures the agreement between predicted and true class labels, adjusted for chance agreement, and is particularly informative in imbalanced classification scenarios. A Kappa value closer to one indicates strong agreement. In this study, Gradient Boosting achieved the highest Kappa (0.9330), reflecting its excellent reliability. The Brier Score, on the other hand, quantifies the accuracy of probabilistic predictions by computing the mean squared difference between predicted probabilities and actual outcomes. A lower Brier Score indicates better-calibrated predictions. Again, Gradient Boosting and the stacked Model exhibited the lowest Brier Scores (0.0104 and 0.0243, respectively), confirming their superiority in both classification confidence and probabilistic reliability.

4.3. Multiclass Classification

All components—including the three base learners (the Random Forest, Gradient Boosting, and the Multilayer Perceptron) and the meta-learner (logistic regression)—are explicitly trained as native four-class classifiers, rather than through iterative one-vs-rest binary classification strategies.
Base Learners: Each base model leverages its inherent multiclass capabilities. For instance, RandomForestClassifier, GradientBoostingClassifier, and MLPClassifier natively support multiclass classification. The complete four-dimensional probability distribution for each sample was extracted from each model.
Meta-Learner: The probability vectors from the base learners (3 models × 4 class probabilities = 12-dimensional feature vector) are concatenated to construct the input features for the meta-learner. A LogisticRegression model configured with multi_class = ‘auto’ (corresponding to a multinomial setting) is then trained on the concatenated probability features. At inference time, the meta-learner outputs a final 4-class probability distribution.
The best hyperparameters for each model in the multiclass classification task were identified through extensive grid and randomized search strategies, as summarized in Table 7. The stacked model was constructed by combining the individually optimized Random Forest, Gradient Boosting, and MLP classifiers. The Random Forest achieved optimal performance with 300 estimators and minimal constraints on depth and splitting. Gradient Boosting benefited from a moderate learning rate (0.05), log2 feature selection, and a subsample ratio of 0.8, while the MLP model performed best with a single hidden layer and a initial learning rate of 0.01. Other models, including the SVM, Decision Tree, and AdaBoost, were fine-tuned using their respective hyperparameters to improve generalization.
Table 8 summarizes the performance of all models across four key evaluation metrics. The stacked model achieved the highest performance overall, with an accuracy of 96.99%, an F1 Score of 0.9449, and a strong balance in both Precision and Recall. Gradient Boosting closely followed with a well-balanced profile and an F1 Score of 0.9320. The Random Forest also showed strong predictive power (F1 Score: 0.8977), while AdaBoost and the Decision Tree achieved comparable results, around 87.9% accuracy. On the other hand, the SVM and MLP performed less consistently, particularly in terms of Recall and the F1 Score, suggesting limited robustness in handling class diversity. These results underscore the effectiveness of ensemble learning, particularly the fusion of optimized base models in the stacked configuration(Figure 4).
Figure 5 presents the confusion matrices of all evaluated models in the multiclass classification task. The stacked model demonstrated near-perfect classification, particularly for classes 0 and 1, with only minimal misclassifications observed in classes 2 and 3. Gradient Boosting and the Random Forest also exhibited strong performance, though occasional confusion was noted between closely related classes. The AdaBoost and Decision Tree models showed more frequent misclassifications, especially between classes 0 and 1, as well as classes 2 and 3. The SVM and MLP produced comparatively higher error rates, with more dispersed misclassifications across all classes, indicating challenges in maintaining class separation. These visual results further emphasize the superior generalization ability of ensemble-based models, particularly the stacked and Gradient Boosting configurations.
Table 9 presents the Cohen’s Kappa and Brier Score values for all models evaluated in the multiclass classification task. The stacked model achieved the highest Kappa value (0.9738), indicating near-perfect agreement between the predicted and true classes, while also maintaining the lowest Brier Score (0.0125), reflecting excellent probability calibration. Gradient Boosting followed closely with a Kappa of 0.9380 and a low Brier Score of 0.0256. The Random Forest also demonstrated strong performance, though with slightly higher calibration error. In contrast, the SVM and MLP exhibited lower Kappa values and higher Brier Scores, indicating less stable classification and less reliable probabilistic predictions. These results confirm that ensemble models, particularly the stacked model, deliver superior consistency and calibration in multiclass classification.
To evaluate the contribution of each base learner in the stacking ensemble, we conducted an ablation study by removing one base model at a time and measuring the resulting performance. As shown in Table 10, the complete model (RF + GB + MLP) achieved the highest performance across all metrics, with an accuracy of 0.9699, a macro F1 Score of 0.9449, and a Cohen’s kappa of 0.9738.
Among the pairwise combinations, the GB + MLP configuration yielded the best results, indicating that the Gradient Boosting and MLP models are individually strong and complementary. The performance drop observed in the RF + MLP (macro F1: 0.9240) and RF + GB (macro F1: 0.9320) settings suggests that while each model contributes positively, the synergy of all three base learners is essential for achieving optimal classification performance. This validates the effectiveness of our stacking ensemble strategy and justifies the inclusion of all three models.

4.4. Overfitting

The model exhibited no evidence of overfitting, as demonstrated by the close alignment between 5-fold cross-validation accuracy (0.9905 ± 0.0062) and test accuracy (0.9699). The minimal performance gap ( 2%) indicates strong generalization capability to unseen data. Furthermore, the high test macro F1 score (0.9337) and quadratic weighted kappa (0.9589) support the model’s balanced performance across classes and its reliability in predicting the ordinal severity of sarcopenia. These results collectively confirm that the model maintains high predictive accuracy without overfitting.

4.5. Model Interpretability

The SHAP summary plot for the stacked model, as presented in Figure 6, illustrates the impact of each feature on model predictions for multiclass sarcopenia classification. The top three contributors—ASM, SS_SPPB, and weight_kg—showed the highest average influence, with a wide distribution of SHAP values indicating varied effects across patients. Features such as HG_R_M, BMI_kgm2, and D_HG also played significant roles. The color gradient represents feature values, showing how high and low values push predictions toward different sarcopenia classes. These insights underscore the clinical relevance of muscle mass, body composition, and performance measures in model decision-making.

5. Discussion

This study developed a robust, interpretable model fusion approach to predict sarcopenia severity using multiclass classification. Among the evaluated models, the stacked ensemble model integrating a Random Forest, Gradient Boosting, and a MLP outperformed all others with an accuracy of 96.99%, a macro F1 score of 0.9449, and the highest Cohen’s Kappa (0.9738), demonstrating strong agreement with clinical labels. SHAP analysis highlighted ASM, weight_kg, and SS_SPPB as consistently influential features, supporting the model’s clinical validity. The proposed model generalized well across all four severity classes without overfitting and maintained excellent probabilistic calibration, as evidenced by its lowest Brier Score (0.0125).
Despite these promising results, this study has several limitations. First, the dataset was derived from a single population—older Korean adults—which may limit generalizability to other ethnic or age groups. Future validation using multi-center, multi-ethnic cohorts is warranted. Second, while ordinal classification was appropriate for sarcopenia staging, this work employed standard multiclass classifiers rather than dedicated ordinal algorithms. Although the model performed well, specialized ordinal approaches could further enhance performance. Third, although SHAP was used for interpretability, it does not capture temporal dependencies or causal relationships, which could be critical in longitudinal sarcopenia progression analysis.
Despite these limitations, this study presents several novel contributions. To our knowledge, this is the first study to integrate a stacking ensemble with comprehensive SHAP-based interpretability for sarcopenia severity prediction using a clinically rich feature set. The dual-path feature selection (LASSO + Random Forest) provided a balanced blend of linear and nonlinear insights. The approach demonstrated resilience to imbalanced data through SMOTE, while careful hyperparameter tuning and nested cross-validation ensured optimal generalization. Together, these methodological innovations fill key gaps in the existing literature, particularly in moving beyond binary sarcopenia detection toward ordinal severity classification.
While individual components of our methodology—ensemble learning and SHAP-based explanations—have appeared separately in prior classification studies, our work innovatively synthesizes these approaches within the novel context of multiclass sarcopenia severity staging. Specifically, our methodological advancements include applying an ensemble model to a clinically-defined four-tier severity scale for the first time, integrating dual-path feature selection methods (LASSO regression and Random Forest importance) to capture diverse feature relationships, and rigorously validating model calibration and reliability through metrics like Cohen’s Kappa and Brier Score. Together, these contributions significantly enhance both the clinical relevance and methodological rigor of the predictive framework, providing a robust and interpretable tool tailored explicitly for geriatric healthcare applications.
We recognize that our initial analysis did not include direct comparisons with certain advanced state-of-the-art models, such as the widely-used ensemble method XGBoost3.0.0. To address this limitation and validate our approach comprehensively, we incorporated XGBoost as an additional benchmark. XGBoost achieved an accuracy of 96.24%, a macro F1 score of 0.9545, and a Cohen’s Kappa of 0.9697—demonstrating competitive performance. Nonetheless, our proposed stacked ensemble model outperformed XGBoost in calibration and agreement, maintaining its position as the most robust and interpretable tool for multiclass sarcopenia severity classification in clinical settings.
These findings have important implications. Clinically, the proposed model offers a non-invasive, data-driven tool to assist in early risk stratification and personalized intervention planning for older adults at various stages of sarcopenia. For researchers, the work highlights the importance of ensemble strategies, robust feature selection, and interpretability in medical AI. Future research should explore real-time implementation, longitudinal validation, and the integration of temporal data (e.g., from wearable sensors) to support dynamic sarcopenia monitoring. Additionally, extending the framework to other progressive musculoskeletal conditions could broaden its impact in geriatric care.

6. Conclusions

This study proposed an effective ensemble learning approach for multiclass classification of sarcopenia severity in older adults, leveraging stacked model fusion with the individually optimized Random Forest, Gradient Boosting, and MLP classifiers. The stacked model demonstrated superior performance across all evaluated metrics, achieving an accuracy of 96.99%, a macro F1 score of 0.9449, and a Cohen’s Kappa of 0.9738, while maintaining strong calibration, as indicated by a low Brier Score. SHAP analysis further confirmed the clinical relevance of key features such as ASM, body weight, and physical performance measures, enhancing model transparency and interpretability.
Importantly, this work moves beyond traditional binary sarcopenia classification by addressing the full spectrum of severity, offering a more nuanced and clinically useful prediction framework. Although limitations related to data generalizability and the absence of ordinal-specific modeling were acknowledged, this study’s strengths—including robust feature selection, model optimization, and explainable AI integration—underscore its contribution to advancing personalized sarcopenia management.
Future work should aim to validate the proposed model across diverse populations, incorporate longitudinal data to enable prediction of disease progression, and develop dynamic updating mechanisms to facilitate real-time sarcopenia monitoring. Furthermore, as AI technologies continue to evolve, integrating the predictive framework with large language models—for example, for clinical note interpretation [44] and agentic artificial intelligence systems [45]—presents a promising direction. Such integration could enhance real-time clinical decision-making, improve patient stratification, and support adaptive health monitoring, particularly within geriatric and musculoskeletal care domains.

Author Contributions

Conceptualization, D.T. and W.K.; methodology, J.K.; software, A.R.; validation, A.R., D.T. and J.K.; formal analysis, W.K.; investigation, D.T.; resources, J.K.; data curation, A.R.; writing—original draft preparation, D.T.; writing—review and editing, W.K.; visualization, A.R.; supervision, W.K.; project administration, J.K.; funding acquisition, J.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the Gachon University Research Fund (GCU-202307760001), the Ministry of Education of the Republic of Korea, and the National Research Foundation of Korea (NRF-2022S1A5C2A07090938).

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to ongoing research project.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Table A1. List of Features with Descriptions for Sarcopenia Dataset.
Table A1. List of Features with Descriptions for Sarcopenia Dataset.
IDFeatureDescription
1SarcopeniaSarcopenia status (0 = Normal, 1 = Possible, 2 = Sarcopenia, 3 = Severe)
2PlaceLocation where data was collected
3sexGender of the subject
4AgeAge of the subject (years)
5height_cmHeight of the subject (cm)
6weight_kgWeight of the subject (kg)
7SMM_kgSkeletal Muscle Mass (kg)
8BFM_kgBody Fat Mass (kg)
9BMI_kgm2Body Mass Index (kg/m2)
10Percent_BFBody fat percentage (%)
11BMR_kcalBasal Metabolic Rate (kcal)
12SBP_mmHgSystolic Blood Pressure (mmHg)
13DBP_mmHgDiastolic Blood Pressure (mmHg)
14BP_StageBlood Pressure classification stage
15PulsePulse rate (beats per minute)
16CC_cmCalf circumference (cm)
17DominantDominant side of the body (hand/leg)
18ASMAppendicular Skeletal Muscle mass (kg)
19HG_R_1Hand grip strength right (first trial) (kg)
20HG_L_1Hand grip strength left (first trial) (kg)
21HG_R_2Hand grip strength right (second trial) (kg)
22HG_L_2Hand grip strength left (second trial) (kg)
23HG_R_MMaximum hand grip strength right (kg)
24HG_L_MMaximum hand grip strength left (kg)
25D_HGDominant hand grip strength (kg)
26ND_HGNon-dominant hand grip strength (kg)
27Plartar_R_1Plantar flexion strength right foot (kg)
28Plartar_L_1Plantar flexion strength left foot (kg)
29Dorsal_R_1Dorsal flexion strength right foot (kg)
30Dorsal_L_1Dorsal flexion strength left foot (kg)
31D_PlantarDominant plantar flexion strength (kg)
32D_DorsalDominant dorsal flexion strength (kg)
33ND_PlantarNon-dominant plantar flexion strength (kg)
34ND_DorsalNon-dominant dorsal flexion strength (kg)
35SLS_RSingle leg stance right (seconds)
36SLS_LSingle leg stance left (seconds)
37D_SLSDominant side single leg stance (seconds)
38ND_SLSNon-dominant side single leg stance (seconds)
39SLS_MAXMaximum single leg stance time (seconds)
40SSSit-to-stand repetitions (30 s)
41SS_SPPBSit-to-stand time for 5 repetitions (SPPB protocol)
42CSRChair sit and reach distance (cm)
43MWT22-Minute Walk Test distance (meters)
44TUGTimed Up and Go test (seconds)
45Gaitspeed_SPPBGait speed from SPPB (m/s)
46SPPBShort Physical Performance Battery total score
47G_HG_RGrade-adjusted hand grip strength right
48G_HG_LGrade-adjusted hand grip strength left
49G_BMIGrade-adjusted Body Mass Index
50G_SSGrade-adjusted sit-to-stand repetitions
51G_2MWTGrade-adjusted 2-min walk test
52G_TUGGrade-adjusted Timed Up and Go test
53G_CSRGrade-adjusted Chair Sit and Reach test
54G_D_SLSGrade-adjusted dominant single-leg stance test
55D1_sPhysical and Mental Health domain (adjusted)
56D2_sLocomotion domain (adjusted)
57D3_sBody Composition domain (adjusted)
58D4_sFunctionality domain (adjusted)
59D5_sActivities of Daily Living domain (adjusted)
60D6_sLeisure Activities domain (adjusted)
61D7_sFears domain (adjusted)
62SarQoL_Total_sTotal Sarcopenia Quality of Life score (adjusted)
63FESFall Efficacy Scale score
64SARC_FSARC-F questionnaire score
65SARC_CalFSARC-F questionnaire with calf circumference
66Area_cityResidential city or region
67Area_DongResidential neighborhood/town
68EducationlevelEducation level of the participant
69ReligionReligion of the participant
70Smoking_Smoking status
71Smoking_d_Smoking duration (years)
72Smoking_a_Smoking amount (packs per day)
73Drinking_f_Drinking frequency (times per week)
74Drinking_d_Drinking duration (years)
75Drinking_a_Drinking amount (glasses per session)
76FamilyFamily type
77HouseHousing type
78IncomeMonthly income
79Educationlevel_pParents’ education level
80RegularPA_Regular physical activity status
81TypeofPA_1_Type of physical activity 1
82TypeofPA_2_Type of physical activity 2
83TypeofPA_3_Type of physical activity 3
84TypeofPA_4_Type of physical activity 4
85TypeofPA_5_Type of physical activity 5
86TypeofPA_6_Type of physical activity 6
87FVCForced Vital Capacity (liters)
88PreFVCPredicted Forced Vital Capacity (L)
89FEV1Forced Expiratory Volume in 1 s (L)
90PEFPeak Expiratory Flow (L/min)
91MIP_AveAverage Maximum Inspiratory Pressure (cmH2O)
92SAFSkin autofluorescence (AGEs measurement)
93HbA1cGlycated hemoglobin (HbA1c %)
94DMDiabetes Mellitus status
95HypertensionHypertension diagnosis
96HyperlipidemiaHyperlipidemia diagnosis
97SleepdisorderSleep disorder status

References

  1. Cruz-Jentoft, A.J.; Sayer, A.A. Sarcopenia. Lancet 2019, 393, 2636–2646. [Google Scholar] [CrossRef]
  2. Petermann-Rocha, F.; Balntzi, V.; Gray, S.R.; Lara, J.; Ho, F.K.; Pell, J.P.; Celis-Morales, C. Global prevalence of sarcopenia and severe sarcopenia: A systematic review and meta-analysis. J. Cachexia Sarcopenia Muscle 2022, 13, 86–99. [Google Scholar] [CrossRef]
  3. Turimov Mustapoevich, D.; Kim, W. Machine learning applications in sarcopenia detection and management: A comprehensive survey. Healthcare 2023, 11, 2483. [Google Scholar] [CrossRef]
  4. Ozgur, S.; Altinok, Y.A.; Bozkurt, D.; Saraç, Z.F.; Akçiçek, S.F. Performance evaluation of machine learning algorithms for sarcopenia diagnosis in older adults. Healthcare 2023, 11, 2699. [Google Scholar] [CrossRef]
  5. Lynch, D.H.; Spangler, H.B.; Franz, J.R.; Krupenevich, R.L.; Kim, H.; Nissman, D.; Zhang, J.; Li, Y.Y.; Sumner, S.; Batsis, J.A. Multimodal diagnostic approaches to advance precision medicine in sarcopenia and frailty. Nutrients 2022, 14, 1384. [Google Scholar] [CrossRef]
  6. Roberts, S.; Collins, P.; Rattray, M. Identifying and managing malnutrition, frailty and sarcopenia in the community: A narrative review. Nutrients 2021, 13, 2316. [Google Scholar] [CrossRef]
  7. Band, S.S.; Yarahmadi, A.; Hsu, C.C.; Biyari, M.; Sookhak, M.; Ameri, R.; Dehzangi, I.; Chronopoulos, A.T.; Liang, H.W. Application of explainable artificial intelligence in medical health: A systematic review of interpretability methods. Inform. Med. Unlocked 2023, 40, 101286. [Google Scholar] [CrossRef]
  8. Rane, N.; Choudhary, S.; Rane, J. Explainable artificial intelligence (XAI) in healthcare: Interpretable models for clinical decision support. SSRN 2023. [Google Scholar] [CrossRef]
  9. Wang, K.; Zhang, H.; Cheng, C.Y.M.; Chen, M.; Lai, K.W.C.; Or, C.K.; Hu, Y.; Vellaisamy, A.L.R.; Lam, C.L.K.; Xi, N.; et al. Multi-Risk-Level Sarcopenia-Prone Screening via Machine Learning Classification of Sit-to-Stand Motion Metrics from Wearable Sensors. Adv. Intell. Syst. 2025, 2025, 2401120. [Google Scholar] [CrossRef]
  10. Luo, X.; Ding, H.; Broyles, A.; Warden, S.J.; Moorthi, R.N.; Imel, E.A. Using machine learning to detect sarcopenia from electronic health records. Digit. Health 2023, 9, 20552076231197098. [Google Scholar] [CrossRef] [PubMed]
  11. Guo, J.; He, Q.; She, C.; Liu, H.; Li, Y. A machine learning–based online web calculator to aid in the diagnosis of sarcopenia in the US community. Digit. Health 2024, 10, 20552076241283247. [Google Scholar] [CrossRef]
  12. Bae, J.H.; Seo, J.w.; Kim, D.Y. Deep-learning model for predicting physical fitness in possible sarcopenia: Analysis of the Korean physical fitness award from 2010 to 2023. Front. Public Health 2023, 11, 1241388. [Google Scholar] [CrossRef] [PubMed]
  13. Seok, M.; Kim, W.; Kim, J. Machine learning for sarcopenia prediction in the elderly using socioeconomic, infrastructure, and quality-of-life data. Healthcare 2023, 11, 2881. [Google Scholar] [CrossRef]
  14. Chen, X.D.; Chen, W.J.; Huang, Z.X.; Xu, L.B.; Zhang, H.H.; Shi, M.M.; Cai, Y.Q.; Zhang, W.T.; Li, Z.S.; Shen, X. Establish a new diagnosis of sarcopenia based on extracted radiomic features to predict prognosis of patients with gastric cancer. Front. Nutr. 2022, 9, 850929. [Google Scholar] [CrossRef]
  15. Tukhtaev, A.; Turimov, D.; Kim, J.; Kim, W. Feature Selection and Machine Learning Approaches for Detecting Sarcopenia Through Predictive Modeling. Mathematics 2025, 13, 98. [Google Scholar] [CrossRef]
  16. Ghosh, P.; Azam, S.; Jonkman, M.; Karim, A.; Shamrat, F.J.M.; Ignatious, E.; Shultana, S.; Beeravolu, A.R.; De Boer, F. Efficient prediction of cardiovascular disease using machine learning algorithms with relief and LASSO feature selection techniques. IEEE Access 2021, 9, 19304–19326. [Google Scholar] [CrossRef]
  17. Wang, J.; Xu, Y.; Liu, L.; Wu, W.; Shen, C.; Huang, H.; Zhen, Z.; Meng, J.; Li, C.; Qu, Z.; et al. Comparison of LASSO and random forest models for predicting the risk of premature coronary artery disease. BMC Med. Inform. Decis. Mak. 2023, 23, 297. [Google Scholar] [CrossRef] [PubMed]
  18. Wang, K.; Zhang, H.; Cheng, C.Y.M.; Chen, M.; Lai, K.W.C.; Or, C.K.; Chen, Y.; Hu, Y.; Vellaisamy, A.L.R.; Lam, C.L.K.; et al. High Accuracy Machine Learning Model for Sarcopenia Severity Diagnosis based on Sit-to-stand Motion Measured by Two Micro Motion Sensors. medRxiv 2023. [Google Scholar] [CrossRef]
  19. Ayllón-Gavilán, R.; Guijo-Rubio, D.; Gutiérrez, P.A.; Bagnall, A.; Hervás-Martínez, C. Convolutional-and Deep Learning-Based Techniques for Time Series Ordinal Classification. IEEE Trans. Cybern. 2024, 55, 537–549. [Google Scholar] [CrossRef]
  20. Xu, W.; Gao, Y.; Wang, Y.; Guan, J. Protein–protein interaction prediction based on ordinal regression and recurrent convolutional neural networks. BMC Bioinform. 2021, 22, 485. [Google Scholar] [CrossRef]
  21. Zakariah, M.; AlQahtani, S.A.; Al-Rakhami, M.S. Machine learning-based adaptive synthetic sampling technique for intrusion detection. Appl. Sci. 2023, 13, 6504. [Google Scholar] [CrossRef]
  22. Hittawe, M.M.; Harrou, F.; Togou, M.A.; Sun, Y.; Knio, O. Time-series weather prediction in the Red sea using ensemble transformers. Appl. Soft Comput. 2024, 164, 111926. [Google Scholar] [CrossRef]
  23. Bin Tareaf, R.; Korga, A.M.; Wefers, S.; Hanken, K. Direct vs. Cross-Validated Stacking in Ensemble Learning: Evaluating the Trade-Off between Inference Time and Generalizability on Fashion-MNIST. In Proceedings of the 2024 16th International Conference on Machine Learning and Computing, Shenzhen, China, 2–5 February 2024; pp. 453–461. [Google Scholar]
  24. Imani, M.; Beikmohammadi, A.; Arabnia, H.R. Comprehensive Analysis of Random Forest and XGBoost Performance with SMOTE, ADASYN, and GNUS Under Varying Imbalance Levels. Technologies 2025, 13, 88. [Google Scholar] [CrossRef]
  25. Salehi, A.R.; Khedmati, M. A cluster-based SMOTE both-sampling (CSBBoost) ensemble algorithm for classifying imbalanced data. Sci. Rep. 2024, 14, 5152. [Google Scholar] [CrossRef]
  26. Cullerne Bown, W. Sensitivity and specificity versus precision and recall, and related dilemmas. J. Classif. 2024, 41, 402–426. [Google Scholar] [CrossRef]
  27. Jung, D.; Lee, D.; Nguyen, Q.H.N.; Kim, J.; Mun, K.R. A Machine Learning Approach to Predict the Risk of Sarcopenia. In Proceedings of the 2023 IEEE EMBS Special Topic Conference on Data Science and Engineering in Healthcare, Medicine and Biology, St. Julians, Malta, 7–9 December 2023; pp. 47–48. [Google Scholar]
  28. Taghandiki, K. Quantum Machine Learning Unveiled: A Comprehensive Review. J. Eng. Appl. Res. 2024, 1, 29–48. [Google Scholar]
  29. Zupo, R.; Moroni, A.; Castellana, F.; Gasparri, C.; Catino, F.; Lampignano, L.; Perna, S.; Clodoveo, M.L.; Sardone, R.; Rondanelli, M. A machine-learning approach to target clinical and biological features associated with sarcopenia: Findings from northern and southern Italian aging populations. Metabolites 2023, 13, 565. [Google Scholar] [CrossRef]
  30. Turimov, D.; Kim, W. Enhancing Sarcopenia Prediction Through an Ensemble Learning Approach: Addressing Class Imbalance for Improved Clinical Diagnosis. Mathematics 2025, 13, 26. [Google Scholar] [CrossRef]
  31. Kim, B.R.; Yoo, T.K.; Kim, H.K.; Ryu, I.H.; Kim, J.K.; Lee, I.S.; Kim, J.S.; Shin, D.H.; Kim, Y.S.; Kim, B.T. Oculomics for sarcopenia prediction: A machine learning approach toward predictive, preventive, and personalized medicine. EPMA J. 2022, 13, 367–382. [Google Scholar] [CrossRef]
  32. Ullah, U.; Maheshwari, D.; Castillo Olea, C.; Garcia Zapirain, B. Sarcopenia risk prediction and feature selection by using quantum machine learning algorithms. Quantum Mach. Intell. 2024, 6, 80. [Google Scholar] [CrossRef]
  33. Nemade, V.; Pathak, S.; Dubey, A.K. Deep learning-based ensemble model for classification of breast cancer. Microsyst. Technol. 2024, 30, 513–527. [Google Scholar] [CrossRef]
  34. Morsy, S.; Abd-Elsalam, N.; Kandil, A.; Elbialy, A.; Youssef, A.B. A deep learning ensemble framework for robust classification of lung ultrasound patterns: Covid-19, pneumonia, and normal. Int. J. Adv. Intell. Inform. 2025, 11, 1966. [Google Scholar] [CrossRef]
  35. Yogesh, N.; Shrinivasacharya, P.; Naik, N. Novel statistically equivalent signature-based hybrid feature selection and ensemble deep learning LSTM and GRU for chronic kidney disease classification. PeerJ Comput. Sci. 2024, 10, e2467. [Google Scholar] [CrossRef]
  36. Deshpande, N.M.; Gite, S.; Pradhan, B. Explainable AI for binary and multi-class classification of leukemia using a modified transfer learning ensemble model. Int. J. Smart Sens. Intell. Syst. 2024, 17, 1–20. [Google Scholar] [CrossRef]
  37. Panthakkan, A.; Anzar, S.; Mansoor, W.; Al Ahmad, H. A new frontier in hematology: Robust deep learning ensembles for white blood cell classification. Biomed. Signal Process. Control 2025, 100, 106995. [Google Scholar] [CrossRef]
  38. Selvaraj, K.M.; Gnanagurusubbiah, S.; Roy, R.R.R.; Balu, S. Enhancing skin lesion classification with advanced deep learning ensemble models: A path towards accurate medical diagnostics. Curr. Probl. Cancer 2024, 49, 101077. [Google Scholar] [CrossRef]
  39. Li, M.; Sun, H.; Huang, Y.; Chen, H. Shapley value: From cooperative game to explainable artificial intelligence. Auton. Intell. Syst. 2024, 4, 2. [Google Scholar] [CrossRef]
  40. Franceschi, L.; Donini, M.; Archambeau, C.; Seeger, M. Explaining probabilistic models with distributional values. arXiv 2024, arXiv:2402.09947. [Google Scholar] [CrossRef]
  41. Madadizadeh, F.; Ghafari, H.; Bahariniya, S. Kappa Statistics: A Method of Measuring Agreement in Dental Examinations. Open Public Health J. 2023, 16, e18749445259818. [Google Scholar]
  42. Li, M.; Gao, Q.; Yu, T. Kappa statistic considerations in evaluating inter-rater reliability between two raters: Which, when and context matters. BMC Cancer 2023, 23, 799. [Google Scholar] [CrossRef]
  43. Yang, W.; Jiang, J.; Schnellinger, E.M.; Kimmel, S.E.; Guo, W. Modified Brier score for evaluating prediction accuracy for binary outcomes. Stat. Methods Med Res. 2022, 31, 2287–2296. [Google Scholar] [CrossRef] [PubMed]
  44. Singhal, K.; Azizi, S.; Tu, T.; Mahdavi, S.S.; Wei, J.; Chung, H.W.; Scales, N.; Tanwani, A.; Cole-Lewis, H.; Pfohl, S.; et al. Large language models encode clinical knowledge. Nature 2023, 620, 172–180. [Google Scholar] [CrossRef] [PubMed]
  45. Zou, J.; Topol, E.J. The rise of agentic AI teammates in medicine. Lancet 2025, 405, 457. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Architecture of the model fusion classifier, involving data preprocessing, feature selection, model fusion with optimized base learners, and SHAP-based interpretability.
Figure 1. Architecture of the model fusion classifier, involving data preprocessing, feature selection, model fusion with optimized base learners, and SHAP-based interpretability.
Mathematics 13 02907 g001
Figure 2. Comparison of classification performance metrics across different models. The figure illustrates Accuracy (orange), Precision (blue), Recall (red), and the F1 Score (green) for the stacked model, Gradient Boosting, the Random Forest, AdaBoost, the Decision Tree, the MLP, and SVM classifiers.
Figure 2. Comparison of classification performance metrics across different models. The figure illustrates Accuracy (orange), Precision (blue), Recall (red), and the F1 Score (green) for the stacked model, Gradient Boosting, the Random Forest, AdaBoost, the Decision Tree, the MLP, and SVM classifiers.
Mathematics 13 02907 g002
Figure 3. Confusion matrices of all evaluated models. Each confusion matrix presents the number of true positives, false positives, true negatives, and false negatives for the models.
Figure 3. Confusion matrices of all evaluated models. Each confusion matrix presents the number of true positives, false positives, true negatives, and false negatives for the models.
Mathematics 13 02907 g003
Figure 4. Comparison of accuracy, Precision, Recall, and the F1 Score across different models for the multiclass classification task.
Figure 4. Comparison of accuracy, Precision, Recall, and the F1 Score across different models for the multiclass classification task.
Mathematics 13 02907 g004
Figure 5. Confusion matrices of all models for the multiclass classification task, illustrating classification performance across four classes.
Figure 5. Confusion matrices of all models for the multiclass classification task, illustrating classification performance across four classes.
Mathematics 13 02907 g005
Figure 6. SHAP summary plot showing the feature impact distribution for the stacked model in multiclass sarcopenia prediction.
Figure 6. SHAP summary plot showing the feature impact distribution for the stacked model in multiclass sarcopenia prediction.
Mathematics 13 02907 g006
Table 1. Taxonomy of sarcopenia ML studies.
Table 1. Taxonomy of sarcopenia ML studies.
StudyInput ModalityModel(s)Task TypeSeverity ConsideredAccuracy (%)
[10]Electronic Health Records (EHR)LR, MLP, and SVMMulticlassYes91.4
 [11]DemographicsXGBoostBinaryNo85.2
 [12]Fitness ScoresDNNBinaryYes87.5
 [27]Electromyography (EMG)LSTMMulticlassYes94.4
 [13]Socioeconomic/Quality MeasuresRF and LightGBMBinaryNo∼80
 [30]Clinical VitalsRF, SVM, and NN (Ensemble)BinaryNo88.5
 [31]OculomicsXGBoostBinaryNo75.1
 [29]Laboratory BiomarkersRF and LRBinaryYes-
 [32]Clinical + Categorical FeaturesQuantum SVM and RFBinaryYes76.7
 [18]IMU (Sit-to-Stand)Ensemble StackMulticlassYes90.4
Proposed ModelClinical VitalsRF, GB, and MLPMulticlassYes96.9
Table 2. Sarcopenia classification criteria based on clinical assessments.
Table 2. Sarcopenia classification criteria based on clinical assessments.
Sarcopenia LevelDescriptionCriteria Based on Assessment
0—NormalNo evidence of sarcopeniaAll parameters remain within normal ranges: muscular force capacity, functional mobility metrics, and quantitative appendicular muscle mass measurements exceed threshold values.
1—Possible SarcopeniaEarly signs detected mainly in primary care settingsDiminished muscular force (hand dynamometry: males < 28 kg and females < 18 kg) or compromised functional capacity (chair-rise pentad ≥ 12 s). Quantitative muscle mass evaluation not mandatory at this diagnostic stage.
2—SarcopeniaConfirmed sarcopenia diagnosisReduced appendicular skeletal musculature combined with either insufficient grip force (males < 28 kg, females < 18 kg) or suboptimal mobility parameters (6-meter ambulatory velocity < 1.0 m/s, chair-rise pentad ≥ 12 s, or abbreviated physical function index ≤ 9).
3—Severe SarcopeniaAdvanced sarcopenia conditionConcurrent manifestation of all diagnostic indicators: insufficient appendicular muscle volume with compromised muscular force and deteriorated functional performance metrics (triple-domain deficiency syndrome) [1].
Table 3. Descriptions of selected features.
Table 3. Descriptions of selected features.
FeatureMeaning
HG_R_MMaximum handgrip strength of the right hand.
HG_L_2Second handgrip strength trial for the left hand.
TUGTimed up-and-go test (mobility and balance assessment).
D_HGDominant handgrip strength.
AgeAge of the subject in years.
HG_R_1First handgrip strength trial for the right hand.
SS_SPPBSit-to-stand and short physical performance Battery total score.
BFM_kgBody Fat Mass in kilograms.
ND_HGNon-dominant handgrip strength.
G_BMIGrade body mass index category.
SPPBShort physical performance battery score.
BMR_kcalBasal metabolic rate in kilocalories.
G_SSGrade of the sit-to-stand score category.
ASMAppendicular skeletal muscle mass.
Smoking_a_Smoking status.
HG_R_2Second handgrip strength trial for the right hand.
SMM_kgSkeletal muscle mass in kilograms.
CC_cmCalf Circumference in centimeters.
sexBiological sex of the participant.
G_HG_LGrade of handgrip strength for the left hand.
BMI_kgm2Body mass index in kg/m2.
EducationlevelHighest level of education attained.
D3_sBody composition domain.
G_HG_RGrade of handgrip strength for the right hand.
G_TUGGrade of the timed up-and-go category.
HG_L_MMaximum handgrip strength of the left hand.
SSSit-to-stand repetitions (30 s).
D_PlantarPlantar flexion strength or delay.
SleepdisorderPresence of a diagnosed sleep disorder.
Dorsal_R_1First dorsal flexion strength measurement on the right.
weight_kgBody weight in kilograms.
Refer to Table A1.
Table 4. Best hyperparameters obtained during model optimization for binary classification.
Table 4. Best hyperparameters obtained during model optimization for binary classification.
ModelBest Parameters
Stacked ModelEnsemble of optimized base models
Gradient Boosting{‘learning_rate’: 0.05, ‘max_depth’: 3, ‘n_estimators’: 200}
Random Forest{‘max_depth’: None, ‘min_samples_split’: 2, ‘n_estimators’: 300}
AdaBoost{‘n_estimators’: 200, ‘learning_rate’: 1.0}
Decision Tree{‘max_depth’: None, ‘min_samples_split’: 2}
MLP{‘activation’: tanh, ‘alpha’: 0.0001, ‘hidden_layer_sizes’: (50),
‘learning_rate_init’: 0.01}
SVM{‘kernel’: rbf, ‘gamma’: 0.001, ‘C’: 100.0}
Table 5. Performance comparison of different models for binary classification.
Table 5. Performance comparison of different models for binary classification.
#ModelAccuracyPrecisionF1 ScoreRecall
1Proposed Model0.96240.98000.94000.9100
2Gradient Boosting0.97740.96000.97000.9700
3Random Forest0.93980.92000.91000.9000
4AdaBoost0.93980.88460.85190.8214
5Decision Tree0.91730.87000.88000.8800
6MLP0.90980.88000.86000.8400
7SVM0.90230.86000.85000.8500
Bold values indicate best-performing models in key metrics.
Table 6. Comparison of Cohen’s Kappa and Brier Scores across models.
Table 6. Comparison of Cohen’s Kappa and Brier Scores across models.
ModelKappaBrier Score
Stacked Model0.87900.0243
Gradient Boosting0.93300.0104
Random Forest0.81420.0484
AdaBoost0.81420.1575
Decision Tree0.75440.0827
MLP0.71360.0747
SVM0.70210.0768
Table 7. Best hyperparameters for each model used in the multiclass classification task.
Table 7. Best hyperparameters for each model used in the multiclass classification task.
#ModelBest Parameters
1Stacked (RF + GB + MLP)Optimized RF, GB, and MLP base models
2Random Forest{‘max_depth’: None, ‘min_samples_leaf’: 1, ‘min_samples_split’: 5, ‘n_estimators’: 300}
3Gradient Boosting{‘subsample’: 0.8, ‘n_estimators’: 300, ‘max_features’: ‘log2’, ‘max_depth’: 7, ‘learning_rate’: 0.05}
4MLP{‘alpha’: 0.0001, ‘hidden_layer_sizes’: (50),
‘learning_rate_init’: 0.01}
5SVM{‘C’: 10, ‘gamma’: ‘scale’, ‘kernel’: ‘rbf’}
6Decision Tree{‘max_depth’: None, ‘min_samples_leaf’: 1, ‘min_samples_split’: 2}
7AdaBoost{‘learning_rate’: 0.01, ‘n_estimators’: 50}
All parameters were selected based on the best performance through a cross-validated grid/random search.
Table 8. Comparison of model performance across key metrics in the multiclass classification task.
Table 8. Comparison of model performance across key metrics in the multiclass classification task.
ModelAccuracyPrecisionRecallF1 Score
Proposed Model0.96990.96000.94000.9449
Gradient Boosting0.94740.93670.93130.9320
Random Forest0.91730.90340.89650.8977
AdaBoost0.87970.88400.86750.8646
Decision Tree0.87970.88000.88900.8837
SVM0.81950.79810.75100.7704
MLP0.80450.75680.74270.7485
Table 9. Comparison of Cohen’s Kappa and Brier Scores across models for the multiclass classification task.
Table 9. Comparison of Cohen’s Kappa and Brier Scores across models for the multiclass classification task.
ModelKappaBrier Score
Stacked Model0.97380.0125
Gradient Boosting0.93800.0256
Random Forest0.91060.0407
AdaBoost0.86800.0602
Decision Tree0.87400.0602
SVM0.79720.0671
MLP0.77490.0901
Higher Kappa values indicate stronger classification agreement, while lower Brier Scores reflect better-calibrated probability estimates.
Table 10. Ablation study of base learners in the stacking ensemble.
Table 10. Ablation study of base learners in the stacking ensemble.
Base Learner CombinationAccuracyMacro F1Cohen’s Kappa
RF + GB + MLP (Full Model)0.96990.94490.9738
RF + GB0.94740.93200.9384
RF + MLP0.94740.92400.9285
GB + MLP0.96240.93370.9589
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ruziboev, A.; Turimov, D.; Kim, J.; Kim, W. Multiclass Classification of Sarcopenia Severity in Korean Adults Using Machine Learning and Model Fusion Approaches. Mathematics 2025, 13, 2907. https://doi.org/10.3390/math13182907

AMA Style

Ruziboev A, Turimov D, Kim J, Kim W. Multiclass Classification of Sarcopenia Severity in Korean Adults Using Machine Learning and Model Fusion Approaches. Mathematics. 2025; 13(18):2907. https://doi.org/10.3390/math13182907

Chicago/Turabian Style

Ruziboev, Arslon, Dilmurod Turimov, Jiyoun Kim, and Wooseong Kim. 2025. "Multiclass Classification of Sarcopenia Severity in Korean Adults Using Machine Learning and Model Fusion Approaches" Mathematics 13, no. 18: 2907. https://doi.org/10.3390/math13182907

APA Style

Ruziboev, A., Turimov, D., Kim, J., & Kim, W. (2025). Multiclass Classification of Sarcopenia Severity in Korean Adults Using Machine Learning and Model Fusion Approaches. Mathematics, 13(18), 2907. https://doi.org/10.3390/math13182907

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop