Multiclass Classification of Sarcopenia Severity in Korean Adults Using Machine Learning and Model Fusion Approaches

Ruziboev, Arslon; Turimov, Dilmurod; Kim, Jiyoun; Kim, Wooseong

doi:10.3390/math13182907

Open AccessArticle

Multiclass Classification of Sarcopenia Severity in Korean Adults Using Machine Learning and Model Fusion Approaches

¹

Department of Computer Engineering, Gachon University, Sujeong-gu, Seongnam-si 13120, Republic of Korea

²

Department of Exercise Rehabilitation, Gachon University, Incheon 21936, Republic of Korea

^*

Authors to whom correspondence should be addressed.

Mathematics 2025, 13(18), 2907; https://doi.org/10.3390/math13182907

Submission received: 8 May 2025 / Revised: 18 August 2025 / Accepted: 25 August 2025 / Published: 9 September 2025

(This article belongs to the Special Issue Mathematical Modelling and Machine Learning Methods for Bioinformatics and Data Science Applications, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

This study presents a unified machine learning strategy for identifying various degrees of sarcopenia severity in older adults. The approach combines three optimized algorithms (Random Forest, Gradient Boosting, and Multilayer Perceptron) into a stacked ensemble model, which is assessed with clinical data. A thorough data preparation process involved synthetic minority oversampling to ensure class balance and a dual approach to feature selection using Least Absolute Shrinkage and Selection Operator regression and Random Forest importance. The integrated model achieved remarkable performance with an accuracy of 96.99%, an F1 score of 0.9449, and a Cohen’s Kappa coefficient of 0.9738 while also demonstrating excellent calibration (Brier Score: 0.0125). Interpretability analysis through SHapley Additive exPlanations values identified appendicular skeletal muscle mass, body weight, and functional performance metrics as the most significant predictors, enhancing clinical relevance. The ensemble approach showed superior generalization across all sarcopenia classes compared to individual models. Although limited by dataset representativeness and the use of conventional multiclass classification techniques, the framework shows considerable promise for non-invasive sarcopenia risk assessments and exemplifies the value of interpretable artificial intelligence in geriatric healthcare.

Keywords:

sarcopenia severity classification; machine learning; model fusion; ensemble learning; stacked model; SHapley Additive exPlanations explainability; multiclass classification; feature selection; geriatric healthcare; predictive modeling

MSC:

68T07

1. Introduction

Sarcopenia, defined by the gradual decline in skeletal muscle mass and strength, presents a considerable risk to the well-being and autonomy of elderly populations globally. Its early detection and severity assessment are critical for implementing timely interventions to prevent adverse outcomes such as frailty, falls, and hospitalization [1,2]. Traditionally, sarcopenia diagnosis has relied on manual clinical assessments and imaging techniques, which, while effective, are often resource-intensive and limited by inter-observer variability [3]. Contemporary advancements in computational learning methods have enabled innovative approaches for automated sarcopenia screening and prognostic analysis, offering scalable, data-driven solutions to support clinical decision-making [3,4].

Despite these promising developments, most existing studies have focused narrowly on binary classification—distinguishing sarcopenic from non-sarcopenic individuals—without adequately addressing the full spectrum of sarcopenia severity [5,6]. Furthermore, while various machine learning models have been proposed, few have employed model fusion strategies or integrated explainable AI (XAI) techniques, such as SHapley Additive exPlanations (SHAP) analysis, to enhance model transparency and clinical interpretability [7,8]. Consequently, critical gaps remain in developing robust, interpretable, and generalizable models capable of stratifying sarcopenia into multiple severity levels based on real-world clinical features.

To address these limitations, the present study proposes an ensemble-based model fusion framework that integrates the Random Forest, Gradient Boosting, and Multilayer Perceptron (MLP) classifiers, which are optimized through hyperparameter tuning. The model is designed to predict sarcopenia severity across four ordinal classes using a rich set of clinical, anthropometric, and functional assessment features. Additionally, SHAP analysis is incorporated to identify and visualize the most influential predictors, thereby enhancing model explainability and clinical relevance. Through comprehensive performance evaluation—including accuracy, Kappa statistics, and the Brier Score—this study aims to provide a robust, interpretable, and clinically useful tool for sarcopenia risk stratification. Contributions of this study are as follows:

Development of a multiclass classification model to predict sarcopenia severity levels (normal, possible, sarcopenia, and severe) rather than just a binary diagnosis, addressing a critical gap in existing machine learning studies.
Design and evaluation of a stacked ensemble classifier that integrates the individually optimized Random Forest, Gradient Boosting, and Multilayer Perceptron models, achieving superior performance across all key metrics.
Application of dual-path feature selection combining the Least Absolute Shrinkage and Selection Operator(LASSO) and the Random Forest (for nonlinear importance), enhancing both prediction accuracy and model robustness.
Integration of SHAP explainability analysis, providing transparent interpretation of feature contributions and ensuring clinical interpretability of the model’s predictions.
Comprehensive performance evaluation using accuracy, the macro F1 Score, Cohen’s Kappa, AUROC, and the Brier Score across seven models, demonstrating the proposed model’s superiority in both generalization and calibration.
Construction of a reproducible pipeline that combines data balancing, the Synthetic Minority Over-sampling Technique(SMOTE), feature engineering, hyperparameter optimization, and post hoc explanation—facilitating practical application in clinical decision support systems.

2. Related Works

The growing prevalence of sarcopenia among aging populations has driven significant research into early detection and severity assessments using machine learning (ML) techniques. This section integrates recent findings, focusing on sarcopenia severity classification, ensemble learning, feature engineering, data balancing, and model interpretability.

2.1. Sarcopenia Assessment and Severity Prediction

Recent studies have transitioned from binary classification to severity-aware models using ML. For instance, IMU-based sit-to-stand motion metrics were used to develop models, achieving up to 90.4% multiclass accuracy using SVMs and KNNs [9]. Similarly, EHR-based models achieved AUCs above 91% using logistic regression and MLP for multi-level sarcopenia detection [10]. Feature-rich demographic and functional datasets have been used to build web-based risk calculators with robust performance [11]. Deep learning methods using physical fitness scores also yielded over 87% accuracy [12].

2.2. Feature Selection and Interpretability in Clinical ML

Explainability is central to clinical trust in ML. SHAP analysis identified socioeconomic and lifestyle factors as key predictors in KNHANES datasets [13]. Radiomics-based CT image studies employed automated segmentation and radiomic feature extraction for robust sarcopenia classification [14]. Feature selection techniques such as LASSO and ensemble-based methods remain essential [15,16,17].

2.3. Ordinal and Multiclass Classification Approaches

The need to respect the ordered nature of sarcopenia stages has driven the adoption of ordinal classification methods. Severity classification frameworks with risk staging (0–3) showed high accuracy and improved clinical utility [18]. Techniques such as ordinal regression, ensemble ordinal classifiers, and LSTM architectures have demonstrated strong performance [19,20].

2.4. Ensemble and Fusion Learning for Robust Prediction

Combining diverse learners through stacking or voting has proven effective in medical prediction tasks. A study using the RF, SVM, and NN in ensemble fashion achieved 88.5% accuracy, addressing class imbalance using adaptive synthetic sampling [21]. Stacked architectures consistently outperform standalone models across various domains [22,23].

2.5. Data Balancing in Imbalanced Medical Data

Imbalanced class distributions, particularly in severity stages, challenge classification reliability. SMOTE and its variants continue to be effective, particularly when paired with ensemble models [24,25]. These approaches enhance recall for rare classes without significantly sacrificing specificity [26].

2.6. Emerging Modalities and Future Directions

Novel modalities such as oculomics and wearable EMG-based sensors have shown promise for passive, real-time sarcopenia monitoring [27]. Quantum machine learning models have also emerged with slight accuracy gains and computational benefits over classical counterparts [28]. Cross-regional studies emphasize the importance of clinical biomarkers like vitamin D, C-reactive protein (CRP), and folate [29].

Table 1 summarizes recent studies by input modality, model type, prediction scope, and performance metrics.

2.7. Advancing Healthcare Diagnostics with Interpretable Ensemble Deep Learning Models

Recent advancements in healthcare classification tasks have increasingly leveraged ensemble deep learning models to enhance both predictive accuracy and interpretability. For example, ensemble frameworks combining multiple convolutional neural networks (CNNs) such as VGG16, InceptionV3, and EfficientNet have significantly outperformed single-model approaches in diagnosing conditions like breast cancer and lung diseases [33,34]. In the classification of chronic kidney disease, hybrid feature selection techniques integrated with ensemble learning have reduced noise and improved model precision [35]. Moreover, explainability methods such as LIME and Grad-CAM have been effectively incorporated into ensemble systems for leukemia and white blood cell classification, aiding clinical decision-making by visualizing model rationale [36,37]. Similar improvements have been demonstrated in dermatology, where ensemble models achieved high accuracy and AUC scores in classifying skin lesions [38]. These findings collectively highlight the growing value of ensemble deep learning approaches, especially when augmented with interpretability tools, for advancing reliable and explainable healthcare diagnostics.

2.8. Description of Interpretability and Evaluation Metrics

SHAP is a model-agnostic interpretability method derived from cooperative game theory, which attributes to each feature a fair contribution to the prediction based on Shapley values. It has been recently extended to address limitations in complex and high-dimensional domains, enabling both local and global interpretability in diverse applications [39,40]. Cohen’s Kappa coefficient is a statistical measure of inter-rater agreement that corrects for agreement expected by chance, with weighted variants enhancing performance for ordinal or imbalanced data; recent studies have demonstrated its continued relevance in medical and diagnostic reliability contexts [41,42]. The Brier score is a strictly proper scoring rule for evaluating the accuracy and calibration of probabilistic predictions by measuring the mean squared difference between predicted probabilities and actual binary outcomes. Recent developments, such as the Penalized Brier Score, address the metric’s sensitivity to overconfident but incorrect predictions by incorporating additional penalties, thereby improving model comparisons and calibration assessments [43].

3. Materials and Methods

The overall architecture of the proposed model fusion classifier is illustrated in Figure 1. The workflow begins with data collection, followed by a comprehensive preprocessing stage involving robust scaling to standardize feature distributions and SMOTE to address class imbalance issues. Subsequently, feature selection is conducted using LASSO and Random Forest algorithms to extract the most informative attributes. Selected features are then subjected to model fusion, where multiple base learners—Random Forests, Gradient Boosting, and MLPs—are individually optimized through hyperparameter optimization (HPO). Model evaluation ensures performance assessment at each step, while SHAP analysis is integrated to enhance the interpretability of the selected features and model predictions. This structured and iterative process ultimately leads to the development of a robust and interpretable fusion-based classification system (Algorithm 1).

Algorithm 1: Model fusion algorithm

3.1. Study Design and Data Description

The dataset utilized in this study was sourced from the Institute of Human Convergence Health Science at Gachon University. Data collection took place over nine months, from 1 September 2022 to 31 May 2023, across multiple community settings in Incheon, including social welfare centers, daycare facilities, and senior welfare centers. This extended timeframe facilitated the collection of a comprehensive and diverse sample from various locations, thereby enhancing the representativeness of the target population. The collected data was subsequently divided into training and testing sets using an 80/20 split.

The present study aimed to predict the severity of sarcopenia using a comprehensive clinical dataset comprising 664 individuals with 97 characteristics. These features included demographic variables, anthropometric measurements, physical performance assessments, and biochemical markers. The target variable, the severity of sarcopenia, was classified on an ordinal scale (0–3) representing the non-sarcopenic, mild, moderate, and severe categories (Table 2).

Sarcopenia is classified into four levels based on the European Working Group on Sarcopenia in Older People, version 2 guidelines. Normal indicates adequate muscle strength, mass, and performance. Possible sarcopenia is identified by either low muscle strength or poor physical performance. A sarcopenia diagnosis requires reduced muscle mass coupled with either diminished strength or functional performance. The classification of severe sarcopenia occurs when all three criteria—decreased muscle mass, impaired strength, and compromised performance—are simultaneously present [1].

Recent diagnostic guidelines for sarcopenia stratification have established a four-tier classification system, as delineated in Table 2. This hierarchical framework progresses from normal muscular health (Level 0) through possible sarcopenia (Level 1) and confirmed sarcopenia (Level 2) to severe sarcopenia (Level 3), with each successive stage characterized by increasingly stringent diagnostic criteria incorporating appendicular skeletal muscle mass measurements, muscular force capacity assessments, and functional performance metrics. The multiparametric approach reflected in Table 2 facilitates precise clinical categorization and enables targeted intervention strategies appropriate to disease severity.

Sarcopenia = \{\begin{matrix} 0, & Normal : All assessments within the normal range . \\ 1, & Possible : Low strength or poor performance (no ASM * yet) \\ 2, & Sarcopenia : Low ASM with low strength or performance \\ 3, & Severe : Low ASM, low strength, and poor performance \end{matrix}

ASM*—appendicular skeletal muscle mass.

3.2. Data Preprocessing

Data preprocessing was systematically carried out in several stages. Initially, missing data were addressed using Multiple Imputation by Chained Equations (MICE):

X_{m i s s}^{(t + 1)} = f (X_{o b s}, X_{m i s s}^{(t)}, θ) + ε

(1)

where

X_{m i s s}

and

X_{o b s}

denote missing and observed data, respectively;

θ

represents model parameters; and

ε

denotes the random error term.

The continuous variables were then normalized using robust scaling.

3.3. Feature Engineering

The feature selection strategy integrated clinical expertise and statistical methods. Specifically, LASSO logistic regression was employed.

min_{β} [- \frac{1}{N} \sum_{i = 1}^{N} (y_{i} log (p_{i}) + (1 - y_{i}) log (1 - p_{i})) + λ \sum_{j = 1}^{p} | β_{j} |]

(2)

where N is the number of observations,

y_{i}

is the observed class label,

p_{i}

is the predicted probability,

β

represents the regression coefficients, and

λ

is the regularization parameter.

Additionally, feature importance was assessed using Random Forest importance scores, which were calculated as

V I (X_{j}) = \frac{1}{ntree} \sum_{t = 1}^{ntree} (V I_{t} (X_{j}))

(3)

where

V I_{t} (X_{j})

is the importance of feature

X_{j}

in tree t.

The features selected from both methods were combined, creating a comprehensive and robust feature set for further modeling.

3.4. Model Development

An advanced ordinal classification approach utilizing a stacking ensemble framework was developed, integrating a Random Forest (RF), a Light Gradient Boosting Machine (LightGBM), and an Artificial Neural Network (ANN). These models were selected for their complementary strengths.

The stacking ensemble model followed a structured methodology:

Data was partitioned into training and testing subsets, which were stratified to preserve class distributions.
SMOTE was employed, with

$X_{n e w} = X_{i} + λ \times (X_{z i} - X_{i})$

(4)

where $X_{z i}$ is randomly chosen from k-nearest neighbors and $λ \in [0, 1]$ .
Robust scaling was applied as described earlier.
Hyperparameter optimization for each model was meticulously conducted using the GridSearchCV and RandomizedSearchCV methods.
Stratified K-Fold cross-validation was applied, with

$C V_{e r r o r} = \frac{1}{K} \sum_{k = 1}^{K} Error (f^{- k}, D_{k})$

(5)

where $f^{- k}$ denotes the model trained excluding fold k and $D_{k}$ is the validation fold.
Predictions from base learners were combined using a logistic regression meta-learner, where

$p (y = 1 | x) = \frac{1}{1 + e^{- (β_{0} + \sum β_{j} x_{j})}}$

(6)

3.5. Model Interpretability

Interpretability was enhanced using SHAP, which was calculated as

ϕ_{i} = \sum_{S \subseteq N ∖ {i}} \frac{| S |! (| N | - | S | - 1)!}{| N |!} [f_{x} (S \cup {i}) - f_{x} (S)]

(7)

where

ϕ_{i}

is the SHAP value for feature i, N is the set of all features, and

f_{x} (S)

is the model’s prediction using only the features in subset S.

3.6. Validation and Performance Metrics

Validation employed nested cross-validation, utilizing accuracy, Cohen’s Weighted Kappa, AUROC, and the Brier score.

Accuracy for binary classification was calculated as follows:

Accuracy = \frac{T P + T N}{T P + T N + F P + F N}

. Accuracy for multiclass classification was calculated as follows:

Accuracy =

\frac{Number of correct predictions}{Total number of predictions} = \frac{\sum_{i = 1}^{C} T P_{i}}{N} .

Cohen’s Weighted Kappa was calculated as

κ = \frac{P_{o} - P_{e}}{1 - P_{e}}

where

P_{o}

is the observed agreement and

P_{e}

is the expected agreement by chance.

AUROC measures the model’s ability to distinguish between different classes by evaluating the trade-off between the true positive rate (sensitivity) and the false positive rate at various classification thresholds. The AUROC score ranges from 0 to 1, where a value of 1.0 indicates perfect class separation and 0.5 suggests no better than random guessing. This metric is particularly valuable for imbalanced classification tasks, as it reflects the model’s ranking quality rather than its accuracy.

The Brier Score quantifies the accuracy of probabilistic predictions by measuring the mean squared difference between the predicted probabilities and the actual binary outcomes. It is computed as

Brier Score = \frac{1}{N} \sum_{i = 1}^{N} {({\hat{p}}_{i} - y_{i})}^{2}

where

{\hat{p}}_{i}

denotes the predicted probability of the positive class for the i-th instance and

y_{i} \in {0, 1}

is the corresponding true label. The score ranges from 0 to 1, with lower values indicating better calibrated and more accurate probability estimates.

Precision, Recall, and F1 Scores

To evaluate the classification performance beyond accuracy, especially in imbalanced datasets, we also report Precision, Recall, and F1 Scores.

Precision (also called the positive predictive value) measures the proportion of correctly predicted positive instances among all instances predicted as positive, with

$Precision = \frac{T P}{T P + F P}$

where $T P$ is the number of true positives and $F P$ is the number of false positives.
Recall (also known as sensitivity or the true positive rate) measures the proportion of correctly predicted positive instances out of all actual positive instances, with

$Recall = \frac{T P}{T P + F N}$

where $F N$ is the number of false negatives.
The F1 Score is the harmonic mean of Precision and Recall, providing a balance between the two:

$F 1 Score = 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall}$

The F1 Score is particularly useful when the dataset is imbalanced, as it considers both false positives and false negatives.

4. Results

4.1. Feature Selection

To enhance model performance and mitigate overfitting, a rigorous feature selection strategy was implemented. Two complementary techniques were utilized: the LASSO and Random Forest feature importance analysis. The LASSO, a regularization-based method, was employed to select features with non-zero coefficients, promoting sparsity and addressing potential multicollinearity among predictors. In parallel, the Random Forest was applied to rank features based on their mean decrease in impurity, effectively capturing complex nonlinear relationships.

Following independent application, the features identified by the LASSO and Random Forest were combined to construct a consolidated subset, thereby leveraging both linear and nonlinear dependencies present in the data. This integrative approach enhanced the robustness and interpretability of the model.

As a result, the feature space was substantially reduced, decreasing from the original 97 variables to 31 carefully selected features (Table 3).

By systematically reducing the feature dimensionality from 96 to 31, the model was optimized for improved generalizability, computational efficiency, and predictive accuracy.

4.2. Binary Classification

The best hyperparameters for each model were determined through systematic optimization procedures and are summarized in Table 4. Gradient Boosting achieved optimal performance with a learning rate of 0.05, a maximum depth of three, and 200 estimators, while the Random Forest required deeper trees and more estimators. Simpler models such as the Decision Tree and SVM benefited from minimal parameter tuning. For the MLP, the use of a tanh activation function and a relatively high learning rate contributed to improved convergence. The stacked model combined these individually optimized base learners to enhance generalization performance.

This binary classification task involves two classes, 0 for no sarcopenia and 1 for sarcopenia, to accurately identify individuals at risk of sarcopenia.

The performance comparison of the evaluated models is presented in Table 5. Among all classifiers, Gradient Boosting achieved the highest accuracy (97.74%) and F1 Score (0.97), indicating superior predictive performance. The Random Forest also exhibited strong results with an accuracy of 93.98% and an F1-Score of 0.91. The stacked ensemble model demonstrated robust generalization with an accuracy of 96.24%, a Precision of 0.98, and an F1 Score of 0.94 while maintaining a high kappa (0.8790) and near-perfect AUROC (0.9969). Although models like the SVM and MLP delivered acceptable performance, they exhibited lower Recall and F1 Scores compared to ensemble approaches. These results confirm that model fusion substantially enhances classification robustness and generalization over individual models (Figure 2).

Figure 3 presents the confusion matrices of all evaluated models, providing a detailed view of classification outcomes for both classes. The stacked model achieved a perfect classification of the negative class (105 true negatives) and a relatively high true positive count (23), misclassifying only 5 positive instances. Gradient Boosting exhibited the most balanced confusion matrix with minimal misclassifications, reflecting its superior overall performance. The Random Forest and AdaBoost also maintained strong classification ability, though they showed similar false positive and false negative patterns. In contrast, the SVM and MLP models had slightly higher misclassification rates, especially among positive instances, indicating their limited ability to capture the minority class distribution. The Decision Tree model, while simple, delivered competitive performance, correctly identifying 23 positive cases but misclassifying 6 negatives. These visual insights reinforce the effectiveness of ensemble methods, particularly the Gradient Boosting and stacked models, in achieving accurate and stable binary classification performance.

To better evaluate model performance beyond standard classification metrics, Cohen’s Kappa and the Brier Score were analyzed for each model, as shown in Table 6. Cohen’s Kappa measures the agreement between predicted and true class labels, adjusted for chance agreement, and is particularly informative in imbalanced classification scenarios. A Kappa value closer to one indicates strong agreement. In this study, Gradient Boosting achieved the highest Kappa (0.9330), reflecting its excellent reliability. The Brier Score, on the other hand, quantifies the accuracy of probabilistic predictions by computing the mean squared difference between predicted probabilities and actual outcomes. A lower Brier Score indicates better-calibrated predictions. Again, Gradient Boosting and the stacked Model exhibited the lowest Brier Scores (0.0104 and 0.0243, respectively), confirming their superiority in both classification confidence and probabilistic reliability.

4.3. Multiclass Classification

All components—including the three base learners (the Random Forest, Gradient Boosting, and the Multilayer Perceptron) and the meta-learner (logistic regression)—are explicitly trained as native four-class classifiers, rather than through iterative one-vs-rest binary classification strategies.

Base Learners: Each base model leverages its inherent multiclass capabilities. For instance, RandomForestClassifier, GradientBoostingClassifier, and MLPClassifier natively support multiclass classification. The complete four-dimensional probability distribution for each sample was extracted from each model.

Meta-Learner: The probability vectors from the base learners (3 models × 4 class probabilities = 12-dimensional feature vector) are concatenated to construct the input features for the meta-learner. A LogisticRegression model configured with multi_class = ‘auto’ (corresponding to a multinomial setting) is then trained on the concatenated probability features. At inference time, the meta-learner outputs a final 4-class probability distribution.

The best hyperparameters for each model in the multiclass classification task were identified through extensive grid and randomized search strategies, as summarized in Table 7. The stacked model was constructed by combining the individually optimized Random Forest, Gradient Boosting, and MLP classifiers. The Random Forest achieved optimal performance with 300 estimators and minimal constraints on depth and splitting. Gradient Boosting benefited from a moderate learning rate (0.05), log2 feature selection, and a subsample ratio of 0.8, while the MLP model performed best with a single hidden layer and a initial learning rate of 0.01. Other models, including the SVM, Decision Tree, and AdaBoost, were fine-tuned using their respective hyperparameters to improve generalization.

Table 8 summarizes the performance of all models across four key evaluation metrics. The stacked model achieved the highest performance overall, with an accuracy of 96.99%, an F1 Score of 0.9449, and a strong balance in both Precision and Recall. Gradient Boosting closely followed with a well-balanced profile and an F1 Score of 0.9320. The Random Forest also showed strong predictive power (F1 Score: 0.8977), while AdaBoost and the Decision Tree achieved comparable results, around 87.9% accuracy. On the other hand, the SVM and MLP performed less consistently, particularly in terms of Recall and the F1 Score, suggesting limited robustness in handling class diversity. These results underscore the effectiveness of ensemble learning, particularly the fusion of optimized base models in the stacked configuration(Figure 4).

Figure 5 presents the confusion matrices of all evaluated models in the multiclass classification task. The stacked model demonstrated near-perfect classification, particularly for classes 0 and 1, with only minimal misclassifications observed in classes 2 and 3. Gradient Boosting and the Random Forest also exhibited strong performance, though occasional confusion was noted between closely related classes. The AdaBoost and Decision Tree models showed more frequent misclassifications, especially between classes 0 and 1, as well as classes 2 and 3. The SVM and MLP produced comparatively higher error rates, with more dispersed misclassifications across all classes, indicating challenges in maintaining class separation. These visual results further emphasize the superior generalization ability of ensemble-based models, particularly the stacked and Gradient Boosting configurations.

Table 9 presents the Cohen’s Kappa and Brier Score values for all models evaluated in the multiclass classification task. The stacked model achieved the highest Kappa value (0.9738), indicating near-perfect agreement between the predicted and true classes, while also maintaining the lowest Brier Score (0.0125), reflecting excellent probability calibration. Gradient Boosting followed closely with a Kappa of 0.9380 and a low Brier Score of 0.0256. The Random Forest also demonstrated strong performance, though with slightly higher calibration error. In contrast, the SVM and MLP exhibited lower Kappa values and higher Brier Scores, indicating less stable classification and less reliable probabilistic predictions. These results confirm that ensemble models, particularly the stacked model, deliver superior consistency and calibration in multiclass classification.

To evaluate the contribution of each base learner in the stacking ensemble, we conducted an ablation study by removing one base model at a time and measuring the resulting performance. As shown in Table 10, the complete model (RF + GB + MLP) achieved the highest performance across all metrics, with an accuracy of 0.9699, a macro F1 Score of 0.9449, and a Cohen’s kappa of 0.9738.

Among the pairwise combinations, the GB + MLP configuration yielded the best results, indicating that the Gradient Boosting and MLP models are individually strong and complementary. The performance drop observed in the RF + MLP (macro F1: 0.9240) and RF + GB (macro F1: 0.9320) settings suggests that while each model contributes positively, the synergy of all three base learners is essential for achieving optimal classification performance. This validates the effectiveness of our stacking ensemble strategy and justifies the inclusion of all three models.

4.4. Overfitting

The model exhibited no evidence of overfitting, as demonstrated by the close alignment between 5-fold cross-validation accuracy (0.9905 ± 0.0062) and test accuracy (0.9699). The minimal performance gap ( 2%) indicates strong generalization capability to unseen data. Furthermore, the high test macro F1 score (0.9337) and quadratic weighted kappa (0.9589) support the model’s balanced performance across classes and its reliability in predicting the ordinal severity of sarcopenia. These results collectively confirm that the model maintains high predictive accuracy without overfitting.

4.5. Model Interpretability

The SHAP summary plot for the stacked model, as presented in Figure 6, illustrates the impact of each feature on model predictions for multiclass sarcopenia classification. The top three contributors—ASM, SS_SPPB, and weight_kg—showed the highest average influence, with a wide distribution of SHAP values indicating varied effects across patients. Features such as HG_R_M, BMI_kgm2, and D_HG also played significant roles. The color gradient represents feature values, showing how high and low values push predictions toward different sarcopenia classes. These insights underscore the clinical relevance of muscle mass, body composition, and performance measures in model decision-making.

5. Discussion

This study developed a robust, interpretable model fusion approach to predict sarcopenia severity using multiclass classification. Among the evaluated models, the stacked ensemble model integrating a Random Forest, Gradient Boosting, and a MLP outperformed all others with an accuracy of 96.99%, a macro F1 score of 0.9449, and the highest Cohen’s Kappa (0.9738), demonstrating strong agreement with clinical labels. SHAP analysis highlighted ASM, weight_kg, and SS_SPPB as consistently influential features, supporting the model’s clinical validity. The proposed model generalized well across all four severity classes without overfitting and maintained excellent probabilistic calibration, as evidenced by its lowest Brier Score (0.0125).

Despite these promising results, this study has several limitations. First, the dataset was derived from a single population—older Korean adults—which may limit generalizability to other ethnic or age groups. Future validation using multi-center, multi-ethnic cohorts is warranted. Second, while ordinal classification was appropriate for sarcopenia staging, this work employed standard multiclass classifiers rather than dedicated ordinal algorithms. Although the model performed well, specialized ordinal approaches could further enhance performance. Third, although SHAP was used for interpretability, it does not capture temporal dependencies or causal relationships, which could be critical in longitudinal sarcopenia progression analysis.

Despite these limitations, this study presents several novel contributions. To our knowledge, this is the first study to integrate a stacking ensemble with comprehensive SHAP-based interpretability for sarcopenia severity prediction using a clinically rich feature set. The dual-path feature selection (LASSO + Random Forest) provided a balanced blend of linear and nonlinear insights. The approach demonstrated resilience to imbalanced data through SMOTE, while careful hyperparameter tuning and nested cross-validation ensured optimal generalization. Together, these methodological innovations fill key gaps in the existing literature, particularly in moving beyond binary sarcopenia detection toward ordinal severity classification.

While individual components of our methodology—ensemble learning and SHAP-based explanations—have appeared separately in prior classification studies, our work innovatively synthesizes these approaches within the novel context of multiclass sarcopenia severity staging. Specifically, our methodological advancements include applying an ensemble model to a clinically-defined four-tier severity scale for the first time, integrating dual-path feature selection methods (LASSO regression and Random Forest importance) to capture diverse feature relationships, and rigorously validating model calibration and reliability through metrics like Cohen’s Kappa and Brier Score. Together, these contributions significantly enhance both the clinical relevance and methodological rigor of the predictive framework, providing a robust and interpretable tool tailored explicitly for geriatric healthcare applications.

We recognize that our initial analysis did not include direct comparisons with certain advanced state-of-the-art models, such as the widely-used ensemble method XGBoost3.0.0. To address this limitation and validate our approach comprehensively, we incorporated XGBoost as an additional benchmark. XGBoost achieved an accuracy of 96.24%, a macro F1 score of 0.9545, and a Cohen’s Kappa of 0.9697—demonstrating competitive performance. Nonetheless, our proposed stacked ensemble model outperformed XGBoost in calibration and agreement, maintaining its position as the most robust and interpretable tool for multiclass sarcopenia severity classification in clinical settings.

These findings have important implications. Clinically, the proposed model offers a non-invasive, data-driven tool to assist in early risk stratification and personalized intervention planning for older adults at various stages of sarcopenia. For researchers, the work highlights the importance of ensemble strategies, robust feature selection, and interpretability in medical AI. Future research should explore real-time implementation, longitudinal validation, and the integration of temporal data (e.g., from wearable sensors) to support dynamic sarcopenia monitoring. Additionally, extending the framework to other progressive musculoskeletal conditions could broaden its impact in geriatric care.

6. Conclusions

This study proposed an effective ensemble learning approach for multiclass classification of sarcopenia severity in older adults, leveraging stacked model fusion with the individually optimized Random Forest, Gradient Boosting, and MLP classifiers. The stacked model demonstrated superior performance across all evaluated metrics, achieving an accuracy of 96.99%, a macro F1 score of 0.9449, and a Cohen’s Kappa of 0.9738, while maintaining strong calibration, as indicated by a low Brier Score. SHAP analysis further confirmed the clinical relevance of key features such as ASM, body weight, and physical performance measures, enhancing model transparency and interpretability.

Importantly, this work moves beyond traditional binary sarcopenia classification by addressing the full spectrum of severity, offering a more nuanced and clinically useful prediction framework. Although limitations related to data generalizability and the absence of ordinal-specific modeling were acknowledged, this study’s strengths—including robust feature selection, model optimization, and explainable AI integration—underscore its contribution to advancing personalized sarcopenia management.

Future work should aim to validate the proposed model across diverse populations, incorporate longitudinal data to enable prediction of disease progression, and develop dynamic updating mechanisms to facilitate real-time sarcopenia monitoring. Furthermore, as AI technologies continue to evolve, integrating the predictive framework with large language models—for example, for clinical note interpretation [44] and agentic artificial intelligence systems [45]—presents a promising direction. Such integration could enhance real-time clinical decision-making, improve patient stratification, and support adaptive health monitoring, particularly within geriatric and musculoskeletal care domains.

Author Contributions

Conceptualization, D.T. and W.K.; methodology, J.K.; software, A.R.; validation, A.R., D.T. and J.K.; formal analysis, W.K.; investigation, D.T.; resources, J.K.; data curation, A.R.; writing—original draft preparation, D.T.; writing—review and editing, W.K.; visualization, A.R.; supervision, W.K.; project administration, J.K.; funding acquisition, J.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the Gachon University Research Fund (GCU-202307760001), the Ministry of Education of the Republic of Korea, and the National Research Foundation of Korea (NRF-2022S1A5C2A07090938).

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to ongoing research project.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Table A1. List of Features with Descriptions for Sarcopenia Dataset.

ID	Feature	Description
1	Sarcopenia	Sarcopenia status (0 = Normal, 1 = Possible, 2 = Sarcopenia, 3 = Severe)
2	Place	Location where data was collected
3	sex	Gender of the subject
4	Age	Age of the subject (years)
5	height_cm	Height of the subject (cm)
6	weight_kg	Weight of the subject (kg)
7	SMM_kg	Skeletal Muscle Mass (kg)
8	BFM_kg	Body Fat Mass (kg)
9	BMI_kgm2	Body Mass Index (kg/m²)
10	Percent_BF	Body fat percentage (%)
11	BMR_kcal	Basal Metabolic Rate (kcal)
12	SBP_mmHg	Systolic Blood Pressure (mmHg)
13	DBP_mmHg	Diastolic Blood Pressure (mmHg)
14	BP_Stage	Blood Pressure classification stage
15	Pulse	Pulse rate (beats per minute)
16	CC_cm	Calf circumference (cm)
17	Dominant	Dominant side of the body (hand/leg)
18	ASM	Appendicular Skeletal Muscle mass (kg)
19	HG_R_1	Hand grip strength right (first trial) (kg)
20	HG_L_1	Hand grip strength left (first trial) (kg)
21	HG_R_2	Hand grip strength right (second trial) (kg)
22	HG_L_2	Hand grip strength left (second trial) (kg)
23	HG_R_M	Maximum hand grip strength right (kg)
24	HG_L_M	Maximum hand grip strength left (kg)
25	D_HG	Dominant hand grip strength (kg)
26	ND_HG	Non-dominant hand grip strength (kg)
27	Plartar_R_1	Plantar flexion strength right foot (kg)
28	Plartar_L_1	Plantar flexion strength left foot (kg)
29	Dorsal_R_1	Dorsal flexion strength right foot (kg)
30	Dorsal_L_1	Dorsal flexion strength left foot (kg)
31	D_Plantar	Dominant plantar flexion strength (kg)
32	D_Dorsal	Dominant dorsal flexion strength (kg)
33	ND_Plantar	Non-dominant plantar flexion strength (kg)
34	ND_Dorsal	Non-dominant dorsal flexion strength (kg)
35	SLS_R	Single leg stance right (seconds)
36	SLS_L	Single leg stance left (seconds)
37	D_SLS	Dominant side single leg stance (seconds)
38	ND_SLS	Non-dominant side single leg stance (seconds)
39	SLS_MAX	Maximum single leg stance time (seconds)
40	SS	Sit-to-stand repetitions (30 s)
41	SS_SPPB	Sit-to-stand time for 5 repetitions (SPPB protocol)
42	CSR	Chair sit and reach distance (cm)
43	MWT2	2-Minute Walk Test distance (meters)
44	TUG	Timed Up and Go test (seconds)
45	Gaitspeed_SPPB	Gait speed from SPPB (m/s)
46	SPPB	Short Physical Performance Battery total score
47	G_HG_R	Grade-adjusted hand grip strength right
48	G_HG_L	Grade-adjusted hand grip strength left
49	G_BMI	Grade-adjusted Body Mass Index
50	G_SS	Grade-adjusted sit-to-stand repetitions
51	G_2MWT	Grade-adjusted 2-min walk test
52	G_TUG	Grade-adjusted Timed Up and Go test
53	G_CSR	Grade-adjusted Chair Sit and Reach test
54	G_D_SLS	Grade-adjusted dominant single-leg stance test
55	D1_s	Physical and Mental Health domain (adjusted)
56	D2_s	Locomotion domain (adjusted)
57	D3_s	Body Composition domain (adjusted)
58	D4_s	Functionality domain (adjusted)
59	D5_s	Activities of Daily Living domain (adjusted)
60	D6_s	Leisure Activities domain (adjusted)
61	D7_s	Fears domain (adjusted)
62	SarQoL_Total_s	Total Sarcopenia Quality of Life score (adjusted)
63	FES	Fall Efficacy Scale score
64	SARC_F	SARC-F questionnaire score
65	SARC_CalF	SARC-F questionnaire with calf circumference
66	Area_city	Residential city or region
67	Area_Dong	Residential neighborhood/town
68	Educationlevel	Education level of the participant
69	Religion	Religion of the participant
70	Smoking_	Smoking status
71	Smoking_d_	Smoking duration (years)
72	Smoking_a_	Smoking amount (packs per day)
73	Drinking_f_	Drinking frequency (times per week)
74	Drinking_d_	Drinking duration (years)
75	Drinking_a_	Drinking amount (glasses per session)
76	Family	Family type
77	House	Housing type
78	Income	Monthly income
79	Educationlevel_p	Parents’ education level
80	RegularPA_	Regular physical activity status
81	TypeofPA_1_	Type of physical activity 1
82	TypeofPA_2_	Type of physical activity 2
83	TypeofPA_3_	Type of physical activity 3
84	TypeofPA_4_	Type of physical activity 4
85	TypeofPA_5_	Type of physical activity 5
86	TypeofPA_6_	Type of physical activity 6
87	FVC	Forced Vital Capacity (liters)
88	PreFVC	Predicted Forced Vital Capacity (L)
89	FEV1	Forced Expiratory Volume in 1 s (L)
90	PEF	Peak Expiratory Flow (L/min)
91	MIP_Ave	Average Maximum Inspiratory Pressure (cmH₂O)
92	SAF	Skin autofluorescence (AGEs measurement)
93	HbA1c	Glycated hemoglobin (HbA1c %)
94	DM	Diabetes Mellitus status
95	Hypertension	Hypertension diagnosis
96	Hyperlipidemia	Hyperlipidemia diagnosis
97	Sleepdisorder	Sleep disorder status

References

Cruz-Jentoft, A.J.; Sayer, A.A. Sarcopenia. Lancet 2019, 393, 2636–2646. [Google Scholar] [CrossRef]
Petermann-Rocha, F.; Balntzi, V.; Gray, S.R.; Lara, J.; Ho, F.K.; Pell, J.P.; Celis-Morales, C. Global prevalence of sarcopenia and severe sarcopenia: A systematic review and meta-analysis. J. Cachexia Sarcopenia Muscle 2022, 13, 86–99. [Google Scholar] [CrossRef]
Turimov Mustapoevich, D.; Kim, W. Machine learning applications in sarcopenia detection and management: A comprehensive survey. Healthcare 2023, 11, 2483. [Google Scholar] [CrossRef]
Ozgur, S.; Altinok, Y.A.; Bozkurt, D.; Saraç, Z.F.; Akçiçek, S.F. Performance evaluation of machine learning algorithms for sarcopenia diagnosis in older adults. Healthcare 2023, 11, 2699. [Google Scholar] [CrossRef]
Lynch, D.H.; Spangler, H.B.; Franz, J.R.; Krupenevich, R.L.; Kim, H.; Nissman, D.; Zhang, J.; Li, Y.Y.; Sumner, S.; Batsis, J.A. Multimodal diagnostic approaches to advance precision medicine in sarcopenia and frailty. Nutrients 2022, 14, 1384. [Google Scholar] [CrossRef]
Roberts, S.; Collins, P.; Rattray, M. Identifying and managing malnutrition, frailty and sarcopenia in the community: A narrative review. Nutrients 2021, 13, 2316. [Google Scholar] [CrossRef]
Band, S.S.; Yarahmadi, A.; Hsu, C.C.; Biyari, M.; Sookhak, M.; Ameri, R.; Dehzangi, I.; Chronopoulos, A.T.; Liang, H.W. Application of explainable artificial intelligence in medical health: A systematic review of interpretability methods. Inform. Med. Unlocked 2023, 40, 101286. [Google Scholar] [CrossRef]
Rane, N.; Choudhary, S.; Rane, J. Explainable artificial intelligence (XAI) in healthcare: Interpretable models for clinical decision support. SSRN 2023. [Google Scholar] [CrossRef]
Wang, K.; Zhang, H.; Cheng, C.Y.M.; Chen, M.; Lai, K.W.C.; Or, C.K.; Hu, Y.; Vellaisamy, A.L.R.; Lam, C.L.K.; Xi, N.; et al. Multi-Risk-Level Sarcopenia-Prone Screening via Machine Learning Classification of Sit-to-Stand Motion Metrics from Wearable Sensors. Adv. Intell. Syst. 2025, 2025, 2401120. [Google Scholar] [CrossRef]
Luo, X.; Ding, H.; Broyles, A.; Warden, S.J.; Moorthi, R.N.; Imel, E.A. Using machine learning to detect sarcopenia from electronic health records. Digit. Health 2023, 9, 20552076231197098. [Google Scholar] [CrossRef] [PubMed]
Guo, J.; He, Q.; She, C.; Liu, H.; Li, Y. A machine learning–based online web calculator to aid in the diagnosis of sarcopenia in the US community. Digit. Health 2024, 10, 20552076241283247. [Google Scholar] [CrossRef]
Bae, J.H.; Seo, J.w.; Kim, D.Y. Deep-learning model for predicting physical fitness in possible sarcopenia: Analysis of the Korean physical fitness award from 2010 to 2023. Front. Public Health 2023, 11, 1241388. [Google Scholar] [CrossRef] [PubMed]
Seok, M.; Kim, W.; Kim, J. Machine learning for sarcopenia prediction in the elderly using socioeconomic, infrastructure, and quality-of-life data. Healthcare 2023, 11, 2881. [Google Scholar] [CrossRef]
Chen, X.D.; Chen, W.J.; Huang, Z.X.; Xu, L.B.; Zhang, H.H.; Shi, M.M.; Cai, Y.Q.; Zhang, W.T.; Li, Z.S.; Shen, X. Establish a new diagnosis of sarcopenia based on extracted radiomic features to predict prognosis of patients with gastric cancer. Front. Nutr. 2022, 9, 850929. [Google Scholar] [CrossRef]
Tukhtaev, A.; Turimov, D.; Kim, J.; Kim, W. Feature Selection and Machine Learning Approaches for Detecting Sarcopenia Through Predictive Modeling. Mathematics 2025, 13, 98. [Google Scholar] [CrossRef]
Ghosh, P.; Azam, S.; Jonkman, M.; Karim, A.; Shamrat, F.J.M.; Ignatious, E.; Shultana, S.; Beeravolu, A.R.; De Boer, F. Efficient prediction of cardiovascular disease using machine learning algorithms with relief and LASSO feature selection techniques. IEEE Access 2021, 9, 19304–19326. [Google Scholar] [CrossRef]
Wang, J.; Xu, Y.; Liu, L.; Wu, W.; Shen, C.; Huang, H.; Zhen, Z.; Meng, J.; Li, C.; Qu, Z.; et al. Comparison of LASSO and random forest models for predicting the risk of premature coronary artery disease. BMC Med. Inform. Decis. Mak. 2023, 23, 297. [Google Scholar] [CrossRef] [PubMed]
Wang, K.; Zhang, H.; Cheng, C.Y.M.; Chen, M.; Lai, K.W.C.; Or, C.K.; Chen, Y.; Hu, Y.; Vellaisamy, A.L.R.; Lam, C.L.K.; et al. High Accuracy Machine Learning Model for Sarcopenia Severity Diagnosis based on Sit-to-stand Motion Measured by Two Micro Motion Sensors. medRxiv 2023. [Google Scholar] [CrossRef]
Ayllón-Gavilán, R.; Guijo-Rubio, D.; Gutiérrez, P.A.; Bagnall, A.; Hervás-Martínez, C. Convolutional-and Deep Learning-Based Techniques for Time Series Ordinal Classification. IEEE Trans. Cybern. 2024, 55, 537–549. [Google Scholar] [CrossRef]
Xu, W.; Gao, Y.; Wang, Y.; Guan, J. Protein–protein interaction prediction based on ordinal regression and recurrent convolutional neural networks. BMC Bioinform. 2021, 22, 485. [Google Scholar] [CrossRef]
Zakariah, M.; AlQahtani, S.A.; Al-Rakhami, M.S. Machine learning-based adaptive synthetic sampling technique for intrusion detection. Appl. Sci. 2023, 13, 6504. [Google Scholar] [CrossRef]
Hittawe, M.M.; Harrou, F.; Togou, M.A.; Sun, Y.; Knio, O. Time-series weather prediction in the Red sea using ensemble transformers. Appl. Soft Comput. 2024, 164, 111926. [Google Scholar] [CrossRef]
Bin Tareaf, R.; Korga, A.M.; Wefers, S.; Hanken, K. Direct vs. Cross-Validated Stacking in Ensemble Learning: Evaluating the Trade-Off between Inference Time and Generalizability on Fashion-MNIST. In Proceedings of the 2024 16th International Conference on Machine Learning and Computing, Shenzhen, China, 2–5 February 2024; pp. 453–461. [Google Scholar]
Imani, M.; Beikmohammadi, A.; Arabnia, H.R. Comprehensive Analysis of Random Forest and XGBoost Performance with SMOTE, ADASYN, and GNUS Under Varying Imbalance Levels. Technologies 2025, 13, 88. [Google Scholar] [CrossRef]
Salehi, A.R.; Khedmati, M. A cluster-based SMOTE both-sampling (CSBBoost) ensemble algorithm for classifying imbalanced data. Sci. Rep. 2024, 14, 5152. [Google Scholar] [CrossRef]
Cullerne Bown, W. Sensitivity and specificity versus precision and recall, and related dilemmas. J. Classif. 2024, 41, 402–426. [Google Scholar] [CrossRef]
Jung, D.; Lee, D.; Nguyen, Q.H.N.; Kim, J.; Mun, K.R. A Machine Learning Approach to Predict the Risk of Sarcopenia. In Proceedings of the 2023 IEEE EMBS Special Topic Conference on Data Science and Engineering in Healthcare, Medicine and Biology, St. Julians, Malta, 7–9 December 2023; pp. 47–48. [Google Scholar]
Taghandiki, K. Quantum Machine Learning Unveiled: A Comprehensive Review. J. Eng. Appl. Res. 2024, 1, 29–48. [Google Scholar]
Zupo, R.; Moroni, A.; Castellana, F.; Gasparri, C.; Catino, F.; Lampignano, L.; Perna, S.; Clodoveo, M.L.; Sardone, R.; Rondanelli, M. A machine-learning approach to target clinical and biological features associated with sarcopenia: Findings from northern and southern Italian aging populations. Metabolites 2023, 13, 565. [Google Scholar] [CrossRef]
Turimov, D.; Kim, W. Enhancing Sarcopenia Prediction Through an Ensemble Learning Approach: Addressing Class Imbalance for Improved Clinical Diagnosis. Mathematics 2025, 13, 26. [Google Scholar] [CrossRef]
Kim, B.R.; Yoo, T.K.; Kim, H.K.; Ryu, I.H.; Kim, J.K.; Lee, I.S.; Kim, J.S.; Shin, D.H.; Kim, Y.S.; Kim, B.T. Oculomics for sarcopenia prediction: A machine learning approach toward predictive, preventive, and personalized medicine. EPMA J. 2022, 13, 367–382. [Google Scholar] [CrossRef]
Ullah, U.; Maheshwari, D.; Castillo Olea, C.; Garcia Zapirain, B. Sarcopenia risk prediction and feature selection by using quantum machine learning algorithms. Quantum Mach. Intell. 2024, 6, 80. [Google Scholar] [CrossRef]
Nemade, V.; Pathak, S.; Dubey, A.K. Deep learning-based ensemble model for classification of breast cancer. Microsyst. Technol. 2024, 30, 513–527. [Google Scholar] [CrossRef]
Morsy, S.; Abd-Elsalam, N.; Kandil, A.; Elbialy, A.; Youssef, A.B. A deep learning ensemble framework for robust classification of lung ultrasound patterns: Covid-19, pneumonia, and normal. Int. J. Adv. Intell. Inform. 2025, 11, 1966. [Google Scholar] [CrossRef]
Yogesh, N.; Shrinivasacharya, P.; Naik, N. Novel statistically equivalent signature-based hybrid feature selection and ensemble deep learning LSTM and GRU for chronic kidney disease classification. PeerJ Comput. Sci. 2024, 10, e2467. [Google Scholar] [CrossRef]
Deshpande, N.M.; Gite, S.; Pradhan, B. Explainable AI for binary and multi-class classification of leukemia using a modified transfer learning ensemble model. Int. J. Smart Sens. Intell. Syst. 2024, 17, 1–20. [Google Scholar] [CrossRef]
Panthakkan, A.; Anzar, S.; Mansoor, W.; Al Ahmad, H. A new frontier in hematology: Robust deep learning ensembles for white blood cell classification. Biomed. Signal Process. Control 2025, 100, 106995. [Google Scholar] [CrossRef]
Selvaraj, K.M.; Gnanagurusubbiah, S.; Roy, R.R.R.; Balu, S. Enhancing skin lesion classification with advanced deep learning ensemble models: A path towards accurate medical diagnostics. Curr. Probl. Cancer 2024, 49, 101077. [Google Scholar] [CrossRef]
Li, M.; Sun, H.; Huang, Y.; Chen, H. Shapley value: From cooperative game to explainable artificial intelligence. Auton. Intell. Syst. 2024, 4, 2. [Google Scholar] [CrossRef]
Franceschi, L.; Donini, M.; Archambeau, C.; Seeger, M. Explaining probabilistic models with distributional values. arXiv 2024, arXiv:2402.09947. [Google Scholar] [CrossRef]
Madadizadeh, F.; Ghafari, H.; Bahariniya, S. Kappa Statistics: A Method of Measuring Agreement in Dental Examinations. Open Public Health J. 2023, 16, e18749445259818. [Google Scholar]
Li, M.; Gao, Q.; Yu, T. Kappa statistic considerations in evaluating inter-rater reliability between two raters: Which, when and context matters. BMC Cancer 2023, 23, 799. [Google Scholar] [CrossRef]
Yang, W.; Jiang, J.; Schnellinger, E.M.; Kimmel, S.E.; Guo, W. Modified Brier score for evaluating prediction accuracy for binary outcomes. Stat. Methods Med Res. 2022, 31, 2287–2296. [Google Scholar] [CrossRef] [PubMed]
Singhal, K.; Azizi, S.; Tu, T.; Mahdavi, S.S.; Wei, J.; Chung, H.W.; Scales, N.; Tanwani, A.; Cole-Lewis, H.; Pfohl, S.; et al. Large language models encode clinical knowledge. Nature 2023, 620, 172–180. [Google Scholar] [CrossRef] [PubMed]
Zou, J.; Topol, E.J. The rise of agentic AI teammates in medicine. Lancet 2025, 405, 457. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Architecture of the model fusion classifier, involving data preprocessing, feature selection, model fusion with optimized base learners, and SHAP-based interpretability.

Figure 2. Comparison of classification performance metrics across different models. The figure illustrates Accuracy (orange), Precision (blue), Recall (red), and the F1 Score (green) for the stacked model, Gradient Boosting, the Random Forest, AdaBoost, the Decision Tree, the MLP, and SVM classifiers.

Figure 3. Confusion matrices of all evaluated models. Each confusion matrix presents the number of true positives, false positives, true negatives, and false negatives for the models.

Figure 4. Comparison of accuracy, Precision, Recall, and the F1 Score across different models for the multiclass classification task.

Figure 5. Confusion matrices of all models for the multiclass classification task, illustrating classification performance across four classes.

Figure 6. SHAP summary plot showing the feature impact distribution for the stacked model in multiclass sarcopenia prediction.

Table 1. Taxonomy of sarcopenia ML studies.

Study	Input Modality	Model(s)	Task Type	Severity Considered	Accuracy (%)
[10]	Electronic Health Records (EHR)	LR, MLP, and SVM	Multiclass	Yes	91.4
[11]	Demographics	XGBoost	Binary	No	85.2
[12]	Fitness Scores	DNN	Binary	Yes	87.5
[27]	Electromyography (EMG)	LSTM	Multiclass	Yes	94.4
[13]	Socioeconomic/Quality Measures	RF and LightGBM	Binary	No	∼80
[30]	Clinical Vitals	RF, SVM, and NN (Ensemble)	Binary	No	88.5
[31]	Oculomics	XGBoost	Binary	No	75.1
[29]	Laboratory Biomarkers	RF and LR	Binary	Yes	-
[32]	Clinical + Categorical Features	Quantum SVM and RF	Binary	Yes	76.7
[18]	IMU (Sit-to-Stand)	Ensemble Stack	Multiclass	Yes	90.4
Proposed Model	Clinical Vitals	RF, GB, and MLP	Multiclass	Yes	96.9

Table 2. Sarcopenia classification criteria based on clinical assessments.

Sarcopenia Level	Description	Criteria Based on Assessment
0—Normal	No evidence of sarcopenia	All parameters remain within normal ranges: muscular force capacity, functional mobility metrics, and quantitative appendicular muscle mass measurements exceed threshold values.
1—Possible Sarcopenia	Early signs detected mainly in primary care settings	Diminished muscular force (hand dynamometry: males < 28 kg and females < 18 kg) or compromised functional capacity (chair-rise pentad ≥ 12 s). Quantitative muscle mass evaluation not mandatory at this diagnostic stage.
2—Sarcopenia	Confirmed sarcopenia diagnosis	Reduced appendicular skeletal musculature combined with either insufficient grip force (males < 28 kg, females < 18 kg) or suboptimal mobility parameters (6-meter ambulatory velocity < 1.0 m/s, chair-rise pentad ≥ 12 s, or abbreviated physical function index ≤ 9).
3—Severe Sarcopenia	Advanced sarcopenia condition	Concurrent manifestation of all diagnostic indicators: insufficient appendicular muscle volume with compromised muscular force and deteriorated functional performance metrics (triple-domain deficiency syndrome) [1].

Table 3. Descriptions of selected features.

Feature	Meaning
HG_R_M	Maximum handgrip strength of the right hand.
HG_L_2	Second handgrip strength trial for the left hand.
TUG	Timed up-and-go test (mobility and balance assessment).
D_HG	Dominant handgrip strength.
Age	Age of the subject in years.
HG_R_1	First handgrip strength trial for the right hand.
SS_SPPB	Sit-to-stand and short physical performance Battery total score.
BFM_kg	Body Fat Mass in kilograms.
ND_HG	Non-dominant handgrip strength.
G_BMI	Grade body mass index category.
SPPB	Short physical performance battery score.
BMR_kcal	Basal metabolic rate in kilocalories.
G_SS	Grade of the sit-to-stand score category.
ASM	Appendicular skeletal muscle mass.
Smoking_a_	Smoking status.
HG_R_2	Second handgrip strength trial for the right hand.
SMM_kg	Skeletal muscle mass in kilograms.
CC_cm	Calf Circumference in centimeters.
sex	Biological sex of the participant.
G_HG_L	Grade of handgrip strength for the left hand.
BMI_kgm2	Body mass index in kg/m².
Educationlevel	Highest level of education attained.
D3_s	Body composition domain.
G_HG_R	Grade of handgrip strength for the right hand.
G_TUG	Grade of the timed up-and-go category.
HG_L_M	Maximum handgrip strength of the left hand.
SS	Sit-to-stand repetitions (30 s).
D_Plantar	Plantar flexion strength or delay.
Sleepdisorder	Presence of a diagnosed sleep disorder.
Dorsal_R_1	First dorsal flexion strength measurement on the right.
weight_kg	Body weight in kilograms.

Refer to Table A1.

Table 4. Best hyperparameters obtained during model optimization for binary classification.

Model	Best Parameters
Stacked Model	Ensemble of optimized base models
Gradient Boosting	{‘learning_rate’: 0.05, ‘max_depth’: 3, ‘n_estimators’: 200}
Random Forest	{‘max_depth’: None, ‘min_samples_split’: 2, ‘n_estimators’: 300}
AdaBoost	{‘n_estimators’: 200, ‘learning_rate’: 1.0}
Decision Tree	{‘max_depth’: None, ‘min_samples_split’: 2}
MLP	{‘activation’: tanh, ‘alpha’: 0.0001, ‘hidden_layer_sizes’: (50), ‘learning_rate_init’: 0.01}
SVM	{‘kernel’: rbf, ‘gamma’: 0.001, ‘C’: 100.0}

Table 5. Performance comparison of different models for binary classification.

#	Model	Accuracy	Precision	F1 Score	Recall
1	Proposed Model	0.9624	0.9800	0.9400	0.9100
2	Gradient Boosting	0.9774	0.9600	0.9700	0.9700
3	Random Forest	0.9398	0.9200	0.9100	0.9000
4	AdaBoost	0.9398	0.8846	0.8519	0.8214
5	Decision Tree	0.9173	0.8700	0.8800	0.8800
6	MLP	0.9098	0.8800	0.8600	0.8400
7	SVM	0.9023	0.8600	0.8500	0.8500

Bold values indicate best-performing models in key metrics.

Table 6. Comparison of Cohen’s Kappa and Brier Scores across models.

Model	Kappa	Brier Score
Stacked Model	0.8790	0.0243
Gradient Boosting	0.9330	0.0104
Random Forest	0.8142	0.0484
AdaBoost	0.8142	0.1575
Decision Tree	0.7544	0.0827
MLP	0.7136	0.0747
SVM	0.7021	0.0768

Table 7. Best hyperparameters for each model used in the multiclass classification task.

#	Model	Best Parameters
1	Stacked (RF + GB + MLP)	Optimized RF, GB, and MLP base models
2	Random Forest	{‘max_depth’: None, ‘min_samples_leaf’: 1, ‘min_samples_split’: 5, ‘n_estimators’: 300}
3	Gradient Boosting	{‘subsample’: 0.8, ‘n_estimators’: 300, ‘max_features’: ‘log2’, ‘max_depth’: 7, ‘learning_rate’: 0.05}
4	MLP	{‘alpha’: 0.0001, ‘hidden_layer_sizes’: (50), ‘learning_rate_init’: 0.01}
5	SVM	{‘C’: 10, ‘gamma’: ‘scale’, ‘kernel’: ‘rbf’}
6	Decision Tree	{‘max_depth’: None, ‘min_samples_leaf’: 1, ‘min_samples_split’: 2}
7	AdaBoost	{‘learning_rate’: 0.01, ‘n_estimators’: 50}

All parameters were selected based on the best performance through a cross-validated grid/random search.

Table 8. Comparison of model performance across key metrics in the multiclass classification task.

Model	Accuracy	Precision	Recall	F1 Score
Proposed Model	0.9699	0.9600	0.9400	0.9449
Gradient Boosting	0.9474	0.9367	0.9313	0.9320
Random Forest	0.9173	0.9034	0.8965	0.8977
AdaBoost	0.8797	0.8840	0.8675	0.8646
Decision Tree	0.8797	0.8800	0.8890	0.8837
SVM	0.8195	0.7981	0.7510	0.7704
MLP	0.8045	0.7568	0.7427	0.7485

Table 9. Comparison of Cohen’s Kappa and Brier Scores across models for the multiclass classification task.

Model	Kappa	Brier Score
Stacked Model	0.9738	0.0125
Gradient Boosting	0.9380	0.0256
Random Forest	0.9106	0.0407
AdaBoost	0.8680	0.0602
Decision Tree	0.8740	0.0602
SVM	0.7972	0.0671
MLP	0.7749	0.0901

Higher Kappa values indicate stronger classification agreement, while lower Brier Scores reflect better-calibrated probability estimates.

Table 10. Ablation study of base learners in the stacking ensemble.

Base Learner Combination	Accuracy	Macro F1	Cohen’s Kappa
RF + GB + MLP (Full Model)	0.9699	0.9449	0.9738
RF + GB	0.9474	0.9320	0.9384
RF + MLP	0.9474	0.9240	0.9285
GB + MLP	0.9624	0.9337	0.9589

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ruziboev, A.; Turimov, D.; Kim, J.; Kim, W. Multiclass Classification of Sarcopenia Severity in Korean Adults Using Machine Learning and Model Fusion Approaches. Mathematics 2025, 13, 2907. https://doi.org/10.3390/math13182907

AMA Style

Ruziboev A, Turimov D, Kim J, Kim W. Multiclass Classification of Sarcopenia Severity in Korean Adults Using Machine Learning and Model Fusion Approaches. Mathematics. 2025; 13(18):2907. https://doi.org/10.3390/math13182907

Chicago/Turabian Style

Ruziboev, Arslon, Dilmurod Turimov, Jiyoun Kim, and Wooseong Kim. 2025. "Multiclass Classification of Sarcopenia Severity in Korean Adults Using Machine Learning and Model Fusion Approaches" Mathematics 13, no. 18: 2907. https://doi.org/10.3390/math13182907

APA Style

Ruziboev, A., Turimov, D., Kim, J., & Kim, W. (2025). Multiclass Classification of Sarcopenia Severity in Korean Adults Using Machine Learning and Model Fusion Approaches. Mathematics, 13(18), 2907. https://doi.org/10.3390/math13182907

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multiclass Classification of Sarcopenia Severity in Korean Adults Using Machine Learning and Model Fusion Approaches

Abstract

1. Introduction

2. Related Works

2.1. Sarcopenia Assessment and Severity Prediction

2.2. Feature Selection and Interpretability in Clinical ML

2.3. Ordinal and Multiclass Classification Approaches

2.4. Ensemble and Fusion Learning for Robust Prediction

2.5. Data Balancing in Imbalanced Medical Data

2.6. Emerging Modalities and Future Directions

2.7. Advancing Healthcare Diagnostics with Interpretable Ensemble Deep Learning Models

2.8. Description of Interpretability and Evaluation Metrics

3. Materials and Methods

3.1. Study Design and Data Description

3.2. Data Preprocessing

3.3. Feature Engineering

3.4. Model Development

3.5. Model Interpretability

3.6. Validation and Performance Metrics

Precision, Recall, and F1 Scores

4. Results

4.1. Feature Selection

4.2. Binary Classification

4.3. Multiclass Classification

4.4. Overfitting

4.5. Model Interpretability

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI