DeepSarc-US: A Deep Learning Framework for Assessing Sarcopenia Using Ultrasound Images

Behboodi, Bahareh; Obrand, Jeremy; Afilalo, Jonathan; Rivaz, Hassan

doi:10.3390/app14156726

Open AccessArticle

DeepSarc-US: A Deep Learning Framework for Assessing Sarcopenia Using Ultrasound Images

by

Bahareh Behboodi

¹,

Jeremy Obrand

²,

Jonathan Afilalo

^3,† and

Hassan Rivaz

^1,4,*,†

¹

Department of Electrical and Computer Engineering, Concordia University, Montreal, QC H3G 1M8, Canada

²

Division of Experimental Medicine, McGill University, Montreal, QC H3A 0C2, Canada

³

Division of Cardiology and Centre for Clinical Epidemiology, Jewish General Hospital, McGill University, Montreal, QC H3A 0C2, Canada

⁴

School of Health, Concordia University, Montreal, QC H3G 1M8, Canada

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2024, 14(15), 6726; https://doi.org/10.3390/app14156726

Submission received: 5 June 2024 / Revised: 23 July 2024 / Accepted: 24 July 2024 / Published: 1 August 2024

(This article belongs to the Special Issue Current Updates on Ultrasound for Biomedical Applications)

Download

Browse Figures

Versions Notes

Abstract

:

Sarcopenia, the age-related loss of skeletal muscle mass, is a core component of frailty that is associated with functional decline and adverse health events in older adults. Unfortunately, the available tools to diagnose sarcopenia are often inaccessible or not user-friendly for clinicians. Point-of-care ultrasound (US) is a promising tool that has been used to image the quadriceps muscle and measure its thickness (QMT) as a diagnostic criterion for sarcopenia. This measurement can be challenging for clinicians, especially when performed at the bedside using handheld systems or phased-array probes not designed for this use case. In this paper, we sought to automate this measurement using deep learning methods to improve its accuracy, reliability, and speed in the hands of untrained clinicians. In the proposed framework, which aids in better training, particularly when limited data are available, convolutional and transformer-based deep learning models with generic or data-driven pre-trained weights were compared. We evaluated regression (QMT as a continuous output in cm) and classification (QMT as an ordinal output in 0.5 cm bins) approaches, and in the latter, activation maps were generated to interpret the anatomical landmarks driving the model predictions. Finally, we evaluated a segmentation approach to derive QMT. The results showed that both transformer-based models and convolutional neural networks benefit from the proposed framework in estimating QMT. Additionally, the activation maps highlighted the interface between the femur bone and the quadriceps muscle as a key anatomical landmark for accurate predictions. The proposed framework is a pivotal step to enable the application of US-based measurement of QMT in large-scale clinical studies seeking to validate its diagnostic performance for sarcopenia, alone or with ancillary criteria assessing muscle quality or strength. We believe that implementing the proposed framework will empower clinicians to conveniently diagnose sarcopenia in clinical settings and accordingly personalize the care of older patients, leading to improved patient outcomes and a more efficient allocation of healthcare resources.

Keywords:

ultrasound; quadriceps muscle thickness; transformer; regression; activation map

1. Introduction

Frailty syndrome is a growing public health concern, especially among older adults with multiple medical conditions, in whom it is associated with high rates of disability, morbidity, mortality, and healthcare resource use [1]. Frailty is highly prevalent in cardiac patients, affecting 25–50% of those above age 70 and being associated with adverse outcomes in the settings of coronary disease, valvular disease, heart failure, and arrhythmia, to name a few [2]. A large body of evidence has shown the negative impact of frailty following invasive procedures like cardiac surgery [2,3]. As such, cardiologists and cardiac surgeons have embraced the concept of frailty to better characterize their older patients, predict risk, and guide treatment decisions. This is crucial to ensuring benefits for patients and avoiding costly yet futile procedures. In addition, it is helpful to prepare patients before and after cardiac surgery through cardiac rehabilitation, exercise programs, nutritional supplementation, and comprehensive geriatric interventions. Consequently, proactive detection of frailty in cardiac patients may allow for the deployment of cost-effective, easy-to-implement preventive health measures that have been shown to improve clinical and patient-reported outcomes [4,5]. There are multiple ways to operationalize frailty and measure it in the clinical setting [6,7,8]. One of the biggest challenges is the lack of clinician-friendly tools to measure frailty in acutely ill patients who are otherwise unable to perform the usual physical performance tests and questionnaires, especially those seen in the emergency department and cardiac intensive care unit. A proposed solution has been to measure muscle mass and quality as an objective biomarker of frailty, which does not require any patient effort to acquire and can be used in all patients, regardless of acuity [9].

The age-related loss of muscle mass and quality is known as sarcopenia, and it is one of the cornerstones of frailty syndrome [10,11]. Sarcopenia has been defined as a systemic condition that reflects the functional and physical aspects of aging; thus, its assessment in a clinical environment can provide useful information about the patient’s underlying frailty [12]. When evaluating sarcopenia, various physical tests and questionnaires are available, including assessments of hand grip strength, muscle mass, clinical frailty scale (CFS), frailty index (FI), and others [13,14]. Measurement of muscle mass can be achieved using a variety of imaging modalities, including computed tomography (CT), magnetic resonance imaging (MRI), dual X-ray absorptiometry, or ultrasound (US) [15,16,17]. Additionally, measurement of intramuscular fat and inflammation (indicative of muscle quality) can be achieved using many of these modalities. US, compared to other modalities, has the major advantage of being portable and feasible at the patient’s bedside, which is particularly attractive for acutely ill patients that cannot otherwise be electively transported for such tests. In addition, US is a non-invasive, portable, cost-effective, and safe imaging modality that is widely used in medicine [18]. The evidence for using US to measure the quality and quantity of a variety of muscle groups is compelling [17,19,20]. Unfortunately, US has noteworthy drawbacks such as a low inherent signal-to-noise ratio, low contrast to differentiate muscle from adjacent soft tissues, operator-dependent acquisition of images, and the need for significant time and training to quantitatively measure muscle thickness and intramuscular fat. Progress is needed to overcome these drawbacks before US can become a mainstream tool for the assessment of sarcopenia by clinicians.

Recent developments in image processing techniques such as deep learning (DL), particularly convolutional neural networks (CNNs) and transformers, have been instrumental in automating and standardizing the analysis of medical images and deriving informative features not otherwise apparent to the human eye. Blanc-Durand et al. developed a DL method that allowed for the automated and reliable quantification of skeletal muscle mass for sarcopenia assessment from CT images, which may be integrated into a clinical workflow [15]. Bian et al. used DL methods to optimize and improve cardiac US images in hospitalized patients with chronic heart failure (CHF). Their findings further revealed a correlation between CHF and sarcopenia [21]. Pintelas et al. used an autoencoder DL method to condense valuable information from multi-frame US muscle images, followed by a classification strategy for the diagnosis of sarcopenia [22]. In a similar study, Sobral et al. compared several DL methods to diagnose sarcopenia and differentiate it from normal muscle tissue [23]. Yet another study used DL methods to segment muscle tissue from US images [24].

ConvNext is an advanced CNN model showcasing notable performance in various computer vision applications [25]. It leverages a sophisticated architecture, incorporating deep convolutional layers that enable it to capture intricate patterns and features within images. The model benefits from its ability to automatically learn hierarchical representations, making it well-suited for complex image recognition tasks. ConvNext has demonstrated competitive accuracy and efficiency, making it a valuable asset in the realm of DL-based image analysis, specifically in US image analysis [26,27,28].

Current vision-transformer (ViT) models, inspired by natural language processing (NLP) studies, have shown promising performances compared to CNNs [29]. ViT-based models do not have the inductive locality bias of CNNs, which makes CNNs less effective at modeling long-range dependencies. Instead, ViT-based models take advantage of their data-driven self-attention mechanism, which helps them to better understand the contextual information derived from not only the region of interest but also its surrounding [29,30,31]. Despite all the aforementioned benefits compared to CNNs, these models are data-hungry, and their performance can be limited by the size of the training dataset. To tackle this limitation of ViT-based models, He et al. proposed the masked auto-encoder (MAE), a self-supervised learning approach that includes image inpainting [30]. The MAE paradigm is a self-supervised technique for ViT-based models that enables the network to learn useful information by predicting masked targets. MAE has shown potential for faster training and better generalization of the ViT-based models. In ViT models, feature maps are created based on a single low-resolution image by adopting a fixed-scale windowing step. Liu et al. [32,33] proposed Swin Transformer V2 (SwinT) that builds hierarchical feature maps by adopting a non-overlapped shifted windowing step into the encoder of vanilla ViT, and it is capable of training with images of up to 1536 × 1536 pixels (vanilla ViT refers to ViT model proposed by [29]). SwinT has shown capabilities for learning functional dependencies between features.

ViT-based models have demonstrated superior performance over simple CNN-based models in various computer vision tasks as well as US image analysis [34,35,36,37,38,39,40,41,42]. However, ConvNext, employing a convolutional architecture, has surpassed ViT-based models in certain contexts of natural image analysis [25]. Notably, neither ViT nor ConvNext have been specifically applied to the measurement of muscle quality as of the current knowledge update. Furthermore, despite the promising performance of ConvNext and ViT-based models compared to simple CNN-based models, the training step is challenging, particularly when there are a lack of sufficient data. In the clinical scope, due to security and privacy policies, the publicly available US data are very limited. Labeled data are especially scarce, since manually annotating medical data is expensive. Therefore, utilizing complex models in clinical settings is quite challenging. To this end, a strategy is proposed for training the complex models on a small set of US images. We aimed to explore the performance of three recent CNN-based and three ViT-based DL models to estimate quadriceps muscle thickness (QMT) based on a limited number of US images using two main strategies that provide a fair comparison of ViT- and CNN-based models. To better comprehend the decisions made by the models, examples of the visualization maps produced by the ViT- and CNN-based models are further examined. These visualization maps can also offer supplementary information that can be used as a guide for practitioners at the time of data acquisition. The contributions of this work are summarized as follows:

Three CNN- and three ViT-based models are proposed to estimate QMT using ultrasound images acquired in a clinical setting.
A strategy is proposed for optimizing the training of DL models to estimate QMT more accurately, especially when limited data are available.
The activation maps are explored to provide clinicians with real-time feedback. This feedback can potentially be used to help clinicians collect better US images to help DL models estimate QMT more accurately.
To the best of our knowledge, it is shown for the first time that DL can be used to automatically estimate QMT from US images taken from phased array probes.

This paper is organized as follows. In Section 2, first, the data collection and experimental setup are outlined. Then, the results of the proposed QMT estimation strategies and the derived visualization maps are presented in Section 3. Finally, the outcomes of our experiments are concluded in Section 4.

2. Methods

2.1. Dataset

The dataset is based on a prospective cohort of 486 adult patients undergoing a clinically indicated cardiac US examination at the Jewish General Hospital (JGH), Montreal, Canada, between 1 October 2018, and 30 June 2019. These patients provided verbal informed consent to participate in this study. At the end of the cardiac US examination, with the patient remaining in a supine position, the sonographer acquired a static image of the left quadriceps muscles (rectus femoris and vastus intermedius) at the level of the anterior thigh, midway (approximated visually) between the anterior superior iliac spine and patella. A GE E95 machine (GE HealthCare, Chicago, IL, USA) with a sector 4Vc-D phased-array probe was used for data acquisition, where the acquisition setting was set to standard adult transthoracic settings with the center frequency of 1.4 MHz. The phased-array probe used to image the heart was also used to image the quadriceps muscles, even though the latter are usually imaged using linear probes that are better suited for superficial structures. This was done for convenience of clinical implementation, as cardiac US systems do not typically include linear probes.

The quadriceps US images were extracted in DICOM format and manually annotated by a trained observer (J. O.) to define the region of interest corresponding to the skeletal muscle tissue between the superior margin of the femur and the inferior margin of the subcutaneous fat, which is defined as QMT. Specifically, for each image, the upper border of the femur was delineated by marking nine equally spaced points (with the fifth point centered on the femur bone), and the upper border of the muscle was delineated by marking five equally spaced points (with the third point centered on the quadriceps muscle, aligning roughly with the central point of the femur). QMT was then determined by measuring the vertical distance between the central points of the femur and quadriceps. To validate the QMTs measured by J.O., referred to as the ground-truth QMTs in this study, and to prevent training the models with potentially inaccurate QMTs, the muscle thickness in each US image was additionally measured by two independent medical researchers using the same subset of images. Table 1 presents the inter-rater variabilities for redundant QMT measures. Inter-rater reliabilities were evaluated using the intraclass correlation coefficient (ICC) with a 95% confidence interval (CI) [43,44]. All ICC values exceeded 0.80, indicating very good to excellent reliability.

The size of the images was 708 × 1080 pixels, which was resized to 224 × 224. It is worth noting that the ground truth QMT was computed before resizing the images to 224 × 224. Given the variability in image acquisitions across subjects, pixel spacing was duly considered in this calculation. Five-fold cross-validation was used in all the experiments. Therefore, 67% of the images (330 images) were used for training, 17% (83 images) were used for validation, and the remaining 15% (73 images) were used as the test set. Figure 1 presents three examples of US images with annotations, and Figure 1a–d present the distribution of QMT in centimeters (cm). Please note that pixel spacing varies across patients, so one pixel does not correspond to the same length in centimeters for different patients. Therefore, in Figure 1a,b, the QMT is 1.57 cm and 2.17 cm, respectively, even though the number of QMT pixels in (a) is higher than in (b).

2.2. Experimental Setup

The overarching goal was to automate the evaluation of US-based QMT as a clinically useful and accessible indicator of sarcopenia. Figure 2 summarize the proposed framework.

2.2.1. Regression and Classification for QMT Measurements

Both regression and classification approaches were employed to achieve this goal, and within each approach, a total of six DL models were compared using either their ImageNet weights [45] or our experimentally derived weights. The first approach consisted of training a regression model to predict the QMT value (in cm) as a continuous output. The second approach consisted of training a classification model to predict the QMT class (10 classes binned in 0.5 cm increments starting at 1.0 cm) as an ordinal output and also to generate activation maps for visualization. Using this experimental setup, the following hypotheses were tested:

Models with transformer and CNN architecture would achieve good results in predicting QMT.
Regression models with pre-trained weights experimentally derived from classification training runs would outperform the same models with pre-trained weights from ImageNet.
Classification models with pre-trained weights experimentally derived from regression training runs would outperform the same models with pre-trained weights from ImageNet.
Activation maps that correctly highlighted the anatomical structures of interest would be more likely to correspond to accurate predictions of QMT.

The six DL models investigated for regression and classification tasks were ResNet101 [46], DensNet121 [47], ConvNext [25], ViT [29], MAE [30], and SwinT [32]. For the transformer-based models (i.e., ViT, MAE, and SwinT), the base (i.e., ViT-B, MAE-B, and SwinT-B) and large (i.e., ViT-L, MAE-L, and SwinT-L) architecture designs of the models were used in our experiments. For ConvNext, the base architecture design was utilized.

The initialization weights investigated for the regression models were experimentally derived from classification training runs (CW-Regression) vs. a priori derived from ImageNet weights (IW regression). The rationale is that pre-training a model on an easier task (i.e., classification) can help it learn basic features, and then training it on a more sophisticated task (i.e., regression) can fine-tune it to learn complex features [48,49,50].

The initialization weights investigated for classification models were experimentally derived from regression training runs (RW classification) vs. a priori derived from ImageNet weights (IW classification). Since the ImageNet weights were designed for 1000 classes, the last layer was modified accordingly for 10 classes (1–1.5 cm, 1.5–2 cm, 2–2.5 cm, 2.5–3 cm, 3–3.5 cm, 3.5–4 cm, 4–4.5 cm, 4.5–5 cm, 5–5.5 cm, above 5.5 cm.

For all training, the Adam [51] optimizer with a learning rate of 0.0001 was used. Five-fold cross-validation was used, and the results were presented as the average of each fold for the patients comprising the test set. Furthermore, multiple augmentations, including horizontal flipping, cropping, Gaussian noise addition, and blurring, were randomly applied to the training set.

Training Regression Models for QMT Estimation:

The regression models in the proposed strategies were trained for hundreds of epochs util the loss did not change for 50 consecutive epochs, using the mean squared error (MSE) loss function defined as:

M S E_{l o s s} = \frac{1}{N} \sum_{i = 1}^{N} {(y_{i} - {\hat{y}}_{i})}^{2},

(1)

where y and

\hat{y}

represent the ground truth and predicted QMT, respectively, in a batch of N (32) US images. As previously noted in Section 2.1, the ground truth values of QMT were derived from manual annotations of skeletal muscle.

Training Classification Models for QMT Estimation and Activation Map Visualization:

The classification models were trained using the focal loss function [52]. Focal loss, as defined in Equation (2), tackles the class imbalance problem by reducing the loss contribution from common samples (which are usually correctly classified) and increasing the importance of rare cases (which are often misclassified).

F o c a l_{l o s s} = - α {(1 - p_{t})}^{γ} l o g (p_{t}),

(2)

where

p_{t}

represents the estimated probability for the corresponding class.

α

and

γ

are defined as the weight and modulating factors, respectively. The recommendations in [52] were followed for the initialization of these factors. Therefore,

α

and

γ

were set to 0.25 and 2, respectively. Predefined class intervals set by our clinician were used. The intervals, defined as [1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5, 5.5, 7.5], led to a total of 10 classes (0 to 9). For example, the class for a QMT of 1.3 cm was set to 0. For visualizing activation maps, Grad-CAM [53] is utilized for CNN-based models. For transformer-based models, only the activation maps of the ViT and MAE models were investigated by adopting the transformer interpretability method [54].

2.2.2. Segmentation for QMT Measurement

An alternative method to attain the objective of automating the assessment of QMT measurements in US images involved generating segmentation masks using annotations of quadriceps muscle and femur surfaces. Subsequently, the length between these surfaces was measured. Figure 3 illustrates examples of US images, where the first column on the left displays annotation points on the US images, and the second column on the left showcases generated segmentation masks derived from these annotation points for three patients (Figure 3a–c). The generated segmentations can be employed to calculate muscle thickness in US images. The muscle thickness can be calculated by identifying the lowest pixel of the muscle surface and the uppermost pixel of the femur surface, as indicated by the dashed lines in Figure 3. It is noteworthy that while segmentations can be used to measure muscle thickness, their application may not extend to the assessment of other biomarkers of muscle quality that are commonly utilized in sarcopenia detection.

For the training phase, binary masks were initially generated based on annotations of the surface areas of the muscle and femur. Following a standard approach for training the proposed regression and classification models, a 5-fold cross-validation technique was employed. The model architecture utilized for segmentation was a modified U-Net [55] with ResNet50 as its backbone. The chosen configuration included Dice loss as the loss function and a learning rate of 0.001. Dice loss can be defined as follows:

Dice Loss = 1 - \frac{2 \times Intersection}{Union + ϵ},

(3)

where

ϵ

was set to 0.00001.

Subsequently, majority voting was applied to consolidate the binary masks for the 73 subjects in the test set. To be more specific, each test image had five masks generated by the five models trained on the five train-validation sets (i.e., 5-fold cross-validation). The final mask for each test image was determined via majority voting among these five generated masks. After this, muscle thickness was computed in a post-processing step involving the measurement of the distance between the surfaces of the muscle and the femur. In the post-processing phase, the horizontal edges of the binary mask were identified using the Canny edge detection method. This facilitated the determination of surface curves for both the muscle and femur. The QMT was subsequently determined by identifying the bottom-most pixel on the muscle surface and the highest pixel on the femur surface. An example of a QMT measurement from predicted segmentation masks is shown in Figure 3. For simplicity, the QMT measured based on predicted segmentation masks has been denoted as Seg-Regression throughout the remainder of this manuscript.

The overall overview of the proposed framework is summarized in Figure 2. The IW-Regression model is initialized with ImageNet weights, the CW-Regression model is initialized with IW-Classifictaion weights, and the Seg-Regression model is initialized with ImageNet weights.

3. Results

As previously mentioned, the results presented below are based on an average of 5-fold cross-validation for the 73 subjects in the test set. Here, we present a summary of the terminologies utilized throughout this manuscript for reference.

IW-Regression: This denotes the regression model utilizing ImageNet pre-trained weights (I referring to ImageNet weights). For instance, IW-Regression ResNet101 signifies the fine-tuned ResNet101 with ImageNet weights specifically tailored for the regression task of QMT.
IW-Classification: This represents the classification model leveraging ImageNet pre-trained weights. Similarly, IW-Classification ResNet101 signifies the ResNet101 with ImageNet weights fine-tuned for the classification of QMT.
CW-Regression: This designates the regression model fine-tuned using IW-Classification model weights. For example, CW-Regression ResNet101 is the ResNet101 initially initialized with ImageNet then fine-tuned for the classification of QMT (as denoted by IW-Classification), and subsequently fine-tuned once more for the regression task of QMT.
RW-Classification: This signifies the classification model fine-tuned using IW-Regression model weights. For example, RW-Classification ResNet101 is the ResNet101 initially initialized with ImageNet and fine-tuned for the regression of QMT (as denoted by IW-Regression) and then further fine-tuned for the classification task of QMT.
Seg-Regression: This denotes the measurement of QMT through a post-processing step applied to the predicted segmentation masks.

3.1. Regression of QMT

The median absolute error over the test set is summarized in Table 2. In Table 2, the asterisk (p-value < 0.05) and double asterisks (p-value < 0.01) show a significant difference between the error distributions of the IW-Regression model and its CW-Regression counterpart. For the significance test, the Wilcoxon signed-rank test (histograms of errors did not show Gaussian distributions) was used. It is also shown that the average median absolute error across all CW-Regression models is significantly less than that of the IW-Regression models.

As shown in Table 2, most of the CW-Regression models showed improvements compared to their IW-Regression counterparts, as demonstrated in the Delta column. Therefore, it can be concluded that the prior step of classification training for a regression task can help to boost the QMT estimations in US images. Additionally, the results demonstrate that SwinT-L, the transformer-based model, with median absolute errors of 0.30 and 0.25 for its IW-Regression and CW-Regression models, respectively, significantly benefits more from our proposed training strategy than the other transformers-based models. Among the CNN-based models, ConvNext showcased superior performance by achieving a median absolute error of 0.23 in CW-Regression. Furthermore, it demonstrated a notable enhancement of 0.04 cm in QMT estimation when comparing CW-Regression with IW-Regression.

Given the limited number of US images available, pretraining a model for a classification task first, which is perhaps seen to be an easier challenge, can enable the model to be better prepared for the regression task in both transformer- and CNN-based models. However, the substantial impact of the proposed strategy is dependent on the model architecture, where significant improvements in DensNet121, ViT-B, SwinT-B, and SwinT-L were observed. There were no significant improvements in ResNet101, ConvNext-B, ViT-L, and MAE-B. The complete statistical analysis is presented in Figure 4.

3.2. Classification of QMT

This section summarizes the results of the classification tasks. As previously explained, the RW-Classification models were initialized with the weights of our regression training runs; however, the IW-Classification models were initiated with publicly available weights and were pretrained on the ImageNet dataset. The classification accuracy results are shown in Table 3. Among the IW-Classification models, SwinT-L showed the best accuracy of 43.84%, and among the RW-Classification models, ViT-L showed the best accuracy of 43.84%. It was observed that the IW-Classification models showed better performances compared to the RW-Classification models. SwinT-B and SwinT-L were the only models that showed improvements in their RW-Classification compared to IW-Classification models. Additionally, the transformer-based models outperformed CNN-based models in both the IW-Classification and RW-Classification models, demonstrating their superior performance in the classification task.

The activation maps from Classification models for ResNet101, DensNet121, ViT-B, and MAE-B are shown in Figure 5 and Figure 6. Based on the achieved activation maps, it was discovered that in both the IW-Classification and RW-Classification models, the activation maps in the ViT-B and MAE-B models mainly focused on detecting the edge of the femur bone or the skeletal muscle tissue itself in most of the correctly classified US images, as shown in Figure 5c,d. There were no distinguishing patterns in the activation maps between the IW-Classification and RW-Classification models. Furthermore, no discernible differences in the activation maps of ViT-B against ViT-L or MAE-B versus MAE-L were found; thus, only the results for ViT-B and MAE-B are provided. The activation maps for two sample test subjects are shown in the first row for correctly classified cases and in the second row for misclassified cases. It is important to note that the activation maps were not relevant in some misclassified cases of CNN-based models, especially when the classification errors were greater than two classes, as shown in Figure 5e,f.

3.3. Segmentation of QMT

This section presents the results of QMT measurements derived from segmentation masks. An evaluation of the predicted masks was performed using the Dice score, defined as follows:

Dice score = \frac{2 \times Intersection}{Union + 0.0001} .

(4)

The aggregate Dice score across 73 subjects yielded

0.90 \pm 0.06

, indicating perfect predictions. Figure 3 showcases examples of predicted masks for three patients. The horizontal edges were identified using the Canny edge detection method applied to the predicted masks. Following this, the QMT was computed by measuring the distance between the lowest point of the muscle surface and the uppermost part of the femur surface, as represented by the dashed lines in Figure 3. The median absolute error of QMT measures based on Seg-Regression was found to be

0.13

cm. In the context of estimating QMT, the model consistently demonstrated an outstanding performance, showcasing its superior ability in this particular task. Table 4 presents a comparison of the median absolute error derived from Seg-Regression with that of IW-Regression and CW-Regression. Table 4 displays the averages of all the models for IW-Regression and CW-Regression. As the error distributions did not adhere to normality, the medians of the absolute errors are reported in this table. Subsequently, a Wilcoxon signed-rank test was conducted to assess statistical significance.

4. Discussion and Conclusions

In this study, a method was developed for the measurement of QMT using US images acquired using a phased array probe in a clinical setting. QMT has been put forth as an objective biomarker for frailty and a diagnostic criterion for sarcopenia. A number of DL models, including ResNet101, DensNet121, ConvNext-B, ViT-B, ViT-L, MAE-B, MAE-L, SwinT-B, and SwinT-L, were compared to predict this measurement.

First, there was no significant and consistent difference when comparing all the CNN-based models with the transformer-based models in predicting QMT. In the IW-Regression models, as summarized in Table 2, ConvNext-B, a CNN-based model, achieved a median absolute error of 0.27, while of the transformer-based models, SwinT-B also achieved the same median absolute error of 0.27. Furthermore, upon conducting a significance test, as illustrated in Figure 4a, the observed difference between the two models was not statistically significant. In the CW-Regression models, as depicted in the results presented in Table 2, ConvNext-B once again demonstrated a superior performance among the CNN-based models, achieving the lowest median absolute error of 0.23. Similarly, within the transformer-based models, SwinT-B and SwinT-L closely followed, with median absolute errors of 0.24 and 0.25, respectively, showing competitive results comparable to ConvNext-B. Therefore, it can be concluded that both CNN- and transformer-based models exhibit a satisfactory performance when adequately trained.

Second, performance improved for the CW-Regression models compared to the IW-Regression models for the task of QMT prediction. In the domain of CNN-based models, all the models experienced improvements through the utilization of the CW-Regression strategy. To this end, ResNet101, DensNet121, and ConvNext-B demonstrated improvements of 0.1 cm, 0.12 cm, and 0.04 cm, respectively. Similarly, within the transformer-based models, the implementation of the proposed CW-Regression strategy generally resulted in enhanced performance. Excluding MAE-L, which showed no improvements, ViT-B, ViT-L, MAE-B, SwinT-B, and SwinT-L achieved improvements of 0.03 cm, 0.03 cm, 0.03 cm, 0.03 cm, and 0.05 cm, respectively. These findings highlight the fact that for those models that are harder to train using small datasets, simplifying the task (i.e., pretraining on the classification task first) is highly beneficial for their performance. This is in line with a large body of evidence in curriculum learning [56] that has shown that training deep models on easier tasks first leads to better results. This is often due to the non-convex nature of the loss landscape, where training on an easy task helps the network to not be trapped in poor local minima. Given that transformer-based models have shown a large potential in the natural language processing field, our proposed CW-Regression transformer models will be advantageous for US images where there are limited images available.

Third, our experiments on the classification of QMT, i.e., the IW-Classification, and RW-Classification models, showed that prior pretraining on a regression task is not beneficial for the classification models, which emphasizes the fact that regression is a harder task compared to classification. When compared to the IW-Classification models, the RW-Classification models generally displayed decreased performance across all eight SOTA models. Furthermore, there was no significant difference between the activation maps of the IW-Classification and RW-Classification models. Although the current work concentrates on non-ordinal classification, investigating ordinal classification in further studies may provide insightful information. Assessing inter-observer variability, improving annotation techniques, and creating strong algorithms that can process ordinal data effectively could all help improve the accuracy of measuring QMT in US images. Such initiatives would lay the groundwork for enhanced clinical applications and a more thorough understanding of muscle health.

Fourth, the activation maps derived from the classification of QMT can further provide complementary information that can be beneficial for sonographers and clinicians when collecting and interpreting the images. It was found that either the femur bone or its surroundings were the main focus in activation maps for correctly classified cases. When the surface of the femur bone was not clearly visible in the US image, the activation maps of CNN-based models did not display any useful visualizations, and the model usually misclassified that case. As a result, since these visualizations can be obtained online during the data collection, they can aid in ensuring that the US image is collected in such a way that the surface of the femur bone is clearly visible in the image in order to improve the QMT predictions. While the activation maps provide qualitative insights, the lack of ground-truth annotations for the entire body of muscle limits the ability to conduct quantitative experiments on the generated activation maps. This constraint highlights the challenge of comprehensively assessing muscle features without complete annotations and prompts consideration for future investigations to incorporate more extensive annotation datasets. It is worth noting that variations in data acquisition settings can impact the model’s activation maps. In this context, the utilization of transfer learning techniques becomes crucial for improving the model’s adaptability to diverse acquisition settings. In future work, it could also be highly beneficial to visualize activation maps during the training phases. Saving checkpoints of training weights and tracking and analyzing the progression trends of activation maps throughout the training process can provide valuable insights for further optimization.

The present study encountered a notable limitation regarding the absence of complete annotations for the entire muscle body in our dataset. While we successfully trained a classification model to estimate muscle thickness based on annotations of the surface of the femur bone and the surface of the muscle, the lack of ground truth annotations for the entire muscle limits our ability to conduct quantitative experiments on the generated grad-cam images. This constraint highlights the challenge of comprehensively assessing muscle features without complete annotations and prompts consideration for future investigations to incorporate more extensive annotation datasets.

Moreover, the adoption of segmentation masks to automate QMT measurements in US images presents a compelling avenue for streamlining the evaluation of muscle thickness. Specifically, the QMT measures derived from the segmentation-based approach (Seg-Regression) exhibited noteworthy distinctions when compared to those obtained through IW-Regression and CW-Regression. Despite Seg-Regression showcasing superior performance over its counterparts, it is imperative to acknowledge certain limitations. There were instances where the prediction of the segmentation masks encountered failures. As illustrated in Figure 3a, for instance, the generated mask is inaccurately produced, leading to an erroneous measurement of QMT (error of 1.38 cm). Similarly, in Figure 3c, while the QMT is accurately measured, the predicted mask itself is inaccurate. Therefore, the accuracy of the post-processing step relies heavily on the segmentation model’s performance. Furthermore, the application of Seg-Regression is primarily tailored for the precise measurement of muscle thickness and may not be seamlessly extended to assess other essential biomarkers such as hand grip strength, FI, CFS, etc., required for sarcopenia assessments. This constraint underscores the need for a nuanced understanding of the method’s scope, emphasizing its suitability for specific QMT-related assessments while recognizing its limitations in a broader context.

A novel aspect of this study is the demonstrated ability to generate accurate predictions of QMT despite the use of suboptimal phased-array US source images. Such measurements would otherwise require training, practice, and added time for clinicians to perform manually—all of which are major barriers to clinical translation. Phased-array images were used because of their convenience for clinical acquisition without the need to switch probes in the context of a cardiac US exam. Further research is underway using linear probes in the context of a point-of-care US exam using a handheld system. The images provided by these probes are ideally suited to imaging superficial skeletal muscles and should, in theory, generate even more accurate predictions of QMT.

It is important to note that predicting sarcopenia involves more than just measuring muscle thickness. Accurate assessment requires integrating various patient measurements to provide a comprehensive evaluation. The proposed technique is designed to automate the entire sarcopenia assessment process by relying solely on patient data, minimizing the need for practitioner input. This approach is particularly useful in scenarios where immediate access to equipment for measuring muscle thickness is not available, such as bedside data collection. By incorporating our method, we can facilitate the development of fully automated, online sarcopenia assessment systems, enhancing the efficiency and accuracy of patient evaluations.

One limitation of this study is that only one leg was imaged. A more accurate approach would involve imaging both legs and averaging the measurements, although this would significantly increase the scan time. Future research could explore the costs and benefits of imaging both legs. Another constraint of the current study is the phased-array transducer that was used for data acquisition. Limb muscles, as mentioned earlier, are typically evaluated using linear probes that are better adapted for superficial structures. However, in the current study, phased-array images were used because of their convenience for clinical acquisition without the need to switch probes in the context of a cardiac US exam.

The contribution of this study is a necessary step preceding the clinical application of QMT at scale. The objective was to provide detailed benchmarking and a comparative analysis of the performance of various models, helping us to understanding the strengths and weaknesses of different architectures in the context of muscle thickness measurements. By providing reliable and precise muscle thickness measurements, this study contributes essential data that, when combined with other diagnostic criteria, will help categorize patients more specifically within the spectrum of sarcopenia. Ultimately, this will enhance the ability of healthcare providers to diagnose and manage sarcopenia more effectively. Having derived DeepSarc-US, the stage is now set to apply this method to a large-scale clinical cohort to validate its diagnostic performance against a gold standard determination of sarcopenia. It is important to note that muscle size in isolation may not be sufficient to diagnose sarcopenia, and ancillary criteria assessing muscle quality and strength may very well be needed. Further research involving radiomic features of muscle quality is currently an active area of ongoing investigations that has the potential to complement the proposed measures of muscle size.

Author Contributions

Conceptualization, B.B., J.A. and H.R.; methodology, B.B.; validation, B.B., J.A. and H.R.; formal analysis, B.B.; investigation, B.B.; resources, B.B.; data curation, J.O.; writing—original draft preparation, B.B.; writing—review and editing, B.B., J.A. and H.R.; visualization, B.B.; supervision, J.A. and H.R.; project administration, J.A. and H.R.; funding acquisition, J.A. and H.R. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Mitacs through the Mitacs Accelerate program and The Natural Sciences and Engineering Research Council of Canada (NSERC). Afilalo, J. is supported by the Canadian Institutes of Health Research and the Fonds de Recherche du Québec en Santé. We would like to thank Marie-Josée Blais, Nancy Murray, and the entire team of cardiac sonographers at the JGH for diligently acquiring the US images used for this study. We would also like to thank Igal Sebag (Director, Echocardiography Lab) and Lawrence Rudski (Chief, Division of Cardiology) for supporting this research project.

Institutional Review Board Statement

The study protocol was approved by the CIUSSS West-Central Montreal Research Ethics Board (protocol code 15-136, date of approval: 28 March 2018).

Informed Consent Statement

Written informed consent was obtained from all the subjects involved in this study.

Data Availability Statement

The original data used in this study cannot be shared due to the Jewish General Hospital’s patient privacy.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Morley, J.E.; Vellas, B.; Van Kan, G.A.; Anker, S.D.; Bauer, J.M.; Bernabei, R.; Cesari, M.; Chumlea, W.; Doehner, W.; Evans, J.; et al. Frailty consensus: A call to action. J. Am. Med. Dir. Assoc. 2013, 14, 392–397. [Google Scholar] [CrossRef]
Afilalo, J.; Alexander, K.P.; Mack, M.J.; Maurer, M.S.; Green, P.; Allen, L.A.; Popma, J.J.; Ferrucci, L.; Forman, D.E. Frailty assessment in the cardiovascular care of older adults. J. Am. Coll. Cardiol. 2014, 63, 747–762. [Google Scholar] [CrossRef] [PubMed]
Hoogendijk, E.O.; Afilalo, J.; Ensrud, K.E.; Kowal, P.; Onder, G.; Fried, L.P. Frailty: Implications for clinical practice and public health. Lancet 2019, 394, 1365–1375. [Google Scholar] [CrossRef] [PubMed]
Gwyther, H.; Shaw, R.; Dauden, E.A.J.; D’Avanzo, B.; Kurpas, D.; Bujnowska-Fedak, M.; Kujawa, T.; Marcucci, M.; Cano, A.; Holland, C. Understanding frailty: A qualitative study of European healthcare policy-makers’ approaches to frailty screening and management. BMJ Open 2018, 8, e018653. [Google Scholar] [CrossRef]
Damluji, A.A.; Forman, D.E.; Van Diepen, S.; Alexander, K.P.; Page, R.L.; Hummel, S.L.; Menon, V.; Katz, J.N.; Albert, N.M.; Afilalo, J.; et al. Older adults in the cardiac intensive care unit: Factoring geriatric syndromes in the management, prognosis, and process of care: A scientific statement from the American Heart Association. Circulation 2020, 141, e6–e32. [Google Scholar] [CrossRef] [PubMed]
Afilalo, J.; Joshi, A.; Mancini, R. If you cannot measure frailty, you cannot improve it. JACC Heart Fail. 2019, 7, 303–305. [Google Scholar] [CrossRef]
Lee, S.H.; Gong, H.S. Measurement and interpretation of handgrip strength for research on sarcopenia and osteoporosis. J. Bone Metab. 2020, 27, 85. [Google Scholar] [CrossRef]
Fountotos, R.; Munir, H.; Goldfarb, M.; Lauck, S.; Kim, D.; Perrault, L.; Arora, R.; Moss, E.; Rudski, L.G.; Bendayan, M.; et al. Prognostic value of handgrip strength in older adults undergoing cardiac surgery. Can. J. Cardiol. 2021, 37, 1760–1766. [Google Scholar] [CrossRef]
Bibas, L.; Saleh, E.; Al-Kharji, S.; Chetrit, J.; Mullie, L.; Cantarovich, M.; Cecere, R.; Giannetti, N.; Afilalo, J. Muscle mass and mortality after cardiac transplantation. Transplantation 2018, 102, 2101–2107. [Google Scholar] [CrossRef]
Cruz-Jentoft, A.; Michel, J.P. Sarcopenia: A useful paradigm for physical frailty. Eur. Geriatr. Med. 2013, 4, 102–105. [Google Scholar] [CrossRef]
Zuckerman, J.; Ades, M.; Mullie, L.; Trnkus, A.; Morin, J.F.; Langlois, Y.; Ma, F.; Levental, M.; Morais, J.A.; Afilalo, J. Psoas muscle area and length of stay in older adults undergoing cardiac operations. Ann. Thorac. Surg. 2017, 103, 1498–1504. [Google Scholar] [CrossRef] [PubMed]
Dodds, R.; Sayer, A.A. Sarcopenia and frailty: New challenges for clinical practice. Clin. Med. 2016, 16, 455. [Google Scholar] [CrossRef] [PubMed]
Kojima, G.; Iliffe, S.; Walters, K. Frailty index as a predictor of mortality: A systematic review and meta-analysis. Age Ageing 2018, 47, 193–200. [Google Scholar] [CrossRef] [PubMed]
Church, S.; Rogers, E.; Rockwood, K.; Theou, O. A scoping review of the Clinical Frailty Scale. BMC Geriatr. 2020, 20, 393. [Google Scholar] [CrossRef]
Blanc-Durand, P.; Schiratti, J.B.; Schutte, K.; Jehanno, P.; Herent, P.; Pigneur, F.; Lucidarme, O.; Benaceur, Y.; Sadate, A.; Luciani, A.; et al. Abdominal musculature segmentation and surface prediction from CT using deep learning for sarcopenia assessment. Diagn. Interv. Imaging 2020, 101, 789–794. [Google Scholar] [CrossRef] [PubMed]
Joshi, A.; Mancini, R.; Probst, S.; Abikhzer, G.; Langlois, Y.; Morin, J.F.; Rudski, L.G.; Afilalo, J. Sarcopenia in Cardiac Surgery: Dual X-ray Absorptiometry Study from the McGill Frailty Registry. Am. Heart J. 2021, 239, 52–58. [Google Scholar] [CrossRef] [PubMed]
Cruz-Jentoft, A.J.; Bahat, G.; Bauer, J.; Boirie, Y.; Bruyère, O.; Cederholm, T.; Cooper, C.; Landi, F.; Rolland, Y.; Sayer, A.A.; et al. Sarcopenia: Revised European consensus on definition and diagnosis. Age Ageing 2019, 48, 16–31. [Google Scholar] [CrossRef] [PubMed]
Wijntjes, J.; van Alfen, N. Muscle ultrasound: Present state and future opportunities. Muscle Nerve 2021, 63, 455–466. [Google Scholar] [CrossRef] [PubMed]
Stock, M.S.; Thompson, B.J. Echo intensity as an indicator of skeletal muscle quality: Applications, methodology, and future directions. Eur. J. Appl. Physiol. 2021, 121, 369–380. [Google Scholar] [CrossRef]
Stringer, H.J.; Wilson, D. The role of ultrasound as a diagnostic tool for sarcopenia. J. Frailty Aging 2018, 7, 258–261. [Google Scholar] [CrossRef]
Bian, P.; Zhang, X.; Liu, R.; Li, H.; Zhang, Q.; Dai, B. Deep-Learning-Based Color Doppler Ultrasound Image Feature in the Diagnosis of Elderly Patients with Chronic Heart Failure Complicated with Sarcopenia. J. Healthc. Eng. 2021, 2021, 2603842. [Google Scholar] [CrossRef] [PubMed]
Pintelas, E.; Livieris, I.E.; Barotsis, N.; Panayiotakis, G.; Pintelas, P. An Autoencoder Convolutional Neural Network Framework for Sarcopenia Detection Based on Multi-frame Ultrasound Image Slices. In Proceedings of the IFIP International Conference on Artificial Intelligence Applications and Innovations, Crete, Greece, 25–27 June 2021; pp. 209–219. [Google Scholar]
Sobral, C.; Silva, J.S.; André, A.; Santos, J.B. Sarcopenia Diagnosis: Deep Transfer Learning versus Traditional Machine Learning. 2010. Available online: https://recpad2021.uevora.pt/wp-content/uploads/2020/10/RECPAD_2020_paper_2.pdf (accessed on 23 July 2024).
Marzola, F.; van Alfen, N.; Doorduin, J.; Meiburger, K.M. Deep learning segmentation of transverse musculoskeletal ultrasound images for neuromuscular disease assessment. Comput. Biol. Med. 2021, 135, 104623. [Google Scholar] [CrossRef] [PubMed]
Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar]
Hassanien, M.A.; Singh, V.K.; Puig, D.; Abdel-Nasser, M. Predicting breast tumor malignancy using deep ConvNeXt radiomics and quality-based score pooling in ultrasound sequences. Diagnostics 2022, 12, 1053. [Google Scholar] [CrossRef] [PubMed]
Kim, K.; Macruz, F.; Wu, D.; Bridge, C.; McKinney, S.; Al Saud, A.A.; Sharaf, E.; Pely, A.; Danset, P.; Duffy, T.; et al. Point-of-care AI-assisted stepwise ultrasound pneumothorax diagnosis. Phys. Med. Biol. 2023, 68, 205013. [Google Scholar] [CrossRef] [PubMed]
Ding, X.; Liu, Y.; Zhao, J.; Wang, R.; Li, C.; Luo, Q.; Shen, C. A novel wavelet-transform-based convolution classification network for cervical lymph node metastasis of papillary thyroid carcinoma in ultrasound images. Comput. Med Imaging Graph. 2023, 109, 102298. [Google Scholar] [CrossRef] [PubMed]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; Girshick, R. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 16000–16009. [Google Scholar]
Li, J.; Chen, J.; Tang, Y.; Wang, C.; Landman, B.A.; Zhou, S.K. Transforming medical imaging with Transformers? A comparative review of key properties, current progresses, and future perspectives. Med. Image Anal. 2023, 85, 102762. [Google Scholar] [CrossRef] [PubMed]
Liu, Z.; Hu, H.; Lin, Y.; Yao, Z.; Xie, Z.; Wei, Y.; Ning, J.; Cao, Y.; Zhang, Z.; Dong, L.; et al. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12009–12019. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Gheflati, B.; Rivaz, H. Vision transformers for classification of breast ultrasound images. In Proceedings of the 2022 44th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), Scotland, UK, 11–15 July 2022; pp. 480–483. [Google Scholar]
Li, L.; Wu, Z.; Liu, J.; Wang, L.; Jin, Y.; Jiang, P.; Feng, J.; Wu, M. Cross-Attention Based Multi-Scale Feature Fusion Vision Transformer For Breast Ultrasound Image Classification. In Proceedings of the 2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Las Vegas, NV, USA, 6–8 December 2022; pp. 1616–1619. [Google Scholar]
Sun, J.; Wu, B.; Zhao, T.; Gao, L.; Xie, K.; Lin, T.; Sui, J.; Li, X.; Wu, X.; Ni, X. Classification for thyroid nodule using ViT with contrastive learning in ultrasound images. Comput. Biol. Med. 2023, 152, 106444. [Google Scholar] [CrossRef] [PubMed]
Zhang, J.; Chen, Y.; Liu, P. Automatic Recognition of Standard Liver Sections Based on Vision-Transformer. In Proceedings of the 2022 IEEE 16th International Conference on Anti-Counterfeiting, Security, and Identification (ASID), Xiamen, China, 2–4 December 2022; pp. 1–4. [Google Scholar]
Liu, X.; Almekkawy, M. Ultrasound Super Resolution using Vision Transformer with Convolution Projection Operation. In Proceedings of the 2022 IEEE International Ultrasonics Symposium (IUS), Venice, Italy, 10–13 October 2022; pp. 1–4. [Google Scholar]
Muhtaseb, R.; Yaqub, M. EchoCoTr: Estimation of the Left Ventricular Ejection Fraction from Spatiotemporal Echocardiography. In Proceedings of the Medical Image Computing and Computer Assisted Intervention–MICCAI 2022: 25th International Conference, Singapore, 18–22 September 2022; Proceedings, Part IV. Springer: Berlin/Heidelberg, Germany, 2022; pp. 370–379. [Google Scholar]
Xia, M.; Yang, H.; Qu, Y.; Guo, Y.; Zhou, G.; Zhang, F.; Wang, Y. Multilevel structure-preserved GAN for domain adaptation in intravascular ultrasound analysis. Med. Image Anal. 2022, 82, 102614. [Google Scholar] [CrossRef]
Qu, X.; Lu, H.; Tang, W.; Wang, S.; Zheng, D.; Hou, Y.; Jiang, J. A VGG attention vision transformer network for benign and malignant classification of breast ultrasound images. Med. Phys. 2022, 49, 5787–5798. [Google Scholar] [CrossRef]
Shen, X.; Wang, L.; Zhao, Y.; Liu, R.; Qian, W.; Ma, H. Dilated transformer: Residual axial attention for breast ultrasound image segmentation. Quant. Imaging Med. Surg. 2022, 12, 4512. [Google Scholar] [CrossRef] [PubMed]
Liljequist, D.; Elfving, B.; Skavberg Roaldsen, K. Intraclass correlation—A discussion and demonstration of basic features. PLoS ONE 2019, 14, e0219854. [Google Scholar] [CrossRef] [PubMed]
Koo, T.K.; Li, M.Y. A guideline of selecting and reporting intraclass correlation coefficients for reliability research. J. Chiropr. Med. 2016, 15, 155–163. [Google Scholar] [CrossRef] [PubMed]
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis. (IJCV) 2015, 115, 211–252. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
Sun, D.; Yang, X.; Liu, M.Y.; Kautz, J. Models matter, so does training: An empirical study of cnns for optical flow estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 1408–1423. [Google Scholar] [CrossRef] [PubMed]
Wang, L.; Zhang, X.; Su, H.; Zhu, J. A comprehensive survey of continual learning: Theory, method and application. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 5362–5383. [Google Scholar] [CrossRef]
Ilg, E.; Mayer, N.; Saikia, T.; Keuper, M.; Dosovitskiy, A.; Brox, T. Flownet 2.0: Evolution of optical flow estimation with deep networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2462–2470. [Google Scholar]
Kingma, D.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]
Chefer, H.; Gur, S.; Wolf, L. Transformer interpretability beyond attention visualization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 782–791. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18. Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Bengio, Y.; Louradour, J.; Collobert, R.; Weston, J. Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning, Montreal, QC, Canada, 14–18 June 2009; pp. 41–48. [Google Scholar]

Figure 1. Examples of the dataset with QMT values of (a) 1.57 cm, (b) 2.17 cm, and (c) 6.75 cm (Note: The pixel spacing varies between images (a–c), so one pixel does not correspond to the same length in cm across these images). The colored dots represent the annotations of the quadriceps muscle and femur bone surfaces (better seen in colored prints). (d) The distribution of QMT across all 486 subjects.

Figure 2. Summary of the proposed QMT measurement framework: IW-Regression, CW-Regression, and Seg-Regression. IW-Regression model initialized with ImageNet weights, CW-Regression model initialized with IW-Classifictaion weights, and Seg-Regression model initialized with ImageNet weights.

Figure 3. Examples of segmentation masks generated from manual annotations (ground truth) for three patients (a–c) (better seen in colored prints). The predicted masks showcase the segmentation outcomes achieved by the Seg-Regression model (predicted mask). The Dice scores of predicted masks for patients (a), (b), and (c) were 0.63, 0.89, and 0.76, respectively. In this process, the QMT was derived through a post-processing step that involved determining the distance between the horizontal edges of the muscle surface and the femur surface, utilizing Canny edge detection (horizontal edges).

Figure 4. Statistical analysis for the (a) IW-Regression and (b) CW-Regression models.

Figure 5. Activation maps in classification models: ResNet101 (a,e), DensNet (b,f), ViT-B (c,g), and MAE-B (d,h). The first and second rows represent activation maps for two different subjects. The first row was correctly classified, and the second row was misclassified. (GT: ground truth class label, Pred: predicted class label).

Figure 6. Sample activation maps of (a) ViT-B and (b) MAE-B, detecting the body of the muscle.

Table 1. Inter-rater variability of QMT measurements.

QMT Measurement (cm)				ICC (2,1)	95% CI	ICC (2,3)	95% CI
Rater 1	Rater 2	Rater 3	Rater 4	ICC (2,1)	95% CI	ICC (2,3)	95% CI
3.29 ± 1.07	2.94 ± 0.86	2.99 ± 0.87	3.24 ± 1.12	0.83	[0.77–0.87]	0.95	[0.93–0.96]

Table 2. Median absolute error of QMT estimations.

Median Absolute Error (cm)
Model	IW-Regression	CW-Regression	Delta
ResNet101	0.40	0.30	$+ 0.10 ↑$
DensNet121	0.50	0.38	$+ 0.12 ↑$ *
ConvNext-B	0.27	0.23	$+ 0.04 ↑$
ViT-B	0.31	0.27	$+ 0.03 ↑$ *
ViT-L	0.32	0.29	$+ 0.03 ↑$
MAE-B	0.34	0.31	$+ 0.03 ↑$
MAE-L	0.28	0.28	$0.00$
SwinT-B	0.27	0.24	$+ 0.03 ↑$ *
SwinT-L	0.30	0.25	$+ 0.05 ↑$ *
All	0.28	0.25	$+ 0.03 ↑$ **

* and ** denote statistically significant differences with p-values of <0.05 and p-value

< 0.01

, respectively, between IW-Regression models with ImageNet weights and CW-Regression models with corresponding classification weights. ↑: CW-Regression outperformed IW-Regression; Results of ConvNext-B model are in bold representing the minimum error for CNN-based models. The bold values for SwinT-B and SwinT-L represent the minimum errors in ViT-based models.

Table 3. Accuracy of QMT classification in classification models.

Accuracy (%)
Model	IW-Classification	RW-Classification	Delta
ResNet101	36.99	30.14	$- 6.85 ↓$ *
DensNet121	39.73	38.36	$- 1.37 ↓$
ConvNext-B	31.51	32.88	$+ 1.37$
ViT-B	42.47	38.36	$- 4.11 ↓$ **
ViT-L	41.10	43.84	$+ 2.74 ↑$
MAE-B	38.36	36.99	$- 1.37 ↓$
MAE-L	41.10	41.10	$0.00$
SwinT-B	41.10	36.99	$- 4.11 ↓$ *
SwinT-L	43.84	42.47	$- 1.37 ↓$
All	43.84	41.10	$- 2.74 ↓$

* and ** denote statistically significant differences with p-values of <0.05 and p-value

< 0.01

, respectively, between IW-Classification, models with ImageNet weights, and RW-Classification, models with corresponding regression weights. ↑: RW-Classification outperformed IW-Classification, ↓: IW-Classification outperformed RW-Classification. Bold represent the highest accuracy in each proposed framework.

Table 4. Median of absolute errors in QMT estimations.

Median of Absolute Errors (cm)
Seg-Regression	IW-Regression	CW-Regression
0.13	0.28 **	0.25 **

** denotes a statistically significant differences with p-values

< 0.01

between Seg-Regression and two other methods.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Behboodi, B.; Obrand, J.; Afilalo, J.; Rivaz, H. DeepSarc-US: A Deep Learning Framework for Assessing Sarcopenia Using Ultrasound Images. Appl. Sci. 2024, 14, 6726. https://doi.org/10.3390/app14156726

AMA Style

Behboodi B, Obrand J, Afilalo J, Rivaz H. DeepSarc-US: A Deep Learning Framework for Assessing Sarcopenia Using Ultrasound Images. Applied Sciences. 2024; 14(15):6726. https://doi.org/10.3390/app14156726

Chicago/Turabian Style

Behboodi, Bahareh, Jeremy Obrand, Jonathan Afilalo, and Hassan Rivaz. 2024. "DeepSarc-US: A Deep Learning Framework for Assessing Sarcopenia Using Ultrasound Images" Applied Sciences 14, no. 15: 6726. https://doi.org/10.3390/app14156726

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DeepSarc-US: A Deep Learning Framework for Assessing Sarcopenia Using Ultrasound Images

Abstract

1. Introduction

2. Methods

2.1. Dataset

2.2. Experimental Setup

2.2.1. Regression and Classification for QMT Measurements

2.2.2. Segmentation for QMT Measurement

3. Results

3.1. Regression of QMT

3.2. Classification of QMT

3.3. Segmentation of QMT

4. Discussion and Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI