Sample Size Effect on Musculoskeletal Segmentation: How Low Can We Go?

Huysentruyt, Roel; Van den Borre, Ide; Lazendić, Srđan; Duquesne, Kate; Van Oevelen, Aline; Li, Jing; Burssens, Arne; Pižurica, Aleksandra; Audenaert, Emmanuel

doi:10.3390/electronics13101870

Open AccessArticle

Sample Size Effect on Musculoskeletal Segmentation: How Low Can We Go?

by

Roel Huysentruyt

^1,*

,

Ide Van den Borre

^1,2,

Srđan Lazendić

³

,

Kate Duquesne

¹

,

Aline Van Oevelen

¹,

Jing Li

¹,

Arne Burssens

¹

,

Aleksandra Pižurica

^2,*

and

Emmanuel Audenaert

¹

Group of Orthopedics and Traumatology, Department of Human Structure and Repair, Ghent University Hospital, 9000 Ghent, Belgium

²

Group for Artificial Intelligence and Sparse Modelling (GAIM), Department of Telecommunications and Information Processing, Ghent University, 9000 Ghent, Belgium

³

Clifford Research Group, Foundations Lab, Department of Electronics and Information Systems, Ghent University, 9000 Ghent, Belgium

^*

Authors to whom correspondence should be addressed.

Electronics 2024, 13(10), 1870; https://doi.org/10.3390/electronics13101870

Submission received: 17 April 2024 / Revised: 7 May 2024 / Accepted: 8 May 2024 / Published: 10 May 2024

(This article belongs to the Special Issue Revolutionizing Medical Image Analysis with Deep Learning)

Download

Browse Figures

Versions Notes

Abstract

:

Convolutional Neural Networks have emerged as a predominant tool in musculoskeletal medical image segmentation. It enables precise delineation of bone and cartilage in medical images. Recent developments in image processing and network architecture desire a reevaluation of the relationship between segmentation accuracy and the amount of training data. This study investigates the minimum sample size required to achieve clinically relevant accuracy in bone and cartilage segmentation using the nnU-Net methodology. In addition, the potential benefit of integrating available medical knowledge for data augmentation, a largely unexplored opportunity for data preprocessing, is investigated. The impact of sample size on the segmentation accuracy of the nnU-Net is studied using three distinct musculoskeletal datasets, including both MRI and CT, to segment bone and cartilage. Further, the use of model-informed augmentation is explored on two of the above datasets by generating new training samples implementing a shape model-informed approach. Results indicate that the nnU-Net can achieve remarkable segmentation accuracy with as few as 10–15 training samples on bones and 25–30 training samples on cartilage. Model-informed augmentation did not yield relevant improvements in segmentation results. The sample size findings challenge the common notion that large datasets are necessary to obtain clinically relevant segmentation outcomes in musculoskeletal applications.

Keywords:

musculoskeletal segmentation; sample size; nnU-Net; data augmentation

1. Introduction

Convolutional Neural Networks (CNNs) are widely used for image segmentation of musculoskeletal structures, facilitating the precise delineation of bones and cartilage in medical images [1,2,3]. Segmenting these structures is a critical step in patient-specific research and clinical applications, including 3D-shape analysis, joint biomechanics and implant design. A common belief is that training these networks to generalize over unseen cases necessitates extensive amounts of labeled data. Networks are often constructed using tens to over a hundred training cases [4,5,6]. This poses a significant challenge as the manual delineation of anatomical structures by experts is both labor-intensive and time-consuming. General segmentation tools are becoming available in more recent times, which could largely solve this labor issue [7]. Nevertheless, many applications require deep learning (DL) models that are trained on specific in-house datasets to tackle a unique clinical challenge. These are not always covered by available tools, resulting in experts still facing a substantial workload. The current situation creates a critical bottleneck, potentially hindering scientific breakthroughs in the field.

Several strategies are being adopted to mitigate the extensive workload, including advancements in image preprocessing, postprocessing and network architecture. A typical preprocessing step is the use of data augmentation to create more diverse training samples. Conventional augmentation techniques include geometric transformations such as translations, rotations and flipping and intensity transformations through image filtering and noise addition [8,9]. For skeletal segmentation, the use of conventional techniques, more specifically roto-translational augmentation, leads to improved segmentation results of the hip joint [10].

The nnU-Net currently stands as the state-of-the-art method in the domain. It is a self-configuring segmentation approach that integrates these advancements in data augmentation as a standard preprocessing step [11]. Beyond these conventional augmentation techniques, musculoskeletal segmentation can also benefit from more sophisticated augmentation approaches [12]. An additional attractive approach involves incorporating medical shape knowledge as an augmentation method. Since its introduction by Cootes and Taylor [13], statistical shape models (SSMs) have been widely applied for capturing shape variations in diverse anatomical structures. By generating new shapes from these SSMs and applying deformations on original images, new training samples comprising shape and image pairs can be created. The growing availability of large public medical shape datasets (e.g., MedShapNet [14]) further facilitates the construction of robust SSMs. Furthermore, with advancements in modeling non-linear shape variation and/or eliminating the need for shape correspondence, current model-informed augmentation can even be improved [15,16,17]. Nonetheless, only limited research has explored the potential of shape model-informed methods for data augmentation [18,19,20,21,22].

Current advancements present an opportunity to reduce the workload of experts, while still generating robust segmentation tools. To achieve this, it is important to first understand the relationship between the number of annotated images and segmentation accuracy [23,24]. This is also the case when using nnU-Net. Recent work has already investigated the relationship when employing nnU-Net for tumor segmentation [25]. In the case of bone and cartilage structures, this relationship has not been thoroughly evaluated. Hence, a key question remains: What is the minimal number of training samples required to achieve clinically meaningful segmentation results?

Expressing clinically relevant differences in medical image segmentations through conventional metrics, such as the dice similarity coefficient (DSC) and Hausdorff distance (HD), is a complex task. These metrics often do not correlate directly with trustworthiness [26]. Depending on the chosen standard, incremental improvements in metrics may not always translate to a meaningful clinical impact [27]. Clinical relevancy in segmentation is more accurately achieved when the model demonstrates the capability to effectively perform specific clinical tasks. This is indicated by a plateauing or minimal improvement in segmentation metrics as more training data is added, suggesting that the model has reached a sufficient level of performance. Additionally, clinical relevance is supported when there are no visual discrepancies in segmentation outcomes that might affect clinical decision-making. This nuanced approach to evaluating clinical relevance reflects the intricate relationship between machine learning performance and actual clinical utility.

The objectives of this study are therefore two-fold:

(1).: Examine the minimal amount of training samples needed for obtaining clinically relevant segmentations of both bone and cartilage deploying nnU-Net. This way, a better understanding of the relationship between segmentation accuracy and the amount of training data is achieved.
(2).: Explore the relevance of employing shape model-informed augmentation when combined with current state-of-the-art techniques.

2. Materials and Methods

2.1. Datasets and Partitioning

Three distinct multilabeled musculoskeletal datasets, encompassing both asymptomatic and pathological cases, and different image modalities are investigated. This section provides an in-depth informative overview of each dataset. This work does not further investigate how the inherent characteristics, such as demographics, of each dataset affect the segmentation results.

The first dataset contains 96 high-resolution images, a subset of a larger previously published collection of Angio CT scans, visualizing the entire skeletal system of the lower limb [28,29]. Inter-voxel spacing ranges from 0.578 mm to 0.977 mm, with an inter-slice thickness of either 0.625 mm or 1.250 mm. The cohort includes 54 males and 42 females, averaging 67 years, with a standard deviation of 13 years. Out of the 96 samples, 11 are designated for evaluation. The remaining 85 are used for training.

The second dataset is selected to represent clinical pathologies and to allow comparison with prior research. The open-source Osteoarthritis Initiative (OAI) ZIB (https://nda.nih.gov/oai, accessed on 16 April 2024) dataset is used, which includes MRI images of the knee joint, covering the full spectrum of OA grades. The dataset features both bone and cartilage segmentations. The original 253/254 split of the train and test set is maintained.

The third and final dataset in this study is selected to consolidate our findings and reflect the typical challenges in musculoskeletal research. This involves dealing with structures of high complexity and scenarios of limited data availability. The dataset comprises 13 MRI images of the right foot, characterized by an inter-voxel spacing of 0.904 mm and a slice thickness of 0.900 mm [30]. The demographic composition is 5 males versus 8 females, with an average age of 28 years and a standard deviation of 12 years. Two images are allocated for evaluation, while the remaining 11 are used for training. Considering the low sample size for evaluation, the experiments are repeated 5-fold by random sampling. The MRI images present notable challenges, particularly due to low contrast towards the toe regions, complicating manual segmentation. This makes the dataset an interesting case for evaluation within the scope of this study.

2.2. Model-Based Data Augmentation

Data augmentation is applied to the lower-limb CT dataset and the foot MRI dataset to evaluate its added value in musculoskeletal image segmentation. First, shape augmentation is performed through the use of SSMs built upon principal component analysis (PCA). Before applying PCA, preprocessing is required to obtain dense shape correspondences and shape alignment. To get correspondence, prior published work is applied [31]. First, each target shape is prealigned to a template shape according to its principal axis of inertia, followed by a rigid iterative closest point (ICP) registration. Then, a non-rigid ICP is applied to transfer the vertices of the template to the target shape, giving correspondence. After this procedure, shapes are set back to their original position and orientation. Finally, all shapes are rigidly aligned. This is done using a generalized Procrustes alignment (GPA) on the corresponding data. Following, the variance in shape is captured through PCA of the corresponding data. To complete the PCA, each shape can be represented by a point cloud

X i = {x_{i, 1}, \dots, x_{i, n}}

,

x i, j \in ℝ^{3}

, with

i = 1, \dots, M

the total number of shapes and

j = 1, \dots, n

the number of included corresponding vertices. 175,520 vertices are included for the Angio CT data and 28,222 for the foot MRI. The mean shape

\bar{X}

is calculated, followed by the covariance matrix

C

:

\bar{X} = \frac{1}{M} \sum_{i = 1}^{M} X_{i}

(1)

C = \frac{1}{M - 1} \sum_{i = 1}^{M} {(X_{i} - \bar{X})}^{T} (X_{i} - \bar{X})

(2)

From the covariance matrix, the modes of variation

v = {v_{1}, \dots, v_{M - 1}}

and the corresponding variances

λ = {λ_{1}, \dots, λ_{M - 1}}

are found by solving:

C v_{t} = λ_{t} v_{t}, t = 1, \dots, M - 1

(3)

These are sorted such that

λ_{1} > λ_{2} > \dots > λ_{M - 1}

. The number of modes of variation

c

is chosen such that 95% of the explained variance is considered. Each shape can then be represented by its shape descriptor

b_{i} = {b_{i, 1}, \dots, b_{i, c}}

, so that it can be approximated by:

X_{i} ≅ \bar{X} + \sum_{t = 1}^{c} b_{i, t} v_{t}

(4)

New shape entities are generated following the approach proposed by the authors of DeepSSM: new shape descriptors

b

are sampled from a kernel density estimate (KDE) distribution fitted to the PCA space of the training data [21]. This allows the reconstruction of new shapes, following (4). An extra augmentation is applied to the generated shapes by a similar sampling procedure, applied to the transformations obtained from the GPA. This allows for plausible variation in both shape and orientation. Finally, a thin plate spline (TPS) warp is derived by mapping a shape from the original data to the newly generated shape. The found warp is then used to generate a new image by deforming the original image.

2.3. Network Training and Evaluation

Each training is performed using the self-configuring nnU-Net methodology. The full-resolution 3D U-Net architecture, with kernel size 3 × 3 × 3, is used. Images are converted to the right image format. No further preprocessing of the images is required using the nnU-Net procedure. Training is completed on each of the datasets using a range of training samples, varying from 1 to 85 for the lower-limb CT data, 1 to 253 for the OAI ZIB data and 1 to 11 for the MRI data. Each network is trained for 1000 epochs using the cross-entropy dice loss on all data in the training set and evaluated on a dedicated test set, instead of using the 5-fold cross-validation implemented in nnU-Net. To assess the generalizability of the results, all trainings with a reduced number of training samples are repeated with a different random training set. Computations are performed using the NVIDIA A100 GPUs within the high-performance computing infrastructure at Ghent University. Further hyperparameters are automatically extracted during image preprocessing. For the OAI ZIB and foot MRI dataset, patch sizes of, respectively, 80 × 190 × 160 and 192 × 128 × 96, are used. The patch size varied more between different trainings of the CT dataset. Examples are 96 × 320 × 80 and 224 × 112 × 96. All trainings used a batch size of 2.

The nnU-Net framework incorporates a variety of conventional geometric augmentations, including rotation, scaling and mirroring. Due to the symmetry within the images and the choice of labels, mirroring is turned off for the CT and OAI ZIB datasets. Included intensity augmentations are adding Gaussian noise, gamma augmentation, brightness and contrast adjustments, Gaussian blurring and low-resolution simulation.

All predictions are evaluated against the ground truths through visual inspection conducted by an orthopedic expert specialized in bone and cartilage segmentation. When a visual mistake is observed, it is noted. Extra attention is paid to segmentation leakage outside of the bone region, as well as misclassification at the joints, where the different segmentation classes are in close spatial relation. Both would result in clinically relevant misclassifications. These can lead to wrong diagnosis or surgery planning. Subsequently, the DSC and HD are computed based on the prediction (Pred) and the ground truth (GT), providing insights into both the overall performance and the magnitude of the largest error. These are given by:

DSC = \frac{2 | P r e d \cap G T |}{| P r e d | + | G T |}

(5)

H D (P r e d, G T) = m a x (h (P r e d, G T), h (P r e d, G T)), with

(6)

h (P r e d, G T) = \max_{a \in P r e d} \min_{b \in G T} {| | a - b | |}_{2}

(7)

The HD is calculated on the generated 3D meshes, with a and b representing the vertices of the meshes. 3D meshes are obtained using the marching cubes algorithm. Meshes are post-processed by only keeping the largest structure of each bone prediction. This excludes possible smaller misclassifications in the background region. These would lead to a high HD, while this does not impact the segmentation accuracy of the region of interest.

Finally, the impact of model-informed data augmentation is evaluated by including the generated shapes and images in the training data. New shapes are generated in 3D. Hence, for training, these are converted to label maps. Based on the findings of Tang et al., the increase in training samples when implementing model-based data augmentation is fixed at 20 [18]. Experiments with other amounts are performed, but do not yield any significant differences and are not further depicted.

3. Results

A multiclass segmentation is obtained for each image. For the Angio CT dataset of the lower limb, segmentation includes five bone classes: right pelvis, left pelvis, femur, tibia and fibula. Observations indicate an enhancement in performance within joint areas when the bones are individually labeled. Therefore, it is useful to label the right and left pelvis separately. The OAI ZIB dataset comprises two bone and two cartilage classes: the femur, tibia, tibial cartilage, and femoral cartilage. The MRI dataset of the foot delineates 17 bone classes: 5 proximal phalanges (Prox1-5), 5 metatarsal bones (Meta1-5), and 7 tarsal bones (MedCun, InterCun, LatCun, Nav, Cal, Tal, Cub). Training each network on the CT data requires approximately 15 h while training on the MRI datasets takes around 24 h. For each training, a stable reduction in the loss and increase in DSC during training on the training set is observed. Note that no validation loss is measured as all data are used for training. Predictions on the test set are only generated using completely trained models.

Figure 1 displays examples of model-informed augmented training samples of the lower-limb CT and foot MRI data. The numbers of vertices included to calculate the TPS warp are, respectively, 10,000 and 5000, allowing precise deformations while balancing calculation time. The vertices are chosen based on an isotropic resampling procedure, keeping correspondence. The generation of an augmented sample requires approximately over 1 h for the lower-limb CT samples and around 30 min for the foot MRI samples.

Table 1 presents the results obtained from the lower-limb CT dataset. For 10 training samples, the mean DSC ranges between 0.951 and 0.982, and the mean HD varies between 2.36 and 4.15 mm. The obtained mean DSC on the full training dataset of 85 samples ranges between 0.952 and 0.983, with the mean HD varying between 2.30 and 3.88 mm. Shape model-informed augmentation is applied starting from 1, 3 and 5 original training samples. Looking at 5 training samples combined with augmentation, a reduction in HD is observed, varying between 2.30 and 4.21 mm compared to 2.38 and 4.80 mm without augmentation. However, the DSCs are lower for each class than those achieved without the augmentation. Upon visual inspection, no improvements are found when implementing more than 10 training cases. Moreover, this model demonstrates no significant misclassifications.

Numerical results for varying numbers of training samples used with the OAI ZIB dataset are detailed in Table 2. Reaching 15 samples, the segmentation metrics of the bone converge, with mean DSC and HD of 0.983 and 2.55 mm for the femur and 0.988 and 2.65 mm for the tibia. Beyond 15 samples, no further visually relevant improvement is noted. In the case of cartilage segmentation, small enhancements in terms of DSC and HD are observed with additional data. The optimal mean DSC, 0.868 and 0.900, is found when implementing 50 training samples. The lowest HD values of 4.14 and 4.46 mm are found for 253 training cases. Nevertheless, these enhancements do not constitute visually relevant improvements beyond 25–30 samples.

The findings from the small-sized MRI dataset of the foot are summarized in Table 3. While the dataset is small, including the results of five cross-validation folds allows for numerical evaluation. For 7 training samples, the mean DSC ranges from 0.851 to 0.960, with the maximum HD being 2.11 mm. Results show limited improvement beyond seven training samples. Model-informed augmentation is employed from 1, 2, 3 and 11 original training samples. No notable additional improvement in segmentation metrics is observed when including model-informed data augmentation.

Examples of the evolution of the predictions for different numbers of training samples are presented in Figure 2. A visualization of the metrics for different sample sizes can be found in Figure 3.

4. Discussion

This study addresses the important question of identifying the minimal number of training samples required to achieve clinically relevant outcomes in semantic segmentations when employing nnU-Net on bone and cartilage datasets. The potential advantage of shape model-informed augmentation is further explored, employing PCA for shape augmentation and TPS warping for image deformation.

Our findings indicate that robust bone segmentation performance is achievable with as few as 10 training samples for the lower-limb CT dataset and 7 for the MRI dataset of the foot. For the OAI ZIB dataset, DSCs comparable to previous studies are obtained with 25–30 training samples [32,33]. Although quantifying the clinical relevance of segmentations remains challenging, the obtained DSCs do not vary by more than 0.5% upon adding more training samples beyond these numbers, suggesting a convergence. Additionally, the HDs did not decrease beyond the voxel spacing, indicating that the change in maximum error remains within a confined range. These observations suggest that further addition of training data does not significantly enhance performance in obtaining bone and cartilage segmentation, making them acceptable for further application in research and clinical studies.

These results challenge the prevailing dogma that DL in medical image segmentation of bone and cartilage necessitates unreasonable datasets. It is demonstrated on multiple datasets that 10–15 training samples suffice to yield high-quality multiclass predictions for bones and 25–30 training samples for cartilage when using the nnU-Net methodology. While more varied structures e.g., menisci and ligaments, specific pathologies, or multiple scanner types may require larger datasets, the notion that hundreds of training cases are essential does not hold. Further, note that clean annotations are utilized in this study. Coarse annotations might limit the network’s learning ability on a limited sample size.

Furthermore, obtained findings have significant implications for research, as segmenting should no longer be a limiting factor, thereby freeing resources for other tasks. It is crucial to note that these findings serve as a general guideline and do not negate the need for researchers to validate their networks for specific research or clinical applications. In clinical settings, employing a larger dataset remains a safer practice. When more extensive datasets are needed to generalize over pathologies, multiple scanner types or more varying structures, active learning is recommended.

Although shape model-informed augmentation has been suggested to hold the potential to further reduce the burden of expert segmentations, the findings suggest it does not substantially benefit semantic image bone and cartilage segmentation. This can be due to several different limitations. First, this work only investigated shape model-informed augmentation. No model-informed intensity operations were performed. It is possible that a generated image resembles the original image too much, as only shape differences are included. Next, PCA is unable to capture non-linear deformations. Certain anatomical variation is, in fact, non-linear. This is also the case for our MRI dataset. It is possible that the PCA is unable to capture the differences in toe orientation in the scanner between subjects, as this is a non-linear movement. This might lead to unrealistic shape augmentation and thus unrealistic images. Finally, the limited amount of shape data for the MRI dataset limits the generation of the SSM. Nonetheless, this is not the case for the CT dataset of the lower limb. Even when solving some of these issues, current segmentation methods operate effectively with limited training cases on these kinds of musculoskeletal data, and below this amount shape model-informed augmentation does not offer advantages. Furthermore, the substantial computational cost combined with limited gains makes the approach impractical compared to the current implemented geometric and intensity augmentation techniques.

Model-informed augmentation can still be beneficial for segmenting different kinds of anatomical structures and other applications, such as directly predicting shape scores akin to DeepSSM. As an example, model-informed augmentation has been shown to improve the automatic segmentation of muscles, which can be a more complex segment than the structures investigated in this work [22]. However, only marginal gains were reported.

This study provides an initial exploration into the relationship between nnU-Net segmentation accuracy and musculoskeletal data. Despite including a variety of datasets, including different image modalities and anatomical structures, the research remains constrained by several limitations. Except for knee osteoarthritis, the study does not encompass other pathologies. Furthermore, it focuses solely on bone and cartilage as anatomical structures. Different scanning procedures, pathologies or anatomical structures might lead to different sample sizes. Next, even though our MRI dataset of the foot is interesting due to the low contrast and the 5-fold cross-validation implemented, it remains small for evaluation. Nonetheless, the findings offer a promising indication of nnU-Net’s efficacy in segmenting musculoskeletal data. Despite this efficacy and the self-configuring nature, which makes it very easy to use, some limitations were noticed. Training on the CT dataset took 15 h while training on the MRI dataset took 24 h; however, when cross-validation is included, training a network on a single dataset can easily take up to 5 days. Next, training and inference on new cases require enough memory access, which might not be available to everyone for such an extended period.

5. Conclusions

This study investigates the relationship between nnU-Net segmentation accuracy and training sample sizes for musculoskeletal data using both CT and MRI datasets. We demonstrate that the nnU-Net methodology can achieve notable accuracy on bone and cartilage with, respectively, 10–15 and 25–30 training samples. This contests the prevailing belief that DL for semantic medical image segmentation of musculoskeletal structures requires extensive datasets. The efficiency can provide a significant reduction in manual labor, allowing a focus shift towards different tasks, thus speeding up the research process. Furthermore, the use of model-informed augmentation is investigated. New training samples of skeletal structures are generated by sampling new shapes from created SSMs and combining this with a TPS warp of the original image. While this is an interesting approach, it does not further improve the semantic segmentation of bone structures using the current state-of-the-art.

Author Contributions

Conceptualization and study development, R.H., I.V.d.B., S.L. and E.A.; data analysis, R.H.; writing, R.H. and E.A.; data curation, J.L., A.V.O., K.D. and A.B. project administration, funding acquisition and supervision, E.A. and A.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research is supported by Het Fonds voor Wetenschappelijk Onderzoek (FWO), G004224N. Researchers were also supported by the Flanders AI research program, Het Fonds voor Wetenschappelijk Onderzoek (FWO) senior clinical investigator and PhD fellowships and the Chinese Scholarship Council (CSC).

Data Availability Statement

The OAI dataset is openly available. The Angio CT and MRI datasets are parts of previously published works and are available on request.

Acknowledgments

This manuscript was prepared using the Osteoarthritis Initiative (OAI) public use dataset and does not necessarily reflect the opinions or views of the OAI investigators. The OAI is a public–private partnership comprised of five contracts (N01-AR-2-2258; N01-AR-2-2259; N01-AR-2-2260; N01-AR-2-2261; N01-AR-2-2262) funded by the National Institutes of Health, a branch of the Department of Health and Human Services, and conducted by the OAI Study Investigators. Private funding partners include Merck Research Laboratories; Novartis Pharmaceuticals Corporation, GlaxoSmithKline; and Pfizer, Inc. Private sector funding for the OAI is managed by the Foundation for the National Institutes of Health. The authors of this study want to thank everyone involved in creating these datasets.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Liu, X.; Song, L.; Liu, S.; Zhang, Y. A review of deep-learning-based medical image segmentation methods. Sustainability 2021, 13, 1224. [Google Scholar] [CrossRef]
Azad, R.; Aghdam, E.K.; Rauland, A.; Jia, Y.; Avval, A.H.; Bozorgpour, A.; Merhof, D. Medical Image Segmentation Review: The success of U-Net 2022. arXiv 2022, arXiv:2211.14830. [Google Scholar] [CrossRef]
Liu, F.; Kijowski, R. Deep Learning in Musculoskeletal Imaging. Adv. Clin. Radiol. 2019, 1, 83–94. [Google Scholar] [CrossRef]
Keles, E.; Irmakci, I.; Bagci, U. Musculoskeletal MR Image Segmentation with Artificial Intelligence. Adv. Clin. Radiol. 2022, 4, 179–188. [Google Scholar] [CrossRef] [PubMed]
Lee, Y.S.; Hong, N.; Witanto, J.N.; Choi, Y.R.; Park, J.; Decazes, P.; Yoon, S.H. Deep neural network for automatic volumetric segmentation of whole-body CT images for body composition assessment. Clin. Nutr. 2021, 40, 5038–5046. [Google Scholar] [CrossRef] [PubMed]
Neves, C.A.; Tran, E.D.; Blevins, N.H.; Hwang, P.H. Deep learning automated segmentation of middle skull-base structures for enhanced navigation. Int. Forum. Allergy Rhinol. 2021, 11, 1694–1697. [Google Scholar] [CrossRef] [PubMed]
Ma, J.; He, Y.; Li, F.; Han, L.; You, C.; Wang, B. Segment Anything in Medical Images 2023. Nat. Commun. 2023, 15, 654. [Google Scholar] [CrossRef]
Chlap, P.; Min, H.; Vandenberg, N.; Dowling, J.; Holloway, L.; Haworth, A. A review of medical image data augmentation techniques for deep learning applications. J. Med. Imaging Radiat. Oncol. 2021, 65, 545–563. [Google Scholar] [CrossRef]
Alomar, K.; Aysel, H.I.; Cai, X. Data Augmentation in Classification and Segmentation: A Survey and New Strategies. J. Imaging 2023, 9, 46. [Google Scholar] [CrossRef]
Montin, E.; Deniz, C.M.; Kijowski, R.; Youm, T.; Lattanzi, R. The impact of data augmentation and transfer learning on the performance of deep learning models for the segmentation of the hip on 3D magnetic resonance images. Inform. Med. Unlocked 2024, 45, 101444. [Google Scholar] [CrossRef]
Isensee, F.; Jaeger, P.F.; Kohl, S.A.A.; Petersen, J.; Maier-Hein, K.H. nnU-Net: A self-configuring method for deep learning-based biomedical image segmentation. Nat. Methods 2021, 18, 203–211. [Google Scholar] [CrossRef] [PubMed]
Noguchi, S.; Nishio, M.; Yakami, M.; Nakagomi, K.; Togashi, K. Bone segmentation on whole-body CT using convolutional neural network with novel data augmentation techniques. Comput. Biol. Med. 2020, 121, 103767. [Google Scholar] [CrossRef] [PubMed]
Cootes, T.F.; Taylor, C.J.; Cooper, D.H.; Graham, J. Active Shape Models-Their Training and Application. Comput. Vis. Image Underst. 1995, 61, 38–59. [Google Scholar] [CrossRef]
Li, J.; Pepe, A.; Gsaxner, C.; Luijten, G.; Jin, Y.; Ambigapathy, N.; Egger, J. MedShapeNet—A Large-Scale Dataset of 3D Medical Shapes for Computer Vision 2023. arXiv 2023, arXiv:2308.16139. [Google Scholar]
Duquesne, K.; Nauwelaers, N.; Claes, P.; Audenaert, E.A. Principal polynomial shape analysis: A non-linear tool for statistical shape modeling. Comput. Methods Programs Biomed. 2022, 220, 106812. [Google Scholar] [CrossRef] [PubMed]
Lüdke, D.; Amiranashvili, T.; Ambellan, F.; Ezhov, I.; Menze, B.H.; Zachow, S. Landmark-Free Statistical Shape Modeling Via Neural Flow Deformations. In International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Berlin/Heidelberg, Germany, 2022; pp. 453–463. [Google Scholar] [CrossRef]
Iyer Krithika Elhabian, S.Y. Mesh2SSM: From Surface Meshes to Statistical Shape Models of Anatomy. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2023; Madabhushi, A., Mousavi, P., Salcudean, S., Duncan, J., Syeda-Mahmood, T., Taylor, R., Eds.; Springer Nature: Cham, Switzerland, 2023; pp. 615–625. [Google Scholar] [CrossRef]
Tang, Z.; Chen, K.; Pan, M.; Wang, M.; Song, Z. An Augmentation Strategy for Medical Image Processing Based on Statistical Shape Model and 3D Thin Plate Spline for Deep Learning. IEEE Access 2019, 7, 133111–133121. [Google Scholar] [CrossRef]
Karimi, D.; Samei, G.; Kesch, C.; Nir, G.; Salcudean, S.E. Prostate segmentation in MRI using a convolutional neural network architecture and training strategy based on statistical shape models. Int. J. Comput. Assist. Radiol. Surg. 2018, 13, 1211–1219. [Google Scholar] [CrossRef] [PubMed]
Schmid, J.; Assassi, L.; Chênes, C. A novel image augmentation based on statistical shape and intensity models: Application to the segmentation of hip bones from CT images. Eur. Radiol. Exp. 2023, 7, 39. [Google Scholar] [CrossRef] [PubMed]
Bhalodia, R.; Elhabian, S.; Adams, J.; Tao, W.; Kavan, L.; Whitaker, R. DeepSSM: A blueprint for image-to-shape deep learning models. Med. Image Anal. 2024, 91, 103034. [Google Scholar] [CrossRef]
Lin, Z.; Henson, W.H.; Dowling, L.; Walsh, J.; Dall’Ara, E.; Guo, L. Automatic segmentation of skeletal muscles from MR images using modified U-Net and a novel data augmentation approach. Front. Bioeng. Biotechnol. 2024, 12, 1355735. [Google Scholar] [CrossRef]
Wakamatsu, Y.; Kamiya, N.; Zhou, X.; Hara, T.; Fujita, H. Relationship between number of annotations and accuracy in segmentation of the erector spinae muscle using Bayesian U-Net in torso CT images. In International Forum on Medical Imaging in Asia 2021; Chang, R.-F., Ed.; SPIE: Bellingham, WA, USA, 2021; p. 29. [Google Scholar] [CrossRef]
Nemoto, T.; Futakami, N.; Kunieda, E.; Yagi, M.; Takeda, A.; Akiba, T.; Mutu, E.; Shigematsu, N. Effects of sample size and data augmentation on U-Net-based automatic segmentation of various organs. Radiol. Phys. Technol. 2021, 14, 318–327. [Google Scholar] [CrossRef] [PubMed]
Gottlich, H.C.; Gregory, A.V.; Sharma, V.; Khanna, A.; Moustafa, A.U.; Lohse, C.M.; Potretzke, T.A.; Korfiatis, P.; Potretzke, A.M.; Denic, A.; et al. Effect of Dataset Size and Medical Image Modality on Convolutional Neural Network Model Performance for Automated Segmentation: A CT and MR Renal Tumor Imaging Study. J. Digit. Imaging 2023, 36, 1770–1781. [Google Scholar] [CrossRef] [PubMed]
Kofler, F.; Ezhov, I.; Isensee, F.; Balsiger, F.; Berger, C.; Koerner, M.; Demiray, B.; Rackerseder, J.; Paetzold, J.; Li, H.; et al. Are we using appropriate segmentation metrics? Identifying correlates of human expert perception for CNN training beyond rolling the DICE coefficient. Mach. Learn. Biomed. Imaging 2023, 2, 27–71. [Google Scholar] [CrossRef]
Gibson, E.; Hu, Y.; Huisman, H.J.; Barratt, D.C. Designing image segmentation studies: Statistical power, sample size and reference standard quality. Med. Image Anal. 2017, 42, 44–59. [Google Scholar] [CrossRef]
Audenaert, E.A.; Pattyn, C.; Steenackers, G.; De Roeck, J.; Vandermeulen, D.; Claes, P. Statistical Shape Modeling of Skeletal Anatomy for Sex Discrimination: Their Training Size, Sexual Dimorphism, and Asymmetry. Front. Bioeng Biotechnol. 2019, 7, 302. [Google Scholar] [CrossRef]
Van Houcke, J.; Audenaert, E.A.; Atkins, P.R.; Anderson, A.E. A Combined Geometric Morphometric and Discrete Element Modeling Approach for Hip Cartilage Contact Mechanics. Front. Bioeng Biotechnol. 2020, 8, 318. [Google Scholar] [CrossRef]
Peiffer, M.; Burssens, A.; Duquesne, K.; Last, M.; De Mits, S.; Victor, J.; Audenaert, E.A. Personalised statistical modelling of soft tissue structures in the ankle. Comput. Methods Programs Biomed. 2022, 218, 106701. [Google Scholar] [CrossRef]
Audenaert, E.A.; Van Houcke, J.; Almeida, D.F.; Paelinck, L.; Peiffer, M.; Steenackers, G.; Vandermeulen, D. Cascaded statistical shape model based segmentation of the full lower limb in, C.T. Comput. Methods Biomech. Biomed. Eng. 2019, 22, 644–657. [Google Scholar] [CrossRef] [PubMed]
Ambellan, F.; Tack, A.; Ehlke, M.; Zachow, S. Automated segmentation of knee bone and cartilage combining statistical shape knowledge and convolutional neural networks: Data from the Osteoarthritis Initiative. Med. Image Anal. 2019, 52, 109–118. [Google Scholar] [CrossRef]
Liang, D.; Liu, J.; Wang, K.; Luo, G.; Wang, W.; Li, S. Position-Prior Clustering-Based Self-attention Module for Knee Cartilage Segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Berlin/Heidelberg, Germany, 2022; pp. 193–202. [Google Scholar] [CrossRef]

Figure 1. Examples of two augmented samples with the corresponding original data samples for both the lower-limb CT and foot MRI datasets. The augmented 3D models are obtained by sampling a shape descriptor and applying the SSM reconstruction. Next, the augmented image is found by deforming the original image based on the found TPS warp from the original shape to the augmented shape.

Figure 2. Evolution of segmentation predictions with different training sample sizes. Both 3D reconstructions and the original images overlayed with predictions are shown. For the lower-limb CT and the Foot MRI dataset, the worst prediction when trained on 1 sample is shown to clearly demonstrate segmentation improvement. For the OAI dataset, a typical prediction evolution over different sample sizes is given. (a) Lower-limb CT. (b) OAI ZIB. (c) Foot MRI. (GT = ground truth).

Figure 3. Evolution of segmentation predictions with different training sample sizes. Only results without additional shape model-informed data augmentation are shown. The mean Dice Similarity Coefficient (DSC) and the Hausdorff distance (HD) of each segmented structure are given. The bars indicate a single standard deviation in each direction. It is seen that the metrics converge after including only a small subset of the complete dataset. (a) Lower-limb CT. (b) OAI ZIB. (c) Foot MRI.

Table 1. Effect of sample size on the CT dataset of the lower limb on 5 different skeletal structures employing nnU-Net, with and without shape model-based augmentation. The obtained mean Dice Similarity Coefficient (DSC) and the mean Hausdorff distance (HD), in mm, with the standard deviations are depicted.

		Femur	Tibia	Fibula	Pelvis R	Pelvis L
1 sample with aug	DSC HD DSC HD	0.978 ± 0.004 10.7 ± 30.96 0.976 ± 0.003 3.06 ± 0.78	0.977 ± 0.003 2.87 ± 0.87 0.976 ± 0.002 2.77 ± 0.65	0.935 ± 0.017 4.87 ± 9.53 0.935 ± 0.012 2.44 ± 1.14	0.960 ± 0.009 4.33 ± 1.67 0.957 ± 0.007 4.79 ± 2.47	0.958 ± 0.009 4.78 ± 1.97 0.957 ± 0.008 4.57 ± 2.10
3 samples with aug	DSC HD DSC HD	0.980 ± 0.003 9.88 ± 31.06 0.979 ± 0.003 2.31 ± 0.63	0.978 ± 0.003 2.85 ± 1.00 0.977 ± 0.001 2.56 ± 0.75	0.946 ± 0.011 2.87 ± 1.40 0.940 ± 0.010 2.56 ± 1.13	0.962 ± 0.009 4.59 ± 1.44 0.960 ± 0.009 3.74 ± 0.97	0.961 ± 0.009 4.34 ± 1.50 0.959 ± 0.008 3.72 ± 1.07
5 samples with aug	DSC HD DSC HD	0.981 ± 0.003 2.38 ± 0.71 0.980 ± 0.003 2.30 ± 0.67	0.979 ± 0.003 2.82 ± 0.88 0.977 ± 0.002 2.42 ± 0.70	0.948 ± 0.011 2.62 ± 1.29 0.943 ± 0.009 2.58 ± 1.12	0.964 ± 0.009 4.80 ± 1.59 0.960 ± 0.009 4.21 ± 2.08	0.962 ± 0.010 4.37 ± 1.88 0.960 ± 0.009 4.03 ± 1.38
10 samples	DSC HD	0.982 ± 0.003 2.36 ± 0.68	0.980 ± 0.003 2.53 ± 0.75	0.951 ± 0.010 2.78 ± 1.59	0.966 ± 0.009 3.60 ± 0.69	0.964 ± 0.010 4.15 ± 1.61
15 samples	DSC HD	0.982 ± 0.003 2.51 ± 0.71	0.980 ± 0.003 2.58 ± 0.88	0.951 ± 0.011 2.63 ± 1.31	0.966 ± 0.009 3.88 ± 0.91	0.964 ± 0.010 3.82 ± 1.43
85 samples	DSC HD	0.983 ± 0.002 2.50 ± 0.80	0.980 ± 0.002 2.30 ± 0.80	0.952 ± 0.010 2.57 ± 1.23	0.970 ± 0.005 3.88 ± 0.64	0.965 ± 0.007 3.67 ± 1.54

Table 2. Effect of sample size, employing nnU-Net, on the OAI ZIB data of the knee. Results are found for the femur and tibia, and their cartilage layers. The obtained mean Dice Similarity Coefficient (DSC) and the mean Hausdorff distance (HD), in mm, with the standard deviations are depicted.

		Femur	Femoral Cartilage	Tibia	Tibial Cartilage
1 sample	DSC HD	0.959 ± 0.059 25.70 ± 13.14	0.844 ± 0.034 26.37 ± 13.24	0.960 ± 0.060 25.85 ± 12.83	0.783 ± 0.053 24.98 ± 13.19
5 samples	DSC HD	0.982 ± 0.020 3.34 ± 1.93	0.882 ± 0.030 5.51 ± 2.08	0.987 ± 0.002 2.94 ± 1.14	0.840 ± 0.054 5.32 ± 2.64
10 samples	DSC HD	0.983 ± 0.018 2.90 ± 1.06	0.888 ± 0.025 5.23 ± 1.77	0.987 ± 0.002 2.85 ± 1.03	0.845 ± 0.040 5.53 ± 2.48
15 samples	DSC HD	0.983 ± 0.040 2.55 ± 1.09	0.890 ± 0.025 4.94 ± 1.62	0.988 ± 0.002 2.65 ± 1.25	0.856 ± 0.038 5.36 ± 2.23
20 samples	DSC HD	0.983 ± 0.019 2.71 ± 1.12	0.893 ± 0.025 4.96 ± 2.29	0.988 ± 0.002 2.83 ± 1.17	0.857 ± 0.036 5.14 ± 2.00
25 samples	DSC HD	0.984 ± 0.017 2.64 ± 1.09	0.895 ± 0.021 4.68 ± 1.69	0.988 ± 0.002 2.79 ± 1.15	0.860 ± 0.035 5.05 ± 2.14
30 samples	DSC HD	0.986 ± 0.002 2.58 ± 0.81	0.896 ± 0.021 4.68 ± 1.49	0.988 ± 0.002 2.91 ± 1.25	0.861 ± 0.035 4.98 ± 2.03
50 samples	DSC HD	0.983 ± 0.019 2.55 ± 0.93	0.900 ± 0.021 4.46 ± 1.45	0.989 ± 0.002 2.47 ± 0.82	0.868 ± 0.030 4.74 ± 1.73
100 samples	DSC HD	0.983 ± 0.024 2.56 ± 1.04	0.900 ± 0.020 4.15 ± 1.23	0.989 ± 0.002 2.90 ± 1.06	0.871 ± 0.031 4.55 ± 1.41
253 samples	DSC HD	0.983 ± 0.020 2.65 ± 1.00	0.900 ± 0.021 4.14 ± 1.24	0.988 ± 0.005 2.65 ± 0.63	0.865 ± 0.050 4.46 ± 1.77

Table 3. Effect of sample size, employing nnU-Net on the MRI foot dataset. Results are found with and without shape model-informed augmentation. The obtained mean Dice Similarity Coefficient (DSC) and the mean Hausdorff distance (HD), in mm, with standard deviations, are depicted. A total of 17 foot bones are segmented. The cuneiforms, metatarsal bones and proximal phalanges are combined, respectively, to Cun, Meta and Proxy.

		Calc	Cub	Cun	Nav	Talus	Meta	Proxy
1 sample with aug	DSC HD DSC HD	0.940 ± 0.007 3.62 ± 0.93 0.946 ± 0.001 2.67 ± 0.04	0.887 ± 0.011 3.10 ± 0.63 0.878 ± 0.003 3.56 ± 0.59	0.895 ± 0.010 3.32 ± 1.29 0.883 ± 0.007 2.62 ± 0.48	0.908 ± 0.009 2.23 ± 0.17 0.919 ± 0.005 1.79 ± 0.26	0.934 ± 0.011 2.36 ± 0.42 0.941 ± 0.001 2.49 ± 0.21	0.780 ± 0.047 6.86 ± 4.04 0.775 ± 0.015 8.15 ± 1.25	0.704 ± 0.131 6.82 + 3.97 0.714 ± 0.03 7.83 ± 1.70
2 samples with aug	DSC HD DSC HD	0.949 ± 0.003 2.32 ± 0.23 0.945 ± 0.001 2.49 ± 0.21	0.920 ± 0.012 2.45 ± 0.43 0.901 ± 0.002 2.52 ± 0.09	0.908 ± 0.012 3.44 ± 2.77 0.908 ± 0.015 1.78 ± 0.23	0.910 ± 0.020 2.34 ± 1.29 0.922 ± 0.01 1.63 ± 0.01	0.942 ± 0.006 2.42 ± 0.39 0.942 ± 0.009 2.34 ± 0.29	0.854 ± 0.025 3.02 ± 1.10 0.858 ± 0.013 2.96 ± 0.43	0.788 ± 0.045 2.71 ± 0.82 0.760 ± 0.035 4.14 ± 0.75
3 samples with aug	DSC HD DSC HD	0.952 ± 0.007 2.22 ± 0.38 0.946 ± 0.006 4.86 ± 2.67	0.917 ± 0.005 2.45 ± 0.49 0.909 ± 0.006 3.48 ± 1.29	0.914 ± 0.011 2.29 ± 1.12 0.887 ± 0.010 2.52 ± 0.33	0.920 ± 0.007 1.90 ± 0.30 0.927 ± 0.007 1.51 ± 0.04	0.950 ± 0.008 1.94 ± 0.28 0.942 ± 0.010 2.75 ± 0.70	0.872 ± 0.022 2.25 ± 0.72 0.872 ± 0.012 3.17 ± 0.052	0.825 ± 0.048 2.35 ± 0.83 0.822 ± 0.025 3.14 ± 0.54
5 samples	DSC HD	0.954 ± 0.006 2.12 ± 0.29	0.927 ± 0.003 2.17 ± 0.36	0.926 ± 0.011 1.44 ± 0.17	0.931 ± 0.009 1.65 ± 0.25	0.952 ± 0.010 1.82 ± 0.28	0.894 ± 0.012 1.67 ± 0.15	0.847 ± 0.028 2.01 ± 0.53
7 samples	DSC HD	0.960 ± 0.004 2.11 ± 0.34	0.933 ± 0.004 1.96 ± 0.40	0.931 ± 0.006 1.48 ± 0.10	0.931 ± 0.008 1.65 ± 0.33	0.957 ± 0.005 1.66 ± 0.06	0.891 ± 0.008 1.66 ± 0.19	0.851 ± 0.028 2.01 ± 0.49
11 samples with aug	DSC HD DSC HD	0.961 ± 0.004 1.86 ± 0.29 0.961 ± 0.001 2.09 ± 0.47	0.934 ± 0.002 2.12 ± 0.40 0.935 ± 0.001 2.10 ± 0.55	0.931 ± 0.008 1.39 ± 0.02 0.929 ± 0.005 1.45 ± 0.08	0.928 ± 0.004 1.67 ± 0.39 0.925 ± 0.002 1.61 ± 0.17	0.959 ± 0.006 1.67 ± 0.05 0.962 ± 0.005 1.54 ± 0.07	0.894 ± 0.006 1.65 ± 0.14 0.896 ± 0.009 1.63 ± 0.14	0.857 ± 0.028 1.77 ± 0.41 0.862 ± 0.025 1.85 ± 0.41

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Huysentruyt, R.; Van den Borre, I.; Lazendić, S.; Duquesne, K.; Van Oevelen, A.; Li, J.; Burssens, A.; Pižurica, A.; Audenaert, E. Sample Size Effect on Musculoskeletal Segmentation: How Low Can We Go? Electronics 2024, 13, 1870. https://doi.org/10.3390/electronics13101870

AMA Style

Huysentruyt R, Van den Borre I, Lazendić S, Duquesne K, Van Oevelen A, Li J, Burssens A, Pižurica A, Audenaert E. Sample Size Effect on Musculoskeletal Segmentation: How Low Can We Go? Electronics. 2024; 13(10):1870. https://doi.org/10.3390/electronics13101870

Chicago/Turabian Style

Huysentruyt, Roel, Ide Van den Borre, Srđan Lazendić, Kate Duquesne, Aline Van Oevelen, Jing Li, Arne Burssens, Aleksandra Pižurica, and Emmanuel Audenaert. 2024. "Sample Size Effect on Musculoskeletal Segmentation: How Low Can We Go?" Electronics 13, no. 10: 1870. https://doi.org/10.3390/electronics13101870

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Sample Size Effect on Musculoskeletal Segmentation: How Low Can We Go?

Abstract

1. Introduction

2. Materials and Methods

2.1. Datasets and Partitioning

2.2. Model-Based Data Augmentation

2.3. Network Training and Evaluation

3. Results

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI