1. Introduction
Meniscus tissue is one of the soft tissues in the human joint located between the femur and tibia in the knee joint. The meniscus comprises C-shaped fibro-cartilaginous tissue and consists of the lateral and medial meniscus. These structures have essential functions: providing joint stability, shock absorption, joint fluid distribution, and load transfer [
1,
2]. Meniscal injuries usually lead to changes in joint biomechanics that affect load distribution and contact stresses [
3]. Meniscal injuries are common knee injuries that can be seen in all age groups and usually occur as a result of sporting activities, aging-related degeneration, or traumatic effects.
Accurate diagnosis of meniscal injuries is essential to determine the type and extent of damage. Magnetic resonance imaging (MRI) is the gold standard for non-invasive diagnosis. MRI is widely used in the preliminary diagnosis of knee injuries due to its noninvasive properties and exceptional ability to provide clear visualization of soft tissue (high-contrast resolution) [
4]. This method, which assesses the structural integrity of meniscal tissue with high precision, is often time-consuming and requires expert interpretation. In recent years, several studies have been conducted using deep learning (DL) algorithms to detect and evaluate meniscal injuries on MRI images [
5,
6].
Meniscal segmentation automatically or manually determines the boundaries of meniscal tissue and separates it from surrounding tissues [
7,
8]. Segmentation is usually performed using pixel-based or region-based methods. With the rise of deep learning, image segmentation has been divided into three main categories: (i) semantic segmentation, (ii) instance segmentation, and (iii) panoptic segmentation. Semantic segmentation assigns each pixel in an image to a specific class, while instance segmentation separately identifies different objects belonging to the same class. Panoptic segmentation combines these approaches by labeling each pixel as an object or background [
9,
10].
Some DL-based studies for detecting and classifying meniscal injuries support this process by segmenting meniscal tissue [
4,
11]. Segmentation is not only used in injury detection, but also in treatment planning and surgical interventions. In particular, meniscal segmentation from knee MRI images plays an important role in analyzing the length, width, height, cross-sectional area, and surface area of the meniscus for meniscal allograft transplantation using a 3D reconstruction model based on the patient’s normal meniscus [
12].
The segmentation process can be complicated because MRI images have multiple sequences showing different tissues at various intensity and contrast levels. This causes structures with no clear boundaries, such as the meniscus, to appear in different shapes and intensities in each sequence. However, DL-based segmentation methods offer a more efficient approach to overcoming these challenges than traditional methods by providing high accuracy, speed, and automation advantages. As a result, these innovative methods provide a versatile contribution to the detection and treatment of meniscal injuries.
In recent years, studies on DL-based meniscus tissue segmentation have increased and differed in imaging methods, segmentation architectures, and targeted tasks. Although most of the studies in the literature are based on MRI images, alternative methods such as ultrasonography are also used. Especially for knee osteoarthritis (KOA), where early diagnosis is difficult and evaluation methods are limited, ultrasound stands out as a simple, economical, and non-invasive tool. Ultrasonography offers a powerful alternative to traditional methods by providing fast and effective automated segmentation in meniscal assessment. In this context, segmentation models developed with deep learning techniques achieve successful results in tasks such as the analysis of ultrasound images and automatic measurement of the meniscal area [
13].
In terms of segmentation algorithms, U-Net-based architectures stand out with their simple structure and high performance, providing precise segmentation at the pixel level [
12,
13,
14,
15,
16]. In addition, Mask R-CNN-based architectures are effective in discriminating different regions of meniscal tissue (e.g., medial and lateral) and injury types (e.g., tear and degeneration), and excel in instance segmentation tasks [
11,
17,
18].
The objectives of the studies also determine the choice of these methods. For example, some studies in the literature focus only on segmenting meniscal tissue [
12,
15,
18,
19], while others use segmentation as a preprocessing step to detect meniscal injuries [
11,
14,
16,
17].
This study used the YOLOv8, YOLOv9, and YOLOv11 series of YOLO algorithms for meniscus segmentation. Segmentation masks were created with five models (YOLOv8x-seg, YOLOv8x-seg-p6, YOLOv9c-seg, YOLOv9e-seg, YOLOv11-seg) that can perform the segmentation task. By utilizing the different structural properties of these models, the strengths of each model are combined through ensemble methods. The masks obtained using the YOLO series are combined with innovative methods such as pixel-based multiple voting and dynamic weight optimization to improve the performance of the YOLO series on meniscus segmentation (
Figure 1). Despite the significant advances in the literature, studies using ensemble methods for meniscus segmentation and injury classification are limited. Therefore, ensemble methods aim to fill an important gap in this field. Furthermore, this is the first known study using the YOLO series for meniscus segmentation. The proposed method is validated with a test set (internal dataset) and an external set (external dataset). The contributions of this study to the literature are as follows:
Using YOLO series models, the proposed ensemble methods (pixel-based voting, weighted multiple voting, and dynamic weighted multiple voting optimized by grid search) improved the meniscus segmentation performance.
The proposed ensemble-based approach sheds light on the development of studies using ensemble methods as a preprocess in the detection or classification of meniscal tears.
By improving the accuracy and reliability of the results of meniscus segmentation, the proposed method has a significant impact on the planning of surgical interventions and the improvement of the quality of life of those affected.
The experiments on the internal and external sets demonstrated the generalization ability of the proposed method and its applicability in different clinical settings.
In the following sections of the study, the datasets, segmentation models, and ensemble methods are described in the materials and methods section. In the results section, the performance of the proposed method is evaluated using several metrics and is compared with other methods. In the discussion section, the strengths and weaknesses of the proposed method are discussed and compared with the literature. In the conclusion section, the general findings are summarized, and the potential impacts of the method in the health field are emphasized.
3. Results
Five models (YOLOv8x-seg, YOLOv8x-seg-p6, YOLOv9c-seg, YOLOv9e-seg, YOLOv9e-seg, YOLOv11-seg) were trained for the segmentation of ROIs in the meniscus region using the existing YOLO algorithms YOLOv8, YOLOv9, and YOLOv11 series. In the rest of the paper, these models will be referred to as Model 1, Model 2, Model 3, Model 4, and Model 5, respectively. To improve the prediction results obtained from these five models, ensemble-methods-based approaches were used, and pixel-based voting, weighted multiple voting, and dynamic weighted multiple voting optimized by grid search were applied. In particular, dynamic weighted multiple voting with grid search significantly improved the model performance.
R, P, and F1 score metrics were used to measure the performance of the models at the end of training. The best metric values (best) and the last epoch value (last) obtained by the models for 100 epochs in the mask generation task are given in
Table 3 and
Figure 4. When
Table 4 and
Figure 4 are carefully analyzed, Model 1 showed the highest performance in terms of best metric values with P (0.9379), R (0.9277), and F1 score (0.9328) values. Considering the last epoch values, Model 1 maintained its superiority over the other models, although it showed a slight decrease in P (0.9312), R (0.9084), and F1 score (0.9196) metrics.
Figure 4 shows that during the training process, the parameter values of the models initially increase rapidly and then reach a stable point. In particular,
Figure 4 shows that Model 1 reaches a stable performance in the early epoch values, while Model 5 fluctuates a lot during the training process.
The models were trained on the internal dataset and evaluated with the test and external sets.
Figure 5 shows the results of the qualitative images obtained.
Table 4 shows the number of images from the test set and the external set for which the models failed to generate ROI areas during the evaluation. Model 1 successfully predicted ROI areas for all test and external set images. Model 2, on the other hand, fell behind the other models as the model with the highest number of unsegmented ROI areas.
The trained models were evaluated in the test set and external set with a confidence threshold of 0.5, and ROI masks of the meniscus region predicted by the model were created from these images. The confidence threshold was set to 0.5 to ensure a balanced performance between P and R values. The resulting masks were used for qualitative visual analysis and evaluated in the proposed method’s pipeline.
Figure 6 shows examples of the masks produced due to this process. The original images, ground truth, and the prediction masks of Model 1 in the test set are also shown in
Figure 6.
The masks created using YOLO models were developed with Ensemble Voting Methods and evaluated both in the test and external sets. Several methods were tested in the study, including pixel-based voting (Method 1), weighted ensemble voting (Method 2), and dynamic weighted ensemble voting optimized by grid search (Method 3). The masks for these methods are generated by the multiple voting methods described in the Materials and Methods section. The quantitative results of the models and the proposed method on the test set and the external set are presented in
Table 5,
Table 6 and
Table 7. Performance metrics are reported with 95% confidence intervals, and confidence intervals are denoted by the ± symbol to represent the variance around the mean values of the metrics.
Table 5 summarizes the performance evaluations of the models applied to the test and external set and the masks obtained from multiple voting methods.
As seen in
Table 5, Method 3 shows successful results both on the test set and external set, in short, on all metric values. These results show that Method 3 succeeded in both segmentation accuracy and detecting positive samples correctly. In the external set, Method 3 achieved the highest value of 0.9004 (±0.0064) for the DSC metric, followed by 0.8976 (±0.0071) in the test set. Regarding PPV, Method 3 also stood out with values of 0.8561 (±0.0121) and 0.8876 (±0.0134) in both test and external sets, respectively. This shows that Method 3 performs strongly in maintaining the accuracy of optimistic predictions. Similar success was observed regarding Specificity and accuracy values, with very high values of 0.9995 (±0.0000) in both datasets. The results of these metrics show that the false positive rate is extremely low, and the overall accuracy of the model is high. In the Sensitivity metric, Method 3 outperformed the other methods with values of 0.9467 (±0.0077) on the test set and 0.9200 (±0.0119) on the external set. These results show that Method 3 can identify positive examples with high accuracy. It is seen that the methods built with multiple voting methods generally give better results than the models trained with YOLO. In model training, Model 1, which has the best performance, performs better than the other models.
The performance analyses of the methods and models were not limited to general evaluations on the test set and external set; detailed analyses specific to the lateral and medial meniscus regions and healthy and torn meniscus conditions were also performed. In these analyses, DSC and Sensitivity metrics were used to emphasize the overall performance and Sensitivity of the model. These methods’ regional and case-based success were compared based on these two metrics. The performance results for the LM and MM ROI areas are presented in
Table 6, while the results for the healthy and torn meniscus cases are presented in
Table 7.
Table 6 shows that Method 3 is the most successful MM and LM segmentation method. For MM, 0.9006 (±0.0084) DSC and 0.9435 (±0.0104) Sensitivity values were obtained in the test set, and 0.9055 (±0.0069) DSC and 0.9008 (±0.0151) Sensitivity values were obtained in the external set. For LM, it showed superior performance with 0.8923 (±0.0134) DSC and 0.9522 (±0.0111) Sensitivity in the test set and 0.8916 (±0.0128) DSC and 0.9534 (±0.0129) Sensitivity in the external set.
Table 7 shows the performance of the masks for healthy and torn meniscus segmentation. In particular, Method 3 achieves the best healthy and torn meniscus segmentation results in both datasets. Model 1 stands out as the second most successful method for healthy meniscus after Method 3. In the external set, Model 1 stood out with DSC values of 0.8705 (±0.0202) and Sensitivity values of 0.9591 (±0.0120). However, Method 3 gave more consistent results with lower uncertainty than Model 1. Method 2 ranked second in the segmentation of torn meniscus. It performed close to Method 3 with DSC values of 0.8882 (±0.0093) and Sensitivity values of 0.8826 (±0.0205) in the external set.
Considering the quantitative results obtained by the proposed methods on the test and external sets, Method 3 performed the best. Cohen’s Kappa, Jaccard Index, and Matthews Correlation Coefficient (MCC) metrics were used to evaluate the overall consistency of the method and its balance between classes. These metrics aim to demonstrate the method’s success beyond chance-based classification and its Sensitivity to class imbalances. In the test set, these metrics were calculated as 0.8972 (±0.0071), 0.8158 (±0.0117), and 0.8991 (±0.0068), respectively. In the external set, 0.8999 (±0.0065), 0.8999 (±0.0065), and 0.9016 (±0.0061) values were obtained, respectively.
The DSC and Sensitivity differences between the test set and the external set of Method 3 were compared to evaluate the model’s overall segmentation success and generalization ability. According to the t-test analysis, the p-value for DSC was calculated as 0.364. Since this value was more significant than 0.05, it was determined that there was no statistically significant difference between the DSC metrics in the test set and the external set. On the other hand, the p-value for Sensitivity is calculated as 0.0000087, which is much smaller than 0.05, so there is a statistically significant difference between the Sensitivity metrics between the two datasets.
4. Discussion
This study aims to extract meniscal ROIs from knee MRI images and improve segmentation results using ensemble-based approaches. We trained segmentation models based on state-of-the-art YOLO versions. We optimized the results using innovative ensemble methods such as pixel-based voting, weighted multiple voting, and dynamic weighted multiple voting with grid search. Meniscal volume is known to be less than 0.1% of the entire knee joint MRI scan [
35] and, in the 2D images used in this study, meniscal ROI areas represent less than 1.5% (
Figure 7). The manual segmentation of such small tissues has limited applicability due to its high risk of error and time-consuming nature. Therefore, deep learning-based approaches offer practical solutions to overcome these challenges. Furthermore, this is the first known study in which segmentation models based on the YOLO series are integrated with innovative ensemble methods (e.g., grid-search-based dynamic weighting) to improve performance.
The YOLO series has an innovative architecture capable of fast and highly accurate object detection and stands out for its speed and accuracy advantages in biomedical imaging and segmentation tasks [
36]. Ensemble methods combine the strengths of these models to improve their results further. Pixel-based voting, weighted multiple voting, and dynamic weighted voting with grid search combine the outputs from different models to provide more accurate and consistent results. These methods are powerful solutions for challenging segmentation tasks, especially in complex biomedical images. Incomplete segmentation results are reduced due to inhomogeneous intensity levels in the detection of meniscus ROI areas on MRI images. The ground truth comparison of the proposed method (Method 3) and the mask obtained using Model 1, which shows the best result in training, is given in
Figure 8. As can be seen in
Figure 8, one of the main reasons for incomplete segmentation is the small size of the meniscus ROI areas and the inhomogeneous intensity levels in the MRI images. Meniscal tissue shows similar intensities to other soft tissues on MRI images, resulting in limited contrast, especially in the posterior horn regions of the lateral and medial meniscus. This may lead to incomplete or incorrect boundaries in the segmentation masks. In addition, artifacts and low signal intensity in some regions of the images also contribute to incomplete segmentation. In
Figure 8, when the masks produced by Model 1 are compared with ground truth, such missing boundaries are more evident. However, the proposed Method 3 overcomes some of these shortcomings through dynamic weighting and obtains results that are more consistent with the ground truth. This shows the success of Method 3, especially in balancing intensity differences and reducing incomplete segmentation.
Table 3 and
Table 4 show that Model 1 performs the best among the trained models. This can be explained by the contribution of the increase in the number of parameters in the YOLO series to the performance improvement [
37]. The parameter numbers (Parameters/M) of the models are ~71.70 M (Model 1), ~5.09 M (Model 2), ~27.62 M (Model 3), ~59.68 M (Model 4), and ~2.83 M (Model 5). Although Model 1 outperformed the other models with its high number of parameters, qualitative evaluations also revealed cases where other models incorrectly estimated ROI areas. The main goal of ensemble methods is to improve segmentation accuracy by combining the strengths of each YOLO model and minimizing the inconsistencies between the outputs obtained from different models. In this study, the combination of masks from five different YOLO models demonstrates the contribution of this diversity to segmentation performance.
The quantitative results of the models and the proposed method on the test set and the external set are presented in
Table 5,
Table 6 and
Table 7. In general, the findings in
Table 5 show that Method 3 performs better and is more balanced than the other methods in terms of all metrics. In particular, the high performance of Method 3 on the external set highlights the model’s ability to cope with data diversity and its generalization success. The consistency provided by multiple voting methods provides a generalizable and reliable solution for clinical applications. For example, Method 3 showed consistent results in metrics such as DSC (0.8976 (±0.0071), 0.9004 (±0.0064)) and Specificity (0.9467 (±0.0077), 0.9200 (±0.0119)) in the test set and external set, indicating that this method can work with similar accuracy on different patient groups.
Table 6 gives quantitative evaluation results according to menu types. Although the images containing LM are fewer than MM in the datasets, DSC and Sensitivity in the test set and external set obtained very close results. For example, the DSC for MM in the test set was 0.9006 (±0.0084), while in the external set, it was 0.9055 (±0.0069). This shows that the generalization ability is strong across different meniscus types.
Table 7 shows the relationship between healthy meniscal tissue and torn meniscal tissue. The detection of ROI areas of torn meniscus was more successful than healthy meniscus in both the test and external sets. For example, the DSC and Sensitivity for a torn meniscus in the test set were 0.9008 (±0.0077) and 0.9443 (±0.0090), respectively, while these values were 0.9051 (±0.0061) and 0.9102 (±0.0140) in the external set. One of the main reasons for this difference is that there are four times more torn meniscus images than healthy meniscus images in the datasets. This imbalance of data diversity is considered an essential factor affecting segmentation accuracy.
Although Method 3 achieved good results using ensemble methods, the high parameter count of Model 1 provided an advantage in terms of segmentation accuracy and offered the potential to reduce the need for manual segmentation significantly. However, the DSC and Sensitivity differences between the test set of Method 3 and the external set were compared to evaluate the model’s overall segmentation success and generalization ability. According to the t-test analysis, the p-value for DSC is 0.364, which is greater than 0.05, indicating that the DSC differences between the test set and the external set are not statistically significant. On the other hand, the p-value for Sensitivity was calculated as 0.0000087, and since this value is less than 0.05, the difference between the two datasets is statistically significant. These results demonstrate the generalization success of Method 3 and its potential to work reliably on different datasets. This suggests that, while the method maintains overall segmentation accuracy, its ability to recognize all positive regions may vary depending on differences between datasets.
DL-based methods commonly used in the literature for meniscus segmentation include U-Net [
12,
13,
14,
15,
16] and Mask R-CNN [
11,
17,
18]. Although the first version of YOLO was released in 2016 [
38], it gained the segmentation task later in the YOLO series. After U-Net was introduced as a convolutional network for biomedical image segmentation [
39], U-Net-based architectures were developed for meniscus segmentation. Some of these studies have focused on the preprocessing of meniscal tear detection [
11,
14,
16,
17], while other studies have focused only on the success of segmentation [
12,
15,
18,
19]. Although sagittal [
11,
15,
16,
17,
18,
40] slices have generally been used for meniscal segmentation, there have also been studies in coronal slices [
12] or both slices [
35]. More research has been completed on the segmental section because it contains more information about the meniscus and better defines the clinical pattern [
16,
25]. Jeon et al. [
12] proposed a U-Net-based method for meniscal segmentation and calculated DSC for MM and LM to be 85.18% and 84.33%, respectively [
12]. In another U-Net-based study for the automatic segmentation of knee cartilage and meniscal structures, DSC and overall performance for MM and LM were calculated as 0.87%, 0.89%, and 0.88%, respectively [
15]. In another U-Net-based study, DSC was reported in the range of 094–0.93 [
16]. In the proposal of Li et al. [
11] for a fully automatic 3D DCNN for the detection and segmentation of meniscal clefts in knee MR images, DSC and Sensitivity were 0.924 and 0.95, respectively. The overall performance evaluations of the proposed method (Method 3) were 0.8976 (±0.0071) and 0.9004 (±0.0064) for DSC in the test and external set, respectively, indicating that our study contributes to the literature.
While the related studies are based on a single model for meniscus segmentation, this study uses ensemble methods and operates according to the masks obtained from five different models. In this way, the performance of YOLO models is improved. In this study, beyond proposing a new convolutional neural network model, a new model is presented. YOLO models offer real-time performance and provide a significant advantage in overcoming the difficulties of manual segmentation. Furthermore, the accuracy of YOLO is improved by using a dynamic weighted multiple voting method with grid search, which is unique in the literature.
The internal dataset was divided into train, valid, and test sets using stratified random sampling to maintain class balance, make model performances consistent, and make quantitative and qualitative observations more consistent. One of the limitations of this study is that the total number of healthy and torn meniscus images in the dataset is different. Model 1 showed a superior performance due to its high number of parameters and complexity. However, this also reveals that the model needs more computational power. Other models are not as powerful as Model 1. The existence of another model close to the performance of Model 1 will improve the performance of ensemble methods and contribute to the performance of the proposed method (Method 3).
The ability of the proposed model to automate the segmentation of the meniscus could significantly reduce the time required for manual labeling by radiologists and orthopedic surgeons. This efficiency gain could improve workflows in the clinical setting. In addition, the dynamically weighted ensemble approach increases the reliability of the model by reducing inter-observer variability, making it a promising tool for both diagnostic and surgical planning applications.
4.1. Pros of the Study, Advantages, and Future Work
This study will be extended with a new YOLO-based architecture, providing a pre-treatment for meniscal degeneration detection. Furthermore, the method’s representativeness will increase if more extensive and balanced datasets are used. In addition, the integration of alternative imaging modalities, such as ultrasound, may increase this study’s applicability to different imaging protocols and open a new research area in this field.
4.2. Limitations of the Study
This study used only the YOLO series, but no comparison was made with different segmentation models (e.g., U-Net, Mask R-CNN). This shortcoming limits the comparative power of the study between methods. Although performed by experts in the field, the manual process of meniscal ROI selection may lead to some labeling errors. Although the study results were tested by external validation, the diversity of these datasets is limited across different clinical protocols or devices. The model’s performance on datasets from different MRI devices or imaging protocols has not yet been evaluated. In addition, evaluations on a single slice (sagittal) are another study limitation.
Although the proposed method shows strong segmentation performance, some limitations should be noted. The model performs exceptionally well in cases where meniscal tears have clear boundaries. However, for meniscal tears with irregular shapes or low contrast in MRI scans, the segmentation accuracy decreases slightly. This is probably due to the fact that, in these cases, it is difficult to distinguish meniscal tissue from the surrounding structures.