3.3. Comparative Performances of AI Systems in Scoliosis Severity and Treatment Modality Classifications
Table 4 serves as a critical tool in evaluating the precision and reliability of AI systems for clinical decision support in the assessment of the scoliosis severity. The dataset encompasses evaluations from a sample group of 72 subjects, and the significance level for all the kappa scores is denoted by
p < 0.001, underscoring the reliability of the observed agreement levels. Each AI system’s agreement is reported as an overall kappa score, as well as by individual severity categories: mild, moderate, and severe. It is important to note that some AI systems did not undertake the task of classification, and, thus, their kappa values are not available.
Based on the results in
Table 4, it is evident that ChatGPT 4, Copilot, and PopAi have achieved perfect agreement with the reference standards, as indicated by kappa values of 1.00 across all the categories of severity. Such unanimity suggests that these systems exhibit an exceptional capacity to replicate the reference classification accurately, which positions them as highly reliable tools for scoliosis severity assessment in clinical settings.
In contrast, the AI system PMC-LLaMA 13B manifests the lowest kappa score, with an overall agreement of 0.55 and particularly low agreement in the moderate category at 0.33. These figures suggest a substantial divergence from the reference evaluation, indicating that PMC-LLaMA 13B’s classification might not be as dependable for clinical decision-making, especially in distinguishing a moderate severity of scoliosis.
Noteworthy are the performances of Gemini, You Research, and You Genius, which did not undertake classification. Furthermore, the AI systems Gemini Advanced, You GPT 4, Claude 3 Opus, and Claude 3 Sonnet display varying degrees of kappa values, with Gemini Advanced and You GPT 4 having moderate overall agreement levels of 0.77 and 0.67, respectively. Claude 3 Opus shows a high level of agreement at 0.96, while Claude 3 Sonnet presents a lower kappa value of 0.85. It is apparent that these systems, although not perfectly aligned with the reference evaluation, still maintain a reasonable level of concordance that could be enhanced with further calibration.
Table 4 is aimed at providing an evaluative comparison of various AI systems in terms of their agreements with established reference standards (reference level in
Table 2) for recommending treatment modalities for scoliosis.
Table 4 is crucial for understanding how AI systems compare with the gold standard of treatment modality recommendations in the field of scoliosis management. The insights gained from this table will be instrumental for healthcare professionals, researchers, and AI developers in assessing the current capabilities of AI systems and in identifying areas where these systems can be improved to better support clinical decision-making.
ChatGPT 4 and Copilot stand out with kappa values of 1.00 across all the severity categories, indicating a perfect agreement with the reference evaluation for the treatment modality. This level of concordance suggests that these AI systems are highly reliable and could be considered as being benchmark performers in the context of this study.
Conversely, PMC-LLaMA 13B has the lowest kappa score overall at 0.37, with notably poor agreement in the mild (0.13) and moderate (0.18) categories, the latter two not reaching statistical significance, as indicated by a p-value of greater than 0.05. Such findings highlight significant inconsistencies with the reference standards and imply that this system may require substantial improvements before it can be considered as being a feasible tool for clinical decision-making in its current state.
The AI systems Gemini, Gemini Advanced, You Research, and You GPT 4 did not partake in the classification for the treatment modality; therefore, their performances could not be assessed against the reference evaluation within this analysis.
PopAi and You Genius presented with moderate kappa scores of 0.83 and 0.78, respectively, which, although not perfect, still reflect a reasonable level of agreement with the reference evaluation. Claude 3 Opus also demonstrated a moderate level of agreement, with an overall kappa value of 0.77.
Claude 3 Sonnet exhibited a kappa value of 0.50, which signifies a moderate-to-low agreement overall, and the scores across the categories suggest inconsistency, with particularly low agreement in the severe category (0.38).
In
Table 4, there is a comparative analysis that quantifies the level of agreement between severity intervals as estimated by various AI systems and actual severity estimates derived from the individual Cobb angles.
Table 4 is essential for assessing the performance of AI technologies in the field of orthopedics, specifically concerning the accurate categorization of the scoliosis severity. The findings outlined within this table will offer valuable insights for clinicians seeking to integrate AI into their diagnostic process, for researchers aiming to benchmark and improve AI diagnostic algorithms, and for developers working on enhancing the precision of AI systems in medical imaging and diagnosis.
ChatGPT 4, Copilot, and PopAi exhibit a kappa value of 1.00, signaling perfect agreement across all the categories of the scoliosis severity. This indicates exceptional performance and suggests these systems have achieved the highest level of accuracy in mirroring the traditional Cobb angle measurements, making them the most reliable among the evaluated AI systems for this specific task.
At the other end of the spectrum, PMC-LLaMA 13B has the lowest overall kappa score of 0.49, with notably weaker agreement in the moderate severity category at 0.30. This demonstrates a substantial discrepancy between the AI-estimated intervals and the actual Cobb angle estimates, indicating a less-reliable performance.
In the middle ground, Claude 3 Opus displays strong agreement, with a kappa score of 0.96, closely approximating the high standard set by the best-performing systems. Similarly, Claude 3 Sonnet shows a reasonably high level of agreement at 0.81, despite some variation across different severity categories.
Gemini Advanced and You GPT 4 present with moderate kappa values of 0.68 and 0.67, respectively, which suggest that although they may have potential utility, there is appreciable room for improvement in their estimation processes to reach the levels of the top-performing AI systems.
It is noteworthy that the systems Gemini, You Research, and You Genius did not undertake the classification task; therefore, their abilities to estimate severity intervals in comparison with actual Cobb angle measurements remains unquantified within this analysis.
Table 4 offers a critical evaluation of the correspondence between the treatment recommendations made by AI systems and the defined Cobb angle ranges from
Table 2 for scoliosis patients. This table is essential for assessing the accuracy with which AI systems can match established treatment guidelines that are rooted in objective clinical assessments. Such a comparative analysis is vital for clinicians considering the incorporation of AI into treatment decision-making, for healthcare institutions contemplating the adoption of AI in diagnostic processes, and for developers aiming to improve the precision of AI algorithms for better clinical outcomes.
The data reveal that both ChatGPT 4 and Copilot exemplify the epitome of precision, with a perfect kappa value of 1.00 across all the treatment modalities, including monitoring, bracing, and surgery. This denotes a flawless alignment with the actual estimates, positioning these systems as the standard-bearers for AI-assisted treatment modality decisions in the context of scoliosis management.
Conversely, PMC-LLaMA 13B demonstrates a markedly low overall agreement, with a kappa value of 0.24, and particularly poor performance in the bracing and surgery categories, with kappa values of 0.13 and 0.02, respectively, both of which are not statistically significant. This performance indicates a pronounced divergence from the actual treatment modality estimates and highlights PMC-LLaMA 13B as the least reliable system among those evaluated.
Claude 3 Opus ranks as a highly precise system, with an overall kappa value of 0.98, nearly perfect scores in the monitoring and bracing categories, and reaching a score of 1.00 in the surgery category. Although not achieving the absolute perfection of ChatGPT 4 and Copilot, Claude 3 Opus still stands as a highly dependable system for treatment modality estimation.
PopAi, although not achieving the pinnacle of agreement like the top performers, still presents a robust kappa score of 0.83 overall, with its performance in the surgery category reaching the maximum kappa value of 1.00. However, its lower kappa values in the monitoring and bracing categories suggest that there is some discrepancy between the AI estimates and actual modality decisions that could be refined.
Claude 3 Sonnet, with an overall kappa value of 0.58, shows a moderate level of agreement, indicating a level of inconsistency in its estimations compared to the actual standards, particularly noticeable in the surgery category, with a kappa value of 0.49.
It is important to note that Gemini, You Research, You GPT 4, and You Genius did not participate in the estimation of treatment modalities, and, thus, no data are available to assess their performances in this context.
3.4. Performance Metrics and Confusion Matrix Analysis for Scoliosis Severity Classifications
In this section, we offer an in-depth analysis of the performance exhibited by several AI systems assigned the task for categorizing the severity of the scoliosis. This comparison involves aligning the predicted classifications against the reference standards across three defined levels of severity: mild, moderate, and severe.
Table 5 encapsulates the results of this analysis, providing a comprehensive overview of the performance metrics for each AI system under evaluation. These metrics encompass overall accuracy, sensitivity (true positive rate), specificity (true negative rate), positive predictive value (PPV), negative predictive value (NPV), precision, recall, F1 score, prevalence, detection rate, detection prevalence, and balanced accuracy. These metrics have been calculated from the confusion matrix for each severity class, offering insights into the strengths and limitations of each system’s predictive capabilities.
This table serves not only as a tool for cross-comparison but also as an indicator of the precision and reliability of each AI system in the context of scoliosis severity classification. It is intended to guide through an understanding of which systems excel in certain metrics and to shed light on potential areas for improvement.
According to the results in
Table 5, AI systems such as ChatGPT 4, Copilot, and PopAi display impeccable performances, with perfect scores (1.00) across all the metrics and severity classes, indicating models that predict with an exceptional level of correctness.
Gemini Advanced presents as a robust system with relatively high scores, particularly notable in the Balanced Accuracy metric, which corrects for any bias in the dataset toward a particular class. For the mild and severe classes, the metrics are commendably high; however, there is a noticeable dip in the performance metrics for the Moderate class, particularly in Overall Accuracy and Sensitivity. This suggests that Gemini Advanced may struggle with borderline cases or has difficulty when the features are not distinctly polarized.
You GPT 4 demonstrates a moderate level of accuracy, with its lowest scores observed in the Moderate class for Sensitivity and Positive Predictive Value, which indicates a potential area for model improvement. High specificity across all the severity classes suggests that the system is adept at identifying negatives, but there may be an issue with false negatives, particularly in the Mild class, as indicated by a lower Sensitivity score.
Claude 3 Opus and Claude 3 Sonnet both exhibit high-performance levels, with Claude 3 Opus showing a slight edge, particularly in the Moderate class, where it outperforms Claude 3 Sonnet in terms of Sensitivity and PPV. Both systems achieve perfect or near-perfect scores in most metrics for the Mild and Severe classes, suggesting that they have a well-calibrated understanding of the extremes of the severity spectrum.
PMC-LLaMA 13B shows the most room for improvement, particularly in the Moderate class, which has the lowest Overall Accuracy and Sensitivity scores. This could be indicative of a model that is less adept at managing less-defined or nuanced classifications. The relatively low scores across the board also suggest that PMC-LLaMA 13B may benefit from further training or a more sophisticated feature extraction process.
In
Figure 3, the visualization illustrates predictive outcomes for scoliosis severity across the various AI systems, each contributing 72 results. The x-axis categorizes the actual measurements of the Cobb angle into three distinct segments, indicating mild, moderate, and severe scoliosis. These segments are clearly delineated by dashed lines for straightforward identification and interpretation.
Each data point on the x-axis represents an actual measurement of the Cobb angle, categorized within these predefined segments. To prevent data points from overlapping, they have been randomly jittered along the y-axis. This method enhances the clarity of the display and aids in the visual assessment of the predictive accuracy.
These predictions are color coded to reflect the severity categories: green for mild, blue for moderate, and red for severe scoliosis. Points colored in gray indicate measurements for which the AI systems did not provide a classification.
Misclassifications are easily identifiable through color discrepancies: points that are not green within the mild severity range, not blue within the moderate range, and not red within the severe range, are considered as being incorrect classifications. This color-coding system allows for quick and effective evaluations of the accuracy and errors in the AI’s predictive analysis.
3.5. Performance Metrics and Confusion Matrix Analysis for Treatment Modality Classifications
The present analysis undertakes a comprehensive review of the performances of diverse artificial intelligence systems in their task for classifying treatment modalities for scoliosis. This evaluation is critical, as accurately identifying the correct treatment approach, which includes options, such as monitoring, bracing, or surgical interventions, is essential for optimizing patient health outcomes and the efficacy of the medical care.
In a manner comparable to that in
Table 5,
Table 6 lays out a detailed comparative assessment of performance metrics, which have been compiled from the confusion matrices that reflect the classification results achieved by each AI system. The purpose of this comparative assessment is to provide an in-depth perspective on the capabilities and areas requiring enhancement for each AI system in the realm of treatment modality classification.
The ChatGPT 4 and Copilot systems stand out, with perfect scores across all the metrics for each severity class. This suggests that these systems have a high degree of accuracy, reliability, and consistency in their predictions, leaving from little to no room for improvement within the context of the provided data. The uniformity of their performances across all the classes indicates well-balanced systems that understand the defining features of each severity level.
Contrastingly, PopAi demonstrates a notable variance in performance across the severity classes. Although it achieves perfect Sensitivity and Specificity for the Surgery class, its performance in the Bracing class is markedly lower, with an Overall Accuracy of 0.58. The Positive Predictive Value (PPV) for the Bracing class is also lower, which highlights potential challenges in the system’s ability to correctly identify cases requiring bracing treatment. The relatively high Detection Rate in the Monitoring and Surgery classes suggests that PopAi is effective in detecting these conditions when present, yet the lower Detection Rate for Bracing indicates room for improvement in the system’s ability to accurately detect moderate cases.
The You Genius system exhibits high levels of performance in the Monitoring and Surgery severity classes, with Sensitivity and Specificity scores comparable to those of PopAi. However, it has a lower Overall Accuracy in the Bracing class, similar to PopAi’s performance. The F1 Score in the Bracing class for You Genius is also indicative of a compromise between Precision and Recall, suggesting that the system may benefit from further refinement to better balance these metrics.
Moving to Claude 3 Opus, this system shows a strong performance in the Monitoring class with high Sensitivity and a Balanced Accuracy of 0.92, indicating effective performances across both positive and negative cases. However, the performance drops in the Bracing class, with an Overall Accuracy of 0.68 and a Balanced Accuracy of 0.79, suggesting difficulties in distinguishing between cases requiring bracing versus other interventions. The Surgery class shows a rebound in the performance, though not to the level of the perfection seen in ChatGPT 4 and Copilot.
Claude 3 Sonnet presents as the least consistent performer among the evaluated systems. The Overall Accuracy is lower across all the classes, with the Surgery class demonstrating a significant drop to 0.50. This inconsistency is further evidenced by the lower Balanced Accuracy scores, indicating that Claude 3 Sonnet struggles with both false positives and false negatives across the severity classes.
Lastly, PMC-LLaMA 13B shows significant variability across the different classes. The Sensitivity for the Monitoring class is particularly low at 0.24, which is concerning for a medical diagnostic tool, as it implies a high rate of missed cases that require monitoring. However, PMC-LLaMA 13B performs better in the Surgery class, with a Sensitivity of 0.89 and a Balanced Accuracy of 0.89, showing a stronger capability in identifying severe cases of scoliosis that may necessitate surgical intervention.
The description provided for
Figure 3 is also applicable to
Figure 4, with the notable distinction that the predictive focus shifts from scoliosis severity to treatment modalities based on Cobb angle measurements. In
Figure 4, individual AI systems forecast the necessity for monitoring, bracing, or surgery as potential treatments.