**3. Results**

*3.1. Classifier Performance*

The trained classifiers were evaluated on the test set, with the resulting accuracy, sensitivity, specificity, F1, and AUC reported in Table 2. The RF model achieved the best classification performance with an AUC of 0.91 and an F1 of 0.80. Overall, testing loss and accuracy are consistent with 5-fold cross validation results on the training set, demonstrating the generalizability of the proposed method on unseen data. No over-fitting seemed to occur on any of the models reported.


**Table 2.** Performance metrics for spleen injury classification on the test set. The highest value for each performance metric is **bolded**.

*3.2. Comparison against Deep Learning*

A comparison between the RF classifier performance and the deep learning method is shown in Table 3. The RF classifier trained with hand-crafted features demonstrated better performance than the deep learning method, with RF achieving an AUC of 0.91 while the deep learning method achieved an AUC of 0.72. The lower deep learning performance is likely due to the small sample size available in this study, as deep learning methods require large datasets to minimize over-fitting and achieve good performance [25,32]. These results demonstrate that hand-crafted features using domain knowledge can overcome sample size limitations.

**Table 3.** Performance metrics for the RF classifier trained using hand-crafted features and for the deep learning method. The highest value for each performance metric is **bolded**.


#### *3.3. Leave-One-Site-Out Analysis*

This study utilizes two different datasets—the internal Michigan Medicine dataset and the public CIREN dataset. A leave-one-site-out analysis was performed to evaluate the cross-site generalizability of the proposed method.

To achieve an 80% to 20% training/test split, the Michigan Medicine dataset, containing a total of 54 samples, was used as the training set while 14 CIREN samples were used as the test set. The 14 CIREN test samples were randomly stratified based on injury grade. The best performing classifier from Section 2.4.2, RF, was trained on the Michigan Medicine samples and tested on the CIREN samples. Performance metrics of the classifier are reported in Table 4.

**Table 4.** Performance metrics for the RF classifier trained on Michigan Medicine samples and tested on CIREN samples.


The RF classifier achieved good performance on the cross-site generalizability assessment, with an AUC of 0.91 and a F1 of 0.71. Compared to the performance on the mixed-site test set, the classifier achieved the same AUC but lower F1, accuracy, and sensitivity. This performance difference is likely affected by the limited sample size used to train the classifier as only one dataset is utilized. Overall, the performance of the classifier demonstrates that the proposed method is relatively robust against variability stemming from differences in the data from two different sites.

## **4. Discussion**

RF outperformed other classifiers on both the training and test set, which is consistent with its popularity among many previous medical image analysis studies [33–36]. Several features of RF may contribute to its higher performance on medical images—RFs are suited for high predictor dimension relative to sample size, they inherently perform feature selection, and they generalize well to regions of the feature space with sparse data [34,35].

Table 5 reports the classification accuracy of RF by injury grade on both the training and test sets. Although the RF classifier correctly classified the majority of samples across all injury grades, most incorrect classifications occurred within mildly or moderately injured samples (AIS = 2, 3). High classification accuracy is seen among healthy samples and more severe samples (AIS = 4, 5). Lower accuracy and higher variance were achieved for all injury grades as compared to healthy samples, likely due to the smaller number of samples within each individual injury grade compared to the healthy dataset. Despite the lower performance on less severe cases, the proposed method performs well on severe cases, demonstrating the potential to increase injury triage efficiency in real-world applications.

**Table 5.** RF classification accuracy by injury grades. The mean accuracy and standard deviation (SD) across 5-fold cross validation on the training set, as well as the mean accuracy on the test set are reported.


Common misclassifications included classification of a mildly or moderately injured sample as healthy and classification of a healthy sample as lacerated, as illustrated in Figure 3. A lacerated sample with lower injury severity misclassified as healthy is likely caused by a relatively smooth segmentation contour (Figure 3c), which may be the result of imperfect segmentation of the lacerated region and/or a lower degree of laceration. Healthy samples misclassified as lacerated were often due to noise in the original image, which produces misleading segmentations or irregular contour shapes (Figure 3d,e).

Existence of a small portion of samples with localization errors likely led to lower model performance due to imperfect or erroneous segmentations. Image resolution and noise are likely contributing factors to imperfect localization and segmentation results. Previous studies have shown that image thickness is inversely related to image noise but directly related to image resolution [37]. In order, 5 mm CT slices were utilized in this study, which has worked relatively well due to its lower image noise compared to thinner slices. However, 5 mm slices have a lower resolution, decreasing diagnostic content and the proposed method's ability to detect small lesions. Although not available in the datasets utilized in this study, 3 mm slices may strike an ideal balance between minimizing image noise and maximizing image resolution and can be explored in the future [37].

**Figure 3.** Classification results. (**<sup>a</sup>**,**b**) lacerated (AIS = 2) samples correctly classified as lacerated; (**c**) lacerated (AIS = 2) sample incorrectly classified as healthy; (**d**,**<sup>e</sup>**) healthy samples incorrectly classified as lacerated.

Future work will focus on refinement of the segmentation method to improve classification accuracy in lower severity cases. Additional pre- and post-processing of in the segmentation method can be introduced to reduce noise and increase discrimination between healthy and mildly lacerated spleen. Incorporation of more samples in each injury grade may increase classifier performance and support extension of the current binary classification to multi-class classification on different injury grades, providing additional clinical use cases. Finally, although this study focuses on spleen lacerations, future work should generalize to other blunt spleen injuries, including hematomas and hemorrhages.

## **5. Conclusions**

In this study, an automated method for detecting spleen lacerations in CT scans was proposed. The classification scheme was built upon a previously developed localization and segmentation process [6], and used histogram, Gabor filters, fractal dimension, and shape features to distinguish lacerated spleens from healthy controls. Classifiers examined were RF, naive Bayes, SVM, *k*-NN ensemble, subspace discriminant ensemble, and a CNNbased architecture. The RF method outperformed other models in discriminating between lacerated and healthy spleens, achieving an AUC of 0.91 and an F1 of 0.80. Additionally, a leave-one-site-out analysis was performed that demonstrated the method's robustness against variability stemming from differences in the data from two different sites. Results from this study demonstrate the potential for automated, quantitative assessment of traumatic spleen injury to increase triage efficiency and improve patient outcomes. Future work will focus on improving classifier accuracy in less severe cases, extension of the method to support multi-class classification based on injury grade, and generalization to other types of blunt spleen injuries.

**Author Contributions:** Conceptualization, K.N. and J.G.; Methodology, J.W., A.W., C.G., K.N., and J.G.; Validation, J.W. and J.G.; Formal Analysis, J.W.; Data Curation, J.W., A.W., and J.G.; Writing— Original Draft Preparation, J.W.; Writing—Review and Editing, J.W. and J.G.; Supervision, J.G. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** The study was conducted according to the guidelines of the Declaration of Helsinki, and approved by the Institutional Review Board of University of Michigan (protocol code HUM00098656, approved 13 December 2020).

**Informed Consent Statement:** Patient consent was waived as the research involved no more than minimal risk to the subjects.

**Data Availability Statement:** Two datasets were employed in this study—the Crash Injury Research Engineering Network (CIREN) dataset and an internal dataset collected from Michigan Medicine. CIREN is a public dataset that is available for download at https://www.nhtsa.gov/research-data/ crash-injury-research (accessed on 1 February 2021). Data collected from Michigan Medicine can be made available to external entities under a data use agreemen<sup>t</sup> with the University of Michigan.

**Conflicts of Interest:** The authors declare no conflict of interest.
