1. Introduction
Salivary gland malignancies can be categorized into primary tumors, which include epithelial malignancies such as mucoepidermoid carcinoma and lymphomas represented by extranodal marginal zone B-cell lymphoma of mucosa-associated lymphoid tissue (MALT Lymphoma), and secondary tumors originating from metastases [
1]. The distribution of secondary salivary gland malignancies varies regionally, with metastases from head and neck squamous cell carcinoma accounting for approximately 73%‒100% of cases [
2,
3]. Treatment and prognoses differ significantly between primary and secondary malignancies. While most patients undergo parotidectomy, neck lymph node dissection, or radiotherapy, the 5-year survival rate for secondary salivary gland squamous cell carcinoma remains substantially lower than that for primary malignancies (32.6% vs. 77.2%) [
4]. Therefore, early identification of secondary salivary gland malignancies is crucial.
Although clinical data, imaging examinations, and fine needle aspiration (FNA) [
5] can provide initial insights into whether a lesion is benign or malignant, accurate tumor classification still relies heavily on core needle biopsy or postoperative pathology. This reliance can lead to delayed diagnoses or wrong assessments, increasing surgical risks or resulting in missed treatment opportunities [
6].
Ultrasound remains one of the preferred imaging modalities for salivary gland diseases. It is limited in the differential diagnosis of salivary gland tumors due to overlapping diagnostic features in the sonographic images [
7]. While elastography and contrast-enhanced ultrasound techniques show promise, they have yet to be widely adopted [
8]. Recently, radiomics and deep learning have demonstrated potential in the non-invasive differentiation of benign and malignant salivary gland tumors [
9], as well as in distinguishing between different pathological subtypes of benign tumors [
10,
11,
12].
However, no ultrasound-related studies have focused on the differentiation of secondary malignant salivary gland tumors, despite their significant proportion and the clear differences in treatment regimens compared to primary tumors. We aimed to develop a model integrating radiomics and deep learning using a retrospective analysis of ultrasound images from four major medical centers, providing a noninvasive approach to characterize salivary gland malignancies.
2. Materials and Methods
2.1. Patients
This retrospective study analyzed patients diagnosed with salivary gland malignancies across four centers in two regions. The inclusion criteria were as follows: patients with histopathologically confirmed salivary gland malignancies through surgical resection or biopsy, those who underwent preoperative ultrasound examination, and those with complete clinical data. The exclusion criteria were patients with poor-quality ultrasound images impeding accurate diagnosis, salivary gland tumors resulting from the invasion of adjacent malignant tumors, recurrent salivary gland malignancies following total resection, and cases where pathology could not definitively confirm whether malignancies were primary or secondary.
A total of 140 patients were ultimately enrolled, including 111 from Jiangsu Cancer Hospital, who were split in a 7:3 ratio into training and internal validation sets. An additional 29 patients from the First Affiliated Hospital of Nanjing Medical University, Affiliated Hospital of Nantong University, and the Affiliated Jiangning Hospital of Nanjing Medical University were used as the external test set. A flow diagram is shown in
Figure 1. This study complied with the Declaration of Helsinki and was approved by the Ethics Committee of Jiangsu Cancer Hospital (No. KY-2024-057; Date: 1 July 2024). The requirement for individual consent was waived.
2.2. Histopathological Outcomes
Pathological diagnoses were based on the 5th Edition of the World Health Organization Classification of Head and Neck Tumors [
13]. At Jiangsu Cancer Hospital, all pathological diagnoses were re-evaluated by a pathologist (X.-C.H.) with over 6 years of experience. This process involved reviewing original diagnoses and assessing patients’ clinical data to confirm whether the malignancies were primary or secondary. For the external test set, pathological and clinical data from three regional medical centers were collected by M.-J.W., H.Z., and Q.J. The final diagnoses were confirmed by X.-C.H. using consistent pathological and clinical criteria (
Tables S1 and S2 in Supplementary Materials).
2.3. Ultrasound Imaging
Due to the retrospective nature of the study, all ultrasound images were exported in a PNG format from the ultrasound report systems and stored on a computer, which ensured high image quality. The ultrasound devices used in the study include Mylab Twice and MyLab 90 (Esaote, Genoa, Italy), HI VISION Preirus (HITACHI, Tokyo, Japan), S2000 (Siemens Healthineers, Erlangen, Germany), LOGIQ E20 (GE Healthcare, Chicago, IL, USA), and ALOKA ARIETTA 850 (FUJI, Tokyo, Japan), all equipped with linear high-frequency probes (with a frequency range of approximately 5 to 12 MHz). The ultrasound images were reviewed by M.W., an experienced ultrasound physician with over 15 years of expertise at a hospital comparable in clinical and diagnostic capabilities to the four participating centers, while being blinded to the pathological diagnosis. To assess lesion features, ultrasound evaluations were guided by the American College of Radiology (ACR) Thyroid Imaging Reporting and Data System (TI-RADS) [
14], as no established diagnostic standards exist for salivary gland malignancies. Ultrasound features evaluated included composition, echogenicity, shape, aspect ratio, margin, calcification, and posterior acoustic characteristics.
2.4. Labeling
The ultrasound physician (Z.X.), with 6 years of experience, adhered to standardized procedures and was blinded to lesion pathology during annotation. Regions of interest (ROIs) were delineated using ITK-SNAP (version 4.0.2,
www.itksnap.org) (accessed on 20 December 2024), encompassing the entire tumor mass while excluding non-tumorous surrounding tissues. The delineated ROIs were then exported to the Neuroimaging Informatics Technology Initiative (NIFTI) format for subsequent model training.
2.5. Radiomics Features Extraction
The PyRadiomics library was employed to extract radiomics features using a multistep approach. The analyzed image types included the original image, along with various transformed versions, including Wavelet, Square, SquareRoot, Logarithm, Exponential, and Gradient. The extracted features were categorized into three groups: geometry, intensity, and texture. Geometry features describe the two-dimensional shape characteristics of the tumor. Intensity features capture the first-order statistical distribution of voxel intensities within the ROI. Texture features capture patterns and higher-order spatial distributions of intensities, and are extracted using multiple methods, including the Gray level co-occurrence matrix (GLCM), Gray level dependence matrix (GLDM), Gray level run length matrix (GLRLM), Gray level size zone matrix (GLSZM), and Neighboring gray tone difference matrix (NGTDM). Additionally, for three-dimensional features, the third dimension was set to 1 to accommodate specific computational requirements.
2.6. Radiomics Features Selection
Feature selection was conducted using the following methods: (1) Z-score standardization was applied to remove scale effects across all features; (2) independent sample t-tests or Mann–Whitney U tests were used to calculate the
p-values for all features between the primary and secondary tumor groups, and features with
p-values less than 0.05 were retained for further analysis; (3) Spearman correlation analysis was used to remove redundant features. Features with a correlation coefficient greater than 0.9 were considered highly correlated, and only one feature from each pair was retained to reduce redundancy; and (4) The Least Absolute Shrinkage and Selection Operator (LASSO) regression algorithm, combined with five-fold cross-validation, was employed to further eliminate irrelevant features [
15]. The final selected features were used for modeling.
2.7. Deep Learning Training
Several deep learning models with ImageNet pre-trained weights were used for training and validation [
16]. The training, internal validation, and external test datasets were loaded based on their respective class labels and were normalized using the ImageNet standard. The models were trained with a stochastic gradient descent (SGD) optimizer with an initial learning rate of 0.01, a batch size of 32, and 50 epochs. During training, model performance was evaluated using metrics including accuracy, precision, recall, and F1 scores. Additionally, confusion matrices and receiver operating characteristic (ROC) curves were generated to assess classification performance further.
2.8. Radiomics-Deep Learning (RadiomicsDL) and Combined Models
Deep learning features were extracted from the global average pooling (avgpool) layer of the trained model. The classification layer was removed, and the avgpool output, which captured high-level image semantics, was used as the feature vector. The input data underwent forward propagation and the extracted features were organized into a matrix. Principal component analysis (PCA) was applied to reduce dimensionality while retaining key information. Compressed feature vectors were used for modeling, thereby improving the efficiency and performance.
Compressed deep learning features were combined with selected radiomics features, and key features were retained through dimensionality reduction, similar to the radiomics features selection. These were used to develop the RadiomicsDL model, which was then combined with the ultrasound features to create the combined model.
2.9. Machine Learning Modeling
Six classical machine learning models (Logistic Regression [LR], Support Vector Machine [SVM], Random Forest, eXtreme Gradient Boosting [XGBoost], Light Gradient Boosting Machine [LightGBM], and Multi-Layer Perceptron [MLP]) were used for modeling. After training, ROC curves were plotted to compare the area under the receiver operating characteristic curve (AUC) across the training, internal validation, and external test datasets. The model with the best performance on the external test dataset was selected as the final model.
2.10. Model Interpretability
Gradient-weighted Class Activation Mapping (Grad-CAM) and SHapley Additive exPlanations (SHAP) techniques were utilized to improve the interpretability of the DL and RadiomicsDL models.
Grad-CAM highlights the key regions that influence classification by generating heat maps from the gradients of the target class with respect to convolutional feature maps. Overlaying these heat maps on ultrasound images reveals areas critical to the model’s performance and identifies clinically relevant features [
17].
SHAP quantifies the contribution of individual features to the predictions. Globally, it identifies dominant features, while locally explaining individual predictions by visualizing the direction and magnitude of contributions, further improving interpretability [
18].
2.11. Software and Statistical Analysis
Statistical analyses were performed using Python (version 3.7) and R (version 3.6.1). Categorical variables were presented as frequency (n) and percentage (%), while continuous variables were presented as mean ± standard deviation (SD). Group differences were assessed with chi-square or Fisher’s exact test for categorical variables, and independent samples t-tests or Mann–Whitney U tests for continuous variables. Correlation analyses were performed using Pearson’s or Spearman’s coefficients, as appropriate. The diagnostic performance of the model was evaluated using the AUC, sensitivity, specificity, and accuracy. Statistical differences between model performances were tested using Delong’s test. Statistical significance was set at p < 0.05.
4. Discussion
In this study, radiomics and deep learning features were extracted from ultrasound images to develop predictive models, including US, Radiomics, DL, RadiomicsDL, and combined models. The models were externally validated across three independent central hospitals, demonstrating the robust diagnostic performance of the RadiomicsDL model. Notably, the RadiomicsDL model achieved an AUC of 0.807 in the external test dataset, effectively distinguishing between primary and secondary salivary gland malignancies. This result highlights the potential of RadiomicsDL for non-invasive clinical applications.
To the best of our knowledge, this is the first study to investigate the pathological classification of salivary gland malignancies using ultrasound imaging. In the training dataset, the aspect ratio and posterior echo were identified as statistically significant features distinguishing primary from secondary salivary gland malignancies. Specifically, an aspect ratio of <1 and posterior echo enhancement were independent indicators of secondary tumors. The US model, developed based on these features, achieved an AUC of 0.726 in the internal training set. However, its performance significantly declined in the external test dataset, with an AUC of only 0.421. This finding underscores the limitations of conventional ultrasound in accurately classifying malignant tumor subtypes. Similar challenges are reflected in its inconsistent sensitivity for distinguishing benign from malignant salivary gland tumors, previously reported to range from 38.9% to 88% [
19]. Due to this limitation, the diagnostic performance of the combined model in the validation and test sets was hindered, which is also seen in the limitations of conventional ultrasound in the application to salivary gland tumors. The overlapping ultrasound characteristics among tumor types likely represent a key barrier to achieving higher predictive accuracy, highlighting the need for advanced diagnostic tools, such as radiomics or DL approaches, to improve pathological classification precision.
The Radiomics model achieved promising results in the training and internal validation datasets but showed signs of overfitting in the external test dataset. This finding suggests that, while radiomics models can achieve high accuracy within specific datasets, their generalizability and robustness across diverse datasets remain challenges. By contrast, the DL model demonstrated stable performance across all datasets, highlighting its ability to capture data complexity and adapt to heterogeneous data distributions. Previous studies have emphasized that integrating radiomics and DL features can enhance tumor differentiation, staging, and prognosis prediction compared with using either method alone [
20,
21]. This improvement was attributed to the multi-omics model incorporating additional critical parameters. The integration of radiomics and deep learning in the Radiomics_DL framework improved the AUC and resulted in optimal performance on both the internal test set and the external validation set. Compared to the standalone DL model, the Radiomics_DL model corrected several misclassifications, reducing the occurrence of false positives and false negatives (
Figure 4). This further emphasizes the advantages of combining radiomics and deep learning, particularly in enhancing diagnostic accuracy.
In this study, we leveraged SHAP to analyze the interpretability of our proposed RadiomicsDL model, effectively visualizing the model’s evaluation process and prediction outcomes. The RadiomicsDL model was developed by integrating key deep learning features with selected radiomics features, combining two radiomics features and one deep learning-derived feature, with the SHAP summary plot identifying the radiomic feature Wavelet_LHH_glcm_SumEntropy as the most influential. This feature is derived from wavelet transform analysis of the GLCM. Our findings indicate that higher values of this feature correspond to an increased likelihood of primary tumor presence. This observation aligns with previous studies, which have demonstrated a correlation between this feature and favorable prognosis as well as reduced tumor invasiveness [
22,
23]. The integration of these two radiomics features with the deep learning-derived feature DL_0 significantly enhanced the model’s discriminative capability. Furthermore, leveraging SHAP for local interpretability analysis enables effective visualization of the model’s evaluation process and prediction outcomes.
Although this study provides encouraging preliminary results, it has several limitations. First, the retrospective design prevented the standardization of ultrasound image acquisition, and the analysis was limited to conventional ultrasound images, which may have constrained the model’s generalizability. Second, the relatively low incidence of salivary gland malignancies limited the sample size, despite cases being collected from multiple central hospitals. Variations in the regional distribution of pathological subtypes, potentially reflecting differences in population genetics or healthcare practices, may have introduced instability and reduced the reliability of the results. Moreover, since this study focused solely on binary classification of salivary gland malignancies, its applicability is limited. The failure to identify lymphomas separately is another limitation of this research.
To address these limitations, future studies should explore the integration of multimodal imaging data, such as adding Color Doppler Flow Imaging (CDFI) and elastography, to complement ultrasound findings and enhance diagnostic accuracy, particularly in further subtyping of tumors. Additionally, developing more generalized and versatile multilayer diagnostic models that can provide initial benign/malignant classification as well as further subtype classification for salivary gland tumors would be beneficial. Furthermore, collaborative efforts across multiple centers, along with the accumulation of large-scale datasets integrating clinical and genomic information, offer hope for building more comprehensive and robust diagnostic models.