*2.3. Statistical Analysis*

Statistical analyses were performed in SPSS version 25 (IBM Statistics, Chicago, IL, USA). Per cohort and for both cohorts together, differences in radiomic features extracted from VOIvital-tumour and VOIgross-tumour were assessed using a Wilcoxon signed-rank test or a paired t-test after testing for (log-)normality. Since over one hundred features are tested simultaneously, some features may show a significant difference between both delineation methods by chance, which increases the false discovery rate [24]. Therefore, the Benjamini– Hochberg multiple testing correction was performed [25]. The Benjamini–Hochberg correction determines the significance level for specific feature (*pi*) using Equation (2):

$$p\_i < \left(\frac{i}{n}\right)a \tag{2}$$

where *i* is the ranking of a feature when ranking all features based on the significance level of the paired *t*-test from smallest to largest, *n* is the total number of features and *a* is the original significance level (*a* = 0.05). Additional subset analyses of all patients based on the NTF and SUVmax were performed by creating three equally-sized groups for low, medium and high values: NTF: ≤0.12, 0.12 < NTF ≤ 0.36, >0.36; SUVmax: ≤4.61, 4.61 < SUVmax ≤ 12.09, >12.09 g/mL. Differences in numbers of affected features per cohort and subgroup were assessed using the Fisher's exact test. Overlaps in the affected features per cohort and subgroup were visualised using Venn diagrams.

For the PPGL cohort, the predictive performance for the underlying tumour biology of the radiomic models based on features derived from the different delineation methods was assessed by binary logistic regression in R version 3.6.0 (R Foundation for Statistical Computing, Vienna, Austria). Moreover, a radiomic model was created out of features from both delineation methods, assuming that both features contain different information. The response variable in regression was the noradrenergic biochemical profile of the PPGLs. Unsupervised feature selection or dimension reduction was performed to deal with multicollinearity and high dimensionality, which occurs when the number of features largely exceeds the number of patients. As a rule of thumb, 1 feature was selected for every 10 subjects [26] and 3 features were selected to be tested (PPGL dataset: *n* = 31 patients). The predictive performance of the radiomic models was not assessed for the NSCLC cohort since this cohort consisted of only 12 patients, which corresponds to only 1 feature to be tested and is inadequate to explain sufficient variance of the dataset. Dimension reduction in the PPGL dataset using redundancy filtering and factor analysis was performed using the FMradio (Factor Modeling for Radiomics Data) R-package version 1.1.1 [27]. Features were scaled (centred around 0, variance of 1), avoiding that features with the largest scale dominated the analysis. Redundancy filtering of the Pearson correlation matrix of features is performed with a threshold of τ = 0.95 and, from each group, one feature is retained. Factor analysis of the redundancy filtered correlation matrix with an orthogonal rotation was executed so that the first factor explained the largest possible variance in the dataset; the succeeding factors explained the largest variance in orthogonal directions. The sampling adequacy of the model, which is quantified by the Kaiser-Meier-Olkin (KMO) statistic, was predefined to be ≥0.9. The feature with the highest loading on a single factor was selected for regression analysis. The three selected features are associated with the noradrenergic biochemical profile using multiple binary logistic regression. Areas under the curve (AUC) of the receiver operating characteristic (ROC) of the radiomic models based on the three selected features for VOIvital-tumour, VOIgross-tumour and combined were computed and compared using DeLong's test for paired ROC curves. A sham experiment was conducted to validate the findings by randomisation of the outcome labels (noradrenergic biochemical

profile) [28]. This takes into account the prevalence of the outcome and the distributions and multicollinearity of the radiomic features but uncouples their hypothesised relation. Binary logistic regression was performed and the sham experiment was repeated 100 times to calculate the mean AUCs.
