1. Introduction
Lung cancer, excluding skin cancer, is the second most common type of cancer in both genders after prostate cancer in men and breast cancer in women [
1,
2]. Unfortunately, the overall five-year survival rate of patients with lung cancer is still dismally low (≈18.6%) and far below that of the other types of oncological disorders such as colorectal (≈64.5%), breast (≈89.6%) and prostate (≈98.2%) cancer [
2]. The survival rate, however, depends a great deal on the stage of the disease when it is first diagnosed, ranging from a grim ≈5% for distant tumours to ≈56% when the disease is still localized [
2].
The effectiveness of lung cancer therapy strongly relies on early diagnosis. Chest radiographies (CXRs), computed tomography (CT), magnetic resonance imaging (MRI), positron emission tomography (PET), cytology sputum and breath analysis represent the currently available detection techniques for lung cancer [
3]. Computed tomography, in particular, plays a pivotal role in differentiating benign vs. malignant lung nodules in the early screening phase. However, although CT scans provide valuable information about suspicious lung nodules, their correct interpretation can be a challenging task for the radiologist. In this context, computer-assisted diagnosis may provide a valid support for the radiologist to contribute to the diagnostic process of lung cancer.
In recent years computerised analysis of imaging data (particularly from CT and PET/CT) has shown great promises to improve the management of patients with lung cancer [
4,
5,
6,
7,
8,
9,
10]. The rationale behind this paradigm is that the quantitative extraction of imaging parameters from suspicious lesions—particularly shape and texture features—may reveal hidden patterns that would otherwise go unnoticed to the naked eye [
11,
12]. Furthermore, the extraction of objective, reproducible and standardised imaging parameters helps reduce the intra-observer and inter-observer bias and facilitates tracking changes over time. Radiomics leverages on artificial intelligence techniques and the increasing availability of large, open-access and multicentric datasets of pre-classified cases to infer clinical information about unknown ones (‘population imaging’ approach [
13]). Several studies have underlined the potential benefit of radiomics for clinical problem-solving in lung cancer, such as prediction of malignancy [
14,
15,
16], histological subtype [
17,
18,
19], prognosis [
20,
21,
22] and response to treatment [
23,
24,
25] (see also
Figure 1 for an overview of potential applications).
The radiomics pipeline consists of six steps [
26]: (1) acquisition, (2) pre-processing, (3) segmentation (also referred to as delineation), (4) feature extraction, (5) post-processing and (6) data analysis/model building. The fourth step, which aims at extracting a set of quantitative parameters from the region of interest, is central to the whole process and various studies have shown that steps 1–3 can have significant impact on feature extraction [
27,
28,
29,
30,
31,
32]. A current major research focus is therefore the assessment of the stability of radiomics features to changes in image acquisition settings, signal pre-processing and lesion delineation (see Traverso et al. [
33] for a general review on the subject).
In particular, the repeatability and reproducibility of radiomics features from lung lesions on CT has been investigated in a number of recent works. In [
34] Balagurathan et al. evaluated the test-retest reproducibility of texture and non-texture features from chest CT scans to two consecutive acquisitions (on the same patient) taken within 15 min from one another. Of the 329 features included in their study, they found that 29 (i.e., approximately one in eleven) had a concordance correlation coefficient (CCC)
. The study by Seung-Hak et al. [
35] addressed the impact of voxel geometry and intensity quantisation on 260 lung nodules at CT; in this case the results indicated that nine of the 252 features investigated had high reproducibility among the different experimental settings. As for stability to lesion segmentation, Kalpathy-Cramer et al. [
36] investigated 830 radiomics features from CT scans of pulmonary nodules at CT and determined that 68% of them had
. Parmar et al. [
37] compared the variability of radiomics features extracted from automatically segmented lesions (3D-Slicer) with that of features from manually-segmented ones and found higher reproducibility in the first case. Owens et al. [
38] examined the repeatability of 40 radiomics features from ten CT scans of non-small lung cancer to manual and semi-automatic lesion delineation consequent to intra-observer, inter-observer and inter-software variability. Similarly to [
37], they concluded that semi-automatic lesion delineation can provide better reproducible radiomics features than manual segmentation. Tunali et al. [
39] assessed the repeatability relative to the test-retest of 264 radiomics features from the peritumoural area of lung lesions and their stability to nine semi-automated lesion segmentation algorithms. They determined an unlikely response between the different classes of texture features investigated with first-order features generally showing better stability than the other groups. More recently, Haarburger et al. [
40] evaluated the stability of 89 shape and texture features to manual and automatic lesion delineation, finding that 84% of the features investigated had intra-class correlation coefficient (ICC)
.
One common shortcoming in the available studies, however, is that most of them are based on proprietary datasets (with the notable exception of [
40]) and custom feature extraction routines, all of which renders the results difficult to reproduce. In this work we investigated the stability of 88 textural features from CT scans of lung lesions to delineation and intensity quantisation. To guarantee reproducibility we based our study on a public dataset (LIDC-IDRI [
41]) and on feature extraction routines from an open-access package (PyRadiomics [
42]). Furthermore, we made all the code used for the experiments freely available to the public for future comparisons and evaluations. Our results identified 30 features that had good or excellent stability to lesion delineation, 28 to intensity quantisation and 18 to both.
3. Results
Table 5,
Table 6,
Table 7,
Table 8,
Table 9 and
Table 10 summarise the results of the experiments. As it can be observed, of the 88 features considered in the study, 18 showed good or excellent stability (defined as
10%) relative to both lesion delineation and intensity quantisation. Broken down by class, the number (percentages) of features with at least good stability relative to both delineation and intensity quantisation were: 4/13 (≈31%) for first-order features, 6/24 (33%) for GLCM features, 1/14 (≈7%) for GLDM features, 5/16 (31%) for GLRLM features and 2/16 (≈13%) for GLSZM features, whereas none of the five NGTDM features achieved at least good stability relative to both conditions.
If we examine the results by class of features we observe that those of the first-order (except Uniformity) all had at least good repeatability relative to intensity quantisation (this is evident also from
Figure 5). This is, of course, what we expected, as these features (excluding Entropy and Uniformity) are by definition independent on signal quantisation—apart from numerical round-off errors. It is also no surprise that Entropy and Uniformity (Respectively defined as ‘Discretised intensity entropy’ and ‘Discretised intensity uniformity’ in the Image Biomarker Standardisation Initiative [
55]) exhibited the highest relative error (9.40% and 23.05%, respectively), for they depend—by definition—on the number of quantisation levels used. Under the currently accepted formulations [
42,
55], Entropy and Uniformity have values in
and
, respectively, which sets into evidence the dependency on
. As for stability to lesion delineation, it emerged that Max was the most stable feature. This is coherent with tissue density being usually highest in the central area of the lesion, which is also the part of the tissue that most observers would include in the delineation. The other parameters that had good to excellent stability were Entropy, Range and Min.
For the other classes,
Figure 5 indicates that the data about intensity quantisation were, on the whole, more dispersed than those about lesion resampling. Features from GLCM generally proved more resilient to changes in lesion delineation (half of them had
) than intensity resampling (only seven features out of 24 reached at least good stability). This is, again, coherent with the GLCM definition depending heavily on the number of quantisation levels.
Similar arguments hold for the other classes of texture features. In particular, GLDM produced very few stable features: Only three of them showed at least good stability to lesion delineation and only one to intensity resampling. It is worth recalling that GLDM is based on the concept of ‘depending’ voxels [
42,
56]; that is, a neighbouring voxel is considered dependent on the central voxel if the absolute difference between the intensity values of the two is below a user-defined threshold
. For the threshold value we used the default PyRadiomics settings (
) and this may have had an effect—possibly negative—on the stability of this group of features. Likewise, GLSZM features were highly sensitive to signal quantisation too, which is again logical given the definition of GLSZM. Recall that this is based on sets of connected voxels (grey zones) sharing the same grey-level intensity; consequently, changes in signal quantisation are likely to produce different grey-zones, with fewer quantisation levels resulting in larger grey-zones and vice versa. This inevitably reflects on the feature values.
Notably, none of the NGTDM features proved resilient enough to both lesion delineation and intensity resampling (
Table 10). As for lesion delineation, only Busyness and Strength attained excellent and good stability, respectively, whereas Coarseness was the only feature with good stability to intensity resampling. Consider that NGTDM [
42,
57] estimates the joint probability between the intensity level at one voxel and the average intensity difference among its neighbour voxels; we speculate that changing the number of resampling levels (
) may alter the joint distribution and this could explain the poor stability to signal quantisation.
4. Discussion
Radiomics has attracted increasing research interest in recent years as a possible means to assist physicians in clinical decision making. Potential applications in pulmonary imaging include, in particular, detection and assessment of suspicious lung nodules; prediction of histological subtype, prognosis and response to treatment. The radiomics workflow involves six steps, each of which is sensitive to a number of settings and parameters. Stability of radiomics features to these settings is therefore critical for guaranteeing reproducibility and consistency across multiple institutions.
Regarding stability to lesion delineation, a comparison with previously published works indicate that our results are by and large in agreement with what was reported by Haarburger et al. [
40] concerning first-order, GLDM, GLSZM and GLRLM features. However, our study indicated lower stability of GLCM and NGTDM features than reported in [
40]. One possible explanation of this discrepancy might be that the bin width used here (≈3 HU) was different than adopted in [
40] (25 HU). In [
34] Balagurunatan et al. found 29 features stable to lesion delineation, of which five were also investigated in the present work. Our findings show partial overlap with [
34]: first-order Entropy and GLRLM RLN achieved good stability in both studies; on the other hand, GLCM Contrast, GLRLM GLN and RLN were stable in [
34] but not here. As for intensity quantisation, Lee et al. [
35] reported three highly stable (ICC
) first-order features (Max, Min and Entropy), which confirmed their performance (
) in our experiments, and two GLCM features (DiffEnt and Homogeneity—equivalent to ID); however, the reproducibility of the latter two was only moderate (DiffEnt) and poor (ID) in our study. In Shafiq-Ul-Hassan et al.’s phantom study [
58], 11 texture features panned out as highly stable and defined as percent coefficient of variation (%COV) below 30%. Out of them, the ones directly comparable with the present work are first-order Uniformity (indicated as Energy in [
58]); GLCM InvVar and JointAvg; GLRLM GLN and RLN; and NGTDM Coarseness and Strength. Among these only the GLCM ones attained good stability in our experiments, albeit the threshold for ‘goodness’ adopted in [
58] (%COV < 30%) was far more generous than can be used here (
).
One much debated question in radiomics is whether intensity resampling should be absolute or relative [
59]. In the first case, the window bounds are determined
a priori and are invariable across different scans and ROIs, whereas in the second case they are relative to the region of interest. When intensity values represent quantities with the same physical meaning across different scans (such as Hounsfield Units—assuming there are no calibration errors) the use of absolute resampling seems logical [
42,
60] and this was the decision made here. As detailed in
Section 2.2, we determined the window bounds based on the actual distribution of the intensity values in the whole dataset, but other choices such as mediastinal (W:350, L:50) or lung (W:1500, L:−600) window would be reasonable options as well. Notably, our results indicate that changes in intensity quantisation had little consequence on most of the first-order features; whereas the effect on the other classes was generally stronger and with much larger intra-class variability (see
Figure 5). This suggests that particular care should be taken in the selection of texture features different than first-order when changes in signal quantisation are involved.
Another methodological point that requires further attention is what figure of merit should be used for assessing feature stability. Althought Intra-class Correlation Coefficient (ICC) is the common practice in the literature [
35,
37,
38,
40], we did not think this was the correct choice here. There are two reasons behind this observation. First, ICC assumes a statistical model where the true scores are normally distributed among the study population [
61], but of course this is not guaranteed. Second, in a multicentric study much of the inter-subject variance may come from differences in parameters that are hard to control, such as voxel size, slice thickness, tube voltage, etc., all of which may have unknown and unpredictable effects on the estimated ICC. In order to avoid these potential problems we based our evaluation on a direct measurement of intra-rater difference at the nodule level, as for instance used in Varghese et al.’s phantom study [
29], and averaged the result over the whole population. The resulting
S (Equation (
1)–(
2)) avoids the unpredictable effects consequent to between-nodule differences in the acquisition settings; furthermore, it has a straightforward interpretation (values are bound between 0% and 100%) and does not rely on any assumptions on the distribution of the underlying data.
Focussing on the potential implications of radiomics in clinical decision making, one major problem related to the lack of feature stability is that the results are difficult to reuse across multiple centres. If one centre determines that having a certain feature value above a given threshold is predictive of malignancy in lung nodule screening, a second one can reuse that result only if (a) the features are computed using the same settings or (b) the features are stable enough. Concerning intensity quantisation, of course one sensible approach would be to stick to one value (
is a common choice [
16,
62,
63,
64]) in order to have comparable data. However, for some features simple mathematical transformations could be applied to make the features independent of the number of quantisation levels (see for instance
Appendix A). In order to avoid or reduce the inter-observer bias related to manual lesion delineation, automated and semi-automated methods offer great promises in terms of speed, accuracy and repeatability [
65]. Previous studies have shown that semi-automated segmenters can improve on manual delineation and generate more reproducible radiomics features [
37,
38].
5. Conclusions and Future Work
In recent years, the extraction of quantitative imaging features from lung lesions on CT has attracted increasing research interest as a potential tool to improve diagnosis, risk stratification and the follow-up of lung cancer. Still, the applicability of radiomics across multiple institutions and on large populations of patients depends a great deal on the robustness of the image features to changes in the acquisition settings, preprocessing procedures and lesion delineation methods. In this context the objective of this work was to evaluate the impact of lesion delineation and intensity quantisation on the stability of texture features extracted from suspicious lung nodules on CT scans. Specifically, we assessed the robustness of 88 texture features from six classes: first-order, GLCM, GLDM, GLRLM, GLSZM and NGTDM. For reproducible research purposes, we carried out the experiments on a public dataset of lung nodules (LIDC-IDRI) and employed open-source tools (Python and PyRadiomics) for feature extraction. Implementation settings and code are also available to the public for future comparisons and evaluation.
The results indicate that the impact of changes in lesion delineation and intensity quantisation was important: of the 88 texture features included in the study, only 18 showed good stability () relative to both types of change. These findings suggest caution when it comes to building predictive models involving CT features obtained with different quantisation schemes and/or affected by contour variability. From a clinical standpoint, our results are useful as they identify a set of stable CT texture features that can contribute to the diagnosis of lung cancer. This is very important for the discovery of robust imaging biomarkers that may help characterise lung lesions, particularly, in those cases where the anatomical site or the clinical presentation of the patient rule out other invasive methods (e.g., biopsy).
The present investigation indicates different directions for future research. In confirming previous studies [
58], we found that most texture features were sensitive to intensity quantisation of the CT signal. This suggests (a) that the mathematical formulations of these features may need to be revised in order to remove such dependency (as for instance proposed in
Appendix A) and/or (b) that the number of quantisation level should be defined/recommended in internationally-accepted guidelines (standardisation). Similarly, the effects of intra-observer and inter-observer variability in lesion delineation could be reduced by recurring to automated and semi-automated segmentation procedures. As pointed out in [
35], this is particularly critical in lung cancer where tumour progression is associated with density changes in the core and peri-tumoural region. Hence, the need for radiomics to take into account both areas.