1. Introduction
Automated planning (AP) is fast becoming the state of the art in radiotherapy planning for intensity-modulated radiotherapy (IMRT) and volumetric-modulated radiotherapy (VMAT) [
1,
2,
3] and can be classified into one of two categories: knowledge-based planning (KBP) or rules-based planning (RBP). KBP uses statistical techniques [
2,
4,
5,
6,
7] trained on historical clinical datasets, to inform planning for novel cases through prediction of optimisation objectives [
8], dose–volume histograms [
9,
10,
11] or voxel-level dose [
2]. RBP employs logic to converge on a solution. For example, a lexicographic ordering that optimises planning goals (PGs) in strict sequential order [
12,
13,
14] and protocol-based automatic iterative optimisation (PBAIO) that uses algorithms to automatically adapt planning parameters during optimisation. Various PBAIO approaches have been developed, including scripts that manipulate dose-volume objectives by moving them a specified increment at the start of every new pass [
15] or modify weighting factors so objective values meet specified targets [
16]. There are PBAIO scripts that record the iterative process during manual planning and use this to generate an AP algorithm [
17] and commercially available Auto-Planning software that automatically generates new contours during optimisation to help meet clinical goals [
18]. The majority of these AP techniques have been shown to produce plans non-inferior to manual planning and are used in clinical practice. Comprehensive reviews of all techniques are found in the literature [
1,
2,
4].
The most clinically desirable plans are ‘Pareto optimal’. That is, no dosimetric improvements can be made to a PG except at the detriment of another. The various AP methods therefore aim to converge upon this set. However, planning can be complex given PGs may conflict with one another and clinical desirability is dependent upon appropriate management of these trade-offs. Therefore, although the most clinically desirable plans are Pareto optimal, achieving Pareto optimality does not guarantee clinical desirability.
For KBP, trade-off balancing is automatically determined by the underlying clinical plans in the knowledge-base. For RBP, balancing must be explicitly defined in a process known as ‘calibration’. Calibration is the process of balancing the relative priority of PGs such that they align with the oncologists’ preferences. The dominant approach to RBP calibration is trial-and-error [
19,
20,
21] (TAE) where AP parameters are iteratively updated until an acceptable solution for a given clinical site is obtained. The approach is time consuming with improvements made only with respect to previously tried examples. It does not allow for the intuitive exploration of competing PGs and, as with manual planning, may yield solutions that are not fully congruent with oncologists’ clinical preferences [
22]. One way to manage the limitations of TAE is to use a KBP calibration approach where AP calibrations are derived from machine learning (ML) on historical clinical datasets [
23,
24]. This approach may be more efficient than TAE but will depend strongly on the knowledge base composition. A third approach is to utilise Pareto navigation techniques during the calibration process (‘Pareto guided automated planning’ or PGAP). This involves exploring a set of unique and systematically produced Pareto optimal solutions, each representing a differently balanced AP solution. Due to the number of solutions necessary for this to be effective, it can be resource intensive. Nevertheless, it is an
a posteriori multicriteria optimisation (MCO) method allowing exploration of the trade-off relationships between PGs [
22,
25,
26]. Recent work has demonstrated the utility of PGAP in yielding plans consistent with oncologists’ preferences for prostate patients with and without elective nodal irradiation under conventional and extreme hypofractionation regimes [
16,
22,
27].
Despite advances in available calibration methods, RBP calibration takes a ‘one size fits all’ approach with a single AP protocol (or wishlist) used for all patients of a given clinical site. This assumes an AP calibration that achieves a clinically optimum dose distribution for one patient is optimal for all patients within that clinical site. The validity of a ‘one size fits all’ approach has not been explicitly explored in the literature and there is evidence that points to site-specific RBP leading to sub-optimal or clinically unacceptable plans for a reasonably large proportions of cases. For lung stereotactic body radiotherapy, Vanderstraeten et al. observed that up to 24% of automated plans were considered clinically unacceptable without further tweaking [
28]. For locally advanced nasopharngeal carcinoma, Zhang et al. conclude that “automatic VMAT is not good enough to completely replace manual VMAT” [
29]. Finally, though independent quality assurance of 229 prostate cancer patients planned using AP, Janssen et al. demonstrated that 17% of plans were suboptimal and could be improved [
30]. This evidence highlights deficiencies in the ‘one size fits all’ approach and indicates that personalisation of AP protocols to individual patients may be required to ensure optimality.
In contrast, KBP utilises a fully individualised approach, with ML models using anatomy based predictive factors to generate patient-specific optimisation objectives or dose distribution parameters. The predicted parameters are used to form static objective function inputs to a standard gradient decent optimisation. Whilst optimisations using this approach are inherently patient tailored, the relationship between anatomy and objectives/dose parameters is complex, with wide variances across a patient cohort. Accurate modelling is therefore challenging, generally requires large training datasets and can yield models with clinically relevant prediction errors [
31]. Furthermore, the quality of the model is highly depended on the optimality of the underlying training dataset [
32], which is not guaranteed.
In summary, modelling uncertainties for KBP and the ‘one size fits all’ approach for RBP mean current AP solutions may not yield optimal, patient tailored plans. To address this problem we propose a hybrid AP solution where KBP is utilised to predict patient specific AP protocol parameters that act as an input for an already validated RBP solution. In this regard RBP is no longer reliant on a ‘one size fits all’ set of protocol parameters, but instead can utilise a protocol fully personalised to the individual patient. Application of KBP in this manner has the advantage that a validated RBP approach, by its nature, has suitably suppressed the relationship between anatomy and AP protocol parameters such that a single parameter set can yield acceptable plans across a treatment site. In this regard, the purpose of KBP is not to ensure RBP yields acceptable plans, but rather to further refine and individualise AP protocol parameters with the aim of fully personalising treatment plans. Importantly, with much of the variance already reduced through RBP, it is theorised that unlike standalone KBP approaches, uncertainties in the KBP models in a hybrid solution will be of low clinical significance.
The purpose of this work was to develop and evaluate a novel KBP-RBP hybrid planning solution for prostate cancer using PGAP. This new methodology utilised ML to identify the relationships between anatomy and optimum patient-specific calibration parameters (determined via Pareto navigation) such that individualised AP protocols could be generated for novel patients. Recent studies illustrate the clinical relevance of incorporating geometric features in the AP process for robust optimisation [
33] and development of a hybrid approach in which geometric features are used as KBP inputs for calibration of an RBP system [
34]. The KBP-RBP hybrid solution developed in this work considered advanced KBP techniques based on geometric features. It was trained on a representative dataset and validated for an independent set of novel patients. For validation the solution was compared against patient-specific expert-driven Pareto navigation (MCO
gs), which is considered the gold standard, and a standard PGAP approach using a ‘one size fits all’ site specific protocol (PGAP
std). The evaluation aimed to answer: (i) does personalising protocols via ML improve plan quality compared to PGAP
std and (ii) Is there a significant difference between the PGAP approaches and MCO
gs.
4. Discussion
In our previous work we developed a PGAP solution (built on a PBAIO framework) that utilised a single ‘one size fits all’ AP protocol for all patients in a given treatment site. The approach was evaluated against traditional TAE manual planning and considered non-inferior. This study builds upon that work in two key ways. Firstly, we introduced ML upstream of the PBAIO AP algorithm to develop a novel hybrid KBP-RBP planning approach, where ML is utilised to generate fully bespoke AP protocols for individual patients. Secondly, PGAP
std, PGAP-ML
clus and PGAP-ML
reg were evaluated against a Pareto navigated gold standard, rather than traditional TAE manual planning that is prone to sub optimality [
53]. In this regard the efficacy of each automated approach could be comprehensively assessed.
Plans generated from this novel approach and plans generated via PGAPstd were compared to a Pareto navigation gold standard (MCOgs). All approaches yielded plans acceptable for clinical use and at a population level demonstrated excellent congruence with MCOgs. At an individual patient level, PGAP-MLreg was considered the weakest solution, due to algorithms being influenced by anatomical outliers. Both PGAPstd and PGAP-MLclus yielded very good agreement with MCOgs across all patients, with PGAP-MLclus considered marginally superior due to fewer extreme outliers.
ML techniques used in this work are not new to radiotherapy planning. PCA [
5], regression [
5,
50] and clustering [
54] have all been used in KBP to make predictions based on anatomical features with notable success. This work builds upon this knowledge in two ways. Firstly, previous ML implementation would typically seek to generate a patient-specific input to a native treatment planning optimiser. In contrast this novel approach aimed to generate patient-specific AP protocols to further personalise an already validated RBP solution. Secondly we present a methodology to evaluate the performance of different model formations using a LOOCV decision framework, such that the optimal model for a given site can be selected. This allowed for an automatic and unbiased choice among different models comprised of various feature sets, types of features and types of model. This approach helps to resolve the challenge of defining a ML formation prior to training and allows for bespoke architecture to be utilised for individual PGs, thus removing the requirement for a homogeneous ML approach, which may not be appropriate. Results of this study support this assertion, with different model formations selected during the LOOCV model selection process.
ML in this work relied on a dataset of numerical geometric information derived from delineated patient anatomy. Whilst this methodology is based on previous KBP work, inclusion of other features may improve the versatility and modelling accuracy of the developed approach. A promising method would be utilisation of neural network generated features, which has been implemented successfully for dose prediction [
55,
56]. Neural networks could be utilised to directly generate patient-specific AP protocols or used in a two step approach to generate dosimetric features (rather than anatomical features) from which PG weights are derived [
57]. However, as plan generation is a geometry-based optimisation problem, modelling wholly on anatomy based features may hold intrinsic value as they can be interpreted and therefore reduce the risk of developing an automated planning ‘black box’.
The largest variances in difference from MCOgs for both input parameters (weights) and outputs metrics (dose distribution) was observed for PGAP-MLreg. This is thought to be related to the size and composition of the training dataset not adequately representing the patient population. PGAPstd and PGAP-MLclus were more robust to the limited dataset size, with small deviations from MCOgs observed for outlier patients. Given regression allows predictions to be extrapolated beyond the bounds defined by the training dataset, increased robustness of PGAPstd and PGAP-MLclus compared to PGAP-MLreg is thought to be due to PGAPstd and PGAP-MLclus prediction weights being bounded by the training data. For outlier patients PGAP-MLreg could therefore lead to inconsistent or spurious predictions. As generating the ground truth training data is time consuming, curation of a suitably large dataset for accurate regression modelling may be challenging, especially for busy radiotherapy clinics. Therefore, these results indicate PGAP-MLreg may not be the best suited ML approach for routine clinical application. Across the three methods, PGAP-MLclus was considered the most comparable to MCOgs based on the number of significant differences observed following Wilcoxon testing, the magnitude of dose differences and the fact fewer outliers were observed. However, the superiority of PGAP-MLclus over PGAPstd was considered marginal. As PGAPstd is equivalent to PGAP-MLclus when K = 1, these results indicate that for the majority of patients individualisation via clustering may not be necessary if a simple site-specific protocol based on an average weight is implemented. However, marginal improvements may be gained when using PGAP-MLclus for patients who are anatomical outliers, most likely for ROIs where large anatomical variances are common, such as for bladder and patient outline ROIs.
A key strength of this study was that training and evaluation was performed with plans generated using a posteriori multicriteria optimisation methodology (MCO
gs), which we consider to be a gold standard in patient-specific plan generation [
22]. This contrasts with the majority of KBP training approaches and AP comparative studies in the literature, which use manual plans generated with TAE [
29,
58,
59]. Our ML models and study results are therefore not confounded by unwarranted variation or sub-optimality of plans within the training and validation datasets, which are known issues associated with TAE manual planning [
53]. Across all three methodologies at a population level there was excellent correspondence with MCO
gs, with all volume and dose metrics within ±0.66% and ±0.34 Gy, respectively. In terms of trade-off balancing, PGAP-ML
reg and PGAP
std led to a marginal reduction in PTV48 D98% (0.17 and 0.28 Gy, respectively), resulting in a corresponding minor reduction in rectum V40.5Gy and V48.6Gy (0.3–0.4%). This was considered a clinically insignificant difference. No other trade-off differences were observed. In terms of individual patients PGAP-ML
clus and PGAP
std, yielded plans with high correlation to the gold standard MCO generated comparator (MCO
gs). The correlation was weaker for PGAP-ML
reg, which as discussed was attributed to the small training dataset size. Results provide strong evidence that PGAP
std, (built on a PBAIO AP framework), generates individualised plans, even when a site-specific protocol is utilised. This is an important finding, not only in validating the use of PGAP
std for prostate cancer, but also providing evidence that a posteriori multicriteria optimisation yields minimal benefits over AP in terms of the individuation of patient plans. In terms of the utility of patient-specific protocols, whilst PGAP-ML
clus and PGAP-ML
reg did not yield marked improvements, anatomical variances were shown to be an important factor in the prediction of weights during training. For example, regression models yielded R
2 values > 0.83, with reasonable MSE during LOOCV. This suggests ML may yield improvements over PGAP
std where larger anatomical variations cause the optimality of the PBAIO framework to break down, as has been demonstrated in the application of Pinnacle
3 Auto-Planning for lung [
28] and nasopharynx [
29] where poor quality planning was associated with anatomical outliers.
Whilst training and validating using MCOgs was a major strength in this work, due to the resource intensive nature of generating these ground truth plans, the size of the training dataset was constrained to 20 patients. This represents a key weakness in the approach, resulting in weak associations between training and validation MSE and, as discussed, the poor performance of PGAP-MLreg for outlier patients where weights were generated via extrapolation. However, despite this weakness, agreement with MCOgs was very good across all methods. It was therefore considered that training and validating on small high quality datasets was preferable to using large low quality manually generated datasets, where variation in plan quality could lead to poor models and/or spurious validation results. To improve the efficacy of training on small datasets a potential solution is to actively select a cohort of patients that suitability samples the extent of variation in the population (including outliers). This contrasts with the random selection approach taken in this work, as this approach does not explicitly screen for outlier geometries to model on.
In terms of similar studies, the most relevant are those assessing the modelling performance of KBP solutions for prostate cancer. For DVH prediction using the commercial KBP system Rapid Plan (Varian, Palo Alto), Cagni et al. [
31] demonstrated that even when trained using a set of Pareto optimal plans, clinically relevant prediction errors were observed. Specifically, for rectum and bladder, errors in mean dose of up to 6 Gy (7.7% of the prescribed dose of 78 Gy) and 5 Gy (6.4% of 78 Gy) were observed, respectively. In our study, rectum and bladder mean dose errors were <2.0 Gy (3.3% of 60 Gy) across all three methods. In terms of KBP via objective weight prediction, Boutilier et al. [
8] presented a dosimetric assessment of logistic regression and k-nearest neighbour models. Performance of the models were similar, with 95% percentile errors in volume dose metrics of 1.5% and 3.5% for bladder V88% and V68% respectively, and 2% and 4.5% for rectum V88% and V68% respectively. In our study, equivalent metrics were all ≤1.5% for both rectum and bladder. The performance of all three of our approaches is therefore considered very good in the context of previous work and highlights the effectiveness of the PBAIO framework in yielding bespoke plans, even without utilising ML for personalised protocols.
In this study, the absolute weights generated during MCOgs calibration were modelled and each PG were considered individually with their own optimal model defined. This made performing regression and clustering straight forward and helped to identify anatomical features that are important considerations when optimising a given trade-off. An intuitive alternative approach may have been to use a multi-output ML technique such as multi-output regression or deep learning to predict not only PG weights but relative PG weights. There is the potential that such an approach is more generalisable as weights are strongly relative in plan optimisation. Additional improvements would be to replicate these results with larger patient datasets. This would lead to greater statistical power and minimise the discrepancies in model performance between the calibration and validation cohort, which was observed for PGAP-MLreg. Inclusion of more expert observers could lead to a definition of MCOgs with even better congruence with clinical preferences. Finally, repeating the study on a more heterogeneous patient dataset (e.g., head and neck cancer) may yield substantially different results. In this study, MCOgs and PGAPstd were highly aligned, which was not expected, meaning any potential benefit of ML was minimal. This may not be the case for different clinical sites of increased complexity and heterogeneity.