*Article* **Framework for Evaluating Potential Causes of Health Risk Factors Using Average Treatment Effect and Uplift Modelling**

**Daniela Galatro 1, Rosario Trigo-Ferre 2, Allana Nakashook-Zettler 1, Vincenzo Costanzo-Alvarez 3,\*, Melanie Jeffrey 4, Maria Jacome 5, Jason Bazylak <sup>3</sup> and Cristina H. Amon 1,3**


**Abstract:** Acute myeloid leukemia (AML) is a type of blood cancer that affects both adults and children. Benzene exposure has been reported to increase the risk of developing AML in children. The assessment of the potential relationship between environmental benzene exposure and childhood has been documented in the literature using odds ratios and/or risk ratios, with data fitted to unconditional logistic regression. A common feature of the studies involving relationships between environmental risk factors and health outcomes is the lack of proper analysis to evidence causation. Although statistical causal analysis is commonly used to determine causation by evaluating a distribution's parameters, it is challenging to infer causation in complex systems from single correlation coefficients. Machine learning (ML) approaches, based on causal pattern recognition, can provide an accurate alternative to model counterfactual scenarios. In this work, we propose a framework using average treatment effect (ATE) and Uplift modeling to evidence causation when relating exposure to benzene indoors and outdoors to childhood AML, effectively predicting causation when exposed indoors to this contaminant. An analysis of the assumptions, cross-validation, sample size, and interaction between predictors are also provided, guiding future works looking at the universalization of this approach in predicting health outcomes.

**Keywords:** acute myeloid leukemia; risk factors; average treatment effect; uplift modelling; machine learning; benzene

#### **1. Introduction**

Acute myeloid leukemia (AML) is a cancer of the myeloid line of blood cells; AML starts in the blood stem cells and is characterized by its rapid growth [1]. While AML is the most common type of leukemia in adults, it also affects children; and about 500 children are diagnosed with AML in the U.S. annually [2]. Childhood AML is most prevalent during the first two years of life and adolescence. Epidemiological and genetic studies have confirmed that most infant leukemias develop in utero [3,4].

Chemical exposure to significant benzene concentrations is reported as a possible cause of AML in occupationally exposed workers [5]. Benzene exposure has also been reported to increase the risk of developing AML in children. Most existing reports are retrospective case-control studies [6], which are inherently limited since benzene exposure is typically measured indirectly (biased) as parents of sick versus healthy children may differentially recall them [5].

Moreover, some positive findings may be due to confounding factors instead, as other biases are added when specific segments of the population are under-represented in the study or control cohorts [6]. It has also been noted that exposure to various solvents

**Citation:** Galatro, D.; Trigo-Ferre, R.; Nakashook-Zettler, A.; Costanzo-Alvarez, V.; Jeffrey, M.; Jacome, M.; Bazylak, J.; Amon, C.H. Framework for Evaluating Potential Causes of Health Risk Factors Using Average Treatment Effect and Uplift Modelling. *Algorithms* **2023**, *16*, 166. https://doi.org/10.3390/a16030166

Academic Editors: Xiang Zhang and Xiaoxiao Li

Received: 1 March 2023 Revised: 15 March 2023 Accepted: 18 March 2023 Published: 19 March 2023

**Copyright:** © 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

and hydrocarbons increases the chance of developing childhood AML [7,8]. Both groups of chemicals fall into a vast range of toxicological profiles, benzene being the one of greatest concern.

In the following subsections, we briefly review the traditional approaches to establishing the relationship between Benzene Exposure and AML and current trends in estimating causation for health outcomes using Machine Learning outcomes.

#### *1.1. Traditional Approaches to Establishing the Relationship between Benzene Exposure and AML*

The Bradford Hill criteria [9,10] have been extensively used to evaluate causation when human epidemiologic relationships are found between exposure to a contaminant such as benzene and the disease outcome, such as AML or other hematopoietic and pulmonary diseases [10]. Despite its extensive use, this method is still debated among epidemiologists, as they question, among many arguments, its scope of application and the possibility of ruling out causality in some specific scenarios.

The assessment of the potential relationship between environmental benzene exposure and childhood AML includes studying exposures prior to conception, during pregnancy, or while breastfeeding [5]. Parental exposure to benzene, for instance, has been studied as a potential risk factor for infant AML, with conflicting results among researchers. Kaatsch et al. [11] and Shu et al. [12] did not find any association between parental exposure and childhood AML. On the other hand, Buckley et al. [13] and Magnani et al. [14] reported elevated rates of childhood AML associated with benzene, solvents, and paternal petroleum occupational exposure (prenatal).

Households' exposures to benzene have also been investigated, showing no increases in childhood AML related to home use of solvents [15,16]. Nevertheless, cigarette smoke, the main non-occupational source of benzene, is associated with an increased risk of developing this disease, even at low-level exposure to benzene, through parental smoking during pregnancy [17,18].

Other exposure surrogates for benzene as an air pollutant are traffic density and proximity to chemical plants, refineries, and gas stations. While several air pollutants are typically present in different concentrations, their risks are ranked by odds ratios. Norlinder and Jarkvolm [19] observed an increase in childhood AML related to car density with an incidence greater than 20 cars/km2, possibly attributable to benzene in the gasoline. Reynolds [20] reported correlations between traffic density between benzene and butadiene air concentrations. The corresponding odds ratios for children reveal a relationship with the incidence rate of childhood AML. Similarly, Crosignani et al. [21] also reported an association between elevated rates of childhood AML to traffic density, attributable to high benzene exposure. Steffan et al. [22] reported an increased risk of this disease in proximity to gas stations, while Harrison et al. [23] found no association. These opposing results have also been reported when analyzing the risks of nearby petrochemical plants [5].

The discrepancies found in the literature may be due to the method of estimating exposures, potential confound from correlated pollutants, and the intrinsically anisotropic nature of the control volumes defined for the studies. For instance, air measurements in outdoor volumes, show considerable variability in space and time, compared to values reported for indoor control volumes. Benzene has undoubtedly been classified as a human carcinogen causing AML in adults when exposed to relatively high concentrations at work. Nevertheless, it is not clear if low concentrations of this compound in outdoor air cause childhood AML, although some studies reveal some association [24].

Previous studies involving benzene exposure and childhood AML statistically reported their results as odds ratio (OR) and/or risk ratio (RR). OR stands for the ratio of an event's odds in one group (for example, the exposed group) to its odds in the other group (for example, the nonexposed group); the risk ratio (RR) stands for the ratio of the risk of an event in one group to the risk of an event in the other group. An OR or RR greater than 1.0 indicates an increase in odds or risk among the exposed group compared to the unexposed one. Conditional and unconditional logistic regressions are typically

used to fit the data of these studies, hence estimating the ORs, and RRs [25–27]. Matching is used in case-control studies to adjust for confounding data, ensuring that regression is possible when there is not enough overlap in confounding variables between cases and a set of controls. Age, sex, and race are typical confounders suggested by descriptive epidemiology [28]. Since the distribution of these variables may differ between cases and controls, matching is commonly used to select cases and controls with similar distributions of similar variables. Unmatched case-control studies, on the other hand, are typically analyzed using unconditional logistic regression (ULR) due to their robust estimates and effectiveness if there are few confounders to adjust for [23,27]. ULR involves producing exposure-disease strata for each level of the confounder and then producing an average effect across the data. This method is also effective when there are no problems with sparse data, no loss of validity, and a potential increase in precision [29].

ULR makes the events log-odds a linear mixture of one or more independent factors, predicting the likelihood that an event will occur [30]. Similar to other regressions, this method shows where there is a relationship between these variables. Nevertheless, this relationship cannot always imply causation; in other words, regression does not determine causation. In statistics, the causal analysis goes one step further than the standard statistical analysis in assessing the parameters of a distribution, aiming to infer probabilities under changing conditions and understanding the actual effect of a specific phenomenon happening in a system.

#### *1.2. Current Trends in Estimating Causation for Health Outcomes Using Machine Learning Methods*

Causal analysis can be sometimes inaccurate, as in complex systems, it is difficult to make causal arguments based on single correlation coefficients. To overcome this challenge, recent machine learning (ML) approaches have been developed, based on causal pattern recognition, providing robustness in accurately modeling counterfactual scenarios [31].

ML methods have been applied to health sciences for causal inference [32]. For counterfactual prediction, ML has been used to address causal questions using methods such as Random Forest [33] and Bayesian additive regression trees (BART) [34]. Some common research goals that can be tackled with these techniques are [30] (i) the evaluation of potential causes of health outcomes, (ii) the assessment of treatment options, and (iii) the assessment of bias in the statistical analysis. Some ML predictors proved to be assertive when causally inferring the influence of a health outcome while being controlled by confounders. One study, for instance, showed that the targeted maximum likelihood estimation (TMLE) technique outperformed traditional models when analyzing the effect of fruit density on the nutrition of pregnant women on birth outcomes [35]. Despite being a promising alternative to causal analysis, it is still in the foreseeable future when we will be able to identify high-level causal variables from low-level data using these techniques, as causal modeling approaches such as meta-learning and meta-modeling will be able to find causal relationships accurately. Moreover, the potential case of being exposed to complex mixtures of chemical contaminants causing adverse health outcomes can have additive or synergistic effects, posing a challenge where the strength of the ML approach could be used in combination with existing human data to infer causality [36].

Uplift modeling has been used by several companies to estimate the effect of an action on some customer outcomes [37]. Estimating customer uplift is a causal inference since it requires determining the difference between two outcomes that are mutually exclusive for an individual (counterfactual nature); this is carried out using randomized experiments for the treatment group and the control group. At the same time, the Uplift estimation is also a machine learning problem because different models must be trained to finally select the one that yields the most reliable prediction, requiring a sensible cross-validation process [38]. The combination of these features taken from both approaches, causal inference, and machine learning, make Uplift modeling a suitable alternative in determining causation. To our knowledge, Uplift modeling has not been applied to health sciences. Uplift modeling

might become unstable when predicting causation, as with any other ML technique, as the sample size decreases. Moreover, its convergence might be affected since it can depend on variables not typically used in response models.

In this work, we propose a novel framework, for evaluating potential causes of health outcomes using the average treatment effect (ATE) to compare the effects in our computational experiments and Uplift modeling, as a machine learning technique employed for cross-validation of the ATEs. This simple approach effectively estimates the causation of health outcomes using two levels of confirmation, integrating causal inference and machine learning features. Our case study includes the causation of benzene exposure (as an air pollutant) to childhood AML, analyzing the counterfactual nature of the OR-based relationships found indoors and outdoors. The case study selected for our framework was presented by Heck et al. [25] and includes childhood AML data collected from California birth records over 17 years. Health outcomes based solely on relationships between variables can be challenged when the outcome is not proven to result from the occurrence of the other event (s). This paper aims to describe the development and use of this framework and discuss its significance as a reliable causation framework that can potentially be effectively used to prevent, monitor, and treat diseases.

#### **2. Methods**

The methodological workflow (Figure 1) can be applied as a framework for any study involving the causation of a pollutant exposure to the risk of developing a disease through machine learning techniques and the analysis of the counterfactual nature of the OR-based relationships.

**Figure 1.** Framework for causation analysis of a pollutant exposure to the risk of developing a disease using machine learning techniques.

The proposed ATE-Uplift framework is depicted in Figure 1, where the data is collected either from an existing database or by obtaining simulated data from a regression model. Typically, health outcome studies in the literature comprise strata of data that are fitted using logistic regressions, as discussed in Section 2.1. Once collected, our causation analysis includes two stages: (i) estimating the ATEs and (ii) cross-validating the previous estimation using the Uplift modeling. The two levels strategy based on cross-validation integrating causal inference and machine learning is crucial to support the reliability of our framework since the consequences of determining causation can have several implications in the prevention, monitoring, and treatment of diseases. Ultimately, our framework provides two integrated causation indicators, the ATE values, and Qini coefficients. The results from these stages allow researchers to conclude the causation of pollutant exposure to the risk of developing a disease.

#### *2.1. Case Study*

The data for the case study selected in this work was presented by Heck et al. [25]. In their work, they ascertained 46 cases of AML from the California Cancer Registry records of children with ages less than six years and 19,209 controls from California birth records between 1990 and 2007 and within 6 km of air monitoring stations. Risks of developing AML were reported as ORs, including the ones associated with exposure to benzene, butadiene, toluene, and other air pollutants, concluding that there is an increased risk for AML when exposed to these air pollutants. A 1-M-matched case-control study was performed using unconditional logistic regression (ULR), adjusted for the year of birth as a matching variable, with 1:M = 1:20, to compare several characteristics of cases and controls, primarily demographics (socioeconomic status, race, birthplace, and parity). Then, ULR was employed to estimate the risk of AML from each interquartile range increase in air toxic exposure for the different pollutants separately. The data might require evaluating the correlations between pollutants. Heck et al. [25] recommended performing a factor analysis with varimax rotation to group highly correlated pollutants based on eigenvalues greater than 1. We have selected the OR related to exposure to benzene, which reveals a relationship between this exposure and the risk of developing childhood AML.

#### *2.2. Data Generation and Preparation for Uplift*

This section summarizes the process used to generate data from the 1-M matched control case study fitted with URL [23], as the original numerical and categorical data of the study was unavailable, and its consequent preparation for ATE-Uplift, including factual and counterfactuals. The data generation involves drawing samples from the existing ULR model by Heck et al. [23].

Let *y* be the case-control study, where *y* = 1 for a case and *y* = 0 for the control. Let *xm* = {*xm*1, *xm*2} a vector of matching variables, *xe* and exposure associated with the casecontrol status, and *x*<sup>0</sup> a vector of unmatched dummy variables corresponding to the strata of the matched variables. The ULR model, assuming no interaction between the predictors, is given by,

$$\log \text{it}(\pi) = \beta\_0 + \beta\_c \mathbf{x}\_c + \beta\_m^T \mathbf{x}\_m + \beta\_0^T \mathbf{x}\_0 \tag{1}$$

where *π* is the probability of developing the disease, and *β*'s are the regression coefficients. *T* denotes transpose.

The input data for the ATE model must be generated as a table of factual and counterfactual cases. 'Factuals' are persons in the factual universe (where a 'zero' value is assigned to benzene, or benzene = 0), and 'counterfactuals' are persons in a counterfactual universe (where 'one' value is assigned to benzene, or benzene = 1). The table is generated from the expected propensity of AML conditional on the inputs, which are predicted and converted into a binary variable, as shown in Table 1.


**Table 1.** Factuals and counterfactuals input for the ATE model.

In Table 1, person 1 and person 61 are identically theoretical persons, except for having different binary exposures to benzene and potentially different binary AML outcomes. Similarly, and with the same proviso, person 2 and person 62 are identical, person 3 and person 63 are identical, and so on.

We used simulated data from the ULR of the case study; the simulation assumes no interaction between the benzene exposure and the rest of the predictors (socioeconomic status, race, birthplace, and parity); hence, Simpson's paradox is ruled out. This paradox emerges when groups of data show one trend, but this is reversed when the groups are combined. Our factual and counterfactuals table, in particular, the BINaml column (the presence or absence of AML), is generated by the projection of the target variable of Equation (1) or its substitute, Equation (2), that is *logit*(*π*(*x*)) of AML and its associated binary AML variable.

$$\log \text{it}(\pi(\mathbf{x})) = \beta\_0 + \beta\_\varepsilon \mathbf{x}\_\varepsilon + \boldsymbol{\beta}\_m^T \mathbf{x}\_m \tag{2}$$

This projection requires Equation (1) to be completely specified, including the intercepts, which was omitted by Heck et al. [25] as is usual in matched case-controlled studies. Therefore, we substitute Equation (1) with Equation (2) to estimate the intercept, where *β*<sup>0</sup> is a single number rather than a vector, yet *βe*, *β<sup>T</sup> <sup>m</sup>* are the same *βe*, *β<sup>T</sup> <sup>m</sup>* [25]. We estimate *β*<sup>0</sup> using Equation (2) following Haneuse's et al. [38] optimization approach by using an estimated overall prevalence of AML for the years 1990–2007 and the simulated benzene exposure data. As input to this projection, we use two levels of benzene exposure: the minimum and maximum values of the benzene distribution. Having projected the binary AML variable, we inserted it into the BINaml column of the factual and counterfactuals table, along with the corresponding benzene exposures and other predictor values. Finally, we generated the factual and counterfactuals table by converting the benzene exposures from numerical to categorical (i.e., from a maximum value to 1 and from a minimum value to 0). We carry out this last conversion following the approach by Lebel [39] and the California of Environmental Health Hazzard Assessment's reference exposure level for benzene as 0.940 ppm (0.903 ppb) [40].

#### *2.3. Distribution of Benzene Exposure*

Heck et al. [23] provide a description of the distribution of outdoor benzene exposure, reproduced in Table 2.

**Table 2.** Outdoor benzene distribution [23].


\* ppbv is parts per billion by volume.

For our benzene exposure data simulation, we use a normal distribution centered at 1.268 with a standard deviation of 0.830, a truncated minimum of 0.151, and a truncated maximum of 4.600. Heck et al. [25] calculate the risk of AML associated with one interquartile range increase in benzene exposure; we normalize the benzene distribution parameters (mean, standard deviation, minimum, and maximum) by the IQR = 1.197 as well.

As for the observed prevalence of pediatric AML in California, we use an incidence of 46/19,255 (observed ratio of cases over controls) [25], reflecting their 1:20 matching.

In addition to the outdoor distribution, we also provide a causation analysis using an indoor benzene exposure distribution from locations near garages (Table 3) [41].

**Table 3.** Indoor benzene distribution [38].


\* Assumed the same as the outdoor distribution.

We normalized the parameters by the IQR of the outdoor benzene exposure to be able to compare the results using two different distributions.

#### *2.4. Causation between Exposure to Benzene and Risk of Developing AML*

We estimated the causal effects of the benzene exposure to AML using the average treatment effect and typical validation indicators obtained from Uplift modeling.

The average treatment effect (*ATE*) is a special case of an average partial effect for a binary explanatory variable. It is used to compare the effects in our randomized computational experiments. The *ATE* measures the difference in mean outcomes between units assigned to the study and units assigned to the control and is given by,

$$ATE = \frac{1}{N} \Sigma\_i \left( y\_{1(i)} - y\_{0(i)} \right) \tag{3}$$

where the treatment effect for individual *i* is given by *y*1(*i*) − *y*0(*i*) = *β*(*i*); the summation occurs over all N individuals in the population.

The value of the potential outcome *y*(*i*) must not be affected by how the treatment and exposure are assigned among all other individuals, according to the stable unit treatment value assumption that is required for the estimation of the ATE. Extrapolation based on strata must be assumed, or monotonicity instead, which denotes the absence of definers in the population. Therefore, if the experiment experiences non-compliance, the ATE can no longer be recovered. Instead, a local ATE can be obtained as the average treatment effect for a particular subpopulation, limiting its extrapolation. Generalization for the causal inference can be, hence, affected. The extrapolation based on ignobility or monotonicity, are difficult assumptions to verify; once the data is available, the foundations of the design of the experiment might reveal signals of homogeneity across groups that can verify them or not, for which further data analysis is required. A stronger argument for Uplift modeling can be made, as they provide a solution for isolating effects. Thus, Uplift models the difference between conditional response probabilities in both the treatment and control groups, clearly identifying groups of individuals on which an action or intervention will have the most 'positive effect'. A binary outcome is assumed for Uplift modeling, aligned to the odds ratio, as per the nature of the problem described in this work.

There are various Uplift modeling approaches. The response probabilities that differ between the study and control groups are used by the Two-Model approach to model the uplift. This leads to a methodology based on two models as these probabilities are calculated separately for each group. In Lo's approach (Figure 2), the independent variables change in logistic regression. The model is based (and learned) on one model; however, the predicted probabilities are calculated for both groups. For the calculation of the predicted probabilities, there is a dummy treatment variable in the test dataset, which is set to 0 for the control group and 1 for the treatment group [42].

**Figure 2.** Uplift Two-model Approach.

The Uplift model validation is performed by selecting an appropriate cost function measuring the difference between the actual and predicted values of the response variable. In economics, the Gini coefficient is used to measure the model's goodness-of-fit. It is typically plotted to show the Lorenz curve where the predicted scores of the targeted observations are sorted in decreasing order. The extension of this curve and the Gini coefficient for Uplift modeling is called the Qini curve [43]. Figure 3 shows a typical Qini curve for an application in Econometrics.

**Figure 3.** Typical Qini curve for Uplift model evaluation, used in Econometrics.

The *y*-axis shows the cumulative incremental gains, and the *x*-axis the proportion of the population targeted. There is an Uplift curve and a random curve based on the calculation of every segment. The Qini coefficient is the difference between the area under these curves. A positive Qini-coefficient represents a good performance of the model, while a value approximating zero represents a poorer performance.

#### *2.5. Causation Codes*

The Uplift data generation code (datagen.rmd) uses the βs provided by Heck et al. [23] and benzene distributions indoors and outdoors to find the corrected intercept β\_0 via optimization. This code, developed with the software R, is a modification of the program presented by Haneuse et al. [38]. The main results obtained include the ATE and the data table prepared as input for Uplift modeling.

We also developed an R code (*uplift.R*) for the Uplift modeling, using the library *tools4uplift*. The data is split into train and validation data for further fitting using the baseline model two-model estimator through the embedded function *DualUplift*. The results are interpreted from the QiniArea from the function *PerformanceUplift*.

Both codes datagen.r and uplift.rmd are available in Available online: https://github. com/CHE408UofT/AML\_Uplift (accessed on 18 March 2023) [44].

#### *2.6. Results Interpretation*

ATE values equal to zero reveal no causation, while values different than zero reveal causation between exposure to benzene and the risk of developing childhood AML. We look at two different scenarios, as we included indoor and outdoor benzene distributions, looking at finding causation indoors and/or outdoors; therefore, we reported ATE values for both scenarios.

Regarding the Uplift modeling, the goodness-of-fit of the model is evaluated using the *QiniArea* function in R, which computes the area under the Qini curve. A positive value attests to a good performance of the model, while a value near 0 shows a worse performance.

#### **3. Results**

This section presents the results of using our ATE-Uplift framework to estimate the causation of benzene exposure (as an air pollutant) to childhood AML, through the analysis of the counterfactual nature of the OR-based relationships found indoors and outdoors.

#### *3.1. Indoor Benzene Distribution*

The results of generating a set of factual/counterfactuals tables for indoor benzene distribution to estimate their ATEs, show an average ATE of 0.203. This positive value reveals the causation between indoor exposure to benzene and the risk of developing AML in early childhood.

When analyzing the Uplift modeling results, we look at a variant of the Qini curve, representing the incremental uplift as a function of the proportion of the population target. Incremental Uplift measures whether an event would not have occurred without a specific interaction; hence positive Qinis attest to the goodness-of-fit of the Uplift model. Figure 4 shows a typical Qini curve for the indoor benzene distribution scenario.

The data is first partitioned into subsets that keep the same distribution of treated versus non-treated and responders versus non-responders values. The training was achieved using the formula *DualUplift* which fits the data using the two-model estimator or approach (logistic regression model), with splits of the data in 70% for the training and 30% for the validation. The first element of the *DualUplift* class is the baseline model fitted for nontreated individuals, and the second is the baseline model fitted for treated individuals. Using the two-model estimator, a baseline model is fitted for comparison purposes. Using the validation set, the function *predict* infers the uplift. Finally, to evaluate the quality of the baseline model, we plot the Qini curve, as shown in Figure 4. The Qini coefficients are single indexes of the Uplift model. The x-axis represents the fraction of targeted individuals

and the y-axis represents the incremental number of positive responses relative to the total number of targeted individuals. The straight line between the origin and (100, y-max) in Figure 4 represents a benchmark to compare the model's performance to a strategy that randomly targets subjects, as we explained before. In our case, the Qini coefficient is positive (with a value of 0.14) and outperforms random targeting. The uplift percentages are detailed in Figure 5. This reinforces the goodness-of-fit of the Uplift modeling approach, as the observed uplifts are ordered from highest to lowest.

**Figure 4.** Qini curve for Uplift modeling for the validation data—indoor benzene distribution.

**Figure 5.** Uplift versus proportion of population targeted.

Our combined ATE/Uplift modeling framework allows us to conclude that there is a causation of benzene exposure (as an air pollutant) to AML in early childhood, analyzing the counterfactual nature of the OR-based relationship found indoors for the simulated data from Heck et al. [25] and the benzene exposure distribution considered [25].

#### *3.2. Outdoor Benzene Distribution*

The results of generating a set of factual/counterfactuals tables for outdoor benzene distribution to estimate their ATEs show an average value of zero, which indicates

no causation between outdoor exposure to benzene and the risk of developing AML in early childhood. The incremental uplift, in this case, is also zero at any proportion of the targeted population.

#### **4. Discussion**

The ATE-Uplift framework effectively predicts causation in health outcomes when fitted using ULR. Uplift modeling acts as a cross-validation technique of the ATE estimation, integrating its causal inference and machine learning features. Moreover, Uplift modeling ensures the goodness-of-fit of the regression. Thus, categorical factual/counterfactuals tables generated from the original data are input to obtain ATEs, with positive values revealing causation. Then, Uplift modeling is employed, where positive Qini coefficients confirm causation.

For the data considered by Heck et al. [25], their corresponding ULR fitting and distribution of benzene exposure [38], our framework was useful in revealing the causation between indoor exposure to benzene and the risk of developing childhood AML. In contrast, no causation was revealed when considering outdoor exposure. Nevertheless, it was observed that, for this scenario, low ATE values (<0.05) near zero are somehow related to inconsistent Uplift modeling results, as the corresponding Qini coefficients also show low values (<0.03), different than zero. In these instances, we recommend, using a third cross-validation method when reporting low ATE values.

While our preference is to have the original data to perform exploratory data analysis (EDA), evaluate the interaction between predictors, and consequently preselect -and later compare- applicable ML methods for estimating causation, in this work, we cautiously tested ATE-Uplift for generated data, with one predictor, benzene, as ATE and Uplift modeling are complementary methods for binary outcomes, working with factual and counterfactuals to consequently estimating causality. ML methods, although promising, must be carefully supported by a detailed EDA and a sample size evaluation. We noticed, for instance, that inconsistent Uplift modeling results are obtained when decreasing the sample size, an effect that must be minimized as per most ML predictors. Our future work will add to our framework an EDA that will allow us to observe trends among and within data groups, potentially fit and compare different surrogate models to establish the 'best' relationship between variables, and ultimately, perform a meta-analysis and comparison between causation-based ML methods. Some techniques to be explored in future works include meta-analysis and machine learning tools such as meta-learning causal structures and causal Bayesian networks.

It is important to note that the reliability of the results obtained through the application of our ATE-Uplift framework depends largely on the completeness of the data used and how well-defined the assumptions are. Thus, once the numerical and categorical data is available, it is recommended to perform data analysis to verify or not the main assumptions regarding the absence of definers in the population required by the ATE model and, therefore, frame the limitations of the model extrapolation.

Finally, we have assumed that there is no interaction between the benzene exposure and the rest of the predictors. Such interaction will be considered in future works adapting our framework once the data is available. ORs might be readjusted, and correlations between pollutants might be revisited for more accurate correlations representing the interaction within the strata predictors and benzene exposure.

Our research's scientific and practical novelty lies simply and effectively in estimating the causation of health outcomes using two levels of confirmation or cross-validation. We use ATE values and Uplift modeling, integrating causal inference and machine learning features. Causation is, per se, the main goal when evaluating health outcomes since confirmed relationships between variables might not indicate that the outcome is indeed the result of the occurrence of the other event(s). A reliable causation approach may lead to effective disease prevention, monitoring, and treatment.

#### **5. Conclusions**

In this work, we have presented the ATE-Uplift framework to predict causation in health outcomes. Uplift modeling and estimating ATE values effectively integrate causal inference and machine learning capabilities. We tested our framework to estimate the causation between benzene exposure and AML, verified when considering indoor exposure to this air pollutant. Causation is confirmed with two indicators, an ATE value different than zero and positive Qini coefficients. Further considerations to validate and universalize the use of this approach include an exhaustive exploratory data analysis (EDA) to observe trends that might allow confirming the assumptions for its applicability and analyzing the interaction between predictors, as its comparison with emerging ML methods being evaluated for causation.

**Author Contributions:** Conceptualization and visualization: D.G., R.T.-F., V.C.-A. and C.H.A. Data collection and methodology: D.G., R.T.-F. and A.N.-Z. Software: D.G. and R.T.-F. Formal analysis: D.G., R.T.-F. and V.C.-A. Validation: A.N.-Z., M.J. (Melanie Jeffrey), M.J. (Maria Jacome), J.B. and C.H.A. Computing resources: D.G., R.T.-F. and C.H.A. Writing—original draft: D.G. and R.T.-F. Preparation: D.G., R.T.-F. and A.N.-Z. Writing—review and editing: D.G., R.T.-F., V.C.-A., M.J. (Melanie Jeffrey), M.J. (Maria Jacome), J.B. and C.H.A. Supervision: D.G., R.T.-F., V.C.-A. and C.H.A.; funding acquisition: C.H.A. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by a Healthy Cities Implementation Science Team Grants (LOI) 202110LT5 from the Canadian Institutes of Health Research.

**Data Availability Statement:** Not applicable.

**Acknowledgments:** We are grateful to Julia E. Heck from the UCLA Fielding School of Public Health, Department of Epidemiology, for the information and support provided throughout this research.

**Conflicts of Interest:** The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

#### **References**


**Disclaimer/Publisher's Note:** The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

### *Article* **Nearest Neighbours Graph Variational AutoEncoder**

**Lorenzo Arsini 1,2, Barbara Caccia 3, Andrea Ciardiello 2,\*, Stefano Giagu 1,2,\* and Carlo Mancini Terracciano 1,2**


**Abstract:** Graphs are versatile structures for the representation of many real-world data. Deep Learning on graphs is currently able to solve a wide range of problems with excellent results. However, both the generation of graphs and the handling of large graphs still remain open challenges. This work aims to introduce techniques for generating large graphs and test the approach on a complex problem such as the calculation of dose distribution in oncological radiotherapy applications. To this end, we introduced a pooling technique (ReNN-Pool) capable of sampling nodes that are spatially uniform without computational requirements in both model training and inference. By construction, the ReNN-Pool also allows the definition of a symmetric un-pooling operation to recover the original dimensionality of the graphs. We also present a Variational AutoEncoder (VAE) for generating graphs, based on the defined pooling and un-pooling operations, which employs convolutional graph layers in both encoding and decoding phases. The performance of the model was tested on both the realistic use case of a cylindrical graph dataset for a radiotherapy application and the standard benchmark dataset sprite. Compared to other graph pooling techniques, ReNN-Pool proved to improve both performance and computational requirements.

**Keywords:** graph neural network; variational autoencoder; pooling; nearest neighbours

#### **1. Introduction**

Deep Generative Modeling involves training a deep neural network to approximate the high-dimensional probability distribution of the training data, enabling the generation of new examples from that distribution. There are various approaches to deep generative modeling, such as Generative Adversarial Networks (GANs) [1], Variational AutoEncoders (VAEs) [2], and Normalizing Flow models [3]. A comprehensive review of deep generative modeling can be found in [4].

In some cases, the architecture includes both an encoding and a decoding scheme, as is the case with models such as VAEs. The encoding process is typically used to obtain compact representations of the data distribution, often achieved through pooling operations that reduce the dimensionality of the data. The creation of a "bottleneck" by embedding the input samples in lower and lower dimensional spaces enables the extraction of essential features of the data. The decoding scheme, on the other hand, often employs reverse techniques used in the encoding, such as un-pooling operations, to restore the original dimensionality of the data.

In recent years, there has been a growing interest in utilizing Deep Learning in the field of Medical Physics. Specifically, in Radiotherapy (RT), Deep Generative modeling presents a valuable opportunity to streamline the calculation of deposited dose distributions by radiation in a given medium. These data, which are currently computed using more resource-intensive methods, can be utilized to optimize and validate RT treatment plans. Little effort has been made so far in this area, except for in the works of [5,6]. Models for this application should possess two key properties. Firstly, a high resolution in dose prediction is crucial and requires the ability to process large data efficiently. Secondly,

**Citation:** Arsini, L.; Caccia, B.; Ciardiello, A.; Giagu, S.; Mancini Terracciano, C. Nearest Neighbours Graph Variational AutoEncoder. *Algorithms* **2023**, *16*, 143. https:// doi.org/10.3390/a16030143

Academic Editors: Xiang Zhang and Xiaoxiao Li

Received: 20 December 2022 Revised: 7 February 2023 Accepted: 7 February 2023 Published: 6 March 2023

**Copyright:** © 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

models should be lightweight to enable their use in resource-constrained medical devices and for online training. In the future, real-time imaging may enable real-time treatment planning optimization, making it imperative to use fast Deep Learning models during both training and inference.

Deep Learning applications can find a role in specialised hardware for both resource efficient deployment and fast inference tasks. These models are referred to as Lightweight models and they are designed to be small and efficient, making them well-suited for use on resource-constrained devices. Notable applications are embedded software in Internet of Things (IoT) devices [7], wearable medical devices [8], and real-time applications such as online video analysis, e.g., online crowd control [9]. A similar need is also present for models developed for fast inference on accelerated hardware, for which a keypoint example is the trigger analysis in high energy physics experiments [10]. These models are typically smaller and less complex than traditional deep learning models, which allow them to run on devices with limited computational power and memory. Moreover, designing them often involves trade-offs between computational resources and performance.

Deep Generative Modeling is commonly based on Convolutional Neural Networks (CNNs) for many applications, as they are one of the most powerful tools for processing grid-like data in Deep Learning frameworks. However, a significant amount of real-world data is better described by more flexible data structures, such as graphs.

A graph is defined by a pair <sup>G</sup> <sup>=</sup> (*V*, *<sup>E</sup>*). *<sup>V</sup>* <sup>=</sup> {*vi*}*<sup>N</sup> <sup>i</sup>*=<sup>1</sup> is a set of *N* vertices, or nodes, while *<sup>E</sup>* <sup>=</sup> {*eij*}*<sup>N</sup> <sup>i</sup>*,*j*=<sup>1</sup> is the set of edges, which carry the relational information between nodes. The edge set *E* can be organised into the adjacency matrix *A*, an *NxN* binary matrix, whose elements *Aij* are equal to 1 if a link between *i*-th and *j*-th node exists and is equal to 0 otherwise.

To address learning tasks on graphs, there has been an increasing interest in Graph Neural Networks (GNNs). These architectures typically use Graph Convolutional layers (GCNs), which allow for the processing of data on graphs, generalizing the concept of convolutions in CNNs. There are currently several types of GCNs available, ranging from the simplest models [11,12], to those based on graph spectral properties [13], and those that include attention mechanisms [14]. Although it is currently possible to achieve excellent results in solving various problems of classification, regression, or link prediction on graphs, graph generation remains an open challenge [15].

Unlike images, graphs often have complex geometric structures that are difficult to reproduce, particularly in an encoder–decoder framework. Despite various approaches existing, there is currently no standard method for addressing this class of problems. Standard classes of models for graph generation are Graph AutoEncoders (GAEs) and Variational Graph AutoEncoders (VGAEs) [16], which apply the concepts of AutoEncoders and Variational AutoEncoders (VAEs) to graphs. However, these architectures can only reconstruct or generate the adjacency matrix of the graphs, not the features of their nodes. Additionally, while these models can learn meaningful embeddings of node features, the graph structure and number of nodes remain fixed during the encoding process, resulting in no compression of input data through pooling operations and no bottleneck.

Different strategies have been developed for pooling operations on graphs. Early works used the eigen-decomposition for graph coarsening operations based on the graph topological structure, but these methods are often computationally expensive. An alternative algorithm is the Graclus algorithm [17], used in [13] and later adopted in other works on GNNs. Approaches like this aim to define a clustering scheme on graphs, on top of which it is possible to apply a standard max or mean pooling. Other approaches, such as SortPooling [18], select nodes to pool based on their importance in the network. There is also a stream of literature that bases pooling operations on spectral theory [19,20]. Finally, state-of-the-art approaches rely on learnable operators that, such as message-passing layers, can adapt to a specific task to compute the optimal pooling, such as DiffPool [21], Top-K pooling [22], and ASAPooling [23]. These pooling methods have been demonstrated to perform well when integrated into GNN models for graph classification, but all have

limitations. For example, DiffPool learns a dense matrix to assign nodes to clusters, thus it is not scalable to large graphs. Top-k pooling samples the top-k aligned nodes with a learnable vector, not considering the graph connectivity. In this way, after the pooling, a good connectivity between the surviving nodes is not guaranteed. ASAPooling uses a self-attention mechanism for cluster formation and a top-k scoring algorithm for cluster selection, also taking into account graph connectivity. While overcoming some limitations of previous methods, this pooling requires more computations, which can lead to high computing needs for large graphs.

In contrast to pooling procedures, there is currently a lack of solutions for un-pooling operations for graphs that can be considered the inverse of pooling. The only works that attempt to define such operations are described in [22,24]. Other decoding schemes for graph generation are designed in different ways, most of which are task-specific. For example, in many algorithms for protein or drug structure generation, decoding is conducted by adding nodes and edges sequentially [25,26]. On the other hand, there are also works on "one-shot" graph generation, with decoding architectures that can output the node and edge features in a single step [27,28]. However, in various works that use this approach, the decoding of nodes and edges are considered separately and do not take into account the structure of the graphs. For example, in [29], a 1D-CNN is used to decode node features and a 2D-CNN is used for edge features.

In summary, we found that the current literature lacks:


In this study, we propose a model for graph generation based on Variational AutoEncoders. We focus on the case of graph data with a fixed, regular, and known geometric structure. To this end, we have developed:


The remainderof the study is organized as follows. In Section 2, we describe our proposed deep learning architecture and the datasets used. First, in Section 2.1, we introduce our Nearest Neighbors Graph VAE, describing how the developed pooling and unpooling techniques work and how they are integrated into the generative model. Then, in Section 2.2, we describe the benchmark sprite dataset and present our own set of cylindrical-shaped graph data for a Medical Physics application. In Section 3, we present the results of applying our Graph VAE to the described datasets. We also conduct an ablation study to compare the performance of our model using different pooling and unpooling techniques. The paper concludes with a final discussion in Section 4.

#### **2. Materials and Methods**

#### *2.1. Nearest Neighbour Graph VAE*

In this section, we introduce our Nearest Neighbour Graph VAE model, which uses graph convolutions in both the encoding and decoding operations. To create a bottleneck in the architecture, we propose new symmetrical graph pooling and un-pooling techniques, which we refer to as Recursive Nearest Neighbour Pooling and Un-Pooling (ReNN-Pool and Un-Pool). Such operations enable the construction of a VAE decoder based on graph convolutions.

#### 2.1.1. ReNN-Pool and Un-Pool

In CNNs, pooling operations are widely used to decrease the dimensionality of feature maps while increasing the receptive fields of neurons when processing grid-like data. This is easily possible on images, where a standard pooling layer typically involves a downsampling operation such as the max or mean function applied to groups of nearby pixels. Conversely, applying this idea to graph structures is generally a challenging task.

However, there are cases where, although data does not have a grid-like shape and can be better processed with GNNs, the graph structures are fixed and have a regular and known geometry. For example, some datasets may contain examples whose data are arranged in a cylindrical or spherical structure. For these cases, we developed a simple pooling operation (ReNN-Pool) that can sub-sample graph nodes in a regular way.

The ReNN-Pool consists in a masking operation and a subsequent update of the adjacency matrix of the graph. First of all, nodes are sorted on the basis of their position in the graph. For example, if samples have a cylindrical structure, nodes can be hierarchically ordered on the basis of their positions along the *z*, *θ* and *r* axes. Then, the masking operation is carried out. Masking consists in dropping, i.e., removing, certain nodes from the graph. The process is performed in a recursive manner on the sorted node list. It begins with the first node, which is preserved, while all its nearest neighbours are dropped and removed from the node list. The process then continues with the next node and repeats, until all nodes in the list have been processed.

After the masking, we rewire links between the "survived" nodes, connecting the ones that were 2nd order neighbours before the masking. This is conducted substituting the adjacency matrix *A* with *A*2. If we call *M* the vector that contains all the indices of the "survived" nodes, *<sup>X</sup>* <sup>=</sup> {*xi*}*<sup>N</sup> <sup>i</sup>*=<sup>1</sup> and *A*, respectively, and the nodes' feature matrix and the adjacency matrix before the pooling, the application of ReNN-Pool gives in output:

$$X' = \{ \mathbf{x}'\_i \}\_{i=1'}^N \text{ where } \mathbf{x}'\_i = \begin{cases} \mathbf{x}\_{i\prime} & \text{if } i \in M \\ 0, & \text{otherwise} \end{cases} \qquad A' = \begin{cases} A\_{ij\prime}^2 & \text{if } i, j \in M \\ 0, & \text{otherwise} \end{cases}$$

The process is illustrated in Figure 1.

**Figure 1. Pooling.** The ReNN-Pool operation consists of two steps: masking and rewiring. In the first step, recursively on the whole graph, a node is selected and all its nearest neighbours are dropped. In the second step, nodes are linked to the ones that were their 2nd order neighbours.

In the case of regular geometrical graph structures, by construction, pooled and notpooled nodes are evenly spread across the graph; thus, the choice of the starting node does not affect the performances of the model in which the ReNN-Pool is used. For irregular graph structures, the performance can depend on the choice of the first node. However, in principle this can become an hyperparameter to optimize using a validation set. Such a possibility will be explored in future works.

On regular graphs, the masking operation allows one to define evenly spread clusters of nodes. In fact, the surviving nodes can be thought as centroids of the clusters made up by their nearest neighbours. These clusters can be used to perform a mean or max pooling on graphs in addition to the simple masking. A comparison of these methods is presented in Section 3.

Due to the fact that the creation of masks and the adjacency matrices' augmentation in our pooling only depends on the graph structure of data, it is possible to compute and store them before the training. This has two main advantages. First, the pooling operation has no influence on the computational needs both in the training and the inference phase. Thus, it can be used inside lightweight models for resource-constrained devices and it is scalable for graphs of any size. Second, such masks and adjacencies can also be used to define an unpooling operation that is symmetrical to the pooling one. Such a possibility is particularly relevant for the construction of generative encoder–decoder models, but also crucial for symmetrical architectures such as U-nets [30], where skip connections connect graphs with the same geometrical structure. Starting from a lower dimensional (pooled) representation of the graph with a feature matrix *X* and adjacency *A*2, the Un-Pool layer embeds back the nodes in their original position in the higher dimensional representation of the graph, that has an adjacency matrix *A*. All other restored nodes are initialized to 0. A similar idea for the un-pooling was already explored [22]. An illustration of the un-pooling operation is shown in Figure 2.

**Figure 2. Un-Pooling.** The Un-Pool operation consists in embedding the nodes of a pooled graph in their initial position in the bigger original graph structure. All other restored nodes are initialized to 0.

#### 2.1.2. ReNN Graph VAE Architecture

A Variational AutoEncoder is a generative model with an encoder–decoder structure. First the encoder learns to represent high dimensional input data as a distribution in a low dimensional space, called "latent space". The decoder is then trained to recover the original data, sampling a point from such distribution. We refer the reader to the Appendix B and the original paper [2] for a detailed description of the model.

In our architecture, the encoding consists of three subsequent repetitions of two operations: a graph convolution and a ReNN-Pool operation. For the graph convolution, we chose to use the GraphConv layer [11]. With this convolution, the new features of nodes are just linear combinations between their old features and the mean of the features of their neighbourhood, followed by a ReLu activation:

$$\mathbf{x}'\_{i} = \text{ReLU}\left(\mathcal{W}\_{1}\mathbf{x}\_{i} + \mathcal{W}\_{2}\frac{1}{|\mathcal{N}(i)|}\sum\_{j \in \mathcal{N}(i)} \mathbf{x}\_{j}\right). \tag{1}$$

Eventually, to increase the expressive power of the network, one can also include edge features *eij* in the computation, considering instead:

$$\mathbf{x}'\_{i} = \operatorname{ReLU}\left(\mathcal{W}\_{1}\mathbf{x}\_{i} + \mathcal{W}\_{2}\frac{1}{|\mathcal{N}(i)|}\sum\_{j \in \mathcal{N}(i)} \mathbf{e}\_{ij}\mathbf{x}\_{j}\right). \tag{2}$$

In particular, we consider edge features *eij* to be learned parameters of the Neural Network. The number of output channels of the GraphConv in the three subsequent repetitions is set to 16, 32, and 64.

After the three graph encoding steps, the node features are put together and flattened. Then, the dimensionality of data is further reduced through a linear layer.

The encoded data is processed as in a standard VAE architecture with Gaussian prior to that.

Data points are mapped to Gaussian distributions in the latent space and, using the reparameterisation trick, a variable *Z* is sampled from those distributions.

The decoding uses the same graph representations employed for the encoding, but in reverse order. After an initial decoding linear layer, we repeat for three times the series of an un-pooling layer and a graph convolution. For the convolutions, we employ again the GraphConv layers with the output channels' number set to 32, 16 and 1. In this way, the original dimensionality of the data is recovered. The activation function of the last convolutional layer is a sigmoid. The model is trained minimizing the standard *β*-VAE loss function, defined in Appendix B, with binary cross entropy as the reconstruction term.

The concatenation of a graph convolution and a pooling operation can be thought of as an "encoding block", while the union of an un-pooling operation and a convolution can be thought of as a "decoding block". Various architectures can be built using different numbers of blocks. In Figure 3, for example, our VAE architecture is illustrated, but with only two blocks in the encoder and decoder.

**Figure 3. Full scheme.** Schematic representation of our Re-NN Graph VAE with two encoding blocks and two decoding blocks. Each block is made up of a graph convolution and a pooling (un-pooling) operation. In the lower part of the picture, the VAE sampling and the encoding (decoding) linear layers are represented.

#### *2.2. Datasets*

#### 2.2.1. Energy Deposition Datasets

The architecture presented in this work was developed for a specific application in Medical Physics, which is the generation of the distribution of the dose absorbed by a medium interacting with a given particle beam, conditioned to beam parameters and medium density. Specifically, the approach was developed for a frontier application in radiation oncology therapy, which makes use of high-energy electron beams (Flash Therapy [31]).

The datasets for this task are built simulating an electron beam interacting with a material using Geant4 [32], a toolkit in C++ for Monte Carlo simulations in particle physics.

The two datasets differ on the material in which electrons deposit their energy. In the first case, this material is a cubic volume filled with water. We will refer to this dataset as "Water Dataset". In the second case, to increase the complexity of the task, we inserted a slab of variable density in the water volume. This slab is orthogonal to the electrons' beam and has a fixed position and thickness. The density of the slab is uniformly sampled at each run of the simulation between 0 and 5 g/cm<sup>3</sup> (for reference, water density is 1 g/cm3). We will refer to this dataset as "Water + Slab Dataset". In both cases, the particles' energies are sampled uniformly between 50 and 100 MeV, which is the typical range of energies for the FLASH radiotherapy with high-energy electrons.

Energy deposition data are collected in a cylindrical scorer, aligned with the electron beam, and divided in 28 × 28 × 28 voxels along *z*, *θ* and *r* axes. The cylindrical shape is particularly useful in our application because it allows for higher precision near the beamline.

Each example in the dataset is therefore a set of 28<sup>3</sup> voxels arranged in a cylindrical shape. Voxels have only one feature, and correspond to the amount of energy deposited in them by the simulated particle beam. Each example of the Water Dataset is labelled by the initial energy of the electron beam, while in the other dataset examples are labelled by both the particles' initial energy and the slab's density. An illustration of a typical example from these datasets is shown in Figure 4. Besides the representation of the original data in the left panel, we also show how the ReNN-Pool operates on nodes. As it is possible to see, the nodes' pooling is conducted in a spatial and uniform way.

**Figure 4. Dataset and ReNN-Pool**. From the left, the panels show the representation of a typical example from the energy deposition datasets and two pooled representations of the same example. The nodes in light grey have null features, while all others show an energy distribution within the considered range.

Datasets, respectively, consist of 4662 and 6239 examples and are divided in train, validation and test sets on the basis of particle energy and slab density. In particular, in the Water Dataset, the test set is made up of examples with a particle's energy ranging between 70 and 80 MeV. In the Water + Slab Dataset test set, the examples have the same range of initial energies and slab density values, ranging between 2 and 3 g/cm3. In both cases, the rest of the dataset is used for validation and trains with a ratio of 1/10.

Test sets have been chosen in this way in order to test the network ability to interpolate between samples and generalise to unseen configurations.

For both datasets, we imposed a graph structure on the data. Each voxels was associated with a node and nodes were linked within each other with a nearest neighbours

connectivity. In this way, the nodes in the center and on the outer surface of the cylinder have five neighbours, while all others have six neighbours.

#### 2.2.2. Sprite Dataset

The sprite dataset is a benchmark dataset for the latent space feature disentaglement in VAEs and consist of a set of 2D images representing pixelized video game "sprites". The aim of testing our Graph VAE on such a dataset is to show that our model can also work as a standard Variational AutoEncoder on tasks that are different from the one for which it was developed. Although a CNN would reasonably be the best choice for this dataset, images can also be thought of as graphs with a regular structure; therefore, they should also be processed effectively by our model.

We used part of the dataset employed in [33], available online (https://github.com/ YingzhenLi/Sprites, accessed on 17 November 2022). Such a dataset consists of 9000 examples for training and 2664 for testing. Each example is a sequence of 8 frames of a sprite performing an action. We decided to keep the first frame for each example, so we ended up with 9000 images, divided into a training and validation set with a 1/8 ratio, and 2664 images for testing. Each image is 64 × 64 pixels and represents a standing sprite whose attributes are organized in 4 categories (skin color, tops, pants and hairstyle) with 6 variations each, and 3 orientations (left, center and right).

To process such a dataset with our architecture, we had to impose a graph structure on the data. Therefore, we associated a node to each pixel and connected nodes with a grid-like connectivity. In this way, internal nodes have 4 edges, border nodes have 3 edges and corner nodes have 2 edges.

#### **3. Results**

#### *3.1. Results on Energy Deposition Datasets*

We trained our VAE with the two energy deposition datasets described in the previous section. Here, we present the results that regard the reconstruction of the energy deposition distribution from the test set. The DL model was trained for 200 epochs and the best set of learnable parameters was chosen as the one that minimizes the validation loss. We set the latent space dimensionality to 1, for the Water Volume dataset, and to 2 for the other dataset. For the weight update, we used the Adam optimiser with an initial learning rate of 0.003 and an exponential scheduler with *λ* = 0.9. The hyperparameter *β* of the VAE, defined in Appendix B, was set to 1.

To evaluate the performance of our model, we considered both local node-per-node and global reconstruction metrics. As a node-per-node reconstruction metric, we use the *δ*-index, developed by [5]. This metric is inspired by the standard *γ*-index [34], used for the clinical validation of treatment plans, and is currently used in the field to evaluate DL models for energy deposition data generation. The reconstruction error on each node is defined as:

$$\delta = \frac{X\_{\text{reco}} - X\_{GT}}{\max(X\_{GT})} \tag{3}$$

where *Xreco* is the node feature predicted by the VAE, while *XGT* is the ground truth node feature in the corresponding example. Then, as a reconstruction performance measure, we consider the 3% passing rate, which is the percentage of nodes with a *δ* index smaller than 3% . In the water volume case, our Network reaches 99.4% of nodes with a 3% passing rate, while the water volume + Slab case reaches 98.4%, as reported in Table 1.


**Table 1. Results on energy deposition reconstruction**. We report mean relative errors on energy profiles and total energy along with the mean 3% *δ*-index passing rate. Uncertainties are computed as standard deviations. Values are computed on test sets.

As global evaluation metrics, we consider the error on relevant physical quantities that the generative model should reconstruct well. To this end, we compute the relative error on three quantities:


In Figure 5, we show the energy profiles along the *z* and *r* axes. The upper Figure 5a regards the Water Dataset, while the lower Figure 5b refers to the Water + Slab one. In each panel, the blue line correspond to the ground truth, i.e., Monte Carlo simulated data, while the orange line refers to the reconstructed data from our Network. In both cases, the Network reconstructs the profiles well. The mean relative errors on profiles and total energy deposition are reported in Table 1, along with their standard deviation on the test set.

**Figure 5.** *Cont*.

**Figure 5. Energy profiles reconstruction**. Distribution of energy deposition along *z* and *r* axes from the test sets of the Water dataset (**a**) and the Water + Slab dataset (**b**). The blue lines correspond to the Monte Carlo simulated data, while the orange lines refer to the reconstructed data from our Network.

While the errors on the *r* profile, total energy and overall reconstruction are quite low and around 1–3%, the errors on the *z* profile are around 6%. To understand such a result, we included in the analysis the variance in the Monte Carlo simulations. We fixed the particle energy and slab density to be the ones in the example in Figure 5b, and we ran 100 Monte Carlo simulations computing the mean and standard deviation of the energy deposition profile along the *z* axis. We also generated 100 energy deposition distributions for feeding our VAE the test set example relative to the chosen particles' energy and slab density, as well as computed the mean and standard deviation for the *z* profile.

In Figure 6, we show the comparison of the standard deviation over the mean of the energy profile along the *z* axis between Monte Carlo simulations (left) and VAE reconstruction (right). The red dashed lines represent the slab with different (in this case higher) density, where most of the energy is released. Note that most of the errors in the reconstruction is relative to regions where there are fluctuations in the energy deposition, and so in our training set generated by Monte Carlo simulations, they are not negligible.

**Figure 6. Standard deviations in MC and VAE**. Comparison of standard deviation over mean of the energy profile along the *z* axis between Monte Carlo simulations and VAE reconstruction in the Water + Slab setting. Values are estimated for 100 MC runs with fixed parameters and 100 VAE execution with the same example as input. Results demonstrate how VAE's largest errors are in regions where energy deposition fluctuation, and so Monte Carlo's ones, are not negligible.

#### *3.2. Results on Sprite Dataset*

For this task, to increase the expressive power of the architecture, we used both variants of the GraphConv layers. In particular, in the first layer of the encoder (and in a symmetrical way in the last layer of the decoder), we used the GraphConv without edge weights described in Equation (1). In the other layers, we used the GraphConv with edge weights (Equation (2)). Such weights are learned using Linear layers. A full description of the model is given in Appendix A.

We trained our Graph VAE for 50 epochs with a batch size of 50 and set the latent space to have 5 dimensions.

For the weight update, we used the Adam optimiser with an initial learning rate of 0.005 and an exponential scheduler with *λ* = 0.95. In this case, The hyperparameter *β* of the VAE, defined in Appendix B, was set to 2.

As shown in Figure 7, our model can also work like a standard CNN VAE for image generation. In the upper panel, we show a comparison between the input images and the reconstructed ones. In the lower panel, we show how the VAE can learn a disentangled representation of features and can also interpolate through samples. In particular, we show how when fixing all dimensions of the latent space but one, it is possible to generate samples with a continuously changing hair style. A direct comparison with a standard 2D CNN VAE is presented in Appendix C.

(**b**)

**Figure 7. Results on sprite dataset**. (**a**) Comparison between input sprite images (first row) and reconstructed ones (second row). (**b**) Images generated fixing all the latent variables' dimensions but one, which is varied. Generated sprites have fixed attributes but a continuously varying hair style.

#### *3.3. Ablation Study on Pooling*

We performed an ablation study to asses the impact of the pooling technique on the model's performance. The model was evaluated on the Water + Slab dataset using the reconstruction metrics discussed in Section 3.1. Five pooling techniques were considered:


For the Random Pool and Top-k Pool, we fixed the ratios of the pooled nodes to be the same as the one of the ReNN-Pool. Such ratios are sequentially: 50%, 87% and 84%. After the node dropping, the adjacency matrix is updated from *A* to *A*2, connecting second order neighbours, as conducted in this work and recommended in [22], where the Top-k pooling was introduced. The un-pooling operations employ the same node masks and adjacency matrices of the pooling operations; thus, they are always symmetrical to them. The results of the ablation study are presented in Table 2, where it is demonstrated how the

architectures with the ReNN-Pool and Un-Pool operations outperform the other models. The results relative to the addition of mean and max operations are very likely related to the kind of data processed, which are smooth energy distributions in a cylinder. With such data, it is reasonable that the ReNN Mean Pool performs similarly to the simple ReNN-Pool and that adding the max operation leads to worse results.

**Table 2. Results of ablation study on pooling**. Comparison of the results on the test set of the Water + Slab dataset using different pooling and un-pooling techniques. Mean relative errors on energy profiles and total energy along with the mean 3% *δ*-index passing rate on test sets are reported.


A plausible explanation of the better performances of ReNN-Pool-based techniques with respect to the Random and Top-k Pool relates to the connectivity of the graphs. ReNN-Pool is specifically designed for data to always be represented by a single connected graph. Indeed, after each pooling operation, the receptive field of the remaining nodes is enlarged but there is no loss of information due to the disconnected clusters of nodes. Conversely, with random pooling or Top-k pooling there is no guarantee that this will happen. Actually, in most cases, after such pooling operations, the graph structure breaks up in different unconnected clusters. That is particularly true when the graph exhibits only local connectivity.

#### **4. Discussion**

In this work, we presented our Nearest Neighbour Graph VAE, a Variational AutoEncoder that can generate graph data with a regular geometry. Such a model fully takes advantage of the Graph convolutional layers in both encoding and decoding phases. For the encoding, we introduced a pooling technique (ReNN-Pool), based on the graph connectivity that allows us to sub-sample graph nodes in a spatially uniform way and to alter the graph adjacency matrix consequently. The decoding is carried out using a symmetrical un-pooling operation to retrieve the original graphs. We demonstrated how our model can reconstruct well the cylindrical-shaped graph data of energy deposition distributions of a particle beam in a medium.

We also evaluated the performance of the model on the sprite benchmark dataset, after transforming the image data into graphs. Although it can not be directly compared with more sophisticated and task specific algorithms for image synthesis, our model has the ability to generate good quality images, create disentangled representations of features, and interpolate through samples as well as a standard CNN VAE. Finally, we performed an ablation study on pooling. The results show how, on our task on large regular graphs, using the ReNN-Pool is more efficient and leads to better performances versus using a state-of-the-art technique, such as Top-k Pool.

Finally, we believe that ReNN-Pool represents a simple, lightweight and efficient solution to pool regular graphs. It requires no computation during either the training or inference of models because node masks and adjacency matrices can be computed and stored early on. Thus, it is directly scalable to graphs of any size, contrarily to state-of-theart pooling techniques. Moreover, the definition of a symmetrical un-pooling technique enables the construction of graph decoding modules, which can take advantage of graph convolutional layers. The current limitation of our pooling is that it has been only tested on regular graphs. However, a test on irregular graphs is among our future research directions. Although ReNN-Pool is not directly usable on all types of graphs, such as fully or highly connected ones, we believe that it could also be an efficient solution for irregular graphs

with small to medium-sized node neighbourhoods. We also plan to test our method on graph U-Net architectures, where the symmetry between encoding and decoding is needed.

**Author Contributions:** Conceptualization, L.A., B.C., A.C., S.G. and C.M.T.; methodology, L.A., B.C., A.C., S.G. and C.M.T.; software, L.A.; validation, L.A. and A.C.; formal analysis, L.A.; investigation, L.A. and A.C.; resources, B.C., S.G. and C.M.T.; data curation, L.A. and C.M.T.; writing—original draft preparation, L.A.; writing—review and editing, L.A., A.C. and S.G.; visualization, L.A.; supervision, S.G. and C.M.T. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Data Availability Statement:** The Sprite dataset is publicly available in https://lpc.opengameart. org/, (accessed on 17 November 2022); the Energy deposition datasets and the code used for this study are available on request by contacting the authors.

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

#### **Abbreviations and Symbols**

The following abbreviations and symbols are used in this manuscript:


#### **Appendix A. Full Model Description**

In the following Tables, we report a detailed list of the layers that compose the models we used to run the experiments described in Section 3. Next to the name of each layer, we report the number of parameters in it and the number of nodes and edges in the graphs after the layer execution. In particular, in Table A1, we describe the model used on the Water + Slab dataset. For the Water dataset, we used the same architecture except for the last two linear layers of the encoder and the first one of the decoder, whose output (input) number of channels was set to 1, instead of 2, in accordance with the latent space dimensionality.

In Table A2, the version of the Re-NN Graph VAE used for the Sprite dataset is described. The linear layers between the pooling (un-pooling) operations and the graph convolutions are responsible for learning the edge features which enter in the computation of the GraphConv marked with *eij*.


**Table A1.** ReNN Graph VAE used for the Water + Slab dataset. For the Water dataset, the same architecture was used but the parameter marked with an asterisk was changed to 1, in accordance with the latent space dimensionality.

**Table A2.** ReNN Graph VAE for sprite dataset.


#### **Appendix B. Variational AutoEncoder**

A Variational AutoEncoder is a generative Deep Learning model first proposed by Kingma and Welling [2]. It is a special AutoEncoder based on the variational Bayes inference, whose goal is to learn the distribution of the training data and to be able to sample new datapoints from it. The underlying hypothesis is that datapoints {*x*} are the results of a generative process controlled by a variable *z* that lives in a low dimensional space, called latent space, and their distribution is thus:

$$p(\mathbf{x}) = \int p(\mathbf{x}|z)p(z)dz,$$

where the prior *p*(*z*) is often considered Gaussian. The model is made up of two networks: the encoder and the decoder. The encoder *qψ*(*z*|*x*) maps the input data to a distribution in the latent space. Thanks to the reparemeterisation trick, a point from such a distribution is sampled in a fully differentiable way and processed by the decoder *pψ*(*x*|*z*) to retrieve the original data. The model is trained maximising the evidenced lower bound (ELBO) of the data likelihood:

$$\{\psi,\phi\} = \arg\max\_{\psi,\phi} \left[ \mathbb{E}\_{\mathbf{x}\sim q\_{\phi}(\cdot|\mathbf{x})} \left[ \log p\_{\phi}(\mathbf{x}|\mathbf{z}) - D\_{\text{KL}}(q\_{\psi}(\mathbf{z}|\mathbf{x}) | p(\mathbf{z})) \right] \right]$$

After training, new datapoints *x* can be generated sampling *z* in the latent space and passing it to the decoder. Starting from a standard VAE, it is also possible to slightly modify the loss function by adding a scalar hyperparameter *β* > 1:

$$\{\psi,\phi\} = \arg\max\_{\forall \phi} \left[ \mathbb{E}\_{\mathbf{x}\sim q\_{\phi}(\cdot|\mathbf{x})} \left[ \log p\_{\phi}(\mathbf{x}|z) - \beta D\_{KL}(q\_{\Psi}(z|\mathbf{x})|p(z)) \right] \right].$$

The model with such a modification is known as *β*-VAE [35] and is recognised to promote a better disentangling of features' embedding in the latent space.

#### **Appendix C. ReNN Graph VAE vs. CNN VAE**

We performed a quantitative comparison between our ReNN Graph VAE and a standard 2D CNN VAE on the Sprite benchmark dataset. The CNN comprise 2D convolutional layers and mean pooling in the encoder and 2D transpose convolutions and upsampling in the decoder. A full description of the model is given in Table A3. The model was trained for 50 epochs with a batch size of 50, setting the latent space to have 5 dimensions. For the weight update, we used the Adam optimiser with an initial learning rate of 0.005 and an exponential scheduler with *λ* = 0.95. In order to obtain a disentangled representation for the hair style in the latent space, we had to set the *β* parameter to 4.

To quantitatively evaluate the performance of our model on this dataset, we considered the Structure Similarity Index Measure (SSIM). It is a perception-based measure that considers image degradation as the perceived change in structural information. While pixel-per-pixel reconstruction metrics, such as as MSE or the previously used *δ*-index, estimate absolute errors, the structural information considers the strong inter-dependencies between spatially close pixels that carry important information about the image as a whole. Both CNN VAE and ReNN Graph VAE reached an average SSIM on the test set of 0.90.

In Figure A1, we also report a comparison between some ground truth, ReNN Graph VAE reconstructed sprites and CNN VAE reconstructed sprites from the test set. Both VAEs work well, but it is possible to spot some differences. CNN VAE reconstructed images are slightly blurrier that the originals, while the ReNN Graph VAE has slightly less bright colours.

**Figure A1. Ground truth vs. reconstructed sprites**. Comparison between sprites from the test set with their reconstructed counterpart by ReNN Graph VAE and a standard CNN VAE.


**Table A3.** Two-dimensional CNN VAE for sprite dataset.

#### **References**


**Disclaimer/Publisher's Note:** The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
