*2.3. Modeling of the Determinants of the Food Consumption Water Footprint of Tunisian Households*

Multiple linear regression (MLR) is used to quantify the relationship between several independent variables and a dependent variable. We also created a multinomial logit model by converting the dependent variable Y into three food water footprint classes; however, to keep Y as a continuous variable, we finally opted for a semi log multiple regression model. This method has been successfully used by different authors to establish a statistical model [40–42]. In this study, the MLR method provides an equation linking the dependent variable *Yi* (food consumption water footprint) to the independent variables *Xi* using the following form:

$$Y\dot{\imath} = \beta\_0 + \beta\_1 X\_{i1} + \dots + \beta\_n X\_{in} + \varepsilon\_i \tag{1}$$

The intercept (*β0*) and the regression coefficients of variables (*βi*) are determined by the least square method [41]. *Xi* variables are used to explain the water footprint of food consumption, (*n*) is the number of households in the sample, and *ε* is the error of estimation in the statistical regression model. The best equation is selected while being based on the highest (R2), lowest standard deviation (SD), and F-ratio value. The MLR modeling method was performed using STATA software.

The original dependent variable was Yi = "food consumption water footprint". To get around the problems of the large values and highly skewed dependent variable, we used the log-transformation of the dependent variable. Specifically, we used a semi-log model applying the natural log of Y (ln Y). Logarithmically transforming variables in a regression model is useful where a non-linear relationship exists between the independent and dependent variables [43]. Using the logarithm of one or more variables makes the effective relationship non-linear, while still preserving the linear model. Such transformation is also a convenient means of transforming a highly skewed variable distribution into one that is normally distributed.


**Table 1.** Characteristics of the household sample (n = 4854).

Source: Own calculations from [30].

In a first step, all the variables correlated with the dependent variable were introduced into the model. Then, for the next iterations, the non-significant variables with the highest *p*-values were eliminated one by one until the best model is obtained. To choose the optimal set of independent variables we used a backward selection, based on the Akaike's Information Criterion (AIC) [44] and Bayesian Information Criterion (BIC) (Appendix A). The Breusch–Pagan/Cook–Weisberg test indicated a problem of heteroscedasticity. Specifying the robust variance-covariance estimator (VCE robust) option is equivalent to requesting White-corrected standard errors in the presence of heteroscedasticity (Appendix B). Using the variance inflation factor (VIF) test we concluded that the selected independent variables in the final model do not present a problem of multicollinearity (Appendix C).

The independent variables correspond to the geographic, socio-economic, and demographic characteristics of the households. The variables used in the final model are summarized in Table 2.


**Table 2.** Variables used in the multiple linear regression model.

\* The reference level for categorical variables is selected according to the modality with the greatest number of observations.

In order to identify the healthiest and most sustainable diets at the same time, several studies are starting to look at the quantification of the dietary water footprint [18,45,46]. However, only a few recent studies in China and Spain incorporated regional, income and food wastage effects in household consumption water footprint [33,34,47].
