3.2.3. Analysis of Survey Data

To avoid overfitting and reduce dimensionality, a principal component analysis (PCA) was employed on all indicators under each factor. A PCA reduces the size of the independent variable set by retaining the maximum variance using fewer dimensions than the original number of indicators. The number of principal components (PCs) retained in this study followed by Formann [36], such that the proportion between the number observations and number variables (v) is greater than 5 <sup>×</sup> 2V. For this study, three PCs were retained under each of the seven factors. All data were confirmed to follow a normal distribution.

After completing the PCA, a k-means clustering algorithm was used to identify groupings of expert responses. K-mean clustering is a partitional clustering approach identifies a user-defined number of clusters (K), which are designated by their means or centroids. To group n-number of observations into a K-number of clusters, this technique uses either the Euclidean or rectilinear distance of these scaled points from the centroids as a measure of similarity. This is performed in an iterative process. First, the numbers of hypothetical clusters (*K*) based on characteristics as geography, profession and hydrologic position of watercourse states is decided. Second, among the data points, initial centroids are randomly selected. The number of these randomly selected observations (an associated starting centroid) are, by default, equal to the number of assumed clusters. Third, the Euclidean distance between these initial centroids and each data point is calculated using Equation (1). Then, individual observations are classified into k-clusters depending on their minimum distance from the randomly selected centroids. The smaller the Euclidean distance between a given centroid and data points, the higher the probability to be grouped in a similar cluster. However, each time the centroid changes, the cluster of data points also changes. Thus, at each step, centroids are updated by taking the average of the data points that are categorized in the same cluster in the preceding iteration. This continues until a consistent cluster assignment is obtained.

$$d\_{u,v} = \sqrt{(u\_1 - v\_1)^2 + (u\_2 - v\_2)^2 + \dots + \left(u\_q - v\_q\right)^2} \tag{1}$$

where *du,v* is Euclidean distance between a given centroid and variable *u* and *v*, and 1,2 ... *q* are data points or observations of each variables.

Accordingly, we first set the number of hypothesized clusters at *K* = 2 for groupings between basin and non-basin or upstream and downstream, *K* = 4 for grouping among basin states, and *K* = 5 for grouping by profession. Given these specified number of clusters, the algorithm iterated to identify patterns of differences or similarities that existed in the survey data. However, because the ideal number of clusters for a given dataset can be different from what is anticipated, an optimal number of clusters was detected using an elbow method for proofing our assumptions. Repeating the steps above, all variances (within sum of squared errors (SSE)) corresponding to each *K* values from 2 to (*n* − 1) were calculated, where *n* is the number of observations. Then, by plotting (*K* vs. SSE), a point marking the approximate location where a rapid decline in the slope of the variance ends and began to flatten (forming an elbow shape) was noted. At this point, because the rate of change in the variance was quite small for additional clusters, the corresponding number of clusters (*K*-value) was selected as appropriate. This also allowed a k-means cluster obtained by an optimal *K*-value to verify the initially selected number of clusters. Further, the number of respondents categorized in each group from each country was also used as a check to examine the existence of expected grouping in the survey responses.

Although the k-mean clustering identifies patterns and reasons for grouping in the data, it cannot determine the degree of difference between groupings, thus Analysis of Variance (ANOVA) was included to evaluate potentially statistically significant differences among basin countries, and a t-test was used to identify potentially significant differences between, within, and outside of the basin countries for individual indicators. To be statistically significant, results may surpass the 95% confidence level (*p*-value < 0.05).

#### 3.2.4. Final Indicator Selection

The selection of the final set of indicators was based on both statistical significance and the percentage of responses with a Likert score of 5 (very important) (Figure 4). All indicators were classified into levels of consensus according to the following:


**Figure 4.** Flowchart for identifying the consensus level for each indicator.

#### **4. Results**

In the following section, we present results and outcomes from all respondents, analysis comparing among basin countries and basin versus non-basin groupings, and key indicators identified for the Nile Basin that may be used for equitable and reasonable water allocation among the states.

#### *4.1. Responses of All Experts*

According to the summary of data (Figures 5 and 6), the response from basin country experts covered all classes from not important to very important (1–5) for most of the indicators. For non-basin countries, this was not the case. This emphasizes that experts from basin countries appeared more divided on most of the indicators than non-basin experts. Yet, basin experts also expressed a common positive inclination for some indicators, including water-food-energy risk index, population without electricity, the relative significance of hydropower, access to drinking water, access to clean cooking, multidimensional poverty, hunger index, existing irrigation demand, and future domestic water demand. Contrarily, although approximately 80% of non-basin experts considered the majority of indicators to be important, there were some exceptions on which they were divided, including ICT index, life expectancy index, cereal yield, and industry (%GDP).

#### *4.2. Comparison among Basin States*

The sum of the variance explained from the first three PCs for each factor ranged from 60% to 95%, expressing the scope of agreement and disagreement between basin experts. The clusters based on these PCs for each factor were quite mixed and did not show a clear distinction (Figures 7, A1 and A2). Overall, the influence of the expert's profession (Figure A1) and home country (Figure A2) appears negligible, whereas a country being grouped by hydrologic position (upstream vs. downstream) did

indeed illustrate clearer clusters (Figure 7). This was particularly clear from factors 1 to 5, but less so for factors 6 and 7 (costs of conservation and protection, and the availability of alternative uses (Figure 7f,g), where a stronger similarity was observed). Thus, not all indicators necessarily imply a difference in opinion between upstream and downstream states.

**Figure 5.** A percentage summary of survey responses for indicators from basin state experts.

**Figure 6.** A percentage summary of survey responses for indicators from non-basin state experts.

**Figure 7.** A cluster based on hydrologic position of basin states (i.e., cluster 1 and cluster 2 represent upstream and downstream countries, respectively), where (**a**) is experts response division on indicators under factor-#1—called geography, hydrology, ecology, and natural features, (**b**) factor-#2—socio-economic needs of basin states, (**c**) factor-#3—the population dependent on the watercourse, (**d**) factor-#4—the effects of water use, (**e**) factor-#5—existing and potential uses, (**f**) factor-#6—costs of conservation and protection, and (**g**) factor-#7—availability of alternative and comparable values and uses.

Beyond the confirmation of optimal number of clusters to be (*k* = 2), the number of experts falling into each distinct cluster based on hydrologic position is also insightful (Figure 8). Sudan and Egypt were visibly similar for indicators focused on factor 6 (costs of conservation and protection) and factor 7 (availability of alternative uses). However, for factors 1–5, most Sudanese experts' responses more closely matched upstream expert opinions, leaving a relatively clear separation between Egypt versus other basin states. Although these conclusions can be drawn at the factor scale, broad dissimilarities across all indicators listed under each factor are not necessarily evident. Rather, it is typical only a few indicators under each factor that strongly influence the division by hydrologic position. These specific indicators can be identified through the proposed statistical tests.

**Figure 8.** The number of experts from upstream and downstream states in each cluster, organized by factor. Where factor-#1 is geography, hydrology, ecology, and natural features, factor-#2 is socio-economic needs of basin states, factor-#3 is the population dependent on the watercourse, factor-#4 is the effects of water use, factor-#5 is existing and potential uses, factor-#6 is costs of conservation and protection, and factor-#7 is availability of alternative and comparable values and uses.

Whereas a general agreement exists on 67 indicators among basin countries, a significant difference exists for eight indicators among basin states as well as two indicators between Egyptian experts (Table S2). Apart from Egypt's experts, there were no significant differences detected within other basin states. The eight indicators which resulted in a significant difference based on hydrologic position included the average drought-affected people per year in each country (I-6), population living below the income poverty line (I-30), population growth rate (I-38), wetland area (I-56), estimated cost to conserve erosion hot spot areas (I-58), virtual water (I-61), revenue and job opportunity from ports (I-74), and water conservation by crop pattern modification (I-75). The two indicators for which Egyptian experts also significantly disagreed within themselves included the average drought affected people per year in each country (I-6) and estimated cost to conserve erosion hot spot areas (I-58).
