*3.1. Ground Truth 1*

#### 3.1.1. Connectivity Detection

In Figures 7 and 8, respectively, Sensitivity and Precision are shown per method. Some remarks, specifically with regard to Conv2D, need to be made before interpreting the results.

**Figure 7.** Mean Sensitivity for all methods (L = 1500). Sensitivity Ranking Far–Superficial: TRGC > Conv2D (TS = 2) > LSTM-NUE > TCDF > Conv2D (TS = 3). Sensitivity Ranking Far– Deep: TRGC > LSTM-NUE > Conv2D (TS = 2) > Conv2D (TS = 3) = TCDF. Abbreviation TS = amount of time series included in the predictions.

**Figure 8.** Mean Precision for all methods (L = 1500). Sensitivity Ranking Far–Superficial: TRGC > Conv2D (TS = 2) > LSTM-NUE > TCDF > Conv2D (TS = 3). Sensitivity Ranking Far– Deep: TRGC > LSTM-NUE > Conv2D (TS = 2) > Conv2D (TS = 3) = TCDF. Abbreviation TS = amount of time series included in the predictions.

Given that the Conv2D-model based on three predicting time series resulted in a very low Sensitivity (0.13 ± 0.18) and low Precision (0.40 ± 0.55), see Figures 7 and 8, it was not considered relevant to explore the model with three predictors further in terms of connectivity strength (for connectivity strength per ANN, see Sections 3.1.2–3.1.4). This decision was supported by the results of a Scheirer–Ray–Hare Test with model and dipole condition as factors and with follow-up Mann–Whitney U tests (Bonferroni-corrected). Superior results were obtained with Conv2D models containing two predicting time series versus three predicting time series. These results can be consulted in Appendix A (Table A1).

Hence, strength rankings are explored only with the Conv2D model with two predictors (Section 3.1.4). With regard to the model based on two predicting time series, the results obtained by looking at each predictor pair (consisting of two time series) separately revealed large differences between pairs in terms of Sensitivity and Precision. We chose to take all detected connections into account while calculating our scores instead of averaging over all predictor pairs, as it could lead to biased results. This is because if it is found that X2 predicts X1 when it is predicted together with X1 but not detected when it is predicted with X2 and X3, and the discovered connection between X2 and X1 is still included in the performance scores, this increases Sensitivity but decreases Precision. The decrease in Precision then occurs because if a false positive is found by one of the two predictor pairs, it is still counted. Option 1 was chosen to put the focus more upon detection ability and exploration. Thus, it must be kept in mind that a positive detection bias exists in all our overall two-to-one performance scores of Conv2D.

While focusing on differences in Sensitivity, the following results were obtained for the used ANNs and TRGC. A Scheirer–Ray–Hare Test with model and dipole condition as factors revealed no statistically significant interaction (using alpha = 0.05) between the effects of the type of connectivity method and dipole condition (*p* = 0.63), nor the main effect of dipole condition itself (*p* = 0.63). However, a simple main effects analysis showed that the type of connectivity method does have a statistically significant effect on Sensitivity (*H* (4,40) = 38,159, *p* < 0.001). Follow-up two-sided Mann–Whitney U tests (Bonferroni-corrected: alpha = 0.05, alpha adjusted = 0.005), carried out across dipole conditions, show significant and marginally significant differences between the following methods. Median scores (denoted as *Mdn*) are reported. In contrast to TCDF (*Mdn* = 0.33), smaller contributions of one time series to another could be detected with LSTM-NUE (*Mdn* = 0.67), *p* < 0.001. The difference between TCDF and Conv2D (*Mdn* = 0.67) with two time series as predictors was only marginally significant after correcting for multiple

comparisons, *p* = 0.006. TRGC (*Mdn* = 1), however, outperformed all ANN models in terms of Sensitivity (*p*-values denoting differences with all other methods < 0.001). Finally, while no significant difference was found between LSTM-NUE (*Mdn* = 0.67) and Conv2D (*Mdn* = 0.67) with two time series as predictors (*p* = 0.239), LSTM-NUE performed significantly better than Conv2D with three time series as predictors (*Mdn* = 0), *p* < 0.001). Rankings are described below in Figure 7 to provide qualitative comparisons. Note that in Figure 7, mean scores *M* for each dipole condition is still reported, given that the current results were obtained with small sample sizes. Hence, differences between dipole conditions may still appear once statistical power is increased (i.e., by using more data sets) and given that differences between dipole conditions were, to some extent, expected.

With regard to Precision, the results of a Scheirer–Ray–Hare Test with method and dipole condition as factors were not significant, albeit a marginally significant result for method (*H* (4,40) = 8.79, *p* = 0.07) was obtained. Hence, no follow-up tests were carried out.

Thus, we rely upon rankings only for our qualitative description (in terms of mean scores *M*, taking dipole condition into account) of the data. In the Far–Superficial dipole condition, Conv2D with TS = 2 and LSTM-NUE obtain both a Precision of *M* = 0.75 (±0.15, 0.17, respectively), followed by TRGC (*M* = 0.58 ± 0.05) and TCDF (*M* = 0.50 ± 0). Precision is lowest in Conv2D with TS=3(*M =* 0.40 ± 0.55). However, in the Far–Deep dipole condition, TRGC obtains perfect Precision (*M =* 1 ± 0), followed by LSTM-NUE (*M =* 0.88 ± 0.14) and Conv2D with TS=2(*M* = 0.63 ± 0.41). The qualitatively lower Precision score of TRGC in the Far–Superficial condition turned out to be mainly due to two consistently observed false-positive connections that were not detected in the Far–Deep dipole condition.

When summarizing the results in terms of F1-scores, the following ranking was obtained for the Far–Superficial condition: TRGC (*M =* 0.73 ± 0.04) = Conv2D with TS = 2 (*M =* 0.73 ± 0.09) > LSTM-NUE (*M =* 0.70 ± 0.07) > TCDF (*M =* 0.40 ± 0.0) > Conv2D with TS = 3 (*M =* 0.20 ± 0.27).

For the Far–Deep condition, the F1-score ranking was as follows: 1 ± 0 (TRGC) > 0.75 ± 0.17 (LSTM-NUE) > 0.47 ± 0.31 (Conv2D with TS = 2) > 0.20 ± 0.27 (Conv2D with TS = 3) = 0.20 ± 0.27 (TCDF).
