2.2.4. TRGC—Time-Reversed Granger Causality

As our baseline method, Time-Reversed Granger Causality (TRGC), as implemented (by means of the Matlab function "tr\_gc\_test", embedded in "simulation\_source\_connectivity"), and evaluated by [1], was used. As stated before, the difference with "traditional" GC is the type of significance procedure. Instead of the classical way to determine significance (a likelihood ratio test), which cannot distinguish between actual versus spurious correlations due to source mixing, it determines whether the "standard" GC scores for non-reversed and reversed data have opposing directions and are both significant. In other words, directionflipping must occur when data are time-reversed. This is referred to as conjunction-based TRGC [7]. A drawback of GC (and hence, TRGC) is that one needs to define the model order, which is feasible when the ground truth is known, such as in simulations, but in "real" EEG data, this quickly becomes a tricky problem. An advantage, on the other hand, is the fact that with TRGC, one model for all sources is constructed, after which one threshold is applied to all obtained GC scores.

Configuration. Function tr\_gc\_test takes as input an NxL matrix H', the model order, the number of time steps in the time series, alpha, the type of significance test ("conservative", requiring significant GC scores with original as well as reversed data; versus a significance test based on difference scores between GC scores in normal and reversed order) and finally, the type of VAR model estimation regression mode to calculate pairwiseconditional time-domain Granger Causality scores. In this work, the model order of TRGC was set to two, we opted for "conservative" significance testing, and ordinary least squares (OLS) was used as Vector-Autoregression (VAR) estimator. We used an alpha level of 0.05, FDR corrected [31]. The corresponding p-value was taken as a threshold to binarize connectivity scores.

## *2.3. Performance Evaluation*

The main question is whether the connections in the ground truths could be detected by the evaluated networks and by TRGC ("True Positives", TP) without detecting too many false connections ("False Positives", FP), thus connections that are not present in the ground truths. Measures based upon these are Precision, Sensitivity/Recall, and F1-score (Figure 6), which we used for comparing TCDF, LSTM-NUE, Conv2D and TRGC.


**Figure 6.** Main evaluation measures.

The results on connectivity strength are not directly compared between models as they differ substantially. These strength estimates, based on the mean over five runs on the same data set, are calculated and ranked. It must be emphasized that these strength estimates are relative per model and target training as, for each target time series, the network is trained differently. The latter implies that connection strengths obtained in the prediction of a particular Target time series X1 cannot be readily compared with connection strengths obtained in the prediction of another target time series X2. If F1 < 50%, only rankings are presented. Self-connectivity is not taken into account to avoid an overly positive perception of the results.
