**1. Introduction**

In the past few years, the use of bipartite networks for the representation of realworld complex systems has become widespread in a variety of fields and applications. These networks are usually constructed using multi-sample, multi-variate structured data used to model complex systems such as biological networks (enzymes and reactions [1], genes and diseases [2], plants and pollinators [3]), movies and actors [4,5], authors and papers [5,6], board of directors members and companies [7,8], companies and technologies they patent [9], members of peer-to-peer networks and data provided [10], international NGO branches and cities hosting them [11], supreme court judges and their votes [12], and legislators and bills they sponsor [13].

A prominent example is the bipartite network formed by countries and the products they export. This type of data has been used extensively in the field of economic complexity (EC) [14,15] to assess various quantities of interest for the modelling of the economic development of countries. The first one is the competitiveness of countries and the sophistication of products [16–20], and the relatedness between products, countries, or between countries and products [21,22]. With respect to the datasets implemented in the literature up to now, the dataset we use in this paper adds the inclusion of services to the set of tangible products traditionally considered in the EC literature [23,24].

**Citation:** Saenz de Pipaon Perez, C.; Zaccaria, A.; Di Matteo, T. Asymmetric Relatedness from Partial Correlation. *Entropy* **2022**, *24*, 365. https://doi.org/10.3390/e24030365

Academic Editor: Stanisław Drozd˙ z˙

Received: 10 January 2022 Accepted: 24 February 2022 Published: 3 March 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

An agreed definition of relatedness still does not exist, despite the vast number of applications of this concept, that ranges from forecasting industrial upgrading [25] to its use as an explanatory variable in a number of different contexts (see [26] and references therein). In most cases, one computes a projection of the bipartite network (e.g., country-product) onto one of the two sets of nodes to obtain a monopartite network (e.g., product-product) [21,22,27]; the relatedness between the nodes of the target layer is given by the weights of the corresponding links. Since the information content of the projected network is always smaller than that in the bipartite network, the choice of the method employed to achieve this is highly non-trivial. The resulting network should be a meaningful representation of the bipartite network for the specific problem being tackled while minimising the information loss due to the projection. There are several methods available in the literature to carry out this task (see [26,28]); however, to the best of our knowledge, no one takes explicitly into account the temporal structure, with the possible exception of the time-delayed cooccurrences approach described in [23,29] which, however, does not take into account the correlation between the different time series involved. This is a key element, since a comprehensive unveiling of the complex interactions between industrial sectors clearly requires a dynamical perspective.

In this paper, we tackle this issue by quantifying the average influence between industrial sectors in terms of partial correlation. To do so we introduce a framework that generalises a network generation method based on correlation-filtering called the partial correlation planar graph (PCPG) algorithm [30] in order to allow for its use with multisample multi-variate datasets. Since this methodology is particularly suitable for bipartite networks such as the ones usually studied in EC, we have called our framework biPCPG. The PCPG is an adaptation of the Planar Maximally Filtered Graph (PMFG) [31] which is in turn a further step from the Minimum Spanning Tree (MST) [32]. Fruitfully applied to financial market dynamics [33], these methods are able to capture the heterogeneity of similarities usually found at different scales of correlation in complex systems thanks to them employing a hierarchical clustering approach rather than a thresholding approach. The advantage of the PMFG over the MST is that, due to its relaxed constraints, its output network contains loops and a larger amount of information than the MST by preserving all the hierarchical properties of the MST [31].

The PCPG [30] adapts the PMFG in order to capture asymmetric interactions among variables in the system, thus producing a directed network. The PCPG achieves this by employing an edge-weighting scheme based on partial correlations, which are a measure of how the correlation of two variables is affected by a third variable. More specifically, the socalled *influence* (the difference between correlation and partial correlation) is employed to measure the similarities in the system and is used as a metric to select the edges included in the network. In our case, this formulation of relatedness allows asymmetries to be detected in the system.

As a result, the PCPG network is a weighted, connected, directed network that includes the MST as a subgraph as well as allowing for other substructures such as loops and cliques of three and four elements which add to the information content of the graph [31]. The fact that the links present in the PCPG are mostly those which correspond to the largest correlations in the system ensures the statistical robustness of the network to a high extent [34].

The PCPG was originally developed for its use on multi-variate datasets of only one sample: the time series of different stocks. In our case we have the export time series, so not only many variables (the different products) but also many samples, one of each country. In this paper we propose an extension of the PCPG, that we call biPCPG, to allow its application on multi-sample and multi-variate datasets, e.g., the export time series, by product, of many countries.

Our proposed extension to the PCPG method involves the preparation of the multisample dataset in order to apply the PCPG algorithm. This is achieved by structuring the dataset into a set of correlation matrices among the time series of products exported

by countries, averaging these, and applying the existing PCPG procedure. Following similar principles, we also adapt an existing bootstrapping procedure (see [34]) in order to determine the statistical reliability of the links present in the resulting network.

The contribution of this paper is many fold. Firstly, the biPCPG framework opens the possibility of the application of the PCPG algorithm to a wide variety of datasets with a multi-sample and multi-variate structure, including, but not limiting to, the ones usually analysed using the EC framework. Furthermore, the data-processing methodology introduced here could be utilised to apply other correlation-filtering algorithms for network generation (e.g., [31,33]).

Secondly, this paper introduces a network which describes the asymmetric relatedness among physical products (manufacturing) and services. This is an addition with respect to the networks usually present in the literature, such as the product space [21] and product taxonomy network [22], which are constituted only by products.

Thirdly, this paper introduces an adapted bootstrapping procedure to asses the reliability of the edges present in a network generated from multi-sample multi-variate datasets. Similarly to the network-generating framework, this bootstrapping procedure can be utilised to asses the reliability of edges in networks generated using alternative correlation-filtering methods with datasets with this structure.

Fourthly, in order to assess the information content of the biPCPG network we calculate two assortativity measures and run a community detection procedure, finding that meaningful clusters and connections emerge, as well as a relevant complexity-related assortativity. In summary, the biPCPG analysis unveils the average influence between industrial and service sectors, efficiently encapsulating the information about the correlation structure of the system.

Finally, we provide a Python package named "biPCPG" [35] with its documentation hosted in [36]. The 0.1.0 version of this package was used to perform all the calculations done in this paper, including the data-handling, biPCPG network generation, bootstrapping procedure and calculations done on the biPCPG network. It is worth noting that the package has a modular structure such that the data-handling and the generation of the biPCPG network are computed independently of each other. This allows the user to, for example, utilise the data-handling module to prepare a multi-sample multi-variate dataset for an alternative correlation-filtering method, or to implement the PCPG algorithm on a dataset of her choice, without the need for the dataset to have a multi-sample multi-variate structure. To the best of our knowledge, the PCPG module in the biPCPG package is the first publicly available Python implementation of the PCPG algorithm.

The rest of this paper is organised as follows. In Section 2, we describe the dataset used in this investigation and the cleaning procedure performed on it. In Section 3, we describe the set of methods to generate the biPCPG network and comment on the resulting network. In the result sections we describe the assortativity calculations and community detection procedure done on the biPCPG network and show the results obtained. Section 5 concludes.

#### **2. Data Description and Preprocessing**

The dataset used in this research project is an integration of the United Nations Commodity Trade Statistics Database (UN-COMTRADE—https://comtrade.un.org, accessed on 13 February 2019) and the International Monetary Fund's Balance of Payments data (BPM6) [37], relative to physical goods and service exports respectively. This integrated dataset was introduced in a World Bank working paper [23]. The UN-COMTRADE data consists of the amount of exports from each country per category of products (in USD). The categorisation of products is given by the World Customs Organization's (WCO) Harmonized System 2007 edition (HS2007) [38], which classifies products by using a hierarchical six-digit code depending on the category of the product. The IMF BPM6 dataset consists of the amount (in USD) of services provided abroad by each country and is collected according to the 6th edition of its manual, provided by the International Monetary

Fund (IMF). Henceforth, we will globally refer to the collection of products in COMTRADE and services in BPM6 as *sectors*.

The hierarchical structure of the HS classification allows for an aggregation from the most granular six-digit level, consisting in about 5000 different products, into a coarser twodigit level. A further aggregation of a few small (in terms of export quantities) two-digit sectors into a single two-digit sector was also performed in this dataset, leaving a total of 78, roughly homogeneous aggregated product sectors at the two-digit level. From the BPM6 part of the dataset, there are a further 22 service sectors at a comparable level of aggregation.

The aggregated dataset used in our study is therefore comprised of 78 + 22 = 100 sectors of products and services, these are listed in Table A1 in Appendix D. The data span a total of 22 years, from 1995 to 2016. As there are missing data points in some years for several countries, we apply a sanitation procedure where only countries with complete data for all sectors throughout the 22 years are kept. This reduces the dataset to from 129 countries to 99 countries. The analysed dataset has a total 99 × 100 = 9900 time series of length 22, with no missing values, representing the amount of product exports or service provisions in USD for each country.

In order to perform specific calculations (see Section 4.2), the 100 sectors in the dataset must be aggregated one level further. The product sectors can be further aggregated using what the WCO refers to as *sections*. The WCO provides a total of 21 sections which are available at [38]. In this case, services sectors can be aggregated into a single "section". Thus, in our aggregated dataset we have a total of 22 sections of sectors—21 product sections arising from the HS2007 classification, and one additional section containing the service sectors from the BPM6 dataset.

#### *Revealed Comparative Advantage Matrices*

The raw data used to construct in this paper are the amount of exports *E<sup>y</sup> <sup>c</sup>*,*p* (in USD) of a sector *p* (product or service) by a country *c* in year *y*. We compute the Revealed Comparative Advantage (RCA) [39] as

#### ratio of *c*'s exports of *p* to the total exports of *c* in year *y*

*RCA<sup>y</sup><sup>c</sup>*,*p* ratio of the world's exports of *p* to the total world's exports of all sectors in year *y*

$$=\frac{E\_{c,p}^{y}\Big/\sum\_{p'\in P}E\_{c,p'}^{y}}{\sum\_{c'\in C}E\_{c',p}^{y}\Big/\sum\_{c'\in C,p'\in P}E\_{c',p'}^{y}}$$

=

(1)

where *P* and *C* are the sets of unique sectors and unique countries in the dataset discussed above.

The use RCA is ubiquitous in the EC literature, because removes trivial dependencies from the sectors' and countries' size. When the *RCA<sup>y</sup> <sup>c</sup>*,*p* is above 1, the country is said to have a revealed comparative advantage in exporting a given sector in that year. Conversely, when *RCA<sup>y</sup> <sup>c</sup>*,*p* is below 1 the country can be thought of as not being very competitive in that particular sector. Finally, when *RCA<sup>y</sup> <sup>c</sup>*,*p* is equal to 1 the country has the expected (average) share of the world's exports in the given sector and year.

Therefore, the dataset on which we perform the following calculations consists of time series *RCAc*,*<sup>p</sup>* = (*RCA<sup>y</sup> <sup>c</sup>*,*p* : *y* ∈ *Y*) for 99 countries and 100 sectors, where *Y* is the index set of years [1995, <sup>2016</sup>]. The data is then shaped into a set of 22 matrices **RCA***y*, one for each year, where each row represents a country, each column represents a sector and each entry is the corresponding *RCA<sup>y</sup> <sup>c</sup>*,*p* value.

#### **3. Methods: The biPCPG Framework**

*3.1. Methodology Description*

Before discussing the detailed implementation of the biPCPG methodology, here we provide a summarised description of our procedure; a visual representation can be found in Figure 1.

Given the multi-sample nature of the dataset analysed, a series of data-preprocessing steps are needed before the application of PCPG. The PCPG algorithm takes a single correlation matrix as an input and outputs a network (see Section 3.5). In order to obtain our biPCPG network, along with reliability values for its edges from a multi-sample dataset, we need two main procedures, a "Network generating procedure" and a "Bootstrapping procedure".

The "Network generating procedure" is shown in the black box in Figure 1 and deals with the data handling necessary to obtain a PCPG network from a dataset with a multisample structure. In our case, we are interested in obtaining a biPCPG network where nodes are sectors, therefore the input matrix should describe the correlations between sectors.

To find this input correlation matrix, the initial step is to shape the dataset such that, for each country, we have a matrix where the columns are the relevant time series of each sector. We then compute a correlation matrix for each of these time series matrices. Finally, we average these correlation matrices over countries to obtain an average correlation matrix which serves as the input to the PCPG algorithm, i.e., the last step in the biPCPG framework. The output of the biPCPG algorithm is the network we refer to as *G*, as well as the weights of the edges in contains, i.e., the average influence between sectors.

**Figure 1.** Flowchart of procedures and methods involved in obtaining the final biPCPG network.

The "Bootstrapping procedure" of our framework, shown in the grey box in Figure 1, deals with the bootstrapping procedure necessary to asses the reliability of the edges in the biPCPG network obtained. This starts from the country time series matrices, which are bootstrapped *R* times, obtaining a "batch" of replicates each time. Each of these batches contains *C* matrices, one for each country, where the rows have been drawn coherently from their corresponding original country matrices. This is done in order to randomise the time dimension while preserving the correlation structure across countries (see Section 3.6). We then replicate the "Network generating procedure" described above by treating each batch of replicates as a new dataset of country time series matrices and follow the steps to obtain a replicate biPCPG network. This means that, for each batch, we calculate a correlation matrix for every time series matrix, we then average across these correlation matrices and use the average correlation matrix as an input to the PCPG algorithm. Repeating this procedure for all *R* batches we obtain *R* replicate networks. We find the fraction of times

each edge in *G* appears in the replicate networks, which is a measure of the reliability of the edge.

#### *3.2. Partial Correlations and Average Influence: Definitions*

As described in the original PCPG paper (see [30]), the starting point of our analysis is the *partial correlation*, which measures the effect that a random variable *Z* has on the correlation between two other random variables, *X* and *Y*. The partial correlation *ρ*(*<sup>X</sup>*,*<sup>Y</sup>* : *Z*) is defined in terms of the Pearson correlations *ρ*(·, ·) between the three variables, formally

$$\rho(X,Y:Z) = \frac{\rho(X,Y) - \rho(X,Z)\rho(Y,Z)}{\sqrt{[1-\rho^2(X,Z)][1-\rho^2(Y,Z)]}}.\tag{2}$$

A small value of *ρ*(*<sup>X</sup>*,*<sup>Y</sup>* : *Z*) may be ambiguous, as this could be due to the correlations among the three variables being small; or due to variable *Z* having a strong effect on the correlation between *X* and *Y*, which is generally the interesting case. In order to discriminate between these two cases the *correlation influence* or *influence* of variable *Z* on the pair of elements *X* and *Y* is used. This is defined as

$$d(X, \mathcal{Y}: Z) \equiv \rho(X, \mathcal{Y}) - \rho(X, \mathcal{Y}: Z). \tag{3}$$

We define the *average influence* of variable *Z* on the correlations between *X* and all other variables in the system as follows:

$$d(X:Z) = \langle d(X, \Upsilon:Z) \rangle\_{\mathcal{Y} \neq X}. \tag{4}$$

We anticipate that the average influence will be the input of the network building algorithm also described in [30].

Note that, potentially, there could be certain values of *measured* correlations *ρ*(*<sup>X</sup>*,*<sup>Y</sup>*), *ρ*(*<sup>X</sup>*, *Z*) and *ρ*(*<sup>Y</sup>*, *Z*) that lead to a *measured* partial correlation *ρ*(*<sup>X</sup>*,*<sup>Y</sup>* : *<sup>Z</sup>*), to be out of its defined range [−1, 1]. In our analysis, this occurred in 0.02% of the partial correlations computed. In these cases, partial correlations were set to be undefined (*NaN* in programming terms) which in turn makes the influence values based on these partial correlations also undefined. Similarly to the undefined correlation values described above, these undefined influences are not included in calculation of average influence *d*(*X* : *<sup>Z</sup>*).

Some of the values obtained for *ρ*(*<sup>X</sup>*,*<sup>Y</sup>*), *ρ*(*<sup>X</sup>*,*<sup>Y</sup>* : *<sup>Z</sup>*), *d*(*<sup>X</sup>*,*<sup>Y</sup>* : *Z*) and *d*(*X* : *Z*) in our dataset and their interpretation are discussed in Section 3.4. An important point is that, in general, *d*(*X* : *Z*) = *d*(*Z* : *<sup>X</sup>*): the influence is asymmetric, and the largest among these two quantities indicates the main direction of influence between *X* and *Z*. For example, in our dataset when *X* = Glass and *Z* = Furniture, the average influence of Furniture on Glass *d*(*X* : *Z*) = 0.03 while the corresponding reverse average influence of Glass on Furniture *d*(*Z* : *X*) = 0.29, suggesting that the direction of influence is from Glass to Furniture and not vice-versa. This, however, is an example of a clear-cut case, where difference between the two average influence values is not small. In general, these differences tend to be much smaller. This can be an effect of the complex relationship and mutual interaction between the economic sectors, or a consequence of the noise present in the data. This makes a bootstrapping procedure necessary in order to asses the statistical confidence in the overall direction of influence, as well as the average influence values themselves. We will discuss the bootstrapping procedure in Section 3.6.

#### *3.3. Average Correlation Matrix*

The input to the PCPG algorithm is a correlation matrix [30]. In our procedure, to allow its use on our multi-sample dataset, this correlation matrix is replaced by an average correlation matrix over countries. In order to obtain this average correlation matrix, we reshape the 22 **RCA***y* matrices into a total of *C* = 99 matrices, one for each country, each consisting of *T* = 22 rows and *P* = 100 columns. We denote these **TS***<sup>c</sup>*, *c* ∈ 1, ... , *C*. In this way, the columns of each matrix **TS***c* are the *RCAc*,*<sup>p</sup>* time series of the given country *c*, where each column represents a sector *p* in the dataset.

In order to obtain the input matrix to the PCPG algorithm, we first find *C* correlation matrices denoted **K***<sup>c</sup>*, *c* ∈ 1, ... , *C* from the pair correlations between the columns of each matrix **TS***<sup>c</sup>*. Thus the entries of the country correlation matrix **K***c* are given by

$$\rho \begin{pmatrix} \mathcal{K}\_{\mathfrak{c}} \end{pmatrix}\_{p, p'} = \rho \left( \left( T \mathbb{S}\_{\mathfrak{c}} \right)\_{\ast, p'} \left( T \mathbb{S}\_{\mathfrak{c}} \right)\_{\ast, p'} \right) = \rho \left( R \mathcal{C} A\_{\mathfrak{c}, p'}, R \mathcal{C} A\_{\mathfrak{c}, p'} \right) \tag{5}$$

where *ρ* is the Pearson correlation, the subscript ∗, *p* denotes the column *p* of the matrix and *RCAc*,*<sup>p</sup>* is the RCA time series for country *c* and sector *p*.

For each correlation value we obtain p-value via a two-sided T-test procedure [40]. Given we are performing multiple tests, we apply a False Discovery Rate (FDR) correction to obtain *adjusted* p-values via the Benjamini–Hochberg (BH) procedure [41]. We choose the BH procedure since it ultimately allows the inclusion of more information in the biPCPG network than a more restrictive correction procedure such as the Bonferroni correction [42]. Note that the FDR correction has been extensively used in the literature for the statistical validation of networks and, in particular, it has been previously used to validate networks representing bipartite complex systems [43].

We reject non-statistically significant correlation samples when the adjusted p-value is above a critical value of 0.01. In these cases, the corresponding entries to the **K***c* matrix are marked as undefined. The same procedure for obtaining country correlation matrices was also performed without the FDR correction for the 0.01 and alternative critical values. This produced networks which have the same main features as the network presented below, including the main hub nodes, clusters of sectors and communities detected.

Once the country correlation matrices **K***c* are found, we then compute the element-wise mean of these matrices, obtaining the average correlation matrix **K¯** with entries

$$\mathcal{K}\_{p,p'} = \frac{1}{\mathbb{C}} \sum\_{c=1}^{\mathbb{C}} \left( \mathbb{K}\_c \right)\_{p,p'} \tag{6}$$

where row and column indices *p* and *p* denote economic sectors. Any undefined correlation is discarded during the averaging process.

Note that, using this notation, the correlations *ρ*(·, ·) mentioned in Section 3.2, are replaced by the average correlations *K*¯ *p*,*p* described here. This leads to an equivalent expression for the partial correlation

$$\rho(p, p': p'') = \frac{\mathbb{K}\_{p, p'} - \mathbb{K}\_{p, p''} \mathbb{K}\_{p', p''}}{\sqrt{\left[1 - \left(\mathbb{K}\_{p, p''}\right)^2\right] \left[1 - \left(\mathbb{K}\_{p', p''}\right)^2\right]}} \cdot \tag{7}$$

#### *3.4. Partial Correlation and Average Influence: Empirical Analysis*

In order to clarify the meaning of the intermediate quantities that are used to build the biPCPG network, we devote this subsection to the discussion of some empirical features.

Bearing in mind how the influence of a variable on the correlation of two other variables is defined (see Equation (3)), we explore four examples of the results obtained from these computations. Note that, in the description below, the variables *X*, *Y* and *Z* used in the definition of Equation (3), are replaced by sectors of our system. Thus, the partial correlation column in Table 1 describes the average correlation, *K*¯ *p*,*p* , between sectors *p* and *p* accounting for the effect of a third sector *p*, and similarly for the influence column. We therefore denote these quantities *ρ*(*p*, *p* : *p*) and *d*(*p*, *p* : *p*), respectively.

Example 1 shown in Table 1 is an example of the case described in Section 3.2, which shows a very small partial correlation due to all correlations among the three variables being small. By definition, this makes the resultant influence value is small, which reduces the average influence of the sector "Other textile" on the sector "Cereals", making the appearance of this edge in the network less probable.

Example 2 also shows a case where the partial correlation between *p* and *p*, accounting for the effect of *p*, is small. However, contrary to the case in Example 1, this is due to *p* strongly affecting the correlation between *p* and *p*, i.e., *ρ*(*p*, *p* ) ∼ *ρ*(*p*, *p*)*ρ*(*p* , *p*). Therefore, the resulting influence is relatively high, which increases the probability of an edge from "Cultural" to "Audiovisual" being present in the biPCPG network. In addition, note that the probability of an edge from "Cultural" to "Audiovisual" also increases under these results, due to the symmetry between the *p* and *p* variables.

In Example 3, we have a case where the correlation between *p* and *p* is relatively strong and variable *p* has a small effect on it. This is due to the similar values of the correlation *ρ*(*p*, *p* ) and the partial correlation *ρ*(*p*, *p* : *p*). Therefore, the resulting influence of "Knitted clothing" on the correlation between the "Pigments" and "Aluminium" sectors is close to zero.

Finally, Example 4 shows a seemingly counter-intuitive case where the correlation between *p* and *p* is small while their partial correlation given *p* is negative, yielding a high influence. A negative partial correlation occurs when the correlation between *p* and *p* is small but both *p* and *p* have a high correlation with *p*. In this case, the influence of "Plastics" can be interpreted as preventing the correlation *ρ*(*p*, *p* ) between "Vehicles" and "Earths and stone" from being lower, or being negative.


**Table 1.** Examples of values used in the computations of influence *d*(*p*, *p* : *<sup>p</sup>*).

It is important to note that the average influence values among sector pairs determine the structure of any PCPG network (see Section 3.5). Figure 2 displays a scatter plot that shows the correlation *ρ*(·, ·) and average influence *d*(·, ·) among all *N*(*N* − 1) = 9900 pairs of sectors in our biPCPG network. Note that this includes data points for both *d*(*p* : *p*) and *<sup>d</sup>*(*p* : *p*) influences at the same horizontal coordinate as the correlation between *p* and *p* is symmetric.

This plot shows that the average influence between a pair of sectors is highly correlated with the correlation between the same pair of sectors, showing a very narrow 95% confidence interval (barely visible as it is only slightly wider than the fit line). See Appendix B for details on the calculation of the confidence and prediction intervals shown in Figure 2.

This is not surprising given how the average influence is calculated; however, the relatively high coefficient of determination *R*<sup>2</sup> = 0.58 indicates that, generally, the partial correlation values obtained are relatively small. This may be due to there actually not being

large influences between the sectors, or due to limitations of the dataset. For example, hidden influences between the sectors could potentially be detected in datasets with longer time series.

In Figure 2, we can observe that most of the correlations (around 80%) are positive. Around 10.7% of the pairs of sectors with positive correlations have an average influence below zero. This quantity is over an order of magnitude larger than its counterpart, the percentage of pairs of sectors with negative correlation but a positive average influence, which is around 0.47%.

**Figure 2.** Plot showing correlation and average influence values among all 9900 pairs of sectors in the system. A line of best fit among the points is shown in red along with the coefficient of determination *R*<sup>2</sup> = 0.58, with the 95% confidence interval limits in light blue and the 95% prediction interval limits in dashed grey lines. Note the confidence interval is so narrow it is only visible at the edges of the red best fit line upon close inspection.

#### *3.5. Network Construction*

The construction algorithm of a PCPG network starts with a list of the *N*(*N* − 1) average influence values in decreasing order and an empty graph of *N* nodes and no edges, where *N* is the number of variables in the system. In our case, we have *N* = 100 economic sectors. We then cycle through the sorted list, starting with the largest average influence value found, e.g., *d*(*p* : *p*), where *p* and *p* are a given pair of products. The edge *p* → *p* is included in the network if and only if the resulting network is still planar and the edge *p* → *p* has not been included already. We stop adding edges if adding the next edge in the list would break the planarity of the graph. This procedure ensures two things: (i) only the largest among *d*(*p* : *p*) and *<sup>d</sup>*(*p* : *p*) will be included in the network, and (ii) the final network has 3(*N* − 2) edges. It is important to note that for a given input correlation matrix of size *N* × *N* the PCPG network will always have 3(*N* − 2) edges and that the identity of these edges solely depends on the correlation values in the input matrix.

The final result of this procedure is what we refer to as the biPCPG network, *G*. Naturally, we also obtain the average influence *d* associated to each edge in *G*, as well as the network's adjacency matrix **A** defined as

$$A\_{p,p^{\prime\prime}} = \begin{cases} 1 & \text{if edge } p \to p^{\prime\prime} \in \mathcal{G}\_{\prime} \\ 0 & \text{otherwise} \end{cases} \tag{8}$$

#### *3.6. biPCPG Bootstrapping*

To assess the reliability of the links in the biPCPG network, we adapt a bootstrapping procedure originally introduced in [34]. The aim is to obtain a bootstrap value for each link which is proportional to the reliability of the link.

We build *R* batches, where the matrices to be bootstrapped in each batch are the time series matrices of all countries **TS***c* ∀ *c* ∈ 1, ... , *C*. From each matrix **TS***<sup>c</sup>*, a replicate time series matrix **TS***rc* ∀ *r* ∈ 1, ... , *R* is obtained, where *R* = 1000 is the total number of batches. An important feature of our procedure is how the null model, i.e., the replicate time series matrices, is generated. For each batch, the bootstrapping of the time series matrices is done coherently across countries. This means rows are drawn with repetition from each of the country matrices *jointly*—the same row indices are selected across the matrices. In addition, the new locations of the selected rows in their corresponding replicate matrices are exactly the same. This way, in the replicate time series matrices, **TS***rc*, the time structure of the time series is destroyed while preserving the country-level correlations.

Take, for example, the first batch, *r* = 1. In order to obtain the first batch of replicate matrices **TS**1*c* ∀ *c* ∈ 1, ... , *C*, we randomly select a sequence of *T* = 22 row indices, allowing repetitions. These row indices denote which rows from the original matrices **TS***c* are included in the corresponding replicates **TS**1*c* in this batch, as well as their order. This way, any row of a replicate matrix in this first batch will contain data points corresponding to the same year as rows of the same index in all the other replicate matrices in the batch.

After all the replicate matrices are obtained for all countries and batches, we calculate a replicate correlation matrix **K***rc* for each of them, rejecting non-statistically significant samples as described in Section 3.3. We then find the element-wise mean of the replicate correlation matrices in each batch *r*, obtaining *R* replicate average correlation matrices **K ¯** *r* where

$$
\bar{\mathcal{K}}\_{p,p'}^r = \frac{1}{\mathbb{C}} \sum\_{c=1}^{\mathbb{C}} \left( \mathbb{K}\_c^r \right)\_{p,p'}.\tag{9}
$$

Note that, similarly to the replicate time series matrices, in these replicate correlations matrices the time structure of the time series is destroyed while preserving the countrylevel correlations due to the way the bootstrapping has been performed.

We then apply the PCPG algorithm described in Section 3.5 to each matrix **K ¯** *r*, obtaining *R* replicate adjacency matrices, **A***r* ∀ *r* ∈ 1, . . . , *R*.

To compute the bootstrap value, *bp*,*p* , for each link *p* → *p*, we evaluate the number of time the link appears in the replicate adjacency matrices **A***<sup>r</sup>*, and normalise by the number of replicates *R*, formally

$$b\_{p,p''} = \frac{\sum\_{r=1}^{K} A\_{p,p''}^{r}}{R} \tag{10}$$

Each bootstrap value is therefore some number in the interval [0-1] and is proportional to the reliability of the link.
