**4. Results**

#### *4.1. Descriptive Analysis of the biPCPG Network*

The network *G* resulting from the application of the biPCPG method to our dataset is shown in Figure 3. This network displays some interesting results with a few distinct hub nodes. The most noticeable of these nodes are "Plastics", "Pigments" and "Vegetables" nodes. Hub nodes in the network also tend to have high average influence on other nodes

in the network, this being displayed by the width of the edges stemming out of them. The colour of the edge represents its bootstrap value. We note that the hub nodes are also the source of most of the darker edges in the network, i.e., the most reliable edges, especially the "Plastics" node, whose edges bootstrap values are very high.

**Figure 3.** The biPCPG network. The widths of the edges are proportional to the average influence value, *d*(*p*, *p*) they represent. The colours of the edges are proportional to their bootstrap value, *bp*,*p* . The darker the edge, the more reliable it is. Node colours represent the sector section each product and service belong to. Node sizes are proportional to out-degree. The node layout was found using the ForceAtlas2 algorithm [44].

The resulting network also displays distinct clusters of intuitively related economic sectors. For example, the most recognisable "food and plant" cluster can be found at the bottom-right of the network, surrounding the "Vegetables" hub node. At the topleft of the network, we can observe another distinct cluster containing several sectors related to chemicals or raw materials. Finally, on the top-right of the network, surrounding the "Plastics" and "Pigments" nodes, one can find a "macro-cluster" formed mostly by industrial and manufacturing sectors.

It is worth noting that, while most edges connect intuitively related sectors, the are several cases of less-intuitive connections spread around the network. This causes the inclusion of some of these seemingly unrelated sectors in some of the clusters mentioned above. This is partially due the original construction of the PCPG algorithm, which ensures a fixed number of edges to be included in the network. Therefore, edges representing small influences among sectors could be forced to be included in the network. In our case, the biPCPG network obtained contains around 5% of edges representing Average influence values of 0.05 or smaller.

#### *4.2. Assortativity Analysis*

As described in Section 2, the 100 sectors in our dataset can be grouped into 22 groups of sectors called *sections*. Furthermore, a key metric within the field of economic complexity is the *complexity* of a product or service, which measures the capabilities needed by a country to produce it (see Appendix A). In order to better understand the structure of this network, and by extension the information contained in it, one can then investigate its *homophily* or *assortativity* according to these characteristics. Roughly speaking, this is the tendency for nodes belonging to the same group to be connected to each other. In this paper, we make use of two different assortativity metrics which we describe below. The motivation behind this analysis is to assess if our framework generates a meaningful network which is able to synthesise information about the system.

#### 4.2.1. Assortativity by Unordered Characteristics

This quantity is used to measure the assortativity between, for example, nodes with an associated qualitative characteristic such as, in our case, sector sections, *s* (see Section 2). The *assortativity coefficient* is defined as [45]

$$s\_s = \frac{\text{Tr}\mathbf{F} - ||\mathbf{F}^2||}{1 - ||\mathbf{F}^2||} \tag{11}$$

where entries of the matrix **F** are the fractions of edges in the network that connect a vertex of section *s* to one of section *s*, and ||**X**|| is the sum of all elements of a matrix **X** [45]. Therefore the numerator is a quantity that measures the fraction of the edges in the network that connect vertices of the same type (i.e., within-section edges) minus the expected value of the same quantity in a network with the same community divisions but random connections between the vertices. The denominator is one minus the same expected value.

This formula gives *ss* = 0 when there is no assortative mixing and *ss* = 1 when there is perfect assortative mixing. For a perfectly disassortative network, the value is in the range −1 ≤ *ss* < 0 (see [45] for its interpretation). We evaluate this metric for the section of sectors described in Section 2, denoting this by the subscript *s*.

#### 4.2.2. Assortativity by Scalar Characteristics

A measure of assortativity for numeric quantities associated with nodes can also be defined [45]. First, note that the entries of the matrix **F** are the fraction of all edges in a network that connect nodes with associated scalar values *q* and *q*. Note that the values *q* and *q* are discrete—in our case these are the *Complexity rank* [17] of sectors—computed by taking average complexity *value* of each product (across the available years in our dataset) and ranking these averages from highest to smallest. The complexity of a product or service is a well-known quantity in the economic complexity literature that describes

the capabilities needed by a country to produce it, see Appendix A for its definition. The *numeric assortativity coefficient* is defined as

$$s\_q = \frac{\sum\_{q,q'} q q' \left( F\_{q,q'} - a\_q b\_{q'} \right)}{\sigma\_a \sigma\_b} \tag{12}$$

where *aq* = ∑*q Fq*,*q* , *bq* = ∑*q Fq*,*q* and *σa* and *σb* are the standard deviations of the distributions of *aq* and *bq* , respectively. The value of *sq* is in the range −1 ≤ *sq* ≤ 1 with *sq* = 1 indicating perfect assortativity and *sq* = −1 indicating perfect disassortativity. Typically, assortativity values in the range 0.3–0.7 are considered to indicate a significant community structure in social networks (higher values are rare) [46,47].

#### 4.2.3. Assortativity Results

The results for the two assortativity metrics defined above are as follows:


These results indicate that the structure of the resulting biPCPG network encodes information efficiently. Firstly, the *Assortativity by sector section*, *ss* = 0.15, is positive, this means that sectors that belong to the same *section* (see Section 2) tend to be connected in the network, i.e., they influence each other. The section of each sector is reflected in Figure 3 by the colour of the node. The most evident clustering of sectors within the same section is found at the top of the plot where a highly connected cluster of service sectors is found.

Furthermore, the moderately high *Assortativity by sector mean complexity rank*, *sq* = 0.19, indicates that sectors around the same level of complexity tend to influence each other. This makes sense intuitively since, according to the economic complexity literature, these tend to be connected in other networks that describe the relationship among products (e.g., product space network, product taxonomy network [21,22]).

#### *4.3. Community Detection on the biPCPG Network*

We apply a well-known community detection algorithm for directed networks based on spectral optimisation [48]. The modularity, or quality function, to be maximised is

$$Q^{\rm dir} = \frac{1}{m} \sum\_{p, p''} \left( A\_{p, p''} - \frac{k\_p^{\rm out} \, k\_{p''}^{\rm in}}{m} \right) \delta \left( \nu\_{p'}, \nu\_{p''} \right) \tag{13}$$

where **A** is the adjacency matrix, *k*in*p* and *k*out *p* are the weighted in-degree and out-degree of node *p*, *m* is the total edge weight in the network, *<sup>ν</sup>p* is the community of node *p* and *<sup>δ</sup><sup>ν</sup>p*, *<sup>ν</sup>p* = 1 if *<sup>ν</sup>p* = *<sup>ν</sup>p* and 0 otherwise. This method does not require any parameter choices relating to community size or number of communities; however, adaptations of this method that allow for these choices are available in the literature. It is worth pointing out that, for the analysis carried out in this paper, edge-weights are all set to 1. In Equation (13), this makes the weighted in-degree and out-degree simply the in- and out-degree as well as fixing *m* = 294, the total number of edges in the network.

Since there is no universal definition for communities in directed networks, we also apply the same community detection algorithm for the undirected version of the biPCPG network *G*und. In this case, the modularity to be maximised is given by

$$Q^{\rm und} = \frac{1}{2m} \sum\_{p, p''} \left( A^{\rm und}\_{p, p''} - \frac{k\_p k\_{p''}}{2m} \right) \delta \left( \nu\_{p'} \nu\_{p''} \right) \tag{14}$$

where **A**und is the undirected adjacency matrix which defines the undirected network *G*und. This can be obtained from the adjacency matrix, **A**, which defines the directed biPCPG network *G* as follows

$$A\_{p,p''}^{\text{und}} = \begin{cases} 1 & \text{if } A\_{p,p''} = 1 \text{ or } A\_{p'',p} = 1, \\ 0 & \text{otherwise.} \end{cases} \tag{15}$$

This allows us to qualitatively assess if the structure of the biPCPG network is sufficient for reasonable communities to be detected, without the bias of the information contained in the average influence or bootstrap values associated to edges. We implement this algorithm via the *leidenalg* Python package (version 0.8.4) [49], an implementation of the *leiden* algorithm for modularity optimisation.

Note that optimising modularity is an NP-hard problem [50], and therefore heuristics have to be implemented for algorithms to be efficient. One of the steps in the *leiden* algorithm used here involves selecting a random community for a node to be added to. However, this randomness can be controlled via a *seed* to the random number generator. This makes the process deterministic such that the same communities are selected every time the algorithm is run on a given network using the same seed value. In our analysis, we tested several seed values finding that the detected communities varied only for a few nodes, with many seed values returning the exact same partitions. The results shown in Section 4.3 were found using 1 as the seed, as well as for many other seed values tested.

Furthermore, we compare the the communities obtained for the directed and undirected versions of the network for seed values 1, ... , 1000 via the *Adjusted Mutual Information* [51]. Take, for example, our set of *P* of *N* sectors and consider two partitions of *P*, namely *U* = {*<sup>U</sup>*1, *U*2, ... , *UJ*} with *J* pairwise-disjoint clusters found by maximising *Q*und for the undirected version of the network, and *V* = {*<sup>V</sup>*1, *V*2, ... , *VD*} with *D* pairwise-disjoint clusters found by maximising *Q*dir for the directed version of the network. The *AMI* between the two partitions is then defined as

$$AMI(\mathcal{U}, V) = \frac{MI(\mathcal{U}, V) - E\{MI(\mathcal{U}, V)\}}{\max\{H(\mathcal{U}), H(V)\} - E\{MI(\mathcal{U}, V)\}}\tag{16}$$

where *MI*(*<sup>U</sup>*, *V*) is the mutual information between two partitions, *<sup>E</sup>*{*MI*(*<sup>U</sup>*, *V*)} is the expected mutual information and *H*(*U*) and *H*(*V*) are the entropy values associated to partitions *U* and *V* respectively. The *AMI* equals 1 when two partitions are exactly the same and 0 when the *MI* between them equals its expected value and therefore serves as a similarity measure for the two partitions, for further details on its calculation see [51]. In Section 4.3, we give the result for the *average AMI* obtained for the 1000 seed values tested using the *scikit-learn 0.23* Python package.

### Community Detection Results

The community detection procedure described above yielded 5 distinct communities when applied on the undirected biPCPG network, *G*und, which we denote communities *ν* = 1, ... , 5. These communities have 31, 22, 21, 13 and 13 sectors contained in each of them, respectively.

The detected communities in the network can be seen highlighted in Figure 4. When comparing with Figure 3, which shows the network highlighting the section of each sector, one can see that the detected communities partition the network into groups that contain intuitively related sectors. For example, communities 2, 3 and 5 contain mostly nodes related to industrial and chemical sectors, while community 1 captures the "food and plant" cluster described above as well as some service sectors. Finally, for community 4, it is slightly more difficult to find a common theme. However, it is worth noting that over half of the sectors it contains are service sectors.

**Figure 4.** biPCPG network, *G*, resulting from the application of the PCPG algorithm on the mean correlation matrix **K ¯** between sectors' RCA time series. Nodes are grouped by their community, *ν*, found by maximising modularity in the network. The node layout was found using the ForceAtlas2 algorithm [44].

The information structure these communities contain can be seen when sorting rows and columns of the average correlation matrix **K ¯** and average influence matrix by community index as seen in Figures A2 and A3 in Appendix C. We can observe, for example, that brighter colours, meaning higher values, are generally found close the diagonal of the matrices (i.e., among sectors within the same community). This is especially noticeable for communities 1 and 2. We can also identify which rows and columns represent service sectors, as these tend to have a lower correlation and average influence values with non-service sectors (depicted in dark blue) and higher values among themselves.

The average *adjusted mutual information* obtained for the 1000 seed values tested is 0.90. This is a very high value which tells us that, on average, the partitions obtained for the directed and undirected versions of the network were very similar. This suggests that the community detection procedure is weakly dependent on the version of the network (directed vs. undirected) as well as the seed value used.
