**3. Discussion**

Due to increased substitution rates [37], IDP homologs are difficult to identify and correctly align. Also, their extended, floppy conformational states manifest in less contacting residue pairs per unit length compared to folded proteins/domains, also leaving less room for residue co-evolution [39]. Furthermore, IDPs excel in moonlighting [57], meaning that they can interact with multiple partners (often through the very same sequence region) and thus their evolution might well be simultaneously constrained by many different partners. However, these multiple different interactions are only poorly represented in structural databases, and residue co-variation analysis can also only be performed for one pair of sequences at a time. Another important factor is that the ensembles and interactions of IDRs are frequently modulated by post-translational modifications (although these are largely limited to eukaryotic proteins), which currently cannot be taken into account by the available residue co-variation analysis methods. Therefore, we need to acknowledge that, at the moment, it is not feasible to comprehensively cover the full range of structural and interaction plasticity of disordered proteins by such analyses. Due to these reasons, IDPs and their interactions have hardly been investigated for the occurrence of residue co-variation [6,7,9,11,12,14,15]. Although a recent analysis suggested

structured states of IDPs based on observable intrachain ECs [40], the co-evolution of IDPs with folded partners remained to be elucidated. To possibly demonstrate and understand IDP–partner co-evolution, we performed a comprehensive analysis of IDP–partner interprotein evolutionary couplings on the available bacterial complexes of the DIBS database using dedicated methods [14,15]. To ensure sufficient diversity and depth of sequence coverage by homologs, we restricted our analysis to bacterial IDPs and their cognate partners. Furthermore, to improve internal consistency of the predictions, we applied two different algorithms, Gremlin and EVcomplex. We could identify high-scoring interprotein ECs for seven interacting protein pairs, typically the ones with the best alignment coverage. Alignment coverage is, however, not strictly correlated with phylogenetic spread; it also depends on the length of interacting protein regions. For complexes where the IDP contacts the partner over a relatively long sequence bit, like, for instance, for the FlgM-FliA complex (Figure 1E), we had a good chance to identify some ECs. However, although SLiM-mediated interactions are highly preferred by IDPs and some of them are exceptionally widely conserved, they did not show high-scoring interprotein ECs, probably due to their short length, wherein the evolutionary information on a relatively small fraction of specificity-determining residues is diluted by many more surrounding other residues undergoing fast evolutionary turnover [31]. The absence of detected ECs for SLiM-mediated interactions could be a major limitation of performing residue co-variation analyses on IDPs, as SLiMs are crucial and widely applied interaction units of IDPs that mediate contacts with many of their partners.

Supporting the validity of the predicted ECs, i.e., that they represent true physical contacts between the IDP and its partner, or at least could be spatially restrained by each other, the majority of the visible ECs fall within 8 Å. Residue contact networks from PCA also showed that several EC pairs had multiple atomic contacts; moreover, ECs had more atomic contacts than other contacting IDP–partner residue pairs. Regarding their structural preferences, EC residues tend to be located in helices in both IDPs and folded partners. Although EC-carrying interfaces did not significantly differ in size or in the surface-normalized number of H-bonds and salt bridges, it is important to note that no ECs could be identified in interfaces that are small (<600 Å2) or have no hydrogen bonds or salt bridges, no matter how widely conserved the interaction is. Since for the partners we did not find enrichment in any residue type, we do not claim that the enrichment of EC-carrying IDP interfaces in negative residues would imply the employment of increased electrostatic attraction compared to other IDP–partner interfaces. Therefore, we cannot ascertain the biological relevance of the noted preference for negatively charged residues.

Importantly, ECs could be identified for both permanent and transient IDP–partner interactions. This shows that not only obligatory, permanent associations, such as the subunits of the ribosome [15] or heteromeric enzymes, but also well-established signaling relationships, such as regulatory antisigma factors, have a trace of detectable co-evolution between the IDP and its interaction partner. Regarding the evolutionary history and genomic proximity of the genes encoding the protein pairs with high-scoring ECs, the CcdA-CcdB (PDB: 3HPW) and MazE-MazF (5CQX) toxin–antitoxin pairs, the alpha and delta subunits of the ATP synthase (2A7U) and the Sigma-E/Anti-Sigma-E regulatory factors RseB and RseA (3M4W), are each encoded on common operons. This suggests that these protein pairs are co-expressed and their co-occurrence is strongly preferred in evolution. To the contrary, the other two pairs, FliA with its anti-sigma factor FlgM (1SC5) and proteasome-associated ATPase (product of *mpa* gene) with prokaryotic ubiquitin-like protein Pup (3M91) are located in different operons and are thus substantially less strictly associated both in gene-regulatory and evolutionary terms. Therefore, based on our findings, a phylogenetically widely preserved protein–protein interaction that buries an interface larger than SLiM-mediated interactions typically do, might be enough for the reliable identification of interprotein evolutionary couplings. No other assumption regarding permanence of the interaction or co-regulation of the corresponding genes needs to be made prior to analysis, although such factors could increase the chance of finding true ECs.

In this first dedicated study of IDP–partner co-evolution, we also show that IDPs are difficult to investigate by methods addressing residue co-variation, due to their fast evolutionary changes, limited sequence representation, increased propensity for moonlighting functions, frequent use of short linear motifs for partner binding and relatively few co-evolving residue pairs both within their chains [39,40] and with their partners. Nevertheless, by demonstrating detectable footprints of IDP–partner co-evolution for interactions with largely different functional readouts, our results are also promising. They imply that the explosion in the number of sequenced genomes, the continuous improvement of techniques of sequence homology detection [58], and advances in sequence alignment approaches optimized for IDPs [40] could soon empower residue covariation analyses of IDPs to provide predictions and new insights into the structures and interactions of IDPs whose experimental investigation proved to be challenging.

#### **4. Materials and Methods**

#### *4.1. Dataset Preparation*

Bacterial protein complexes were obtained from the DIBS database [45] (Supplementary Table S1). We identified 4 of the 42 as redundant: DIBS: DI2210001–PDB: 3TCJ, DI1200012–3UF7, DI1210005–3C94, and DI1210010–5CW7 were excluded due to redundancy to DI2200001–3HPW, DI1200011–3UF7, DI1210004–3C94, and DI1210007–5CZF, respectively. The DI1200014–5F56 complex was also excluded due to the IDP peptide being <5 residues long. For the remaining 37 complexes, the constituent chains were trimmed or extended to make them optimal for co-evolution analysis. IDP chains shorter than 30 residues were extended based on UniProt [59] to reach the minimum of 30 residues length required for Gremlin analysis. Terminal segments could obviously only be extended to one direction. For the two short IDP chains representing internal segments, more extension was added to the end with residues forming an interface with the partner chain. Partner chains/very long IDP chains were trimmed to interacting domains/subdomains to ensure that the best possible coverage values are obtained (sequence coverage values are highly dependent on the total length of the analyzed sequences). The resulting trimmed UniProt regions used for further analysis are indicated in Table 1. Then for each complex, the IDP counterpart was checked in PFAM 31 [46] to get an idea of their phylogenetic spread. If, for the IDP region in the complex or at least for a neighboring protein domain/region of the protein there were no PFAM families available, or the number of sequences in the full alignments of the relevant PFAM families were <130, then the complex was excluded from correlated mutation analysis (for PFAM families see Supplementary Table S2).

Information on the complexes representing permanent or transient interactions was taken from the literature and UniProt subunit structure annotations. To assign if the genes are encoded within a single operon/gene neighborhood, we used information in the STRING database [60] and Ensembl Bacteria [61]. Assignments on the phylogenetic spread of the interactions were based on literature mining.

## *4.2. Co-Evolution Analysis*

The trimmed regions of the 19 remaining complexes were run with the Gremlin [15] and EVcomplex [14] webservers for interprotein co-variation analysis. When using Gremlin, for regions >60 residues the e-value threshold was set to E-06 and the number of iterations with Jackhmmer to 4, while for regions ≤60 residues, we applied a less stringent e-value threshold of E-04 with 8 iterations. The reason for using a less stringent e-value threshold for shorter sequences is explained and supported by the Introduction/Updates section of the Gremlin webserver. Δgene was set to 1–∞ to identify the closest homologs in the analyzed proteomes regardless of their genomic location. In 6 cases, Gremlin stopped due to insufficient alignments for further analysis. It finished the analysis in 13 cases. We accepted interprotein residue pairs with a scaled score ≥1.30 and a probability >0.88 as co-varying pairs; evolutionary couplings (ECs). These thresholds were selected based on our observation that among the predicted possible EC pairs with a scaled score >1.3 the probability values are the main determinants of the residue pairs representing true positive predictions with residue–residue distances

reflecting direct contacts or false positive predictions with large interresidue distances. Therefore, we have selected a strict probability value threshold (top 12%) to avoid false positive ECs compromising the dataset used for analysis. The 7 complexes with such high-scoring interface ECs identified by Gremlin were analyzed further and compared to the remaining 30 complexes with no high-scoring ECs.

For the EVcomplex analysis, we applied the default e-value (10−5) in searching for homologues and chose the option of selecting the closest homologs of the query sequences from the analyzed proteomes. Here, interprotein ECs with an EVcomplex score >0.9 have been accepted as ECs. EVcomplex identified ECs for 7/19 complexes.
