**2. Results**

#### *2.1. Sequence-Based Properties Define Four Clusters of Complexes*

Complexes formed by mutual synergistic folding were taken from the MFIB (Mutual Folding Induced by Binding) database [26], and each complex has been assigned a feature vector describing the sequence composition of its constituent protein chains. To represent the sequence composition, we use the amino acid grouping previously used for investigating protein–protein complexes involving IDPs [28] (see Data and Methods and Figure 1 for definitions, and Supplementary Table S1 for exact values for all complexes). These vectors were used as input for hierarchical clustering (Supplementary Figure S1) to quantify the sequence-based relationship between various complexes. k-means clustering (Supplementary Figure S2) indicates four as a suitable number of clusters, and, therefore, we use four sequence-based clusters in all subsequent analyses. While this choice is not the only acceptable one based on the k-means results, we aim to have a restricted set of clusters to describe the major types of sequential classes. The main features of the four clusters are shown in Figure 1, while cluster numbers for each complex are shown in Supplementary Table S1.

Figure 1 shows the average sequence compositions of each of the four sequence-based clusters. While clusters were defined based on sequence compositions only, Figure 1 also shows the average heterogeneity of the four clusters, meaning the average normalized difference in sequence composition between the interacting proteins of the complexes (see Data and Methods). Complexes in clusters 1 and 2 are both largely devoid of special residues, including Gly (flexible), Pro (rigid), and Cys (cysteine). Members of these two clusters contain an average fraction of hydrophobic residues; however are slightly depleted in aromatic residues, indicating that π–π interactions are not the dominant source of stability. The most characteristic difference between clusters 1 and 2 is that members of cluster 1 typically contain a high fraction of polar residues, while members of cluster 2 are enriched in charged residues. Also, cluster 1 members are typically formed by proteins with highly different compositions (high heterogeneity values), while cluster 2 members are formed by proteins of very similar compositions.

In contrast, members of clusters 3 and 4 are typically enriched in Gly and Pro and contain a higher-than-average fraction of aromatic residues. Again, polar/charged residue balance is a distinguishing feature, with clusters 3 and 4 showing preferences for polar and charged residues, respectively. Also, similarly to clusters 1 and 2, there is a notable difference in heterogeneity values between clusters 3 and 4: members of clusters 3 and 4 are typically composed of proteins with very similar and different residue compositions, respectively.


**Figure 1.** Average values of sequence features for the four sequence-based clusters. Blue and orange shadings mark values that are over- or under-represented compared with the average of all MSF complexes. Heterogeneity values were not used for cluster definitions.

#### *2.2. Structure-Based Properties O*ff*er A Di*ff*erent Means of Defining Complex Types*

The structural properties of the studied complexes were quantified using various features describing secondary structure compositions, various molecular surfaces, and incorporating hydrophobicity measures and atomic contacts (see Supplementary Table S1 and Data and Methods). These structural features were used to describe each complex in the form of a feature vector, and similarly to

the analysis of sequence properties, these vectors were input to hierarchical clustering; however, structural features were filtered, and only those that share a modest degree of correlation were kept (see Supplementary Table S2 and Data and Methods for specifics) to avoid bias. The resulting tree is shown in Supplementary Figure S3. In contrast to the sequence-based clustering, k-means within-cluster sum of squares analysis does not indicate any low number of clusters as more optimal than others (Supplementary Figure S4). In order to have a medium number of clusters, we cut the hierarchical tree at a linkage distance that defines five clusters (Supplementary Figure S3), again reflecting our preference to arrive at a moderate number of complex types, to provide a high-level classification scheme. The average values of structural parameters for all five structure classes are shown in Figure 2.

The obtained clusters show distinguishing structural features. Members of cluster 1 incorporate the highest amount of nonhelical secondary structure elements. These complexes heavily rely on a large number of buried hydrophobic residues for stability, and most stabilizing atomic contacts are formed between residues of the same protein, relying less on intermolecular interactions, which tend to be mostly polar in nature.

In contrast, members of cluster 2 adopt mainly helical structures. The stability of these complexes seems to rely more on the interactions formed between the subunits, mostly formed between side chains. The importance of interchain interactions is also reflected in the large relative interface and small relative buried surface areas.

Cluster 3 and 4 complexes exhibit similar features, including a balanced ratio of various secondary structure elements and polar/hydrophobic balance of various molecular surfaces and contacts. For both clusters, interchain contacts rely mostly on side chain–side chain and backbone–backbone contacts. The main difference between the two clusters is the relative role of the interface between the participating proteins. Cluster 3 members have a larger-than-average interface, in terms of both molecular surface and number of contacts, meanwhile cluster 4 complexes have a very restricted interface size, incorporating only a few atomic contacts.

Members of cluster 5 are the most similar to the average in most structural features. There are only weak distinguishing features, including a slightly increased helical content at the expense of extended structural elements, a moderate increase in the role of backbone–side chain interactions in interchain contacts, and the increased ratio of interchain contacts. However, these deviations in average parameter values are modest and—with the exception of the decreased extended structure content—none of them reaches 20% compared to the average values calculated for all complexes.


**Figure 2.** Average values for structure features for the five structure-based clusters. Blue and orange shadings mark values that are over- or under-represented compared to the average of all MSF complexes. SASA—solvent accessible surface area, hydro:hydro—fraction of contacts that are formed between two hydrophobic atoms. Asterisks mark features that were included in the clustering.

#### *2.3. Defining Interaction Types Based on Sequence and Structure Clusters*

Considering together the previously established sequence- and structure-based clusters, in total 20 types of complexes can be defined (Figure 3). The number of known complexes in possible types shows large variations, with some highly favored ones (e.g., type 2[sequence]/2[structure]) and ones with a single known example (e.g., type 2/1), showing that not all sequence compositions are compatible with all types of adopted structures. In order to arrive at a reasonable number of basic complex types, types with 10 or fewer complexes were either merged with the adjacent sequence clusters or were omitted. As structural differences in general are larger between clusters, types corresponding to different structure clusters were never merged. For structure clusters 1 and 2, only two adjacent sequence clusters were merged, as these contain over 95% and 85% of the complexes, respectively. In contrast, for structure classes 3 and 4, all four sequence clusters were merged, as the distribution of complexes is more even across the sequence space. For structure cluster 5, even a single sequence cluster is enough to capture over 85% of complexes, and thus no merging was employed. This approach yielded five main interaction types, each of which has over 20 complexes. In order to include all known MSF complexes, a sixth pseudo-type was introduced, which contains all structures not compatible with any of the previously described five types (see Supplementary Table S1 for an exhaustive list).

**Figure 3.** MSF complex types. Colored regions mark separate interaction types considering sequence- and structure-based clusters (vertical and horizontal axes, respectively). The relationship of each sequence-and structure-based cluster taken from the hierarchical clustering (Supplementary Figures S1 and S3) is shown on the corresponding side of the table. Each of the six defined types is assigned a randomly selected color (that is of high contrast), and these are used in later figures to denote the corresponding complex types.

The complex types defined so far are based on structure and sequence features. However, if these types represent biologically meaningful classes, there should be other relevant differences between them in terms of the energetics of the interaction, binding strength, subcellular localization, or the biological regulation of the interaction. In the next chapters, we describe each complex type with biologically important characteristics and assess the potential differences between the members of each class.

#### *2.4. Complex Types Show Characteristic Energetic Properties*

From a biological perspective, the strength of association between interacting protein chains and the stability of the resulting complex is of utmost importance. Unfortunately, complexes formed exclusively by IDPs via MSF generally lack targeted measurements concerning thermodynamic and stability parameters. However, low-resolution energy calculations and prediction algorithms can give an indication about the characteristic energetics properties of the uncovered complex types in general. While these methods might have fairly large errors in individual cases, they are well equipped for comparative studies between groups of complexes.

In order to assess the energetic properties of complexes, we employed an energy calculation scheme using low-resolution force fields based on statistical potentials (see Data and Methods). As a reference, energetic properties were calculated for complexes formed exclusively by ordered proteins and complexes formed by an IDP binding to an ordered partner via coupled folding and binding (CFB) (see Data and Methods and Supplementary Tables S3 and S4). Figure 4 shows two types of calculated energies for each complex. On one hand, we calculated the total energy per residue in the whole complex, which reflects the overall stability. On the other hand, we also calculated the fraction of this stabilizing energy coming from intermolecular interactions (i.e., how important the interaction is for stability). In accordance with our expectations, complexes formed by ordered proteins feature strongly bound overall structures, with fairly large negative stabilizing energy/residue. In contrast, CFB complexes in general have less favorable per-residue energies, hinting at their comparatively weakly bound overall structures. However, the energetic feature providing the most recognizable difference between ordered and CFB complexes is the energy contribution of interchain contacts to the overall stability. In the case of ordered complexes, this contribution is fairly limited, as individual subunits have a stable structure on their own. In contrast, if the complex features an IDP, the interaction energy becomes a major contributor to stability (Figure 4a).

While ordered and CFB complexes tend to segregate in this energy space, complexes formed by MSF seem to be more heterogeneous, covering the whole available range of energetic values (Figure 4b). In the case of near-ordered proteins (Type 1), the energies resemble that of ordered complexes, hinting at the borderline ordered nature of the constituent IDPs, with the interaction between subunits playing a minor role. In contrast, coiled-coil-like structures (Type 2) on average have a much less stable complex structure, with interaction playing a substantial role in stability. These complexes resemble IDPs bound to ordered domains, and are expected to include several transient interactions. Other types fall largely between these two extreme cases. Energetics properties of the two types of oligomerization modules (Types 3 and 4) reflect the differences in interface surface area and contact numbers, shown in Figure 2. While the overall stability for both types varies in a very wide range, on average, the contribution of the interaction is higher for interface-heavy complexes (Type 3) than for interface-light ones (Type 4). Handshake-like folds (Type 5) show interesting properties: these complexes are quite stable with only limited variation in the per-residue energies. Yet, they achieve this high stability by relying heavily on the interaction between subunits of the dimer. As opposed to the complexes in Figure 4a, MSF complexes show high overlap in the energy space. This shows that very different structures, with potentially very different sequence compositions, can have similar energetic properties. Also, the high variability of energetic properties within complex types (the main reason for high overlap between different groups) shows that depending on the biological function, similar complexes can be required to have very different stabilities. For example, while several dimeric transcription factors can have similar structures that accommodate DNA-binding, the association and dissociation rates of the dimers (regulating their transcriptional activity) have to adapt to the required expression profiles of the genes they regulate.

**Figure 4.** Energetic parameters of various interaction classes. The relative energetic weight of intersubunit interactions in the overall stability (*y*-axis) as a function of the overall energy per residue (*x*-axis, measured in arbitrary units, AU) for ordered complexes and complexes formed by coupled folding and binding (**a**), and the five well-defined types of MSF complexes (**b**).

The transient or obligate nature of interactions provides clues about their roles in biological systems. This is at least partially describable through Kd dissociation constants. While there is ample data about Kd values of IDPs binding via CFB to ordered domains [23], these values are largely missing for MSF complexes. In accord, we calculated estimated Kd values for MSF complexes (Supplementary Table S1), with Figure 5 showing the Kd distributions for the six previously defined complex types. In a biological context, actual Kd values can be a nonlinear function of environmental parameters. Unfortunately, this information is largely unknown for most MSF complexes, and such predicted Kd values should be treated with caution and should only be used for comparing group averages, where individual errors can even out. The lowest average Kd values were calculated for complexes with a handshake-like fold (Type 5). The next two types with low Kds are the near-ordered complexes (Type 1) and interface-heavy oligomerization modules (Type 3). These three types together possibly cover most cases of the interactions where the complex needs to stay stable for an extended period of time, such as histone dimes (Type 5), complexes with enzymatic activity (Type 1) and several transcription factors (Type 3). Coiled-coil-like structures and oligomerization modules with small interfaces in general have a higher Kd, indicating that several transiently bound complexes belong to these types.

**Figure 5.** Predicted Kd value distributions for the six types of MSF complexes.

#### *2.5. Interactions Are Heavily Regulated by Several Mechanisms*

While the energetics of various interactions can provide clues about their transient/obligatory nature, the regulatory mechanisms can give more direct evidence. For example, while most IDP enzymes (belonging to Type 1) form particularly stable oligomers, indicating an obligate interaction, for example the oligomeric state of superoxide dismutase (SOD1) is known to be controlled by post-translational modification (PTM) serving as an on/off switch [35]; meaning that despite a strong interaction, it is reversible, and the disordered state of the monomers is biologically relevant (Figure 6a). Figure 6a shows additional examples of various regulatory mechanisms of MSF interactions via PTMs. These regulatory steps have already been described in the case of IDPs that bind to ordered domains [36], but have not been studied in the context of IDPs participating in MSF interactions. Apart from the on/off switch exemplified by SOD1, PTMs can control the partner selection of synergistically folding IDPs, such as in the case of another tightly bound complex, formed by H3/H4 histones (Type 5) [37]. PTMs can also tune the affinity of certain interactions, as is the case for the activating p53/CBP interaction (Type 4) [38]. Apart from these mechanisms that directly control the interaction between IDPs, PTMs can have a more indirect effect, modulating the activity of the dimer itself. In the case of the Max dimeric transcription factor, phosphorylation at the N-terminus of the binding region controls the dimer's (Type 4) interaction capacity towards DNA [39]. An even more indirect modulation of function is displayed for the retinoblastoma protein Rb, which in complex with E2F1/DP1 (Type 3) has a strong transcriptional repression activity. Upon methylation, Rb recruits L3MBTL1 [40], which is a direct repressor of transcription via chromatin compaction, augmenting the effect of Rb through a related but separate mechanism extrinsic to the Rb/E2F1/DP1 complex. This way the strength of repression depends on the PTM of the MSF complex, but through an additional protein that is not part of the complex but contributes to the complex function through a parallel mechanism in an indirect way.

To have a more systematic picture of the extent of regulatory mechanisms in MSF interactions, Figure 6b shows the fraction of known MSF complexes with experimentally verified PTM sites (Supplementary Table S5). In total, nearly 30% of studied complexes feature at least one PTM that was experimentally verified in a low-throughput experiment, presenting a regulatory mechanism that is able to directly or indirectly modulate either the interaction itself, or the activity of the resulting complex. The most prevalent PTM is phosphorylation, affecting 22% of complexes, but 10%, 15%, and 5% of MSF complexes contain methylation, acetylation, and ubiquitination sites as well (Figure 6b).

In addition, complex formation can also be regulated through the availability of the subunits participating in the interaction. This availability can depend on the alternative mRNA splicing of the corresponding genes, where certain isoforms lack the binding site (Supplementary Table S6). Also, even if the translated isoform has the binding site, the protein itself can be sequestered by competing interactions with other protein partners (Supplementary Table S7). These mechanisms are present for 11% (alternative splicing) and 16% (competing interactions) of complexes, and together with PTMs, in total 36% of MSF complexes have at least one known regulatory mechanism for modulating the interaction. Furthermore, these regulatory mechanisms often act in cooperation, with seven interactions known to employ PTMs, alternative splicing, and competing interactions as well (Figure 6c).

**Figure 6.** Regulatorymechanisms ofMSF complexes. (**a**) examples of regulation andmodulation of function through post-translational modifications. p—phosphorylation, g—glutathionylation, me—methylation, SOD1—superoxide dismutase, CBP—CREB-binding protein, Rb—retinoblastoma-associated protein. Colored boxes represent interacting chains forming the MSF complexes. (**b**) The fraction of complexes with verified PTM sites, and the fraction of complexes where at least one interactor is regulated via alternative splicing or by competing interactions. (**c**) Number and overlap of MSF complexes affected by the three types of regulatory mechanisms.
