*2.6. Various Complex Types Show Di*ff*erential Subcellular Localization*

In addition to regulatory mechanisms detailed in the previous chapter, a crucial element in the spatio-temporal control of protein function is subcellular localization [41]. In order to assess this aspect of MSF complexes, and to understand if the defined interaction types have different properties in terms of cellular localization, we used "cellular component" terms from GeneOntology (GO) [42] (see Data and Methods). Various GO terms were condensed into five categories including "Extracellular", "Intracellular", "Membrane", "Nucleus", and "Other" to enable an overview of the differences in localization between the six complex types (Figure 7) (for exact GO terms for each complex see Supplementary Table S8).

The least amount of information is available for Type 1, near-ordered complexes. Albeit GO terms are lacking for most complexes, even the limited annotations highlight that these complexes are able to efficiently function in the extracellular space, which in general is fairly uncommon for IDPs. Coiland zipper-type helical complexes (Type 2) are somewhat more often attached to the membrane or function in the intracellular space, or in non-nuclear environments, such as the lysosome. In contrast, oligomerization modules (Types 3 and 4) are most prevalent in the nucleus and the intracellular space, which is in line with the function of the high number of transcription factors in these groups. However, modules with a large interface (Type 3) are relatively often found in other compartments, while modules with smaller interfaces (Type 4) also function in the extracellular space. Complexes adopting a handshake-like fold are enriched in histones, which is reflected in their enrichment in the nucleus and the chromatin (classified as "other" in Figure 7). Type 6 complexes are heterogeneous in terms of localization as well, and hence members can be found in all studied localizations to a comparable degree. These preferences in subcellular localization for different complex types reinforce our notion that even though our classification scheme relies on sequence and structure properties alone, the obtained interaction types also have biological meaning.

**Figure 7.** Subcellular localization of MSF complexes belonging to the six types. "Other" contains the "non-membrane-bounded organelle", "secretory granule", "lysosome", "cytoplasmic vesicle lumen", and "transport vesicle" GeneOntology terms.

### *2.7. The Annotated Catalogue of Complexes Formed via Mutual Synergistic Folding*

Considering the previously analyzed features of complexes, averaging the calculated features for the six established interaction types provides the annotated catalogue of MSF interactions (Figure 8). Apart from the main sequential and structural features, Figure 8 also shows example structures, energetic properties, subcellular localization, and the main regulatory mechanisms for each complex type.

The first type of complexes bears a high similarity to ordered protein complexes, and hence are named near ordered. The constituent chains are usually similar, in many cases corresponding to homooligomers, with a high Pro/Gly content and typically only a few charges. The main difference compared to protein complexes formed by ordered proteins is that near ordered subunits are depleted in α-helices [28]. For reaching a stable structure through the interaction, they utilize a large number of intrachain contacts, with inter-subunit interactions through a small polar interface playing only a secondary role in the stability of the complex. This group contains a large number of enzymes, transport proteins, and nerve growth factors, where the exact structure is of utmost importance; however, in contrast to monomeric proteins, the presence of this structure relies on the interaction. This interaction type is mostly regulated through phosphorylation and acetylation of binding site residues. These proteins resemble ordered proteins in their localization as well, with extracellular regions being highly representative.

The second type of complexes contains structures with a high overall similarity, mostly consisting of coiled-coils and zippers, structures composed of parallel interacting helical structures, often stabilized by a restricted set of residues, such as leucines, alanines, or tryptophans. In general, constituent proteins are depleted in residues incompatible with α-helix formation, such as Pro and Gly, and also in aromatic residues. In turn, they are abundant in hydrophobic residues and show an enrichment for either polar or charged residues. The constituent helices usually form a fairly weakly bound system, where the interchain interactions via the relatively large interfaces play a major role. Constituent proteins are able to bury only a small fraction of their polar surfaces. Coiled-coil interactions are often regulated, typically via various types of PTMs, most often through phosphorylation or, to a lesser degree, acetylation. Despite their highly similar structures, complexes in this group convey a large variety of functions, mainly pertaining to regulating transcription and performing membrane-associated biological roles, such as organelle and membrane organization.

The third and the fourth type of complexes are both generic oligomerization modules that can be split according to the importance for the interchain interactions, grouping them as either interface-heavy (Type 3) or interface-light (Type 4) complexes. In both cases, the sequences can be highly variable, and the unifying features are mostly structural. Both types typically have an average-sized relative buried area with balanced hydrophobic/polar composition. However, interface-heavy complexes have a large, slightly polar interface that plays a major role in achieving the tightly bound structures. In contrast, interface-light complexes form a more helical structure and have smaller hydrophobic interfaces that play a more diminished role in achieving the stability of a less tightly bound system. This hints at interface-light complexes being more transient, also supported by the fact that these complexes have a higher number of known regulatory PTMs and are also modulated by alternative splicing. Both type 3 and type 4 complexes preferentially occur in nuclear and intracellular processes, as several of them are ribbon–helix–helix (interface-heavy) or basic helix–loop–helix (interface-light) transcription factors, able to shuttle between the nuclear and the intracellular spaces. In addition to the similarities in subcellular localization, type 4 complexes preferentially occur in the extracellular space, and type 3 complexes in other cell compartments, as well.

The fifth type of complexes typically adopts a handshake-like fold, characteristic of histones and homologous proteins. While these structures are usually largely helical, the interacting proteins often contain a relatively high ratio of prolines and glycines, in addition to the enrichment of aromatic residues. While they are depleted in polar residues, both the interface and the buried surface have a fairly balanced hydrophobic/polar makeup. The complexes are relatively tightly bound, and interchain interactions play a fairly large role in stabilizing the interaction. This type of complex has the highest ratio of both PTMs and competitive interactions, providing a large amount of regulation. In addition, PTMs are highly heterogeneous, containing phosphorylations, acetylations, methylations, and ubiquitinations as well. Members of this cluster primarily serve DNA/chromosome-related functions, and hence are usually located in the nucleus.

While types 1–5 represent well-defined groups with members of clear unifying similarities, the final group serves as an umbrella term for complexes that are not members of any previous structural/sequential class. In accord, these complexes cannot be described by simple characteristic features and are the most sequentially and structurally heterogeneous group. This group contains highly specialized interactions that present unique protein complexes, which are regulated through all three control mechanisms and occur in all studied subcellular localizations.

#### *2.8. Interaction Types Present A Novel Classification of Protein Complexes*

The described MSF classification method bears similarity to the approach employed in CATH, as both approaches use a hierarchical classification of PDB structures. However, CATH does not consider interactions and simply relies on the secondary structure elements and their connectivity and arrangement, in contrast to the presented analysis taking into account protein chain interactions too, together with sequence composition features.

Figure 9 shows the studied MSF complexes in both our MSF classification system and in CATH, considering the top two levels ("Class" and "Architecture"). The highest-level CATH definitions, corresponding to "Class", reflect the overall secondary structure element distribution of the structures. In this framework, Type 1 near-ordered complexes mostly occupy the "Mainly Beta" CATH class, while complexes from the other five types mostly fall into the "Mainly Alpha" class or the "Other" class. At the next CATH level, "Architecture", certain MSF type complexes (such as type 2 coils and zippers) are segregated into further subclasses.

Considering "Class" and "Architecture" definitions, there is very little correspondence between the CATH and the new MSF classification. If the two schemes showed a high degree of similarity, the matrix in Figure 9 should be close to a diagonal matrix. In reality, however, off-diagonal elements are large, confirming the novelty of the presented MSF classification scheme.

**Figure 9.** Overlap between CATH and MSF classification.

#### **3. Discussion**

Here, we present the first approach aiming at the classification of complex structures formed exclusively by disordered proteins via mutual synergistic folding. We developed and applied a method that can classify these complexes into various types based on sequence- and structure-based properties. The classification scheme takes into account on the one hand, the overall sequence and structure properties of the complex, and on the other hand, the interaction itself, quantifying the role of intra- and

intermolecular interactions in relation to the overall contact/surface properties of the structure. As the classification protocol is based on hierarchical clustering, it is freely scalable. Tuning the resolution via changing the number of sequence-based or structure-based clusters, the method can be used to yield any number of types and subtypes. The presented classification is a top-level one highlighting the major types of MSF classes, and this six-way classification scheme will be used to better define MSF complex types in the MFIB [26] database.

While both sequence- and structure-based parameters are taken into account when defining the final complex types, the two sets of descriptors have different roles in the scalability of the method. In our presented approach to defining complex types, the main features are structural properties, while sequence parameters are more descriptive in the sense that they highlight the sequential features needed to be able to fold into a complex of given structural properties (Figure 3). However, sequence features can be used to distinguish subtypes of structure-defined complex types. For example, type 1 near-ordered complexes come in two flavors according to the two sequence clusters they cover (Figures 1 and 3): polar-driven interactions between mostly homodimers, and charge/hydrophobic driven interactions between mostly heterodimers. Also, type 2 complexes (coils and zippers) come in two varieties: relying on polar-driven interactions for heterodimers and charge-driven for homodimers.

In addition to providing a scalable classification scheme, the described method and the defined complex types have biological relevance. The presented complex types have different biological properties; although only information describing the sequence and structure properties were put in, the resulting types show different properties in terms of the energetics and strength of the interactions (Figures 4 and 5), the relevant regulatory processes (Figure 6), and subcellular localization (Figure 7).

The analysis of the energetics properties of the interactions can provide a glimpse into the biophysical details of the binding and folding. The use of low-resolution statistical force fields proved to be a suitable approach to discriminate complexes based on the structural features of constituent chains [28] and to describe the binding of IDPs [43,44]. While complexes of ordered proteins and domain-recognition IDP binding sites have a fairly narrow range in energetics parameters (Figure 4a), complexes formed exclusively by IDPs are more heterogeneous, basically covering the whole range of the energy spectrum (Figure 4b). Furthermore, based on predictions, MSF complexes cover at least 10 orders of magnitude in Kd values (Figure 5). Hence, in terms of binding strength and stability, these complexes have the potential to cover a very wide range of biological functions, overlapping with those of ordered complexes and domain-binding IDPs as well, in agreement with the previous comparative functional analysis of a wide range of interactions [28].

For most known MSF complexes, the resulting structure is instrumental for proper function, such as the coiled-coil structure for the SNAP receptor (SNARE) complex in mediating membrane fusion [45], the dimeric structure for a wide range of transcription factors in precise DNA-binding [46–48], and the proper coordination of catalytic residues for oligomeric enzymes [49,50]. Therefore, for MSF complexes, the interaction de facto switches on the protein function, and hence the precise regulation of the interaction strength is vital in the biological context of these complexes. While structure-based Kd value predictions are informative, in some cases they do not fully describe the interactions. Many MSF complexes are tightly bound, yet they are not necessarily obligate complexes, and their association/dissociation can be under heavy regulation. For example, solely based on Kd values and energetics, type 5 (handshake-like fold) interactions seem to form obligate complexes. However, there are several cases where these interactions do break up in a biological setting, most notably for histones. Histone H4 is able to form dimers with at least eight different H3 variants [51], and it was described that in the case of H3.1 and H3.3, the preference of H4 for these two partners is governed by H4 phosphorylation [37]. The post-translational modifications can enhance complex formation or dissociation in many other cases as well [35]. In addition, competition for the same binding partner and binding site availability as a function of alternative splicing is an additional mechanism for the regulation of the formation of MSF complexes (Figure 6).

Exploring the precise regulatory mechanisms for MSF complexes would be highly informative. Unfortunately, experimental Kd measurements are lacking for the majority of these interactions, and interactions in structural detail have usually been only analyzed in a single PTM state. Therefore, the molecular details and biologically relevant steps of the regulation of these interactions are difficult to assess; but from a biological sense, it is probable that even several low Kd complexes can dissociate rapidly in certain cases. At least some regulatory mechanisms are currently known for about 36% of studied MSF complexes, but the real numbers are bound to be higher. This means that most probably the majority of MSF complexes are not obligate complexes, where the disordered state is physiologically irrelevant, but can exist in both the stable bound state and the disordered unbound state as well, under native conditions. Thus, MSF complexes are integral parts or direct targets of regulatory networks, although the extent of regulation varies with the interaction type considered.

Apart from the studied regulatory mechanisms, additional layers of spatio-temporal regulation can play crucial roles for MSF complexes, similarly to other IDP interactions [41]. An emerging such regulatory mechanism is liquid–liquid phase separation (LLPS). A prime example is the Nck/neuronal Wiskott–Aldrich syndrome protein (N-WASP). N-WASP is known to undergo LLPS when interacting with Nck and nephrin [52], via linear motif-mediated coupled folding and binding. Mutually synergistic folding between the secreted EspFU pathogen protein from enterohaemorrhagic *Escherichia coli* and the autoinhibitory GTPase-binding domain (GBD) in host WASP proteins (MFIB ID:MF2202002, type 5 complex) hijacks the native LLPS-mediated cellular processes [53], showing that competing interactions are not always stoichiometric in nature, and the true extent of MSF regulation is likely to be even more complex than highlighted here.

The difference between complex types in various biological and biophysical properties shows that these type-definitions reflect true biological differences. Apart from being useful for complex classification, the presented method also shows that differences in binding strength, subcellular localization, and regulation are encoded in the sequence and structural properties of proteins. This can be the basis for developing future prediction methods, where these sequence- and structure-based parameters can be used as input for the prediction of biological features of complexes. In addition, the establishment of MSF complex types has direct implications, as knowledge present for a specific complex might be transferable to other complexes of the same type. For example, certain pathological conditions arise through the aggregation of IDPs. A well-known example is transthyretin (TTR) aggregation that can lead to various amyloid diseases, such as senile systemic amyloidosis [54]. Another example from the same near-ordered complex type is the superoxide dismutase SOD1, which is able to form aggregates in amyotrophic lateral sclerosis [55]. While the localization and the biological function of TTR and SOD1 (hormone transport and enzymatic catalysis) are radically different, their potency of malfunctioning (often connected to various mutations) share a high degree of resemblance. On one hand, this marks other type 1 complexes as candidates for toxic aggregation, on the other hand, it indicates that the potential therapeutic techniques for one complex (e.g., CLR01 for TTR) can give clues about potential targeting of other interactions.

Such structural classification approaches can have a high impact on structure research, most importantly in the study of protein structure or evolution, in training and/or benchmarking algorithms, augmenting existing datasets with annotations, and examining the classification of a specific protein or a small set of proteins [56]. Up to date, several structure-based classification approaches have been developed, such as SCOP [32] and CATH [33], which are extended to protein complexes as well. In this sense, previously existing methods are able to classify MSF complexes too. However, the approaches used do not take into account that these structures are only stable in the context of the interaction, and that a certain protein region can adopt fundamentally different structures depending on the interacting partner. The lack of the explicit encoding of parameters describing the properties and importance of the interaction into the classification scheme makes current methods unable to accurately describe the spectrum of MSF complexes, and to date, no such dedicated classification scheme has been proposed. In contrast to previously existing methods that largely encode the same

information [57], the presented MSF classification scheme is highly independent (Figure 9), and thus serves as an orthogonal approach capable of properly handling the specific properties of IDP-driven complex formation through mutual synergistic folding.
