**1. Introduction**

Since the millennium it has been clear that Anfinsen's long-standing paradigm that was alleged to be valid for all proteins: "Protein structure is uniquely determined by its amino acid sequences" [1,2] is only valid for a specific subclass of proteins, while the rest of the proteins, termed intrinsically disordered proteins (IDPs), have no permanent 3D structures [3–6]. In our earlier effort to identify the physical background of protein disorder, the lack of sufficient pairwise interaction energy between the residues to ensure a stable 3D structure was pinpointed. When this energy is not enough to compensate the entropy-related free energy loss in the course of the formation of a unique structure, intrinsically disordered proteins are witnessed [7]. It has been shown that this pairwise energy can be calculated from the amino acid sequences without any structural information. On this basis we developed a widely used method, IUPred, to predict disordered proteins or protein segments from local composition data [8]. Another application of the estimation of the pairwise interaction energies led us to recognize the physical properties of the binding regions of disordered proteins, which can bind to ordered proteins [9]. When certain segments of a disordered protein interact with an ordered protein structure, part of their interactions will be manifested through elements of this stable globular protein having enough pairwise energy to stabilize their structures, i.e., to be folded, on the surface

of ordered proteins. The contribution of a single residue depends only on the composition of the surrounding residues. Since ordered proteins have different amino acid compositions to disordered proteins, the resulting interaction energies of the residues at the contact surface can stabilize the structure (coupled folding and binding).

On the basis of this phenomenon, a binding site prediction method, termed ANCHOR, was developed [10]. These interacting segments generally appeared as short motifs of polypeptide chains (ELMs) [9,10]. More recently, the upgraded version of IUPred and ANCHOR were combined into a new server called IUPred2A [11].

While this phenomenon appeared to be general, over the years the number of "exceptions" increased, suggesting that the insufficient pairwise energy calculated by the IUPred algorithms was only valid for certain intrinsically disordered proteins and protein segments (IDSs), and that another kind of IDP and IDS also existed. Even in the early age of IDP studies, there was sporadic information that some IDPs exhibit mutual folding and binding together with other IDPs, without the help of already stable proteins or other stable macromolecules [12,13]. For example, NCDB segments of CBP form a complex with the ACTR domain of p160, see: protein data bank (PDB) entry 1kbh [14] or region C of WASP is I complex with the GBD segment of WASP [15]. In these examples, the interacting parts of the disordered proteins were not ELM sized, but rather have structural domain sizes [16]. In many cases the interacting disordered protein segments were alike, forming homodimer or homo-oligomers. Here the coupled folding and binding should not appear due to the difference in residue composition, as in the case of ELMs stabilized on the surface of an ordered protein. Therefore there should be another mechanism for coupled folding and binding than the one we can recognize by ANCHOR. Since macromolecular interactions are part of almost all the activity of disordered proteins, a new mechanism for coupled folding and binding, where there is no stable template to use, define a new subset of IDPs. Despite the sporadic information about these interactions, not too many of this kind of complexes were reported in the literature [17,18]. Therefore we performed a detailed analysis of several databases and on the scientific literature and collected information on these complexes and created the Mutual Folding Induced by Binding (MFIB) database [19]. These complexes exhibit large structural variations (see Figure 1).

Almost half of the MSF-complexes are homodimers, but there is a significant amount of heterodimers and other oligomeric states, including homo- and heterotetramers, as well as trimers, pentamers, and hexamers. To explore the unique features of the entries in the MFIB database and pinpoint those characters that differ between these entries and those of those disordered segments that can participate in coupled folding and binding with already stable proteins, we created the Disordered Binding Site (DIBS) database of the latter complexes [20]. Currently, a publication of the comparison of the structural differences of proteins of the MFIB and DIBS databases is in progress [21].

The elements of the pairwise interaction matrices used in the IUPred and ANCHOR algorithms were derived from the structure data of folded globular proteins, therefore this data includes the free energy from the average hydration of the residues in these proteins. We showed that this is similar for most globular protein, therefore a fair free energy contribution of a particular residue can be calculated from the composition of the rather large polypeptide segment centered by the particular residue, using the pairwise energy interaction matrix [7]. In the IUPred algorithm, when a particular residue is processed, whether it belongs to an ordered segment or a disordered one, the interaction of this residue in question and all other residues in a large surrounding region are considered. Therefore this calculated energy value has to be the same for all permutations of the residues of the segments located at both sides of the center residue, until or unless the compositions of the segments are changed. The amino acid sequences of proteins that have stable folded structures evolved in such a way that the side chains together shield the backbone from water, which minimizes the energetically unfavored water-accessible area on the polypeptide backbone. In this work we show that this statement is not valid for the disordered proteins listed in MFIB.

**Figure 1.** Oligomeric states in the MFIB database with example complexes. (**A**: 1BET, nerve growth factor (*Mus musculus*); **B**: 1AQ5, assembly domain of cartilage oligomeric matrix protein (*Gallus gallus*); **C**: 1GKE, Transthyretin (*Rattus norvegicus*); **D**: 1MZ9, assembly domain of cartilage oligomeric matrix protein (*Mus musculus*); **E**: 1NPK, nucleoside diphosphate kinase (*Dictyostelium discoideum*); **F**: 5GT0, H2A-H2B histone dimer, containing histone variants H2A type 1-A and H2B type 1-J (*Homo sapiens*); **G**: 2AZE, Rb C-terminal domain bound to an E2F1-DP1 heterodimer (*Homo sapiens*); **H**: 2NB1, p63/p73 hetero-tetramerization domain (*Homo sapiens*); **I**: 1VZJ, The synaptic acetylcholinesterase tetramer assembled around a polyproline-II helix (*Homo sapiens*); **J**: 1G2C, respiratory syncytial virus fusion protein core (*Homo sapiens*).

We investigated whether the interacting regions of these proteins can be identified based on their location in the whole polypeptide chain, on their biased amino acid composition or on specific physical properties. We discovered that their most unique characteristic is the high water accessibility of their peptide backbone, compared to the water accessibility of the folded proteins, which have similar amino acid compositions.
