**2. Results**

#### *2.1. Sequence-Based Analysis*

In this study, homodimeric protein complexes from MFIB were analyzed regarding sequence and structural properties. First we checked the location of the MFIB homodimeric dataset (MFHD) PDB segments with a known 3D structure in the full UniProt protein sequences. In some cases, the MFHD PDB segments were located near the N-terminus, near the C-terminus, in the middle of the sequence or they were identical with the full sequence (Figure 2).

We examined the residue composition of the MFHD proteins (Figure 3, Table S2), which were compared with two reference datasets, the globular homodimeric dataset (GLHD) and the globular monomeric dataset (GLMD, see Section 4). To better understand the amino acid composition of the sequences, it was depicted by principal component analysis (PCA) (Figure 4, Table S3). PCA showed that the amino acid composition of the MFHD proteins did not differ significantly from the amino acid composition of the globular proteins (GLHD, GLMD). The PCA also demonstrated that MFHD formed a diverse group based on their amino acid composition.

**Figure 2.** Distribution of MFHD PDB segments in the full UniProt sequences. (I: Amino acid sequence from UniProt is identical with amino acid sequences of MFHD PDB segment amino acid sequences; N: MFHD PDB segment is located in N-terminus of the amino acid sequences from UniProt; C: MFHD PDB segment is located in C-terminus of the amino acid sequences from UniProt; M: MFHD PDB segment is located in middle of the full amino acid sequence from UniProt).

**Figure 3.** Sequence properties of MFHD, GLHD, and GLMD proteins (For values, see Table S2).

We investigated the MFHD with several protein disorder predictors (IUPred, ESpritz, GlobPlot, VSL2b, MobiDB Lite, MetaDisorder) [8,22–26], which worked well on the IDPs listed in DIBS, but

did not recognize the polypeptide of MFHD complexes and other members of the MFIB database as disordered proteins. All methods predicted less than 30% of the protein residues as disordered, while the IUPred long/short methods, relying on a physical basis, predicted only 8 and 10% of the protein residues as disordered, respectively (for values, see Table S4). Other prediction methods based on amino acid composition bias also failed to detect MFHD PDB segments. Methods developed from the DAS and DAS-TMfilter [27,28] algorithms were tested on the dataset.

**Figure 4.** PCA ordination of the proteins from MFHD, GLHD, and GLMD based on their amino acid compositions (for values, see Table S3).

#### *2.2. Structure-Based Analysis*

We will use the term "interface" for the contact surface area of the two identical subunits in the dimeric structures. In cases where the term "monomeric structure" is used, calculations were carried out on structures from which the second chain was deleted since the PDB files contained dimer forms of the complexes. Residues belonging to the interface region were identified based on solvent accessible surface area (SASA) calculations. All-atom SASA values were calculated for the residues. Residues where the SASA value calculated from the dimer form were less than or equal to 20% of its counterpart from the monomeric structure defining the interface. We found that on average there were 26.4 interface residues per polypeptide-chain in the MFIB homodimeric dataset and 21.0 interface residues per polypeptide-chain in the reference globular homodimeric dataset. Considering the average size of the protein, this means that 27.13% of all residues in the MFHD and 22.34% of all residues in the GLHD belonged to the interface region. The higher value obtained for the MFIB homodimeric structures indicates that inter-subunit interactions may play an essential role in the stabilization of MFHD proteins.

We were looking for residues in the interface that have solvent-accessible spots in their main-chain in the monomeric structure, which become buried in the dimeric structures. We identified residues where the main-chain SASA in the dimeric form was less than 20% of the monomeric form value. Only residues with exposed main-chains, with a relative main-chain SASA larger than 0.2 in the monomeric structure, were taken into account. These residues with solvent-accessible main-chain patches (RSAMPs) were believed to be the main driving force of the dimerization of the disordered polypeptide chains collected in the MFIB database. We found a total of 183 such residues in the MFHD proteins; all structures contained at least one such residue. This was 3.14% of all residues. Considering that 27.13% of the residues were forming the interface, this means that 11.57% of the MFHD interface residues were RSAMPs. In the GLHD, 40.83% of the proteins did not contain such residues, on average 1.56% of all residues were RSAMPs. Since 22.34% of the residues form the interface,

only 6.98% of the interface residues were RSAMPs. We calculated the average solvent-accessible surface area of the main chains. In the MFHD, the average solvent-accessible, main-chain area belonging to the interface region was 1154.56 Å<sup>2</sup> per polypeptide-chain, while in the GLHD this value was 790.54 Å2. We can see that in the case of MFIB proteins a larger main-chain surface area is solvent accessible, which is energetically not favorable. The amino acid composition of the interface region and RSAMPs of the MFHD and GLHD complexes can be seen in Figure 5, Tables S5 and S6. Alanine and glycine were the most abundant residues under RSAMPs, which might be responsible for the higher solvent accessibility of the main chain in the MFHD. In the interface region, aliphatic residues are predominant. In the MFHD this was 50.6%, while in the GLHD 45.4% of the interface residues were aliphatic, making inter-subunit hydrophobic interactions even more prominent in MFIB proteins.

**Figure 5.** Amino acid composition of interface region (**A**) and RSAMPs (**B**) of the MFIB and globular homodimeric datasets (For values, see Tables S5 and S6).

We determined the secondary structural propensities in the MFHD, GLHD, and GLMD. We found that in the MFHD a significantly higher percentage of residues (39.4%) belonged to α-helices when compared to GLHD and GLMD (39.4% and 27.9%). In the MFHD, 21.2% of the residues belong to β-sheets, while in the GLHD and GLMD this value was 27.1% and 28.3%, respectively. The MFIB proteins show higher helical propensities than globular proteins.

We identified the hydrogen bonds formed between the two subunits. In the MFHD 6.97 inter-subunit H-bonds per structure were found, while in the GLHD this was only 4.58. Furthermore we identified underwrapped hydrogen bonds that are not well-enough shielded from the solvent, called dehydrons, in all structures [29]. In the MFHD we found 3.11 dehydrons per polypeptide

chain under the inter-subunit H-bonds, while only 2.18 were found in the GLHD. Contrary to these results is the average wrapping of inter-subunit H-bonds, which was 16.0 for the MFHD and 13.6 for the GLHD. Although there were more dehydrons—i.e., underwrapped H-bonds—in the MFHD, the average wrapping value was still higher.

Due to the large difference found in the inter-subunit H-bonds, other inter-subunit interactions were also investigated. First we identified inter-subunit ion-pairs. We found that in the MFHD there were 1.17 inter-subunit ion-pairs on average, with only 0.66 in the GLHD. Charged residues tend to occur at the surface due to the desolvation of buried charges being energetically not favorable. Charged residues buried either in the interior of a protein or in the interface region of the dimeric structure should form ion pairs in order to compensate the desolvation penalty through favorable electrostatic interactions. Since the occurrence of charged residues is a bit higher in the interface region of the GLHD (16.2% vs. 14.7%), the lower number of inter-subunit ion-pairs was unexpected. We already noted in an earlier publication that inter-subunit ion pairs might contribute to the stabilization of proteins [30].

Stabilization centers (SCs) are pairs of residues involved in more than average long-range interactions [31]. These residue clusters are believed to contribute to the stabilization of protein structures through the cooperativity of the individual interactions [32,33]. The stabilization centers formed between different polypeptide chains can contribute to the stabilization of a protein complex [34]. We identified inter-subunit SCs in both the MFHD and GLHD. The two residues that form a stabilization center are called stabilization center elements (SCEs). We identified the SCEs belonging to the interface. In the MFHD, 3.86% of all residues form inter-subunit SCs, that is on average 14.22% of the interface residues form inter-subunit SCs. In the GLHD, only 1.83% of the residues belong to inter-subunit SCs. This means that only 8.19% of the interface residues form inter-subunit SCs. In MFIB dimers, the inter-subunit SCs were much more frequent than in the GLHD. We investigated whether SCEs overlap with RSAMPs or whether they are segregated. We found that there was a significant overlap, as 29.51% of the RSAMPs were SCEs in the MFHD. In the GLHD, we obtained a similar value of 29.19% for the overlap.

#### **3. Discussion**

In a recent study, we compared the residue composition of IDPs from the MFIB with complexes from the DIBS and other human protein databases [21] and we found that the composition of MFIB complexes was significantly different from that of the DIBS and only slightly different from that of human proteins. IDPs from the DIBS database are capable of coupled binding and folding on the surface of ordered proteins and can be predicted through bioinformatics methods like the ANCHOR algorithm, which is based on the different residue composition of the disordered monomer and the disordered–ordered protein complex. Therefore, in this work we studied MSF-homodimers to exclude this explanation for the case of mutual synergistic folding. We observed that in some cases the interacting segment of MFIB homodimers was the full polypeptide chain, while in other cases only a part of the chain was involved in the dimerization (Figure 2). We showed that they could be an order of magnitude longer than ELMs, which can be recognized by ANCHOR in other proteins.

In our current study the residue composition of the homodimeric complexes from the MFIB was determined and compared with that of homodimeric and monomeric globular proteins in similar amino acid sequence lengths (Figures 3 and 4). Our results showed that the IDPs listed in the MFIB had a similar amino acid composition to that of globular proteins. The PCA showed that the globular (GLHD, GLMD) and the MFHD proteins were not distinguishable. Although the points belonging to the complexes in the PCA figure were not certainly clustered, suggesting that MFHD is a distinct subgroup of IDPs. This was confirmed by the comparison of MFHD with the UniRef50 database, which showed that the main part of MFHD belongs to a distinct cluster and there is no significant similarity between their Pfam domains.

We investigated the MFHD with several protein disorder predictors (IUPred, ESpritz, GlobPlot, VSL2b, MobiDB Lite, MetaDisorder), which work well on the IDPs listed in DIBS. These methods did not recognize the full-length polypeptide chains of the MFHD complexes and other members of the MFIB database as disordered proteins. Since the disorder predictors IUPred and ANCHOR rely exclusively on solid physical principles, these methods were used to discover the physical principles behind the disordered character of the protein and the origin of the coupled folding and binding of the homodimers in the MFIB database. Our current study indicated that in the case of MFHD, the IUPred algorithm using its standard 20 × 20 pairwise free energy matrix overestimated the stabilizing energy because the energetically-unfavorable large solvent-accessible surface area of the peptide backbone in single protein chains resulted in less stabilizing energy. This can explain why these proteins were disordered in monomeric form. On the one hand, members of the MFIB dataset can be disordered for similar reason than other disordered proteins. That is, the sum of their pairwise interaction enthalpy did not compensate the free energy contribution of the entropy loss during folding. However, this is not the consequence of the amino acid composition of these polypeptides. Pairwise interactions of residue pairs, which have backbone parts not sufficiently shielded from the solvent, contribute less enthalpy to the stabilization than that found in globular proteins, from which the standard 20 × 20 pairwise free energy matrix was derived. Therefore, by using the free energy matrix in IUPred, we overestimated the stabilizing free energy of the proteins listed in the MFIB. This is why the IUPred algorithm predicted these monomers as structured proteins, while the experiments showed that they are disordered in their monomeric form [16].

We can conclude that the residue composition of MFHD is rather similar to that of the globular proteins (GLHD and GLMD), we were looking for structural differences among them. We found that the interface region had more residues in the MFHD than in the GLHD. MFIB homodimeric proteins had a larger solvent-accessible main-chain surface area in the interface when compared to globular homodimeric proteins. The polypeptide backbone of MFHD proteins was more accessible for water than in globular proteins. During dimerization, the solvent-accessible surface area of the backbone decreased and a high number of inter-subunit interactions (H-bonds, ion-pairs and stabilization centers) formed, leading to the stabilization of the of the disordered polypeptide-chains, enabling an ordered structure of MFIB proteins in the dimeric form. The driving force of the dimerization was the mutual shielding of the water-accessible backbones and the formation of extra intermolecular interactions.
