**4. Materials and Methods**

Filters were applied to the homodimeric structures of the MFIB database. A reference dataset was created from homodimeric globular proteins, where the monomeric form was also globular in itself. Another reference dataset was created from monomeric globular proteins.

All homodimeric structures were collected from the MFIB database, and the modified PDB files were used. Entries belonging to the "coils and zippers" structure class were discarded since structures belonging to this class are both sequentially and structurally different from other homodimers. It is evident that a structure like a leucine-zipper cannot exist in monomeric form, thus no reference dataset can be created from "coils and zippers" where the monomer is not disordered in itself. A contact map matrix for all remaining structures was created. Entries with unusual contact maps were manually inspected. After inspection, the following entries were discarded: 2adl, 1r05, 4ath, 1aa0, 4w4k, 1ejp, resulting in a dataset of 60 homodimeric structures (Table S1). Heteroatoms were deleted from the structure. This dataset was referred to as MFIB homodimeric dataset (MFHD). We checked the secondary structure of the databases using the DSSP 2.0.4 program [35]. We found that 39.4% of the residues belonged to α-helices and 21.2% to β-sheets. The size distribution of the dataset was investigated. We counted the number of residues belonging to the N, N + 20 intervals. We found that the 140–240 interval was predominant, thus the reference datasets were created according to this size distribution.

A non-redundant reference dataset was created from homodimeric globular proteins. All homodimeric structures within the 140–240 amino acid size range were collected from the non-homologous PDB\_Select database as of November, 2017 [36]. Structures containing coiled-coil structural elements identified with the Socket 3.0.3 program were excluded from the dataset [37]. The proper quaternary structure of the homodimers was created according to the BIOMT records of the PDB files. Entries with the following PDB ligand summary "ids" of large molecular sizes ligands and cofactors were discarded from the dataset because they could significantly alter the results of the solvent-accessible surface area calculations (017, 1BG, 1PE, 1PG, 5GP, C2E, FAD, HEC, KI1, MYA, MYR, MYS, NER, O8N, OLC, P33, P6G, PE5, UNL). Heteroatoms were deleted from the remaining structures. This procedure resulted in a list of 218 protein structures. This dataset was referred to as the globular homodimeric dataset (GLHD). For the PDB codes, see Table S1. According to DSSP, 27.9% of the residues belonged to α-helices and 27.1% to β-sheets.

An additional non-redundant reference dataset of the monomeric structures in the 140–240 amino acid size range containing only one structural domain was created from the PDB\_SELECT database. The initial database was filtered by size and monomeric state criteria. All entries proved to be single domain according to the DDomain program using authors-trained parameters [38]. This dataset was referred to as the globular monomeric dataset (GLMD) and contained 191 entries (Table S1). According to DSSP, 24.9% of the residues belonged to α-helices and 28.3% to β-sheets.

Differences in the amino acid composition of the proteins sequences from the MFHD, GLHD, and GLMD datasets were revealed by principal component analysis (PCA) ordination using the plotly software according to Raska [39].

Hydrogen bonds were identified using the find\_pairs command of PyMOL using 3.5 Å distance and 45 degree angle criteria between the donor and acceptor groups [40]. The calculation of the wrapping of hydrogen bonds and the identification of dehydrons was performed with the dehydron\_ter.py program [41].

Stabilization centers (SCs) are pairs of residues, called stabilization center elements (SCEs), which are involved in several long-range interactions. These residues can be identified with our publicly available web server at http://scide.enzim.hu [42].

The solvent-accessible surface area (SASA) was calculated using the FreeSASA 2.03 program [43]. A residue was classified as buried when its relative SASA was below or equal to 0.2. Residues with a relative SASA value of over 0.2 were considered as exposed. A residue was classified as part of the interface region when its all-atom SASA calculated from the dimeric structure was less than 20% of the value calculated from the monomeric structure (created by deleting the second chain from the PDB file).

Ion-pairs were defined as pairs of negatively and positively charged residues, where the distance between the charged groups was equal to or less than 4 Å [44]. Ion pairs were identified using our own C++ program.

**Supplementary Materials:** Supplementary materials can be found at http://www.mdpi.com/1422-0067/19/ 11/3340/s1. Table S1. List of PDB entries in the MFHD, GLHD, GLMD datasets. Table S2. Average amino acid sequence composition of proteins from MFHD, GLMD, and GLHD. Table S3. Amino acid sequence composition of proteins from MFHD, GLMD, and GLHD for PCA. Table S4. Disorder content by various predictors. Table S5. Average amino acid composition of interface region of the proteins from MFHD and GLMD. Table S6. Average amino acid composition of RSAMPs of the proteins from MFHD and GLMD.

**Author Contributions:** Conceptualization, I.S., C.M., M.C.; methodology, A.M., E.F., C. M.; software, A.M., E.F., M.C., C.M.; validation, A.M., C.M; formal analysis, C.M.; investigation, A.M., E.F.; resources, A.M, E.F.; data curation, A.M., E.F., C.M.; writing—original draft preparation, A.M., C.M., I.S.; writing—review and editing, E.F., A.M., C.M; visualization, A.M.; supervision, I.S., M.C.; project administration, I.S.; funding acquisition, I.S.

**Funding:** This work was financially supported by the National Research, Development and Innovation Office (grant no. K115698). IS was supported by project no. FIEK\_16-1-2016-0005 financed under the FIEK\_16 funding scheme (National Research, Development and Innovation Fund of Hungary). The work of AM was supported by the ÚNKP-18-3 New National Excellence Program of the Ministry of Human Capacities (Hungary).

*Int. J. Mol. Sci.* **2018**, *19*, 3340

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.
