*4.3. Calculating Sequence Features*

Similarly to the approach described in [28], the following amino acid groups were used in quantifying sequence composition of proteins: hydrophobic (containing A, I, L, M, V), aromatic (containing F, W, Y), polar (containing N, Q, S, T), charged (containing H, K, R, D, E), rigid (containing only P), flexible (containing only G), and covalently interacting (containing only C). This low-resolution sequence composition at least partially compensates for commonly occurring amino acid substitutions that in most cases do not affect protein structure and function. In all cases, compositions were calculated for the entire complex, including all interacting protein chains. An 8th sequence parameter was used to quantify the compositional difference between subunits. This dissimilarity measure was defined as:

Δ*total* = <sup>7</sup> *<sup>i</sup>*=<sup>1</sup> Δ*i*, where Δ*<sup>i</sup>* is the largest composition difference of residue group *i* between any pair of constituent chains. The average dissimilarities for various sequence-based clusters are shown in Figure 1. For exact sequence composition values for all MSF entries, see Supplementary Table S1.
