**3. Protein Coil Library Model of the DSE**

The PDB [73] provides an ever-increasing number of high-resolution protein structures, which include both regularly ordered secondary structures (helices, sheets, and turns) and irregularly ordered structures (coils and loops). While any individual coil or loop was sufficiently ordered for structural determination, the assumption is that in aggregate, a large set of irregularly ordered structures would provide information on the conformational tendencies and properties of the polypeptide chain in the denatured state. Collectively, these models of the denatured state are constructed by examining the regions of resolved protein structures that are outside the α-helix and β-strand domains. Indeed, analyses of "protein coil libraries" generally support the structural preferences that have been observed in peptide-based models. As these libraries of coil structures have evolved, the field has gained valuable insights into the roles of sequence context, intramolecular interactions, and protein hydration in determining the intrinsic structural tendencies of the amino acids.

In 1995, Swindells and Thornton generated one of the first iterations of a protein coil library based on high-resolution protein structures [27]. Four basins were defined on the Ramachandran plot, corresponding to a (α-helix), b (β-sheet), p (PPII), and L (lefthanded helix). Using 85 structures obtained from the PDB, they removed residues that were assigned helix or sheet conformation, retaining all coils, loops, and turns in the analyzed set. Within this set, residues Glu, Gln, Ser, Asp, and Thr demonstrated strong propensities for the "a" region, as their side chains have both the hydrogen bonding capacity and rotational flexibility to form hydrogen bonds to backbone groups. The "b" propensities appeared to be less sensitive to the chemistry and rotamer of the side chain, consistent with the location of the side chain relative to the backbone when in the β-sheet conformation. While the authors did not explicitly discuss the "p" region (PPII), their data show a significant redistribution of the population between the four basins when the "whole" and "coil library" sets are compared. When the entire polypeptide chain was considered, the a and b basins were the two most highly populated. In the coil library, with helices and sheets removed, the a and p basins exhibited the highest populations. This demonstrated that in the structures of intact proteins, PPII conformations are well represented in the non-alpha and non-beta regions.

This work was followed by an analysis of the PPII content in 274 high-resolution structures conducted by Stapley and Creamer [74]. In their analysis, they found the PPII conformation was common, with more than half of the proteins containing at least one PPII helix longer than three residues, despite PPII residues comprising just 2% of all residues in the dataset. This study was the first to detail the PPII propensities of each side chain. Predictably, Gly was disfavored, while Pro had a strong PPII propensity. Additionally, they observed that Gln, Arg, Lys, and Thr had generally strong propensities for adopting PPII conformations. Moreover, a positional dependence of PPII propensity within the PPII helix was also found. The ability of polar side chains, such as Gln, Lys, and Arg, to form hydrogen bonds with the backbone between *i* and *i* + 1 positions stabilizes the PPII helix. This is consistent with the overrepresentation of Gln, Arg, Lys, and Thr in the first PPII helix position. These data also supported the idea that PPII helices have extensive solvent exposure, as there was a significant negative correlation between nonpolar solvent

accessibility and PPII propensity. Taken together, their work demonstrated that both solvent accessibility and the ability to form hydrogen bonds with the backbone were important elements of PPII propensity, consistent with prior work in peptides.

In 2005, Rose and coworkers developed a protein coil library (PCL) that is webaccessible [28]. The PCL becomes updated as the PDB is also updated. This repository of structure elements uses the regular expressions for α-helices and β-sheet and then extracts all non-helix and non-sheet residues from deposited structures that share <90% identity. Note that, as a result, the PCL contains both turns and homologous sequences. Additionally, for structure classification purposes, the PCL divides the Ramachandran plot into 30◦ × 30◦ bins, whereby each bin refers to one of 144 different "mesostates".

An analysis in 2008 by Perskie et al. identified seven naturally clustering basins in a Ramachandran plot of PCL structures [30]. These basins represent the familiar α, β, PPII, αL, and τ (type II' β-turn) structural motifs, and also a γ basin, for inverse γ turns, and a δ basin that captures residues preceding a proline in proline-terminated helices. This allowed amino acid preferences for the different basins (see Table 2 in ref. [30]) to be determined and studied. For example, solvent–backbone hydrogen bonding, which can favor PPII [14], and side chain–side chain sterics, which for branched amino acids adjacent to proline can favor δ at the expense of β, were found to be crucial determinants of the basin preferences.

To better understand how the conformational preferences of a residue in the denatured state depend on the identity and state of its adjacent (nearest) neighbor, Freed and coworkers constructed an increasingly stringent set of coil libraries [29]. Using 2020 nonhomologous polypeptide chains, the "full" set was defined as the entire polypeptide chain, sans the terminal residues. The first cull of the full set (Cαβ) removed the α-helix and βsheet identified residues, similar to the original coil libraries and the PCL described above. This had the effect of reducing the number of residues to 40% of the original. The next restriction additionally removed hydrogen-bonded turns from the set (Cαβ<sup>t</sup> ), slimming the library to 28% of the original. Finally, to produce the most restricted coil library, the authors retained only those residue positions located within contiguous stretches four residues or longer, and which were "internal" to coils. This had the effect of reducing "end bias" from structured regions, which is known to favor PPII at the expense of α and β.

The sequential removal of ordered residues had the overall effect of increasing PPII content and decreasing α populations in the coil library. Specifically, when all structured positions were included, α-helical conformations were the predominant state. Upon removing the α-helix and β-sheet residues—as Swindells and Thornton did a decade prior—the PPII conformation emerged as a major subpopulation. With turns also removed (Cαβ<sup>t</sup> ), the most populated conformation was PPII, and there was a significant reduction in the α population. The dominance of the PPII conformation is not restricted to a particular subset of amino acids, as all 20 amino acids show a considerable propensity to adopt the PPII configuration (Table 3). The most restricted set (with only residues that are well within coil regions) showed little change in the population distribution, with the PPII population continuing to be dominant.

Using the most restricted set, the authors also found that the size of the PPII subpopulation is constant regardless of solvent accessibility [29]. Moreover, PPII is the dominant conformation in all but the 10% most surface-exposed residues. The α-helix dominates in the surface residues, due to the propensity of the polypeptide backbone at the surface to preferentially turn back toward the folded core of the protein. The independence of PPII content and solvent accessibility initially appears to contrast with earlier work with both peptides and earlier versions of PCLs. However, these results can be reconciled by understanding the PPII conformation as a mechanism for maximizing backbone hydrogen bonding. In the PPII conformation, the backbone amides and carbonyls are in positions that both minimize steric hindrance and enable both functional groups to form hydrogen bonds, either with solvent molecules or within the protein [29]. Therefore, the PPII propensity likely reflects the intrinsic hydrogen bond capacity of a polypeptide, not merely solvation.


**Table 3.** Amino acid specific propensity for the PPII backbone conformation in the protein coil library.

<sup>a</sup> Calculated by Freed and coworkers using a restricted coil library that removed α-helices, β-sheets, turns, and residues flanking secondary structures from a set of protein structures [29].

These general results can be replicated using almost any set of nonhomologous protein structures. Figure 2 shows results from a curated set of 122 human protein structures, sharing less than 50% sequence identity and with structural resolution < 2.0 Å [75]. In the full set, containing 15,958 residue positions, the α conformation is the most populated (Figure 2A). When α-helices and β-strands are removed, PPII is the most favored conformation for the remaining 6418 residue positions (Figure 2B). α α β

α − − α β − **Figure 2.** Protein coil libraries are dominated by PPII conformations. In total, 122 non-homologous human structures were analyzed for individual residue conformations (including Gly). (**A**) Ramachandran plot for every residue in the set (15,598 residues). The major population is in the α region, centered at (−65◦ , −45◦ ). (**B**) Ramachandran plot of the same set after removing all identified α-helix and β-sheet residues (identified using the information provided in the PDB structure file header), yielding 6418 remaining. The major population has shifted to the PPII region, and peaks at (−65◦ , 135◦ ). For both plots, color represents the count in 10◦ × 10◦ bins.

The consistency of PPII propensity in protein coil libraries, especially when viewed in light of hydrogen bonding capacity, therefore predicts that a bias toward PPII conformations is an inherent characteristic of the polypeptide backbone.
