2. Results and Discussion
The occupancy-weighted average α-carbon
B-factor was calculated for each of the resulting chains and plotted as the ordinate of a graph using structure resolution as the abscissa, in the hope of observing an association, even if possibly a weak one. However,
figure 3 shows at best a resolution-dependent upper bound for the mean C
B-factor. The lack of a stronger association can at least partially be attributed to the surprisingly low
B-factors reported by some structures. Many files contained atomic coordinates with
B-factors of 2 Å
2 and below, and several included negative
B-factors. It was therefore necessary to remove from the study-set any structures containing
B-factors lower than some reasonable value. This cut-off value came from a high resolution structure with reliable low
B-factors, with no negative values. The highest resolution structure of lysozyme in the PDB at the time of this writing, PDB ID 2vb1 [
14], lists no
B-factors lower than 2.15 Å
2, so this was selected as the cut-off. 597 HIV protease chains passed, represented in
figure 3 as blue points. This filtering step noticeably improved the linearity of the relationship between resolution and C
B-factors.
Figure 3.
Quality of deposited PR structures. Monomers that passed the B-factor cut-off of 2.15 Å2 are marked blue, whereas those that failed are red. NMR structures, to which the concept of resolution does not apply, were not used for this plot.
Figure 3.
Quality of deposited PR structures. Monomers that passed the B-factor cut-off of 2.15 Å2 are marked blue, whereas those that failed are red. NMR structures, to which the concept of resolution does not apply, were not used for this plot.
The PR monomers composing the study-set were superposed by the least squares method. Shown in
figure 4A, the ribbon diagram representation of this superposition resembles an ensemble of NMR structures, even though no NMR structures were present in the data set. Though motion cannot be inferred directly from crystallographic data, it is worth noting that the greatest variation is observed in the flap and elbow, supporting the findings of NMR and molecular dynamics studies that have described these regions as the most dynamic [
18]. Interestingly, the flap region showed a greater relative thermal stability.
Figure 4B, a putty cartoon based on mean Cα
B-factors, shows much greater values in the elbow and 60’s loop than in the flap. However, when considering spatial displacement from the mean monomer (as in
figure 4C), the tip of the flap joins the elbow and the 10’s and 60’s loops (defined as in
figure 1) as one of the most variable regions, even though some of the range suggested by
figure 4A has been averaged out.
Figure 4.
PR monomers. (
A) Ribbon diagram of the final data set superposed by least squares. (
B) Putty cartoon of
B-factor variation on the mean structure, colored from low to high (yellow to red). (
C) putty cartoon of spatial variation on the mean structure, colored from low to high (blue to green). Refer to
figure 1 for definitions of PR regions. Rendered using PyMOL (DeLano, 2002)
Figure 4.
PR monomers. (
A) Ribbon diagram of the final data set superposed by least squares. (
B) Putty cartoon of
B-factor variation on the mean structure, colored from low to high (yellow to red). (
C) putty cartoon of spatial variation on the mean structure, colored from low to high (blue to green). Refer to
figure 1 for definitions of PR regions. Rendered using PyMOL (DeLano, 2002)
A possible explanation would be the existence of two distinct conformations of the enzyme: open and closed. In the latter, the presence of a ligand would enable interactions that hold the flap closed, ensuring its stability. In the former, steric clashes with symmetry-related molecules may limit flap opening and movement, or alternate conformers may be induced by amino acid variation. An analysis of crystal contacts across the various space groups mentioned in
Table 1 affirms that there are several crystal contacts on the flap and elbow regions. Residues that formed crystal contacts in all the structures within each space group were used to calculate consensus contact regions within the space group. Though there are regions of contact that are specific to some space groups, the elbow and flap stretch and a number of other key contact points were common for all the space groups.
Table 1.
Distribution of crystal contacts by residue in representative structures from each of the space groups reported for PR. This is not a table but a figure. Author need to use Word Table tools to format table.
Table 1.
Distribution of crystal contacts by residue in representative structures from each of the space groups reported for PR. This is not a table but a figure. Author need to use Word Table tools to format table.
|
Figure 5A, a plot of the range of temperature factors and spatial displacement by residue, confirms the elbow, the tip of the flap, and the other loops as maxima. Overall, the
B-factor seems to adequately fulfill its role as a
de facto measure of spatial variation because there is high agreement in the location of the extrema of the two data series. However, the
B-factor is not as reliable in predicting the magnitude of these extrema. Crystal contacts deduced from structures in different space groups seemingly coincide with regions of higher
B-factors which may support the effect of crystal artifacts on the actual dynamics and
B-factor values. The unreliability is especially apparent when separately treating ligand-bound (
figure 5B) and ligand-free (
figure 5C) monomers. The majority of PR structures in the PDB are bound to a ligand, so
figure 5A,B do not differ significantly.
Figure 5.
PR
B-factor (orange) and spatial displacement (blue) variation with residue number. (
A) Final data set, (
B) bound monomers, (
C) ligand-free monomers. Values were normalized for comparison purposes. Mean and standard deviation values are given in
Table 2. Secondary structure elements are identified.
Figure 5.
PR
B-factor (orange) and spatial displacement (blue) variation with residue number. (
A) Final data set, (
B) bound monomers, (
C) ligand-free monomers. Values were normalized for comparison purposes. Mean and standard deviation values are given in
Table 2. Secondary structure elements are identified.
In
figure 5C on the other hand, the tip of the flap exhibits by far the greatest displacement from the mean ligand-free structure, and this value is much larger than the corresponding
B-factor might indicate. Furthermore, the distribution of ligand-free structures is as a whole more variable spatially than that of bound ones, as described in
Table 2. Spatial displacement over ligand-free structures has both a greater mean, 0.577 Å, and a greater standard deviation, 0.465 Å, than over bound structures (mean = 0.343 Å, standard deviation = 0.160 Å). The difference may be partially due to the discrepancy in sample sizes, but it nevertheless suggests the possibility of multiple PR conformations in the absence of a ligand.
Table 2.
PR B-factor and spatial displacement distribution. This table shows the mean spatial displacement observed in the ligand-bound Vs ligand-free PR structures. This table is a figure.
Table 2.
PR B-factor and spatial displacement distribution. This table shows the mean spatial displacement observed in the ligand-bound Vs ligand-free PR structures. This table is a figure.
|
From this analysis of an entire array of structures, it is also possible to obtain an estimate of what Å value represents a significant conformational change. Referring to the statistics in
Table 2, a change of 0.5 Å or below is within error range. A spatial displacement of 1.0 Å, or approximately four standard deviations from the mean of the entire study-set, is more convincing. In the distance matrix of pairwise
RMSDs, of which a small segment is given as
Table 3, several structures, namely PDB IDs 1xl2, 2fns, 2fnt, 2hs1, 2hs2, and 5upj, have
RMSDs of 0.95 Å and above separating the monomers that compose them, supporting the finding that the two monomers adopt different conformational states when PR binds an asymmetric ligand [
19]. The initial work of Prabu-Jeyabalan
et al. [
19] was on an inactivated HIV-1 PR-substrate complex, and the two monomers in the reported structure (PDB ID 1f7a) have an
RMSD of only 0.34 Å. However, of the aforementioned structures, only PDB IDs 2fns and 2fnt have peptide ligands whereas the rest are bound to non-peptides, and PDB ID 5upj is an HIV-2 PR. The observation may therefore be conjectured to hold generally for HIV proteases and asymmetric ligands.
Table 3.
Representative table of the pairwise RMSD distance (Å) matrix of the 587 monomers in the study set. Rows and columns are labeled with the PDB ID and chain identifier. This is a figure.
Table 3.
Representative table of the pairwise RMSD distance (Å) matrix of the 587 monomers in the study set. Rows and columns are labeled with the PDB ID and chain identifier. This is a figure.
|
To further understand the effects of ligand binding on PR structure, monomers were superposed within the bound and ligand-free subsets.
Figure 6A shows the result for the ligand-free monomers. Surprisingly, not all structures had flaps in the “semi-open” or open conformations. Several exhibited the closed flap conformation, though closer examination revealed these to belong to covalently-bonded PR dimers (PDB IDs 1g6l and 1lv1) that were split into monomers by removing the bridge of connecting residues. This also explains why these structures differ noticeably at the C-terminus from the other ligand-free structures. The superposition of the mean bound and mean ligand-free monomers rendered as a ribbon diagram in
figure 6B, shows a prominent difference in the tip of the flap but little variation elsewhere.
Figure 6C gives the same information as a plot of spatial difference by residue, and the spike corresponding to the tip of the flap is unmistakable. However, the maximal distance, 2.75 Å, is smaller than the actual deviation between the open and closed conformations, because the mean ligand-free structure is closer to the semi-open state due to averaging.
Figure 6.
Ligand effects on PR monomer structure. (A) Superposition of ligand-free monomers; (B) superposition of mean ligand-free (orange) and bound (blue) monomers; (C) plot of spatial difference Vs residue number for mean ligand-free and bound monomers. Ribbon diagrams rendered using PyMOL.
Figure 6.
Ligand effects on PR monomer structure. (A) Superposition of ligand-free monomers; (B) superposition of mean ligand-free (orange) and bound (blue) monomers; (C) plot of spatial difference Vs residue number for mean ligand-free and bound monomers. Ribbon diagrams rendered using PyMOL.
Monomers were also organized on the basis of crystallographic space group and superposed to obtain mean monomers. Most of the representative monomers were in the closed conformation, as shown in
figure 7. Exceptions were the monomers corresponding to the C2, P4
12
12, and P4
1 space groups. Interestingly, all structures that crystallized in the C2 space group were bound to a ligand, as listed in
Table 4. This surprising observation may be due to the fact that all C2 structures except PDB ID 1ztz were of HIV-2 PR and solved during the same study. The deviation of the P4
12
12 space group is accounted for by noting that all its monomers are ligand-free, except one (PDB ID 3bc4 [
20]) whose flaps are prevented from closing by two non-peptide inhibitors that pack the active site, acting as a wedge. Finally, the P4
1 space group has an almost equal distribution of bound and ligand-free monomers but differs the most from the closed conformation, which would be expected of a predominantly ligand-free space group. This may be because most of the P4
1 structures have mutations at residues 82 and 84, which are essential to ligand binding and structural stability in the active site [
21].
Figure 7.
Superposition of mean PR monomer structures for the space groups: P2
1 (orange), C2 (red), P2
12
12 (chartreuse), P2
12
12
1 (yellow), I222 (purple), P4
1 (cyan), P4
3 (lime green), P4
12
12 (blue), P4
32
12 (magenta), I4
122 (salmon), P6
1 (olive), P6
5 (brown), P6
122 (pink) and I2
13 (green). Ribbon diagram rendered using PyMOL.
Table 3 describes the distribution of space groups in the final data set.
Figure 7.
Superposition of mean PR monomer structures for the space groups: P2
1 (orange), C2 (red), P2
12
12 (chartreuse), P2
12
12
1 (yellow), I222 (purple), P4
1 (cyan), P4
3 (lime green), P4
12
12 (blue), P4
32
12 (magenta), I4
122 (salmon), P6
1 (olive), P6
5 (brown), P6
122 (pink) and I2
13 (green). Ribbon diagram rendered using PyMOL.
Table 3 describes the distribution of space groups in the final data set.
Table 4.
Distribution of PR structures by space group. In the strictest sense, the distribution should be further subdivided because not all structures belonging to the same space group have isomorphous unit cells.
Table 4.
Distribution of PR structures by space group. In the strictest sense, the distribution should be further subdivided because not all structures belonging to the same space group have isomorphous unit cells.
|
4. Conclusions
Analysis of a static array of PDB structures to gain further insight about a protein has great potential as a method to deduce a statistical bar for structural variation, as demonstrated by this PR case study. While there are other algorithms to data-mine the PDB, it is clear from this study that quality control is required before using a data-set for analysis. There also exist algorithms to superpose structures, but to our knowledge, this is the first method that also occupancy-weighs available conformations for the superposition. The algorithms described here were used to data-mine the PDB, filter search results, perform quality control, and superpose structures. This made possible a comparison of B-factors and spatial variation over the entire study-set of PR monomers, the bound and ligand-free subsets, and the different represented space groups. Examination of the resulting distributions is an alternative way of identifying a protein’s most variable regions and qualifying spatial displacement as significant or within the range of error.
However, such an approach to protein study is made more difficult by the many different practices of PDB depositors even within the limits of a file format with a detailed specification. Choice of title, choice of keywords, numbering of residues, and organization into models and chains are often overlooked. This is unnoticeable to a human user, but it makes selection of the study-set the most complex and error-prone step of a data-mining endeavor. Additionally, many structures abuse B-factors, occupancies, and other parameters, or assign them special meaningless values not specified by the PDB file format. Therefore quality controls must be implemented to exclude from such investigations any structures with statistics that might bias results. For data-mining investigations to be successful, a paradigm shift will be required of depositors to the PDB: to stop treating the painstaking process of preparing a structure for submission as an unnecessary complication and see the PDB itself not just as a collection of coordinates, but as a tool that could shed light on many of the questions of structural biology.