1. Introduction
Atomistic molecular dynamics (MD) simulations can reliably assess dynamical properties in equilibrium structures of molecular systems of interest, given an ergodic sampling and an accurate force field. The force field parameters are calibrated to reproduce properties measured by experiments or simulations. Considering the immense complexity of macromolecular systems, and the sensitivity of weak (hydrogen-bonding and dispersion) non-covalent interactions in a liquid phase, contributing to intra-solute, solute-solvent and solvent-solvent interactions, even modest inaccuracies in models and their parameters can adversely impact the results of atomistic molecular simulations, especially of challenging systems such as intrinsically disordered proteins (IDPs). IDPs are elusive to experimental studies, thus atomistic simulations are a crucial tool to provide detailed insight into their complex structure, dynamics, and function. Unfortunately, computational studies of IDPs are often found to disagree with experimental data. The free energy landscape of IDPs is inverted compared to the structured proteins [
1], which makes computational studies focusing on IDPs very challenging. Discrepancies between theory and experiments are commonly attributed to either force field biases [
2,
3] or insufficient sampling. This motivated the development of molecular force fields designed to handle IDPs [
4,
5,
6] and to apply enhanced sampling techniques [
7,
8,
9,
10,
11] or restraints derived from experimental data (e.g., solution NMR) in simulations of IDPs [
10,
12,
13]. The outcomes of those efforts were successful to various extents; however, IDP simulations still require parameter improvements [
10,
14,
15,
16,
17]
IDPs have unordered structures in aqueous solution, and while either dehydrated or interacting with lipid membranes, they exhibit increased amounts of ordered secondary structures [
18]. This clearly shows that IDPs are highly sensitive to solvation effects [
19,
20] and suggests that focusing on the improvement of the water models used in the simulations may offer a more accurate yet computationally feasible framework for reliable simulations of this class of proteins.
The complexity of the water properties, combined with multiple possible levels of approximation, has led to the proposal of dozens of water models. Simplified classical water models, such as widely popular three-point SPC [
21] and TIP3P [
22] models, are currently indispensable components of atomistic MD toolkits. Yet, despite several decades of extensive research, these models are still far from perfect. To start, none of them accurately reproduces the key properties of bulk water [
23]. Alternative approaches, most notably the “optimal” three-charge, four-point rigid water model (OPC) [
24] have been developed and tested recently. OPC uses the optimised distribution of point charges to best describe the electrostatics of the water molecule, in contrast to the ‘conventional’ approach to constructing the classical solvation models, which often imposes geometry constraints [
25]. However, simplified classical water models, particularly the simplest, non-polarisable three-point models, are still the most commonly used in the biomolecular simulations community, due to their computational efficiency and simplicity.
In simulations of IDPs, the best-performing water models have a charge distribution with a large dipole moment, a large quadrupole moment, and negative charge out of the molecular plane, to give symmetrically ordered tetrahedral hydration [
26]. We have observed that the dipole calculated for the very popular TIP3P model is too low, resembling a dipole of an isolated water molecule in a vacuum (2.36D), rather than of a dipole in the liquid bulk state (3D). The exact value of the liquid water dipole is still debated; however, in this study, we rely on the results of most recent first-principles simulations of liquid state water. Nevertheless, to improve the properties of the TIP3P water model, it seemed crucial to adjust the dipole: we have done so by augmenting partial atomic charges of the water molecule. The performance of such an improved model, denoted as the Charge-Augmented Three-Point Water Model for Intrinsically Disordered Proteins (CAIPi3P), was subsequently tested on model IDPs: histatin 5, R/S-peptide, partially disordered
At2g23090 protein from
A. thaliana, and two domains of the La-related protein: RNA recognition Motif 1 (RRM1) and La-Motif (LaM). We observed that the dipole adjustment dramatically improved the performance of the model, in terms of the reproducibility of experimental data for IDPs, without negatively affecting the performance, speed, and data reproducibility for the folded regions/domains of partially disordered systems or the performance and data reproducibility for the folded regions/domains of partially disordered systems or the ‘structured’ proteins.
3. Discussion
This work focused on the development of the novel three-point solvation model, denoted as CAIPi3P. Compared to the established and popular TIP3P model, CAIPi3P, which is based on the same framework, considerably improved the sampling of intrinsically disordered model peptides. All-atom MD simulations using CAIPi3P improved the SAXS scattering profile for two model IDPs: R/S peptide and histatin 5, and partially disordered At2g23090 from A. thaliana with the central IDR. The improvement was evident for all force fields used for the protein, although the selection of the most appropriate force field plays a vital role in the sampling improvement.
For the R/S peptide, the improvement was evident in simulations with the AMBER03ws force field. Application of the CAIPi3P model resulted in a better agreement for the radius of gyration since the framework prevented the artificial collapse of the polypeptide chain, which is a common pitfall of atomistic simulations of IDPs. CAIPI3P, due to modified electrostatics, maintained the generated stretched conformation, which resulted in better agreement with the experimental data. In the interpretation of the results, it is essential to focus on the differences in primary sequence between these two model IDPs. Histatin 5 has several polar residues dispersed throughout the length of the peptide, resulting in an overall uniform polar distribution. This homogeneous distribution helps the polypeptide chain to maintain favourable interactions with the solvent, resulting in the overall expanded structure. The R/S peptide is polar and charged, with the charged residues located within the eight C-terminal arginine–serine (R/S) repeats
Figure 5, highlighted regions in the right panel). The obtained ensemble was affected by the C-terminal charge distribution, which facilitated the collapse of the polypeptide chain. Such a collapse was reduced when the AMBER03ws force field was applied. The sampling was further improved when CAIPi3P water was used, since it favoured the solute–solvent electrostatic interactions due to increased dipole moment of the water molecule. Solvent–solute interactions thus competed with excessive intramolecular solute–solute interactions, which led to the collapse.
In this work, only AMBER force fields were tested. Rauscher and co-workers [
32] used R/S peptide to assess the accuracy of the CHARMM36m, obtaining accurate results for SAXS scattering profile.
For
At2g23090, MD simulations showed a good agreement with experimental data when using the CAIPi3P water model in combination with the AMBER99SB-ILDN protein force field. Differences between AMBER99SB-ILDN and AMBER03ws lay within the side chain charge distribution and how certain residues interact with the solvents [
28,
39]. Consequently, in the
At2g23090 simulations, the compact globular C-terminal domain unfolded, increasing the interactions with the solvent molecules and the internal structural energy. In contrast, AMBER99SB-ILDN force field held the globular domains folded. This resulted in a similar SAXS pair distance distribution function (PDDF) between the resulting ensemble and the experimental data when using the CAIPi3P model. CAIPi3P water molecules interacted with the polar regions of the protein, improving the local sampling within the intrinsically disordered region and shielding the long-range interactions, avoiding the artificial collapse of the polypeptide chain. It is important to remark that
At2g23090 was solved by NMR using the ARIA [
40] with explicit water refinement [
41]. This method of refinement, albeit reliable, may bias the final ensemble towards a conformation towards states sampled with TIP3P. However, this structure was based not only on ARIA, but also NMR restraints, which should reduce bias on the final NMR ensemble and in our simulations. An evidence that the ARIA bias is not substantial is the fact that
At2g23090 TIP3P simulations self-collapsed, resulting in an ensemble that differed significantly form the NMR conformations.
The average radius of gyration was also closer to the experimental value when CAIPi3P was used.
Table 2 shows all the calculated and experimental values for all tested systems. Given the high structural fluctuations in IDPs, the error bars have a significant intersection. Hence, there is no statistical difference in this subject when TIP4P/2005 and CAIPi3P are compared for both histatin 5 and R/S peptide.
For the LaRP6-LaM and LaRP6-RRM1, we decided to focus on the most known combinations of force-field and water models used in this work, AMBER99SB-ILDN+TIP3P and AMBER03ws+TIP4P/2005, and how these force fields would be affected by CAIPi3P. Given the low extent of the disordered loops within LaRP6-RRM1, the usage of CAIPi3P did not cause a substantial improvement in comparison to TIP3P. However, simulations employing CAIPi3P sampled Rg values more accurately when combined with the AMBER03ws force field. LaRP6-LaM has a long IDR in its C-terminus, and because of this, CAIPi3P significantly improved the accuracy of the obtained conformations for both force fields, since it avoided the self-collapse of the C-terminal regions χ2 when used with AMBER99SB-ILDN and stabilized the structured scaffold of the core region when combined with AMBER03ws.
Nonetheless, there is a considerable improvement in the accuracy of the sampled conformations when simulations were carried out with the CAIPi3P solvation model.
Table 4 shows that systems simulated with CAIPi3P resulted in the lowest difference between the calculated and experimental SAXS scattering profile, with the root mean square difference between the calculated and experimental PDDF shown in
Table 4.
The bulk water parameters calculated for CAIPi3P are summarised in
Table 5. By changing the dipole moment of the TIP3P water model most of the bulk water parameters were improved for CAIPi3P in comparison to the standard TIP3P model. However, several significant changes need to be addressed, such as the average oxygen-oxygen radial density distance (R
O-O) and the density. The R
O-O distance for CAIPi3P was lower than the experimental distance, resulting in a higher density of 1.06 g/cm
3. This results in a more compact water configuration, increasing the water-water correlation and decreasing the overall potential energy of the bulk water, and deeply affecting the density temperature dependence (
Figure S6). Therefore, the usage of a higher dipole yields higher barriers to reorganise the solvent surrounding the solute, which contributes to the better sampling of the protein observed in CAIPi3P simulations.
The differences between experimental and CAIPI3P bulk water parameters shows that the latter require improvements. These modifications may come in tuning the vibrational frequency of the H-O-H angle to modify water-water interactions to decrease the magnitude of hydrogen bonds, which should yield better agreement with experimental data. CAIPi3P improved several different parameters when compared to the standard TIP3P model; however, it still less accurate in comparison to OPC and TIP4P/2005. Future work needs to address the bulk water issues of CAIPi3P model, which should further increase the applicability of CAIPi3P to different systems.
4. Materials and Methods
To assess the role solvation effects have in reproducing the experimental parameters of IDPs, and to evaluate the applicability of the CAIPi3P model to studies of “mixed” ordered–disordered systems, we selected model IDPs (histatin 5 [
30] and R/S peptide [
32]) and At2g23090, which is partially disordered. To determine the performance of the model, simulations were made for a comparison between CAIPi3P and established water models.
Fully extended conformations of histatin 5 (sequence: DSHAKRHHGYKRKFHEKHHSHRGY) and R/S-peptide (sequence: GAMGPSYGRSRSRSRSRSRSRSRS) were built using the UCSF Chimera [
42] package since their experimental atomistic structures were not available. The conformational ensemble of
A. thaliana At2g23090 (PDB code: 1WVK), obtained by solution NMR, was used to calculate the small-angle X-ray scattering (SAXS) distribution and radius of gyration. The lowest-energy conformer was selected as a starting point for all-atom molecular dynamics (MD) simulations.
For all systems investigated, missing hydrogen atoms were added, and several combinations of protein and water parametrisations were chosen, as summarised in
Table 6. All simulations were performed using the Gromacs 5.3 suite [
43]. The combinations of the force field and water models used are summarised in
Table 7. For each combination, a 1 nm cubic box was centred on the structure.
The system was solvated with the necessary number of water molecules to fill the protein simulation box. Next, sodium and chloride ions were added to the system at a concentration of 0.1 M to neutralise the simulation unit and to mimic the “physiological” salt concentration. The bonds were constrained using the LINCS algorithm [
44], setting a 2 fs time step. The electrostatic interactions were calculated using the particle-mesh Ewald method [
45], with a non-bonded cut-off set at 1 nm. All structures were energy minimised using the steepest descent algorithm for 20,000 steps. The minimisation was stopped when the maximum force fell below 1000 kJ/mol/nm using the Verlet cutoff scheme. This was followed by an equilibration run (NVT ensemble) of 20 ps with a time step of 2 fs and position restraints applied to the backbone, where the system was heated from 0 to 300 K; and another equilibration (NPT ensemble) at the constant temperature (300 K, 20 ps, 2 fs step) with backbone position restraints applied, and the constant pressure (1 bar). The temperature was set constant at 300 K by using an alternative Berendsen [
46] thermostat (τ = 0.1 ps). The pressure was kept constant at 1 bar by using a Parrinelo–Rahman barostat with isotropic coupling (τ = 2.0 ps) to a pressure bath [
47]. Finally, three production runs (NPT ensemble) of 100 ns were run for each system, using every force field–solvation model combination.
Ubiquitin (PDB code: 1UBQ) and lysozyme (PDB code: 253L) were selected for comparative runs to assess the effect of CAIPi3P water model on globular proteins with no IDRs. The simulation methodology was the same as the one described for the IDPs, with the exception that only the AMBER99SB-ILDN force field was used in combination with either the TIP3P or CAIPi3P solvation model. To check the convergence of simulation, the average radius of gyration and the radius of gyration standard deviation through time were plotted and analysed for histatin 5 (
Figures S7 and S8), R/S Pep (
Figures S9 and S10),
At2g23090 (
Figures S11 and S12), LaRP6-RRM1 (
Figures S13 and S14) and LaRP6-LaM (
Figures S15 and S16). Since most of the simulations achieved a plateau within 100 ns, the sampled ensembles were used for the SAXS calculations.
CRYSOL [
48,
49] software was used to calculate the SAXS scattering patterns, along with the GNOM [
48] software to calculate radial density distributions. The root square difference (RMSDexp-calc) between the experimental and the calculated were SAXS density made using an in-house script. The
gmx gyrate module from the Gromacs suite was used to calculate the radii of gyration from the obtained trajectories. RMSF and RMSD values were calculated using the Gromacs suite (
gmx rms and
gmx rmsf, respectively). To evaluate the similarity between the distribution curves, their root square deviation (RMSD
exp-calc) was calculated. To evaluate the accuracy for the sampled conformation in comparison to the experimental SAXS scattering, the reduced χ
2 were calculated between the interpolated experimental dataset and the calculated scattering profiles for each simulation. The errors used to calculate χ
2 values were based on the normalised average experimental error of 0.02.
The internal energy calculated in this work was made using GROMACS gmx energy, by calculating all bonded energy potentials (bonds, angles, dihedrals and improper dihedrals) a and non-bonded potentials (Coulombic potential and Lennard-Jones potential) for intramolecular interactions for each protein.
The bulk water properties were calculated using the protocols used by Izadi and co-workers [
25].