A Benchmark Test of High-Throughput Atomistic Modeling for Octa-Acid Host–Guest Complexes
Round 1
Reviewer 1 Report
Comments and Suggestions for Authors
The authors compared a series of existing academic docking and post-docking protocols applied to predict the binding strength (affinity) of a series of host-guest (h-g) complexes taken from SAMPL challenges, the guests belonging to two closely related molecules , namely octa -acid (OA) and methylated octacid (TEMOA).
After a statistical analysis of the results, the authors conclude that a conclusive picture is obtained on the "practical value" of these protocols for predicting affinities on these compounds. The data are clearly presented and the methods are sufficiently described, just because they are academic tools, but there are doubts about the conclusions:
1. The RMSE comparison between docking and post-docking protocols seems pointless, as it is clear that the score provided by many protocols is not a valid estimate of affinity. Only the statistical correlation provided by Pearson (scoring power) or Kendall (classification power) seems significant in their study. Perhaps it might be more convenient to perform a differential RMSE analysis comparing the ∆ (score) with affinity changes between related guests on the same host?
2. As stated by the authors, the validity of their study is limited to the OA and TEMOA h-g complexes, but some system dependency is still observed when switching from OA to TEMOA. Furthermore, the TEMOA correlation data values tend towards 0 (no correlation). Therefore, even validation on such a limited class of complexes is questionable
3. The authors limit their study to some standard methods and two particular OA/TEMOA h-g systems, still finding a system dependence: many previous related studies on the same topic are cited by the authors (e.g. ref. 16 -21 ), but not commented on. For example, in ref.17 Autodock4 is applied to OA/TEMOA and other h-g complexes, with high performance claims: a comparison with such results, as well as with others in the reference list, would be necessary whenever possible. Otherwise it seems difficult to understand what new knowledge these results have actually provided and what their usefulness is, even in the field of h-g chemistry.
Author Response
Reviewer #1:
The authors compared a series of existing academic docking and post-docking protocols applied to predict the binding strength (affinity) of a series of host-guest (h-g) complexes taken from SAMPL challenges, the guests belonging to two closely related molecules , namely octa -acid (OA) and methylated octacid (TEMOA).
After a statistical analysis of the results, the authors conclude that a conclusive picture is obtained on the "practical value" of these protocols for predicting affinities on these compounds. The data are clearly presented and the methods are sufficiently described, just because they are academic tools, but there are doubts about the conclusions:
- The RMSE comparison between docking and post-docking protocols seems pointless, as it is clear that the score provided by many protocols is not a valid estimate of affinity. Only the statistical correlation provided by Pearson (scoring power) or Kendall (classification power) seems significant in their study. Perhaps it might be more convenient to perform a differential RMSE analysis comparing the ∆ (score) with affinity changes between related guests on the same host?
Response: The docking scores are highly related to the parametrization procedure of the scoring function. For AutoDock variants, the scoring function is parametrized to reproduce the experimental affinity. So, we can see a large number of literatures using AutoDock variants (especially Vina and AutoDock4 scoring functions) to estimate the binding affinity of host-guest complexes. By contrast, some scoring functions are parametrized following a different spirit (e.g., PLANTS that incorporates force-field energetics from Tripos and hydrogen bond terms from Chemscore). For these scoring functions, their values are apparently far from experimental affinities and we should indeed focus on ranking metrics. Although we agree with the reviewer’s comment, we still find the presentation of a face-to-face comparison between various docking scores with experimental affinity valuable, as such a comparison is absent in current literatures and provides the community a practical example of the magnitudes of scores produced by different scoring functions.
- As stated by the authors, the validity of their study is limited to the OA and TEMOA h-g complexes, but some system dependency is still observed when switching from OA to TEMOA. Furthermore, the TEMOA correlation data values tend towards 0 (no correlation). Therefore, even validation on such a limited class of complexes is questionable
Response: Although we’ve performed extensively literature review, only limited data about the OA derivatives are obtained. The OA family does not have many available experimental references and the current batch containing 31 OA-guest and 17 TEMOA-guest complexes is already the largest and most diverse dataset reported in computational investigations up to now. We expect to further better the quality of the dataset but it seems that further efforts from experimentalists are necessary, which is beyond our capabilities. For other host-guest systems such as cucurbiturils, we indeed secure much larger datasets, e.g., https://doi.org/10.1016/j.molliq.2024.125245.
- The authors limit their study to some standard methods and two particular OA/TEMOA h-g systems, still finding a system dependence: many previous related studies on the same topic are cited by the authors (e.g. ref. 16 -21 ), but not commented on. For example, in ref.17 Autodock4 is applied to OA/TEMOA and other h-g complexes, with high performance claims: a comparison with such results, as well as with others in the reference list, would be necessary whenever possible. Otherwise it seems difficult to understand what new knowledge these results have actually provided and what their usefulness is, even in the field of h-g chemistry.
Response: Compared with the docking(AD4) paper reported by Lorenzo Casbarra and Piero Procacci, our work considers larger datasets and much more docking protocols. In the mentioned paper, the authors include SAMPL6-8 systems, which contain 16 OA-guest pairs and 8 TEM-guest pairs. By contrast, our dataset contains 31 OA-guest and 17 TEM-guest pairs, which is the largest testing bed for these OA derivatives. As for docking protocol, we consider 7 docking protocols including Autodock Vina, Autodock Vinardo, PLANTS-plp, PLANTS-chemplp, DOCK6-Contact, DOCK6-Grid-energy, rDOCK, while in the mentioned paper only the AutoDock family is employed. As for the comparison with other OA-related papers, our work still prevails in the dataset and docking protocols. Further, we consider an extensive set of end-point protocols that covers not only main-stream implicit-solvent models but more importantly both the single- and three-trajectory sampling realizations. The three-trajectory realization as a commonly overlooked sampling protocol has shown very promising performance in WP6 SAMPL9 dataset (https://doi.org/10.3390/molecules28062767) and thus the benchmark calculation of it on other host-guest complexes including OA derivatives is crucial. Clarifications and discussions about these points have been added to the introduction section in revision.
Reviewer 2 Report
Comments and Suggestions for AuthorsThis work conducted by Wang et. al, tried to benchmark the molecular docking tools and MM-PBSA/GBSA free energy methods on the octa acid host-guest systems. The topic of this manuscript is important but the presentation of this work is not in a good shape. There are several severe issues.
1) The title 'current status' is very misleading and sounds like a review article. Maybe change to something related to 'benchmark'.
2) The abstract mentions 20 protocols for post-docking end-point rescoring methods. But I cannot find these 20 methods.
In addition, please make some table to include 7 different docking tools and ? different end-point methods.
3) The meaning of pearson r, RMSE, PI and kentall t are not explained clearly. Readers cannot understand how to understand these values.
4) The colorbar scheme in Figure 3 is very strange and ugly. Please improve and use a common color scale.
In general, this manuscript seems like a lab report from an undergraduate and needs many efforts to improve.
Author Response
Reviewer #2:
This work conducted by Wang et. al, tried to benchmark the molecular docking tools and MM-PBSA/GBSA free energy methods on the octa acid host-guest systems. The topic of this manuscript is important but the presentation of this work is not in a good shape. There are several severe issues.
1) The title 'current status' is very misleading and sounds like a review article. Maybe change to something related to 'benchmark'.
Response: We changed the ‘current status’ keyword to ‘a benchmark test’
2) The abstract mentions 20 protocols for post-docking end-point rescoring methods. But I cannot find these 20 methods.
Response: In the original manuscript, the 20 end-point protocols are discussed in the section 2.2 around Eq. (2) and the corresponding numerical results are shown in Fig. 5. The combinations of 5 regimes for energetic evaluation (MM/PBSA, MM/GBHCTSA, MM/GBOBC-ISA, MM/GBOBC-IISA and MM/GBneck2SA), 2 regimes for sampling (single- and three-trajectory realizations) and 2 regimes for ranking (enthalpy-only ΔH or entropy-included ΔG).
In addition, please make some table to include 7 different docking tools and ? different end-point methods.
Response: In revision, we added two tables summarizing the docking and end-point protocols benchmarked in this work (specifically Table 1 for docking and Table 2 for end-point protocols).
3) The meaning of pearson r, RMSE, PI and kentall t are not explained clearly. Readers cannot understand how to understand these values.
Response: We added clarifications about the four quality metrics to the first paragraph of section 3.1.1 in revision.
4) The colorbar scheme in Figure 3 is very strange and ugly. Please improve and use a common color scale.
In general, this manuscript seems like a lab report from an undergraduate and needs many efforts to improve.
Response: We change the coloring regime used in Fig. 3 and Fig. S2-S3 in revision.
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsThe authors have addressed my concerns
Reviewer 2 Report
Comments and Suggestions for AuthorsThe authors have addressed the comments well.