Next Article in Journal
Characteristics and Species Diversity of Semi-Natural Plant Communities on Langqi Island
Previous Article in Journal
Effect of Acute Thermal Stress Exposure on Ecophysiological Traits of the Mediterranean Sponge Chondrilla nucula: Implications for Climate Change
 
 
Technical Note
Peer-Review Record

bioGWAS: A Simple and Flexible Tool for Simulating GWAS Datasets

by Anton I. Changalidis 1,2,3, Dmitry A. Alexeev 2, Yulia A. Nasykhova 1, Andrey S. Glotov 1,* and Yury A. Barbitoff 1,2,*
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Submission received: 25 October 2023 / Revised: 3 December 2023 / Accepted: 12 December 2023 / Published: 23 December 2023
(This article belongs to the Section Bioinformatics)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The study entitled “bioGWAS: a simple and flexible tool for simulating GWAS datasets” created a powerful tools that bioGWAS provides an important set of functionalities that would aid the development of new methods for downstream processing of GWAS results. However, as a review, I find some shortcomings that need to be addressed. Below I have provided numerous remarks. Given these shortcomings, the manuscript requires further revisions.

1. Document comments: Although the script provides some helpful information, adding more comments and document strings throughout the entire code can help improve the readability of the code. Especially, comments on functions and parameters can be added to help users more easily understand the purpose and working principle of the code.

2. Error handling: There is no error handling mechanism provided in the script. In practical applications, it should be considered to add appropriate error handling and exception handling to prompt users where the error is in order to handle potential problems and abnormal situations.

3. Parameter settings: If there are too many necessary input parameters, you can set some parameters as optional inputs. If only a portion of the software's functionality is needed, it should also be able to run.

4. Operation status: At the beginning and end of each stage of software operation, you can add some prompts or choose to use log output to facilitate users to identify the operation status of the software.

5. Code format: Multiple lines of code in Python files should also be indented and consistent to increase code readability. Such as:

s = f'''

biogwas_path: {os.path.abspath(path)}

vcf_in_dir: {os.path.abspath(args.input_dir)}

data_dir: {os.path.abspath(args.data_dir)}

images_dir: {os.path.abspath(args.img_dir)}

instead of

s = f'''

biogwas_path: {os.path.abspath(path)}

vcf_in_dir: {os.path.abspath(args.input_dir)}

data_dir: {os.path.abspath(args.data_dir)}

images_dir: {os.path.abspath(args.img_dir)}

6. Magic Number: There is a paragraph in the code:

seed_settings.add_argument('-S',

                    '--seed',

                    required=False,

                    default=566,

                    type=int,

                    help='Random seed of this phenotype (!!!) simulation.')

Here, the number 566 is a magical number. If it is replaced with a meaningful constant or variable, such as:

RANDOM_SEED_DEFAULT = 566

 

seed_settings.add_argument('-S',

                    '--seed',

                    required=False,

                    default=RANDOM_SEED_DEFAULT,

                    type=int,

                    help='Random seed of this phenotype (!!!) simulation.')

This can make the code easier to read, make it easier for others to understand the meaning of this number, and also facilitate subsequent modifications.

7. Subprocess. call: Consider using subprocess. run instead of subprocess. call to better control sub processes. This allows you to capture output, check for errors, and handle exceptions

8.The visualization function needs further improvement

 

Comments on the Quality of English Language

The English language needs further improvement

Author Response

The study entitled “bioGWAS: a simple and flexible tool for simulating GWAS datasets” created a powerful tools that bioGWAS provides an important set of functionalities that would aid the development of new methods for downstream processing of GWAS results. However, as a review, I find some shortcomings that need to be addressed. Below I have provided numerous remarks. Given these shortcomings, the manuscript requires further revisions.

Authors: We thank the Reviewer for a thorough assessment of our work and useful suggestions.

  1. Document comments: Although the script provides some helpful information, adding more comments and document strings throughout the entire code can help improve the readability of the code. Especially, comments on functions and parameters can be added to help users more easily understand the purpose and working principle of the code.

Authors: We added docstrings and commentaries to main code parts, to make it more understandable for users that would like to inspect it.

 

  1. Error handling: There is no error handling mechanism provided in the script. In practical applications, it should be considered to add appropriate error handling and exception handling to prompt users where the error is in order to handle potential problems and abnormal situations.

Authors: Snakemake provides great functionality for output information messages during tasks completion. We added custom messages to every Snakemake rule to clearly inform the user about the start of the job, its successful execution or error to make code and logs more complete, readable and understandable by the user.

 

  1. Parameter settings: If there are too many necessary input parameters, you can set some parameters as optional inputs. If only a portion of the software's functionality is needed, it should also be able to run.

Authors: We would like to point out that the majority of parameters of the tool have the default values and are thus not required. The only required parameters are directories and files. All other parameters can be left default during simulation. As we understand that the help message of the tool contains many details, we now include the list of the required parameters and examples of simple usage in the readme.md file in the repository.

 

  1. Operation status: At the beginning and end of each stage of software operation, you can add some prompts or choose to use log output to facilitate users to identify the operation status of the software.

Authors: As mentioned in our response to the comment #2, we have added messages to be displayed at the beginning and end of each stage of the pipeline, including custom error messages.

 

  1. Code format: Multiple lines of code in Python files should also be indented and consistent to increase code readability. Such as:

s = f'''

biogwas_path: {os.path.abspath(path)}

vcf_in_dir: {os.path.abspath(args.input_dir)}

data_dir: {os.path.abspath(args.data_dir)}

images_dir: {os.path.abspath(args.img_dir)}

instead of

s = f'''

biogwas_path: {os.path.abspath(path)}

vcf_in_dir: {os.path.abspath(args.input_dir)}

data_dir: {os.path.abspath(args.data_dir)}

images_dir: {os.path.abspath(args.img_dir)}

Authors: The initial formatting of this code section was done in order to avoid the indents in the resulting .yml file as these are not allowed. To make the code easier to read, we have moved this variable from functions into the constants section.

 

  1. Magic Number: There is a paragraph in the code:

seed_settings.add_argument('-S',

                    '--seed',

                    required=False,

                    default=566,

                    type=int,

                    help='Random seed of this phenotype (!!!) simulation.')

Here, the number 566 is a magical number. If it is replaced with a meaningful constant or variable, such as:

RANDOM_SEED_DEFAULT = 566

 

seed_settings.add_argument('-S',

                    '--seed',

                    required=False,

                    default=RANDOM_SEED_DEFAULT,

                    type=int,

                    help='Random seed of this phenotype (!!!) simulation.')

This can make the code easier to read, make it easier for others to understand the meaning of this number, and also facilitate subsequent modifications.

Authors: This default value and some other “magic” constants were moved to the constants section at the beginning of the file.

 

  1. Subprocess. call: Consider using subprocess. run instead of subprocess. call to better control sub processes. This allows you to capture output, check for errors, and handle exceptions

Authors: To be consistent throughout all code, we changed subprocess.call() to subprocess.Popen() (which is also called from suggested subprocess.run()).

 

8.The visualization function needs further improvement

Authors: Visualization function was improved by including additional user-controlled parameters of the plots: in particular, DPI, height, and width of plotted images.

We would also like to point out that we plan to keep developing the tool and adding new features to it to further improve the user experience.

Reviewer 2 Report

Comments and Suggestions for Authors

The paper introduces a new tool to simulate genotype and phenotype data and provides GWAS summary statistics to facilitate the evaluation of the generated data. One distinctive advantage of this tool is that it allows users to specify the association between variants and traits. Overall, the paper is well presented. My comments are listed below.  

1.     In line 118, the authors mention that "Additionally, SNPs are filtered by minimum and maximum MAF." It would be beneficial for the readers if the authors could provide more clarity on this aspect. Specifically, it would be helpful to know what the default values for the minimum and maximum MAF are and whether these thresholds are customizable by the users.

2.     There appears to be a contradiction between the sentences in lines 124 and 125. My understanding is that if there are 'k' causal variants, then 'K-k' SNPs would be non-causal. However, the manuscript states that “K-k causal SNPs should be drawn…” which seems inconsistent with the previous statement. Could the authors please clarify this point for better understanding? Also, in line 137, the authors state, “this procedure yields K causal variants.”

3.     Is it possible that we cannot reach F1 to 0.9? In this case, how should we select the parameters?

4.     I doubt the authors’ claim in line 286 that the Q-Q plots do not show signs of over-inflation for both original and simulated data. All the Q-Q plots in Figure 2 deviate from the diagonal line, indicating that all the data are over-inflated. I would recommend that the authors revisit and verify this section to ensure accuracy in their findings and presentation.

5.     In line 366, should “MMAF” be changed to “MAF”?

Author Response

The paper introduces a new tool to simulate genotype and phenotype data and provides GWAS summary statistics to facilitate the evaluation of the generated data. One distinctive advantage of this tool is that it allows users to specify the association between variants and traits. Overall, the paper is well presented. My comments are listed below.  

Authors: We thank the Reviewer for a positive assessment of our work.

  1.     In line 118, the authors mention that "Additionally, SNPs are filtered by minimum and maximum MAF." It would be beneficial for the readers if the authors could provide more clarity on this aspect. Specifically, it would be helpful to know what the default values for the minimum and maximum MAF are and whether these thresholds are customizable by the users.

Authors: The corresponding paragraph was expanded to include the details requested by the Reviewer.

  1.     There appears to be a contradiction between the sentences in lines 124 and 125. My understanding is that if there are 'k' causal variants, then 'K-k' SNPs would be non-causal. However, the manuscript states that “K-k causal SNPs should be drawn…” which seems inconsistent with the previous statement. Could the authors please clarify this point for better understanding? Also, in line 137, the authors state, “this procedure yields K causal variants.”

Authors: We use K to denote the total number of causal variants, and k for the number of causal variants drawn from a specific pathway (in case pathway-based selection of variants is used). Thus, K - k is the number of variants that are causal, but are located in genes outside the user-defined gene set (in other words, are randomly scattered across the genome). The number of ‘non-causal’ variants would be the total number of SNPs in the dataset subtracting K. The choice of such a notation is motivated by the parameters of hypergeometric distribution that is commonly used to test for gene set enrichment. We have made certain corrections to the corresponding section of the manuscript to make the notations more clear.

  1.     Is it possible that we cannot reach F1 to 0.9? In this case, how should we select the parameters?

Authors: We thank the Reviewer for this important question. Indeed, there are certain scenarios in which the results will have an F1 score lower than 0.9. If the accuracy of simulation does not reach the desired level, further tuning of simulation parameters is required. The following steps may be performed:

  • Sample size (N) can be increased to achieve higher levels of recall (i.e., identification of all desired causal variants);
  • Heritability (genetic variance) can be adjusted to ensure optimal precision and recall values. Our analysis shows that, under N = 10,000, the best performance of simulation is achieved when the ratio of heritability to the number of causal variants (i.e., the proportion of variance explained by each causal variant) is around 0.01 (see Supplementary Figure X);
  • If non-default parameters of effect size distribution are used, it can be advised to decrease the standard deviation of the effect (sd_beta) so that all causal variants have non-zero genetic effects.

We expanded the corresponding section of the Materials and Methods and Results to include these details and assist users in selecting parameters for simulation (the end of section 2.6 and p. 8, lines 290-304). We have also revised Supplementary Tables showing the results of parameter optimization to provide a better understanding of the performance of the same parameter sets under different numbers of causal SNPs (K).

  1.     I doubt the authors’ claim in line 286 that the Q-Q plots do not show signs of over-inflation for both original and simulated data. All the Q-Q plots in Figure 2 deviate from the diagonal line, indicating that all the data are over-inflated. I would recommend that the authors revisit and verify this section to ensure accuracy in their findings and presentation.

Authors: We kindly disagree with the Reviewer on this matter. While the Q-Q plot deviates from the expectation, such a deviation indicates the presence of the genome-wide association signal rather than overinflation due to uncorrected biases (as evidenced by the fact that the deviation from expectation is only observed at p < 0.001). We decided to amend the wording to make this more clear.

  1.     In line 366, should “MMAF” be changed to “MAF”?

Authors: The issue was corrected.

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

No more comments

Comments on the Quality of English Language

Okay

Back to TopTop