Next Article in Journal
Protective Features of Autophagy in Pulmonary Infection and Inflammatory Diseases
Next Article in Special Issue
Retroelement—Linked Transcription Factor Binding Patterns Point to Quickly Developing Molecular Pathways in Human Evolution
Previous Article in Journal
MicroRNAs at the Interface between Osteogenesis and Angiogenesis as Targets for Bone Regeneration
Previous Article in Special Issue
Large-Scale Assessment of Bioinformatics Tools for Lysine Succinylation Sites
 
 
Article
Peer-Review Record

A High Efficient Biological Language Model for Predicting Protein–Protein Interactions

by Yanbin Wang 1,2,†, Zhu-Hong You 1,*,†, Shan Yang 1,2,†, Xiao Li 1, Tong-Hai Jiang 1 and Xi Zhou 1
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Submission received: 27 December 2018 / Revised: 26 January 2019 / Accepted: 2 February 2019 / Published: 3 February 2019
(This article belongs to the Special Issue Bioinformatics and Computational Biology 2019)

Round  1

Reviewer 1 Report

Wang et al. introduce the development of a model that can be used to accurately predict protein-protein interactions just from protein sequences alone. Though interesting, it is not suitable for publication in the present manner. I would like to see both testing of the model with additional datasets and also identification/validation of novel PPIs-both would give further confidence in the model and directly demonstrate its utility to the life science community. It also requires extensive editing of English language and style-I have given examples below in point 4.

 

Specific comments

1.     Further testing of the model with different datasets other than those used for training would give further confidence in the approach. There are many different and often non-overlapping datasets that could be used e.g. for human these also include BioGRID or DIP in addition to the HPRD used in the manuscript (they have all now been combined in the PICKLE meta-database (Gioutlakis et al 2017 PLOS One) and are readily available.

2.     Further testing and experimental validation of the model using protein sequences not found in training datasets would demonstrate the utility of the model in identifying PPIs.

3.     Figure 1-maybe link the analogies directly with lines rather than all going to the similarity box, to better illustrate the manuscript.

4.     Suggested examples of changes to English language style are given, but there are numerous others (e.g. many long sentences) and the manuscript should be carefully checked.

-line 13 ‘efficient methods for covering implicit pattern in protein sequence…’-change to efficient methods for predicting PPIs from protein sequence information…’

-line 47 ‘these methods are cannot be implemented until the pre-information of proteins is obtained’-can be changed to ‘these methods cannot be implemented without pre-existing information.’

-line 55 ‘have denoted that sequence order effects to affect…’-change to ‘have noted that sequence order effects affect…’

-line 69 ‘The major reason for limiting the prediction capability…’-change to ‘The major limitation of the prediction capabilities of these methods…’

-line 75-remove ‘at the structural’

-line 121 replace ‘interaction protein pairs’ with ‘interacting protein pairs’

-line 122 replace ‘containing’ with ‘contain’

-line 128 and 129 ‘golden standard’ should be ‘gold standard’

 

 


Author Response

Authors’ Response to Reviewers’ Comments

Paper   title:

A High Efficient Biological Language Model for PredictingProtein-Protein Interactions

Manuscript ID:

Paper ID: 425516

Authors:

Yanbin Wang, Z.-H. You *, Shan Yang, Xiao Li, Tonghai Jiang,Zhou Xi

We are grateful to the editor and reviewers for putting in efforts to review the paper with the aim of improving the quality of our paper. We have addressed the concerns of the editor and reviewers in the revised manuscript. In particular, the following revisions have been made.

Reviewer 1

   Wang et al. introduce the development of a model that can be used to accurately predict protein-protein interactions just from protein sequences alone. Though interesting, it is not suitable for publication in the present manner. I would     like to see both testing of the model with additional datasets and also identification/validation of novel PPIs-both would give further confidence in     the model and directly demonstrate its utility to the life science community. It also requires extensive editing of English language and style-I have given     examples below in point 4.


Response: Thanks to the reviewers for their interest in our work and for making valuable suggestions.

Comment 1:

Further testing of the model with different datasets other than   those used for training would give further confidence in the approach. There   are many different and often non-overlapping datasets that could be used e.g.   for human these also include BioGRID or DIP in addition to the HPRD used in   the manuscript (they have all now been combined in the PICKLE meta-database   (Gioutlakis et al 2017 PLOS One) and are readily available.

Response: We appreciate the reviewer for this valuable suggestion and providing information about PPIs data source. We have read the paper that describe

the PICKLE database and cited this paper in the revised manuscript.

1.      Aris, G. , Maria I., K. , & Nicholas K., M. . (2017). Pickle 2.0: a human protein-protein interaction meta-database employing data integration via genetic information ontology. Plos One,12(10), e0186039.

We have established an extended Human PPIs data set based on PICKLE database. We arranged 36630 interacting protein pairs based on the information at PICKLE, and 36480 non-interacting protein pairs based on the assumption that proteins occupying different subcellular localizations do not interact. The experimental result was shown in Table 1.

Table 1. the prediction results on extended Human data.

Testing Set

Accu (%)

Sens (%)

Prec (%)

MCC (%)

AUC

Extended-Human

99.58

99.64

99.50

99.16

0.9995

Once again, we thank the reviewers for providing us with clues to find high-quality, massive human protein interaction datasets. This database will also provide reliable data support for our next work.

Comment 2:

Further testing and experimental validation of the model using   protein sequences not found in training datasets would demonstrate the   utility of the model in identifying PPIs.

Response: Thanks the reviewer for this essential comment. We further validated the quality of the proposed method by applying it to independent dataset.

More specific, the trained model on S. cerevisiae dataset has been directly performed to E. coli dataset that contains 6954 interacting protein pairs. To further

ensure that the protein sequence in the test set (E. coli) does not exist in the training set (S. cerevisiae), we run a python program to detect duplicate

proteins, and have screenshot of the results of the duplicate checking. The screenshot has shown in Figure 1.From Figure 1 we can see that there is no

repeated protein sequence between training data and test data.

    We can see from the Figure 2, the prediction probability of employing the model to predicting PPIs of E. coli are close to 1 for the most of samples. The high prediction probability proves that our method is effective, practical and able to detect PPIs across species.

    We have upload the data, trained prediction model, code, word vector and documentation to https://figshare.com/s/b35a2d3bf442a6f15b6e. Reviewers and  Readers can use pre-trained prediction model to quickly repeat our experiments.

Comment 3:

Figure 1-maybe link the analogies directly with lines rather   than all going to the similarity box, to better illustrate the manuscript.

Response: Thanks for the helpful suggestions. In the revised manuscript, we modified the figure 1 according the reviewer’ suggest.

Comment 4:

Suggested examples of changes to English language style are   given, but there are numerous others (e.g. many long sentences) and the   manuscript should be carefully checked.

-line 13 ‘efficient methods for covering implicit pattern in   protein sequence…’-change to efficient methods for predicting PPIs from   protein sequence information…’

-line 47 ‘these methods are cannot be implemented until the   pre-information of proteins is obtained’-can be changed to ‘these methods   cannot be implemented without pre-existing information.’

-line 55 ‘have denoted that sequence order effects to   affect…’-change to ‘have noted that sequence order effects affect…’

-line 69 ‘The major reason for limiting the prediction   capability…’-change to ‘The major limitation of the prediction capabilities   of these methods…’

-line 75-remove ‘at the structural’

-line 121 replace ‘interaction protein pairs’ with ‘interacting   protein pairs’

-line 122 replace ‘containing’ with ‘contain’

-line 128 and 129 ‘golden standard’ should be ‘gold standard’

ResponseThank you for your helpful suggestions. In the revised manuscript, all known errors have been corrected. The manuscript has been proofread and carefully edited to ensure easy readability.


Author Response File: Author Response.pdf

Reviewer 2 Report

 The authors of this paper introduce an approach to predicting protein-protein interactions (PPIs) that frames the task in a natural language processing (NLP) context. In this way they seek to make their approach free from

requirig any prior knowledge of the biology behind these interactions as well as readily integratable into existing NLP methods. The main novelty of the paper stems

from the protein sequence encoding method where the sequence is segmented and then embedded as a vector using a skip-gram model. The segmentation

results from alternating byte-pair-encoding and expectation maximization steps until a desired vocabulary size is reached. The resulting lexicon is then input

into a skip-gram model so that each segment is encoded in a n-length vector. The vectors of all the segments are them summed and averaged to finally arrive

at an n-length encoding for the original sequence. A convolutional neural network takes a concatenated vector pair as input for the binary prediction of the PPI.

This is a very interesting method. The application of natural language processing technology to solve some problems in bioinformatics is very promising, and

this idea has attracted much attention in recent years. However, the current research on bioinformatics based on natural language processing technology

almost ignores a key question: how to properly segment protein sequences? This paper answers this question well, and successfully solved the dilemma of

current research based on natural language processing. It allows some biological problems to be reasonably embedded in the framework of language

understanding and builds a bridge between biological information and natural language processing technology. In addition, the results in the paper confirm that

 the way of word segmentation affects the performance of prediction, which has potential to become a new research hotpoint. However, Authors should

consider solving a few minor problems before publishing their papers.
The grammatical errors present in the introduction can make some statements in the paper unclear.
Materials and methodology. The S* should be s* to be more consistent with the notation of the proposed segmentation. Equation (4) would be more clear if the

index of the summation were represented by a symbol other than x since the equation also contains X which represents a set. The algorithm used for

segmentation would be more clearly represented in pseudo-code form.
The paper’s significance can be improved by more motivation on importance of the word segmentation.
Details of the convolutional neural network’s role in this work is minimal beyond the general description of the operations a CNN performs. The architecture

used in this paper is unique in that it uses 3 subnetworks. Increasing the detailed description of CNN will help to improve the readability of the paper.


Author Response

Authors’ Response to Reviewers’ Comments

We are grateful to the editor and reviewers for putting in efforts to review the paper with the aim of improving the quality of our paper. We have addressed the concerns of the editor and reviewers in the revised manuscript. In particular, the following revisions have been made.

Reviewer 2

The authors of this paper introduce an approach to predicting protein-protein interactions (PPIs) that frames the task in a natural language processing (NLP) context. In this way they seek to make their approach free from requiring  any prior knowledge of the biology behind these interactions as well as  readily integratable into existing NLP methods. The main novelty of the paper stems from the protein sequence encoding method where the sequence is segmented and then embedded as a vector using a skip-gram model. The segmentation results from alternating byte-pair-encoding and expectation maximization steps until a  desired vocabulary size is reached. The resulting lexicon is then input into a  skip-gram model so that each segment is encoded in a n-length vector. The  vectors of all the segments are them summed and averaged to finally arrive at an n-length encoding for the original sequence. A convolutional neural network takes a concatenated vector pair as input for the binary prediction of the PPI.This is a very interesting method. The application of natural language processing technology to solve some problems in bioinformatics is very promising,  and this idea has attracted much attention in recent years. However, the current  research on bioinformatics based on natural language processing technology almost  ignores a key question: how to properly segment protein sequences? This paper  answers this question well, and successfully solved the dilemma of current research  based on natural language processing. It allows some biological problems to be  reasonably embedded in the framework of language understanding and builds a bridge  between biological information and natural language processing technology. In addition,  the results in the paper confirm that the way of word segmentation affects the   performance of prediction, which has potential to become a new research hotpoint.   However, Authors should consider solving a few minor problems before publishing   their papers.


Response: We appreciate the reviewer's approval.

Comment 1:

The grammatical   errors present in the introduction can make some statements in the paper   unclear.

Response: Thank you for your helpful suggestions. In the revised manuscript, all known errors have been corrected. The manuscript has been proofread and carefully edited to ensure easy readability.

Comment 2:

Materials and   methodology. The S* should be s* to be more consistent with the notation of   the proposed segmentation. Equation (4) would be more clear if the index of   the summation were represented by a symbol other than x since the equation   also contains X which represents a set. The algorithm used for segmentation   would be more clearly represented in pseudo-code form.

Response: Based on the reviewers' comments, we revised the Equation (4). To facilitate the reader's understanding of our algorithm, this code and trained model are available at https://figshare.com/s/b35a2d3bf442a6f15b6e.

Comment 3:

The paper’s   significance can be improved by more motivation on importance of the word   segmentation.

Response: Thank the reviewer for this suggestion. In the revised manuscript, we added more description of the segmentation.

Comment 4:

Details of the   convolutional neural network’s role in this work is minimal beyond the   general description of the operations a CNN performs. The architecture used   in this paper is unique in that it uses 3 subnetworks. Increasing the   detailed description of CNN will help to improve the readability of the   paper.

Response: Thank you for your helpful suggestions. We have added some content about the process of CNN we constructed in the revised manuscript.


Reviewer 3 Report

The paper by Wang and colleagues presents an interesting approach parsing protein sequences into words, the using these words to construct feature vectors and then training a model based on CNN to predict PPIs.

Major concerns:
- The paper, in general, would benefit from a more detailed description of how various parameters were set - alpha and the number of features in particular. How sensitive is the approach to how these were set?
- The comparison to other approaches needs to be explained in more detail. Are you using the original tools or reimplementation? Why can't the MCC be calculated for some cases?
- There are several models presented in this paper, for the different organisms for example. How similar are these? Would a model for one organism perform well when applied to another? Or would one need to build a species-specific model? The authors to recognize that more data is better, would it make sense to create one model with the data from all organisms combined? Why not?
- How does the proposed lexicon relate to motif databases?
- Please make the code available; please specify the license. Provide sufficient amounts of data so that the reviewers and readers can download, test and use the software. Provide documentation on how to use the tool.
- Provide supplemental data/tables with details on the input data, which data was filtered out, which of the data was used for training and testing. Please also provide the results of the algorithm, explicitly detailing the false negative and false positives.
- Please explicitly report the obtained result for the cross-validation and the testing datasets.


Minor concerns:
- Please provide a better introduction to the tools that the algorithm was compared to, the ones listed in tables 2-4.
- The language has some minor issues that are negatively impacting the perception of the paper. I recommend having a native speaker read the manuscript or use a proof-reading service.

Author Response

Authors’ Response to Reviewers’ Comments

Paper   title:

A High Efficient Biological Language Model for PredictingProtein-Protein Interactions

Manuscript ID:

Paper ID: 425516

Authors:

Yanbin Wang, Z.-H. You *, Shan Yang, Xiao Li, Tonghai Jiang,Zhou Xi

We are grateful to the editor and reviewers for putting in efforts to review the paper with the aim of improving the quality of our paper. We have addressed the concerns of the editor and reviewers in the revised manuscript. In particular, the following revisions have been made.

Reviewer 3

The paper by Wang and colleagues presents an interesting approach parsing protein sequences into words, the using these words to construct feature vectors and then training a model based on CNN to predict PPIs.

Response: Thanks to the reviewers for their interest in our work and for making valuable suggestions.

Comment 1:

The paper, in   general, would benefit from a more detailed description of how various   parameters were set - alpha and the number of features in particular. How   sensitive is the approach to how these were set?

Response: The authors would like to thank the reviewer for the comment. In the revised manuscript, in order to increase the interpretability and readability of the paper, we describe the parameters and the number of features in more detail.

In this article, we first consider using a word segmentation based on unigram language model to segment protein sequences into words. We hope to improve the interpretability of natural language processing technology in the biological field through reasonable bio-word segmentation techniques. Some important natural language processing techniques and ideas are adopted to accomplish this task. Here, there is an important problem that vocabulary of biological sequences is unknown. Because the joint optimization of vocabulary set and their occurrence probabilities is intractable, we here seek to find them with the iterative algorithm. First, heuristically make a reasonably big seed vocabulary from the datasets. And then EM algorithm and maximizes the following marginal likelihood are used to generate vocabulary.

Actually, in the first step, we employed an important algorithm, Byte-Pair-Encoding (BPE), that is a data compression technique that iteratively replaces the most frequent pair of bytes in a sequence with a single, unused byte. It also can be used for word segmentation. But we have not stopped here. We hope that we can get more accurate vocabulary by iterative optimization.

Our iteration algorithm involves the parameter alpha, which can be regarded as a threshold to filter part of segment that has low probability to be word. Through this iterative filtering, we can get a higher quality vocabulary. As the reviewer mentioned, how to select parameters is worth discussing. The larger the parameters are, the fewer words will be discarded and the more iterations will be made. If the parameter is too small, a rough result may be obtained. To balance time and computational costs and vocabulary quality, we recommend setting it between 70 and 80. In fact, we have also tested the quality of BPE algorithm. The results show that the vocabulary generated by BPE can also produce competitive prediction results. Therefore, we believe that the parameter alpha is very important, but the predicted results are not sensitive to it.

In addition, based on practical experience and studying results, the bigger the vector dimension is, the better the effect of the model is. But as the dimension of word vector increases, the performance curve increases first and then tends to be flat. Figure 1 shows the trend of accuracy as dimensions’ increase. Considering efficiency, in this work, we extract 1024 features for a word.


Comment 2:

The comparison   to other approaches needs to be explained in more detail. Are you using the original   tools or reimplementation? Why can't the MCC be calculated for some cases?

Response: We wish to thank reviewer for this comment. In the revised manuscript, we have added more detail for the comparison to other approaches. In order to ensure the authenticity of the comparison, we directly use the results reported in the previous papers [1-3]. Since some of them do not use MCC as an evaluation index, the corresponding content in the table is set to empty.

1.      You, Z. H. , Zhou, M. C. , Luo, X. , & Li, S. . (2016). Highly efficient framework for predicting interactions between proteins. IEEE Transactions on Cybernetics, 1-13.

2.      Li, J. Q. , You, Z. H. , Li, X. , Ming, Z. , & Chen, X. . (2017). Pspel: in silico prediction of self-interacting proteins from amino acids sequences using ensemble learning. IEEE/ACM Transactions on Computational Biology & Bioinformatics,14(5), 1165-1172.

3.      Liu, B. , Liu, F. , Fang, L. , Wang, X. , & Chou, K. C. . (2015). Repdna: a python package to generate various modes of feature vectors for dna sequences by incorporating user-defined physicochemical properties and sequence-order effects. Bioinformatics, 31(8), 1307-1309.

Comment 3:

There are   several models presented in this paper, for the different organisms for   example. How similar are these? Would a model for one organism perform well   when applied to another? Or would one need to build a species-specific model?   The authors to recognize that more data is better, would it make sense to   create one model with the data from all organisms combined? Why not?

Response: We would like to thank the reviewer for this comment. We shared the same considerations with the reviewers. In fact, we used to want to do this, and even we are more eager to do so. Frankly speaking, we don’t know what their essential differences at the sequence level due to the limitations of the knowledge background. From the perspective of sequence, they are very similar. We have consulted a considerable number of references and found no discussion on this part. To validate the reviewer's ideas, we designed a new experiment. The trained model on S. cerevisiae dataset has been directly performed to an independent dataset, E. coli dataset, that contains 6954 interacting protein pairs. The distribution of prediction probability in E. coli dataset has been shown in Figure 2

We can see from the Figure 2, the prediction probability of employing the model to predicting PPIs of E. coli are close to 1 for the most of samples. The high prediction probability proves that our method is effective, practical and able to detect PPIs across species.

In addition, to validate our idea that the larger the amount of data, the better. We have established an extended Human PPIs data set. We arranged 36630 interacting protein pairs and 36480 non-interacting protein pairs. The experimental result was shown in Table 1.

Table 1. the prediction results on extended Human data.

Testing Set

Accu (%)

Sens (%)

Prec (%)

MCC (%)

AUC

Extended-Human

99.58

99.64

99.50

99.16

0.9995

In the following work, we intend to carry out a new work, as the reviewer considered. We will collect more PPIs data sources, use a large number of multi-species data to build a universal PPIs predictor, and test its effectiveness. We share the expectation of the reviewer that such an attempt can lead to exciting results.

Comment 4:

How does the   proposed lexicon relate to motif databases?

Response: The authors would like to thank the reviewer for this question. It is an interesting issue to consider the relationship between the generated lexicon and motif databases. To be honest, we did have not much knowledge about the motif, so it is difficult to us explaining the relevance. But we believe the lexicon produced by the proposed methods have huge potential in accelerating the discovery of new motifs. The words of the lexicon are likely to be motif. And it can be a method that help finding some candidate motif. We really appreciate the reviewer for this interesting question. Our method can give a probability for every word from the lexicon, maybe we can start a new work cooperated with biologists to validate the relationship of the lexicon and motif based on the rank of the word.

Comment 5:

Please make the   code available; please specify the license. Provide sufficient amounts of   data so that the reviewers and readers can download, test and use the   software. Provide documentation on how to use the tool.

Response: Thank reviewer for the suggestions. The code, data, word vector, trained predictor and documentation are available at https://figshare.com/s/b35a2d3bf442a6f15b6e. Reviewers and Readers can use pre-trained prediction model to quickly repeat our experiments.

Comment 6:

Provide   supplemental data/tables with details on the input data, which data was   filtered out, which of the data was used for training and testing. Please   also provide the results of the algorithm, explicitly detailing the false   negative and false positives.

Response: Thanks reviewer for the suggestions. In this paper, we use random sampling to extract 80% from the entire data set as training set and validation set, where the validation set accounts for 10% of the extracted samples. The remaining 20% is used as test set. To clarify the input and output of the model, we provided a supplementary table to describe the details on input data and false negates and false positives.

Table 1. the detail on training set, validation set, test set, false negates and false positives on four PPIs dataset.

Testing Set

TraD

ValD

TestD

FN

FP

Human

5615

1402

780

15

6

S. cerevisiae

8055

895

1119

40.

35

H. pylori

2099

525

292

16

19

Extended-Human

52627

13157

7326

13

18

 

Comment 7:

Please   explicitly report the obtained result for the cross-validation and the   testing datasets

Response: We thank reviewer for the useful suggestions and comply with the requirements of the reviewers. In the revised manuscript, we explicitly reported the obtained result for the testing datasets and added an experiment for validating the proposed method. In new experiment, the larger Human PPIs data set was collected, which contained 73110 protein pairs.

In this work, the cross-validation method was not adopted. we established a prediction model based on deep-learning algorithm. We follow the deep learning model training rules to divide the data set into three groups without duplicate samples, involving training set, validation set and test set.

1.      Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet Classification with Deep Convolutional Neural Networks. neural information processing systems

2.      Donahue, J., Hendricks, L. A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Darrell, T., & Saenko, K. (2015). Long-term recurrent convolutional networks for visual recognition and description. computer vision and pattern recognition,, 2625-2634

3.      Szegedy, C., Ioffe, S., Vanhoucke, V., & Alemi, A. A. (2016). Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. national conference on artificial intelligence,, 4278-4284

4.      Lecun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature,521(7553), 436.

Cross validation is a method for estimating the generalization accuracy of a supervised learning algorithm. It is often used to reduce over-fitting and increase generalization ability for traditional machine learning models. However, the role of cross-validation in the training of neural networks can be replaced by several other strategies.

Using validation sets is a way to improve model quality. Russel and Norvig claim that the training data set for model fitting can be further divided into training set and verification set. Validation set is used to adjust the hyper-parameters of the model and provide unbiased evaluation for the fitted model. Specifically, for a model, its parameters can be divided into common parameters and hyper parameters. Under the premise of not introducing reinforcement learning, the common parameters can be updated by the gradient descent, that is, the parameters updated by the training set. The hyper parameters involve the number of network layers, network nodes, iterations, and the learning rate, etc. These parameters are not updated by the gradient descent algorithm. The validation set participates in this tuning process, making the model fit better.

In addition, dropout and early stopping are also used in training to ensure a reliable and generalizable model.

In the machine learning model, if the training sample is too small, the trained model is prone to over-fitting. As a result, the model has a smaller loss function on the training data and a higher prediction accuracy. However, the loss function is larger on the test data and the prediction accuracy is lower. In the training for deep neural network, we usually use Dropout to overcome this problem.

5.      Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2012). Improving neural networks by preventing co-adaptation of feature detectors. arXiv: Neural and Evolutionary Computing.

6.      Srivastava, N., Hinton, G. E., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1), 1929-1958.

7.      Konda, K. R., Bouthillier, X., Memisevic, R., & Vincent, P. (2015). Dropout as data augmentation. arXiv: Machine Learning.

Early stopping is another method for avoiding overfitting. Early stopoing is to calculate the performance of the model on the verification set in training. When the performance of the model begins to decline, stop training, so as to avoid the problem of over-fitting caused by continuing training.

8.      Wu, X., & Liu, J. (2009). A New Early Stopping Algorithm for Improving Neural Network Generalization. international conference on intelligent computation technology and automation

In our work, we use the three strategies mentioned above to replace cross validation. In order to prove the validity of the method, the fitting curves on training set and test set on extended Human dataset are depicted and displayed in Figure. 2.

From the figure, we can see that the loss of test set and verification set in the second round decreases rapidly, the accuracy increases rapidly, and the two curves begin to approach gradually. At the fifth round, the loss of test set and verification set has been reduced to less than 0.05, and the accuracy increases to more than 0.975. Figure 2 shows that our method is robust and efficient and can effectively prevent overfitting.

Comment 8:

Please provide a   better introduction to the tools that the algorithm was compared to, the ones   listed in tables 2-4.

Response: We thank reviewer for the useful suggestions. More description of methods is listed in Table 2-4 has been added in the revised manuscript.

Comment 9:

The language has   some minor issues that are negatively impacting the perception of the paper.   I recommend having a native speaker read the manuscript or use a   proof-reading service. Thanks for your advice

Response: Thanks for reviewer’ advice. In the revised manuscript, all known errors have been corrected. The manuscript has been proofread and carefully edited to ensure easy readability.

 


Author Response File: Author Response.pdf

Round  2

Reviewer 1 Report

The authors have included many suggested changes and analyses.

However, some grammatical/stylistic issues remain-please check carefully and/or have a native English speaker read. Some suggested changes are underlined below (the list is not exhaustive, only illustrative):

Line 73 '...their own unique advantages.' (i.e. add 'their' and delete 'in different angles')

Lines 73-75 change to 'Rescanning fundamental theory problems with a biological language viewpoint that is inspired by natural language processing can help us find new solutions to bio information problems'.

Line 265-269 rewrite paragraph to be 'These statistics indicate that our approaches yield encouraging results. Prediction quality increases with the amount of data used for training. The performance suffers when the Skip-Gram word representation model and the designed CNN are applied to small data sets.Thus, our model has good scalability and can be further improved by increasing the size of training data sets. '

Line 312 3.3. 'Comparison with Previous Studies'

Lines 325-326 'The accuracy of the proposed method clearly stands out in comparison with several other methods'.

Author Response

We are grateful to the editor and reviewers for putting in efforts to review the paper with the aim of improving the quality of our paper. We have addressed the concerns of the editor and reviewers in the revised manuscript. In particular, the following revisions have been made.

The authors have included many suggested changes and analyses. 

Response: We would like to thank the specialist for precious advices

However, some grammatical/stylistic issues remain-please check carefully and/or have a native English speaker read. Some suggested changes are underlined below (the list is not exhaustive, only illustrative):

Line 73 '...their own unique advantages.' (i.e. add 'their' and delete 'in different angles')

Lines 73-75 change to 'Rescanning fundamental theory problems with a biological language viewpoint that is inspired by natural language processing can help us find new solutions to bio information problems'.

Line 265-269 rewrite paragraph to be 'These statistics indicate that our approaches yield encouraging results. Prediction quality increases with the amount of data used for training. The performance suffers when the Skip-Gram word representation model and the designed CNN are applied to small data sets. Thus, our model has good scalability and can be further improved by increasing the size of training data sets. '

Line 312 3.3. 'Comparison with Previous Studies'

Lines 325-326 'The accuracy of the proposed method clearly stands out in comparison with several other methods'.

Response: Special thanks to the reviewer for the contribution to the readability of this paper. We have checked the manuscript carefully, all known grammatical errors have been corrected.


Author Response File: Author Response.docx

Reviewer 3 Report

Wang and co-authors present an improved manuscript and have addressed my main concerns. Hence, I will recommend Cells to consider this manuscript for publication. However, I would strongly recommend the authors to add a license to their code so that others can use it.

Author Response

Wang and co-authors present an improved manuscript and have addressed my main concerns. Hence, I will recommend Cells to consider this manuscript for publication. However, I would strongly recommend the authors to add a license to their code so that others can use it.


Response: We thank the reviewer for approval, and special thanks to reviewer for precious advices. The code license has been set to include automatically, and the license type is MIT. The public is free to use our code. Figure 1 shows that the license for the code has been included.


Author Response File: Author Response.docx

Back to TopTop