Figure 1.
Model architecture of PreLnc. A presents animals (humans, mice, and cows) and B presents plants (A. thaliana, O. sativa and Z. mays). The FDR (false discovery rate) adjusted P-value and Z-value are used to select tri-nucleotides as candidate feature sets. Pearson correlation coefficients (PCCs) are used to obtain ranking lists of features for feature selection results with the incremental feature selection (IFS) method. Model for each species is based on feature selection results and machine learning classifiers.
Figure 1.
Model architecture of PreLnc. A presents animals (humans, mice, and cows) and B presents plants (A. thaliana, O. sativa and Z. mays). The FDR (false discovery rate) adjusted P-value and Z-value are used to select tri-nucleotides as candidate feature sets. Pearson correlation coefficients (PCCs) are used to obtain ranking lists of features for feature selection results with the incremental feature selection (IFS) method. Model for each species is based on feature selection results and machine learning classifiers.
Figure 2.
Z-value stacking histograms on animals. The red-labeled bars indicate the ability to distinguish trinucleotides with significant effects from lncRNAs (Long non-coding RNAs) on humans, mice, and cows (adjusted p-value < 0.01).
Figure 2.
Z-value stacking histograms on animals. The red-labeled bars indicate the ability to distinguish trinucleotides with significant effects from lncRNAs (Long non-coding RNAs) on humans, mice, and cows (adjusted p-value < 0.01).
Figure 3.
Z-value stacking histograms on plants. The red-labeled bars indicate the ability to distinguish trinucleotides with significant effects from lncRNAs on A. thaliana, O. sativa and Z. mays (adjusted p-value < 0.01).
Figure 3.
Z-value stacking histograms on plants. The red-labeled bars indicate the ability to distinguish trinucleotides with significant effects from lncRNAs on A. thaliana, O. sativa and Z. mays (adjusted p-value < 0.01).
Figure 4.
Pearson correlation coefficients of features on animals. The correlation coefficient is in the form of a heat map between various features and transcript categories. Blue represents a positive correlation and red represents a negative correlation. The darker the color, the stronger the correlation. The description of the feature abbreviations: Length, sequence length; GC_content, GC content; Stop_condon_std, standard deviation of stop codon counts (TAA, TAG, TGA); CDS_len, CDS length; CDS_percent, CDS percentage; CDS_score, CDS score of txCdsPredict prediction; Pep_len, peptide length; Fickett, fickett score; PI, isoelectric point; ORF_in, open reading frame integrity; Hexamer, hexamer score, and the other features—tri-nucleotides. The numerical value of the grid indicates the correlation between these two features. The dark blue area in the upper left corner of the figure shows the high correlation over 0.9 between stop codon counts, CDS length, CDS score and peptide length.
Figure 4.
Pearson correlation coefficients of features on animals. The correlation coefficient is in the form of a heat map between various features and transcript categories. Blue represents a positive correlation and red represents a negative correlation. The darker the color, the stronger the correlation. The description of the feature abbreviations: Length, sequence length; GC_content, GC content; Stop_condon_std, standard deviation of stop codon counts (TAA, TAG, TGA); CDS_len, CDS length; CDS_percent, CDS percentage; CDS_score, CDS score of txCdsPredict prediction; Pep_len, peptide length; Fickett, fickett score; PI, isoelectric point; ORF_in, open reading frame integrity; Hexamer, hexamer score, and the other features—tri-nucleotides. The numerical value of the grid indicates the correlation between these two features. The dark blue area in the upper left corner of the figure shows the high correlation over 0.9 between stop codon counts, CDS length, CDS score and peptide length.
Figure 5.
Pearson correlation coefficients of features on plants. The correlation coefficient is in the form of a heat map between various features and transcript categories. Blue represents a positive correlation and red represents a negative correlation. The darker the color, the stronger the correlation. The description of the feature abbreviations: Length, sequence length; GC_content, GC content; Stop_condon_std, standard deviation of stop codon counts (TAA, TAG, TGA); CDS_len, CDS length; CDS_percent, CDS percentage; CDS_score, CDS score of txCdsPredict prediction; Pep_len, peptide length; Fickett, fickett score; PI, isoelectric point; ORF_in, open reading frame integrity; Hexamer, hexamer score; and the other features, tri-nucleotides. The numerical value of the grid indicates the correlation between these two features. The dark blue area in the upper left corner of the figure shows the high correlation over 0.88 between stop codon counts, CDS length, CDS score and peptide length.
Figure 5.
Pearson correlation coefficients of features on plants. The correlation coefficient is in the form of a heat map between various features and transcript categories. Blue represents a positive correlation and red represents a negative correlation. The darker the color, the stronger the correlation. The description of the feature abbreviations: Length, sequence length; GC_content, GC content; Stop_condon_std, standard deviation of stop codon counts (TAA, TAG, TGA); CDS_len, CDS length; CDS_percent, CDS percentage; CDS_score, CDS score of txCdsPredict prediction; Pep_len, peptide length; Fickett, fickett score; PI, isoelectric point; ORF_in, open reading frame integrity; Hexamer, hexamer score; and the other features, tri-nucleotides. The numerical value of the grid indicates the correlation between these two features. The dark blue area in the upper left corner of the figure shows the high correlation over 0.88 between stop codon counts, CDS length, CDS score and peptide length.
![Genes 11 00981 g005]()
Figure 6.
Ranking correlation coefficients (|r|) of features for animals. The horizontal coordinates represent the features. The vertical coordinates represent the correlation coefficient between the features and the transcription category. The description of the feature abbreviations: CDS_percent, CDS percentage; Fickett, fickett score; Hexamer, hexamer score; CDS_score, CDS score of txCdsPredict prediction; GC_content, GC content; Stop_condon_std, standard deviation of stop codon counts (TAA, TAG, TGA); PI, isoelectric point; Length, sequence length; ORF_in, open reading frame integrity and the other features—tri-nucleotides.
Figure 6.
Ranking correlation coefficients (|r|) of features for animals. The horizontal coordinates represent the features. The vertical coordinates represent the correlation coefficient between the features and the transcription category. The description of the feature abbreviations: CDS_percent, CDS percentage; Fickett, fickett score; Hexamer, hexamer score; CDS_score, CDS score of txCdsPredict prediction; GC_content, GC content; Stop_condon_std, standard deviation of stop codon counts (TAA, TAG, TGA); PI, isoelectric point; Length, sequence length; ORF_in, open reading frame integrity and the other features—tri-nucleotides.
Figure 7.
Ranking correlation coefficients (|r|) of features for plants. The horizontal coordinates represent the features. The vertical coordinates represent the correlation coefficient between the features and the transcription category. The description of the feature abbreviations: Length, sequence length; Hexamer, hexamer score; CDS_score, CDS score of txCdsPredict prediction; Stop_condon_std, standard deviation of stop codon counts (TAA, TAG, TGA); Fickett, fickett score; PI, isoelectric point; CDS_percent, CDS percentage; GC_content, GC content; ORF_in, open reading frame integrity and the other features—tri-nucleotides.
Figure 7.
Ranking correlation coefficients (|r|) of features for plants. The horizontal coordinates represent the features. The vertical coordinates represent the correlation coefficient between the features and the transcription category. The description of the feature abbreviations: Length, sequence length; Hexamer, hexamer score; CDS_score, CDS score of txCdsPredict prediction; Stop_condon_std, standard deviation of stop codon counts (TAA, TAG, TGA); Fickett, fickett score; PI, isoelectric point; CDS_percent, CDS percentage; GC_content, GC content; ORF_in, open reading frame integrity and the other features—tri-nucleotides.
Figure 8.
Results of IFS with multiple classifiers. The horizontal and vertical coordinates represent a number of features and F-measure, respectively. (A) F-measures on humans. The best parameter is 0.91783 with 19 features and random forest classifier; (B) F-measures on mice. The best parameter is 0.92555 with 20 features and random forest classifier; (C) F-measures on cows. The best parameter is 0.94723 with 20 features and random forest classifier; (D) F-measures on A. thaliana. F-measures were basically stable at 0.99 with ranking features; (E) F-measures on O. sativa. F-measures were basically stable at 0.99 with ranking features; (F) F-measures on Z. mays. F-measures were basically stable at 0.99 with ranking features.
Figure 8.
Results of IFS with multiple classifiers. The horizontal and vertical coordinates represent a number of features and F-measure, respectively. (A) F-measures on humans. The best parameter is 0.91783 with 19 features and random forest classifier; (B) F-measures on mice. The best parameter is 0.92555 with 20 features and random forest classifier; (C) F-measures on cows. The best parameter is 0.94723 with 20 features and random forest classifier; (D) F-measures on A. thaliana. F-measures were basically stable at 0.99 with ranking features; (E) F-measures on O. sativa. F-measures were basically stable at 0.99 with ranking features; (F) F-measures on Z. mays. F-measures were basically stable at 0.99 with ranking features.
Figure 9.
F-measures of balanced random forests in positive and negative sample sets of 1:5, 1:4, 1:3, 1:2, and 1:1, respectively. The horizontal and vertical coordinates represent proportion and F-measure, respectively. (A) Balanced random forests with 20 features on animals. Balanced random forests perform poorly on uneven animal datasets with F-measure gaps of 15.93% for humans, 13.18% for mice, and 14.52% for cows; (B) Balanced random forests with 33 features on plants. F-measure gaps of plants are smaller (6.77% for A. thaliana, 6.40% for O. sativa, and 3.53% for Z. mays, respectively.
Figure 9.
F-measures of balanced random forests in positive and negative sample sets of 1:5, 1:4, 1:3, 1:2, and 1:1, respectively. The horizontal and vertical coordinates represent proportion and F-measure, respectively. (A) Balanced random forests with 20 features on animals. Balanced random forests perform poorly on uneven animal datasets with F-measure gaps of 15.93% for humans, 13.18% for mice, and 14.52% for cows; (B) Balanced random forests with 33 features on plants. F-measure gaps of plants are smaller (6.77% for A. thaliana, 6.40% for O. sativa, and 3.53% for Z. mays, respectively.
Figure 10.
Receiver operating characteristic curves of 6 species. Different colored curves correspond to different tools (red: PreLnc, blue: CPC2, green: CPAT, black: PLEK). The horizontal and vertical coordinates represent the true positive rate and the false positive rate, respectively. (A) Receiver operating characteristic on humans; (B) Receiver operating characteristic on mice; (C) Receiver operating characteristic on cows; (D) Receiver operating characteristic on A. thaliana; (E) Receiver operating characteristic on O. sativa; (F) Receiver operating characteristic on Z. mays.
Figure 10.
Receiver operating characteristic curves of 6 species. Different colored curves correspond to different tools (red: PreLnc, blue: CPC2, green: CPAT, black: PLEK). The horizontal and vertical coordinates represent the true positive rate and the false positive rate, respectively. (A) Receiver operating characteristic on humans; (B) Receiver operating characteristic on mice; (C) Receiver operating characteristic on cows; (D) Receiver operating characteristic on A. thaliana; (E) Receiver operating characteristic on O. sativa; (F) Receiver operating characteristic on Z. mays.
Figure 11.
Prediction performance on Aedes aegypti, Rhesus, Opossum, Platypus and Pig. The horizontal and vertical coordinates represent species and accuracy, respectively. The 4274 lncRNAs of A. aegypti were derived from the systematic research project on the Nucleotide Sequence Database (NT) of NCBI. The lncRNAs of Rhesus (9128 lncRNAs), Opossum (27167 lncRNAs), Platypus (11210 lncRNAs) and Pig (29585 lncRNAs) were derived from NONCODE v5.
Figure 11.
Prediction performance on Aedes aegypti, Rhesus, Opossum, Platypus and Pig. The horizontal and vertical coordinates represent species and accuracy, respectively. The 4274 lncRNAs of A. aegypti were derived from the systematic research project on the Nucleotide Sequence Database (NT) of NCBI. The lncRNAs of Rhesus (9128 lncRNAs), Opossum (27167 lncRNAs), Platypus (11210 lncRNAs) and Pig (29585 lncRNAs) were derived from NONCODE v5.
Table 1.
Datasets of animals and plants. Transcripts of animals were obtained from Ensembl (v97). Long non-coding and protein-coding transcripts of plants were obtained from GreeNC and EnsemblPlants (v44), respectively. All of the data is filtered by CD-hit, and the datasets are independent of each other.
Table 1.
Datasets of animals and plants. Transcripts of animals were obtained from Ensembl (v97). Long non-coding and protein-coding transcripts of plants were obtained from GreeNC and EnsemblPlants (v44), respectively. All of the data is filtered by CD-hit, and the datasets are independent of each other.
Dataset | Species | Training | Testing |
---|
1 | Humans | positive | 12,000 | 24,323 |
negative | 12,000 | 80,324 |
2 | Mice | positive | 6500 | 6534 |
negative | 6500 | 53,239 |
3 | Cows | positive | 1000 | 976 |
negative | 1000 | 36,346 |
4 | Arabidopsis thaliana | positive | 1400 | 1339 |
negative | 1400 | 33,345 |
5 | Oryza sativa | positive | 2500 | 2487 |
negative | 2500 | 33,970 |
6 | Zea mays | positive | 7500 | 7843 |
negative | 7500 | 68,460 |
Table 2.
The first subset of 11 features. The features can be classified into four categories: (1) Basic features: No 1, 2, 3. (2) ORF (open reading frame) and CDS (coding sequence) related features: No 4–7. (3) Peptide related features: No 8, 9. (4) Functional definition features: 10, 11.
Table 2.
The first subset of 11 features. The features can be classified into four categories: (1) Basic features: No 1, 2, 3. (2) ORF (open reading frame) and CDS (coding sequence) related features: No 4–7. (3) Peptide related features: No 8, 9. (4) Functional definition features: 10, 11.
No. | Feature | Introduction |
---|
1 | Length | sequence length |
2 | GC_content | GC content |
3 | Stop_condon_std | standard deviation of stop codon counts (TAA, TAG, TGA) |
4 | ORF_in | open reading frame integrity |
5 | CDS_len | CDS length |
6 | CDS_score | CDS score of txCdsPredict prediction |
7 | CDS_percent | CDS percentage |
8 | Pep_len | peptide length |
9 | PI | isoelectric point |
10 | Hexamer | hexamer score |
11 | Fickett | fickett score |
Table 3.
Methods comparison.
Table 3.
Methods comparison.
Species | Methods | SEN% | SPE% | ACC% | MCC% | AUC% |
---|
Humans | CPAT | 95.321 | 81.759 | 84.911 | 67.763 | 94.857 |
CPC2 | 94.742 | 69.398 | 75.288 | 54.402 | 90.122 |
PreLnc | 96.633 | 85.000 | 87.703 | 72.801 | 96.199 |
PLEK | 87.761 | 81.219 | 82.739 | 61.160 | 86.971 |
LncFinder | 95.663 | 84.005 | 86.714 | 70.781 | / |
Mice | CPAT | 95.883 | 86.842 | 87.831 | 62.111 | 96.414 |
CPC2 | 94.781 | 76.235 | 78.263 | 47.693 | 92.155 |
PreLnc | 94.154 | 89.927 | 90.389 | 66.525 | 97.087 |
PLEK | 91.185 | 74.534 | 76.354 | 43.730 | 85.574 |
LncFinder | 95.638 | 89.448 | 90.219 | 66.557 | / |
Cows | CPAT | 94.570 | 91.391 | 91.474 | 44.095 | 97.657 |
CPC2 | 87.602 | 94.759 | 94.572 | 50.225 | 97.172 |
PreLnc | 95.389 | 93.743 | 93.787 | 50.768 | 97.881 |
PLEK | 90.164 | 83.228 | 83.409 | 30.043 | 86.190 |
LncFinder | 93.955 | 95.251 | 95.217 | 55.497 | / |
A. thaliana | CPAT | 99.851 | 90.097 | 90.474 | 50.910 | 96.663 |
CPC2 | 82.450 | 92.473 | 92.086 | 47.245 | 95.976 |
PreLnc | 99.925 | 93.138 | 93.400 | 58.598 | 97.483 |
PLEK | 88.200 | 82.876 | 83.082 | 34.318 | 85.566 |
LncFinder | 99.477 | 90.610 | 90.953 | 51.833 | / |
PLncPRO | 73.786 | 88.925 | 88.340 | 35.359 | / |
RNAplonc | 99.776 | 90.814 | 91.160 | 52.444 | / |
O. sativa | CPAT | 94.692 | 82.437 | 83.273 | 46.333 | 92.944 |
CPC2 | 78.166 | 77.265 | 77.327 | 31.660 | 83.629 |
PreLnc | 96.220 | 82.058 | 83.024 | 46.697 | 96.085 |
PLEK | 89.867 | 85.514 | 85.811 | 47.849 | 89.763 |
LncFinder | 94.094 | 87.861 | 88.241 | 53.995 | / |
PLncPRO | 30.197 | 76.334 | 73.186 | 3.849 | / |
RNAplonc | 99.517 | 77.489 | 78.991 | 43.352 | / |
Z. mays | CPAT | 98.126 | 91.788 | 92.439 | 71.936 | 97.826 |
CPC2 | 88.920 | 89.239 | 89.206 | 60.756 | 95.351 |
PreLnc | 99.796 | 97.547 | 97.779 | 89.514 | 99.892 |
PLEK | 90.960 | 90.879 | 90.888 | 65.360 | 95.757 |
LncFinder | 98.546 | 92.331 | 92.970 | 73.453 | / |
PLncPRO | 66.888 | 77.615 | 76.512 | 30.455 | / |
RNAplonc | 99.308 | 85.465 | 86.883 | 60.881 | / |
Table 4.
Prediction on CPC2 dataset.
Table 4.
Prediction on CPC2 dataset.
Species | Methods | SEN% | SPE% | ACC% | MCC% |
---|
Humans | CPAT | 90.247 | 95.409 | 92.574 | 85.285 |
| CPC2 | 92.692 | 95.995 | 94.181 | 88.386 |
| PreLnc | 96.072 | 98.893 | 97.344 | 94.705 |
| PLEK | 92.866 | 89.086 | 91.163 | 82.132 |
| LncFinder | 93.106 | 96.239 | 94.518 | 89.054 |
Mice | CPAT | 97.879 | 91.003 | 93.601 | 87.154 |
| CPC2 | 94.892 | 93.871 | 94.257 | 87.972 |
| PreLnc | 92.167 | 92.254 | 92.221 | 83.677 |
| PLEK | 96.409 | 85.888 | 89.986 | 80.498 |
| LncFinder | 96.703 | 95.412 | 95.912 | 91.428 |
A. thaliana | CPAT | 99.961 | 93.371 | 94.391 | 82.777 |
| CPC2 | 99.649 | 95.310 | 95.981 | 95.981 |
| PreLnc | 98.205 | 100.000 | 99.722 | 98.936 |
| PLEK | 98.712 | 86.004 | 87.972 | 68.947 |
| LncFinder | 96.721 | 93.443 | 93.951 | 80.768 |
| RNAplonc | 98.205 | 94.101 | 94.737 | 83.181 |
| PLncPRO | 99.141 | 94.881 | 95.540 | 85.552 |
Table 5.
Prediction on NONCODEv5.
Table 5.
Prediction on NONCODEv5.
Methods | NONCODEv5_Humans (172,216 Total) | NONCODEv5_Mice (131,697 Total) |
---|
PreLnc | 95.319% | 96.315% |
CPAT | 93.188% | 97.594% |
CPC2 | 93.946% | 95.928% |
PLEK | 89.321% | 95.195% |
LncFinder | 94.256% | 96.913% |
Table 6.
Comparison of the system computing time consumption.
Table 6.
Comparison of the system computing time consumption.
Methods | Humans (13,627 Total) | Mice (17,098 Total) | A. thaliana (16,542 Total) |
---|
CPAT | 37 s | 46 s | 39 s |
CPC2 | 26 s | 34 s | 25 s |
PreLnc | 269 s | 346 s | 279 s |
PLEK | 400 s | 307 s | 291 s |
LncFinder | 198 s | 212 s | 164 s |
RNAplonc | / | / | 34 s |
PLncPRO | / | / | 56 s |