Next Article in Journal
SDG26 Is Involved in Trichome Control in Arabidopsis thaliana: Affecting Phytohormones and Adjusting Accumulation of H3K27me3 on Genes Related to Trichome Growth and Development
Next Article in Special Issue
Identification and Characterization of the AREB/ABF Gene Family in Three Orchid Species and Functional Analysis of DcaABI5 in Arabidopsis
Previous Article in Journal
Transcriptional Expression of Nitrogen Metabolism Genes and Primary Metabolic Variations in Rice Affected by Different Water Status
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

CircPCBL: Identification of Plant CircRNAs with a CNN-BiGRU-GLT Model

1
Anhui Province Key Laboratory of Smart Agricultural Technology and Equipment, Anhui Agricultural University, Hefei 230036, China
2
School of Life Science, Anhui Agricultural University, Hefei 230036, China
3
School of Information and Computer Science, Anhui Agricultural University, Hefei 230036, China
*
Author to whom correspondence should be addressed.
Plants 2023, 12(8), 1652; https://doi.org/10.3390/plants12081652
Submission received: 20 February 2023 / Revised: 10 April 2023 / Accepted: 13 April 2023 / Published: 14 April 2023
(This article belongs to the Special Issue Emerging Topics in Plant Bioinformatics and Omics Data Analysis)

Abstract

:
Circular RNAs (circRNAs), which are produced post-splicing of pre-mRNAs, are strongly linked to the emergence of several tumor types. The initial stage in conducting follow-up studies involves identifying circRNAs. Currently, animals are the primary target of most established circRNA recognition technologies. However, the sequence features of plant circRNAs differ from those of animal circRNAs, making it impossible to detect plant circRNAs. For example, there are non-GT/AG splicing signals at circRNA junction sites and few reverse complementary sequences and repetitive elements in the flanking intron sequences of plant circRNAs. In addition, there have been few studies on circRNAs in plants, and thus it is urgent to create a plant-specific method for identifying circRNAs. In this study, we propose CircPCBL, a deep-learning approach that only uses raw sequences to distinguish between circRNAs found in plants and other lncRNAs. CircPCBL comprises two separate detectors: a CNN-BiGRU detector and a GLT detector. The CNN-BiGRU detector takes in the one-hot encoding of the RNA sequence as the input, while the GLT detector uses k-mer (k = 1 − 4) features. The output matrices of the two submodels are then concatenated and ultimately pass through a fully connected layer to produce the final output. To verify the generalization performance of the model, we evaluated CircPCBL using several datasets, and the results revealed that it had an F1 of 85.40% on the validation dataset composed of six different plants species and 85.88%, 75.87%, and 86.83% on the three cross-species independent test sets composed of Cucumis sativus, Populus trichocarpa, and Gossypium raimondii, respectively. With an accuracy of 90.9% and 90%, respectively, CircPCBL successfully predicted ten of the eleven circRNAs of experimentally reported Poncirus trifoliata and nine of the ten lncRNAs of rice on the real set. CircPCBL could potentially contribute to the identification of circRNAs in plants. In addition, it is remarkable that CircPCBL also achieved an average accuracy of 94.08% on the human datasets, which is also an excellent result, implying its potential application in animal datasets. Ultimately, CircPCBL is available as a web server, from which the data and source code can also be downloaded free of charge.

1. Introduction

A novel class of non-coding RNA is called circular RNA (circRNA). It is formed by a reverse connection of the downstream 5′ end splicing site and the upstream 3′ end splicing site, with a 3′,5′-phosphodiester bond at the junction. Post-splicing is another name for the procedure [1]. Since it lacks a free end, it was initially frequently dismissed as a by-product of incorrect splicing or operational errors and received little attention [2]. For the first time, viroids infectious to higher plants were covalently determined as circRNAs in 1976, according to Sanger et al. [3]. At this point, circRNAs began to attract widespread attention. CircRNAs have since been discovered in yeast mitochondria [4], the hepatitis delta virus (HDA) [5], humans [6], mice [7], and rats [8], among other places. There are many databases available now that can be used to accept and store circRNAs from various species, including circBase [9], circRNADb [10], PlantcircBase [11], etc. Compared with linear RNA, circRNA has a more stable and conserved closed-loop structure and is not degraded by RNA exonuclease enzymes.
More and more circRNA functions are being annotated in the transcriptome as a result of the expansion of experimental techniques. For example, ciRS-7, which is abundantly expressed in human and mouse brains, acts as an miR-7 sponge and influences miRNA activity [12]. Li et al. found that intron-retaining circRNA could regulate the expression of the genes of RNA polymerase II [13]. CircRNA is critical for the emergence and growth of different cancer cells [14,15,16], according to recent research. Despite the fact that several functional circRNAs have been discovered, their formation mechanism is still not entirely understood. Intron-pairing-driven circularization, RNA-binding-protein (RBP)-mediated circularization, and lariat-driven circularization are the only patterns that have been observed thus far [2]. Figure 1 depicts these three mechanisms. In Intron-pairing-driven circularization, the intron sequences on both sides of the circularized exon can pair complementarily, permitting the direct combination of the 5′ splice site with the 3′ splice site, resulting in the formation of circRNAs. In RBP-mediated circularization, RBPs bind to specific motifs in the flanking intron sequences and facilitate tissue-specific circRNA formation. In Lariat-pairing-driven circularization, exons will occur while pre-mRNAs carry on GU/AG splicing, leading to lariat formation, which can execute reverse splicing and then form circRNAs. It is clear that there is still much to learn with circRNA research, which is still in its infancy.
LncRNAs are transcripts that are more than 200 nt in length and encode little or no protein. They can be categorized into different types based on the positioning of their coding sequences relative to protein-coding genes, such as sense, antisense, bidirectional, intronic, and intergenic lncRNAs. LncRNAs were initially considered to be transcriptional noise with no biological function, which were transcribed by RNA polymerase II [17]. Recent studies have shown the close association of lncRNAs with various diseases, and a large number of computational methods have been developed as a result [18,19,20,21]. LncRNAs are equally vital in plants, playing significant roles in various biological processes. For example, lncRNAs are essential in corresponding abiotic stresses in plants, as detailed by Li et al. [22], while Meng et al. identified 63 plant-growth-hormone-responsive lncRNAs [23]. In addition, silence lncRNA1459 and lncRNA1840 have been shown to delay the maturation of tomato plants [24]. It has been shown that lncRNA recognition models constructed on human datasets can be used for closely related vertebrates, but perform poorly on plant datasets [25], suggesting that there may be differences in lncRNA formation mechanisms and biological characteristics between plants and animals [26]. While various experiments have demonstrated that lncRNAs have essential functional properties, distinguishing them from mRNAs is challenging due to their shared features of a cap structure. Moreover, lncRNAs typically lack conserved sequences that can be used for detection, which drastically reduces the number of features available to the field of bioinformatics [27]. Our task is to classify circRNAs and lncRNAs, which will be more demanding than classifying circRNAs and mRNAs, as both of them are non-coding RNAs with more structural and functional similarities.
The initial stage of carrying out follow-up studies is the identification of circRNA. Traditional experimental methods are ineffective and require a lot of time and effort. Due to their comparable length distribution and low expression traits, circRNAs as a subclass of lncRNAs continue to be difficult to differentiate from other lncRNAs [28]. Currently, several computational methods have been developed to identify circRNAs. For example, the CirRNAPL [29] classifier adopts the extreme learning machine (ELM) method, which is refined by a particle swarm. Through CNN, DeepCirCode [30] detected circRNA reverse splicing sequences and outperformed conventional machine learning (SVM and RF). JEDI [31] introduces a cross-attention layer, which is superior to the existing tools in its effectiveness, to capture the deep interaction between the splicing points. Although some progress has been made in these methods, their application is mainly limited to human and mouse datasets, and datasets using plants have not been considered.
At present, JEDI is an outstanding predictor that was built for animal circRNAs in 2021 [31], which achieved a model accuracy of over 98% on human datasets and an accuracy of over 86% on cross-species testing on mouse datasets. These results demonstrate the excellent performance of the tool in identifying animal circRNAs. However, plant circRNAs differ from animal circRNAs in that they contain the following characteristics: (1) Repeating elements and reverse complementary sequences are less common in flanking introns [32]. (2) There are non-GT/AG splicing signals on both sides of circRNAs junction sites in rice, which is different from human circRNAs [33]. These variations indicate that the presence and use of circRNAs may differ between plants and animals. During our experiments, CircPCBL achieved a 94% accuracy or so on the validation set when trained on the human dataset alone, which was comparable to JEDI. However, the accuracy on the plant dataset was just above 85%, which further illustrates the above point. The research on circRNAs in plants is still in its early stages at the moment. Therefore, it is urgent to develop a plant-specific circRNA identification method to accelerate the research progress of plant circRNAs.
Consequently, in 2021, Yin et al. developed the plant-specific circRNA prediction software PCirc [34], which calculated k-mers, ORFs, and splicing junction sequences coding (SJSC) characteristics, which were predicted by a trained RF model. Specifically, the k-mer features represent the frequency of occurrence of neighboring k nucleotides, and a k-value range of 1–4 was chosen in this study. ORFs, which denote the protein-coding segment of a sequence, were used in PCirc as both ORF-coverage and ORF-length, referring to the proportion and length of the protein-coding region within the entire sequence. Additionally, SJSC is a vector comprising 50 bp sequences upstream and downstream of the splicing junction site, with each base represented by a corresponding numerical code. PCirc used rice as the training species and successfully predicted the circRNAs of the rice by the above three sets of features, and the average accuracy of a ten-times-ten-fold cross-validation reached over 99%. The software also demonstrated brilliant performance on the cross-species test, with an accuracy of 89.80% and 81.30% on Arabidopsis and maize datasets, respectively. Although this method has achieved excellent results, given that it only used rice as a training species, its level of universality is insufficient. Machine learning and deep learning methods require a large amount of data to better express the semantic information of circRNAs and lncRNAs for making robust predictions. Moreover, using machine learning methods requires manual feature extraction, which consumes plenty of time and effort [35]. Consequently, this study proposes a deep-learning-based model for plant circRNA identification, which learns features directly in the original sequence through the end-to-end features of the deep learning methods, thus avoiding the manual extraction of features required in machine learning methods. To fill the gap in the identification of plant circRNAs, it is necessary to uncover high-quality features based on the unique structure of plant circRNAs, whereas such specific features require a large amount of prior biological knowledge. By training the model on plant circRNAs, we anticipated that the deep learning would be able to automatically extract features that were specific to plants from the raw sequence data, distinguishing them from animal circRNAs. As a result, we thought about our study from the following perspectives: (1) expanding the number of species, (2) using deep learning to automatically extract features, and (3) discarding complex feature engineering, where the input to the model was only based on the original sequences.
In this study, we developed a depth recognition framework named CircPCBL based on the aforementioned factors. For the construction of the datasets, we selected six different training species, including Arabidopsis thaliana, Brassica rapa, Zea mays, Oryza sativa Japonica, Solanum lycopersicum, and Solanum tuberosum. After preprocessing, the datasets included 17,600 distinct circRNAs and lncRNAs, where circRNAs were coded as 1 and lncRNAs were coded as 0 in our task. These sequences were from different databases, namely PlantcircBase, CANTATAdb2.0, and GreeNC v1.12, which were detected with high confidence by bioinformatics tools or were experimentally validated with specific sequence information available from the databases. These data were divided according to a ratio of 7:3, of which 70% were used for the model training and 30% for the model validation and hyperparameter tuning. Furthermore, we constructed three independent test sets for three different plants (Cucumis sativus, Populus trichocarpa, and Gossypium raimondii), which were all from various families, as the training set for validating the cross-species prediction ability of the model. A total of 8739, 6611, and 4501 sequences were in the three test sets mentioned above, respectively, and the source of the test data was identical to that of the training and validation data. Similarly, we validated the utility of the model in the experimentally validated circRNAs and lncRNAs datasets as well, with 11 circRNAs in Poncirus trifoliata and 10 lncRNAs in rice. In terms of model architecture, CircPCBL consists of two parts: the CNN-BiGRU detector based on one-hot and the GLT detector based on k-mer. One-hot encodes the original sequence, and if the sequence length is less than a fixed value m, it will fill with zero-vectors at the end; otherwise, is directly truncated to m. The value of k in k-mer was set to 1–4 in our task, which was calculated from the original RNA sequences. One-hot and k-mer can represent sequences effectively and produce excellent outcomes in the majority of biological sequence recognition tasks without any prior biological knowledge [36,37,38,39]. Where k-mer can reflect the differences in sequence composition but cannot reflect the order of each base, one-hot makes up for this. Thus, the one-hot and k-mer characteristics were selected to enhance each other’s sequence information. These two features were based on the original sequence encoding only, as a one-hot-based encoding on pure sequences still only represents the sequences base by base and a k-mer approach simply adds oligonucleotides as potential sequence motifs. So, at this stage, there is no functionality or new functionality features given to identify circRNAs. Our ongoing research and development led to the creation of the CNN-BiGRU detector, which is also discussed in Section 2.1.1 and Section 2.1.2. To extract the local sequence information and decrease the model parameters and feature dimensions, GLT was introduced to the k-mer processing process. This architecture was inspired by an improved transformer model, DeLighT, which reduces parameter redundancy by introducing GLT, making the transformer deeper, faster, and stronger [40].
This paper makes the following contributions:
Only the original sequences were used for the feature extraction, avoiding complex feature engineering and paying less attention to the local region of the sequences.
A depth recognition framework named CircPCBL is proposed, which uses a CNN-BiGRU detector and a GLT detector, respectively, to process different features, rather than simply using a single model.
As far as we know, this is the first study to use a deep learning method in machine learning techniques to identify plant circRNAs and other lncRNAs.
CircPCBL also showed a brilliant generalization performance on plants from different families from the training verification set.
We provide an online web server for easy use: www.circpcbl.cn (accessed on 27 December 2022). The data and source code can also be downloaded for free through the web server.

2. Results

We assessed the robustness of CircPCBL in identifying plants’ lncRNAs and circRNAs by the following datasets: (1) Validation set of CircPCBL (2) Three independent test sets constructed for each of the three plants (Cucumis sativus, Populus trichocarpa, Gossypium raimondii) (3) Independent case study of Poncirus trifoliata and rice (Real Set). In this section, we will describe the evaluation strategies in order.

2.1. Performance of CircPCBL for Validation Sets

2.1.1. Comparison of Traditional Deep Learning Methods and Coding Methods

In this section, we selected six traditional deep learning algorithms for comparison, namely RNN, BiRNN, GRU, BiGRU, LSTM, and BiLSTM, which are commonly used in the NLP field. We experimented with the word embedding and one-hot encoding methods in order to select the best approach. By using a sparse representation of various bases, one-hot is capable of reflecting the individual pieces of information in a sequence. Word embedding differs from one-hot in that it can capture the relationship among different bases and encode a sequence as a dense matrix. In our task, word embedding encoded a single base as a fifty-dimensional dense vector, whereas one-hot only encoded it as a four-dimensional binary vector. The comparison results (Table 1) showed that the BiGRU model encoded by one-hot performed the best, with four of the five metrics (accuracy: 0.8216, recall: 0.7992, F1: 0.8172, and MCC: 0.6438) being significantly higher than the other models. In addition, its precision (0.8360) was almost equal to the second-ranked model of word embedding—BiGRU (0.8370). Given that there were no significant differences between the two encoding methods, one-hot merely encoded a sequence as a four-dimensional sparse matrix in order to demonstrate robust performance, which greatly reduced the computing cost. Therefore, we decided to begin the model improvement with one-hot-BiGRU. To guarantee a fair comparison, we adjusted the number of hidden units in [20,30,40] for all the models to select the best parameter. The tuned number of hidden units for each model is shown in Table 1. Each model was trained with 200 epochs.

2.1.2. The Effect of Hyperparameters on CNN-BiGRU’s Performance

To further improve the performance of the one-hot BiGRU model, we inserted CNNs before BiGRU to initially extract the local contextual information [41] and spatial information [42,43] of the sequences. CNN-BiGRU received one-hot encoding features as an input as well. In our experiments, CNN-BiGRU was proved to have a distinct performance improvement. We also examined the hyperparameters of CNN-BiGRU, such as the convolutional kernel size (Kernel_size), the number of hidden units (Hidden_size), and the sequence length (Seq_len), in order to improve the model’s performance. The results are shown in Figure 2.
The first hyperparameter was Kernel_size. We compared six combinations, including [1,3,5], [3,5,7], [5,7,9], [1,3,5,7], [3,5,7,9], and [1,3,5,7,9]. For each combination, we used 32 convolutional kernels to extract different features at the same scale. The combination of the convolution kernels with the best overall model performance was [3,5,7] (Figure 2a), where accuracy was 0.8371, precision was 0.8314, recall was 0.8465, F1 was 0.8389, and MCC was 0.6743. The overall model performance decreased when the combination of the convolution kernels was [1,3,5]. The problem was caused by the model’s perceptual field becoming narrower when a smaller convolutional kernel was used, making it impossible to capture the sequence’s overall contextual relationship. However, the model performance diminished to varying degrees as the convolutional kernels size or the number of kernels was increased. We speculate that, on the one hand, the addition of convolutional kernels increased the perceptual field and improved the ability to capture the global features of the sequence, but on the other hand, the expansion of the model parameters led to the expansion of the model parameters, which were easier to overfit and reduced the effectiveness of the model by increasing the amount of invalid information.
The second hyperparameter was Hidden_size. The representation of the semantic information included in the sequences depended on the size of the BiGRU hidden layer. Underfitting was more likely when there were too few hidden units present, while gradient disappearance was more likely when there were more hidden units present. In this regard, we experimented with every five values between twenty and forty as the hidden layer size. As shown in Figure 2b, when the number of hidden units was set to thirty, the four metrics of accuracy, recall, MCC, and F1 of the model reached the peak. Further increasing the hidden layer size did not improve the model’s performance but did increase the training time of the model. Thus, the Hidden_size hyperparameter was chosen to be 30 for the experiment.
The third hyperparameter was Seq_len. The amount of the sequence information maintained depended on the sequence’ fixed length size. It is clear from Figure 2c that there was a general positive correlation between the model’s performance and sequence length. The accuracy was poorer when the sequence length was short (500, 800) because the sequence lost too much information. Each measure had the following values when the sequence length was 1500: accuracy: 0.8422, precision: 0.8320, recall: 0.8576, MCC: 0.6848, and F1: 0.8446. The recall and F1 were greater by 0.0322 and 0.0036, respectively, as compared to the duration of 1800. Although accuracy, precision, and MCC were all greater for the constant length of 1800, there was not a significant difference overall. We ultimately decided on a value 1500 for Seq_len while also taking the computational cost into account.
Finally, we compared the overall model performance before and after the insertion of CNN (Table 2). The overall performance of CNN-BiGRU was better than that of BiGRU. Although the precision of CNN-BiGRU was lower, the gap between them was only 0.0039. For the training of the model, CNN-BiGRU was trained with 100 epochs, which was less than BiGRU, because it converged faster than BiGRU via our experimental findings.

2.1.3. Performance after Fusion of the GLT Model

Finally, we improved CNN-BiGRU by fusing GLT to add additional sequence information. On the basis of the rule of just using raw sequences, we used k-mer features as the GLT model’s input. In theory, the deep neural network could learn directly from other sequence-based parameters such as GC content, purine, and pyrimidine content. The experimental comparison (Figure 3) showed that the model with GLT performed better than all the previous models, and, for the first time, the accuracy was greater than 85%. Accuracy, recall, MCC, and F1 were specifically enhanced by 0.0117, 0.0282, 0.0232, and 0.01, respectively, in comparison to CNN-BiGRU (Table 3). Therefore, we finally selected the CNN-BiGRU-GLT model as the plant circRNA and lncRNA recognition method.
We visualized the training process of the three models, BiGRU, CNN-BiGRU, and CNN-BiGRU-GLT, by plotting the changes in the loss and accuracy of the training and validation sets for the first 100 epochs, as shown in Figure 4. From the figure, it can be seen that the enhanced models converged faster and achieved higher accuracy rates.
In addition, we observed that the CNN-BiGRU-GLT model improved the accuracy by only around 1.2% compared to CNN-BiGRU, which may be attributed to the model stability factor. To further validate the model refinement, we repeated training the above two models five times, and the results (Table 4 and Table 5) showed that the CNN-BiGRU-GLT model outperformed it by consistently exceeding an 85% accuracy across all five experiments, while the CNN-BiGRU maintained an accuracy below 85%. In particular, the CNN-BiGRU-GLT model exhibited exceptional consistency, as all of its metrics maintained a standard deviation of less than 0.007. This is a testament to the model’s overall stability and reliability.

2.1.4. Comparison of Traditional Machine Learning Methods

We also compared the performance of CircPCBL with four well-known machine learning algorithms (GBDT, RF, SVM, and KNN) to assess its performance more thoroughly. These machine learning methods utilized k-mer features as their inputs, with k being a value between 1 and 4. We tuned their hyperparameters by grid searching (details are shown in Table 6), and the adjusted parameters of each machine learning model were set as follows: GBDT {‘learning rate’: 0.1, ‘the number of base classifiers’: 200}; RF {‘the number of base classifiers’: 200}; SVM {‘kernel function’: Gaussian kernel function, ‘C’: 1.0}; KNN {‘the number of neighboring points’: 5, ‘p-value of Minsky distance’: 3}, and the rest of the parameters were taken as the defaults. The performances of the different machine learning models on the validation set are shown in Figure 5 and Table 7. We can see from the results that CircPCBL outperformed the more established machine learning techniques. In comparison to GBDT, RF, SVM, and KNN, the MCC value was 0.7080, which was 0.1311 higher, 0.1386 higher, 0.3239 higher, and 0.3964 higher, respectively. For the other metrics, accuracy was 0.0655, 0.0693, 0.1636, and 0.1983 higher than GBDT, RF, SVM, and KNN, respectively; precision was 0.0685, 0.0798, 0.1931, 0.2111 higher than GBDT, RF, SVM, and KNN, respectively; recall was 0.0647, 0.0553, 0.0985, and 0.1675 higher than GBDT, RF, SVM, and KNN, respectively; and F1 was 0.0666, 0.0676, 0.1482, and 0.1897 higher than GBDT, RF, SVM, and KNN, respectively. In addition, due to the generally low MCC values, it may be that the predictions were more accurate for a single class. As such, for a more robust assessment, we output the single-class prediction accuracy for each model. For circRNAs, CNN-BiGRU-GLT reached an accuracy of 0.8490, which was 0.0647, 0.0553, 0.0985, and 0.1675 higher than that of GBDT, RF, SVM, and KNN, respectively; regarding lncRNAs, CNN-BiGRU-GLT achieved an accuracy of 0.8590, which was 0.1312, 0.1386, 0.3274, and 0.3966 higher than that of GBDT, RF, SVM, and KNN, respectively. Therefore, it is evident that our model not only performed best in all the evaluation metrics but also yielded robust predictions for each category. At the same time, we noted that GBDT, the best-performing traditional machine learning algorithm, had an accuracy that was still somewhat inferior to that of BiGRU without any improvement, which indicated the effectiveness of the automatic feature extraction carried out by the deep learning methods.

2.2. Performance of CircPCBL for Test Sets

Similarly, we verified CircPCBL’s capacity for cross-species prediction and compared how well each model performed on the independent test set (Table 8). In this, RNN, BiRNN, GRU, BiGRU, LSTM, BiLSTM, and CNN-BiGRU had one-hot coding features as their inputs, the machine learning methods had k-mer features as their inputs, and CircPCBL had one-hot and k-mer features as its inputs. The results showed that Cucumis sativus, Populus trichocarpa, and Gossypium raimondii had prediction accuracies of 0.8588, 0.7587, and 0.8660, respectively. From the results, CircPCBL had the best performance across all measures and had a generalization ability that was much higher than that of the other models in terms of Cucumis sativus and Populus trichocarpa. The performance of the model in terms of Gossypium raimondii was not much lower than that of the top-ranked CNN-BiGRU in general. It is worth mentioning that the model’s generalization performance was significantly boosted with the addition of GLT, particularly on the independent test set for Populus trichocarpa. The model’s prediction accuracy on this test set was enhanced by almost 7% compared to its performance before incorporating GLT. Ultimately, to further illustrate the stability of the models and the effectiveness of the improvement strategy, we contrasted the models (BiGRU, CNN-BiGRU, and CNN-BiGRU-GLT) before and after their improvement via outputting their prediction accuracy for individual categories (Figure 6). The results showed that the CNN-BiGRU-GLT model accurately identified both circRNAs and lncRNAs on all the independent test sets, without any cases of superior identification for a specific class. Specifically, its difference in terms of the prediction accuracy for circRNAs and lncRNAs was the lowest with respect to the pre-improvement models BiGRU and CNN-BiGRU. In the Cucumis sativus, Populus trichocarpa, and Gossypium raimondii tests, the single-class prediction accuracy differences were 0.0616, 0.0117, and 0.0091, respectively, while these differences with the BiGRU model were 0.2606, 0.1559, and 0.0509, respectively, and those with the CNN-BiGRU model were 0.0754, 0.1712, and 0.0101, respectively. Meanwhile, it can be seen from Figure 6 that the BiGRU model had a clear case of biased prediction for the single-class test, and this situation was gradually corrected in the process of model refinement.

2.3. Prediction of Experimentally Validated circRNAs and lncRNAs

Zeng et al. identified 558 potential circRNAs in Poncirus trifoliata by high-throughput sequencing and bioinformatics analysis, and 11 circRNAs that were resistant to RNAse R were identified by real-time PCR [44]. The 11 circRNAs were subjected to CircPCBL, and a prediction accuracy of 90.9% was obtained, with 10 out of the 11 being correctly predicted as circRNAs. Other than the above, Li et al. used the rapid amplification of the cDNA ends method (RACE) to obtain ten lncRNA sequences that were present in rice [45]. These sequences were analyzed via the CircPCBL network as well, which successfully identified nine of them with an accuracy of 90%. These results indicate the usefulness of CircPCBL in identifying functional circRNAs and lncRNAs.

2.4. Smote Sampling for Different Species’ Sequences

The number of sequences remained different between the species, with Arabidopsis thaliana having the largest number of positive and negative samples of samples, each with a total of 3000. To ensure balanced datasets across all the species, we made use of smote sampling to increase the samples of the remaining five plants to six thousand. Table 9 displays the results obtained after retraining CircPCBL. The results showed that after smote sampling, the performance of CircPCBL slightly decreased on the validation set. Its biased prediction appeared more obvious on the three independent test sets, especially on the Populus trichocarpa test set, where its overall prediction accuracy reached 0.6719, but the prediction accuracy for circRNAs was only 0.5589. Smote sampling is theoretically a data enhancement method, but it did not show the desired effect in our task. We analyzed the possible reasons as follows.
Firstly, different species have vastly different sequence numbers. For example, the sequence number of Brassica rapa is merely 800, and expanding it to 6000 will not reveal much information, but it could increase the risk of overfitting in the model. Second, RNA sequences have structural and functional specificities. CircRNAs, as a subclass of lncRNAs, have a high similarity to them. Generating samples with a low volume based on the feature distance may produce more noise, which could adversely affect the learning process of the model. Eventually, the generated samples exhibited a high degree of similarity, which could cause the model to overly focus on these samples and result in overfitting, ultimately leading to a decrease in the generalization of the model.

2.5. Testing on Species of Wide Interest to the Field of Plant Genomics

The species Arabidopsis thaliana [46], Oryza sativa [47], and Solanum lycopersicum [48] have been extensively studied in the field of plant genomics. To evaluate our model’s performance, we tested it specifically on these three species. In order to enhance the reliability of the prediction outcomes, we employed a random selection process to choose circRNA sequences from each species for analysis, where the numbers of Arabidopsis thaliana, Oryza sativa, and Solanum lycopersicum were 3000, 3000, and 2000, respectively. The process was repeated 30 times, and the final accuracy was the average result of these experiments. The results showed that the average prediction accuracy of Arabidopsis thaliana reached 0.8366 ± 0.0058, while that of Oryza sativa and Solanum lycopersicum was 0.8628 ± 0.0050 and 0.8982 ± 0.0038, respectively. The small variance of the 30 replicate experiments (Figure 7) not only demonstrates the robustness of the model but also proves that the randomly sampled data were representative of the overall data. To sum up, our software had an equally brilliant predictive power for the species of interest in the field of plant genomics.

2.6. Trying CircPCBL on Human Datasets

CircPCBL is a plant-specific tool, but we still tested its performance on a human dataset, hoping that the model could contribute to more species and not be limited to plants. Analogous to JEDI, our positive samples were from circRNADb (http://reprod.njmu.edu.cn/cgi-bin/circrnadb/circRNADb.php, accessed on 30 March 2023) [10], while the negative samples were from GENCODE v19 (https://www.gencodegenes.org/human/release_19.html, accessed on 30 March 2023) [49]. The numbers of positive and negative samples were 32,914 and 23,898, respectively. We randomly selected 8000 positive and negative samples of each, following a 7:3 division for training and validation. To guarantee the representativeness of the dataset, we repeated the process five times. For the model setup, 30 epochs were taken for the training, where an early stop was adopted to prevent overfitting, and the training time was approximately 24 min. The results (Table 10) showed that CircPCBL exhibited an average accuracy of 94.08% on the human dataset. Relative to positive-sample circRNAs, its average accuracy reached 94.31% without showing a biased prediction. Although the accuracy of CircPCBL was slightly lower compared to JEDI, it also achieved satisfactory results using only raw sequences to classify circRNAs and lncRNAs. While we did not carry out plant-specific feature engineering, CircPCBL showed large differences between the plant and animal datasets, which may imply that some variations existed in the plant and animal lncRNAs and circRNAs. In addition, when the CircPCBL model was trained on the plant dataset and directly transferred to the human dataset with the above five random samples (Table 10), its average accuracy was only 69.02%, and its F1 and MCC values were only 38.30%, and 67.10% on average, respectively. As a result, this also illustrates the necessity of developing plant-specific circRNA identification tools.

2.7. Impact of Training Species Diversity on Model Generalization Performance

In contrast to PCirc, we brought in more species for the training of the model. We argued that an increase in the number of species would improve the generalization performance of the model, for it could not only learn the similar aspects but also the different features via training on various kinds of plants. To verify this opinion, we applied only rice circRNAs and lncRNAs to CircPCBL and tested it on three independent test sets. The results (Table 11) showed that when trained only on rice, CircPCBL achieved accuracy, precision, recall, and MCC values of 0.8292, 0.8425, 0.8192, and 0.6586, respectively, on the validation set. However, the performance on the three independent test sets decreased remarkably compared to before, especially for Cucumis sativus and Gossypium raimondii, with the accuracy decreasing by 0.1612 and 0.1568, respectively. These results demonstrated that the number of species had an essential impact on the model’s generalization performance, and thus it is necessary to maintain the diversity of training species.

3. Discussion

The database PlantcircBase is devoted to cataloging the circRNAs of plants. It was created in 2017. Before that, almost all circRNAs databases were related to humans and animals [11]. Numerous studies on circRNAs in animals have been reported, but progress in studies on circRNA in plants has been sluggish. PlantcircBase has recently been updated to its seventh edition, which now includes a total of 171,118 circRNAs from 21 species of plants. Even though many plant circRNAs have been discovered, there are still no effective tools for identifying them, and only traditional experimental techniques are available. We believe CircPCBL to be the first deep-learning-based framework for circRNA identification in plants.
We used two different models and inputs in CircPCBL, and the outputs of the two models were linked for prediction through a fully connected layer. In particular, CNN-BiGRU was used to process the sparse matrix encoded by one-hot, and GLT was used to extract the deep-level information from the k-mer features. Different information about the sequences was processed independently using each of the two models. Through varieties of data testing and analysis, CircPCBL had an optimal stability and generalization. In addition, the improvement method was also shown to be effective for the CNN-BiGRU model. Our input did not consider any biological knowledge. We argue that biologically based characteristics (such as ORFs and CDs) focus excessively on the coding regions of RNA sequences and disregard the UTR regions, which leads to biased predictions for sequences with insufficient CD coverage [50]. CircPCBL lowers the attention on a single region by using one-hot and k-mer to reflect the sequence order and composition, respectively. It is also possible to fully display the composition information of the sequences. It is worth mentioning that CircPCBL, which is based only on the original sequence features, also showed brilliant performance under the different datasets.
However, CircPCBL might still require a lot of development. We plan to continue our research in the following areas: First, we will keep refining the model’s structure to enhance its prediction performance. Second, it has been found that the “tree” model in the machine learning algorithm has a better fitting ability, so we will consider the integration of the deep learning method and the “tree” model [51]. Third, we will explore high-quality features to facilitate our categorization task.

4. Materials and Methods

The process of developing CircPCBL is shown in Figure 8. CircPCBL consists of two independent models (CNN-BiGRU and GLT) whose inputs are only based on the original sequences.

4.1. Dataset Construction

In this study, two classes were defined for the training of CircPCBL: lncRNAs from CANTATAdb 2.0 (http://yeti.amu.edu.pl/CANTATA/download.php, accessed on 27 December 2022) [52] and GreeNC v1.12 (http://greenc.sequentiabiotech.com/wiki/Main_Page, accessed on 27 December 2022) [53,54] were considered as the negative dataset; circRNAs, which were regarded as the positive dataset, were obtained from PlantcircBase (Release v7 Data) (http://ibi.zju.edu.cn/plantcircbase/, accessed on 27 December 2022) [11]. From all the above databases, we collected lncRNAs and circRNAs from nine different plants altogether. Among them, six types of plants (Arabidopsis thaliana, Brassica rapa, Zea mays, Oryza sativa Japonica, Solanum lycopersicum, and Solanum tuberosum) were used for constructing the training and validation sets, and they were divided according to a ratio of 7:3, and these plants belonged to the families Cruciferae, Gramineae, and Solanaceae. The remaining three types of plants (Cucumis sativus, Populus trichocarpa, and Gossypium raimondii) were constructed as three independent test sets, which do not belong to the same families as all the species in the training and validation sets. They were thus employed to confirm CircPCBL’s ability to predict the behavior across the species. Considering the problems of redundancy and imbalancing in the datasets, we performed the following processing on the raw data: Firstly, we removed the sequences with an excessively long or excessively short length from each fasta file using the box–whisker plot method. Next, we used the tools of cd-hit and cd-hit-est-2d with a threshold of 80% [55,56,57,58] to eliminate redundant sequences between the individual data sets and the different classes of datasets. Ultimately, we balanced the positive and negative samples for each species via random sampling. The specifics of the data used in our work are shown in Table 12. In addition to the above datasets, we also constructed a new test set, named the Real Set, which contained 11 circRNAs of Poncirus trifoliata reported in [44] and 10 lncRNAs of rice reported in [45] to further evaluate the generalization ability of CircPCBL under a real dataset [59,60]. Furthermore, we observed that there existed an imbalance in the datasets among the different species. To address this issue, after the final deployment of CircPCBL, we tried to perform smote sampling of sequences from various species, and we retrained the model to compare the before and after changes in the model’s performance.

4.2. CircPCBL Architecture

Deep learning has been used extensively in the field of biology recently [61,62,63,64,65]. End-to-end learning, which is possible with deep learning but not with traditional machine learning, reduces the amount of information needed to be understoon about circRNAs and lncRNAs and does away with the need for complex feature engineering [66]. Deep learning is a powerful tool for addressing high-dimensional datasets because it can automatically extract features from unprocessed sequences. Therefore, we will dig into the deep-level differences between circRNA and lncRNA by deep learning methods to realize the classification of circRNA and lncRNA.
After experimenting with different conventional deep learning models, we found that the model of BiGRU was more suitable for our classification task. By examining the performance of Word2Vec and one-hot, we finally decided to choose one-hot as the input of the BiGRU model. The model’s performance was then further enhanced by adding CNN in front of BiGRU. We also designed a GLT model, which received k-mer (k = 1, 2, 3, 4) features as its inputs in addition to CNN-BiGRU. K-mer was subjected to a grouped linear transformation via GLT to obtain local information, which was subsequently distributed via shuffling among many groups in order to acquire global representations. The outputs of the two models were connected, and the ultimate prediction results were output through the fully connected layer. Because gradient disappearance and overfitting are more likely to occur as a model’s depth increases, we used tactics such as early halting, layer normalization, and showing the learning rate. The details of the model are as follows.

4.2.1. One-Hot CNN-BiGRU

One-hot encoding encodes four nucleotides as binary vectors, where: A(1,0,0,0), G(0,1,0,0), C(0,0,1,0), and T(0,0,0,1). Consequently, an RNA sequence of length L was represented by a 4xL sparse matrix. The length of the RNA sequences passed into the CNN-BiGRU model needed to be consistent, so we fixed the length of the sequences to m. Sequences with a length longer than m were truncated directly, and those with a length less than m were filled with (0,0,0,0,0) vectors. The CNN-BiGRU model calculated the output vector to have a 32-dimensional size for a single RNA sequence. We used ReLU for all activation functions. The sequence length was set to 1500. The convolution kernel size was a [3,5,7] combination, which we employed. The number of BiGRU hidden units was set to 30.

4.2.2. K-mer GLT

The k-mer frequency is the frequency of the occurrence of adjacent k bases. It is thought that the k-mer frequency is species-specific and sequence-specific and that this specificity would become more pronounced as the k-value rises. Nevertheless, a dimensional catastrophe will result from the heedless pursuit of disparities in distribution [50]. We used 340 features, ranging from 1-mer to 4-mer, as the inputs into the GLT model. First, using a fully connected layer, we reduced the size of the features to 256 dimensions. Next, we divided the 256-dimensional vectors into 2 groups for linear transformation, with each group producing a 64-dimensional output vector. The 128-dimensional vector was finally divided into 4 groups for linear transformation, with the output vector’s dimensions being 8 for each group. This last step produced a 32-dimensional output vector. Layer normalization was used for each linear transformation to keep the gradient visible.

4.2.3. Model Fusion

A fully connected layer then outputted the final prediction result after connecting the 32-dimensional vector that CNN-BiGRU and GLT both produced. In the whole training process, our learning rate was set to 0.001, and batch_size was set to 16. All the models were trained on an NVIDIA GeForce RTX 2060 GPU.

4.3. Performance Evaluation

To evaluate CircPCBL, we opted for some commonly used evaluation metrics, namely accuracy, precision, recall, F1-score, and MCC, which were calculated as shown below:
Accuracy = (TP + TN)/(TP + FP + TN + FN)
Precision = TP/(TP + FP)
Recall = TP/(TP + FN)
F1-score = 2TP/(2TP + FP + FN)
MCC = (TP × TN − FP × FN)/Sqrt((TP + FP) × (TP + FN) × (TN + FP) × (TN + FN))
where TP and TN indicate the number of correctly predicted lncRNAs and circRNAs, and FP and FN indicate the number of incorrectly predicted circRNA and lncRNAs. Precision demonstrates how many of the predicted circRNAs samples were correct. Recall indicates how many circRNA samples were correctly predicted, i.e., the single-class prediction accuracy of circRNAs. The F1 score took into account both precision and recall, and its value is the harmonized average of them. MCC, whose full name is Matthew’s correlation coefficient, integrates TP, TN, FP, and FN, which can describe the correlation coefficient between the predicted and actual results. Its value ranged from −1 to 1, with higher values indicating better results for the model.

5. Conclusions

The abundance of plant circRNAs in PlantcircBase provides data support for deep learning techniques. The majority of circRNA recognition tools available today are geared toward animals, and it is still challenging to identify circRNA in plants. In this paper, we presented the CircPCBL model, which combines the models of CNN, BiGRU, and GLT and processes one-hot and k-mer features by different models for the identification of plant circRNAs. This model is based solely on raw sequences. CircPCBL has a wide range of species, is trained with various plant kinds, and exhibits outstanding cross-species prediction performance. In addition, we offer a free-to-use web server so that users can output predictions by simply entering the sequence as specified by the format or by directly uploading a fasta file. In a nutshell, CircPCBL is a user-friendly method for deeply identifying plant circRNAs and aims to increase the field’s understanding of these molecules. Despite the progress made with CircPCBL, there is still considerable scope for enhancing its accuracy, indicating the need for further development of similar tools. To improve the model’s performance, it would be useful to conduct error analyses to identify specific lncRNA subclasses or circRNA patterns that the model struggles to capture. This targeted information can inform future efforts to optimize the model’s performance, which is an area where our work can be further refined. We will explore how to further enhance the accuracy of plant circRNA prediction and conduct additional research on the functional prediction of plant circRNAs in a following study [67].
Lastly, since our tool classifies circRNAs and other lncRNAs, it cannot recognize other molecules such as mRNAs, tRNAs, etc. Here, we provide some suggestions to users to use our model better. Initially, in case you want to check the coding ability of sequences, we would recommend the following tools: OrfPredictor (http://proteomics.ysu.edu/tools/OrfPredictor.html, accessed on 6 April 2023) [68] or NCBI ORF Finder (https://www.ncbi.nlm.nih.gov/orffinder/, accessed on 6 April 2023) [69], and also LGC for long non-coding RNAs (https://ngdc.cncb.ac.cn/lgc/, accessed on 6 April 2023) [70]. In addition, when you are wondering if a sequence is a tRNA, please refer to the tool tRNA-scan (http://lowelab.ucsc.edu/tRNAscan-SE/, accessed on 6 April 2023) [71,72]. Each of these tools has proven to be powerful and user-friendly. Once you have verified that your test sequences are lncRNAs but not mRNAs or sncRNAs along with the above tools, our model will be of assistance in further determining whether they are circRNAs.

Author Contributions

Conceptualization, X.Z., P.W. and Z.N.; methodology, P.W.; software, Z.N.; validation, X.Z., P.W. and Z.H.; resources, P.W.; data curation, X.Z.; writing—original draft preparation, P.W. and X.Z.; writing—review and editing, X.Z.; visualization Z.N. and Z.H.; supervision, X.Z.; project administration, P.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Nature Science Research Project of Education Department in Anhui Province. Grant Number: KJ2020A0108.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data and source code presented in this study are available at https://github.com/Peg-Wu/CircPCBL, accessed on 6 April 2023.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Zhang, Y.; Xue, W.; Li, X.; Zhang, J.; Chen, S.; Zhang, J.-L.; Yang, L.; Chen, L.-L. The Biogenesis of Nascent Circular RNAs. Cell Rep. 2016, 15, 611–624. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  2. Zhao, X.; Zhong, Y.; Wang, X.; Shen, J.; An, W. Advances in Circular RNA and Its Applications. Int. J. Med. Sci. 2022, 19, 975–985. [Google Scholar] [CrossRef] [PubMed]
  3. Sanger, H.L.; Klotz, G.; Riesner, D.; Gross, H.J.; Kleinschmidt, A.K. Viroids are single-stranded covalently closed circular RNA molecules existing as highly base-paired rod-like structures. Proc. Natl. Acad. Sci. USA 1976, 73, 3852–3856. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  4. Arnberg, A.C.; Van Ommen, G.J.; Grivell, L.A.; Van Bruggen, E.F.; Borst, P. Some yeast mitochondrial RNAs are circular. Cell 1980, 19, 313–319. [Google Scholar] [CrossRef] [PubMed]
  5. Kos, A.; Dijkema, R.; Arnberg, A.C.; van der Meide, P.H.; Schellekens, H. The hepatitis delta (delta) virus possesses a circular RNA. Nature 1986, 323, 558–560. [Google Scholar] [CrossRef] [PubMed]
  6. Cocquerelle, C.; Mascrez, B.; Hétuin, D.; Bailleul, B. Mis-splicing yields circular RNA molecules. FASEB J. 1993, 7, 155–160. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  7. Capel, B.; Swain, A.; Nicolis, S.; Hacker, A.; Walter, M.; Koopman, P.; Goodfellow, P.; Lovell-Badge, R. Circular transcripts of the testis-determining gene Sry in adult mouse testis. Cell 1993, 73, 1019–1030. [Google Scholar] [CrossRef]
  8. Zaphiropoulos, P.G. Circular RNAs from transcripts of the rat cytochrome P450 2C24 gene: Correlation with exon skipping. Proc. Natl. Acad. Sci. USA 1996, 93, 6536–6541. [Google Scholar] [CrossRef] [Green Version]
  9. Glazar, P.; Papavasileiou, P.; Rajewsky, N. circBase: A database for circular RNAs. RNA 2014, 20, 1666–1670. [Google Scholar] [CrossRef] [Green Version]
  10. Chen, X.; Han, P.; Zhou, T.; Guo, X.; Song, X.; Li, Y. circRNADb: A comprehensive database for human circular RNAs with protein-coding annotations. Sci. Rep. 2016, 6, 34985. [Google Scholar] [CrossRef]
  11. Chu, Q.J.; Zhang, X.C.; Zhu, X.T.; Liu, C.; Mao, L.F.; Ye, C.Y.; Zhu, Q.H.; Fan, L.J. PlantcircBase: ADatabase for Plant Circular RNAs. Mol. Plant 2017, 10, 1126–1128. [Google Scholar] [CrossRef] [PubMed]
  12. Hansen, T.B.; Jensen, T.I.; Clausen, B.H.; Bramsen, J.B.; Finsen, B.; Damgaard, C.K.; Kjems, J. Natural RNA circles function as efficient microRNA sponges. Nature 2013, 495, 384–388. [Google Scholar] [CrossRef]
  13. Li, Z.Y.; Huang, C.; Bao, C.; Chen, L.; Lin, M.; Wang, X.L.; Zhong, G.L.; Yu, B.; Hu, W.C.; Dai, L.M.; et al. Exon-intron circular RNAs regulate transcription in the nucleus. Nat. Struct. Mol. Biol. 2015, 22, 256–264. [Google Scholar] [CrossRef]
  14. Cedric, B.C.; Souraka, T.D.M.; Feng, Y.L.; Kisembo, P.; Tu, J.C. CircRNA ZFR stimulates the proliferation of hepatocellular carcinoma through upregulating MAP2K1. Eur. Rev. Med. Pharmacol. Sci. 2020, 24, 9924–9931. [Google Scholar] [CrossRef]
  15. Wang, H.; Niu, X.; Mao, F.; Liu, X.; Zhong, B.; Jiang, H.; Fu, G. Hsa_circRNA_100146 Acts as a Sponge of miR-149-5p in Promoting Bladder Cancer Progression via Regulating RNF2. OncoTargets Ther. 2020, 13, 11007–11017. [Google Scholar] [CrossRef]
  16. Yi, F.; Xin, L.; Feng, L. Potential mechanism of circRNA_000585 in cholangiocarcinoma. J. Int. Med. Res. 2021, 49, 3000605211024501. [Google Scholar] [CrossRef]
  17. Rinn, J.L.; Chang, H.Y. Genome Regulation by Long Noncoding RNAs. Annu. Rev. Biochem. 2012, 81, 145–166. [Google Scholar] [CrossRef] [Green Version]
  18. Lan, W.; Lai, D.; Chen, Q.; Wu, X.; Chen, B.; Liu, J.; Wang, J.; Chen, Y.-P.P. LDICDL: LncRNA-Disease Association Identification Based on Collaborative Deep Learning. IEEE-ACM Trans. Comput. Biol. Bioinform. 2022, 19, 1715–1723. [Google Scholar] [CrossRef]
  19. Liu, Y.; Yu, Y.; Zhao, S. Dual Attention Mechanisms and Feature Fusion Networks Based Method for Predicting LncRNA-Disease Associations. Interdiscip. Sci. 2022, 14, 358–371. [Google Scholar] [CrossRef]
  20. Wang, B.; Zhang, C.; Du, X.-X.; Zheng, X.-D.; Li, J.-Y. lncRNA-disease association prediction based on the weight matrix and projection score. PLoS ONE 2023, 18, e0278817. [Google Scholar] [CrossRef]
  21. Zhao, H.; Shi, J.; Zhang, Y.; Xie, A.; Yu, L.; Zhang, C.; Lei, J.; Xu, H.; Leng, Z.; Li, T.; et al. LncTarD: A manually-curated database of experimentally-supported functional lncRNA-target regulations in human diseases. Nucleic Acids Res. 2020, 48, D118–D126. [Google Scholar] [CrossRef] [PubMed]
  22. Waititu, J.K.; Zhang, C.; Liu, J.; Wang, H. Plant Non-Coding RNAs: Origin, Biogenesis, Mode of Action and Their Roles in Abiotic Stress. Int. J. Mol. Sci. 2020, 21, 8401. [Google Scholar] [CrossRef]
  23. Meng, Y.; Xing, L.; Li, K.; Wei, Y.; Wang, H.; Mao, J.; Dong, F.; Ma, D.; Zhang, Z.; Han, M.; et al. Genome-wide identification, characterization and expression analysis of novel long non-coding RNAs that mediate IBA-induced adventitious root formation in apple rootstocks. Plant Growth Regul. 2019, 87, 287–302. [Google Scholar] [CrossRef]
  24. Zhu, B.; Yang, Y.; Li, R.; Fu, D.; Wen, L.; Luo, Y.; Zhu, H. RNA sequencing and functional analysis implicate the regulatory role of long non-coding RNAs in tomato fruit ripening. J. Exp. Bot. 2015, 66, 4483–4495. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  25. Vieira, L.M.; Grativol, C.; Thiebaut, F.; Carvalho, T.G.; Hardoim, P.R.; Hemerly, A.; Lifschitz, S.; Ferreira, P.C.G.; Walter, M.E.M.T. PlantRNA_Sniffer: A SVM-Based Workflow to Predict Long Intergenic Non-Coding RNAs in Plants. Non-Coding RNA 2017, 3, 11. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  26. Negri, T.d.C.; Alves, W.A.L.; Bugatti, P.H.; Saito, P.T.M.; Domingues, D.S.; Paschoal, A.R. Pattern recognition analysis on long noncoding RNAs: A tool for prediction in plants. Brief. Bioinform. 2018, 20, 682–689. [Google Scholar] [CrossRef]
  27. Yotsukura, S.; du Verle, D.; Hancock, T.; Natsume-Kitatani, Y.; Mamitsuka, H. Computational recognition for long non-coding RNA (lncRNA): Software and databases. Brief. Bioinform. 2017, 18, 9–27. [Google Scholar] [CrossRef]
  28. Derrien, T.; Johnson, R.; Bussotti, G.; Tanzer, A.; Djebali, S.; Tilgner, H.; Guernec, G.; Martin, D.; Merkel, A.; Knowles, D.G.; et al. The GENCODE v7 catalog of human long noncoding RNAs: Analysis of their gene structure, evolution, and expression. Genome Res. 2012, 22, 1775–1789. [Google Scholar] [CrossRef] [Green Version]
  29. Niu, M.; Zhang, J.; Li, Y.; Wang, C.; Liu, Z.; Ding, H.; Zou, Q.; Ma, Q. CirRNAPL: A web server for the identification of circRNA based on extreme learning machine. Comput. Struct. Biotechnol. J. 2020, 18, 834–842. [Google Scholar] [CrossRef]
  30. Wang, J.; Wang, L. Deep learning of the back-splicing code for circular RNA formation. Bioinformatics 2019, 35, 5235–5242. [Google Scholar] [CrossRef]
  31. Jiang, J.Y.; Ju, C.J.; Hao, J.; Chen, M.; Wang, W. JEDI: Circular RNA prediction based on junction encoders and deep interaction among splice sites. Bioinformatics 2021, 37, i289–i298. [Google Scholar] [CrossRef] [PubMed]
  32. Ye, C.Y.; Chen, L.; Liu, C.; Zhu, Q.H.; Fan, L. Widespread noncoding circular RNAs in plants. New Phytol. 2015, 208, 88–95. [Google Scholar] [CrossRef] [Green Version]
  33. Ye, C.-Y.; Zhang, X.; Chu, Q.; Liu, C.; Yu, Y.; Jiang, W.; Zhu, Q.-H.; Fan, L.; Guo, L. Full-length sequence assembly reveals circular RNAs with diverse non-GT/AG splicing signals in rice. RNA Biol. 2017, 14, 1055–1063. [Google Scholar] [CrossRef] [Green Version]
  34. Yin, S.; Tian, X.; Zhang, J.; Sun, P.; Li, G. PCirc: Random forest-based plant circRNA identification software. BMC Bioinform. 2021, 22, 10. [Google Scholar] [CrossRef]
  35. Min, S.; Lee, B.; Yoon, S. Deep learning in bioinformatics. Brief. Bioinform. 2017, 18, 851–869. [Google Scholar] [CrossRef] [Green Version]
  36. Wei, P.-J.; Pang, Z.-Z.; Jiang, L.-J.; Tan, D.-Y.; Su, Y.-S.; Zheng, C.-H. Promoter prediction in nannochloropsis based on densely connected convolutional neural networks. Methods 2022, 204, 38–46. [Google Scholar] [CrossRef]
  37. Kaur, A.; Chauhan, A.S.; Aggarwal, A.K. Prediction of Enhancers in DNA Sequence Data Using a Hybrid CNN-DLSTM Model. IEEE/ACM Trans. Comput. Biol. Bioinform. 2022, 20, 1327–1336. [Google Scholar] [CrossRef] [PubMed]
  38. Min, X.; Zeng, W.; Chen, N.; Chen, T.; Jiang, R. Chromatin accessibility prediction via convolutional long short-term memory networks with k-mer embedding. Bioinformatics 2017, 33, I92–I101. [Google Scholar] [CrossRef] [Green Version]
  39. Hashim, E.K.M.; Abdullah, R. Rare k-mer DNA: Identification of sequence motifs and prediction of CpG island and promoter. J. Theor. Biol. 2015, 387, 88–100. [Google Scholar] [CrossRef] [Green Version]
  40. Mehta, S.; Ghazvininejad, M.; Iyer, S.; Zettlemoyer, L.; Hajishirzi, H. Delight: Deep and light-weight transformer. arXiv 2020, arXiv:2008.00623. [Google Scholar]
  41. Gao, Y.J.; Chen, Y.Q.; Feng, H.S.; Zhang, Y.H.; Yue, Z.Y. RicENN: Prediction of Rice Enhancers with Neural Network Based on DNA Sequences. Interdiscip. Sci. 2022, 14, 555–565. [Google Scholar] [CrossRef]
  42. Luo, Z.T.; Su, W.; Lou, L.L.; Qiu, W.R.; Xiao, X.; Xu, Z.C. DLm6Am: A Deep-Learning-Based Tool for Identifying N6,2 ‘-O-Dimethyladenosine Sites in RNA Sequences. Int. J. Mol. Sci. 2022, 23, 11026. [Google Scholar] [CrossRef] [PubMed]
  43. Chen, G.; Zhang, X.; Zhang, J.; Li, F.; Duan, S. A novel brain-computer interface based on audio-assisted visual evoked EEG and spatial-temporal attention CNN. Front. Neurorobot. 2022, 16, 159. [Google Scholar] [CrossRef] [PubMed]
  44. Zeng, R.F.; Zhou, J.J.; Hu, C.G.; Zhang, J.Z. Transcriptome-wide identification and functional prediction of novel and flowering-related circular RNAs from trifoliate orange (Poncirus trifoliata L. Raf.). Planta 2018, 247, 1191–1202. [Google Scholar] [CrossRef] [PubMed]
  45. Li, X.; Shahid, M.Q.; Wen, M.; Chen, S.; Yu, H.; Jiao, Y.; Lu, Z.; Li, Y.; Liu, X. Global identification and analysis revealed differentially expressed lncRNAs associated with meiosis and low fertility in autotetraploid rice. BMC Plant Biol. 2020, 20, 82. [Google Scholar] [CrossRef] [Green Version]
  46. Chen, G.; Cui, J.; Wang, L.; Zhu, Y.; Lu, Z.; Jin, B. Genome-Wide Identification of Circular RNAs in Arabidopsis thaliana. Front. Plant Sci. 2017, 8, 1678. [Google Scholar] [CrossRef] [Green Version]
  47. Wang, Y.; Xiong, Z.; Li, Q.; Sun, Y.; Jin, J.; Chen, H.; Zou, Y.; Huang, X.; Ding, Y. Circular RNA profiling of the rice photo-thermosensitive genic male sterile line Wuxiang S reveals circRNA involved in the fertility transition. BMC Plant Biol. 2019, 19, 340. [Google Scholar] [CrossRef] [Green Version]
  48. Hong, Y.-H.; Meng, J.; Zhang, M.; Luan, Y.-S. Identification of tomato circular RNAs responsive to Phytophthora infestans. Gene 2020, 746, 144652. [Google Scholar] [CrossRef]
  49. Frankish, A.; Carbonell-Sala, S.; Diekhans, M.; Jungreis, I.; Loveland, J.E.; Mudge, J.M.; Sisu, C.; Wright, J.C.; Arnan, C.; Barnes, I.; et al. GENCODE: Reference annotation for the human and mouse genomes in 2023. Nucleic Acids Res. 2022, 51, D942–D949. [Google Scholar] [CrossRef]
  50. Wang, Y.; Zhao, P.; Du, H.; Cao, Y.; Peng, Q.; Fu, L. LncDLSM: Identification of Long Non-coding RNAs with Deep Learning-based Sequence Model. bioRxiv 2022. [Google Scholar] [CrossRef]
  51. Kang, Q.; Meng, J.; Cui, J.; Luan, Y.; Chen, M. PmliPred: A method based on hybrid model and fuzzy decision for plant miRNA-lncRNA interaction prediction. Bioinformatics 2020, 36, 2986–2992. [Google Scholar] [CrossRef] [PubMed]
  52. Szczesniak, M.W.; Bryzghalov, O.; Ciomborowska-Basheer, J.; Makalowska, I. CANTATAdb 2.0: Expanding the Collection of Plant Long Noncoding RNAs. Methods Mol. Biol. 2019, 1933, 415–429. [Google Scholar] [CrossRef] [PubMed]
  53. Di Marsico, M.; Paytuvi Gallart, A.; Sanseverino, W.; Aiese Cigliano, R. GreeNC 2.0: A comprehensive database of plant long non-coding RNAs. Nucleic Acids Res. 2022, 50, D1442–D1447. [Google Scholar] [CrossRef] [PubMed]
  54. Paytuvi Gallart, A.; Hermoso Pulido, A.; Martinez de Lagran, I.A.; Sanseverino, W.; Aiese Cigliano, R. GREENC: A Wiki-based database of plant lncRNAs. Nucleic Acids Res. 2016, 44, D1161–D1166. [Google Scholar] [CrossRef]
  55. Tong, X.; Liu, S. CPPred: Coding potential prediction based on the global description of RNA sequence. Nucleic Acids Res. 2019, 47, e43. [Google Scholar] [CrossRef] [Green Version]
  56. Li, A.; Zhang, J.; Zhou, Z. PLEK: A tool for predicting long non-coding RNAs and messenger RNAs based on an improved k-mer scheme. BMC Bioinform. 2014, 15, 311. [Google Scholar] [CrossRef] [Green Version]
  57. Sun, L.; Liu, H.; Zhang, L.; Meng, J. lncRScan-SVM: A tool for predicting long non-coding RNAs using support vector machine. PLoS ONE 2015, 10, e0139654. [Google Scholar] [CrossRef]
  58. Lertampaiporn, S.; Thammarongtham, C.; Nukoolkit, C.; Kaewkamnerdpong, B.; Ruengjitchatchawalya, M. Identification of non-coding RNAs with a new composite feature in the Hybrid Random Forest Ensemble algorithm. Nucleic Acids Res. 2014, 42, e93. [Google Scholar] [CrossRef]
  59. Zhou, B.; Ding, M.; Feng, J.; Ji, B.; Huang, P.; Zhang, J.; Yu, X.; Cao, Z.; Yang, Y.; Zhou, Y.; et al. EVlncRNA-Dpred: Improved prediction of experimentally validated lncRNAs by deep learning. Brief. Bioinform. 2023, 24, bbac583. [Google Scholar] [CrossRef]
  60. Zhang, Y.; Jia, C.; Kwoh, C.K. Predicting the interaction biomolecule types for lncRNA: An ensemble deep learning approach. Brief. Bioinform. 2021, 22, bbaa228. [Google Scholar] [CrossRef]
  61. Dai, Q.; Cheng, X.; Qiao, Y.; Zhang, Y. Crop Leaf Disease Image Super-Resolution and Identification With Dual Attention and Topology Fusion Generative Adversarial Network. IEEE Access 2020, 8, 55724–55735. [Google Scholar] [CrossRef]
  62. Chen, Y.; Wang, J.; Wang, C.; Liu, M.; Zou, Q. Deep learning models for disease-associated circRNA prediction: A review. Brief. Bioinform. 2022, 23, bbac364. [Google Scholar] [CrossRef] [PubMed]
  63. Xu, Z.; Luo, M.; Lin, W.; Xue, G.; Wang, P.; Jin, X.; Xu, C.; Zhou, W.; Cai, Y.; Yang, W. DLpTCR: An ensemble deep learning framework for predicting immunogenic peptide recognized by T cell receptor. Brief. Bioinform. 2021, 22, bbab335. [Google Scholar] [CrossRef] [PubMed]
  64. Liu, Z.; Ji, C.; Ni, J.-C.; Wang, Y.-T.; Qiao, L.; Zheng, C.-H. Convolution Neural Networks Using Deep Matrix Factorization for Predicting circRNA-Disease Association. IEEE/ACM Trans. Comput. Biol. Bioinform. 2021, 20, 277–284. [Google Scholar] [CrossRef]
  65. Wang, L.; Yan, X.; You, Z.-H.; Zhou, X.; Li, H.-Y.; Huang, Y.-A. SGANRDA: Semi-supervised generative adversarial networks for predicting circRNA-disease associations. Brief. Bioinform. 2021, 22, bbab028. [Google Scholar] [CrossRef]
  66. Zhang, X.; Xuan, J.; Yao, C.; Gao, Q.; Wang, L.; Jin, X.; Li, S. A deep learning approach for orphan gene identification in moso bamboo (Phyllostachys edulis) based on the CNN + Transformer model. BMC Bioinform. 2022, 23, 162. [Google Scholar] [CrossRef]
  67. Zhang, P.; Liu, Y.; Chen, H.; Meng, X.; Xue, J.; Chen, K.; Chen, M. CircPlant: An Integrated Tool for circRNA Detection and Functional Prediction in Plants. Genom. Proteom. Bioinform. 2020, 18, 352–358. [Google Scholar] [CrossRef]
  68. Min, X.J.; Butler, G.; Storms, R.; Tsang, A. OrfPredictor: Predicting protein-coding regions in EST-derived sequences. Nucleic Acids Res. 2005, 33, W677–W680. [Google Scholar] [CrossRef] [Green Version]
  69. Sayers, E.W.; Barrett, T.; Benson, D.A.; Bolton, E.; Bryant, S.H.; Canese, K.; Chetvernin, V.; Church, D.M.; DiCuccio, M.; Federhen, S.; et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2010, 38, D5–D16. [Google Scholar] [CrossRef] [Green Version]
  70. Wang, G.; Yin, H.; Li, B.; Yu, C.; Wang, F.; Xu, X.; Cao, J.; Bao, Y.; Wang, L.; Abbasi, A.A.; et al. Characterization and identification of long non-coding RNAs based on feature relationship. Bioinformatics 2019, 35, 2949–2956. [Google Scholar] [CrossRef]
  71. Lowe, T.M.; Chan, P.P. tRNAscan-SE On-line: Integrating search and context for analysis of transfer RNA genes. Nucleic Acids Res. 2016, 44, W54–W57. [Google Scholar] [CrossRef] [PubMed]
  72. Chan, P.P.; Lowe, T.M. tRNAscan-SE: Searching for tRNA Genes in Genomic Sequences. In Gene Prediction: Methods and Protocols; Kollmar, M., Ed.; Springer: New York, NY, USA, 2019; pp. 1–14. [Google Scholar] [CrossRef]
Figure 1. Biogenesis and structures of circRNAs: (A) intron-pairing-driven circularization; (B) RBP-mediated circularization; (C) lariat-driven circularization; (D) sdifferent circRNA structures. (EIciRNA: exon–intron circRNA; EcircRNA: exonic circRNA; ciRNA: intronic circRNA).
Figure 1. Biogenesis and structures of circRNAs: (A) intron-pairing-driven circularization; (B) RBP-mediated circularization; (C) lariat-driven circularization; (D) sdifferent circRNA structures. (EIciRNA: exon–intron circRNA; EcircRNA: exonic circRNA; ciRNA: intronic circRNA).
Plants 12 01652 g001
Figure 2. The effect of different hyperparameters on CNN-BiGRU: (a) Performance of CNN-BiGRU models with different kernel_size combinations; (b) effect of different hidden layer sizes on CNN-BiGRU models; (c) performance comparison of CNN-BiGRU models with different sequence fixed lengths.
Figure 2. The effect of different hyperparameters on CNN-BiGRU: (a) Performance of CNN-BiGRU models with different kernel_size combinations; (b) effect of different hidden layer sizes on CNN-BiGRU models; (c) performance comparison of CNN-BiGRU models with different sequence fixed lengths.
Plants 12 01652 g002
Figure 3. Comparison of the CNN-BiGRU-GLT model’s performance with traditional deep learning methods on validation set: (A) CNN-BiGRU-GLT and RNN performance comparison; (B) CNN-BiGRU-GLT and BiRNN performance comparison; (C) CNN-BiGRU-GLT and GRU performance comparison; (D) CNN-BiGRU-GLT and BiGRU performance comparison; (E) CNN-BiGRU-GLT and LSTM performance comparison; (F) CNN-BiGRU-GLT and BiLSTM performance comparison; (G) CNN-BiGRU-GLT and CNN-BiGRU performance comparison.
Figure 3. Comparison of the CNN-BiGRU-GLT model’s performance with traditional deep learning methods on validation set: (A) CNN-BiGRU-GLT and RNN performance comparison; (B) CNN-BiGRU-GLT and BiRNN performance comparison; (C) CNN-BiGRU-GLT and GRU performance comparison; (D) CNN-BiGRU-GLT and BiGRU performance comparison; (E) CNN-BiGRU-GLT and LSTM performance comparison; (F) CNN-BiGRU-GLT and BiLSTM performance comparison; (G) CNN-BiGRU-GLT and CNN-BiGRU performance comparison.
Plants 12 01652 g003
Figure 4. Loss and accuracy variation in each epoch for training and validation sets on different models with various degrees of improvement (BiGRU, CNN-BiGRU, and CNN-BiGRU-GLT).
Figure 4. Loss and accuracy variation in each epoch for training and validation sets on different models with various degrees of improvement (BiGRU, CNN-BiGRU, and CNN-BiGRU-GLT).
Plants 12 01652 g004
Figure 5. Comparison of the CNN-BiGRU-GLT model with traditional machine learning methods on validation set: (a) CNN-BiGRU-GLT and KNN performance comparison; (b) CNN-BiGRU-GLT and SVM performance comparison; (c) CNN-BiGRU-GLT and RF performance comparison; (d) CNN-BiGRU-GLT and GBDT performance comparison.
Figure 5. Comparison of the CNN-BiGRU-GLT model with traditional machine learning methods on validation set: (a) CNN-BiGRU-GLT and KNN performance comparison; (b) CNN-BiGRU-GLT and SVM performance comparison; (c) CNN-BiGRU-GLT and RF performance comparison; (d) CNN-BiGRU-GLT and GBDT performance comparison.
Plants 12 01652 g005
Figure 6. Single-class prediction accuracy with BiGRU, CNN-BiGRU, and CircPCBL models: (A) Cucumis sativus; (B) Populus trichocarpa; (C) Gossypium raimondii. The number on the bar chart indicates the prediction accuracy margin.
Figure 6. Single-class prediction accuracy with BiGRU, CNN-BiGRU, and CircPCBL models: (A) Cucumis sativus; (B) Populus trichocarpa; (C) Gossypium raimondii. The number on the bar chart indicates the prediction accuracy margin.
Plants 12 01652 g006
Figure 7. Thirty random sampling prediction experiments on plants of greater interest: (A) accuracy of each experiment on the Arabidopsis thaliana dataset; (B) accuracy of each experiment on the Oryza sativa dataset; (C) accuracy of each experiment on the Solanum lycopersicum dataset.
Figure 7. Thirty random sampling prediction experiments on plants of greater interest: (A) accuracy of each experiment on the Arabidopsis thaliana dataset; (B) accuracy of each experiment on the Oryza sativa dataset; (C) accuracy of each experiment on the Solanum lycopersicum dataset.
Plants 12 01652 g007
Figure 8. The flowchart for developing CircPCBL: (A) main datasets used in our work; (B) one-hot encoding process; (C) kK-mers features calculation process; (D) CNN-BiGRU model architecture; (E) GLT model architecture; (F) result output process.
Figure 8. The flowchart for developing CircPCBL: (A) main datasets used in our work; (B) one-hot encoding process; (C) kK-mers features calculation process; (D) CNN-BiGRU model architecture; (E) GLT model architecture; (F) result output process.
Plants 12 01652 g008
Table 1. Performance of different deep learning models with different coding methods (the Time column indicates the model training time, and the Hidden column denotes the optimal number of neurons in the hidden layer for various models, with the range adjusted from [20,30,40]).
Table 1. Performance of different deep learning models with different coding methods (the Time column indicates the model training time, and the Hidden column denotes the optimal number of neurons in the hidden layer for various models, with the range adjusted from [20,30,40]).
CodeModelHiddenEpochsTime (min)Validation Set
AccuracyPrecisionRecallF1MCC
One-hotRNN40200370.77290.78170.75900.77020.5461
BiRNN40200560.77710.78400.76650.77520.5543
GRU40200380.77840.80530.73590.76900.5590
BiGRU30200570.82160.83600.79920.81720.6438
LSTM20200380.65550.67690.59840.63530.3133
BiLSTM20200570.77630.77110.78770.77930.5528
Word embeddingRNN30200380.77480.77710.76370.77030.5496
BiRNN20200570.76080.78480.72040.75120.5235
GRU30200390.78240.78330.77760.78040.5648
BiGRU30200590.81400.83700.77190.80310.6293
LSTM30200410.77520.78960.76300.77610.5508
BiLSTM30200610.79960.81870.76770.79240.6003
Table 2. Comparison of BiGRU and CNN-BiGRU performance on validation set.
Table 2. Comparison of BiGRU and CNN-BiGRU performance on validation set.
ModelEpochsTime (min)Validation Set
AccuracyPrecisionRecallMCCF1
BiGRU200570.82160.83600.79920.64380.8172
CNN-BiGRU100460.84220.83200.85760.68480.8446
Table 3. Comparison of BiGRU and CNN-BiGRU performance on validation set.
Table 3. Comparison of BiGRU and CNN-BiGRU performance on validation set.
ModelEpochsTime (min)Validation Set
AccuracyPrecisionRecallMCCF1
CNN-BiGRU100460.84220.83200.85760.68480.8446
CNN-BiGRU-GLT100880.85400.86030.84900.70800.8546
Table 4. Results of repeating the experiment five times for the CNN-BiGRU model.
Table 4. Results of repeating the experiment five times for the CNN-BiGRU model.
CNN-BiGRUAccuracyPrecisionRecallMCCF1
10.84220.83200.85760.68480.8446
20.83940.87040.79640.68120.8317
30.84220.85180.82960.68470.8406
40.84790.84390.85250.69590.8482
50.84450.83660.85900.68920.8477
Average0.84330.84700.83900.68710.8426
Std0.00320.01510.02660.00560.0068
Table 5. Results of repeating the experiment five times for the CNN-BiGRU-GLT model.
Table 5. Results of repeating the experiment five times for the CNN-BiGRU-GLT model.
CNN-BiGRU-GLTAccuracyPrecisionRecallMCCF1
10.85400.86030.84900.70800.8546
20.85700.86020.85340.71400.8568
30.85300.85740.85030.70610.8538
40.85570.85030.86200.71140.8561
50.85250.85340.84310.70480.8482
Average0.85440.85630.85160.70890.8539
Std0.00190.00440.00690.00380.0034
Table 6. Parameter tuning details of machine learning models (GBDT, RF, SVM, and KNN).
Table 6. Parameter tuning details of machine learning models (GBDT, RF, SVM, and KNN).
ModelParameterDescriptionSearch ScopeBest
GBDTlearning_rateLearning rate[0.1, 0.01, 0.001]0.1
n_estimatorsThe number of base classifiers[50, 100, 150, 200]200
RFn_estimatorsThe number of base classifiers[50, 100, 150, 200]200
SVMkernelKernel function[rbf, linear]rbf
CPenalty factor[0, 0.2, 0.4, 0.6, 0.8, 1.0]1
KNNn_neighborsThe number of neighboring points[5, 10, 15]5
pp-value of Minsky distance[1, 2, 3]3
Table 7. Specific metric values of the CNN-BiGRU-GLT model compared to traditional machine learning methods on the validation set.
Table 7. Specific metric values of the CNN-BiGRU-GLT model compared to traditional machine learning methods on the validation set.
ModelTime (min)Validation Set
AccuracyPrecisionRecallF1MCC
CNN-BiGRU-GLT880.85400.86030.84900.85460.7080
GBDT30.78840.79180.78430.78800.5769
RF<10.78470.78050.79370.78700.5694
SVM<10.69030.66720.75050.70640.3842
KNN<10.65570.64920.68150.66490.3117
Table 8. Performance of the CNN-BiGRU-GLT model on three independent test sets and comparison results with aforementioned traditional machine learning and deep learning methods.
Table 8. Performance of the CNN-BiGRU-GLT model on three independent test sets and comparison results with aforementioned traditional machine learning and deep learning methods.
SpeciesModelAccuracyPrecisionRecallMCCF1
C. sativusRNN0.75080.64810.75000.49000.6952
BiRNN0.76590.66730.76280.51940.7118
GRU0.75740.67950.68160.48500.6805
BiGRU0.81190.75880.73860.59850.7485
LSTM0.77460.69670.71780.52410.7071
BiLSTM0.76870.66710.77840.52870.7185
CNN-BiGRU0.84570.78990.80800.67390.7989
CNN-BiGRU-GLT0.85880.80510.82800.70190.8164
GBDT0.79130.71830.73950.55930.7287
RF0.79110.71220.75310.56160.7321
SVM0.71200.60570.68820.40630.6443
KNN0.62950.50750.76520.30570.6103
P. trichocarpaRNN0.66000.65950.64950.31980.6545
BiRNN0.62180.61350.64150.24420.6272
GRU0.66180.64760.69740.32490.6716
BiGRU0.63560.65590.55770.27330.6028
LSTM0.63140.61880.66840.26400.6426
BiLSTM0.65070.62900.72060.30540.3054
CNN-BiGRU0.68540.71910.59980.37500.6540
CNN-BiGRU-GLT0.75870.75870.75290.51740.7558
GBDT0.63170.61150.70500.26730.6550
RF0.62260.59380.75630.25640.6652
SVM0.63030.58630.86390.29810.6986
KNN0.56240.54690.68520.13070.6083
G. raimondiiRNN0.68740.43850.77750.38050.5607
BiRNN0.66360.41800.79220.35950.5472
GRU0.73450.48860.74550.42610.5903
BiGRU0.81250.60320.78700.56300.6829
LSTM0.72940.48250.74980.42110.5871
BiLSTM0.69920.45150.80260.40870.5779
CNN-BiGRU0.86830.69620.86320.68760.7708
CNN-BiGRU-GLT0.86600.69190.86150.68290.7675
GBDT0.72780.48040.74290.41560.5835
RF0.69430.44240.73510.36680.5524
SVM0.64850.40650.80350.34810.5398
KNN0.52790.32340.76880.19120.4553
Table 9. Performance comparison of CircPCBL on validation and independent test sets before and after SMOTE sampling.
Table 9. Performance comparison of CircPCBL on validation and independent test sets before and after SMOTE sampling.
DatasetModelAccuracyPrecisionRecallMCCF1
Validation setCircPCBL0.85400.86030.84900.70800.8546
SMOTE + CircPCBL0.84180.84340.84130.68350.8424
Test set_C. sativusCircPCBL0.85880.80510.82800.70190.8164
SMOTE + CircPCBL0.84790.80790.78570.67540.7966
Test set_P. trichocarpaCircPCBL0.75870.75870.75290.51740.7558
SMOTE + CircPCBL0.67190.71700.55890.35110.6282
Test set_G. raimondiiCircPCBL0.86600.69190.86150.68290.7675
SMOTE + CircPCBL0.86600.70690.81650.66910.7577
Table 10. Performance of CircPCBL on human dataset.
Table 10. Performance of CircPCBL on human dataset.
Human DatasetsAccuracyPrecisionRecallMCCF1
After retraining0.9408 ± 0.00250.9373 ± 0.00520.9431 ± 0.01060.8818 ± 0.00510.9401 ± 0.0029
Direct transferred0.6902 ± 0.00120.7153 ± 0.00230.6320 ± 0.00550.3830 ± 0.00230.6710 ± 0.0025
Table 11. Performance of CircPCBL on three independent test sets when training with rice only (the values in parentheses show the decrease in contrast to previous tests).
Table 11. Performance of CircPCBL on three independent test sets when training with rice only (the values in parentheses show the decrease in contrast to previous tests).
SpeciesAccuracyPrecisionRecallMCCF1
C. sativus0.6976 (−0.1612)0.5895 (−0.2156)0.6659 (−0.1621)0.3753 (−0.3266)0.6254 (−0.1910)
P. trichocarpa0.6630 (−0.0957)0.6556 (−0.1031)0.6748 (−0.0781)0.3262 (−0.1912)0.6651 (−0.0907)
G. raimondii0.7092 (−0.1568)0.4554 (−0.2365)0.6805 (−0.1810)0.3589 (−0.3240)0.5456 (−0.2219)
Table 12. Details of the datasets used in our work.
Table 12. Details of the datasets used in our work.
DatasetSpeciescircRNA Code: 1lncRNA Code: 0
Raw *After **Used ***Raw *After **Used ***
Training and validationA. thaliana52,39321,8273000437333243000
B. rapa59148040085016473400
Z. mays10,381626260010,761682600
O. sativa43,88325,2302000278820882000
S. lycopersicum379624572000471631242000
S. tuberosum172880580057903488800
Total112,77257,061880036,92919,1798800
TestC. sativus483233133313734854265426
P. trichocarpa440832783278432233333333
G. raimondii147811551155421633463346
* Raw data; ** data after redundant sequences were removed; *** balanced data for positive and negative samples (test datasets did not conduct the operation).
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wu, P.; Nie, Z.; Huang, Z.; Zhang, X. CircPCBL: Identification of Plant CircRNAs with a CNN-BiGRU-GLT Model. Plants 2023, 12, 1652. https://doi.org/10.3390/plants12081652

AMA Style

Wu P, Nie Z, Huang Z, Zhang X. CircPCBL: Identification of Plant CircRNAs with a CNN-BiGRU-GLT Model. Plants. 2023; 12(8):1652. https://doi.org/10.3390/plants12081652

Chicago/Turabian Style

Wu, Pengpeng, Zhenjun Nie, Zhiqiang Huang, and Xiaodan Zhang. 2023. "CircPCBL: Identification of Plant CircRNAs with a CNN-BiGRU-GLT Model" Plants 12, no. 8: 1652. https://doi.org/10.3390/plants12081652

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop