7.2. Data Preprocessing
In this study, the miRNA–lncRNA interaction prediction tool used was psRNATarget. The sequences of target genes that can interact with miRNA were found by examining the degree of matching between miRNA and target sequences in plants. The filtered lncRNAs and miRNAs were then entered into the psRNATarget program for prediction, yielding a positive dataset of 18,468 miRNA–lncRNA interaction pairings. The construction of a negative dataset with substantial interference capabilities is required to validate the model’s performance further. A small percentage of miRNAs participate in interaction pairings due to their low and short sequence lengths; a small percentage of miRNAs participate in interaction pairings due to their low and short sequence lengths; consequently, the experiment mostly analyzed lncRNA sequences. To begin, whole lncRNAs were sorted into two groups: those that participated in this interaction and those that did not. Then, using the Needleman Wunsch algorithm [
48], a similarity comparison between the two groups for lncRNA was performed, and samples of lncRNAs with similarities of more than 80% were eliminated [
49].
After similarity elimination, lncRNAs that were not engaged in the lncRNA–miRNA interaction were randomly paired with all miRNAs to produce the negative sample datasets. A random sampling approach was applied to obtain the same numbers for negative samples as positive samples to guarantee the balance of negative and positive samples. The positive and negative datasets were jumbled randomly to create the 39,593 data points needed for the experiment. We employed the SMOTE method [
50] to enhance the sample size by producing characteristic data that resemble the samples to address data insufficiency and small sample size issues. We randomly selected an eigenvalue from a positive sample, calculated the eigenvalue of the closest positive sample, and then created new positive samples between the two using positive samples as an example. We iterated the previous steps until the sample data were large enough. Because the dataset’s maximum sequence length exceeds 8000 nt, the training phase takes a long time. At the same time, there were just 315 sequences longer than 4000 nucleotides. As a result, we discarded sequences that were longer than 4000 nt. The findings show that after deleting data with sequence lengths of more than 4000 nt, CNN-RNN accuracy did not improve much, but the training time was considerably reduced. The original dataset is dataset 1, and dataset 2 is updated after deleting the data with sequence lengths of more than 4000 nt. Three experiments were carried out, which are presented in
Table 2. Although the accuracy of CNN-RNN changed somewhat, the time of training for each batch was reduced by more than half.
7.3. k-mer Features of miRNA Sequence
Triticum aestivum features were extracted using a hybrid CNN-RNN model. The experiments used tenfold cross-validation to ensure the accuracy and dependability of the data. The experimental dataset was divided into ten groups: nine for training and one for verification. The medium values of 10 experiments are used as the final results after experimenting 10 times alternatively. The main extracted feature and secondary extracted feature of the sequence are the key features retrieved in this experiment. The most prevalent extracted feature is k-mer. Each k-mer contains nucleotides K that can be A, T, C, or G. The experiments extract sequence characteristics from 3-mer (64 dimensions), 2-mer (16 dimensions), and 1-mer (4 dimensions). To match the above k-mer, a sliding window with a length of k and a sliding step size of one is employed. The experiment also retrieved the sequence’s gap features, such as the initial gap feature (A*A, 64 dimensions) and the second gap feature (A**A, 256 dimensions).
Secondary structural features decided the primary functions of RNA molecules. According to studies, the more stable an RNA sequence’s structure leads to more free energy is produced during folding to build secondary structures; the more stable the secondary structure is the additional complimentary basis pairing it creates, with higher G and C values. This experiment extracted the sequences’ basis complimentary pairing rate (
E1) and C and G values (
E2), and normalized minimum free energies (
DM). The ViennaRNA [
51] toolbox was used to identify the point bracket form for the secondary sequence structures, as well as the least free energy created through the production of this secondary structure, which is characterized as follows:
where
is the maximum number of base pairs that may be paired at the sequences,
L is the length of the sequence,
and
are the frequency occurrences for
C and
G, and
MFE is the minimal free energy for the sequence.
There were 485 dimensions derived, covering both fundamental and secondary structural elements. The 485-dimensional feature vectors were created by fusing these features. Every feature vector was concatenated at vector sets for model testing and training.
Table 3 shows the complete feature information.
The experiments also used tenfold cross-validation, with 90% of the data being used for training and 10% for testing. On the Triticum aestivum dataset, CNN-RNN is first compared to shallow machine learning approaches, including traditional machine learning algorithms such as random forest, k-nearest neighbor (k-NN), and support vector machine (SVM). Although deep learning harvests information automatically, the important features may be lost in the process, resulting in a generic and not optimum condition. As a result, deep learning approaches may not perform as well as shallow machine learning models.
The proposed model was compared to shallow machine learning models and another deep learning model to verify its performance.
Table 4 and
Figure 5 demonstrate the experimental results of our suggested model and the shallow machine learning models.
Table 4 shows that our suggested model achieves greater than 96% for all four assessment factors; clearly, this is higher than other models, demonstrating that our proposed model outperforms shallow machine learning approaches. Experimental data suggest that our proposed model outperforms shallow machine learning in the categorization of miRNA–lncRNA interactions.
The proposed model was compared to various deep learning models such as LSTM, IndRNN, CNN, and CNN+LSTM, and shallow machine learning methods. Each model was trained and tested using six sets of data and tenfold cross-validation; accuracy was utilized as the assessment criterion. The Triticum aestivum dataset is divided into six groups, with maximum sequence lengths of 3000 nt, 2500 nt, 2000 nt, 1500 nt, 1000 nt, and 500 nt for each group.
Figure 6 depicts the data distribution.
Table 5 shows the categorization findings.
Table 5 shows that the LSTM accuracy dramatically reduces as sequence length increases, whereas the CNN+LSTM accuracy marginally decreases. Only the accuracy of the proposed model and CNN remained unchanged, but our proposed method’s accuracy is substantially greater than CNN’s. The findings suggest that our proposed method outperforms previous deep learning models regarding miRNA–lncRNA interaction accuracy, particularly when the length of the sequences is rather large. We examined the loss convergences rates for the models when the length of the sequences is 3000 nt to test our model’s performance further. The loss convergence rate in 20 iterations is compared in
Figure 7. In terms of both convergence rate and degree of convergence, our suggested strategy outperforms existing deep learning models.
In recent years, much work has been devoted to creating computer approaches for finding connections in diverse biological datasets. Many researchers have used shallow machine learning methods to construct a prediction model through feature selection in the prediction of the interaction between miRNA and lncRNA, but there are many problems such as fewer training data, large noise, and more human factors, resulting in low reliability of the prediction results. The comparative analysis of the proposed model with state-of-the-art models showed that the proposed model has better performance, with accuracy of 97.7%, greater than the models described in [
5,
10,
17,
19,
21,
22], as shown in
Table 6. Additionally, we compared our model with the XGBoost model. We applied the same dataset in this model after comparison, and our model was slightly better than the XGBoost model, as shown in
Table 7.
As shown in
Table 7, the proposed model was compared with another advanced model, XGBoost model, to prove its effectiveness. We applied the same dataset used in this work to the XGBoost model for the comparison. The proposed model was slightly better than the XGBoost model in terms of accuracy, F1-score, recall, specificity, and precision. The results indicate that the proposed model is slightly better than the XGBoost method.