Mus4mCPred: Accurate Identification of DNA N4-Methylcytosine Sites in Mouse Genome Using Multi-View Feature Learning and Deep Hybrid Network

Wang, Xiao; Du, Qian; Wang, Rong

doi:10.3390/pr12061129

Open AccessArticle

Mus4mCPred: Accurate Identification of DNA N4-Methylcytosine Sites in Mouse Genome Using Multi-View Feature Learning and Deep Hybrid Network

by

Xiao Wang

^1,2,*

,

Qian Du

¹ and

Rong Wang

³

¹

School of Computer Science and Technology, Zhengzhou University of Light Industry, Zhengzhou 450002, China

²

Henan Provincial Key Laboratory of Data Intelligence for Food Safety, Zhengzhou University of Light Industry, Zhengzhou 450002, China

³

School of Electronic Information, Zhengzhou University of Light Industry, Zhengzhou 450002, China

^*

Author to whom correspondence should be addressed.

Processes 2024, 12(6), 1129; https://doi.org/10.3390/pr12061129

Submission received: 24 April 2024 / Revised: 21 May 2024 / Accepted: 28 May 2024 / Published: 30 May 2024

(This article belongs to the Special Issue Application of Machine Learning Algorithms for Biological Data and Biological Systems)

Download

Browse Figures

Versions Notes

Abstract

N4-methylcytosine (4mC) is a critical epigenetic modification that plays a pivotal role in the regulation of a multitude of biological processes, including gene expression, DNA replication, and cellular differentiation. Traditional experimental methods for detecting DNA N4-methylcytosine sites are time-consuming, labor-intensive, and costly, making them unsuitable for large-scale or high-throughput research. Computational methods for identifying DNA N4-methylcytosine sites enable the rapid and cost-effective analysis of DNA 4mC sites across entire genomes. In this study, we focus on the identification of DNA 4mC sites in the mouse genome. Although there are already some computational methods that can predict DNA 4mC sites in the mouse genome, there is still significant room for improvement in accurately predicting them due to their inability to fully capture the multifaceted characteristics of DNA sequences. To address this issue, we propose a new deep learning predictor called Mus4mCPred, which utilizes multi-view feature learning and deep hybrid networks for accurately predicting DNA 4mC sites in the mouse genome. The predictor Mus4mCPred firstly employed different encoding methods to extract the feature vectors of DNA sequences, then input these features generated by different encoding methods into various hybrid deep learning models for the learning and extraction of more sophisticated representations of these features, and finally fused the extracted multi-view features to serve as the final features for DNA 4mC site prediction in the mouse genome. Multi-view features enabled the more comprehensive capture of data characteristics, enhancing the feature representation of DNA sequences. The independent test results showed that the sensitivity (Sn), specificity (Sp), accuracy (Acc), and Matthews’ correlation coefficient (MCC) were 0.7688, 0.9375, 0.8531, and 0.7165, respectively. The predictor Mus4mCPred outperformed other state-of-the-art methods, achieving the accurate identification of 4mC sites in the mouse genome.

Keywords:

DNA N4-methylcytosine sites; bioinformatics; deep learning; feature fusion

1. Introduction

DNA methylation is a common epigenetic modification, referring to the process whereby methyl groups (-CH3) are added to bases within the DNA molecule. This modification is typically catalyzed by DNA methyltransferases (DNMTs) [1], methylated regions of DNA that are usually associated with gene silencing or the suppression of gene expression [2], which can affect the transcriptional activity of genes [3], thereby playing a key role in the processes of cellular differentiation and development [4,5]. In prokaryotic and eukaryotic genomes, three common methylation types have been identified: N4-methylcytosine (4mC) [6], 5-methylcytosine (5mC) [7], and N6-methyladenine [8]. 5mC refers to the cytosine that is methylated at the fifth carbon atom of the cytosine ring. 5mC is the most common modification of DNA methylation and is widely present in the genomes of eukaryotes [9], and its location in the genome can affect the structure of chromosomes. 6mA refers to the modification of adenine at the N6 position (the sixth carbon atom on the adenine ring) by a methyl group. Similar to cytosine methylation, 6mA is also an important form of DNA methylation, but it is less common in the genomes of eukaryotes and mainly exists in prokaryotes [10]; the presence of 6mA is closely associated with biological processes such as DNA repair and tolerance to environmental stress in organisms. 4mC refers to cytosine where the N4 position (the fourth carbon atom in the cytosine ring) is modified by a methyl group (-CH3). Studies have shown that the presence of N4-methylcytosine can affect the physical and chemical properties of DNA, thereby influencing DNA replication and repair processes, which are crucial for maintaining genome stability. This epigenetic modification can also affect gene expression levels by influencing transcription factor binding [11] and chromatin structure adjustments [12,13]. Relative to the other two modification sites, there has been less research on 4mC. Therefore, the study of N4-methylcytosine helps to provide new perspectives and understanding for scientific research and may provide targets for new therapeutic approaches.

Currently, there are several experimental methods available for identifying 4mC sites in DNA. Methylation-specific polymerase chain reaction (PCR) [14] uses differences in DNA methylation to detect methylation sites in DNA by PCR amplification. Mass spectrometry [15] detects methylation by analyzing precise mass changes in DNA fragments. Whole-genome bisulfite sequencing [16] uses sulfites to transform unmethylated cytosine while methylated cytosine is unaffected. The methylation sites in the DNA are then identified by sequencing. Single-molecule real-time (SMRT) sequencing [17] detects methylation sites by observing the activity of DNA polymerase during DNA synthesis. However, the experimental methods for detecting 4mC sites in DNA have drawbacks, such as being time-consuming and having high costs [18]. With the advancement of machine learning and deep learning, several computational methods have been developed for predicting 4mC sites. Deep learning methods can handle large-scale genomic data and support end-to-end learning, directly extracting features and classifying from raw DNA sequences without the need for complex manual feature engineering. This simplifies the data processing workflow and improves work efficiency. Through deep learning technology, researchers are able to gain a deeper understanding of the complexity of epigenetic modifications, providing a powerful tool for genomics and biomedicine research. 4mCCNN [19] utilizes one-hot encoding and two one-dimensional convolutions for classifying 4mC sites. One-hot encoding represents each base in the DNA sequence as an independent feature vector. One-dimensional convolution can learn the local feature representations of these feature vectors, thereby better identifying the site modifications within the sequence. 4mCPred-SVM [20] integrates four sequence features and combines them with an SVM classifier to train an optimal prediction model. DNA sequences are integrated by these four coding methods to obtain vector features. SVM maps the vector features into a high-dimensional space with the purpose of finding an optimal hyperplane that separates data points of different categories to achieve classification. Deep4mC [21] encodes 12 features, evaluated by eight different classifiers. Binary, ENAC, EIIP, and NCP are used as inputs; two one-dimensional convolution layers are used for feature extraction; and the attention layer is used to capture key features. These key features are then finally input into the LR classifier to obtain an output score that represents the probability of a 4mC site. All of the above methods have been studied in six species: Arabidopsis thaliana, Caenorhabditis elegans, Drosophila melanogaster, Escherichia coli, Geobacter pickeringii and Geoalkalibacter subterraneus; comparatively few studies have been conducted in mice [22,23,24]. Research on mice has only slowly emerged in recent years, and mice are commonly used to model human diseases and to study disease mechanisms and drug screening; investigating 4mC sites in mouse DNA may help discover and understand epigenetic changes related to human diseases, providing new targets and strategies for the treatment and prevention of related diseases.

4mCpred-EL [22] is the first method developed for identifying 4mC sites in mouse genes, it utilizes four machine learning algorithms and seven feature encodings to generate probability features, which are then utilized for prediction through ensemble classifiers. i4mC-Mouse [23] transforms sequences into feature vectors using six different encodings and classifies them using an RF classifier. These two methods are based on machine learning, which tends to have weaker learning capabilities and complex feature extraction processes. In contrast, 4mCPred-CNN [24] and Mouse4mC-BGRU [25] are based on deep learning methods. 4mCPred-CNN utilizes one-hot encoding and nucleotide composition profiles for feature extraction, employing convolutional neural networks (CNNs) to learn more abstract features. Mouse4mC-BGRU employs k-mer tokenization for encoding and inputs features into a bidirectional gated recurrent unit (GRU) to automatically extract both long-term and short-term dependencies within DNA sequences, thereby learning contextual information. Recently, a new method called MultiScale-CNN-4mCPred [26] has emerged, which combines convolutional neural networks with different kernel sizes and long short-term memory (LSTM) to capture features of different scales and contextual information for predicting 4mC sites in mouse genes, thus improving prediction accuracy.

However, most of the above methods perform early feature fusion during the encoding stage, and integrating all features into the same encoding space may overlook the differences between different features, leading to feature conflicts or information loss issues. To address this problem, we proposed Mus4mCPred, which employs multi-view [27] feature learning. It inputs different encoded features into separate neural networks to extract multi-view features and integrates these multi-view features to better represent DNA sequences. Each neural network can be optimized specifically for specific types of features to improve the effect of feature extraction. Mus4mCPred comprises adaptive embedding, residual convolutional neural networks, and bidirectional LSTM networks. The embedding layer effectively maps discrete features to dense vector representations, allowing neural networks to better learn semantic information between features. CNNs can efficiently extract local features to capture spatial or temporal local structures of the data, with translational invariance and local connectivity. Bidirectional LSTM can effectively capture long-term dependencies in sequential data through its gating mechanisms, thereby providing a better understanding of the contextual information in the sequence data. The incorporation of residual structures enables the model to more effectively capture the complex features within DNA sequences, enhancing the network’s representational ability.

2. Materials and Methods

2.1. Datasets

For the fairness and credibility of the experiment, the dataset used in this study is the same as 4mCPred-EL [22], i4mC-Mouse [23], Mouse4mC-BGRU [25], and MultiScale-CNN-4mCPred [26]. These datasets were constructed using the MethSMRT database [28], which is a database specifically for methylation data and contains genomic methylation information from a variety of biological samples. CD-HIT [29] is a tool to remove sequence redundancy. In this study, we set CD-HIT to 0.7, filtering out sequences with similarity exceeding 70%. This approach reduces computational complexity and balances the classification results [30]. The final training set consists of 746 positive samples and 746 negative samples, while the test set comprises 160 positive samples and 160 negative samples, with all sequences being 41 bp in length. The data distribution pie chart is shown in Figure 1.

2.2. The Architecture of Mus4mCPred

In this study, we propose a new deep learning prediction method called Mus4mCPred, as illustrated in Figure 2. Mus4mCPred consists primarily of three parts. (1) Feature Encoding. This study utilized several feature encodings, involving genomic sequence information and physicochemical property information. The Word2vec model was pre-trained on DNA sequences, and during the training process, the model learned the vector representation of each nucleotide. After training, each DNA sequence was transformed into a fixed-length feature vector of length 164. Token encoding encoded each DNA sequence into a discrete feature vector of length 40. Character encoding and EIIP each transformed each DNA sequence into a feature vector of length 41, and by fusing these two types of features, a feature vector of length 82 was obtained. (2) Multi-view Learning. To better represent DNA sequences, we input multiple feature vectors into a hybrid neural network to extract multi-view features. The features extracted by Word2vec are input into a two-layer bidirectional long short-term memory (BiLSTM). BiLSTM can process both the forward and backward information of the sequence, obtaining more comprehensive contextual information. The discrete features extracted by Token encoding are input into the embedding layer and mapped into a continuous vector space. Local features are then extracted using two layers of one-dimensional convolution, which are subsequently fed into two layers of BiLSTM to further extract contextual semantic information, thereby achieving a better representation of the sequence. To avoid the issue of vanishing gradients, we have incorporated residual connections between the two one-dimensional convolutional layers. The fused features from character encoding and EIIP are input into an embedding layer and two convolutional layers to extract dense and local features. To prevent overfitting, dropout [31] and batch normalization are flexibly added to each module. Randomly discarding some features helps to accelerate the convergence of the network. Due to the use of different encoding methods with distinct neural networks for multi-view learning, for the sake of convenience in subsequent representation, we have named these three combinations as follows: the Word2vec-BiLSTM module, the token encoding-BiLSTM-CNN module, and the character encoding-EIIP-CNN module. (3) Prediction Module. We fused the extracted multi-view features and input them into a fully connected layer for classification.

2.3. Feature Encoding

Character Encoding: In neural network classification tasks, converting DNA sequences into a numerical format that neural networks can understand and process is crucial. Character encoding [26] maps the four DNA bases ‘A’, ‘G’, ‘C’, and ‘T’ to 0, 1, 2, and 3, respectively. For example, the sequence ‘ACGTACCT’ is encoded as (0, 2, 1, 3, 0, 2, 2, 3). This encoding method facilitates the mapping of vector features in the embedding layer while also saving storage resources. The formula is expressed as follows:

C (m) = \{\begin{array}{l} 0 & i f & m & i s & A \\ 1 & i f & m & i s & G \\ 2 & i f & m & i s & C \\ 3 & i f & m & i s & T \end{array}

(1)

where m represents the type of nucleotide.

Token Encoding: First, the entire sequence undergoes word segmentation, where a sliding window moves sequentially along the sequence, with every two nucleotides forming a group. Unlike with k-mer [32], we did not encode based on the frequency or proportion of nucleotide occurrence [33,34], but instead use the index corresponding to each group of nucleotides for encoding, as shown in Table 1. In this way, a sequence of length 41 base pairs can be transformed into a vector of dimension 40 [35]. For example, the sequence ‘ACGTACCT’ is tokenized into seven 2-mer sequences: ‘AC’, ‘CG’, ‘GT’, ‘TA’, ‘AC’, ‘CC’, and ‘CT’, so the sequence is encoded as (2, 9, 7, 12, 2, 10, 11).

EIIP: EIIP (electron–ion interaction potential) [36] encodes the bases according to the ionization potential of each atom in the nucleotides, with each nucleotide being assigned a value corresponding to its electron or ion interaction potential [37]. This representation captures the physical and chemical properties of the sequence and can effectively express the sequence information. The EIIP values for the nucleotides ‘A’, ‘G’, ‘C’, and ‘T’ are 0.1260, 0.0806, 0.1340, and 0.1335, respectively. For example, the sequence ‘ACGTACCT’ is encoded as (0.1260, 0.1340, 0.0806, 0.1335, 0.1260, 0.1340, 0.1340, 0.1335). The formula is expressed as follows:

E (n) = \{\begin{array}{l} 0.1260 & i f & n & i s & A \\ 0.0806 & i f & n & i s & G \\ 0.1340 & i f & n & i s & C \\ 0.1335 & i f & n & i s & T \end{array}

(2)

where n represents the type of nucleotide.

Word2vec: Word2vec [38] is a method used to convert words into vector representations, widely applied in the field of natural language processing [39]. It generates word vectors by learning the distribution of words in context, so that words with similar meanings are closer in the vector space. Since DNA sequences are composed of different nucleotides, the CBOW (continuous bag of words) [40] method in Word2vec is adopted to focus on the contextual information of nucleotides in DNA sequences, thereby enhancing the feature representation capability of biological sequences. The CBOW model encodes DNA sequences into one-hot vectors. Based on each sequence, training samples are generated from the context window. The model continuously adjusts the representation of the nucleotide vectors through training, resulting in each nucleotide being represented as a fixed-length vector of dimension D. Finally, each 41bp sequence is transformed into a fixed-length vector of size 41*D. In this study, the Word2vec model’s vector dimension D is set to 4, the sliding window is set to 1, and the number of training iterations is set to 256.

2.4. Embedding Layer

The embedding layer is typically used to store word embeddings and retrieve them using indices, and the module takes an index list as input and outputs the corresponding word embeddings. These embeddings can capture semantic relationships between sequences and are commonly used as inputs for natural language processing tasks. The role of the embedding layer is to map discrete inputs to dense, low-dimensional vector representations [41]; these vectors serve as inputs to the model and are learned during the training process, enabling the model to better understand the semantic information of the input data. Unlike fully connected layers, the embedding layer accepts discrete indices as input rather than vectors, avoiding matrix multiplication and thereby achieving higher efficiency.

2.5. One-Dimensional Convolutional Neural Network

Convolutional neural network (CNN) is a type of multilayer neural network structure, initially applied in image processing, and later widely used in natural language processing [42,43]. A typical CNN comprises convolutional layers, pooling layers, and fully connected layers [44]. The convolutional layers are responsible for extracting features from the input data, while the pooling layers perform dimensionality reduction and the downsampling of features, thereby enhancing the efficiency and generalization ability of the model. One-dimensional convolutional neural networks (1D-CNNs) is a type of CNN specifically designed for processing textual data. In bioinformatics, 1D-CNNs are commonly employed for tasks such as processing DNA sequence data, predicting protein structures, and genome analysis. The one-dimensional residual convolutional neural network [45] used in this study combines the residual connection with the features of the one-dimensional convolutional neural network to prevent the gradient from disappearing during training.

2.6. Bidirectional Long Short-Term Memory

Recurrent neural networks (RNNs) [46] are simple recurrent structures, they take the current input vector and the hidden state from the previous time step as input and output the current time step’s hidden state and prediction value. This enables the RNN to capture the temporal dependencies within sequences [47]. However, due to issues like vanishing or exploding gradients, it is challenging to effectively handle long-term dependencies [48]. Long short-term memory (LSTM) networks [49] possess a more complex internal architecture, which includes components such as the input gate, forget gate, and output gate. These components enable LSTM networks to more effectively capture and maintain long-term dependencies, thereby enhancing their performance when dealing with long sequence data. Bidirectional long short-term memory (BiLSTM) is an extension of LSTM. BiLSTM can simultaneously gather information from both directions (forward and backward) of a sequence, thus comprehensively capturing the contextual information within the sequence. This aids the model in better understanding the features’ information.

2.7. Performance Evaluation Metrics

To evaluate the performance of the model, we adopted four statistical metrics, namely Matthews’ correlation coefficient (MCC), accuracy (Acc), sensitivity (Sn), and specificity (Sp), to evaluate the performance of the predictive factors, as shown in Equations (3)–(6):

S n = \frac{T P}{T P + F N}

(3)

S p = \frac{T N}{T N + F P}

(4)

A c c = \frac{T P + T N}{T P + T N + F P + F N}

(5)

M C C = \frac{T P \times T N - F P \times F N}{\sqrt{(T P + F P) (T P + F N) (T N + F P) (T N + F N)}}

(6)

where TP and TN represent the number of correctly predicted 4mC and non-4mC samples, respectively. FP and FN represent the number of incorrectly predicted 4mC and non-4mC samples, respectively. Acc and MCC are both metrics used to evaluate the performance of a classification model, but they focus on slightly different aspects. Acc concerns the overall correctness of the classifier’s predictions, while MCC takes into account the classifier’s performance across different classes and the imbalance between the classes.

2.8. Hyperparameter Optimization

The selection of hyperparameters is crucial for the performance and generalization ability of the model, selecting appropriate hyperparameters can enhance the model’s performance and expedite its convergence speed. In this study, grid search was employed to optimize hyperparameters, including the embedding layer’s dimensionality, the number of hidden units in BiLSTM, and the kernel sizes of convolutional layers. After tuning, for the Word2vec-BiLSTM module, both layers of BiLSTM had hidden units set to 12. For the token encoding-BiLSTM-CNN module, the kernel sizes of both convolutional layers were set to 3 with padding as 1, the hidden units of both BiLSTM layers were set to 12. For the character encoding-EIIP-CNN module, the kernel sizes of both convolutional layers were 3 without padding. The first fully connected layer consists of 3,816 nodes and utilizes the ReLU activation function for non-linear transformation. The second layer comprises 160 nodes, and finally, the sigmoid activation function is employed to map the output to probabilities.

To assess the stability of the model, ten-fold cross-validation was used for model training. The batch size was set to 128, learning rate to 0.0001, dropout to 0.5, and weight decay to 5 × 10⁻⁵. Except for sigmoid activation function used in classification, ReLU was used for all other activation functions. The loss function was binary cross-entropy, and Adam [50] optimization algorithm was utilized to optimize model parameters. Table 2 illustrates the hyperparameter settings of the model.

3. Results and Discussion

3.1. Ablation Experiment

Mus4mCPred consists of three modules, different modules adopt different encoding methods and neural network architectures for feature extraction. In this section, we conduct ablation experiments on each of the three modules, comparing the results obtained from networks with and without certain structures. For the Word2vec-BiLSTM module, we compare the performance of BiLSTM with different numbers of layers. As shown in Table 3, it can be observed that the performance is best with two layers of BiLSTM, with an MCC improvement of 2.16% and 3.80% compared to one layer and three layers of BiLSTM, respectively. For the token encoding-BiLSTM-CNN module, we compare the performance of models with and without residual connections. As shown in Table 4, the MCC with residual connections is 6.57% higher than without, indicating that residual connections enhance the network’s expressive power in deep neural networks. For the character encoding-EIIP-CNN module, we examined the impact of different numbers of CNN layers on model performance. As shown in Table 5, the results demonstrate that two layers of CNN yield the best performance.

Furthermore, to demonstrate the importance of each module, we conducted ablation experiments on the entire model to validate the significance of each module. We tested single modules, pairwise combinations of modules, and the final model, as depicted in Figure 3; it total, seven different scenarios were evaluated. It can be observed that in the single-module performance tests, token encoding–BiLSTM–CNN module performed the best, the average performance of pairwise combined modules exceeded that of single modules, and the performance of the three-module fusion was the highest among the seven scenarios, indicating the effectiveness of our module fusion.

3.2. Comparison with Other Predictors

To evaluate the performance of the Mus4mCPred method, we compared our method with 4mCpred-EL [22], i4mC-Mouse [23], Mouse4mC-BGRU [25], and MultiScale-CNN-4mCPred [26]. To ensure fairness, we used the same training and testing sets as these four methods, and employed ten-fold cross-validation, dividing the dataset into ten equally sized subsets, with nine subsets used for training and one subset for testing in each iteration, repeated ten times. This approach allowed for a more accurate evaluation of the model’s performance in the case of limited sample sizes. As shown in Table 6, Mus4mCPred method achieved promising results in cross-validation. The results on the independent test set are presented in Table 7, where the four evaluation metrics, Sn, Sp, Acc, and MCC, were 0.7688, 0.9375, 0.8531, and 0.7165, respectively. Compared to the state-of-the-art method MultiScale-CNN-4mCPred, Mus4mCPred method showed improvements in all metrics except Sn; specifically, Sp increased by 10%, Acc increased by 0.62%, and MCC increased by 2.26%. Experimental analysis demonstrated that Mus4mCPred outperforms existing methods and can effectively predict 4mC sites in DNA. The confusion matrix can demonstrate the performance of a classification model during the prediction process. It provides a visual representation of how well the classification model performs on each class, including the counts of correct predictions and mispredictions. The confusion matrix of the Mus4mCPred is shown in Figure 4.

3.3. Generalization Ability

To comprehensively evaluate the performance of Mus4mCPred and understand the advantages and limitations of the model in terms of generalization ability, we conducted related tests on datasets from other species. The dataset we used includes six species: A. thaliana, C. elegans, D. melanogaster, E. coli, G. pickeringii, and G. subterraneus, with the sample counts for training and independent testing sets shown in Table 8. We compared the Mus4mCPred method with other state-of-the-art predictors method 4mCPred [37], DeepTorrent [51], and MultiScale-CNN-4mCPred [26] on these six species datasets. As shown in Figure 5, through data comparison, we found that Mus4mCPred method achieved better results in the first three species, with both Acc and MCC higher than those of the comparison methods. This could be due to the smaller number of samples in the latter three species. In situations where the sample size is limited, complex deep learning models may memorize the details and noise of the training data, leading to an inability to generalize to unseen data. In addition, the predictive capability of computational models is limited by established data. If changes in environmental conditions cause significant alterations in the properties of biomolecules, such as the emergence of new sites. If the new sites have not appeared in the training set, the model may lack the necessary information to identify and predict these sites.

4. Conclusions

In this study, the Mus4mCPred method we proposed can effectively identify 4mC sites in mouse DNA. Mus4mCPred employs various feature encoding methods for DNA sequences to obtain different vector features. These vector features are then input into a hybrid neural network model for multi-view feature extraction. Subsequently, the multi-view features are fused to obtain the final feature representation. Mus4mCPred comprises adaptive embedding, convolutional neural networks, and bidirectional long short-term memory, which fully consider the local features and contextual information of the data. The extracted multi-view features encompass various aspects of the data, enhancing the robustness of the model. Mus4mCPred outperforms existing methods on benchmark datasets and independent datasets in most metrics. To demonstrate the generalization ability of Mus4mCPred, we also conducted tests on six other species (A. thaliana, C. elegans, D. melanogaster, E. coli, G. pickeringii, and G. subterraneus). Computer testing has shown that Mus4mCPred is not only effective at predicting 4mC sites in the mouse genome but is also capable of predicting 4mC sites in the genomes of other species. This suggests that Mus4mCPred may have cross-species applicability. Mice share a high degree of homology with humans at the genetic level and are commonly used in research on human diseases. The methods used to study modified sites in the mouse genome may also be applicable to the study of the human genome, such as identifying the transcription start sites [52] to provide strategies for treating and preventing human diseases. Due to the less-than-ideal performance of Mus4mCPred on small sample datasets, we will address this issue by employing transfer learning techniques in our future research. Large models are pre-trained on extensive datasets, and their weights are further adjusted through fine-tuning to better suit our specific data and task requirements.

Author Contributions

X.W.: validation, writing—review and editing, supervision. Q.D.: conceptualization, methodology, writing—original draft. R.W.: writing—review and editing, supervision. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by funds from Key Research Project of Colleges and Universities of Henan Province (No. 22A520013, No. 23B520004), the Key Science and Technology Development Program of Henan Province (No. 232102210020, No. 202102210144), and the Training Program of Young Backbone Teachers in Colleges and Universities of Henan Province (No. 2019GGJS132).

Data Availability Statement

The source codes and data for Mus4mCPred are available at https://github.com/meloaedy/Mus4mCPred (accessed on 23 April 2024).

Conflicts of Interest

The authors declare there are no conflicts of interest.

References

Wang, Y.; Sheng, Y.; Liu, Y.; Pan, B.; Huang, J.; Warren, A.; Gao, S. N⁶-methyladenine DNA modification in the unicellular eukaryotic organism Tetrahymena thermophila. Eur. J. Protistol. 2017, 58, 94–102. [Google Scholar] [CrossRef] [PubMed]
Bestor, T.H. The DNA methyltransferases of mammals. Hum. Mol. Genet. 2000, 9, 2395–2402. [Google Scholar] [CrossRef] [PubMed]
He, X.J.; Chen, T.; Zhu, J.K. Regulation and function of DNA methylation in plants and animals. Cell Res. 2011, 2, 442–465. [Google Scholar] [CrossRef]
Moore, L.D.; Le, T.; Fan, G. DNA methylation and its basic function. Neuropsychopharmacology 2013, 38, 23–38. [Google Scholar] [CrossRef]
Schübeler, D. Function and information content of DNA methylation. Nature 2015, 517, 321–326. [Google Scholar] [CrossRef] [PubMed]
Chen, W.; Yang, H.; Feng, P.; Ding, H.; Lin, H. iDNA4mC: Identifying DNA N4-methylcytosine sites based on nucleotide chemical properties. Bioinformatics 2017, 33, 3518–3523. [Google Scholar] [CrossRef]
Ehrlich, M.; Wang, R.Y. 5-Methylcytosine in eukaryotic DNA. Science 1981, 212, 1350–1357. [Google Scholar] [CrossRef] [PubMed]
Ye, P.; Luan, Y.; Chen, K.; Liu, Y.; Xiao, C.; Xie, Z. MethSMRT: An integrative database for DNA N6-methyladenine and N4-methylcytosine generated by single-molecular real-time sequencing. Nucleic Acids Res. 2017, 45, D85–D89. [Google Scholar] [CrossRef] [PubMed]
Lyko, F. The DNA methyltransferase family: A versatile toolkit forepigenetic regulation. Nat. Rev. Genet. 2018, 19, 81–92. [Google Scholar] [CrossRef]
Liu, J.; Zhu, Y.; Luo, G.Z.; Wang, X.; Yue, Y.; Wang, X.; Zong, X.; Chen, K.; Yin, H.; Fu, Y.; et al. Abundant DNA 6mA methylation during early embryogenesis of zebrafish and pig. Nat. Commun. 2016, 7, 13052. [Google Scholar] [CrossRef]
Glickman, B.W.; Radman, M. Escherichia coli mutator mutants deficient in methylation-instructed DNA mismatch correction. Proc. Natl. Acad. Sci. USA 1980, 77, 1063–1067. [Google Scholar] [CrossRef]
Sánchez-Romero, M.A.; Cota, I.; Casadesús, J. DNA methylation in bacteria: From the methyl group to the methylome. Curr. Opin. Microbiol. 2015, 25, 9–16. [Google Scholar] [CrossRef] [PubMed]
Kumar, S.; Karmakar, B.C.; Nagarajan, D.; Mukhopadhyay, A.K.; Morgan, R.D.; Rao, D.N. N4-cytoeisine dna methylation regulates transcription and pathogenesis in Helicobacter pylori. Nucleic Acids Res. 2018, 46, 3429–3445. [Google Scholar] [CrossRef] [PubMed]
Feng, H.; Shao, W.; Du, L.; Qing, X.; Zhang, Z.; Liang, C.; Liu, D. Detection of SHOX2 DNA methylation by methylation-specific PCR in non-small cell lung cancer. Transl. Cancer Res. 2020, 9, 6070–6077. [Google Scholar] [CrossRef] [PubMed]
Domon, B.; Aebersold, R. Mass spectrometry and protein analysis. Science 2006, 312, 212–217. [Google Scholar] [CrossRef]
Doherty, R.; Couldrey, C. Exploring genome wide bisulfite sequencing for DNA methylation analysis in livestock: A technical assessment. Front. Genet. 2014, 5, 126. [Google Scholar] [CrossRef]
Ardui, S.; Ameur, A.; Vermeesch, J.R.; Hestand, M.S. Single molecule real-time (SMRT) sequencing comes of age: Applications and utilities for medical diagnostics. Nucleic Acids Res. 2018, 46, 2159–2168. [Google Scholar] [CrossRef]
Małysiak-Mrozek, B.; Baron, T.; Mrozek, D. Spark-IDPP: High-throughput and scalable prediction of intrinsically disordered protein regions with Spark clusters on the Cloud. Clust. Comput. 2019, 22, 487–508. [Google Scholar] [CrossRef]
Manavalan, B.; Hasan, M.M.; Basith, S.; Gosu, V.; Shin, T.H.; Lee, G. Empirical Comparison and Analysis of Web-Based DNA N⁴-Methylcytosine Site Prediction Tools. Mol. Ther. Nucleic Acids 2020, 22, 406–420. [Google Scholar] [CrossRef]
Wei, L.; Luan, S.; Nagai, L.A.E.; Su, R.; Zou, Q. Exploring sequence-based features for the improved prediction of DNA N4-methylcytosine sites in multiple species. Bioinformatics 2019, 35, 1326–1333. [Google Scholar] [CrossRef]
Xu, H.; Jia, P.; Zhao, Z. Deep4mC: Systematic assessment and computational prediction for DNA N4-methylcytosine sites by deep learning. Brief. Bioinform. 2021, 22, bbaa099. [Google Scholar] [CrossRef]
Manavalan, B.; Basith, S.; Shin, T.H.; Lee, D.Y.; Wei, L.; Lee, G. 4mCpred-EL: An Ensemble Learning Framework for Identification of DNA N⁴-methylcytosine Sites in the Mouse Genome. Cells 2019, 8, 1332. [Google Scholar] [CrossRef] [PubMed]
Hasan, M.M.; Manavalan, B.; Shoombuatong, W.; Khatun, M.S.; Kurata, H. i4mC-Mouse: Improved identification of DNA N4-methylcytosine sites in the mouse genome using multiple encoding schemes. Comput. Struct. Biotechnol. J. 2020, 18, 906–912. [Google Scholar] [CrossRef] [PubMed]
Abbas, Z.; Tayara, H.; Chong, K.T. 4mCPred-CNN-Prediction of DNA N4-Methylcytosine in the Mouse Genome Using a Convolutional Neural Network. Genes 2021, 12, 296. [Google Scholar] [CrossRef] [PubMed]
Jin, J.; Yu, Y.; Wei, L. Mouse4mC-BGRU: Deep learning for predicting DNA N4-methylcytosine sites in mouse genome. Methods 2022, 204, 258–262. [Google Scholar] [CrossRef]
Zheng, P.; Zhang, G.; Liu, Y.; Huang, G. MultiScale-CNN-4mCPred: A multi-scale CNN and adaptive embedding-based method for mouse genome DNA N4-methylcytosine prediction. BMC Bioinform. 2023, 24, 21. [Google Scholar] [CrossRef] [PubMed]
Li, Y.; Wu, F.X.; Ngom, A. A review on machine learning principles for multi-view biological data integration. Brief. Bioinform. 2018, 19, 325–340. [Google Scholar] [CrossRef] [PubMed]
Leinonen, R.; Sugawara, H.; Shumway, M. The sequence read archive. Nucleic Acids Res. 2011, 39, D19–D21. [Google Scholar] [CrossRef] [PubMed]
Fu, L.; Niu, B.; Zhu, Z.; Wu, S.; Li, W. CD-HIT: Accelerated for clustering the next-generation sequencing data. Bioinformatics 2012, 28, 3150–3152. [Google Scholar] [CrossRef]
Li, W.; Godzik, A. Cd-hit: A fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 2006, 22, 1658–1659. [Google Scholar] [CrossRef]
Baldi, P.; Sadowski, P.J. Understanding dropout. In Proceedings of the 26th International Conference on Neural Information Processing Systems—Volume 2, Lake Tahoe, NV, USA, 5–10 December 2013. [Google Scholar]
Hasan, M.M.; Manavalan, B.; Khatun, M.S.; Kurata, H. i4mC-ROSE, a bioinformatics tool for the identification of DNA N4-methylcytosine sites in the Rosaceae genome. Int. J. Biol. Macromol. 2020, 157, 752–758. [Google Scholar] [CrossRef]
Yang, H.; Yang, W.; Dao, F.Y.; Lv, H.; Ding, H.; Chen, W.; Lin, H. A comparison and assessment of computational method for identifying recombination hotspots in Saccharomyces cerevisiae. Brief. Bioinform. 2020, 21, 1568–1580. [Google Scholar] [CrossRef] [PubMed]
Zhou, Y.; Zeng, P.; Li, Y.H.; Zhang, Z.; Cui, Q. SRAMP: Prediction of mammalian N6-methyladenosine (m6A) sites based on sequence-derived features. Nucleic Acids Res. 2016, 44, e91. [Google Scholar] [CrossRef] [PubMed]
Nguyen-Vo, T.H.; Trinh, Q.H.; Nguyen, L.; Nguyen-Hoang, P.U.; Rahardja, S.; Nguyen, B.P. i4mC-GRU: Identifying DNA N⁴-Methylcytosine sites in mouse genomes using bidirectional gated recurrent unit and sequence-embedded features. Comput. Struct. Biotechnol. J. 2023, 21, 3045–3053. [Google Scholar] [CrossRef] [PubMed]
Nair, A.S.; Sreenadhan, S.P. A coding measure scheme employing electron-ion interaction pseudopotential (EIIP). Bioinformation 2006, 1, 197–202. [Google Scholar] [PubMed]
He, W.; Jia, C.; Zou, Q. 4mCPred: Machine learning methods for DNA N4-methylcytosine sites prediction. Bioinformatics 2019, 35, 593–601. [Google Scholar] [CrossRef] [PubMed]
Yilmaz, S.; Toklu, S. A deep learning analysis on question classification task using Word2vec representations. Neural Comput. Appl. 2020, 32, 2909–2928. [Google Scholar] [CrossRef]
Bartusiak, R.; Augustyniak, Ł.; Kajdanowicz, T.; Kazienko, P.; Piasecki, M. WordNet2Vec: Corpora agnostic word vectorization method. Neurocomputing 2019, 326, 141–150. [Google Scholar] [CrossRef]
Yang, S.; Yang, Z.; Yang, J. 4mCBERT: A computing tool for the identification of DNA N4-methylcytosine sites by sequence- and chemical-derived information based on ensemble learning strategies. Int. J. Biol. Macromol. 2023, 231, 123180. [Google Scholar] [CrossRef]
Inglesfield, J.E. A method of embedding. J. Phys. C Solid. State Phys. 1981, 14, 3795. [Google Scholar] [CrossRef]
Kusumoto, D.; Yuasa, S. The application of convolutional neural network to stem cell biology. Inflamm. Regen. 2019, 39, 14. [Google Scholar] [CrossRef] [PubMed]
Tran, H.V.; Nguyen, Q.H. iAnt: Combination of convolutional neural network and random Forest models using PSSM and BERT features to identify antioxidant proteins. Curr. Bioinform. 2022, 17, 184–195. [Google Scholar] [CrossRef]
Gu, J.; Wang, Z.; Kuen, J.; Ma, L.; Shahroudy, A.; Shuai, B.; Liu, T.; Wang, X.; Wang, G.; Cai, J.; et al. Recent advances in convolutional neural networks. Pattern Recognit. 2018, 77, 354–377. [Google Scholar] [CrossRef]
Liu, Y.; Hu, Y.; Cai, W.; Zhou, G.; Zhan, J.; Li, L. DCCAM-MRNet: Mixed Residual Connection Network with Dilated Convolution and Coordinate Attention Mechanism for Tomato Disease Identification. Comput. Intell. Neurosci. 2022, 2022, 4848425. [Google Scholar] [CrossRef] [PubMed]
Liu, Q.; Fang, L.; Yu, G.; Wang, D.; Xiao, C.L.; Wang, K. Detection of DNA base modifications by deep recurrent neural network on Oxford Nanopore sequencing data. Nat. Commun. 2019, 10, 2449. [Google Scholar] [CrossRef]
Lu, W.; Tang, Y.; Wu, H.; Huang, H.; Fu, Q.; Qiu, J.; Li, H. Predicting RNA secondary structure via adaptive deep recurrent neural networks with energy-based filter. BMC Bioinform. 2019, 20, 684. [Google Scholar] [CrossRef] [PubMed]
Li, Z.; Yu, Y. Protein secondary structure prediction using cascadedconvolutional and recurrent neural networks. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, New York, NY, USA, 9–15 July 2016. [Google Scholar]
Huang, G.; Shen, Q.; Zhang, G.; Wang, P.; Yu, Z.G. LSTMCNNsucc: A Bidirectional LSTM and CNN-Based Deep Learning Method for Predicting Lysine Succinylation Sites. Biomed. Res. Int. 2021, 2021, 9923112. [Google Scholar] [CrossRef]
Reyad, M.; Sarhan, A.M.; Arafa, M. A modified Adam algorithm for deep neural network optimization. Neural Comput. Appl. 2023, 35, 17095–17112. [Google Scholar] [CrossRef]
Liu, Q.; Chen, J.; Wang, Y.; Li, S.; Jia, C.; Song, J.; Li, F. DeepTorrent: A deep learning-based approach for predicting DNA N4-methylcytosine sites. Brief. Bioinform. 2021, 22, bbaa124. [Google Scholar] [CrossRef]
Cherstvy, A.G.; Teif, V.B. Electrostatic effect of H1-histone protein binding on nucleosome repeat length. Phys. Biol. 2014, 11, 044001. [Google Scholar] [CrossRef]

Figure 1. The sample distribution of the mouse dataset used in this study.

Figure 2. The framework of Mus4mCPred consists of three parts: feature encoding, multi-view learning, and prediction module. Three different encoding schemes convert DNA sequences into vector representations, which are then fed into corresponding neural networks to learn multi-view features. After fusing multi-view features, classification is performed in the fully connected layer. BN stands for batch-normalization. DP stands for dropout.

Figure 3. Performance of different module combinations on the independent test set.

Figure 4. Confusion matrix of Mus4mCPred on the independent test set.

Figure 5. Comparison with other predictors on the independent test set across the six species.

Table 1. Indexes corresponding to different base pairs.

Nucleotide	Token Coding
AA	0
AG	1
AC	2
AT	3
GA	4
GG	5
GC	6
GT	7
CA	8
CG	9
CC	10
CT	11
TA	12
TG	13
TC	14
TT	15

Table 2. Detailed information of hyperparameters.

Module	Parameter	Search Range	Optimal Value
	Learning rate	[0.00005–0.001]	0.0001
	Dropout rate	[0.2–0.5]	0.5
	Batch size	[32, 64, 128, 256]	128
Word2vec-BiLSTM	BiLSTM1 (the number of hidden units)	[6–24]	12
Word2vec-BiLSTM	BiLSTM2 (the number of hidden units)	[6–24]	12
Token encoding-BiLSTM-CNN	Embedding dim	[16–24]	16
	Conv1d_1 (kernel size)	[1, 3, 5, 7]	3
	Conv1d_2 (kernel size)	[1, 3, 5, 7]	3
	BiLSTM 1(the number of hidden units)	[6–24]	12
	BiLSTM 2 (the number of hidden units)	[6–24]	12
Character encoding-EIIP-CNN	Embedding dim	[16–24]	24
	Conv1d_1 (kernel size)	[1, 3, 5, 7]	3
	Conv1d_2 (kernel size)	[1, 3, 5, 7]	3

Table 3. The impact of different numbers of BiLSTM layers in the Word2vec-BiLSTM module on the model performance on the independent test set. The features extracted by Word2vec are input into a two-layer bidirectional long short-term memory (BiLSTM). Therefore, this module is abbreviated as the “Word2vec-BiLSTM module”.

Model	Sn	Sp	Acc	MCC
1layer	0.8188	0.8750	0.8469	0.6949
2layer	0.7688	0.9375	0.8531	0.7165
3layer	0.7500	0.9188	0.8344	0.6785

Table 4. The impact of residual connections in the token encoding-BiLSTM-CNN module on the model performance on the independent test set. The features extracted by token encoding are sequentially input into the BiLSTM and CNN, so this module is abbreviated as the “token encoding-BiLSTM-CNN module”.

Model	Sn	Sp	Acc	MCC
Residual connection	0.7688	0.9375	0.8531	0.7165
No residual connection	0.8000	0.8500	0.8250	0.6508

Table 5. The impact of different numbers of convolutional layers in the character encoding-EIIP-CNN module on model performance on the independent test set. The features extracted by character encoding and EIIP encoding are input into CNN, so this module is abbreviated to “character encoding-EIIP-CNN module”.

Model	Sn	Sp	Acc	MCC
1layer	0.7625	0.8938	0.8281	0.6620
2layer	0.7688	0.9375	0.8531	0.7165
3layer	0.7625	0.9063	0.8344	0.6758

Table 6. Performance comparison with other predictors using 10-fold cross-validation on the mouse dataset used in this study.

Prediction	Sn	Sp	Acc	MCC
4mCpred-EL [22]	0.8040	0.7870	0.7950	0.5910
i4mC-Mouse [23]	0.6831	0.9020	0.7930	0.6510
Mouse4mC-BGRU [25]	0.7940	0.8400	0.8100	0.6200
MultiScale-CNN-4mCPred [26]	0.8008	0.8294	0.8166	0.6335
Mus4mCPred	0.8469	0.7835	0.8149	0.6345

Table 7. Performance comparison with other predictors on the independent test set.

Prediction	Sn	Sp	Acc	MCC
4mCpred-EL [22]	0.7572	0.8251	0.7910	0.5840
i4mC-Mouse [23]	0.8071	0.8252	0.8161	0.6330
Mouse4mC-BGRU [25]	0.8000	0.8500	0.8250	0.6510
MultiScale-CNN-4mCPred [26]	0.8563	0.8375	0.8469	0.6939
Mus4mCPred	0.7688	0.9375	0.8531	0.7165

Table 8. Statistical summary of the dataset for the six different species.

Species	Training Datasets		Test Datasets
	Positive	Negative	Positive	Negative
C. elegans	1554	1554	750	750
D. melanogaster	1769	1769	1000	1000
A. thaliana	1978	1978	1250	1250
E. coli	388	388	134	134
G. subterraneus	905	905	350	350
G. pickeringii	569	569	200	200

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, X.; Du, Q.; Wang, R. Mus4mCPred: Accurate Identification of DNA N4-Methylcytosine Sites in Mouse Genome Using Multi-View Feature Learning and Deep Hybrid Network. Processes 2024, 12, 1129. https://doi.org/10.3390/pr12061129

AMA Style

Wang X, Du Q, Wang R. Mus4mCPred: Accurate Identification of DNA N4-Methylcytosine Sites in Mouse Genome Using Multi-View Feature Learning and Deep Hybrid Network. Processes. 2024; 12(6):1129. https://doi.org/10.3390/pr12061129

Chicago/Turabian Style

Wang, Xiao, Qian Du, and Rong Wang. 2024. "Mus4mCPred: Accurate Identification of DNA N4-Methylcytosine Sites in Mouse Genome Using Multi-View Feature Learning and Deep Hybrid Network" Processes 12, no. 6: 1129. https://doi.org/10.3390/pr12061129

APA Style

Wang, X., Du, Q., & Wang, R. (2024). Mus4mCPred: Accurate Identification of DNA N4-Methylcytosine Sites in Mouse Genome Using Multi-View Feature Learning and Deep Hybrid Network. Processes, 12(6), 1129. https://doi.org/10.3390/pr12061129

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Mus4mCPred: Accurate Identification of DNA N4-Methylcytosine Sites in Mouse Genome Using Multi-View Feature Learning and Deep Hybrid Network

Abstract

1. Introduction

2. Materials and Methods

2.1. Datasets

2.2. The Architecture of Mus4mCPred

2.3. Feature Encoding

2.4. Embedding Layer

2.5. One-Dimensional Convolutional Neural Network

2.6. Bidirectional Long Short-Term Memory

2.7. Performance Evaluation Metrics

2.8. Hyperparameter Optimization

3. Results and Discussion

3.1. Ablation Experiment

3.2. Comparison with Other Predictors

3.3. Generalization Ability

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI