1. Introduction
The CRISPR-Cas9 (Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR)-CRISPR-associated protein 9 (Cas9)) system is a robust genome-editing tool with a broad range of applications in numerous research [
1,
2,
3]. After the recognition of the 3-nucleotide protospacer adjacent motif (PAM), the endonuclease Cas9 uses a single guide RNA (gRNA) to form base pairs with any DNA target sequences of interest and introduce a site-specific double-strand break [
1,
4,
5]. The high-efficiency and simplicity of CRISPR-Cas9 system enabled genome engineering has great potential in improving agriculture productivity and clinical application [
6,
7].
The CRISPR-Cas9 system is widely used to enable highly efficient genome editing in various species and cell types, but it may wrongly bind to the unwanted region and cause extra off-target activity. These off-target activities can confound research experiments and also affect the practical application of the technique [
8]. The Cas9 can be programmed by altering the sequence of gRNA to target abundant sites in the genome, and the off-target effects of different gRNAs may vary greatly [
9]. Therefore, it is crucial to design the off-target prediction model to evaluate the on- and off-target activities of gRNA and choose gRNA with high on-target rate and low off-target effect [
10].
From the perspective of gRNA binding to non-target regions, the off-target activities induced by CRISPR-Cas9 mechanism can be divided into three categories: (a) nucleic acid base mismatch with on-target sites; (b) nucleic acid base insertion from gRNA sequence; (c) nucleic acid base deletion from gRNA sequence [
11]. The off-target cleavage may occur anywhere in the region that the genome contains a PAM and a protospacer sequence with mismatch, insertion, or deletion. Therefore, accurate evaluation and prediction for the off-target situation of various gRNAs are required for selecting gRNAs with high specificity and targeting accuracy.
The research on off-target prediction models has rose substantial concern in recent years. And the methods that existed mainly include two categories, experimental techniques and in silico methods. Many experimental techniques have been developed such as GUIDE-seq [
12], DISCOVER-Seq [
13], SMRT-OTS and Nano-OTS [
14], Digenome-seq [
15,
16], CIRCLE-seq [
17], CHANGE-seq [
18] and target-specific DNA enrichment [
19]. Compared with those cell-based techniques that possess the characteristic of high accuracy with high cost, the in silico methods are relatively more convenient and low-cost to predict the off-target activities for particular gRNA without assays.
The early prediction method MIT-score [
20,
21] figured out that the bases mismatch between gRNA and target DNA follows the sequential-based rules and is highly related to the number and location of bases. Based on the off-target data validated by experiments, MIT-score adjusting the corresponding weights, which allows the discovery of off-target sites in the early stages of gene editing without PAM. Another prediction method based on hand-crafted rules is CCTop [
22] considering the distance between off-target sites and PAM since experiments showed that the distance to PAM would affect off-target activities. However, the methods using hand-crafted rules required the manual design of rules, which consumed a lot of effort to adjust the structure and hyperparameters of the network and was dependent on the analysis of the datasets. Furthermore, for those biological structures of sequences that remain unclear, hand-crafted rules may miss extra information.
The first machine learning prediction method CFD [
9] proposed by Doench et al. using a lentiviral library infecting MOLM13 cells to obtain a dataset with off-target activities, the experiments showed that CFD outperformed than MIT-score and CCTop. Based on CFD, Listgarten et al. proposed a two-layer regression model Elevation-score [
23], which achieved better performance. S.Abadi et al. also presented a regression model CRISTA [
24] on the basis of random forest, which referred to the secondary structure of RNA and epigenetic factors in the designing process. Considering the specificity of nucleotide composition and mismatch position on gRNA-target pair, Peng et al. proposed Ensemble SVM [
25] to train an ensemble support vector machine classifier. Recently, Wang et al. also presented a generalized prediction method GNL-Scorer [
26] to achieve prediction of off-target activities cross-species. For those in silico off-target prediction models based on machine learning, most of them just considered base mismatch and lacked further research on RNA insertion and RNA deletion problems. Meanwhile, those methods cannot mine data features in the best manner and remain limited in prediction accuracy.
The recent application of deep learning to sequence-based problems signifies its applicability on off-target prediction. Chuai et al. implemented DeepCRISPR [
27] that combined the epigenetic features and neural network, in which autoencoder and Recurrent Neural Networks (RNN) were utilized to design optimal gRNA as well as predict the on-target and off-target sites simultaneously. Based on Deep Convolutional Neural Network (DCNN) and feedforward neural network, Lin et al. proposed CNN_std [
28]. Similarly, Liu et al. also adopted Convolutional Neural Network (CNN) architecture and further introduced attention mechanism into AttnToMismatch_CNN [
29]. Another convolutional neural network based on attention mechanism is CRISPR-ONT [
30], which paid more attention to a proximal region of PAM that may include cleavage-related information. This method also included a replacement-based sensitivity analysis to illustrate the relative importance of each site. Different from those methods that improved on model architecture, DL-CRISPR [
31] focused on dataset optimization. They extended the current positive dataset to improve the competitiveness of the model and investigated dataset design to address data imbalance issue, after that, four layers of CNN were used to learn data features and the final score is got as the average score of 10 models. Recently, Lin et al. proposed CRISPR-Net [
32], in which the Inception module that combined several kernels with different sizes were used as feature extractor in the convolutional layer, and the Long Short-Term Memory (LSTM) units were used to form a recurrent neural network in terms of its advantages of selective memory function. Although the method uses a data feature extractor to prevent information loss, it still needs to be further improved to preserve the original information. Meanwhile, since those existing prediction methods cannot satisfy enough precision for implementing CRISPR/Cas9 gene-editing techniques at the clinical level, it is pressing to propose a new method to address the problem.
In this work, we propose an off-target prediction model based on a recurrent convolutional network named R-CRISPR, predicting off-target activities of gRNA-target sequence with mismatch, insertion, and deletion. We first encode the target sequence pair into a binary matrix as the input of the prediction model and then use the preprocessing module on the basis of the RepVGG to extract data features. Finally, the bi-directional recurrent network constructed by Long Short Term Memory units is used for further training of data to improve learning efficiency.
This work provides the following contributions:
1. We developed R-CRISPR, a recurrent convolutional network to evaluate and predict off-target effects of gRNA-target sequence with mismatch, insertion, and deletion.
2. We compare the R-CRISPR with five mainstream prediction methods on datasets obtain from experimental methods to evaluate the model performance. Using the area under the curve of Receiver Operating Characteristic Curve (ROC) and Precision Recall Curve (PRC) as the measurement standard, the performance of R-CRISPR surpasses existing mainstream prediction models.
3. We compare the R-CRISPR with the state-of-art prediction model CRISPR-Net, the R-CRISPR model has an improvement of 0.2% and 1.9% on AUROC and AUPRC.
4. We make extended research to explore the performance difference on various combinations of training datasets, and improve the prediction accuracy by designing an ideal dataset combination.
4. Discussion
The accurate evaluation of off-target activities in the CRISPR-Cas9 system is a severe issue when applying machine learning. Since the early prediction models remained hand-crafted rules and limited predictive accuracy. In this study, we first used an encoding scheme to encode each gRNA-target sequence into a 7 × 24 matrix as the input of an improved convolutional neural network for data feature extraction. Then, given the above strategies, we proposed R-CRISPR, an off-target prediction model based on a recurrent convolutional network with a Cross Entropy Loss Function to solve the problem. Since the mainstream in silicon off-target activities prediction methods lacked further research on gRNA-target pairs insertion and deletion problems, we optimized R-CRISPR to satisfy the demands of insertion and deletion detection. We first explored the prediction accuracy of mismatch problems in terms that nucleic acid base mismatch occupies the main proportion of off-target sites and most existing predictive methods were designed for mismatch-only problems. On mismatch-only off-target dataset GUIDE_II verified by GUIDE-seq, experiments show that R-CRISPR outperformed six existing mainstream predictive methods on both ROC and RC analysis with an average accuracy of 0.991 on AUROC and 0.319 on AUPRC. In addition, we set a 5-fold cross-validation test based on the off-target dataset confirmed by CIRCLE-seq (with nucleic acid base insertion and deletion) to investigate how insertion and deletion problems affect the off-target prediction. We trained and compared R-CRISPR with the state-of-art prediction method CRISPR-Net, which could also measure off-target sites with insertion and deletion, on different combinations of datasets. R-CRISPR achieved a higher accuracy of 0.976 on AUROC and 0.460 on AUPRC with an improvement of 0.1% and 4.1% than CRISPR-Net. Furthermore, we also explored how the quality of training data is influenced by data concatenation and designed seven combinations of datasets to test the performance of R-CRISPR. Seven R-CRISPR models expressed competitive performance on ROC analysis with an average accuracy of 0.992 on AUROC, while the test results were numerous on PR analysis with the highest accuracy achieved 0.319 and lowest one appeared 0.173. The experiments indicated that the designing of training datasets could affect predictive results significantly, and the R-CRISPR trained on combined datasets surpassed those trained on a single dataset. We believed that the combination of multiple datasets could obtain multifarious information of off-target activities, and produce a more comprehensive dataset, hence improving the model performance. Meanwhile, we speculated that the sample imbalance caused by fewer positive samples was also a crucial point for model performance. Since the off-target activities only occupied a minority number in the whole biological process, the datasets obtained from most experiments were unbalanced, which required further optimization.