Next Article in Journal
The Genetics of a Behavioral Speciation Phenotype in an Island System
Next Article in Special Issue
Ensemble Consensus-Guided Unsupervised Feature Selection to Identify Huntington’s Disease-Associated Genes
Previous Article in Journal
Using a Chemical Genetic Screen to Enhance Our Understanding of the Antibacterial Properties of Silver
Previous Article in Special Issue
Gene Regulatory Networks Reconstruction Using the Flooding-Pruning Hill-Climbing Algorithm
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Novel Probability Model for LncRNA–Disease Association Prediction Based on the Naïve Bayesian Classifier

1
Key Laboratory of Intelligent Computing & Information Processing, Xiangtan University, Xiangtan 411105, China
2
College of Computer Engineering & Applied Mathematics, Changsha University, Changsha 410001, China
3
Department of Computer Science, Princeton University, Princeton, NJ 08544, USA
*
Author to whom correspondence should be addressed.
Genes 2018, 9(7), 345; https://doi.org/10.3390/genes9070345
Submission received: 26 May 2018 / Revised: 24 June 2018 / Accepted: 3 July 2018 / Published: 8 July 2018

Abstract

:
An increasing number of studies have indicated that long-non-coding RNAs (lncRNAs) play crucial roles in biological processes, complex disease diagnoses, prognoses, and treatments. However, experimentally validated associations between lncRNAs and diseases are still very limited. Recently, computational models have been developed to discover potential associations between lncRNAs and diseases by integrating multiple heterogeneous biological data; this has become a hot topic in biological research. In this article, we constructed a global tripartite network by integrating a variety of biological information including miRNA–disease, miRNA–lncRNA, and lncRNA–disease associations and interactions. Then, we constructed a global quadruple network by appending gene–lncRNA interaction, gene–disease association, and gene–miRNA interaction networks to the global tripartite network. Subsequently, based on these two global networks, a novel approach was proposed based on the naïve Bayesian classifier to predict potential lncRNA–disease associations (NBCLDA). Comparing with the state-of-the-art methods, our new method does not entirely rely on known lncRNA–disease associations, and can achieve a reliable performance with effective area under ROC curve (AUCs)in leave-one-out cross validation. Moreover, in order to further estimate the performance of NBCLDA, case studies of colorectal cancer, prostate cancer, and glioma were implemented in this paper, and the simulation results demonstrated that NBCLDA can be an excellent tool for biomedical research in the future.

1. Introduction

Long non-coding RNAs (lncRNAs), those with over 200 nucleotides in length [1,2,3], are considered a new class of non-protein-coding transcripts. Much research evidence has shown that lncRNAs participate in almost the entire cell life cycle through various mechanisms and play significant roles in multiple biological processes including transcription, translation, epigenetic regulation, splicing, differentiation, immune response, cell cycle control, and so on [4,5,6,7,8]. In particular, the mutations and dysregulations of lncRNAs have been proven to be closely related to various human complex diseases [9,10,11], including AIDS [12], diabetes [13], Alzheimer’s Disease (AD) [14], and many types of cancers such as breast [15], prostate [16], hepatocellular [17], and bladder cancer [18]. For instance, the expression of the lncRNA called HOTAIR was shown to be higher in primary breast tumors and metastases, and the HOTAIR expression level was proven to be a powerful predictor of eventual metastasis and death [19,20]. Additionally, the lncRNA MALAT1 was demonstrated as a prognostic indicator as well as a therapeutic target and acts as a potential therapeutic method for preventing lung cancer metastasis, which is targeted by antisense oligonucleotides (ASO) [21]. Moreover, recent studies have shown that the human H 19 gene is frequently overexpressed in the myometrium and stroma during pathological endometrial proliferative events [22].
Obviously, predicting potential associations between lncRNAs and diseases would contribute to systematically understanding the pathogenesis of complex diseases at the molecular level and facilitate the identification of biomarkers for disease diagnosis, treatment, and prediction of response to therapy. However, relatively few experiments have supported lncRNA–disease associations until now. Hence, developing effective computational methods to uncover the potential associations between lncRNAs and diseases has become a hot topic in recent years. In general, existing models for predicting potential associations between lncRNAs and diseases can be divided into three categories. Among them, the first kind of methods are based on known disease-related lncRNAs. For example, Sun et al. proposed a model named RWRlncD [23], which carried out a random walk with the restart method on an lncRNA functional similarity network. This method uncovered potential associations between lncRNAs and diseases by integrating the disease similarity network, the lncRNAs functional network, and known lncRNA–disease associations. Ping et al. developed a method based on a newly constructed bipartite network, which relies on the known associations between lncRNAs and diseases [24]. Yang et al. constructed a coding-non-coding gene–disease bipartite network based on known associations between diseases and disease-causing genes (including lncRNAs). Then, they developed an iterative algorithm to uncover the possible links in the newly constructed bipartite network [25]. Ding et al. proposed a new model named TPGLDA to predict potential lncRNA–disease associations by integrating gene–disease associations with lncRNA–disease associations [26].
Different from the first kind of methods based on known lncRNA–disease associations, the second category of prediction models does not rely on known disease-related lncRNAs. For example, Chen et al. proposed a new method called HGLDA by integrating micro-RNA (miRNA)–disease associations and lncRNA–miRNA interactions. A hypergeometric distribution test is then applied to identify potential lncRNA–disease associations [27]. Liu et al. developed a computational framework by integrating human lncRNA expression profiles, gene expression profiles, and human disease-associated gene data to predict potential human lncRNA–disease associations [28]. Li et al. put forward a prediction method on account of the information of genome location to globally discover potential human lncRNAs related to vascular disease [29]. Gu et al. proposed a random walk-based model to identify potential associations between lncRNAs and diseases, which can be applied for predicting a disease without known associated lncRNAs and for inferring an lncRNA without known associated diseases [30].
In recent years, an increasing number of studies have been developed for understanding the cellular process, molecular interactions, and the pathogenesis of complex diseases at the molecular level by integrating different types of data and molecular interaction networks [31]. Such research includes the prediction of gene–disease associations [32], and the prediction of potential disease-associated miRNAs [33,34]. An increasing number of researchers have also adopted various data frameworks to increase the reliability of association prediction between diseases and lncRNAs. Hence, a third kind of prediction models has been proposed, in which multiple data sources are integrated to identify disease-related lncRNAs. For example, Lu et al. proposed a new prediction of lncRNA–disease associations via inductive matrix completion (named SIMCLDA), by integrating known lncRNA–disease interactions, disease–gene, gene–gene ontology associations [35]. Zhang et al. developed a novel model named LncRDNetFlow, which utilized a flow propagation algorithm to integrate a variety of information including the similarity of lncRNAs, the protein–protein interactions, and the similarity of diseases to infer lncRNA–disease associations [36]. Fu et al. proposed a model called MFLDA to predict potential lncRNA–disease associations by considering the quality and relevance of different heterogeneous data sources, which can select and integrate the data sources by assigning different weights to them [37]. Chen developed a path-based approach named KATZLDA for discovering potential lncRNA–disease associations by integrating information including known lncRNA–disease associations, lncRNA expression profiles, lncRNA functional similarity, disease semantic similarity, and Gaussian interaction profile kernel similarity [38]. All of these above data fusion-based methods can achieve effective results.
In this paper, to effectively predict potential lncRNA–disease associations, we first constructed a global tripartite network by integrating three kinds of heterogeneous networks including an lncRNA–disease association network, an miRNA–disease association network, and an miRNA–lncRNA interaction network. Then, considering that more heterogeneous networks can boost the prediction performance, we constructed a quadruple global network by appending a gene–lncRNA interaction network, a gene–disease association network, and a gene–miRNA interaction network to the tripartite network. Thereafter, based on these two newly constructed global networks, a novel probabilistic model named Naïve Bayesian Classifier used to predict potential LncRNA–Disease Associations (NBCLDA), based on the naïve Bayesian classifier, is proposed to uncover potential lncRNA–disease associations. Moreover, in order to evaluate the prediction performance of the NBCLDA, the leave-one-out cross-validation (LOOCV) framework was implemented, and the experimental results demonstrated the effective performance of the NBCLDA and illustrated that it can achieve better predictive performance than state-of-the-art methods in the terms of LOOCV.

2. Data Collection and Preprocessing

Considering that more heterogeneous data sources can boost the performance of prediction models, in this paper, to construct our novel prediction model NBCLDA—with the ultimate goal being to infer potential associations between lncRNAs and diseases-seven heterogeneous data sets were combined. These include the sets of miRNA–disease, miRNA–lncRNA, lncRNA–disease, gene–disease, and gene–lncRNA associations, as well as the sets of gene–miRNA interactions, and of diseases with disease tree numbers. The sets were collected from various databases.

2.1. Construction of miRNA–Disease and miRNA–lncRNA Association Sets

In this article, the miRNA–disease and miRNA–lncRNA association sets were downloaded from the HMDD [39] and the starBase v2.0 [40] databases in January 2015. Once these two data sets were collected, we removed any duplicate associations with conflicting evidence. Then, we further unified the names of miRNAs, and, thereafter, manually selected the common miRNAs in both sets. Finally, we retained only the associations related with those selected miRNAs in these two data sets. As a result, we obtained a data set DS 1 consisting of 4704 miRNA–disease interactions between 246 miRNAs and 373 diseases, and a data set DS 2 consisting of 9086 miRNA–lncRNA interactions between 246 miRNAs and 1089 lncRNAs (see Supplementary Materials Tables S1 and S2).

2.2. Construction of the lncRNA–Disease Association Set

In this paper, the set of lncRNA–disease associations was collected from the MNDR v2.0 database [41] in 2017. In a similar way, once the data set was collected, we removed the duplicate associations with conflicting evidence. Then, we selected the lncRNA–disease associations with diseases belonging to DS 1 and lncRNAs belonging to DS 2 simultaneously. As a result, we obtained a data set DS 3 consisting of 407 lncRNA–disease associations between 77 lncRNAs and 95 diseases (see Supplementary Materials Table S3). The data set DS 3 is utilized as the test sample in our following simulation experiments.

2.3. Construction of the Gene–Disease and Gene–lncRNA Association Sets

In this article, the set of gene–disease associations was gathered from the DisGeNET v5.0 database [42] in May 2017, and the set of gene–lncRNA associations was downloaded from the LncACTdb v1.0 database [43]. Again, we removed the duplicate associations with conflicting evidence. Then, we further unified the names of genes, and thereafter manually selected the common genes in both sets. Finally, we retained only the associations related with those selected genes in these two data sets. Additionally, we transformed some disease names included in the newly constructed set of gene–disease associations into their aliases in the DS 1 , in order to keep the uniformity of disease names. For example, the disease names “pulmonary Emphysema” and “Bladder Neoplasm” in the newly collected set of gene–disease associations was converted into “pulmonary Embolism” and “Bladder Neoplasms” in the DS 1 , respectively. Hence, we obtained a data set DS 4 consisting of 3702 gene–disease associations between 171 genes and 227 diseases, and a data set DS 5 consisting of 411 gene–lncRNA interactions between 171 genes and 66 lncRNAs (see Supplementary Materials Tables S4 and S5).

2.4. Construction of the Gene–miRNA Association Set

In this paper, the set of gene–miRNA interactions was obtained from the miRecords [44] database that was last updated in April 2013. Once the data set was collected, we removed the duplicate associations with conflicting evidence. Then, we selected the gene–miRNA interactions with genes belonging to DS 4 or DS 5 and miRNAs belonging to DS 1 or DS 2 , simultaneously. Finally, as a result, we obtained a data set DS 6 consisting of 565 gene–miRNA associations between 109 genes and 174 miRNAs (see Supplementary Materials Table S6).

2.5. Construction of the Set of Diseases with Disease Tree Numbers

In this article, the set of diseases with Disease tree numbers was gathered from the MeSH database [45] . In the MeSH database, the disease terms, described as DAGs, were classified and signified as disease tree numbers. We browsed the MeSH database and collected the disease tree numbers of diseases in DS 1 . As a result, we obtained a data set DS 7 consisting of 373 diseases with their disease tree numbers (see Supplementary Materials Table S7).

2.6. Analysis of Multi Relational Data Sources

In our model, four object types such as lncRNA, diseases, miRNA, and genes are considered. Based on these four object types, we collect six relational data sources from different databases. Figure 1 is constructed to illustrate the relationship between these different data sources more directly. In Figure 1, R # 1 Ω # 2 Ω denotes the different associations between these four object types, where # 1 represents one object, # 2 represents another object and Ω denotes the dataset DS Ω that the two objects belong to. For example, R m 1 d 1 denotes the associations between miRNAs and diseases, m represents miRNAs, d represents diseases, and ‘1’ indicates all these miRNAs and diseases belong to the dataset DS 1 . In addition, the numbers of the same objects in the different datasets and the relationships among them are shown in Figure 1. For instance, the number of diseases is 373 in R m 1 d 1 , 95 (= 29 + 66) in R l 3 d 3 and 227 (= 66 + 161) in R g 4 d 4 , and it is obvious that both the 95 diseases in R l 3 d 3 and the 227 diseases in R g 4 d 4 are part of the 373 diseases in R m 1 d 1 ; moreover, the intersect of disease in R l 3 d 3 and R g 4 d 4 includes 66 different diseases.

3. Method

As illustrated in Figure 2, our newly proposed model NBCLDA for predicting potential associations between lncRNAs and diseases can be mainly divided into the following steps:
Step 1: As illustrated in Figure 2a, on the basis of data sets DS 1 , DS 2 , and DS 3 we can construct an miRNA–disease association network labeled MDN, an miRNA–lncRNA association network labeled MLN, and an lncRNA–disease association network labeled LDN.
Step 2: As illustrated in Figure 2b, by integrating the three association networks constructed in Step 1, we can easily obtain a global tripartite network G N 1 of lncRNA–miRNA–disease relationships.
Step 3: As illustrated in Figure 2c, in order to utilize multiple data sources to improve the prediction performance, on the basis of data sets DS 4 , DS 5 , and DS 6 obtained above, we can also construct a gene–disease association network labeled GDN, a gene–lncRNA association network labeled GLN, and a gene–miRNA association network labeled GMN.
Step 4: As illustrated in Figure 2d, by appending the three association networks constructed in Step 3 to G N 1 constructed in Step 2, we can easily obtain a global quadruple network G N 2 of lncRNA–miRNA–gene–disease relations.
Step 5: As illustrated in Figure 2e,f, after applying the naïve Bayesian classifier theory to G N 1 and G N 2 , we can obtain two kinds of prediction models: NBCLDA- G N 1 and NBCLDA- G N 2 .
Step 6: As illustrated in Figure 2g,h, in order to further improve the prediction performance of the NBCLDA, we implemented disease semantic similarity in NBCLDA- G N 1 and NBCLDA- G N 2 . Thus, we can obtain two new prediction models, NBCLDA- G N 1 -SD and NBCLDA- G N 2 -SD, to infer potential lncRNA–disease associations.

3.1. Construction of the MDN, MLN, LDN, and G N 1

Let L be the set of n lncRNAs in DS 2 , L be the set of n lncRNAs in DS 3 , D be the set of r diseases in DS 1 , D be the set of r diseases in DS 3 . Additionally, let M = { m 1 , m 2 , , m t } be the set of t miRNAs in DS 1 or DS 2 . From Section 2.1 and Section 2.2, it is clear that L L and D D ; hence, we can let L = { l 1 , l 2 , , l n } , L = { l 1 , l 2 , , l n , l n + 1 , , l n } , D = { d 1 , d 2 , , d r } , and D = { d 1 , d 2 , , d r , d r + 1 , , d r } . Thus, we can represent the miRNA–disease association network, MDN, as M D N = ( M , D , E 1 ) , where E 1 = { e m k d j | m k M , d j D } denotes the set of known interactions between the miRNAs in M and the diseases in D. That is, the edge e m k d j E 1 m k is associated with d j .
In the same way, we can further represent the miRNA–lncRNA interaction network, MLN, and the lncRNA–disease association network, LDN, as M D N = ( M , L , E 2 ) and L D N = ( L , D , E 3 ) , where E 2 = { e m k l i | m k M , l i L } denotes the set of known interactions between the miRNAs in M and the lncRNAs in L; E 3 = { e l i d j | l i L , d j D } represents the set of interactions between the lncRNAs in L and the diseases in D . Thus, the edge e m k l i E 2 m k is associated with l i , and the edge e l i d j E 3 l i is associated with d j . Finally, the global tripartite network, G N 1 , is expressed as G N 1 = ( L , D , M , E ) , where E = E 1 E 2 E 3 .

3.2. Construction of GDN, GLN, GMN, and G N 2

Let D be the set of r diseases in DS 4 , L be the set of n lncRNAs in DS 5 , G be the set of p genes in DS 4 or DS 5 , G be the set of p genes in DS 6 , and M be the set of t miRNAs in DS 6 . Additionally, from Section 2.3 and Section 2.4, it is clear that D D , L L , and G G ; hence, we can let D = { d 1 , d 2 , , d r } , L = { l 1 , l 2 , , l n } , G = { g 1 , g 2 , , g p } , G = { g 1 , g 2 , , g p , g p + 1 , , g p } , and M = { m 1 , m 2 , , m t } . We can thus represent the gene–disease association network, GDN, as G D N = ( G , D , E 4 ) , where E 4 = { e g f d j | g f G , d j D } denotes the set of known interactions between the genes in G and the diseases in D . That is, the edge e g f d j E 4 g f is associated with d j .
In the same way, we can further represent the gene–lncRNA interaction network, GLN, and gene–miRNA interaction network, GMN, as G L N = ( G , L , E 5 ) and G M N = ( G , M , E 6 ) , where E 5 = { e g f l i | g f G , l i L } and E 6 = { e g f m k | g f G , m k M } denote the set of known gene–lncRNA interactions and the set of known gene–miRNA interactions, respectively. In other words, the edge e g f l i E 5 g f is associated with l i and the edge e g f m k E 6 g f is associated with m k . Finally, it is evident that the global tripartite network G N 2 can be expressed as G N 2 = ( L , D , M , G , E ) , where E = E E 4 E 5 E 6 .

3.3. Construction of NBCLDA

The naïve Bayesian classifier is a simple probabilistic classifier with a naïve independence assumption that any feature of a class is independent of the other features of the class. Abstractly, based on the Bayesian classifier probability model p ( C | F 1 , F 2 , , F n ) , where C is a dependent class variable and F 1 , F 2 , , F n are the feature variables of class C, the posterior probability can be described as follows:
p ( C | F 1 , F 2 , , F n ) = p ( F 1 , F 2 , , F n | C ) p ( C ) p ( F 1 , F 2 , , F n ) .
Furthermore, according to the above assumption, since each feature F i is conditionally independent of every other feature F j ( i j ) , Equation (1) can be expressed as:
p ( C | F 1 , F 2 , , F n ) = p ( C ) i = 1 n p ( F i | C ) p ( F 1 , F 2 , , F n ) .
Inspired by existing probabilistic models based on Bayesian theory to predict missing links in complex networks [46], we designed a prediction model NBCLDA to infer potential disease-related lncRNAs; we applied the naïve Bayesian theory to G N 1 and G N 2 , constructed in Section 3.1 and Section 3.2, respectively. In the context of Equation (1), in NBCLDA, the associations between lncRNAs and diseases in G N 1 and G N 2 are considered as the class of variables, while the common neighboring nodes of every lncRNA–disease pair in G N 1 and G N 2 are considered as the feature variables. In particular, when applying the naïve Bayesian theory to G N 1 , for any given pair of lncRNA and disease nodes in G N 1 , we will consider that their common neighboring miRNA nodes are all conditionally independent of each other, since all of the miRNAs are different, and, therefore, we assume that each of the miRNAs will not affect the others. To illustrate this assumption more intuitively, we provide an example in Figure 3a, in which the common neighboring nodes m 1 and m 3 between l 2 and d 3 will be assumed to be conditionally independent.
However, when applying the naïve Bayesian theory to G N 2 , as there are two types of common neighboring nodes, miRNAs and genes, between a pair of lncRNA and disease nodes. In this case, it is unreasonable to consider that all of these common neighbors are conditionally independent of each other, since there may exist interactions between genes and miRNAs. Therefore, for any given pair of lncRNA and disease nodes in G N 2 , let ϕ be the set that consists of all their common neighboring nodes. Then, for any miRNA node m , if there is a gene node g that is associated with m , we will consider the miRNA m and its related gene g as a whole, and denote them as m - g and label this an miRNA–gene pair. By this means, it is obvious that there will be three kinds of features in ϕ —miRNAs, genes, and miRNA–gene pairs. Hence, we assume that these three kinds of elements in ϕ are conditionally independent of each other. To illustrate this assumption more intuitively, we present an example in Figure 3b, in which, m 1 , m 3 , g 1 , and g 4 are the common neighboring nodes between l 2 and d 3 , and we will assume that m 3 - g 4 , m 1 , and g 1 are conditionally independent of each other.

3.3.1. Method for Applying the Naïve Bayesian Theory into G N 1

For any given lncRNA node l i and disease node d j in G N 1 , let N ( l i ) and N ( d j ) be the sets of neighboring nodes that are directly connected to l i and d j , respectively. From this, we construct C N ( l i , d j ) = { m 1 , m 2 , , m h } , which denotes the set consisting of all common neighboring nodes between l i and d j in G N 1 . Then, the prior probabilities for the existence of an relationship edge e l i d j are calculated via:
p ( e l i d j = 1 ) = | M c | | M | ,
p ( e l i d j = 0 ) = 1 p ( e l i d j = 1 ) ,
where | M c | denotes the number of known associations between lncRNAs and diseases in LDN, and | M | = n × r , where n denotes the number of lncRNAs in L and r denotes the number of diseases in D.
Based on the naïve Bayesian classifier, the posterior probabilities for an edge e l i d j , representing whether the node l i is connected to d j in G N 1 , are defined as follows:
p ( e l i d j = 1 | C N ( l i , d j ) ) = p ( e l i d j = 1 ) p ( C N ( l i , d j ) ) m δ C N ( l i , d j ) p ( m δ | e l i d j = 1 ) ,
p ( e l i d j = 0 | C N ( l i , d j ) ) = p ( e l i d j = 0 ) p ( C N ( l i , d j ) ) m δ C N ( l i , d j ) p ( m δ | e l i d j = 0 ) .
From Equations (5) and (6), we can directly identify whether an lncRNA node is connected with a disease node or not in G N 1 . However, since it is often too complicated to calculate the value of p ( C N ( l i , d j ) ) , we first define the probability of a potential association existing between l i and d j in G N 1 as follows:
S 1 ( l i , d j ) = p ( e l i d j = 1 ) p ( e l i d j = 0 ) m δ C N ( l i , d j ) p ( m δ | e l i d j = 1 ) p ( m δ | e l i d j = 0 ) ,
where p ( m δ | e l i d j = 1 ) and p ( m δ | e l i d j = 0 ) are the conditional probabilities of a node m δ belonging to C N ( l i , d j ) ; they represent the possibilities of whether the node is a common neighboring node between l i and d j in G N 1 or not, respectively. Moreover, according to Bayesian theory, these two conditional probabilities can be expressed as:
p ( m δ | e l i d j = 1 ) = p ( e l i d j = 1 | m δ ) p ( m δ ) p ( e l i d j = 1 ) ,
p ( m δ | e l i d j = 0 ) = p ( e l i d j = 0 | m δ ) p ( m δ ) p ( e l i d j = 0 ) ,
where p ( e l i d j = 1 | m δ ) and p ( e l i d j = 0 | m δ ) represent the conditional probability of whether the lncRNA node l i is connected to the disease node d j or not, respectively, and m δ is one of the common neighboring nodes between l i and d j in G N 1 . Thus, p ( e l i d j = 1 | m δ ) and p ( e l i d j = 0 | m δ ) are calculated via the following formulas:
p ( e l i d j = 1 | m δ ) = N m δ + N m δ + + N m δ ,
p ( e l i d j = 0 | m δ ) = N m δ N m δ + + N m δ ,
where N m δ + and N m δ denote the number of known and unknown associations between lncRNAs and diseases whose common neighbors include m δ , respectively.
Hence, from Equations (8) and (9), Equation (7) can be modified as follows:
S 1 ( l i , d j ) = p ( e l i d j = 1 ) p ( e l i d j = 0 ) m δ C N ( l i , d j ) p ( e l i d j = 0 ) p ( e l i d j = 1 | m δ ) p ( e l i d j = 1 ) p ( e l i d j = 0 | m δ ) .
Moreover, given any two nodes l i and d j in G N 1 , the value of p ( e l i d j = 1 ) p ( e l i d j = 0 ) is a constant, which we denote as ϕ m for convenience. Additionally, for each common neighboring node between l i and d j in G N 1 , let N l denote the number of lncRNAs directly related to m δ , and N d denote the number of diseases directly related to m δ . Then, N m δ + + N m δ = N l × N d , and hence, Equation (7) can further be modified as follows:
S 1 ( l i , d j ) = ϕ m m δ C N ( l i , d j ) ϕ m 1 N m δ + N m δ .
Considering that N m δ + may equal zero, we will introduce the Laplace calibration to guarantee that the value of S 1 ( l i , d j ) will not be zero:
S 1 ( l i , d j ) = ϕ m m δ C N ( l i , d j ) ϕ m 1 N m δ + + 1 N m δ + 1 .
Furthermore, by introducing the logarithmic function for standardization, for any given lncRNA node l i and disease node d j in G N 1 , we can finally define the probability of a potential association existing between them as:
S 1 ( l i , d j ) = l o g ( S 1 ( l i , d j ) ) λ ,
where λ is a constant utilized for normalization.

3.3.2. Method for Applying the Naïve Bayesian Theory to G N 2

In the same manner as described in Section 3.3.1, for any given lncRNA node l i and disease node d j in G N 2 , we construct the set consisting of all common neighboring nodes, C N ( l i , d j ) = { m 1 , m 2 , , m h , g 1 , g 2 , , g u } . Then, the posterior probabilities of p ( e l i d j = 1 | C N ( l i , d j ) ) and p ( e l i d j = 0 | C N ( l i , d j ) ) , representing whether the node l i is connected to d j in G N 2 or not, respectively. Then, similarly as described in Section 3.3.1, we can define the probability of a potential association existing between l i and d j in G N 2 as follows (the deep representation of scheme are described in Supplementary Material):
S 2 ( l i , d j ) = ϕ m m α C N ( l i , d j ) g β C N ( l i , d j ) m α ¯ , g β ¯ C N ( l i , d j ) ϕ m 3 ( N m α + + 1 ) ( N g β + + 1 ) ( N m α ¯ , g β ¯ + + 1 ) ( N m α + 1 ) ( N g β + 1 ) ( N m α ¯ , g β ¯ + 1 ) ,
where N m α ¯ , g β ¯ + and N m α ¯ , g β ¯ denote the number of known and unknown associations between l i and d j in G N 2 , respectively, conditional on m α ¯ and g β ¯ being common neighboring nodes between l i and d j in G N 2 and m α ¯ - g β ¯ is an miRNA–gene pair. In addition, N m α + and N m α denote the number of known and unknown associations between l i and d j in G N 2 , respectively, conditional on m α being a common neighboring node between l i and d j . In addition, N g β + and N g β represent the number of known and unknown associations between l i and d j in G N 2 , respectively, conditional on g β being a common neighboring node between l i and d j . Finally, following the example of Equation (15), we can finally define the probability of a potential association existing between l i and d j in G N 2 as follows:
S 2 ( l i , d j ) = l o g ( S 2 ( l i , d j ) ) λ .

3.3.3. Method of Appending the Disease Semantic Similarity into N B C L D A

The disease semantic similarity has been widely utilized as a valuable data source for discovering potential disease-related lncRNAs in many previous studies [30,38]. In this paper, we append the disease semantic similarity into our newly constructed prediction model NBCLDA to further uncover the potential relationships between lncRNAs and diseases.
From the description given in Section 2.5, we know that each disease term in the MeSH database can be described as a directed acyclic graph (DAG), in which the nodes represent the disease MeSH descriptors and all MeSH descriptors in the DAG are linked from more general terms (parent nodes) to more specific terms (child nodes) by a direct edge. Hence, in this paper, we first obtain the disease tree numbers according to the disease terms collected from the MeSH database. Thereafter, adopting the method proposed by Wang et al. [47], while supposing that disease d j is represented as the graph D A G d j = ( d j , T d j , E d j ) , where T d j is the set of all ancestor nodes of d j including node d j , E d j is the set of corresponding links, and the contribution of a disease t in D A G d j to the semantic of disease d j can be calculated as follows:
D d j ( t ) = 1 , if t = d j , m a x { Δ × D d j ( c t ) | c t children   of t } , if t d j ,
where Δ is the semantic contribution factor for edges E d j linking disease d j with child disease t and the disease d j is the most specific disease and its own semantic score is defined as 1. Since nodes located farther from d j will be more general diseases that contribute less to d j , then, based on Equation (24), we can define the semantic value of the disease d j as follows:
D V ( d j ) = t T d j D d j ( t ) .
Therefore, based on the assumption that the diseases share the nodes of their DAGs, the semantic similarity between disease d j and d i can be defined as:
S D ( d j , d i ) = t T d j T d i ( D d j ( t ) + D d i ( t ) ) D V ( d j ) + D V ( d i ) .
Finally, based on the disease semantic similarity and the similarities between lncRNAs and diseases, we can reconstruct a new recommended measurement for inferring potential associations between lncRNAs and diseases as follows:
S = S × S D ,
where S denotes either S 1 ( l i , d j ) or S 2 ( l i , d j ) and S D , which is computed via Equation (20) denotes the disease semantic similarity.

4. Results

4.1. Performance Evaluation

The performance of the NBCLDA, for inferring potential associations between lncRNAs and diseases, is evaluated by implementing LOOCV and is based on experimentally verified lncRNA–disease associations. At each round, a known lncRNA–disease association is used as a test sample, whereas all the remaining associations are taken as training cases for model learning. This step continues until each sample is treated as a verification sample. Moreover, the value of area under the receiver operating characteristic (ROC) curve (AUC) can be applied for measuring the overall performance of the method. The closer the AUC value is to 1, the better the performance is, and an AUC value of 0.5 refers to a random guess. We calculate a series of true positive rates (TPR or sensitivity) and false positive rates (FPR or 1−specificity) by setting different classification thresholds, and the ROC curve is plotted with the functional relationship between them. Specifically, TPR corresponds to the ratio of the successfully predicted lncRNA–disease associations to the total experimentally verified lncRNA–disease associations, and FPR refers to the percentage of candidate lncRNAs ranked below the threshold.
First, in order to estimate the influence of the addition of new types of nodes and the introduction of the disease semantic similarity on the predictions of potential associations between lncRNAs and diseases, we implemented the NBCLDA on the two constructed global networks G N 1 and G N 2 in the framework of LOOCV. The simulation results are shown in Figure 4 and Figure 5. From Figure 4, the NBCLDA achieved an AUC of 0.8240 on G N 1 and an AUC of 0.8604 on G N 2 when the disease semantic similarity was not utilized. On the other hand, from Figure 5, an AUC of 0.8519 on G N 1 and an AUC of 0.8819 on G N 2 were achieved when the disease semantic similarity was included. This demonstrates that the prediction performance of our method not only benefits from the addition of the new types of nodes for predicting potential associations between lncRNAs and diseases, but also is significantly improved by the introduction of disease semantic similarity.
In order to further assess the performance of the NBCLDA, we compared it with other state-of-the-art models including HGLDA [27], SIMCLDA [35], MFLDA [37], Yang et al. method [26], KATZLDA [38] and TPGLDA [26] in the framework of LOOCV. For comparing with the HGLDA, a data set consisting of 183 experimentally validated lncRNA–disease associations was previously constructed and taken as the test set to evaluate its performance. Hence, for convenience, we compared our model, the NBCLDA, with the HGLDA on that data set using the framework of LOOCV. The simulation results are illustrated in Table 1 and Figure 6, from which it is evident that our approach outperformed the HGLDA. For comparing with SIMCLDA, a data set consisting of 101 known lncRNA–disease associations between 30 lncRNAs and 79 diseases was collected from the data set containing of 293 experimentally validated lncRNA–disease associations which was used in method SIMCLDA. These selected lncRNAs and diseases all belong to DS 3 in our paper. The simulation results are illustrated in Table 1, from which it is evident that our approach outperformed the SIMCLDA. While comparing with MFLDA, six relational data sources including lncRNA–miRNA associations, lncRNA–gene function associations, lncRNA–disease associations, miRNA–gene interactions, miRNA–disease associations and gene–disease associations, which were used in the method MFLDA, were collected to implement NBCLDA. The data set of experimentally validated lncRNA–disease associations was taken as the test set to evaluate its performance. The simulation results are illustrated in Table 1, from which it is evident that our approach outperformed the MFLDA.
Furthermore, we compared the NBCLDA with Yang et al.’s method based on the data set DS 3 consisting of 407 lncRNA–disease associations between 77 lncRNAs and 95 diseases. In order to make a comparison with Yang et al.’s method, according to their description, we first deleted the nodes with a degree equal to 1. As a result, we obtained a data set consisting of 319 lncRNA–disease associations between 37 lncRNAs and 52 diseases. Then, we took this data set as the test set to compare the two methods in the framework of the LOOCV. The simulation results are shown in Figure 7, from which it is seen that the NBCLDA achieved an AUC of 0.9169 while being implemented on G N 2 , which is much better than the AUC of 0.8568 achieved by Yang et al.’s method. We also compared the NBCLDA with the KATZLDA, which is a path-based method designed to predict potential lncRNA–disease associations by integrating multiple pieces of information including known lncRNA–disease associations, lncRNA expression profiles, lncRNA functional similarity, disease semantic similarity, and the Gaussian interaction profile kernel similarity. Executing the simulation, we could not obtain information on the expression profiles of corresponding lncRNAs; thus, we compared the two methods without this information. The simulation results are shown in Figure 8, which indicate that the NBCLDA achieves higher AUCs (of 0.8519 and 0.8829) than the KATZLDA with a corresponding AUC of 0.8323. This also demonstrates the superiority of our newly constructed prediction model, the NBCLDA. Finally, comparing with TPGLDA, a data set consisting of 312 experimentally validated lncRNA–disease associations including 68 lncRNAs and 67 diseases and a data set consisting of 1941 gene–disease associations between 165 genes and 67 diseases were constructed, respectively. The data set of known lncRNA–disease associations was taken as the test set to evaluate its performance. The simulation results are illustrated in Table 1, from which it is obvious that TPGLDA can achieve a better performance with an AUC of 0.92, which is higher than that of ours with the AUC value of 0.8982. The main reason that TPGLDA can achieve a better performance is probably that the contribution of resource moved in both directions are taken into consideration by a consistence-based resource allocation algorithm. However, NBCLDA does not entirely rely on known lncRNA–disease associations and can integrate multiple data sources to predict potential associations.
In order to further evaluate the performance of NBCLDA, 20 percent of the known lncRNA–disease associations are randomly chosen as training set, while the remaining known and all the unknown associations are taken as testing set. We then compare with the six methods on the predicted top-k associations by using F1-score measure, which is a measure of a test’s accuracy [48]. Since the sparse known lncRNA–disease associations, we set different threshold k based on the different set of known associations when comparing with other methods and the comparison results are illustrated in Table 2. From Table 2, we could see that NBCLDA outperforms several other methods in terms of F1-score. However, TPGLDA could achieve higher values than that of our approach, this is likely due to that resource moved in both directions are taken into consideration by consistence-based resource allocation algorithm. However, comparing with TPGLDA, our new method does not entirely rely on known lncRNA–disease associations and can integrate multiple data sources to predict potential associations. These advantages may be an excellent addition for biomedical research in the future.

4.2. Case Studies

To further estimate the performance of the NBCLDA, case studies of three types of lncRNA-related diseases—colorectal cancer, prostate cancer, and glioma—are analyzed in this section. During the simulation experiment, the known lncRNA–disease associations in the data set DS 3 were considered as the training samples, while the experimentally validated lncRNA–disease associations beyond DS 3 were used for testing. As for the simulation results, the top 20 disease-related lncRNAs, predicted by the NBCLDA, were verified via relevant literature, and the corresponding evidence is listed in Table 3. In addition, the predicted results of the top 20 disease-related lncRNAs were presented in the Supplementary Table S8.
Colorectal cancer (CRC) is one of the most common cancer types in western countries and its morbidity increases with age [49]. Accumulating studies have shown that lncRNAs play important roles in several steps of carcinogenesis and cancer metastasis and additionally interact with various cancers including CRC [50,51]. Therefore, we implemented the NBCLDA to discover possible CRC-associated lncRNAs. As illustrated in Table 3, seven of the top 20 lncRNAs have been validated to be related to colorectal cancer by recent biological literature, and five of them are ranked in the top 10 of the prioritized prediction results. The other two are lncRNAs SNHG16 (ranked 12th) and TUG1 (ranked 18th). For example, Chen et al. indicated that the lncRNA XIST can regulate the process of CRC development by competing for miR-200b-3p and thus it may be considered as a biomarker for prognosis [52]. Additionally, it has been demonstrated that the lncRNA MALAT1 may be considered as a potential prognostic and therapeutic target of colorectal cancer patients as it can fulfill a chemoresistant function in colorectal cancer [53]. Nakano et al. found that the epigenetic destruction and loss of imprinting of the lncRNA KCNQ1OT1 play a significant role in the occurrence of colorectal cancer [54]. Han et al. suggested that H19 can be considered as a candidate therapeutic biomarker and a new target for human CRC therapy when it is used as a growth regulator [55].
Prostate cancer is the second most common cause of cancer-related mortality in males worldwide [56]. Increasing studies show that lncRNA have become a promising target for the treatment of cancers including prostate cancer [57,58]. Hence, we carried out the NBCLDA to uncover possible prostate cancer-associated lncRNAs, and five of the top 20 predicted lncRNAs were verified and are listed in Table 3 according to the relevant literature. For example, Ren et al. evaluated the expression of MALAT1 in prostate cancer and showed that it may be considered as a perspective therapeutic target for refractory prostate cancer [59]. Zhu et al. found that the lncRNA H19 and its derived miRNA H19-miR-675 were significantly downregulated in advanced prostate cancer and they may be used for diagnostic and therapeutic treatment in advanced prostate cancer because H19-miR-675 could act as a suppressor of prostate cancer metastasis [60]. Additionally, Tian et al. showed that targeting the lncRNA NEAT1 axis could be used as a potential application in improving chemotherapy of prostate cancer [61].
Glioma is one of the most common malignant forms of brain tumors, and 6 out of 100,000 people may have gliomas [62]. Accumulating research has shown that lncRNAs play a significant role in the process of glioma development [63]. Therefore, we applied the NBCLDA to predict potential lncRNAs associated with glioma. Four of the top 20 glioma-related lncRNAs were validated by recent literature on biological experiments, and the results are illustrated in Table 3. For example, the lncRNA MALAT1 plays an important role in the progression and therapy of glioma and it may be considered an effective prognostic biomarker for the treatment of glioma [64]. Zhang et al. demonstrated that the lncRNA H19 was overexpressed in glioma tissue and cell lines, and also promotes cell proliferation of glioma [65]. Furthermore, Li et al. suggested that the lncRNA TUG1 can promote cell apoptosis of glioma cells and may act as a tumor suppressor in human glioma [66].

5. Discussion

Accumulating studies have indicated that lncRNAs play crucial roles in biological processes, complex disease diagnoses, prognoses, and treatments. Furthermore, computational models for predicting novel lncRNA–disease associations by integrating varieties of biological data are among the most noticeable topics. This is helpful to explore the understanding of disease mechanisms at the lncRNA level. In this paper, we construct a global tripartite network and a quadruple network by integrating various biological information and propose a novel approach, the NBCLDA, to predict potential lncRNA–disease associations by applying the naïve Bayesian classifier into the two constructed networks. Compared with current models, the NBCLDA does not entirely rely on known lncRNA–disease associations, and can achieve a reliable performance with effective AUCs in the LOOCV framework. This means that our method can not only predict the possible associations between lncRNAs and diseases included in the known associations set, but can also predict the potential associations whose elements are not in the known data set.
To evaluate the predictive performance of our method, the LOOCV is implemented based on the experimentally verified lncRNA–disease associations obtained from the MNDR database. Simulation experiment results of the NBCLDA show a strong performance and its predictive accuracy has been significantly improved by the addition of new types of nodes and the disease semantic similarity for predicting potential associations between lncRNAs and diseases. It also shows that the NBCLDA can achieve better performance than the other three state-of-the-art models with more effective AUCs in the framework of the LOOCV. Moreover, in order to further estimate the performance of the NBCLDA, case studies of colorectal cancer, prostate cancer, and glioma were implemented in this paper. These simulation results demonstrated that the NBCLDAs can be an excellent tool for future biomedical research.
Despite the reliable experimental results of the NBCLDA, there are also some biases in our method. For example, the known experimentally validated lncRNA–disease associations are still limited. Therefore, the prediction performance of the NBCLDA would be improved by a more comprehensive data set. Furthermore, the data sources in this paper need to be strictly preprocessed according to the proposed method, which restricts the richness of the data sources to a certain extent.

6. Conclusions

In this paper, we mainly summed up the following contributions: (1) we constructed a global tripartite network by integrating a variety of biological information including miRNA-disease, miRNA-lncRNA and lncRNA-diseases associations and interactions; (2) we constructed a global quadruple network by appending gene–lncRNA interaction, gene–disease association, and gene–miRNA interaction networks to the global tripartite network; (3) we developed a novel approach NBCLDA based on the naïve Bayesian classifier and applied it into the two global networks to predict potential lncRNA–disease associations; (4) we appended the disease semantic similarity into our newly constructed prediction model NBCLDA to further uncover the potential relationships between lncRNAs and diseases; (5) NBCLDA can not only predict the possible associations between lncRNAs and diseases included in the known associations set, but can also predict the potential associations whose elements are not in the known data set; (6) NBCLDA can integrate multiple heterogeneous biological data for discovering potential relationships between lncRNAs and diseases; (7) in the future work, more biological data can be collected and pre-processed to be utilized in the newly proposed method for predicting potential lncRNA-disease associations.

Supplementary Materials

The following are available at https://www.mdpi.com/2073-4425/9/7/345/s1, Supplementary Table S1: The known miRNA–disease associations of the data set DS 1 consisting of 4704 miRNA–disease interactions which were collected from the HMDD database; Supplementary Table S2: The known miRNA–lncRNA associations of the data set DS 2 consisting of 9086 miRNA–lncRNA interactions which were collected from the starBase v2.0 database; Supplementary Table S3: The known lncRNA–disease associations of the data set DS 3 consisting of 407 lncRNA–disease associations which were downloaded from the MNDR v2.0 database; Supplementary Table S4: The known gene–disease associations of the data set DS 4 consisting of 3702 gene–disease associations which were gathered from the DisGeNET v5.0 database; Supplementary Table S5: The known gene–lncRNA associations of the data set DS 5 consisting of 411 gene–lncRNA interactions which were downloaded from the LncACTdb database; Supplementary Table S6: The known gene–miRNA associations of the data set DS 6 consisting of 565 gene–miRNA association was obtained from the miRecords database; Supplementary Table S7: The Disease tree numbers of the data set DS 7 consisting of 373 diseases with their disease tree numbers which were gathered from the MeSH database; Supplementary Table S8: The results of top 20 lncRNAs related to these three diseases. Supplementary Materials: Deep representation of the probabilistic scheme in our method.

Author Contributions

Conceptualization, J.Y. and L.W.; Methodology, J.Y., P.P. and L.W.; Validation, L.K., X.L. and Z.W.; Formal Analysis, J.Y. and L.W.; Investigation, L.K. and Z.W.; Resources, P.P. and Z.W.; Data Curation, J.Y. and P.P.; Writing—Original Draft Preparation, J.Y. and P.P.; Writing—Review and Editing, L.W. and X.L.; Supervision, L.W.; Project Administration, L.K. and X.L.; Funding Acquisition, L.W.

Funding

This research is partly sponsored by the Natural Science Foundation of Hunan Province (No. 2018JJ4058, No. 2017JJ5036), the National Natural Science Foundation of China (No. 61640210, No. 61672447), the CERNET Next Generation Internet Technology Innovation Project (No. NGII20160305), the project of “12th Five-Year” planning of Education Science in Hunan Province (No. XJK015BZY031) and the CERNET Next Generation Internet Technology Innovation Project (No. NGII20160305,No.NGII20170109).

Acknowledgments

The authors thank the anonymous referees for suggestions that helped improve the paper substantially.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

References

  1. Li, Y.; Zhang, J.; Pan, J.; Feng, X.; Duan, P.; Yin, X.; Xu, Y.; Wang, X.; Zou, S. Insights into the roles of lncRNAs in skeletal and dental diseases. Cell Biosci. 2018, 8, 8. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  2. Garitano-Trojaola, A.; Agirre, X.; Prósper, F.; Fortes, P. Long non-coding RNAs in haematological malignancies. Int. J. Mol. Sci. 2013, 14, 15386. [Google Scholar] [CrossRef] [PubMed]
  3. Guttman, M.; Russell, P.; Ingolia, N.T.; Weissman, J.S.; Lander, E.S.R. Ribosome Profiling Provides Evidence that Large Noncoding RNAs Do Not Encode Proteins. Cell 2013, 154, 240–251. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  4. Guttman, M.; Rinn, J.L. Modular regulatory principles of large non-coding RNAs. Nature 2012, 482, 339–346. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  5. Wang, K.C.; Chang, H.Y. Molecular mechanisms of long noncoding RNAs. Mol. Cell 2011, 43, 904–914. [Google Scholar] [CrossRef] [PubMed]
  6. Wapinski, O.; Chang, H.Y. Long noncoding RNAs and human disease. Trends Cell Biol. 2011, 21, 354–361. [Google Scholar] [CrossRef] [PubMed]
  7. Derrien, T.; Johnson, R.; Bussotti, G.; Tanzer, A.; Djebali, A.; Tilgner, H.; Guernec, G.; Martin, D.; Merkel, A.; Knowles, D. The GENCODE v7 catalog of human long noncoding RNAs: Analysis of their gene structure, evolution, and expression. Genome Res. 2012, 22, 1775–1789. [Google Scholar] [CrossRef] [PubMed]
  8. Zhao, W.; Luo, J.; Jiao, S. Comprehensive characterization of cancer subtype associated long non-coding RNAs and their clinical implications. Sci. Rep. 2014, 4, 6591. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  9. Cheetham, S.W.; Gruhl, F.; Mattick, J.S.; Dinger, M.E. Long noncoding RNAs and the genetics of cancer. Br. J. Cancer 2013, 108, 2419–2425. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  10. Mercer, T.R.; Dinger, M.E.; Mattick, J.S. Long non-coding RNAs: Insights into functions. Nat. Rev. Genet. 2009, 10, 155–159. [Google Scholar] [CrossRef] [PubMed]
  11. Taft, R.J.; Pang, K.C.; Mercer, T.R.; Dinger, M.; Mattick, JS. Non-coding RNAs: Regulators of disease. J. Pathol. 2010, 220, 126–139. [Google Scholar] [CrossRef] [PubMed]
  12. Zhang, Q.; Chen, C.Y.; Yedavalli, V.S.; Jeang, K.T. NEAT1 long noncoding RNA and paraspeckle bodies modulate HIV-1 posttranscriptional expression. MBio 2013, 4, e00596-12. [Google Scholar] [CrossRef] [PubMed]
  13. Pasmant, E.; Sabbagh, A.; Vidaud, M.; Biéche, I. ANRIL, a long, noncoding RNA, is an unexpected major hotspot in GWAS. FASEB J. 2011, 25, 444–448. [Google Scholar] [CrossRef] [PubMed]
  14. Faghihi, M.A.; Modarresi, F.; Khalil, A.M.; Wood, D.E.; Sahagan, B.G.; Morgan, T.E.; Finch, C.E.; Laurent, G.S.; Kenny, P.J.; Wahlestedt, C. Expression of a noncoding RNA is elevated in Alzheimer’s disease and drives rapid feed-forward regulation of b-secretase. Nat. Med. 2008, 14, 723–730. [Google Scholar] [CrossRef] [PubMed]
  15. Malih, S.; Saidijam, M.; Malih, N. A brief review on long noncoding RNAs: a new paradigm in breast cancer pathogenesis, diagnosis and therapy. Tumor Biology 2016, 37, 1479–1485. [Google Scholar] [CrossRef] [PubMed]
  16. Cui, Z.; Ren, S.; Lu, J.; Wang, F.; Xu, W.; Sun, Y.; Wei, M.; Chen, J.; Gao, X.; Xu, C.; Mao, J.H.; Sun, Y. The prostate cancer-up-regulated long noncoding RNA PlncRNA-1 modulates apoptosis and proliferation through reciprocal regulation of androgen receptor. Urol. Oncol. 2013, 31, 1117–1123. [Google Scholar] [CrossRef] [PubMed]
  17. Wang, J.; Liu, X.; Wu, H.; Ni, P.; Gu, Z.; Qiao, Y.; Chen, N.; Sun, F.; Fan, Q. CREB up-regulates long noncoding RNA, HULC expression through interaction with microRNA-372 in liver cancer. Nucleic Acids Res. 2010, 38, 5366–5383. [Google Scholar] [CrossRef] [PubMed]
  18. Ma, Z.; Xue, S.; Zeng, B.; Qiu, D. lncRNA SNHG5 is associated with poor prognosis of bladder cancer and promotes bladder cancer cell proliferation through targeting p27. Oncol. Lett. 2018, 15, 1924–1930. [Google Scholar] [CrossRef] [PubMed]
  19. Spizzo, R.; Almeida, M.I.; Colombatti, A.; Calin, G.A. Long non-coding RNAs and cancer: A new frontier of translational research? Oncogene 2012, 31, 4577–4587. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  20. Gupta, R.A.; Shah, N.; Wang, K.C.; Kim, J.; Horlings, H.M.; Wong, D.J.; Tsai, M.C.; Hung, T.; Argani, P.; Rinn, J.L. Long non-coding RNA HOTAIR reprograms chromatin state to promote cancer metastasis. Nature 2010, 464, 1071–1076. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  21. Gutschner, T.; Heammerle, M.; Eissmann, M.; Hsu, J.; Kim, Y.; Hung, G.; Revenko, A.; Arun, G.; Stentrup, M.; Gross, M. The noncoding RNA MALAT1 is a critical regulator of the metastasis phenotype of lung cancer cells. Cancer Res. 2013, 73, 1180–1189. [Google Scholar] [CrossRef] [PubMed]
  22. Lottin, S.; Adriaenssens, E.; Berteaux, N.; Leprêtre, A.; Vilain, M.O.; Denhez, E.; Coll, J.; Dugimont, T.; Curgy, J.J. The human H19 gene is frequently overexpressed in myometrium and stroma during pathological endometrial proliferative events. Eur. J. Cancer 2005, 41, 168–177. [Google Scholar] [CrossRef] [PubMed]
  23. Sun, J.; Shi, H.; Wang, Z.; Zhang, C.; Liu, L.; Wang, L.; He, W.; Hao, D.; Liu, S.; Zhou, M. Inferring novel lncRNA–disease associations based on a random walk model of a lncRNA functional similarity network. Mol. Biosyst. 2014, 10, 2074–2081. [Google Scholar] [CrossRef] [PubMed]
  24. Ping, P.; Wang, L.; Kuang, L.; Ye, S.; Lqbal, M.F.B. A Novel Method based on lncRNA–disease association Network for LncRNA–Disease Association Prediction. IEEE/ACM Trans. Comput. Biol. Bioinform. 2018. [Google Scholar] [CrossRef]
  25. Yang, X.; Gao, L.; Guo, X.; Shi, X.; Wu, H.; Song, F.; Wang, B. A network based method for analysis of lncRNA-disease associations and prediction of lncRNAs implicated in diseases. PLoS ONE 2014, 9, e87797. [Google Scholar] [CrossRef] [PubMed]
  26. Ding, L.; Wang, M.; Sun, D.; Li, A. TPGLDA: Novel prediction of associations between lncRNAs and diseases via lncRNA–disease-gene tripartite graph. Sci. Rep. 2018, 8, 1065. [Google Scholar] [CrossRef] [PubMed]
  27. Chen, X. Predicting lncRNA–disease associations and constructing lncRNA functional similarity network based on the information of miRNA. Sci. Rep. 2015, 5, 13186. [Google Scholar] [CrossRef] [PubMed]
  28. Liu, M.X.; Chen, X.; Chen, G.; Cui, Q.H.; Yan, G.Y. A computational framework to infer human disease-associated long noncoding RNAs. PLoS ONE 2014, 9, e84408. [Google Scholar] [CrossRef] [PubMed]
  29. Li, J.W.; Cheng, G.; Wang, Y.C.; Gao, C.; Wang, Y.; Ma, W.; Tu, J.; Wang, J.; Chen, Z.; Kong, W.; Cui, Q. A bioinformatics method for predicting long noncoding RNAs associated with vascular disease. Sci. China Life Sci. 2014, 57, 852–857. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  30. Gu, C.; Liao, B.; Li, X.; Cai, L.; Li, Z.; Li, K.; Yang, J. Global network random walk for predicting potential human lncRNA–disease associations. Sci. Rep. 2017, 7, 12442. [Google Scholar] [CrossRef] [PubMed]
  31. Gligorijević, V.; Pržulj, N. Methods for biological data integration: perspectives and challenges. J. R. Soc. Interface 2015, 12, 20150571. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  32. Zeng, X.; Ding, N.; Rodríguez-Patón, A.; Zou, Q. Probability-based collaborative filtering model for predicting gene–disease associations. BMC Med. Genom. 2017, 10 (Suppl. 5), 76. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  33. Zhao, H.; Kuang, L.; Wang, L.; Ping, P.; Xuan, Z.; Pei, T.; Wu, Z. Prediction of microRNA-disease associations based on distance correlation set. BMC Bioinform. 2018, 19, 141. [Google Scholar] [CrossRef] [PubMed]
  34. Zeng, X.; Zhang, X.; Zou, Q. Integrative approaches for predicting microRNA function and prioritizing disease-related microRNA using biological interaction networks. Brief. Bioinform. 2016, 17, 193. [Google Scholar] [CrossRef] [PubMed]
  35. Lu, C.; Yang, M.; Luo, F.; Wu, F.X.; Li, M.; Pan, Y.; Li, Y.; Wang, J. Prediction of lncRNA–disease associations based on inductive matrix completion. Bioinformatics 2018. [Google Scholar] [CrossRef] [PubMed]
  36. Zhang, J.; Zhang, Z.; Chen, Z.; Deng, L. Integrating multiple heterogeneous networks for novel lncRNA-disease association inference. IEEE/ACM Trans. Comput. Biol. Bioinform. 2017. [Google Scholar] [CrossRef] [PubMed]
  37. Fu, G.; Wang, J.; Domeniconi, C.; Yu, G. Matrix factorization based data fusion for the prediction of lncRNA–disease associations. Bioinformatics 2017. [Google Scholar] [CrossRef] [PubMed]
  38. Chen, X. KATZLDA: KATZ measure for the lncRNA–disease association prediction. Sci. Rep. 2015, 5, 16840. [Google Scholar] [CrossRef] [PubMed]
  39. Li, Y.; Qiu, C.; Tu, J.; Geng, B.; Yang, J.; Jiang, T.; Cui, Q. HMDD v2.0: A database for experimentally supported human microRNA and disease associations. Nucleic Acids Res. 2014, 42, D1070. [Google Scholar] [CrossRef] [PubMed]
  40. Li, J.H.; Liu, S.; Zhou, H.; Qu, L.H.; Yang, J.H. starBase v2.0: Decoding miRNA-ceRNA, miRNA-ncRNA and protein-RNA interaction networks from large-scale CLIP-Seq data. Nucleic Acids Res. 2014, 42, D92. [Google Scholar] [CrossRef] [PubMed]
  41. Cui, T.; Lin, Z.; Yan, H.; Yi, Y.; Tan, P.; Zhao, Y.; Hu, Y.; Xu, L.; Li, E.; Wang, D. MNDR v2.0: An updated resource of ncRNA–disease associations in mammals. Nucleic Acids Res. 2018, 46, D371–D374. [Google Scholar] [CrossRef] [PubMed]
  42. Piñero, J.; Àlex, B.; Queraltrosinach, N.; Gutierrez-Sacristan, A.; Deu-Pons, J.; Centeno, E.; Garcia-Garcia, J.; Sanz, F.; Furlong, L.I. DisGeNET: A comprehensive platform integrating information on human disease-associated genes and variants. Nucleic Acids Res. 2017, 45, D833–D839. [Google Scholar] [CrossRef] [PubMed]
  43. Wang, P.; Ning, S.; Zhang, Y.; Li, R.; Ye, J.; Zhao, Z.; Zhi, H.; Wang, T.; Guo, Z.; Li, X. Identification of lncRNA-associated competing triplets reveals global patterns and prognostic markers for cancer. Nucleic Acids Res. 2015, 43, 3478. [Google Scholar] [CrossRef] [PubMed]
  44. Xiao, F.; Zuo, Z.; Cai, G.; Xiao, F.; Zuo, Z.; Cai, G.; Kang, S.; Gao, X.; Li, T. miRecords: An integrated resource for microRNA-target interactions. Nucleic Acids Res. 2008, 37, D105–D110. [Google Scholar] [CrossRef] [PubMed]
  45. U.S. National Library of Medicine. Medical Subject Headings 2018 [Internet]. Available online: https://meshb.nlm.nih.gov/search (accessed on 6 July 2018).
  46. Liu, Z.; Zhang, Q.; Lu, L.; Zhou, T. Link prediction in complex networks: A local naïve Bayes model. Europhys. Lett. 2011, 96, 48007. [Google Scholar] [CrossRef] [Green Version]
  47. Wang, D.; Wang, J.; Lu, M.; Song, F.; Cui, Q. Inferring the human microRNA functional similarity and functional network based on microRNA-associated diseases. Bioinformatics 2010, 26, 1644–1650. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  48. Luo, J.; Ding, P.; Liang, C.; Cao, B.; Chen, X. Collective prediction of disease-associated miRNAs based on transduction learning. IEEE/ACM Trans. Comput. Biol. Bioinform. 2016, 14, 1468–1475. [Google Scholar] [CrossRef] [PubMed]
  49. Berger, F.G. Interview: Screening and treatment for colorectal cancer. Colorectal Cancer 2013, 2, 117–120. [Google Scholar] [CrossRef]
  50. Prensner, J.R.; Chinnaiyan, A.M. The emergence of lncRNAs in cancer biology. Cancer Dis. 2011, 1, 391–407. [Google Scholar] [CrossRef] [PubMed]
  51. Gutschner, T.; Diederichs, S. The hallmarks of cancer: A long non-coding RNA point of view. RNA Biol. 2012, 9, 703–719. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  52. Chen, D.L.; Chen, L.Z.; Lu, Y.X.; Zhang, D.S.; Zeng, Z.L.; Pan, Z.Z.; Huang, P.; Wang, F.H.; Li, Y.H.; Ju, H.Q. Long noncoding RNA XIST expedites metastasis and modulates epithelial-mesenchymal transition in colorectal cancer. Econ. Theory Bus. Manag. 2017, 8, e3011. [Google Scholar] [CrossRef] [PubMed]
  53. Li, P.; Zhang, X.; Wang, H.; Wang, L.; Liu, T.; Du, L.; Yang, Y.; Wang, C. MALAT1 is associated with poor response to oxaliplatin-based chemotherapy in colorectal cancer patients and promotes chemoresistance through EZH2. Mol. Cancer Ther. 2017, 16, 739–751. [Google Scholar] [CrossRef] [PubMed]
  54. Nakano, S.; Murakami, K.; Meguro, M.; Soejima, H.; Higashimoto, K.; Urano, T.; Kugoh, H.; Mukai, T.; Ikeguchi, M.; Oshimura, M. Expression profile of LIT1/KCNQ1OT1, and epigenetic status at the KvDMR1 in colorectal cancers. Cancer Sci. 2006, 97, 1147–1154. [Google Scholar] [CrossRef] [PubMed]
  55. Han, D.; Gao, X.; Wang, M.; Qiao, Y.; Xu, Y.; Yang, J.; Dong, N.; He, J.; Sun, Q.; Lv, G. Long noncoding RNA H19 indicates a poor prognosis of colorectal cancer and promotes tumor growth by recruiting and binding to eIF4A3. Oncotarget 2016, 7, 22159–22173. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  56. Weiss, M.; Plass, C.; Gerhauser, C. Role of lncRNAs in prostate cancer development and progression. Biol. Chem. 2014, 395, 1275–1290. [Google Scholar] [CrossRef] [PubMed]
  57. Yang, G.; Lu, X.; Yuan, L. LncRNA: A link between RNA and cancer. Biochim. Biophys. Acta 2014, 1839, 1097–1109. [Google Scholar] [CrossRef] [PubMed]
  58. Chakravarty, D.; Sboner, A.; Nair, S.S.; Giannopoulou, E.; Li, R.; Hennig, S.; Mosquera, JM.; Pauwels, J.; Park, K.; Kossai, M. The oestrogen receptor alpha-regulated lncRNA NEAT1 is a critical modulator of prostate cancer. Nat. Commun. 2014, 5, 5383. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  59. Ren, S.; Liu, Y.; Xu, W.; Sun, Y.; Lu, J.; Wang, F.; Wei, M.; Shen, J.; Hou, J.; Gao, X.; Xu, C. Long noncoding RNA MALAT-1 is a new potential therapeutic target for castration resistant prostate cancer. J. Urol. 2013, 190, 2278–2287. [Google Scholar] [CrossRef] [PubMed]
  60. Zhu, M.; Chen, Q.; Liu, X.; Sun, Q.; Zhao, X.; Deng, R.; Wang, Y.; Huang, J.; Xu, M.; Yan, J.; Yu, J. lncRNA H19/miR-675 axis represses prostate cancer metastasis by targeting TGFBI. FEB J. 2014, 281, 3766–3775. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  61. Tian, X.; Zhang, G.; Zhao, H.; Li, Y.; Zhu, C. Long non-coding RNA NEAT1 contributes to docetaxel resistance of prostate cancer through inducing RET expression by sponging miR-34a. RSC Adv. 2017, 7, 42986–42996. [Google Scholar] [CrossRef]
  62. Boele, F.W.; Rooney, A.G.; Grant, R.; Klein, M. Psychiatric symptoms in glioma patients: from diagnosis to management. Neuropsychiatr. Dis. Treat. 2015, 11, 1413–1420. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  63. Zhou, Q.; Liu, J.; Quan, J.; Liu, W.; Tan, H.; Li, W. lncRNAs as potential molecular biomarkers for the clinicopathology and prognosis of glioma: A systematic review and meta-analysis. Gene 2018. [Google Scholar] [CrossRef] [PubMed]
  64. Ma, K.X.; Wang, H.J.; Li, X.R.; Li, T.; Su, G.; Yang, P.; Wu, J.W. Long noncoding RNA MALAT1 associates with the malignant status and poor prognosis in glioma. Tumour Biol. J. Int. Soc. Oncodev. Biol. Med. 2015, 36, 3355–3359. [Google Scholar] [CrossRef] [PubMed]
  65. Zhang, T.; Wang, Y.R.; Zeng, F.; Cao, H.Y.; Zhou, H.D.; Wang, Y.J. LncRNA H19 is overexpressed in glioma tissue, is negatively associated with patient survival, and promotes tumor growth through its derivative miR-675. Eur. Rev. Med. Pharmacol. Sci. 2016, 20, 4891–4897. [Google Scholar] [PubMed]
  66. Li, J.; Zhang, M.; An, G.; Ma, Q. LncRNA TUG1 acts as a tumor suppressor in human glioma by promoting cell apoptosis. Exp. Biol. Med. 2016, 241, 644. [Google Scholar] [CrossRef] [PubMed]
Figure 1. The relationship between the different data sources and number of data points.
Figure 1. The relationship between the different data sources and number of data points.
Genes 09 00345 g001
Figure 2. The flowchart of NBCLDA. In the diagram, the green circles, blue squares, orange triangles, and purple diamonds represent lncRNAs, diseases, miRNAs, and genes, respectively. (a) construction of the MDN, MLN, and LDN; (b) construction of global tripartite network G N 1 by integrating the MDN, MLN, and LDN; (c) construction of the GDN, GLN, and GMN; (d) construction of the global quadruple network G N 2 by appending the GDN, GLN, and GMN into G N 1 ; (e,f) construction of the potential lncRNA–disease association network by using the NBCLDA- G N 1 , and NBCLDA- G N 2 ; (g,h) inference of potential lncRNA–disease associations by using disease semantic similarity. Here, in (eh), the known lncRNA–disease associations are represented as the solid edges, and the candidate lncRNA–disease associations are represented as dashed edges.
Figure 2. The flowchart of NBCLDA. In the diagram, the green circles, blue squares, orange triangles, and purple diamonds represent lncRNAs, diseases, miRNAs, and genes, respectively. (a) construction of the MDN, MLN, and LDN; (b) construction of global tripartite network G N 1 by integrating the MDN, MLN, and LDN; (c) construction of the GDN, GLN, and GMN; (d) construction of the global quadruple network G N 2 by appending the GDN, GLN, and GMN into G N 1 ; (e,f) construction of the potential lncRNA–disease association network by using the NBCLDA- G N 1 , and NBCLDA- G N 2 ; (g,h) inference of potential lncRNA–disease associations by using disease semantic similarity. Here, in (eh), the known lncRNA–disease associations are represented as the solid edges, and the candidate lncRNA–disease associations are represented as dashed edges.
Genes 09 00345 g002
Figure 3. (a) a subnetwork of Figure 2b, in which, the common neighboring nodes m 1 and m 3 between l 2 and d 3 , are assumed to be conditionally independent; (b) a subnetwork of Figure 2d, in which, m 1 , m 3 , g 1 , and g 4 are the common neighboring nodes between l 2 and d 3 . Here, m 3 - g 4 , m 1 , and g 1 are assumed to be conditionally independent.
Figure 3. (a) a subnetwork of Figure 2b, in which, the common neighboring nodes m 1 and m 3 between l 2 and d 3 , are assumed to be conditionally independent; (b) a subnetwork of Figure 2d, in which, m 1 , m 3 , g 1 , and g 4 are the common neighboring nodes between l 2 and d 3 . Here, m 3 - g 4 , m 1 , and g 1 are assumed to be conditionally independent.
Genes 09 00345 g003
Figure 4. Performance evaluation for the NBCLDA in terms of ROC curves and AUCs based on the experimentally known associations (data set DS 3 ), in the framework of LOOCV. Here, NBCLDA- G N 1 and NBCLDA- G N 2 represent the simulation results while implementing our algorithm on the global networks G N 1 and G N 2 , respectively.
Figure 4. Performance evaluation for the NBCLDA in terms of ROC curves and AUCs based on the experimentally known associations (data set DS 3 ), in the framework of LOOCV. Here, NBCLDA- G N 1 and NBCLDA- G N 2 represent the simulation results while implementing our algorithm on the global networks G N 1 and G N 2 , respectively.
Genes 09 00345 g004
Figure 5. Same as Figure 4, but additionally including disease semantic similarity. Here, NBCLDA- G N 1 -SD and NBCLDA- G N 2 -SD represent the simulation results when appending the disease semantic similarity to the NBCLDA on networks G N 1 and G N 2 , respectively.
Figure 5. Same as Figure 4, but additionally including disease semantic similarity. Here, NBCLDA- G N 1 -SD and NBCLDA- G N 2 -SD represent the simulation results when appending the disease semantic similarity to the NBCLDA on networks G N 1 and G N 2 , respectively.
Genes 09 00345 g005
Figure 6. The performance of the NBCLDA in terms of ROC curves and AUCs based on 183 known lncRNA–disease associations, in the framework of the LOOCV.
Figure 6. The performance of the NBCLDA in terms of ROC curves and AUCs based on 183 known lncRNA–disease associations, in the framework of the LOOCV.
Genes 09 00345 g006
Figure 7. Comparison of the performance of the NBCLDA and Yang et al.’s method in terms of ROC curves and AUCs based on a data set of 319 lncRNA–disease associations between 37 lncRNAs and 52 diseases in the framework of the LOOCV.
Figure 7. Comparison of the performance of the NBCLDA and Yang et al.’s method in terms of ROC curves and AUCs based on a data set of 319 lncRNA–disease associations between 37 lncRNAs and 52 diseases in the framework of the LOOCV.
Genes 09 00345 g007
Figure 8. Comparison of the performance of the NBCLDA and KATZLDA approaches in terms of ROC curves and AUCs based on data set DS 3 , in the framework of the LOOCV.
Figure 8. Comparison of the performance of the NBCLDA and KATZLDA approaches in terms of ROC curves and AUCs based on data set DS 3 , in the framework of the LOOCV.
Genes 09 00345 g008
Table 1. Performance comparisons between the NBCLDA and other state-of-the-art models in terms of AUCs based on the different data sets of known lncRNA–disease associations in the framework of the LOOCV.
Table 1. Performance comparisons between the NBCLDA and other state-of-the-art models in terms of AUCs based on the different data sets of known lncRNA–disease associations in the framework of the LOOCV.
MethodsAUCsMethodsAUCs
NBCLDA- G N 2 -SD0.8982NBCLDA- G N 2 -SD0.9169
HGLDA0.7621Yang et al. method0.8568
NBCLDA- G N 2 -SD0.8897NBCLDA- G N 2 -SD0.8829
SIMCLDA0.8526KATZLDA0.8283
NBCLDA- G N 2 -SD0.8704NBCLDA- G N 2 -SD0.8897
MFLDA0.7945TPGLDA0.92
Table 2. F1-scores of NBCLDA, SIMCLDA, MFLDA, Yang et al.’s method, KATZLDA, TPGLDA at different top-k cutoffs
Table 2. F1-scores of NBCLDA, SIMCLDA, MFLDA, Yang et al.’s method, KATZLDA, TPGLDA at different top-k cutoffs
Methods F1-Score
NBCLDA0.1536 (k = 20)0.1582 (k = 40)null (k = 60)
SIMCLDA0.0635 (k = 20)0.0482 (k = 40)null (k = 60)
NBCLDA0.1773 (k = 20)0.2415 (k = 40)null (k = 60)
MFLDA0.2012 (k = 20)0.1139 (k = 40)null (k = 60)
NBCLDA0.2575 (k = 20)0.2855 (k = 34)null (k = 60)
Yang et al.’s method0.2707 (k = 20)0.2769 (k = 34)null (k = 60)
NBCLDA0.1183 (k = 20)0.1088 (k = 40)0.1139 (k = 60)
KATZLDA0.1274 (k = 20)0.0869 (k = 40)0.0779 (k = 60)
NBCLDA0.1295 (k = 20)0.1510 (k = 40)0.1320 (k = 60)
TPGLDA0.2070 (k = 20)0.1644 (k = 40)0.1301 (k = 60)
Table 3. The lncRNAs in the top 20 for the three case studies.
Table 3. The lncRNAs in the top 20 for the three case studies.
DiseaselncRNAEvidence (PMID)Rank
Colorectal cancerXIST171436211
Colorectal cancerMALAT125446987,25031737,21503572,25025966,24244343,268870563
Colorectal cancerKCNQ1OT1169653976
Colorectal cancerH1911120891,19926638,22427002,26068968,269890258
Colorectal cancerNEAT1263148479
Colorectal cancerSNHG162451995912
Colorectal cancerTUG12685633018
Prostate cancerMALAT123845456,23726266,26516927,223494603
Prostate cancerKCNQ1OT1237282906
Prostate cancerH1924063685,249889468
Prostate cancerNEAT123728290,2541523010
Prostate cancerTUG12697552919
GliomaMALAT126649278,25613066,26619802,27134488,269382954
GliomaH1924466011,269837196
GliomaTUG125645334,2736333910
GliomaNEAT12658208412

Share and Cite

MDPI and ACS Style

Yu, J.; Ping, P.; Wang, L.; Kuang, L.; Li, X.; Wu, Z. A Novel Probability Model for LncRNA–Disease Association Prediction Based on the Naïve Bayesian Classifier. Genes 2018, 9, 345. https://doi.org/10.3390/genes9070345

AMA Style

Yu J, Ping P, Wang L, Kuang L, Li X, Wu Z. A Novel Probability Model for LncRNA–Disease Association Prediction Based on the Naïve Bayesian Classifier. Genes. 2018; 9(7):345. https://doi.org/10.3390/genes9070345

Chicago/Turabian Style

Yu, Jingwen, Pengyao Ping, Lei Wang, Linai Kuang, Xueyong Li, and Zhelun Wu. 2018. "A Novel Probability Model for LncRNA–Disease Association Prediction Based on the Naïve Bayesian Classifier" Genes 9, no. 7: 345. https://doi.org/10.3390/genes9070345

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop