1. Introduction
The protein signaling network describes the interactions between different types of proteins. It is critical for discovering the unknown effects or even unobserved components in molecular biology [
1]. However, due to the complexity of the signaling pathways, it is difficult to achieve this goal simply via biological experiments [
2]. Fortunately, the network can be obtained in another way by analyzing the variables, such as protein or phospholipid expression levels in the cells [
2]. Therefore, the construction of protein signaling networks becomes an important issue in the current research of bioinformatics. Since Bayesian network is an effective tool to construct such multivariate relationships [
3], the construction of protein signaling networks is equivalent to the structure learning problem of Bayesian networks.
Most of the existing Bayesian network structure learning methods can be divided into two categories in general, namely constraint-based structure learning methods and score-based structure learning methods [
3]. Constraint-based methods focus on the conditional independence among the observation variables [
4] and select the graph structures that can best express the conditional independence in data [
5]. On the contrary, in score-based approaches the structure learning will be taken as model selection problems [
6], and the results will be given by optimization problems with loss functions and constraints.
However, there are two major problems in the practice of building protein signaling networks using these classic Bayesian network methods. First, the network structures representing correct joint probabilities do not guarantee the correct causal directions. Bayesian networks with different causal structures may represent identical joint probabilities and conditional probabilities [
7]. Second, the structure learning of Bayesian networks can be taken as a combinatorial optimization problem with high computational complexity. The searching space of the structure graph is
, where
d is the number of observation variables or the number of nodes in the graph, so it is an NP-hard problem [
8].
Many studies have been conducted in recent years to address both of the above issues. On the topic of Bayesian network with causality, structure learning methods developed in [
9,
10] assume that the users can obtain additional data by controlling some of the variables during structure learning, which is unfortunately infeasible when building protein signaling networks; On the other hand, the methods with pure observation data may have extra assumptions on the data distributions. As an example, ref. [
11] assumes that the observations are produced by a linear structural equation model with non-Gaussian additive noise. These assumptions are not consistent with that of the protein signal data. Score-based methods with black box neural network models are also developed in this field [
12,
13]. The results of these neural network-based methods are largely dependent on the initial positions in the searching spaces as well as many other hyper-parameters of the models; moreover, a huge amount of calculations are required. So, on the topic of reducing the complexity of the structure learning, improved methods with greedy search policies [
14,
15] and heuristic policies [
16] can be found in the literature on Bayesian network structure learning to accelerate the searching. A very different trick is used in the work of [
17], where the combinatorial optimization problem of the network structure learning is converted to a continuous optimization problem to reduce the computational complexity. This technique will be leveraged in the proposed method in this paper.
Thus, a new Bayesian network structure learning algorithm is proposed in this paper to construct the protein signaling networks. First, a causal direction matrix can be calculated by the algorithm in [
18], since the relationships between any two variables in protein signaling networks are very likely to be non-linear. Second, the causal direction matrix is used as one of the constraint functions in a continuous optimization problem which minimizes the fitting loss of a structural equation model [
17]. Finally, a pruning method is developed to keep the structure matrix sparse. In the experiment section, we show that the method proposed in this paper outperforms the existing methods in accuracy and computational efficiency.
In the following part of this paper, preliminaries will be given in
Section 2. The details of the proposed method will be given in
Section 3, and experimental results will be demonstrated in
Section 4.
3. Methods
As described in
Section 1, a continuous optimization with causal direction matrix constraints is proposed in this work to construct the structure of the Bayesian network. In this section, the computation of the causal direction matrix is described first, the optimization problem, as well as its numerical solving algorithm, is developed next, and the pruning method is shown at the end of this section.
3.1. Causal Direction Matrix
The causal direction constraint matrix
G is calculated in this section. If
cause
, the
th element of the matrix is 1, otherwise, it is set to be zero. Note that this causal relationship can be either direct or indirect. This matrix is able to avoid edges with incorrect causal direction if the following equation holds.
In (
9),
is a complementary matrix of
G where 1 turns into 0 and 0 turns into 1. When (
9) holds, all the causal directions in the matrix
W are identical to that in the matrix
G. The elements in the causal direction matrix
G can be calculated by (
3) and (
4). However since the denominator of (
3) and (
4) may be zero if duplicated elements occur [
19], the actual calculation will be performed as in Equations (
10) and (
11).
Let
(
) be the
kth largest value after removing all the duplicate elements and
is the number of different values in the vector of variable
, and
denotes the number of occurrences of
in the original dataset. So (
3) can be replaced by (
10), and similarly, (
4) can be replaced by (
11).
The matrix
G is initialized as a zero matrix of
. Once the results of
and
are calculated, the
th element in the corresponding position of the causal direction matrix
G can be determined by the Equation (
12).
is the indicator function.
means that the causal direction is from
to
, otherwise
means that the causal direction from
to
. Note again that this causal direction could be either direct or indirect. The pseudo-code of the computation of the causal direction matrix is described in the Algorithm 1.
Algorithm 1: Causal Direction Matrix |
|
The matrix is a complementary symmetric boolean matrix with .
3.2. Optimization Problem for Structure Learning
A least square fitting loss of an SEM (structural equation modeling) is used in this section to construct an optimization problem for structure learning. At the same time, an
-regularization is added to restrict the absolute values of the weights. So the optimization problem can be written as follows.
There are two constraints in this optimization problem. The first one is the acyclic constraint, which ensures that the directed graph constructed by the weighted adjacency matrix
W is acyclic. The details of the acyclic constraint have been given in
Section 2.2. The second one is the causal direction matrix constraint, which ensures that the prediction graph has identical causal directions with the causal direction matrix
G. The computation of
G has been described in detail in
Section 3.1.
To solve the above optimization problem, i.e., to minimize the objective function with a given set of specified equation constraints, the Lagrange multiplier method and gradient decent method are used here to solve this problem.
where
All the items in the above target function are derivable and their gradients can be written as follows.
Then the gradient with respect to the above optimization problem
can be written as follows.
So this optimization problem can be solved by many mathematical tools, an algorithm based on gradient descent is written in Algorithm 2 as follows.
Algorithm 2: Gradient Decent Algorithm for Optimization of Problem (14) |
|
In the experiment section, the values of the hyper-parameters in Algorithm 2 are set as follows. The step sizes , and are all set to be 1, the tolerance threshold is set to be , and the threshold is set to be 0.2. and , where the initial values of and are set to 0, and is an all-zero matrix.
3.3. Pruning Algorithm
A pruning method is also proposed to make the final result graph more sparse. This method is inspired by the concept of Granger causality [
22], which uses the absolute values of the coefficients of a regression model to determine whether one variable is the cause of another variable. A second-order polynomial regression model is constructed for each node with parent nodes. For a parent node in one of these models, if its regression coefficients of the linear term, second order term and cross term are all sufficiently small, this node can be safely removed in this regression model.
The parent node of node
are denoted by
, where
. Since the node may have multiple parents,
k represents the
kth parent of the node and
l represents the total number of parents. We perform a second-order polynomial expansion on these parent nodes and fit them into a linear regression model, which leads to the following Equation (
21).
In practice, first we expand the data matrix to contain the quadratic terms and cross terms, and then any linear regression algorithms can be applied. The pruning algorithm is written in Algorithm 3.
Algorithm 3: Pruning Algorithm |
|
The threshold in the Algorithm 3 is set to be in the experiment section, and represents edge from the kth parent node of node to the node .
Finally, the Bayesian structure learning procedure with causal direction graph and continuous optimization is summarized in Algorithm 4 and is abbreviated to CO-CDG in the following discussions in this paper. Note that all the hyper-parameters are omitted for clarity and simplicity.
Algorithm 4: The CO-CDG Algorithm |
Input: dataset X |
Output: W |
- 1
Causal Direction Matrix G = Algorithm1(X); - 2
Prediction Graph W = Algorithm2(X, G); - 3
Final Prediction Graph W = Algorithm3(X,W);
|
4. Experiment Results
In the experiment section, we focus on the correctness of the learned structures and the running time of the proposed algorithm. The performances of the proposed method are compared with the FGES (fast greedy equivalence search) algorithm in the reference [
15], the neural network-based DAG-GNN algorithm in the reference [
12], the NoTears algorithm in the reference [
17], the reinforcement learning-based RL-BIC algorithm and RL-BIC2 algorithm in the reference [
13].
The link to the code packages of the above comparison algorithms are listed as follows:
The structural Hamming distance (SHD) [
23] is used to measure the correctness of the structures. SHD can be calculated as
In (
20),
represents the number of redundant edges that need to be removed,
represents the number of missing edges that need to be added, and
represents the number of edges in the opposite direction that need to be reversed. Thus, the structural Hamming distance (SHD) is the total number of operations that are required to convert a predicted graph to the true graph. The operations include adding, deleting, and direction reversing of the edges. Lower structural Hamming distance represents a better structure when the true structure is available.
In the following Sections, both artificial data and real data are used to demonstrate the performance of the proposed method.
4.1. Experiments on Artificial Data
The generation of the artificial data is briefly described here, more details can be found in [
13]. Given the number of nodes
d, a
upper triangular matrix is generated randomly as the binary adjacency matrix of the graph. The elements of the upper triangular matrix are sampled from a Bernoulli distribution with
. If node
has multiple parent nodes, a second-order polynomial expansion, which is identical to Equation (
21) in
Section 3.3 of this paper, is used. The coefficients in (
21) are either set as 0 or randomly sampled from a uniform distribution
. The probability of a coefficient equal to 0 is
; thus, if a parent node has no contribution to the child node due to 0 coefficients being generated, the value 1 in the corresponding position in the true binary adjacency matrix should be changed to 0. Next, noise samples are generated from a standard Gaussian distribution. Finally, a graph with
d nodes is generated and the data set has 5000 samples for each node. In order to avoid the influence of the outliers, the samples are sorted in ascending order by their absolute sums, and then the first 3000 samples are used for the following experiments.
In the following experiments in this section, the SHDs of the proposed method will be evaluated under different data lengths and different number of nodes. In the experiments in
Figure 1, artificial data sets are generated in which the number of nodes range from 5 to 14. The results show that with the increase in the number of nodes, the SHDs of the proposed algorithm are significantly smaller than the other algorithms.
In
Table 1, the running times of the above algorithms are also evaluated and the experiment settings are the same. The results show that the running time of the proposed algorithm is significantly smaller than the other algorithms as the number of nodes increases. The running time of the FGES algorithm is smaller than that of the proposed algorithm in some cases; however, the corresponding SHD values are much bigger than that of the proposed algorithm.
In the experiments in
Figure 2, the number of nodes is set to be 11, which is equal to that in the real data set. 100 data points are used at first and 10 data points are added each time to increase the experiment data length. Thus, there are 291 recorded results in a single curve in the following figure. Note that the order of the data points is random. As can be seen from
Figure 2, with the increase in the length of data, the SHDs of the proposed algorithm are significantly smaller than the other algorithms as well.
Next in
Table 2, the running times of the methods are listed. The number of nodes is set to be 11, and the data length is set to be 3000; these settings are identical to the last group experiment in
Figure 2. Not surprisingly, the proposed method has a high computational speed.
4.2. Experiments on Protein Signaling Network Data
Experiments in this section are performed on a real data set of protein signaling network in the work of [
2]. There are 11 signaling nodes in this data set, each of which represents a phosphorylated protein molecule in the research of the human primary T cell signaling pathway. A causal graph, which can be taken as the ground truth, has been constructed by classical biochemistry and genetic analysis over the past two decades, just as shown in the
Figure 3. There are 20 edges in this graph, where 17 of them are high-confidence causal edges, and the rest are low-confidence causal edges. This study relies solely on the data to obtain the results without reference to the biological knowledge behind the data.
There are 14 sub-datasets with respect to 14 different biochemical experiments. The details of data collection can be found in [
2]. Note that the number of data points is different for each sub-data set, ranging from 723 to 917.
4.2.1. Comparison of SHDs of Different Algorithms
The results of the algorithms on all samples of the 14 sub-datasets are listed in
Table 3. The proposed method gives the best results in more than half of the sub-datasets. Additionally, the average SHD is also smaller than all of its counterparts.
Note that the results of all the algorithms, including the proposed one, are sensitive to the order of variables in the data set, which is out of our expectation since it cannot be well explained from the perspective of mathematical formulations, at least for the proposed method here. Furthermore, from the results, there are no significant clues to answer the questions, such as in what order will the results be better, and why. Thus, this will be one of our future works.
A group of experiments on different data lengths is also demonstrated in
Figure 4. The experiment settings are similar to that in the experiments of
Figure 2 in the section of the artificial data. The results of the third sub-dataset are drawn in
Figure 4. There are 911 data points in the third sub-dataset. A total of 100 data points are used at first and 10 data points are added each time to enlarge the experiment data set, and there are 82 recorded results in a single curve here. The proposed method outperforms the existing method on most occasions.
In
Table 4, the binary values in the elements of adjacency matrices are taken as the results of a binary classification problem. Thus, the metrics such as accuracy, precision, recall and F1 score are evaluated to provide more details. The experiment settings for
Table 4 are the same with that of the points in the last column of
Figure 4. From
Table 4, it is also clear that the algorithm proposed in this paper is capable of achieving better results under the metrics of binary classification.
4.2.2. Comparison of Running Time
Similarly, the average running times of the algorithms on 14 sub-datasets are listed in the
Table 5. The average running time of the proposed method is significantly shorter than the existing methods.
4.2.3. Analysis of the Result Structures
The proposed algorithm gives satisfactory results in a large part of the graph. Just as shown in
Figure 5, there are seven nodes and nine edges in the subgraph of ground truth (left subfigure). The result graph (right subfigure) gives seven correct edges. Only one edge is missing and one edge has an incorrect causal direction.
On the other hand, in the example in
Figure 6, two out of three edges given by the proposed method have incorrect causal directions. This result indicates that the SHDs can be further improved if the Algorithm 1 in this paper gives better guesses of causal directions.
5. Conclusions
In this paper, a continuous optimization algorithm with causal direction matrix constraints is proposed to learn the structure of the Bayesian network for protein signaling variables. The results of the proposed algorithm outperform the existing methods. The research in protein signaling network and related fields can be effectively accelerated with the help of the structures given automatically by the proposed method. From the perspective of a pure biologist, admittedly, nothing new has been discovered in this paper. However, it is proved that the proposed method is capable of discovering something that we already knew to be true, which indicates that it can be applied to more applications with similar data distributions.
There are two possible future works under the current framework of this paper, which lie in the two constraints of the proposed optimization problem, respectively. First, the accuracy of the causal direction matrix can be further improved, the assumption of the causal direction inference method used in this paper may be not consistent with some of the edges. Second, the acyclic assumption of the Bayesian networks may be also not correct in some cases. Topology with circles cannot be excluded in the field of protein signaling research, so a dynamic Bayesian network should be a more general choice. In addition, from the perspective of data, multi-modal data are also a potential challenge in the related field [
24].