Reconstruction of Single-Cell Trajectories Using Stochastic Tree Search

Jingyi Zhai; Hongkai Ji; Hui Jiang

doi:10.3390/genes14020318

Abstract

The recent advancement in single-cell RNA sequencing technologies enables the understanding of dynamic cellular processes at the single-cell level. Using trajectory inference methods, pseudotimes can be estimated based on reconstructed single-cell trajectories which can be further used to gain biological knowledge. Existing methods for modeling cell trajectories, such as minimal spanning tree or k-nearest neighbor graph, often lead to locally optimal solutions. In this paper, we propose a penalized likelihood-based framework and introduce a stochastic tree search (STS) algorithm aiming at the global solution in a large and non-convex tree space. Both simulated and real data experiments show that our approach is more accurate and robust than other existing methods in terms of cell ordering and pseudotime estimation.

Keywords:

trajectory reconstruction; single-cell RNA sequencing data analysis; embedding location tree

1. Introduction

The advancement of single-cell RNA sequencing enables measuring of gene expression for individual cells to prompt an understanding of dynamic cellular processes, including cell state transitions such as cell differentiation. Reconstructing a cell trajectory from the gene expression for a sample of cells is one new research area made possible by this technology. However, the high-dimensional gene expression data space and the associated high-level noise pose difficulties in modeling the trajectory from the original expression data [1]. One way to reconstruct the cell trajectory is by the calculation of pseudotime, where pseudotime is a measure of the distance of a particular cell from the origin in a dynamic process. This type of computational approach is called trajectory reconstruction (TR) [2]. To overcome the challenges in single-cell trajectory analysis, TR methods generally have two main steps: First, to handle the high dimensionality and high noise level in the expression data, a dimensionality reduction method is applied to convert the original high-dimensional data space into a low-dimensional space. Both linear and nonlinear dimensionality methods can be considered in this step to address different types of data. The second step is to model the trajectory in the dimension-reduced space. For instance, a minimal spanning tree (MST) is fitted or the k-nearest neighbor (KNN) graph is applied to model the cell trajectory [3]. This paper focuses on the second trajectory-modeling approach, to further improve the performance of the TR method in the reconstruction of single-cell trajectories.

The existing TR methods can be classified into three main categories based on the trajectory-modeling steps. First, Wanderlust, Wishbone and SLICER are all designed based on the KNN graph. Wanderlust was not originally developed for single-cell transcriptomics data but for cytometry data, so a dimensionality reduction is implemented and the trajectory-modeling is applied directly to the high-dimensional data space [4]. Based on Wanderlust, Wishbone adds a dimensionality reduction step before trajectory modeling to address the high-dimensionality challenge in the scRNA-seq data. Unlike Wanderlust, Wishbone can also detect a bifurcating event with two branches [5]. Similarly to Wishbone, SLICER fits the KNN graph on a lower dimension but with a different dimension reduction approach—LLE. SLICER requires less prior information than the previous two approaches and is also able to detect “bubbles”—a special type of cell trajectory [6]. Secondly, several TR methods, such as Monocle, TSCAN and Slingshot, use an MST to model a cell trajectory. Monocle is a pioneering method which uses ICA to reduce dimensionality [7]. TSCAN further reduces the complexity in the data structure by a model-based clustering algorithm, where an MST is obtained by connecting the cluster centers [8]. More recently, Slingshot also fits an MST on clusters of cells similar to TSCAN, but the MST is treated as the initial guess for the simultaneous principal curve algorithm [9]. Third, in addition to the KNN graph and an MST, other trajectory-modeling methods can also be employed. For instance, diffusion pseudotime (DPT) uses diffusion-like random walks to model transitions between cells. DPT is more computationally efficient and suffers less from overfitting issues caused by the dimensional reduction step. However, DPT only models a bifurcating trajectory and does not provide an explicit tree structure [10]. As an advanced version of Monocle, Monocle2 first reduces dimensionality by the “dpFeature” method. Then, a group of centroids are obtained as the latent representation of the dimension-reduced data by the soft k-means clustering, and a spanning tree is fitted to the latent data. Finally, Monocle2 uses reversed graph embedding (RGE) to obtain a principal graph [11].

However, the majority of these TR methods use gradient approaches in the trajectory-modeling step; thus, non-convex optimization is involved with a locally searching algorithm applied [2]. With a locally searching algorithm, there is a risk that the algorithm becomes stuck in the local solution and outputs it as the final solution. Unlike these TR methods, our proposed stochastic tree search (STS) algorithm searches for the optimal solution globally. We first construct a penalized likelihood and then apply a stochastic optimization algorithm to search through the tree space and obtain the final optimal solution. Thus, we can identify an optimal tree directly in the lower dimension converted by a dimensionality reduction method. Moreover, as the objective function in the optimization is flexible, STS is applicable to more types of data than those MST-based approaches.

The remaining parts of the paper are organized as follows: In Section 2, we introduce our trajectory-modeling method with the dynamic optimization in the tree searching space. Section 3 presents a simulation study to compare our algorithm with four other methods in different settings. In Section 4, the implementation of our method is shown through two single-cell RNA sequencing datasets as well as in a comparison with four existing methods. Section 5 concludes the paper with a discussion.

2. Methods

Similar to most existing TR methods, our single-cell trajectory reconstruction approach also consists of two parts. The first part is flexible with any dimensionality reduction method, while we develop a novel stochastic tree searching process to estimate the cell trajectory for the second part. Under the likelihood framework, we define an optimization function used to find the optimal embedding tree as the estimated cell trajectory. During the stochastic tree searching process, we search through the whole embedding tree space with a pool of candidate trees at each time and start from the simplest one-edge tree structure to a more complicated tree structure with more edges. The details of our stochastic tree searching algorithm are shown in the following subsections.

2.1. Preprocessing

Before fitting the tree structure on the data, the raw gene read counts are normalized by log2 transformation. Then, the dimension of normalized data is reduced by any linear or nonlinear method. Principal component analysis (PCA) applies a linear projection of the data, which preserves the variance in the new lower dimension space. Locally linear embedding (LLE), diffusion maps, and t-SNE are more general approaches without the linear relationship assumption, so these methods are able to find nonlinear relationships between cells. The data drive the choice of dimensionality reduction approach. If the linear assumption holds in the data, PCA will be applied since it is relatively computational efficient [1]. Otherwise, nonlinear methods will be applied, especially for the more recent data set with a more complex cell trajectory.

2.2. Penalized Likelihood

We use an embedding tree on the lower-dimension space to estimate the cell trajectory. An embedding tree

T = (V, E, y)

can be defined with three main components

V

,

E

and

y

.

V

is a vertex set of size

| V |

, and

E

is an edge set with

E \subset V \times V

.

y = (y_{1}, ..., y_{| V |})

is the associated vertex embedding, where

y_{j} \in R^{p}

. When we fit the embedding tree

T

with n data points

X = {X_{1}, ..., X_{n}}

, we assume that

X_{i} | T \sim N (ϕ_{T} (X_{i}), σ^{2} I),

where

X_{i} \in R^{p}

and

ϕ_{T} (X_{i})

is the projection of

X_{i}

to the embedding tree

T

. Now we have

p (X_{i} | T) = {(2 π σ^{2})}^{- \frac{p}{2}} exp {- \frac{1}{2 σ^{2}} {(X_{i} - ϕ_{T} (X_{i}))}^{T} (X_{i} - ϕ_{T} (X_{i}))} .

which leads to the likelihood function

\begin{matrix} L (T; X) & = {(2 π σ^{2})}^{- \frac{n p}{2}} exp {- \frac{1}{2 σ^{2}} \sum_{i = 1}^{n} | | X_{i} - ϕ_{T} (X_{i}) {| |}^{2}} \\ \propto exp {- \frac{1}{2 σ^{2}} \sum_{i = 1}^{n} | | X_{i} - ϕ_{T} (X_{i}) | |^{2}} \end{matrix}

(1)

where we assume that

σ^{2}

is known and can be estimated from the data. The negative log-likelihood function is

\begin{matrix} - l & = - log (L) = \frac{1}{2 σ^{2}} \sum_{i = 1}^{n} | | X_{i} - ϕ_{T} (X_{i}) {| |}^{2} \\ \propto \sum_{i = 1}^{n} | | X_{i} - ϕ_{T} (X_{i}) {| |}^{2} \end{matrix}

(2)

Finally, we can find the optimal embedding tree by minimizing the following penalized negative log-likelihood function,

min_{T} f (T) + p (T) .

f (T)

is the loss function derived from the negative log-likelihood function as follows,

f (T) = \sum_{i = 1}^{n} | | x_{i} - ϕ_{T} (x_{i}) {| |}^{2},

where

ϕ_{T} (x_{i})

is the projection of

x_{i}

to the embedding tree

T

. The penalty term

p (T)

has two components as follows,

p (T) = λ_{1} {| V |}^{α} + λ_{2} \sum_{e = (j, k) \in E} | | y_{j} - y_{k} {| |}^{2}, λ_{1}, λ_{2}, α > 0 .

The first part in the penalty term controls the complexity of the tree structure. With the larger tree size, more penalties will be added to the optimization function. Based on the BIC

B I C = K log (n) - 2 log (\hat{L}) \approx \frac{1}{{\hat{σ}}^{2}} \sum_{i = 1}^{n} | | x_{i} - ϕ_{T} (x_{i}) {| |}^{2} + p | V | log (n)

where

{\hat{σ}}^{2}

is the sample variance. K is the number of free parameters for the tree model, so we have

K \approx p | V |

. Therefore, we set

λ_{1}

and

α

as follows,

λ_{1} = {\hat{σ}}^{2} log (n) p, α = 1,

The second part shrinks the length of tree edges. We use a small penalty term with

λ_{2} = 0.01

to avoid any unnecessary long edges.

2.3. Projection of the Data to the Tree

In order to obtain the penalized negative log-likelihood function for the data, we first introduce the way to calculate the projection of each data point to a given tree. For an embedding tree

T = (V, E, y)

, the projection of a data point

x_{i}

to the tree

T

is defined as

ϕ_{T} (x_{i})

—a point on the tree

T

such that

| | x_{i} - ϕ_{T} (x_{i}) | |

is minimized. We further denote

ϕ_{e} (x_{i})

as the project of

x_{i}

to an edge

e = (V_{j}, V_{k}) \in E

. Here, e is an edge connecting two vertices

V_{j}, V_{k} \in V

with the associated embedding

y_{j}, y_{k} \in R^{p}

, where

j, k \in {1, ..., | V |}

. We can further define

ϕ_{e} (x_{i})

in the following way,

ϕ_{e} (x_{i}) = y_{j} + l_{e} (x_{i}) (y_{k} - y_{j}),

and

l_{e} (x_{i}) = \underset{l \in [0, 1]}{argmin} | | x_{i} - [y_{j} + l (y_{k} - y_{j})] | |,

where

l_{e} (x_{i})

is the mapping of

ϕ_{e} (x_{i})

when mapping the line segment

[y_{j}, y_{k}]

to

[0, 1]

which can be calculated using the following closed-form formula

l_{e} (x_{i}) = max {0, min (1, \frac{< x_{i} - y_{j}, y_{k} - y_{j} >}{| | y_{k} - y_{j} {| |}^{2}})} .

We have

ϕ_{T} (x_{i}) = min_{ϕ_{e} (x_{i})} | | x_{i} - ϕ_{e} (x_{i}) | |

and

| | x_{i} - ϕ_{T} (x_{i}) | | = min_{e} | | x_{i} - ϕ_{e} (x_{i}) | | .

2.4. Updating Vertex Embedding Location

In the optimization process, we start with a random tree and then iteratively project the data to the tree (detailed in the section above) and update the embedding locations of its vertices. Without loss of generality, suppose there is a vertex

v \in V

and

{v_{1}, v_{2}, ..., v_{m}}

is the set of all the vertices that v connects to. Let

x_{k_{1}}, ..., x_{k_{n_{k}}} \in R^{p}

be the data points projected to edge

e_{k} = (v_{k}, v)

for

k = 1, ..., m

. The new embedding location

y^{*}

for v can be calculated through minimizing the objective function

S (y) = \sum_{i = 1}^{n} | | x_{i} - ϕ_{T} (x_{i}) {| |}^{2} + λ_{2} \sum_{k = 1}^{m} | | y - y_{k} {| |}^{2} .

Let

l_{k_{t}} = l_{e_{k}} (x_{k_{t}})

be the mapping of

ϕ_{e_{k}} (x_{k_{t}})

when mapping

[y_{k}, y]

to

[0, 1]

, and

y^{*}

can then be computed by the following formula,

y^{*} = \underset{y}{argmin} \sum_{k} \sum_{t} | | x_{k_{t}} - [y_{k} + l_{k_{t}} (y - y_{k})] {| |}^{2} + λ_{2} \sum_{k} | | y - y_{k} {| |}^{2} .

If we assume that the mapping

l_{k_{t}}

is fixed for

y^{*}

, there is a closed form solution for updating

y^{*}

. See Appendix A for more details. However, the mapping

l_{k_{t}}

will change as the embedding location

y^{*}

is updated. Hence, there is no simple closed-form solution for

y^{*}

and we apply a backtracking line search algorithm to obtain

y^{*}

as follows. We first set the initial values

y^{0}

and

τ^{0} = τ_{m a x} = 1

. Then, for the b-th iteration,

b = 1, \dots, B

, we take the following steps:

1.: $τ^{b} = p τ^{b - 1}$
2.: $y^{b + 1} = y^{b} - τ^{b} \nabla S_{b}$
3.: Continue the iteration until we have $S (y^{b} - τ^{b} \nabla S_{b}) - S (y^{b}) \leq c τ^{b} \nabla S_{b}^{T} \nabla S_{b}$ .

Here there is no simple closed-form solution for the gradient

\nabla S_{b}

as the mapping

l_{k_{t}}

is also related to

y^{*}

. Therefore, we calculate the gradient

\nabla S_{b}

by the numerical method where

\nabla S_{b} \approx \frac{S (y^{b} + ϵ I) - s ((y^{b})}{ϵ}

. We use a small

ϵ

= 1 × 10

^{- 6}

in the previous numerical approximation formula. We use

p = 0.8

and

c = 0.5

in our simulation studies. Finally we repeat the above backtracking line search process until the difference in the objective function reaches the tolerance.

2.5. Tree Similarity Score

To facilitate the stochastic tree search, we maintain a pool of candidate trees. During the stochastic tree searching process, we remove similar trees and therefore maintain the diversity of all candidate trees in the tree pool, which is achieved using a tree similarity score. The tree similarity score is defined to estimate the similarity between two trees of the same size. Considering two embedding trees

T_{1} = (V^{(1)}, E^{(1)}, y^{(1)})

and

T_{2} = (V^{(2)}, E^{(2)}, y^{(2)})

with the same size

| V^{(1)} | = | V^{(2)} | = s

, the tree similarity score can be computed as follows,

S (T_{1}, T_{2}) = D (V^{(1)} (y^{(1)}), V^{(2)} (y^{(2)})) + D (E^{(1)}, E^{(2)}) .

The first part

D (V^{(1)} (y^{(1)}), V^{(2)} (y^{(2)}))

denote the similarity between the two vertex sets

V^{(1)}

and

V^{(2)}

through the embedding location

y^{(1)}

and

y^{(2)}

as follows,

D (V^{(1)} (y^{(1)}), V^{(2)} (y^{(2)})) = \sum_{i}^{s} min_{j} | | y_{i}^{(1)} - y_{j}^{(2)} | | .

The second part

D (E^{(1)}, E^{(2)})

calculates the distance between two edge sets

E^{(1)}

and

E^{(2)}

as

D (E^{(1)}, E^{(2)}) = \sum_{i}^{s - 1} \sum_{j}^{s - 1} D (E_{i}^{(1)}, E_{j}^{(2)}) = \sum_{i}^{s - 1} \sum_{j}^{s - 1} \sqrt{{D_{i j}^{(1)}}^{2} + {D_{i j}^{(2)}}^{2}},

where

E_{i}^{(1)} = (V_{i 1}^{(1)}, V_{i 2}^{(1)})

,

E_{j}^{(2)} = (V_{j 1}^{(2)}, V_{j 2}^{(2)})

and

| | E_{i}^{(1)} - E_{j}^{(2)} | | = [\begin{matrix} D_{i j}^{(1)} \\ D_{i j}^{(2)} \end{matrix}] = [\begin{matrix} min {| | y_{i 1}^{(1)} - y_{j 1}^{(2)} | |, | | y_{i 1}^{(1)} - y_{j 2}^{(2)} | |} \\ min {| | y_{i 2}^{(1)} - y_{j 1}^{(2)} | |, | | y_{i 2}^{(1)} - y_{j 2}^{(2)} | |} \end{matrix}] .

2.6. Stochastic Optimization

2.6.1. Initial Tree Generation

Starting from the most simple tree structure—one-edge tree, the associated embedding locations are generated from a multivariate normal distribution

M V N (μ, Σ)

. The mean vector

μ

and diagonal variance matrix

Σ

can be estimated from the data. To better search through the large tree space, multiple initial trees

{T_{2}^{1}, ..., T_{2}^{L_{1}}}

are generated to form the initial tree pool

{T_{2}^{l}}

. We then update the embedding location for each tree in the initial tree pool to better fit the data. For each updated tree

T_{2}^{*}

, we calculate the tree similarity score

S_{1}

and optimization score

S_{2}

as follows,

S_{1} (T_{2}^{*}) = min_{T_{2} \in {T_{2}^{l}}} S (T_{2}, T_{2}^{*}),

S_{2} (T_{2}^{*}) = - \sum_{i = 1}^{n} | | x_{i} - ϕ_{T_{2}^{*}} (x_{i}) {| |}^{2} + λ_{1} | V | + λ_{2} \sum_{e = (j, k) \in E} | | y_{j} - y_{k} {| |}^{2} .

Let

T_{2}^{m i n}

denote the tree in the pool where

S (T_{2}^{m i n}, T_{2}^{*}) = {min}_{T_{2} \in {T_{2}^{l}}} S (T_{2}, T_{2}^{*})

, and a is a pre-defined cutoff value. If

S_{1} (T_{2}^{*}) > a

and

S_{2} (T_{2}^{*}) < {min}_{T_{2} \in {T_{2}^{l}}} S_{2} (T_{2})

, then we update the current initial tree by replacing

T_{2}^{m i n}

with

T_{2}^{*}

.

2.6.2. Grow Trees by Adding Nodes

Based on the old tree pool

{T_{k}^{l}}

where we have all candidate trees with the tree size equal to k, we grow the trees by adding a new node to each tree. There are two ways to add a new node. The first one is adding a new node

v_{n e w}

connected to any existing node

v_{j}, j \in 1, \dots, k

of

T_{k}

and obtaining the new edge

e_{n e w} = (v_{n e w}, v_{j})

. The probability of selecting any node among all nodes of

T_{k}

is set as follows,

P (e_{n e w} = (v_{n e w}, v_{j})) = \frac{1}{k} .

Here, we assume there is an equal probability of selecting each existing node to connect with the new node. In order to make the stochastic optimization more efficient, we generate the new embedding location

y_{n e w}

of

v_{n e w}

based on the guidance from the data. In detail, we first compute the residuals based on

T_{k}

,

r_{i} = | | x_{i} - ϕ_{T_{k}} (x_{i}) {| |}^{2} .

Then, the residuals are standardized as

r_{i}^{*} = \frac{r_{i}}{m a x_{i} {r_{i}}} .

Finally, we sample the embedding location of the new node

y_{n e w}

from

x_{1}, ..., x_{n}

with the sampling weights

w_{i} = e^{λ_{3} r_{i}^{*}}

. The new node is more likely to locate in the area with larger residuals since they have a higher sampling probability. The parameter

λ_{3}

controls the level of new embedding location driven by residuals.

The other way to generate a new node is by adding a middle point to any edge

e = (v_{j}, v_{g})

of

T_{k}

. Each edge of

T_{k}

has the same probability to be chosen to add a middle point, where

P (e = (v_{j}, v_{g})) = \frac{1}{k - 1}, j \neq g \in {1, ..., k} .

We replace the previous edge e with two new edges

e_{1}^{n e w} = (v_{j}, v_{n e w})

and

e_{2}^{n e w} = (v_{g}, v_{n e w})

. The new embedding location

y_{n e w}

can be calculated by the following formula,

y_{n e w} = \frac{1}{2} (y_{j} + y_{g}),

where

y_{j}

and

y_{g}

are the embedding locations for

v_{j}

and

v_{g}

.

2.6.3. Optimizing Tree with Data

From the old tree pool

{T_{k}^{l}}

, we can obtain a new tree pool

{T_{k + 1}^{l}}

by adding a new node as described in the previous section. For each new tree, we further update the embedding location for all nodes to better fit the data. Similar to the initial tree updating procedure, for each updated tree

T_{k + 1}^{*}

in the optimization process, we compute the tree similarity score

S_{1}

and optimization score

S_{2}

. If

S_{1} (T_{k + 1}^{*}) > a

and

S_{2} (T_{k + 1}^{*}) < {min}_{T_{k + 1} \in {T_{k + 1}^{l}}} S_{2} (T_{k + 1})

, then we update the current initial tree by replacing

T_{k + 1}^{m i n}

with

T_{k + 1}^{*}

, where

T_{k + 1}^{m i n}

is the tree in the pool with the smallest tree similarity score

S (T_{k + 1}^{*}, T_{k + 1})

. We repeat adding a new node step on the sequence of old trees for

L_{2}

times and update the existing new tree pool

{T_{k + 1}^{l}}

according to the above two criteria.

2.6.4. Final Optimal Tree

According to the optimization score

S_{2} (T_{k})

, we order trees in each tree pool

{T_{k}^{l}}

, for

k = 2, ...

, and then we select the top ranked tree with smallest optimization score

S_{2} (T_{k}^{o p t}) = {min}_{T_{k} \in {T_{k}^{l}}} S_{2} (T_{k})

. When

S_{2} (T_{k + 1}^{o p t}) > S_{2} (T_{k}^{o p t})

, we stop developing trees and obtain the final optimal tree

T^{o p t} = T_{k}^{o p t}

. The details of the process of finding the final optimal tree are shown in Figure 1.

Figure 1. The flow chart for the stochastic tree searching algorithm.

2.7. Pseudotime Calculation

Once the optimal tree is found through the above stochastic tree searching method, we can then compute pseudotime through the same shortest path algorithm as applied by Bendall (2014) in the pseudotime calculation for Wanderlust [12]. We first define the length of an edge as the Euclidean distance between the two vertices on this edge, so that a corresponding adjacency matrix for the optimal tree can be calculated. Then, the distance between each cell to the origin is computed by the shortest path algorithm. Finally, we define the distance as the pseudotime for cells and order cells by their pseudotimes.

2.8. Extension from Linear Trees to Nonlinear Trees

The assumption of a linear tree-shaped cell trajectory is not always valid for most scRNA-seq data sets. The allowance of nonlinear cell trajectory estimation can improve the accuracy in the reconstruction of a cell trajectory. Hence, we also propose an improved the global searching algorithm—a stochastic tree searching algorithm to include nonlinear cell trajectories. Instead of the linear embedding tree, the new algorithm plans to apply a curved embedding tree with curved edges modeled by bounded principal curves. The details of the curved tree method can be found in Appendix B.

3. Simulation

3.1. Design

To check the accuracy and the robustness to the data noise of our algorithm, a k-nodes embedding tree in p dimensions is randomly generated with the embedding location of each node simulated from a multivariate normal distribution

N (0, I_{p})

. Considering the complexity in the real data set, additional noise in the data is generated from a known distribution. In the first part of the simulation study, the data noise is simulated from a standard normal distribution

N (0, 1)

. In the second part, the noise is generated from a student t distribution

t_{3}

. In this case, potential extreme outliers will occur, driving the data far away from the true tree structure. A parameter

σ

denotes the scale of noise in the simulated data and controls the noise level. The larger

σ

means more noise in the simulated data. In the study, we gradually increase the noise level

σ

to examine the robustness of our algorithm. Because of the significance of pseudotime in the real data application, the accuracy is calculated based on the ordering of pseduotime assignments. Kendall rank correlation coefficient is computed between the estimated pseudotime and the simulated pseudotime based on the true structure.

3.2. Comparison with Other Methods

Four existing single-cell trajectory-inferring methods are also applied to the simulated data as comparisons. These four approaches use an MST, the KNN graph or more complicated methods such as the principal curve in the trajectory-modeling part, which covers a majority of models for reconstructing a cell trajectory. Monocle models the trajectory by an MST in the lower dimension converted by ICA. More complicated than Monocle, Monocle2 uses an additional clustering algorithm to obtain the latent data representation in the lower dimension and then uses RGE to learn a principal curve. Slingshot also models the trajectory based on an MST. However, the MST is fitted on the clusters of cells and finally is treated as an initial guess for a simultaneous principal curve algorithm, which enables modeling of a nonlinear trajectory. SLICER first reduces dimensionality by LLE and then applies KNN graph for the trajectory-modeling step.

As shown in Figure 2, for the normal-distributed noise, all methods have the same trend that both the Kendall correlation and the stability will decrease as the noise level increases. Hence, the accuracy of the cell ordering based on the pseudotime is associated with the noise level. However, STS has the highest correlation and is the most stable one among all methods with any noise level. When the noise level is low, the correlation of our method is close to 1, so our method produces almost the same cell ordering result as the truth. SLICER has the second highest correlation, and Slingshot has poorer performance than SLICER. The two versions of Monocle methods have a similar performance with the lowest correlation among all methods. When the noise level is extremely high, all methods have a Kendall correlation lower than 0.5, and the estimated cell orderings are close to random assignments.

Figure 2. Kendall correlations with different noise level for two different noise distribution cases. (a) Normal. (b) T distribution.

When the noise is generated from student t distribution in Figure 2, the Kendall correlations of all methods will drop faster than those in the normally distributed noise case when the noise level increases. The performance of all approaches is also more unstable than the normal case. With a low or moderate noise level, our method outperforms other approaches with the highest Kendall correlation. When the noise level reaches 1.5, the correlations of all methods drop below 0.5; thus, all approaches fail to accurately identify the true cell orderings. Monocle with DDRTree (Monocle2) is the most stable and least affected by the extreme outliers among all methods. STS has a slightly lower correlation than Monocle2 but is still compatible with other methods.

We also check the computational efficiency through simulations with different sample sizes and tree sizes. As Figure 3 shows, the computational cost increases nonlinearly with the growth in sample size when the tree size is fixed. However, there is a linear increasing trend indicated by Figure 3 for the running time when the tree size increases and the sample size remains the same. When the tree size is more than eight, the tree structure is complicated, making it difficult to estimate it through the data. In this case, STS will stop with a more straightforward estimated optimal tree with fewer edges.

Figure 3. Running time with different sample sizes for tree sizes.

3.3. Comparison between Linear and Curved Tree Methods

In order to check the accuracy and the robustness to data noise of our curved tree algorithm, a k-nodes curved tree with non-zero curvature parameters in p dimensions is randomly generated with the embedding location of each node simulated from a multivariate normal distribution

N (0, I_{p})

. Considering the complexity in the real data set, additional noise in the data is generated from a known distribution. Since the data is noisier in the curve tree framework, we only simulate the data noise from a standard normal distribution

N (0, 1)

. A parameter

σ

denotes the scale of noise in the simulated data and controls the noise level. The larger

σ

means more noise in the simulated data. In the study, we first gradually increase the noise level

σ

to examine the robustness of our algorithm when the curvature level is fixed. Then, we simulate the curvature parameters of the bounded principal curves from the normal distribution

N (0, c I_{p})

. The standard deviation c of this normal distribution controls the curvature of simulated curved trees. In detail, a larger c means that there is a higher curvature on average, and thus a more complicated tree is simulated. We also simulate data with different curvature levels when the noise level is fixed. We assess the performance of our method in two parts. To check the accuracy of cell orderings, the Kendall rank correlation coefficient is computed between the estimated pseudotime and the simulated pseudotime based on the true structure. In addition, we also check the accuracy of the estimated cell trajectory by the residual standard error. The residual standard error is calculated as the square root of the mean square error between the estimated data projections and the true projections. We also compare our curved tree algorithm with the linear tree algorithm.

As shown in Table 1, all approaches have the same nonlinear trend: both the Kendall correlation and the residual standard error with the associated standard deviation will decrease as the noise level increases. Hence, the accuracy of the cell ordering is based on the pseudotime and is related to the noise level, as we found for the linear tree method. Moreover, the accuracy of trajectory estimation is also related to the noise level. When we compare the two algorithms with the same noise level, there is no significant difference in Kendall correlation between the two methods. Regarding residual standard error, the curved tree algorithm outperforms the linear tree algorithm with much lower values. In general, the curved tree algorithm does not significantly improve the accuracy of cell orderings, but the estimation of cell trajectory is more accurate.

Table 1. Mean Kendall correlations and mean residual standard error for both curved tree algorithm and linear tree algorithm with different noise level.

From Table 2, both the Kendall correlation and residual standard error are positively correlated with the curvature for the two methods. With the increasing curvature, there is no significant difference in Kendall correlations between the curved and linear tree methods. However, with the larger curvature, the Kendall correlations between the two methods have a more significant difference. In terms of residual standard error, the curved tree method has a smaller value than the linear tree approach with the same curvature. Moreover, with the increase in curvature, both two approaches become more unstable with the higher variation in Kendall correlation and residual standard error.

Table 2. Mean Kendall correlations and mean residual standard error for both curved tree algorithm and linear tree algorithm with different curvature.

4. Application

4.1. Induction of Mouse Embryonic Stem (ES) Cell Differentiation

Time series data of 421 RamDA-seq samples with mouse ES cells were collected at five different time points with 157,717 gene features during the cell differentiation process. These time points are 0, 12, 24, 48 and 72 h after the induction of cell differentiation into primitive endoderm (PrE) cells [13]. We pre-processed the data in the same way as Cannoodt (2016) did. The data is first filtered by only keeping cells with good quality, so the number of cells reduces to 414. Then, the count data is normalized by log2 transformation. Finally, 23,658 gene features are selected based on the feature variability. With the final gene expression matrix, diffusion map is used to further reduce the data dimensionality. With the normalized data transformed into a lower dimension, we apply STS to estimate the cell trajectory and calculate pseudotime to order cells. Further, we also fit the same four existing TR methods in the simulation studies—Slingshot, SLICER, Monolce and Monocle2.

Hayashi (2018) identified a cell differentiation trajectory in their paper, where cells move from the initial state to the final state through an intermediate transition state. From Figure 4, our two approaches can obtain an estimated cell trajectory close to the truth. For the four existing TR methods, Slingshot, Monocle and Monocle2 cluster on cells but classify different numbers of clusters ranging from 5 to 16. Both SLICE and Monocle2 estimate the cell trajectory slightly differently from the other three methods, as they also identify a small branch at the transition state. In terms of pseudotime estimation, not all of these methods accurately order cells. Figure 5 shows the relationship between the estimated pseudotime and the true time for each method. For all the methods except SLICER, there is a positive correlation between the estimation and truth, which indicates the pseudotime calculation consistent with the true time points. SLICER is unable to estimate a pseudotime accurately related to the truth. The same result is indicated in Table 3. The Kendall correlation for SLICER is the lowest, at

0.39

, among all TR methods. Other methods all have a Kendall correlation greater than

0.8

and our two methods have the highest correlation. All the methods except SLICER are able to estimate pseudotime accurately and recover the true cell orders.

Figure 4. Estimated cell trajectories by our linear STS and curved tree methods on the mouse ES cell dataset. (a) Linear tree method. (b) Curved tree method.

Figure 5. Scatter plots between the estimated pseudotimes and the true times for our approach and four existing TR methods on the mouse ES cell dataset. (a) SLICER. (b) Slingshot. (c) Monocle ICA. (d) Monocle DDRTree. (e) Linear Tree. (f) Curved Tree.

Table 3. Kendall correlation for all methods on the mouse ES cell dataset.

Comparing our linear and curved tree methods, the curved tree method (

0.0045

) has a slightly lower residual standard error than the linear tree method (

0.0055

). Therefore, the curved tree method estimates the cell trajectory more accurately than the linear tree approach. In terms of pseudotime estimation, consistent with the findings from the previous data set, our curve tree method does not have any significant improvement on the cell orderings when compared with the linear tree method. Both the curved and the linear tree methods have the same Kendall correlation.

4.2. Resolution of Cell Fate Decisions from Zygote to Blastocyst

We also apply our STS methods on a single-cell expression data set with Ct values of 48 genes from 442 cells harvested over the first four days of mouse development. We try to recover the dynamic cell development process from zygote through blastocyst (from 1-cell stage to 64-cell stage). We follow the same data cleaning and normalization process as Guo et al. (2010) in [14]. Firstly, duplicate cells from two or more cell embryos are removed, and the cells with Ct values less than 28 are also removed. Then, we normalize the cell Ct values using the endogenous controls Actb and Gapdh by subtracting their average Ct values for each cell. With the final gene expression matrix, diffusion map is used to further reduce the data dimensionality. With the normalized data transformed into a lower dimension, we apply our linear and curved tree methods to estimate the cell trajectory and calculate pseudotime to order cells. Further, we also fit the same four existing TR methods in the simulation studies—Slingshot, SLICER, Monolce and Monocle2.

The previous study shows that the cells from the 64-cell stage are subdivided into the trophectoderm (TE), the epiblast (EPI) and the primitive endoderm (PE). In addition, the cells from the 32-cell stage can be classified as the inner cell mass (ICM) or TE in the middle of the cell development [14]. From Figure 6, our two approaches can obtain an estimated cell trajectory close to the truth. For the four existing TR methods, Monocle2 can also recover a similar three-branch structure to that in our method. However, Slingshot, SLICER and Monocle only identify two cell types at 64-cell stage. In terms of pseudotime estimation, not all of these methods accurately order cells. Figure 7 shows the relationship between the estimated pseudotime and the true time for each method. Both Slingshot and our approach show a positive correlation between the estimation and truth, indicating the pseudotime calculation consistent with the true time points. SLICER, Monocle and Monocle2 fail to distinguish between the first three stages as well as the last two stages. The same result is indicated in Table 4. The Kendall correlation for Monocle is the lowest, at

0.51

, among all TR methods. SLICER and Monocle2 have a Kendall correlation equal to

0.69

and our algorithm has the second highest correlation of

0.76

. Slingshot has a slightly higher correlation than our methods. This data set shows a more complicated dynamic process than the previous data set, so none of these methods can order the cells highly close to the truth. Nevertheless, STS and Slingshot can order the majority of the cells accurately.

Figure 6. Cell trajectories estimated by our linear STS and curved tree methods on the Zygote–Blastocyst dataset. (a) Linear tree method. (b) Curved tree method.

Figure 7. Scatter plots between estimated pseudotimes and true times for our approach and four existing TR methods on the Zygote–Blastocyst dataset. (a) SLICER. (b) Slingshot. (c) Monocle ICA. (d) Monocle DDRTree. (e) Linear Tree. (f) Curved Tree.

Table 4. Kendall correlation for all methods on the Zygote–Blastocyst dataset.

Comparing between our linear and curved tree methods, the curved tree method (

0.0220

) has a slightly lower residual standard error than the linear tree method (

0.0229

). Therefore, the curved tree method estimates the cell trajectory more accurately than the linear tree approach. In terms of pseudotime estimation, consistent with the findings from the previous data set, our curved tree method does not have any significant improvement on the cell orderings when compared with the linear tree method. Both the curved and the linear tree methods have the same Kendall correlation.

5. Discussion

The reconstruction of a cell trajectory from the scRNA-seq data is considered as a nonlinear optimization problem for many existing TR methods. These TR methods are mainly based on an MST or KNN graph to search for an optimal solution locally. Instead of searching locally, our algorithm provides a novel approach to directly search a global optimal fitting tree. We apply a stochastic tree optimization algorithm after pre-processing the data with normalization and dimension reduction. Based on a penalized likelihood, we start searching with a one-edge-tree pool and then gradually move to a more sophisticated tree pool by adding a new node. Although we currently use squared Euclidean distance between the data point and its projection on the tree, the optimization function is flexible and can be extended to different forms. As the simulation study and two real data examples show, STS is more accurate in cell-ordering estimation and less sensitive to outliers and skewed expression distribution. Specifically, the global optimal tree search improves the estimation performance compared with local search methods. In general, STS can be applicable to reconstruct both a simple bifurcation trajectory and more complicated multifurcation trajectories, with both high accuracy in trajectory estimation and cell ordering.

As some other challenges introduced by the complex data structure remain, there are still several potential future extensions to our STS method. Firstly, the giant tree searching space encumbers the computational efficiency of the dynamic optimization process. Secondly, the current optimization step is sensitive to extreme outliers, though the extreme outliers do not frequently occur in the real dataset. To reduce the sensitivity to the outliers, the current sum square objective function could be replaced by other functions. Moreover, based on the likelihood function, we can also calculate the probability for each tree in the tree pool by assigning prior probability for the tree size. Then we can further extend to the inference part, such as hypothesis testing [15]. Finally, instead of the backward line search algorithm we can use other optimization algorithms to reduce the sensitivity to the initial tree guess. Nevertheless, all these extensions can enable our approach to be adapted to more complex and noisy real data sets.

Author Contributions

Conceptualization, H.J. (Hongkai Ji) and H.J. (Hui Jiang); Formal analysis, J.Z.; Methodology, J.Z., H.J. (Hongkai Ji) and H.J. (Hui Jiang); Software, J.Z.; Supervision, H.J. (Hui Jiang); Visualization, J.Z.; Writing—original draft, J.Z.; Writing—review & editing, H.J. (Hongkai Ji) and H.J. (Hui Jiang). All authors have read and agreed to the published version of the manuscript.

Funding

Hongkai Ji is supported by the National Institutes of Health (NIH) grant R01HG009518.

Data Availability Statement

The data and computer codes that support the findings in this paper are available at https://github.com/kkttzjy/STS. These data were derived from the following resources available in the public domain: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE98664 accessed on 31 March 2016, and https://doi.org/10.1016/j.devcel.2010.02.012 accessed on 11 February 2022.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

TR	Trajectory Reconstruction
PCA	Principal Component Analysis
ICA	Independent Component Analysis
LLE	Locally Linear Embedding
MST	Minimal Spanning Tree
KNN	K-nearest Neighbor
DPT	Diffusion Pseudotime
STS	Stochastic Tree Search
RGE	Reversed Graph Embedding

Appendix A. Updating Vertex Embedding Location with a Closed-Form Solution

Without loss of generality, suppose there is a vertex

v \in V

and

{v_{1}, v_{2}, ..., v_{m}}

is the set of all the vertices that v connects to. Let

x_{k_{1}}, ..., x_{k_{n_{k}}} \in R^{p}

be the data points projected to edge

e_{k} = (v_{k}, v)

for

k = 1, ..., m

. The new embedding location

y^{*}

for v can be calculated through minimizing the objective function

\sum_{i = 1}^{n} | | x_{i} - ϕ_{T} (x_{i}) {| |}^{2} + λ_{1} {| V |}^{α} + λ_{2} \sum_{e = (j, k) \in E} | | y_{j} - y_{k} {| |}^{2} .

Now, let

l_{k_{t}} = l_{e_{k}} (x_{k_{t}})

be the mapping of

ϕ_{e_{k}} (x_{k_{t}})

when mapping

[y_{k}, y]

to

[0, 1]

. If we assume that

l_{k_{t}}

is fixed even when the embedding location y changes, then

y^{*}

can then be computed by the following formula,

y^{*} = \underset{y}{argmin} \sum_{k} \sum_{t} | | x_{k_{t}} - [y_{k} + l_{k_{t}} (y - y_{k})] {| |}^{2} + λ_{2} \sum_{k} | | y - y_{k} {| |}^{2} .

There is a closed-form solution for

y^{*}

as follows,

y^{*} = \frac{\sum_{k} {λ_{2} y_{k} + \sum_{t} l_{k_{t}} [x_{k_{t}} - (1 - l_{k_{t}}) y_{k}]}}{λ_{2} m + \sum_{k} \sum_{t} l_{k_{t}}^{2}} .

Appendix B. Extension to Curved Tree Method

Instead of the linear embedding tree assumption, we assume that a cell trajectory is a curved embedding tree with the nonlinear tree edges modeled as the bounded principal curves. The details of our stochastic curved tree searching algorithm are shown in the following subsections.

Appendix B.1. Curved Embedding Tree

We first start with defining a new class of embedding trees with nonlinear edges as

G = (V, E, y, F)

. V is a vertex set of size

| V |

, and E is an edge set with

E \subset V \times V

.

y = (y_{1}, ..., y_{| V |})

is the associated vertex embedding where

y_{j} \in R^{p}

.

F = {f_{j}}_{j = 1}^{| V | - 1}

is the set of curved edges which are modeled as principal curves bounded by their vertices. Specifically, for the j-th edge

E_{j} = (V_{j 1}, V_{j 2})

with the associated embedding locations for the two vertices

y_{j 1}, y_{j 2}

we have data points

X_{k}, k = 1, ..., n_{k}

projected to the j-th edge, where

n_{k}

is the total number of data points projected to the j-th edge and

X_{k} \in R^{p}

. For

s = 1, ..., p

, the curved j-th edge is modeled as

X_{k s} = f_{j s} (λ_{k}) + ϵ_{k s},

where

λ_{k}

is the projection index for

X_{k}

. In this chapter, we consider to model

λ_{k}

as the arc length of the projection on the curve to the origin. Based on this assumption, we plan to define

λ_{k}

as the pseudotime of

X_{k}

. In addition, we assume that the residual

ϵ_{k s} \overset{i . i . d}{\sim} N (0, σ^{2})

.

Inspired by the standard principal curve algorithm introduced by Hastie [16], we also define the projection index

λ_{k}

as follows,

λ_{k} = λ_{f_{j}} (X_{k}) = sup {λ : | | X_{k} - f_{j} (λ) | | = inf_{τ} | | X_{k} - f_{j} (τ) | |} .

Here the projection index

λ_{f_{j}} (X_{k})

of the data point

X_{k}

is the value of

λ

for which its projection

f_{j} (λ)

to the curved edge

E_{j}

is closest to

X_{k}

.

Then, we consider a principal curve

f_{j} (λ)

bounded by two vertices

V_{j 1}

and

V_{j 2}

with the associated pseudotimes

λ_{j 1}

and

λ_{j 2}

. Hence, for any data point projected on this curve, their projection index will be restricted to the interval

(λ_{j 1}, λ_{j 2})

. We then scale

λ

into the new range interval

(0, 1)

.

We can model

f_{j s} (λ)

with any format of nonlinear functions. In this paper, we assume that

f_{j s} (λ)

is a mixture of a linear function, a quadratic function and a cubic function as follows,

f_{j s} (λ) = g_{j s} (λ) + α_{j s} h_{j s} (λ) + β_{j s} l_{j s} (λ) .

where we have

g_{j s} (λ) = y_{j 1 s} + λ (y_{j 2 s} - y_{j 1 s})

h_{j s} (λ) = λ (λ - 1)

l_{j s} (λ) = λ (λ - 1) (λ - \frac{1}{2}) .

Here,

α_{j s}

and

β_{j s}

are the curvature parameters which control the complexity in the structure of a curved tree. We define augmented vectors of curvature parameters as

γ_{j} = (\begin{matrix} α_{j} \\ β_{j} \end{matrix})

, where

α_{j} = (α_{j 1}, ..., α_{j p})

and

β_{j} = (β_{j 1}, ..., β_{j p})

. The mixture of the above three functions can cover a wide range of curves.

Appendix B.2. Penalized Likelihood

Suppose that we have n data points

X_{1}, ..., X_{n}

in p dimensions, where

X_{i} \in R^{p}

, for

i = 1, ..., n

. For a curved tree

G = (V, E, y, F)

fitted to the data, the projection of each data point

X_{i}

on j-th edge of the curved tree is denoted as

f_{j} (λ_{i})

. Then we assume that

X_{i} | G \sim N (f_{j} (λ_{i}), σ^{2} I),

thus, we have

p (X_{i} | G) = {(2 π σ^{2})}^{- \frac{p}{2}} exp {- \frac{1}{2 σ^{2}} {(X_{i} - f_{j} (λ_{i}))}^{T} (X_{i} - f_{j} (λ_{i}))} .

Now, we can calculate the likelihood function as follows,

\begin{matrix} L (G; X) & = {(2 π σ^{2})}^{- \frac{n p}{2}} exp {- \frac{1}{2 σ^{2}} \sum_{j = 1}^{| V |} \sum_{k = 1}^{n_{j}} | | X_{k} - f_{j} (λ_{k}) {| |}^{2}} \\ exp {- \frac{1}{2 σ^{2}} \sum_{j = 1}^{| V |} \sum_{k = 1}^{n_{j}} | | X_{k} - f_{j} (λ_{k}) | |^{2}} \end{matrix}

(A1)

where we assume that

σ^{2}

is known and can be estimated from the data. The negative log-likelihood function is

\begin{matrix} - l & = - log (L) = \frac{1}{2 σ^{2}} \sum_{j = 1}^{| V |} \sum_{k = 1}^{n_{j}} | | X_{k} - f_{j} (λ_{k}) {| |}^{2} \\ \sum_{j = 1}^{| V |} \sum_{k = 1}^{n_{j}} | | X_{k} - f_{j} (λ_{k}) {| |}^{2} \end{matrix}

(A2)

Based on the previous assumption, we propose a penalized log likelihood with the lasso penalty and edge shrinkage as

\sum_{j = 1}^{| V |} \sum_{k = 1}^{n_{j}} | | X_{k} - f_{j} (λ_{k}) {| |}^{2} + ϕ_{1} \sum_{j = 1}^{| V |} \sum_{s = 1}^{p} {| α_{j s} | + | β_{j s} |} + ϕ_{2} \sum_{e = (j, k) \in E} | | y_{j} - y_{k} {| |}^{2} .

(A3)

The above penalized likelihood function is the sum of squares of the Euclidean distances between the data points and their projections on the tree with two penalties. The first penalty shrinks the curvature of the curved tree to control the complexity of the tree structure. The second one shrinks the edge lengths to avoid unnecessarily long edges. We use a small penalty term with

ϕ_{2} = 0.01

to avoid any unnecessary long edges and a small penalty term

ϕ_{1} = 0.01

to control the curvature.

Appendix B.3. Projection of the Data to the Tree

In order to obtain the penalized negative log-likelihood function for the data, we first introduce the way to calculate the projection of each data point to a given curved tree. For an embedding curved tree

G = (V, E, y, F)

, the projection of a data point

x_{i}

to the curved tree

G

is defined as

ϕ_{G} (x_{i})

, where

ϕ_{G} (x_{i}) = f (λ_{i})

. We use a piecewise linear approximation approach to calculate

ϕ_{G} (x_{i})

. In particular, we divide the projected curved edge e into L small line segments. Since we have

λ_{i} \in [0, 1]

, we obtain a series of

λ_{l}

where

λ_{l} = \frac{l}{L}

with

l = 0, ..., L

. For each line segment

e_{l} = (f (λ_{l - 1}), f (λ_{l})), l = 1, ..., L

, we compute

f (λ_{l}) = ϕ_{e_{l}} (x_{i})

by the equation for the linear tree. We further define

d_{i l} = | | X_{i} - f (λ_{l}) {| |}^{2}

to measure the difference between the data point

x_{i}

and

f (λ_{l})

. Then, we obtain the projection index of the data point

X_{i}

on the curved tree

λ_{i} = λ_{i l^{*}}

if

d_{i l^{*}} = min_{l} d_{i l}

. The corresponding approximated projection of

X_{i}

on G is computed as

f (λ_{i}) = f (λ_{i l^{*}})

.

Appendix B.4. Updating Vertex Embedding Location

In the optimization process, we start with the optimal linear tree found by STS algorithm. We assign

γ^{0}

as the initial guess for the curvature parameters of principal curves. Then, we otain the initial curved tree

G^{0}

based on the optimal linear tree. We update the embedding location

y

and the curvature parameters

α

and

β

with the line search algorithm.

Without loss of generality, suppose there is a vertex

v \in V

and

{v_{1}, v_{2}, ..., v_{m}}

is the set of all the vertices that v connects to. Let

x_{k_{1}}, ..., x_{k_{n_{k}}} \in R^{p}

be the data points projected to edge

e_{k} = (v_{k}, v)

for

k = 1, ..., m

. The new embedding location

y^{*}

for v and the new curvature parameters

γ_{k}^{*}

for the curve edge

e_{k}

can be calculated by minimizing the objective function

S (y, γ_{k}) = \sum_{k = 1}^{m} \sum_{i = 1}^{n_{k}} | | X_{n_{i}} - f_{k} (λ_{k_{i}}) {| |}^{2} + ϕ_{1} \sum_{s = 1}^{p} {| α_{k s} | + | β_{k s} |} + ϕ_{2} | | y - y_{k} {| |}^{2} .

When we update the curved tree

G

with

y^{*}

,

γ_{k}^{*}

, the projection of data points on the new curved tree will also change. Hence, there are no closed-form solutions for

y^{*}

and

γ_{k}^{*}

. We apply the backtracking line search algorithm to compute

y^{*}

and

γ_{k}^{*}

. Since we need to update both the embedding location y and the curvature parameters

γ_{k}

, the backtracking line search algorithm in Section 2.4 is modified to update the two parts. We first update

γ_{k}

as follows. We set the initial values

γ_{k}^{0}

, and we have

τ_{γ}^{0} = τ_{m a x} = 1

. Then, for the b-th iteration,

b = 1, \dots, B

, we take the following steps:

1.: $τ_{γ}^{b} = p τ_{γ}^{b - 1}$ ,
2.: $γ_{k}^{b + 1} = γ_{k}^{b} - τ_{γ_{k}}^{b} \frac{\partial S_{b}}{\partial γ_{k}^{b}}$
3.: Continue the iteration until $S (y, γ_{k}^{b} - τ_{γ_{k}}^{b} \frac{\partial S_{b}}{\partial γ_{k}^{b}}) - S (y, γ_{k}^{b}) \leq c τ_{γ}^{b} {(\frac{\partial S_{b}}{\partial γ_{k}^{b}})}^{T} \frac{\partial S_{b}}{\partial γ_{k}^{b}}$ .

Here, we calculate the gradient

\frac{\partial S_{b}}{\partial γ_{k}^{b}}

by the numerical method, where

\frac{\partial S_{b}}{\partial γ_{k}^{b}} \approx \frac{S (y, γ_{k}^{b} + ϵ 1) - s (y, γ_{k}^{b})}{ϵ} .

We use a small

ϵ = 1 \times 10^{- 6}

in the previous numerical approximation formula.

Once we have the final updated curvature parameters

γ_{k}^{*}

, we further update y with the similar backtracking line search process as follows. We set the initial values

y^{0}

and

τ_{y}^{0} = τ_{m a x} = 1

. Then, for the b-th iteration,

b = 1, \dots, B

, we take the following steps:

1.: $τ_{y}^{b} = p τ_{y}^{b - 1}$ ,
2.: $y^{b + 1} = y^{b} - τ_{y}^{b} \frac{\partial S_{b}}{\partial y_{b}}$ ,
3.: Continue the iteration until $S (y^{b} - τ_{y}^{b} \frac{\partial S_{b}}{\partial y_{b}}, α_{k}^{*}, β_{k}^{*}) - S (y, α_{k}^{*}, β_{k}^{*}) \leq c τ_{y}^{b} {(\frac{\partial S_{b}}{\partial y_{b}})}^{T} \frac{\partial S_{b}}{\partial y_{b}}$ .

Here, we calculate the gradient

\frac{\partial S_{b}}{\partial y^{b}}

by the numerical method, where

\frac{\partial S_{b}}{\partial y^{b}} \approx \frac{S (y^{b} - τ_{y}^{b} \frac{\partial S_{b}}{\partial y_{b}}, γ_{k}^{*}) - s (y^{b}, γ_{k}^{*})}{ϵ} .

We use

p = 0.8

and

c = 0.5

in our simulation studies. Finally we repeat the above two backtracking line search processes until the difference in the objective function reaches the tolerance.

Appendix B.5. Pseudotime Calculation

Once the optimal curved tree is found through the above stochastic curved tree searching method, we can then compute pseudotime using the same shortest path algorithm [12] as applied by Bendall (2014) in the pseudotime calculation for Wanderlust. However, unlike the shortest path algorithm applied for the linear tree method, we use the arc length of a nonlinear edge in the calculation of pseudotime. We first calculate the arc lengths of the nonlinear edges with the piecewise approximation as in Appendix B.3. Then, the arc length between each cell to the origin is computed by the shortest path algorithm. Finally, we define the distance as the pseudotime for cells and order cells by their pseudotimes.

References

Cannoodt, R.; Saelens, W.; Saeys, Y. Computational methods for trajectory inference from single-cell transcriptomics. Eur. J. Immunol. 2016, 46, 2496–2506. [Google Scholar] [CrossRef] [PubMed]
Saelens, W.; Cannoodt, R.; Todorov, H.; Saeys, Y. A comparison of single-cell trajectory inference methods. Nat. Biotechnol. 2019, 37, 547–554. [Google Scholar] [CrossRef] [PubMed]
Moon, K.R.; Stanley, J.S.; Burkhardt, D.; van Dijk, D.; Wolf, G.; Krishnaswamy, S. Manifold learning-based methods for analyzing single-cell RNA-sequencing data. Curr. Opin. Syst. Biol. 2018, 7, 36–46. [Google Scholar] [CrossRef]
Bendall, S.; Davis, K.; Amir, E.a.; Tadmor, M.; Simonds, E.; Chen, T.; Shenfeld, D.; Nolan, G.; Pe’er, D. Single-Cell Trajectory Detection Uncovers Progression and Regulatory Coordination in Human B Cell Development. Cell 2014, 157, 714–725. [Google Scholar] [CrossRef] [PubMed]
Setty, M.; Tadmor, M.D.; Reich-Zeliger, S.; Angel, O.; Salame, T.M.; Kathail, P.; Choi, K.; Bendall, S.; Friedman, N.; Pe’er, D. Wishbone identifies bifurcating developmental trajectories from single-cell data. Nat. Biotechnol. 2016, 34, 637–645. [Google Scholar] [CrossRef] [PubMed]
Welch, J.D.; Hartemink, A.J.; Prins, J.F. SLICER: Inferring branched, nonlinear cellular trajectories from single cell RNA-seq data. Genome Biol. 2016, 17, 106. [Google Scholar] [CrossRef] [PubMed]
Trapnell, C.; Cacchiarelli, D.; Grimsby, J.; Pokharel, P.; Li, S.; Morse, M.; Lennon, N.J.; Livak, K.J.; Mikkelsen, T.S.; Rinn, J.L. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat. Biotechnol. 2014, 32, 381–386. [Google Scholar] [CrossRef] [PubMed]
Ji, Z.; Ji, H. TSCAN: Pseudo-time reconstruction and evaluation in single-cell RNA-seq analysis. Nucleic Acids Res. 2016, 44, e117. [Google Scholar] [CrossRef] [PubMed]
Street, K.; Risso, D.; Fletcher, R.B.; Das, D.; Ngai, J.; Yosef, N.; Purdom, E.; Dudoit, S. Slingshot: Cell lineage and pseudotime inference for single-cell transcriptomics. BMC Genom. 2018, 19, 477. [Google Scholar] [CrossRef] [PubMed]
Haghverdi, L.; Büttner, M.; Wolf, F.A.; Buettner, F.; Theis, F.J. Diffusion pseudotime robustly reconstructs lineage branching. Nat. Methods 2016, 13, 845–848. [Google Scholar] [CrossRef] [PubMed]
Qiu, X.; Mao, Q.; Tang, Y.; Wang, L.; Chawla, R.; Pliner, H.A.; Trapnell, C. Reversed graph embedding resolves complex single-cell trajectories. Nat. Methods 2017, 14, 979–982. [Google Scholar] [CrossRef] [PubMed]
Dijkstra, E.W. A note on two problems in connexion with graphs. Numer. Math. 1959, 1, 269–271. [Google Scholar] [CrossRef]
Hayashi, T.; Ozaki, H.; Sasagawa, Y.; Umeda, M.; Danno, H.; Nikaido, I. Single-cell full-length total RNA sequencing uncovers dynamics of recursive splicing and enhancer RNAs. Nat. Commun. 2018, 9, 619. [Google Scholar] [CrossRef] [PubMed]
Guo, G.; Huss, M.; Tong, G.Q.; Wang, C.; Li Sun, L.; Clarke, N.D.; Robson, P. Resolution of Cell Fate Decisions Revealed by Single-Cell Gene Expression Analysis from Zygote to Blastocyst. Dev. Cell 2010, 18, 675–685. [Google Scholar] [CrossRef] [PubMed]
Lu, L.; Jiang, H.; Wong, W.H. Multivariate Density Estimation by Bayesian Sequential Partitioning. J. Am. Stat. Assoc. 2013, 108, 1402–1410. [Google Scholar] [CrossRef]
Hastie, T.; Stuetzle, W. Principal Curves. J. Am. Stat. Assoc. 1989, 84, 502–516. [Google Scholar] [CrossRef]

Figure 1. The flow chart for the stochastic tree searching algorithm.

Figure 2. Kendall correlations with different noise level for two different noise distribution cases. (a) Normal. (b) T distribution.

Figure 3. Running time with different sample sizes for tree sizes.

Figure 4. Estimated cell trajectories by our linear STS and curved tree methods on the mouse ES cell dataset. (a) Linear tree method. (b) Curved tree method.

Figure 5. Scatter plots between the estimated pseudotimes and the true times for our approach and four existing TR methods on the mouse ES cell dataset. (a) SLICER. (b) Slingshot. (c) Monocle ICA. (d) Monocle DDRTree. (e) Linear Tree. (f) Curved Tree.

Figure 6. Cell trajectories estimated by our linear STS and curved tree methods on the Zygote–Blastocyst dataset. (a) Linear tree method. (b) Curved tree method.

Figure 7. Scatter plots between estimated pseudotimes and true times for our approach and four existing TR methods on the Zygote–Blastocyst dataset. (a) SLICER. (b) Slingshot. (c) Monocle ICA. (d) Monocle DDRTree. (e) Linear Tree. (f) Curved Tree.

Table 1. Mean Kendall correlations and mean residual standard error for both curved tree algorithm and linear tree algorithm with different noise level.

Noise Level	Mean Kendall Correlation (SD)		Mean Residual Standard Error (SD)
Noise Level	Curved Tree	Linear Tree	Curved Tree	Linear Tree
0.01	0.87 (0.23)	0.84 (0.26)	0.0356 (0.0325)	0.1812 (0.0365)
0.05	0.83 (0.29)	0.82 (0.29)	0.0529 (0.0293)	0.1839 (0.0399)
0.10	0.83 (0.25)	0.81 (0.29)	0.0889 (0.0285)	0.1988 (0.0410)
0.15	0.81 (0.27)	0.79 (0.30)	0.0889 (0.0283)	0.1978 (0.0403)

Table 2. Mean Kendall correlations and mean residual standard error for both curved tree algorithm and linear tree algorithm with different curvature.

Curvature	Kendall Correlation		Residual Standard Error
Curvature	Curved Tree	Linear Tree	Curved Tree	Linear Tree
0	0.85 (0.24)	0.84 (0.24)	0.0823 (0.0262)	0.1753 (0.0452)
0.5	0.81 (0.29)	0.81 (0.29)	0.0889 (0.0285)	0.1988 (0.0410)
1.0	0.83 (0.24)	0.78 (0.26)	0.1027 (0.0420)	0.2621 (0.0366)
1.5	0.60 (0.34)	0.57 (0.34)	0.1325 (0.0466)	0.3039 (0.0376)
2.0	0.48 (0.43)	0.43 (0.40)	0.1405 (0.0478)	0.3248 (0.0409)

Table 3. Kendall correlation for all methods on the mouse ES cell dataset.

Method	Linear Tree	Cruved Tree	SLICER	Slingshot	Monocle ICA	Monocle DDRTree
Kendall Correlation	0.87	0.87	0.39	0.87	0.86	0.87

Table 4. Kendall correlation for all methods on the Zygote–Blastocyst dataset.

Method	Linear Tree	Curved Tree	SLICER	Slingshot	Monocle ICA	Monocle DDRTree
Kendall Correlation	0.76	0.76	0.69	0.78	0.51	0.69

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Article Metrics

Citations

Article Access Statistics

Journal Statistics

Multiple requests from the same IP address are counted as one view.