Abstract
The recent advancement in single-cell RNA sequencing technologies enables the understanding of dynamic cellular processes at the single-cell level. Using trajectory inference methods, pseudotimes can be estimated based on reconstructed single-cell trajectories which can be further used to gain biological knowledge. Existing methods for modeling cell trajectories, such as minimal spanning tree or k-nearest neighbor graph, often lead to locally optimal solutions. In this paper, we propose a penalized likelihood-based framework and introduce a stochastic tree search (STS) algorithm aiming at the global solution in a large and non-convex tree space. Both simulated and real data experiments show that our approach is more accurate and robust than other existing methods in terms of cell ordering and pseudotime estimation.
1. Introduction
The advancement of single-cell RNA sequencing enables measuring of gene expression for individual cells to prompt an understanding of dynamic cellular processes, including cell state transitions such as cell differentiation. Reconstructing a cell trajectory from the gene expression for a sample of cells is one new research area made possible by this technology. However, the high-dimensional gene expression data space and the associated high-level noise pose difficulties in modeling the trajectory from the original expression data [1]. One way to reconstruct the cell trajectory is by the calculation of pseudotime, where pseudotime is a measure of the distance of a particular cell from the origin in a dynamic process. This type of computational approach is called trajectory reconstruction (TR) [2]. To overcome the challenges in single-cell trajectory analysis, TR methods generally have two main steps: First, to handle the high dimensionality and high noise level in the expression data, a dimensionality reduction method is applied to convert the original high-dimensional data space into a low-dimensional space. Both linear and nonlinear dimensionality methods can be considered in this step to address different types of data. The second step is to model the trajectory in the dimension-reduced space. For instance, a minimal spanning tree (MST) is fitted or the k-nearest neighbor (KNN) graph is applied to model the cell trajectory [3]. This paper focuses on the second trajectory-modeling approach, to further improve the performance of the TR method in the reconstruction of single-cell trajectories.
The existing TR methods can be classified into three main categories based on the trajectory-modeling steps. First, Wanderlust, Wishbone and SLICER are all designed based on the KNN graph. Wanderlust was not originally developed for single-cell transcriptomics data but for cytometry data, so a dimensionality reduction is implemented and the trajectory-modeling is applied directly to the high-dimensional data space [4]. Based on Wanderlust, Wishbone adds a dimensionality reduction step before trajectory modeling to address the high-dimensionality challenge in the scRNA-seq data. Unlike Wanderlust, Wishbone can also detect a bifurcating event with two branches [5]. Similarly to Wishbone, SLICER fits the KNN graph on a lower dimension but with a different dimension reduction approach—LLE. SLICER requires less prior information than the previous two approaches and is also able to detect “bubbles”—a special type of cell trajectory [6]. Secondly, several TR methods, such as Monocle, TSCAN and Slingshot, use an MST to model a cell trajectory. Monocle is a pioneering method which uses ICA to reduce dimensionality [7]. TSCAN further reduces the complexity in the data structure by a model-based clustering algorithm, where an MST is obtained by connecting the cluster centers [8]. More recently, Slingshot also fits an MST on clusters of cells similar to TSCAN, but the MST is treated as the initial guess for the simultaneous principal curve algorithm [9]. Third, in addition to the KNN graph and an MST, other trajectory-modeling methods can also be employed. For instance, diffusion pseudotime (DPT) uses diffusion-like random walks to model transitions between cells. DPT is more computationally efficient and suffers less from overfitting issues caused by the dimensional reduction step. However, DPT only models a bifurcating trajectory and does not provide an explicit tree structure [10]. As an advanced version of Monocle, Monocle2 first reduces dimensionality by the “dpFeature” method. Then, a group of centroids are obtained as the latent representation of the dimension-reduced data by the soft k-means clustering, and a spanning tree is fitted to the latent data. Finally, Monocle2 uses reversed graph embedding (RGE) to obtain a principal graph [11].
However, the majority of these TR methods use gradient approaches in the trajectory-modeling step; thus, non-convex optimization is involved with a locally searching algorithm applied [2]. With a locally searching algorithm, there is a risk that the algorithm becomes stuck in the local solution and outputs it as the final solution. Unlike these TR methods, our proposed stochastic tree search (STS) algorithm searches for the optimal solution globally. We first construct a penalized likelihood and then apply a stochastic optimization algorithm to search through the tree space and obtain the final optimal solution. Thus, we can identify an optimal tree directly in the lower dimension converted by a dimensionality reduction method. Moreover, as the objective function in the optimization is flexible, STS is applicable to more types of data than those MST-based approaches.
The remaining parts of the paper are organized as follows: In Section 2, we introduce our trajectory-modeling method with the dynamic optimization in the tree searching space. Section 3 presents a simulation study to compare our algorithm with four other methods in different settings. In Section 4, the implementation of our method is shown through two single-cell RNA sequencing datasets as well as in a comparison with four existing methods. Section 5 concludes the paper with a discussion.
2. Methods
Similar to most existing TR methods, our single-cell trajectory reconstruction approach also consists of two parts. The first part is flexible with any dimensionality reduction method, while we develop a novel stochastic tree searching process to estimate the cell trajectory for the second part. Under the likelihood framework, we define an optimization function used to find the optimal embedding tree as the estimated cell trajectory. During the stochastic tree searching process, we search through the whole embedding tree space with a pool of candidate trees at each time and start from the simplest one-edge tree structure to a more complicated tree structure with more edges. The details of our stochastic tree searching algorithm are shown in the following subsections.
2.1. Preprocessing
Before fitting the tree structure on the data, the raw gene read counts are normalized by log2 transformation. Then, the dimension of normalized data is reduced by any linear or nonlinear method. Principal component analysis (PCA) applies a linear projection of the data, which preserves the variance in the new lower dimension space. Locally linear embedding (LLE), diffusion maps, and t-SNE are more general approaches without the linear relationship assumption, so these methods are able to find nonlinear relationships between cells. The data drive the choice of dimensionality reduction approach. If the linear assumption holds in the data, PCA will be applied since it is relatively computational efficient [1]. Otherwise, nonlinear methods will be applied, especially for the more recent data set with a more complex cell trajectory.
2.2. Penalized Likelihood
We use an embedding tree on the lower-dimension space to estimate the cell trajectory. An embedding tree can be defined with three main components , and . is a vertex set of size , and is an edge set with . is the associated vertex embedding, where . When we fit the embedding tree with n data points , we assume that
where and is the projection of to the embedding tree . Now we have
which leads to the likelihood function
where we assume that is known and can be estimated from the data. The negative log-likelihood function is
Finally, we can find the optimal embedding tree by minimizing the following penalized negative log-likelihood function,
is the loss function derived from the negative log-likelihood function as follows,
where is the projection of to the embedding tree . The penalty term has two components as follows,
The first part in the penalty term controls the complexity of the tree structure. With the larger tree size, more penalties will be added to the optimization function. Based on the BIC
where is the sample variance. K is the number of free parameters for the tree model, so we have . Therefore, we set and as follows,
The second part shrinks the length of tree edges. We use a small penalty term with to avoid any unnecessary long edges.
2.3. Projection of the Data to the Tree
In order to obtain the penalized negative log-likelihood function for the data, we first introduce the way to calculate the projection of each data point to a given tree. For an embedding tree , the projection of a data point to the tree is defined as —a point on the tree such that is minimized. We further denote as the project of to an edge . Here, e is an edge connecting two vertices with the associated embedding , where . We can further define in the following way,
and
where is the mapping of when mapping the line segment to which can be calculated using the following closed-form formula
We have
and
2.4. Updating Vertex Embedding Location
In the optimization process, we start with a random tree and then iteratively project the data to the tree (detailed in the section above) and update the embedding locations of its vertices. Without loss of generality, suppose there is a vertex and is the set of all the vertices that v connects to. Let be the data points projected to edge for . The new embedding location for v can be calculated through minimizing the objective function
Let be the mapping of when mapping to , and can then be computed by the following formula,
If we assume that the mapping is fixed for , there is a closed form solution for updating . See Appendix A for more details. However, the mapping will change as the embedding location is updated. Hence, there is no simple closed-form solution for and we apply a backtracking line search algorithm to obtain as follows. We first set the initial values and . Then, for the b-th iteration, , we take the following steps:
- 1.
- 2.
- 3.
- Continue the iteration until we have .
Here there is no simple closed-form solution for the gradient as the mapping is also related to . Therefore, we calculate the gradient by the numerical method where . We use a small = 1 × 10 in the previous numerical approximation formula. We use and in our simulation studies. Finally we repeat the above backtracking line search process until the difference in the objective function reaches the tolerance.
2.5. Tree Similarity Score
To facilitate the stochastic tree search, we maintain a pool of candidate trees. During the stochastic tree searching process, we remove similar trees and therefore maintain the diversity of all candidate trees in the tree pool, which is achieved using a tree similarity score. The tree similarity score is defined to estimate the similarity between two trees of the same size. Considering two embedding trees and with the same size , the tree similarity score can be computed as follows,
The first part denote the similarity between the two vertex sets and through the embedding location and as follows,
The second part calculates the distance between two edge sets and as
where , and
2.6. Stochastic Optimization
2.6.1. Initial Tree Generation
Starting from the most simple tree structure—one-edge tree, the associated embedding locations are generated from a multivariate normal distribution . The mean vector and diagonal variance matrix can be estimated from the data. To better search through the large tree space, multiple initial trees are generated to form the initial tree pool . We then update the embedding location for each tree in the initial tree pool to better fit the data. For each updated tree , we calculate the tree similarity score and optimization score as follows,
Let denote the tree in the pool where , and a is a pre-defined cutoff value. If and , then we update the current initial tree by replacing with .
2.6.2. Grow Trees by Adding Nodes
Based on the old tree pool where we have all candidate trees with the tree size equal to k, we grow the trees by adding a new node to each tree. There are two ways to add a new node. The first one is adding a new node connected to any existing node of and obtaining the new edge . The probability of selecting any node among all nodes of is set as follows,
Here, we assume there is an equal probability of selecting each existing node to connect with the new node. In order to make the stochastic optimization more efficient, we generate the new embedding location of based on the guidance from the data. In detail, we first compute the residuals based on ,
Then, the residuals are standardized as
Finally, we sample the embedding location of the new node from with the sampling weights . The new node is more likely to locate in the area with larger residuals since they have a higher sampling probability. The parameter controls the level of new embedding location driven by residuals.
The other way to generate a new node is by adding a middle point to any edge of . Each edge of has the same probability to be chosen to add a middle point, where
We replace the previous edge e with two new edges and . The new embedding location can be calculated by the following formula,
where and are the embedding locations for and .
2.6.3. Optimizing Tree with Data
From the old tree pool , we can obtain a new tree pool by adding a new node as described in the previous section. For each new tree, we further update the embedding location for all nodes to better fit the data. Similar to the initial tree updating procedure, for each updated tree in the optimization process, we compute the tree similarity score and optimization score . If and , then we update the current initial tree by replacing with , where is the tree in the pool with the smallest tree similarity score . We repeat adding a new node step on the sequence of old trees for times and update the existing new tree pool according to the above two criteria.
2.6.4. Final Optimal Tree
According to the optimization score , we order trees in each tree pool , for , and then we select the top ranked tree with smallest optimization score . When , we stop developing trees and obtain the final optimal tree . The details of the process of finding the final optimal tree are shown in Figure 1.
Figure 1.
The flow chart for the stochastic tree searching algorithm.
2.7. Pseudotime Calculation
Once the optimal tree is found through the above stochastic tree searching method, we can then compute pseudotime through the same shortest path algorithm as applied by Bendall (2014) in the pseudotime calculation for Wanderlust [12]. We first define the length of an edge as the Euclidean distance between the two vertices on this edge, so that a corresponding adjacency matrix for the optimal tree can be calculated. Then, the distance between each cell to the origin is computed by the shortest path algorithm. Finally, we define the distance as the pseudotime for cells and order cells by their pseudotimes.
2.8. Extension from Linear Trees to Nonlinear Trees
The assumption of a linear tree-shaped cell trajectory is not always valid for most scRNA-seq data sets. The allowance of nonlinear cell trajectory estimation can improve the accuracy in the reconstruction of a cell trajectory. Hence, we also propose an improved the global searching algorithm—a stochastic tree searching algorithm to include nonlinear cell trajectories. Instead of the linear embedding tree, the new algorithm plans to apply a curved embedding tree with curved edges modeled by bounded principal curves. The details of the curved tree method can be found in Appendix B.
3. Simulation
3.1. Design
To check the accuracy and the robustness to the data noise of our algorithm, a k-nodes embedding tree in p dimensions is randomly generated with the embedding location of each node simulated from a multivariate normal distribution . Considering the complexity in the real data set, additional noise in the data is generated from a known distribution. In the first part of the simulation study, the data noise is simulated from a standard normal distribution . In the second part, the noise is generated from a student t distribution . In this case, potential extreme outliers will occur, driving the data far away from the true tree structure. A parameter denotes the scale of noise in the simulated data and controls the noise level. The larger means more noise in the simulated data. In the study, we gradually increase the noise level to examine the robustness of our algorithm. Because of the significance of pseudotime in the real data application, the accuracy is calculated based on the ordering of pseduotime assignments. Kendall rank correlation coefficient is computed between the estimated pseudotime and the simulated pseudotime based on the true structure.
3.2. Comparison with Other Methods
Four existing single-cell trajectory-inferring methods are also applied to the simulated data as comparisons. These four approaches use an MST, the KNN graph or more complicated methods such as the principal curve in the trajectory-modeling part, which covers a majority of models for reconstructing a cell trajectory. Monocle models the trajectory by an MST in the lower dimension converted by ICA. More complicated than Monocle, Monocle2 uses an additional clustering algorithm to obtain the latent data representation in the lower dimension and then uses RGE to learn a principal curve. Slingshot also models the trajectory based on an MST. However, the MST is fitted on the clusters of cells and finally is treated as an initial guess for a simultaneous principal curve algorithm, which enables modeling of a nonlinear trajectory. SLICER first reduces dimensionality by LLE and then applies KNN graph for the trajectory-modeling step.
As shown in Figure 2, for the normal-distributed noise, all methods have the same trend that both the Kendall correlation and the stability will decrease as the noise level increases. Hence, the accuracy of the cell ordering based on the pseudotime is associated with the noise level. However, STS has the highest correlation and is the most stable one among all methods with any noise level. When the noise level is low, the correlation of our method is close to 1, so our method produces almost the same cell ordering result as the truth. SLICER has the second highest correlation, and Slingshot has poorer performance than SLICER. The two versions of Monocle methods have a similar performance with the lowest correlation among all methods. When the noise level is extremely high, all methods have a Kendall correlation lower than 0.5, and the estimated cell orderings are close to random assignments.
Figure 2.
Kendall correlations with different noise level for two different noise distribution cases. (a) Normal. (b) T distribution.
When the noise is generated from student t distribution in Figure 2, the Kendall correlations of all methods will drop faster than those in the normally distributed noise case when the noise level increases. The performance of all approaches is also more unstable than the normal case. With a low or moderate noise level, our method outperforms other approaches with the highest Kendall correlation. When the noise level reaches 1.5, the correlations of all methods drop below 0.5; thus, all approaches fail to accurately identify the true cell orderings. Monocle with DDRTree (Monocle2) is the most stable and least affected by the extreme outliers among all methods. STS has a slightly lower correlation than Monocle2 but is still compatible with other methods.
We also check the computational efficiency through simulations with different sample sizes and tree sizes. As Figure 3 shows, the computational cost increases nonlinearly with the growth in sample size when the tree size is fixed. However, there is a linear increasing trend indicated by Figure 3 for the running time when the tree size increases and the sample size remains the same. When the tree size is more than eight, the tree structure is complicated, making it difficult to estimate it through the data. In this case, STS will stop with a more straightforward estimated optimal tree with fewer edges.
Figure 3.
Running time with different sample sizes for tree sizes.
3.3. Comparison between Linear and Curved Tree Methods
In order to check the accuracy and the robustness to data noise of our curved tree algorithm, a k-nodes curved tree with non-zero curvature parameters in p dimensions is randomly generated with the embedding location of each node simulated from a multivariate normal distribution . Considering the complexity in the real data set, additional noise in the data is generated from a known distribution. Since the data is noisier in the curve tree framework, we only simulate the data noise from a standard normal distribution . A parameter denotes the scale of noise in the simulated data and controls the noise level. The larger means more noise in the simulated data. In the study, we first gradually increase the noise level to examine the robustness of our algorithm when the curvature level is fixed. Then, we simulate the curvature parameters of the bounded principal curves from the normal distribution . The standard deviation c of this normal distribution controls the curvature of simulated curved trees. In detail, a larger c means that there is a higher curvature on average, and thus a more complicated tree is simulated. We also simulate data with different curvature levels when the noise level is fixed. We assess the performance of our method in two parts. To check the accuracy of cell orderings, the Kendall rank correlation coefficient is computed between the estimated pseudotime and the simulated pseudotime based on the true structure. In addition, we also check the accuracy of the estimated cell trajectory by the residual standard error. The residual standard error is calculated as the square root of the mean square error between the estimated data projections and the true projections. We also compare our curved tree algorithm with the linear tree algorithm.
As shown in Table 1, all approaches have the same nonlinear trend: both the Kendall correlation and the residual standard error with the associated standard deviation will decrease as the noise level increases. Hence, the accuracy of the cell ordering is based on the pseudotime and is related to the noise level, as we found for the linear tree method. Moreover, the accuracy of trajectory estimation is also related to the noise level. When we compare the two algorithms with the same noise level, there is no significant difference in Kendall correlation between the two methods. Regarding residual standard error, the curved tree algorithm outperforms the linear tree algorithm with much lower values. In general, the curved tree algorithm does not significantly improve the accuracy of cell orderings, but the estimation of cell trajectory is more accurate.
Table 1.
Mean Kendall correlations and mean residual standard error for both curved tree algorithm and linear tree algorithm with different noise level.
From Table 2, both the Kendall correlation and residual standard error are positively correlated with the curvature for the two methods. With the increasing curvature, there is no significant difference in Kendall correlations between the curved and linear tree methods. However, with the larger curvature, the Kendall correlations between the two methods have a more significant difference. In terms of residual standard error, the curved tree method has a smaller value than the linear tree approach with the same curvature. Moreover, with the increase in curvature, both two approaches become more unstable with the higher variation in Kendall correlation and residual standard error.
Table 2.
Mean Kendall correlations and mean residual standard error for both curved tree algorithm and linear tree algorithm with different curvature.
4. Application
4.1. Induction of Mouse Embryonic Stem (ES) Cell Differentiation
Time series data of 421 RamDA-seq samples with mouse ES cells were collected at five different time points with 157,717 gene features during the cell differentiation process. These time points are 0, 12, 24, 48 and 72 h after the induction of cell differentiation into primitive endoderm (PrE) cells [13]. We pre-processed the data in the same way as Cannoodt (2016) did. The data is first filtered by only keeping cells with good quality, so the number of cells reduces to 414. Then, the count data is normalized by log2 transformation. Finally, 23,658 gene features are selected based on the feature variability. With the final gene expression matrix, diffusion map is used to further reduce the data dimensionality. With the normalized data transformed into a lower dimension, we apply STS to estimate the cell trajectory and calculate pseudotime to order cells. Further, we also fit the same four existing TR methods in the simulation studies—Slingshot, SLICER, Monolce and Monocle2.
Hayashi (2018) identified a cell differentiation trajectory in their paper, where cells move from the initial state to the final state through an intermediate transition state. From Figure 4, our two approaches can obtain an estimated cell trajectory close to the truth. For the four existing TR methods, Slingshot, Monocle and Monocle2 cluster on cells but classify different numbers of clusters ranging from 5 to 16. Both SLICE and Monocle2 estimate the cell trajectory slightly differently from the other three methods, as they also identify a small branch at the transition state. In terms of pseudotime estimation, not all of these methods accurately order cells. Figure 5 shows the relationship between the estimated pseudotime and the true time for each method. For all the methods except SLICER, there is a positive correlation between the estimation and truth, which indicates the pseudotime calculation consistent with the true time points. SLICER is unable to estimate a pseudotime accurately related to the truth. The same result is indicated in Table 3. The Kendall correlation for SLICER is the lowest, at , among all TR methods. Other methods all have a Kendall correlation greater than and our two methods have the highest correlation. All the methods except SLICER are able to estimate pseudotime accurately and recover the true cell orders.
Figure 4.
Estimated cell trajectories by our linear STS and curved tree methods on the mouse ES cell dataset. (a) Linear tree method. (b) Curved tree method.
Figure 5.
Scatter plots between the estimated pseudotimes and the true times for our approach and four existing TR methods on the mouse ES cell dataset. (a) SLICER. (b) Slingshot. (c) Monocle ICA. (d) Monocle DDRTree. (e) Linear Tree. (f) Curved Tree.
Table 3.
Kendall correlation for all methods on the mouse ES cell dataset.
Comparing our linear and curved tree methods, the curved tree method () has a slightly lower residual standard error than the linear tree method (). Therefore, the curved tree method estimates the cell trajectory more accurately than the linear tree approach. In terms of pseudotime estimation, consistent with the findings from the previous data set, our curve tree method does not have any significant improvement on the cell orderings when compared with the linear tree method. Both the curved and the linear tree methods have the same Kendall correlation.
4.2. Resolution of Cell Fate Decisions from Zygote to Blastocyst
We also apply our STS methods on a single-cell expression data set with Ct values of 48 genes from 442 cells harvested over the first four days of mouse development. We try to recover the dynamic cell development process from zygote through blastocyst (from 1-cell stage to 64-cell stage). We follow the same data cleaning and normalization process as Guo et al. (2010) in [14]. Firstly, duplicate cells from two or more cell embryos are removed, and the cells with Ct values less than 28 are also removed. Then, we normalize the cell Ct values using the endogenous controls Actb and Gapdh by subtracting their average Ct values for each cell. With the final gene expression matrix, diffusion map is used to further reduce the data dimensionality. With the normalized data transformed into a lower dimension, we apply our linear and curved tree methods to estimate the cell trajectory and calculate pseudotime to order cells. Further, we also fit the same four existing TR methods in the simulation studies—Slingshot, SLICER, Monolce and Monocle2.
The previous study shows that the cells from the 64-cell stage are subdivided into the trophectoderm (TE), the epiblast (EPI) and the primitive endoderm (PE). In addition, the cells from the 32-cell stage can be classified as the inner cell mass (ICM) or TE in the middle of the cell development [14]. From Figure 6, our two approaches can obtain an estimated cell trajectory close to the truth. For the four existing TR methods, Monocle2 can also recover a similar three-branch structure to that in our method. However, Slingshot, SLICER and Monocle only identify two cell types at 64-cell stage. In terms of pseudotime estimation, not all of these methods accurately order cells. Figure 7 shows the relationship between the estimated pseudotime and the true time for each method. Both Slingshot and our approach show a positive correlation between the estimation and truth, indicating the pseudotime calculation consistent with the true time points. SLICER, Monocle and Monocle2 fail to distinguish between the first three stages as well as the last two stages. The same result is indicated in Table 4. The Kendall correlation for Monocle is the lowest, at , among all TR methods. SLICER and Monocle2 have a Kendall correlation equal to and our algorithm has the second highest correlation of . Slingshot has a slightly higher correlation than our methods. This data set shows a more complicated dynamic process than the previous data set, so none of these methods can order the cells highly close to the truth. Nevertheless, STS and Slingshot can order the majority of the cells accurately.
Figure 6.
Cell trajectories estimated by our linear STS and curved tree methods on the Zygote–Blastocyst dataset. (a) Linear tree method. (b) Curved tree method.

Figure 7.
Scatter plots between estimated pseudotimes and true times for our approach and four existing TR methods on the Zygote–Blastocyst dataset. (a) SLICER. (b) Slingshot. (c) Monocle ICA. (d) Monocle DDRTree. (e) Linear Tree. (f) Curved Tree.
Table 4.
Kendall correlation for all methods on the Zygote–Blastocyst dataset.
Comparing between our linear and curved tree methods, the curved tree method () has a slightly lower residual standard error than the linear tree method (). Therefore, the curved tree method estimates the cell trajectory more accurately than the linear tree approach. In terms of pseudotime estimation, consistent with the findings from the previous data set, our curved tree method does not have any significant improvement on the cell orderings when compared with the linear tree method. Both the curved and the linear tree methods have the same Kendall correlation.
5. Discussion
The reconstruction of a cell trajectory from the scRNA-seq data is considered as a nonlinear optimization problem for many existing TR methods. These TR methods are mainly based on an MST or KNN graph to search for an optimal solution locally. Instead of searching locally, our algorithm provides a novel approach to directly search a global optimal fitting tree. We apply a stochastic tree optimization algorithm after pre-processing the data with normalization and dimension reduction. Based on a penalized likelihood, we start searching with a one-edge-tree pool and then gradually move to a more sophisticated tree pool by adding a new node. Although we currently use squared Euclidean distance between the data point and its projection on the tree, the optimization function is flexible and can be extended to different forms. As the simulation study and two real data examples show, STS is more accurate in cell-ordering estimation and less sensitive to outliers and skewed expression distribution. Specifically, the global optimal tree search improves the estimation performance compared with local search methods. In general, STS can be applicable to reconstruct both a simple bifurcation trajectory and more complicated multifurcation trajectories, with both high accuracy in trajectory estimation and cell ordering.
As some other challenges introduced by the complex data structure remain, there are still several potential future extensions to our STS method. Firstly, the giant tree searching space encumbers the computational efficiency of the dynamic optimization process. Secondly, the current optimization step is sensitive to extreme outliers, though the extreme outliers do not frequently occur in the real dataset. To reduce the sensitivity to the outliers, the current sum square objective function could be replaced by other functions. Moreover, based on the likelihood function, we can also calculate the probability for each tree in the tree pool by assigning prior probability for the tree size. Then we can further extend to the inference part, such as hypothesis testing [15]. Finally, instead of the backward line search algorithm we can use other optimization algorithms to reduce the sensitivity to the initial tree guess. Nevertheless, all these extensions can enable our approach to be adapted to more complex and noisy real data sets.
Author Contributions
Conceptualization, H.J. (Hongkai Ji) and H.J. (Hui Jiang); Formal analysis, J.Z.; Methodology, J.Z., H.J. (Hongkai Ji) and H.J. (Hui Jiang); Software, J.Z.; Supervision, H.J. (Hui Jiang); Visualization, J.Z.; Writing—original draft, J.Z.; Writing—review & editing, H.J. (Hongkai Ji) and H.J. (Hui Jiang). All authors have read and agreed to the published version of the manuscript.
Funding
Hongkai Ji is supported by the National Institutes of Health (NIH) grant R01HG009518.
Data Availability Statement
The data and computer codes that support the findings in this paper are available at https://github.com/kkttzjy/STS. These data were derived from the following resources available in the public domain: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE98664 accessed on 31 March 2016, and https://doi.org/10.1016/j.devcel.2010.02.012 accessed on 11 February 2022.
Conflicts of Interest
The authors declare no conflict of interest.
Abbreviations
The following abbreviations are used in this manuscript:
| TR | Trajectory Reconstruction |
| PCA | Principal Component Analysis |
| ICA | Independent Component Analysis |
| LLE | Locally Linear Embedding |
| MST | Minimal Spanning Tree |
| KNN | K-nearest Neighbor |
| DPT | Diffusion Pseudotime |
| STS | Stochastic Tree Search |
| RGE | Reversed Graph Embedding |
Appendix A. Updating Vertex Embedding Location with a Closed-Form Solution
Without loss of generality, suppose there is a vertex and is the set of all the vertices that v connects to. Let be the data points projected to edge for . The new embedding location for v can be calculated through minimizing the objective function
Now, let be the mapping of when mapping to . If we assume that is fixed even when the embedding location y changes, then can then be computed by the following formula,
There is a closed-form solution for as follows,
Appendix B. Extension to Curved Tree Method
Instead of the linear embedding tree assumption, we assume that a cell trajectory is a curved embedding tree with the nonlinear tree edges modeled as the bounded principal curves. The details of our stochastic curved tree searching algorithm are shown in the following subsections.
Appendix B.1. Curved Embedding Tree
We first start with defining a new class of embedding trees with nonlinear edges as . V is a vertex set of size , and E is an edge set with . is the associated vertex embedding where . is the set of curved edges which are modeled as principal curves bounded by their vertices. Specifically, for the j-th edge with the associated embedding locations for the two vertices we have data points projected to the j-th edge, where is the total number of data points projected to the j-th edge and . For , the curved j-th edge is modeled as
where is the projection index for . In this chapter, we consider to model as the arc length of the projection on the curve to the origin. Based on this assumption, we plan to define as the pseudotime of . In addition, we assume that the residual .
Inspired by the standard principal curve algorithm introduced by Hastie [16], we also define the projection index as follows,
Here the projection index of the data point is the value of for which its projection to the curved edge is closest to .
Then, we consider a principal curve bounded by two vertices and with the associated pseudotimes and . Hence, for any data point projected on this curve, their projection index will be restricted to the interval . We then scale into the new range interval .
We can model with any format of nonlinear functions. In this paper, we assume that is a mixture of a linear function, a quadratic function and a cubic function as follows,
where we have
Here, and are the curvature parameters which control the complexity in the structure of a curved tree. We define augmented vectors of curvature parameters as , where and . The mixture of the above three functions can cover a wide range of curves.
Appendix B.2. Penalized Likelihood
Suppose that we have n data points in p dimensions, where , for . For a curved tree fitted to the data, the projection of each data point on j-th edge of the curved tree is denoted as . Then we assume that
thus, we have
Now, we can calculate the likelihood function as follows,
where we assume that is known and can be estimated from the data. The negative log-likelihood function is
Based on the previous assumption, we propose a penalized log likelihood with the lasso penalty and edge shrinkage as
The above penalized likelihood function is the sum of squares of the Euclidean distances between the data points and their projections on the tree with two penalties. The first penalty shrinks the curvature of the curved tree to control the complexity of the tree structure. The second one shrinks the edge lengths to avoid unnecessarily long edges. We use a small penalty term with to avoid any unnecessary long edges and a small penalty term to control the curvature.
Appendix B.3. Projection of the Data to the Tree
In order to obtain the penalized negative log-likelihood function for the data, we first introduce the way to calculate the projection of each data point to a given curved tree. For an embedding curved tree , the projection of a data point to the curved tree is defined as , where . We use a piecewise linear approximation approach to calculate . In particular, we divide the projected curved edge e into L small line segments. Since we have , we obtain a series of where with . For each line segment , we compute by the equation for the linear tree. We further define to measure the difference between the data point and . Then, we obtain the projection index of the data point on the curved tree if . The corresponding approximated projection of on G is computed as .
Appendix B.4. Updating Vertex Embedding Location
In the optimization process, we start with the optimal linear tree found by STS algorithm. We assign as the initial guess for the curvature parameters of principal curves. Then, we otain the initial curved tree based on the optimal linear tree. We update the embedding location and the curvature parameters and with the line search algorithm.
Without loss of generality, suppose there is a vertex and is the set of all the vertices that v connects to. Let be the data points projected to edge for . The new embedding location for v and the new curvature parameters for the curve edge can be calculated by minimizing the objective function
When we update the curved tree with , , the projection of data points on the new curved tree will also change. Hence, there are no closed-form solutions for and . We apply the backtracking line search algorithm to compute and . Since we need to update both the embedding location y and the curvature parameters , the backtracking line search algorithm in Section 2.4 is modified to update the two parts. We first update as follows. We set the initial values , and we have . Then, for the b-th iteration, , we take the following steps:
- 1.
- ,
- 2.
- 3.
- Continue the iteration until .
Here, we calculate the gradient by the numerical method, where
We use a small in the previous numerical approximation formula.
Once we have the final updated curvature parameters , we further update y with the similar backtracking line search process as follows. We set the initial values and . Then, for the b-th iteration, , we take the following steps:
- 1.
- ,
- 2.
- ,
- 3.
- Continue the iteration until .
Here, we calculate the gradient by the numerical method, where
We use and in our simulation studies. Finally we repeat the above two backtracking line search processes until the difference in the objective function reaches the tolerance.
Appendix B.5. Pseudotime Calculation
Once the optimal curved tree is found through the above stochastic curved tree searching method, we can then compute pseudotime using the same shortest path algorithm [12] as applied by Bendall (2014) in the pseudotime calculation for Wanderlust. However, unlike the shortest path algorithm applied for the linear tree method, we use the arc length of a nonlinear edge in the calculation of pseudotime. We first calculate the arc lengths of the nonlinear edges with the piecewise approximation as in Appendix B.3. Then, the arc length between each cell to the origin is computed by the shortest path algorithm. Finally, we define the distance as the pseudotime for cells and order cells by their pseudotimes.
References
- Cannoodt, R.; Saelens, W.; Saeys, Y. Computational methods for trajectory inference from single-cell transcriptomics. Eur. J. Immunol. 2016, 46, 2496–2506. [Google Scholar] [CrossRef] [PubMed]
- Saelens, W.; Cannoodt, R.; Todorov, H.; Saeys, Y. A comparison of single-cell trajectory inference methods. Nat. Biotechnol. 2019, 37, 547–554. [Google Scholar] [CrossRef] [PubMed]
- Moon, K.R.; Stanley, J.S.; Burkhardt, D.; van Dijk, D.; Wolf, G.; Krishnaswamy, S. Manifold learning-based methods for analyzing single-cell RNA-sequencing data. Curr. Opin. Syst. Biol. 2018, 7, 36–46. [Google Scholar] [CrossRef]
- Bendall, S.; Davis, K.; Amir, E.a.; Tadmor, M.; Simonds, E.; Chen, T.; Shenfeld, D.; Nolan, G.; Pe’er, D. Single-Cell Trajectory Detection Uncovers Progression and Regulatory Coordination in Human B Cell Development. Cell 2014, 157, 714–725. [Google Scholar] [CrossRef] [PubMed]
- Setty, M.; Tadmor, M.D.; Reich-Zeliger, S.; Angel, O.; Salame, T.M.; Kathail, P.; Choi, K.; Bendall, S.; Friedman, N.; Pe’er, D. Wishbone identifies bifurcating developmental trajectories from single-cell data. Nat. Biotechnol. 2016, 34, 637–645. [Google Scholar] [CrossRef] [PubMed]
- Welch, J.D.; Hartemink, A.J.; Prins, J.F. SLICER: Inferring branched, nonlinear cellular trajectories from single cell RNA-seq data. Genome Biol. 2016, 17, 106. [Google Scholar] [CrossRef] [PubMed]
- Trapnell, C.; Cacchiarelli, D.; Grimsby, J.; Pokharel, P.; Li, S.; Morse, M.; Lennon, N.J.; Livak, K.J.; Mikkelsen, T.S.; Rinn, J.L. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat. Biotechnol. 2014, 32, 381–386. [Google Scholar] [CrossRef] [PubMed]
- Ji, Z.; Ji, H. TSCAN: Pseudo-time reconstruction and evaluation in single-cell RNA-seq analysis. Nucleic Acids Res. 2016, 44, e117. [Google Scholar] [CrossRef] [PubMed]
- Street, K.; Risso, D.; Fletcher, R.B.; Das, D.; Ngai, J.; Yosef, N.; Purdom, E.; Dudoit, S. Slingshot: Cell lineage and pseudotime inference for single-cell transcriptomics. BMC Genom. 2018, 19, 477. [Google Scholar] [CrossRef] [PubMed]
- Haghverdi, L.; Büttner, M.; Wolf, F.A.; Buettner, F.; Theis, F.J. Diffusion pseudotime robustly reconstructs lineage branching. Nat. Methods 2016, 13, 845–848. [Google Scholar] [CrossRef] [PubMed]
- Qiu, X.; Mao, Q.; Tang, Y.; Wang, L.; Chawla, R.; Pliner, H.A.; Trapnell, C. Reversed graph embedding resolves complex single-cell trajectories. Nat. Methods 2017, 14, 979–982. [Google Scholar] [CrossRef] [PubMed]
- Dijkstra, E.W. A note on two problems in connexion with graphs. Numer. Math. 1959, 1, 269–271. [Google Scholar] [CrossRef]
- Hayashi, T.; Ozaki, H.; Sasagawa, Y.; Umeda, M.; Danno, H.; Nikaido, I. Single-cell full-length total RNA sequencing uncovers dynamics of recursive splicing and enhancer RNAs. Nat. Commun. 2018, 9, 619. [Google Scholar] [CrossRef] [PubMed]
- Guo, G.; Huss, M.; Tong, G.Q.; Wang, C.; Li Sun, L.; Clarke, N.D.; Robson, P. Resolution of Cell Fate Decisions Revealed by Single-Cell Gene Expression Analysis from Zygote to Blastocyst. Dev. Cell 2010, 18, 675–685. [Google Scholar] [CrossRef] [PubMed]
- Lu, L.; Jiang, H.; Wong, W.H. Multivariate Density Estimation by Bayesian Sequential Partitioning. J. Am. Stat. Assoc. 2013, 108, 1402–1410. [Google Scholar] [CrossRef]
- Hastie, T.; Stuetzle, W. Principal Curves. J. Am. Stat. Assoc. 1989, 84, 502–516. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).







