1. Introduction
Starting with the pioneering experiment performed by Perrin [
1], the quantitative analysis of microscopy images has become an important technique for various disciplines ranging from physics to biology. Over the last century, it has evolved to what is now known as single-particle tracking (SPT) [
2,
3,
4]. In recent years, SPT has gained popularity in the biophysical community. The method serves as a powerful tool to study the dynamics of a wide range of particles including small fluorophores, single molecules, macromolecular complexes, viruses, organelles and microspheres [
5,
6]. Processes such as microtubule assembly and disassembly [
7], cell migration [
8], intracellular transport [
9,
10] and virus trafficking [
11] have been already successfully studied with this technique.
A typical SPT experiment results in a series of coordinates over time (also known as “trajectory”) for every single particle, but it does not provide any directed insight into the dynamics of the investigated process by itself. Mobility patterns of particles encoded in their trajectories have to be extracted in order to relate individual trajectories to the behavior of the system at hand and the associated biological process [
12]. The analysis of SPT trajectories usually starts with the detection of a corresponding motion type of a particle, because this information may already provide insights into mechanical properties of the particle’s surrounding [
13]. However, this initial task usually constitutes a challenge due to the stochastic nature of the particles’ movement.
There are already several approaches to analyse the mobility patterns of particles. The most commonly used one is based on the mean square displacement (MSD) of particles [
10,
14,
15,
16,
17]. The idea behind this method is quite simple: a MSD curve (i.e., an average square displacement as a function of the time lag) is quantified from a single experimental trajectory and then fitted with a theoretical expression [
18]. A linear best fit indicates normal diffusion (Brownian motion) [
19], which corresponds to a particle moving freely in its environment. Such a particle neither interacts with other distant particles nor is hindered by any obstacles. If the fit is sublinear, the particle’s movement is referred to as subdiffusion. It is appriopriate to represent particles moderated by viscoelastic properties of the environment [
20], particles which hit upon obstacles [
21,
22] or trapped particles [
9,
23]. Finally, a superlinear MSD curve means superdiffusion, which relates to the motion of particles driven by molecular motors. This type of motion is faster than the linear case and usually in a specific direction [
24].
Although popular in the SPT community, the MSD approach has several drawbacks. First of all, experimental uncertainties introduce a great amount of noise into the data, making the fitting of mathematical models challenging [
10,
14,
25,
26]. Moreover, the observed trajectories are often short, limiting the MSD curves to just a few first time lags. In this case, distinguishing between different theoretical models may not be feasible. To overcome these problems, several analytical methods that improve or go beyond MSD have already been proposed. The optimal least-square fit method [
10], the trajectory spread in space measured with the radius of gyration [
27], the van Hove displacements distributions [
28], self-similarity of trajectory using different powers of the displacement [
29] or the time-dependent directional persistence of trajectories [
30] are examples of methods belonging to the first category. They may be combined with the results of the pure MSD analysis to improve the outcome of classification. The distribution of directional changes [
31], the mean maximum excursion method [
32] and the fractionally integrated moving average (FIMA) framework [
33] belong to the other class. They allow efficient replacement of the MSD estimator for classification purposes. Hidden Markov models (HMM) turned out to be quite useful in heterogeneity checking within single trajectories [
34,
35] and in the detection of confinement [
36]. Classification based on hypothesis testing, both relying on MSD and going beyond this statistics, has been shown to be quite successful as well [
26,
37].
In the last few years, machine learning (ML) has started to be employed for the analysis of single-particle tracking data. In contrast to standard algorithms, where the user is required to explicitly define the rules of data processing, ML algorithms can learn those rules directly from series of data. Thus, the principle of ML-based classification of trajectories is simple: an algorithm learns by adjusting its behavior to a set of input data (trajectories) and corresponding desired outputs (real motion types, called the ground truth). These input–output pairs constitute the training set. A classifier is nothing but a mapping between the inputs and the outputs. Once trained, it may be used to predict the motion type of a previously unseen sample.
The main factor limiting the deployment of ML to trajectory analysis is the availability of high-quality training data. Since the data collected in the experiments is not really provable (otherwise, we would not need any new classification method), synthetic sets generated with computer simulations of different diffusion models are usually used for training.
Despite the data-related limitations, several attempts at ML-based analysis of SPT experiments have been already carried out. The applicability of the Bayesian approach [
18,
38,
39], random forests [
40,
41,
42,
43], neural networks [
44] and deep neural networks [
41,
45,
46] was extensively studied. The ultimate goal of those works was the determination of the diffusion modes. However, some of them went beyond the pure classification and focused on extraction of quantitative information about the trajectories (e.g., the anomalous exponent [
42,
45]).
In one of our previous papers, we compared two different ML approaches to classification [
41]. Feature-based methods do not use raw trajectories as input for the classifiers. Instead, they require a set of human-engineered features, which are then used to feed the algorithms. In contrast, deep learning (DL) methods extract features directly from raw data without any effort from human experts. In this case, the representation of data is constructed automatically and there is no need for complex data preprocessing. Deep learning is currently treated as the state-of-the-art technology for automatic data classification and slightly overshadows the feature-based methods. However, from our results, it follows that the latter are still worth to consider. Compared to DL, they may arrive at similar accuracies in much shorter training times, are usually easier to interpret, allow to work with trajectories of different lengths in a natural way and often do not require any normalisation of data. The only drawback of those methods is that there is not a universal set of features that works well for trajectories of any type. Choosing the features is challenging and may have an impact on the classification results.
In this paper, we would like to elaborate on the choice of proper features to represent trajectories. Comparing classifiers trained on the same set of trajectories, but with slightly different features, we will address some of the challenges of feature-based classification.
The paper is structured as follows. In
Section 2, we briefly introduce the concept of anomalous diffusion and present the stochastic models that we chose to model it. In
Section 3, methods and data sets used in this work are discussed. The results of classification are extensively analysed in
Section 4. In the last section, we summarise our findings.
2. Anomalous Diffusion and Its Stochastic Models
Non-Brownian movements that exhibit non-linear mean squared displacement can be described by multiple models, depending on some specific properties of the corresponding trajectories. The most popular models are the continuous-time random walk (CTRW) [
9], random walks on percolating clusters (RWPC) [
47,
48], fractional Brownian motion (FBM) [
49,
50,
51], fractional Lévy
-stable motion (FLSM) [
52], fractional Langevin equation (FLE) [
53] and autoregressive fractionally integrated moving average (ARFIMA) [
54].
In this paper, we follow the model choice described in [
26,
37,
43]—namely, we use FBM, the directed Brownian motion (DBM) [
55] and Ornstein–Uhlenbeck (OU) processes [
56]. With the particular choice of the parameters, all these models simplify to the classical Brownian motion (i.e., normal diffusion).
The FBM is the solution of the stochastic differential equation
where
is the scale coefficient, which relates to the diffusion coefficient
D via
,
is the Hurst parameter and
is a continuous-time, zero-mean Gaussian process starting at zero, with the following covariance function
The value of
H determines the type of diffusion in the process. For
, FBM produces subdiffusion. It corresponds to a movement of a particle hindered by mobile or immobile obstacles [
57]. For
, FBM generates superdiffusive motion. It reduces to the free diffusion at
.
The directed Brownian motion, also known as the diffusion with drift, is the solution to
where
is the drift parameter and
is again the scale parameter. For
, it reduces to normal diffusion. For other choices of
v, it generates superdiffusion related to an active transport of particles driven by molecular motors.
The Ornstein–Uhlenbeck process is often used as a model of a confined diffusion (a subclass of subdiffusion). It describes the movement of a particle inside a potential well and can be determined as the solution to the following stochastic differential equation:
The parameter
is the long-term mean of the process (i.e., the equilibrium position of a particle),
is the value of a mean-reverting speed and and
is again the scale parameter. If there is no mean reversion effect, i.e.,
, OU reduces to normal diffusion.
3. Methods and Used Data Sets
In this paper, we discuss two feature-based classifiers: random forest (RF) and gradient boosting (GB) [
58]. The term feature-based relates to the fact that the corresponding algorithms do not operate on raw trajectories of a process. Instead, for each trajectory a vector of human-engineered features is calculated and then used as input for the classifier. This approach for the diffusion mode classification has already been used in [
41,
42,
43,
45], but here, we propose a new set of features, which gives better results on synthetic data sets.
Both RF and GB are examples of ensemble methods, which combine multiple classifiers to obtain better predictive performance. They use decision trees [
59] as base classifiers. A single decision tree is fairly simple to build. The original data set is split into smaller subsets based on values of a given feature. The process is recursively repeated until the resulting subsets are homogeneous (all samples from the same class) or further splitting does not improve the classification performance. A splitting feature for each step is chosen according to Gini impurity or information gain measures [
58].
A single decision tree is popular among ML methods due to the ease of its interpretation. However, it has several drawbacks that disqualify it as a reliable classifier: it is sensitive to even small variations of data and prone to overfitting. Ensemble methods combining many decision trees help to overcome those drawbacks while maintaining most of the advantages of the trees. A multitude of independent decision trees is constructed by making use of the bagging idea with the random subspace method [
60,
61,
62] to form a random forest. Their prediction is aggregated and the mode of the classes of the individual trees is taken as the final output. In contrast, the trees in gradient boosting are built in a stage-wise fashion. At every step, a new tree learns from mistakes committed by the ensemble. GB is usually expected to perform better than RF, but the latter one may be a better choice in case of noisy data.
In this work, we used implementations of RF and GB provided by the scikit-learn Python library [
63]. The performance of the classifiers was evaluated with the common measures including accuracy, precision, recall, F1 score and confusion matrices (although the information given by those measures is to some extent redundant, we decided to use all of them due to their popularity). The accuracy is a percentage of correct predictions among all predictions, that is a general information about the performance of a classifier (reliable in case of the balanced data set). The precision and recall give us a bit more detailed information for each class. The precision is a ratio of the correct predictions to all predictions in that class (including the cases falsely assigned to this class). On the other hand, the recall (also called sensitivity or true positive rate) is the ratio of correct predictions of that class to all members of that class (including the ones that were falsely assigned to another class). The F1 score is a harmonic mean of precision and recall, resulting in high value only if both precision and recall are high. Finally, the confusion matrices show detailed results of classification: element
of matrix C is the percentage of the observations from class
i assigned to class
j (a row presents actual class, while the column presents predicted class).
The Python codes for the data simulation, features calculation, models preparation and performance calculation are available at Zenodo (see
Supplementary Materials).
3.1. Features Used for Classification
As already mentioned above, both ensemble methods require vectors of human-engineered features representing the trajectories as input. In some sense, those methods may be treated as a kind of extension to the statistical methods usually used for classification purposes. Instead of conducting a statistical testing procedure of diffusion based on one statistic, what is often the case, we can combine several statistics with each other bu turning them into features, which are then used to train a classifier. This could be of particular importance in situations, when single statistics yield results differing from each other (cf. [
43]). It should be mentioned, however, that choosing the right features is a challenging task. For instance, we have already shown in [
41] that classifiers trained with a popular set of features do not generalise well beyond the situations encoutered in the training set. Thus, great attention needs to be paid to the choice of the input features to machine learning classifiers as well. They ought to cover all the important characteristics of the process, but at the same time, they should contain the minimal amount of unnecessary information, as each redundant piece of data causes noise in the classification or may lead to overfitting, for example (for a general discussion concerning a choice of features, see, for instance, [
64]).
Based on the results in [
41,
43], we decided to use the following features in our analysis, hereinafter referred to as Set A:
The first five features were already used in [
41]. It should also be mentioned here that three of them are based on MSD curves. There is one important point to consider while calculating the curves, namely the maximum time lag. If not specified otherwise, we will use the lag equal to 10% of each trajectory’s length. Since this choice is not obvious and may impact the classification performance, we will discuss the sensitivity of classifiers’ accuracies to different choices of the lag in
Section 4.5.
Apart from the set of features presented above, denoted Set A, we are going to analyse two other sets: the one used in [
40,
41], referred as Set B, and the one proposed in [
43] (set C). The lists of features used in each set are given in
Table 1 (for their exact definition, please see the mentioned references). Sets A and B have several features in common. The link between sets A and C is not so apparent, but the maximal excursion and
p-variation-based statistics play in the description of trajectories a role similar to the standardised maximum distance and the exponent of power function fitted to
p-variation, respectively.
Following [
41], we consider four classifiers for each set of features: RF and GB classifiers built with the full set (labelled as “with
D”) and with a reduced one after the removal of the diffusion constant
D (“no
D”).
3.2. Synthetic Data
Unlike the explicitly programmed methods, machine learning algorithms are not ready-made solutions for arbitrary data. Instead, an algorithm needs to be firstly fed with a reasonable amount of data (so-called training data) that should contain the main characteristics of the process under investigation in order to find and learn some hidden patterns. As the classifier is not able to extract any additional patterns from previously unseen samples after this stage, its performance is highly dependent on the quality of the training data. Hence, the training set needs to be complete in some sense.
First, we created our main data set, which will be referred to as the base data set for the remainder of this paper. It is analogous to the one used in [
43]. We generated a number of 2D trajectories according to the three diffusion models described in
Section 2, with no correlations between the coordinates. A single trajectory can be denoted as
where
is the position of the particle at time
,
. We kept the lag
between two consecutive observations constant.
The details of our simulations are summarised in
Table 2. In total, 120,000 trajectories have been produced, 40,000 for each diffusion mode, in order to balance the data set. The length of the trajectories was randomly chosen from the range between 50 and 500 steps to mimic typical observations in experiments. We set
and
.
Since the normal diffusion can be generated by a particular choice of the models’ parameters (
for FBM,
for DBM and
for OU), it is almost indistinguishable from the anomalous diffusion generated with the parameters in the vicinity of those special values. The addition of the noise complicates the problem even more. Thus, following [
43], we introduced a parameter
c that defines a range in which a weak sub- or superdiffusion should be treated as a normal one. Although introduced here at a different level, it bears resemblance to the cutoff
c used in [
37].
Apart from the base data set, we are going to use several auxiliary ones to elaborate on different aspects of the feature choice. In
Section 4.3, we will work with a training set, in which the trajectories from the base one are disturbed with a Gaussian noise to resemble experimental uncertainties. In
Section 4.4, we will analyse the performance of classifiers trained on synthetic data generated with
, corresponding to the diffusion coefficient
, which is adequate for the analysis of real data samples. To study the sensitivity of the classifiers to the value of the cutoff
c in
Section 4.6, we will use three further sets with
,
and
. In
Section 4.7, a synthetic set with
, where
D is drawn from the uniform distribution on
will be used to check how the classifiers cope with the trajectories characterised by heterogeneous mobilities.
For all data sets, the training and testing subset were randomly selected with a 70%/30% ratio.
3.3. Empirical Data
To check how our classifiers work on unseen data, we will apply them to some real data. We decided to use the trajectories of G proteins and G-protein-coupled receptors already analysed in [
37,
43,
66]. To avoid some issues related with short time series, we limited ourselves to trajectories with at least 50 steps only, obtaining 1037 G proteins’ and 1218 receptors’ trajectories. They are visualised in
Figure 1.
5. Conclusions
In this paper, we presented a new set of features (referred to as Set A, see
Table 1) for the two types of machine learning classifiers, random forest and gradient boosting, that on the synthetic data set gives good results, better than the set used previously in [
43]. We have analysed the performance of our classifier trained and tested on the multiple versions of the synthetic data set, allowing us to assess its usefulness, flexibility and robustness. Moreover, we compared the proposed set with the ones already used in this problem, from [
40,
41,
43]. Our set gives the best results in terms of the most common metrics.
Although the results on the synthetic data set are promising, we acknowledge the challenge with the application of the classifiers to real data. As discussed in [
41], the classifiers trained on particular models for given diffusion modes do not generalise well. In
Section 4.4, we show that even the classifiers with good accuracy return not clear result when used with the data of potentially different characteristics. To some extent, it can be improved by including more models in the training data set.
Thus, we would like to underline the importance of the features’ selection for a given problem—even for the same task (e.g., diffusion mode classification), both models chosen for the training data generation and features chosen for their characterisation have a great influence on the performance of classifiers. Moreover, the assumptions made in constructions of the classifiers, such as hyperparameters’ values or simply the choice of classifier type, are also highly important.