1. Introduction
For problems with limited task-specific data, supervised machine learning models often fail to generalize well. Classically, practitioners operating in these settings will choose a model that is appropriately expressive given the amount of data available. That is, they use a model that effectively exploits the “bias–variance” trade-off [
1]. Modern machine learning approaches such as transfer learning [
2,
3], domain adaptation [
4], meta-learning [
5,
6,
7,
8], and continual learning [
9,
10,
11,
12] attempt to mitigate the lack of task-specific data by leveraging information from a collection of available source tasks. These approaches are ineffective when the task of interest is sufficiently different from the source tasks.
In this paper we study a data-adaptive method that can interpolate between the classical and modern approaches for a specific set of classifiers: when the amount of available task-specific data is large and the available source tasks are sufficiently different then the method is equivalent to the classical single task approach; conversely, when the amount of available task-specific data is small and the available source tasks are similar to the task of interest then the method is equivalent to the modern approach.
At a high level, our proposed method is designed in the context of a set of classifiers based on Fisher’s Linear Discriminant (“FLD”) [
13,
14]. Each element in the class is a convex combination of (i) an average of linear classifiers trained on source tasks and (ii) a classifier trained only on data from a new target task. Given the set of classifiers, we derive the expected risk (under 0–1 loss) of an element in the class under particular generative assumptions, approximate the risk using the appropriate limit theory, and select the classifier that minimizes this approximated expected risk. By approximating the expected risk, we are able to simultaneously take advantage of the relationship between the source tasks and the target task and the new information available related to the target task.
We focus on FLD, as opposed to more complicated classification techniques, due to its popularity in low resource settings. For example, our setting of interest is the physiological prediction problem—broadly defined as any setting that uses biometric or physiological data (e.g., EEG, ECG, breathing rate, etc.) or any derivative thereof to make predictions related to the state of a person—where polynomial classifiers and regressors with expert-crafted features are still the preferred performance baselines [
15].
The rest of the paper is organized as follows: We first review relevant aspects of the domain adaptation, physiological prediction, and task similarity literature. We then describe our problem setting formally, introduce notation, and review the distributional assumptions for which FLD is optimal under 0–1 classification loss. We subsequently make the relationship between the source distributions and target distribution explicit by leveraging the sufficiency of the FLD projection vector. We define the set of classifiers based on this relationship, derive an expression for the expected risk of a general element in this set, and propose a computable approximation to it that can be used to find the optimal classifier in the set. Finally, we study the effect of different hyperparameters of the data generation process on the performance of the approximated optimal classifier relative to model (i) and model (ii) before applying it to three physiological prediction settings.
1.1. Related Works
1.1.1. Connection to Domain Adaptation Theory
The problem we address in this work can be framed as a domain adaptation problem with multiple sources. While a rich body of literature [
4,
16,
17,
18,
19,
20,
21,
22] has studied this setting, our work shares the most resemblance with the theoretical analysis discussed in [
17]. They study the combination of the source classifiers and derive a hypothesis that achieves a small error with respect to the target task. In their work, they assume the target distribution is a mixture of the source distributions. Our work, on the other hand, combines the average source classifier with the target classifier under the assumption that the classifiers originate from the same distribution on the task level. Indeed, the explicit relationship that we place on the source and target projection classifiers allows us to derive an analytical expression of the target risk that does not rely on the target distribution being a mixture of source distributions.
1.1.2. Domain Adaptation for Physiological Prediction Problems
Domain adaptation and transfer learning are ubiquitous in the physiological prediction literature due to large context variability and small in-context sample sizes. See, for example, a review of EEG-inspired methods [
15] and a review of ECG-inspired methods [
23]. Most similar to our work are methods that combine general-context data and personalized data [
24], or weigh individual classifiers or samples from the source task based on similarities to the target distribution [
25,
26]. Our work differs from [
24], for example, by explicitly modeling the relationship between the source and target tasks. This allows us to derive an optimal combination of the models as opposed to relying strictly on empirical measures.
1.1.3. Measures of Task Similarity
Capturing the difference between the target task and the source tasks is imperative for data-driven methods that attempt to interpolate between different representations or decision rules. We refer to attempts to capture the differences as measures of task similarity measures. Generally, measures of task similarity can be used to determine how relevant a pre-trained model is for a particular target task [
22,
27,
28,
29] or to define a taxonomy of tasks [
30].
In our work, the convex coefficient
parameterizing the proposed class of models can be thought of as a measure of model-based task dissimilarity between the target task and the average-source task—the farther the distribution of the target projection vector is from the distribution of the source projection vector the larger the convex coefficient. Popular task similarity measures utilize information theoretic quantities to evaluate the effectiveness of a pre-trained source model for a particular target task such as H-score [
27], NCE [
28], or LEEP [
29]. This collection of work is mainly empirical and does not place explicit generative relationships on the source and target tasks. Other statistically inspired task similarity measures, like ours, rely on the representations induced by the source and target classifiers such as partitions [
31] and other model artifacts [
32,
33,
34]. Similar ideas have been used to leverage the presence of multiple tasks for ranking [
35].
1.2. Problem Setting
The classification problem discussed herein is an instance of a more general statistical pattern recognition problem ([
14], Chapter 1): Given training data
assumed to be i.i.d. samples from a classification distribution
, construct a function
that takes as input an element of
and outputs an element of
such that the expected loss of
with respect to
is small. With a sufficient amount of data and suitably defined loss, there exists a classifier
that has statistically minimal expected loss for any given
. In the prediction problems like the physiological prediction problem, however, there is often
not enough data from the target task to adequately train classifiers and we assume, instead, that there are auxiliary data (or derivatives thereof) from different contexts available that can be used to improve the expected loss [
2].
In particular, given
assumed to be i.i.d. samples from the classification distribution
for
, we want to construct a classifier
that minimizes the expected loss with respect to the target distribution
. We refer to the classification distribution
as a source distribution for
. Note that for other modern machine learning settings the classifier
is constructed to optimize joint loss functions with respect to
[
36].
Generally, for the classifier
to improve upon the task-specific classifier
, the source distributions need to be related to the target distribution such that the information learned when constructing the mappings from the input space to the label space in the context of the source distributions can be “transferred” or “adapted” to the context of the target distribution [
32].
4. Applications to Physiological Prediction Problems
We next study the proposed class of classifiers in the context of three physiological prediction problems: EEG-based cognitive load classification, EEG-based stress classification, and ECG-based social stress classification. Each of these problems has large distributional variability across persons, devices, sessions, and tasks. Moreover, labeled data in these tasks is expensive—non-overlapping feature vectors can require up to 45 s of recording to obtain. That is, large improvements in classification metrics near the beginning of the in-task data regime is important in mitigating the amount of time required for a Human–Computer Interface to produce relevant predictions and is thus necessary for making these types of devices usable.
The dataset related to EEG-based cognitive load classification task is proprietary. We include the results because there is a (relatively) large number of participants with multiple sessions per participant and the cognitive load task is a representative high-level cognitive state classification problem. Both the EEG-based [
40] and ECG-based stress [
41] classification are publicly available. Given the complicated nature of physiological prediction problems, previous works that use these datasets typically choose an arbitrary amount of training data for each session, train a model, and report classification metrics related to a held-out test set (e.g., [
42] (EEG) and [
43] (ECG)) or held-out participants (e.g., [
44] (EEG) and [
45,
46] (ECG)). Our focus, while similar, is fundamentally different: we are interested in classification metrics as a function of the amount of training data seen.
In each setting we have access to a small amount of data from a target study participant and the projection vectors from other participants. The data for each subject are processed such that the assumptions of Equation (
1) are matched as closely as possible. For example, we use the available training data from the target participant to force the class conditional means to be on the unit sphere and for their midpoint to cross through the origin. Further, we normalize the learned projection vectors so that the assumption that the vectors come from a von Mises–Fisher distribution is sensible.
The descriptions of the cognitive load and stress datasets are altered versions of the descriptions found in Chen et al. [
47]. Unless otherwise stated, the balanced accuracy and the convex coefficient corresponding to each method are calculated using 100 different train–test splits for each participant. Conditioned on the class type, the windowed data used for training are consecutive windows. A grid search in
was used when calculating convex coefficients.
4.1. Cognitive Load (EEG)
The first dataset we consider was collected under NASA’s Multi-Attribute Task Battery II (MATB-II) protocol. MATB-II is used to understand a pilot’s ability to perform under various cognitive load requirements [
48] by attempting to induce four different levels of cognitive load—no (passive), low, medium, and high—that are a function of how many tasks the participant must actively tend to.
The data includes 50 healthy subjects with normal or corrected-to-normal vision. There were 29 female and 21 male participants and each participant was between the ages of 18 and 39 (mean 25.9, std 5.4 years). Each participant was familiarized with MATB-II and then participated in two sessions containing three segments. The three segments were further divided into blocks with the four different levels of cognitive requirements. The sessions lasted around 50 min and were separated by a 10 min break. We focus our analysis on a per-subject basis, meaning there will be two sessions per subject for a total of 100 different sessions.
The EEG data was recorded using a 24-channel Smarting MOBI device and was processed using high pass (0.5 Hz) and low pass (30 Hz) filters and segmented in ten-second, non-overlapping windows. Once the EEG data was windowed, we calculated the mass in the frequency domain for the theta (4–8 Hz), alpha (8–12 Hz), and lower beta (12–20 Hz) bands. We then normalized the mass of each band on a per channel basis. In our analysis we consider only the frontal channels {Fp1, Fp2, F3 F4, F7, F8, Fz, aFz}. Our choice in channels and bands is an attempt to mitigate the number of features while maintaining the presence of known cognitive load indicators [
49]. The results reported in
Figure 4 are for this
-dimensional two-class problem {no and low cognitive load, medium and high cognitive load}.
For a fixed session we randomly sample a continuous proportion of the participant’s windowed data
and also have access to the projection vectors corresponding to all sessions except for the target participant’s other session (i.e., we have 100 − 1 − 1 = 98 source projection vectors). As mentioned above, we use the training data to learn a translation and scaling to best match the model assumptions of
Section 2.
The top left figure of
Figure 4 shows the mean balanced accuracy on the non-sampled windows of four different classifiers: the average-source classifier, the target classifier, the optimal classifier, and the oracle classifier. The average-source, target, and optimal classifiers are as described in
Section 3. The oracle classifier is the convex combination of the average-source and target projection vectors that performs the best on the held-out test set. The median balanced accuracy of each classifier is the median (across sessions) calculated from the mean balanced accuracy of 100 different train–test samplings for each session.
The relative behaviors of the average-source, target, and optimal classifiers in this experiment are similar to what we observe when varying the amount of target data in the simulations for large —the average-source classifier outperforms the target classifier in small data regimes, the target classifier outperforms the average-source classifier in large data regimes, and the optimal classifier is able to outperform or match the performance of both classifiers throughout the regime. Indeed, in this experiment the empirical value of when estimating the projection vectors using all of each session’s data is approximately 17.2.
The top right part of
Figure 4 shows scatter plots of the convex coefficients for the optimal and oracle methods. Each dot represents the average of 100 coefficients for a particular session for a given proportion of training data from the target task (i.e., one dot per session). The median coefficient is represented by a short line segment. The median coefficient for both the oracle and the optimal classifiers get closer to 1 as more target data is available. This behavior is intuitive, as we would expect the optimal algorithm to favor the in-distribution data when the estimated variance of the target classifier is “small”.
The bottom row of
Figure 4 is the set of histograms of the difference between the optimal classifier’s balanced accuracy and the target classifier’s balanced accuracy where each count represents a single session. These histograms give us a better sense of the relative performance of the two classifiers—a distribution centered around 0 would mean that we have no reason to prefer the optimal classifier over the target classifier and where a distribution shifted to the right of 0 it would mean that we would prefer the optimal classifier to the target classifier.
For , the optimal classifier outperforms the target classifier for 92 of the 100 sessions with differences as large as 19.2% and a median absolute accuracy improvement of about 9.3%. The story is similarly dramatic for with the optimal classifier outperforming the target classifier for 92 of the 100 sessions, with a maximum difference of about 19.2% and a median difference of 7.8%. For , the distribution of the differences is still shifted to the right of 0 with a non-trivial median absolute improvement of about 3.7%, a maximum improvement of 12%, and an improvement for 81 of the sessions. For , the optimal classifier outperforms the target classifier for 76 of the 100 sessions, though the distribution is only slightly shifted to the right of 0. The p-values, up to three decimal places, from the one-sided Wilcoxon’s rank-sum test for the hypothesis that the distribution of the paired differences is symmetric and centered around 0 are less than for each proportion of available target data that we considered.
4.2. Stress from Mental Math (EEG)
In the next study we consider there are two recordings for each session—one corresponding to a resting state and one corresponding to a stressed state. For the resting state, participants counted mentally (i.e., without speaking or moving their fingers) with their eyes closed for three minutes. For the stressful state, participants were given a four digit number (e.g., 1253) and a two digit number (e.g., 43) and asked to recursively subtract the two digit number from the four digit number for 4 min. This type of mental arithmetic is known to induce stress [
50].
There were initially 66 participants (47 women and 19 men) of matched age in the study. Thirty of the participants were excluded from the released data due to poor EEG quality. Thus, we consider the provided set of 36 participants first analyzed by the study’s authors [
40]. The released EEG data were preprocessed via a high-pass filter and a power line notch filter (50 Hz). Artifacts such as eye movements and muscle tension were removed via ICA. We windowed the data into two-and-a-half-second chunks with no overlap, and consider the two-class classification task {stressed, not stressed} with access only to the channels along the centerline {Fz, Cz, Pz}, and the theta, alpha, and lower beta bands described above. The results of this experiment are displayed in
Figure 5 and are structured in the same way as the cognitive load results.
For this study, we see relative parity between the target and average-source classifiers when . In this case, the optimal classifier is able to leverage the discriminative information in both sets of information and improve the balanced accuracy. This win is maintained until the target classifier performance matches the optimal classifier performance for . The poor performance of the average-source classifier is likely due to the empirical value for being less than 3.
Interestingly, we do not see as clear a trend for the median convex coefficients in the top right figure. They are relatively stagnant between , and before jumping considerably closer to 1 for .
When comparing the optimal classifier to the target classifier on a per-participant basis directly (bottom row), it is clear that the optimal classifier is favorable: for , and the optimal classifier outperforms the target classifier for 25, 24, and 24 of the 36 participants, respectively, and the median absolute difference of these wins is in the 1.8–2.6% range for all three settings with maximum improvements of 19.2 for 19.2 for and 12.1 for . As with the cognitive load task, this narrative shifts for as the distribution of the differences is approximately centered around 0. The p-values from the one-sided rank-sum test reflect these observations: 0.001, 0.01, 0.007, and 0.896 for , and , respectively.
4.3. Stress in Social Settings (ECG)
The last dataset we consider is the WEarable Stress and Affect Detection (WESAD) dataset [
41]. For WESAD, the researchers collected multi-modal data while participants underwent a neutral baseline condition, an amusement condition, and a stress condition. The participants meditated between conditions. For our purposes, we will only consider the baseline condition where participants passively read a neutral magazine for approximately 20 min and the stress condition where participants went through a combination of the Trier Social Stress Test and a mental arithmetic task for a total of 10 min.
For our analysis, we consider 14 of the 15 participants and only work with their corresponding ECG data recorded at 700 Hz. Before featurizing the data, we first downsampled to 100 Hz and split the time series into 15 s, non-overlapping windows. We used Hamilton’s peak detection algorithm [
51] to find the time between heartbeats for a given window. We then calculated the proportion of intervals larger than 20 ms, the normalized standard deviation of the interval length, and the ratio of the high (between 15 and 40 Hz) and low (between 4 and 15 Hz) frequencies of the interval waveform after applying a Lomb–Scargle correction for waves with uneven sampling. These three features are known to have discriminative power in the context of stress prediction [
52], though typically for larger time windows.
We report the same metrics for this dataset in
Figure 6 as we do for the two EEG studies above: the mean balanced accuracies are given in the top left figure, the convex coefficients for the optimal and oracle classifiers are given in the top right, and the paired difference histograms between the optimal classifier’s balanced accuracy and the target classifier’s balanced accuracy are given in the bottom row.
The relative behaviors of the classifiers in this study is similar to the behaviors in the EEG-based stress study above. The optimal classifier is able to outperform the other two classifiers for and is matched by the target classifier for the rest of the regime. The average-source classifier is never preferred and the empirical value of is approximately 1.5. The distributions of the optimal coefficients get closer to 1 as p increases but are considerably higher compared to the MATB study for each value of p—likely due to the large difference between the empirical values of across the two problems.
Lastly, the paired difference histograms for favor the optimal classifier. The histograms for and are inconclusive. The p-values for Wilcoxon’s rank-sum test are and for , and , respectively.
4.4. Visualizing the Projection Vectors
The classification results above provide evidence that our proposed approximation to the optimal combination of the average-source and target projection vectors is useful from the perspective of improving the balanced accuracy. There is, however, a consistent gap that remains between the performance of the optimal classifier and the performance of the oracle classifier. To begin to diagnose potential issues with our model, we visualize the projection vectors from each of the tasks.
The three subfigures of
Figure 7 show representations of the projection vectors for each task. The dots in the top row correspond to projection vectors from sessions from the MATB dataset (left) and the Mental Math dataset (right). The arrows with endpoints on the sphere in the bottom row correspond to projection vectors from sessions from WESAD. For these visualizations, the entire dataset was used to estimate the projection vectors. The two-dimensional representations for MATB and Mental Math are the first two components of the spectral embedding [
53] of the affinity matrix
A with entries
and
. The projection vectors for the WESAD task are three-dimensional and are thus amenable to visualization.
For each task we clustered the representations of the projection vectors using a Gaussian mixture model where the number of components was automatically selected via minimization of the Bayesian Information Criterion (BIC). The colors of the dots and arrows reflect this cluster membership. The BIC objective function prefers a model with at least two components to a model with a single component for all of the classification problems—meaning that modeling the distribution of the source vectors as a uni-modal von Mises–Fisher distribution is likely wrong and that a multi-modal von Mises–Fisher distribution may be more appropriate. We do not pursue this idea further but do think that it is could be a fruitful future research direction if trying to mitigate the gap between the performances of the optimal and oracle classifiers.
4.5. The Effect of the Number of Samples Used to Calculate
In the simulation experiments described in
Section 3 and the applications to different physiological prediction problems in
Section 4.1,
Section 4.2 and
Section 4.3, we used 100 samples from the distribution of
to estimate the risk for a given
. There is no way to know
a priori how many samples are sufficient for estimating the optimal coefficient. We can, however, study how different amounts of samples effect the absolute error of the optimal coefficient compared to an coefficient calculated using an unrealistic amount of samples. For this analysis we focus on a single session from the Mental Math dataset described in
Section 4.2. The dataset choice was a bit arbitrary. The session was chosen because it is the session where the optimal classifier performs closest to the median balanced accuracy for
.
Figure 8 shows the effect of
B, the number of samples from the distribution of
used to calculate the risk for a given
, on the mean absolute error when compared to a convex coefficient calculated using
= 10,000 samples. The mean absolute errors shown are calculated for
by first sampling a proportion of data
p from the target task, training the target classifier using the sampled data, and then estimating the optimal coefficient using
= 10,000 samples from the distribution of
. We then compare this optimal coefficient to coefficient found using
samples from the distribution of
30 different times, calculate the absolute difference, and record the mean. The lines shown in
Figure 8 are the average of 100 different training sets. In this experiment the coefficients
were evaluated.
There are a few things of note. First, when there is more target data available, fewer samples from the distribution of are needed to obtain a specific value of the mean absolute error. Second, the mean absolute error curves appear to be a negative exponential function of B and, for this subject, it seems that the benefit of more samples decays quite quickly after . Lastly, though the closer the convex coefficients are to the coefficient calculated using samples the more closely the classifier will perform to the analytically derived optimal classifier, the gap between the performance of the oracle classifier and the optimal classifier in the real-data sections above indicates that there may be some benefit from a non-zero mean absolute error.
4.6. Computational Complexity
Assuming that we have access to the source projection vectors and target data , and letting B be the number of bootstrap samples and h be the number of evaluated classifiers in , the computational complexities for obtaining the projection vectors associated with the three algorithms studied above are as follows: the average-source classifier is ; the target classifier is ; and the approximately optimal classifier is . That is, using the approximately optimal classifier incurs an additional computational cost that is quadratic in B and linear in h when assuming that sampling from a multivariate Gaussian and evaluating the error for each random sample are both .
4.7. Privacy Considerations
As presented in Algorithm 1, the process for calculating the optimal convex coefficient requires access to the normalized source projection vectors . This requirement can be prohibitive in applications where the data (or derivatives thereof) from a source task are required to stay local to a single device or are otherwise unable to be shared. For example, it is common for researchers to collect data in a lab setting, deploy a similar data collection protocol in a more realistic setting, and to use the in-lab data as a source task and the real-world data as a target task. Depending on the privacy agreements between the researchers and the subjects, it may be impossible to use the source data directly.
The requirements for Algorithm 1 can be changed to address these privacy concerns by calculating the average source vector and its corresponding standard error in the lab setting and only sharing these two parameters. Indeed, given and , the algorithm is independent of the normalized source vectors and can be the only thing stored and shared with devices and systems collecting data from the target task.
5. Discussion
The approximation to the optimal convex combination of the target and average-source projection vector proposed in
Section 2 is effective in improving the classification performance in simulation and, more importantly, across different physiological prediction settings. The improvement is both operationally significant and statistically significant in settings where very little training data from the target distribution is available. In most Human–Computer Interface (HCI) systems, an improvement in this part of the regime is the most critical as manufacturers want to mitigate the amount of configuration time (i.e., the time spent collecting labeled data) the users endure and, more generally, make the systems easier to use. We think that our proposed method, along with its privacy-preserving properties inherent to parameter estimation, is helpful towards that goal.
With that said, there are limitations in our work. For example, the derivation of the optimal convex coefficient and, subsequently, our proposed approximation is only valid for the two-class problem. We do not think that an extension to the multi-class problem is trivial, though treating a multi-class problem as multiple instances of the two-class problem is a potential way forward [
54,
55].
Similarly, our choice to use a single coefficient on the average-source projection vector, as opposed to one coefficient per source task, may be limiting in situations where the source vectors are not well concentrated. In the WESAD analysis where
, for example, it may be possible to maintain an advantage over the target classifier for a larger section of the regime with a more flexible class of hypotheses. The flexibility, however, comes at the cost of privacy and computational resources. A potential middle ground between maximal flexibility and the combination of privacy preservation and computational costs is modeling the distribution of the source projection vectors as a multi-modal vMF where the algorithm would only need access to the mean direction vector and standard errors associated with each constituent distribution. The visualizations in
Section 4.4 provide evidence that this model may be more appropriate than the one studied here.