1. Introduction
To understand data, it is necessary to identify important features and, from this, construct categorizations representing dimensionality reductions. For electroencephalography (EEG), the categorization has, historically, mainly depended on human experts. There are several problems with this: (1) it is not known if the categories are the most optimal; (2) it is difficult to provide objective definitions of the categories; and (3) the expert assessments may suffer from intra- and inter-rater variability. When developing algorithms or machine learning to analyze EEG, the data often need to be transformed into a set of features, e.g., a limited set of numbers related to the frequency content. However, important information may be lost in this transformation. In the context of supervised learning, using a ground truth created by experts transfers the inherent problem of the categorization to the method. Hence, methods that discover relevant EEG features and how these are related through a self-supervised process may provide a new perspective on traditional EEG categories and possibly new insights. To this end, in the work presented here, a deep learning approach for improving parametric t-distributed stochastic neighbor embedding (t-SNE) for visual EEG analysis was applied. This article has been written with both clinical neurophysiologists and data scientists in mind. Many of the technical details have therefore been placed in appendices.
t-SNE is a method primarily developed for visualizing high-dimensional data by mapping them to a low-dimensional space [
1]. The result is usually clustering of similar data in the low-dimensional representation, and relations in the data can then be identified by visual inspection and comparisons with the original data (
Figure 1).
t-SNE is based on pairwise matching of the probabilities that data examples are neighbors in both the high- and the low-dimensional space. The high-dimensional probability has a multivariate normal distribution, and the low-dimensional probability has a multivariate t-distribution. Optimization is accomplished via gradient descent. The original implementation does not create a model for the mapping from high- to low-dimensional representations (the method is described in more detail in
Appendix B.1). The original t-SNE is used for the visualization of the results in several EEG studies [
2,
3,
4,
5,
6,
7,
8,
9,
10,
11]. In a couple of other studies, the method is used for feature extraction from EEG [
12,
13].
Parametric t-SNE is a variation where the dimensional reduction is learned by a neural network [
14], thus creating a model for the mapping (
Figure 2A). In the original implementation, restricted Boltzmann machines are pretrained using greedy layer-wise training and then fine-tuned using the principle of t-SNE. Li et al. [
15] and Xu et al. [
16] used parametric t-SNE as a step to extract features of motor imagery EEG data, and in the evaluation with support vector machine classifiers, it compared favorably to other feature extraction methods.
Substantial preprocessing is usually required to extract features from EEG. Jing et al. used frequency features in combination with several other statistical or nonlinear measures [
2]. Suetani and Kitajo employed frequency features in a modified version of t-SNE using a beta-divergence as a distance measure [
6]. Kottlarz et al. used t-SNE to compare frequency features and features created using ordinal pattern analysis [
9]. Li et al. used wavelet-generated features for parametric t-SNE [
15], and Xu et al. extended this by combining wavelet features with the original EEG and frequency band features, all processed further through common spatial pattern filtering [
16]. Other researchers created features using tensor decomposition [
3], connectivity measures [
4], and several extract features from the latent spaces of neural networks [
5,
7,
8,
10]. Exceptions to the above are Ma et al. [
12], Georg et al. [
11], and Yu et al. [
13], who applied t-SNE to raw EEG to create features. It is difficult to assess how the different approaches compare in terms of performance, and none of them can be regarded as a standard approach.
A high-dimensional representation similar to how an expert views EEG might be preferable. Here, the identification of high-level features, such as interictal epileptiform discharges (IEDs), is regarded as important, whereas their time and location may be less important. Using the Euclidean distance of the raw data to compare visually similar examples could result in them being assessed as different, if important waveforms are located at different positions. Furthermore, if the waveforms only constitute a small part of the signals, they may not have a large enough impact on the measure. A property often claimed, regarding convolutional neural networks (CNNs), is the location invariant detection of features [
17]. Therefore, using a CNN with appropriate fields of view to convert the EEG into a set of high-level features appears to be a possible solution, but the question of how to train the network needs to be addressed. Training the networks in classification tasks may be an alternative [
5,
7,
10], but this may put too much emphasis on features specific to the classification task and dataset, and involves supervised learning. Deep clustering is a set of methods with which clustering is performed on latent representations in neural networks. Most of the work has been conducted in image analysis, and the methods have been thoroughly reviewed [
18,
19,
20,
21]. Many of the methods have similarities to t-SNE, mainly regarding the use of the Kullback–Leibler divergence to match distributions. However, in comparison to t-SNE, there are some notable differences: (1) the number of clusters must often be predefined, sometimes with cluster centers initiated by, e.g., k-means clustering; (2) there is variation in which distributions are used; (3) the loss is often a combination of the Kullback–Leibler divergence and some other form of loss, e.g., the reconstruction loss of an autoencoder; and (4) the network may be pretrained, e.g., as an autoencoder or generative adversarial network.
Currently used EEG categories, e.g., epileptiform discharges and seizure activity, are clinically useful concepts, but the categories are difficult to define objectively, and in praxis, the assessment may differ from expert to expert. t-SNE has been used to visualize EEG in several studies. Motivated by the problem inherent in EEG categorization, the objective of this work was to further improve parametric t-SNE as a tool for the visual study of categories in EEG data. Previous implementations of t-SNE use the raw or, most often, preprocessed EEG as the high-dimensional representation (
Figure 2A). Here, CNNs were instead trained using the principle of t-SNE to match a neighbor structure in its latent space to one in its output space (
Figure 2B), i.e., the CNNs learned a new relevant high-dimensional representation. In addition, the computations were simplified by using a simple distribution based on ranked distances instead of a normal distribution for the high-dimensional representation. This reduced training times. The main advantage of the suggested method is that no feature engineering is necessary. This may avoid problems with the loss of relevant information due to the choice of an inferior preprocessing technique. The method is fully self-supervised and thus avoids potential biases regarding the definitions of categories introduced in supervised alternatives.
4. Discussion
In this work on EEG analysis, a novel deep learning approach to t-SNE was compared to conventional parametric t-SNE. Ordinarily, the high-dimensional representation is a preprocessed version of the EEG data. In the presented approach, CNN encoders instead learned a new high-dimensional representation using the principle of t-SNE. Two different time–frequency methods were used for the conventional implementations.
All methods showed promising results. The kappa values calculated using the SVMs were comparable in general, and there was no consistent pattern in favor of any method. On the contrary, the number of clusters as assessed via k-means clustering was consistently higher for the CNN encoders. Distinct clustering is very important in visual analysis. In this study, the data were annotated, and there were thus color codes to guide the eye. However, this may not always be the case, but distinct clusters indicating a closer similarity in the data, i.e., potential categories, are easily identifiable.
All methods can probably be improved further, for example, by optimizing the preprocessing of the data and the hyperparameters of the algorithms. Improvements can also likely be made in a specific sense by tailoring the methods to each specific dataset and categorization problem, e.g., the choice of EEG channels and frequency bands to analyze. There are different ways of implementing time–frequency methods. STFT is a simple linear time-frequency method which provides a cross-terms-free representation, but it has low signal concentration, low resolution, and selectivity [
31]. For example, the cross-terms-free Wigner distribution, also known as the S-method, may be a better option, as it also provides a cross-terms-free representation, remedies listed disadvantages of STFT, provides noise influence reduction [
32] and shows the best performance in estimation of the instantaneous frequency [
33]. Of course, there are many other methods to consider for producing features when using the original parametric t-SNE. However, given the extent of possible variation in deep learning, it is speculated that this approach has the largest potential to improve the performance.
There was a possible tendency for the suggested approach to perform better for IEDs. As IEDs are short transients, it is possible that the convolutional and max-pooling operations were more efficient in detecting these compared to the CWT and STFT representations. The CWT encoder seemed to perform better than the STFT encoder for IEDs. The CWT representation had more dimensions than the STFT representation, but it also consisted of statistical values from the wavelet processing, where the max value was equivalent to the max-pooling operation used in the level analysis of the CNN encoders, and this may have contributed to it performing better for IEDs than the STFT representation.
As evaluated by the SVMs and kappa scores, the methods showed similar performances in the sleep–wake and seizure datasets. These categories are, to a large extent, defined by their frequency content, which is why time–frequency methods can perform well for these data types.
In contrast to the IED and seizure datasets, the number of clusters decreased for the test data compared to the training data of the sleep–wake dataset. The main difference between the datasets, apart from the difference in EEG patterns, was that the test data of the sleep–wake dataset consisted of new subjects, whereas for the other two datasets, the test data consisted of new data from the same subjects that were used for training. This implies overfitting to the training subjects; however, the global structure generalized to new subjects. It is speculated that larger datasets containing data from a larger number of subjects may be necessary for a better generalization to new subjects. Since the CNN encoders learn a new high-dimensional representation, it is reasonable to assume that there is more potential for overfitting in general. This risk is probably higher for smaller datasets, which the demonstration presented in
Appendix E indicates, where a smaller dataset was used, and overfitting was more prominent. In the results presented in
Section 3, the training data examples are non-overlapping. Using overlapping examples can provide a larger variation in the training data. This is discussed in
Appendix A.3, where it is also demonstrated that it can affect the clustering. The effect on overfitting is also demonstrated in
Appendix E, where it decreases when using overlapping examples. Other strategies for data augmentation could also be tested. Since the method is self-supervised, pretraining the network on a large unassorted dataset could be a relatively affordable alternative to decrease overfitting. Dropout layers could be added to the model. The model uses L2 regularization for the convolutional and fully connected layers, and other types of regularization could also be tested to mitigate overfitting.
If there is overfitting, it is probably based on the distance between the signals. This is suggested by all of the low-dimensional representations produced using the method in this article, since the resulting clusters conform with the annotations. Hence, the low-dimensional representations of the training data can still be useful and offer insight into the relations between those specific signals, even when the encoders do not generalize well to new data. Whether it is worth the time and resources to train unique encoders on every dataset as a method of analysis is another question.
The suggested approach can be time-consuming to develop. If a high-dimensional representation that produces good results for t-SNE can be identified, then there are faster implementations for the algorithm, e.g., Barnes–Hut t-SNE [
34] and Flt-SNE [
35], and the deep learning approach may be unnecessarily complicated.
Since the method is self-supervised, anything in the data may induce clustering. This may be beneficial; e.g., the method will, in a sense, be more unbiased. On the other hand, it may also make the method too unselective; e.g., uninteresting artifacts may cause clustering of data that are suboptimal for the task at hand. The method could also be too sensitive; e.g., theoretically, small variations in the electrode placement could affect the generalization if training and test data are recorded on separate occasions, or there may be spurious clustering induced by noise.
The data materials used in this study were relatively small, both with respect to the number of data examples and the number of subjects. Additional materials are needed for further evaluation. Another limitation was that experiments were only performed using the same relatively short example duration. Time–frequency methods may show an inferior performance when using shorter durations. A demonstration using a longer duration is presented in
Appendix E, but further testing using different durations is required. The annotation and data selection were performed by one of the authors, and the method was developed by the authors, which must, of course, be evaluated further, using other data and by external researchers.
As described in the introduction, t-SNE has previously been used mainly to visualize EEG data. There are several promising applications: The method could be further developed to produce quick overviews of whole recordings, providing a rapid first assessment evaluating if an EEG is normal or pathological. This could be useful to, e.g., prioritize the order of assessment of EEGs in clinical routine work when the workload is high. The method might also be used to construct trends for long-term monitoring. For example, seizure activity and status epilepticus could easily be detected in intensive care units lacking personnel with training in EEG interpretation. Jing et al. [
2] integrated visualization using t-SNE into an annotation tool; rapid annotation of data could be performed by simply manually marking clusters and assigning categories in a graphical user interface, which then automatically annotates the corresponding time intervals in the EEG.