1. Introduction
Epilepsy is a neurological disease that affects about
of the world’s population [
1]. A key feature of epilepsy is seizures—rare episodes of abnormal activity [
2]. On the behavioural level, seizures may either induce uncontrollable convulsions (e.g., tonic–clonic seizures) or loss of consciousness (e.g., absence seizures) [
3]. On the neural activity level, seizures are accompanied by rhythmic low-frequency neural oscillations with a high amplitude reflecting synchronous activity in large neuronal populations [
4].
In general, seizures are accompanied by a state of incapacity that leads to dangerous situations for patients and other people. Epilepsy can also cause cognitive and behavioural deficits [
5]. Thus, antiepileptic treatment is highly important, and it starts with diagnostics [
6]. Epileptic seizures occur unpredictably and alternate with long periods of normal activity. The rare nature of these events complicates their diagnostics in hospital settings. To collect a sufficient amount of epileptiform activity, patients undergo prolonged electroencephalogram (EEG) monitoring in hospital [
7]. As a result, this procedure generates big data comprising hours of normal activity and several seizures with a total duration of a few minutes in total. Usually, this EEG data becomes a subject of manual processing by epileptologists, who spend many hours inspecting EEG signals to label the fragments containing seizures [
8]. Machine learning (ML) algorithms are often implemented in EEG data processing, which includes seizure detection [
9,
10,
11,
12]. These algorithms can be used to reduce the routine work of experts, helping them label seizures faster, and decreasing the number of errors caused by the human factor.
The application of the ML approach in seizure labelling commonly results in classifiers capable of detecting two classes: “seizures” and “normal activity” [
13]. Each classifier falls into one of two broad categories: supervised and unsupervised [
14]. The supervised ML algorithm undergoes training using the labelled data of some patients before labelling data from a new patient. Using EEG recordings with manually labelled seizures, these algorithms may learn the distinctive features of epileptiform patterns to detect them on unseen EEG signals [
15]. A supervised approach to epilepsy detection has shown promising results and includes state-of-the-art techniques such as deep learning (DL) [
16], recurrent neural networks (RNNs) [
17], and its modified version—long short-term memory (LSTM) [
18]. The majority of ML methods for seizure labelling are supervised [
19]. However, some limitations of supervised ML algorithms are often overlooked. Firstly, supervised methods require a large amount of pre-labelled data. The acquisition of such a dataset can be a challenging task in itself. Epileptic patterns are known to be high variable, originating from physiological features of different types of epilepsy and further exacerbated by differences in experimental conditions, and hard- and software acquisition. Secondly, one of the most prominent problems in epileptic data is class imbalance: hour and day recordings usually only contain several epileptic episodes with a total length in minutes. In many cases, the class imbalance leads to overfitting.
Unsupervised ML algorithms are commonly less effective in classification but can provide other benefits, for instance, when working with imbalanced data [
20]. Additionally, unsupervised methods do not require data pre-labelling and tend to be more explainable [
21]. The ultimate goal of our work was to propose a decision support system (DSS) for epilepsy diagnostics as explainability is highly important in medicine. Thus, we decided to use unsupervised ML methods specifically designed to work with imbalanced data and perform anomaly detection. A literature search showed that some unsupervised methods rely on complex pre-processing [
22] or invasive techniques [
23], neither of which are acceptable for rapid clinical diagnostics in human patients. In this research, we decided to use EEG data with minimal pre-processing to devise a system capable of operating in a clinical setting. The main goal of our work was to estimate the performance of unsupervised anomaly detection models as a system for automatic pre-labelling of epileptic seizures on human EEGs.
Recently, we showed that EEG data with epileptic seizures are not only imbalanced but also demonstrate features of extreme event behaviour [
24,
25]. We used this knowledge to apply anomaly detection techniques to label epileptic EEG data. These techniques belong to unsupervised ML algorithms that operate without training data. We tested a one-class support-vector machine (SVM), a popular outlier detection algorithm, on the data of 83 patients and reported
sensitivity and
precision [
26].
Here, we further investigate the ability of outlier detection algorithms to detect seizures. We start by introducing popular algorithms for outlier detection, including SVM, ensemble method, and distance-based methods, and adjust their hyperparameters to ensure the best score. Then, we trained these algorithms using various features of EEG signals, including two frequency ranges and three types of spectral power: full spectrum, mean, and principal components (PCs). With the obtained results, we tested two hypotheses: (i) the algorithm type and features impact performance; (ii) predicted performance based on prior knowledge about the algorithm type and features.
The remainder of this paper is organized as follows.
Section 2 presents the materials and methods, describing the used dataset, details on the following data processing, and feature engineering procedures.
Section 2 also includes information on the proposed methods: Block diagram, details on the different ML algorithms tested, and a description of processes for evaluation and hyperparameter optimization.
Section 3 presents the results and discussion, including the results of hyperparameter optimization and an evaluation of the models with optimal hyperparameters.
Section 3 also includes the results of statistical comparisons of different algorithms as well as different features.
Section 4 presents the conclusion.
3. Results and Discussion
We started with hyperparameter optimization for all algorithms and all types of input data (
Table 2). We tested the models with different types of distance estimation metrics, and the GridSearch method chose the Euclidean distance as the best metric.
Therefore, we proceeded with the Euclidean distance in all calculations. The OCSVM performed better with the rbf kernel when the gamma was set to scale. The obtained results provided the following insights. First, the LNND and LOF required more neighbours than kNN. Second, the parameters of the OCSVM did not depend on the type of input data. Third, in the distance-based methods the threshold had the smallest value for PCA and the highest value for the 30 Hz and 3 Hz spectral power.
After finding the optimal combination of hyperparameters, an individual model was trained for each patient, i.e., 80 models were obtained for each ML method. Then, the recall, precision, and
F1-score were calculated, and these values were averaged for a group of patients (see
Table 3). These results show that OCSVM, kNN, and IF reach the highest performance on the raw data, while the LNND and LOF perform better on PCAs. The best results were demonstrated by the LNND trained using 30 Hz PCAs.
To test if the F1-score changed between algorithms, we used the non-parametric Friedman test.
For the mean value in the 3 Hz band, there was a statistically significant difference in the
F1-score depending on which algorithm was used,
. Post hoc analysis with Conover’s tests was conducted with a Bonferroni correction applied, resulting in a significance level set at
. There was a significant difference in the
F1-score between the LNND and other algorithms (
Table 4).
For the mean value in the 30 Hz band, there was a statistically significant difference in the F1-score depending on which algorithm was used,
. Post hoc analysis with Conover’s tests was conducted with a Bonferroni correction applied, resulting in a significance level set at
. There was a significant difference in the F1-score between the LNND and other algorithms (
Table 5).
For the PCA in the 3 Hz band, there was no statistically significant difference in the F1-score between the algorithms, .
For the PCA in the 30 Hz band, there was no statistically significant difference in the F1-score between the algorithms, .
For the RAW data in the 3 Hz band, there was no statistically significant difference in the F1-score between the algorithms, .
For the RAW data in the 30 Hz band, there was no statistically significant difference in the F1-score between the algorithms, .
The F1-scores appeared to be low; however, there are certain reasons for this. Firstly, in unsupervised ML, we cannot compare predicted values to true values and change the parameters of the method to improve the performance. Since unsupervised methods aim to separate anomalies from the rest of the data, the main way to affect performance is to estimate the percentage of outliers in the data. Secondly, EEG data contains a lot of artefacts that can also be marked as anomalies, and since the data is not pre-labelled, we have limited options to avoid them. The presence of artefacts affects the threshold value for the method, so in highly contaminated data, the F1-score can be lower. Nonetheless, even these F1-scores can be beneficial for DSS, where the final decision is made by a human. The proposed system could be used for express analyses of the EEG to save the time and effort of experts.
In the next step, we considered the precision–recall curves for all algorithms trained using the optimal hyperparameters from
Table 2 on the different types of input data (
Figure 3). Each sub-figure shows the precision–recall curves for all algorithms (different colours).
The obtained results provided the following insights. First, the lowest areas under the precision–recall curves were achieved at 30 Hz mean (
Figure 3B). This indicates that the overall power may not be a reliable marker of a seizure. In particular, the precision is low, indicating a large number of false detections. The possible explanation is that there are many events characterized by an increase in overall power, such as artefacts or sleep-related patterns [
42]. Second, using 3 Hz mean improves the precision–recall curves (
Figure 3A). This illustrates that seizures produce an outlier in a specific frequency band rather than increasing the overall power. In line with our previous results, this confirms that extreme behaviour occurs in a certain frequency range [
24,
33]. Third, the mean spectral power provides the worst results for the LOF and LNND algorithms (red and green curves). Moreover, the transition to the specific frequency band of 3 Hz barely improves the performance of these methods. We see that the green and red curves practically do not change between
Figure 3A,B, while the others show higher precision. Note that LOF and LNND are distance-based methods, working similarly to kNN but using a different distance measure. Knowing that kNN improves performance when switching from 30 Hz mean to 3 Hz mean, we hypothesize that the distance measure affects the separability between the classes. Fourth, using the entire 30 Hz range provides the widest feature space in which we see that the LOF and LNND perform better. The possible reason for this is that the boundary between the classes changes its configuration in the different feature spaces. Fifth, the best precision–recall curve is achieved using the LNND trained on 30 Hz PCA (
Figure 3F). Interestingly, the LNND improves its performance when using PCAs in the 30 Hz range rather than 3 Hz. An intuitively clear explanation is that extending the frequency range provides more distinguishing features together with an increasing amount of unnecessary information and correlated features, therefore limiting the ability to improve. Using PCAs reduces the amount of unnecessary information and rejects correlations between features but leaves the advantage of using a wide frequency range.
Finally, we analysed the effect of the distance measure in the distance-based algorithms on the separability between seizures and normal activity in the feature space. We introduced distance between the classes, a normalized difference between the median distances to seizures and normal states, and tested if it was dependent on the algorithm, frequency band, or feature (see
Figure 4). The median distances did not follow a normal distribution in the group of patients (according to the Shapiro–Wilk test), having long tails over the high values. Therefore, we removed
of patients with the highest values and transformed the rest data using the root-mean-square. Resulting values underwent repeated measures of analysis of variances (rm ANOVA) with two within-subject factors: frequency band (3 Hz and 30 Hz) and feature (raw, mean, PCA) separately for three ML algorithms (LNND, kNN, and LOF).
For the LNND, the rm ANOVA reveals a significant main effect of the feature: F(2,106) = 11.468,
p < 0.001. The main effect of the frequency band was insignificant: F(1,53) = 1.472,
p = 0.23. Finally, the interaction effect feature × frequency band was also insignificant: F(2,106) = 1.365,
p < 0.26. These results evidence that the distance between the classes depends on the feature in a similar way for both frequency bands. The direction of change is shown in
Figure 4A. Note that since the effect of frequency and the feature × frequency band interaction were both insignificant, the distance between classes in
Figure 4A is presented as an average between the 30 Hz and 3 Hz bands. Using the mean value as a feature provides the smallest distance between the seizures and normal states. Raw and PCA ensure a greater distance but it barely differs between them.
For the LOF, the rm ANOVA reveals a significant main effect of the feature: F(2,78) = 53.738,
p < 0.001, and an insignificant main effect of the frequency band: F(1,39) = 3.757,
p = 0.06. In contrast, there was a significant feature × frequency band interaction effect: F(2,78) = 23.109,
p < 0.001. These results evidence that the distance between the classes depends on the feature and the way it changes depends on the frequency band. The direction of change is shown in
Figure 4B for 30 Hz and 3 Hz. For the 3 Hz band, the distance between the classes hardly changes between the different features. For the 30 Hz band, the mean value provides the highest distance, and raw data provides the lowest distance.
For the kNN, the rm ANOVA reveals a significant main effect of the feature: F(2,114) = 35.854,
p < 0.001, and an insignificant main effect of the frequency band: F(1,57) = 1.312,
p = 0.257. There was a significant feature × frequency band interaction effect: F(2,114) = 19.894,
p < 0.001. These results evidence that the distance between the classes depends on the feature and the way it changes depends on the frequency band. The direction of change is shown in
Figure 4C for 30 Hz and 3 Hz. When using the mean as a feature, the distance differs between 3 Hz and 30 Hz, for the raw and PCA this difference disappears.
Combining the results of outlier detection (
Figure 3) and the analysis of distances (
Figure 4), we find similar tendencies in the dependence of the precision–recall curve and distance in the LNND on the frequency band and feature. First, the lowest precision–recall curve is observed for the mean value and grows when we use raw data and PCA. Second, this tendency remains similar for 3 Hz and 30 Hz. Therefore, we conclude that in the LNND, the structure of the feature space affects the distance between the classes (i.e., separability). Estimating the separability may give prior knowledge about the performance of the LNND, therefore enabling the selection of features to ensure the best performance.
In terms of achievable performance rates, the obtained results are consistent with modern studies, which also propose novel methods for automated seizure identification in EEG signals using different supervised ML-based approaches. Amiri et al., proposed [
9] a method utilizing sparse common spatial patterns (sCSPs) and adaptive short-time Fourier transform-based synchrosqueezing transforms (adaptive FSSTs) to enhance the time–frequency representation of multi-component EEG signals and reduce noise and interferences. They method achieved outstanding performance with high sensitivity, specificity, and accuracy in detecting seizures, demonstrating the potential of their proposed method for epilepsy diagnosis. Malekzadeh et al., utilized [
10] a computer-aided diagnosis system (CADS) for the automatic diagnosis of epileptic seizures in EEG signals, involving pre-processing, feature extraction, and classification steps. They used tunable-Q wavelet transforms (TQWTs) for EEG signal decomposition and extracted various linear and non-linear features from TQWT sub-bands. They employed different approaches based on conventional ML and DL for classification and achieved high accuracy (up to
) in detecting seizures using their proposed DL method based on convolutional neural network (CNN) and RNN. Jiwani et al., proposed [
11] an LSTM-CNN model for epileptic seizure detection using EEG signals, highlighting the need for automated techniques due to the lengthy process and shortage of specialists for the visual inspection of EEG reports. Their proposed model combined LSTM and CNN for feature extraction and classification, and they achieved promising results in detecting seizures.
Thus, our findings provide valuable insights into the performance of outlier detection algorithms for epileptic seizure detection in EEG signals and highlight the potential of distance-based methods, particularly when combined with PCA, for achieving high performance in this task. These findings are consistent with current studies [
9,
10,
11,
12]. Further research in this area could potentially lead to the development of more effective and efficient computer-aided tools for epilepsy diagnosis, reducing the interpretation load on specialists and improving patient care.