4.1. Experimental Settings
The experimental validation of the proposed models was verified on two datasets [
31,
32]. The datasets were obtained 10-10 EEG electrode positions with modified combinatorial nomenclature [
36]. The datasets used a unipolar reference [
37]. The signal was referenced to the ground earlobe and grounded to the left mastoid. In order to have the same number of electrodes, two electrodes (C3, C4) were removed from the Akimpech data. As a result eight-channel (Fz, Cz, Pz, P3, P4, PO7, PO8, Oz) data were extracted from both datasets. The provided EEG values vectors
X were marked with a
y label. For EEG vectors containing target P300 peak
, while for nontarget flashings
.
The raw EEG data of healthy subjects were filtered using a Chebyshev 4th order notch filter for the frequency range of 58–62 Hz anda Chebyshev 8th order band-pass filter for 0.1–60 Hz range by the authors of the dataset. The frequencies higher than 30 Hz or -band of EEG signal did not need to be considered in the oddball paradigm. Thus, the EEG signal was again band-passed using the 0.1–30 Hz frequency range. The frequencies below 30 Hz, which are -band (8–13 Hz), -band (13–30 Hz), -band (0.1–4 Hz) and -band (4–8 Hz) brainwave frequencies, were mainly considered for P300 component extraction. The data of ALS patients were already band-passed for 0.1–30 Hz range by the dataset providers.
Some researchers prefer using 1000 ms latency of EEG vectors for classifying P300 component, for instance the time period from −200 ms to 800 ms can be considered as in [
20]. The period starting from 0 ms to 700 ms is also a popular choice for P300 detection [
38]. To reduce the redundancy of the dataset, it was decided not to consider the whole 1000 ms time period for each flashing but only to take the period up to 700 ms after the stimuli. However, since the dataset considered not only healthy subjects but also ALS patients, it was decided to extend the period taking into consideration 100 ms before the stimulus. This can improve the classification, as a sharper difference can be detected between the voltage detected 100 ms before the stimulus and 300 ms after the stimulus, rather than the difference between the beginning of the stimulus (0 ms) and the P300 component. Thus, it was decided to consider the regions starting from −100 ms before the flashing and ending with the 700 ms after the flashing. By taking the −100 ms to 700 ms latency period, 204 data points were extracted for each flashing trial, and the sampling rate was 256 Hz.
The removal of the unnecessary EEG data can improve the computational complexity of the ensemble models, which require more computational resources than classical standalone classifiers. In addition, the dataset was balanced by removing redundant nontarget EEG vectors. Initially, there were 25 letters of input provided by the dataset for each subject, which gave 300 data samples for a single subject. There were 250 nontarget data samples out of 300. In order to balance the data, only 75 of them were randomly selected for further training. The dataset comprised 60% of the nontarget class and 40% of the target class data from data balancing steps. This gave us 125 data samples for each subject. Training data was collected from 8 healthy subjects, resulting in 1000 data samples. Test data consisted of 500 data samples of healthy subjects and 625 data samples from ALS patients. Instead of complex dimensionality reduction techniques, such as principal component analysis (PCA), the EEG signal was averaged by the channels for its further classification by LDA, kNN, and SVM.
The proposed models were trained on 1000 data samples and tested for 500 data samples of healthy subjects. In order to evaluate the models, 3-fold validation was performed. Each model was trained and validated three times and the average metrics were calculated for healthy subjects’ training. The trained models were then tested on 625 data samples of ALS patients.
The computations were performed using Python 3.7.3. The hardware used during the simulations was NVIDIA GeForce GT 650M together with the 2.6 GHz Quad-Core Intel Core i7 processor. The simulations were carried out for experimental EEG data (as described earlier) and in various settings of number of channels, viz., 8-channel EEG, 4-channel EEG, and single-channel EEG.
In order to evaluate the performance of each classifier, the number of true positive (
), true negative (
), false positive (
), and false negative (
) predictions were calculated. The most commonly used metric for performance evaluation is classifier’s accuracy, which is calculated as
However, when working with unbalanced datasets, accuracy does not show the overfitting rate. If the dataset is not balanced, it would consist of 10 nontarget flashings and only 2 target flashings (as only one row and one column out of 12 rows and columns contain the chosen character). If the classifier identifies only nontarget EEG signals, but fails to classify the target flashings, there would be 10 correctly recognized nontarget components. However, in that case, there would be zero true positively recognized target class components. For this example, the accuracy would still be 85.71%, which seems quite good. However, the fact is that the classifier failed to identify all of the target peaks. In order to examine whether the target class was correctly recognized and the number of
was low, the recall metric is calculated as
Precision value indicates an EEG signal labeled as positive (target response) is positive and is computed as
In our case, the data were not perfectly balanced. The number of nontarget classes exceeded the target classes, as the dataset comprised 60% of the nontarget class and 40% of the target class after the balancing. That is why the recall value was still considered. In order to see both characteristics of recall and precision metrics, the F-score was calculated as a harmonic mean
Thus, recall and F-score were mainly used for the performance evaluation.
4.2. Intra-Subject Experiments
The main objective of the present contribution was to develop and test a robust subject-independent classification approach for a P300 speller. The results of the proposed approach are presented later; however, in this subsection, the results for the intra-subject experiments are presented, where the models were trained and tested from the same subject. In other words, SST training was applied using five ALS patients and five healthy subjects. 80% of the data from one subject was used for training and 20% for testing.
The averaged metrics were obtained by summing the results from each subject and dividing by the number of subjects. The experiments were done for eight-channel data, four-channel data, and single-channel data. The obtained averaged F-score is presented for each model in
Table 1.
It can be observed from
Table 1 that there was no significant difference between the performance when using eight data channels and four data channels. Single-channel data provided inaccurate results, achieving about an 83% average F-score for all subjects. Thus, it can be concluded that single-channel data is a poor choice for intra-subject classification. It is further seen in
Section 4.5 that single-channel usage did not provide high performance in generic training either.
The usage of CNN in LDA-SVM-kNN-CNN did not significantly decrease the performance in the eight-channel data experiment, reaching a 98.75% F-score. However, it dropped to 93.56% when using four-channel data. All of the other ensemble voting models provided quite stable results during the experiments on eight-channel and four-channel data.
When trained and tested for each subject separately, the models achieved higher performance, compared to the proposed subject-independent training results, presented in
Section 4.3–
Section 4.5. However, it should be noticed, that this approach is not a good option when talking about online training and practical usage. As stated earlier, the aim of this research is to develop a subject-independent classifier, which can be used by ALS patients without the necessity to train. So, despite the fact that by using SST training the models were able to reach a 99% F-score, inter-subject results are more important for a user’s comfort and are detailed in the following subsections.
4.3. Eight-Channel Data Simulations
The classification models were trained on the eight-channel data of eight healthy subjects and tested on four healthy subjects. The channels used are represented in
Figure 4a. Two baseline classifiers were also trained and tested on the same data. The first classifier was a classical gradient boosting. Gradient boosting has shown high performance for EEG classification in different applications, such as rehabilitation systems [
39] and the P300 speller [
40]. In this work, the gradient boosting classifier was modeled by using the sklearn python library [
41]. The second classifier was extreme gradient boosting or XGBoost [
42]. Due to its high performance and time efficiency over recent years, XGBoost has become a popular option for different applications, and the P300 speller is not an exception [
43]. Both XGBoost and gradient boosting classifiers are designed with the maximum number of trees limited to 100, where each tree can have a maximum depth of three nodes. A default learning rate of 0.1 was used for the experiments [
41]. There were no pruning and no parallel threads used for the baseline classifiers. XGBoost uses tree booster, which is preferable to the linear booster, as the linear booster may fail to fit when using complex time-series EEG data.
While training for eight-channel data, the weights of the W-LDA-SVM-kNN model were found using RS. RS performed nested 5-fold cross-validation on the data of eight healthy subjects to find the optimal weights. There were 800 data samples used for training and 200 data samples used for the test to find the optimal weights. Searching for the weights took 41.58 s for the data from eight subjects. The obtained weights were as follows:
LDA weight:
SVM weight:
kNN weight:
The obtained weights can be used for further experiments, without renewal. The average time for elapsed for testing was 3.91 s as seen from the results, presented in
Table 2. The last column of
Table 2 represents the computational time spent for various models while testing the same amount of data. The proposed classifiers provided good results, except for the model that used CNN. The LDA-SVM-
kNN-CNN ensemble voting model turned out to be computationally ineffective due to the complex structure of the neural network. Moreover, the model suffered from overfitting, as the value of the F-score was more than 7% lower than the accuracy value.
The fastest model proposed was the LDA-kNN fusion, which took only 0.72 s to train for eight subjects. This can be explained by the fact that LDA is an efficient choice for EEG classification with low computational complexity, and kNN is an instance-based algorithm that computes the distance for only neighbors. For the same experiment, standalone LDA required 0.61 s for training, while it took only 0.16 s for kNN to train the same amount of data. The weighted ensemble model did not show any performance improvement compared to the simple averaged LDA-SVM-kNN model. However, both models provided the best F-score, achieving more than 99.12%.
Obviously, the proposed ensemble classifiers require more time to process the data than the classical standalone models. However, it is seen from
Table 2 that the difference between the elapsed time is not very meaningful. Thus, it can be said that ensemble learning does not require many more computational resources when trained on eight subjects. Moreover, the proposed classifiers provided better results than the gradient boosting in terms of computational complexity. This is explained by the fact that the gradient boosting nests decision trees one after another to achieve the necessary performance. XGBoost works much faster than the classical gradient boosting, however, it was still slightly outperformed by the proposed ensemble voting classifiers, except for the LDA-SVM-
kNN-CNN.
Table 3 represents the simulation results obtained from testing on five ALS patients’ data. The overall performance of the classifiers decreased compared to the results of testing on the healthy subjects’ data. Still, the proposed methods did work with ALS patients. This means that the classifiers are subject-independent even in terms of comparing healthy subjects with patients with a brain disorder. The baseline classifiers performed slightly better, reaching more than 85% F-score. The weighted voter classifier W-LDA-SVM-
kNN outperformed gradient boosting and achieved the best performance metrics among the proposed classifiers in this case. So it can be assumed that the SVM classifier, which had the most value in the weighted voter, performed better on ALS eight-channel data than LDA and
kNN.
The simple ensemble averaging models LDA-SVM-kNN and LDA-kNN achieved about 84% accuracy, which is also a meaningful result, despite the fact that these models were slightly outperformed by the boosting algorithms. Again, LDA-SVM-kNN-CNN showed the worst result among the proposed models, meaning that the convolution of the eight-channel data was not a good choice. The proposed CNN architecture failed to extract the most essential features out of the EEG input data. Thus, it can be summarized that the CNN model is a poor choice for EEG time-series data classification in a subject-independent P300 speller.
4.4. Four-Channel Data Simulations
Multichannel EEG classification allows covering different regions of the human brain; however, it makes it much more complex. It has been shown by the comparison of 14-channel data and 4-channel data classification that increasing the number of EEG electrodes does not increase the accuracy of the P300 speller [
44]. For instance, decreasing the number of channels from 64 to 20 using channel selection and nontarget data reduction provided better results in terms of computational complexity and did not affect the accuracy negatively [
45]. The EEG channels can be efficiently selected using different methods, such as abchannel-aware dictionary with sparse representation for the P300 speller as in [
45]. Group sparse Bayesian linear discriminant analysis (BLDA) can also be applied for channel selection. As reported in [
46], by applying group sparse BLDA to the data collected from 16 different subjects, it was found that the most optimal channels selected by the algorithm were located close to visual ERP areas. Despite the fact that optimal EEG channel selection is subject-dependent, the abovementioned results lead to the common idea that the selected channels should be located in the parietal and occipital zones of human brain. The presented results show that the CPz, P4, P3, Pz, O1, Oz, and O2 channels were the most efficient electrodes for the majority of the subjects using most of the channel selection methods. Authors recommend using the Pz, Oz, O1, and O2 combination of channels as the most efficient [
46].
In order to check whether the number of EEG channels can be decreased without affecting the accuracy negatively, experiments have been conducted using four EEG channels. The four channels used for the experiment were chosen from the visual ERP area of a human brain, according to the results that were presented by other researchers. The channels selected were P3, P4, Pz, and Oz. The placement of the electrodes is represented in
Figure 4b. The combination of PO7, PO8, Pz, and Oz was also tried during the experiments; however, its accuracy was on average about 3.5% lower than the combination of P3, P4, Pz, and Oz.
Table 4 presents the obtained results for four-channel data features classification. Testing on ALS patients using four channels generally improved the classification performance among the proposed ensemble models. Comparing these results with the eight-channel experiments (see
Table 3), the F-score increased on average by more than 5% for the proposed ensemble models, which did not use CNN. In contrast, the performance of the LDA-SVM-
kNN-CNN model decreased by more than 10%. This is explained by the structure of the CNN classifier, which is very dependent on the input shape. That is another reason why the CNN is inefficient for EEG time-series classification in the P300 speller. Every time the number of channels is changed, the architecture of the CNN classifier should be changed too, which is a very complex procedure.
It is observed from
Table 3 that the weighted ensemble voting classifier achieved a 90.74% F-score, which was higher than the gradient boosting with 88.59%. The proposed simple averaging voters achieved 89.17% using LDA-
kNN architecture and 88.88% using LDA-SVM-
kNN fusion, which was also better than the gradient boosting algorithm. XGBoost appears as the most accurate model in this case study, however, it suffers from a long processing time (see
Table 2). The proposed models achieved a somewhat similar performance, and at a reduced computational cost.
4.5. Single-Channel Data Simulations
Multichannel EEG processing is a time-consuming and complex process. Some researchers prefer using a single-channel EEG data for the P300 speller [
47]. Single-channel classification must be performed using some of the central electrodes (such as Fz, Cz, Pz, or Oz), as it is inappropriate to consider only one hemisphere of the human brain for data acquisition in BCI speller. Fz and Cz channels are not located in the visual cortex of the brain, thus for single-channel experiments either the Pz or Oz electrode should be chosen. To examine whether a single data channel usage was efficient in our case, the Pz electrode was chosen for further simulations.
The parietal region showed the maximum activity during the oddball paradigm, thus, the Pz electrode presented in
Figure 4c was chosen among other active options. During the simulations, the LDA-SVM-
kNN-CNN voter was excluded, as there is no meaning of CNN usage on a single-channel EEG vector.
The results obtained during testing for four healthy subjects and five ALS patients are shown in
Table 5. A single-channel classification was not as efficient as multichannel usage. The average accuracy for healthy subjects’ data classification was 91.28% for the proposed voters, while it was only 78.19% average accuracy for the EEG classification of ALS patients. The weighted voter was slightly more accurate for ALS data, while there was no significant change in the results for healthy subjects. The weighted fusion of three classifiers again slightly outperformed the LDA-
kNN voter in terms of performance metrics for ALS data, reaching 78.74% accuracy, while LDA-
kNN reached only 77.86%. Still, the proposed ensemble models provided better results than the standalone classifiers.
4.6. Discussion
When classifying the data obtained from healthy subjects, the LDA-kNN voter achieved better results and outperformed the SVM-based voters by about 0.33%. This difference may not seem significant; however, considering the low computational complexity of the LDA-kNN fusion, this classifier is better to use when training on larger datasets. However, when using smaller datasets, it is preferred to add SVM into the ensemble model, as it will provide more accurate results. The weighted ensemble voting model with the SVM classifier provided the best performance for ALS patients’ data, achieving more than 90% accuracy when using four-channel classification. There was a tradeoff between the accuracy and the computational complexity. For large datasets, the LDA-kNN voter will be the better option. However, W-LDA-SVM-kNN provided more accurate results, but as it requires much more time for training, it is better to use on smaller datasets.
Apart from the tradeoff between the computational complexity and the accuracy, there was one general weakness, related to memory requirements. Due to the fact that kNN is an instance-based algorithm, it must store the training data. This may cause problems when using more training data for online spellers. Nevertheless, 1000 data samples, which were vectors containing 204 data points, were enough for an efficient result; thus, data storage should not cause significant limitations.
By comparing different numbers of channels, it turned out that using only four channels of the parietal zone was more efficient than using a wider range of brain activity with eight channels. The summary of the results for ALS patients’ data with different number of EEG channels is presented in
Table 6. The accuracy improved by about 5% (on average) when using the four channel EEG data. Single-channel EEG classification provided less than 80% accuracy, which was another limitation found during the experiments. However, if the single-channel EEG timeseries are converted to a frequency domain, the accuracy may increase as in [
47]. Thus, to decrease the number of used electrodes in the future, it is planned to use frequency domain spectrograms instead of EEG timeseries.
The proposed methodology allows training a universal P300 speller, which does not need to be retrained for each subject. Despite the fact that the classification was performed offline, it is assumed that the same tendency will be noted for the online P300 speller as well. Therefore, ALS patients will not face the necessity of sitting for an hour in front of the flashing GUI for the speller to collect the training set. The proposed features classification methodology makes the P300 speller ready for exploitation right from the first trial.