1. Introduction
As a chronic non-communicable brain disease, epilepsy can occur in people of any age. It is one of the most common neurological diseases with about 50 million patients, and about 5 million people are diagnosed each year around the world [
1]. Electro-encephalography (EEG) is a commonly used auxiliary diagnostic method in the discovery and treatment of brain diseases [
2], but there are certain limitations in traditional EEG methods. On the one hand, traditional EEG diagnosis based on visual assessments requires experienced specialists for correct judgments, which is subject to their professional experience. Further, as the frequency of using EEG equipment in outpatients and inpatients increases, it usually takes several hours or days to record the EEG data, analyze and diagnose. On the other hand, patients with epilepsy can register as completely normal when undergoing an outpatient EEG examination, because the brain of an epilepsy patient usually does not consistently trigger seizures. Recording EEG for longer periods can capture abnormal EEG signals, but it is expensive and time-consuming for both patients and doctors.
EEG signals are nonlinear and nonstationary [
3].
Figure 1 displays a group of normal, inter-ictal, and seizure EEG signals collected by Bonn University that are difficult to be interpreted visually. Moreover, the inter-ictal signals, which are important for the diagnosis and treatment of patients, cannot be easily distinguished from normal signals.
With the development of computer technologies, a variety of algorithms have been designed with excellent results in the automatic classification of EEG signals [
4]. EEG signal classification consists of two parts: feature extraction, which is the most important part, and classification. Methods based on time–frequency domain, such as the AR model [
5], fast Fourier transform method [
6], and Hilbert Huang method [
7], have been applied to extract the features of EEG signals with trivial information loss. However, these methods are generally based on linear models, so the non-linear characteristics of EEG signals that are essential in EEG signals processing can be ignored.
In recent years, machine learning and deep learning models have been applied to capture the features of EEG signals. Zhou et al. [
8] used CNN to extract features for EEG signals and applied the model to a seizure detection task. Although CNN can effectively focus on local features, it cannot directly capture the long-term relationships in the EEG signals. To solve this problem, Mishra et al. [
9] combined CNN and RNN models to solve the sleep stage classification problem based on the EEG signals. RNN contains connections between nodes to form a directed graph sequence. This structure allows RNN to naturally handle the time dynamic behavior of time series. However, it also causes the problem of gradient vanishing [
10], which makes the training of RNN very difficult and time-consuming. Sun et al. [
11] proposed an ESN feature extractor model, which is an unsupervised self-encoding model based on an ESN. The model can extract EEG signal features and achieves good performance in the classification of the epileptic EEG signals. ESN is a special network model that provides a new structure of the recurrent neural network and a new criterion of supervised training [
12]. Compared to RNN, most parameters of ESN are randomly generated, except for the readout layer that needs training, so the model is very fast and does not suffer from the problem of gradient vanishing. ESN can achieve excellent performance on a variety of chaotic time series prediction tasks and complex industrial time series problems. However, the performance of ESN largely depends on the choice of hyperparameters that requires massive experimental cost for achieving a good configuration [
13].
Evolutionary algorithms, which are inspired by biological evolution, are effective in optimizing the hyperparameters of the model. Wang et al. [
14] chose the genetic algorithm (GA) to optimize the hyperparameters of ESN and applied ESN to ECG signal prediction tasks, which achieves better results compared to the original ESN model. Moth–flame optimization (MFO) [
15] is an extended version of the swarm intelligence algorithm, which simulates the special navigation method of moths flying around flames and provides a new heuristic search paradigm in the optimization field, called spiral search. The new search paradigm enables the MFO algorithm to search near the candidate optimal solutions. Mei et al. [
16] applied MFO in ORPD problems to obtain the best combination of control variables. The MFO algorithm has good performances in various fields and the advantages of simplicity and rapid searching over other optimization algorithms [
17]. In this paper, we propose a new model named MFO-ESN that applies the MFO algorithm to automatically search for better hyperparameters of an ESN feature extractor.
An essential problem of combining MFO and ESN is to define the fitness function, which can evaluate the performance of the feature extractors and determine the optimization direction. For classification tasks, the most intuitive way is to use the accuracy of a classifier. However, the method relies significantly on the choice of classifiers, which is not objective and may lead to overfitting problems. The basic idea behind a good feature extraction for classification tasks is to find a way to map the raw data onto a feature space, which draws the feature vectors from the same class closer than those from the different ones. We propose a new feature distribution evaluation function to fit the MFO-ESN, named FDEF. FDEF is based on the idea of triplet loss [
18] and can evaluate the performance of a feature extractor without using a specific classifier.
Triplet loss is a metric that concerns the relation of the triplet. For classification tasks, a triplet consists of the observed sample: the corresponding positive sample whose class is the same as the observed sample, and the corresponding negative sample whose class is different from the observed sample. Triplet loss is a detailed and objective way to judge how easy the sample can be classified, which not only focuses on the degree of dispersion, but also considers the relation amongst triplets.
The main contributions of this work can be summarized as follows:
- (1)
A novel feature extraction called moth–flame optimized echo state network (MFO-ESN) is developed, which uses MFO to optimize the hyperparameters of ESN for fitting the specific tasks.
- (2)
A new function based on the triplet is introduced to evaluate the distribution of features extracted by MFO-ESN without relying on specific classifiers.
- (3)
MFO-ESN is verified on the real-world single-channel EEG signals classification task with an accuracy of 98.16%. The results also show that MFO-ESN with FDEF can promote the performances of many classifiers.
- (4)
We also conduct experiments on the multi-channel EEG signals classification task with the highest specificity of both the patient specific and the cross-patient task. The cross-patient task simulates the real diagnosis situation with high specificity, proving the strong generalization ability of MFO-ESN.
The remainder of this paper is organized as follows. In
Section 2, a review of related works is given, including the ESN algorithm and the MFO algorithm. In
Section 3, a new feature evaluation function (FDEF) fitting MFO-ESN is presented. A detailed description of the feature extractor combining ESN and MFO named MFO-ESN is also proposed. The experiment process and results of the single-channel and multi-channel epilepsy EEG signals classification task are described in
Section 4 and
Section 5, respectively. In
Section 6, we conclude this paper and propose a future study direction.
3. Methodology
The previous introduction shows that the reservoir layer constructed by random initialization is the most important part of ESN. However, many researchers have proposed various methods to initialize ESN. For example, Strauss et al. [
31] presented the design strategy of ESN, such as ensuring the echo state property (ESP) and reducing the influence of noise during the training process. Bianchi et al. [
32] studied the dynamic characteristics of ESN through a recursive analysis, which contributes to further understanding and constructing the optimal reservoir layer. However, it is difficult to consider multiple parameters of a reservoir layer simultaneously, such as
M,
SR,
CD, and
IS. Therefore, tuning the hyperparameters is usually time-consuming and difficult, particularly for complex tasks.
To improve the above shortcomings, moth–flame optimization is used to select the hyperparameters of ESN, named MFO-ESN. Meanwhile, a novel evaluation function (FDEF), which can evaluate the performance of the feature extractor in a detailed and more objective way, is proposed as the fitness function of MFO-ESN.
3.1. Feature Distribution Evaluation Function (FDEF)
Features are the most important factor in machine learning projects where learning is easy if many independent features can be acquired and each correlate well with the class [
33]. The quality of the features extracted from the original EEG signal has a great impact on the classification of the subsequent classifier. An excellent feature extractor should be able to produce obvious differences in the extracted features from different classes of EEG signals in distribution. For the extracted features of the same class, the similarities will be retained, while the differences that we are not concerned about will be ignored. Therefore, we propose a novel feature distribution evaluation function (FDEF), which is calculated as Equation (10).
where
is the feature set containing all features extracted from raw data, while
is the feature point we observed.
For every observed feature point , we construct a triplet , including the positive point and the negative point . is the center of the feature points whose class is the same as , and is the center of the feature points whose class is different from . and denote the distance amongst , , and , as measured by the -norm, respectively.
is the positive distance that is the smaller the better, and
is the negative distance that should be larger than
. FDEF can evaluate the ratio between positive distance and negative distance, which is different from another direct metric using subtraction, such as Equation (11):
where
m is a margin that is enforced between a positive pair and a negative pair. The purpose of minimizing Equation (11) is to make the distance between the positive pair smaller than the distance between the negative pairs by more than
m.
Equation (11) is a direct way to describe the difference between positive distance and negative distance. It has been applied in FaceNet [
18] as a loss function that achieved state-of-the-art performances in the person re-ID tasks. However, compared to Equation (11), Equation (10) has two advantages:
- (1)
FDEF is robust to the mean value of features. The mean values of the features extracted through different ESN feature extractors are different. Further, features with a bigger mean value always obtain a better result using Equation (11) without considering the distribution of features.
- (2)
FDEF is not sensitive to the dimension of the feature. The number of neurons in the reservoir layer (P) is a key parameter that needs to be adjusted. The feature dimension is the same as P, which means the feature dimension changes during the training process of MFO-ESN. Equation (11) is sensitive to the varying feature dimensions that force the model to reduce P.
Therefore, compared to Equation (11), FDEF using the ratio description pays more attention to the distribution state of the aggregation and dispersion of the samples in the sample space, rather than the specific distance value and the difference in feature dimensions.
For Equation (10), the positive distance should be smaller than the negative distance, which promises that the observed feature point is closer to the center of the corresponding class—in other words, it is more centralized. Therefore, penalty items are added to those points that do not meet the constraint. The new form of FDEF is shown in the following:
where
is the punishment coefficient and
is the rectified linear unit. The parameter
is similar to the parameter
m in Equation (11) that needs to be modified carefully to ensure the convergence of the model.
Regarding multiple classes, we follow the strictest method, which is to choose the smallest negative distance as the negative distance in the FDEF formula. This choice forces the spacing between classes to be more obvious. Of course, since all calculations are based on the L2 normalization, when the number of classes or points become larger, the issue of convergence should be considered.
3.2. Moth–Flame Optimized ESN
As mentioned above, most parameters of ESN are initialized randomly and fixed, and the initialization of ESN largely depends on the selection of the hyperparameters. The selection and adjustment of hyperparameters are crucial to the ESN model. Unfortunately, because the relations amongst these hyperparameters of ESN are not clear, choosing appropriate hyperparameters for the specific tasks is difficult and usually not sufficient. In response to this situation, the MFO-ESN model is proposed, and its structure is shown in
Figure 3. In MFO-ESN, MFO is used to optimize the hyperparameters of ESN, so that the ESN feature extractor can better extract the features of the input EEG signal fitting EEG classification tasks.
The number of neurons in the reservoir layer (P), spectral radius (SR), connection density (CD), and input scaling coefficient (IS) are selected as the hyperparameters to be adjusted. Therefore, we set the dimension of moth and flame as 4 to represent these hyperparameters.
The features of the EEG signals are extracted using an ESN feature extractor [
9] whose architecture is the same as ESN. The ESN feature extractor is an unsupervised model that applies the idea of the autoencoder to extract features from the EEG signals. It utilizes a readout matrix
as the hidden layer, as well as the extracted feature. MFO-ESN uses the label information to optimize the effectiveness of ESN by selecting appropriate hyperparameters. Different from the usual optimization ways, MFO-ESN does not use the classification accuracy of the classification task that may be influenced by classifiers as a fitness function; rather, it uses FDEF to evaluate the distribution of features extracted by ESN.
To evaluate the fitness of the moth, the corresponding hyperparameters are used to initialize ESN for extracting features from the raw EEG signals. Then, the fitness is calculated according to Equation (12). With the positions of moths updated according to Equation (7), their fitness changes. Flames denote the historical optimal solution reached by the moths that spiral around the closest flame.
Considering the calculation cost and the suggestion of initializing the ESN model in [
31],
P is set within the range of [5, 100],
SR is within the range of [0.1, 1],
CD is within the range of [0.1, 1], and
IS is within the range of [0.1, 5].
MFO requires multiple moths to search around multiple flames. Since the search path of the moth is spiral, it converges slowly in the latter part of the iteration, while the local search capability decreases. Therefore, the number of moths cannot be too small. We set 20 as the population of moths as well as flames and set the maximum iterations as 100. The running process of the MFO-ESN algorithm is shown in Algorithm 1.
Algorithm 1 MFO-ESN |
Input: the population number (N), the maximum times of iteration (T) |
Output: the best hyperparameters of ESN |
Steps: |
(a) Set the population number and maximum number of iterations. |
(b) Initialize the moth population. |
(c) Initialize the ESN using hyperparameters represented by moths. |
(d) Extract features using initialized ESN. |
(e) Calculate fitness values of moths according to Equation (12) and sort fitness values. |
(f) Update the moth position based on Equation (7). |
(g) Update the flame position to determine the current optimal solution. |
(h) Determine the number of moths and flames based on Equation (9). |
(i) Repeat step (c) to step (h) until the constraint is met. |
(j) The process ends. |
5. Experiments on the CHB-MIT EEG Data Set
ESN can be effectively used to process multi-channel time series data, while all the EEG signals in the Bonn University data set are in single-channel form. To demonstrate that MFO-ESN is also suitable for multi-channel EEG tasks and to evaluate the performance of MFO-ESN with SVM in a further step, we conducted experiments on epilepsy classification by using the CHB-MIT EEG dataset.
5.1. A Brief Description of the Data Set
This dataset collected by Boston Children’s Hospital (CHB) and the Massachusetts Institute of Technology (MIT) consists of 916 h of continuous scalp EEG recordings grouped into 24 cases. This dataset is available to download online [
43] and is described in detail in [
44]. The EEG signals of each case recorded in the international 10–20 electrode system, with a sampling rate of 256 Hz, are saved in several EDF format files, and most of them contain 23 channels (24 or 26 channels in a few cases). Most files contain exactly one hour of EEG signals, while some files belonging to case 10 are two hours long, and those belonging to cases 4, case 6, case 7, case 9, and case 23 are four hours long.
A total of 198 seizures are manually annotated by medical specialists (pointing out the start time and end time) in 916 h. There are 45 s of seizure activity recordings on average that are too small compared to the normal EEG signals. To increase the sample number of seizure class and balance the data set, the data used in the experiments are converted into a series of segments using a sliding window whose size is 5 s, and every segment contains 23 channels as case 1. Segments belonging to a normal class do not include seizure EEG signals annotated by the experts; meanwhile, segments belonging to a seizure class do not include normal EEG signals.
Figure 6 shows a period of EEG signals of case 5 and, according to the doctor’s annotations, the seizure begins in 2348 seconds and ends in 2465 seconds. The five-seconds-long window slides on the EEG signals without overlapping. In more detail, for a 117-seconds-long seizure process, we can obtain 23 segments of 5 seconds in length.
This experiment consists of two parts: patient-specific and cross-patient. Patient specific denotes that the training set and testing set are both from the same case. We randomly chose a 20% segment of the seizure class; we chose the same number of segments from the seizure class for training and the rest of the segments for testing. Cross-patient denotes that the data from different cases are used to train and test. In more detail, the data from case 1, whose EEG signals are collected from an 11-year-old female, and case 2 from an 11-year-old male, are not involved in training but are only used for testing. The data from the other 22 cases are divided into the training set and the testing set. This setting, where case 1 and case 2 do not participate in training, can better simulate epilepsy diagnosis where the data from new patients cannot be trained.
5.2. Results and Discussion of the Multi-Channel Classification
To better evaluate the performance of the model and compare it to other works, three statistical indicators are used for evaluation:
where
TP,
TN,
FN, and
TP are shown in
Table 3.
The quantity of the seizure class is far lower than the normal class, which heavily affects the accuracy. Therefore, of greater concern to us is the sensitivity, which evaluates how many seizure segments are classified correctly, and the specificity, which represents the ability of the model to accurately classify normal segments.
Table 4 shows the results of the patient-specific and cross-patient experiments. The patient-specific experiment can be considered as a private custom system where the training set and testing set are from the same case. The overall performance achieves a specificity of 92.56%, a sensitivity of 96.79%, and an accuracy of 92.75%. Out of all cases, case 7 obtains the highest accuracy of 99.97%, with a specificity of 100% and a sensitivity of 99.96%. As we can see, the accuracy is much closer to the specificity because the number of normal segments in the testing set is far greater than seizure segments. For an epilepsy diagnosis, sensitivity is more important than either accuracy or specificity, as high sensitivity can ensure that any suspicious patients will be further judged by the doctor in time.
Compared to the patient-specific experiment, the cross-patient experiment achieves an average specificity of 91.02%, a sensitivity of 96.14%, and an accuracy of 91.92%. With the increase in the training segments and test samples, the performance of the model is not significantly different from the patient-specific experiment. More importantly, the performance of case 1 and case 2, which have not been trained before, is still good enough. The results prove that our model not only performs well on existing patient data, but also on other unseen patients.
As shown in
Table 5, we compared the studies that use the CHB-MIT data set. Most of these studies use complex deep models such as CNNs, RNNs, and their variants. Li et al. [
45] obtained the highest accuracy of 95.96% and highest sensitivity of 96.05% for the patient-specific experiment. Chen et al. [
46] obtained the highest accuracy of 92.30% and highest sensitivity of 92.89% for the cross-patient experiment. This is probably because ESN is a simple fast-training model whose ability to handle complex tasks is inferior to complex deep learning models. However, we achieved the highest sensitivity—which is more significant for diagnosis—of 96.79% in the patient-specific and 96.14% in cross-patient experiments. This may be one reason why minimizing FDEF can be regarded as a new way to maximize the difference between classes, so that our model can more effectively detect seizure segments.
6. Conclusions
In this paper, the proposed MFO-ESN model uses MFO to optimize the hyperparameters of ESN. Further, a new feature distribution evaluation function, FDEF, is proposed as the fitness function of MFO-ESN by using the label information of the EEG signals. Without using specific classifiers, MFO-ESN can extract more suitable features for the classification tasks. Combined with the SVM model, the effectiveness of the extracted features is verified on the Bonn EEG multi-classification data set and obtains an average accuracy of 98.16%. It is also found that the features extracted by the MFO-ESN model improve the performance of multiple classifiers, which means that MFO-ESN is an effective preprocessing method for EEG signal classification tasks. Furthermore, we apply MFO-ESN with SVM in the multi-channel EEG signals classification task of CHB-MIT and obtain the highest sensitivity of 96.79% in patient-specific and 96.14% in cross-patient experiments with good specificity and accuracy.
Apart from its higher performance, MFO-ESN with FDEF has two additional advantages compared with the previous EEG feature extraction methods: (1) it can automatically optimize the hyperparameters of ESN to adjust the architecture and initial parameters for the specific classification task; (2) using FDEF as the fitness of MFO decouples the optimization of the feature extractor from the classifier, which means the optimization direction no longer relies on the performance of the classifier but on the relative separability amongst classes. These advantages mean that the proposed MFO-ESN model can achieve efficient feature extraction and demonstrate that it has a better generalization ability.
In the future, this method can be further studied with respect to the following considerations. From the perspective of practical applications, the efficiency of MFO-ESN can be further evaluated on other EEG signal classification tasks. Given that the hyperparameters of ESN are searched in the set parameter space, how to improve the efficiency of the search and stop functions in the proper time are also key problems. In the theoretical direction, we can further improve the current ESN method. For example, the relationships amongst channels can be well described by the positioned EEG electrodes where epileptic seizures in different brain regions have different manifestations and treatments. By designing the structure of ESN to combine both the spatial information represented by the channels and the temporal information existing in the EEG time series, a better feature extraction result is promising.