Next Article in Journal
Research on Multi-Optimal Project of Outlet Guide Vanes of Nuclear Grade Axial Flow Fan Based on Sensitivity Analysis
Previous Article in Journal
Investigation of Nonlinear Flow in Discrete Fracture Networks Using an Improved Hydro-Mechanical Coupling Model
Previous Article in Special Issue
Traffic Flow Prediction Method Based on Seasonal Characteristics and SARIMA-NAR Model
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Classification of Sleep Stage with Biosignal Images Using Convolutional Neural Networks

Department of Convergence Electronic Engineering, Gyeongsang National University, Jinju-si 52725, Korea
*
Author to whom correspondence should be addressed.
Appl. Sci. 2022, 12(6), 3028; https://doi.org/10.3390/app12063028
Submission received: 8 January 2022 / Revised: 7 March 2022 / Accepted: 14 March 2022 / Published: 16 March 2022

Abstract

:
Clinicians and researchers divide sleep periods into different sleep stages to analyze the quality of sleep. Despite advances in machine learning, sleep-stage classification is still performed manually. The classification process is tedious and time-consuming, but its automation has not yet been achieved. Another problem is low accuracy due to inconsistencies between somnologists. In this paper, we propose a method to classify sleep stages using a convolutional neural network. The network is trained with EEG and EOG images of time and frequency domains. The images of the biosignal are appropriate as inputs to the network, as these are natural inputs provided to somnologists in polysomnography. To validate the network, the sleep-stage classifier was trained and tested using the public Sleep-EDFx dataset. The results show that the proposed method achieves state-of-the-art performance on the Sleep-EDFx (accuracy 94%, F1 94%). The results demonstrate that the classifier is able to learn features described in the sleep scoring manual from the sleep data.

1. Introduction

Sleep is a period of rest for the mind and body. During sleep, volition and consciousness are in partial or complete abeyance and bodily functions are suspended. Sleep is a behavioral state marked by characteristic immobile posture and diminished response to external stimuli [1]. While the mechanism of sleep has not yet been fully elucidated, sleep helps not only to refresh the neurobehavioral domain but also to form immunological long-term memories [2]. Sleep periods of all mammals vary from 2 to 20 h, and humans need 7.5 sleep hours on average per night [3]. Sleep quality is closely associated with many health-related issues, such as obesity, hypertension, coronary heart disease, diabetes, stroke, cancer, and mortality [4].
Polysomnography (PSG) is prescribed to measure sleep quality in clinics. The sleep stage, which is the depth of sleep, is measured on the PSG. These sleep stages are defined by associated patterns in the electroencephalogram (EEG), electrooculogram (EOG), and electromyogram (EMG). The common process for sleep staging is that a somnologist labels the sleep stage for each 30 s epoch of the biosignal during sleep, followed by the classification guideline. The sleep-stage scoring process is tedious and time-consuming as trained somnologists perform it manually [5].
The sleep-stage scoring process has the following problems: (1) The process is time-consuming and expensive because it takes a trained somnologist around 30–90 min for an 8-h sleep recording; and (2) there are a few consensuses of scoring between somnologists for the same datasets. Even the intra-rater reliability is only around 0.76 (Cohen’s kappa) [6]. This provides a significant reason for errors in sleep-stage classification. The reliability of manual sleep-stage classification is acceptable, but its accuracy is poor, and its reliability needs to be improved. Despite advances in machine learning in recent decades, approaches to automatizing sleep-stage classification have a gap between clinical and commercial procedures. The main reasons for the unreliability are the high costs of the classifier [7].
In this paper, we will propose a sleep-stage classification method using an artificial neural network. The method uses a deep learning algorithm to analyze sleep quality from biosignal data without the intervention of experts. To extract features automatically, the biosignal images of EEG and EOG are used as input to the convolutional neural network (CNN). Additionally, whether the use of EEG and EOG is sufficient for the accuracy of sleep-stage classification shall be examined. The proposed method is verified using a Sleep-EDFx dataset, which is an open sleep dataset [8].
The main contributions of this paper are as follows:
(i)
We propose a method for sleep staging classification, with only EEG and EOG signals, with good accuracy.
(ii)
A CNN framework is proposed for the classification. The EEG and EOG signals are converted into images of the time domain and frequency domain, and the images are fed to the CNN as an input.
(iii)
We demonstrate that the proposed method shows experimentally good performance on publicly available datasets:

2. Backgrounds

During sleep, different stages can be distinguished according to brain activation. The manual for classifying sleep stages was published by the American Academy of Sleep Medicine (AASM). The manual defined four different sleep stages: Drowsiness (stage S1), Light Sleep (stage S2), Slow-Wave Sleep (stage SWS), and Rapid-Eye-Movement sleep (stage REM or R) [5]. One sleep cycle takes about 90 min and consists of S2, SWS, and R. S1 is usually found during falling asleep or after waking up. Usually, 3–5 cycles are repeated during the night. Figure 1 shows an example of scoring throughout a night, a so-called hypnogram.
PSG recordings include combinations of EEG, EOG, respiratory signals, and EMG signals. The sleep stage is determined every 30 s, and the signals generated, during sleep. The unit of sleep scoring is a 30 s epoch because the biosignals are still recorded on paper every 30 s.
PSG signals are very complex, but they have certain important patterns for the classification of sleep stage by a somnologist. For example, wave frequencies are the important patterns used to distinguish the different sleep stages, called sleep spindles (12–14 Hz), slow waves (0.5–4 Hz), alpha waves (8–12 Hz), and theta oscillations (4–8 Hz) [5]. However, there is ambiguity in the sleep scoring manual, so automatic sleep scoring is very difficult.
The manual classification of sleep stage has the following problems and limitations:
  • The ambiguity of sleep scoring: There are no clear criteria to determine each stage. The stages are defined empirically by somnologists, and some sleep stages (SWS, REM) have identifiable characteristics, but the distinction between Wake, S1, and S2 is less clear. Therefore, many scoring disagreements exist in S1 (23–74% agreement for S1) [9].
  • The cost of sleep scoring: Polysomnography takes a long time to analyze the biosignals recorded during sleep, since experts manually examine the results. Therefore, it is difficult to freely prescribe polysomnography to patients because it is an expensive medical treatment.
Therefore, automatic sleep-stage classification would have the following benefits:
  • Sleep-stage scoring can be reliable. The system will produce the same and stable results every time, while humans are error-prone.
  • Somnologists can save an amount of time to label the recorded signals in polysomnography.

3. Related Works

Conventional automatic sleep scoring systems are implemented using features in biosignals empirically defined by experts. The features are fed into the system as an input. Typical features consist of the characteristics of the signal or the change in the signal as brain activity. The sleep-stage classifier tries to find unique features from the input. Generally, the accuracy of classifiers relies heavily on the quality of the dominant features.
Most features used for sleep-stage classification can be categorized as one of the following [10]:
  • Temporal features: Temporal features explain the changes in brain signals over time. Each sleep stage is related to the brain signal and state. For example, an indicator of SWS is a change in peak amplitude or a change in frequencies.
  • Spectral features: Frequency features are extracted from the signal using Fourier or wavelet transformation. Sleep stages are characterized by unique frequencies. Most often the brain wave is divided into delta (0.5–4 Hz), theta (4–8 Hz), alpha (8–12 Hz), sleep spindles (12–16 Hz), beta (12–40 Hz), and gamma (40–100 Hz) according to the frequency [5].
  • Statistical features: Statistical features can describe the biosignals. The minimum, maximum, and average of the signal or the number of zero-crossing can be general features in the signals. Other statistical features are median, standard deviation, and skewness of the signals.
Although many researchers have utilized various combinations of features, the results are similar, and it seems that an optimal combination of features does not exist. The most recent trend to classify sleep stages is machine learning. Machine learning algorithms can retrieve features of a given dataset automatically. There are various, but the convolutional neural network (CNN) [11], which is one of the machine learning algorithms, is used in this paper.
The CNN consists of a convolutional kernel in which each entry in the kernel is applied to an input signal composed of neural weights. As backpropagation, these weights in the network learn and adapt to features presented in the input. Each kernel scans the input and activates the region exhibiting the features. The layers can be stacked, with each layer building a more abstract representation from features found in previous layers. This process allows abstract feature representations to be extracted from raw data, such as images. Since CNN shows excellent performance in image feature extraction, biosignals are converted into images and used as input to the CNN in this research.
The previous methods that have classified sleep stages using deep learning algorithms are proposed: Deep Neural Networks (DNN) [12,13], Convolutional Neural Networks (CNN) [14,15,16,17,18,19,20,21,22,23], and Recurrent Neural Networks (RNN) [24,25]. In most of the studies, CNN and RNN have been used to handle raw PSG recordings. Other methods are based on the combination of CNN and RNN [26,27].
Tsinalis et al. [14] presented a CNN with numeric EEG data (1-dimension) as input. Their method successfully extracted features in the biosignal and classified sleep stages. Zhang and Wu [26] also proposed a complex-valued CNN. Complex values allow researchers to add more dimensions to the decision boundary and can speed up the learning process. Supratak et al. [27] used a combination of two CNNs to extract features. In the research, a CNN with larger filters could detect features with low frequency and the other with smaller filters could detect features with high frequency. In this architecture, two bidirectional LSTM layers were stacked. Chambon et al. [17] proposed a deep learning method to classify sleep stages using several biosignals, such as EEG, EOG, and EMG. This algorithm consisted of three steps. First, linear spatial filtering was applied to enhance the information contained in the input signals. Second, features were extracted using the rectified linear operator and max-pooling was applied. Finally, the features were fed to the SoftMax layer. Vilamala et al. [22] proposed a method of using the spectrum images from the EEG signal in each epoch as an input. These images were fed to a CNN to be classified as sleep stages.

4. Sleep-Stage Classification Using CNN

In our research, biosignals are converted into graph images, and the images are fed into the CNN as inputs. There are two reasons for this method: (1) When a somnologist determines the sleep stage in the PSG, he or she looks at the graph image of the biosignal to determine whether there are forms or characteristics of brain waves at each stage of sleep according to the classification manual. Since somnologists make judgments based on the images, it is also effective and natural to train the deep learning model with the images. Therefore, the images, not numbers, are used as an input of the CNN model; (2) the CNN model, a deep learning model used in our research, has excellent characteristics in image classification. In previous research, the CNN model extracted correct features from an image and showed high recognition accuracy using the features [28], so the CNN model is used as the sleep-stage classifier in our research. In this research, the CNN is trained and evaluated using images of the biosignals in the time domain and frequency domain.
To classify sleep stages using the CNN model, the biosignals are extracted from the SleepEDFx dataset, preprocessed, and converted into images. Then, the CNN is trained using the images as input and determines sleep stages for testing the data. The dataset used for training, validating, and testing is Sleep-EDFx, which is a source of open sleep data for researchers [8].

4.1. Preprocessing and Input Data

The Sleep-EDFx includes EEG, EOG, EMG, snoring, and other physiological signals generated by body and brain activities. EEG, which is recorded neuronal activity evaluation during sleep, is widely adopted for sleep-stage classification. However, the biosignals are distorted by several artifacts such as electrode movements and power line noise. The attenuation or removal of unwanted and noisy signals is a prerequisite for most signal processing applications. The artifacts can be affected by misinterpretation and accuracy and the experimental results are also distorted. Therefore, preprocessing is necessary to remove artifacts and normalize the information contained in the biosignals. In the preprocessing step, data pruning and normalization are executed after extracting EEG and EOG signals from the Sleep-EDFx dataset.
In the Sleep-EDFx, artifacts or missing data were found in the beginning or at the end of recordings and in the middle of the recordings. In these cases, data loss could affect sleep-stage classification for the specific epochs. An epoch is a time unit for classifying sleep stages and has a duration of 30 s, so EEG and EOG biosignals are divided equally in the 30-s units. After removing the corrupted epochs in the recorded EEG and EOG signals, corresponding epochs are removed from the hypnogram to synchronize the signals.
After that, the signals are normalized based on the maximum and minimum values of the signals over time. Since the biosignals are very weak with low voltage, they are greatly affected by body contact or the surrounding environment, so normalization is required to improve the performance of the sleep-stage classification. The signals are scaled to a fixed range of [0 1]. Normalized signals are divided into each epoch.
In the proposed method, 2D images instead of conventional 1D signals are used as an input to the CNN for sleep-stage classification. Many researchers have tried to find dominant features to improve their accuracy. Therefore, several feature-extraction and selection methods have been proposed to improve the accuracy of classification. However, the proposed method eliminates the need for manual feature-extraction and selection, while taking advantage of the advancements in image classification using deep learning. The signal data in the time and frequency domain in each epoch were converted to 2D binary images. The converted images were intuitive and interpretable for somnologists. These images were fed to a CNN for training and testing.
The preprocessed EEG and EOG data were converted into graph images in the time domain and the frequency domain. The sampling frequency of each signal was 100 Hz. The images in the time domain were made with EEG or EOG values according to the sampling order.
To extract the frequency information for the EEG and EOG signal in each epoch, the signal was transformed to the frequency domain using Fast Fourier Transform (FFT). Since the signal was sampled at 100 Hz, one 30-s epoch had 3000 value points. After the FFT, only 1000 coefficients at low frequencies of the 3000 coefficients were used to exclude the effects of high frequencies. The amplitude spectrum is an image in the frequency domain.
The images in the time domain and the frequency domain were converted to 2D binary images whose resolutions were 432 × 288. To minimize the computation cost, the final image resolution was set to 216 × 144, which is the size of the CNN input after resizing the image. Figure 2 shows typical sample images of the time domain and the frequency domain for each stage. The images of the EEG are in Figure 2a and the images of the EOG are in Figure 2b.
For the EEG images in the time domain, the image of S1 has a smaller amplitude than that of S2, and SWS shows a waveform with a larger amplitude. In the frequency domain, the EEG image of S2 shows various frequency waveforms with irregular periods. On the other hand, the EEG image of SWS is mainly composed of low-frequency component waveforms. However, the EOG images show different characteristics.

4.2. Structure of CNN Model

Deep learning is a kind of machine learning algorithm. In recent years, classification techniques using deep learning have shown excellent performance in several fields of application such as text analysis and image classification. The ability to extract information from large amounts of data is one of the key reasons to adopt deep learning techniques in sleep-stage classification. Another advantage of deep learning is its ability to deal with very large datasets with high performance. Deep learning can learn features directly from the raw input data with no or little prior knowledge. In sleep-stage classification, the number of epochs in a single dataset is huge. For this reason, we adapted a deep learning algorithm to handle PSG signals.
The deep learning model used in our study is a CNN suitable for classifying large images. The CNN is usually composed of several convolutional layers, followed by several fully connected layers and a Softmax regression layer that outputs class probabilities. The CNN is a normalized multi-layer perceptron composed of a part that effectively extracts features contained in an image and a part that classifies images from features. The convolutional layer filters the image with a two-dimensional array value according to the image to be trained and extracts features. The following layers include more convolutions that can describe higher-order features. The MaxPool layer reduces the size of the feature map by converting the local features of the image into one representative scalar value and extracting the features of the image. The MaxPool layers are used to reduce computational cost and to regularize. The dropout reduces the learning time by preventing the creation of excessive nodes and layers. The flatten layer transforms the output value into one dimension to determine the result value.
The network description of the CNN used in this study can be found in Table 1. The input to the CNN consists of the EEG and EOG images of the time domain and the frequency domain for each epoch. The output of the last fully connected layer is fed to a four-way Softmax which produces a probability over the four class labels. The CNN has convolutional layers of 3 × 3 filter and a MaxPool layer of a 2 × 2 filter. The convolutional and MaxPool layers are consistently arranged throughout the network architecture. As the activation function to each layer, Rectified Linear Units (ReLU) are applied to prevent negative values from passing through to the next layer. The batch normalization is followed by a dropout with a ratio of 0.25 or 0.5.
The first convolutional layer filters the 216 × 144 input image with 32 kernels. The second convolutional layer takes as input the output of the first convolutional layer and filters the 108 × 72 input with 64 kernels. The second convolutional layer connects to the flatten layer. The flatten layer converts the 54 × 36 output of the convolutional layer into a one-dimensional vector of size 124,416. The two-unit-dense layer at the end of the layers predicts four classes from the dataset. The first dense layer converts the 124,416 input to 256 vectors, and the second dense layer converts it to four values. The Softmax activation function in the dense layer will output a value between 0 and 1 according to the confidence of the class to which the images belong. Therefore, the final output values of the model are the probability values of each sleep stage of S1, S2, SWS, and R. The final sleep stage is determined as the sleep stage with the highest probability value.
The Adam optimizer [29] was used for optimization in this study. The following parameters were selected for mini-batches of size 32: learning rate = 0.001, beta1 = 0.9, beta2 = 0.999, and epsilon = 10−7. The CNN was trained for each dataset with a maximum of 20 epochs, and the training process stopped when no improvements were found. The epoch is defined as a process over the whole training set. Weights were initialized with a normal distribution with mean = 0 and standard deviation = 0.1. Those values were obtained empirically by monitoring the loss during training. The implementation using Python programming language was written in Keras [30] with a Tensorflow backend [31].

5. Experimental Results

5.1. Sleep Dataset

We used the Sleep-EDFx dataset in Physionet [32] to evaluate the performance of the proposed method. The dataset included PSG recordings at a sampling rate of 100 Hz and each record consists of EEG (from Fpz-Cz and Pz-Oz electrode locations), EOG (horizontal and vertical), chin electromyography (EMG), and event markers. We only used the Fpz-Cz EEG and the horizontal EOG channels in our research. The hypnograms (sleep stages for each 30 s epoch) were manually labeled by somnologists, followed by the Rechtschaffen and Kales rule [33]. Each stage was mapped to each sleep stage. The sleep stages included Wake, R(REM), S1, S2, S3, S4, MOVEMENT (movement time), and UNKNOWN (not scored). According to the American Academy of Sleep Medicine (AASM) standard, the S3 and S4 stages were merged into a single stage: SWS (Slow-Wave Sleep) [5]. Wake, MOVEMENT, and UNKNOWN were excluded.
The distribution of sleep stages in the Sleep-EDF database might be biased. Hence, the number of W and S2 stages were much larger than other stages. The deep learning algorithms cannot handle the class imbalance problem. To address this problem, the dataset was sampled to nearly reach a balanced number of sleep stages in each class. Because of the imbalance in the classes, we applied balanced sampling. For example, a total of 16,000 experimental data were randomly sampled from the dataset, consisting of 2400 (15%) for S1, 7200 (45%) for S2, 3520 (22%) for SWS, and 2880 (18%) for R. The experimental dataset was split randomly into a training, a validation, and a testing set with respective proportions (0.6, 0.2, 0.2). The sizes of the training set, validation set, and testing set were 9600, 3200, and 3200, respectively. Therefore, there was no dependency on the training, validation, and testing sets.

5.2. Performance Measures

Various metrics can be used to measure the performance of classifiers. In a binary case, important metrics to measure the performance of the classifier are precision and recall. Precision is the number of true positives divided by the sum of true positives and false positives. Recall is the number of true positives divided by the sum of true positives and false negatives. Frequently, accuracy is used to measure the percentage of overall agreements between predictions and correct answers.
In multi-class problems, the accuracy can show skewed performance because of an imbalance of classes. For example, if the percentage of S2 in the testing set is up to 45%, then just classifying all data as S2 archives an accuracy of 45% with bias in the data. In the case of multi-class, precision and recall can be measured for each class. An F1 score can be used by employing these two measures. The F1 score calculates the harmonic mean of the precision and recall per class. The F1 scores are not biased due to class imbalance. To measure the overall performance, the accuracy and weighted F1 score were used in our research. The weighted F1 score was calculated as the weighted average of the F1 score for each class. The closer the accuracy, precision, recall, F1 score, and weighted F1 score values were to 1, the more reliable the classifier was. Using the EarlyStopping function of TensorFlow [31], the training was stopped when the accuracy of the validation data did not improve more than six epochs. Then, we measured the performance values for the testing data using the Scikit-learn package [34].
All the experiments were conducted on a server with an Intel i9-9920X CPU @3.50 GHz, 128 GB RAM, four NVidia GeForce RTX 2080, and CUDA 10.1. It took approximately 20 s to perform one epoch in training.

5.3. Experiments

We performed a set of experiments for the signal channels and domains. Experimentally, only the EEG and EOG signals were sufficient, and we used the two signals. The EEG signal should be included because it is vital under the AASM sleep scoring guidelines [5].
The experiments consisted of the following:
  • Exp1: Classification using EEG and EOG images in the time domains
  • Exp2: Classification using EEG and EOG images in the frequency domains
  • Exp3: Classification using EEG and EOG images in the time domain and frequency domain
In Exp1 and Exp2, the two images (EEG and EOG) were concatenated vertically. The image was resized to 432 by 288 and used as an input for the CNN. In Exp3, the two images that were used in Exp1 and Exp2 were also concatenated horizontally. The image was also resized to 432 by 288 as an input.
Exp1 verified the accuracy of sleep-stage classification using EEG and EOG images in the time domain as input to the CNN. After training the CNN, the accuracy of the training and validation set could be visualized. Figure 3 shows the accuracy of the training process for the numbers of epochs. After 10 epochs in the training, the CNN achieved 96% as the training set accuracy and 82% as the validation set accuracy.
Table 2 shows the performance evaluation results of Exp1 for the testing data after training. The F1 score values were 59% for S1, 90% for S2, 100% for SWS, and 74% for R. As for the overall performance of Exp1, the weighted F1 score was 84% and the accuracy was 85%.
Figure 4 is the confusion matrix of the cross-validation for Exp1. The row labels are the given classes of the testing data, and the column labels are the ones predicted by the CNN. The results indicate that the CNN was weak in detecting S1. It shows that S1 was correctly classified in only 47.9% of the cases, but that 36.7% of S2 and 15.3% of R were misclassified for the actual given S1.
Exp2 verified the classification accuracy using EEG and EOG images in the frequency domain. Figure 5 shows the accuracy of the training process for the numbers of epochs. After 14 epochs in the training, the CNN achieves 90% as training set accuracy and 89% as validation set accuracy.
Table 3 shows the performance evaluation results of Exp2. The F1 score values were 60% for S1, 90% for S2, 97% for SWS, and 84% for R, respectively. For the overall performance of Exp2, the weighted F1 score was 86% and accuracy was 86%. The results show that the performance of Exp2 is similar to that of Exp1.
Figure 6 is the confusion matrix of Exp2. The accuracy of classification for S1 is still low. A total of 48.3% of the given S1 set was correctly classified, but 34.7% of S2 and 17.0% of R were misclassified for the S1 set.
Exp3 verifies the classification accuracy using EEG and EOG images in the time and frequency domains. The input for the CNN in Exp3 was resized by concatenating the images that were used in Exp1 and Exp2. Figure 7 shows the accuracy of the experiment for numbers of epochs. After 14 epochs, the CNN achieved a 98% training set accuracy and a 94% validation set accuracy.
Table 4 shows the performance evaluation results of Exp3. The F1 score values were 84% for S1, 100% for S2, 100% for SWS, and 84% for R. As for the overall performance of Exp 3, the weighted F1 score was 94% and the accuracy was also 94%.
Figure 8 is the confusion matrix of Exp. 3. The CNN is classified correctly except that S1 and R, or vice versa, are partially confused. The reason why it is difficult for the CNN to distinguish between S1 and R is because they have similar characteristics.
Table 5 shows a comparison of the results of the three experiments. It shows the average, the range of accuracy, and the F1 score for each experiment. The highest accuracy was shown when EEG and EOG images in the time domain and frequency domain were used. In this research, only EEG and EOG channels were used, but in general, we can obtain better results by adding more channels. It was experimentally confirmed that using only EEG and EEG channels can achieve similar results to using more channels.
We showed that it is possible to classify sleep stages using the EEG and EOG channel and a CNN working on biosignals, without any feature extraction processes. Training after simple preprocessing was conducted without requiring any expert knowledge of features. This is an advantage, because the CNN can learn the features in the EEG and EOG images, wherein the images are a kind of dataset that somnologists are familiar with when classifying sleep stages.
When analyzing the kinds of errors in the classification, the errors mostly happened in stages that are contiguous in the sleep cycle. For example, S1 is most often confused with R or vice versa. Similarly, S1 is known as the stage for which inter-somnologist agreement is the lowest. It can be misclassified as R, which can contain patterns similar to S1.

6. Discussion

We compare the proposed method to the previous studies on sleep-stage classification. Table 6 shows the characteristics and accuracies of recent studies. The studies on sleep classification are difficult to directly compare because they do not use the same database, testing methods, or scoring rules. Even if the studies using the same dataset are comparable, the following limitations exist: (1) configuration of experimental data in the dataset (whether Wake and Movement stages are included or not); (2) sampling methods of training, validation, and testing of the experimental data; and (3) distribution between classes (class imbalance program). For example, the epochs of the Wake stage occupy a higher percentage than any other stages in the Sleep-EDFx database, because wake epochs before and after sleep are recorded. Some studies try to rebalance the number of epochs for each stage, but others do not, which leads to biased results.
Therefore, the performance comparison is inappropriate unless the experiments use the same experimental environment. With respect to this, the proposed method outperforms those of other studies, which commonly used data only. We verified classification performance in line with the previous studies with 94.0% accuracy.

7. Conclusions

In this paper, a new sleep-stage classification method based on image classification is proposed. The idea is to use EEG and EOG images instead of conventional hand-crafted features in sleep-stage classification. This method is not necessary for feature extraction and selection, but provides a natural graphic interface to somnologists using CNN, which is an excellent algorithm for image recognition.
The EEG and EOG signals in time and frequency domains of sleep epochs were converted to binary images, and these images were fed to the pre-trained CNN. The proposed method achieved a state-of-the-art performance (accuracy: 94%, weighted F1 score: 94%) on the public Sleep-EDFx dataset. It was demonstrated that using the CNN, it is possible to extract features from biosignals and learn similar patterns to those described by the AASM. The results showed that the classifier reaches outstanding results, as in other studies.
For future work, we would verify the proposed method compared to other sleep datasets, conduct further optimization of parameters, and utilize pre-trained CNNs. Sleep-stage classifiers are still not used in the commercial field despite having higher accuracy. For sleep-stage classifiers to be commercialized, they should be validated with more datasets so that the classifiers can archive a level of confidence similar to that of somnologists. More research is needed for CNNs to be used in clinical practice.
The code, trained models, and results are publicly available at https://github.com/psch125/sleep_classification (accessed on 3 January 2022).

Author Contributions

Conceptualization, M.-J.J.; methodology, M.-J.J.; software, S.-C.P.; validation, S.-C.P.; formal analysis, M.-J.J. and S.-C.P.; investigation, M.-J.J.; resources, M.-J.J.; data curation, M.-J.J.; writing—original draft preparation, M.-J.J. and S.-C.P.; writing—review and editing, M.-J.J.; supervision, M.-J.J.; project administration, M.-J.J.; funding acquisition, M.-J.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Acknowledgments

We thank the anonymous reviewers for their careful reading and insightful suggestions to help improve the manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Definition of Sleep. Available online: https://medical-dictionary.thefreedictionary.com/sleep (accessed on 3 January 2022).
  2. Rasch, B.; Born, J. About Sleep’s Role in Memory. Physiol. Rev. 2013, 93, 681–766. [Google Scholar] [CrossRef] [PubMed]
  3. Carskadon, M.A.; Dement, W.C. Normal human sleep: An overview. In Principles and Practice of Sleep Medicine, 5th ed.; Kryger, M.H., Roth, T., Demen, W.C., Eds.; Elsevier Saunders: St. Louis, MO, USA, 2011; pp. 16–26. [Google Scholar]
  4. Jackson, C.L.; Redline, S.; Kawachi, I.; Hu, F.B. Association between sleep duration and diabetes in black and white adults. Diabetes Care 2013, 36, 3557–3565. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  5. The American Academy of Sleep Medicine. The AASM Manual for the Scoring of Sleep and Associates Events: Rules, Terminology and Technical Specifications Version 2.6. Published Online 1 January 2021. Available online: https://aasm.org/clinical-resources/scoring-manual/ (accessed on 3 January 2022).
  6. Lee, Y.J.; Lee, J.Y.; Cho, J.H.; Choi, J.H. Inter-rater reliability of sleep stage scoring: A meta-analysis. J. Clin. Sleep Med. 2022, 18, 193–202. [Google Scholar] [CrossRef]
  7. Chriskos, P.; Frantzidis, C.A.; Nday, C.M.; Gkivogkli, P.T.; Bamidis, P.D.; Kourtidou-Papadeli, C. A review on current trends in automatic sleep staging through bio-signal recordings and future challenges. Sleep Med. Rev. 2021, 55, 101377. [Google Scholar] [CrossRef]
  8. Kemp, B.; Zwinderman, A.H.; Tuk, B.; Kamphuisen, H.A.C.; Oberye, J.J.L. Analysis of a sleep-dependent neuronal feedback loop: The slow-wave microcontinuity of the EEG. IEEE Trans. Biomed. Eng. 2000, 47, 1185–1194. [Google Scholar] [CrossRef]
  9. Rosenberg, R.S.; Hout, S.V. The American Academy of Sleep Medicine inter-scorer reliability program: Respiratory events. J. Clin. Sleep Med. 2014, 10, 447–454. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  10. Sen, B.; Peker, M.; Cavusoglu, A.; Celebi, F.V. A comparative study on classification of sleep stage based on EEG signals using feature selection and classification algorithms. J. Med. Syst. 2014, 38, 1–21. [Google Scholar] [CrossRef]
  11. Shea, K.O.; Nash, R. An Introduction to Convolutional Neural Networks. arXiv 2015, arXiv:1511.08458. [Google Scholar]
  12. Dong, H.; Supratak, A.; Pan, W.; Wu, C.; Matthews, P.; Guo, Y. Mixed neural network approach for temporal sleep stage classification. IEEE Trans. Neural Syst. Rehabil. Eng. 2018, 26, 324–333. [Google Scholar] [CrossRef] [Green Version]
  13. Wei, R.; Zhang, X.; Wang, J.; Dang, X. The research of sleep staging based on single-lead electrocardiogram and deep neural network. Biomed. Eng. Lett. 2018, 8, 87–93. [Google Scholar] [CrossRef]
  14. Tsinalis, O.; Matthews, P.M.; Guo, Y.; Zafeiriou, S. Automatic Sleep Stage Scoring with Single-Channel EEG Using Convolutional Neural Networks. arXiv 2016, arXiv:1610.01683. [Google Scholar]
  15. Mikkelsen, K.; de Vos, M. Personalizing deep learning models for automatic sleep staging. arXiv 2018, arXiv:1801.02645. [Google Scholar]
  16. Yildirim, O.; Baloglu, U.; Acharya, U. A Deep Learning Model for Automated Sleep Stages Classification Using PSG Signals. Int. J. Environ. Res. Public Health 2019, 16, 599. [Google Scholar] [CrossRef] [Green Version]
  17. Chambon, S.; Galtier, M.; Arnal, P.; Wainrib, G.; Gramfort, A. A deep learning architecture for temporal sleep stage classification using multivariate and multimodal time series. IEEE Trans. Neural Syst. Rehabil. Eng. 2018, 26, 758–769. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  18. Andreotti, F.; Phan, H.; Cooray, N.; Lo, C.; Hu, M.; De Vos, M. Multichannel Sleep Stage Classification and transfer Learning Using Convolutional Neural Networks. In Proceedings of the 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Honolulu, HI, USA, 18–21 July 2018; pp. 171–174. [Google Scholar]
  19. Phan, H.; Andreotti, F.; Cooray, N.; Chen, O.; De Vos, M. Joint classification and prediction CNN framework for automatic sleep stage classification. Proc. IEEE Trans. Biomed. Eng. 2019, 66, 1285–1296. [Google Scholar] [CrossRef]
  20. Sors, A.; Bonnet, S.; Mirekc, S.; Vercueil, L.; Payen, J.-F. A convolutional neural network for sleep stage scoring from raw single-channel eeg. Biomed. Signal Process Control 2018, 42, 107–114. [Google Scholar] [CrossRef]
  21. Andreotti, F.; Phan, H.; De Vos, M. Visualising Convolutional Neural Network Decisions in Automatic Sleep Scoring. In Proceedings Joint Workshop on Artificial Intelligence in Health (AIH); CEUR: Stockholm, Sweden, 2018; pp. 70–81. [Google Scholar]
  22. Vilamala, A.; Madsen, K.; Hansen, L. Deep convolutional neural networks for interpretable analysis of EEG sleep stage scoring. In Proceedings of the IEEE International Workshop on Machine Learning for Signal Processing (MLSP), Tokyo, Japan, 25–28 September 2017; pp. 1–6. [Google Scholar]
  23. Malafeev, A.; Laptev, D.; Bauer, S.; Omlin, X.; Wierzbicka, A.; Wichniak, A.; Jernajczyk, W.; Riener, R.; Buhmann, J.; Achermann, P. Automatic human sleep stage scoring using deep neural networks. Front Neurosci. 2018, 781. [Google Scholar] [CrossRef] [Green Version]
  24. Phan, H.; Andreotti, F.; Cooray, N.; Chen, O.; De Vos, M. Automatic sleep stage classification using single-channel EEG: Learning sequential features with attention-based recurrent neural networks. In Proceedings of the 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Honolulu, HI, USA, 18–21 July 2018; pp. 1452–1455. [Google Scholar]
  25. Patanaik, A.; Ong, J.; Gooley, J.; Ancoli-Israel, S.; Chee, M. An end-to-end framework for real-time automatic sleep stage classification. Sleep 2018, 41, zsy041. [Google Scholar] [CrossRef]
  26. Zhang, J.; Wu, Y. Automatic sleep stage classification of single-channel EEG by using complex-valued convolutional neural network. Biomed. Eng./Biomed. Tech. 2017, 63, 177–190. [Google Scholar] [CrossRef]
  27. Supratak, A.; Dong, H.; Wu, C.; Guo, Y. DeepSleepNet: A Model for Automatic Sleep Stage Scoring based on Raw Single-Channel EEG. IEEE Trans. Neural Syst. Rehabil. Eng. 2017, 25, 1998–2008. [Google Scholar] [CrossRef] [Green Version]
  28. Al-Saffar, A.A.M.; Tao, H.; Talab, M.A. Review of deep convolution neural network in image classification. In Proceedings of the International Conference on Radar, Antenna, Microwave, Electronics, and Telecommunications, Jakarta, Indonesia, 23–24 October 2017; pp. 26–31. [Google Scholar]
  29. Kingma, D.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2017, arXiv:1412.6980. [Google Scholar]
  30. Gulli, A.; Pal, S. Deep Learning with Keras; Packt Publishing Ltd.: Birmingham, UK, 2017. [Google Scholar]
  31. Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G.; Davis, A.; Dean, J.; Devin, M.; et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. arXiv 2016, arXiv:1603.04467. Available online: tensorflow.org (accessed on 3 March 2021).
  32. Goldberger, A.; Amaral, L.; Glass, L.; Hausdorff, J.; Ivanov, P.C.; Mark, R.G.; Mietus, J.E.; Moody, G.B.; Peng, C.-K.; Stanley, H.E. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation 2000, 101, e215–e220. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  33. Rechtschaffen, K.; Kales, A. A Manual of Standardized Terminology Techniques and Scoring System for Sleep Stages of Human Subjects; Public Health Service, US Government Printing Office: Washington, DC, USA, 1968.
  34. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; Vanderplas, J.; et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
  35. O’Reilly, C.; Gosselin, N.; Carrier, J.; Nielsen, T. Montreal Archive of Sleep Studies: An open-access resource for instrument benchmarking and exploratory research. J. Sleep Res. 2014, 23, 628–635. [Google Scholar] [CrossRef]
  36. Zhang, G.Q.; Cui, L.; Mueller, R.; Tao, S.; Kim, M.; Rueschman, M.; Mariani, S.; Mobley, D.; Redline, S. The National Sleep Research Resource: Towards a sleep data commons. J. Am. Med. Inform. Assoc. 2018, 25, 1351–1358. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  37. Quan, S.; Howard, B.; Iber, C.; Kiley, J.; Nieto, F.; O’Connor, G.; Rapoport, D.; Redline, S.; Robbins, J.; Samet, J.; et al. The Sleep Heart Health Study: Design, rationale, and methods. Sleep 1997, 20, 1077–1085. [Google Scholar]
Figure 1. An example of a hypnogram with 6.5 h of sleep. Adapted from https://commons.wikimedia.org/wiki/File:Sleep_Hypnogram.svg (accessed on 3 January 2022).
Figure 1. An example of a hypnogram with 6.5 h of sleep. Adapted from https://commons.wikimedia.org/wiki/File:Sleep_Hypnogram.svg (accessed on 3 January 2022).
Applsci 12 03028 g001
Figure 2. Images of time domain and frequency domain for each sleep stage: (a) EEG; (b) EOG.
Figure 2. Images of time domain and frequency domain for each sleep stage: (a) EEG; (b) EOG.
Applsci 12 03028 g002aApplsci 12 03028 g002b
Figure 3. Training and validation set accuracy over epochs (Exp1).
Figure 3. Training and validation set accuracy over epochs (Exp1).
Applsci 12 03028 g003
Figure 4. Confusion matrix of the classification results (Exp1).
Figure 4. Confusion matrix of the classification results (Exp1).
Applsci 12 03028 g004
Figure 5. Training and validation set accuracy over epochs (Exp2).
Figure 5. Training and validation set accuracy over epochs (Exp2).
Applsci 12 03028 g005
Figure 6. Confusion matrix of the classification results (Exp2).
Figure 6. Confusion matrix of the classification results (Exp2).
Applsci 12 03028 g006
Figure 7. Training and validation set accuracy over epochs (Exp3).
Figure 7. Training and validation set accuracy over epochs (Exp3).
Applsci 12 03028 g007
Figure 8. Confusion matrix of the classification results (Exp3).
Figure 8. Confusion matrix of the classification results (Exp3).
Applsci 12 03028 g008
Table 1. CNN architecture for sleep-stage classification.
Table 1. CNN architecture for sleep-stage classification.
LayerFilterKernelOutputDropoutActivation
Input (216,144)
Conv2D32(3,3) 0.25ReLU
MaxPool (2,2)(108,72)
Conv2D64(3,3) 0.25ReLU
MaxPool (2,2)(54,36)
Flatten (124,416)
Dense (256)0.5ReLU
Dense (4) Softmax
Table 2. Experiment results for test data (Exp1).
Table 2. Experiment results for test data (Exp1).
StagePrecisionRecallF1 ScoreWeighted F1 ScoreAccuracy
S10.770.480.590.840.85
S20.840.960.90
SWS0.991.001.00
R0.780.700.74
Table 3. Experiment results for test data (Exp2).
Table 3. Experiment results for test data (Exp2).
StagePrecisionRecallF1 ScoreWeighted F1 ScoreAccuracy
S10.780.480.600.860.86
S20.850.950.90
SWS0.960.970.97
R0.830.840.84
Table 4. Experiment results for test data (Exp3).
Table 4. Experiment results for test data (Exp3).
StagePrecisionRecallF1 ScoreWeighted F1 ScoreAccuracy
S10.780.910.840.940.94
S21.001.001.00
SWS1.001.001.00
R0.910.780.84
Table 5. Summary of experiments: mean and range of the testing set.
Table 5. Summary of experiments: mean and range of the testing set.
Time DomainFreq. DomainTime and Freq. Domain
Accuracy86% (71.4–100)86% (72.1–95.2)94% (78.4–100)
Weighted F1 score85% (59–100)86% (60–97)94% (84–100)
Table 6. A comparison of sleep-stage classification methods.
Table 6. A comparison of sleep-stage classification methods.
StudyDatasetInputMethodAccuracy (%)
Tsinalis [14]Sleep-EDF *, MASS **EEGCNN74.8, 77.9
Dong [12]MASSEOGDNN81.4
Supratak [27]Sleep-EDFx, MASSEEGCNN + RNN82.0, 81.5
Andreotti [18]Sleep-EDF, MASSEEG, EOGCNN76.8, 79.4
Mikkelsen [15]Sleep-EDFEEG, EOGCNN84.0
Chambon [17]MASSEEG, EOG, EMGCNN79.9
Sors [20]SHHS1 ***EEGCNN87.0
Phan [19]Sleep-EDFEEG, EOGCNN82.3
Yildirim [16]Sleep-EDFxEEG, EOGCNN92.3
This studySleep-EDFxEEG, EOGCNN94.0
* Sleep-EDF—subset version of Sleep-EDFx. ** MASS—Montreal Archive of Sleep Studies [35]. *** SHHS1—Sleep Heart Health Study [36,37].
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Joe, M.-J.; Pyo, S.-C. Classification of Sleep Stage with Biosignal Images Using Convolutional Neural Networks. Appl. Sci. 2022, 12, 3028. https://doi.org/10.3390/app12063028

AMA Style

Joe M-J, Pyo S-C. Classification of Sleep Stage with Biosignal Images Using Convolutional Neural Networks. Applied Sciences. 2022; 12(6):3028. https://doi.org/10.3390/app12063028

Chicago/Turabian Style

Joe, Moon-Jeung, and Seung-Chan Pyo. 2022. "Classification of Sleep Stage with Biosignal Images Using Convolutional Neural Networks" Applied Sciences 12, no. 6: 3028. https://doi.org/10.3390/app12063028

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop