MultiScaleSleepNet: A Hybrid CNN–BiLSTM–Transformer Architecture with Multi-Scale Feature Representation for Single-Channel EEG Sleep Stage Classification

Liu, Cenyu; Guan, Qinglin; Zhang, Wei; Sun, Liyang; Wang, Mengyi; Dong, Xue; Xu, Shuogui

doi:10.3390/s25206328

Open AccessArticle

MultiScaleSleepNet: A Hybrid CNN–BiLSTM–Transformer Architecture with Multi-Scale Feature Representation for Single-Channel EEG Sleep Stage Classification

by

Cenyu Liu

¹

,

Qinglin Guan

¹,

Wei Zhang

¹

,

Liyang Sun

¹,

Mengyi Wang

¹,

Xue Dong

^1,* and

Shuogui Xu

^2,*

¹

China-UK Low Carbon College, Shanghai Jiaotong University, Shanghai 200240, China

²

Shanghai Changhai Hospital, Shanghai 200433, China

^*

Authors to whom correspondence should be addressed.

Sensors 2025, 25(20), 6328; https://doi.org/10.3390/s25206328

Submission received: 1 September 2025 / Revised: 20 September 2025 / Accepted: 28 September 2025 / Published: 13 October 2025

(This article belongs to the Special Issue AI on Biomedical Signal Sensing and Processing for Health Monitoring)

Download

Browse Figures

Versions Notes

Abstract

Accurate automatic sleep stage classification from single-channel EEG remains challenging due to the need for effective extraction of multiscale neurophysiological features and modeling of long-range temporal dependencies. This study aims to address these limitations by developing an efficient and compact deep learning architecture tailored for wearable and edge device applications. We propose MultiScaleSleepNet, a hybrid convolutional neural network–bidirectional long short-term memory–transformer architecture that extracts multiscale temporal and spectral features through parallel convolutional branches, followed by sequential modeling using a BiLSTM memory network and transformer-based attention mechanisms. The model obtained an accuracy, macro-averaged F1 score, and kappa coefficient of 88.6%, 0.833, and 0.84 on the Sleep-EDF dataset; 85.6%, 0.811, and 0.80 on the Sleep-EDF Expanded dataset; and 84.6%, 0.745, and 0.79 on the SHHS dataset. Ablation studies indicate that attention mechanisms and spectral fusion consistently improve performance, with the most notable gains observed for stages N1, N3, and rapid eye movement. MultiScaleSleepNet demonstrates competitive performance across multiple benchmark datasets while maintaining a compact size of 1.9 million parameters, suggesting robustness to variations in dataset size and class distribution. The study supports the feasibility of real-time, accurate sleep staging from single-channel EEG using parameter-efficient deep models suitable for portable systems.

Keywords:

automatic sleep stage classification; EEG; multi-head attention; temporal-spectral fusion; hybrid architecture

1. Introduction

1.1. Introduction to Sleep Stage Classification

Sleep occupies approximately one-third of the human lifespan and is indispensable to both physical and mental health [1]. Adequate sleep improves cognitive performance and overall well-being [2,3], whereas critical biological processes such as memory consolidation, cellular repair, and cerebral development occur predominantly during sleep [4,5,6,7]. Conversely, sleep deprivation is associated with impaired cognition, immune suppression, and dysregulation of memory, emotion, and metabolism [8,9,10,11,12]. Although the health consequences of sleep issues are well-documented, their wider socioeconomic implications are only now receiving sustained attention. A multinational study across 16 countries in Europe, North America, and Australia found that chronic insomnia affects over 8% of adults, resulting in annual GDP losses of 0.64–1.31% [13]. Further evidence confirms that insufficient or fragmented sleep not only compromises daily emotional stability but also exacerbates productivity declines and societal economic burdens [14]. Given these health and economic repercussions, accurate sleep stage classification is essential for diagnosing sleep disorders and optimizing their management.

Polysomnography (PSG) serves as the clinical gold standard for sleep stage classification. It records electroencephalography (EEG), electrooculography (EOG), and electromyography (EMG), providing the basis for separating sleep into non-rapid eye movement (NREM) and rapid eye movement (REM) periods [15]. Following standardized guidelines, including the Rechtschaffen and Kales (R&K) rules [16] and the American Academy of Sleep Medicine (AASM) [17] guideline, specialists manually annotate each 30 s PSG epoch into specific stages. The R&K rules define six sleep stages: wakefulness (Wake), four NREM stages (S1–S4), and REM. Stage S1 represents light sleep, S2 marks a transitional phase, and S3–S4 correspond to deep sleep, also termed slow-wave sleep (SWS). The REM stage, characterized by rapid ocular movements, muscle atonia, and low-voltage mixed-frequency EEG that resembles wakefulness, accounts for 20–25% of adult sleep. In 2007, the AASM merged stages S3 and S4 into a single deep sleep stage (N3), as illustrated in Figure 1.

Distinct EEG waveforms characterize each sleep stage, as summarized in Table 1. Wakefulness is dominated by β waves, whereas α waves appear during eye closure or meditative states. The transition from the wakefulness stage to light sleep (N1) involves attenuation of α activity and emergence of low-amplitude θ waves. The N1 stage is brief and transitions to N2, which is identified by sleep spindles and K-complexes. Subsequent reductions in cortical activity lead to deep sleep (N3), characterized by high-amplitude δ waves. Finally, REM sleep is distinguished by sawtooth waves in the EEG and rapid ocular oscillations detectable via EOG. Representative time-domain traces for each stage are shown in Figure 2.

1.2. Related Work

Traditional sleep stage classification depends on expert visual interpretation of PSG data, which is labor-intensive, time-consuming, costly, and vulnerable to inter- and intra-rater variability that often produces inconsistent results [18]. This manual approach has been proven inefficient for large-scale or real-time applications [19].

To address these challenges, researchers have developed automated sleep stage classification methods that leverage advanced EEG signal processing and machine learning techniques. These methods reduce dependence on manual feature engineering and mitigate the logistical constraints associated with PSG setups [20]. Recent studies have shown particular interest in single-channel EEG augmented with adaptive noise-reduction techniques, a strategy that supports scalable deployment while preserving clinical accuracy [21].

Automated approaches fall into two broad categories. Traditional approaches depend on manually engineered features paired with supervised classifiers such as support vector machine (SVM) and random forest classifiers, whereas deep learning employs neural networks to autonomously extract PSG-derived features through end-to-end learning frameworks.

Early studies adhering to the traditional paradigm emphasized manual feature engineering combined with classical algorithms. For example, Al-Salman et al. [22] applied wavelet transforms to extract time-frequency features from EEG signals, while Abdulla et al. [23] modeled EEG epochs as correlation networks integrated with community detection algorithms. Satapathy et al. [24] enhanced feature optimization by coupling the ReliefF algorithm with an AdaBoost-augmented random forest classifier. Subsequent efforts explored feature refinement strategies, including sliding window statistical mapping [25], principal component analysis (PCA)-based dimensionality reduction [26], and frequency-domain feature extraction using band-pass filters [27]. Despite their interpretability, these methods suffered from limited generalizability due to their reliance on domain-specific expertise, ultimately hindering scalability in large-scale applications.

Deep learning has advanced sleep stage classification by automating feature extraction. Early innovations, such as SleepEEGNet [28], leveraged convolutional neural networks (CNNs) to capture time-invariant spatial patterns in EEG data, while Humayun et al. [29] demonstrated the effectiveness of residual CNNs for raw EEG signal processing. To address temporal dependencies, SeqSleepNet [30] proposed hierarchical architectures combining RNNs, while CSCNN-HMM [31] incorporated hidden Markov models (HMMs) to model state transitions. Heng et al. [32] further advanced temporal modeling with bidirectional gated recurrent units (GRUs) enhanced by attention mechanisms. Subsequent studies refined feature aggregation: AttnSleep [33] fused multi-resolution CNNs with transformer encoders, and XSleepNet [34] dynamically integrated time-frequency representations. Recent developments include transformer-in-transformer architectures [35] for joint local–global dependency learning and multi-task frameworks incorporating channel attention modules [36]. While CNNs and RNNs excel at spatial and temporal feature extraction, hybrid models often face trade-offs between computational efficiency and temporal granularity. Attention mechanisms enhance feature prioritization but necessitate large datasets to prevent overfitting.

Recent research has increasingly focused on multichannel data fusion and transformer-based architectures. MultiChannelSleepNet [37] employed transformer encoders with layer normalization to integrate cross-channel features. Kumar et al. [38] addressed class imbalance through supervised contrastive learning paired with self-attention mechanisms, while Zhao et al. [36] improved cross-dataset generalization using multi-task learning. Discrete Cosine Transform (DCT)-augmented EEGNet [39] optimized multichannel analysis by enhancing spectral domain features. Despite their advantages, multichannel methods escalate computational costs and necessitate strategies to mitigate channel redundancy, highlighting the need for efficient feature selection mechanisms.

Earlier approaches, in contrast to modern hybrid systems, often employed architectures specialized for either spatial or temporal feature extraction. Models such as TimeNet and Temporal Convolutional Networks (TCNs) prioritized convolutional operations for local pattern recognition but lacked mechanisms to capture long-range dependencies. Conversely, BiLSTM-based frameworks, exemplified by SleepEEGNet [28], excelled at sequence modeling but had limited ability to integrate frequency-domain features due to their sequential nature. Even hybrid systems like CSCNN-HMM [31], which integrated HMMs to model sleep stage transitions, omitted dynamic feature weighting via attention mechanisms. Similarly, AttnSleep [33] introduced attention-based aggregation but relied on a dual-branch architecture, potentially limiting its ability to holistically model multi-resolution time-frequency interactions.

To overcome these limitations, this study introduces novel advancements in feature extraction, temporal dependency modeling, and transformer-based architecture design to improve representation learning for sleep stage classification. Departing from existing approaches that treat time- and frequency-domain features in isolation or employ restrictive architectures, our model employs parallel convolutional branches to holistically capture diverse temporal and spectral patterns. Specifically, a Fast Fourier Transform (FFT)-based spectral analysis branch and multi-scale convolutional branches are designed to extract complementary time-frequency features from EEG epochs, enhancing detection of fine-grained signal characteristics. For temporal modeling, the framework integrates a lightweight transformer module with a bidirectional long short-term memory (BiLSTM) network, forming a hybrid encoder. This architecture synergistically combines convolutional operations for local feature extraction, sequential modeling for long-range dependencies, and attention mechanisms for dynamic feature weighting. The BiLSTM component further stabilizes gradient propagation during training while bolstering classification robustness.

The main contributions of this work are as follows:

1.: Multi-scale temporal–spectral fusion: a hybrid architecture integrating FFT-based spectral analysis with multi-resolution convolutional branches was developed, achieving comprehensive temporal–spectral feature integration through physiologically-aligned kernel configurations.
2.: Hybrid long-range dependency modeling: bidirectional LSTM networks were synergistically combined with transformer self-attention mechanisms, establishing hierarchical temporal context modeling from local sleep transitions to global stage progression patterns.
3.: Unified dynamic feature weighting: cross-domain attention mechanisms were implemented to dynamically recalibrate feature significance across concatenated temporal–spectral representations, enabling adaptive fusion of multi-modal sleep characteristics.

The remainder of this paper is organized as follows: Section 2 details the methodology, including the proposed architecture and its operational principles. Section 3 describes experimental protocols, datasets, and results. Section 4 provides a comparative analysis against state-of-the-art methods and discusses performance outcomes. Finally, Section 5 synthesizes key conclusions, evaluates clinical and technical implications, and outlines future research directions.

2. Materials and Methods

2.1. Methodology

2.1.1. Model Framework

Figure 3 illustrates the proposed architecture consisting of three sequential modules: (i) multi-scale feature extraction, (ii) sequence context encoding, and (iii) classification. The network accepts a 30 s EEG epoch sampled at 100 Hz (input dimension of 3000 × 1) and outputs one-hot encoded labels for five sleep stages: Wake (W), N1, N2, N3, and REM.

The feature extractor employs a multi-scale convolutional neural network (MSCNN) to capture both time-domain patterns and frequency-domain characteristics. To address the limitations of CNNs in modeling temporal relationships, the sequence context encoder (SCE) integrates a bidirectional LSTM (BiLSTM) layer with a multi-head attention (MHA) mechanism. This combination facilitates the learning of hierarchical, long-range dependencies. The final classifier consists of a fully connected layer followed by temperature-scaled softmax activation. Training uses focal loss with label smoothing to counter class imbalance. Implementation details for each block follow.

2.1.2. Feature Extraction

Multiscale Convolutional Neural Network

The proposed MSCNN architecture integrates both time-domain and frequency-domain representations to comprehensively model sleep EEG dynamics. As shown in Figure 4a, the network consists of five parallel branches: four time-domain convolutional branches targeting specific neurophysiological frequency bands, and one frequency-domain branch for spectral analysis using the discrete Fourier transform (DFT). Raw EEG signals x(t) are transformed into magnitude spectra using DFT:

X [k] = \sum_{n = 0}^{N - 1} x [n] e x p (- j \frac{2 π k n}{N}), k = 0, 1, \dots, N - 1

(1)

where

X [k]

represents the frequency component at index

k

and N corresponds to a 30 s epoch sampled at 100 Hz (with dimensions

R^{3000 \times 1}

). These magnitude spectra are then processed by 1D convolutional layers to extract stage-related frequency features.

The time-domain branches employ kernel sizes of 200, 25, 13, and 8 samples to align with the characteristic oscillatory periods of the δ (0.5–4 Hz), θ (4–8 Hz), α (8–13 Hz), and β (13–30 Hz) bands, respectively. Given a sampling frequency

f_{s}

(Hz), a kernel of length K spans a temporal window

T = K / f_{s}

. Using

f = 1 / T

, the lowest frequency component that can be stably characterized by that kernel is approximated by

f_{m i n} \approx f_{s} / K

. For

f_{s} = 100 H z,

the four kernels yield

f_{m i n} \approx \{0.50, 4.00, 7.69, 12.5\} H z

, targeting frequency components corresponding to δ, θ, α, and β bands. Each branch thus acts as a learnable band-pass filter tuned to physiologically meaningful oscillations. Each branch further comprises two convolutional blocks (1D convolution, batch normalization, ReLU, spatial dropout, and max pooling), which enhance local temporal discrimination while suppressing noise and redundancy. Formally, the convolution operation in branch i can be expressed as

y_{i} (t) = σ ((x \times w_{i}) (t) + b_{i}), i \in \{δ, θ, α, β, D F T\}

(2)

where x is the input EEG sequence,

w_{i}

is the branch-specific kernel,

b_{i}

is the bias term, and

σ (\cdot)

denotes the activation function. The fused multiscale representation is obtained by channel-wise concatenation,

Y_{M S C N N} = \oplus_{i = 1}^{5} y_{i}

(3)

which preserves complementary sub-band information for downstream sequence modeling. This architecture optimizes localized temporal feature extraction while suppressing high-frequency noise. In practice, Fast Fourier Transform (FFT) acceleration lowers complexity from

O (N^{2})

to

O (N l o g N)

[40]. Max pooling reduces spatial dimensions for efficiency [41], while spatial dropout mitigates overfitting [42]. Finally, multi-scale fusion concatenates all branches along the channel axis to produce the feature map

{Y_{M S C N N} \in R}^{750 \times 640}

.

Squeeze-and-Excitation (SE) Mechanism

To address channel redundancy in the fused multi-scale representation, a squeeze-and-excitation (SE) block is applied after concatenating the branches. Given an input feature map

U \in R^{T \times C}

, where T = 750 is the number of time steps, C = 640 is the number of feature channels, and

U

is

Y_{M S C N N}

, the SE module comprises three stages: squeeze, excitation, and recalibration. In the squeeze step, temporal information is aggregated into a channel descriptor

Z

through global average pooling along the temporal axis:

z_{c} = \frac{1}{T} \sum_{t = 1}^{T} u_{c} (t), Z \in R^{C}

(4)

where

z_{c}

is the descriptor of channel c. During excitation, the network learns adaptive channel weights via a two-layer fully connected gating mechanism with a reduction ratio r = 16.

S = σ (W_{2} \cdot δ (W_{1} \cdot Z)), \{\begin{matrix} W_{1} \in R^{C / r \times C} \\ W_{2} \in R^{C \times C / r} \end{matrix}

(5)

where

δ (\cdot)

and

σ (\cdot)

represent ReLU and sigmoid activations, respectively. In the recalibration stage, the weights S are applied elementwise to the original feature map to produce the rescaled output

\tilde{U}

:

{\tilde{u}}_{c} (t) = s_{c} \cdot u_{c} (t), \tilde{U} \in R^{375 \times 128}

(6)

2.1.3. Sequence Context Encoder

The encoder combines BiLSTM units with an MHA block. BiLSTM propagates hidden states in both forward and backward directions, yielding context-rich representations of the input sequence. MHA then assigns adaptive importance weights by means of scaled dot product operations. Residual links and subsequent layer normalization help to maintain stable gradients. Because this design integrates multi-scale features and attention-guided temporal context within a single module, no separate decoder is required and end-to-end sequence learning is achieved.

Bidirectional LSTM Unit

To model long-range sleep stage transitions, we implement a BiLSTM network, which contains four functional components: input layer, forward and backward hidden layers, cell state, and output layer, as shown in Figure 4c. This bidirectional design processes input sequences through complementary temporal passes, allowing simultaneous learning of forward and reverse contextual relationships. The hidden states are computed as follows:

\vec{h_{t}} = B i L S T M (x_{t}, {\vec{h}}_{t - 1})

(7)

\overset{\leftarrow}{h_{t}} = B i L S T M (x_{t}, {\overset{\leftarrow}{h}}_{t - 1})

(8)

h_{t} = C o n c a t (W_{\vec{h_{t}}} \vec{h_{t}}, W_{\overset{\leftarrow}{h_{t}}} \overset{\leftarrow}{h_{t}}) + b_{t}

(9)

where

x_{t} \in R^{375 \times 64}

is the input vector at timestep t, and

\vec{h_{t}}

and

\overset{\leftarrow}{h_{t}}

denote forward and backward hidden states, respectively. The sequence length equals 375 samples and the hidden dimension equals 64. Compared with bidirectional gated recurrent units, BiLSTM can model longer dependencies due to its three-gate structure (input, forget, and output gates) exercising finer control over information flow, which is suitable for capturing polysomnographic patterns [43].

Attention Mechanism

To refine feature representations after multi-scale convolutional extraction and BiLSTM-based temporal modeling, we implement a multi-head attention module. The BiLSTM output matrix

X \in R^{375 \times 64}

is mapped into query (Q), key (K), and value (V) matrices through learnable linear transformations. For each of the

h_{a}

parallel attention heads (8 heads in the proposed model), scaled dot product attention calculates alignment scores:

A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(10)

The outputs of all attention heads are concatenated and linearly projected to synthesize multi-head representations:

M H A (X) = C o n c a t ({h e a d}_{1}, \dots, {h e a d}_{h_{a}}) W^{O}, W^{Q} \in R^{h_{a} d_{v} \times d}

(11)

This design enables attention across multiple subspaces, allowing dynamic prioritization of salient temporal features and modeling complex dependencies between time steps [44]. To stabilize training, residual connections combine the original BiLSTM output

X

with the MHA output

M H A (X)

, followed by layer normalization:

X_{n o r m} = L a y e r N o r m (X + M H A (X))

(12)

Finally, global average pooling aggregates temporal features from

X_{n o r m}

into sequence-level representations, which are processed by a dropout-regularized fully connected layer for classification.

2.1.4. Classification

Following sequence encoding, the normalized feature matrix

X_{n o r m} \in R^{375 \times 64}

is produced by multi-head attention with residual fusion and then forwarded to the classification module. A GlobalAveragePooling1D layer collapses the temporal dimension, producing a fixed-length vector that condenses the entire sequence context. This vector enters a multilayer perceptron (MLP) with one or more fully connected layers, each followed by dropout to mitigate over-fitting. Temperature scaling is applied to the resulting logits to improve probability reliability without altering class rankings, and a subsequent softmax activation converts the calibrated logits into class probabilities. The network finally outputs a one-hot vector that indicates one of the five sleep stages.

2.1.5. Optimization Strategies

To improve robustness on an imbalanced dataset and to stabilize training, two complementary techniques are incorporated: (1) focal loss with label smoothing, and (2) a cosine annealed learning rate schedule. Together, these measures enhance convergence speed, confidence calibration, and overall classification accuracy.

Learning Rate Schedule

Training uses Adam [45] with cosine annealing strategy [46]. Adam’s adaptive moment estimates reduce hyper-parameter sensitivity, while cosine annealing periodically resets the step size, helping the optimizer escape suboptimal minima. The learning rate at epoch e is

L R (e) = \frac{η_{0}}{2} (1 + \cos (\frac{π \cdot e}{N_{e p o c h s}}))

(13)

where

η_{0}

is the initial learning rate and

N_{e p o c h s}

denotes total training epochs.

Loss Function

To counteract class imbalance, we adopt a focal loss function with label smoothing, as introduced by Lin et al. [47]:

L o s s (y_{t r u e}, y_{p r e d}) = - \sum_{i = 1}^{N} (y_{t r u e, i}^{s m o o t h e d} \cdot \log (y_{p r e d, i}) \cdot α {(1 - y_{p r e d, i})}^{γ})

(14)

where

y_{t r u e}^{s m o o t h e d}

represents the smoothed ground-truth labels, and

α

and

γ

are hyperparameters controlling the contribution of hard-to-classify examples. Setting

γ

> 0 reduces the loss weight for well-classified samples, directing focus toward misclassified instances. By dynamically downweighting easy samples, this formulation mitigates bias toward majority classes while improving generalization.

Softmax Activation

Output probabilities are calibrated using a temperature parameter

τ

in the softmax activation:

y_{p r e d, i} = \frac{\exp (z_{i} / τ)}{\sum_{j = 1}^{N} \exp (z_{j} / τ)}

(15)

where

z_{i}

denotes the logit for class i. Temperature scaling preserves class rankings while smoothing overconfident predictions, enhancing reliability in imbalanced scenarios.

2.2. Experiments

2.2.1. Datasets

The proposed model was evaluated on three benchmark datasets: the original PhysioNet Sleep-EDF [48] and its expanded version, Sleep-EDF Expanded (Sleep-EDFx) [49], and Sleep Heart Health Study (SHHS) [50,51]. These datasets comprise 61 and 197 full-night PSG recordings, respectively, organized into two cohorts:

Sleep Cassette (SC): 153 recordings from medication-free Caucasian adults (age range 25 to 101 years) collected during a 1987–1991 observational study investigating age-related sleep physiology;
Sleep Telemetry (ST): 44 recordings from 22 healthy Caucasian participants (gender-balanced) acquired during a 1994 pharmacological study analyzing the impact of temazepam on sleep macroarchitecture.

Each recording includes two complementary data modalities:

PSG Files: These files contain full-night neurophysiological signals, including bipolar EEG channels (Fpz–Cz and Pz–Oz), horizontal electrooculography (EOG), submental electromyography (EMG), and event markers;
Hypnogram Files: These files include expert-curated sleep stage annotations aligned with the PSG data, including Wake (W), REM (REM), NREM stages 1–4 (S1–S4), Movement Time (M), and unscored segments.

The SHHS dataset includes 6441 subjects with various conditions, including pulmonary, cardiovascular, and coronary diseases.

All hypnograms were manually scored by certified sleep experts following R&K rules [16]. In line with AASM guidelines [17] and prior research [28], we merged S3 and S4 into a single SWS class and excluded M and unscored segments from analysis.

2.2.2. Data Preprocessing

Single-channel EEG from the frontopolar–central (Fpz–Cz, Sleep-EDF/EDFx) or central–mastoid (C4–A1, SHHS) derivation was segmented into 30 s epochs; at sampling rates of 100 Hz and 125 Hz, each epoch contained 3000 and 3750 samples, respectively. As shown in Figure 5, single-channel EEG signals were segmented into 30 s epochs (3000 samples at 100 Hz or 3750 samples at 125 Hz). The Fpz–Cz channel was selected primarily for two reasons. First, this choice ensures consistency with widely adopted baseline studies in sleep staging (e.g., [28,33]), which supports fair and reproducible comparisons. Second, as a bipolar derivation, Fpz–Cz captures broad cortical activity with high signal quality and minimal muscle artifact interference [15,49], making it a representative and practical choice for single-channel EEG analysis. Epoch labels were mapped to five classes: W: 0, N1: 1, N2: 2, N3: 3, and REM: 4. Besides data segmentation, no additional preprocessing (e.g., filtering, artifact removal, or normalization) was applied, as the raw segmented signals were used directly to preserve the original physiological characteristics and align with the data format of the public dataset referenced. The class distribution for both datasets is summarized in Table 2.

2.2.3. Experimental Setup

The framework was implemented in Python 3.8.0 using TensorFlow. To address inherent class imbalance (particularly underrepresented N1, REM, and N3 stages), we integrated focal loss with label smoothing and class weighting during training. All splits were performed at the subject level to prevent any overlap of subjects across training, validation, and test sets. For Sleep-EDFx, both nights from the same subject were kept together in the same split. We first grouped all 30 s epochs by subject (and by recording/night when applicable), then sampled 80%/10%/10% of subjects for training/validation/test sets using a fixed random seed (42). Training used the Adam optimizer with an initial learning rate of 0.001 and batch size of 128. Each epoch input had dimensions of 3000 × 1 per epoch, resulting in batch input dimensions of 128 × 3000 × 1.

To ensure a fair and meaningful comparison with existing state-of-the-art models, we adhered to the following consistent experimental protocols throughout our study:

Datasets and Channels: All experiments were conducted on the standard versions of the Sleep-EDF, Sleep-EDFx, and SHHS datasets, employing the single-channel Fpz–Cz or C4–A1 derivation. This choice aligns with the most commonly adopted settings in the literature against which we compare.
Data Preprocessing: We applied identical preprocessing steps: segmentation into 30 s epochs, no additional filtering or artifact removal, and mapping of sleep stages to five classes (W, N1, N2, N3, REM) in accordance with AASM guidelines.
Data Splitting: A subject-wise split strategy was strictly followed to prevent data leakage, ensuring that all epochs from the same subject were contained within either the training, validation, or test set.

While minor, uncontrolled variations in evaluation protocols (e.g., specific random seeds for splitting) may exist across different studies, the performance improvements reported in this work are primarily attributed to the proposed architectural advances under this aligned experimental framework.

2.2.4. Evaluation Metrics

To comprehensively evaluate the performance of the proposed method, we used multiple evaluation metrics, including overall accuracy (ACC), precision (PR), recall (RE), Cohen’s kappa coefficient (κ), and the F1 score, defined as follows:

A C C = (\frac{T P + T N}{T P + F P + F N + T N}) \times 100 %

(16)

P R = (\frac{T P}{T P + F P})

(17)

R E = (\frac{T P}{T P + F N})

(18)

F 1 = (\frac{2 \times P R \times R E}{P R + R E})

(19)

κ = \frac{P_{0} {- P}_{e}}{1 - P_{e}}

(20)

True Positive (TP) refers to the sleep stage instances correctly classified by the model. True Negative (TN) refers to instances correctly identified as not belonging to a specific sleep stage. False Positive (FP) refers to instances incorrectly predicted as belonging to a sleep stage, while False Negative (FN) refers to instances misclassified as other sleep stages.

ACC provides global classification performance, while PR and RE quantify prediction reliability and stage-specific detection capability, respectively. The F1-score balances these metrics, particularly critical for imbalanced data. κ evaluates agreement between model predictions and expert annotations, with κ > 0.6 indicating strong concordance based on the Landis–Koch criteria [52]. These metrics objectively assess alignment with ground-truth hypnograms by treating the model as an automated rater.

3. Results

3.1. Classification Performance of the Proposed Model

The performance of the proposed model was evaluated using class-wise evaluation metrics and confusion matrices, as presented in Table 3, Table 4 and Table 5. The model achieved an overall accuracy of 85.6% on the Sleep-EDFx dataset (Fpz–Cz EEG channel). On the Sleep-EDF dataset, it reached 88.6% accuracy and 84.6% on the SHHS dataset.

To provide a more comprehensive view, the confusion matrix for the Fpz–Cz EEG channel on the Sleep-EDFx dataset, along with probability matrices for both the training and testing sets are shown in Figure 6. The diagonal elements in this matrix represent True Positive (TP) counts, and the matrices illustrate consistency in stage-wise classification across datasets.

The model achieves high recall rates of over 85% for the Wake, N2, N3, and REM stages, confirming its effectiveness in capturing the major electrophysiological signatures of these sleep states. Notably, Wake and N2 stages maintain high generalization performance, with training recall values of 96.7% and 91.4%, and testing recall values of 95.1% and 88.7%, respectively. The recall for N3 declines slightly from 87.6% in training to 82.7% in testing, but remains sufficient for clinically relevant detection of slow-wave sleep. The REM stage shows a more noticeable decrease, from 92.4% to 85.3%, possibly due to inter-individual variability, similar to other sleep stages or the inherent limitations of single-channel EEG in capturing phasic REM features.

The N1 stage continues to pose the greatest classification difficulty, as evidenced by a recall drop from 61.8% in the training set to 51.0% in the testing set. Misclassifications predominantly occur with adjacent stages, with N2 and Wake accounting for 20% and 12% of errors, respectively, and REM contributing another 7%. These results underscore the inherent ambiguity of N1, whose mixed-frequency EEG features overlap with both light sleep and REM. This overlap complicates the classification and is consistent with findings in the polysomnographic literature [18].

Table 6 presents a comprehensive performance comparison between the proposed model MultiScaleSleepNet and established baselines across experimental datasets. Three key observations can be drawn from this comparison.

First, MultiScaleSleepNet achieves consistently higher accuracy while using only single-channel EEG input. It outperforms conventional sequence-to-sequence architectures such as SleepEEGNet [28], XsleepNet2 [34], and AttnSleep [33] across both large-scale (Sleep-EDFx) and small-scale (Sleep-EDF) datasets. Notably, it improves overall accuracy over SleepEEGNet by 5.6% on Sleep-EDFx and 4.4% on Sleep-EDF, confirming its robustness across varying dataset sizes.

Second, the model demonstrates consistent superiority across all evaluation metrics. On Sleep-EDFx, MultiScaleSleepNet achieves relative improvements of 1.6%, 4.3%, 2.9%, 1.0%, 0.6%, and 2.2% in accuracy over XsleepNet2, AttnSleep, SleepContext-Net, CSCNN-HMM, MultiChannel-SleepNet, and Multi-Task Learning, respectively. On Sleep-EDF, the advantages become more pronounced, with gains of 2.3%, 4.2%, 3.8%, 1.4%, and 3.0% over the same models (excluding CSCNN-HMM, which lacks reported results for this dataset). Moreover, despite using only one EEG channel, MultiScaleSleepNet surpasses the performance of MultiChannel-SleepNet, which relies on three-channel input, and achieves the highest macro-F1 scores and kappa coefficients on both datasets.

Third, the class-wise analysis reveals exceptional capability in addressing data imbalance. MultiScaleSleepNet achieves optimal class-wise F1-scores across all sleep stages, particularly excelling in underrepresented stage N1. These results highlight the dual effectiveness of the proposed architecture in mitigating data imbalance while sustaining high overall classification performance.

3.2. Hyperparameter Tuning

Hyperparameter optimization played a crucial role in performance enhancement. Sensitivity analysis was performed on dropout rate, learning rate, and batch size to optimize performance. The best configuration (dropout = 0.3, learning rate = 1 × 10⁻³, batch size = 128) achieved the highest accuracy, macro-F1, and kappa across datasets.

We first examined dropout rates (0.2, 0.3, 0.5) and initial learning rates (1 × 10⁻³, 1 × 10⁻⁴) through repeated trials. As shown in Figure 7, the optimal configuration was a dropout rate of 0.3 and a learning rate of 0.001, achieving peak values for accuracy (85.5%), macro-F1 score (0.82), and kappa coefficients (0.80). A higher dropout of 0.5 led to underfitting and longer training times (approximately 15% increase), due to excessive suppression of temporal features.

Next, the impact of batch size on training dynamics was subsequently examined. As illustrated in Figure 8, larger batch sizes, particularly a size of 128, led to faster convergence, lower training loss, and improved accuracy. The superior performance associated with this configuration is likely due to smoother gradient updates that reduce noise during parameter optimization. Additionally, it enhances the ability to capture long-term temporal dependencies in the time-series data. These advantages contribute to more stable and efficient model training, indicating that a batch size of 128 offers an optimal balance between computational efficiency and generalization capability for this task.

3.3. Sleep Stage Prediction Analysis

To assess temporal consistency and classification stability across varying time scales, the model was tested on continuous sequences of 100, 200, and 300 epochs. These sequences were randomly extracted from the continuous recordings of the Sleep-EDFx test set to simulate realistic, arbitrary-length data segments and to evaluate the robustness of our approach without bias towards a specific sequence length. Our model architecture is inherently flexible to variable-length inputs, as it processes each epoch independently through the same feature extraction pipeline without relying on fixed-length temporal context. The final sequence-level accuracy for a given window was then calculated as the proportion of epochs whose predicted labels exactly matched the ground truth across the entire continuous segment.

As illustrated in Figure 9, the predicted labels closely aligned with expert annotations for all window lengths. Accuracy remained high, slightly decreasing from 94% for 100-epoch windows, to 91% for 300-epoch windows. Notably, the model achieved 88% accuracy in identifying the N1 stage in 100-epoch segments, demonstrating resilience in detecting this challenging stage.

Error analysis revealed two principal trends. First, 65% of misclassifications occurred at stage boundaries (e.g., N1/N2 and N3/REM), reflecting the ambiguity of transitional epochs and inter-rater variability. Second, frequent confusion between N1 and REM was linked to overlapping alpha wave patterns, common to both stages. These observations suggest that model limitations align with recognized clinical challenges and may be mitigated through improved modeling of transition dynamics.

3.4. Ablation Experiments

To evaluate the impact of each architectural component, ablation studies were conducted on model variants, as shown in Table 7. The baseline BiLSTM configuration achieved an accuracy of 73.2%, a macro-F1 score of 0.600, and kappa coefficient of 0.619, requiring 258 epochs to converge. This performance highlights the limitations of using temporal modeling alone.

Adding a convolutional backbone improved accuracy by 13.4% to 83.0%, while reducing training time to 210 epochs. The combined CNN-BiLSTM model further improved performance to 84.9% accuracy and a macro-F1 score of 0.795, albeit with a slight increase in convergence time to 261 epochs.

Introducing the squeeze-and-excitation (SE) mechanism improved discriminative capability, resulting in 85.5% accuracy and a macro-F1 of 0.800. However, this enhancement extended training to 356 epochs. The final architecture, which incorporates transformer-based attention with an optimized BiLSTM, achieved peak performance (85.6% accuracy, 0.811 macro-F1, and 0.804 kappa coefficient) and converged in only 200 epochs, reducing training time by 43.8% relative to the SE-enhanced model.

These results confirm three key design principles: (i) multi-scale CNNs enable effective hierarchical spectral decomposition, (ii) BiLSTMs capture bidirectional temporal dependencies, and (iii) attention mechanisms dynamically prioritize important features. The final model offers an effective trade-off between representational power and training efficiency.

4. Discussion

MultiScaleSleepNet is a hybrid model comprising 1.9 million parameters that integrates multi-scale convolutional branches, bidirectional LSTM, and lightweight attention for single-channel EEG classification. It achieves high accuracy across the Sleep-EDFx (85.6%), Sleep-EDF (88.6%), and SHHS (84.6%) datasets, with kappa coefficients of 0.80, 0.84, and 0.79, respectively. The model also maintains robust F1-scores across the Wake, N2, N3, and REM stages and significantly improves the classification of the diagnostically challenging N1 stage. The consistent performance on the SHHS dataset, which encompasses a broader and more clinically diverse population, further demonstrates the model’s generalizability and potential utility beyond healthy adult cohorts.

When comparing our results to those of earlier studies, it is important to note that although all leading models use CNN backbones for feature extraction from time-series data, their architectures differ significantly. MultiChannel-SleepNet and recent transformer-only models rely on multi-channel recordings or deep attention stacks, which increase hardware costs or degrade sharply when training data fall below 100 nights. In contrast, MultiScaleSleepNet achieves comparable or superior accuracy using only single-channel EEG, suggesting that architectural balance, rather than model size, governs generalization. The 10% and 12% relative F1-score improvements for N1 classification over SleepEEGNet and AttnSleep highlight that its gains arise from architectural design rather than parameter tuning.

Class-wise analysis demonstrates the strong capability of the model in handling data imbalance. MultiScaleSleepNet achieves competitive F1-scores across all sleep stages, with particularly outstanding performance in identifying the underrepresented yet clinically critical N1 stage, and exhibits high precision in distinguishing REM sleep from wakefulness. The improved capability in N1 sleep detection holds significant clinical relevance. Accurate identification of N1 sleep is crucial for diagnosing insomnia, as patients often present with prolonged N1 duration and frequent stage transitions. By utilizing specially optimized theta-band convolutional kernels, our model effectively captures the low-amplitude, mixed-frequency EEG characteristics of N1 sleep, thereby providing a more reliable basis for clinical assessment. Meanwhile, the performance of the model in discriminating REM sleep from wakefulness contributes to the screening of REM sleep behavior disorder (RBD). By integrating both tonic and phasic features of REM sleep, the model significantly reduces misclassification between REM and wakefulness, enabling more accurate association of abnormal motor behaviors with REM sleep rather than wakefulness. These results fully demonstrate the dual strengths of the proposed architecture: overcoming the technical challenges of data imbalance while delivering performance improvements with tangible clinical value.

It is worth emphasizing that although earlier studies often attribute performance degradation on small datasets to model over-parameterization, the proposed model maintains robust accuracy under data-constrained conditions. Moreover, contrary to the assumption that multi-lead inputs are necessary for reliable N1 detection, the proposed multi-scale convolutional design demonstrates effective recovery of inter-electrode information using only a single channel. Although MultiScaleSleepNet incorporates five convolutional branches, the total number of parameters remains limited to 1.9 million, which corresponds to approximately one-third of the parameter count of XSleepNet2 (5.8 million) [34]. This compact architecture enhances its suitability for large-scale and cost-effective deployment in both clinical and home-based sleep monitoring scenarios. In addition, the integration of the discrete Fourier transform within the model framework offers a generalizable preprocessing approach that can be readily extended to other types of biosignals, including audio, EOG, and EMG.

5. Conclusions

In this study, we introduced MultiScaleSleepNet, a hybrid architecture for automatic sleep staging that learns simultaneously from raw EEG signals and their multiscale spectral representations. The network integrates parallel convolutional branches, bidirectional long short-term memory layers, and multi-head attention to capture both local rhythmic patterns and long-range temporal context from a single EEG channel. Designed for robustness across varying dataset sizes, MultiScaleSleepNet effectively balances its architectural components and mitigates overfitting through shared feature normalization and scheduled dropout.

Ablation studies validate the functional contributions of each module: multi-scale convolutional networks enhance spectral discrimination, BiLSTM layers capture bidirectional temporal dependencies, and attention mechanisms enable adaptive feature weighting. Furthermore, the incorporation of focal loss with label smoothing significantly improves classification of underrepresented stages, particularly the challenging N1 stage.

Experimental results on the Sleep-EDF and Sleep-EDFx datasets demonstrate that MultiScaleSleepNet outperforms existing single-channel baseline models. Its efficient inference enables practical deployment on resource-constrained edge devices and wearable sensors. These findings underscore the potential of our framework to facilitate large-scale, cost-effective sleep monitoring and support both clinical and at-home population health applications. Moreover, future work will focus on validating the generalizability of the model on clinical datasets containing patients with sleep disorders to further assess its utility in real-world diagnostic settings.

Author Contributions

Conceptualization, C.L., S.X. and X.D.; methodology, C.L., W.Z. and X.D.; software, W.Z.; validation, C.L., L.S. and Q.G.; formal analysis, L.S.; investigation, C.L., M.W. and Q.G.; resources, X.D.; data curation, C.L. and Q.G.; writing—original draft preparation, C.L.; writing—review and editing, X.D. and S.X.; visualization, C.L. and M.W.; supervision, X.D. and S.X.; project administration, X.D.; funding acquisition, X.D. and S.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key Research and Development Program of China, the National Natural Science Foundation of China under Grant No. 52376160, the National Natural Science Foundation of China under Grant No. 52006137, and the National Postdoctoral Researchers Program of China under Grant No. GZC20231559.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Code/weights will be made available on request.

Acknowledgments

The authors would like to express their gratitude to the creators and maintainers of the Sleep-EDF [48] and Sleep-EDF Expanded [49] datasets for making their data publicly available, which was essential for the training and evaluation of this study.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Cassidy, S.; Chau, J.Y.; Catt, M.; Bauman, A.; Trenell, M.I. Cross-Sectional Study of Diet, Physical Activity, Television Viewing and Sleep Duration in 233 110 Adults from the UK Biobank; the Behavioural Phenotype of Cardiovascular Disease and Type 2 Diabetes. BMJ Open 2016, 6, e010038. [Google Scholar] [CrossRef]
Ramar, K.; Malhotra, R.K.; Carden, K.A.; Martin, J.L.; Abbasi-Feinberg, F.; Aurora, R.N.; Kapur, V.K.; Olson, E.J.; Rosen, C.L.; Rowley, J.A.; et al. Sleep Is Essential to Health: An American Academy of Sleep Medicine Position Statement. J. Clin. Sleep Med. 2021, 17, 2115–2119. [Google Scholar] [CrossRef]
Hale, L.; Troxel, W.; Buysse, D.J. Sleep Health: An Opportunity for Public Health to Address Health Equity. Annu. Rev. Public Health 2020, 41, 81–99. [Google Scholar] [CrossRef]
Tononi, G.; Cirelli, C. Sleep Function and Synaptic Homeostasis. Sleep Med. Rev. 2006, 10, 49–62. [Google Scholar] [CrossRef] [PubMed]
Rasch, B.; Born, J. About Sleep’s Role in Memory. Physiol. Rev. 2013, 93, 681–766. [Google Scholar] [CrossRef]
Stickgold, R.; Walker, M.P. Sleep-Dependent Memory Consolidation and Reconsolidation. Sleep Med. 2007, 8, 331–343. [Google Scholar] [CrossRef]
Matricciani, L.; Paquet, C.; Galland, B.; Short, M.; Olds, T. Children’s Sleep and Health: A Meta-Review. Sleep Med. Rev. 2019, 46, 136–150. [Google Scholar] [CrossRef]
Mayer, G.; Frohnhofen, H.; Jokisch, M.; Hermann, D.M.; Gronewold, J. Associations of Sleep Disorders with All-Cause MCI/Dementia and Different Types of Dementia—Clinical Evidence, Potential Pathomechanisms and Treatment Options: A Narrative Review. Front. Neurosci. 2024, 18, 1372326. [Google Scholar] [CrossRef]
Garbarino, S.; Lanteri, P.; Bragazzi, N.L.; Magnavita, N.; Scoditti, E. Role of Sleep Deprivation in Immune-Related Disease Risk and Outcomes. Commun. Biol. 2021, 4, 1304. [Google Scholar] [CrossRef]
Finan, P.H.; Quartana, P.J.; Remeniuk, B.; Garland, E.L.; Rhudy, J.L.; Hand, M.; Irwin, M.R.; Smith, M.T. Partial Sleep Deprivation Attenuates the Positive Affective System: Effects Across Multiple Measurement Modalities. Sleep 2017, 40, zsw017. [Google Scholar] [CrossRef]
Bishir, M.; Bhat, A.; Essa, M.M.; Ekpo, O.; Ihunwo, A.O.; Veeraraghavan, V.P.; Mohan, S.K.; Mahalakshmi, A.M.; Ray, B.; Tuladhar, S.; et al. Sleep Deprivation and Neurological Disorders. BioMed Res. Int. 2020, 2020, 5764017. [Google Scholar] [CrossRef]
Zhu, B.; Shi, C.; Park, C.G.; Zhao, X.; Reutrakul, S. Effects of Sleep Restriction on Metabolism-Related Parameters in Healthy Adults: A Comprehensive Review and Meta-Analysis of Randomized Controlled Trials. Sleep Med. Rev. 2019, 45, 18–30. [Google Scholar] [CrossRef]
RAND Corporation. The Societal and Economic Burden of Insomnia in Adults: An International Study; RAND Corporation: Santa Monica, CA, USA, 2023. [Google Scholar]
Palmer, C.A.; Bower, J.L.; Cho, K.W.; Clementi, M.A.; Lau, S.; Oosterhoff, B.; Alfano, C.A. Sleep Loss and Emotion: A Systematic Review and Meta-Analysis of over 50 Years of Experimental Research. Psychol. Bull. 2024, 150, 440–463. [Google Scholar] [CrossRef] [PubMed]
Berry, R.B.; Budhiraja, R.; Gottlieb, D.J.; Gozal, D.; Iber, C.; Kapur, V.K.; Marcus, C.L.; Mehra, R.; Parthasarathy, S.; Quan, S.F.; et al. Rules for Scoring Respiratory Events in Sleep: Update of the 2007 AASM Manual for the Scoring of Sleep and Associated Events: Deliberations of the Sleep Apnea Definitions Task Force of the American Academy of Sleep Medicine. J. Clin. Sleep Med. 2012, 8, 597–619. [Google Scholar] [CrossRef]
Rechtschaffen, A.; Kales, A. A Manual of Standardized Terminology, Techniques and Scoring System for Sleep Stages of Human Subjects; National Institute of Health: Bethesda, MD, USA, 1968. [Google Scholar]
Berry, R.B.; Brooks, R.; Gamaldo, C.; Harding, S.M.; Lloyd, R.M.; Quan, S.F.; Troester, M.T.; Vaughn, B.V. AASM Scoring Manual Updates for 2017 (Version 2.4). J. Clin. Sleep Med. 2017, 13, 665–666. [Google Scholar] [CrossRef]
Phan, H.; Mikkelsen, K. Automatic Sleep Staging of EEG Signals: Recent Development, Challenges, and Future Directions. Physiol. Meas. 2022, 43, 04TR01. [Google Scholar] [CrossRef] [PubMed]
Chattu, V.K.; Manzar, M.D.; Kumary, S.; Burman, D.; Spence, D.W.; Pandi-Perumal, S.R. The Global Problem of Insufficient Sleep and Its Serious Public Health Implications. Healthcare 2018, 7, 1. [Google Scholar] [CrossRef] [PubMed]
Sun, J.; Cao, R.; Zhou, M.; Hussain, W.; Wang, B.; Xue, J.; Xiang, J. A Hybrid Deep Neural Network for Classification of Schizophrenia Using EEG Data. Sci. Rep. 2021, 11, 4706. [Google Scholar] [CrossRef]
Imtiaz, S.A. A Systematic Review of Sensing Technologies for Wearable Sleep Staging. Sensors 2021, 21, 1562. [Google Scholar] [CrossRef]
Al-Salman, W.; Li, Y.; Oudah, A.Y.; Almaged, S. Sleep Stage Classification in EEG Signals Using the Clustering Approach Based Probability Distribution Features Coupled with Classification Algorithms. Neurosci. Res. 2023, 188, 51–67. [Google Scholar] [CrossRef]
Abdulla, S.; Diykh, M.; Laft, R.L.; Saleh, K.; Deo, R.C. Sleep EEG Signal Analysis Based on Correlation Graph Similarity Coupled with an Ensemble Extreme Machine Learning Algorithm. Expert. Syst. Appl. 2019, 138, 112790. [Google Scholar] [CrossRef]
Satapathy, S.K.; Brahma, B.; Panda, B.; Barsocchi, P.; Bhoi, A.K. Machine Learning-Empowered Sleep Staging Classification Using Multi-Modality Signals. BMC Med. Inf. Decis. Mak. 2024, 24, 119. [Google Scholar] [CrossRef]
Diykh, M.; Li, Y.; Abdulla, S. EEG Sleep Stages Identification Based on Weighted Undirected Complex Networks. Comput. Methods Programs Biomed. 2020, 184, 105116. [Google Scholar] [CrossRef]
Tabar, Y.R.; Mikkelsen, K.B.; Rank, M.L.; Hemmsen, M.C.; Kidmose, P. Investigation of Low Dimensional Feature Spaces for Automatic Sleep Staging. Comput. Methods Programs Biomed. 2021, 205, 106091. [Google Scholar] [CrossRef]
Santaji, S.; Desai, V. Analysis of EEG Signal to Classify Sleep Stages Using Machine Learning. Sleep Vigil. 2020, 4, 145–152. [Google Scholar] [CrossRef]
Mousavi, S.; Afghah, F.; Acharya, U.R. SleepEEGNet: Automated Sleep Stage Scoring with Sequence to Sequence Deep Learning Approach. PLoS ONE 2019, 14, e0216456. [Google Scholar] [CrossRef]
Humayun, A.I.; Sushmit, A.S.; Hasan, T.; Bhuiyan, M.I.H. End-to-End Sleep Staging with Raw Single Channel EEG Using Deep Residual ConvNets. In Proceedings of the 2019 IEEE EMBS International Conference on Biomedical & Health Informatics (BHI), Chicago, IL, USA, 19–22 May 2019; pp. 1–5. [Google Scholar]
Phan, H.; Andreotti, F.; Cooray, N.; Chen, O.Y.; De Vos, M. SeqSleepNet: End-to-End Hierarchical Recurrent Neural Network for Sequence-to-Sequence Automatic Sleep Staging. IEEE Trans. Neural Syst. Rehabil. Eng. 2019, 27, 400–410. [Google Scholar] [CrossRef] [PubMed]
Huang, J.; Ren, L.; Zhou, X.; Yan, K. An Improved Neural Network Based on SENet for Sleep Stage Classification. IEEE J. Biomed. Health Inform. 2022, 26, 4948–4956. [Google Scholar] [CrossRef]
Heng, X.; Wang, M.; Wang, Z.; Zhang, J.; He, L.; Fan, L. Leveraging Discriminative Features for Automatic Sleep Stage Classification Based on Raw Single-Channel EEG. Biomed. Signal Process. Control 2024, 88, 105631. [Google Scholar] [CrossRef]
Eldele, E.; Chen, Z.; Wu, M.; Guan, C. An Attention-Based Deep Learning Approach for Sleep Stage Classification with Single-Channel EEG. IEEE Trans. Neural Syst. Rehabil. Eng. 2021, 29, 809–818. [Google Scholar] [CrossRef] [PubMed]
Phan, H.; Chen, O.Y.; Tran, M.C.; Koch, P.; Mertins, A.; De Vos, M. XSleepNet: Multi-View Sequential Model for Automatic Sleep Staging. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 5903–5915. [Google Scholar] [CrossRef]
Kim, M.; Chung, W. Convolutional Transformer-in-Transformer for Automatic Sleep Stage Classification. In Proceedings of the 2024 12th International Winter Conference on Brain-Computer Interface (BCI), Gangwon, Republic of Korea, 26–28 February 2024; pp. 1–5. [Google Scholar]
Zhao, C.; Li, J.; Guo, Y. Sequence Signal Reconstruction Based Multi-Task Deep Learning for Sleep Staging on Single-Channel EEG. Biomed. Signal Process. Control 2024, 88, 105615. [Google Scholar] [CrossRef]
Dai, Y.; Li, X.; Liang, S.; Wang, L.; Duan, Q.; Yang, H.; Zhang, C.; Chen, X.; Li, L.; Li, X.; et al. MultiChannelSleepNet: A Transformer-Based Model for Automatic Sleep Stage Classification with PSG. IEEE J. Biomed. Health Inform. 2023, 27, 4204–4215. [Google Scholar] [CrossRef] [PubMed]
Kumar, C.B.; Mondal, A.K.; Bhatia, M.; Panigrahi, B.K.; Gandhi, T.K. Unravelling Sleep Patterns: Supervised Contrastive Learning with Self-Attention for Sleep Stage Classification. Appl. Soft Comput. 2024, 167, 112298. [Google Scholar] [CrossRef]
Xia, M.; Zhao, X.; Deng, R.; Lu, Z.; Cao, J. EEGNet Classification of Sleep EEG for Individual Specialization Based on Data Augmentation. Cogn. Neurodyn. 2024, 18, 1539–1547. [Google Scholar] [CrossRef]
Cooley, J.W.; Tukey, J.W. An Algorithm for the Machine Calculation of Complex Fourier Series. Math. Comput. 1965, 19, 297–301. [Google Scholar] [CrossRef]
Riesenhuber, M.; Poggio, T. Hierarchical Models of Object Recognition in Cortex. Nat. Neurosci. 1999, 2, 1019–1025. [Google Scholar] [CrossRef]
Hinton, G.E.; Srivastava, N.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R.R. Improving Neural Networks by Preventing Co-Adaptation of Feature Detectors. arXiv 2012, arXiv:1207.0580. [Google Scholar] [CrossRef]
Huang, Z.; Xu, W.; Yu, K. Bidirectional LSTM-CRF Models for Sequence Tagging. arXiv 2015, arXiv:1508.01991. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2023, arXiv:1706.03762. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2017, arXiv:1412.6980. [Google Scholar] [CrossRef]
Loshchilov, I.; Hutter, F. SGDR: Stochastic Gradient Descent with Warm Restarts. arXiv 2017, arXiv:1608.03983. [Google Scholar] [CrossRef]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. arXiv 2018, arXiv:1708.02002. [Google Scholar] [CrossRef]
Kemp, B.; Zwinderman, A.H.; Tuk, B.; Kamphuisen, H.A.C.; Oberye, J.J.L. Analysis of a Sleep-Dependent Neuronal Feedback Loop: The Slow-Wave Microcontinuity of the EEG. IEEE Trans. Biomed. Eng. 2000, 47, 1185–1194. [Google Scholar] [CrossRef]
Goldberger, A.L.; Amaral, L.A.N.; Glass, L.; Hausdorff, J.M.; Ivanov, P.C.; Mark, R.G.; Mietus, J.E.; Moody, G.B.; Peng, C.-K.; Stanley, H.E. PhysioBank, PhysioToolkit, and PhysioNet: Components of a New Research Resource for Complex Physiologic Signals. Circulation 2000, 101, E215–E220. [Google Scholar] [CrossRef]
Zhang, G.-Q.; Cui, L.; Mueller, R.; Tao, S.; Kim, M.; Rueschman, M.; Mariani, S.; Mobley, D.; Redline, S. The National Sleep Research Resource: Towards a Sleep Data Commons. J. Am. Med. Inf. Assoc. 2018, 25, 1351–1358. [Google Scholar] [CrossRef]
Quan, S.F.; Howard, B.V.; Iber, C.; Kiley, J.P.; Nieto, F.J.; O’Connor, G.T.; Rapoport, D.M.; Redline, S.; Robbins, J.; Samet, J.M.; et al. The Sleep Heart Health Study: Design, Rationale, and Methods. Sleep 1997, 20, 1077–1085. [Google Scholar] [CrossRef] [PubMed]
Landis, J.R.; Koch, G.G. The Measurement of Observer Agreement for Categorical Data. Biometrics 1977, 33, 159. [Google Scholar] [CrossRef]
Zhao, C.; Li, J.; Guo, Y. SleepContextNet: A Temporal Context Network for Automatic Sleep Staging Based Single-Channel EEG. Comput. Methods Programs Biomed. 2022, 220, 106806. [Google Scholar] [CrossRef]

Figure 1. R&K rules and AASM standard guideline on sleep staging.

Figure 2. Examples of characteristic EEG signal features across sleep stages from time domain.

Figure 3. Framework of proposed model.

Figure 4. Detailed architecture of MultiScaleSleepNet: (a) multiscale convolutional neural network for EEG feature extraction; (b) feature map reweighting using the squeeze-and-excitation mechanism; (c) bidirectional long short-term memory (BiLSTM) processing block; (d) multi-head attention mechanism; (e) sleep stage classification with temperature-scaled softmax activation.

Figure 5. Data segmentation and labeling.

Figure 6. Confusion matrix of training and test sets on the Fpz–Cz channel of the Sleep-EDFx dataset.

Figure 7. Impact of dropout and learning rate on model performance across evaluation metrics.

Figure 8. Accuracy and loss curves for different batch sizes on Sleep-EDFx.

Figure 9. Ground-truth and predicted labels for three random sequences of 100, 200, and 300 contiguous epochs from the Sleep-EDFx test set. Acc indicates sequence-level accuracy. Red dots mark epochs that were misclassified.

Table 1. EEG frequency of different bands.

Band	Delta (δ)	K-Complex	Saw-Tooth	Theta (θ)	Alpha (α)	Sleep Spindles	Beta (β)
Frequency Range (Hz)	0.5–4	0.5–1.5	2.0–6.0	4–8	8–13	12–14	13–30

Table 2. Number of 30 s epochs for each sleep stage in the processed dataset.

Dataset	W	N1	N2	N3	REM	Total
Sleep-EDF	8285	2804	17,799	5703	7717	42,308
Sleep-EDFx	69,546	24,650	86,884	18,703	33,573	233,356
SHHS	46,319	10,304	142,125	60,153	65,953	324,854

Table 3. Class-wise evaluation metrics on the Sleep-EDFx test set (Fpz–Cz channel), which accounts for 10% of the total Sleep-EDFx dataset.

Label	PR	RE	F1	Support
W	0.94	0.95	0.94	6954
N1	0.58	0.51	0.54	2465
N2	0.87	0.89	0.88	8689
N3	0.88	0.83	0.85	1870
REM	0.83	0.85	0.84	3358
ACC			0.86	23,336
κ			0.80	23,336

Table 4. Class-wise evaluation metrics on the Sleep-EDF test set (Fpz–Cz channel), which accounts for 10% of the total Sleep-EDF dataset.

Label	PR	RE	F1	Support
W	0.94	0.94	0.94	829
N1	0.58	0.49	0.53	280
N2	0.90	0.92	0.91	1780
N3	0.90	0.93	0.91	571
REM	0.87	0.86	0.87	771
ACC			0.89	4231
κ			0.80	4231

Table 5. Class-wise evaluation metrics on the SHHS test set (C4–A1 channel), which accounts for 10% of the total Sleep-EDFx dataset.

Label	PR	RE	F1	Support
W	0.84	0.87	0.85	4675
N1	0.48	0.19	0.28	1102
N2	0.89	0.85	0.87	14,283
N3	0.87	0.90	0.89	6101
REM	0.77	0.89	0.82	6325
ACC			0.85	32,486
κ			0.79	32,486

Table 6. Performance comparison of the proposed model with state of the art models.

Methods	Dataset	Over Performance			Per-Class Performance (F1)
Methods	Dataset	ACC	MF1	κ	W	N1	N2	N3	REM
SleepEEGNet [28]	Sleep-edfx	80.03	73.55	0.73	91.72	44.05	82.49	73.45	76.06
XsleepNet2 [34]	Sleep-edfx	84.00	77.90	0.78	-	-	-	-	-
AttnSleep [33]	Sleep-edfx	81.3	75.1	0.74	92.0	42.0	85.0	82.1	74.2
SleepContext-Net [53]	Sleep-edfx	82.7	77.2	0.76	92.8	49.0	84.8	80.6	76.4
CSCNN-HMM [31]	Sleep-edfx	84.6	78.0	0.79	93.0	41.0	88.0	85.0	83.0
MultiChannel-SleepNet [37]	Sleep-edfx	85.0	79.6	0.79	94.0	53.0	86.9	81.8	82.6
Multi-Task Learning [36]	Sleep-edfx	83.4	78.6	0.77	92.9	52.6	84.9	78.9	83.7
Proposed Model	Sleep-edfx	85.6	81.1	0.80	94.0	54.0	88.0	85.0	84.0
SleepEEGNet [28]	Sleep-edf	84.26	79.66	0.79	89.19	52.19	86.77	85.13	85.02
XsleepNet2 [34]	Sleep-edf	86.3	80.6	0.81	-	-	-	-	-
AttnSleep [33]	Sleep-edf	84.4	78.1	0.79	89.7	42.6	88.8	90.2	79.0
SleepContext-Net [53]	Sleep-edf	84.8	79.8	0.79	89.6	50.5	88.4	88.5	82.0
MultiChannel-SleepNet [37]	Sleep-edf	87.2	82.0	0.81	92.8	49.1	90.0	89.3	84.8
Multi-Task Learning [36]	Sleep-edf	85.6	81.1	0.80	90.4	53.7	88.3	88.3	85.1
Proposed Model	Sleep-edf	88.6	83.3	0.84	94.0	53.0	91.0	91.0	87.0
SleepEEGNet [28]	SHHS	73.9	68.4	0.65	81.3	34.4	73.4	75.9	77.0
AttnSleep [33]	SHHS	84.2	75.3	0.78	86.7	33.2	87.1	87.1	82.1
Proposed Model	SHHS	84.6	74.5	0.79	85.0	28.0	87.2	89.0	82.2

Table 7. Comparative performance evaluation of proposed model and ablation variants on sleep-EDFx dataset.

Components	ACC (%)	Macro-F1	Kappa	Epochs to Converge
BiLSTM (baseline)	73.2	0.600	0.619	258
CNN	83.0	0.767	0.767	210
CNN + BiLSTM	84.9	0.795	0.792	261
+SE Module	85.5	0.800	0.800	356
Proposed Model	85.6	0.811	0.804	200

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, C.; Guan, Q.; Zhang, W.; Sun, L.; Wang, M.; Dong, X.; Xu, S. MultiScaleSleepNet: A Hybrid CNN–BiLSTM–Transformer Architecture with Multi-Scale Feature Representation for Single-Channel EEG Sleep Stage Classification. Sensors 2025, 25, 6328. https://doi.org/10.3390/s25206328

AMA Style

Liu C, Guan Q, Zhang W, Sun L, Wang M, Dong X, Xu S. MultiScaleSleepNet: A Hybrid CNN–BiLSTM–Transformer Architecture with Multi-Scale Feature Representation for Single-Channel EEG Sleep Stage Classification. Sensors. 2025; 25(20):6328. https://doi.org/10.3390/s25206328

Chicago/Turabian Style

Liu, Cenyu, Qinglin Guan, Wei Zhang, Liyang Sun, Mengyi Wang, Xue Dong, and Shuogui Xu. 2025. "MultiScaleSleepNet: A Hybrid CNN–BiLSTM–Transformer Architecture with Multi-Scale Feature Representation for Single-Channel EEG Sleep Stage Classification" Sensors 25, no. 20: 6328. https://doi.org/10.3390/s25206328

APA Style

Liu, C., Guan, Q., Zhang, W., Sun, L., Wang, M., Dong, X., & Xu, S. (2025). MultiScaleSleepNet: A Hybrid CNN–BiLSTM–Transformer Architecture with Multi-Scale Feature Representation for Single-Channel EEG Sleep Stage Classification. Sensors, 25(20), 6328. https://doi.org/10.3390/s25206328

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MultiScaleSleepNet: A Hybrid CNN–BiLSTM–Transformer Architecture with Multi-Scale Feature Representation for Single-Channel EEG Sleep Stage Classification

Abstract

1. Introduction

1.1. Introduction to Sleep Stage Classification

1.2. Related Work

2. Materials and Methods

2.1. Methodology

2.1.1. Model Framework

2.1.2. Feature Extraction

Multiscale Convolutional Neural Network

Squeeze-and-Excitation (SE) Mechanism

2.1.3. Sequence Context Encoder

Bidirectional LSTM Unit

Attention Mechanism

2.1.4. Classification

2.1.5. Optimization Strategies

Learning Rate Schedule

Loss Function

Softmax Activation

2.2. Experiments

2.2.1. Datasets

2.2.2. Data Preprocessing

2.2.3. Experimental Setup

2.2.4. Evaluation Metrics

3. Results

3.1. Classification Performance of the Proposed Model

3.2. Hyperparameter Tuning

3.3. Sleep Stage Prediction Analysis

3.4. Ablation Experiments

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI