1.1. Introduction to Sleep Stage Classification
Sleep occupies approximately one-third of the human lifespan and is indispensable to both physical and mental health [
1]. Adequate sleep improves cognitive performance and overall well-being [
2,
3], whereas critical biological processes such as memory consolidation, cellular repair, and cerebral development occur predominantly during sleep [
4,
5,
6,
7]. Conversely, sleep deprivation is associated with impaired cognition, immune suppression, and dysregulation of memory, emotion, and metabolism [
8,
9,
10,
11,
12]. Although the health consequences of sleep issues are well-documented, their wider socioeconomic implications are only now receiving sustained attention. A multinational study across 16 countries in Europe, North America, and Australia found that chronic insomnia affects over 8% of adults, resulting in annual GDP losses of 0.64–1.31% [
13]. Further evidence confirms that insufficient or fragmented sleep not only compromises daily emotional stability but also exacerbates productivity declines and societal economic burdens [
14]. Given these health and economic repercussions, accurate sleep stage classification is essential for diagnosing sleep disorders and optimizing their management.
Polysomnography (PSG) serves as the clinical gold standard for sleep stage classification. It records electroencephalography (EEG), electrooculography (EOG), and electromyography (EMG), providing the basis for separating sleep into non-rapid eye movement (NREM) and rapid eye movement (REM) periods [
15]. Following standardized guidelines, including the Rechtschaffen and Kales (R&K) rules [
16] and the American Academy of Sleep Medicine (AASM) [
17] guideline, specialists manually annotate each 30 s PSG epoch into specific stages. The R&K rules define six sleep stages: wakefulness (Wake), four NREM stages (S1–S4), and REM. Stage S1 represents light sleep, S2 marks a transitional phase, and S3–S4 correspond to deep sleep, also termed slow-wave sleep (SWS). The REM stage, characterized by rapid ocular movements, muscle atonia, and low-voltage mixed-frequency EEG that resembles wakefulness, accounts for 20–25% of adult sleep. In 2007, the AASM merged stages S3 and S4 into a single deep sleep stage (N3), as illustrated in
Figure 1.
Distinct EEG waveforms characterize each sleep stage, as summarized in
Table 1. Wakefulness is dominated by β waves, whereas α waves appear during eye closure or meditative states. The transition from the wakefulness stage to light sleep (N1) involves attenuation of α activity and emergence of low-amplitude θ waves. The N1 stage is brief and transitions to N2, which is identified by sleep spindles and K-complexes. Subsequent reductions in cortical activity lead to deep sleep (N3), characterized by high-amplitude δ waves. Finally, REM sleep is distinguished by sawtooth waves in the EEG and rapid ocular oscillations detectable via EOG. Representative time-domain traces for each stage are shown in
Figure 2.
1.2. Related Work
Traditional sleep stage classification depends on expert visual interpretation of PSG data, which is labor-intensive, time-consuming, costly, and vulnerable to inter- and intra-rater variability that often produces inconsistent results [
18]. This manual approach has been proven inefficient for large-scale or real-time applications [
19].
To address these challenges, researchers have developed automated sleep stage classification methods that leverage advanced EEG signal processing and machine learning techniques. These methods reduce dependence on manual feature engineering and mitigate the logistical constraints associated with PSG setups [
20]. Recent studies have shown particular interest in single-channel EEG augmented with adaptive noise-reduction techniques, a strategy that supports scalable deployment while preserving clinical accuracy [
21].
Automated approaches fall into two broad categories. Traditional approaches depend on manually engineered features paired with supervised classifiers such as support vector machine (SVM) and random forest classifiers, whereas deep learning employs neural networks to autonomously extract PSG-derived features through end-to-end learning frameworks.
Early studies adhering to the traditional paradigm emphasized manual feature engineering combined with classical algorithms. For example, Al-Salman et al. [
22] applied wavelet transforms to extract time-frequency features from EEG signals, while Abdulla et al. [
23] modeled EEG epochs as correlation networks integrated with community detection algorithms. Satapathy et al. [
24] enhanced feature optimization by coupling the ReliefF algorithm with an AdaBoost-augmented random forest classifier. Subsequent efforts explored feature refinement strategies, including sliding window statistical mapping [
25], principal component analysis (PCA)-based dimensionality reduction [
26], and frequency-domain feature extraction using band-pass filters [
27]. Despite their interpretability, these methods suffered from limited generalizability due to their reliance on domain-specific expertise, ultimately hindering scalability in large-scale applications.
Deep learning has advanced sleep stage classification by automating feature extraction. Early innovations, such as SleepEEGNet [
28], leveraged convolutional neural networks (CNNs) to capture time-invariant spatial patterns in EEG data, while Humayun et al. [
29] demonstrated the effectiveness of residual CNNs for raw EEG signal processing. To address temporal dependencies, SeqSleepNet [
30] proposed hierarchical architectures combining RNNs, while CSCNN-HMM [
31] incorporated hidden Markov models (HMMs) to model state transitions. Heng et al. [
32] further advanced temporal modeling with bidirectional gated recurrent units (GRUs) enhanced by attention mechanisms. Subsequent studies refined feature aggregation: AttnSleep [
33] fused multi-resolution CNNs with transformer encoders, and XSleepNet [
34] dynamically integrated time-frequency representations. Recent developments include transformer-in-transformer architectures [
35] for joint local–global dependency learning and multi-task frameworks incorporating channel attention modules [
36]. While CNNs and RNNs excel at spatial and temporal feature extraction, hybrid models often face trade-offs between computational efficiency and temporal granularity. Attention mechanisms enhance feature prioritization but necessitate large datasets to prevent overfitting.
Recent research has increasingly focused on multichannel data fusion and transformer-based architectures. MultiChannelSleepNet [
37] employed transformer encoders with layer normalization to integrate cross-channel features. Kumar et al. [
38] addressed class imbalance through supervised contrastive learning paired with self-attention mechanisms, while Zhao et al. [
36] improved cross-dataset generalization using multi-task learning. Discrete Cosine Transform (DCT)-augmented EEGNet [
39] optimized multichannel analysis by enhancing spectral domain features. Despite their advantages, multichannel methods escalate computational costs and necessitate strategies to mitigate channel redundancy, highlighting the need for efficient feature selection mechanisms.
Earlier approaches, in contrast to modern hybrid systems, often employed architectures specialized for either spatial or temporal feature extraction. Models such as TimeNet and Temporal Convolutional Networks (TCNs) prioritized convolutional operations for local pattern recognition but lacked mechanisms to capture long-range dependencies. Conversely, BiLSTM-based frameworks, exemplified by SleepEEGNet [
28], excelled at sequence modeling but had limited ability to integrate frequency-domain features due to their sequential nature. Even hybrid systems like CSCNN-HMM [
31], which integrated HMMs to model sleep stage transitions, omitted dynamic feature weighting via attention mechanisms. Similarly, AttnSleep [
33] introduced attention-based aggregation but relied on a dual-branch architecture, potentially limiting its ability to holistically model multi-resolution time-frequency interactions.
To overcome these limitations, this study introduces novel advancements in feature extraction, temporal dependency modeling, and transformer-based architecture design to improve representation learning for sleep stage classification. Departing from existing approaches that treat time- and frequency-domain features in isolation or employ restrictive architectures, our model employs parallel convolutional branches to holistically capture diverse temporal and spectral patterns. Specifically, a Fast Fourier Transform (FFT)-based spectral analysis branch and multi-scale convolutional branches are designed to extract complementary time-frequency features from EEG epochs, enhancing detection of fine-grained signal characteristics. For temporal modeling, the framework integrates a lightweight transformer module with a bidirectional long short-term memory (BiLSTM) network, forming a hybrid encoder. This architecture synergistically combines convolutional operations for local feature extraction, sequential modeling for long-range dependencies, and attention mechanisms for dynamic feature weighting. The BiLSTM component further stabilizes gradient propagation during training while bolstering classification robustness.
The main contributions of this work are as follows:
- 1.
Multi-scale temporal–spectral fusion: a hybrid architecture integrating FFT-based spectral analysis with multi-resolution convolutional branches was developed, achieving comprehensive temporal–spectral feature integration through physiologically-aligned kernel configurations.
- 2.
Hybrid long-range dependency modeling: bidirectional LSTM networks were synergistically combined with transformer self-attention mechanisms, establishing hierarchical temporal context modeling from local sleep transitions to global stage progression patterns.
- 3.
Unified dynamic feature weighting: cross-domain attention mechanisms were implemented to dynamically recalibrate feature significance across concatenated temporal–spectral representations, enabling adaptive fusion of multi-modal sleep characteristics.
The remainder of this paper is organized as follows:
Section 2 details the methodology, including the proposed architecture and its operational principles.
Section 3 describes experimental protocols, datasets, and results.
Section 4 provides a comparative analysis against state-of-the-art methods and discusses performance outcomes. Finally,
Section 5 synthesizes key conclusions, evaluates clinical and technical implications, and outlines future research directions.