1. Introduction
Depression, a prevalent mental disorder, is characterized by a persistent low mood, anhedonia, or diminished interest in activities, and in severe cases, may lead to suicide [
1]. According to the World Health Organization (WHO), an estimated 280 million people worldwide are affected by depression [
2,
3,
4]. Consequently, depression is prioritized in the World Health Organization’s Mental Health Gap Action Programme [
5]. Clinically, the diagnosis of a depressive episode typically requires a duration of at least two weeks, accompanied by significant distress or impairment in social functioning. The primary diagnostic criteria for depression are detailed in the Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition (DSM-5) or Fourth Edition (DSM-IV) [
6,
7]. Diagnoses typically rely on psychiatric evaluations and psychometric questionnaires, including the Hamilton Depression Scale (HAM-D) and the Self-Rating Depression Scale [
8,
9,
10,
11,
12,
13]. However, this approach relies heavily on patient cooperation and clinician expertise, which complicates the diagnostic process and may delay early intervention for patients [
14,
15]. Depression disrupts neural communication and neurogenesis, impairing the function of specific brain regions and leading to fluctuating patterns of brain activity [
16,
17]. Research shows that individuals with depression exhibit dynamic disturbances in the frontal limbic network and abnormal inter-cerebral functional connectivity [
18,
19], which is indicative of altered complex neuronal interactions. Consequently, numerous clinicians and researchers recognize that identifying objective physiological indicators for the direct diagnosis of depression could substantially enhance both diagnostic accuracy and treatment outcomes.
Electroencephalography (EEG) is a non-invasive neuroimaging technique offering a high temporal resolution and easy accessibility [
20,
21]. EEG accurately reflects brain network activity, offers objective neural markers, and remains free from clinical bias and subjective self-assessment. It has also proved valuable for capturing information about physiological changes in the brain associated with depression [
22,
23,
24]. Compared to magnetic resonance imaging (MRI), EEG is more cost-effective and suitable for frequent testing, thereby offering significant advantages in diagnosing depression [
25,
26]. Compared to EEG during task execution, resting-state EEG (rsEEG) not only minimizes interference from visual scenes, instructions, and task performance variations influenced by subjects’ cognitive levels, gender, age, and interest [
27], but it also effectively captures the intrinsic brain activity. The resting state allows subjects to focus their attention inward, thus facilitating self-related brain activity [
28,
29]. Furthermore, a study comparing resting-state and task-state EEG demonstrates that classification accuracy is higher in the resting state than in the task state [
30]. With advances in computational psychiatry [
31,
32,
33], the integration of rsEEG with artificial intelligence produces remarkable results in diagnosing depression [
34]. In contemporary research on depression diagnosis, computer-aided diagnostic models primarily fall into two categories: traditional machine learning (TML) models and deep learning (DL) models. Traditional machine learning models rely on feature engineering to process and analyze training data, using algorithms to make decisions. Conversely, deep learning models can handle large datasets and autonomously learn complex feature representations.
Currently, there are only a few review articles that address the application of EEG in diagnosing depression. Numerous studies highlight the impact of feature extraction and the selection of classification models on diagnostic outcomes; however, few directly compare traditional machine learning methods with deep learning approaches. We compare and analyze common elements such as sample size, data acquisition, preprocessing, feature extraction, feature selection, classification methods, and validation techniques across these two types of studies. Furthermore, we examine the primary challenges in the current research and propose desirable practices. In all tables presented in this review, a dash (“-”) indicates that relevant data are unavailable in the reference.
3. Results
Forty-nine studies employing rsEEG for depression diagnosis, published between 2018 and 2024, are systematically reviewed. This includes 22 studies employing TML methods and 27 studies utilizing DL approaches. Two of these studies employ both methods [
35,
36]; they are classified based on the method that achieves the highest accuracy rate in the analysis.
Section 3.1,
Section 3.2,
Section 3.3,
Section 3.4,
Section 3.5,
Section 3.6 and
Section 3.7 provide a detailed account of the sample size, data acquisition, and preprocessing approaches used in these studies, along with the methods applied for feature extraction and selection, classification, and validation.
Figure 1 illustrates the general framework for diagnosing depression using TML and DL methods. The figure clearly shows that the raw EEG obtained from the EEG cap must first undergo preprocessing, as indicated by the elliptical area in the preprocessing box grid, which highlights the four most commonly utilized methods. Subsequently, two approaches are presented: on the left, the TML method and on the right, the DL method. In TML methods, feature extraction and selection are two distinct steps, with the most frequently used methods detailed in these modules. If a deep learning approach is employed, the feature extraction step includes two strategies: manual feature extraction integrated with a DL model, as depicted in the ellipse area, and automatic feature extraction, with schematic diagrams of the two predominant structures, convolutional neural network and long short-term memory, systematically arranged from top to bottom within the box frame. In the classification and validation module, each segment’s size in the sector diagram corresponds to the frequency of use of each method.
3.1. Sample Size
3.1.1. Studies Based on TML Methods
In 2012, Ahmadlou and his team conducted the first study on diagnosing depression using traditional machine learning with rsEEG, involving a limited sample size of 24 subjects [
37]. In 2018, Cai et al. [
38] expanded their study’s sample size to include 213 participants, comprising 92 individuals with depression and 121 controls without depression. Subsequently, Mumtaz et al. [
39] published a study including 34 with MDD and 30 age-matched HC. Wan et al. [
40] used two separate samples: one consists of 35 participants (23 with MDD and 12 HC), and the other comprises 30 participants (15 with MDD and 15 HC). In 2021, Wu et al. [
41] significantly increased their sample size by collecting rsEEG data from 400 participants, evenly divided into 200 with MDD and 200 HC, across four different hospitals. In contrast, in 2022, Avots et al. [
42] conducted a study featuring a notably limited sample size of 20 participants, consisting of 10 individuals with MDD and 10 HC. In the same year, Li et al. [
43] analyzed data from 92 cases. Two studies by Soni et al. [
44,
45] used multiple datasets, each with sample sizes ranging from 30 to 55 and incorporating data from the study by Seal et al. [
46], which includes 15 patients with MDD and 18 HC. According to the WHO, depression proves more prevalent in women than in men [
3]. In 2023, Shim et al. [
47] analyzed EEG data exclusively from female participants, consisting of 49 patients with MDD and 49 HC. This study is unique, as it exclusively includes female subjects. In the same year, two other studies included samples of 32 participants (19 with MDD and 13 HC) and 80 participants (40 with MDD and 40 HC), respectively [
48,
49].
Table 1 outlines the fundamental experimental setup for EEG data acquisition in studies employing both TML and DL methodologies. It features four columns, detailing from left to right: sample size, frequency (in Hz), number of electrodes, and the corresponding study.
Table 2 provides the web address of the publicly available dataset used in the depression diagnostic study, which is accessible by clicking on the link. Additionally, numerous researchers have utilized these publicly available sample data in their studies, as detailed in
Table 2. The dataset from Mumtaz et al. [
39,
50,
51], which includes 34 patients with MDD and 30 HC, is frequently used by researchers for analysis [
52,
53,
54,
55,
56,
57,
58]. Nassib et al. [
59] used a public dataset (42 with MDD + 42 HC) published by Cavanagh’s team [
60]. The Multi-modal Open Dataset for Mental-disorder Analysis (MODMA), provided by Cai’s team [
61] and comprising two subsets—one with 24 patients with MDD and 29 HC, and another with 26 with MDD and 29 HC—significantly contributes to depression detection research conducted by various researchers [
44,
45,
62,
63].
3.1.2. Studies Based on DL Methods
In 2018, Acharya et al. [
64] published a study on depression detection, collecting EEG signals from 30 participants, comprising 15 with MDD and 15 HC. Subsequently, Mao et al. [
81] and Ay et al. [
66] published studies with sample sizes of 34 participants (17 with MDD and 17 HC) and 30 participants (15 with MDD and 15 HC), respectively. In 2020, Wan’s team [
72] and Duan et al. [
35] chose the same data source for their experiments, though their sample sizes differ: Wan et al. included 35 participants (12 with MDD and 23 HC), while Duan et al. analyzed data from 32 participants (16 with MDD and 16 HC). In 2021, Seal et al. [
65] analyzed a sample of 33 participants, including 15 with MDD and 18 HC. The samples employed by the researchers mentioned above are relatively small, with none exceeding 35 participants. Subsequently, Khan et al. [
79] increased the sample size to 60 participants for depression diagnosis research, dividing them equally between patients with MDD and HC. In 2022, Yan et al. analyzed a dataset of 80 participants, evenly split between 40 with MDD and 40 HC [
69,
70]. In 2023, Zhang et al. [
75] published a study including 53 participants, consisting of 24 with MDD and 29 HC. Subsequently, Xu et al. [
73] further increased the sample size to 75 participants, including 41 with MDD and 34 HC, in their analysis. To better validate the proposed methods, an increasing number of researchers are opting to use multiple datasets. In these studies, the sample sizes generally hover around 105 participants, with one study exceeding 200 participants (refer to
Table 2). The use of an entirely different set of EEG data renders the results significantly more compelling. Researchers extensively use publicly available datasets in studies employing deep learning frameworks. As indicated in
Table 2, the publicly available dataset from Mumtaz’s team [
39,
50,
51,
89] is utilized by more researchers than any other, followed by the MODMA dataset from Cai’s team [
61]. Furthermore, the datasets from Cavanagh’s team [
60], the EDRA dataset shared by Yang et al. [
90], and the PRED + CT dataset are also popular among various researchers.
3.1.3. Analysis
As indicated in
Table 1, only four studies feature sample sizes exceeding 200, with one study comprising a sample of 400. Nine studies include an equal number of participants in the depression and control groups. Publicly available datasets are used more frequently in research utilizing deep learning frameworks, rather than in research employing traditional machine learning methods (refer to
Table 2). Additionally, variations in the samples used across different studies were observed.
Table 3 offers a detailed summary of the unpublished datasets used in depression diagnostic studies, including the male-to-female ratio, subjects’ mean age or age range, diagnostic criteria for MDD, data collection sources, and associated studies. As shown in
Table 3, the diagnostic methods used in various studies to ascertain the status of MDD patients vary. These methods include the DSM-IV, HAM-D, International Classification of Diseases (ICD), Emotional State Questionnaire (EST-Q), Mini-international Neuropsychiatric Interview (MINI), and Patient Health Questionnaire-9 (PHQ-9), among others [
11,
91,
92,
93,
94,
95,
96,
97]. Similarly, the state of the participant during the experiment—whether their eyes are open, closed, or both—can significantly influence the EEG signals captured. Additional parameters influencing brain activity include age, sex, the time of day the experiment is conducted, and prior physical activity. Painkillers, antidepressants, or any other medications that significantly alter brain activity can influence the outcomes of the experiments. Similarly, studies vary in terms of participants’ medication use: some require all participants to have ceased any medication use at least six weeks before the experiment; others enroll participants who are receiving medication for the first time; and some include participants who are already on medication. Some studies fail to report participants’ medication use. When selecting subjects, it is advisable to include individuals who are at least 18 years old and have a minimum education level of middle school. Subjects with MDD must meet both DSM-IV and criteria, whereas healthy controls must have no history of mental disorders. Common exclusion criteria for participation include the following: (1) severe physical disabilities that preclude completion of the experiment; (2) diagnoses of other mental or neurological disorders; (3) histories of brain injuries that have resulted in coma; and (4) consumption of alcohol, nicotine, or caffeine prior to the experiment.
3.2. Data Acquisition
The process of data acquisition represents the initial stage of EEG research, and data collection methods vary across studies. Next, the included studies are compared based on parameters such as the number of electrodes, sampling rate, and other relevant factors. Initially, Ahmadlou’s team [
37] use 19 electrodes to record rsEEG signals from both depressed patients and healthy controls with their eyes closed for three minutes, and the sampling frequency is 256 Hz. Numerous subsequent investigations adopt identical electrode quantities and sampling frequencies, with only slight deviations observed in the sampling duration. As EEG acquisition techniques evolve, a growing number of researchers enhance both channel count and sampling rates to acquire more granular data on rsEEG signals [
41,
47,
49,
59]. Yan et al. [
69,
70] utilize sampling rates of up to 1000 Hz to record EEG signals in their experiments.
Research has shown that effective classification tasks can be accomplished using fewer electrodes [
68]. This reduction conserves not only time and computational resources, but it also decreases susceptibility to noise. Fewer electrodes mean a simpler process, which will facilitate the expansion of EEG datasets in this field. Given the strong correlation between the frontal lobe and emotional processes [
99], Cai et al. [
38] employ a three-electrode frontal system to collect 90 s of data, using the sampling frequency of 250 Hz. Sakib et al. [
48] record data from only 14 channels using a wireless EEG headset. Some research teams employ multiple sampling rates for various signal processing methods during experiments to enhance data reliability [
40,
42].
Additionally, the placement of the electrodes is of equal importance. The 10–20 system for electrode placement remains the standard method prescribed by the International Society for Electroencephalography, and it is widely utilized in EEG studies [
100,
101,
102]. Most researchers adhere to the 10–20 international standard system for electrode placement during data acquisition, as illustrated in
Figure 2.
Table 4 summarizes the electrodes placed in different regions of the brain according to the 10–20 international standard system. The various brain regions utilized for EEG acquisition are depicted in
Figure 3 [
103]. An increasing number of researchers are focusing on EEG signals from specific brain regions. Acharya et al. [
64] and Ay et al. [
66] analyze EEG recordings from four electrodes, located in the left and right hemispheres of the brain. During data collection, Wan et al. [
72] select several representative brain regions in the prefrontal cortex (PFC), frontal cortex, and parietal cortex—regions known for their strong association with depression—for EEG collection. To investigate inter-hemispheric asymmetry in patients with MDD, Duan et al. [
35] utilize 28 pairs of electrodes to conduct experiments across five brain regions. The EEG signals used in the studies conducted by Wu’s team [
74] and Xu’s team [
73] are primarily collected from the frontal regions of the brain.
Analysis
The frequency of EEG signals typically lies below 100 Hz. According to Nyquist’s theorem, when the sampling frequency is more than twice the highest frequency of the signal, the sampled signal can retain the complete information of the raw signal. An examination of the depression diagnostic studies listed in
Table 1 reveals that nearly all of them maintain a sampling rate exceeding 200 Hz. However, the indiscriminate use of a high sampling rate to garner more information may not be advisable. Processing data at high sampling rates demands increased computational resources, thereby substantially extending training times, particularly for deep learning models. Increased data volumes significantly raise storage needs and preprocessing times, potentially causing bottlenecks in both feature extraction and training phases. Additionally, higher-dimensional data from increased channel counts may increase the risk of model overfitting. Given that multi-channel data often contain substantial redundant information, models are likely to capture noise and irrelevant features rather than the intended signal patterns. Handling high-dimensional data often requires more complex model architectures, thereby increasing both the complexity of model training and the difficulty of tuning. In practice, researchers must balance sampling rates, channel counts, and model complexity to optimize the performance within the constraints of the available resources.
3.3. Preprocessing
Noise and artifacts must be filtered from EEG data prior to analysis [
104]. EEG data inherently exhibit instability, weakness, and high susceptibility to external interference. Errors in experimental setups, environmental noise, and artifacts from other biological signals adversely impact the overall signal. Preprocessing constitutes a key component of EEG signal processing, significantly enhancing the signal-to-noise ratio and establishing a reliable foundation for further analyses and interpretation [
105]. This article reviews a range of preprocessing methods employed in the included studies, including filtering, independent component analysis (ICA), data segmentation, Z-score and min–max normalization, principal component analysis (PCA), re-referencing, soft-thresholding algorithms, sphere spline, FastICA, adaptive autoregression (AAR) models, adaptive noise cancellation (ANC), pseudo-subspace reconstruction (PSR), discrete and multilevel discrete wavelet transforms (DWT and MDWT), fast Fourier transform (FFT), among others.
Table 5 summarizes the preprocessing methods used in studies that employ both TML and DL approaches. The left column enumerates the various preprocessing methods, while the right column provides details of the studies that implemented these methods.
Figure 4 graphically presents the frequency distributions of various preprocessing methods, with the numbers after the colon (“:”) denoting the usage frequency of each method. Furthermore, the thickness of each ribbon strip corresponds proportionally to its frequency.
As indicated in
Table 5 and
Figure 4, filtering is the most commonly employed preprocessing method. In all included studies on depression diagnosis, filtering was employed 71 times: 36 times in studies employing traditional machine learning and 35 times in those utilizing deep learning approaches. Various filter types employ specific window sizes and cutoff frequencies. Considering that the power line frequency in China is 50 Hz, researchers frequently use 50 Hz notch filters to reduce power line interference in recorded EEG signals. Bandpass, high-pass, and low-pass, each characterized by varying cutoff frequencies, are frequently employed as well. Kalman filters, adaptive filters, smoothing filters, and Hanning filters remain in the exploratory phase of application in this field. ICA is also a frequently employed method, utilized eight times in studies based on TML and nine times in those based on DL approaches. In contrast, FastICA was employed only twice in studies using deep learning approaches, as shown in
Figure 4. This usage demonstrates FastICA’s faster convergence compared to traditional ICA and its suitability for deep learning tasks. ICA and FastICA are widely employed to isolate and remove disturbances attributable to eye and muscle movements [
106,
107,
108,
109]. Decomposing the EEG signal into its constituent components allows for the isolation and removal of artifacts by identifying and excluding these components [
110]. A notable drawback of this artifact removal technique is its potential to alter the inherent signal dynamics. Research indicates that minimal preprocessing of physiological signals retains more of their original information [
111]. To minimize signal alteration, some researchers engage EEG specialists who visually identify artifact-free segments for subsequent analysis [
42].
Typically, the acquisition of EEG data requires special equipment and professional operation. Consequently, EEG data samples from depressed patients are often limited, especially in medical research, where acquiring large volumes of labeled data for depression diagnosis is challenging. Additionally, EEG data are characterized by nonlinearity, time variability, and sensitivity to noise, and must be processed prior to training deep learning models to avoid overfitting. Given the high temporal resolution of EEG signals, many researchers have employed data augmentation methods to improve the utility of limited data [
112,
113,
114]. As illustrated in
Figure 4, the method has been utilized in both TML and DL research contexts. Data segmentation, a technique frequently used in data augmentation, involves dividing a long EEG signal into multiple short time windows, thereby significantly increasing the amount of data available [
115,
116]. Segmentation improves the model’s focus on local features of the time series, thereby enhancing its performance. Researchers must consider the length of the segmentation window; a window that is too short risks losing global information, while one that is too long may not significantly increase the sample size. Some researchers have segmented EEG into specific time intervals using data cropping [
81,
117], a technique that enables researchers to concentrate on the electrical activity of the brain during certain key time windows. However, it is essential that data cropping ensures the selected time segments are sufficiently informative; otherwise, non-informative or noisy segments may be inadvertently included.
Z-score normalization is widely used to standardize EEG data across various channels or time periods to a consistent scale, thus eliminating discrepancies arising from varying amplitudes or baseline deviations [
118,
119]. Normalization improves the visibility of abnormal values within EEG signals, thereby facilitating the identification of atypical brain activity features. However, this method exhibits high sensitivity to outliers, such as noise from motion artifacts or electrode detachment. These outliers can substantially affect the calculation of means and standard deviations, consequently impacting the standardized results. Additionally, several studies have utilized specialized software and libraries to mitigate various types of noise, including Brain Electric Source Analysis (BESA), EEGLAB, Curry 7, Open MEEG toolbox, Brainstorm, ICLabel and MARA plugins, and MNE-Python.
Analysis
The preprocessing methods outlined above can be divided into two primary categories: automatic and semi-automatic or manual. In automatic preprocessing methods, tools such as EEGLAB and BESA rapidly process large volumes of EEG data, minimizing human intervention with techniques such as filtering, ICA, and normalization. This approach is especially well-suited for large-scale data analysis. Automated methods ensure consistency in preprocessing steps. For example, the use of uniform filters and standardization algorithms produces reproducible results across experiments, thereby enhancing the study’s reliability. FastICA and ANC effectively extract and remove artifacts and enhance signal-to-noise ratios, thereby improving signal quality. Typically, these methods can rapidly identify and eliminate artifacts caused by eye movements or other biological signals. However, automatic preprocessing can alter the intrinsic dynamics of the signal, especially when techniques such as ICA are employed [
106]. Incorrect component identification may lead to the loss of crucial signal features. Among semi-automatic or manual preprocessing methods, the manual approach allows researchers to tailor the preprocessing strategy to specific experimental requirements. Expert visual inspection ensures the accuracy and integrity of the signal and minimizes potential distortions. Expert intervention facilitates more efficient identification and handling of outliers, thereby preventing the omission of important signals that may be overlooked by automated processing. This approach generally preserves more information and minimizes alterations to the signal. Although semi-automatic or manual methods can enhance data quality, insufficient sample sizes may lead to both the training and test sets containing the same individuals, thus impairing the model’s generalization capabilities. In conclusion, automatic preprocessing methods provide significant advantages in terms of efficiency and consistency, making them ideal for large-scale data analyses. However, they may also alter the signal characteristics. Conversely, semi-automatic or manual methods offer enhanced flexibility and improved control over signal quality, thereby allowing for the retention of more useful information.
Although there are many similarities in EEG signal preprocessing between TML and DL approaches, notable differences in method selection and data processing strategies remain. In terms of method selection, deep learning predominantly employs sophisticated, automated techniques like ASR and ANC to effectively manage high-dimensional and complex data [
120]. In contrast, traditional machine learning methods primarily employ classical signal processing techniques like PCA and Kalman filters, which concentrate on the statistical properties of signals and linear transformations. In terms of data processing methods, the transformation of EEG signals into images for convolutional neural network (CNN) analysis represents a novel approach, which is seldom employed in TML. This method underscores the capabilities of DL to effectively manage unstructured data, such as images and videos. In contrast, traditional machine learning prioritizes specific filtering techniques and models, such as Kalman filters and adaptive prediction filters, which are tailored for precise signal modeling and prediction. These differences highlight the unique needs and strengths of each approach in addressing data complexity and ensuring adaptability.
In summary, deep learning methods increasingly depend on automation and advanced techniques during preprocessing to enhance data quality and improve model performance. Traditional machine learning methods, known for their conciseness and efficiency, are particularly well-suited for environments with limited computing resources. These methods concentrate on fundamental signal characteristics and employ straightforward transformations during the preprocessing phase.
3.4. Feature Extraction
3.4.1. Studies Based on TML Methods
Although EEG signals encompass a broad spectrum of information, the specific aspects related to depression remain unclear. Therefore, the signal must be optimized through the extraction of features that accurately represent the relevant information [
121,
122]. Feature extraction entails generating features by discerning hidden patterns within the input signal. In the reviewed literature, a range of methods have been employed to extract and select diverse types of features for diagnosing depression.
Table 6 summarizes the various categories of features extracted from studies that employ traditional machine learning methods. As shown in
Table 6, the investigated features are classified into five categories: time-domain, frequency-domain, nonlinear, connectivity, and generated features. Various methods have been utilized to extract and select diverse features for the diagnosis of depression, including FFT, DWT, MDWT, Neighborhood Component Analysis (NCA), Fourier–Bessel Series Expansion (FBSE), histogram-based feature generation, parametric modeling with autoregressive models, and detrended fluctuation analysis (DFA), among others.
Several researchers have introduced methods for automatic feature extraction. Several researchers have developed a graph where the nodes represent subjects within the dataset, employing the Node2vec algorithm to compute feature representations as node embeddings [
44,
45]. These node embeddings serve as valuable features readily applicable to classification algorithms. Aydemir’s team [
54] pioneers the development of a molecular shape-based feature generator, termed the “melamine pattern”. In the approach, the melamine pattern is innovatively integrated with the DWT to generate texture features, also known as spatial domain features, significantly enhancing the classification accuracy. Tasci et al. [
62] decompose the original signal using the Daubechies Four (DB4) parent wavelet function. They introduce an innovative Twin Pascal Triangle Lattice Pattern (TPTLP) approach to extract texture features from EEG signals.
3.4.2. Studies Based on DL Methods
TML methods depend on the manual design and selection of features, requiring the expertise and experience of domain specialists. Numerous researchers have attempted to remove this limitation by applying DL models to the field. During data processing, features are progressively extracted and integrated across multiple layers of the model, ultimately leading to a final decision or prediction based on these automatically extracted features [
105]. However, deep learning models are often described as “black-box“, making their internal feature representations difficult to interpret. Furthermore, raw EEG signals may include information extraneous to depression, potentially impairing model performance if used directly as input. As a result, some researchers strive to integrate manually extracted features into deep learning models. In EEG signal processing, brain waves are typically categorized into four primary frequency bands: δ, θ, α, and β [
123,
124,
125]. Each frequency band is associated with specific brain states and activities. Several researchers have focused their research on these frequency bands to enhance model performance through the extraction of structural features and power spectral density. Inspired by advances in deep learning for computer vision, several researchers have transformed EEG into image form, capitalizing on the strengths of DL in image processing to enhance EEG signal analysis. However, this approach could result in potential information loss, heightened processing complexity, and an increased risk of overfitting. Furthermore, given that the brain functional connectivity analysis reveals the interactions and synergistic activities among various brain regions [
126,
127,
128], several researchers have combined brain functional connectivity matrices with feature fusion techniques. This approach increases the comprehensiveness of raw EEG signal information and provides substantial noise immunity. Furthermore, several researchers have utilized a range of techniques for EEG signal analysis, including Partial Directed Coherence (PDC), and the integration of temporal and spatial features, as well as both distance-based and non-distance-based projection methods.
Table 7 summarizes the diverse feature extraction methods employed in combination with deep learning techniques.
3.4.3. Analysis
An analysis of
Table 6 shows that the features extracted in studies employing traditional machine learning methods are primarily time-domain features, with skewness and kurtosis being the most frequently utilized, followed by the mean. Among the nonlinear features, entropy calculation is a widely used technique, with approximate entropy and Shannon entropy also being frequently employed by researchers. Regarding the use of entropy, the EEG diagnosis of depression using DL methods has not yet been explored. Conversely, band power is the most frequently utilized feature in the frequency domain. Both approaches emphasize the critical role of functional connectivity in elucidating the brain’s functional networks. The application of molecular structure maps and lattices in EEG-based depression diagnosis represents an exceptionally innovative approach, providing significant potential for future exploration. As detailed in
Table 7, studies employing deep learning methods frequently combine diverse feature extraction techniques with various deep learning models, including CNN, DAN, LSTM, GNN, and MLP.
Features extracted through traditional machine learning methods are typically low-level and directly derived from the original signal. Deep learning methods can extract high-level features across multiple layers, producing more abstract and complex representations that significantly improve model performance. Traditional machine learning features are crafted for straightforward interpretation, while deep learning models exhibit greater complexity and reduced interpretability. In their experiments, researchers often employ visualization techniques, including activation maps and feature maps, to aid interpretation. In summary, while deep learning offers robust capabilities for automatic feature extraction and learning in EEG signal processing, its complexity and the significant computational resources it demands must also be considered.
3.5. Feature Selection
Feature selection reduces feature dimension to remove redundant or irrelevant features. The extracted features may be redundant or irrelevant. Utilizing all of these as inputs to the classifier could potentially decrease its accuracy. To improve classifier accuracy, key features must be selectively identified and utilized in the analysis [
129].
Table 8 presents a summary of the feature selection methods employed in EEG-based depression diagnosis studies. According to
Table 8, Neighborhood Component Analysis (NCA) emerges as the most frequently employed method. The primary objective of Neighborhood Component Analysis (NCA) is to enhance classification performance through the development of a feature transformation matrix. This matrix aims to bring samples from the same category closer together, and to separate samples from different categories within the transformed feature space. However, NCA exhibits significant computational complexity, demonstrates sensitivity to noise and outliers, and is optimally suited for analyzing linear feature relationships. Additionally, methods such as the genetic algorithm (GA), ReliefF, Node2vec, and others are commonly utilized.
Certain DL models, including CNN, LSTM, and Recurrent Neural Networks (RNNs), automatically filter and prioritize input features through their inherent structure and training process. This method effectively highlights useful features while minimizing the impact of redundant or irrelevant ones. Additionally, certain deep learning models employ regularization techniques, such as L1 and L2 regularization, Dropout, and Batch Normalization, to mitigate overfitting and indirectly facilitate feature selection. Visualization tools, such as Grad-CAM and t-SNE, facilitate researchers’ interpretation of model features and regions of interest, thereby enhancing feature selection and analysis.
In summary, deep learning-based methods eliminate the necessity for manual feature selection and excel at processing large-scale, high-dimensional, and complex data. However, the feature selection process in deep learning models is frequently viewed as a black-box, complicating efforts to interpret the decision-making logic within the model. Feature selection continues to be a critical area of research, substantially affecting classification performance and playing a crucial role in subsequent data analyses.
3.6. Classification Methods
3.6.1. Studies Based on TML Methods
Table 9 summarizes the performance of each algorithm in studies of depression diagnosis that use traditional machine learning methods. From left to right, the table lists the algorithm used, the validation strategy (which details the proportion of the dataset allocated to training, testing, and validation), the top-performing classifier, the accuracy given as a percentage, and the associated study. The choice of classifiers significantly influences the performance of the models, with various studies employing distinct classification methods. Currently, a diverse array of traditional machine learning algorithms is systematically applied to the diagnosis of depression and categorized into seven primary groups:
Support Vector Machine-Based Classifiers: this category includes the Support Vector Machine (SVM) with various kernels and Least Squares-SVM (LS-SVM).
Tree-Based Structures: features methods such as Decision Tree (DT), Best-First Tree (BF-Tree), and Coarse Tree.
K-nearest neighbor (KNN) and variants: encompasses KNN-GA and E-KNN.
Ensemble Learning Methods: comprises Bagging, Extreme Gradient Boosting (XGBoost), GentleBoost (GB), RusBoost (RB), Random Forest (RF), and Adaptive Boosting (AdaBoost).
Probabilistic models: includes well-known methods such as Naïve Bayesian (NB) and Logistic Regression (LR).
Discriminant Analysis-Based Classifiers: features Linear Discriminant Analysis (LDA).
Other methods: includes approaches such as K-means Singular Value Decomposition (K-SVD) and Linear Regression.
Figure 5 illustrates the frequency distribution of depression diagnosis algorithms that utilize traditional machine learning, with the numbers following the colon (“:”) indicating the frequency of usage for each method.
Analysis
This review mainly analyzes the classification accuracy of various studies. Classification accuracy represents the percentage of all participants who were correctly classified as either depressed or in the control group. As illustrated in
Figure 5, KNN and SVM were the classifiers most frequently used in the included studies on depression diagnosis, with KNN appearing 15 times and SVM 18 times. As shown in
Table 9, KNN and SVM emerged as the top-performing classifiers in several studies. On non-public datasets, the KNN classification demonstrated a strong performance, with accuracies ranging from 76.83% to 98.43%. On public datasets, the KNN classification yielded superior results, surpassing other classifiers such as BF-Tree and K-SVD in studies using the MODMA dataset [
61] and the Mumtaz team’s dataset [
39,
50,
51], achieving peak classification accuracies of 100% and 99.11%, respectively. This reflects the model’s exceptionally high performance under specific conditions; however, this does not rule out the possibility of sample leakage, which could artificially inflate the classification accuracy. SVM also exhibits an outstanding performance, with accuracy rates ranging from 83.67% to 98%. Probabilistic models like LR, as well as LDA, have also been explored somewhat in this area; however, none have emerged as the top-performing classifiers in any of the studies reviewed. Furthermore, ensemble methods have exhibited high accuracies, underscoring the advantages of integrated learning approaches for complex tasks, achieving a maximum accuracy rate of 99.11%. As shown in
Figure 5 and
Table 9, among tree-based classifiers, Decision Tree is not only the most frequently used algorithm, but it also ranks as the highest-performing classifier in several studies, with accuracies reaching up to 95%.
In summary, the accuracies reported in the referenced studies varied from 76.83% to 100%. For traditional machine learning methods, the classification accuracy depends more on the selected features than on the choice of classifier [
130]. Many studies focus on examining the impact of diverse feature extraction methods on model performance. Some studies aim to reduce system complexity by minimizing the number of channels used, while still preserving accuracy. Although many studies report respectable classification accuracies, they often involve a small number of participants, which may limit the model’s generalizability and its applicability in clinical application.
3.6.2. Studies Based on DL Methods
Table 10 summarizes the performance of various algorithms used in DL-based depression diagnosis studies. From left to right, the table lists the algorithms employed, the validation strategy (which details the percentage of the dataset allocated to training, testing, and validation), the top-performing classifier, the accuracy presented as a percentage, and the associated study. Deep learning algorithms for diagnosing depression in the reviewed studies are categorized into four primary groups:
CNN and variants, including CNN, 2D-CNN, CNN-LSTM, Lightweight-CNN, AchCNN, and EEGNet;
Graph Convolutional Network (GCN) and variants, such as GCN, MGGCN, GICN, and AMGCN;
Recurrent Neural Network (RNN) and variants, including RNN, 4D-CRNN, GRU, and GC-GRU;
Inception-based models and variants, notably Inception, InceptionTime, InceptionNet, and Tsception.
Among these, numerous hybrid models are identified, including DeprNet [
65], AchLSTM [
66], AchCNN [
64], T-LSTM [
131], H-KNN1 [
38], H-KNN2 [
132], S-EMD [
133], S-SVM [
134], H-DBN [
135], EEGNet [
136], DeepConvNet [
137], ShallowConvNet [
137], LSDD-EEGNet [
70], HybridEEGNet [
72], SynEEGNet [
72], RegEEGNet [
72], MRCNN-RSE [
73], MRCNN-LSTM [
73], DeepCoral [
138], VGG16 [
139], AlexNet [
139], ResNet50 [
139], MV-SDGC-RAFFNet [
84], TS-SEFFNet [
140], 4D-CRNN [
141], DBGC-ATFFNet [
142], ASTGCN [
143,
144], DepHNN [
145], TSUnet-CC [
87], DiffMDD [
88], 1D-CNN-Transformer [
66,
145], CWT-1D-CNN [
146], CWP-2D-CNN [
146], GRU-Conv [
147], and GC-GRU [
148].
Figure 6 illustrates the frequency distribution of DL-based algorithms used in diagnosing depression, with the numbers following the colon (“:”) indicating the usage frequency of each method.
Analysis
As shown in
Figure 6, CNN and its variants constitute the most commonly employed deep learning method in the research. An analysis of
Table 10 shows that most studies utilizing CNN and its variants (e.g., 2D-CNN, CNN-LSTM) exhibit an excellent performance. In studies using publicly available datasets from Mumtaz’s team [
39,
50,
51,
89] and Cavanagh’s team [
60], CNN variants (2D-CNN and Lightweight-CNN) surpassed other classifiers in attaining the highest classification accuracy. Among these, the 2D-CNN is particularly notable, with Khan’s team [
80] achieving a 100% classification accuracy in the study, which highlights its exceptional capability in EEG signal processing. The accuracy of Lightweight-CNN has reached 99.87% and reduced computational resource consumption while preserving high accuracy, rendering it adaptable for a wide range of applications. The CNN-LSTM model merges the feature extraction capabilities of CNN with the time series processing advantages of LSTM, attaining an accuracy of up to 99.9%. LSTM variants excel in processing EEG signals but prove relatively ineffective when used independently. Hybrid models, including MV-SDGC-RAFFNet and TSUnet-CC, which integrate the strengths of multiple models, achieve accuracies of 99.19% and 99.22%, respectively, highlighting their potential for future hybrid model development. The application of GCN and its variants in EEG signal processing has shown excellent accuracy (see
Table 10). In studies utilizing the public dataset MODMA [
61], MGGCN delivered the highest performance, achieving an accuracy of 99.69% [
76]. The application of GCN and its variants in EEG signal processing merits additional exploration. MLP delivered the highest performance in the study utilizing the public dataset EDRA.
In summary, deep learning methods consistently outperform traditional machine learning techniques in classification accuracy, demonstrating substantial potential for EEG-based automatic depression diagnosis. However, the black-box-like properties of DL models complicate the interpretation of their decision-making processes, potentially reducing their trustworthiness in clinical applications. Studies utilizing various publicly available datasets have yielded diverse optimal classifiers, indicating that the existing findings lack sufficient generalizability. Gathering more heterogeneous data and performing additional experimental comparisons will help address this issue.
3.7. Validation Methods
To detect overfitting and ensure model generalization, it is essential to evaluate both traditional and deep learning models using suitable validation techniques.
Figure 7 depicts the frequencies of various validation methods used in depression diagnostic studies, with the numbers above the bars showing the frequency of each method. As depicted in
Figure 7, among the included depression diagnostic studies, the n-fold cross-validation (CV) is the most frequently employed method by researchers, succeeded by the LOOCV. The n-fold CV [
149,
150,
151], typically configured for 5 or 10 folds, is the approach most commonly employed by researchers (refer to
Table 9 and
Table 10). This method is the most commonly utilized validation technique, and numerous models in this study exhibit a strong performance when assessed using this approach. Specifically, across n iterations, each sample is included n − 1 times in the training set and once in the validation set. This method effectively utilizes the entire dataset and mitigates the issue of small sample sizes prevalent in many studies. However, n-fold CV presents certain drawbacks, including significant computational costs attributed to multiple training and evaluation cycles. Additionally, if the dataset size is not divisible by n, this can lead to varying fold sizes, potentially impacting the accuracy of the evaluation results. To address this issue, several studies have chosen to partition the dataset on a per-subject basis. However, this validation method entails the risk that certain sample types may exclusively appear in either the training or the testing set, especially in cases of imbalanced datasets. Wan’s team [
40] implemented a no-putback sampling strategy in their study, ensuring consistent sample sizes for both the depressed and normal control groups, thereby addressing the issue of model performance degradation due to sample imbalance.
To optimize information utilization from each sample, several studies have implemented LOOCV [
152] and its variants, including LOSOCV and LOPOCV [
47,
58,
62]. LOPOCV and LOSOCV are well-suited for datasets with substantial individual differences, while LOOCV is more appropriate for datasets with small sample sizes. Additionally, one study [
54] implements the hold-out validation strategy [
153], which entails randomly partitioning the dataset into two mutually exclusive subsets: one designated for training and the other for testing. If the test set is too small, the evaluation results might be unstable and lack accuracy; conversely, an overly large test set can undermine the realism of model training. Furthermore, hold-out validation can introduce bias from the data partitioning process, particularly when significant differences exist between the sample proportions in the training and testing sets. Maintaining data independence is crucial when employing these methods to preserve the accuracy and reliability of model assessments. However, owing to limited sample sizes in many studies, the test and training sets frequently lack independence, potentially resulting in data leakage. Wu’s team [
41] mitigates this issue by implementing 5-fold CV and validating the model with an independent test set, thus minimizing the risk of data leakage. A research team [
63] has used the nested CV strategy [
154], effectively reducing data leakage by separating parameter selection from performance evaluation, thus providing a more robust assessment of the model’s generalization capability. However, this strategy incorporates two layers of cross-validation, which may complicate its implementation.
When selecting a validation strategy, researchers should base their decisions on factors including dataset size, computational resources, model complexity, and evaluation requirements. For large-scale datasets constrained by limited computational resources, hold-out CV provides a rapid and straightforward method. For medium-sized datasets, k-fold CV, such as 5-fold or 10-fold, ensures a more stable evaluation. When dealing with small datasets and ample computational resources, LOOCV optimizes data utilization. For complex models and hyperparameter tuning, nested CV is the preferred method, providing low-bias model evaluations and optimized generalization error estimations. In scenarios involving unbalanced datasets, stratified k-fold or stratified nested CVs may be utilized to ensure consistent sample proportions across categories. Future research should emphasize the utilization of large datasets and independent samples wherever feasible.
4. Discussion
Researchers in existing studies commonly utilize two types of methods: TML and DL. Numerous studies report substantial accuracy in diagnosing depression using rsEEG. Given the distinctions between the two research methods, we opted to investigate the commonalities among these studies.
To begin with, accurately identifying depression is the first challenge that researchers must address. Episodes of MDD often coexist with other disorders, such as bipolar disorder and anxiety disorders [
155]. The clinical manifestations of major depression and bipolar disorder are challenging to distinguish [
156,
157,
158]; moreover, most individuals with bipolar disorder also experience phases of major depression [
159,
160]. Recent research has revealed that resting-state connectivity biomarkers are capable of defining neurophysiological subtypes of depression [
161]. Future research may explore solutions to this issue through EEG-based, data-driven approaches [
162].
Various studies may utilize differing diagnostic tools and criteria, leading to heterogeneity in EEG parameters across subjects. Consequently, identifying reliable biomarkers for MDD based on EEG signals is crucial [
163,
164,
165]. Several researchers have investigated the feasibility of employing wavelet coherence between regions of DMN as a biomarker [
80]. Moreover, research has shown that both effective connectivity and functional connectivity among various brain regions are essential for accurately distinguishing between depressed patients and healthy participants [
79,
86,
166,
167].
Currently, numerous studies employing TML and DL methods with EEG data face the pervasive issue of insufficient sample sizes. Among the studies included in the comparative analysis, only four featured a sample size exceeding 200 cases. Most studies exhibited small to moderate sample sizes and relied on limited sample sources. A large sample size is typically not derived from a single location; rather, data are gathered from multiple sources, including hospitals, clinics, neuroscience research institutes, clinical trials, research programs, and universities. Furthermore, the range of subjects’ occupations could be broadened to include IT workers, doctors, video bloggers, business managers, laborers, and sales personnel. Recently, advancements in portable EEG recording devices have streamlined the collection of EEG data [
168,
169]. For instance, wireless EEG caps such as the Epoch, ENOBIO Neuroelectrics, and iMotions not only offer portability and comfort, but they also facilitate real-time data transmission. The enhancement of public databases and the development of standardized datasets have offered essential support for research in depression diagnostics. Mumtaz’s team [
39,
50,
51,
89], Cavanagh’s team [
60], Cai’s team [
61], Yang’s team [
90], and researchers including Wang et al. [
98], and Kang et al. [
77] have publicly shared their datasets or code, thereby facilitating knowledge sharing. Many researchers have expanded upon these contributions. We encourage researchers in this field to share their data and code. Doing so not only promotes scientific transparency and reproducibility, enhances research credibility, and facilitates the verification and reproduction of results by others, but it also fosters cooperation and communication, thereby advancing technological innovation in this area [
170].
The limited availability of shared datasets in this field can be partly attributed to the complexity of EEG signal acquisition. In clinical practice, EEG signals are typically acquired using traditional wet electrodes [
171]. However, the requirement to apply a conductive medium, such as conductive paste, between the electrodes and the scalp complicates the procedure and creates discomfort for the subject. Dry and semi-dry electrodes offer a more efficient alternative for EEG signal acquisition, especially in situations where professionals are unavailable [
172]. Currently, newly developed semi-dry gel electrodes offer greater comfort compared to wet electrodes, and they address the issue of rigid dry electrodes not adhering stably to the skin [
173,
174]. This advancement simplifies the EEG collection process and enhances the multifunctionality of the technology across a range of scenarios. The widespread adoption of gel electrodes could significantly enhance EEG signal acquisition.
KNN and SVM are the classifiers most frequently utilized in studies employing traditional machine learning methods. Meanwhile, studies that utilize deep learning methods frequently employ various types of CNNs and their variants. These models have demonstrated a robust performance in diagnosing depression, achieving exceptionally high accuracy rates. To enhance the generalization capabilities of the models, researchers frequently employ n-fold CV and LOOCV methods for evaluation. However, most studies were not tested with independent samples, and some lacked the explicit reporting of validation information, casting doubt on the reliability of the model results. Future studies should ensure the complete independence of training and test sets and should transparently report their distributions to affirm the reliability and clinical applicability of the model results. Generalization ability denotes a model’s capacity to predict and process new data beyond the training dataset. Consequently, external validation is essential. Among the included studies on depression diagnosis, none were found to have developed a model on one dataset (its own dataset) and validated it on another (a public dataset). Implementing such a process would offer a robust test of the model’s generalizability. Only a few existing studies have considered using multiple datasets, and they mostly rely on commonly used public datasets (e.g., MODMA [
61]), potentially affecting model robustness. The subsequent phase entails the development of models on proprietary datasets, followed by their validation on datasets that are publicly accessible. Furthermore, the model must be interpretable [
175]. With highly interpretable models, researchers can determine which EEG features, including activity in specific frequency bands, most significantly influence diagnostic outcomes. This understanding facilitates the more precise identification of depression biomarkers, offering essential insights for advancing neuroscience research. Enhancing model interpretability presents a substantial challenge for deep learning-based approaches. Researchers must meticulously select algorithms based on their intended applications. The development of visualization tools and interpretative algorithms can aid clinicians in comprehending and applying the results derived from studies employing CNN architectures. Moving forward, it is imperative to intensify comparative studies among various models to identify optimal solutions that improve current clinical practices and provide more effective treatment strategies for patients with depression.
Presently, methods that diagnose depression via a single EEG signal still encounter significant limitations regarding accuracy and clinical applicability. Multimodal approaches to EEG are generally more effective than unimodal methods, as computer vision techniques can extract features more efficiently by leveraging multiple models, rather than relying on a single model [
132,
176,
177,
178,
179]. Future research should investigate the integration of multimodal data, incorporating a variety of sources including speech, facial expressions, text, images, eye tracking, RNA-Seq, functional magnetic resonance imaging (fMRI), and other modalities to enhance the diagnostic accuracy and generalizability [
18,
164,
180,
181,
182,
183,
184,
185,
186,
187,
188,
189,
190,
191].
Classifying the subjects as those with MDD or HC represents only the preliminary step in depression research employing EEG signals. The development of predictive models for depression treatment outcomes and relapse is crucial for preventing ineffective treatment strategies [
192,
193,
194,
195,
196,
197,
198]. Furthermore, the use of IoT technology in developing personalized medical models, which take into account individual differences and clinical characteristics, offers considerable potential for providing precise diagnoses and tailored treatment plans for patients with depression [
199,
200].
As this research advances, computational intelligence-based diagnostic systems for depression are anticipated to exert a growing influence on clinical applications, offering more reliable and intelligent diagnostic and treatment solutions. Simultaneously, these technological advances are also expected to yield valuable insights into the diagnosis and research of other mental disorders.
5. Conclusions
The diagnosis of depression remains a significant and challenging issue in contemporary medical research. Diagnostic methods traditionally depend on clinical assessments and self-reports; however, these approaches often face challenges stemming from subjectivity and clinical bias. In recent years, the advent of computational psychiatry has catalyzed a growing number of studies that integrate electroencephalography with artificial intelligence, substantially enhancing the diagnosis of depression. In this review, we conduct a comparative analysis of 49 studies that utilize rsEEG in conjunction with both TML and DL techniques to diagnose depression.
EEG is capable of detecting specific brain physiological changes related to depression, offering a crucial objective foundation for depression diagnosis. TML methods are extensively employed in the diagnosis of depression, notably demonstrating significant advancements in feature extraction. Although feature extraction is complex and time-intensive, traditional machine learning offers superior interpretability in analyzing diagnostic results compared to deep learning methods. Commonly used traditional machine learning algorithms, including KNN and SVM, have demonstrated an excellent classification performance in numerous studies. Recently, research on the application of DL methods for diagnosing depression has expanded significantly. Deep learning demonstrates significant potential in processing raw EEG signals due to its capability for automatic feature extraction. CNN and its variants represent the most commonly utilized deep learning architectures. By integrating manually extracted features with these models, researchers further enhance the accuracy of depression diagnosis. However, deep learning-based studies still face limitations due to the scarcity of large-scale EEG datasets, and the challenge of model interpretability persists. Furthermore, validating these models using independent datasets is crucial to ensure their generalization capabilities. Going forward, researchers will focus more on multimodal data fusion methods. Currently, the integration of artificial intelligence and EEG in depression research offers new opportunities for precision medicine and preventive strategies. We anticipate a growing number of individuals benefiting from this advancement.