Next Article in Journal
A Universal Image Compression Sensing–Encryption Algorithm Based on DNA-Triploid Mutation
Next Article in Special Issue
Towards Human-Interactive Controllable Video Captioning with Efficient Modeling
Previous Article in Journal
Mathematics of a Process Algebra Inspired by Whitehead’s Process and Reality: A Review
Previous Article in Special Issue
maGENEgerZ: An Efficient Artificial Intelligence-Based Framework Can Extract More Expressed Genes and Biological Insights Underlying Breast Cancer Drug Response Mechanism
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

CALSczNet: Convolution Neural Network with Attention and LSTM for the Detection of Schizophrenia Using EEG Signals

by
Norah Almaghrabi
*,
Muhammad Hussain
and
Ashwaq Alotaibi
Department of Computer Science, College of Computer and Information Sciences, King Saud University, Riyadh 11451, Saudi Arabia
*
Author to whom correspondence should be addressed.
Mathematics 2024, 12(13), 1989; https://doi.org/10.3390/math12131989
Submission received: 22 April 2024 / Revised: 20 June 2024 / Accepted: 24 June 2024 / Published: 27 June 2024

Abstract

:
Schizophrenia (SZ) is a serious psychological disorder that affects nearly 1% of the global population. The progression of SZ disorder causes severe brain damage; its early diagnosis is essential to limit adverse effects. Electroencephalography (EEG) is commonly used for SZ detection, but its manual screening is laborious, time-consuming, and subjective. Automatic methods based on machine learning have been introduced to overcome these issues, but their performance is not satisfactory due to the non-stationary nature of EEG signals. To enhance the detection performance, a novel deep learning-based method is introduced, namely, CALSczNet. It uses temporal and spatial convolutions to learn temporal and spatial patterns from EEG trials, uses Temporal Attention (TA) and Local Attention (LA) to adaptively and dynamically attend to salient features to tackle the non-stationarity of EEG signals, and finally, it employs Long Short-Term Memory (LSTM) to work out the long-range dependencies of temporal features to learn the discriminative features. The method was evaluated on the benchmark public-domain Kaggle dataset of the basic sensory tasks using 10-fold cross-validation. It outperforms the state-of-the-art methods on all conditions with 98.6% accuracy, 98.65% sensitivity, 98.72% specificity, 98.72% precision, and an F1-score of 98.65%. Furthermore, this study suggested that the EEG signal of the subject performing either simultaneous motor and auditory tasks or only auditory tasks provides higher discriminative features to detect SZ in patients. Finally, it is a robust, effective, and reliable method that will assist psychiatrists in detecting SZ at an early stage and provide suitable and timely treatment.

1. Introduction

Schizophrenia (SZ) is a chronic and severe psychological disorder that severely affects the social life of patients. It is characterized by disruptions in a patient’s behavior, emotions, language, sleep, and thinking. SZ patients may experience auditory hallucinations and lose touch with their surroundings. Individuals with SZ have a mortality rate that is two to three times higher than the healthy population due to physical diseases like metabolic and cardiovascular conditions [1,2]. The onset of SZ symptoms typically occurs between late adolescence (around 16 years) and early adulthood (around 30 years). The World Health Organization (WHO) declared that SZ can be treated using a long-term medication regimen [1,3,4]. However, an estimated 69% of individuals with SZ are not effectively treated, mostly in low- and middle-income countries with inadequate mental healthcare [2,5].
The causes of SZ may include genetic disorders, psychosocial factors, environmental factors, and some complications related to brain chemicals (neurotransmitters) such as glutamate and dopamine [2]. SZ is characterized by three basic types of symptoms: positive, negative, and cognitive. Each type has specific indicators. Positive symptoms include delusions, paranoia, hallucinations, exaggerative or inaccurate perceptions, disorganized speech, and thought disorders. Negative symptoms include reduced emotions, cognitive deficiency, and muddled thinking. Cognitive symptoms include impaired concentration, memory, and attention [2,6]. SZ is associated with cognitive, emotional, and addiction problems, as well as conditions such as homelessness that exacerbate the harm that patients experience socially, physiologically, and psychologically [6,7].
The detection of SZ is complex, and psychiatrists follow a traditional approach based on signs and symptoms observed over a long period, combined with self-report data obtained from the patient or their observers [8]. The diagnostic decisions are made based on the Diagnostic and Statistical Manual of Mental Disorders (5th edition (DSM-V)) and the International Statistical Classification of Diseases and Related Health Problems (10th revision (ICD-10)). This approach is unreliable, time-consuming, and requires a psychiatrist’s expertise [2,9].
Alternatively, the application of Computer-Aided Diagnosis (CAD) systems has been advocated to differentiate SZ patients from healthy controls (HCs) precisely [2,5,10]. Several neuroimaging techniques have emerged to detect SZ, including Magnetic Resonance Imaging (MRI), functional MRI (fMRI), Positron Emission Tomography (PET), and Computed Tomography (CT) [1,2,3,4]. These techniques are costly, have poor temporal resolution, are not portable, and require extensive training. Also, they may generate low-quality images due to artifacts resulting from patient motion [2].
As an alternative, electroencephalography (EEG) signals are used as a sensitive biomarker to detect SZ. Several reasons drive the utilization of EEG signals. First, EEG is a non-invasive, safe, and comfortable option for patients. Second, it is more cost-effective than MRI or PET scans, enabling its use in a large number of research studies and clinical settings [1,2,11]. Moreover, EEG is a practical tool for monitoring brain activity in real-time, gives high temporal resolution, and is available at any healthcare center [9]. It is rich in brain information, which deep learning models can exploit to identify patterns associated with several mental diseases such as SZ. Finally, it is widely used to detect various mental disorders, including seizures, epilepsy, Alzheimer’s, depression, Parkinson’s disease, attention deficit hyperactivity disorder, autism, and schizophrenia [2,12].
Although EEG is a powerful tool for detecting SZ, it requires experts such as neurologists to interpret and locate abnormalities in EEG signals, which is a lengthy task and error-prone [2]. Developing an automated and reliable method that can differentiate between SZ patients and healthy controls (HCs) is essential.
An EEG signal is non-stationary, non-linear, and continuously changing; so, a robust feature extraction method is needed to tackle these issues. Several conventional machine learning (ML) algorithms have been used for SZ detection, including Support Vector Machine (SVM) [8,9,10,12,13,14,15], Random Forest (RF) [12,16] and K-Nearest Neighbors (KNN) [17]. These algorithms use hand-crafted features, which suffer from a long, complex, and tedious process [5] and need expert knowledge [18]. In contrast, deep learning (DL) methods automatically learn discriminative features from data and overcome the problems associated with the traditional ML approach [13,18]. Recently, CNN models have been adopted in SZ detection [10,11,19,20,21], which can be broadly categorized into 1D-CNN [19,20,21] and 2D-CNN [11,12,16]. In the case of 1D-CNN, raw EEG trials are fed into the network, whereas 2D-CNN requires that the raw EEG trials be converted into 2D images using several visualization methods, such as wavelet transforms [22]. 1D-CNNs perform better than 2D-CNNs because they do not involve conversion, which may cause some information loss [17]; also, they are less prone to overfitting because of their small learnable parameter complexity.
EEG signals are timeseries-based, and the existing deep learning-based methods do not make the best use of the timeseries nature of EEG signals. On the other hand, the Long Short-Term Memory (LSTM) network analyzes timeseries signals better than a CNN because it learns long-term dependencies by memorizing information for long periods [2]. In addition, the attention mechanism in DL dynamically adjusts the focus on the essential features of the signal and ignores the less important ones [23]. Using the design concepts of the 1D-CNN, LSTM, and attention mechanism, a deep network, namely, CALSczNet, is proposed. It takes raw EEG trials as the input and extracts spatiotemporal features, analyzing the spatial and temporal patterns of brain activations. It employs two attention mechanisms, i.e., temporal attention and local attention, to selectively focus on features according to their significance. Finally, it leverages an LSTM to encode long-term dependencies in EEG trials.
The CALSczNet model was evaluated on a benchmark public-domain dataset named “The Basic Sensory Task in Schizophrenia” from Kaggle [24]. Schizophrenic patients find it difficult to differentiate between internally and externally generated stimuli due to problems with the corollary discharge process in the nervous system [25,26,27]. For this reason, three experimental conditions were used in this dataset to collect data about brain activations in the form of EEG signals to identify schizophrenic patients. The first condition is Button Tone, where subjects press a button every 1–2 s to deliver 1000 Hz, 80 dB sound pressure level, and tones with zero delay between the button press and tone onset. The second condition is Play Tone, where the temporal sequence of tones was preserved for playback after 100 tones had been delivered. The third condition is Button Alone, where subjects pressed a button at approximately the same pace and no sound occurred [24,25].
The main contributions of the proposed research are as follows:
  • It proposes the deep model CALSczNet, which is lightweight because it employs depth-wise convolution and separable convolution layers to extract spatiotemporal features and prevents overfitting when trained on a small number of available EEG trails.
  • It uses two effective attention mechanisms, i.e., Temporal Attention (TA) and Local Attention (LA), to concentrate on important features. The TA is used to refine low-level temporal features, while the LA is employed to refine intermediate-level spatial features. These mechanisms enable the model to select, process, and forward the most discriminative features, which assist the model in achieving faster computational performance.
  • Further, it uses an LSTM to capture long-term dependencies over time and accelerate the classification process.
  • It introduces a data augmentation technique to increase the number of training EEG trials to train the models, avoiding overfitting. Furthermore, it analyzed and found the appropriate temporal lengths of EEG trials that contain the most discriminative information.
  • The Kaggle dataset includes three conditions for the button press task. It analyzed and studied the effects of each condition for its potential use in the automated detection of SZ.
  • It found that a small number of channels from the left frontal lobe are effective in distinguishing between SZ and HC subjects, which improved the SZ detection performance compared to existing methods.
The remainder of the paper is organized as follows. Section 2 reviews recently conducted studies on SZ detection, Section 3 presents the details of the proposed method and its main phases, and Section 4 describes the evaluation protocol and model training. Section 5 discusses the extensive experiments conducted to demonstrate the effectiveness of the CALSczNet model. Finally, the discussion and comparison to state-of-the-art studies are given in Section 6, and the paper is concluded in Section 7.

2. Related Work

In recent years, several researchers have developed methods to automatically detect schizophrenia (SZ) using EEG signals. These methods are either ML-based, which involves using hand-engineered features, or DL-based. This section briefly reviews notable studies undertaken to diagnose SZ in patients from both categories by adopting the Kaggle dataset.

2.1. Hand-Engineered Feature-Based Method

Most ML methods perform three essential phases: (i) signal processing, such as the separation of rhythms; (ii) linear or non-linear feature extraction, which involves analyzing EEG signals in the time domain and/or frequency domain; and (iii) classification [11,28]. Many researchers decompose EEG signals into frequency bands to extract features that are fed to classifiers to detect SZ in patients.
Siuly et al. [1] decomposed 3-s EEG signals into a series of Intrinsic Mode Functions (IMFs) using Empirical Mode Decomposition (EMD). In turn, the authors extracted IMF features using 22 linear and non-linear statistical methods, which were reduced using the Kruskal–Wallis test. Following this, the features were passed to Ensemble Bagged Tree (EBT). The method achieved an accuracy of 89.59% for IMF 2 using 10-fold cross-validation on the Kaggle dataset, consisting of 10 channels residing on the left frontal region of the brain and considered condition one (press button to deliver a tone). The authors applied a manual process to select a suitable number of IMFs, which is a crucial phase since it has a determinative impact on model performance. The proposed method is complex, and the accuracy is relatively low.
Baygin et al. [3] combined raw EEG signals (ESs) with the decomposition of EEG signals (DSs) obtained using Maximum Absolute Pooling (MAP). The authors extracted 512 features from ES and DS for each EEG signal using the Collatz conjecture. In turn, it was reduced using Iterative Neighborhood Component Analysis (INCA) to feed the KNN classifier. The authors evaluated the proposed method using two public datasets, one of which was the Kaggle dataset. The reported accuracy was 93.58% at the F3 channel using 10-fold cross-validation. The proposed method is a complex process that must be performed accurately because the performance of the final classification outcome depends on it.
Khare et al. [29] decomposed EEG signals using empirical wavelet transform (EWT) to produce 28 AM-FM components (features) in a Fourier spectrum. The Kruskal–Wallis test was used to select the most discriminative features. Later, the selected five features were fed to different classifiers, including KNN, DA, the ensemble method, SVM, and the decision tree. The authors performed 10-fold cross-validation and used the Kaggle dataset for model evaluation. The SVM classifier outperformed the other classifiers with 88.7 accuracy.
Khare et al. [30] developed a flexible tunable Q wavelet transform (F-TQWT) to decompose an EEG signal into different features, which was then reduced by the Kruskal–Wallis test to five features. Following this, the most prominent channel was selected using the Fisher score method. The authors found that the AF7 channel was the most discriminant channel adopted to classify EEG signals. The flexible least square support vector machine (F-LSSVM) classifier was used to classify the five features of the AF7 channel, which achieved 91.39% accuracy.

2.2. Deep Learning-Based Method

Recently, researchers have used DL techniques to detect schizophrenia through EEG signals. Prior studies have used either 1D raw EEG signals or EEG signals converted into 2D images for analysis. This section briefly presents some of the studies that have adopted one of these two categories for SZ detection.

2.2.1. 1D Methods: Using Raw EEG Signals

Guo et al. [20] discussed the influence of tuning different hyperparameters in deep learning models to improve the model’s performance. The authors performed three steps: (i) preprocessing EEG signals using FFT, (ii) extracting 21-dimensional features, and (iii) classifying features using a CNN model. They achieved 92% accuracy. Notably, this study did not mention feature extraction or feature selection methods. In addition, the authors declared that the total number of sample data points was 18,560 for training and 4641 for testing, but they did not state the argumentation technique used.
Smith et al. [21] decomposed EEG signals into various features using robust variational mode decomposition (RVMD). The Kruskal–Wallis test was used to select only six features among different features. These features were fed into an optimized extreme learning machine (OELM) classifier with only a single layer. The OELM model achieved 92.93% accuracy. Both [20,21] used 64 channels of the dataset and all conditions. Srinivasan et al. [22] introduced a method that first processes the raw EEG signals and then uses convolutional neural networks (CNNs) and Long Short-Term Memory (LSTM) networks to extract spatial and temporal features. This method achieved 98.2% accuracy.

2.2.2. 2D Methods: Converting EEG Signals into Images

Khare et al. [11] converted EEG signals into images using short-time FT (STFT), CWT, and smoothed pseudo-Wigner-Ville distribution (SPWVD). These were fed independently into the following CNN classifiers: AlexNet, VGG-16, ResNet50, and the proposed CNN. The best accuracy was 93.36% using SPWVD-CNN with 10-fold cross-validation. The temporal length was 3 s and 64 channels, and they considered condition one (press button to deliver a tone). The authors used a large number of channels (approximately 64).
Siuly et al. [31] utilized a GoogLeNet-based architecture called “SchizoGoogLeNet” to extract features from denoised and transformed EEG signals to differentiate between HC and SZ patients. In the proposed method, various preprocessing steps were carried out, including (i) reshaping the original 2D-EEG matrix into a 3D matrix (shape = channels × sample points × epochs), (ii) using a channel-wise average filter with a kernel size equal to 12 to suppress noise and reduce the temporal length to half, (iii) transforming the signal matrix into an image with four dimensions to match the input size required by the GoogleNet model (shape = 224 × 224 × 3 × image samples), and (iv) extracting features using the GoogleNet model to produce two deep feature vectors. The Pat-Ind evaluation protocol with 10-fold cross-validation was used to assess the models. Five well-known models were adopted: GoogLeNet, SVM, KNN, DT, and LDA. The SVM classifier outperformed the other models, using an unbalanced dataset with 98.84% accuracy and 99.02% sensitivity. Instead, it could employ augmentation methods to overcome such problems. This research observed that the average filter’s kernel size is large, which decreases the temporal length from 3072 to 256 sample points. The reduction of temporal length resulted in the loss of important details, including the task onset, since the average filter treated all sample points equally. Additionally, to balance the dataset, the authors removed some trials of SZ subjects that degraded the performance of the balancing dataset.
Sahu et al. [23] proposed a multi-scale feature extraction model named SCZ-SCAN to classify subjects as either HC or SZ patients. The authors substituted a conventional convolution layer with depth-wise separable convolution for the following reasons: (i) to extract features in different scales and (ii) to reduce the number of parameters and thus, the computation time. Moreover, spatial attention (SP) and channel-wise attention (CWA) blocks were attached to the proposed model to capture significant features and suppress irrelevant features. In the preprocessing stage, EEG signals were converted into 2D images using the continuous-time wavelet transform (CWT) method (128 × 128). Two public datasets were adopted for evaluation, including the Kaggle dataset, and a 10-fold cross-validation method was performed. The proposed model achieved 96% accuracy for the Kaggle dataset.
It is noteworthy that these studies have not paid sufficient attention to DL architecture or the selection of essential brain regions with their channels. In contrast, this research aims to develop a robust DL model that is reliable and effective for diagnosing SZ in patients using raw EEG signals; as explained in upcoming sections.

3. Proposed Method

This section presents the details of the proposed method called CALSczNet, which takes a raw EEG trial and detects schizophrenia in patients efficiently. First, the problem is formulated, then, the description of the dataset used to develop the method is given, and finally, the architecture of CALSczNet is described in detail.

3.1. Problem Formulation

The problem is to detect schizophrenia using an EEG signal, which characterizes the electrical brain activities. An EEG trial x is used to detect whether a subject is a healthy control (HC) or has schizophrenia (SZ). The EEG trial x captures electrical brain activities in temporal and spatial dimensions. The temporal dimension specifies the change in EEG signals over time, whereas the spatial dimension identifies various channels’ locations on the scalp. Therefore, the trial x is represented as a matrix, i.e., x R C × T , where C is the number of channels and T is the temporal length (i.e., the number of samples):
x = x 11 x 12 x 1 T x 21 x 22 x 2 T           x C 1 x C 2 x C T .
The problem of detecting schizophrenia using x is a classification problem. The EEG trial x of a subject is given, and he/she has to be classified as an SZ patient or an HC patient. Let us assume that Y = 0 , 1 is the set of labels such that the state (label) of x is 0 for HC and 1 for SZ.
In the schizophrenia detection problem, the x is taken as the input, and its label l   Y is predicted. It needs to design the mapping f ·   ,   θ : R C × T Y such that f x , θ = l , where x R C × T and l 0 , 1 . Specifically, when any x is passed to f , it yields l as the output (predicted label). In the mapping f , the parameter θ represents the learnable parameters of f . We design f as a lightweight deep network. In the following subsections, the design of f is discussed.

3.2. Data Description and Preparation

We used the benchmark public-domain Kaggle dataset [24] to design and validate the proposed method. It consisted of 81 adult participants, including 32 healthy controls (26 males and 6 females) with an average age of 38.38 years (max = 63 years and min = 22 years) and 49 schizophrenic patients (41 males and 8 females) with an average age of 40.02 years (max = 63 years and min = 19 years). The SZ patients were diagnosed using clinical interviews based on DSM-IV (SCID), and patients under medication were excluded.
While capturing EEG signals, three conditions were performed for each subject: (i) pressing the button to generate the tone (button tone), (ii) listening to the recorded tone passively (tone only), and (iii) pressing the button without generating any tone (button only). The EEG signals were collected from 64 scalp sites and 8 external sites using a BioSemi ActiveTwo system. From each subject, 86 to 100 EEG trials were recorded. The EEG signal was preprocessed using standard techniques, including (i) continuously digitizing at 1024 Hz for 3000 ms (3 s) per condition, (ii) reference off-line to averaged ear lobe electrodes, (iii) baseline correction −100 to 0 ms, and (iv) digitally bandpass filtering between 0.5 Hz and 15 Hz. Moreover, it was free from any artifacts [24,25]. The task onset was located in the center of each trial, which split the signal into 1.5 s before and after the task onset.
According to Ford et al. [25] and Zhang [26] The most discriminative features exist between −100 ms and 300 ms before and after the task onset, respectively, during conditions I and II. In addition, Zhang et al. [32] and Thilakavathi et al. [8] stated that to improve EEG analysis, it is preferable to concentrate on the temporal period where the subject is focused during the experiment and discard others. Moreover, Sabeti et al. [33], reported that the short temporal length enhances the model’s performance. Considering these studies, the temporal length of each EEG trial is fixed to 2 s captured in the middle of each epoch. In addition, to augment the training dataset, a segment of each trial is extracted with a length of 2.5 s; see Section 4.2 for details.
Following the state-of-the-art studies introduced in Section 2.1 [3,4,29], only the left frontal lobe containing 10 channels was considered: Fp1, AF7, AF3, F1, F3, F5, F7, FT7, FC5, and FC3. For each condition, trials were extracted and arranged in the matrix of form N T r × C × T ; where N T r represents the number of training trials, C is the number of channels, and T is the temporal length of each trial in samples. The numbers of HC trials were 3108, 3007, and 3111, corresponding to condition-I, condition-II, and condition-III, respectively, whereas for SZ, the numbers were 4728, 4624, and 4623, corresponding to condition-I, condition-II, and condition-III, respectively.
The Fp1 channels from two EEG trials corresponding to HC and SZ are shown in Figure 1. There is a significant difference in Fp1 corresponding to HC and SZ. HC shows more fluctuations, which indicates that the healthy subject is more responsive to different activities than a schizophrenic patient; a subject with SZ disease cannot figure out a tiny variation in ongoing noises and other activities [23]. Consequently, the EEG signal plays an important role in observing and detecting the abnormal electrical activities of the SZ brain.
The first row of Figure 1 shows that the amplitude of the EEG signal varies across different brain states ((a) healthy control and (b) schizophrenic). To mitigate this issue, it is essential to reduce the variability of each EEG signal to identify patterns associated with SZ. For this purpose, Z-scoring is used across time points per channel to make these deviations more discernible, as follows:
x ^ t = Z x t   , μ   ,   σ = x t   μ σ ,
where x t and x ^ t denote the time sample before and after normalization, μ is the mean, and σ is the standard deviation.
The Z-scoring ensures all EEG signals have the same distribution by setting the mean to zero and the standard deviation to 1. Further, it prevents the model from being biased towards features with larger scales, eliminates noise and artifacts, allows the model to converge faster, and enhances model generalization [7,19,31,34]. It is noteworthy that to prevent the attenuation of important information in raw EEG signals, the baseline correction step is performed before Z-scoring. This baseline correction preserves significant patterns in EEG signals, ensuring they are more discernible after normalization [25,35,36].

3.3. Proposed CALSczNet

Deep Convolution Neural Networks (CNNs) have recently achieved a promising performance in computer vision tasks [37]. The standard deep CNN model suffers from two weaknesses. First, during training, features are automatically learned densely and sequentially, stacking convolutional layers [38,39]. The stacking of convolutional layers primarily enhances the performance of the CNN [20,38]. However, it also leads to a significant increase in the number of learnable parameters, requiring more computational resources [23,38,39]. As a result, the efficiency of the CNN model is reduced as it necessitates a considerable number of training examples during the training process [37]. Unfortunately, there is a lack of training examples in medical datasets such as EEG signals that cannot train a deep model effectively. Consequently, the issue of overfitting occurs [23]. Another weakness of deep CNN is that convolution operations treat all EEG signal features equally without prioritizing the most discriminative ones. It can lead to overlooking important features and slowing down the training process.
Inspired by the EEGNet framework [37], a lightweight deep network, namely, CALSczNet, is introduced, which overcomes the above issues and improves the performance and efficiency of conventional CNN models for EEG data classification. The depth-wise convolution and separable convolution layers are employed to reduce the parameter complexity and extract discriminative spatial features. Further, LSTM is employed to determine the long-range dependencies between features and attention mechanisms to pay attention to the salient features. CALSczNet consists of five blocks, as shown in Figure 2; the specification of its architecture is described in Table 1. The main five blocks of CALSczNet are (i) the temporal convolution block ( F t ), (ii) the spatial depth-wise convolution block ( F s ), (iii) the separable convolution block ( F s e p ), (iv) the LSTM block ( F L S T M ) , and (v) a classification block ( F Ϲ ) ; the details of the blocks are given in the following subsections.

3.3.1. Temporal Convolution Block ( F t )

This block defines the mapping F t : R C × T × 1 R C × T × F 0 that takes a reshaped EEG trial x   R C × T × 1 as the input and yields z ( 1 )   R C × T × F 0 , where each channel is decomposed into F 0 bands and enhanced to emphasize salient bands. This mapping is composed of three functions:
z ( 1 ) = F t x = B A t C t ( x ) ,
where C t is temporal convolution layer, A t is temporal attention, and B is batch normalization.
The temporal convolution C t is implemented as a temporal convolution layer with F 0 temporal kernels of size 1 × T s . It convolves x with each of F 0 temporal kernels in a channel-wise manner and generates z ( 0 ) R C × T × F 0 , as follows:
z k ( 0 ) c , t , 1 = i = T s / 2 T s / 2 s k 1 , i x ( c , t i , 1 ) ,   1 c C ,   a n d   1 t T
where s k is kth kernels ( 1 k   F 0 ) and z k ( 0 ) R C × T × 1 is kth band of the output feature map z ( 0 ) R C × T × F 0 . The number of kernels and their sizes are determined empirically. Different choices were examined, e.g., F 0 = 8, 16, 32, etc. as well as 1 × T = 1 × 8, 1 × 16, 1 × 32, etc. When selecting the best choice, two aspects are considered: (i) the ability of the TConv layer to effectively capture sufficient and meaningful low-level temporal features and (ii) managing model complexity and computational resources to prevent overfitting. The best F 0 and 1 × T s were found to be 16 and 1 × 32 , respectively. The TConv layer is responsible for converting the x into multiple temporal bands that extract information in the time domain from each channel [37,39,40]. In TConv layer, the activation function is omitted following the idea of common spatial patterns (CSP) [41] and the convention followed in the EEGNet [37]. Empirically, it was found that it does not improve the performance (see Section 5.1.3).
The temporal attention A t receives z ( 0 ) R C × T × F 0 as the input and creates temporally saliency-enhanced z ´ ( 1 ) R C × T × F 0 , as follows:
z ´ ( 1 ) = A t z 0 = P z ( 0 ) z ( 0 ) z ( 0 ) ,
where P is the projection map that projects z ( 0 ) R C × T × F 0 into the saliency map z ´ ( 0 ) R C × T × 1 i.e.,
P z ( 0 ) = S C t z ( 0 ) ,
Here, C t projects 3 D tensor z ( 0 ) to 2 D map, which is further passed through the softmax ( S ) mapping that determines the relative importance of the features in each spatiotemporal location z ( 0 ) c , t , : by transferring the 2 D map to a probability distribution. The mapping C t is implemented as a temporal convolution layer (TConv) with one kernel with a size of 1 × 9 × 16 . The size of the kernel is specified based on the prior findings that used the same dataset in this experiment, and it was shown that the small kernel size enhanced the performance of the model [23,34]. In addition, EEG signals involve a variety of useful patterns that occur within a small temporal length. Thus, a small kernel size allows the model to effectively capture local information (features) along the temporal dimension, which accelerates model training. The mapping P emphasizes each z ( 0 ) c , t , : feature according to its importance.
The P z ( 0 ) is multiplied by the original tensor z ( 0 ) using scalar multiplication , which multiplies scalar P z 0 ( c , t ) with vector z ( 0 ) c , t , : to pay attention to each saptio-temporal location according to its saliency. Further, the P z ( 0 ) z ( 0 ) is added to the original tensor z ( 0 ) to attend to the salient features, and this operation is implemented as a shortcut connection. This connection simplifies network learning, enhances model performance, and tackles the model degradation problem [42]. Temporal attention is used to refine low-level features, activate important features over time, determine when to pay attention, and accelerate computation by propagating only relevant temporal features.
Finally, batch normalization B is implemented as a BN layer. The BN layer assists in accelerating the training process, improving training stability, and overcoming the overfitting problem during training. During the training process, if the output of one layer changes, then, the distribution of the input value of the next layer also changes. This internal covariate shift could be problematic for upcoming layers, such as overfitting. Thus, the BN layer mitigates the internal covariate shift by normalizing the output value of a layer to have a mean of 0 and a standard deviation of 1 [20,43].
This module extracts the temporal features from the input EEG trial using temporal convolution (TConv) and emphasizes the most discriminative temporal features using Temporal Attention (TA).

3.3.2. Spatial Depth-Wise Convolution Block ( F s )

This block builds the mapping F s : R C × T × F 0 R 1 × ( T / 2 ) × F 1 ,   w h e r e   F 1 = F 0 × D   a n d   D = 4 (see Section 5.1.4 for more details), which receives the output z ( 1 )   R C × T × F 0 of the F t block as the input and yields z ( 2 )   R 1 × ( T / 2 ) × F 1 . It is composed of the following functions:
z ( 2 ) = F s z ( 1 ) = D r o p Ą Ƥ g B C s ( z ( 1 ) ) ,
where C s is spatial depth-wise convolution, B is batch normalization, g is a non-linear activation function GeLU, Ą Ƥ is average pooling, and D r o p is dropout.
The spatial depth-wise convolution C s is implemented as a spatial depth-wise convolution layer (SDWConv) with F 0 × D kernels, where D is the depth multiplier to expand the number of output feature maps. It convolves each of F 0 feature maps of z ( 1 ) with D spatial kernels of size c × 1 independently to yield z ´ ( 2 )   R 1 × T × ( F 0 × D ) [37,40], as follows:
z ´ d ( 2 ) 1 , t , f 0 = c = 1 C s d f 0 c , 1 z ( 1 ) ( c , t , f 0 ) ,   1 t T ,   1 d D ,   a n d   1 f 0 F 0
where s d f 0 is dth spatial kernel corresponding to the feature map f 0 . After examining different values of D , the best value of D is found to be 4, which generates F 0 × D = 16 × 4 = 64 feature maps; see Section 5.1.4. By fusing channel information, it reduces the complexity of the model and extracts the discriminative information across channels, which improves the performance of the model and helps avoid the overfitting problem [37].
After that, the batch normalization B is used to overcome the internal covariate shift problem, and the activation function GeLU ( g ) is used to introduce non-linearity in the model, which enhances the performance of the model by capturing complexity. Then, the average pooling ( Ą Ƥ ) with a kernel size of 1 × 2 is implemented to reduce the temporal length, the number of learnable parameters, and computational complexity [20]. The dropout D r o p with a probability of 0.01 is applied to discourage the co-adaptation of neurons and avoid overfitting.
The output z ( 2 ) R 1 × ( T / 2 ) × F 1 is passed to two blocks: (i) the Local Attention (LA) block (Section Local Attention (LA) Block) and (ii) the F s e p block (Section 3.3.3).

Local Attention (LA) Block

As mentioned above, the SDWConv layer learns each feature map independently, and each feature map represents a variety of features. These features could be crucial for detecting SZ, while others involve redundant information. Therefore, it is essential to eliminate the redundant features to prevent model degradation and emphasize the salient features [23]. As a result, the Local Attention (LA) block is placed intelligently as a branch of the F s   block to refine locally prominent features at the intermediate level, as shown in Figure 2. The LA block consists of an attention mapping ( α ) that is used to learn attention weights for corresponding feature maps. The α includes two components, namely, squeeze and excitation. Then, the output of the α is scaled using scaling multiplication, which is enhanced later by adding shortcut connections, as follows:
z ( 3 ) = α z ( 2 ) z 2 z 2   ;   z ( 3 )   R 1 × ( T / 2 ) × F 1
w h e r e   α = S Ƒ 2 R Ƒ 1 G z 2   ;   α   R 1 × F 1 ,
In the attention mapping ( α ), the squeeze component utilizes global average pooling ( G ) to collect global information per feature map. For each feature map, it computes the average across spatial dimensions (height and width), i.e., h × w . Then, a single scalar value is produced to represent the overall essence of that feature map. After that, the excitation component processes the output of the G through two fully connected layers. The first fully connected layer ( Ƒ 1 ) consists of 64 neurons. It is divided by the reduction ratio (ř), which is equal to 12. This reduction ratio aids in reducing model complexity. Following this, the Ƒ 1 is followed by the ReLU ( R ) function, whereas the next fully connected layer ( Ƒ 2 ) consists of exactly 64 neurons, followed by Softmax ( S ) Equation (10). This work replaces the Sigmoid function introduced in the original SE module with Softmax [44]. It was found that the S function recalibrates the feature weights to a number between 0 and 1. This number represents the probability for each feature independently. In fact, the probability value represents the importance of the feature that is gradually activated. By adopting the S function, the performance of the proposed model is improved.
After that, the scalar multiplication is used to multiply each scaled value α i in the α with the corresponding feature map in the original input tensor z 2 along the temporal dimension α z ( 2 ) z 2 . Here, due to the scalar multiplication being local along feature map dimension and the temporal dimension, it is named the LA block. Furthermore, the scaling multiplication emphasizes the important i t h feature maps and suppresses the less important ones of z 1,1024 , i ( 2 ) of z 2 .Then, the scaled feature maps α z ( 2 ) z 2 are added to the original input tensor z ( 2 ) using a shortcut connection. This connection helps to pay more attention to the most salient features of Equation (9). Finally, the output z ( 3 ) R 1 × ( T / 2 ) × F 1 of the LA block is exploited to give more weight to the feature maps that are produced by the next block, i.e., F s e p , using element-wise multiplication ; for more details, see Section 3.3.3.
This module extracts the spatial features from EEG trials using spatial depth-wise convolution ( F s ), controls the number of output feature maps, and adjusts the model’s capacity and complexity. In addition, it emphasizes the most discriminative spatial features using Local Attention (LA).

3.3.3. Separable Convolution Block ( F s e p )

The separable convolution layer (SepConv) receives the direct output of the F s block z ( 2 ) R 1 × ( T / 2 ) × F 1 to carry out two steps: (i) depth-wise convolution and (ii) point-wise convolution. Here, the depth-wise convolution has the same process as in Section 3.3.2. Each kernel convolves only one feature map along the temporal dimension in depth-wise manner. Thus, it requires performing point-wise convolution to activate features among different feature maps and expand them. The point-wise convolution uses the same convolution operation as conventional convolution [37].
Adopting the SepConv layer reduces the learnable parameters and computation costs, preserving good accuracy [37,45]. Moreover, it efficiently decouples the relationship within and across feature maps of the input layer by learning each feature map individually in time direction (depth-wise convolution) and then optimally merging the feature maps in the depth direction and generating more and new feature maps (point-wise convolution) [37,39].
In this block, the kernel size of depth-wise convolution is set to 1 × 16 , and the number of kernels is F 2 = F 1 × D ,   w h e r e   F 1 = 64 ,   a n d   D = 1 . Then, the BN is applied, followed by the activation function GeLU. After that, the element-wise multiplication is used to multiply the output z ´ ( 4 ) R 1 × ( T / 2 ) ×   F 2 of the F s e p block with the output of the LA block z ( 3 ) R 1 × ( T / 2 ) × F 1 , where F 1 = F 2 = 64 as in Equation (11). Here, the output z ( 3 ) of the LA block is utilized to pay more attention to the most salient features of z ´ 4 . The operation is implemented locally along feature maps and temporal points of both z ´ 4 and z ( 3 ) to amplify the important features to the highest values. Thus, each attended feature map of z ( 3 ) multiplies with the corresponding feature map in the z ´ 4 .
z ´ ´ ( 4 ) = z ´ 4 z ( 3 ) ; z ´ ´ ( 4 ) R 1 × ( T / 2 ) × F 2 ,
Finally, the output z ´ ´ ( 4 ) R 1 × ( T / 2 ) × F 2 of the F s e p block is reduced using average polling with a kernel size of 1 × 4, followed by dropout with a probability of 0.10. The final output is z ( 4 ) R 1 × ( ( T / 2 ) / 4 ) × F 2 , which is fed into the LSTM block F L S T M to capture important temporal features of the sequential data.
This module extracts features in two phases: depth-wise convolution and point-wise convolution. The depth-wise convolution captures local spatial features per feature map, whereas the point-wise convolution aggregates global features across all channels. Then, LA is used to pay attention to important features. This attention mechanism prioritizes essential features and excludes the less significant ones.

3.3.4. LSTM Block ( F L S T M )

Learning long-term dependencies among temporal features is important for learning discriminative features. For this purpose, an F L S T M is used. The LSTM layer can learn long-term dependencies of sequential data by introducing three gates. These gates decide which important information should be transmitted to the next hidden state [5,43,46]. The F L S T M module is a type of Recurrent Neural Network (RNN) model. It maintains long-term memory and overcomes the vanishing gradient problem, which occurs during the training process of typical RNNs.
The basic building block of the F L S T M is depicted in Figure 3 [43,46]. It consists of a cell state and three gates (forget, input, and output). The cell state (or memory state) represents the long-term memory, allowing LSTM to store EEG signal information over time. In contrast, the three gates represent the short-term memory that can control the input/output data flow of the cell state by adding or forgetting information. These gates process the input value using a sigmoid activation function ( σ ) that converts the input to a value between 0 and 1. The hyperbolic tangent activation function (Tanh) converts the data to a value between 1 and −1.
The forget gate f t used to control information of a previous cell memory state c t 1 either memorizes it or discards it. On the other hand, the input gate i t is responsible for adding new information to the current memory cell, which is then processed in the output gate. Following that, the output gate o t decides which important information from the current memory will be passed through the next hidden state h t of the LSTM unit. In the end, the LSTM unit produces two outputs, which are the current memory state c t and the current hidden state h t . It is noteworthy to mention that the LSTM unit requires three inputs. These inputs are previous memory state c t 1 , previous hidden state h t 1 , and the current input z ^ t ( 5 ) [46].
The mathematical representation of the single LSTM unit (Figure 3), which consists of the forget, input, and output gates, is expressed as follows:
f t = σ w f · h t 1 , z ^ t ( 5 ) + b f
i t = σ w i · h t 1 , z ^ t ( 5 ) + b i
c ~ t = t a n h w c ~ · h t 1 , z ^ t ( 5 ) + b c ~
o t = σ w o · h t 1 , z ^ t ( 5 ) + b o
c t = f t × c t 1 + i t × c ~ t
h t = o t × tanh c t
where w f , w i , w c ~ ,   and   w o represent the LSTM weight for the forget, input, and output gates, respectively. b f , b i , b c ~ , a n d   b o represent the bias vector for the forget, input, and output gates, respectively. Also, h t 1 , x t represents the concatenation of the previous hidden state h t 1 and current input x t . Finally, the operations *   a n d   + represent element-wise multiplication and element-wise addition, respectively.
The output z ( 4 ) R 1 × ( ( T / 2 ) / 4 ) × F 2 of the previous block is reshaped to z ^ ( 5 ) R ( ( T / 2 ) / 4 ) × F 2 to pass it to the LSTM layer. The number of neurons in the LSTM unit is set to 25. The LSTM layer is followed by a BN layer and dropout with a probability of 0.10, and it yields the output z ( 5 ) R ( ( T / 2 ) / 4 ) × 25 . Overall, these layers define the mapping F L S T M as follows:
z ( 5 ) = F L S T M z ^ ( 5 ) = D r o p B L S T M z ^ ( 5 ) ,
The F L S T M module helps CALSczNet to efficiently and effectively learn long-term dependencies of the EEG signal to capture significant temporal features of EEG trials [43].
CALSczNet is developed to prioritize the most discriminative features across EEG signals. It integrates three advanced techniques beyond standard CNNs. First, it employs attention mechanisms to emphasize the most relevant and informative spatiotemporal features dynamically and suppresses less significant ones. Second, it incorporates a separable convolution layer, which assists in reducing the number of learnable parameters and computational costs, thereby helping to prevent overfitting. It prioritizes learning the most essential and relevant spatial features by utilizing separable convolution. Third, it encapsulates the LSTM layer to capture temporal dependencies of the EEG signal because the LSTM layer uses memory cells to retain essential features and gates to decide which features to keep, forget, or use to update the cell state.

3.3.5. Classification Block

This block takes the learned features from previous blocks and predicts the state of the input EEG trials, which are either HC or SZ. The output z ( 5 ) R ( ( T / 2 ) / 4 ) × 25 is flattened and passed to this block. It is composed of a flattened layer, a fully connected layer ( Ƒ ), and a softmax layer. The flattened layer simplifies the 2D tensor to a feature vector passed to Ƒ , which consists of two neurons corresponding to HC or SZ. It determines which features are strongly correlated with a particular class. The softmax layer converts the activations of these neurons to posterior probabilities p C i x ,   i = 1 ,   2 of class C 1 (HC) and class C 2 (SZ) given the input EEG trial.
p C i x = e a i j = 1 n e a j ,     i = 1 ,   2 ,
where a i ,     i = 1 ,   2 is the activation of the ith neuron.

4. Evaluation Protocol and Model Training

In this section, the procedure used to implement and evaluate the performance of the proposed method is first described, as well as the metrics used for evaluation. Then, details of the data augmentation and model training are provided.

4.1. Evaluation Procedure

To evaluate the CALSczNet model, it was implemented on a computer equipped with an Intel® Core™ CPU (i9-12900H, ~2.5 GHz) processor and an NVIDIA GPU GeForce RTX 3080Ti with 16 GB of memory. The Keras framework with TensorFlow backend version 2.10.0 was used to implement, train, and test the model.
In this study, all experiments were carried out using stratified 10-fold cross-validation based on the patient-independent evaluation protocol. For each fold, the training, validation, and testing sets consisted of 80%, 10%, and 10% of the total data, respectively; a one-fold split of data is shown in Table 2.
Well-known performance metrics were used to evaluate the model’s performance. They allow a fair comparison to state-of-the-art techniques. These metrics are accuracy, sensitivity (recall), specificity, and precision, calculated via the confusion matrix. Also, the F1-score and AUC are computed. The AUC is the area under a receiver operating characteristic (ROC) curve. The formula of each metric is illustrated as follows (Table 3).

4.2. Model Training

The number of EEG trials to train the model is insufficient (see Table 2); it leads to overfitting [23]. To overcome this issue, a sliding window technique with overlapping is used [47]. First, for each trial, a 2.5-s segment is extracted from the middle of the raw EEG signal, considering 1.25 s on both sides of the task onset location. Then, from this segment, N 2-s EEG trials (instances) are created using the stride of s t = L l N 1   s e c o n d s , where L = 2.5 s is the temporal length of the input segment and l = 2 s is the temporal length of the output EEG trial (instance); see Figure 4 for details.
The trials generated in this way are label-preserving because each trial contains at least half of the task period and introduces the variation in the data. This approach increases the number of training trials by N, which can be adjusted to generate the training trials that are enough to train the model without overfitting. The values of N are set to 7 and 4, respectively, corresponding to HC and SZ. Using this procedure, the training trials are augmented for both HC and SZ.
The model was trained using the Adam optimizer and sparse categorical cross-entropy. The maximum number of epochs was set to 100 with a batch size of 32. The learning rate schedule (LRS) was used with an initial learning rate ( l r ) of 1 × 10 4 ; the l r was decreased after every 4th epoch using l r = l r × e 0.1 . In addition, the L 1 and L 2 regularization techniques were used to control overfitting and improve the generalization of the model. The values of L 1 and L 2 were set to 1 × 10 4 .

5. Experiments and Results

This section presents the results of the experiments to evaluate and validate the suitability and effectiveness of CALSczNet for the classification of SZ and HC. First, the results of the ablation study using the condition-I dataset are presented. Based on the ablation experiments’ results, CALSczNet’s architecture was finalized and tested on the datasets for the other two conditions (condition-II and condition-III).

5.1. Ablation Study

To show the effectiveness of different blocks of CALSczNet and the hyperparameters, the ablation experiments were conducted using the condition-I (button tone) dataset. The 10 main experiments are considered, including the effect of the following hyperparameters: number of neurons in the LSTM layer, data augmentation, activation function in the F t block, depth multiplier in the F s block, attention blocks, shortcut connection in attention block, reduction factor in the LA block, element-wise multiplication vs. element-wise addition, learning rate, and the effectiveness of the CALSczNet model compared to the EEGNet model.

5.1.1. Effect of The Number of Neurons in the LSTM Layer

An LSTM network can effectively learn and process long-term dependencies within sequential data, either remembering or forgetting information over time, such as EEG signals. An EEG signal records changes in brain activity over time. Therefore, in this experiment, the effect of the LSTM layer was examined with different numbers of neurons. Both attention blocks (TA and LA) were removed, and only the LSTM layer with a different number of neurons was kept.
The results shown in Figure 5 reveal that the highest performance was achieved with the architecture containing an LSTM layer with 25 neurons. On the other hand, the LSTMs with 30, 35, and 40 neurons are less effective. It indicates that increasing the number of LSTM neurons leads to increased model complexity, which causes overfitting.
These results indicate that an LSTM layer can effectively model the temporal dependencies within EEG trial samples, thereby gradually improving model performance. Furthermore, LSTM assists in capturing only the important temporal features of EEG trial samples and accelerates the classification process.

5.1.2. Effectiveness of Data Augmentation

This experiment explored the influence of applying data augmentation techniques on the performance of the CALSczNet model. As shown in Figure 6, the highest performance was achieved when training the model with an augmented training dataset with 32,000 training trials. In contrast, the performance of the model decreased when using the original training dataset with 6272 training trials. By increasing the size of the training dataset to 32,000, it is possible to mitigate the risk of overfitting and enhance the model’s generalization capabilities.

5.1.3. Effect of Activation Function in Temporal Convolution Block

Following the idea of the common spatial pattern (CSP) [41], the non-linearity in the temporal convolution block is omitted. An experiment was performed to show the effectiveness of omitting the non-linear activation function in the first TConv layer of the first block (see Table 1). The results shown in Figure 7 indicate that the TConv layer without a non-linear activation function increased the model’s performance.
It is probably because the EEG signals are rich in important temporal features that may need linear extraction to preserve them. By omitting a non-linear function, the first TConv layer can effectively capture low-level features to be effectively processed in later layers. Conversely, using non-linear functions might distort these features, losing some critical information and making it difficult for the subsequent layers to extract useful information.

5.1.4. Effect of Depth Multiplier in Spatial Depth-Wise Convolution Block

The depth multiplier ( D ) in the F s block is critical as it influences both the model’s complexity and its performance. Different values of D , ranging from 2 to 5, were examined. As shown in Figure 8, the performance of the model increases proportionally with an increase in the value of D . This observation suggests that the model’s ability to capture more relevant features from the EEG signal improves as its complexity increases. The best performance is achieved with D values of 4 or 5. The primary role of D is to increase the number of output feature maps, enhancing the model’s capability to capture more detailed and varied features. Given that the performances of D = 4 and D = 5 are relatively similar, a value of D = 4 is selected due to its lower model complexity compared to D = 5.

5.1.5. Effect of Attention Blocks

This experiment empirically shows the effectiveness and contribution of each attention block in CALSczNet. To accomplish this purpose, the attention blocks are removed one at a time. Figure 9 summarizes the average performance results and the impact of removing one of the attention blocks.
As illustrated in Figure 9, removing either LA or TA from the CALSczNet model results in relatively the same performance in terms of accuracy and F1-score.
First, the result of removing LA (blue bar color in Figure 9) indicates that when placing a TA block after the temporal convolution layer (TConv), the TA accurately refines the low-level temporal features. By contrast, removing TA (orange bar in Figure 9) leads to utilizing an LA block to activate the feature maps that are generated by the F s e p block; it refines and emphasizes the intermediate-level spatial features. At this point, the LA block plays a significant role in activating and preparing the prominent features that feed into the next F L S T M block (see Figure 2). Then, the F L S T M learns temporal dependencies effectively, accelerating the training process and classification.
Eventually, TA and LA blocks are added (green bar color in Figure 9) to finalize the proposed model. As shown in Figure 9, this resulted in a substantial performance improvement for the CALSczNet model. The results indicate that the two attention blocks, namely, TA and LA, assist in efficiently learning the complex and relevant temporal and spatial features of the EEG trail, accelerate model training, and enhance model performance. Furthermore, the results indicate that CALSczNet is more suitable for one-dimensional EEG representation and robust enough to effectively generalize and classify unseen data. Moreover, they can help to reduce the chance of facing overfitting problems due to the small number of learnable parameters.

5.1.6. Effect of Shortcut Connection in Attention Block

In this experiment, the impact of shortcut connections was investigated by either removing or adding them to the attention blocks. As shown in Figure 10, the proposed model with a shortcut connection achieved better results than the model without it. The model with a shortcut connection achieved 98.47% accuracy and an F1-score of 98.5%. This result indicates that attention blocks highly emphasize only the most relevant features. Otherwise, it suppresses irrelevant ones, which leads to the loss of informative features in the original input. To overcome such problems, the shortcut connection is added to the attention blocks to capture more discriminative features of the original EEG signal that could be lost during the attention process stage. Following this, these features were concatenated with the attention-weighted ones. It was observed that using a shortcut connection helps preserve the original input’s informative features since the EEG signal is rich in useful features. In addition, a shortcut connection can help mitigate the loss of information during attention processing, thereby enhancing model performance.

5.1.7. Effect of Reduction Factor in LA Block

The reduction ratio ř in the LA block is crucial since it affects the complexity and performance of the model. By setting a high reduction ratio, the number of features across all channels is significantly reduced in the excitation stage of the LA block. As a result, the model complexity is reduced, accelerating the training process. However, a very high reduction of features can lead to information loss, undermining performance. Therefore, various experiments were performed to reduce complexity and enhance performance to find an appropriate reduction ratio. Figure 11 compares the effect of different reduction ratio values of the LA block on the proposed model. It was found that setting ř = 12 resulted in a favorable tradeoff between complexity and performance. Consequently, the value of ř = 12 was used for all the following experiments.

5.1.8. Effect of Element-Wise Multiplication vs. Element-Wise Addition

This section examines the effect of element-wise multiplication versus element-wise addition between the LA and separable convolution blocks. As shown in Figure 12, the proposed model with element-wise multiplication significantly outperformed the model with element-wise addition. It achieved 98.47% accuracy and an F1-score of 98.5%. In practice, element-wise multiplication amplifies or suppresses features based on their importance. It is useful for assigning the highest values for informative features and the lowest values for less important features. Consequently, element-wise multiplication is employed to multiply the attentive features produced by the LA block with the features generated by the separable convolution block. This operation helps emphasize the useful features and suppresses the less useful ones, accelerating the training process.

5.1.9. Effect of Learning Rate

In this experiment, the impact of LR was evaluated on the training of CALSczNet with the Adam optimizer. Figure 13 presents the effect of different LRs using condition-1 and Fold 7. Figure 13 illustrates that CALSczNet’s performance dropped significantly when using a very small LR of 1 × 10−5 (0.00001). Using a much lower LR causes an extremely slow training process and an inability to update weights properly due to small LR steps. Additionally, the model might be stuck in local minima or require extensive training time to converge.
In contrast, the performance of CALSczNet improved when the LR rate increased gradually. The LR of 1 × 10−4 (0.0001) achieved the highest performance, with 98.94 accuracy and an F1-score of 98.94. Although CALSczNet performed well with LRs of 1 × 10−3 (0.001) and 1 × 10−2 (0.01), a learning rate of 1 × 10−4 (0.0001) is the best. Therefore, the LR of 1 × 10−3 (0.001) was selected to train the model in all the experiments.

5.1.10. Effectiveness of the CALSczNet Model Compared to the EEGNet Model

This experiment explores how the proposed CALSczNet model differs from the EEGNet model. As shown in Figure 14, the proposed model outperforms the EEGNet model.
The key difference between the CALSczNet and the EEGNet models is that the CALSczNet integrates two attention mechanisms, i.e., TA and LA. These two attention mechanisms assist in enhancing spatiotemporal features using TA and emphasize important feature maps based on a global context using LA. On the other hand, EEGNet does not incorporate any attention mechanisms. Moreover, both TA and LA incorporate scaling multiplication and element-wise addition operations that are used to combine input with attention-enhanced features. These two operations contribute to improving the model’s performance. Conversely, the EEGNet model processed features directly from one convolutional layer to another without additional attention-based operations.
Another distinctive feature is incorporating an LSTM layer in the CALSczNet to work out the temporal dependencies in the EEG signal.
In addition, the CALSczNet exploits the intermediate saliency-enhanced features of LA to scale the output features of F s e p via element-wise multiplication; see Figure 2 and Section 5.1.8. This operation is used to amplify important features and suppress less important ones. In contrast, the EEGNet model treats all features equally without emphasizing the most discriminative ones.

5.2. Performance on All Conditions

After constructing CALSczNet with appropriate hyperparameters based on condition-I, it was used to train and evaluate the remaining conditions of the dataset. Four experiments were conducted to compare the effectiveness of different conditions for detecting SZ in patients. These experiments investigated either individual conditions or combined all conditions. The experiments that trained each condition independently involved the following: (i) condition-I: press the button to generate a tone representing both the auditory and motor actions simultaneously (button tone); (ii) condition-II: listen to a tone passively, representing the auditory action only (tone only); and (iii) condition-III: press the button without generating a tone, representing the motor action only (button only).
For fairness, all experiments were conducted with the same activation function, optimizer, learning rate, and number of epochs. The experiment was run approximately five times for each fold to select and report the best performance result. This allowed us to check whether similar results were obtained. Furthermore, these runs assisted in overcoming the problem of parameter initialization that occurs in iterative algorithms since model training relies on this initialization.
In the first experiment, when the model was trained on each condition independently, the performance of the model using conditions I and II outperformed condition-III; see Figure 15. It achieved accuracies of 98.47% and 98.35% for conditions I and II, respectively. These results indicate that the detection of SZ in patients increases when a subject performs either auditory and motor action concurrently or auditory action alone. It can be concluded that these two actions have a high correlation with the left frontal lobe (motor cortex) and exhibit more abnormalities in SZ. Furthermore, these results reveal that the two conditions carry the most discriminative features for accurate SZ detection. On the other hand, condition-III yielded the lowest results since the patient did not perform any action or have any reaction. Thus, relying solely on condition-III as a biomarker for detecting schizophrenia (SZ) patients is ineffective due to its lack of distinctive features for detection. As a result, the state of the subject while recording EEG signals is crucial as the most powerful features for SZ patients emerge during brain activity, such as conditions I and II, rather than during an inactivity state like condition-III.
In the second experiment, all conditions were combined to train the model, which improved the performance compared to the use of each condition individually. This performance indicates that all conditions contribute effectively to SZ detection. It obtained the highest accuracy, sensitivity, specificity, and F1-score with values of 98.6%, 98.65%, 98.72%, and 98.65%, respectively. Therefore, the results in Figure 15 demonstrate that the state of the subject and the left frontal lobe contribute to SZ detection and deliver a reasonable improvement in model performance.

5.3. Robustness of the CALSczNet Model

This section presents the performance of each condition and their combinations to demonstrate the CALAczNet model’s robustness.
Figure 16 illustrates the mean training and validation accuracy curve over 100 epochs for 10-fold cross-validation. In Figure 16, the x-axis represents the number of epochs, and the y-axis represents the accuracy achieved by the CALSczNet model on the training and validation datasets.
From all the curves, it can be observed that the mean training (in red) and validation (in blue) accuracies increase as the number of training epochs increases, and both lines are closely aligned. Also, when the number of epochs for training and validation reached approximately 20, the model quickly converged. Additionally, accuracy tended to stabilize at approximately 98%. This indicates that the model generalizes well on unseen data and does not overfit the training data. This also suggests that these three conditions contribute to SZ detection either when used independently or combined strongly.
To demonstrate the level of confusion between positive classification (correctly classified) and negative classification (misclassified), the results of the confusion matrix are reported. Figure 17 presents the confusion matrix for fold 10 (as an example) for each condition as well as for all conditions. These figures display each class of EEG trials in the testing set, where 0 and 1 represent HC and SZ, respectively. From the confusion matrix of condition-I shown in Figure 17a, it can be noticed that 1888 (100%) of SZ trials are correctly classified as the SZ class, while 1836 (99%) of HC trials are correctly classified as HC. In addition, the confusion matrix of condition-II in Figure 17b demonstrates that 1843 (100%) of SZ trials are correctly classified as SZ class, while 1771 (99%) of HC trials are correctly classified as HC. On the other hand, the number of misclassified SZ and HC trials slightly increased in the confusion matrix of condition-III (see Figure 17c), which is approximately 57 (2%) and 11 (1%) for SZ and HC, respectively. Figure 17d shows that 1% of both SZ and HC trials are incorrectly identified when all conditions are combined to train the proposed model. These misclassified trials in Figure 17d are attributable to condition-III. Therefore, it is strongly recommended that condition-III be investigated before training DL models since it increases the number of misclassified trials.
To validate the efficiency and stability of the proposed model, the ROC and AUC were plotted for each condition and the combination of conditions. Overall model performance was measured using the area under the ROC curve value. The highest area value reveals the better performance of the model, which is closer to 1. As seen in Figure 18, the highest AUC is obtained by condition-I and condition-II, which is almost 1 (0.99). In contrast, the AUC value of all the conditions is slightly lower (0.987), which is attributable to the fact that condition-III is included. Condition-III yields a slightly lower AUC compared to the other conditions. To conclude, Figure 18 shows that all ROC curves are close to the top-left corner, which maximizes the true positive rate and minimizes the false positive rate. This demonstrates that the proposed model performs outstandingly well in accurately distinguishing between SZ and HC.
Table 4 demonstrates the performance of CALSczNet across 10 folds when combining all conditions. The rows of the table represent the results of 10 folds (F1 to F10) in terms of Acc, Sen, Sep, and F1-score. The last column represents each metric’s average and standard deviation across the folds.
The accuracy over all the folds ranges between 98.27 and 98.85, with an average of 98.6 and a relatively small standard deviation of ±0.27. A similar trend is observed for the other metrics. There is no significant variation in the performance across all the folds; it indicates the robustness of the model when it is trained using different training and validation sets and tested using test sets. Further, it indicates that the model has no overfitting and generalizes well to classify unseen data. Moreover, the results of sensitivity and specificity corroborate CALSczNet’s strength in classifying both classes without bias towards any particular class. The high F1-scores across folds reflect a good balance between precision and recall. This means that the model effectively identifies SZ (recall) and accurately classifies SZ as positive cases (precision). CALSczNet can classify subjects such as HC or SZ accurately, robustly, and effectively.

6. Discussion and Comparison

In this study, the CALSczNet model was proposed to detect schizophrenia using EEG signals automatically. It was shown that CALSczNet can efficiently distinguish healthy subjects from schizophrenic patients. Based on the findings detailed in Section 5, the proposed CALSczNet model is a robust and reliable approach, with an exceptional accuracy of approximately 98.6 ± 0.27% when combining all conditions to train the model. To the best of our knowledge, the proposed model is the first work that has utilized the effectiveness of attention mechanisms and the LSTM technique to detect SZ patient-based EEG signals.
The proposed model’s two core attention blocks are designed to learn features from EEG trials accurately. The TA block is placed in the temporal block, which selectively captures specific time-relevant features within an EEG trial. On the other hand, the LA block is placed between two important blocks. First, it is used immediately after the F s   block to emphasize the spatial features of the EEG trial. Then, these attentive spatial features are utilized to enhance the features produced by the F s e p block using element-wise multiplication operation. In this case, it was discovered that the element-wise multiplication operation is a valuable addition. It aided in amplifying the feature extraction process between these two blocks. In addition, the selection of an appropriate reduction factor for the Ƒ 1 layer in LA leads to (i) the compression of the feature space, (ii) reduction in model complexity, and (iii) effective capture of the essential features. Hence, the LA block plays a crucial role in preparing informative features for the subsequent LSTM layer. The LSTM layer effectively encodes the timeseries and learns the long-term dependencies of sequential EEG data. Notably, the number of neurons in the LSTM layer significantly impacts model performance. The LSTM with 25 neurons met our problem objectives. The combined use of TA, LA, and LSTM techniques enables CALSczNet to extract features successfully, accelerates the training and classification processes, and thereby significantly enhances the overall robustness and performance of the proposed model.
Furthermore, an augmentation technique is adopted to avoid the overfitting problem and improve model performance using an overlapping sliding window with a size of 2 s. This technique increases the training trials based on the small temporal length around the task onset. It was observed that there is a positive relationship between model performance and both augmentation and small temporal length. This is because the 2-s period contains the occurrence of a task of the EEG trial, which has a strong discriminative temporal feature, thereby enhancing model performance. In addition, the short time duration of 2 s can be easily processed and reduce the computation time and storage space.
As mentioned in Section 3.2, conditions I and II contain important features as they represent mental activities. Based on this observation, the model was trained on each condition independently and by combining them. The best results were obtained when the conditions were combined, although 1% of SZ and HC trials were misclassified. These misclassifications primarily resulted from condition-III, where the patient did not perform any responses or activities. It is strongly recommended that the subject’s state be considered in the future, reflecting brain activities such as simultaneous auditory and motor activities or auditory activity only.
This study considers the ten channels of the left frontal lobe. They demonstrated superior performance. These results indicate that the left frontal region and the selected channels can reveal abnormalities in EEG signals and contain the most useful features for detecting SZ in patients.
The ROC curves exhibited high AUC values (close to 1) across all conditions; this indicates the proposed model’s excellent ability to distinguish between classes. Observing the confusion matrix results, conditions I and II show that the model’s capabilities are significantly high, as evidenced by the high rate of true positives and true negatives and the very low rate of false positives and false negatives. In contrast, condition-III slightly increases false positives and false negatives due to less distinct EEG patterns. Moreover, when all conditions are combined, the model’s performance improves, indicating the model’s ability to generalize across various types of EEG signal conditions.
CALSczNet is robust and effectively utilizes EEG signals to diagnose SZ early. It demonstrates superior performance with fewer learnable parameters. It is expected to aid psychologists in accurately diagnosing SZ in patients at an early stage and providing timely treatment, thereby enhancing their quality of life.

6.1. Comparison to State-of-the-Art Techniques

The effectiveness and superiority of the proposed method are proven by providing a comparative report of the proposed method with recent state-of-the-art methods that used the same dataset, as shown in Table 5. Table 5 presents a detailed overview for comparison, including the overall performance metrics.
In ref. [3], hand-engineered feature-based methods were adopted to classify EEG signals. This study achieved the lowest performance compared to the CALSczNet model, with 93.58% accuracy when using only the F3 channel. Several phases must be carried out by adopting hand-engineered feature-based methods. It involves signal decomposition, feature extraction and/or selection, and classification [3]. However, these phases are complex and time-consuming, involving difficult and laborious parameter tuning. In addition, some feature extraction methods suffer from problems such as [11]. Also, these methods cannot handle large datasets of EEG signals for SZ detection [13]. Therefore, conventional machine learning techniques are insufficiently robust to address schizophrenia detection [2,12]. On the other hand, deep learning techniques must learn feature representations from data and deal with a large dataset of EEG signals [13].
In deep learning, EEG signals have been used in either 1-dimensional (1D) or 2-dimensional (2D) images. As depicted in Table 5, some recent researchers have exploited 1D raw EEG signals [20,21,22], which obtained about 92%, 92.93%, and 98.2% accuracies, respectively. Also, the EEG converted to 2D images was used to detect SZ in patients in [23,31]. To obtain the 2D form, visualization techniques such as spectrogram, STFT, and CWT were performed on the raw EEG signal (1D). The main challenge associated with 2D is that some information is lost during the transformation process from 1D [15]. Therefore, the raw 1D-EEG was adopted in this study to overcome the mentioned problems.
Moreover, Siuly et al. [31] adopted highly complex pre-trained CNN model architectures such as GoogleNet to classify 2D-EEG. This model has approximately 5 million learnable parameters compared to the small number of training images (23,201 unbalanced images). They achieved good accuracies, e.g., 95.09% [31]. However, these results indicate that the models suffer from overfitting. In comparison, the CALSczNet model adopted a depth-wise convolution layer, a separable convolution layer, attention mechanisms, and an LSTM layer. These types of layers help reduce computational complexity, accelerate learning, and avoid the overfitting problem. In addition, these layers help to balance the number of training EEG trials and the number of learnable parameters, which is relatively small compared to the previous study (28,216 parameters, as shown in Table 1). Also, CALSczNet achieved a 98.6% accuracy when it was trained on 1D-EEG data using all conditions, utilizing only 10 channels from the left frontal lobe. It outperformed the model in the previous study [31], which was based on 2D-EEG data using all conditions, employing channels that cover the entire brain regions. It is worth highlighting that the CALSczNet model is built on the EEGNet model. Therefore, CALSczNet is a compact model that achieves good performance with a considerable training dataset.
Furthermore, the studies that adopted deep learning in Table 5 did not use end-to-end learning approaches [20,21,22,31]. Unlike these studies, the proposed method is based on an end-to-end 1D CNN model.
The state of the subject during the recording of EEG signals concerning the task performed has not been discussed in the studies presented in Table 5. These studies examined their models by only considering the combined effect of all conditions. In contrast, this study explored the effects of each condition independently and in combination.
The appropriate temporal length of the Kaggle dataset was not discussed or analyzed by the studies presented in Table 5. These studies primarily used a temporal length of 3 s as in [20,23,31] or 1.5 s [3]. In ref. [3], 1.5 s was used without specifying exactly why and where the 1.5 s were extracted from the EEG signal. By contrast, this study deeply investigated the task onset. It was located in the middle of the EEG signal. Based on this investigation, the period around it was specified as the area where the most discriminant features reside. As a result, a 2-s temporal length was specified since it includes the task occurrence. It exhibited superior results compared to previous studies.
In addition, the studies listed in Table 5 have not discussed brain regions and the number of channels used, except for those employing hand-engineered feature-based methods [3]. These studies [20,21,31] utilized all channels, covering the entire brain regions. In contrast, this study investigated the effectiveness of the left frontal lobe. As discussed in Section 5.2, the left frontal lobe contains discriminant features that can distinguish SZ and HC. Also, the analysis of EEG rhythms using Lampel–Ziv complexity (LZC) showed that the frontal lobe contains abnormal fluctuations [30]. As a result, the 10 channels of the left frontal lobe were adopted, which outperformed existing state-of-the-art works. Additionally, employing a few EEG channels helped minimize patient discomfort and manage the EEG signal recording session effectively.
To summarize, a recent study [22] utilized the Kaggle dataset and achieved a notable accuracy of 98.2%. However, the study did not address several aspects: it did not specify the number of EEG channels used, leading to the assumption that all available channels, potentially up to 70, were utilized. The temporal length of the EEG signals employed was also not detailed. Additionally, the approach is complex, requiring significant computational resources. The study did not consider the condition of the subjects even though the dataset offered three tasks per subject. In contrast, the proposed method deliberates on various aspects of the provided dataset, including suitable temporal lengths, brain regions with a smaller number of channels, and the state of subjects during EEG signal recording. Finally, this method is designed as an end-to-end model. Hence, the proposed CALSczNet method is a promising choice for automatically detecting SZ in patients.

6.2. Applicability and Limitations of the CALSczNet Model

In this study, we emphasize the importance of generalizability and implemented methods to reduce the risk of overfitting. The issue of overfitting was overcome by intelligently integrating advanced techniques. First, the depth-wise convolution layers are used to reduce computational complexity and enhance efficiency by processing each input channel independently; see Section 3.3.2. Also, a separable convolution layer enhances computational speed and decreases the number of parameters by decomposing standard convolution into depth-wise and point-wise convolutions. In depth-wise convolution, the spatial features are extracted locally. On the other hand, point-wise convolutions apply a 1 × 1 filter using standard convolution operation to extract global features; see Section 3.3.3. In addition, the normalization layer improves training stability and speed by standardizing inputs, reducing internal covariate shift; see Section 3.3.1, lines 456–459. Additionally, augmentation and regularization techniques are used to enhance generalization to unseen data by artificially increasing the diversity of the training dataset; see Section 4.2. All the mentioned methods help to improve the generalization, prevent overfitting, and deal with small datasets. In future studies, we aim to focus on validating CALSczNet using larger and more diverse datasets.
Based on experimental findings and analyses from the perspective of AI, the lightweight and efficient nature of CALSczNet makes it suitable for real-time applications and easy integration into clinical workflows, ensuring practical usability. However, full clinical applicability falls outside the scope of the paper. In the future, we will engage and collaborate with the relevant medical authorities concerning this issue. Moreover, CALSczNet is cost-effective. It is lightweight and can be implemented on any platform using minimum computing and storage resources.

7. Conclusions

This paper proposed an efficient, lightweight, and end-to-end deep learning approach for the automatic detection of schizophrenia (SZ) using raw EEG signals in their original 1D form. CALSczNet uses depth-wise and separable convolution layers to reduce the model’s complexity, extract discriminative temporal and spatial features, and mitigate the overfitting issue. Moreover, three components are incorporated into the CALSczNet: temporal attention, local attention, and the LSTM layer. The main role of these components lies in their capability to capture relevant temporal, spatial, and long-term dependency features, respectively. It aids in extracting salient features and accelerating the model’s training and convergence. The experimental results demonstrate that the state of a subject when either pressing the button to generate a tone or listening to the tone passively provided a better classification of SZ and HC than only pressing the button. Using ten-fold cross-validation, the method was evaluated on the benchmark public-domain SZ EEG Kaggle database. It achieves superior performance compared to the state-of-the-art techniques. The highest performance was obtained when combining all conditions (e.g., Acc: 98.6%, Sen: 98.65%, Sep: 98.72%, Pres: 98.72%, and F1-score: 98.65%). The lightweight model is designed to assist neurologists in effectively detecting SZ in patients reliably without overfitting, thereby enhancing their clinical decision-making to administer appropriate treatment promptly. In the future, it will explore additional brain regions, work with various available datasets, and examine the model using EEG datasets of depression, emotion recognition, and Alzheimer’s disease.

Author Contributions

Conceptualization, N.A. and M.H.; methodology, N.A. and M.H.; software, N.A.; validation, N.A. and A.A.; formal analysis, N.A., M.H., and A.A.; investigation, NA; resources, M.H.; data curation, N.A.; writing—original draft preparation, N.A. and A.A.; writing—review and editing, M.H.; visualization, N.A. and A.A.; supervision, M.H.; project administration, M.H.; and funding acquisition, M.H. All authors have read and agreed to the published version of the manuscript.

Funding

The research was supported under Researchers Supporting Project number RSP2024R109, King Saud University, Riyadh, Saudi Arabia.

Data Availability Statement

The data are available online at https://www.kaggle.com/broach/button-tone-sz (accessed on 30 March 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Siuly, S.; Khare, S.K.; Bajaj, V.; Wang, H.; Zhang, Y. A computerized method for automatic detection of schizophrenia using EEG signals. IEEE Trans. Neural Syst. Rehabil. Eng. 2020, 28, 2390–2400. [Google Scholar] [CrossRef]
  2. Singh, K.; Singh, S.; Malhotra, J. Spectral features based convolutional neural network for accurate and prompt identification of schizophrenic patients. Proc. Inst. Mech. Eng. Part H J. Eng. Med. 2021, 235, 167–184. [Google Scholar] [CrossRef]
  3. Baygin, M.; Yaman, O.; Tuncer, T.; Dogan, S.; Barua, P.D.; Acharya, U.R. Automated accurate schizophrenia detection system using Collatz pattern technique with EEG signals. Biomed. Signal Process Control 2021, 70, 102936. [Google Scholar] [CrossRef]
  4. Jahmunah, V.; Oh, S.L.; Rajinikanth, V.; Ciaccio, E.J.; Cheong, K.H.; Arunkumar, N.; Acharya, U.R. Automated detection of schizophrenia using nonlinear signal processing methods. Artif. Intell. Med. 2019, 100, 101698. [Google Scholar] [CrossRef]
  5. Baygin, M. Biomedical Signal Processing and Control An accurate automated schizophrenia detection using TQWT and statistical moment based feature extraction. Biomed. Signal Process Control 2021, 68, 102777. [Google Scholar] [CrossRef]
  6. Goshvarpour, A.; Goshvarpour, A. Schizophrenia diagnosis using innovative EEG feature-level fusion schemes. Phys. Eng. Sci. Med. 2020, 43, 227–238. [Google Scholar] [CrossRef]
  7. Buettner, R.; Beil, D.; Scholtz, S.; Djemai, A. Development of a machine learning based algorithm to accurately detect schizophrenia based on one-minute EEG recordings. In Proceedings of the 53rd Hawaii International Conference on System Sciences, Maui, HI, USA, 7–10 January 2020. [Google Scholar]
  8. Shenbaga, B.T.S.; Malaiappan, D.M. EEG power spectrum analysis for schizophrenia during mental activity. Australas. Phys. Eng. Sci. Med. 2019, 42, 887–897. [Google Scholar] [CrossRef]
  9. Chu, L.; Qiu, R.; Liu, H.; Ling, Z.; Zhang, T.; Wang, J. Individual recognition in schizophrenia using deep learning methods with random forest and voting classifiers: Insights from resting state EEG streams. arXiv 2017, arXiv:1707.03467. [Google Scholar]
  10. Aslan, Z.; Akin, M. Automatic detection of schizophrenia by applying deep learning over spectrogram images of EEG signals. Trait. Signal 2020, 37, 235–244. [Google Scholar] [CrossRef]
  11. Khare, S.K.; Bajaj, V.; Acharya, U.R. SPWVD-CNN for Automated Detection of Schizophrenia Patients Using EEG Signals. IEEE Trans. Instrum. Meas. 2021, 70, 2507409. [Google Scholar] [CrossRef]
  12. Shalbaf, A.; Bagherzadeh, S.; Maghsoudi, A. Transfer learning with deep convolutional neural network for automated detection of schizophrenia from EEG signals. Phys. Eng. Sci. Med. 2020, 43, 1229–1239. [Google Scholar] [CrossRef]
  13. Oh, S.L.; Vicnesh, J.; Ciaccio, E.J.; Yuvaraj, R.; Acharya, U.R. Deep convolutional neural network model for automated diagnosis of Schizophrenia using EEG signals. Appl. Sci. 2019, 9, 2870. [Google Scholar] [CrossRef]
  14. Shrestha, A. Review of Deep Learning Algorithms and Architectures. IEEE Access 2019, 7, 53040–53065. [Google Scholar] [CrossRef]
  15. Phang, C.-R.; Ting, C.-M.; Samdin, S.B.; Ombao, H. Classification of EEG-based Effective Brain Connectivity in Schizophrenia using Deep Neural Networks. In Proceedings of the 2019 9th International IEEE/EMBS Conference on Neural Engineering (NER), San Francisco, CA, USA, 20–23 March 2019; pp. 401–406. [Google Scholar]
  16. Thilakavathi, B.; Devi, S.S.; Bhanu, K. Peak frequency analysis for schizophrenia using electroencephalogram power spectrum during mental activity. Int. J. Biomed. Eng. Technol. 2018, 28, 18–37. [Google Scholar]
  17. AShoeibi, A.; Khodatars, M.; Ghassemi, N.; Jafari, M.; Moridian, P.; Alizadehsani, R.; Panahiazar, M.; Khozeimeh, F.; Zare, A.; Hosseini-Nejad, H.; et al. Epileptic seizures detection using deep learning techniques: A review. Int. J. Environ. Res. Public Health 2021, 18, 5780. [Google Scholar] [CrossRef]
  18. Shoeibi, A.; Sadeghi, D.; Moridian, P.; Ghassemi, N.; Heras, J.; Alizadehsani, R.; Khadem, A.; Kong, Y.; Nahavandi, S.; Zhang, Y.D.; et al. Automatic Diagnosis of Schizophrenia using EEG Signals and CNN-LSTM Models. Front. Neuroinform. 2021, 15, 777977. [Google Scholar] [CrossRef]
  19. Guo, Z.; Wu, L.; Li, Y.; Li, B. Deep neural network classification of EEG data in schizophrenia. In Proceedings of the 2021 IEEE 10th Data Driven Control and Learning Systems Conference, DDCLS, Suzhou, China, 14–16 May 2021; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2021; pp. 1322–1327. [Google Scholar] [CrossRef]
  20. Khare, S.K.; Bajaj, V. A hybrid decision support system for automatic detection of Schizophrenia using EEG signals. Comput. Biol. Med. 2022, 141, 105028. [Google Scholar] [CrossRef]
  21. Srinivasan, S.; Johnson, S.D. A novel approach to schizophrenia Detection: Optimized preprocessing and deep learning analysis of multichannel EEG data. Expert. Syst. Appl. 2024, 246, 122937. [Google Scholar]
  22. Praveena, D.M.; Sarah, D.A.; George, S.T. Deep Learning Techniques for EEG Signal Applications—A Review. IETE J. Res. 2020, 68, 3030–3037. [Google Scholar] [CrossRef]
  23. Sahu, G.; Karnati, M.; Gupta, A.; Seal, A. SCZ-SCAN: An automated Schizophrenia detection system from electroencephalogram signals. Biomed. Signal Process Control 2023, 86, 105206. [Google Scholar] [CrossRef]
  24. Roach, B. EEG Data from Basic Sensory Task in Schizophrenia Button Press and Auditory Tone Event Related Potentials from 81 Human Subjects. Kaggle. Available online: https://www.kaggle.com/broach/button-tone-sz (accessed on 12 February 2022).
  25. Ford, J.M.; Palzes, V.A.; Roach, B.J.; Mathalon, D.H. Did I do that? Abnormal predictive processes in schizophrenia when button pressing to deliver a tone. Schizophr. Bull. 2014, 40, 804–812. [Google Scholar] [CrossRef] [PubMed]
  26. Zhang, L. EEG Signals Classification Using Machine Learning for the Identification and Diagnosis of Schizophrenia. In Proceedings of the Annual International Conference of the IEEE Engineering in Medicine and Biology Society, EMBS, Berlin, Germany, 23–27 July 2019; pp. 4521–4524. [Google Scholar] [CrossRef]
  27. Wurtz, R.H.; Corollary Discharge in Primate Vision. Scholarpedia. Available online: http://www.scholarpedia.org/article/Corollary_discharge_in_primate_vision (accessed on 3 June 2024).
  28. Akbari, H.; Ghofrani, S.; Zakalvand, P.; Sadiq, M.T. Schizophrenia recognition based on the phase space dynamic of EEG signals and graphical features. Biomed. Signal Process Control 2021, 69, 102917. [Google Scholar] [CrossRef]
  29. Khare, S.K.; Bajaj, V.; Siuly, S.; Sinha, G.R. Classification of schizophrenia patients through empirical wavelet transformation using electroencephalogram signals. In Modelling and Analysis of Active Biopotential Signals in Healthcare; IOP Publishing: Bristol, UK, 2020; Volume 1, p. 1. [Google Scholar] [CrossRef]
  30. Khare, S.K.; Bajaj, V. A self-learned decomposition and classification model for schizophrenia diagnosis. Comput. Methods Programs Biomed. 2021, 211, 106450. [Google Scholar] [CrossRef] [PubMed]
  31. Siuly, S.; Li, Y.; Wen, P.; Alcin, O.F. SchizoGoogLeNet: The GoogLeNet-Based Deep Feature Extraction Design for Automatic Detection of Schizophrenia. Comput. Intell. Neurosci. 2022, 2022, 1992596. [Google Scholar] [CrossRef] [PubMed]
  32. Zhang, D.; Yao, L.; Chen, K.; Monaghan, J. A Convolutional Recurrent Attention Model for Subject-Independent EEG Signal Analysis. IEEE Signal Process Lett. 2019, 26, 715–719. [Google Scholar] [CrossRef]
  33. Sabeti, M.; Behroozi, R.; Moradi, E. Analysing complexity, variability and spectral measures of schizophrenic EEG signal. Int. J. Biomed. Eng. Technol. 2016, 21, 109–127. [Google Scholar] [CrossRef]
  34. Barros, C.; Roach, B.; Ford, J.M.; Pinheiro, A.P.; Silva, C.A. From sound perception to automatic detection of schizophrenia: An EEG-based deep learning approach. Front. Psychiatry 2022, 12, 813460. [Google Scholar] [CrossRef] [PubMed]
  35. Yang, Y.; Wu, Q.; Qiu, M.; Wang, Y.; Chen, X. Emotion Recognition from Multi-Channel EEG through Parallel Convolutional Recurrent Neural Network. In Proceedings of the International Joint Conference on Neural Networks (IJCNN), Rio de Janeiro, Brazil, 8–13 July 2018; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2018. [Google Scholar] [CrossRef]
  36. Ahmed, M.Z.I.; Sinha, N.; Ghaderpour, E.; Phadikar, S.; Ghosh, R. A Novel Baseline Removal Paradigm for Subject-Independent Features in Emotion Classification Using EEG. Bioengineering 2023, 10, 54. [Google Scholar] [CrossRef]
  37. Lawhern, V.J.; Solon, A.J.; Waytowich, N.R.; Gordon, S.M.; Hung, C.P.; Lance, B.J. EEGNet: A Compact Convolutional Network for EEG-based Brain-Computer Interfaces. J. Neural Eng. 2016, 15, 056013. [Google Scholar] [CrossRef]
  38. Zhou, Y.; Chen, S.; Wang, Y.; Huan, W. Review of research on lightweight convolutional neural networks. In Proceedings of the 2020 IEEE 5th Information Technology and Mechatronics Engineering Conference (ITOEC), Chongqing, China, 12–14 June 2020; pp. 1713–1720. [Google Scholar]
  39. Ko, W.; Jeon, E.; Jeong, S.; Suk, H.I. Multi-Scale Neural Network for EEG Representation Learning in BCI. IEEE Comput. Intell. Mag. 2021, 16, 31–45. [Google Scholar] [CrossRef]
  40. Deng, Y.; Yu, H.; Peng, F.; Yan, F.; Wu, Y.; Yan, L. EEG Signal Classification Based on Neural Network with Depthwise Convolution. In Journal of Physics: Conference Series; IOP Publishing: Bristol, UK, 2022; p. 012056. [Google Scholar]
  41. Dong, J.; Komosar, M.; Vorwerk, J.; Baumgarten, D.; Haueisen, J. Scatter-based common spatial patterns—A unified spatial filtering framework. arXiv 2023, arXiv:2303.06019. Available online: http://arxiv.org/abs/2303.06019 (accessed on 12 February 2022).
  42. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  43. Alshaya, H.; Hussain, M. EEG-Based Classification of Epileptic Seizure Types Using Deep Network Model. Mathematics 2023, 11, 2286. [Google Scholar] [CrossRef]
  44. Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
  45. Huang, W.; Xue, Y.; Hu, L.; Liuli, H. S-EEGNet: Electroencephalogram signal classification based on a separable convolution neural network with bilinear interpolation. IEEE Access 2020, 8, 131636–131646. [Google Scholar] [CrossRef]
  46. Xu, G.; Ren, T.; Chen, Y.; Che, W. A one-dimensional cnn-lstm model for epileptic seizure recognition using eeg signal analysis. Front. Neurosci. 2020, 14, 578126. [Google Scholar] [CrossRef]
  47. Lashgari, E.; Liang, D.; Maoz, U. Data Augmentation for Deep-Learning-Based Electroencephalography. J. Neurosci. Methods 2020, 346, 108885. [Google Scholar] [CrossRef]
Figure 1. The Fp1 channels of two EEG trials for condition-I before and after normalization corresponding to (a) healthy control and (b) schizophrenia.
Figure 1. The Fp1 channels of two EEG trials for condition-I before and after normalization corresponding to (a) healthy control and (b) schizophrenia.
Mathematics 12 01989 g001
Figure 2. The architecture of the CALSczNet model.
Figure 2. The architecture of the CALSczNet model.
Mathematics 12 01989 g002
Figure 3. The typical structure of the signal LSTM unit.
Figure 3. The typical structure of the signal LSTM unit.
Mathematics 12 01989 g003
Figure 4. The process of creating training trials from one EEG, where st represent stride and s represents time in second.
Figure 4. The process of creating training trials from one EEG, where st represent stride and s represents time in second.
Mathematics 12 01989 g004
Figure 5. The effect of the number of neurons in LSTM neurons on CALSczNet.
Figure 5. The effect of the number of neurons in LSTM neurons on CALSczNet.
Mathematics 12 01989 g005
Figure 6. The effect of augmentation on the CALSczNet model.
Figure 6. The effect of augmentation on the CALSczNet model.
Mathematics 12 01989 g006
Figure 7. The effect of activation function in F t block of the CALSczNet model.
Figure 7. The effect of activation function in F t block of the CALSczNet model.
Mathematics 12 01989 g007
Figure 8. The effect of depth multiplier in F s block on the CALSczNet model.
Figure 8. The effect of depth multiplier in F s block on the CALSczNet model.
Mathematics 12 01989 g008
Figure 9. The effect of attention block on the CALSczNet model.
Figure 9. The effect of attention block on the CALSczNet model.
Mathematics 12 01989 g009
Figure 10. The effect of shortcut connection on CALSczNet model.
Figure 10. The effect of shortcut connection on CALSczNet model.
Mathematics 12 01989 g010
Figure 11. The effect of the reduction ratio ř of the first fully connected layer in the LA block on the CALSczNet model.
Figure 11. The effect of the reduction ratio ř of the first fully connected layer in the LA block on the CALSczNet model.
Mathematics 12 01989 g011
Figure 12. The effect of the element-wise addition vs. element-wise multiplication.
Figure 12. The effect of the element-wise addition vs. element-wise multiplication.
Mathematics 12 01989 g012
Figure 13. The effect of using various learning rates on the CALSczNet model.
Figure 13. The effect of using various learning rates on the CALSczNet model.
Mathematics 12 01989 g013
Figure 14. The effectiveness of the CALSczNet model compared to the EEGNet model.
Figure 14. The effectiveness of the CALSczNet model compared to the EEGNet model.
Mathematics 12 01989 g014
Figure 15. The performance of the CALSczNet model for each condition independently and the combination of all conditions.
Figure 15. The performance of the CALSczNet model for each condition independently and the combination of all conditions.
Mathematics 12 01989 g015
Figure 16. Training and validation curves of CALSczNet corresponding to (a) condition-I; (b) condition-II; (c) condition-III; and (d) all conditions. The solid red lines represent the mean accuracy of training, and the solid blue lines represent the mean accuracy of validation over 10-fold cross-validation. The shadowed areas around these lines indicate the ranges of values.
Figure 16. Training and validation curves of CALSczNet corresponding to (a) condition-I; (b) condition-II; (c) condition-III; and (d) all conditions. The solid red lines represent the mean accuracy of training, and the solid blue lines represent the mean accuracy of validation over 10-fold cross-validation. The shadowed areas around these lines indicate the ranges of values.
Mathematics 12 01989 g016
Figure 17. Confusion matrix of the proposed CALSczNet model corresponding to (a) condition-I, Fold10; (b) condition-II, Fold10; (c) condition-III, Fold10; and (d) all conditions, Fold10, where 0 and 1 represent HC and SZ, respectively.
Figure 17. Confusion matrix of the proposed CALSczNet model corresponding to (a) condition-I, Fold10; (b) condition-II, Fold10; (c) condition-III, Fold10; and (d) all conditions, Fold10, where 0 and 1 represent HC and SZ, respectively.
Mathematics 12 01989 g017
Figure 18. ROC curves of the proposed CALSczNet model corresponding to each condition and the combination of all conditions for the testing set: (a) condition-I, Fold 9; (b) condition-II, Fold 9; (c) condition-III, Fold 9; and (d) all conditions, Fold 9 for the testing set.
Figure 18. ROC curves of the proposed CALSczNet model corresponding to each condition and the combination of all conditions for the testing set: (a) condition-I, Fold 9; (b) condition-II, Fold 9; (c) condition-III, Fold 9; and (d) all conditions, Fold 9 for the testing set.
Mathematics 12 01989 g018
Table 1. The specification of CALSczNet. KS: Kernel Size, KN: Number of Kernels, BN: Batch Normalization, ACT: Activation Function, #LPs: number of Learnable Parameters, TConv: Temporal Convolution Layer, TA: Temporal Attention, SDWConv: Spatial Depth-Wise Convolution layer, D : Depth Multiplier, LA: Local Attention, SepConv: Separable Convolution layer, Drop: Dropout, Ą Ƥ : Average Pooling, and FC: Fully Connected layer.
Table 1. The specification of CALSczNet. KS: Kernel Size, KN: Number of Kernels, BN: Batch Normalization, ACT: Activation Function, #LPs: number of Learnable Parameters, TConv: Temporal Convolution Layer, TA: Temporal Attention, SDWConv: Spatial Depth-Wise Convolution layer, D : Depth Multiplier, LA: Local Attention, SepConv: Separable Convolution layer, Drop: Dropout, Ą Ƥ : Average Pooling, and FC: Fully Connected layer.
BlockLayerKSKNInput SizeOutput SizeBN/ACT#LPs
F t Input--10 × 2048 × 110 × 2048 × 1-0
TConv1 × 32 F 0 = 16 10 × 2048 × 110 × 2048 × 16-512
TA--10 × 2048 × 1610 × 2048 × 16BN144
F s SDWConv10 × 1 F 1 = F 0 × D ; D = 4 10 × 2048 × 161 × 2048 × 64BN/GeLU640
Ą Ƥ 1 × 2-1 × 2048 × 641 × 1024 × 64-0
Drop ( 0.10 ) --1 × 1024 × 641 × 1024 × 64-0
LA --1 × 1024 × 641 × 2048 × 64-640
F s e p SepConv1 × 16 F 2 = F 1 × D ; D = 1 1 × 1024 × 641 × 1024 × 64BN/GeLU5120
Ą Ƥ 1 × 4-1 × 1024 × 641 × 256 × 64-0
Drop ( 0.10 ) --1 × 256 × 641 × 256 × 64-0
Reshape--1 × 256 × 64256 × 64-0
F L S T M LSTM (25)--256 × 25256 × 25BN9000
Drop ( 0.10 ) --256 × 25256 × 25-0
F Ϲ Flatten--256 × 256400-0
Drop ( 0.25 ) --64006400-
FC--64002SoftMax12,800
Total Number of Learnable Parameters 28,216
Table 2. The number of EEG epochs for each condition in case of a one-fold split.
Table 2. The number of EEG epochs for each condition in case of a one-fold split.
Name of SetCondition-ICondition-IICondition-III
HCSZHCSZHCSZ
Testing set311473301463312463
Validation set311473301463312463
Training set248637822405639824873697
Table 3. The definition of model performance matrices.
Table 3. The definition of model performance matrices.
MetricFormulaEquation No.
Accuracy (Acc) T P + T N T P + T N + F P + F N (20)
Sensitivity Recall (Sen) T P T P + F N (21)
Specificity (Sep) T P T P + F P (22)
F1-score 2   ×   P r e c i s i o n   ×   R e c a l l P r e c i s i o n + R e c a l l (23)
Table 4. The performance of the CALSczNet model when combining all conditions using a 10-fold cross-validation evaluation protocol on the testing dataset, where AVG: average and STD: standard deviation. The best result is displayed in bold.
Table 4. The performance of the CALSczNet model when combining all conditions using a 10-fold cross-validation evaluation protocol on the testing dataset, where AVG: average and STD: standard deviation. The best result is displayed in bold.
F1F2F3F4F5F6F7F8F9F10AVG ± STD
Acc98.6598.7398.7198.3898.2798.4498.5398.7298.8598.7298.6 ± 0.27
Sen98.71 98.5598.8098.1498.2598.4398.3098.6499.0498.7798.56 ± 0.37
Sep98.6898.9998.6198.6398.3898.4598.8598.9798.7798.8698.72 ± 0.21
F1-Score98.75 98.7398.7198.4898.3798.4598.5398.7298.9598.8298.65 ± 0.28
Table 5. Comparison to state-of-the-art methods and the CALSczNet model using the same dataset, where #Ch/BR: number of channels and brain region, TL: temporal length in seconds, s: seconds, 1D: using raw EEG signal as it is, 2D: converting EEG signal into an image, HE: hand-engineered method, and DL: deep learning method. The best result is displayed in bold.
Table 5. Comparison to state-of-the-art methods and the CALSczNet model using the same dataset, where #Ch/BR: number of channels and brain region, TL: temporal length in seconds, s: seconds, 1D: using raw EEG signal as it is, 2D: converting EEG signal into an image, HE: hand-engineered method, and DL: deep learning method. The best result is displayed in bold.
Study and Year#Ch/BRTL (s)TaskMethod and EEG Types (1D and 2D)ClassifierPerformance (%)
AccSenSepF1-Score
Baygin et al. [3], 202010 frontal1.5-HEWeighted KNN93.58 withF395.7990.24-
Siuly et al. [31], 202270/all BR3All CondsDL:
2D-EEG
GoogLeNet95.0993.8197.0295.83
Sahu et al. [23], 2023---DL:
2D-EEG
SCZ-SCAN969695-
Guo et al. [20], 202164/all BR3All CondsDL:
1D-EEG
CNN92---
Smith et al. [21], 202264/all BR3All CondsDL:
1D-EEG
OELM92.9397.1591.0694.07
Srinivasan et al. [22], 2024---HE and DL:
1D-EEG
Optimized preprocessing and hybrid- CNN-LSTM98.298.997.597.2
Proposed method, 202410 frontal2Cond-IDL:
1D-EEG
CALSczNet98.4798.7698.3498.65
Cond-II98.3598.5498.2398.46
Cond-III97.9097.998.0298.01
All Conds98.698.6598.7298.72
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Almaghrabi, N.; Hussain, M.; Alotaibi, A. CALSczNet: Convolution Neural Network with Attention and LSTM for the Detection of Schizophrenia Using EEG Signals. Mathematics 2024, 12, 1989. https://doi.org/10.3390/math12131989

AMA Style

Almaghrabi N, Hussain M, Alotaibi A. CALSczNet: Convolution Neural Network with Attention and LSTM for the Detection of Schizophrenia Using EEG Signals. Mathematics. 2024; 12(13):1989. https://doi.org/10.3390/math12131989

Chicago/Turabian Style

Almaghrabi, Norah, Muhammad Hussain, and Ashwaq Alotaibi. 2024. "CALSczNet: Convolution Neural Network with Attention and LSTM for the Detection of Schizophrenia Using EEG Signals" Mathematics 12, no. 13: 1989. https://doi.org/10.3390/math12131989

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop