1. Introduction
Schizophrenia represents a severe neuropsychiatric disorder that affects nearly 20 million people worldwide [
1]. As treatment can help more efficiently if an appropriate antipsychotic is prescribed in the early stage, a fast diagnosis is crucial [
2]. However, there is no known objective marker, such as a finding in a blood test or a shape in the brain anatomy, to conclusively diagnose schizophrenia, so current diagnosis relies on an interview with a psychiatrist. During the interview, the patient shows the symptoms they suffer from, such as hallucinations or delusions. Still, there is no objective diagnostic result to confirm the diagnosis and be available for presentation to the patient and their family.
In past decades, there have been efforts to take advantage of computational methods as tools for neuroimaging data analysis to determine the neurobiological nature of psychiatric diseases. In the early stages, statistical methods were used to detect differences between the brain of a healthy person and a patient diagnosed with, for example, schizophrenia [
3,
4,
5,
6,
7], Alzheimer’s disease [
8], or borderline personality disorder [
9,
10,
11] at the group level. Manual region-of-interest (ROI) methods were used to measure the size of a specific structure in the brain and compare it between patients and healthy subjects. However, this method is unsuitable for comparability between laboratories due to vaguely defined borders between anatomical structures [
8]. Recently, fully automated brain morphometry methods have emerged that focus not on a single structure but the whole brain or its tissues. The method, termed voxel-based morphometry (VBM), has become a standard pipeline that processes image data from an anatomical MRI scan into a statistically comparable image [
12,
13]. The pipeline starts with registration and resampling into a standardized space, continues with segmentation into brain tissues–white matter (WM), gray matter (GM), and cerebrospinal fluid (CSF), and ends with Gaussian kernel smoothing, resulting in gray matter density (GMD) features. A statistical map with
p-values can be counted on features and describes the statistical significance of the differences between groups in each voxel. VBM-based approaches revealed structural changes in schizophrenia, particularly in the left superior temporal gyrus region [
3], decrease in GM volume in the thalamus, middle orbital gyrus, inferior frontal gyrus, and other structures [
6], and decline in GM bilateral occipital lobe, left orbital frontal cortex and other structures [
7], as well as GM reduction in the ventral cingulate gyrus and several regions of the medial temporal lobe in patients with Borderline Personality Disorder [
9]. These findings suggest that schizophrenia is manifested by morphological abnormalities in many brain structures. Another brain morphometry method that uses anatomical MRI scans is deformation-based morphometry (DBM). DBM tracks the shifts, turns, and other movements of the brain tissues while registering the image on a standard brain template. The resulting features are represented by local volume changes (LVC), which can also be statistically evaluated at the voxel level.
Findings obtained using various automatic brain morphometry methods suggest that there are differences between healthy and diseased groups that are revealable by statistical analysis of various features extracted from brain image data. Despite these successful results based on a group-level analysis, there is a lack of a supportive diagnostic tool that would reveal with specific probability whether a patient is affected by a brain disorder or not. In comparison with statistical methods, classifiers work at the subject level. Many classification methods have been used to solve this task. Statistics-based methods such as linear discriminant analysis and its modifications have been applied to imaging data in schizophrenia research [
14], whereas machine learning methods, which, in comparison with statistical methods, learn hidden patterns in the images, often iteratively based on the training dataset, proved to be a good way to classify neuroimaging data. There have been a lot of studies using machine learning to classify neuropsychiatric diseases. Some authors used support vector machines (SVM) to classify schizophrenia with features extracted from MRI data using wavelet transform [
15] or from fMRI data using independent component analysis [
16]. Other features were extracted by calculating the Pearson correlation of WM integrity in diffusion-weighted images [
17] or features extracted and selected with SVM alone [
18]. Other authors took advantage of ensemble learning, which employs multiple classifiers trained on different data or features, to provide a decision by voting [
19,
20,
21]. Although the classification of schizophrenia is still challenging, some machine learning methods have been very successful in diagnosing some neurodegenerative diseases, such as Alzheimer’s disease. Savio et al. used backpropagation, radial basis function, and probabilistic neural network [
22], while Huang et al. [
23] used VBM and artificial neural networks (NNs).
The current success of deep learning builds on the historical development of NNs, which began with the perceptron [
24] and the multi-layer perceptron (MLP) [
25]. The MLP is a commonly used classifier. However, it may be overfitted with a growing number of neurons and layers. A very deep architecture can easily learn the information, such as subject class, and classify well on a training group, but the similarly successful results may not be achieved on test data. A better way to learn NNs is to go layer by layer while controlling the correctness of the learning. Such models are called stacked autoencoders.
An autoencoder [
26] is a single-layer neural network in which the input information is the same as the expected target; thus, the hidden layer of neurons tries to reconstruct the input [
27]. The single layers are then cascaded to create a deep stacked autoencoder network, which is finally tuned by a backpropagation algorithm. Autoencoders are used to classify many brain diseases. Employing the leave-site-out method for validation, Zeng et al. [
28] diagnosed schizophrenia with a deep discriminant autoencoder network on functional connectivity MRI in a multi-centric study, reaching an accuracy of 85% and 81%. Moussavi-Khalkhali and Jamshidi [
29] used sparse AE to extract features for classifying Alzheimer’s disease vs late and mild cognitive impairments and controls using multinomial logistic regression. In addition to the standard AE (82.88% accuracy), they used partially cascaded AE that works with information from each layer as an input to the classifier (92.44% accuracy). They also used denoising autoencoders. Nayak et al. [
30] employed a modified autoencoder (SRVFL-AE) to classify MRI with multiclass brain abnormalities such as degenerative disease, stroke, tumor, infectious disease, and normal brain, with high accuracy exceeding 95%. Mendoza-Léon et al. [
31] developed a supervised switching autoencoder based on the principle of reconstructing two output channels. The better-reconstructed channel defined the input class. They had single slices taken from structural MRIs of patients with Alzheimer’s disease and achieved an accuracy higher than 87.5%. Deep wavelet autoencoders helped to detect cancer in the brain [
32]. Pinaya et al. [
33] used deep autoencoders to classify schizophrenia with AUC-ROC 0.611-0.751.
Another effective method for training deep architectures is the convolutional neural network (CNN) [
34,
35]. The convolutional neural network has its 3D variant for imaging data with its 3D nature, such as MRI. The 3D CNN consists of several types of layers. The main type is a convolutional layer that counts the convolution operation between the kernel and the input image or output of the previous layer. The convolution operation brings two advantages. The CNN works with a much smaller number of weights than traditional MLP because the weights for each convolution kernel are shared for each kernel position during convolution. It strongly reduces the number of network parameters, which can help to tackle overfitting. Compared to MLP, SVM, Naïve Bayes, or other classifiers, the convolution works with information about the position of neighboring voxels. Each convolution’s result serves as the activation function’s input, which is necessary to turn the network into a nonlinear model, and the standard for CNN is RELu (Rectified Linear Unit). Two other types of layers follow: the batch normalization (BN) layer and the max-pooling (MP) layer, several layers with fully connected neurons follow, and typically, the output layer for the classification task is SOFTMAX.
CNN has found application in many medical image classification and schizophrenia applications, delivering good results. Wang et al. [
36] worked with functional MRI and took advantage of the dilated 3D convolution, achieving an accuracy of 80%. They suggest preprocessing the data by slice timing, realigning, normalization, and smoothing. Three-dimensional CNN was used in [
37] on structural MRI preprocessed by VMB, reaching 79.27% accuracy and 70.98% accuracy on the validation set, respectively. Hu et al. [
38] trained 3D CNN on structural and diffusion MRI, preprocessed by skull stripping, registration, segmentation, and modulation, with an accuracy of 79.27% and pre-trained 2D CNN with an accuracy of 72.41%. Oh et al. [
39] classified schizophrenia patients on a multi-centric structural MRI dataset (n = 873) with an accuracy of 97%, but the accuracy achieved on the validation dataset was lower than 70%. Campese et al. [
40] compared 3D CNN, 2D CNN and SVM on structural MRI data (preprocessed by VBM), with 3D CNN outperforming the other two. Therefore, the authors suggest that 3D convolution can capture spatial information correctly. The authors also applied the augmentation technique in CNN and noted a positive effect on accuracy.
The two previously mentioned deep architectures (CNN and AE) were combined by Oh et al. [
41]. They used a 3D convolutional autoencoder to diagnose schizophrenia spectrum disorders based on task-based fMRI (the task was to evaluate positive, negative, and neutral images with an unpleasant or neutral reaction). In doing so, they achieved an accuracy of around 84%. The autoencoders served as a tool for initializing the weights before the final tuning of the convolutional network.
Since the quality of the classifiers is directly affected by the dataset’s quality, one of the main challenges is using an appropriate image preprocessing pipeline. There are various feature extraction and selection methods that can successfully extract only important and well-discriminative information from the image. Voxel and deformation-based morphometries can serve as feature extraction methods, as can wavelet transform [
15] or independent component analysis [
16] in the case of fMRI data. In this paper, the advantage of morphometry methods is exploited. Voxel-based morphometry [
12] prepares segmented and smoothed tissues, whereas deformation-based morphometry preprocesses the whole brain and prepares information, as some shifts are necessary for template fitting. Since CNNs are known to classify the original image without feature selection, the third dataset used in this paper uses T1-weighted images preprocessed using only the necessary steps to remove the skull and non-brain parts. As previously described, a whole-brain approach without much preprocessing could be advantageous because schizophrenia manifests inhomogeneously in the brain.
Deep learning methods are state-of-art classifiers known for their ability to extract appropriate information for classification. However, morphometries as feature extraction methods used to play a crucial role in brain preprocessing, where shallow classification methods were applied and enabled to prepare information for classifiers, improving the classification pipeline. Even nowadays, authors tend to use preprocessing [
36,
37,
38,
40]. To the best of our knowledge, various morphometric methods have not yet been investigated and compared in combination with deep learning classifiers, such as autoencoders and 3D CNN-based models, and the ability of these methods to properly extract features alone has not been compared with morphometry preprocessing methods.
In this paper, several research questions are asked and answered experimentally. First, SAE and 3D CNN deep learning methods were employed to classify schizophrenia patients vs controls from our dataset. Second, CNN, in particular, does not need much preprocessing. Therefore, two morphometric methods, VBM and DBM, were applied and assessed to see if this feature extraction helps classify deep learning-based models compared to minimally preprocessed MRI. The third goal was to explore how architectural changes, such as the number of layers and neurons, affect each feature extraction outcome. Fourth, features extracted by the three methods were combined in all possible ways, and the impact on classification results was observed.
2. Materials and Methods
2.1. Schizophrenia Patients and Healthy Controls
All the subjects were patients at the University Hospital Brno. The dataset was 52 patients in the first episode of schizophrenia (SZ) and 52 healthy control subjects (HC). The entire dataset consisted of only male patients with median (min-max) age: SZ 22.9 (17–40), HC 23.0 (18.2–37.8). The dataset groups were age-matched to limit the effect of gray matter volume reduction with aging [
42]. All patients were diagnosed according to ICD-10 (International Classification of Diseases). Their blood and urine samples were collected and tested for toxicology, hematology, and biochemistry to exclude patients with abnormal findings. None of the subjects had a family or personal history of Axis I psychiatric disorder. All subjects signed informed consent, and the study was approved by the ethics committee [
43].
A previous study [
44] involved a subset of 39 patients and an equal number of controls. Another of our previous studies [
14] comprised 49 patients and 49 healthy controls recruited from the same clinical workplace. The numbers of subjects differed due to prospectivity in the study design. The data collected from this study were also used in our previous paper focused on training a multi-layer perceptron with random subspace ensembles [
21].
All subjects were scanned using a 1.5 T Siemens Symphony MRI machine with the following parameters: the sagittal tomographic plane thickness was 1.17 mm, the in-plane resolution was 0.48 mm × 0.48 mm, the 3-D field of view contained 160 × 512 × 512 voxels, IR/GR sequence, TR was 1700 ms, TE was 3.93 ms, TI was 1100 ms, the flip angle was 15°, and FOV was 246 × 246 mm [
43].
2.2. Feature Extraction
There are many brain imaging modalities, each carrying a different type of information, which together provide comprehensive data for data modeling based on multidimensional statistical methods and machine learning techniques. In this experiment, structural MRI data provided information for deep neural network models. The data were preprocessed using three different standardized pipelines supplying different image features: (i) image registration with skull stripping and two automated morphometry methods, (ii) voxel-based morphometry, and (iii) deformation-based morphometry.
The first dataset, later referred to as the T1 dataset, consisted of features corresponding to T1-weighted intensities in structural MRI images. The images were spatially normalized, the skull was stripped, and the CSF was removed using SPM8 software (Statistical Parametric Mapping package for MATLAB:
http://www.fil.ion.ucl.ac.uk/spm/, accessed 11 March 2022) to preserve only those brain parts with white and gray matter. The second dataset was preprocessed using the VBM pipeline [
45], later referred to as the GMD dataset, implemented in the SPM8 toolbox (
http://dbm.neuro.uni-jena.de/vbm/, accessed 11 March 2022). The pipeline, described in detail in [
21], included intensity inhomogeneity correction, spatial normalization, GM segment extraction and its modulation with the determinants of the Jacobian matrices computed during the nonlinear registration, and final smoothing using an 8 mm FWHM Gaussian kernel. The third dataset was preprocessed using DBM, which involved identical correction and spatial normalization steps as for VBM. Instead of working with only the GM tissue segment, the DBM pipeline transforms the brain images into local volume changes resulting from the determinants of the Jacobian matrices extracted from vector displacement fields obtained by the high-dimensional registration algorithm [
46]. This preprocessed dataset of local volume change features is later referred to as the LVC dataset.
2.3. Experiment 1: Autoencoders
The first experiment focused on the use of stacked autoencoders. The pipeline of the classification process, consisting of simple feature selection, SAE with specific configurations, and validation methods, is shown in
Figure 1. The first part of the algorithm is the selection of voxels that could fundamentally improve the success rate of the networks and tackle the problem of the time-consuming learning process. The selection metric was the Mann-Whitney test, i.e., a non-parametric method, so no assumption about the data distribution had to be met. After putting aside the testing samples, the Mann-Whitney test was applied to each voxel of the brain images voxels at a specific location in each patient’s brain, creating one group, and voxels at the same location in healthy brains created a second group. The application of the tests on the whole brain resulted in a probability map of
p-values. These values were ranked from lowest to highest so that only the most discriminating voxels could be selected. The results are not interpreted as statistically significant differences between the two groups–the healthy and patients–but serve as a tool to select only those voxels that could lead to good classification results. The selection was performed using a threshold value related to a predefined
p-value, i.e., an uncorrected significance level. By thresholding, voxels were divided into two groups. The first group served as the set of features for learning classifiers and is referred to as the feature pool (FP). The second group was omitted in further analysis. The parameter examined was the size of the feature pool, which was set to 1000 and 5000 voxels. Since the feature pool of 5000 voxels did not bring significant classification improvement and was very computationally expensive, the feature pool was not further enlarged.
Once the feature pool was defined, stacked autoencoders could be learned on selected voxels to classify healthy controls and patients. The SAE-based classifiers have many parameters that need to be set, so the size of the feature pool and two other network settings were explored. These were the number of layers and the number of neurons. The latter always decreased from the input to the output layer, so the information going through the network was continually compressed to eventually provide a small feature space for the classification by the SOFTMAX layer. The number of neurons was set equal to or smaller than the number of inputs. Although it would have been very informative to compute all combinations of layers and neurons within layers, the experiment would have been very demanding on computational resources, which were limited. Therefore, trends were observed rather than looking for the best results across all possible combinations of network architectures.
Stacked autoencoders with the following settings were used for classification based on all the three datasets: each layer was learned using 100 epochs, the supervised SOFTMAX layer was learned in 1000 epochs, the regularization parameters were left as default (L2 regularization was equal to 0.004, sparsity regularization was equal to 4, and sparsity proportion to 0.15), and the training was fine-tuned. The training process was validated using 10-fold cross-validation, where 10% of the dataset was put aside in each fold before the features were selected and the network trained. After the last fold was performed, the complete testing data set consisted of 104 subjects, and each subject was classified by a network that was not trained on that subject to achieve unbiased results. The entire training process was repeated ten times to tackle the variability caused by random weight initialization. Finally, average accuracy metrics were assessed.
2.4. Experiment 2: 3D Convolutional Neural Network
Like autoencoders, 3D convolutional neural networks were used to solve the problem of classifying all three types of features, i.e., intensity with minimal preprocessing, gray matter density, and local volume changes (see the classification process in
Figure 1). Since the convolutional neural network can extract important features by itself, no other feature selection method, such as the Mann-Whitney test, was applied in this experiment. Furthermore, by omitting this selection step, no information about the position of neighboring voxels was lost, so the images were treated as 3D structures, which is crucial for convolution operations in CNNs.
The parameters examined were the number of convolutional layers and convolutional kernels. The deep nature allows the neural network to learn and compose the structure of complex features across layers; the more layers, the more space to work with the features. However, the size of the network also affects the risk of overfitting, and a very deep CNN may not provide benefits, especially if the dataset is not large enough. The convolution kernel size was set to 3 × 3 × 3 voxels by default, and the number of convolutional kernels increased from the input to the output layer. The convolutional layer was always followed by the batch normalization, RELu, and max-pooling layers. The particular 3D CNN architectures examined can be seen in
Table 1. The other parameter settings were as follows: the input image dimension was 121 × 145 × 121 × 1, where the last dimension was the number of channels (grayscale channel only); the stride was set to 3, the dilation factor was 1, the max-pooling dimension was the same as filter dimension (3 × 3 × 3 voxels), the number of neurons in the fully connected layer was 10 hidden and 2 output neurons, the dropout = 0.5, the initial learning rate was 0.0001 for the Adam algorithm, the data were shuffled in each epoch, and the mini-batch size was 10. Training was performed on batches (subsets) of data to estimate the gradient over the training set. The more data there is in a batch, the better the gradient estimation, and it is more efficient than calculating the gradient for each subject separately [
47]. The training was controlled by the validation set, which consisted of 10% of the training data, and was stopped after 1000 epochs or when all of the following criteria were met: the minimal number of epochs was 100, accuracy on the validation set was decreasing, training accuracy was greater than 90%, and validation accuracy exceeded 75%. These criteria enabled the 3D CNN to learn with high accuracy in a reasonable amount of time.
The training process was validated using 10-fold cross-validation in the same way as for SAEs, and since the weights were initialized randomly and the learning process was random due to shuffling in both network types, the entire learning process of each architecture was performed ten times. Therefore, all assessment metrics (accuracy, sensitivity, and specificity) were averaged, and their mean values are presented here.
2.5. Experiment 3: Combined Features
The final experiment focused on combining the input data to investigate whether this step would improve accuracy. T1, GMD, and LVC features were thus combined in all possible ways, i.e., T1 + GMD, T1 + LVC, GMD + LVC, and T1 + GMD + LVC were put together as input data. In the case of autoencoders, feature selection based on the Mann-Whitney test was applied to each dataset separately, and particular feature pools were concatenated to serve as input data for autoencoders with various architectures. Regarding 3D CNNs, features were combined using a Directed Acyclic Graph (DAG), allowing multiple inputs in the next layer. The T1, GMD, and LVC features were processed separately by 3D convolutional, batch normalization, RELu and max-pooling layers in their own channels. Then the outputs were combined and processed using fully connected layers. This way, the DAG architecture enabled assessing whether any feature combinations could improve accuracy. Since learning the 3D CNN classifier based on the combination of the two or three features was computationally very intensive, we decided to explore only CNNs of five and seven convolutional layers, which gave good results when single modality was used. Adding more layers was beyond our computational possibilities.
4. Discussion
This paper contributes to the research and explores the possibility of computer-aided classification of schizophrenia using deep learning algorithms. Its main contribution is the investigation of the influence of input data on classification results and thorough testing and comparison of two modern deep learning classifiers. Stacked autoencoders and 3D CNNs were applied to classify schizophrenia patients and healthy controls. The classification was based on MRI data preprocessed using three different pipelines to extract the three following types of features: T1 with simple registration on a template and skull stripping, GMD with the use of voxel-based morphometry, and LVC using deformation-based morphometry. The research focused on three experiments: (1) two deep learning classifiers were trained to classify schizophrenia based on various data preprocessing pipelines, (2) two important network architecture parameters were explored: depth and number of neurons in each layer, and (3) feature extraction methods were finally combined in all possible ways to train the deep learning classifiers.
The main limitations of the study are small size of the sample dataset (which is difficult to collect, especially in neuroscience), data-related unavailability of validation data from another center, insufficiently extensive but also a very time-demanding investigation into the combination of possible parameters and model architectures–SAE and 3D CNNs have many parameters to be set, such as number of layers, size of filters, number of filters, size and frequency of pooling layer, number of layers with fully-connected neurons and their quantity, which makes the exploration of all possible settings almost impossible. Instead of the whole grid of parameter combinations, only trends of some main parameters were investigated. Finally, a method that could show the impact of brain regions on classification is missing, which was not the aim of this study.
The stacked autoencoders achieved different accuracy for each type of feature. The best result obtained using T1 features was an accuracy of 56.35%. In contrast, for LVC features, an accuracy of up to 63.37% was achieved (in both cases with only a shallow network consisting of two layers with 500 hidden and two output neurons and FP = 5000). Deeper architectures did not lead to any accuracy improvements for T1 input data cases. Such low accuracy rates suggest that the simplest image preprocessing (T1) is unsuitable for stacked autoencoders. In the case of LVC features, a bigger feature pool (FP = 5000) allowed the network to classify better than a smaller feature pool (FP = 1000). The accuracies obtained using LVC features ranged from 56.63% (FP = 1000 with 100-50-10 architecture) to 63.37% (FP = 5000 and 500 architecture). A similar trend in accuracy was observed for T1 features. There, FP = 1000 and 50 architectures led to an accuracy of 54.13%, while FP = 5000 and 500 architectures led to an accuracy of 56.35%. However, both feature extraction methods (simple registration and skull stripping and DBM) did not seem to provide sufficient information for the given classification task, so the stacked autoencoders performed poorly. Better results were achieved with features extracted using the VBM pipeline. GMD enabled SAE-based classifiers to achieve accuracies of up to 70%. The best architecture (FP = 1000 and 500-50) achieved an average accuracy of 69.62% (sensitivity 68.85%, specificity 70.38%). This time, a smaller feature pool (FP = 1000) led to better results, with only the deepest architecture (1000-500-100-50-10) failing, with the networks classifying almost all subjects as patients, which resulted in an average sensitivity higher than 90% and an average specificity lower than 13%. Depending on the specific architecture, a larger pool of features (FP = 5000) led to accuracies between 50.77% and 66.25%. However, the 100-50-10 and 1000-500-100-50-10 architectures also failed to classify well, as in the case of FP = 1000. In conclusion, when only one dataset source was used for SAE classification, GMD provided the most informative features because its results were better than those using T1 and LVC. A smaller feature pool was better when using GMD. However, when T1 and LVC were used, more features helped increase the accuracy of the results. This suggests that GMD might have better discriminative information hidden in fewer features while not carrying much noise. The main finding of these SAE experiments is that both VBM and DBM pipelines brought a noticeable advantage for SAE classification compared to SAE accuracy results on the minimally preprocessed dataset, suggesting that SAE cannot extract image characteristics by itself, and preprocessing steps are needed.
Other authors have used autoencoders to classify patients with schizophrenia. Zeng et al. [
28] achieved an accuracy of >80% with a deep autoencoder discriminant network. They reached better accuracy results, but worked with fMRI data on a larger sample size (n = 734) that might have affected the results. Comparable classification results to ours (AUC-ROC 0.611-0.751) were achieved by Pinaya et al. [
33] with a deep autoencoder and a normative method.
The combination of feature types entering SAEs did not bring any improvement. The most successful combination was GMD and LVC on a larger pool (FP = 5000) and 500 architecture with an average accuracy of 61.54% (sensitivity = 69.62%, specificity = 53.46%). However, the results were worse than SAEs performed on GMD and LVC only. Any trends related to the size of the feature pool or SAE architecture were hardly observable. The combination of all three feature types brought an improvement only when compared to the results obtained by SAEs trained on T1 data, which we considered inappropriate.
According to the results obtained, 3D CNNs trained on T1 features needed at least five convolutional layers to train. Architectures with three convolutional layers only achieved around 50% or less accuracy, whereas 3D CNNs with more than five convolutional layers reached up to 60%. Unfortunately, more layers did not help to improve the results. In any case, 3D CNNs seem to have outperformed SAEs trained on the same data (T1). Architectures with five or more convolutional layers led to an accuracy almost 4% higher than the best SAE architecture (FP = 5000 and 500 architecture). The reason may lie in the 3D structure of filters and the whole-brain approach. Regarding the 3D CNNs trained on GMD, the results were slightly better than in the case of the T1 dataset. In terms of accuracy, the best architecture (50-100-150-200-250-300-350-10) was identical to the T1 data but with almost up to 3% better results. The improvement may suggest that schizophrenia is more manifested in the gray matter, which is extracted from the brain in contrast with T1 or DBM, which is a whole-brain method. However, the trends were similar to the previous dataset. The three-layer networks did not perform well. Compared to SAE, different datasets did not help improve the results, suggesting that 3D CNN can learn to extract features from images themselves and without preprocessing. On the other hand, the results were not as good as in the case of SAEs, suggesting that SAE can work more effectively with a smaller sample size, or that the preprocessing method of feature selection helped reduce noisy voxels from the images, or a combination of both. The 3D CNNs trained on the DBM (LVC) preprocessed data somehow failed in classification, the accuracy being less than 54%. We assume that the DBM pipeline deformed the relations between brain structures, which are useful for CNN classification. In summary, based on our exploratory work and experimental results, it is obvious that the 3D CNNs are not as powerful a tool for classifying schizophrenia based on MRI data compared to SAE, at least for such a small dataset. A small dataset may not provide enough training data for a model as complex as the 3D CNN. However, it is highly challenging to collect large enough datasets in neuroscience. VBM and DBM preprocessing did not affect the results as much as in the case of SAEs.
DAG 3D CNN networks did not outperform any model based on a single feature dataset. Input information combined from two or three data sources with the same sample size of patients and controls probably contributed to overfitting the models in the training phase only because it exacerbated the curse of dimensionality–too many features for too many subjects. Thus, feature combination is rather useless with such as small sample size.
In the context of schizophrenia classification, 3D CNNs have been employed by other researchers. Campese et al. [
40] used pre-trained 3D CNNs (LeNet, VNet, and UNet combined with SVM) to classify VBM preprocessed data. They achieved an accuracy of 86.3% (n = 101) and 71.63% (n = 176) on two independent datasets, respectively. The dilated 3D CNN achieved >80% accuracy in classifying 2300 fMRI EPI images [
36], but the dataset consisted of only 46 subjects. Moreover, the modality differed from that used in this paper. Oh et al. [
41] used 3D CNN combined with task-based fMRI data (n = 82) reaching ~84% accuracy. The same dataset as in this paper was used in our previous work. The first paper [
21] explored random subspace ensembles of multi-layer perceptrons (MLP) and SVMs, which were learned on the VBM and DBM data and their combination. The achieved accuracies were 73.12% (MLP) and 73.51% (SVM). The second paper [
20] presented the results on SVM trained on selected voxels using two-sample t-tests with an accuracy not exceeding 64%. These results show that the results achieved in this paper are worse compared to shallow methods trained on the same dataset and suggest that shallow methods can achieve better results with smaller sample sizes.
There are studies with well-performing classifiers whose sample dataset does not exceed n = 130 (recommended by [
18]), which may have affected the results. These are the following: (i) [
17], which used neural network-based classifiers (ANN, LVQ, RBF) with a 100% accuracy (n = 20), (ii) [
16], which incorporated NN on fMRI data and reached an accuracy of 75.6% (n = 69), and (iii) [
15], which used SVM on structural MRI achieving an accuracy over 71% (n = 104). All of the above models outperformed our results. Our drawback was the small sample size (N = 104).
A final statistical comparison of the results shows that the SAE-based classifier trained on GMD features outperformed all other investigated classifiers and input feature datasets. On the other hand, the SAE-based classifier on T1 data and 3D CNN-based on LVC data performed statistically significantly worse than most of the other classifiers. The results suggest that it is worthwhile to use GMD resulting from the widely accepted VBM pipeline combined with SAE-based classification.