*Article* **Rethinking the Methods and Algorithms for Inner Speech Decoding and Making Them Reproducible**

**Foteini Simistira Liwicki \*, Vibha Gupta, Rajkumar Saini, Kanjar De and Marcus Liwicki**

Embedded Intelligent Systems LAB, Machine Learning, Department of Computer Science, Electrical and Space Engineering, Luleå University of Technology, 97187 Luleå, Sweden; vibha.gupta@ltu.se (V.G.); rajkumar.saini@ltu.se (R.S.); kanjar.de@ltu.se (K.D.); marcus.liwicki@ltu.se (M.L.) **\*** Correspondence: foteini.liwicki@ltu.se

**Abstract:** This study focuses on the automatic decoding of inner speech using noninvasive methods, such as Electroencephalography (EEG). While inner speech has been a research topic in philosophy and psychology for half a century, recent attempts have been made to decode nonvoiced spoken words by using various brain–computer interfaces. The main shortcomings of existing work are reproducibility and the availability of data and code. In this work, we investigate various methods (using Convolutional Neural Network (CNN), Gated Recurrent Unit (GRU), Long Short-Term Memory Networks (LSTM)) for the detection task of five vowels and six words on a publicly available EEG dataset. The main contributions of this work are (1) subject dependent vs. subjectindependent approaches, (2) the effect of different preprocessing steps (Independent Component Analysis (ICA), down-sampling and filtering), and (3) word classification (where we achieve state-ofthe-art performance on a publicly available dataset). Overall we achieve a performance accuracy of 35.20% and 29.21% when classifying five vowels and six words, respectively, in a publicly available dataset, using our tuned iSpeech-CNN architecture. All of our code and processed data are publicly available to ensure reproducibility. As such, this work contributes to a deeper understanding and reproducibility of experiments in the area of inner speech detection.

**Keywords:** brain–computer interface (BCI); inner speech; electroencephalography (EEG); deep learning; Convolutional Neural Network (CNN); independent component analysis; supervised learning

## **1. Introduction**

Thought is strongly related to inner speech [1,2], through a voice being inside the brain that does not actually speak. Inner speech, although not audible, occurs when reading, writing, and even when idle (i.e., "mind-wandering" [3]). Moreover, inner speech follows the same pattern, e.g., regional accents, as if the person is actually speaking aloud, for example [4]. This work focuses on inner speech decoding.

While inner speech has been a research topic in the philosophy of psychology since the second half of the 20th century [5], with results showing that the part of the brain responsible for the generation of inner speech is the frontal gyri, including Broca's area, the supplementary motor area, and the precentral gyrus, the automatic detection of inner speech has very recently become a popular research topic [6,7]; however, a core challenge of this research is to go beyond the closed vocabulary decoding of words and integrate other language domains (e.g., phonology and syntax) to reconstruct the entire speech stream.

In this work, we conducted extensive experiments using deep learning methods to decode five vowels and six words on a publicly available electroencephalography (EEG) dataset [8]. The backbone CNN architecture used in this work is based on the work of Cooney et al. [7].

The main contributions of this work are as follows: (i) providing code for reproducing the reported results, (ii) subject dependent vs. subject-independent approaches, (iii)

**Citation:** Simistira Liwicki, F.; Gupta, V.; Saini, R.; De, K.; Liwicki, M. Rethinking the Methods and Algorithms for Inner Speech Decoding and Making Them Reproducible. *NeuroSci* **2022**, *3*, 226–244. https://doi.org/10.3390/ neurosci3020017

Academic Editor: Szczepan Paszkiel

Received: 1 March 2022 Accepted: 12 April 2022 Published: 19 April 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

the effect of different preprocessing steps (ICA, down-sampling, and filtering), and (iv) achieving state-of-the-art performance on the six word classification task reporting a mean accuracy of 29.21% for all subjects on a publicly available dataset [8].

#### *State-of-the-Art Literature*

Research studies in inner speech decoding use data of invasive (e.g., Electrocorticography (ECOG) [9,10]) and non-invasive methods (e.g., Magnetoencephalography (MEG) [11,12], functional Magnetic Resonance Imaging (fMRI) [13], Functional Near-Infrared Spectroscopy (FNIRS) [14,15]) with EEG being the most dominate modality used so far [16]. Martin et al. [10] attempted to detect single words from inner speech using ECOG recordings from inner and outer speech. This study included six word pairs and achieved a binary classification accuracy of 58% using a Support Vector Machine (SVM). ECOG is not scalable as it is invasive but it advances our understanding and limit of decoding inner speech research. Recent methods used a CNN with the "MEG-as-an-image" [12] and "EEG-asraw-data" [7,17] inputs.

The focus of this paper is on inner speech decoding in terms of the classification of the words and vowels. Classified words can be useful in many scenarios of human–computer communication, e.g., in smart homes or health-care devices, were the human wants to give simple commands via brain signals in a natural way. For human-to-human communication, the ultimate goal of inner speech decoding (in terms of representation learning) is often to synthesize speech [18,19]. In this related area, [18] uses a minimal invasive method called stereotactic EEG (sEEG) with one subject and 100 Dutch words in an open-loop stage for training the decoding models and close-loop stage to evaluate in real time the imagined and whispered speech. The attempt, although not yet intelligible, provides a proof of concept for tackling the close-loop synthesis of imagined speech in real time. Ref. [19] uses MEG data from seven subjects, using, as stimuli, five phrases (1. Do you understand me, 2. That's perfect, 3. How are you, 4. Good-bye, and 5. I need help.), and two words (yes/no). They follow a subject-dependent approach, where they train and tune a different model per subject. Using a bidirectional long short-term memory recurrent neural network, they achieve a correlation score of the reconstructed speech envelope of 0.41 for phrases and 0.77 for words.

Ref. [15] reported an average classification accuracy of 70.45 ± 19.19% for a binary word classification task using Regularized Linear Discriminant Analysis (RLDA) using FNIRS data. The EEGNet [20] is a CNN-based deep learning architecture for EEG signal analysis that includes a series of 2D convolutional layers, average pooling layers, and batch normalization layers with activations. Finally, there is a fully connected layer at the end of the network to classify the learned representations from the preceding layers. The EEGNet serves as the backbone network in our model; however, the proposed model extends the EEGNet in similar manner to [7].

There are two main approaches when it comes to brain data analysis: subject dependent and subject independent (see Table 1). In the subject-dependent approach, the analysis is taken for each subject individually and performance is reported per subject. Representative studies in the subject-dependent approach are detailed in the following paragraph. Ref. [8] reported a mean recognition rate of 22.32% in classifying five vowels and 18.58% in classifying six words using a Random Forest (RF) algorithm with a subject-dependent approach. Using the data from six subjects, Ref. [21] reported an average accuracy of 50.1% ± 3.5% for the three-word classification problem and 66.2% ± 4.8% for a binary classification problem (long vs. short words), following a subject-dependent approach using a Multi-Class Relevance Vector Machine (MRVM). In [12], MEG data from inner and outer speech was used; an average accuracy of 93% for the inner speech and 96% for the outer speech decoding of five phrases in a subject-dependent approach using a CNN was reported. Recently, Ref. [22] reported an average accuracy of 29.7% for a four-word classification task on a publicly available dataset of inner speech [23]. In the subjectindependent approach, all subjects are taken into account and the performance is reported

using the data of all subjects; therefore the generated decoding model can generalize the new subjects' data. The following studies use a subject-independent approach. In [6], the authors reported an overall accuracy of 90% on the binary classification of vowels compared with consonants using Deep-Belief Networks (DBN) and the combination of all modalities (inner and outer speech), in a subject-independent approach. In [7], the authors used a CNN with transfer learning to analyze inner speech on the EEG dataset of [8]. In these experiments, the CNN was trained on the raw EEG data of all subjects but one. A subset of the remaining subject's data was used to finely tune the CNN and the rest of the data were used to test the CNN model. They authors reported an overall accuracy of 35.68% (five-fold cross-validation) for the five-vowel classification task.


**Table 1.** Overview of inner speech studies (2015–2021). TL: Transfer learning.

#### **2. Materials and Methods**

*2.1. Dataset and Experimental Protocol*

The current work used a publicly available EEG dataset as described in [8]. This dataset includes recordings from 15 subjects using their inner and outer speech to pronounce 5 vowels *(/a/, /e/, /i /, /o/, /u/)* and 6 words *(arriba/up, abajo/down, derecha/right, izquierda/left, adelante/forward, and atr a*´*s/backwards)*. A total of 3316 and 4025 imagined speech sample EEG recordings for vowels and words, respectively, are available in the dataset. An EEG with 6 electrodes was used in these recordings.

Figure 1 shows the experimental design followed in [8]. The experimental protocol consisted of a ready interval that was presented for 2 s, followed by the stimulus (vowel or word) presented for 2 s. The subjects were asked to use their inner or outer speech during the imagine interval to pronounce the stimulus. Finally, a rest interval of 4 s was presented, indicating that the subjects could move or blink their eyes before proceeding with the next stimulus. It is important to note that for the purpose of our study, only the inner speech part of the experiment was used.

**Figure 1.** Experimental protocol used in [8]: Ready interval followed by a textual representation of the stimulus (vowel or word). The inner speech production took place during the stimulus interval for 4 s.

#### *2.2. Methods*

The proposed framework uses a deep CNN to extract representations from the input EEG signals. Before applying the proposed CNN, the signals are preprocessed and then the CNN network is trained on the preprocessed signals.

Figure 2 depicts the flow of the proposed work. Separate networks are trained for vowels and words following the architecture depicted in Figure 2. The proposed network is inspired by Cooney et al. [7]; they performed filtering, downsampling, and artifact removal before applying the CNN; however, we have noticed that downsampling degrades the recognition performance, see Section 4. As a result, we did not downsample the signals in our experiments. The downsampling block is represented by a cross in Figure 2 to indicate that this task is not included in our proposed system in comparison with [7]. The current work reports results on 3 different experimental approaches using preprocessed data and raw data. The 3 different approaches are discussed in detail in Sections 3.1.1 and 3.1.2. More information about the preproccessing techniques can be found in Section 2.3.

**Figure 2.** The figure illustrates the proposed workflow. The preprocessed EEG signals with or without downsampling are used to train a CNN model for inner speech decoding.

#### *2.3. Preprocessing*

In the current work, we apply the following preprocessing steps:

**Filtering:** A frequency between 2 Hz and 40 Hz is used for filtering [8].

**Down-sampling:** The filtered data are down-sampled to 128 HZ. The original frequency of the data is 1024 Hz.

**Artifact removal:** Independent component analysis (ICA) is known as a blind-source separation technique. When recording a multi-channel signal, the advantages of employing ICA become most obvious. ICA facilitates the extraction of independent components from mixed signals by transforming a multivariate random signal. Here, ICA applied to identify components in EEG signal that include artifacts such as eye blinks or eye movements. These are components then filtered out before the data are translated back from the source space to the sensor space. ICA effectively removes noise from the EEG data and is, therefore, an aid to classification. Given the small number of channels, we intact all the channels and instead use ICA [25] for artifact removal (https://github.com/pierreablin/picard/blob/ master/matlab\_octave/picard.m, accessed on 27 February 2022).

Figure A3 (see Appendix C) depicts the preprocessed signal after applying ICA. This figure shows the vowel *a* for two subjects. From this figure, it can be noted that the subject's model is not discriminative enough as overlapping is observed. The response from all electrodes' behavior for all vowels for Subject-02 can be seen in Figure A4 (see Appendix C). From this figure, it can be seen that all electrodes are adding information as they all differ in their characteristics.

#### *2.4. iSpeech-CNN Architecture*

In this section, we introduce the proposed CNN-based iSpeech architecture. After extensive experiments on the existing CNN architecture for inner speech classification tasks, we determined that downsampling the signal has an effect on the accuracy of the classification and thus removed it from the proposed architecture. The iSpeech-CNN architecture for imagined vowel and word recognition is shown in Figure 3. The same architecture is used in training for imagined vowels and words separately. The only difference is that the network for vowels has five classes; therefore, the softmax layer outputs five probability scores; one for each vowel. In the same manner, the network for words has six classes; therefore, the softmax layer outputs six probability scores; one for each word. Unlike [7], after extensive experimentation, we observed that the number of filters has an effect on the overall performance of the system; 40 filters are used in the first four layers of both networks. The next three layers have 100, 250, and 500 filters, respectively; however, the filter sizes are different. Filters of sizes (1 × 5), (6 × 1), (1 × 5), (1 × 3), (1 × 3), (1 × 3), and (1 × 3) are used in the first, second, third, fourth, fifth, sixth, and seventh layers, respectively.

**Figure 3.** Proposed iSpeech-CNN architecture for imagined speech recognition based on the architecture described in [7]. This network is trained separately for vowels and words. Therefore, the difference lies in the last layer (softmax). The softmax layer for vowels have five outputs while for words has six outputs.

We used an Adam optimizer with a dropout of 0.0002 for the vowel classification and 0.0001 for the word classification. As the network is very small, dropping out more features will adversely affect the performance. The initial learning rate was fixed to 0.001 with a *piecewise* learning rate scheduler. Our network was trained for 60 epochs, and the best validation loss was chosen for the final network. The regularization was also fixed to a value of 0.001. Our proposed iSpeech-CNN architecture follows the same structure as [7] but with a different numbers of filters and training parameters and preprocessing.

#### **3. Experimental Approaches and Performance Measures**

This section describes the experimental approaches that have been utilized for the analysis of EEG data and the performance measures that quantify the obtained analysis.

#### *3.1. Experimental Approaches*

Three experimental approaches were used for analysis, and they are discussed in detail in the following subsections.

#### 3.1.1. Subject-Dependent/Within-Subject Approach

Subject-dependent/within-subject classification is a baseline approach that is commonly used for the analysis of inner speech signals. In this approach, individual models are trained corresponding to each subject and for each subject, a separate model is created. The training, validation, and testing sets all have data from the same subject. This approach essentially measures how much an individual subject's data changes (or varies) over time.

To divide the subject data into training, testing, and validation datasets, a ratio of 80-10-10 is used. The training, validation, and testing datasets contain all vowel/words category samples (five/six, respectively) in the mentioned ratio. To remove the bias towards the samples, five different trials are utilized. Furthermore, the mean accuracy and standard deviation are reported for all experimental approaches.

#### 3.1.2. Subject Independent: Leave-One-Out Approach

The subject-dependent approach does not show generalization capability as it models one subject at a time (Testing data contain only samples of the subject that is being modeled). The leave-one-out approach is an independent approach where data of each subject are tested using models that are trained using the data of all other subjects but one, i.e., *n* − 1 subjects out of total *n* will be used for the training model, and the rest will be used for testing. For example, Model-01 will be trained with data from subjects except Subject01, and will be tested with Subject01 (see Tables A3–A6).

This approach helps to obtain a deeper analysis when there are fewer subjects or entities and shows how each individual subject affects the overall estimate of the rest of the subjects. Hence, this approach may provide more generalizable remarks than subjectspecific models that depend on individual models.

#### 3.1.3. Mixed Approach

The mixed approach is a variation of subject-independent approach. Although leaveone-out is truly independent, we can see the mixed approach as less independent in nature as it includes data from all subjects in training, validation, and testing. As it contains the data of all subjects, we called it the mixed approach. This approach differs from the withinsubject and leave-one-out approaches, where *n* models correspond to the total number of subjects in the data, are trained. In this approach, only one model will be trained for all subjects. Testing contains samples of all the subjects under all categories (vowels/words).

To run this experiment, 80% of the samples of all the subjects are included in the training set, 10% in the validation set, and the remaining in the test set. We also ensure class balancing, i.e., each class will have approximately the same number of samples of all vowel/word categories. The same experiment is repeated for five random trials, and the mean accuracy along with the standard deviation is reported.

#### *3.2. Performance Measures*

The mean and standard deviation are used to report the performance of all the approaches. For the final results, the F-scores are also given.

**Mean:** The mean is the average of a group of scores. The scores are totaled and then divided by the number of scores. The mean is sensitive to extreme scores when the population samples are small.

**Standard deviation:** In statistics, the standard deviation (SD) is a widely used measure of variability. It depicts the degree of deviation from the average (mean). A low SD implies that the data points are close to the mean, whereas a high SD suggests that the data span a wide range of values.

**F-score:** The F-score is a measure of a model's accuracy that is calculated by combining the precision and recall of the model. It is calculated by the following formula:

$$\text{F-score} = \frac{2 \ast \text{Precision} \ast \text{Recall}}{\text{Precision} + \text{Recall}} \tag{1}$$

where precision is the percentage of true positive examples among the positive examples classified by the model, and recall is the fraction of examples classified as positive, among the total number of positive examples.

#### **4. Results and Discussion: Vowels (Five Classes)**

The results estimated with the subject-specific approach are discussed first as this approach is common in most of the EEG-related papers. All code, raw data, and preprocessed data are provided on Github (https://github.com/LTU-Machine-Learning/Rethinking-Methods-Inner-Speech, accessed on 27 February 2022). Related approaches are discussed in later subsections.

#### *4.1. Subject-Dependent/Within-Subject Classification*

In this section, we report the results when applying the subject-dependent approach. Figures 4, 5 and A1 and Table A5 show the results of our proposed iSpeech-CNN architecture. Tables A1 and A2 show the results of the reference CNN architecture.

**Figure 4.** The impact of different sampling rates on vowel recognition performance of (iSpeech-CNN Architecture) with different filters in first three CNN layers. The bars indicate the standard error, sample size = 5. Theoretical chance accuracy = 20% (red-dotted line).

**Figure 5.** Subject dependent results for vowels without downsampling on preprocessed signals (iSpeech-CNN Architecture). Theoretical chance accuracy = 20% (red-dotted line).

#### 4.1.1. Ablation Study—Influence of Downsampling

Table A1 shows the results with raw and downsampled data when used within the referenced CNN architecture framework.

It is clearly observed from Table A1 that downsampling the signals results in a loss of information. Figure 4 shows that there is a significant performance increase between 32 and 1024; however, some other differences (e.g., for 40 filters between 128 and 1024) are not significant. For clarity, the bars for standard error for each data point are added. The highest vowel recognition performance (35.20%) is observed at the highest sampling rate (1024), i.e., without downsampling.

In other words, the chosen sampling rate was not sufficient to retain the original information; therefore, further results will be reported for both raw data and downsampled data, in order to obtain a better insight into the preprocessing (i.e., filtering and ICA) stage.

#### 4.1.2. Ablation Study—Influence of Preprocessing

Filtering and artifact removal plays an important role while analyzing the EEG signals. We applied both bandpass filtering (see Section 2), and picard (preconditioned ICA for real data) for artifact removal to obtain more informative signals. Table A2 shows the results of preprocessing when applied on the raw and downsampled data within the reference CNN architecture framework. The performance, i.e., the overall mean accuracy, decreased from 32.51% to 30.90%. The following points can be noted from Table A2: (1) Filtering and artifact removal highly influence the performance irrespective of raw and downsampled data. (2) The improved performance can also be observed with respect to each subject. A smaller standard deviation can also be seen. (3) The CNN framework generated higher performance than the handcrafted features and the GRU (see Table 2). We also performed experiments with the LSTM classifier and noticed the random behavior (theoretical chance accuracies); no significant difference as compared to GRU; therefore, iSpeech-CNN performs best among all classifiers.

**Table 2.** Average subject-dependent classification results on the [8] dataset.


#### 4.1.3. Ablation Study—Influence of Architecture

Based on the CNN literature in the EEG paradigm [7,26], adding more layers to the reference CNN architecture does not help to obtain an improved performance; however, by changing the number of filters in the initial layers, some improvements can be observed. Based on the CNN literature for EEG signals, having a sufficient number of filters in the initial layers helps to obtain some improvement [7,27]. Here, we choose three initial layers, unlike in natural images, in speech, initial layers are more specific to the task rather than the last few layers. The results with a changing number of filters in the initial layer within the iSpeech-CNN architecture are shown in Table A5. In the reference CNN architecture, this filter number was 20 for the initial three layers; however, we have changed this number to 40 (decided based on experimentation) in the iSpeech-CNN architecture. Table A5 clearly shows that changing the filter parameter yields higher performance than with the number of filters (compare to the reference architecture results in Tables A1 and A2 in the Appendix A). This improvement is observed with and without downsampled data and with respect to the subject (see Figures A1 and 5). The standard deviation also decreases with these modifications (see Table A5).

#### *4.2. Mixed Approach Results*

This section discusses the results of the mixed approach. In this approach, data from all subjects are included in training, validation, and testing. Table A4 shows the results for the mixed approach with and without downsampling. These results were compiled with filtering and ICA in both reference and modified CNN architectures.

From these results, it is noted that the obtained accuracies are random in nature. The modified CNN architecture parameters do not help to obtain any improvements and show random accuracy behavior. In other words, it is difficult to achieve generalized performance with EEG signals. Based on the EEG literature, it has also been justified that models trained on data from one subject cannot be generalized to other subjects even though have been recorded using the same setup conditions.

Determining the optimal frequency sub-bands corresponding to each subject could be one possible direction that may be successful in such a scenario. We intent to explore such a direction in our future work.

#### *4.3. Subject-Independent: Leave-One-Out Results*

Having discussed the subject-specific and mixed results, in this section, the subjectindependent results are discussed. The leave-one-out approach is a variation of the mixed approach; however, unlike the mixed approach, here, the data of the testing subject are not included in the training. For example, in Figure 6, except *Subject*01, all other subjects were used in the training of *Model*-01. Figure 6 and Table A6 show the results using the iSpeech-CNN architecture, while Table A3 shows the results using the reference architecture.

**Figure 6.** Leave-one-out results for vowels with and without downsampling on preprocessed signals (iSpeech-CNN Architecture). Theoretical chance accuracy = 20% (red-dotted line).

It can be noted that having fewer subjects in training (one less as compared to the mixed approach), shows slightly better behavior than the mixed approach, where all subjects were included in the training. Moreover, changing the reference CNN parameters to our proposed iSpeech-CNN architecture also shows improved performance (see Figure 6 and Table A6).

The mixed and leave-one-out approaches both showed that generalizing the performance over all subjects is difficult in the EEG scenario. Hence, there is a need for the preprocessing stage, which can make the data more discriminative.

#### **5. Results and Discussion: Words (Six Classes)**

Having discussed all the approaches for the category of vowels, we noticed that only the subject-specific approach showed performance that was not random in nature and hence makes sense; therefore, in this section, we only report results corresponding to the subject-specific approach for the word category.

This category contains six different classes (see Section 2.1). Table A8 and Figures A2 and 7 show the performance results for the classification of the six words, using the proposed iSpeech-CNN architecture. The performance results when using the reference architecture can be found in Appendix B. From these tables and figures, the same kind of behavior as vowels is observed. The change in the number of filters in the initial layers affected the performance as shown in Table A8. The downsampling of data also affects the overall performance. Figure 8 shows that the highest word recognition performance (29.12%) is observed at highest sampling rate (1024), i.e., without downsampling. For clarity, we added the bars for standard error for each data point. As opposed to vowel recognition, there is a steady increase in the performance when increasing the sampling rate (though again, not always significant among two neighboring values).

The iSpeech-CNN architecture shows better performance than handcrafted features such as real-time wavelet energy [8] and reference architecture (Appendix B).

Overall, we achieve a state-of-the-art performance of 29.21% when classifying the six words using our proposed iSpeech-CNN architecture and preprocessing methodology without downsampling.

The performance reported in this work is based on the CNN architecture of the reference network [7]. No other architecture was investigated. This is due to the reason that the goal of the proposed work is to reproduce the Cooney's results and making the network and codes available to the research community.

**Figure 7.** Subject-dependent results for words without downsampling on preprocessed signals (iSpeech-CNN Architecture). Theoretical chance accuracy = 16.66% (red-dotted line).

**Figure 8.** The impact of different sampling rates on word recognition performance of (iSpeech-CNN Architecture) with different filters in first three CNN layers. Performance increases with higher sampling rates. The bars indicate the standard error, sample size = 5. Theoretical chance accuracy = 16.66% (red-dotted line).

#### **6. Performance Comparison and Related Discussion**

In this section, we compare our results on the vowels and words dataset with existing work and discuss related findings. Based on the reported performances in the Table 2, it is clearly noted that the CNN performs better than the handcrafted features for both datasets.

The precision, weighted F-score, and F-score for our proposed iSpeech-CNN in comparison with the reported results of Cooney et al. [7] are shown in Table 3. From this table, we can note that our proposed system results in a higher precision; however, a lower F-score compared to the model in [7]. Hence, the reproducibility of the results reported in [7] is difficult.


**Table 3.** Precision and F-score (with respect to Tables A5–A8) for vowel and word classification (iSpeech-CNN Architecture).

Our proposed CNN architecture and preprocessing methodology outperform the existing work in word and vowel category when following subject-dependent approach, as shown in Table 2; however, it is worth to mention that for the vowel classification, unlike in [7], we do not downsampling the data. Furthermore, [7] when using transfer learning approach for the vowel classification task, they report an overall accuracy of 35.68%, which is slightly higher than our reported accuracy in the subject-dependent approach.

Based on the 1-tail paired *t*-test results, we found that there is statistical significant difference between iSpeech-CNN and the reference paper [7] for word classification and for vowel classification, if we compare to the work without transfer learning (which is the fair comparison, as transfer learning adds a new dimension). We also found that there is no significant difference between the best reported results with transfer learning [7,24] and iSpeech-CNN. Furthermore, when we run the 1-tail paired *t*-test results for iSpeech-CNN between downsampling and without downsampling, we found that these difference are significantly different for the words task (*p* = 0.0005), but not statistically significant for the vowels task. We are following 1-tail paired *t*-test and used 10% of the overall samples, i.e., 332 for vowels and 403 for words.

Hence, it is observed that the correct selection of preprocessing methods and the number of filters in the CNN, greatly add to the performance. The elaborated results for each category and with each approach have been added to Appendices A and B.

#### **7. Conclusions**

This study explores the effectiveness of preprocessing steps and the correct selection of filters in the initial layers of the CNN in the context of both vowel and word classification. The classification results are reported on a publicly available inner speech dataset of five vowels and six words [8]. Based on the obtained accuracies, it is found that such a direction of exploration truly adds to the performance. We report state-of-the-art classification performance for vowels and words with mean accuracies of 35.20% and 29.21%, respectively, without downsampling the original data. Mean accuracies of 34.88% and 27.38% have been reported for vowels and words, respectively, with downsampling. Furthermore, the proposed CNN code in this study is available to the public to ensure reproducibility of the research results and to promote open research. Our proposed iSpeech-CNN architecture and preprocessing methodology are the same for both datasets (vowels and words).

Evaluating our system in other publicly available datasets is part of our future work. Furthermore, we will address the issues related to the selection of the downsampling rate and the selection of the optimal frequency sub-band with respect to subjects.

**Author Contributions:** Conceptualization, F.S.L.; methodology, F.S.L. and V.G.; software, F.S.L. and V.G.; validation, F.S.L., V.G., R.S. and K.D.; formal analysis, F.S.L., V.G., R.S. and K.D.; investigation, F.S.L., V.G., R.S. and K.D.; writing—original draft preparation, F.S.L., V.G., R.S., K.D. and M.L.; writing—review and editing, F.S.L., V.G., R.S., K.D. and M.L.; visualization, F.S.L., V.G., R.S., and K.D.; supervision, F.S.L. and M.L.; project administration, F.S.L. and M.L.; funding acquisition, F.S.L. and M.L. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received funding by the Grants for excellent research projects proposals of SRT.ai 2022.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The code used in this paper is publicly accessible on Github (https: //github.com/LTU-Machine-Learning/Rethinking-Methods-Inner-Speech, accessed on 27 February 2022).

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

#### **Appendix A. The Results on Vowels**

**Table A1.** Subject-dependent results for the vowels using raw and downsampled data (Reference Architecture).



**Table A2.** Subject-dependent results for vowels with and without downsampling on preprocessed data (Reference Architecture).

**Table A3.** Leave-one-out results for vowels with and without downsampling on preprocessed data (Reference Architecture). Subject in test set.



**Table A4.** Mixed-approach results for vowels with and without downsampling on preprocessed data (Filtering and Artifact Removal) (Reference Architecture; iSpeech-CNN Architecture).


**Table A5.** Subject-dependent results for vowels with and without downsampling on preprocessed signals (iSpeech-CNN Architecture).



**Table A6.** Leave-one-out results for vowels with and without downsampling on preprocessed signals (iSpeech-CNN Architecture).

**Figure A1.** Subject-dependent results for vowels with downsampling on preprocessed signals (iSpeech-CNN Architecture). Chance accuracy 20%.

#### **Appendix B. The Results on Words**

**Table A7.** Subject-dependent results for words with and without downsampling on preprocessed data (Reference Architecture).


**Table A8.** Subject-dependent results for words with and without downsampling on preprocessed signals (iSpeech-CNN Architecture).


**Figure A2.** Subject-dependent results for words with downsampling on preprocessed signals (iSpeech-CNN Architecture). Chance accuracy 16.66%.

**Figure A3.** Example of preprocessed signals for all electrodes (after ICA) for the vowel /a/ for Subject01 and Subject02.

**Figure A4.** Example of preprocessed signals (after ICA) for all vowels and all electrodes for Subject02.

#### **References**

