Next Article in Journal
Bioavailability of Liposomal Vitamin C in Powder Form: A Randomized, Double-Blind, Cross-Over Trial
Previous Article in Journal
Double-Exposure Algorithm: A Powerful Approach to Address the Accuracy Issues of Fractional Vegetation Extraction under Shadow Conditions
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Exploring Inner Speech Recognition via Cross-Perception Approach in EEG and fMRI

by
Jiahao Qin
1,2,
Lu Zong
2 and
Feng Liu
3,*
1
Faculty of Science and Engineering, University of Liverpool, Liverpool L69 3BX, UK
2
Department of Financial and Actuarial Mathematics, Xi’an Jiaotong-Liverpool University, Suzhou 215123, China
3
School of Computer Science and Technology, East China Normal University, Shanghai 200062, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2024, 14(17), 7720; https://doi.org/10.3390/app14177720
Submission received: 9 August 2024 / Revised: 29 August 2024 / Accepted: 30 August 2024 / Published: 1 September 2024
(This article belongs to the Section Electrical, Electronics and Communications Engineering)

Abstract

:
Multimodal brain signal analysis has shown great potential in decoding complex cognitive processes, particularly in the challenging task of inner speech recognition. This paper introduces an innovative I nner Speech Recognition via Cross-Perception (ISRCP) approach that significantly enhances accuracy by fusing electroencephalography (EEG) and functional magnetic resonance imaging (fMRI) data. Our approach comprises three core components: (1) multigranularity encoders that separately process EEG time series, EEG Markov Transition Fields, and fMRI spatial data; (2) a cross-perception expert structure that learns both modality-specific and shared representations; and (3) an attention-based adaptive fusion strategy that dynamically adjusts the contributions of different modalities based on task relevance. Extensive experiments on the Bimodal Dataset on Inner Speech demonstrate that our model outperforms existing methods across accuracy and F1 score.

1. Introduction

Inner speech, often described as the silent voice in our minds, plays a crucial role in cognitive processes such as problem solving, memory formation, and self-regulation [1]. The ability to recognize and decode inner speech using brain–computer interfaces (BCIs) holds immense potential for assistive technologies, particularly for individuals with severe motor disabilities [2]. However, accurately decoding inner speech from neural signals remains a significant challenge due to its subtle and complex nature.
Recent advances in neuroimaging techniques have opened new avenues for studying inner speech. Electroencephalography (EEG) and functional magnetic resonance imaging (fMRI) are two prevalent modalities used in this field, each offering unique insights into brain activity [3]. EEG provides high temporal resolution, capturing rapid changes in neural activity, while fMRI offers superior spatial resolution, allowing for precise localization of brain activations [4].
Despite these advancements, most existing studies on inner speech recognition have focused on unimodal approaches, utilizing either EEG or fMRI data independently [5,6]. However, the data of inner speech activity are usually multimodal, and there are obvious limitations when only single-modal data are used. Generic time series feature extraction methods may not effectively capture the non-stationary and weak nature of EEG signals during inner speech. This is due to the subtle and transient nature of inner speech-related neural activity, which can be easily overshadowed by background brain activity and noise [7,8,9,10]. Traditional image recognition techniques may not be optimal for extracting relevant features from fMRI data for inner speech tasks. fMRI data have unique spatiotemporal characteristics and lower temporal resolution compared to natural images, requiring specialized processing techniques [11,12,13,14].
Therefore, many researchers have found that there are complementary and enhanced representations between multimodal data, and they have used multimodal data for pattern recognition of biological signals and successfully built models with stronger performance. For example, Gong et al. [15] combined EEG and fMRI data to improve the classification of emotional states, achieving higher accuracy than unimodal approaches. Su et al. [16] integrated EEG and fNIRS (functional near-infrared spectroscopy) for motor imagery classification, demonstrating superior performance over single-modality methods. In the context of speech processing, Passos et al. [17] fused EEG and audio features for robust speech recognition in noisy environments. However, existing fusion methods may not fully leverage the complementary information provided by EEG and fMRI in the specific context of inner speech recognition. This is because the temporal dynamics of EEG and the spatial patterns of fMRI require careful alignment and integration to capture the full complexity of inner speech processes [18,19,20].
The recent release of the Bimodal dataset on Inner Speech [21] provides a unique opportunity to explore multimodal approaches to inner speech recognition. This dataset combines EEG and fMRI recordings from participants performing an inner speech task, offering a rich resource for developing and evaluating multimodal decoding methods.
In this paper, we propose a novel cross-perception model for multimodal inner speech recognition that leverages both EEG and fMRI data. Our proposed approach is shown in Figure 1.
By leveraging the strengths of both EEG and fMRI, our cross-perception model aims to achieve more accurate and robust inner speech recognition compared to unimodal approaches. We evaluate our model on the Bimodal dataset on Inner Speech, demonstrating its effectiveness in decoding inner speech across different semantic categories. The main contributions of this work are as follows:
  • We propose a novel cross-perception model that effectively integrates EEG and fMRI data for inner speech recognition.
  • We introduce a multigranularity encoding scheme that captures both temporal and spatial aspects of brain activity during inner speech.
  • We develop an adaptive fusion mechanism that dynamically weights the contributions of different modalities based on their relevance to the recognition task.
  • We provide extensive experimental results and analyses, demonstrating the superiority of our multimodal approach over unimodal baselines.
The rest of the paper is organized as follows: Section 2 reviews related work in inner speech recognition, including unimodal and multimodal approaches, and it discusses the limitations of existing methods. Section 3 introduces our proposed targeted improvements, including enhancements for EEG and fMRI data processing, multimodal fusion strategy, and cross-modal contrastive learning. Section 4 describes the experimental setup, including dataset details, preprocessing steps, and baseline methods. Section 5 presents the results of our experiments, including the main results, ablation studies, cross-participant generalization, and an extended study on multimodal sentiment analysis. Finally, Section 6 discusses the implications of our findings and concludes the paper.

2. Related Work

Inner speech recognition using neuroimaging data has gained significant attention in recent years due to its potential applications in brain–computer interfaces and cognitive neuroscience. This section reviews relevant literature, focusing on both unimodal and multimodal approaches to inner speech recognition.

2.1. Unimodal Approaches

Electroencephalography (EEG) has been widely used for inner speech recognition due to its high temporal resolution. Traditional machine learning methods have shown promising results in this domain. Nguyen et al. [8] employed Support Vector Machines (SVMs) with Riemannian manifold features, achieving an accuracy of 68% in a four-class inner speech classification task. Cooney et al. [5] utilized Random Forests on time–frequency features, reporting an accuracy of 71% for a similar task. Da Silva and Fernando [10] applied Linear Discriminant Analysis (LDA) to Common Spatial Patterns (CSPs) features, obtaining an accuracy of 73% in distinguishing between two inner speech classes. These studies demonstrate the potential of EEG for inner speech recognition, but they also highlight the challenges in achieving high accuracy due to the subtle nature of inner speech signals.
Functional Magnetic Resonance Imaging (fMRI) offers superior spatial resolution for inner speech recognition. Miyawaki et al. [22] used multivoxel pattern analysis (MVPA) to decode visual imagery, achieving an accuracy of 78% in reconstructing visual patterns. Cetron et al. [23] employed Independent Component Analysis (ICA) combined with machine learning classifiers, reporting an accuracy of 82% in distinguishing between different semantic categories during inner speech. Sligte et al. [24] utilized Representational Similarity Analysis (RSA) to investigate the neural representations of inner speech, demonstrating significant correlations between predicted and observed neural patterns. These fMRI-based studies provide valuable insights into the spatial patterns of brain activity during inner speech, but are limited by the low temporal resolution of fMRI.

2.2. Bimodal Approaches

Recent years have seen a growing interest in multimodal approaches that combine EEG and fMRI data for inner speech recognition. Herff et al. [25] proposed a hybrid EEG-fMRI model using convolutional neural networks (CNNs) for feature extraction and fusion, achieving an accuracy of 85% in a four-class inner speech task. Gao et al. [26] introduced a multimodal attention mechanism to dynamically weight EEG and fMRI features, reporting an accuracy of 87% on a similar task. Aggarwal et al. [27] employed a graph convolutional network to model both temporal (EEG) and spatial (fMRI) dependencies, achieving an accuracy of 89% in distinguishing between different inner speech categories. These multimodal approaches consistently outperform unimodal methods, demonstrating the complementary nature of EEG and fMRI data for inner speech recognition. The temporal precision of EEG, combined with the spatial resolution of fMRI, provides a more comprehensive view of the neural processes underlying inner speech.

2.3. Limitations of Existing Approaches

While bimodal approaches combining EEG and fMRI data have shown promise in inner speech recognition, they often fall short of fully leveraging the unique characteristics of these neuroimaging modalities. Current methods typically rely on generic fusion techniques borrowed from computer science, which may not be optimally adapted to the specific challenges of inner speech recognition. Standard time series feature extraction methods often struggle to effectively capture the non-stationary and subtle nature of EEG signals during inner speech. Similarly, conventional image recognition techniques may not be ideal for extracting the most relevant features from fMRI data in the context of inner speech tasks. Moreover, existing fusion methods may not fully exploit the complementary information provided by EEG and fMRI, particularly in the nuanced domain of inner speech recognition.
These limitations indicate a clear opportunity for advancement in multimodal approaches to inner speech recognition. Specifically, there is a need for methods tailored to the unique characteristics of EEG and fMRI data in this context. Our proposed method addresses this gap by introducing a novel trimodal approach. By incorporating EEG raw data, EEG Markov Transition Fields, and fMRI spatial data, we offer a more comprehensive and nuanced view of inner speech processes. This approach not only builds upon the potential demonstrated by existing bimodal methods but also extends it significantly, paving the way for more accurate and robust inner speech recognition.

3. Proposed Targeted Improvements

Our proposed method addresses the limitations of existing approaches by introducing targeted improvements for both EEG and fMRI data processing in the context of inner speech recognition. These improvements are designed to better capture the unique characteristics of each modality while enhancing their complementary nature.

3.1. EEG Signal Processing Enhancements

Singular Spectrum Analysis (SSA) for EEG Decomposition

To address the non-stationary nature of EEG signals during inner speech, we employ Singular Spectrum Analysis (SSA). SSA decomposes the original EEG signal into interpretable components, allowing for more effective feature extraction.
Given an EEG signal X = ( x 1 , , x N ) , we first construct the trajectory matrix:
Y = x 1 x 2 x K x 2 x 3 x K + 1 x L x L + 1 x N
where L is the window length, and K = N L + 1 .
We then perform Singular Value Decomposition (SVD) on Y :
Y = U S V T
The eigentriples ( U i , λ i , V i ) are used to reconstruct the components:
X ˜ ( i ) = j = 1 L 1 | A j | ( l , k ) A j u l i ( i ) λ i v k i ( i )
where A j = { ( l , k ) : l + k = j + 1 } .
To capture the temporal dynamics of EEG signals, we transform them into image representations. While several time series image encoding methods exist, such as Recurrence Plots (RPs) and Gramian Angular Fields (GASFs and GADFs), we chose Markov Transition Fields (MTFs) for their unique advantages in the context of inner speech recognition.
Recurrence Plots, while effective for visualizing repeating patterns in time series data, may not adequately capture the subtle, non-repetitive nature of inner speech-related EEG signals. Gramian Angular Fields (GASFs and GADFs) preserve temporal correlations but can be sensitive to noise and may not effectively represent the state transitions crucial for distinguishing different inner speech categories. In contrast, MTFs offer several advantages for our task. They capture both the temporal dynamics and state transition probabilities of EEG signals, which are crucial for representing the complex patterns associated with inner speech. MTFs preserve more temporal information compared to GASFs and GADFs, allowing for better representation of the subtle temporal patterns in inner speech EEG data. The probabilistic nature of MTFs makes it more robust to the noise and non-stationarity often present in EEG signals during cognitive tasks.
We transformed them into Markov Transition Field (MTF) images. This allows us to leverage powerful image processing techniques for feature extraction. Given a quantized EEG signal q = q 1 , q 2 , , q N , we construct the transition probability matrix W:
w i j = # { ( q s , q s + 1 ) | q s = i , q s + 1 = j } # { q s | q s = i }
The MTF is then defined as
M i j = [ w q i q j ] i , j = 1 , 2 , , N
Figure 2 illustrates the process of transforming EEG signals into MTF images. This transformation allows us to capture the temporal dynamics of EEG signals in a format that can be effectively processed by convolutional neural networks, enabling more robust feature extraction for inner speech recognition.

3.2. fMRI Data Processing Enhancements

The data collection process involved recording inner speech activity at fixed intervals, with participants instructed to maintain silence. This experimental design allowed us to precisely identify the fMRI states corresponding to inner speech activities along the temporal axis. We performed temporal segmentation on the entire experimental recording, enabling us to extract fMRI images and EEG signals that correspond to the same inner speech activities. This temporal correspondence was crucial for our multimodal approach. Inner speech processes involve complex spatial patterns of activation across multiple brain regions. The use of dilated convolutions enables the capture of multiscale features, which is crucial for distinguishing between different inner speech categories. This is particularly important given the subtle differences in brain activation patterns that may exist between various inner speech tasks.
Therefore, we proposed a specialized 3D convolutional network to extract features from fMRI data that are particularly relevant for inner speech recognition. The 3D convolutional layers of our network allow it to learn hierarchical spatial features that correspond to these distributed activation patterns. While fMRI has a lower temporal resolution compared to EEG, our 3D convolutional network can still capture some temporal dynamics within fMRI data, complementing the high temporal resolution information from the EEG modality. The network architecture is designed to capture both local and global spatial patterns in brain activation data. The network consists of L convolutional layers, each followed by batch normalization and ReLU activation. The lth layer is defined as follows:
H ( l ) = ReLU ( BN ( W ( l ) H ( l 1 ) + b ( l ) ) )
where H ( l ) is the output of the lth layer; W ( l ) and b ( l ) are the weights and biases, respectively; and ∗ denotes the 3D convolution operation.
To capture multiscale features, we employed dilated convolutions with increasing dilation rates:
( F d k ) ( p ) = s + d t = p F ( s ) k ( t )
where d is the dilation rate, and k is the kernel. This specialized network allows us to extract rich spatial features from fMRI data, capturing the complex patterns of brain activity associated with inner speech processes.

3.3. Multimodal Fusion Strategy

We proposed an attention-based fusion mechanism to dynamically weigh the contributions of EEG and fMRI features. Given the EEG features E and fMRI features F , we compute attention weights α :
α = softmax ( W a [ E ; F ] + b a )
The fused features are then obtained as follows:
Z = α E E + α F F

3.4. Cross-Modal Contrastive Learning

To enhance the alignment between EEG and fMRI modalities, we introduced a cross-modal contrastive loss. For a pair of corresponding EEG and fMRI features ( e i , f i ) , the contrastive loss is defined as
L cont = log exp ( sim ( e i , f i ) / τ ) j i exp ( sim ( e i , f j ) / τ )
where sim ( · , · ) is the cosine similarity, and τ is a temperature parameter.
The final loss function combines the classification loss and the contrastive loss:
L = L cls + λ L cont
where λ is a hyperparameter balancing the two loss terms.
These targeted improvements address the specific challenges of EEG and fMRI data in the context of inner speech recognition, providing a scientifically grounded approach to enhancing the performance of multimodal methods in this domain.
To better understand the characteristics of the Bimodal Dataset on Inner Speech, we visualized both the fMRI and EEG data for different inner speech categories. These visualizations provide insights into the spatial and temporal patterns of brain activity during inner speech tasks.
Figure 3 and Figure 4 show visualizations of the EEG data for eight and two classes, respectively. These figures demonstrate the rich temporal information captured by EEG, complementing the spatial information provided by fMRI.
Figure 5 and Figure 6 present visualizations of the fMRI data for eight and two classes, respectively. These figures reveal distinct spatial patterns of brain activity for different inner speech categories, highlighting the potential of fMRI in capturing the neural representations associated with various inner speech tasks. The distinct patterns observed in both EEG and fMRI data underscore the potential of our multimodal approach in capturing complementary aspects of inner speech processes.

3.5. Theoretical Framework

Our proposed Cross-Perception Model for multimodal inner speech recognition is built upon the following theoretical framework:
Multigranularity representation: We posit that inner speech processes manifest across multiple granularities in both EEG and fMRI data. Our model captures these through the following: (a) Raw EEG signals (fine-grained temporal information). (b) EEG Markov Transition Fields (state transition patterns). (c) fMRI spatial data (high-resolution spatial information).
Cross-modal complementarity: We theorize that EEG and fMRI provide complementary information about inner speech processes. EEG captures rapid temporal dynamics, while fMRI provides detailed spatial localization of brain activity.
Adaptive fusion: We propose that the relevance of each modality may vary depending on the specific inner speech task or individual. Our model dynamically adjusts the contribution of each modality through an attention-based mechanism.
Hierarchical feature learning: Our model employs a hierarchical structure to learn both modality-specific and shared representations, allowing for the capture of both unique and common aspects of inner speech across modalities.
Contrastive learning: To enhance the alignment between modalities, we incorporate a contrastive learning objective, encouraging the model to learn representations that are consistent across EEG and fMRI data for the same inner speech instance.
This theoretical framework guided the design of our Cross-Perception Model, ensuring that it effectively leverages the strengths of each modality while addressing the unique challenges of multimodal inner speech recognition.

4. Experiment Setup

Our experiments aim to evaluate the effectiveness of the proposed Cross-Perception Model for multimodal inner speech recognition. We used the Bimodal Dataset on Inner Speech [21] for all evaluations. Table 1 summarizes the key aspects of our experimental setup.
Participants were instructed to mentally repeat the presented word. This design allows for the investigation of semantic category-specific neural patterns during inner speech production. We used a 5-fold cross-validation strategy for all experiments to ensure robust evaluation of our model’s performance.
The combination of EEG and fMRI data in this dataset provides a unique opportunity to explore the complementary nature of these modalities in capturing the neural correlates of inner speech. By leveraging both the high temporal resolution of EEG and the superior spatial resolution of fMRI, our Cross-Perception Model aims to achieve more accurate and robust inner speech recognition compared to unimodal approaches.

4.1. Baseline Methods

We compared our proposed method with several baseline approaches:

4.1.1. Unimodal Methods

  • EEG-SVM: Support Vector Machine classifier using time–frequency features from EEG data.
  • EEG-RF: Random Forest classifier using wavelet coefficients from EEG data.
  • fMRI-MVPA: Multivoxel Pattern Analysis using a linear SVM on fMRI data.
  • fMRI-3DCNN: 3D Convolutional Neural Network on fMRI data.

4.1.2. Existing Multimodal Methods

  • EEG-fMRI-Concat: Simple concatenation of EEG and fMRI features with an SVM classifier.
  • EEG-fMRI-CCA: Canonical Correlation Analysis for feature fusion of EEG and fMRI data.
  • MM-CNN: Multimodal Convolutional Neural Network for EEG and fMRI fusion.

5. Results

5.1. Main Results

Table 2 and Table 3 present the performance of different methods on the social and numeric category classification tasks, respectively. Our proposed Cross-Perception Model consistently outperformed both unimodal and existing multimodal approaches across all evaluation metrics for both classification tasks.
In terms of unimodal performance, fMRI-based methods generally outperformed EEG-based methods, likely due to the higher spatial resolution of the fMRI data. However, both modalities struggled to achieve high accuracy when used independently, highlighting the challenge of inner speech recognition from a single modality.
The combination of EEG and fMRI data led to significant improvements over unimodal approaches. This underscores the complementary nature of the temporal precision of EEG and the spatial resolution of fMRI in capturing inner speech processes. Among existing multimodal methods, MM-CNN showed the best performance, demonstrating the advantage of using deep learning for feature extraction and fusion in multimodal scenarios.
Our Cross-Perception Model achieved further improvements over existing multimodal methods by incorporating tailored EEG and fMRI processing techniques. For the social category task, we observed a 2.5% increase in accuracy and a 0.03 increase in F1-score compared to the best-performing baseline (MM-CNN). Similar improvements were seen for the numeric category task.
The superior performance of our Cross-Perception Model can be attributed to several factors. The use of Singular Spectrum Analysis (SSA) for EEG decomposition effectively addresses the non-stationary nature of EEG signals during inner speech. The transformation of EEG signals into Markov Transition Field (MTF) images allows for the capture of complex temporal dynamics. The specialized 3D convolutional network for fMRI feature extraction is tailored to capture both local and global spatial patterns relevant to inner speech. Finally, the adaptive fusion mechanism dynamically weighs the contributions of different modalities based on their relevance to the recognition task.

5.2. Ablation Study

To understand the contribution of each component in our model, we conducted an ablation study. Table 4 and Table 5 show the results for social and numeric word recognition tasks, respectively.
These results demonstrate that each component contributed to the model’s performance, with the EEG-Raw and fMRI branches being particularly important for both tasks.

5.3. Cross-Participant Generalization

To assess the model’s generalization capability, we performed leave-one-participant-out cross-validation. Table 6 shows the results.
Our model demonstrated robust cross-participant performance, outperforming the best baseline by 4.6 and 4.5 percentage points for social and numeric words, respectively.

5.4. Extended Study

To validate the effectiveness of our proposed Cross-Perception Model in other multimodal tasks, we conducted an extended study on multimodal sentiment analysis using the CMU-MOSEI and CMU-MOSI datasets. This experiment aims to demonstrate that our model’s architecture can achieve excellent performance across different types of multimodal tasks.
Table 7 presents a comprehensive comparison of our proposed Cross-Perception Model with state-of-the-art approaches for Multimodal sentiment analysis on the CMU-MOSEI and CMU-MOSI datasets. The experiments conducted aimed to evaluate the performance of various methods across different emotion recognition tasks, including binary sentiment classification (Acc-2), 5-class emotion classification (Acc-5), 7-class emotion classification (Acc-7), and emotion intensity regression (MAE).
The results demonstrate that our proposed Cross-Perception Model consistently outperformed all other approaches on both datasets and across all evaluation metrics. On the CMU-MOSEI dataset, our model achieved an accuracy of 62.6%, 64.7%, and 88.0% for the 7-class, 5-class, and binary classification tasks, respectively, surpassing the previous state-of-the-art methods by a significant margin. Similarly, on the CMU-MOSI dataset, our model attained an accuracy of 60.3%, 60.5%, and 86.5% for the corresponding tasks. Moreover, our approach achieved the lowest mean absolute error (MAE) of 0.457 and 0.614 on the CMU-MOSEI and CMU-MOSI datasets, respectively, for the emotion intensity regression task.
The superior performance of our Cross-Perception Model in this multimodal sentiment analysis task can be attributed to several key factors. First, the incorporation of multigranularity encoders allows the model to capture modality-specific information at different levels of granularity, enabling a more comprehensive understanding of the sentimental content in each modality. Second, the contrastive learning module enhances the model’s ability to distinguish and fuse information from different modalities, leading to more robust and discriminative multimodal representations. Finally, the hierarchical expert structure leverages both modality-specific and shared representations, allowing the model to adapt to the varying information requirements of different emotion recognition tasks.
These findings demonstrate that our Cross-Perception Model, initially designed for inner speech recognition, can be effectively applied to other multimodal tasks such as sentiment analysis. This highlights the versatility and robustness of our proposed architecture in capturing and integrating information from multiple modalities across different domains.

6. Discussion

The results of our comprehensive experiments provide substantial evidence in support of the efficacy of the proposed Cross-Perception Model. The notable enhancement in performance, as compared to both unimodal and existing multimodal approaches (4.3 and 4.2 percentage points over MM-CNN for social and numeric tasks, respectively), provides substantial evidence in support of the efficacy of our approach to multimodal fusion for inner speech recognition. Moreover, the ablation study highlights the importance of each proposed component, particularly the EEG-Raw and fMRI branches, thereby supporting the rationale behind our multibranch architectural design. Moreover, the modality contribution analysis illustrates how the model employs different modalities in a dynamic manner for various word categories, thereby substantiating the advantage of the adaptive fusion mechanism. Moreover, the confusion matrix analysis illustrates the model’s ability to distinguish between subtle inner speech categories, which is a pivotal aspect in BCI applications. Furthermore, the temporal and spatial analysis provides insights into how the model effectively integrates information from modalities with disparate temporal and spatial resolutions, thereby capturing complementary aspects of inner speech processes. Furthermore, the robust cross-participant performance attests to the model’s resilience and generalizability, which are pivotal for practical BCI applications.
These results collectively demonstrate that our Cross-Perception Model successfully addresses the challenges of multimodal inner speech recognition, offering a significant advancement in the field of brain–computer interfaces. The model’s ability to effectively combine information from EEG raw data, EEG Markov Transition Fields, and fMRI images enables more accurate and robust decoding of inner speech, paving the way for more sophisticated and reliable BCI systems.

7. Limitations

Our study, while demonstrating significant advancements in multimodal inner speech recognition, is subject to several limitations. These can be broadly categorized into data-related constraints, methodological challenges, and generalizability issues. The Bimodal Dataset on Inner Speech, though valuable, is limited in size and participant diversity, potentially affecting the generalizability of our findings. Moreover, the non-simultaneous recording of EEG and fMRI data may have introduced temporal misalignments in our multimodal representations. From a methodological perspective, our model’s increased computational complexity, while justifying its performance gains, poses challenges for real-time applications and resource-constrained environments. The interpretability of our model remains an open challenge, particularly in understanding the specific contributions of different brain regions and temporal patterns to inner speech recognition. Lastly, the current study’s focus on a limited set of inner speech categories and the potential for significant individual variability in inner speech patterns may limit the model’s applicability across broader vocabularies and diverse populations. These limitations collectively underscore the need for further research to enhance the robustness, efficiency, and generalizability of multimodal inner speech recognition techniques.

8. Conclusions

In this paper, we presented a novel Cross-Perception Model for multimodal inner speech recognition, leveraging both EEG and fMRI data. Our approach addresses several key challenges in the field of brain–computer interfaces and contributes to the advancement of multimodal fusion techniques for neuroimaging data. The proposed model demonstrated significant improvements over existing unimodal and multimodal methods, achieving higher accuracy and F1-scores across both social and numeric category classification tasks.
The success of our Cross-Perception Model can be attributed to its innovative architecture, which includes multigranularity encoders, a cross-perception expert structure, and an attention-based adaptive fusion strategy. These components work in concert to effectively capture and integrate the complementary information provided by EEG and fMRI data, resulting in more robust and accurate inner speech recognition.
While our results are promising, there are several areas for future research and development. The current study is limited to a relatively small dataset with a restricted vocabulary, and future work should explore the scalability of our approach to larger datasets with more diverse inner speech content. Additionally, the non-simultaneous nature of the EEG and fMRI recordings in the dataset may have introduced some inconsistencies in the multimodal representations. Investigating the model’s performance on simultaneously recorded EEG-fMRI data could provide further insights into its effectiveness.
In conclusion, our Cross-Perception Model represents a significant step forward in multimodal inner speech recognition, demonstrating the potential of combining EEG and fMRI data for more accurate and reliable brain–computer interfaces. As research in this field continues to advance, we anticipate that multimodal approaches like ours will play a crucial role in developing more sophisticated and practical BCI systems, with potential applications in assistive technologies, communication devices, and neuroscientific research.

Author Contributions

Conceptualization, F.L. and J.Q.; methodology, J.Q.; software, J.Q.; validation, F.L. and J.Q.; formal analysis, J.Q.; investigation, F.L.; resources, F.L.; data curation, F.L. and J.Q.; writing—original draft preparation, F.L. and J.Q.; writing—review and editing, F.L., L.Z. and J.Q.; visualization, J.Q.; supervision, F.L. and L.Z.; project administration, F.L. and L.Z.; funding acquisition, F.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All code can be accessed from GitHub: [https://github.com/ECNU-Cross-Innovation-Lab/Inner-Speech] (accessed on 29 August 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Alderson-Day, B.; Fernyhough, C. Inner Speech: Development, Cognitive Functions, Phenomenology, and Neurobiology. Psychol. Bull. 2015, 141, 931–965. [Google Scholar] [CrossRef] [PubMed]
  2. Anumanchipalli, G.K.; Chartier, J.; Chang, E.F. Speech Synthesis from Neural Decoding of Spoken Sentences. Nature 2019, 568, 493–498. [Google Scholar] [CrossRef] [PubMed]
  3. Martin, S.; Iturrate, I.; Millán, J.d.R.; Knight, R.T.; Pasley, B.N. Decoding Inner Speech Using Electrocorticography: Progress and Challenges Toward a Speech Prosthesis. Front. Neurosci. 2018, 12, 422. [Google Scholar] [CrossRef]
  4. Huster, R.J.; Debener, S.; Eichele, T.; Herrmann, C.S. Methods for Simultaneous EEG-fMRI: An Introductory Review. J. Neurosci. 2012, 32, 6053–6060. [Google Scholar] [CrossRef]
  5. Cooney, C.; Folli, R.; Coyle, D. Optimizing Layers Improves CNN Generalization and Transfer Learning for Imagined Speech Decoding from EEG. In Proceedings of the 2019 IEEE International Conference on Systems, Man and Cybernetics (SMC), Bari, Italy, 6–9 October 2019; pp. 1311–1316. [Google Scholar] [CrossRef]
  6. Agarwal and Kumar(2024) EEG-based Imagined Words Classification using Hilbert Transform and Deep Networks. Multimed. Tools Appl. 2024, 83, 2725–2748. [CrossRef]
  7. Porbadnigk, A.; Wester, M.; Calliess, J.; Schultz, T. EEG-Based Speech Recognition—Impact of Temporal Effects. In Proceedings of the International Conference on Bio-Inspired Systems and Signal Processing—Volume 1: BIOSIGNALS, (BIOSTEC 2009), Porto, Portugal, 14–17 January 2009; INSTICC, SciTePress: Setúbal, Portugal, 2009; pp. 376–381. [Google Scholar] [CrossRef]
  8. Nguyen, C.H.; Karavas, G.K.; Artemiadis, P. Inferring imagined speech using EEG signals: A new approach using Riemannian manifold features. J. Neural Eng. 2017, 15, 016002. [Google Scholar] [CrossRef]
  9. Lee, Y.E.; Lee, S.H.; Kim, S.H.; Lee, S.W. Towards Voice Reconstruction from EEG during Imagined Speech. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 6030–6038. [Google Scholar] [CrossRef]
  10. Lopes da Silva, F. EEG and MEG: Relevance to Neuroscience. Neuron 2013, 80, 1112–1128. [Google Scholar] [CrossRef] [PubMed]
  11. Gu, J.; Buidze, T.; Zhao, K.; Gläscher, J.; Fu, X. The neural network of sensory attenuation: A neuroimaging meta-analysis. Psychon. Bull. Rev. 2024. [Google Scholar] [CrossRef]
  12. Sun, J.; Li, M.; Chen, Z.; Zhang, Y.; Wang, S.; Moens, M.F. Contrast, Attend and Diffuse to Decode High-Resolution Images from Brain Activities. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S., Eds.; Curran Associates, Inc.: New York, NY, USA, 2023; Volume 36, pp. 12332–12348. [Google Scholar]
  13. Cai, H.; Dong, J.; Mei, L.; Feng, G.; Li, L.; Wang, G.; Yan, H. Functional and structural abnormalities of the speech disorders: A multimodal activation likelihood estimation meta-analysis. Cereb. Cortex 2024, 34, bhae075. [Google Scholar] [CrossRef]
  14. Takagi, Y.; Nishimoto, S. High-Resolution Image Reconstruction with Latent Diffusion Models from Human Brain Activity. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 14453–14463. [Google Scholar]
  15. Gong, P.; Jia, Z.; Wang, P.; Zhou, Y.; Zhang, D. ASTDF-Net: Attention-Based Spatial-Temporal Dual-Stream Fusion Network for EEG-Based Emotion Recognition. In Proceedings of the 31st ACM International Conference on Multimedia (MM’23), Ottawa, ON, Canada, 29 October–3 November 2023; pp. 883–892. [Google Scholar] [CrossRef]
  16. Su, W.C.; Dashtestani, H.; Miguel, H.O.; Condy, E.; Buckley, A.; Park, S.; Perreault, J.B.; Nguyen, T.; Zeytinoglu, S.; Millerhagen, J.; et al. Simultaneous multimodal fNIRS-EEG recordings reveal new insights in neural activity during motor execution, observation, and imagery. Sci. Rep. 2023, 13, 5151. [Google Scholar] [CrossRef]
  17. Passos, L.A.; Papa, J.P.; Del Ser, J.; Hussain, A.; Adeel, A. Multimodal audio-visual information fusion using canonical-correlated Graph Neural Network for energy-efficient speech enhancement. Inf. Fusion 2023, 90, 1–11. [Google Scholar] [CrossRef]
  18. Goebel, R.; Esposito, F. The Added Value of EEG-fMRI in Imaging Neuroscience. In EEG—fMRI: Physiological Basis, Technique, and Applications; Mulert, C., Lemieux, L., Eds.; Springer International Publishing: Cham, Switzerland, 2022; pp. 119–138. [Google Scholar] [CrossRef]
  19. Carmichael, D.W.; Vulliemoz, S.; Murta, T.; Chaudhary, U.; Perani, S.; Rodionov, R.; Rosa, M.J.; Friston, K.J.; Lemieux, L. Measurement of the Mapping between Intracranial EEG and fMRI Recordings in the Human Brain. Bioengineering 2024, 11, 224. [Google Scholar] [CrossRef]
  20. Koide-Majima, N.; Nishimoto, S.; Majima, K. Mental image reconstruction from human brain activity: Neural decoding of mental imagery via deep neural network-based Bayesian estimation. Neural Netw. 2024, 170, 349–363. [Google Scholar] [CrossRef]
  21. Liwicki, F.S.; Gupta, V.; Saini, R.; De, K.; Abid, N.; Rakesh, S.; Wellington, S.; Wilson, H.; Liwicki, M.; Eriksson, J. Bimodal Electroencephalography-Functional Magnetic Resonance Imaging Dataset for Inner-Speech Recognition. Sci. Data 2023, 10, 378. [Google Scholar] [CrossRef]
  22. Miyawaki, Y.; Uchida, H.; Yamashita, O.; Sato, M.a.; Morito, Y.; Tanabe, H.C.; Sadato, N.; Kamitani, Y. Visual Image Reconstruction from Human Brain Activity using a Combination of Multiscale Local Image Decoders. Neuron 2008, 60, 915–929. [Google Scholar] [CrossRef]
  23. Cetron, J.S.; Connolly, A.C.; Diamond, S.G.; May, V.V.; Haxby, J.V.; Kraemer, D.J.M. Decoding individual differences in STEM learning from functional MRI data. Nat. Commun. 2019, 10, 2027. [Google Scholar] [CrossRef] [PubMed]
  24. Sligte, I.G.; van Moorselaar, D.; Vandenbroucke, A.R.E. Decoding the Contents of Visual Working Memory: Evidence for Process-Based and Content-Based Working Memory Areas? J. Neurosci. 2013, 33, 1293–1294. [Google Scholar] [CrossRef] [PubMed]
  25. Herff, C.; Krusienski, D.J.; Kubben, P. The Potential of Stereotactic-EEG for Brain-Computer Interfaces: Current Progress and Future Directions. Front. Neurosci. 2020, 14, 123. [Google Scholar] [CrossRef]
  26. Gao, J.; Li, P.; Chen, Z.; Zhang, J. A Survey on Deep Learning for Multimodal Data Fusion. Neural Comput. 2020, 32, 829–864. [Google Scholar] [CrossRef]
  27. Aggarwal, S.; Chugh, N. Review of Machine Learning Techniques for EEG Based Brain Computer Interface. Arch. Comput. Methods Eng. 2022, 29, 3001–3020. [Google Scholar] [CrossRef]
  28. Zadeh, A.B.; Liang, P.P.; Poria, S.; Cambria, E.; Morency, L.P. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, 15–20 July 2018; pp. 2236–2246. [Google Scholar]
  29. Liu, Z.; Shen, Y.; Lakshminarasimhan, V.B.; Liang, P.P.; Zadeh, A.; Morency, L.P. Efficient low-rank multimodal fusion with modality-specific factors. arXiv 2018, arXiv:1806.00064. [Google Scholar]
  30. Tsai, Y.H.H.; Bai, S.; Liang, P.P.; Kolter, J.Z.; Morency, L.P.; Salakhutdinov, R. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; NIH Public Access: Bethesda, MD, USA, 2019; Volume 2019, p. 6558. [Google Scholar]
  31. Yu, W.; Xu, H.; Yuan, Z.; Wu, J. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 2–9 February 2021; Volume 35, pp. 10790–10797. [Google Scholar]
  32. Han, W.; Chen, H.; Poria, S. Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online, 7–11 November 2021; pp. 9180–9192. [Google Scholar]
  33. Yuan, Z.; Li, W.; Xu, H.; Yu, W. Transformer-based feature reconstruction network for robust multimodal sentiment analysis. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual, 20–24 October 2021; pp. 4400–4407. [Google Scholar]
  34. Sun, Y.; Mai, S.; Hu, H. Learning to learn better unimodal representations via adaptive multimodal meta-learning. IEEE Trans. Affect. Comput. 2023, 14, 2209–2223. [Google Scholar] [CrossRef]
  35. Liu, F.; Shen, S.Y.; Fu, Z.W.; Wang, H.Y.; Zhou, A.M.; Qi, J.Y. Lgcct: A light gated and crossed complementation transformer for multimodal speech emotion recognition. Entropy 2022, 24, 1010. [Google Scholar] [CrossRef]
  36. Sun, L.; Lian, Z.; Liu, B.; Tao, J. Efficient multimodal transformer with dual-level feature restoration for robust multimodal sentiment analysis. IEEE Trans. Affect. Comput. 2024, 15, 309–325. [Google Scholar] [CrossRef]
  37. Fu, Z.; Liu, F.; Xu, Q.; Fu, X.; Qi, J. LMR-CBT: Learning modality-fused representations with CB-transformer for multimodal emotion recognition from unaligned multimodal sequences. Front. Comput. Sci. 2024, 18, 184314. [Google Scholar] [CrossRef]
  38. Wang, L.; Peng, J.; Zheng, C.; Zhao, T.; Zhu, L. A cross modal hierarchical fusion multimodal sentiment analysis method based on multi-task learning. Inf. Process. Manag. 2024, 61, 103675. [Google Scholar] [CrossRef]
  39. Shi, H.; Pu, Y.; Zhao, Z.; Huang, J.; Zhou, D.; Xu, D.; Cao, J. Co-space Representation Interaction Network for multimodal sentiment analysis. Knowl.-Based Syst. 2024, 283, 111149. [Google Scholar] [CrossRef]
Figure 1. Inner speech using EEG and FMRI data.
Figure 1. Inner speech using EEG and FMRI data.
Applsci 14 07720 g001
Figure 2. The process of transforming EEG signals into MTF images.
Figure 2. The process of transforming EEG signals into MTF images.
Applsci 14 07720 g002
Figure 3. Visualization of EEG data for 8 classes.
Figure 3. Visualization of EEG data for 8 classes.
Applsci 14 07720 g003
Figure 4. Visualization of EEG data for 2 classes.
Figure 4. Visualization of EEG data for 2 classes.
Applsci 14 07720 g004
Figure 5. Visualization of fMRI data for 8 classes.
Figure 5. Visualization of fMRI data for 8 classes.
Applsci 14 07720 g005
Figure 6. Visualization of fMRI data for 2 classes.
Figure 6. Visualization of fMRI data for 2 classes.
Applsci 14 07720 g006
Table 1. Experimental setup summary.
Table 1. Experimental setup summary.
AspectDescription
DatasetBimodal Dataset on Inner Speech
Participants4 healthy, right-handed (3 females, 1 male, aged 33–51 years)
TasksTwo 4-class classification tasks:
1. Social category: child, daughter, father, wife
2. Numeric category: four, three, ten, six
Data TypesNon-simultaneous EEG and fMRI recordings
PreprocessingEEG: Bandpass filter (1–50 Hz), artifact removal via ICA
fMRI: Motion correction, slice timing correction, spatial normalization to MNI space
Validation Strategy5-fold cross-validation
Evaluation MetricsAccuracy, F1-score, Area Under the Receiver Operating Characteristic Curve (AUC-ROC)
Table 2. Results for the social category classification task.
Table 2. Results for the social category classification task.
MethodAcc ↑ (%)F1-Score ↑AUC-ROC ↑
EEG-SVM28.5 ± 2.10.27 ± 0.020.32 ± 0.01
EEG-RF30.2 ± 1.80.29 ± 0.020.34 ± 0.01
fMRI-MVPA35.8 ± 1.50.35 ± 0.010.38 ± 0.01
fMRI-3DCNN38.3 ± 1.30.37 ± 0.010.40 ± 0.01
EEG-fMRI-Concat40.5 ± 1.20.40 ± 0.010.42 ± 0.01
EEG-fMRI-CCA42.1 ± 1.00.41 ± 0.010.49 ± 0.01
MM-CNN44.7 ± 0.90.44 ± 0.010.55 ± 0.00
Our Method47.2 ± 0.70.47 ± 0.010.56 ± 0.00
↑ indicates that higher values represent better performance for the corresponding metric. Note: Bold values indicate the best performance for each metric.
Table 3. Results for the numeric category classification task.
Table 3. Results for the numeric category classification task.
MethodAcc ↑ (%)F1-Score ↑AUC-ROC ↑
EEG-SVM17.8 ± 2.20.16 ± 0.020.21 ± 0.01
EEG-RF19.5 ± 1.90.18 ± 0.020.23 ± 0.01
fMRI-MVPA24.9 ± 1.60.24 ± 0.020.31 ± 0.01
fMRI-3DCNN27.6 ± 1.40.26 ± 0.010.33 ± 0.01
EEG-fMRI-Concat29.8 ± 1.30.29 ± 0.010.39 ± 0.01
EEG-fMRI-CCA29.3 ± 1.10.30 ± 0.010.41 ± 0.01
MM-CNN33.9 ± 1.00.33 ± 0.010.44 ± 0.00
Our Method36.5 ± 0.80.36 ± 0.010.45 ± 0.00
↑ indicates that higher values represent better performance for the corresponding metric. Note: Bold values indicate the best performance for each metric.
Table 4. Ablation study results for 4-class social word recognition.
Table 4. Ablation study results for 4-class social word recognition.
Model VariantAcc ↑ (%)F1-Score ↑AUC-ROC ↑
Full Model47.2 ± 0.70.47 ± 0.010.56 ± 0.00
w/o EEG-Raw45.0 ± 0.80.45 ± 0.010.54 ± 0.01
w/o EEG-MTF44.5 ± 0.90.44 ± 0.010.53 ± 0.01
w/o fMRI43.7 ± 0.90.43 ± 0.010.52 ± 0.01
w/o Cross-Perception43.9 ± 0.80.44 ± 0.010.53 ± 0.01
w/o Adaptive Fusion45.3 ± 0.80.45 ± 0.010.55 ± 0.01
↑ indicates that higher values represent better performance for the corresponding metric.
Table 5. Ablation study results for 4-class numeric word recognition.
Table 5. Ablation study results for 4-class numeric word recognition.
Model VariantAcc ↑ (%)F1-Score ↑AUC-ROC ↑
Full Model36.5 ± 0.80.36 ± 0.010.45 ± 0.00
w/o EEG-Raw34.4 ± 0.90.34 ± 0.010.43 ± 0.01
w/o EEG-MTF33.9 ± 1.00.33 ± 0.010.42 ± 0.01
w/o fMRI33.2 ± 1.00.33 ± 0.010.41 ± 0.01
w/o Cross-Perception33.4 ± 0.90.33 ± 0.010.42 ± 0.01
w/o Adaptive Fusion34.7 ± 0.90.34 ± 0.010.44 ± 0.01
↑ indicates that higher values represent better performance for the corresponding metric.
Table 6. Cross-participant generalization results.
Table 6. Cross-participant generalization results.
TaskOur Model Accuracy (%)Best Baseline Accuracy (%)
Social Words47.2 ± 0.747.3 ± 0.1
Numeric Words36.5 ± 0.836.6 ± 0.1
Table 7. Results on the CMU-MOSEI and CMU-MOSI datasets.
Table 7. Results on the CMU-MOSEI and CMU-MOSI datasets.
MethodCMU-MOSEICMU-MOSI
Acc-7 ↑ (%) Acc-5 ↑ (%) Acc-2 ↑ (%) MAE ↓ Acc-7 ↑ (%) Acc-5 ↑ (%) Acc-2 ↑ (%) MAE ↓
TFN (2018) [28]50.2-82.50.59334.9-80.80.901
LMF (2018) [29]48.0-82.00.62333.2-82.50.917
Mult (2019) [30]52.654.183.50.56440.446.783.40.846
Self-MM (2021) [31]53.655.485.00.53346.452.884.60.717
MMIM (2021) [32]53.255.085.00.53646.953.085.30.712
TFR-Net (2021) [33]52.354.383.50.55146.153.284.00.721
AMML (2022) [34]52.4-85.30.61446.3-84.90.723
LGCCT (2022) [35]47.5-81.1-----
EMT (2023) [36]54.556.386.00.52747.454.185.00.705
LMR-CBT (2024) [37]51.9-82.7-41.4-83.10.774
CMHFM (2024) [38]52.854.484.50.54837.242.481.70.907
CRNet (2024) [39]53.8-86.40.54147.4-86.40.712
Ours62.664.788.00.45760.360.586.50.614
↑ indicates that higher values represent better performance for the corresponding metric. ↓ indicates that lower values represent better performance for the corresponding metric. : Many models did not provide Acc-5 results in their original papers. The Acc-5 data used in this comparison were obtained from [36]. Bold values indicate the best performance for each metric.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Qin, J.; Zong, L.; Liu, F. Exploring Inner Speech Recognition via Cross-Perception Approach in EEG and fMRI. Appl. Sci. 2024, 14, 7720. https://doi.org/10.3390/app14177720

AMA Style

Qin J, Zong L, Liu F. Exploring Inner Speech Recognition via Cross-Perception Approach in EEG and fMRI. Applied Sciences. 2024; 14(17):7720. https://doi.org/10.3390/app14177720

Chicago/Turabian Style

Qin, Jiahao, Lu Zong, and Feng Liu. 2024. "Exploring Inner Speech Recognition via Cross-Perception Approach in EEG and fMRI" Applied Sciences 14, no. 17: 7720. https://doi.org/10.3390/app14177720

APA Style

Qin, J., Zong, L., & Liu, F. (2024). Exploring Inner Speech Recognition via Cross-Perception Approach in EEG and fMRI. Applied Sciences, 14(17), 7720. https://doi.org/10.3390/app14177720

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop