1. Introduction
Inner speech, often described as the silent voice in our minds, plays a crucial role in cognitive processes such as problem solving, memory formation, and self-regulation [
1]. The ability to recognize and decode inner speech using brain–computer interfaces (BCIs) holds immense potential for assistive technologies, particularly for individuals with severe motor disabilities [
2]. However, accurately decoding inner speech from neural signals remains a significant challenge due to its subtle and complex nature.
Recent advances in neuroimaging techniques have opened new avenues for studying inner speech. Electroencephalography (EEG) and functional magnetic resonance imaging (fMRI) are two prevalent modalities used in this field, each offering unique insights into brain activity [
3]. EEG provides high temporal resolution, capturing rapid changes in neural activity, while fMRI offers superior spatial resolution, allowing for precise localization of brain activations [
4].
Despite these advancements, most existing studies on inner speech recognition have focused on unimodal approaches, utilizing either EEG or fMRI data independently [
5,
6]. However, the data of inner speech activity are usually multimodal, and there are obvious limitations when only single-modal data are used. Generic time series feature extraction methods may not effectively capture the non-stationary and weak nature of EEG signals during inner speech. This is due to the subtle and transient nature of inner speech-related neural activity, which can be easily overshadowed by background brain activity and noise [
7,
8,
9,
10]. Traditional image recognition techniques may not be optimal for extracting relevant features from fMRI data for inner speech tasks. fMRI data have unique spatiotemporal characteristics and lower temporal resolution compared to natural images, requiring specialized processing techniques [
11,
12,
13,
14].
Therefore, many researchers have found that there are complementary and enhanced representations between multimodal data, and they have used multimodal data for pattern recognition of biological signals and successfully built models with stronger performance. For example, Gong et al. [
15] combined EEG and fMRI data to improve the classification of emotional states, achieving higher accuracy than unimodal approaches. Su et al. [
16] integrated EEG and fNIRS (functional near-infrared spectroscopy) for motor imagery classification, demonstrating superior performance over single-modality methods. In the context of speech processing, Passos et al. [
17] fused EEG and audio features for robust speech recognition in noisy environments. However, existing fusion methods may not fully leverage the complementary information provided by EEG and fMRI in the specific context of inner speech recognition. This is because the temporal dynamics of EEG and the spatial patterns of fMRI require careful alignment and integration to capture the full complexity of inner speech processes [
18,
19,
20].
The recent release of the Bimodal dataset on Inner Speech [
21] provides a unique opportunity to explore multimodal approaches to inner speech recognition. This dataset combines EEG and fMRI recordings from participants performing an inner speech task, offering a rich resource for developing and evaluating multimodal decoding methods.
In this paper, we propose a novel cross-perception model for multimodal inner speech recognition that leverages both EEG and fMRI data. Our proposed approach is shown in
Figure 1.
By leveraging the strengths of both EEG and fMRI, our cross-perception model aims to achieve more accurate and robust inner speech recognition compared to unimodal approaches. We evaluate our model on the Bimodal dataset on Inner Speech, demonstrating its effectiveness in decoding inner speech across different semantic categories. The main contributions of this work are as follows:
We propose a novel cross-perception model that effectively integrates EEG and fMRI data for inner speech recognition.
We introduce a multigranularity encoding scheme that captures both temporal and spatial aspects of brain activity during inner speech.
We develop an adaptive fusion mechanism that dynamically weights the contributions of different modalities based on their relevance to the recognition task.
We provide extensive experimental results and analyses, demonstrating the superiority of our multimodal approach over unimodal baselines.
The rest of the paper is organized as follows:
Section 2 reviews related work in inner speech recognition, including unimodal and multimodal approaches, and it discusses the limitations of existing methods.
Section 3 introduces our proposed targeted improvements, including enhancements for EEG and fMRI data processing, multimodal fusion strategy, and cross-modal contrastive learning.
Section 4 describes the experimental setup, including dataset details, preprocessing steps, and baseline methods.
Section 5 presents the results of our experiments, including the main results, ablation studies, cross-participant generalization, and an extended study on multimodal sentiment analysis. Finally,
Section 6 discusses the implications of our findings and concludes the paper.
2. Related Work
Inner speech recognition using neuroimaging data has gained significant attention in recent years due to its potential applications in brain–computer interfaces and cognitive neuroscience. This section reviews relevant literature, focusing on both unimodal and multimodal approaches to inner speech recognition.
2.1. Unimodal Approaches
Electroencephalography (EEG) has been widely used for inner speech recognition due to its high temporal resolution. Traditional machine learning methods have shown promising results in this domain. Nguyen et al. [
8] employed Support Vector Machines (SVMs) with Riemannian manifold features, achieving an accuracy of 68% in a four-class inner speech classification task. Cooney et al. [
5] utilized Random Forests on time–frequency features, reporting an accuracy of 71% for a similar task. Da Silva and Fernando [
10] applied Linear Discriminant Analysis (LDA) to Common Spatial Patterns (CSPs) features, obtaining an accuracy of 73% in distinguishing between two inner speech classes. These studies demonstrate the potential of EEG for inner speech recognition, but they also highlight the challenges in achieving high accuracy due to the subtle nature of inner speech signals.
Functional Magnetic Resonance Imaging (fMRI) offers superior spatial resolution for inner speech recognition. Miyawaki et al. [
22] used multivoxel pattern analysis (MVPA) to decode visual imagery, achieving an accuracy of 78% in reconstructing visual patterns. Cetron et al. [
23] employed Independent Component Analysis (ICA) combined with machine learning classifiers, reporting an accuracy of 82% in distinguishing between different semantic categories during inner speech. Sligte et al. [
24] utilized Representational Similarity Analysis (RSA) to investigate the neural representations of inner speech, demonstrating significant correlations between predicted and observed neural patterns. These fMRI-based studies provide valuable insights into the spatial patterns of brain activity during inner speech, but are limited by the low temporal resolution of fMRI.
2.2. Bimodal Approaches
Recent years have seen a growing interest in multimodal approaches that combine EEG and fMRI data for inner speech recognition. Herff et al. [
25] proposed a hybrid EEG-fMRI model using convolutional neural networks (CNNs) for feature extraction and fusion, achieving an accuracy of 85% in a four-class inner speech task. Gao et al. [
26] introduced a multimodal attention mechanism to dynamically weight EEG and fMRI features, reporting an accuracy of 87% on a similar task. Aggarwal et al. [
27] employed a graph convolutional network to model both temporal (EEG) and spatial (fMRI) dependencies, achieving an accuracy of 89% in distinguishing between different inner speech categories. These multimodal approaches consistently outperform unimodal methods, demonstrating the complementary nature of EEG and fMRI data for inner speech recognition. The temporal precision of EEG, combined with the spatial resolution of fMRI, provides a more comprehensive view of the neural processes underlying inner speech.
2.3. Limitations of Existing Approaches
While bimodal approaches combining EEG and fMRI data have shown promise in inner speech recognition, they often fall short of fully leveraging the unique characteristics of these neuroimaging modalities. Current methods typically rely on generic fusion techniques borrowed from computer science, which may not be optimally adapted to the specific challenges of inner speech recognition. Standard time series feature extraction methods often struggle to effectively capture the non-stationary and subtle nature of EEG signals during inner speech. Similarly, conventional image recognition techniques may not be ideal for extracting the most relevant features from fMRI data in the context of inner speech tasks. Moreover, existing fusion methods may not fully exploit the complementary information provided by EEG and fMRI, particularly in the nuanced domain of inner speech recognition.
These limitations indicate a clear opportunity for advancement in multimodal approaches to inner speech recognition. Specifically, there is a need for methods tailored to the unique characteristics of EEG and fMRI data in this context. Our proposed method addresses this gap by introducing a novel trimodal approach. By incorporating EEG raw data, EEG Markov Transition Fields, and fMRI spatial data, we offer a more comprehensive and nuanced view of inner speech processes. This approach not only builds upon the potential demonstrated by existing bimodal methods but also extends it significantly, paving the way for more accurate and robust inner speech recognition.
3. Proposed Targeted Improvements
Our proposed method addresses the limitations of existing approaches by introducing targeted improvements for both EEG and fMRI data processing in the context of inner speech recognition. These improvements are designed to better capture the unique characteristics of each modality while enhancing their complementary nature.
3.1. EEG Signal Processing Enhancements
Singular Spectrum Analysis (SSA) for EEG Decomposition
To address the non-stationary nature of EEG signals during inner speech, we employ Singular Spectrum Analysis (SSA). SSA decomposes the original EEG signal into interpretable components, allowing for more effective feature extraction.
Given an EEG signal
, we first construct the trajectory matrix:
where
L is the window length, and
.
We then perform Singular Value Decomposition (SVD) on
:
The eigentriples
are used to reconstruct the components:
where
.
To capture the temporal dynamics of EEG signals, we transform them into image representations. While several time series image encoding methods exist, such as Recurrence Plots (RPs) and Gramian Angular Fields (GASFs and GADFs), we chose Markov Transition Fields (MTFs) for their unique advantages in the context of inner speech recognition.
Recurrence Plots, while effective for visualizing repeating patterns in time series data, may not adequately capture the subtle, non-repetitive nature of inner speech-related EEG signals. Gramian Angular Fields (GASFs and GADFs) preserve temporal correlations but can be sensitive to noise and may not effectively represent the state transitions crucial for distinguishing different inner speech categories. In contrast, MTFs offer several advantages for our task. They capture both the temporal dynamics and state transition probabilities of EEG signals, which are crucial for representing the complex patterns associated with inner speech. MTFs preserve more temporal information compared to GASFs and GADFs, allowing for better representation of the subtle temporal patterns in inner speech EEG data. The probabilistic nature of MTFs makes it more robust to the noise and non-stationarity often present in EEG signals during cognitive tasks.
We transformed them into Markov Transition Field (MTF) images. This allows us to leverage powerful image processing techniques for feature extraction. Given a quantized EEG signal
, we construct the transition probability matrix
W:
The MTF is then defined as
Figure 2 illustrates the process of transforming EEG signals into MTF images. This transformation allows us to capture the temporal dynamics of EEG signals in a format that can be effectively processed by convolutional neural networks, enabling more robust feature extraction for inner speech recognition.
3.2. fMRI Data Processing Enhancements
The data collection process involved recording inner speech activity at fixed intervals, with participants instructed to maintain silence. This experimental design allowed us to precisely identify the fMRI states corresponding to inner speech activities along the temporal axis. We performed temporal segmentation on the entire experimental recording, enabling us to extract fMRI images and EEG signals that correspond to the same inner speech activities. This temporal correspondence was crucial for our multimodal approach. Inner speech processes involve complex spatial patterns of activation across multiple brain regions. The use of dilated convolutions enables the capture of multiscale features, which is crucial for distinguishing between different inner speech categories. This is particularly important given the subtle differences in brain activation patterns that may exist between various inner speech tasks.
Therefore, we proposed a specialized 3D convolutional network to extract features from fMRI data that are particularly relevant for inner speech recognition. The 3D convolutional layers of our network allow it to learn hierarchical spatial features that correspond to these distributed activation patterns. While fMRI has a lower temporal resolution compared to EEG, our 3D convolutional network can still capture some temporal dynamics within fMRI data, complementing the high temporal resolution information from the EEG modality. The network architecture is designed to capture both local and global spatial patterns in brain activation data. The network consists of
L convolutional layers, each followed by batch normalization and ReLU activation. The
lth layer is defined as follows:
where
is the output of the
lth layer;
and
are the weights and biases, respectively; and ∗ denotes the 3D convolution operation.
To capture multiscale features, we employed dilated convolutions with increasing dilation rates:
where
d is the dilation rate, and
k is the kernel. This specialized network allows us to extract rich spatial features from fMRI data, capturing the complex patterns of brain activity associated with inner speech processes.
3.3. Multimodal Fusion Strategy
We proposed an attention-based fusion mechanism to dynamically weigh the contributions of EEG and fMRI features. Given the EEG features
and fMRI features
, we compute attention weights
:
The fused features are then obtained as follows:
3.4. Cross-Modal Contrastive Learning
To enhance the alignment between EEG and fMRI modalities, we introduced a cross-modal contrastive loss. For a pair of corresponding EEG and fMRI features
, the contrastive loss is defined as
where
is the cosine similarity, and
is a temperature parameter.
The final loss function combines the classification loss and the contrastive loss:
where
is a hyperparameter balancing the two loss terms.
These targeted improvements address the specific challenges of EEG and fMRI data in the context of inner speech recognition, providing a scientifically grounded approach to enhancing the performance of multimodal methods in this domain.
To better understand the characteristics of the Bimodal Dataset on Inner Speech, we visualized both the fMRI and EEG data for different inner speech categories. These visualizations provide insights into the spatial and temporal patterns of brain activity during inner speech tasks.
Figure 3 and
Figure 4 show visualizations of the EEG data for eight and two classes, respectively. These figures demonstrate the rich temporal information captured by EEG, complementing the spatial information provided by fMRI.
Figure 5 and
Figure 6 present visualizations of the fMRI data for eight and two classes, respectively. These figures reveal distinct spatial patterns of brain activity for different inner speech categories, highlighting the potential of fMRI in capturing the neural representations associated with various inner speech tasks. The distinct patterns observed in both EEG and fMRI data underscore the potential of our multimodal approach in capturing complementary aspects of inner speech processes.
3.5. Theoretical Framework
Our proposed Cross-Perception Model for multimodal inner speech recognition is built upon the following theoretical framework:
Multigranularity representation: We posit that inner speech processes manifest across multiple granularities in both EEG and fMRI data. Our model captures these through the following: (a) Raw EEG signals (fine-grained temporal information). (b) EEG Markov Transition Fields (state transition patterns). (c) fMRI spatial data (high-resolution spatial information).
Cross-modal complementarity: We theorize that EEG and fMRI provide complementary information about inner speech processes. EEG captures rapid temporal dynamics, while fMRI provides detailed spatial localization of brain activity.
Adaptive fusion: We propose that the relevance of each modality may vary depending on the specific inner speech task or individual. Our model dynamically adjusts the contribution of each modality through an attention-based mechanism.
Hierarchical feature learning: Our model employs a hierarchical structure to learn both modality-specific and shared representations, allowing for the capture of both unique and common aspects of inner speech across modalities.
Contrastive learning: To enhance the alignment between modalities, we incorporate a contrastive learning objective, encouraging the model to learn representations that are consistent across EEG and fMRI data for the same inner speech instance.
This theoretical framework guided the design of our Cross-Perception Model, ensuring that it effectively leverages the strengths of each modality while addressing the unique challenges of multimodal inner speech recognition.
4. Experiment Setup
Our experiments aim to evaluate the effectiveness of the proposed Cross-Perception Model for multimodal inner speech recognition. We used the Bimodal Dataset on Inner Speech [
21] for all evaluations.
Table 1 summarizes the key aspects of our experimental setup.
Participants were instructed to mentally repeat the presented word. This design allows for the investigation of semantic category-specific neural patterns during inner speech production. We used a 5-fold cross-validation strategy for all experiments to ensure robust evaluation of our model’s performance.
The combination of EEG and fMRI data in this dataset provides a unique opportunity to explore the complementary nature of these modalities in capturing the neural correlates of inner speech. By leveraging both the high temporal resolution of EEG and the superior spatial resolution of fMRI, our Cross-Perception Model aims to achieve more accurate and robust inner speech recognition compared to unimodal approaches.
4.1. Baseline Methods
We compared our proposed method with several baseline approaches:
4.1.1. Unimodal Methods
EEG-SVM: Support Vector Machine classifier using time–frequency features from EEG data.
EEG-RF: Random Forest classifier using wavelet coefficients from EEG data.
fMRI-MVPA: Multivoxel Pattern Analysis using a linear SVM on fMRI data.
fMRI-3DCNN: 3D Convolutional Neural Network on fMRI data.
4.1.2. Existing Multimodal Methods
EEG-fMRI-Concat: Simple concatenation of EEG and fMRI features with an SVM classifier.
EEG-fMRI-CCA: Canonical Correlation Analysis for feature fusion of EEG and fMRI data.
MM-CNN: Multimodal Convolutional Neural Network for EEG and fMRI fusion.
5. Results
5.1. Main Results
Table 2 and
Table 3 present the performance of different methods on the social and numeric category classification tasks, respectively. Our proposed Cross-Perception Model consistently outperformed both unimodal and existing multimodal approaches across all evaluation metrics for both classification tasks.
In terms of unimodal performance, fMRI-based methods generally outperformed EEG-based methods, likely due to the higher spatial resolution of the fMRI data. However, both modalities struggled to achieve high accuracy when used independently, highlighting the challenge of inner speech recognition from a single modality.
The combination of EEG and fMRI data led to significant improvements over unimodal approaches. This underscores the complementary nature of the temporal precision of EEG and the spatial resolution of fMRI in capturing inner speech processes. Among existing multimodal methods, MM-CNN showed the best performance, demonstrating the advantage of using deep learning for feature extraction and fusion in multimodal scenarios.
Our Cross-Perception Model achieved further improvements over existing multimodal methods by incorporating tailored EEG and fMRI processing techniques. For the social category task, we observed a 2.5% increase in accuracy and a 0.03 increase in F1-score compared to the best-performing baseline (MM-CNN). Similar improvements were seen for the numeric category task.
The superior performance of our Cross-Perception Model can be attributed to several factors. The use of Singular Spectrum Analysis (SSA) for EEG decomposition effectively addresses the non-stationary nature of EEG signals during inner speech. The transformation of EEG signals into Markov Transition Field (MTF) images allows for the capture of complex temporal dynamics. The specialized 3D convolutional network for fMRI feature extraction is tailored to capture both local and global spatial patterns relevant to inner speech. Finally, the adaptive fusion mechanism dynamically weighs the contributions of different modalities based on their relevance to the recognition task.
5.2. Ablation Study
To understand the contribution of each component in our model, we conducted an ablation study.
Table 4 and
Table 5 show the results for social and numeric word recognition tasks, respectively.
These results demonstrate that each component contributed to the model’s performance, with the EEG-Raw and fMRI branches being particularly important for both tasks.
5.3. Cross-Participant Generalization
To assess the model’s generalization capability, we performed leave-one-participant-out cross-validation.
Table 6 shows the results.
Our model demonstrated robust cross-participant performance, outperforming the best baseline by 4.6 and 4.5 percentage points for social and numeric words, respectively.
5.4. Extended Study
To validate the effectiveness of our proposed Cross-Perception Model in other multimodal tasks, we conducted an extended study on multimodal sentiment analysis using the CMU-MOSEI and CMU-MOSI datasets. This experiment aims to demonstrate that our model’s architecture can achieve excellent performance across different types of multimodal tasks.
Table 7 presents a comprehensive comparison of our proposed Cross-Perception Model with state-of-the-art approaches for Multimodal sentiment analysis on the CMU-MOSEI and CMU-MOSI datasets. The experiments conducted aimed to evaluate the performance of various methods across different emotion recognition tasks, including binary sentiment classification (Acc-2), 5-class emotion classification (Acc-5), 7-class emotion classification (Acc-7), and emotion intensity regression (MAE).
The results demonstrate that our proposed Cross-Perception Model consistently outperformed all other approaches on both datasets and across all evaluation metrics. On the CMU-MOSEI dataset, our model achieved an accuracy of 62.6%, 64.7%, and 88.0% for the 7-class, 5-class, and binary classification tasks, respectively, surpassing the previous state-of-the-art methods by a significant margin. Similarly, on the CMU-MOSI dataset, our model attained an accuracy of 60.3%, 60.5%, and 86.5% for the corresponding tasks. Moreover, our approach achieved the lowest mean absolute error (MAE) of 0.457 and 0.614 on the CMU-MOSEI and CMU-MOSI datasets, respectively, for the emotion intensity regression task.
The superior performance of our Cross-Perception Model in this multimodal sentiment analysis task can be attributed to several key factors. First, the incorporation of multigranularity encoders allows the model to capture modality-specific information at different levels of granularity, enabling a more comprehensive understanding of the sentimental content in each modality. Second, the contrastive learning module enhances the model’s ability to distinguish and fuse information from different modalities, leading to more robust and discriminative multimodal representations. Finally, the hierarchical expert structure leverages both modality-specific and shared representations, allowing the model to adapt to the varying information requirements of different emotion recognition tasks.
These findings demonstrate that our Cross-Perception Model, initially designed for inner speech recognition, can be effectively applied to other multimodal tasks such as sentiment analysis. This highlights the versatility and robustness of our proposed architecture in capturing and integrating information from multiple modalities across different domains.
6. Discussion
The results of our comprehensive experiments provide substantial evidence in support of the efficacy of the proposed Cross-Perception Model. The notable enhancement in performance, as compared to both unimodal and existing multimodal approaches (4.3 and 4.2 percentage points over MM-CNN for social and numeric tasks, respectively), provides substantial evidence in support of the efficacy of our approach to multimodal fusion for inner speech recognition. Moreover, the ablation study highlights the importance of each proposed component, particularly the EEG-Raw and fMRI branches, thereby supporting the rationale behind our multibranch architectural design. Moreover, the modality contribution analysis illustrates how the model employs different modalities in a dynamic manner for various word categories, thereby substantiating the advantage of the adaptive fusion mechanism. Moreover, the confusion matrix analysis illustrates the model’s ability to distinguish between subtle inner speech categories, which is a pivotal aspect in BCI applications. Furthermore, the temporal and spatial analysis provides insights into how the model effectively integrates information from modalities with disparate temporal and spatial resolutions, thereby capturing complementary aspects of inner speech processes. Furthermore, the robust cross-participant performance attests to the model’s resilience and generalizability, which are pivotal for practical BCI applications.
These results collectively demonstrate that our Cross-Perception Model successfully addresses the challenges of multimodal inner speech recognition, offering a significant advancement in the field of brain–computer interfaces. The model’s ability to effectively combine information from EEG raw data, EEG Markov Transition Fields, and fMRI images enables more accurate and robust decoding of inner speech, paving the way for more sophisticated and reliable BCI systems.
7. Limitations
Our study, while demonstrating significant advancements in multimodal inner speech recognition, is subject to several limitations. These can be broadly categorized into data-related constraints, methodological challenges, and generalizability issues. The Bimodal Dataset on Inner Speech, though valuable, is limited in size and participant diversity, potentially affecting the generalizability of our findings. Moreover, the non-simultaneous recording of EEG and fMRI data may have introduced temporal misalignments in our multimodal representations. From a methodological perspective, our model’s increased computational complexity, while justifying its performance gains, poses challenges for real-time applications and resource-constrained environments. The interpretability of our model remains an open challenge, particularly in understanding the specific contributions of different brain regions and temporal patterns to inner speech recognition. Lastly, the current study’s focus on a limited set of inner speech categories and the potential for significant individual variability in inner speech patterns may limit the model’s applicability across broader vocabularies and diverse populations. These limitations collectively underscore the need for further research to enhance the robustness, efficiency, and generalizability of multimodal inner speech recognition techniques.
8. Conclusions
In this paper, we presented a novel Cross-Perception Model for multimodal inner speech recognition, leveraging both EEG and fMRI data. Our approach addresses several key challenges in the field of brain–computer interfaces and contributes to the advancement of multimodal fusion techniques for neuroimaging data. The proposed model demonstrated significant improvements over existing unimodal and multimodal methods, achieving higher accuracy and F1-scores across both social and numeric category classification tasks.
The success of our Cross-Perception Model can be attributed to its innovative architecture, which includes multigranularity encoders, a cross-perception expert structure, and an attention-based adaptive fusion strategy. These components work in concert to effectively capture and integrate the complementary information provided by EEG and fMRI data, resulting in more robust and accurate inner speech recognition.
While our results are promising, there are several areas for future research and development. The current study is limited to a relatively small dataset with a restricted vocabulary, and future work should explore the scalability of our approach to larger datasets with more diverse inner speech content. Additionally, the non-simultaneous nature of the EEG and fMRI recordings in the dataset may have introduced some inconsistencies in the multimodal representations. Investigating the model’s performance on simultaneously recorded EEG-fMRI data could provide further insights into its effectiveness.
In conclusion, our Cross-Perception Model represents a significant step forward in multimodal inner speech recognition, demonstrating the potential of combining EEG and fMRI data for more accurate and reliable brain–computer interfaces. As research in this field continues to advance, we anticipate that multimodal approaches like ours will play a crucial role in developing more sophisticated and practical BCI systems, with potential applications in assistive technologies, communication devices, and neuroscientific research.