1. Introduction
Keyword spotting (KWS) is a technique for detecting keywords in audio streams and is commonly used in voice-activated devices to capture user commands or requests. By contrast, SSL estimates the position of one or multiple sound sources based on the recorded acoustic signals, which is critical in identifying the source of sound.
Traditional signal processing techniques have been used in the past for both KWS and SSL; however, they do not perform well in noisy and reverberant environments because of the nature of imperfect environment modeling. However, with recent advances in deep neural networks (DNNs), several DNN-based approaches have demonstrated significant improvements in accuracy and robustness [
6]. In addition, owing to the characteristics of KWS and SSL applications mainly required in mobile or low-power Internet of Things (IoT) devices, many studies have been conducted to reduce the amount of computation and memory required for model operation [
Both KWS and SSL play key roles in interactive voice assistant systems. However, relying on only one of the two functions is insufficient for end users who need a more active interaction with the system. For example, an IoT device with a display and camera is required to face a speaker uttering a keyword in a video conference. For efficient interactive communication between the screen and camera, the speaker should face the display and be within the camera-viewing angle for other viewers. To address this challenge, KWS and SSL were utilized together to mimic the human sound-recognition process. Among the many approaches, embedding KWS and SSL functionalities in either a sequential or parallel manner can be considered the most intuitive approach for mimicking human sound recognition processes.
The sequential approach executes each function individually, whereas the parallel approach executes both functions concurrently and outputs the result. Compared with the sequential approach, the parallel approach provides a faster inference time to obtain the output; however, it requires a higher computational load within the same timeframe. By contrast, although the sequential approach requires longer inference times, it can run on devices with limited memory and computational resources. Because of these differences in execution policies, distinct variations exist in the inference time, computational workload, and memory usage between the two approaches.
However, despite the advantages of these approaches, limitations arise because KWS and SSL systems are typically executed on mobile and low-power IoT devices. For example, a parallel approach may impose an excessive computational burden beyond the capabilities of such devices. The sequential approach can result in long latency, which lowers the quality of the experience for end users. Consequently, to overcome the limitations in each approach, reduce the latency in the systems, and enable their execution on low-power devices, innovative techniques are required to mitigate the computational and memory requirements.
In this context, multi-task learning (MTL) is an effective approach for achieving our goal. MTL involves training a model on multiple tasks simultaneously, reducing the number of parameters and improving computational efficiency. For example, in self-driving cars, tasks such as object classification, lane detection, and sign recognition, must be performed concurrently using real-time images. Instead of creating separate neural networks for each task, which would lead to long inference times and high memory consumption, MTL leverages shared components in input data recognition and feature extraction. Thus, resources can be reduced while maintaining high accuracy. In addition, the interdependencies between tasks enable the model to learn from multiple sources of information, leading to improved task performance.
Furthermore, several recent studies have reported an improved performance of each task when MTL was performed with KWS and other tasks (domain classification, voice activity detection, speaker verification, etc.) [
13]. In [
10], the authors considered spectral mapping as an auxiliary task for their KWS model to improve KWS performance. In [
11], the authors proposed a multi-task network that performs KWS and speaker verification (SV) simultaneously to fully utilize interrelated domain information. In [
12], the authors optimized a DNN to predict both domain types and keywords simultaneously. The study in [
13] applied multi-task learning to KWS and SV to leverage user information in the KWS system.
Motivated by the aforementioned research, we propose a single-neural-network model for KWS and SSL using multi-task learning. The model was constructed with a 1d mobile inverted bottleneck convolution (MBConv) block using self-attention and depthwise convolution and trained based on MTL with a hard-sharing method. With the help of a thorough analysis, which involved calculating the cosine similarity in each 1d MBConv block, we discovered that the features required for KWS and SSL tasks exhibited significant differences in the deep layer, while demonstrating a high similarity in the early layer. By considering both tasks simultaneously in the early layers, the learned feature maps contribute to the generalization and effectiveness of each task. To facilitate model training, we created a synthetic multichannel audio dataset by simulating room impulse responses (RIR) at various directions, heights, and distances using an RIR generator [
14] and a TensorFlow speech command dataset [
15]. Despite existing research leveraging multi-task learning for KWS or SSL, our approach stands out as the first successful integration of KWS and SSL based on multi-task learning to the best of our knowledge. Our contributions are summarized as follows:
Integration of KWS and SSL: The proposed approach achieves a significant milestone by successfully integrating Keyword Spotting (KWS) and Sound Source Localization (SSL) based on Multi-Task Learning (MTL). This breakthrough addresses the limitations associated with sequential KWS and SSL methods, providing a more efficient solution in terms of both memory usage and computation.
Validation Procedure for Shared Encoder Effectiveness: We introduce a validation procedure designed to assess the efficacy of a shared encoder within the framework of Multi-Task Learning (MTL). This method includes the utilization of layer-wise feature map cosine similarity, offering an optimized guideline for the design of MTL models. This distinctive contribution enhances the precision of MTL model design, thereby simplifying the overall design process.
2. Methods
As shown in
Figure 1, the proposed model is designed to perform multi-tasking of both the SSL and KWS. The model consists of a shared encoder and task-specific layers. It considers a multichannel mixture as the input and produces two outputs: the probability of a keyword’s presence in the mixture and the prediction of the sound source location for the keyword utterance. The multichannel mixture represents audio data in the form of raw waveforms that combine the multichannel target signal data and multichannel interference signals. The multichannel target signal data contain utterances of specific keywords, whereas the multichannel interference signals include real-world noise or non-keyword utterances.
The shared encoder begins with a standard 1D convolutional layer, followed by several 1D MBConv blocks. It is designed to extract common high-level features used by both KWS and SSL tasks from a multichannel mixture. This design choice stems from the prediction that the high-level features that need to be extracted will be similar, in that the two tasks process the same sound data and focus on keyword utterance regions within the data. This architecture enables both latency reduction and parameter-size optimization.
The task-specific layers are divided into two branches: KWS and SSL. Each branch consists of several 1D MBConv blocks, standard 1D convolutional layers at the end, global max pooling, dropout, and fully connected layers. These layers extract task-specific low-level features from the high-level features obtained from the shared encoder, enabling each task to perform well (see
Figure 2).
2.1. One-Dimensional Mobile Inverted Bottleneck Conv (MBConv) Block
The original MBConv block, first introduced in MNASNet [
16], is a module that integrates depth-wise convolution, inverted residual, and squeeze-and-excitation techniques. Because the MBConv block was orginally designed for processing 2D data, such as images, in this study, the 2D convolution of the MBConv block was replaced with a 1D convolution to create a 1D MBConv block, as shown in
Figure 3. In [
17], the authors used the building blocks of ResNet and SeNet, the state-of-the-art image classification models at the time, transformed them into 1D CNN architectures, and have shown improved results in music auto-tagging, KWS, acoustic scene tagging, and SSL. The 1D MBConv block is morphologically similar to the ReSe-n block; however, because it uses depth-wise convolution rather than the standard 1D convolution, a lighter model can be created with fewer parameters and computational complexity.
Depthwise separable convolution involves the decomposition of a standard convolution operation into depthwise and pointwise convolutions. Depthwise convolution applies a separate filter to each input channel, whereas pointwise convolution uses a 1 × 1 convolution to combine the outputs of the depthwise convolution across all channels. This approach significantly reduces the computational cost and memory requirements of the convolution operations. The computation of the standard convolution was , and that of the depth-wise separable convolution was with an input of size , a kernel of size , and output channels.
An inverted residual is a design pattern that directly connects the bottlenecks of a block using a shortcut. An inverted residual is a design pattern that differs from traditional residual connections in terms of the starting layer. The inverted residual begins with a bottleneck layer that reduces the spatial dimensions and channel numbers of the input feature map, whereas the input and output feature maps have the same spatial dimensions and channel numbers as those in the traditional residual connection. This enables a network to reduce the number of computations and model parameters, while enabling it to learn efficiently.
Finally, the squeeze-and-excitation mechanism proposed in [
19] enhances the performance of the network by learning to selectively emphasize essential features and suppress less-relevant ones. It scales the feature maps by computing the relationship between each channel and applying a weight factor. The squeeze-and-excitation mechanism involves two steps. First, global average pooling is applied to the input feature maps to output a vector containing information regarding the importance of each feature map. Second, two fully connected (FC) layers are applied to the vector to output a set of attention weights. These weights are then applied to the input feature maps, scaling the activation of each feature map using their corresponding attention weights.
2.2. Multi-Task Learning
The proposed network structure and learning approach are based on MTL to optimize memory usage and reduce the inference time for both KWS and SSL tasks. MTL is a machine-learning paradigm that aims to train a model to minimize the prediction errors for multiple related tasks simultaneously. Typically, models are trained to perform only one task; however, MTL can improve the generalization performance of a model by leveraging the commonalities across different tasks. For example, in computer vision, image classification and image segmentation models involve different tasks and output forms. However, common features learned from image classification can assist in image segmentation, and object area segmentation information can improve the classification task. By jointly learning these tasks, the MTL can result in a more accurate and efficient model than training each task independently.
Hard parameter sharing in multi-task learning reduces computational complexity by enabling the sharing of model parameters. By sharing parameters among tasks, the overall number of trainable parameters is reduced, resulting in lower memory requirements, computational overhead during training, and inference. Let
represent the shared paramters, and
be the task-specific parameters for KWS and SSL, respectively. The total number of parameters in the shared model is as follows:
In contrast, independent models for each task would require the following:
represent redundant parameters learned separately for each task. By sharing parameters between tasks, the reduction in parameters is as follows:
This reduction lowers memory usage and computational overhead, which is especially critical in resource-constrained environments.
The shared layers in hard parameter sharing capture common features across tasks, allowing the model to leverage shared knowledge and extract relevant information more effectively. This not only improves the model’s overall performance, but also enhances generalization across different tasks. By jointly learning from multiple tasks, the model can benefit from the complementary information contained in each task, thereby improving its accuracy and robustness. The overall network comprises shared layers, which are directly shared among tasks to learn the common features of different tasks, and task-specific layers, which are unique to each task and learn specific features for individual tasks.
During the training phase, the loss functions of the KWS and SSL tasks are combined to optimize the network parameters. The objective of the KWS task is to minimize this loss function, which encourages the network to accurately classify the input into the corresponding keyword or non-keyword categories. The KWS task is formulated as a binary classification task and its loss function can be expressed as follows:
represent the ground truth label (one for positive instances and zero for negative instances) and input, respectively;
represent the shared encoder and task-specific layers with a sigmoid function, respectively; and
denotes the logarithm function.
By contrast, the SSL task focuses on estimating the direction of the sound that utters the keyword. This can be achieved through a regression (directly estimating the angle) or classification (dividing the space into regions). In this research, a combination of classification and regression is used, similar to the strategy implemented in [
4], where the coarse location is first classified, and then the fine location is estimated within that region. The SSL loss function can be represented as follows:
represent the ground truth labels of coarse and fine locations, respectively;
, and
represent the task-specific layers, FC layer with SoftMax function for coarse location classification, and FC layer with sigmoid function for fine location regression, respectively. Finally, the total loss function can be formulated as follows:
, and
are the scaling factors of each loss function.
3. Results
This section details the simulation of a synthetic multichannel audio dataset, experimental setting, and evaluation of the model. The cosine similarity between the KWS and SSL models was used to validate the design. Then, the model was evaluated in terms of KWS accuracy, DOA accuracy, DOA error, and latency.
3.1. Simulation of Synthetic Multichannel Audio Dataset
In this study, we attempted to create a multichannel audio dataset that reflects difficult situations in which not only the sound of uttering keywords, but also the speech sounds of multiple speakers and various noises exist simultaneously. To achieve this, the “TensorFlow speech command dataset” and the RIR generator were used for data generation (see
Table 1).
An RIR generator was used to simulate sound reflection, refraction, and attenuation. With the addition of information such as room size, microphone and speaker positions, and reverberation time, data reflecting the corresponding RIR are guaranteed. The room sizes were (3, 3, 2.5), (8, 10, 6), (6, 8, 3), (3, 7, 3), (8, 5, 4), and (10, 7.5, 3.5). The sampling rate used to generate sounds was 16khz. The reverberation time for the data sound signal ranges from 0.3 s to 0.6 s. Speech signals were assumed to be acquired using a 6-circular microphone array with six microphones arranged at equal angles. The microphone array positions were randomly set within each room. The distance between the sound source and microphone array ranges from 0.2 to 4.24 m, and the height ranges from zero to 2 m.
Single-channel audio data were convolved with various RIR to produce multichannel audio data. The generated multichannel target signal, noise, and up to two interference sources were combined at random amplitude ratios. The amplitude ratio between the target and interference signals is referred to as the signal to interference ratio (SIR), which is expressed by the following:
is the amplitude of the target signal and
is the amplitude of the interference signal. For each generated sample, SIR values were randomly assigned within the range of −15 dB to 20 dB. An SIR of less than zero indicates that the amplitude of the interference signal is greater than that of the target signal, which reflects the phenomenon in which the target signal is difficult to hear owing to ambient noise.
3.2. Experimental Settings
All the experiments were performed using synthetic multichannel audio data. The dataset was split at an 8:1:1 ratio for training, validation, and testing. Each data sample had a duration of 1 s and a sampling rate of 16 kHz. The dataset was designed assuming that the signal was recorded using a circular 6-microphone array with a diameter of 9.2 cm.
To establish baselines, separate models were trained for the KWS and SSL tasks as well as for the MTL. Considering each of the eight 1D MBConv blocks, along with the first and last convolutional layers that constitute the model, the baseline model comprises 10 modules. The multi-task models varied in the number of modules of the shared encoder, ranging from two to eight, while maintaining a total of 10 modules. For example, if the shared encoder has three modules, then each task-specific layer has seven. We used a batch size of 32 and an initial learning rate of 5 × 10−4, which decayed by a factor of 0.1 at the 23rd and 30th epochs. These hyperparameters—batch size, learning rate, and the weights for each task’s loss function—were tuned to achieve the best performance with the 8-module shared encoder.
The models were optimized using the ADAM optimizer [
20], which combines the strengths of momentum-based optimization and adaptive learning rates. In ADAM, parameters
θ are updated using estimates of the first moment (mean) and second moment (variance) of the gradients. The update equations are as follows:
Here, is the gradient at step and and control the decay rates of the moment estimates. is the learning rate, and prevents division by zero. This algorithm allows ADAM to adapt the learning rate for each parameter, resulting in faster convergence and more robust optimization across diverse tasks.
We used various baseline models to conduct a comparative experiment between our optimized multi-task model. The baseline models included BC-ResNet-8 [
21], TC-ResNet-8 [
8], Res-8 [
22], and Simple 2D CNN. It is worth noting that BC-ResNet-8, TC-ResNet-8, and Res-8 are renowned for their exceptional performance in KWS. All the models were transformed using the hard parameter-sharing approach. While the proposed model takes raw waveforms in the time domain as input, the remaining baseline models utilize mel spectrograms in the time-frequency domain as input. The Hann window size is set at 480, the fast Fourier transform size is 480, the length of the hop is 160, and the number of mel filter banks is 40.
To demonstrate the task similarity between the KWS and SSL tasks, we computed the cosine similarity of the feature maps for each module of the two models trained on KWS and SSL using single-task learning. Cosine similarity is a widely used metric for measuring the similarity between vectors. A cosine similarity value close to one indicates vectors with similar directions, approaching −1 for vectors pointing in opposite directions. A value of zero signifies an orthogonal vector. Cosine similarity was calculated using the following formula:
where the
represent vectors,
are feature maps of channel
of a specific block of each KWS and SSL model, respectively, and
is the number of feature maps of
To evaluate the performance of each baseline model, the KWS accuracy, DOA error, DOA accuracy, and inference time per sample were used. Accuracy represents the ratio of correctly predicted answers to the total number of samples, whereas the DOA error quantifies the deviation between the predicted and ground truth sound source directions. The KWS accuracy is calculated as the ratio of correctly predicted answers to the total number of samples, using the following formula:
denotes the total number of samples,
denotes the KWS prediction, and
denotes the KWS label.
The DOA error represents the average difference between the actual and predicted angles. The formula for calculating the DOA error is as follows:
is the ground truth angle and
is the predicted angle. The DOA accuracy refers to the accuracy of predicting the location of the speaker who uttered the keyword. If the difference between the predicted and actual angles is smaller than the threshold
, the position prediction is considered successful. The DOA accuracy calculation formula is as follows:
In this study, the threshold was set at 20° to calculate the DOA accuracy for each baseline model. We evaluated the DOA error and accuracy using baseline models and traditional SSL algorithms based on signal processing, such as SRP-PHAT [
23], TOPS [
24], and MUSIC [
The inference time per sample represents the time required for the model to process a single sample. We measured and compared the inference time per sample of the baseline models to demonstrate the impact of reducing the computation and execution times through hard parameter sharing of the MTL.
The experiments were implemented using the PyTorch framework (version 1.7.1), and the training and validation were performed on an NVIDIA RTX 3090 Ti GPU. The inference times of the models were measured using an Intel Core i7-7700K
[email protected] GHz × 8.
3.3. Cosine Similarity of KWS and SSL Model
To analyze the relationship between the feature maps of the KWS and SSL models, we calculated the cosine similarity values for each layer of both models. Cosine similarity values were calculated using the previously mentioned formula.
Table 2 lists the cosine similarity values between the KWS and SSL models for different layers. As shown in
Table 2, the initial layers exhibited relatively high cosine similarity values, indicating a high degree of similarity between the feature maps. However, as the number of layers increased, the cosine similarity values decreased, suggesting a decrease in the level of similarity between the feature representations of the two models. This result supports our assertion that KWS and SSL tasks are related, being extremely different in the deep layers but highly similar in the early layers.
3.4. Comparison of Accuracy, DOA Error, and DOA Accuracy of the Single-Task and Multi-Task Models
The performance of the baseline models was evaluated using the test dataset. In
Table 3, “single-task (KWS)” and “single-task (SSL)” are the models trained on KWS and SSL, respectively, and “multi-task” is the model trained on KWS and SSL simultaneously using the MTL. The number of shared modules indicates the number of modules used by the encoder. The results show that multi-task models with two and three modules in the shared encoder show an improvement of 0.32 and 0.12 in accuracy for the KWS task compared to single-task models. However, for the SSL task, multi-task models with four, five, six, seven, and eight modules in the shared encoder exhibited a reduction in DOA error by 0.62°, 1.22°, 1.33°, 2.03°, and 2.06°, respectively, compared to single-task SSL-trained models. Additionally, the DOA accuracy increased by 0.73%, 1.64%, 1.77%, 2.39%, and 2.07% in the respective multi-task models. Overall, the findings suggest that the optimal choice of the training approach and number of shared layers depends on the specific tasks and desired performance metrics. Notably, the multi-task model with a 7-module shared encoder exhibited the best DOA accuracy, closely approaching the best DOA error, while maintaining a KWS accuracy comparable to that of the single-task model. Therefore, in terms of all metrics, the 7-module shared encoder model was considered optimal.
3.5. Comparison of Accuracy, DOA Error, and DOA Accuracy of the Baseline Models
We comprehensively evaluated the optimal multi-task model with a 7-module shared encoder, comparing it with baseline models, including BC-ResNet-8, TC-ResNet-8, Res8, and Simple 2D CNN, utilizing the test dataset. In
Table 4, the results reveal that the optimal multi-task model exhibits a slightly lower accuracy compared to the baseline models. However, a noteworthy observation is the substantial increase in DOA error and DOA accuracy in the optimal multi-task model. This suggests the model’s proficiency in extracting shared features for both KWS and SSL. Considering its overall performance in both KWS and SSL tasks, the optimal multi-task model stands out as the most remarkable among the evaluated models.
3.6. Comparison of DOA Error and DOA Accuracy
Table 5, we present a comparison of the DOA error and DOA accuracy for “single-task (SSL)”, “multi-task (best)”, and traditional signal processing-based SSL algorithms such as SRP-PHAT, TOPS, and MUSIC. The “multi-task (best)” model, utilizing eight modules in its shared encoder, achieved a DOA error of 12.373° and a DOA accuracy of 89.54%, outperforming the second-best “single-task (SSL)” model by 2.06° and 2.07% in these respective metrics. However, the SRP-PHAT, TOPS, and MUSIC algorithms, noted for their high accuracy in ideal environments, exhibited poor performance on our dataset, which featured various environments and noise. The aforementioned results demonstrate the effectiveness of MTL in improving SSL performance and highlight the challenging nature of the generated dataset.
3.7. Comparison of Inference Time per Sample and the Number of Parameters of the Baseline Models
Table 6 indicates that the multi-task models demonstrated a faster inference time per sample than the single-task models, which sequentially processed each task. This improvement is attributed to the single-task model requiring separate encoders for feature extraction in each task, whereas the proposed system utilizes a shared encoder. Consequently, as the number of shared layers in the encoder increased, the inference time per sample increased. In summary, the proposed multi-task models with shared layers not only achieved comparable or improved performance, but also provided a notable reduction in inference time per sample compared to the single-task models. Notably, the multi-task model with a 7-module shared encoder, which has been identified as optimal in terms of various metrics, also exhibits a remarkable improvement in inference time per sample. Specifically, it was 20.55 ms faster than single-task models, highlighting its efficiency and suitability for deployment in hardware with limited computational power and memory resources.
4. Discussion
This study proposed an approach to perform KWS and SSL with a single neural network by extracting common features of input data with a common encoder based on hard parameter sharing of the MTL. The core limitation of sequential KWS and SSL is that they require increased memory and inference times. To overcome this limitation, the proposed approach utilizes MTL to improve both efficiency and various metrics, including KWS accuracy, DOA accuracy, and DOA error.
The experimental results demonstrate the potential benefits of the proposed model, which offers faster and more accurate sound signal data recognition capabilities that can significantly benefit various industries. By accurately detecting the spatial localization of an event and recognizing specific keywords, this model can improve efficiency, accuracy, and automation in many real-world applications. For instance, accurate keyword spotting and SSL are crucial for various voice-assistive technologies, such as voice-controlled virtual assistants, smart speakers, and smart displays. The improved efficiency and accuracy of our model contribute to enhancing user quality of experience and expanding the capabilities of these applications.
However, several areas require further improvement and exploration. First, it would be valuable to evaluate the performance of the proposed model using real-world datasets. Despite our efforts to create a challenging synthetic multichannel audio dataset that closely resembles real-world conditions using an RIR generator and the TensorFlow speech command dataset, audio data collected in the real world may exhibit different characteristics, such as reverberations and reflections. Therefore, experiments conducted with real-world datasets would validate the effectiveness of the proposed approach.
Additionally, testing the model under more complex acoustic environments with higher reverberation times (e.g., 0.7–1.5 s) would offer deeper insights into its robustness. Reverberation times of this magnitude, typically found in large public spaces or industrial settings, can significantly distort both KWS and SSL performance. While many studies, including the current one, have focused on environments with shorter reverberation times of under 1 s—common in homes, offices, and conference rooms due to their controlled acoustic conditions [
28], it has been observed that as reverberation time and distance from the microphone increase, performance in tasks such as speaker recognition and sound localization tends to degrade [
29]. Future research should expand the scope to include environments with longer reverberation times, as these present more realistic challenges that can push the model’s boundaries. Evaluating its performance in these more acoustically challenging spaces will ensure the model’s applicability in a wider array of real-world scenarios, where precise sound localization and keyword detection are essential for successful real-time applications.
Another aspect to consider is the comparison of the results of the proposed model with those of networks of different structures. Although the proposed network model demonstrated improved results across all metrics, the way it would perform in networks with different total numbers of modules, types of modules, and numbers of dimensions remains unknown. Utilizing a single shared encoder ensures reduced latency. However, if the shared features between tasks are not adequately learned, this could potentially lead to decreased performance in other metrics. Hence, by applying MTL for KWS and SSL to networks with various structures, we can examine the differences in the metrics and further validate the effectiveness of our proposed approach.
Finally, improving the performance of both KWS and SSL tasks remains crucial. The multi-task model outperformed the single-task model when up to three modules were shared. However, increasing the number of shared modules led to a decline in KWS performance, while SSL performance improved. This phenomenon can be attributed to the decreased similarity between the features extracted for each task in deeper layers. As deeper layers handle more specialized information, the similarity between the tasks’ feature requirements decreases, leading to this performance gap.
To mitigate this issue, future research will explore dynamic information-sharing techniques, such as attention mechanisms, to enable selective feature sharing based on each task’s needs, as seen in [
30], where attention mechanisms were used to enhance focus on important regions, improving task performance. Additionally, incorporating task-specific layers in deeper parts of the network could further optimize performance by providing tailored processing for KWS and SSL. These strategies aim to balance the performance between the two tasks, ensuring robust performance under various configurations and shared module settings.
5. Conclusions
In this study, we proposed a novel approach for KWS and SSL using the hard parameter sharing of the MTL. The proposed approach utilizes a shared encoder to extract common features from the input data, which reduces computation and memory requirements compared with sequential KWS and SSL. To enhance the speed and robustness of the network, we employed a modified version of MBConv, which demonstrated promising performance in the image domain for processing 1D data, instead of using the basic standard 1D convolution. Assuming a highly challenging environment, a synthetic multichannel audio dataset was generated and used for network training, validation, and testing. Experimental results on a synthetic multichannel audio dataset show that the proposed approach achieves improvements in latency, KWS accuracy, DOA accuracy, and DOA error.
Certain issues need to be addressed in future research. First, it is necessary to evaluate the effectiveness of the proposed model on real-world datasets to validate its performance in practical scenarios. This will provide valuable insights into the applicability and robustness of the model under real data conditions. Second, it is necessary to investigate the generalizability of the proposed approach to different network structures. The versatility and adaptability of this approach can be assessed by examining its performance in various architectural configurations. Finally, addressing the issue of KWS performance degradation with an increasing number of shared modules is important. Investigating techniques to reduce this degradation and maintain performance will contribute to the continued improvement of the proposed approach.