1. Introduction
In recent years, the rapid advancement of artificial intelligence technology has spurred the further development of intelligent coal mine construction [
1]. The introduction of policies, such as the “Intelligent Coal Mine Guide (2021 Edition)” and the “Trial Measures for the Acceptance Management of Intelligent Demonstration Coal Mines”, has underscored the growing necessity of establishing intelligent coal mines that leverage related artificial intelligence technologies [
2,
3]. Given the unique characteristics of the coal industry, developing an intelligent coal mine that includes a voice interaction system tailored to the sector is crucial for ensuring the safety of coal mine production.
In the coal mine scenario, automatic speech recognition (ASR) systems aimed at handling dialect accents may face the following specific requirements.
High Accuracy. The ASR system needs to be able to accurately recognize speech with local characteristics, even under the influence of background noise and the miners’ dialect accents.
Strong Robustness. The system should be able to work stably in a coal mine environment, with interference from machine noise, blasting sounds, and other work-related noises.
Dialect Adaptability. The ASR (automatic speech recognition) system needs to be able to adapt to and recognize specific regional dialects, which may require the collection and training of dialect datasets.
Real-time Processing. In emergency situations, the ASR system needs to be capable of processing and responding to voice commands in real-time to enable quick decision-making.
Noise Suppression. The system should have effective noise suppression technology to reduce the interference of environmental noise in speech recognition.
Far-field Recognition Capability. Due to the special nature of the coal mine environment, the ASR system needs to have far-field speech recognition capability, allowing it to accurately capture speech from a greater distance.
Multi-speaker Recognition. The system should be able to handle situations with multiple speakers, distinguishing between the voices of different speakers and accurately recognizing each one.
Contextual Understanding. The ASR system should not only recognize speech but also understand the specific context of the miner’s instructions or reports to execute tasks more accurately.
Ease of Use and Wearability. Considering the working conditions of miners, the ASR system should be user friendly and wearable, without hindering the normal work of the miners.
Safety and Privacy Protection. The system needs to ensure the security of collecting and processing voice data and the privacy protection of the miners.
Diversity of Training Data. To enhance the system’s generalization capability, the training data should include a variety of dialects and accents, as well as different speaking rates and styles.
User Customization. The system may need to offer customization options to allow miners to adjust the system settings according to their own accents and speaking habits.
Northern Shaanxi, one of the most coal-rich regions in China, is at the forefront of producing and managing the production environment, conferences, dispatching, and command operations within the coal industry. The integration of a voice interaction system that accommodates the local dialect is vital for enhancing communication efficiency and safety in these contexts.
Management personnel in the coal mining industry predominantly communicate using the North Shaanxi dialect, commonly referred to as “Shaanxi Pu,” which is characterized by a distinct Northern Shaanxi accent. The dialect serves not only as a cultural emblem but also as a carrier of traditional cultural heritage. Consequently, it is imperative to compile a corpus of the North Shaanxi dialect and to develop speech recognition capabilities tailored to the coal mining industry. This initiative is vital for both preserving the regional culture and enhancing operational efficiency within the sector.
To address the challenge of dialect recognition within the field of speech recognition, initial research efforts by scholars involved adapting traditional speech recognition models for dialect recognition purposes. This included the application of linear predictive coding (LPC), dynamic time warping (DTW), and hidden Markov models (HMMs) as technical frameworks for dialect identification. Furthermore, the integration of Gaussian Mixture Model (GMM) technology in speech modeling has notably enhanced recognition rates. For instance, studies [
4,
5,
6] have utilized the GMM to develop dialect recognition systems for the Mongolian dialect, Chongqing dialect, and the Shuozhou dialect in Shanxi Province.
Due to the reliance of traditional dialect recognition methods on extensive corpora and manual annotation, these approaches incur high costs and often yield only moderate recognition performance. Moreover, the advent of deep learning has transformed the landscape of speech recognition technology.
Technology, particularly deep learning, not only streamlines the process of speech recognition but also markedly enhances recognition accuracy. For instance, the study in the literature [
7] developed an end-to-end Listen, Attend, and Spell (LAS) model for Tujia speech recognition, incorporating a multi-head attention mechanism to boost the accuracy of Tujia dialect recognition. In the document by [
8], an end-to-end dialect speech recognition method based on transfer learning is introduced, which leverages shared feature extraction to enhance the recognition performance of low-resource dialects. The document by [
9] presents an end-to-end speech recognition system that integrates a multi-head self-attention mechanism with a residual network (ResNet) and a bidirectional long short-term memory network (Bi-LSTM), significantly improving the recognition of Jiangxi and Hakka dialects. Nonetheless, these models could benefit from further enhancements in incorporating contextual semantic information and capturing positional details.
Currently, end-to-end (E2E) speech recognition technology has yielded substantial research outcomes [
10]. Architectures such as Recurrent Neural Networks (RNNs) [
11], Convolutional Neural Networks (CNNs) [
12,
13], self-attention-based Transformer networks [
14], and the Conformer [
15] have emerged as prominent backbone structures for automatic speech recognition (ASR) models, garnering significant attention. However, the Conformer decoder exhibits limited text generation capabilities and the Transformer model is computationally intensive, incurs substantial memory costs, and has a weaker ability to capture local features.
Addressing these limitations, this study introduces a Conformer–Transformer–CTC (Connectionist Temporal Classification) fusion approach for dialectal speech recognition systems. The proposed method leverages the audio modeling prowess of the Conformer as the encoder, utilizes the Transformer for text generation as the decoder, and harnesses the flexible alignment capabilities of CTC to construct an end-to-end dialect speech recognition model, thereby enhancing speech recognition accuracy.
The end-to-end speech recognition model proposed in this paper for the North Shaanxi dialect demonstrates innovation in terms of customization for specific dialects, end-to-end architecture, adaptive feature extraction, and robustness against environmental noise.
3. Method
3.1. Corpora Establishment
Currently, there is a significant lack of openly accessible corpora concerning the dialect used within the coal mining sector in Northern Shaanxi. To facilitate research in the realm of speech recognition for the North Shaanxi dialect, this study has successfully compiled a specialized dialect corpus for the Northern Shaanxi coal mining industry. The methodology employed in constructing this corpus is detailed in
Figure 1.
The initial phase involves the careful selection of speech materials from the region. By analyzing the distinctive features of the North Shaanxi dialect, which are summarized in
Table 1, the corpus was compiled. The selection process was guided by the textual content of coal mine dispatch logs, industry-specific terminology from the coal mining sector, and relevant industrial texts. Subsequently, recorded transcripts were created based on this selected material.
The recording protocol adopted in this study involves the collection of dialect data by 20 volunteers. The recordings are primarily based on the text from the Northern Shaanxi coal mine scene-specific dialect dataset. Among the 20 volunteers, there are 13 males and 7 females, with ages ranging from 18 to 40 years. All participants are native to Northern Shaanxi, and the dialects recorded are exclusively of the Northern Shaanxi variety. The resultant dataset is detailed in
Table 2.
To guarantee the quantity and integrity of the data, this study employs professional recording equipment to capture the audio. The recordings are saved in WAV format with a 16 kHz sampling rate. Subsequent to recording, the audio files are meticulously checked and annotated using Adobe Audition.
In the final phase of data preparation, all recorded data, including participant information, the original phonetic corpus, and the annotated and processed corpus along with the corresponding text annotations, are systematically organized and stored within a unified corpus repository. This structured approach ensures that the dataset is both comprehensive and accessible for further analysis and use in speech recognition model training.
3.2. Conformer–Transformer–CTC Model Structure
To achieve a better dialect recognition rate in dialect speech recognition models, this paper establishes a Conformer–Transformer–CTC speech recognition system, which includes preprocessing modules (speech preprocessing module, text preprocessing module) and an encoder–decoder (the encoder uses a Conformer, and the decoder uses a combination of a Transformer and CTC for joint decoding).
Figure 2 shows the end-to-end dialect Conformer–Transformer–CTC speech recognition system.
3.2.1. Preprocessing Module
The preprocessing module includes a speech preprocessing module and a text preprocessing module, as shown in
Figure 3.
The speech processioning module developed in this study comprises a downsampling module (Sub-sampling Embedding), a convolution module, and position coding (position encoding).
Firstly, the speech features are downsampled. Downsampling is a commonly used technique in data processing that reduces the complexity of data by decreasing its temporal resolution. In the field of speech processing, downsampling can help models better handle issues such as significant variations in speech duration or rapid speaking rates caused by dialects.
When processing dialectal speech, due to the characteristics of dialects, there may be significant variations in the duration of speech, or the speaker may talk at a very fast pace. These characteristics can make it difficult for the model to accurately recognize and process the speech signal. By downsampling, we can reduce the temporal resolution of the speech signal, thereby enabling the model to more effectively handle rapidly changing speech features.
Acoustic features in dialectal speech may exhibit spatiotemporal correlations. This means that in dialectal speech recognition, acoustic features not only have continuity over time but also spatial correlations. Using downsampling techniques, these features can be effectively captured, thereby improving the accuracy and efficiency of speech recognition.
Therefore, the following convolutional module, which includes a 2D convolution layer and a ReLU activation layer, is used to capture the patterns of acoustic features in both time and frequency dimensions, learning the local features within dialectal speech. At the same time, since dialectal speech may have acoustic features that differ from standard languages, the ReLU activation function can enhance the model’s expressive power for these features.
Following this, a linear layer is employed to extract dialect-specific acoustic features and patterns, yielding a more tailored feature representation. Additionally, fixed position coding, utilizing sine and cosine functions, is applied to better comprehend the sequential distribution and structure of dialectal speech features, as detailed in Formulas (1) and (2).
Conclusively, the application of Dropout operations involves the random suppression of certain features, which decreases the model’s reliance on individual features and, in turn, strengthens its robustness and generalization capabilities.
PE refers to the position encoding matrix, where pos denotes the specific position of the current character in the sequence, i represents the i-th dimension of the character vector, and indicates the dimension size of the character vector.
Through the aforementioned processes, the speech processioning module facilitates dimensional reduction, feature transformation, and position modeling for dialectal speech. This results in a feature representation with enhanced discriminating power, thereby improving the performance and accuracy of dialect recognition. Utilizing this speech processioning module enables the model to more effectively adapt to the characteristics of dialectal speech and to extract dialect-specific acoustic features and patterns.
The text processioning module developed in this paper includes an embedding layer, position coding, and convolution modules.
Initially, the embedding layer is employed to convert text labels into dense vector representations, capturing the semantic relationships between words. Subsequently, the same position coding technique used in the speech preprocessing module is applied to encode the positional information of the words. Following this, a 1D convolutional layer is utilized to extract implicit positional information and to capture more nuanced local semantic details. Layer normalization is applied to normalize the model’s output at this stage. Finally, the ReLU activation function is introduced to incorporate non-linearities, thereby enhancing the model’s representational capacity.
By undergoing this text processing pipeline, the speech recognition system can enhance its comprehension and expression of textual information, thereby improving the overall performance of speech recognition.
3.2.2. Codec
The advantage of the Conformer architecture as an encoder lies in its ability to process both time-domain and frequency-domain features of the audio signal. This dual processing capability allows the Conformer to yield a rich audio representation, enhancing its understanding of the input audio signal and providing more informative features for the subsequent decoder stage. Nevertheless, the Conformer structure exhibits limited text generation capabilities within its decoder. To address this, the present study employs a Transformer-based decoder. The self-attention mechanism inherent in the Transformer is adept at managing long-range dependencies, which allows the decoder to take into account the global context when generating text. This results in the production of accurate and coherent textual outputs. The proposed architecture is depicted in
Figure 4.
The Conformer model primarily consists of four key modules: the first feedforward module (feedforward), the multi-head attention module (multi-head self-attention), the convolutional module (convolution module), and the second feedforward module. The Conformer calculates the output
for the input vector
of the
-th encoder, and the equation is as follows:
Among the components, FFN denotes the feedforward module. “First” signifies the initial feedforward module, succeeded by the “Second” feedforward module. MHSA stands for the multi-head self-attention module, while “Conv” is an abbreviation for the convolution module. “Layernorm” represents layer normalization. Each of these modules incorporates a residual connection to enhance the flow of gradients and stabilize training.
represents the input sequence,
represents the advanced sequence,
is the output sequence, and the probability of encoding and decoding is as follows:
At each time
, calculate the conditional dependency of the output on the encoder features
through the attention mechanism. The attention mechanism is a function of the hidden state of the current decoder and the encoder output features, which are compressed into a context vector.
,
,
, and
are the learning parameters. The attention distribution is obtained by normalizing with
as follows:
Using
and hiding states
, the corresponding context direction is obtained using weighted sums as follows:
Finally, a Transformer is used as the decoder, and the training loss function is defined as follows:
3.2.3. CTC Auxiliary Training
Methods for improving speech recognition assistance tasks were investigated by assuming different levels of learning representations at different levels [
29]. In this paper, the CTC objective function is integrated into the Conformer–Transformer fusion model. End-to-end training through CTC learning sequence-to-sequence mapping without explicit alignment reduces irregular alignment of Attention-based Encoder–Decoder (AED) models for better performance.
During training, the model consists of three parts, the Conformer encoder, the Transformer decoder, and the CTC decoder. The end-to-end dialect speech recognition training is as follows:
The output of the encoder is used to calculate the CTC loss. Let the training set be S, and then the CTC loss function is as follows:
Combining the CTC loss and the decoder loss facilitates the convergence of the decoder while enabling the hybrid model to exploit the label dependence. Since CTC is used to assist the decoder alignment, CTC is less weighted in the fusion. The total loss function is defined as the weighted sum of CTC and the decoder loss as follows:
Among them, is used to measure the importance of the CTC loss and the decoder loss. represents speech features, while represents text annotations.
3.3. Analysis of the Conformer–Transformer–CTC Model
In the research on end-to-end speech recognition models for the North Shaanxi dialect, the Conformer–Transformer–CTC model includes the following special designs for the unique coal mine scenario.
Dialect Feature Embedding. The model will include an acoustic feature embedding layer specific to the Shaanxi Northern dialect to better capture the tonal and pronunciation characteristics of the dialect.
Multilingual Pre-training. The model may use a dataset that includes standard Mandarin and the North Shaanxi dialect during the pre-training phase so that the model can learn the differences between the two languages.
Dialect Adaptation Layer. In the decoder part of the model, a dialect adaptation layer may be added to adjust the probability distribution of the model’s output in order to adapt to the grammatical and lexical characteristics of the North Shaanxi dialect.
Attention Mechanism Optimization. The attention mechanism should be enhanced to better handle long-term dependencies and specific phonetic liaisons commonly found in dialects.
Dialect Data Augmentation. The diversity of dialect training data should be increased using methods such as phoneme substitution, time stretching, and noise addition to improve the model’s generalization capabilities.
Custom Loss Function. A loss function that better reflects the difficulty of dialect recognition should be designed, such as assigning higher weights to phonemes that are unique to the dialect.
Joint Optimization of Acoustic and Language Models. During the training process, optimize both the acoustic model and the language model simultaneously to ensure they work in concert, thereby improving the accuracy of speech recognition.
In these special designs, the use of downsampling processing can reduce data dimensionality and computational complexity while retaining the main energy and key spectral information of the speech signal as much as possible. The frequency range of speech typically spans from 20 Hz to 20 kHz. If a 50% downsampling is performed, the frequency range of the speech will be reduced from the original 20 Hz–20 kHz to 10 Hz–10 kHz. In the context of speech recognition for the North Shaanxi dialect, the frequency components of the speech signal are very rich, necessitating the use of an anti-aliasing filter to prevent aliasing effects.