1. Introduction
With the development of wireless networks, Internet of Things applications emerge in endlessly [
1,
2,
3,
4,
5,
6]. Among them, speech recognition technology is one of the most widely used. However, the accuracy of speech recognition technology still needs to be improved. Due to the complexity of Chinese, this problem is more prominent. Fortunately, with the development of Internet technology and artificial intelligence, speech recognition technology has received more and more attention. However, due to the accuracy of speech recognition technology and the grammatical habits of the language itself, the results of speech conversion often have some errors. These errors not only affect the subsequent speech recognition, but also increase the difficulty for the recipient to understand the semantics of the speech text. Therefore, it is an important research direction to improve the performance of speech recognition technology to use natural language processing technology for text processing after speech recognition.
At present, the common text processing methods after speech recognition are mainly divided into three categories: dictionary-based, traditional machine learning, and deep learning method. The dictionary-based method [
7] aims to build a huge dictionary, find the corresponding words in the dictionary through string matching, and then check and correct the possible errors [
8]. Chiu et al. [
9] proposed a Chinese text error correction model based on a combination of rules and statistical methods. They use a rule-based detection model to identify errors in Chinese text and then use statistical machine translation models to provide the most appropriate corrections for erroneous text. Yeh et al. [
10] proposed a Chinese spelling check method based on a string matching algorithm and Chinese model. First, the probability of the sentence constructed by the author was detected using the n-gram, and then the string matching algorithm Knuth–Morris–Pratt (KMP) was used to detect and correct the error. Hasan et al. [
11] used a character-based statistical machine translation model to correct user search query errors in the e-commerce world. Instead of using the common method of segmenting Chinese sentences first, they used character-by-character processing of Chinese sentences to obtain better performance than the former. Hsieh et al. [
12] proposed a word lattice decoding model to solve the Chinese spell check problem. They prevent word segmentation errors and misspelling errors by including all confusing characters in the word grid to improve the accuracy of rule matching. Siklósi et al. [
13] proposed a method for automatically correcting spelling errors in Hungarian clinical records. They modeled the spelling correction problem as a translation task, using the error text as the source language, the target text as the corrected text, using the statistical machine translation model to perform the error correction task, and modeling the lexical context using a 3-gram-based language model. However, the accuracy of error detection and error correction based on dictionary is very high, but with a poor robustness. The dictionary requires a lot of manual maintenance. With the rapid development of Internet technology, network vocabulary and new vocabulary emerges endlessly, which brings great difficulties to the maintenance of dictionaries.
With the weakness of dictionary-based methods, many researches adopted traditional machine learning methods. They mainly extract the correlations and features between words from a large number of corpora, train the language model, and finally obtain the optimal correction of words through the language model [
14]. Han et al. [
15] proposed a Chinese spelling check model based on maximum entropy (ME). The ME was trained as a Chinese spell check model by importing a large original corpus. Zhao et al. [
16] used a hybrid model based on a graph-based generic error model and two independently trained specific error models to detect Chinese spelling errors. In the graph-based generic error model, a directed acyclic graph was generated for each sentence, and then a single source shortest path algorithm was used to detect and correct common spelling errors in the sentence. Chen et al. [
17] proposed a new Chinese Spelling Check (CSC) probability framework based on the advantages of alternative models and language models. An unsupervised way was used to integrate the topic language model into the CSC system so that semantic information under a very large span could be obtained in a string. In addition, the CSC framework and web resources are integrated to further improve the overall performance of the framework. Liu et al. [
18] proposed a Chinese spelling check framework that automatically detects and corrects Chinese spelling errors. Machine translation models and language models were used as part of the framework to generate correction candidates, and then the candidate correction candidates generated by statistical machine translation (SMT) and language model (LM) were sorted by support vector machine (SVM) classifiers. Yeh et al. [
19] proposed a method for detecting and correcting Chinese spelling errors using an inverted index table with a reordering mechanism. To reduce the search space and computational complexity of the framework, the context-dependent confidence-based pruning method was applied. With the class-like language model and the maximum entropy correction model, the correlation model between the original input and the expected input was obtained. However, the functions used by the traditional machine learning model in the modeling process are relatively simple, and there are certain limitations on the feature extraction of complex problems.
In recent years, many researches have applied deep learning methods to natural language processing. The deep learning method mainly extracts the features of the large-scale corpus through the neural network model, and then uses the trained model to judge the optimal corrective words [
20]. The traditional machine learning model extracts sample features usually in the form of numbers, so only shallower data features can be extracted, while the deep neural network can better fit the data by combining the underlying features into a more abstract high-level feature representation [
21,
22]. Therefore, deep neural networks usually have more efficient performance than traditional machine learning models [
23]. Wang et al. [
24] proposed a method for automatically generating CSC corpus based on optical character recognition (COR) and automatic speech recognition (ASR). They implemented a supervised bidirectional long short-term memory (LSTM) model that turned Chinese spelling questions into a sequence tag problem. Li et al. [
25] proposed a nested recurrent neural network (RNN) model for misspelling correction and trained the model using pseudo data generated from speech similarity. Liao et al. [
26] proposed a Bi-LSTM-CRF generation sequence model in Chinese grammatical error diagnosis (CGED) to CGED into a sequence tag problem. In the CGED task, Zhou et al. [
27] proposed a traditional linear CRF model and an LSTM-CRF model that add combination features such as positional features and syntactic features. Li et al. [
28] proposed a sequence labeling method based on the strategy gradient LSTM model for the traditional model facing the imbalance of positive and negative samples and the disappearance of gradient. They used a strategy-based deep reinforcement learning approach in model training. Ren et al. [
29] proposed a convolution-based sequence-to-sequence model in the Chinese grammar correction task of The Seventh CCF International conference on Natural Language Processing and Chinese Computing. They combine the advantages of convolutional neural networks and sequence-to-sequence models to treat GEC tasks as translating erroneous Chinese into correct Chinese and using CNN to capture local word sequences.
In the actual application of speech to text, we find that there are many errors in the text after speech recognition, which brings great difficulties to the text processing. This paper combines the advantages of traditional machine learning and deep learning method to correct the errors in the text converted from speech, so as to facilitate further text processing. The common text processing technology after speech recognition usually puts the error detection and correction of the speech text into the same model. In this paper, we divide the post-speech text processing task into two sub-tasks: speech text error detection and speech text error correction. In addition, we train different models separately. In the text error detection subtask, we divide the text characters to be detected into 0 and 1 categories, 0 means that the character is correct, 1 means the character is wrong, and then combined with LSTM and CRF [
30] technology to train the Bi-LSTM model to handle the text error detection task. In the text error correction subtask, we first construct the replacement character set of the wrong character, then construct the Bi-LSTM model, input the replacement character of the wrong character detected in the text detection subtask into the Bi-LSTM model, and output the optimal corrected character.
The rest of the paper is organized as follows:
Section 2 will elaborate on the text processing technology after speech recognition we use from two aspects: text error detection and text error correction.
Section 3 will present experiments on text error detection and text error correction sub-tasks and describe the experimental results in detail. We will discuss and summarize the experimental conclusions in
Section 4 and
Section 5.
3. Results
In this section, we evaluate the performance of our proposed model on text error detection and correction.
Section 3.1 introduces the datasets we used in our experiments,
Section 3.2 describes the performance metrics we used,
Section 3.3 describes the data preprocessing process we performed on the dataset, and
Section 3.4 describes the results of our model and compares it with other models.
3.1. Datasets
Our experiment uses SIGHAN 2013 CSC Datasets. The test set for this dataset provides a sample set and a similar character set. The sample set consists of 700 samples from student articles, half of which contain at least one error and the remaining samples do not contain any errors. Similar character sets contain characters that are similar in shape to Chinese characters and sound similar in pronunciation. The test set is composed of students from 13 to 14 years old. In the text error detection task, the test set contains 1000 test sentences, of which 300 have errors. In a text correction task, the test set contains 1000 test sentences, each of which contain one or more errors.
3.2. Performance Metrics
3.2.1. Metrics of Error Detection
We use the following indicators to evaluate the performance of the text detection model:
- (1)
Detection Accuracy (DA): The number of sentences with the correct detection result / the total number of sentences.
- (2)
Detection Precision (DP): The actual number of errors detected / the number of error sentences detected.
- (3)
Detection Recall (DR): The actual number of errors detected / the number of actual error sentences.
- (4)
Detection F1 (DF1): 2 * DP * DR / (DP + DR).
3.2.2. Metrics of Error Correction
We use the following indicators to evaluate the performance of the text correction model:
- (1)
Location Accuracy (LA): The number of sentences correctly detected the error location / The total number of sentences.
- (2)
Correction Accuracy (CA): The number of sentences correctly corrected the error / The total number of sentences.
- (3)
Correction Precision (CP): The number of sentences correctly corrected the error / The number of error sentences detected by the system.
3.3. Data Preprocessing
- (1)
Clear all punctuation in the sentence, leaving only plain text data.
- (2)
The original data text is marked according to the character level model, the correct word is marked as 0, the wrong word is marked as 1, and the obtained 0, 1 sequence is used as the label of the original data text.
- (3)
Calculate the number of different characters in the dataset after preprocessing, the number of occurrences of each character, the length of the largest character contained in each sentence, and the average character length of each sentence. The maximum sentence length is used as a fixed length for characters in each sentence. If the length of the sentence is greater than the fixed length, it is truncated. If the length of the sentence is less than the fixed length, 0 is added.
3.4. Experimental Results
Tensorflow is Google’s open source data flow diagram-based machine learning framework, keras is for tensorflow, and Theano and high-level encapsulation. People can quickly build a complex neural network model by using them. Therefore, we used tensorflow and keras to implement the model we described in
Section 2.
To more accurately evaluate the performance of our proposed model, we used a 10-fold cross-validation [
47] and 5*2 cross-validation [
48] approach to partition the dataset we used. In the 10-fold cross-validation method, we randomly divided the dataset into 10 parts, using nine of them as the training set in turn, and the remaining part as the verification set, taking the mean of these 10 results as the evaluation result of our model. In the 5*2 cross-validation method, we randomly divided the dataset into five parts, used four of them as the training set in turn, and the remaining part as the test set, and in the data used as the training set, we used the 2-fold cross-validation for training, and the average of the five results was taken as the evaluation result of our model.
Table 1 and
Table 2 show the classification results in the text error detection model, respectively.
Table 3 and
Table 4 show the classification results in the text error correction model, respectively.
It is well known that the number of iterations of the model affects the final performance of the model. We experimented with our models using different iterations. As the number of iterations increases, the model will gradually over-fit, resulting in a decrease in generalization performance as shown in
Table 5 and
Table 6. When the number of iterations is less than 10 times, the performance of the model gradually increases. When the number of iterations of the model is more than 10 times, the performance of the model gradually decreases with the increase of the number of iterations.
In the field of natural language processing, the input text statements are usually of different lengths. Unifying the length of the input statements to different lengths also affects the performance of the network model. We unified the length of the input statement to the length of the largest sentence in the dataset and the average sentence length, and experimented on these two methods.
Table 7 and
Table 8 show the experimental results using the length of these two input statements. Using the average sentence length will intercept the statement in the dataset whose statement length is longer than the average sentence length, resulting in the missing part of the context information of these sentences, affecting the final experimental results of the model. As can be seen from
Table 7 and
Table 8, using the maximum sentence length in the dataset as the fixed length of the input sentence can result in better experimental results than using the average sentence length.
We compared our model with the experimental results of other models. The results of the error detection experiments of our model on the SIGHAN 2013 CSC dataset are shown in
Table 9 and
Figure 6.
In the method proposed by Hsieh et al. [
49], in the CKIP-WS system, they first segment the Chinese sentences and then search for possible replacement words through the confusion set and the CKIP dictionary. In the G1-WS system, after the Chinese sentence segmentation, the similar words are searched by the dictionary constructed by Google’s unigram data, and then sorted according to their frequencies, and the low frequency words are used as the wrong candidates. The error detection performance of the two method models is overly dependent on the richness of the dictionary and confusion set they construct, and has greater limitations in different fields. In the method proposed by Wang et al. [
50] they first use a rule-based way to check for high-frequency erroneous words in the sentence, and then use a CRF-based parser to divide the sentence into a sequence of words. The characters in each short word (less than three characters) are considered to be potential erroneous characters. The performance of their error detection depends on the detection rules they construct. Compared to their models, we can use the LSTM neural network to automatically extract character features that are more complex than them, without the need to build a dictionary or rule, and use the transfer feature of the CRF model to correlate the outputs of the LSTM network to obtain more better error detection performance.
The results of the error correction experiments of our model on the SIGHAN 2013 CSC dataset are compared with other models as shown in
Table 10 and
Figure 7.
In the method of Hsieh et al., they use the similar characters of the wrong character in the confusion set or dictionary as candidate characters, and then construct a 3-gram model to determine the best character sequence as the correction result. In the method of Wang et al., they replace the detected wrong characters with characters of the same shape or similar pronunciation, and then use the CRF parser to re-segment the modified sentences and use the LM model to score, and take the sentence with the highest LM score as the correction result. Compared with their models, we use the Bi-LSTM model to get more context features than them, so as to get better error correction results.
4. Discussion
In this paper, we proposed a new text processing technology for speech recognition. We divided the text processing tasks for speech recognition into two sub-tasks: speech text error detection and speech text error correction. In the speech text error detection subtask, we developed a Bi-LSTM-CRF model based on the advantages of the LSTM neural network and the CRF model. The LSTM model is a variant of the RNN model. Its neurons can effectively retain the historical information in the input sequence through the input gate, output gate, and forget gate, which is very suitable for processing sequence data. For text data, each word has a certain relationship with its before and after words, so the LSTM neural network model can effectively extract the context features in the text data and better process the text data. The bidirectional LSTM neural network model we used can simultaneously obtain the forward and backward dependence features of a word in a sentence, which can better reflect the characteristics of the input text. Compared with the n-gram method proposed by Hsieh et al., the n-gram model is a statistical language model whose core assumption is that the probability of occurrence of the current word in the text sentence is only related to the previous n-1 words. As n increases, the more dependent features of the obtained words will be, the prediction of future words will be more accurate, but when n increases, sparse problems will occur, which will affect the accuracy of future word prediction. The Bi-LSTM network we used not only can obtain long-distance forward text features, but also obtain backward text features. The output of the separate Bi-LSTM model does not take into account the contextual relationship between the outputs and may result in impossible results in the locale, so we introduced the CRF model. The CRF model can relate the output of the Bi-LSTM model to each other, which is equivalent to treating the output as a
classification problem, which can effectively eliminate some impossible results and output more accurate results. The experimental results in
Section 3 show that our model has better text error detection performance than the n-gram model used by Hsieh et al. and the CRF model used by Wang et al.
In the speech text error correction subtask, we first constructed the candidate character set of the wrong character, and then input the candidate characters of the wrong character detected in the speech text error detection subtask into the Bi-LSTM neural network, and selected the candidate character with the largest output probability. As our correct character. Compared with other statistical probability models, the Bi-LSTM model we used can better extract the context features between input texts, and then selected the optimal correction characters by comparing different error candidate characters. The experimental results in
Section 3 show that our proposed text error correction model has better text error correction performance with other text error correction models.
In addition, we also explored how the performance of the model changes with the number of iterations as the number of iterations of the model changes. By analyzing the experimental results, we find that when the number of iterations of the model increases slowly, the performance of the model will gradually become better, but when the number of iterations reaches a certain value, the performance of the model will gradually decrease. As the number of model iterations is very large, although the performance on the training set is very effective, the model parameters over-fit the training set data, resulting in poor generalization performance of the model, affecting the effect of the model on the test set. In addition, the text length of the input model will also have an impact on the model’s effects. We chose the maximum sentence length in the dataset and the length of the average sentence as the fixed input length of the model. The experimental results show that the fixed input length of the model has a better effect when the maximum sentence length is taken. Since the input length can take more text features when the maximum sentence length is obtained, a better effect model can be obtained.
5. Conclusions and Future Work
The development of speech recognition technology is closely related to the progress of wireless networks. With the rapid development of wireless network technology [
51,
52,
53,
54,
55,
56], speech recognition technology will be more and more widely used in the field of Internet of Things. Due to the problems of dialect and accent, it is necessary to correct the text after speech recognition before displaying. However, text processing technology for speech recognition has been attracted more and more attention of researchers. This paper proposes a text processing model after Chinese speech recognition based on a LSTM neural network model and CRF technology. The model can fully extract the context information of the input text. Through the analysis of the experimental results, we found that compared with other models, our proposed neural network model has better text error detection and correction performance.
In addition, by increasing the amount of data in the replacement set, the error detection and correction performance of the model can be further improved.
In this article, raw data was processed by a character-level model. Given the advantages of the word-level model, we can try to process the raw data using a word-level model. In addition, in the phonetic text, in addition to homophones and approximations, there may be grammatical errors, which can be used as our next research direction.