*3.2. Testing Phase*

In testing phase, an exemplary speech recognition model in RNN-LSTM structure was created. METU 1.0 Turkish speech set, which was suggested in the studies [36,37] and distributed over Linguistic Data Consortium (LDC) [39], was used for training of the model. The outputs of the standard recognition model were subjected to the comparison model process, of which the details are in Figure 5, and an increase in system performance was intended to be achieved.

The procedure for correcting the output of the LSTM speech recognition model within the proposed new model is shown in Figure 6. Accordingly, all outputs are subjected to a verification and correction process. Firstly, the outputs are checked for any corrections. If results are generated with 0% error rate, recognition process is terminated and system output is generated. In addition, this output is added into the database as a new "Referenced Template" if the word does not exist in the database. Moreover, if the recognition output is produced with uncorrected characters, it is ensured that these outputs are subjected to a correction process. "Referenced Template DB" that was previously prepared is used for this correction. Each output is subjected to different string distance calculation algorithms, some tests are subjected to 1 and some others to 11 different algorithms, in order to find the closest word in the database. Each output is subjected to this search and distance calculation process without exception and it is ensured that the closest reference word is found. This reference word is used as the system output instead of the incorrectly generated word. That is, instead of the text output generated using audio data, the new referenced word found by matching the incorrect outputs in the database is used.

**Figure 6.** The proposed correction method for LSTM speech recognition.

The purpose of this proposed model is to correct system outputs with a few character errors and thereby improve system recognition performance. The pseudo code version of the algorithm is also presented in Figure 7. In the first step of this method, it is very important to reset the output value to be produced by LSTM model and the values that keep the distance of the words between LSTM model output and reference words. The LSTM-based speech recognition system is then performed and the output is produced. This output is checked for error rate is equal to 0%. This control has been put in order to contribute to a faster operation. The results that have 0% error rate, are not included the process. However, if recognition is performed incorrectly, all words in the reference table are sequentially compared and the distance of each word to the LSTM model output is calculated. After this calculation, an index with the minimum distance is determined and the related word is generated as a system output. A type of output correction is performed.

There are many different distance calculation algorithms for finding the nearest word. Although each algorithm has a different approach, the main goal is to make the correct estimation. At this stage, the most accurate approach would be to calculate each word with more than one distance algorithm and get the best value. However, it is natural that this is reflected in the process time as a significant increase. Therefore, it this study, the most suitable distance algorithm for the dataset was determined firstly by the tests and this method was used in the comparison process. The reference word closest to the recognition output with some errors is found by searching with the code given above.

**Figure 7.** PSEUDOCODE: The pseudo code of most-matching learning.

#### *3.3. Datasets, Tools and Algorithms Used For Testing Proposed Model*

In order to test the model proposed in different ways, audio data, text data, database tools, LSTM model and other libraries were needed.

LSTM models require audio data for training. In this study, audio data of "Middle East Technical University Turkish Microphone Speech v. 1.0" [36,37] audio data, which contain a comprehensive data that can be used in Turkish speech recognition studies, were used. 120 speakers (60 males and 60 females) speak 40 sentences each (approximately 300 words per speaker), which makes approximately 500 min of speech in total. The 40 sentences are selected randomly for each speaker from a triphone-balanced set of 2,462 Turkish sentences. The ages range from 19 to 50 years, with an average of 23.9 years.

It is essential to create a text data set that can be used as a template for the correction of LSTM speech recognition outputs. It needs to be able to cover almost all words in Turkish. For this purpose, 3 commonly used Turkish data group were preferred. These are Zemberek [40], BOUN Corpus [41] and METU 1.0 Speech Dataset [36,37]. Zemberek, published in 2010, is an open source library containing grammatical features specific to Turkish language and can be used in Natural Language Processing (NLP) studies. It contains approximately 1.15 million unique Turkish words. BOUN Corpus, published in 2008, is a Project to create a Turkish language resource. It contains approximately 1.4 million unique Turkish words. METU 1.0 audio data set was prepared for use in speech recognition studies and published in 2002. This data set contains approximately 7 thousand unique Turkish words.

The PostGreSql database tool [42] was used to store "Referenced Template Words" and perform a quick search. Python [43] programming language was used for training of LSTM speech recognition system and Java libraries were used for testing. Dill [44], librosa [45], namedtupled [46], numpy [47], python\_speech\_features [48], tensorflow [49] libraries were used for speech recognition with Python.

There are many algorithms available for string comparison of the model. In this work, 11 different distance calculation algorithm were used to test the system performance in a different ways, which are Levenshtein [50], Damerau Levenshtein [51], Jaro Winkler [52,53], Longest Common Subsequence [54], Metric LCS [54], Normalized Levenshtein [50], Optimal String Alignment [51], Precomputed Cosine [55], Qgram [56], Sift4 [57] and Weighted Levenshtein [24]. Thus, it was possible to observe the effects of different algorithms on the proposed model and make comparison in order to contribute to performance improvements.

#### **4. Experiment and Results**

In order to test the proposed data correction model, a medium scale Turkish training set was prepared. An LSTM RNN based model was prepared using an acoustic model with using Connectionist Temporal Classification (CTC) and deep learning. METU 1.0 audio dataset was used to test the application. The audio data were recorded in a quiet studio. The speech signal was recorded as a single channel at 16 Khz 16 bit resolution. The analysis frames are 20 ms wide and there is a 10 ms overlap. Different sample sets were created and the model proposed was tested. Three different text set were used for the tests. Zemberek Library [40], Boun Text Corpus [41] and Metu 1.0 text data [36,37] were preferred. All these datasets were examined and a text dataset containing more than 2M different Turkish words was obtained by combining them. For training and testing of the model, multiple training and test sets were prepared from audio dataset and the results were observed comparatively.

Two different methods were used to evaluate the proposed model in this paper. Firstly, the method was applied on more than one test, and as a result, the average contribution to system performance was observed. In the second method, the original sentence was compared with the LSTM model output and the model proposed output at each step of the 10000 Epoch tests, and the number of sentences were closer to the original sentence was counted. Thus, although the average value of the increase is important. How many sentences the proposed model had produced better results for in 10000 tests was also evaluated. Figure 8 shows the process of second method as an example.

**Figure 8.** Sample test scheme of the proposed model.

Table 1 shows the test results of different distance algorithms. These results were obtained as a result of 10000 Epoch tests. When the results are examined, it is seen that different increase values in performance are obtained when different algorithms are used. This is due to the fact that distance algorithms produce different results for finding the closest words. The proposed model achieved maximum 3.84% increase in overall system recognition performance. In addition, a significant contribution is the before and after correction count when a comparison is made for each test in 10000 Epochs. Accordingly, after 10000 tests performed in the best condition, the model produced better results in 7711 after correction and 2289 before correction. This shows that the correction process is very useful not only for the overall performance of the system, but also in correcting inaccurate outputs throughout the system. A proportional difference of 27.21% is achieved on average.


**Table 1.** Recognition Results of LSTM Speech Recognition with Different Distance Algorithm.

Table 2 shows the basic parameters of the test models. 10 test sets were created with these parameters but working with different data. With the proposed model, Normalized Levensthein algorithm, which is one of the distance algorithms that provide the best contribution to the overall performance of the system, was kept constant and tests were performed for the increase in system performance for 10 different test sets prepared differently from each other. The test results are presented in Table 3.

**Table 2.** Experiment Parameters of Designed Model for LSTM.


**Table 3.** Recognition Results of LSTM Speech Recognition with Correction Method.


Average Difference in Success Count: 40.64%; Average Performance Increase: 2.25%; Maximum Performance Increase: 3.55%; Minimum Performance Increase: 1.42%.

As shown in Table 3, the proposed model produced different results for different data sets. The purpose of these tests is to observe the contribution of the proposed model to the system performance in different test environments. According to this, the change of the test sets did not cause any change in the contribution to the system, but there was a difference in the contribution rate. An average system performance increase of 2.25% was achieved. The variability of the improvement ratio is directly proportional to the output produced by the system before correction. The model offers a contribution of more than 3% in cases where there are less character errors in a word but there are

character errors in more than one word in a sentence. However, the improvement level drops to around 1.5% in cases where a word has more than one character errors.

Changes in the number of Epoch and the number of hidden layers of the model lead to changes in the level of contribution.

It improves the performance of system in any case but acts at levels ranging from 0.5% to 3.55% (Figure 9). Furthermore, the proportional difference in the number of recognition's before and after correction is much higher. On average, it produces more accurate outputs at 40.64%. It is seen that it makes a significant contribution. The model makes visible corrections in each test sentence. This process fixes many character errors. Numerical difference exceeds 50% indicates situations where the LSTM model does not perform learning or overall recognition performance of model remains below 20%. With the increase Epoch number, the differences in the numbers of successes decrease to more reasonable levels. This is the case where real learning takes place and numerical success goes down to average levels. In the case of realistic and good learning, both the contribution of the system to the overall performance exceeds over 3% and the difference in the number of success reaches over 40%. This shows that the proposed model can contribute more with a good learning network.

**Figure 9.** Effect of Epoch Number and Hidden Layer Count on Recognition Performance.

With the increase of performance, it is natural that the model proposed makes an increase in system calculation time. Since, after the system output, an output correction process is performed. After the output of the system was produced to observe how much the system was reflected as an increase of the overall runtime, we observe how long the verification method completed its operations and as a result we obtained the results as in Table 4. These results were achieved based on an Intel i5 processor and 8 GB ram memory. Approximately 0.04 s is required to correct each word during recognition. Assuming that an adult can speak an average of 200 words in 1 min, an extra 8 s is needed for a recognition model using the model proposed for creating speech data containing 1 min of

continuous speech. These times are slightly longer in multiple comparison models using multiple distance calculation algorithms to find the closest word.


**Table 4.** Results of different recognition approach with METU 1.0.

Acceleration of the given processing times is possible with the necessary optimization techniques or faster computer infrastructure. However, we want to show here is that the proposed model requires an extra time in a standard speech recognition systems. It is a method that can be applied to many recognition systems with its effect on system result and run time comparison.

In this study, an LSTM based approach was preferred in speech recognition process and the proposed new approach contributed to a standard model. In addition to LSTM, many different algorithms such as Hidden Markov Model (HMM), Modified HMM, Embedded HMM, Gaussian Mixture Model (GMM), and Support Vector Machines (SVM) can be used. There have been many studies using these methods from past to present [58–65]. In order to compare the results obtained in this study, the results of the studies using modelling approaches other than LSTM and performing tests with METU 1.0 [36,37] dataset are shown in Table 4.

When the results shown in Table 4 are examined, the error rate is approximately 30%. However, small differences in the text data used in language model had effects on the results. The LSTM based model has an average error rate of 26.87%. In general, better results are produced than that of the models using HMM, CVA or N-gram. But, with the proposed new approach to reduce error rate, this performance level was further improved and reduced to 24.82%.

#### **5. Conclusions**

In this study, a viable output correction mechanism for speech recognition systems using LSTM modelling is proposed. Thus, an increase in system performance is aimed.

This proposed hybrid model provides a performance increase of approximately 2.25% with small increases in system runtime. Once the system requirements are established, it is very easy to implement.

The diversification of the audio data set and the enrichment of the word set will contribute to the performance. In addition, optimization of the distance calculation algorithms and improving database schemas would allow a considerable reduction in the processing times.

Nowadays, a speech recognition system that can operate in all situations, that has a 100% output performance and that can be used for all languages has not yet been developed. So, the model proposed can be used for many studies and structures to allow performance improvements at certain levels.

#### **6. Future Works and Restriction**

Although the new model proposed generally produces successful results, some limitations were encountered during the tests. The most important reason why the system cannot contribute more to the overall recognition performance is that the issue with the space character have not been resolved. As shown in Figure 10, the output of the system, which was made with almost 100% accuracy, is corrupted by the correction process. Solving this problem would reflect a significant increase in overall recognition performance.

**Figure 10.** Restriction of Method – space character problem.

The second problem is that different distance calculation algorithms cannot be used together because of different calculation structures. If the necessary optimization and matching studies were performed on the distance values of the algorithms, it would be possible to find the correct template words. The optimization process would contribute to the system recognition performance.

Finally, the biggest problem of the system is the delays in the operating time. Since this method is a hybrid model, it requires a longer run time than that of the standard models. It is necessary to reduce this time to tolerable levels. Otherwise, it cannot be used in real-time speech recognition applications.

**Author Contributions:** Conceptualization, R.S.A. and N.B.; methodology, R.S.A. and N.B.; software, R.S.A.; validation, R.S.A. and N.B.; formal analysis, R.S.A. and N.B.; investigation, R.S.Arslan and N.B.; resources, R.S.A. and N.B.; data curation, R.S.A. and N.B. writing—original draft preparation, R.S.A. and N.B.; writing—review and editing, R.S.A. and N.B.; visualization, R.S.A.; supervision, N.B.; project administration, N.B."

**Funding:** There is no extra funding.

**Conflicts of Interest:** The authors declare no conflict of interest.
