**1. Introduction**

Speech is the most effective means of communication among people and is the most natural way of exchanging information. Therefore, the desire of people to interact vocally with computers is increasing day by day. To meet this desire, studies are being conducted in a number of fields intended for the simulation of humans' ability to talk, from carrying out of simple tasks by computers through machine-human interaction, to turning speech to text through Automatic Speech Recognition (ASR) systems [1]. In recent years, signal processing and recognition have been utilized in many areas such as human activity tracking and detection [2,3], computer engineering [4], physical sciences, health-related applications [5], and natural science and industrial [6]. Speech sounds may get mixed with another speaker's speech in the background, in a room with TV sound or with an external voice. All sounds except the speech signal are called noise. Therefore, it is necessary to filter this noise in speech recognition systems [7]. In order to reduce noise in the recording stage of speech, the environment and the sensors are very important. Many different sensor-based technologies have been proposed for signal capturing and processing [8–11].

Along with the developments in speech modelling, ASR systems have begun to be used in many different areas such as voice processing at call centers, tasks requiring human-machine interface, travel information, stock-exchange transactions, quotations, weather reports, data entry, speech dictation, and access to information. Although there has been considerable progress in ASR systems over the last 60 years, there are still many problems that need to be solved and many particulars to be improved [12]. In the modelling of speech recognition systems, the flow-chart given in Figure 1 is generally followed. Accordingly, after the input signal is received into the system via the microphone, extracting the

samples of the sound, digitizing, putting through various filters, in some necessary cases, labelling and transforming into a format that can be modelled are all carried out at pre-processing stage. Here, the aim is to produce a state of sound that is less simple, and free of speech variations and noise. In the next step, the parameters of the sampled audio signal obtained are captured and calculations are made to extract the remaining properties of the sound at certain time intervals. Feature extraction operations are used in many applications such as Real-time feature extraction of large size satellite images [13], activity tracking applications [14], face recognition [15], real-time motion capture [16], as well as human detection and activity tracking [17]. Therefore, the data obtained in feature extraction are critical values for speech recognition and provide important clues. In the next step, the comparison of the estimated word from the parameters with the language models is performed. The outputs produced after the acoustic model is compared with the outputs of the language model and a search is performed on the language model. Thus, it is aimed to find the right word to correct the acoustic model outputs. A kind of forecasting process is operated. Therefore, the correct word is obtained by making the most appropriate selection for the acoustic output from the text data that were created with the help of the word log [2].

**Figure 1.** Basic flow of speech recognition systems.

This is the most basic flow of voice recognition techniques. Differences arise from the approaches used to create acoustic and language models.

In this study, a new method is proposed to improve the performance of Long Short Term Memory (LSTM) based Recurrent Neural Network (RNN)-trained speech recognition systems. The system is based on the principle of correcting recognition outputs with minimal increase in runtime.

Similar to the proposed method in this study, many different hybrid studies have been published in recent years. In the study [18] in which the Gaussian model was used with LSTM, a performance increase of approximately 0.5% to 0.2% was achieved compared to standard LSTM models. In the study [19], where a talking facial expression was intended to contribute to speech recognition performance, a certain increase was achieved for all situations. In the study by Kowari et al. [20], Deep Neural Network (DNN), Recurrent Neural Network (RNN) and Convolutional Neural Network (CNN) were used in combination. An increase in performance was achieved.

In Section 2 of this study, the working method of RNN-LSTM structure is explained, and, the structure of the proposed model and the working process are explained in Section 3 in detail. The comparative representation of the experiments performed to observe the contribution of the model and its results are described in Section 4. The evaluation of the results, some of the problems

encountered and the studies that can be carried out for on the solving of these issues are presented in Sections 5 and 6.

#### **2. Recurrent Neural Networks and Long Short Term Memory**

Artificial neural network is a computation model based on the structure and functions of biological neural networks. It is like an artificial nerve used to receive process and transmit information. It is basically consists of 3 layers. The input layer communicates with the environment for values to be input to the neural network. The output layer is used to provide information out of the network. It collects and transmits the required information. The hidden layer is the layer that contains neurons with the activation function located on the input and output layer. It extracts and uses the properties of inputs coming from the previous layer. It can consist of multiple layers in itself. An artificial neural network includes an input layer, a certain number of hidden layers depending on the problem and an output layer. People produce new ideas by using their previous thoughts. Thus, the permanence of information is provided. Traditional neural network models, the first systems in which this type of thinking was model, cannot totally provide this idea. This forgotten old information constitutes a serious problem in the training of the network. RNN models were proposed to produce a solution to this problem. These models have loops that allow information to persist.

As shown in Figure 2, schema shows a chunk of neural network, Xn input value and yk output value. A loop allows the information to be transmitted to the next step of the network.

**Figure 2.** RNN give permission to recursion [21,22].

These loop system is not different from that of the classical neural network. It can be thought of as an architecture that includes multiple copies of a neural segment and receives messages from the previous step in each step. They work in the same way as arrays or lists. RNN structure has been applied to many problems such as speech recognition, language modelling, translation and image analysis in recent years and very successful results has been achieved.

The most important idea of RNN is that the previous data can be used in the next steps of the network. The amount of previous data is an important consideration for RNN structures. This structure is well suited for problems where recent information is sufficient to make prediction about the future. However, more information may be needed to solve some problems. For example, consider a language model that tries to predict the next word using previous words. This prediction requires subjects or verb given previously than the last word of the sentence. RNN structures cannot provide learning in cases where addiction is increased. The most important reason for this is the theoretically limited number of nodes in the network.

LSTM based RNN networks have been proposed to solve this problem. Long Short Term Memory (LSTM) is an RNN network that can learn long-term dependencies. This model was firstly proposed by Hochriter and Schmidhuber in 1997 [4]. Its popularity has increased over time and achieved very good results for many problems. It is well suited for avoiding long-term dependence problems. It is an idea that long-term recall of knowledge is a natural behavior for people (Figure 3).

**Figure 3.** An Unrolled Recurrent Neural Network [21,22].

A standard RNN network is very simple and each layer can contain only one "tanh" function. However, in LSTM networks, this layer has a slightly different structure as shown in Figure 4.

**Figure 4.** The repeating module in an LSTM contains four interacting layers [3].

The key structure of the LSTM is the line that runs over the diagram and works with small linear interactions. It is capable of adding and deleting information with the help of logic gates. LSTM works in a structure consisting of 6 steps.


In the above formulas, each b is a bias vector, W is a weight matrix, and xt is the input to the memory cell. i,c,f,o indices refer to input, cell memory, forget and output gates respectively.

The Step (1) is to decide which information is thrown away from the cell state. This decision is made using the sigmoid layer called the forget gate layer. It looks at ht-1 and xt and generates an output of 0–1 for each ct-1 cell state. If it is 1, it means to completely keep it and 0 is to completely get rid of it.

Step (2) is to decide what new information to be kept in memory. It has two steps. First one is, decide which values to update with the sigmoid layer. Then, the tanh layer generates a vector for new candidate values(C~) in step (3).

It is necessary to update the old Ct-1 status with the new Ct status. To forget the previous decision, it is multiplied by the ft value and a new candidate value is created by it value in step (4).

Finally, it is necessary to decide what the cell output will be. This output depends on the current state of the cell but is a filtered version. Using the sigmoid function, the data to be extracted from the cell is selected in step (5). Then, the data is normalized with the tanh function in step (6). So the values between −1 and 1 are pushed. and multiplied by the output of the sigmoid function. Thus, only the part of the data that is decided is produced as output [21,22].

Using LSTM with RNN, remarkable results were obtained for many different problems. The next step from this stage is the structures that allow much larger data to be learned in each learning step. Many studies were carried out in relation to LSTM and derivatives and successful results were obtained on better learning systems [23–29]. Similar to LSTM structure, The Gated Recurrent Unit (GRU) is preferable for this study, as well. In the study of [30], comparing the LSTM and GRU approach, both models were compared and reported to produce similar results. When LSTM and GRU trainings were performed for the model proposed in this study, it was observed that more successful results could be obtained with LSTM. For this reason, the success of the hybrid model was demonstrated by creating a model with LSTM structure.

#### **3. Proposed Correction Approaches for LSTM Speech Recognition**

The LSTM based RNN structure can perform speech recognition with a certain level of performance on a test data. There are many factors such as dataset, model complexity, and the training time that affects this recognition process. For this reason, it is generally not possible to maintain satisfactory levels of system performance. Many different approaches have been proposed in the literature to improve the level of performance [31–35]. In addition to speech signal, features of human activities contribute to recognition processes. With these models, it is aimed to achieve significant increases such as our proposed model.

Using the comparison method with reference words together with the LSTM structure, an increase in performance can be achieved. Figure 5 shows the structure of proposed model.

**Figure 5.** Framework of Speech Recognition with Correction Methodology.

There are basically two phases in this model. The creation of a database of "Referenced Template Words" is the first phase. Then, in the second phase, this database is compared with the LSTM system outputs and the correction of the system outputs is performed.

#### *3.1. Data Preparation Phase*

In the data preparation phase of the model proposed, primarily Turkish data collection process was performed. For this, 3 distributed datasets in Turkish were used. All of these datasets are created for different purposes. Therefore, in order to be used in this model, data correction and filtering operations are required. For instance, many editing procedures are needed, such as deleting schemas showing the suffix and roots of the words or correcting Turkish character problems. After these operations, 3 datasets were combined, the whole data were analyzed and the duplicated words were deleted and the type conversions were made to keep them in the same database tables. As a result, a "Referenced Template Words" database containing approximately 3 million unique Turkish words was created.

After the preparation of the text dataset, it is necessary to prepare an audio dataset. In this study, audio data of Middle East Technical University Microphone Speech v 1.0 [36,37] was used. It is a dataset prepared in Turkish language. Turkish is a phoneme based agglutinative language. Each letter is represented by a phoneme. However, in some cases, vowels and consonants may vary depending on where they are produced. Twenty letters in Turkish alphabet are represented by 45 phonetic symbols. In this dataset, a 193 speaker audio corpus and a pronunciation lexicon were developed. A new corpus and audio tools were created to ensure the accuracy of phonetic alignment and phoneme recognition. 91.2% of the automatically labelled phoneme boundaries are placed within 20 ms of hand-labelled locations for the Turkish audio corpus. The corpus is about the size of 600 Mbytes. The data has been digitally recorded with a Sound Balaster sound card on a PC at a 16 KHz sampling rate. The first 2000 sentences of the TIMIT [38] dataset were translated into Turkish. Afterwards, some studies aimed to improving the dataset yielded 2462 sentences with 9165 different words.
