1. Introduction
Speech to Text (STT) [
1,
2] is a process in which a computer interprets a person’s speech and then converts the contents into text. One of the most popular algorithms is HMM (Hidden Markov Model) [
3], which constructs an acoustic model by statistically modeling voices spoken by various speakers [
4] and constructs a language model using corpus [
5].
Machine Translation (MT) is a sub-field of computational linguistics that investigates the use of software to translate text or speech from one language to another (
https://en.wikipedia.org/wiki/Machine_translation). MT has been approached by rules [
6], examples [
7], and statistics [
8]. Currently Neural Machine Translation (NMT) [
9] has dramatically improved MT performance, and there are a lot of translation apps, such as iTranslate (
https://www.itranslate.com/) and Google Translate (
https://translate.google.com/), competing in the market.
A recent video-sharing site, YouTube’s automated captioning system [
10] combines STT with MT technology. When uploading a video, YouTube extracts the voice data and writes subtitle files, and translates the file into the desired language. These techniques are of great help to users who cannot speak in the language spoken in the content. However, if the speaker does not deliberately distinguish between sentences, for example if they do not make a pause between sentences, multiple sentences are recognized as a single sentence. This problem significantly degrades the performance of the MT. What’s even worse is that YouTube manages subtitles by time slots, not by utterances. The subtitles automatically generated in the actual YouTube are divided into units of scene, and these are to be translated.
In this paper, we propose an automatic sentence segmentation method that automatically generates a period mark using deep neural networks to improve the accuracy of automatic translation of YouTube subtitles.
In natural language processing, a period is a very important factor in determining the meaning of a sentence.
Table 1 shows sentences with different meanings according to the different period positions. The example sentence is an STT-generated sentence with periods and capital letters are removed. This sentence is divided into completely different sentences, such as Case 1 and Case 2, depending on the position of the period, and the generated sentences have different meanings. If the original text is divided as in Case 1, here “he” is “Sam John”. On the other hand, if divided as in Case 2, “he” would be “John” and “John” should eat “Sam”.
Studies on past real-time sentence boundary detection and period position prediction [
11,
12,
13] have attempted to combine words and other features (pause duration, pitch, etc.) into one framework. A research [
14] verified that the detection process could improve translation quality. In addition, a research has been conducted to automatically detect sentence boundaries based on a combination of N-gram language models and decision trees [
15]. Currently, Siri, Android, and Google API (
https://cloud.google.com/speech-to-text/docs/automatic-punctuation) insert punctuation marks after recognition. However, it is more important for them to distinguish whether the punctuation of a sentence is a period or a question mark, rather than finding and recognizing appropriate position when recognizing multiple sentences at once. Because these studies were focusing on speech recognition and translation, they relied on some acoustic features, which could not be provided by YouTube scripts. Chinese sentence segmentation [
16] created a statistical model using data derived from Chinese Treebank, and predicted the positions of periods based on the model. This research was different from other mentioned research studies because it was using only text data.
This paper presents a new sentence segmentation approach based on neural networks and YouTube scripts that is relatively less dependent on word order and sentence structure. We measures the performance of this approach. We used the 27,826 subtitles included in the online lectures provided by Stanford University for this study. These lecture videos provide well-separated subtitle data in sentence units. Therefore, this subtitles can be converted into the format of automatic subtitles provided by YouTube, and can be used as training data for the model to classify whether the period is present or not. We use Long-Short Term Memory (LSTM) [
17] of Recurrent Neural Network (RNN) [
18], which has excellent performance in natural language processing, to build a model with data and predict the position of the punctuation mark. LSTM showed potential to be adapted to punctuation restoration in speech transcripts [
19], which combines textual features and pause duration. Although RNN has shown good performance with various input lengths, we sacrificed some of this benefits by making the data length similar to YouTube subtitles. In this study, we build the input as closely as possible to YouTube scripts, and try to locate the punctuation marks using only text features.
This paper is composed as follows.
Section 2 describes the experimental data used in this study and their preprocessing process.
Section 3 explains neural network-based machine learning and explains algorithms and models used in this study. In
Section 4, the process of the period prediction and the sentence segmentation experiment are explained in detail. Finally, in
Section 5, the present study is summarized and conclusions are presented.
3. Methods
In this paper, we use LSTM of an artificial neural network among machine learning methods to predict whether or not to split the sentence. To do this, we create a dictionary with a number (index) for each word, express each word as a vector using word embedding, and concatenate 5 word vectors and use it as X_DATA.
Figure 3 shows how data are processed in the program. The five words in
Figure 3 are taken from the second example in
Table 2. First, all words get vector values of 100 dimensions through word embedding using Word2Vec. These vector values also contain information about the position of the period. Second, the words that have been converted to vectors enter the LSTM layer and this output is calculated while being affected by the previous output. Finally, these outputs predict that a period must be generated between the “accuracy” and “we” by softmax function.
3.1. Word Embedding
Word embedding is a method of expressing all the words in a given corpus in a vector space [
20]. Word embedding allows us to estimate the similarity of words, which makes it possible to achieve higher performance in various natural language processing tasks. Typical word embedding models are Word2Vec [
21,
22,
23], GloVe [
24], and FastText [
25]. In this paper, words are represented as vectors by using Word2Vec, which is most widely known to the public.
Word2Vec is a continuous word embedding learning model created by several researchers, including Google’s Mikolov in 2013. The Word2Vec model has two learning models: Continuous Bag of Words (CBOW) and Skip-gram. CBOW basically creates a network for inferring a given word by using surrounding words as an input. This method is known to have good performance when data is small. On the other hand, the Skip-gram uses the given word as an input to infer the surrounding word. This method is known to show good performance when there is a lot of data (
https://www.tensorflow.org/tutorials/representation/word2vec).
3.2. RNN (Recurrent Neural Network)
RNN (Recurrent Neural Network) is a neural network model that is suitable for learning time-series data. In the previous neural network structure, it is assumed that the input and output are independent of each other. However, in RNN, the same activation function is applied to every element of one sequence, and the output result is affected by the previous output result. However, the actual implementation of RNN has a limitation in effectively handling only relatively short sequences. This is called vanishing gradient problem, and a long-short term memory (LSTM) has been proposed to overcome this problem.
As shown in
Figure 4, LSTM solved the vanishing gradient problem by adding an input gate, a forget gate, and an output gate to each path that compute the data in the structure of the basic RNN. These three gates determine what existing information in the cell will be discarded, whether to store new information, and what output value to generate.
LSTM has been successfully applied to many natural language processing problems. Especially, it shows good performance in the language modelling [
26], which calculates the probability of seeing the next word in the sentence, and the machine translation field [
27], which decides which sentence to output as the result of automatic translation. In this study, we use this characteristic of LSTM to train the features of sentence segmentation of a series of consecutive words. We also create a model that can automatically segment sentences by creating a degree of segmentation of the sentence as the output value of the network.
4. Experimental Results
The data of this paper is composed of 27,826 subtitles (X_DATA, Y_DATA) in total. Training data and test data were randomly divided into 7:3 ratios.
Table 3 shows the Hyper parameters used in this study and the performance of the experiment.
As can be seen in
Table 3, the words were represented as 100-dimensional vectors and training was repeated 2000 times with a training set. As a result, the final cost is 0.181 and the accuracy is 70.84%.
We created a confusion matrix of
Table 4 to evaluate our method in detail. In this table, A, B, C, D, and E correspond to classes of Y_DATA in
Table 2, in which A is “1, 0, 0, 0, 0”, B is “0, 1, 0, 0, 0”, and so on. We excluded the prediction case of “0, 0, 0, 0, 0”, since this case could be the further addressed in the future.
Table 5 shows precision, recall, and f-measure values of each class. The average f-measure is over 80%, which is a relatively higher performance than previous research studies [
11,
12,
13,
14,
15], even though we did not use acoustic features. Five classes did not show distinguished difference in their prediction performances. We did not consider which word could be a candidate of segmentation because the process also requires additional calculation and resources. We did not use any manual annotation work compared to other works proposed earlier. Given those considerations, the performance shows much potentials.
Figure 5 shows the cross validation data and training data cost for learning epochs. By checking the cross validation, it is possible to determine the optimum number of epochs and check whether there is a problem of overfitting or underfitting. In this graph, after 1000 repetitions, the CV graph and the training graph are slightly different. Therefore, it can be concluded that there is no significant meaning for the more iterative learning.
Table 6 shows the predicted test data based on the learned model. A data shows correct prediction and B data shows wrong prediction.
5. Conclusions
In this paper, we present a method to find proper positions of period marks in YouTube subscripts, which can help improve the accuracy of automatic translation. This method is based on Word2Vec and LSTM. We cut off the data to make them similar to YouTube subscripts. We tried to predict whether a period can come between each word or not. In the experiment, the accuracy of the approach was measured to be 70.84%.
In a future study, we will apply other neural net models, such as CNN, attention, and BERT, to bigger data to improve the accuracy. To do this, we need to collect subscript data from a wider range of online education sites. Second, we want to collect more YouTube subtitle data in various areas and translate it into Korean to build a parallel corpus. We will develop the software tools needed to build the corpus. Finally, we will develop a speech recognition tool based on the open API to create the actual YouTube subtitle files. With those steps, we will be able to build all the necessary processes from the YouTube voice to the Korean subtitles creation.