Automatic Correction of Indonesian Grammatical Errors Based on Transformer

Musyafa, Ahmad; Gao, Ying; Solyman, Aiman; Wu, Chaojie; Khan, Siraj

doi:10.3390/app122010380

Open AccessArticle

Automatic Correction of Indonesian Grammatical Errors Based on Transformer

by

Ahmad Musyafa

^1,2

,

Ying Gao

^1,*

,

Aiman Solyman

³

,

Chaojie Wu

¹ and

Siraj Khan

³

¹

School of Computer Science and Engineering, South China University of Technology, Guangzhou 510006, China

²

Informatics Engineering, Pamulang University, Jalan Raya Puspitek 46, Banten 15310, Indonesia

³

School of Software Engineering, South China University of Technology, Guangzhou 510006, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(20), 10380; https://doi.org/10.3390/app122010380

Submission received: 13 September 2022 / Revised: 10 October 2022 / Accepted: 11 October 2022 / Published: 14 October 2022

Download

Browse Figures

Versions Notes

Abstract

:

Grammatical error correction (GEC) is one of the major tasks in natural language processing (NLP) which has recently attracted great attention from researchers. The performance of universal languages such as English and Chinese in the GEC system has improved significantly. This could be attributed to the large number of powerful applications supported by neural network models and pretrained language models. Referring to the satisfactory results of the universal language in the GEC task and the lack of research on the GEC task for low-resource languages, especially Indonesian, this paper proposes an automatic model for Indonesian grammar correction based on the Transformer architecture which can be applied to other low-resource language texts. Furthermore, we build a large corpus of the Indonesian language that can be utilized for evaluating the next Indonesian GEC task. We evaluate the models in this dataset, and the results show that the Transformer-based automatic error correction model achieved significant and satisfactory results compared with the results of previous research models.

Keywords:

natural language processing; grammatical error correction; transformer

1. Introduction

Grammar is one of the most important features in human language which provides understanding and readability of text for readers or listeners [1]. Grammar and related errors in text and speech lead to confusion and misunderstanding for readers. Therefore, numerous research studies in natural language processing have tried to address the challenge of automatic grammar error correction (GEC). Recently, English and Chinese have received more research attention thanks to large-scale pretraining models and massive training data for GEC, which reported close to human-like performance [2,3]. However, languages such as Arabic, Russian and Indonesian have received little research attention due to a lack of training data, being classified as low-resource languages. The Indonesian language is listed as the 10th-most-spoken language in the world (https://www.visualcapitalist.com/100-most-spoken-languages/, accessed on 1 May 2022) and is used by more than 270 million people (https://en.wikipedia.org/wiki/List, accessed on 1 May 2022) in countries and dependencies by population. However, the lack of research in Indonesian natural language processing (NLP), especially in the GEC task, is associated with the lack of training data, since no parallel GEC training data are available.

In the last decade, the growth of neural networks in NLP has increased significantly and has even penetrated various tasks including text classification [4,5], information extraction [6,7], summarization [8], question answering [9] and GEC [10]. The automatic correction of grammar and spelling has increased due to the recent increase in second language learners and also the increasing advancement of deep learning technology, which can be felt directly in real life [11,12]. Recently, automatic GEC systems based on sequence-to-sequence (seq2seq) machine translation techniques have been preferred, such as recurrent neural networks (RNN) [13] convolutional neural networks (CNNs) [14] and multi-head attention networks (Transformer) [15,16], which has been found to outperform the other models in resource NLP tasks, including the GEC task. GEC based on NMT techniques uses parallel data to learn how to map the source (correct sentence) to the target (incorrect sentence), as shown in Figure 1.

Indonesian GEC is still growing, and few studies have been performed, such as that by Floranti et al. [17], who introduced an analysis-based descriptive research method to analyze grammatical errors. Lin et al. [4] proposed an Indonesian GEC framework for classification tasks that corrected 10 part of speech (POS) errors in text. Fahda et al. [18] proposed the Markov probability model to correct spelling and grammatical errors. Yusnitasari et al. [19] specified 10 types of grammatical errors that often occur in writing using a morphological and syntactic approach. Rahutomo et al. [20] introduced a grammar-checking method based on the rule for checking sentence patterns and punctuation. Most of the previous methods are applied using rule-based and statistical methods, and to the best of our knowledge, end-to-end neural-based approaches have not been investigated in Indonesian GEC.

In this paper, we propose automatic GEC based on Transformer [15] to correct Indonesian grammar errors, which can also be used to correct the text of other low-resource languages. The proposed model was equipped with a copy mechanism to address the unknown and special words that appeared in the source sentences, which is one of GEC’s challenges. Aside from that, a semi-supervised method has been proposed to construct parallel training data from a monolingual corpus that is out of domain. The proposed model is the first end-to-end neural-based Indonesian grammatical error correction (IGEC) system that outperformed the previous IGEC systems and has the ability to correct all types of grammatical errors. We summarize the contributions of this paper as follows:

We propose a multi-level, semi-supervised method to construct synthetic parallel training data from an out-of-domain corpus;
We construct a large amount of synthetic data available for open access to address the challenge of a lack of training data;
We propose an IGEC framework based on multi-head attention networks (Transformer) equipped with a copying mechanism to improve performance;
The proposed model has the auto-correction ability for all types of errors in Indonesian GEC and outperforms the existing approaches.

The code and trained models are available at GitHub (https://github.com/Almangiri/Indonesian-GEC-framework (accessed on 1 May 2022)). The rest of this paper is structured as follows. The most related work is presented in Section 2. Section 3 explains the proposed model and system components. Section 4 presents the experimental details. The conclusions are in Section 7.

2. Related Works

The attention in research and industry toward automatically detecting and correcting grammatical errors in text has increased in the last decade. However, different approaches have been proposed, such as rule-based systems that use hundreds of grammar rules to detect and correct grammatical errors [21]. Furthermore, n-gram language models have been applied to measure the probability of each sequence of words (n-gram) from a large corpus of text [22]. GECs are based on a statistical classifier model that uses classification algorithms such as decision trees, naive Bayes classifiers and logistic regression to detect grammatical errors [4]. Statistical machine translation (SMT) is phrase-based statistical machine translation for the incorrect input sentence and maximizes the conditional probability of the suggested output correct sentence over possible corrections [10]. More recently, neural machine translation (NMT) deep neural networks have been used in GEC as classes of models that learn how to map an ill-formatted input (an incorrect sentence) to a well-formatted output (a correct sentence) via a set of hidden layers [23].

Cristian et al. [24] proposed GEC based on a five-gram language model to overcome the limitations of the training data and also claimed that it outperformed the language model in statistical machine translation. The model was trained on a large corpus containing a collection of more than nine billion web pages. David and Hiram [25] used the trigram and bigram language models to detect grammatical errors in a large corpus. Feng et al. [26] used a combination of the n-gram language model and a string-matching algorithm to resolve the problem of Chinese spelling with Cantonese usage. Another study conducted by Zhao et al. [27] used the n-gram model to detect and correct Chinese grammatical errors that achieved high accuracy in correction and was not satisfactory in detection. The weakness of these language models is that they are not suitable in the representation of sequence probability and handling sentence construction or rare words.

GEC classifier systems are used to classify input sentences into category sentences or words. In this case, most of the previous work used classified specific error types such as verbs, prepositions and articles based on the large size of the possible correction. Li et al. [13] developed a GEC system to correct preposition errors, missing commas, articles, nouns and verb forms. Lin et al. [4] introduced an Indonesian GEC framework that used a combination of pretraining language with a BRNN to correct 10 types of errors in the text. Another good example is presented by Victor et al. [28], who employed a bidirectional LSTM tagger for the proper word choice for English GEC. This appends the part of speech (POS) tagger and filters out the suggested word by verb form, article, preposition, comma and noun number from the source tokens. The disadvantage of this approach is that the error type classification target is extremely specific, and it is difficult to extend the error types.

However, SMT achieved remarkable improvement in GEC. For example, Dowmunt et al. [29] used an SMT-based method with task-specific features to correct grammatical errors. Roman and Marcin [3] proposed a hybrid system of SMT and NMT methods for automatic correction of grammatical errors. However, SMT systems could be used to correct all types of errors, whereas they still rely heavily on large amounts of training data and are customized to specific error types. Recently, NMT approaches have been used in GEC. For example, Yuan [30] utilized the NMT approach to overcome the challenge of rare words, and it outperformed SMT systems, which were evaluated in both the First Certificate in English (FCE) and CoNLL-2014 test sets. Another work presented by Bahdanau [31] proposed an NMT model that extends the basic encoder-decoder architecture to handle the problem of the context vector in the source sentence. Solyman et al. [11] introduced a semi-supervised GEC model based on a CNN with an attention mechanism and fine-tuning to correct the Arabic text. In addition, Wang et al. [32] demonstrated the ability of GEC to overcome infrequent error patterns, meaning that NMT approaches can handle different types of error patterns compared with the classical GEC systems. Due to the fast growth of neural networks, numerous GEC systems use multi-head attention networks, which achieved state-of-the-art results in GEC, such as in [16,31,33].

In summary, the previous works on low-resource language focused on rule-based GEC classifier techniques, GEC based on SMT and classical NMT techniques. Furthermore, numerous approaches have been proposed to generate synthetic data in order to address the lack of parallel GEC training data. The limitation of the previous works is that they were based on rule-based approaches and limited to correcting some types of errors, as well as the generated synthetic data leading to unreliable training patterns since they utilized simple spelling confusion functions. This motivated us to propose an end-to-end Indonesian GEC framework and also propose a semi-supervised method to generate a large-scale parallel corpus.

3. Grammatical Error Types in the Indonesian Language

This section introduces a systematic classification method to describe and classify the most common types of Indonesian grammatical errors. Maria et al. [34] studied the types of grammatical errors on the official Indonesian government and Indonesian tourism websites, identifying two types of errors, namely syntactic and morphological errors. In contrast, Nurul Aini [35] analyzed grammatical errors based on qualitative descriptive methods in the translated text and identified 11 types of grammatical errors, classified as incorrect use of determiners, verbs, auxiliary verbs, prepositions, conjunctions, pronouns, singular nouns and noun phrases and the omission of verbs, pronouns and determiners. Fahda and Ayu [18] introduced a rule-based model combined with a statistical MT approach to detect three errors: punctuation, spelling and word selection. Faisal et al. [20] used a rule-based method to check three types of errors: patterns, punctuation and spelling.

Previous studies classified various types of grammatical errors from different points of view. However, none of the previous works identified the criteria for classifying Indonesian grammatical errors. In the following subsections, we classify the types of Indonesian grammar and other related errors from different points of view based on the nature of the errors, frequency, efficiency and detection technique as reviewed from the literature. Accordingly, we identified four main criteria for Indonesian GEC as follows:

Error Nature: This includes the types of errors related to the difficulty and ease of detection. These errors must be separated into different classes. For instance, it is easy to detect spelling errors using a spell checker, whereas it is difficult to automatically detect semantic errors, which require experts (in-depth knowledge of linguistics).

Error Frequency: This includes the types of highly frequent and common errors, such as syntax, semantics, spelling, preposition, punctuation, conjunction and word choice errors. This should be a separate group.

Error Effect: This covers types of errors related to the text validity, which should be separated according to the validity class. For example, a spelling error can invalidate a word, a syntax error can invalidate a text, and a sentence structure error can invalidate a sentence.

Error Detection: This includes the types of errors related to detection in the same class both at the sentence level and word level. For example, spelling error detection can represent checking a complete sentence. Likewise, preposition error detection can describe checking words before and after prepositions.

We identified the most common Indonesian grammatical errors according to the main criteria and similar studies above. The description of identification of Indonesian grammatical errors above includes the most common errors made by second language learners and native speakers. In summary, we identified these main types of errors as shown in Table 1, Figure 2 describes the main categories, and more details are in the following subsections:

1.

Syntax error: These errors types are related to incorrect writing styles and all error types in the rules of grammar, such as incorrect use of a sentence structure, morphology, syntax and even punctuation. In the syntax form, the main point is the relationship among words in the sentences, and we identified the subtypes of errors as follows:

Word choice: This leads the sentence to have an inappropriate meaning if there is an inappropriate word in the sentence structure. This always occurs in second language learners often choosing the wrong words, which can lead to different meanings, such as with the exchange of the words ‘Ada’ and ‘Adalah’, ‘Berangkat’ and ‘Meninggalkan’ and ‘Mengambil’ and ‘Meninggalkan’.
Affix usage: The incorrect affix usage will change the meaning of the main word, and this error is often made by language learners, such as with the words ‘Nikmat = Menikmati = Kenikmatan’, ‘Cerai = Bercerai = Penceraian’ and ‘Jalan = Berjalan = Perjalanan’.
Word order: This error is related to the structure or arrangement of words in sentences. Foreign learners often make this error because they make an analogy of word order in Indonesian with that in English, such as ‘Hari menarik = An interesting day’ with (‘Menarik Hari’) and ‘Buku bagus = A good book’ with ‘Bagus Buku’, among others.
Syntactic error: This error often occurs in incomplete sentences and ungrammatical sentences in Indonesian grammar procedures. Learners often omit the subject, predicate or object in a sentence so that the sentence is not fully understood. For example, they omit the object in the sentence (‘Mereka menikmati banyak sekali = They enjoy a lot’), while the correct sentence is ‘Mereka sangat menikmati malam–malam indah = They enjoy the beautiful nights a lot’.
Preposition usage: The use of inappropriate prepositions will change the meaning of the sentences. An example of incorrect preposition usage such as ‘Mereka membeli sebuah buku ke Toko buku = They buy a booktothe bookstore’ should be ‘Mereka membeli sebuah buku di Toko buku = They buy a book at the bookstore’.

2.

Spelling Error: This frequently occurs while writing text, such as adding or deleting characters and missed spaces between words. These errors can be detected easily using a simple spell checker application. Examples include affix usage errors and incorrect conjunction usage.

3.

Semantic error: These errors occur during writing as a result of incorrect use or misuse of pronunciation marks in the sentences, which leads to misunderstanding the reader.

Conjunction usage: Incorrect use of conjunctions is common among second language learners. Conjunctions are links between phrases, clauses and sentences. In order to avoid misinterpretation in sentences, we must use conjunctions correctly and appropriately, such as the use of the words ‘Bahwa’ and ‘Walaupun’, which are not correct in the context of the sentence.
Plural formation: The formation of plural words involves using plural markers such as ‘Beberapa = some’, ‘Sedikit = few’ and ‘Banyak = many’, numerals such as in ‘10 Buku = 10 books’ and the repetition of the noun itself, such as ‘Toko-toko = shops’. However, language learners often use repeated plural words, such as in ‘Banyak toko-toko tutup karena pandemi = Many shop and shop are closed because of a pandemic’, and the correct usage is ‘Banyak toko tutup karena pandemi = Many shops are closed because of a pandemic’.
Passive construction usage: Passive and active sentences are one of the toughest challenges for foreign learners. They often make errors between the two in sentences such as ‘Stadion itu membangun untuk olahraga national = The stadium build for national sports’, where the correct usage is ‘Stadion itu dibangun untuk olaharaga national = The stadium was built for national sports’.

4. Method

This paper introduces a global GEC framework for low-resource languages applied to Indonesian. Initially, to overcome the data sparsity problem in Indonesian GEC, we propose a semi-supervised confusion method to generate parallel training data. The baseline was a multi-head attention network equipped with a copy mechanism to address the challenge of special words in given sentences, and the details are shown in Figure 3.

4.1. Confusion Method

A lack of training data is one of the main challenges of GEC, especially for low-resource languages. In order to address this problem, a semi-supervised method was proposed to generate synthetic data that could increase the amount of training data. The generated data was utilized to pretrain the GEC model. The seed of the synthetic data came from the open-source monolingual corpus called the CC100-Indonesian dataset. These data were created by Conneau and Wenzek et al. [36], and it is one of 100 corpora of monolingual data processed from Common Crawl (https://data.statmt.org/cc-100/ (accessed on 1 May 2022)) snapshots. The Indonesian data consist of 6,848,850 sentences at a total size of 747 MB.

Initially, the data were merged into one file and then normalized by mentions, non-UTF8 encoding, hashtags, links and over spaces, where the exclamation points, numbers and punctuation were kept. The sentences of the target file were split out with a maximum length of 40 words and a minimum of 10 words. Then, three versions of synthetic data were provided based on data normalization. In the first version, the source and target sentences were normalized. In the second version, the target side was normalized, while in the third version, the source and target sentences were kept in the original format. The proposed method consists of three sub-approaches, including misspelling (adding or deleting a random character within random words), synthesizing punctuation errors that convert the input sentences to POS tagging followed by removing or adding new punctuation marks from a given list. The third method is to randomly swap each pair of words to generate semantic errors. In the last step, the synthesized sentences were combined with each corresponding original sentence to generate synthetic GEC training pairs that have various training patterns. The details of the semi-supervised method are explained in Algorithm 1 and Figure 4.

Algorithm 1 Confusion method approach
Require: $X, α$ .	▹ Monolingual sentence and the value of Alpha
Ensure: $(\hat{X}, Y)$ .	▹ Synthetic data
function A_DC_HR( $W_{i}$ )	▹ Add a random character to $W_{i}$
$W_{i} = [c_{1}, c_{2}, . . ., c_{n}], c_{i} \in [c_{1}, c_{2}, . . ., c_{n}]$
$\hat{c_{i}} \in [c_{1}, c_{2}, . . ., c_{n}]$	▹ Add $\hat{c_{i}}$ into index $i + 1$
${\hat{W}}_{i} = [c_{1}, c_{2}, c_{i}, \hat{c_{i}}, . . ., c_{n}]$	▹ The synthesized word ${\hat{W}}_{i}$
Return $\hat{W_{i}}$
end function
function D_ELC_HR ( $W_{i}$ )	▹ Delete a random character from $W_{i}$
$W_{i} = [c_{1}, c_{2}, . . ., c_{n}], c_{i} \in [c_{1}, c_{2}, . . ., c_{n}]$
Delete $c_{i}$
${\hat{W}}_{i} = [c_{1}, c_{2},, . . ., c_{n}]$	▹ The synthesized word ${\hat{W}}_{i}$
Return $\hat{W_{i}}$
end function
functionPunctuation( $\hat{Y}$ )
${\hat{Y}}_{i} = [w_{1}, w_{2}, . . ., w_{n}], P l s t = [?, [,], {,},!, ", (,), *, ., :]$	▹ $P l s t$ is a list of punctuation
${\hat{Y}}_{i} = [w_{1}, w_{2}, {l s t}_{i}, . . ., w_{n}]$	▹ Insert $P l s t_{i}$ in a random position with in ${\hat{Y}}_{i}$
Return $\hat{Y_{i}}$
end function
function Swap( $\hat{Y}$ )
${\hat{Y}}_{i} = [w_{1}, w_{2}, . . ., w_{n}]$
${\hat{Y}}_{i} = [w_{1}, w_{3}, w_{2}, . . ., w_{n}]$	▹ Swap a random pair of words
Return $\hat{Y_{i}}$
end function
procedure ConfusionFunction( $X, α$ )
$\hat{Y} ⟵ X$
$N ⟵ (α * l e n (X))$
fns = $[A d C h r (W_{i}), D e l C h r (W_{i})]$
for N iterations do
$W_{i} \in \hat{X}$	▹ Select a random word in $\hat{X}$
$\hat{W}$ = choice(fns)	▹ Generate spelling errors
Update $\hat{X}$
$\hat{Y}$ = Punctuation( $\hat{Y}$ )	▹ Generate punctuation errors
$\hat{Y}$ = Swap( $\hat{Y}$ )	▹ Generate semantic errors
end for
Return $\hat{X}$
end procedure

4.2. Model Architecture

Recently, neural-based approaches demonstrated that it is efficient to address the challenges of NLP tasks. In the same context, NMT approaches outperform other approaches, such as the role-based and SMT approaches [37].

In this paper, the baseline was based on the Transformer architecture and the attention mechanism. The model has an encoder-decoder structure, each consisting of four identical stack blocks. Each block in the encoder and decoder consists of two sublayers, namely the multi-head attention and the position-wise feedforward followed by a residual connection and normalization layer as in the original paper [15]. The first layer is used to combine word embedding and positional embedding into a context vector to allow the model to keep the order of the tokens within a sequence during training, whereas the decoder has an extra attention layer over the encoder’s hidden states. However, the mechanism of the proposed models is to use the encoded input sentence and translate it by the neural network into a resulting sequence of vectors representing the translated target sentence (correct). This is accomplished by predicting the next word token

(y_{1} . . . y_{T})

and the source word token

(x_{1} . . . x_{N})

, using the equation below:

h_{1 . . . N}^{s r c} = e n c o d e r (L^{s r c} x_{1 . . . N})

(1)

h_{t} = d e c o d e r (L t r g y_{t - 1 . . . 1}, h_{1 . . . N}^{s r c})

(2)

P_{t} (w) = s o f t m a x (L^{s r c} x_{1 . . . N})

(3)

where

h_{1 . . . N}^{s r c}

is the hidden state of the encode and

h_{t}

is the hidden state of the target for the next word. In order to generate the probability distribution of the next word, we made use of the softmax operation, which was applied between the hidden state of the target and the matrix of the word embedding. Furthermore,

L \in R^{d_{x} x | v |}

is the embedding matrix, where

d_{x}

and

x | v

refer to the word embedding and vocabulary size, respectively. To generate the probability distribution of the next word, softmax was applied between the target hidden state and the embedding matrix using the following equation:

l_{c e} = - \sum_{t = 1}^{T} l o g (p_{t} (y_{t}))

(4)

where

l_{c e}

denotes the loss of each training example and the cross-entropy loss accumulation of each position during decoding.

Motivated by the previous success of the copy mechanism in seq2seq tasks such as text summarization [38], we decided to leverage such a feature for low-resource GEC systems to improve the probability distribution of the output

P_{t}

. Figure 3 shows the global architecture of the proposed framework, consisting of Transformer with the copy mechanism. Equation (5) represents the output

P_{t}

, which is a mix of

P_{t}^{g e n}

, the decoder generation distribution, and the copy distribution

P_{t}^{c o p y}

of the GEC model. This extends the vocabulary to all the words appearing in the source sentence. However, a balancing factor of

α_{t}^{c o p y} \in [0, 1]

at each time step t was applied to control the copying and generation of text:

P_{t} (y_{t}) = (1 - α_{t}^{c o p y}) * P^{g e n} (y_{t}) + (α_{t}^{c o p y}) * P^{c o p y} (y_{t})

(5)

In the same context, the normalized attention distribution was used as the copy score and the copy hidden states to estimate the balancing factor

α_{t}^{c o p y}

as in the following equation:

α_{t}^{c o p y} = s i g m o i d (W^{T} \sum (A_{t}^{T} . V))

(6)

Finally, the loss was calculated the same as in Equation (4) while considering the mixed probability distribution

y_{t}

given in Equation (4).

5. Experiments

5.1. Dataset

The seed data were a monolingual corpus created by Conneau and Wenzek et al. [36], which was used to train the XLM-R model from the cc-net repository [39]. The original dataset includes 100 monolingual languages constructed using URLs and paragraph indexes provided by the cc-net repository by processing Common Crawl snapshots. The total size of the data is 36 GB, and we used 747 MB of the CC100-Indonesian corpus. Then, after augmenting the data, the total synthetic parallel data was 2.08 GB. The final training set was subdivided into 5,999,634 examples for the training set, 424,431 examples for the development set and 424,420 examples for the test set.

5.2. Model and Parameters

The experiment for the proposed model was performed using the configuration of the seq2seq transformer proposed by Vaswani [15]. Table 2 shows the main parameters, including the model size, which was changed from 512 to 256, and the batch size was also changed from 2048 tokens to 128 tokens. The number of layers was changed from six layers to four layers, but we kept the head attention at eight heads as the default value. In addition, a static Adam optimizer [40] was used with a fixed learning rate, which was 3

\times 10^{4}

, instead of a warm up and cool down approach. During the training process, we utilized early stopping when there was no improvement in verification data performance to prevent model overfitting. All the experiments were performed on Python version 3.6 with two Titan RTX GPUs, and we used CUDA 10.2.

5.3. Evaluation

We evaluated the model’s performance and reported the precision, recall and F1 scores as well as the Bilingual Evaluation Understudy (BLEU) scores [41]. The BLEU is an evaluation metric that is often used in machine translation and can be used to calculate the similarity of the reference data and prediction results of a system. In this paper, we utilized BLEU-4 to evaluate and calculate the quality of the proposed model compared with the original sentences. Furthermore, the F1 scores can determine precision, aiming to calculate the percentage of correct predictions so that it can produce the proportion of corrected sentences from grammatical errors, while recall refers to the positive class percentages from the total positive predictions of correct predictions to show the proportion of sentences that were corrected.

6. Result and Discussion

This section reports the experimental results of our IGEC model using synthetic and authentic training sets. Furthermore, different seq2seq models and different decoding improvement techniques were investigated and analyzed, and they are discussed in the following three subsections.

6.1. Impact of the Synthetic Data

In the beginning, the impact of the semi-supervised method that could generate reliable training data was investigated in different frameworks. Table 3 shows the impact of the synthetic data with a bidirectional recurrent neural network (BRNN), convolutional neural network (CNN), self-attention network (SAN) and BPE without fine-tuning. A BRNN with an attention mechanism was investigated for Indonesian GEC, which was used to localized only on the nearby words, and the attention mechanism was reduced during the backward propagation, reporting 36.22 as the F1 score and 48.18 as the BLEU score. It is worth mentioning that the CNN and SAN have never been investigated with IGEC before. CNN-GEC with attention has the ability of parallel training, in which there is no need for sequential operations. Additionally, the CNN combines feature extraction and classification into one task, which makes the training process and decoding speed faster. This achieved remarkable improvement in the F1 score (+07.31) and F1 score (+08.58). This result without fine-tuning outperformed a recent work on the IGEC framework, presented in [4]. SAN-GEC was based on a modified version of Transformer, which perceives the entire input sequence simultaneously. SAN-GEC increased the F1 score to 55.28 and F1 score to 59.91. This finding demonstrates the efficiency and reliability of the semi-supervised method for constructing training data for the Indonesian GEC task and addressing the challenge of a lack of training data. Figure 5 shows a comparison of these models.

6.2. SAN-GEC with a Copy Mechanism

This section investigates the performance of IGEC based on SAN-GEC (Transformer) and the copy mechanism, as well as some machine translation techniques for improving performance. Table 4 shows that SAN-GEC and a copy mechanism increased the F1 score from 55.28 to 62.67, with an increase of 07.39. This improvement is the impact of addressing the problem of the UNK words out of the training vocabulary. In this case, each word predicted as UNK would be copied from the original word in the target input sentence.

Regarding the previous results, the copy mechanism performed well in handling UNK words. This generates other concerns about the accuracy of the copied words. To this end, the copy mechanism was verified without copying UNK words, which led to ignoring the UNK tokens during the experiments. Table 4 shows that the performance of SAN-GEC and a copy mechanism was still ahead of that of SAN-GEC, with an increase of 4.98 in the F1 score and remarkable improvement in recall. In the same context, BPEChart was applied to address the UNK words, which was used to subdivide each UNK word into sub-tokens. The impact of BPEChat was verified with different settings of vocabulary sizes, including 30,000, 10,000 and 1000. Table 4 shows the best model of SAN-GEC and a copy mechanism with a 1000-word vocabulary for BPEChat addressing the UNK words, which increased the F1 score from 62.67 to 69.87, an improvement of 7.20 when SAN-GEC and a copy-mechanism was used to address the UNK words. Furthermore, the best model was improved with fine-tuning by using an authentic training corpus, which increases the F1 score from 69.87 to 71.94 and the BLEU score from 77.21 to 78.13. Figure 6 illustrates the difference in performance.

6.3. Discussion

The proposed IGEC framework in this paper demonstrated that is an effective system for addressing the challenge of a lack of training data for the Indonesian language. Furthermore, it was able to correct all types of errors, and it was not restricted to specific errors. The experimental results for over 100,000 sentences demonstrated that SAN-GEC with a copy mechanism and BPEChar performed the best with a BLEU score of 78.00, making it our best model. SAN-GEC achieved remarkable improvement in GEC, perceiving the entire input sequence simultaneously regardless of each word’s position, while the copy mechanism increased the accuracy of the name entities usually identified as UNK words during training in the classical neural-based GEC systems. Furthermore, BPEChar addressed spelling errors more effectively compared with the copy mechanism, which is one of the main challenges of the automatic GEC models. However, compared with the different settings of BPEChar, the 30,000-word vocabulary size caught the syntax and features from the word level more efficiently, while the 10,000-word vocabulary size was better for correcting spelling and UNK words and 1000-word size was the best. In the final analysis, we compared the performance of our model with existing Indonesian GEC systems. The performance of the proposed framework was compared with the recent Indonesian GEC systems. Bryant et al. [42] introduced a GEC system reexamining LMs limited to correct a range of errors that include prepositions, conjunctions and indefinite pronouns errors. which reported 41.60 as its F1 score. Lin et al. [4] proposed an Indonesian GEC model based on POS tags trained using 10,000 sentences and used to correct 10 types of grammatical errors, which posted an F1 score of 55.10. To the best of our knowledge, our work is the first work to employ end-to-end neural-based GEC with three different neural-based models. Aside from that, our semi-supervised confusion method constructed massive training data for the Indonesian GEC task.

6.4. Case Study

In this subsection, we utilized a case study to examine the model performance. This subsection introduces a case study to examine the model performance with a real-world example taken from the CC100 Indonesian dataset. Table 5 shows the outputs of different versions of the Indonesian GEC model, including the baseline SAN-GEC (Transformer), SAN-GEC with a copy mechanism and SAN-GEC with a copy mechanism and BPEChar, and also we provided the sources, targets and English translations. The presented example has 21 errors subdivided into 6 categories as follows: the syntactic or grammatical errors are in numbers 1, 8, 11 and 13, while the wrong word selection errors are in numbers 2, 5, 9, 17, 18 and 19. Furthermore, errors 6 and 14 are conjunction usage errors, and the numbers 4 and 20 are incorrect preposition usage errors. Finally, affix usage errors and word order errors are in numbers 3, 7, 12 and 21. The red color refers to all errors in the sentences. In the beginning, the baseline model (Transformer) corrected the syntactic or grammatical, word selection and conjunction usage errors for numbers 1, 2, 4, 5, 6, 8, 9, 11, 13, 14, 17, 18, 19 and 20 and failed to correct the errors for numbers 3, 4, 7, 10, 12, 15, 16, 20 and 21. Moreover, the next model (Transformer with a copy mechanism) corrected all mistakes except the affix usage and word order errors in numbers 3, 7, 12 and 21. The last model, Transformer with a copying mechanism and BPEChar, corrected all mistakes but failed to correct the punctuation errors in numbers 10, 15 and 16, as it left only a comma in the sentences.

The baseline SAN-GEC with the copy mechanism and BPEChar model proved its ability to detect and correct Indonesian text errors except for some punctuation errors. Basically, the punctuation errors in the given example are not a big issue in Indonesian grammar because they are still acceptable in Indonesian writing. Thus, our model was relatively effective compared with the baseline model (Transformer) because it revealed enhanced improvement over the previous performance.

7. Conclusions

We introduced a Transformer-based global GEC framework equipped with a copy mechanism for low-resource languages, which in this case was Indonesian. The proposed model is the first end-to-end neural-based Indonesian GEC system which can also be used to correct the text of other low-resource languages. We presented the copying mechanism used in this model to deal with unfamiliar and special words that occurred frequently in the source sentences. In addition, in order to overcome the problem of data sparsity in Indonesian GEC, we proposed a semi-supervised confusion method to produce parallel training data from a monolingual out-of-domain corpus. The performance of the proposed model was significantly improved, and it had the ability to correct all types of Indonesian grammatical errors. Finally, the experimental results of this model achieved significant and satisfactory results that outperformed the previous GEC models and reported an average F1 score of 0.7194 and a BLEU score of 78.13, which are relatively effective in grammatical error correction tasks.

Although the proposed confusion method was able to generate different types of training patterns, it still was not able to include the most difficult grammatical errors, such as semantic and synthetic errors. To this end, we are interested in using an extended rule set to generate such errors, and we would also like to use error tags to control the error rates and distribution. Furthermore, we are interested in investigating the performance of our model using other low-resource languages such as Malay, Javanese and Sundanese.

Author Contributions

Conceptualization, A.M. and A.S.; methodology, A.M.; software, A.S.; validation, Y.G., A.M. and A.S.; investigation, A.M.; resources, A.M.; data curation, C.W. and S.K.; writing—original draft preparation, Y.G.; writing—review and editing, A.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Guangzhou Key Area R&D Program (202103010005).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The code and trained models are available at github.com/Almangiri/Indonesian-GEC-framework (accessed on 2 September 2022).

Acknowledgments

Our work was supported by the Guangzhou Key Area R&D Program (202103010005).

Conflicts of Interest

The authors declare no conflict of interest.

References

Solyman, A.; Wang, Z.; Tao, Q. Proposed model for arabic grammar error correction based on convolutional neural network. In Proceedings of the 2019 International Conference on Computer, Control, Electrical, and Electronics Engineering (ICCCEEE), Khartoum North, Sudan, 21–23 September 2019; pp. 1–6. [Google Scholar]
Kiyono, S.; Suzuki, J.; Mizumoto, T.; Inui, K. Massive exploration of pseudo data for grammatical error correction. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 28, 2134–2145. [Google Scholar] [CrossRef]
Grundkiewicz, R.; Junczys-Dowmunt, M. Near Human-Level Performance in Grammatical Error Correction with Hybrid Machine Translation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, LA, USA, 1–6 June 2018; Association for Computational Linguistics: New Orleans, LA, USA, 2018; Volume 2, (Short Papers). pp. 284–290. [Google Scholar] [CrossRef] [Green Version]
Lin, N.; Chen, B.; Lin, X.; Wattanachote, K.; Jiang, S. A Framework for Indonesian Grammar Error Correction. Trans. Asian Low-Resour. Lang. Inf. Process. 2021, 20, 1–12. [Google Scholar] [CrossRef]
Obied, Z.; Solyman, A.; Ullah, A.; Fat’hAlalim, A.; Alsayed, A. BERT Multilingual and Capsule Network for Arabic Sentiment Analysis. In Proceedings of the 2020 International Conference On Computer, Control, Electrical, And Electronics Engineering (ICCCEEE), Khartoum, Sudan, 26–28 February 2020; pp. 1–6. [Google Scholar]
Tissot, H.C.; Shah, A.D.; Brealey, D.; Harris, S.; Agbakoba, R.; Folarin, A.; Romao, L.; Roguski, L.; Dobson, R.; Asselbergs, F.W. Natural language processing for mimicking clinical trial recruitment in critical care: A semi-automated simulation based on the LeoPARDS trial. IEEE J. Biomed. Health Inform. 2020, 24, 2950–2959. [Google Scholar] [CrossRef] [PubMed]
Osman, A.M.; Dafa-Allah, A.; Elhag, A.A.M. Proposed security model for web based applications and services. In Proceedings of the 2017 International Conference on Communication, Control, Computing and Electronics Engineering (ICCCCEE), Khartoum, Sudan, 16–18 January 2017; pp. 1–6. [Google Scholar]
Jiang, J.; Zhang, H.; Dai, C.; Zhao, Q.; Feng, H.; Ji, Z.; Ganchev, I. Enhancements of attention-based bidirectional lstm for hybrid automatic text summarization. IEEE Access 2021, 9, 123660–123671. [Google Scholar] [CrossRef]
Kia, M.A.; Garifullina, A.; Kern, M.; Chamberlain, J.; Jameel, S. Adaptable Closed-Domain Question Answering Using Contextualized CNN-Attention Models and Question Expansion. IEEE Access 2022, 10, 45080–45092. [Google Scholar] [CrossRef]
Junczys-Dowmunt, M.; Grundkiewicz, R. Phrase-based Machine Translation is State-of-the-Art for Automatic Grammatical Error Correction. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA, 1–5 November 2016; Association for Computational Linguistics: Austin, TX, USA, 2016; pp. 1546–1556. [Google Scholar] [CrossRef] [Green Version]
Solyman, A.; Zhenyu, W.; Qian, T.; Elhag, A.A.M.; Toseef, M.; Aleibeid, Z. Synthetic data with neural machine translation for automatic correction in arabic grammar. Egypt. Inform. J. 2021, 22, 303–315. [Google Scholar] [CrossRef]
Lee, J.H.; Kim, M.; Kwon, H.C. Deep learning-based context-sensitive spelling typing error correction. IEEE Access 2020, 8, 152565–152578. [Google Scholar] [CrossRef]
Li, R.; Wang, C.; Zha, Y.; Yu, Y.; Guo, S.; Wang, Q.; Liu, Y.; Lin, H. The LAIX Systems in the BEA-2019 GEC Shared Task. In Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications, Florence, Italy, 2 August 2019; Association for Computational Linguistics: Florence, Italy, 2019; pp. 159–167. [Google Scholar] [CrossRef]
Chollampatt, S.; Ng, H.T. A multilayer convolutional encoder-decoder neural network for grammatical error correction. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Solyman, A.; Zhenyu, W.; Qian, T.; Elhag, A.A.M.; Rui, Z.; Mahmoud, Z. Automatic Arabic Grammatical Error Correction based on Expectation Maximization routing and target-bidirectional agreement. Knowl.-Based Syst. 2022, 241, 108180. [Google Scholar] [CrossRef]
Floranti, A.D.; Adiantika, H.N. Grammatical Error Performances in Indonesia EFL Learners’ Writing. IJELTAL (Indones. J. Engl. Lang. Teach. Appl. Linguist. 2019, 3, 277–295. [Google Scholar]
Fahda, A.; Purwarianti, A. A statistical and rule-based spelling and grammar checker for Indonesian text. In Proceedings of the 2017 International Conference on Data and Software Engineering (ICoDSE), Palembang, Indonesia, 1–2 November 2017; pp. 1–6. [Google Scholar]
Yusnitasari, R.; Suwartono, T. Top Ten Most Problematic Grammatical Items for Indonesian Tertiary Efl Learner Writers. Premise J. Engl. Educ. 2020, 9, 1. [Google Scholar] [CrossRef]
Rahutomo, F.; Mulyo, A.S.; Saputra, P.Y. Automatic Grammar Checking System for Indonesian. In Proceedings of the 2018 International Conference on Applied Science and Technology (iCAST), Manado, Indonesia, 26–27 October 2018; pp. 308–313. [Google Scholar]
Wu, X.; Huang, P.; Wang, J.; Guo, Q.; Xu, Y.; Chen, C. Chinese Grammatical Error Diagnosis System Based on Hybrid Model. In Proceedings of the 2nd Workshop on Natural Language Processing Techniques for Educational Applications, Beijing, China, 31 July 2015; Association for Computational Linguistics: Beijing, China, 2015; pp. 117–125. [Google Scholar] [CrossRef]
Koto, F.; Lau, J.H.; Baldwin, T. Liputan6: A large-scale Indonesian dataset for text summarization. arXiv 2020, arXiv:2011.00679. [Google Scholar]
Bryant, C.J. Automatic Annotation of Error Types for Grammatical Error Correction. Ph.D. Thesis, University of Cambridge, Cambridge, UK, 2019. [Google Scholar]
Buck, C.; Heafield, K.; van Ooyen, B. N-gram Counts and Language Models from the Common Crawl. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland, 26–31 May 2014; European Language Resources Association (ELRA): Reykjavik, Iceland, 2014; pp. 3579–3584. [Google Scholar]
Hernandez, S.D.; Calvo, H. CoNLL 2014 Shared Task: Grammatical Error Correction with a Syntactic N-gram Language Model from a Big Corpora. In Proceedings of the Eighteenth Conference on Computational Natural Language Learning: Shared Task, Baltimore, MD, USA, 26–27 June 2014; Association for Computational Linguistics: Baltimore, MD, USA, 2014; pp. 53–59. [Google Scholar] [CrossRef]
Yeh, J.F.; Chang, L.T.; Liu, C.Y.; Hsu, T.W. Chinese spelling check based on N-gram and string matching algorithm. In Proceedings of the 4th Workshop on Natural Language Processing Techniques for Educational Applications (NLPTEA 2017), Taipei, Taiwan, 1 December 2017; pp. 35–38. [Google Scholar]
Zhao, J.; Liu, H.; Bao, Z.; Bai, X.; Li, S.; Lin, Z. N-gram Model for Chinese Grammatical Error Diagnosis. In Proceedings of the 4th Workshop on Natural Language Processing Techniques for Educational Applications (NLPTEA), Taipei, Taiwan, 1 December 2017; Asian Federation of Natural Language Processing: Taipei, Taiwan, 2017; pp. 39–44. [Google Scholar]
Makarenkov, V.; Rokach, L.; Shapira, B. Choosing the Right Word: Using Bidirectional LSTM Tagger for Writing Support Systems. Eng. Appl. Artif. Intell. 2019, 84, 1–10. [Google Scholar] [CrossRef] [Green Version]
Junczys-Dowmunt, M.; Grundkiewicz, R. The AMU System in the CoNLL-2014 Shared Task: Grammatical Error Correction by Data-Intensive and Feature-Rich Statistical Machine Translation. In Proceedings of the Eighteenth Conference on Computational Natural Language Learning: Shared Task, Baltimore, MD, USA, 26–27 June 2014; Association for Computational Linguistics: Baltimore, MD, USA, 2014; pp. 25–33. [Google Scholar] [CrossRef] [Green Version]
Yuan, Z.; Briscoe, T. Grammatical error correction using neural machine translation. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, CA, USA, 12–17 June 2016; pp. 380–386. [Google Scholar]
Bahdanau, D.; Cho, K.; Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
Wang, Y.; Wang, Y.; Dang, K.; Liu, J.; Liu, Z. A Comprehensive Survey of Grammatical Error Correction. ACM Trans. Intell. Syst. Technol. (TIST) 2021, 12, 1–51. [Google Scholar] [CrossRef]
Grundkiewicz, R.; Junczys-Dowmunt, M.; Heafield, K. Neural grammatical error correction systems with unsupervised pre-training on synthetic data. In Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications, Florence, Italy, 2 August 2019; pp. 252–263. [Google Scholar]
Sari, M.E.C. Grammatical errors in the English version of Indonesia’s official tourism website. Lexicon 2014, 2, 147–159. [Google Scholar] [CrossRef]
Aini, N. The Grammatical Errors in the Translational Text: Indonesian-English Structure. Tell Teach. Engl. Lang. Lit. J. 2018, 6, 10–30651. [Google Scholar] [CrossRef] [Green Version]
Conneau, A.; Khandelwal, K.; Goyal, N.; Chaudhary, V.; Wenzek, G.; Guzmán, F.; Grave, E.; Ott, M.; Zettlemoyer, L.; Stoyanov, V. Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 8440–8451. [Google Scholar] [CrossRef]
Maimaiti, M.; Liu, Y.; Luan, H.; Sun, M. Enriching the transfer learning with pre-trained lexicon embedding for low-resource neural machine translation. Tsinghua Sci. Technol. 2022, 27, 150–163. [Google Scholar] [CrossRef]
Zhou, Q.; Yang, N.; Wei, F.; Zhou, M. Sequential copying networks. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
Wenzek, G.; Lachaux, M.A.; Conneau, A.; Chaudhary, V.; Guzmán, F.; Joulin, A.; Grave, E. Ccnet: Extracting high quality monolingual datasets from web crawl data. arXiv 2019, arXiv:1911.00359. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, Philadelphia, PA, USA, 7–12 July 2002; Association for Computational Linguistics: Philadelphia, PA, USA, 2002; pp. 311–318. [Google Scholar] [CrossRef]
Bryant, C.; Briscoe, T. Language model based grammatical error correction without annotated training data. In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, New Orleans, LA, USA, 5 June 2018; pp. 247–253. [Google Scholar]

Figure 1. An example of an automatic grammatical error correction system.

Figure 2. Indonesian error classification scheme for the GEC system.

Figure 3. The framework of a multi-head attention network with a copy mechanism.

Figure 4. Illustration of a semi-supervised method.

Figure 5. Illustration of the performance of the three baselines using our synthetic data.

Figure 6. Illustration of the performance of the proposed model with BPEChar and fine-tuning.

Table 1. Types of Indonesian grammatical errors.

Error Type	Incorrect Sentence	Correct Sentence
Word selection	Adalah banyak pembeli dan penjual dalam pasar. (Are many buyers and sellers in the market?)	Ada banyak pembeli dan penjual di dalam pasar itu. (There are many buyers and sellers in the market.)
Affix usage	Dia nikmat perjalanan di Indonesia. (He is delicious the journey in Idonesia.)	Dia menikmati perjalanan di Indonesia. (He enjoys the journey in Indonesia.)
Word order	Hari Senin, menarik hari. (Monday, day interesting.)	Hari Senin adalah hari yang menarik. (Monday is an interesting day.)
Sentence function or syntactic error	Saat ini adalah orang yang mau belajar, untuk menjadi dokter ide bagus!. (At this time is a person willing to study, to become a doctor a great idea!.)	Saat ini, banyak orang ingin belajar kedokteran untuk menjadi dokter, itu ide bagus!. (Today, many people want to study medicine to become a doctor. That’s a great idea!)
Preposition usage	Ayahku kembali di hotel Mandalika naik bus kecil. (My father returns at the Mandalika hotel by small bus.)	Ayahku kembali ke hotel Mandalika naik bus kecil. (My father returns to the Mandalika hotel by a small bus.)
Conjunction usage	Masjid ini dibangun dengan uang dari orang-orang bahwa tinggal di sekitar. (This mosque was built with money from the people that live nearby.)	Masjid ini dibangun dengan uang dari orang-orang yang tinggal di sekitar. (This mosque was built with money from the people who live nearby.)
Plural formation	Banyak perusahaan-perusahaan telah tutup karena pandemi. (Many company and company have closed because of the pandemic.)	Banyak perusahaan telah tutup karena pandemi. (Many companies have closed because of the pandemic.)
Passive construction usage	Bangunan ini membuat untuk presiden kedua. (This building builds for the second president.)	Bangunan ini dibuat untuk presiden kedua. (This building was built for the second president.)

Table 2. Size of parameters.

Parameters	Size
Embedding size	256
Batch size	128
Head attention	8
Encoder-decoder layer	4
Dropout ratio	0.1
Learning rate	0.0003

Table 3. A comparison of BRNN, CNN-GEC and SAN using precision, recall, F1 score and BLEU score.

Model	Precision	Recall	F1	BLEU
BRNN-GEC	43.12	31.23	36.22	48.18
CNN-GEC	52.34	37.26	43.53	56.76
SAN-GEC	65.53	47.81	55.28	59.91

Table 4. The impact of the copy mechanism, different settings for BPEChar and fine-tuning the IGEC framework, the bold refers to state-of-the-art values.

Model	Precision	Recall	F1	BLEU
SAN-GEC + copy mechanism (UNK words)	64.85	60.64	62.67	66.12
SAN-GEC + copy mechanism (non-UNK words)	61.25	59.31	60.26	64.32
SAN-GEC + copy mechanism + BPEChar 30k	64.11	64.09	64.09	67.19
SAN-GEC + copy mechanism + BPEChar 10k	67.21	65.29	66.23	71.17
SAN-GEC + copy mechanism + BPEChar 1k	70.42	69.33	69.87	77.21
SAN-GEC + copy mechanism + BPEChar 1k + Fine-tuning	71.14	72.76	71.94	78.13

Table 5. Performance of different versions of our model, where the incorrect words have been numbered and colored in red.

Type	Example
Source	Namun demikian $^{1}$ saat ini pemerintah Jawa Barat memiliki uang $^{2}$ didik $^{3}$ ke $^{4}$ suatu $^{5}$ guru dan $^{6}$ mereka dapat kerja $^{7}$ baik-baik $^{8}$ . Suatu $^{9}$ guru juga, $^{10}$ disekolah $^{11}$ dapat ajar $^{12}$ macam-macam $^{13}$ yang $^{14}$ pelajaran serumpun, $^{15}$ misalnya matematika, fisika, kimia dan biologi, $^{16}$ selain itu mereka dapat $^{17}$ uang tambahan $^{18}$ dan uang tunjang $^{19}$ ke $^{20}$ keluarga dari pemerintah Indonesia atau pemerintah pusat berbulan-bulan $^{21}$ .
Target	Saat ini, pemerintah Jawa Barat memiliki dana pendidikan untuk seorang guru sehingga mereka dapat bekerja dengan baik dan tenang. Di sekolah, seorang guru dapat mengajar berbagai pelajaran yang serumpun misalnya Matematika, Fisika, Kimia, dan Biologi. Selain itu, setiap bulan mereka menerima dana bantuan dan dana tunjangan untuk keluarga dari pemerintah Indonesia atau pemerintah pusat.
Translation	Currently, the West Java government has an education fund for teachers so that they can work well and quietly. In school, a teacher can teach a variety of closely related subjects such as mathematics, physics, chemistry, and biology. In addition, every month they receive grants and allowances for families from the Indonesian government or the central government.
SAN-GEC	Saat ini, pemerintah Jawa Barat memiliki dana didik $^{3}$ ke $^{4}$ seorang guru sehingga mereka dapat kerja $^{7}$ dengan baik dan tenang. Seorang guru juga, $^{10}$ di sekolah dapat ajar $^{12}$ berbagai pelajaran yang serumpun, $^{15}$ misalnya matematika, fisika, kimia, dan biologi, $^{16}$ selain itu mereka menerima dana bantuan dan dana tunjangan ke $^{20}$ keluarga dari pemerintah Indonesia atau pemerintah pusat berbulan-bulan $^{21}$ .
SAN-GEC + copy mechanism	Saat ini, pemerintah Jawa Barat memiliki dana didik $^{3}$ untuk seorang guru sehingga mereka dapat kerja $^{7}$ dengan baik dan tenang. Di sekolah, seorang guru juga, $^{10}$ dapat ajar $^{12}$ berbagai pelajaran yang serumpun, $^{15}$ misalnya Metematika, Fisika, Kimia, dan Biologi, $^{16}$ Selain itu, setiap bulan mereka menerima dana bantuan dan dana tunjangan untuk keluarga dari pemerintah Indonesia atau pemerintah pusat.
SAN-GEC + copy mechanism + BPEChar 1k	Saat ini, pemerintah Jawa Barat memiliki dana pendidikan untuk seorang guru sehingga mereka dapat bekerja dengan baik dan tenang. Di sekolah, seorang guru, $^{10}$ dapat mengajar berbagai pelajaran yang serumpun, $^{15}$ misalnya Matematika, Fisika, Kimia, dan Biologi, $^{16}$ Selain itu, setiap bulan mereka menerima dan bantuan dan dana tunjangan untuk keluarga dari pemerintah Indonesia atau pemerintah pusat.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Musyafa, A.; Gao, Y.; Solyman, A.; Wu, C.; Khan, S. Automatic Correction of Indonesian Grammatical Errors Based on Transformer. Appl. Sci. 2022, 12, 10380. https://doi.org/10.3390/app122010380

AMA Style

Musyafa A, Gao Y, Solyman A, Wu C, Khan S. Automatic Correction of Indonesian Grammatical Errors Based on Transformer. Applied Sciences. 2022; 12(20):10380. https://doi.org/10.3390/app122010380

Chicago/Turabian Style

Musyafa, Ahmad, Ying Gao, Aiman Solyman, Chaojie Wu, and Siraj Khan. 2022. "Automatic Correction of Indonesian Grammatical Errors Based on Transformer" Applied Sciences 12, no. 20: 10380. https://doi.org/10.3390/app122010380

APA Style

Musyafa, A., Gao, Y., Solyman, A., Wu, C., & Khan, S. (2022). Automatic Correction of Indonesian Grammatical Errors Based on Transformer. Applied Sciences, 12(20), 10380. https://doi.org/10.3390/app122010380

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Automatic Correction of Indonesian Grammatical Errors Based on Transformer

Abstract

1. Introduction

2. Related Works

3. Grammatical Error Types in the Indonesian Language

4. Method

4.1. Confusion Method

4.2. Model Architecture

5. Experiments

5.1. Dataset

5.2. Model and Parameters

5.3. Evaluation

6. Result and Discussion

6.1. Impact of the Synthetic Data

6.2. SAN-GEC with a Copy Mechanism

6.3. Discussion

6.4. Case Study

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI