BERTOEIC: Solving TOEIC Problems Using Simple and Efficient Data Augmentation Techniques with Pretrained Transformer Encoders

Lee, Jeongwoo; Moon, Hyeonseok; Park, Chanjun; Seo, Jaehyung; Eo, Sugyeong; Lim, Heuiseok

doi:10.3390/app12136686

Open AccessArticle

BERTOEIC: Solving TOEIC Problems Using Simple and Efficient Data Augmentation Techniques with Pretrained Transformer Encoders

by

Jeongwoo Lee

¹

,

Hyeonseok Moon

¹

,

Chanjun Park

^1,2

,

Jaehyung Seo

¹,

Sugyeong Eo

¹ and

Heuiseok Lim

^1,*

¹

Department of Computer Science and Engineering, Korea University, Seoul 136-701, Korea

²

Upstage, Yongin-si 16942, Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(13), 6686; https://doi.org/10.3390/app12136686

Submission received: 26 May 2022 / Revised: 26 June 2022 / Accepted: 30 June 2022 / Published: 1 July 2022

Download

Browse Figures

Versions Notes

Abstract

:

Recent studies have attempted to understand natural language and infer answers. Machine reading comprehension is one of the representatives, and several related datasets have been opened. However, there are few official open datasets for the Test of English for International Communication (TOEIC), which is widely used for evaluating people’s English proficiency, and research for further advancement is not being actively conducted. We consider that the reason why deep learning research for TOEIC is difficult is due to the data scarcity problem, so we therefore propose two data augmentation methods to improve the model in a low resource environment. Considering the attributes of the semantic and grammar problem type in TOEIC, the proposed methods can augment the data similar to the real TOEIC problem by using POS-tagging and Lemmatizing. In addition, we confirmed the importance of understanding semantics and grammar in TOEIC through experiments on each proposed methodology and experiments according to the amount of data. The proposed methods address the data shortage problem of TOEIC and enable an acceptable human-level performance.

Keywords:

artificial intelligence; deep learning; natural language processing; machine reading comprehension; data augmentation

1. Introduction

Deep learning technology has witnessed considerable development in various fields, with many models exceeding human performance. However, deep learning models have rarely been used to solve tests that evaluate human ability. The TOEIC is a global test that evaluates the practical English skills necessary in daily life or international work, focusing on communication skills for people whose native language is not English. This test is widely used in more than 160 countries and 14,000 institutions and companies worldwide for the purpose of promotion or overseas dispatch (https://exam.toeic.co.kr/common/template/viewContents.php?contentsCode=19, accessed on 20 May 2022). However, model studies for solving the TOEIC problems have not been actively conducted. TOEIC comprises a total of 7 parts, wherein parts 1–4 correspond to listening comprehension problems, and parts 5–7 correspond to reading comprehension problems (The reading comprehension problems, which evaluate reading ability, specifically comprise fill-in-the-blank in a single sentence (Part 5), fill-in-the-blank in multiple sentences (Part 6), and reading comprehension problems wherein the content must be understood and inferred (Part 7)). Thus, by solving these problems, TOEIC can be used as a criterion to evaluate human English ability based on the various parts, and has been widely used from the past to the present. However, deep learning models are yet to demonstrate good results even for tasks that are considered very easy on a human level.

Table 1 presents the performance of TOEIC-BERT (https://github.com/graykode/toeicbert, accessed on 20 May 2022), which conducted experiments with a deep learning model for TOEIC Part 5, and Figure 1 shows the distribution of the number of people by the score for the 455th test (implemented on 20 February 2022), which was based on the present (https://exam.toeic.co.kr/result/statisToeic.php, accessed on 20 May 2022). It was found that although TOEIC-BERT evaluated only a relatively easy part, Part 5 fill-in-the-blank in a single sentence, the best performance model yielded 76.38%; in comparison, humans were evaluated based on various parts, and many people had a score of 750 or higher, which is equivalent to or exceeds 76.38%.

This study considers the data scarcity problem as the reason why deep learning models do not perform well. Currently, there is sufficient data for solving human problems; however, minimal open data is available for machine learning. In addition, the research related to the TOEIC problem solving is not actively conducted because of the lack of related data; that is, owing to the lack of training data, deep learning models encounter difficulties even in understanding semantic/grammatical relationships in simple sentences. Consequently, the deep learning models cannot solve problems that humans can solve easily. Therefore, this study proposed simple and efficient data augmentation techniques to improve the deep learning model in terms of the understanding of semantic/grammatical relationships in sentences, thereby improving the performance of TOEIC problem solving model. Further, the results obtained when using these techniques were comprehensively analyzed.

This study focused on problems in Part 5 of the TOEIC. An officially open dataset is essential for the objective performance verification of the model. There is currently no official dataset except that for Part 5, which can be found on Kaggle (https://www.kaggle.com/tientd95/toeic-test, accessed on 20 May 2022). In addition, the cloze test [1] of Part 5 can be used for the evaluation of language ability [2,3,4], which is a widely used task for evaluating the language comprehension ability of the NLP system [5].

As mentioned earlier, for Part 5, an officially open dataset exists in Kaggle. However, as there are only 3625 problems, it is insufficient to learn the machine. Therefore, this study proposed two data augmentation techniques using POS-tagging and lemmatizing of NLTK [6] based on the characteristics of the problems in Part 5 of TOEIC. In addition, since BERT is used in studies for different tasks [7], this study also tries to solve the TOEIC problem with various versions of BERT, and through this, an excellent deep learning model can be created for solving the TOEIC problems.

The contributions of this study are as follows.

Through the POS-tagging based data augmentation methodology, when solving semantic problems comprising options with similar parts-of-speech but with words that have different meanings, the intention of the problem can be better understood by comparing the options focusing on the meanings;
Through the lemmatizing-based data augmentation methodology, when solving grammar problems where various forms of words with a similar meaning are presented, the problem can be better understood by comparing the options focusing on the grammatical relationship;
The effectiveness of the methodologies proposed in this study was verified through experiments for each methodology and experiments according to the amount of data, and we confirmed that the data scarcity problem could be solved through this methodology.

This paper is structured as follows. Section 2 reviews related work in the area of the cloze test. Section 3 discusses the proposed method, while Section 4 describes the experiments and results. Finally, Section 5 concludes the paper.

2. Related Work

There are studies that have focused on filling in the blank such as large-scale cloze-style datasets [8,9,10,11]. However, studies on the TOEIC Part 5 problem solving model are not being actively conducted. A fill-in-the-blank task can be divided into subjective and objective forms. In CodeXGLUE [12], which is a machine learning benchmark dataset for code understanding and generation, a subjective type of fill-in-the-blank task was conducted. RACE [13] conducted a multiple-choice fill-in-the-blank task that fills in the blank with reference to the text.

CodeXGLUE is a benchmark dataset for code intelligence that aids software developers, and can be used for upgrading code search [14,15] and code completion [16,17] systems. There are several tasks in CodeXGLUE, and among them, the task of understanding the code by predicting the masked code for programming languages such as Go, Java, and Python according to natural language commands required by humans, is most relevant to this research. When checking the form and studies of this dataset, the meaning of natural language commands must be understood to fill in the blank. Further, the grammar of programming language must be understood along with semantic understanding. Consequently, this study analyzed the problem types of TOEIC Part 5 and reinforced the data based on the characteristics of the problem, thereby conducting model learning specialized for TOEIC data. However, the task of filling in the blank in the code in CodeXGLUE is data specific to the grammar of the programming language and aims to evaluate the ability of the model to understand the code. Thus, it is not suitable for training as data for learning TOEIC problems. In addition, in contrast to TOEIC, which is a multiple-choice form, the task of filling in the blank in CodeXGLUE is a subjective form; thus, there is a limit to its use in this study.

RACE comprises questions that evaluate the students’ understanding and reasoning ability. It is similar to the TOEIC data to be dealt with in this study in that the answer corresponding to the blank in the question is selected from the options. However, unlike TOEIC Part 5, in RACE it is necessary to read the paragraph and infer the answer corresponding to the blank of the question; thus, directly using RACE in this study is challenging.

CLOTH is a multiple-choice dataset that includes questions used in middle and high school language tests, and is very similar to our task in that when there is the blank in the content, the answer corresponding to the blank must be found from the options. There are four problem types in CLOTH: Grammar type related to tense, active/passive voice, subjunctive, etc., Matching/paraphrasing type that answer questions by copying/paraphrasing words in context, Short-term-reasoning type that infers the answer based on information in the same sentence, and Long-term-reasoning type that infers the answer by synthesizing the information distributed over several sentences. There are two types of TOEIC data dealt with in this study: semantic and grammar problem type. Just as there are data according to type in CLOTH study, this study also refers to the similar perspective to the CLOTH study and establishes strategies specialized for semantic and grammar problem type to derive excellent performance for all types.

As mentioned above, it is evident that studies related to TOEIC have not progressed much. However, if we borrow the perspective of several cloze-style datasets related to TOEIC Part 5 and establish the strategy for each problem type that exists in the TOEIC, we will be able to sufficiently enhance the performance of deep learning.

3. Proposed Method

3.1. Simple and Efficient Data Augmentation

Part 5 of the TOEIC problem set is the single statement fill-in-the-blank problem. This part can be classified into two classes, as shown in Table 2: the semantic and the grammar problem type, which can be identified by the option type. Options for the semantic assessment problems are composed of the semantically different but syntactically similar words, whereas grammar related problems comprise syntactically different but semantically similar words. Examples of TOEIC Part 5 can be found in Table 2. Inspired by these characteristics, this study proposes methods to augment data by considering the characteristics of semantic and grammar problem type.

3.1.1. Random and Brute Data Augmentation

The TOEIC Part 5 fill-in-the-blank in a single sentence is a problem wherein one sentence with a blank is given and the most appropriate word, phrase, or clause for the blank is selected from four options. In this study, the data generation method was proposed, which considers the objective of such a fill-in-the-blank problem. Specifically, the unlabeled mono-text was adopted and the sentence was tokenized into the word segment units. Among the split word segments, one of them was replaced with a blank space to compose a query statement and that word segment was regarded as the correct answer option. In utilizing the mono-text, punctuation marks and special characters at both ends of each word segment were removed such that the trained model can focus on the understanding of the syntactic and semantic information of the words in a sentence.

Through the training process of the TOEIC problem-solving model, the model is expected to infer the correct answer corresponding to the blank among the four options. In this case, the problem’s difficulty level may vary depending on the manner in which the wrong answer options are presented alongside the correct answer option. Consequently, it is important to properly configure wrong answer options when augmenting data to satisfy the complex problems.

The simplest way to generate wrong answer options is to randomly extract word segments for all word segments in the entire training data. In this study, this simple strategy was denoted as the random and brute data augmentation method. Thus, this strategy was adopted as the baseline. However, this cannot properly reflect the characteristics of TOEIC Part 5 as the relationship with the correct answer option was not considered. Therefore, data were created considering the characteristics of TOEIC Part 5, such that the data could be augmented similar to the actual TOEIC data and the performance of the TOEIC problem-solving model can be improved.

3.1.2. POS-Tagging Based Data Augmentation

The first method proposed to construct wrong answer options is POS-tagging based data augmentation. As one of the features of TOEIC Part 5 mentioned earlier, semantic problems are characterized by a list of options with similar parts-of-speech, but with words that have different meanings [18,19]. Considering this characteristic, this method used POS-tagging to classify all word segments in the entire data based on parts-of-speech. Thereafter, it extracted three wrong answer options from the same word segment set of parts-of-speech as the correct answer option.

After splitting the English sentence data into word segments and tagging each word segment through POS-tagging of NLTK, the word segments of the same parts-of-speech were classified in a manner such that they were included in the same set. If this process is conducted for the entire English sentence data, all the word segments existing in the data can be organized by parts-of-speech. Thus, word segment sets for a total of 39 parts-of-speech were generated by synthesizing all parts-of-speech existing in all sentences of the Korean-English parallel corpus. The parts-of-speech standard is the same as the parts-of-speech standard proposed by NLTK, and among them, only those included in this dataset were selected. Specific details can be found in the Table A1 of Appendix A.

Among them, the set with a small number of word segments to be used as the wrong answer option may interfere with data diversity; thus, the sets with less than 30-word segments (TO, SYM, UH, WP$, LS, “, ”) were excluded. In addition, preprocessing was conducted by excluding singular proper noun (NNP), plural proper noun (NNPS), and cardinals (CD), which do not focus on the meaning of words.

When generating the data, the entire English sentence data was split into word segments with one of them being replaced with a blank to generate TOEIC data. At this time, by extracting three wrong answer options from the set of the same parts-of-speech as the actual correct answer option replaced with a blank, the data can be augmented similar to the semantic problems wherein options with different meanings but similar parts-of-speech are listed. Examples of the actual data generated through this methodology are presented in the POS-tagging Based Data column in Table 3.

As is evident from the POS-tagging Based Data column in Table 3, the data was augmented by reflecting the characteristics of the semantic problems, comprising options that had different meanings but similar parts-of-speech. When data was augmented in this manner, the model could better focus on the meaning corresponding to the blank in the sentence. Therefore, when solving the actual TOEIC problems, the model can better infer the correct answer to the intention of the problem through comparison to the options with a focus on meaning.

3.1.3. Lemmatizing Based Data Augmentation

The second method proposed to construct wrong answer options is lemmatizing based data augmentation. Grammar problems exist in TOEIC Part 5, which is characterized by a list of options with similar meanings but different forms. Considering this characteristic, this method used lemmatizing to classify all word segments in the entire data based on their lemma. Subsequently, three wrong answer options were extracted from among the word segments with the same lemma as the correct answer option.

After splitting the English sentence data into word segments and determining the lemma for each word segment through lemmatizing of NLTK, the word segments of the same lemma were classified by lemma such that they were included in the same set. At this time, for lemmatizing of the NLTK, the more accurate lemma can be found by specifying the parts-of-speech. However, specifying the exact parts-of-speech for all word segments is challenging because each word segment has different parts-of-speech. Thus, in this method, considering all cases, the lemma was determined by applying all the parts-of-speech that could be specified. There are five parts-of-speech that can be specified: nouns, verbs, adjectives, adverbs, and satellite adjectives. For each word segment, all the lemmas of the parts-of-speech were determined, and all the corresponding word segments for each lemma were stored. If this process is conducted for the entire English sentence data, all the word segments existing in the data can be organized based on the lemma. Because the options of TOEIC Part 5 comprise a total of four options, it is possible to configure the options only when there are more than four word segments with the same lemma.

When generating data based on lemmatizing, as in Section 3.1.2, the entire English sentence data was split into word segments with one of them being replaced with a blank to generate TOEIC data. At this time, by extracting three wrong answer options from among the word segments with the same lemma as the actual correct answer option replaced with a blank, the data can be augmented similar to the grammar related problems wherein options with different forms but similar meanings are listed. Examples of actual data generated through this methodology are shown in the Lemmatizing Based Data column in Table 3.

As evident from the Lemmatizing Based Data column in Table 3, the data was augmented by reflecting the characteristics of the grammar related problem, comprising options as various forms of words with similar meanings. When data was augmented in this manner, the model could better focus on grammatical relationships within the sentence. Therefore, when solving the actual TOEIC problems, the model can better infer the correct answer to the intention of the problem through comparisons of the options focusing on grammatical relationships of the blank.

3.2. Model

In this study, the BERT [20], RoBERTa [21], and ELECTRA [22] were adopted as the baseline model structures, which were trained with augmented TOEIC data as proposed in Section 3.1. For implementing TOEIC task with transformer encoder-based model structures, conventional multiple-choice pipeline [13,23,24,25,26] was adopted. In addition, to determine the correct answer among the four answer candidates following that pipeline, four sequences corresponding to each option of the specific TOEIC QA were generated. The problem statement and the option words were concatenated with [SEP] token with a [CLS] token in the head. The [SEP] token was added to specify the problem statement and the option part. Specifically, for a particular problem statement

q u e r y

and the corresponding candidates

o p t^{1}, o p t^{2}, o p t^{3}, o p t^{4}

, four input sequences

s e q^{1}, s e q^{2}, s e q^{3}, s e q^{4}

were generated as in Equation (1).

\begin{matrix} s e q^{1} = [C L S] q u e r y [S E P] o p t^{1} [S E P] \\ s e q^{2} = [C L S] q u e r y [S E P] o p t^{2} [S E P] \\ s e q^{3} = [C L S] q u e r y [S E P] o p t^{3} [S E P] \\ s e q^{4} = [C L S] q u e r y [S E P] o p t^{4} [S E P] \end{matrix}

(1)

When adopting RoBERTa, where <s> and </s> tokens were utilized instead of [CLS] and [SEP], they were revised to <s> and </s> respectively, to follow the pre-trained structure.

Subsequently, through PLM, each input sequence

s e q^{i}

was encoded to the latent hidden representation. Specifically, the encoded hidden state for the first token of

s e q^{i}

, which is denoted as

h_{0}^{i}

, was utilized as the encoded hidden representation of the whole sequence

s e q^{i}

. We adopt the same encoding process that was conventionally adopted to several other previous studies [20,21,22]. Thereafter, through the trainable pooling layer W and its bias term b,

h_{0}^{i}

was encoded to predict the final pooled constant

c^{i}

. Eventually, through the combination of

c^{i}

for each i and their subsequent concatenation into the single pooled vector, the pooled output vector was obtained. Consequently, by applying softmax activation to that vector, which was denoted as O, the correct index for the answer can be predicted. These processes can be described as Equation (2).

\begin{matrix} h^{i} = PLM (s e q^{i}) = [h_{0}^{i} h_{1}^{i} \dots h_{n}^{i}] \end{matrix}

(2)

\begin{matrix} O = (\begin{matrix} o^{1} \\ o^{2} \\ o^{3} \\ o^{4} \end{matrix}) = softmax (\begin{matrix} c^{1} \\ c^{2} \\ c^{3} \\ c^{4} \end{matrix}) = softmax (\begin{matrix} W h_{0}^{1} + b \\ W h_{0}^{2} + b \\ W h_{0}^{3} + b \\ W h_{0}^{4} + b \end{matrix}) \end{matrix}

(3)

The predicted output vector O was utilized to gauge the probability of being selected as the answer for each option candidate. Further, the above process was optimized to minimize the cross-entropy of predicting the correct answer label. Specifically, for a given label y and model output O in a whole training dataset D, whole model structures were trained to minimize the loss

L

defined as Equation (4).

L = \sum D [- \sum {i = 1}^{4} t_{i} log (o^{i})] where t_{i} = 1 (y_{i} = i)

(4)

In this equation,

t_{i}

acts as an index indicator that returns 1 for the answer index and 0 otherwise. Thus, based on the above process, a multiple choice TOEIC classification model that can predict the correct answer, given the input problem statement and candidate options, can be constructed. The overall framework proposed in this study is shown in Figure 2.

From the unlabeled corpus, the TOEIC data was generated using the proposed augmentation strategies. Thereafter, by following the multiple-choice QA pipeline, the TOEIC multiple choice task was implemented.

4. Experiments

4.1. Dataset Details

For TOEIC data, this study utilized 3625 TOEIC Part 5 dataset released by Kaggle (https://www.kaggle.com/tientd95/toeic-test, accessed on 20 May 2022). The data was split in a ratio of 8:1:1 to construct 2899/363/363 training, validation, and test data, respectively. The overall statistics including the number of datasets, and the minimum, maximum, and average length of a sentence are presented in Table 4.

Unlabeled data for augmenting the TOEIC data was obtained from AI Hub [27,28], where quality is guaranteed by human evaluation. The English side texts were leveraged from 1,602,708 Korean-English parallel corpus. In addition, only those sentences with a sequence length of 64 or less among the data were adopted because the longest sequence length on a token basis does not exceed 64 in Kaggle TOEIC data.

4.2. Implementation Details

In this study, three models, BERT, RoBERTa, and ELECTRA, were used for the multiple-choice task [13,23,24,25,26]. The hyperparameters and learning rate for the model used are as follows. Among the learning rate settings in [1 × 10

^{- 5}

, 3 × 10

^{- 5}

, 5 × 10

^{- 5}

], the learning rate with the best performance were selected.

BERT-large was trained using bert-large-uncased. BERT-large uses 24 layers with a hidden size of 1024 and has approximately 336M trainable parameters. In addition, the learning rate was set to 3 × 10 $^{- 5}$ ;
RoBERTa-large was trained using roberta-large. RoBERTa-large uses 24 layers with a hidden size of 1024 and has approximately 355M trainable parameters. Further, the learning rate was set to 1 × 10 $^{- 5}$ ;
ELECTRA-large was learned using google/electra-large-discriminator. ELECTRA-large uses 24 layers with a hidden size of 1024 and has approximately 335M trainable parameters. In addition, the learning rate was set to 1 × 10 $^{- 5}$ .

Each model was trained by setting the batch size to 64, max epoch to 50, and max sequence length to 64. In addition, each experiment was conducted using NVIDIA RTX A6000.

4.3. Evaluation Details

The performance evaluation metric used in this study was Accuracy, which is an index that determines the similarity of the predicted data to the actual data, and is widely used in classification performance evaluation. Accuracy is equivalent to the expression 5.

A c c u r a c y = \frac{T h e n u m b e r o f s a m p l e p r e d i c t e d c o r r e c t l y}{T h e n u m b e r o f a l l s a m p l e}

(5)

The model compared the predicted result and the actual correct answer among the options of 0~3, set the case of agreement as 1 and that of nonconformity as 0, summed them all up, and calculated the average. Thus, accuracy implies a value obtained by dividing the number of accurately predicted samples by the number of all samples.

4.4. Main Results

The experimental results are presented in Table 5. The table contains two main contents: base results for comparison (Kaggle TOEIC, Random and Brute) and results of the proposed methods (POS-tagging, Lemmatizing, Mixed). The former includes training results on Kaggle TOEIC data and randomly and brutely augmented data, wherein wrong answer options were randomly constructed for all word segments of the entire data. The latter are the results for POS-tagging based, lemmatizing based, and mixed (merged POS-tagging based and lemmatizing based data in a 5:5 ratio) data.

As indicated by the results, the proposed POS-tagging and lemmatizing based data positively affected the performance improvement. It can be speculated that the model trained on the POS-tagging based data concentrated on the meaning and compared the options to infer the correct answer based on the intention of the TOEIC question because the POS-tagging based data induced the model to better focus on the meaning corresponding to the blank in the sentence. In case of the lemmatizing based data, the model can focus better on the grammatical relationship within the sentence, and it appropriately concentrated on the grammar in the blank when solving the TOEIC problem. In addition, the Mixed data incorporating the POS-tagging based and lemmatizing based data enabled targeting both question types. This allowed the model to better understand the intent of the problem, which raised the performance. The results indicate that the performance of TOEIC problem-solving model can be improved through POS-tagging based and lemmatizing based data augmentation methods in the data sparsity setting.

Furthermore, additional fine-tuning on Kaggle TOEIC data was performed in succession to the model trained with augmented data to examine the performance variation. The results are presented in the “With FT” column. It can be observed that the accuracy of first fine-tuning the augmented data and then continuing to learn with Kaggle TOEIC data outperformed that of training with Kaggle TOEIC data, thereby mitigating the data sparsity issue. In particular, the best performance was achieved when learning with mixed data composed with POS-tagging based and lemmatizing based data. It was inferred that applying both of the methods proposed in this paper contributed significantly to the model’s understanding of semantic/grammatical relationships in sentences.

4.5. Performance Comparison Experiment According to the Amount of Data

To investigate the performance difference depending on the amount of data, an experiment with different amounts of data were performed: 100, 200, 400, 800, and 1450 k. The performance fluctuation according to the data size is depicted in Figure 3.

Figure 3 shows that the performance generally tended to improve with the increase in data. This suggests that the proposed augmentation technique is effective in achieving performance. In addition, because of fine-tuning the data obtained by performing POS-tagging and lemmatizing, considering the semantic and grammatical relationships, respectively, the performance was increased compared to the result of training the randomly and brutely augmented data that randomly extracted incorrect answers. Furthermore, Mixed data with both relationships considered exhibited much higher performance. This implies that the performance varies depending on the manner in which the data is augmented even with the same amount of data.

5. Conclusions

This study analyzed the extent to which the machine reading comprehension model can perform on TOEIC. Further, POS-tagging based and lemmatizing based data augmentation methods were proposed, which improved the performance in data scarcity. The POS-tagging based data augmentation method augmented data for vocabulary related problems, comprising options with different meanings but similar parts-of-speech. Further, the lemmatizing based data augmentation method augmented data for grammar related problems, comprising options with the same lemma but different formats of words. Furthermore, it was demonstrated that both methods were significant as indicated by the experimental results. Currently, data augmentation is performed only on TOEIC Part 5 to improve performance, so it cannot be applied to other parts of the TOEIC yet. Therefore, in the future, it will be developed to cover all parts of TOEIC, and experiments will be conducted to examine whether the proposed method can improve performance even in other domain tasks [29,30].

Author Contributions

Conceptualization, J.L.; methodology, H.M.; software, H.M.; validation, J.L.; formal analysis, H.M. and S.E.; investigation, J.S.; writing—original draft preparation/review and editing, C.P., J.S., S.E. and H.L.; visualization, H.M. and J.L.; funding acquisition/project administration/supervision, H.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the MSIT (Ministry of Science and ICT), Korea, under the ITRC (Information Technology Research Center) support program (IITP-2018-0-01405) supervised by the IITP (Institute for Information & Communications Technology Planning & Evaluation)” and Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF-2021R1A6A1A03045425).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Table A1. Parts-of-speech tags included in this dataset.

Tag	Description	Example
$	dollar	$
“	opening quotation mark	‘ “
”	closing quotation mark	’ ”
CC	conjunction, coordinating	and
CD	numeral, cardinal	1987, one-tenth
DT	determiner	the
EX	existential there	there
FW	foreign word	ich
IN	preposition or conjunction, subordinating	among, into
JJ	adjective or numeral, ordinal	cheap
JJR	adjective, comparative	cheaper
JJS	adjective, superlative	cheapest
LS	list item marker	SP-44001
MD	modal auxiliary	could, will
NN	noun, common, singular or mass	table
NNP	noun, proper, singular	Venneboerger
NNPS	noun, proper, plural	Americans
NNS	noun, common, plural	tables
PDT	pre-determiner	both
POS	genitive marker	’s
PRP	pronoun, personal	me, myself, themselves
PRP$	pronoun, possessive	my, their
RB	adverb	fast
RBR	adverb, comparative	faster
RBS	adverb, superlative	fastest
RP	particle	up
SYM	symbol	=, *
TO	“to” as preposition or infinitive marker	to
UH	interjection	Gosh
VB	verb, base form	ask, avoid
VBD	verb, past tense	dipped, exacted
VBG	verb, present participle or gerund	telegraphing, focusing
VBN	verb, past participle	multihulled, experimented
VBP	verb, present tense, not 3rd person singular	predominate, wrap
VBZ	verb, present tense, 3rd person singular	reconstructs, marks
WDT	WH-determiner	whichever
WP	WH-pronoun	who, whom, whosoever
WP$	WH-pronoun, possessive	whose
WRB	Wh-adverb	whenever

References

Taylor, W.L. “Cloze procedure”: A new tool for measuring readability. J. Q. 1953, 30, 415–433. [Google Scholar] [CrossRef]
Fotos, S.S. The cloze test as an integrative measure of EFL proficiency: A substitute for essays on college entrance examinations? Lang. Learn. 1991, 41, 313–336. [Google Scholar] [CrossRef]
Jonz, J. Cloze item types and second language comprehension. Lang. Test. 1991, 8, 1–22. [Google Scholar] [CrossRef]
Tremblay, A. Proficiency assessment standards in second language acquisition research:“Clozing” the gap. Stud. Second. Lang. Acquis. 2011, 33, 339–372. [Google Scholar] [CrossRef]
Hu, Z.; Chanumolu, R.; Lin, X.; Ayaz, N.; Chi, V. Evaluating NLP Systems On a Novel Cloze Task: Judging the Plausibility of Possible Fillers in Instructional Texts. arXiv 2021, arXiv:2112.01867. [Google Scholar]
Loper, E.; Bird, S. Nltk: The natural language toolkit. arXiv 2002, arXiv:cs/0205028. [Google Scholar]
Bilal, M.; Almazroi, A.A. Effectiveness of Fine-Tuned BERT Model in Classification of Helpful and Unhelpful Online Customer Reviews. Electron. Commer. Res. 2022, 1–21. [Google Scholar] [CrossRef]
Hermann, K.M.; Kocisky, T.; Grefenstette, E.; Espeholt, L.; Kay, W.; Suleyman, M.; Blunsom, P. Teaching machines to read and comprehend. Adv. Neural Inf. Process. Syst. 2015, 28, 1693–1701. [Google Scholar]
Hill, F.; Bordes, A.; Chopra, S.; Weston, J. The goldilocks principle: Reading children’s books with explicit memory representations. arXiv 2015, arXiv:1511.02301. [Google Scholar]
Bajgar, O.; Kadlec, R.; Kleindienst, J. Embracing data abundance: Booktest dataset for reading comprehension. arXiv 2016, arXiv:1610.00956. [Google Scholar]
Onishi, T.; Wang, H.; Bansal, M.; Gimpel, K.; McAllester, D. Who did what: A large-scale person-centered cloze dataset. arXiv 2016, arXiv:1608.05457. [Google Scholar]
Lu, S.; Guo, D.; Ren, S.; Huang, J.; Svyatkovskiy, A.; Blanco, A.; Clement, C.; Drain, D.; Jiang, D.; Tang, D.; et al. Codexglue: A machine learning benchmark dataset for code understanding and generation. arXiv 2021, arXiv:2102.04664. [Google Scholar]
Lai, G.; Xie, Q.; Liu, H.; Yang, Y.; Hovy, E. Race: Large-scale reading comprehension dataset from examinations. arXiv 2017, arXiv:1704.04683. [Google Scholar]
Premtoon, V.; Koppel, J.; Solar-Lezama, A. Semantic code search via equational reasoning. In Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation, London, UK, 15–20 June 2020; pp. 1066–1082. [Google Scholar]
Wang, W.; Zhang, Y.; Zeng, Z.; Xu, G. Trans^3: A transformer-based framework for unifying code summarization and code search. arXiv 2020, arXiv:2003.03238. [Google Scholar]
Svyatkovskiy, A.; Deng, S.K.; Fu, S.; Sundaresan, N. Intellicode compose: Code generation using transformer. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Virtual Event, 8–13 November 2020; pp. 1433–1443. [Google Scholar]
Svyatkovskiy, A.; Zhao, Y.; Fu, S.; Sundaresan, N. Pythia: AI-assisted code completion system. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 2727–2735. [Google Scholar]
Moon, H.; Park, C.; Eo, S.; Seo, J.; Lee, S.; Lim, H. A Self-Supervised Automatic Post-Editing Data Generation Tool. arXiv 2021, arXiv:2111.12284. [Google Scholar]
Moon, H.; Park, C.; Seo, J.; Eo, S.; Lim, H. An Automatic Post Editing With Efficient and Simple Data Generation Method. IEEE Access 2022, 10, 21032–21040. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Clark, K.; Luong, M.T.; Le, Q.V.; Manning, C.D. Electra: Pre-training text encoders as discriminators rather than generators. arXiv 2020, arXiv:2003.10555. [Google Scholar]
Xie, Q.; Lai, G.; Dai, Z.; Hovy, E. Large-scale cloze test dataset created by teachers. arXiv 2017, arXiv:1711.03225. [Google Scholar]
Zellers, R.; Bisk, Y.; Schwartz, R.; Choi, Y. Swag: A large-scale adversarial dataset for grounded commonsense inference. arXiv 2018, arXiv:1808.05326. [Google Scholar]
Sakaguchi, K.; Le Bras, R.; Bhagavatula, C.; Choi, Y. Winogrande: An adversarial winograd schema challenge at scale. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 8732–8740. [Google Scholar]
Bisk, Y.; Zellers, R.; Gao, J.; Choi, Y. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 7432–7439. [Google Scholar]
Park, C.; Lim, H. A study on the performance improvement of machine translation using public korean-english parallel corpus. J. Digit. Converg. 2020, 18, 271–277. [Google Scholar]
Park, C.; Shim, M.; Eo, S.; Lee, S.; Seo, J.; Moon, H.; Lim, H. Empirical Analysis of Korean Public AI Hub Parallel Corpora and in-depth Analysis using LIWC. arXiv 2021, arXiv:2110.15023. [Google Scholar]
Park, C.; Seo, J.; Lee, S.; Lee, C.; Moon, H.; Eo, S.; Lim, H.S. BTS: Back TranScription for speech-to-text post-processor using text-to-speech-to-text. In Proceedings of the 8th Workshop on Asian Translation (WAT2021), Bangkok, Thailand, 6 August 2021; pp. 106–116. [Google Scholar]
Park, C.; Go, W.Y.; Eo, S.; Moon, H.; Lee, S.; Lim, H. Mimicking Infants’ Bilingual Language Acquisition for Domain Specialized Neural Machine Translation. IEEE Access 2022, 10, 38684–38693. [Google Scholar] [CrossRef]

Figure 1. Distribution of the number of people by the score for the 455th test (implemented on 20 February 2022).

Figure 2. Overall process. The figure on the left corresponds to “Step 1: Data Augmentation” and the figure on the right corresponds to “Step 2: Multiple Choice Task”.

Figure 3. BERT, RoBERTa, and ELECTRA performance variation according to the data size.

Table 1. Performance of TOEIC-BERT, an existing study that conducted experiments with a deep learning model for TOEIC Part 5.

Model	BERT-Base-Uncased	BERT-Base-Cased	BERT-Large-Uncased	BERT-Large-Cased
Accuracy	73.46%	76.38%	75.29%	72.84%

Table 2. Examples of TOEIC Part 5 fill-in-the-blank in a single sentence.

Type	Questions	Options
Semantic	Even experienced clerks are encouraged to attend training ______ to keep them updated on new ideas in the world of banking.	(A) materials (B) sessions (C) experiences (D) positions
Grammar	The assets of Marble Faun Publishing Company ______ last quarter when one of their main local distributors went out of business.	(A) suffer (B) suffers (C) suffering (D) suffered

Table 3. Examples of POS-tagging Based Data and Lemmatizing Based Data.

POS-Tagging Based Data		Lemmatizing Based Data
Question	Options	Question	Options
I think the ______ discovery is that of fire.	(A) sweetest (B) driest (C) youngest (D) greatest	I was told that the initial ______ could be possible before going to the processing plant.	(A) purchase (B) purchases (C) purchasing (D) purchased
A litigation agent in a criminal case shall faithfully perform the following ______ until the close of the relevant case.	(A) beanies (B) observatories (C) laundries (D) duties	Weighing only a third of the existing products the flat lash is so light that users might forget that they are ______ artificial eyelashes.	(A) wears (B) wore (C) wear (D) wearing
So A is an eco-friendly product, and it’s a highly portable container which can be used by ______ it.	(A) folding (B) windsurfing (C) caching (D) downsizing	It will make it easier to recognize the suspect at the same time ______ the possible attempts.	(A) reduces (B) reduce (C) reduced (D) reducing

Table 4. The statistics of the training, validation, and test set. Here, the ‘length’ refers to the length after tokenization with BertTokenizer.

	Train	Valid	Test
# questions	2899	363	363
Min length	5	12	10
Max length	57	34	50
Avg length	20.45	22.25	21.28

Table 5. Experimental results on Kaggle TOEIC data, Randomly and Brutely Augmented Data, POS-tagging Based Data, Lemmatizing Based Data, and Mixed Data.

Data	BERT		RoBERTa		ELECTRA
Data	Only Aug	With FT	Only Aug	With FT	Only Aug	With FT
Kaggle TOEIC	86.22%		96.14%		90.63%
Random & Brute	84.02%	92.01%	93.11%	96.41%	94.21%	97.24%
POS-tagging	90.08%	94.21%	95.59%	96.96%	96.14%	97.52%
Lemmatizing	90.63%	94.21%	95.04%	97.24%	96.96%	98.07%
Mixed	91.46%	95.04%	96.41%	97.52%	97.24%	98.07%

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lee, J.; Moon, H.; Park, C.; Seo, J.; Eo, S.; Lim, H. BERTOEIC: Solving TOEIC Problems Using Simple and Efficient Data Augmentation Techniques with Pretrained Transformer Encoders. Appl. Sci. 2022, 12, 6686. https://doi.org/10.3390/app12136686

AMA Style

Lee J, Moon H, Park C, Seo J, Eo S, Lim H. BERTOEIC: Solving TOEIC Problems Using Simple and Efficient Data Augmentation Techniques with Pretrained Transformer Encoders. Applied Sciences. 2022; 12(13):6686. https://doi.org/10.3390/app12136686

Chicago/Turabian Style

Lee, Jeongwoo, Hyeonseok Moon, Chanjun Park, Jaehyung Seo, Sugyeong Eo, and Heuiseok Lim. 2022. "BERTOEIC: Solving TOEIC Problems Using Simple and Efficient Data Augmentation Techniques with Pretrained Transformer Encoders" Applied Sciences 12, no. 13: 6686. https://doi.org/10.3390/app12136686

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

BERTOEIC: Solving TOEIC Problems Using Simple and Efficient Data Augmentation Techniques with Pretrained Transformer Encoders

Abstract

1. Introduction

2. Related Work

3. Proposed Method

3.1. Simple and Efficient Data Augmentation

3.1.1. Random and Brute Data Augmentation

3.1.2. POS-Tagging Based Data Augmentation

3.1.3. Lemmatizing Based Data Augmentation

3.2. Model

4. Experiments

4.1. Dataset Details

4.2. Implementation Details

4.3. Evaluation Details

4.4. Main Results

4.5. Performance Comparison Experiment According to the Amount of Data

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI