2. Materials and Methods
2.1. Participants Information
Seventy-two third- and fourth-grade students in 14 public elementary schools in South Korea participated in this study. Among them, 28 poor comprehenders were identified by screening (achievement below the 16th percentile on the district-wide reading assessments), followed by the standardized reading assessment battery tests of Reading Achievement and Reading Cognitive Process (RA-RCP) tests [
17]. These students with reading comprehension impairments had intact word-recognition capabilities and communication skills. The remaining 44 average students were used as a control group.
2.2. Data Collection
In the training session, three sentences were presented, and the participants were asked to verbalize their thoughts right after reading each sentence.
The stimulus text for the main session was adopted from a textbook for elementary reading courses. The text included 10 sentences, and the average number of words in each sentence was 9.2 with a standard deviation of 2.7. The participants verbalized their thoughts right after reading each sentence, and those responses were transcribed.
As a result, the dataset of 720 pairs of sentences and responses was collected by transcribing the responses of 72 participants to 10 sentences during the main session. The average number of words in each response was 12.8, with a standard deviation of 9.1. Thus, the responses were generally longer than the sentences in the stimulus text. This difference in word length between text and response may be due to the difference between written and oral language.
2.3. Feature Extraction
First, each sentence of the stimulus text and corresponding response was encoded to numerical representations using sentence embedding. For the embedding models, term frequency–inverse document frequency (TF-IDF) [
10] and Sentence BERT (SBERT) [
15,
16] were used. The main motivation for choosing these two embedding models was to compare the word-frequency-based and contextualized embeddings. TF-IDF is calculated from the frequencies of words and how insensitive their arrangement is. In contrast, SBERT is the Transformer-based embedding that considers neighboring words using the attention mechanism [
15,
16].
For TF-IDF, the vocabulary was built by first normalizing the dataset and extracting nouns, similarly to previous studies [
18,
19], using the open-source Korean text processor [
20]. Nouns that appeared more than once in the dataset were collected to form the vocabulary size of 246.
Using TF-IDF, stimulus sentences and readers’ responses were encoded to 246 dimensional vectors, with each dimension representing the frequency of the corresponding noun. Specifically, the TF-IDF value is defined by the product of the term frequency (TF) and inverse document frequency (IDF) as follows:
where TF(w,t) is the term frequency of the given term w in the sentence t, N is the total number of sentences, and DF(w) is the number of sentences that contain term w. The resulting TF-IDF vectors were normalized using the Euclidean norm.
SBERT is a variation in the Transformer architecture [
21] tailored for sentence-level representations. Employing the Siamese network structure [
15,
16], the SBERT model learns sentence embeddings that have smaller distances for semantically similar sentences. To train an SBERT model, sentence pairs with annotated semantic similarity scores are used. This is in contrast to other Transformer models that learn numerical representations by learning to predict masked words based on the surrounding context.
Sentence embedding by SBERT was performed as follows. Following the preprocessing method for the Korean Language Understanding Evaluation (KLUE) benchmark dataset [
22], each sentence and its corresponding response were tokenized into morphemes, and byte pair encoding was applied with the vocabulary size of 32 K. A pre-trained SBERT model with RoBERTa architecture [
23] for Korean was obtained from [
24], which was trained with the KorNLU dataset [
25] provided by KakaoBrain [
26]. Using this model, 768 dimensional embedding vectors were calculated for all the sentences in the stimulus text and readers’ responses.
Next, similarity scores between the embeddings of the stimulus sentences and the corresponding responses were calculated as follows. For each sentence and corresponding response, embedding vectors are called
and
, respectively. The cosine similarity score between these embeddings, called
, is calculated as follows.
where ⋅ denotes the inner product of two vectors and | | represents the Euclidean norm. Because there were 10 sentences in the stimulus text, 10 similarity scores were computed for each participant.
2.4. Classification Models
Seven machine learning models were chosen to classify the similarity scores into normal and abnormal classes.
The first three classifiers were simple models, as follows. The first classifier is logistic regression [
27], which is simple and effective for linearly separating the input features into two groups. More specifically, the log odds between the two classes are related to a linear function of the input features, resulting in a linear decision boundary between the two classes. The second classifier is Linear Discriminant Analysis (LDA) [
28]. LDA is another linear classifier, but based on different assumptions about the input distributions, the input features are normally distributed with different means for different classes and the same covariance. The third classifier is quadratic discriminant analysis (QDA) [
29]. Similar to LDA, QDA assumes that the input features are normally distributed with different means for different classes. In contrast, QDA allows the covariances of the classes to be different, resulting in non-linear decision boundaries.
Four more advanced classifiers were chosen to capture more complex patterns in the feature. The Naïve Bayes classifier [
30] was used to investigate the effect of potential correlations between the input features. The Naïve Bayes classifier modeled the input features independently of each other, providing a baseline accuracy for ignoring the interactions between input features. Next, a support vector classifier (SVC) with a linear kernel [
31] was used to determine the hyperplane that separated the input features with different classes with the maximum margin. In addition, the random forest classifier [
32] was used to capture complex and potentially non-linear patterns in the data using an ensemble of randomly selected decision trees. The final model was the k-nearest neighbor (KNN) classifier [
33]. KNN could capture more complex and local structures in the data because the class label of the input sample was determined by the class labels of the KNNs.
The hyperparameters of the SVC, random forest, and KNN classifiers varied for a wide range of values, and the highest accuracy was reported for each classifier. In the SVC, the level of robustness to outliers was controlled by the weight (C) of the penalty for misclassification during training. The value of C varied from 10−5 to 105 by the factor of 10−0.25. The complexity level of the random forest was controlled by the number of estimators (n). Increasing the number of estimators enabled the classifier to consider more complex data patterns. However, excessive estimators would result in overfitting and poor generalization to new inputs. The value of n increased from 2 to 30 in steps of 2. In the KNN classifier, the number of neighbors (k) controlled the smoothness of the decision boundary. The value of k increased from 1 to 25 in steps of 2.
2.5. Classification Accuracy Measured Using Cross-Validation
The F1 scores of the seven classifiers for each embedding model were measured using stratified 5-fold cross-validation [
34]. Specifically, the dataset was randomly shuffled and divided into five folds so that each fold had the same proportion of normal and abnormal participants. For a given embedding model, a classifier was trained using all but one fold, and the F1 score of the classifier was measured for the omitted fold. This was repeated for all folds.
The statistical significance of the F1 scores was calculated as follows. The baseline is the classifier that always predicts every participant as the majority class (abnormal). The F1 score of this trivial classifier is 0.56. Therefore, the five F1 scores obtained from cross-validation were compared with the baseline F1 score of 0.56 using a one-tailed t-test.
4. Discussion
The results of this study show that the similarity score based on the embedding of sentences is an effective way of quantifying reading behavior. The similarity scores between the stimulus text and the responses were lower for poor comprehenders compared to normal students. This is consistent with previous research, indicating that poor comprehenders struggle with understanding the meaning of given sentences and often digress into unrelated topics [
3]. It is believed that such behaviors are associated with an attention deficit or inefficient working memory [
1,
3]. Our method provides quantitative evidence to support this theory.
The main difference between the two embedding models is the use of context in the text and response. TF-IDF is based solely on the frequency of words and is insensitive to their order or relationship to neighboring words. In contrast, the SBERT model is trained to take contextual information into account by considering words around a target word. This difference resulted in qualitatively different sentence embeddings and higher classification accuracies of the SBERT model than those of TF-IDF.
Even with simple linear classifiers, the embedding models produced qualitatively different F1 scores. The F1 scores of the logistic regression classifiers were similar for different embedding models. The F1 score of LDA based on TF-IDF was lower than that of the logistic regression classifier based on TF-IDF. In contrast, the F1 score of LDA based on SBERT was higher than that of the logistic regression classifier based on TF-IDF, resulting in a significantly higher F1 than the baseline classifier. This difference was due to the different assumptions made about the input to the classifiers. LDA assumes that the input is normally distributed. As shown in
Figure 2, the distribution of similarity scores based on TF-IDF is bimodal, which does not fit the LDA assumption. In contrast, the distribution of similarity scores based on SBERT is unimodal, which is well modeled by LDA.
Similarly, a comparison of LDA and QDA provides the following insights. Both LDA and QDA assume normality of the input distribution. The only difference is that LDA assumes that the same covariance matrix is shared between the classes, whereas QDA allows each class to have its own covariance matrix, allowing more flexibility in capturing the different distributions of the classes with a larger number of parameters. For both embedding models, the F1 values of the QDA are lower than those of the LDA. Thus, the lower accuracy of the more flexible classifier indicates overfitting.
The F1 values of the Naïve Bayes classifiers were slightly higher than those of logistic regression, but not significantly higher than those of the baseline classifier. This is because the feature dimension (10), which corresponds to the number of sentences in the stimulus text, is relatively low in this study. Thus, scaling the classifier for higher dimensional features at the cost of ignoring the correlation between dimensions provides little gain.
Compared to the Naïve Bayes classifiers, SVM and random forest classifiers produced higher classification accuracies. This suggests that the reader’s response to an input text changes as the participant reads through the sentences in the text. Considering this non-stationary behavior allowed the SVM and random forest classifiers to achieve higher classification accuracies.
For both embedding models, the F1 scores of the SVM classifiers were significantly higher than those of the baseline classifier. This is consistent with the previous finding that SVM works well, especially when the number of samples is not enough. The balance between the classification error and sensitivity to outliers can be controlled by the hyper parameter C. With the optimal choice of hyperparameter, SVMs achieved high F1 scores of 0.65 (TF-IDF) and 0.68 (SBERT), with corresponding recall scores of 0.5 (TF-IDF) and 0.46 (SBERT).
The highest accuracy of the random forest classifier demonstrates the effectiveness of the ensemble approach. The random forest classifier is a collection of simple decision trees. Aggregating predictions from multiple trees has improved overall accuracy and reduced overfitting. Consequently, the highest F1 score of 0.74 with a recall score of 0.61 was achieved using SBERT.
Notably, the KNN classifier is the most flexible classifier, but its accuracy is low. This can be attributed to the overfitting phenomenon. The KNN classifier fits complex decision boundaries in the training data and does not generalize well to unseen examples. The drop in classification accuracy was more pronounced for TF-IDF, where the F1 score (0.57) is close to that of the baseline classifier (0.56). In comparison, SBERT is less sensitive to overfitting, as the F1 score of the KNN classifier based on SBERT (0.61) is significantly higher than the baseline score.
Further improvement could be achieved by end-to-end training of a deep neural network. Rather than separating feature extraction and classification, training a deep neural network from end-to-end could lead to higher accuracy by enabling the model to learn both feature extraction and classification tasks simultaneously. However, end-to-end training would require larger amounts of data and is not applicable to the current dataset. The author is in the process of collecting more data to explore this direction in future work.
5. Conclusions
In this study, we propose a method to quantify readers’ responses to a given text and automate the think-aloud protocol for diagnosing reading comprehension impairments. The data collection procedure of the think-aloud protocol remains the same. We propose the quantification of readers’ responses using features based on sentence embedding and standard classification models. The stimulus text and the think-aloud responses were first encoded into high-dimensional vectors using sentence embedding. The similarities between these encoded representations were then used as features. Notably, the similarity scores were lower for poor comprehenders than for average students. Using these similarity-based features, we successfully classified normal from abnormal readers. The highest F1 score of 0.74 was achieved using SBERT embedding in combination with the random forest classifier.
Our future research will focus on further exploring the complex patterns within readers’ utterances. In this study, we used pre-trained embedding models to represent the stimulus text and readers’ responses for feature (similarity score) extraction and classification. To advance our methodology, our goal is to fine-tune the embedding models using an integrated framework that combines embedding, feature extraction, and classification into a unified model. This integrated model will be trained end-to-end to provide an embedding optimized for the think-aloud protocol. For this purpose, we are actively collecting additional think-aloud data, including texts from diverse topics and corresponding responses of readers. We expect to achieve a higher accuracy in identifying reading comprehension impairments and to uncover individual differences in reading comprehension.