Exercise Semantic Embedding for Knowledge Tracking in Open Domain

Cheng, Zhi; Li, Jinlong

doi:10.3390/info16040302

Open AccessArticle

Exercise Semantic Embedding for Knowledge Tracking in Open Domain

by

Zhi Cheng

^1,*

and

Jinlong Li

²

¹

School of Artificial Intelligence and Data Science, University of Science and Technology of China, Hefei 230026, China

²

School of Computer Science and Technology, University of Science and Technology of China, Hefei 230026, China

^*

Author to whom correspondence should be addressed.

Information 2025, 16(4), 302; https://doi.org/10.3390/info16040302

Submission received: 13 March 2025 / Revised: 3 April 2025 / Accepted: 8 April 2025 / Published: 9 April 2025

Download

Browse Figures

Versions Notes

Abstract

:

As one of the foundational techniques of data-driven intelligent education systems, knowledge tracing (KT) tracks students’ mastery of specific knowledge points by analyzing historical student–exercise interaction data. Since most student–exercise interaction data are open-domain data in real-world applications, new exercises and knowledge are never modeled, which causes the knowledge tracing performance to degrade. Based on this situation, the primary goal of this study is to address the problem of knowledge tracing in open-domain data. To address this, this paper proposes a two-stage knowledge tracing framework, namely Exercise Semantic embedding for Knowledge Tracing (ESKT). In the first stage of ESKT, the exercise semantic information is embedded in a pre-trained language model (PLM). In the second stage, to capture the semantic answer information, this paper proposes a Knowledge Tracing with an Answer Encoder and Multiple Questions Attention Mechanism (KTAM). To verify the performance of the framework, it is compared with the State-Of-The-Art (SOTA) methodology AKT, SAINT, on the English reading comprehension dataset, and the results prove that ESKT can achieve, at most, a 7% AUC boost in open-domain datasets. In conclusion, this paper innovatively uses a pre-trained language model for exercise semantic embedding to solve the problem of knowledge-tracking tasks in open-domain data.

Keywords:

deep learning; machine learning; artificial intelligence in education; natural language processing; pre-trained language model; knowledge tracing; open domain

1. Introduction

Online education platforms, such as Intelligent Tutoring Systems (ITSs) and Massive Open Online Courses (MOOCs), have been rapidly developed in recent years. The volume of student–exercise interaction data on these platforms is growing, encompassing various disciplines [1,2]. To better collect and exploit these data, intelligent education systems are built to facilitate personalized learning and intelligent exercise recommendations [3,4]. However, effectively developing and utilizing these data has become a bottleneck when it comes to advancing the intelligent education system. This study primarily focus on the technologies used to leverage these data.

In intelligent education systems, student–exercise interaction data refers to the sequences of students’ historical exercise records. These data include information about the exercises that students have previously worked on, as well as their performance and response score on those exercises. Analyzing these data will be beneficial for some applications, such as cognitive diagnosis for judging students’ cognitive ability [5], providing early warning to students [6], adapting learning content to suit students’ abilities dynamically [7], and recommending suitable exercises to students [8]. In these applications, knowledge tracing (KT) is an fundamental source of technical support.

Knowledge tracing predicts students’ future exercise performance by monitoring and analyzing their evolving knowledge states during their exercise-solving process [9]. For example, Figure 1 shows a student practicing English reading comprehension. Given the student’s historical exercise-solving process in

e_{1} \sim e_{4}

, the knowledge tracing task predicts the student’s performance in the next exercise, e.g.,

e_{5}

.

The principles of knowledge tracing are based on the observations that students with similar proficiency levels for specific knowledge points tend to perform similarly on corresponding exercises in student–exercise interaction data. Additionally, an individual student will perform consistently when practicing similar exercises. By analyzing student–exercise interaction data, knowledge tracing can obtain students’ mastery of the knowledge points involved in the exercises. Based on the information about students’ mastery of the knowledge points and the correlation between exercises, knowledge tracing can predict the students’ performance in practicing exercises in the future, intelligently recommend exercises, and provide personalized learning suggestions.

The existing methods of knowledge tracking can be divided into two categories according to the types of used information, specifically the connections of student–student or exercise–exercise. For instance, the factor analysis methods [10,11,12] utilize similar relationships between students to track their knowledge status, while the deep learning methods [13,14,15,16,17] adopt the relationships between exercises by embedding the knowledge point information. These deep learning methods perform poorly on open-domain datasets, although they have achieved good results on public datasets, as most of these public datasets are closed-domain datasets in which the knowledge points implied in the exercises have occurred in the training set. However, in real-world applications, there are newly added exercises containing new knowledge points without annotations, such as new English reading comprehension articles.

No study focuses on these newly added exercises, but the cold start problem is similar and has been studied [18,19]. The cold start problem indicates the difficulty of prediction caused by too little training data when the system is first established. The open domain dataset problem focuses more on situations where the test data are not labeled and are not contained within the training set. In addition, although the cold start problem also faces the problem of predicting new exercises, these methods are faced with mathematical practice datasets, which are unsuitable for open-domain English reading comprehension problems.

In summary, the research goals of this study are as follows:

To solve the problem of knowledge tracking in open-domain data. In applications, new exercises are constantly added to the exercise bank. The newly added exercises have no labeled knowledge point information, so determining how to predict students’ performance on new exercises is a difficult problem. One of the goals of this study is to address this problem.
In this study, we aim to solve the problem of knowledge tracking in English reading comprehension data. Compared with the public mathematics education data, the English reading comprehension data contain a lot of exercise context and multiple questions in an exercise context. Determining how to use these features to improve the performance of knowledge tracking tasks is a problem, which is also one of the goals of this study.
Another aim of this study is to take advantage of students’ answer information. Existing research methods only focus on students’ performance—that is, whether students answered an exercise correctly or incorrectly—while students’ answer information will also benefit the knowledge tracking task. Incorporating students’ answer information can distinguish students’ exercise records with different reasons for mistakes. Therefore, determining how to use students’ answer information to improve the performance of knowledge tracking models is another important consideration. This is also the supporting research objective of this study.

2. Related Work

Knowledge tracing. Knowledge tracing is a task that involves predicting students’ future performance based on their historical performance [20]. Traditional methods solve knowledge tracing mostly using Bayesian methods [21,22] and factor analysis [10,11,12].

Since the rapid development of deep neural networks, research on knowledge tracing based on deep neural networks has become mainstream. Deep knowledge tracing (DKT) is the first method to use the long short-term memory (LSTM) network to solve issues related to knowledge tracing [16]. Based on DKT, DKT+ [17] introduces regularization terms corresponding to reconstruction and waviness to the loss function to enhance the consistency in knowledge tracing prediction. The self-attentive knowledge tracing (SAKT) first introduces attention into knowledge tracing [15]. Dynamic key-value memory networks (DKVMNs) use an external memory matrix to store learner knowledge [23].

Since the transformer [24] framework was first proposed, the transformer has made great progress in many domains. Separated self-attentive neural knowledge tracing (SAINT) is the first transformer-based model for knowledge tracing to apply deep self-attentive layers to exercises and responses separately [13]. Based on SAINT, SAINT+ [25] integrates temporal features. Context-aware attentive knowledge tracing (AKT) utilizes a novel monotonic attention mechanism in which attention weights indicate exponential decay considering the forgetting behavior of students [14]. MonaCoBERT [26] designed Monotonic attention for ConvBERT architecture [27].

Some works have noted that exercise embedding is important in knowledge tracing tasks. The exercise-enhanced recurrent neural network (EERNN) uses LSTM representing exercise text to make full use of exercise record information [19]. Pre-training embeddings via bipartite graphs allows us to obtain exercise embedding by using the bipartite graph of question-skill relations [28].

There are also some works that have explored knowledge tracing applications in relation to specific educational data. Option tracing (OT) solves multiple choice questions using mathematical education data [29]. The mathematical exercise representation and association of exercise (ERAKT) focus on exercise embedding in mathematics education [30]. Some works [31,32] test the benchmark of knowledge tracing tasks on several mathematical datasets, and these results help with knowledge tracing research on English reading comprehension.

With the advent of the boom of large language models (LLMs), there have been some knowledge tracing task studies based on large models (LLMs) [33,34]. Though these studies are novel, the LLMs knowledge tracing has not surpassed the non-LLMs knowledge tracing at present.

Multi-choice reading comprehension. Multi-choice reading comprehension is a challenging task which requires a person to select an answer from a set of candidate options when given a passage and a question. This paper utilizes multi-choice reading comprehension methods to represent exercises.

Fine-tuning PLM is currently the master solution for multi-choice reading comprehension. PLM such as Bert, XLNet, Longformer, RoBERta, AlBert and so on [35,36,37,38,39] have achieved significant improvement in various reading comprehension tasks including multi-choice reading comprehension.

There are some works which have proposed specific mechanisms for multi-choice reading comprehension. Dual Multi-head Co-Attention (DUMA) models the relationships between the passage, question, and answer for multi-choice reading comprehension tasks, which is able to cooperate with popular PLM [40]. The dual co-matching network with two reading strategies about passage sentence selection and answer option interaction (DCMN+) can effectively improve multi-choice reading comprehension tasks in combination with PLM [41].

3. Materials and Methods

This section first introduces the formal definition of knowledge tracing in reading comprehension, and then the proposed Exercise Semantic embedding for Knowledge Tracing (ESKT) is explained, containing two stages: Stage I: fine-tune Pre-trained Language Model (PLM) and Stage II: Knowledge Tracing with an Answer Encoder and Multiple Questions Attention Mechanism (KTAM), as illustrated in Figure 2. In Figure 2, the left side shows the process of fine-tuning PLM on an external dataset, and the right side explains KTAM. KTAM consists of three encoders—the exercise encoder, the response score encoder, and the answer encoder—along with a single decoder. These components internally utilize the Multiple Questions Attention Mechanism to enhance their functionality. Finally, this section introduces strategies for training with auxiliary tasks.

The knowledge tracking task in this study is built upon the Transformer [24] architecture. Previous studies, such as SAINT [13] and AKT [14], have demonstrated that Transformer-based knowledge tracking models outperform those relying on Bayesian networks [21,22] or recurrent neural networks [16,17]. Drawing inspiration from state-of-the-art knowledge tracking algorithms like SAINT and AKT, the exercise semantic embedding for knowledge tracing (ESKT) framework is proposed to address this research’s goals and objectives. The specific connection between the exercise semantic embedding for knowledge tracing (ESKT) framework and our research goals is as follows:

To solve the problem of knowledge tracking in open-domain data, the exercise semantic embedding for knowledge tracing (ESKT) framework incorporates a pre-trained language model to generate exercise semantic embedding in Stage I: fine-tune Pre-trained Language Model (PLM). Even when encountering a new exercise, this exercise can be represented by the exercise context.
To solve the problem of knowledge tracking in English reading comprehension data, the encoder and decoder in the exercise semantic embedding for knowledge tracing (ESKT) framework incorporate the proposed Multiple Questions Attention Mechanism, which leverages multiple questions within a single exercise context in Stage II: Knowledge Tracing with an Answer Encoder and Multiple Questions Attention Mechanism (KTAM). Additionally, the long exercise context is embedded in Stage I: fine-tune Pre-trained Language Model (PLM).
To take advantage of students’ answer information, the Answer Encoder in the exercise semantic embedding for knowledge tracing (ESKT) framework is proposed to encode students’ answer information in Stage II: Knowledge Tracing with an Answer Encoder and Multiple Questions Attention Mechanism (KTAM), which enriches the features of input data and improves the performance of knowledge tracking.

3.1. Task Definition of Knowledge Tracing

Let

S

be the students set, in which each student is represented by

s \in S

, which contains the basic information of a student. Let

E

denote the English multiple-choice reading comprehension exercise set. An exercise

e_{i} \in E

is a tuple

(c_{i}, q_{i}, o_{i}, k_{i}, j_{i}, t_{i})

, where

c_{i}

is the context of reading comprehension exercise,

q_{i}

is the question text,

o_{i} = {o_{A_{i}}, o_{B_{i}}, o_{C_{i}}, o_{D_{i}}}

stands for the four option texts of

q_{i}

,

k_{i}

is the question type of

q_{i}

,

j_{i}

denotes the exercise–question index, and

t_{i} \in {A, B, C, D}

is the true option. An exercise record sequence R of a student s is defined as

R = {(e_{1}, a_{1}, r_{1}), (e_{2}, a_{2}, r_{2}), \dots, (e_{n}, a_{n}, r_{n})}_{s}

, where

e_{i} = (c_{i}, q_{i}, o_{i}, k_{i}, j_{i}, t_{i}) \in E

is an exercise solved by student s at time step i,

a_{i} \in {A, B, C, D}

is the choice of student s in solving

e_{i}

, and

r_{i} \in {0, 1}

is the student’s response score;

r_{i} = 1

if

a_{i} = t_{i}

, otherwise,

r_{i} = 0

.

Given a student

s \in S

and their exercise record sequence

{(e_{1}, a_{1}, r_{1}),

(e_{2}, a_{2}, r_{2}), \dots, (e_{n}, a_{n}, r_{n})}

, knowledge tracing is used to predict the student’s performance response score

{\tilde{r}}_{n + 1}

on the exercise

e_{n + 1}

.

3.2. Stage I: Fine-Tune Pre-Trained Language Model (PLM)

Given an external English multiple-choice reading comprehension exercise set

E^{'}

and a pre-trained language model

P L M_{r a w}

, we fine-tune

P L M_{r a w}

with

E^{'}

, as shown in Equation (1):

\begin{matrix} P L M_{t u n e d} = F i n e T u n e (P L M_{r a w}; E^{'}), \end{matrix}

(1)

where we employ LongFormer [35] as the language model

P L M_{r a w}

, the external dataset is set to the RACE [42] dataset in the form of a reading comprehension task, and

F i n e T u n e (\cdot)

represents the fine-tuning operation, as defined by Equations (2)–(4):

\begin{matrix} t e x t_{i n p u t_{i}} & = & s t a c k (t e x t_{A_{i}}, t e x t_{B_{i}}, t e x t_{C_{i}}, t e x t_{D_{i}}), \end{matrix}

(2)

\begin{matrix} h_{A_{i}}, h_{B_{i}}, h_{C_{i}}, h_{D_{i}} & = & P L M (t e x t_{i n p u t_{i}}), \end{matrix}

(3)

\begin{matrix} p_{A_{i}}, p_{B_{i}}, p_{C_{i}}, p_{D_{i}} & = & S o f t m a x (L i n e a r (h_{A_{i}}, h_{B_{i}}, h_{C_{i}}, h_{D_{i}})), \end{matrix}

(4)

where operation

s t a c k (\cdot)

concatenates the option texts

t e x t_{A_{i}}, t e x t_{B_{i}}, t e x t_{C_{i}}, t e x t_{D_{i}}

as exercise text

t e x t_{i n p u t}

, option text is partial information from

e_{i}

, for example,

t e x t_{A_{i}} = c_{i} [C L S] q_{i} [C L S] o_{A_{i}}

, [CLS] is a special classification token in the vocab list; language model

P L M (\cdot)

, additional fully connected layer

L i n e a r (\cdot)

and

S o f t m a x (\cdot)

function convert the exercise text of

e_{i}

into the probability distribution of four options, as shown in Equation (4) and (5).

3.3. Stage II: Knowledge Tracing with Answer Encoder and Multiple Questions Attention Mechanism (KTAM)

Given a student’s exercise record sequence

{(e_{1}, a_{1}, r_{1}),

(e_{2}, a_{2}, r_{2}), \dots, (e_{n}, a_{n}, r_{n})}

and a next exercise

e_{n + 1}

, the KTAM predicts the student’s response score

{\tilde{r}}_{n + 1}

on

e_{n + 1}

in stage II. There are two specific steps: embedding, and encode–decode.

Embedding: Taking the fine-tuned model

P L M_{t u n e d}

, we extract the features of i-th exercise record

(e_{i}, a_{i}, r_{i})

in a student exercise record sequence, as shown in Equations (5)–(7):

\begin{matrix} x_{e_{i}} & = & h_{{W^{0}}_{i}} \oplus ((h_{{W^{0}}_{i}} - h_{{W^{1}}_{i}}) + (h_{{W^{0}}_{i}} - h_{{W^{2}}_{i}}) + (h_{{W^{0}}_{i}} - h_{{W^{3}}_{i}})), \end{matrix}

(5)

\begin{matrix} x_{r_{i}} & = & x_{e_{i}} + M_{r} \cdot r_{i}, \end{matrix}

(6)

\begin{matrix} x_{a_{i}} & = & h_{{Z^{0}}_{i}} \oplus ((h_{{Z^{0}}_{i}} - h_{{Z^{1}}_{i}}) + (h_{{Z^{0}}_{i}} - h_{{Z^{2}}_{i}}) + (h_{{Z^{0}}_{i}} - h_{{Z^{3}}_{i}})), \end{matrix}

(7)

where

x_{e_{i}}, x_{r_{i}}, x_{a_{i}}

are embeddings of exercise, the student’s response score and the student’s answer option, respectively;

{W^{0}, W^{1}, W^{2}, W^{3}}

is a permutation of

{A, B, C, D}

, and

W^{0}

is the correct option of

e_{i}

, and the others are incorrect options;

h_{A_{i}}, h_{B_{i}}, h_{C_{i}}, h_{D_{i}}

are obtained by Equation (3);

M_{r}

is a trainable matrix;

{Z^{0}, Z^{1}, Z^{2}, Z^{3}}

is a permutation of

{A, B, C, D}

;

Z^{0}

is the answering option in which this student practices exercise

e_{i}

and selects

Z^{0}

; and

{Z^{1}, Z^{2}, Z^{3}}

are other options; and ⊕ is a vector concatenation.

Encode–decode: In an encoder, the obtained embeddings

x_{e_{i}}, x_{a_{i}}, x_{r_{i}}, i \leq n

and

x_{e_{n + 1}}

are encoded by Equations (8)–(10):

\begin{matrix} \tilde{x_{e_{n + 1}}} & = & E_{e} (x_{e_{1}}, \dots, x_{e_{n + 1}}; M u l t i Q A t t n), \end{matrix}

(8)

\begin{matrix} \tilde{x_{a_{n}}} & = & E_{a} (x_{a_{1}}, \dots, x_{a_{n}}; M u l t i Q A t t n), \end{matrix}

(9)

\begin{matrix} \tilde{x_{r_{n}}} & = & E_{r} (x_{r_{1}}, \dots, x_{r_{n}}; M u l t i Q A t t n), \end{matrix}

(10)

where

\tilde{x_{e_{n + 1}}}

,

\tilde{x_{a_{n}}}

,

\tilde{x_{r_{n}}}

are

2 d_{h}

dimensions’ hidden vectors, and

E_{e}

,

E_{a}

,

E_{r}

represent the exercise encoder, response score encoder, and answer encoder, correspondingly.

M u l t i Q A t t n (\cdot)

stands for multiple question attention mechanism, defined by Equation (11):

\begin{matrix} M u l t i Q A t t n (K, Q, V) = A t t n (K, Q, V; M a s k_{u t}) \oplus \\ A t t n (K, Q, V; M a s k_{m q}), \end{matrix}

(11)

where

K, Q, V

refers to the key, query, and value of attention input, ‘⊕’ is a vector concatenation operation, and

A t t n (\cdot)

represents a general term for attention.

M a s k_{u t}

stands for upper triangular mask attention and

M a s k_{m q}

is defined by Equation (12):

\begin{matrix} M a s k_{m q} (i, j) & = & \{\begin{matrix} 0, & c_{i} = c_{j} & j \leq i \\ 1, & o t h e r w i s e \end{matrix} \end{matrix}

(12)

In the decoder, the answer option and response score hidden vector

\tilde{x_{a_{n}}}

,

\tilde{x_{r_{n}}}

are fused to obtain student answer hidden vectors

\tilde{x_{a r_{n}}}

. Then,

\tilde{x_{e_{n + 1}}}

,

\tilde{x_{a r_{n}}}

are decoded to hidden vector

\tilde{y_{n + 1}}

, which is used to predict the future response score

\hat{r_{n + 1}}

of

e_{n + 1}

, as shown in Equations (13)–(15):

\begin{matrix} \tilde{x_{a r_{n}}} & = & L i n e a r (\tilde{x_{a_{n}}} \oplus \tilde{x_{r_{n}}}), \end{matrix}

(13)

\begin{matrix} \tilde{y_{n + 1}} & = & D e c o d e (\tilde{x_{r_{n + 1}}}, \tilde{x_{a r_{n}}}; M u l t i Q A t t n), \end{matrix}

(14)

\begin{matrix} \hat{r_{n + 1}} & = & P r e d i c t_{k t} (\tilde{y_{n + 1}}), \end{matrix}

(15)

where

\tilde{x_{a r_{n}}}

and

\tilde{y_{n + 1}}

are the 2

d_{h}

dimension vectors, and

\tilde{y_{n + 1}}

refers to a hidden vector of knowledge state of student s in time

n + 1

while

\hat{r_{n + 1}}

ranges 0 to 1 refer to the predicting response score of the student s answering exercise e,

M u l t i Q A t t n (\cdot)

is defined in Equations (11) and (12), and

P r e d i c t_{k t} (\cdot)

is a fully connected network.

3.4. Training with Auxiliary Tasks

To address the new exercise in open-domain datasets and add more information regarding question k, this paper uses auxiliary tasks [43] for training. Once the decoder architecture produces the hidden vector

\tilde{y_{n + 1}}

, it will be sent to two classification networks. The first is the main knowledge tracing task prediction network, which predicts the performance

\hat{r_{n + 1}}

of the students. The second is the auxiliary question type classification network, which predicts the question type

k_{n + 1}

in

e_{n + 1}

. These networks have the same module but do not share their parameters as well as shown in Equation (16):

\begin{matrix} P r e d i c t_{k t} (x) & = & P r e d i c t_{k l} (x) \\ = & σ (L i n e a r_{2} (R e L U (L i n e a r_{1} (x)), \end{matrix}

(16)

where

P r e d i c t_{k t} (\cdot)

and

P r e d i c t_{k l} (\cdot)

have the same formula but they do not share the parameters,

σ (\cdot)

and

R e L U (\cdot)

represent the sigmoid and ReLU activation function, respectively.

The outputs of two classification networks are the exercise answering response score and distribution of question types, and are defined as Equations (17) and (18):

\begin{matrix} \hat{r_{n + 1}} & = & P r e d i c t_{k t} (\tilde{y_{n + 1}}), \end{matrix}

(17)

\begin{matrix} \hat{k_{n + 1}} & = & P r e d i c t_{k l} (\tilde{y_{n + 1}}), \end{matrix}

(18)

where

\hat{r_{n + 1}}

refers to the prediction response score of exercise

e_{n + 1}

with the ground truth

r_{n + 1}

, and

\hat{k_{n + 1}}

is the distribution of the prediction question type of exercise

e_{n + 1}

’s question

q_{n + 1}

with the ground truth

k_{n + 1}

.

The two classification networks have different losses and all of the frameworks are optimized together by minimizing the two losses, as shown in Equations (19)–(21):

\begin{matrix} L_{kl} & = & \sum_{s, t, i} - (o n e h o t {(k_{t}^{s})}_{i} log \hat{k_{t, i}^{s}}), \end{matrix}

(19)

\begin{matrix} L_{kt} & = & \sum_{s, t} - (r_{t}^{s} log {\hat{r}}_{t}^{s} + (1 - r_{t}^{s}) log (1 - {\hat{r}}_{t}^{s})), \end{matrix}

(20)

\begin{matrix} L & = & L_{kt} + λ L_{kl}, \end{matrix}

(21)

where

L_{kl}

is the cross-entropy loss function used for the optimization of question type classification network, and

o n e h o t {(k_{t}^{s})}_{i}

represents the value of the i-th dimension of the one-hot vector of

k_{t}^{s}

which refers to the question type student s is practicing at the t moment, and

\hat{k_{t, i}^{s}}

refers to the value of dimension i of probability distribution in the predicting question type in which the student s completes the exercise at the t moment, and

L_{kt}

is the cross-entropy loss, and

r_{t}^{s}

represents the response score of the question answered by the student s at the t moment while

{\hat{r}}_{t}^{s}

is the response score predicted by the model, and

L

is the loss function of the overall network, which is obtained by adding the losses of two tasks together.

4. Experiment Results and Analysis

To verify the performance of the proposed Exercise Semantic embedding for Knowledge Tracing (ESKT), ESKT was compared with SOTA methods AKT on the Zhixue dataset. To verify the effectiveness of the PLM, Answer Encoder, and Multiple Questions Attention Mechanism, ablation experiments were conducted. And to examine the adaptability of ESKT, exercise semantic embedding was applied for different downstream knowledge tracking models.

In addition, this section also introduces the Zhixue dataset and the experimental settings, and discusses the attention score of the model and the setting of hyperparameters.

4.1. Experiment Setup

Zhixue dataset. Table 1 shows some statistics on the Zhixue data. The Zhixue dataset is an open-domain dataset, has a large volume, and is an English reading comprehension dataset which can be used for knowledge tracing. And the characteristics of Zhixue dataset are detailed as follows:

Open Domain. In the Zhixue dataset, the test set contains two types of open domain data. (1) New exercise data. A new exercise is an exercise that is performed by students in the test set but does not appear in the training set. (2) New exercise sequence data. A new exercise sequence is an exercise sequence that does not appear in the training set.
Data volume. Table 2 shows a comparison of exercise records, sequences, and question numbers of the Zhixue dataset and Statics2011 (https://pslcdatashop.web.cmu.edu/DatasetInfo?datasetId=507 (accessed on 8 March 2025)), ASSISTments2009 (https://sites.google.com/site/assistmentsdata/home/2009-2010-assistment-data/skill-builder-data-2009-2010 (accessed on 8 March 2025)), Bridge2006 (https://pslcdatashop.web.cmu.edu/KDDCup/ (accessed on 8 March 2025)), and NIPS34 (https://eedi.com/projects/neurips-education-challenge (accessed on 8 March 2025)) dataset. The Zhixue dataset comprises 266,487 student exercise sequences, over 5 million exercise records, and 281,748 exercise questions in its exercise bank; its exercise bank is much larger than those of other datasets in the Table 2.
Special exercise data structures. The Zhixue dataset comprises educational data on English reading comprehension tasks, and most of these public datasets are educational tasks from the math field. Data in the field of mathematics will include knowledge points or concept labels, and general methods will also use the concept to distinguish the exercise, while English reading comprehension data weaken the knowledge points or concepts of exercise. The Zhixue dataset includes context, questions, and option text, which can help with the embedding of the exercise.

The entire dataset was first partitioned into two parts according to the students’ schools and grades in a 7:3 ratio to form the training and test sets. This partition is in line with practical application. In application, the student depicted in the test data often comes from a new school or grade. Thus, the two datasets obtained by this division are heterogeneously distributed, and the test dataset will contain data that have not appeared in the training data set. In the training set, we randomly took out 30% of the data to create the validation set, and the validation set and the training set were equally distributed, and the validation set was used for early stops and parameter tuning during training. After the model was trained, it was evaluated on the test set.

Table 1. Statistics on the Zhixue dataset.

Statistics	Value
Exercise record	5,096,545
Students	266,487
# Exercise questions	281,748
Avg. exercise record per student	19.12

Table 2. Statistics on Zhixue and public datasets.

Dataset	Exercise Record	Sequences	Questions
Statics2011	194,947	333	1224
ASSISTments2009	346,860	4217	26,688
Bridge2006	3,679,119	1146	207,856
NIPS34	1,382,727	4918	978
Zhixue	5,096,545	266,487	281,748

To our knowledge, we have not yet found a public English reading comprehension knowledge tracing dataset, and the existing math datasets do not meet our research goals, so our experiments can only be conducted on our commercial Zhixue dataset.

Baseline methods. This paper uses the SOTA method AKT and SAINT [13,14] as the baseline methods, in which the exercise–question index is used to embed the exercise as general methods [32].

Evaluation Metric. We use the general knowledge tracing methods [32] Accuracy (ACC) and Area Under the Curve (AUC) to evaluate the model. To evaluate the performance of the model on an open domain dataset and focus on the performance in new exercises and new exercise sequences, we further introduce Sequence Cold AUC (SCAUC), Sequence Cold ACC (SCACC), Exercise Cold AUC (ECAUC), and Exercise Cold ACC (ECACC) to evaluate the model’s performance on open-domain data, as shown in Equations (22)–(25):

\begin{matrix} P r e ({\hat{r}}_{t}^{s}) & = & \{\begin{matrix} 0, & {\hat{r}}_{t}^{s} \leq γ \\ 1, & o t h e r w i s e \end{matrix} \end{matrix}

(22)

\begin{matrix} I n d i c a t e ({\hat{r}}_{t}^{s}, r_{t}^{s}) & = & \{\begin{matrix} 1, & {\hat{r}}_{t}^{s} = r_{t}^{s} \\ 0, & o t h e r w i s e \end{matrix} \end{matrix}

(23)

\begin{matrix} A C C & = & \frac{\sum_{s, t} I n d i c a t e (P r e ({\hat{r}}_{t}^{s}), r_{t}^{s})}{M + N} \end{matrix}

(24)

\begin{matrix} A U C & = & \frac{\sum_{s, t, r_{t}^{s} = = 1} r a n k ({\hat{r}}_{t}^{s}) - \frac{M (M + 1)}{2}}{M N}, \end{matrix}

(25)

where

γ

is the threshold for judging the prediction result,

r a n k (\cdot)

represents the ordinal function, M represents the positive sample number, and N represents the negative sample number.

AUC and ACC calculate all the exercise prediction results in the test set while ECACC (ECAUC) only calculates the ACC (AUC) of the new exercise in the test set, and SCACC (SCAUC) calculates the ACC (AUC) of new exercise sequences in the test set. We extract all exercises of a student’s exercise record sequence to form exercise sequence

{e_{1}, e_{2}, \dots}

. An exercise sequence that does not appear in the training set is called a new exercise sequence and an exercise that was never seen in the training set is called a new exercise.

Training. During PLM fine-tuning, we fine-tune Longformer on the external RACE dataset [42]. The exercise texts are truncated to 2048 tokens if they exceeded this length, and shorter texts are padded with the special [PAD] token to ensure uniform input length for each batch. For the RACE dataset, an exercise’s context, question text, and option text length typically range from 300 to 600 words, and after tokenization, the maximum token length does not exceed 2048, so the truncation operation is not used in the RACE dataset. The [PAD] character added in the padding operation is a semantically meaningless character, which is just used to process batch inputs of the same length at the same time. We used the BertAdam optimizer to Fine-tune Longformer, and the learning rate was set to

1 \times 10^{- 4}

. We ran three epochs with a single NVIDIA Tesla V100 GPU to fine-tune the Longformer for the multi-choice reading comprehension task. This GPU, equipped with 32GB of video memory, is manufactured by NVIDIA Corporation (Santa Clara, CA, USA).

During knowledge tracing, we used the Adam optimizer to train a series of experiments with AKT and SAINT, the basic framework with a batch size of 64 students. Learning rate was set to

1 \times 10^{- 4}

, and the number of duplicate blocks of each encoder and decoder was set to four, and the number of heads in the transformer was set to eight. The weight loss

λ

was set to

1 \times 10^{- 2}

. To complete the knowledge tracing tasks, we ran up to 100 epochs using a single NVIDIA Tesla V100 GPU. It took 12 h to train the model.

4.2. Comparison with State-of-the-Art Methods

To validate the performance of the proposed exercise semantic embedding for knowledge tracing (ESKT), it was compared with SOTA methods AKT and SAINT on the Zhixue dataset. The experimental results are shown in Table 3.

Table 3 shows that ESKT outperforms AKT by 2.3% Area Under the Curve (AUC) and 1.6% Accuracy (ACC). Compared with SAINT, the AUC of ESKT is higher by 2.8% and the ACC of ESKT is higher by 1.6%. Since AUC and ACC metrics can reflect model performance, these results show that the proposed ESKT outperforms the existing state-of-the-art models, AKT and SAINT.

In SCAUC and SCACC of Table 3, ESKT showed performance improvements of 4.7% and 3.0%, respectively, compared to AKT. In particular, for ECAUC and ECACC, ESKT resulted in a significant improvement, with increases of 7% and 4.3%, compared to AKT. Since SCAUC, SCACC, ECAUC, and ECACC metrics can reflect the model’s performance with regard to new exercises and new exercise sequences, these experimental results confirm that ESKT has great advantages over existing methods when facing new exercise sequences and new exercises, and ESKT is superior to state-of-the-art methods in solving open-domain knowledge tracking.

On the other hand, compared with AKT, ESKT improved by 7% and 4.3% on ECAUC and ECACC respectively, which is larger than the improvement on SCAUC (4.7%) and SCACC (3.0%). And both of these improvements were greater than the improvements in AUC (2.3%) and ACC (1.6%). The above results show that ESKT had the greatest performance improvement in ECAUC and ECACC metrics, which indicates that the improvement of ESKT on the open domain dataset mainly contributed to the performance for new exercises.

A deeper analysis of the results reveals that the ESKT framework distinguishes itself from the AKT and SAINT models through its innovative two-stage architecture, which incorporates three specialized modules: the Pre-trained Language Model, the Answer Encoder, and the Multi-Question Attention Mechanism. The ablation experiments presented in Section 4.3 provide a detailed analysis of the individual and collective contributions of the three modules within the ESKT framework. The results further highlight the effectiveness of ESKT and its three modules.

4.3. Ablation Experiments

4.3.1. Explanation of Model Symbols

In this section, ablation experiments were conducted on the Exercise Semantic embedding for Knowledge Tracing (ESKT) framework to evaluate the contributions of its three core modules: the Pre-trained Language Model (P), Answer Encoder (A), and Multiple Questions Attention Mechanism (M). The experiments systematically removed each module to assess its individual and combined impacts. Specifically, removing the Answer Encoder (A) resulted in the ESKT-A variant, eliminating the Multiple Questions Attention Mechanism (M) produced ESKT-M, and removing both yielded ESKT-AM. Further removal of the Pre-trained Language Model (P) from ESKT-AM resulted in ESKT-PAM, which aligns with the AKT model.

To further evaluate the adaptability of the two-stage framework, experiments were conducted by substituting the knowledge tracing architecture in the second stage with the SAINT model, resulting in a new variant named ESSAINT. Systematic ablation studies were performed on ESSAINT to assess the contributions of its components. Specifically, removing the Answer Encoder (A) yielded ESSAINT-A, eliminating the Multiple Questions Attention Mechanism (M) produced ESSAINT-M, and removing both modules resulted in ESSAINT-AM. Further removal of the Pre-trained Language Model (P) from ESSAINT-AM led to ESSAINT-PAM, which aligns with the original SAINT model.

4.3.2. Pre-Trained Language Model, Answer Encoder and Multiple Questions Attention Mechanism

Ablation experiments on ESKT were conducted to verify the effects of the Pre-trained Language Model, Answer Encoder, and Multiple Questions Attention Mechanism. The experimental results are shown in Table 4.

In Table 4, compared with the ESKT model, the ESKT-A model shows a decrease of 0.1–0.3% in all metrics. This indicates that the specific answer information provided by the Answer Encoder can better track students’ abilities.

Compared with the ESKT, the ESKT-M model has reduced the AUC, ACC, SCAUC, and SCACC by 0.3–0.5%, and has reduced the ECAUC and ECACC by less than 0.1%. The experimental results show the effectiveness of the Multiple Questions Attention Mechanism. However, the results of ECAUC and ECACC also show that the effect of this mechanism on new exercises is not as significant as the effect on new sequence exercises. The reason for this result is that it is more difficult to capture the relationship between different sub-questions in the new batch of questions.

Compared with ESKT-AM, the results of ESKT-M and ESKT-A also confirm the above conclusions. ESKT-M shows an improvement of 0.2–0.6% in all indicators compared with ESKT-AM, while ESKT-A shows an improvement of 0.2–0.6% in AUC, ACC, SCAUC, and SCACC. ESKT-A also shows an improvement of 0.6% in ECAUC and is almost the same in ECACC.

The analysis then focuses on the results of ESKT-PAM (that is, AKT) and ESKT-AM in Table 4. Compared with ESKT-PAM, the addition of the PLM module results in a 1.6% and 1.1% improvement in AUC and ACC, 3.9% and 2.5% improvement in SCAUC and SCACC, and 6.3% and 3.9% improvement in ECAUC and ECACC. The experimental results show that PLM results in huge improvements. In particular, the effect of PLM is more significant on new exercises and new exercise sequences. The results show that the exercise semantic information provided by PLM effectively solves the problem of embedding new exercises in open-domain datasets and improves the model’s knowledge tracking ability.

In summary, the experimental results and subsequent analysis collectively demonstrate that the Pre-trained Language Model (P), Answer Encoder (A), and Multiple Questions Attention Mechanism (M) each independently contribute to the ESKT framework’s performance.

4.3.3. Downstream Knowledge Tracking Framework

Further experiments were conducted to verify the effectiveness of the Pre-trained Language Model, Answer Encoder, and Multiple Questions Attention Mechanism modules on the SAINT base model. The experimental results are shown in Table 5.

In Table 5, compared with ESSAINT, ESSAINT-A shows a decrease of 0.1–0.5% for all metrics. Compared with ESSAINT, ESSAINT-M shows a decrease of 0.1–0.6% in all metrics. In addition, ESSAINT-A shows an improvement of 0.2–1.0% in various metrics compared with ESSAINT-AM, and ESSAINT-M shows an improvement of 0.1–0.7% in various metrics compared with ESSAINT-AM. Moreover, compared with ESSAINT-PAM, ESSAINT-AM shows an improvement of 1.8% and 1.3% in AUC and ACC, 4.4% and 2.9% in SCAUC and SCACC, and 7.1% and 4.4% in ECAUC and ECACC. The results in Table 5 further verify the experimental conclusions in Section 4.2.

The experimental results presented in Table 4 and Table 5 demonstrate that the two-stage framework exhibits adaptability and effectiveness. In the first stage, the pre-trained language model remains fixed, serving as a stable foundation for capturing the exercise semantic embedding. In the second stage, the knowledge tracing model is designed to be modular, allowing it to be replaced with alternative Transformer-based architectures. Despite such substitutions, the three core modules—Pre-trained Language Model (P), Answer Encoder (A), and Multiple Questions Attention Mechanism (M)—continue to enhance the framework’s performance. The results demonstrate that this framework is not only effective in its original configuration but that it is also adaptable to diverse Transformer-based knowledge tracing models.

4.4. Experiments of Training with Auxiliary Tasks

Experimental results were compared to evaluate the impact of training with or without auxiliary tasks, as shown in Table 6. In this table, ESKT+T denotes the training of ESKT with auxiliary tasks, and the “+T” notation used for other models follows the same convention.

Training with auxiliary tasks has advantages for SCAUC, SCACC, ECAUC, and ECACC in almost all instances. However, there is a slight decrease in the AUC and ACC indicators. The main reason is that the knowledge type classification loss caused by the auxiliary task will bring about a certain improvement in the zero-shot data, indicating that the knowledge type is a common feature of knowledge tracing multiple-choice reading comprehension. However, the addition of knowledge type classification loss also makes the model biased in the non-zero-shot set, which not only makes the model focus on the improvement of accuracy but also makes it more generalized. Therefore, the AUC(ACC) will decrease slightly, but the zero-shot metrics will increase. The addition of knowledge type classification loss makes the model biased, but it enhances the generalization ability.

4.5. Attention Visualization

The attention score metrics within the ESKT+T model are visualized in Figure 3, illustrating how the attention score metric evolves across different layers.

In Figure 3, the attention scores of ESKT+T are displayed for the last block in the exercise encoder, response score encoder, answer encoder, and decoder. Each image represents an average eight-heads attention score. The top four images show the attention scores for multiple questions mask attention score in the exercise encoder, answer encoder, response score encoder, and decoder, which only capture information from the same exercise context but with different questions. The bottom images display the upper triangular mask attention score, which captures information from historical exercise records.

The attention score metric is a square matrix that serves as an intermediate value within the model’s three encoders and one decoder, reflecting the influence weights within the input sequence.

Taking the exercise encoder’s attention score matrix as an example, we assume its input is

x_{e_{1}}, \dots, x_{e_{n + 1}}

, where

x_{e_{i}}

is the embedding of exercise

e_{i}

. The element at position

(i, j)

in this matrix represents the influence weight of exercise

e_{i}

on exercise

e_{j}

in the student’s exercise sequence. To prevent future exercises from affecting the current exercise

e_{i}

, the i-th row of the matrix is masked such that

e_{i}

is only influenced by previous exercises. This masked portion is depicted as the colorless part of the matrix in the Figure 3. For exercises preceding

e_{i}

, those with greater impact on

e_{i}

are represented by colors closer to the top of the color-changing column (yellow), while those with lesser impact are represented by colors closer to the bottom (black). Since ESKT employs an eight-head attention mechanism, the final block of the exercise encoder generates eight attention score matrices of the same size. After averaging these matrices and visualizing the result, the attention score metric of the exercise encoder is obtained and shown in the lower left part of Figure 3.

The answer encoder and response score encoder take inputs

x_{a_{1}}, \dots, x_{a_{n}}

and

x_{r_{1}}, \dots, x_{r_{n}}

, respectively, and their attention score matrices are structured similarly to those of the exercise encoder. The decoder’s input is the combined output of the three encoders, and its attention matrix reflects the interaction weights between the student’s exercise sequence, answer information, and response scores. The method for generating these matrices follows the same approach as described for the exercise encoder.

The four attention score matrices at the top of Figure 3 are multi-question mask matrices. The uncolored parts signify blocked future information, while the black areas indicate weights reset to 0 for exercises without shared reading context. This means the current exercise is influenced solely by historical sub-questions and remains unaffected by future exercises or those from different contexts.

4.6. Hyperparameter $λ$ Analysis

To show how the hyperparameter

λ

impacts the experimental results, experiments were conducted on

λ

. The experimental results are shown in Table 7.

In Table 7,

λ

works best around the order of

1 \times 10^{- 2}

. If

λ

is greater than this magnitude, the model will be too biased towards knowledge-type loss, resulting in the performance of the model being degraded in all aspects. When

λ

is less than this magnitude, the effect of auxiliary tasks is also not too good. Although the weight of knowledge type loss will lead to the recovery of AUC and ACC, but the generalization ability of the model on zero-shot data results in too little improvement. When the magnitude of

λ

is

1 \times 10^{- 2}

, the generalization ability of the model with regard to zero-shot data (SCACC and ECACC) is improved, although the AUC and ACC decrease compared with those achieved when no auxiliary tasks are added.

5. Discussion

The main experimental results of this study demonstrate that the proposed exercise semantic embedding for knowledge tracing (ESKT) framework achieves significant improvements in English reading comprehension knowledge tracing. A detailed analysis reveals that the three core modules, Pre-trained Language Model (P), Answer Encoder (A), and Multiple Questions Attention Mechanism (M), within the ESKT framework contribute to the main benefits. The reasons for the effectiveness of these three modules are as follows:

The ESKT framework innovatively introduces a fine-tuned pre-trained language model, which explicitly integrates exercise semantic embedding. This contrasts with the AKT and SAINT models, which rely solely on knowledge points or concept labels for exercise representation. Even when encountering a new exercise, ESKT can obtain its exercise embedding from context. Therefore, this module has greatly improved ESKT’s performance on open-domain data. In addition, the Longformer pre-trained language model employed in this paper effectively handles long exercise context in English reading comprehension, reducing the information loss that typically occurs when the exercise context is simplified to knowledge-point labels.
ESKT proposes a multi-question attention masking mechanism that combines an upper triangle mask with a multi-question attention mask. This design leverages historical answer information to enhance the attention weights between interrelated questions within the same overarching exercise context, strengthening the model’s ability to capture dependencies and contextual relationships across exercises.
ESKT designs an answer encoder that integrates students’ answer information, while AKT and SAINT only use students’ performance. The answer encoder helps distinguish the representation of students’ answers with different reasons for errors, enriches the input features, and improves the performance of knowledge tracking.

Furthermore, the experimental results also demonstrate that this two-stage framework exhibits adaptability. In the first stage, the pre-trained language model remains fixed. In the second stage, the knowledge tracing model is designed as a modular component, enabling seamless replacement with alternative Transformer-based architectures. Even when substituted, the three core modules still enhance the framework’s performance.

6. Conclusions

This paper outlines three research goals in English reading comprehension knowledge tracing. To address these objectives, it proposes a two-stage framework called exercise semantic embedding for knowledge tracing (ESKT), comprising three key modules: a Pre-trained Language Model (P), Answer Encoder (A), and Multiple Questions Attention Mechanism (M). Finally, the experimental results demonstrate the effectiveness of the proposed ESKT framework and achieve the research goals. The specific methods for achieving each research goal and the corresponding experimental results are presented as follows:

To solve the problem of knowledge tracking in open-domain data, this study uses a Pre-trained Language Model (P), which explicitly integrates exercise semantic embedding, during the first stage. This contrasts with the AKT and SAINT models, which rely solely on knowledge points or concept labels for exercise representation. Even when encountering a new exercise, ESKT can obtain its exercise embedding from context and predict students’ performance on new exercises. The experimental results show that this module results in a 6.3% and 3.9% improvement in the ECAUC and ECACC. This proves the effectiveness of this module in solving the knowledge tracking task of new exercises, thus solving the problem of knowledge tracking in open-domain data.
To solve the problem of knowledge tracking in English reading comprehension data, the Multiple Questions Attention Mechanism (M) is proposed. This mechanism combines an upper triangle mask with a multi-question attention mask to enhance the attention weights between interrelated questions within the same overarching exercise context. In addition, English reading comprehension long exercise context is handled in stage I with Longformer pre-trained language model. Therefore, the ESKT framework effectively leverages the features of English reading comprehension data, including long exercise contexts and multiple questions within a single exercise, to enhance its performance.
To take advantage of the students’ answer information, Answer Encoder (A) is used in the second stage of ESKT. In contrast, AKT and SAINT only use students’ performance. The answer encoder helps distinguish the representation of students’ answers with different reasons for errors, enriches the input features, and improves the performance of knowledge tracking. The experimental results show that this module results in a 0.2–0.6% improvement in all metrics. This proves the effectiveness of this module in taking advantage of the students’ answer information.

While this study has made progress in addressing knowledge tracing for English reading comprehension, it still has certain limitations. This study focuses on English reading comprehension exercises, but it does not encompass knowledge tracing for other question types like fill-in-the-blank questions or question answering, nor does it extend to subjects such as mathematics, physics, or computer science.

In future work, we will continue to build upon the two-stage framework introduced in this study. We plan to retain the first stage, which employs language models to generate embeddings for various English question types, while redesigning the second stage’s knowledge tracing mechanism to better capture interactions between different question formats. This approach aims to enhance knowledge tracing across diverse English exercises. Additionally, we intend to explore the application of similar methodologies to knowledge tracing tasks in other disciplines such as mathematics, physics, and computer science, further expanding the framework’s applicability and effectiveness.

Author Contributions

Conceptualization, Z.C.; methodology, Z.C.; software, Z.C.; validation, Z.C.; formal analysis, Z.C. and J.L.; investigation, Z.C.; resources, Z.C.; data curation, Z.C.; writing—original draft preparation, Z.C.; writing—review and editing, Z.C. and J.L.; visualization, Z.C.; supervision, J.L.; project administration, J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy restrictions.

Acknowledgments

Thanks to Jinlong Li for providing valuable suggestions for revising the language of this article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Salama, R.; Hinton, T. Online higher education: Current landscape and future trends. J. Furth. High. Educ. 2023, 47, 913–924. [Google Scholar] [CrossRef]
Ulum, H. The effects of online education on academic success: A meta-analysis study. Educ. Inf. Technol. 2022, 27, 429–450. [Google Scholar] [CrossRef] [PubMed]
Pei, P.; Raga, R.C., Jr.; Abisado, M. Enhanced personalized learning exercise question recommendation model based on knowledge tracing. Int. J. Adv. Intell. Inform. 2024, 10, 13–26. [Google Scholar] [CrossRef]
Terzieva, V.; Ivanova, T.; Todorova, K. Personalized Learning in an Intelligent Educational System. In Novel & Intelligent Digital Systems, Proceedings of the 2nd International Conference (NiDS 2022), Athens, Greece, 29–30 September 2022; Springer: Cham, Switzerland, 2022; pp. 13–23. [Google Scholar] [CrossRef]
Liu, Y.; Zhang, T.; Wang, X.; Yu, G.; Li, T. New development of cognitive diagnosis models. Front. Comput. Sci. 2023, 17, 171604. [Google Scholar] [CrossRef]
Xu, F.; Li, Z.; Yue, J.; Qu, S. A systematic review of educational data mining. In Intelligent Computing, Proceedings of the 2021 Computing Conference, Virtual, 15–16 July 2021; Springer: Cham, Switzerland, 2021; Volume 2, pp. 764–780. [Google Scholar] [CrossRef]
Khosravi, H.; Sadiq, S.; Gasevic, D. Development and adoption of an adaptive learning system: Reflections and lessons learned. In Proceedings of the 51st ACM Technical Symposium on Computer Science Education, Portland, OR, USA, 11–14 March 2020; pp. 58–64. [Google Scholar] [CrossRef]
Huang, Z.; Liu, Q.; Zhai, C.; Yin, Y.; Chen, E.; Gao, W.; Hu, G. Exploring multi-objective exercise recommendations in online education systems. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, Beijing, China, 3–7 November 2019; pp. 1261–1270. [Google Scholar] [CrossRef]
Shen, S.; Liu, Q.; Huang, Z.; Zheng, Y.; Yin, M.; Wang, M.; Chen, E. A survey of knowledge tracing: Models, variants, and applications. IEEE Trans. Learn. Technol. 2024, 17, 1898–1919. [Google Scholar] [CrossRef]
Cen, H.; Koedinger, K.; Junker, B. Learning factors analysis—A general method for cognitive model evaluation and improvement. In Intelligent Tutoring Systems, Proceedings of the International Conference on Intelligent Tutoring Systems, Jhongli, Taiwan, 26–30 June 2006; Springer: Berlin/Heidelberg, Germany, 2006; pp. 164–175. [Google Scholar] [CrossRef]
Lavoué, E.; Monterrat, B.; Desmarais, M.; George, S. Adaptive gamification for learning environments. IEEE Trans. Learn. Technol. 2018, 12, 16–28. [Google Scholar] [CrossRef]
Thai-Nghe, N.; Drumond, L.; Horváth, T.; Krohn-Grimberghe, A.; Nanopoulos, A.; Schmidt-Thieme, L. Factorization techniques for predicting student performance. In Educational Recommender Systems and Technologies: Practices and Challenges; IGI Global: Hershey, PA, USA, 2012; pp. 129–153. [Google Scholar] [CrossRef]
Choi, Y.; Lee, Y.; Cho, J.; Baek, J.; Kim, B.; Cha, Y.; Shin, D.; Bae, C.; Heo, J. Towards an appropriate query, key, and value computation for knowledge tracing. In Proceedings of the Seventh ACM Conference on Learning @ Scale, Virtual, 12–14 August 2020; pp. 341–344. [Google Scholar] [CrossRef]
Ghosh, A.; Heffernan, N.; Lan, A.S. Context-aware attentive knowledge tracing. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual, 6–10 July 2020; pp. 2330–2339. [Google Scholar] [CrossRef]
Pandey, S.; Karypis, G. A self-attentive model for knowledge tracing. arXiv 2019, arXiv:1907.06837. Available online: https://arxiv.org/pdf/1907.06837 (accessed on 8 March 2025).
Piech, C.; Bassen, J.; Huang, J.; Ganguli, S.; Sahami, M.; Guibas, L.J.; Sohl-Dickstein, J. Deep knowledge tracing. Adv. Neural Inf. Process. Syst. 2015, 28. Available online: https://proceedings.neurips.cc/paper/2015/file/bac9162b47c56fc8a4d2a519803d51b3-Paper.pdf (accessed on 8 March 2025).
Yeung, C.K.; Yeung, D.Y. Addressing two problems in deep knowledge tracing via prediction-consistent regularization. In Proceedings of the Fifth Annual ACM Conference on Learning @ Scale, London, UK, 26–28 June 2018; pp. 1–10. [Google Scholar] [CrossRef]
Liu, Q.; Huang, Z.; Yin, Y.; Chen, E.; Xiong, H.; Su, Y.; Hu, G. Ekt: Exercise-aware knowledge tracing for student performance prediction. IEEE Trans. Knowl. Data Eng. 2019, 33, 100–115. [Google Scholar] [CrossRef]
Su, Y.; Liu, Q.; Liu, Q.; Huang, Z.; Yin, Y.; Chen, E.; Ding, C.; Wei, S.; Hu, G. Exercise-enhanced sequential modeling for student performance prediction. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar] [CrossRef]
Corbett, A.T.; Anderson, J.R. Knowledge tracing: Modeling the acquisition of procedural knowledge. User Model. User-Adapt. Interact. 1994, 4, 253–278. [Google Scholar] [CrossRef]
Käser, T.; Klingler, S.; Schwing, A.G.; Gross, M. Dynamic Bayesian networks for student modeling. IEEE Trans. Learn. Technol. 2017, 10, 450–462. [Google Scholar] [CrossRef]
Yudelson, M.V.; Koedinger, K.R.; Gordon, G.J. Individualized bayesian knowledge tracing models. In Artificial Intelligence in Education: Proceedings of the 16th International Conference, AIED 2013, Memphis, TN, USA, 9–13 July 2013; Proceedings 16; Springer: Berlin/Heidelberg, Germany, 2013; pp. 171–180. [Google Scholar] [CrossRef]
Zhang, J.; Shi, X.; King, I.; Yeung, D.Y. Dynamic key-value memory networks for knowledge tracing. In Proceedings of the 26th International Conference on World Wide Web, Perth, Australia, 3–7 April 2017; pp. 765–774. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Available online: https://dl.acm.org/doi/10.5555/3295222.3295349 (accessed on 8 March 2025).
Shin, D.; Shim, Y.; Yu, H.; Lee, S.; Kim, B.; Choi, Y. Saint+: Integrating temporal features for ednet correctness prediction. In Proceedings of the LAK21: 11th International Learning Analytics and Knowledge Conference, Irvine, CA, USA, 12–16 April 2021; pp. 490–496. [Google Scholar] [CrossRef]
Lee, U.; Park, Y.; Kim, Y.; Choi, S.; Kim, H. Monacobert: Monotonic attention based convbert for knowledge tracing. In Generative Intelligence and Intelligent Tutoring Systems: Proceedings of the International Conference on Intelligent Tutoring Systems; Springer: Cham, Switzerland, 2024; pp. 107–123. [Google Scholar] [CrossRef]
Jiang, Z.H.; Yu, W.; Zhou, D.; Chen, Y.; Feng, J.; Yan, S. Convbert: Improving bert with span-based dynamic convolution. Adv. Neural Inf. Process. Syst. 2020, 33, 12837–12848. Available online: https://arxiv.org/pdf/2008.02496 (accessed on 8 March 2025).
Liu, Y.; Yang, Y.; Chen, X.; Shen, J.; Zhang, H.; Yu, Y. Improving knowledge tracing via pre-training question embeddings. arXiv 2020, arXiv:2012.05031. Available online: https://arxiv.org/pdf/2012.05031 (accessed on 8 March 2025).
Ghosh, A.; Raspat, J.; Lan, A. Option tracing: Beyond correctness analysis in knowledge tracing. In Artificial Intelligence in Education, Proceedings of the International Conference, Utrecht, The Netherlands, 14–18 June 2021; Springer: Cham, Switzerland, 2021; pp. 137–149. [Google Scholar] [CrossRef]
Huang, T.; Liang, M.; Yang, H.; Li, Z.; Yu, T.; Hu, S. Context-Aware Knowledge Tracing Integrated with the Exercise Representation and Association in Mathematics. In Proceedings of the 14th International Conference on Educational Data Mining (EDM 2021), Virtual, 29 June–2 July 2021; Available online: https://files.eric.ed.gov/fulltext/ED615528.pdf (accessed on 8 March 2025).
Liu, Z.; Liu, Q.; Chen, J.; Huang, S.; Luo, W. simpleKT: A simple but tough-to-beat baseline for knowledge tracing. arXiv 2023, arXiv:2302.06881. Available online: https://arxiv.org/pdf/2302.06881 (accessed on 8 March 2025).
Liu, Z.; Liu, Q.; Chen, J.; Huang, S.; Tang, J.; Luo, W. pyKT: A python library to benchmark deep learning based knowledge tracing models. Adv. Neural Inf. Process. Syst. 2022, 35, 18542–18555. Available online: https://papers.nips.cc/paper_files/paper/2022/file/75ca2b23d9794f02a92449af65a57556-Paper-Datasets_and_Benchmarks.pdf (accessed on 8 March 2025).
Li, H.; Yu, J.; Ouyang, Y.; Liu, Z.; Rong, W.; Li, J.; Xiong, Z. Explainable few-shot knowledge tracing. arXiv 2024, arXiv:2405.14391. Available online: https://arxiv.org/pdf/2405.14391 (accessed on 8 March 2025).
Neshaei, S.P.; Davis, R.L.; Hazimeh, A.; Lazarevski, B.; Dillenbourg, P.; Käser, T. Towards Modeling Learner Performance with Large Language Models. arXiv 2024, arXiv:2403.14661. Available online: https://arxiv.org/pdf/2403.14661 (accessed on 8 March 2025).
Beltagy, I.; Peters, M.E.; Cohan, A. Longformer: The long-document transformer. arXiv 2020, arXiv:2004.05150. Available online: https://arxiv.org/pdf/2004.05150 (accessed on 8 March 2025).
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. Available online: https://arxiv.org/pdf/1810.04805 (accessed on 8 March 2025).
Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. Albert: A lite bert for self-supervised learning of language representations. arXiv 2019, arXiv:1909.11942. Available online: https://arxiv.org/pdf/1909.11942 (accessed on 8 March 2025).
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692. Available online: https://arxiv.org/pdf/1907.11692 (accessed on 8 March 2025).
Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R.R.; Le, Q.V. Xlnet: Generalized autoregressive pretraining for language understanding. Adv. Neural Inf. Process. Syst. 2019, 32. Available online: https://proceedings.neurips.cc/paper/2019/file/dc6a7e655d7e5840e66733e9ee67cc69-Paper.pdf (accessed on 8 March 2025).
Zhu, P.; Zhang, Z.; Zhao, H.; Li, X. DUMA: Reading comprehension with transposition thinking. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 30, 269–279. Available online: https://arxiv.org/pdf/2001.09415 (accessed on 8 March 2025). [CrossRef]
Zhang, S.; Zhao, H.; Wu, Y.; Zhang, Z.; Zhou, X.; Zhou, X. Dual co-matching network for multi-choice reading comprehension. arXiv 2019, arXiv:1901.09381. Available online: https://arxiv.org/pdf/1901.09381 (accessed on 8 March 2025). [CrossRef]
Lai, G.; Xie, Q.; Liu, H.; Yang, Y.; Hovy, E. Race: Large-scale reading comprehension dataset from examinations. arXiv 2017, arXiv:1704.04683. Available online: https://aclanthology.org/D17-1082/ (accessed on 8 March 2025).
Liu, Z.; Liu, Q.; Chen, J.; Huang, S.; Gao, B.; Luo, W.; Weng, J. Enhancing deep knowledge tracing with auxiliary tasks. In Proceedings of the ACM Web Conference 2023, Austin, TX, USA, 30 April–4 May 2023; pp. 4178–4187. [Google Scholar] [CrossRef]

Figure 1. An example of knowledge tracing. The student’s answer and the result for

e_{5}

are not predetermined in advance. The objective of knowledge tracking is to predict the student’s result for

e_{5}

, denoted by the red question mark.

Figure 1. An example of knowledge tracing. The student’s answer and the result for

e_{5}

are not predetermined in advance. The objective of knowledge tracking is to predict the student’s result for

e_{5}

, denoted by the red question mark.

Figure 2. The framework of exercise semantic embedding for knowledge tracing (ESKT).

Figure 3. Attention metric visualization.

Table 3. This table shows the comparative experimental results with SOTA methods. The best results are shown in bold.

Model	AUC	ACC	SCAUC	SCACC	ECAUC	ECACC
SAINT	0.7428	0.6949	0.6782	0.6681	0.6113	0.6275
AKT	0.7485	0.6987	0.6911	0.6756	0.6254	0.6343
ESKT	0.7714	0.7142	0.7377	0.7053	0.6955	0.6771

Table 4. This table shows the ablation experiments of ESKT, and the baseline model is the AKT. The best results are presented in bold, and the underlined numbers indicate the second-best performance.

Model	AUC	ACC	SCAUC	SCACC	ECAUC	ECACC
ESKT-PAM (AKT)	0.7485	0.6987	0.6911	0.6756	0.6254	0.6343
ESKT-AM	0.7645	0.7096	0.7303	0.7004	0.6886	0.6735
ESKT-M	0.7678	0.7115	0.7332	0.7027	0.6947	0.6770
ESKT-A	0.7683	0.7114	0.7361	0.7022	0.6942	0.6738
ESKT	0.7714	0.7142	0.7377	0.7053	0.6955	0.6771

Table 5. This table shows the ablation experiments of ESKT, and the baseline model is SAINT. The best results are shown in bold and the underlined numbers indicate the second best performance.

Model	AUC	ACC	SCAUC	SCACC	ECAUC	ECACC
ESSAINT-PAM	0.7428	0.6949	0.6782	0.6681	0.6113	0.6275
ESSAINT-AM	0.7610	0.7077	0.7219	0.6971	0.6820	0.6712
ESSAINT-M	0.7625	0.7079	0.7254	0.6984	0.6897	0.6743
ESSAINT-A	0.7641	0.7102	0.7261	0.6995	0.6926	0.6742
ESSAINT	0.7665	0.7121	0.7275	0.7206	0.6943	0.6758

Table 6. This table shows the comparative experiments results on the Zhixue dataset training with auxiliary tasks or no auxiliary tasks.

Model	AUC	ACC	SCAUC	SCACC	ECAUC	ECACC
ESKT-AM	0.7645	0.7096	0.7303	0.7004	0.6886	0.6735
ESKT-AM+T	0.7616	0.7086	0.7282	0.7044	0.6909	0.6807
ESKT	0.7714	0.7142	0.7377	0.7053	0.6955	0.6771
ESKT+T	0.7702	0.7132	0.7356	0.7061	0.6963	0.6775
ESSAINT-AM	0.7610	0.7077	0.7219	0.6971	0.6820	0.6712
ESSAINT-AM+T	0.7605	0.7061	0.7232	0.6958	0.6902	0.6715
ESSAINT	0.7665	0.7121	0.7275	0.7206	0.6943	0.6758
ESSAINT+T	0.7655	0.7114	0.7293	0.7217	0.6955	0.6772

Table 7. The performance of different

λ

in ESKT+T.

Table 7. The performance of different

λ

in ESKT+T.

$λ$	AUC	ACC	SCAUC	SCACC	ECAUC	ECACC
0	0.7714	0.7142	0.7377	0.7053	0.6955	0.6771
$1 \times 10^{- 2}$	0.7702	0.7132	0.7356	0.7061	0.6963	0.6775
$1 \times 10^{- 3}$	0.7707	0.7135	0.7360	0.7054	0.6957	0.6771
$1 \times 10^{- 1}$	0.7687	0.7128	0.7354	0.7051	0.6954	0.6765

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cheng, Z.; Li, J. Exercise Semantic Embedding for Knowledge Tracking in Open Domain. Information 2025, 16, 302. https://doi.org/10.3390/info16040302

AMA Style

Cheng Z, Li J. Exercise Semantic Embedding for Knowledge Tracking in Open Domain. Information. 2025; 16(4):302. https://doi.org/10.3390/info16040302

Chicago/Turabian Style

Cheng, Zhi, and Jinlong Li. 2025. "Exercise Semantic Embedding for Knowledge Tracking in Open Domain" Information 16, no. 4: 302. https://doi.org/10.3390/info16040302

APA Style

Cheng, Z., & Li, J. (2025). Exercise Semantic Embedding for Knowledge Tracking in Open Domain. Information, 16(4), 302. https://doi.org/10.3390/info16040302

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Exercise Semantic Embedding for Knowledge Tracking in Open Domain

Abstract

1. Introduction

2. Related Work

3. Materials and Methods

3.1. Task Definition of Knowledge Tracing

3.2. Stage I: Fine-Tune Pre-Trained Language Model (PLM)

3.3. Stage II: Knowledge Tracing with Answer Encoder and Multiple Questions Attention Mechanism (KTAM)

3.4. Training with Auxiliary Tasks

4. Experiment Results and Analysis

4.1. Experiment Setup

4.2. Comparison with State-of-the-Art Methods

4.3. Ablation Experiments

4.3.1. Explanation of Model Symbols

4.3.2. Pre-Trained Language Model, Answer Encoder and Multiple Questions Attention Mechanism

4.3.3. Downstream Knowledge Tracking Framework

4.4. Experiments of Training with Auxiliary Tasks

4.5. Attention Visualization

4.6. Hyperparameter $λ$ Analysis

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Exercise Semantic Embedding for Knowledge Tracking in Open Domain

Abstract

1. Introduction

2. Related Work

3. Materials and Methods

3.1. Task Definition of Knowledge Tracing

3.2. Stage I: Fine-Tune Pre-Trained Language Model (PLM)

3.3. Stage II: Knowledge Tracing with Answer Encoder and Multiple Questions Attention Mechanism (KTAM)

3.4. Training with Auxiliary Tasks

4. Experiment Results and Analysis

4.1. Experiment Setup

4.2. Comparison with State-of-the-Art Methods

4.3. Ablation Experiments

4.3.1. Explanation of Model Symbols

4.3.2. Pre-Trained Language Model, Answer Encoder and Multiple Questions Attention Mechanism

4.3.3. Downstream Knowledge Tracking Framework

4.4. Experiments of Training with Auxiliary Tasks

4.5. Attention Visualization

4.6. Hyperparameter λ Analysis

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4.6. Hyperparameter $λ$ Analysis