Online Learning Strategy Induction through Partially Observable Markov Decision Process-Based Cognitive Experience Model

Gao, Huifan; Ma, Biyang

doi:10.3390/electronics13193858

Open AccessArticle

Online Learning Strategy Induction through Partially Observable Markov Decision Process-Based Cognitive Experience Model

by

Huifan Gao

¹ and

Biyang Ma

^2,3,*

¹

Department of Automation, Xiamen University, Xiamen 361000, China

²

School of Computer Science, Minnan Normal University, Zhangzhou 363000, China

³

Key Laboratory of Data Science and Intelligent Applications, Fujian Province University, Zhangzhou 363000, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(19), 3858; https://doi.org/10.3390/electronics13193858 (registering DOI)

Submission received: 22 August 2024 / Revised: 20 September 2024 / Accepted: 26 September 2024 / Published: 29 September 2024

(This article belongs to the Special Issue Data-Driven Intelligence in Autonomous Systems)

Download

Browse Figures

Versions Notes

Abstract

:

Inducing learning strategies is a crucial component of intelligent tutoring systems. Previous research has predominantly focused on the induction of offline learning strategies. Although the existing offline learning strategy induction methods can also be used for real-time updates of learning strategies, their update efficiency is not high, making it difficult to capture the characteristics exhibited by learners during the learning process in a timely manner. With the superior performance of the Partially Observable Markov Decision Process (POMDP), this paper proposes a POMDP-based cognitive experience model, which can be quickly updated during interactions and enables the real-time induction of learning strategies by weighting the learning experiences of different learners. Experimental results demonstrate that the learning strategies induced by PCEM are more personalized and exhibit superior performance.

Keywords:

intelligent education; online learning strategy; partially observable Markov decision process

1. Introduction

Intelligent tutoring systems (ITSs) provide personalized support in learners’ cognitive process when the learners are engaged with knowledge concept learning based on the integration of computer science and educational principles [1]. The systems utilize artificial intelligence technology and educational models to offer customized learning paths, personalized feedback, and support, in order to optimize learning outcomes [2,3]. Depending on the type of learning resources, the tutoring methods employed by ITS vary. Figure 1 illustrates one such method, where the system and the learner engage in a repetitive cycle: (a) the system chooses a question, (b) the learner answers the question, and (c) the system checks the answer and reads the explanation. The repetitive cycle is known as the exercise-based cognitive process, which is the primary focus of the discussion of this article. In this process, the key to improving the learner’s knowledge state lies in selecting appropriate exercises from a vast pool of options, which is referred as the learning strategy. Efficiently inducing the optimal learning strategy is the focus of this study.

Studies on optimizing learners’ learning strategies are primarily divided into two categories—model-free and model-based learning strategy optimization. The studies about model-free learning strategy optimization mainly utilize techniques based on reinforcement learning (RL) [4,5,6,7] and its improvements, such as hierarchical reinforcement learning (HRL) [8,9] and deep reinforcement learning (DRL) [10,11,12,13,14]. However, these methods do not explicitly represent how learners understand knowledge concepts in the cognitive processes, meaning they do not model learners’ cognitive processes. Consequently, the resulting learning strategies may be unexplainable, which contradicts the principles of evidence-based learning [15].

In contrast, the model-based methods mainly utilize techniques based on a partially observable Markov decision process (POMDP) [16]. Early related studies primarily explored the feasibility of POMDP in cognitive modeling and learning strategy induction [17]. Subsequent studies improved modeling methods and strategy induction techniques, emphasizing the personalization of models and strategies [18,19,20,21]. Additionally, the online application of model-based methods also deserves attention.

Unlike the need for online planning in POMDPs due to the excessively large planning space, the online methods required by ITS focus more on the real-time update of cognitive models and learning strategies. Specifically, learners’ answers can serve not only as the basis for the next decision, but also as new information to update the cognitive model. In other words, in this scenario, the model and the corresponding optimal strategy are both dynamic. However, achieving this requires addressing the following difficulties.

Generally, a large amount of calculation is required to complete the model update, which is time-consuming and contrary to the required real-time update.
The information provided by a single answer is very limited, and will be diluted by other historical records.

The first issue arises because traditional methods demand all historical data for model training to ensure accuracy. In real-time interactions, this would mean learners must collectively wait for model updates. Using all historical data compensates for missing information in individual datapoints. Alternatively, although individual learners’ answer records may be limited, they deeply reflect personal cognitive experiences. We refer to the POMDP model trained solely from the answer records of individual learners as the POMDP-based cognitive experience model (PCEM). For specific temporal answer sequences, which represent an advisee, each cognitive experience model can be regarded as a adviser, and the final learning strategy is derived by synthesizing the suggestions of these advisers. Considering the difference in cognitive abilities among different learners, the weight of each adviser’s suggestion in the final learning strategy should not be the same, but should be determined by the similarity in cognitive abilities between the adviser and the advisee, which precisely addresses the second issue mentioned above.

Our main contributions are summarized as follows:

To achieve a personalized online update of the model and strategy, we propose the POMDP-based cognitive experience model (PCEM). Unlike models that accurately reflect learners’ cognitive abilities, the cognitive experience model emphasizes learners’ subjective experiences in the cognitive process. Additionally, we introduce an adviser–advisee mechanism to derive learning strategies corresponding to the PCEM.
This paper proposes new methods for parameter learning and strategy solutions to the PCEM. The parameter learning draws on the ideas of HCPM and makes appropriate modifications based on the PCEM characteristics. The strategy solving method incorporates the similarity between cognitive experiences to improve the performance.
We demonstrate the PCEM performance in real-time update of learning strategies in several real-world knowledge concept learning domains.

The remaining sections of this article are organized as follows. We start to review the related work of learning strategy induction methods in Section 2. We present the HCPM background knowledge in Section 3.1, and provide a detailed exposition of the parameter learning and strategy induction methods for the PCEM model in Section 3.3 and Section 3.4. Section 4 shows the experimental results and demonstrates the performance of PCEM. Section 5 summarizes this work and discusses further research.

2. Related Works

Obtaining learning/teaching strategies through acquiring or constructing cognitive models can be challenging and expensive. Therefore, researchers have embraced model-free methods for inducing such strategies, which derive them directly from learners’ answer data without relying on precise cognitive process modeling. Tang et al. [5] proposed a RL-based method for personalized learning that adaptively selects learning materials and optimizes recommendation strategies under uncertain information. Zhou et al. [9] introduced and applied an offline, off-policy HRL framework based on Gaussian processes, demonstrating higher effectiveness in guiding teaching strategies compared to general RL through experiments. Ju et al. [11] proposed critical-RL to identify critical pedagogical decisions, and through classroom studies, found that, for certain learners, critical strategies at key decision points were significantly more effective than random strategies. Huang et al. [12] introduced the deep reinforcement learning framework for exercise recommendation (DRE), utilizing two Exercise Q-Networks (EQNM and EQNR) and three domain-specific rewards, showing superior recommendation performance on real datasets.

Zhou et al. [6] demonstrated the improvement of learner–system interaction by combining RL with human-written explanations, with experiments proving that personalized RL decisions paired with explanations are more effective than using RL decisions or explanations alone. Ausin et al. [13,14] showed how combining data-driven methods (such as DRL) with educational strategies, along with simple explanations, enhanced learner interaction and learning performance with ITSs, revealing the effectiveness of learners’ problem-level decisions. Their team also developed the InferNet method to address temporal delayed reward issues, showing its effectiveness in inferring true immediate rewards and significantly improving learner learning outcomes in simulated tasks and empirical studies compared to immediate rewards and previous methods. Kubotani et al. [7] proposed a RL-based teaching framework that optimizes teaching strategies through internal modeling of learners, even with limited learner learning history, and validated its effectiveness through mathematical model experiments.

Model-based learning strategy induction methods help optimize and develop learning/teaching strategies by constructing and using cognitive models. These methods offer theoretical support and data-driven personalized learning paths, improving learning outcomes. However, building and maintaining accurate cognitive models is costly. Considering that deciding on learning/teaching decisions involves reasoning and balancing multiple priorities, Rafferty et al. [17] addressed this by developing a POMDP planning framework. This framework provided a method for selecting learning/teaching actions based on learning models, domain structure, and teaching goals. It demonstrated the ability to select actions in real-time in moderately sized domains and showed that, despite mismatches between learner models and actual learners, POMDP strategies could accelerate learning in concept learning tasks, outperforming other alternative methods. Ramachandran et al. [18] designed the Assistive Tutor POMDP (ATPOMDP) to provide personalized support for learners practicing difficult math concepts over multiple tutoring sessions and demonstrated its effectiveness, significantly enhancing learning gains.

Nioche et al. [19] proposed a modular framework combining online inferences for each user and items with online planning considering learning time constraints, expanding model-based tutoring system research. This framework showed effectiveness in both simulated and real learners, particularly in adapting to changes in learning abilities and item difficulty. Gao et al. [20,21] used ITS as a platform to introduce a new strategy induction method and practice-based cognitive modeling, adopting a POMDP model to optimize learning strategies and refining them using information entropy techniques. This approach aimed to explain learners’ learning performance in knowledge concepts of interest and provide personalized learning strategies. Recognizing that personalized modeling was more suitable than traditional methods for tutoring individual learners on learning knowledge concepts, the team also introduced the homomorphic POMDP (H-POMDP) model and a new cognitive modeling approach to induce learning strategies for individual learners, demonstrating its performance in multiple knowledge concept learning domains.

3. POMDP-Based Cognitive Experience Modeling and Online Personalized Learning Planning

In this section, we provide an overview of the background knowledge pertaining to HCPM [21], which lays the foundation for the development of PCEM. Following that, we delineate the definition of PCEM and introduce the associated method for updating the real-time model and learning strategy.

To facilitate subsequent explanations, we first define the relevant symbols. In the exercise-based cognitive process, learners are denoted by

Le

, exercises by

Ex

, and knowledge concepts by

Kc

, with their respective quantities represented by

L

,

M

, and

N

. A learner’s attempt at a question is considered a learning action, denoted as

a

, and the correct or incorrect of the answer is denoted as

o

. Thus, the cognitive process of the

i_{t h}

learner can be represented by the sequence of answer data as

h_{i} = ((a_{i, 1}, o_{i, 1}), (a_{i, 2}, o_{i, 2}), \dots, (a_{i, T_{i}}, o_{i, T_{i}}))

, where

T_{i}

is the number of attempts made by the

i_{t h}

learner. The cognitive processes of all learners can be represented as

H = {h_{1}, h_{2}, . . ., h_{L}}

.

3.1. Background Knowledge

Figure 2 shows a four-step POMDP model representing the cognitive process of a given learner.

Given a knowledge learning domain, a POMDP-based cognitive model is represented as a 6-tuple

(S, A, Ω, T, O, R)

, where S is a set of knowledge states, A is a set of learning actions,

Ω

is a set of observations related to the learner’s performance in learning actions, T is a state transition function that indicates changes in knowledge states caused by the learning actions, O is an observation function that describes the possible performance of a learner in a specific learning action given a specific knowledge state, and R is a reward function customized according to specific learning goals.

Compared with the POMDP-based cognitive models that assume the same cognitive ability for learners, HCPM can differentiate between the learning abilities of different individuals, thereby helping to achieve personalized learning strategy optimization. HCPM comprises several POMDP-based cognitive models

{M_{1}, M_{2}, \dots, M_{K}}

that satisfy the following properties (

Φ

,

Ψ

):

\begin{matrix} Φ \leftarrow { & S_{1} = S_{2} = \dots = S_{K} = S, A_{1} = A_{2} = \dots = A_{K} = A, Ω_{1} = Ω_{2} = \dots = Ω_{K} = Ω} \\ Ψ \leftarrow { & T_{1} \neq T_{2} \neq \dots \neq T_{K}, O_{1} = O_{2} = \dots = O_{K} = O, R_{1} = R_{2} = \dots = R_{K} = R} \end{matrix}

These properties enable the HCPM to represent various performances among individuals with different cognitive abilities in the same cognitive task.

This article presents the novel PCEM, evolved from HCPM. It shifts away from precisely describing the cognitive process, emphasizing rapid updates and documentation of learners’ experiences for real-time strategy optimization.

3.2. PCEM Specification

The number of POMDPs

K

included in HCPM needs to be determined. Noting that learners’ learning abilities vary to some extent,

K

can be set to the number of learners in the data, so that each trained POMDP reflects the learning experience of each learner. Thus, we can define PCEM as follows.

Definition 1.

Given the number of learners

L

, each learner is modeled by a POMDP, we have

{M_{1}, M_{2}, \dots, M_{L}}

, and these POMDPs satisfy the following properties (Φ, Ψ). We refer to these

L

POMDPs as the PCEM model

M^{E} = {M_{1}, M_{2}, \dots, M_{L}}

.

\begin{matrix} Φ \leftarrow { & S_{1} = S_{2} = \dots = S_{L} = S, A_{1} = A_{2} = \dots = A_{L} = A, Ω_{1} = Ω_{2} = \dots = Ω_{L} = Ω} \\ Ψ \leftarrow { & T_{1} \neq T_{2} \neq \dots \neq T_{L}, O_{1} = O_{2} = \dots = O_{L} = O, R_{1} = R_{2} = \dots = R_{L} = R} \end{matrix}

Similarly to HCPM, all

L

POMDPs in PCEM share the same state space S, action space A, and observation space

Ω

, and have identical observation O and reward functions R, but they have different transition functions

T_{1}, T_{2}, \dots, T_{L}

. Unlike HCPM, each POMDP in PCEM can only reflect the cognitive experience of each individual learner. Even if the response sequence used to train a particular POMDP is sufficiently long, it is difficult to approximate the cognitive model. Specifically, a learner’s cognitive process often has an irreversible nature—once a learner has mastered certain knowledge concepts, it is challenging for the learner to forget these concepts during continuous learning. Therefore, even relatively long response sequences often contain only a small amount of information that reflects changes in the learner’s knowledge state, while including a large amount of information about how the learner maintains the knowledge state in the later stages.

3.3. PCEM Parameter Learning

Given a set of learning activity data

H = \{h_{1}, h_{2}, \dots, h_{L}\}

, where each data point

h_{i} = \{(a_{i, 1}, o_{i, 1}), (a_{i, 2}, o_{i, 2}), \dots, (a_{i, T_{i} - 1}, o_{i, T_{i} - 1})\}

represents the learner

{Le}_{i}

, the objective is to learn the cognitive experience model

M_{l} = (D_{l}, T_{l}, O)

(

D_{l}

is the distribution of the initial state) for each learner. We divide

H

into observation sequences

O = \{O_{1}, O_{2}, \dots, O_{L}\}

(

O_{i}

is the observation sequence of the learner

{Le}_{i}

) and action sequences

A = \{A_{1}, A_{2}, \dots, A_{L}\}

(

A_{i}

is the action sequence of the learner

{Le}_{i}

), with the corresponding state sequences being

S = \{S_{1}, S_{2}, \dots, S_{L}\}

(

S_{i}

is the state sequence of the learner

{Le}_{i}

). Then, the probability for all observation sequences

O

is given by

\begin{matrix} P (O ∣ M^{E}, A) = \sum_{S} P (O ∣ S, M^{E}, A) P (S ∣ M^{E}, A) \end{matrix}

where

P (\cdot)

and

P (\cdot | \cdot)

represent the probability function and conditional probability function, respectively. The parameter learning can be achieved through the EM algorithm.

Determine the log-likelihood function of the complete data.

The complete data are

\begin{matrix} (A, O, S) = & (a_{1, 1}, a_{1, 2}, \dots, a_{1, T_{1}}, \dots, a_{L, 1}, a_{L, 2}, \dots, a_{L, T_{L}}, o_{1, 1}, o_{1, 2}, \dots, o_{1, T_{1}}, \dots, o_{L, 1}, o_{L, 2}, \dots, o_{L, T_{L}}, \\ s_{1, 1}, s_{1, 2}, \dots, s_{1, T_{1}}, \dots, s_{L, 1}, s_{L, 2}, \dots, s_{L, T_{L}}) \end{matrix}

and the log-likelihood function of the complete data are

l o g P (O, S ∣ M^{E}, A)

.

E-step of the EM algorithm: Calculate the Q function $Q (M^{E}, \bar{M^{E}})$ ,

\begin{matrix} Q (M^{E}, \bar{M^{E}}) = & \sum_{S} log P (O, S ∣ M^{E}, A) P (S ∣ O, \bar{M^{E}}, A) \\ = & \sum_{S} log P (O, S ∣ M^{E}, A) \frac{P (O, S ∣ \bar{M^{E}}, A)}{P (O ∣ \bar{M^{E}}, A)} \end{matrix}

where

\bar{M^{E}}

are the current estimates of the parameters,

M^{E}

are the parameters to be maximized, and

P (O ∣ \bar{M^{E}}, A)

is a constant that can be ignored.

\begin{matrix} P (O, S ∣ M^{E}, A) = & D_{1} (s_{1, 1}) O (s_{1, 1}, a_{1, 1}, o_{1, 1}) T_{1} (s_{1, 1}, a_{1, 1}, s_{1, 2}) \\ \dots O (s_{1, T_{1}}, a_{1, T_{1}}, o_{1, T_{1}}) T_{1} (s_{1, T_{1}}, a_{1, T_{1}}, s_{1, T_{1} + 1}) \\ D_{2} (s_{2, 1}) O (s_{2, 1}, a_{2, 1}, o_{2, 1}) T_{2} (s_{2, 1}, a_{2, 1}, s_{2, 2}) \\ \dots O (s_{2, T_{2}}, a_{2, T_{2}}, o_{2, T_{2}}) T_{2} (s_{2, T_{2}}, a_{2, T_{2}}, s_{2, T_{2} + 1}) \\ \dots \\ D_{L} (s_{L, 1}) O (s_{L, 1}, a_{L, 1}, o_{L, 1}) T_{1} (s_{L, 1}, a_{L, 1}, s_{L, 2}) \\ \dots O (s_{L, T_{L}}, a_{L, T_{L}}, o_{L, T_{L}}) T_{L} (s_{L, T_{L}}, a_{L, T_{L}}, s_{L, T_{L} + 1}) \end{matrix}

Thus, the function

Q (M^{E}, \bar{M^{E}})

can be expressed as

\begin{matrix} \begin{matrix} Q (M^{E}, \bar{M^{E}}) = & \sum_{S_{1}} log D_{1} (s_{1, 1}) P (O_{1}, S_{1} ∣ \bar{M_{1}}, A_{1}) + . . . + \sum_{S_{L}} log D_{L} (s_{L, 1}) P (O_{L}, S_{L} ∣ \bar{M_{L}}, A_{L}) \\ + & \sum_{S_{1}} (\sum_{t = 1}^{T_{1}} l o g T_{1} (s_{1, t}, a_{1, t}, s_{1, t + 1})) P (O_{1}, S_{1} ∣ \bar{M_{1}}, A_{1}) \\ + & . . . \\ + & \sum_{S_{L}} (\sum_{t = 1}^{T_{L}} l o g T_{L} (s_{L, t}, a_{L, t}, s_{L, t + 1})) P (O_{L}, S_{L} ∣ \bar{M_{L}}, A_{L}) \\ + & \sum_{S} (\sum_{i = 1}^{L} \sum_{t = 1}^{T_{i}} log O (s_{i, t}, a_{i, t}, o_{i, t})) P (O_{i}, S_{i} ∣ \bar{M_{i}}, A_{i}) \end{matrix} \end{matrix}

(1)

M-step of the EM algorithm: maximize the Q-function $Q (M^{E}, \bar{M^{E}})$ to estimate the model parameters.

Since the parameters to be maximized appear separately in

(2 L + 1)

terms in Equation (1), it is only necessary to maximize each term individually.

(a): The first $L$ terms of Equation (1) can be expressed as

\begin{matrix} \sum_{S_{i}} log D_{i} (s_{i, 1}) P (O_{i}, S_{i} ∣ \bar{M_{i}}, A_{i}) = \sum_{s \in S} log D_{i} (s) P (O_{i}, s_{i, 1} = s ∣ \bar{M_{i}}, A_{i}) \end{matrix}

Note that

D_{i} (s)

satisfies the constraint

\sum_{s \in S} D_{i} (s) = 1

, we use the Lagrange multipliers and express the Lagrange function as

\begin{matrix} \sum_{s \in S} log D_{i} (s) P (O_{i}, s_{i, 1} = s ∣ \bar{M_{i}}, A_{i}) + γ (\sum_{s \in S} D_{i} (s) - 1) \end{matrix}

Take the partial derivatives of it and set the results to 0,

\begin{matrix} \frac{\partial}{\partial D_{i} (s)} [\sum_{s \in S} log D_{i} (s) P (O_{i}, s_{i, 1} = s ∣ \bar{M_{i}}, A_{i}) + γ (\sum_{s \in S} D_{i} (s) - 1)] = 0 \end{matrix}

\begin{matrix} P (O_{i}, s_{i, 1} = s ∣ \bar{M_{i}}, A_{i}) + γ D_{i} (s) = 0 \end{matrix}

(2)

Sum over s to obtain

γ

\begin{matrix} γ = - P (O_{i} ∣ \bar{M_{i}}, A_{i}) \end{matrix}

By substituting it into Equation (2), we obtain the

D_{i} (s)

\begin{matrix} D_{i} (s) = \frac{P (O_{i}, s_{i, 1} = s ∣ \bar{M_{i}}, A_{i})}{P (O_{i} ∣ \bar{M_{i}}, A_{i})} \end{matrix}

(3)

(b): The $(L + 1)$ th to $(2 L)$ th terms of Equation (1) can be expressed as

\begin{matrix} \sum_{S_{i}} (\sum_{t = 1}^{T_{i}} l o g T_{i} (s_{i, t}, a_{i, t}, s_{i, t + 1})) P (O_{i}, S_{i} ∣ \bar{M_{i}}, A_{i}) \\ = & \sum_{s, s^{'} \in S} (\sum_{t = 1}^{T_{i}} l o g T_{i} (s_{i, t}, a_{i, t}, s_{i, t + 1})) P (O_{i}, s_{i, t} = s, s_{i, t + 1} = s^{'} ∣ \bar{M_{i}}, A_{i}) \end{matrix}

Similarly to the first

L

terms, applying the Lagrange multipliers with the constraint

\sum_{s^{'} \in S} T_{i} (s, a, s^{'}) = 1

(only when

a_{i, t} = a

, the partial derivative of

T_{i} (s_{i, t}, a_{i, t}, s_{i, t + 1})

with respect to

T_{i} (s, a, s^{'})

is not 0, denoted as

I (a_{i, t} = a)

),

\begin{matrix} \begin{matrix} T_{i} (s, a, s^{'}) = & \frac{\sum_{t = 1}^{T_{i}} P (O_{i}, s_{i, t} = s, s_{i, t + 1} = s^{'} ∣ \bar{M_{i}}, A_{i}) I (a_{i, t} = a)}{\sum_{t = 1}^{T_{i}} P (O_{i}, s_{i, t} = s ∣ \bar{M_{i}}, A_{i}) I (a_{i, t} = a)} \\ I (*) & = \{\begin{matrix} 1, * i s t r u e \\ 0, * i s f a l s e \end{matrix} \end{matrix} \end{matrix}

(4)

where

I (*)

is a indicator function.

(c): The $(2 L + 1)$ th term can be expressed as

\begin{matrix} \sum_{S} (\sum_{i = 1}^{L} \sum_{t = 1}^{T_{i}} log O (s_{i, t}, a_{i, t}, o_{i, t})) P (O_{i}, S_{i} ∣ \bar{M_{i}}, A_{i}) \\ = & \sum_{s \in S} (\sum_{i = 1}^{L} \sum_{t = 1}^{T_{i}} log O (s_{i, t}, a_{i, t}, o_{i, t})) P (O_{i}, s_{i, t} = s ∣ \bar{M_{i}}, A_{i}) \end{matrix}

Similarly, we use the Lagrange multipliers with the constraint

\sum_{o \in Ω} O (s, a, o) = 1

. Note that the partial derivative of

O (s_{i, t}, a_{i, t}, o_{i, t})

with respect to

O (s, a, o)

is not 0 only when

a_{i, t} = a

and

o_{i, t} = o

, denoted as

I (a_{i, t} = a)

and

I (o_{i, t} = o)

\begin{matrix} \begin{matrix} O (s, a, o) = \frac{\sum_{i = 1}^{L} \sum_{t = 1}^{T_{i}} P (O_{i}, s_{i, t} = s ∣ \bar{M_{i}}, A_{i}) I (a_{i, t} = a) I (o_{i, t} = o)}{\sum_{i = 1}^{L} \sum_{t = 1}^{T_{i}} P (O_{i}, s_{i, t} = s ∣ \bar{M_{i}}, A_{i}) I (a_{i, t} = a)} \end{matrix} \end{matrix}

(5)

As shown in Figure 3, for each sequence, there is a corresponding POMDP (Figure 3a). We first initialize the parameters of these

L

POMDPs (Figure 3b). We extract the observation sequence and action sequence for each sequence to facilitate subsequent parameter updates based on the update formulas (Figure 3c). We repeatedly iterate and update the parameters according to Equations (3)–(5) until the parameter values converge (e.g., the difference between the parameter values of the last two updates does not exceed a certain threshold) (Figure 3d,e).

We summarize the above process in Algorithm 1. The algorithm generates the state space S, action space A, and observation space

Ω

based on the specific problem to be solved (line 1), and randomly initializes the parameters

M^{E}

(line 2). Then, given available learning records, the algorithm updates the parameters according to Equations (3)–(5) (lines 3–16), until the termination condition is met (e.g., the error between the parameters updated in the last two iterations is within a certain threshold).

Algorithm 1 Parameter Learning of PCEM

Input: The observation sequences

O

, the action sequences

A

, the termination condition

Output: The initial state distribution

D_{i}

, the transition function

T_{i}

, and the observation function O (

i = 1, 2, . . ., L

).

1:: Generate S, A, and $Ω$ .
2:: Randomly initialize $M^{E}$ .
3:: while the termination condition is not met do
4:: for $i \in \{1, 2, . . ., L\}$ do
5:: for $s \in S$ do
6:: Calculate $D_{i} (s)$ according to Equation (3)
7:: for $s, s^{'} \in S, a \in A$ do
8:: Calculate $T_{i} (s, a, s^{'})$ according to Equation (4)
9:: for $s \in S, o \in Ω, a \in A$ do
10:: Calculate $O (s, a, o)$ according to Equation (5)
11:: return $D_{i}, T_{i}, O (i = 1, 2, . . ., L)$

Comparison of algorithm efficiency.

Comparing the parameter learning methods of HCPM and PCEM, the parameter learning method of HCPM consumes more time than that of PCEM primarily in two steps: (1) HCPM requires the computation of the membership degree of each time series to each POMDP, whereas in PCEM, these membership degrees are constants (since the time series and POMDPs correspond one-to-one) and do not require calculation; and (2) when calculating the parameters of the state transition function, HCPM uses each time series to (weighted) calculate the parameters of every POMDP, while PCEM only uses each time series to calculate the parameters of its corresponding single POMDP. Therefore, the difference in time complexity between the two can be expressed as follows:

\begin{matrix} O_{H C P M} - O_{P C E M} = O (K L ∣ S ∣ + ∣ A ∣ K L {∣ S ∣}^{2} T) \end{matrix}

where O denotes the time complexity (note that this is distinct from the observation function O mentioned earlier),

K

is the number of POMDPs in HCPM,

L

is the number of time series, T represents the length of the time series (as the lengths of different time series may vary, a representative value of the same order of magnitude is used here),

| S |

is the size of the state space, and

| A |

is the size of the action space. The first term on the right side of the equation represents the time complexity of calculating the membership degrees, while the second term represents the difference in time complexity between the two models when calculating the state transition functions. Additionally, the time complexity of the parameter learning methods for PCEM and PCPM is the same. However, in practical applications, since each learner is modeled individually (except for the observation function), the parameter updates for each model can be completed locally and independently, which significantly improves computational efficiency.

3.4. PCEM Online Planning

For PCEM, although it is possible to use planning methods to solve each POMDP within it to obtain a strategy corresponding to each learner, it should be noted that these POMDPs only reflect the past cognitive experiences of the corresponding learners. Therefore, these strategies often have difficulty providing correct tutoring for the cognitive process that requires continuous acquisition of new knowledge.

Given the cognitive process is complexity, we propose the adviser–advisee mechanism for real-time planning. The advisee is the learner tackling the upcoming exercise, while other learners engaging with the system are termed advisers. For a particular knowledge concept that the advisee is about to learn, there are two steps in online planning: (a) the advisers put forward suggestions based on their respective cognitive experiences; and (b) the system comprehensively considers these suggestions and selects the next exercise for the advisee. The two steps are detailed as follows.

(a) Based on the advisee’s current answer sequence, each adviser suggests their view of the optimal next exercise. This step can be implemented by dynamic programming [22]. The initial belief of each adviser is equivalent to the distribution of its initial state,

\begin{matrix} b_{1} (s_{1}) = D_{j} (s_{1}) \end{matrix}

The update of the belief is as follows:

\begin{matrix} b_{t} (s_{t}) = \frac{\sum_{s_{t - 1} \in S} b_{t - 1} (s_{t - 1}) O (s_{t - 1}, a_{t - 1}, o_{t - 1}) T_{j} (s_{t - 1}, a_{t - 1}, s_{t})}{\sum_{s_{t - 1}, s_{t}^{'} \in S} b_{t - 1} (s_{t - 1}) O (s_{t - 1}, a_{t - 1}, o_{t - 1}) T_{j} (s_{t - 1}, a_{t - 1}, s_{t}^{'})} \end{matrix}

(6)

For any belief, the optimal strategy

π^{*} (b_{t})

is to choose the action with the highest value.

\begin{matrix} π^{*} (b_{t}) = a r g m a x_{a_{t} \in A} Q (b_{t}, a_{t}) \end{matrix}

where the action value is calculated as

\begin{matrix} Q (b_{t}, a_{t}) = R (b_{t}, a_{t}) + γ \sum_{o_{t} \in Ω} O (b_{t}, a_{t}, o_{t}) V_{t + 1} (b_{t + 1}) \end{matrix}

where the next belief

b_{t + 1}

is updated according to Equation (6).

The belief value

V_{t} (b_{t})

is the value of the action that maximizes the new belief value therefore leading to the recursive computation in the DP,

\begin{matrix} V_{t} (b_{t}) = m a x_{a_{t} \in A} Q (b_{t}, a_{t}) \end{matrix}

(b) After each adviser gives his/her estimation of the value of each action, the system takes into account all of the value estimations of the advisers comprehensively. This requires considering the similarity between each adviser and the advisee, and calculating the comprehensive value estimation of each action by weighting with the similarity. The similarity between learner i (advisee) and learner j (adviser) is calculated as follows:

\begin{matrix} w_{i, j} = \frac{P (O_{i} ∣ M_{j}, A_{i})}{\sum_{j^{'} = 1}^{L} P (O_{i} ∣ M_{j^{'}}, A_{i})} \end{matrix}

The comprehensive estimated value of the action is

\begin{matrix} Q^{c} (b_{i, t}, a_{t}) = \sum_{j = 1}^{L} w_{i, j} Q_{j} (b_{j, t}, a_{t}) \end{matrix}

The comprehensive optimal strategy is

\begin{matrix} {π^{c}}^{*} (b_{i, t}) = a r g m a x_{a_{t} \in A} Q (b_{i, t}, a_{t}) \end{matrix}

4. Experimental Results

This section presents the results of our rigorous experimentation, validating the efficacy of our approach. We detail the datasets employed, the performance metrics and testing methodologies used, and showcase the experimental outcomes. This is followed by a thorough summary and discussion. Our implementation seamlessly integrates online planning for PCPM, HCPM, and PCEM. Below, we outline the methods used for comparative analysis. It is worth noting that these are the only POMDP-based planning strategies specifically designed for modeling learners’ cognitive processes in knowledge concept learning, as documented in existing literature.

PCPM [20]: This model incorporates POMDP to represent learners’ cognitive processes, facilitating real-time updates to a learner’s knowledge state based on feedback. This, in turn, enables the adaptation of learning strategies accordingly.
HCPM [21]: By utilizing H-POMDP, this model models learners’ cognitive processes, allowing dynamic updates to learners’ knowledge states and cognitive abilities based on their feedback. This facilitates the induction of suitable learning strategies.
PCEM: our method, proposed in Algorithm 1, aims to enhance the performance and efficiency of the system.

We devised three metric measurements (detailed in Equations (7)–(9)) to quantify the compared algorithms’ effectiveness. Furthermore, we conducted statistical analysis (Equation (10)) and evaluated time complexity to comprehensively assess the performance of the algorithms under comparison. For our simulation studies, we carefully chose eight knowledge concept learning domains from three distinct real-world datasets: two publicly accessible datasets, namely Non Skill-Builder data 2009-10 (ASSIST) [23] and Smart Learning Partner (SLP) [24], as well as a proprietary dataset, Quanlang (https://www.quanlangedu.com (accessed on 1 August 2024)). Through extensive experiments, we established the superiority of PCEM in terms of accurately real-time modeling learners’ cognitive processes and adapting learning strategies dynamically based on feedback. The diverse datasets’ results further underlined the efficacy and adaptability of our method. We provided comprehensive results, encompassing comparative studies, statistical analysis, and an evaluation of time complexity. These simulations were executed in a state-of-the-art computing environment, utilizing Python 3.8 on an Ubuntu server equipped with a powerful Core i9-1090K 3.7 GHz processor, GeForce RTX 3090 graphics card, and an ample 128 GB of RAM.

4.1. Datasets

We briefly describe the datasets used in simulation experiments: ASSIST, SLP, and Quanlang. Table 1 summarizes these datasets and sub-datasets. Figure 4 and Table 2 show the knowledge concept structures and names in eight learning domains. For each domain, the experiments simulate learners’ cognitive processes, considering different abilities and knowledge states, interacting with the system in real-time.

4.2. Metrics for Evaluating Model Performance

To evaluate model effectiveness, we used specific performance metrics. We simulated real-time learning strategy updates for various models and learner–system interactions based on these strategies. Assuming a 5 min response time per question, we simulated

10, 000

learners interacting simultaneously with the system (10 interactions per knowledge concept). We logged the final knowledge states of these simulated learners, and computed their distribution as:

P (s) = \sum_{l = 1}^{L} I (s_{T_{l}} = s) / L, \forall s \in S

. Three metrics were employed to assess the performance of different online learning strategies, and we analyzed the statistically significant differences of these metrics through a t-test.

$δ (κ) \in [0, 1]$ represents the average mastery level of the knowledge concept $κ$ among the learner group.

$\begin{matrix} δ (κ) & = \sum_{s \in S} I (s, κ) P (s) \end{matrix}$

(7)

where $I (s, κ)$ is a indicator function, represents the mastery level (0 or 1) of the learner with knowledge state s for the knowledge concept $κ$ .
$Δ \in [0, N]$ is the average number of knowledge concepts mastered by the learner group in the learning domain.

$\begin{matrix} Δ = \sum_{κ \in K} δ (κ) = \sum_{s \in S} C (s) P (s) \end{matrix}$

(8)

where $C (s) = \sum_{κ \in K} I (s, κ)$ is the number of knowledge concepts mastered by a learner with the knowledge state s.
$Λ \in [0, + \infty)$ represents the stability of the learning strategy.

$\begin{matrix} Λ = \sum_{s \in S} {[Δ - C (s)]}^{2} P (s) \end{matrix}$

(9)
The independent sample t-test statistic is

$\begin{matrix} t = \frac{Δ_{1} - Δ_{2}}{\sqrt{\frac{(L_{1} - 1) Λ_{1}^{2} + (L_{2} - 1) Λ_{2}^{2}}{L_{1} + L_{2} - 2}} (\frac{1}{L_{1}} + \frac{1}{L_{2}})} \end{matrix}$

(10)

where $L_{1} = L_{2} = L = 10, 000$ in our experiments. And note that the t here is different from the t representing time step in the previous.

4.3. Experiment: Evaluations on Online Learning Strategy Induction

We modeled each dataset using PCEM and HCPM, respectively, recording the number of iterations completed by each method within 5 min to compare their computational efficiency. As shown in Table 3, the computational efficiency of PCEM is significantly higher than that of HCPM. Additionally, by comparing the scales of several datasets, it is evident that computational efficiency is influenced by the number of learners, knowledge concepts, exercises, and records, which is consistent with the analysis in Section 3.3.

Table 4 presents the results of the experiment where learners’ progress is guided by real-time updated learning strategies. These findings highlight the effectiveness of different learning strategies across various knowledge concept learning domains, shedding light on learners’ performance in these diverse domains. The results provide critical insights into how different strategies impact the final knowledge states that learners achieve, and offer a quantitative measure of these outcomes. For each dataset and learning strategy, we can analyze the distribution of learners’ final knowledge states after the simulation ends, allowing us to calculate and compare the key metrics presented in Table 4. The careful analysis presented in Table 4 reveals profound insights into the effectiveness of various learning strategies. Consider the knowledge learning domain SLP2, where the state space, defined as

(κ_{1}, κ_{2}) \in {(0, 0), (1, 0), (1, 1)}

, represents achievable knowledge mastery states. Notably, the probabilities of learners mastering

κ_{1}

and

κ_{2}

are impressively high, at

0.9998

and

0.9987

, respectively. These figures underscore the remarkable efficacy of the strategies evaluated, offering compelling evidence for their continued application and potential expansion to other educational settings.

Table 5 provides a detailed statistical significance analysis of the performance differences between the online learning strategies induced by the PCEM model and those induced by the PCPM and HCPM models across various knowledge learning domains. The analysis utilizes a two-tailed t-test with a significance level of

0.05

and degrees of freedom equal to

19, 998

, where the critical value is

1.96

.

4.4. Experimental Summary and Discussion

Through rigorous and extensive experiments, we have unequivocally established the superiority of PCEM in accurately modeling learners’ cognitive processes in real-time and dynamically adapting learning strategies based on feedback. The results obtained from diverse datasets further underscore the effectiveness and versatility of our method. We provide comprehensive experimental results, including comparative studies, statistical analysis, and an evaluation of time complexity.

In computational efficiency tests, we modeled datasets using both PCEM and HCPM, recording iterations completed within five minutes. Table 3 clearly shows PCEM’s superior efficiency over HCPM. Analysis across various dataset scales reveals that factors like the number of learners, knowledge concepts, exercises, and records impact computational efficiency, echoing our Section 3.3 analysis.

Across all tested domains, learners who followed PCEM-induced strategies consistently demonstrated a higher level of mastery over knowledge concepts, as illustrated in Table 4. A noticeable pattern emerged: knowledge concepts closer to the root of the knowledge structure were more likely to be mastered, while those nearer to the leaf nodes proved more challenging. This trend reflects the hierarchical nature of knowledge acquisition, wherein mastering core concepts is essential for understanding more complex ideas. This finding aligns with natural cognitive development, indicating that PCEM strategies resonate with learners’ intuitive learning paths. The hierarchical approach inherent in PCEM not only intuitively organizes the learning process, but also enhances efficiency. By giving priority to fundamental concepts, PCEM lays a strong foundation for building upon more complex ideas, facilitating gradual and manageable cognitive progression. This incremental approach mitigates cognitive overload, boosts knowledge retention, and fosters deeper comprehension.

The statistical analysis presented in Table 5 reveals that, although the PCEM model slightly lags behind the PCPM and HCPM models in the Quanlang3 domain, these disparities are statistically insignificant. Crucially, in most other domains, PCEM significantly outperforms its counterparts, showcasing higher proficiency and stability. This underscores the efficacy of PCEM-induced strategies in guiding learners to achieve comprehensive knowledge mastery.

In conclusion, our findings strongly recommend the implementation of PCEM-induced learning strategies. Their hierarchical approach resonates with natural cognitive development, elevating learning efficiency and depth of understanding. The statistical evidence supports their preeminence in most domains, highlighting their adaptability to individual differences and potential for personalized learning.

5. Conclusions and Future Work

In comparison to offline learning strategy induction methods, online personalized learning strategy induction offers enhanced practicality but also poses greater challenges. This paper utilizes ITS as a research platform and introduces the POMDP-based cognitive experience model (PCEM), which serves as the foundation for developing a novel online learning strategy update method. PCEM meticulously captures individual learner characteristics by recording their learning experiences, enabling the induction of personalized online learning strategies. By measuring the similarity between different learners, our method calculates the reference value of each learner’s experience in another’s learning process, facilitating personalized strategy updates. This approach ensures rapid updates to online learning strategies while significantly enhancing their overall performance.

However, our current research is limited to modeling simple state and action spaces, focusing on basic model settings and analysis. Future research will aim to gather more comprehensive information during the data collection phase and explore effective methods to incorporate this information into the modeling process. Our ultimate goal is to develop a tutoring system that considers multiple factors, akin to human instructors’ teaching methods, which will require more complex model design and broader data collection. This presents a substantial challenge in the application of artificial intelligence within ITS, and we look forward to addressing these challenges in our future work.

Author Contributions

Conceptualization, H.G.; methodology, H.G.; software, H.G.; validation, H.G.; formal analysis, H.G.; investigation, H.G.; resources, B.M.; data curation, H.G.; writing—original draft preparation, H.G.; writing—review and editing, B.M.; visualization, H.G.; supervision, B.M.; project administration, B.M.; funding acquisition, B.M. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported in part by the National Natural Science Foundation of China (Grants No. 62176225 and 62276168) and the Natural Science Foundation of Fujian Province, China (Grant No. 2022J05176) and Guangdong Province, China (Grant No. 2023A1515010869).

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wang, H.; Tlili, A.; Huang, R.; Cai, Z.; Li, M.; Cheng, Z.; Yang, D.; Li, M.; Zhu, X.; Fei, C. Examining the applications of intelligent tutoring systems in real educational contexts: A systematic literature review from the social experiment perspective. Educ. Inf. Technol. 2023, 28, 9113–9148. [Google Scholar] [CrossRef] [PubMed]
Vasandani, V.; Govindaraj, T. Knowledge organization in intelligent tutoring systems for diagnostic problem solving in complex dynamic domains. IEEE Trans. Syst. Man Cybern. 1995, 25, 1076–1096. [Google Scholar] [CrossRef]
Goh, G.M.; Quek, C. EpiList: An intelligent tutoring system shell for implicit development of generic cognitive skills that support bottom-up knowledge construction. IEEE Trans. Syst. Man Cybern. Part Syst. Humans 2006, 37, 58–71. [Google Scholar] [CrossRef]
Kaelbling, L.P.; Littman, M.L.; Moore, A.W. Reinforcement learning: A survey. J. Artif. Intell. Res. 1996, 4, 237–285. [Google Scholar] [CrossRef]
Tang, X.; Chen, Y.; Li, X.; Liu, J.; Ying, Z. A reinforcement learning approach to personalized learning recommendation systems. Br. J. Math. Stat. Psychol. 2019, 72, 108–135. [Google Scholar] [CrossRef]
Zhou, G.; Yang, X.; Azizsoltani, H.; Barnes, T.; Chi, M. Improving student-system interaction through data-driven explanations of hierarchical reinforcement learning induced pedagogical policies. In Proceedings of the 28th ACM Conference on User Modeling, Adaptation and Personalization, Genoa, Italy, 12–18 July 2020; pp. 284–292. [Google Scholar]
Kubotani, Y.; Fukuhara, Y.; Morishima, S. Rltutor: Reinforcement learning based adaptive tutoring system by modeling virtual student with fewer interactions. arXiv 2021, arXiv:2108.00268. [Google Scholar]
Pateria, S.; Subagdja, B.; Tan, A.h.; Quek, C. Hierarchical reinforcement learning: A comprehensive survey. ACM Comput. Surv. (CSUR) 2021, 54, 1–35. [Google Scholar] [CrossRef]
Zhou, G.; Azizsoltani, H.; Ausin, M.S.; Barnes, T.; Chi, M. Hierarchical reinforcement learning for pedagogical policy induction. In Proceedings of the Artificial Intelligence in Education: 20th International Conference, AIED 2019, Chicago, IL, USA, 25–29 June 2019; Proceedings, Part I 20. Springer: Berlin/Heidelberg, Germany, 2019; pp. 544–556. [Google Scholar]
Arulkumaran, K.; Deisenroth, M.P.; Brundage, M.; Bharath, A.A. Deep reinforcement learning: A brief survey. IEEE Signal Process. Mag. 2017, 34, 26–38. [Google Scholar] [CrossRef]
Ju, S. Identify critical pedagogical decisions through adversarial deep reinforcement learning. In Proceedings of the 12th International Conference on Educational Data Mining (EDM 2019), Montreal, QC, Canada, 2–5 July 2019. [Google Scholar]
Huang, Z.; Liu, Q.; Zhai, C.; Yin, Y.; Chen, E.; Gao, W.; Hu, G. Exploring multi-objective exercise recommendations in online education systems. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, Beijing, China, 3–7 November 2019; pp. 1261–1270. [Google Scholar]
Sanz Ausin, M.; Maniktala, M.; Barnes, T.; Chi, M. Exploring the impact of simple explanations and agency on batch deep reinforcement learning induced pedagogical policies. In Proceedings of the Artificial Intelligence in Education: 21st International Conference, AIED 2020, Ifrane, Morocco, 6–10 July 2020; Proceedings, Part I 21. Springer: Berlin/Heidelberg, Germany, 2020; pp. 472–485. [Google Scholar]
Ausin, M.S.; Maniktala, M.; Barnes, T.; Chi, M. Tackling the credit assignment problem in reinforcement learning-induced pedagogical policies with neural networks. In Proceedings of the International Conference on Artificial Intelligence in Education, Utrecht, The Netherlands, 14–18 June 2021; Springer: Berlin/Heidelberg, Germany, 2021; pp. 356–368. [Google Scholar]
Judd, C.H. Educational Psychology; Routledge: London, UK, 2012. [Google Scholar]
Spaan, M.T. Partially observable Markov decision processes. In Reinforcement Learning; Springer: Berlin/Heidelberg, Germany, 2012; pp. 387–414. [Google Scholar]
Rafferty, A.N.; Brunskill, E.; Griffiths, T.L.; Shafto, P. Faster teaching via pomdp planning. Cogn. Sci. 2016, 40, 1290–1332. [Google Scholar] [CrossRef] [PubMed]
Ramachandran, A.; Sebo, S.S.; Scassellati, B. Personalized robot tutoring using the assistive tutor pOMDP (AT-POMDP). In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 8050–8057. [Google Scholar]
Nioche, A.; Murena, P.A.; de la Torre-Ortiz, C.; Oulasvirta, A. Improving artificial teachers by considering how people learn and forget. In Proceedings of the 26th International Conference on Intelligent User Interfaces, College Station, TX, USA, 13–17 April 2021; pp. 445–453. [Google Scholar]
Gao, H.; Zeng, Y.; Ma, B.; Pan, Y. Improving Knowledge Learning Through Modelling Students’ Practice-Based Cognitive Processes. Cogn. Comput. 2024, 16, 348–365. [Google Scholar] [CrossRef]
Gao, H.; Zeng, Y.; Pan, Y. Inducing Individual Students’ Learning Strategies through Homomorphic POMDPs. arXiv 2024, arXiv:2403.10930. [Google Scholar]
Bellman, R. Dynamic programming. Science 1966, 153, 34–37. [Google Scholar] [CrossRef]
Feng, M.; Heffernan, N.; Koedinger, K. Addressing the assessment challenge with an online system that tutors as it assesses. User Model. User-Adapt. Interact. 2009, 19, 243–266. [Google Scholar] [CrossRef]
Lu, Y.; Pian, Y.; Shen, Z.; Chen, P.; Li, X. SLP: A multi-dimensional and consecutive dataset from k-12 education. In Proceedings of the 29th International Conference on Computers in Education (ICCE 2021), Online, 22–26 November 2021; Volume 1, pp. 261–266. [Google Scholar]

Figure 1. Exercise-based cognitive process. During this process, the system chooses a question from the question bank for the learner based on the learning strategy (either manually designed or algorithmically induced) and the estimated current knowledge state of the learner. After the learner answers the question, they then compare their answer with the correct answer provided by the system and read the explanation for the question to consolidate or improve their mastery of the knowledge concepts. The system updates its estimation of the learner’s knowledge state based on their performance, thereby creating a loop.

Figure 2. A POMDP-based cognitive model plans the exercise-based learning over four time steps. At each time step, the learner’s knowledge state and the selected question determine his/her performance. The selected question, in turn, alters the learner’s knowledge state. The system makes decisions for the current time step based on the questions answered and the performance at each previous time step recorded in the history. We aim for the learner’s knowledge state to be as strong as possible by the end of the process, so we set a value function for the final knowledge state.

Figure 3. The PCEM parameter learning contains the procedures (a–e). For each sequence, there is a corresponding POMDP (a). The parameters of these

L

POMDPs is first initialized (b). And, the observation sequence and action sequence is extracted for each sequence to facilitate subsequent parameter updates based on the update formulas (c). The parameters are repeatedly iterated and updated according to Equations (3)–(5) until converge (d,e).

Figure 3. The PCEM parameter learning contains the procedures (a–e). For each sequence, there is a corresponding POMDP (a). The parameters of these

L

POMDPs is first initialized (b). And, the observation sequence and action sequence is extracted for each sequence to facilitate subsequent parameter updates based on the update formulas (c). The parameters are repeatedly iterated and updated according to Equations (3)–(5) until converge (d,e).

Figure 4. Knowledge concept structures. The structure is represented by a directed graph, where each node represents a knowledge concept. A knowledge concept represented by a node can be mastered only if all the knowledge concepts represented by its parent nodes (if any) are mastered.

Table 1. The statistical features of the datasets and sub-datasets in the experiments.

Dataset/Sub-Dataset	Subject	Learners	Knowledge Concepts	Questions	Answer Logs
ASSIST		8096	543	6907	603,128
ASSIST1	Math	2865	3	202	25,963
ASSIST2	Math	3509	7	154	20,513
ASSIST3	Math	3283	6	208	25,642
Quanlang		11,765	399	4871	343,719
Quanlang1	Math	416	4	81	4243
Quanlang2	Math	296	3	62	3089
Quanlang3	Math	515	5	103	5304
SLP		3408	184	6851	344,576
SLP1	Physics	143	5	184	4278
SLP2	Chemistry	81	2	33	1400

Table 2. Knowledge concept descriptions.

Data Set	ID	Knowledge Name	State Description
ASSIST1	$κ_{1}$	Subtraction whole numbers	( $κ_{1}$ , $κ_{2}$ , $κ_{3}$ )
	$κ_{2}$	Subtraction whole numbers, Pattern finding
	$κ_{3}$	Pattern finding
ASSIST2	$κ_{1}$	Congruence	( $κ_{1}$ , $κ_{5}$ , $κ_{2}$ , $κ_{6}$ , $κ_{3}$ , $κ_{7}$ , $κ_{4}$ )
	$κ_{2}$	Perimeter of a Polygon
	$κ_{3}$	Substitution
	$κ_{4}$	Equation solving more than two steps
	$κ_{5}$	Congruence, Perimeter of a Polygon
	$κ_{6}$	Congruence, Perimeter of a Polygon, Substitution
	$κ_{7}$	Congruence, Perimeter of a Polygon, Substitution, Equation solving more than two steps
ASSIST3	$κ_{1}$	Multiplication of whole numbers	( $κ_{1}$ , $κ_{4}$ , $κ_{2}$ , $κ_{5}$ , $κ_{3}$ , $κ_{6}$ )
	$κ_{2}$	Pattern finding
	$κ_{3}$	Unlabeled
	$κ_{4}$	Multiplication of whole numbers, Pattern finding
	$κ_{5}$	Pattern finding, Unlabeled
	$κ_{6}$	Multiplication of whole numbers, Pattern finding, Unlabeled
Quanlang1	$κ_{1}$	Positive and negative numbers	( $κ_{1}$ , $κ_{2}$ , $κ_{3}$ , $κ_{4}$ )
	$κ_{2}$	Absolute value
	$κ_{3}$	Opposite number
	$κ_{4}$	Addition of rational numbers
Quanlang2	$κ_{1}$	Definition of linear equations in one variable	( $κ_{1}$ , $κ_{2}$ , $κ_{3}$ )
	$κ_{2}$	Formulating linear equations in one variable
	$κ_{3}$	Solutions to linear equations in one variable
Quanlang3	$κ_{1}$	Definition of triangles	( $κ_{1}$ , $κ_{2}$ , $κ_{3}$ , $κ_{4}$ , $κ_{5}$ )
	$κ_{2}$	Classification of triangles
	$κ_{3}$	Triangular side relations
	$κ_{4}$	Interior angle properties of triangles
	$κ_{5}$	Exterior angle properties of triangles
SLP1	$κ_{1}$	Rectilinear propagation of light	( $κ_{1}$ , $κ_{2}$ , $κ_{3}$ , $κ_{4}$ , $κ_{5}$ )
	$κ_{2}$	Reflection of light
	$κ_{3}$	Refraction of light
	$κ_{4}$	Image formation by converging lens
	$κ_{5}$	Lens and its applications
SLP2	$κ_{1}$	Physical and chemical changes	( $κ_{1}$ , $κ_{2}$ )
SLP2	$κ_{2}$	Research into applications of substance property	( $κ_{1}$ , $κ_{2}$ )

Table 3. Comparison of the number of iterations.

	ASSIST1	ASSIST2	ASSIST3	Quanlang1	Quanlang2	Quanlang3	SLP1	SLP2
PCEM	44	7	8	364	682	39	92	1916
HCPM	14	3	3	91	178	12	38	792

Table 4. Experimental results on the learning strategy performance.

Sub-Dataset	Metric	PCPM	HCPM	PCEM
ASSIST1	$δ (κ_{1})$	0.9977	0.9988	0.9993
	$δ (κ_{3})$	0.9926	0.9947	0.9959
	$δ (κ_{2})$	0.9988	0.9986	0.9983
	$Δ$	2.9890	2.9921	2.9935
	$Λ$	0.0180	0.0133	0.0112
ASSIST2	$δ (κ_{1})$	0.9944	0.9983	0.9995
	$δ (κ_{5})$	0.9804	0.9913	0.9956
	$δ (κ_{2})$	0.9950	0.9985	0.9994
	$δ (κ_{6})$	0.9586	0.9793	0.9889
	$δ (κ_{3})$	0.9901	0.9953	0.9977
	$δ (κ_{7})$	0.9337	0.9613	0.9761
	$δ (κ_{4})$	0.9819	0.9870	0.9903
	$Δ$	6.8340	6.9110	6.9475
	$Λ$	0.4848	0.2427	0.1369
ASSIST3	$δ (κ_{1})$	0.9966	0.9978	0.9985
	$δ (κ_{4})$	0.9863	0.9923	0.9955
	$δ (κ_{2})$	0.9984	0.9994	0.9997
	$δ (κ_{5})$	0.9869	0.9918	0.9945
	$δ (κ_{3})$	0.9969	0.9977	0.9982
	$δ (κ_{6})$	0.9690	0.9795	0.9859
	$Δ$	5.9340	5.9584	5.9722
	$Λ$	0.1647	0.1008	0.0675
Quanlang1	$δ (κ_{1})$	0.9999	1.0000	1.0000
	$δ (κ_{2})$	0.9968	0.9987	0.9995
	$δ (κ_{3})$	0.9891	0.9932	0.9956
	$δ (κ_{4})$	0.9817	0.9874	0.9911
	$Δ$	3.9674	3.9793	3.9861
	$Λ$	0.0671	0.0391	0.0246
Quanlang2	$δ (κ_{1})$	0.9998	0.9999	0.9999
	$δ (κ_{2})$	0.9962	0.9977	0.9985
	$δ (κ_{3})$	0.9882	0.9917	0.9941
	$Δ$	2.9843	2.9893	2.9926
	$Λ$	0.0237	0.0157	0.0105
Quanlang3	$δ (κ_{1})$	1.0000	1.0000	1.0000
	$δ (κ_{2})$	0.9338	0.9337	0.9330
	$δ (κ_{3})$	0.9317	0.9296	0.9267
	$δ (κ_{4})$	0.9858	0.9875	0.9889
	$δ (κ_{5})$	0.9199	0.9211	0.9215
	$Δ$	4.7711	4.7718	4.7700
	$Λ$	0.2494	0.2464	0.2478
SLP1	$δ (κ_{1})$	1.0000	1.0000	1.0000
	$δ (κ_{2})$	0.9643	0.9636	0.9627
	$δ (κ_{3})$	0.9965	0.9976	0.9984
	$δ (κ_{4})$	0.9881	0.9899	0.9913
	$δ (κ_{5})$	0.9478	0.9492	0.9501
	$Δ$	4.8967	4.9004	4.9025
	$Λ$	0.1401	0.1298	0.1236
SLP2	$δ (κ_{1})$	0.9997	0.9997	0.9998
	$δ (κ_{2})$	0.9963	0.9978	0.9987
	$Δ$	1.9960	1.9975	1.9984
	$Λ$	0.0046	0.0030	0.0020

Bold numbers indicate the statistically significant difference.

Table 5. The statistical significance of the learning strategy performance.

PCEM vs.	ASSIST1	ASSIST2	ASSIST3	Quanlang1	Quanlang2	Quanlang3	SLP1	SLP2
PCPM	2.6378	14.4004	7.9324	6.1880	4.4891	0.1518	1.1302	3.0267
HCPM	0.9550	5.9215	3.3538	2.7149	2.0473	0.2519	0.4169	1.2903

Bold numbers indicate the statistically significant difference.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gao, H.; Ma, B. Online Learning Strategy Induction through Partially Observable Markov Decision Process-Based Cognitive Experience Model. Electronics 2024, 13, 3858. https://doi.org/10.3390/electronics13193858

AMA Style

Gao H, Ma B. Online Learning Strategy Induction through Partially Observable Markov Decision Process-Based Cognitive Experience Model. Electronics. 2024; 13(19):3858. https://doi.org/10.3390/electronics13193858

Chicago/Turabian Style

Gao, Huifan, and Biyang Ma. 2024. "Online Learning Strategy Induction through Partially Observable Markov Decision Process-Based Cognitive Experience Model" Electronics 13, no. 19: 3858. https://doi.org/10.3390/electronics13193858

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Online Learning Strategy Induction through Partially Observable Markov Decision Process-Based Cognitive Experience Model

Abstract

1. Introduction

2. Related Works

3. POMDP-Based Cognitive Experience Modeling and Online Personalized Learning Planning

3.1. Background Knowledge

3.2. PCEM Specification

3.3. PCEM Parameter Learning

3.4. PCEM Online Planning

4. Experimental Results

4.1. Datasets

4.2. Metrics for Evaluating Model Performance

4.3. Experiment: Evaluations on Online Learning Strategy Induction

4.4. Experimental Summary and Discussion

5. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI