Estimation of a Simple Structure in a Multidimensional IRT Model Using Structure Regularization

Shimmura, Ryosuke; Suzuki, Joe

doi:10.3390/e26010044

Open AccessFeature PaperArticle

Estimation of a Simple Structure in a Multidimensional IRT Model Using Structure Regularization

by

Ryosuke Shimmura

^*

and

Joe Suzuki

Graduate School of Engineer Science, Osaka University, Toyonaka 560-0043, Japan

^*

Author to whom correspondence should be addressed.

Entropy 2024, 26(1), 44; https://doi.org/10.3390/e26010044

Submission received: 5 December 2023 / Revised: 28 December 2023 / Accepted: 29 December 2023 / Published: 31 December 2023

(This article belongs to the Special Issue Stochastic Models and Statistical Inference: Analysis and Applications)

Download

Browse Figures

Versions Notes

Abstract

:

We develop a method for estimating a simple matrix for a multidimensional item response theory model. Our proposed method allows each test item to correspond to a single latent trait, making the results easier to interpret. It also enables clustering of test items based on their corresponding latent traits. The basic idea of our approach is to use the prenet (product-based elastic net) penalty, as proposed in factor analysis. For optimization, we show that combining stochastic EM algorithms, proximal gradient methods, and coordinate descent methods efficiently yields solutions. Furthermore, our numerical experiments demonstrate its effectiveness, especially in cases where the number of test subjects is small, compared to methods using the existing

L_{1}

penalty.

Keywords:

prenet penalty; lasso; simple structure; stochastic EM algorithm

1. Introduction

Item Response Theory (IRT) is a mathematical model used for applying and evaluating tests, and is employed in the creation and operation of various large-scale ability tests such as language proficiency exams. Although IRT models are practical, they assume a unidimensional latent trait, which is not suitable when the test measures multiple abilities. Therefore, to measure multiple latent traits, Multidimensional IRT (MIRT) models [1] are utilized, extending the IRT model to multiple dimensions. However, MIRT models can be challenging to interpret from the estimation results, such as understanding what each latent trait represents and the relationships among test items. Therefore, to facilitate interpretation, such as “what latent traits are the test items measuring,” it is desirable to have a simple structure in the estimated matrix, like having many zeros. In this paper, we propose a method for estimating a matrix with a simple structure where one item corresponds to one latent trait, using binary (correct/incorrect) response data.

In MIRT, the simplicity of the estimation results plays an important role in interpretability. Existing research has proposed penalization methods using

L_{1}

-regularization, as employed in lasso [2,3], for MIRT [4]. Using

L_{1}

-regularization shrinks the estimates towards zero, allowing some variables to be precisely zero. Thus, the method simplifies the estimated matrix by excluding unnecessary variables and performing variable selection. The properties of

L_{1}

-regularization in linear regression have been widely studied and are known to provide high-accuracy estimates with consistency in model selection [5,6,7].

However,

L_{1}

-regularization does not necessarily produce interpretable and simple matrices as estimation results. For example, if the regularization parameter is too large, all variables are estimated as zero, making the analysis meaningless. Indeed, numerical experiments (Section 4) have shown that when the number of subjects (examinees) is small, selecting the regularization parameter using the Bayesian Information Criterion (BIC) [8] leads to selecting a matrix where all components are zero. Also,

L_{1}

-regularization uniformly shrinks all variables to zero, leading to more frequent zero estimates for variables close to zero.

In this study, we employ the product-based elastic net penalty (prenet penalty) [9], proposed in the field of factor analysis, for structure regularization. The prenet penalty is a penalty for the product of pairs in the same row of a matrix, and using this penalty ensures that the estimates have at most one nonzero component per row. Therefore, it allows clustering of test items by latent traits, with one item corresponding to one latent trait. If responses are multi-valued, the obtained responses can be treated as continuous values and solved within the framework of factor analysis. However, in cases like this study where responses are binary, solving within the conventional factor analysis framework is unnatural. Thus, this study can be seen as an extension of the prenet penalty to discrete factor analysis. The optimization of the proposed method efficiently combines the stochastic EM algorithm [10], the proximal gradient method [11], and the coordinate descent method [12].

The regularization parameter of the prenet penalty controls the simplicity of the estimated matrix. In this study, the regularization parameter is determined using BIC. Furthermore, a Monte Carlo simulation using synthetic data is conducted to compare

L_{1}

-regularization with the proposed method using prenet. The proposed method with prenet demonstrates its ability to estimate the true structure of the matrix even with a small number of subjects.

The remainder of this paper is organized as follows: Section 2 describes the MIRT model and the prenet penalty dealt with in this study. Section 3 presents the optimization methods to obtain solutions to the proposed method. Section 4 demonstrates the performance comparison between

L_{1}

-regularization and prenet penalty using Monte Carlo simulations with synthetic data. Finally, Section 5 concludes the paper and discusses future extensions.

2. MIRT Model and Prenet Penalty

2.1. 2-Parameter Multidimensional IRT Model

Consider a situation with responses from N subjects to J items. The response of subject i to item j is binary, denoted by

y_{i j} \in {0, 1}

. Each subject has K-dimensional latent traits, represented by

θ_{i} \in R^{K}, (i = 1, \dots, N)

. Assuming that

y_{i j}

comes from a 2-parameter multidimensional IRT (2-PL MIRT) model with

a_{j} \in R^{K}, b_{j} \in R; (j = 1, \dots, J)

, the model is

\begin{matrix} p (y_{i j} = 1 ∣ θ_{i}, a_{j}, b_{j}) & = \frac{\exp (a_{j}^{T} θ_{i} + b_{j})}{1 + \exp (a_{j}^{T} θ_{i} + b_{j})} . \end{matrix}

(1)

Furthermore, we assume local independence among responses, that is, let

y_{i} = {(y_{i 1}, \dots, y_{i J})}^{T}

,

A = {(a_{1}, \dots, a_{J})}^{T}

, and

b = {(b_{1}, \dots, b_{J})}^{T}

. Then, it is assumed that

p (y_{i} ∣ θ_{i}, A, b) = \prod_{j = 1}^{J} p (y_{i j} ∣ θ_{i}, a_{j}, b_{j})

holds.

Assuming the prior distribution of

θ_{i}

as

ϕ

, the likelihood function for the complete data, given the latent traits

Θ = {(θ_{1}, \dots, θ_{N})}^{T}

and responses of each subject

Y = {(y_{1}, \dots, y_{N})}^{T}

, is

\begin{matrix} L (A, b ∣ Y, Θ) = \prod_{i = 1}^{N} ϕ (θ_{i}) \prod_{j = 1}^{J} p (y_{i j} ∣ θ_{i}, a_{j}, b_{j}) \end{matrix}

(2)

In this paper,

ϕ

is modeled as the density function of independent normal distributions

N (0, I_{K})

, where

I_{K}

is the K-dimensional identity matrix. While it is possible to estimate the covariance matrix

Σ

by considering a prior distribution of

N (0, Σ)

, in this context, we choose to fix

Σ

as

I_{K}

. The log-likelihood of

Y

, marginalized over

Θ

, is

\begin{matrix} ℓ (A, b ∣ Y) = \sum_{i = 1}^{N} log (\int ϕ (θ_{i}) \prod_{j = 1}^{J} p (y_{i j} ∣ θ_{i}, a_{j}, b_{j}) d θ_{i}) \end{matrix}

(3)

In marginal maximum likelihood estimation of conventional MIRT, one seeks to maximize (3) to find

A, b

. However, this study considers adding regularization P to impose a simple structure on

A

, with

ρ > 0

as the regularization parameter:

\begin{matrix} (\hat{A}, \hat{b}) = \underset{(A, b)}{argmax} \{\frac{1}{N} ℓ (A, b ∣ Y) - ρ P (A)\} \end{matrix}

(4)

Setting

P (A) = {∥ A ∥}_{1} = \sum_{j = 1}^{J} \sum_{k = 1}^{K} | a_{j k} |

results in

L_{1}

-regularization, as proposed in [4].

2.2. Item Clustering Using structure Regularization

With

L_{1}

-regularization like in lasso [3], when

ρ \to \infty

, the estimate

\hat{A} = O

, making clustering infeasible. In addition, the estimates are reduced by

ρ

. This study uses

γ \in [0, 1]

for the product-based elastic net penalty (prenet penalty) [9]:

\begin{matrix} P (A) & = \sum_{j = 1}^{J} \sum_{k = 1}^{K - 1} \sum_{k^{'} > k}^{K} \{γ | a_{j k} | | a_{j k^{'}} | + \frac{1}{2} (1 - γ) {(a_{j k})}^{2} {(a_{j k^{'}})}^{2}\} \\ A & = {(a_{1}, \dots, a_{J})}^{T} = (\begin{matrix} a_{11} & \dots & a_{1 K} \\ ⋮ & ⋱ & ⋮ \\ a_{J 1} & \dots & a_{J K} \end{matrix}) \end{matrix}

(5)

The prenet penalty (

P (x, y) = γ | x | | y | + (1 - γ) x^{2} y^{2} / 2

) for

γ = 1, 0.5, 0.1

is shown in Figure 1. Prenet has a pointed shape when

x = 0

or

y = 0

, and although it is not a convex function overall, it has the property of being multi-convex, becoming convex when other variables are fixed. Due to this property, an efficient solution can be obtained using the coordinate descent method.

Using the prenet penalty, when

ρ \to \infty

, a simple structure is imposed on

\hat{A}

. In fact, when

γ \in (0, 1]

, as

ρ \to \infty

,

\hat{A}

has at most one nonzero component per row (Proposition 1 in [9]), thus leading to a situation like in Figure 2, where each item corresponds to at most one latent trait. Therefore, it becomes possible to categorize each item by the necessary latent traits, enabling item clustering. The prenet penalty is also discussed in relation to k-means in [9], and can be seen as a generalization of the k-means.

2.3. Determining the Regularization Parameter $ρ$

In this study, the regularization parameter

ρ

is determined using the Bayesian Information Criterion (BIC) [8]. The BIC is defined as follows:

\begin{matrix} B I C = - 2 ℓ (A, b ∣ Y) + p_{0} log N \end{matrix}

(6)

Here,

p_{0}

represents the number of nonzero components in A. The BIC applies a penalty based on the number of nonzero components in

A

, thus penalizing the model’s complexity. We calculate the BIC (6) for several values of

ρ

and select the parameter that minimizes the BIC. The parameter

γ \in [0, 1]

for the prenet penalty should also be determined using the BIC, but in the experiments of this paper, only

γ = 0.1

is used. Furthermore, when clustering is the goal, a sufficiently large

ρ

can yield item clusters, but if

ρ

is too large, the non-convexity of the prenet penalty becomes strong, leading to convergence to a local minimum and instability in the solution. Therefore, it is advisable not to make

ρ

excessively large.

Just as in factor analysis, when the identifiability conditions of the orthogonal model (e.g., Theorem 5.1 in [13]) are satisfied, the solution becomes unique apart from the indeterminacy of orthogonal rotation. To fix this rotation, some constraints must be imposed on A. For instance,

a_{j k} = 0

for

j = 1, \dots, K

,

k = j + 1, \dots, K - 1

. Under these identifiability conditions and the constraint that

A

has at most one nonzero component per row, implying the perfect simple structure,

A

becomes unique except for the sign and permutation of columns. Using the prenet penalty,

A

with the perfect simple structure can be estimated when

ρ \to \infty

. Therefore, in this study, we conducted estimation without adding any special constraints other than the prenet penalty.

3. Optimization Method

This section discusses methods to solve the optimization problem (4). In this paper, we employ the stochastic expectation-maximization (stEM) algorithm, as proposed in standard marginal likelihood estimation [10,14]. The EM algorithm is a method that seeks a solution by repeating the E-step, which calculates the expected log-likelihood of the posterior distribution, and the M-step, which maximizes the expected log-likelihood obtained in the E-step. In the stEM algorithm, the E-step is efficiently performed using random numbers from the posterior distribution, consequently making the calculations in the M-step more efficient.

3.1. Stochastic E-step (stE-step)

Let

(A^{(t)}, b^{(t)})

be the parameters at the t-th iteration. In the standard E-step, for iteration

t + 1

, the expected log-likelihood of the posterior distribution is computed as follows:

\begin{matrix} Q (A, b ∣ A^{(t)}, b^{(t)}) : = E_{Θ} [\frac{1}{N} \sum_{i = 1}^{N} log (p (y_{i} ∣ θ_{i}, A, b) p (θ_{i})) | Y, A^{(t)}, b^{(t)}] \end{matrix}

(7)

In the case of MIRT model, it is difficult to compute this as usual, so existing research [4] has used a lattice point approximation. In this study, we approximate by generating random numbers from the posterior distribution using the Markov chain Monte Carlo (MCMC) method, namely Gibbs sampling. The details of the random number generation method using Gibbs sampling are described in Section 3.3.

In the StE-step, random numbers from the posterior distribution are generated only once per iteration. That is,

\begin{matrix} {\tilde{θ}}_{i}^{(t + 1)} \sim p ({\tilde{θ}}_{i}^{(t + 1)} ∣ y_{i}, A^{(t)}, b^{(t)}) \end{matrix}

(8)

is sampled, and

\begin{matrix} Q (A, b ∣ A^{(t)}, b^{(t)}) \approx \frac{1}{N} \sum_{i = 1}^{N} log (p (y_{i} ∣ {\tilde{θ}}_{i}^{(t + 1)}, A, b) p ({\tilde{θ}}_{i}^{(t + 1)})) \end{matrix}

(9)

is approximated in this way.

3.2. M-Step

In the M-step, we maximize the regularized expected log-likelihood obtained in the stE-step (9). Specifically, using

{\tilde{θ}}_{i}^{(t + 1)}

as the random numbers generated in the stE-step, we solve for

\begin{matrix} (A^{(t + 1)}, b^{(t + 1)}) = \underset{(A, b)}{argmax} \{\frac{1}{N} \sum_{i = 1}^{N} log (p (y_{i} ∣ {\tilde{θ}}_{i}^{(t + 1)}, A, b)) - ρ P (A)\} . \end{matrix}

(10)

The part to be maximized, the expected log-likelihood, is composed of J logistic regressions, allowing easy calculation of gradients and Hessians. Therefore, it can be calculated using methods such as the proximal Newton method [15]. In this paper, we solve it using the proximal gradient method [11] and the coordinate descent method for

L_{1}

-regularization [12]. Details of the optimization calculations are presented in Appendix A.

3.3. Gibbs Sampling

We consider the method of sampling from the posterior distribution in the StE-step. In this study, following [16], we generate random numbers from the posterior distribution using Gibbs sampling with the Pólya-Gamma distribution. This approach is an extension of the method proposed for logistic regression [17].

Definition 1.

When a random variable X follows the distribution

\begin{matrix} X \sim \sum_{k = 1}^{\infty} \frac{G (b, 1)}{2 π^{2} {(k - 0.5)}^{2} + c^{2} / 2} \end{matrix}

(11)

we say that X follows the Pólya-Gamma distribution with parameters

b > 0, c \in R

, denoted as

X \sim P G (b, c)

. Here,

G (b, 1)

is the gamma distribution with parameters

b, 1

.

Using the Pólya-Gamma distribution, the logistic function for

ψ \in R, a \in {0, 1}

can be expressed as an integral in the following form:

\begin{matrix} \frac{{(\exp (ψ))}^{a}}{1 + \exp (ψ)} = 2^{- 1} \exp (κ ψ) \int_{0}^{\infty} \exp (- \frac{w ψ^{2}}{2}) p (w ∣ b, 0) d w \end{matrix}

(12)

where

p (w ∣ b, 0)

is the probability density function of

P G (b, 0)

and

κ = a - \frac{1}{2}

. From (12), the model (1) with

k_{i j} = y_{i j} - \frac{1}{2}

can be written as

\begin{matrix} p (y_{i j} ∣ θ_{i}, a_{j}, b_{j}) = 2^{- 1} \exp (k_{i j} (a_{j}^{T} θ_{i} + b_{j})) \int_{0}^{\infty} \exp (- \frac{1}{2} w_{i j} {(a_{j}^{T} θ_{i} + b_{j})}^{2}) p (w_{i j} ∣ 1, 0) d w_{i j} . \end{matrix}

Thus, the conditional distribution of

θ_{i}

given

y_{i}, A, b, w_{i}

is

\begin{matrix} p (θ_{i} ∣ y_{i}, A, b, w_{i}) & \propto p (θ_{i}) \prod_{j = 1}^{J} \exp \{k_{i j} (a_{j}^{T} θ_{i} + b_{j}) - \frac{1}{2} w_{i j} {(a_{j}^{T} θ_{i} + b_{j})}^{2}\} \\ \propto p (θ_{i}) \prod_{j = 1}^{J} \exp \{- \frac{w_{i j}}{2} {(a_{j}^{T} θ_{i} + b_{j} - \frac{k_{i j}}{w_{i j}})}^{2}\} \\ \propto p (θ_{i}) \exp \{- \frac{1}{2} {(z_{i} - A θ_{i})}^{T} Ω_{i} (z_{i} - A θ_{i})\} \end{matrix}

(13)

where

z_{i} = {(\frac{k_{i 1}}{w_{i 1}} - b_{1}, \dots, \frac{k_{i J}}{w_{i J}} - b_{J})}^{T}, Ω_{i} = diag (w_{i 1}, \dots, w_{i J})

. In this study, since the prior distribution for

θ_{i}

is

N (0, I_{K})

, the conditional distribution (13) becomes a normal distribution, and

\begin{matrix} θ_{i} ∣ y_{i}, A, b, w_{i} \sim N (μ_{i}, V_{i}) \end{matrix}

(14)

where

V_{i} = {(A^{T} Ω_{i} A + I_{K})}^{- 1}, μ_{i} = V_{i} A^{T} Ω_{i}^{- 1} z_{i}

.

Furthermore, considering the conditional distribution of

w_{i j}

, we have

\begin{matrix} w_{i j} ∣ y_{i}, a_{j}, b_{j}, θ_{i} \sim P G (1, a_{j}^{T} θ_{i} + b_{j}) . \end{matrix}

(15)

Therefore, in the

t + 1

th step of the stE-step, we use Gibbs sampling to generate random numbers conforming to the posterior distribution as follows:

\begin{matrix} {\tilde{w}}_{i j}^{(t + 1)} ∣ y_{i}, a_{j}^{(t)}, b_{j}^{(t)}, {\tilde{θ}}_{i}^{(t)} & \sim P G (1, {(a_{j}^{(t)})}^{T} {\tilde{θ}}_{i}^{(t)} + b_{j}), (i = 1, \dots, N, j = 1, \dots, J) \end{matrix}

(16)

\begin{matrix} {\tilde{θ}}_{i}^{(t + 1)} ∣ y_{i}, A^{(t)}, b^{(t)}, {\tilde{w}}_{i}^{(t + 1)} & \sim N (μ_{i}, V_{i}), (i = 1, \dots, N) . \end{matrix}

(17)

Random numbers from the Pólya-Gamma distribution can be obtained using R packages such as “pg”.

3.4. Calculation of Final Estimation Results

In the StEM algorithm, since the stE-step computes stochastically using random numbers as in (9), it does not converge to a certain value like the conventional EM algorithm. Therefore, the final estimation result is obtained by operations such as averaging. In [10], the last m steps are used with

Ψ^{(t)} = (A^{(t)}, b^{(t)})

as

\begin{matrix} \hat{Ψ} = \frac{1}{m} \sum_{t = T + 1}^{T + m} Ψ^{(t)} . \end{matrix}

(18)

However, since stochastic operations are performed in the stE-step, even if regularization terms such as

L_{1}

-regularization or prenet penalty are added,

a_{i j}^{(t)}

is not always zero in steps

T + 1 \sim T + m

. Therefore, if the estimation result is obtained using (18), the advantages of regularization cannot be fully utilized. In this study, for the estimation of

A

, the median of the last m steps is used as

\begin{matrix} {\hat{a}}_{i j} = Med (a_{i j}^{(T + 1)}, \dots, a_{i j}^{(T + m)}) . \end{matrix}

(19)

By using the median, if

a_{i j}^{(t)}

is mostly zero,

{\hat{a}}_{i j}

is estimated as zero. Note that for the estimation of

b

, the average of the last m steps is used, similar to (18).

4. Numerical Experiments

In this section, we evaluate the performance of the proposed method using synthetic data with a prenet penalty.

Comparison with Lasso

The synthetic data used in this study, with

J = 15, K = 3

, is generated as follows:

\begin{matrix} A = {(\begin{matrix} 0.4 & 0.7 & 1.0 & 1.3 & 1.6 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0.4 & 0.7 & 1.0 & 1.3 & 1.6 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0.4 & 0.7 & 1.0 & 1.3 & 1.6 \end{matrix})}^{T}, \end{matrix}

with

b = 0

. The parameter

γ

for the prenet penalty is set to 0.1. For lasso, estimation is performed with

ρ

values of

0.1, 0.1 \times 0 . 8^{1}, \dots, 0.1 \times 0 . 8^{20}

, and the BIC is calculated for each. The result with the smallest BIC is chosen as the lasso estimation result. Similarly, for our method with the prenet penalty, estimation is carried out with

ρ

values of

3, 3 \times 0 . 8^{1}, \dots, 3 \times 0 . 8^{20}

, and the result with the smallest BIC is chosen. For both methods, a warm start is performed. The estimation begins by determining the maximum

ρ

, and subsequently,

ρ

is decreased. The estimation at each step uses the preceding

ρ

as the initial value.

First, we evaluate the estimation results of lasso and prenet regularization when data is generated 50 times with

N = 100

. The evaluation metrics used are the mean squared error (MSE) and the correct estimate rate (CER). MSE and CER for the sth data’s estimation result

{\hat{A}}^{(s)}

can be calculated as follows:

\begin{matrix} MSE & = \frac{1}{T} \sum_{s = 1}^{T} \frac{∥ A - {\hat{A}}^{(s)} ∥_{F}^{2}}{J K}, \end{matrix}

(20)

\begin{matrix} CER & = 1 - \frac{1}{T} \sum_{s = 1}^{T} \sum_{j = 1}^{J} \sum_{k = 1}^{K} \frac{| I (a_{j k} \neq 0) - I ({\hat{a}}_{j k}^{(s)} \neq 0) |}{J K} . \end{matrix}

(21)

MSE measures how well

A

is estimated, and CER measures how well the structure of

A

is estimated. Since the estimation results of lasso and prenet are indeterminate with respect to the sign and permutation of the columns of

\hat{A}

, MSE and CER are calculated for all sign and permutation combinations, choosing the smallest one. Figure 3 shows the boxplot of MSE when

ρ

is selected by BIC. In the case of lasso, choosing

ρ

by BIC often results in

A = O

, with

A \neq O

estimated only twice. On the other hand, for prenet, one component in each row does not become zero, resulting in

A \neq O

and generally smaller MSE compared to lasso. Next, Figure 4 shows the boxplot of CER. For lasso, as mentioned earlier, all components become zero, failing to estimate the structure of

A

well. For prenet, the average CER is around 85%, indicating a good estimation of the structure. As a result, with a small sample size of

N = 100

, lasso fails to estimate the structure of

A

, reducing all components of

A

to zero. In contrast, prenet successfully estimates the structure of

A

, performing better than lasso.

Next, we present the boxplots of MER and CER for the estimation results when generating data 50 times with

N = 500

in Figure 5 and Figure 6. As seen in Figure 5, unlike the case of

N = 100

, the estimation results by lasso are not zero when

N = 500

, and the MSE is smaller compared to prenet. However, the CER of the results of the lasso estimation is low, indicating that it does not accurately estimate the structure of

A

. On the other hand, prenet has a slightly higher MSE but a higher CER, successfully estimating the structure of

A

. In fact, in more than half of the cases, the CER reaches 1, perfectly estimating the structure of

A

.

Finally, the boxplots of the Mean Error Rate (MER) and the Correct Estimation Rate (CER) for the estimation results generated 50 times with

N = 1000

are shown in Figure 7 and Figure 8. As seen in Figure 7, unlike the case of

N = 500

, the estimation results by prenet are smaller and more stable compared to lasso when

N = 1000

. Regarding the CER, while lasso is mostly unable to estimate the structure of

A

, prenet can perfectly estimate the structure of

A

in most cases, resulting in a CER of 1.

5. Conclusions

In this study, we proposed a method for clustering test items in the 2-PL MIRT model using the prenet penalty for structure regularization. The prenet penalty allows each item to be affected by only one latent trait, resulting in a simple structure, making the estimation results easier to interpret. Although the prenet penalty is generally non-convex and thus difficult to optimize directly, it has a multi-convex property where it becomes convex when focusing on one variable and fixing the others, allowing efficient solution via the coordinate descent method. In this study, we efficiently estimated using the stochastic-EM algorithm with the proximal gradient method and the coordinate descent method, also used in lasso. However, the estimation results may stop at a local minimum, so it is necessary to compute several times with different initial values. In numerical experiments, we applied the method to synthetic data and demonstrated that it can estimate the structure of the item matrix better than lasso. With lasso, when the number of subjects is small, using BIC to determine parameters results in all estimates becoming zero, while prenet does not have this issue and can estimate well even with a small number of subjects. Also, with lasso, all estimates shrink, and with large

ρ

,

A = O

is estimated, but with prenet, if only one component per row is nonzero, the prenet penalty becomes zero, avoiding such shrinkage. Therefore, the estimates becoming entirely zero is almost non-existent.

In this study, we dealt with the 2-PL MIRT model where responses are binary, but future research should extend to models with guessing and to cases where responses are multi-valued. Also, while this study only dealt with synthetic data, applications to real data and a further detailed evaluation of the performance of the proposed method are necessary. Furthermore, it is necessary to examine how estimates would be affected when imposing additional constraints such as

a_{j k} = 0

for

j = 1, \dots, K

,

k = j + 1, \dots, K - 1

.

Author Contributions

Methodology, R.S. and J.S.; validation and writing, R.S.; supervision, J.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by JSPS KAKENHI Grant Number JP23KJ1458 and the Grant-in-Aid for Scientific Research (C) 22K11931.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not aplicable.

Data Availability Statement

The evaluation in the paper is based on synthetic data described above.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Details of the M-Step

Appendix A.1. Proximal Gradient Method

Here, we consider the update method in the M-step when the regularization parameters

ρ, γ

are fixed. The proximal gradient method [11] minimizes a function expressible as

f (Ψ) + g (Ψ)

with respect to

Ψ

. Starting with an initial value

Ψ^{(0)}

and a step size

η > 0

,

\begin{matrix} Ψ^{(u + 1)} & \leftarrow {prox}_{η g} (Ψ^{(u)} - η \nabla f (Ψ^{(u)})) \end{matrix}

(A1)

\begin{matrix} {prox}_{η g} (Ψ^{'}) : & = \underset{Ψ}{argmin} {η g (Ψ) + \frac{1}{2} ∥ Ψ - Ψ^{'} ∥^{2}} \end{matrix}

(A2)

is updated in this manner. Here,

{prox}_{g}

is called the proximal operator. In the case of this study, to optimize (10) at the

t + 1

step of the EM algorithm, we set

Ψ = (A, b) \in R^{J \times K} \times R^{J}

and

\begin{matrix} f (Ψ) : & = - \frac{1}{N} log p (y_{i} ∣ {\tilde{θ}}_{i}^{(t)}, A, b), \end{matrix}

(A3)

\begin{matrix} g (Ψ) : & = ρ P (A) . \end{matrix}

(A4)

Defining

S_{i j}^{(u)} = P (y_{i j} = 1 ∣ {\tilde{θ}}_{i}^{(t)}, a_{j}^{(u)}, b_{j}^{(u)})

, the gradient of f is

\begin{matrix} \nabla_{A} f (Ψ) & = - \frac{1}{N} {(Y - S)}^{T} {\tilde{Θ}}^{(t)}, \\ \nabla_{b} f (Ψ) & = - \frac{1}{N} {(Y - S)}^{T} 1_{N} . \end{matrix}

Here,

{\tilde{Θ}}^{(t)} = {({\tilde{θ}}_{1}^{(t)}, \dots, {\tilde{θ}}_{N}^{(t)})}^{T}

,

Y = {(y_{1}, \dots, y_{N})}^{T}

,

S^{(u)} = {(S_{i j}^{(u)})}_{i j} \in R^{N \times J}

, and

1_{N} \in R^{N}

is a vector with all components as 1. Therefore, applying the proximal gradient method, at each step u using the gradients of

A, b

,

\begin{matrix} A^{(u + 1)} & = \underset{A}{argmin} \{η P (A) + \frac{1}{2} {∥A - (A^{(u)} - η \nabla_{A} f (Ψ^{(u)}))∥}_{F}^{2}\}, \end{matrix}

(A5)

\begin{matrix} b^{(u + 1)} & = \underset{b}{argmin} \{\frac{1}{2} {∥b - (b^{(u)} - η \nabla_{b} f (Ψ^{(u)}))∥}_{2}^{2}\} \end{matrix}

(A6)

are solved to update

A, b

. Here,

{∥ \cdot ∥}_{F}

is the Frobenius norm, which is the sum of squares of all elements in a matrix.

Since there is no regularization applied to b,

b^{(u + 1)} = b^{(u)} - η \nabla_{b} f (Ψ^{(u)})

can be updated simply. The update method for

A

is explained in the following section.

Appendix A.2. Computation of the Proximal Operator

Here, we consider the update of

A

as in (A5). The prenet penalty is non-convex, making a direct update of

A

challenging. However, as the prenet penalty is multi-convex, optimizing one variable while fixing the others is straightforward. Therefore, we can solve (A5) using the coordinate descent method.

The calculation formula is as follows:

\begin{matrix} {\tilde{a}}_{j k} & = \underset{a_{j k}}{argmin} \{η (ρ γ \sum_{l \neq k} | {\tilde{a}}_{j l} | | a_{j k} | + \frac{1}{2} ρ (1 - γ) \sum_{l \neq k} {({\tilde{a}}_{j l})}^{2} {(a_{j k})}^{2}) + \frac{1}{2} {(a_{j k} - (a_{j k}^{(u)} - η \frac{\partial}{\partial a_{j k}} f (Ψ^{(u)})))}^{2}\} \\ = \underset{a_{j k}}{argmin} \{\frac{1}{2} (1 + η ρ (1 - γ) \sum_{l \neq k} {({\tilde{a}}_{j l})}^{2}) a_{j k}^{2} - (a_{j k}^{(u)} - η \frac{\partial}{\partial a_{j k}} f (Ψ^{(u)})) a_{j k} + (η ρ γ \sum_{l \neq k} | {\tilde{a}}_{j l} |)\} \\ = \underset{a_{j k}}{argmin} \{\frac{1}{2} {(a_{j k} - \frac{C_{1}}{C_{2}})}^{2} + \frac{C_{3}}{C_{2}} | a_{j k} |\} \\ = sign (\frac{C_{1}}{C_{2}}) \times {(|\frac{C_{1}}{C_{2}}| - \frac{C_{3}}{C_{2}})}_{+} \\ C_{1} & = 1 + η ρ (1 - γ) \sum_{l \neq k} {({\tilde{a}}_{j l})}^{2} \\ C_{2} & = a_{j k}^{(u)} - η \frac{\partial}{\partial a_{j k}} f (Ψ^{(u)}) \\ C_{3} & = η ρ γ \sum_{l \neq k} | {\tilde{a}}_{j l} | \end{matrix}

(A7)

Here,

sign (θ) = \{\begin{matrix} 1 & θ > 0 \\ 0 & θ = 0 \\ - 1 & θ < 0 \end{matrix}, θ_{+} = max {θ, 0}

. Note that

a_{j k}^{(u)}, Ψ^{(u)}

are values from the previous step in the proximal gradient method. Therefore, using (A7), we update for each

j, k

, and repeat until convergence to find the proximal operator for

A

.

References

Reckase, M.D. Multidimensional Item Response Theory, 1st ed.; Springer: Berlin/Heidelberg, Germany, 2009. [Google Scholar]
Hastie, T.; Tibshirani, R.; Wainwright, M. Statistical Learning with Sparsity: The Lasso and Generalizations; CRC Press: Boca Raton, FL, USA, 2015. [Google Scholar]
Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Stat. Methodol. 1996, 58, 267–288. [Google Scholar] [CrossRef]
Sun, J.; Chen, Y.; Liu, J.; Ying, Z.; Xin, T. Latent variable selection for multidimensional item response theory models via L 1 regularization. Psychometrika 2016, 81, 921–939. [Google Scholar] [CrossRef] [PubMed]
Fu, W.; Knight, K. Asymptotics for lasso-type estimators. Ann. Stat. 2000, 28, 1356–1378. [Google Scholar] [CrossRef]
Wainwright, M.J. Sharp thresholds for High-Dimensional and noisy sparsity recovery using-Constrained Quadratic Programming (Lasso). IEEE Trans. Inf. Theory 2009, 55, 2183–2202. [Google Scholar] [CrossRef]
Zhao, P.; Yu, B. On model selection consistency of Lasso. J. Mach. Learn. Res. 2006, 7, 2541–2563. [Google Scholar]
Schwarz, G. Estimating the dimension of a model. Ann. Stat. 1978, 6, 461–464. [Google Scholar] [CrossRef]
Hirose, K.; Terada, Y. Sparse and simple structure estimation via prenet penalization. Psychometrika 2022, 88, 1381–1406. [Google Scholar] [CrossRef] [PubMed]
Zhang, S.; Chen, Y.; Liu, Y. An improved stochastic EM algorithm for large-scale full-information item factor analysis. Br. J. Math. Stat. Psychol. 2020, 73, 44–71. [Google Scholar] [CrossRef] [PubMed]
Beck, A.; Teboulle, M. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2009, 2, 183–202. [Google Scholar] [CrossRef]
Friedman, J.; Hastie, T.; Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 2010, 33, 1–22. [Google Scholar] [CrossRef] [PubMed]
Anderson, T.W.; Rubin, H. Statistical Inference in Factor Analysis. In Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA, 26–31 December 1954; University of California Press: Berkeley, CA, USA, 1956. [Google Scholar]
Celeux, G.; Diebolt, J. A stochastic approximation type EM algorithm for the mixture problem. Stochastics 1992, 41, 119–134. [Google Scholar] [CrossRef]
Lee, J.D.; Sun, Y.; Saunders, M. Proximal Newton-type methods for convex optimization. Adv. Neural Inf. Process. Syst. 2012, 25, 1–9. [Google Scholar]
Jiang, Z.; Templin, J. Gibbs samplers for logistic item response models via the Pólya–Gamma distribution: A computationally efficient data-augmentation strategy. Psychometrika 2019, 84, 358–374. [Google Scholar] [CrossRef] [PubMed]
Polson, N.G.; Scott, J.G.; Windle, J. Bayesian inference for logistic models using Pólya–Gamma latent variables. J. Am. Stat. Assoc. 2013, 108, 1339–1349. [Google Scholar] [CrossRef]

Figure 1. The prenet penalty for

γ = 1, 0.5, 0.1

.

Figure 1. The prenet penalty for

γ = 1, 0.5, 0.1

.

Figure 2. Image of the obtained estimation results.

Figure 3. MSE of

A

(

N = 100