d = 2PL-difficulty
#
### canonical correlation findings
ms_res$can_cor[6:12]
### check boundary conditions
#
### average pb_dc
lapply(ms_res$pb_dc,function(x)mean(unlist(x),na.rm=T))[6:12]
### average gamma
lapply(ms_res$gkg,function(x)mean(unlist(x),na.rm=T))[6:12]
```
#### **References**



Kline, Paul. 2000. *The Handbook of Psychological Testing*. London: Routledge.


Von der Embse, Nathaniel P., Andrea D. Mata, Natasha Segool, and Emma-Catherine Scott. 2014. Latent profile analyses of test anxiety: A pilot study. *Journal of Psychoeducational Assessessment* 32: 165–72. [CrossRef]

Wainer, Howard. 1989. The future of item analysis. *Journal of Educational Measurement* 26: 191–208. [CrossRef]

Yen, Wendy M., and Anne R. Fitzpatrick. 2006. Item Response Theory. In *Educational Measurement*. Edited by Robert L. Brennan. Westport: Praeger Publishers.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Diagnosing a 12-Item Dataset of Raven Matrices: With Dexter**

#### **Ivailo Partchev**

Cito, 6814 CM Arnhem, The Netherlands; Ivailo.Partchev@cito.nl

Received: 21 February 2020; Accepted: 27 April 2020; Published: 6 May 2020

**Abstract:** We analyze a 12-item version of Raven's Standard Progressive Matrices test, traditionally scored with the sum score. We discuss some important differences between assessment in practice and psychometric modelling. We demonstrate some advanced diagnostic tools in the freely available R package, dexter. We find that the first item in the test functions badly—at a guess, because the subjects were not given exercise items before the live test.

**Keywords:** intelligence tests; classical test theory; IRT; interaction model; test-item regression

#### **1. Introduction**

Myszkowski and Storme (2018) have applied a number of binary and polytomous item-response theory (IRT) Lord (1980) models to the last series of Raven's Standard Progressive Matrices (SPM) test Raven (1941), further referred to as the SPM-LS test. They have made their dataset publicly available, and the *Journal of Intelligence* has proposed a special issue where other researchers are encouraged to present their own analyses.

The idea is not entirely new. Back in 1976, Thissen (1976) tried to apply Bock's nominal response model Bock (1972) to Raven's matrices as an attempt to throw light on the functioning of the distractors and improve scoring in the lower ability range. It is easy to overlook this publication as it came so incredibly early, some five years before Bock and Aitkin (1981) proposed a really practicable way to estimate the model.

To start with the big question of whether applying complex IRT models to an old, venerable test of intelligence should be an improvement: I have not one but two answers. One is "possibly", the other "certainly not". The duplicity arises from the fact that it is not possible to have methods and criteria that would be equally appropriate to summative assessment, formative assessment, survey research, methodological research, or substantive research.

Consider assessment. Computer-assisted learning has developed at staggering rates, becoming essentially intertwined with formative assessment. Operating within the effort to increase ability, we can even enjoy the luxury of being able to ask the same item multiple times and observe learning happen. Summative assessment has remained more traditional: We tend to interrupt the learning process for a while, hoping that ability will remain unchanged during testing, and praying that the items have not been compromised by disclosure. The two modes are not simply different—they are more like opposites. Hence, there is no methodological one-size-fits-all—not even within assessment practice.

On the other hand, not everybody who analyzes test data is busy grading exams. Some might be studying populations, as is the case with PISA, TIMSS and friends. Others might be interested in the way people behave when answering educational or intelligence tests. They will come up with ideas and

hypotheses whose evidential support will have to be demonstrated, since statements are not limited to a specific individual or projected to a specific finite population but generalized beyond. Goodness of fit plays a very different role in such circumstances than in the more artisanal job of making a measurement instrument for testing.

In the role of researchers, we might for example ask whether persons are guessing responses at random, and we can try to formalize the question into a testable model. It is a perfectly valid discussion Azevedo (2009); Glas (2009); Maris and Bechger (2009); Partchev (2009); San Martín et al. (2009); Thissen (2009); von Davier (2009) whether such a model, say the 3PL, is a good idea from a substantive or mathematical point of view. From my participation in that dispute it is clear that I am not very enthusiastic; see also Appendix B for some results in applying the 3PL model on the SPM-LS dataset. However, this is not the same as porting the 3PL model into assessment practice, the latter being predominantly ruled by the sum score. This is mainly for two reasons: (i) The sum score makes sense in a particular social situation and (ii) it seems to capture most of the essential information in the responses.

As Dorans (2012) notes, commenting on earlier work by Paul Holland, test takers can assume multiple roles: Those of learners, examinees, or contestants. Quoting from his abstract: "Test takers who are contestants in high-stakes settings want reliable outcomes obtained via acceptable scoring of tests administered under clear rules." Telescoping to sports, where fairness is also a major issue, the 2020 edition of the ATP rulebook ATP Tour Inc. (2020) defines every conceivable rule and situation in the game of tennis on 374 pages (beats the APA Publication Manual American Psychological Association (2010) by more than 100 pages). Nothing is left to chance, everything is specified well before the game starts, and just how bizarre the idea that the scoring rules might be defined post hoc, based on a fairly opaque analysis of the results, and placidly assuming that athletes cheat as a rule. However, this is exactly what the 3PL model proposes.

Similar objections may be raised against the idea to 'exploit' the potentially useful information in the wrong responses by fitting a nominal response model. Investigate in research—yes; exploit in assessment—rather not. When we are to pass judgement over individuals, our thinking tends to be more binary: Either the distractors are wrong and should get no credit, or they are sensible and should get partial credit. In either case, it should be part of the rules before the referee shouts "Time!".

The need for simple scoring rules that are known before testing has begun, are easily explained to all parties involved, and are widely accepted as fair, is one of the main reasons why most assessment programs tend to rely on the sum score. When the test has more than one form, the choice is mainly between classical test theory (CTT) and equipercentile or kernel equating (still a hot topic, to judge by the number of recent books González and Wiberg 2017; Kolen and Brennan 2014; von Davier 2011; von Davier et al. 2004), or IRT, which provides an alternative solution to the equating problem. However, we would be interested primarily in models with sufficient statistics, such as the Rasch or the partial credit model, because they preserve the scoring rule (in the case of one test form, the ability estimates are just a monotone transformation of the sum score).

Another important advantage is that the degree of misfit of the IRT model would indicate the extent to which our scoring rule misses out potentially useful information. This is more realistic on the item level, where it can be a valuable tool in quality assurance. At test level and within IRT, it is more difficult to demonstrate misfit in practice (see also Appendix C). The search for that important thing that is not already captured by the sum score has become something of a Holy Grail in psychometrics—since the day when they added a second parameter to the Rasch model and up to the latest advances in cognitive diagnostic assessment Leighton and Gierl (2007). I have followed with great interest, have often been disappointed, and will probably be just as enthusiastic when the next wave appears.

What follows is an example of the initial data crunching that would be done at an educational testing institute when the data from a new test comes in. A careful exploratory analysis should always precede

whatever comes next, whether assessment or further modelling and research; and we should not forget that the properties of an instrument and the properties of a dataset collected with it are not the same thing.

While playing Sherlock Holmes with the SPM-LS data, I take the opportunity to present our freely available R package, dexter, Maris et al. (2019) because it has been developed especially for this purpose and combines traditional and novel methods. The accent is on assessing and understanding item fit. There is no attempt at an exhaustive analysis of the psychometric properties of the 12-item test form, SPM-LS. Raven's matrices have been around for about 80 years and much is known about them—for example, Brouwers et al. (2009) examine 798 applications in 45 countries (N = 244,316) published between 1944 and 2003. Besides, an insight into the properties of the short form can be seen as the collective endeavour of the whole special issue—see, for example, Garcia-Garzon et al. (2019) for a factor-analytic analysis that shows the SPM-LS to be essentially unidimensional.

#### **2. Materials and Methods**

#### *2.1. Data*

The data is as supplied with the original study by Myszkowski and Storme (2018): The responses of 499 French undergraduate students aged between 19 and 24 to the twelve items of SPM-LS.

#### *2.2. Methods*

All analyses have been performed with dexter Maris et al. (2019), a freely available package for R Core Team (2013). Dexter has been created to be as useful as possible to both researchers and test practitioners, as long as they stay with models that have sufficient statistics for their parameters Andersen (1973). Every dexter project starts, as appropriate for testing, with a complete enumeration of the scoring rules for each item: Every admissible response gets mapped to an integer, with 0 as the lowest item score. Out of these rules, the program creates automatically a state-of-the-art relational data base optimized for the typical structure of test data.

The toolbox for assessing the quality of the items includes, among others:


There is a companion package, dextergui Koops et al. (2019), providing an easy graphical user interface (GUI) to the basic functions. The GUI is very convenient: All tables are live, they can be sorted on each column, and clicking anywhere on the table opens up the appropriate graphs. However, in a paper like this it is easier to reproduce a script (see Appendix A) than to explain a GUI.

Readers of this journal will hardly need CTT statistics like item facility and item-total correlation, the Rasch model Rasch [1960] (1980), or the partial credit model (PCM) Masters (1982) explained. What we call distractor plots are non-parametric regressions of response alternatives on the total score. We produce them by estimating the density of the total score, overall and for each response alternative, and applying Bayes' rule to obtain the density of each response alternative given the total score.

A useful and novel method is a plot (example shown in Figure 1) that compares three itemtotal regressions:


• the regression predicted by Haberman's interaction model, shown as a thicker gray line.

Item-total regressions (ITR) are somewhat similar to item response functions (IRF), but there are some important differences. The IRF combines an unobservable quantity on an arbitrary scale (on the *x* scale) with observable data (on the *y* axis) while the ITR only involves observable data.

What, however, is the interaction model? Well hidden in a book on an entirely different subject, Haberman's interaction model Haberman (2007) remains relatively unpopular and underestimated. We (the developers of dexter) have found it to be a very useful diagnostic tool, and we have generalized it to also handle polytomous items. The interaction model can be seen equivalently as a model for locally dependent items, a Rasch model where difficulty depends on item and score, and an exponential family model for classical test theory, as can be seen from the following equations:

$$P(\mathbf{x}|\boldsymbol{\theta}) \propto \exp(\theta \mathbf{x}\_{+} - \sum\_{i} \beta\_{i} \mathbf{x}\_{i} + \sum\_{i} \sum\_{j>i} (\sigma\_{i} + \sigma\_{j}) \mathbf{x}\_{i} \mathbf{x}\_{j}) \tag{1}$$

$$P(\mathbf{x}|\theta) \approx \exp\left(\theta \mathbf{x}\_{+} - \sum\_{i} (\beta\_{i} + \sigma\_{i} \mathbf{x}\_{+}) \mathbf{x}\_{i}\right) \tag{2}$$

$$P(\mathbf{X}|\boldsymbol{\theta}) \propto \exp\left(\sum\_{i} \beta\_i \mathbf{x}\_{+i} + \sum\_{i} \sigma\_i \sum\_{p} \mathbf{x}\_{pi} \mathbf{x}\_{p+} + \sum\_{s} n\_s \ln \lambda\_s\right) \tag{3}$$

where *i* and *j* index items, *p* indexes persons, *s* indexes sum scores, and + stands for summation. *x* are observed item responses, **x** a response vector, and **X** a matrix of responses. *θ* are latent abilities, *β* item difficulties, and *σ* are the item-specific interaction parameters featured in Haberman's model. The *λ* are there to reproduce (i.e., perfectly fit) the score distribution, and may be called score-parameters.

#### **Item SPM04**

**Figure 1.** Example plot comparing three item-total regressions for the fourth item. Pink dots show the observed regression (in this case, proportion of correct responses at each distinct total score), predictions from the Rasch model are shown with a thin black line, and those from the interaction model with a thick gray line.

Each of these three representations can serve as the point of departure for a potentially useful discussion. Because our interest here is mainly in item fit, we will concentrate on the third one. We observe that the three terms in the exponential ensure that the model will reproduce perfectly, through the three sets of parameters, *β*, *σ*, and *λ*, the classical item facilities, the correlations of the item scores with the total score, and the distribution of the total scores. Note that this is more or less everything that we want to know about the data within CTT.

Let us return to Figure 1. I have deliberately chosen the item that deviates the most from the Rasch model in having a higher discrimination. The IM readily detects that, in fact, the 2PL model can be shown to be a low-rank approximation to the IM, so we have even more flexibility with the IM than with the 2PL model. However, unlike the two-, three- or many-PL models, the IM has sufficient statistics, it can be estimated via the conditional likelihood, and it makes predictions conditional on the observed total score, not on a hypothesized, latent quantity. This makes it much more appropriate for evaluating item fit.

Observe how, when the Rasch model and the IM deviate for an item, the pink dots representing the empirical item-total regression tend to cluster around the IM curve. This is what one typically sees in practice, and the points tend to get closer to the line as the sample size increases. In other words, not only does the IM reproduce exactly the three most interesting aspects of the response data from a CTT point of view, but it seems to capture all systematic deviations from the Rasch model, leaving out just the random noise. To make the plots even more legible, we have introduced 'curtains' slightly obscuring but not hiding the lower and upper 5% of the data as measured on the test score. This helps concentrate on the really important differences among the three ITR.

Neither the Rasch model nor the IM make any provisions for random guessing. The 3PL model, on the contrary, assumes that people always guess, and then tries to fit a curve with a particular shape to the data. Even if that is successful (the model has an identification problem, as shown in Azevedo 2009; Glas 2009; Maris and Bechger 2009; Partchev 2009; San Martín et al. 2009; Thissen 2009; von Davier 2009), the data can lie near to the curve for many reasons, one of which is random guessing. None of the three models have a device to tell us whether people are actually guessing or not.

The two smoothed ITR start and end at the same points as the observed ITR. Inevitably, both the observed and the predicted item score must be 0 when the total score is 0, and when a person achieves a full total score, the item score for each item must also take the maximum possible value. This gives a specific aspect to the ITR depending on the slope. When an item discriminates better than predicted by the Rasch model, the ITR of the IM retains the same sigmoid shape but gets steeper. When discrimination is low, typical of badly written items, the curve starts to bend, resembling a cubic polynomial. This is particularly expressive when the ITR must accommodate a negative slope in the middle, typical of items with a wrong answer key. When the slope is small or negative, the ITR of the IM suggests that persons of low ability (say, at the left curtain) have a spuriously high probability of a correct response. This is not necessarily due to guessing.

To summarize: I believe that items discriminating similar to or better than what is expected under the Rasch model can be used without consternation: Differences in the slope will cancel when we sum together even a very modest number of item scores (see also Appendix C). Low discrimination always means trouble of one kind or another. So, my recommended workflow is to catch such items, starting with the item-total and item-rest correlations and proceeding with the item-total regressions. A careful analysis of the distractor plots for the offending items will help diagnose what is wrong with the item and suggest a revision.

#### **3. Results**

The 12-item test, SPM-LS, has a decent Cronbach alpha of 0.81, and an average item facility of 0.65. The average correlation with the total score (rit) is 0.57, and the average correlation with the rest score (rir) is 0.47.

Table 1 shows the essential item level CTT statistics. As expected from the structure of the SPM test, the item facilities progressively decrease with the notable exception of the first item. Discrimination, as characterized by the rit and the rir, is highest in the middle and lowest at both ends, which is what one would expect from point-biserial correlations. However, the discrimination for the first item is a bit too low, especially as the item does not appear to be as easy as anticipated.

**Table 1.** Selected item level CTT statistics for the SPM-LS data set.


These observations are facilitated by the plots on Figure 2.

**Figure 2.** Item facility (**left**) and correlation with the rest score (**right**) by position of the item in the SPM-LS test.

The slight hint that the first item, SPM01, may be out of line with the others, becomes quite dramatic when we examine the ITR plots (Figure 3). Just compare the plots for the first two items, which are supposed to be very similar. For item SPL02, the Rasch model and the IM agree almost perfectly. According to both models, persons of low ability, say at the left curtain (fifth percentile) have a high probability, well over 0.5, to answer correctly, but this is obviously because the item is very easy.

The plot for item SPM01 is very different. The IM curve has a very particular and irregular shape, thanks to which it deviates sharply from the Rasch model in the lower ability range, but much less in the upper range. What is going on among persons of lower ability? Are they guessing? The pseudo-guessing parameter in the 3PL model (Appendix B) is equal to zero for both items, and what is the logic to guess when the item is so easy? Why on the first item but not on the second?

Figure 4 shows distractor plots (non-parametric regressions of each response alternative to an item on the total test score), which are a less sophisticated but very detailed and useful diagnostic tool when an item appears spurious on the traditional CTT statistics and/or ITR plots. I have included the plots for all 12 items because an anonymous reviewer suggested that the reader would like to see them, and I completely agree; however, the idea is that, in practice, the CTT statistics and the ITR plots will help us narrow down the detailed diagnostic plots that we need to examine to the most problematic items.

**Figure 3.** *Cont.*

**Figure 3.** Item-total regressions for the items in the SPM-LS test obtained from the data (pink dots), the Rasch model (thin black lines), and the interaction model (thick gray lines).

**Figure 4.** *Cont.*

**Figure 4.** *Cont.*

**Figure 4.** Non-parametric option-total regressions (distractor plots) for the twelve items in the SPM-LS test. The title of each plot shows the item label, in which booklet the item appears, and in what position. The legend shows the actual responses and the scores they will be given. Response alternatives that do not show up have not been chosen by any person.

Looking at the distractor plot for item SPM01, we observe that most of the seven distractors are not chosen at all, while one, represented with the blue line, is quite popular. When that happens, the item is effectively transformed into a coin tossing game. If this were a cognitive test, I would recommend rewriting the item. However, this is a matrix item, abstract and constructed in a principled manner, so the only explanation that comes to mind is that the test was given without a couple of items on which the examinees could train. For lack of these, the first item served as an exercise item.

Similar, but milder effects are observed on the ITR for items SPM09, SPM12, and possibly SPM11. The 3PL model (Appendix B) has larger pseudo-guessing parameters for these items. The distractor plots (Figure 4) show that all distractors are in use, but some of them tend to peak out in the middle, a bit like the middle category in a partial credit item. There might be some reason for this in the way Raven items are constructed.

There is more logic to guess when the item is difficult, especially if the stakes are high, but is this what is happening? Possibly. As one sharp-witted psychometrician told me once, the trouble with the research on guessing is that so much of it is guesswork. On the other hand, there must be ways to make guessing behaviour more explicit and include it in the rules of the game that we are playing. For example, one could have the subjects order the responses by their degree of confidence they are correct, or use a continuous response model as described in Verhelst (2019).

#### **4. Discussion**

To a pool of different analyses of the same dataset, I have contributed a specimen of the exploratory analysis we typically do when developing a cognitive test for assessment. My purpose was mainly to increase the diversity in professional perspectives, and to popularize some novel and useful diagnostic tools in our software.

While I use the Rasch model and the less popular interaction model, the focus is not on modelling, not even on the more traditional psychometric analysis of an intelligence test. Capitalizing on the fact that the models share the same scoring rule as the original test, the sum score, I use them to evaluate and

support the scoring rule, and to highlight items that possibly go astray. I might have relied more heavily on the models in different circumstances: For example, if the test had more than one form (the Rasch model is useful for equating), or if I were interested in research, not in an instrument to assess persons.

The way in which I use the models explains why a paper that deals essentially with model fit does not treat model fit in the way typical of scientific research. I did not put forward any model to explain the data, in which case model fit would be an argument supporting my ideas. I did formulate a hypothesis or, rather, a guess (confirmed later) when I found out that a certain item did not follow my preferred model. In this case, model fit was about quality control more than about anything else.

I am certainly not original in pointing out that summative assessment, formative assessment, population surveys, methodological research and substantive research are sufficiently different to have not only distinct but sometimes even mutually exclusive criteria on what is desirable, appropriate, or admissible. This is fine as long as it is not forgotten and ignored.

In the final run, my story has three morals: (i) The way you should go "depends a good deal on where you want to get to" Carroll (1865), (ii) whatever the destination, always do exploratory analysis first, and (iii) in practical assessment, the model should follow from the scoring rule, not vice versa.

**Funding:** This research received no external funding.

**Acknowledgments:** I am greatly indebted to the other persons behind dexter: Gunter Maris and Timo Bechger at ACTNext, and Jesse Koops at Cito. Special thanks to the reviewers and to Timo Bechger and Norman D. Verhelst for many useful suggestions. Cito has graciously consented to cover the publication fee.

**Conflicts of Interest:** The author declares no conflict of interest.

#### **Abbreviations**

The following abbreviations are used in this manuscript:


#### **Appendix A**

This is a minimal script to perform all analyses in this paper. It does not cover the final formatting of the tables and graphs. Note that the original dataset was modified slightly by hand: column names were changed from STM1, STM2,. . . to STM01, STM02 etc.

```
library(dexter) # load the dexter library
setwd('~/WD/Raven') # set the work directory
keys = data.frame( # data frame as required
item_id = sprintf('SPM%02d', 1:12), # by keys_to_rules function
noptions = 8,
key = c(7,6,8,2,1,5,1,6,3,2,4,5) # (the correct responses)
)
rules = keys_to_rules(keys) # scoring rules as reqd by dexter
```

```
J. Intell. 2020, 8, 21
```

```
db = start_new_project(rules, 'raven.db') # data base from the rules
dat = read.csv('dataset.csv', head=TRUE) # read in data...
add_booklet(db, dat, 'r') # ... and add to the data base
tia_tables(db) # tables of CTT statistics
mo = fit_inter(db) # fit the Rasch and the IM
plot(mo) # produce all ITR plots
distractor_plot(db,'SPM01') # distractor plot for item SPM01
```
#### **Appendix B**

I tried to estimate the 3PL model for the SPM-LS dataset with three different programs: The R package mirt Chalmers (2012), the R package ltm Rizopoulos (2006), and the long-time flagship in educational testing, BILOG-MG Zimowski et al. (1996). All options in the R packages were held at their defaults, and no priors were used in BILOG-MG to constrain any of the three parameters during estimation. The results are shown in Table A1.

We observe reasonably good agreement between mirt and BILOG-MG, while the ltm estimates deviate more. Interesting enough, the estimates of the pseudo-guessing parameter, *c*, seem to agree the most among the three programs. They are also logical: since the items are arranged by increasing difficulty, there is no logic to guess on the easy items, so the guessing parameter is zero or close to zero. For the two most difficult items, it is close to 1/8, but it is difficult to explain why people should want to guess more on item 9. Moreover, all three programs seem to struggle with the discrimination parameter of item 11, quite seriously in the case of ltm.


**Table A1.** Parameter estimates for the 3PL model obtained for the SPM-LS dataset with three different programs.

I made another comparison using only BILOG-MG and playing with the available priors for constraining parameter estimates. Table A2 shows results without any priors at all (same as in Table A1); with a lognormal (0, 0.5) prior on the discrimination parameter, *a*, and no prior on *c*; and with a lognormal (0, 0.5) prior on *<sup>a</sup>* and a beta(20 <sup>×</sup> <sup>1</sup> <sup>8</sup> <sup>+</sup> 1, 20 <sup>×</sup> <sup>7</sup> <sup>8</sup> <sup>+</sup> 1) on *<sup>c</sup>* ( <sup>1</sup> <sup>8</sup> and <sup>7</sup> <sup>8</sup> obtain from the fact that all items have 8 possible responses). The prior on *a* slightly alleviates the problem with the extremely high estimate while the prior on *c* simply invents guessing where we could not possibly have information on it: The less people have reason to guess, the more the estimate drifts towards <sup>1</sup> 8 .


**Table A2.** Parameter estimates for the 3PL model obtained for the SPM-LS dataset with BILOG-MG and three different settings.

#### **Appendix C**

One of the reviewers has asked me to add some references on model fit at test level. Taken sufficiently seriously, this is not quite as easy at it may seem. Flowing out of the theory of errors, CTT is very concerned with test reliability and validity. Classical texts on CTT Gulliksen (1950) have entire chapters devoted to, say, the effect of test length on test error, reliability, and validity. IRT has an indisputable contribution in focusing on the item and item fit, but it may have gone a bit too far, overlooking the proverbial forest for the sake of the trees.

For the more pragmatic outlook of this paper, an important reference concerned with the practical implications of model misfit is Sinharay and Haberman (2014). Reasoning similar to mine, but pertaining to differential item functioning (DIF) rather than item fit, is found in Chalmers et al. (2016).

In what follows, I will try to avoid the item level—test level dichotomy, and steal a peek in-between. Our software, dexter Maris et al. (2019), has a handy function, fit\_domains(), for the analysis of subtests within the test. The function transforms the items belonging to each subtest, or domain, into one large partial credit item. Such 'polytomisation', as discussed by Verhelst and Verstralen (2008), is a simple and efficient way to deal with testlets. The formal, constructed, and homogeneous nature of the SPM-LS test makes it a good candidate for some further experimentation. Note that I am not proposing a new method—I am just being curious.

I start by combining item 1, intended to be the easiest, with item 7, of medium difficulty. Item 2 will be combined with item 8, item 3 with item 9, and so on. We end up with 6 partial credit items combining an easier item with a more difficult one or, if you wish, six testlets or subtests made to be as parallel as possible. Their category trace lines are shown on Figure A1.

We can also examine the ITR (Figure A2), which are comparable to the ITR for the original test items; the item score on the *y* axis is also the 'test score' of the six subtests. We observe better item fit and a closer correspondence between the regressions predicted by the two models.

The next step will be to combine triplets of items: 1, 5, and 9; 2, 6, and 10, etc. Perhaps not surprisingly, the two models and the data come even closer (Figure A3).

**Figure A1.** Category trace lines for partial credit items obtained by combining the original items SPM01 and SPM07 (Item 1), SPM02 and SPM08 (Item 2) etc. The partial credit model is shown with thinner and darker lines, and the polytomous IM with broader and lighter lines of the same hue.

**Figure A2.** Item-total regressions for partial credit items obtained by combining the original items SPM01 and SPM07 (Item 1), SPM02 and SPM08 (Item 2) etc. Observed data is shown with pink dots, the PCM with thin black lines, and the interaction model with thick gray lines.

**Figure A3.** Item-total regressions for partial credit items obtained by combining triplets of items. Observed data is shown with pink dots, the PCM with thin black lines, and the interaction model with thick gray lines.

Quite predictably, the next step is to combine quadruples of items, and the result is even better fit (Figure A4).

**Figure A4.** Item-total regressions for partial credit items obtained by combining quadruples of items. Observed data is shown with pink dots, the PCM with thin black lines, and the interaction model with thick gray lines.

Finally, we have two subtests of six items each, one consisting of the odd-numbered items, and the other of the even-numbered items in the original test (Figure A5). This parcours is the closest approximation to item fit at test level from an IRT perspective that I can produce as of now.

**Figure A5.** Item-total regressions for two subtests of six items each. Observed data is shown with pink dots, the PCM with thin black lines, and the interaction model with thick gray lines.

#### **References**




c 2020 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

#### **Nils Myszkowski**

Department of Psychology, Pace University, New York, NY 10038, USA; nmyszkowski@pace.edu

Received: 10 January 2020; Accepted: 24 April 2020; Published: 6 May 2020

**Abstract:** Raven's Standard Progressive Matrices (Raven 1941) is a widely used 60-item long measure of general mental ability. It was recently suggested that, for situations where taking this test is too time consuming, a shorter version, comprised of only the last series of the Standard Progressive Matrices (Myszkowski and Storme 2018) could be used, while preserving satisfactory psychometric properties (Garcia-Garzon et al. 2019; Myszkowski and Storme 2018). In this study, I argue, however, that some psychometric properties have been left aside by previous investigations. As part of this special issue on the reinvestigation of Myszkowski and Storme's dataset, I propose to use the non-parametric Item Response Theory framework of Mokken Scale Analysis (Mokken 1971, 1997) and its current developments (Sijtsma and van der Ark 2017) to shed new light on the SPM-LS. Extending previous findings, this investigation indicated that the SPM-LS had satisfactory scalability (*H* = 0.469), local independence and reliability (*MS* = 0.841, *LCRC* = 0.874). Further, all item response functions were monotonically increasing, and there was overall evidence for invariant item ordering (*HT* = 0.475), supporting the Double Monotonicity Model (Mokken 1997). Item 1, however, appeared problematic in most analyses. I discuss the implications of these results, notably regarding whether to discard item 1, whether the SPM-LS sum scores can confidently be used to order persons, and whether the invariant item ordering of the SPM-LS allows to use a stopping rule to further shorten test administration.

**Keywords:** Mokken scale analysis; non-parametric item response theory; psychometrics; invariant item ordering

#### **1. Introduction**

The general factor of intelligence (*g*) is central in the prediction of several outcomes, such as job performance (Ree and Earles 1992; Salgado et al. 2003) and academic achievement (Rohde and Thompson 2007). Its accurate measurement is therefore crucial in multiple contexts, including personnel selection, vocational guidance or academic research in individual differences. However, because of practical constraints, it is desirable in many contexts to reduce test length as much as possible, while maintaining acceptable accuracy.

Raven's Standard Progressive Matrices (SPM) (Raven 1941) and Advanced Progressive Matrices (APM) (Raven et al. 1962) are widely used—though also criticized (Gignac 2015)—instruments to measure *g*. However, both these tests remain rather long in untimed conditions, with some participants sometimes taking more than 40 min to respond them (Hamel and Schmittmann 2006). Several solutions have been proposed to further reduce test administration time, such as constraining time (Hamel and Schmittmann 2006) and using short versions (Bors and Stokes 1998).

While these solutions have focused on the APM (Hamel and Schmittmann 2006; Bors and Stokes 1998; Myszkowski and Storme 2018) have recently suggested that the last series of the SPM—the SPM-LS—could

be a more efficient solution, with only 12 items, while maintaining the progressive aspect characteristic of Raven's matrices, along with satisfactory psychometric properties. However, the original study (Myszkowski and Storme 2018)—which I propose to extend—has studied the SPM-LS with parametric Item Response Theory (IRT) models, and is largely focused on recovering information from distractor responses using nested logit models (Suh and Bolt 2010; Storme et al. 2019), therefore putting aside important aspects of the test—such as the monotonicity of item responses and invariant item ordering, which I later further discuss. I propose here to bridge these gaps using the framework of Mokken Scale Analysis (MSA) (Mokken 1971, 1982, 1997), a well developed non-parametric item-response theory framework that is particularly appropriate to address them (Sijtsma and van der Ark 2017).

#### *1.1. The SPM-LS*

While the SPM is heavily studied, the SPM-LS is very recent, and thus has not been the object of many investigations. Currently, it has only been studied in its original investigation (Myszkowski and Storme 2018)—which used binary and nominal IRT models—and as part of this special issue through a further investigation of its dimensionality (Garcia-Garzon et al. 2019). Investigations of the SPM-LS indicated that IRT models could satisfactorily fit test responses (Bürkner 2020; Myszkowski and Storme 2018), and that the test seemed to present adequate reliability/information for abilities ranging from about 2 standard deviations below the mean—or 3 if recovering information from distractors—to 1.5 to 2 standard deviations above the mean (Myszkowski and Storme 2018), in a sample of undergraduate students, suggesting that it could be more appropriate in terms of difficulty for the general population than for post-secondary students. In addition, Garcia-Garzon et al. (2019) notably studied in this special issue the dimensionality of the SPM-LS using a variety of methods—Exploratory Graph Analysis (EGA), bifactor Exploratory Factor Analysis (EFA) and Confirmatory Factor Analysis (CFA). Overall, the psychometric qualities of the SPM-LS so far appeared satisfactory for use by researchers and practitioners, but some characteristics have not been studied, for which Mokken Scale Analysis is a particularly appropriate framework.

#### *1.2. Mokken Scale Analysis*

Since its inception (Mokken 1971), Mokken Scale Analysis has been the object of several methodological developments, notably discussing how to evaluate the properties of instruments evaluated with MSA (Van der Ark 2012), best practices in MSA (Sijtsma and van der Ark 2017) and the active development of a package (Van der Ark 2007) for the statistical programming language R. While it is largely and more thoroughly described elsewhere (Mokken and Lewis 1982; Mokken 1997; Van der Ark 2007; Sijtsma and van der Ark 2017; Sijtsma 1998), I could briefly describe Mokken Scale Analysis (MSA) (Mokken and Lewis 1982; Mokken 1997) as a non-parametric IRT framework, which, for dichotomous responses, represents the probability of succeeding an item *j* as a function of an person *i*'s latent ability—*θi*. Unlike the Rasch model (Rasch 1993) and, more broadly, unlike binary logistic and normal ogive models—which are said to be parametric IRT models (Mokken and Lewis 1982)—MSA does not represent the relation between latent ability and item responses using item parameters, but using an item-response function only defined as monotonically increasing (Mokken and Lewis 1982).

#### *1.3. The Benefits of Mokken Scale Analysis*

Because they do not require response functions to have a specific shape, Mokken's models are less constrained than (notably) Rasch models (Meijer et al. 1990), which implies that some items that are not well fitted by Rasch models may still be scalable with MSA, because their response function may be monotonic without necessarily having a logistic/normal ogive shape. While MSA does not allow

certain applications otherwise permitted by Rasch modeling, like test equating or computer adaptive testing, (Meijer et al. 1990, p. 297) note that, "for many testing applications, it often suffices to know the order of persons on an attribute". Therefore, Mokken scaling is attractive for the reason that it focuses mainly on a test's capacity to order persons, while allowing for more items to fulfill its requirements than Rasch models do allow. In the context of the SPM-LS, this is particularly interesting, especially as Myszkowski and Storme (2018) had to use highly parametrized models to achieve an acceptable fit, with 3- and 4-parameter models fitting much better than notably the Rasch 1-parameter model—in this special issue, Bürkner (2020) makes a similar conclusion using Bayesian IRT. Instead of increasing the number of parameters to better fit item responses—and risking overfitting and thus compromising reproducibility—Mokken scaling proposes to retain fewer (but fundamental) uses of a test: Ordering persons (for both MSA models) and items (for the Double Monotonicity model only).

#### 1.3.1. The Monotone Homogeneity and Double Monotonicity Models

For dichotomous items, Mokken (1997) defined two item-response models: The Monotone Homogeneity Model (MHM) and the Double Monotonicity Model (DMM). Both the Monotone Homogeneity Model and the Double Monotonicity Model assume the monotonicity of item response functions. However, the two models differ in that only the Double Monotonicity Model assumes that item response functions do not intersect—an assumption usually referred to as invariant item ordering.

Before focusing on these two assumptions central to MSA, as well as their consequences in the context of the SPM-LS, it is important to note that both models also assume unidimensionality, meaning that they both assume that the same latent attribute *θ<sup>i</sup>* explains the items scores—therefore also assuming local independence (Sijtsma et al. 2011). While MSA offers procedures (also used in this study) to investigate this assumption, they would probably not justify a new study, because the dimensionality of the SPM-LS has been, on this very dataset, investigated with a plethora of psychometric methods (Myszkowski and Storme 2018; Garcia-Garzon et al. 2019). I will therefore mainly focus here on the *incremental* value of using Mokken Scale Analysis in addition to these previously used approaches.

#### 1.3.2. Monotonicity of Item Response Functions

An important feature of MSA is that it allows to study monotonicity, where parametric Item-Response Theory and traditional (linear) factor analysis models generally leave this assumption untested. Indeed, although parametric item response models for binary responses are (in general) monotonous, a misfitting item does not necessarily indicate that the item response is non-monotonic (Meijer et al. 1990). Therefore, because it has only been studied with parametric response models, the monotonicity of the SPM-LS has remained untested so far. This characteristic is manifested, in pass-fail (binary) tests like the SPM-LS, by item response functions that are monotonically increasing. This means that the probability to succeed on an item monotonically increases with the examinee's ability. In contrast with parametric IRT, the framework of Mokken Scale Analysis offers methods to investigate this property and specifically identify its violations (Van der Ark 2007). I therefore propose, in the present study, to use this framework to bridge that gap in the study of the test.

As a consequence, this study is therefore the first to study the monotonicity of the SPM-LS, which, as previously noted (Van der Ark 2007), is not only relevant to Mokken scaling, but also relevant to any model that formulates this assumption, such as parametric Item-Response Theory models and traditional factor analysis. It is an essential psychometric property of a test, because it is the property that implies that higher scores imply higher abilities (for all items, at any ability level), and thus that scores can be used to infer person ordering (Van der Ark 2007).

#### 1.3.3. Invariant Item Ordering

While both the Monotone Homogeneity Model and the Double Monotonicity Model assume unidimensionality and monotonicity of the response functions, only the Double Monotonicity Model assumes that the ordering of the items (based on their difficulty) is the same for all examinees (Mokken 1997; Sijtsma and van der Ark 2017; Sijtsma et al. 2011). In other words, this property, referred to as Invariant Item Ordering (IIO), assumes that, for any given item pair, the easier item has a higher probability of being succeeded than the more difficult one at any ability level. This manifests itself graphically by the item response functions of the two items not intersecting.

As was previously noted (Sijtsma et al. 2011; Sijtsma and van der Ark 2017), this property is an important feature of a test, as it "greatly facilitates the interpretation of test scores" (Sijtsma and van der Ark 2017, p. 143), and is "both omnipresent and implicit in the application of many tests, questionnaires, and inventories" (Ligtvoet et al. 2010, p. 593). Indeed, a stronger IIO implies that two persons with the same total score are more likely to have succeeded the same items, and that an examinee with a higher total score than another examinee is more likely to have answered correctly the same items, and one or several more difficult items. Therefore, invariant item ordering lends more meaning to person comparisons based on total scores.

In addition, IIO is especially relevant for the SPM-LS, because its items substantially vary in difficulty and are presented by increasing difficulty. A stronger IIO implies that, if an examinee fails an item, there is an increased probability that the examinee will fail the next (more difficult) one. Therefore, a stronger IIO would suggest that we can envision stopping the test administration after one or several items have been failed (Sijtsma and Meijer 1992). This would presents practical advantages, notably for shortening test administration.

#### **2. Materials and Methods**

#### *2.1. Participants*

Per the topic of this special issue, I re-analyzed the publicly available dataset from Myszkowski and Storme (2018) study. The original study presented various parametric IRT analyses performed on a dataset comprised of 499 students (214 males and 285 females) aged between 19 and 24. Because I directly reanalysed this dataset, I point to the original article for more details on data collection and sample characteristics.

One thing to note that is specific to this paper is that the sample size is both similar to the one used in Sijtsma and van des Ark's tutorial on Mokken scale analysis (Sijtsma and van der Ark 2017) and, more importantly, in accordance with the sample size recommendations provided by Straat et al. (2014). They show (p. 817) that a sample size of around 500 is largely sufficient for an accurate analysis with scalability coefficients *Hj* of 0.42 (or higher)—in the results, I present scalability coefficients, and show that the scalability of the scale meets that requirement.

#### *2.2. Instrument*

The Last Series of the Standard Progressive Matrices, or SPM-LS (Myszkowski and Storme 2018), was built from the original Standard Progressive Matrices (Raven 1941), a very popular and extensively researched test of non-verbal logical reasoning, which is also frequently used as a brief measure of *g*, the general factor of intelligence. As its name indicates, it consists of the last—and thus most difficult—series of the original SPM, but used as a standalone test (without examinees taking previously the other series). It is composed of 12 items of theoretically increasing difficulty. Each item consists of an incomplete 3-by-3 matrix, with the last element of the matrix being missing. The examinee is to identify, among eight options—seven distractors and one correct response—the missing matrix element.

Research shows that *g* is far from being extensively, nor purely captured by the SPM (Gignac 2015; Carpenter et al. 1990), and this is certainly even more true of SPM-LS, since it is a shortened version. Nevertheless, the SPM, and a fortiori the SPM-LS, present the advantage of being short measures, with overall satisfactory reliability. In particular, the SPM-LS, in its original investigation on this dataset (Myszkowski and Storme 2018), presented encouraging evidence of reliability, with observed reliabilities based on IRT modeling that ranged from 0.78 to 0.84 depending on the IRT model used, and a Cronbach's *α* of 0.92.

As unidimensionality is an assumption of Mokken Scale Analysis (Van der Ark 2007), it is also important to note that the SPM-LS investigations indicated that the test is essentially unidimensional, with a McDonald's coefficient *ω<sup>h</sup>* of 0.86 (Myszkowski and Storme 2018) and satisfactory fit of unidimensional models (Myszkowski and Storme 2018). Garcia-Garzon et al. (2019) explorations also supported unidimensionality, in spite of a nuisance factor specific to the last six items.

#### *2.3. Analysis*

Because Sijtsma and van der Ark's tutorial on Mokken scale analysis (Sijtsma and van der Ark 2017) presents the advantages of presenting the current state of the art of Mokken scale analysis and of laying out clearly the different steps to take in order to perform a Mokken scale analysis, I followed the different steps provided in the tutorial. All analyses were computed using the same team's regularly updated and comprehensive R package mokken (Van der Ark 2007, 2012; Sijtsma et al. 2011) (version 2.8.11).

A reason for the popularity of Mokken scaling is the availability of an automatic procedure to select a set (or several sets) of scalable items, a procedure generally referred to as the Automated Item Selection Procedure (AISP), which aims at maximizing scalability. In addition, Straat et al. (2016) also recently suggested a item selection procedure which aims to maximize local independence. Likewise, a stepwise selection procedure aiming at maximizing invariant item ordering has been proposed (Ligtvoet et al. 2010). Still, it was decided here that the primary objective of the present study would be to investigate the SPM-LS as an a priori scale, meaning that the main objective was to investigate its qualities using Mokken Scale Analysis, not to carve a revised instrument out of it. This decision was motivated by the fact that the SPM-LS is already a very short measure (12 items), and also because, in the SPM-LS, the very process of solving items is—at least theoretically—used to help the examinee learn the rule(s) used in subsequent items (Myszkowski and Storme 2018). Therefore, even if an item were to present poor qualities (e.g., weak scalability), it might still be useful as a training for the other items, and thus it may still be preferable or conservative to keep it.

#### 2.3.1. Data Preparation

The dataset analyzed did not present any missing data nor impossible responses. Sijtsma and van der Ark (2017) recommend, as a preliminary step to Mokken Scale Analysis, to filter out cases whose responses are dubious, and they suggest doing so using the count of Guttman errors. I proceeded to count the number of Guttman errors *G*+ per case, computed with the package function check.errors() of the mokken package. There were a total of 2021 Guttman errors, indicating that the items did not constitute a Guttman scale.

As Sijtsma and van der Ark (2017) suggested, I identified as dubious cases—and consequently removed—the cases for which *G*+ indices were beyond the upper Tukey fence of the distribution of *G*+ indices. This corresponded to cases with more than 15 Guttman errors, and resulted in the elimination of 14 cases (1.17% cases) with suspicious item-score patterns. The frequency histogram of *G*+ indices, with the Tukey fence, is presented in Figure 1.

**Figure 1.** Histogram of the count of Guttman errors (*G*+), with Tukey fence (3rd quartile +1.5 × *IQR*) used as a threshold for outlier detection.

#### 2.3.2. Scalability

As recommended by Sijtsma and van der Ark (2017), I investigated the scalability of the complete SPM-LS by computing *Hjk* (scalability coefficients for item pairs), *Hj* (scalability coefficients for items) and *H* (total scalability coefficient of the scale). I used the rules of thumb originally proposed by Mokken (1997) and currently suggested by Sijtsma and van der Ark (2017), which are *H* < 0.3 for insufficient scalability, 0.3 ≤ *H* < 0.4 for weak scalability, 0.4 ≤ *H* < 0.5 for medium scalability and *H* ≥ 5 for strong scalability. Since the Monotone Homogeneity Model implies that *Hjk* and *Hj* are all positive (and ideally as close to 1 as possible), I searched for negative values (or values close to 0) as violations of the monotonicity (Sijtsma and van der Ark 2017).

#### 2.3.3. Local Independence

Local independence is an assumption of both the monotone homogeneity model and the double monotonicity item. Local independence implies that item scores are independent for a given ability level *θ*. As suggested in Sijtsma and van der Ark (2017)'s tutorial, I used the procedure proposed by Straat et al. (2016) to study local dependencies in the SPM-LS. They suggest the computation of three series of indices: *W*1, *W*<sup>2</sup> and *W*3. While the computation of these indices is further explained in the original article, we can note that high *W*1, *W*<sup>2</sup> and *W*<sup>3</sup> values indicate local dependencies. High *W*<sup>1</sup> values indicate that an item pair is likely positively locally dependent. An item with a high *W*<sup>2</sup> is likely to be positively locally dependent with another item. High *W*<sup>3</sup> indicate that an item pair is likely negatively locally dependent. Again here, and as Straat et al. (2016) suggested, a Tukey fence was used to detect problematic items.

#### 2.3.4. Monotonicity

As recommended by Sijtsma and van der Ark (2017), I studied monotonicity by plotting item response functions, using a non-parametric regression of each item scores on "rest scores" (the total scores on the other items) (Junker and Sijtsma 2000). Following the defaults of the check.monotonicity() function of the mokken package (Van der Ark 2007), the rest scores were grouped using a minimum size of *N*/5 (meaning groups of 100 cases at least). The identified violations were then significance tested (Van der Ark 2007; Molenaar and Sijtsma 2000).

As an alternative to testing monotonicity violations, it has been been recently proposed that positive evidence for monotonicity can be gathered through Bayes factors (Tijmstra et al. 2015; Tijmstra and Bolsinova 2019). Based on a suggestion by a reviewer that I use this procedure, I contacted the first author of these papers, who provided code to implement it. The procedure is discussed in more details in the original paper (Tijmstra et al. 2015), but, in short, it consists in evaluating the relative amount of support from the data for (strict) manifest monotonicity—denoted hypothesis *HMM*—against the competing hypothesis that there is at least one manifest non-monotonicity—denoted hypothesis *HNM*—and against the competing hypothesis of essential monotonicity—denoted *HEM*, defined as a form of monotonicity that allows for non-monotonicities between adjacent manifest scores. Bayes Factors *BFMM*,*NM* and *BFMM*,*EM* were estimated through Gibbs sampling, and used to indicate support for *HMM* in contrast with *HNM* and *HEM* respectively. Values above 1 indicate more support for *HMM* than for the competing hypothesis. 20,000 iterations were used as burn-in and discarded, and 100,000 iterations were subsequently used to estimate the Bayes Factors—which is more conservative than initially suggested (Tijmstra et al. 2015).

#### 2.3.5. Invariant Item Ordering

Even though Invariant Item Ordering (IIO) is only an assumption of the Double Monotonicity Model for binary items (Mokken 1997)—not of the the Monotone Homogeneity Model—I studied IIO because of its benefits for score interpretability and the possibility to stop the examination after failed items. As Sijtsma and van der Ark (2017) suggested, overall IIO was assessed with the coefficient *HT*. Like for the *H* scalability coefficients, and as suggested by Ligtvoet et al. (2010), I used thresholds of *HT* < 0.3 for insufficient IIO (Sijtsma and Meijer 1992), 0.3 ≤ *HT* < 0.4 for weak IIO, 0.4 ≤ *HT* < 0.5 for medium IIO and *HT* ≥ 5 for strong IIO. I also graphically compared the item response functions of pairs of items that significantly intersect.

#### 2.3.6. Reliability

Cronbach's *α* and empirical reliability from parametric IRT models have been previously reported and discussed as satisfactory in the same dataset (Myszkowski and Storme 2018). Here, to investigate reliability, as recommended in the context of MSA (Sijtsma and van der Ark 2017), I used the Molenaar-Sijtsma (MS) reliability estimate (Sijtsma and Molenaar 1987), which assumes the Double Monotonicity Model. In addition, I reported the Latent Class Reliability Coefficient (LCRC) (van der Ark et al. 2011), which is more robust to violations of the Double Monotonicity Model.

#### **3. Results**

#### *3.1. Scalability*

The SPM-LS had medium scalability, with an *H* coefficient of 0.469 (*SE* = 0.021). The scalability of the item pairs *Hjk* is reported in Table 1, along with the scalability of the items *Hj*. All item pairs and item scalability coefficients were positive, giving support to the monotone homogeneity model. However,

it can be noted that the first item had a substantially lower scalability (*Hj* = 0.265) than the other items (*Hj* ranging from 0.401 to 0.602), and that the total scalability would be strong (*H* = 0.516) without this item.


**Table 1.** Scalability coefficients of the item pairs (*Hjk*) and items (*Hj*).

#### *3.2. Local Independence*

The *W*1, *W*<sup>2</sup> and *W*<sup>3</sup> indices for local dependencies detection are presented in Table 2. While *W*<sup>2</sup> indices did not suggest that any item were likely in a positive locally dependent pair, *W*<sup>1</sup> indices identified 3 positive local dependencies, between Item 4 and 11, 5 and 11, and 5 and 12. *W*<sup>3</sup> indices suggested a negative local dependency between item 1 and 9.

#### *3.3. Monotonicity*

The item response functions of all items are presented in Figure 2. Only one violation of monotonicity was observed, for item 3—the response function of this item can be seen as slightly decreasing between rest scores 8–9 and 10–11. This violation was, however, non significant.

The Bayes Factors used to compare the relative support for monotonicity against non-monotonicity and essential monotonicity are reported in Table 3. Overall, monotonicity was supported for all items against its complement—although the support was much weaker for the last two items—with Bayes Factors ranging from 1.64 to 818,417.9. The data tended to support (strict) monotonicity against essential monotonicity, with, however, limited support, and with the exception of 12, which had a Bayes Factor slightly smaller than 1.

**Figure 2.** Item response functions of the last series of the Standard Progressive Matrices (SPM-LS) items (with 95% confidence intervals).


**Table 2.** *W*1, *W*<sup>2</sup> and *W*<sup>3</sup> indices of the SPM-LS (flagged values in bold face).

**Table 3.** Bayes Factor for the relative support of monotonicity against its complement (*BFMM*,*NM*) and against essential monotonicity (*BFMM*,*EM*).


#### *3.4. Invariant Item Ordering*

The observed invariant item ordering was medium but close to strong, with a *HT* coefficient of 0.475, overall supporting IIO, and therefore, in combination with the previous analyses, supporting the Double Monotonicity Model. Only 3 significant violations of IIO were observed, involving item 1 with items 4, 6 and 7. The item response functions for item pairs with significant intersections are presented in Figure 31. Because all three violations involved item 1, I computed *HT* again without it, and found that the IIO would in this case be strong (*HT* = 0.520).

<sup>1</sup> For Figure 3, the plotting function of the mokken package was modified in order for all rest score groups on the *x*-axis to be consistent. In addition, it can be noted that the three plots involve item 1, but that its item response function appears slightly different in the three plots. The reason for this is that the rest score is computed in each plot using all items but the two items involved in the comparison. Since the item pair is different in each plot, the rest score group is therefore different, leading to slightly different response functions for the same item.

**Figure 3.** Item response functions (with 95% confidence intervals) of significantly intersecting item pairs.

#### *3.5. Reliability*

The MS reliability estimate was 0.836, and the LCRC reliability estimate was 0.876, both indicating, like previously found using other estimates (Myszkowski and Storme 2018), that the SPM-LS had satisfactory reliability. The item-rest correlations ranged between 0.285 and 0.563, item 1 having a notably lower item-rest correlation that the other items. However, the reliability indices were similar without this item (*MS* = 0.841, *LCRC* = 0.874).

#### **4. Discussion**

While the SPM-LS has already been investigated using a variety of methods in this very dataset—including parametric IRT, Bayesian IRT, factor analysis, and exploratory graph analysis (Myszkowski and Storme 2018; Garcia-Garzon et al. 2019; Bürkner 2020)—the current study proposes the first investigation of this instrument using non-parametric IRT, and more specifically Mokken Scale Analysis (Mokken 1971; Mokken and Lewis 1982). This framework allowed to study several psychometric properties, permitting to both confirm the previous encouraging results on the SPM-LS—on dimensionality, local independence and reliability—and to investigate new properties—monotonicity and invariant item ordering.

#### *4.1. Conclusions on the SPM-LS*

Overall, the SPM-LS showed robust psychometric qualities in this study. More specifically, it was found to have satisfactory monotonicity, scalability, local independence (with only a few local dependencies), invariant item ordering (with only a few significant violations) and reliability. This is an overall satisfactory set of results, which would lead us to encourage the use of this instrument.

The main new elements regarding the investigation of this scale were the support for monotonicity—the item response functions were overall monotonically increasing—and invariant item ordering—the item response functions overall did not intersect, giving, along with unidimensionality and local independence, support for the Double Monotonicity Model. The fact that this model was overall supported is interesting, as it presents several advantages for the use of the SPM-LS in practice (Ligtvoet et al. 2010). First, the monotonicity of item responses suggests that, even though Rasch 1-parameter (and to a lesser extent, 2-parameter) models did not fit well this dataset (Myszkowski and Storme 2018; Bürkner 2020), there is support for the SPM-LS sum scores being able to order persons based on their ability. In addition, it is very clear that each series of Raven's matrices were originally conceptualized as having a cumulative structure, with examinees responding items gradually increasing in difficulty by the stacking of logical rules to decipher and apply: Empirical support for invariant item ordering supports such a hypothetical functioning of the test. Test editors and practitioners generally assume, that, because an item A has a higher success rate than another item B, then item A is necessarily easier than B for all examinees, and they often use a test as though this assumption were true, without empirically testing it (Ligtvoet et al. 2010): The current study provides evidence that it is empirically justified to make such interpretations from the SPM-LS.

It was notable, through this investigation, that the issues encountered tended to involve item 1. More specifically, item 1 was the item with the smallest scalability (based on *Hj* coefficients), the only one with an outlying negative local dependency (based on *W*<sup>3</sup> coefficients), was involved in all three significant violations of invariant item ordering, and had the lowest item-rest correlation. While it appears tempting to remove this item, I would recommend to at least maintain it as a training item (meaning, having participants take it but not necessarily including it in the scoring). This is because (1) the presence of this item is still probably important for the examinees to learn the base rule used throughout the series, and (2) the plots suggest that this items' response function is still monotonous, and its intersections with the

item response functions of items 4, 6 and 7 appear somewhat minimal, as the confidence intervals overlap for most ability levels. The current study suggests that practitioners and/or future researchers using the SPM-LS use the full instrument, even though they may question and study their own dataset to decide on whether to use item 1 in the scoring or not.

#### *4.2. Limitations*

While this investigation presents satisfactory findings regarding the psychometric qualities of the SPM-LS, the different indices observed were not perfect, and notably, the scalability of the scale was only medium (Mokken 1997; Sijtsma and van der Ark 2017), suggesting that the instrument can further be improved. I noted earlier that it would be categorized as strong if item 1 were excluded from the scoring but, albeit strong, it would be still just above the strong threshold. In addition, excluding item 1 from scoring remains a post-hoc suggestion, made after seeing each items' scalability. It would therefore call for further investigations using a new sample.

Mokken Scale Analysis investigates aspects of psychometric instruments that are different from more usual sets of analyses (notably of the factor analytic or Rasch tradition)—especially the investigation of monotonicity and invariant item ordering—but this study also suffers from some limitations of this specific framework. For example, it does not provide a way to study or recover information from distractor responses like other approaches—such as nested logit models (Suh and Bolt 2010), the nominal response model (Bock 1997) or the multiple-choice model (Thissen and Steinberg 1984)—which are an important aspect of this specific test (Myszkowski and Storme 2018). Related to this, MSA certainly allows to graphically study item responses, but, because it is non-parametric, it does not produce item parameters that can be interpreted. This is a limiting factor in this context, because previous results (Myszkowski and Storme 2018; Bürkner 2020) suggest that phenomena like guessing—which is unaccounted for in MSA, apart from potentially appearing in item response functions—are relevant for this test. Another limitation of MSA is that it does not provide a way to investigate conditional reliability, and therefore does not allow to, for example, diagnose if an instrument provides reliable ability estimates across a wide range of ability levels. This is particularly a problem in the case of the SPM-LS, because the fact that it only includes one series of the original SPM implies that the range of abilities that are reliably measured may be limited (Myszkowski and Storme 2018). Finally, other advanced uses of Rasch modeling, such as computer-adaptive testing and test equating, are also impossible with Mokken scaling (Meijer et al. 1990).

#### *4.3. Future Directions*

Support for the Double Monotonicity Model, because of invariant item ordering, indicates that, for an item A of lower difficulty than an item B, an examinee who fails item A is predicted to also fail item B (and all items that are more difficult). Thus, if one orders items from the easiest to the most difficult, as is done with the SPM-LS, then it is conceivable to have examinees stop the test after a number of failures. This is because they are likely to then fail all future items. As a supplementary analysis, in this dataset, I computed the correlations between the full scores of examinees (using all item scores) and the scores they would have received, had they been stopped after a number of consecutive failures. I found that stopping the test after only one failure provided scores that were strongly (but far from perfectly) correlated with full scores— *r*(483) = 0.735, *p* < 0.001—while stopping the test after two consecutive items failed would preserve scores nearly perfectly—*r*(483) = 0.999, *p* < 0.001. Based on this, I would suggest that stopping the administration after 2 consecutively failed items could lead to gains of administration time without any substantial loss of information about an examinee's ability. I recommend that future studies further examine this possibility, though the present study already gives quite a strong support for such a use.

Finally, while the psychometric investigation of an instrument can take many shapes, the current study demonstrates how Mokken Scale Analysis can provide insightful information about an instrument, even when that instrument has already been studied in the same dataset with multiple popular and less popular methods (Myszkowski and Storme 2018; Bürkner 2020; Garcia-Garzon et al. 2019.) Besides replicating the present study in other samples and in other conditions—which is certainly called for—I suggest that future studies investigate the SPM-LS using other non-parametric IRT models—for example, spline IRT models (Winsberg et al. 1984)—to better understand its functioning.

**Funding:** This research received no external funding.

**Conflicts of Interest:** The author declares no conflict of interest.

#### **References**


Mokken, Robert Jan. 1971. *A Theory and Procedure of Scale Analysis.* The Hague and Berlin: Mouton de Gruyter.



c 2020 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Regularized Latent Class Analysis for Polytomous Item Responses: An Application to SPM-LS Data**

#### **Alexander Robitzsch 1,2**

<sup>1</sup> IPN—Leibniz Institute for Science and Mathematics Education, D-24098 Kiel, Germany; robitzsch@leibniz-ipn.de

<sup>2</sup> Centre for International Student Assessment (ZIB), D-24098 Kiel, Germany

Received: 9 July 2020; Accepted: 10 August 2020; Published: 14 August 2020

**Abstract:** The last series of Raven's standard progressive matrices (SPM-LS) test was studied with respect to its psychometric properties in a series of recent papers. In this paper, the SPM-LS dataset is analyzed with regularized latent class models (RLCMs). For dichotomous item response data, an alternative estimation approach based on fused regularization for RLCMs is proposed. For polytomous item responses, different alternative fused regularization penalties are presented. The usefulness of the proposed methods is demonstrated in a simulated data illustration and for the SPM-LS dataset. For the SPM-LS dataset, it turned out the regularized latent class model resulted in five partially ordered latent classes. In total, three out of five latent classes are ordered for all items. For the remaining two classes, violations for two and three items were found, respectively, which can be interpreted as a kind of latent differential item functioning.

**Keywords:** regularized latent class analysis, regularization, fused regularization, fused grouped regularization, distractor analysis

#### **1. Introduction**

There has been recent interest in assessing the usefulness of short versions of the Raven's Progressive Matrices. Myszkowski and Storme (2018) composed the last 12 matrices of the Standard Progressive Matrices (SPM-LS) and argued that it could be regarded as a valid indicator of general intelligence *g*. As part of this special issue, the SPM-LS dataset that was analyzed in Myszkowski and Storme (2018) was reanalyzed in a series of papers applying a wide range of psychometric approaches.

Previous reanalyses of the SPM-LS dataset have in common that quantitative latent variable models were utilized. In this paper, discrete latent variable models (i.e., latent class models) are applied for analyzing the SPM-LS dataset. With discrete latent variable models, the analysis of types instead of traits is the primary focus (see von Davier et al. (2012) and Borsboom et al. (2016)). A disadvantage of discrete latent variable models is that they often have a large number of parameters to estimate. For example, latent class models result in item response probabilities that are allowed to vary across classes. Even with only a few classes, the number of estimated parameters is typically larger than parametric models with quantitative latent variables. Hence, model selection based on principles often favors quantitative latent variable models over discrete latent variable models. So-called regularization approaches automatically reduce the number of parameters to estimate (see Huang et al. (2017) or Jacobucci et al. (2016)) for the use of regularization in structural equation modeling and Tutz and Schauberger (2015) or Battauz (2019) in item response modeling). In this paper, these regularization approaches are applied in discrete latent variable models, and some extensions for polytomous data are proposed.

The paper is structured as follows. In Section 2, we give a brief overview of latent class analysis. In Section 3, regularized latent class analysis for dichotomous and polytomous data is introduced. In Section 4, we apply proposed models of Section 3 in a simulated data illustration. In Section 5, we apply regularized latent class analysis to the SPM-LS dataset. Finally, in Section 6, we conclude with a discussion.

#### **2. Latent Class Analysis**

Latent variable models represent discrete items by a number of latent variables (see Agresti and Kateri 2014 for an overview). These latent variables can be categorical or quantitative or a mixture of both. Quantitative latent variables are considered in factor analysis, structural equation models, or item response models. In this article, we focus on categorical latent variables. In this case, latent variables are labeled as latent classes and are extensively studied in the literature of latent class analysis (LCA; Collins and Lanza 2009; Langeheine and Rost 1988; Lazarsfeld and Henry 1968).

A latent class model (LCM) represents the multivariate distribution of *I* categorical items *X* = (*X*1,..., *XI*) by a fixed number of *C* latent classes. Let *U* denote the latent class variable that takes one of the values 1, 2, ... , *C*. It is assumed that items *Xi* are conditionally independent on the latent class variable *U*. This means that it holds that

$$P(\mathbf{X} = \mathbf{x} | \mathcal{U} = \mathbf{c}) = \prod\_{i=1}^{l} P(\mathbf{X}\_{i} = \mathbf{x}\_{i} | \mathcal{U} = \mathbf{c}) \quad \text{for } \mathbf{x} = (\mathbf{x}\_{1}, \dots, \mathbf{x}\_{l}) \quad . \tag{1}$$

The multivariate probability distribution is then given as a mixture distribution

$$P(\mathbf{X} = \mathbf{x}) = \sum\_{c=1}^{K} P(\mathcal{U} = c) \prod\_{i=1}^{I} P(X\_i = \mathbf{x}\_i | \mathcal{U} = c) \quad . \tag{2}$$

Applications of LCMs to intelligence tests can be found in Formann (1982) or Janssen and Geiser (2010).

#### *2.1. Exploratory Latent Class Analysis for Dichotomous Item Responses*

In this subsection, we describe the LCM for dichotomous items. Let *pic* = *P*(*Xi* = 1|*U* = *c*) denote the item response probability for correctly solving item *i* if a person is located in class *c*. In the estimation, these bounded parameters (*pic* ∈ [0, 1]) are transformed onto the real line by using the logistic transformation (see also Formann 1982)

$$P(X\_i = \mathbf{x} | \mathcal{U} = \mathbf{c}) = \frac{\exp(\mathbf{x}\gamma\_{ic})}{1 + \exp(\gamma\_{ic})} \quad (\mathbf{x} = \mathbf{0}, 1). \tag{3}$$

Note that *pic* is a one-to-one function of *γic*. For estimation purposes, it is sometimes more convenient to estimate models with unbounded parameters instead of estimating models with bounded parameters. For *I* items and *C* classes, *I* · *C* item parameters have to be estimated in the case of dichotomous items. In comparison to item response models (1PL model: one parameter, 2PL: two parameters, etc.), this results in many more parameters to be estimated. However, LCMs do not pose the assumption that classes are ordered, and no monotonicity assumptions of item response functions are posed.

Moreover, let *pc* = *P*(*U* = *c*) denote the probability that a person is in class *c*. As for item parameters, a logistic transformation is used to represent the class probabilities *pc* by parameters *δc*. More formally, we set

$$p\_{\mathcal{C}} = \frac{\exp(\delta\_{\mathcal{C}})}{1 + \sum\_{j=2}^{\mathcal{C}} \exp(\delta\_{j})} \quad (\mathcal{c} = 1, \dots, \mathcal{C}), \tag{4}$$

where *δ*<sup>1</sup> = 0. Because the probabilities sum to one, only *C* − 1 distribution parameters have to be estimated. In total, the saturated distribution of *<sup>I</sup>* dichotomous items has 2*<sup>I</sup>* − 1 free possible parameters, which is represented by *I* · *C* + *C* − 1 parameters in the LCM with *C* classes.

LCMs can be interpreted as pure exploratory models because no structure of item response probabilities among classes is posed. Confirmatory LCMs assume additional equality constraints on item response probabilities (Finch and Bronk 2011; Nussbeck and Eid 2015; Oberski et al. 2015; Schmiege et al. 2018). Like in confirmatory factor analysis, it could be assumed that some items load only on some classes, which translates into equal item response probabilities. Cognitive diagnostic models can be seen as particular confirmatory LCMs (von Davier and Lee 2019).

It should be emphasized that restricted LCMs form the basis of almost all popular latent variable models for discrete item responses that are nowadays very popular. Formann (1982) suggested to represent the vector *γ* = (*γ*11, ... , *γ*1*C*, ... , *γI*1, ... *γIC*) of item parameters as linear combinations *γic* = *qicα* (*i* = 1, ... , *I*; *c* = 1, ... , *C*) using a parameter vector *α* and known weight vectors *qic*. In addition, the distribution parameter *δ* = (*δ*1, ... , *δC*) is represented by *δ<sup>c</sup>* = *wcβ* using a parameter vector *β* and known weight vectors *wc*. The resulting so-called structured LCM (Formann and Kohlmann 1998) includes unidimensional and multidimensional 1PL and 2PL models as special cases as well as mixture item response models (Formann 2007). To accomplish this, continuous latent variables are approximated by a finite number of discrete latent classes. For example, a normally distributed latent variable is approximated by discrete latent classes (e.g., *C* = 21 classes) whose probabilities are represented by only two components in *α* (i.e., the mean and the standard deviation as the first two moments). The usage of discrete latent classes can be interpreted as performing numerical integration with a fixed integration grid and applying the rectangle rule. Similar generalizations of restricted LCM were proposed by researcher von Davier (2008, 2010). In the rest of this article, we will focus on the simple exploratory LCM, although the proposed extension also applies to the more general structured latent class model.

In many applications, the allocation of persons to classes should be predicted by person variables *Z* (Collins and Lanza 2009). In more detail, class probabilities *pc* = *P*(*U* = *c*) are replaced by subject-specific conditional probabilities *P*(*U* = *c*|*Z* = *z*) (so-called latent class regression). These models further ease the interpretation of latent classes.

#### *2.2. Exploratory Latent Class Analysis for Polytomous Item Responses*

Now assume that there are *I* polytomous items and each item has *Ki* + 1 nominal categories 0, 1, ... , *Ki*. Item response probabilities are then given as *pikc* = *P*(*Xi* = *k*|*U* = *c*) and are again transformed into unbounded parameters *γikc* by a logistic transformation. In more detail, it is assumed that

$$P(X\_i = \mathbf{x} | \mathcal{U} = \mathbf{c}) = \frac{\exp(\gamma\_{\text{iix}})}{1 + \sum\_{k=1}^{K\_i} \exp(\gamma\_{i\&c})} \quad (\mathbf{x} = 0, 1, \dots, K\_i), \tag{5}$$

where *γi*0*<sup>c</sup>* = 0 for all items *i* and all latent classes *c*. Instead of estimating *Ki* + 1 probabilities for item *i* and class *c*, *Ki* free parameters *γihc* have to be estimated. If all polytomous items have *K* + 1 categories, the multidimensional contingency table of observations has (*<sup>K</sup>* + <sup>1</sup>)*<sup>I</sup>* − 1 free parameters while in the LCM *I* · *K* · *C* + *C* − 1 parameters are estimated. It should be emphasized that the LCM for polytomous

items has more free parameters compared to LCMs with dichotomous items as well as for unidimensional and multidimensional item response models for polytomous data.

Like for dichotomous data, restricted LCMs were formulated that represented the vector of all item response functions by *γikc* = *qikcα* (*i* = 1, ... , *I*; *k* = 1, ... , *Ki*; *c* = 1, ... , *C*) using a parameter vector *α* and known weight vectors *qikc* (Formann 1992). It is advisable to induce some structure on item response functions, especially for polytomous data, because many parameters have to be estimated without any structural assumptions.

#### **3. Regularized Latent Class Analysis**

As LCMs are exploratory models, interpretation of results could sometimes be challenging. Moreover, in not too large samples, parameter estimation gets instable, and findings are sometimes not generalizable across different samples. Alternatively, confirmatory latent models could be estimated for obtaining more stable and more interpretable parameter estimates. However, such confirmatory approaches need assumptions that have to be known in advance of the data analysis. Hence, alternative approaches are sought.

Regularized latent class models (RLCMs; Chen et al. 2017) estimate item response probabilities under the presupposition that similar item response probabilities in these models are grouped and receive the same value. The main idea of using the regularization technique (see Hastie et al. 2015 for an overview) to LCMs is that by subtracting an appropriate penalty term from the log-likelihood function, some simpler structure on item response probabilities is posed. Different penalty terms typically result in different estimated parameter structures. In a recent Psychometrika paper, Chen et al. (2017) proposed the RLCM for dichotomous item responses. Related work for dichotomous data can be found in Wu (2013) and Yamamoto and Hayashi (2015).

The regularization technique has also been applied for factor models with continuous items (Huang et al. 2017; Jacobucci et al. 2016) and discrete items (Chen et al. 2018; Sun et al. 2016) in order to fit exploratory factor models with the goal of estimating as many zero loadings as possible. In this respect, regularization is a viable alternative to factor rotation methods (Scharf and Nestler 2019).

The regularization technique has also been applied to Gaussian mixture models in which cluster means are estimated to be equal for some variables among clusters (Bhattacharya and McNicholas 2014; Ruan et al. 2011). Regularized latent class analysis (RLCA) is also referred to as penalized latent class analysis (see DeSantis et al. 2008). Under this label, LCMs are typically meant by that apply regularization to the estimation of regression coefficients of the latent class regression model (DeSantis et al. 2008, 2012; Houseman et al. 2006; Leoutsakos et al. 2011; Sun et al. 2019; Wu et al. 2018). Fop and Murphy (2018) provide a recent review of applications of the regularization technique in mixture models.

In the following, we describe the RLCM at first for dichotomous items. Afterward, we consider the more complex case of polytomous items in which more possibilities for setting equality constraints among item response probabilities are present.

#### *3.1. Regularized Latent Class Analysis for Dichotomous Item Responses*

At first, we consider the case of dichotomous items *Xi* (*i* = 1, ... , *I*). In an RLCM, not all item response probabilities *pic* (*c* = 1, ... , *C*) are assumed to be unique. Chen et al. (2017) subtracted a penalty term from the log-likelihood function that penalizes differences in ordered item response probabilities. In more detail, denote by *pi*,(*c*) (*c* = 1, ... , *C*) ordered item response probabilities of the original probabilities *pic*

such that *pi*,(1) ≤ *pi*,(2) ≤ ... *pi*,(*C*), and collect all parameters in *p*<sup>∗</sup> *<sup>i</sup>* . Then, Chen and colleagues used the following penalty function for item *i*

$$\operatorname{Pen}(p\_i^\*; \lambda) = \sum\_{\varepsilon=2}^{\mathbb{C}} H\_{\operatorname{SCAD}}(p\_{i, (\varepsilon)} - p\_{i, (\varepsilon - 1)}; \lambda), \tag{6}$$

where *H*SCAD denotes the smoothly clipped absolute deviation penalty (SCAD; Fan and Li 2001). The SCAD penalty takes a value of zero if *pi*,(*c*) − *pi*,(*c*−1) = 0 and is positive otherwise (see Figure <sup>1</sup> for the functional form of the SCAD penalty). The parameter *λ* is a regularization parameter that governs the strength of the penalty function. With small values of *λ*, differences are barely penalized, but with large values of *λ*, differences are heavily penalized, and item parameters approach a uniform distribution.

If *X* denotes the matrix of observed data and *p*∗ denotes the vector of all ordered item response probability and *δ* the vector that represents the skill class probabilities, the following function is maximized in Chen et al. (2017):

$$\mathcal{U}(\boldsymbol{p}^\*, \mathcal{S}; \boldsymbol{X}) - N \sum\_{i=1}^I \operatorname{Pen}(\boldsymbol{p}\_i^\*; \boldsymbol{\lambda}), \tag{7}$$

where *l* denotes the log-likelihood function of the data. By employing a penalty function *Pen* in the estimation, some item response probabilities are merged, which, in turn, eases the interpretation of resulting latent classes. It should be noted that for estimating model parameters, the regularization parameter *λ* has to be fixed. In practice, the regularization parameter *λ* also has to be estimated. Hence, the maximization is performed on a grid of *λ* values (say, *λ* = 0.01, 0.02, ... , 0.30), and that model is selected that is optimal with respect to some criterion. Typical criteria are the cross-validated log-likelihood or information criteria like the Akaike information criterion (AIC), Bayesian information criterion (BIC), or others (Hastie et al. 2015).

The maximization of (7) is conducted using an expectation-maximization (EM) algorithm (see Section 3.3 for general description). The estimation approach of Chen et al. (2017) is implemented in the R package CDM (George et al. 2016; Robitzsch and George 2019).

#### 3.1.1. Fused Regularization among Latent Classes

Though the estimation approach of Chen et al. (2017) is successful, it is not clear how it could be generalized to polytomous data because it is not evident how item response probabilities of several categories should be ordered. Hence, we propose a different estimation approach. We apply the technique of fused regularization (Tibshirani et al. 2005; Tutz and Gertheiss 2016) that penalizes all pairwise differences of item response probabilities. In more detail, for a vector *p<sup>i</sup>* of item response probabilities, we replace the penalty (used in Equation (6)) of Chen et al. (2017) by

$$\operatorname{Per}(p\_i; \lambda) = \sum\_{c < d} H\_{\text{MCP}}(p\_{ic} - p\_{id}; \lambda), \tag{8}$$

where *pic* = *P*(*Xi* = 1|*U* = *c*) are class-specific item response probabilities, and *h*MCP denotes the minimax concave penalty (MCP; Zhang 2010). We do not suppose dramatic differences to the SCAD penalty, but we would expect less biased estimators than using the often employed least absolute shrinkage and selection operator (LASSO) penalty *H*LASSO(*x*; *λ*) = *λ*|*x*| (see Hastie et al. 2015). By using pairwise differences in Equation (8), item response probabilities are essentially merged into item-specific clusters of values that are equal within each cluster. Hence, the same goal as in Chen et al. (2017) is achieved. As explained in

Section 2.2, our estimation approach uses transformed item response probabilities *γ*. Therefore, in the estimation, we replace Equation (8) by

$$Pen(\gamma\_{i\prime};\lambda) = \sum\_{c$$

Note that by using the penalty on *γ<sup>i</sup>* in Equation (9) instead of on *p<sup>i</sup>* in Equation (8), a different metric in quantifying differences in item parameters is introduced. By using *γi*, differences in extreme probabilities (i.e., probabilities near 0 or 1) appear to be less similar than by using untransformed probabilities as in (8).

In Figure 1, the LASSO, MCP, and SCAD penalty functions are depicted. It can be seen for *x* values near to 0, the MCP and the SCAD penalty equal the LASSO penalty (i.e., *f*(*x*) = *λ*|*x*|). For sufficiently large *x* values MCP and SCAD reach an upper asymptote, which is not the case for the LASSO penalty. Hence, for the MCP and SCAD penalty, the penalty is relatively constant for large values of *x*. This property explains why the MCP and SCAD penalty typically results in less biased estimates. It should be noted that the application of the regularization presupposes some sparse structure in the data for obtaining unbiased estimates. In other words, the true data generating mechanism consists of a sufficiently large number of equal item parameters. If all item response probabilities would be different in the data generating model, employing a penalty that forces many item parameters to be equal to each other would conflict the data generating model.

**Figure 1.** Different penalty functions used in regularization with regularization parameter *λ* = 0.25 (**left panel**) and *λ* = 0.125 (**right panel**).

#### 3.1.2. Hierarchies in Latent Class Models

The RLCM can be used to derive a hierarchy among latent classes. The main idea is depicted in Figure 2. In an RLCM with *C* = 4 classes, a partial order of latent classes is defined. Class 1 is smaller than Classes 2, 3, and 4. Classes 2 and 3 cannot be ordered. Finally, Classes 2 and 3 are smaller than Class 4. More formally, in an RLCM, we define class *c* to be *smaller* than class *d* (or: class *d* is *larger* than class *c*) if all item response probabilities in class *c* are at most as large as in class *d*, i.e., *pic* ≤ *pid* for all items *i* = 1, ... , *I*.

We use the notation *c d* to indicate that *c* is smaller than *d*. In a test with many items, fulfilling these inequalities for all items might be a too strong requirement. Hence, one weakens the concept of partial ordering a bit. Given a tolerable for at most *ι* items, we say that class *c* is *approximately smaller* than class *d* if *pic* ≤ *pid* is fulfilled for at least *I* − *ι* items.

**Figure 2.** Illustration of a partial order with four latent classes.

The partial ordering of latent classes substantially eases the interpretation of the results in RLCMs. Chen et al. (2017) used the RLCM to derive partially ordered latent classes in cognitive diagnostic modeling. Wang and Lu (2020) also applied the RLCM for estimating hierarchies among latent classes (see also Robitzsch and George 2019). Using the RLCM with an analysis of hierarchies may be considered as a preceding method of confirmatory approaches to latent class modeling.

#### *3.2. Regularized Latent Class Analysis for Polytomous Item Responses*

In the following, we propose an extension of RLCM for polytomous item responses. It has been shown that using information from item distractors (Myszkowski and Storme 2018; Storme et al. 2019) could increase the reliability for person ability estimates compared to using only dichotomous item responses that only distinguishes between correct and incorrect item responses. Moreover, it could be beneficial to learn about the differential behavior of item distractors analyzing the data based on correct and all incorrect item responses.

Assume that 0 denotes the category that refers to a correct response and 1, ... , *Ki* refer to the categories of the distractors. In our parameterization of the LCM for polytomous data (see Section 2.2), only parameters *γikc* of distractors *k* for item *i* in classes *c* are parameterized. Given the relatively small sample size of the SPM-LS application data (i.e., *N* = 499), the number of estimated parameters in an unrestricted LCM turn out to be quite large because there are seven distractors per item. Moreover, it could be supposed that the distractors of an item behave similarly. Hence, it would make sense to estimate some item parameters to be equal to each other.

We now outline alternatives for structural assumptions on item response probabilities. Let us fix item *i*. For *Ki* + 1 categories and *C* classes, *Ki* · *C* item parameters are modeling (omitting the category 0). Hence, we can distinguish between different strategies to the setting of equalities of item parameters. First, for a fixed category *k*, one can merge some item response probabilities among classes. This means that some of the differences *γikc* − *γikd* (*c* = *d*) are zero. Hence, a penalty on differences *γikc* − *γikd* has to be posed. This is just the penalty as for dichotomous items (see Equation (8)), but the regularization is applied for *Ki* categories instead of one category. Second, for a fixed class *c*, some item response probabilities among categories could be merged. In this case, one would impose a penalty on differences *γikc* − *γihc* (*k* = *h*).

Third, penalization among classes and among categories can be simultaneously applied. In the remainder, we outline the different strategies in more detail.

#### 3.2.1. Fused Regularization among Latent Classes

Let *<sup>γ</sup>ik*<sup>∗</sup> = (*γik*1, ... , *<sup>γ</sup>ikC*) denote the vector of item parameters for item *<sup>i</sup>* in category *<sup>k</sup>*. Again, let *<sup>γ</sup><sup>i</sup>* denote the vector of all item parameters of item *i*. For a regularization parameter *λ*<sup>1</sup> and item *i*, we define the penalty

$$\operatorname{Pen}(\gamma\_{\vec{\imath}i};\lambda\_1) = \sum\_{k=1}^{K\_i} \operatorname{Pen}(\gamma\_{\vec{\imath}k\*};\lambda\_1) = \sum\_{k=1}^{K\_i} \sum\_{c$$

As a result, for a category, some item response probabilities will be merged across latent classes. However, the merging of item parameters (also referred to as fusing; Tibshirani et al. 2005) is independently applied for all categories of an item. In practice, it is maybe not plausible that all distractors of an item would function differently, and item parameters should be more regularized.

#### 3.2.2. Fused Regularization among Categories

As a second alternative, we now merge categories. Let *γi*∗*<sup>c</sup>* = (*γi*1*c*, ... , *γiKic*) denote the vector of item parameters for item *i* in class *c*. For a regularization parameter *λ*<sup>2</sup> and item *i*, we define the penalty

$$\operatorname{Pen}(\gamma\_{\vec{\nu}};\lambda\_2) = \sum\_{c=1}^{\mathcal{C}} \operatorname{Pen}(\gamma\_{i \star c}; \lambda\_2) = \sum\_{c=1}^{\mathcal{C}} \sum\_{k < h} H\_{\operatorname{MCP}}(\gamma\_{i \& c} - \gamma\_{i \& c}; \lambda\_2). \tag{11}$$

As a result, some of the item response probabilities of categories are set equal to each other. As an outcome of applying this penalty, atypical distractors could be detected. However, by using the penalty in Equation (11), no equalities among latent classes are imposed.

#### 3.2.3. Fused Regularization among Latent Classes and Categories

The apparent idea is to combine the regularization among latent classes and categories. By doing so, the penalties in Equations (10) and (11) have to be added. In more detail, for regularization parameters *λ*<sup>1</sup> and *λ*2, we use the penalty

$$\operatorname{Pen}(\gamma\_{\bar{i}i};\lambda\_1,\lambda\_2) = \sum\_{k=1}^{K\_i} \sum\_{c$$

It can be seen that the penalty in Equation (12) now depends on two regularization parameters. In the estimation, the one-dimensional grid of regularization parameters has then to be substituted by a two-dimensional grid. This substantially increases the computational demand.

#### 3.2.4. Fused Group Regularization among Categories

We can now proceed to pose additional structural assumptions on item parameters. One could suppose that two distractors *k* and *h* of item *i* show the same behavior. In the RLCM, this means that *γikc* − *γihc* = 0 holds for all classes *c* = 1, ... , *C*. The group regularization technique allows us to estimate all parameters in a subset of parameters to be zero (see Huang et al. 2012 for a review). A fused group regularization approach presupposes that either all differences *γikc* − *γihc* equal zero or all differences are estimated to be different from zero (Cao et al. 2018; Liu et al. 2019). This property can be achieved

by substituting a norm of the difference of the two vectors in the penalty. In more detail, one considers the penalty

$$\operatorname{Per}(\gamma\_i; \lambda\_1) = \sum\_{k$$

where for a vector *<sup>x</sup>* = (*x*1, ... , *xp*), the norm ||*x*|| is defined as ||*x*|| <sup>=</sup> <sup>√</sup>*<sup>p</sup>* ∑*p <sup>k</sup>*=<sup>1</sup> *<sup>x</sup>*<sup>2</sup> *<sup>k</sup>* . In practice, using the penalty in Equation (13) could provide a more parsimonious estimation than the penalty defined in Equation (12). In principle, model comparisons can be carried out to decide which assumption is better represented in the data.

#### 3.2.5. Fused Group Regularization among Classes

Alternatively, one could also assume that latent classes function the same among classes. In the RLCM, then it would hold that that *γikc* − *γikd* = 0 for all categories *k* = 1, ... , *Ki*. A fused group regularization results in the property that either all item parameters of classes *c* and *d* are equal to each other or all estimated to be different from each other. The following penalty is used in this case:

$$\operatorname{Pen}(\gamma\_{i\prime};\lambda\_2) = \sum\_{c$$

#### *3.3. Estimation*

We now describe the estimation of the proposed RLCM for polytomous data. Let *X* = (*xnik*) denote the observed dataset where *xnik* equals 1 if person *n* (*n* = 1, ... , *N*) chooses category *k* for item *i*. Let *γ<sup>i</sup>* denote item parameters of item *i* and the vector that contains item parameters of all items. The vector *δ* represents the skill class distribution. Furthermore, let *pic*(*x*; *γi*) = *P*(*Xi* = *x*|*U* = *c*) and *pc*(*δ*) = *P*(*U* = *c*).

Following Chen et al. (2017) and Sun et al. (2016), an EM algorithm is applied for estimating model parameters. The complete-data log-likelihood function is given

$$d\_{\rm com}(\gamma, \mathcal{S}, \mathbf{U}) = \sum\_{n=1}^{N} \sum\_{i=1}^{I} \sum\_{k=1}^{K\_i} \sum\_{c=1}^{\mathbb{C}} \mathbf{x}\_{nik} u\_{nc} \log p\_{ic}(k; \gamma\_i) + \sum\_{n=1}^{N} \sum\_{c=1}^{\mathbb{C}} u\_{nc} \log p\_c(\mathcal{S}), \tag{15}$$

where *u<sup>n</sup>* = (*un*1, ... , *unC*) is the vector of latent class indicators for person *n*. It holds that *unc* = 1 if person *n* is located in class *c*. Obviously, the true class membership is unknown and, hence, Equation (15) cannot be used for maximization.

In the EM algorithm, the estimation of *l*com is replaced by the expected complete-data log-likelihood function by integrating over the posterior distribution. In more detail, unobserved values *unc* are replaced by their conditional expectations:

$$u\_{\rm nc}^{\*} = \mathbb{E}(u\_{\rm nc}|\mathbf{x}\_{\rm n}; \gamma^{(t)}, \boldsymbol{\delta}^{(t)}) = \frac{p\_c(\boldsymbol{\delta}^{(t)}) \prod\_{i=1}^{I} \prod\_{k=1}^{K\_i} p\_{ic}(k; \gamma\_i^{(t)})^{x\_{\rm nki}}}{\sum\_{d=1}^{D} p\_d(\boldsymbol{\delta}^{(t)}) \prod\_{i=1}^{I} \prod\_{k=1}^{K\_i} p\_{id}(k; \gamma\_i^{(t)})^{x\_{\rm nki}}} \quad (c = 1, \dots, \mathbb{C}), \tag{16}$$

where *γ*(*t*) and *δ*(*t*) are parameter estimates from a previous iteration *t*. The EM algorithm alternates between the E-step and the M-step. By replacing the unobserved values *uni* by their expected values *u*<sup>∗</sup> *ni*, the following *Q*-function is obtained that is used for maximization in the M-step

$$Q(\boldsymbol{\gamma}, \boldsymbol{\delta} | \boldsymbol{\gamma}^{(t)}, \boldsymbol{\delta}^{(t)}) = \sum\_{n=1}^{N} \sum\_{i=1}^{I} \sum\_{k=1}^{K\_i} \sum\_{c=1}^{C} \mathbf{x}\_{nik} \boldsymbol{u}\_{nc}^{\*} \log p\_{ic}(k; \boldsymbol{\gamma}\_{i}) + \sum\_{n=1}^{N} \sum\_{c=1}^{C} \boldsymbol{u}\_{nc}^{\*} \log p\_{c}(\boldsymbol{\delta}).\tag{17}$$

From this *Q*-function, the penalty function is subtracted such that the following function is minimized for some regularization parameter *λ* in the M-step

$$Q(\boldsymbol{\gamma}, \boldsymbol{\delta} | \boldsymbol{\gamma}^{(t)}, \boldsymbol{\delta}^{(t)}) - N \sum\_{i=1}^{I} \operatorname{Pen}(\gamma\_{i i}; \boldsymbol{\lambda}). \tag{18}$$

It can be seen that item parameters *γ<sup>i</sup>* are separately obtained for each item *i* in the M-step because the penalties are defined independently for each item. Hence, for each item *i*, one maximizes

$$\sum\_{m=1}^{N} \sum\_{k=1}^{K\_i} \sum\_{c=1}^{C} x\_{mk} u\_{nc}^\* \log p\_{ic}(k; \gamma\_i) - N \text{Pen}(\gamma\_i; \lambda). \tag{19}$$

Latent class probability parameters *δ* are also obtained independently from item parameters in the M-step.

The penalty function *Pen* turns out to be non-differentiable. Here, we use a differentiable approximation of the penalty function (Oelker and Tutz 2017; see also Battauz 2019). As it is well known that the log-likelihood function in LCMs is prone to multiple maxima, using multiple starting values in the estimation is advised.

The described EM algorithm is included in an experimental version of the function regpolca() in the R package sirt (Robitzsch 2020). The function is under current development for improving computational efficiency.

#### **4. Simulated Data Illustration**

Before we illustrate the application of the method to the SPM-LS dataset, we demonstrate the technique using a simulated data set. This helps to better understand the proposed method of regularized latent class modeling under ideal conditions.

#### *4.1. Dichotomous Item Responses*

#### 4.1.1. Data Generation

First, we consider the case of dichotomous items. To mimic the situation in the SPM-LS dataset, we also chose *I* = 12 items for simulating a dataset. Moreover, to reduce sampling uncertainty somewhat, a sample size of *N* = 1000 subjects was chosen. There were *C* = 4 latent classes with true class probabilities 0.30, 0.20, 0.10, and 0.40. In Table 1, we present the item response probabilities with each cluster. We only specified parameters for six items and duplicated these parameters for the remaining six items in the test. It can be seen in Table 1 that many item response probabilities were set equal to each other. Indeed, for the first four items, there are only two instead of four unique probabilities. Moreover, it is evident from Table 1 that the four classes are partially ordered. The first class has the lowest probabilities for all items and is, therefore, the smallest class that consists of the least proficient subjects. The fourth class has the highest probabilities, constitutes the largest class, and contains the most proficient subjects.

The model selection is carried out using information criteria AIC and BIC. For regularized models, the required number of parameters in the computation of information criteria is determined by the number of estimated unique parameters. For example, if four item response probabilities would be estimated to be equal in a model, only one parameter would be counted.


**Table 1.** Data illustration dichotomous data: true item response probabilities *pic*.

#### 4.1.2. Results

In the first step, we estimated exploratory latent class models with *C* = 2, 3, 4, 5, and 6 classes. The model comparison is presented in Table 2. While the decision based on the AIC was ambiguous and selected the incorrect number of classes, the BIC correctly selected model with *C* = 4 latent classes. This observation is consistent with the literature that argues that model selection in LCMs should be based on the BIC instead of the AIC (Collins and Lanza 2009; Keribin 2000).

**Table 2.** Data illustration dichotomous data: model comparison for exploratory latent class models (LCMs).


*Note: C* = number of classes; #np = number of estimated parameters.

In the solution with four classes, estimated class probabilities were 0.290, 0.204, 0.120, and 0.386, respectively, which closely resembled the data generating values. In Table 3, estimated item response probabilities are shown. The estimates were very similar to the data generating parameters that are presented in Table 1. It can be seen that some deviations from the simulated equality constraints are obtained. It is important to emphasize that latent class solutions are not invariant with respect to their class labels (so-called label switching). Class labels in the estimated model have to be permuted in order to match the class label in the simulated data.

Finally, we estimated the RLCM for regularization parameters from 0.01 to 1.00 in steps of 0.01 for *C* = 4 classes in order to obtain the best-fitting solution. The regularization parameter *λ* = 0.21 provided the best-fitting model in terms of the BIC (BIC = 13,104, AIC = 12,957). Notably, this model showed a substantially better model fit than the exploratory LCM with four classes due to the more parsimonious estimation of item parameters. In the model with *λ* = 0.21, in total, 21 item parameters were regularized (i.e., they were set equal to item parameters in other classes for the respective item), resulting in 30 freely estimated model parameters. With respect to the AIC, the best-fitting model was obtained for *λ* = 0.15 (BIC = 13,110, AIC = 12,953). This model resulted in 19 regularized item parameters and 32 freely estimated item parameters. The model selected by the minimal BIC (*λ* = 0.21) resulted in estimated class probabilities of 0.29, 0.21, 0.11, and 0.39. Estimated item response probabilities (shown in Table 4) demonstrate that the equality constraints that were posed in the data generating were correctly identified in the estimated model.


**Table 3.** Data illustration dichotomous data: estimated item response probabilities in exploratory LCM with *C* = 4 classes.

**Table 4.** Data illustration dichotomous data: estimated item response probabilities in the regularized latent class model (RLCM) with *C* = 4 classes based on the minimal Bayesian information criterion (BIC) (*λ* = 0.21).


Lastly, we want to illustrate the behavior of regularization. For the sequence of specified regularization parameters *λ* in the estimation, the estimated item response probabilities *pic*(*λ*) can be plotted. Such a plot is also referred to as a regularization path (Hastie et al. 2015). With very small *λ* values, no classes were merged, and all item parameters were estimated differently from each other. With increasing values of *λ*, item parameters were subsequently merged. For Item 1 and Classes 2, 3, and 4, the regularization path is shown in Figure 3. At first, item parameters of Class 2 and 4 are merged at *λ* = 0.04. Afterwards, all three item response probabilities are merged at *λ* = 0.09.

#### *4.2. Polytomous Item Responses*

#### 4.2.1. Data Generation

Now, we simulate illustrated data with polytomous item responses with 12 items, each item possessing four categories (i.e., *Ki* = 3). The first category (i.e., Category 0) refers to the correct category, while categories 1, 2, and 3 refer to distractors of the item. As in the case of dichotomous data, item response probabilities were 0.30, 0.20, 0.10, and 0.40, and *N* = 1000 subjects were simulated. Again, we specified item parameters for the first six items and replicated the parameters for the remaining six items. In Table 5, true item response probabilities are shown that were used for generating the dataset. It is evident that the item response probabilities are strongly structured. All distractors of Items 1 and 5 function precisely the same. For Item 2, Category 1, and Category 2 show the same behavior. Category 3 only shows a differential behavior in Classes 2 and 4. At the other extreme, all item response probabilities differ for Item 6 among classes and categories. It can be expected that an RLCM will result in a substantial model improvement compared to an exploratory LCM without equality constraints.

**Figure 3.** Data illustration dichotomous data: Regularization path for estimated item response probabilities for Item 1 for Classes 2, 3, 4 for the four-class solution.

#### 4.2.2. Results

At first, we fitted exploratory LCMs with 2, 3, 4, 5, and 6 classes. Based on the information criteria presented in Table 6, the correct model with *C* = 4 latent classes was selected. However, the difference in model improvement by moving from 3 to 4 classes would be considered as negligible (i.e., a BIC difference of 3) in practice. Estimated latent class probabilities in the model with four latent classes were estimated as 0.29, 0.21, 0.12, and 0.38. Estimated item response probabilities are shown in Table A1 in Appendix A.

In the next step, different RLCMs for polytomous data were specified. As explained in Section 3.2, one can regularize differences in item parameters among classes (using a regularization parameter *λ*1), among categories (using a regularization parameter *λ*2), or both (using both regularization parameters or applying fused group regularization). We fitted the five approaches (Approaches R1, ..., R5) that were introduced in Section 3.2 to the simulated data using unidimensional and two-dimensional grids of regularization parameters. For each of the regularization approaches, we selected the model with minimal BIC.

In Table 7, it can be seen that the model with the fused penalty on item categories fitted the model best (Approach R2: BIC = 24,689). In this model, 79 item parameters are regularized. The decrease in BIC compared to an exploratory LCM is substantial. From these findings, it follows for this dataset that it is important to fuse item parameters among categories instead of among classes. The best-fitting model when using a simultaneous penalty for classes and categories (Approach R3: BIC = 24,836) outperformed the model in which only parameters were fused among classes (Approach R1: BIC = 24,932). However, it was inferior to the model with fusing among categories. Notably, the largest number of regularized

parameters (#nreg = 103) was obtained for Approach R3. The fused grouped regularization approaches (Approaches R4 and R5) also improved fit compared to an unrestricted exploratory LCM but were also inferior to R2. The reason might be that applying group regularization results in the extreme decision that either all item parameters are equal or all are different. In contrast, fused regularization Approaches R1, R2, and R3 allow the situation in which only some of the item parameters are estimated to be equal to each other.


**Table 5.** Data illustration polytomous data: true item response probabilities *pic*.

**Table 6.** Data illustration polytomous data: model comparison for exploratory LCMs.


*Note: C* = number of classes; #np = number of estimated parameters.

**Table 7.** Data illustration polytomous data: model comparison for different RLCMs with four classes.


*Note:* Appr. = approach; Cat = category; Eq. = equation for regularization penalty in Section 3.2; *C* = number of classes; #np = number of estimated parameters; #nreg = number of regularized item parameters.

For the best-fitting model of Approach R2 (i.e., fusing among categories), estimated class probabilities were 0.29, 0.22, 0.10, and 0.39, respectively. Estimated item response probabilities from this model are shown in Table 8. It can be seen that model estimation was quite successful in identifying parameters that were equal in the data generating model.


**Table 8.** Data illustration polytomous data: estimated item response probabilities in the RLCM with *C* = 4 classes and fused regularization among classes based on the minimal BIC.

*Note:* Cat = category.

#### **5. Application of the SPM-LS Data**

In this section, we illustrate the use of RLCM to the SPM-LS dataset.

#### *5.1. Method*

According to the topic of this special issue, the publicly available dataset from the Myszkowski and Storme (2018) study was reanalyzed. The original study compared various parametric item response models (i.e., 1PL, 2PL, 3PL, 4PL, and nested logit model) performed on a dataset comprised of *N* = 499 students (214 males and 285 females) aged between 19 and 24. The analyzed data consisted of responses on the 12 most difficult SPM items and are made freely available at https: //data.mendeley.com/datasets/h3yhs5gy3w/1. For details regarding the data gathering procedure, we refer to Myszkowski and Storme (2018).

Each of the *I* = 12 items had one correct category and *Ki* = 7 distractors. To be consistent with the notation introduced in the paper and to ease interpretation of the results, we recoded the original dataset when using polytomous item responses. First, we scored the correct response as Category 0. Second, we recoded the order of distractors according to their frequency. In more detail, Category 1 in our rescored dataset was the most attractive distractor (i.e., most frequent distractor), while Category 7 was the least attractive distractor. The relative frequencies and references to categories of the original dataset are shown in Table 9. It could be supposed that there some especially attractive distractors for each item. However, many item category frequencies turned out to be relatively similar such that it could be that they would also function homogeneously among latent classes. We also analyzed the SPM-LS dataset in its dichotomous version in which Category 1 was scored as correct, and Category 0 summarized responses of all distractors.


**Table 9.** The last series of Raven's standard progressive matrices (SPM-LS) polytomous data: percentage frequencies and recoding table.

*Note:* Numbers in parentheses denote the original item category.

For the dataset with dichotomous items, the exploratory LCM and the RLCM were fitted for two to six latent classes. Model selection was conducted based on the BIC. For the dataset with rescored polytomous items, we used the same number of classes for estimating the exploratory LCM. For the RLCM, we applied the fused regularization approach with respect to classes (Section 3.2.1), categories (Section 3.2.2), and to classes and categories in a simultaneous manner (Section 3.2.3).

#### *5.2. Results*

#### 5.2.1. Results for Dichotomous Item Responses

For the SPM-LS dataset with dichotomous items, a series of exploratory LCMs and RLCMs with two to six classes was fitted. According to the BIC presented in Table 10, an exploratory LCM with four classes would be selected. When RLCMs were fitted, a model with five classes would be selected that had 19 regularized item parameters.

In Table 11, item response probabilities and skill class probabilities for the RLCM with *C* = 5 classes are shown. By considering the average item response probabilities per skill class *<sup>p</sup>*•*<sup>c</sup>* = (∑*<sup>I</sup> <sup>i</sup>*=<sup>1</sup> *pic*)/*I*, Class C1 (12% frequency) was the least performing and Class C5 (37% frequency) the best performing class. Class C3 (40% frequency) could be seen as an intermediate class. Classes C2 and C4 were relatively rare. Compared to the medium Class C3, students in Class C2 had a particularly bad performance at Items 3, 6, and 11, but outperformed them on Items 7, 8, and 12. Students in Class C4 showed perfect performance on Items 8 and 9, but notably worse performance on Items 10 and 11. Interestingly, one could define a partial order on the classes if we allowed at most two violations of inequality conditions. In Figure 4, this partial order is depicted. The arrow from Class C1 to Class C3 means that C1 was smaller than C3. There are arrows with particular labels that indicate violations of the partial order. For example, C1 was approximately smaller than C2, and Items 3 and 11 violated the ordering property. To summarize, three out of the five classes fulfilled the ordering property for all items. Two classes possessed violations for two or three items and could be interpreted to detect subpopulations of subjects that showed latent differential item functioning.


**Table 10.** SPM-LS dichotomous data: model comparison for exploratory LCMs and RLCM.

*Note: C* = number of classes; *λ* = regularization parameter of selected model with minimal BIC; #np = number of estimated parameters; #nreg = number of regularized parameters.

**Table 11.** SPM-LS dichotomous data: estimated item probabilities and latent class probabilities for best fitting RLCM with *C* = 5 latent classes.


*Note: pc* = skill class probability; *<sup>p</sup>*•*<sup>c</sup>* = average of item probabilities within class *<sup>c</sup>*.

#### 5.2.2. Results for Polytomous Item Responses

We now only briefly discuss the findings for the analysis of the SPM-LS dataset based on polytomous item responses. For the exploratory latent class models, the model with just two latent classes would be selected according to the BIC. However, the model with six latent classes would be selected according to the AIC. Given a large number of estimated item parameters, applying the RLCM seems to be required for obtaining a parsimonious model. The best-fitting model was obtained with *C* = 3 classes by fusing categories with a regularization parameter of *λ*<sup>2</sup> = 0.24. Classes C1 (28% frequency) and C2 (5% frequency) had low performance, while Class C3 was the high-performing class (67% frequency). As an illustration, we provide in Table 12 estimated item probabilities for the last three items. It can be seen that some of the categories were fused such that they had equal item response probabilities within a latent class. All item parameters are shown in Table A2 in Appendix B.

**Figure 4.** SPM-LS dichotomous data: partial order for latent class from RLCM.

**Table 12.** SPM-LS polytomous data: estimated item response probabilities and latent class probabilities for best-fitting RLCM with *C* = 3 latent classes for items SPM10, SPM11 and SPM12.


*Note:* Cat = category.

At the time of writing, results for polytomous data for the SPM-LS do not seem to be very consistent with those for dichotomous data. It could be the large number of parameters to be estimated (several hundred depending on the number of classes) for the relatively small sample size of *N* = 499 is critical. Other research has also shown that regularization methods for LCMs need sample sizes of at least 1000 or even more for performing satisfactorily (Chen et al. 2015).

#### **6. Discussion**

In this article, we proposed an extension of regularized latent class analysis to polytomous item responses. We have shown using the simulated data illustration and the SPM-LS dataset that fusing among classes or categories can be beneficial in terms of model parsimony and interpretation. Often, conceptualizing substantive questions as latent classes led researchers to easier to think in types of persons. This interpretation is not apparent in latent variables with continuous latent variables.

In our regularization approach to polytomous data, we based regularization penalties on distractors of items. Hence, the correct item response serves as a reference category. In LCA applications in which the definition of a reference category cannot be done, the regularization approach has certainly to be modified. Note that for *Ki* + 1 categories, only *Ki* item parameters per class can be independently estimated. Alternatively, a sum constraint ∑*Ki <sup>k</sup>*=<sup>0</sup> *γikc* = 0 could be posed if *γikc* (*k* = 0, 1, ... , *Ki* denotes the item parameters of item *i* of category *k* in class *c*. Such constraints can be replaced by adding ridge-type penalties of the form *<sup>λ</sup>*<sup>3</sup> <sup>∑</sup>*Ki <sup>k</sup>*=<sup>0</sup> *<sup>γ</sup>*<sup>2</sup> *ikc* to the fused regularization penalty, where *λ*<sup>3</sup> is another regularization parameter. By squaring item parameters in the penalty function, they are uniformly shrunk to zero in the estimation.

By treating the correct item response as the reference category, regularization only operates on the categories for the incorrect response. As pointed out by an anonymous reviewer, it could be more appropriate by fusing classes for the correct item response and for incorrect item response categories separately. This would lead to an overidentified model because all class-specific item response probabilities would appear in the model. However, if, again, a ridge-type would be employed, the identification issue would disappear.

As the application of the regularization technique to an LCM results in a particular restricted LCM, it has to be shown that the model parameters can be identified. The analysis of necessary and sufficient conditions for identification in restricted LCMs was currently investigated (Gu and Xu 2018; Xu 2017). Because the inclusion of the penalty function, accompanied by a regularization parameter, introduces an additional amount of information in the estimation, it is unclear whether identifiability should be studied only on the likelihood part of the optimization function (see San Martín 2018 for a related discussion in Bayesian estimation).

It should be noted that similar regularization approaches have been studied for cognitive diagnostic models (Chen et al. 2015; Gu and Xu 2019; Liu and Kang 2019; Xu and Shang 2018). These kinds of models pose measurement models on *D* dichotomous latent variables. These *D* latent variables constitute 2*<sup>D</sup>* latent classes. In addition, in this model class, the modeling of violations of the local independence assumption in LCA has been of interest (Kang et al. 2017; Tamhane et al. 2010).

Previous articles on the SPM-LS dataset also used distractor information by employing the nested logit model (NLM; Myszkowski and Storme 2018). The NLM is also very data-hungry, given the low sample size of the dataset. It has been argued that reliability can be increased by using distractor information (Myszkowski and Storme 2018; Storme et al. 2019). It should be noted that this is only true to the extent that item parameters can be reliably estimated. For *N* = 499 in the SPM-LS dataset, this will probably be not the case. Regularized estimation approaches could circumvent estimation issues (see Battauz 2019 for a similar approach in the nominal response model).

Finally, we would like to emphasize that the regularization approaches can be interpreted as empirical Bayesian approaches that employ hierarchical prior distributions on item parameters (van Erp et al. 2019). It can be expected that Bayesian variants of RLCMs are competitive to EM-based estimation, especially for small(er) samples.

**Funding:** This research received no external funding.

**Acknowledgments:** I am incredibly grateful to Nils Myszkowski for the long-standing patience and the invitation for contributing to the special issue 'Analysis of an Intelligence Dataset' in the *Journal of Intelligence*. We thank the three anonymous reviewers whose comments helped improve and clarify this manuscript.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**

The following abbreviations are used in this manuscript:


#### **Appendix A. Additional Results for Simulated Data Illustration with Polytomous Item Responses**

In Table A1, estimated item response probabilities for the exploratory LCM with four latent classes are shown.

**Table A1.** Data illustration polytomous data: estimated item response probabilities in the exploratory LCM with *C* = 4 classes.


**Appendix B. Additional Results for SPM-LS Dataset with Polytomous Item Responses**

Table A2 shows all estimated item response probabilities for the SPM-LS dataset with polytomous items for the best fitting RLCM with *C* = 3 classes.


**Table A2.** SPM-LS polytomous data: estimated item response probabilities and latent class probabilities for best-fitting RLCM with *C* = 3 latent classes.

*Note:* Cat = category.

#### **References**



Langeheine, Rolf, and Jürgen Rost. 1988. *Latent Trait and Latent Class Models*. New York: Plenum Press. [CrossRef]




c 2020 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

MDPI St. Alban-Anlage 66 4052 Basel Switzerland Tel. +41 61 683 77 34 Fax +41 61 302 89 18 www.mdpi.com

*Journal of Intelligence* Editorial Office E-mail: jintelligence@mdpi.com www.mdpi.com/journal/jintelligence

MDPI St. Alban-Anlage 66 4052 Basel Switzerland

Tel: +41 61 683 77 34 Fax: +41 61 302 89 18