Maximum likelihood estimation of the linear factor model for continuous items assumes normally distributed item scores. We consider deviations from normality by means of a skew-normally distributed factor model or a quadratic factor model. We show that the item distributions under a skew-normal factor are equivalent to those under a quadratic model up to third-order moments. The reverse only holds if the quadratic loadings are equal to each other and within certain bounds. We illustrate that observed data which follow any skew-normal factor model can be so well approximated with the quadratic factor model that the models are empirically indistinguishable, and that the reverse does not hold in general. The choice between the two models to account for deviations of normality is illustrated by an empirical example from clinical psychology.

]]>We consider multi-set data consisting of observations, *k* = 1,…, *K* (e.g., subject scores), on *J* variables in *K* different samples. We introduce a factor model for the *J* × *J* covariance matrices , *k* = 1,…, *K*, where the common part is modelled by Parafac2 and the unique variances , *k* = 1,…, *K*, are diagonal. The Parafac2 model implies a common loadings matrix that is rescaled for each *k*, and a common factor correlation matrix. We estimate the unique variances by minimum rank factor analysis on for each *k*. The factors can be chosen orthogonal or oblique. We present a novel algorithm to estimate the Parafac2 part and demonstrate its performance in a simulation study. Also, we fit our model to a data set in the literature. Our model is easy to estimate and interpret. The unique variances, the factor correlation matrix and the communalities are guaranteed to be proper, and a percentage of explained common variance can be computed for each *k*. Also, the Parafac2 part is rotationally unique under mild conditions.

Methods for the identification of differential item functioning (DIF) in Rasch models are typically restricted to the case of two subgroups. A boosting algorithm is proposed that is able to handle the more general setting where DIF can be induced by several covariates at the same time. The covariates can be both continuous and (multi-)categorical, and interactions between covariates can also be considered. The method works for a general parametric model for DIF in Rasch models. Since the boosting algorithm selects variables automatically, it is able to detect the items which induce DIF. It is demonstrated that boosting competes well with traditional methods in the case of subgroups. The method is illustrated by an extensive simulation study and an application to real data.

]]>An important distinction between different models for response time and accuracy is whether conditional independence (CI) between response time and accuracy is assumed. In the present study, a test for CI given an exponential family model for accuracy (for example, the Rasch model or the one-parameter logistic model) is proposed and evaluated in a simulation study. The procedure is based on the non-parametric Kolmogorov–Smirnov tests. As an illustrative example, the CI test was applied to data from an arithmetics test for secondary education.

]]>A Monte Carlo study was used to compare four approaches to growth curve analysis of subjects assessed repeatedly with the same set of dichotomous items: A two-step procedure first estimating latent trait measures using MULTILOG and then using a hierarchical linear model to examine the changing trajectories with the estimated abilities as the outcome variable; a structural equation model using modified weighted least squares (WLSMV) estimation; and two approaches in the framework of multilevel item response models, including a hierarchical generalized linear model using Laplace estimation, and Bayesian analysis using Markov chain Monte Carlo (MCMC). These four methods have similar power in detecting the average linear slope across time. MCMC and Laplace estimates perform relatively better on the bias of the average linear slope and corresponding standard error, as well as the item location parameters. For the variance of the random intercept, and the covariance between the random intercept and slope, all estimates are biased in most conditions. For the random slope variance, only Laplace estimates are unbiased when there are eight time points.

]]>A multi-group factor model is suitable for data originating from different strata. However, it often requires a relatively large sample size to avoid numerical issues such as non-convergence and non-positive definite covariance matrices. An alternative is to pool data from different groups in which a single-group factor model is fitted to the pooled data using maximum likelihood. In this paper, properties of pseudo-maximum likelihood (PML) estimators for pooled data are studied. The pooled data are assumed to be normally distributed from a single group. The resulting asymptotic efficiency of the PML estimators of factor loadings is compared with that of the multi-group maximum likelihood estimators. The effect of pooling is investigated through a two-group factor model. The variances of factor loadings for the pooled data are underestimated under the normal theory when error variances in the smaller group are larger. Underestimation is due to dependence between the pooled factors and pooled error terms. Small-sample properties of the PML estimators are also investigated using a Monte Carlo study.

]]>The Asymptotic Classification Theory of Cognitive Diagnosis (Chiu *et al*., 2009, *Psychometrika*,* 74*, 633–665) determined the conditions that cognitive diagnosis models must satisfy so that the correct assignment of examinees to proficiency classes is guaranteed when non-parametric classification methods are used. These conditions have only been proven for the Deterministic Input Noisy Output AND gate model. For other cognitive diagnosis models, no theoretical legitimization exists for using non-parametric classification techniques for assigning examinees to proficiency classes. The specific statistical properties of different cognitive diagnosis models require tailored proofs of the conditions of the Asymptotic Classification Theory of Cognitive Diagnosis for each individual model – a tedious undertaking in light of the numerous models presented in the literature. In this paper a different way is presented to address this task. The unified mathematical framework of general cognitive diagnosis models is used as a theoretical basis for a general proof that under mild regularity conditions any cognitive diagnosis model is covered by the Asymptotic Classification Theory of Cognitive Diagnosis.

In a pre-test–post-test cluster randomized trial, one of the methods commonly used to detect an intervention effect involves controlling pre-test scores and other related covariates while estimating an intervention effect at post-test. In many applications in education, the total post-test and pre-test scores, ignoring measurement error, are used as response variable and covariate, respectively, to estimate the intervention effect. However, these test scores are frequently subject to measurement error, and statistical inferences based on the model ignoring measurement error can yield a biased estimate of the intervention effect. When multiple domains exist in test data, it is sometimes more informative to detect the intervention effect for each domain than for the entire test. This paper presents applications of the multilevel multidimensional item response model with measurement error adjustments in a response variable and a covariate to estimate the intervention effect for each domain.

]]>Cluster bias refers to measurement bias with respect to the clustering variable in multilevel data. The absence of cluster bias implies absence of bias with respect to any cluster-level (level 2) variable. The variables that possibly cause the bias do not have to be measured to test for cluster bias. Therefore, the test for cluster bias serves as a global test of measurement bias with respect to any level 2 variable. However, the validity of the global test depends on the Type I and Type II error rates of the test. We compare the performance of the test for cluster bias with the restricted factor analysis (RFA) test, which can be used if the variable that leads to measurement bias is measured. It appeared that the RFA test has considerably more power than the test for cluster bias. However, the false positive rates of the test for cluster bias were generally around the expected values, while the RFA test showed unacceptably high false positive rates in some conditions. We conclude that if no significant cluster bias is found, still significant bias with respect to a level 2 violator can be detected with an RFA model. Although the test for cluster bias is less powerful, an advantage of the test is that the cause of the bias does not need to be measured, or even known.

]]>In real testing, examinees may manifest different types of test-taking behaviours. In this paper we focus on two types that appear to be among the more frequently occurring behaviours – solution behaviour and rapid guessing behaviour. Rapid guessing usually happens in high-stakes tests when there is insufficient time, and in low-stakes tests when there is lack of effort. These two qualitatively different test-taking behaviours, if ignored, will lead to violation of the local independence assumption and, as a result, yield biased item/person parameter estimation. We propose a mixture hierarchical model to account for differences among item responses and response time patterns arising from these two behaviours. The model is also able to identify the specific behaviour an examinee engages in when answering an item. A Monte Carlo expectation maximization algorithm is proposed for model calibration. A simulation study shows that the new model yields more accurate item and person parameter estimates than a non-mixture model when the data indeed come from two types of behaviour. The model also fits real, high-stakes test data better than a non-mixture model, and therefore the new model can better identify the underlying test-taking behaviour an examinee engages in on a certain item.

]]>