An important distinction between different models for response time and accuracy is whether conditional independence (CI) between response time and accuracy is assumed. In the present study, a test for CI given an exponential family model for accuracy (for example, the Rasch model or the one-parameter logistic model) is proposed and evaluated in a simulation study. The procedure is based on the non-parametric Kolmogorov–Smirnov tests. As an illustrative example, the CI test was applied to data from an arithmetics test for secondary education.

]]>A Monte Carlo study was used to compare four approaches to growth curve analysis of subjects assessed repeatedly with the same set of dichotomous items: A two-step procedure first estimating latent trait measures using MULTILOG and then using a hierarchical linear model to examine the changing trajectories with the estimated abilities as the outcome variable; a structural equation model using modified weighted least squares (WLSMV) estimation; and two approaches in the framework of multilevel item response models, including a hierarchical generalized linear model using Laplace estimation, and Bayesian analysis using Markov chain Monte Carlo (MCMC). These four methods have similar power in detecting the average linear slope across time. MCMC and Laplace estimates perform relatively better on the bias of the average linear slope and corresponding standard error, as well as the item location parameters. For the variance of the random intercept, and the covariance between the random intercept and slope, all estimates are biased in most conditions. For the random slope variance, only Laplace estimates are unbiased when there are eight time points.

]]>A multi-group factor model is suitable for data originating from different strata. However, it often requires a relatively large sample size to avoid numerical issues such as non-convergence and non-positive definite covariance matrices. An alternative is to pool data from different groups in which a single-group factor model is fitted to the pooled data using maximum likelihood. In this paper, properties of pseudo-maximum likelihood (PML) estimators for pooled data are studied. The pooled data are assumed to be normally distributed from a single group. The resulting asymptotic efficiency of the PML estimators of factor loadings is compared with that of the multi-group maximum likelihood estimators. The effect of pooling is investigated through a two-group factor model. The variances of factor loadings for the pooled data are underestimated under the normal theory when error variances in the smaller group are larger. Underestimation is due to dependence between the pooled factors and pooled error terms. Small-sample properties of the PML estimators are also investigated using a Monte Carlo study.

]]>The Asymptotic Classification Theory of Cognitive Diagnosis (Chiu *et al*., 2009, *Psychometrika*,* 74*, 633–665) determined the conditions that cognitive diagnosis models must satisfy so that the correct assignment of examinees to proficiency classes is guaranteed when non-parametric classification methods are used. These conditions have only been proven for the Deterministic Input Noisy Output AND gate model. For other cognitive diagnosis models, no theoretical legitimization exists for using non-parametric classification techniques for assigning examinees to proficiency classes. The specific statistical properties of different cognitive diagnosis models require tailored proofs of the conditions of the Asymptotic Classification Theory of Cognitive Diagnosis for each individual model – a tedious undertaking in light of the numerous models presented in the literature. In this paper a different way is presented to address this task. The unified mathematical framework of general cognitive diagnosis models is used as a theoretical basis for a general proof that under mild regularity conditions any cognitive diagnosis model is covered by the Asymptotic Classification Theory of Cognitive Diagnosis.

In real testing, examinees may manifest different types of test-taking behaviours. In this paper we focus on two types that appear to be among the more frequently occurring behaviours – solution behaviour and rapid guessing behaviour. Rapid guessing usually happens in high-stakes tests when there is insufficient time, and in low-stakes tests when there is lack of effort. These two qualitatively different test-taking behaviours, if ignored, will lead to violation of the local independence assumption and, as a result, yield biased item/person parameter estimation. We propose a mixture hierarchical model to account for differences among item responses and response time patterns arising from these two behaviours. The model is also able to identify the specific behaviour an examinee engages in when answering an item. A Monte Carlo expectation maximization algorithm is proposed for model calibration. A simulation study shows that the new model yields more accurate item and person parameter estimates than a non-mixture model when the data indeed come from two types of behaviour. The model also fits real, high-stakes test data better than a non-mixture model, and therefore the new model can better identify the underlying test-taking behaviour an examinee engages in on a certain item.

]]>In a pre-test–post-test cluster randomized trial, one of the methods commonly used to detect an intervention effect involves controlling pre-test scores and other related covariates while estimating an intervention effect at post-test. In many applications in education, the total post-test and pre-test scores, ignoring measurement error, are used as response variable and covariate, respectively, to estimate the intervention effect. However, these test scores are frequently subject to measurement error, and statistical inferences based on the model ignoring measurement error can yield a biased estimate of the intervention effect. When multiple domains exist in test data, it is sometimes more informative to detect the intervention effect for each domain than for the entire test. This paper presents applications of the multilevel multidimensional item response model with measurement error adjustments in a response variable and a covariate to estimate the intervention effect for each domain.

]]>Cluster bias refers to measurement bias with respect to the clustering variable in multilevel data. The absence of cluster bias implies absence of bias with respect to any cluster-level (level 2) variable. The variables that possibly cause the bias do not have to be measured to test for cluster bias. Therefore, the test for cluster bias serves as a global test of measurement bias with respect to any level 2 variable. However, the validity of the global test depends on the Type I and Type II error rates of the test. We compare the performance of the test for cluster bias with the restricted factor analysis (RFA) test, which can be used if the variable that leads to measurement bias is measured. It appeared that the RFA test has considerably more power than the test for cluster bias. However, the false positive rates of the test for cluster bias were generally around the expected values, while the RFA test showed unacceptably high false positive rates in some conditions. We conclude that if no significant cluster bias is found, still significant bias with respect to a level 2 violator can be detected with an RFA model. Although the test for cluster bias is less powerful, an advantage of the test is that the cause of the bias does not need to be measured, or even known.

]]>We show how the hierarchical model for responses and response times as developed by van der Linden (2007), Fox, Klein Entink, and van der Linden (2007), Klein Entink, Fox, and van der Linden (2009), and Glas and van der Linden (2010) can be simplified to a generalized linear factor model with only the mild restriction that there is no hierarchical model at the item side. This result is valuable as it enables all well-developed modelling tools and extensions that come with these methods. We show that the restriction we impose on the hierarchical model does not influence parameter recovery under realistic circumstances. In addition, we present two illustrative real data analyses to demonstrate the practical benefits of our approach.

]]>In this paper, the performance of six types of techniques for comparisons of means is examined. These six emerge from the distinction between the method employed (hypothesis testing, model selection using information criteria, or Bayesian model selection) and the set of hypotheses that is investigated (a classical, exploration-based set of hypotheses containing equality constraints on the means, or a theory-based limited set of hypotheses with equality and/or order restrictions). A simulation study is conducted to examine the performance of these techniques. We demonstrate that, if one has specific, a priori specified hypotheses, confirmation (i.e., investigating theory-based hypotheses) has advantages over exploration (i.e., examining all possible equality-constrained hypotheses). Furthermore, examining reasonable order-restricted hypotheses has more power to detect the true effect/non-null hypothesis than evaluating only equality restrictions. Additionally, when investigating more than one theory-based hypothesis, model selection is preferred over hypothesis testing. Because of the first two results, we further examine the techniques that are able to evaluate order restrictions in a confirmatory fashion by examining their performance when the homogeneity of variance assumption is violated. Results show that the techniques are robust to heterogeneity when the sample sizes are equal. When the sample sizes are unequal, the performance is affected by heterogeneity. The size and direction of the deviations from the baseline, where there is no heterogeneity, depend on the effect size (of the means) and on the trend in the group variances with respect to the ordering of the group sizes. Importantly, the deviations are less pronounced when the group variances and sizes exhibit the same trend (e.g., are both increasing with group number).

]]>We analytically derive the fixed-effects estimates in unconditional linear growth curve models by typical linear mixed-effects modelling (TLME) and by a pattern-mixture (PM) approach with random-slope-dependent two-missing-pattern missing not at random (MNAR) longitudinal data. Results showed that when the missingness mechanism is random-slope-dependent MNAR, TLME estimates of both the mean intercept and mean slope are biased because of incorrect weights used in the estimation. More specifically, the estimate of the mean slope is biased towards the mean slope for completers, whereas the estimate of the mean intercept is biased towards the opposite direction as compared to the estimate of the mean slope. We also discuss why the PM approach can provide unbiased fixed-effects estimates for random-coefficients-dependent MNAR data but does not work well for missing at random or outcome-dependent MNAR data. A small simulation study was conducted to illustrate the results and to compare results from TLME and PM. Results from an empirical data analysis showed that the conceptual finding can be generalized to other real conditions even when some assumptions for the analytical derivation cannot be met. Implications from the analytical and empirical results were discussed and sensitivity analysis was suggested for longitudinal data analysis with missing data.

]]>Cognitive diagnosis models of educational test performance rely on a binary Q-matrix that specifies the associations between individual test items and the cognitive attributes (skills) required to answer those items correctly. Current methods for fitting cognitive diagnosis models to educational test data and assigning examinees to proficiency classes are based on parametric estimation methods such as expectation maximization (EM) and Markov chain Monte Carlo (MCMC) that frequently encounter difficulties in practical applications. In response to these difficulties, non-parametric classification techniques (cluster analysis) have been proposed as heuristic alternatives to parametric procedures. These non-parametric classification techniques first aggregate each examinee's test item scores into a profile of attribute sum scores, which then serve as the basis for clustering examinees into proficiency classes. Like the parametric procedures, the non-parametric classification techniques require that the Q-matrix underlying a given test be known. Unfortunately, in practice, the Q-matrix for most tests is not known and must be estimated to specify the associations between items and attributes, risking a misspecified Q-matrix that may then result in the incorrect classification of examinees. This paper demonstrates that clustering examinees into proficiency classes based on their item scores rather than on their attribute sum-score profiles does not require knowledge of the Q-matrix, and results in a more accurate classification of examinees.

]]>Equivalence tests are an alternative to traditional difference-based tests for demonstrating a lack of association between two variables. While there are several recent studies investigating equivalence tests for comparing means, little research has been conducted on equivalence methods for evaluating the equivalence or similarity of two correlation coefficients or two regression coefficients. The current project proposes novel tests for evaluating the equivalence of two regression or correlation coefficients derived from the two one-sided tests (TOST) method (Schuirmann, 1987, *J. Pharmacokinet. Biopharm*,* 15*, 657) and an equivalence test by Anderson and Hauck (1983, *Stat. Commun*., *12*, 2663). A simulation study was used to evaluate the performance of these tests and compare them with the common, yet inappropriate, method of assessing equivalence using non-rejection of the null hypothesis in difference-based tests. Results demonstrate that equivalence tests have more accurate probabilities of declaring equivalence than difference-based tests. However, equivalence tests require large sample sizes to ensure adequate power. We recommend the Anderson–Hauck equivalence test over the TOST method for comparing correlation or regression coefficients.

Many cognitive theories of judgement and decision making assume that choice options are evaluated relative to other available options. The extent to which the preference for one option is influenced by other available options will often depend on how similar the options are to each other, where similarity is assumed to be a decreasing function of the distance between options. We examine how the distance between preferential options that are described on multiple attributes can be determined. Previous distance functions do not take into account that attributes differ in their subjective importance, are limited to two attributes, or neglect the preferential relationship between the options. To measure the distance between preferential options it is necessary to take the subjective preferences of the decision maker into account. Accordingly, the multi-attribute space that defines the relationship between options can be stretched or shrunk relative to the attention or importance that a person gives to different attributes describing the options. Here, we propose a generalized distance function for preferential choices that takes subjective attribute importance into account and allows for individual differences according to such subjective preferences. Using a hands-on example, we illustrate the application of the function and compare it to previous distance measures. We conclude with a discussion of the suitability and limitations of the proposed distance function.

]]>How do people choose between a smaller reward available sooner and a larger reward available later? Past research has evaluated models of intertemporal choice by measuring goodness of fit or identifying which decision-making anomalies they can accommodate. An alternative criterion for model quality, which is partly antithetical to these standard criteria, is predictive accuracy. We used cross-validation to examine how well 10 models of intertemporal choice could predict behaviour in a 100-trial binary-decision task. Many models achieved the apparent ceiling of 85% accuracy, even with smaller training sets. When noise was added to the training set, however, a simple logistic-regression model we call the difference model performed particularly well. In many situations, between-model differences in predictive accuracy may be small, contrary to long-standing controversy over the modelling question in research on intertemporal choice, but the simplicity and robustness of the difference model recommend it to future use.

]]>This article proposes an approach to modelling partially cross-classified multilevel data where some of the level-1 observations are nested in one random factor and some are cross-classified by two random factors. Comparisons between a proposed approach to two other commonly used approaches which treat the partially cross-classified data as either fully nested or fully cross-classified are completed with a simulation study. Results show that the proposed approach demonstrates desirable performance in terms of parameter estimates and statistical inferences. Both the fully nested model and the fully cross-classified model suffer from biased estimates of some variance components and statistical inferences of some fixed effects. Results also indicate that the proposed model is robust against cluster size imbalance.

]]>In many educational tests which involve constructed responses, a traditional test score is obtained by adding together item scores obtained through holistic scoring by trained human raters. For example, this practice was used until 2008 in the case of GRE^{®}General Analytical Writing and until 2009 in the case of TOEFL^{®} iBT Writing. With use of natural language processing, it is possible to obtain additional information concerning item responses from computer programs such as e-rater^{®}. In addition, available information relevant to examinee performance may include scores on related tests. We suggest application of standard results from classical test theory to the available data to obtain best linear predictors of true traditional test scores. In performing such analysis, we require estimation of variances and covariances of measurement errors, a task which can be quite difficult in the case of tests with limited numbers of items and with multiple measurements per item. As a consequence, a new estimation method is suggested based on samples of examinees who have taken an assessment more than once. Such samples are typically not random samples of the general population of examinees, so that we apply statistical adjustment methods to obtain the needed estimated variances and covariances of measurement errors. To examine practical implications of the suggested methods of analysis, applications are made to GRE General Analytical Writing and TOEFL iBT Writing. Results obtained indicate that substantial improvements are possible both in terms of reliability of scoring and in terms of assessment reliability.