This paper introduces a two-parameter family of distributions for modelling random variables on the (0,1) interval by applying the cumulative distribution function of one ‘parent’ distribution to the quantile function of another. Family members have explicit probability density functions, cumulative distribution functions and quantiles in a location parameter and a dispersion parameter. They capture a wide variety of shapes that the beta and Kumaraswamy distributions cannot. They are amenable to likelihood inference, and enable a wide variety of quantile regression models, with predictors for both the location and dispersion parameters. We demonstrate their applicability to psychological research problems and their utility in modelling real data.

]]>Inference methods for null hypotheses formulated in terms of distribution functions in general non-parametric factorial designs are studied. The methods can be applied to continuous, ordinal or even ordered categorical data in a unified way, and are based only on ranks. In this set-up Wald-type statistics and ANOVA-type statistics are the current state of the art. The first method is asymptotically exact but a rather liberal statistical testing procedure for small to moderate sample size, while the latter is only an approximation which does not possess the correct asymptotic α level under the null. To bridge these gaps, a novel permutation approach is proposed which can be seen as a flexible generalization of the Kruskal–Wallis test to all kinds of factorial designs with independent observations. It is proven that the permutation principle is asymptotically correct while keeping its finite exactness property when data are exchangeable. The results of extensive simulation studies foster these theoretical findings. A real data set exemplifies its applicability.

]]>Despite the fact that data and theories in the social, behavioural, and health sciences are often represented on an ordinal scale, there has been relatively little emphasis on modelling ordinal properties. The most common analytic framework used in psychological science is the general linear model, whose variants include ANOVA, MANOVA, and ordinary linear regression. While these methods are designed to provide the best fit to the metric properties of the data, they are not designed to maximally model ordinal properties. In this paper, we develop an order-constrained linear least-squares (OCLO) optimization algorithm that maximizes the linear least-squares fit to the data conditional on maximizing the ordinal fit based on Kendall's τ. The algorithm builds on the maximum rank correlation estimator (Han, 1987, *Journal of Econometrics*, 35, 303) and the general monotone model (Dougherty & Thomas, 2012, *Psychological Review*, 119, 321). Analyses of simulated data indicate that when modelling data that adhere to the assumptions of ordinary least squares, OCLO shows minimal bias, little increase in variance, and almost no loss in out-of-sample predictive accuracy. In contrast, under conditions in which data include a small number of extreme scores (fat-tailed distributions), OCLO shows less bias and variance, and substantially better out-of-sample predictive accuracy, even when the outliers are removed. We show that the advantages of OCLO over ordinary least squares in predicting new observations hold across a variety of scenarios in which researchers must decide to retain or eliminate extreme scores when fitting data.

This paper compares the multilevel modelling (MLM) approach and the person-specific (PS) modelling approach in examining autoregressive (AR) relations with intensive longitudinal data. Two simulation studies are conducted to examine the influences of sample heterogeneity, time series length, sample size, and distribution of individual level AR coefficients on the accuracy of AR estimates, both at the population level and at the individual level. It is found that MLM generally outperforms the PS approach under two conditions: when the sample has a homogeneous AR pattern, namely, when all individuals in the sample are characterized by AR processes with the same order; and when the sample has heterogeneous AR patterns, but a multilevel model with a sufficiently high order (i.e., an order equal to or higher than the maximum order of individual AR patterns in the sample) is fitted and successfully converges. If a lower-order multilevel model is chosen for heterogeneous samples, the higher-order lagged effects are misrepresented, resulting in bias at the population level and larger prediction errors at the individual level. In these cases, the PS approach is preferable, given sufficient measurement occasions (*T *≥* *50). In addition, sample size and distribution of individual level AR coefficients do not have a large impact on the results. Implications of these findings on model selection and research design are discussed.

The gain–loss model (GaLoM) is a formal model for assessing knowledge and learning. In its original formulation, the GaLoM assumes independence among the skills. Such an assumption is not reasonable in several domains, in which some preliminary knowledge is the foundation for other knowledge. This paper presents an extension of the GaLoM to the case in which the skills are not independent, and the dependence relation among them is described by a well-graded competence space. The probability of mastering skill *s* at the pretest is conditional on the presence of all skills on which *s* depends. The probabilities of gaining or losing skill *s* when moving from pretest to posttest are conditional on the mastery of *s* at the pretest, and on the presence at the posttest of all skills on which *s* depends. Two formulations of the model are presented, in which the learning path is allowed to change from pretest to posttest or not. A simulation study shows that models based on the true competence space obtain a better fit than models based on false competence spaces, and are also characterized by a higher assessment accuracy. An empirical application shows that models based on pedagogically sound assumptions about the dependencies among the skills obtain a better fit than models assuming independence among the skills.

Subgroup analyses allow us to examine the influence of a categorical moderator on the effect size in meta-analysis. We conducted a simulation study using a dichotomous moderator, and compared the impact of pooled versus separate estimates of the residual between-studies variance on the statistical performance of the
*Q*
_{B}
_{(P)} and
*Q*
_{B}
_{(S)} tests for subgroup analyses assuming a mixed-effects model. Our results suggested that similar performance can be expected as long as there are at least 20 studies and these are approximately balanced across categories. Conversely, when subgroups were unbalanced, the practical consequences of having heterogeneous residual between-studies variances were more evident, with both tests leading to the wrong statistical conclusion more often than in the conditions with balanced subgroups. A pooled estimate should be preferred for most scenarios, unless the residual between-studies variances are clearly different and there are enough studies in each category to obtain precise separate estimates.

Based on data from a cognitive test presented in a condition with time constraints per item and a condition without time constraints, the effect of speed on accuracy is investigated. First, if the effect of imposed speed on accuracy is negative it can be explained by the speed–accuracy trade-off, and if it can be captured through the corresponding latent variables, then measurement invariance applies between a condition with and a condition without time constraints. The results do show a negative effect and a lack of measurement invariance. Second, the conditional accuracy function (CAF) is investigated in both conditions, with and without time constraints. The CAF shows an (item-dependent) negative conditional dependence between response time and response accuracy and thus a positive relationship between speed and accuracy, which implies that faster responses are more accurate. In sum, there seem to be two kinds of speed effects: a speed–accuracy trade-off effect induced by imposed speed and an opposite CAF effect associated with speed within conditions. The second effect is interpreted as stemming from a within-person variation of the cognitive capacity during the test which simultaneously favours or disfavours speed and accuracy.

]]>Cognitive psychometric models embed cognitive process models into a latent trait framework in order to allow for individual differences. Due to their close relationship to the response process the models allow for profound conclusions about the test takers. However, before such a model can be used its fit has to be checked carefully. In this manuscript we give an overview over existing tests of model fit and show their relation to the generalized moment test of Newey (*Econometrica*, 53, 1985, 1047) and Tauchen (*Journal of Econometrics*, 30, 1985, 415). We also present a new test, the Hausman test of misspecification (Hausman, *Econometrica*, 46, 1978, 1251). The Hausman test consists of a comparison of two estimates of the same item parameters which should be similar if the model holds. The performance of the Hausman test is evaluated in a simulation study. In this study we illustrate its application to two popular models in cognitive psychometrics, the Q-diffusion model and the D-diffusion model (van der Maas, Molenaar, Maris, Kievit, & Boorsboom, *Psychological Review*, 118, 2011, 339; Molenaar, Tuerlinckx, & van der Maas, *Journal of Statistical Software*, 66, 2015, 1). We also compare the performance of the test to four alternative tests of model fit, namely the *M*_{2} test (Molenaar *et al*., *Journal of Statistical Software*, 66, 2015, 1), the moment test (Ranger *et al*., British *Journal of Mathematical and Statistical Psychology*, 2016) and the test for binned time (Ranger & Kuhn, *Psychological Test and Assessment Modeling*, 56, 2014b, 370). The simulation study indicates that the Hausman test is superior to the latter tests. The test closely adheres to the nominal Type I error rate and has higher power in most simulation conditions.

In generalized linear modelling of responses and response times, the observed response time variables are commonly transformed to make their distribution approximately normal. A normal distribution for the transformed response times is desirable as it justifies the linearity and homoscedasticity assumptions in the underlying linear model. Past research has, however, shown that the transformed response times are not always normal. Models have been developed to accommodate this violation. In the present study, we propose a modelling approach for responses and response times to test and model non-normality in the transformed response times. Most importantly, we distinguish between non-normality due to heteroscedastic residual variances, and non-normality due to a skewed speed factor. In a simulation study, we establish parameter recovery and the power to separate both effects. In addition, we apply the model to a real data set.

]]>The purpose of this paper is to highlight the importance of a population model in guiding the design and interpretation of simulation studies used to investigate the Spearman rank correlation. The Spearman rank correlation has been known for over a hundred years to applied researchers and methodologists alike and is one of the most widely used non-parametric statistics. Still, certain misconceptions can be found, either explicitly or implicitly, in the published literature because a population definition for this statistic is rarely discussed within the social and behavioural sciences. By relying on copula distribution theory, a population model is presented for the Spearman rank correlation, and its properties are explored both theoretically and in a simulation study. Through the use of the Iman–Conover algorithm (which allows the user to specify the rank correlation as a population parameter), simulation studies from previously published articles are explored, and it is found that many of the conclusions purported in them regarding the nature of the Spearman correlation would change if the data-generation mechanism better matched the simulation design. More specifically, issues such as small sample bias and lack of power of the *t*-test and *r*-to-*z* Fisher transformation disappear when the rank correlation is calculated from data sampled where the rank correlation is the population parameter. A proof for the consistency of the sample estimate of the rank correlation is shown as well as the flexibility of the copula model to encompass results previously published in the mathematical literature.

The Cox proportional hazards model with a latent trait variable (Ranger & Ortner, 2012, *Br. J. Math. Stat. Psychol., 65*, 334) has shown promise in accounting for the dependency of response times from the same examinee. The model allows flexibility in shapes of response time distributions using the non-parametric baseline hazard rate while allowing parametric inference about the latent variable via exponential regression. The flexibility of the model, however, comes at the price of a significant increase in the complexity of estimating the model. The purpose of this study is to propose a new estimation approach to overcome this difficulty in model estimation. The new procedure is based on the penalized partial likelihood estimator in which the partial likelihood is maximized in the presence of a penalty function. The potential of the proposed method is corroborated by a series of simulation studies for fitting the proportional hazards latent trait model to psychological and educational testing data. The application of the estimation method to the hierarchical framework (van der Linden, 2007, *Psychometrika, 72*, 287) is also illustrated for jointly analysing response times and accuracy scores.

It is becoming more feasible and common to register response times in the application of psychometric tests. Researchers thus have the opportunity to jointly model response accuracy and response time, which provides users with more relevant information. The most common choice is to use the hierarchical model (van der Linden, 2007, *Psychometrika*, 72, 287), which assumes conditional independence between response time and accuracy, given a person's speed and ability. However, this assumption may be violated in practice if, for example, persons vary their speed or differ in their response strategies, leading to conditional dependence between response time and accuracy and confounding measurement. We propose six nested hierarchical models for response time and accuracy that allow for conditional dependence, and discuss their relationship to existing models. Unlike existing approaches, the proposed hierarchical models allow for various forms of conditional dependence in the model and allow the effect of continuous residual response time on response accuracy to be item-specific, person-specific, or both. Estimation procedures for the models are proposed, as well as two information criteria that can be used for model selection. Parameter recovery and usefulness of the information criteria are investigated using simulation, indicating that the procedure works well and is likely to select the appropriate model. Two empirical applications are discussed to illustrate the different types of conditional dependence that may occur in practice and how these can be captured using the proposed hierarchical models.

The emergence of Gaussian model-based partitioning as a viable alternative to *K*-means clustering fosters a need for discrete optimization methods that can be efficiently implemented using model-based criteria. A variety of alternative partitioning criteria have been proposed for more general data conditions that permit elliptical clusters, different spatial orientations for the clusters, and unequal cluster sizes. Unfortunately, many of these partitioning criteria are computationally demanding, which makes the multiple-restart (multistart) approach commonly used for *K*-means partitioning less effective as a heuristic solution strategy. As an alternative, we propose an approach based on iterated local search (ILS), which has proved effective in previous combinatorial data analysis contexts. We compared multistart, ILS and hybrid multistart–ILS procedures for minimizing a very general model-based criterion that assumes no restrictions on cluster size or within-group covariance structure. This comparison, which used 23 data sets from the classification literature, revealed that the ILS and hybrid heuristics generally provided better criterion function values than the multistart approach when all three methods were constrained to the same 10-min time limit. In many instances, these differences in criterion function values reflected profound differences in the partitions obtained.

Stability or sensitivity analysis is an important topic in data analysis that has received little attention in the application of multidimensional scaling (MDS), for which the only available approaches are given in terms of a coordinate-based analytical jackknife methodology. Although in MDS the prime interest is in assessing the stability of the points in the configuration, this methodology may be influenced by imprecisions resulting from the inherently necessary Procrustes method. This paper proposes an analytical distance-based jackknife procedure to study stability and cross-validation in MDS in terms of the jackknife distances, which is not influenced by the Procrustes method. For each object, the corresponding jackknife estimated points are considered as naturally clustered points, and stability and cross-validation are analysed in terms of the MDS distances arising from the jackknife procedure, on the basis of a weighted cluster-MDS algorithm. A jackknife-relevant configuration is also proposed for cross-validation in terms of coordinates, in a cluster-MDS framework.

]]>A new multilevel latent state graded response model for longitudinal multitrait–multimethod (MTMM) measurement designs combining structurally different and interchangeable methods is proposed. The model allows researchers to examine construct validity over time and to study the change and stability of constructs and method effects based on ordinal response variables. We show how Bayesian estimation techniques can address a number of important issues that typically arise in longitudinal multilevel MTMM studies and facilitates the estimation of the model presented. Estimation accuracy and the impact of between- and within-level sample sizes as well as different prior specifications on parameter recovery were investigated in a Monte Carlo simulation study. Findings indicate that the parameters of the model presented can be accurately estimated with Bayesian estimation methods in the case of low convergent validity with as few as 250 clusters and more than two observations within each cluster. The model was applied to well-being data from a longitudinal MTMM study, assessing the change and stability of life satisfaction and subjective happiness in young adults after high-school graduation. Guidelines for empirical applications are provided and advantages and limitations of a Bayesian approach to estimating longitudinal multilevel MTMM models are discussed.

]]>Multidimensional computerized adaptive testing (MCAT) has received increasing attention over the past few years in educational measurement. Like all other formats of CAT, item replenishment is an essential part of MCAT for its item bank maintenance and management, which governs retiring overexposed or obsolete items over time and replacing them with new ones. Moreover, calibration precision of the new items will directly affect the estimation accuracy of examinees’ ability vectors. In unidimensional CAT (UCAT) and cognitive diagnostic CAT, online calibration techniques have been developed to effectively calibrate new items. However, there has been very little discussion of online calibration in MCAT in the literature. Thus, this paper proposes new online calibration methods for MCAT based upon some popular methods used in UCAT. Three representative methods, Method A, the ‘one EM cycle’ method and the ‘multiple EM cycles’ method, are generalized to MCAT. Three simulation studies were conducted to compare the three new methods by manipulating three factors (test length, item bank design, and level of correlation between coordinate dimensions). The results showed that all the new methods were able to recover the item parameters accurately, and the adaptive online calibration designs showed some improvements compared to the random design under most conditions.

]]>In the framework of meta-analysis, moderator analysis is usually performed only univariately. When several study characteristics are available that may account for treatment effect, standard meta-regression has difficulties in identifying interactions between them. To overcome this problem, meta-CART has been proposed: an approach that applies classification and regression trees (CART) to identify interactions, and then subgroup meta-analysis to test the significance of moderator effects. The previous version of meta-CART has its shortcomings: when applying CART, the sample sizes of studies are not taken into account, and the effect sizes are dichotomized around the median value. Therefore, this article proposes new meta-CART extensions, weighting study effect sizes by their accuracy, and using a regression tree to avoid dichotomization. In addition, new pruning rules are proposed. The performance of all versions of meta-CART was evaluated via a Monte Carlo simulation study. The simulation results revealed that meta-regression trees with random-effects weights and a 0.5-standard-error pruning rule perform best. The required sample size for meta-CART to achieve satisfactory performance depends on the number of study characteristics, the magnitude of the interactions, and the residual heterogeneity.

]]>Over the past decade, Mokken scale analysis (MSA) has rapidly grown in popularity among researchers from many different research areas. This tutorial provides researchers with a set of techniques and a procedure for their application, such that the construction of scales that have superior measurement properties is further optimized, taking full advantage of the properties of MSA. First, we define the conceptual context of MSA, discuss the two item response theory (IRT) models that constitute the basis of MSA, and discuss how these models differ from other IRT models. Second, we discuss dos and don'ts for MSA; the don'ts include misunderstandings we have frequently encountered with researchers in our three decades of experience with real-data MSA. Third, we discuss a methodology for MSA on real data that consist of a sample of persons who have provided scores on a set of items that, depending on the composition of the item set, constitute the basis for one or more scales, and we use the methodology to analyse an example real-data set.

]]>Two different item response theory model frameworks have been proposed for the assessment and control of response styles in rating data. According to one framework, response styles can be assessed by analysing threshold parameters in Rasch models for ordinal data and in mixture-distribution extensions of such models. A different framework is provided by multi-process item response tree models, which can be used to disentangle response processes that are related to the substantive traits and response tendencies elicited by the response scale. In this tutorial, the two approaches are reviewed, illustrated with an empirical data set of the two-dimensional ‘Personal Need for Structure’ construct, and compared in terms of multiple criteria. Mplus is used as a software framework for (mixed) polytomous Rasch models and item response tree models as well as for demonstrating how parsimonious model variants can be specified to test assumptions on the structure of response styles and attitude strength. Although both frameworks are shown to account for response styles, they differ on the quantitative criteria of model selection, practical aspects of model estimation, and conceptual issues of representing response styles as continuous and multidimensional sources of individual differences in psychological assessment.

]]>