Cognitive diagnosis models of educational test performance rely on a binary Q-matrix that specifies the associations between individual test items and the cognitive attributes (skills) required to answer those items correctly. Current methods for fitting cognitive diagnosis models to educational test data and assigning examinees to proficiency classes are based on parametric estimation methods such as expectation maximization (EM) and Markov chain Monte Carlo (MCMC) that frequently encounter difficulties in practical applications. In response to these difficulties, non-parametric classification techniques (cluster analysis) have been proposed as heuristic alternatives to parametric procedures. These non-parametric classification techniques first aggregate each examinee's test item scores into a profile of attribute sum scores, which then serve as the basis for clustering examinees into proficiency classes. Like the parametric procedures, the non-parametric classification techniques require that the Q-matrix underlying a given test be known. Unfortunately, in practice, the Q-matrix for most tests is not known and must be estimated to specify the associations between items and attributes, risking a misspecified Q-matrix that may then result in the incorrect classification of examinees. This paper demonstrates that clustering examinees into proficiency classes based on their item scores rather than on their attribute sum-score profiles does not require knowledge of the Q-matrix, and results in a more accurate classification of examinees.

]]>Equivalence tests are an alternative to traditional difference-based tests for demonstrating a lack of association between two variables. While there are several recent studies investigating equivalence tests for comparing means, little research has been conducted on equivalence methods for evaluating the equivalence or similarity of two correlation coefficients or two regression coefficients. The current project proposes novel tests for evaluating the equivalence of two regression or correlation coefficients derived from the two one-sided tests (TOST) method (Schuirmann, 1987, *J. Pharmacokinet. Biopharm*,* 15*, 657) and an equivalence test by Anderson and Hauck (1983, *Stat. Commun*., *12*, 2663). A simulation study was used to evaluate the performance of these tests and compare them with the common, yet inappropriate, method of assessing equivalence using non-rejection of the null hypothesis in difference-based tests. Results demonstrate that equivalence tests have more accurate probabilities of declaring equivalence than difference-based tests. However, equivalence tests require large sample sizes to ensure adequate power. We recommend the Anderson–Hauck equivalence test over the TOST method for comparing correlation or regression coefficients.

We show how the hierarchical model for responses and response times as developed by van der Linden (2007), Fox, Klein Entink, and van der Linden (2007), Klein Entink, Fox, and van der Linden (2009), and Glas and van der Linden (2010) can be simplified to a generalized linear factor model with only the mild restriction that there is no hierarchical model at the item side. This result is valuable as it enables all well-developed modelling tools and extensions that come with these methods. We show that the restriction we impose on the hierarchical model does not influence parameter recovery under realistic circumstances. In addition, we present two illustrative real data analyses to demonstrate the practical benefits of our approach.

]]>We analytically derive the fixed-effects estimates in unconditional linear growth curve models by typical linear mixed-effects modelling (TLME) and by a pattern-mixture (PM) approach with random-slope-dependent two-missing-pattern missing not at random (MNAR) longitudinal data. Results showed that when the missingness mechanism is random-slope-dependent MNAR, TLME estimates of both the mean intercept and mean slope are biased because of incorrect weights used in the estimation. More specifically, the estimate of the mean slope is biased towards the mean slope for completers, whereas the estimate of the mean intercept is biased towards the opposite direction as compared to the estimate of the mean slope. We also discuss why the PM approach can provide unbiased fixed-effects estimates for random-coefficients-dependent MNAR data but does not work well for missing at random or outcome-dependent MNAR data. A small simulation study was conducted to illustrate the results and to compare results from TLME and PM. Results from an empirical data analysis showed that the conceptual finding can be generalized to other real conditions even when some assumptions for the analytical derivation cannot be met. Implications from the analytical and empirical results were discussed and sensitivity analysis was suggested for longitudinal data analysis with missing data.

]]>In this paper, the performance of six types of techniques for comparisons of means is examined. These six emerge from the distinction between the method employed (hypothesis testing, model selection using information criteria, or Bayesian model selection) and the set of hypotheses that is investigated (a classical, exploration-based set of hypotheses containing equality constraints on the means, or a theory-based limited set of hypotheses with equality and/or order restrictions). A simulation study is conducted to examine the performance of these techniques. We demonstrate that, if one has specific, a priori specified hypotheses, confirmation (i.e., investigating theory-based hypotheses) has advantages over exploration (i.e., examining all possible equality-constrained hypotheses). Furthermore, examining reasonable order-restricted hypotheses has more power to detect the true effect/non-null hypothesis than evaluating only equality restrictions. Additionally, when investigating more than one theory-based hypothesis, model selection is preferred over hypothesis testing. Because of the first two results, we further examine the techniques that are able to evaluate order restrictions in a confirmatory fashion by examining their performance when the homogeneity of variance assumption is violated. Results show that the techniques are robust to heterogeneity when the sample sizes are equal. When the sample sizes are unequal, the performance is affected by heterogeneity. The size and direction of the deviations from the baseline, where there is no heterogeneity, depend on the effect size (of the means) and on the trend in the group variances with respect to the ordering of the group sizes. Importantly, the deviations are less pronounced when the group variances and sizes exhibit the same trend (e.g., are both increasing with group number).

]]>Many probabilistic models for psychological and educational measurements contain latent variables. Well-known examples are factor analysis, item response theory, and latent class model families. We discuss what is referred to as the ‘explaining-away’ phenomenon in the context of such latent variable models. This phenomenon can occur when multiple latent variables are related to the same observed variable, and can elicit seemingly counterintuitive conditional dependencies between latent variables given observed variables. We illustrate the implications of explaining away for a number of well-known latent variable models by using both theoretical and real data examples.

]]>Research problems that require a non-parametric analysis of multifactor designs with repeated measures arise in the behavioural sciences. There is, however, a lack of available procedures in commonly used statistical packages. In the present study, a generalization of the aligned rank test for the two-way interaction is proposed for the analysis of the typical sources of variation in a three-way analysis of variance (ANOVA) with repeated measures. It can be implemented in the usual statistical packages. Its statistical properties are tested by using simulation methods with two sample sizes (*n = *30 and *n* = 10) and three distributions (normal, exponential and double exponential). Results indicate substantial increases in power for non-normal distributions in comparison with the usual parametric tests. Similar levels of Type I error for both parametric and aligned rank ANOVA were obtained with non-normal distributions and large sample sizes. Degrees-of-freedom adjustments for Type I error control in small samples are proposed. The procedure is applied to a case study with 30 participants per group where it detects gender differences in linguistic abilities in blind children not shown previously by other methods.

For item response theory (IRT) models, which belong to the class of generalized linear or non-linear mixed models, reliability at the scale of observed scores (i.e., manifest correlation) is more difficult to calculate than latent correlation based reliability, but usually of greater scientific interest. This is not least because it cannot be calculated explicitly when the logit link is used in conjunction with normal random effects. As such, approximations such as Fisher's information coefficient, Cronbach's *α*, or the latent correlation are calculated, allegedly because it is easy to do so. Cronbach's *α* has well-known and serious drawbacks, Fisher's information is not meaningful under certain circumstances, and there is an important but often overlooked difference between latent and manifest correlations. Here, manifest correlation refers to correlation between observed scores, while latent correlation refers to correlation between scores at the latent (e.g., logit or probit) scale. Thus, using one in place of the other can lead to erroneous conclusions. Taylor series based reliability measures, which are based on manifest correlation functions, are derived and a careful comparison of reliability measures based on latent correlations, Fisher's information, and exact reliability is carried out. The latent correlations are virtually always considerably higher than their manifest counterparts, Fisher's information measure shows no coherent behaviour (it is even negative in some cases), while the newly introduced Taylor series based approximations reflect the exact reliability very closely. Comparisons among the various types of correlations, for various IRT models, are made using algebraic expressions, Monte Carlo simulations, and data analysis. Given the light computational burden and the performance of Taylor series based reliability measures, their use is recommended.

**A**pplications of standard item response theory models assume local independence of items and persons. This paper presents polytomous multilevel testlet models for dual dependence due to item and person clustering in testlet-based assessments with clustered samples. Simulation and survey data were analysed with a multilevel partial credit testlet model. This model was compared with three alternative models – a testlet partial credit model (PCM), multilevel PCM, and PCM – in terms of model parameter estimation. The results indicated that the deviance information criterion was the fit index that always correctly identified the true multilevel testlet model based on the quantified evidence in model selection, while the Akaike and Bayesian information criteria could not identify the true model. In general, the estimation model and the magnitude of item and person clustering impacted the estimation accuracy of ability parameters, while only the estimation model and the magnitude of item clustering affected the item parameter estimation accuracy. Furthermore, ignoring item clustering effects produced higher total errors in item parameter estimates but did not have much impact on the accuracy of ability parameter estimates, while ignoring person clustering effects yielded higher total errors in ability parameter estimates but did not have much effect on the accuracy of item parameter estimates. When both clustering effects were ignored in the PCM, item and ability parameter estimation accuracy was reduced.

The previously unknown asymptotic distribution of Cook's distance in polytomous logistic regression is established as a linear combination of independent chi-square random variables with one degree of freedom. An exhaustive approach to the analysis of influential covariates is developed and a new measure for the accuracy of predictions based on such a distribution is provided. Two examples with real data sets (one with continuous covariates and the other with both qualitative and quantitative covariates) are presented to illustrate the approach developed.

]]>Previous studies have discussed asymmetric interpretations of the Pearson correlation coefficient and have shown that higher moments can be used to decide on the direction of dependence in the bivariate linear regression setting. The current study extends this approach by illustrating that the third moment of regression residuals may also be used to derive conclusions concerning the direction of effects. Assuming non-normally distributed variables, it is shown that the distribution of residuals of the correctly specified regression model (e.g., *Y* is regressed on *X*) is more symmetric than the distribution of residuals of the competing model (i.e., *X* is regressed on *Y*). Based on this result, 4 one-sample tests are discussed which can be used to decide which variable is more likely to be the response and which one is more likely to be the explanatory variable. A fifth significance test is proposed based on the differences of skewness estimates, which leads to a more direct test of a hypothesis that is compatible with direction of dependence. A Monte Carlo simulation study was performed to examine the behaviour of the procedures under various degrees of associations, sample sizes, and distributional properties of the underlying population. An empirical example is given which illustrates the application of the tests in practice.

This study investigated differential item functioning (DIF) mechanisms in the context of differential testlet effects across subgroups. Specifically, we investigated DIF manifestations when the stochastic ordering assumption on the nuisance dimension in a testlet does not hold. DIF hypotheses were formulated analytically using a parametric marginal item response function approach and compared with empirical DIF results from a unidimensional item response theory approach. The comparisons were made in terms of type of DIF (uniform or non-uniform) and direction (whether the focal or reference group was advantaged). In general, the DIF hypotheses were supported by the empirical results, showing the usefulness of the parametric approach in explaining DIF mechanisms. Both analytical predictions of DIF and the empirical results provide insights into conditions where a particular type of DIF becomes dominant in a specific DIF direction, which is useful for the study of DIF causes.

]]>The study of thresholds for discriminability has been of long-standing interest in psychophysics. While threshold theories embrace the concept of discrete-state thresholds, signal detection theory discounts such a concept. In this paper we concern ourselves with the concept of thresholds from the discrete-state modelling viewpoint. In doing so, we find it necessary to clarify some fundamental issues germane to the psychometric function (PF), which is customarily constructed using psychophysical methods with a binary-response format. We challenge this response format and argue that response confidence also plays an important role in the construction of PFs, and thus should have some impact on threshold estimation. We motivate the discussion by adopting a three-state threshold theory for response confidence proposed by Krantz (1969, *Psychol. Rev*., *76*, 308–324), which is a modification of Luce's (1963, *Psychol. Rev*., *70*, 61–79) low-threshold theory. In particular, we discuss the case in which the practice of averaging over order (or position) is enforced in data collection. Finally, we illustrate the fit of the Luce–Krantz model to data from a line-discrimination task with response confidence.

This paper demonstrates the usefulness and flexibility of the general structural equation modelling (SEM) approach to fitting direct covariance patterns or structures (as opposed to fitting implied covariance structures from functional relationships among variables). In particular, the MSTRUCT modelling language (or syntax) of the CALIS procedure (SAS/STAT version 9.22 or later: SAS Institute, 2010) is used to illustrate the SEM approach. The MSTRUCT modelling language supports a direct covariance pattern specification of each covariance element. It also supports the input of additional independent and dependent parameters. Model tests, fit statistics, estimates, and their standard errors are then produced under the general SEM framework. By using numerical and computational examples, the following tests of basic covariance patterns are illustrated: sphericity, compound symmetry, and multiple-group covariance patterns. Specification and testing of two complex correlation structures, the circumplex pattern and the composite direct product models with or without composite errors and scales, are also illustrated by the MSTRUCT syntax. It is concluded that the SEM approach offers a general and flexible modelling of direct covariance and correlation patterns. In conjunction with the use of SAS macros, the MSTRUCT syntax provides an easy-to-use interface for specifying and fitting complex covariance and correlation structures, even when the number of variables or parameters becomes large.

]]>