In this paper, the performance of six types of techniques for comparisons of means is examined. These six emerge from the distinction between the method employed (hypothesis testing, model selection using information criteria, or Bayesian model selection) and the set of hypotheses that is investigated (a classical, exploration-based set of hypotheses containing equality constraints on the means, or a theory-based limited set of hypotheses with equality and/or order restrictions). A simulation study is conducted to examine the performance of these techniques. We demonstrate that, if one has specific, a priori specified hypotheses, confirmation (i.e., investigating theory-based hypotheses) has advantages over exploration (i.e., examining all possible equality-constrained hypotheses). Furthermore, examining reasonable order-restricted hypotheses has more power to detect the true effect/non-null hypothesis than evaluating only equality restrictions. Additionally, when investigating more than one theory-based hypothesis, model selection is preferred over hypothesis testing. Because of the first two results, we further examine the techniques that are able to evaluate order restrictions in a confirmatory fashion by examining their performance when the homogeneity of variance assumption is violated. Results show that the techniques are robust to heterogeneity when the sample sizes are equal. When the sample sizes are unequal, the performance is affected by heterogeneity. The size and direction of the deviations from the baseline, where there is no heterogeneity, depend on the effect size (of the means) and on the trend in the group variances with respect to the ordering of the group sizes. Importantly, the deviations are less pronounced when the group variances and sizes exhibit the same trend (e.g., are both increasing with group number).

]]>The previously unknown asymptotic distribution of Cook's distance in polytomous logistic regression is established as a linear combination of independent chi-square random variables with one degree of freedom. An exhaustive approach to the analysis of influential covariates is developed and a new measure for the accuracy of predictions based on such a distribution is provided. Two examples with real data sets (one with continuous covariates and the other with both qualitative and quantitative covariates) are presented to illustrate the approach developed.

]]>The study of thresholds for discriminability has been of long-standing interest in psychophysics. While threshold theories embrace the concept of discrete-state thresholds, signal detection theory discounts such a concept. In this paper we concern ourselves with the concept of thresholds from the discrete-state modelling viewpoint. In doing so, we find it necessary to clarify some fundamental issues germane to the psychometric function (PF), which is customarily constructed using psychophysical methods with a binary-response format. We challenge this response format and argue that response confidence also plays an important role in the construction of PFs, and thus should have some impact on threshold estimation. We motivate the discussion by adopting a three-state threshold theory for response confidence proposed by Krantz (, *Psychol. Rev*., *76*, 308–324), which is a modification of Luce's (, *Psychol. Rev*., *70*, 61–79) low-threshold theory. In particular, we discuss the case in which the practice of averaging over order (or position) is enforced in data collection. Finally, we illustrate the fit of the Luce–Krantz model to data from a line-discrimination task with response confidence.

In this paper we propose a latent class distance association model for clustering in the predictor space of large contingency tables with a categorical response variable. The rows of such a table are characterized as profiles of a set of explanatory variables, while the columns represent a single outcome variable. In many cases such tables are sparse, with many zero entries, which makes traditional models problematic. By clustering the row profiles into a few specific classes and representing these together with the categories of the response variable in a low-dimensional Euclidean space using a distance association model, a parsimonious prediction model can be obtained. A generalized EM algorithm is proposed to estimate the model parameters and the adjusted Bayesian information criterion statistic is employed to test the number of mixture components and the dimensionality of the representation. An empirical example highlighting the advantages of the new approach and comparing it with traditional approaches is presented.

]]>This study investigated differential item functioning (DIF) mechanisms in the context of differential testlet effects across subgroups. Specifically, we investigated DIF manifestations when the stochastic ordering assumption on the nuisance dimension in a testlet does not hold. DIF hypotheses were formulated analytically using a parametric marginal item response function approach and compared with empirical DIF results from a unidimensional item response theory approach. The comparisons were made in terms of type of DIF (uniform or non-uniform) and direction (whether the focal or reference group was advantaged). In general, the DIF hypotheses were supported by the empirical results, showing the usefulness of the parametric approach in explaining DIF mechanisms. Both analytical predictions of DIF and the empirical results provide insights into conditions where a particular type of DIF becomes dominant in a specific DIF direction, which is useful for the study of DIF causes.

]]>Previous studies have discussed asymmetric interpretations of the Pearson correlation coefficient and have shown that higher moments can be used to decide on the direction of dependence in the bivariate linear regression setting. The current study extends this approach by illustrating that the third moment of regression residuals may also be used to derive conclusions concerning the direction of effects. Assuming non-normally distributed variables, it is shown that the distribution of residuals of the correctly specified regression model (e.g., *Y* is regressed on *X*) is more symmetric than the distribution of residuals of the competing model (i.e., *X* is regressed on *Y*). Based on this result, 4 one-sample tests are discussed which can be used to decide which variable is more likely to be the response and which one is more likely to be the explanatory variable. A fifth significance test is proposed based on the differences of skewness estimates, which leads to a more direct test of a hypothesis that is compatible with direction of dependence. A Monte Carlo simulation study was performed to examine the behaviour of the procedures under various degrees of associations, sample sizes, and distributional properties of the underlying population. An empirical example is given which illustrates the application of the tests in practice.

**A**pplications of standard item response theory models assume local independence of items and persons. This paper presents polytomous multilevel testlet models for dual dependence due to item and person clustering in testlet-based assessments with clustered samples. Simulation and survey data were analysed with a multilevel partial credit testlet model. This model was compared with three alternative models – a testlet partial credit model (PCM), multilevel PCM, and PCM – in terms of model parameter estimation. The results indicated that the deviance information criterion was the fit index that always correctly identified the true multilevel testlet model based on the quantified evidence in model selection, while the Akaike and Bayesian information criteria could not identify the true model. In general, the estimation model and the magnitude of item and person clustering impacted the estimation accuracy of ability parameters, while only the estimation model and the magnitude of item clustering affected the item parameter estimation accuracy. Furthermore, ignoring item clustering effects produced higher total errors in item parameter estimates but did not have much impact on the accuracy of ability parameter estimates, while ignoring person clustering effects yielded higher total errors in ability parameter estimates but did not have much effect on the accuracy of item parameter estimates. When both clustering effects were ignored in the PCM, item and ability parameter estimation accuracy was reduced.

For item response theory (IRT) models, which belong to the class of generalized linear or non-linear mixed models, reliability at the scale of observed scores (i.e., manifest correlation) is more difficult to calculate than latent correlation based reliability, but usually of greater scientific interest. This is not least because it cannot be calculated explicitly when the logit link is used in conjunction with normal random effects. As such, approximations such as Fisher's information coefficient, Cronbach's *α*, or the latent correlation are calculated, allegedly because it is easy to do so. Cronbach's *α* has well-known and serious drawbacks, Fisher's information is not meaningful under certain circumstances, and there is an important but often overlooked difference between latent and manifest correlations. Here, manifest correlation refers to correlation between observed scores, while latent correlation refers to correlation between scores at the latent (e.g., logit or probit) scale. Thus, using one in place of the other can lead to erroneous conclusions. Taylor series based reliability measures, which are based on manifest correlation functions, are derived and a careful comparison of reliability measures based on latent correlations, Fisher's information, and exact reliability is carried out. The latent correlations are virtually always considerably higher than their manifest counterparts, Fisher's information measure shows no coherent behaviour (it is even negative in some cases), while the newly introduced Taylor series based approximations reflect the exact reliability very closely. Comparisons among the various types of correlations, for various IRT models, are made using algebraic expressions, Monte Carlo simulations, and data analysis. Given the light computational burden and the performance of Taylor series based reliability measures, their use is recommended.

This paper demonstrates the usefulness and flexibility of the general structural equation modelling (SEM) approach to fitting direct covariance patterns or structures (as opposed to fitting implied covariance structures from functional relationships among variables). In particular, the MSTRUCT modelling language (or syntax) of the CALIS procedure (SAS/STAT version 9.22 or later: SAS Institute, 2010) is used to illustrate the SEM approach. The MSTRUCT modelling language supports a direct covariance pattern specification of each covariance element. It also supports the input of additional independent and dependent parameters. Model tests, fit statistics, estimates, and their standard errors are then produced under the general SEM framework. By using numerical and computational examples, the following tests of basic covariance patterns are illustrated: sphericity, compound symmetry, and multiple-group covariance patterns. Specification and testing of two complex correlation structures, the circumplex pattern and the composite direct product models with or without composite errors and scales, are also illustrated by the MSTRUCT syntax. It is concluded that the SEM approach offers a general and flexible modelling of direct covariance and correlation patterns. In conjunction with the use of SAS macros, the MSTRUCT syntax provides an easy-to-use interface for specifying and fitting complex covariance and correlation structures, even when the number of variables or parameters becomes large.

]]>Research problems that require a non-parametric analysis of multifactor designs with repeated measures arise in the behavioural sciences. There is, however, a lack of available procedures in commonly used statistical packages. In the present study, a generalization of the aligned rank test for the two-way interaction is proposed for the analysis of the typical sources of variation in a three-way analysis of variance (ANOVA) with repeated measures. It can be implemented in the usual statistical packages. Its statistical properties are tested by using simulation methods with two sample sizes (*n = *30 and *n* = 10) and three distributions (normal, exponential and double exponential). Results indicate substantial increases in power for non-normal distributions in comparison with the usual parametric tests. Similar levels of Type I error for both parametric and aligned rank ANOVA were obtained with non-normal distributions and large sample sizes. Degrees-of-freedom adjustments for Type I error control in small samples are proposed. The procedure is applied to a case study with 30 participants per group where it detects gender differences in linguistic abilities in blind children not shown previously by other methods.

Score tests for identifying locally dependent item pairs have been proposed for binary item response models. In this article, both the bifactor and the threshold shift score tests are generalized to the graded response model. For the bifactor test, the generalization is straightforward; it adds one secondary dimension associated only with one pair of items. For the threshold shift test, however, multiple generalizations are possible: in particular, conditional, uniform, and linear shift tests are discussed in this article. Simulation studies show that all of the score tests have accurate Type I error rates given large enough samples, although their small-sample behaviour is not as good as that of Pearson's Χ^{2} and *M*_{2} as proposed in other studies for the purpose of local dependence (LD) detection. All score tests have the highest power to detect the LD which is consistent with their parametric form, and in this case they are uniformly more powerful than Χ^{2} and *M*_{2}; even wrongly specified score tests are more powerful than Χ^{2} and *M*_{2} in most conditions. An example using empirical data is provided for illustration.

The minimum-diameter partitioning problem (MDPP) seeks to produce compact clusters, as measured by an overall goodness-of-fit measure known as the partition diameter, which represents the maximum dissimilarity between any two objects placed in the same cluster. Complete-linkage hierarchical clustering is perhaps the best-known heuristic method for the MDPP and has an extensive history of applications in psychological research. Unfortunately, this method has several inherent shortcomings that impede the model selection process, such as: (1) sensitivity to the input order of the objects, (2) failure to obtain a globally optimal minimum-diameter partition when cutting the tree at *K* clusters, and (3) the propensity for a large number of alternative minimum-diameter partitions for a given *K*. We propose that each of these problems can be addressed by applying an algorithm that finds all of the minimum-diameter partitions for different values of *K*. Model selection is then facilitated by considering, for each value of *K*, the reduction in the partition diameter, the number of alternative optima, and the partition agreement among the alternative optima. Using five examples from the empirical literature, we show the practical value of the proposed process for facilitating model selection for the MDPP.

Virtually all discussions and applications of statistical mediation analysis have been based on the condition that the independent variable is dichotomous or continuous, even though investigators frequently are interested in testing mediation hypotheses involving a multicategorical independent variable (such as two or more experimental conditions relative to a control group). We provide a tutorial illustrating an approach to estimation of and inference about direct, indirect, and total effects in statistical mediation analysis with a multicategorical independent variable. The approach is mathematically equivalent to analysis of (co)variance and reproduces the observed and adjusted group means while also generating effects having simple interpretations. Supplementary material available online includes extensions to this approach and Mplus, SPSS, and SAS code that implements it.

]]>The study explores the robustness to violations of normality and sphericity of linear mixed models when they are used with the Kenward–Roger procedure (KR) in split-plot designs in which the groups have different distributions and sample sizes are small. The focus is on examining the effect of skewness and kurtosis. To this end, a Monte Carlo simulation study was carried out, involving a split-plot design with three levels of the between-subjects grouping factor and four levels of the within-subjects factor. The results show that: (1) the violation of the sphericity assumption did not affect KR robustness when the assumption of normality was not fulfilled; (2) the robustness of the KR procedure decreased as skewness in the distributions increased, there being no strong effect of kurtosis; and (3) the type of pairing between kurtosis and group size was shown to be a relevant variable to consider when using this procedure, especially when pairing is positive (i.e., when the largest group is associated with the largest value of the kurtosis coefficient and the smallest group with its smallest value). The KR procedure can be a good option for analysing repeated-measures data when the groups have different distributions, provided the total sample sizes are 45 or larger and the data are not highly or extremely skewed.

]]>In item response theory, the classical estimators of ability are highly sensitive to response disturbances and can return strongly biased estimates of the true underlying ability level. Robust methods were introduced to lessen the impact of such aberrant responses on the estimation process. The computation of asymptotic (i.e., large-sample) standard errors (ASE) for these robust estimators, however, has not yet been fully considered. This paper focuses on a broad class of robust ability estimators, defined by an appropriate selection of the weight function and the residual measure, for which the ASE is derived from the theory of estimating equations. The maximum likelihood (ML) and the robust estimators, together with their estimated ASEs, are then compared in a simulation study by generating random guessing disturbances. It is concluded that both the estimators and their ASE perform similarly in the absence of random guessing, while the robust estimator and its estimated ASE are less biased and outperform their ML counterparts in the presence of random guessing with large impact on the item response process.

]]>Latent trait models for responses and response times in tests often lack a substantial interpretation in terms of a cognitive process model. This is a drawback because process models are helpful in clarifying the meaning of the latent traits. In the present paper, a new model for responses and response times in tests is presented. The model is based on the proportional hazards model for competing risks. Two processes are assumed, one reflecting the increase in knowledge and the second the tendency to discontinue. The processes can be characterized by two proportional hazards models whose baseline hazard functions correspond to the temporary increase in knowledge and discouragement. The model can be calibrated with marginal maximum likelihood estimation and an application of the ECM algorithm. Two tests of model fit are proposed. The amenability of the proposed approaches to model calibration and model evaluation is demonstrated in a simulation study. Finally, the model is used for the analysis of two empirical data sets.

]]>The difference between two proportions, referred to as a risk difference, is a useful measure of effect size in studies where the response variable is dichotomous. Confidence interval methods based on a varying coefficient model are proposed for combining and comparing risk differences from multi-study between-subjects or within-subjects designs. The proposed methods are new alternatives to the popular constant coefficient and random coefficient methods. The proposed varying coefficient methods do not require the constant coefficient assumption of effect size homogeneity, nor do they require the random coefficient assumption that the risk differences from the selected studies represent a random sample from a normally distributed superpopulation of risk differences. The proposed varying coefficient methods are shown to have excellent finite-sample performance characteristics under realistic conditions.

]]>Missing values are a practical issue in the analysis of longitudinal data. Multiple imputation (MI) is a well-known likelihood-based method that has optimal properties in terms of efficiency and consistency if the imputation model is correctly specified. Doubly robust (DR) weighing-based methods protect against misspecification bias if one of the models, but not necessarily both, for the data or the mechanism leading to missing data is correct. We propose a new imputation method that captures the simplicity of MI and protection from the DR method. This method integrates MI and DR to protect against misspecification of the imputation model under a missing at random assumption. Our method avoids analytical complications of missing data particularly in multivariate settings, and is easy to implement in standard statistical packages. Moreover, the proposed method works very well with an intermittent pattern of missingness when other DR methods can not be used. Simulation experiments show that the proposed approach achieves improved performance when one of the models is correct. The method is applied to data from the fireworks disaster study, a randomized clinical trial comparing therapies in disaster-exposed children. We conclude that the new method increases the robustness of imputations.

]]>The purpose of this study was to evaluate a modified test of equivalence for conducting normative comparisons when distribution shapes are non-normal and variances are unequal. A Monte Carlo study was used to compare the empirical Type I error rates and power of the proposed Schuirmann–Yuen test of equivalence, which utilizes trimmed means, with that of the previously recommended Schuirmann and Schuirmann–Welch tests of equivalence when the assumptions of normality and variance homogeneity are satisfied, as well as when they are not satisfied. The empirical Type I error rates of the Schuirmann–Yuen were much closer to the nominal α level than those of the Schuirmann or Schuirmann–Welch tests, and the power of the Schuirmann–Yuen was substantially greater than that of the Schuirmann or Schuirmann–Welch tests when distributions were skewed or outliers were present. The Schuirmann–Yuen test is recommended for assessing clinical significance with normative comparisons.

]]>We derive the statistical power functions in multi-site randomized trials with multiple treatments at each site, using multi-level modelling. An *F* statistic is used to test multiple parameters in the multi-level model instead of the Wald chi square test as suggested in the current literature. The *F* statistic is shown to be more conservative than the Wald statistic in testing any overall treatment effect among the multiple study conditions. In addition, we improvise an easy way to estimate the non-centrality parameters for the means comparison *t*-tests and the *F* test, using Helmert contrast coding in the multi-level model. The variance of treatment means, which is difficult to fathom but necessary for power analysis, is decomposed into intuitive simple effect sizes in the contrast tests. The method is exemplified by a multi-site evaluation study of the behavioural interventions for cannabis dependence.

Fischer's (1973) linear logistic test model can be used to test hypotheses regarding the effect of covariates on item difficulty and to predict the difficulty of newly constructed test items. However, its assumptions of equal discriminatory power across items and a perfect prediction of item difficulty are never absolutely met. The amount of misfit in an application of a Bayesian version of the model to two subtests of the SON-R –17 is investigated by means of item fit statistics in the framework of posterior predictive checks and by means of a comparison with a model that allows for residual (co)variance in the item parameters. The effect of the degree of residual (co)variance on the robustness of inferences is investigated in a simulation study.

]]>The family of (non-parametric, fixed-step-size) adaptive methods, also known as ‘up–down’ or ‘staircase’ methods, has been used extensively in psychophysical studies for threshold estimation. Extensions of adaptive methods to non-binary responses have also been proposed. An example is the three-category weighted up–down (WUD) method (Kaernbach, 2001) and its four-category extension (Klein, 2001). Such an extension, however, is somewhat restricted, and in this paper we discuss its limitations. To facilitate the discussion, we characterize the extension of WUD by an algorithm that incorporates response confidence into a family of adaptive methods. This algorithm can also be applied to two other adaptive methods, namely Derman's up–down method and the biased-coin design, which are suitable for estimating any threshold quantiles. We then discuss via simulations of the above three methods the limitations of the algorithm. To illustrate, we conduct a small scale of experiment using the extended WUD under different response confidence formats to evaluate the consistency of threshold estimation.

]]>In this paper we implement a Markov chain Monte Carlo algorithm based on the stochastic search variable selection method of George and McCulloch (1993) for identifying promising subsets of manifest variables (items) for factor analysis models. The suggested algorithm is constructed by embedding in the usual factor analysis model a normal mixture prior for the model loadings with latent indicators used to identify not only which manifest variables should be included in the model but also how each manifest variable is associated with each factor. We further extend the suggested algorithm to allow for factor selection. We also develop a detailed procedure for the specification of the prior parameters values based on the practical significance of factor loadings using ideas from the original work of George and McCulloch (1993). A straightforward Gibbs sampler is used to simulate from the joint posterior distribution of all unknown parameters and the subset of variables with the highest posterior probability is selected. The proposed method is illustrated using real and simulated data sets.

]]>The semi-parametric proportional hazards model with crossed random effects has two important characteristics: it avoids explicit specification of the response time distribution by using semi-parametric models, and it captures heterogeneity that is due to subjects and items. The proposed model has a proportionality parameter for the speed of each test taker, for the time intensity of each item, and for subject or item characteristics of interest. It is shown how all these parameters can be estimated by Markov chain Monte Carlo methods (Gibbs sampling). The performance of the estimation procedure is assessed with simulations and the model is further illustrated with the analysis of response times from a visual recognition task.

]]>Given a set of points on the plane and an assignment of values to them, an optimal linear partition is a division of the set into two subsets which are separated by a straight line and maximally contrast with each other in the values assigned to their points. We present a method for inspecting and rating all linear partitions of a finite set, and a package of three functions in the R language for executing the computations. One function is for finding the optimal linear partitions and corresponding separating lines, another for graphically representing the results, and a third for testing how well the data comply with the linear separability condition. We illustrate the method on possible data from a psychophysical experiment (concerning the size–weight illusion) and compare its performance with that of linear discriminant analysis and multiple logistic regression, adapted to dividing linearly a set of points on the plane.

]]>Parameters in structural equation models are typically estimated using the maximum likelihood (ML) approach. Bollen (1996) proposed an alternative non-iterative, equation-by-equation estimator that uses instrumental variables. Although this two-stage least squares/instrumental variables (2SLS/IV) estimator has good statistical properties, one problem with its application is that parameter equality constraints cannot be imposed. This paper presents a mathematical solution to this problem that is based on an extension of the 2SLS/IV approach to a system of equations. We present an example in which our approach was used to examine strong longitudinal measurement invariance. We also investigated the new approach in a simulation study that compared it with ML in the examination of the equality of two latent regression coefficients and strong measurement invariance. Overall, the results show that the suggested approach is a useful extension of the original 2SLS/IV estimator and allows for the effective handling of equality constraints in structural equation models.

]]>