This study adapted an effect size measure used for studying differential item functioning (DIF) in unidimensional tests and extended the measure to multidimensional tests. Two effect size measures were considered in a multidimensional item response theory model: signed weighted P-difference and unsigned weighted P-difference. The performance of the effect size measures was investigated under various simulation conditions including different sample sizes and DIF magnitudes. As another way of studying DIF, the χ^{2} difference test was included to compare the result of statistical significance (statistical tests) with that of practical significance (effect size measures). The adequacy of existing effect size criteria used in unidimensional tests was also evaluated. Both effect size measures worked well in estimating true effect sizes, identifying DIF types, and classifying effect size categories. Finally, a real data analysis was conducted to support the simulation results.

For a certification, licensure, or placement exam, allowing examinees to take multiple attempts at the test could effectively change the pass rate. Change in the pass rate can occur without any change in the underlying latent trait, and can be an artifact of multiple attempts and imperfect reliability of the test. By deriving formulae to compute the pass rate under two definitions, this article provides tools for testing practitioners to compute and evaluate the change in the expected pass rate when a certain (maximum) number of attempts are allowed without any change in the latent trait. This article also includes a simulation study that considers change in ability and differential motivation of examinees to retake the test. Results indicate that the general trend shown by the analytical results is maintained—that is, the marginal expected pass rate increases with more attempts when the testing volume is defined as the total number of test takers, and decreases with more attempts when the testing volume is defined as the total number of test attempts.

]]>This article introduces the Diagnostic Profiles (DP) standard setting method for setting a performance standard on a test developed from a cognitive diagnostic model (CDM), the outcome of which is a profile of mastered and not-mastered skills or attributes rather than a single test score. In the DP method, the key judgment task for panelists is a decision on whether or not individual cognitive skill profiles meet the performance standard. A randomized experiment was carried out in which secondary mathematics teachers were randomly assigned to either the DP method or the modified Angoff method. The standard setting methods were applied to a test of student readiness to enter high school algebra (Algebra I). While the DP profile judgments were perceived to be more difficult than the Angoff item judgments, there was a high degree of agreement among the panelists for most of the profiles. In order to compare the methods, cut scores were generated from the DP method. The results of the DP group were comparable to the Angoff group, with less cut score variability in the DP group. The DP method shows promise for testing situations in which diagnostic information is needed about examinees and where that information needs to be linked to a performance standard.

]]>In observed-score equipercentile equating, the goal is to make scores on two scales or tests measuring the same construct comparable by matching the percentiles of the respective score distributions. If the tests consist of different items with multiple categories for each item, a suitable model for the responses is a polytomous item response theory (IRT) model. The parameters from such a model can be utilized to derive the score probabilities for the tests and these score probabilities may then be used in observed-score equating. In this study, the asymptotic standard errors of observed-score equating using score probability vectors from polytomous IRT models are derived using the delta method. The results are applied to the equivalent groups design and the nonequivalent groups design with either chain equating or poststratification equating within the framework of kernel equating. The derivations are presented in a general form and specific formulas for the graded response model and the generalized partial credit model are provided. The asymptotic standard errors are accurate under several simulation conditions relating to sample size, distributional misspecification and, for the nonequivalent groups design, anchor test length.

]]>This article presents a novel method, the Complex Dynamics Essay Scorer (CDES), for automated essay scoring using complex network features. Texts produced by college students in China were represented as scale-free networks (e.g., a word adjacency model) from which typical network features, such as the in-/out-degrees, clustering coefficient (CC), and dynamic networks, were obtained. The CDES integrates the classical concepts of network feature representation and essay score series variation. Several experiments indicated that the network measures different essay qualities and can be clearly demonstrated to develop complex networks for autoscoring tasks. The average agreement of the CDES and human rater scores was 86.5%, and the average Pearson correlation was .77. The results indicate that the CDES produced functional complex systems and autoscored Chinese essays in a method consistent with human raters. Our research suggests potential applications in other areas of educational assessment.

]]>In this study, we describe what factors influence the observed score correlation between an (external) anchor test and a total test. We show that the anchor to full-test observed score correlation is based on two components: the true score correlation between the anchor and total test, and the reliability of the anchor test. Findings using an analytical approach suggest that making an anchor test a *miditest* does not generally maximize the anchor to total test correlation. Results are discussed in the context of what conditions maximize the correlations between the anchor and total test.