This study examines the effectiveness of three approaches for maintaining equivalent performance standards across test forms with small samples: (1) common-item equating, (2) resetting the standard, and (3) rescaling the standard. Rescaling the standard (i.e., applying common-item equating methodology to standard setting ratings to account for systematic differences between standard setting panels) has received almost no attention in the literature. Identity equating was also examined to provide context. Data from a standard setting form of a large national certification test (*N* examinees = 4,397; *N* panelists = 13) were split into content-equivalent subforms with common items, and resampling methodology was used to investigate the error introduced by each approach. Common-item equating (circle-arc and nominal weights mean) was evaluated at samples of size 10, 25, 50, and 100. The standard setting approaches (resetting and rescaling the standard) were evaluated by resampling (*N* = 8) and by simulating panelists (*N* = 8, 13, and 20). Results were inconclusive regarding the relative effectiveness of resetting and rescaling the standard. Small-sample equating, however, consistently produced new form cut scores that were less biased and less prone to random error than new form cut scores based on resetting or rescaling the standard.

A common suggestion made in the psychometric literature for fixed-length classification tests is that one should design tests so that they have maximum information at the cut score. Designing tests in this way is believed to maximize the classification accuracy and consistency of the assessment. This article uses simulated examples to illustrate that one can obtain higher classification accuracy and consistency by designing tests that have maximum test information at locations other than at the cut score. We show that the location where one should maximize the test information is dependent on the length of the test, the mean of the ability distribution in comparison to the cut score, and, to a lesser degree, whether or not one wants to optimize classification accuracy or consistency. Analyses also suggested that the differences in classification performance between designing tests optimally versus maximizing information at the cut score tended to be greatest when tests were short and the mean of ability distribution was further away from the cut score. Larger differences were also found in the simulated examples that used the 3PL model compared to the examples that used the Rasch model.

]]>Computerized adaptive testing (CAT) and multistage testing (MST) have become two of the most popular modes in large-scale computer-based sequential testing. Though most designs of CAT and MST exhibit strength and weakness in recent large-scale implementations, there is no simple answer to the question of which design is better because different modes may fit different practical situations. This article proposes a hybrid adaptive framework to combine both CAT and MST, inspired by an analysis of the history of CAT and MST. The proposed procedure is a design which transitions from a group sequential design to a fully sequential design. This allows for the robustness of MST in early stages, but also shares the advantages of CAT in later stages with fine tuning of the ability estimator once its neighborhood has been identified. Simulation results showed that hybrid designs following our proposed principles provided comparable or even better estimation accuracy and efficiency than standard CAT and MST designs, especially for examinees at the two ends of the ability range.

]]>De la Torre and Deng suggested a resampling-based approach for person-fit assessment (PFA). The approach involves the use of the statistic, a corrected expected a posteriori estimate of the examinee ability, and the Monte Carlo (MC) resampling method. The Type I error rate of the approach was closer to the nominal level than that of the traditional approach of using along with the assumption of a standard normal null distribution. This article suggests a generalized resampling-based approach for PFA that allows one to employ or another person-fit statistic (PFS) based on item response theory, the corrected expected a posteriori estimate or another ability estimate, and the MC method or another resampling method. The suggested approach includes the approach of de la Torre and Deng as a special case. Several approaches belonging to the generalized approach perform very similarly to the approach of de la Torre and Deng's in two simulation studies and in applications to three real data sets, irrespective of the PFS used. The generalized approach promises to be useful to those interested in resampling-based PFA.

]]>This study examined the utility of response time-based analyses in understanding the behavior of unmotivated test takers. For the data from an adaptive achievement test, patterns of observed rapid-guessing behavior and item response accuracy were compared to the behavior expected under several types of models that have been proposed to represent unmotivated test taking behavior. Test taker behavior was found to be inconsistent with these models, with the exception of the effort-moderated model. Effort-moderated scoring was found to both yield scores that were more accurate than those found under traditional scoring, and exhibit improved person fit statistics. In addition, an effort-guided adaptive test was proposed and shown by a simulation study to alleviate item difficulty mistargeting caused by unmotivated test taking.

]]>Equating methods make use of an appropriate transformation function to map the scores of one test form into the scale of another so that scores are comparable and can be used interchangeably. The equating literature shows that the ways of judging the success of an equating (i.e., the score transformation) might differ depending on the adopted framework. Rather than targeting different parts of the equating process and aiming to evaluate the process from different aspects, this article views the equating transformation as a standard statistical estimator and discusses how this estimator should be assessed in an equating framework. For the kernel equating framework, a numerical illustration shows the potentials of viewing the equating transformation as a statistical estimator as opposed to assessing it using equating-specific criteria. A discussion on how this approach can be used to compare other equating estimators from different frameworks is also included.

]]>