On the Cover
This issue's cover (courtesy of Katherine Castellano with input from Ben Shear and Andrew Ho) may be reminiscent of a Statistics 101 lesson that you've given again and again or one you distinctly remember being drilled into you. Mantras like “always plot your data” and “be aware of bin sizes” are surely running through your head. These eight plots of a single test score distribution for a real state × subject × grade × year data set (that we've taken the liberty of anonymizing) may be simple and familiar, but they illustrate the fundamental importance of knowing your test score data. They are far from the only eight we could have chosen, but hopefully, this selection helps make the point that different ways of plotting your test scores can highlight, mask, or distort various features of the distribution or imply various assumptions about the scale of the underlying test. For instance, the equal-width bars of the barplots (#5 and #7), unlike the histograms (#6 and #8), give the illusion that all achievement levels are “equal,” while at the same time the nominal labels avoid explicitly promoting an equal-interval scale. All but the aptly named discrete histogram (#3) mask the discreteness of the data, and in particular, mask the weight of students scoring at the highest obtainable scale score. The histogram (#2) and empirical density plot (#1) make it look like students obtained scores over a range of points near the maximum. The boxplot's (#4) single “outlying” maximum point gives no hint as to the number of students (2,500 in this case!) obtaining this score as it would be drawn even if only one student earned this score point.
Discussions of the pros/cons of these and other visual depictions of test score distributions are dissertation-worthy and beyond the scope of this blurb, but we welcome your feedback. Cast your vote on which plot you prefer by e-mailing firstname.lastname@example.org.
In This Issue
If there is a unifying theme to be found in this issue, it is the coincidence that all four articles are solo-authored.
In “The Multiple-Use of Accountability Assessments: Implications for the Process of Validation,” Martha Koch introduces to notion of accountability-focused assessments with multiple intended uses that interact. In her case study examining the Education Quality and Accountability Office Grade 9 Assessment of Mathematics administered in schools throughout Ontario, Canada, she points to the example of teachers grading students on the basis of a subset of items pulled from the assessment as a use that could influence the interpretation and meaning of aggregated scores at the school level (based on all items from the assessment). She suggests that this can pose unforeseen challenges to the application of Kane's argument-based approach to test validation.
In “Defining and Measuring College and Career Readiness: A Validation Framework,” Wayne Camara discusses key issues that will need to be addressed in attempts to validate scores from large-scale assessments written to support inferences about the college and career readiness of American students by the end of high school. This is an important issue being grappled with by the two large-scale assessment consortia that were funded under Race to the Top, the Partnership for College and Career Readiness (PARCC) and Smarter-Balanced (SBAC). Camara argues that unlike the large-scale assessments that have been predominant for the past decade under No Child Left Behind, the validation of tests written to assess college and career readiness will depend greatly upon whether scores are predictive of external criteria of college readiness, preparedness, or success. Camara presents the pros and cons of a number of possible criterion measures that could be used for this purpose, pointing out that it is likely to be much easier to come to consensus about measures of college readiness/preparedness/success than it will be to find good measures of career readiness/preparedness/success.
Camara discusses how cut points for performance levels might be set for PARCC or SBAC tests on the basis of some probability of success on an external criterion. One weakness of this approach that I have not seen get sufficient attention in the literature is that the sharpness of such cuts depends greatly on the slope of the regression (whether logistic or linear). If the slope is relatively flat, then we might well find that students quite far below or above a cut point still have about the same probability (i.e., within a few percentage points) of achieving (for example) a grade of B or higher in an entry-level college math course.
In “Measuring Cohesion: An Approach That Accounts for Differences in the Degree of Integration Challenge Presented by Different Types of Sentences,” Kathleen Sheehan presents a novel approach that could be applied in the context of measuring the text complexity of reading passages. She argues that when attempting to quantify the cohesiveness of sentences within a passage, an adjustment must be made for the overall difficulty of the passage. She then introduces a measure of cohesion that makes such an adjustment, and using empirical data compares and contrasts the inferences about text complexity that would be reached using measures with and without the adjustment.
Finally, in “A Note on Assessing the Added Value of Subscores,” Sandip Sinharay shows how a method for evaluating the added value of reporting subscores on a test can be framed and understood in terms of the correlations between parallel test forms. The original approach, which Sinharay credits to Shelby Haberman, concludes that observed subscores only have added value when they can be said to predict the unobserved true score associated with the subscore better than the observed total score. Sinharay's contribution is to show that another way to understand this is that a subscore can be said to have added value if it is in better agreement than the total score with the corresponding subscore on a parallel form. He demonstrates the mathematical equivalence of the two interpretations and argues that an interpretation based on parallel forms may be easier for practitioners to grasp intuitively.