In the lead article, Davenport, Davison, Liou, & Love demonstrate the relationship among homogeneity, internal consistency, and coefficient alpha, and also distinguish among them. These distinctions are important because too often coefficient alpha—a reliability coefficient—is interpreted as an index of homogeneity or internal consistency. We argue that factor analysis should be conducted before calculating internal consistency estimates of reliability. If factor analysis indicates the assumptions underlying coefficient alpha are met, then it can be reported as a reliability coefficient. However, to the extent that items are multidimensional, alternative internal consistency reliability coefficients should be computed based on the parameter estimates of the factor model. Assuming a bifactor model evidenced good fit, and the measure was designed to assess a single construct, omega hierarchical—the proportion of variance of the total scores due to the general factor—should be presented. Omega—the proportion of variance of the total scores due to all factors—also should be reported in that it represents a more traditional view of reliability, although it is computed within a factor analytic framework. By presenting both these coefficients and potentially other omega coefficients, the reliability results are less likely to be misinterpreted.

]]>I discuss the contribution by Davenport, Davison, Liou, & Love (2015) in which they relate reliability represented by coefficient α to formal definitions of internal consistency and unidimensionality, both proposed by Cronbach (1951). I argue that coefficient α is a lower bound to reliability and that concepts of internal consistency and unidimensionality, however defined, belong to the realm of validity, viz. the issue of what the test measures. Internal consistency and unidimensionality may play a role in the construction of tests when the theory of the attribute for which the test is constructed implies that the items be internally consistent or unidimensional. I also offer examples of attributes that do not imply internal consistency or unidimensionality, thus limiting these concepts' usefulness in practical applications.

]]>Student growth percentiles (SGPs, Betebenner, 2009) are used to locate a student's current score in a conditional distribution based on the student's past scores. Currently, following Betebenner (2009), quantile regression (QR) is most often used operationally to estimate the SGPs. Alternatively, multidimensional item response theory (MIRT) may also be used to estimate SGPs, as proposed by Lockwood and Castellano (2015). A benefit of using MIRT to estimate SGPs is that techniques and methods already developed for MIRT may readily be applied to the specific context of SGP estimation and inference. This research adopts a MIRT framework to explore the reliability of SGPs. More specifically, we propose a straightforward method for estimating SGP reliability. In addition, we use this measure to study how SGP reliability is affected by two key factors: the correlation between prior and current latent achievement scores, and the number of prior years included in the SGP analysis. These issues are primarily explored via simulated data. In addition, the QR and MIRT approaches are compared in an empirical application.

]]>Little is known about the reliability of college grades relative to how prominently they are used in educational research, and the results to date tend to be based on small sample studies or are decades old. This study uses two large databases (*N* > 800,000) from over 200 educational institutions spanning 13 years and finds that both first-year and overall college GPA can be expected to be highly reliable measures of academic performance, with reliability estimated at .86 for first-year GPA and .93 for overall GPA. Additionally, reliabilities vary moderately by academic discipline, and within-school grade intercorrelations are highly stable over time. These findings are consistent with a hierarchical structure of academic ability. Practical implications for decision making and measurement using GPA are discussed.

This article uses definitions provided by Cronbach in his seminal paper for coefficient α to show the concepts of reliability, dimensionality, and internal consistency are distinct but interrelated. The article begins with a critique of the definition of reliability and then explores mathematical properties of Cronbach's α. Internal consistency and dimensionality are then discussed as defined by Cronbach. Next, functional relationships are given that relate reliability, internal consistency, and dimensionality. The article ends with a demonstration of the utility of these concepts as defined. It is recommended that reliability, internal consistency, and dimensionality each be quantified with separate indices, but that their interrelatedness be recognized. High levels of unidimensionality and internal consistency are not necessary for reliability as measured by α nor, more importantly, for interpretability of test scores.

]]>The alignment between a test and the content domain it measures represents key evidence for the validation of test score inferences. Although procedures have been developed for evaluating the content alignment of linear tests, these procedures are not readily applicable to computerized adaptive tests (CATs), which require large item pools and do not use fixed test forms. This article describes the decisions made in the development of CATs that influence and might threaten content alignment. It outlines a process for evaluating alignment that is sensitive to these threats and gives an empirical example of the process.

]]>The purpose of this ITEMS module is to provide an introduction to differential item functioning (DIF) analysis using mixture item response models. The mixture item response models for DIF analysis involve comparing item profiles across latent groups, instead of manifest groups. First, an overview of DIF analysis based on latent groups, called latent DIF analysis, is provided and its applications in the literature are surveyed. Then, the methodological issues pertaining to latent DIF analysis are described, including mixture item response models, parameter estimation, and latent DIF detection methods. Finally, recommended steps for latent DIF analysis are illustrated using empirical data.

]]>Grade inflation threatens the integrity of college grades as indicators of academic achievement. In this study, we contribute to the literature on grade inflation by providing the first estimate of the size of grade increases at the student level between the mid-1990s and mid-2000s. By controlling for student characteristics and course-taking patterns, we are able to eliminate alternative explanations for grade increases. Our results suggest that grade inflation has occurred across decades, at a small yet non-negligible rate. Suggestions for future research are discussed.

]]>Feinberg and Wainer (2014) provided a simple equation to approximate/predict a subscore's value. The purpose of this note is to point out that their equation is often inaccurate in that it does not always predict a subscore's value correctly. Therefore, the utility of their simple equation is not clear.

]]>With the recent adoption of the Common Core standards in many states, there is a need for quality information about textbook alignment to standards. While there are many existing content analysis procedures, these generally have little, if any, validity or reliability evidence. One exception is the Surveys of Enacted Curriculum (SEC), which has been widely used to analyze the alignment among standards, assessments, and teachers’ instruction. However, the SEC can be time-consuming and expensive when used for this purpose. This study extends the SEC to the analysis of entire mathematics textbooks and investigates whether the results of SEC alignment analyses are affected if the content analysis procedure is simplified. The results indicate that analyzing only every fifth item produces nearly identical alignment results with no effect on the reliability of content analyses.

]]>This study reports on the development of a teacher evaluation instrument, based on students’ observations, which exhibits cumulative ordering in terms of the complexity of teaching acts. The study integrates theory on teacher development with theory on teacher effectiveness and applies a cross-validation procedure to verify whether teaching acts have a cumulative order. The resulting teacher evaluation instrument comprises 32 teaching acts with cumulative ordering in terms of complexity. This ordering aligns with prior teacher development research. It also represents a valuable extension in that the instrument can provide feedback about a teacher's current phase of development and advice for improvement.

]]>Much of the recent focus of educational policymakers has been on improving the measurement of teacher effectiveness. Linking student growth to teacher effects has been a large part of reform efforts. To date, neither researchers nor practitioners have arrived at a consensus on how to treat test scores from students with disabilities in growth-based teacher effectiveness indicators, despite the fact that these students make up approximately 13% of the K-12 student population. In this study, we leverage longitudinal data from the population of teachers in one state to explore practical questions related to including general assessment scores from students with disabilities in teacher evaluation. Findings suggest that including test scores from students with disabilities allows more teachers to be evaluated and does not substantially affect teachers’ scores. Moreover, including disability-related covariates can allow for fairer evaluations for teachers with many students with disabilities in their class.

]]>Drawing valid inferences from item response theory (IRT) models is contingent upon a good fit of the data to the model. Violations of model-data fit have numerous consequences, limiting the usefulness and applicability of the model. This instructional module provides an overview of methods used for evaluating the fit of IRT models. Upon completing this module, the reader will have an understanding of traditional and Bayesian approaches for evaluating model-data fit of IRT models, the relative advantages of each approach, and the software available to implement each method.

]]>