Recent international studies note that countries whose students perform well on international science assessments report the need to change science education. Some countries use assessments for diagnostic purposes to assist teachers in addressing their students' needs. However, in the United States, standards-based reform has focused the national discussion on documenting students' attainment of high educational standards. Students' science achievement is one of those standards, and in many states, “high-stakes” tests determine the resultant achievement measures. Policymakers and administrators use those tests to rank school performance, to prevent students' graduation, and to evaluate teachers. With science test measures used in different ways, statistical confidence in the measures' validity and reliability is essential. Using a science achievement test from one state's systemic reform project as an example, this paper discusses the strengths of the Rasch model as a psychometric tool and analysis technique, referring to person item maps, anchoring, differential item functioning, and person item fit. Furthermore, the paper proposes that science educators should carefully inspect the tools they use to measure and document changes in educational systems. © 2005 Wiley Periodicals, Inc. Sci Ed90:253–269, 2006