Holding Accountability Accountable: A Cost–Benefit Analysis of Achievement Test Scores


Correspondence concerning this article should be addressed to Murray Levine or Adeline G. Levine, 18 St. Andrew's Walk, Buffalo, NY 14222. Electronic mail may be sent to psylevin@buffalo.edu.

In the name of accountability, elementary and secondary school children's achievement test scores have come to dominate educational discourse. Although the tests currently focus on the basic tools—English language and mathematics—the students' scores are used not only for decisions about promoting children from grade to grade but also to evaluate teachers’ and principals’ performances, as well as to rank schools and school districts for educational merit. Test scores are now used to justify granting vouchers, increasing the number of charter schools, and determining which public schools are to be labeled as failing, or subject to “turnaround” practices, or even to be closed. There is so much to lose or gain that the current process is called high-stakes testing.

In 2002, the No Child Left Behind (NCLB) Act introduced high-stakes standardized academic achievement testing nationwide. The federal act offered financial incentives to states administering annual achievement tests in public school grades 3–8, based on each state's educational standards. The proponents assured the public that the annual testing, with severe consequences for failing to show progress, would have two results: improved academic achievement test scores by all children and a decrease in the majority–minority achievement score gap. Initially, the Congress set aside $400 million to support test development in the states and continued to appropriate about that amount every year for grants to the states to support assessment.

Testing has always been performed by teachers, based on their own individual curricula, to help them determine their students’ grasp of subject matter and also to improve their own classroom performance and presentation of material. Since the early to middle part of the 20th century, standardized tests in various subjects have been used to assess students from many different classrooms, schools, or geographic locations. Everyone takes the same tests; students can be compared, individually or by groups. The questions, necessarily a sample of what was taught, are assumed to thoroughly cover the subjects; the tests are assumed to be accurately and reliably scored; the test scores are assumed to represent actual learning.

High-stakes testing is not useful as an assessment to improve learning. Because the tests are administered and scored at the end of the school year, they are of little assistance to teachers who want to identify students in need of help. The results are reported as single scores that provide no diagnostic information to help a teacher work with a student.

Annual achievement testing with consequences is an intervention. It is sold explicitly as a method to improve education by raising test scores. It is basically the latest manifestation of attempts dating back to the 19th century to make public schools efficient, based on “scientific management” principles. The model is that of the factory production line: Students are the raw materials processed by teachers; teachers are the line factory workers in need of close supervision; test scores are the product. From the earliest days of standardized testing, educational administrators asserted that children's test scores could be attributed to teacher effectiveness. Tests made it easier to identify substandard teachers and easier to justify firing them.

Whether to assess or to intervene, if we are good managers, we must ask “how costly is this method of annual high-stakes achievement testing and how successful is it in producing measurable outcomes?” Thus, we need a cost–benefit analysis.

Research Questions

We explored the following three research questions:

  • What has been the benefit in New York State of high-stakes testing, measured in changes in test scores on New York State-constructed tests, and at what cost, measured in dollars? In other words, how much “bang for the buck” do New York taxpayers get from high-stakes testing? We chose New York, because longitudinal data were available and we live and pay taxes in this state.
  • How much “bang for the buck” do we get on a national level, with benefits measured on a federally constructed test?
  • How much has the high-stakes testing intervention affected the majority–minority achievement gap as measured on a national level?

Data Sources


It is difficult to find trustworthy figures on state expenditures for NCLB testing. The New York State Department of Education does not provide precise figures about the state's cost for the testing process. Therefore, to do a cost–benefit analysis, we have made estimates based on the information made available to us in prompt responses to written requests.

A June 8, 2010, memorandum is one source of data for New York State's testing costs. Under the subject line “Assessment Cost Reduction Strategies,” New York State Education Commissioner John B. King, Jr. estimated that the 2009–2010 “cost of assessment will be approximately $45 million,” up from $14 million before NCLB. The dollar amount cited by Commissioner King included the cost of the annual Regents examinations. The memo also included the number of students taking Regents exams, and the number tested annually on state-constructed tests in grades 3–8.

Taking into account the proportion of Regents-taking students to NCLB-tested students, and assuming it is more costly to administer and to grade Regents examinations, we initially assigned $15 million—one-third of the $45 million spent on annual assessments—to the cost of NCLB testing. We compared that amount with New York State's portion of the approximately $400 million a year appropriated by the Congress to enable the NCLB testing mandates. New York's share of the federal money was about $15.8 million annually. With no estimate of the costs for state personnel to work with the contractors to develop and supervise the testing and the contracts, we assumed these costs are included in the $15 million annual estimate derived from the Commissioner's memo. However, given other possible costs, we thought it reasonable to use the $15.8 million federal grant to the state as a better, but still conservative, estimate of what New York State actually spent on NCLB testing.

High-stakes testing is not useful as an assessment to improve learning

Assuming similar costs each year, we estimated the sum of New York State's annual costs for NCLB assessments to be $94.8 million (6 ×$15.8 million) over the 6-year period of the intervention, 2006–2011. We have examined other possible sources to estimate costs and conclude that our measure of cost is extremely conservative. It is irrelevant that much of the state's cost may be reimbursed in federal dollars. (As taxpayers, we really do not care. It all comes out of our pockets.)


Information about achievement test scores is available in abundance from state and federal education departments. We have argued that what an achievement test score actually measures remains unknown, the labels attached to it notwithstanding. As we will discuss later, there is evidence that the scores on achievement tests may measure general mental ability, or a single “latent trait” of test-taking ability rather than command of a subject. However, achievement test scores are the current quantitative measure used in educational policy discussions by educators, politicians, the media, and the public. Therefore, despite our reservations, to do a cost–benefit analysis, we will use changes over time in standardized achievement test scores as the measure of educational benefits.

Cost–Benefit Analysis

We can now proceed to a cost–benefit analysis using achievement test data available from the state of New York and from national studies. We will use $94.8 million as an approximation of the total cost for the NCLB-required intervention summed over the 6 years (6 × $15.8 million).

What achievement test scores actually measure remains unknown

How Much ‘Bang for the Buck’ in New York State?

Tables 1 and 2 are derived from published reports about tests constructed and scored for New York State by a contractor, CTB/McGraw Hill. NCLB testing started in 2006. Table 1 shows standard scale and raw scores for English language arts (ELA) for grades 3–8 from 2006 through 2011. Table 2 shows the results for mathematics for the same grades and years. The published reports note that standard “scale scores from the 2006 and 2011 … can be directly compared.” Raw scores —how many test items are answered correctly—are perfectly correlated with scale scores. We shall estimate the dollar cost of each raw score increment in correct answers to make more concrete the meaning of the standard score changes.

Table 1. English Language Arts, New York State Achievement Test Standard Scores by Year and Grade
Grade20062011Score Changes


  1. Scale: 450–770; Source: NY State Education Department. New York State Testing Program 2011: English Language Arts, Grades 3–8. Technical Report. Monterey, CA: CTB/McGraw-Hill (2011).

Sum  28.56  
Mean  4.760.73+3.25
Table 2. New York State Mathematics Achievement Test Standard Scores by Year and Grade
Grade20062011Score Changes


  1. Scale: 450–770; Source: NY State Education Department. New York State Testing Program 2011: Mathematics, Grades 3–8. Technical Report. Monterey, CA: CTB/McGraw-Hill (2011).

Sum  120.69  
Mean  20.123.05+11.67

English Language Arts Standard Scores

As shown in Table 1, summed across 6 years of testing, New York students have gained a total of 28.56 standard score units, an average of 4.76 units per year, a small change. Because standard score units are comparable, it does not violate the underlying mathematics to simply sum the changes for each grade across the 6 years or to average them. If we divide $94.8 million dollars by the total gain of 28.56 over 6 years, it has cost the state taxpayer $3.31 million per unit of standard score gain in ELA test scores.

ELA Percent Changes

During the 6 years, the improvements each year were less than 1% for all but seventh grade, where it was less than 2%. Overall, a very small percent of students did better in 2011 than in 2006.

ELA Raw Scores

At its heart, a test score is the sum of the number of right answers on a specific test. When the scale scores are reconverted into raw scores for each grade, on average, students averaged 3.25 more items correct in 2011 compared with 2006 on tests with about 40 items. (The number of test items varies somewhat by grade, but 40 is sufficiently representative.) At a cost of $94.8 million, the high-stakes intervention averaged $29.16 million per additional correct item. Even so, we have no idea what an additional correct item tells us specifically about what a child has learned.

Mathematics Standard Scores

Table 2 shows that the summed difference in math scores between 2006 and 2011, from grades 3–8, is 120.69 standard score points, an average of 20.12 points per year. If we divide $94.8 million by 120.69, the result is a cost to the taxpayer of $785,483 per unit of standard score gain in math over 6 years. Keep in mind that each standard score unit is a very small amount.

Mathematics Percent Changes

A small percent of students did better in math in 2011 than they did in 2006; the largest gain was 4.24% for the seventh grade over the 6-year period.

Mathematics Raw Scores

After 6 years of the high-stakes testing intervention costing $94.8 million, the average math score showed 11.67 more items correct from 2006 to 2011, at a cost of $8.1 million per additional correct item. We will discuss that increments in the number correct in mathematics raw score correlate very strongly with grade level (ρ = .78; < .001).

The high-stakes intervention averaged $29.16 million per additional correct item

Table 3 is based on our second source, the scores for New York State fourth and eighth graders in reading and mathematics from the National Assessment of Educational Progress tests (NAEP will be described later). The NAEP scores for 2007 and 2011 are for years overlapping those in the New York State longitudinal report. We did not have access to the raw score equivalents of the NAEP scale scores.

Table 3. National Assessment of Educational Progress (NAEP) Reading and Mathematics Standard Scores for NY State Students by Year and Grade
Subject20072011Difference% Change


  1. Scale: 0–500; Source: National Center for Education Statistics, IES, NAEP, State Profiles: http://nces.ed.gov/nationsreportcard/states/

4th grade224222−2−0.89
8th grade264266+20.76
4th grade243238−5−2.06
8th grade280280  00.0


The standard scale reading scores on the NAEP tests (Table 3) do not differ sharply from the state tests results shown on Table 1. There is a less than 1% decrease in the New York fourth-grade standard reading scores and also less than 1% increase in the NY eighth-grade students’ reading scores on the NAEP tests.


The NAEP standard scale scores for mathematics (Table 3) show a contrast with the NY state test results (Table 2). The eighth-grade students’ scores show no gain on the NAEP tests in mathematics in the 2007–2011 period. There is a very small drop in fourth-grade scores. If they had learned mathematics as the state test results suggested, the students’ newly acquired knowledge should have been demonstrated on different tests covering similar content.

Overall, the national tests show almost no achievement score benefit and even some detriment over the 4-year period of high-stakes testing in New York State. During that period, we estimate that the state spent $63.52 million on its own high-stakes testing intervention. The cost–benefit analysis is simply this: there was next to no benefit, only costs.

National Assessment of Educational Progress

Before we turn to the remaining research questions, we will describe the second source of benefit data, the NAEP reports, which are published by the U.S. Department of Education. In 1988, the Elementary and Secondary Education Act amendments authorized the newly created National Assessment Governing Board to set policies for test development, for national administration of the test, and for general supervision of these activities.

In 2002, the NCLB required states to participate in NAEP testing every other year for grades 4 and 8 and at the end of high school. The National Center for Education Statistics, under the supervision of the Governing Board, developed highly sophisticated testing procedures and sampling methods for meeting its NCLB obligations. Originally meant to provide information about educational progress and to audit results of states’ tests, NAEP's findings have become increasingly important and widely used. The NAEP reports are called the National Report Card; the NAEP tests are considered more secure and less susceptible to “teaching the test” than are the state tests. NAEP's nationally representative samples are large enough to provide statistically reliable data state by state.

How Much ‘Bang for the Buck’ Nationally?

To answer the second research question, we used a NAEP report on national scores. NAEP standard scale scores range between 0 and 500. Table 4 shows the average standard scores on the NAEP tests for the full national sample for reading and for mathematics in 2003 and in 2011, the years when NCLB was in force.

The majority—minority gap in reading remains little changed

Table 4. National Sample National Assessment of Educational Progress (NAEP) Mean Standard Scores in Reading and Mathematics by Year and Grade
Subject20032011Difference% Change


  1. Scale: 0–500; Source: http://nces.ed.gov/nationsreportcard/itt/ State by State Results of NAEP Testing.

4th grade21622041.9
8th grade26126431.1
4th grade23424062.6
8th grade27628372.5


Between 2003 and 2011, the national scores in standard score units for fourth-grade reading increased by four points. For eighth-grade reading, the increase was three points.


Between 2003 and 2011, fourth-grade math scores increased by six points and eighth-grade math scores by seven points.

Percent changes

Translated into percent increases, in 2011 over 2003, the lowest percentage increase was 1.1% for eighth-grade reading; the highest was 2.6% for fourth-grade math. Evidently, there was very little change over time. Given the very large sample size, the small difference from 2003 to 2011 might well be statistically significant, but its educational significance is questionable.


The largest change in the standard scores for the national sample, between 2003 and 2011, is the seven-point increase in eighth-grade mathematics. We have no estimates of the amount of money all the states combined spent annually on NCLB testing, but if total testing industry revenues are reasonably estimated at about $2.8 billion dollars a year, then the 9-year cost to the states of the high-stakes testing intervention is $25.2 billion, making the cost per increment nationally in eighth-grade math standard score about $3.6 billion. Because the other changes are smaller, the cost per increment of standard score would be higher.

How Much Has the NCLB-Mandated High-Stakes Intervention Affected the Majority–Minority Achievement Gap, as Measured on the National Level?

For this third question, we used NAEP national data on the changes in reading and mathematics standard scores for 2003 and 2011, for White, Black, and Hispanic fourth- and eighth-grade students.


Table 5 shows the average standard scores in reading in 2003 and 2011 for fourth and eighth grades. Between 2003 and 2011 inclusive, the minority students increased their standard scores more than White students did. The fourth-grade White student scores increased from 2003 to 2011 by three points or 1.3%; the Black students’ increase was eight points or 4.1%; the Hispanic students’ increase was six points or 3.0%. Table 5 shows that the results are similar for eighth-grade reading scores.

Table 5. National Sample NAEP Mean Standard Scores in Reading by White, Black, and Hispanic Students by Year and Grade
Grade20032011Difference% Change


  1. Scale: 0–500; Source: http://nces.ed.gov/nationsreportcard/itt/ State by State Results of National Assessment of Educational Progress (NAEP) Testing.

4th grade
8th grade

How much did these small changes affect the size of the majority–minority gap in reading? While the standard scores and the percentage increases for both the fourth-grade reading and the eighth-grade reading are slightly larger for minority than for White students, the size of the majority–minority gap remains little changed.

In 2003, compared with Whites, the gap in fourth-grade reading for Black students was 30 scale score points (227 − 197). In 2011, the gap was 25 scale score points (230 – 205). For Hispanic students, the gap in scale points in 2003 was 28 (227 − 199); in 2011, it was 25 points (230 − 205). The majority–minority gap scarcely changed. The differences can be calculated in the same way for the eighth-grade reading scores.


Fourth-grade and eighth-grade math scores are shown in Table 6. Between 2003 and 2011, Black and Hispanic students scored somewhat higher increases in standard scores and percentage changes than White students, but the size of the majority–minority gap in mathematics essentially remains. The gap for fourth-grade Black students relative to Whites changed from 27 points in 2003 (243 – 216) to 25 points in 2011 (249 – 224). For Hispanic students, the gap changed from 22 points in 2003 (243 – 221) to 20 points in 2011 (249 – 229). Table 6 shows that the results are similar for eighth-grade math scores.

Table 6. National Sample NAEP Mean Standard Scores in Mathematics by White, Black, and Hispanic Students by Year and Grade
Grade20032011Difference% Change


  1. Scale: 0–500; Source: http://nces.ed.gov/nationsreportcard/itt/ State by State Results of National Assessment of Educational Progress (NAEP) Testing.

4th grade
8th grade

The figures in the rows of Table 7 are the differences described earlier for Tables 5 and 6, between the standard scale scores for Whites–Blacks and Whites–Hispanics from 2003 to 2011 inclusive. Table 7 shows the percent of change in the majority–minority gap in both reading and math scores over the 9-year period. The larger the change in percentage, the more the gap was reduced.

Since the enactment of No Child Left Behind, the testing industry has generated $2.8 billion annually

Table 7. Black–White Gap and Hispanic–White Gap in National Sample NAEP Mean Standard Scores in Reading and Mathematics by Grade and Year
20032011% Change20032011% Change


  1. The larger the percent change, the smaller the gap. The average reduction in the gap between Whites and Black + Hispanics is 11.2% for reading and between Whites and Blacks + Hispanics, it is 11.3% for math. The overall gap between the minorities and Whites in 2011 is still over 88% of what it was in 2003. Source: http://nces.ed.gov/nationsreportcard/states/ State by State Results of National Assessment of Educational Progress (NAEP) Testing.

4th grade3025−16.72825−10.7
8th grade2624−7.72621−19.2
4th grade2725−7.42220−9.1
8th grade3531−11.42924−17.2

If we average the change in the achievement gap for the White compared with the two minority groups taken together, we can derive convenient summary figures. Even though it conceals the variability in Table 7, there is a reduction of 11.2% in the majority–minority achievement test score gap in reading. Similarly for math, there is a reduction of 11.3% in the majority–minority achievement test score gap. Nonetheless, the overall gap remains in 2011 at about 88% of what it was in 2003.


New York State and the nation have bought very little educational benefit from the high-stakes testing intervention. Some experts argue that the dollar costs of high-stakes assessment do not matter. They say that $15.8 million per year is a tiny fraction of the $28 billion New York State spent on public elementary and secondary education in 2011–2012. That may be true, but to paraphrase what a politician once said, “A million here, a million there and pretty soon you're talking about real money.” Ninety-four million dollars plus is real money; $390 million in federal appropriations to the states for assessment each year since 2002—almost $4 billion total on NCLB testing—is a lot of real money. The $2.8 billion to the testing industry annually in the years since the NCLB went into force is also a lot of real money.

The small gains in scores, shown on Tables 1–4, may be statistically significant, because the population samples are large. What can those students do now that they could not do in reading, writing, or math, before they gained the few additional items correct on the tests?

In regard to the increase in New York mathematics test scores shown in Table 2, one rival plausible hypothesis is that the longer children were exposed to the same items, the more items they got correct. New York State mathematics test results have long been criticized, because the tests were published and available to teachers and test items were repeated in exactly the same or in a similar form from year to year. The gains may well represent familiarity with the tests and not any specific increase in learning because of high-stakes testing. Alternatively, the small amount of gain in raw score could have easily resulted from subtle coaching during the test by teachers held in thrall to high-stakes testing or from persistent drilling. In any case, the students may gain on state tests, but there seems to be no generalizability from the state to the national tests in the same subjects. (Our analysis assumes that alternate forms of the tests used in different years are perfectly equivalent in difficulty and that the composition of the population taking the tests did not change from year to year.)

What About the Majority–Minority Gap?

Proponents of high-stakes testing confidently predicted that the program would lessen the majority–minority gap in achievement test scores. Despite the high costs, high-stakes testing has not changed the gap appreciably. Over the NCLB years, NAEP scores show that minority–majority achievement test gaps barely changed nationally for fourth- and eighth-grade reading and mathematics (Tables 5 and 6). Those findings are reflected as well in a longitudinal study of 46,400 New York City students. Between school years 2005–2006 through 2009–2010, the White–Hispanic and White–Black differences in ELA scores on state tests increased slightly or remained unchanged. Both national and state results show that No Child Left Behind high-stakes testing has failed in this predicted aspect of its costly mission.

The Winners

If taxpayers got very little yield, the testing industry reaped a money harvest. Projected nationally, with an estimated 56 million NCLB-required tests administered in the 2007–2008 school year alone, the testing industry took in multimillion dollars for devising, scoring, and reporting on tests. The firms earn much more by selling workbooks, practice tests, and computer software designed for test preparation. “Profit margins in test preparation are as much as 7 times higher than they are for No Child tests. …. In 2005, the industry generated $2.8 billion in revenue from testing and test preparation. … Getting students ready for tests is the biggest part of the industry's revenue.”1 Testing corporations and the suppliers of curriculum materials are anticipating a new bonanza when the Common Core Standards with its extensive testing routines goes into effect, all on a national basis.

What Achievement Tests Measure

The achievement tests may not measure what they claim to measure. Consistent individual differences between students account for the largest amount of variance in achievement test performance. The correlations between measures of mental ability and achievement tests are in the neighborhood of .80. Researchers in Texas concluded that there apparently is a “latent trait” characterized as test-taking ability. It may be that achievement tests are impervious to instruction and that they are not a useful criterion for measuring educational progress.

High-stakes testing may have another significant flaw. Recently, educational researchers in Indiana compared two groups of students. Both groups took American College Testing (ACT) tests. ACT results are used widely by college administrators to assess high school achievement and to determine the probabilities of success in college. One group of students were high school seniors from schools with increased scores in reading and mathematics on Indiana state tests for 3 consecutive years. The other group of students were from high schools with a decrease in performance on the Indiana state tests for 3 years. When compared, there was little difference between the two groups in reading or mathematics performance on the ACT tests, another example showing that performance on state tests does not generalize to performance on national tests. More important, students in schools that had improved on the state tests scored significantly lower on the ACT science test than did the students from schools where scores had declined on the state tests in reading and math. Even though the difference was statistically significant, it was small, and such a study should be repeated by other researchers. One hypothesis to account for the findings is that schools that emphasized test-based instruction did their students a disservice by not adequately teaching nontested subjects such as science.

Small gains in raw scores could easily have resulted from subtle coaching or teaching to the test

Music and art, and even recess periods, have been eliminated in many schools to provide more time for instruction in test taking. Horace Mann, a 19th century statesman, was prominent in the establishment of publicly supported “common schools.” In 1847, Mann stressed that Americans need a broad education to strengthen our diversely populated democracy. He warned that “the naked capacity to read and write is no more education than a tool is to a workman.2 Our current high-stakes testing deforms the educational process to the point that the public school curriculum in many states is dwindling down to the teaching of “three Rs.”

Better Use of Money

Our daily newspapers frequently report that schools forego programs and activities that cost far less than the amount spent on relentless testing. What could we do alternatively with the $15.8 million New York State spends annually on NCLB testing to improve education? Obviously, good buildings, smaller classes, more teachers, more classroom aides, more libraries, and more professional librarians could improve the school experience considerably for students and teachers.

We can readily think of other uses for dollars now spent on testing. Here is one small example. Recently, a group of first grade teachers received funds from a private source to take their children to the local zoo. Their school is located in a poor neighborhood, and 50% of the children are from immigrant families where English is not spoken in the home. For school children, a field trip is a big treat. The zoo trip cost less than $1000 for tickets and a bus for 100 children and some parent volunteers. After the field trip, the first-grade children, who had been studying animals, wrote brief essays and made drawings about the animals they had seen and sent them with thank you notes to the sponsoring fund. The essays and notes expressed the excitement the children experienced at the zoo and the observations they made. While we have no hard data on the point, we would be willing to assert that the experience will prove memorable for the children and will serve to increase their liking for and attachment to school. The $15.8 million New York State spends annually on high-stakes testing could support 15,800 zoo trips for underprivileged children.

Here is another idea for using the money. We could improve science, technology, engineering, mathematics (STEM) education. For an estimated 10% of what we now spend annually on testing, we could fund start-up costs for working research laboratories in 1000 high schools and staff each one with a scientist who has completed a postdoctoral fellowship. The laboratory could do actual science and be required to take in students as assistants. The scientist would be available to consult with other teachers in the school, to network with other laboratories in the system, and to partner with local universities and research institutes. We could evaluate the projects by the products students work on, by oral examinations, by science fair projects, and by the colleges and majors the students select. Such a plan would be consistent with present government emphasis on science, technology, engineering, and mathematics and with the aims of the Common Core Standards in education.

Questions for Self-Assessment

  1. Discuss high-stakes achievement testing as an intervention rather than as an assessment.
  2. Discuss the majority–minority achievement gap and what effect high-stakes testing has had on that gap nationally.
  3. How does achievement testing create what Levine and Levine describe as abuse of the educational process, children, and teachers?
  4. What did a study conducted by researchers in Indiana reveal about the relationship between high school students' scores on American College Testing (ACT) tests and the students' reading and math scores on state achievement tests? What are some possible explanations for this outcome?
  5. Discuss the suggestions Levine and Levine propose for a better use of the billions of dollars now being paid to the testing industry.

Waste, Fraud, and Abuse

High-stakes testing has created waste, fraud, and abuse. The money spent on high-stakes testing is wasted. High-stakes testing has failed at its claim of improving academic achievement test scores. Even the small increase in mathematics knowledge that can be concluded from the data on Table 2 raises questions. Did the students learn more about math or more about test taking? What part did familiarity with the tests play? If the higher scores achieved by the students indicate more learning of mathematics, why is that not reflected in the NAEP test scores (Table 3)? If there is no learning that generalizes, the high-stakes testing intervention is a waste of money and of the time and energy of teachers and students. High-stakes testing has produced fraud on a huge scale. Cheating scandals have been noted in many states. Beyond direct cheating, some administrators, coping with unreasonable state or federal demands, have manipulated the difficulty of tests, changed test cutoff scores to show increased numbers of proficient students, and have been less than honest in their “spin” on unfavorable test results. One example: Although test scores rose in New Orleans with the switch to charter schools, many charters refused to take handicapped children, other than those with minor disabilities. Even so, while their results improved, they were still at the bottom of the rankings for Louisiana. Similarly, education officials and politicians touted the growth in NAEP test scores in Washington, DC, but neglected to tell us that the Washington, DC, scores are still extremely low. These efforts at spin are public relations gimmicks, which reduce the credibility of our public officials when the gimmicks are exposed, a serious loss in today's posttruth era.

For 10% of what New York State now spends annually on testing, we could fund start-up costs for working research laboratories in 1000 high schools

High-stakes testing also creates abuse of the educational process as the curriculum narrows, especially in schools with lower test scores. It is an abuse of children to deny recess periods because recess takes time away from instruction. It is an abuse of children when the stress of high-stakes testing affects so many that the instructions for administering tests contain detailed procedures for handling test booklets on which children have vomited. It is an abuse of children retained in grade because of low test scores, while a hundred years of research says retention in grade without more assistance does no good and does positive harm. It is an abuse of immigrant children to humiliate them, to tell them they are persistently failing, when they capably serve as the liaison between their non-English-speaking parents and the new culture that they now inhabit.

It is an abuse of teachers to force them to adopt methods designed to raise test scores, methods which denigrate their professionalism by treating teachers as automatons in need of close supervision and micromanaging. Test scores have small meaning when measured against a teacher's loving concern for his or her students. We know that teachers in one of our local schools serving a poor, largely immigrant neighborhood contribute money from their pockets to buy winter boots for their students, some of whom have never seen snow in their tropical homelands. A test score has little meaning compared with the extra hours that teachers voluntarily spend working with children and their parents.

High-stakes testing has created waste, fraud, and abuse

It is an abuse of teachers to threaten their jobs and careers with test scores of uncertain meaning and with a large margin of error. Despite the failures of high-stakes testing and its detrimental consequences, there is no end in sight. The new Common Core Standards with associated tests will provide a new reason to bash the public school system and to make the case for further privatization and automation of many core teaching functions. Moreover, the Race to the Top (RTTT), a federal initiative in place since 2009, will increase the pressures on educators. The Race promises extra education funding and relief to states from the more onerous requirements of NCLB's adequate yearly progress provision, but only on condition that the states modify their education laws to include methods to evaluate teachers based on student test scores. Only if such a law is passed is a state eligible to apply for the funds; there is no guarantee that the funds will be forthcoming. The evaluations must include provisions for removing teachers whose students do poorly on state-mandated achievement tests. Because high-stakes testing is now written into state laws, it will have a continuing effect.

High-stakes testing has had 10 years of trial nationwide, a veritable experiment in nature. Test scores have risen only slightly during the NCLB years, and if we look back further, the rate of increase in test scores was higher before NCLB than it is now. High-stakes testing has been ineffective in raising test scores and has not closed the majority–minority gap. It has wasted resources of time and energy as well as money, has promoted fraud, and has proven abusive. With the New York data, and the results of national testing, the burden of proof that high-stakes testing is an effective educational strategy shifts to its proponents. It is time to hold accountability by high-stakes testing accountable.

  1. 1

    David Glovin & David Evans, How Test Companies Fail Your Kids, Bloomberg Markets, Dec. 2006, 128, 135.

  2. 2

    Jonathan Messerli, Horace Mann: A Biography 443 (1972).