SEARCH

SEARCH BY CITATION

Abstract

  1. Top of page
  2. Abstract
  3. Truth, Beauty, and Justice in a Federally Funded Magnet School Evaluation
  4. The 1990 National Assessment Governing Board: National Assessment of Educational Progress Standard Setting Evaluation
  5. Analysis
  6. Values and Validity in the Context of Evaluation Practice
  7. References
  8. Biographies

House's classic Evaluating with Validity proposes three dimensions—truth, justice, and beauty—for evaluation validity. A challenge to achieving validity is balancing the priorities between these three dimensions when they conflict. This chapter examines the concept of validity and the values inherent in each of these dimensions and any choices between them. Our analysis of these inherent values and any prioritization between truth, justice, and beauty aims to help the evaluator confront the kinds of dilemmas faced when one's commitment to values, evaluation theories, or methodology comes up against conflicting realities for a particular evaluation. Striking an appropriate balance can be particularly challenging in contexts involving diverse cultures or even homogenous cultures of which the evaluator is not a part. We use two case examples to explore the issues in real-life contexts.

Perhaps the most dangerous threat to the validity of an evaluation is a poor understanding of what validity is, to what that approbation applies, or what role values play in striving for validity. Validity is most often discussed in the literature as applicable to measurement and research design (Cronbach & Meehl, 1955; Shadish, Cook, & Campbell, 2001). In particular, there is a robust literature on which research designs are most, and which are least, capable of eliminating threats to validity (Campbell & Stanley, 1966; Chen, Donaldson, Mark, 2011; Collins, Hall, & Paul, 2004; Cook & Campbell, 1979; Morgan & Winship, 2007; Murnane & Willett, 2010; Scriven, 1976; Shadish, 2011; Shadish et al., 2001). Further literature discusses which types of validity are of most serious concern (Chen et al., 2011; Cronbach, 1982; Shadish et al., 2001). In the context of evaluation, these influential discussions create a dangerous space—one in which “experiment” is so closely tied to discussions about RCTs and particular quasi-experiments, and the concept of validity is so frequently tied to those experiments, that it is all but forgotten that validity is actually a trait of arguments.

Tests and research designs are not valid in the abstract—they are not universally valid. A particular test or research design is suitable for a particular purpose and in a particular context. That is to say, a test or research design is valid insomuch as it contributes to the validity of the argument supporting the claim it is intended to support (Cook & Campbell, 1979; Shadish et al., 2001; Wainer & Braun, 1988). It is arguments that are valid or invalid, and arguments are often about more than a design or a use of a particular method. Further, arguments that have the support of privileged research designs are not necessarily better supported than arguments supported by other premises. The important issue is to address and, hopefully, eliminate threats to validity.

Proofs and formal logic aside, arguments in the real world are subject to values in a variety of ways. When we construct arguments for ourselves, there may be fewer values involved and we need not always attend to them. When an argument is constructed for an audience, values require our attention, and a balancing of values is almost always required. Epistemological or methodological values are reflected when we prioritize particular research designs. This is not necessarily problematic, but it can become a problem when the preference is blind or the one preferring is not aware that it is indeed a preference, an expression of value. In such cases, evaluators may not be aware of relevant and suitable alternatives. For example, part of the problem with a policy that prioritizes a particular design is that the policy places accountability for validity almost entirely on research design and often privileges certain aspects of that research design over others. What about the validity of the needs assessment, of the measures being used, or of the formation of the evaluation questions? Constructing an experimental design eliminates various threats to internal validity, but this can give a false sense of confidence in the findings and in the ability of the measures used to generate findings worth having. The validity of the measures, for that context, is assumed. Or, at the very least, we are not accustomed to seeing most evaluators make a case for the appropriateness of their measures in the particular evaluative context.

Once it becomes clear that validity is about the validity of the argument supporting claims and that values are inherently intertwined in this process, the evaluator is forced to contemplate how these ideas are present and operate in an evaluative context. House's framing of evaluation validity creates a useful framework within which one can reflect on evaluation decisions before, during, and after key activities have been executed. Both Scriven (1991) and House (1980) take us beyond the narrow focus of validity as it pertains to truth claims about the program. House further points the way by providing a useful lens, or at least presenting us with the fact that there are multiple lenses available, with which we can view what remains of evaluation validity. Neither Scriven nor House tells us precisely what specific factors will be relevant, nor should they. If we take both of them seriously in terms of attending to the entire evaluation context, these factors will vary between evaluations. House has delineated broad factors–dimensions of evaluation validity–and the two case examples included in this chapter will show us how those dimensions come into precise focus. One unavoidable result of such a holistic and comprehensive view of evaluation validity is that there will be, inevitably, instances in which the relevant values come into conflict. As you consider the cases in this chapter, and our analysis of these cases, it will be helpful to keep the following questions in mind:

  1. What are the inherent values underlying truth, beauty, and justice and choices about balancing them? What values are we accepting, rejecting, or balancing when we choose between truth, justice, and beauty?
  2. How do these dimensions come into conflict in specific evaluation settings and what causes them to do so?
  3. How can an evaluator go about appropriately balancing these dimensions to produce valid evaluations? Must the attention to, and weighting of, the dimensions occur only in the final synthesis?

In many ways, the cases in this chapter represent opposite ends on a continuum of contexts in which evaluation practice occurs. Unlike the first case, the second case for analysis focuses on a highly publicized evaluation that ended with much controversy. Unsurprisingly, the evaluation team and other interested parties have written a lot about the case. Additionally, the stakes associated with the evaluation outcomes were much higher in the second case. Lastly, the overall scope of the two cases differs. Yet, both can be used to analyze the central questions we explore in this chapter as a means to better understand the ways in which truth, beauty, and justice operate across evaluation contexts.

Truth, Beauty, and Justice in a Federally Funded Magnet School Evaluation

  1. Top of page
  2. Abstract
  3. Truth, Beauty, and Justice in a Federally Funded Magnet School Evaluation
  4. The 1990 National Assessment Governing Board: National Assessment of Educational Progress Standard Setting Evaluation
  5. Analysis
  6. Values and Validity in the Context of Evaluation Practice
  7. References
  8. Biographies

The current theory of change behind the Magnet Schools Assistance Program (MSAP) is that offering a specialized curriculum attracts students from different social, economic, ethnic, and racial backgrounds.1 In doing so, magnets reduce minority group isolation while simultaneously improving academic outcomes for all students.

As a means to test this theory of change, and particularly in the context of the particular external evaluation that is the subject of this case, the final evaluation design approved by both the school district and the federal MSAP program was a theory-driven evaluation (Donaldson, 2007) employing a quasi-experimental design. A quasi-experimental design was chosen because of limits imposed upon the external evaluation by the adoption of specific language around “scientifically based evaluation methods.” By requiring that the evaluation be consistent with this language, the federal funding agency could ensure that (a) their views about the types of questions that evaluations should answer were prominent and (b) their beliefs about how best to answer those questions were promulgated. Their interests ultimately required that the evaluation team answer the question of whether magnet school attendance led to higher learning gains as measured by state-developed standardized exams for students attending magnet programs compared to those not attending magnet programs. It is also important to note that a quasi-experimental design was chosen over an experimental design because of limits imposed on the external evaluation team by the local program. While they were interested in causal, outcome-oriented questions, they were not interested in them at the expense of denying a potentially successful intervention to students through a random assignment process. This was especially so because the magnet school theory of change states that magnet schools seek to reduce minority group isolation. For the local program staff, the moral imperative of achieving this far out-weighed the benefits of an experimental design.

Further, a theory-driven evaluation design was chosen to meet the related but divergent needs of the local program staff. While they were equally interested in establishing a causal link between magnet school attendance and improved student academic outcomes as measured by state-developed standardized exams, their interest went beyond the simple, “does it work” question. For example, does the magnet program work equally well for all student subgroups? To what extent does school climate appear to influence the degree of effectiveness observed? Their interests ultimately came down to ongoing, formative feedback to inform decision making regarding the schools. Thus, before any evaluation work was conducted, the external evaluation team collaboratively developed a logic model with program staff that depicted the theory of change and hypothesized facilitating and hindering mechanisms.

To meet the demands of “rigor” placed upon the external evaluation team by the federal funding agency, the evaluation staff then delved into the magnet school and school improvement literature to (a) better understand the extant knowledge base regarding the identified facilitating and hindering mechanisms and (b) identify potentially useful surveys and survey items for measuring these domains. The evaluation team chose to modify existing survey instruments to measure potential facilitating and hindering mechanisms within the magnet school context. These survey instruments were pilot-tested, and factor analysis was used to establish the reliability of the instruments used. This process was not only consistent with good survey design, but as a secondary consideration was intended to meet “rigor” demands placed upon the evaluation team by the federal funding agency and therefore increase the likelihood that the results would be perceived as credible. Further, survey administration followed the quasi-experimental design used to assess student outcomes. That is, instruments were administered to students and teachers at the treatment (magnet) and comparison (nonmagnet) schools. Comparison schools were identified using propensity score matching techniques (Parsons, 2001; Rubin, 1997) to comply with the limits imposed on the external evaluation team by the federal funding agency.

The evaluation team had to develop a reporting system that would present the truth as uncovered by the evaluation in a way that would be perceived as credible to multiple stakeholder groups, thus a two-phased reporting system was developed. As a first step, the evaluators produced formal reports using nonacademic language and used a more formal and traditional scientific style for the appendices containing technical information. The intended audience for these briefs was primarily the MSAP grant officer and others in the US Department of Education, and secondarily others interested in a more thorough discussion of results and associated technical details (e.g., researchers and local staff within the district who work daily with accountability data). In addition, findings briefs were created and modeled after state-developed annual testing briefs. These results were disseminated to program staff (including administrators and teachers), students, parents, and board of education members. The evaluation team hypothesized that by modeling the briefs after an existing accountability reporting system within the state, results would be easily digested and therefore more likely to be used. Further, it could capitalize on the perceived credibility of the existing accountability system, further increasing the likelihood of interest in, and use of, findings.

A further important consideration that influenced how the evaluation was conducted and how reports were created and disseminated is that the school district originally hired a different evaluation firm to conduct an external evaluation employing quasi-experimental methods. In the year that followed, the external evaluation contract was terminated for a variety of reasons. For example, that evaluator neglected to involve the local program staff in any portion of the evaluation, including providing progress reports or sharing formative evaluation data. The first official correspondence concerning the projects’ progress was a Year-1 summative report that was sent simultaneously to the local program staff and the federal MSAP grant officer. The local program staff read the report and immediately deemed the report uncredible. While the report was coherent in the sense that it was organized by evaluation question, further inspection of data reported in tables and figures was cause for concern. For example, the external evaluator only reported the results for students in one grade level. According to the report, this was due to the unavailability of data. This was incongruent with the perspective of the local staff because data on students from multiple grade levels across the district had been sent to the external evaluator. In short, the credibility of the evaluator, and subsequently the findings he or she produced, were called into question. Because the local program staff was very interested in findings that were deemed truthful and credible, the local program staff negotiated with their MSAP grant officer to gain authorization to hire a new evaluation firm. While no concrete evidence exists, one can infer from the MSAP grant officer's approval of this modification that they had similar reservations about the ability for the evaluation to uncover the truth, and as a result, produce credible findings.

The particulars surrounding the context into which the new evaluation team entered also had implications for the proposed evaluation design. In particular, it highlighted that the credibility of evaluation findings would be judged not only on whether the evaluation was uncovering the truth in this particular context, but the role that stakeholder involvement and program staff perceptions played in the perceived credibility of evaluation findings. Specifically, the program staff was sensitive to, and intent on, ensuring that stakeholders were involved in the evaluation process, that findings were grounded in truthful and credible data, and that the team carrying out the external evaluation was not necessarily invested in the success of the program, but were invested in finding answers to evaluation questions. It was not enough for the evaluation to produce truth. Truth had to be produced while being sensitive to local needs and ideas about how evaluation ought to be conducted.

The 1990 National Assessment Governing Board: National Assessment of Educational Progress Standard Setting Evaluation

  1. Top of page
  2. Abstract
  3. Truth, Beauty, and Justice in a Federally Funded Magnet School Evaluation
  4. The 1990 National Assessment Governing Board: National Assessment of Educational Progress Standard Setting Evaluation
  5. Analysis
  6. Values and Validity in the Context of Evaluation Practice
  7. References
  8. Biographies

The National Assessment Governing Board (NAGB) was created in 1988 with the passing of the bill containing the Augustus F. Hawkins–Robert T. Stafford Elementary and Secondary School Improvement Amendments (P.L. 100–297).2 One of the main responsibilities of the NAGB was to develop and implement a standard-setting process for results from the National Assessment of Educational Progress (NAEP; Vinovskis, 1998). The 1988 School Improvement Amendments also included a mandate requiring an external evaluation of the NAGB standard-setting process. When the NAGB contracted Stufflebeam, Jaeger, and Scriven to conduct the external evaluation shortly after the newly formed standards were released, criticism had already emerged regarding the credibility of the process the NAGB used to set the standards and the proficiency levels based on those standards (Stufflebeam, 2000; Vinovskis, 1998).

Stufflebeam and his colleagues sought to conduct an evaluation that provided both formative information and an overall summative judgment concerning the reliability and validity of the NAGB process for categorizing students as “below basic,” “basic,” “proficient,” or “advanced” based on their NAEP scores. The evaluation team issued three formative reports which the NAGB used to modify their system. These reports were delivered to NAGB staff and not released for public consumption. In the first report, the evaluators noted significant flaws in the standard-setting process. In the second and third reports, they reported that, despite the NAGB's sincere and constructive attempts to address the concerns and improve the process, significant flaws persisted (Stufflebeam, 2000; Vinovskis, 1998).

Given that the three formative evaluations reported significant flaws and the second and third evaluations reported that attempts to fix the problems had failed, it should come as no surprise that the draft summative report indicated that significant flaws in the standard-setting process remained and that the “resulting standards, which are due to be released in spite of the project's technical failures, must be used only with extreme caution” (Stufflebeam, Jaeger, & Scriven, 1991, as cited in Vinovskis, 1998). The draft was sent out for prerelease review to a total of 45 individuals, representing various stakeholder groups (e.g., NAGB representatives, measurement experts, and policy makers), more than half of which were measurement and research methodology experts. This was done to ensure the report was “sound, clear, useful, and up to the standards Congress would expect” (Stufflebeam, 2000, p. 299). The draft summative report was intended to remain confidential, and those receiving a copy were instructed not to distribute the report because the sole purpose of the prerelease review was to provide feedback for the evaluation team.

Upon receiving the draft summative report and notice that the draft had been disseminated to a prerelease review panel, the NAGB immediately fired Stufflebeam, Jaeger, and Scriven. Further, according to Stufflebeam, the NAGB “sent an unsigned, vitriolic attack on [the] draft report to all members of [the] prerelease review group” (NAGB, 1991; Stufflebeam, 2000, p. 299). Despite being fired, the evaluation team took the recommendations of the non-NAGB prerelease reviewers, incorporated it into the draft, and sent the final summative evaluation report to the NAGB. Why did the evaluation process end on such a bad note? What objections did the NAGB have to the draft summative report?

It seems unlikely that there were serious technical flaws in the draft summative report. That report apparently contained much of what had already been reported in the formative reports without raising objections or criticism from NAGB personnel. In addition, with a few notable exceptions (e.g., Cizek, 1993; Kane, 1993), the many external summative evaluations that ensued after Stufflebeam and colleagues issued their report supported the claims made by Stufflebeam and his colleagues (Linn, Koretz, Barker, & Burstein, 1991; Shepard, 1993; U.S. General Accounting Office, 1993). So what else explains the acrimonious ending to the evaluative relationship? What happened between the submission of the third formative report and the first draft of the summative report?

Analysis

  1. Top of page
  2. Abstract
  3. Truth, Beauty, and Justice in a Federally Funded Magnet School Evaluation
  4. The 1990 National Assessment Governing Board: National Assessment of Educational Progress Standard Setting Evaluation
  5. Analysis
  6. Values and Validity in the Context of Evaluation Practice
  7. References
  8. Biographies

According to House, it is not uncommon for an evaluator to intensely focus on one dimension of validity to the exclusion of the other two dimensions. He observed that evaluations entrenched in an “objectivist” approach, focus “exclusively on the truth aspect of validity,” and are thus “often not credible to those evaluated and are sometimes undemocratic, unfair, or otherwise normatively incorrect” (House 1980, p. 250). We see a clear example of this in the MSAP case when the original evaluator's intense focus on truth prevented him from considering other dimensions of validity, resulting in invalid evaluation conclusions. He was not cognizant of the need to balance the dimensions. The resulting negative evaluation was seen as invalid not only by the program, but by the program officer (the client) the program staff convinced to replace the evaluator. It is not difficult to see why. Based on the description of the original evaluator's approach to the evaluation and relative ambivalence toward relevant stakeholder concerns, we can see that the evaluator would have had a very difficult time producing a valid evaluative argument.

The original evaluator failed to move beyond truth. Intentionally or not, he missed opportunities to learn about his audience and identify common ground that could be used as a foundation for an evaluative argument that might have been accepted by them. Further, the original evaluator did not even communicate with, or identify, the relevant stakeholders, let alone look for ways to give them a voice in the process. Their reaction shows that this deficiency prevented their interests from being represented in the evaluation report to their program officer.

What House does not emphasize, but seems likely, is that such an intense focus on one dimension of validity is a threat to that dimension itself. By ignoring the interests and excluding the voices of the variety of stakeholders, that is, by ignoring beauty and justice, one is likely also to miss at least part of the truth. For example, the original evaluator in the MSAP case not only failed to move beyond truth, he even failed on the truth dimension itself, by failing to consider other perspectives or other factors that might influence what was perceived as the truth. Another possibility is not attending to one dimension sufficiently. As we see in the NAGB case, this need not be wholesale neglect and may be the result of factors beyond the evaluator's control. In such cases, the validity of the evaluation, or its degree of validity, may be less clear.

In the NAGB case, privileging truth to the neglect of the other dimensions is not the issue. It is unlikely that the justice dimension was neglected at all. In fact, the potential challenges to validity in the NAGB case may have nothing to do with imbalance per se, at least not initially. From the beginning, the evaluation team appears to have given fair opportunity for all stakeholder voices to be heard, while conducting the evaluation in such a way that the interests of the most vulnerable stakeholders were well represented. Each of the three formative reports was submitted to scrutiny in public forums, allowing adequate opportunity for feedback concerning any difficulties in the clarity or technical quality of the evaluation leading up to those points. In each of these three reports, the evaluation team raised serious concerns about the standard-setting process with no objections, rebuttals, or qualifications from NAGB staff.

But, something important changed between the delivery of those three reports and the delivery of the draft summative report. In Stufflebeam's (2000) recounting, after the third formative report was delivered, NAGB staff asked the evaluation team not to finish the summative report stating that the “assignment was complete” (p. 298). About a month later NAGB staff again contacted Stufflebeam, asking that his team submit a summative evaluation, explaining that “state-level critics had protested that NAGB had not fulfilled the congressional requirement for an external summative evaluation of the achievement levels project” (p. 289) and that the report was needed within the next month. Stufflebeam and his colleagues got to work immediately, even, at the urging of the NAGB, without obtaining a formal contract. The context now shifted from a reasonably paced, contractually binding, formative evaluation, in which information is provided primarily for program staff for the purpose of improvement, to a rushed, contract-free, summative evaluation, in which information is provided primarily for the group to whom program staff are accountable for the purpose of final judgment.

In the rushed shift between contexts, the evaluation team may have overlooked important aspects of both truth and beauty. Despite the apparent technical quality of the evaluation, the team left themselves vulnerable to potentially legitimate criticism on the truth dimension of validity. In science and evaluation, there is more to truth than technical merit.

Scientific statements can never be certain; they can be only more or less credible. And credibility is a term in individual psychology, i.e., a term that has meaning only with respect to an individual observer. To say that some proposition is credible is, after all, to say that it is believed by an agent who is free not to believe it, that is, by an observer who, after exercising judgment and (possibly) intuition, chooses to accept the proposition as worthy of his believing it. (Weizenbaum, 1976 as cited in House, 1980, p. 71, emphasis ours)

As House explains, a major component of truth in evaluation is the negotiated agreement on beginning, or foundational, premises. He explains, “The development of an evaluation argument presupposes agreement on the part of the audiences. The premises of the argument are the beginning of this agreement and the point from which larger agreement is built” (1980, p. 76). According to House, agreements “derived from the negotiation that often precede the evaluation—agreements between sponsors, program personnel, and evaluators” are potentially the most important agreements for a particular evaluation (p. 78, emphasis ours). House stresses the importance of this negotiation and notes that elements subject to negotiation, and hence elements for which agreement is crucial, include criteria, methods, and procedures. “Disagreement on these points can destroy the entire credibility of the evaluation (p. 78).” As he summarizes, “the evaluator must start from where his audiences are, even though the beginning premises may not be acceptable to other parties nor to the evaluator himself. Otherwise the evaluation will not be credible or persuasive (pp. 78–79).” If there is no agreement on beginning principles, there is likely to be disagreement about the truth of resulting conclusions. Indeed, as we will emphasize in what follows, establishing not only agreement, but a record of agreement can be critical for resolving potential disagreement about answers to evaluation questions.

Values and Validity in the Context of Evaluation Practice

  1. Top of page
  2. Abstract
  3. Truth, Beauty, and Justice in a Federally Funded Magnet School Evaluation
  4. The 1990 National Assessment Governing Board: National Assessment of Educational Progress Standard Setting Evaluation
  5. Analysis
  6. Values and Validity in the Context of Evaluation Practice
  7. References
  8. Biographies

So, what does it look like when the evaluator appropriately balances these three dimensions in an evaluation? The second evaluator from the MSAP case provides an illustration. She was able to attend to the truth aspect of validity in the design she chose—that is, a quasi-experimental design to answer causal-type evaluation questions. But, by collaboratively developing a logic model that was used to advocate for the need for going beyond the simple “what works” question, in addition to strengthening the truth dimension, she was able to (a) address the justice dimension by including other voices in the evaluation besides the federal funding agency and (b) address the beauty dimension by creating a visual framework that was deemed credible by all stakeholders. This also provides evidence that the weighing of these dimensions can occur at any point in the process, not just in the final synthesis.

Further, as can be garnered from the NAGB case, balancing and negotiating these dimensions must begin at the start of an evaluation and continue throughout. Stufflebeam states that his experience with the NAGB taught him the lesson that it is unwise to work without a contract. We suggest, in harmony with Stufflebeam, establishing a jointly endorsed record of the negotiation is almost as important as the negotiation itself. This is because failing or neglecting to establish explicit, documented agreement on the basic premises and other foundational aspects of the evaluation can result in misunderstandings and differing recollections of the points of agreement. This, in turn, impacts not only the truth dimension of validity, but also the beauty. Different views of the context result in different prescriptions for action and hence different views about the criteria for evaluating the success of any undertaking. It is possible that the NAGB had a different view of the standard-setting process that resulted in disagreement about the deficiencies pointed out by the evaluation team. Perhaps they agreed that there were deficiencies but disagreed about the relative importance of the elements in which deficiencies were found. On such a view, the deficiencies might be seen as something to continue to refine, but of sufficiently minimal importance that the standards produced were both acceptable and the best option available. Or, the NAGB may have agreed both with the assessment of deficiencies and of their importance, but disagreed about what would constitute a solution to the problem. They could have legitimately believed they had addressed the source of the deficiency, and seen the evaluation team's criticisms in the draft summative report as irrelevant to what they had understood to be the problem. Either of these scenarios would present a legitimate disagreement with the team's evaluative conclusions, presenting a challenge to the validity of the evaluative argument.

So, how can an evaluator go about appropriately balancing these dimensions to produce valid evaluations? House provides us with an answer when he says, “the validity of an evaluation depends upon whether the evaluation is true, credible, and normatively correct” (House, 1980, p. 255). Validity requires all three. Striking the right balance requires careful and consistent consideration of each dimension before the evaluation commences, during the evaluation, and even after the evaluation has ended. The evaluator must be careful about giving priority to any one of the three. It is perhaps more productive to use this system as a set of checks and balances. In doing so, one is forced to ask oneself, “What values am I privileging?” “What influence will privileging those values have on the evaluation?” and “How can I attend to these issues in a balanced manner?” Giving more than a little priority to any one dimension may undermine all three. For example, too much priority on social justice is likely to blind the evaluator to some aspects of the truth. Decisions based on the errors are likely to cause harm to someone, potentially even the group the evaluator set out to protect. Too much emphasis on trying to please everyone in terms of acceptable questions and methodology is likely to water down the evaluation and create a sea of potentially conflicting findings in which important evaluative information could get lost. Too little attention to the framing of the evaluation—or stakeholder and evaluator fundamental views of the evidence, the nature and purpose of the evaluation, the criteria, or the relevant evaluation questions—can lead to controversy over the legitimacy of evaluation findings and to little or no use of these findings.

Turning back to a point we made earlier in the chapter, what should an evaluator do when forced to choose between dimensions? As was mentioned earlier, House himself notes that it is not uncommon for an evaluator to intensely focus on one dimension to the exclusion of the other two dimensions. He further argues, “in those concrete instances in which truth and beauty conflict, truth is more important than beauty. And justice more important than either” (p. 117). So for House, when forced to privilege one dimension over the others, and knowing that giving priority to one could undermine the entire process, he chooses justice first, then truth, and last beauty. Thus, he puts fairness and social justice above all else. While we agree with much of House's writings, it is here that we take a departure from him. In our view, abandoning truth for justice risks losing both. Truth must be given priority, but it must be a truth that is both scientifically and culturally humble. This is a truth that takes careful note of the subtle intersection of truth and beauty where they converge around credibility. For something to be credible, it must be both meaningful and coherent to its audience. If one scrupulously attends to this quality of truth, then justice is present in that the evaluation genuinely examines the context from different perspectives, heeding the voices of all relevant stakeholders; it can thus be judged to be normatively correct. We believe that one cannot adequately attend to the truth dimension without acknowledging the intersection of truth and beauty around credibility, and thus, also attending to the justice dimension. The dimensions are intertwined and demand appropriate balance. We agree with House that social justice is truly of utmost importance, but without truth one cannot say so. Of course, without beauty few will understand what is said.

Notes
  1. 1

    The Magnet Schools Assistance Program (MSAP) is currently authorized under Title V, Part A, of the Elementary and Secondary Act, as amended in 1994 and is administered by the Office of Innovation and Improvement. According to the U.S. Department of Education, the MSAP “provides grants to eligible local educational agencies to establish and operate magnet schools that are operated under a court-ordered or federally approved voluntary desegregation plan” (retrieved from http://www2.ed.gov/programs/magnet/index.html on September 25, 2011). A common identifier of magnet schools is their intended structure around a distinctive educational curriculum or theme (Ballou, 2009). Examples of magnet themes include STEM, language immersion, visual and performing arts, and international baccalaureate.

  2. 2

    Readers wishing to know more about the NAGB evaluation than is provided here should consult the extensive documentation of this process, and in particular, this evaluation (cf. Stufflebeam, 2000; U.S. General Accounting Office, 1993; Vinovskis, 1998).

References

  1. Top of page
  2. Abstract
  3. Truth, Beauty, and Justice in a Federally Funded Magnet School Evaluation
  4. The 1990 National Assessment Governing Board: National Assessment of Educational Progress Standard Setting Evaluation
  5. Analysis
  6. Values and Validity in the Context of Evaluation Practice
  7. References
  8. Biographies
  • Ballou, D. (2009). Magnet school outcomes. In M. Berends & M. G. Springer (Eds.), Handbook of research on school choice (pp. 409426). New York, NY: Routledge.
  • Cizek, G. C. (1993, August). Reactions to National Academy of Education report, “Setting performance standards for student achievement.” Unpublished manuscript. Retrieved from http://www.eric.ed.gov/contentdelivery/servlet/ERICServlet?accno=ED360397
  • Campbell, D., & Stanley, J. (1966). Experimental and quasi-experimental designs for research. Chicago, IL: Rand McNally.
  • Chen, H. T., Donaldson, S. I., & Mark, M. M. (Eds.). (2011). New Directions for Evaluation: No. 130. Advancing validity in outcome evaluation: Theory and practice. San Francisco, CA: Jossey-Bass.
  • Collins, J., Hall, N., & Paul, L. (Eds.). (2004). Causation and counterfactuals. Cambridge: MIT Press.
  • Cook, T. D., & Campbell, D. T. (1979). Quasi-experimentation: Design and analysis issues for field settings. Chicago, IL: Rand-McNally.
  • Cronbach, L. J. (1982). Designing evaluations of educational and social programs. San Francisco, CA: Jossey-Bass.
  • Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52, 281302. doi:10.1037/h0040957
  • Donaldson, S. I. (2007). Program theory-driven evaluation science: Strategies and applications. New York, NY: Routledge.
  • House, E. R. (1980). Evaluating with validity. Beverly Hills, CA: Sage.
  • Kane, M. (1993, November). Comments on the NAE evaluation of the NAGB achievement levels. Unpublished manuscript. Retrieved from http://www.eric.ed.gov/contentdelivery/servlet/ERICServlet?accno=ED360398
  • Linn, R., Koretz, D., Barker, E., & Burstein, L. (1991). The validity and credibility of the achievement levels for the 1990 National Assessment of Educational Progress in Mathematics (Center for the Study of Evaluation Report No. 330). Los Angeles, CA: Center for Research on Evaluation, Standards, and Student Testing, University of California at Los Angeles.
  • Morgan, S. L. & Winship, C. (2007). Counterfactuals and causal inference: Methods and principles for social research. Cambridge, UK: Cambridge University Press.
  • Murnane, R. J., & Willett, J. B. (2010). Methods matter: Improving causal inference in educational and social science research. New York, NY: Oxford University Press.
  • National Assessment Governing Board (NAGB). (1991, August 14). Response to the draft summative evaluation report on the National Assessment Governing Board's inaugural effort to set achievement levels on the National Assessment of Educational Progress. Washington, DC: Author.
  • Parsons, L. S. (2001, April). Reducing bias in a propensity score matched-pair sample using Greedy matching techniques. Paper presented at the Annual SAS Users Group International Conference, Long Beach, CA.
  • Rubin, D. B. (1997). Estimating causal effects from large data sets using propensity scores. Annals of Internal Medicine, 127, 757763. doi:10.7326/0003-4819-127-8_Part_2-199710151-00064
  • Scriven, M. (1976). Maximizing the power of causal investigations: The modus operandi method. In G. V. Glass (Ed.), Evaluation studies review annual (pp. 101118). Beverly Hills, CA: Sage.
  • Scriven, M. (1991). Evaluation thesaurus (4th ed.). Newbury Park, CA: Sage.
  • Shadish, W. R. (2011). The truth about validity. In H. T. Chen, S. I. Donaldson, & M. M. Mark (Eds.), New Directions for Evaluation: No. 130. Advancing validity in outcome evaluation: Theory and practice (pp. 107117). San Francisco, CA: Jossey-Bass. doi:10.1002/ev.369
  • Shadish, W. R., Cook, T. D., & Campbell, D. T. (2001). Experimental and quasi-experimental designs for generalized causal inference. Boston, MA: Houghton Mifflin.
  • Shepard, L. A., Glaser, R., Linn, R. L., & Bohrnstedt, G. (1993). Setting performance standards for student achievement (final report). Stanford, CA: National Academy of Education.
  • Stufflebeam, D. L. (2000). Lessons in contracting for evaluations. American Journal of Evaluation, 21, 293314. doi:10.1177/109821400002100302
  • Stufflebeam, D. L., Jaeger, R. M., & Scriven, M. (1991, August). Summative evaluation of the National Assessment Governing Board's inaugural effort to set achievement levels of the National Assessment of Educational Progress. Draft report submitted to NAGB on August 1, 1991.
  • U.S. General Accounting Office. (1993). Educational achievement standards: NAGB's approach yields misleading interpretations. Washington, DC: United States General Accounting Office.
  • Vinovskis, M. (1998). Overseeing the nation's report card: The creation and evolution of the National Assessment Governing Board (NAGB). Washington, DC: National Assessment Governing Board.

Biographies

  1. Top of page
  2. Abstract
  3. Truth, Beauty, and Justice in a Federally Funded Magnet School Evaluation
  4. The 1990 National Assessment Governing Board: National Assessment of Educational Progress Standard Setting Evaluation
  5. Analysis
  6. Values and Validity in the Context of Evaluation Practice
  7. References
  8. Biographies
  • James C. Griffith is a doctoral candidate for a dual degree in philosophy and psychology at the Claremont Graduation University and a lead evaluator at the Claremont Evaluation Center.

  • Bianca Montrosse-Moorhead is an assistant professor in the Measurement, Evaluation and Assessment program, a research scientist for the Collaborative on Strategic Education Reform (CSER), and coordinator of the Graduate Certificate Program in Program Evaluation at the University of Connecticut.