From a review of the related literature on evaluating the performance of school psychologists in the field and during their graduate preparation (Prus & Waldron, 2008), four key principles emerge as critical to a credible performance evaluation system: (a) the use of multiple measures, including at least one measure of impact on student outcomes; (b) reliability and validity, with validity anchored to the 2010 NASP Practice Model; (c) utility for distinguishing different levels of proficiency; and (d) linkage to professional development and improvement (see Table 1). These four key principles align well with the policy guidelines set by the National Alliance of Pupil Services Organizations (NAPSO) to assist states in considering how to best apply student achievement outcomes to educator evaluation systems (www.napso.org). NAPSO's (2011) recommendations include the central role for school psychologists and other support personnel in the creation of evaluation systems designed to determine their competence. These evaluation systems should be research-based and consider professional preparation and practice models supported by the national organizations responsible for advancing research and practice for each distinct profession. NAPSO (2011) also advocates for the use of evaluators with expertise in roles, responsibilities, and job functions specific to the position they are evaluating to understand the unique practices and foundational knowledge of the profession and the specific demands, needs, and requirements of each position. Appropriately, credentialed evaluators are also critical for providing meaningful feedback. Finally, evaluation systems must use multiple measures in evaluating professional performance (NAPSO, 2011).
Performance Appraisal Rubrics
The use of multiple measures in a performance evaluation system is essential as a means of capitalizing on the advantages and minimizing the disadvantages of each individual, episodic or isolated method (Prus & Waldron, 2008). Traditionally, performance appraisal rubrics and rating scales have been utilized almost exclusively as the sole measure to evaluate the performance of a school psychologist. Typically, performance appraisal rubrics are adapted from instruments used to evaluate teachers (e.g., Marzano Model, Charlotte Danielson's Framework for Teaching) or administrators, depending on whether school psychologists are under a teacher or administrative contract. Two examples of performance standards based on Charlotte Danielson's Framework for Teaching include the standards developed by the Cincinnati Public Schools (2009; see Table 1) and the framework developed by the Delaware Department of Education (2012). An example of a performance appraisal rubric based in part on the NASP Standards can be obtained from the Indiana Association of School Psychologists website (www.iasponline.org).
Far too often, a performance appraisal rubric or rating scale is the sole measure used to evaluate school psychologists’ professional competencies, and observations are conducted in one setting of one professional activity (e.g., leading a meeting among teachers and parents). Whereas many school districts have developed, or are in the process of developing, performance appraisal rubrics or rating scales, the evaluator is not necessarily someone with professional knowledge and a background in school psychology (i.e., a building principal, assistant principal, or district administrator without school psychology credentials or affiliations).
The advantages of performance appraisal rubrics and rating scales are that they can be aligned with professional training standards, they provide a direct measure of skills and behaviors in the settings in which the skills and behaviors are expected to be performed, and they are generally accepted, and expected, in the context, and there is a sense of fairness in that school psychologists are evaluated using the same type of measure used to evaluate teachers or administrators (Prus & Waldron, 2008).
Performance appraisal rubrics and rating scales also have several disadvantages identified by Prus and Waldron (2008) that may place limits on their reliability, validity, and utility for use as a measure within a comprehensive performance evaluation system. First, ratings can be quite subjective, especially if provided by a single evaluator. This is a particular concern in situations in which the single evaluator does not have the background knowledge and expertise in school psychology needed to evaluate the more complex aspects of professional practice (e.g., data-based decision making, assessment). In these instances, the non-school psychologist evaluator frequently bases the appraisal on professional competencies that are on public display (e.g., conducting meetings with parents and teachers), severely limiting the comprehensiveness of the performance evaluation. Reliability and validity are sacrificed to the degree that the evaluator is not able to discern competent practice or is disinclined to report less than competent practice (Yariv, 2006). Likewise, the utility of a performance appraisal rubric is reduced if the evaluator is unable to distinguish different levels of proficiency.
A second and related limitation of performance appraisal rubrics is that actual observations of some situations (e.g., counseling, conflict resolution) may be difficult due to concerns about confidentiality, the potential impact of observers on clients, or the low frequency in which the circumstances requiring these skills may occur (Prus & Waldron, 2008). Similarly, the results for school psychologists may vary as a function of the school setting (e.g., expectations for practice) and rater (e.g., building administrator, supervising school psychologist). Individuals responsible for conducting evaluations of school psychologists point out that the time and effort required to complete performance appraisal rubrics can be overwhelming, particularly if many competencies are to be assessed (Prus & Waldron, 2008).
Performance appraisal rubrics may be incorporated as one measure in a comprehensive performance evaluation system comprising multiple measures if school districts follow the guidelines put forth by Prus and Waldron (2008). Rubrics and ratings scales must have specific, operational criteria for observing and appraising performance. Additionally, rigorous training in the use of the measure must be provided to all evaluators. Specific operational criteria and rigorous training are critically important for all evaluators, and particularly for non-school psychologists who may serve as an evaluator of school psychologists. Prus and Waldron (2008) recommend that each school psychologist is rated by more than one source (e.g., building administrator, supervising school psychologist), and the performance of a school psychologist should be assessed in multiple situations and settings over time.
Case Studies (Single-Case Designs)
Single-case designs are widely considered to be one of the best methods for evaluating intervention effectiveness and linking practitioner efforts to student growth over time. Although school psychologists are not typically involved in the direct implementation of academic and behavioral interventions, they do play an essential role in collaborative problem solving with individual teachers as a member of a problem-solving team. Assessing student outcomes in response to increasingly intensive interventions in the context of a multitiered system of support is an outcomes-based approach to evaluating a school psychologist's consultation skills and knowledge of evidence-based academic and behavioral interventions. As school psychology practitioners, we are required by NASP Standards to link our professional practices to direct, measurable outcomes, regardless of whether the practices involve direct services (e.g., behavior contingency contracts, academic tutoring, counseling) or indirect services (e.g., consultation).
The basic AB (case study) single-case design can be highly effective in documenting a student's baseline level of performance as well as academic and/or behavior changes over the course of an intervention (Bloom, Fischer, & Orme, 2005). The essential steps for gathering case study data involve (a) selecting an outcome measure, (b) collecting baseline data, (c) implementing an intervention, and (d) collecting ongoing data (Steege, Brown-Chidsey, & Mace, 2002). Baseline data should be collected for a duration sufficient to document that the behavior is stable. Case study data need to be visually displayed to discern whether there have been changes in trend, level, or variability. Other standard methods of outcome determination can be used based on data from AB designs, such as goal attainment scaling (GAS), percentage of nonoverlapping data (PND), effect size (ES).
Although case study designs are not adequate to establish internal validity definitively, as is the case with more rigorous single-case designs (e.g., ABA or ABAB), Kazdin (1981) has argued that the use of specific methodologies can maximize the extent to which valid inferences can be drawn from case studies, enabling case study designs (also known as accountability designs) to play an important role in the overall framework of evidence-based practices. Specific methodologies that increase the strength of case study designs to serve accountability purposes include: (a) the use of direct observations of operationally defined student behaviors to yield objective data whose reliability and validity can be assessed, in contrast to anecdotal information; (b) multiple assessment occasions prior to and during the implementation of an intervention, in contrast to a single assessment before and after the intervention; (c) repeated measures of a student's target behavior(s) to establish the range of preintervention and postintervention variability in the student's performance (Kazdin, 1981). Given these methodological features, case study designs can provide reasonable evidence that the intervention services being provided by a school psychologist are producing the desired results (Brown-Chidsey, Steege, & Mace, 2008).
Drawing a distinction between accountability for research and accountability for practice may help clarify the role of case study designs. In research, rigorous experimental designs are required to establish the internal validity of a novel intervention approach if the new intervention is to be disseminated as evidence based (Brown-Chidsey et al., 2008). By contrast, school psychologists’ practice involves the delivery of well-established, research-based intervention approaches and the documentation of their effectiveness. Indeed, federal, state, and agency regulations require the documentation of intervention effectiveness, and school psychologists have an ethical responsibility to do so (Polaha & Allen, 1999; Steege et al., 2002). A parallel can be drawn to primary care physicians who are expected to show that their recommended treatments had the desired effects over time for a variety of concerns, but they are not obligated to conduct double-blind randomly controlled trials with patients as part of their routine practice to demonstrate accountability. Thus, case study designs can and should be used as part of a comprehensive approach to demonstrating accountability in practice. The strength of the evidence is further enhanced when case study designs are incorporated into a school psychologist's routine practice and multiple replications are demonstrated over time (Steege et al., 2002).
The use of case study designs for performance evaluation shares many of the advantages and disadvantages of portfolio assessments used in graduate preparation programs. The advantages include the ability to represent multiple samples of work over time, thus reflecting a practitioner's knowledge and skill development across settings while avoiding the problems inherent with one-shot measurement occasions (Prus & Waldron, 2008). Case study designs can be used to measure the effectiveness of interventions targeting individuals or small groups, or at a classwide or systems level (Polaha & Allen, 1999; Steege et al., 2002). Case studies further allow for flexibility in assessing a variety of professional competencies (e.g., data-based decision making, problem solving, consultation, academic and behavioral intervention design, communication skills) in the natural context in which the school psychologist works, thus enabling low-inference evaluative judgments to be made regarding the practitioner's performance (Prus & Waldron, 2008). Finally, case study designs increase school psychologists’ participation in the performance evaluation process.
The primary disadvantage of the use of case study designs is that as a descriptive approach (also referred to as pre-experimental), AB single-case designs do not completely address all the plausible rival hypotheses, nor do they control for threat to internal validity (Cook & Campbell, 1979). Consequently, the school psychologist is unable to conclude with any confidence that changes in student performance were the direct result of the intervention (Brown-Chidsey et al., 2008).
A second limitation of case study designs for performance evaluation is that cases depend largely on the opportunities school psychologists have in the settings in which they work, which may vary by unique role and context variables (Prus & Waldron, 2008). Given that potentially high-stakes performance evaluation decisions may be based on case study demonstrations, the extent to which samples represent the school psychologist's independent ability rather than the product of other collaborators may be a concern (Prus & Waldron, 2008). A final limitation is the recognition that collecting, analyzing, and aggregating case study data may involve knowledge and skills not previously mastered by the school psychologist.
Case study designs may be incorporated as a measure in a comprehensive performance evaluation system comprised of multiple measures if school districts adhere to the following guidelines. First, the case study approach needs to have clear, published expectations for content and the evaluation criteria, including exemplars (Prus & Waldron, 2008). The case study process developed as part of NASP's National School Psychology Certification System for candidates from non–NASP-approved programs includes a rubric for evaluating the quality of case studies (NASP, 2010b). It is recommended that each case study or collection of case studies be rated by more than one trained professional, that inter-rater reliability be monitored, and that recalibration be completed periodically, as needed (Prus & Waldron, 2008). Practically speaking, submitted work should be limited to a volume that can be thoroughly and effectively evaluated by raters (Prus & Waldron, 2008). A cost-effective approach may involve submitting case studies electronically and having evaluators review them over the summer months. Given that submitted case studies involve actual cases, it will be critical that school psychologists remove all identifiable student and consultee (i.e., teacher/parent) information in all submitted materials (Prus & Waldron, 2008). To verify the authenticity of the case study's implementation and outcomes, however, procedures need to be established for a third-party “sign off” from an impartial administrator or supervisor familiar with the intervention and its outcomes. Finally, it should be recognized that case studies submitted as part of a performance evaluation system will likely represent a school psychologist's best work and need to be evaluated as such (Prus & Waldron, 2008).
Measuring Impact Using Case Studies: The Ohio Internship Program in School Psychology
The evaluation of the Ohio Internship Program illustrates how case studies can be used to evaluate the impact of school psychological services. The Ohio Internship Program is a collaboration among the Ohio Department of Education, Office for Exceptional Children, and Ohio's nine school psychology graduate preparation programs. Nearly 100 school psychology graduate students complete their internship each year in the state-funded Ohio Internship Program. Emphasis in accountability for school psychological services and shifts toward evidence-based intervention decisions led to the development of a model of the evaluation of the statewide internship experience with regard to outcomes for schools and students (Morrison et al., 2011; Morrison, Graden, & Barnett, 2009).
The evaluation of the Ohio Internship Program comprises three components. The first is a measure of intern competencies. To assess the development of interns’ skills and competencies during the internship, university-developed rating scales were completed by internship field supervisors at the beginning, midpoint, and end of the internship. The second is a measure of the number of students served by each intern based on the professional practice logs they were required to maintain throughout the school year. For this output measure, interns are asked to report the number of students served at each tier within a multitiered system of support: Tier 1—universal-/system-level practices, such as Positive Behavior Support planning and universal screening for instructional decision making; Tier 2—supplemental/targeted intervention, and Tier 3—intensive/individualized interventions. The third component of the evaluation of the Ohio Internship Program is a measure of the impact of intervention services using a case study approach. Interns are asked to provide outcome data for six individual, targeted, and universal interventions in which they were meaningfully involved. The interventions for which outcome data are required include three academic interventions (Tiers 1–3) and three social/behavior interventions (Tiers 1–3). The interventions in which outcome data are provided are judged by the interns to be exemplars of the support services they provided during their internship year.
Goal Attainment Scaling. GAS is the primary method used for summarizing intervention outcomes for students served by school psychology interns. As a supplement to the GAS process, two additional summary statistics are calculated, in instances where such calculations are appropriate, to measure the effects of an intervention provided by the interns: the PND and ES.
The GAS process involves the development of a 5-point scale for measuring goal attainment as outlined by Kiresuk, Smith, and Cardillo (1994). In this evaluation model, “Expected Level of Outcome” is replaced with “No Change” to better represent students’ responses to the intervention. Thus, positive ratings reflect a positive change in the target, and negative ratings reflect a change in an undesired direction for the target. The other scale anchors remained the same: “Somewhat More Than Expected,” “Somewhat Less Than Expected,” “Much More Than Expected,” and “Much Less Than Expected.”
Reviews of the reliability and validity of many applications of GAS procedures are available in Cardillo and Smith (1994) and Smith and Cardillo (1994), respectively. Studies that used a 5-point scale (similar to the approach used herein) reported interrater reliability indices between .87 and .93 (as cited in Cardillo & Smith, 1994). Test–retest reliability also was acceptable (e.g., correlation of r = .84 over a 2- to 3-week period; see studies reported in Cardillo & Smith, 1994). In school settings, the use of GAS methodology has been demonstrated to be of significant value in the evaluation of intervention-based change and is “a more accurate estimate than any other measure” (Sladeczek, Elliott, Kratochwill, Robertson-Mjaanes, & Stoiber, 2001, p. 52). GAS validity evidence includes analyses of many types of intervention outcomes, including school-based interventions (see Kratochwill, Elliott, & Busse, 1995). GAS has been found to be responsive to measuring diverse functional goals across services and sensitive to measuring intervention-inducted change, making it a strong outcome measure for groups of students in which the rate of progress varies (MacKay, McCool, Cheseldine, & McCartney, 1993). A summary of the research regarding the utility and acceptability of GAS for measuring students’ progress can be found in Roach and Elliott (2005).
Percentage of Nonoverlapping Data. Calculating the PND involves counting the number of intervention data points that exceed the highest baseline point (for studies seeking to increase a target behavior) or counting the number of intervention data points lower than the lowest baseline point (for studies seeking to decrease a target behavior). The number of nonoverlapping data points is then divided by the total number of intervention points to obtain the PND. PND has been found to produce a summary statistic that is consistent with the outcomes obtained through visual analysis of individual participant graphs (Olive & Smith, 2005). PND should not be calculated when a baseline data point of zero is present in decreasing behavior studies or an extremely high baseline data point is present in increasing behavior studies (Scruggs & Mastropieri, 1998; Scruggs, Mastropieri, & Casto, 1987).
The use of PND as a summary statistic that is easy to calculate and interpret has wide support in the research literature (Mathur, Kavale, Quinn, Forness, & Rutherford, 1998). Ratings using PND are judged on the following scale: a PND greater than or equal to 90% is considered “Highly Effective,” a PND of 70% to less than 90% is judged as “Moderately Effective,” a PND of 50% to less than 70% is considered “Mildly Effective,” and a PND of less than 50% is rated as “Ineffective” (Scruggs, Mastropieri, Cook, & Escobar, 1986).
Effect Size. There are many ES estimation methods (Busk & Serlin, 1992; Thompson, 2007). ES in this evaluation model was calculated as the change in achievement or behavior relative to the baseline (control) standard deviation (Busk & Serlin, 1992). As a general guide for outcomes without much specific prior evidence for comparisons, interventions that yield an ES greater than or equal to 0.80 are considered to have a large effect; an ES between 0.50 and 0.79 represents a moderate effect, whereas an ES between 0.20 and 0.49 reflects a small effect.