Measuring instructional practice in science using classroom artifacts: lessons learned from two validation studies



With growing interest in the role of teachers as the key mediators between educational policies and outcomes, the importance of developing good measures of classroom processes has become increasingly apparent. Yet, collecting reliable and valid information about a construct as complex as instruction poses important conceptual and technical challenges. This article summarizes the results of two studies that investigated the properties of measures of instruction based on a teacher-generated instrument (the Scoop Notebook) that combines features of portfolios and self-report. Classroom artifacts and teacher reflections were collected from samples of middle school science classrooms and rated along 10 dimensions of science instruction derived from the National Science Education Standards; ratings based on direct classroom observations were used as comparison. The results suggest that instruments that combine artifacts and self-reports hold promise for measuring science instruction with reliability similar to, and sizeable correlations with, measures based on classroom observation. We discuss the implications and lessons learned from this work for the conceptualization, design, and use of artifact-based instruments for measuring instructional practice in different contexts and for different purposes. Artifact-based instruments may illuminate features of instruction not apparent even through direct classroom observation; moreover, the process of structured collection and reflection on artifacts may have value for professional development. However, their potential value and applicability on a larger scale depends on careful consideration of the match between the instrument and the model of instruction, the intended uses of the measures, and the aspects of classroom practice most amenable to reliable scoring through artifacts. We outline a research agenda for addressing unresolved questions and advancing theoretical and practical knowledge around the measurement of instructional practice. © 2011 Wiley Periodicals, Inc. J Res Sci Teach 49: 38–67, 2012

There is growing consensus among researchers and policymakers about the importance of accurate, valid, and efficient measures of instructional practice in science classrooms. Instruction directly or indirectly mediates the success of many school improvement efforts and thus accurate descriptions of what teachers do in classrooms as they attempt to implement reforms is key for understanding “what works” in education, and equally importantly, “how?” Many educational policies and programs rely on claims about the value of certain practices for improving student outcomes; for example, the No Child Left Behind legislation prompted schools to adopt scientifically based practices to improve the achievement of all students. Similarly, the reform teaching movement often recommends specific approaches to instruction designed to promote higher-level learning. More generally, the National Research Council (2006) recommended that states and districts address existing inequities in the kinds of experiences or opportunities to learn different groups of students are exposed to in their classrooms. These examples suggest that a detailed examination of teachers' classroom practices and their relationship with student achievement is key for understanding why policy recommendations such as these may be effective or not (Ball & Rowan, 2004; Blank, Porter, & Smithson, 2001; Mayer, 1999).

In the case of science classrooms, large-scale implementation of high quality instruction can be particularly challenging given the relative lack of qualified teachers available. At the same time, there is a relative paucity of research on the measurement of instructional practices in science classrooms compared to other subjects such as mathematics (see e.g., Glenn, 2000). As a result, the research base to support claims about instructional effects in science is often limited (e.g., Laguarda, 1998; McCaffrey et al., 2001; Von Secker & Lissitz, 1999). As with other subjects, this shortage of empirical evidence reflects in part the conceptual difficulty of finding common frames of reference for describing science instruction, but in equal or larger measure the technical challenge of developing efficient, reliable procedures for large-scale data collection about science teachers' instructional practices.

Features of Instructional Practice in Middle School Science

The first major challenge facing the development of a measure of instructional practice is defining the target construct itself. Instruction in any subject is a complex and multidimensional phenomenon that can be described (and quantified) only partially within an organizing model or set of assumptions. In the case of science education, representing instruction across scientific disciplines and areas of scientific knowledge can make construct definition particularly challenging. For this study, we adopted as reference the model of content (what is taught) and pedagogy (how it is taught) proposed by the National Science Education Standards (National Research Council, 1996). The standards emphasize student learning of the skills that characterize the work of scientists (observation, measurement, analysis, and inference), and accordingly focus on instructional practices and classroom experiences that help students learn how to ask questions, construct and test explanations, form arguments, and communicate their ideas (Ruiz-Primo, Li, Tsai, & Schneider, 2010).

While the NRC model offers a useful set of organizing notions for conceptualizing and studying science instruction, it lacks specificity and detail in terms of concrete features of teacher classroom practices. Le et al. (2006) operationalized the NRC model in terms of specific measurable features of teacher practice, offering more concrete guidance for characterizing variation in instruction across classrooms. In their study, a panel of scientists and national science education experts was convened to develop a taxonomy of science curriculum and instruction linked to the NRC model. The initial taxonomy included four categories (scientific understanding, scientific thinking, classroom practice, and teacher knowledge), which the panel then described in terms of concrete behaviors and instructional practices that could be found in classrooms. Through this process, the panel identified 10 measurable features or dimensions of science instruction: Grouping, Structure of Lessons, Scientific Resources, Hands-on, Inquiry, Cognitive Depth, Explanation and Justification, Connections and Applications, Assessment, and Scientific Discourse Community. Figure 1 presents synthetic definitions for each of these dimensions, which provided the conceptual framework for characterizing instruction in our studies. For each of these dimensions a complete rubric can be found in the Supplementary online appendix; the rubrics describe in detail the teacher behaviors that characterize instructional practice of varying quality.

Figure 1.

Dimensions of instructional practice in middle school science.

Importantly, the framework underlying the dimensions of science instruction in this paper predates more recent conceptualizations such as the model offered in Taking Science to School (NRC, 2007), the science Standards for College Success (College Board, 2009), or most recently the Framework for K-12 Science Education of the National Research Council and the National Academy of Science (NRC, 2011). However, careful review suggests that the dimensions in our framework are far from obsolete; collectively, they reflect a model of instruction that shares many easily recognizable features and areas of emphasis in the newer models. Below we review our dimensions of science instruction specifically in relation to the framework for K-12 Science Education (NRC, 2011), the most recent model that will serve as the foundational document for the first generation of common core science standards. The discussion reveals considerable overlap between the 1996 and 2011 frameworks, but also points to areas where significant differences exist between them.

Most of the dimensions we use map fairly directly onto elements of the new NRC framework. In our model Cognitive Depth refers to instruction that emphasizes understanding central (core) disciplinary concepts or ideas, developing models as generalizations of findings, and drawing relationships among science concepts. While not organized under a cognitive depth category, these components are all included the new framework; in particular, the notion of disciplinary core ideas represents one of three major dimensions that constitute the Framework-dimensions that are organized as learning progressions that best support student learning across grades (NRC, 2011, p. 2-2 to 2-3). Furthermore, the development and use of mental and conceptual models is one of eight elements that constitute the scientific practices dimension (NRC, 2011, p. 3–8). In our model, cognitive depth covers both content and enactment, which mirrors the emphasis on designing learning experiences that intertwine scientific explanations and practices in the new framework (NRC, 2011, p. 1–3).

Another instance of considerable overlap concerns the evaluation of evidence for supporting scientific discourse and explanation. Our Explanation/Justification dimension focuses on the use of scientific evidence and concepts to explain and justify claims or findings, while Scientific Discourse is concerned with students and teachers communicating scientific evidence and reasoning to each other. In concert, these two dimensions in our framework reflect one of the major emphases of the 2011 framework-engaging students in argumentation from evidence. Both models are concerned with students and teachers “talking science,” communicating scientific evidence and reasoning.

A key change from the 1996 National Science Education Standards in the 2011 framework is the replacement of scientific inquiry with the broader notion of scientific practices that emphasize understanding of how scientists work (NRC, 2011, p. 3–2). Key practices includes engaging students first hand in posing scientific questions, designing investigations, collecting and analyzing data, and providing evidence to support an answer (NRC, 2011, p. 3-6 to 3-18). Notably, however, each of these practices can be mapped directly to elements of the Inquiry dimension in our model of instruction based in the 1996 standards. The dimension Connections and applications in our model refers to instruction that emphasizes developing connections among scientific concepts and students experiences in the world around them, and the application of scientific knowledge and reasoning to specific real world contexts. Like inquiry, the term connections and applications does not explicitly appear in the three dimensions of the 2011 framework, but rather this notion constitutes one of six guiding principles underlying the framework (NRC, 2011, p. 2–4).

In our model, Structure of Lessons refers to an organized and coherent series of lessons that build on one another logically to enhance student learning. The new framework addresses this dimension in two ways. First, the disciplinary core ideas build on student prior knowledge and interest and second, these core ideas are organized as coherent learning progressions that delineate the developmental trajectory necessary for students to master a concept (NRC, 2011, p.2-2, 2-6). Finally, Assessment in our model includes formal and informal approaches teachers use to gauge student understanding and progress and inform instructional decision-making. The 2011 framework articulates a vision of formative and summative assessment that emphasizes combined use of formal and informal assessments in the classroom, using teacher developed or large-scale tools aligned to curriculum and instruction and linked to longitudinal models of student learning.

Some dimensions in our model are not identified specifically in the new framework, but refer to practices that are still easily recognizable within it. Our dimensions Scientific Resources and Hands-on are two examples. The former refers to use of scientific tools, materials, and equipment during instruction; the latter more specifically requires students to handle these resources directly as a way to physically engage with scientific phenomena. While the 2011 framework does not explicitly name these as dimensions, the use of scientific tools and materials is an important component of in scientific practice (see e.g., “planning and carrying out investigations”). As described in the first dimension in the 2011 framework (Scientific and Engineering Practices) the use of a variety of tools and materials to engage children in work that aims to solve a specific challenges posed by the teacher is a priority for science students of all ages (NRC, 2011 p. 3–17). Similarly, Grouping refers to students working together in groups to carry out scientific tasks. While there is little direct discussion of Grouping practices during instruction in the 2011 framework, the notion of collaborative learning and the social nature of scientific investigation is hardly foreign to it. Indeed, the document acknowledges that science is a collaborative enterprise (p. 2–3) and refers to findings in Taking Science to School that call for incorporating a range of instructional approaches including collaborative small-group investigations to reflect the social nature of science (NRC, 2011; p. 10-9).

This overlap is, of course, not surprising; the latest framework from the NAS and NRC, and the ones that preceded it did not seek to compete with or replace the NRC (1996) standards, as much as build on and complement them to offer a more comprehensive and cohesive vision for science education (NRC, 2011, p. 2–5). Nevertheless, some prominent features of these more recent frameworks are absent in our model of instruction. For example, our dimensions do not include mathematical reasoning and quantitative applications highlighted in the 2009 College Board standards and incorporated as one of the practices of scientists in the 2011 framework. Similarly absent are the engineering practices and technology applications that provide one of the central organizing themes in the new framework. Also, unlike recent frameworks, our dimensions do not explicitly address equity and other social issues related to science and technology. Finally, while the new framework outlines important implications for the types of instruction needed in science classrooms, it is it directly tied to a particular model of teaching or instructional standards. Systematically measuring instructional practice will require explicitly describing and classifying teaching behaviors and in that sense defining teaching standards (see e.g., Le et al., 2006). In the final section of the article, we discuss how dimensions of instructional practice of varying grain sizes and inference levels might be derived from general science education frameworks in future research or policy efforts.

Methods for Collecting Information About Instructional Practice in Classrooms

Researchers have used different methods to collect data about instructional practice, each with strengths and weaknesses. Surveys are the most widely used approach because they offer a cost-effective way to include a large number of classrooms and broad range of aspects of practices (e.g., coverage of subject matter, cognitive demand, instructional strategies, time allocation, or teachers' beliefs and attitudes; see, for example, Mayer, 1999). However, like all self-report measures surveys are subject to error, bias, and social-desirability effects. First, respondents have imperfect memories and may not always consistently recall, summarize, or judge the frequency or nature of instruction over the school year. Moreover, practitioners and researchers may not have a shared understanding of key terms, particularly those used to describe new or evolving practices (e.g., cooperative groups, formative classroom assessment), or aspects of practice that involve high abstraction (e.g., inquiry, classroom discourse); in these situations teachers may refer to personal definitions or idiosyncratic interpretations (Antil, Jenkins, Wayne, & Vasdasy, 1998). Finally, some elements of classroom practice (e.g., interactions between teachers and students) may be inherently difficult to capture accurately through teacher surveys (Matsumura, Garnier, Slater, & Boston, 2008).

Case study methods overcome some of the limitations of surveys through extensive direct observation in schools and classrooms, and interviews that provide insights into the perspectives of students, teachers, and administrators (Stecher & Borko, 2002). Because observers can be carefully trained to recognize nuanced features of practice, case studies can reduce respondent bias and memory error and are easily adaptable for studying different kinds of instructional innovations (Spillane & Zeuli, 1999). However, case studies are time- and labor-intensive and thus are usually not feasible as a tool for large-scale research or policy uses (Knapp, 1997). Nevertheless, while the generalizability of findings sometimes may be suspect, much of what we know about instructional practice is based on in-depth studies of small numbers of classrooms (Mayer, 1999).

In light of the limitations of surveys and case studies, novel approaches have been proposed to gather information about instruction. Some researchers have asked teachers to record information about classroom events or interactions in daily structured logs using selected-response questions to make recall easier and reduce the reporting burden (see e.g., Rowan, Camburn, & Correnti, 2004; Smithson & Porter, 1994). While logs typically address discrete events rather than more complex features of instruction, collecting logs over extended periods may also provide a broader, longer-term perspective on instruction. Others have explored the use of vignettes to obtain insights into instructional practice: teachers respond to written or oral descriptions of real or hypothetical classroom events, ideally revealing their attitudes, understanding and pedagogical skills (Kennedy, 1999; Stecher et al., 2003). When used with open-ended response formats, vignettes offer an opportunity for teachers to provide detailed descriptions about the instructional strategies they use and to explain the decisions they make when planning and implementing their lessons. However, both logs and vignettes rely on teacher self-report and thus, like surveys, suffer from the potential for self-report bias, social desirability, and inconsistency in interpretation (Hill, 2005).

More recently, researchers have incorporated instructional artifacts into their studies of classroom practice (e.g., Borko et al., 2006; Clare & Aschbacher, 2001; Resnick, Matsumura, & Junker, 2006; Ruiz-Primo, Li, & Shavelson, 2002). Artifacts are actual materials generated in classrooms, such as assignments, homework, quizzes, projects, or examinations. Systematically collected artifacts, assembled into portfolios or collected in other forms, can be used to measure various features of instructional practice, including some that are difficult to capture through surveys or observations (e.g., use of written feedback); moreover, because they contain direct evidence of classroom practice, artifacts are less susceptible to biases and social desirability effects. In addition to its potential for measuring instructional practice, the process of collecting artifacts can have value for teacher professional development (see e.g., Gerard, Spitulnik, & Linn, 2010; Moss et al., 2004). However, this method is not without limitations: collecting artifacts places a significant burden on teachers, who must save, copy, assemble, and even annotate and reflect on the materials. Furthermore, as with surveys, artifacts may reveal little about instructional interactions between teachers and students during class time.

The Scoop Notebook: Measuring Instructional Practice Through Artifacts and Self-Report

We designed an instrument for measuring instruction that seeks to combine the advantageous features of portfolios, logs, and vignettes. We call our instrument the Scoop Notebook as an analogy to scientists scooping samples of materials for analysis. As with most portfolios, our notebook contains actual instructional materials and work products that serve as a concrete basis for interpreting teacher reports about their classroom practices, reducing self-report bias and potentially providing richer information about instruction. As with logs, the notebook asks teachers to collect and annotate artifacts daily for a period of time; reporting on daily events when they are recent in memory reduces memory effects, and enables the consideration of day-to-day fluctuations in classroom practice in context. Finally, like vignettes, the notebook includes open-ended questions that solicit teachers' reflections on their practice specifically situated in their classroom context. Specifically, when compiling their notebooks, science teachers are first asked to respond to a set of reflection questions intended to elicit important contextual information for understanding their instructional practices in the context of the particular classroom and series of lessons. Teachers then collect three kinds of classroom artifacts every day over a period of 5 days of instruction: instructional artifacts generated or used before class (e.g., lesson plans, handouts, rubrics), during class (e.g., readings, worksheets, assignments), and after/outside class (e.g., student homework, projects). Teachers also provide three samples of student work for each graded artifact (e.g., assignments, homework), and a recent formal assessment used in the classroom (e.g., test, quiz, paper). Teachers use self-adhesive notes included in the materials they receive to briefly describe each artifact and sample of student work. A disposable camera enables teachers to provide transitive evidence of instruction that cannot be photocopied (e.g., equipment, posters, three-dimensional science projects); a daily photo-log is used to title or briefly describe each photograph. At the end of the notebook period teachers answer a series of retrospective questions eliciting additional information about the series of lessons in the notebook. Finally, teachers are asked to assess the extent to which the contents of the notebook reflect their instructional practice in the classroom.

As this description makes clear, our instrument belongs in the broad category of teacher portfolios (or more recently e-portfolios; see for example, Wilkerson & Lang, 2003). We see the Scoop Notebook as a particular type of portfolio instrument designed to provide more depth of information about instruction over a shorter period of time (Wolfe-Quintero & Brown, 1998). The leading hypothesis behind this type of instrument is that the combination of teacher reflections and classroom artifacts results in a more complete picture of science instruction than each source can provide by itself. A more detailed presentation of the notebook and accompanying materials, including sample artifacts, annotations, and instructions to teachers is available in the Supplementary online appendix or from the authors by request.

This article summarizes the results of two field studies that investigated the properties of measures of instructional practice in middle school science based on our teacher-generated notebook instrument. The purpose of these pilot studies was twofold: first, to answer basic questions about the reliability and validity of the measure; and second, to collect useful information to help researchers design better artifact-based instruments for measuring instruction in the future. The results contribute to a small but growing body of research that systematically examines the measurement of instructional practice through a variety of methods and in a variety of contexts (see e.g., Borko et al., 2006; Matsumura et al., 2008; Mayer, 1999; Pianta & Hamre, 2009; Rowan & Correnti, 2009). This is a particularly interesting topic in science education, because previous research on the measurement of instruction has largely focused on mathematics and language arts.


We present the results of two field studies that investigated the properties of measures of instruction based on the Scoop Notebook and on direct classroom observation. Our studies addressed four main research questions: (a) What is the reliability of measures of science instruction based on the notebook and direct classroom observation? (b) What are the patterns of inter-correlation among dimensions of instructional practice? (i.e., do measures based on notebooks and observations reflect similar underlying structure?) (c) What is the correlation between measures of the same classroom based on notebooks and observations? and (d) What lessons can be drawn from our use of the a teacher-generated notebook for measuring science instruction in general, and for improving artifact-based instruments in particular?1


We recruited middle school teachers for our study in school districts in the Los Angeles and Denver areas. In all, 49 teachers from 25 schools (nine in California, 14 in Colorado) participated in the two studies. Table 1 shows the composition of the samples for each study: The year 1 (2003–2004) sample included 11 middle school science teachers in California, and 17 in Colorado. The year 2 (2004–2005) study included a different sample of 11 teachers in California, and 10 in Colorado. Schools come from six districts in two states ensuring a diversity of contextual policy influences, including academic standards, curricular programs, and instructional approaches. The schools are also diverse with respect to enrollment of minority students (11–94%), students receiving free/reduced lunch (1–83%), and English language learners (0.2–40%). Finally, the sample is diverse in terms of school performance with 17–90% of students proficient in English, 13–77% proficient in math, and about a third of schools in each state identified as low performing.

Table 1. Summary of data collected, sources of evidence, and analysis
 2003–2004 Study2004–2005 Study
Sample Size (Schools)4 CA, 10 CO6 CA, 7 CO
Sample Size (Teachers)11 CA, 17 CO11 CA, 10 CO
  • a

    D, Daily Ratings; S, Summary Rating; GS, Gold Standard Rating.

  • b

    T, Teacher (notebook); R, Rater; O, Occasion.

Number of raters2112
Number of occasions1213
Generalizability designTxRan/an/aTxRxO, TxRb
CorrelationsNotebook—ObservationObservation —Gold Standard  
 Notebook—Gold Standard   

Teachers selected for the study a classroom they considered typical of the science classes they taught with respect to student composition and achievement. To compile their notebooks, teachers collected materials in the selected class during a 5-day period of instruction (or equivalent with block scheduling) starting at the beginning of a unit or topic. Before starting data collection we met with teachers to review the notebook contents and data collection procedures.2


As described in the previous section, our measures characterize science instruction along 10 features or dimensions of practice derived from the National Science Education Standards (Le et al., 2006; NRC, 1996). Detailed rubrics (scoring guides) were developed to characterize each dimension of practice on a five-point scale, ranging from low (1) to high (5). The rubrics provide descriptions of high (5), medium (3), and low (1) quality practice, anchored to examples of what these scores might look like in the classroom. The middle scores (2 and 4) are not defined in the rubrics; raters use these scores to designate practices that fall somewhere between two anchored points in the scale This helps to define the scores as not only qualitatively distinct pictures of instructional practice, but as ordered points in a quantitative scale. Complete rubrics for all the dimensions are available in the accompanying online appendix.

Notebooks and observations were rated using the same rubrics. Notebook readers assigned a single score for each dimension considering the evidence in the notebook as a whole. Classroom observers assigned two kinds of rating on each dimension: daily ratings after each visit, and summary observation ratings for the series of lessons. Finally, classroom observers were asked to review the teacher's notebook after the observation period and assign a Gold Standard (GS) rating for each dimension considering all the evidence at their disposal from their own observations and the notebook. These GS ratings are not independent from observation or notebook ratings; rather, they are composite ratings that represent our best available estimate of the “true” status of a teacher's instruction on the 10 dimensions of practice.

Notebook and observation ratings were carried out by the authors and a group of doctoral students with backgrounds in science teaching, and training in research methodology. A calibration process was undertaken to ensure consistent understanding and scoring of dimensions across raters. Observer training involved first reviewing and discussing the scoring guides and independently rating videotapes of science lessons; the group then discussed the ratings to resolve disagreements and repeated the process with a second videotape. Notebook readers first discussed the use of the scoring guides to judge instruction on the basis of the notebook contents. Readers then rated three notebooks independently and discussed the scores to resolve differences of interpretation; the process was repeated with two more notebooks.

Design and Analytic Methods

The year 1 study focused on estimating the reliability of notebook ratings and criterion validity with respect to classroom observations. Therefore, each notebook was independently rated by two trained readers; classrooms were visited by one observer on two occasions during the notebook period. The year 2 study investigated the reliability of classroom observation ratings. Therefore, pairs of observers visited each classroom on three occasions during the notebook period; notebooks were rated by one reader who had not visited the classroom. In both years, GS ratings were assigned by observers who visited each classroom considering the additional evidence in the notebook. Table 1 summarizes the sources of information and design used each year.


We employed two complimentary approaches to investigate the reliability of notebook and observation ratings. Inter-rater agreement indices offered preliminary evidence of consistency and helped pinpoint problematic notebooks, classrooms, or raters. The reliability of ratings was then formally assessed using Generalizability (G) Theory (Shavelson & Webb, 1991). G-theory is particularly suitable as a framework for investigating the reliability of measures of instruction because it can assess the relative importance of multiple sources of error simultaneously (e.g., raters, tasks, occasions; see e.g., Moss et al., 2004). In the year 1 study each notebook was scored by multiple raters on each dimension; this is a crossed Teacher × Rater design with one facet of error (raters), which identifies three sources of score variance: true differences in instructional practice across teachers (equation image), mean differences between raters (i.e., variance in rater severity, equation image), and a term combining interaction and residual error (equation image). The year 2 study investigated the reliability of observation ratings. For summary observation ratings assigned at the end of the observation period we used the same Teacher × Rater design just described. With daily observation ratings there is one more facet of error (Teacher × Rater × Occasion); the design thus separates true variance in instructional practice (equation image), from error variance related to raters and occasions (equation image and equation image); variance related to two-way interactions (equation image, equation image, and equation image; for example, raters give different scores to the same teachers, averaging over occasions); and residual interaction and error in the model (equation image).3

In addition to reviewing the sources of variance in the ratings, we conducted a series of decision (D) studies to estimate the reliability of the ratings under various measurement scenarios (e.g., varying numbers of raters and observations). G-theory distinguishes between reliability for relative (norm-referenced) and absolute (criterion-referenced) score interpretations. Because our measures assess teacher practice in relation to fixed criteria outlined in our model of science instruction (not in relation to other teachers) we report absolute reliability coefficients (known as dependability coefficients in G-theory).

Correlations Among Dimensions Within Methods

To address the second research question we conducted a series of exploratory factor analyses to examine the patterns of intercorrelation among dimensions. Comparing the results observed with notebook and observation ratings can offer evidence of the degree to which both methods yield measures that capture the same underlying constructs of instructional practice. We first examined the hypothesis of unidimensionality—whether a dominant factor underlies all dimensions of instruction. We then assessed the possibility that more than one factor underlies and explains the correlations among dimensions. In this situation creating a single aggregate index of instruction would not be appropriate; instead multiple indices would be necessary to reflect different aspects of instructional practice represented by separate groups of dimensions.4

Correlations Among Methods Within Dimensions

The third research question concerns the degree to which notebooks and observations converge in their assessments of instructional practice in the same science classrooms. To address this question we estimated raw and disattenuated correlations between ratings of instruction as measured through notebooks and observations in the year 1 study (n = 28 classrooms).5 In addition, we estimated correlations between notebook, observation, and Gold Standard ratings.

Additional Qualitative Analysis of Notebook Use

To address the fourth research question, we examined the operational evidence from the two field studies in an effort to understand how the notebooks functioned in practice and how they might be modified to improve reliability, validity, and feasibility for measuring science instruction on a larger scale. These analyses consisted of a qualitative review of the completeness of each notebook and the variety of artifacts collected in it, and extensive qualitative analysis of reflections offered by teachers on the notebook, their perceptions about the notebook and its potential for adequately capturing their instructional practices in the science classroom, as well as their feedback on potential ways to improve the notebook and data collection procedures. Finally raters assessed the usefulness of each source of information in the notebook for judging each dimension of instruction using a three-point scale (ranging from 0-Not helpful to 2-Very helpful).


Reliability of Notebook and Observation Ratings

Table 2 shows inter-rater agreement indices for notebook and summary observation ratings for each dimension of science instruction. Exact inter-rater agreement was low to moderate, ranging from 22% to 47% with notebooks, 29% to 62% with observations. These results suggest that better rater training and potentially also improvements or clarifications to the scoring guidelines may be needed; in the discussion section, we outline changes to the scoring rubrics that may help improve rating consistency. On the other hand agreement within-one point gives a more hopeful sense of rater consistency: over 90% for Overall, and over 75% for all dimensions, which is in the typical range for more established observation and portfolio measures (see e.g., Pecheone & Chung, 2006; Pianta & Hamre, 2009). These results also indicate that raters in our studies were able to judge science instruction using the notebooks with similar levels of agreement to observers inside classrooms.6

Table 2. Notebook and summary observation ratings of reform-oriented instruction (1–5 scale)
DimensionNotebook Ratings (2003–2004)Summary Observation Ratings (2004–2005)
% Agreement% of Variance% Agreement% of Variance
ExactW/in1TRError (TR,e)ExactW/in-1TRError (TR,e)
  1. Percent agreement and variance components by dimension.

  2. Note: Components that account for 5% of variance or less are not displayed for ease of interpretation

Cognitive depth408253.210.436.5389063.932.6
Discou rse Community439161.034.3338133.28.857.9
Scientific resources377551.98.239.9489076.823.1
Structure of lessons478228.99.961.1338657.621.920.3

We conducted a series of analyses using Generalizability Theory to further investigate the sources of error affecting judgments of science instruction based on notebooks and observations. Table 2 also presents estimated variance components for notebook ratings and summary observation ratings of each dimension. The results indicate that most of the variance in the Overall ratings (57% and 65%, respectively, with notebooks and observations) reflects true differences between teachers. Also notable are the small rater variance components, which may suggest that rater training was more successful than the agreement indices would initially indicate. Among the individual dimensions the few instances of substantial rater variance occurred not with notebooks but with observation ratings: observers were less consistent in judging Assessment, Explanation—Justification, and Structure of Lessons. Finally, residual error variance was considerable. The equation image term combines teacher by rater interaction, random error, and potentially variance associated with facets excluded from the design. One important source of error often hidden in measures of instruction is variation across measurement occasions (Shavelson, Webb, & Burstein, 1986); in the next section, we investigate whether this facet may have contributed to error variance in the measures.

Table 3 presents variance components for a teacher by rater by occasion Generalizability design for daily observation ratings in the year 2 study. Of note is that most of the variance in Overall ratings (52%) reflects true differences across teachers. The results also show that variance across occasions is an important source of error in the measures; in particular we found substantial (over 20%) day-to-day variation in instructional practices (equation image) related to Grouping, Scientific Resources, Hands-On Activities, and Connections/Applications. Naturally, instruction in science classrooms can be expected to vary from day to day for a variety of reasons, and thus equation image could be seen as reflecting true variance in teacher practice over time, not error. However, the interaction does reflect the degree of uncertainty (i.e., error) in generalizing from a measure of science instruction based on a limited sample of observations (or with notebooks, days of data collection) beyond the period covered by the observations as if it were a true measure over time and raters across the school year. This day-to-day variation highlights the need for drawing sufficient samples of occasions of practice as we will examine in detail below.

Table 3. Variance components by dimension, daily observation ratings (2004–2005)
DimensionDaily Observation Ratings (2004–2005) (% of Variance)
T (%)R (%)O (%)T × R (%)T × O (%)R × OTRO,e (%)
  1. Note: Components that account for 5% of variance or less are not displayed for ease of interpretation

Cognitive depth30.211.516.242.0
Discourse Community17.349.97.422.6
Scientific resources52.126.312.2
Structure of lessons26.610.512.116.629.2

Table 4 presents absolute reliability (i.e., dependability) coefficients for measures of science instruction obtained through notebooks and observations, averaging over two and three raters. In the g-theory framework these coefficients reflect the extent to which a measure of instruction generalizes beyond the specific instances of measurement it represents (i.e., one lesson judged by one rater) to the universe of admissible conditions under which it could have been obtained (i.e., all lessons in the year, all possible raters).7 The results offer valuable insight for assessing the reliability of measures based on notebooks and observations. For individual dimensions the dependability of ratings differed across methods. Notebook ratings of Hands-on, Inquiry, Scientific Resources, and Structure of Lessons have lower dependability coefficients than observation ratings. Conversely, notebook ratings of Assessment, Explanation/Justification, and Discourse Community are more reliable than observation ratings. The large teacher by rater interaction seen with Discourse Community suggests that raters interpreted the scoring guide for this dimension inconsistently during direct observation. In the discussion section, we consider in detail the possibility that some aspects of instructional practice may be measured more reliably by artifact collections while others are better measured through direct observation. Finally, the dependability of Overall notebook ratings over two raters is 0.73, compared to 0.79 and 0.80 for Overall summary and daily observation ratings (with three raters dependability is over 0.80 throughout). One early lesson from these results is that while notebooks and observations may support reliable Overall judgments of science instruction, reliable measures of the individual dimensions of practice are harder to obtain and require either larger numbers of observations or raters, or potentially modifications to the rating rubrics.

Table 4. Dependability coefficients for notebooks, multiple and summary observation ratings and notebooks ratings by dimension
DimensionNotebook (2003–2004)Summary Observation (2004–2005)Daily Observationa (2004–2005)
2 Raters3 Raters2 Raters3 Raters2 Raters3 Raters
  • a

    The coefficients reflect the reliability of the average rating over five observations.

Cognitive depth0.690.770.780.840.700.75
Discourse Community0.760.820.500.600.370.46
Scientific resources0.680.760.870.910.840.86
Structure of lessons0.450.550.730.800.590.66

Because classroom practice can be naturally expected to vary from day to day, it is also important to consider how the reliability of indicators of instructional practice may vary as a function of the number of days of observation or data collection. The variance components in Table 3 can be used to obtain projections of reliability under different scenarios; Figure 2 plots estimated dependability coefficients by number of daily observations. The figure indicates that Overall ratings of science instruction by two raters reach dependability of 0.80 after the fifth observation (with three raters only three observations would be required; see Table 4). As could be expected, for individual dimensions with greater day-to-day variance, such as Grouping, Hands-on, and Connections, additional observations are required to produce reliable measures. In general, the curves suggest that reliability improves little beyond five or six observations. Thus, another lesson that can be drawn from our results is that obtaining reliable ratings may require multiple visits to the classrooms (as many as five or more depending on the features of instruction of interest) which has direct implications for assessing the cost, efficiency, and ultimately the usefulness of direct observation in classrooms for research and policy purposes.

Figure 2.

Dependability of science classroom observation ratings (by number of observations; n = 3 raters).

In our study, each teacher completed only one notebook and thus it is not possible to directly estimate the extent of variation in the ratings over time (or, in consequence, to investigate rating reliability as a function of number of occasions sampled). In practice, however, notebook ratings do implicitly consider variation over time because each rating is based in evidence that spans 5 days of instruction. The results in the Table 4 suggest that a single notebook rating spanning multiple days of instruction can offer reliability comparable to that attainable by averaging together multiple daily observation ratings. One final lesson that can be drawn from the middle column in Table 4 is that summary observation ratings (single ratings based on evidence from multiple observations taken as a whole) can be used to improve further on the reliability of daily ratings.

Correlations Among Dimensions Within Methods of Measurement

Table 5 condenses the results of exploratory factor analyses investigating the internal structure of notebook ratings (year 1) and summary observation ratings (year 2). The first column for each type of rating shows the results of analyses that tested the hypothesis of unidimensionality. The results generally support the notion that a dominant factor underlies the 10 dimensions of instructional practice in science—as shown in the table the first factor accounts for 50% of the total variance in notebook ratings, and 42% of the variance in summary observation ratings. While these one-factor solutions may appropriately describe the pattern of correlations among dimensions, from a substantive standpoint the fact that about half of the total variance in these measures remains unexplained should not be overlooked. It suggests that additional factors are necessary to fully explain the pattern of correlations among dimensions and the variance of individual dimensions of instruction. Accordingly, additional analyses explored solutions with two and three factors underlying the ten dimensions of science instruction. The model with three factors appeared to fit the notebook and observation data best, both statistically (TLI = 0.97 and 0.95) and substantively. The first factor (which we term Content) groups Cognitive Depth, Discourse Community, Assessment, and Inquiry; the second factor (termed Format) reflects use of Scientific Resources and Hands-on experiences in the classroom. Finally, Structure of Lessons is singled out as a third factor, suggesting that well-structured lesson plans are equally likely to occur in classrooms that differ substantially in terms of the other two factors. This analysis suggests that future studies might investigate scoring notebooks (or observing classrooms) using a rating system where individual aspects of instruction are mapped to or organized around two or three overarching instructional factors.

Table 5. Factor loadings for notebook and summary observation ratings, one and three-factor solutions
DimensionNotebook Ratings (per Rater) (2003–2004; n = 84)Summary Observation Ratings (per Rater) (2004–2005; n = 42)
One-Factor (50% of Variance)Three-Factor (RMSEA = 0.071, TLI = 0.97)One-Factor (42% of Variance)Three-Factor (RMSEA = 0.059, TLI = 0.95)
InstructionContent Format StructureInstructionContent Format Structure
Cognitive depth0.810.830.330.440.830.750.410.52
Discourse Community0.810.870.260.180.790.840.340.20
Scientific resources0.590.370.920.300.660.380.800.20
Structure of lessons0.360.270.160.970.470.230.240.93

Correlations among Methods Within Dimensions

Table 6 presents correlations between notebook, summary observation, and Gold Standard ratings for each dimension of instruction.8 The raw correlation coefficient between Overall notebook and observation ratings is 0.57 (the disattenuated correlation is 0.69). Notably, the correlation between the average rating across dimensions for notebooks and observations is 0.71, suggesting that a simple arithmetic average may be a more consistent measure of practice across dimensions than the holistic average provided by raters (i.e., the Overall rating). Across individual dimensions the raw correlations between notebooks and observations are 0.5 or higher in nearly all cases (the weaker correlation for Structure of Lessons reflects its skewed distribution and lower reliability seen in Tables 3 and 4). The strongest convergence was observed with Hands on (0.76), Inquiry (0.69), and Discourse Community (0.64).

Table 6. Correlation among notebook, summary observation, and gold standard ratings, by dimension (2003–2004 and 2004–2005 studies)
DimensionPearson (and Disattenuated) Correlations
Notebook—Observation (2003–2004; n = 28)Notebook—Gold Standard (2003–2004; n = 28)Observation—Gold Standard (2004–2005; n = 21)a
  • a

    Note: Corrected correlations were over 1.00 and are not displayed.

Overall (holistic rating)0.57 (0.69)0.59 (0.70)0.92
Average Index (average rating)0.710.720.94
Hands-on0.76 (0.95)0.85 (0.99)0.95
Inquiry0.69 (0.85)0.62 (0.75)0.96
Scientific resources0.55 (0.72)0.59 (0.79)0.92
Assessment0.54 (0.77)0.54 (0.73)0.82
Cognitive depth0.53 (0.83)0.41 (0.75)0.95
Connections0.55 (0.63)0.70 (0.81)0.93
Discourse community0.64 (0.72)0.70 (0.81)0.90
Explanation/Justification0.62 (0.77)0.54 (0.67)0.84
Grouping0.61 (0.73)0.67 (0.80)0.96
Structure of lessons0.26 (0.39)0.26 (0.39)0.96

Similar correlations were observed in the year 1 study between notebook and Gold Standard ratings, contrasting with the much higher correlations observed between GS and observation ratings in year 2 (0.92 for Overall ratings and over 0.84 for all dimensions, see Table 6). Because GS ratings combine evidence from notebooks and observations, these correlations represent part-whole relationships and thus cannot be used directly to assess criterion validity.9 However, the correlations reflect the relative weight raters gave to evidence from notebooks and observations when both sources were available. The larger correlations in the second study suggest that raters who visited a classroom may have been more persuaded by or inclined to rely on their own observations than on the evidence contained in the notebook. Where both sources of evidence are available, these results imply a need for explicitly outlining the ways in which the evidence from the notebook should complement and in some cases override evidence garnered during classroom visits.

Notebook Completeness

Nearly every notebook contained teacher-generated artifacts, with the average being nine artifacts per notebook. Similarly, nearly all notebooks contained annotated examples of student work. On the other hand only six in 10 teachers provided examples of formative or summative assessments given in their classroom, and only four in 10 included completed assessments with student answers; this could reflect teacher unwillingness to share assessments or it could mean simply that no assessments were given during the notebook period. Teachers annotated most of the materials they collected in the notebook, but the comments tended to be brief and often added little information that was not apparent in the materials themselves. Every teacher answered the pre-notebook reflection questions, providing information about classroom organization, expectations for student behavior and learning, equipment and material availability, plans, events, and activities affecting classroom practice during the notebook period. While most teachers (82%) provided answers to all daily reflection questions, the quality of the information provided varied considerably from terse responses providing little valuable context, to in-depth commentary illuminating important elements of the lesson. Daily reflections often grew shorter over the notebook period.

Usefulness of Notebook Artifacts for Judging Instructional Practice

Table 7 summarizes rater reports of the usefulness of each source of information in the notebook for judging each dimension of instruction. Reflection questions (in particular daily reflections) were very helpful for judging most aspects of instructional practice. In addition, at least one artifact was considered very useful for judging instruction across dimensions. Instructional materials (e.g., lesson plans, handouts, worksheets) were very helpful for judging inquiry, cognitive depth, and connections/applications, and somewhat helpful for all the remaining dimensions. On the other hand, some artifacts provide useful information for some dimensions but not others; for example, formal assessment artifacts and samples of student work were very helpful for judging assessment, explanation-justification, and cognitive depth, and not as helpful for the remaining dimensions. Finally, the photo log and the white (assessment) labels were of limited value for judging most dimensions of science instruction. The results provide some support for the notion that the combination of artifacts and teacher reflections is useful for assessing instructional practice.

Table 7. Usefulness of artifacts for rating each dimension of science instructionThumbnail image of

Teacher Perceptions of the Notebook

Teacher answers to the post-notebook reflection questions offer additional insight on the potential of this instrument for supporting valid measures of instructional practice in science classrooms. The vast majority of teachers said the lessons in the notebook were very representative of their typical instructional practice in that classroom during the year. Most teachers also said the collection of artifacts and reflections in the notebook captured very well what it was like to learn science in their classroom; however, a minority thought that the notebook did not adequately reflect instruction in their classrooms. Teacher feedback points to ways the notebook may be improved for future use. The most frequent suggestions were to extend the notebook period, and to collect additional materials to reflect the organizational and support structures of classrooms. Interestingly, a few teachers suggested supplementing the information in the notebooks with classroom observations.


This article presented the results of two field studies of the Scoop Notebook—an artifact-based instrument for measuring middle school science instruction that combines artifact collection and teacher self report. Our analyses addressed four main research questions concerning reliability, dimensionality, and correlation between notebook ratings and ratings based on direct observations, and lessons for improving artifact-based measures of instruction. The results have bearing on the strength of a validity argument for interpreting notebook scores as reflective of variation in science teachers' instructional practices in the classroom. Equally importantly, the results offer valuable insight into the conceptual and methodological challenges involved in measuring instructional practice in science classrooms. In this section, we discuss the lessons we derive from our studies for the development of better artifact-based instruments that may help address some of these challenges in the future.

Summary of Results (Reliability and Validity of Notebook Ratings)

In our study, global (overall) ratings of instruction based on our instrument showed appropriate reliability comparable to that attainable through direct observation over multiple classroom visits. For individual dimensions of instruction, the results were mixed: reliability was adequate for some dimensions (e.g., Grouping, Hands-on, Discourse) but not for others (e.g., Cognitive Depth, Inquiry, Assessment, Structure). Factor analyses point to similar (albeit not identical) factorial structures underlying notebook and observations ratings. Finally, we found sizeable correlations (0.60–0.70) between Overall and Average notebook ratings and their observation counterparts; for individual dimensions the correlations remain consistently over 0.50, further bolstering the claim that the two methods are measuring that same selected aspects of science instruction.

Overall, these results suggest that carefully constructed artifact-based instruments hold potential for measuring instructional practice in science classrooms; at the same time, there is ample reason to caution against over-interpretation and misuse of ratings of instruction based on the notebook. Notebook ratings can be valuable for describing instructional practice for groups of science teachers, and to assess curriculum implementation or track change in practice over time (Lee, Penfield, & Maerten-Rivera, 2009). Aggregate measures can also be used to evaluate the effect of interventions or professional development programs on the practices of groups of teachers (Bell, Matkins, & Gansneder, 2011). In its current form, however, use of the notebook for decisions or judgments about individual teachers on individual dimensions is not warranted. Portfolio instruments may be appropriate for use within a multiple indicator system for assessing teacher performance, but further validation research would be needed to justify such uses. Moreover, additional research with larger samples of teachers is needed to investigate how notebooks function with different groups of teachers (e.g., novice and expert) or students (e.g., low or high performing), and in different types of classes (e.g., lower vs. college track).

The Notebook Validation Studies: Lessons Learned

Our studies set out to shed light on the technical aspects of developing reliable and valid measures of instruction in science classrooms. The psychometric results and our experience conducting the study emphasized the close interconnectedness of the technical and conceptual challenges involved in measuring instructional practice. In the following section we discuss the implications and lessons we draw from our studies for the development of better instruments for measuring instructional practice.

Dimensions of Instruction: Sources of Variance and Sources of Evidence

The first series of lessons is related to the implications of variation in practice over time for measuring different dimensions of instruction. Considerable day-to-day variability affected the reliability of daily ratings for some dimensions (e.g., Grouping). For dimensions with large daily fluctuations it may be preferable to assign a single score over time than to assign daily ratings. Summary observation and notebook ratings take this holistic approach, resulting in considerable improvement in reliability for ratings of the Grouping dimension. Thus, an “overall” approach (i.e., assigning a single score that takes into account the variation in practice observed over time) may be better suited for measuring dimensions of practice that vary considerably from day to day. Finally, it should be noted that variation over time and average quality are not directly related; large fluctuations in practice could signal high quality instruction (i.e., instruction that is varied and adaptable to lesson content and student progress) in some classrooms, while similar variation in other classrooms might still be accompanied by low quality instruction.

A second lesson concerns the fit between dimensions and sources of evidence. In our study notebook ratings of Hands-on, Inquiry, Scientific Resources, and Structure of Lessons have lower reliability than observation ratings. The evidence suggests that these dimensions may stand out clearly when observing teachers in classrooms, but are likely more difficult to discern on the basis of notebook contents alone (e.g., hands-on use of materials will be apparent in class, while notebooks can only offer indirect evidence via artifacts such as photographs or worksheets). Conversely, notebook ratings of Assessment, Explanation/Justification, and Discourse Community are more reliable than observation ratings. For these dimensions, artifacts accompanied by teacher reflections may offer a clearer and more comprehensive picture of practice than would be available to classroom observers, who may not have access to materials, and may not visit a classroom when key instructional activities are occurring. These findings are encouraging for the future of artifact-based measures, given the importance of classroom assessment and student-generated explanations in current thinking on science instruction (Gerard et al., 2010; Ruiz-Primo et al., 2010).

The results of the factor analyses shed further light on the close interplay between dimensions and sources of evidence in the notebook and observation ratings. While we found similar factorial structures with notebook and observation ratings, the differences point to interesting unresolved questions about the precise nature of the constructs measured with each approach. Given the high levels of inference involved in judging the complex dimensions of instruction in the model, these results hold important clues about how the two methods may be influenced by (or privilege) different sources of evidence of practice. For example, notebook ratings of Grouping relate to the content factor more strongly than observation ratings, suggesting that notebook readers considered not only the frequency of group work, but also the cognitive nature of the work carried out in the groups. Inquiry is more closely tied to the format factor in observation ratings, and to the content factor in notebook ratings, suggesting that observers judged this dimension primarily in terms of physical arrangements and formal activities carried out in classrooms (e.g., laboratories, experiments), whereas notebook raters may have more closely considered the cognitive nature of the activities as intended in the scoring guide for this dimension. In general, because notebook readers rely on teacher reflections to illuminate the artifacts collected, notebook ratings of some dimensions (e.g., grouping, inquiry) may be highly influenced by the quality and depth of teacher reflections, or the overall cognitive depth reflected by the contents of the notebook. Classroom observers on the other hand process large amounts of visual and auditory information about classroom activity and discourse, which for some dimensions may lead to judgments of instruction that are influenced by routine processes, types of activities, and physical arrangements in the classroom.

The correlations between notebook and observation ratings also provide insight into the match between measures and sources of evidence. While the correlations are sizeable (i.e., 0.50 or larger) they are not so high as to suggest complete convergence across methods. More likely, notebooks and observations tap into some overlapping aspects of science instruction, and each tool also reflects unique features of instructional practice not adequately captured by the other method. The assessment dimension presents a prime example: notebooks are well suited to measure formal assessment practice through the variety of artifacts collected, but are inherently limited in the extent to which they can convey informal, on-the-fly assessment. Conversely, instances of informal assessment will be evident to classroom observers, but more formal aspects of assessment might be difficult to gauge after only two or three visits. This raises questions about the use of observations as the central validation criterion for notebook ratings (or other measures of instruction), when a combination of both methods would yield a more complete picture of instructional practice in science than either method is capable of by itself. Interestingly, Matsumura et al. (2008) reached a similar conclusion in their investigation of tools for measuring instructional practice in literacy and mathematics lessons. Since different artifacts and sources of information may provide better evidence for some dimensions than others, it is crucial to start with a clear idea of what aspects of practice are of interest, consider the potential sources of evidence available, and select those sources that balance evidentiary power with practicality.

Grain Size and Level of Inference in Measures of Instructional Practice

Our experience in conducting these studies offers valuable insights into the challenges faced in designing measures to capture complex features of science instruction, and the potential for tradeoffs between the reliability and validity of the measures (Moss, 1994). Lessons can also be drawn about the interplay between the grain size and level of inference of a measure of instruction. Grain size refers to the scope of a construct (its conceptual richness or internal dimensionality), while level of inference refers to the distance between the evidence in the notebook and the scores assigned by a rater on a dimension—that is, ratings may map directly to artifacts in the notebook, or require considerable inference and interpretation from raters.

Each dimension in our model represents a rich construct that may involve multiple features of science instruction. The conceptual richness resulting from this large grain size is an important aspect of the validity of the resulting measures of science instruction. At the same time, the reliability and practical usefulness of these measures rest on the ability of the rubrics to map features of science instruction to scores in a way that minimizes subjectivity in rater interpretation and therefore maximizes the reliability of notebook ratings. Compared to measures of smaller grain size, the inclusion of multiple features of instruction within a dimension can negatively impact the reliability of ratings, or alternatively require additional rater training to attain the same reliability. For example, in our study, the assessment dimension had generally low levels of agreement and reliability. As defined in our model, this dimension encompasses both formal and informal aspects of classroom assessment practice, which rater reports suggest sometimes made it difficult to synthesize practice into a single score.

Our experience suggests that to maximize rating consistency it is useful to define dimensions in terms of fewer features of practice, provided that this does not compromise the core essence of the dimension. Thus, one possibility would be to use a larger number of dimensions, each encompassing fewer features of practice. For example, separate dimensions could be defined for formal and informal components of classroom assessment, and informal assessment might be further separated into dimensions for initial assessments of prior knowledge and monitoring of small group activities. Crucially, however, narrowing a measure of instruction to improve its reliability can also raise important questions of validity or usefulness. The reliability-validity paradox is well known in the measurement literature: while adequate reliability is generally necessary for validity, after a certain point increasing reliability by narrowing a measure limits the variance it can share with others, thus effectively decreasing validity (see e.g., Li, 2003). While the number of quizzes administered per week is easier to measure reliably, it also yields less rich and less useful information than a larger grain assessment practice dimension, other things being equal. Finally, a larger number of items, even of small grain size can increase the burden on raters or observers.

If multiple features of practice are needed to adequately characterize a dimension, another approach would be to provide weights or precise decision rules describing how these features should be condensed into one rating. This approach reduces the latitude of raters in interpreting the scoring guides and can improve consistency. For example, judgments of grouping based on the frequency of group work will differ from others based on the nature of activities conducted in groups; to improve rater consistency without omitting either feature, our guides specified different combinations of frequency and type of activity leading to an intermediate rating (see rating guide in the Supplementary online appendix). Admittedly, the complexity of the dimensions may make it difficult to specify all possible combinations of activities that comprise the dimension in a rubric, and the weights to be assigned to each of them in different classroom contexts and scenarios. These issues concerning grain size, reliability, and validity warrant investigation in future research.

Ultimately, efforts to improve reliability by tightening the scoring rules have to be balanced with the need to avoid narrow rules that are not responsive to the range and complexity of science instruction. The tradeoffs between small and large grain size, and between low and high inference, when measuring instruction are well known: measures with large grain size can offer rich, contextualized information about instruction with high potential value for informing policy analysis and professional development. This type of measure has thus become standard practice for measuring instruction through observation (see e.g., Grossman et al., 2010; Hill et al., 2008; Pianta, Hamre, Haynes, Mintz, & Paro, 2009), video (see e.g., Marder et al., 2010), artifacts (Matsumura, Garnier, Pascal, & Valdés, 2002), and portfolio instruments (e.g., Pecheone & Chung, 2006; Silver et al., 2002). However, a large grain size also carries higher development, training and collection costs that may not be realistic for many applications and in some cases ultimately results in measures that do not improve upon, or even fail to match the reliability of survey-based measures. Small grain size) ratings offer superior reliability, but are limited in their ability to capture instruction in its full complexity—rich, contextualized information about a construct as complex as instruction cannot be obtained by merely aggregating narrow, de-contextualized indicators. Researchers and developers thus need to be mindful and explicit about their goals, assumptions, and choices in considering the tradeoffs between specificity (i.e., reliability) and richness (i.e., validity) of the dimensions included in their measures of instruction. Importantly, the grain size of a measure is not dictated by the framework used to develop them; any framework poses similar tradeoffs for researchers between large and small grain sizes, reliability, and validity. Within any given framework some aspects of instructional practice (e.g., discipline, grouping, hands-on practices) may lend themselves more easily to quantification through small grain-size measures and constructs, while others (e.g., explanation/justification, inquiry) may be best captured through broader constructs.

Similarly, in designing artifact-based and observational measures of instruction it is important to consider the appropriate combination of grain size and level of inference. While the two are often positively correlated, grain size and level of inference should be understood as distinct and important aspects of a measure. The assessment dimension again offers a good case study: as discussed before, one option for improving the reliability of this dimension would be to split it into two dimensions capturing only formal or informal aspects of assessment. While narrower than the original, both dimensions would still clearly encompass constructs of fairly large grain size. However, the constructs would differ in terms of level of inference as it relates to our instrument: formal assessment can be more directly quantified from artifacts in the notebooks, while informal assessment requires considerable inference from artifacts, and teacher annotations and reflections. In general, irrespective of grain size developers try to minimize the level of inference in their measures to the degree this is reasonable and useful. It is important to note, however, that inferential leaps are always required for deriving contextualized ratings of complex constructs like instruction, and that a potential advantage of portfolio-like instruments is precisely that the artifacts collected can help strengthen these inferences, and ameliorate concerns about validity and teacher self-report bias. Developers should thus aim to balance grain size and level of inference to suit the intended uses and interpretations of their measures. One final recommendation for improving consistency when high inference measures are deemed desirable is to offer specific guidance to raters about the sources of information in the notebook that should be considered when rating the dimension. For example, our rubrics instruct raters to look for indirect evidence of Discourse Community in teacher reflections and lesson plans, and to look for evidence to inform their ratings of hands-on and grouping in the photographs and the photograph log.

Refinements to Notebooks and Materials

Our experience suggests a number of ways in which our instrument could be improved for future use. Rater perceptions about the usefulness of notebook materials for rating instructional practice revealed two broad patterns: First, the usefulness of artifacts varied across dimensions of practice; not surprisingly the most revealing artifacts for judging Cognitive Depth are not the same as those most revealing of Hands-On activities. This has direct implications for notebook design. If interest centers only on a subset of dimensions of science instruction some artifacts could be eliminated without significant loss of clarity. Conversely, if some artifacts are eliminated the notebook will lose power for describing some dimensions of practice but not others. Secondly, raters found that teacher reflections were as revealing of instructional practice as the artifacts in the notebooks. As noted earlier, we hypothesized that artifacts and reflections combined would provide a more complete picture of science instruction than either source alone. The findings seem consistent with this hypothesis, suggesting that artifacts are most informative when illuminated by teacher reflections. For example, while assessment artifacts may be revealing of teacher expectations and cognitive demand, teacher reflections describing the learning goals and the way the lesson developed provide crucial contextual information to better interpret these artifacts. Finally, some artifacts in the notebook appeared to be generally less valuable for measuring instruction than others; specifically photos and photo logs were not consistently provided and when present were of limited usefulness. For these reasons, and because providing disposable cameras increases the cost and burden to both teachers and researchers, these artifacts could be omitted in future versions of the notebook.

Feedback solicited from teachers after they completed their notebooks also points to potential areas for improvement. The teachers generally felt that more days of data collection and more materials would provide a more accurate representation of instruction in their classrooms. Future studies should investigate the use of portfolios spanning different periods of time (e.g., 5, 10, or 20 days of instruction) and assess the properties of the resulting measures against the cost incurred in the collection and scoring of the portfolio. Studies are also needed to explore different configurations of notebooks for collecting data over time (e.g., five consecutive days of instruction, 5 days in a fixed period of time, 5 days in content units of variable length). Moreover, informal discussions with teachers and a review of notebook contents reveal that teachers provided shorter and terser annotations and responses to reflection questions as days went by. This pattern suggests that notebooks should be designed to minimize the amount of daily open-ended writing required of teachers. One possibility is to use shorter questions for daily reflections and for the self-adhesive notes; another is to eliminate daily reflections altogether and to incorporate some of this information in responses on the adhesive notes.

Unanswered Questions and Next Steps

What Dimensions Should be Used to Characterize Science Instruction?

Several questions remain related to the measurement of science instruction in general and the use of artifact-based measures of science instruction in particular. The first involves the choice of model and dimensions of science instruction to measure. Albeit still highly influential, the National Science Education Standards are only one possible model of instruction. Other models have been proposed which may emphasize different aspects or dimensions of instruction (see e.g., NRC, 2007; Luykx & Lee, 2007; Windschitl, 2001). In particular, future studies aiming to measure instructional practice in science should incorporate the new common core science standards that will be developed from the 2011 framework for K-12 Science Education offered by the National Academy (NRC, 2011). Unlike others in the past the new NAS framework specifically highlights implications for instruction and offers a general vision of instructional practice that emphasizes scientific practices and coherently carries core scientific notions across disciplines and grades (NRC, 2011 p. 10-9). While the development of measures and rating schemes would require a far greater level of detail than the framework provides, the attention to instruction in the framework suggests that such guidance will be present in the forthcoming common core science standards.

To be more relevant for instructional improvement, the dimensions should reflect a fully articulated, widely adopted set of standards based on the new 2011 framework. As discussed in the introduction, there is considerable overlap between this framework and the dimensions of instruction derived from the 1996 NRC standards used in this study, so that a set of dimensions of instructional practice constructed to reflect the new framework will likely bear more than a passing resemblance to the dimensions in our measures. Nevertheless, it is likely that new dimensions would differ in important ways from the rating dimensions used in this study. For example, new dimensions would be needed to incorporate practice related to equity and social issues, and quantitative analysis and summary of data, both of which are prominent in the new framework. One or more dimensions are also needed to capture the new explicit focus on engineering practices and problems. Some dimensions in our model would likely also need to be revised to incorporate some of these key components of the new framework that we expect will be reflected in common core standards. Finally, our dimensions currently focus on pedagogy and are not anchored to specific scientific content. The 2011 framework makes this approach more difficult in some cases because a coherent treatment of core disciplinary ideas and concepts across lessons, grade levels, and disciplines is at the heart of the model. For example, we reduced redundancy in this study by eliminating focus on conceptual understanding from the definition of Structure of Lessons and Discourse Community. This approach would not work under the new framework; instead, the dimensions would need to be revised to incorporate an explicit expectation for continued and cohesive treatment of cognitively complex core ideas over time.

A closely related issue concerns the appropriate degree of overlap between the dimensions in the model. While each dimension represents a distinct aspect of instructional practice, the dimensions are closely related conceptually (and empirically as demonstrated by the factor analysis results). A review of the scoring rubrics (available in the Supplementary online appendix) further highlights the areas of overlap between some of them. For example, the description for high scores in Grouping also hints at a constructivist-oriented pedagogy and high levels of cognitive demand, while low scores reflect more didactic teaching. Discourse Community emphasizes frequency and mode of communication, but also includes peer questioning and review. The overlap reflects cross cutting themes embedded in the NRC model, but is more generally symptomatic of the conceptual difficulty entailed in disentangling the elements of complex multidimensional constructs. It may be possible to use a smaller set of dimensions condensed from the NSES (or other) model while still providing useful information about instruction for specific purposes. In addition, using fewer, less overlapping dimensions could also help improve rating consistency and reliability.

At the same time, the results suggest that reducing the number of dimensions can result in substantial loss of information about unique aspects of science instruction that would not be adequately captured by aggregate indices. Thus, improvements in reliability may come at the cost of losing nuance and richness in the information. The Overall dimension in our rating guidelines can be seen as an extreme example of this type of overlap, where all the conceptual richness of the model and all the evidence in the notebook are condensed into a single rating. This dimension has reasonable empirical backing in the form of a large proportion of variance explained in a factor analysis, and in fact it exhibits better measurement properties than some of the individual dimensions. However, it is also apparent that its usefulness for most or all practical purposes is at least questionable—an Overall rating of 3 (medium) contains little specific information of value to form an impression or a judgment about any one teacher's instructional practices, or the areas of relative strength or weakness.

Thus, in deciding what model to use to characterize instruction, consideration should be given to the inferences sought and the uses intended. In general, for evaluative purposes (or other purposes for which reliability is key) we caution against using large numbers of overlapping dimensions in favor of fewer, more conceptually distinct dimensions. For other purposes (e.g., professional development, providing feedback to teachers) retaining more dimensions that reflect specific aspects of science instruction will likely be desirable. This process is closely related to the considerations about grain size discussed previously. The 2011 framework is organized around three macro dimensions (scientific practices, disciplinary core ideas, and crosscutting concepts) each sub-divided into multiple elements (e.g., eight for scientific practices, six for cross-cutting concepts). In principle these elements might represent the appropriate initial grain size for measuring instruction but as noted earlier they are themselves rich, overlapping constructs and capturing them reliable may pose a significant measurement challenge. For example, a developer would need to decide how to handle the substantial overlap between discursive processes captured in developing arguments from evidence and communicating information (scientific practices 7 and 8 in the new framework).

How Many Days (or Topics) are Needed for Reliable Notebook Ratings?

Another critical issue in developing measures of instruction involves the sampling of observations and topics. Our studies confirm the notion that variation in teacher practice over time is an important source of error in the measures (Shavelson et al., 1986), and suggest that five or more observations may be needed to support reliable global ratings of instructional practice in science classrooms, with longer periods likely needed for judging some individual dimensions of instruction. One could extend this claim to suggest that a 5-day period may also be sufficient for notebooks to support reliable judgments of science instruction across the school year; however, this extrapolation requires empirical testing. For each classroom, we collected a single notebook in a single science unit and therefore our data do not allow us to assess the degree to which our measures can be generalized over time or to other science topics. Additional research is needed employing designs that include collection of multiple notebooks from each teacher to allow investigation of questions related to variation in practice over time and over topics. While the NRC model of science instruction applies in principle across the range of contents found in middle school science curricula, one could certainly imagine that some science units are more conducive than others to particular instructional practices. If present, this science content specificity could call for designing portfolios to capture instructional practice that spans multiple units of science content.

What are the Most Cost-Effective Uses of Notebooks?

An important set of considerations in using portfolio-type measures relate to the cost of data collection and scoring. The viability of a teacher-generated artifact-based instrument like ours for large-scale use rests not only on the reliability and validity of the measures of instruction it may yield, but also on the cost of collecting and scoring the evidence in the notebooks. Our experience conducting these studies suggest that teachers spent an average of 10–12 hours compiling and annotating materials for the notebook, and raters invested an average of 45–60 minutes scoring each notebook. More systematic cost-effectiveness studies are needed to assess the burden that the various notebook components place on teachers and raters, and to explore ways to streamline notebook collection and rating without negatively affecting reliability and validity. Finally, a comprehensive cost-benefit analysis of notebooks would ideally include a systematic investigation of their potential value as formative tools for teacher professional development.

How Should Measures of Instruction be Validated Against Student Achievement?

A final but critical issue concerns the types of evidence that should support the validity of notebook ratings as indicators of instructional practice in science classrooms (Jaeger, 1998). In our studies, validity was assessed with reference to a particular model of instruction by looking at the dimensionality of the measures, and their relationship with other instruments measuring the same dimensions of instruction (i.e., observations). As with other measures of instruction, however, it is also critical to consider the relationship to relevant student outcomes. The value of a portfolio tool for measuring science instruction ultimately rests on its ability to diagnose and improve instruction, which is conceived (implicitly or explicitly) as a mechanism for improving student science learning. Because we were not able to collect measures of student science learning in our studies, an important piece of the validity argument for these measures is missing. Future studies should thus include investigation of the relationship between measures of science instruction and relevant student science learning outcomes.


The findings in the studies presented in this paper suggest that artifact-based instruments like the Scoop Notebook may hold promise for supporting reliable and valid measures of instructional practice in science classrooms. Artifact collection can be suitable for capturing important components of science instruction that may be difficult to capture using other types of instruments (e.g., long-term science projects, group collaborations), and components that do not occur regularly and is thus difficult to capture through direct observation (e.g., cognitive challenge of assessment, written feedback to students). Finally, artifact-based measures or portfolios may be valuable value as tools for professional development. However, the use of this type of instrument can also represent a considerable challenge for researchers and practitioners. Our studies point to areas in need of improvement with our current instruments, but more generally they provide useful insight into the strengths and limitations of portfolio-type instruments, and their potential value as part of a comprehensive model of science assessment.

Overall, the studies highlight some of the key conceptual, methodological, and practical issues faced in developing portfolio-type instruments anchored on a comprehensive model of science instruction.

The authors would like to thank three anonymous reviewers and the editors for their thoughtful comments and suggestions for improving the manuscript. We also thank Matt Kloser at Stanford for providing valuable feedback to strengthen the final version of this document. The work reported herein was supported under the Educational Research and Development Centers Program, PR/Award Number R305B960002, administered by the Institute of Education Sciences (IES), U.S. Department of Education.


1A more detailed presentation and discussion of methodological approaches and results is provided in the Supplementary online technical appendix.

2Each teacher received a $200 honorarium for participating in the study.

3The designs are incomplete because not all raters evaluate all notebooks. However, because the cells in the design are missing at random we treated our design as fully crossed for estimation purposes. Variance components were estimated using SAS Varcomp (SAS Institute Inc., 2003) with minimum variance quadratic unbiased estimates (MIVQUE) to take into account the imbalanced sample sizes in the design (Brennan, 2001).

4Principal component analysis was first carried out in SPSS v.16 (SPSS Inc., 2007) to investigate the hypothesis of unidimensionality; this was followed by factor analysis with OLS extraction and oblique rotation using CEFA (Tateneni, Mels, Cudeck, & Browne, 2008). Solutions were assessed through RMSEA and TLI fit indices alongside substantive considerations. Due to the small sample sizes available, for these analyses we considered different raters as separate data points.

5Raw correlations are biased downward by unreliability of the measures. Disattenuated correlations are shown estimating the theoretical true correlation without measurement error.

6The indices of agreement are based on different samples of teachers over 2 years. While the teacher samples were very similar across studies, and the rater pool was nearly identical, the comparison of agreement indices offered is indirect and warrants some caution.

7As with any reliability coefficient, dependability coefficients cannot be judged without reference to a specific use and context; coefficients of 0.70 are often acceptable for research or low-stake uses, 0.80–0.90 is typically needed for decisions about individual subjects.

8Correlations involving notebooks shown come from the year 1 study. In the year 2 study, independent notebook ratings were not obtained for each classroom.

9Where resources permit, future studies will should try to obtain Gold Standard ratings independent from notebook and observation ratings (i.e., assigned by a separate pool of raters).