The success of all students in science has become a priority in countries throughout the world, as governments have increasingly realized that their economic futures depend on a workforce that is capable in science, mathematics, and engineering (Kilpatrick & Quinn, 2009; Duschl, Schweingruber, & Shouse, 2007). A particular focus in policy discussions is on science in the elementary grades, where children's early attitudes and orientations are formed. Yet science education is particularly problematic in elementary schools. Numerous surveys have found that elementary teachers are often unsure of themselves in science, with little confidence in their science knowledge or pedagogy (Harlen & Qualter, 2008; Cobern & Loving, 2002; Pell & Jarvis, 2003). Since the appearance of the National Science Education Standards (National Research Council, 1996, 2000, 2012) and the Next Generation Science Standards (www.nextgenscience.org/next-generation-science-standards) frameworks, there has been general agreement in the United States about what students should learn in science, and a consensus that science should be taught using inquiry-oriented methods that emphasize conceptual understanding rather than just facts. Yet beyond this broad agreement, what do we know about what works in elementary science? There has been a rapid increase in the use of rigorous experimental methods to evaluate educational programs of all kinds, and this is beginning to have a significant impact on science education (see Marx, 2012; Penuel & Fishman, 2012). However, experiments evaluating practical applications of alternative science programs and practices are still rare at all grade levels.
Vitale, Romance, and Crawley (2010), for example, reported that experimental studies with student learning as an outcome accounted for only 16% of studies published in the Journal of Research in Science Teaching in 2005–2009, and this percentage has declined since the 1980s. Most of the few experiments are brief laboratory-type studies, not evaluations of practical programs in real schools over significant time periods.
There have been several reviews of research over time on various aspects of science education, such as inquiry teaching (Anderson, 2002; Bennett, Lubben, & Hogarth, 2006; Furtak, Seidel, Iverson, & Briggs, 2012; Minner, Levy, & Century, 2010; Shymansky, Hedges, & Woodworth, 1990), small-group methods (Bennett, Lubben, Hogarth, & Campbell, 2004; Lazarowitz & Hertz-Lazarowitz, 1998), and overall methods (Fortus, 2008; Hipkins et al., 2002; Schroeder, Scott, Tolson, Huang, & Lee, 2007). Yet the studies reviewed in all of these are overwhelmingly secondary, not elementary. For example, the Schroeder et al. (2007) review identified 61 qualifying studies, of which only 6 took place in elementary schools. Minner et al. (2010), in a review of inquiry-based science instruction, found 41 of 138 studies to focus on elementary science, but many of these were of low methodological quality, according to the authors. Furtak et al. (2012) identified 22 studies evaluating inquiry methods published in 1996–2006, but only 3 of these involved grades K-6.
While there have been several reviews of research on various aspects of science teaching, there has not been a comprehensive review of experimental evaluations of alternative approaches to elementary science education. The only review of all research on elementary science within the past 25 years is an unpublished bibliography of research and opinion about science education written for Alberta (Canada) school leaders (Gustafson, MacDonald, & d'Entremont, 2007). A review of research focusing specifically on elementary science approaches is important for several reasons. Science is very different in elementary schools than in middle or high schools, so findings from studies of secondary science may not apply to elementary science teaching. Elementary science is almost always taught by non-specialists, teachers who are responsible for all other subjects and rarely have university degrees in science (Epstein & Miller, 2011). A recent survey (Trygstad, Smyth, Banilower, & Nelson, 2013) found that only 36% of elementary teachers who teach science took at least one university course in life, earth, and physical science, 21% took only one of these, and 6% took none at all. Most teachers reported being very well prepared in reading (80%) and math (77%), but not science (39%). As a result, innovations in science education may need to support teachers' content knowledge and to help them manage limited time and resources. Also, there is little time set aside for science in most elementary schools, and elementary schools rarely have the labs and equipment common in middle and high schools. In the United States and the United Kingdom, among other countries, science is not tested as part of state or national accountability until secondary school, so science is often diminished in focus in preference for time and resources devoted to reading and math. In contrast, middle and high schools almost invariably have specialist science educators with regular periods set aside for science, and there is eventually accountability for science learning.
It is important to note ways in which science is different from math and reading, the subjects that have been most often studied using the experimental methods emphasized in this review. First, science covers such a broad range that it is typically taught in time-limited units. For example, a fourth grade teacher might teach a 4-week unit on electricity, another on cell functions, and a third on volcanos. These topics do not build on each other, the way skills in reading or math do. Further, there is relatively widespread agreement about the ultimate goals of elementary reading and math instruction, and accountability measures and state standards clearly define what those goals will be. In contrast, the content of science standards is constantly evolving, and is fiercely contested. The lack of science assessments in most states leaves the ultimate goals of science instruction more open to local variation. These aspects of science are important context for this review, and will be discussed in introducing the review methods and explaining the findings.
Affordances and Limitations of Quantitative Reviews in Science Education
- Top of page
- Affordances and Limitations of Quantitative Reviews in Science Education
- Review Methods
This review applies to elementary science education a quantitative synthesis of experimental studies of practical applications of alternative science approaches. It is important to note that the quantitative review methods applied in this article are well suited for some questions but not others, and the review does not pretend to encompass all questions. In particular, the focus of this review is squarely on innovations in elementary science, and not on the objectives of instruction. When it is clear what is to be taught, and the question is what materials, methods, and professional development will best accomplish the desired outcome, this review's methods are arguably appropriate, focusing on differential outcomes of different approaches on objectives that all teachers are trying to help their students attain. In contrast, changes in the objectives of instruction, such as emphasis on particular topics, is guided by different imperatives, such as scientific advances and philosophical debates about the purpose of science education, that are less amenable to experimental evaluations.
Science standards develop because scientists, science educators, economists, policy makers, and the general public come to believe that a given topic deserves more attention, or teaching of a given topic at a particular time is believed to facilitate further learning (Wilson, 2009). For example, the recently released Next Generation Science Standards (www.nextgenscience.org/next-generation-science-standards) emphasize the interconnected nature of science, deeper understanding of content, and a greater focus on engineering and technology. They encourage more focus on topics such as climate change and evolution. These standards are statements of a set of understandings and values of what science education should achieve in the modern world. If a student learns more about climate change, that is of value in itself, and teachers teaching about climate change need not be compared to those who do not teach about climate change. What meaningful measure could assess the difference? A test of climate change begs the question, while any other test misses the essence of what the new standards are intended to accomplish.
This distinction between the study of teaching methods and of objectives matters in that many studies in science education evaluate objectives that are different from those emphasized in schools today. Such studies may, for example, provide students with innovative instruction on science topics that students might never otherwise see. They might then give pre- and posttests to note gains made by the students who received the novel content and then compare gains on such measures to those made in a control group. However, in such a study the experimental-control comparison is not a meaningful evaluation, because it is obvious the students taught something that others are not taught at all will learn more of that material. The value of the novel curriculum, in this case, has to be demonstrated in some other way, perhaps using observations, judgments by experts, or international benchmarking. These research methods may be appropriate and rigorous within their own genres.
Even where the objectives are common to all classrooms, having outcome measures that are fair to intervention and control groups is a key methodological consideration. This review excludes studies in which experimenters made their own outcome measures closely aligned with their treatments and then did not ensure that students in the control group were exposed to the content or skills measured on their purpose-built assessments. As one example, Vosnidau, Ioannides, Dimitrakopoulou, and Papademetriou (2001) evaluated an approach to teaching fifth and sixth graders about forces, energy, and mechanics. The control group received 3 weeks of ordinary instruction in mechanics, while the experimental group received an intensive program over the same period. The pre- and posttest, made by the experimenters, focused on the precise topics and concepts emphasized in the experimental group. The control group made no gain at all on this test from pre- to posttest, while the experimental group did gain significantly. Were the students better off as a result of the treatment, or did they simply learn about topics that would not otherwise have been taught? It may be valid to argue that the content learned by the experimental group was more valuable than that learned by the control group, but the experiment does not provide evidence that this particular content is better than traditional content.
Another recent example of the problem of treatment-inherent measures is a study by Heller, Daehler, Wong, Shinutara, and Miratrix (2012) comparing three professional development strategies for teaching fourth graders a unit on electric circuits. Students were pretested and then posttested on a test “…designed to measure a Making Sense of SCIENCE content framework…” (Heller et al., 2012, p. 344). The three experimental groups all implemented the Making Sense of SCIENCE curriculum unit on electric circuits, while the control teachers may not have been teaching electric circuits at all during the same time period and certainly could not be assumed to be teaching the same content contained in the Making Sense of SCIENCE curriculum. (The only indication that they were teaching electric circuits at any point in fourth grade was a suggestion that this topic typically appears in fourth grade standards, but even if control teachers did teach electric circuits, they may have done so before or after the experimental period.) Comparisons among the three experimental conditions in this study are meaningful, but the comparisons with the control group are not, because such comparisons may simply reflect the fact that experimental teachers were teaching about electric circuits during the experimental period and control teachers were not doing so.
A study reported by Slavin and Madden (2011), focusing on math and reading studies reviewed in the U.S. Department of Education's What Works Clearinghouse (WWC), found that measures that are “inherent” to the treatment (covering content not taught in the control group) are associated with effect sizes that are much higher than are measures of the curriculum taught in experimental as well as control groups. For example, among seven mathematics studies included in the WWC and using both treatment-inherent and treatment-independent measures, the mean effect sizes were +0.45 and −0.03, respectively. Among ten reading studies, the mean effect sizes were +0.51 and +0.06, respectively. In studies of science education, experimenter-made measures inherent to the content taught only or principally in the experimental condition are often the only measures reported. These measures are often justified by their authors on the basis that the material taught and measured is what students should have been taught. This may well be the case, but an experimental study of this kind provides no evidence one way or the other on the value of the experimental treatment.
While recognizing the value of other research methods in science education and, in particular, the importance of arguments for innovations in science standards, the present review focuses exclusively on experiments in which experimental and control groups are equally focused on achieving particular objectives so that they can be fairly compared on common measures. A major limitation of this focus is that most studies that use common measures in experimental and control groups use standardized tests, which many science educators reject as being overly focused on facts rather than inquiry or scientific processes (see, e.g., Furtak et al., 2010). Yet there are exceptions, in which more inquiry-oriented measures have been used, and even when this is not the case one can argue that it is of interest to know about science programs capable of improving student outcomes on traditional measures, even as we acknowledge that better measures and better curricula tied to those measures may be desirable.
- Top of page
- Affordances and Limitations of Quantitative Reviews in Science Education
- Review Methods
The most important finding of the present review is the very limited number of rigorous experimental evaluations of elementary science programs. After an exhaustive search involving examination of 332 published and unpublished articles that purported to evaluate science approaches in elementary schools since 1980, only 23 studies met the review standards. (For a table listing studies that did not qualify for the review, and the main reasons they were not included, please contact the first author). As a point of comparison, a review of elementary mathematics programs using a somewhat more stringent set of inclusion standards (requiring a treatment duration of at least 12 weeks instead of 4) identified 87 qualifying studies (Slavin & Lake, 2008).
The elementary science studies that did meet the inclusion criteria provide useful information on several approaches to improving outcomes in science teaching. Seventeen of the qualifying studies focused on inquiry-oriented instructional processes for teachers, including approaches such as cooperative learning, integrating science and reading, and use of science kits. The theory of action uniting this category of approaches is an emphasis on teachers improving science learning by using specific, well-articulated strategies designed to develop students' understanding, curiosity, and ability to apply scientific methods. These interventions invariably emphasize professional development and coaching to help teachers use promising approaches.
Two categories of inquiry-oriented instructional process programs were designated: Those that also provided teachers with kits and specific guidelines for hands-on inquiry-oriented explorations (7 studies), and those that provided professional development without kits (10 studies). The theory of action underlying the programs providing kits emphasizes the idea that if teachers have well-designed materials to enable them to teach inquiry lessons, as well as professional development to help them use these materials, they are more likely to effectively implement the programs, and student outcomes will improve. Examples of this approach include Full-Option Science System (FOSS) and Science and Technology for Children (STC). These provide extensive professional development, but the main focus is on providing teachers with appealing, well-developed materials to help them use inquiry and laboratory approaches as well as traditional content. It also includes programs such as Scott Foresman Science, which combines kits with leveled readers focusing on science inquiry.
The theory of action underlying the inquiry-oriented programs without science kits, such as cooperative learning and science-reading integration, emphasizes teaching teachers generic strategies they can use every day to make science teaching engaging, comprehensible, and conceptually challenging.
Another category of approaches to improving science instruction emphasizes the use of technological applications to enhance student outcomes. This category includes six studies of individual technologies, such as computer-assisted instruction, as well as class-focused technology, such as video and interactive whiteboard technologies, and combinations of these types.
Inquiry-Oriented Programs Without Science Kits
Inquiry-oriented programs that do not provide specific materials focus their efforts on helping teachers learn and use generic processes in their daily science teaching, such as cooperative learning, concept development, and science-reading integration. Table 1 summarizes characteristics and findings of the ten qualifying studies of interventions in this category. Overall, the sample size-weighted effect size for inquiry-oriented programs that do not use science kits was +0.36.
Table 1. Inquiry-oriented programs without science kits
|Study||Design||Duration||N||Grade||Sample characteristics||Posttest||Effect sizes by subgroup/measure||Overall effect size|
|Increasing conceptual challenge|
|Mant, Wilson, and Coates (2007)||Matched||1 year||32 schools (16E, 16C) 1,120 students (560E, 560C)||Year 6: 10–11 years old||Rural and village schools in Oxfordshire, England, mostly White, middle class||(National) Key Stage 2 science tests|| ||+0.33|
|Baines, Blatchford, and Chowne (2007)||Matched||2 years||31 schools (12E, 19C) 61 classes (21E, 40C) 1,587 students (560E, 1,027C)||Years 5: 8–10 years old||Schools in London, England||Items adapted from standardized tests for Year 6, simplified for younger children|| ||+0.21|
|Ebrahim (2010)||Matched||6 weeks||8 classes 164 students (86E, 78C)||4–5||Girls' schools in Kuwait||Experimenter-made tests on Earth, soil, agriculture|| ||+0.27|
|Romance and Vitale (1992)||Matched||1 year||7 classes (3E, 4C) 128 students (51E, 77C)||4||Large urban district in Florida||MAT Science|| ||+0.90|
|Romance and Vitale (2001)||Matched||1 year||15 schools 393 students (227E, 166C)||4–5||Large urban district in Florida||MAT Science|| ||+0.66|
|Cervetti, Barber, Dorph, Pearson, and Goldschmidt (2012)||Cluster random||8 weeks||94 teachers 1,913 students (976E, 937C)||4||Southern state||Experimentor made: Science content||+0.65||+0.42|
| || || || || || ||Science vocabulary Science writing||+0.22 +0.40|| |
|Collaborative concept-mapping with co-teaching|
|Jang (2010)||Matched||8 weeks||114 students (58E, 56C)||4||Science classes in Taiwan||Schoolwide science test on electricity and rainbows|| ||+0.54|
|Systematic vocabulary instruction|
|Rosebrock (2007)||Matched||12 weeks||686 students (401E, 285C) Matched on test scores but not demographics||5||Middle-class suburb of Houston||TAKS Earth and Space Science Subtest|| ||+0.24|
|Scott (2005)||Matched||1 year||99 students (66E, 33C)||3||Large, diverse district outside of Houston, TX 54%H, 37%AA, 5%W, 83% FL, 40% LEP||ITBS-Science|| ||+0.29|
|4-E Learning Cycle|
|Ebrahim (2004)||Matched||4 weeks||98 students (49E, 49C) in 4 classes||4||Schools in Kuwait||Experimenter-made test: plants and food|| ||+0.96|
Increasing Conceptual Challenge
Mant, Wilson, and Coates (2007) evaluated a professional development program in 32 mostly rural and village schools in Oxfordshire, England. Almost all children were White, and few qualified for free school meals. Teachers of Year 6 (ages 10–11) in 16 schools were provided with extensive professional development intended to increase engagement and conceptual challenge in science lessons. Sixteen control schools were matched on prior scores on the national science exam (number of students receiving scores of “5,” the top score), number of children in Year 6, and percent of students with special needs.
In each experimental school, the science coordinator and a Year 6 class teacher participated in an extensive series of professional development sessions, consisting of 8 full-day and 4 evening trainings at Oxford Brookes University. The sessions emphasized cognitively challenging, practical, whole-class science lessons. Teachers learned to use thinking skills strategies such as regular “bright ideas time” opportunities for focused discussion, “positive, minus, and interesting” (PMI) features of phenomena, and “big questions.” Teachers were encouraged to emphasize higher-order thinking, practical work, investigations, and purposeful, focused recording. The content and materials used in experimental and control schools was the same, as dictated by the National Curriculum for England.
The evaluation compared Key Stage 2 science tests routinely administered to all students in England at the end of elementary school (Year 6). Students' tests are rated on a scale from 1 to 5, with 4 considered passing and 5 outstanding. The year before the experiment, experimental and control schools were nearly identical in percent of students attaining Level 5 (E = 39.6%, C = 39.4%). At the end of the study year, however, 51.4% of experimental students and 41.6% of control students reached level 5. This difference was statistically significant at the school level (p < 0.05), and was equivalent to an individual-level effect size of +0.33, with estimated N's for each condition of E = 560, C = 560.
Two studies evaluated forms of cooperative learning. One of these, by Baines et al. (2007), evaluated a cooperative learning intervention in 21 classes in 12 London elementary schools (N = 560). Students were in Years 4–5 (8–10 years old). Control students were in 40 classes in 19 schools (N = 1,027) in a different area of London. The schools were selected in the year following the experimental year to match the experimental schools in demographics and pretests.
The cooperative intervention, called Social Pedagogical Research in Group Work (SPRinG), involved students working in groups of 2–4 on a regular basis over the course of a year. Teachers participated in 7 half-day meetings, and were given manuals and lesson plans to provide a structure and examples of cooperative work. Students were trained in cooperative skills such as listening, explaining, and sharing ideas, and these skills were reinforced during implementation.
Pre- and posttests were constructed from items adapted from standardized tests for Year 6, simplified for younger children. They included both multiple choice and open-ended items and emphasized interpretation of diagrams, tables, and graphs. Controlling for pretests, the overall effect size was +0.21 (p < 0.01).
As noted previously, embedded within the overall experiment was a “micro-experiment” in which students in experimental and control groups were pre- and posttested on a unit on evaporation, and then on a unit on forces. As noted earlier in this article, an evaluation of the 2-week evaporation unit produced much larger effect sizes than those reported for the whole year, but did not meet the duration standards of this review. It is interesting to note that on the end-of-year tests as well, outcomes for questions relating to evaporation and forces had very positive outcomes, and analyses of the items other than evaporation and forces showed no experimental-control differences, suggesting that teachers emphasized these topics much more in the experimental than in the control group.
A second study of cooperative learning was carried out by Ebrahim (2010) in two schools for girls in Kuwait. The cooperative learning intervention was not clearly specified, but it involved organizing students into mixed-ability groups of 4–5. The team method apparently emphasized positive interdependence and individual accountability.
Eight intact fifth grade classes were taught by 4 female teachers (N's = 86E, 78C). Each teacher taught one class randomly assigned to use cooperative learning and one to use teacher-centered instruction during a 6-week unit on earth, soil, and agriculture. Because analysis was at the student level, this was considered a randomized quasi-experiment. Teachers taught the same content in each of their classes.
Students were pre- and posttested on an experimenter-made test on the content taught equally in both types of classes. Controlling for pretests, students in the cooperative learning classes learned significantly more than controls (ES = +0.27, p < 0.03).
A concern frequently expressed by science educators is that in elementary schools driven by math and reading tests, science is often pushed aside. An approach developed by Romance and Vitale (2001, 2011) confronts this problem with a program called Science IDEAS, which integrates science with reading and focuses on building content-area reading skills as well as science skills. Teachers in Science IDEAS receive extensive professional development and coaching to help them build comprehension strategies for science and to build science concepts. Students are taught to link together observed events, to make predictions or manipulate conditions to produce outcomes, and to make meaningful interpretations of events. The science approach emphasizes hands-on activities, concept mapping, and journal writing. In particular, students are taught to read and to create propositional concept maps to represent scientific phenomena. Schools adopt Science IDEAS throughout the school and use it every day in a 1 ½ to 2-hour science/reading block. Project staff regularly visit teachers to monitor fidelity of implementation.
The first large study of Science IDEAS was reported by Romance and Vitale (2001). This study involved 15 schools in a diverse district in Florida. A total of 227 students in grades 4–5 were in schools implementing Science IDEAS, and 166 were in matched control schools. On MAT Science tests, controlling for ITBS Reading scores from the previous year, students in the experimental classes scored substantially higher (ES = +0.66, p < 0.01). They also scored better on ITBS-Reading posttests (ES = +0.11, p < 0.01).
Romance and Vitale (1992) evaluated an earlier version of Science IDEAS in a program that replaced a district-adopted reading textbook approach with a program that integrated reading with science, introducing content-area-reading strategies, hands-on science activities, and science process skills. Students in the experimental group participated in a combined daily 2-hour reading/science block, while those in the control group maintained a 1 ½ hour reading/language arts block, using the district's basal series, and a ½ hour science period, mostly using the district science text. Because of the limited time allocated to science in the control classes, teachers in the control group had fewer opportunities to use hands-on activities or to pursue science topics in depth.
The evaluation compared 3 fourth-grade classes (N = 51) using the experimental program to 4 control classes (N = 77) in a demographically similar school with similar pretest scores, all located in a large urban district in Florida. The treatments were implemented over a school year.
Controlling for prior-year ITBS reading scores, students in the experimental group scored substantially higher than controls on MAT-Science (ES = +0.90, p < 0.001) and also on ITBS Reading (ES = +0.40, p < 0.01). The science difference amounted to almost a full grade equivalent, while the reading difference was about 25% of a grade equivalent.
Another approach to integrating science and literacy in the upper elementary grades was described by Cervetti, Barber, Dorph, Pearson, and Goldschmidt (2012). In their form of science-literacy integration, built around a 40-session (8 week) unit on light, four major investigations were carried out. Each 10-session investigation included 4 days of hands-on activities, 2 of reading, 2 of writing, and 2 of discourse. During the reading sessions, students were taught specific study strategies that they applied in partnership and then in the whole class.
A study of the integrated light unit was carried out with 94 fourth grade classes in a southern state. Teachers were randomly assigned to treatment or control conditions. The groups were well matched on percent free and reduced lunches (58% E, 53% C), percent African American (36% E, 39% C), Hispanic (6% E, 7% C), and White (53% E, 49% C).
Control teachers were asked to present the content of their state science standards, using their usual methods and materials. One of the four topics taught in the treatment group, Light as Energy, was not taught in the control group, and the control group did not teach a segment on Light and Color, so the assessments focused only on the material covered equally in both groups, on characteristics of light and interactions of light.
Student learning was assessed using an experimenter-made test designed to align with state standards and to fairly assess the content taught in experimental and control groups. Science Understanding, Science Writing, Science Vocabulary, and Reading Comprehension were assessed (only the first three relate to the present review). Students were pre- and posttested. On Science Understanding, the posttest adjusted for pretest differences had an effect size of +0.65 (p < 0.001). On Science Vocabulary, the effect size was +0.22 (p < 0.001), and on Science Writing, ES = +0.40, p < 0.001. On reading comprehension, however, there were no differences (ES = +0.09, n.s.).
One concern in this study is whether the results might be due to the treatment simply encouraging teachers to teach more science during the 8-week period, despite the investigators' attempts to equalize the focus on science. In fact, teacher surveys indicated that experimental teachers taught science 3.66 hours/week while control teachers allocated 3.03 (ES = +0.53, p < 0.05). They may have particularly spent more time on light. A year-long or at least semester-long study in which teachers in experimental as well as control teachers teach many objectives would be needed to rule out this alternative explanation of the findings.
Collaborative Concept Mapping
Jang (2010) reported an evaluation of a collaborative concept-mapping technique in fourth-grade science classes in Taiwan. In the experimental classes, two teachers worked together as a team. Students (N = 58) worked in small groups on activities that emphasized creating concept maps to organize information and ideas. Students discussed together, but then made their own learning journals and concept maps. The experiment compared two experimental to two control classes in an 8-week study focusing on electricity and rainbows. The matched control classes (N = 56) received whole-class instruction using the same materials and activities, but without team teaching or team learning. The outcome measure was a schoolwide uniform science test ordinarily given by the schools. Adjusting for pretests, posttest scores significantly favored the experimental group (ES = +0.54, p < 0.05).
Systematic Vocabulary Instruction
A method for teaching science vocabulary was evaluated in a middle-class suburb of Houston by Rosebrock (2007). The method taught fifth graders 35 terms relating to Earth and space science over a period of 12 weeks. Each of the terms, selected from the Texas state science standards, was introduced in a 12-step process in which the words were introduced, defined, explained, read in various contexts, demonstrated in hands-on lab work, discussed in small groups, illustrated in writing, concept maps, or diagrams, used in games and crossword puzzles, and finally quizzed. The experiment compared one school that used the vocabulary intervention and one that served as a control group. The schools were well matched on state test scores but not on demographics; the experimental school had a greater number of African American students (20% vs. 10%), Hispanic students (20% vs. 17%), and Asian students (13% vs. 5%) and fewer White students (48% vs. 68%) than controls. A similar proportion of students was economically disadvantaged in both schools (around 16%). There were 401 students in the experimental group and 285 controls.
The posttest measure consisted of the nine multiple-choice items relating to Earth and space science on the 40-item Texas Assessment of Knowledge and Skills (TAKS) science test. The author was unaware of how much overlap there was between the 35 vocabulary words taken from the state standards and the nine TAKS items relating to Earth and space science. Controlling for TAKS pretests, students in the vocabulary intervention scored significantly higher than controls (ES = +0.24, p < 0.001).
Scott (2005) carried out a year-long matched evaluation of an extensive science professional development approach called Teachers Engaged in Authentic Mentoring Strategies (TEAMS). The program provided teachers with a 2-week summer institute, professional development days, mentoring from a building science specialist, monthly after-school meetings, classroom observation days, and participation in an electronic database system. These resources were intended to help teachers learn and effectively implement inquiry approaches to science teaching that emphasized engagement, exploration, elaboration, and evaluation. Teachers were taught reading and vocabulary strategies as applied to reading science content. They learned to use formative assessments for science teaching.
The study took place in Aldine, Texas where the author was science director. Aldine is a large, diverse district outside of Houston. Although the TEAMS process was used throughout many elementary schools in Aldine, the study evaluated third graders taught by 3 teachers in only three experimental schools and three control teachers in three similar schools matched on pretests and demographic factors. The TEAMS schools averaged 83% free lunch and 40% Limited English Proficient. Fifty-four percent of students were Hispanic, 37% African American, and 5% White. ITBS-Science data were obtained at the end of second grade (pretest) and the end of third grade for a total of 66 experimental and 33 control students. Adjusting for pretest differences, posttest differences favored the TEAMS students (ES = +0.29).
4-E Learning Cycle
In a small study in Kuwait, Ebrahim (2004) evaluated a 4-E Learning Cycle in four fourth-grade classes. The experimental treatment emphasized exploration, explanation, expansion, and evaluation, using experiments, student-centered cooperative work, assessment through teacher observations rather than student tests, and real-world problem solving. The control group used a traditional lecture format to cover the same content, a month-long unit on plants.
The study compared two classes in each treatment. Each of two teachers taught one 4-E and one control class (N = 49E, 49C). Because Kuwaiti classes are segregated by gender, there was one class of boys and one of girls in each treatment.
Students were pre- and posttested using an experimenter-made test of the content taught equally in both conditions. The groups were well-matched overall at pretest. At posttest, differences on the posttest strongly favored the 4-E groups (ES = +0.96, p < 0.001).
Inquiry-Oriented Programs With Science Kits
Inquiry-oriented programs that provide teachers with specific materials and instruction resemble those discussed in the previous section in that they provide teachers with extensive initial training and coaching. However, they are different in focus, in that they tend to emphasize the rich content supported by their materials rather than focusing on all of science education. That is, the theory of action in science kit programs is that implementing the hands-on activities will build deep learning about the scientific process and about the core concepts of elementary science. There may be less of an emphasis on the direct teaching of science concepts that takes place during times when kits are not being used. This contrasted with the approaches described in the previous section, which tended to focus equally on generic strategies for inquiry and hands-on experiments and on strategies for concept development that applied to all science taught in the elementary grades.
Table 2 summarizes characteristics and findings of the seven qualifying studies of instructional process approaches that provide specific student inquiry activities and materials. The sample size-weighted mean effect size for these studies was +0.02, or effectively zero.
Table 2. Inquiry-oriented programs with science kits
|Study||Design||Duration||N||Grade||Sample characteristics||Posttest|| ||Effect sizes by subgroup/measure||Overall effect size|
|Insights, FOSS, and STC|
|Pine et al. (2006)||Matched||1 year||41 classrooms||5||9 diverse school districts in CA, AZ, and NV 500 students in each group||TIMSS Items||−0.02|| ||+0.05|
| || || || || || ||Performance tasks||+0.11|| || |
|Leach (1992)||Random||14 weeks||5 classes (2E, 3C) 103 students (38E, 65C)||5||Urban district in TX, 49% minority||CTBS Science||+0.29|| ||+0.48|
| || || || || || ||Electricity and magnetism test||+0.67|| || |
|Newman et al. (2012)||Cluster random||1 year||82 schools (41E, 41C) 7,528 students (4,082E, 3,446C)||5,7||Alabama statewide||SAT-10 Science|| ||+0.05|| |
|System-Wide Change for All Learners and Educators (SCALE)|
|G. Borman, Gamoran, and Bowdon (2008) and Gamoran et al. (2012)||Cluster random||2 years||71 schools (33E, 38C)||4-5||Los Angeles USD 73%H, 11%W, 8%AA 76% FL, 33%ESL|| ||LAUSD Test||State Test||−0.01|
| || || || || || ||Life Science||−0.04||−0.01|| |
| || || || || || ||Earth Science||+0.01||+0.01|| |
|K. Borman, Boydston, Lee, Lanehart, and Cotner (2009)||Cluster random||3 years||20 schools||3–5||Pasco County, FL||PASS|| || ||+0.04|
| || || || || || ||Multiple choice||+0.08|| || |
| || || || || || ||Performance||0.00|| || |
|Miller, Jaciw, and Ma (2007)||Cluster random||1 year||36 schools (18E, 18C) 2,079 students (1,059E, 1,020C)||3–5||5 districts in WA, UT, OH, FL, CA||NWEA Science|| || ||−0.02|
|Kim et al. (2012)||Cluster random (classes within schools)||1 year||6 schools 356 students||K-3||Title I schools in Virginia||MAT-8 Science|| || ||−0.01|
Insights, FOSS, and STC
Pine et al. (2006) carried out a large-scale evaluation of the impacts of the major hands-on inquiry curricula developed in the 1990s: Insights, FOSS, and STC. The study compared fifth graders in 41 classrooms in 9 school districts in California, Arizona, and Nevada. Two groups of schools using hands-on inquiry curricula over the course of a year were identified: high-SES (less than 50% free lunch; mean = 21%) and low-SES (more than 50% free lunch; mean = 64%). Then matched schools using traditional textbooks were identified. Approximately 500 students were in each treatment condition. In order to control for any pre-existing differences, students were given a standardized Cognitive Abilities (CogAt) test, focusing on reading and math. This was given at about the same time as the outcome measures, so this is a contemporaneous control variable rather than a pretest. Two tests were used to assess outcomes. One was a 25-item selection of items from the Third International Math and Science Study (TIMSS), with 23 multiple-choice and 2 open-ended questions. The second was a performance measure developed by the investigators. Students were asked to carry out four experiments, one involving determining weight using a spring, one testing the absorbency of different paper towels, one comparing melting rates of ice cubes in salt vs. fresh water, and one involving observations of flatworms over 3 days. None of these topics were directly taught in the kits. Each of the performance measures, administered one-on-one by research assistants, yielded scores on planning an inquiry, observation, data collection, graphical and pictorial representation, inference, and explanation based on evidence.
A total of 720 students took all measures. After adjustments for the CogAt, there were no differences between inquiry and textbook students on the TIMSS items (mean ES = −0.02). There were significant differences favoring the inquiry students on the flatworms task (p < 0.05), but not on the other measures. Averaging across the four performance measures, the mean effect size was +0.11. An HLM analysis, with students nested within classrooms, also found a small positive effect for the flatworm task, but no significant differences for the four tasks taken together. There were no interactions with gender or socioeconomic status.
A 14-week study involving fifth graders was carried out by Leach (1992) in an urban district in Texas with a minority enrolment of 49%. Students were randomly assigned to one of two experimental and three control classes (N = 38E, 65C). Control classes were taught three chapters from a textbook, while experimental students used three FOSS units. The only overlap in content was a unit (FOSS) and chapter (control) on electricity and magnetism. The experimenter selected items on this topic from the control group's textbook for use as a posttest, and CTBS science was also used as a posttest. On CTBS, effects non-significantly favored the FOSS students (ES = +0.29, n.s.). On the electricity and magnetism test, effects were statistically significant and much larger (ES = +0.67, p < 0.02). However, it was unclear whether the amount of time and focus on electricity and magnetism was similar in the two conditions.
The Alabama Math, Science and Technology Initiative (AMSTI) is an ambitious, state-wide approach intended to improve performance in upper-elementary and middle schools across Alabama. After developing and beginning dissemination of the program since 2002, the state contracted with a third-party evaluator to do a cluster randomized experiment to evaluate the program starting in 2006–2007 (Newman et al., 2012).
The AMSTI intervention involves providing teachers with extensive materials and supplies in kits to enable them to make extensive use of hands-on activities throughout the year. Within regions, kits are rotated among schools every 3–4 months. The kits include equipment such as thermometers, digital cameras, and test kits. Teachers received ten days of inservice (5 in science, 5 in math) during the summer before implementation. During the implementation year, faculty from regional universities visited participating schools to provide encouragement and advice.
The evaluation involved students in grades 4–8 in 82 schools throughout Alabama that had applied to participate. Schools were matched on prior performance and demographic factors, and then one school in each pair was randomly assigned to the experimental group and one to control, yielding 41 in each group. Science was assessed only in grades 5 and 7. Attrition among teachers and students was small and equal across groups. However, the report combines results for grades 5 and 7, so we report them together. Two experimental and one control school dropped out, leaving 39E, 40C, and there were 192 teachers (102E, 90C) and 7,528 students (4,082E, 3,446C) in the analytic sample.
The posttest was the routinely administered SAT-10 Science test. SAT-10 Reading was used as a pretest and covariate. The analyses used HLM. The adjusted effect size was +0.05, which was not statistically significant. Results for mathematics and reading were similar.
Despite the disappointing learning outcomes, there was strong evidence that teachers in the AMSTI treatment reported more use of active learning instruction in science than did controls (ES = +0.32, p < 0.002).
G. Borman, Gamoran, and Bowdon (2008) evaluated a large-scale professional development initiative in the Los Angeles Unified School District (LAUSD). The intervention was a National Science Foundation Math and Science Partnership initiative, called SCALE, for System-Wide Change for All Learners and Educators. In the SCALE elementary science component, fourth- and fifth-grade teachers participated in summer institutes and then received coaching and mentoring in the use of extended, inquiry-based “immersion units” intended to take students and teachers through a full cycle of inquiry in science investigation. The units emphasized “big ideas,” posing scientific questions, giving priority to evidence, connecting evidence-based explanations to scientific knowledge, and communicating and justifying explanations. One teacher in each grade level participated in the summer institute, but all teachers received extensive coaching and mentoring at their school.
Eighty schools were randomly assigned to experimental or control conditions. A few schools had missing data, and the analysis sample included 33 experimental and 38 control schools. Control schools were offered the SCALE curriculum units, but not the professional development or ongoing coaching. Approximately 73% of students were Hispanic, 11% were White, 8% were African-American, 3% were Asian, and 3% were Filipino. 76% of students received free lunch, and 33% were English language learners. Experimental and control schools were well matched on these factors and on reading and math scores.
During the first program year, the outcome measures were three science assessments provided by LAUSD to all students in grades 4–5. One test focused on life science, one on earth science, and one on physical science. Each consisted of 20 multiple choice items and one constructed-response item. Teachers were allowed to give these tests in any order.
Hierarchical linear modeling (HLM) was used to analyze the data, controlling for science pretests and other factors. On life science, the treatment effects were significantly negative (ES = −0.27, p < 0.01), while on earth science (ES = +0.01, n.s.) and physical science (ES = −0.08, n.s.) there were no differences, for an average effect size of −0.11. Additional analyses investigated these unexpected findings. One hypothesis was that effects might be more positive for the science lead teachers who actually participated in the summer training. However, the students of the lead teachers scored slightly worse, relative to controls, than did teachers in general. Another analysis found that for teachers in general, treatment effects were the same for experienced teachers (>3 years) than for less experienced teachers. However, students of lead teachers with less experience gained slightly from the SCALE treatment while students of more experienced lead teachers did worse than controls. Examinations of outcomes on life science questions more closely aligned to the SCALE curriculum did not show positive outcomes.
Gamoran, G. Borman, Bowden, Shewakramani, and Kelly (2012) followed students in the G. Borman et al. (2008) study for an additional year. At the end of that time, achievement results were no longer significantly negative, but they were essentially zero on all LAUSD measures: life science (ES = −0.05, n.s.), earth science (ES = +0.03, n.s.), and physical science (ES = −0.03, n.s.). On state standardized tests given to fifth graders, differences were also very small on life science (ES = −0.02, n.s.), earth science (ES = −0.02, n.s.), and physical science (ES = −0.03, n.s.).
Another large-scale, randomized evaluation of science kits was carried out by K. Borman, Boydston, Lee, Lanehart, and Cotner (2009). They evaluated the Teaching SMART professional development program in Pasco County, Florida. Twenty schools and their teachers of grades 3–5 were matched on pretests and demographic factors and then randomly assigned to Teaching SMART or control conditions (N (schools) = 10E, 10C) over a 3-year period. Teaching SMART professional development emphasized an exploratory, hands-on approach, cooperative learning, equity, questioning techniques, problem solving, discovery, and real-world applications. In addition to initial inservices, teachers received extensive on-site coaching from specially trained site coaches (each of three site coaches was responsible for about 40 teachers). The program provides more than 100 “culturally sensitive, grade-specific” lesson plans based on AAAS and NSF standards and benchmarks, as well as activity kits with consumable supplies and equipment kits with all necessary resources.
Student achievement was measured on the Partnership for the Assessment of Standards-based Science (PASS), which combined authentic performance assessments with multiple-choice items. PASS assessments were administered as pretests and then at the end of third, fourth, and fifth grades. Data from routinely administered state FCAT reading and math tests were also collected and reported.
Outcomes on the PASS over the 3-year experiment were not statistically or educationally significant. Adjusting for pretests, there was no significant difference on the PASS multiple choice items (ES = +0.08, n.s.), and no difference on the performance measures (ES = 0.00, n.s.).
Scott Foresman Science
Scott Foresman Science, published by Pearson, is a year-long curriculum intended to be used every day in grades 3–5. It includes kits focusing on science inquiry, as well as leveled readers. During a half-day inservice, teachers learn to use a strategy emphasizing a progression from “directed inquiry” to “guided inquiry” to “full inquiry.” Experiments using materials from the kits are used at all of these stages, but particularly in “full inquiry,” where students have the opportunity to work in small groups to set up their own experiments. However, much of class time is spent on the leveled readers, which emphasizes inquiry but does not use the kits.
A third-party evaluator was engaged to evaluate Scott Foresman Science (Miller, Jaciw, and Ma, 2007). The study involved five districts around the United States. Within each district, teachers of grades 3–5 were randomly assigned to use Scott Foresman Science or to continue with their existing science approaches. The study took place over a full school year. Students were pre- and posttested on the NWEA Science Concepts and Processes and Reading Achievement scales. Both pretests were used as covariates in an HLM analysis. The study involved 36 schools (18E, 18C), 92 teachers (46E, 46C), 113 classrooms (56E, 57C), and 2,079 students (1,059E, 1,020C), who were well matched on pretest and demographic factors. Two of the districts had significant numbers of English learners, but most students were White and spoke English as their first language.
The outcomes indicated no differences on science posttests, controlling for pretests (ES = −0.02, n.s.). There was a small positive effect on reading, but it was also not significant (ES = +0.05, n.s.). Analyzed separately, none of the five districts showed a positive effect.
Project Clarion is a science program for grades K-3 that uses prepared science units from the Integrated Curriculum Model, or ICM (VanTassel-Baska, 1986). Each unit includes an inquiry based on a concept of change or systems. Students take on a role as a scientist, learning the scientific process in order to answer a question or solve a real-world problem.
In a study in six Virginia Title I schools (Kim et al., 2012), teachers were randomly assigned to participate in Project Clarion or to serve as a control group. The study took place over 3 years, but since children frequently transitioned between experimental and control conditions, only the first year data could be used. Students in grades K-3 were pre- and posttested on the MAT-8 science test. Adjusting for pretest differences, the posttest effect size was near zero (ES = −0.01).
Despite substantial interest in technology applications throughout the science education community and many small trials of exciting innovations, only five studies of technology programs in elementary science met the standards of this review. The many articles on technology programs that did not meet the review standards typically described studies of very brief duration, often carried out under very artificial circumstances (e.g., with many additional staff members helping children with the technology). Perhaps most importantly, many studies of technology programs in science that did not qualify for this review used measures inherent to the experimental program and did not ensure that there was a control group studying the same content. It is interesting to note that in systematic reviews of research on elementary math (Slavin and Lake, 2008) and reading (Slavin et al., 2009), studies of technology programs, especially computer-assisted instruction, was the category with the largest number of qualifying studies. The inclusion standards in those reviews were nearly identical to those used in this review.
Table 3 summarizes characteristics and outcomes of the six studies of technology-focused programs that met the standards of the present review. The weighted mean effect size for these studies was +0.42.
Table 3. Technology programs
|Study||Design||Duration||N||Grade||Sample characteristics||Posttest||Effect sizes by subgroup/measure||Overall effect size|
|Barak, Ashkar, and Dori (2011)||Matched||1 year||7 schools (5E, 2C) 1,335 students (926E, 409C)||4, 5||Israel||Measure based on Israeli national standards|| ||+0.43|
|SEG Research (2009)||Matched||1 semester||371 students (186E, 185C)||3, 5||Palm Beach (FL) and New York City||SAT-10 Science||Gr. 3 +0.10, Gr. 5 +0.55||+0.33|
|Waterford Early Math and Science Program (WEMS)|
|Powers and Price-Johnson (2007)||Cluster Random (classes within schools)||1 year||5 schools 22 classes (13E, 9C) 338 students (199E, 139C)||K||Tucson (mostly Hispanic students)||SESAT (SAT-10 for kindergarten)|| ||+0.70|
|Voyage of the Mimi|
|Rothman (2000)||Matched||1 year||109 students in 7 classes (57E, 52C)||5||Suburban Philadelphia||MAT-7 Science|| ||+0.25|
|Sun, Lin, and Yu (2008)||Matched||8 weeks||113 students in 2 schools (56E, 57C)||5||Taiwan||Experimenter-made tests of acids and alkalis, use of microscope|| ||+0.30|
|Sun, Lin, and Wang (2009)||Matched||4 weeks||118 students in 4 classes (63E, 65C)||4||Taiwan||Experimenter-made tests of sun and moon systems|| ||+0.26|
In an Israeli study, Barak, Ashkar, and Dori (2011) evaluated a program in which whole classes were shown on-line multimedia content called BrainPOP (http://www.brainpop.com). BrainPOP students viewed 3 to 5 minute animated BrainPOP videos that explain scientific concepts in an interesting way. A teacher's section provides lesson plans and ideas for building on the BrainPOP content. In this experiment, students saw about one video each week. They then engaged in activities either individually or in cooperative pairs, with teacher instruction following up on the concepts introduced in the videos. The BrainPOP videos and follow-up activities were organized to align with the Israeli national curriculum. Control classes used traditional textbooks and classroom teaching to study the same content, equally aligned with Israeli standards.
The experiment took place over the course of a school year. A total of 926 fourth and fifth graders in 5 elementary schools received the experimental treatment, while 409 students in two schools matched on pretests and parent characteristics served as a control group. Students were pre- and posttested on a measure of “understanding of scientific concepts and phenomenon,” based on Israeli national standards. Adjusting for pretests, the posttest means strongly favored the experimental group (ES = +0.43, p < 0.001). Ratings of students' explanations also favored the experimental group (p < 0.05).
SEG Research (2009) carried out an evaluation of BrainPOP in Palm Beach County, Florida, and New York City. Third and fifth graders who used BrainPOP 2–3 hours/week (N = 186) were pre- and posttested on Stanford-10 scales, and compared to matched control students (N = 185). On the science scale, BrainPOP students gained significantly more than controls in fifth grade (ES = +0.55, p < 0.001) but not third grade (ES = +0.10, n.s.), for a mean of +0.33. Positive effects were also reported for SAT-10 measures of reading, vocabulary, and language for fifth graders, and for reading and vocabulary among third graders.
Waterford Early Math and Science Program (WEMS)
The Waterford Early Math and Science Program is an educational software program that provides self-paced computer-assisted instruction in grades K-2. In a study with kindergarten students, Powers & Price-Johnson (2007) evaluated WEMS in five majority-Hispanic schools in Tucson, Arizona. Within the schools, intact classrooms were randomly assigned to use WEMS (N = 13 classrooms, 199 students) or to serve as a control group (N = 9 classrooms, 139 students). In the WEMS classes, teachers were asked to give students at least four 22-minute sessions each week, split between math and science content. In fact, many students received less than the expected dosage. The researchers established a minimum of 1,100 minutes of use as the expectation over the course of the year, and 26% of children did not receive this much exposure.
Students were pre- and posttested on kindergarten forms of the SAT10 Environment test. Adjusting for pretest differences, the WEMS students gained more than controls by an effect size of +0.70.
The Voyage of the Mimi
The Voyage of the Mimi (Bank Street, 1984) is a multimedia program that uses a variety of technology related to whales to teach science in elementary and middle schools. Rothman (2000) evaluated an application of the program in three schools in a Philadelphia suburb. At the time of the study, the program included computer simulations and modeling, microcomputer-based laboratory data collection and analysis, and interactive video disks that showed students appealing video content on the topics of study. In the Rothman (2000) evaluation, four modules were used: “Introduction to Computing,” “Maps and Navigation” (in which student teams use science and math to help free a whale caught in the net of a fishing trawler), “Ecosystems” (two computerized simulations in which students observe changes in populations of animals and plants as ecosystems change), and “Whales and Their Environment” (hands-on microcomputer activities in which students collect data about temperature, light, and sound to test hypotheses related to whales).
The study compared a total of 163 fifth graders in three schools. One implemented all four of the Mimi modules and participated in a Mimi-oriented field trip. In the four fifth-grade classes (n = 57), the author estimated that Mimi activities were used in 37% of class periods, leaving 63% for traditional textbook instruction. A second school with four classes (N = 54) used only one Mimi module, for 7% of class periods, and a control school with three classes (n = 52) only used the textbook.
Students were pre- and posttested on a 40-item Metropolitan Achievement Test (MAT-7) science scale in a year-long experiment. Students in the school that used the full program gained non-significantly more than the control school (ES = +0.25, n.s.), and the school that made minimal use of the program also gained non-significantly more than the control students (ES = +0.33, n.s.). On an attitude measure, only the full treatment school gained significantly more than the control school (p < 0.015).
In a study in two Taiwan elementary schools, Sun, Lin, and Yu (2008) evaluated an approach in which fifth graders used web-based lab simulations to do experiments. Two 4-week units were taught, one on acids and alkalis and one on the operation of a microscope.
In each of several lab exercises, students were shown computer screens. On the left side, they carried out simulated experiments, while on the right side were “cabinets” containing simulated tools and instruments, such as thermometers, alcohol burners, and test tubes. Students could use the simulated equipment and see the results of their work; for example, moving a simulated magnet near a simulated compass would cause the needle to point toward the magnet. Records of students' operations were made immediately available to the teacher, who could then respond right away to errors.
The experiment compared four intact classes in two schools. Classes were randomly assigned to experimental (N = 56) or control (N = 57) conditions, but with such a small number of classes the design was treated as matched. Control classes were taught precisely the same content as were experimental students and the same amount of time was allocated to each group. Detailed lesson plans were given to each teacher to try to standardize the content taught in each treatment group.
Students were pre- and posttested on experimenter-made tests covering the content taught in all classes. Adjusting for (small) pretest differences, students using the web-based labs scored higher than controls (ES = +0.30, p < 0.05).
In a closely-related experiment, Sun, Lin, and Wang (2009) evaluated use of a 3-D virtual reality (VR) model of the sun and moon in a 4-week unit. Taiwanese fourth graders in the VR group used a unit called “Capricious Moon Lady” focusing on location of the moon, phases of the moon, relation of the moon phases to the lunar calendar, and related topics. The computer was able to simulate the positions of the Earth and moon, 3-D coordinates, effects of the gravitational pulls of sun, Earth, and moon, and so on. Students could choose to “observe” the sun, Earth, and moon from the Earth, from a movable space ship, or from a spaceship in a set orbit. They went through a series of exercises to learn about the moon's phases and movement, and also used the software to analyze their own observations of the moon each evening. Control students studied the same content, but used 2-D photographs to learn about the moon. Control students also observed the moon each evening, but did not enter their observations on the computer.
In four intact classrooms within an elementary school in southern Taiwan, students were in two treatment and two control classes in a matched design (T = 63, C = 65). Students were pretested and posttested on experimenter-made measures keyed to the content studied by both groups. At the end of the 4-week experiment, the treatment group scored significantly better, adjusting for small pretest differences (ES = +0.26, p < 0.02).
- Top of page
- Affordances and Limitations of Quantitative Reviews in Science Education
- Review Methods
As noted earlier, the most important findings of this review is the fact that very few studies of elementary science met the inclusion standards. Out of 332 identified studies purporting to evaluate science approaches in elementary schools, only 23 had control groups, durations of at least 4 weeks, equivalence on pretests, and measures not inherent to the experimental treatment. In light of the small numbers of qualifying studies of any particular type of program, it must be acknowledged that any conclusions about the findings of these studies can only be tentative.
Previous syntheses of research on science teaching have reported much more positive impacts on science achievement than those found in the current synthesis. For example, a meta-analysis by Schroeder et al. (2007) reported mean effect sizes ranging from +0.29 to +1.28 for 8 categories of science treatments in elementary and secondary schools, far higher than those reported in the present review. However, Schroeder et al. included experiments using treatment-inherent measures, brief studies, and artificial procedures characteristically associated with high positive effect sizes.
It is important to set these studies against the backdrop of contested and changing views about the nature and purpose of science education. Changes in state and national science standards are driven both by advances in scientific knowledge and changing perceptions of what students should be taught. Curricula developed in the 1950s and 1960s when the priority was educating future scientists have been replaced by those emphasising the production of scientifically literate citizens for the twenty-first century (Atkin & Black, 2007; Osborne & Dillon, 2008). Over the same period, there has been a move toward more student-centered, hands-on, dialogic teaching methods (Treagust, 2007). Yet as standards evolve, it is still important to know what programs and teaching methods are most effective in helping students meet whatever standards are currently prevalent.
A surprising finding from the largest and best-designed of the studies synthesized in the present review is the limited achievement impact of elementary science programs that provide teachers with kits to help them make regular use of hands-on, inquiry-oriented activities. These include evaluations of the well-regarded FOSS, STC, Insights, Project Clarion, and Teaching SMART programs, none of which showed positive achievement impacts. Introduced in the 1990s, these hands-on, kit-based curricula were designed to be easier for teachers to use than the inquiry-based curricula that preceded them. Research has shown that elementary teachers using kits present lessons that are more accurate in content than do those not using kits (Nowicki, Sullivan-Watts, Shim, Young, & Pockalny, 2012). One might argue that traditional science tests might not be sensitive to the more sophisticated understandings of scientific process that are the targets of these inquiry-oriented approaches, but the studies by Pine et al. (2006) and K. Borman et al. (2009) used (in addition to traditional tests) well-designed measures in which students had to demonstrate deep understandings of scientific reasoning, and these measures also failed to register positive effects. The only study of a science inquiry kit that did show positive effects was a very small evaluation of FOSS by Leach (1992). The weighted overall mean effect size across the six studies of science kit programs was only +0.02.
Previous descriptive research has supported the observation that when teachers are given science kits, their focus can be on implementing the materials rather than on building deeper understandings among students. For example, a large evaluation of the Local Systemic Change Through Teacher Enhancement Program in 61 sites across the United States noted that “LSCs have had difficulty in moving teachers beyond ‘surface changes’—simply implementing new materials—to the larger task of teaching for understanding” (Boyd, Banilower, Pasley, & Weiss, 2003, p. 64).
In contrast, several equally inquiry-oriented professional development programs that did not provide kits did show positive science achievement outcomes in rigorous evaluations. These studies provided extensive professional development in effective science teaching, emphasizing conceptual challenge (Mant et al., 2007), cooperative learning (Baines et al., 2007; Ebrahim, 2010), science-reading integration (Cervetti et al., 2012; Romance & Vitale, 1992, 2001), teaching scientific vocabulary (Rosebrock, 2007), and use of an inquiry learning cycle (Ebrahim, 2004). All ten of these studies found significant positive effects of inquiry-oriented professional development on conventional measures of science achievement, with a weighted mean effect size of +0.36.
The six qualifying studies of technology applications in elementary science all show significant promise. Four approaches had qualifying evaluations: Waterford Early Math and Science (WEMS), BrainPOP, The Voyage of the Mimi, and use of web-based laboratory exercises. WEMS is a traditional computer-assisted instruction approach, applied in this case only to kindergartners, but the other three models are all characterized by the use of video or computer graphics to illustrate scientific processes, active inquiry using technology tools, integration of technology, teaching, and group work among students, and efforts to make science content motivating and relevant to students. These science applications are very different from the computer-assisted instruction applications that have dominated uses of technology in elementary mathematics (Slavin & Lake, 2008). Computer-assisted instruction (CAI) in math has emphasized having students work on problems at their appropriate level of need, with feedback on the correctness of their answers, while most of the science applications with evaluations that met the standards of this review focused more on using technology to enhance classroom teaching and laboratory work.
While the technology applications had the highest weighted mean effect size among the three categories of elementary science approaches (ES = +0.42), it is important to take these findings as promising rather than proven. Most of the studies used matching rather than random assignment, and matched studies leave open the possibility of selection bias (schools or teachers using the programs might have been better or more reform-oriented teachers). Except for the two BrainPOP evaluations, the sample sizes are small, and small studies tend to have larger effect sizes than do ones with large samples (Slavin & Smith, 2009). Yet these preliminary findings argue for further development and large-scale evaluations of modern approaches, for example those that integrate video and computer technologies with inquiry-oriented teaching and cooperative learning.
Although the limited number of qualifying studies makes explanations of these divergent outcomes tentative, it is nevertheless interesting to speculate about their meaning. First, how could the provision of science kits carefully designed to facilitate hands-on inquiry have so little benefit for student learning, while other inquiry-oriented professional development approaches did have positive effects? One possible answer lies with the nature of the kits themselves, which have been criticized for failing to adequately facilitate conceptual understanding (Boyd et al., 2003). An alternative interpretation relates to the nature of practical science teaching in elementary schools. In reality, time and resource limitations for elementary science teachers make it difficult to cover the entire science curriculum. In recent years, as high-stakes accountability has focused increasingly on reading and math rather than science, this problem may have become more serious. Elementary teachers who spend a great deal of time on laboratory exercises may be taking time away from coverage of the rest of the science curriculum, especially objectives not covered by the kits. Further, professional development targeted toward helping teachers use kits may not help them enhance their effectiveness on the science units taught without kits.
In contrast, the programs that focus primarily on improving daily instruction on all objectives, not just those that are the focus of provided science materials, may help teachers teach the entire range of science objectives more effectively. That is, a teacher who learns to make effective, daily use of cooperative learning, or conceptually challenging content, or science-reading integration, can take advantage of these new skills every day, for every objective. Elementary science teachers need to develop pedagogical content knowledge, which means knowing how to make science content meaningful, useful, and engaging (Duschl et al., 2007; Cobern & Loving, 2002; Zembal-Saul, Starr, & Krajcik, 2002). Previous work on cooperative learning in science has demonstrated that it is the interactions established through cooperative learning that best predict positive outcomes (Howe et al., 2007; Thurston et al., 2010).
Many of the science teaching approaches found to be effective in the studies meeting the inclusion criteria resemble methods that have been found to have positive effects in other subjects and in a broader range of science studies. This is particularly true of cooperative learning, which has been frequently found to work at all levels of science education (Bennett et al., 2004; Lazarowitz and Hertz-Lazarowitz, 1998) and in a wide variety of other subjects (Rohrbeck, Ginsburg-Block, Fantuzzo, & Miller, 2003; Slavin, 2013; Webb, 2008). Science-reading integration has also been found to be effective in reading studies (e.g., Guthrie et al., 1998; Guthrie, Anderson, Alao, & Rinehart, 1999).
The findings of the qualifying studies do not call into question the value of inquiry itself or of hands-on laboratory activities, which have long been accepted by the profession as the core of any modern science curriculum (see, for example, Minner et al., 2010; Shymansky et al., 1990; Bennett et al., 2006; Anderson, 2002). Yet few if any elementary science teachers use hands-on inquiry activities every day to cover all of the curricular expectations in today's state and national standards. In fact, research has shown that, despite the focus on inquiry-based teaching in science education policy, it has a generally low profile in classroom practice (Weiss, Pasley, Smith, Banilower, & Heck, 2003). In order to make a substantial difference on broad measures of science learning, teachers may need effective pedagogical strategies for all objectives and all teaching modes.
It is important to note the limitations of this review. Its methods focus on rigorous experimental evaluations of teaching methods and technologies intended to improve the learning of elementary science. However, the review's inclusion standards ruled out many studies that might be of interest to some readers. These include very brief studies (less than 4 weeks), studies that use artificial, non-replicable procedures, and studies in which the control group was not studying the content tested on the outcome measures. These exclusions were intended to focus on studies that inform readers about pragmatic science approaches that could readily be used at a significant scale. However, other experimental, correlational, and observational research is also valuable for theory building, description, and tests of concept. It is not possible in a review to do justice to every type of research done for every type of purpose, but it would be misleading to suggest that research excluded here is of less importance than studies that were included. Excluded studies simply addressed different objectives.
Far more research and development are needed to identify effective and replicable approaches to improving science achievement outcomes for elementary schools. Science education needs to move beyond brief and artificial pilot tests of exciting new methods and technologies to put them to the test in real schools over extended time periods with valid and comprehensive measures of what students should know and be able to do in science. It is encouraging that the framework behind the new U.S. science standards recognizes the need for empirical testing of instructional approaches and specifically recommends randomized trials to evaluate ideas and practices used in the development of learning progressions (National Research Council, 2012). Too many curious, creative students leave elementary school with a diminished love for science and deep misconceptions about scientific principles and the nature of science itself (The Royal Society, 2010). Science education researchers need to use the tools of science to evaluate and progressively improve the programs and practices needed to help elementary teachers build a scientifically literate society.
This research was supported by a grant from the National Science Foundation (DRL-1019306). However, any opinions expressed are those of the authors and do not represent NSF positions or policies. We would like to thank Daphne Minner, Jeanne Century, Derek Bell, Mary Ratcliffe, Judith Bennett, Michael Karweit, and Gavin Fulmer for their comments on earlier drafts.