- Top of page
- Literature Review
The results of international assessments such as the Trends in International Mathematics and Science Study (TIMSS) are often reported as rankings of nations. Focusing solely on national rank can result in invalid inferences about the relative quality of educational systems that can, in turn, lead to negative consequences for teachers and students. This study seeks an alternative data analysis method that allows for improved inferences about international performance on the TIMSS. In this study, four classroom teachers categorized a sample of TIMSS items by the cognitive domains of knowing and applying using the definitions provided by the TIMSS 2011 Assessment Frameworks. Items of different cognitive domains were analyzed separately. This disaggregation allowed for more valid inferences to be made about student performance. Results showed almost no significant difference between the performance of U.S. students and the students of five other nations. Additionally, no differences were observed in U.S. students' performance on knowing items and applying items, although students from some sample nations performed significantly better on knowing items. These results suggest that policy makers, educators, and citizens should be cautious when interpreting the results of TIMSS rank tables.
In making these recommendations we … wish to make clear that it is not our intention to interfere with the freedom of the press. Nor is it our intention to discourage public debate over how best to educate children. We do feel, however, that the public is not well served when it is exposed to misleading information and unsupported speculation. (Canadian Psychological Association [CPA], 2000, p. 14)
Assessments are designed based on a theory of learning (National Research Council [NRC], 2001b). How one defines learning will influence how one designs an assessment. For example, an assessment created from a sociocultural learning perspective would have students interact with each other to create a joint product, rather than individually responding to test items (Jakobsson, Mäkitalo, & Säljö, 2009; Noble et al., 2012). While there are many different theories of learning (Miller, 2009), much of the research in recent years agrees that learning is more than memorizing.
The processes of learning in mathematics and science are described in educational research. Some of the literature in mathematics education recommends that learning mathematics be viewed as acquiring mathematical proficiency (NRC, 2001a). Mathematical proficiency includes the five interwoven strands of conceptual understanding, procedural fluency, strategic competence, adaptive reasoning, and productive dispositions. Thus, a mathematically proficient student can perform computations (procedural fluency) and reason about complex relationships and situations (adaptive reasoning). Similarly, learning in science extends beyond knowing facts or concepts (NRC, 2012). One who learns science should understand how concepts connect through multiple disciplines and should be able to participate in the practices of science. Research related to expertise highlights the need to conceptualize learning as more than knowing facts (NRC, 2000). Rather than simply having a greater body of memorized facts, an expert's knowledge is organized around central concepts, contextualized to signal when using the knowledge is relevant, and transfers from situation to situation (NRC, 2000). Thus, it is important for assessments to be built upon a theory of learning that incorporates more than memorizing facts.
The Trends in International Mathematics and Science Study (TIMSS) is an international assessment that attempts to assess students' understanding at multiple levels. These levels include three cognitive domains: knowing, applying, and reasoning (Mullis, Martin, Graham, O'Sullivan, & Preushoff, 2009). Knowing items require students to remember facts or procedures. Items from the applying domain require that students use knowledge to solve a science or mathematics problem. To respond correctly to reasoning items, students must find solutions to complex, multistep problems in unfamiliar contexts.
Although one could report the results of the TIMSS by cognitive domain, it is more common to report each nation's rank (e.g., Arenson, 2004; Armario, 2010; Asimov, n.d.; Nagesh, 2010). In this way, the scores of nations are lined up, and nations are ranked from the top to the bottom. It should be noted that some of the groups that participate in the TIMSS are not formally known as nations (e.g., Hong Kong). However, for simplicity, all groups used in this study will be referred to as nations. As seen in Table 1, the rank of U.S. fourth graders on science items has consistently dropped since 1995, when it was ranked third, to 2007, when it was ranked eighth. Such rankings have led to increased concerns about the state of U.S. education (Finkel, 2012; NRC, 2001a). For example, one report quoted President Obama as saying, “It is unacceptable to me, and I know it's unacceptable to you, for us to be ranked on average as 21st or 25th—not with so much at stake. We don't play for second place here in America. We certainly don't play for 25th” (Nagesh, 2010, para. 3).
Table 1. Rankings and Scores of Nations on TIMSS Fourth Grade Science
|International Average|| ||International Average||489||International Average||500|
| 2||Japan||574||Chinese Taipei||551||Chinese Taipei||557|
| 3||United States||565||Japan||543||Hong Kong||554|
| 4||Czech Republic||494||Hong Kong||542||Japan||548|
| 5||England||551||England||540||Russian Federation||546|
| 6||Canada||549||United States||536||Latvia||542|
| 8||Ireland||539||Hungary||530||United States||539|
| 9||Scotland||536||Russian Federation||526||Hungary||536|
|17||Iran, Islamic Rep. of||416||Moldova, Rep. of||496||Netherlands||523|
|18|| || ||Slovenia||490||Slovenia||518|
|19|| || ||Cyprus||480||Denmark||517|
|20|| || ||Norway||466||Czech Rep.||515|
|21|| || ||Armenia||437||Lithuania||514|
|22|| || ||Iran, Islamic Rep. of||414||New Zealand||504|
|23|| || ||Philippines||332||Scotland||500|
|24|| || ||Tunisia||314||Armenia||484|
|25|| || ||Morocco||304||Norway||477|
|26|| || || || ||Ukraine||474|
|27|| || || || ||Iran, Islamic Rep. of||436|
|28|| || || || ||Georgia||418|
|29|| || || || ||Colombia||400|
|30|| || || || ||El Salvador||390|
|31|| || || || ||Algeria||354|
|32|| || || || ||Kuwait||348|
|33|| || || || ||Tunisia||318|
|34|| || || || ||Morocco||297|
|35|| || || || ||Qatar||294|
|36|| || || || ||Yemen||197|
Focusing solely on the rank of nations on international assessments such as the TIMSS is problematic (Soh, 2012). This focus leads to inferences about the relative quality of educational systems that may not be valid. Validity, the “degree to which evidence and theory support the interpretations of assessment scores” (NRC, 2001b, p. 39), is a central concern in assessment design (American Educational Research Association [AERA], American Psychological Association [APA], & National Council on Measurement in Education [NCME], 1999; Songer & Ruiz-Primo 2012). While an assessment itself is not valid or invalid, the inferences made from an assessment may have more or less evidence for validity (Downing, 2003). When evidence for validity is high, one can be more confident about inferences drawn from that assessment (NRC, 2001b).
The inferences that one can draw from an assessment are limited by the purposes intended in the design of the assessment (NRC, 2001b; Pellegrino, 2012). For example, when suggesting that teachers make use of available TIMSS items, Glynn (2012) cautions that teachers only use these items to assess student achievement as defined by TIMSS. Notwithstanding, a single assessment is often used for multiple purposes, such as measuring student learning, promoting or firing teachers, and the quality of educational systems (Songer & Ruiz-Primo, 2012). The more purposes an assessment is used to serve, “the more each purpose will be compromised” (NRC, 2001b, p. 2). Thus, it is important to limit inferences to the intended purposes of the assessment when making inferences from the results.
The purpose of the TIMSS assessment includes providing data about student achievement and learning contexts across multiple years and nations in order to inform educational policy in the participating nations (Mullis et al., 2009). The TIMSS was not designed to compare and rank the educational systems of multiple nations. In fact, Beaton, Potlethwaite, Ross, Spearritt, and Wolf (1999), writing on behalf of the International Academy of Education, which oversees the TIMSS, caution that comparisons of nations are only valid in two situations. These include (a) when comparing results of items upon which the nations in the comparison share a common curricular emphasis and sequence, or (b) when “a single underlying trait (or dimension of knowledge) is being assessed by the test” (Beaton et al., 1999, p. 18). It follows that inferences regarding the comparative quality of nations' educational systems drawn from reports of national rank (as shown in Table 1) have little evidence for validity.
A key reason why rank does not validly imply the quality of educational systems is that the curriculum of participating nations differs. For example, Reddy (2005) contends that South Africa's poor performance on the TIMSS (for eighth graders) is because that nation's curriculum is not adequately reflected in the assessment. South Africa's curriculum, Reddy argues, is focused on goals of “access, redress, equity and quality” (p. 67)—goals which are not reflected on the TIMSS. In addition to differing in content, the curriculum of participating nations may vary in sequence. It is not uncommon for nations to introduce skills or topics as much as two or three years apart (Beaton et al., 1999).
Comparing nations by simply looking at national rank on the TIMSS, as is often the case in media reports (CPA, 2000), has limited evidence for validity. Beaton et al. (1999) stated, “Small changes [in rank] occur frequently and may simply be due to chance. These changes should not be taken seriously” (p. 27). Although a nation's change in rank is often cited as the result of changes in the quality of an educational system, (e.g., Arenson, 2004), this may not be the case. For example, a change in a nation's rank may be due to the number of participating nations (Beaton et al., 1999; Soh, 2012). If the United States dropped from second place to fifth place from one year's assessment to another, it may not be because the United States' educational system has deteriorated. Rather, it is possible that the U.S. performance improved, but two nations that outperformed the United States participated only in the latter year. Additionally, a nation may change rank based on changes in the demographic makeup of participating nations (CPA, 2000; Rotberg, 1998). Bracey (1998) provided a notable example highlighting the need to be cautious when emphasizing relative rankings of nations. In the 1995 TIMSS results, the difference between Bulgaria (fifth place) and Iceland (30th place) was only 10%. Although the rankings lead the reader to infer dramatic differences between ranks, such as fifth place and 30th place, a closer analysis of the scores reveals that the actual difference may not be noteworthy. Thus, inferences about the relative quality of an educational system based solely on TIMSS rank have little evidence for validity.
This study seeks to reexamine a sample of the results on TIMSS items from a few nations. These items were sorted by cognitive domain, a “dimension of knowledge” (Beaton et al., 1999, p. 18), allowing for more valid comparisons across nations. Specifically this study will ask
- How did U.S. students perform as compared with students from other nations on TIMSS items in the same cognitive domain?
- How did each nation perform on knowing items as compared with applying items?
- As reported on international assessments, how do nations rank when just looking at overall performance?
- Top of page
- Literature Review
This analysis at once highlights the difference between simply ranking nations and comparing student performance in specific cognitive domains. Rankings suggest differences in student performance that are not observable when items are disaggregated by domain. For example, although Australia is ranked last (in this sample) for both mathematics and science, there is no significant difference between U.S. student performance and Australian student performance on items grouped by domain. Similarly, few nations seem to be more successful in one domain than the other. All significant differences that were observed are not very meaningful, as indicated by the small effect sizes.
Therefore, the primary conclusion from this analysis is that reports that focus on the rank of nations are limited in their scope and can lead to invalid inferences about the relative quality of educational systems. The TIMSS has been designed on a theory of learning (NRC, 2001b) that conceptualizes learning as having multiple levels of complexity (Mullis et al., 2009). Because items have been designed to correspond with levels of learning, the performance of nations can be compared at these levels. These comparisons can lead to inferences about how well students in participating nations are performing on items from the same cognitive domain. These inferences have much greater evidence for validity (Downing, 2003) than inferences about the relative performances of nations (Beaton et al., 1999).
It cannot be implied that these results would be the same for an analysis using all TIMSS items and all participating nations. However, it is shown by this study that analyzing the entire TIMSS assessment by cognitive domain would likely yield results from which more valid inferences could be made.
Inferences about cognitive domains could yield much more useful information for educators and policy makers. The rank of a nation is minimally informative. As discussed above, there could be many reasons the rank of a nation changes, such as the addition of a participating nation or differences in a nation's demographic composition (Beaton et al., 1999; CPA, 2000; Reddy, 2005). Many of these reasons are not involved with the improvement or degradation of an educational system, as is often implied by reports (e.g., Asimov, n.d.). Analysis by cognitive domain gives educators and policy makers information about how students are performing on specific cognitive tasks. In this way, educators and policy makers could know what areas need to be improved upon. For example, if students in Russia were outperforming other nations on reasoning items, it would be important to explore how Russia is helping students be successful in such high-level cognitive tasks. Additionally, consider the response if it was identified that the United States was performing mediocrely overall, but outperforming other nations on reasoning items. Although these results would indicate that there is some needed improvement in the areas of knowing and applying, it would be a great success to be excelling in the most challenging domain.
In addition to providing more useful results, this type of analysis could help avoid some of the negative effects of large-scale testing. As it is, some of the stresses placed upon educators and policy makers are due to less-than-ideal rank on assessments where they are being compared with nations whose curricula are more aligned with the assessment or whose demographic characteristics are more favorable (CPA, 2000; Reddy, 2005). Although a comparison by cognitive domain may be influenced by these factors, it is possible that, as is shown in this study, nations may perform more similarly than overall rankings indicate. Where differences do exist, analysis by cognitive domain shows more detailed and valid information. This more detailed and valid information is especially important because it may curb the tendency of teachers to focus on low-level learning due to large-scale testing (Lomax et al., 1995; NRC, 2001a). If results highlighted that students were struggling to respond to reasoning items although they were successful at knowing items, teachers may be influenced to change their teaching to encourage this higher-level learning.
In the future, this analysis should be conducted with results from the entire TIMSS assessment including all participating nations. This study with a small sample shows the potential of such an analysis. Because comparisons between nations are also valid when nations share a similar curriculum (Beaton et al., 1999), future studies should compare performance on items that align with the curricula of multiple nations. Additionally, it would be interesting to execute an analysis similar to the one conducted in this study that spans across multiple years of data. In this way, one could ascertain if students were improving in a particular cognitive domain over time and see if that improvement was similar to other nations.
Policy makers, educators, and citizens are urged to be cautious when making inferences from large-scale assessments. It is important to consider the theory of learning upon which the assessment was built (NRC, 2001b). It is also important to keep in mind the purpose the assessment was designed to accomplish and to keep inferences close to those purposes (Pellegrino, 2012). International assessments show promise for encouraging the improvement of educational systems, but that promise can be compromised when policy makers, educators, and citizens solely focus on international rank. Those reporting results to the public should take care to be explicit about these purposes (AERA et al., 1999) and avoid leading readers to unsubstantiated inferences (CPA, 2000).