• TIMSS;
  • cognitive domain;
  • validity


  1. Top of page
  2. Abstract
  3. Literature Review
  4. Methods
  5. Results
  6. Conclusions
  7. References
  8. Acknowledgments

The results of international assessments such as the Trends in International Mathematics and Science Study (TIMSS) are often reported as rankings of nations. Focusing solely on national rank can result in invalid inferences about the relative quality of educational systems that can, in turn, lead to negative consequences for teachers and students. This study seeks an alternative data analysis method that allows for improved inferences about international performance on the TIMSS. In this study, four classroom teachers categorized a sample of TIMSS items by the cognitive domains of knowing and applying using the definitions provided by the TIMSS 2011 Assessment Frameworks. Items of different cognitive domains were analyzed separately. This disaggregation allowed for more valid inferences to be made about student performance. Results showed almost no significant difference between the performance of U.S. students and the students of five other nations. Additionally, no differences were observed in U.S. students' performance on knowing items and applying items, although students from some sample nations performed significantly better on knowing items. These results suggest that policy makers, educators, and citizens should be cautious when interpreting the results of TIMSS rank tables.

In making these recommendations we … wish to make clear that it is not our intention to interfere with the freedom of the press. Nor is it our intention to discourage public debate over how best to educate children. We do feel, however, that the public is not well served when it is exposed to misleading information and unsupported speculation. (Canadian Psychological Association [CPA], 2000, p. 14)

Assessments are designed based on a theory of learning (National Research Council [NRC], 2001b). How one defines learning will influence how one designs an assessment. For example, an assessment created from a sociocultural learning perspective would have students interact with each other to create a joint product, rather than individually responding to test items (Jakobsson, Mäkitalo, & Säljö, 2009; Noble et al., 2012). While there are many different theories of learning (Miller, 2009), much of the research in recent years agrees that learning is more than memorizing.

The processes of learning in mathematics and science are described in educational research. Some of the literature in mathematics education recommends that learning mathematics be viewed as acquiring mathematical proficiency (NRC, 2001a). Mathematical proficiency includes the five interwoven strands of conceptual understanding, procedural fluency, strategic competence, adaptive reasoning, and productive dispositions. Thus, a mathematically proficient student can perform computations (procedural fluency) and reason about complex relationships and situations (adaptive reasoning). Similarly, learning in science extends beyond knowing facts or concepts (NRC, 2012). One who learns science should understand how concepts connect through multiple disciplines and should be able to participate in the practices of science. Research related to expertise highlights the need to conceptualize learning as more than knowing facts (NRC, 2000). Rather than simply having a greater body of memorized facts, an expert's knowledge is organized around central concepts, contextualized to signal when using the knowledge is relevant, and transfers from situation to situation (NRC, 2000). Thus, it is important for assessments to be built upon a theory of learning that incorporates more than memorizing facts.

The Trends in International Mathematics and Science Study (TIMSS) is an international assessment that attempts to assess students' understanding at multiple levels. These levels include three cognitive domains: knowing, applying, and reasoning (Mullis, Martin, Graham, O'Sullivan, & Preushoff, 2009). Knowing items require students to remember facts or procedures. Items from the applying domain require that students use knowledge to solve a science or mathematics problem. To respond correctly to reasoning items, students must find solutions to complex, multistep problems in unfamiliar contexts.

Although one could report the results of the TIMSS by cognitive domain, it is more common to report each nation's rank (e.g., Arenson, 2004; Armario, 2010; Asimov, n.d.; Nagesh, 2010). In this way, the scores of nations are lined up, and nations are ranked from the top to the bottom. It should be noted that some of the groups that participate in the TIMSS are not formally known as nations (e.g., Hong Kong). However, for simplicity, all groups used in this study will be referred to as nations. As seen in Table 1, the rank of U.S. fourth graders on science items has consistently dropped since 1995, when it was ranked third, to 2007, when it was ranked eighth. Such rankings have led to increased concerns about the state of U.S. education (Finkel, 2012; NRC, 2001a). For example, one report quoted President Obama as saying, “It is unacceptable to me, and I know it's unacceptable to you, for us to be ranked on average as 21st or 25th—not with so much at stake. We don't play for second place here in America. We certainly don't play for 25th” (Nagesh, 2010, para. 3).

Table 1. Rankings and Scores of Nations on TIMSS Fourth Grade Science
International Average International Average489International Average500
  1. Note. The 1999 TIMSS assessment not included because it only tested eighth grade students. Nations that did not meet the sampling criteria were not included in these tables. Nations included in the sample for this study are bolded.

  2. a
  3. b
  4. c
  5. TIMSS = Trends in International Mathematics and Science Study.

 2Japan574Chinese Taipei551Chinese Taipei557
 3United States565Japan543Hong Kong554
 4Czech Republic494Hong Kong542Japan548
 5England551England540Russian Federation546
 6Canada549United States536Latvia542
 8Ireland539Hungary530United States539
 9Scotland536Russian Federation526Hungary536
10Hong Kong533Netherlands525Italy535
11New Zealand531Australia521Kazakhstan533
12Norway530New Zealand520Germany528
14Greece497Italy516Slovak Rep.526
17Iran, Islamic Rep. of416Moldova, Rep. of496Netherlands523
18  Slovenia490Slovenia518
19  Cyprus480Denmark517
20  Norway466Czech Rep.515
21  Armenia437Lithuania514
22  Iran, Islamic Rep. of414New Zealand504
23  Philippines332Scotland500
24  Tunisia314Armenia484
25  Morocco304Norway477
26    Ukraine474
27    Iran, Islamic Rep. of436
28    Georgia418
29    Colombia400
30    El Salvador390
31    Algeria354
32    Kuwait348
33    Tunisia318
34    Morocco297
35    Qatar294
36    Yemen197

Focusing solely on the rank of nations on international assessments such as the TIMSS is problematic (Soh, 2012). This focus leads to inferences about the relative quality of educational systems that may not be valid. Validity, the “degree to which evidence and theory support the interpretations of assessment scores” (NRC, 2001b, p. 39), is a central concern in assessment design (American Educational Research Association [AERA], American Psychological Association [APA], & National Council on Measurement in Education [NCME], 1999; Songer & Ruiz-Primo 2012). While an assessment itself is not valid or invalid, the inferences made from an assessment may have more or less evidence for validity (Downing, 2003). When evidence for validity is high, one can be more confident about inferences drawn from that assessment (NRC, 2001b).

The inferences that one can draw from an assessment are limited by the purposes intended in the design of the assessment (NRC, 2001b; Pellegrino, 2012). For example, when suggesting that teachers make use of available TIMSS items, Glynn (2012) cautions that teachers only use these items to assess student achievement as defined by TIMSS. Notwithstanding, a single assessment is often used for multiple purposes, such as measuring student learning, promoting or firing teachers, and the quality of educational systems (Songer & Ruiz-Primo, 2012). The more purposes an assessment is used to serve, “the more each purpose will be compromised” (NRC, 2001b, p. 2). Thus, it is important to limit inferences to the intended purposes of the assessment when making inferences from the results.

The purpose of the TIMSS assessment includes providing data about student achievement and learning contexts across multiple years and nations in order to inform educational policy in the participating nations (Mullis et al., 2009). The TIMSS was not designed to compare and rank the educational systems of multiple nations. In fact, Beaton, Potlethwaite, Ross, Spearritt, and Wolf (1999), writing on behalf of the International Academy of Education, which oversees the TIMSS, caution that comparisons of nations are only valid in two situations. These include (a) when comparing results of items upon which the nations in the comparison share a common curricular emphasis and sequence, or (b) when “a single underlying trait (or dimension of knowledge) is being assessed by the test” (Beaton et al., 1999, p. 18). It follows that inferences regarding the comparative quality of nations' educational systems drawn from reports of national rank (as shown in Table 1) have little evidence for validity.

A key reason why rank does not validly imply the quality of educational systems is that the curriculum of participating nations differs. For example, Reddy (2005) contends that South Africa's poor performance on the TIMSS (for eighth graders) is because that nation's curriculum is not adequately reflected in the assessment. South Africa's curriculum, Reddy argues, is focused on goals of “access, redress, equity and quality” (p. 67)—goals which are not reflected on the TIMSS. In addition to differing in content, the curriculum of participating nations may vary in sequence. It is not uncommon for nations to introduce skills or topics as much as two or three years apart (Beaton et al., 1999).

Comparing nations by simply looking at national rank on the TIMSS, as is often the case in media reports (CPA, 2000), has limited evidence for validity. Beaton et al. (1999) stated, “Small changes [in rank] occur frequently and may simply be due to chance. These changes should not be taken seriously” (p. 27). Although a nation's change in rank is often cited as the result of changes in the quality of an educational system, (e.g., Arenson, 2004), this may not be the case. For example, a change in a nation's rank may be due to the number of participating nations (Beaton et al., 1999; Soh, 2012). If the United States dropped from second place to fifth place from one year's assessment to another, it may not be because the United States' educational system has deteriorated. Rather, it is possible that the U.S. performance improved, but two nations that outperformed the United States participated only in the latter year. Additionally, a nation may change rank based on changes in the demographic makeup of participating nations (CPA, 2000; Rotberg, 1998). Bracey (1998) provided a notable example highlighting the need to be cautious when emphasizing relative rankings of nations. In the 1995 TIMSS results, the difference between Bulgaria (fifth place) and Iceland (30th place) was only 10%. Although the rankings lead the reader to infer dramatic differences between ranks, such as fifth place and 30th place, a closer analysis of the scores reveals that the actual difference may not be noteworthy. Thus, inferences about the relative quality of an educational system based solely on TIMSS rank have little evidence for validity.

This study seeks to reexamine a sample of the results on TIMSS items from a few nations. These items were sorted by cognitive domain, a “dimension of knowledge” (Beaton et al., 1999, p. 18), allowing for more valid comparisons across nations. Specifically this study will ask

  1. How did U.S. students perform as compared with students from other nations on TIMSS items in the same cognitive domain?
  2. How did each nation perform on knowing items as compared with applying items?
  3. As reported on international assessments, how do nations rank when just looking at overall performance?

Literature Review

  1. Top of page
  2. Abstract
  3. Literature Review
  4. Methods
  5. Results
  6. Conclusions
  7. References
  8. Acknowledgments

In order to understand the context of this study, a brief overview of the TIMSS assessment is germane. Next, some of the literature on large-scale assessments will be summarized. Although this study focuses on the TIMSS, an international assessment, the relevant literature is about many types of large-scale assessments, including international, national, or regional assessments.

TIMSS Assessment

The TIMSS assessment is an international assessment overseen by the International Association for the Evaluation of Educational Achievement, an organization that has been conducting assessments in multiple content areas around the globe for over 50 years (Beaton et al., 1999). The TIMSS assessment focuses on mathematics and science student achievement and related contextual factors (such as demographics and classroom factors). It is administered to fourth and eighth graders every four years.

As mentioned above, items are designed to assess the cognitive domains of knowing, applying, or reasoning. For mathematics, the TIMSS 2011 Assessment Frameworks (Mullis et al., 2009) describes knowing as including the following cognitive processes: recall, recognize, compute, retrieve, measure, classify, or order. Applying in mathematics is described as doing the following: select, represent, model, or solve routine problems. Items that require a student to reason in mathematics require that a student be able to: analyze, generalize, specialize, integrate, synthesize, justify, or solve nonroutine problems (Mullis et al., 2009). The science knowing domain includes thought processes of recall/recognize, define, describe, illustrate with examples, or use tools and procedures. The applying domain includes compare, contrast, classify, use models, relate, interpret information, find solutions, or explain. Reasoning in science requires students to analyze, integrate, synthesize, hypothesize, predict, design, draw conclusions, generalize, evaluate, or justify. As explained below, this study focuses on TIMSS items designed for fourth graders. Of these items, 40% (16–24 items) are knowing items, 40% (16–24 items) are applying items, and the remaining 20% (8–12 items) are reasoning items.

Criticisms of Large-Scale Assessments

Much has been written regarding large-scale assessments, both in popular press (e.g., Kohn, 2004) and academic outlets (e.g., Maltese & Hochbein, 2012). Some of these papers have been positive about the quality and effects of large-scale assessments (e.g., Cizek, 2001; Glynn, 2012). For example, Glynn (2012) found that TIMSS release items had good psychometric qualities. Although there were ways to improve some of the items, he recommended that researchers and teachers make use of these available high-quality items in their own assessments. MacPherson and Osborne (2012) found that items on the Programme for International Student Assessment, an international study similar to the TIMSS, were cognitively demanding. Additionally, Cizek (2001) argued that large-scale assessments have benefited schools in many ways. These include encouraging improved professional development, greater teacher knowledge of assessment, and increased student learning.

Although these authors and others have responded positively to large-scale testing, many authors have written critiquing such assessments. One of the flaws commonly mentioned is that there are many factors that contribute to variation in test performance. Primary among these factors are cultural factors, including socioeconomic status (Noble et al., 2012; Rotberg, 1998), gender (CPA, 2000), student interest (Olsen & Lie, 2011), family makeup (Shen, 2005), and school structure (Rotberg, 1998; Shen, 2005). Some researchers have even proposed that differences in international assessment performance may be attributed to the proportion of low performers included in the sample population (Bracey, 1998; Rotberg, 1998).

Additionally, critics of large-scale testing have questioned the value of such tests because they lack information that is useful for impacting classroom instruction (NRC, 2001a; Reddy, 2005). Because what is assessed may not match with the nation's intended curriculum, and because of the challenge of understanding complex scoring systems, it is hard for a nation to determine what should be done based on the results of international assessments (Reddy, 2005). This is especially true when the primary focus of international assessments is the rank of each nation. Rather than emphasizing what students do or do not know so that educators can act to improve student learning, a focus on rank primarily makes international assessments a competition (Reddy, 2005; Soh, 2012).

Other authors, instead of critiquing the flaws of large-scale assessments, have focused on their negative consequences (AERA, 2000; CPA, 2000; Lomax, West, Harmon, Viator, & Madaus, 1995; NRC, 2001a; 2001b). For instance, focusing on testing has been found to increase low-level teaching. Lomax et al. (1995) found that teachers reported increasing their use of teaching practices that emphasized memorization and lower-level thinking because of pressure from standardized testing, an increase that was most prominent among teachers who taught high percentages of minority students. Additionally, research in mathematics education has found that when teachers prepare students for large-scale assessments, teachers focus on teaching procedural knowledge rather than mathematical proficiency (NRC, 2001a).

Other negative consequences of large-scale testing include detrimental psychological effects on school officials and students (NRC, 2001a). The CPA and the Canadian Association of School Psychologists (CPA, 2000) have published a position statement cautioning against emphasizing ranks of schools, citing examples of inappropriate behavior by school officials. In each of these situations, “the responsible factor was the increased pressure on the schools that resulted from the rankings” (CPA, 2000, p. 9).

Because of the potential negative consequences of large-scale assessments, it is important that the results be as clear as possible to avoid potential misinterpretation (AERA et al., 1999). Because nations can be validly compared when a common trait connects items, an analysis of items by cognitive domain should lead to a more valid comparison of nations.


  1. Top of page
  2. Abstract
  3. Literature Review
  4. Methods
  5. Results
  6. Conclusions
  7. References
  8. Acknowledgments

This study was completed by a team of four classroom teachers who were in their final semester of a master's degree in science and mathematics education. This study focuses on fourth-grade items because the collective experience of the team was most aligned with the fourth-grade content. The perspective we brought as classroom teachers who were immersed in both science education research and daily interactions with students in the classroom is valuable (Glynn, 2012).

Data Collection and Coding

Data were found on the Dare to Compare Web site (National Center for Education Statistics [NCES], n.d.a), which allows users to view actual items from previous TIMSS assessments. Users select a subject area (e.g., mathematics), a grade level (fourth or eighth), and the number of items to view (from five to 20). The Web site then generates a sample of actual items from previous TIMSS assessments in the form of a practice test. The reason we used this Web site as a data source is that one can view the percent of students who correctly answered each test item for individual nations (e.g., Singapore). Therefore, this Web site allowed us to collect data on student performance on individual TIMSS items from multiple nations. The nations included in this analysis were the United States, International (meaning the average for all students who participated in the study), Australia, Singapore, Hong Kong, and England. These nations were included in this analysis because Dare to Compare reported scores on all items for only these six nations.

Thirty-nine fourth-grade mathematics and 39 fourth-grade science items were collected. We stopped collecting items at this point because we began to encounter the same items repeatedly. The percent of students who answered each item correctly in each nation was recorded. Items collected in this sample came from the TIMSS 1995, 2003, and 2007 assessments. No information about the validity or reliability of these specific items was provided. However, as items that were previously included in the TIMSS assessment, a reasonable level of quality can be assumed (Glynn, 2012).

The research team coded each item by cognitive domain as described in the TIMSS 2011 Assessment Frameworks (Mullis et al., 2009). After team members discussed the interpretation of the framework, each team member individually coded the first ten items. The coding from each team member was compared and discussed. Because there was a 90% agreement between the team, it was determined that a sufficient level of interrater reliability had been reached for each team member to continue coding a portion of the remaining items independently. Upon completion of coding, the researchers found that out of all the items in the sample, only one qualified as reasoning. A possible reason for this is that the Dare to Compare Web site contains only multiple-choice items, which are best suited for assessing lower-level thinking. Because there were no other items to compare it with, this item was excluded from the analysis. Data were compiled in Microsoft Excel (Microsoft Corporation, Redmond, WA, USA) and analyzed in SPSS (IBM Corp., Armonk, NY, USA).

An example of an item from each domain is provided in Figure 1. Test item five for mathematics requires the process of “computing” to find the correct answer and thus, the item was coded as knowing. Test item 12 requires a student to turn a three-dimensional image in his mind, which qualifies the item to be coded as applying. No mathematics reasoning items were found in this sample. Test item 15 for science asks a student to recall and describe information about boiling water, which qualifies the item to be coded as knowing. Test item 1 is coded as applying because it asks a student to interpret the information given in the picture and explain it using scientific knowledge. In item 17, students have to evaluate the best spot for growing crops based on the information provided in the image. This process of weighing options and considering conditions in each location qualifies the item to be coded as reasoning.


Figure 1. Examples of TIMSS items.

Download figure to PowerPoint

It is acknowledged that all test items came from Dare to Compare (NCES, n.d.a), a Web site intended to help students ascertain how their ability to answer TIMSS items compares with students from other nations. The tool is a portion of the NCES' Kid's Zone, which is intended to connect students with information about education and college (NCES, n.d.b). As such, we cannot claim that the items in this sample are representative of the composition of the TIMSS. In fact, because this sample contains only one reasoning item (2.5% of the sample) as opposed to 20% on the entire test, it is evident that this sample is not representative of the entire test. Additionally, items on Dare to Compare are restricted to multiple-choice items, thus neglecting the constructed-response items that are included in the TIMSS (Mullis et al., 2009). Notwithstanding this limitation, the researchers chose to use these items because we had access to data from multiple nations about individual test items. Additionally, for this study, a sample that is representative of the entire TIMSS is not necessary because none of the conclusions require inferences about the entire composition of the test.

Data Analysis

In order to compare the performance of each nation on items of the same cognitive domain, a one-way analysis of variance (ANOVA) was conducted for each domain (i.e., mathematics knowing, mathematics applying, science knowing, and science applying). In this analysis, the independent variable was the nation, and the dependent variable was the percent of students who correctly answered the item.

The second question sought to determine if each nation's students performed differently on knowing items or applying items in each of the subjects tested. Therefore, a series of independent two-tailed t-tests were completed. These, for instance, compared U.S. student performance on science knowing items with U.S. student performance on science applying items. This was repeated for science and mathematics for all nations in the sample. Independent t-tests are reasonable because it was not the same students that took all test items, and the scores can be treated as independent of each other. Because each t-test addressed a separate question (e.g., Singapore science knowing vs. Singapore science applying and England mathematics knowing vs. England mathematics applying), an unadjusted alpha of .05 is appropriate.

Lastly, an international ranking table, mimicking those reporting TIMSS results, was created for this sample of nations and items. This table orders the nations by the average percent of students correctly answering sample items for each nation in both mathematics and science.


  1. Top of page
  2. Abstract
  3. Literature Review
  4. Methods
  5. Results
  6. Conclusions
  7. References
  8. Acknowledgments

Comparisons Across Nations

A one-way ANOVA compared the performance of students across multiple nations on items of the same domain. Each of these results will be reported below and in Table 2.

Table 2. ANOVA and Post Hoc Analysis Comparing Student Performance on TIMSS Items
Mathematics Knowing
  1. *** p < .001, * p < .05.

  2. ANOVA = analysis of variance; df = degrees of freedom; SE = standard error; TIMSS = Trends in International Mathematics and Science Study.

Comparison with the United StatesMean DifferenceSEp 
Hong Kong12.3084.894.045* 
Mathematics Applying   
Comparison with the United StatesMean DifferenceSEp 
Hong Kong7.8465.973.494 
Science Knowing    
Science Applying    
Mathematics knowing items

Analysis indicated a significant difference in the performance of students from all six nations on mathematics knowing items, F(4,125) = 5.342, p = .001, η2 = .15. This analysis was followed up with Dunnett's test, which compared student performance in each nation with that of the United States. This test indicated that the only nation that scored significantly differently (p = .045) than the United States (M = 68.885) was Hong Kong (M = 81.192). The performance of students in the other nations was not significantly different than the performance of U.S. students.

Mathematics applying items

A comparison of student performance in the six nations on mathematics applying items revealed a significant difference, F(4,60) = 3.265, p = .17, η2 = .18. When the performance of students from each nation was compared with the performance of U.S. students using a Dunnett's test, it was found that none of these nations' students performed significantly differently than the United States on mathematics applying items.

Science items

No significant differences were found between the performance of students in the six different nations on science knowing items, F(4,125) = .241, p = .915. Additionally, the difference between the performance of students in these six nations on science applying items was not statistically significant, F(4,60) = .099, p = .982. These results indicate that no nations' students performed significantly differently than students in the United States on items in the domains of science knowing and science applying.

In short, few differences in student performance were observed when items were analyzed by domain. Students in most nations perform similarly to students in the United States. The only observed exception to this was Hong Kong, whose students performed better than the United States on mathematics knowing items. However, this difference was associated with a small effect size.

Comparisons Within Each Nation

In addition to comparing how students from different nations performed, this study sought to determine how students within the same nation performed on items from different cognitive domains. To do this, independent t-tests compared the performance of one nation on knowing items with that nation's performance on applying items for each subject. These results are reported in Table 3.

Table 3. Comparison of the Average Percent of Student Correctly Answering Items
 Knowing (n = 26)Applying (n = 13)tdfp (η2)
  1. ** p < .01, * p < .05.

  2. df = degrees of freedom.

International61.127 (16.088)50.385 (11.969)2.34631.238.026* (.15)
Australia65.846 (17.955)55.154 (14.052)1.86137.071
England64.885 (20.861)57.538 (14.501)1.2832.819.209
Hong Kong81.192 (13.894)67.385 (16.215)2.76837.009** (.17)
Singapore80.577 (14.662)73.538 (13.176)1.46037.153
United States68.885 (19.780)59.538 (17.391)1.44537.157
International67.321 (17.145)57.769 (17.001)1.62937.112
Australia70.462 (18.974)64.154 (17.034)1.01137.319
England72.269 (19.437)63.692 (18.843)1.31237.198
Hong Kong72.923 (16.134)60.692 (22.728)1.73418.253.100
Singapore75.269 (16.639)61.615 (20.484)2.23637.031* (.12)
United States73.000 (18.234)64.538 (17.154)1.39237.172
Mathematics items

There was no difference in student performance in five of the six nations in this sample for mathematics items. Students in the international sample and Hong Kong performed significantly differently on mathematics knowing and mathematics applying items. A greater percent of students internationally answered mathematics knowing items correctly (M = 61.127) than correctly answered mathematics applying items, M = 50.385, t(31.238) = 2.346, p = .026, η2 = .15. Similarly, more students in Hong Kong correctly answered mathematics knowing items (M = 81.192) than mathematics applying items, M = 67.835, t(37) = 2.768, p = .009, η2 = .17.

Science items

Only students in Singapore performed significantly differently on science knowing items and science applying items, t(37) = 2.236, p = .031, η2 = .12. This shows that a greater percent of Singapore students correctly answered science knowing items (M = 75.269) than correctly answered science applying items (M = 61.615).

From these analyses, it can be seen that many of the nations' students performed similarly on knowing items and applying items. There were only three places where a nation's students performed better on knowing items than applying items. No nation's students performed better in one domain for both mathematics and science.

Ranking of Nations

The rank of the nations for this sample is reported in Table 4. These ranks are based on the average percent of students who correctly answered items in this study sample. All five nations were above the international average (M = 57.546). Beginning at the highest rank, the nations were ordered in mathematics as follows: Singapore (M = 78.231), Hong Kong (M = 76.590), United States (M = 65.769), England (M = 62.436), and Australia (M = 62.282). In science, these five nations also scored above the international average. Beginning at the highest rank, the nations were ordered in science as follows: Singapore (M = 70.718), United States (M = 70.179), Hong Kong (M = 68.846), England (M = 69.410), and Australia (M = 68.359).

Table 4. Ranking of Sample Nations for Sample Items
 Fourth-Grade MathematicsFourth-Grade Science
 Students Correct (%) Students Correct (%)
2Hong Kong76.590United States70.179
3United States65.769Hong Kong68.846


  1. Top of page
  2. Abstract
  3. Literature Review
  4. Methods
  5. Results
  6. Conclusions
  7. References
  8. Acknowledgments

This analysis at once highlights the difference between simply ranking nations and comparing student performance in specific cognitive domains. Rankings suggest differences in student performance that are not observable when items are disaggregated by domain. For example, although Australia is ranked last (in this sample) for both mathematics and science, there is no significant difference between U.S. student performance and Australian student performance on items grouped by domain. Similarly, few nations seem to be more successful in one domain than the other. All significant differences that were observed are not very meaningful, as indicated by the small effect sizes.

Therefore, the primary conclusion from this analysis is that reports that focus on the rank of nations are limited in their scope and can lead to invalid inferences about the relative quality of educational systems. The TIMSS has been designed on a theory of learning (NRC, 2001b) that conceptualizes learning as having multiple levels of complexity (Mullis et al., 2009). Because items have been designed to correspond with levels of learning, the performance of nations can be compared at these levels. These comparisons can lead to inferences about how well students in participating nations are performing on items from the same cognitive domain. These inferences have much greater evidence for validity (Downing, 2003) than inferences about the relative performances of nations (Beaton et al., 1999).

It cannot be implied that these results would be the same for an analysis using all TIMSS items and all participating nations. However, it is shown by this study that analyzing the entire TIMSS assessment by cognitive domain would likely yield results from which more valid inferences could be made.

Inferences about cognitive domains could yield much more useful information for educators and policy makers. The rank of a nation is minimally informative. As discussed above, there could be many reasons the rank of a nation changes, such as the addition of a participating nation or differences in a nation's demographic composition (Beaton et al., 1999; CPA, 2000; Reddy, 2005). Many of these reasons are not involved with the improvement or degradation of an educational system, as is often implied by reports (e.g., Asimov, n.d.). Analysis by cognitive domain gives educators and policy makers information about how students are performing on specific cognitive tasks. In this way, educators and policy makers could know what areas need to be improved upon. For example, if students in Russia were outperforming other nations on reasoning items, it would be important to explore how Russia is helping students be successful in such high-level cognitive tasks. Additionally, consider the response if it was identified that the United States was performing mediocrely overall, but outperforming other nations on reasoning items. Although these results would indicate that there is some needed improvement in the areas of knowing and applying, it would be a great success to be excelling in the most challenging domain.

In addition to providing more useful results, this type of analysis could help avoid some of the negative effects of large-scale testing. As it is, some of the stresses placed upon educators and policy makers are due to less-than-ideal rank on assessments where they are being compared with nations whose curricula are more aligned with the assessment or whose demographic characteristics are more favorable (CPA, 2000; Reddy, 2005). Although a comparison by cognitive domain may be influenced by these factors, it is possible that, as is shown in this study, nations may perform more similarly than overall rankings indicate. Where differences do exist, analysis by cognitive domain shows more detailed and valid information. This more detailed and valid information is especially important because it may curb the tendency of teachers to focus on low-level learning due to large-scale testing (Lomax et al., 1995; NRC, 2001a). If results highlighted that students were struggling to respond to reasoning items although they were successful at knowing items, teachers may be influenced to change their teaching to encourage this higher-level learning.

In the future, this analysis should be conducted with results from the entire TIMSS assessment including all participating nations. This study with a small sample shows the potential of such an analysis. Because comparisons between nations are also valid when nations share a similar curriculum (Beaton et al., 1999), future studies should compare performance on items that align with the curricula of multiple nations. Additionally, it would be interesting to execute an analysis similar to the one conducted in this study that spans across multiple years of data. In this way, one could ascertain if students were improving in a particular cognitive domain over time and see if that improvement was similar to other nations.

Policy makers, educators, and citizens are urged to be cautious when making inferences from large-scale assessments. It is important to consider the theory of learning upon which the assessment was built (NRC, 2001b). It is also important to keep in mind the purpose the assessment was designed to accomplish and to keep inferences close to those purposes (Pellegrino, 2012). International assessments show promise for encouraging the improvement of educational systems, but that promise can be compromised when policy makers, educators, and citizens solely focus on international rank. Those reporting results to the public should take care to be explicit about these purposes (AERA et al., 1999) and avoid leading readers to unsubstantiated inferences (CPA, 2000).


  1. Top of page
  2. Abstract
  3. Literature Review
  4. Methods
  5. Results
  6. Conclusions
  7. References
  8. Acknowledgments
  • American Educational Research Association (AERA) (2000). Position statement on high-stakes testing. Washington, DC: Author.
  • American Educational Research Association (AERA), American Psychological Association (APA), & National Council on Measurement in Education (NCME) (1999). Standards for educational and psychological testing. Washington, DC: Author.
  • Arenson, K. W. (2004). Math and science tests find 4th and 8th graders in U.S. still lag many peers. The New York Times, Dec 15, 2004. Retrieved from
  • Armario, C. (2010). “Wake up Call”: U.S. Students Trail Global Leaders. Retrieved from
  • Asimov, N. (n.d.). No gain by U.S. students on international exam: Math, science scores stay only above average. San Francisco Chronicle. Retrieved from
  • Beaton, A. E., Potlethwaite, T. N., Ross, K. N., Spearritt, D., & Wolf, R. M. (1999). The benefits and limitations of international educational acheivement studies. Paris: International Institute for Educational Planning/UNESCO.
  • Bracey, G. W. (1998). TIMSS: The message and the myths. Principal, 77, 1822.
  • Canadian Psychological Association (CPA) (2000). A joint position statement by the Canadian Psychological Association and the Canadian Association of School Psychologists on the Canadian press coverage of the province-wide achievement test results. Ottawa, Canada: Canadian Psychological Association.
  • Cizek, G. J. (2001). More unintended consequences of high-stakes testing. Educational Meaurement: Issues and Practice, 4, 1927.
  • Downing, S. M. (2003). Validity: On the meaningful interpretation of assessment data. Medical Education, 37, 830837.
  • Finkel, E. (2012). What can U.S. Schools learn from foreign counterparts? District Administration, 48(2), 3032.
  • Glynn, S. M. (2012). International assessment: A Rasch model and teachers' evaluation of TIMSS science achievement items. Journal of Research in Science Teaching, 49(10), 13211344.
  • Jakobsson, A., Mäkitalo, Å., & Säljö, R. (2009). Conceptions of knowledge in research on students' understanding of the greenhouse effect: Methodological positions and their consequences for representations of knowing. Science Education, 93(6), 978995.
  • Kohn, A. (2004). What does it mean to be well-educated? and more essays on standards, grading, and other follies. Boston: Beacon Press.
  • Lomax, R. G., West, M. M., Harmon, M. C., Viator, K. A., & Madaus, G. F. (1995). The impact of mandated standardized testing on minority students. The Journal of Negro Education, 64(2), 171185.
  • MacPherson, A., & Osborne, J. (2012). There's more to science than recall: An analysis. Paper presented at the Annual Conference of the National Association of Research in Science Teaching, Indianapolis, ID. March.
  • Maltese, A. V., & Hochbein, C. D. (2012). The consequences of “school improvement”: Examining the association between two standardized assessments measuring school improvement and student science achievement. Journal of Research in Science Teaching, 49(6), 804830. doi:10.1002/tea.21027.
  • Miller, P. H. (2009). Theories of developmental psychology (5th ed.). New York: Worth Publishers.
  • Mullis, I. V. S., Martin, M. O., Graham, J. R., O'Sullivan, C. Y., & Preushoff, C. (2009). TIMSS 2011 assessment frameworks. Amesterdam, The Netherlands: International Association for the Evaluation of Educational Achievement.
  • Nagesh, G. (2010). Obama: US Students' Performance in Science, Math is “Unacceptable,” Retrieved from
  • National Center for Education Statistics (NCES) (n.d.a). Dare to Compare. Retrieved March 14, 2012, from
  • National Center for Education Statistics (NCES) (n.d.b). NCES Kid's Zone. Retrieved April 14, 2012, from
  • National Research Council (NRC) (2000). How people learn: Brain, mind, experience, and school. Washington, DC: National Academy Press.
  • National Research Council (NRC) (2001a). Adding it up: Helping children learn mathematics. Washington, DC: National Academy Press.
  • National Research Council (NRC) (2001b). Knowing what students know: The science and design of educational assessment. Washington, DC: National Academy Press.
  • National Research Council (NRC) (2012). A framework for K-12 science education: Practices, crosscutting concepts, and core ideas. Washington, DC: Author.
  • Noble, T., Suarez, C., Rosebery, A., O'Connor, M. C., Warren, B., & Hudicourt-Barnes, J. (2012). “I never thought of it as freezing”: How students answer questions on large-scale science tests and what they know about science. Journal of Research in Science Teaching, 49(6), 778803. doi:10.1002/tea.21026.
  • Olsen, R. V., & Lie, S. (2011). Profiles of students' interest in science issues around the world: Analysis of data from PISA 2006. International Journal of Science Education, 33(1), 97120.
  • Pellegrino, J. W. (2012). Assessment of science learning: Living in interesting times. Journal of Research in Science Teaching, 49, 831841. doi:10.1002/tea.21032.
  • Reddy, V. (2005). Cross-national acheivement studies: Learning from South Africa's participation in the Trends in International Mathematics and Science Study (TIMSS). Compare: A Journal of Comparative Education, 35(1), 6377.
  • Rotberg, I. C. (1998). Interpretation of international test score comparisons. Science, 280(5366), 10301031.
  • Shen, C. (2005). How American middle schools differ from schools of five Asian countries: Based on cross-national data from TIMSS 1999. Educational Research & Evaluation, 11(2), 179199.
  • Soh, K. C. (2012). Fifteen-years-old students of seven east Asian cities in PISA 2009: A secondary analysis. New Horizons in Education, 60(1), 8391.
  • Songer, N. B., & Ruiz-Primo, M. A. (2012). Assessment and science education: Our essential new priority? Journal of Research in Science Teaching, 49(6), 683690. doi:10.1002/tea.21033.


  1. Top of page
  2. Abstract
  3. Literature Review
  4. Methods
  5. Results
  6. Conclusions
  7. References
  8. Acknowledgments

The author would like to acknowledge the assistance of classmates Jeff and Jacob on this project. Also, thank you Nancy Wentworth, Eula Monroe, and the reviewers for their constructive feedback.