#### Discussion on the paper by Wild, Pfannkuch, Regan and Horton

**John MacInnes** (*University of Edinburgh*)

This paper is important for three reasons. First it asks us to think about statistics education in an original but fundamental way: what is the simplest, clearest, but theoretically sound way to teach inference—a keystone idea of the discipline—to students who may have little or no facility with mathematics, nor any particular interest in statistics nor any desire or ambition to study it further? Second it provides an opportunity for us to think about the relevance of their approach in teaching statistics to other audiences. Finally, it ought to make us think also about school mathematics education and its fitness for preparing students, at school, college or university, to understand statistics.

I have complete sympathy for the argument of Wild and his colleagues that statistical ideas, including inference, can usefully be taught in schools, and that schools should not be seen as providing only core mathematics competencies. I am a sociologist, but I find that colleagues in the life and even the natural sciences often find that the mathematics competencies that are taught in UK schools focus too much on passing assessments in mathematics techniques and far too little on either their application or any sense of the relevance and power of numbers.

I share also their desire to put the relationship between samples and populations at the heart of the matter; their faith in ‘complexity reduction strategies’ (which is not a bad definition for the whole enterprise of statistics) and their insistence on ‘motivationally compelling learning’ rather than proceeding logically and abstractly from first analytic principles. Univariate descriptive statistics might constitute, logically, the ground floor of the statistical edifice, but students often find it such an uninspiring place that they seldom want to explore the rest of the building. I teach postgraduate students who have little or no previous statistical training. They also struggle to get a clear idea of a sample and population, simply because they are used to proceeding on the assumption that data always come from that mythical beast the ‘representative’ sample. I thus endorse the authors’ decision to expel ‘external context knowledge’ from the learning process.

I share their realism about students’ inability to ‘think under the null’, their enthusiasm for visual approaches and above all their belief that

‘rules too often serve as replacement for thinking, even as a barrier to thinking, rather than ... aids to thinking that can help to marshal our common sense’

(school mathematics again?). The students I teach also struggle to ‘think under the null’ and, presented with precisely the variety and complexity of messages that the authors describe so well, can come to grasp at formulaic rules as a lifebelt, rather than developing an understanding through exploring data and coming to realize both the potential and the limits of a sample.

I have three, small, reservations. The first concerns their wish to keep such a clear distinction between ‘the description game’ and ‘the inference game’. There are costs and benefits here. Sufficiently accurate descriptions might inevitably yield differences, but I find that a clear distinction between description and inference is one that beginners neither see nor need to focus on: they describe in order to infer. Their aim is usually inference from a sample to a population, and description of their data is the means to achieve it. What is fundamental, as the authors argue, is ‘building within students a reliable propensity to conjure up the ideas of sampling variation’ when they see data. Comparing shift and spread is a good way to get at the mechanics of inference, but both come, in a sense, from the ‘description game’.

Second, is there a danger that students might draw a correct but misleading conclusion from the milestones: that big samples are good, full stop, rather than bigger samples make finer comparisons possible? Will they not see our aim, to use the authors’ visual metaphor, to see the world through the clearest possible glass?

Third, and I doubt that the authors would disagree, having secured some knowledge of inference, I would want to pay at least some attention to other sources of error, especially measurement error. At least in the social sciences, the construction of data is something that I find students struggle with. They find it easy to write off almost any attempt at measurement as hopelessly ‘relativist’. However, once data are in their hands, they rapidly acquire an aura of almost Papal infallibility. Measurement is another key source of ‘ripples’ in the glass.

Wild and his colleagues argue that ‘technology is the ultimate game changer’ in statistics education, making a convincing case for the role of computer-generated animations in teaching. Innovative visualization of data is an indispensable weapon, not only in teaching statistics but also in securing greater general statistical literacy. However, good animations need rare computing skills and time to produce. They do not fit well either with the standard artisan model of university education, nor with the existing system of funding it. This may be something that can be done better, and ought to be funded, collectively.

Finally, let me make an observation about what we might call the generation of statistical knowledge and awareness. Let us distinguish

- (a)
the development of the discipline of statistics in its own right,

- (b)
its application by those formally trained in statistics and

- (c)
its use and application by others.

This threefold classification is imprecise because the boundary between groups (b) and (c) is porous. The development of statistics has always been tied to its application, from agriculture to astronomy. However, with the expansion, professionalization and bureaucratization (in its Weberian, rather than any pejorative, sense) of scientific enquiry, many of those using and applying statistics will have studied the discipline, if at all, briefly and in a purely applied context. It is a tool that they may use with less concern for its inner workings than whether ‘it does what it says on the tin’. It is also imprecise because group (c) is virtually limitless. As the Royal Statistical Society's ‘getstats’ campaign reminds us, any active, and even not so active, citizen can benefit from statistics in making sense of an ever increasing supply of public data, or assessing risk.

We too often assume that the relationship between these three groups is simple and hierarchical. Professional statisticians whose expertise is rooted in their immersion in the discipline should set standards for, and where possible deliver, the training of the second group, as well as taking some responsibility for enlightening the third. Thus the Royal Statistical Society's mission is

- (a)
to ‘nurture the discipline of statistics by publishing a Journal, organising meetings, setting and maintaining professional standards, accrediting university courses and operating examinations’ and

- (b)
to ‘promote the discipline of statistics by disseminating and encouraging statistical knowledge and good practice with both producers and consumers of statistics, and in society at large’.

In making such an assumption we perhaps pay insufficient attention to two issues. First there is no necessary association between having specialist knowledge and being able to describe it. This applies with particular force when its description necessarily involves a novel specialist language for the learner. Second, the relationship between our three groups works best when hierarchy is moderated by vigorous communication. Perhaps the most important lesson of this paper is that we can best advance statistical knowledge when we listen carefully to those who do not yet have it.

I am pleased to second the vote of thanks to our authors for a very interesting paper. In particular there are three innovative insights and use of software that I commend. These are

- (a)
the insight of the sample as seeing the population through a distorting lens,

- (b)
the software that retains the different results of a boxplot with fading so that you have the idea that when I see one I remember the variability pattern and

- (c)
the idea of making a call that *x* is larger than *y* which hides the formal requirement for a null hypothesis.

So why do I feel distinctly uneasy when reading the paper as a whole?

It took me back to when I first taught school statistics in 1959 to students who were 16 years old and older. Here it was assumed that all data were good data and that all samples could be considered as random samples. These assumptions were not valid then; even less are they valid now.

The emphasis in the paper is on laying the foundations of classical statistical inference based on random samples. This is only one aspect of what is being done in modern day statistics. The authors say that these things are for all learners—just developing specific techniques over time. But all this takes time and all school time is under pressure. You would have to justify this particular set of teaching over other things within the school curriculum. Even when you have made the case for teaching statistics to all you have to justify which statistics teaching to which pupils. You have to prioritize. The authors dismiss other aspects of statistics—yet many of them are more important for most students.

The only distorting effect of the lens (the sample) that is considered is that of variability. A far larger distorting effect in most of the data that the students are likely to come across or to generate themselves is that of bias. All students need to be aware of how to obtain good data and to judge whether the data they have are good data.

The idea of making a call comes down to a set of rules. The learner could go away with the impression that the more statistical techniques you know the closer the two samples can be for you to make a call. Yet this is *not* what we want. You can always make a call if the two samples are different (e.g. in means or medians)—the important thing is to have some idea about how likely you are to be wrong. The idea that you might be wrong is fundamental and is seriously understated in this paper.

For most learners it is more important to know that, in the circumstances where you have random samples, when you make a call the following make you more likely to be right:

- (a)
the greater the difference between the samples;

- (b)
the larger the samples (so smaller between-sample variation);

- (c)
the smaller the variability within the sample.

Only then move on to the principle that we could use things like medians and edges of boxplots to indicate an acceptable degree of certainty in making this call.

This emphasis on random samples and sample variability raises the question of what is important in school statistics. The work of statisticians is to obtain and interpret data. Many applications today have data sets that do not come easily under the category of representative samples from a population. Many are population statistics in their own right; some could be samples—but if they are so seen it is not clear what sort of sample they are (e.g. sales by a supermarket to its customers using loyalty cards). It is the job of the statistician to make sense of these data. It is an important part of school statistics that students become aware of the sort of problems that statisticians work on and the sort of inferences they can draw.

This brings us to the idea of statistical inference. The authors take statistical inference to mean only the sorts of inference drawn in classical statistics—drawing inferences from random samples or those that can be considered as random. Statistical inference is much, much more than this. Intuitive statistical inference is broader than laying the basic foundations for classical statistical inference.

It starts when you look at the data and ask *what do these data tell me?* In early school years this can lead to straight deductive answers (e.g. more pupils in our class come to school by bus than by car). But it easily moves to inference when the question is changed to *what might have caused these data to be as they are?* This is the start of inference.

Answers to this question might include references to variability as in this paper, but they also give rise to many other possibilities and important statistical ideas. This question raises the idea of accuracy of the data—how were they measured and obtained? So we consider the nature of questionnaires and questionnaire design or collecting data efficiently through experimental design. It also raises the nature of the sample (if it is a sample) from which the data were drawn. This question may also point us to looking at similar data from the past to see how things have changed and inferences about what might have caused them to change. These raise the whole idea of bias in the data as a possible explanation for what is there. They are also topics that are readily grasped by most secondary school students.

All this links more with scientific understanding of inference and possible sociological or scientific explanations—and these might be open to further investigation on which is more or less likely. In my view these are more important aspects of statistics that all students should know and, if there is insufficient time to teach all, should take precedence over the more rule-oriented decision making that is described in the later parts of this paper.

The vote of thanks was passed by acclamation.

**Adrian Bowman** (*University of Glasgow*)

I add my thanks to the authors for a very stimulating paper. It must be recognized that there has sometimes been a slightly uneasy relationship between the world that is inhabited by teachers, of any subject and at any level, and the world of educational research. The authors of this paper are to be commended for helping to bridge the gap for statistics. In particular, they have clearly demonstrated the benefit of understanding what goes on in the minds of those who meet statistical concepts for the first time.

I agree with the authors that the development of concepts in the same ‘visual space’ is very helpful. This includes not only concepts of data but of models, e.g. in two-way analysis of variance where the superimposition of fitted means from different models on carefully structured plots of the data can help greatly in explaining ideas, in particular of interaction.

I agree with the authors that animation is a hugely valuable tool. It is not yet widely recognized that animation is actually readily available in R which has become to a large extent a common ‘computational space’ for statisticians. One example is the rpanel package (Bowman *et al*., 2007), which aims to make the construction of animations as easy as possible to do, particularly in an everyday, time-pressured teaching context.

These principles are applicable in more sophisticated contexts. For example, in non-parametric regression animation allows us to view the effects of altering the complexity of the model that we fit. There are standard methods of indicating uncertainty in our estimation, but graphical reference bands, expressing where our estimate is expected to lie if the relationship is linear, can provide an effective assessment of the suitability of that simple model. The authors are keen to avoid ‘thinking under the null’, for reasons I understand but do not completely share. If we view models as competing descriptions of the underlying population, then viewing the world through spectacles which in other language would be expressed as a null hypothesis is not necessarily a completely unnatural concept.

The fact that I would like to transfer the authors’ ideas to other arenas serves simply to underline the fact that I found the paper very stimulating and I would like to repeat my thanks to the authors for presenting it.

**Jim Ridgway** (*Durham University*)

*Statistical literacy should be about heuristics, not cook books*

The paper paves the way not just for a radical rethink of statistics curricula, but also for approaches to statistical literacy. A key idea in the paper is that statistical novices can use heuristics which lead to judgements that a professional statistician would agree with, without necessarily understanding technical aspects of inference. Here, I argue that this approach should underpin efforts to increase public statistical literacy, and I offer an example.

In the UK, great emphasis is placed on evidence-informed policy. Examination results are used to create school league tables; schools are required to present student performance data in particular ways and are strongly encouraged to use these data to reflect on issues of teaching quality. The National Strategies Web site (http://nationalstrategies.standards.dcsf.gov.uk/node/188855) offers guidance and instruction on how this can be done, illustrated by a school case history. Here is an extract, based on reading score tests from 10-year-old children:

‘*Analysis showed that those making the best progress were:*

This is then qualified by ‘because of pupil mobility, data available [*sic*] only for 15 out of the 24 who are currently in the year group’.

Statistical literacy is knowing that conclusions from such data are silly. An important mission for the Royal Statistical Society's ‘getstats’ campaign should be to ensure that government Web sites designed for key user groups should offer robust heuristics for interpreting data—such as the milestones in this paper. Milestones customized to particular sample sizes would be easy to generate on line. (Heuristics at a coarser grain size, e.g. *high dropout makes interpretation difficult*, could also be useful.) The current ‘library’ on the National Strategies Web site that provides clear explanations of important statistical ideas (regression, confidence intervals etc.) might better serve the intended audience if it focused on building statistical literacy via engagement with interactive displays.

Primary schools typically have year groups of around 30 pupils. Education effects are rather weak (Hattie, 2009). Wild and his colleagues show very clearly that conclusions about weak effects based on small samples are unwarranted. Primary school teachers need to be statistically literate. It is wrong to engage them in activities with high stake implications for their schools and professional activities that are conceptually flawed. This paper offers ideas and tools that are directly relevant to the statistical education of teachers.

**John Pullinger** (*Royal Statistical Society ‘getstats’ Campaign Board, London*)

Let me first congratulate Chris Wild and his fellow authors for the energy and inspiration in this paper. It is a rallying cry for our profession. We must find ways to make the ideas that underpin statistical thinking accessible to wider audiences if the discipline of statistics is to thrive and be appreciated for what it is: an essential life skill in today's world.

All too often we put people off. We seem dry and dusty. We make statistics look difficult. In schools we allow ourselves to be bit part players in mathematics lessons. Or, often even worse, we allow statistics to be mainstreamed in substantive subjects where students are encouraged to press a few buttons on a package and magic numbers pop out. We should never accept a null hypothesis. This paper argues with vigour that we need least of all to accept this nullest one.

The alternative is to make statistics interesting. The wholly visual approach that is advocated in the paper is appealing. The links that are provided in the paper take us to a series of mesmerizing pulsating boxplots. The approach makes the necessary intuitive connection to the fundamental ideas of statistical inference. I buy the analysis but I am compelled to ask for more.

Success needs to be judged, as in all fields of enlightenment, by the ability of the method to equip the seeker with a desire to apply their new-found skill in their own lives. They must have enough thirst to learn more on their own. To do this, the paper's rallying cry must call the profession to show just how pervasive statistical thinking is to everyday life and be its advocates. Each of us can show how many of the judgements that are made by us and those around us are based on a sample drawn from the fullness of reality. Each of us can show those we meet how to be enquiring about how those samples are drawn in order to make sound inferences for their choices. Each of us can help others, in schools, in work, in government, in the media and in everyday life to ‘getstats’.

Thank you for a paper that calls us to action.

**James Nicholson** (*Durham University*)

The authors acknowledge that the detail of the paper is looking at a narrow area of the curriculum and a small part of the statistical process. I agree that there are very interesting uses of technology, especially in visualization, which may help to build much better understanding of significance where random sampling is involved. The paper also talks about wider issues in statistical education and it is to those that I wish to address the bulk of my comments.

One striking feature of the New Zealand curriculum is the early introduction of multivariate data, and the repeated revisiting of reasoning from evidence in gradually more sophisticated contexts. Teaching mathematics in the UK I have seen students who are mathematically talented, or at least very competent, feel that they were not needed by the social sciences in that they never encountered real social problems as substantial contexts, nor worked with data where mathematics offered real insights into social issues. Those students with a leaning towards the social sciences, or wanting to address social issues, find the academic work mostly based on words with some headline statistics, usually of aggregated data, but little to suggest that there is a need for the students to be able to interpret the data for themselves, or that mathematics and statistics are potentially valuable assets.

As a consequence, in the UK at least, the current crop of social science students has a level of mathematical competency and statistical literacy which causes concern. The issues of dealing with multivariate data on the scale that is encountered in social sciences does not require statistical inference in the sense of making inferences about populations from random samples of quite small size. However, there is the task of reasoning from the evidence of large-scale multivariate data where headline statistics can be disaggregated by one or more factors to explore the nature of differences. For example, if on average women live longer than men, is this consistent across all ethnic groups?; across social class?; across all regions of the UK?; and is the situation changing over time? This is quite a different sort of statistical inference, but one which might have a broader usefulness than that addressed in detail in the paper.

I wonder whether the authors feel that the early engagement with multivariate data will do anything to reduce the perception which students seem to acquire that social sciences and mathematics belong in non-overlapping universes.

**Thomas King** (*University of Southampton*) **and Clare M. Woodford** (*Queen Mary University, London*)

There is a need for students to be taught more of the problem solving process and creativity of statistics, and also for these ideas to be accessible to all students. However, this leaves the questions of what is actually learned; what are the outcomes for different students; how can they apply this in their future life; and, how does this foster statistical literacy? If we define statistical literacy as being ‘familiarity with reading and writing data stories’ (King, 2011), then we need to think of how this may be developed as concurrent work shows this is not an immediate outcome (Pfannkuch *et al*., 2010).

The focus on visual thinking initially caused some concern as facility with different modes of thought varies substantially between students. However, the idea in the paper is to keep eyes on the graph whereas visual thinking typically means more abstract visualization than concrete comparisons (Grandin, 2006). To tell data stories visually, and also to assess students’ competence, they can certainly present their own arguments, giving a commentary supported by their own graphs. Indeed such a presentation would engage other students in interpreting the data stories of their peers and reading their graphs. This would also bring a greater understanding of the difficulty of engaging with unfamiliar graphs which have typically not been understood by scientists (Roth, 2004).

‘would help students develop mathematically as citizens [through] understanding and using statistics (including a critical appreciation of how they are presented and interpreted in the media)’

(Advisory Council on Mathematics Education (2010), page 5) even on the least technical level 3 pathway. Smith (2004) recommends that statistics should be taught outside mathematics, although this was not recommended by the government, perhaps partly because of teaching capability. Thus this paper would present the technical part of a wider implementation of statistics in the school curriculum but does not address the range of mathematical outcomes that students within their programme may have. Furthermore, if this will be such a new development in schools, how will the learning be supported by parents and teachers in other subjects, and does this strategy for developing inference enable conceptions of statistics as a versatile method for empirical discovery rather than a recipe for divining significance (see Petocz and Reid (2010))?

**Julian Stander and Rana Moyeed** (*University of Plymouth*)

We congratulate the authors on their educationally important and thought-provoking paper. We very much like the idea of producing a mental picture of superimposed or vibrating boxplots whenever we deal with data from a continuous random variable such as those shown in Fig. 5 and Fig. 6. This idea has already been included in an introductory statistics module delivered to a large group of students at the University of Plymouth as we believe that it will provide a valuable aid to in-class discussion of repeated sampling. Nevertheless, we contend that notions like repeated sampling from a parent population, sampling variation, confidence intervals, statistical significance and *p*-values will continue to present many pedagogical difficulties. Although we understand the need for students to have a good knowledge of frequentist statistics, we would like to see the statistics education community give more emphasis to the development of tools for the teaching of Bayesian techniques, an excellent example of which is Tony O'Hagan's First Bayes software that is available from http://www.firstbayes.co.uk. Such tools can also enhance and enrich the teaching of probability ideas, a firm grasp of which is fundamental for the understanding of statistical inference. We shall shortly start to investigate with the Royal Statistical Society Centre for Statistical Education how visualization techniques motivated by those in this paper can be applied in the Bayesian framework. We believe that adopting the Bayesian approach to learning about underlying processes would help educators to present inferential ideas in a much more convincing and scientifically natural way, avoiding the issues and topics related to repeated sampling that are often so difficult to understand and explain. In any case, the authors’ use in Section 3.3 of the phrase ‘‘‘I'm thinking that the true median is likely to be in here somewhere'’’ suggests that people instinctively adopt a Bayesian approach for statistical inference.

**Ramesh Kapadia** (*Institute of Education, London*)

It is fascinating to hear a paper about teaching conceptions of statistical inference read in London three decades after the first major international project on teaching statistics completed its main work in schools in England (Schools Council Project on Statistical Education, 1980). As the project officer under Peter Holmes I fondly remember the innovative ideas expounded, many of which are now part of the English national curriculum.

The units of work that were produced for schools in the late 1970s did include some inferential work, such as on drawing implications from samples to populations. Yet, since the work was aimed at the ‘bulk of students and not just an intellectual *élite*’ (page 5), we avoided the definition given here of statistical inference as ‘confidence intervals ... and ... distributions’. We encouraged pupils to look at data in context, without applying formal tests. We also aimed to underpin the work with the sound foundations that are afforded by probability. This aspect seems to have been underplayed in the current paper, but it is explored in Kapadia and Borovcnik (1991) and Borovcnik and Kapadia (2009). We would be interested to hear of how even the majority of students can really cope with ideas of formal hypothesis testing or milestone 3, since many pupils find probabilistic ideas difficult (Kahneman *et al.*, l982).

This paper uses visual possibilities of technology to widen the possible approaches: evidence on how well this works is now needed, notwithstanding the pioneering work of Hans Rosling. We would like to see detailed evaluative research testing the efficacy of such an approach. This needs to confront the relative lack of impact of statistical software and computers in mathematics education as a whole, in comparison with the sometimes quite extravagant claims that are made for geometrical and statistical packages.

The paper advocates that the required inferential steps should not be demanding and supports a rule-based approach. This ignores the fact that the process of reaching rules is pedagogically important and too camouflaged here.

We would also be interested in reading about national assessment from New Zealand where it is ‘*impossible*’ to obtain a good mark by blindly applying tests and high marks will require ‘understanding and insight’. Even a few examples of such assessment would mean that New Zealand led the world, for such questions are all too rare in examinations in England.

The following contributions were received in writing after the meeting.

**Alan Agresti** (*University of Florida, Gainesville*)

I commend the authors for making sensible proposals for improving the teaching of statistical inference at various stages of the education process. I like their emphasis on a gradual building of complexity using visual displays of sampling variation.

Like the authors, I believe that the inference part of most introductory statistics courses overemphasizes significance testing. I agree that it is more informative to focus on answering ‘Which is bigger, and how much bigger?’ than ‘Are they different?’. We all recognize how much difficulty students have in ‘thinking under the null’ and properly interpreting *p*-values. Given its difficulty and how few of the students at this level will ultimately be conducting research or reading research journal articles, I do not think it warrants so much attention. In addition, for a staged development path in schools such as the authors recommend, many (probably most) of the teachers themselves are unlikely to understand well the subtleties of significance testing. By contrast, I think that students find it easier to understand a confidence interval than the many steps of a significance test. In the authors’ milestone 4 introducing formal inference, I would also suggest explanation of confidence intervals and the distinction between practical and statistical significance.

The authors’ milestones emphasize the analysis of quantitative variables, treating categorical variables in an extension. I believe that the authors’ complexity reduction strategies in explaining sampling variation would be enhanced by also discussing categorical data at early stages, focusing mainly on the binary case. The students will already have heard about polls and other surveys, so it is easier for them to comprehend variability from sample to sample in a percentage than variability in a boxplot. Simulation to illustrate sampling variation is also simpler to explain with binary data. At an early stage, the instructor could connect a binary version of the Fig. 11 footprint with the sample size by noting that the simple error bound of 100/√*n* (on the percentage scale) describes variability of the footprint.

In summary, our students’ understanding and appreciation of the importance of statistics would improve, and our profession would ultimately be better off, if more of us followed the authors’ lead in thinking seriously about these issues and then making concrete proposals for change.

**Janet Ainley** (*University of Leicester*) **and Dave Pratt** (*Institute of Education, London*)

The paper presents an innovative pedagogic sequence for leading students towards an understanding of sampling variation. It is argued that current simulation methods involving resampling confuse students. Naive learners’ grasp of the key concept is tenuous and so it is reasonable to conjecture that approaches which follow the heuristic ‘keep the eyes on the graph’ might focus attention. We look forward to this idea being tested.

The approach proposed aims to support a relational understanding of sampling variation by employing an analogy in which the actual patterns in the population are distorted through sampling variation. Current pedagogic methods continue to emphasize an instrumental approach in which procedures are temporarily learned and soon forgotten because of their meaninglessness.

A continued tension, however, is that, although the pedagogic focus is on gaining a relational understanding of sampling variation, students are likely to lose any sense of how this notion could be significant for them. It is one thing to understand the difference between the description game and the inference game, but that matters little if they see no point in playing the game at all! We therefore advocate an early emphasis on investigations that throw up the need for sampling and issues about why sampling variation is useful. In other words, we want milestone 1(a) to be ‘writ large’ so that a utility-based understanding might be constructed and not forgotten in the imaginative learning trajectory that might afford a relational understanding.

Such tasks would need to focus on the question ‘lf you are trying to answer a question about the population, why look at a sample?’. In a real life situation, the answer to this question is that the population is not accessible, because it is very large (in the case of market research or opinion polls), because it is not finite (as might be the case in a scientific experiment) or it is impractical (as in a blood sample). An example in this paper is a sample taken from the CensusAtSchool data for New Zealand. Taking samples from this is a pedagogic device, since all the data in the population are available. Such pedagogic artifices are often deployed when relational understanding is sought but we need to remember that, in such situations, the lack of authenticity works against utility-based understanding.

**Murray Aitkin** (*University of Melbourne*)

As I have not taught statistics at school level my comments are relevant to the university introductory course for non-mathematicians.

I have had happy success with the approach of Hodges *et al.* (1975). The finite population approach in which students draw their own random samples by dice throwing seems to me the essential place to begin: it raises randomness concepts immediately, which can be followed up with the examples of voluntary sampling biases in Moore's several books, and clinical trial design. It also allows us to define probability, which is a term that is apparently missing from the early courses described by the authors, in terms of equally likely cases based on the die regularity.

I restricted the inferential content to population proportions. Once we have probability, it is straightforward to compute probabilities of events, and now Bayes theorem for simple two-alternative hypotheses becomes simple to use, also replacing Venn diagrams by two-way contingency tables.

To make inferential statements by plausible intervals we can expand the two-alternative case to the *K*-alternative case—the advantage again of finite populations is that the probabilities are all small but finite, and the usual credible interval calculation can be introduced as an approximation to the real finite sum. For continuous variables we dichotomized them at the median.

In the course I last gave at the University of Newcastle, I interspersed the discussion of probability and sampling with set piece lectures on specific real world studies which illustrated the importance of proper design—the analysis was not the issue.

I had the highest approval rating ever from students in this course, and I enjoyed it more than any other course I have taught. The weakest student said to me that she really enjoyed it and came away from it with the conviction that statistics was very important in all areas of society.

It was a revelation to her that her 95% interval did not include the true value, though those of all the other students did! She asked with great embarrassment ‘What did I do wrong in my sampling?’. ‘Nothing!’ was my reassuring answer. ‘You are discovering an inherent limitation of statistical inference—we can only make 100% confident statements which are trivial!’

**Adrian Baddeley** (*Commonwealth Scientific and Industrial Research Organisation, Floreat, and University of Western Australia, Perth*)

First I wish to emphasize the importance of this topic in the widest sense. It is already clear that some of the major scientific errors and technological failures of the 21st century will be caused by erroneous inferences from data. An increasing proportion of research effort in many disciplines is devoted to data analysis rather than laboratory or fieldwork. New research findings rest increasingly on inferences from data, so it is vital that the key ideas of statistical inference be absorbed into general knowledge, It is now crucially important for our discipline to take seriously the task of communicating statistical ideas, particularly statistical inference, to a wide audience.

Arguably the authors’ project is not just about finding new and better ways to communicate key statistical concepts. Perhaps on the principle that we do not really understand something until we can explain it to a child, the project is also a fundamental investigation of the logical structure of statistical inference. The challenge of presenting the key ideas in graphical form is also an investigation of connections between statistics and geometry.

The authors demonstrate some of the wonderful opportunities that are presented by new information technology for communicating statistical concepts. This technology also presents our greatest threat, namely the ‘drag-and-drop’ model of data analysis, the growing expectation that statistical analysis can be reduced to a software process. To respond, we need to identify and emphasize those parts of statistical methodology which truly cannot be automated, and this again is related to the authors’ project.

First-year university students in microbiology are shown how to extract DNA from a tissue sample in their first laboratory class. They follow a recipe and are completely unaware of the complex biochemical and physical processes that are involved; yet the excitement in the room is palpable. Our challenge is to match that level of excitement when we teach statistics.

**Manfred Borovcnik** (*University of Klagenfurt*)

The authors’ approach to statistical inference is visually oriented and rule based, but not closely linked to context (which is not discussed here), decision logic and probability. The comparison of two distributions is related to the key question of ‘When can I make the call that B tends to give larger values than A?’(pages 259 and 270), which lacks an easy answer and might lead one astray. The probability of a paired comparison being incorrect can be high for various distributions.

The visualization of sampling fluctuation is compiled by the superposition of repeated boxplots. However, the blurred graph may emerge from boxes that are centred at the same point widening or narrowing, or, with the same width shifting left and right. More importantly the *size* of fluctuation is caught only *qualitatively*. Mathematically the minimum and maximum (marked by whiskers) of larger samples are spread more to the extremes; yet, paradoxically, the diagrams (page 258) show the opposite.

The three suggested milestones vary in the inherent *α*-levels of the rules: it varies in milestone 1 from 15% to 0.4% (with *n*) and is about 8% in milestone 2, and 2% in milestone 3. So more sophisticated milestones, perhaps perversely, do not necessarily lead to better outcomes, depending on the sample size: moreover, the power to detect a difference decreases by later milestones, which is counterintuitive statistically and perhaps misleading.

There are situations where a difference is correctly detected by the crude rule of milestone 1 but not with the rule of milestones 2 or 3. For a difference in the means of two distributions of the size of 0.6745*σ* (with common standard deviation *σ*), rule 1 (of milestone 1) has a power of 0.66, and rule 2 only of 0.52, whereas rule 3 goes down to even 0.44 (as estimated from a simulation study from normal distributions with 30 data).

The poorer behaviour of the milestone 3 rule is partly caused by the ‘intuitive intervals’, which treat extreme and usual cases as equally likely. This procedure is defended by the authors but makes the ‘confidence’ interval broader by a factor of roughly and precludes a probabilistic argument. Thus this is not a good progression to the more conventional milestone 4 of formal statistical inference.

Besides the inconsistencies noted, the visual representation helps only to *memorize* the rules, but not to understand them. Where is the cognitive development as rules are not motivated?

**Mike Camden** (*Statistics New Zealand, Wellington*)

This contribution aims to raise two issues. The first is that a widespread understanding of inference is vitally important for the quality of official statistics. The second is that an understanding of sampling uncertainty for estimates is valuable in itself. The methods in this paper are used mainly for comparisons, but they could be adapted for estimates like medians and proportions.

The authors ask ‘why should statisticians care ...’ about school statistics? Official statistics agencies have good reasons to care. Much of the agencies’ efforts go into sample surveys. The quality of the results from these surveys depends on response rates. The approaches in this paper will enable current students and future citizens to understand the aims of statistical inference, and so to value their contributions to social and business sample surveys. A statistically informed society and high response rates are vital for minimizing sampling error as well as bias.

The authors make the results of sampling visible for estimates of centres (Figs 5 and 6) and proportions (Figs 11 and 12). However, much of their thinking centres on comparisons, and ‘making the call’ about differences. A task for the future may be to apply their thinking to the estimates, and indeed to estimates of differences. Questions about estimates may be less motivating than questions about comparisons, but they are common in official statistics, and in the rest of the work of ‘data detectives’. As the authors note, the vibrating boxplots of sample medians lead to intervals of uncertainty. Perhaps these can be used by students to make informal inference statements about population medians and similar measures.

The authors note that ‘computer technology has changed the world entirely’. Methods like those in this paper, soundly designed in terms of both statistics and pedagogy, will enable today's students and tomorrow's citizens to contribute to and to use data wisely. Perhaps they will say ‘I helped to reduce the uncertainty in that estimate’, and be proud of it!

**Len Cook** (*Victoria University of Wellington*)

The on-going revolution in information and communications technologies brings new challenges to the numeracy of populations in all countries. How to make this a compelling challenge to people at all stages of learning is of serious concern to which statisticians have generally not yet responded successfully.

The ideas and methods in this paper are a significant addition to the statistical toolbox that we need for our role as educators in statistical reasoning. They extend the capacity to lift numeracy, not only of those involved in teaching, but also of those who develop and present evidence.

The societies that we live in are in the midst of responding to major transitions, and common to almost all are the effects of social and demographic change, globalization and climate change and information and communications technology. We can effectively plan the future form of but few activities and systems without a rich understanding of the context in which they will operate, and the understanding that we need will come from understanding and evaluating evidence from both traditional and innovative sources of information. Most decisions will be taken by those without a trained scientific understanding of the validity of the evidence gained from traditional approaches, whatever their field of endeavour.

Developing simple means of bringing concepts of variation and statistical inference to a wide range of decision makers and analysts, whatever their stage of learning, would be an extraordinary achievement, in terms of its effect on the quality of decision making in such uncertain times. This paper is also a reminder that those involved in the development of evidence, and its evaluation, have obligations to ensure its accessibility not only to those involved in policy development and advice, but also for informing the wider community about whatever new knowledge we have. We are reminded that this includes ways of assessing evidence, as well as knowledge of the evidence itself. To this extent, the authors should have as their target those who engage in scientific activity in all fields, simply because public understanding in an informed democracy of the evidence that is produced about communities, societies and the planet itself is vital for the political choices that we all now face.

**Neville Davies** (*Royal Statistical Society Centre for Statistical Education, Plymouth*)

Scanning the Society's journal over the last 173 years it is clear that there are few papers that address the teaching of statistics, let alone help beginning learners of it. Therefore I particularly welcome the fact that this paper is designed to help teaching *and* learning in an area that some regard as ‘obvious’ or ‘simple’: inference is often difficult for newcomers—especially non-specialists. It is all too easy not to realize the difficulty that students have in getting to grips with the concepts of inference.

The authors brilliantly marry easy-to-use technology with exploitation of visual senses to produce better understanding of the relationship between samples, their sizes, variation, sampling and how these relate to what are ‘back in the populations’ (adopting the authors’ catch-phrase). These result in simple but giant steps forward in helping to teach *and* learn these aspects of statistics. However, I believe that the authors should stress *more* that the approach is useful for beginning learners of inference at *any* age.

The authors make extensive use of real survey-type data from *CensusAtSchool* New Zealand, one of several similar projects run in six other countries, including the UK. The data produced have been useful to help to develop statistical thinking at school and other levels (Connor *et al.,* 2006). The databases from each country's implementation are stored at the Royal Statistical Society Centre for Statistical Education (RSSCSE) and contain well over 1.4 million responses that can be sampled over the Internet. The RSSCSE intends to build from the ideas in the paper, exploit the family of databases, augment them with others that reflect data production methods from designed experiments and use technology and visualization tools to create an electronic Wiki-type dynamic collaborative teaching, learning, assessment and continuing professional development environment. It will be free to use and will support the RSS 10-year statistical literacy campaign (http://www.getstats.org.uk). There will be six dynamically connected sections:

- (a)
data visualization, interrogation and presentation using the authors’ diagrams with memory and geographical information systems;

- (b)
investigation and problem solving, including opportunities for discussion and review;

- (c)
teaching resources with a shareable teacher's corner enabling exchange of dynamic documents;

- (d)
learning resources, including a learner's corner with self-reflective journals;

- (e)
embedded assessment, including formative, summative and a problem solving approach to teaching and learning (

Marriott *et al.*, 2009);

- (f)
continuing professional development and access to on-line resources leading to accreditation by the RSS.

The environment will be available on the RSSCSE Web site in due course.

**N. I. Fisher** (*University of Sydney and ValueMetrics Australia, Sydney*)

I congratulate the speakers on their efforts to interest the wider statistical community in what—and how—school students learn about statistics. And New Zealand is to be congratulated on its leadership in establishing a national curriculum labelled *mathematics and statistics*.

Their paper identifies difficult challenges in introducing a school statistics curriculum that will, on the one hand, equip all students with some life skills which they will need in dealing with the vagaries of life and, on the other, attract some of the abler students to pursue further statistical studies.

Currently, in Australia, statistics is trapped in a game of double jeopardy, incorporated invisibly in a curriculum area called mathematics (seemingly its most obvious home); and, as a consequence, mathematicians believing that ownership of statistics is rightfully theirs. Since 2002, the leadership of the Australian Bureau of Statistics and of the Statistical Society of Australia have been working on strategies to revolutionize statistics education from kindergarten to post-doctoral research. It has been an arduous journey, with occasional moments of great promise followed by lengthy periods of desolation (Fisher, 2010). We are now close to having a statistics stream in a new national K-12 Mathematics (*sic*) Curriculum, but we are still battling to have it read like a statistics curriculum developed by statisticians and not by mathematicians. (However, there are grounds for hope: after one lengthy discussion, a mathematics educator finally remarked to me, ‘Ah, I get it. Statistics starts with a question’.) Many mathematics teachers do not appreciate that

‘... Statistics is a science ... and it is no more a branch of mathematics than are physics, chemistry and economics; for if its methods fail the test of experience—not the test of logic—they are discarded’

‘The traditional road to statistical knowledge is blocked, for most, by a formidable wall of mathematics’.

There have been many interesting lessons to learn from our experience.

Perhaps statistics education and the related area of professional accreditation should be viewed as part of public awareness, and so benefit from a broad, professionally developed strategy based on synthesizing the numerous, very worthy, yet disparate activities (booklets, e.g. *Statistics: a Job for Professionals*, available from http://www.statsoc.org.au/statistics.htm, Web sites such as http://understandinguncertainty.org/, journals like *Significance*, ...) that are currently in place.

**Joan Garfield and Andrew Zieffler** (*University of Minnesota, Minneapolis*)

*Information and formal statistical inference?: new questions raised*

Wild and colleagues have done a laudable job of describing problems that students have in learning to make statistical inferences and in also creating an innovative approach to prepare school children to reason informally about statistical inference. The ideas espoused in the paper, with the pedagogical focus at earlier grades on the big concepts underlying inference (e.g. sampling variation), rather than obfuscation via procedures (e.g. *t*-tests) and theory (e.g. the central limit theorem), are at the heart of informal inferential reasoning.

‘the way in which students use their informal statistical knowledge to make arguments for connections between observed samples and unknown populations’,

- (a)
reasoning about possible characteristics of a population (e.g. shape and centre) based on a sample of data,

- (b)
reasoning about possible differences between two populations based on observed differences between two samples of data (i.e. are differences due to an effect as opposed to just due to chance?) and

- (c)
reasoning about whether or not a particular sample of data (and summary statistic) is likely (or surprising) given a particular expectation or claim.

Informal inferential reasoning has been the focus of recent research and discussion, as it appears to offer a promising way to help students to build important concepts of statistical inference. Furthermore, it allows the study of how students intuitively reason about sophisticated ideas that are usually difficult to formalize and use. In Zieffler *et al.* (2008) we suggested several questions that emerge from the study of informal and formal statistical inference that can guide research. The approach presented by Wild and his colleagues suggests many other questions, such as the following.

- (a)
What are effective ways to develop students’ informal inferential reasoning?

- (b)
How does good informal inferential reasoning foster students’ ability to use and understand formal methods of statistics inference?

- (c)
What aspects of formal inference are needed given current tools and approaches (e.g. randomization methods)?

Wild and his colleagues have created an approach worth studying that can help to address the first two questions. However, a key question is whether good informal inferential reasoning can stand alone, or whether it must be a stepping-stone to formal methods of inference. If students develop good ways of reasoning about the uncertainty of conclusions drawn from sample data in light of variability, would that not be a worthy end goal of statistical instruction?

**Andrew Gelman** (*Columbia University, New York*)

I agree that, wonderful as informal plots and data summaries are, we also should be teaching students formal statistical inference, which is a big part of what separates statistical thinking from mere intelligent commentary. I like the authors’ formulation that statistical inference

‘addresses a particular type of uncertainty, namely that caused by having data from random samples rather than having complete knowledge of entire populations, processes or distributions’.

‘we also need to start from a clear distinction between when we are playing the description game and when we are playing the inference game’.

I would go one step further and get rid of the concept of ‘description’ entirely, for two reasons.

- (a)
Ultimately, we are almost always interested in inference. Catch-phrases such as ‘let the data speak’ and rhetoric about avoiding assumptions (not in the paper under discussion, but elsewhere in the statistics literature) can obscure the truth that we care about what is happening in the population; the sample we happen to select is just a means to this end.

- (b)
Description can often—always?—be reframed as inference. For example, we talk about the mean and standard deviation of the data, or the mean and standard deviation of the population. But the former can be presented simply as an estimate of the latter. I prefer to start with the idea of the population mean and standard deviation, then introduce the sample quantities as estimates.

Similarly, some textbooks first introduce linear regression as a data summary and then return to it later in the context of statistical inference. I prefer to start with the linear model *y*=*a*+*bx*, with no ‘*ɛ*’—I believe it is a mistake to focus on the error term before students have become comfortable with the deterministic model—and then introduce the least squares estimate, not as a data summary, but as an estimate of an underlying pattern of interest.

In giving these comments, I am not trying to imply that my approach is the best way or even a good way to teach statistics. I have no evidence to back up my hunches on how to teach. But I would like to suggest the possibilities, because I think that statisticians are so stuck in a ‘description *versus* inference’ view of the world which can lead to difficulty in teaching and learning.

**Harvey Goldstein** (*University of Bristol*)

I very much enjoyed reading this paper, and especially its emphasis on the progressive refinement of statistical understandings. One of the issues that is not discussed by the authors, apart from a brief mention in connection with Fig. 12(a), is that of dependences between data values. In all their other examples is embedded the unarticulated assumption that data values have been independently sampled and there is also, throughout most of the examples, the assumption that the sampling procedure is ‘unbiased’.

It could, of course, be argued, as the authors do in connection with Fig. 12(a), that these are complexities that can be introduced at later stages. This assumes, however, that in an important sense the notions of unbiased and independent sampling are more ‘basic’ or more ‘natural’. Although they may be considered formally more basic they may not be so pedagogically. Thus, in practice, samples are often biased—e.g. through non-response—and also may contain (subtle) dependences, such as occur in multistage sampling or where time series are involved such as in successive daily temperature measurements. If such dependences or biases are important then the early milestone inferences may not be even approximately correct. The provenance of the data is relevant and important. Thus, the example of pupil heights that was discussed by the authors implicitly assumes both independence and unbiasedness, since the authors are concerned with making inferences to a real population—although what the population consists of is not actually described. When the authors discuss the computer animation of sampling variation they are (I assume) generating simple random samples from a ‘population’. But these do not necessarily have the same statistical properties as real life samples—so how are students to be made aware of this artificiality?

My suggestion is to introduce ideas of bias and dependence early on, say in milestone 1. I take the point that one does not wish to introduce too many concepts at once, but both of these ideas are so fundamental to all inference that to ignore them runs the danger of divorcing the pedagogy from the reality. This issue is generally not a problem in traditional ‘formal’ statistics courses since the simplifying assumptions of unbiased and independent sampling are explicit, later to be modified. The challenge in the curriculum that is proposed by the authors is how to deal with them at the outset in ways that are intuitively accessible to students. I suspect that to do this satisfactorily would involve the extensive use of real data before, or alongside, artificially generated data and I would very much welcome the authors’ comments.

**Robert Gould** (*University of California, Los Angeles*)

The authors are to be congratulated on developing an innovative and exciting new curriculum for teaching statistical thinking to young students. Most exciting to me is the authors’ conception of students as life-long learners who will ‘live statistically’ and who therefore need to internalize statistical thinking. In the past, students first encountered data in their introductory statistics course, where they were taught formal data collection methods and warned that inferential techniques applied to only very special circumstances. Course content focused on producing critical consumers of statistics. However, modern students encounter data routinely, in both formal and informal environments, and are far more likely than prior generations to need to process and analyse data. The need to produce citizen statisticians has never been greater.

This demand to produce statistically literate citizens has perhaps caused many instructors to overload the curriculum. Wild and his colleagues take a different tack and strip away all but the essentials. This decision to focus on inference in a limited context at such an early age will lay a strong foundation for learning some of the more complex statistical concepts at a later stage. Even those of us who must teach ‘all of statistics’ in one (too short) course could benefit from the example of choosing a small number of topics and teaching them very well, rather than lightly touching on many topics and teaching very little. One lesson learned from this paper is that the proper use of technology can strip much of the tediousness from the curriculum and allow students and instructors to focus on important conceptual features.

As Hotelling (1948) explained in his historic summary of the then-current state of statistics education, *who* teaches statistics is as important as what is taught. The curriculum proposed can be taught only by teachers with a firm conceptual understanding of inference and data analysis. This is not meant as a criticism of the authors’ work, since even the best curriculum can be ruined by a poor teacher, but instead as a reminder that, in most classrooms, statistics education is left in the hands of people who have not themselves learned statistics. Still, to prevent the lessons from being anything more than a transmission of a set of rules (an approach which has dominated statistics education), improved professional development will be required to ensure that teachers of the curriculum proposed will have a sufficiently deep understanding of statistics.

**Sander Greenland** (*University of California, Los Angeles*)

I am disturbed by the authors’ exclusive focus on random variability. In evidence synthesis and observational research, random variation can be smaller than uncertainty about bias (Greenland, 2005; Lash *et al*., 2009). Even in randomized trials, extensive drop-out and non-adherence can render naive any claim to resembling the experimental designs on which conventional statistics are based. Sample surveys can suffer similarly through non-response and erroneous response.

‘addresses a particular type of uncertainty, namely that caused by having data from random samples rather than having complete knowledge of entire populations.... It does not address study design and execution issues, quality of data, relevance of data, practical importance and so on.’

I think this conceptualization is harmful, making statistics a bag of blind algorithms rather than an integral part of *scientific* inference. Worse, it seems to license the usual application to highly non-random data of algorithms that assume purely random variation.

‘it is misleading to report confidence intervals which reflect only the extent of possible random error when systematic biases are suspected’

(Turner *et al*., 2009). Despite such *caveats*, basic statistics courses often omit serious discussion about the non-random problems of real studies. Consequently, researchers rely on quantifying only random variation, and tend to underestimate seriously the uncertainty that they should associate with a given result. They may see a confidence interval which summarizes only random variability, and then *behave* as if that captures most uncertainty (even when claiming to know better). I fear that the authors’ approach will only further propagate and entrench such overconfident behaviour, making it even more difficult to unlearn than it is now.

Fortunately, one can introduce bias concepts in tandem with random variation without leaving the authors’ framework or introducing untoward complications. For example, one could illustrate what might happen in a survey in which heights are asked of the children rather than measured directly, when there is a tendency towards under-reporting by girls and over-reporting by boys. Showing how such problems cast doubt on conventional rules (like those given by the authors) would be a worthwhile educational achievement; it should not be left to a separate module.

**Paul Hewson** (*University of Plymouth*)

I find the paper very interesting and the underlying body of work very encouraging. I thank the authors for that. I have two questions: one specific; one general. The specific question concerns the authors’ statement in Section 2.5 that the ‘experimental *versus* observational’ issue is a tangential concern. I am not really sure that I entirely follow this point. I wonder whether this dichotomy is so fundamental that the authors have really presented a method for engaging with observationally based inference. One could develop a parallel, experimentally based pathway, which would have many stages in common but be fundamentally different precisely because the dichotomy is so fundamentally important.

I was encouraged to read that randomization-based inference will be introduced at a school level as this seems a more direct way of encouraging people to think under the null. But my more general point is that I did note the authors’ repeated comments that ‘thinking under the null’ should be delayed as much as possible. I think that this is a great idea, but I wonder why not delay it indefinitely? The Bayesian paradigm is widely accepted in applied statistics and practice nowadays. Additionally, noting that (to give a Waikato- based example) Bolstad (2007) has produced a highly acclaimed textbook which teaches non-statistical specialists how to work in a Bayesian framework, we have support for the view that we do not actually need any thinking under the null. I think that the general point also relates to the specific—thinking under the null seems simpler to introduce in the context of experimentally based inference than observationally based inference, and I therefore think that the proposals in this paper really lean towards Bayesian inference.

**Kuldeep Kumar** (*Bond University, Gold Coast*)

This is a very interesting paper dealing with teaching statistical inference to high school students and also as an explication for a first course at university level. Statistical inference is of course one of the difficult subjects to teach and I must congratulate the authors for providing insight into teaching this subject. In most of the business schools statistical inference is taught only as a part of the statistics for management course and I think that it is much more challenging to teach the concept of statistical inference to management students. One of the reasons is the low numeracy skills of the students and sometimes, secondly, difficulties arising in teaching mature age students who have left mathematics long ago. The Business School cannot afford to have a full course on statistical inference. Accordingly, the question is how can we teach statistical inference to these students? Especially in the context of a business school there should be more emphasis on teaching the application of statistical inference rather than the theory. There has been much interest in the teaching of statistics in business schools for a very long time; for example see Kumar (2010) and references included in the paper. We usually use problem-based learning where all the topics are introduced with a problem in real life. However, maybe in a business context, the statistical tools should be taught to show how the problem can be solved, and thus exploring the help of these tools and techniques by using real data.

Teaching basic statistics courses by using Microsoft EXCEL has become more popular in business schools. We also quite often use EXCEL to demonstrate the central limit theorem and other concepts dealing with statistical inference. One of the advantages of using EXCEL is that it is very easy to use and it comes as part of the Microsoft package which is available on most personal computers. EXCEL is a very good tool for graphical representation and producing tables. The ‘Analysis toolpak’ can do many statistical analyses, e.g. calculation of descriptive statistics, correlation, multiple regression, analysis of variance, the *t*-test, *Z*-test, confidence intervals, moving averages and the random-number generator. The basic statistical functions can be used to calculate normal probabilities, binomial probabilities, *χ*^{2}-tests etc.

Usually I have observed that school children are good at using EXCEL and I wonder whether the authors can use EXCEL to demonstrate the concepts that are mentioned in the paper in a more accessible way.

The authors of this paper have several good ideas on *how* to teach statistical inference in first courses; but I have reservations about their ideas on *what* to teach. They rightly say

‘Many of the problems with students learning statistics stem from too many concepts having to be operationalized almost simultaneously’

(Section 4), so why not teach *one* concept which will embrace the *whole* of worthwhile statistics? That concept is probability with its three rules of convexity, addition and multiplication. Probability can be taught by the physical experience of drawing balls from an urn, so that students can collect their own data.

Statistics is about uncertainty. The authors emphasize variation in repeated samples from a population, which is a source of uncertainty, but it is not the only one; what of the uncertainty in a court of law associated with a defendant's guilt? Uncertainty can only be satisfactorily expressed through probability, which is essentially an extension of standard deterministic logic to embrace uncertainty. The concept of a population is often artificial and rarely relevant to a practical problem. Repetition is not basic to statistics; coherence is and probability is coherent. The idea of a million apples falling from Newton's tree is not exciting but that the phenomenon coheres with the motion of the planets is. Boxplots are not basic but merely useful decorations, whereas probability is basic and not just pictorially effective. Even experienced statisticians can fail to understand the correct concept of a confidence interval: were the procedure to be carried out indefinitely in samples from the same population, then 95% of the resulting, intervals would include the true value. Often they think of the true value lying within the interval with 95% probability, which is typically false.

The authors’ view of statistics is too narrow, restricting themselves to frequentist ideas, whereas operationally we need opinions that necessarily involve probability.

**Thomas A. Louis** (*Johns Hopkins Bloomberg School of Public Health, Baltimore*)

It is a pleasure to comment on this communication, especially on World Statistics Day. It will improve quantitative literacy, will add breadth and depth to statistical thinking and will enhance appreciation of statistics and statisticians. Students at all levels will benefit from the examples and viewpoints. I endorse the authors’ call to action and most of their specifics, but I add my list of ‘most important and fundamental concepts’.

A criticism: I am not in favour of using the term ‘distortions’ to characterize the effect of simple random sampling; it suggests chicanery. Why not use ‘variation’ in this context and save ‘distortion’ for other sampling forms?

- (c)
*The importance of strategic thinking*: for example, ask whether a conclusion should depend on whether distance is measured in miles or metres or a risk ratio is *A/B* or *B/A*. Use these to motivate use of (statistic − centre)/scale and logarithm-based procedures.

- (d)
*The fundamental role of the full probability distribution*: full distributional thinking promotes discussion of inferential goals. For example, it may be that the median or some other percentile is more relevant than the mean, or that the quantity of interest is the random variable induced by filtering a stochastic input (e.g. exposure) through an output function (e.g. an exposure–response relationship). In teaching the Gaussian distribution, show that the mean and standard deviation fully describe the distribution but are not necessarily the primary objects of inference. Use a similar approach for other parametric distributions. Full distribution thinking is essential for Bayesians (for whom it is innate) and frequentists.

- (e)
*The use of live data*: recruit students via interesting and timely data. For example, discuss issues in modelling of the spread of influenza, publication bias, stochastic properties of the yield from a screening programme, estimating the size of a crowd or civilian casualties, political polls, stock market gains (and losses!) ....

In summary, I recommend emphasizing the foregoing when educating, collaborating and communicating. I encourage all of us to build on the authors’ work and thereby to promote statistics and statisticians.

**Helen MacGillivray** (*Queensland University of Technology, Brisbane*)

The deserved congratulations to the authors for the innovative, thoughtful and hard work of this paper should be matched by a determination of the statistical community to follow the examples set by them and others to become seriously involved in the many and diverse challenges in statistical education. There should be no illusions about the toughness of the challenges, the value of tackling them and the need for significant and scholarly work and analysis that bring together statistical and teaching expertise and experience.

Key points of this paper include the building of understanding gradually through revisiting and extending simple procedures, richer use of technology than merely reducing calculations, and that statistical soundness should underpin any simple procedures.

All statistical procedures need to be seen as part of what Chambers (1993) called ‘greater’ statistics. Emphasis on more holistic and practical approaches to statistics, reflecting what statisticians do, has played a crucial role in major developments in statistics education, including articulation and integration of the data-investigative process or cycle (Holmes, 1997; Wild and Pfannkuch, 1999).

Similarly, the skilful and integrated use of technology in statistics education should not prevent us from developing statistical thinking away from the computer. In discussing the teaching of sampling and sampling variation, the ‘downward’ view of sampling from a known ‘population’ or distribution can be matched by the ‘upward’ question of such importance in statistical inference, namely of what general situation or group can we consider our data to be representative with respect to the question(s) of interest (Utts and Heckard (2007), page 72). Again, the data-investigative cycle and emphasis on real and everyday situations provide many opportunities for developing this awareness across all educational levels.

As in this paper, by choosing to illustrate simple procedures within the context of real many-variabled data sets, ‘multivariatitis’ can become the norm in teaching statistics. However, the tertiary level brings other questions, e.g. why not introduce statistical inference through categorical data? And should we not be using technology to investigate real many-variabled data sets sooner than in the traditional approach? Statisticians who help postgraduates across disciplines are aware of the consequences of overexposure to the traditional two-sample approach.

Perhaps we should develop ‘worry questions’ for teaching statistics, noting Bruner's (1996) point that most people have acquired a ‘folk pedagogy’ that reflects certain ‘wired-in human tendencies and some deeply ingrained beliefs’ (page 46). Asking questions, lateral thinking and analysing information are part of good statistical practice, including in teaching statistics.

**Xiao-Li Meng** (*Harvard University, Cambridge*)

‘We don't care about school stuff.’ Surely any professors who want good students (and who does not?) care, and even those who only care about research care about ‘school stuff’ in effect because they love to have well-taught students as research assistants! The issue, as the authors noted, is the concern that statistical education in high school can do more harm than good when there is a severe shortage of qualified statistical educators (Meng, 2009a,b,2010a). The shortage is real, even at the university level (Meng, 2009c, 2010b). However, the authors’ reasoning that we should start as early as possible is compelling, especially in this information age when everyone and everything are competing for the attention of future generations. The proposal to use modern technology is therefore appealing because technology can help, among many other things, to simplify and globalize pedagogical presentations, thereby reducing the dependence on local expertise for quality education, at least to some extent.

Regarding the authors’ emphasis on concentrating on a single concept, arguments can go either way, as the authors noted. I cannot convince myself either way even after raising several glasses. Forcing beginners to dive in when they cannot swim is fatal. Allowing them to believe that they can jump into any wet environment because they have learned diving in a swimming pool is equally fatal. The right solution, I believe, is to start simply but also to demonstrate complications. It is safer to start with a swimming pool. This, however, should be supplemented by also showing beginners pictures or videos of brooks, creeks, streams, reservoirs, ponds, springs, estuaries, rivers, lakes, seas, oceans, or even waterfalls, wells and aquifers. They do not need to learn how to swim or dive in all these environments (nobody can!), but knowing about their existence helps to put what they have learned into perspective. After students have learned about sampling variation by using the authors’ ingenious vibrating boxplots, why not ‘vibrate’ them more with a few relatable examples such as noise in digital transmissions, errors in forecasting the stock market and uncertainties in predicting their own examination scores? There is no need to spell out what causes these ‘vibrations’. The mere recognition that they are real, need to be studied but do not fall into the category of ‘sampling variation’ should serve well the purpose of inspiring students to study statistics as a vast, vital and vibrant scientific discipline.

**Deborah Nolan** (*University of California, Berkeley*)

I congratulate Wild, Pfannkuch, Regan and Horton for a bold innovative approach to teaching statistical inference. They think critically about concepts in inference; they separate and scaffold the big ideas, and create a model for how technology can aid in understanding these concepts. Too often we allow technology to drive the application; we simply duplicate existing paradigms for learning without consideration of their appropriateness for or advantages of the new technology.

The technique that is presented for visualizing sampling variability reminds me of Rosling's innovative use of animation to help students to understand multivariate times series. Rosling took a familiar idea—flipbook animation—and, with the help of new technology, ‘liberated the *x*-axis from the burden of time’ (Rosling and Johansson, 2009). Wild and his colleagues take the familiar concept of shading, which has recently gained popularity via transparent colours and overplotting, and apply it to the concept of sampling variation. The blurred images that they create have great potential as a metaphor for sampling variability. As they explain, it enables students to keep their eyes on the graphical representations as they gain insight into statistical inference.

A key step is for students to move from understanding how blurry overplotted boxplots demonstrate sampling variability (Fig. 7) to successfully comparing two boxplots (Fig. 8), each representing a sample, to detect differences between corresponding populations. Wild and his colleagues cleverly avoid several pitfalls by comparing boxplots, rather than, say, histograms. Yet, I am curious whether students have difficulty in making this transition.

Currently, I am experimenting with R (R Development Core Team, 2010) in an introductory, university level course and also am pairing physical and technological demonstrations of concepts. The students use small computational steps that parallel a physical model to construct understanding of a concept. Although I think that this approach is more effective than one function or applet that ‘does it all’, this claim needs to be further examined. More, it has been jarring for students to go between the computational approach and traditional textbook, and I expect that we can only have success with an integrated curriculum.

Technology is constantly changing, offering tremendous opportunities, yet threatening to make newly developed applications obsolete. In contrast, physical models seem timeless. We may ask whether the benefit of uncovering new paradigms for learning offset the burden of keeping abreast of these continual changes. The excellent contributions of Wild and his colleagues demonstrate that with careful design the benefits clearly outweigh the disadvantages.

**Sastry Pantula and Ron Wasserstein** (*American Statistical Association, Alexandria*)

Just as the nurturing of mathematical thinking begins early, so must statistical reasoning. It is imperative that statistics is included in a 21st-century education for all students, but many K-12 teachers have not received the necessary training at the appropriate level to teach statistical literacy in their classrooms. Sound statistical reasoning skills take time for students to develop and cannot be honed in a single course. Foundational statistical concepts should be introduced and nurtured in the elementary grades and strengthened and expanded throughout the middle school, high school and post-secondary grades. We applaud the discussion of Wild, Pfannkuch, Regan and Horton regarding the necessity of statistics being appropriately taught at the school level and the need for statisticians to care about and be involved in the statistical education of our school children.

As the authors discuss, characteristics of effective statistics education including data analysis and probability are described in Franklin *et al.* (2007) (http://www.amstat.org/education/gaise). These include teaching statistical problem solving as an investigative process, promoting statistical thinking through hands-on problem solving and concept activities, emphasizing concepts, using real world data, stressing the importance of context, recognizing variability and using appropriate technology to emphasize concepts and to support learning. Students must be taught to think statistically, in part by learning to ask key questions about scientific investigations to understand their reported conclusions and to offer an informed opinion about their legitimacy. Franklin *et al.* (2007) provide a framework to describe what is meant by a statistically literate high school graduate and provide steps to achieve this goal of a statistically literate citizenry. We appreciate the authors’ discussion of how to teach the basic tenets of statistical inference at the K-12 level to guide students further to develop statistical reasoning over time.

The American Statistical Association recognizes the need for high quality teacher preparation and professional development in statistics. Well-trained teachers should have tools and training to teach statistical concepts effectively. To enhance teacher training and student learning, teachers and teacher educators need access to materials supported by statistics education research. Statisticians and statistical societies need to work to improve statistical instruction and literacy through policy and outreach efforts with education departments, teacher preparation and professional development programmes. As we enhance the statistical reasoning of teachers and students, we shall take an important step towards enabling our children to thrive in our data-driven society.

**Emanuel Parzen** (*Texas A&M University, College Station*)

I strongly support the authors’ goals for statistical education.

- (a)
To motivate students about the practical value of learning statistical reasoning teach that success in life (all aspects) is achieved through understanding (rather than memorizing) formulae that are useful to answer practical questions. Learn methods ‘just in time’ to solve a practical problem, posed in words. Require students to formulate and solve statistical problems ultimately to answer original practical questions. Explain that each step is obtained by reasoning (applying universal strategies), not memorization.

- (b)
Frontiers of methods and applications can be taught if what we teach extends to more advanced problems. The competitive advantage of statisticians is that they extend to complex data methods that work for simple data; vice versa should be an aim of statistical education.

- (c)
Visual (produced by software) teaching and practice of statistical methods are best provided by scatter diagrams, slopes of lines and correlation coefficients. The two-sample problem (of data

*X*_{i} size

*m*,

*Y*_{j} size

*n*) should be introduced visually as a scatter diagram of combined data (0,

*X*_{i}),(1,

*Y*_{j}). The least squares line joins

with slope

the significant difference of

*d* from zero is judged by

*T*=√{(

*m*+

*n*−2)

*R*}/√(1−

*R*^{2}), where

*R* is the sample correlation coefficient of the scatter diagram (calculated by using

*n* rather than

*n*−1 in sample variance). To compute confidence intervals calculate SE(

*d*)=

*d*/

*T*; plot quantiles

*Q*(

*P*), 0<

*P*<1, of

*T* and

*Z* normal.

- (d)
The above regression dummy variable approach to the two-sample problem helps to introduce regression for continuous *X* and *Y*; logistic regression.

- (e)
An alternative to the dot plot (of sample *X*_{i} size *n*) is to plot the sample quantile function (it works for any sample size; 0–1 data) defined as scatter plot ((*i*−0.5)/*n*,*X*(*i*;*n*));*X*(*i*;*n*) are ordered values (add boxplots and fences; interpret tails and symmetry). The rotating plot is the sample distribution function.

- (f)
All statistical education, including introductory education, should learn to solve problems by ‘analogies between analogies’ (links between statistical problems in different fields and different applications, and links between different parts of the course).

- (g)
A unifying idea of frontier statistical inference methods is to calculate the optimal function that minimizes a (information) distance from a specified function, subject to inner product constraints that provide parametric representation, and estimating equations for parameters of the optimal function. For the model selection problem, choose constraints that best fit data.

**Brian Pink** (*Australian Bureau of Statistics*)

‘statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write’.

The authors have rightly placed the primary objective of statistical education in schools to be the learning of life skills so that students may reason and understand the use of statistical evidence, regardless of the career that they will pursue. It is indeed important that students understand basic design issues and may question issues relating to the relevance and quality of data, presentation and practical importance of the results, etc. The current curriculum tends to restrict the use of statistics to applying tools for summarizing data and making conclusions. The focus of the paper is on inferential thinking but readers must not be distracted from the importance of the wider context of sound statistical reasoning. Indeed addressing the questions ‘is the data set fit for the purpose of the investigation?’ and ‘are the conclusions justified?’ should be an important part of the journey.

The approach to teaching statistical inference that is advocated in the paper would rely on students being able to look at graphs and to interpret information. Such skills are not always found in students in early years of high school, and it would be useful if the authors could clarify how thinking relating to the data as visualized in graphs could be taught well. The authors touch lightly on the ‘descriptive game’ but to play the ‘inference game’ students need to be able to use graphical tools to describe and understand the data.

We strongly support teaching the ‘minimum set of the biggest ideas of statistical inference’. Should the ‘minimum set’ include extending from a sample to an infinite population? We would welcome the authors’ comment on how this abstract concept could be included in a manner that is underpinned by the principles proposed.

The authors made much effective use of automation in data presentation to help to draw out the message in an understandable manner. At the Australian Bureau of Statistics we have provided various interactive tools on the Web site to help to display visually the data that are disseminated (see http://www.abs.gov.au/websitedbs/D3310114.nsf/home/Interact+with+our+data). Teachers should, however, encourage students to develop their own presentation tool to tell the story behind the data. Paradoxically, if our teaching of statistical thinking and inference is successful, we would not overly rely on the availability of presentation tools!

I thank the authors for a thought-provoking paper.

**J.-F. Plante** (*HEC Montréal*) **and N. Reid** (*University of Toronto*)

We congratulate the authors on a very original and dedicated programme of improving statistical teaching in schools. As the authors note, this may help us to recruit more, and stronger, students to our discipline. Just as important, it is part of our social role to improve statistical literacy in the population.

Too often, as educators and commentators, we describe and illustrate a list of errors in inference, a list of ‘don't dos’. Although we may be correct, we rarely follow this critique by a similarly long list of acceptable approaches. Too often, the final answer is ‘The solution is hard and you may learn it some day, if you choose the right optional course’. We see very positively the proposal of the authors to go ahead with simple forms of inference, leaving discussions about some important issues (and their solutions) to a later time. As students try different problems, they will gain confidence in their capacity to analyse data, and they will discover for themselves the need for more suitable models.

At the University of Toronto, we taught a first-year reading course on statistics to students with little background in statistics or mathematics. The outline of the course was intentionally vague, and stories in the news were used as pretexts to discussions of statistical methodology. We found that linking statistical methods to current stories improved the interest and the understanding of the students. We also found that explanations of puzzling or new concepts always drew more attention when discussed in the context of a question from a member of the class. The mathematical content of our course was minimal, but experience makes us believe that it is possible to gain a very intuitive understanding of traditional statistical concepts such as variability, confidence intervals, multiple comparisons and statistical significance.

Most readers of this paper will be trained statisticians who understand the subtleties of the discipline. One challenge that is not discussed is the feasibility of implementation at a larger scale, involving teachers with different levels of ability in statistics. A natural next step may be to think of optimal ways to train teachers to ensure that they are comfortable in the delivery of the material and aware of important subtleties. This should be achievable—our elementary music teachers were not virtuosos, but they did a great job at teaching us the basics—but it deserves careful planning.

**Donald B. Rubin** (*Harvard University, Cambridge*)

The target article displays substantial wisdom when criticizing current methods for teaching and conveying basic ideas of statistical inference. A rather dramatic example of the confusion that can be created by traditional frequentist ‘thinking under the null’ is a current US legal case in which the defendant was convicted of a federal crime, basically for not adhering to a strict interpretation of significance tests, when he provided a press release concerning the possible mortality benefit of his company's product for treating an inevitably fatal, swiftly advancing disease (United States *versus* Harkonen, 2010 U.S. Dist. LEXIS 75528 (N.D. Cal. July 27th, 2010)).

To avoid such confusion, however, my approach is distinctly different from that proposed in the paper, although perhaps complementary. We should start the inferential process with clear statements of scientific objectives and associated estimands, and a precise description of the hypothetical data set from which we would simply calculate the estimands. Such a data set could be composed mostly of unobserved or even unobservable values, but it defines the objects of inference. Then we should think about the mechanism that led to observed values being observed and unobserved values being missing (as in Rubin (1976)). To me, statistical inference concerns estimating missing values from observed values rather than pondering *p*-values, and one cannot even begin to do this without assuming something about the process that created the missing values: was it, for example, simple random sampling or was it a censoring mechanism?

This inferential process is most directly conceptualized as finding the probability distribution of missing values given observed values and scientific assumptions, i.e., formally, finding the posterior predictive distribution of the missing values, whence the posterior distribution of the estimands can be calculated. There is no need to teach technical details of this Bayesian approach; instead we can use ideas based on simulating the unknowns under scientifically motivated models. We should not avoid discussing the scientific context of any statistical problem, because this can greatly affect the resulting inference.

The approach of distributionally filling in (or multiply imputing (Rubin, 2004)) the missing values is intuitive and reveals the natural uncertainty of inference. The standard frequentist approach, which treats some known functions of the observed values as unknown (e.g. sufficient statistics) and treats some unknowns (e.g. parameters) as known, leads to the confusing double-negative logic of ‘failing to reject the null hypothesis’ etc.

After understanding this natural approach to access uncertainty of inference, then we can teach the importance of evaluating operating characteristics of such procedures. We should rarely, if ever, fully trust the veracity of the models used to impute the missing values, and frequency-based evaluations of them can be extremely helpful, as argued in Rubin (1984).

**Richard L. Scheaffer** (*University of Florida, Gainesville*)

I congratulate the authors on their original thinking on key issues surrounding statistics in schools; the emphasis on data exploration must be expanded to include basic concepts of inference and professional statisticians must play a major role in establishing the primacy of statistics in schools while promoting creative thinking in its teaching.

The Quantitative Literacy Project in the USA (1980s), borrowing unabashedly from the UK's Schools Project, succeeded in raising interest in statistics through the new and exciting emphasis on exploratory data analysis. Our National Council of Teachers of Mathematics supported this effort and promoted statistics (data analysis) as part of the school curriculum. Emphasis on teaching statistics began to wane after 2000 when schools came under the pressure of testing as *the* measure of ‘progress’. Statistical thinking, which is difficult to test and not yet in the mainstream, became neglected or replaced by the ‘meanmedianmode’ framework. Franklin *et al.* (2007) were instrumental in helping some states (where education decisions lie) to revitalize their curriculum in positive ways, and the new common core state standards (http://corestandards.org/) should expand the interest in statistics.

The common core standards contain a potentially strong statistics strand throughout the middle and secondary years, but the potential will be realized only if creative thinking of the type that is espoused in this paper becomes widespread. The plan of a logical progression of as few new ideas as are needed for basic inference, coupled with the use of appropriate technology while keeping an eye on the data, seems sound. Related work on pedagogy and assessment is also essential to the enterprise. Statistics should be taught as an investigative process; this approach allows that—even requires it.

I am happy to see the authors not projecting all inference towards the Procrustean *t*-test, as less restrictive mechanisms for inference are now accessible through technology. However avoidance of ‘thinking under the null’ is not entirely positive, as it may be quite natural for some to think ‘would the observed difference in these medians be so large if the populations were the same?’.

The quotes from John Tukey are great; we should also keep in mind that his influence in making data analysis exciting and acceptable among statisticians allowed real statistics education to enter the schoolhouse door. Now, the statistical profession must take interest in ‘school stuff’ to keep that door open. Statistical thinking is too important to be relegated to at most one college course!

**Milo Schield** (*Augsburg College, Minneapolis*)

At the 1995 meeting of the American Statistical Association in Orlando, Robert Hogg talked extemporaneously about problems in teaching the introductory statistics course. I can still hear his voice saying, none too quietly,

‘I'm tired of people talking about problems in the introductory course. I know there are problems. I've written about some of the problems. What I want is for someone to come up with a solution: a comprehensive solution, not just a small change. I want them to write up their solution in detail so I can see what to do differently. I want to be able to try it for myself. Then I will know how good it is.’

Bob Hogg was very perceptive. It is much easier to be a critic than a creator.

The Wild–Pfannkuch–Regan–Horton paper is all but certain to satisfy Bob Hogg's requirements. This paper does not just dwell on the problems; it advocates solutions; solutions that involve greater focus on ideas and concepts, less focus on the numerical details. And the solutions are detailed—not just broad generalizations; they form an integrated whole—not just a narrow change. Although some elements of this paper have been presented and discussed at various conferences (Statistical Reasoning, Thinking and Literacy, the International Conference on Teaching Statistics, the International Statistical Institute, American Statistical Association, etc.) by the co-authors and others, this paper integrates these ideas into a unified plan based on visual evidence.

The metaphors presenting these ideas are vivid and compelling:

‘our scene setting metaphor for statistical inference starts with the idea that looking at the world by using data is like looking through a window with ripples in the glass’;

‘we limit attention to distortions that are produced by the act of sampling and sampling variation’.

The central issue and the author's key claim are presented in a straightforward way:

‘some statisticians may be uncomfortable about using the non-overlap of individual uncertainty intervals to indicate [statistical] significance’;

‘we believe that this is a finer-distinction, later-refinement issue rather than a fundamental issue’;

‘each [of our guidelines] is statistically valid but with a limited range of applicability’.

In conclusion, if the goal is to introduce the idea of statistical inference to statistical beginners without the use of computational aids and with minimal mathematics, then arguably this paper represents the single biggest advance in several decades of concerted effort in statistical education.

**Michael Stuart** (*Trinity College Dublin*)

While admiring the efforts of the authors to simplify the introduction of key ideas of statistical inference, I am puzzled by their dismissal of process sampling as a base on which to build an approach to inference. The fact that repeated sampling is an integral part of process monitoring means that the key concept of the sampling distribution can be given an operational introduction through visualizing continuing process monitoring, by way of an evolving control chart, rather than the theoretical but practically implausible notion of repeated sampling of a population.

The fact that the distribution of evolving process data can be built up step by step means that a concrete view of that distribution can be built up and, with sufficient simulation, be seen to lead to a conceptual model. The fact that a stable process is the ideal in a process monitoring context and that control charts are designed to detect substantial departures therefrom means that the ‘null hypothesis’ concept arises naturally. Shewhart, the originator of the control chart, used ‘chance causes’ and ‘assignable causes’ of variation to denote null hypotheses and departures therefrom respectively. He used the normal model for chance causes, thus leading to his 3*σ*-limits.

By initially sampling one value at a time, corresponding to the so-called ‘individuals control chart’, and later moving to sampling several values at a time, corresponding to the -chart, the ideas of repeated sampling and standard error can be separately introduced, thus easing both tasks.

Related ideas become more accessible when introduced in this way. For example, the notion of a significance level may be identified with the control chart *false alarm rate* which has a simple operational interpretation in terms of an on-going process, with built-in repeated sampling, rather than the more obscure *probability of rejecting the null hypothesis when true*. It also facilitates emphasizing that there is more to statistical analysis than hypothesis testing, as a control chart is considerably more than a succession of individual hypothesis tests.

It is suggested that the approach which is advocated here passes all four tests listed at the end of Section 2.3 of the paper, that it lends itself to the wholly visual approach advocated in the paper and that it overcomes the reservations raised by Wild (2006), section 2.5, regarding process sampling.

First, I congratulate the authors on an interesting and thought-provoking paper. To provide some background I was head of the Australian Bureau of Statistics from 2000 to 2006. During this period I took a strong personal interest in the teaching of statistics in schools and funded many initiatives. This was driven to a large extent by employers’ concem about the supply of statisticians being considerably less than the demand. The demographics of statisticians suggested that this was going to become worse before it got better. Some employers raised their concerns at the Ministerial level, provoking the interest of the Minister and his acceptance that statistics was an important life skill.

A roundtable of key stakeholders was held to discuss the problem and it identified that the key part of the solution was with schools—students, teachers and pedagogy. Foremost was the lack of statistical training of teachers, inadequate teaching resources, uninspiring teaching methods and little knowledge of potential careers in statistics.

Although we had the interest of the Minister we had great difficulty in raising the interest of many in our profession. Many in senior academic positions did not see the problem. As a result progress was much slower than it could have been. The lesson is that there is a need for a co-operative approach from all the leaders of the statistical community. It should not be assumed. It may require some internal debate first.

I support the approach by Wild and his colleagues which is, in essence, learning by doing. Statistical theory is a real turn-off to most and not necessary to teach the statistical concepts and processes which are important in everyday life. It is the approach that is taken in our application of CensusAtSchools. This was the most successful of the Australian Bureau of Statistics's schools projects, helped by the involvement of teachers in the design of the product and lesson plans.

I believe that the focus of teacher training must be on mathematics teachers. Statistics in schools is most likely to be an identifiable part of the mathematics curriculum rather than a subject in its own right. Given that statistics often becomes ‘real’ when its application can be demonstrated to students, statistical content in subjects such as geography and economics can be very useful.

To conclude, I strongly support the proposal of Wild and his colleagues that ‘We need complexity reduction strategies’. This will mean moving away from conventional thinking on teaching.

The **authors** replied later, in writing, as follows.

We thank the Royal Statistical Society, Neville Davies and the Royal Statistical Society Centre for Statistical Education, John MacInnes, Peter Holmes and everyone who participated in the discussion of our paper. It is very encouraging to see contributions from such a diverse spectrum of the statistical family ranging from leaders of national statistical agencies, leaders of international statistical societies and Committee of Presidents of Statistical Societies award winners to many toilers at the coalface of statistical research, the provision of statistical services, statistical education and user communities. Given its societal importance, having such a broad coalition actively working together is absolutely vital to advance the penetration and impact of statistical education.

With so many contributions it is impossible to address them all individually so we shall concentrate on several general themes, leaving most of what we agree with, and some of what we disagree with, to pass without comment. Although there was plenty of support for what we are trying to do there was a spread of opinion over details, and particularly over scope.

*Misinterpretations and the tyranny of page limits*

Our paper combined a general appeal for engagement in reconceiving the broad spectrum of statistics in much more accessible ways with a detailed example. We detailed work in the area where we have so far done most of our own ‘reconceiving’—conceptual pathways to classical statistical inference. Phrases such as ‘We limit attention to …’, which were intended to relate to the scope of exposition in a page-limited paper, were often misinterpreted as limitations on the totality of what students can and should do. A second misinterpretation arose when we talked about ‘ ‘‘statistical inferences’’ as the territory addressed by confidence intervals, critical values, *p*-values and posterior distributions’. Here we were simply defining the meaning we were ascribing to the *phrase*‘statistical inference’ (consistently with books that include it in their titles). We certainly did not mean to claim that this was the totality of extracting meaning from data.

The two misinterpretations above led to ‘There are more things in heaven and earth, Horatio, than are dreamt of in your philosophy’ accompanied by input about what they were. The unexpected benefit has been wide-ranging discussion about what is important in statistics education, including many areas that are indeed more important than classical inference. Areas discussed included consideration of *biases* and other *non-sampling errors* (Holmes, Greenland, Goldstein, Louis and Maclnnes), *more typical* but complex forms and aspects of *data production* including *process data* (Holmes and Stuart), *dependence structures* (Goldstein) and *missing data* (Rubin), *experimental versus observational data* and *causation* (Hewson), approaches based explicitly on probability including *Bayesian* (Aitkin, Hewson and Lindley), imperatives for *official statistics* (Camden), *power* and *levels of confidence* (Holmes) and the irrelevance of classical inference for very large *n* data sets (Nicholson). There were also some questions and differences of opinion about pedagogy. We shall pick up as many of these issues as space permits and conclude our response with themes raised about agencies, statistical organizations and school curricula in the advancement of statistical literacy.

*Just one part of a much wider curriculum*

Our work on the inference problem addresses a detailed part of a much wider statistics curriculum in New Zealand (NZ) in which statistics is taught in every year of schooling from age 5 to about 18 years as part of the curriculum area that is now called ‘Mathematics and Statistics’. This statistics curriculum (http://www.censusatschool.org.nz/2008/documents/new-nz-stats-curriculum.pdf) is almost certainly the most comprehensive in the world and covers most of the above, whereas those areas that are not addressed are probably much too advanced to be broached before university. Why did we work and report on statistical inference?: because it corresponded to an important hole in our broader fabric—an important area where a good way forward had not already been worked out either by us or others.

The NZ statistics curriculum consists of three overlapping strands that go through all 13 years of schooling. These strands are *statistical investigation* (experiences in, and skills and concepts for, conducting investigations), *statistical literacy* (experiences in, and skills and concepts for, critiquing accounts of investigations by others) and *probability* (rather more traditional and building the mathematical linkages). Everything is based around the *problem–plan–data–analysis–conclusions* (PPDAC) cycle; see MacKay and Oldford (2000) and Wild and Pfannkuch (1999)). Over the course of their schooling, our students will have gone around the PPDAC cycle hundreds of times, using it both as a guide to planning and conducting investigations, and also as a framework for interpreting and critiquing the work of others. (What is the problem that they were trying to address? What were their questions? How well do these questions address the base problem? What were their measures? Do they address the questions? How did they collect their data? …). We owe this basic structure to Jock MacKay and Wayne Oldford's second-year course at the University of Waterloo in the early 1990s. It arose from their work on structured approaches to problem solving for managers in industrial process improvement. Thus NZ students are being educated to think critically about the various types of non-sampling errors and resulting biases when they think their way through the ‘plan’ and ‘data’ steps of the cycle; likewise for issues of measurement. This contributes to a greater or lesser degree of trust in anything the data might say. They think about more technical matters (including inference) under ‘analysis’. The key issue of confounding is considered at the interface between ‘analysis’ and ‘conclusions’. A collection of process diagrams and other representations of aspects of statistical thinking can be found at http://www.stat.auckland.ac.nz/~wild/StatThink.

*Complexity and frameworks*

What these discussions have begun to highlight is the complexity and subtlety of statistical thinking as it interacts with the real world. Our deep-seated desire to build all of this at once collides violently with the realities of small working memory and cognitive overload that cognitive psychologists warn about and all students (but only perceptive teachers) experience every day. And yet somehow we must try to build a very rich fabric of interlinked concepts in a way that the right nodes will be activated at the appropriate times. Our way forward relies on clustering concepts into smaller, more manageable sets that share fairly well defined spheres of influence and then finding strategies (e.g. PPDAC) to build bridges between these clusters which facilitate the right clusters being activated at the right time. Several things are clear: by attempting to do everything at once we shall achieve nothing; cognitive overload is seldom more than a hair's breadth away; and even being entirely correct is unhelpful when bought at the expense of being comprehensible.

Students in NZ are wrestling with issues of design and bias from early on in their studies of statistical literacy. As a result, when we start to introduce issues of informal inference in a stripped-down manner, they have a framework to integrate this with the larger questions that are raised by several of the discussants. As they proceed through the staged development that we have laid out, they continue to sharpen their questions and ensure that they do not get a precise answer to the wrong question.

Ours is a world of competing absolute imperatives with no ‘silver bullets’: only well balanced trade-offs. All our swords are double edged. Even the now accepted belief in using interesting real data has its negatives. On the one hand it is absolutely necessary for motivation, for making it obvious that certain things really are worth thinking and learning about. Additionally most of the PPDAC cycle simply cannot be experienced without real data with rich context (Wild, 1994). But on the other hand our recent research and experience say that, when trying to reason from data, students and their teachers do not know when to pay attention to *context* and when to pay attention to *pattern*. Technical statistics, including classical inference, is concerned with reasoning from pattern. Meaning, explanations of patterns thought likely to persist (Holmes) and much of critique draw most heavily on context. Statisticians rapidly shuttle between these two spheres (Wild and Pfannkuch, 1999; Cobb, 2007), something extremely subtle with a long encultural time. Consequentially, when trying to build the concepts of statistical inference by using data with a context that is too rich and compelling, teachers and students tend to go off on tangents into the context sphere at the very point where they are trying to form linkages in the pattern sphere. Thus these linkages may neither be coherently presented nor received. Timely and appropriate activation of context *versus* pattern is a critical area for researchers to develop new scaffolds.

It was gratifying to have the reading of our paper used in the launch of the Society's 10-year statistical literacy campaign ‘getstats’. As with such campaigns, the motivation for our own work is the democratization of statistics so that the benefits of statistical conceptions and modes of thought can be brought to the widest possible segments of society (Cook, Pink, Pullinger and Camden). But, since the ‘traditional road to statistical knowledge is blocked, for most, by a formidable wall of mathematics’ (Efron and Tibshirani as quoted by Fisher), we are seeking non-mathematical, maximally accessible, ways of conveying ideas. This precludes starting inference from probability (Aitkin and Louis) and going immediately to Bayesian conceptions (Lindley, Hewson, and Stander and Moyeed; we do look forward to their work on Bayesian visualization). Models of dependence structures (Goldstein) and the subtle issues that were discussed by Rubin raise even higher barriers to mass entry, whereas process data (Stuart), with its lurking menaces of dependence and non-stationarity, are inherently more complex than sampling or randomization data. Ideas about these more complex matters should be easier to build if some solid intuitions have already been established in simpler, if less frequent (Holmes), contexts. Meng's metaphors of learning swimming in swimming pools but glimpsing oceans, etc., are gloriously apposite.

*Description and inference*

Although we agree with Gelman and Maclnnes that in reality description and inference are intimately interlinked, developmentally we have to pull these things apart (Pfannkuch *et al.*, 2010). In part this is because first people need ‘to learn to see’ and this does not come quickly even with what statisticians consider trivially simple displays. A focus on description helps with ‘learning to see’ and the distinction also helps to cement ‘what I'm seeing differs somewhat from the underlying reality’. We use the triggers (Fig. 1 in Pfannkuch *et al.* (2010)), ‘*I notice*…’ (to trigger thoughts about what I am seeing), ‘*I wonder*…’ (to trigger thoughts about what of this might reflect aspects of the deeper reality), ‘*I worry* ...’ (about data relevance, quality and alternative explanations) and ‘*I expect*…’ (what I expected to see and what surprises me). We shall appropriate Holmes's excellent ‘What might have caused these data to be as they are?’.

*Key concepts of inference*

We agree with Agresti that confidence intervals are more generally useful than tests. The most critical messages are, in decreasing importance for reach, and beginning by adapting Box on models,

- (a)
all estimates are wrong but some are sufficiently close to be useful (similarly measures),

- (b)
every estimate should be accompanied by a measure of uncertainty,

- (c)
ideas about what uncertainties these measures account for,

- (d)
beginning ideas about how to obtain them and

- (e)
technical understandings and competence.

We would like everyone to understand the first and to feel aggrieved when not provided with the second, and many to be steeped in what questions to consider when addressing the third. For the fourth we favour the bootstrap, with facilitates handling measurement data and category data (Agresti and MacGillivray) simultaneously. But it is all a question of how fast you can get there and that seems to be highly dependent on maturity. For the ages of students whom our teachers were working with we found that we had to slow the process from sampling variation to confidence intervals down and first to try to make a call on direction. For many young students, self-discovery that ‘the data can get it backwards’ is revelatory, as is the prior self-discovery that ‘I can get useful information about a population by taking a sample’ (Louis). And as part of these discovery processes our students do learn about being wrong (Holmes) and acting in ways that prevent this happening too frequently. These classroom implementation issues (Borovcnik, King and Woodford) and assessment (Kapadia) will be reported on separately. Since we wrote this paper Chris Wild has been working on dynamic visualizations for the simple bootstrap with exciting results. The direct visual links between the effects of sampling as a dynamic visual process and mimicked by the effects of bootstrap resampling as a dynamic visual process make bootstrap confidence intervals quite compelling; see http://www.stat.auckland.ac.nz/~wild/BootAnim. ‘Do they work?’ Simulate and see!

We shall teach ‘thinking under the null’ and significance (Bowman and Scheaffer, and despite Bayesian and other objections of Hewson, Lindley and Rubin) in the last year of school, but in modules on data from designed experiments (where the idea of an intervention making absolutely no difference is plausible) using randomization tests so that the analysis involves no mathematical machinery and is directly linked to the data production mechanism.

As to the many challenges that have been issued (Ainley and Pratt, Borovcnik, Garfield and Zieffler, Kapadia and Nolan), brevity precludes saying more but many correspond to work that is already in train or on the drawing board and we encourage others to join us in trying to answer these very important questions.

*Curricula and the advancement of statistical literacy*

The desire for significant whole-society increases in statistical literacy is strongest in those currently or formerly in leadership positions in statistical agencies (Pink, Trewin, Cook and Pullinger) and to a significant if lesser extent in those who lead or have led Societies (e.g. Pantula and Wasserstein, Scheaffer, Fisher, Louis and MacGillivray). This is not surprising as these are all roles which demand big picture vision, far beyond ‘the sorts of statistics I do and the sorts of students I see’. As National Statisticians Trewin (Australia) and Pink (NZ and then Australia) have really stood out for their personal investments in pushing education agendas to achieve such ends. Pink's successor heading Statistics New Zealand, Geoff Bascand, has continued this tradition and 20 years of lobbying of educational officials by Mike Camden has paid real dividends.

We believe that the only feasible road to substantial increases in whole-society statistical literacy is an appreciable presence of statistics in compulsory school curricula wisely used; anything less amounts to tinkering around the margins. It might be valuable tinkering but, in terms of percentage reach, it is tinkering nonetheless.

Concern has been raised about the training and ability of teachers to teach statistics and the need for development (Bowman, Plante and Reid, Trewin, Fisher, Gould, Pantula and Wasserstein, and Meng). Unfortunately, substantial investment in any of these things follows presence in the actual *assessed* curriculum; it does not precede it. If we wait until everything is perfect in terms of teacher preparation nothing will ever happen. We must either trust that intelligent people, provided with appropriate support coupled with research evidence (Borovcnik and Kapadia), can learn to do these things or conclude that widespread statistical literacy is a just nice dream that can never happen.

Although some things can be learned from the NZ experience its small size makes it too special. Australia has the large-society complications of states with the power to make their own rules. The following are gleaned from experiences in Australia (Trewin and Fisher), NZ and even some in the UK. The processes of curricular reform are almost entirely political. Doors of opportunity momentarily open then rapidly close for years, perhaps decades. The national statistical society and the national statistical offices need to be allies who speak with one voice with commitment from the top of both organizations. Each organization has legitimacy with different stakeholders and can open different doors. Although connections at a political level are very useful they are also very transient. Their most important advantage may be in receiving a sympathetic hearing from key officials who then must be convinced that statistics is critical to societal advancement, is different from mathematics with almost orthogonal imperatives and is *the province of an entirely different community* (see Fisher). This is an on-going challenge as, even in the bureaucracy, people and roles change. Finally it is critical to have ears to the ground that will detect the earliest rumblings of a curriculum movement. As soon as the first tremors are felt, start running fast and pray that you have not already been shut out by all the competing voices.