Meta-analysis of quasi-experimental research: are systematic narrative reviews indicated?


Jerry A Colliver, Professor, Statistics and Research Consulting, Southern Illinois University School of Medicine, 913 North Rutledge Street (Room 2104), Springfield, Illinois 62794-9623, USA. Tel: 00 1 217 545 7765; Fax: 00 1 217 545 5218; E-mail:


Context  Meta-analyses are commonly performed on quasi-experimental studies in medical education and other applied field settings, with little or no apparent concern for biases and confounds present in the studies synthesised. The implicit assumption is that the biases and confounds are randomly distributed across the studies and are averaged or cancelled out by the synthesis.

Objectives  We set out to consider the possibility that the results and conclusions of meta-analyses in medical education are subject to biases and confounds and to illustrate this possibility with a re-examination of the studies synthesised in an important, recently published meta-analysis of problem-based learning.

Methods  We carefully re-examined the studies in the meta-analysis. Our aims were to identify obvious biases and confounds that provided plausible alternative explanations of each study’s results and to determine whether these threats to validity were considered and convincingly ruled out as plausible rival hypotheses.

Results  Ten of the 11 studies in the meta-analysis used quasi-experimental designs; all 10 were subject to constant biases and confounds that favoured the intervention condition. Threats to validity were not ruled out in the individual studies, nor in the meta-analysis itself.

Conclusions  Our re-examination of the results and conclusions of the meta-analysis illustrates our concerns about the validity of meta-analyses based primarily on quasi-experimental studies. Our tentative conclusion is that the field of medical education might be better served in most instances by systematic narrative reviews that describe and critically evaluate individual studies and their results in light of threats to their validity.


Quasi-experimentation1–3 is the predominant approach to research in medical education and understandably so. Random assignment to study conditions is difficult, if not impossible, as a result of practical and ethical constraints in applied field settings, which are the usual sites of most medical education research. As a result, research in medical education is commonly subject to biases and confounds due to non-randomised, uncontrolled, quasi-experimental studies.4 This has important – yet typically overlooked – implications for meta-analysis in medical education, which we consider here.


To begin with, the results and conclusions of any single study are only as good as the data analysed. Results based on biased and confounded data – such as are likely with a quasi-experimental study – are subject to alternative explanations that reflect the biases and confounds. Donald Campbell, the founder of quasi-experimentation referred to these biases and confounds as ‘threats to validity’ of the conclusion about the relationship between the intervention and outcome studied.1–3 Is the relationship or effect caused by the intervention or the confound? The rival explanations that arise from threats to validity are often more plausible than the conclusion intended by the research hypothesis and must be ruled out – preferably with an evidence-based argument or demonstration – to provide support for the study’s research hypothesis.1–3

Throughout his writings, Campbell emphasised that ‘quasi-experimentation’ involves two inter-related parts: a taxonomy of designs, and a theory of generic confounds associated with each design that must be ruled out to establish the causal inference about the intervention and outcome.1–5 A classic example is the commonly used one-group pretest–post-test design, which is subject to numerous threats or confounds such as: history (any change-producing event other than the intervention that occurs between the pretest and post-test, such as related lectures and demonstrations in another class that, rather than the nominal intervention, may improve post-test performance); maturation (natural human processes may change with the passage of time, such that study participants become older, hungrier, less motivated, tired, more sick, etc., and subsequent post-test scores reflect the process changes, not the effect of the intervention); testing (the pretest itself may be a stimulus for change that causes students to think about or read up on the topic and perform better on the post-test, regardless of the intervention, or the pretest may sensitise participants to the intervention, which otherwise might not have been effective), and statistical regression towards the mean (an extreme group on pretest will be less extreme on post-test as a result of the regression of measurement error to the mean, such that a low-performing group will improve performance on post-test even without the remediation or intervention).1 These are generic design threats that must be considered and ruled out in the specific study context in order to establish the effectiveness of the intervention with a one-group, pre-test/post-test design.

Quasi-experimentation then requires more than simply selecting a design and referring to it by name in the research report. The challenge for the researcher is to identify the design’s threats to validity and demonstrate that they can be ruled out in the context of the particular research study in order to establish the validity of the causal inference about the intervention and outcome. Giving an in-passing nod to these threats in a paragraph on limitations in the Discussion section – as seems to be the current practice in reports of quasi-experimental studies – does not begin to rule them out and does nothing to establish the validity of the research hypothesis over and above any plausible rival hypotheses. When the quality of medical education research is criticised for ‘serious methodological flaws’,6 it would be more accurate and constructive to say that researchers are not following through on the second step of quasi-experimentation, which is to address the threats to validity and rule them out.4


Similarly, the results and conclusions of a meta-analysis are only as good as the results of the studies synthesised. However, researchers seem to be even less cognisant of, or concerned about, the threat of biases and confounds in the multiple studies synthesised in a meta-analysis. Meta-analytic reports based primarily on quasi-experimental studies typically do not acknowledge the existence of plausible rival explanations of the study findings synthesised. The implicit assumption is that the biases and confounds are randomly distributed across the studies and hence are ‘averaged out’ or ‘cancelled out’ by the synthesis. This may seem reasonable, but can this assumption be trusted to guarantee the validity of the results and conclusions of most meta-analyses, in particular, those in medical education that involve a preponderance of quasi-experimental studies?

We were concerned about this assumption because of the possibility of constant biases and confounds. A ‘constant’ bias has been defined as a specific bias shared by a group of studies.3 Such constant biases would seem to be common in meta-analyses that synthesise results from education innovations in applied field settings in medical education. These studies are necessarily subject to numerous practical constraints and considerations that dictate and limit the conduct of the study. Constant biases, then, would result from uniform implementation decisions that address the constraints within a given research area. We also thought that different biases and confounds in different studies might also be ‘constant’ in the sense that they might influence in the same (constant) direction – by, say, favouring a popular innovation – and, in combination, would bias the meta-analysis as a whole. It seems plausible to think that these implementation decisions, more often than not, would unwittingly serve to advantage the intervention or innovation. Naturally, faculty would want their innovative curriculum, course, or workshop to be the best possible, so a lot of effort would be put into polishing and perfecting the new intervention – probably with little or no attention would be given to the quality of the out-of-favour traditional arm of the study (i.e. the control group). The innovation might be further advantaged by special selection, more time on task, outcomes designed specifically for the intervention, etc.

We were also concerned because, even if biases are not constant across all or most studies, it seems unlikely that positive and negative biases will be perfectly and completely balanced or cancelled out by one another. Hence, any remaining bias in a group of synthesised studies will pose a threat of unknown magnitude (and direction) to the validity of the meta-analysis. Either way – whether there is constant bias or remaining unbalanced bias – the results of meta-analyses based on quasi-experimental studies are suspect and require an initial study-by-study examination that considers and rules out plausible alternative explanations.


To illustrate the problem, we report the findings of our re-examination of the most recent meta-analysis of problem-based learning (PBL). This is the second of two meta-analyses7,8 conducted by the same group of investigators, published in two major journals in the field of education, and based for the most part on the same studies, which were also included in earlier PBL reviews.9–15 Thus these studies have seen a lot of meta-analysis action. The first meta-analysis, published by Dochy, Segars, Van den Bossche & Gijbels7 in 2003, categorised outcomes as ‘knowledge’ or ‘skills,’ and found a small negative, but significant, effect for knowledge (weighted d = − 0.223) and a medium-sized significant effect for skills (weighted = 0.460). The second meta-analysis, by Gijbels, Dochy, Van den Bossche & Segars8 in 2005, which is examined here, categorised the outcomes as ‘concepts,’‘principles’ or ‘application’, based on Sugrue’s model16 of the cognitive components of problem solving. The authors found non-significant effects for concepts and application and a large significant effect for principles (weighted = 0.795). For the purposes of this illustration, we focused on the studies in the principles category of the more recent Gijbels et al.8 meta-analysis, which showed the largest significant effect and included most of the studies in the skills category in the Dochy et al.7 meta-analysis (both categories showed significant positive results) and many of the studies cited in earlier reviews.9–15


All three authors of this paper reviewed the 11 studies17–27 that reported effect size for outcomes in the principles category of the Gijbels et al. meta-analysis8 (Table 1). The authors first carefully read, reread and discussed the original studies. We then identified major obvious biases and confounds that provided plausible rival explanations of each study’s results. We also sought to determine whether these threats to validity were considered and satisfactorily ruled out as plausible alternative explanations of the results. (A twelfth study28 was excluded from both meta-analyses because it was considered a serious outlier [= 2.171]. We were unable to confirm this estimate of the effect size for study 12, but we checked the two other serious outliers [= − 8.291 and = − 7.91] that had been excluded from both meta-analyses and found that both were computed by dividing by the standard error of the mean rather than the standard deviation. Our re-computations were = − 1.09 and = − 0.67, respectively, which show sizeable negative effects.)

Table 1.   Summary of design, biases and confounds, threats to validity and effect size (d) of studies in the ‘principles’ category of the meta-analysis by Gijbels et al.8
Studies included in the meta-analysis by Gijbels et al.8nComparison of:Biases and confoundsThreats to validityOutcome effect sizes (d)
PBLLBLTwo tracksConsecutive classesDifferent schoolsSmall subgroups of curriculum groupsRandomised groupsSelection biasIntervention/ outcome confoundTime-on-task confoundTesting lag and response ratesFavour interventionNot ruled out
  1. *PBL elective versus non-PBL elective

  2. PBL = problem-based learning; LBL = lecture-based learning; NA = not applicable

  3. ? = uncertain

1 Boshuizen et al.17    4  4  XX ?X  XX2.268
2 Distlehorst & Robbs18   47154X    X   XX0.445
3 Doucet et al.19   21 26X    XXX XX1.293
4 Finch20   21 26 X    X  XX1.904
5 Goodman et al.21   36297X    X   XX− 0.133
6 Hmelo et al.22   20 20   X* XX  XX0.7305
7 Hmelo23   39 37X  X* XX  XX0.768
8 Martenson et al.241,651818 X      XXX0.00
9 Mennin et al.25  144447X   (X)X   XX0.046
(randomised)  (67)(27)(X)        (NA)(NA)(− 0.16)
10 Richards et al.26   88364X    X   XX0.3375
11 Schmidt et al.27  204 204 X X    NANA0.310


Our re-examination of the Gijbels et al.8 meta-analysis revealed that 10 of the 11 studies17–27 in the principles category used quasi-experimental designs (Table 1). All 10 of these studies were subject to constant biases and confounds that favoured the intervention group. These threats to validity were not ruled out as plausible rival explanations in the original studies, and the study-level validity threats were not considered in the meta-analysis. (Gijbels et al.8 addressed this in one sentence in the Discussion section, saying that: ‘It is also known that selection bias problems are sometimes inherent in “between-institution” or “elective track” studies.’)

Seven studies (2, 3, 5, 6, 7, 9 and 10 in Table 1) were subject to selection bias; the self-selection and special selection in five of these studies (2, 3, 5, 9 and 10) clearly resulted in superior students in the treatment group at the outset. (For example, in study 2, PBL students were reported to have had a significantly higher mean score on the Medical College Admissions Test, with = 0.46, which is nearly identical to the outcome difference reported by Gijbels et al. for study 2 [= 0.445]. Even stronger support for the alternative ‘selection’ hypothesis was provided by study 9, which fortuitously included results of a secondary analysis of a randomised subset of data for a subgroup of students who had requested PBL but were randomly assigned to PBL or the standard condition. In general, the results for the randomised comparisons showed negative effects for the PBL intervention [= − 0.33 for the National Board of Medical Examiners licensure examinations Part III], whereas results for the non-randomised quasi-experimental comparisons in the full study were positive [= 0.33 for Part III].)

Five other studies (1, 3, 4, 6 and 7) were subject to intervention and outcome confounding, such that the intervention group had experience and practice with the outcome as a part of the intervention. (For example, the studies by one research group [studies 6 and 7] compared small groups of paid volunteers from PBL and standard curricula to undertake a series of pathophysiological explanation tasks that involved clinical problems or cases – much like the usual PBL routine. Not surprisingly, the PBL students did better, because they were simply doing what they had been doing all along with similar problems, which the standard-track students had yet to encounter. In addition, exactly the same numbers of PBL [89%] and standard [89%] students made the correct diagnosis by the end of the case problem, despite the higher scores of PBL students on the PBL process test/outcome. The problem with these studies was clearly stated by another study [4] in this group, which reported: ‘…the PBL students had exposure to modified essay questions associated with the PBL evaluation exercises which were not a feature of the traditional programme.’)

One study (3) was subject to both selection bias and intervention or outcome confounding, and also to a time-on-task bias, whereby more time was spent on the intervention task (three 2-hour sessions over a 3-week period versus one 2-hour meeting). Another study (8) appeared to be subject to two other sources of confounding (time lapse from course to testing and differential response rates [2–3 years versus 3.5–4.5 years, and n = 1.651 versus n = 818]), although the results showed no effect of the intervention. The remaining study (11) reported results for randomised comparisons (students were assigned by lottery to different schools with different curricula), which showed a weak effect of the intervention (= 0.31). (Note that the effect size of the single randomised study was considerably less than the mean of all 11 studies, which included the 10 non-randomised, quasi-experimental studies.)


Our re-examination of the Gijbels et al.8 results and conclusions reinforced our concerns about the validity of meta-analyses based on quasi-experimental studies. Our findings illustrate that threats to validity in quasi-experimental studies can seriously bias the meta-analysis conclusion. The implication of this illustration of a worst-case scenario, with its serious constant biases, for meta-analysis practice in general is that unless positive and negative biases are completely counterbalanced or cancelled out across the studies synthesised – which seems very unlikely – the results of a meta-analysis are subject to remaining threats to validity which would seem to undermine confidence in the conclusions of the meta-analysis. We hope this illustration will alert researchers and readers to the threat of biases in meta-analyses based primarily on quasi-experimental studies that they might undertake or read.

Systematic narrative reviews

Our tentative conclusion is that, at this stage of research in medical education, the field might be better served, in most instances, by systematic narrative reviews that describe and critically evaluate individual studies and their results, rather than by reviews that obscure biases and confounds by averaging. At the least, meta-analysts must avoid a mindless coding and averaging of studies and emphasise the critical reading and rereading of each and every candidate study to establish whether there are serious concerns that might disqualify any given study from inclusion and to determine whether a meta-analysis is meaningful or even possible with the studies located by the search. In the illustration presented here, a meta-analysis does not seem advisable.


In this paper, we have focused on quasi-experimental studies in meta-analysis because most studies in medical education are quasi-experimental and prone to validity threats. Very few studies in medical education are randomised experiments, and the number targeted for review in a specific research domain could be very small. For example, in our illustration, only one of 11 studies was randomised (study 1127 in Table 1). Because the number of randomised experiments in most specific research domains would seem too small to warrant a meaningful meta-analysis, our recommendation is that experimental studies should be included with the quasi-experimental studies in a systematic narrative review.

More seriously, many so-called randomised studies in field settings such as medical education may start out randomised, but succumb to various threats to validity along the way, which effectively ‘de-randomise’ the groups and turn an experiment into a quasi-experiment. For example, differential levels of attrition in treatment and control groups might result in only the very best students surviving in a challenging innovative treatment condition, while there is little or no attrition in the control condition. More time might be required for the administration and completion of the treatment task than for the less involved control condition. The assessment method used for the outcome might also be employed as a routine part of the intervention, but it is only experienced as the outcome for controls. After the initial randomisation, students might be enrolled in two different schools (with different faculty, different resources, rural versus urban contexts, etc.) and the innovative curriculum might be delivered at one school and the traditional curriculum at the other (as in the randomised study 1127). Analysis of covariance (ancova) is not the answer because it only partially corrects for biases and confounds and thus does not fully eliminate the threats in quasi-experimental studies29,30 and often creates more confusion than it resolves. Even ‘pure’ randomisation (without de-randomisation) is no guarantee that groups are equivalent at baseline: randomly assigned groups often (P = 0.05) differ significantly on baseline measures.

Nevertheless, researchers in medicine have come to refer to randomisation as the ‘gold standard’, and researchers in medical education have followed suit. Worrall31criticises this ‘gold standard’ conception of randomisation and says it creates the impression that randomisation ‘plays a uniquely privileged role’ in scientific inquiry and that ‘RCTs [randomised controlled trials] carry special weight – often indeed that they are essential for any truly scientific conclusion to be drawn from trial data about the effectiveness or otherwise of proposed new therapies or treatments’. However, randomisation is not the sine qua non of scientific inquiry: it is neither necessary nor sufficient for causal inference, as indicated by the title of Worrall’s article, ‘Why there’s no cause to randomise’.31

Instead, randomisation is simply a tool – albeit a very powerful tool – that can be used to approximate an ideal for scientific inquiry. This ideal (aka the Galilean model) provides a theoretical definition of what is meant by ‘true’ experiment and valid causal inference, which is that all factors except one must be held constant so that the outcome can be attributed to that one factor. This is easier said than done – it’s an ideal! Randomisation then is just a practical tool or procedure that attempts to achieve the ideal: it attempts to control for bias in the selection of subjects for conditions. Blinding is another powerful practical tool, but it attempts to control for confounding factors after the initial randomisation. And so on. Our recommendation then is that randomised studies – particularly in otherwise uncontrolled, non-laboratory field settings, such as medical education research – should also be critically evaluated for validity threats to determine the credibility and validity of the study conclusions and should be included with quasi-experimental studies in a systematic narrative review.

Study quality versus validity threats

Mechanically scoring studies for quality (such as by a research assistant using a checklist) sidesteps the critical concern about whether the outcome results can be legitimately attributed to the nominal intervention.32,33 That is, study quality as assessed with checklists commonly reflects reporting quality, acknowledgement of ethical issues, adequate background, rationale and literature search, etc. and is not a proxy for ruling out validity threats and establishing the validity of the research conclusion. The hazards of scoring the quality of studies for meta-analysis were dramatically demonstrated by Juni et al.,34 who applied 25 published quality assessment scales to the original 17 clinical trials in a meta-analysis that compared low-molecular-weight heparin versus standard heparin in the prevention of postoperative thrombosis.35 They found that different quality scales ‘can dramatically influence the interpretation of meta-analytic studies’, whereby ‘depending on the scale used, the effect size either increased or decreased with increasing trial quality’. They concluded that ‘relevant methodological aspects should be assessed individually and their influence on effect sizes explored’.34

In brief, systematic narrative reviews of individual studies might prove instructive to researchers and ultimately contribute to the quality of medical education research. At the least, reviewers should perform an initial critical narrative review to see if the studies warrant synthesis in a meta-analysis and to provide a basis for substantive coding for the meta-analysis. Meta-analysis is a powerful tool for medical education, but it demands an initial, critical, study-by-study approach to determine the suitability of studies, especially quasi-experimental studies, for meta-analysis. Meta-analysis also requires a willingness to exclude studies for which plausible validity threats cannot be convincingly ruled out and – harder yet – to concede that a valid meta-analysis of the selected studies may not be possible, in which case a systematic narrative review is indicated.

Contributors:  all authors made substantial contributions to the conception and design of this paper and to its critical revision for important intellectual content. All authors approved the final manuscript.

Acknowledgements:  none.


Funding:  none.

Conflicts of interest:  none.

Ethical approval:  not applicable.


What is already known on this subject

Meta-analyses are commonly performed on quasi-experimental studies, despite the biases and confounds present in the individual studies. The implicit assumption is that biases and confounds are averaged or cancelled out by the synthesis.

What this study adds

This paper emphasises the likelihood that the biases and confounds of quasi-experimentation will undermine the conclusions of the meta-analysis and illustrates the problem by re-examining the results of a recent meta-analysis based primarily on quasi-experimental studies.

Suggestions for further research

The authors tentatively conclude that the field of medical education (and other applied research areas that rely heavily on quasi-experimentation) might be better served by systematic narrative reviews that critically evaluate individual studies and their results.