David A Cook MD, MHPE, Division of General Internal Medicine, Mayo Clinic College of Medicine, Baldwin 4-A, 200 First Street SW, Rochester, Minnesota 55905, USA. Tel: 00 1 507 266 4156; Fax: 00 1 507 284 5370; E-mail: email@example.com
Medical Education 2011: 45: 227–238
Context Studies evaluating reporting quality in health professions education (HPE) research have demonstrated deficiencies, but none have used comprehensive reporting standards. Additionally, the relationship between study methods and effect size (ES) in HPE research is unknown.
Objectives This review aimed to evaluate, in a sample of experimental studies of Internet-based instruction, the quality of reporting, the relationship between reporting and methodological quality, and associations between ES and study methods.
Methods We conducted a systematic search of databases including MEDLINE, Scopus, CINAHL, EMBASE and ERIC, for articles published during 1990–2008. Studies (in any language) quantifying the effect of Internet-based instruction in HPE compared with no intervention or other instruction were included. Working independently and in duplicate, we coded reporting quality using the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement, and coded study methods using a modified Newcastle–Ottawa Scale (m-NOS), the Medical Education Research Study Quality Instrument (MERSQI), and the Best Evidence in Medical Education (BEME) global scale.
Results For reporting quality, articles scored a mean ± standard deviation (SD) of 51 ± 25% of STROBE elements for the Introduction, 58 ± 20% for the Methods, 50 ± 18% for the Results and 41 ± 26% for the Discussion sections. We found positive associations (all p < 0.0001) between reporting quality and MERSQI (ρ = 0.64), m-NOS (ρ = 0.57) and BEME (ρ = 0.58) scores. We explored associations between study methods and knowledge ES by subtracting each study’s ES from the pooled ES for studies using that method and comparing these differences between subgroups. Effect sizes in single-group pretest/post-test studies differed from the pooled estimate more than ESs in two-group studies (p = 0.013). No difference was found between other study methods (yes/no: representative sample, comparison group from same community, randomised, allocation concealed, participants blinded, assessor blinded, objective assessment, high follow-up).
Conclusions Information is missing from all sections of reports of HPE experiments. Single-group pre-/post-test studies may overestimate ES compared with two-group designs. Other methodological variations did not bias study results in this sample.
In medical education, as in all research, investigators employ a variety of research designs to answer important questions. Each scientific paradigm or approach uses specific methodological features to enhance the rigour of results and the defensibility of interpretations. Within the quantitative experimental paradigm, methods such as randomisation, allocation concealment (for randomised controlled trials [RCTs]), adequate power, complete follow-up, blinding of assessment and objective assessment enhance study quality by reducing confounding and strengthening the causal link between intervention and outcome.1 In addition to the issue of confounding, however, the question of whether weak methods bias the result itself also requires resolution: do results from weaker studies overestimate or underestimate the true effect? Some evidence in clinical medicine suggests an association between certain study quality measures and study outcomes (i.e. strong and weak methods produce different results2), whereas evidence from other studies does not.3,4 A review of education studies in non-medical fields found that the variance explained by differences between randomised and observational studies, although small, was of approximately the same magnitude as the variance explained by study interventions.5 However, another study in medicine found that some design features, such as objective outcome assessment, mitigate other design flaws.6 We are not aware of any studies in medical education empirically quantifying the effect of different research designs on study outcomes. Such information would inform discussions regarding the trustworthiness of weak designs in medical education research.
Consumers of the medical literature also need transparent and complete reporting. Several expert panels have developed guidelines to facilitate good reporting (e.g. Consolidated Standards of Reporting Trials [CONSORT],7 Transparent Reporting of Evaluations with Non-randomised Designs [TREND],8 Strengthening the Reporting of Observational Studies in Epidemiology [STROBE]9), yet reporting quality remains suboptimal.10–12 Four studies of reporting quality in the medical education literature13–16 have identified deficiencies, but all of these were limited by incomplete evaluation of quality standards. A number of recent studies in clinical medicine have likewise identified room for improvement in adherence to reporting guidelines.11,12,17–22 However, most of these studies focused exclusively on RCTs and only one included both RCT and observational studies.17 We are not aware of a systematic review of reporting quality in medical education research employing a data extraction instrument based on comprehensive, accepted standards. Such a review would contribute substantially to our understanding of the scope of incomplete reporting, which, in turn, would set the stage for the exploration of potential solutions.
Systematic reviews in medical education are beginning to evaluate study quality23,24 and a few studies25–28 have focused specifically on methodological quality. However, we are not aware of any studies in medical education that have explored associations between study methods and reporting quality as some research has in clinical medicine.29 These constructs should, at least in theory, be independent; substantial correlation would indicate that poorly reported studies are interpreted (correctly or incorrectly) as having used inferior methods.
Finally, at least three measures have been used in systematic reviews of medical education research to evaluate the methodological quality of disparate quantitative designs: the Medical Education Research Study Quality Instrument (MERSQI);25,26 the Newcastle–Ottawa Scale,24,30 and the Best Evidence in Medical Education (BEME) global rating.23 In choosing among these instruments, systematic reviewers and consumers in general would benefit from comparative data. We are not aware of studies providing such information.
To address the gaps noted above, we conducted a study with a four-fold purpose, namely: (i) to explore the quality of reporting of experimental research in medical education; (ii) to compare three measures for evaluating the methodological quality of experimental medical education research; (iii) to evaluate the relationship between reporting and methodological quality, and (iv) to evaluate the relationship between effect size (ES) and specific aspects of methodological quality. We did this using the articles identified in a recently published systematic review of Internet-based instruction.24 Although it does not cover all topics in medical education, this sample reflects a comprehensive snapshot from 1990 to 2008 of experiments in one field of active education research and covers journals from a broad range of specialties, including medicine, nursing, allied health, dentistry and pharmacy.
Study eligibility and selection
We included studies from a recent systematic review of Internet-based instruction.24,31 This review included studies published in any language that investigated use of the Internet to teach health professions learners, in comparison with another intervention, using quantitative outcomes of satisfaction, knowledge or attitudes, skills (in a test setting), behaviours (in practice), or effects on patients. Definitions and a full search strategy have been previously published.24,31 Briefly, we defined health professions learners as students, postgraduate trainees or practitioners in a profession directly related to human or animal health, including doctors, nurses, pharmacists and dentists, and we defined Internet-based instruction as computer-assisted instruction that uses the Internet or a local intranet as the means of delivery. We searched databases including MEDLINE, EMBASE, CINAHL, ERIC, Web of Science and Scopus for articles published from 1 January 1990 to 17 November 2008, using terms such as ‘Internet’, ‘web’, ‘e-learning’ and ‘computer-assisted instruction’. We supplemented this search with articles from authors’ files and article reference lists. When we found duplicate reports of the same study, we included only one report. Of the 2705 studies initially found, we identified 266 studies eligible for inclusion through duplicate review of titles, abstracts and full text.
For analyses evaluating ESs in relation to methodological variations, we included all studies. For reasons of feasibility, for the analysis of reporting quality we randomly selected half the studies using computer-generated random numbers, after excluding thesis dissertations because thesis formats and length vary substantially from those of journal publications.
The first step in determining reporting quality required the selection of a quality standard. We considered three different guidelines: the TREND,8 CONSORT for non-pharmacological treatments32 and STROBE9 statements. We selected the STROBE statement because most of the studies in our sample were observational (hence many elements of CONSORT did not apply) and because the STROBE guidelines have now been endorsed by over 100 journals (http://www.strobe-statement.org). We used the ‘more informative abstract’ headings for coding abstract completeness.33
To extract data on methodological quality, we used a modification of the Newcastle–Ottawa Scale (m-NOS),24 the MERSQI25 and the BEME global rating.23 We also extracted data on ethical issues (institutional review board approval and participant consent) and study conclusions (our interpretation and our impression of the authors’ interpretation of whether the study results favoured the study intervention, the comparison intervention or neither).
Using these parameters of reporting and methodological quality, we developed and refined a data abstraction form through iterative pilot testing and revision. Appendix S1 (online) contains a detailed description of how we operationalised these criteria. We abstracted data independently in duplicate for all items. Conflicts were resolved by consensus.
Data synthesis and analysis
We calculated inter-rater agreement on quality codes using the intraclass correlation coefficient (ICC) for a single rater.34 We used sas Version 9.1 (SAS Institute, Inc., Cary, NC, USA) for all analyses. Statistical significance was defined by a two-sided alpha of 0.05.
We determined frequency of presence for all STROBE elements and ethical issues. To enable correlation with other quantitative measures, we calculated a completeness of reporting index for each main section (Title/Abstract, Introduction, Methods, Results, Discussion) reflecting the percentage of elements present in that section. The sum of the individual section indices constituted an overall reporting index (maximum score 500). To evaluate reporting quality over time, we used publication year to divide articles into four groups of roughly equal size, and compared reporting indices using the Kruskall–Wallis test. We used the Wilcoxon rank sum test to compare reporting quality between randomised and observational studies.
We reported individual frequencies or scores for each item on the m-NOS and MERSQI, and calculated mean total scores for the m-NOS, MERSQI and BEME scales. Using Spearman’s rho, we calculated the association between each methodological quality scale and the other methodological quality scales and the overall reporting index.
We abstracted outcomes data and determined ES as previously described.24 To keep comparisons consistent, in this study we used only knowledge outcomes. We explored associations between knowledge ESs and individual quality features, including: sample representativeness; number of groups; whether the comparison group was drawn from the same community as the intervention group (two-group studies only); randomisation (two-group studies only); allocation concealment (randomised studies only); blinding of participants to the study hypothesis; blinding of assessors; objective outcome assessment, and completeness of follow-up. We explored associations in three ways. Firstly, to determine how methodological quality might affect the results of a meta-analysis, we conducted random effects meta-analyses for high- and low-quality subgroups following procedures previously described.24 Secondly, we performed meta-regression using these quality features as covariates, except that we did not include allocation concealment (because this would have restricted regression to RCTs) or features for which there were fewer than four studies at each quality level. As different pooled ESs would be expected for different comparisons, we performed separate meta-analyses and meta-regression for each study type (i.e. studies making comparison with no intervention, studies comparing Internet-based instruction with a non-Internet intervention, and studies comparing two computer-based instructional designs that facilitated differing levels of learner cognitive interactivity). Thirdly, although subgroup meta-analyses and meta-regression tell us how much methodological differences affect a pooled estimate, they do not indicate the degree to which a study’s methods might cause results to deviate from the true effect (bias). Pooled analyses permit a positive deviation in one study to compensate for or cancel out a negative deviation in another study. Although such deviations will increase between-study variability and thus be reflected in a wider confidence interval (CI), we explored this issue more directly by examining associations between study methods and the magnitude of deviation from the pooled ES (our best estimate of true effect). We calculated a difference score for each study by subtracting that study’s ES from the pooled ES for that study type, and then taking the absolute value. We used mixed-effects analysis of variance to determine the association between the predictor variables (listed above) and the difference score, including all study types and accounting for repeated measures on articles (as several articles contributed more than one comparison).
We identified 266 articles reporting comparative studies of Internet-based instruction involving 32 928 learners. For reasons of feasibility, we randomly selected half (n = 133) for additional data extraction on reporting quality and previously uncoded methodological features. All 266 were eligible for analyses of primary methodological features and ES, but only 209 provided the needed knowledge outcomes. PRISMA-type diagrams35 detailing trials identified, excluded and included have been published previously.24,31
Quality of reporting
Of the 133 articles, we excluded from reporting quality analyses three very short reports (< 500 words) with restricted journal requirements (e.g. no references permitted). We present reporting quality for the remaining 130 articles in Fig. 1; details are reported in Appendix S1. The mean ± SD reporting index (the percentage of STROBE elements present for a given article section, maximum 100) ranged from 41 ± 26 for the Discussion section to 58 ± 20 for Methods. The overall reporting index (sum of the five individual reporting indices, maximum 500) was 253 ± 90, ranging from 39 to 486 for individual articles. We had very good inter-rater reliability for reporting quality data abstraction, with ICCs of > 0.61 for all elements and > 0.71 for all but five (problem statement, statement of study intent, outcome description, discussion summary, discussion generalisability).
Although 120 of 130 articles (92%) clearly described the Internet-based intervention, only 61 of 93 (66%) articles with a comparison arm clearly described the comparison intervention. Sixty-nine articles (53%) stated the study design and 22 (17%) reported sample size calculations. Seventy-five articles (58%) noted either institutional review board evaluation (56 studies, 43%) and/or participant consent (57 studies, 44%).
Fifty-five studies (42%) reported the number of subjects eligible for participation, 103 (79%) reported follow-up rates and 17 (13%) provided a CONSORT-style flow diagram. Although 114 articles (88%) reported p-values, only 82 (63%) reported both the mean and a measure of variance (e.g. SD or standard error of the mean) and only 11 (8%) reported the CI for the difference between means.
Study limitations and strengths were infrequently acknowledged, with 66 articles (51%) commenting on sources of potential bias, 44 (34%) mentioning precision (i.e. adequacy of sample size), and 65 (50%) discussing the magnitude of effect or potential confounders. Even fewer articles (n = 29, 22%) interpreted study results in light of limitations.
The reporting quality of RCTs was somewhat higher than that of observational studies for most individual elements (Appendix S1). Reporting indices for all sections were significantly higher for RCTs than for observational studies (p < 0.001).
Reporting quality improved over time (p = 0.002). The mean ± SD overall reporting index rose from 212 ± 81 for studies reported during 1996–2001 (n = 27) to 235 ± 82 for those reported in 2002–2004 (n = 35), 261 ± 84 for those reported in 2005–2006 (n = 42) and 307 ± 97 for those reported in 2007–2008 (n = 26). We observed a similar pattern for each section-specific reporting index (not shown).
Interpretation of study results
We coded our interpretation of the study results as favouring the study (Internet-based) intervention, favouring the comparison, or neutral, and also coded our impression of the authors’ interpretations. Our agreement on these codes was excellent (ICC = 0.86). Although we found generally good concordance between our interpretations of the study results and our impressions of the authors’ interpretations (ICC = 0.81), we found discrepancies in 12 studies (9%). These generally involved author interpretations of neutral results (our impression) as favouring the study intervention (n = 8), or of results favouring the comparison intervention (our impression) as neutral (n = 2) or favouring the study intervention (n = 1). In one instance we interpreted results as favouring the study intervention and the authors interpreted them as neutral.
We rated the methodological quality of 133 articles using three previously described scales (Table 1). We achieved excellent inter-rater agreement for all MERSQI codes (ICC ≥ 0.76) except appropriateness of analytic methods, for which inter-rater agreement was moderate (ICC = 0.53). We also achieved moderate agreement for BEME codes (ICC = 0.58). As previously reported,24 inter-rater agreement for m-NOS score domains was moderate or substantial36 (≥ 0.48).
Table 1. Ratings from three measures of study methodological quality
Subscale (points awarded if present)
Present, n (%) (n = 133)
Scale score, mean ± SD, median (range)
*Comparability of cohorts criterion A was present if the study was (i) randomised or (ii) controlled for a baseline learning outcome; criterion B was present if (i) a randomised study concealed allocation or (ii) an observational study controlled for another baseline learner characteristic
†Whereas the MERSQI and m-NOS are multi-item scales, the BEME scale is a single item ranked from 1 to 5 using the anchors noted above.
Medical Education Research Study Quality Instrument (MERSQI)
Total score (maximum 18)
11.7 ± 2.1, 11.5 (6–16)
Study design (maximum 3)
Single-group pre-/post-test (1.5)
Observational two-group (2)
Randomised two-group (3)
Sampling: institutions, n (maximum 1.5)
> 2 (1.5)
Sampling: follow-up (maximum 1.5)
< 50% or not reported (0.5)
≥ 75% (1.5)
Type of data: outcome assessment (maximum 3)
Validity evidence (maximum 3)
Internal structure (1)
Relations to other variables (1)
Data analysis: appropriate (maximum 1)
Data analysis: sophistication (maximum 2)
Beyond descriptive analysis (2)
Highest outcome type (maximum 3)
Satisfaction, attitudes, perceptions (1)
Knowledge, skills (1.5)
Patient/health care outcomes (3)
Newcastle–Ottawa Scale (modified)
Total score (maximum 6)
2.9 ± 1.5, 3 (0–6)
Representativeness of sample
Comparison group from same community
Comparability of comparison cohort, criterion A*
Comparability of comparison cohort, criterion B*
Blinded outcome assessment
Best Evidence in Medical Education (BEME) global scale (single item)†
Average score (maximum 5)
2.3 ± 1.0, 2 (1–5)
No clear conclusions can be drawn (1)
Results ambiguous but appears to show a trend (2)
Conclusions can probably be based on results (3)
Results are clear and likely to be true (4)
Results are unequivocal (5)
We found high correlations (all p < 0.0001) between MERSQI scores and m-NOS (ρ = 0.73) and BEME (ρ = 0.62) scores, and moderate correlation between m-NOS and BEME (ρ = 0.57) scores. We also found significant associations (all p < 0.0001) between reporting quality (overall reporting index) and MERSQI (ρ = 0.64), m-NOS (ρ = 0.57) and BEME (ρ = 0.58) scores.
Association between methodological quality and effect size
Using the 209 studies reporting knowledge outcomes (25 397 learners), we explored associations between methodological quality and ES in three ways.
Firstly, to understand how methodological differences might affect the results of a meta-analysis, we performed meta-analyses on methodological quality subgroups. As the pooled ES for each study type (comparison with no intervention, with a non-Internet intervention or with a computer-based intervention requiring less learner cognitive interactivity) varied,24,31 we analysed each type separately. Complete analyses are reported in Appendix S1; we mention here the most salient results. For controlled studies with no intervention, we found lower ESs for studies with two or more (versus one) groups and studies in which learners were not blinded to the study hypothesis. We also found lower quality associated with higher ESs for sample representativeness and selection of the comparison group, but 95% CIs overlapped. Media-comparative studies demonstrated a consistent association between lower quality and higher ESs for all features except sample representativeness, allocation concealment and participants blinded to study hypothesis, but differences were relatively small (< 0.2 SDs) and CIs overlapped substantially. By contrast, studies comparing two computer-based interventions showed higher ESs for all high-quality features except allocation concealment, although again CIs showed substantial overlap.
Secondly, we performed meta-regression to identify study features independently associated with knowledge outcomes (see details in Appendix S1). In the analysis of controlled studies with no intervention (n = 126), only the number of groups demonstrated a significant association: studies with two or more groups had a lower average ES than single-group studies (difference − 0.35, 95% CI − 0.61 to − 0.08; p = 0.012). For the 63 media-comparative studies, studies with a representative sample had a higher average ES (difference 0.27, 95% CI 0.01–0.53; p = 0.043) than those with less representative samples. For the 20 studies comparing two computer-based interventions, we found no significant association between methods and ES.
Deviation from pooled estimate
Thirdly, although the above analyses tell us how much methodological differences affect a pooled estimate, they do not indicate how a given study’s results differ from the true effect. Thus, we explored associations between study methods and the magnitude of deviation from the pooled ES. Among the nine quality features examined (Table 2), only the number of groups demonstrated a statistically significant difference (difference from pooled ES was 0.83 for single-group studies and 0.49 for two-group studies; between-subgroup difference = 0.34, 95% CI 0.07–0.61; p = 0.013). This indicates that results from single-group pre-/post-test studies differ from the pooled estimate (study results either greater than or less than the pooled ES) by about one-third SD more than results from two-group studies. Notably, only no-intervention studies contributed to that analysis. Differences between other methodological variations were small (≤ 0.15) and not statistically significant (p ≥ 0.32).
Table 2. Associations between methodological quality and effect size (n = 209)
Difference score,* mean (95% CI)
*The difference score was calculated as the absolute value of the difference between the pooled effect size (ES) and the ES for that study: ES_differencei = |ESpooled − ESi|. Reported values represent adjusted estimates from mixed-effects analysis of variance adjusting for study type (no intervention, media-comparative, computer-based instruction versus computer-based instruction)
95% CI = 95% confidence interval
Number of groups
Comparison group from same community (two-group studies only)
Group assignment (two-group studies only)
Allocation concealed (randomised studies only)
Participants blinded to study hypothesis
In this sample of health professions education experimental research, we sought to determine the quality of reporting, compare three measures for evaluating methodological quality, explore the relationship between reporting and methodological quality, and evaluate associations between ES and study methods. We found reporting quality to be generally suboptimal. Each article contained, on average, only about half the STROBE elements and only one element (a clear definition of the study intervention) was present in > 90% of reports. Discussion sections were particularly prone to reporting deficiencies: not only were the summary of results, study limitations and integration with other studies infrequently identified, but we disagreed with authors in the interpretation of study results 9% of the time and in these disagreements the study authors nearly always favoured the study intervention. We also found moderate to high correlation between three measures of methodological quality and found that studies with higher methodology scores had higher reporting indices.
The magnitude of difference between ESs in individual studies and the pooled estimate across studies was similar for high- and low-quality experimental study designs, except that one-group pre-/post-test studies deviated more than two-group studies. Meta-regression found a similar effect for number of groups even after adjusting for other method features. In subgroup analyses comparing meta-analytic pooled estimates between high- and low-quality study features, we found no consistent pattern for no-intervention comparative studies and media-comparative studies; some quality features were associated with larger ESs and others were associated with lower ESs. However, comparisons of two computer-based interventions nearly always revealed larger ESs for higher-quality studies.
Strengths and limitations
This study has several limitations. Firstly, we focused on a single field of health professions education research (i.e. Internet-based instruction), although we included all experimental studies within that field. Because we included only quantitative comparative studies, it is unlikely these results characterise the quality of other designs such as cross-sectional and qualitative methods. Secondly, we recognise heterogeneity in the study designs and research questions of the included reports. This increases variance in pooled analyses and may blur potentially important between-study differences. However, the present study focuses on study quality rather than educational effectiveness per se. Thirdly, poor reporting in original reports may have affected our coding of study methods, a point to which we will return. Fourthly, to keep comparisons consistent we limited our methods – ES analyses to knowledge outcomes; we might have found different results with skills or behaviour outcomes. Furthermore, in our estimations of bias we used the pooled estimate as a measure of true ES, but in reality the true effect is unknown. Fifthly, given multiple comparisons, the meaning of a single statistically significant p-value in the evaluation of bias could be questioned (i.e. results may reflect chance). We also note that some of the quality variables are not strictly independent. Finally, some subgroup estimates have relatively wide CIs, indicating uncertainty in the magnitude of effect.
This study also has several strengths. We addressed several important questions related to research reporting and research methods in medical education research. We included a large number of studies, reporting data on nearly 33 000 learners representing a broad spectrum of training programmes. We used a widely accepted reporting guideline as the foundation of our reporting quality measure. All data abstraction was conducted in duplicate using rigorous methods.
Comparison with other studies
Other studies in medical education have described suboptimal reporting quality focusing on the abstract,37 introduction15 and selected methods.13–16,38 By using the comprehensive STROBE framework, the present study expands on these, particularly on the reporting of results and subsequent discussion. Compared with the only study in clinical medicine to use STROBE,17 reporting quality in the present study was somewhat lower. Both real differences in reporting quality and differences in definitions of operational criteria may contribute to this gap. As has been reported in clinical medicine,11,12,22 it appears that reporting quality in medical education experimental research has improved over time. This may reflect increased author training or awareness, or changes in journal policies.
Our findings on methodological quality build on those of previous studies25–28 by explicitly comparing three previously used quality measures, examining the association between methodological and reporting quality, and exploring the relationship between methodological quality and ES. The absence of association between methods and quantitative results has been reported in clinical medicine.3,4
Although this study focused on experimental research in Internet-based instruction, we suspect our findings have relevance to quantitative research in health professions education generally. The precise numbers will vary, but the ultimate conclusion – that there is room for improvement in the clarity, completeness and objectivity of reporting – is likely to hold across research themes. The association between reporting and methodological quality (which are conceptually distinct) may simply mean that researchers capable of employing stronger methods have superior writing skills. It may also mean that superior reporting allows readers to discern more clearly a study’s methodological rigor. The latter interpretation supports the need for high-quality reporting.
Precisely which reporting elements are most important depend on the perceptions and purposes of the individual consumer. For example, a scientist seeking to confirm or refute a theoretical model, a designer seeking to replicate an effective instructional intervention, and a researcher conducting a systematic review may each require slightly different information from a report. This variety of purposes underscores the importance of complete reporting, which, in turn, validates the need for guidelines that define essential reporting elements. Rote adherence to guidelines will not compensate for poor-quality research or inferior writing skills, but inclusion of the elements listed in guidelines such as the STROBE, CONSORT or TREND statements will enable a wide range of consumers to understand and apply the study results.
The problem of incomplete reporting is not limited to medical education research. However, as the problem almost certainly includes some degree of lack of awareness (implying that the solution will require education), it makes sense that medical educators should assist in identifying solutions. Guidelines help,39 but will be insufficient on their own.10,40,41 Hands-on editing has been shown to improve reporting quality.42–45 Rigorously enforced journal policies on required reporting elements such as human subjects protections also contribute.15 Ultimately, it will fall to reviewers and editors to not only raise the bar, but to help authors develop the skills they need to vault it.
High methodological and reporting quality does not guarantee valid interpretations and conclusions. The finding that our interpretations differed from those of some study authors suggests confirmation bias, the tendency to interpret results as favouring a more desirable conclusion. Nearly 20 years ago, Cohen and Dacanay reported nearly identical bias towards new technologies.46 More recently, Colliver and McGaghie noted over-interpretation of study results.47 In addition to demanding rigorous methods and complete reporting, editors and reviewers must ensure objective interpretation of study findings.
How should researchers grade the quality of quantitative research in medical education? Although there is no reference standard with which to make comparison, the high correlation between the MERSQI and m-NOS scores suggests they may be superior to the global BEME score. More importantly, the multi-dimensional nature of these quality indices permits users to evaluate individual quality components. Although the MERSQI and m-NOS cover similar domains (sampling, comparability of cohorts, follow-up, trustworthiness of outcomes), individual scale items address these domains differently. The m-NOS entails more rater subjectivity, which enhances flexibility for different study designs but increases the risk of reviewer error or bias, as reflected in the generally higher rater agreement for the MERSQI. The MERSQI has accumulated considerable evidence to support the validity of scores for summarising the quality of large numbers of published studies25 and this may confer an advantage for such applications. However, in terms of making judgements about individual studies, it remains unclear whether the MERSQI, the m-NOS or some combination of the two will provide the most useful information.
It appears that single-group pre-/post-test studies may overestimate the ES, as might be expected given the multiple validity threats from which this design suffers.1 Although our findings merit confirmation in other samples, the absence of other clear associations between study methods and ESs calls into question the conventional wisdom that better methods provide quantitative estimates closer to truth. Of course, superior methods have other advantages, most notably that they facilitate the interpretation of results by enhancing external validity and measurement score validity, and minimise alternative explanations for the observed effect (confounding). Although we found little difference in ES between randomised and observational studies, only randomised designs permit a clear causal link between the intervention and the outcome. Nonetheless, good evidence can be accumulated using a variety of study methods. We believe that researchers should focus first on asking important research questions and then on minimising the threats to valid study interpretation, rather than embracing a specific research design.
Contributors: DAC had full access to all data used in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis. All authors contributed to the conception and design of the study and the acquisition and analysis of data. DAC wrote the first draft of this paper and all authors contributed to its critical revision. All authors approved the final manuscript for submission.
Acknowledgements: the authors thank Denise M Dupras MD, PhD, Patricia J Erwin MLS and Victor M Montori MD, MSc for their roles in initial study identification.
Funding: this work was supported by intramural funds and a Mayo Foundation Education Innovation award. AJL is supported in part through the John R Evans Chair in Health Sciences Education Research at McMaster University, Hamilton, Ontario. There was no external funding.