Basic knowledge about biostatistics and study design is important for the assessment of scientific findings and their introduction into clinical practice. A recently published article in the Journal of the American College of Surgery provides compelling evidence that the statistical complexity of research in surgical journals is increasing1. However, critical appraisal of the scientific literature continues to challenge the surgeon. Moreover, there are common caveats in the interpretation of scientific literature that can potentially mislead the reader and therefore must be properly identified. This article outlines, in simple, non-mathematical language, certain caveats that are often relevant to the proper interpretation of published surgical research. It is hoped that it will help surgeons, both established and in training, to recognize common conceptual pitfalls.

Caveat 1: High impact factor is not synonymous with high quality

A critical approach is important, regardless of the reputation of the journal; a high impact factor does not necessarily equate with high-quality research. There are many examples of publications in top journals, both medical and surgical, that have serious methodological flaws2. Although editorial scrutiny and peer review are intended to recognize errors in manuscripts before publication, this process is by no means perfect. Surgeons are strongly advised to cast a critical eye on articles, regardless of the journal in which they appear. Research findings should not be introduced into clinical practice unless there is confidence that the research was performed with methodological rigor and that the findings are free from significant bias.

Caveat 2: Garbage in, garbage out

Abandon the belief that complex statistical computations are always associated with high-quality research. The objective of statistics is to transform raw, uninterpretable data into meaningful results, from which conclusions can be drawn. Statistics are undoubtedly important in medical research and they are often essential in the design and interpretation of surgical studies. However, the most sophisticated statistics can do nothing to improve the quality of the actual data. If the data are incorrect or if crucial factors, such as confounders (see Caveat 3), are missing, the study will remain problematic, regardless of the complexity of the applied statistical computations. Remember the adage—garbage in, garbage out.

Before defining a confounder, it is important to understand the meaning of a predictor variable and an outcome. Clinical studies are often designed to show an association between a predictor variable and an outcome. In the surgical literature, a predictor variable may be a new operation, new diagnostic imaging tool, novel neoadjuvant or adjuvant chemotherapy regimen, or a risk or prognostic factor. The objective of the study generally is to assess the impact of the predictor variable on the outcome. Common outcomes are overall survival, disease-free survival, response to treatment and postoperative complications. A confounding variable (also known as a confounding factor or confounder) is an extrinsic factor linked to the predictor variable that also affects the outcome. The perceived association between predictor variable and outcome variable is distorted (biased) by the confounder.

An example that illustrates the potential danger of confounding can be drawn from a prospective observational (non-randomized) study comparing open versus laparoscopic total mesorectal excision (TME) for rectal cancer. Assume that the primary endpoint of this study is overall survival. Assume also that laparoscopic TME is found to have a significantly (P < 0·001) improved 5-year overall survival rate compared with the open approach. Does this mean that laparoscopic TME is truly better than open surgery?

The answer is ‘not necessarily’. It is certainly conceivable that subsets of patients undergoing open and laparoscopic TME differ greatly with respect to important prognostic factors. For instance, patients undergoing open TME might be older, have more co-morbid conditions and a higher rate of poorly differentiated cancer. They may present with larger tumours or with node-positive disease. Conversely, those having laparoscopic surgery may represent a highly selected subset of younger, healthier patients, with early-stage disease and well differentiated carcinomas. All of these factors (age, co-morbidities, tumour stage, lymph node status, tumour grade) may confound the relationship between the primary predictor variable (type of procedure) and the endpoint (overall survival).

Confounding is a common and troublesome problem in non-randomized studies. Often, associations between a predictor variable and an outcome are found to meet statistical criteria for significance and are proclaimed by investigators as real. However, it is of the utmost importance to examine such results as critically and objectively as possible. Ask whether a highly significant association could possibly be distorted by one or several confounders. To decrease bias, results from non-randomized studies should be adjusted for potential confounding factors using multivariable analyses, or other statistical techniques, such as stratification or propensity score analysis.

Caveat 4: Beware of postrandomization bias—levels of evidence are not sacred3

The classification of different study types into a level of evidence hierarchy has become common practice. The highest level of evidence is ascribed to large, prospective, randomized studies, with cohort and case–control studies ranked further down the hierarchy5. Prospective randomized studies are intended to reduce the possibility of bias, but, when poorly designed or conducted, they are vulnerable to distortion and confounding. One important challenge in performing clinical trials is to avoid postrandomization bias. Postrandomization bias is much commoner in surgical studies than in drug trials, as blinding is generally more difficult in the former. So it is important that the reader of clinical trials in surgery evaluates whether or not the outcome assessments were performed in a blinded fashion.

An example is postrandomization bias due to differences in expertise between the surgeons operating in respective study arms6. Consider a randomized study comparing open and laparoscopic resection of the sigmoid colon for diverticular disease. It can be assumed that a patient who is randomly assigned to the laparoscopic procedure will be operated on by senior surgeons with extensive experience of laparoscopic surgery. Conversely, if the patient is assigned to open resection, he or she may be operated on by a surgeon in training who has quite limited experience of colonic surgery. Although the randomization process distributes both known and unknown confounders (such as age, co-morbidity, nutritional status, sex, race and extent of disease) equally between the study arms, differences in surgical expertise create a postrandomization bias of potentially enormous magnitude. This renders meaningful interpretation of the results impossible.

Postrandomization bias can be minimized by standardizing surgical procedures. Moreover, all surgeons who participate in a clinical trial should have a similar degree of technical expertise. Ideally, they all should have performed a specified number of relevant operations (sufficient to overcome their learning curve) before being allowed to participate in a prospective randomized study. Randomized controlled trials are not guaranteed to be free of bias. In fact, the introduction of postrandomization bias can make the results of randomized clinical trials difficult or even impossible to interpret correctly. It is critically important to abandon the belief that only data from randomized clinical trials help advance the state of medical knowledge. A well-designed and conscientiously performed conhort study is of much greater value than a poorly designed and executed randomized clinical trial. A well-designed and well-performed case control study is much better than a methodologically poor cohort study.

The correct interpretation of P values, one of the most ubiquitous statistics in surgical literature, can prove something of a challenge to many surgeons. An understanding of the meanings of the null and alternative hypotheses is fundamental. The null hypothesis of a study (H_{0}) states that no difference exists between the study groups. In a two-armed randomized controlled trial, for example, H_{0} is that there is no difference between arm 1 and arm 2 for the endpoint under investigation. Conversely, the alternative hypothesis (H_{A}) is that a difference exists between arms 1 and 2. The P value is defined as the probability that the difference between the two arms is at least as large as the one observed in the sample, if there actually is no difference in the overall population (assuming H_{0} to be true). In other words, the P value represents the probability that the difference observed between study arms could occur by chance alone.

An example can be found in a recently published randomized clinical trial7 comparing perioperative chemotherapy plus surgery with surgery alone. This landmark multicentre study showed a significant 5-year overall survival benefit for patients randomly assigned to receive perioperative chemotherapy; in fact, the estimated 5-year overall survival rate was 36 per cent in the perioperative chemotherapy group compared with only 23 per cent in the surgery-alone group. The P value associated with this benefit was 0·009.

This P value can be interpreted as follows. If perioperative chemotherapy plus surgery results in equivalent overall survival to that of surgery alone (if H_{0} is true), there is only a 0·9 per cent (P = 0·009) chance of observing a survival difference as large as, or larger than, the one observed. In other words, if both treatment arms are equally effective, it is unlikely that the reported overall survival difference would have occurred by chance alone. If the P value is small, the probability of obtaining the observed difference by chance alone is low, and one may assume that H_{0} can be rejected. Conversely, if the P value is large, it is conceivable that the observed results may plausibly be due to the play of chance and, in consequence, H_{0} cannot be dismissed.

Although the concept of the P value as a measure of statistical significance seems straightforward, there are a number of pitfalls that must be borne in mind. First, a non-significant P value does not necessarily demonstrate that H_{0} is true. Large non-significant P values can occur due to small sample size (see Caveat 6). A non-significant P value in such circumstances indicates only that the evidence is not strong enough to reject H_{0}4, 8. Second, statements such as ‘the association was found to be statistically significant (P < 0·050)’ are commonplace. Such statements are imprecise and should be avoided. For example, a significant P value could be 0·049 or 0·00 001. It is far more informative to provide the exact value9. Finally, the P value depends on a variety of factors, including the existing difference between the study groups, the scatter of the data (the standard deviation) and the sample size. The larger the difference between the study groups, the smaller the standard deviation; and the larger the sample size, the more significant the P value will be. Considering the role that these factors play in influencing the significance of the P value, the ‘canonical’ benchmark of 0·050 should not be used as a default cutoff between relevant and unimportant results. The statement that “the P value was not significant; therefore, the study findings are not important” is overly simplistic and may be altogether mistaken, as we will discuss in greater detail in Caveat 6.

Caveat 6: Statistical significance does not necessarily equal clinical relevance (and vice versa)3, 4

As noted above, the magnitude of the P value depends, among other factors, on sample size. If the sample size is sufficiently large, even tiny differences between study groups will become statistically significant. The question, however, is whether these small differences are of clinical relevance. Statistically significant results may prove trivial in the context of a real-world situation. On the other hand, even though the P value might not be statistically significant (for example, owing to small sample size), differences found between the study groups may be clinically important.

The distinction between statistical significance and clinical relevance becomes even more important with the increasing use of administrative databases for medical research purposes3. The author has participated in an analysis of a database containing more than 230 000 patients with breast cancer10, and some administrative databases (such as the Nationwide Inpatient Sample and US State databases) may contain several million patients. It is clear that, with such huge patient numbers, the tiniest differences in outcome become statistically significant. However, these statistically significant differences may be of no clinical relevance whatsoever.

Consider a randomized clinical trial comparing two different chemotherapy regimens after potentially curative resection of colonic cancer. Hypothesize that patients in arm 1 have a 5-year overall survival rate of 60 per cent compared with 40 per cent in arm 2. Such a huge difference is obviously of enormous clinical relevance. However, if fewer than 172 patients are enrolled into the trial (sample size computations are based on α of 0·05, β of 0·2 and an accrual period of 1 year), the difference will not be statistically significant. The finding of such a clinically significant difference, which is not statistically significant, means that a clinically significant difference may still exist. In other words, the study has not ruled out the existence of a clinically important benefit. However, neither has it proven any benefit (see Caveat 8).

Caveat 7: Beware of overinterpreting survival curves4

A survival curve is a graphical presentation of time-to-event data. The term survival curve is misleading, as not only time to death, but time to any event, may be displayed graphically—tumour recurrence, extubation after surgery, rejection of a kidney transplant, etc. For this reason, the appellation ‘Kaplan–Meier curve’ may be preferable. The starting point of a Kaplan–Meier curve, or time zero (T_{0}), marks the beginning of the period of observation for the event under investigation. For instance, T_{0} can be the time point of randomization, the day of operation or the start of adjuvant chemotherapy. By definition, T_{0} is always set at 100 per cent, as no patient has yet experienced any event (e.g. death, if overall survival is the primary outcome). Each step down on a survival curve represents the occurrence of an event. If two patients experience the event under investigation, the step down is twice as large as for one patient; if three patients experience the event, the step down is three times as large, and so on.

In most survival analyses, some patients are ‘lost’ (data regarding the outcome in question are not available) before the event occurs, or before follow-up is complete for the study. This phenomenon is common in investigations that recruit patients over many years. As a consequence, the length of follow-up varies greatly among the enrolled population. Those who do not experience the event of interest are referred to as censored, either during the study (for example, due to loss to follow-up) or at the end of the study11. Censoring allows such patients to provide valuable information despite not having experienced the outcome under investigation. When a patient is censored, the survival curve remains horizontal. However, the number of patients at risk for the event decreases.

It is important that survival curves clearly indicate when a patient is being censored. This can be achieved by use of tick marks (Fig.1). The display of censored subjects enables the reader to deduce how the number of patients at risk has decreased over time. It is of prime importance to realize that the number of patients, upon which the Kaplan–Meier curve estimates are computed, decreases constantly as length of follow-up increases. In other words, estimates of survival are associated with increasing uncertainty over time as the Kaplan–Meier curve progresses from left to right12. Often, the far end of a Kaplan–Meier curve may represent only a few patients out of the original population under study. An alternative to using tick marks is to display the number of patients remaining in the study below the Kaplan–Meier curve at regular intervals, such as every 12 months. Again, it is crucial that the reader is able to assess how the patient sample has decreased over time; this allows an appreciation of the increasing unreliability of the displayed survival estimates.

One final point to remember is that the survival curve will give accurate estimates of the patients who experience an endpoint at any time only if a censored patient has the same risk for this outcome as those who remain in the study. For example, censoring deaths in a study that examines myocardial infarction may provide misleading results.

Caveat 8: Beware of underpowered clinical trials and type II errors3, 4, 8

Power is the probability of finding a statistically significant result (of rejecting the null hypothesis) in a study when the patient samples in the overall populations are truly different. A type II or β error represents a circumstance in which study results lead to the erroneous conclusion that no significant difference exists between groups when a significant difference does, in fact, exist. β, the false-negative rate, is complementary to the power of a study.

Adequate power in a clinical trial is critical, as investigators and funding agencies must be confident that an existing difference in the overall patient population can be detected with the proposed study sample. The power of a study is inextricably linked to sample size: the larger the sample size, the greater the power. This is important, as a randomized clinical trial may fail to answer the research question definitively if the sample size is too small. Clinical trials that neither find a statistically significant difference between outcomes nor enroll a sufficiently large number of patients should be considered inconclusive (not negative) and the results should be interpreted, at best, as hypothesis generating rather than hypothesis testing8. Unfortunately, a plethora of studies exist in the medical literature that were clearly underpowered but whose authors concluded that there was no statistically significant difference in outcome13, 14; this is an erroneous and potentially harmful determination. For correct interpretation of an investigation that does not find a statistically significant difference between the groups, it is important to check whether a 95 per cent confidence interval (a range of values that will contain the true value in 95 per cent of all cases) was computed. If the entire range of values in the 95 per cent confidence interval is of no importance, a strong negative conclusion can be made. Conversely, if the 95 per cent confidence interval contains values which are of clinical relevance, the study should be considered inconclusive.

The reader of the surgical literature must also beware of post hoc (after the study has been completed) sample size calculations. These may be misleading as they often involve quite unreliable assumptions (for example, no heterogeneity between different medical centres, or unrealistic guesses about patterns of dropout or censoring). However, the most important reason for viewing post hoc sample size computations with scepticism is simply that providing pretrial estimates of what might be observed in a study if the actual observations from the study were already known is, of course, inherently flawed. This is like reviewing prematch form when you already know the final score.

Caveat 9: Understand the difference between absolute and relative risk reduction

The relative risk (also known as the risk ratio) is the likelihood that the outcome of interest will occur in the group with the risk factor divided by the likelihood of that outcome occurring in the group without the risk factor. If two interventions are compared, the relative risk is defined as the risk of experiencing a negative outcome in group 1 divided by the risk of experiencing this outcome in group 2.

When considering relative versus absolute risk reduction, one might imagine a paper stating that laparoscopic appendicectomy reduces the risk of postoperative wound infection by 50 per cent (relative risk of 50 per cent for developing postoperative wound infection compared with open surgery). Imagine also that this result is correct. But how important is such a finding? The answer is that it depends on the prevalence of the complication. If the wound complication rate after open appendicectomy is 20 per cent, a 50 per cent reduction is undoubtedly important, as it would indicate an absolute risk reduction of 10 per cent (from 20 to 10 per cent) in postoperative wound infection. Conversely, if the wound infection rate after open appendicectomy is 2 per cent, a 50 per cent relative risk (same relative risk as above) would result in a reduction to 1 per cent (absolute risk reduction of 1 per cent), a decrease of minimal clinical relevance. Clearly, a large relative risk reduction can be associated with a very small absolute risk reduction.

The number needed to treat (NTT) is a helpful tool in assessing the importance of a difference between two interventions. It is the number of patients that must be treated to prevent the occurrence of one adverse outcome. The NTT is the inverse of the absolute risk reduction. In the first scenario described above, the NTT is 10 (absolute risk reduction of 10 per cent or 0·1; NTT 1/0·1). In other words, ten patients would need to undergo laparoscopic surgery to prevent one postoperative wound infection. In the second scenario, the NTT is 100 (absolute risk reduction of 1 per cent or 0·01; NTT 1/0·01). In other words, 100 patients would need to undergo laparoscopic surgery to prevent one postoperative wound infection.

Overview

A basic knowledge of statistics and study design is important for clinicians if they are to introduce new advances appropriately into clinical practice. It is hoped that this overview of some important caveats will provide a useful resource for surgeons in the interpretation of surgical literature.

Acknowledgements

The author thanks Mr Jonathan McCall for reading the manuscript and making numerous valuable suggestions.