SEARCH

SEARCH BY CITATION

Missing data is a problem that occurs widely in medical research, and indeed is difficult to avoid. The aim of this commentary is to deal with its implications in clinical trials. In analysing the results of a randomised controlled trial, it is important to ensure that the main benefit of randomisation has not been compromised: namely, that the treatment arms (experimental and control) remain comparable in everything except the interventions being compared.[1] Missing outcome data is a serious problem because of its ability to bias the findings of a clinical trial. The bias affects the direction of the effect observed, but data loss can also affect the precision, making the results unreliable.

A good definition of missing data is given by Little, ‘values that are not available and that would be meaningful for analysis if they were observed’.[2] As our interest is in trials, we concentrate on missing outcome data, although methods have been developed to handle missing covariates as well. The bias because of missing data may depend on the reason why data are missing. We therefore have to consider the mechanism of how data comes to be lost.

Little and Rubin defined three patterns of missing data.[3]

  1. Missing completely at random (MCAR), in which the available data is just as representative of the population from which it were taken as the complete data in the sample.
  2. Missing at random (MAR), in which, given the values that we have, the missing data does not depend on the unobserved data.
  3. Missing not at random (MNAR), in which the probability of missing data depends on the unobserved data.

This is the standard nomenclature. To avoid misunderstanding, we suggest that ‘conditionally at random’ (which Rubin terms MAR) might be less misleading than ‘at random’, which could be interpreted to mean ‘completely at random’.

As an example of these patterns, consider questionnaires administered at an ambulatory clinic for women experiencing pelvic pain. If some questionnaires are randomly lost, the pattern of loss is MCAR. If women who are disabled are transported directly to hospital avoiding the ambulatory clinics where pelvic pain is being measured, such that data cannot be obtained from some of them, and the probability of non-attendance is predictable from the data we have, the pattern is MAR. However, the reason for missing data may lie outside the clinic, and so depend on unobserved variables. This might occur, for example if some of the disabled women are depressed and less likely to attend, but this information has not been recorded. Then the pattern is MNAR. Table 1 summarises the patterns of missing data.

Table 1. Types of missing data patterns and implications for clinical trial analysis
 MCAR Missing completely at randomMAR Missing (conditionally) at randomMNAR Missing not at random
  1. a

    White IR, Royston P, Wood AM. Multiple imputation using chained equations: Issues and guidance for practice. Stat Med 2011;30:377–399.

DefinitionaThe probability of a particular value being missing is completely independent of both the observed data and the unobserved dataThe probability of a particular value being missing depends only on the observed dataThe probability of a particular value being missing depends on the unobserved data
Example: a study about the intensity of pelvic pain and related factors. The data is collected by questionnaires given to women at a clinicSome questionnaires are randomly lost by accident or random errors in entering dataDisabled women experience difficulty in attending the clinics where the questionnnaires are administered. Women's disability information is recordedSome of the disabled women are depressed and less likely to attend, but this information has not been recorded
BiasNoneNone, because missing data do not depend on the unobserved dataYes, because missing data depend on unseen observations
Implications for the analysis in clinical trialsComplete case analysis. Loss of power, imprecision. Assumption cannot be tested.Required for most types of analyses (e.g. Multiple Imputation). Assumption can be testedIt is very rare to know the appropriate model for this data loss mechanism

The Consolidated Standards of Reporting Trials (CONSORT) statement was designed to improve the quality of reporting of randomised controlled trials.[4] It comprises an evidence-based set of recommendations for reporting RCTs that includes a 25-item checklist and a flow diagram. The checklist deals with the design of the trial, the analysis and interpretation of the results. The flow diagram displays the progress of participants through the stages of enrolment in the trial, intervention, allocation, follow-up and data analysis.

A flowchart enables us to keep track of missing data at each stage of the trial, and so makes clear the problem that we will have to deal with in the analysis. People whose participation ceased after allocation are unlikely to be representative of all participants in the study. Knowing the number of participants who did not receive the intervention as allocated, or did not complete treatment, enables the reader to assess to what extent the estimated effect of therapy might be biased.

Therefore, our first recommendation is to present a participant flowchart following the CONSORT statement. In Figure 1 we show an example of a clinical trial study flowchart published in BJOG recently. Such a flowchart should give a more comprehensive picture of the missing data than a single summary statistic.

image

Figure 1. Example of the CONSORT flow of participants through each stage of the trial (Lakeman et al. BJOG 2012;119:1473–1482).

Download figure to PowerPoint

The ‘intention to treat’ (ITT) principle means comparing patients in the groups to which they were originally randomly assigned.[5] This is generally interpreted as including all patients, regardless of whether they actually satisfied the entry criteria, the treatment was actually received, and whether they subsequently withdrew or deviated from the protocol. This maintains the comparability of the groups apart from random variation, which is the reason for randomization.

As White et al.[6] point out, it is not clear how the principle can be applied when outcome data are missing. It is not clear whether any loss of participants can be attributed to chance, an improvement in their condition, a worsening in their condition or to side effects, or can be attributed to any other characteristic of the participant. The traditional way to deal with missing data is complete case analysis, in which patients with missing data are not included. If the loss mechanism is MCAR, it is a sensible method. MCAR is a strong assumption, however. Under MCAR, the results are unbiased, although the statistical power may be reduced.

Our second recommendation is about performing a main analysis of all observed data that are valid under a plausible assumption about the missing data.

Other more complex methods for dealing with missing data are based on the MCAR or MAR assumptions, such as weighting and imputation procedures. Weighting procedures consist of weighting every observed value by the inverse of its probability of being observed, given the covariates.[7]

Imputation methods include single and multiple imputation. The first implies filling in or imputing the missing values in the data set. In multiple imputation, missing values are imputed using a set of sampled values based on models for the missing data conditional on all relevant observed data, and later on appropriately combining results obtained from each of them.[7, 8]

Our assumptions about the mechanism of loss, however (typically, MAR), might be mistaken. We need to find out whether the results are robust to the type of assumptions we make. To this end, we may vary the assumption about randomness and see whether the findings are robust to such variation. White suggests that if the main analysis assumes similarity between groups who are and are not lost to follow-up, a good sensitivity analysis might assume that the group who are lost to follow-up have systematically worse outcomes.[6] With binary outcomes, look at the best- and worst-case scenarios. Table 2 contains a hypothetical example to illustrate the necessity of taking missing data into account in analysis. As can be seen, different scenarios of missing data can make a big difference to the result.

Table 2. Hypothetical scenarios for missing data and the results of various methods to deal with data loss
 Good outcomeBad outcomeMissing dataTotalProportion good
(a) Actual results with complete case analysis
Experimental4921301000.70
Control3535301000.50
 Good outcomeBad outcomeTotalProportion good
  1. In (a) we have the same proportion of good results as in (e), i.e. the estimate is unbiased; however, because the effective sample is smaller, the confidence interval for it will be about 11% greater.

  2. As an example, if x = 10 but y = 20 (because depression is less well controlled in the controls and causes non-participation), the proportions in the farthest right column for experimental and control subjects will be 0.64 and 0.55, a less convincing result than in (a).

(b) Cases with missing data had good outcomes
Experimental79211000.79
Control65351000.65
(c) Cases with missing (control) data had good outcomes; cases with missing (experimental) data had bad outcomes
Experimental49511000.49
Control65351000.65
(d) Cases with missing (experimental) data had good outcomes; cases with missing (control) data had bad outcomes
Experimental79211000.79
Control35651000.35
(e) Cases with data missing completely at random
Experimental70301000.70
Control50501000.50
(f) Cases with data missing not at random
Experimental49 + x21 + (30 − x)1000.49 + 0.01x
Control35 + y35 + (30 − y)1000.35 + 0.01y

We therefore make a third recommendation, namely a sensitivity analysis. The results of the sensitivity study should be reported by a statement in the published study about the robustness of the findings to changes in the assumptions.

Sterne et al.[9] offer guidelines for the reporting of analyses potentially affected by missing data, which are consistent with our framework. Such information could be included in an appendix to a paper. They recommend reporting the number of missing values for the variables of interest, or the number of cases with complete data for each important component of the analysis. They recommend giving reasons for missing data, particularly in terms of other variables, and describing any important differences between individuals with complete and incomplete data. Where applicable, these could support the assumptions upon which missing data are handled when performing ITT, as well as the sensitivity analysis afterwards. In the latter connection they recommend giving details of the modelling used in multiple imputation.

Whereas our discussion above is mainly about reporting, it is important to mention the design stage of a study. When doing a sample size calculation it may be worthwhile to anticipate drop-outs by increasing the recommended sample size by an appropriate amount. The inflation figure may be suggested by experience with research in related areas. In the conduct of the study it is essential to ensure that all efforts are made to minimise losses.

We made a brief search of the BJOG website for recent articles in BJOG that mentioned ‘missing data’ explicitly by using this expression in the search field. This yielded five articles published in the current millennium, albeit not connected with clinical trials. Of these, the three earlier ones mentioned the problem, but did not attempt a formal statistical treatment,[10-12] whereas the two more recent ones did so.[13, 14] Notwithstanding the small number, if they represent the wider picture, the trend is welcome, certainly for reporting the results of clinical trials. Our recommendations are summarised in Box 1.

Box 1. Summary of recommendations

Plan sample size, taking losses to follow-up into consideration

Inflate the sample size by taking into account the potential for losses and take all measures to avoid missing data during the study.

Include a participant flowchart that shows all losses

Information about the flow of participants enables missing data to be identified.

Intention-to-treat analysis adjusted for missing data should be the primary analysis

The basis should include reasonable assumptions about missing data; complete case analysis fails to meet the intention to treat principle.

Include a sensitivity analysis

See whether the results are robust to different assumptions.

Contribution to authorship

  1. Top of page
  2. Disclosure of interests
  3. Contribution to authorship
  4. Details of ethics approval
  5. Funding
  6. Acknowledgement
  7. References

All authors actively participated in the preparation of this article and agreed on the order shown. All shared equally in the conception and planning. MJ wrote the initial drafts of the main text and covering letter (and the revised versions), which were then sent to the co-authors (AR and JZ), and drafts were extensively revised in light of correspondence between the three authors. AR was responsible for the initial draft of Table 1.

Acknowledgement

  1. Top of page
  2. Disclosure of interests
  3. Contribution to authorship
  4. Details of ethics approval
  5. Funding
  6. Acknowledgement
  7. References

The initiative came from Prof K Khan, who provided helpful advice and feedback on an earlier draft.

References

  1. Top of page
  2. Disclosure of interests
  3. Contribution to authorship
  4. Details of ethics approval
  5. Funding
  6. Acknowledgement
  7. References