Correspondence to: Dr Kristie Carter, Department of Public Health, University of Otago, PO Box 7343, Wellington, New Zealand; e-mail: email@example.com
Background: Most research is affected by differential participation, where individuals who do not participate have different characteristics to those who do. This is often assumed to induce selection bias. However, selection bias only occurs if the exposure-outcome association differs for participants compared to non-participants. We empirically demonstrate that selection bias does not necessarily occur when participation varies in a study.
Methods: We used data from three waves of the longitudinal Survey of Family, Income and Employment (SoFIE). We examined baseline associations of labour market activity and education with self-rated health using logistic regression in five participation samples: A) the original sample at year one (n=22,260); B) those remaining in the sample (n=18,360); C) those (at year 3) consenting to data linkage (n=14,350); D) drop outs over three years (n=3,895); and E) those who dropped out or did not consent (n=7,905).
Results: Loss to follow-up was more likely among lower socioeconomic groups and those with poorer health. However, for labour market activity and education, the odds of reporting fair/poor health were similar across all samples. Comparisons of the mutually exclusive samples (C and E) showed no difference in the odds ratios after adjustment for sociodemographic (participation) variables. Thus, there was little evidence of selection bias.
Conclusions: Differential loss to follow-up (drop out) need not lead to selection bias in the association between exposure (labour market activity and education) and outcome (self-rated health).
All studies suffer from non-participation in some form, be it due to missing data, initial non-response or, in longitudinal studies, loss to follow-up or attrition (caused by difficulty locating participants, refusals to continue, or death), which may lead to selection bias.1–4 Selection bias arises when the association between the exposure and outcome is different among those who participate, compared to those who do not.5 It has been shown that non-participation (defined as non-response and attrition) more often occurs in younger populations, people of lower socioeconomic position, less stable family or household type and those in poorer health.6,7 This will lead to biased estimates of population prevalence of sociodemographic and health characteristics.6,8,9 However, this does not necessarily cause selection bias of the association between the exposure and outcome, as is often argued in the literature (and through peer review). 10,11 Therefore, for selection bias to occur, we would need to observe differential participation by the joint distribution of the exposure and the outcome (i.e. exposure and outcome are dependent predictors of participation).
It is often accepted that non-response and attrition in a survey automatically leads to selection bias and jeopardises the validity of results.10,11 However, most studies only compare the characteristics of responders with non-responders and do not investigate whether any difference in participation affects the exposure and outcome association of interest.7,12,13 In the few studies where this has been investigated, even if there is differential participation this does not lead to selection (or non-response) bias in prevalence rates4,14,15 or baseline associations with future mortality.6 In a comparison of attrition in the English Longitudinal Study of Ageing and the US Health and Retirement Study, it was found that although there was differential attrition between the two surveys, this had no impact on the association between different health states and socioeconomic status.4
The objective of this paper is to demonstrate empirically, using longitudinal data, that selection bias need not occur for the analytical association of interest, in the presence of differential participation and consent. We do so by examining the association between socioeconomic variables and self-rated health (SRH) at Wave 1 of a longitudinal study, for: A) Wave 1 original sample members; B) the balanced panel (those who participated in Waves 1, 2 and 3); C) the balanced panel restricted to those also consenting to data linkage in Wave 3; D) those lost to follow up (or dropped out) of the survey by Wave 3; and E) those who dropped out or did not consent. We do not attempt to analyse factors of non-response or time-varying attrition in this article.
The Survey of Families, Income and Employment (SoFIE) is a longitudinal panel survey, administered by Statistics New Zealand (NZ), of approximately 11,500 households (77% initial response rate) with more than 22,000 adults (≥15 years) interviewed on an annual basis, starting in October 2002. Annual face-to-face interviews collected comprehensive information on demographics, households, income, employment, education and family composition, as well as SRH. In Wave 3, written consent was requested from participants to link their SoFIE record to cancer registrations and hospitalisations.
The current analyses utilise the first three waves of SoFIE data (Wave 1 to 7 data Version 1).
Cross-tabulations investigate the prevalence of demographic and socioeconomic variables in the four population restrictions (described above). To examine the effects of selection bias on the results, univariate and multivariable logistic regression analyses are used to examine the association of baseline (Wave 1) socioeconomic variables (labour market activity and education) with SRH in the four populations. SRH is dichotomised into good (excellent, very good, good) and poor (fair, poor) health. Multivariable analyses are adjusted for age, sex, ethnicity, and other socioeconomic factors (education, labour market activity and area deprivation). The Wald test is used to test for heterogeneity between the final model estimates in the mutually exclusive populations (Wave 3 responders and consenters [C] compared to Wave 3 drop out and non-consenters [E]).
All analyses are conducted on unit level data using SAS 8.2. All numbers of participants presented in the tables of this paper are rounded to the nearest multiple of five, with a minimum value of five, as per Statistics NZ confidentiality protocol, therefore totals may not add up to the sum of counts.
A total of 22,260 adult original sample members participated at Wave 1 (Figure 1). By Wave 3, 18,360 (82.5%) of the original sample members were re-interviewed in Waves 2 and 3. Therefore, 3,895 participants (17.5%) dropped out of, or did not respond, in Waves 2 and/or 3. Approximately 150 deaths occurred each year, which are included in the attrition numbers. Table 1 shows that attrition was greater in younger participants, those reporting ethnicity other than NZ European, poorer health status and lower socioeconomic status (unemployed, living in highly deprived areas). Of the 18,360 people who were interviewed at Wave 3, 14,350 (78.1%) consented to having their health records linked to their SoFIE records. This represents 64.5% of the original Wave 1 population.
Table 1. Demographic and socioeconomic characteristics of the original (Wave 1) and restricted adult SoFIE participants.
Table 2 presents the results of logistic regression analyses, regressing labour market activity and education on SRH in populations with different participation levels. For labour market activity and all levels of educational qualifications the odds ratios of reporting fair/poor health compared to good health were similar for the original Wave 1 population, the balanced panel, and the Wave 3 consenters. For example, the univariate odds of fair/poor SRH for those not working were 4.9 (95% CI 4.5–5.5) in the original Wave 1 population, 4.6 (95% CI 4.1–5.2) in the balanced panel and 4.6 (95% CI 4.0–5.2) in those who remained in the sample and consented to record linkage at Wave 3. Adjusting for age, sex, ethnicity and socioeconomic factors in the multivariable analysis reduced the associations between the socioeconomic variables and SRH.
Table 2. Logistic regression of the relationship between Wave 1 fair/poor self-rated health and socioeconomic variables for the Wave 1 adult population, the balanced panel and those who consented to data linkage.
Table 3 presents the results of the Wald test comparing the odds ratio for poor health in the most extreme (mutually exclusive) population groups: the balanced panel and consenting at Wave 3 population (C), with the population that dropped out or didn't consent (E). The odds ratios for the two population subgroups were not statistically significantly different from each other (p=0.09). Once demographic and socioeconomic factors (potentially predicting drop out) were adjusted for the odds ratios more or less identical (p=0.72).
Table 3. Logistic regression of the relationship between Wave 1 fair/poor self-rated health and labour force non-participation for the balanced panel and consenting population at Wave 3 and those who dropped out or didn't consent.
Odds Ratio Univariate Fair/Poor
Wald Test p-value
Odds Ratio Multivariable* Fair/Poor
Wald Test p-value
All numbers of participants presented in the tables of this paper are rounded to the nearest multiple of five, with a minimum value of 5, as per Statistics NZ confidentiality protocol, so totals may not add up to the sum of counts.
Balanced panel + consented; (C) (n=14,350)
Drop out and non-consenters; (E) (n=7,905)
Full data Wave 1 population (n=22,260)
In this analysis of three years of longitudinal data, we have shown that people who continue to participate have different characteristics to those who drop out or do not consent to data linkage. By Wave 3 of SoFIE, 17.5% of participants had dropped out, or did not respond in Waves 2 or 3 of the survey leading to a population that is older, more likely to be of NZ European ethnicity, has better health and higher socioeconomic status (higher income, employed, living in less deprived areas). This is consistent with other research that has found those consenting to participate in research and those who continue to respond to a survey differ to those who do not.13,16–18 However, despite this differential participation, we found little evidence of selection bias due to drop out or consent on the association between baseline socioeconomic measures and health, especially after adjustment for factors associated with participation, demographic and socioeconomic. Other studies that have looked at the effect of non-participation or attrition on regression estimates have also found minimal impact on models of exposure-outcome associations.4,6,15,17,19,20
The odds ratios in our study became even more similar after adjusting for covariates as these covariates were possibly predictors of participation. This is consistent with adjusting for selection bias arising due to common causes of exposure and participation, and common causes of outcome and participation (as opposed to exposure and outcome directly influencing participation), and that adjustment for these common causes (or their proxies) will minimise any bias.3,5
In this analysis we do not attempt to analyse the initial household sampling non-response (23%). The SoFIE study was conducted by Statistics New Zealand, which is reflected in the high household response rate.21 During the survey, Statistics New Zealand made great attempts to track all original sample members. If they refused follow-up or could not be found and were not interviewed for two or more consecutive years then they were no longer tracked, leading to the increasing attrition (drop-out from the sample over time). We do not attempt to examine selection bias due to time-dependent attrition or patterns of missing data in this paper.
A number of longitudinal surveys have shown that the effect of time-varying attrition on longitudinal estimates is minimal.4,18–20,22 Some types of longitudinal analysis, such as fixed effects models, only use within individual changes over time to compute estimates so may be less prone to selection bias.
In conclusion, the use of longitudinal data allows us to examine the effect of non-response, attrition and consent to data linkage on the association between baseline socioeconomic factors and SRH. Although others have shown theoretically and empirically that differential participation has minimal effect on exposure-outcome associations, it is still common practice for researchers to make ill-considered assertions about selection bias based only on cross-sectional participation and differences in participation by only the exposure and the outcome separately. These results are valid for the SoFIE population only and for cross-sectional associations between labour market activity, education and health. We hope that this paper will encourage researchers to explicitly consider this bias in exposure-outcome associations, and to extend beyond presenting only considering univariate participation as an assessment of selection bias.
SoFIE-Health is primarily funded by the Health Research Council of New Zealand as part of the Health Inequalities Research Programme. Access to the data used in this study was provided by Statistics New Zealand in a secure environment designed to give effect to the confidentiality provisions of the Statistics Act, 1975. The results in this study and any errors contained therein are those of the author, not Statistics New Zealand.