Bronwyn Loong was a doctoral student in the Department of Statistics at Harvard University when this research was conducted.
Disclosure control using partially synthetic data for large-scale health surveys, with applications to CanCORS
Article first published online: 13 MAY 2013
Copyright © 2013 John Wiley & Sons, Ltd.
Statistics in Medicine
Volume 32, Issue 24, pages 4139–4161, 30 October 2013
How to Cite
Loong, B., Zaslavsky, A. M., He, Y. and Harrington, D. P. (2013), Disclosure control using partially synthetic data for large-scale health surveys, with applications to CanCORS. Statist. Med., 32: 4139–4161. doi: 10.1002/sim.5841
- Issue published online: 1 OCT 2013
- Article first published online: 13 MAY 2013
- Manuscript Accepted: 8 APR 2013
- Manuscript Revised: 2 APR 2013
- Manuscript Received: 11 NOV 2012
- National Cancer Institute (NCI). Grant Number: U01 CA093344
- NCI. Grant Number: U01 CA093332
- data confidentiality;
- data utility;
- disclosure risk;
- multiple imputation;
- synthetic data
Statistical agencies have begun to partially synthesize public-use data for major surveys to protect the confidentiality of respondents’ identities and sensitive attributes by replacing high disclosure risk and sensitive variables with multiple imputations. To date, there are few applications of synthetic data techniques to large-scale healthcare survey data. Here, we describe partial synthesis of survey data collected by the Cancer Care Outcomes Research and Surveillance (CanCORS) project, a comprehensive observational study of the experiences, treatments, and outcomes of patients with lung or colorectal cancer in the USA. We review inferential methods for partially synthetic data and discuss selection of high disclosure risk variables for synthesis, specification of imputation models, and identification disclosure risk assessment. We evaluate data utility by replicating published analyses and comparing results using original and synthetic data and discuss practical issues in preserving inferential conclusions. We found that important subgroup relationships must be included in the synthetic data imputation model, to preserve the data utility of the observed data for a given analysis procedure. We conclude that synthetic CanCORS data are suited best for preliminary data analyses purposes. These methods address the requirement to share data in clinical research without compromising confidentiality. Copyright © 2013 John Wiley & Sons, Ltd.