When releasing data to the public, data disseminators typically are required to protect the confidentiality of survey respondents' identities and attribute values. Removing direct identifiers such as names and addresses generally is not sufficient to eliminate disclosure risks, so that data must be altered before release to limit the risks of unintended disclosures. When intense data alteration is needed to ensure protection, the quality of the released data can be seriously degraded. This article reviews a disclosure limitation approach called synthetic data, in which values of confidential data are replaced with simulations from statistical models. Theoretical and empirical investigations have shown that synthetic data approaches have the potential to result in higher data quality than other disclosure limitation procedures, particularly when intense data alteration is necessary. The article discusses the main variants of synthetic data approaches, namely full synthesis and partial synthesis. It includes discussions of synthetic data generation and disclosure risk assessment. WIREs Comp Stat 2011 3 450–456 DOI: 10.1002/wics.174
For further resources related to this article, please visit the WIREs website.