Data masking for disclosure limitation
Version of Record online: 30 JUN 2009
Copyright © 2009 John Wiley & Sons, Inc.
Wiley Interdisciplinary Reviews: Computational Statistics
Volume 1, Issue 1, pages 83–92, July/August 2009
How to Cite
Duncan, G. and Stokes, L. (2009), Data masking for disclosure limitation. WIREs Comp Stat, 1: 83–92. doi: 10.1002/wics.3
- Issue online: 13 JUL 2009
- Version of Record online: 30 JUN 2009
- confidentiality/data confidentiality;
- multiple imputation;
- record linkage;
- data swapping;
- synthetic data
Protecting confidentiality is essential to the functioning of systems for collecting and disseminating data on individuals and enterprises that are necessary for evidence-based public policy formulation. Deidentification of records, defined as removing obvious identifiers such as name and address, is not sufficient to protect confidentiality. Microdata have characteristics that lead to increased disclosure risk, such as existence of identification files, geographical detail, outliers, many/detailed attribute variables, or longitudinal or panel structure in the data. Data stewardship organizations can lower disclosure risk through disclosure limitation methods and through the construction of synthetic data. Both record and attribute suppression can be represented by matrix masks, as can perturbation through noise addition, and data swapping. Also sampling and aggregation have matrix mask representations. Distinct from masking methods, synthetic data construction considers the microdata to be a realization of some statistical model. It then replaces the true microdata with samples generated according to the model. The released data consist of records of individual synthetic units rather than records for the actual units. The organization must recognize uncertainty in both model form and values of model parameters. This argues for the relevance of hierarchical and mixture models to generate the synthetic data. Synthetic data has an advantage as a disclosure limitation method over masked data because it is easier for the user to analyze. Copyright © 2009 Wiley Periodicals, Inc., A Wiley Company
For further resources related to this article, please visit the WIREs website.