SEARCH

SEARCH BY CITATION

Keywords:

  • confidentiality/data confidentiality;
  • multiple imputation;
  • record linkage;
  • data swapping;
  • synthetic data

Abstract

Protecting confidentiality is essential to the functioning of systems for collecting and disseminating data on individuals and enterprises that are necessary for evidence-based public policy formulation. Deidentification of records, defined as removing obvious identifiers such as name and address, is not sufficient to protect confidentiality. Microdata have characteristics that lead to increased disclosure risk, such as existence of identification files, geographical detail, outliers, many/detailed attribute variables, or longitudinal or panel structure in the data. Data stewardship organizations can lower disclosure risk through disclosure limitation methods and through the construction of synthetic data. Both record and attribute suppression can be represented by matrix masks, as can perturbation through noise addition, and data swapping. Also sampling and aggregation have matrix mask representations. Distinct from masking methods, synthetic data construction considers the microdata to be a realization of some statistical model. It then replaces the true microdata with samples generated according to the model. The released data consist of records of individual synthetic units rather than records for the actual units. The organization must recognize uncertainty in both model form and values of model parameters. This argues for the relevance of hierarchical and mixture models to generate the synthetic data. Synthetic data has an advantage as a disclosure limitation method over masked data because it is easier for the user to analyze. Copyright © 2009 Wiley Periodicals, Inc., A Wiley Company

For further resources related to this article, please visit the WIREs website.