INFERRING LOGIT MODELS FROM EMPIRICAL MARGINS USING PROXY DATA

Authors


  • This work was supported in part by the Internal Revenue Service (IRS) for research on errors (TIRNO08E00040) and by the Office of Naval Research (ONR) for research on dynamic networks (N00014-08-1-1186) and metric robustness (N00014-06-1-0104) . Additional support was provided by the Center for Computational Analysis of Social and Organizational Systems (CASOS). The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Internal Revenue Service, the Office of Naval Research, or the U.S. government. Direct correspondence to Kathleen M. Carley, ISR, School of Computer Science, Carnegie Mellon University, 5000 Forbes Ave., Pittsburgh, PA; e-mail: Kathleen.carley@cs.cmu.edu.

Abstract

We examine several approaches for inferring logit models from empirical margins of predictor covariates and conditional margins containing the means of a binary response for each covariate margin. One method is to fit proxy data to the conditional response using the beta distribution, a process we call “margin analysis.” Proxy data can obtained using three approaches: (1) implementing the iterative proportional fitting (IPF) procedure on the margin totals, (2) sampling from a larger relevant data source such as the census, and (3) enumerating, or sampling from, the combinatoric space of all possible tables constrained by the margins. The first procedure is a well-studied approach for estimating contingency tables from margins, but it does not necessarily maintain the associations between the covariates unless seeded with an initial table containing those associations. In the second approach, which is appropriate for analyzing sociodemographic covariates, we can use a large census sample adjusting for sampling biases observed in the empirical margins. However, the appropriateness of using a census proxy depends substantially on how similar the sampling pools are. Our third approach entails exploring the combinatoric space of all contingency tables constrained by the margins while considering the associations among the covariates. We aggregate the logit models estimated from each table in that space into a single model. This approach is more robust than the first two as it considers multiple proxies. While the estimated logit models from each approach are generally similar to one another, for the low-dimensional tables we explore in this paper, the combinatoric approach incurs wider standard errors, which renders potentially significant coefficients insignificant. Finally, we suggest weighting the combinatoric models with evidence-relevant probabilities obtained using the multivariate Pólya distribution.

Ancillary