This research was partly supported by the European Commission Fifth Framework (project number SERD-2002-00061) and by the Irish Research Council for Humanities and the Social Sciences. We thank Andrea Volkens for generously sharing her experience and data regarding the CMP; Thomas Daubler for research assistance; and Thomas Daubler, Gary King, Michael D. McDonald, Oli Proksch, and Jon Slapin for comments. We also thank James Adams, Garrett Glasgow, Simon Hix, Abdoul Noury, and Sona Golder for providing and assisting with their replication datasets and code.
Treating Words as Data with Error: Uncertainty in Text Statements of Policy Positions
Article first published online: 27 MAR 2009
©2009, Midwest Political Science Association
American Journal of Political Science
Volume 53, Issue 2, pages 495–513, April 2009
How to Cite
Benoit, K., Laver, M. and Mikhaylov, S. (2009), Treating Words as Data with Error: Uncertainty in Text Statements of Policy Positions. American Journal of Political Science, 53: 495–513. doi: 10.1111/j.1540-5907.2009.00383.x
- Issue published online: 27 MAR 2009
- Article first published online: 27 MAR 2009
Political text offers extraordinary potential as a source of information about the policy positions of political actors. Despite recent advances in computational text analysis, human interpretative coding of text remains an important source of text-based data, ultimately required to validate more automatic techniques. The profession's main source of cross-national, time-series data on party policy positions comes from the human interpretative coding of party manifestos by the Comparative Manifesto Project (CMP). Despite widespread use of these data, the uncertainty associated with each point estimate has never been available, undermining the value of the dataset as a scientific resource. We propose a remedy. First, we characterize processes by which CMP data are generated. These include inherently stochastic processes of text authorship, as well as of the parsing and coding of observed text by humans. Second, we simulate these error-generating processes by bootstrapping analyses of coded quasi-sentences. This allows us to estimate precise levels of nonsystematic error for every category and scale reported by the CMP for its entire set of 3,000-plus manifestos. Using our estimates of these errors, we show how to correct biased inferences, in recent prominently published work, derived from statistical analyses of error-contaminated CMP data.