To P or not to P? The Usefulness of P-values in Quantitative Political Science Research

This contribution gives a short overview over the mechanics of significance testing in inferential statistics, in particular linear models, and tries to put the discussion about the usefulness of p-values into a broader perspective of scientific practise. This discussion needs to be embedded into the larger debate about the credibility crises faced by empirical social science research. In particular, it seems of utmost importance to discuss what the profession as a whole, journals, publishers, as well as editors can do to encourage better research practise that generates reliable and useful empirical findings.


Introduction: The Use and Misuse of P-values 1
Star-gazing and p-hacking are just two of the commonly used pejorative descriptions of publication or favoured-hypothesis bias. The so-called replication crisis (Gelman 2011, Benjamin et al. 2018Lakens et al. 2018;McShane et al. 2017;Tramow and Marks 2015;Nuzzo 2014) in quantitative social science research is often attributed to the (mis-)use of p-values when presenting inferential statistical results to empirically support a previously stated hypothesis or theoretical argument.
A quote from the infamous Political Science Rumors website exemplifies this problem: "Third-year AP here. Starting to realize that there is no way I can demonstrate a meaningful relationship between my two variables without manipulating P-values. Two questions: 1) Is this unethical? 2) What are the consequences if I get caught? At this point, I've sunk too much time into the project, so abandoning it simply isn't an option." Recent research has shown that the distribution of presented p-values in published research significantly differs from that distribution in unpublished work (Gerber and Malhotra 2008a,b). In published empirical research p-values bunch up at the (arbitrarily) set α-value of 0.05 (Esarey and Wu 2016, Gerber and Malhotra 2008a,b, Gerber et al. 2001). This research finds that statistically significant results are overrepresented in academic articles. If significant results are consistently favoured in the review process, published empirical findings could systematically overstate the magnitude of the effects even under ideal conditions (Esarey and Wu 2016, Gerber and Malhotra 2008a,b, Gerber et. al 2001. Gerber and Malhotra (2008a) analyze empirical articles in the two leading political science journals, the American Political Science Review (APSR) and the American Journal of Political Science (AJPS), and conclude that there is publication bias due to the reliance on the 0.05 significance level in empirical research. Gerber et al. (2001), in addition, argue that to achieve statistical significance, the effect size must be larger in small samples. If published work is frequently biased against statistically insignificant findings, we should observe that the effect size reduces as sample sizes increase. And they show exactly this.
The new editor of the prime political methodology journal, Political Analysis, recently banned the usage of p-values and significance stars from articles published in Political Analysis (Gill 2018). This kicked loose a general debate about the usefulness of employing statistical hypothesis testing in general and presenting p-values as indication of statistical significance more specifically. This debate cannot be treated independently of a more general discussion of replicability, robustness and reproducibility of empirical research and ultimately academic misconduct.
After the American Statistical Association published their statement on the use of pvalues (Wasserstein and Lazar 2016), I, as then editor-in-chief of the EPSA journal PSRM, initiated a debate with the editorial board about the use and mis-use of p-values. The debate concluded that p-values as such are not the problem, they provide more or less useful information for the consumer of scientific research. However, they cannot be used as sole criterion for the reliability, significance or economic/political relevance of the empirical findings. This information needs to be coupled with information on effect size, e.g. real world relevance of the empirical results, robustness of the estimates, as well as a discussion of coverage and potential effect heterogeneity. In combination these different sets of empirical information can paint a more complete picture of the credibility of the presented statistical results.
Certainly, t-tests and p-values are not more or less useful than providing confidence intervals or credibility intervals in Bayesian statistics. Bayesian statisticians argue that credibility intervals are more useful because they are generated by simulating the posterior distribution of the estimates. The underlying philosophy differs but Bayesians make equally strong assumptions about prior and posterior distributions that -if violatedhave equally negative effects on inference. Gelman (2011) argues that so-called Bayesian hypothesis testing is just as bad as regular hypothesis testing.
In what follows, I will quickly present the logic of statistical inference and significance testing, discuss the implications of significance testing in linear models, and will then turn to the bigger question of what the profession can do to deal with academic misconduct, since p-value hacking is just a symptom of a larger credibility crisis.

The Econometrics of P-values: Hunting for Inference
Inference -the potential to draw conclusions beyond the analysed data sample to the population -is one of the main goals of empirical analysis in the social sciences. Researchers want to know whether the relationships they find in the sample at hand can predict the relationships between the same variables but drawn from a different sample. What we are ultimately interested in are out-of-sample predictions.
Significance tests have been developed to answer exactly the question whether it is possible to generalize the regression results for the sample under observation to the universe of cases. However, for significance tests to produce reliable results a host of assumptions has to hold. In linear (OLS) regressions this set of underlying assumptions is called full ideal conditions or Gauss-Markov assumptions 2 . These assumptions ensure that the data sample under observation matches the characteristics of the universe of cases or the so-called population. For this to work the researcher has to define the population. This is usually a theoretical question and harder than most applied researchers expect: To what set of cases does the formulated theory or theoretical argument apply? All countries over all periods of time? A set of countries over a defined time-span? All individuals across geographical entities, sex, age, time?
The underlying assumption for significance tests to produce reliable results, is that the sample is randomly drawn from the underlying population and thus mirrors all relevant characteristics of the universe of cases. All deviations are due to random sampling error. Gauss-Markov assumptions ensure that this is the case. If deviations from the population are non-random, the standard errors of the estimated coefficients are estimated incorrectly and the resulting significance tests are therefore wrong and lead to false conclusions.
Bayesian statisticians strongly criticise the assumption, underlying inferential statistical significance testing, that standard errors depict the sampling variation of the estimated coefficient, i.e. the distribution of all effects estimated with a large number of different randomly drawn samples. This criticism is fuelled by the observation that a) we often do not know what the actual population is from which we are drawing a sample, b) samples are often not randomly drawn even if Gauss-Markov assumptions hold, and c) we often cannot draw a sample from a population, especially when we analyse a fixed set of countries or other geographical identities. These issues are certainly present and affect inferential statistical analysis, however standard errors can be interpreted as the precision with which the relationship in the sample can be estimated. For example, they depict random noise whose source is not necessarily random sampling but random measurement error and others.

The T-test: A Quick Discussion
The t-test is the most commonly used significance test in linear OLS regression analysis. It tests whether the estimated coefficient is significantly different from zero, e.g. there is no effect of x -the right-hand-side variable -on y -the dependent variable. The Null-Hypothesis (H0) thus states that β = 0, whereby β denotes the estimated effect of x on y. There are two variations, a one sided alternative (HA) with β > 0 or β < 0 or a two sided alternative hypothesis with β = 0. the test statistic follows a student-t distribution under the Null-Hypothesis, if and only if all Gauss-Markov assumptions are met: t is the critical value of the student t distribution for a specific number of observations n and a specific level of significance. This level of significance is known as the p-value.
The level of significance in theory can be set by the researcher but in practise the convention in statistics and quantitative data analysis in general is a significance level of p= 5% , or 2.5% on each side of the t-distribution for a 2-sided t-test.
The p-value itself is an arbitrary number, yet the stated convention has lead to the discussed problem of p-hacking, star-gazing and publication bias because the profession has been conditioned for decades to accept results that are significant at the 5% level. In order to combat this publication bias, several political scientists (Benjamin et al. 2018;Esarey 2017) suggested to lower the threshold for p-values to 0.005. However, in my opinion, a mechanical lowering of the accepted threshold will not solve the problem.
Why is this the case? P-values adjudicate the frequency with which the researcher allows her statistical analysis to make α or Type-I errors as compared to β or Type-II errors. Statistical testing adopts the legal philosophy "in dubio pro reo": to rather acquit the defendant even though s/he might be guilty than convict an innocent. In this sense, the statistical profession has decided that it is more important to avoid Type-I errorswrongly rejecting the Null-hypothesis and conclude that there is a non-zero effect, than avoiding Type-II errors -wrongly accepting the null that the coefficient is zero. Whether this is reasonable for every single empirical analysis, remains debatable. Selecting pvalues increases or decrease the probability of type I and Type-II errors. The smaller the significance level (0.05, 0.01), the lower the probability of making Type-I and the higher the probability of Type-II errors. In a discipline like Political Science, we should be equally concerned with uncovering effects that are indeed there, in particular when the effects are policy relevant.
Under ideal conditions, the t-test has good statistical power. However, as most applied researchers understand, ideal conditions are just that and are frequently violated in real data analysis. It is therefore useful to discuss and question the mechanical convention of a p-value of 0.05. Since different set ups, different data types and samples meet these ideal conditions differently well, it does not seem helpful to set another static significance level that is lower to solve the problem of publication bias (Esarey 2017).
Researchers often know which of the Gauss-Markov assumptions are violated and how these violations affect the estimation of the standard errors and thus the significance tests. A multitude of solutions to these specification issues like robust standard errors, such as clustering etc. controlling for (group) heteroscedasticity, serial correlation, spatial correlation amongst other issues, as well as small and non-normal sample corrections have been developed and are frequently employed by applied researchers. The problem, quite often, with manipulating the standard errors only is, that most violations of full ideal conditions affect the estimation of both the coefficients and standard errors. Just treating the standard errors might increase the potential for wrong inferences.
While these solutions go some way in reducing the potential for overestimating the statistical significance of effects, because they usually are more conservative estimates of standard errors, they do not necessarily solve the problem of p-hacking and preferred hypothesis bias. The incentives set by the profession, journals, and the research community remain untouched.

The Bigger Debate: Academic Misconduct
The debate about the mis-use of p-values in empirical research is intimately intertwined with the more recent debate on academic fraud and thus reproducibility, reliability, credibility, and robustness of published empirical findings. Why is there an incentive to engage in academic mis-conduct and risk the career? Like doping in sports, cheating allows to reach the goal (publications, citations, tenure, promotion) faster. With probability of detection still very low, incentives for cheating remain high. But the costs are borne by honest academics both personally (competition) and as a profession (reputation).
DART (Data Access and Research Transparency) and COPE (Committee on Publication Ethics) initiatives help to raise awareness and define standards for replication and robustness. Many journals in political science have developed dedicated replication guidelines for empirical research and some of them have implemented in-house replication of quantitative analysis (PSRM, PA, AJPS).
Yet, this does not seem to be enough. Academic research produces (positive) results that hinge on our credibility and reputation. We need to maintain this credibility and reputation by implementing self-control mechanisms that prevent academic fraud and misconduct. We cannot leave it to the (criminal) justice system, since the fraud of a few produces negative externalities for the whole profession.
It seems almost impossible to detect subtle kinds of fraud like p-hacking and nonrobust empirical results through the typical peer review process, which is supposedly the main instrument of quality assurance in the academic profession. In most cases, authors do not have to provide their data to the reviewers. This often might even have good reasons when data is original, sensitive, or even personalized. Yet, the peer review process only evaluates the plausibility of results, it assumes honesty.
What are the solutions? Banning p-values from articles does not seem to help much or it is only a drop on the hot stone of publication bias and academic fraud, since it only treats a symptom but not the disease itself. Raising the costs of mis-conduct is one way forward. Solutions have to increase the perceived probability of detection for the single researcher. Let me discuss a few possibilities that come to mind, without claiming to be exhaustive.
Publishers can easily implement plagiarism software into their online submission systems to screen articles and books for potential copying of existing work without proper citation. A few journals like PSRM have implemented this.
Since the incentives cannot be denied, researchers must bind themselves to the mast like Ulysses through pre-registration: Disciplines that are less affected by spectacular fraud seem to be leading. In political science the EGAP registry holds 1128 pre-registered research designs, as compared to only 80 in 2014. In economics the RCT Registry of the American Economic Association contains 2370 registered studies, as compared to 240 in 2014. Registration of research designs is exponentially increasing. This is a welcome development since registered experiments cannot be changed ex-post in order to adapt the design to the empirical results. However, not all studies lend themselves to preregistration. Again editors have to step up and make pre-registration compulsory in order to make this practise the norm in the profession. Registration does not work, however, if researchers regard the experimentally generated data as private property which do not have to be published or made available to reviewers. In this case researchers can in principle remove cases that do not fit the argument. In addition, recent research indicates that few studies that actually preregister follow through and do what they committed themselves to do. This is in itself an issue, yet this might be more of a problem regarding the idea of pre-registration: The difficulty is to anticipate the whole set of analyses necessary before collecting the data 3 .
Another potential measure is to make all data publicly available. Again many journals require data and code to be made available to the public before publication. But often there are no requirements whether source data has to be included. When source data is original, confidential, or personalized, publication might not be possible or undesirable.
However, new avenues to make this kind of data available for replication need to be explored.
Given that the collection of original data is time consuming, costly, and creates public goods for the discipline, data citation must be improved. Data are intellectual products for which citation should be required (Mooney 2011). This practise increases incentives for scholars to publish data because it will affect their citation count. Original data collection should also be valued more by the profession and our journals to make it both more attractive to collect but also to share data.
The DART initiative and leading journals and editors have institutionalized the publication of replication material. When it comes to replication, journals and their editors are key because they set the standards for good practise in the profession. One way is to strengthen the review process with actual replication of empirical results. This might not be always feasible due to the reasons discussed above. That is why journals need to conduct their own replication analysis of accepted empirical studies, as several leading journals in the discipline now do (PSRM, AJPS, PA).
Replication of empirical results is a necessary but not sufficient condition for detecting and reducing misconduct, especially because the implemented procedures are rather checks that the authors code runs through and reproduces the results in the paper given the provided data. The example of the excel-spreadsheet mistakes of Rogoff and Reinhard, as well as the problem of how to treat missing values in the Piketty case show that simple replication of results will remain insufficient to prevent the publication of unreliable empirical findings. Robustness checks can close part of the gap. They have become increasingly standard in the social sciences. Robustness checks do not just replicate empirical results but take into account that researchers have to take many decisions about estimation and specification. Many published studies read as if the presented specification was the only plausible one. Robustness checks, however, assume that alternative specifications are no less plausible and test whether results and conclusions hold for alternative assumptions. The problem still remains that it is in the hands of the authors to decide which robustness and sensitivity checks to include. This implies the same logic as for p-hacking, yet at least raises the bar an inch higher.
The problem that is faced by the profession is feasibility. Even if we could agree on a set of necessary robustness and sensitivity tests, the question remains who should be in charge of checking that these rules have been followed and at what stage of the publication process?
There is much to do. The profession, publishers and editors need to decide on joint policies with respect to replication and robustness and journals need to start accepting and publishing null findings and replication studies more. As a profession we need better practices that allow us learning from null-findings. The ability to distinguish a nonsignificant finding rejecting a wrong theory from a non-significant finding that results from weak research design, is key. If non-significant findings are useful because they are based on a strong design, scholars and journals will have an incentive to publish them.
This also requires that the scientific community, publishers and journals need to provide the necessary resources to generate an infrastructure which increases the probability of detecting academic fraud, much more so than it is the case at present.

Conclusion
Researchers always have an incentive to select results that confirm their favoured hypotheses. No requirement for robustness and sensitivity checks, or banning of p-values can change this incentive. Unless the profession renders academic fraud more costly, instils better norms of replicability and reproducibility, pre-registration of research designs not just for experimental studies, and encourages publication of none or negative findings, banning p-values cannot and will not solve the replication crisis.

Biography
Vera E. Troeger, Professor of Comparative Politics, Faculty for Economics and Social Sciences, Universitaet Hamburg, Germany, vera.eva.troeger@uni-hamburg.de; and Professor of Quantitative Political Economy, Department of Economics, University of Warwick, UK, v.e.troeger@warwick.ac.uk.
My research interests lie at the intersection between international and comparative political economy, labour economics, as well as applied quantitative data analysis and political methodology.