Summary Clinical trials data often come in the form of low-dimensional tables of small counts. Standard approximate tests such as score and likelihood ratio tests are imperfect in several respects. First, they can give quite different answers from the same data. Second, the actual type-1 error can differ significantly from nominal, even for quite large sample sizes. Third, exact inferences based on these can be strongly nonmonotonic functions of the null parameter and lead to confidence sets that are discontiguous. There are two modern approaches to small sample inference. One is to use so-called higher order asymptotics (Reid, 2003, Annal of Statistics 31, 1695–1731) to provide an explicit adjustment to the likelihood ratio statistic. The theory for this is complex but the statistic is quick to compute. The second approach is to perform an exact calculation of significance assuming the nuisance parameters equal their null estimate (Lee and Young, 2005, Statistic and Probability Letters 71, 143–153), which is a kind of parametric bootstrap. The purpose of this article is to explain and evaluate these two methods, for testing whether a difference in probabilities p2− p1 exceeds a prechosen noninferiority margin δ0. On the basis of an extensive numerical study, we recommend bootstrap P-values as superior to all other alternatives. First, they produce practically identical answers regardless of the basic test statistic chosen. Second, they have excellent size accuracy and higher power. Third, they vary much less erratically with the null parameter value δ0.