The basic research question in the European dipper study related to whether the apparent survival probability of birds differed in years when floods occurred during the breeding season (this species nests near streams) vs. a normal year when a flood did not occur. In each case there are two models: (1) {ϕ(·),*p*(·)},implying that apparent survival (ϕ) and recapture probabilities (*p*) are approximately constant over years; and (2) {ϕ(*n*),ϕ(*f*),*p*(·)}, where years are partitioned into normal years (*n*) and flood years (*f*) in terms of apparent survival probabilities. These models are clear representations of two science hypotheses, one where a flood has no impact on apparent survival, and one where a flood does impact apparent survival. The first model has *K*= 2 parameters, whereas the second model has *K*= 3 parameters. The research question (a simple one-parameter observational study) relates to the possible change in survival probability during flood years.

The maximum likelihood estimates (MLE) for ϕ and measures of precision for parameters in the two models are presented in Table 2. The difference in estimates of survival probability (the ‘effect size’) is 0·1383, SE = 0·0532, with a 95% confidence interval for this difference of (0·0340, 0·2425). The MLE of *p* is 0·9025 (SE = 0·0286) vs. 0·8997 (SE = 0·0293) for the two models, respectively.

#### null hypothesis testing

The simpler model is nested in the three-parameter model and a simple likelihood ratio test of the two models provides a test statistic of 6·735 and, assuming this is χ^{2} distributed on one degree of freedom, we obtain a *P*-value of 0·0095. This would be ruled ‘significant’; some would say ‘highly significant’ and others would include ‘**’ in tabular material to emphasize its high significance. Note that only the null hypothesis (*H*_{0}) is the subject of the test.

Formally, the *P*-value of 0·0095 is the probability of a value as large as 6·735 or larger, given the null model {ϕ(·),*p*(·)} is true. Given that this is such a small probability, one concludes (by default) that the alternative model {ϕ(*n*),ϕ(*f*),*p*(·)} is ‘significantly’ better. The proper interpretation of the *P*-value is strained; this provides some explanation regarding why so many people erroneously believe the *P*-value means something else (e.g. the probability that the null model is true).

#### i-t approach

Under this approach, one obtains the model probabilities directly:

In addition, these are mathematically equivalent to Bayesian posterior model probabilities (Burnham & Anderson 2004). These model probabilities provide direct evidence regarding the empirical support for the two models, without having to assume that either model is ‘true’ (there are no true models). We believe that most scientists and resource managers would view these model probabilities as more meaningful forms of evidence compared with *P*-values.

The quantification of information loss (Δ_{i}= AIC_{i}– minAIC) allows the computation of the likelihood of model *g*_{i}, given the data:

The probability of model *i* is a normalization of the model likelihoods:

The *w*_{i} are ‘Akaike weights’ or model probabilities. These weights are quite unlike *P*-values (probability of the data, given the null model), instead they are the probability of model *i*, given the data (Table 1). Finally, an evidence ratio (*E*) is useful in comparing the relative strength of evidence for two hypotheses, *i* and *j*:

Burnham & Anderson (2002) provide a discussion of evidence ratios and model probabilities. Evidence ratios provide a measure of the relative likelihood of one hypothesis vs. another. Here, likelihood has a technical meaning, can be quantified and should not be confused with probability. For example, if person A holds three raffle tickets and person B has one, person A is three times more *likely* to win than person A. We do not know the absolute probability of either person winning without knowing the total number of raffle tickets. In the dipper example, the evidence ratio gauges the relative support for the two alternatives: 0·9132/0·0868 = 10·52. Given the available data, a difference in survival probability having occurred between normal and flood years is 10·52 times more likely than no difference having occurred. This suggests somewhat limited to moderate evidence for a flood effect on apparent survival probability (strong evidence of a flood effect is not warranted, contrary to the result from NHT). Evidence ratios are invariant to other models in the model set and are the statistic used in legal settings, such as criminal trials relying on DNA evidence (Evett & Weir 1998). Evidence ratios are a continuous measure, but some useful guidelines have existed in the statistical literature (Table 3; Jeffreys 1948; Evett & Weir 1998). Inference should be about models and parameters, given data; however, *P*-values are probability statements about data, given null models. Model probabilities and evidence ratios provide a means to make inference directly about models and their parameters, given data.

Table 3. General guidelines for the amount of support given by an evidence ratio based on Evett & Weir (1998) Evidence ratio | Verbal description |
---|

1–10 | Limited support |

10–100 | Moderate support |

100–1000 | Strong support |

> 1000 | Very strong support |

I-T methods can be used in single-parameter problems such as the pollutant problem posed by Stephens *et al*. (2005), despite their claim that AIC was not applicable because ‘AIC cannot be used to compare models of different data sets’ (Stephens *et al*. 2005). Stephens *et al*. (2005) misinterpret what is meant in the statistical sciences as a ‘data set.’ In particular, a data set does not mean just one vector of numerical values. The example presented by Stephens *et al*. (2005) is a case of a control–treatment design, which assumes a control (sites are similar) and independent samples at each site. In actuality, paired control–treatment samples would be recorded at similar times because pollutant effects are time-dependent as a result of stream flow, but we disregard this to be consistent with Stephens *et al*.'s (2005) original example. Thus, the analysis could be framed as two models:

where *Y* is the concentration of the pollutant, β_{0} is the overall mean concentration (the intercept), β_{1} is the treatment effect, 5 is the constant added representing the minimum treatment effect of interest, and *X* is an indicator of the upstream (control) or downstream (treatment) site. The intercept-only model treats the control and treatment observations as if they were collected at the same site, whereas the indicator-variable model constrains these observations to be site-specific in the analysis. In this case the response variable is the same, therefore models can be built to represent the hypotheses and I-T methods are applicable. If the response variables were different at each site, neither I-T nor NHT could be used.

Stephens *et al*. (2005) were mistaken when they claimed that NHT can provide ‘the probability with which *H*_{A} could be supported’. NHT does not provide information about the probability of the alternative hypothesis because only the null hypothesis is the subject of the test. I-T methods provide the probability of the alternative hypothesis that the authors seek and both the model weights and the evidence ratios quantify the empirical support for the hypotheses, whether there are two or more such hypotheses.

I-T methods allow us to go a step further in our analysis and make formal inference from multiple models simultaneously (Burnham & Anderson 2002). The {ϕ(·),*p*(·)} and {ϕ(*n*),ϕ(*f*),*p*(·)} apparent survival estimates can be model-averaged to produce an estimate of flood and normal year apparent survival that takes model selection uncertainty into account. The model-averaged estimates are a weighted average of the estimates of the two models, with the weights based on the model probabilities (Burnham & Anderson 2002). The model-averaged variance accounts for sampling variance and variation in parameter estimates across models (Burnham & Anderson 2002). The model averaged effect size for the dipper example is 0·1263 (SE = 0·0639). Note the estimate is a bit smaller than that for the {ϕ(*n*),ϕ(*f*),*p*(·)} model and the standard error is larger. The difference represents the uncertainty as to which model is actually best in terms of Kullback–Leibler information loss. This estimate is conditional on the set of models considered rather than a single model. NHT offers no procedure for model averaging or for computing the unconditional estimates of sampling variation or covariation.

It remains true that NHT is not wrong, but it is relatively uninformative in most cases. The scientific method in combination with NHT has increased knowledge since its formalization. However, the theory underlying NHT is weak in that it is based on the probability of the data, given the null model. We believe I-T approaches represent an improved methodology because these methods encourage greater a priori thinking about plausible scientific hypotheses (even if there are only two) and because the outputs are more directly interpretable, regardless of the problem sophistication.