#### Data sources

Data were obtained from three sources. First, every pig herd is required to register with the Danish Central Husbandry Register. This provided a unique identifier (the CHR number), details of farm location, herd size and the number of sows in the herd.

The second source of data was from the central database of the DSSCP. We used the results from 9735 farms in 2003 (*n* = 578 260 individual samples) for initial model building. The DSSCP database also provided results from the 8151 farms sampled in 2004 that were also sampled in 2003 to investigate our different sampling schemes. Details retrieved from the DSSCP database included the CHR number, the date of sampling and the result of the Danish-mix ELISA (DME). This test measures antibodies in meat-juice to determine the previous exposure of finisher pigs to *Salmonella* spp*.* and can detect O-antigens from at least 93% of all serovars known to be present in Danish pigs (Mousing et al., 1997). The principal advantages of serological methods for *Salmonella* detection is the ability to assay a large number of samples rapidly at relatively low cost and high sensitivity when compared with bacteriology (2€ per sample). For these analyses, an ELISA optical density percentage (OD%) greater than 20 is classified as positive. This is equivalent to an adjusted OD% of greater than 10: the cut-off for positivity that has been used by the DSSCP since 1 August 2001 (Alban et al., 2002). All samples included in this study were analysed at the Danish Institute for Food and Veterinary Research using the DME. On the basis of testing, herds receive a monthly ‘serological *Salmonella* index’ which is based on a weighted average of the results from the previous three months. The levels of index are low level or no antibodies (index 0–39); medium (index 40–69); and high (index 70 or greater) (Alban et al., 2002). Herds in the medium and high index have reduced payments for finisher pigs sent to slaughter and must collect pen-faecal samples to determine the subtype and distribution of *Salmonella* in the herd.

The third source of data was the Danish Specific Pathogen Free (SPF) Company which provided health status details associated with each farm.

We chose to analyse data from 2003 and 2004 as we had access to additional farm-level details such as herd size, health status and the number of sows on the farm for those respective years. We proposed fitting a model to data from 2003 to inform sampling strategies for the subsequent year (estimation). Then we fit a model to the 2004 data and use this to see how successful the sampling strategies chosen from the 2003 data were (prediction).

#### Model development for the sampling schemes

The frequency histogram of the herd-level prevalence based on the actual test results from the OHS sampling strategy for 2003 and 2004 (Figure 1) showed a large amount of variation with a predominance of test-negative herds. These test-negative herds can come from two types of disease-negative herds: (i) those that are truly uninfected and therefore every sample is negative, and (ii); those that are, in fact, infected but provide insufficient samples to detect the presence of infection. This led us to propose a zeroinflated binomial (ZIB) approach to model herd-level *Salmonella* prevalence as it reflected our understanding of what is happening on the farm. The ZIB model has two herd-level outcomes, the probability of infection and — conditional on infection being present — an estimate of herd-level seroprevalence. This type of modelling can provide an added advantage over logistic regression: an ability to assess the extent of the similarities and differences between factors affecting herd infection status (invasion) and those affecting the seroprevalence in infected herds (persistence and spread).

Variables that might explain both the presence of infection and herd-level prevalence included herd size, farm location, the number of sows present and herd health status. Herd size was the actual number of slaughter pigs produced for the year; this was centred by subtracting the mean and dividing by 1000. Farm location was a binary variable; if a herd was located in the Sonderjylland district it was coded as 1, otherwise 0. Health status was a three-level categorical variable: conventional, SPF and SPF with *Mycoplasma*. The presence of sows was expressed as a three level ordinal variable: farms with no sows, farms with less than 125 (some) and farms with over 125 (many).

Logistic regression modelling was used for initial model building. Bivariate analyses found all covariates significant at the *P* ≤ 0.25 level and using data from 2003 we built a multivariable model within the statistical software r, version 2.5.1 (Ihaka and Gentleman, 1996). The outcome variable was seroprevalence defined as the number of cases divided by the number of samples taken. All putative risk factors were significant. The continuous variable herd size was checked to see if it was linear in its log odds (Hosmer and Lemeshow, 1989). Polynomials of herd size and biologically plausible two-way interaction terms between the main-effect variables were considered for inclusion.

Once satisfied with the model structure we developed a logistic model within a Bayesian framework using winbugs version 1.4.1 (Gilks et al., 1994). The code for the model is shown in Fig. 2. Initially, we stipulated informed priors for the intercept term, and covariates relating to location, health status and the number of sows present on farm. We based these on published literature supplying subjective information about the likelihood ascribed to various combinations of covariate values (Congdon, 2001). For example, from earlier work on other data from the Danish *Salmonella* surveillance-and-control programme we believed that it would be protective factor for a herd having SPF health status (Benschop et al., 2008b). Moreover, residing in the district of Sonderjylland in the south of Jutland would be a risk factor (Benschop et al., 2008a) for herd-level sero-positivity. Based on available literature, an increased number of sows on farms were considered a risk factor for *Salmonella* in finishers (Hautekiet et al., 2008).

Priors for the Bayesian logistic regression model were expressed in terms of a conjugate beta density (Congdon, 2001). We used a non-informed, normally distributed prior centred at zero and with a variance of 1 for the effect of herd size, given information about the effect of this variable on sero-positivity was not certain or conflicting. Three chains were run and convergence was judged to have occurred on the basis of visual inspection of time series plots and Gelman-Rubin plots (Toft et al., 2007). The length of the chain was determined by running sufficient iterations to ensure the Monte Carlo standard errors for each parameter were less than 5% of the posterior standard deviation. A total of 40 000 iterations were run with a ‘burn in’ of 4000 iterations.

The logistic regression model was extended to a zero-inflated binomial model and specified as follows:

Here, the number of cases from the *i*th herd is binomially distributed as a function of the number of trials (tests for *Salmonella* antibodies in meat-juice) pop[*i*]*,* and the probability of a test being positive (adjusted OD% > 10), *p*[*i*]*.*

We further defined:

where *J*[*i*] is an indicator variable representing infection status of the *i*th herd, rho[*i*] is the sero-prevalence conditional on the presence of infection. The term *rho* therefore represents the probability of finding infection in a randomly chosen pig from an infected herd. The latent variable *J*[*i*] is distributed as:

where *q*[*i*] is the probability of a herd being infected. This latent variable was modelled as:

- (1)

In Equation 1, the logit of the observed probability of the *i*th herd being infected, logit(*qi*)*,* was modelled as a function of *m *=* *4 farm-level explanatory variables (herd size, location, the number of sows present and health status) and a random effect term, *A*_{i}, which was normally distributed with a mean of zero and precision *σ*. For the ZIB model, the continuous variable herd size was categorized to facilitate model convergence. The categories chosen were the same as those used in the DSSCP (Alban et al., 2002).

The latent variable rho[*i*] was modelled as:

- (2)

In Equation 2, the logit of the probability of observing infection in a randomly chosen pig from the *i*th infected farm was modelled as a function of the four farm-level explanatory variables defined earlier and a random effect term for herd, *B*_{i}, which was normally distributed with a mean of zero and precision *τ*.

We set non-informed, normally distributed priors centred at zero and with a precision of 0.5 for each of the fixed effect terms, including the intercept. Sensitivity to these priors was evaluated by re-running the models with a precision of 1 and 0.2. For the precision of the random farm-level effects, *σ* and *τ*, we specified a precision of 1. Sensitivity to these priors was evaluated by re-running the models with a precision of 0.5 and 0.3.

Three chains were run and convergence was judged to have occurred on the basis of visual inspection of plots of the sampled values as a time series (Toft et al., 2007). The required number of iterations of the Gibbs sampler was determined by running sufficient iterations to ensure the Monte Carlo standard errors for each parameter were less than 5% of the posterior standard deviations. A total of 30 060 iterations were run with a ‘burn in’ of 1000 iterations.

We proposed fitting this model on 2003 data to inform sampling strategies for the subsequent year (estimation). Then we fit a model to the 2004 data and use this to see how successful the sampling strategies chosen from the 2003 data were by, for example, comparing the number of false negatives (prediction).

To check for consistency between years (2003 and 2004), we examined model outputs from both years of data separately and compared the magnitude and direction of the regression coefficients. The 8151 random farm-level effects for the 2 years were compared using scatter-plots and quantified using Lin’s concordance correlation coefficient (Lin, 1989).

A scatter plot of the median conditional sero-prevalence rho[*i*] versus the median probability of infection *q*[*i*] (Fig. 3) was used to identify the cut-off for the two model derived risk-based sampling schemes MRBA and MRBB.

#### Comparison of sampling schemes

The results from all four sampling schemes were compared by considering cost, the number of false-negative farms and the number of farms detected with a within herd sero-prevalence of ≥0.40.

Costs were compared by adding up the number of tests taken under each of the four sampling schemes. Only the costs of meat juice testing were taken into account, with each meat juice sample tested costing 2€. These costs are borne by the producers through levies on each pig slaughtered. There are follow-on tests once herds reach level 2 and 3 of 200€ with further costs if herds are found to be positive. These follow-on tests were not considered further in this study.

For each farm (*n *=* *8151) there were 1020 iterations stored from the model and these were used to determine the false-negative rate and the number of farms detected with a within-herd sero-prevalence of ≥0.40 for each of the four sampling schemes.

The number of farms that were falsely reported as negative and the sensitivity for each of the four sampling schemes was determined using the following process:

- (a)
the *J*[*i*] parameter, the indictor variable representing infection status of the *i*th herd, for 2004 was examined at each iteration. If it equalled one, then, for that iteration, the farm was considered infected. Otherwise, for that iteration, the farm was considered uninfected;

- (b)
rho[*i*], the predicted within-herd seroprevalence given the herd was infected, for 2004 was determined for each iteration when the farm was infected. rho[*i*] was combined with the number of pigs sampled, using the binomial distribution to determine the number of positives that would be detected at each iteration;

- (c)
a false-negative iteration was defined as one where the farm was infected at the iteration, but no positives were detected at that iteration. The number of false-negative iterations was summed and divided by the number of total iterations to give the number of false-negative farms;

- (d)
this was expressed as the sensitivity of the sampling scheme by dividing the number of false-negative farms by the total number of farms (*n *=* *8151), and subtracting this fraction (the false-negative fraction) from one.

The number of farms that were predicted to have an observed seroprevalence of ≥0.40 for each of the four sampling schemes was determined using the following process:

- 1
the number of positives detected in each herd for each iteration was determined as in steps (a) and (b) in the preceding paragraph;

- 2
the number of positives was divided by the number sampled to give the observed seroprevalence in each herd at each iteration;

- 3
these numbers were summed and divided by the number of iterations to obtain the expected number of herds with observed seroprevalences of ≥0 : 40.