## Introduction

The infection rate of *Wolbachia* is generally estimated to be at least 20% (Werren *et al*., 1995; Werren & Windsor, 2000). This estimate emerges as the result of several *Wolbachia* screenings, where arthropod, mainly insect species, are tested for infection. In most of the cases, only one individual per species is tested, which we will refer to as one-individual samples. There is one study that gives much higher infection rates of 76% (Jeyaprakash & Hoy, 2000). However, this study used a ‘long PCR’ method that is much more sensitive to trace *Wolbachia* molecules, and therefore environmental contaminants are more likely to be detected. In contrast, most other studies using standard PCR techniques give consistent estimates of infection levels (Table 1).

Number of samples | Proportion of infections (%) | |
---|---|---|

- *
Includes one-individual samples from all 20 studies. - †
Differs from 76% because of two species five individuals were tested which are excluded here.
| ||

Werren & Windsor (2000) | 141 | 20 |

Werren et al. (1995) | 139 | 15 |

West et al. (1998) | 53 | 15 |

Kikuchi & Fukatsu (2003) | 103 | 31 |

Nirgianaki et al. (2003) | 23 | 0 |

Tagami & Miura (2004) | 20 | 25 |

Gotoh et al. (2003) | 21 | 0 |

Total^{*} | 547 | 19 |

Jeyaprakash & Hoy (2000) | 62 | 73^{†} |

The following problem arises in studies based on a single or a few individuals per species. If an individual is infected, the species is rightly classified as infected. One or a few uninfected individuals, however, result in the classification of this species to be uninfected. This method works when infection frequencies within infected populations are always high. On the other hand, low infection frequencies are reported as well. For instance, Tagami & Miura (2004) found only 3.1% of the Japanese butterfly *Pieris rapae* to harbour *Wolbachia*. The probability of detecting this infected species would obviously have been low if only a single specimen had been tested. Furthermore, infection levels may depend, in part, on the mode of reproductive manipulation induced by *Wolbachia*; for instance, male-killers are expected to occur at lower frequencies (5–50%) within species than those causing cytoplasmic incompatibility (CI) (Hurst & Jiggins, 2000). There is also theoretical (Turelli, 1994; Flor *et al*., 2007) and empirical (Hoffmann *et al*., 1998) evidence that CI-infected individuals can occur at intermediate or low frequencies. Thus, because within-species infection frequencies differ across species, it is assumable that the *c*. 20% infection level found in several studies by testing a few individuals per species is an underestimate.

Here we present a meta-analysis of 20 different studies investigating the frequency of *Wolbachia*, and develop a statistical approach to estimate the overall frequency of *Wolbachia*-infected species. We show that studies where >100 individuals per species were tested tend to be biased towards infected species. Correcting for this bias, we estimate that 66% of species are infected with *Wolbachia*. It should be emphasized that this estimate was not achieved using the approach of Jeyaprakash & Hoy (2000); that study was excluded from the analysis due to its infection estimates being an outlier relative to other samples and to the highly sensitive PCR methods used. Rather, the estimate is derived from studies that routinely give 15–30% infection rates when one individual per species is tested, and extrapolating from these the expected percent of infected species among arthropods.

By applying a beta-binomial model, we can estimate a function describing the distribution of infection frequencies within species, and provide an estimate of the total percentage of infected species. This work aims at investigating to which degree the frequency of *Wolbachia* has been underestimated in previous studies and pointing out sampling methods necessary to obtain estimates of the distribution of *Wolbachia* within and among species.

### Data analysis

We summarized data from 20 different *Wolbachia*-screenings (Werren *et al*., 1995; Breeuwer & Jacobs, 1996; Bouchon *et al*., 1998; West *et al*., 1998; Kondo *et al*., 1999; Plantard *et al*., 1999; Werren & Windsor, 2000; Jiggins *et al*., 2001; Ono *et al*., 2001; Van Borm *et al*., 2001; Shoemaker *et al*., 2002; Vavre *et al*., 2002; Gotoh *et al*., 2003; Kikuchi & Fukatsu, 2003; Nirgianaki *et al*., 2003; Rasgon & Scott, 2003; Rokas *et al*., 2002; Shoemaker *et al*., 2003; Thipaksorn *et al*., 2003; Tagami & Miura, 2004). These 20 studies include data from 9432 individuals of 917 arthropod species.

The data show an increasing frequency of infected species with the number of individuals tested. Part of this trend is likely due to studies with large sample sizes having focused on species already known to be infected to determine infection frequencies within species more precisely (Van Borm *et al*., 2001; Rasgon & Scott, 2003). In contrast, samples comprising predominantly one-individual samples of unknown infection status aimed at determining the overall infection frequency among various arthropod species (Werren *et al*., 1995; Werren & Windsor, 2000). Thus, it does not represent an unbiased sample. We deal with this issue using both the complete data set and supposedly less biased subsets for a statistical analysis to estimate overall species infection frequencies. We then test the different data sets for bias. Another problematic point is that different orders might not be evenly represented by samples due to collection methods. There are some studies that focus on single insect orders; others screen individuals from various species and orders. Obviously, these conditions impair the emerging estimates. Nevertheless, they serve as a first attempt to interpret existing data.

Our goal is to estimate the total proportion of infected species as well as to describe the distribution of infection frequencies within species. Both can be achieved using a beta-binomial model (Böhning, 1999; Carlin & Louis, 2000). The beta-binomial model considers *N* random variables *X*_{j}, which are all binomially distributed, but each with different parameters *q*_{j} and *n*_{j}, so that *X*_{j}∼*Bin*(*q*_{j}, *n*_{j}). The parameters *q*_{j} of the species-specific binomial distributions are assumed to themselves follow a distribution. If this distribution is the beta distribution, the conditions to apply a beta-binomial model are fulfilled.

The beta distribution depends on two parameters α and β, which are to be estimated within the framework of a beta-binomial model [for details, see Böhning (1999); Carlin & Louis (2000)]. To obtain the estimates and thus the distribution of the infection frequency within species, we apply a procedure consisting of the following three steps:

- 1Determination of moment estimators and by (1)and(2)where
*X*_{j}is the number of infected individuals,*n*_{j}is the number of individuals tested of species*j*and*N*is the number of tested species. - 2Determination of α and β by the following equations: (3)and(4)
- 3Determination of the overall infection rate
*x*by integrating the distribution of the infection rates within species, which is a function of both estimated parameters α and β:(5)where*c*defines a threshold frequency below which species are considered to be uninfected.

By weighting the infection frequencies within species with the particular sample size [Eqns (1)and (2)], large samples have a strong impact on the estimation procedure. This can be a problem because large samples might be based on prior knowledge and thus not be independent of the parameter being estimated. This is likely the case for the largest sample from *Culex pipiens* (Rasgon & Scott, 2003), of which 1090 individuals were tested (1083 were found to be infected). *Culex pipiens* was known to be infected prior to this survey (Yen & Barr, 1973) and this prior knowledge presumably led to the collection and screening of more than thousand individuals. Among the 13 species with more than 100 individuals tested, 12 harboured *Wolbachia*. This is almost certainly due to the researcher bias of carrying out more extensive sampling of species already known to harbour *Wolbachia* infections (Table 2).

Sample size n | Number of samples | Infected species (%) |
---|---|---|

1 | 547 | 19 |

2 | 110 | 21 |

10 | 6 | 33 |

≥10 | 115 | 54 |

>100 | 13 | 92 |

To test for the potential biases of larger samples, we determined parameter values for three different sample sets, and then tested these for evidence of bias. Specifically, we determined three different distributions *B*_{(i)}, *B*_{(ii)} and *B*_{(iii)} based on three different data sets: (i) complete data, (ii) without the *C. pipiens* sample (thus *n*_{j}<1000) and (iii) only samples with sample size *n*_{j}<100.

Because some species were known to be infected before sampling, we further evaluated a data set *B*_{(iv)} excluding 12 species that were primarily analysed to determine natural infection frequency or *Wolbachia*-induced modifications of the reproductive system.