Nonequivalent lethal equivalents: Models and inbreeding metrics for unbiased estimation of inbreeding load

Abstract Inbreeding depression, the deterioration in mean trait value in progeny of related parents, is a fundamental quantity in genetics, evolutionary biology, animal and plant breeding, and conservation biology. The magnitude of inbreeding depression can be quantified by the inbreeding load, typically measured in numbers of lethal equivalents, a population genetic quantity that allows for comparisons between environments, populations or species. However, there is as yet no quantitative assessment of which combinations of statistical models and metrics of inbreeding can yield such estimates. Here, we review statistical models that have been used to estimate inbreeding load and use population genetic simulations to investigate how unbiased estimates can be obtained using genomic and pedigree‐based metrics of inbreeding. We use simulated binary viability data (i.e., dead versus alive) as our example, but the concepts apply to any trait that exhibits inbreeding depression. We show that the increasingly popular generalized linear models with logit link do not provide comparable and unbiased population genetic measures of inbreeding load, independent of the metric of inbreeding used. Runs of homozygosity result in unbiased estimates of inbreeding load, whereas inbreeding measured from pedigrees results in slight overestimates. Due to widespread use of models that do not yield unbiased measures of the inbreeding load, some estimates in the literature cannot be compared meaningfully. We surveyed the literature for reliable estimates of the mean inbreeding load from wild vertebrate populations and found an average of 3.5 haploid lethal equivalents for survival to sexual maturity. To obtain comparable estimates, we encourage researchers to use generalized linear models with logarithmic links or maximum‐likelihood estimation of the exponential equation, and inbreeding coefficients calculated from runs of homozygosity, provided an assembled reference genome of sufficient quality and enough genetic marker data are available.


| INTRODUC TI ON
Inbreeding depression, the deterioration in mean trait value in progeny of related parents (Crow & Kimura, 1970, chapter 3), is a fundamental quantity in genetics, evolutionary biology, animal and plant breeding, and conservation biology (Charlesworth & Willis, 2009;Hedrick & Kalinowski, 2000;Kristensen & Sorensen, 2005;Wright, 1977). Conceptual and practical advances in these disciplines require accurate and robust estimates of the magnitude of inbreeding depression that can be compared among different traits, among sets of individuals of different ages and sexes, and among different environments, populations or species (Armbruster & Reed, 2005;Fox & Reed, 2010;Hoeck, Wolak, Switzer, Kuehler, & Lieberman, 2015;Kruuk, Sheldon, & Merilä, 2002;Leroy, 2014;Waller, Dole, & Bersch, 2008). These goals in turn require widespread adoption of a standard estimator of the magnitude of inbreeding depression that is unbiased, quantitatively comparable and firmly rooted in population genetic theory.
One such estimator is the inbreeding load, B, measured as the negative slope of a regression of the logarithm of a trait on inbreeding coefficient F (Charlesworth & Charlesworth, 1987;Charlesworth & Willis, 2009;Keller & Waller, 2002). Inbreeding load in viability (i.e., survival versus mortality) is measured in units of "lethal equivalents," where one lethal equivalent corresponds to a group of deleterious alleles that would cause one death on average if made homozygous (Morton, Crow, & Muller, 1956). The number of lethal equivalents can equally be interpreted as the number of deaths that would be expected in a group of hypothetical individuals where each individual carried one deleterious allele in homozygous state (i.e., the group contains as many individuals as there are deleterious alleles; Morton et al., 1956). Hence, one lethal equivalent can correspond to a lethal allele at one locus or to several mildly deleterious alleles at several loci. The concept of lethal equivalents was invented to quantify inbreeding depression in viability (Morton et al., 1956), hence the terminology "lethal." Throughout our study, we use viability data as example. However, the general approach to quantifying inbreeding load as a logarithmic relationship with F can be applied to other fitness components (Charlesworth & Charlesworth, 1987), or indeed to any other trait as long as alleles that improve trait value are, on average, dominant over alleles that reduce trait value or show overdominance (Wolak & Keller, 2014).

In population genetic theory, inbreeding load is defined as
where q i is the frequency of the deleterious allele i , s i is its deleterious effect when homozygous, h i is the dominance coefficient, and the sum is taken over all L biallelic loci at which deleterious alleles can occur (Morton et al., 1956). Morton et al.'s (1956) fundamental insight was that inbreeding load B for trait y can be estimated in the absence of information on q i , s i and h i simply as the slope of a weighted regression of − log e (y) on F, that is with individuals pooled into groups of similar F, and where A is the intercept and y the expected value of the trait for that level of F. This model is itself rooted in population genetics theory and assumes that effects of different environmental and genetic factors act independently and thus have multiplicative effects that translate into additive effects only on the logarithmic scale (Charlesworth & Charlesworth, 1987). It is therefore important that a logarithmic scale is used.
When data are only available for mean trait values of known outbred (y 0 ) and inbred individuals (y F ) with a single known level of F, for example offspring of selfing or full-sibling mating generated in a breeding design, the inbreeding load can be estimated as (Charlesworth & Charlesworth, 1987;Lynch & Walsh, 1998, p. 278).
Such breeding designs are hard to impose in wild, free-living populations or captive populations of endangered animals, but comparable and unbiased estimates of inbreeding load from such populations are key to understanding evolutionary dynamics (Kokko & Ots, 2006) and deciding population management strategies (Caballero, Bravo, & Wang, 2017a,b;Theodorou & Couvet, 2017). Morton et al.'s (1956) regression model (equation 2) provides a conceptually elegant and theoretically well-founded approach for estimating inbreeding load that can be applied given a range of naturally occurring F values.
However, implementation has not been without difficulties that have impeded widespread adoption despite recognition of its useful properties (Keller & Waller, 2002). Indeed, relatively few wild population studies have so far explicitly reported estimates of inbreeding load (Table 1).
One primary problem is that − log e (y) is undefined for any level of inbreeding with a trait mean of zero (e.g., zero survivors), meaning that model 2 cannot be directly fitted across all data.
Multiple alternative statistical models have consequently been advocated (Table 2). Read (1983, 1984) suggested a small sample size correction given group means of zero, but this introduces its own bias (Kalinowski & Hedrick, 1998;Lacy, 1997;Willis & Wiese, 1997). Kalinowski and Hedrick (1998) proposed a model that avoids the issue of undefined logarithms by directly fitting the exponential model y F = y 0 e −BF . Kruuk et al. (2002) extended this model to allow for heterogeneity in outbred survival and inbreeding load among years. García-Dorado, Wang, and López-Cortegano (2016) also developed software to fit this model to individual-level data. Glémin, Vimond, Ronfort, Bataillon, and Mignot (2006) used generalized linear models (GLMs) with a logarithmic link to estimate the regression slope B, pooling groups of individuals with similar levels of inbreeding. As an alternative that does not require calculation of group means, Armstrong and Cassey (2007) and Grueber, Nakagawa, Laws, and Jamieson (2011) suggested the use of GLMs and generalized linear mixed models (GLMMs) with various link functions and error distributions. As an alternative to the conditional GLMMs, Fredrickson, (2) − log e (y) = A + BF, Woolf, and Hedrick (2007) used generalized estimating equations (GEE) to obtain marginal estimates of the number of lethal equivalents. These GLMM and GEE models can easily be applied to individual survival data and, in principle, readily allow estimation of variation in inbreeding depression across ages, sexes or environments. Additional but more rarely used models can be found in Makov and Bittles (1986), Ralls, Ballou, and Templeton (1988), Lee, Lascoux, and Nordheim (1996), Lascoux and Lee (1998) or Hedrick, Hellsten, and Grattapaglia (2016). However, as we will show, some of these models do not preserve the population TA B L E 1 Estimates of inbreeding load from wild vertebrate populations obtained with unbiased statistical models. All studies calculated inbreeding coefficients from pedigree data (i.e., F ped ). The model used to estimate inbreeding load is coded 1 for logarithmic regression or class comparisons similar to the model proposed by Morton et al. (1956) or 2 for maximum-likelihood estimation of an exponential relationship. The life stage column indicates the time frame over which survival was assessed. The next five columns list haploid inbreeding load B for traits assigned to the following life stages: survival in juveniles (Juv.), survival until approximately half the age of sexual maturity (50%), survival until approximately sexual maturity (100%), survival in adults (Ad.) and reproductive traits (Rep.). The last column lists the publication that reported the inbreeding load or that reported the data used to calculate the inbreeding load The estimates for traits marked with an asterisk * are based on our reanalysis of available data. Rationales and methods are described in the R code in the Supporting Information, which also explains why some estimates are omitted. The high estimate of Kruuk et al. (2002) is based on a large data set, but that only includes 22 inbred pairings. Jimenez et al. (1994) estimated adult survival across a 3-week period (approximately 117-138 days of age), which appears to be the period leading to the largest difference between inbred and outbred individuals (their Figure 2). genetic assumptions (additivity on a logarithmic scale) underlying Morton et al.'s (1956) original derivation and, hence, do not yield comparable unbiased estimates of the inbreeding load.
All these models (Table 2) have in common that they require some metric of the inbreeding coefficient, F, of focal individuals (Table 3).
Pedigrees allow estimation of inbreeding coefficients (F ped ) that measure the expected amount of identity by descent of an individual (Wright, 1969, chapter 7). However, Mendelian sampling and linkage cause realized identity by descent to deviate from its expectation (Franklin, 1977;Hill & Weir, 2011;Knief, Kempenaers, & Forstmeier, 2017;Leutenegger et al., 2003;Stam, 1980). Further, wild population pedigrees usually encompass limited numbers of generations and typically contain errors and missing data which can cause bias and error in F ped (Knief et al., 2015;Wang, 2014). Recent developments in DNA sequencing technologies and resulting genomic data are now opening opportunities to quantify realized identity by descent and hence quantify inbreeding load through genomic rather than traditional pedigree-based approaches (Curik, Ferenčaković, & Sölkner, 2014;Hoffman et al., 2014;Kardos, Taylor, Ellegren, Luikart, & Allendorf, 2016;Keller, Visscher, & Goddard, 2011). Several methods to estimate inbreeding coefficients from genomic data are available. In the absence of an assembled reference genome, F can be quantified as a deviation in observed heterozygosity from its expectation based on Hardy-Weinberg equilibrium (Wang, 2014(Wang, , 2016. If an assembled reference genome is available, chromosomal regions can be identified that are homozygous in an individual, and the proportion of the genome in such "runs of homozygosity" is then used to calculate F ROH (McQuillan et al., 2008). Because F ROH is calculated as a proportion, it ranges from 0 to 1, as does F ped , while metrics based on deviation from Hardy-Weinberg equilibrium include positive and negative values (Table 3). Thus, the various estimators of inbreeding differ not only in data requirements and meaning, but also in some of their properties, such as range, mean and variance. These differences may affect resulting estimates of inbreeding load (Kardos, Nietlisbach, & Hedrick, 2018;Yengo et al., 2017).
Despite the need for comparable and unbiased estimates of inbreeding load across diverse natural populations and the increasing diversity of available statistical models ( Table 2) and metrics of F (Table 3), there is as yet no quantitative assessment of which combinations of models and metrics can yield the requisite estimates. Such assessments must themselves be consistent with underlying population genetic theory. Accordingly, we used population genetic simulations to investigate how unbiased measures of inbreeding load can be obtained using genomic and pedigree-based estimates of inbreeding and thereby provide a generally applicable roadmap for future studies.

| MATERIAL S AND ME THODS
We conducted two sets of independent simulations in this study. TA B L E 2 Summary of models for estimation of inbreeding load. The names of these models are used in Figure 1. Details for all models are described in Supporting Information 1, and the models are illustrated in Figure S4 in Supporting Information 1. For the model "GLM logit-link," we used F = 0 and F = 0.25 for predictions, but see Supporting Information 1 for a discussion of the effects of the arbitrary choice of these levels then the set of phenotypic simulations (where F directly affects survival) and finally the set of genetically explicit simulations (where survival is affected by simulated genotypes).

| Genetic simulations of metapopulations
We conducted genetically explicit simulations using Nemo v2.3.46r4 (Guillaume & Rougemont, 2006). To represent patterns of inbreeding that can emerge in natural vertebrate populations, simulations were loosely inspired by a song sparrow (Melospiza melodia) metapopulation on the Gulf Islands in British Columbia, Canada, which is known to express considerable among-individual variation in the degree of inbreeding and to show inbreeding depression in fitness traits (Keller, 1998;Nietlisbach et al., 2017;Reid et al., 2014;Smith, Keller, Marr, & Arcese, 2006;Wilson & Arcese, 2008; Table 1).
We  known attributes (Laine et al., 2016). Nemo then distributed these loci randomly within the chromosomes (see also Nietlisbach et al., 2017). Our simulations follow the genetic model of Morton et al. (1956) by assuming no epistasis. We also did not simulate overdominant loci. We will revisit these assumptions in the Discussion.
Selection coefficients s i were drawn from an exponential distribution with mean s = 0.03, a value in the middle of empirical es-  , 5, 10, 20), two different intercepts of (a) A = 0.25 or (b) A = 0.75, and 791 individuals with realistic F values and binary survival events y F sampled with survival probabilities Π F = e −A-BF . We quantified inbreeding load using the models summarized in Table 2 and illustrated in Figure S4 in Supporting Information 1. Inbreeding load was estimated as the slope of a Poisson generalized linear model with logarithmic link function ("GLM log-link"), with an exponential model ("exponent. ML"), by weighted regression either without (" Morton et al.") or with the small sample size correction of Read (1983, 1984)   with h = 0.1 (Wang et al., 1999). Due to the exponential distribution of s i , the simulated mean dominance coefficient was 0.18, a value close to empirical mean estimates of 0.2-0.4 (reviewed by Wang et al., 1999) or 0.1-0.3 (reviewed by Lynch & Walsh, 1998, p. 286). The resulting distributions of s i (range from 1.25 × 10 −5 to 0.22) and h i (range from 2.94 × 10 −6 to 0.50), and their relationship are shown in Figure S1 in Supporting Information 1.
Mutation rate at neutral and deleterious loci was set to 0.0002.
Mutation rate and number of deleterious loci were chosen in conjunction so that a diploid individual would experience on average one new deleterious mutation, a value compatible with empirical data (Lynch & Walsh, 1998, p. 351;Wang et al., 1999). Due to con- Pedigree-based inbreeding coefficients F ped (Wright, 1969;chapter 7) were calculated based on the previous 20 generations of the metapopulation pedigree (i.e., since generation 4,976, yielding a pedigree of up to 25 generations) using the R package pedigreemm (Vazquez, Bates, Rosa, Gianola, & Weigel, 2010). Three genomic metrics of F were calculated using neutral loci (Table 3). Although some loci with deleterious effects may be part of empirical data sets, we excluded them here because realistic genomic data sets are unlikely to contain all deleterious loci and many of them would be excluded due to minor allele frequency cut-offs.  (Purcell et al., 2007).
The second metric F alt is similar to F H in that it also provides a metric of inbreeding relative to reference allele frequencies, but it differs in that homozygous genotypes are weighted with the inverse of their allele frequency (Yang et al., 2010 (2016) and Bérénos, Ellis, Pilkington, and Pemberton (2016), and F UNI by For all four metrics of F, we calculated mean and variance across all individuals per deme (excluding immigrants).

| Comparison of statistical models to estimate inbreeding load
To investigate which of the five focal statistical models for estimation of inbreeding load (Table 2) provided unbiased estimates of B, we conducted a set of simulations in R that assumed F values were known precisely and were directly affecting fitness. This set of simulations was not genetically explicit, to allow a comparison of statistical models without adding the complexity of potential biases in metrics of F that could arise if survival probability and F were both estimated from genetic data. Consequently, the performance of different metrics of F as proxies for genotypes at loci with deleterious alleles will be addressed in the second set of (genetic) simulations below. Random errors in F or fitness, however, did not affect results here ( Figures S2 and S3 in Supporting Information 1). To obtain realistic distributions of F values for this set of simulations, we used the F ROH values of a single deme simulated in Nemo (791 individuals in total). Using these F values as input, we calculated the expected survival probability F for each class of individuals with inbreeding coefficient F as We then used F to create 791 individual survival events (y F = 0: dead, y F = 1: alive) by sampling survival events from a Bernoulli distribution with success probability F . Hence, the individual survival events y F were Bernoulli distributed with residual variance F ⋅(1 − F ) around the expectation π F . The intercept A was set to 0.25 or 0.75, and the slope B (i.e., the inbreeding load) was set to 1, 5, 10 or 20.
For each combination of A and B, we simulated 10,000 data sets (of 791 individuals each) and then quantified B using each statistical model (Table 2). We applied the method of Morton et al. (1956) to data grouped into similarly sized classes of similar values of F as summarized in the introduction, both with and without the small sample size correction proposed by Read (1983, 1984). Individual survival was analysed using the maximum-likelihood approach described by Kalinowski and Hedrick (1998). We also fitted a GLM with binomial errors and logit link function, and used predictions from this model in equation 3 as recommended by Grueber et al. (2011). In addition, we fitted a GLM with Poisson error distribution and logarithmic link function. Although not a commonly used approach, it is known that a GLM with Poisson distribution and logarithmic link function does provide unbiased point estimates for binary data (e.g., survival versus mortality) and usually avoids convergence problems that may occur with binomial errors and logarithmic link function. However, standard confidence intervals from a Poisson GLM are typically too large, yet this issue can be resolved by using the so-called sandwich estimator, a robust error variance estimation procedure (Zou, 2004; Supporting Information 1). For each model and combination of A and B, we extracted the estimated mean B and the 2.5% and 97.5% quantiles across the 10,000 data sets. These simulations directly compare the performance of the different statistical models to estimate inbreeding load (Table 2)

| Comparison of effects of metrics of F on estimates of inbreeding load
The above analyses showed that a Poisson GLM with logarithmic link provides reliable estimates of inbreeding load (see Results). Therefore, to compare the effects of the four different metrics of F on estimates of inbreeding load, we used this statistical model to regress individual survival on F ped , F H , F alt or F ROH in separate analyses.
Contrary to the previous analysis, we here used observed survival from the genetically explicit Nemo simulations. Thus, in this analysis, survival probability was determined by individual genotypes at loci with deleterious alleles (see above for details), and neutral loci or the pedigree was used to independently measure F. We extracted the slope as an estimate of inbreeding load per replicate and the mean and 2.5% and 97.5% quantiles across the 280 replicates (28 demes from 10 simulation runs). We calculated the actual inbreeding load present in the focal deme using equation 1, with allele frequencies q i from the focal generations 4,996-4,999 and selection s i and dominance coefficients h i for each locus as used in the Nemo simulations. This value provided a genetic reference that equals the value that a reliable method should estimate. We considered a metric of F to be biased if the difference between actual inbreeding load (calculated using equation 1) and its estimate was different from 0 with a p-value of less than 5%, as assessed using an intercept-only linear model with that difference for each deme as response variable. We additionally calculated root mean square error (RMSE), which is a combined measure of accuracy and precision.
Although a GLM with Poisson distribution and logarithmic link function provides unbiased point estimates for binary data, a sandwich estimator has to be used to calculate robust standard errors (Zou, 2004

| Comparison of statistical models to estimate inbreeding load
Fitting the full set of statistical models (Table 2) to the simulated individual survival data showed that only the GLM with logarithmic link function, and the maximum-likelihood estimation of the exponential equation, provided unbiased estimates of inbreeding load in all cases (Figure 1). These two methods fit essentially identical models in different ways. Morton et al.'s (1956) regression model substantially underestimated B when applying the small sample size correction of Read (1983, 1984), confirming previous extensive simulation studies (Kalinowski & Hedrick, 1998;Lacy, 1997;Willis & Wiese, 1997). Without the small sample size correction, Morton et al.'s model gave unbiased estimates for B up to 10, but overestimates for B of 20. This is because, for high B, many replicates had inbreeding classes with zero survivors, which have to be excluded from calculations using Morton et al.'s (1956) model. This In contrast, logarithmic GLMs and maximum-likelihood estimation consistently provided unbiased estimates of inbreeding load (Figure 1). However, maximum-likelihood estimation of the exponential equation failed in 106 out of 80,000 simulated data sets, and its implementation in some software packages may be considered more complicated, particularly given multiple covariates. We consequently recommend using the slope of a GLM with logarithmic link function and Poisson-distributed errors to estimate inbreeding load and to use a sandwich estimator to get appropriate confidence intervals (Zou, 2004;Supporting Information 1).

| Comparison of effects of metrics of F on estimates of inbreeding load
As expected, the distributions of the four metrics of F differed somewhat across the focal simulated individuals. F ped and F ROH had only positive values, with F ped showing a narrower range than F ROH , whereas F H and F alt contained both positive and negative values and thus had a wider range and a mean close to 0 (Table 3 and Figure S5 in Supporting Information 2). We also noted that values of F alt in immigrants and their descendants were too high because F alt strongly weighs rare alleles brought in by immigrants (see Supporting Information 2).
We ran genetically explicit simulations where survival was determined by genotypes at loci with deleterious alleles, and we used neutral loci or the pedigree to calculate four different metrics of F. The resulting estimates of inbreeding load did not yield identical results. Specifically, F ped led to slight overestimates of inbreeding load, and moreover the variation among estimates from the replicate demes was large, making this a relatively imprecise method (Figure 2). Consequently, root mean square error (RMSE) was rather large at 1.33. F ROH with runs of homozygosity longer than 1 Mbp provided unbiased estimates of inbreeding load, and variation in estimates was smaller than for F ped , giving an RMSE of 1.01 (Figure 2). F H led to underestimation of inbreeding load with an RMSE of 0.86, while F alt led to overestimation of inbreeding load with an RMSE of 2.05 (Figure 2

| Comparison of statistical models to estimate inbreeding load
The concept of "inbreeding load" (Morton et al., 1956) provides a standardized and theoretically rigorous measure of the magnitude of inbreeding depression that can be compared among traits, environments and populations. While multiple statistical models (Table 2) have been used to estimate inbreeding load, our simulations show that only logarithmic models yield unbiased estimates. Specifically, a Poisson generalized linear model (GLM) with logarithmic link function, and the maximum-likelihood exponential equation model proposed by Kalinowski and Hedrick (1998), returned unbiased estimates of inbreeding load. Other statistical models might be useful to study aspects of inbreeding other than quantification of inbreeding load.
Of these two models, the GLM with logarithmic link function is generally easy to implement. While it is not usual to model binary traits (such as survival) with Poisson error distributions and associated logarithmic links, such models return unbiased point estimates and appropriate confidence intervals can be computed (Zou, 2004;Supporting Information 1). GLMs designed to estimate inbreeding load in other traits could use error distributions other than Poisson, but using a logarithmic link function is crucial to preserve the population genetic interpretation of inbreeding load.
Meanwhile, Morton et al.'s (1956) (Table 1), Morton et al.'s (1956) (Figure 1). Such differences in baseline survival rate occur, for example, due to environmental differences between years or study sites.
To illustrate the problem, we used published data from Chatham Island black robins (Petroica traversi) (Kennedy, Grueber, Duncan, & Jamieson, 2014) to demonstrate how a logit link model can lead to erroneous comparative assessments of inbreeding load. A standard GLM with binomial errors and logit link generates estimates of inbreeding load that differ more than threefold among three focal study sites (R code in Supporting Information). Such highly different estimates emerge even though the same model provided no statistical support for the hypothesis that inbreeding load varied among sites (i.e., the site-by-F interaction was not significant and excluded from the model). This major apparent discrepancy in interpretation arises because the survival rates of outbred individuals varied markedly among study sites, which most likely reflects ecology (Kennedy et al., 2014). Using a GLM with logarithmic link instead does not lead to such inconsistent results. Thus, predictions from models with logit links should not be used to estimate inbreeding load. A number of estimates of "lethal equivalents" in the literature, particularly in more recent literature, are not in fact equivalent and cannot be meaningfully quantitatively compared.

| Comparison of effects of metrics of F on estimates of inbreeding load
Our genetically explicit genomic simulations showed that fitting the same (appropriate) statistical model using different metrics of F (Table 3) returned quantitatively different estimates of the inbreeding load. Of three metrics derived from genetic markers, only that based on runs of homozygosity (F ROH ) provided unbiased estimates.
F H systematically underestimated inbreeding load, but showed the lowest RMSE. Meanwhile, F ped slightly and F alt considerably overestimated inbreeding load. Our additional analyses of subsets of individuals and loci imply that if much larger data sets were available, estimates based on F H , F ped and F alt would likely still be biased whereas estimates based on F ROH would not, while the RMSE for F ROH would likely decrease (Supporting Information 2). Given appropriate genomic data, F ROH may therefore be the best metric of inbreeding for quantification of inbreeding load. Yengo et al. (2017) concluded from simulations that F H and particularly F alt were the best metrics to quantify inbreeding depression. However, they simulated trait values as a function of an inbreeding coefficient that was calculated in a similar way as F H and F alt , not based on genetically explicit simulations. This shortcut is likely to bias results in favour of metrics with similar properties, leading to conclusions that simply reflect simulation methodology (Kardos et al., 2018). Our genetically explicit simulations, where both trait values and inbreeding coefficients are emergent properties of Mendelian inheritance, genetic drift and selection, show that F ROH yields less biased estimates of the inbreeding load than F H and F alt (see also Keller et al., 2011).
Although F ped has similar properties to F ROH , it yields slight overestimates of the inbreeding load. Pedigrees measure expected identity by descent and not variation due to Mendelian sampling and recombination, whereas a large number of genetic markers allow measuring variation in realized identity by descent (Franklin, 1977;Hill & Weir, 2011;Leutenegger et al., 2003;Stam, 1980). High-density marker-based metrics of inbreeding consequently showed higher correlations with genome-wide identity by descent than F ped in simulation studies (Kardos, Luikart, & Allendorf, 2015;Keller et al., 2011;Wang, 2016), as is expected when realized identity by descent randomly deviates from its expectation based on F ped . In general, independent random errors in the independent variable (i.e., F) increase the variance and may lead to biased regression slopes (Carroll, Ruppert, Stefanski, & Crainiceanu, 2006;Reid et al., 2014). Overestimation, such as we observed, might arise if F ped systematically underestimates genomic inbreeding, for example due to selection and resulting reduced variance (Groen, Kennedy, & Eissen, 1995). Indeed, simulations by Curik, Sölkner, and Stipic (2001) showed that regression slopes of trait values on F were overestimated when using F ped instead of realized genomic inbreeding, because F ped underestimated the variance in identity by descent.
Although desirable and increasingly feasible (Kardos et al., 2016), generating genomic data to measure inbreeding is not without challenges and may not be an option for every research programme (Andrews, Good, Miller, Luikart, & Hohenlohe, 2016;Shafer et al., 2017;Sims, Sudbery, Ilott, Heger, & Ponting, 2014). In these cases, pedigrees of sufficient depth will yield reasonable if slightly biased estimates of inbreeding load. However, if an assembled reference genome of sufficient quality and a dense genetic marker data set are available, we recommend using F ROH and as many individuals as possible for estimation of inbreeding load.

| Implications for wild populations
Our results show that estimates of inbreeding load are contingent on the underlying statistical model and the metric of F, implying that diverse published estimates are often not equivalent and impeding quantitative comparison. We thus collated published estimates of inbreeding loads in wild vertebrate populations that used unbiased methods in Table 1 and explain in the R code in the Supporting Information why other estimates were deemed to not be comparable. Not all studies of inbreeding depression reported estimates of inbreeding load, but they sometimes contained sufficient data to allow approximate calculation (details of analyses and exclusions, and R code, are in the Supporting Information).
We mainly attempted to recalculate estimates of inbreeding load calculated in review studies by O'Grady et al. (2006) and Frankham et al. (2017, table 3.2). We describe the detailed methods in the R code in the Supporting Information and also explain there why some values differ.
We did not list some of the previously reported estimates, mainly because they were not from wild populations or for various issues that we explain in the R code in the Supporting Information. For example, the highest value among vertebrate populations cited by Frankham et al. (2017) is based on a study on red deer (Cervus elaphus) (Huisman et al., 2016) that did not report inbreeding load and that used logit links and F alt to analyse inbreeding depression. As we have shown here, such estimates of inbreeding load may be unreliable. The same concerns apply to the even higher estimates of inbreeding load reported for the same study of red deer by Hedrick and García-Dorado (2016).
When using only estimates from models known to have little bias, a mean inbreeding load for survival until sexual maturity of 3.5 haploid lethal equivalents was found among wild vertebrate populations (Table 1). This value is higher than the mean of 2.3 reported for mammals in captivity (Ralls et al., 1988). We did not observe a recent increase in reported inbreeding load estimates from the wild as previously noted (Hedrick & García-Dorado, 2016). However, there are not many reliable estimates of inbreeding load available for wild vertebrate populations and especially not for measures of lifetime fitness. To improve this situation, we encourage researchers to explicitly calculate and report inbreeding load for their study populations whenever possible. Furthermore, study systems where lifetime reproductive success is well known offer interesting prospects for quantification of inbreeding load in measures of total fitness. The widespread availability of genomic methods will ease the challenge of measuring inbreeding in wild animals and plants in the coming years. However, the difficulty of accurately measuring fitness in wild populations will remain. Thus, detailed long-term study populations where survival and reproduction can be monitored in detail will become increasingly valuable for ecological and evolutionary genomics.

| Limitations
Although our recommendation to use F ROH for measuring inbreeding is in line with other studies (e.g., Keller et al., 2011), there are limitations to our simulations and hence quantitative conclusions.
We investigated the performance of different metrics of F given a metapopulation of 30 demes of fixed size connected by little dispersal and gene flow. Quantitative conclusions will likely change given different structures and resulting means and variances in F. Neither an appropriate metric of F, nor an appropriate statistical model, can guarantee an unbiased estimate of inbreeding load if other assumptions of the underlying theory are violated. In particular, if the assumption of independent effects of loci is violated, for example due to epistasis or additive rather than multiplicative effects among loci, different statistical procedures may be required.
If inbreeding depression is mainly due to overdominance rather than partial directional dominance, biases in estimates of inbreeding load may also change (Curik et al., 2001). Similarly, further research is needed to assess what would change if inbreeding depression was mainly caused by few loci with large effects, such as recessive lethal mutations. Further bias in estimates of inbreeding load could arise if there are nonrandom associations between individual F values and environmental quality, if propensity to inbreed is correlated with fitness-related heritable traits (Becker, Hegelbach, Keller, & Postma, 2016;Reid, Arcese, & Keller, 2008), or if parental investment differs depending on offspring F (Duthie, Lee, & Reid, 2016). In such cases, use of the metrics and models that we have highlighted may need to be coupled with experiments that break associations between F and environmental and parental effects, or with more sophisticated regression models that additionally account for additive genetic effects.

ACK N OWLED G EM ENTS
We thank Peter Arcese, A. Bradley Duthie, Richard Frankham, Christine Grossen, Catherine Grueber, Marty Kardos and three anonymous reviewers for helpful comments and discussion on earlier versions of this manuscript. We thank Cate Lessels and Peter Boag from whose 1987 paper in The Auk we copied the idea for the title of this paper. Our work was supported by a Swiss National Science Foundation grant (31003A-116794) to LFK, the Forschungskredit of the University of Zurich (FK-15-104) and a Swiss National Science Foundation grant (P2ZHP3_168447) to PN, and JMR was supported by a European Research Council grant.

DATA A R C H I V I N G S TAT E M E N T
Code for computer simulations is uploaded as Supporting Information.