On the estimation of inbreeding depression using different measures of inbreeding from molecular markers

Abstract The inbreeding coefficient (F) of individuals can be estimated from molecular marker data, such as SNPs, using measures of homozygosity of individual markers or runs of homozygosity (ROH) across the genome. These different measures of F can then be used to estimate the rate of inbreeding depression (ID) for quantitative traits. Some recent simulation studies have investigated the accuracy of this estimation with contradictory results. Whereas some studies suggest that estimates of inbreeding from ROH account more accurately for ID, others suggest that inbreeding measures from SNP‐by‐SNP homozygosity giving a large weight to rare alleles are more accurate. Here, we try to give more light on this issue by carrying out a set of computer simulations considering a range of population genetic parameters and population sizes. Our results show that the previous studies are indeed not contradictory. In populations with low effective size, where relationships are more tight and selection is relatively less intense, F measures based on ROH provide very accurate estimates of ID whereas SNP‐by‐SNP‐based F measures with high weight to rare alleles can show substantial upwardly biased estimates of ID. However, in populations of large effective size, with more intense selection and trait allele frequencies expected to be low if they are deleterious for fitness because of purifying selection, average estimates of ID from SNP‐by‐SNP‐based F values become unbiased or slightly downwardly biased and those from ROH‐based F values become slightly downwardly biased. The noise attached to all these estimates, nevertheless, can be very high in large‐sized populations. We also investigate the relationship between the different F measures and the homozygous mutation load, which has been suggested as a proxy of inbreeding depression.


| INTRODUC TI ON
Inbreeding depression, the change in the mean of a quantitative trait with inbreeding (a reduction in fitness in the case of life-history traits), is a main factor for the extinction of endangered species (Allendorf, Luikart, & Aitken, 2013;Frankham, Ballou, & Briscoe, 2010;O'Grady et al., 2006) and the management of livestock (Leroy, 2014). Regarding fitness components, inbreeding depression is generally assumed to occur because of the increased homozygosity of partially recessive deleterious alleles (Charlesworth & Willis, 2009;Hedrick, 2012), although some unknown depression may also occur by reduced heterozygosity at loci presenting heterozygote advantage (Charlesworth, 2015).
For quantitative traits, the rate of inbreeding depression (ID) is usually quantified by the slope of the linear regression of the individual phenotypic values on their inbreeding coefficient, F (Lynch & Walsh, 1998). The inbreeding coefficient of an individual, the probability of identity by descent of alleles, that is, alleles of an individual in a locus that come from a common ancestor to its parents, can be obtained from pedigree data (Wright, 1969) but also inferred from genome homozygosity (Li & Horvitz, 1953;Malécot, 1948;Toro et al., 2002), mainly using highly dense molecular markers, such as SNPs. Genomic or molecular measures of F have the advantage of providing estimates of realized inbreeding values, which are often more precise than pedigree ones (Curik, Ferenčaković, & Sölkner, 2014;Keller, Visscher, & Goddard, 2011;Wang, 2016), and do not require a knowledge of the genealogical relationships of individuals. They, however, measure overall homozygosity, which includes not only identity by descent of alleles but also identity in state, that is, identical alleles coming from different ancestors.
An assessment of the accuracy of some of the F measures in the estimation of ID can be made with computer simulations (e.g., Kardos, Luikart, & Allendorf, 2015;Keller et al., 2011), where the true parameters are known, even though simulated scenarios are always a simplified view of the natural processes. Thus, Keller et al. (2011) studied the correlation between different F measures and the homozygous mutation load (HML) of each individual, defined as the number of homozygous loci for rare alleles (MAF < 0.05) carried by an individual. This is assumed to be a proxy of the individual overall load of homozygous (partially) recessive deleterious mutations and, thus, of inbreeding depression. Using this approach, Keller et al. (2011) showed that ROH-based F measures are the most powerful to detect ID. Kardos et al. (2015) rather considered the correlation between the F measures and the proportion of the genome which is identical by descent (IBD), concluding that both ROH-based and SNP-by-SNP-based F measures can explain a large amount of the genomic IBD variation if the number of SNPs is sufficiently large (tens of thousands). However, these previous simulation studies did not consider ID of a quantitative trait in itself, but addressed the issue in an indirect way by considering some proxies of it.
More recently, Yengo et al. (2017) used data on true genotypes for several millions of human SNPs and simulated ID by ascribing phenotypic effects to a sample of the SNPs under different scenarios. They then investigated the accuracy in the estimation of ID using different measures of individual F. The main conclusion was that the most accurate estimations were obtained with a measure of genomic inbreeding based on the correlation between uniting gametes (F UNI ; previously called F III by Yang et al., 2011), a statistic based on the deviation of SNP homozygosity from their expected values giving a larger weight to rare alleles than to common alleles. In contrast, ROH-based measures of inbreeding (F ROH ) were shown to provide very large overestimations of ID. Yengo et al. (2017) thus explicitly recommended the use of F III to estimate ID with molecular data. This generated some debate, though, as it was argued that the simulations performed by Yengo et al. (2017) were not carried out with explicit simulated individuals subjected to genetic processes (Kardos, Nietlisbach, & Hedrick, 2018). Rather, trait values had been simulated as a function of the inbreeding coefficient which was calculated in a similar way as F III , perhaps biasing the conclusions in favour of this F measure (Kardos et al., 2018; see also reply by Yengo et al., 2018 Yengo et al. (2017) used in their simulations real human genotypic data from 10,000 unrelated individuals, and dominance effects were simulated assuming they were inversely proportional to the allele frequency variance and, therefore, with a potentially very large value.
Here, we carried out another set of genetic explicit simulations considering simplistic scenarios to evaluate the relative performance of different genomic marker F measures on the estimation of the rate of inbreeding depression. Our results show that, for low effective population sizes (N = 100), where individuals are expected to be more related to each other and where natural selection is, in general, relatively less intense, F ROH provides the most accurate estimates of ID, whereas F III provides overestimations, in full agreement with the results of Nietlisbach et al. (2019). However, for large effective population sizes (N ≥ 1,000), where individuals tend to be less related and deleterious alleles are expected to be at lower frequencies, F III provides almost unbiased average estimations of ID, in agreement with the results of Yengo et al. (2017), although with a great noise, whereas F ROH can produce slightly downwardly biased average estimates of ID, also with a great noise.

| Simulation procedure and parameters
Time-forward individually based simulations were carried out of a diploid monoecious population of constant size N (100, 500, 1,000, 5,000 and 10,000) individuals with random mating. A modified version of the program SLIM (Messer, 2013) was used in which mutations with effect on fitness have a pleiotropic effect on a quantitative trait (Caballero, Tenesa, & Keightley, 2015). A single genome sequence of 100 Mb, initially devoid of variation, was assumed where mutations appear at a rate U per haploid genome and generation chosen to produce about 30,000 SNPs in the final generation for all population sizes. The rate of recombination between consecutive positions was assumed to be c = 10 −8 , which implies an average rate of recombination of 1 cM per Megabase. Thus, the genome length sequence (100 Mb) and the genetic length (1 M) are typical of a mammalian chromosome. Simulations were run for 10N discrete generations (5N in the case of N = 10,000). Ninety-five per cent of mutations were assumed to be neutral, the remainder having an effect on a quantitative trait (QTL) and a pleiotropic effect on fitness. The genotypic values for the wild-type homozygote, the heterozygote and the mutant homozygote were 0, ah and a for the quantitative trait, and 1, 1 + sh and 1 + s for fitness, where a constant dominance coefficient of h = 0.2 was assumed. Multilocus genotypic values assumed a multiplicative model for fitness and an additive model for the quantitative trait across loci. To minimize the noise, phenotypic values for the quantitative trait were assumed to equal genotypic values; that is, no environmental error was added to the phenotypic values. Values of a and s were obtained from a bivariate gamma distribution with mean effect −0.03 and shape parameter β = 1 (i.e., following an exponential distribution). A correlation (ρ) between a and s values was generally assumed to be one, that is, the trait is fitness itself, but a values were scaled so that the amount of ID was about the same for all population sizes considered (about 1% decrease in the mean per 1% increase in inbreeding or an inbreeding load of about one haploid lethal equivalent). Because mutations always reduced the value of the trait and were partially recessive, the model implied directional dominance, that is, homozygous recessive genotypes always reduced the trait value, thus generating ID for the trait. Except for the impact of purifying selection on deleterious mutations, the simulated populations were Wright-Fisher ideal populations so that the population size (N) approximately equals the effective population size.
For the population sizes of N = 100, 1,000 and 10,000, additional simulations were run considering all previous parameters (default scenario) with the following changes: (a) a one-order higher (c = 10 −7 ) or lower (c = 10 −9 ) recombination rate between consec-

| Measures of inbreeding coefficients and estimates of the rate of inbreeding depression
From the last generation of each simulation, the files map and ped were generated with all individuals of the population (to avoid the confounding effect of sampling) and all neutral segregating SNPs (i.e., QTLs were removed). No MAF pruning was made to the data.
The following measures of the coefficient of inbreeding of each individual were then obtained with the programs PLINK (Purcell et al., 2007) and GCTA (Yang et al., 2011): Estimators F I , F II , and F III provided by GCTAv1.93.0 with the -ibc option: where S is the total number of markers, x k is the number of minor alleles of marker k (i.e., 0, 1 or 2 copies), and p k is the current frequency of the minor allele in the population.
The estimator F I is that proposed by VanRaden (2008)  Finally, the estimator based on runs of homozygosity (F ROH ) was obtained using PLINK1.9 with the default options: a minimum of 100 SNPs, at least 1 SNP per 50 Kb, and a scanning window of 50 SNPs.
An explanation of the rationale and relationship between the estimators F I , F II , F III and F HOM is given in the Supplementary Appendix.
The inbreeding coefficient of an individual (F), that is, the probability of identity by descent of the two alleles at a locus, is a concept relative to a (sometimes arbitrary) reference base population (e.g., an earlier generation of the population). If the allele frequencies of the reference generation are considered in the estimators, these are expected to measure the inbreeding coefficient (IBD) relative to that reference generation, at least when all loci are at linkage equilibrium (see Supplementary Appendix). Unfortunately, the reference generation frequencies are usually unknown and the estimates are obtained assuming the current generation allele frequencies. In this case, the measures of F from Equations (1-4) refer to the deviations of the frequencies of homozygotes from those expected under Hardy-Weinberg expectations or the correlation between the alleles carried by individuals. Thus, they take positive or negative values, generally implying an excess or a defect, respectively, of homozygotes. In contrast, the measures obtained by F ROH take only positive values (from zero to one) as they are expected to include genomic segments of homozygosity produced by IBD.
The estimated ID was obtained as the slope of the linear regression of the phenotypic values (Phe) of individuals on their different F measures. All individuals of the population were used in this regression to avoid biases due to sampling errors. In addition, as was mentioned above, phenotypic values were taken with no error, so that any deviation of the estimated ID from its true value should be ascribed to the different properties of the molecular measures of F.
Simulations were repeated between 100 and 1,000 times depending on the population size considered. For each replicate, the estimated ID obtained with each F measure was compared with the expected (true) ID, calculating the distribution of deviations from the expected value and the root-mean-square error (RMSE), a combined measure of accuracy and precision. Pearson correlations between all F values and between these and the phenotypic values of individuals (Phe) and the different homozygous mutations loads (HML, HML MAF and HML QTL ) were also obtained and averaged across replicates.

| Alternative set of simulations
To double-check the main results obtained, additional time-forward individually based simulations were performed with an in-house C program alternative to SLIM, modified from Caballero, Bravo, and Wang (2017). In this case, populations of size N = 100, 1,000 and 5,000 were run for 1,000 (N = 100) or 10,000 (N ≥ 1,000) discrete generations, assuming a 60 Kb genome sequence of 1 Morgan genetic length and a mutation rate adjusted to produce up to 30,000 SNPs. Thus, the number of SNPs and the genetic length were similar to those of the previous simulations. A 10% proportion of mutations were assumed to be deleterious with average selection and dominance coefficients of s = −0.03 and h = 0.2, respectively. Selection coefficients for mutations were obtained, as before, from an exponential distribution, but dominance coefficients were assumed to be variable, with an inversely proportional relationship with selection coefficients following the model proposed by Caballero and Keightley (1994) (see also Caballero, 2020, p. 159). Values of the true ID and those estimated with the different F values were obtained from each of the last 200 generations of each replicated simulation, considering the whole population. Given the short genome sequence assumed in these simulations, ROH segments larger than 1 Kb were considered.   Figure S1.

| RE SULTS
For N = 100, the average root-mean-square error (RMSE) of the estimates of ID obtained with F ROH-5 (0.216 ± 0.047) is lower than that obtained with F III (0.597 ± 0.097) (Figure 2). However, for larger population sizes, this difference disappears, for example for N = 10,000, RMSE F ROH-5 = 0.837 ± 0.255 and RMSE F III = 0.772 ± 0.231.  Figure 1 and Figure S2). In addition, the corresponding results using a simulation program alternative to SLIM ( Figure S3  (N = 100), Phe has a larger correlation with HML than with HML MAF .
In contrast, for larger N, Phe is more correlated with HML MAF than with HML, suggesting that the phenotype of individuals can be ascribed more strongly to rare homozygous alleles. In fact, as shown in Figure S4, the larger the population size, the lower the QTL frequencies and the larger the contribution to ID by rare alleles, as would be expected. Nevertheless, for large N (particularly for N = 10,000), all correlations between the different values of HML and Phe are rather low, indicating that the homozygous mutation load is a poor proxy of fitness in large populations.
Finally, Figure 4 shows the correlation between the F measures and the phenotypic value (Phe) of individuals and the homozygous mutations loads (HML). F III has the largest correlation with Phe and HML MAF for all values of N and with HML QTL for N ≥ 1,000. The largest correlation with HML is achieved by F HOM , as expected.

| D ISCUSS I ON
Inbreeding depression is a key issue for explaining the evolution of populations and to carry out the management and conservation of wild and domestic species (Lynch & Walsh, 1998). The increasing availability of dense molecular markers (SNPs) for many species has allowed the prediction of the relationships among individuals in the absence of pedigree records. In fact, genomic measures of F can be more useful than pedigree estimates, as they provide realized rather than expected genomic relationships (Kardos et al., 2016;Keller et al., 2011). These estimates of genomic inbreeding can, in turn, be used to estimate the rate of inbreeding depression. If the allele frequencies of many loci at Hardy-Weinberg and linkage equilibrium were known without error at a given previous generation of the population and these frequencies were included in the different SNP-  Wang, Hill, Charlesworth, & Charlesworth, 1999). To perform their simulations, they used a binary viability model and compared the F I G U R E 1 Proportional deviation of the estimates of the rate of inbreeding depression (ID) obtained with different measures of the inbreeding coefficient with marker data (see text), with respect to the true simulated ID. The dot is the mean deviation, and the bar indicates the 95% of the distribution of simulated replicates. Simulated parameters: population size N; genome sequence of 100 Mb run for 10N discrete generations (5N in the case of N = 10,000); rate of recombination between consecutive positions c = 10 −8 ; mutation rate per haploid genome and generation U chosen to produce about 30,000 SNPs in the final generation for all population sizes; 95% of mutations assumed to be neutral, the remainder having an effect on a quantitative trait (QTL) and a pleiotropic effect on fitness; homozygous effects for the trait (a) and fitness (s) obtained from a bivariate gamma distribution with mean effect −0.03 and shape parameter β = 1; the mean homozygous effect for the trait (a) was adjusted to obtain a true inbreeding depression rate of about 1 (one per cent decline in mean per one per cent increase in inbreeding or an inbreeding load of about one haploid lethal equivalent) for all population sizes; dominance coefficient h = 0.2; correlation ρ = 1 between a and s values Their results also showed that the variation between the estimates of ID was smaller by using F ROH than by using F III and those from F HOM had the lowest variation (see Nietlisbach et al., 2019, Figure 2  They also found that the estimates of ID from F ROH had about a threefold larger standard error than those from F III . Additionally, they found that both F III and F HOM produced overestimates of ID when QTLs are enriched in high-LD regions, and underestimates of ID when enriched in low-LD regions, and that these biases could be corrected or reduced if LD score and MAF stratification were applied. Population stratification also produced biases in estimates of ID from F HOM and F ROH , which were not observed in those from F III . In accordance with their simulated results, the estimations of ID over 25 human quantitative traits were consistently larger when obtained from F ROH measures than from F III ones. However, more significant cases of ID were detected with F III than with F ROH .

F II F III F HOM F ROH-0.1 F ROH-1 F ROH-
Our simulations consider some parameters similar to those assumed by Nietlisbach et al. (2019), such as the average selection coefficient and dominance, but assume a single undivided population of varying size. We also considered a simple additive multilocus model for a quantitative trait rather than a binary viability model, so that a log scale is not necessary to be applied and the regression of raw phenotypic values on the predicted F provides a direct estimate of the inbreeding depression rate. Our simulations for N = 100 show no bias of the average ID when F ROH is used, irrespective of whether short (>0.1 or >1 Mb) or long (>5 Mb) ROH are considered (Figure 1). F HOM underestimates ID by a fraction of about −0.3, whereas F III overestimates it by a fraction of about +0.5. These results are actually very similar to those obtained by Nietlisbach et al. (2019), that is −0.2 and +0.7, respectively. In addition, for N = 100 we found that the RMSE of ID estimates from F III measures were three times larger than those from F ROH ones (Figure 2), in concordance with a twofold difference in the same direction observed by Nietlisbach et al. (2019). We observed, however, that estimates of ID from F HOM for N = 100 had a RMSE larger than those from F ROH (Figure 2 Table 1).
However, it can also be argued that long ROH only account for recent inbreeding whereas measures of F including also short ROH would incorporate more ancient inbreeding, what may contribute to values of short-ROH-F measures larger than those from long ones.
The increase in population size also results in a reduction of the average bias of the estimated ID using F III , whose bias becomes 0.5, 0.28, 0.21, −0.002 and −0.14 for increasing values of N. Thus, average estimates of ID with F III for large population sizes are basically unbiased or slightly so, in agreement with the main result obtained by Yengo et al. (2017). Our results do not agree with those of Yengo et al. (2017), however, in relation with the estimation of ID using F ROH , as they obtained overestimations by a fraction +0.9 whereas we obtained almost unbiased estimates or slight underestimations for large N. In fact, Yengo et al. (2017) found generally much larger estimates of ID for 25 human quantitative traits when F ROH was used than when F III was used. Kardos et al. (2018) argued that the F ROH measures considered by Yengo et al. (2017)  difficulties of estimating the rate of inbreeding depression in large outbred populations. We observed that variation in the estimates of ID from F III and F ROH were similar in the case of N = 10,000 (see Figure 1) and that the RMSE of these estimates were also about the same for these two F measures (see Figure 2 for N = 10,000). This contrasts with the result of Yengo et al. (2017), which indicated that the estimates of ID from F ROH had a threefold larger standard error than those from F III . It must be noted, however, that our simulations refer to a simplistic model of a constant population size, with uniform LD along the genome and without any source of stratification, which can differ from the complex structure of the human genomes and populations analysed by Yengo et al. (2017). Thus, there may be a source of unknown factors that could be contributing to the differences between the results of both studies.
We obtained estimates of the mean, variance and correlation between the different genomic F measures and between these and the homozygous mutation load (HML), suggested to be a proxy of ID . The mean of the genomic measures F I , F II , F III and F HOM is basically the same (Table 1) and close to the value of the deviation from Hardy-Weinberg proportions expected in a panmictic population, −1/2N (Kimura & Crow, 1963;Robertson, 1965 (Table 2) and are poor estimators of ID (Figure 1 and Figure  genomic cattle data by Alemu et al. (2020). Some of these results contrast with those of Keller et al. (2011), who found that F ROH had the largest correlation with HML MAF , followed by F III and then F HOM (see Figure 7 of Keller et al., 2011), whereas we observed that the largest correlation is for F III , followed by F ROH and then F HOM (see Figure 4). However, the simulations of Keller et al. (2011) did not consider selection of any type. Purifying selection, as applied in our simulations, would imply some reduction in genetic variability across the genome, so that rare allele frequencies would get more relevance in the HML.
Regarding F ROH , we found that the lowest bias in the estimation of ID occurred when large ROH were considered (Figure 1 and Figure S3). This could be explained, as mentioned above, because short ROH may not fully reflect IBD, overestimating the true genomic inbreeding, but also because short ROH are less enriched in QTLs. Szpiech et al. (2013) showed that ROH in the human genome are enriched in deleterious mutations and that long ROH are more enriched than short ones. In cattle, however, the result is the opposite, with short ROH more enriched in deleterious mutations than long ones (Zhang et al., 2015). Our simulations show that, for the smallest population size (N = 100), the correlation between the phenotypic values (Phe) and F ROH is 0.66 ± 0.003 for F ROH-0.1 and 0.58 ± 0.004 for F  , in line with the cattle results. In contrast, for the largest population size (N = 10,000) the corresponding correlations are 0.012 ± 0.001 for F ROH-0.1 and 0.017 ± 0.001 for F  , in agreement with the human genome results. It is expected that cattle populations, heavily subjected to selection, have lower effective population sizes than human populations, so our results are consistent with the observations.
The F measure showing the largest correlation with the overall homozygosity mutation load (HML, considering all allele frequencies) was F HOM . Multilocus heterozygosity has been traditionally related to overall fitness, although with a poor correlation (Grueber, Waters, & Jamieson, 2011;Slate & Pemberton, 2002). This, however, has been mainly focused on few markers (e.g., microsatellites) and the possibility to explore the multilocus heterozygosity-fitness correlation with highly dense markers using F HOM or other measures of F opens new research areas.
In summary, our simulation results indicate that estimates of the rate of inbreeding depression from F ROH measures of inbreeding are appropriate for populations with low effective sizes but may lead to some underestimation for large ones, unless the ROH fragments con-

ACK N OWLED G EM ENTS
We are grateful to Miguel Toro, Aurora García-Dorado and three anonymous referees for helpful comments on the manuscript. This

CO N FLI C T O F I NTE R E S T
None declared.

DATA AVA I L A B I L I T Y S TAT E M E N T
All software and scripts used in the simulations are available in