Genomic relationships computed from either next-generation sequence or array SNP data


  • M. Pérez-Enciso

    Corresponding author
    1. Centre for Research in Agricultural Genomics (CRAG), Bellaterra, Spain
    2. Veterinary School, Universitat Autònoma de Barcelona, Bellaterra, Spain
    3. Institut Català de Recerca i Estudis Avancats (ICREA), Barcelona, Spain
    4. Animal Breeding and Genomics Group, Wageningen University, Wageningen, The Netherlands
    • Correspondence

      M. Pérez-Enciso, Centre for Research in Agricultural Genomics (CRAG), 08193 Bellaterra, Spain. Tel: +3493 563 66 00; Fax: +3493 563 66 01; E-mail:

    Search for more papers by this author


The use of sequence data in genomic prediction models is a topic of high interest, given the decreasing prices of current ‘next’-generation sequencing technologies (NGS) and the theoretical possibility of directly interrogating the genomes for all causal mutations. Here, we compare by simulation how well genetic relationships (G) could be estimated using either NGS or ascertained SNP arrays. DNA sequences were simulated using the coalescence according to two scenarios: a ‘cattle’ scenario that consisted of a bottleneck followed by a split in two breeds without migration, and a ‘pig’ model where Chinese introgression into international pig breeds was simulated. We found that introgression results in a large amount of variability across the genome and between individuals, both in differentiation and in diversity. In general, NGS data allowed the most accurate estimates of G, provided enough sequencing depth was available, because shallow NGS (4×) may result in highly distorted estimates of G elements, especially if not standardized by allele frequency. However, high-density genotyping can also result in accurate estimates of G. Given that genotyping is much less noisy than NGS data, it is suggested that specific high-density arrays (~3M SNPs) that minimize the effects of ascertainment could be developed in the population of interest by sequencing the most influential animals and rely on those arrays for implementing genomic selection.


There is an increasing interest in implementing sequence data into genomic selection breeding programmes (Druet et al. 2014; Hickey 2013), a trend that is being fostered by dwindling sequencing technology (NGS) prices and by the popularization in the use of molecular information for genetic improvement. Having access to complete or almost complete sequence from whole populations is certainly very attractive for population genetics analyses because of the possibility of directly interrogating genomes for causal mutations, instead of relying on indirect signals based on linkage disequilibrium. For the estimates on the gains in genetic progress with NGS versus high-density array genotyping, the evidence is, however, still limited. Some of the previous studies on this have employed a somewhat simplistic approach, either in generating polymorphism data or in the analyses of NGS data (Meuwissen & Goddard 2010). Recently, more realistic simulations by Druet et al. (2014) have suggested that sequence data will be particularly useful if causal mutations are at low frequency. How NGS data are generated and analysed is relevant because NGS is a highly error-prone technology, especially in detecting rare variants. For instance, the number and reliability of SNPs detected with NGS critically depend on parameters like average depth and the error pattern in sequencing (Nielsen et al. 2011). As we shall show here, additionally, the history of a population is also a determining factor in the behaviour of the different methodologies.

In principle, barring for non-SNP variants like copy number variants (CNVs), the same algorithms and software can be used for genomic selection using either array-based SNPs or sequence-based SNPs. Extant software should nevertheless be optimized to deal with the increase in markers resulting from sequence data. Although variants exist to compute the molecular relationship matrix (G), such as those used in G-BLUP (VanRaden 2008), all are based in computing the allele difference counts between two individuals, summed over SNPs and divided by either the number of SNPs or by a function of allele frequency (p), typically the variance, 2p(1−p) (Toro et al. 2011). The denominator is important though. With array-SNP data, the only information available is the number of SNPs and their frequencies whereas, with sequence data, the total length of DNA sequenced is also known.

Two important concepts are needed to fully understand the differences between genotypes provided by either arrayed SNPs or sequence. The first one is SNP ascertainment bias and the second one, the Most Recent Common Ancestor (MRCA) between two sequences, a concept in turn that is closely related to nucleotide diversity θ = 4 Ne μ where Ne is effective population size and μ, the mutation rate. Ascertainment bias is the criterion by which SNPs that are present in an array or genotyping protocol are chosen. This process often results in markers with high minimum allele frequency, that is, markers that are segregating in a large number of individuals and populations and are therefore highly ‘informative’. Standard population genetics theory (Nielsen & Slatkin 2013) tells us that rare variants (with very low minimum allele frequency, MAF) are the most frequent ones, so the effect of discarding these low informative but very common class of markers is potentially important. So far, all genomic selection programmes are based on arrays made up of ascertained SNPs. Despite this, the effect of ascertainment bias on the computation of the genomic relationship matrix has not been studied, to the best of our knowledge.

A pair of DNA sequences, say the two haplotypes in a single individual, coalesce in their MRCA if they do have not undergone recombination. If recombination has occurred, as will be normally the case, each stretch of DNA for the same two haplotypes will have a different MRCA (Nordborg & Tavare 2002). The number of nucleotide differences between a pair of sequences, for example, the number of SNPs found in a single individual, divided by total DNA length sequenced is directly proportional to the MRCA if no recombination has occurred and is also proportional to nucleotide diversity θ (Nielsen & Slatkin 2013). If recombination has occurred, it will be proportional to average time to all MRCAs. If an element of the sequence-based genomic relationship matrix is computed as the number of allelic differences between the two individuals divided by the number of base pairs actually sequenced, then it can be interpreted as the average time to MRCA between the two individuals’ pairs of haplotypes. In turn, recall that time to MRCA is proportional to θ. Therefore, a principal difference of having access to sequence data instead of only SNP arrays is the possibility of estimating DNA diversity.

In this work, we study the effect of SNP ascertainment bias on molecular relationship matrices. Simulations are used to compare how well molecular relationship matrices G can be estimated from either NGS or array data. Therefore, we do not directly analyse how much can be gained in terms of genetic progress with say G-BLUP but, rather, we provide guidelines on the scenarios and experimental settings where NGS is to perform comparatively better than SNP arrays to estimate G, and on how SNP arrays could be designed to improve its performance. Nevertheless, the assumption is made that the true G (the matrix computed if complete sequence from all individuals were known without errors) is the best possible scenario for genomic prediction. This assumption is not discussed here, and may not be true, but it is reasonable starting point.


Molecular relationships compared

A wide literature is available on different metrics to compute G (VanRaden 2008; Toro et al. 2011). Here, we compare two approaches to measure genetic similarity that use or not allele frequencies. In the first one (GL), we count the number of alleles shared between two individuals, summed over SNPs in the region and divided by total DNA length analysed (L):

display math(1)

where aik is genotype for i-th individual and k-th locus, and δ is an indicative variable taking values 1, 0.5 and 0 for genotype pairs (AA, AA), (AB, –) and (AA, BB), respectively.

As mentioned previously, the quantity gL has a clear evolutionary interpretation: average pairwise difference is proportional to the MRCA between any two sequences, and the average of GL elements is an estimator of nucleotide diversity. In NGS data, because coverage will never be complete, instead of L we used the number of base pairs that fulfilled the minimum depth and base and mapping quality requirements, which was computed for every pair of individuals. Then, with NGS data, L varies from one comparison to another. For SNP arrays, gL cannot be properly computed because the number of interrogated base pairs (L) has no meaning and so the denominator in (1) is the total number of SNPs. Despite its theoretical appeal from a population genetics point of view, a potential disadvantage of GL is that it gives equal weights to each SNP irrespective of its frequency (Yang et al. 2011).

The second metrics used, also widely known and used in many G-BLUP programs, corrects for this dividing by the variance of genotype values:

display math(2)

Where xij is arbitrarily coded as −1, 0 and 1 for genotypes AA, AB and BB, respectively, p is the population frequency of allele B, and μ = −2p is the population mean. We use Equation (2) above instead of averaging over loci math formula because the former has been shown to have a better behaviour (Toro et al. 2011). A problem with Equation (2) is what frequency to use, a matter of additional complication if several populations are analysed. Here, we employed average frequency over all populations and individuals. In most cases presented here, this gave reasonable results. Note that GF and GL are equivalent, for an equal set of markers, if p is 0.5 across all markers.

General setting

The main forces in shaping livestock genomes are bottlenecks, that is, a reduction in effective population size due to domestication and modern selection, admixture, that is, hybridization between different breeds, and artificial selection. Although bottlenecks and selection are usually considered as highly influential, here, we show that admixture can also have a dramatic influence. We consider two scenarios that may represent extreme cases in livestock (Figure 1). The simplest scenario considered, called ‘cattle’, consists of a bottleneck occurred t2 = 2000 generations ago followed by a recent splitting t1 = 50 generations ago in two populations that become isolated after splitting. These two populations may represent dairy and beef cattle breeds, created after a bottleneck as a result of domestication. Following the literature (Gibbs et al. 2009), we assumed a nucleotide variability θ = 0.0012 per bp in both breeds, and effective population size Ne = 300.

Figure 1.

Demographic scenarios considered: left, ‘cattle’ scenario; right, ‘pig’ scenario. In pigs, dotted arrows mean limited introgression from China into international breeds. Migration between international breeds was also allowed.

The second scenario considered is a ‘pig’ scenario. In pigs, it is well known that current international breeds are the result of introgression of Chinese pigs into the ancestral local European pig, and that both Chinese and ancestral European pigs are highly divergent (Giuffra et al. 2000; Groenen et al. 2012). The extent of autosomal introgression into modern breeds is quite high and has been estimated as ~ 30% in two independent studies (Ojeda et al. 2011; Groenen et al. 2012). We simulate individuals from four populations: a local European pig without admixing with Chinese pigs (called ‘Iberian’, population 1), two international breeds, admixed with Chinese pigs (called ‘Large White’ and ‘Duroc’, populations 2 and 3) and a Chinese population (population 4). Chinese admixture has occurred between 25 and 30 generations ago, at a rate of 1 or 2% per generation, with a small migration also between international breeds; three bottlenecks has occurred approximately 30, 100 and 5000 generations ago, the latter time is when Chinese population diverged from the European clade.

We employ coalescence simulations with software MaCS (Chen et al. 2009) to generate DNA data in stretches of 100 kb. The exact coalescence commands for both scenarios are in the Appendix 1.

SNP ascertainment

The most popular criterion to select array-SNPs is ‘informativity’, that is, the requirement that the SNP segregates in different breeds and that its MAF exceeds a given threshold; this results in a bias towards intermediate allele frequency SNPs. From a genealogical perspective, high MAF SNPs are also older on average than low MAF SNPs. Here, we compare two different ascertainment criteria. In the first case, we required a MAF ≥ 0.15 for a SNP to be included in the genotyping panel, and we randomly selected 10 SNPs per 100 kb out of those that fulfilled this criterion; in the second case, we chose a milder allele frequency requisite MAF ≥ 0.05, and 1 of every 100 SNPs was randomly selected. The former criterion would result in SNP arrays of ~300k SNPs, given that, for a typical mammalian genome of 3 Gb, 10 SNPs per 100 kb is equivalent to 300k SNPs per genome. The latter strategy implies 10 times more SNPs, approximately, or 3M SNPs per genome. In the cattle simulation, we ascertained SNPs in 25 individuals form the first breed; in the pig simulation, SNPs were ascertained in 50 individuals, 25 from each of the ‘international’ breeds. Individuals in the SNP discovery panel are not present in the genotyped and sequenced panel.

Simulating NGS errors and bioinformatics pipeline

It is important to realize that NGS-based sequence is not simply a dramatic increase in SNP density. NGS reads present a high error rate, and many bioinformatics subtleties affect the final list and reliability of identified SNPs (Nielsen et al. 2011). Besides, compared with SNP arrays, NGS data are highly unbalanced, and there is no guarantee that all sites will be equally covered across samples. To carefully account for these effects, we simulate a typical NGS analysis pipeline as realistically as possible. The complete bioinformatics pipeline is detailed in Figure 2.

Figure 2.

Bioinformatics pipeline.

First, we simulate polymorphisms with the coalescence using the models detailed in Figure 1, which are mapped onto a fasta file of 100 kb. Initially, the fasta file consisted of randomly chosen nucleotides. One of the simulated sequences is considered as the ‘reference’ sequence and fasta files for the simulated sequences are generated using this reference sequence and the coalescence simulations. In each demographic model and simulation replicate, a subset of pairs of sequences (diploid individuals) is selected to identify SNPs following the ascertainment process described previously. The rest of pairs of sequences are subject to genotyping using both the sparse and the high-density SNP sets; to achieve this, the ascertained SNP positions are interrogated in every genotyped individual. Array genotyping is assumed to be error-free and without missing data. In parallel, for the same genotyped individuals, 100 bp long paired-end reads and 500 ± 2 bp libraries are simulated with software art_illumina (Huang et al. 2012) using the default error profile for Illumina's technology. Depths are either 4× or 10×, but most of results presented refer to 10× only. Reads are aligned with bwa (Li & Durbin 2009) against the reference sequence allowing for six mismatches. SNPs are called with samtools mpileup (Li et al. 2009) using all samples from all populations simultaneously, filtering by base and mapping quality 20, the minimum depth to call a SNP is set to three. We also explored minimum depth of five to call a SNP, but results were in general slightly worse. For a SNP to be considered, its minimum quality from samtools is set to 10. To compute relationships and allele frequencies, we consider the genotype likelihoods from samtools, that is, we weigh each genotype by its probability rather than taking as true genotype the most likely genotype. Multiallelic calls, indels and ambiguous genotypes, that is, the calls where two genotypes have the same likelihood, are discarded.

Finally, we compute the different G matrices (Table 1) for all pairs of individuals in each of the 100 kb simulated windows. The correlation between the elements of the G matrix obtained from the ‘true’ sequence and with the different methods is taken as a measure of how well a given method performs. Correlations are obtained between and within populations. The whole-simulation process is replicated 200 times. We retain all elements of the molecular relationship matrices, but we randomly sample 10% in plotting to improve visualization.

Table 1. List of metrics used in this study
DataAverage allele difference (Equation (1))Frequency weighted score (Equation (2))
  1. a

    Divided by number of total sites.

  2. b

    Divided by number of total callable sites, that is, sites with appropriate depth and mapping quality per pair of individuals.

  3. c

    Divided by number of segregating SNPs.

True sequenceGTLaGTF
SNP (1 SNP/10 kb)GSScGSF
High-density SNP (~1 SNP/kb)GHScGHF

Results and discussion

Influence of the demographic scenario

First, we study the properties of the two demographic models from the coalescence simulations, as if true sequences were known. Figure 3 shows the distribution of DNA variability (θ) and Fst in both ‘cattle’ and ‘pig’ scenarios. These values are obtained across simulation replicates, but they can also be interpreted as the distribution across 100 kb windows in the genome of a single individual, provided the windows are sufficiently apart to be considered as unlinked, and assuming homogeneity throughout the genome. Certainly, the latter condition is never fulfilled because the genome varies widely in all its features: number of SNPs, recombination rate, evolutionary constraint and so on, but this only means that the variances in the Figure 3 are severely underestimated compared with the likely real ones.

Figure 3.

Top: distribution of nucleotide diversity in cattle (cow) and pig scenarios (CN, China; IB, Iberian; LW, Large White); middle: distribution of Fst between pairs of populations (DU, Duroc); bottom, distribution of Fsts between Iberian and Large White versus Large White and Duroc.

The most important aspect to notice in Figure 3 is that the history of the species, especially the presence of admixture, has a dramatic influence on the distribution of nucleotide diversity and on differentiation between breeds. These two parameters are fundamental for many aspects of livestock genetics, including effective genetic management, repeatability of GWAS studies or estimating molecular relationships, the main subject of this paper. In the absence of admixture and in a relatively simple model that only includes bottlenecks (the ‘cattle’ model), both θ and Fst follow smooth patterns. In practice, this means that any single window is equally representative of the whole genome and that a limited number of markers will suffice to characterize the genome.

In contrast, when hybridization between genetically distant breeds occurs, as in the pig, the whole pattern becomes much more complex and unpredictable. Note in particular differentiation between two breeds (Fst). While ‘Iberian’ – ‘Chinese’ Fsts windows are consistently high, the Fst's between ‘Large White’ and China is also high but shows a thick tail on the lower end of the distribution. This means that some windows – those that contain tracts of Chinese germplasm – are not too differentiated between these two breeds. The situation becomes even more complex when comparing Iberian and International breeds or between International breeds. Our pig model predicts that the profile is very flat in this case and, therefore, that differentiation along the genome will be extremely heterogeneous. It should be mentioned that Fst values are poorly correlated across comparisons; for instance, even if Fst's patterns were very similar in IB-LW and LW-DU comparisons, the correlation between those Fst's was mild, 0.48 (Figure 3c). In other words, our model predicts that rarely the same two windows in, say, Large White and Duroc will be simultaneously enriched for alleles of Asian origin.

The effect of demography is also seen in the distribution of G matrix elements (Figure 4, left column). In no case, the G elements were normally distributed while the distribution was symmetrical in the no migration scenario (top row). In stark contrast, but unsurprisingly, migration induces a profound change in how G elements are distributed, consistent with what is observed in Fst (Figure 3). The bottom row in Figure 4 shows these distributions for the ‘Large White’ breed. The distribution is highly skewed to the left because, due to introgression, some pairs of individuals in some genome regions are far less related than the rest of non-admixed individuals or windows. This distribution is very complex and, certainly, its exact shape will depend on a number of factors like age and extent of admixture. For instance, if admixing is very recent and massive, like in any of the B. taurus × B. indicus synthetic breeds, the density will be multimodal and rather flat. If admixture occurred long time ago, recombination will have erased differences and the distribution would become symmetrical again, as in the cattle scenario of Figure 1. This is not the case of the pig, where introgression occurred quite recently and so we expect to see patterns similar to those presented in the bottom row of Figure 4.

Figure 4.

Density of GL matrices’ elements obtained from Equation (1) in the cattle (top row) and pig (bottom row) inferred from the different data types: using the true sequence (G_NL), NGS data (G_NL), standard SNP density (G_SS) and high-density genotyping (G_HS). Elements plotted are those corresponding to within the cattle population not used in the SNP discovery process (top) and the pig ‘Large White’ breed (bottom). Numbers in parentheses are skewness coefficients. Note that x and y axes scales vary widely from plot to plot.

The effect of SNP ascertainment

SNP ascertainment is difficult to model because there are several, often unknown, reasons for a given SNP to be included in the array's final panel. Its effects also depend on the genetic distance between the genotyped population and the breeds used to discover the SNPs, which may be large or even unknown. For all those reasons, the results presented here should be taken with some caution and are aimed at gaining insight on the, plausibly, most important consequences of ascertainment. One of the most well-known effects of ascertainment is distorting the site frequency spectrum (SFS) of the markers relative to the true one if individuals were sequenced (Nielsen et al. 2004; Lachance & Tishkoff 2013). Figure 5 shows the observed SFS in the cattle population that is not used for the SNP discovery process. Clearly, while NGS allows us to recover closely the true spectrum, array-based SNP data are biased towards intermediate allele frequency alleles, as is usually observed in real data from large-scale genotyping studies. Note that even high-density (equivalent to a ~3 million SNPs array) results in a considerable shift towards common variants.

Figure 5.

Site frequency spectra in the cattle population that was not used for SNP ascertainment. Average results of 200 replicates of 100 kb windows and a 25 individual population. In the Figure we show the unfolded spectrum, where a high frequency means a high frequency of the mutated (derived) allele. In practice, the ancestral allele is not necessarily known so only the minimum allele frequency (0< MAF <0.5) can be ascertained.

Unsurprisingly, SNP ascertainment also affects the distribution of the G elements (Figure 4). In the case of no migration (top row), the effect of SNP ascertainment is not so dramatic because the distributions have the same shape as the true one, although note that the mean and variances differ. SNP arrays resulted in highly distorted profiles in the presence of admixture (bottom row), again, that underestimates skewness but overestimates variance as compared to the true distribution. NGS data (GNL) allowed us to recover approximately the true pattern, both with and without admixture.

In commercial arrays, the user has no control over the SNPs genotyped. It is possible, nevertheless, that genotyping technologies improve so that custom-designed high-density arrays may become affordable in the future. In this case, it makes sense to study how SNP ascertainment side effects can be minimized. To that end, we explore four SNP ascertainment strategies that differ in SNP density (10 and 100 per 100 kb) and in the minimum MAF required (0.15 versus 0.01 or 0.05). The correlations between true GL elements and those obtained from ascertained SNPs are in Table 2; for completeness, those with NGS data at low (4×) or medium (10×) depths are also shown. At equal SNP density, relaxing the MAF criterion results in better correlations, especially for those breeds not included in the ascertainment processes (Iberian and China). Increasing the SNP density also improves correlation, although with a likely increase in associated costs as well. We do not find a clear improvement if the number of individuals used to discover the SNPs were 150 instead of 50 (results not shown). For SNP discovery and eventually designing a SNP array, it is better to sequence fewer individuals at modest/high depth than a large population with a shallow depth (unpublished results, B. Nevado, S.E. Ramos-Onsins, M. Pérez-Enciso, submitted). The likely reason for this is that the number of SNPs discovered follows a law of diminishing returns (the expected number of SNPs in a sample of size n is proportional to Ewen's sampling formula math formula) and that, at shallow depth, many true heterozygous SNPs will be missed. Pool sequencing, it should be noted, also results in a bias towards high MAF alleles and distorts the original SFS (Raineri et al. 2012). Finally, note that shallow depth sequencing may result in low correlations in some breeds (Iberian), although it is definitely better than SNP genotyping for distantly related breeds (China).

Table 2. Correlation between true and estimated GL elements (pig scenario)
Ascertainment criterionBreed
Max. no. SNPsMinimum MAFIberianLarge WhiteChina
  1. SNPs were ascertained in 25 Large White and 25 Duroc individuals, which were not included when computing G elements.

NGS 4×0.640.940.74
NGS 10×0.900.970.83

The effect of standardizing by allele frequency

If marker allele frequencies were similar, standardizing by frequency would not have a big effect. However, as this is not the case (Figure 5), it is important to assess the effect of allele frequency in G. The density of GF elements is in Figure 6. As can be seen, standardization by frequency results in a more positive skewness (relative to GL, Figure 4). Standardization has a dramatic effect in the admixture scenario, by smoothing the distribution and making it symmetrical, approximately. This proves that low frequency variants are important in building the G matrices. Standardization will tend to wipe out outlier signals, which can improve the statistical properties of G-BLUP, but may also obscure demographic effects of importance. In practice, it seems sensible to compute both matrices and explore their characteristics as a diagnostics tool.

Figure 6.

Density of G matrices’ elements obtained from Equation (2) in the cattle (top row) and pig (bottom row) inferred from the different data types: using the true sequence (G_NF), NGS (G_NF), standard SNP density (G_SF) and high-density genotyping (G_HF). Elements plotted are those corresponding to within the cattle population not used in the SNP discovery process (top) and the pig ‘Large White’ breed (bottom). Numbers in parentheses are skewness coefficients. Note that x- and y- axis scales vary widely from plot to plot.

NGS, but also high-density genotyping, can result in accurate estimates of G

Figure 7 presents the plots of true versus estimated G elements for the variety of scenarios, methods and populations studied. Provided sequencing depth is enough, NGS data results consistently in the best estimates. In the cattle model, correlations between NGS and true sequence-based G elements were close to one (ρ ~ 0.97), the fit being excellent. Relationships based on SNP data perform also well, in particular for the high-density chip and standardized GF elements (ρ = 0.92).

Figure 7.

Scatter plots of true (x-axis) versus estimated G (y-axis) elements across scenarios, populations and methods. Across rows, from top to bottom: (1) within the non-ascertained cattle population, (2) within Iberian, (3) within Large White, (4) within China, (5) between Iberian and Large White. Across columns, from left to right: GSS, GHS, GNL, GSF, GHF, and GNF. Numbers in the title are the correlations between true and estimated G elements.

Not unexpectedly, the situation becomes much more complex when admixture is present (2nd-5th rows in Figure 7). In general, again, NGS provides accurate estimates of relationships within all populations and population pairs (ρ > 0.85) and were slightly better when standardizing by frequency (GNF), and this improved the linearity of the plots. With SNP data, in contrast, the relationship becomes unpredictable, often markedly non-linear and multimodal. In some extreme case, like for China, using array data to infer relationships was virtually useless because of the large genetic distance between ascertained (Large White and Duroc) and genotyped (China) populations. Still, array-based estimates perform reasonably well within international breeds (ρ = 0.70–0.83), although patterns are often far from linear.

We evaluate the influence of average depth across a limited range of options. We do not observe a marked improvement by increasing 10–20× in depth (results not presented). In contrast, shallow depth (4×) does have a big impact for worse, but mainly in the non-standardized G elements (GL). For instance, in ‘Iberian’ pig, correlation between true and NGS elements is 0.25 for GL elements, but it increases to 0.94 in GF. The reason for such a large – and consistent – discrepancy between GL and GF at shallow depth is not completely clear. It is likely, though, that this is due to the fact that only the SNPs identified are considered in the computation of GF, whereas all sites are employed in GL, irrespective of whether SNPs are identified or not. At very shallow depth, many truly heterozygous individuals will not show up because reads pertain to only one allele type, simply because of sampling. This site will count as homozygous in GL but will not be considered in GF. Besides, these events will occur more frequently as MAF decreases so these markers will in any case be given less weight in GF.

Will high-density genotyping, instead of NGS, suffice?

The answer to this question critically depends in part on issues that have not been specifically addressed here, like the reliability of massive imputation of haplotypes out of a few influential animals’ sequenced. Other aspects to be considered is that sequence data are far more complex to analyse and results in highly unbalanced data compared with chip genotyping; sequence data are also much slower and more expensive to acquire than array genotyping. From our simulations, we also observe that the demographic scenario has a key influence on the properties of the G matrices, so a preliminary study to infer key genetic parameters in the target population should be undertaken before setting a definitive sequence-based selection programme.

Despite their limitations, relationships based on ascertained high-density SNPs are reasonably accurate (ρ ~ 0.80 in Large White pigs and ρ > 0.90 in cattle, Figure 7). The main practical limitation is to have access to such a high-density array, which is equivalent to a ca. 3 M SNPs in mammalian genomes (this number follows from assuming 100 SNPs per 100 kb window and a 3 Gb genome). Currently, the densest array in human is over 4 M SNPs but the densest livestock array is the 800k SNP cattle chip (Matukumalli et al. 2011). For that quantity of SNPs, our model predicts a correlation between estimated and true G elements of 0.57 and 0.83 for GL and GF, respectively, in the cattle scenario. These correlations are obtained when SNPs are ascertained with a minimum MAF ≥ 0.15, and are slightly higher if restriction on MAF was less strict, i.e., 0.05. Therefore, the use of high-density SNP arrays instead of sequence data for genomic selection will partly depend on whether standardization by allele frequency results in optimum genetic progress as compared to using average pairwise differences (Equation (1)).


This work explores the effects of SNP ascertainment and of sequencing bioinformatics pipelines in estimating genetic relationships. It evaluates, using plausible demographic scenarios, the impact of NGS and of SNP ascertainment on the estimation of genetic relationships, a fundamental tool for genomic selection. Several conclusions can be drawn from this study. First, the demographic scenario, specially the presence of admixture, has a dramatic influence on the G elements. While average pairwise differences (GL) have a clear asymmetrical distribution with admixture, weighting by SNP frequency (GF) normalizes – to an extent – this distribution. Second, NGS data result in the most accurate estimates of G, provided enough depth is obtained, although shallow NGS may result in highly distorted estimates of GL (although not so much for GF). Third, high-density genotyping can also result in accurate estimates of G; we recommend nevertheless that the distribution of G elements should be carefully inspected to detect unexpected patterns.

It has been advocated (Larkin et al. 2012; Hayes et al. 2013) that implementation of NGS data in genomic selection schemes could involve sequencing a few highly influential animals and have the genomes of the rest of the population imputed using genotyping or very shallow sequencing data. Here, we provide an alternative strategy: to develop high-density specific arrays for the population of interest with a very mild restriction on MAF. These chips could be designed using, for example, the sequence of the founders or of influential animals and would later be used for massively genotyping the rest of the population. This approach should be cheaper and more reliable than sequencing followed by imputation because animals should be genotyped or sequenced anyhow and the development of custom arrays should be widely feasible in the future. A final important note: this strategy critically depends on GF (Equation (2)) being better suited for genomic selection than GL (Equation (1)).


I thank Andrés Legarra. Miguel Toro and Clara Díaz, for suggestions and many discussions, and the editor and reviewer for constructive criticisms. I also thank Martien Groenen, Hendrik Jan Megens and students for a stimulating environment and discussions, Sebastián Ramos-Onsins for his mstatpop program, and Bruno Nevado for his script to map MaCS output onto fasta files. This work was accomplished during a sabbatical stay in Wageningen University, kindly funded by Wageningen University Research (WUR), which provided a stimulating environment and numerous opportunities for invigorating bicycle rides. Work also funded by AGL2010-14822 grant (Spain).

Appendix 1

 Coalescence commands to generate sequence data

For a thorough documentation on coalescence commands, the reader is referred to Hudson's site (Hudson 2002). The command used in the cattle model is (parameters are scaled in 4Ne units):

  • macs 250 100000 -t 0.0012 -r 0.0012 -I 2 150 50 -ej 0.042 2 1 -eN 1.67 10

which is read as: a coalescence simulation of 125 diploid individuals (250 chromosomes) sequence of length 100000 bp with population variability (θ) and recombination rates per base pair equal to 0.0012 in 4Ne units (Ne is 300), 75 and 50 individuals are sampled from each of two populations (-I 2 150 50), the populations diverged t1 = 0.042 generations ago in 4Ne units (-ej 0.042 2 1) and the ancestral population suffered a bottleneck decreasing its effective population size 10-fold t2 = 1.67 generations ago (-eN 1.67 10). These parameters are adjusted to mimic, approximately, observed diversity levels in cattle (Gibbs et al. 2009).

The complete command for the ‘pig’ scenario is, where scaling is again in 4Ne units, with Ne=125, derived from Ne = θ/4μ =0.0012/4 × 10−6:

  • macs 300 100000 -t 0.0005 -r 0.0005 -I 4 50 100 100 50 -m 2 3 0.01 -m 3 2 0.01 -n 1 0.15 -n 2 0.15 -n 3 0.15 -n 4 1.5 -em 0.049 3 4 5 -em 0.05 2 4 10 -eM 0.06 0 -ej 0.07 2 1 -ej 0.0701 3 1 -en 0.09 1 5 -en 0.2 1 7 -en 0.21 4 10 -ej 10 1 4

The above command (-I 4 50 100 100 50) means that diploid sequence of 25 individuals from each of Iberian and Chinese populations are simulated, together with 50 from Duroc and Large White. Of those, 25 Duroc and 25 LW are later randomly selected to carry out SNP ascertainment, which is used to ‘genotype’ the rest of samples. Migration between China (population 4) and International breeds (populations 2 and 3) have occurred between generations 0.049 and 0.06 ago at a rate 5% and 10%, as well as limited symmetric migration between LW and DU (-m 2 3 0.01 -m 3 2 0.01) all European populations have merged 0.07 × 4Ne generations ago (-ej 0.07 2 1), whereas China and Europe diverged 10 × 4Ne generations ago (-ej 10 1 4).

The properties, in terms of variability and Fst distributions, from these two models were explored with mstatpop (Ramos-Onsins et al., unpublished, available at, that process directly the output from MaCS software.