On the origin and structure of haplotype blocks

Abstract The term “haplotype block” is commonly used in the developing field of haplotype‐based inference methods. We argue that the term should be defined based on the structure of the Ancestral Recombination Graph (ARG), which contains complete information on the ancestry of a sample. We use simulated examples to demonstrate key features of the relationship between haplotype blocks and ancestral structure, emphasizing the stochasticity of the processes that generate them. Even the simplest cases of neutrality or of a “hard” selective sweep produce a rich structure, often missed by commonly used statistics. We highlight a number of novel methods for inferring haplotype structure, based on the full ARG, or on a sequence of trees, and illustrate how they can be used to define haplotype blocks using an empirical data set. While the advent of new, computationally efficient methods makes it possible to apply these concepts broadly, they (and additional new methods) could benefit from adding features to explore haplotype blocks, as we define them. Understanding and applying the concept of the haplotype block will be essential to fully exploit long and linked‐read sequencing technologies.

In the sampled generation, a sample of k genomes is represented as {{0, {}, {{1}}}, …, {0, {}, {{k}}}: the j' th genome is ancestral to itself, (j), over its whole length.Stepping back, crossovers are generated for each genome, drawn from a Poisson with expectation R, and uniformly distributed.To be efficient, these are generated in a single draw.Genomes that experienced a crossover are replaced by two parent genomes.For example, if there were a crossover at j 1 , {0, {}, {1}} would be replaced by genomes {0, {j 1 }, {{1}, {}}} and {X, {j 1 }, {{}, {1}}}.Here, the genotype X is chosen randomly with probability equal to the allele frequency in the whole population.Stepping back in time, coalescence is simulated by assigning each genome a parent with the same allelic state.If the population in the previous generation carries k copies with allele X = 1, then each current genome is assigned a random integer in {1, k}, and similarly for allele X = 0. Parent genomes are then assigned by merging ancestries.For example, if current genomes {0, {j 1 }, {{1}, {}}} and {0, {j 2 }, {{3}, {2, 3}}} share a parent, and j 1 < j 2 , then that parent is assigned as {0, {j 1 , j 2 }, {{1, 3}, {3}, {2, 3}}}.Note that coalescence can only occur within a genetic background (defined by X = 0 or 1), and that multiple coalescences between multiple genomes can occur: we simulate the Wright-Fisher model, not the standard coalescent, which only applies in the limit of a large population.This is important, since in the early stages of a sweep, the new allele is present in few copies.In each generation, the list of ancestral genomes is tidied by deleting genomes that are not ancestral to the sample, and deleting junctions that separate regions with the same ancestry.The algorithm is iterated back, until every ancestor is ancestral to the whole sample, which implies that all genealogies have reached their single common ancestor.
Very large populations can be simulated, since we only track ancestors.The computation is limited by the number of ancestors, which increases with 2 N e R, and the time to complete coalescence, whioch increases with 4 N e .For a given 2 N e R, it is most efficient to choose a modest N e and a large R, but N e should not be so small that results deviate from the standard coalescent.This can be determined by making some runs for larger N e , and by checking against theoretical predictions for the standard coalescent.
In the neutral case, the allelic state is simply set to X = 0. To simulate conditional on a sweep, the trajectory is first generated, by simulating forwards from a single mutation, according to the Wright-Fisher model, and conditioning on fixation.The trajectory is then augmented back in time, by setting it at one copy into the indefinite past.(It is not obvious what the correct choice should be here, but since the genealogy at the selected locus must have coalesced completely at the time of the sweep, and since linked lineages are unlikely to coalesce into the ancestor of the mutation (assumed to be in a single copy) the choice should not make much difference).By first drawing the trajectory, and then simulating coalescence and recombination conditional on it, we can separate random variation in the trajectory, from random variation in the outcome for a given trajectory.We can only infer properties of the single actual trajectory, which may well deviate substantially from expectation.
The core structure, which defines the ARG, is the list of ancestors, traced back through time.From this, we generate a list of all distinct junctions, and thus, of all intervals along the genome; for each of these, a genealogy is constructed.We then construct a list of branches, each branch being generated by a coalescence event that occurs at a specific time, and that brings together a specific set of sampled genomes.For each branch, we list the regions of genome that it covers, in which

Neutral case
Example with 10 genomes, sampled from 2N=100 on R=0.1 (i.e.10cM); SNP generated with μ/r=2 Setting up, and saving results The simulation stores individuals that are represented as {X, {r 1,2 , …}, {a 1. , a 2 , …}}, where X is the genotype at the selected locus, {r 1,2 , …} lists the breakpoints, and {S 1. , S 2 , …} lists the sets of sampled genomes that each interval is ancestral to.The list below goves the lengths of these sets, for individuals 10 generations from the end, when there are 8 ancestral genomes, and much of the genome has coalesced (represented by sets S of length 10).
New functions are defined with the suffix ... Full that distinguish blocks that originate at different times, t. branchListFull returns branches in the form {S, t, {{r 1 , r 2 }, …}}, allowing for the possibility that they may fall in multiple intervals (though this does not happen in this example).posBlockFull[pop, {S, t, {{r 1 , r 2 }, …}], R] returns the instances of the branch in the form {t, {r 1 , r 2 }}.These ..Full versions should be used throughout.

Making the genealogies
This sets up a list of the 34 genealogies along the 10cM stretch of genome: This is a movie of the genealogies along the genome.Numbers label the branches, and the vertical scale is proportional to coalescence times.The vertical axis is always 270 (the time for all of the genome to coalesce.It is not obvious which direction causality runs, and so we plot map length against the time spanned by the branch (le�) as well as time span against map length (right).The 482 branches are shown by blue dots, and the mean for each timespan is shown by the red dots.The two fits (top, bottom row in the table) are to the blue vs the red dots.In each plot, the black line is ∼t -1 , to indicate the slope expected with a simple inverse relation.There is an inverse relation, but the least-squares fit is x ∼t -0.68 (le�), or t∼x -0.54 rather than the expected simple inverse.
Here, the exponent has magnitude less than 1 due to regression to the mean.Suppose that we have two variables, log(t) = y + v t and log(x) = 1 y + v x , which are determined by an underlying inverse relationship plus independent random errors with variances v t , v x , then the regression coefficient of log(x) on log(t) is - The key question is, what kind of branches carry the total ancestry (measured by the area of the branch, and proportional to the expected # of SNP).The area of a branch increases with its timespan (as ∼t 0.34 ), and with the map length that it spans.The le� plot shows the distribution of areas; note the peaks at 0.1, 0.2 which correspond to branches that start at the tips and go back 1 or 2 generations before the first coalescence.The right plot shows the cumulative distribution of area.The total ancestry (map length*generations) is 105.3, on 488 branches; 44 branches with area>0.58carry half of this, and 243 with area>0.1 carry 90% of the total ancestry.Histograms of time span (le�) and map length (right).Blue: all 488 branches; yellow: the 243 branches that carry 90% of the ancestry; red: the 44 branches that carry 50% of the ancestry.The 44 most substantial branches (red) span a wide range, but tend to have longer timespans and span wider map lengths than for the branches as a whole.In this diagram, genome #1 is the bottom row, #10 the top.Thus, branch 1 leads down to to 1 (black), branch 14 leads down to {7, 10} (grey), and so on.These sets can be seen in the genealogies above.These are the data used in the block diagrams in the paper.The two versions only differ in the order with which blocks are overlaid.

Setting up the sweep
There is substantial variation between replicate sweeps, which arises mainly from random variation in the time at which exponential growth is established.The probability of fixation is ∼2 s, and conditional on fixation, the sweep is established as if starting at an initial frequency p 0 that is exponentially distributed, with mean 1/(2 s).This particular sweep has a rather long delay.The trajectory itself is spliced onto a very long run of 1's, so that the simulation can continue back until all lineages have coalesced.This sets the ancestral genotype, prior to the mutation, to zero: This sets up a list of genealogies for each of the 428 intervals along the 20cM stretch of genome.We track yy, the interval it is working on.This is quite slow (and an inefficient algorithm, since it works independently on each genealogy).This generates a set of genealogies that have an outgroup added, and also that specify the associated genotype at the selected locus.It takes some time to add the ancestries: This saves the example.Note that there are two sets of genealogies: genB20 which includes only the sampled 20 genomes, and genB20New which includes the selected background and an outgroup.blB20 lists branches only from the first set.

Description of the genealogies
This shows depth (top right) and mean pairwise divergence (lower le�) for the genealogies.The lower right plot shows SNP patterns, comparing the # of segregating sites (black) with π (blue) along the genome.The classic theory works on average, but not necessarily in any one instance, even for a rather clear sweep..

Genealogies along the genome
This shows how the genealogy along the genome, on a log scale.The genomic interval is shown below each plot, and the numbers label the branches.Below, doing the same, colouring in the branches properly (by showing where the recombination onto a different background happens), and marking times when p = 1 400 , 0.1, 0.5, 0.9 (at {22,38,53,110 generations; dashed lines}).Note that coalescence can only occur within the same background (i.e.black with black or red with red).In the recent past, recombinations may occur within the derived (red) background; further back, when the fitter allele is still rare, recombination will almost certainly be onto the ancestral background (red→black), rather than in the opposite direction.Recombination out is likely when p is below 0.5, and must happen before the origin of the mutation if coalescence is to be avoided.Up to map position ~0.01, there is some rearrangement amongst recent genealogies, but all lineages coalesce in a MRCA associated with the new mutation (marked 731 in the top le� and top middle panels).The first recombination splits the genealogy into clades of size {4,16}; the number of deep branches (that trace back >110 generations, deeper than the top dotted line) for the 9 trees shown below is {1,1,2,2,3,3,5,6,6}.Note that under the standard coalescent, the expected time for a large sammple to coalesce down to 6 lineages is 120 generations -so we expect that large samples will already be substantially thinned before the sweep is reached.By map position 0.0129 (top right), we have had 5 recombinations out (and one back) at {{38,45},47,65,90}.At the end of the genome (map position 0.2; bottom right) there have been about 7 recombination events.One can accurately calculate the expectations for these numbers using the standard coalescent.

General considerations
The following sections try various ways of identifying sets of "interesting" branches.Showing the largest branches: 30 non-tip branches, area>2 These are the # of blocks and their areas, for different threshold areas.7 of the branches give 22% of the area, whilst 22 give 44%.I choose a threshold area of 2, and restrict to non-tips which gives 33% of the area.
Here, I find 32 non-tip branches that cover 33% of the area and 31% of SNP; two of these are nested, and so dropped, leaving 30 branches.
For the 30 branches, these are the # of descendants; times of origin; extent; # SNP; and areas.One branch(#347) is disjunct.Not many are generated by the sweep, or associated with it, in the sense that only 3 begin at zero, and only 9 originate between 40 and 80 generations back (which is when the sweep is inducing most coalescence) {"index", "colour", "# desc", "t", "# SNP", "area", "{r 0 ,r This shows the 30 non-nested, non-tip branches that are associated with the focal genealogy.The light pink rectangle at le� is where the whole genealogy coalesces in the sweep.The substantial branches seem to be concentrated near the sweep, even though SNP diversity is high all along the genome, except for the le�most 1cM Showing the largest branches: 49 branches, area>2 With a threshold area of 2, including tip branches, we include 62% of the area and 61% of SNP: {"index", "colour", "# desc", "t", "# SNP", "area", "{r 0 ,r 1 }"}], TableDepth -> 2] This shows the 30 non-nested, non-tip branches that are associated with the focal genealogy.The light pink rectangle at le� is where the whole genealogy coalesces in the sweep.The substantial branches seem to be concentrated near the sweep, even though SNP diversity is high all along the genome, except for the le�most 1cM This shows all 30 of the "substantial" blocks

This is the set of branches used in the final figure
This shows the 9 branches with >8 descendants, that originate at or more recently than the mutation; and that have area >0.5.The branches with the largest area, longest extent, and the most descendants tend to be near the sweep, but there are three small branches to the right with 9 or 10 descendants: This saves the example: ancestry is enormous, apparently, so we did not save it.Note that there are two sets of genealogies: genB20 which includes only the sampled 20 genomes, and genB20New which includes the selected background and an outgroup.blB20 lists branches only from the first set.I do not save the genealogies, since these take a long time to construct, and are not needed for the plots.The ancestry stored here also includes ancestry for the main example.

10 5 ,
{}, 0, {j, Length[ints20]};This throws 937 SNP onto the 730 branches and 20 genomes (using addSNPFull).There are 4762 segregating alleles: is a histogram of the block areas, on a log 10 scale -centred on about 0.1; the right two plots give the distribution of # SNP (log and normal scale).Most branches have no SNP; 604/937 SNP are on 51 branches with at least 5, 438 are on the 25 branches with at least 10, and 209 are on the 8 branches with at least 20.