SEARCH

SEARCH BY CITATION

Keywords:

  • Coalescent;
  • Composite likelihood;
  • Lipoprotein lipase;
  • Marginal likelihood;
  • Mutation rate heterogeneity;
  • Pseudolikelihood;
  • Recombinational hot spot;
  • Recombination rate

Abstract

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Background
  5. 3. Marginal likelihood
  6. 3.1. Implementation
  7. 4. A composite likelihood
  8. 4.1. Informal theoretical considerations
  9. 4.2. Simulation results
  10. 5. Lipoprotein lipase data
  11. 5.1. Possible recurrent mutation
  12. 5.2. Recombinational hot spot
  13. 5.3. Methodological issues
  14. 6. Discussion
  15. Acknowledgements
  16. References

Summary. There is currently great interest in understanding the way in which recombination rates vary, over short scales, across the human genome. Aside from inherent interest, an understanding of this local variation is essential for the sensible design and analysis of many studies aimed at elucidating the genetic basis of common diseases or of human population histories. Standard pedigree-based approaches do not have the fine scale resolution that is needed to address this issue. In contrast, samples of deoxyribonucleic acid sequences from unrelated chromosomes in the population carry relevant information, but inference from such data is extremely challenging. Although there has been much recent interest in the development of full likelihood inference methods for estimating local recombination rates from such data, they are not currently practicable for data sets of the size being generated by modern experimental techniques. We introduce and study two approximate likelihood methods. The first, a marginal likelihood, ignores some of the data. A careful choice of what to ignore results in substantial computational savings with virtually no loss of relevant information. For larger sequences, we introduce a ‘composite’ likelihood, which approximates the model of interest by ignoring certain long-range dependences. An informal asymptotic analysis and a simulation study suggest that inference based on the composite likelihood is practicable and performs well. We combine both methods to reanalyse data from the lipoprotein lipase gene, and the results seriously question conclusions from some earlier studies of these data.


1. Introduction

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Background
  5. 3. Marginal likelihood
  6. 3.1. Implementation
  7. 4. A composite likelihood
  8. 4.1. Informal theoretical considerations
  9. 4.2. Simulation results
  10. 5. Lipoprotein lipase data
  11. 5.1. Possible recurrent mutation
  12. 5.2. Recombinational hot spot
  13. 5.3. Methodological issues
  14. 6. Discussion
  15. Acknowledgements
  16. References

Humans, like most (so-called diploid) organisms, have two versions of each chromosome, one inherited from each parent. Each sperm or egg has only a single copy of each chromosome, typically formed as a mosaic of the two parental copies. The process of ‘shuffling’ the two parental copies to produce a single chromosome is called recombination. Although not com- pletely understood, it involves variation, so for example the complete set of chromosomes in two different sperm or eggs from the same person would be expected to differ, because of differences in the recombination processes during their formation.

The recombination rate between two specified positions, or loci, on a chromosome is defined to be the probability that the deoxyribonucleic acid (DNA) at each location on the offspring chromosome comes from different chromosomes in the parent. Between two positions which are very close together on the chromosome, the recombination rate will be extremely small. For positions which are far apart, it will be close to inline image. (Recombination rates are bounded below by 0 and above by 0.5.)

There is enormous current interest in understanding the way in which recombination rates vary across the human genome. There is known to be variation over large scales, but little is known about the extent to which recombination rates vary over small scales. Aside from inherent interest, in shedding light on the underlying biological processes, a good understanding of patterns of local variation is central to both the design and the analysis of current and planned studies aimed at elucidating the genetic basis of common human diseases (e.g. Pritchard and Przeworski (2001)). Although the methods described in this paper can be applied to data from any diploid species, we shall, for definiteness, focus the discussion on the case of current major interest, namely humans.

It may be helpful to the subsequent discussion to have some (very rough!) idea of the scales involved in the human genome. The genome consists of around 3×109 nucleotides, or base pairs of DNA, broken into 23 chromosomes. (For our purposes, DNA can be thought of as being a linear string made up of discrete building-blocks, namely the four nucleotides {A,C,G,T}.) The chromosomes differ in size but think of a typical chromosome as being 108 bp (base pairs) or 100 Mb (megabases) long—the smallest is about 50 Mb, and the largest about 300 Mb. Distances measured in base pairs are called physical distances. This contrasts with so-called genetic distance which measures the amount of recombination between two loci. Two loci are at a genetic distance of 1 M (morgan) if the expected number of recombination events between them during one meiosis (the formation of a sperm or egg cell) is 1. Because recombination rates vary across the genome, there is no standard conversion between physical and genetic distances. The average across the genome is close to 1 cM per megabase (e.g. Pritchard and Przeworski (2001)). There are various models which relate recombination rates to genetic distance (see for example Speed and Zhao (2001)), but over small genetic distances recombination rates and genetic distances are almost the same. Thus for example the recombination rate between two loci at genetic distance 1 cM will be close to 0.01.

The standard method for assessing recombination rates in humans is from pedigree data. In its simplest form, we might have observations on parent–child duos and simply count the number of recombination events between the loci of interest. In practice it is much more complicated, and can be highly non-trivial statistically, effectively because often only incomplete information is available about recombination events (see for example Holmans (2001) and Thompson (2001)). There are natural limits to the resolution of pedigree-based methods—they can only be used to estimate recombination rates between loci whose genetic distance is of the order of centimorgans or more. If complete information were available, the problem amounts to estimating a binomial success probability. The real problem is more difficult, but for example thousands of meiotic events are needed to estimate recombination rates of the order of 10−2.

Over the last 5 years or so, the ability to type single sperm separately has added to the resolution, allowing the estimation of rates of the order of 10−3 or 10−4. Recently, more sophisticated (sperm typing) experimental approaches allow a practicable estimation of male recombination rates as small as 10−5 (Jeffreys et al., 2001) but at very substantial cost. Sperm typing methods are currently not feasible for simultaneously assessing rates across the genome, and in any event provide no information on recombination rates in females (which differ in humans from rates in males, at least over large scales). Thus, in short, existing high throughput methods provide information about recombination rates over scales of millions, or in some cases hundreds of thousands, of bases. (We note in passing that an estimation of recombination rates is much easier in most experimental organisms, effectively because experimenters can simply arrange for large numbers of matings of the type that are most helpful.)

A recent study (Jeffreys et al., 2001) showed substantial clumping of recombination events within one 200 kb region of the human genome. The possibility of such extreme local variation in recombination rates is also supported by several other lines of evidence (see for example Jeffreys et al. (2001) and Daly et al. (2001), and references therein), but little systematic information is currently available about recombination rates over physical distances of kilobases or tens of kilobases in humans.

There is potentially useful information for estimating local recombination rates in samples of chromosomes taken from unrelated individuals in a population. Such chromosomes are related by an (unobserved) pedigree, or genealogy, going back many generations (typically many tens of thousands for human DNA). If we could observe both the genealogy and the recombination events on it, this would allow (straightforward) estimation of recombination rates as small as 10−4 or 10−5. In practice, neither the genealogy nor the recombination events are directly observed, but data of this kind carry some information about each, and hence, at least in principle, information for estimating recombination rates over small scales.

There has thus been intense recent interest in methods for estimating local recombination rates from population genetic data. Early approaches used summary statistics (often via a method-of-moments approach; see for example Hudson and Kaplan (1985), Hudson (1987) and Wakeley (1997)); however, these methods only use a small amount of the information that is available from the data and can have poor statistical properties (Wall, 2000). More encouragingly, several different groups have developed full likelihood inference methods (Griffiths and Marjoram, 1996a; Kuhner et al., 2000; Fearnhead and Donnelly, 2001), which are ‘optimal’ in the sense that they use all the available information from the data. However, all these approaches use computationally intensive statistical techniques (either Markov chain Monte Carlo or importance sampling), with very substantial computational burdens, and potential difficulties in assessing the accuracy of approximated likelihood curves. As one example, the most efficient current method takes a month on a 400 MHz Pentium personal computer to give reliable estimates of the likelihood surface for a data set of 500 bp of sequence data from each of 31 chromosomes (Fearnhead et al., 2002). (This method is typically about four orders of magnitude more efficient than some other published approaches; see Fearnhead and Donnelly (2001).)

Existing full likelihood approaches are not practicable for the sizes of data set that are currently being generated. Although there may be scope for improving their efficiency, possibly substantially, we adopt a different approach in this paper, in studying two approximations. Rather than developing computational methods for approximating the full likelihood, we consider two different likelihoods. The first is a marginal likelihood, where we ignore some of the information in the data, considering instead the likelihood for a reduced data set. Through a careful choice of which aspects of the data to ignore, this results in considerable computational savings with only a very limited loss of information for estimating parameters of interest. Our second approximation in effect uses a different, simpler, model for the data, essentially by ignoring certain long-range dependences. We call the resulting likelihood a composite likelihood for our original problem. We sketch some asymptotic (in the length of the sequence) theory and describe simulation studies, both of which are encouraging, and suggest that the resulting procedures may be reasonable for estimation.

Neither general idea is new, although the way in which we implement them is novel. For example Wall (2000) presented an estimation method for recombination based on the likelihood for a summary of the data. We choose a higher dimensional, more informative, summary, so rather more sophisticated methods for studying the associated likelihood are needed. We note that, with our chosen summary, the marginal likelihood is typically almost indistinguishable from the full likelihood. Our composite likelihood approach draws on ideas from spatial statistics (e.g. Besag (1975)). A related, though in some senses complementary, approach to estimating recombination rates has been developed by Hudson (2001a) and subsequently adapted by McVean et al. (2002). They created a composite log-likelihood by summing the log-likelihood of the data at all pairs of nucleotides. Instead, we split the chromosomal region of interest into subregions and create a composite log-likelihood by summing the full log-likelihoods for each subregion. The accuracy of inference based on our composite log-likelihood function depends crucially on the size of subregions used. The larger the subregions the closer the composite log-likelihood is to the (optimal) full log-likelihood.

There are two ways of thinking about the generic statistical problem that is considered here. In the population genetics context there is an accepted family of models for population evolution, including mutation and recombination. Forwards in time these can be thought of as versions of the Fleming–Viot measure-valued diffusion, an infinite dimensional stochastic process. Backwards in time, they induce a random genealogy modelled by the coalescent, and its relatives. (There are important issues about the adequacy of these models for particular applications, but the models are none-the-less widely used, and we shall not consider such issues here. For a discussion in the context of estimating recombination rates, see Fearnhead and Donnelly (2001).) One way of thinking about the challenge of the inference problem is that our sample of chromosomes from the population corresponds to partial information about the value of the Fleming–Viot process at a single point in time, on the basis of which we wish to estimate the (two) parameters governing its evolution. Another perspective, closer to the presentation four paragraphs above, is that this is a missing data problem. If we were told the unobserved genealogy, and recombination events, inference would be straightforward. Missing data problems have received considerable recent attention, but this one is particularly challenging. Although, as is typical, there are choices about exactly how to specify the missing data, however this is done, the space in which it resides is very large.

The paper is organized as follows. The next section gives a very brief outline of the role of the coalescent and its extension to allow for recombination. For further background see for example Hudson (1990) and Donnelly and Tavaré (1995). The marginal likelihood approximation is introduced and studied in Section 3, with the composite likelihood approximation in Section 4. In Section 5 we apply these methods to a data set of sequence variation in the lipoprotein lipase (LPL) gene, a data set which is many times too large for it to be practicable to use full likelihood methods. Aspects of our results differ markedly from a previous analysis of these data (Templeton et al., 2000a): we find no evidence for their conclusion that repeat mutation, and not recombination, is responsible for producing many of the features that are observed in the data. Evidence for their conjectured recombinational hot spot differs substantially across population samples.

The term site refers to a particular nucleotide position. A site is said to be segregating in a sample of chromosomes if not all chromosomes in the sample have the same nucleotide at that site. We also assume throughout that the data consist of DNA sequences from each chromosome sampled. In genetics terminology, this is an assumption that the haplotype of each chromosome is known, or equivalently that we know the phase at each segregating site. Phase information can either be obtained experimentally (e.g. Sobel and Lange (1996)) or inferred (e.g. Stephens et al. (2001)). Throughout we only consider estimating recombination rates from DNA sequence data, although the approximate likelihood methods that we suggest have obvious extensions to other types of population genetic data.

2. Background

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Background
  5. 3. Marginal likelihood
  6. 3.1. Implementation
  7. 4. A composite likelihood
  8. 4.1. Informal theoretical considerations
  9. 4.2. Simulation results
  10. 5. Lipoprotein lipase data
  11. 5.1. Possible recurrent mutation
  12. 5.2. Recombinational hot spot
  13. 5.3. Methodological issues
  14. 6. Discussion
  15. Acknowledgements
  16. References

Population genetic data are generated by the interaction of two processes: the genealogical process (the interrelatedness of different chromosomes as a result of shared ancestry over long timescales) and the mutation process. Note that recombination affects the first of these, as it enables the DNA at two loci on one chromosome to be descended from different chromosomes in the previous generation. First consider the genealogical history of a single position or site in the sequence. This can be represented by a tree (Fig. 1), with time going back into the past as we move up the page. For a sample of n chromosomes, there are initially n distinct branches, each representing the ancestry, at the site of interest, of one of the chromosomes sampled. As we go back in time, chromosomes in the sample share common ancestors (represented by the joining, or coalescing, of branches in the tree). The genealogy stops when all the chromosomes sampled are traced back to a single common ancestor at the site in question.

image

Figure 1. Example of a genealogy at (a) a single site and (b) at two sites for a sample of size 3: moving up the tree or graph corresponds to going back in time; the joining of branches (going back in time) represents chromosomes sharing a common ancestor (these are called coalescent events); in (b) the dependence between the genealogies can be seen—they differ only because of the effect of recombination

Download figure to PowerPoint

The genealogy for a stretch of recombining DNA is more complicated: each site in the region of interest has its own genealogical tree. However, genealogies at nearby sites will be strongly dependent—in fact they are often identical (they only differ if a recombination event occurs between the two sites during the genealogical history of the sample). The entire collection of genealogical trees for the region of interest can be represented by a graph, called the ancestral recombination graph (ARG) (Griffiths and Marjoram, 1996b); see Fig. 1 for an example.

Conditional on the genealogy of a sample, mutations occur as a Poisson process along the branches of the ARG, independently for distinct sites. Furthermore, given the realization of the ARG, it is straightforward to evaluate the probability of a particular configuration of sequences in the chromosomes sampled.

As noted above, we shall focus here on the simplest setting in which the model for the ARG is given by the coalescent with recombination. See for example Kingman (1982a, b), Hudson (1983) and Kaplan and Hudson (1985) for further background. The model is parameterized by scaled mutation and recombination rates, denoted by θ and ρ respectively. If N is the effective population size, and u and r are respectively the probabilities of mutation and recombination within the region of interest in a single generation, then θ=4Nu and ρ=4Nr. (The timescaling by N reflects the fact that chromosomes are typically related over times of the order of N generations. Thus unless the per generation recombination rate r between two loci is extremely small (say 10−4 or smaller) the scaled recombination parameter ρ will be large, and the genealogies, and hence genetic types in a sample, will be independent at the two loci. This is an important difference from pedigree analyses of recombination. In our, population, setting, dependence between the types at two loci (induced by the correlation in their genealogies) typically only extends over distances of fractions of centimorgans. In pedigree studies the dependence extends over chromosomal scales.)

Any full likelihood method essentially needs to calculate probabilities by averaging over the unobserved realizations of the genealogy. These are themselves very high dimensional objects, living in an infinite dimensional space. Two different classes of approach have been suggested, based on either importance sampling or Markov chain Monte Carlo methods. See Stephens and Donnelly (2000) and Stephens (2001) for a general discussion of the different approaches to this problem and Fearnhead and Donnelly (2001) for a comparison of published methods for full likelihood inference in the presence of recombination.

To proceed, we shall also need to fix on a particular model for mutation. For simplicity and definiteness we shall focus primarily on the so-called infinite sites model. This model, which assumes that each mutation event that occurs in the genealogical history of the sample will affect a different nucleotide site, may not be unreasonable for much human DNA sequence data (with the exception of mitochondrial DNA). An extension of the ideas in what follows to other mutational models is straightforward in principle, and in some cases also in practice. One of our methods, and the data analysis of Section 5, is based on a finite sites model which explicitly allows recurrent mutation.

In the simulation studies which follow we fix particular values for θ and ρ, which are plausible for human populations, of 1 per kilobase in each case (e.g. Pritchard and Przeworski (2001)). Discussions of sequence length should be interpreted relative to these parameter values. Thus the computational burden depends on the total recombination and mutation rates across regions in question. A method which is computationally feasible for a sequence of length 2 kb with θ=ρ=1 per kilobase will be feasible for a sequence of length 10 kb if θ=ρ=0.2 per kilobase.

3. Marginal likelihood

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Background
  5. 3. Marginal likelihood
  6. 3.1. Implementation
  7. 4. A composite likelihood
  8. 4.1. Informal theoretical considerations
  9. 4.2. Simulation results
  10. 5. Lipoprotein lipase data
  11. 5.1. Possible recurrent mutation
  12. 5.2. Recombinational hot spot
  13. 5.3. Methodological issues
  14. 6. Discussion
  15. Acknowledgements
  16. References

Our first approximation is to consider the likelihood of a summary of the data, the idea being to find a summary which is both informative about ρ and for which it is still practicable to calculate likelihood surfaces. Wall (2000) used this idea in the context of estimating recombination rates. He chose a summary which was sufficiently simple that the likelihood for the reduced data could be estimated well by naïve simulation. Here we work with a higher dimensional, and hopefully considerably more informative, summary. The price to be paid is that more sophistication is needed to estimate the relevant likelihood. As described below, we do this here by adapting to this simpler problem the importance sampling approach that we developed in Fearnhead and Donnelly (2001) for full likelihood estimation of recombination rates.

Let �� be the set of segregating sites at which the minor nucleotide frequency (the number of times that the less common nucleotide appears in the data) is greater than some prespecified value. Several particular choices are discussed below. We summarize the data by D��, the haplotypes defined (only) by sites in ��, and SO, the number of segregating sites in the data which are not in ��. See Fig. 2 for an example. The idea is that D�� should be informative about ρ, as it contains much of the linkage disequilibrium (LD) information from the data (i.e. information about the non-independence of the collections of nucleotides at different positions) and it is this information that is particularly informative about ρ. The total number of segregating sites should be informative about θ. The marginal likelihoodLM(ρ,θ) is the likelihood of this summary of the data.

image

Figure 2. Example of our summary of the data: the full data consist of the DNA at eight sites in four chromosomes; here we have chosen to keep only the sites at which the minor nucleotide frequency is 2 (i.e. there are two of both of the nucleotides at that site); this defines our set �� ; our summary is the types of each chromosome at these three sites, and the number of other segregating sites (sites at which more than one nucleotide appears); we also assume known the position of the sites in �� , the number of sites sequenced and the minor nucleotide frequency used in the definition of ��

Download figure to PowerPoint

Let an ancestral history H be the collection of genealogies at all sites in the sequence, together with the mutational history at the sites in ��. Thus H is the ARG for the whole sequence (see Fig. 1), plus the positions in the ARG of the mutations which affect the sites in ��. Conditionally on H, D�� and SO are independent, so

  • image

where the summation is over all possible ancestral histories. (For simplicity, in our notation we have suppressed the dependence of p(SO|H,θ) on the position of sites in �� and the threshold used to define ��. Also we have slightly abused the notation as the summation over ancestral histories is in fact a sum over all topologies of the ARG, together with an integral over the lengths of the branches of this ARG, and the positions of the mutations in the ARG.)

Now p(D��|H) is either 1 or 0 (as the ancestral histories uniquely determine the sample at sites in ��). If we let ℋ denote the set of ancestral histories for which p(D��|H)=1, and if q(H) is a probability mass function whose support contains ℋ, then

  • image(1)

This suggests using importance sampling, with proposal density q(H), to approximate LM(ρ,θ). This will be feasible provided that we can calculate p(SO|H,θ) (up to a constant multiplier which does not depend on H).

We calculate p(SO|H,θ) as follows. Consider just sites not in ��, and write L for the total length of branches in the genealogies at these sites, LF for the length of the subset of branches on which mutations can occur, and yet the site would not be in ��, and θb for the per base mutation rate. Then, assuming that there are no repeat mutations at sites not in �� (which is automatic under the infinite sites assumption, but also may be a reasonable approximation under other mutation models), since mutations occur, independently, as a Poisson process of rate θb/2 along each branch,

  • image

Although in principle any choice of proposal density q (with the appropriate support) in equation (1) will result in unbiased estimation of LM(ρ,θ), the choice of q can have dramatic effects on the efficiency of the method. We took as our choice of proposal density the one developed by Fearnhead and Donnelly (2001) for infinite sites data. We implemented this by adapting their program infs.

This approach is still highly computationally intensive because of the need to propose ancestral histories which contain genealogies at all sites in the sequence. Large computational savings can be obtained if we define a new ancestral history H, consisting of just genealogies and mutations at the sites in ��. As before, if ℋ denotes the set of all such ancestral histories with p(D��|H)=1, then

  • image(2)

for any proposal density q(H) whose support contains ℋ.

Exact calculation of p(SO|H,θ) is not now possible. Instead we approximate this probability by assuming that the genealogy at each site is the same as the genealogy of the site in �� to which it is closest (if a site is equidistant from the two closest sites in ��, then with probability inline image we assume that it has the genealogy of the left-hand site, and otherwise we assume that it has the genealogy of the right-hand site). Use of this approximation to p(SO|H,θ) in equation (2) defines an approximation to the marginal likelihood, which we call the approximate marginal likelihood. Again we approximated the approximate marginal likelihood via importance sampling. Our proposal density is that derived by Fearnhead and Donnelly (2001) for finite sites data. (Thus one advantage of our approximate marginal likelihood approach is that it is based on a mutation model which explicitly allows repeat mutations at sites in ��.) We implemented this by adapting their program fins.

3.1. Implementation

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Background
  5. 3. Marginal likelihood
  6. 3.1. Implementation
  7. 4. A composite likelihood
  8. 4.1. Informal theoretical considerations
  9. 4.2. Simulation results
  10. 5. Lipoprotein lipase data
  11. 5.1. Possible recurrent mutation
  12. 5.2. Recombinational hot spot
  13. 5.3. Methodological issues
  14. 6. Discussion
  15. Acknowledgements
  16. References

We simulated data for samples of 50 sequences of length 1 kb, 2 kb and 4 kb, for θ=ρ=1 per kilobase, values which, as noted above, are plausible for human populations. At least for this region of parameter space, increasing the sample size has little effect on the precision of parameter estimation (Fearnhead and Donnelly, 2001), so here and elsewhere in the paper we focus simulation effort on exploring other aspects of the problem. The thresholds used in deciding which segregating sites to exclude in the marginal likelihoods were as follows:

  • (a)
    for the 1 kb data, include only sites at which the minor nucleotide frequency is at least 2;
  • (b)
    for 2 kb and 4 kb, include the five sites with the highest minor nucleotide frequency (including ties), and all other segregating sites with minor nucleotide frequency at least 30% of the sample.

There was striking agreement between each of the full likelihood, marginal likelihood and approximate likelihood surfaces in the simulation study. For example, Fig. 3 shows likelihood curves for ρ with θ fixed at the true value, for 2 kb and 4 kb data. Similar results are obtained for many other data sets. For analysing 1 kb data the three curves are always almost identical (the results are not shown), confirming the plausible intuition that there is little information about recombination in singleton mutations (although such mutations are not completely uninformative).

image

Figure 3. Comparisons of full ( —— ), marginal (——) and approximate marginal (- - - - -) log-likelihood curves for ρ at the true value of θ for simulated data sets of 50 chromosomes (the thresholding schemes used are described in the text; the true value of ρ was 1): (a) 2 kb data; (b) 4 kb data (only the full and approximate marginal log-likelihood curves are shown)

Download figure to PowerPoint

We conclude that in practice, for these thresholding schemes, there is little to be gained in using a full likelihood approach, nor in working hard to calculate the marginal likelihood exactly. We would thus recommend the use of the approximate marginal likelihood method for sequences of this length, implemented by ignoring (only) singleton sites. The saving in computational time is considerable. Although it depends on the structure of particular data sets, and (often to a lesser extent) the level at which the threshold is set, calculating the approximate marginal likelihood accurately can reduce computing time by 1–2 orders of magnitude when compared with the marginal likelihood and 1–3 orders of magnitude compared with the full likelihood, with greater relative savings for more complicated problems. For sequences of length above about 5 kb, even the calculation of the marginal likelihoods can become computationally prohibitive.

4. A composite likelihood

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Background
  5. 3. Marginal likelihood
  6. 3.1. Implementation
  7. 4. A composite likelihood
  8. 4.1. Informal theoretical considerations
  9. 4.2. Simulation results
  10. 5. Lipoprotein lipase data
  11. 5.1. Possible recurrent mutation
  12. 5.2. Recombinational hot spot
  13. 5.3. Methodological issues
  14. 6. Discussion
  15. Acknowledgements
  16. References

Here we consider a different approximation, which is similar in spirit to that of Besag (1975) for spatial data. Consider DNA sequence data from a chromosomal region of interest. Split the region of interest into R subregions. For r=1,…,R, let Dr be the data from the rth subregion. For notational simplicity, we assume that each subregion is of the same length, and let ρ and θ now denote the recombination and mutation rates over one subregion (i.e. 1/R times the rates for the whole region of interest). We assume in this section that these rates are constant across the subregions. Now we define the composite likelihoodLC(ρ,θ) to be

  • image

The composite likelihood ignores information in the data: it neglects the fact that the ith sequence in each of D1,…,DR comes from the same chromosome. Furthermore, the composite likelihood is not even a probability of some summary of the data, as it ignores the dependence between data from different subregions. However, we propose to base inference on this composite likelihood, and in particular to estimate the parameters of interest by the values which maximize the composite likelihood. Note that this approach has similarities to the pairwise methods of Hudson (2001a) and subsequently McVean et al. (2002), and the idea of zeroth-order likelihood for stationary stochastic processes (Azzalini, 1983).

The composite likelihood function can be calculated by using the importance sampling method of Fearnhead and Donnelly (2001) to evaluate each factor in the product. In view of the results of the previous section, it would seem natural to evaluate each subregion likelihood p(Dr|ρ,θ) via the approximate marginal likelihood, rather than the full likelihood. This is indeed what we recommend in practice (and what we have applied in Section 5). To understand the consequences of our various approximations better, we consider in this section the use of the composite likelihood with an evaluation of the full likelihood for each subregion.

In the next subsection we give a very informal discussion of some relevant theoretical issues. For a more complete consideration, see Fearnhead (2002). Section 4.2 then considers empirical evidence on the use of this composite likelihood. Both the theoretical and the empirical considerations are encouraging.

4.1. Informal theoretical considerations

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Background
  5. 3. Marginal likelihood
  6. 3.1. Implementation
  7. 4. A composite likelihood
  8. 4.1. Informal theoretical considerations
  9. 4.2. Simulation results
  10. 5. Lipoprotein lipase data
  11. 5.1. Possible recurrent mutation
  12. 5.2. Recombinational hot spot
  13. 5.3. Methodological issues
  14. 6. Discussion
  15. Acknowledgements
  16. References

The obvious point estimates for ρ and θ are just the values inline image and inline image that maximize LC(ρ,θ). However, the statistical properties of these estimators are unknown, and a rationale for interval estimation is not straightforward. Partial answers to these questions can be obtained by using asymptotic theory. By asymptotic we mean here the limit as the number of subregions, and hence the size of the region of interest, increases. Another limiting regime would be to fix the size of the region and then to let the number of chromosomes sampled tend to  ∞ . These are rather different scenarios. Additional sampled copies of the same region are very highly positively correlated, so the gain in information is small. Nothing is known formally, but it is plausible that in this limiting regime the information grows as the logarithm of the sample size. However, precisely because of recombination, there is rather more independence between sequenced regions, from the same chromosomes, as the regions move further apart. Again, the formal position is not clear, but it seems likely that information grows linearly, or close to linearly, in the number of subregions sequenced. For a discussion of these issues in a simpler setting see Pluzhnikov and Donnelly (1996).

Azzalini (1983) discussed the asymptotic properties of maximum likelihood estimates (MLEs) which are based on approximate likelihoods that are similar to our composite likelihood. How these ideas specifically apply to our composite likelihood is considered in detail in Fearnhead (2002) . We briefly, and very informally, discuss these results here.

The asymptotic properties of the composite likelihood MLEs will depend on the correlation between the score functions for different subregions. We studied these correlations via simulation (the results are not shown), and they appeared small. This suggests treating the composite likelihood as a true likelihood, and in particular assuming that the MLE has an approximate normal distribution, and that the likelihood ratio statistic has an approximate χ2-distribution.

The theoretical results of Fearnhead (2002) show that the correlation decays inversely with the amount of recombination between the subregions. Whereas this decay is sufficiently quick to ensure consistency of the MLEs based on the composite likelihood, it is sufficiently slow that the asymptotic distribution of the likelihood ratio statistic will not be χ2 distributed (in fact a χ2-approximation can be made arbitrarily poor by using increasingly more subregions).

It may still be the case that the usual asymptotic distributional results will provide useful approximations in some settings. This may occur if the subregions themselves were large and the log-likelihood from each subregion were approximately quadratic, and if the number of subregions were small. Below, we use simulation to examine the distribution of the likelihood ratio statistic, and also the MLE, for our composite likelihood.

4.2. Simulation results

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Background
  5. 3. Marginal likelihood
  6. 3.1. Implementation
  7. 4. A composite likelihood
  8. 4.1. Informal theoretical considerations
  9. 4.2. Simulation results
  10. 5. Lipoprotein lipase data
  11. 5.1. Possible recurrent mutation
  12. 5.2. Recombinational hot spot
  13. 5.3. Methodological issues
  14. 6. Discussion
  15. Acknowledgements
  16. References

We now describe the results of a simulation study aimed at understanding the properties of the composite likelihood method. We consider the large sequence properties and sampling distributions of the estimators.

We simulated our data from the coalescent, assuming neutrality, random mating and a constant population size. For reasons discussed above we simulated data from 50 chromosomes throughout. We generated data over different sequence lengths, assuming that 1 kb of DNA corresponds to parameter values ρ=θ=1.0. As noted above, these would be typical values for human populations.

Calculating the full likelihood even for a subregion of 1 or 2 kb is extremely computationally intensive. Especially for the larger region, it is also person intensive, as care should ideally be used in deciding whether enough iterations of the importance sampling method have been used. (See Fearnhead and Donnelly (2001) for a detailed discussion.) The trade-off in the composite likelihood approach is clear. The larger the size of the subregions, the less information is lost through ignoring dependences, but the greater the cost of obtaining good likelihood estimates, and the higher the chance, especially if the process is automated, of serious errors in the subregion likelihood estimates. The main effect here is that, although the importance sampling method that we use gives an unbiased estimate of the likelihood, it is the sample mean of a sample from a distribution with an extremely long right-hand tail. Thus in practice not running the method for sufficiently long will typically result in an underestimation of the likelihood. (See Fearnhead and Donnelly (2001) and Stephens and Donnelly (2000) for a fuller discussion.)

Simulation studies of highly computationally intensive methods are necessarily somewhat limited, so caution should be applied in interpreting the results below. For the composite likelihood approach, the largest size of subregion that is amenable to a simulation study will be smaller than the largest size which can be used for particular data analyses.

We first considered the large sample properties of the composite likelihood MLE. We generated one 500 kb and 10 100 kb data sets. The composite log-likelihood curves (based on splitting each region into 1 kb subregions) for the 10 100 kb data sets are shown in Fig. 4. There is no noticeable bias in the estimation of ρ. This conclusion is supported by the composite log- likelihood curve for the 500 kb data set (the results are not shown).

image

Figure 4. Composite log-likelihood curves for 10 100 kb data sets: each composite log-likelihood curve is based on 1 kb subregions; the bold curve shows the sum of the 10 composite log-likelihood curves (the true value of ρ per kilobase is 1); each curve is adjusted to have its maximum at 0

Download figure to PowerPoint

We also analysed the 100 kb data sets using subregions of 2 kb. Fig. 5 shows the composite likelihood curves based on using 2 million simulated ancestral histories to estimate the likelihood for each subregion. There is evidence of a negative bias caused by occasional inaccuracies in estimating the likelihood curves for individual subregions.

image

Figure 5. Composite log-likelihood curves for 10 100 kb data sets, based on 2 kb subregions: the bold curve shows the sum of the relevant composite log-likelihood curves (the true value of ρ per kilobase is 1); each curve is adjusted to have its maximum at 0

Download figure to PowerPoint

Also of concern is the appropriateness of assuming an approximate normal distribution for inline image, and an approximate inline image-distribution for the likelihood ratio statistic. Fig. 6 shows QQ-plots for both inline image and the likelihood ratio statistic for the composite likelihood for both 10 kb and 20 kb data (each composite likelihood was based on 1 kb subregions). For 20 kb data, the inline image-approximation for the likelihood ratio statistic seems poor, whereas (except for the constraint of ρ≥0) inline image does have an approximate normal distribution. Also, as the sequence length is increased from 10 kb to 20 kb, the fit of the likelihood ratio statistic appears to be less good, as suggested by the theoretical analysis of Fearnhead (2002).

image

Figure 6. QQ -plots of the composite likelihood MLE and likelihood ratio statistic for (a) 20 kb data and (b) 10 kb data: each composite likelihood is based on 1 kb subregions; the results are based on 1500 kb of simulated data

Download figure to PowerPoint

Finally, Table 1 summarizes how the performance of inference for ρ depends on the length of data that are analysed (again inference is based on the composite likelihood for the various choices of subregion size). Firstly consider the results based on 1 kb subregions. The point estimates of ρ appear good, though the length of the sequence plays a crucial role in the variance of the estimates. In contrast, the coverage properties of interval estimates based on the asymptotic distribution of the likelihood ratio statistic are poor. Interval estimation based on a normal approximation for inline image performs somewhat better, but again the confidence intervals have coverage probabilities for large sequences that are lower than nominal.

Table 1.  Summary of the sampling properties of the composite likelihood MLE, for ρ per kilobase (the true value is 1) and associated confidence intervals, for different lengths of data, based on a sample of 50 chromosomes†
SubregionsLength (kb)MeanVarianceMediangConfidence interval coverage
      (a)(b)
  1. †The statistic g (used in Wall's (2000) comparisons) is the proportion of times that the MLE is within a factor of 2 of the truth. The final two columns give the estimated coverage probability of approximate 95% confidence intervals. These confidence intervals are based on (a) an approximate inline image-distribution for the likelihood ratio statistic and (b) an approximate normal distribution for the MLE, whose variance is the inverse of the curvature of the relevant composite log-likelihood curve. The results for subregions of 1 and 2 kb are based on 1500 kb of simulated data; those for 5 kb on 1000 kb of simulated data, with data simulated under the standard neutral model. Composite likelihoods for subregions of 1 and 2 kb were based on the exact likelihood for each subregion, that for 5 kb on the approximate marginal likelihood, omitting only singleton sites, for each subregion. The composite likelihood curve was calculated for ρ per kilobase between 0 and 5 in all cases. For 5 kb sequences, estimates of ρ=5 were obtained in around 2% of cases; these indicate an estimate of ρ which is greater than 5.0, and hence the estimated means and variances for 5 kb sequences are negatively biased.

1 kb50.971.200.660.390.850.97
 100.970.790.820.510.800.94
 200.980.460.880.670.750.90
2 kb100.920.410.820.690.830.86
 200.900.180.860.820.760.82
5 kb51.30.911.120.690.890.77
 101.220.461.070.780.830.74
 201.160.211.120.910.780.72

The results based on 2 kb subregions show evidence for a bias. Despite this, the performance of point estimation is better (measured either via the mean-square error or the proportion of times that the MLE is within a factor of 2 of the truth). The extra information in a 2 kb subregion as opposed to two 1 kb subregions considerably reduces the variance of the estimators. Interval estimates have poor coverage properties. We note again that there is no theoretical foundation even for interval estimation for the full likelihood MLE for ρ, although limited empirical evidence is encouraging (Fearnhead and Donnelly, 2001).

The analysis for the 5 kb subregions is at the limit of feasibility for a simulation study. One consequence is that the importance sampling estimates of the approximate marginal likelihoods for each subregion may not be particularly accurate. For example in many cases these likelihood estimates had estimated effective sample sizes (see Fearnhead and Donnelly (2001) for details) of around 10. Here, an inaccurate likelihood surface may, and apparently does, particularly affect the properties of confidence intervals, although it may be responsible for the increased bias. In analysing a particular data set, one can assess the effective sample size and if necessary increase the simulation effort in calculating subregion approximate likelihoods. As a consequence, the actual performance when using 5 kb subregions should be better than suggested by Table 1. We thus regard the use of 5 kb subregions as the best alternative among the composite likelihood approaches that we considered.

In connection with Table 1 we note that there are possible but very unlikely sample configurations for which the MLE of ρ will be  ∞ . Thus the moments of inline image do not exist. None-the-less, we have found that the use of the sample mean and sample variance of inline image provide a helpful summary of the properties of different estimators. We give estimated histograms of the sampling distributions for some scenarios in Fig. 7. These confirm the conclusions described above in discussing Table 1. Fig. 7(c) shows the performance of the approximate marginal likelihood when applied to 5 kb sequences. The sampling properties of the estimator compare favourably with those of the MLE for smaller sequences (see Fig. 9 of Fearnhead and Donnelly (2001)).

image

Figure 7. Histograms of the composite likelihood MLE for ρ based on samples of 50 chromosomes: (a) 5 kb sequences, 1 kb subregions; (b) 20 kb sequences, 1 kb subregions; (c) 5 kb sequences, 5 kb subregions; (d) 20 kb sequences, 5 kb subregions

Download figure to PowerPoint

image

Figure 9. Plots of pairwise LD, as measured by D (below the diagonal) and the likelihood ratio for independence of the alleles at the two sites (above the diagonal) (see the text for details: only segregating sites whose minor allele frequency was greater than 15% were used (for reasons discussed elsewhere, see for example Jeffreys et al . (2001)); the boundaries of each shaded rectangle are at the midpoints between segregating sites; the darker the box, the more LD there is between the relevant pair of sites; high values of LD suggest small amounts of recombination between the two sites, whereas low values of LD suggest large amounts of recombination; blocks of low LD values near the diagonal are indicative of elevated recombination rates in the respective regions, whereas blocks of high LD values near the diagonal are indicative of smaller recombination rates

Download figure to PowerPoint

In spite of the approximations that are inherent in its construction, the behaviour of the composite likelihood MLEs, at least for point estimation, is encouraging. We briefly compare it with other available estimators.

The method of Wakeley (1997) estimates ρ per kilobase for a 50 kb region (based on a sample of 20 chromosomes) with a variance of 1.16, making it substantially worse than our composite likelihood estimators (for example, when based on 2 kb subregions, the composite likelihood MLE, as described in Table 1, has a sample variance of 0.09). Wall (2000) had also found that this estimator performed poorly in his comparisons. Wall's (2000) method performs worse than the approximate marginal likelihood for 5 kb data and has a performance which is comparable with that of the composite likelihood for 10 kb data (with 5 kb subregions) when performance is measured in terms of the statistic g (defined in Table 1) and better performance in terms of the mean-square error (Wall, personal communication). On the basis of our own simulations (the data are not shown), and those of Hudson (personal communication), Hudson's pairwise likelihood method performs somewhat worse than approximate marginal likelihood for 5 kb data, but comparably or sometimes better than any composite likelihood for longer sequence lengths.

The comparison with Hudson's (2001a, b) method is interesting. His method combines data from all pairs of segregating sites. In long sequences, this will be dominated by comparisons between sites which are reasonably distant. In contrast, our composite likelihood explicitly ignores information in data from regions of the sequence that are widely separated, concentrating instead on the local information from nearby segregating sites but using joint information from many such sites, rather than just from pairs of sites. The loss of the joint information seems expensive for relatively short sequences (for which full or marginal likelihood approaches perform better). As the sequence length grows, it begins to be offset by the additional information from patterns of LD between distant sites. In this sense Hudson's pairwise and our composite likelihood approaches are somewhat complementary. We are currently investigating ways of combining the two approaches.

5. Lipoprotein lipase data

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Background
  5. 3. Marginal likelihood
  6. 3.1. Implementation
  7. 4. A composite likelihood
  8. 4.1. Informal theoretical considerations
  9. 4.2. Simulation results
  10. 5. Lipoprotein lipase data
  11. 5.1. Possible recurrent mutation
  12. 5.2. Recombinational hot spot
  13. 5.3. Methodological issues
  14. 6. Discussion
  15. Acknowledgements
  16. References

We now apply our approximate likelihood methods to analyse sequence data from the LPL gene. The full data are presented in Nickerson et al. (1998) and Clark et al. (1998) and have been used by Templeton et al. (2000a), Przeworski and Wall (2001) and Kuhner et al. (2000) to estimate the amount of recombination in the LPL gene. The data consist of approximately 9.7 kb of DNA sequenced in 142 chromosomes from individuals in Jackson and Rochester in the USA and from North Karelia, Finland. The full haplotype information of some chromosomes is not known: the phase of singleton mutations was not determined, and for some segregating sites the alleles on some chromosomes are unknown. We focus here on two specific issues raised in the original papers:

  • (a)
    the extent to which recurrent mutation (i.e. more than one mutation event at a particular site), rather than recombination, has shaped the data and
  • (b)
    the possibility of substantially elevated recombination rates towards the centre of the sequenced region.

5.1. Possible recurrent mutation

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Background
  5. 3. Marginal likelihood
  6. 3.1. Implementation
  7. 4. A composite likelihood
  8. 4.1. Informal theoretical considerations
  9. 4.2. Simulation results
  10. 5. Lipoprotein lipase data
  11. 5.1. Possible recurrent mutation
  12. 5.2. Recombinational hot spot
  13. 5.3. Methodological issues
  14. 6. Discussion
  15. Acknowledgements
  16. References

Templeton et al. (2000a, b) suggested that there is a significant amount of repeat mutation, due to variation in the mutation rate between sites. The issue is potentially important, because multiple mutations at the same site can leave patterns in the data that are similar to those from recombination events, so, as Templeton and colleagues have argued, a failure to account for this would lead to an overestimation of the recombination rate. In particular they suggested that CpG dinucleotides mutate at a much faster rate than the genome-wide average. (This has been noted from other data: for example, Nachman and Crowell (2000) estimated that they mutate 10 times more frequently than average; see also Krawczak et al. (1998) .)

We first analysed the data from the 48 chromosomes from Jackson. We base our inference on the composite likelihood, splitting the data into 10, 975-base, subregions. As with our earlier suggestion, we used the approximate marginal likelihood in each subregion, in which singleton mutations were ignored. In addition to the computational saving, this has the advantage that for these data the phase of singleton mutations is not known in any case. We also note that it means that throughout this analysis we use a finite sites mutation model, which explicitly allows for the possibility of repeat mutations at each site. For each subregion, we omitted any chromosome whose full haplotype was not known, leaving samples of between 31 and 48 chromosomes for the 10 subregions. On average, it took half a day to calculate the approximate marginal likelihood for each subregion by using a 400 MHz personal computer. The coefficient of variation of the final importance sampling weights was generally of the order of 10−3 or 10−4, implying that the estimates of the approximate marginal likelihood surface in each subregion are within a few per cent of the correct values.

As one way of assessing the consequences of increased mutation rates associated with CpG dinucleotides, we analysed the data under two mutational models. In the first, all sites mutated at the same rate. In the second, we identified CpG doublets in the data and allowed these to mutate at a rate that was 10 times higher than that for other sites. In fact, only mutations away from CpG sites seem to be at increased frequency and, of these, transitions at a higher rate than transversions. In this sense the model that we actually fit will tend to overstate the possibility of repeat mutations. In fact, even allowing for 100-fold increased rates at CpG doublets gives likelihood curves, and estimates of recombination rates, that are similar to those in Fig. 8 (the data are not shown).

image

Figure 8. Composite log-likelihood curves under two mutation models (——, all sites mutate at the same rate; - - - - -, CpG sites mutate 10 times faster) for the LPL data from Jackson, based on the approximate marginal likelihoods for 10, 975-base, subregions: the curves are for the profile log-likelihood of ρ , and each has been set to have a maximum value of 0

Download figure to PowerPoint

The composite log-likelihood curves for ρ under the two models are shown in Fig. 8. These are in fact profile likelihood curves, although the MLE for θ varies very little with ρ. There is little difference between the shape of the two curves, which contrasts with the conclusions of Templeton et al. (2000a), who suggested that allowing for variable mutation rates would substantially reduce the estimate of ρ.

Exactly how much recurrent mutation there may be in the data can be assessed in different ways. Przeworski and Wall (2001) noted that, even with mutation rate variation at the level which has been suggested, the sequence is sufficiently short that the a priori expectation would be for at most a small number of instances of multiple mutations at the same site. Our methods allow us to assess this question a posteriori, since they provide information on the conditional distribution of multiple mutations, and of numbers of recombination events, given the data. This distribution depends on the unknown parameter values, and on which population is analysed. We explored various possibilities for the data from all populations (including the elevated rate for CpG dinucleotides, and the putative recombinational hot spot discussed below), all of which give qualitatively similar results. This analysis, for a range of plausible parameter values, suggests that it is very unlikely that there were more than 10–20 (and much more probably 1–5) multiple mutations, and that in contrast there were at least several hundred recombination events, in the history of the data.

The maximum value of the composite log-likelihood (for the Jackson sample) under the mutation model with tenfold higher rates for CpG dinucleotides is 29.4 units larger than under the simpler model with no mutation rate variation, showing that allowing mutation rate variation improves the fit to the data, but assessing the significance of any difference in composite log-likelihood is not straightforward. Note that evidence of an increased mutation rate at CpG sites (which this suggests) is not the same as the statement that there have been multiple mutations at the same (CpG or other) site, which seems at best a minor effect in the light of the analysis described in the previous paragraph.

We thus conclude, on the basis of these several arguments, that multiple mutations have not played a major role in shaping the observed diversity in the LPL gene. More generally, similar arguments suggest that, although there is undoubtedly a variation in mutation rates across human sequences, plausible levels of this variation are unlikely to have a major effect on the estimation of recombination rates from population data.

5.2. Recombinational hot spot

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Background
  5. 3. Marginal likelihood
  6. 3.1. Implementation
  7. 4. A composite likelihood
  8. 4.1. Informal theoretical considerations
  9. 4.2. Simulation results
  10. 5. Lipoprotein lipase data
  11. 5.1. Possible recurrent mutation
  12. 5.2. Recombinational hot spot
  13. 5.3. Methodological issues
  14. 6. Discussion
  15. Acknowledgements
  16. References

Another conclusion of Templeton et al. (2000a) is that there is a recombinational hot spot between sites 2987 and 4872, which corresponds exactly to the fourth and fifth of our subregions. We now examine this in more detail.

It turns out that the signal for a putative hot spot differs substantially across the three population samples. We first give a more informal analysis, similar to many in the literature, which examines patterns of LD between different pairs of segregating sites in the data. LD measures the dependence between the alleles at a pair of loci. For two biallelic loci, with alleles A and a at the first and alleles B and b at the second, LD is often measured in terms of the difference between the frequency of chromosomes which carry both A and B and the product of the marginal frequencies of the A and B alleles. (It is 0 if the alleles at the two loci are statistically independent.) There are numerous measures of LD, all based on this difference (see for example Hudson (2001b)). The measure D is the absolute value of the difference, divided by its maximum value given the marginal allele frequencies (some researchers define D to be the signed difference). Thus D takes values between 0 (no dependence, or linkage equilibrium) and 1 (strong dependence). An alternative measure of LD is the likelihood ratio for comparing the hypothesis of a general joint distribution with that of independence of alleles at the two loci.

Fig. 9 shows patterns of pairwise LD in each of the three samples, and when the data are combined across populations. The samples from Finland and from Rochester, Minnesota, show distinct blocks of LD on either side of the putative hot spot, as would be expected if there were an elevated recombination rate in the region claimed. The pattern for the Jackson population is markedly different from that for the other two samples, possibly suggesting instead an elevated recombination rate in the entire left-hand (5 ) half of the sequence, but not obviously a hot spot in the centre. The combined sample maintains the suggestion of a hot spot.

For a more thorough analysis, we evaluated composite likelihoods under two models, the first with a constant value of ρ across the entire sequence, the second with one value of ρ for subregions 4 and 5, and a different (common) value of ρ for the remainder of the sequence. In both cases, our mutational model assumed a tenfold increase in mutation rate at CpG sites. (We regard the whole analysis as exploratory, and in any case we cannot perform formal tests, so we note, but do not otherwise adjust for, the fact that the location of the putative hot spot was suggested on the basis of these data.)

Fig. 10 shows the likelihood curves for each sample. As for the more informal analysis of LD above, two of the populations, and the combined sample, reinforce the suggestion of the hot spot. On the basis of these samples, estimated values of ρ per kilobase for the hot spot region are higher (substantially so for the Finnish population) than for the remainder of the sequence. This does not occur for the Jackson data. The values of the (composite) likelihood ratio statistics for choosing between the hot spot and ‘no-hot-spot’ models are 0.3, 30, 30 and 20 for the Jackson, Finnish, Rochester and combined samples respectively.

image

Figure 10. Composite log-likelihood curves for the LPL data for (a) Jackson, (b) Finland, (c) Rochester and (d) all the populations ( —— , composite log-likelihood under a model of a constant recombination rate; ——, composite log-likelihood for the non-hot-spot region; - - - - -, composite log-likelihood for the hot spot region): each curve has been adjusted to have its maximum at 0

Download figure to PowerPoint

Thus, although informal, both the LD and the composite likelihood analyses of the Finnish, Rochester and combined samples reinforce the existence of an elevated recombination rate in the centre of the sequence. Interestingly, there is limited or no evidence for this from the Jackson sample.

5.3. Methodological issues

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Background
  5. 3. Marginal likelihood
  6. 3.1. Implementation
  7. 4. A composite likelihood
  8. 4.1. Informal theoretical considerations
  9. 4.2. Simulation results
  10. 5. Lipoprotein lipase data
  11. 5.1. Possible recurrent mutation
  12. 5.2. Recombinational hot spot
  13. 5.3. Methodological issues
  14. 6. Discussion
  15. Acknowledgements
  16. References

In our view, some differences between the results of our analysis and those of Templeton et al. (2000a) are due in part to systematic problems with the methods that they used. Very briefly, instead of estimating the recombination rate, they used (repeatedly) a particular test which aimed to detect the recombination events in the history of the sample. Their inferred recombination events

  • (a)
    accounted for only a small proportion of the homoplasies in the data (features which must be caused by either recombination or a repeat mutation) and
  • (b)
    showed clustering between sites 2978 and 4812.

Inference from (a) that repeat mutation is responsible for most homoplasies relies on the assumption that the hypothesis test used will actually detect all or most recombination events which have occurred. There is no a priori reason for expecting the test to have such power. In fact the approach only detected 28 recombination events. This suggests a value of ρ about an order of magnitude lower than our (and others’) estimates for these data. It is also about an order of magnitude lower than the expected number of recombination events conditional on the data. We are thus inclined to the view that their analysis is simply not picking up many of the recombination events that are of interest. The method may also be less likely to detect recombination events at the ends of the sequence, which complicates inference from it concerning a central hot spot. (Though subject to the concerns above about the Jackson population, the conclusion is probably not unreasonable.)

6. Discussion

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Background
  5. 3. Marginal likelihood
  6. 3.1. Implementation
  7. 4. A composite likelihood
  8. 4.1. Informal theoretical considerations
  9. 4.2. Simulation results
  10. 5. Lipoprotein lipase data
  11. 5.1. Possible recurrent mutation
  12. 5.2. Recombinational hot spot
  13. 5.3. Methodological issues
  14. 6. Discussion
  15. Acknowledgements
  16. References

There is considerable current interest in understanding the way in which recombination rates change over small scales in the genome. Standard pedigree methods and single-sperm typing do not have sufficient resolution to address this issue. Recent more sophisticated sperm-based approaches do offer the required resolution (for male recombination rates only) but at a considerable cost. By contrast, samples of chromosomes taken from a population are related over long timescales and do contain information on local (sex-averaged) recombination rates. A variety of methods for inferring recombination rates from such data have been developed. In view of the high positive correlations underlying population data of this kind, there is a premium on using as much of the information in the data as possible. Simple methods, based on summary statistics, can be seriously misleading. Several methods for approximating the full likelihood surface, under the standard coalescent model, have recently been developed. All such methods are currently highly computationally intensive, and some can be very misleading. Diagnostics are difficult, and one consequence is that there is a very real danger of inadvertently seriously misestimating the likelihood curve (see the comparisons of different methods in Fearnhead and Donnelly (2001)). In addition, full likelihood methods are impractical for many data sets of the sizes that are currently being generated.

In this paper we have introduced and studied two approximations. The first, a marginal likelihood, ignores some of the information in the data. A careful choice of what to ignore can result in substantial computational savings while losing little of the information in the data that is relevant to estimating recombination rates. An additional approximation in calculating the marginal likelihood also reduces the computational burden, at little cost in accuracy. In considering sequences with values of the scaled mutation and recombination rates θ and ρ up to about 5 (typically up to about 5 kb for human sequences), the use of the approximate marginal likelihood based on ignoring singleton mutations results in computational savings but a likelihood surface that is very close to that for the full data, and we would recommend this approach. We would expect inference based on this approximate marginal likelihood to be similar to that for the full likelihood, which we have shown (Fearnhead and Donnelly, 2001) to be reasonable for both point and interval estimation. (To the extent that they are relevant to this, our current simulations also support this view, except possibly for the 5 kb case, where we believe that errors in assessing the likelihood particularly affect the construction of confidence intervals. As noted above, this is easily rectified in practical data analysis.) Note also that our approximate marginal likelihood method explicitly allows recurrent mutation.

The approximate marginal likelihood approach becomes impractical as a high through put method for sequences with θ and ρ larger than about 5. We note, however, that in one-off analyses considerably larger (or more recombinagenic) regions can be analysed either by full likelihood or approximate marginal likelihood. For example Fearnhead et al. (2002) used full likelihood to estimate recombination rates from a region with ρ of about 50. Although the computational cost of such an analysis is considerable, it is still much less than the cost of collecting the data in the first place! In this sense, and because of continual increases in computing power, our simulation results on performance should understate what is achievable in practice.

For longer sequences we have suggested a different approximation, which we call composite likelihood. In effect this amounts to replacing the agreed coalescent model by a simpler model in which certain long-range dependences are ignored. The sequenced region of interest is broken up into R subregions, with the composite likelihood being the product of the likelihoods (or, in practice, of the approximate marginal likelihoods) for each subregion. Parameter values which maximize the composite likelihood are used for point estimation. Asymptotic considerations (Fearnhead, 2002) and our simulation study suggest that the resulting procedures provide reasonable point estimates. Interval estimation seems more problematical.

The composite likelihood method of point estimation considerably outperforms some published alternatives. It uses information which is complementary to that used in Hudson's pairwise approach (itself a composite likelihood in our terminology, but in which the likelihood for all pairs of sites in the data is combined as if the pairs were all independent). Our conclusion is that for ‘short’ sequences, say ρ≤ 5–10, maximizing the approximate marginal likelihood (ignoring only singleton mutations) should provide better point estimates, but as the sequence length grows beyond this Hudson's method will tend to outperform our composite likelihood approach. Hudson's approach has an added practical advantage in that it deals with diploid (unphased) sequence data. It is also much faster computationally. Work in progress aims to combine both approaches appropriately.

A general point is that, in the context of suspected variation in local recombination rates, the practical relevance of performance over long sequence lengths is questionable. The borrowing of information from neighbouring regions which is implicit in simple applications of methods to large sequences may be of limited value or, worse, inappropriate, if there is considerable local variation in rates, putting a premium on methods which perform well over smaller scales. Here and elsewhere we have used rather informal methods for assessing the variation in rate, and there remains a need for practicable methods for fitting models which allow such variation (both for recombination rates and, as in the LPL data, for variation in mutation rates).

The general statistical issue for our composite likelihood approach is interesting and, apparently, not even straightforward to formulate. The trade-offs over the choice of size of subregion involve comparing the numerical inaccuracy of estimates of likelihood curves for each subregion, which increase with the size of the subregion but are notoriously difficult to assess accurately in exactly those settings where they are serious, and the loss of information in ignoring dependence across subregions, which decreases with the size of subregion. Note that considerably better implementations, in the sense of using larger subregions, are practicable for a single data set than for a simulation study to assess performance.

We have used our methods to (re)analyse a recent high profile data set consisting of 9.7 kb of human sequence data from the LPL gene. We find support for a considerably increased mutation rate at CpG sites, although we conclude on several grounds that its presence will not seriously affect estimates of recombination rates (cf. Templeton et al. (2000a)). For similar reasons, we would not expect plausible levels of variation in mutation rates in the (nuclear) human genome to have a great effect on estimates of recombination rates from human population data.

The signal for a putative recombination hot spot is strong for two of the three samples analysed but effectively absent from the third. This is curious, and disconcerting. One possibility is that the Jackson population, being predominantly African American, has a larger effective population size. This would increase ρ throughout the region, perhaps in just such a way as to destroy patterns of LD in the 5 end of the sequence, but not those in the 3 end. Such a post hoc explanation deserves scepticism, and the problem warrants further attention.

The background levels of recombination in the Finnish and Rochester samples are around the genome-wide average, and for these samples the rate in the putative hot spot was higher by an order of magnitude of perhaps inline image or 1. Rates in the Jackson sample seem about 10–20 times higher than the genome-wide average, across the whole sequence. In contrast, estimates of the mutation parameter θ were 1–2 times the average of 1 per kilobase, with the Jackson estimate larger, but only by about 30–50%.

We have not addressed here the robustness of the standard coalescent model for human populations. Fearnhead and Donnelly (2001) offered some encouragement about the performance of full likelihood methods when the data are generated by a more complicated demographic scenario. Another general point is that our analysis does not treat recombination separately from gene conversion. (Gene conversion can be thought of as a recombination-like event in which two crossings over occur within a short stretch, of perhaps a few hundred base pairs.) In this sense our (and some others’) estimated rates do not relate solely to the (simple) recombination events that we observed in pedigrees. There is currently conflicting evidence on the extent to which gene conversion is important in affecting patterns of polymorphism in humans (e.g. Przeworski and Wall (2001), Frisse et al. (2001) and Jeffreys et al. (2001)). As Przeworski and Wall (2001) noted this may account for the observation (including ours for the LPL gene) of higher local rates than would be consistent with genome-wide averages from pedigree studies.

Acknowledgements

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Background
  5. 3. Marginal likelihood
  6. 3.1. Implementation
  7. 4. A composite likelihood
  8. 4.1. Informal theoretical considerations
  9. 4.2. Simulation results
  10. 5. Lipoprotein lipase data
  11. 5.1. Possible recurrent mutation
  12. 5.2. Recombinational hot spot
  13. 5.3. Methodological issues
  14. 6. Discussion
  15. Acknowledgements
  16. References

This work was supported by Engineering and Physical Sciences Research Council grant GR/M14197 and Biotechnology and Biological Sciences Research Council grant 43/MMI09788. We thank Dick Hudson, Molly Przeworski, Matthew Stephens and three referees for comments on an earlier version of the manuscript.

References

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Background
  5. 3. Marginal likelihood
  6. 3.1. Implementation
  7. 4. A composite likelihood
  8. 4.1. Informal theoretical considerations
  9. 4.2. Simulation results
  10. 5. Lipoprotein lipase data
  11. 5.1. Possible recurrent mutation
  12. 5.2. Recombinational hot spot
  13. 5.3. Methodological issues
  14. 6. Discussion
  15. Acknowledgements
  16. References
  • Azzalini, A. (1983) Maximum likelihood estimation of order m for stationary stochastic processes. Biometrika, 70, 381387.
  • Besag, J. (1975) Statistical analysis of non-lattice data. Statistician, 24, 179195.
  • Clark, A. G., Weiss, K. M., Nickerson, D. A., Taylor, S. L., Buchanan, A., Stengard, J., Salomaa, V. et al. (1998) Haplotype structure and population genetic inferences from nucleotide sequence variation in human Lipoprotein Lipase. Am. J. Hum. Genet., 63, 595612.
  • Daly, M. J., Rioux, J. D., Schaffner, S. F., Hudson, T. J. and Lander, E. S. (2001) High-resolution haplotype structure in the human genome. Nat. Genet., 29, 229232.
  • Donnelly, P. and Tavaré, S. (1995) Coalescents and genealogical structure under neutrality. A. Rev. Genet., 29, 401421.
  • Fearnhead, P. (2002) Consistency of estimators of the population-scaled recombination rate. Technical Report. Department of Mathematics and Statistics, Lancaster University, Lancaster.
  • Fearnhead, P. and Donnelly, P. (2001) Estimating recombination rates from population genetic data. Genetics, 159, 12991318.
  • Fearnhead, P., Harding, R. M., Schneider, J. A., Clegg, J. B. and Donnelly, P. (2002) Extreme local variation in local recombination rates near the β-globin hot-spot revealed by coalescent analysis of population data.To be published.
  • Frisse, L., Hudson, R. R., Bartoszewicz, A., Wall, J. D., Donfacj, J. and Di Rienzo, A. (2001) Gene conversion and different population histories may explain the contrast between polymorphism and linkage disequilibrium levels. Am. J. Hum. Genet., 69, 831843.
  • Griffiths, R. C. and Marjoram, P. (1996a) Ancestral inference from samples of DNA sequences with recombination. J. Comput. Biol., 3, 479502.
  • —(1996b) An ancestral recombination graph. In IMA Volume on Mathematical Population Genetics (eds P.Donnelly and S.Tavaré), pp. 257270. New York: Springer.
  • Holmans, P. (2001) Nonparametric linkage. In Handbook of Statistical Genetics (eds D. J.Balding, M.Bishop and C.Cannings), pp. 478506.Chichester: Wiley.
  • Hudson, R. R. (1983) Properties of a neutral allele model with intragenic recombination. Theor. Popln Biol., 23, 183201.
  • —(1987) Estimating the recombination parameter of a finite population without selection. Genet. Res., 50, 245250.
  • —(1990) Gene genealogies and the coalescent process. In Oxford Surveys in Evolutionary Biology (eds D.Futuyma and J.Antonovics), vol. 7, pp. 144.
  • —(2001a) Two-locus sampling distributions and their application. Genetics, 159, 18051817.
  • —(2001b) Linkage disequilibrium and recombination.In Handbook of Statistical Genetics (eds D. J.Balding, M.Bishop and C.Cannings), pp. 309324. Chichester: Wiley.
  • Hudson, R. R. and Kaplan, N. (1985) Statistical properties of the number of recombination events in the history of a sample of DNA sequences. Genetics, 111, 147164.
  • Jeffreys, A. J., Kauppi, L. and Neumann, R. (2001) Intensely punctate meiotic recombination in the class II region of the major histocompatibility complex. Nat. Genet., 29, 217222.
  • Kaplan, N. and Hudson, R. R. (1985) The use of sample genealogies for studying a selectively neutral m-loci model with recombination. Theor. Popln Biol., 28, 382396.
  • Kingman, J. F. C. (1982a) The coalescent. Stoch. Process. Applic., 13, 235248.
  • —(1982b) Exchangeability and the evolution of large populations. In Exchangeability in Probability and Statistics, pp. 97112. Amsterdam: North-Holland.
  • Krawczak, M.., Ball, E. V. and Cooper, D. N. (1998) Neighboring-nucleotide effects on the rates of germ-line single-base-pair substitution in human genes. Am. J. Hum. Genet., 63, 474488.
  • Kuhner, M. K., Yamato, J. and Felsenstein, J. (2000) Maximum likelihood estimation of recombination rates from population data. Genetics, 156, 13931401.
  • McVean, G. A. T., Awadalla, P. and Fearnhead, P. (2002) A coalescent method for detecting recombination from gene sequences. Genetics, to be published.
  • Nachman, M. W. and Crowell, S. L. (2000) Estimate of the mutation rate per nucleotide in humans. Genetics, 156, 297304.
  • Nickerson, D. A., Taylor, S. L., Weiss, K. M., Clark, A. G., Hutchinson, R. G., Stengard, J., Salomaa, V., Vartiainen, E., Boerwinkle, E. and Sing, C. F. (1998) DNA sequence diversity in a 9.7-kb region of the human Lipoprotein Lipase gene. Nat. Genet., 19, 233240.
  • Pluzhnikov, A. and Donnelly, P. (1996) Optimal sequencing strategies for surveying molecular genetic diversity. Genetics, 144, 12471262.
  • Pritchard, J. K. and Przeworski, M. (2001) Linkage disequilibrium in humans: models and data. Am. J. Hum. Genet., 69, 114.
  • Przeworski, M. and Wall, J. D. (2001) Why is there so little intragenic linkage disequilibrium in humans? Genet. Res., 77,143151.
  • Sobel, E. and Lange, K. (1996) Descent graphs in pedigree analysis: application to haplotyping, location scores, and marker sharing statistics. Am. J. Hum. Genet., 58, 13231337.
  • Speed, T. P. and Zhao, H. (2001) Chromosome maps. In Handbook of Statistical Genetics (eds D. J.Balding, M.Bishop and C.Cannings), pp. 338. Chichester: Wiley.
  • Stephens, M. (2001) Inference under the coalescent. In Handbook of Statistical Genetics (eds D. J.Balding, M.Bishop and C.Cannings), pp. 213238. Chichester: Wiley.
  • Stephens, M. and Donnelly, P. (2000) Inference in molecular population genetics (with discussion). J. R. Statist. Soc. B, 62, 605655.
  • Stephens, M., Smith, N. J. and Donnelly, P. (2001) A new statistical method for haplotype reconstruction from population data. Am. J. Hum. Genet., 68, 978989.
  • Templeton, A. R., Clark, A. G., Weiss, K. M., Nickerson, D. A., Boerwinkle, E. and Sing, C. F. (2000a) Recombinational and mutational hotspots within the human Lipoprotein Lipase gene. Am. J. Hum. Genet., 66, 6983.
  • Templeton, A. R., Weiss, K. M., Nickerson, D. A., Boerwinkle, E. and Sing, C. F. (2000b) Cladistic structure within the Lipoprotein Lipase gene and its implications for phenotypic association studies. Genetics, 156, 12591275.
  • Thompson, E. A. (2001) Linkage analysis. In Handbook of Statistical Genetics (eds D. J.Balding, M.Bishop and C.Cannings), pp. 541564. Chichester: Wiley.
  • Wakeley, J. (1997) Using the variance of pairwise differences to estimate the recombination rate. Genet. Res., 69, 4548.
  • Wall, J. D. (2000) A comparison of estimators of the population recombination rate. Molec. Biol. Evoln, 17, 156163.