Summary. Background: Hemophilia A (HA) has a high level of variation within the disease class, with more than 1000 mutations being listed in the HAMSTeRS database. At the same time a number of F8 mutations are present in specific populations at high frequencies. Objectives: The simultaneous presence of large numbers of rare mutations and a small number of high-frequency mutations raises questions about the origins of HA mutations. The present study was aimed at describing the origins of HA mutations in the complete Swedish population. The primary issue was to determine what proportion of identical mutations are identical by descent (IBD) and what proportion are attributable to recurrent mutation events. The age of IBD mutations was also determined. Patients/Methods: In Sweden, the care of HA is centralized, and the Swedish HA population consists of ∼ 750 patients from > 300 families (35% severe, 15% moderate, and 50% mild). Identical haplotypes were defined by single-nucleotide polymorphism and microsatellite haplotyping, and the ages of the mutations were estimated with estiage. Results: Among 212 presumably unrelated patients with substitution mutations, 97 (46%) had mutations in common with other patients. Haplotyping of the 97 patients showed that 47 had IBD mutations (22%) with estimated ages of between two and 35 generations. The frequency of mild disease increased with an increasing number of patients sharing the mutations. Conclusions: A majority of the IBD mutations are mild and have age estimates of a few hundred years, but some could date back to the Middle Ages.
Both disease-causing mutations and neutral polymorphisms appear de novo in each new generation. Given an estimated world population of 7 × 109 individuals and an average mutation frequency of 10−8 per base pair and generation, it is clear that all changes compatible with life are represented in the human population; that is, all positions will be mutated more than once. The vast majority of these changes are, however, exceedingly rare and present in only very few individuals, and will normally not be detected, because of the very limited sample sizes in most studies . Genetic diseases vary widely with respect to the number of disease-causing mutations that exist within populations. In general, the number of different alleles is lower when the total frequency of disease alleles is high, whereas in diseases with a low overall frequency of disease alleles, the number of different alleles is much higher. Accordingly, properties such as a severe phenotype, dominance and high penetrance are associated with the existence of many different disease alleles, whereas a benign phenotype, recessiveness and low penetrance are associated with a lower number of alleles; that is, mutations are present at higher frequencies . In diseases where only a very limited number of different disease alleles are found, and thus a large proportion of all families carry identical mutations, it is very likely that such an allele has a single mutational origin; that is, it is identical by descent (IBD). At the other extreme of the spectrum, in diseases where a very large number of alleles exist, there may still be cases where apparently unrelated families are found to carry identical mutations. In these cases, however, it is far from evident that they have single mutational origins; they may also be the result of recurrent mutations (RMs).
As a consequence, the presence of the mutation itself in several individuals is not sufficient to determine whether the mutation has a common ancestor or not, i.e. is IBD or an RM. This must instead be inferred from the haplotypes of the carriers. Haplotypes that are IBD between closely related individuals tend to be much longer than shared haplotypes from a population sample, where many generations of recombination have fragmented the ancestral common haplotype. However, even after 100 generations of random mating and recombination, the average length of the shared haplotypes will still be ∼ 0.6 cM . To identify and define IBD mutations, a combined analysis of single-nucleotide polymorphisms (SNPs) and microsatellites from the chromosome region that is potentially IBD is usually performed [4,5]. In principle, all SNPs are potentially informative for determining whether mutations are IBD, but the most efficient approach is to use tagSNPs from all relevant haploblocks outwards from the mutation. The potential IBD haplotype under study will thus be defined by all analyzed SNPs of the long-range haplotype. The discriminating power of this analysis between a true IBD haplotype and an identical haplotype occurring by chance only is the population frequency of the long-range haplotype. For very long haplotypes encompassing many haploblocks, the frequency of the haplotype in the population will be very low. As microsatellites represent an additional source of genetic variation to SNPs, and as they are also very polymorphic as compared with SNPs, this is the marker system of choice for discriminating between different haplotypes.
Hemophilia A (HA) is an X-linked congenital bleeding disorder, caused by a lack of or dysfunction of coagulation factor VIII, and is classified as severe (< 1%), moderate (1–5%), or mild (5–40%), according to the plasma activity of FVIII. Several different types of HA mutation, such as inversions, deletions, duplications, and substitutions, occur in the 186-kb F8 gene. HA has a population frequency of one in 5000 males, and a high level of variation within the disease class . The HAMSTeRS database (http://hadb.org.uk/) lists > 1000 mutations, some of which are present at high frequencies and others of which are present in a single individual. In Sweden, the care of HA is centralized, and all patients are carefully registered. The Swedish HA population consists of ∼ 750 patients from > 300 families (35% severe, 15% moderate, and 50% mild). HA is caused by many different mutations, as is expected for an X-linked monogenetic disease, but a number of reports have also described some F8 mutations that are present in specific populations at high frequencies. One hemophilic mutation, c.6047C>T (V2016A), was identified in an isolated population in rural Newfoundland . With the use of two markers, a total of 44 patients were found to carry the same mutation on the same disease-associated haplotype. An exon 13 duplication was detected in northern Italy in 10 of 31 mildly affected patients with the use of three markers . In an Irish study, a hemophilic missense mutation, c.1649G>A (R531H), was identified in 13 of 69 presumably unrelated families . An identical disease haplotype was identified in all 13 patients with the use of four markers. In a systematic study for mutations in Spain, the screening of 114 unrelated patients detected the c.3780C>G (D1241E) polymorphism in 22 patients, and showed that it was overrepresented in mild HA patients, in whom genetic analyses of F8 failed to detect another pathologic mutation . Also in this case, the disease-associated haplotype was shown to be identical in all 22 patients with the use of four markers. In another example from southern Italy, 20 HA patients with mild disease were shown to carry the same mutation in intron 10 (c.1538-18G>A). Haplotyping with seven markers showed that all patients shared the same haplotype, suggesting the existence of a single ancestor . Finally, seven seemingly unrelated patients with mild HA from Austria had microsatellite haplotypes that were derived from an ancestral haplotype carrying the hemophilic splice site mutation c.788-14T>G . The patients in the examples above have, in all cases, been reported as having a mild phenotype and a common ancestor, i.e. they are assumed to be IBD.
The Swedish HA register provides an opportunity to study an almost unbiased sample of an entire HA population. The present study specifically investigated the cases where the same base change or short insertion/deletion (indel) was found in several, presumably independent, patients. The main task was to determine the origin of these mutations: whether they are truly independent, i.e. resulting from independent mutation events, or whether they have a common ancestor. In the latter cases, the ages of the mutations were estimated.
Materials and methods
The study initially included 301 unrelated Swedish male patients with HA comprising all levels of clinical severity. Clinical and laboratory data were recorded for each patient, and their phenotypes were classified as mild, moderate, or severe. They were analyzed by PCR for the presence of inversions in intron 1  and intron 22 , and analyzed by multiplex ligation-dependent probe amplification for the presence of deletions and duplications . They were subsequently subjected to DNA sequencing of all coding sequences for the presence of substitutions and small indels . The 301 patients were classified, according to their mutations, into the following classes: inversions in intron 1 (4), large deletions (5), no mutation identified (9), inversions in intron 22 (71), and small indels and substitutions (212). The class containing small indels and substitutions was selected for further analysis (Table S1). A control population of 285 individuals was also studied. As F8 is located on the X-chromosome, and both patients and control individuals were all male, a direct determination of haplotypes was possible. The ethical committee of Lund University and the Swedish Data Inspection Board approved the study, and informed consent was obtained from all participating individuals. Genomic DNA was extracted from blood collected in EDTA with the QIAamp DNA Blood Maxi kit (Qiagen, Hilden, Germany), and DNA concentrations were determined by fluorometry with PicoGreen (Molecular Probes, Eugene, OR, USA).
Genetic variation in a 17-Mb chromosome region covering F8 at the distal end of the X-chromosome was analyzed for 70 SNP markers (Table S2). The selection of markers was such that a set of 18 markers covered F8 (0.2 Mb), a set of six markers covered 0.6 Mb of the telomeric flank, and an additional set of 46 markers covered 16.2 Mb of the centromeric flank. The SNP genotypes were determined with the Sequenom MassARRAY MALDI-TOF system . The system analyzes allele-specific primer extension products by using mass spectrometry. Assay design was performed with spectrodesigner software (Sequenom, San Diego, CA, USA), and primers were obtained from Metabion GmbH (Martinsried, Germany) (Table S2). A total of 212 patients and 285 controls were genotyped for the 70 SNPs. The resulting set of 34 790 genotypes contained < 1.0% missing values, and a set of 38 individuals analyzed in duplicate were concordant for all informative combinations. No single individual or marker had > 5% missing values.
A 500-kb DNA sequence containing the F8 locus was searched for short tandem repeats with tandem repeats finder software . A number of tandemly repeated sequences that had > 15 perfect dinucleotide repeat units were identified, and assay systems were designed with primer blast (http://www.ncbi.nlm.nih.gov/tools/primer-blast/). Five of the assay systems passed the quality control (Table S3), and were subsequently used to analyze both patients and controls. The microsatellite markers were amplified with Veriti 384 PCR machines (Applied Biosystems, Foster City, CA, USA), as described in Table S3. PCR products were pooled and diluted in formamide before separation by capillary electrophoresis on ABI 3130 XL or ABI 310 DNA sequencers. Data were analyzed with genemarker software (SoftGenetics, State College, PA, USA).
In all cases where two or more patients had the same mutation, their haplotypes were compared. First, the frequencies of the shared haplotypes were scored among the controls. Some of the individuals had completely different haplotypes, and were therefore coded as not applicable. A second measure based on the length of the shared haplotype was subsequently calculated. For all pairwise combinations of chromosomes in the control population, the length of the shared haplotype, starting from the position of the mutation, was determined. The proportion of pairs of chromosomes with an equal or longer shared haplotype than observed for the mutated pair was then scored. The estiage program  was used to estimate the time from the most recent common ancestor, based on the breakdown of haplotypes and assuming a common origin of identical mutations. The program used allele frequencies from the control population (Table S2) and recombination frequencies from HapMap (http://hapmap.ncbi.nlm.nih.gov/). Linear interpolation was used to obtain data from markers where genetic map position data were lacking. The mutation rate for SNPs was set to 0 in this analysis. Further details of the genetic analysis are presented in Data S1.
The 212 patients containing a total of 151 different substitution mutations and small indels (Table S1) represent families with no known common ancestors. A total of 115 mutations occurred in single individuals, among whom there were four cases where the same base position has been mutated twice to different bases. The remaining 36 mutations occurred in two or more individuals. Thus, a total of 97 patients of 212 (46%) had mutations in common with other, presumably unrelated, patients (Fig. 1).
A graphical representation of the haplotypes of these 97 patients is shown in Fig. 2, where each row represents the graphical genotype of a patient for all 70 SNP markers. The actual sizes of the haplotypes that were identical between patients are given in Table 1 for each patient. The sizes ranged from 0 to 16 998 kb. The size 0 was represented by seven patients with mutations at nucleotide positions c.209, c.1648, c.3637, c.5123, and c.5303, where no allele sharing at all occurred. The frequencies of all specific haplotypes and the frequencies of all pairwise haplotypes of equal or longer lengths were also determined among 285 control individuals (Table 1). It can be seen that these two measures were positively correlated, in addition to being negatively correlated with the length of the shared haplotype. Both frequencies varied from 0 to ∼ 0.8. For the seven patients lacking shared alleles, no frequencies are given.
Table 1. Haplotype information for all patients sharing identical mutations
Five microsatellite markers (M1-M5 in Table 1, Fig. 2 and Table S3) were used for haplotyping of the F8 region in the 97 patients and in 254 controls. A total of 73 different haplotypes were identified among the 254 controls. The three most common haplotypes were present at frequencies of 34%, 6.3%, and 4.7%, respectively, whereas 63 of the 73 haplotypes were present at frequencies of < 2% in the population (Table S3). Thus, the discriminative power of this set of microsatellite markers is fairly high, corresponding to a genetic identity of 0.13. More than 85% of this identity can be attributed to the most common haplotype.
The identical SNP haplotypes among the 97 patients (marked dark gray in Fig. 2) were further evaluated with the microsatellite data. Seven patients had completely different SNP haplotypes, and 43 patients had both haplotype frequency measures (specific SNP haplotype and shared haplotype length; Table 1) of < 0.10 in the control population. Within this group of 43 patients, patients with the same mutation all had identical microsatellite haplotypes. The remaining 47 patients had at least one haplotype measure of ≥ 0.10. Among these, four patients had no microsatellite markers located within the shared SNP haplotypes, 39 patients had different microsatellite haplotypes (marked red in Fig. 2), and only four had identical microsatellite haplotypes. These four patients all carried the most common microsatellite haplotype, with a control population frequency of 0.34.
By the use of data on genetic map positions and allele frequencies for all SNP markers (Table S2), mutation ages were estimated with estiage. To provide a background against which the estimated ages could be interpreted, a simple simulation was performed to calculate the expected survival time of HA mutations. The detailed description of the simulation and the results are presented in Data S1. The results show that mild mutations are unlikely to survive for 50 generations and very unlikely to survive for 100 generations. On the basis of the SNP haplotypes, the mutations were subdivided into three categories: (i) mutations where all patients had both measures < 0.10; (ii) mutations where no patient had both measures < 0.10; and (iii) mutations where some patients had both measures < 0.10 and some did not. These categories were represented by 12, 20 and four mutations, respectively. The 12 mutations in the first category represented a total of 34 patients with age estimates ranging from two to 35 generations. These ages were compatible with the mutations being IBD according to the simulations. All of these haplotypes also showed a complete absence of microsatellite differences. Thus, all available facts point to these mutations being IBD. The 20 mutations in the second category represented a total of 45 patients with age estimates ranging from 87 to 26 275 generations. In some cases, the shared haplotype was so short that the program could not return an estimate. According to the simulations, these ages are very unlikely for HA mutations. In addition, 18 of them had microsatellite differences (including the mutation with an age estimate of 87 generations). These mutations are therefore interpreted as being RMs. The four mutations in the third category represented 18 patients. Thirteen of these had low to moderate SNP haplotype measures, shared identical microsatellite haplotypes, and had age estimates varying from two to 24 generations. These are interpreted as being IBD. The remaining five either lacked a shared haplotype or had high SNP haplotype measures in combination with microsatellite differences. These are interpreted as being RMs. Thus, according to these criteria, a total of 50 patients had RMs and 47 patients had shared IBD mutations.
All 212 patients were also compared with respect to phenotype (Table S1). Ninety-one severe (43%), 30 moderate (14%) and 91 mild (43%) phenotypes were present among the 212 patients. Figure 3 shows the relative frequencies of patients belonging to the phenotype classes severe, moderate and mild as a function of how many patients had identical mutations. A total of 115 mutations occurred in a single individual. These had a relative frequency of severe disease of 55%. This frequency decreased with an increasing number of patients sharing the mutations. A quite different pattern was observed for mild disease, where a relative frequency of 31% was observed in the single individual group. This frequency then gradually increased up to 79% in the group where seven patients shared the same mutation.
The present study investigated different properties of mutation-carrying F8 haplotypes. This information was then used to infer whether identical mutations were IBD or RMs. The SNP haplotype measures of similarity (Table 1) calculated the probability of attaining the observed SNP haplotypes by chance, i.e. given that they are RMs. Thus, both measures function as P-values in a test with RM as the null hypothesis and IBD as the alternative. This means that small values in general indicate IBD mutations, and high values are compatible with the mutations being RM. However, it is possible for mutations that are IBD to reside on a common haplotype. In such cases, the age estimates are informative, as they are estimated under the assumption that all cases of a specific mutation are IBD. There is no direct calculation of a probability, but the simulation results indicate a likely age interval. Thus, the age estimates only have meaning in cases where a group of patients are classified as having IBD mutations. It must be noted that all estimates of ages of mutations have a considerable variance; that is, the confidence intervals are quite broad. In addition, they are skewed towards high values, as illustrated by the mutation at position c.3146, where the age estimate is 35 generations but the upper 95% limit is 286 generations. By using the point estimate, we are thus conservative in favor of IBD mutations rather than RMs. estiage assumes a star phylogeny for cases of more than two patients, which again favors IBD mutations rather than RMs. However, the fact that the inferred numbers of IBD mutations and RMs are so similar show that both categories exist in substantial numbers in the Swedish HA population.
The presence of many RM cases is far from the paradigm underlying the interpretation of common and neutral SNPs, where identical alleles are assumed to be IBD. The RM cases in the present study were detected because the sampling of patients affected by a monogenic disease directly selects for mutations in the underlying gene. Collecting large numbers of such patients will ultimately result in the accumulation of RMs. Another reflection of this is the four cases where the same position has been mutated twice into different base pairs.
The HAMSTeRS database combines information on cases from the whole world; it is therefore likely that frequently reported mutations are RMs. It is also likely that more severe phenotypes are reported more comprehensively to the database than mutations with a mild phenotype. The HAMSTeRS database was surveyed for non-Swedish reports of the mutations in Table 1. A majority of these mutations have also been reported in other countries. The survey revealed a striking difference between the categories of mutations discussed above. The mutations in category 1 have an average of 0.75 (range: 0–2) reports, whereas the mutations in category 2 have an average of 16 (range: 0–65) reports. Our interpretation of this difference is that certain positions have a higher mutation rate that shows a similar pattern all over the world (RMs that mostly have a severe phenotype). The cases in category 1 represent relatively rare mutations that have reached a high number in the Swedish population (IBD mutations that mostly have a mild phenotype). This point to the existence of a third class with a high mutation rate and a mild phenotype. This class is probably represented by category 3, which is dominated by mild phenotypes and IBD mutations.
The present study is based on the analysis of 70 SNP and five microsatellite markers, whereby 30 SNP and five microsatellite markers covered a 1.4-Mb region containing F8. Identical SNP haplotypes in patients with the same IBD mutations had, in most cases, breakpoints outside of this 1.4-Mb region, whereas most cases regarded as RMs had breakpoints within the 1.4-Mb region. As the microsatellite information is obviously critical to the discrimination of IBD mutations or RMs in these cases, the accuracy of the microsatellite data is of great importance. The microsatellite analyses were therefore performed in duplicate, and each dataset was interpreted independently by two persons. A few non-concordant results were reanalyzed, and could thereafter be unambiguously determined. This dataset is therefore assumed to have very few, if any, errors. In addition, ∼ 2.5 mutations can be expected to occur in the dataset as a result of mutation, assuming the analysis of 97 patients for five microsatellite markers, a maximum age of 50 generations for each mutation, and a microsatellite mutation rate of 10−4 per locus and generation. As the dataset contains 27 polymorphisms (marked red in Fig. 2), mutation can only explain a minor fraction (∼ 10%) of the observed polymorphisms.
As pointed out in the Introduction, the patients in the present study represent a major proportion of the HA families in Sweden. The investigated material therefore, in principle, constitutes an unbiased sample of all HA mutations occurring in Sweden. The major theme of the present study has been to discriminate between IBD mutations and RMs, which were found in approximately equal proportions. The implications of this finding for the molecular diagnostics of HA are obvious: in regions where there is a very high frequency of IBD mutations, a first screen for such mutations is warranted, before resequencing of the complete gene. In the Swedish population, with 16 different IBD mutations present in the population and with none present at high frequency, no such screen is warranted. Another consequence of this finding is that the genetic identity observed in the class of mutations is higher than an identity with respect to a common mutational origin. The mutations identified as IBD are apparently, in some cases, fairly young, with estimated ages of two or three generations. However, the available pedigree data cannot confirm this, indicating higher ages in these cases. As the estimates all have confidence intervals ranging between one and at least six generations (6–16), and our pedigree data seldom contain more than three or four generations, it may be that estiage slightly underestimates low ages. Within the group of mutations that are IBD, the highest estimates are 24, 26 and 35 generations, pointing to ages in the interval 500–900 years. This means that some of the mutations could date back to the Middle Ages.
Disclosure of conflict of interests
The authors state that they have no conflict of interest.