Giovanni Romeo, Unità Operativa di Genetica Medica, Dipartimento di Scienze Ginecologiche, Ostetriche e Pediatriche, Policlinico Sant’Orsola Malpighi, Università di Bologna, Via Massarenti 9, 40136 Bologna, Italy. Tel: 0039 051 2088420; Fax: 0039 051 5870611; E-mail: email@example.com
In principle mutational records make it possible to estimate frequencies of disease alleles (q) for autosomal recessive disorders using a novel approach based on the calculation of the Homozygosity Index (HI), i.e., the proportion of homozygous patients, which is complementary to the proportion of compound heterozygous patients P(CH). In other words, the rarer the disorder, the higher will be the HI and the lower will be the P(CH). To test this hypothesis we used mutational records of individuals affected with Familial Mediterranean Fever (FMF) and Phenylketonuria (PKU), born to either consanguineous or apparently unrelated parents from six population samples of the Mediterranean region. Despite the unavailability of precise values of the inbreeding coefficient for the general population, which are needed in the case of apparently unrelated parents, our estimates of q are very similar to those of previous descriptive epidemiological studies. Finally, we inferred from simulation studies that the minimum sample size needed to use this approach is 25 patients either with unrelated or first cousin parents. These results show that the HI can be used to produce a ranking order of allele frequencies of autosomal recessive disorders, especially in populations with high rates of consanguineous marriages.
Studies of consanguinity have taken advantage of the relationship between the frequency of a given autosomal recessive disorder and the proportion of offspring of consanguineous couples affected with the same disorder. If the inbreeding coefficient between parents is known, this approach allows one to calculate the frequency of autosomal recessive disorders in a way that is free of the problems arising from missed diagnoses (incomplete ascertainment). This requires first of all a fairly precise computation of the inbreeding coefficient (F) in the general population. In classical studies of consanguinity, the gene frequency (q), and thus the prevalence (P) of autosomal recessive conditions has been inferred from the ratio between the frequency of consanguineous parents of children affected with a specific autosomal recessive disorder and the frequency of consanguineous couples of the same degree of kinship in the general population. This type of approach has made use of mathematical formulas worked out by population geneticists like Dahlberg (1948) and has made it possible to estimate the prevalence, among others, of Phenylketonuria, Friedrich ataxia, and Cystic Fibrosis in Italy (Romeo et al., 1983a, b, 1985).
The problem always encountered by this type of genetic epidemiology studies is represented by the difficulty to obtain a reliable estimate of the frequency of consanguineous couples in the general population for any given degree of relationship. In Italy this estimate has been made possible up to 1964 by the availability of centralized archives of the Catholic Church, which kept records in Rome of all the dispensations (or permits to marry in Church) awarded by the Pope for consanguineous marriages celebrated in Italy during a period of almost 400 years. These demographic data were collected and organized by Cavalli-Sforza et al. (2004). The absence of an equivalent centralized archive of consanguinity in countries other than Italy has hindered the use of this epidemiological approach.
We utilized, therefore a novel approach to estimate the gene frequency for rare autosomal recessive disorders using mutational records of patients. This approach is based on the possibility that the affected child can carry a double copy of the same causative mutation (homozygosity) or alternatively two different causative mutations in the same gene (compound heterozygosity). Although in the former case the mutant alleles can be identical by descent (IBD) or by state (IBS), in the latter case the two mutations must have been inherited through two different ancestors even if the parents are consanguineous, which implies that the two mutant alleles are not IBD. The proportion of compound heterozygotes among children affected with a given autosomal recessive disorder is dependent not only on the frequency of consanguineous marriages but also on the relative frequency of the different pathogenic alleles. The theoretical basis of this approach has been discussed in a recent paper by ten Kate et al. (2010), starting from the relationship between the frequency of compound heterozygotes among the patients affected with a given autosomal recessive condition (P(CH)) and the frequency of the relevant disease allele in the general population (q). In their paper ten Kate al. introduced theoretical calculations to demonstrate how q is positively correlated with P(CH) and with the inbreeding coefficient (F) among the probands, thus generating an additional tool to infer the frequency of autosomal recessive diseases. In this paper we demonstrate the feasibility of the Homozygosity Index (HI) approach using mutation data from different Mediterranean countries where consanguinity rates are high.
Materials and Methods
Diseases and Patients
We used mutation datasets of individuals affected with two different autosomal recessive disorders, namely Familial Mediterranean Fever (FMF, OMIM #249100) and Phenylketonuria (PKU, OMIM #261600).
The symptoms and severity of the inflammation vary depending on the type and number of mutations in the MEFV gene (M694V being the most severe and penetrant) which can be found in the heterozygous, homozygous, or compound heterozygous state in different patients (Mattit et al., 2006; Moradian et al., 2010). Furthermore, it is not infrequent to find patients carrying three or more mutations and patients carrying complex alleles (Medlej-Hashim et al., 2005; Moradian et al., 2010). MEFV mutations in the homozygous or compound heterozygous state are likely to determine the most severe forms of FMF. However, since such genotypes are detected in only 41–76% of patients, it has been hypothesized that regulatory mutations may go undetected (Yilmaz et al., 2009) or, alternatively, that unknown modifier genes and/or environmental factors may affect the expression of the disease (Touitou, 2001).
Moreover, the FMF carrier rate can be as high as one in three in some ethnic groups (like Armenians), a finding which in turn raises the possibility of a selective heterozygote advantage. (El Shanti et al., 2006).
PKU is one of the most common inborn errors of amino acid metabolism and by far (98%) the most frequent form of HyperPhenylAlaninemia, a group of diseases characterized by the persistent elevation of phenylalanine levels in tissues and biological fluids (Zare-Karizi et al., 2010). PKU is due to deficiency in phenylalanine hydroxylase. This enzyme is coded by the PAH gene, which consists of 13 exons (Zare-Karizi et al., 2010). So far, more than 500 different mutations have been identified and described in PAH, with various phenotypic consequences (Santos et al., 2010). Most of them are point mutations and microdeletions, usually localized to the coding region or the intron–exon boundaries of the gene, mainlyin its 3’ region (Berchovich et al., 2008b). The number of different mutations in a given population is usually high, with a few prevalent mutations and a large number of private mutations (Berchovich et al., 2008a). This results in a high number of compound heterozygous affected individuals (Santos et al., 2010). Several studies indicate that the prevalence of the disorder and the spectrum of mutations differ among populations (Berchovich et al., 2008a, b; Zare-Karizi et al., 2010).
Collection of Mutation Data for Autosomal Recessive Disorders from Offspring of Consanguineous and Nonconsanguineous Parents
Mutational records were obtained from the following diagnostic laboratories: Medical Genetics Unit, St. Joseph University, Beirut, Lebanon; Genetics Department, Institute for Experimental Medicine, Istanbul University, Istanbul, Turkey (for FMF) and Metabolic Disease Unit, Edmon and Lily Safra Children's Hospital, Sheba Medical Center, Tel Hashomer, Israel (for PKU). Drawings of patients’ pedigrees were collected only if they were born to consanguineous parents and only the offsprings of first cousins were included in the consanguineous group for the purpose of this work. Closer consanguineous relationships were not observed in the sample, whereas the offspring of more distantly related cousins and/or with small sample size were not included at all in the study (five FMF patients born to second cousins and 11 born to third cousins or more distantly related individuals in the Lebanese sample; five PKU patients born to first cousins, one born to second cousins and one born to individuals whose degree of kinship was not well defined in the Israeli Jews sample; four PKU patients born to second cousins, one born to third cousins and three patients born to parents whose degree of kinship was not well defined in the Israeli Arab sample). Written informed consent was available for every patient. Only homozygous or compound heterozygous genotypes for alleles previously reported as disease associated (Touitou, 2001; Medlej-Hashim et al., 2005; Mattit et al., 2006; Berchovich et al., 2008a, b) were considered in the allele count.
Estimation of q from the Proportion of Homozygous Genotypes
Assuming that: (1) we have a set of genotypes from individuals affected with an autosomal recessive disease; (2) every genotype is from an individual of the population with a known inbreeding coefficient F; (3) genotypes are only homozygous or compound heterozygous for the disease alleles of a single gene; (4) these alleles are identified only by a single disease-associated variant (i.e., different haplotypes of the same variant are not considered as different alleles); and (5) they strictly act in a recessive manner (no phenotypic effect on heterozygotes); then we can calculate the HIas the number of homozygotes (HOM) over the total of homozygous and compound heterozygous (CH) genotypes (i.e., the total number of patients):
The equation introduced by ten Kate et al. (2010) derives q from the inbreeding coefficient (F), the proportion of compound heterozygotes (P(CH)), and the proportion of compound heterozygotes among non-IBD genotypes (R(CH))
where R(HN) is the proportion of IBS genotypes among non-IBD genotypes and qi is the relative frequency of the ith disease allele (with ∑qi= 1).
It is evident that, in the studied population (i.e., under the above mentioned assumptions), HI = 1-P(CH).
In addition to samples of patients born to related parents, it should be possible in theory to apply this formula also to samples of patients born to apparently unrelated parents, when F for that specific population is known.
To assess whether our estimates of q were reliable, we collected information about disease frequencies in the same populations from which the mutational records were obtained. Data on disease prevalence, disease allele frequencies, and the total disease allele frequency were obtained from traditional epidemiological and mutational reports (Ozen et al., 1998; Dinc et al., 2000; Mattit et al., 2006; ; Berchovich et al., 2008a) and from the collaborating diagnostic laboratories (see Table 1).
Table 1. Patients studied and relative disorders, countries of origin and centers in which they were tested, along with the sample sizes.
N. (relationship between parents)
aBorn to first cousins.
bBorn to apparently unrelated parents. Among the PKU patients, the 87 UR were Israeli Jews and the rest Israeli Arabs.
St. Joseph University, Beirut
34 (1Ca) 107 (URb)
Institute of Experimental Medicine, Istanbul
Metabolic Disease Unit, Sheba Medical Center, Tel Hashomer
30 (URb) 8 (1Ca) 87 (URb)
Sample Size Requirements for Estimating q
We simulated a series of populations of 1000 genotypes with three pathogenic alleles. Different frequencies of the main pathogenic allele were set, with q1 ranging from 0.4 to 0.8. For every q1 value we built two populations: one with F= 0.0625 and one with F= 0.001. We determined, therefore the genotype distribution among the 1000 affected individuals as a function of q; q1; q2; q3; HI and F (through the Sewall Wright's F-statistics), by means of a random-based model built in an Excel® worksheet. Among the populations created, we chose those with realistic values of q (ranging between q= 0.1 and q= 0.002).
Then, we randomly extracted from every set of 1000 genotypes 100 samples of n genotypes (with n = 10, 25, 35, 50, 75, 100) using a custom Perl script. For each sample, we could estimate q using Equation (10). To simplify calculations, we considered only q1 in q computation (i.e., we put R(HN)=q12), given that this does not significantly affect the estimate of q (as explained in the Theoretical model and simulations section). We therefore calculated, for each sample size, the Confidence Interval of q with α= 0.05 (CI95%), thus producing a reliable index of the accuracy of q estimates in the population (see Online Supplementary Material S1 for more details). As expected, CI95% generally shrinks as the sample size (n) increases and q of the population (qpop) always falls within this range or is very close to it. Slight inconsistencies are probably due to random variation and outlier values that occasionally appear in the samplings.
Theoretical Model and Simulations
We investigated the relationships among the different variables in the model through simulation. We observed a positive correlation between q and q1 (the relative allelic frequency of the major pathogenic allele). More specifically, q1 affects q more than the other pathogenic alleles, which are increasingly irrelevant as their relative frequencies decrease (Fig. 1). Indeed, if we replace R(HN) with q12 in (10; so that we consider only q1 in q computation), this does not significantly affect the estimate of q, as shown in Fig. 1. Moreover, q is inversely correlated to R(HN) and HI (see Fig. 2), which is in perfect agreement with the general postulate that the rarer the disorder, the higher the frequency of homozygotes among affected individuals.
HI is also in positive correlation with q1, whose magnitude is also affected by F, which is in direct correlation with HI. In other words, the higher the inbreeding coefficient between the parents of the probands, the higher will be the probability of a single mutation occurring in homozygosity. This is clear from Figure 3, which illustrates the range of values of HI (maximum/minimum HI) versus q1, in subjects born from consanguineous and nonconsanguineous couples. Figure 3 also suggests that, should we encounter a population with a strikingly prevalent pathogenic allele, the differences in HI between a hypothetical sample of probands born to first cousins and one of probands born to unrelated individuals would be very small.
HI is directly proportional to F (as confirmed by Sewall Wright's F estimation of heterozygous individuals in a population) and positively correlated with q1; therefore, for a fixed value of q, q1 will increase as F of the sample decreases (to keep q constant, see Fig. 4).
As summarized in Table 2, we tested six samples of patients for whom we knew individual genotypes and degree of relationship between parents.
Table 2. Comparison between total allele frequency (q)/prevalence (P) estimated by the present method (upper line) and those previously calculated by traditional methods (lower line of the last two columns).
The mutational spectra for each of the six samples, with the relative values of HI (genotype not shown) and all the mutational data available are summarized in (Fig. S1a–f).
Because of the difficulties in estimating the inbreeding coefficient for the populations taken into consideration, we had to set approximate values of F for the samples made up of probands born to unrelated subjects. In some cases (such as the Lebanese and Turkish sample) we used the values indicated by the collaborating diagnostic laboratories. Whenever possible we chose F values among those published and/or reported in the http://www.consang.net website or computed them as a mean of the data reported on the above mentioned website (see Table S1). More specifically, we tried to select the most reliable values, with special regard to data relative to specific ethnic groups (namely Israeli Arabs and Jews samples).
Sample Size Requirements for Estimating q
Our simulations indicate that for a sample of 25 patients showing three different alleles, fairly precise and reliable estimates of q can be obtained with allele frequency of 0.4 ≤q1≤ 0.8 (Fig. S2a, b). We decided to study q variation within the range q1= (0.4–0.8) for two main reasons. Indeed, whatever the F, in real populations we usually expect to find more than three pathogenic alleles. The first implication is that the greater the number of alleles, the most balanced the ratio among allele frequencies will be, with the upper limit of q1 rarely exceeding 0.8. With regard to the lower limit of the interval, hypothesizing q1≤ 0.4 in a three alleles system with q1 increments of 0.1 becomes internally contradictory.
This work is based on the estimate of disease-allele frequencies (therefore of the prevalence of the corresponding autosomal recessive disorders) in the general population, relying only on mutational records. We believe that these records, if used extensively, will generate in the future a useful epidemiological picture of the frequency of autosomal recessive disorders in different populations. The estimate of HI is directly dependent on the relative frequencies of the pathogenic alleles and inversely correlates with the global frequency of these alleles in the general population (q). Therefore, HI can link together all the variables considered so far in the genetic epidemiology of autosomal recessive disorders, namely q, q1,2,..,n-1,n and F.
This holds generally true also for the PKU Israeli Arabs samples, where the fact that the q/P estimate is higher in the sample of patients born to unrelated parents than among patients born to first cousins should not surprise, as the q/P value calculated through this approach always refers to the general population investigated and not to the specific sample. In other words, these two prevalences represent two point estimates of the same population parameter, not two different parameters characteristic of the samples. This also applies to the FMF Lebanon samples. However, for a given disorder in a given population, it is theoretically possible that a sample of patients born to unrelated parents gives a q/P greater than a sample of patients born to consanguineous ones: it would be sufficient that q1 (the relative frequency of the main pathogenic allele) is so big as to overweight even a small F in the “unrelated” sample or, conversely, that q1 is so small as to counterbalance the effect of a high F in the “related” sample (as in this case, see Fig. S1d). Although this is very unusual in large samples, for very small samples like our Israeli Arab sample it can happen because qi fluctuation is notably affected by single genotypes.
The use of F for the group of apparently unrelated parents is the weak point of this approach, due to the unavailability of accurate F estimates for the populations examined. Although for the first cousins samples we can rely on the estimate (F= 0.0625) based on pedigree reconstruction-despite evidence that this might be underestimated (Woods et al., 2006)—the estimate of F for the samples with apparently unrelated parents based on demographic data is not equally reliable. Several studies have tried to infer F in some populations using different experimental methods. The most reliable ones seem to be those based on the measurement and count of Runs of Homozygosity (ROH) with a given minimal length in the genome (Carothers et al., 2006; McQuillan et al., 2008; Polašek et al., 2010). Alternatively, a novel statistical method to estimate the length of ROH (thus the inbreeding coefficient), relying on a maximum-likelihood approach based on a Hidden Markov Model, has been proposed (Leutenegger et al., 2003). Further research on the estimate of F is needed. Another possible limitation could be the relatively small amount of probands that is possible to collect in each population, because autosomal recessive disorders are usually rare conditions and probands born to consanguineous parents are even rarer. It is interesting that our sample size analysis suggests that we need only 25 patients to ensure a reliable estimate of q. If we exclude simple heterozygotes (which are not taken into account in the model), our approach can give reliable estimates of disease allele frequencies for rare autosomal recessive disorders. In fact, a high HI will imply a less biased estimate of q, and therefore a smaller confidence interval for q estimates (ten Kate et al., 2010). Such a high HI could be observed in some specific disorders like Cystic Fibrosis, especially in those countries where ΔF508 (the main pathogenic allele) reaches very high relative frequencies, entailing a high prevalence of homozygous individuals in the population. More importantly, a high HI can occur in samples of affected children of consanguineous parents, therefore making our model of great help in those countries where the frequency of consanguineous marriages is high. In practical terms, in every population or ethnic group, we can produce a ranking order of the prevalence of autosomal recessive disorders. This will have social and clinical relevance and will allow the establishment of priorities for genetic testing at the population level.
Given the very low figures shown by disease allele frequency and disease prevalence in the general population, it is impossible to estimate q with great precision. Highly accurate estimates will become possible when all the sources of bias will be under control. However in light of the potential applications of this approach, we are at present more interested in building a ranking order of prevalence of autosomal recessive disorders rather than in estimating their prevalence in a very precise way.
In conclusion, we propose to collect mutation data from the offspring of consanguineous marriages (as well as of apparently unrelated parents) affected with autosomal recessive disorders from different molecular diagnostic laboratories in all the countries with high consanguinity rates. We will obtain in every population a ranking order according to the different HIs whose values will be inversely proportional to the frequency of the disorder. This approach will have the advantage over traditional descriptive epidemiology studies of generating an estimate of the relative frequency of the different autosomal recessive disorders, free of the bias due to underdiagnosis. Moreover this approach will not need the collection of very large samples or additional mutation data from the general population of unaffected individuals. From a decision-making point of view, this new combined approach of molecular and genetic epidemiology based on consanguinity should become useful to establish priorities for genetic screening and to assess the opportunity of widespread genetic screening for certain autosomal recessive disorders with respect to others, especially in those countries where there is a high frequency of consanguineous marriages.
Therefore, it will have potentially useful applications in epidemiological studies of autosomal recessive disorders. Indeed, it will allow researchers to build, in each population, a ranking order of prevalence of several autosomal recessive disorders, only relying on data already available (i.e., the genotype distribution and the mutational records of sample of patients, along with their pedigree information). This means savings of economical resources, which is a very important aspect in planning genetic screening programs at a population level in developing countries. As a consequence, it will be possible to concentrate resources directly on the prevention of those autosomal recessive disorders which are most frequent in a given country.
Finally, the approach based on mutation analysis in offspring of consanguineous parents can be integrated in the Locus Specific DataBases (LSDBs) which have been rapidly increasing in number during the last decade (Romeo, 2010; van Baal et al., 2010) and in new research projects and networks like the one recently proposed by a group of medical geneticists from different countries of the Mediterranean sea basin (Ozcelik et al., 2010). It is therefore advisable that mutational records report from now on the degree of relationship of parents (if consanguineous) besides the molecular characterization of the mutation present in each patient.
The authors wish to thank Prof. L.P. ten Kate for useful discussions, Prof. Dr. Ahmet Gul for sharing Turkish-FMF patient genotype information and Dr. S. Presciuttini for technical advice in the statistical analysis.