Accumulation of processed pseudogenes in low recombination regions
We analysed the correlation between the absolute density of processed pseudogenes and recombination rate for nonoverlapping 5-Mb windows. The results are shown in Table 1. The pairwise correlation results in the table display (i) Gerstein pseudogene density is significantly negatively correlated with recombination rate; (ii) Bork and Hoppsigen pseudogene densities are also negatively, but not significantly, correlated with recombination rate.
Table 1. Spearman correlations of absolute density of retrotransposons with recombination rate and gene density in 5-Mb windows.
| ||N||Retrotransposon vs. Recombination||Retrotransposon vs. Gene|
|L1||582||−0.342||<0.0001|| || || || || || |
|Alu||582||0.295||<0.0001|| || || || || || |
Chromatin structure, gene richness and recombination rate are viewed as factors that influence the distribution of processed pseudogenes. Given that the processed pseudogene density, recombination rate, GC content and gene density correlate with each other, the correlation of processed pseudogene density with recombination rate might be affected by the others. Partial correlation analysis (Table 1), however, shows that the processed pseudogene density is still negatively correlated with recombination rate when both the GC content and gene density are controlled as constant, suggesting the association of the processed pseudogene density with recombination rate is mediated neither by GC content nor by gene density. Figure 1 also shows that processed pseudogenes, particularly the Gerstein and Bork pseudogenes, show a distribution bias in regions of low recombination rates. We also found that processed pseudogene density is positively correlated with gene density (Table 1), which is in contrast with a model of selection against the insertion of processed pseudogenes into gene-dense regions. The effect of selection against the insertion of processed pseudogenes into gene-dense regions is only detectable in the highest gene density regions in Fig. 2, but cannot account for the general increasing trend of processed pseudogene density with gene density. Relatively more frequently occurred repeat-associated ectopic recombination process in gene-poor regions (Abrusan & Krambeck, 2006) might be responsible for the deficiency of processed pseudogenes in gene-poor regions. It has been shown that the shift of Alu and L1 repeats distributions towards gene-rich regions during evolution might be caused by ectopic recombination (Abrusan & Krambeck, 2006). Specifically, the ectopic recombination between the repeats occurs at a higher frequency in gene-poor regions introducing less deleterious effect than in gene-dense regions, and thus the repeats are deleted at a higher rate in gene-poor regions leaving a higher abundance in gene-dense regions. For the same reason, the processed pseudogenes might be deleted by repeat-associated ectopic recombination at a higher rate in gene-poor regions than in gene-dense regions, and thus led to the accumulation of processed pseudogenes in gene-dense regions as a passive consequence of their elimination from gene-poor regions.
Figure 1. The relative processed pseudogene density in regions of different recombination rates, calculated over 5-Mb windows across the human genome. Recombination rate bins with 0.5 cM Mb−1 intervals are used. See ‘Materials and methods’ for the detail of the calculation of the relative processed pseudogene density.
Download figure to PowerPoint
Figure 2. The relative processed pseudogene density in regions of different gene density, calculated over 2-Mb windows across the human genome. Gene density bins with 5 number Mb−1 intervals are used. See ‘Materials and methods’ for the detail of the calculation of the relative processed pseudogene density.
Download figure to PowerPoint
There are a limited number of fine-scale recombination rates in the human genome available (Crawford et al., 2004), the current genome-wide recombination map, however, is at megabase-scale resolution (Kong et al., 2002). The results of correlation analysis given in Table 1 are based on 5-Mb windows. To test the effect of window size, we also analysed the correlations based on 3-Mb and 10-Mb windows using Gerstein dataset. The results are similar with that based on 5-Mb windows (see Tables 1 and 2), indicating that the window size difference at megabase scale has no bearing on our main conclusions. However, we can see that the correlations are stronger when the window size is larger, which may be suggestive of the cumulative effect of processed pseudogenes in large-scale regions with conserved recombination rate patterns (Serre et al., 2005).
Table 2. Spearman correlations of Gerstein processed pseudogene density with recombination rate and gene density for different window sizes.
|Variable 1||Variable 2||Window size||N||Pairwise||P-value||Partial||P-value|
|ppgene density||Gene density||3-Mb||959||0.298||<0.0001||0.333||<0.0001|
In recent years, it has been reported that recombination plays an important role in the distribution of transposons along chromosomes in a variety of organisms (Duret et al., 2000; Bartolome et al., 2002; Rizzon et al., 2002; Petrov et al., 2003; Wright et al., 2003; Song & Boissinot, 2007). The association between the distribution of transposons and recombination depends on both the types of transposons (DNA or mRNA transposon) and the organisms and can be affected by other factors, such as gene density, the insertion preference of transposition process, etc. For the true correlation between transposon distribution and recombination rate, two models referred to as ectopic recombination model and deleterious insertion model have been discussed (Bartolome et al., 2002; Hua-Van et al., 2005). Now, let us discuss several explanatory models, including the two models, for the negative correlation between processed pseudogene density and recombination rate.
Firstly, processed pseudogenes may act as a suppressor of recombination by reducing the homology of homologous chromosomes. It has been shown that the decrease in the similarity between two homologous sequences caused by indel or SNP could reduce the homologous recombination rate (Brenner et al., 1985; Balakirev & Ayala, 2003). For similar reason, it is possible that the integration of processed pseudogenes into chromosomes and subsequent mutations occurred in the pseudogenes could decrease the similarity of the homologous regions and then reduce the homologous recombination rate. As an evolutionary consequence, the recombination rate would be reduced in regions where processed pseudogenes accumulated. The suppressor model seems possible at first glance. However, we do not think it is the reason for our observation. Both L1 and Alu retrotransposons have similar transposition machinery with processed pseudogenes (Maestre et al., 1995; IHGSC, 2001), and majority of them are also nonfunctional disabled copies characterized by high decaying rate. Hence, if the modification effect of processed pseudogenes on recombination rate is strong enough to be detected in the evolutional history, a similar effect should also be detected for L1 and Alu. However, no such effect was observed for Alu (Table 1).
Secondly, similar to the effect described in the weak selection model proposed for the negative correlation between intron length and recombination rate (Comeron & Kreitman, 2000), the insertion of processed pseudogenes into regions of low recombination rates might be favoured by selection because of its effect of decreasing the Hill–Robertson interference between weakly selected mutations, which allows the adjacent genes or exons evolve more efficiently. The weak selection hypothesis is theoretically possible to explain the negative correlation between recombination rate and processed pseudogene density. However, we speculate that such positive selection is not strong and ubiquitous enough in the genome to explain the observation, because the majority of new insertions of processed pseudogenes are more likely to be either neutral or detrimental.
Thirdly, the deleterious insertion model refers to the selection against deleterious insertion of processed pseudogenes into the genome which expects an accumulation of processed pseudogenes in regions of reduced recombination where selection efficiency decreased because of Hill–Robertson interference (Hill & Robertson, 1966). Our data show a positive correlation between processed pseudogene density and gene density, so it should be borne in mind that the selection against deleterious insertion described in the deleterious insertion model does not correspond to selection against insertion of processed pseudogenes into gene-dense regions.
Finally, if the frequency of ectopic recombination, which is known to be detrimental because of its effect of chromosomal rearrangements or functional element deletions, between transposons is proportional to the meiotic recombination rate (Langley et al., 1988), the insertions of transposons into regions of high recombination rates would have more deleterious effects (Charlesworth et al., 1994; Bartolome et al., 2002). Therefore, it is possible that the distribution preference of processed pseudogenes in regions of low recombination rates is caused by relatively strong selection against ectopic recombination between homologous processed pseudogenes in regions of high recombination rates. In summary, we rather refer to the negative selection described in the deleterious insertion model and ectopic recombination model to explain the accumulation of processed pseudogenes in regions of low recombination rates.
It should be noted that the results from different pseudogene databases do not resemble each other well (Table 1). Each of the different models discussed above, regardless of the mechanisms involved in, predicts that the processed pseudogene would be more abundant in reduced recombination regions. However, why is the correlation between recombination rate and pseudogene density stronger for Gerstein datasets and weaker for Bork and Hoppsigen datasets (see Table 1)? Hoppsigen dataset contains relatively small number of pseudogenes, although it is comprised of pseudogenes with high quality (Khelifi et al., 2005). So it is possible that the weak correlation for Hoppsigen dataset might be caused by the small sample size. The sample size of the Bork dataset is the largest, but the correlation is weak too. It might be caused by the relatively low quality of the pseudogenes in the dataset, because the method used in Bork pseudogene identification is less conservative (Torrents et al., 2003; Khelifi et al., 2005), which might result in more nonpseudogene contamination in the dataset.
Test of the ectopic recombination model
In the previous section, we have discussed some models for the accumulation of processed pseudogenes in regions of reduced recombination, but not strictly test whether each of the various models is responsible for the observation. Generally, it is hard to distinguish the various models that correspond to a same expectation. Now, we test the ectopic recombination model. Of the three processed pseudogene datasets, Gerstein processed pseudogene distribution shows a more significant relationship with recombination rate. So, Gerstein dataset is used in the analysis to discover a more observable biological effect.
The ectopic recombination model is established based on the selection against ectopic recombination, which has deleterious effect, between nonallelic sequences with high similarity. That is to say, the selection against ectopic recombination is absent if there is no risk of ectopic recombination between homologous processed pseudogenes. Ectopic recombination between two genomic sequences depends on two important, but not solely, conditions: high similarity and short distance between the two sequences both located on the same (or homologous) chromosome(s). It is impossible (or extremely rare) that ectopic recombination would occur between two genomic sequences located distantly between each other on same chromosome or located on different chromosomes. Therefore, the processed pseudogenes that are at risk and not at risk for stimulation of ectopic recombination could be separated to test the ectopic recombination model. The test of the ectopic recombination model can be based on a hypothesis that the processed pseudogenes that are at risk for stimulation of ectopic recombination (denoted as at-risk ER ppgenes) should accumulate in regions of low recombination rates more preferentially than the processed pseudogenes that are not at risk for stimulation of ectopic recombination (denoted as nonrisk ER ppgenes).
To test the ectopic recombination model, we classified the processed pseudogenes excluding those located on chromosome Y into two groups, at-risk ER ppgenes and nonrisk ER ppgenes, and compared the distribution patterns of the two groups of processed pseudogenes in regions of different recombination rates. See ‘Materials and Methods’ for the definitions of the at-risk ER ppgenes and nonrisk ER ppgenes. The classification result shows that there are 2856 at-risk ER ppgenes and 8508 nonrisk ER ppgenes. The distribution patterns of the at-risk and nonrisk ER ppgenes in regions of different recombination rates are illustrated in Fig. 3a. As shown in the figure, the at-risk ER ppgenes preferentially accumulate in the regions of low recombination rates (0.0–0.4 cM Mb−1), and the nonrisk ER ppgenes seems to have no observable distribution bias. The latter observation, however, is just caused by the matter of coordinate scale of the figure, because Fig. 3b shows that the nonrisk ER ppgenes also exhibit some extent of distribution bias in low recombination rate regions. The result of correlation analysis (Spearman correlation rs = −0.20, P < 0.0001, N = 582) between nonrisk ER ppgene density and recombination rate also confirms it. Compared to the nonrisk ER ppgenes, the at-risk ER ppgenes show a much more distribution bias in reduced recombination rate regions, and a Wilcoxon test also shows that recombination rates at the at-risk ER ppgene loci are significantly lower than the nonrisk ER ppgenes (Z = −14.2, P < 0.0001), suggesting the presence of the effect of selection against ectopic recombination in the at-risk ER ppgene distribution. Of course, the ectopic recombination-associated selective effect is not exclusive, because the nonrisk ER ppgenes also show a distribution bias in reduced recombination rate regions, although it is not as strong as the at-risk ER ppgenes.
Figure 3. The relative densities of the at-risk ER ppgenes and nonrisk ER ppgenes in regions of different recombination rates, calculated over 5-Mb windows across the human genome. Recombination rate bins with 0.2 cM Mb−1 intervals are used. Figure b is an enlargement of the distribution pattern of the nonrisk ER ppgenes in Fig. a, which is given only to clearly illustrate the distribution bias.
Download figure to PowerPoint
We propose that the risk of ectopic recombination between homologous processed pseudogenes is inversely correlated with the distance between them on the same chromosome, and therefore the closely located homologous processed pseudogenes would preferentially accumulate in regions of low recombination regions to avoid deleterious effect of ectopic recombination. For the same reason, it is expected that recombination rate is positively correlated with the distance between homologous processed pseudogenes. To test this hypothesis, we investigated the relation between recombination rates at processed pseudogenes and the distance between the processed pseudogene and its most closely located counterpart. The result shows that recombination rate is significantly positively correlated with the distance between homologous processed pseudogenes (rs = 0.137, P < 0.0001, N = 6262), which is consistent with the expectation.
We also analysed the correlation between average length of processed pseudogenes and recombination rate in 5-Mb windows for at-risk ER ppgenes and nonrisk ER ppgenes, respectively. The results show that average length of processed pseudogenes is significantly negatively correlated with recombination rate for both the at-risk and nonrisk ER ppgenes (rs = −0.110, P = 0.025, N = 440; rs = −0.214, P < 0.0001, N = 582). Under the ectopic recombination model, selection against the deleterious effects of ectopic recombination should affect longer elements more strongly than shorter ones, as they represent longer targets for homologous pairing (Dray & Gloor, 1997). Therefore, transposon length is expected to be negatively correlated with recombination rate (Petrov et al., 2003). The negative correlation between the length of at-risk ER ppgenes and recombination rate is consistent with the expectation. However, the length effect associated with ectopic recombination is not an exclusive effect, because similar length effect is observed for the nonrisk ER ppgenes as well. In conjunction with the deleterious insertion model, it is possible that the insertion of longer processed pseudogenes into chromosomes might lead to more deleterious effect, thus leading to the accumulation of long processed pseudogenes in regions of low recombination rates where selection efficiency is decreased because of Hill–Robertson interference.
In conclusion, the present study discovered a negative correlation between processed pseudogene density and recombination rate, which is consistent with the selection against ectopic recombination between closely located homologous processed pseudogenes. We believe that the selection against illegitimate recombination is a universal, but not unique, mechanism modulating the distribution of repeated DNA fragments, such as transposons (Hua-Van et al., 2005; Song & Boissinot, 2007), pseudogenes and palindromes (Waldman et al., 1999), because closely located repeats can result in genetic instability via illegitimate recombination. Thus, we hope the study may offer deeper insights into genome evolution and important implications to genetic diseases and unstable transgenes.