Interspersed repeats in the horse (Equus caballus); spatial correlations highlight conserved chromosomal domains

Authors


D. L. Adelson, School of Molecular and Biomedical Science, University of Adelaide, North Terrace, Adelaide, South Australia, Australia 5005
E-mail: david.adelson@adelaide.edu.au

Summary

The interspersed repeat content of mammalian genomes has been best characterized in human, mouse and cow. In this study, we carried out de novo identification of repeated elements in the equine genome and identified previously unknown elements present at low copy number. The equine genome contains typical eutherian mammal repeats, but also has a significant number of hybrid repeats in addition to clade-specific Long Interspersed Nuclear Elements (LINE). Equus caballus clade specific LINE 1 (L1) repeats can be classified into approximately five subfamilies, three of which have undergone significant expansion. There are 1115 full-length copies of these equine L1, but of the 103 presumptive active copies, 93 fall within a single subfamily, indicating a rapid recent expansion of this subfamily. We also analysed both interspersed and simple sequence repeats (SSR) genome-wide, finding that some repeat classes are spatially correlated with each other as well as with G+C content and gene density. Based on these spatial correlations, we have confirmed that recently-described ancestral vs. clade-specific genome territories can be defined by their repeat content. The clade-specific Short Interspersed Nuclear Element correlations were scattered over the genome and appear to have been extensively remodelled. In contrast, territories enriched for ancestral repeats tended to be contiguous domains. To determine if the latter territories were evolutionarily conserved, we compared these results with a similar analysis of the human genome, and observed similar ancestral repeat enriched domains. These results indicate that ancestral, evolutionarily conserved mammalian genome territories can be identified on the basis of repeat content alone. Interspersed repeats of different ages appear to be analogous to geologic strata, allowing identification of ancient vs. newly remodelled regions of mammalian genomes.

Introduction

Mammals vary widely in their appearance and physiology, yet are very similar based on comparisons of their genes. The core mammalian genome consists of approximately 20 000 protein-coding genes, with the vast majority conserved across species (Lander et al. 2001; Venter et al. 2001; Metzker et al. 2004; Lindblad-Toh et al. 2005). However, these protein-coding genes account for only about 1.5% of a typical mammalian genome. The rest of the genome is non-protein coding and, for the most part, is not transcribed [van Bakel et al. (2010)]. While there is still debate on how much of the genome is in fact transcribed, almost half of a typical mammalian genome is repetitive and was dubbed by some as ‘junk DNA’, much of which is derived from self-propagating mobile elements and retroviruses (Jurka et al. 2007). Interspersed repeats are the largest class of sequences in mammals, accounting for 40–50% of the total length of these genomes (Smit 1996; Lander et al. 2001; Waterston et al. 2002; Jurka et al. 2007). The most common interspersed repeats are derived from retro-transposons, also known as retroposons or retro-transposable elements (RTs), which replicate and jump throughout the genome in a manner similar to retroviruses (Smit 1996). While many RT are common to all mammals and are thus presumably of ancestral origin (Smit & Riggs 1995), every species/clade seems to have one or more unique kind of Short Interspersed Nuclear Element (SINE) which contributes heavily to species-specific genome sequences (Jurka et al. 2007). While many RTs are no longer active, species- and lineage-specific repeats serve to remodel genomes by interrupting and often outnumbering ancestral repeats during their phase of rapid transposition/expansion (Deininger et al. 2003; Kazazian 2004; Giordano et al. 2007). Active RTs are believed to be responsible for 10% of mutations in rodents (Kazazian 1998), while less active RTs in humans appear to account for a small fraction of new mutations (Deininger & Batzer 1999). The accumulation of RTs within or near genes has been studied (Birney et al. 2007), and there is evidence that insertions within or near promoters can alter gene expression, while insertions into exons are often incorporated into existing protein-coding genes (Krull et al. 2007). Recently it has become clear that evolution has also made use of these repetitive sequences to wire new regulatory circuits (Mikkelsen et al. 2007). This has resulted from the incorporation of RTs into promoters, miRNA precursors and coding exons (Babushok et al. 2007; Gentles et al. 2007). Species-specific RTs can also contain regulatory elements such as the P53 tumour suppressor binding motif, and thus influence transcriptional regulatory networks genome-wide (Wang et al. 2007). Therefore, while the protein-coding genetic complement of mammals is >80% orthologous or homologous (Elsik et al. 2009), the remainder of these genomes is both highly repetitive and variable. It is therefore apparent that RTs are major drivers of genome evolution.

In mammals, LINE L1 repeats are the dominant RT type both in the common ancestor and in extant species (Lander et al. 2001; Waterston et al. 2002; Lindblad-Toh et al. 2005). Few mammals have active non-long terminal repeat (LTR) LINEs other than L1 that contribute significantly to repeat composition, with the exception of the LINE RTE (BovB) repeats in ruminants (Adelson et al. 2009) and marsupials (Gentles et al. 2007).

Short Interspersed Nuclear Elements require LINEs for their transposition. In primates, LINE L1 repeats encode the machinery to transpose SINE Alu repeats (Dewannieux et al. 2003). Ancestral L2 LINEs, which were incapable of retrotransposition prior to the divergence of eutheria, are believed to have encoded the machinery to transpose SINE MIR (Jurka et al. 2007).

In this report, we describe de novo repeat identification, the generation of repeat consensus sequences and an analysis of the overall repeat content of the equine genome. We also show that there is evidence for spatial accumulation/segregation of repeats based on pairwise correlations of repeat abundance. These results confirm the existence of ancestral genomic territories on the basis of repeat content.

Materials and methods

De novo repeat identification and annotation

Equine genome assembly v.2.0 was used for repeat identification as described in (Adelson et al. 2009; Wade et al. 2009).

Identification of, and tree construction for, intact LINEelements

Coordinates for the intact L1 were retrieved from the equine genome assembly using PALS (Edgar & Myers 2005), with a minimum length of 90% of the query sequence, and a minimum of 70% identity. Sequences were globally aligned using MUSCLE (Edgar 2004) and the alignments used to create maximum likelihood trees using RAxML (Stamatakis 2006) with the GTRCAT substitution model, and an initial 500 bootstraps followed by a thorough maximum likelihood search. Putative active L1 were identified based on conserved ORF1 and ORF2 sequences.

Correlation analysis

This step was performed according to the procedure described in (Adelson et al. 2009). The SINEs ERE1 and 2 were amalgamated into a single group (ERE1_2) for the purpose of this analysis. This analysis was carried out by partitioning the genome into bins, counting the repeats in repeat group in each bin and calculating Spearman’s rank correlations for all pairs of repeat groups, and for the repeat groups and segmental duplications, gene count, and G+C content. As well as the 1.5 Mbp bins used in (Adelson et al. 2009), the analysis was repeated with 20, 50, 100, 150, 500 Kbp, 1, and 7.5 Mbp to see the impact of bin size on the analysis.

Identification of extreme density bins and repeatcontent analysis

The bins were classified as having low, medium or high MIR/L2 density. The cut-off between groups was the 2-tail 10% significance level cut-off for the sum of the MIR and L2 ranks. For the ERE1/2, ERE3 and L1 repeat groups, the statistical package R (The R Foundation for Statistical Computing, 2009) was used to perform Wilcoxon rank sum tests with continuity correction between the high and low-density groups. The high-density MIR/L2 bins for human (hg18) were obtained as for the equine, except that the RepeatMasker library for human was used. These human bins were mapped to equine bins based on full genomic alignments of repeat masked sequence from the horse genome to the human hg18, using PatternHunter 10 (Ma et al. 2002). Following established methods (Waterston et al. 2002; Lindblad-Toh et al. 2005), we identified co-linear clusters of the identified synteny anchors, which were used to form larger syntenic segments in a hierarchical fashion. Segments that were larger than a given size in both genomes, and were comprised of at least four anchors at a given stage in the merging process, defined a resolution-dependent, pairwise synteny map between the two genomes.

Results

Our ab initio repeat identification and annotations (Table 1) indicate that the equine genome has a comparable repeat content to other eutheria (∼47%), with LINE 1 (L1) RT the major class of interspersed repeats. The major SINE class in horse is the perrisodactyl specific ERE type (Gallagher et al. 1999). There are however many unclassified repeats that account for 11% of the equine genome. Most of these unclassified repeats are made up of a number of fragments of known types, and can best be described as possessing a chimeric or recombinant sequence. Only a minority of these unclassified sequences cannot be annotated using currently available RepeatMasker or RepBase repeat databases. The complexity of the chimeric repeats has proved to be resistant to a simple classification scheme, and will not be discussed further here.

Table 1.   Simple and interspersed repeats in the equine genome.
GroupNumberTotal bpPercent coverage of genome
Equus caballusBos taurusHumanMouse
  1. SINE, Short Interspersed Nuclear Elements.

Non-LTR retrotransposons
 LINE L1464 621402 200 88816.2511211.2617.0719.14
 LINE L2234 30471 040 9772.870441.183.070.37
 LINE CR124 8866 321 4270.255420.110.270.06
 LINE RTE19 1024 519 8230.1826310.74NA0.02
 742 913484 083 11519.5596123.2920.4119.59
SINEs
 SINE ERE1/2296 96270 493 1532.84831NANANA
 SINE MIR433 22065 909 9402.663121.392.430.55
 SINE ERE3159 84735 928 6641.45171NANANA
 SINE Other6507647 7430.0261714.2710.686.78
 SINE tRNA3356535 7330.021651.99NA0.00
 899 892173 515 2337.0109617.6513.117.33
ERVs
 LTR MaLR189 97866 335 2592.680311.453.724.00
 LTR ERVL97 99545 135 5911.823720.891.561.03
 LTR ERV195 17337 924 1341.532340.813.000.80
 LTR ERVK62501 698 8320.068640.050.294.02
 389 396151 093 8166.105023.208.579.85
DNA transposons
 DNA All330 94777 908 2693.147921.963.000.89
LTR Other
 LTR other16 7784 205 7950.169940.420.000.01
Di-nucleotide SSR
 Di AG491 2524 615 6630.186500.120.050.43
 Di AC455 5005 154 9250.208290.230.140.76
 Di AT317 7213 249 2450.131290.170.080.20
 Di CG667160 7360.002450.0030.000.01
 1 271 14413 080 5690.528530.520.271.40
Tri-nucleotide SSR
 Tri AAT239 7732 478 1740.100130.080.040.09
 Tri AAG196 6652 012 0540.081300.060.010.12
 Tri AGG182 7801 830 5760.073970.070.010.12
 Tri AGC118 3721 179 2780.047650.130.000.06
 Tri AAC113 9891 174 2300.047450.060.020.09
 Tri ACC97 191972 9590.039310.040.010.06
 Tri ATC85 399890 4240.035980.030.010.04
 Tri ACT29 303284 8720.011510.010.000.01
 Tri CCG12 825135 0750.005460.010.000.01
 Tri ACG256024 7920.001000.000.000.00
 1 078 85710 982 4340.443750.490.100.60
Tetra/penta-nucleotide SSR
 Tetra.penta All2 560 70730 290 2241.223891.250.392.16
Unclassified/chimeric
 Unclassified/chimeric278 805 77011.26500    
 Interspersed repeat total2 379 9261 169 611 99847.2584446.5445.0937.67
 SSR total4 910 70854 353 2272.196172.270.764.16

We have characterized the intact L1s because they are the most common type of RT in the equine genome, and they are the only non-LTR LINEs that have the potential for autonomous retrotransposition. For L1s to have autonomous activity, they must be full length and encode two functional ORFs. We identified 1115 full length L1s by aligning our improved L1 consensus sequences against the Equine v2 assembly and extracting all full-length (≥90% or ≤110%) matching sequences that were ≥70% identical. These intact L1s were used to construct a maximum likelihood tree (Fig. 1). The tree topology reflects the five known L1 equine subfamilies (Smit et al. 1996–2004; Jurka et al. 2005). Active L1s should have two ORFs encoding 40- and 150-kDa proteins (Goodier et al. 2007). Only 103 of our full length L1s could be classified as potentially active on this basis, with 93 of those in the L1_1 subfamily and the remaining in the L1_2 subfamily.

Figure 1.

 L1 phylogeny/active repeats. The maximum likelihood tree derived from the global alignment of all 1115 intact/full length L1 sequences. Red lozenges indicate the 103 putative active L1s, 93 of these were L1_1 and 10 were L1_2.

In order to identify spatial correlations between repeats and between repeats and other genome features such as gene models, segmental duplications and G+C content, we carried out a comprehensive, pairwise correlation analysis of repeat types as performed by (Adelson et al. 2010). The results of this analysis are summarized in Fig. 2.

Figure 2.

 Correlation analysis of repeat groups. The lower-left panel illustrates the pairwise correlations among the repeat groups and between the repeat groups and segmental duplication (Seg_Dup), gene density, and G+C content (GC) based on 1.5 Mbp bins. The direction and significance of the correlations (95% 2-tailed test after Bonferroni correction) is indicated by the cell colors: Yellow (not significant), blue (positive and significant), and red/orange (negative and significant). The repeat groups are clustered based on all their correlations from the 1.5 Mbp bins. The upper-right panel illustrates the general trend in each of the pairwise correlations as the bin size increases from 20 Kbp to 7.5 Mbp.

The effects of bin sizes on the correlations are shown in the right hand diagonal half of Fig. 2 and summarized in Table 2. In general, the use of a 1 Mbp bin size gave the strongest correlations for most pairs, with little additional strengthening at larger bin sizes. Most pairs (61%) had correlations that strengthened as a function of increasing bin size, indicating no bin size dependency for those associations. However, 34% of the pairs had a bin size-dependent response of correlation to bin size. This indicated that some associations appeared to be specific for certain scales/genomic distances. Only a few pairs showed their strongest correlations at a small scale (≤50 Kbp), but almost 10% of pairs had correlations that changed sign as a function of bin size. Some exemplars of these bin size effects are shown in Fig. 3. The scale dependence of some correlations suggests a potential biological effect, perhaps related to effects on gene regulation. Because most correlations were close to their maxima at 1.5 Mbp and had large enough sample sizes to keep the standard error manageable, we settled on this scale as the representative set of correlations shown in the left hand diagonal half of Fig. 2.

Table 2.   How repeat group correlations vary as a function of increasing bin size (bp scale).
ResponseNumber of correlations
No change (0)19
Strengthening (+)169
Strengthening (−)157
Has maximum (+)67
Has maximum (−)61
Has minimum (−)1
Is cubic (+)5
Is cubic (−)2
Changes sign (− to +)33
Changes sign (+ to −)13
Changes sign back (− to + to −)1
 528
Figure 3.

 Exemplars of correlation vs. bin size patterns. Various correlation responses to changing bin size are illustrated.

Based on these results we were able to identify a small number of very strong correlations, some of which we had observed previously in cow and human (Adelson et al. 2009). Specifically, we wish to draw attention to a very strong correlation between the fossil RT LINE 2 (L2) and SINE MIR (r > 0.8, see Fig. 4), and that between clade-specific SINE ERE1/2 and ERE3 sub-families. We also wish to draw attention to a very strong correlation between L1 and gene models, and a weaker correlation between L1 and segmental duplications. The strong pair-wise correlations for L2 and MIR and ERE1/2 and ERE3 were reminiscent of the relationships we had previously uncovered (Adelson et al. 2009; Elsik et al. 2009) in cow and suggested to us that similar spatial associations might exist in horse. We were particularly interested in the L2/MIR correlation, as these are an ancestral LINE/SINE pair, and their strongly correlated ranks indicated that they were being conserved in the same genomic regions. Similarly, the relationship between the clade-specific ERE1/2 and ERE3 indicated that these repeats appeared to favor integration in similar regions.

Figure 4.

 L2/MIR rank correlation. Ranks of the ancestral Long Interspersed Nuclear Elements L2 and Short Interspersed Nuclear Element MIR counts in each 1.5 Mbp bin. Color coding indicates chromosome of origin of the bins. Lines in the upper right and lower left corners indicate the cut off for the high and low density bins respectively, and are based on the expected 5% tails from the random distribution of the sum of the ranks.

By plotting the ranks for L2 and MIR for every 1.5 Mbp bin (see Fig. 4), it is clear that not only is the correlation coefficient very high, but the distribution of ranks is consistent, symmetric and highly constrained. Because rank acts as a proxy for repeat count or density, we were able to select the highest and lowest repeat density bins for further study by partitioning the bins on the basis of the 10% tails from an expected random distribution of bins. These partitions are shown by the lines in Fig. 4, with the 5% extreme higher density L2/MIR bins in the upper-right hand corner, and the 5% extreme lower density L2/MIR bins in the lower-left hand corner. Because each bin can be positioned on the chromosome scaffolds, it was then possible to determine the spatial distribution of the extreme density L2/MIR regions and to do the same for the extreme density clade-specific ERE1/2/ERE3 regions (Fig. 5).

Figure 5.

 Genome distribution of extreme density ancient and clade specific RTs. (a) Extreme high and low L2/MIR and ERE1/2/ERE3 bins plotted side by side on the chromosome scaffolds. (b) Ancestral (L2/MIR) high-density bins for equine and human are shown on the equine assembly, along with the overall alignment. Note that the ‘top’ of the chromosomes corresponds to the end near the X-axis on our plot. The Y-axis corresponds to nucleotide coordinates in mega base pairs (Mbp) from the equine assembly.

The distribution of extreme-density ancestral repeats (L2/MIR) and clade-specific repeats were clearly different, with large contiguous regions of high-density ancestral repeats compared to much more evenly distributed bins for the other extreme-density bins. Furthermore, there was little overlap between the high-density ancestral regions and the high-density clade-specific bins. This was also evident from Table 3, where there were significantly fewer ERE3 RT found in high-density ancestral bins than in low-density ancestral bins. The same was true for L1, which are also of recent origin compared to L2 and MIR.

Table 3.   Number of clade specific repeats in ancestral regions.
 MIR/L2 Density binsP-value*
LowMediumHighAll
  1. *High vs. Low, Wilcoxon rank sum test with continuity correction.

Number of Bins22511252231573 
MIR Counts26 267286 297114 778427 342 
MIR/Bin116.74254.49514.7271.67 
L2 Counts17 347162 02151 803231 171 
L2/Bin77.1144.02232.3146.96 
Median ERE1/2168180159176NS
Median ERE39910184988.7e–09
Median L13042792562803.9e–15

In order to determine if these ancestral repeat-rich regions were evolutionarily conserved, we repeated our correlation analysis on the human genome assembly (hg18) and found that L2/MIR were highly correlated in human (r = 0.86) as well. We identified the high-density L2/MIR bins for the human genome and then converted their coordinates to equine genome locations using our horse/human whole-genome alignment. Figure 5b shows the equine and human ancestral domains and reveals that they have a large degree of overlap that is particularly striking for the larger contiguous sets of bins that define large territories.

Discussion

Our novel methodology for identifying, annotating and analysing repetitive DNA has yielded a number of interesting results. While we did not identify many novel repeats, we did find a large number of hybrid or chimeric repeats. Chimeric repeats have been reported before (Buzdin et al. 2003), but at a much lower frequency than we have observed. These repeats could represent novel, recombinant and evolving RT or they could be satellite sequences that have arisen from RT insertion (Kapitonov et al. 1998), or composite satellites amplified in a chromosome-specific manner. If they are composite satellites, we expect their distribution to be chromosome specific and peri-centromeric or sub-telomeric. While preliminary analyses indicate that some of these repeats are probably chromosome specific, it is too early to conclude that all of these chimeric repeats are composite satellites.

Our comprehensive spatial analysis of repeat groups has confirmed that ancient fossil L2 and MIR RT are highly correlated, as they are in cow and human (Adelson et al. 2009; Elsik et al. 2009). In this report we extended our spatial correlation analysis to investigate the effect of bin size on correlation. While most correlations strengthened as a function of bin size, for some, choice of bin size determined the sign and magnitude of the correlation. A specific example of this is the relationship between L1 and G+C content (Fig. 3). When the human and mouse genome sequences (Lander et al. 2001; Waterston et al. 2002) were analysed using 50 Kbp sequence windows, L1 were observed to be negatively correlated with G+C content. Our small-bin observations were consistent with these results, but at larger window/bin sizes the relationship changed. An arbitrary choice of window or bin size can therefore result in incomplete understanding of spatial relationships. Comparing SINE repeat groups correlations with G+C content in horse gave an opposite result to what was reported in mouse and human, where SINE were positively correlated with G+C content. Furthermore, our analysis of bovine repeat correlations (Adelson et al. 2009) showed that SINE families paired with LINE_RTE were negatively correlated with G+C, while one that was probably paired with L1 was positively correlated with G+C content.

We also noted positive correlations of L1 and LTR ERV1 with segmental duplications. Others have also reported associations of RT with segmental duplication/copy number variation (Bailey et al. 2003; She et al. 2008), but only the latter report implicated L1 RT.

Our analysis also confirmed that fossil RT densities (L2 and MIR) define conserved, syntenic ancestral genome domains. Because L2 and MIR have been inactive since the mammalian radiation, the persistence of such domains can only be explained by two alternate scenarios: (i) negative selection that preserved ancestral territories or (ii) protection from new retrotransposition events. There is evidence that many L2 and MIR RT have undergone strong negative selection because they have been co-opted to regulate gene expression (Silva et al. 2003; Lowe et al. 2007). This suggests that the conserved ancestral repeat-enriched genome territories we have discovered here are the result of purifying selection or of chromatin structural constraints and are probably therefore of functional significance.

Author contributions

DLA: designed research, performed research, analysed data, wrote paper; JMR: performed research, analysed data, wrote paper; MG: performed research; RCE: contributed analytic tools, wrote paper.

Conflicts of interest

The authors have not declared any potential conflicts.

Ancillary