D. L. Adelson, School of Molecular and Biomedical Science, University of Adelaide, North Terrace, Adelaide, South Australia, Australia 5005 E-mail: firstname.lastname@example.org
The interspersed repeat content of mammalian genomes has been best characterized in human, mouse and cow. In this study, we carried out de novo identification of repeated elements in the equine genome and identified previously unknown elements present at low copy number. The equine genome contains typical eutherian mammal repeats, but also has a significant number of hybrid repeats in addition to clade-specific Long Interspersed Nuclear Elements (LINE). Equus caballus clade specific LINE 1 (L1) repeats can be classified into approximately five subfamilies, three of which have undergone significant expansion. There are 1115 full-length copies of these equine L1, but of the 103 presumptive active copies, 93 fall within a single subfamily, indicating a rapid recent expansion of this subfamily. We also analysed both interspersed and simple sequence repeats (SSR) genome-wide, finding that some repeat classes are spatially correlated with each other as well as with G+C content and gene density. Based on these spatial correlations, we have confirmed that recently-described ancestral vs. clade-specific genome territories can be defined by their repeat content. The clade-specific Short Interspersed Nuclear Element correlations were scattered over the genome and appear to have been extensively remodelled. In contrast, territories enriched for ancestral repeats tended to be contiguous domains. To determine if the latter territories were evolutionarily conserved, we compared these results with a similar analysis of the human genome, and observed similar ancestral repeat enriched domains. These results indicate that ancestral, evolutionarily conserved mammalian genome territories can be identified on the basis of repeat content alone. Interspersed repeats of different ages appear to be analogous to geologic strata, allowing identification of ancient vs. newly remodelled regions of mammalian genomes.
Mammals vary widely in their appearance and physiology, yet are very similar based on comparisons of their genes. The core mammalian genome consists of approximately 20 000 protein-coding genes, with the vast majority conserved across species (Lander et al. 2001; Venter et al. 2001; Metzker et al. 2004; Lindblad-Toh et al. 2005). However, these protein-coding genes account for only about 1.5% of a typical mammalian genome. The rest of the genome is non-protein coding and, for the most part, is not transcribed [van Bakel et al. (2010)]. While there is still debate on how much of the genome is in fact transcribed, almost half of a typical mammalian genome is repetitive and was dubbed by some as ‘junk DNA’, much of which is derived from self-propagating mobile elements and retroviruses (Jurka et al. 2007). Interspersed repeats are the largest class of sequences in mammals, accounting for 40–50% of the total length of these genomes (Smit 1996; Lander et al. 2001; Waterston et al. 2002; Jurka et al. 2007). The most common interspersed repeats are derived from retro-transposons, also known as retroposons or retro-transposable elements (RTs), which replicate and jump throughout the genome in a manner similar to retroviruses (Smit 1996). While many RT are common to all mammals and are thus presumably of ancestral origin (Smit & Riggs 1995), every species/clade seems to have one or more unique kind of Short Interspersed Nuclear Element (SINE) which contributes heavily to species-specific genome sequences (Jurka et al. 2007). While many RTs are no longer active, species- and lineage-specific repeats serve to remodel genomes by interrupting and often outnumbering ancestral repeats during their phase of rapid transposition/expansion (Deininger et al. 2003; Kazazian 2004; Giordano et al. 2007). Active RTs are believed to be responsible for 10% of mutations in rodents (Kazazian 1998), while less active RTs in humans appear to account for a small fraction of new mutations (Deininger & Batzer 1999). The accumulation of RTs within or near genes has been studied (Birney et al. 2007), and there is evidence that insertions within or near promoters can alter gene expression, while insertions into exons are often incorporated into existing protein-coding genes (Krull et al. 2007). Recently it has become clear that evolution has also made use of these repetitive sequences to wire new regulatory circuits (Mikkelsen et al. 2007). This has resulted from the incorporation of RTs into promoters, miRNA precursors and coding exons (Babushok et al. 2007; Gentles et al. 2007). Species-specific RTs can also contain regulatory elements such as the P53 tumour suppressor binding motif, and thus influence transcriptional regulatory networks genome-wide (Wang et al. 2007). Therefore, while the protein-coding genetic complement of mammals is >80% orthologous or homologous (Elsik et al. 2009), the remainder of these genomes is both highly repetitive and variable. It is therefore apparent that RTs are major drivers of genome evolution.
Short Interspersed Nuclear Elements require LINEs for their transposition. In primates, LINE L1 repeats encode the machinery to transpose SINE Alu repeats (Dewannieux et al. 2003). Ancestral L2 LINEs, which were incapable of retrotransposition prior to the divergence of eutheria, are believed to have encoded the machinery to transpose SINE MIR (Jurka et al. 2007).
In this report, we describe de novo repeat identification, the generation of repeat consensus sequences and an analysis of the overall repeat content of the equine genome. We also show that there is evidence for spatial accumulation/segregation of repeats based on pairwise correlations of repeat abundance. These results confirm the existence of ancestral genomic territories on the basis of repeat content.
Identification of, and tree construction for, intact LINEelements
Coordinates for the intact L1 were retrieved from the equine genome assembly using PALS (Edgar & Myers 2005), with a minimum length of 90% of the query sequence, and a minimum of 70% identity. Sequences were globally aligned using MUSCLE (Edgar 2004) and the alignments used to create maximum likelihood trees using RAxML (Stamatakis 2006) with the GTRCAT substitution model, and an initial 500 bootstraps followed by a thorough maximum likelihood search. Putative active L1 were identified based on conserved ORF1 and ORF2 sequences.
This step was performed according to the procedure described in (Adelson et al. 2009). The SINEs ERE1 and 2 were amalgamated into a single group (ERE1_2) for the purpose of this analysis. This analysis was carried out by partitioning the genome into bins, counting the repeats in repeat group in each bin and calculating Spearman’s rank correlations for all pairs of repeat groups, and for the repeat groups and segmental duplications, gene count, and G+C content. As well as the 1.5 Mbp bins used in (Adelson et al. 2009), the analysis was repeated with 20, 50, 100, 150, 500 Kbp, 1, and 7.5 Mbp to see the impact of bin size on the analysis.
Identification of extreme density bins and repeatcontent analysis
The bins were classified as having low, medium or high MIR/L2 density. The cut-off between groups was the 2-tail 10% significance level cut-off for the sum of the MIR and L2 ranks. For the ERE1/2, ERE3 and L1 repeat groups, the statistical package R (The R Foundation for Statistical Computing, 2009) was used to perform Wilcoxon rank sum tests with continuity correction between the high and low-density groups. The high-density MIR/L2 bins for human (hg18) were obtained as for the equine, except that the RepeatMasker library for human was used. These human bins were mapped to equine bins based on full genomic alignments of repeat masked sequence from the horse genome to the human hg18, using PatternHunter 10 (Ma et al. 2002). Following established methods (Waterston et al. 2002; Lindblad-Toh et al. 2005), we identified co-linear clusters of the identified synteny anchors, which were used to form larger syntenic segments in a hierarchical fashion. Segments that were larger than a given size in both genomes, and were comprised of at least four anchors at a given stage in the merging process, defined a resolution-dependent, pairwise synteny map between the two genomes.
Our ab initio repeat identification and annotations (Table 1) indicate that the equine genome has a comparable repeat content to other eutheria (∼47%), with LINE 1 (L1) RT the major class of interspersed repeats. The major SINE class in horse is the perrisodactyl specific ERE type (Gallagher et al. 1999). There are however many unclassified repeats that account for 11% of the equine genome. Most of these unclassified repeats are made up of a number of fragments of known types, and can best be described as possessing a chimeric or recombinant sequence. Only a minority of these unclassified sequences cannot be annotated using currently available RepeatMasker or RepBase repeat databases. The complexity of the chimeric repeats has proved to be resistant to a simple classification scheme, and will not be discussed further here.
Table 1. Simple and interspersed repeats in the equine genome.
Percent coverage of genome
SINE, Short Interspersed Nuclear Elements.
402 200 888
71 040 977
6 321 427
4 519 823
484 083 115
70 493 153
65 909 940
35 928 664
173 515 233
66 335 259
45 135 591
37 924 134
1 698 832
151 093 816
77 908 269
4 205 795
4 615 663
5 154 925
3 249 245
1 271 144
13 080 569
2 478 174
2 012 054
1 830 576
1 179 278
1 174 230
1 078 857
10 982 434
2 560 707
30 290 224
278 805 770
Interspersed repeat total
2 379 926
1 169 611 998
4 910 708
54 353 227
We have characterized the intact L1s because they are the most common type of RT in the equine genome, and they are the only non-LTR LINEs that have the potential for autonomous retrotransposition. For L1s to have autonomous activity, they must be full length and encode two functional ORFs. We identified 1115 full length L1s by aligning our improved L1 consensus sequences against the Equine v2 assembly and extracting all full-length (≥90% or ≤110%) matching sequences that were ≥70% identical. These intact L1s were used to construct a maximum likelihood tree (Fig. 1). The tree topology reflects the five known L1 equine subfamilies (Smit et al. 1996–2004; Jurka et al. 2005). Active L1s should have two ORFs encoding 40- and 150-kDa proteins (Goodier et al. 2007). Only 103 of our full length L1s could be classified as potentially active on this basis, with 93 of those in the L1_1 subfamily and the remaining in the L1_2 subfamily.
In order to identify spatial correlations between repeats and between repeats and other genome features such as gene models, segmental duplications and G+C content, we carried out a comprehensive, pairwise correlation analysis of repeat types as performed by (Adelson et al. 2010). The results of this analysis are summarized in Fig. 2.
The effects of bin sizes on the correlations are shown in the right hand diagonal half of Fig. 2 and summarized in Table 2. In general, the use of a 1 Mbp bin size gave the strongest correlations for most pairs, with little additional strengthening at larger bin sizes. Most pairs (61%) had correlations that strengthened as a function of increasing bin size, indicating no bin size dependency for those associations. However, 34% of the pairs had a bin size-dependent response of correlation to bin size. This indicated that some associations appeared to be specific for certain scales/genomic distances. Only a few pairs showed their strongest correlations at a small scale (≤50 Kbp), but almost 10% of pairs had correlations that changed sign as a function of bin size. Some exemplars of these bin size effects are shown in Fig. 3. The scale dependence of some correlations suggests a potential biological effect, perhaps related to effects on gene regulation. Because most correlations were close to their maxima at 1.5 Mbp and had large enough sample sizes to keep the standard error manageable, we settled on this scale as the representative set of correlations shown in the left hand diagonal half of Fig. 2.
Table 2. How repeat group correlations vary as a function of increasing bin size (bp scale).
Number of correlations
No change (0)
Has maximum (+)
Has maximum (−)
Has minimum (−)
Is cubic (+)
Is cubic (−)
Changes sign (− to +)
Changes sign (+ to −)
Changes sign back (− to + to −)
Based on these results we were able to identify a small number of very strong correlations, some of which we had observed previously in cow and human (Adelson et al. 2009). Specifically, we wish to draw attention to a very strong correlation between the fossil RT LINE 2 (L2) and SINE MIR (r > 0.8, see Fig. 4), and that between clade-specific SINE ERE1/2 and ERE3 sub-families. We also wish to draw attention to a very strong correlation between L1 and gene models, and a weaker correlation between L1 and segmental duplications. The strong pair-wise correlations for L2 and MIR and ERE1/2 and ERE3 were reminiscent of the relationships we had previously uncovered (Adelson et al. 2009; Elsik et al. 2009) in cow and suggested to us that similar spatial associations might exist in horse. We were particularly interested in the L2/MIR correlation, as these are an ancestral LINE/SINE pair, and their strongly correlated ranks indicated that they were being conserved in the same genomic regions. Similarly, the relationship between the clade-specific ERE1/2 and ERE3 indicated that these repeats appeared to favor integration in similar regions.
By plotting the ranks for L2 and MIR for every 1.5 Mbp bin (see Fig. 4), it is clear that not only is the correlation coefficient very high, but the distribution of ranks is consistent, symmetric and highly constrained. Because rank acts as a proxy for repeat count or density, we were able to select the highest and lowest repeat density bins for further study by partitioning the bins on the basis of the 10% tails from an expected random distribution of bins. These partitions are shown by the lines in Fig. 4, with the 5% extreme higher density L2/MIR bins in the upper-right hand corner, and the 5% extreme lower density L2/MIR bins in the lower-left hand corner. Because each bin can be positioned on the chromosome scaffolds, it was then possible to determine the spatial distribution of the extreme density L2/MIR regions and to do the same for the extreme density clade-specific ERE1/2/ERE3 regions (Fig. 5).
The distribution of extreme-density ancestral repeats (L2/MIR) and clade-specific repeats were clearly different, with large contiguous regions of high-density ancestral repeats compared to much more evenly distributed bins for the other extreme-density bins. Furthermore, there was little overlap between the high-density ancestral regions and the high-density clade-specific bins. This was also evident from Table 3, where there were significantly fewer ERE3 RT found in high-density ancestral bins than in low-density ancestral bins. The same was true for L1, which are also of recent origin compared to L2 and MIR.
Table 3. Number of clade specific repeats in ancestral regions.
MIR/L2 Density bins
*High vs. Low, Wilcoxon rank sum test with continuity correction.
Number of Bins
In order to determine if these ancestral repeat-rich regions were evolutionarily conserved, we repeated our correlation analysis on the human genome assembly (hg18) and found that L2/MIR were highly correlated in human (r = 0.86) as well. We identified the high-density L2/MIR bins for the human genome and then converted their coordinates to equine genome locations using our horse/human whole-genome alignment. Figure 5b shows the equine and human ancestral domains and reveals that they have a large degree of overlap that is particularly striking for the larger contiguous sets of bins that define large territories.
Our novel methodology for identifying, annotating and analysing repetitive DNA has yielded a number of interesting results. While we did not identify many novel repeats, we did find a large number of hybrid or chimeric repeats. Chimeric repeats have been reported before (Buzdin et al. 2003), but at a much lower frequency than we have observed. These repeats could represent novel, recombinant and evolving RT or they could be satellite sequences that have arisen from RT insertion (Kapitonov et al. 1998), or composite satellites amplified in a chromosome-specific manner. If they are composite satellites, we expect their distribution to be chromosome specific and peri-centromeric or sub-telomeric. While preliminary analyses indicate that some of these repeats are probably chromosome specific, it is too early to conclude that all of these chimeric repeats are composite satellites.
Our comprehensive spatial analysis of repeat groups has confirmed that ancient fossil L2 and MIR RT are highly correlated, as they are in cow and human (Adelson et al. 2009; Elsik et al. 2009). In this report we extended our spatial correlation analysis to investigate the effect of bin size on correlation. While most correlations strengthened as a function of bin size, for some, choice of bin size determined the sign and magnitude of the correlation. A specific example of this is the relationship between L1 and G+C content (Fig. 3). When the human and mouse genome sequences (Lander et al. 2001; Waterston et al. 2002) were analysed using 50 Kbp sequence windows, L1 were observed to be negatively correlated with G+C content. Our small-bin observations were consistent with these results, but at larger window/bin sizes the relationship changed. An arbitrary choice of window or bin size can therefore result in incomplete understanding of spatial relationships. Comparing SINE repeat groups correlations with G+C content in horse gave an opposite result to what was reported in mouse and human, where SINE were positively correlated with G+C content. Furthermore, our analysis of bovine repeat correlations (Adelson et al. 2009) showed that SINE families paired with LINE_RTE were negatively correlated with G+C, while one that was probably paired with L1 was positively correlated with G+C content.
We also noted positive correlations of L1 and LTR ERV1 with segmental duplications. Others have also reported associations of RT with segmental duplication/copy number variation (Bailey et al. 2003; She et al. 2008), but only the latter report implicated L1 RT.
Our analysis also confirmed that fossil RT densities (L2 and MIR) define conserved, syntenic ancestral genome domains. Because L2 and MIR have been inactive since the mammalian radiation, the persistence of such domains can only be explained by two alternate scenarios: (i) negative selection that preserved ancestral territories or (ii) protection from new retrotransposition events. There is evidence that many L2 and MIR RT have undergone strong negative selection because they have been co-opted to regulate gene expression (Silva et al. 2003; Lowe et al. 2007). This suggests that the conserved ancestral repeat-enriched genome territories we have discovered here are the result of purifying selection or of chromatin structural constraints and are probably therefore of functional significance.
DLA: designed research, performed research, analysed data, wrote paper; JMR: performed research, analysed data, wrote paper; MG: performed research; RCE: contributed analytic tools, wrote paper.
Conflicts of interest
The authors have not declared any potential conflicts.