Replication origins were mapped in hyperthermophilic crenarchaea, using high-throughput sequencing-based marker frequency analysis. We confirm previous origin mapping in Sulfolobus acidocaldarius, and demonstrate that the single chromosome of Pyrobaculum calidifontis contains four replication origins, the highest number detected in a prokaryotic organism. The relative positions of the origins in both organisms coincided with regions enriched in highly conserved (core) archaeal genes. We show that core gene distribution provides a useful tool for origin identification in archaea, and predict multiple replication origins in a range of species. One of the P. calidifontis origins was mapped in detail, and electrophoretic mobility shift assays demonstrated binding of the Cdc6/Orc1 replication initiator protein to a repeated sequence element, denoted Orb-1, within the origin. The high-throughput sequencing approach also allowed for an annotation update of both genomes, resulting in the restoration of open reading frames encoding proteins involved in, e.g., sugar, nitrate and energy metabolism, as well as in glycosylation and DNA repair.
DNA replication is an essential cellular process in all organisms, required for duplication of the chromosome(s) prior to cell division, thereby ensuring that both daughter cells receive a full complement of the genetic material. Similar to other biosynthetic pathways, replication is regulated at the level of initiation which, in most organisms, occurs at specific sites in the chromosome, replication origins. The origins contain binding sites for positively acting replication factors, initiator proteins, to which other replication proteins bind, promoting origin unwinding and subsequent initiation of DNA polymerization (Duderstadt and Berger, 2008).
The number of replication origins varies in different evolutionary lineages (Aves, 2009). All bacteria studied to date contain a single replication origin, from which the entire chromosome is bidirectionally replicated. In contrast, eukaryotic chromosomes contain multiple replication origins, ranging from tens to thousands, depending on genome size, resulting in a shortening of the time required for replication, the S phase, as compared to single origin usage (Masai et al., 2010).
Species belonging to the third domain of life, Archaea, display either the single or multiple origin mode of replication, and the trait is correlated to phylogeny. Within the Crenarchaeota phylum, all organisms examined to date contain multiple origins. Species belonging to the Sulfolobales order contain three replication origins (Lundgren et al., 2004; Robinson et al., 2004) unevenly distributed over the chromosome, while in organisms belonging to the Aeropyrum genus, within the Desulfurococcales order, two origins of replication have been identified (Robinson and Bell, 2007). Within the Euryarchaeota phylum, species belonging to the Pyrococcus and Archaeoglobus genera have been shown to contain a single origin of replication (Myllykallio et al., 2000; Maisnier-Patin et al., 2002), while multiple replication origins have been suggested in other genera, including Methanocaldococcus (Maisnier-Patin et al., 2002), Halobacterium (Berquist and DasSarma, 2003; Zhang and Zhang, 2003) and Haloferax (Norais et al., 2007). No experimental data concerning replication origins is available for members of the Thaumarchaeota phylum.
Putative replication origins may be identified by bioinformatics approaches, including searches for previously identified origin-specific sequence elements, origin recognition boxes (archaeal Orbs; Lopez et al., 1999; Zhang and Zhang, 2002; 2004; Robinson et al., 2004; Lundgren and Bernander, 2005), determination of nucleotide skews along the chromosome and other methods (Sernova and Gelfand, 2008). However, ambiguous results are often obtained, resulting in a need for experimental verification, and this may be achieved by marker frequency (MF) analysis. In this, the relative proportions of different markers along the chromosome are determined in exponentially growing cell cultures. Provided that replication is initiated at a fixed position, markers situated close to the origin will, on average, be present in higher copy number than origin-distal markers in DNA from the entire cell population, generating a marker gradient from the replication origin to the terminus.
In the original version of the assay, a small number of markers, in the order of 10 to 15, distributed over the chromosome were used to map the single origin of the Escherichia coli bacterium (Bird et al., 1972). With the advent of whole-genome DNA microarrays, the method could be expanded to thousands of markers regularly spaced across the chromosome (Khodursky et al., 2000; Raghuraman et al., 2001; Lundgren et al., 2004). The recent development of high-throughput (HT) sequencing technologies provides yet a further development of the approach, in which each individual sequence read, in the order of tens to hundreds of millions in a single experiment, can be mapped onto a reference genome sequence to generate an MF gradient (Srivatsan et al., 2008; Skovgaard et al., 2011; Xu et al., 2012). The direct quantification of the reads also obviates the need for inclusion of stationary phase reference DNA in the experiments, a prerequisite for the hybridization-based microarray assays, which are based on relative abundance as compared to a sample from a non-replicating cell population.
During our studies of the cell cycle of the hyperthermophilic crenarchaeon Pyrobaculum calidifontis (Lundgren et al., 2008; Ettema et al., 2011), we decided to carry out a replication origin mapping since Orb sequences characteristic of origins in other archaea had not been identified in this species, and the number, features and relative positions of the replication origin(s), thus, remained unknown.
Here, we report the use of HT sequencing for MF analysis of P. calidifontis, as well as of Sulfolobus acidocaldarius as control. We demonstrate the usefulness of the approach, confirm the positions of the three origins in S. acidocaldarius, and identify four origins of replication dispersed over the P. calidifontis chromosome. We also demonstrate binding of the Cdc6/Orc1 initiator protein to Orb elements at one of the origins, and provide an annotation update of the complete genome sequences of both species.
The hyperthermophilic crenarchaea S. acidocaldarius and P. calidifontis were grown in batch cultures at 80°C and 90°C respectively. Genomic DNA was extracted from samples collected in exponential growth phase. Part of the samples were subjected to flow cytometry analysis, to confirm that a substantial fraction of the populations resided in the chromosome replication stage (S phase) of the cell cycle (Fig. 1). The purified DNA was sequenced using the 454 HT platform (Margulies et al., 2005), yielding a total output of raw sequence data of about 38 Mb and 45 Mb for S. acidocaldarius and P. calidifontis respectively. The reads were filtered to remove artificial replicates (Gomez-Alvarez et al., 2009), and the remaining 35 Mb and 33 Mb of reads were mapped onto the respective genome. The cumulative number of reads was then plotted against chromosome position.
Replication origin mapping in S. acidocaldarius
For S. acidocaldarius about 138 000 reads were retrieved, and 124 000 remained after filtering. When plotted against genome position, three peaks corresponding to replication origins were resolved (Fig. 2). The relative positions agreed with previous mappings by microarray-based MF analysis (Lundgren et al., 2004), 2D gel electrophoresis (Robinson et al., 2004), and GC skew analysis (Chen et al., 2005). As also observed in previous analyses, bidirectional replication was initiated from all three origins, and all six replication forks progressed along the chromosome at similar rates, as indicated by the slopes of the MF gradients. The results provided proof-of-principle data for HT sequencing-based MF analysis of archaea, in addition to providing an independent confirmation of previous replication origin mappings.
Replication origin mapping in P. calidifontis
Of the 113 000 reads that were retrieved for P. calidifontis, 81 000 remained after filtering. When mapped onto the genome sequence, four distinct peaks became apparent, centred at chromosomal regions 0 kb, 500 kb, 900 kb and 1600 kb (Fig. 3). The single chromosome of P. calidifontis, thus, contains four replication origins, the highest number observed in a prokaryotic organism to date. All peaks were of similar relative height, indicating that all four origins were used in all cells in the culture (cf. Sulfolobus species; Lundgren et al., 2004), i.e. that initiation of chromosome replication occurred within a short time interval at all origins in each individual cell.
The MF gradients displayed similar slopes on both sides of each origin, showing that all four origins specified bidirectional replication, and that all eight replication forks progressed along the chromosome at similar speed. In addition, the depth of the intervening troughs between the peaks was, in all four cases, correlated to the distance between the flanking origins, such that the longer this distance, the deeper the corresponding trough, as expected (Lundgren et al., 2004). In contrast to the co-ordinated initiation, only the two initial replication termination events occurred at a similar relative time, while the remaining two events took place at later time points, different for each event, due to the different inter-origin distances (500, 400, 700 and 400 kb). This asynchronous termination is, similar to the co-ordinated initiation, reminiscent of the organization of the replication process in Sulfolobus species (Bernander, 2007).
The ratio between the highest tip and lowest trough in the MF distribution may be used to estimate the fraction of the generation time occupied by the replication period (Lundgren et al., 2004). The ratio indicated that the relative length of the S phase was substantially longer than in a previous estimation based on flow cytometry data alone (Lundgren et al., 2008), in which a single bidirectional replication event was assumed. The MF distributions revealed that the replication forks terminated replication at different time points, resulting in a successive reduction in DNA synthesis per cell. In the flow cytometry DNA distributions (Fig. 1; Lundgren et al., 2008), cells in which residual replication is going on form an overlapping distribution with those in which all forks have run to completion, resulting in an underestimation of the replicating population and, consequently, the S phase length, explaining the discrepancy.
We previously estimated the P. calidifontis replication fork progress rate to 390 bp s−1 (Lundgren et al., 2008). The estimation was based on bidirectional replication from a single origin and, consequently, also needs modification in light of the new data. Instead of simply dividing by 4, resulting in a rate, 98 bp s−1, close to that previously reported for S. acidocaldarius (80–110 bp s−1; Lundgren et al., 2004), the most accurate estimation is obtained by taking the uneven spacing of the origins along the P. calidifontis chromosome (Fig. 3) into account, and the fact that the S phase length is limited by the replication fork covering the longest distance of DNA. The longest inter-origin distance is found between origins 3 and 4 in the MF plot (Fig. 3), as further evidenced by the fact that this is where the deepest trough is found. Each of the two replication forks that meet in this trough have synthesized approximately 350 kb of DNA during the entire S phase, which was estimated to around 165 min, indicative of a fork rate around 35 bp s−1, or about one-third of that of S. acidocaldarius.
Initiator binding sites
We attempted to further pinpoint the locations of the replication origins by searching for putative binding sites for replication initiation proteins. No Orb elements (see Introduction) have been identified in P. calidifontis, despite the presence of well-conserved motifs in a wide range of archaea from both the Crenarchaeota and Euryarchaeota phyla (Robinson et al., 2004). We therefore performed exhaustive searches for repeated elements, regardless of similarities to previously known Orbs, in intergenic regions around all four MF peaks.
Two identical oppositely oriented 12 bp repeats, as well as a third copy, shorter and less conserved, were identified in the ori-1 region (Fig. 4A), in the intergenic space immediately upstream of the single annotated cdc6 gene in the P. calidifontis genome (GenBank accession number NC_009073). Electrophoretic mobility shift assay (EMSA) experiments demonstrated binding of purified Cdc6 protein to the repeat region (Fig. 4B), indicating that Cdc6 recognized the AAAACTTTCAGT consensus sequence, which we designate Orb-1. An additional identical copy of the TTTCAGT component was also present (Fig. 4A). This sequence element corresponds to a well-conserved part of the consensus sequence for common cren- and euryarchaeal Orbs (Robinson et al., 2004), that has been shown to interact with the Cdc6/Orc1 wing domain (Dueber et al., 2011). Various other repeated motifs were identified in intergenic regions around the other three MF peaks (Table S1). However, the significance of these for replication initiation remains uncertain in the absence of independent supporting data, and they were therefore not further considered.
At all three Sulfolobus replication origins, the gene for the initiator protein is located immediately adjacent to the origin, as also appears to be the case for ori-1 of P. calidifontis. Other replication-associated genes are occasionally located adjacent to initiator genes in archaeal and bacterial genomes, and gene context analysis may thus be of use in attempts to reveal the remaining, unknown, initiator proteins in P. calidifontis. The positions of known replication-associated genes in the P. calidifontis genome are, consequently, indicated in Fig. 5, together with the replication origins.
Core gene distribution
We have demonstrated that all three replication initiation regions in S. acidocaldarius and Sulfolobus solfataricus are enriched in archaeal core genes, i.e. genes conserved in a wide range of archaea, which often are essential and highly expressed (Andersson et al., 2010; Table S2). We mapped core genes along the chromosome also in P. calidifontis, as well as in all other genome-sequenced archaea. The four origin regions mapped by HT-based MF coincided well with four peaks of increased core gene density in the P. calidifontis genome (Fig. 6), indicating that core gene distribution provides a useful tool for mapping the number and locations of chromosome replication origins in archaeal genomes. Thus, based on core gene distributions, we predict multiple origins also in the crenarchaea Sulfolobus tokodaii, Hyperthermus butylicus, Staphylothermus marinus and Thermoproteus neutrophilus, in the thaumarchaea Nitrosopumilus maritimus and Cenarchaeum symbiosum, and in the euryarchaea Haloquadratum walsbyi, Methanococcus aeolicus, Thermoplasma acidophilum and Thermoplasma volcanium (Fig. 6). Multiple origins may also be present in a range of additional archaeal species (Fig. S1).
Annotation update of S. acidocaldarius
In addition to the MF analyses, the HT approach resulted in near complete resequencing of both organisms, with consequent possibilities for identification of deviations from the published genome sequences. For both species, the total number of deviations corroborated by high coverage was in the order of 1 per 80 000–150 000 bp (Table S3) testifying to the high quality of the previously reported data. Examples of deviations that resulted in major consequences for predicted proteins are listed in Table 1, and a selection of these are discussed in the following.
Table 1. Selected annotation updates for S. acidocaldarius and P. calidifontis.
Location (gene identifier)
Gene product; comments
Fructose-1,6-bisphosphate aldolase; extension of orf
SSV1 integrase; extension of orf
Substitution (G to T)
Sarcosine oxidase alfa and beta subunits; fusion of orfs Saci_0783 and Saci_0784; extension of orf Saci_0785
Substitution (A to T)
Substitution (A to T)
Hypothetical protein; shortening of orf
Acetyl-coenzyme A synthetase; extension of orf
Xpb/Rad25-related helicase; extension of orf
Malto-oligosyltrehalose synthase; extension of orf
Substitution (G to A)
Glycosyltransferase; extension of orf
Glutamyl-tRNA amidotransferase subunit A; extension of orf
Position (gene identifier)
Gene product; comments
See Table S1 for complete list and further details.
Substitution (C to T)
Hypothetical protein; extension of orf
Coenzyme A transferase; extension of orf
Substitution (T to G)
Molybdopterin oxidoreductase (Pcal_1601)
Nitrate reductase alpha subunit; shortening of orf
In the original annotation of the S. acidocaldarius genome (GenBank accession number NC_007181), a frameshift was noted in the Saci_0419 aldolase orf. The restoration of the reading frame by a single nucleotide deletion, and consequent elongation of the predicted protein, resulted in a full-length aldolase, adding to the predicted repertoire of sugar metabolic enzymes.
The Saci_0783, Saci_0784 and Saci_0785 orfs, with a truncation and a frameshift noted in the original annotation of the latter two respectively, were together affected by three single nucleotide substitutions and one insertion. As a result, the orfs were reduced to two, encoding putative subunits of a sarcosine oxidase (sox genes), affecting the predicted amino acid, energy and/or carbon metabolic capacity.
Similar to the Saci_0419 gene, frameshifts were noted in the original annotation of the Saci_1149 acetyl-coenzyme A synthetase and the Saci_1436 malto-oligosyltrehalose synthase. Single nucleotide insertions restored both orfs to full-length homologues, suggesting that the respective genes encode functional enzymes.
A single nucleotide insertion upstream of the Saci_1326 orf, encoding an Xpb/Rad25-type DNA helicase, resulted in an N-terminal extension of the predicted gene product. The extended protein displayed high similarity, both in length and sequence, to other Sulfolobus Xpb/Rad25-homologues. The orfs encoding the Saci_1904 glycosyltransferase and Saci_2168 glutamyl-tRNA amidotransferase gene were affected by a single nucleotide substitution and deletion respectively. The resulting shorter Saci_1904 and extended Saci_2168 gene products, as compared to the original annotation, are likely to constitute the full-length functional products of the respective transcription units.
Annotation update of P. calidifontis
Two closely positioned single nucleotide deviations were detected in the Pcal_0343 orf. One of these resulted in a frameshift, extending the N-terminal part of the predicted product. Sequence similarity searches with the extended orf did not yield insights into the possible function of the protein.
The Pcal_1385 and Pcal_1601 orfs deviated by a single nucleotide insertion and substitution respectively, resulting in a frameshift and a stop codon elimination. Both proteins lack functional annotation and were listed as pseudogenes in the original P. calidifontis genome data set (GenBank accession number NC_009073). Sequence similarity searches with the restored orfs revealed that they encode a coenzyme A transferase and a molybdopterin oxidoreductase (molybdopterin binding subunit) respectively.
Finally, a frameshift caused by a single nucleotide deletion in the Pcal_1907 nitrate reductase alpha subunit orf resulted in an N-terminal shortening of the protein, as compared to the original annotation.
We demonstrate the usefulness of HT sequencing-based MF analysis for replication origin mapping in archaea. Four replication origins were identified in the single chromosome of P. calidifontis, the highest number recorded for a prokaryotic organism. The result further indicates that the multiple origin mode of replication may be a general crenarchaeal trait. A Cdc6 binding element, Orb-1, was identified in one of the origin regions, and core gene distribution was shown to be a useful predictive tool for origin location in archaea. The replication origins were initiated in near synchrony and specified bidirectional replication with fork progression rates around 35 bp s−1. An annotation update of both the P. calidifontis and S. acidocaldarius genome sequences is also presented.
The HT sequencing-based approach is applicable to any organism for which a sizeable part of the cell population can be harvested in the S phase of the cell cycle, and for which the complete genome sequence is available, or can be generated in the same experiment. The approach is, further, independent of HT platform choice (cf. Srivatsan et al., 2008), as long as the output is directly proportional to marker abundance in the sampled population. A distinct advantage is that the strategy obviates the need for stationary phase reference DNA.
The EMSA experiments provided evidence for the precise location of one of the P. calidifontis replication origins, ori-1, in the region upstream of the gene for the single cdc6-1 replication initiator protein, while precision mapping of the other three origins requires further experimental data. The fact that no Orb boxes could be identified as common to the four origin regions indicate that each origin could be recognized by a specific initiator protein, similar to the situation in Sulfolobus species (Robinson et al., 2004; Robinson and Bell, 2007).
We have previously shown that highly transcribed genes cluster around replication origins in Sulfolobus species (Andersson et al., 2010). Highly conserved archaeal genes, core genes, which usually are essential, are over-represented among such highly expressed genes and, thus, also around replication origins. Here we show that also in P. calidifontis, core genes tend to cluster in origin-proximal regions of the chromosome, and that the correlation is strong enough to allow the use of core gene distribution as a predictive tool for origin location in archaea. Several distinct core gene clusters were apparent in a range of other cren-, thaum- and euryarchaeal genomes (Figs 6 and S1), indicating the presence of multiple replication origins also in these species.
The clustering of core genes around early replicating regions suggests an adaptive advantage for this mode of chromosome organization. Further, the fact that the number of replication origins varies in different archaeal lineages, and that the relative origin positions in the chromosome vary even between species belonging to the same genus (Lundgren et al., 2004), indicate a high degree of flexibility in terms of overall chromosome organization and replication characteristics. The intuitive explanation that the main selective force behind multiple origins would be to shorten the time required for chromosome replication does not agree with the uneven origin spacing observed in most genomes (cf. Fig. 5). We have hypothesized (Andersson et al., 2010) that early replication of core genes could provide a safeguard for essential genetic information in organisms that thrive in highly mutagenic environments, by quickly providing a back-up copy for expression and homologous repair in case of gene damage.
The four P. calidifontis origins correlate well with four regions singled out by third position GC content analysis (Khrustalev and Barkovsky, 2011), in which the GC content bias was suggested to reflect a differential mutation frequency along the chromosome (see also Flynn et al., 2010). If the observed bias might be an effect of differential codon usage related to the level of gene expression this could, again, reflect the tendency for core genes (highly expressed; enriched in codons for abundant tRNA:s) to cluster in origin-proximal regions. Genes located close to the replication origin have also been found to undergo lower substitution rates in many bacterial species (Mira and Ochman, 2002), proposed to be driven by homologous recombination between genes at high copy number due to multiple replication forks. Regardless of selective forces behind the bias, we believe that a combination of core gene distribution, cumulative GC skew and third position GC content analysis, if possible augmented by identification of Orb-like elements and initiator genes, currently provides the most comprehensive and reliable bioinformatics approach for origin mapping in newly sequenced genomes.
The observed sequence variations in the two genomes, as compared to previously published data, may reflect differences between specific isolates used in different laboratories or, alternatively, methodological differences such as choice of DNA sequencing technology. In either case, the restoration of a set of orfs adds to the understanding of the metabolic capacity of both organisms. Together with other curation initiatives (Salzberg, 2007; Du et al., 2011; Esser et al., 2011), this facilitates evaluation of molecular, biochemical and physiological data, and improves our comprehension of this evolutionarily distinct and fascinating group of organisms, displaying some of the most extreme adaptations known to biology.
Growth of cell cultures; DNA isolation
Pyrobaculum calidifontis VA1 (Amo et al., 2002) and S. acidocaldarius DSM 639 (Deutsche Sammlung von Mikroorganismen) were grown in TY medium containing 0.3% (w/v) sodium thiosulfate pentahydrate at 90°C, and 1 × Allen medium containing 0.2% tryptone (Grogan, 1989) at 80°C, respectively. Growth was monitored by spectrophotometry and flow cytometry as described previously (Lundgren et al., 2004; Ettema et al., 2011). Samples of 100 ml and 40 ml of cell culture respectively, were collected in exponential phase and placed on ice to inhibit further replication. The cells were harvested by centrifugation and total genomic DNA was extracted as described previously (Lundgren et al., 2004). All experimental steps, including DNA sequencing, were performed in duplicate for both organisms.
Cloning and purification of recombinant Cdc6
The Pcal_0002 gene, encoding the Cdc6 protein, was amplified by PCR, cloned and verified by DNA sequencing following the protocol described in Pelve et al. (2011). For PCR, forward and reverse primers 5′-ataggatcccatgcacatggtcataattgatg-3′ and 5′-ataggatcctcattgtagctcctctctaata-3′ were used, respectively, resulting in a product flanked by BamHI restriction sites. Recombinant histidine-tagged protein was affinity-purified on a 1 ml HisGravi Trap (GE Healthcare) column and eluted with 20 mM sodium phosphate buffer, pH 7.5, containing 0.5 M NaCl and 0.5 M imidazole. Eluted samples were desalted on PD-10 (GE Healthcare) columns.
Electrophoretic mobility shift assays
Single-stranded oligonucleotides (81 nt) encompassing the Orb-1 elements (Fig. 4A) were end-labelled using T4 polynucleotide kinase (Fermentas) and γ-32P-ATP (10 µCi µl−1; GE Heathcare). Labelled oligonucleotides were mixed and annealed into a double-stranded probe: 5′-gtagagtttagtaaaactttcagttttcagtggtcctcttaaactatagtagactgaaagtttttaagccggattgtccat-3′. To investigate binding, 0.2 pmol labelled fragment was incubated together with purified Cdc6 protein for 30 min at 37°C in 10 mM Tris, pH 7.5, 50 mM KCl, 5 mM MgCl2, 2.5% glycerol, 1 mM DTT, 1 mg ml−1 BSA and 0.05 mg ml−1 poly dI-dC. In the competition experiments, an unrelated 381 bp DNA fragment, or unlabelled probe, were used in 10-fold and 50-fold molar excess respectively. DNA-protein complexes were electrophoresed on native 5% polyacrylamide gels in 0.5 × TBE buffer at 70 V for 2 h and visualized using a Personal Molecular Imager (PMI) system (Bio-Rad).
High-throughput DNA sequencing; data filtering
Genomic DNA, at a concentration of 250 ng µl−1, was sequenced with a GS-FLX 454 pyrosequencer (Roche), using titanium chemistry. Single-ends libraries were prepared by standard procedures (Margulies et al., 2005), and DNA sizes and concentrations were determined with a Bioanalyzer 2100 (Agilent). The P. calidifontis and S. acidocaldarius DNA preparations were differentially labelled with short tags and mixed in equimolar amounts in one-fourth of a sequencing plate. After sequencing, artificially redundant reads were filtered out as described by Gomez-Alvarez et al. (2009), after which 80528 reads (median length 441 bp) and 123 686 reads (median length 296 bp) remained respectively.
Of the reads remaining after filtering, 98% and 97% could be mapped to specific genome positions in P. calidifontis and S. acidocaldarius, respectively, using Newbler 2.3 (Margulies et al., 2005). The genomes were sorted into 6.5 kb windows for P. calidifontis and 8.5 kb for S. acidocaldarius, and the start points of the reads were binned to the windows. The number of reads in each window was then plotted against chromosomal position.
Core gene distribution
Data was retrieved from the arCOG database (Makarova et al., 2007), January 2010 update. The arCOGs that were conserved in all archaeal genomes, except for the reduced genome of Nanoarchaeum equitans, were identified, resulting in a total of 152 arCOGs present in 65 genome sequences. For each genome, arCOGs represented by more than one gene were omitted, to avoid ambiguities. Archaeal genome DNA sequences were retrieved from GenBank (NCBI; for accession numbers, see legend to Fig. S1). Start positions for genes belonging to conserved arCOGs were extracted for each genome, and the genes were plotted against chromosome position within a 100 kb sliding window, transposed in 1 kb steps.
Intergenic sequences in the P. calidifontis chromosome were extracted from regions spanning 50 kb on either side of the four peaks identified in the MF analysis. Conserved direct repeats (≥ 8 bp) were identified with FAIR (Senthilkumar et al., 2010), while degenerated repeats, direct or palindromic, were identified with REPuter (Kurtz et al., 2001), allowing ≤ 4 mismatches and/or indels. The repeat-containing intergenic sequences were further analysed for additional, less conserved, direct or palindromic repeats with the ‘DNA pattern’ software included in RSAT (Thomas-Chollier et al., 2011), as well as by manual inspection.
Genome annotation updates
Deviations from the published genome sequences represented in 75% or more of the reads from both experiments were compiled. The deviations were mapped and analysed in Artemis (Rutherford et al., 2000), after which the DNA sequences were extracted and translated for identification of new or altered orfs, which then were used in protein similarity searches against the GenBank database. All steps were complemented by manual inspection and modification.
We thank Lionel Guy for assistance with extraction of intergenic regions, and Alejandro Artacho for processing of pyrosequencing reads. This work was supported by Grants 621-2007-5864 and 621-2010-5551 from the Swedish Research Council to R. B., and by Grant Microgen CSD2009-00006 from the Consolider-Ingenio program, Spanish Ministry of Science and Innovation to A. M.