De novo chromosome level assembly of a plant genome from long read sequence data

SUMMARY Recent advances in the sequencing and assembly of plant genomes have allowed the generation of genomes with increasing contiguity and sequence accuracy. Chromosome level genome assemblies using sequence contigs generated from long read sequencing have involved the use of proximity analysis (Hi‐C) or traditional genetic maps to guide the placement of sequence contigs within chromosomes. The development of highly accurate long reads by repeated sequencing of circularized DNA (HiFi; PacBio) has greatly increased the size of contigs. We now report the use of HiFiasm to assemble the genome of Macadamia jansenii, a genome that has been used as a model to test sequencing and assembly. This achieved almost complete chromosome level assembly from the sequence data alone without the need for higher level chromosome map information. Eight of the 14 chromosomes were represented by a single large contig (six with telomere repeats at both ends) and the other six assembled from two to four main contigs. The small number of chromosome breaks appears to be the result of highly repetitive regions including ribosomal genes that cannot be assembled by these approaches. De novo assembly of near complete chromosome level plant genomes now appears possible using these sequencing and assembly tools. Further targeted strategies might allow these remaining gaps to be closed.


INTRODUCTION
Reference genome sequences are a key resource for plant science. The challenge of producing a complete genome sequence has been greatly reduced by advances in both DNA sequencing (Hon et al., 2020;Levy and Myers, 2016) and sequence assembly tools (Chen et al., 2017;Phillippy, 2017). Final assembly of chromosome level genomes has relied upon evidence other than the sequence data alone, such as genetic maps (Fierst, 2015;Yu et al., 2019).
Advancements in the field of sequencing, assembly and scaffolding technologies, along with the rapid increase in the amount of freely available genomic data (https://www. ncbi.nlm.nih.gov/genbank/statistics), has greatly facilitated the development of highly accurate de novo assemblers.
Short-read de novo assemblers are not efficient in assembling the complex and long repetitive regions of plant genomes, such as centromeres and telomeres (Liao et al., 2019). To address this limitation, long read sequencing technologies, also known as third generation sequencers, have been developed. However, these long reads from PacBio (Menlo Park, CA, USA) and Oxford Nanopore (Oxford, UK) have been less accurate, with an average base calling accuracy of 90% compared to the 99.9% accuracy of the Illumina (San Diego, CA, USA) reads (Amarasinghe et al., 2020;Shendure et al., 2017). Hybrid assembly pipelines have often been used to assemble many genomes, aiming to overcome the shortcomings of both the long reads and short reads. This has allowed assembly of larger contigs from complex genomes. However, to achieve chromosome level genome assembly, scaffolding of the contigs was usually required. Analysis of sequence proximity in the chromatin by methods such as Hi-C has made this possible (Dudchenko et al., 2017;Kaplan and Dekker, 2013).
Recent advances in long read sequencing technology have allowed a single molecule to be sequenced multiple times to produce long high fidelity reads (HiFi; PacBio) with a base level accuracy of 99.9% (Wenger et al., 2019). We have used Macadamia jansenii to compare methods for the sequencing and assembly of plant genomes (Murigneux et al., 2020;Sharma et al., 2021a). This genome has a size (approximately 800 Mb) typical of many plant genomes but with a relatively low heterozygosity (Sharma et al., 2021b). Assembly of this genome using highly accurate circular consensus sequencing (CCS) reads (HiFi; Pac-Bio) using the HiFiasm assembly tool (Cheng et al., 2021) was found to give a more contiguous genome than that obtained with earlier longer continuous long reads (CLR; PacBio) (Sharma et al., 2021a). The HiFiasm assembler has been used to successfully assemble genomes of Fragaria 9 ananassa (garden strawberry), Rana muscosa (mountain yellow-legged frog) and Sequoia sempervirens (California redwood) (Cheng et al., 2021). Recently, HiFiasm was reported to allow highly contiguous assembly of plant genomes (Driguez et al., 2021). We now report the near complete chromosome level assembly of the M. jansenii genome from HiFi reads with the HiFiasm assembly tool, as well as an analysis of the assembled genome against a Hi-C chromosome level assembly.

HiFiasm assembly
The estimated genome size of the M. jansenii genome is 780 Mb (Murigneux et al., 2020). The size of the primary Hifiasm assembly was 826 Mb, including 779 contigs (Table S1), with the longest contig of 71.9 Mb and an average contig length of 1 Mb. BUSCO analysis (https://busco. ezlab.org) showed that the assembly covered 99.6% of universal single copy genes (Table 1). The contigs generated in this assembly were characterized in three groups based upon their size: large contigs (>1 Mb); medium size contigs (between 1 Mb and 100 kb) and small contigs (<100 kb).
Larger size contigs >1 Mb. There were 30 contigs greater than 1 Mb in length. These contigs alone provided a good assembly with an N50 of 46 Mb and a BUSCO score of 99.1% (Table 1). Dotplot analysis against the Hi-C assembly (Sharma et al., 2021b) showed that, of the nine contigs more than 46 Mb in length, eight correspond to complete Hi-C pseudomolecules (i.e. each contig corresponds to a single chromosome; chromosomes 1, 4, 5, 6, 10, 11, 13 and 14) (Figure 2a). One contig (Ptg000010|), corresponded to a large part of the second largest chromosome (chromosome 2) and another two contigs of approximately 25 and 2.7 Mb covered the other parts of this chromosome (Figure 2b and Tables 2  and 3). The 14 contigs between 4 and 46 Mb in size covered the remaining six chromosomes, in combinations of two to four contigs. Five of the contigs between 1 and 4 Mb in size corresponded to nuclear ribosomal RNA sequences, and the other two contigs matched parts of chromosome 2 and 7 ( Figure 2b and Tables 2 and 3).
Medium size contigs. There were 64 contigs between 1 Mb and 100 kb in size. These contigs had 0% BUSCO genes (Table 1). Only eight contigs in the range between 100 and 824 kb corresponded to seven Hi-C pseudo-molecules (with an alignment block length of more than 100 kb) (Figure 1b; Figures S2 and Figure S3; Tables S2 and S3). Out of these eight contigs, five corresponded to the terminal part of the Hi-C pseudo-molecules and three corresponded to the nonterminal regions of Hi-C chromosomes 3 and 7, marked as red starts in Figure S2(a,b). Most of the medium size contigs corresponded to ribosomal RNA genes ( Figure 5b) and one contig of 183 kb corresponded to a chloroplast assembly ( Figure 3b). None of the contigs showed similarity with mitochondrial sequences (Figure 4b).
Smaller contigs. There were 685 contigs between 10 and 100 kb in size. Most of these small size contigs from the HiFiasm assembly corresponded to small portions of the chloroplast and mitochondrial genomes. These contigs aligned together covered the complete organelle genomes (Figures 3c and 4c). However, a few of contigs corresponded to nuclear ribosomal RNA sequences ( Figure 5c). This contig set also showed 0% BUSCO genes (Table 1).

Influence of data volume
HiFiasm assembly from CCS reads from two individual single molecule, real time (SMRT) sequencing cells and the combined data is given in Table S1. A HiFiasm assembly generated from the 109 CCS data produced 4511 contigs with an assembly of 909 Mb and N50 of 0.38 Mb, whereas a larger CCS file with 189 coverage generated an assembly with less contigs (1058), a shorter assembly length (833 Mb) and an improved N50 of 4.4 Mb ( Table S1). The 189 assembly was closer to the combined CCS assembly (and the Hi-C assembly) than the 109 assembly.
Haploid assembly details are given in Table 1. The haploid 1 assembly had a greater number of contigs than the haploid 2 assembly. The BUSCO results were similar for the two haploid and primary assemblies as all assemblies were relatively complete.

Comparison with Hi-C assembly
A dotplot analysis of 14 pseudo-molecules of M. jansenii Hi-C assembly against the HiFiasm assembly is shown in Figure 1. The dotplot of contigs >1 Mb in size showed a complete match of 25 contigs (out of total 30) with the 14 Hi-C pseudo-molecules (Sharma et al., 2021b) (Figure 1a). The remaining five large contigs did not contribute to the genome assembly. They were composed of nuclear ribosomal RNA sequences. Chromosomes 1, 4, 5, 6, 10, 11, 13 and 14 were covered by a single contig of the HiFiasm assembly (Figure 2a), two chromosomes (Chr 8 and 9) were covered by two contigs, chromosomes 2, 3 and 12 were covered by three contigs, and chromosome 7 was covered by four contigs (Figure 2b, Tables 2 and 3).
Analysis of the sequence at the ends of the HiFiasm contigs (Table 4) showed that the eight Hi-C pseudo-molecules (1, 4, 5, 6, 10, 11, 13 and 14) covered by single HiFiasm contigs had telomere repeats at both ends, except for pseudo-molecules 1 and 5, which had a telomere at one end and an 18s ribosomal RNA on the other terminal. The other two pseudo-molecules that were covered by two contigs (Chr 8 and Chr 9) had telomere sequences at one end of each contig. Chromosomes 2, 3 and 12 were covered by three contigs. In the case of chromosome 12, two contigs had telomere repeats at one end indicating their position at the end of the chromosome. One had 5S RNA gene sequences at the other end, confirming the match with 5S RNA sequences on the end of the middle contig. Chromosome 3 (covered by three contigs) also had two contigs with telomere repeats, confirming their terminal position in the chromosome. Similarly, chromosome 7 (covered by four contigs) had telomere repeats at one end of two contigs, indicating their position at the end of the chromosome and another two in the middle of the chromosome.

Organelle genome analysis
Dotplot analysis of a 159 Mb full length chloroplast genome assembled using the GetOrganelle toolkit (Jin et al., 2020) against the HiFiasm genome assembly indicated the insertion of small fragments of chloroplast sequences in the nuclear genome assembly (Figure 3a; Figure S1A), which also align with previously reported Hi-C assembly results (Sharma et al., 2021b) (Figure S1B). Among the middle size contig set, only one contig (ptg0000186|) of 183 Mb aligned with the chloroplast genome (Figure 3b). Contig ptg000186| covered the complete chloroplast genome including the two inverted repeat regions of the chloroplast ( Figure S4). Another HiFiasm middle size contig, ptg000066|, also showed some similarity with the chloroplast assembly and also aligned with the terminal end of Hi-C chromosome 14 ( Figure S5). Analysis of the smaller size contigs showed that the majority of these contigs contained some fragments of the chloroplast assembly ( Figure 3c).
Mitochondrial sequence analysis revealed that the size of the de novo mitochondrial assembly was 351 kb. Analysis against the HiFiasm assembly indicated the presence of mitochondrial sequences in the smallest set of contigs. The majority of these contigs cover small fragments of the mitochondria genome (Figure 4c), whereas, in the larger contig set (>1 Mb), only a few contigs showed some similarity with mitochondrial sequences. These represent the mitochondria sequences inserted in the nuclear genome (Figure 4a), which aligns with the dotplot result of Hi-C assembly ( Figure S1B[b]). The middle size contigs did not show the presence of any mitochondria sequences in the dotplot analysis (Figure 4b).
Nuclear ribosomal RNA gene sequences analysis Dotplot analysis of nuclear ribosomal RNA sequences showed matches with the majority of the middle size contigs, with a small number of contigs from the smaller set of contigs having ribosomal RNA sequences (Figure 5b,c).

Analysis of repeat elements
The HiFiasm contigs were longer than the corresponding Hi-C pseudomolecules (Table 3). This is probably because the HiFiasm contigs included a larger proportion of repetitive elements than the corresponding Hi-C pseudomolecules (Table 5). The longer chromosome had a generally higher content of repetitive elements, suggesting that the presence of these repeat regions explained their greater size. The HiFiasm assemblies included more repetitive elements in the larger chromosomes but lower repeat content in the smaller chromosomes, largely as a result of the inclusion of less unclassified repeats in the HiFiasm assemblies of the smaller chromosomes.

DISCUSSION
This era of genomics is continuing to advance with improved sequencing technologies and the potential to sequence all recorded species on earth (Lewin et al., 2018). Accurate chromosome level genome assembly requires accurate reads, high genome coverage and long read length. This has typically involved the use of very high coverage and data from multiple sequencing platforms along with mapping of Hi-C technologies to achieve chromosome level assemblies. However, the combination of high sequence accuracy in a long read in HiFi reads (99.8% accuracy at around 15 kb average length) provides the option to assemble a complete genome using a single sequencing technology (Cheng et al., 2021) and with a more readily obtainable genome coverage (Wenger et al., 2019).
In the present study, we have combined the benefit of the highly accurate reads with an improved assembly tool HiFiasm (Cheng et al., 2021). HiFi read genome coverage of 28-409, for plant genomes within the range of 700À1000 Mb size, was sufficient to generate high quality assemblies with Mb contig sizes (Sharma et al., 2021b). The DNA extracted from M. jansenii may have contained some impurities that reduced the efficiency of the DNA sequencing. Two SMRT cells were required to generate 289 genome coverage with CCS reads. For some samples, this may be possible with one single run providing the required coverage if sufficient DNA purity is achieved, reducing the cost of obtaining sufficient sequence. When the two individual CCS runs of 109 and 189 were assembled separately using HiFiasm, the final assembly was very fragmented (N50 of 0.38 and 4.4 Mb, respectively) for M. jansenii (Table S1), whereas the combined 289 gave a highly contiguous assembly with N50 of 46.1 Mb and 99.6% BUSCO results. The combined CCS run results suggests that, if the isolation method resulted in high purity DNA, a single run with less coverage may be sufficient to assemble the genome. The higher base-calling accuracy by HiFi improves the assembly accuracy by bypassing many time-consuming and heavy computational requirement steps in the assembly workflow. The M. jansenii assembly from HiFiasm using HiFi sequencing data produced a near chromosome level assembly, with eight contigs covering eight complete Hi-C pseudo-molecules and another six  chromosomes, being covered by only one to four breaks and a total of 17 contigs. Chromosomes 1 and 5 had 18S ribosomal RNA genes at one end, suggesting that these repeats near the end of the chromosome had prevented assembly to the telomere. Chromosome 12 was interrupted by 5S ribosomal RNA genes. For a plant with approximately 800 Mb of data, we estimate a high-quality chromosome level assembly could be produced within 1 week from the plant material, if the DNA extraction step is well established. This highly contiguous M. jansenii chromosome level assembly will help achieve a better understanding of the genome of macadamias. All four species of Macadamia are listed as threatened under Australian legislation (Mast et al., 2008), although M. jansenii is particularly endangered because of its very low population size (<200 plants in the wild) (Shapcott and Powell, 2011). The highly accurate genome assembly will facilitate its conservation and use in breeding. Macadamia jansenii has small inedible nuts (Gross and Weston, 1992); however, as a result of its small tree size and narrow root spread, it is being tested as a rootstock and in hybrids with the commercial species Macadamia intergrifolia (Alam et al., 2018). The HiFiasm assembly (BUSCO 99%) is much better than the Hi-C assembly (BUSCO 97%) (Sharma et al., 2021b), suggesting the incorporation of some regions missing in the Hi-C assembly.
The initiative to complete the genome assembly of almost all living organisms (Koepfli et al., 2015;Lewin et al., 2018) requires a highly efficient assembly method with sustainable financial, computational and time requirements without compromising on genome accuracy. Contiguity and completeness should be taken into consideration (Rhie et al., 2021). Our analysis suggests that HiFiasm assembly with the HiFi reads may require almost no further scaffolding for the plants with similar genome size of approximately 800 Mb. Analysis of the nature of the few remaining regions of the genome that are not assembled in these analyses may allow the development of targeted strategies to complete these assemblies. Analysis of the sequences at the  ends of the contigs formed by HiFiasm assembly of HiFi reads may identify those contigs that have been interrupted by repetitive sequences that cannot be assembled de novo. This technology is successfully assembling regions with high levels of the repeat sequences that make up more than 50% of the M. jansenii genome (Sharma et al., 2021b). It may be that the very high accuracy of the HiFi reads detects minor variations in repeat sequences that allow their unique assembly and that only perfect repeats that are longer than the HiFi reads create a barrier to assembly. The present study suggests that more than half of the total chromosomes could be assembled telomere to telomere for the plants with a genome size of approximately 800 Mb, whereas plants with larger genome sizes may require some additional methods for complete assembly. Assemblies of larger genomes have been shown to require a higher level of coverage with long read data to achieve the same size of assembled contigs (Sharma et al., 2021a). The chromosomes covered by more than one contig have some end sequences that indicate how they should be connected to other contigs. The present study also suggests that the large ribosomal gene clusters in the genome of plants may be one of the few limitations to complete assembly. This would suggest that sequence analysis of the ends of contigs could be used to guide high level assembly of the genome. However, additional information may be required for plants with very large and complex genomes. This approach will be useful for producing plant genomes generating high quality de novo chromosome level assemblies, especially for laboratories with limited financial, technical and computational resources.

HiFiasm assembly
The HiFiasm genome assembly (Cheng et al., 2021) was generated using the High Performance Computing facility at the University of Queensland. For assembly, 24 core processing units and 120 Gb of memory was employed. Default settings of the HiFiasm assembler were used to assemble heterozygous genomes with built-in duplication purging parameters. The HiFiasm output directory consists of two haploid (1 and 2), one primary contig and one alternate haplotig GFA graph files. Each halplotig and one primary contig GFA file was converted to FASTA format using the awk command.

Analysis of assembly
The primary HiFiasm assembly of M. jansenii included 779 contigs that were categorised into three subsets: (i) contigs <1 Mb size; (ii) contigs <1 Mb and more than 100 kb size; and (iii) contigs <100 kb size. Along with the main primary and two haploid assemblies, all three sets of primary contig subsets were passed through analysis using QUAST (Gurevich et al., 2013), BUSCO (Simão et al., 2015) and RE-PEATMODELER (Humann et al., 2019). The telomere sequences in the HiFiasm contigs were identified using the BIOSERF platform (https://bioserf.org) (Somanathan and Baysdorfer, 2018). Ribosomal RNA and other protein coding genes at the terminal end of the HiFiasm contigs were identified using an NCBI BLAST search (https://blast.ncbi.nlm.nih.gov). Ribosomal RNA in the contigs was identified using Barrnap (https://github.com/tseemann/barrnap) (Seemann, 2013) with default settings for eukaryotes.

Comparison with Hi-C assembly
The HiFiasm contigs were compared with the M. jansenii 14 pseudo-molecules from the Hi-C assembly (Sharma et al., 2021b) using the online interactive D-Genies dotplot tool (Cabanettes and Klopp, 2018) to compare two genomes using Minimap2 and, for alignments, dotplot images were created after selecting the 'sort contigs' option, selecting the 'minimum identity' parameter at 0.75 and checking the 'strong precision' tick box.

Characterisation of organelle genomes content of HiFiasm contigs
A reference mitochondrial genome, chloroplast genome and nuclear ribosomal RNA sequence from this sample were assembled from Illumina raw reads (Murigneux et al., 2020) using the GetOrganelle toolkit (Jin et al., 2020) with default parameters. The HiFiasm contigs (779) were compared with the organellar and ribosomal sequences in dotplots.  Figure S2. (a) Dotplots of Hi-C pseudo-molecules against HiFiasm contigs (longest contigs >1 Mb). (b) Dotplots of Hi-C pseudo-molecules against HiFiasm contigs (longest and middle size contigs). Figure S3. (a) Dotplots of Hi-C pseudo-molecules against HiFiasm contigs (longest contigs >1 Mb). (b) Dotplots of Hi-C pseudo-molecules against HiFiasm contigs (longest and middle size contigs). Figure S4. Chloroplast assembly covered by a single HiFiasm Contig (Ptg0000186|) and small bits by Ptg000066|. Figure S5. Chloroplast sequence (Ptg0000186| and Ptg000066|) insertions in the Hi-C assembly. Table S1. IPA and HiFiasm assembly from different volumes of sequence data Table S2. HiFiasm contigs (<1 Mb and >100 kb) that are part of Hi-C pseudo-molecule assembly