Improving Illumina assemblies with Hi‐C and long reads: An example with the North African dromedary

Abstract Researchers have assembled thousands of eukaryotic genomes using Illumina reads, but traditional mate‐pair libraries cannot span all repetitive elements, resulting in highly fragmented assemblies. However, both chromosome conformation capture techniques, such as Hi‐C and Dovetail Genomics Chicago libraries and long‐read sequencing, such as Pacific Biosciences and Oxford Nanopore, help span and resolve repetitive regions and therefore improve genome assemblies. One important livestock species of arid regions that does not have a high‐quality contiguous reference genome is the dromedary (Camelus dromedarius). Draft genomes exist but are highly fragmented, and a high‐quality reference genome is needed to understand adaptation to desert environments and artificial selection during domestication. Dromedaries are among the last livestock species to have been domesticated, and together with wild and domestic Bactrian camels, they are the only representatives of the Camelini tribe, which highlights their evolutionary significance. Here we describe our efforts to improve the North African dromedary genome. We used Chicago and Hi‐C sequencing libraries from Dovetail Genomics to resolve the order of previously assembled contigs, producing almost chromosome‐level scaffolds. Remaining gaps were filled with Pacific Biosciences long reads, and then scaffolds were comparatively mapped to chromosomes. Long reads added 99.32 Mbp to the total length of the new assembly. Dovetail Chicago and Hi‐C libraries increased the longest scaffold over 12‐fold, from 9.71 Mbp to 124.99 Mbp and the scaffold N50 over 50‐fold, from 1.48 Mbp to 75.02 Mbp. We demonstrate that Illumina de novo assemblies can be substantially upgraded by combining chromosome conformation capture and long‐read sequencing.


| INTRODUC TI ON
Technlogical advances in sequencing have enabled researchers to assemble thousands of eukaryotic genomes. More than 82% of the ~4,300 eukaryotic genomes in the National Center for Biotechnology and Information (NCBI) Assembly database with assembly reports have been assembled using short-and/or long-insert (mate-pair) libraries sequenced with Solexa/Illumina's "Sequencing-By-Synthesis" technology (Bonetta, 2006;Kitts et al., 2016; Table   S1). These genomes are typically in a draft form, consisting of tens of thousands of scaffolds that comprise the majority of the assembly. Long-insert libraries greater than 8 Kbp are needed to span long interspersed nuclear elements (LINEs), some of the most common repetitive elements in eukaryotic genomes (Sotero-Caio, Platt, Suh, & Ray, 2017;Treangen & Salzberg, 2011;van Heesch et al., 2013).
Newer high-throughput laboratory methods are beginning to overcome the limitations of traditional long-insert libraries, and these new libraries can extend across repetitive regions enabling the scaffolding and ordering of previously unscaffolded contigs.
One method, Hi-C, is a type of chromosome conformation capture or proximity ligation method. The method involves DNA regions that are in close proximity three-dimensionally and cross-linked in vivo, digested with restriction enzymes, repaired with biotinylated residues, and ligated together resulting in DNA regions that have various chromatin interactions but are located close together on the same synthetic molecule. These synthetic molecules can then be sheared, enriched for interacting regions using streptavidin beads, and ultimately sequenced using Illumina short-insert libraries in a higher-throughput fashion compared to laborious bacterial artificial chromosome (BAC) and fosmid end sequencing (Lieberman-Aiden et al., 2009). Resulting Hi-C paired-end reads can be mapped to de novo genome assemblies and used to scaffold and order contigs, creating super scaffolds in the size range of chromosomes (Bickhart et al., 2017;Dudchenko et al., 2017;Kaplan & Dekker, 2013;Korbel & Lee, 2013). Another proximity ligation method, Chicago libraries, has much in common with Hi-C except Chicago libraries are constructed in vitro (Putnam et al., 2016) and are available as a commercial service from Dovetail Genomics (Santa Cruz, California, USA). Both Hi-C and Dovetail Chicago libraries have been successfully used in creating super scaffolds to improve the continuity of numerous eukaryotic genomes (Kaplan & Dekker, 2013;Korbel & Lee, 2013;Moll et al., 2017).
Eukaryotic genome assemblies have been further enhanced by long-read sequencing technologies from Pacific Biosciences (PacBio, Menlo Park, CA, USA; Eid et al., 2009) and Oxford Nanopore Technologies (Oxford Nanopore, Oxford, UK; Venkatesan & Bashir, 2011). These technologies generate much longer sequences, but raw reads have higher error rates and are more prone to insertions/deletions (indels) than Illumina reads (Jain, Olsen, Paten, & Akeson, 2016;Salmela & Rivals, 2014). Both Oxford Nanopore and PacBio overcome the problems of errorprone raw reads by generating a consensus sequence either on the level of the instrument whereby DNA molecules in PacBio sequencers are read multiple times (i.e., circular consensus sequences) or after the sequences have been generated by PacBio or Oxford Nanopore sequencers. Ultimately, these long-read technologies generate longer sequences that can span repetitive regions, enabling the assembly of longer contigs that can be later error corrected and/or scaffolded into high-quality eukaryotic genome assemblies using traditional long-insert, Hi-C, or Dovetail Chicago libraries (Jiao et al., 2017;Miller et al., 2017;Passera et al., 2018).
High-quality genomes resulting from long reads and/or Hi-C libraries have improved gene sequence completeness for evolutionary studies and can be used to understand what genetic variation influences phenotypic traits to benefit evolutionary ecology and selective breeding. For example, the latest assembled goat genome has taken advantage of long-read sequences and Hi-C libraries in its assembly (Bickhart et al., 2017). Additionally, long-read sequence assembly of great ape genomes facilitated high-resolution comparative analyses between humans and great apes (Kronenberg et al., 2018).
Genetic variation that influences phenotypic traits can be identified with genome-wide association studies (GWAS) that benefit from more contiguous assemblies. Contiguous assemblies have more variants per scaffold, which can improve genotype imputation (i.e., filling in missing genotype data) in GWAS (Davies, Flint, Myers, & Mott, 2016), and contiguous assemblies also permit searching genomes for genes located nearby variants that are significantly associated with phenotypes. For example, GWAS with contiguous genome assemblies identified candidate loci associated with tuberculosis susceptibility and recombination hot spots in wild boar and Soay sheep, respectively (Johnston, Bérénos, Slate, & Pemberton, 2016;Queirós, Alves, Vicente, Gortázar, & de la Fuente, 2018). GWAS and genomic selection are now routine approaches to improve selective breeding in agriculture and horticulture, for example to investigate beef and milk production traits in cattle (Sorbolini et al., 2017;Yue et al., 2017) and growth and fatness in pigs (Guo et al., 2017).
A contiguous genome assembly is not available for the dromedary (Camelus dromedarius), an important livestock species especially for dry and marginal ecoagricultural parts of the world. Dromedaries are among the last livestock species to have been domesticated, only around 3,000 years ago (Almathen et al., 2016;Uerpmann & Uerpmann, 2012). Traditionally, they have been bred as multi-purpose animals (Abdussamad, Charruau, Kalla, & Burger, 2015), for milk, meat, hides and wool, and for endurance and transport; only recently stronger selection has begun for fast, narrow-bellied racing camels (Faye, Abdallah, Almathen, Harzallah, & Al-Mutairi, 2011).
Thus, dromedaries present a very interesting model to study the "initial stages" of domestication, where potential signals of selection for tameness and tolerance of humans are not overlaid by stronger signatures of artificial selection for economic traits, as seen in specific meat and milk breeds from other livestock. In terms of evolutionary significance, dromedaries form together with their sister taxa, the domesticated Bactrian camel (Camelus bactrianus) and the highlyendangered wild two-humped camels (Camelus ferus), the tribe of Camelini (Old World camels). Next to the New World camels (Lamini) they are the only representatives of the suborder Tylopoda. Thus, dromedary breeders and evolutionary biologists would benefit from a high-quality dromedary reference genome. Although draft genome assemblies from North African and Arabian dromedaries have been established, respectively (Fitak, Mohandesan, Corander, & Burger, 2016;Wu et al., 2014); these genome assemblies are highly fragmented, and scaffolds are not assigned to chromosomes.
Here we describe our efforts to improve the North African dromedary reference genome. We used Chicago and Hi-C sequencing libraries from Dovetail Genomics to resolve the placement and order of previously de novo assembled contigs from Illumina shortand long-insert libraries (Fitak et al., 2016), producing almost chromosome-level scaffolds for which we filled in gaps using PacBio long reads, mapped scaffolds to chromosomes, and annotated the resulting assembly.

| Brief summary of the CamDro2 assembly process
We scaffolded the existing Illumina-only assembly (Fitak et al., 2016; GenBank accession: GCA_000803125.1) with Dovetail Chicago data, improved the Chicago assembly by scaffolding with Hi-C data, filled in gaps in the Hi-C assembly with PacBio reads, then filled in gaps and polished the assembly with Illumina reads used to de novo assemble GCA_000803125.1. An overview of the CamDro2 assembly process is given in Figure S1.

| The original North African dromedary genome assembly
The original North African dromedary genome assembly was created from a female dromedary "Waris" (Fitak et al., 2016; GenBank accession: GCA_000803125.1) owned by the First Austrian Camel Riding School, stemming from the Canary Islands with North African ancestry. Briefly, two types of Illumina libraries were generated and sequenced from DNA extracted from whole blood, which was collected commensally during veterinary diagnostic treatment with the owner's consent: 500 bp (short-insert, 100 bp paired-end reads) and 5 Kbp (long-insert/mate-pair, 50 bp paired-end reads) libraries.

| Dovetail Chicago and Hi-C libraries
Dovetail Genomics created Chicago and Dovetail Hi-C libraries from a low-passage cell culture line (Perelman, Pichler, Gaggl, & Larkin, 2018) derived from ear fibroblasts of the same dromedary "Waris" used in CamDro1. The fibroblast cells were retrieved from a diagnostic skin scraping for parasites, and the owner agreed on using the leftover material for research purposes. Dovetail Genomics created three Chicago and three Hi-C libraries with the DpnII restriction enzyme, sequenced these libraries on six lanes of an Illumina HiSeq sequencer, and then scaffolded the CamDro1 assembly using the HiRise pipeline (Putnam et al., 2016). To do so, first, the CamDro1 assembly was scaffolded using Dovetail Chicago data. The Chicago assembly was then improved by scaffolding with Hi-C data creating a Hi-C assembly.

| PacBio long-read sequencing
We extracted high molecular weight DNA from the same low-passage cell line used to create Dovetail Chicago and Hi-C libraries.
Briefly, the high molecular weight DNA was extracted by lysing freshly harvested cells in lysis buffer, followed by phenol chloroform extraction and precipitation. Throughout the whole extraction process, the DNA was manipulated gently to preserve high molecular weight. From this DNA, the Vienna BioCenter Core Facilities NGS Unit (Vienna, Austria, www.vbcf.ac.at) created a PacBio library for the PacBio Sequel sequencer and sequenced this library on five 1M v2 PacBio Sequel SMRT Cells using PacBio Sequel 2.1 sequencing reagents.

| Additional assembly steps
We used bamtools v. 2.5.0 (Barnett, Garrison, Quinlan, Strömberg, & Marth, 2011) to extract FASTQ sequences from PacBio Sequel subread BAM (binary alignment map) files. Because quality values for subreads from the PacBio Sequel are given a Phred quality score of 0, we artificially assigned a Phred score of 30 to all bases for input into pbjelly v. 15.8.24 (English et al., 2012) to fill in gaps in the Hi-C assembly. We polished the PBJelly assembly with pilon (see Supplementary Methods for settings) once again with the same error-corrected Illumina reads, fixing any SNPs and indels that were not accounted for in the first round of polishing but also filling in gaps. We refer to this as the CamDro2 assembly.

| K-mer analysis and dot plot
We compared 27-mers present in the error-corrected Illumina short-insert sequences and the CamDro2 assembly using kat v. 2.3.4 (Mapleson, Garcia Accinelli, Kettleborough, Wright, & Clavijo, 2017; see Supplementary Methods for settings) to evaluate the proportion of the sequencing reads, duplication rates, and heterozygosity present in the CamDro1 and CamDro2 assemblies.
To assess the level of disagreement between CamDro1 and CamDro2, we made a whole-genome alignment with minimap2 v. 2.15 (Li, 2018) using the "asm5" preset. We then used d-genies v. 1.2.0.1 (Cabanettes & Klopp, 2018) to generate a dot plot for the alignment by using the contig sorting function and filtering out matches with ≤0.001% dot plot width and identity ≤0.75.

| RNA-Seq mapping
To assess the quality of the new assembly, we aligned 10 sets of paired-end RNA-Seq reads (Alim et al., 2019) to the original assembly (CamDro1), to the new assembly (CamDro2), and to several controls: The 10 RNA-Seq data sets comprise a 2 × 2 factorial experiment: summer versus winter seasons and supraoptic nucleus (SON) versus neurointermediate lobe (NIL) brain tissues, with n = 3 replicates in each class. Tissue was homogenized and extracted using in Trizol/chloroform (ThermoFisher), and purified using the RNeasy MiniKit (Qiagen). The library template was prepared using a ribosome depletion protocol (Ribo-Zero Gold; Illumina) and libraries prepared using TruSeq Stranded protocol (Illumina). Samples were multiplexed into lane pools with an 8 pm concentration and sequenced (100 bp paired-end reads with an average 134 bp insert size) to a depth of > 35 million reads using an Illumina HiSeq 2500. Two of the 12 replicates were rejected for insufficient quality. We used tophat v. 2.0.9 (Kim et al., 2013) with default settings to align reads to each genome and report overall alignment rate (default output of Tophat) within each class.
We repeated blastn mapping of our V. pacos RH probe sequences to the Alpaca assembly to assign putative chromosome numbers to this assembly.

| Annotation
We annotated CamDro2 scaffolds greater than 10 Kbp with maker v.
After the second MAKER run, we only retained genes, transcripts, and proteins with AED ≤ 0.50. Next, we predicted putative  (Walker et al., 2014), and then filling in gaps with Illumina short-insert libraries using abyss sealer (Jackman et al., 2017), and polishing again but also filling in gaps with Pilon; and for comparison the Arabian dromedary assembly (Wu et al., 2014;GCA_000767585.1) gene function with diamond v. 0.9.19 (Buchfink, Xie, & Huson, 2015) searches against the UniProt/TrEMBL release 2018_04 database using an E value cutoff of 1e−6. We also mapped proteins predicted by MAKER against the same UniProt/TrEMBL database using diamond and generated a frequency polygon of the query sequence length (predicted proteins) divided by the subject sequence length (UniProt/ TrEMBL proteins) to assess if predicted proteins were truncated (query sequence length divided by the subject sequence length <1.0) due to uncorrected indels introduced by PacBio reads that might interrupt reading frames affecting protein translation (Watson & Warr, 2019).
We also generated 75 random sets of 250 transcripts with AED ≤ 0.25 to test the specificity and sensitivity of Augustus ab initio models used during the first and second MAKER runs.
As part of the assessment of CamDro2, we also annotated CamDro1 using the same MAKER settings and input files used for CamDro2's annotation. We then summarized annotations (i.e., length distributions of genes, mRNAs, exons, introns, and CDS [coding sequences]) with genome annotation generator (Geib et al., 2018).  Table S2). Dovetail Genomics'

| Dovetail Chicago and Hi-C libraries
HiRise pipeline generated a Hi-C linkage density plot between the Hi-C assembly and the Hi-C read pairs (Figure 1). Considering super scaffolds >1 Mbp are allocated in separate shading blocks, this suggests the Hi-C assembly is well assembled (Figure 1).

| Additional assembly steps
In the PBJelly assembly (i.e., Hi-C assembly plus PacBio reads), there were 34,504 gaps (74,277 fewer than the Hi-C F I G U R E 1 Dovetail Genomics' Hi-C linkage density plot for Hi-C reads mapped to the Hi-C assembly. X-and Y-axes give the cumulative mapping positions of the first and second read in a read pair respectively, grouped into bins.  (Figure 2).

| K-MER analysis and dot plot
KAT k-mer analysis indicated a low proportion of sequencing data missing (i.e., black histogram bars) from both the CamDro1 ( Figure   S2a) and CamDro2 ( Figure S2b) assemblies, suggesting that most of the sequencing reads were accounted for in both assemblies.
Both assemblies were also mostly haploid (i.e., red bars) with low heterozygosity (Figures S2a and S2b; peak at k-mer multiplicity of 15 for black bars). The CamDro2 assembly had a lower proportion of missing sequencing data than the CamDro1 assembly indicated by the lower amount of black shading between k-mer multiplicity values 5 and 10, which is replaced by increased red shading at kmer multiplicity values near 1 (see panels below Figure S2a and b for magnified views).
The dot plot for the whole-genome alignment between CamDro1 and CamDro2 shows very good correspondence and agreement between the two assemblies with little to no structural variations ( Figure 3). We scoured the dot plot for signs of insertions, deletions, inversions, and repeats but could find very little evidence of structural variation even upon zooming into the plot. One example of structural variation between CamDro1 and CamDro2 is the 875 Kbp CamDro1 scaffold JWIN01032405.1, which was split and inverted relative to CamDro2 chromosome X ( Figure S3). JWIN01032405.1 was split by Dovetail Genomics' HiRise pipeline during scaffolding with Dovetail Chicago reads. We are not aware of other major structural variation, suggesting that synteny is likely conserved between CamDro1 and CamDro2.

| Chromosome mapping
Of 4,891 V. pacos RH probes, 3,005 had hits with E values ≤ 1e−30 to CamDro2 scaffolds. For each chromosome set of V. pacos RH probes, nearly all of the probes (96 ± 7.7%; mean ± SD; Table S4) had best hits to a single CamDro2 scaffold, thus we were able to assign at least one super scaffold to each of the 37 chromosomes except the Y chromosome as the dromedary used in CamDro1 and CamDro2 was female. Chromosomes are denoted by numbers 1-36 and X in the CamDro2 assembly. There were 101,628,251 bases in scaffolds not assigned to chromosomes accounting for 4.72% of the assembly.
We found strong correspondence between CamDro2 and Alpaca scaffolds through a dot plot of the whole-genome alignment ( Figure   S5). There are inversions in chromosomes 9, 16, 30, and 35 between the two assemblies ( Figures S6-S9). These findings suggest there is strong conservation of chromosomal arrangement among CamDro2 and Alpaca assemblies. In summary, we were able to assign chromosomes 1-36 and X to the Alpaca assembly (Table S5). We could not assign the Y chromosome to the Alpaca assembly as we do not have an alpaca RH probe chromosome set for the Y chromosome.

| Annotation
We predicted 22,534 genes that produced 34,024 proteins for the first MAKER run on the CamDro2 assembly, and there were 26,237 genes that produced 38,070 proteins for the second MAKER run on the CamDro2 assembly. There were 7.7% (1,730) of genes without an assigned annotation in the first MAKER run, whilst 21.5% (5,639) were unannotated in the second MAKER run. The Arabian dromedary assembly (NCBI Annotation Release 100) predicted 24,457 genes that produced 26,716 proteins. We assessed if predicted proteins were truncated due to uncorrected indels intro-

| Genome assembly
We were able to greatly improve the North African dromedary genome assembly by using a combination of chromosome conformation capture sequencing libraries for scaffolding, long reads to fill in gaps, and comparative chromosome mapping to assign super scaffolds to chromosomes. We demonstrate that data from existing Illumina de novo assemblies can be combined with the before-mentioned techniques to produce high-quality reference genomes.
Other genome assembly projects that began with Illumina shortand long-insert libraries have also taken advantage of chromosome conformation capture and/ or long-read technologies to improve assemblies. For example, the AllMis1 assembly (American alligator, Alligator mississippiensis) was assembled with Illumina short-insert libraries and scaffolded with mate-pair and BAC libraries (Green et al., 2014) and then subsequently improved with Dovetail Chicago libraries resulting in the AllMis2 assembly (Rice et al., 2017

| Genome annotation
There were more and longer annotation features and also more mRNAs per gene and exons and introns per mRNA in CamDro2 versus CamDro1 suggesting that CamDro2 has both improved assembly and annotation statistics relative to CamDro1. Details regarding our Augustus model choice are discussed in the Supplements (Appendix S1: Discussion).

| CON CLUS ION
The CamDro2 reference should be of great value to evolutionary biologists and the camelid genetics community, especially researchers interested mammalian comparative genomics and in designing RNA-Seq and GWAS experiments. In particular, the large scaffolds identified in CamDro2 will be useful in SNP imputation if hundreds of dromedaries are sequenced at low coverage using programs such as STITCH (Davies et al., 2016).

ACK N OWLED G EM ENTS
We greatly thank G. Gassner from the First Camel Riding School in Eithental, Austria for allowing leftover diagnostic material from Waris the dromedary to be used for genome sequencing and cell line creation. We also wish to thank J. analyses are available from Online Appendix S1: Methods S1 and also the Dryad repository. Note that annotations hosted by NCBI will differ from annotations on Dryad.