How complete are “complete” genome assemblies?—An avian perspective

The genomics revolution has led to the sequencing of a large variety of nonmodel organisms often referred to as “whole” or “complete” genome assemblies. But how complete are these, really? Here, we use birds as an example for nonmodel vertebrates and find that, although suitable in principle for genomic studies, the current standard of short‐read assemblies misses a significant proportion of the expected genome size (7% to 42%; mean 20 ± 9%). In particular, regions with strongly deviating nucleotide composition (e.g., guanine‐cytosine‐[GC]‐rich) and regions highly enriched in repetitive DNA (e.g., transposable elements and satellite DNA) are usually underrepresented in assemblies. However, long‐read sequencing technologies successfully characterize many of these underrepresented GC‐rich or repeat‐rich regions in several bird genomes. For instance, only ~2% of the expected total base pairs are missing in the last chicken reference (galGal5). These assemblies still contain thousands of gaps (i.e., fragmented sequences) because some chromosomal structures (e.g., centromeres) likely contain arrays of repetitive DNA that are too long to bridge with currently available technologies. We discuss how to minimize the number of assembly gaps by combining the latest available technologies with complementary strengths. At last, we emphasize the importance of knowing the location, size and potential content of assembly gaps when making population genetic inferences about adjacent genomic regions.

Why are (avian) genome assemblies of varying quality? To date, no sequencing technology exists that is capable of sequencing entire avian chromosomes from one end to the other in a single read (Figure 1a). Instead, short-read sequencing technologies produce sequence information ("reads"; usually in "read pairs") of some hundreds of base pairs (bp), and long-read sequencing technologies yield reads of some tens of thousands of bp (Figure 1b; Goodwin, McPherson, & McCombie, 2016). Similar to a jigsaw puzzle, these reads are then assembled into contiguous sequences ("contigs") and linked contigs ("scaffolds") (Yandell & Ence, 2012). Scaffolds thus consist of contigs (all nucleotides determined) and assembly gaps (placeholders of undetermined "N" nucleotides), the latter usually containing repetitive elements such as interspersed repeats (transposable elements and endogenous viruses) and tandem repeats (microsatellites and satellites; Figure 2; Chaisson, Wilson, & Eichler, 2015b; Thomma et al., 2016). Like a puzzle piece occurring multiple times in a single puzzle game, repetitive elements are problematic for genome assembly because they contain ambiguous information about their exact position. If reads or read pairs are shorter than the repeat unit (an individual transposon or tandem repeat) and there are multiple identical repeat copies in the genome, this ambiguity will interfere with the assembly process and cause a loss of information (assembly gaps). This issue typically results in assembly gaps of known size (i.e., approximated by "N" nucleotides) when contigs are bridged into scaffolds by linkage information of read pairs (Figure 2, left;Chaisson et al., 2015b). On the other hand, repeat-rich regions (e.g., clusters of interspersed repeats or large arrays of tandem repeats) are usually not spanned by reads or read pairs at all and thus often lead to termination of scaffolds, that is, assembly gaps of unknown size (Figure 2, right;Chaisson et al., 2015b).
Nearly all currently available avian genome assemblies were generated using short-read sequencing (mostly using the Illumina platform; Kapusta & Suh, 2017). Considering that one can expect a positive correlation between read length and the ability to assemble individual repeat units or repeat-rich regions, we hypothesized that currently published avian genomes based only on short-read sequencing contain significant amounts of missing DNA (i.e., the sum of all DNA hidden in assembly gaps as defined in Figure 2). Although the "true" genome sizes of birds cannot be determined precisely, at least as long as read lengths are shorter than individual chromosomes, genome sizes of hundreds of bird species have been approximated through flow cytometry (Gregory, 2017) and we consider these estimates to be entirely independent of genome assembly sizes. We therefore quantified the amount of missing DNA and the numbers of assembly gaps by comparing assembly summary statistics, flow cytometric genome size estimates and haploid chromosome numbers. While we cannot determine whether these genome size estimates might be biased by genomic properties (such as a higher GC or repeat content) because "true" genome sizes are unknown, we note that the comparison of haploid chromosome numbers vs.
F I G U R E 1 Currently available genomics technologies. (a) Schematic illustration of the data structure of these technologies produced from a hypothetical input DNA molecule. Short reads come in read pairs, long reads as single reads, linked-read clouds (LRC) as short-read pairs with a unique barcode (red asterisk) for each input molecule. Optical maps (OM) contain physical distances between short sequence motifs, and Hi-C maps are short-read pairs of 3D genome interactions obtained through chromatin conformation capture.  Zhang et al. (2014), we were able to obtain genome size and karyotype data for 13 species (Table 1) which span most of the major groups within Neoaves, Galloanserae and Palaeognathae (sensu Jarvis et al., 2014;Suh, 2016).
In a "complete" assembly, the number of scaffolds (or ideally, contigs) should equal the haploid chromosome number. However, the haploid chromosome number of the sampled birds ranges from 25 to 54 and the number of scaffolds ranges from approximately 21,000 to 346,000 (mean 112,000 ± 91,000; Table 1). Therefore, there are tens of thousands to hundreds of thousands of gaps between scaffolds (i.e., of unknown size) in these genome assemblies (Table 1). Furthermore, there are significant amounts of within-scaffold gaps, given that the number of undetermined "N" nucleotides ranges from approximately six to 49 million base pairs (Mb; mean 25 ± 15 Mb; Table 1). We next calculated the total amount of missing DNA by subtracting the assembly size from the flow cytometric F I G U R E 2 Schematic illustration of how repetitive elements may cause gaps in short-read genome assemblies. (a) Interspersed elements (IRs; transposable elements or endogenous viruses, both in blue) can lead to within-scaffold gaps of approximate size (left) or between-scaffold gaps of unknown size (right). (b) Tandem repeats (TRs; microsatellites in red or satellites in orange) can lead to withinscaffold gaps (left; alternatively, a muted gap, i.e., a sequence contraction) or between-scaffold gaps ( (2017) and Christidis (1990). Genome size estimates were converted from C-values into billion basepairs (Gb) assuming 1 pg = 0.978 Gb (Doležel, Bartoš, Voglmayr, & Greilhuber, 2003 Percentage of the expected genome size either missing in the assembly or assembled as "N" nucleotides. genome size estimate and adding the number of "N" nucleotides in the assembly. The estimates range from approximately 7% to 42% missing DNA (mean 20 ± 9%). Even the lowest estimate is a significant proportion considering that analyses based on such genome assemblies are often referred to as "whole-genome" or "genomewide" analyses. Note that hundreds of gaps are still unresolved in the human genome (Table 2), which is arguably the best vertebrate genome assembly available (Chaisson et al., 2015a). Even the wellcurated reference genomes of important model organisms, such as Drosophila melanogaster and Arabidopsis thaliana, still contain missing DNA (Table 2). One may argue that this missing DNA almost entirely consists of repetitive DNA and is outside the scope or interest of most (avian) genomics studies. However, simply ignoring assembly gaps "may lead to false positives and over-optimistic findings," as shown in Domanska, Kanduri, Simovski, and Sandve (2018) where depletion of mapped reads in gap regions biased the inference of colocalization of genomic features. We currently lack a comprehensive understanding of the functional relevance of repetitive DNA even in the most-studied model organisms such as humans (Cordaux & Batzer, 2009;Kellis et al., 2014;Koonin, 2016) and Drosophila (Gallach, 2015;Joshi & Meller, 2017;Zhou et al., 2013); thus, it might be premature to label these regions as completely irrelevant in birds. Furthermore, short-read sequencing is known to be biased against highly GC-rich sequences, meaning that these will be largely underrepresented in the resulting assembly (Chaisson et al., 2015b). This problem might be particularly pronounced in birds because their smallest chromosomes ("microchromosomes") are highly GC-rich (Burt, 2002). It is therefore imaginable that many genes and other functionally important regions are hidden in the missing DNA due to their repetitiveness and/or nucleotide composition. To this end, a growing number of studies suggest that many genes previously declared as "missing" in bird genomes were in fact just "missed" due to their GC richness (Bornelöv et al., 2017;Botero-Castro, Figuet, M-k, Nabholz, & Galtier, 2017;Hron, Pajer, Pačes, Bartůněk, & Elleder, 2015). Overcoming the issue of GC underrepresentation requires long-read sequencing data (Chaisson et al., 2015b) or modified protocols for short-read library preparation (Tilak, Botero-Castro, Galtier, & Nabholz, 2018).
To further quantify missing DNA, we next analysed the genome assemblies of chicken and zebra finch (Table 3), two avian model systems where considerable efforts combining conventional Sanger sequencing, bacterial artificial chromosome libraries and cytogenetic methods were used to build chromosome models (Hillier et al., 2004;Warren et al., 2010). Thanks to the combination of all these techniques (including Sanger read lengths longer than those in shortread sequencing), these genome assemblies have lower amounts of missing DNA than the aforementioned short-read assemblies, but nevertheless contain tens of thousands of gaps between scaffolds (Table 3) (Table 3), the chicken long-read genome assembly (version galGal5) is the most complete and facilitates a direct comparison to the previous chicken Sanger genome (version gal-Gal4). Strikingly, the total amount of missing DNA decreased from 14.1% to 2.4% and the number of "N" nucleotides decreased from approximately 58 to 12 Mb (Table 3). The total number of scaffolds is very similar between the galGal5 and galGal4 assemblies (approximately 23,000), despite the significant increase in assembly contiguity through long-read sequencing (Kapusta & Suh, 2017;Warren et al., 2017). It is likely that the high number of galGal5 scaffolds despite many closed gaps results from the fact that many GC-rich or repeat-rich regions (which were largely inaccessible with previous technologies) have been successfully sequenced and partially assembled, but remain unplaced on chromosomes (Warren et al., 2017).
This would explain why sequences belonging to the three smallest chicken microchromosomes (36, 37 and 38) have still not been confidently assigned (Warren et al., 2017).  Gregory (2017). Genome size estimates were converted from C-values into billion basepairs (Gb) assuming 1 pg = 0.978 Gb (Doležel et al., 2003). There is already the chance to get a glimpse into particularly difficult-to-assemble gaps such as centromeres in humans (Jain et al., 2018b). For avian genomes, Weissensteiner et al. (2017) recently demonstrated that optical mapping data provide an indirect means to estimate the size and potential sequence content of some assembly gaps. They could anchor candidate centromeric tandem repeat arrays with array lengths of over a million base pairs into the hooded crow genome assembly and illustrate an effect on genetic diversity and differentiation between populations in adjacent genomic regions. This approach was of importance to T A B L E 3 Quantification of missing DNA in four birds where both short-read (Illumina; except for Sanger in avian models) and long-read (PacBio) assemblies are available  (2017) and Christidis (1990). Genome size estimates were converted from C-values into billion basepairs (Gb) assuming 1 pg = 0.978 Gb (Doležel et al., 2003). b Assembly metrics from Table S1 of Kapusta and Suh (2017), except for galGal4 (Hillier et al., 2004), hooded crow (Weissensteiner et al., 2017) and Anna's hummingbird + zebra finch PacBio (present study). c Assembly size subtracted from expected genome size. d Sum of all "N" nucleotides present in the genome assembly. e Percentage of the expected genome size either missing in the assembly or assembled as "N" nucleotides. f Although the zebra finch assembly taeGut2 contains 64 chromosome-level scaffolds, one of these ("chrUn") is a concatenation of 35,359 unanchored contigs separated by "N" gaps.
F I G U R E 3 A road map for minimizing the number of assembly gaps using current technologies. Missing DNA is indicated by grey bars, interspersed repeats (IRs) are in blue, and tandem repeats (TRs) are in orange and red. OM: optical mapping; LRC: linked-read cloud sequencing; Hi-C: chromosome conformation capture [Colour figure can be viewed at wileyonlinelibrary.com] chromosome 18 containing the previously identified "speciation island"-a region of particularly high genetic differentiation between European hooded and carrion crow populations presumably involved in reproductive isolation (Poelstra et al., 2014).
Although chromosome 18 contains multiple assembly gaps, only the between-scaffold gap in the middle of the "speciation island" is large and contains a tandem repeat array which potentially is (part of) a centromere (Weissensteiner et al., 2017), showcasing the importance of incorporating information on genome structure into population genetic studies. While assembly gaps may bias results in co-localization analyses of genomic features (Domanska et al., 2018), fragmented assemblies may also lead to biased results when assessing the chromosomal landscape of population genetic statistics. For example, stretches of elevated differentiation ("F ST peaks") are often used to detect genomic regions under selection or to infer gene flow (Wolf & Ellegren, 2017). However, in an overly fragmented assembly, consecutive stretches of elevated differentiation may be too short to be detected, or erroneous inferences may occur if stretches are considered across scaffold boundaries. Thus, it is likely that both false-positive and false-negative discoveries may occur more frequently in incomplete assemblies.
At last, it is important to keep in mind that birds are on the low end of repeat content among vertebrates (Sotero-Caio, Platt, Suh, & Ray, 2017). Given that difficulty of genome assembly increases with repeat content (Sedlazeck, Lee, Darby, & Schatz, 2018), our case study on avian genomes might be a good starting point to illustrate that even sequencing genomes with relatively low repeat content is far from trivial and should not be labelled as "complete" yet. While avian genomes show a repeat density of only 4%-10% with a maximum of 22% in the downy woodpecker , other vertebrates, invertebrates and plants often reach a repeat density of more than 50% (e.g., human genome 50%-69%, Cordaux & Batzer, 2009;de Koning, Gu, Castoe, Batzer, & Pollock, 2011;Locusta migra-toria~59%, Wu, Twort, Crowhurst, Newcomb, & Buckley, 2017;Fritillaria spp. 90%, Ambrozová et al., 2011). This even more increases the need for caution when interpreting results than illustrated here for birds. So, how complete are "complete" avian genome assemblies? For now, the answer is indeed that substantial parts are usually missing.
However, we are confident that the true extent of genetic variation, hidden in repeat-rich and other tricky-to-assemble regions, will be more and more appreciated in the near future, a development spurred by rapid technological developments. Meanwhile, considering that not all gaps are equal in size or structure, our recommendation is this: Mind the gap!

ACKNOWLEDGEMENTS
We thank Mozes Blom, Anne-Marie Dion-Côté, Jan Engler, Per Ericson, Takeshi Kawakami, Cormac Kinsella and Robyn Womack for valuable comments on an earlier version of the manuscript. We also thank Shawn Narum and four anonymous reviewers for further improving this manuscript with their comments. A.S. was supported by grants from the Swedish Science Foundation (2016-05139) and the SciLifeLab Swedish Biodiversity Program (2015-R14).

CONFLI CT OF INTEREST
No conflict of interests to declare.

AUTHOR CONTRI BUTIONS
V.P., M.H.W. and A.S. conceived the study, analysed the data and wrote the manuscript.

DATA ACCESSIBILI TY
All the data used in this study were previously made available by the cited references.