A chromosome-level Amaranthus cruentus genome assembly highlights gene family evolution and biosynthetic gene clusters that may underpin the nutritional value of this traditional crop

Traditional crops have historically provided accessible and affordable nutrition to millions of rural dwellers but have been neglected, with most modern agricultural systems over-reliant on a small number of internationally traded crops. Traditional crops are typically well-adapted to local agro-ecological conditions and many are nutrient-dense. They can play a vital role in local food systems through enhanced nutrition (particularly where diets are dominated by starch crops), food security and livelihoods for smallholder farmers, and a climate-resilient and biodiverse agriculture. Using short-read, long-read and phased sequencing technologies, we generated a high-quality chromosome-level genome assembly for Amaranthus cruentus , an under-researched crop with micronutrient- and protein-rich leaves and gluten-free seed, but lacking improved varieties, with respect to productivity and quality traits. The 370.9 Mb genome demonstrates a shared whole genome duplication with a related species, Amaranthus hypochondriacus . Comparative genome analysis indicates chromosomal loss and fusion events following genome duplication that are common to both species, as well as ﬁssion of chromosome 2 in A. cruentus alone, giving rise to a haploid chromosome number of 17 (versus 16 in A. hypochondriacus ). Genomic features potentially underlying the nutritional value of this crop include two A. cruentus -speciﬁc genes with a likely role in phytic acid synthesis (an anti-nutrient), expansion of ion transporter gene families, and identiﬁcation of biosynthetic gene clusters conserved within the amaranth lineage. The A. cruentus genome assembly will underpin much-needed research and global breeding efforts to develop improved varieties for economically viable cultivation and realization of the beneﬁts to global nutrition security and agrobiodiversity. . We identify gene families that contracted and expanded in the A. cruentus genome, as well as


INTRODUCTION
Substantial investment in genetic improvement of crops during the Green Revolution (1960s to 1980s) led to the development of high-yield varieties of staple cereals (predominantly maize, wheat and rice) to feed the world's population. However, while these crops provide significant calories for human consumption, they are relatively low in protein (and in particular amino acids), vitamin and micronutrient content. Moreover, wide-scale cultivation of these staple crops has dramatically reduced agrobiodiversity and heightened the vulnerability of the global agricultural system. Orphan crops are typically traditional crops grown, traded and consumed within subsistence farming systems. They are not traded internationally but can play a major role in the diets and economy of low-income communities across the developing world. Importantly, they are accepted in local diets. Research and advances in orphan crops have lagged significantly behind that of staple crops, but orphan crops are often uniquely adapted to their local environments with enhanced nutritional content compared with more widely cultivated cereals, vegetables and fruits (Jamnadass et al., 2020).
The genus Amaranthus comprises approximately 60 species of annual and short-lived perennial herbaceous plants, of which several are considered orphan crops. These species have great variability and phenotypic plasticity, and probably included introgression and hybridization between species (Sauer, 1957). This makes taxonomy based on morphological characteristics difficult. Amaranth plants are originally from the Americas where they have been grown mainly for grain for over 8000 years and once were a staple food for the Mayan, Aztec and Inca civilizations (Sauer, 1950). Amaranth plants grow widely in sub-Saharan Africa and South-East Asia where they have adapted to local conditions, are now considered native to these locations, and consumed mainly as a leafy vegetable. Amaranth is one of the most important informally traded leafy vegetables in Africa (Grubben and Denton, 2004). Traditionally, different amaranth species have been grown for grain or as a leafy vegetable: Amaranthus hypochondriacus, Amaranthus caudatus and Amaranthus cruentus species, with cream coloured seeds, are generally used for grain. These three species have been independently semidomesticated from the wild black-seeded Amaranthus hybridus (Brenner et al., 2000;Stetter and Schmid, 2017;Stetter et al., 2020). Three other black-seeded amaranth species, Amaranthus blitum, Amaranthus dubius and Amaranthus tricolor are cultivated for their leaves, with the latter being the most cultivated in South-East Asia (Andini et al., 2013). However, the distinction between 'grain amaranth' and 'leafy amaranth' species is cultural rather than scientific, with the three grain species mentioned above also being used as leafy vegetables in sub-Saharan Africa and Asia (Hauptli and Jain, 1984). Moreover, recently there have been efforts to develop A. cruentus as a dual-use (grain-leaf) crop (Hoidal et al., 2019).
Both grain and leaf amaranth harbour high nutritional qualities (reviewed by Coelho et al., 2018). The fibre and mineral content (such as iron, magnesium and calcium) of amaranth seed is typically higher than in grain of most cereals (Alvarez-Jubete et al., 2009;Pedersen et al., 1987), and amaranth seed contains elevated amounts of lysine, which is often a limiting amino acid in cereals. In addition, amaranth seed has a low gluten content that makes it suitable for consumption by people with coeliac disease and gluten sensitivity (Ballabio et al., 2011), and a wide array of compounds with antioxidant properties (Coelho et al., 2018). Amaranth leaf is a rich source of protein, vitamins (including A, C, B, E and K) and minerals at similar or higher levels than spinach and chard leaves (Venskutonis and Kraujalis, 2013). One cooked portion of A. cruentus leaves was estimated to contribute 89% of the daily vitamin A needs and 34% of the daily iron needs of a child (4-8 years) (van Jaarsveld et al., 2014), a particularly striking example of the nutrition gap that traditional crops can help meet. Amaranth leaves also contain anti-nutrients such as oxalates, nitrates and phytates, which can sequester minerals and prevent uptake. However, analysis of Indian germplasm found no correlation between the accumulation of nutrients and anti-nutrients suggesting that new varieties can be found or developed with improved nutritional status (Prakash and Pal, 1991). As species with C 4 photosynthesis (Kadereit et al., 2003), amaranth plants have increased water use efficiency and, hence, resilience to drought. In addition, amaranth is highly tolerant to other abiotic stresses such as salinity, heat and ultraviolet irradiance (Jamalluddin et al., 2019;Omami and Hammes, 2006). All of these nutritional and physiological characteristics contribute to making amaranth a 'superfood' crop that can be cultivated on poor quality land in low-input agricultural systems (Joshi et al., 2018).
There have been several efforts to develop genomic resources for amaranths to inform breeding programs producing improved varieties with enhanced agronomic and/or nutritional traits. The most detailed of these efforts describes the evolution at chromosome level of A. hypochondriacus compared with quinoa (Chenopodium quinoa) and sugar beet (Beta vulgaris), other members of the Amaranthaceae family . Here we present a high-quality genome assembly for A. cruentus, confirming the genome is n = 17 in agreement with cytogenetic studies (Yssel et al., 2019). We demonstrate that the whole genome duplication (WGD) seen in A. hypochondriacus occurred before divergence of this species and A. cruentus, and highlight intraand inter-chromosomal rearrangements compared with A. hypochondriacus. We identify gene families that contracted and expanded in the A. cruentus genome, as well as potential biosynthetic gene clusters that may play a role in the high nutritional content of this underutilized crop.

Amaranthus cruentus genome consists of 17 chromosomes
Genomic DNA (gDNA) was extracted from a black-seeded A. cruentus line 'Arusha' (Gerrano et al., 2017) ( Figure S1) and sequenced using Illumina paired-end (PE) reads, 109 PE reads (109; Genomics, Pleasanton, CA, USA), Oxford Nanopore Technology (Oxford, UK) and Hi-C sequencing (Phase Genomics, Seattle, WA, USA) (Table S1). For the initial assembly, 24.5 Gb of Oxford Nanopore Technology sequencing data were assembled using WTDBG2 (Ruan and Li, 2020), with polishing using approximately 110 Gb of Illumina short-read data and Purge_dups (Guan et al., 2020) to remove haplotigs and overlaps. This assembly was combined with approximately 120 Gb of Hi-C reads using three-dimensional (3D)-DNA (Dudchenko et al., 2017) to produce an initial chromosome-level assembly. Manual inspection of the Hi-C heat map and the placement of telomere signature sequences was used to adjust (and confirm) final chromosomal scaffolds. The Illumina and 109 Illumina reads were used with PILON (Walker et al., 2014) for polishing to produce the final assembly.
K-mer analysis gave a genome size estimate of 398.6 Mb ( Figure S2), which is similar to the reported 376.4-403.9 Mb genome assembly size of A. hypochondriacus . Our the A. cruentus genome assembly was 370.9 Mb in length, and consisted of 625 scaffolds with a scaffold N50 of 21.7 Mb (Table 1); 98.5% of the genome assembly was anchored and oriented into 17 pseudochromosomes (Table S2, Figure S3), six of which had telomeric repeats at one end. The 17 chromosome-length scaffolds have been named corresponding to the A. hypochondriacus assembly , based on the degree of synteny between the two genomes. A heat map of the genome assembly ( Figure S4) demonstrates the alignment of scaffolds into pseudochromosomes and the overall quality of the assembly. K-mer analysis indicated that the 'Arusha' A. cruentus line has very low levels of heterozygosity (0.07%, Figure S2), supported by the fact that the process to remove haplotigs and contig overlaps to produce a haploid genome sequence only removed a small fraction (5.9 Mb) from the initial assembly.

Genome annotation reveals 25 477 protein-coding genes
Repetitive sequences account for approximately 58% of the 370.9 Mb A. cruentus genome, with 35% due to transposable elements (TEs) ( Table 2). The proportion of TEs in A. cruentus (35%) is higher than that in its closest sequenced relative A. hypochondriacus (24%) and similar to the TE content in the related quinoa A-and B-genome diploid ancestors (Chenopodium pallidicaule, 38%; Chenopodium suecicum, 34%) and sugar beet (B. vulgaris, 31%) genomes (Table S3). Quinoa (C. quinoa) has a much higher content at 78%. The key difference in TE content between A. hypochondriacus and A. cruentus/diploid quinoa/beet relatives is a much lower content of long terminal repeats (LTRs) retrotransposons in A. hypochondriacus (12% versus 22-27%), in particular the content of Gypsy-like LTRs. Gypsy-like LTRs comprise <5% of the A. hypochondriacus genome versus 11% in A. cruentus, 10% in sugar beet, 23%  and 19% in the quinoa ancestors and 33% in quinoa. Content of other classes of TE is similar across these related species. The distribution of TEs across the genome shows an inverse correlation to gene density, as expected from other plant genome assemblies ( Figure 1). Based on a combination of ab initio prediction, protein homology and RNA-sequencing (RNA-seq) evidence, we predicted 25 477 protein-coding genes in the A. cruentus genome. This is similar to the gene number in the A. hypochondriacus and B. vulgaris genomes and, as expected, considerably lower than the allotetraploid C. quinoa (Table 3). We generated full-length transcripts from 10 different A. cruentus tissues and/or conditions using Pacbio Isoseq technology. These full-length transcripts supported 51% of the A. cruentus genes with 73% supported by homology to other species. On average, protein-coding genes in A. cruentus are 4337 bp long and contain 4.86 exons, with these values similar to those of other sequenced Amaranthaceae species (Table 3); 94.3% of these A. cruentus genes could be assigned functions using InterProScan (Quevillon et al., 2005), as well as proteinlevel homology analysis. BUSCO (v4.0.4) analysis of these predicted protein sequences showed that 90.5% of the BUSCO gene set are found as complete genes (1461 of 1614), with another 40 genes present as incomplete sequences, bringing coverage to 93%. Of the complete BUSCO genes, 87.9% are single copy and 2.6% are found in duplicate. Seven per cent of the BUSCO gene set (113 genes) is missing.
As an additional check of the quality of the structural annotation, we compared the lengths of the coding sequences (CDS) of annotated genes in A. cruentus with their reciprocal best-hit orthologue in A. hypochondriacus ( Figure S5). The vast majority of predicted protein-CDS were the same length in both species; however, a greater proportion had a longer sequence in the A. cruentus genome (compared with A. hypochondriacus) than vice versa, suggesting an improved annotation of the A. cruentus genome compared with A. hypochondriacus (i.e. with more full-length CDS). Furthermore, the distributions of transcript, CDS, exon and intron length are all very comparable with three closely related species ( Figure S6). The density of protein-coding genes varies along each chromosome (Figure 1), as expected from other plant genomes, presumably reflecting reduced gene density around centromeric regions. In addition to protein-coding genes and repetitive sequences, we also annotated 131 microRNAs (miRNAs) from 33 families with high confidence and predicted putative target genes for 32 of these families (Table S4a,b). As expected, the predicted target genes represent a broad set of functions [61% could be associated with a gene ontology (GO) term through homology) with enrichment for genes involved in metabolism, regulatory processes, response to hormones and transport associated processes and functions (Table S4c). Nine hundred and twenty-six genes encoding transfer RNAs and 325 genes encoding ribosomal RNAs were predicted in the genome assembly. The A. cruentus genome and annotation information are publicly available within the African Orphan Crop Consortium portal of the ORCAE database (https://www.bioinf ormatics.psb.ugent.be/gdb/amaranthus/) (Sterck et al., 2012;Yssel et al., 2019).

Comparative genomics highlights a shared WGD in the Amaranthus lineage
Using the WGD package (Zwaenepoel and Van de Peer, 2019), we inferred the distribution of the rate of synonymous substitutions per synonymous site (Ks) for the whole paranome and anchor pairs (homeologues) in the A. cruentus and A. hypochondriacus genomes. Using the anchor pairs, a very clear signature of WGD was present at Ks = 0.54 in both A. cruentus and A. hypochondriacus (Figure S7), consistent with the published results for A. hypochondriacus , and indicating that these two species share the same WGD event that occurred between 18 and 34 million years ago (MYA). This is confirmed by the cross-species (A. cruentus-A. hypochondriacus) Ks comparison indicating divergence between these two genomes is much more recent (Figure S7). Within the Amaranthaceae, previous studies indicated a single relatively recent WGD event in C. quinoa that occurred between 3.3 and 6.3 MYA  and no recent WGD event in B. vulgaris. Hence, the new A. cruentus genome confirms the independent WGD events in the amaranth and quinoa lineages and places speciation of A. cruentus and A. hypochondriacus significantly after the amaranth WGD event ( Figure S7).
Using 106 single-copy gene orthologues across 14 angiosperm species, including A. cruentus and A. hypochondriacus, we built a phylogenetic tree. As expected, this shows that A. cruentus is most closely related to A. hypochondriacus and places the divergence of the two species approximately 1.45 (0.61-2.41) MYA ( Figure 2). In this analysis, C. quinoa appeared more closely related to spinach (Spinacia oleracea) than B. vulgaris [as expected from previous molecular analyses (Kadereit et al., 2003) with an estimated divergence time of approximately 30 (16.5-47.4) MYA] ( Figure 2).

Chromosome evolution differs between Amaranthus cruentus and Amaranthus hypochondriacus
As indicated above, the amaranth and quinoa lineages have undergone independent WGD events, presumably from an ancestor with a haploid chromosome number of 9. Chenopodium quinoa has retained its haploid chromosome number of n = 18 during the 3.3-6.3 million years since genome duplication, but the haploid chromosome numbers of A. cruentus and A. hypochondriacus have been reduced to n = 17 and n = 16 respectively, over the 18-34 million years since their shared WGD event. Syntenic analysis in A. hypochondriacus  indicated that this reduction was likely due to the loss of one homoeologue of chromosome 5 and the fusion of the two homoeologues of chromosome 1 to produce n = 16. A similar pattern is evident in A. cruentus ( Figure 1). We examined the homologous relationships among the 17 A. cruentus pseudochromosomes using MCScanx (Wang et al., 2012) and found 4561 collinear genes. Chromosome 1 has 496 collinear genes that showed collinearity to the other half of chromosome 1 and very little collinearity to any other pseudochromosome (Figure 1), indicating the likely fusion of the original subgenome homeologues. Chromosome 1 is also the longest of the A. cruentus chromosomes (Table S2). Chromosome 5 has minimal similarity to the other pseudochromosomes (just one collinear block with chr02B) suggesting that (as in A. hypochondriacus) its homoeologous copy was lost during evolution of modern amaranths, and that the loss of chromosome 5 occurred before speciation of A. cruentus and A. hypochondriacus. Ten of the other 15 pseudochromosomes (chr02B, chr03, chr04, chr06, chr08, chr09, chr10, chr14, chr15 and chr16) have clearly identifiable one-to-one homoeologous relationships (chr02B-chr03, chr04-chr06, chr08-chr15, chr09-chr14, chr10-chr16), while additional rearrangements have given rise to five pseudochromosomes (chr02A, chr07, chr11, chr12 and chr13) showing substantial homology to two pseudochromosomes (chr02A-chr11 and chr12; chr07-chr12 and chr13; and chr13-chr07 and chr04) ( Figure 1). Comparative syntenic analysis of the A. cruentus genome (n = 17) and A. hypochondriacus genome (n = 16) showed a high level of collinearity between the two genomes with MCScanx identifying 35 242 collinear genes representing 74% and 71% of the A. hypochondriacus and A. cruentus gene complements, respectively. Most collinear blocks on A. hypochondriacus have one or two homoeologous copies in A. cruentus (and vice versa) indicating a predominant 1:1 relationship with translocations of a number of collinear blocks ( Figure 3). However, the entire blocks on chromosome 02A and 02B in A. cruentus share collinearity with chromosome 2 in A. hypochondriacus ( Figure 3) indicating a fission of chromosome 2 into 02A and 02B in A. cruentus (subsequent to the chromosome 5 copy loss and chromosome 1 fusion shared with A. hypochondriacus) to produce n = 17. Consistent with this, chr02A is the shortest of the 17 A. cruentus chromosomes (Table S2).
There was extensive synteny between the genomes of A. cruentus and B. vulgaris (n = 9). MCScanx confirmed 13 963 collinear genes and most collinear blocks showed a 2:1 orthologous relationship (A. cruentus/B. vulgaris), resulting from the WGD event in the A. cruentus lineage that is not shared with B. vulgaris. For example, Bvchr6-chr02B and chr03; Bvchr7-chr08 and chr15; and Bvchr8-chr09 and chr14 ( Figure S8). Several more complex orthologous relationships are evident, for example, large collinear blocks on Bvchr5 are orthologous to regions on chr02A, chr07, chr11 and chr12 in A. cruentus, presumably due to multiple chromosomal rearrangements. However, we find only a single A. cruentus chromosome (chr1) with large-scale orthology to Bvchr9, supporting the previous conclusion that A. cruentus experienced a chromosome fusion of chromosome 1 subsequent to the last WGD event.

Amaranthus cruentus gene family evolution
We compared gene families across the 14 genomes presented in the phylogenetic tree ( Figure 2; Table S5). Overall, 78.8% of the 454 388 genes from these genomes could be assigned to orthogroups (21 039 orthogroups) (Emms and Kelly, 2019). One-third of these orthogroups are found across all 14 species, and <4% (773 orthogroups) are species-specific. Twenty orthogroups are specific to A. cruentus, representing 88 genes, of which only 19 had Pfam annotations. Notably, two of these A. cruentus-specific genes had inositol 1,3,4-trisphosphate 5/6-kinase, ATPgrasp domain Pfam annotations. This enzyme plays a key role in the production of inositol hexaphosphate (InsP6), also known as phytic acid, an anti-nutrient that chelates minerals in the gastrointestinal tract inhibiting their uptake. Amaranthus cruentus is known to contain phytate, with some studies finding higher levels in Amaranths compared with cereals such as rice (Lorenz and Wright, 1984). Phytic acid also plays a role in regulating intracellular calcium ion signalling, a key component of plant stress responses, and acts as a co-factor to the jasmonic acid and auxin receptors in Arabidopsis (Hou et al., 2016). Other A. cruentus-specific orthogroups of note with respect to plant stress responses include genes encoding a protein with an auxin response factor domain and a protein containing the NB-ARC domain, a key domain in plant disease resistance proteins. We used the CAFE software (Han et al., 2013) to identify gene families expanded or contracted across the same 14 genomes (Figure 4). Five hundred and three gene families were expanded in A. cruentus compared with the other 13 genomes, and 1099 families contracted. The expanded gene families were enriched for multiple GO terms relating to ion transport and for UDP-glycosyl transferase (UGT) activity ( Figure S9). Several studies have found higher levels of minerals in Amaranths (and other traditional leafy vegetables) compared with conventional species such as spinach and kale (Odhav et al., 2007), and it is possible that expanded cation transporter families contribute to this. UGTs are a family of proteins that transfer UDP-activated sugar moieties on to a variety of substrates. In plants, UGTs are involved in the production of defence compounds, the activity of phytohormones, the activity and stability of secondary metabolites, as well as driving xenobiotic detoxification (Louveau and Osbourn, 2019). With the exception of betalains (Brockington et al., 2011), the scientific community is just starting to explore the range of specialized metabolites produced in amaranths (e.g. Sarker and Oba, 2020) and the relevance of an increased number of UGTS may become clearer.
A larger number of gene families are contracted in A. cruentus compared with the other 13 genomes, and markedly, the majority of the GO terms enriched in these families are associated with fundamental gene expression processes, e.g. messenger RNA splicing, ribosomal RNA processing and ribosome biogenesis. Why these families have been contracted in A. cruentus is not clear; the genome has a similar number of genes to related plant species (Table 3), and long-read RNA-seq analysis of 10 different A. cruentus tissues detected 51% of these. Interestingly, gene families annotated with GO terms associated with messenger RNA splicing are also contracted when comparing the two amaranth species (A. hypochondriacus and A. cruentus) with the other plant genomes in this analysis ( Figure S10). The expansion of gene families annotated with GO terms relating to ion transport are also enriched in both A. hypochondriacus and A. cruentus compared with the other plant genomes in this analysis ( Figure S10).

Secondary metabolite biosynthetic gene clusters
Amaranthus, Beta, Spinacia and Chenopodium are four genera within the Amaranthaceae family that produce tyrosine-derived betalain pigments rather than anthocyanins (Brockington et al., 2011;Clement and Mabry, 1996). Three key genes in betalain biosynthesis are: (i) a cytochrome P450 with tyrosine hydroxylase activity (Polturak et al., 2016); (ii) L-DOPA 4,5-dioxygenase (DODA) (Christinet et al., 2004) with all enzymes shown to have DODA activity to date being DODAa orthologues (Sheehan et al., 2020); and (iii) L-DOPA oxidase (catalysed by the cytochrome P450, CYP76AD1, with again the a clade seeming specific to betalain synthesis) (Brockington et al., 2015;Hatlestad et al., 2012). In B. vulgaris the betalain-specific isoforms of DODAa1 and CYP76AD1 are co-located on chromosome 2, with a single gene between them (Sheehan et al., 2020). Orthologues of these two genes are also co-located in A. hypochondriacus (chromosome 16) and in C. quinoa, with the grouping present on two chromosomes in this species consistent with the recent WGD (Sheehan   et al., 2020). Analysis of our A. cruentus genome indicated that DODAa1 and CYP76AD1 orthologues are also colocated in A. cruentus, on chromosome 16, with this biosynthetic co-localization also conserved in the S. oleracea genome ( Figure 5). Within spinach and quinoa, tandem gene duplications (in addition to the recent WGD event in the quinoa lineage) appear to have given rise to additional copies of these two enzymes, with a tandem pair of the two genes in spinach, and a tandem duplication and triplication of the two biosynthetic genes in quinoa ( Figure S11).
Given the importance of secondary metabolites to the nutritional content of amaranth, particularly as a leafy vegetable, we used PLANTISMASH (Kautsar et al., 2017) to identify additional potential secondary metabolite biosynthetic gene clusters in the A. cruentus and A. hypochondriacus genomes. There is increasing evidence that, as in bacteria and fungi, genes encoding secondary metabolite biosynthetic enzymes involved in the same pathway are found in clusters in plant genomes (N€ utzmann et al., 2016). With constraints that each cluster must include at least three biosynthetic genes of two different types (and closely related duplicate genes are counted only once), PLANTISMASH identified 22 clusters in A. cruentus and 23 clusters in A. hypochondriacus (Table S6a,b). Ten of these clusters were clearly orthologous in A. cruentus and A. hypochondriacus (containing the same core domains, located on homologous chromosomes and with similar overall cluster sizes) ( Table S6c), strengthening the evidence that they are genuine biosynthetic gene clusters. These shared clusters include those likely to synthesize lignans, terpenes, alkaloids and polyketides, as well as clusters containing UGT genes for glycosylation (a gene family expanded in these two species; Figure S9) and likely synthesizing saccharide compounds. Within these clusters the presence of nonbiosynthetic genes varies between the two amaranth species, and biosynthetic gene inversions and variation in the exact biosynthetic gene number is seen in several orthologous clusters demonstrating genome rearrangements that are specific to each Amaranthus species.
Six of these 10 biosynthetic gene clusters show conservation across other genomes in the Amaranthaceae. Three  In two clusters, three different types of biosynthetic enzyme are conserved across A. cruentus and A. hypochondriacus but only two of these enzyme types are conserved across all five genomes ( Figure S12d,e). However, one cluster is particularly striking with genes encoding three different types of biosynthetic enzymes conserved across all five Amaranthaceae genomes analysed (A. cruentus, A. hypochondriacus, B vulgaris, S. oleracea and C. quinoa) (Figure 6). This cluster contains genes encoding the core domains of terpene synthase (type II squalene-hopene cyclase), epimerase and cytochrome P450 enzymes.

DISCUSSION
Here we present a high-quality chromosome-level genome assembly of the traditional crop A. cruentus. This genome is the second Amaranthus chromosome-level genome to be published (after A. hypochondriacus; Lightfoot et al., 2017) and will drive molecular plant breeding initiatives for these neglected crops, which have much to offer in terms of climate-resilience, nutrition, food security and smallholder farmer livelihoods, as well as gluten-free cereal alternatives. Enhanced agrobiodiversity through new and diversified crops will be vital for the environmental, economic and social sustainability of future agricultural systems. The World Vegetable Center (WVC) has developed a number of improved lines for East Africa, which have been partially adopted, but further improvement is still much needed (Ochieng et al., 2019). Amaranthus genetic resources at the WVC include a MAGIC population from eight parental lines: four A. cruentus and four A. hypochondriacus (R. Schafleitner, personal communications). The high-quality genome sequences of the two parental species will therefore be extremely valuable in designing markers for quantitative genetic analysis of this  (Stetter et al., 2020) hence, effective breeding and crop development, for both grain and leafy vegetable traits, will help realize the environmental and socioeconomic potential of this species. Our A. cruentus genome assembly and annotation are freely available within the AOCC portal of the ORCAE database (Sterck et al., 2012;Yssel et al., 2019) both to enable easy reuse of the information, as well as facilitate communitybased manual annotation of the genome.
Amaranthus cruentus is monoecious and can selfpollinate. Amaranths are also known to be able to hybridize (Sauer, 1950), and interspecific hybridization is thought to underlie the success of A. tuberculatus as a weed (Trucco et al., 2009). However, the line we sequenced (Arusha; Gerrano et al., 2017) is a semi-domesticated accession, and as expected had a low level of heterozygosity (0.07%). Combining long-read, short-read and phased sequencing technologies, we generated a 371 Mb genome assembly, with 365 Mb assembled into 17 pseudochromosomes. This is slightly shorter than the latest A. hypochondriacus genome assembly of 404 Mb ; however, K-mer analysis indicated a total A. cruentus genome length of approximately 399 Mb.
Annotation of the A. cruentus assembly indicated a gene number similar to A. hypochondriacus and closely related B. vulgaris, with very similar distributions of transcript, CDs, exon and intron lengths between these species (as well as C. quinoa). The overall proportion of TEs in the two Amaranthus genomes varied (35% versus 24%) with A. cruentus containing a higher proportion of LTRs. However, this may reflect different repeat analysis methods across the two studies. Other types of element were found at similar proportions, apart from SINE elements, which were responsible for five times as much sequence in A. cruentus compared with A. hypochondriacus. The proportion of TEs in plant genomes varies widely, for example, 14% in Arabidopsis thaliana, 63% in tomato and 84% in maize (Ragupathy et al., 2013). The proportion of TEs in a genome is related to genome size (increasing as genomes increase in size) (Tenaillon et al., 2010) with 35% in A. cruentus in line with what is seen in other similarly sized genomes.
Interestingly, our A. cruentus genome contained a larger proportion of longer protein-CDS than the published A. hypochondriacus genome . This is likely due to the A. cruentus Isoseq transcript long-read data we used to complement homology and ab initio gene prediction approaches. We included different tissues and/ or conditions to capture a wide range of transcripts, bearing in mind many transcripts will show fairly restricted expression patterns. Fifty-one per cent of predicted genes in A. cruentus were present as in our long-read transcriptome data, a fairly low gene coverage but with the advantage of full-length transcripts for these genes. Generation of long-read transcriptome data from additional cell/tissue types and conditions/treatments will be needed to further improve the genome annotation and identify full-length transcripts (and importantly, transcript isoforms) for the remaining A. cruentus genes. Such a transcript dataset is critical for accurate RNA-seq expression analysis and investigation of differential splicing (Brown et al., 2017).
Comparison of the new A. cruentus genome with 13 other plant genomes placed the known WGD event in A. The five Amaranthaceae species were: Ac, Amaranthus cruentus; Ah, Amaranthus hypochondriacus; Bv, Beta vulgaris; Cq, Chenopodium quinoa; So, Spinacia oleracea. This cluster includes three different enzyme domains: epimerase, terpene synthase and cytochrome P450. Boxes indicate predicted gene models with different colours indicating different enzyme types (yellow, terpene synthase; blue, cytochrome P450; purple, epimerase) and orthologous genes linked by grey lines. Pink boxes indicate a non-biosynthetic enzyme encoding gene that is also co-located across all five genomes. hypochondriacus  at approximately 30 MY before speciation of A. cruentus and A. hypochondriacus, approximately 1.45 MYA (Figure 2). It also highlighted substantial chromosomal rearrangements following the amaranth lineage WGD event, both pre-and post-speciation. The loss of one copy of chromosome 5 and the fusion of both copies of chromosome 1 were common to both A. cruentus and A. hypochondriacus, whereas the fission of one copy of chromosome 2 into two is specific to A. cruentus. The latter event giving A. cruentus a haploid chromosome number of 17, and A. hypochondriacus, 16. Draft genomes have been published for the agricultural weed species, A. tuberculatus, A. palmerii and A. hybridus (Montgomery et al., 2020), but the lack of annotation restricts comparative genome rearrangement analysis. However, chromosome loss, fusion and fission after the WGD appears to be prevalent in the Amaranthus genus. Among 30 amaranth accessions analysed, four species have a haploid chromosome number n = 16 and 26 have n = 17 (Grant, 1959). As more Amaranthus genomes are assembled, it remains to be seen to what extent chromosome loss, fusion and fission events occurred independently in each species. Assembled Amaranthus genomes will also assist in understanding of hybridization between species with different chromosomal numbers, and its impact on improvement strategies for this neglected crop.
Analysis of gene families in A. cruentus and related genomes, highlighted two A. cruentus-specific genes encoding inositol 1,3,4-trisphosphate 5/6-kinase ATP-grasp domains, which may play a role in phytic acid synthesis. Phytic acid has both anti-nutrient properties, inhibiting the uptake of minerals, as well as beneficial anti-cancer and antioxidant activity. It is likely that with a well-balanced diet the beneficial properties outweigh the negative effects, although in populations with high levels of micronutrient deficiencies and/or malnourishment, the impact of phytates on mineral uptake may be significant (Schlemmer et al., 2009). However, a recent study across 20 cereal and pseudocereal flakes demonstrated that amaranth and teff contained the best ratios of minerals versus dietary fibre, phytates and tannins (Kiewlicz and Rybicka, 2020). Hence consumption of these would help deliver gains in mineral nutrition (particularly for magnesium and iron). The expansion of ion transporter gene families in A. cruentus (and A. hypochondriacus) compared with other plant genomes ( Figure S10) could play a role in the increased accumulation of minerals in this genus compared with cereals and pseudocereals.
Although the Arusha A. cruentus line sequenced here has not been seen to accumulate visible red pigments in either the inflorescence or leaves, it is one of the Amaranthus species producing betalain pigments rather than anthocyanins (Brockington et al., 2011). Consistent with this, we detected conserved co-location of the genes encoding the key enzymes of betalain synthesis, which was previously identified in A. hypochondriacus, B. vulgaris and C. quinoa. We also highlight the same conservation of co-location in the S. oleracea genome (Dohm et al., 2014). In addition to this betalain enzyme grouping, we identified 22 potential biosynthetic gene clusters for plant secondary metabolites in the A. cruentus genome, with 10 of these conserved in A. hypochondriacus. Six of these showed conserved colocation of biosynthetic enzymes across all five Amaranthaceae genomes outlined above, with one cluster exhibiting co-location of genes encoding two cytochrome P450 enzymes, an epimerase, and multiple terpene synthases. We hypothesize this gene cluster plays a role in a terpene synthesis pathway leading to compounds that may be specific to this lineage. Co-location of biosynthetic enzymes can dramatically increase the ease of identifying biosynthetic pathway genes, particularly when combined with transcriptome analysis correlating gene expression with the presence of specific metabolites in specific plant tissues. Complementation of this genome assembly with transcriptome and metabolome analyses will further aid in the discovery and exploitation of A. cruentus, and drive realization of its potential in supporting a healthy and sustainable food system. Enhanced agrobiodiversity through new and diversified crops will be vital for the environmental, economic and social sustainability of future agricultural systems.

Plant growth, sampling and DNA extraction
Plants (A. cruentus cv. 'Arusha'; Gerrano et al., 2017) were grown in the glasshouse under 12 h light (28°C)/12 h dark (24°C) in 10 cm 9 10 cm 9 10 cm pots with fine peat-based compost and added sand. In this sequencing project different pools of individual plants were used for gDNA extraction. For short-read Illumina a total amount of 1.0 lg DNA per sample was used as input material for a sequencing library generated using NEBNextâ DNA Library Prep Kit according to the manufacturer's protocol. For long-read Nanopore sequencing, gDNA was extracted from leaf nuclei (isolated according to Workman et al., 2018) using the Nanobind Plant Nuclei Big DNA kit (Circulomics, Inc., Baltimore, MD, USA). Long fragment DNA sequencing libraries were prepared using the Oxford Nanopore Technologies (Oxford, UK) ligation sequencing kit LSK-SQK109. For chromatin conformation capture (Hi-C), 4-6 week-old leaf material was cross-linked by incubation at room temperature for 1 h in 1% formaldehyde. Glycine (125 mM final concentration) was added with incubation for a further 15 min before rinsing with water, and flash-freezing in liquid nitrogen. The cross-linked material was shipped to Phase Genomics who performed subsequent gDNA extraction, chromatin isolation, library preparation and sequencing. For 109 Chromium Genome sequencing, 4-6-week-old leaf material was sent to the UC Davis Genome Center for subsequent gDNA extraction, library preparation (according to 109 Genomics) protocols and sequencing.

Genome sequencing datasets
Long-read sequencing was carried out by the Technology Facility (University of York) using Oxford Nanopore Technologies MinION sequencers running R9.4.1 flow cells. Data were basecalled using GUPPY v1.8.3. We recovered 24.5 Gb of long-read data comprising 7 728 433 reads with a N50 read length of 6.43 kb and with a maximum length of 356 126 bp (Table S1). Given an estimated genome size of 400 Mb (see below), this represents a coverage of approximately 609. We used these long reads to construct the initial contig-level assembly.
We generated three different types of PE short-read sequences using the Illumina sequencing platform, including Hi-C, 109 Genomics and regular short inserts (Table S1). For Hi-C sequencing, we obtained 120 Gb (approximately 3009 coverage) of in vivogenerated Hi-C data from Phase Genomics. The Hi-C data were used for the construction of a chromosome-scale assembly from the contig-level assembly. One hundred and twenty Gb of 109 Genomics reads and 110 Gb from Illumina PE read libraries with short inserts were used for the estimation of genome size and heterozygosity level, as well as base polishing of the genome assembly at the final stages to improve its accuracy. We used GEN-OMESCOPE (Vurture et al., 2017) to estimate the genome size of A. cruentus. First, a 32-mer distribution was generated with JELLYFISH (Marc ßais and Kingsford, 2011) from the 120 Gb 109 Genomics short reads and used as input in the subsequent GENOMESCOPE analysis.

Genome assembly with chromosome-length scaffolds and polishing
For the initial contig-level assembly, we ran the WTDBG2 (Ruan and Li, 2020) assembly pipelines with the Oxford Nanopore dataset producing an initial assembly of 372.9 Mb with 1948 contigs and a contig N50 of 0.98 Mb. Purge_dups (Guan et al., 2020) was used to remove haplotigs and contig overlaps to produce a haploid representation of the genome resulting in a 367.0 Mb haploid assembly with 1438 contigs and a contig N50 of 1.02 Mb.
To construct highly contiguous scaffolds from the above contiglevel haploid genome, we used the 3D-DNA pipeline (Dudchenko et al., 2017) with the Hi-C data first to correct misjoins in the contigs, then to scaffold and merge overlaps. The resulting assembly was adjusted and curated using (i) guidance from the chromogram of the Hi-C reads, where pixel intensity in the contact matrix indicates how often a pair of loci co-locate in the nucleus, and (ii) placement of telomere signature sequences. The final A. cruentus genome assembly is detailed in Table 1, and includes 17 chromosome-length scaffolds. The genome assembly metrics at different stages of the process outlined above are shown in Table S7.
To improve the base accuracy of the genome assembly further, using the PILON software tool, we polished (Walker et al., 2014): two rounds after mapping the Illumina PE short reads to the working assembly with BWA (Li and Durbin, 2009), and another two rounds after mapping the reads from 109 Chromium Genome sequencing with the LONGRANGER mapping tool (109 Genomics), resulting in a polished final haploid assembly of 370.9 Mb. The quality of the genome assembly was assessed (via cumulative length and genome assembly statistics) using QUAST (Gurevich et al., 2013); 96.8% of the 368 million 150-bp sequencing reads we generated aligned to the final genome assembly with 92.5% of the reads mapped as accurate pairs.

Full-length transcript isoforms of Amaranthus cruentus
To assist genome annotation for the identification and validation of gene structure and coding regions, we used the PACBIO SINGLE MOLECULE, Real-Time sequencing technology and Iso-Seq analysis pipeline to generate full-length cDNA sequences. RNA was extracted from 10 different tissues (2-week-old seedlings; 6-weekold plant root, stem and leaf; flowers; seed; 10-week-old senescent leaf; leaves from 4 week-old plants 1 h after wounding; and leaves of 4-week-old plants 24-and 48-h after Botrytis cinerea infection) according to standard cetyltrimethylammonium bromide protocols followed by lithium chloride precipitation. Samples were pooled and first-strand cDNA synthesized according to Pacific Biosciences (Menlo Park, CA, USA) Isoseq recommended protocol. Library preparation and sequencing were performed according to Pacific Biosciences protocols by the UC Davis Genome Center.
We analysed the highly accurate long reads (HiFi reads) generated with the Circular Consensus Sequence algorithm using the Iso-Seq bioinformatics pipeline: identifying and clustering a fulllength transcript dataset at the isoform level that spans entire transcript isoforms, and polishing to create the A. cruentus highquality consensus dataset containing 51 928 sequences. Cupcake, a set of supporting scripts for processing Iso-Seq data (Gordon et al., 2015), was used to remove further redundancies with the parameters (MIN_ALN_COVERAGE: 0.95; MIN_ALN_IDENTITY: 0.90) producing a final dataset of unique full-length transcript isoforms containing 21 732 sequences in total.
Genome annotation REPEATMODELER v2.0.1 (Flynn et al., 2020) was used to identify the repeat families in the genome assembly of A. cruentus based on the default TE Dfam database and Repbase database with the support of LTR_Struc (McCarthy and McDonald, 2003). Furthermore, LTR_Finder (v1.0.7), LTR_harvest from GENOMETOOLS (v1.5.9) and LTR_retriever (v2.9.0) were used to identify and trace the LTR elements in the A. cruentus genome (Ellinghaus et al., 2008;Ou and Jiang, 2018;Xu and Wang, 2007). We merged the libraries from RE-PEATMODELER and LTR_retriever by USEARCH (Edgar, 2010) with 80% identity as the minimum threshold for combining similar sequences in the de novo libraries to get the non-redundant de novo repeat library. Finally, we used REPEATMASKER v4.1.1 (https://sc icrunch.org/resolver/RRID:SCR_012954, Chen, 2004) with parameter: -e rmblast -a -s -norna -xsmall -gff -lib to identify and classify repeats in the A. cruentus genome assembly. Gene models were predicted with a combination of ab initio prediction, homology search and RNA-aided annotation. BRAKER2 (Br una et al., 2021) was used for ab initio gene prediction using model training based on Iso-seq data from A. cruentus and proteins of very close homology from A. hypochondriacus after the annotated repeats were soft masked in the assembly. For homology prediction, protein sequences from four closely related species that belong to the same family were used as query sequences to search the reference genome using TBLASTN with different e-values (A. hypochondriacus with the e-value ≤1e À 10, B. vulgaris, C. quinoa and S. oleracea with the e-value ≤1e À 5). Regions mapped by these query sequences were subject to Exonerate (Slater and Birney, 2005) to generate putative transcripts. For RNA-aided annotation, the full-length transcript isoforms generated from the Isoseq data were used as input for the PASA pipeline (Haas et al., 2003) for gene structure predictions, with TRANSDECODER (Haas et al., 2013) also used to predict open reading frames. Finally, EVIDENCEMODELER v1.1.1 (Haas et al., 2008) was used to integrate all of the above evidence and BUSCO v4.0.4 (Benchmarking Universal Single-Copy Orthologs; Seppey et al., 2019) to assess the quality of annotation results.
Putative of these predicted genes was obtained by aligning the protein sequences of these genes against the sequences in public protein databases and the UniProt database using BLASTP (E-value <1 9 10 À5 ).

miRNAs and putative target gene prediction
Known mature miRNAs were obtained from miRbase (release 22.1) and aligned to the soft-masked A. cruentus genome using SeqMap with no mismatches (Jiang and Wong, 2008). We extracted approximately 110 bp upstream and downstream sequence surrounding every aligned locus and discarded miRNA candidates located within protein-CDS or repetitive elements to produce the refined miRNA set. The predicted stem-loop structure and minimum free energy were analysed for each region using the RNAfold program of VIENNARNA v2.1.1 (Lorenz et al., 2011) with default settings. A miRNA target transcript prediction pipeline was developed using VMATCH (Schnable et al., 2009). Mature miRNA sequences in the refined miRNA set were reverse complemented and matched against the protein-CDS of A. cruentus, with the parameters allowing up to three mismatches. Predicted miRNA target genes were tested for GO term enrichment using TBtools (Chen et al., 2020).

Ks distribution analysis
Ks distribution analysis was performed using the wgd package and the paranome (entire collection of duplicated genes) was obtained with 'wgd mcl' using all-against-all BlastP and MCL clustering. Ks distribution of A. cruentus was then constructed using 'wgd ksd' with default settings using MAFFT (Katoh and Standley, 2013) for multiple sequence alignment, codeml for maximum likelihood estimation of pairwise synonymous distances, and Fas-tTree for inferring phylogenetic trees used in the node weighting procedure. Anchors or anchor pairs (paralogous genes lying in collinear or syntenic regions of the genome) were obtained using I-ADHORE, employing the default settings in 'wgd syn'.

Phylogenetic divergence
The amino acid sequences of all proteins from A. cruentus and 13 other angiosperms were downloaded from PLAZA (Van Bel et al., 2018). All-versus-all BLASTP with an E-value cut-off of 1e À 05 was performed and orthologous genes were clustered using ORTHOFINDER v2.1.2 (Emms and Kelly, 2019). Single-copy orthologous genes were extracted from the clustering results. MAFFT (Katoh and Standley, 2013) with default parameters was used to perform multiple alignment of protein sequences for each set of single-copy orthologous genes, and transform the protein sequence alignments into codon alignments. Poorly aligned or divergent regions were removed from the multiple sequence alignment results using TRIMAL. The resulting codon alignments from all single-copy orthologues were then concatenated to one supergene for species phylogenetic analysis. A maximum likelihood phylogenetic tree of single-copy proteins alignments and codon alignments from A. cruentus and 13 other angiosperms was constructed using IQ-TREE with the GTR+G model and 1000 bootstrap replicates. Divergence times between the 13 plant species were estimated using MCMCtree from the PAML package (Yang, 2007)

Genome and gene family evolution
Syntenic analysis of genomes was performed using MCSCANX (Wang et al., 2012) with parameters '-s 10' and the circos figures were drawn using TBtools (Chen et al., 2020). CAF E (v3.1) (Han et al., 2013) was used to identify the expansion and contraction of gene families following divergence predicted by the phylogenetic tree above. Tbtools was also used to determine enrichment of GO terms in expanded and contracted families (Chen et al., 2020).

Analysis of biosynthetic gene clusters
For the analysis of betalain biosynthetic genes, DODA sequences in C. quinoa , S. oleracea (Xu et al., 2017), A. hypochondriacus  and A. cruentus (this study) were identified using BLAST searches with the protein sequence of the previously characterized BvDODAa1 (Hatlestad et al., 2012). CYP76AD1 orthologues were identified via a BLAST search using B. vulgaris subsp. Cicla sequence from the National Center for Biotechnology Information (NCBI) (accession no.: KU644144). All gene sequences were checked manually, aligned using mafft and a maximum likelihood gene tree based on protein sequences constructed using IQ-TREE with GTR+G model and 1000 bootstrap replicates to confirm orthology. Microsynteny of DODAa1 and CYP76AD1 orthologues was analysed using MCSCAN JCVI (Tang et al.et al., 2008) including approximately 20 genes from either side of the betalain enzymes. PLANTISMASH (N€ utzmann et al., 2016) was used to identify potential biosynthetic gene clusters in the newly assembled A. cruentus genome and that of A. hypochondriacus . Based on the gene families from ORTHOFINDER analysis, we identified the orthologous sequences in A. hypochondriacus, B. vulgaris, C. quinoa and S. oleracea for each member of the gene cluster in A. cruentus and used MCSCAN JCVI to visualize the microsynteny (Tang et al., 2008). genome assembly by YL and ZN. Genome annotation and comparative analyses were carried out by XM, with the analysis of biosynthetic gene clusters by KJD and XM. Funding for the research was obtained by KJD, IAG, SLV, MWB, SM, YVdP and AVD. The manuscript was written by XM, FEV and KJD with input from all authors.

CONFLICT OF INTERESTS
The authors declare that they have no competing interests.

DATA AVAILABILITY STATEMENT
All data were available either within the supporting information of this manuscript or associated with the A. cruentus genome in the AOCC portal of ORCAE (https://www. bioinformatics.psb.ugent.be/gdb/amaranthus/). In this portal, Chromosome 2B is named Chromosome 17. Raw data and the genome assembly are available at NCBI under BioProject: PRJNA713964.

SUPPORTING INFORMATION
Additional Supporting Information may be found in the online version of this article. Figure S1. Amaranthus cruentus cv. Arusha. Plants were raised as seedlings in the greenhouse and planted in the field approximately 4 weeks after sowing. (a,b) Show plants approximately 9 weeks after transplanting with the plants approximately 1.2 m tall; (c) shows an inflorescence, approximately 30 cm in length; and (d) black seed with diameter approximately 1 mm. Figure S2. K-mer analysis indicates a genome size of 398.6 Mb and a heterozygosity level of 0.07%. Figure S3. Cumulative length of the Amaranthus cruentus genome assembly by number of contigs and scaffolds. Figure S4. Heat map of the Hi-C paired reads indicating the 17 linkage groups. Heat map colour intensity indicates the frequency of paired read co-location. Figure S5. Amaranthus cruentus predicted protein lengths are similar to those in Amaranthus hypochondriacus, with a greater proportion of longer predicted proteins. Frequency of the length ratio between reciprocal best hit orthologues is shown, with values >1 indicating longer proteins in A. hypochondriacus, and values <1 indicating longer proteins in A. cruentus. Figure S6. Distribution of transcript, protein-coding sequence, exon and intron length are similar across the genomes of Amaranthus cruentus, Amaranthus hypochondriacus, Beta vulgaris and Chenopodium quinoa. Figure S7. Ks distributions for paralogues within the Amaranthus cruentus genome and for orthologues of A. cruentus and related species. Figure S8. Syntenic relationship between the Amaranthus cruentus genome and Beta vulgaris genome. Size of chromosomes is indicated in Mb around the outside with syntenic blocks of genes shown in the same colour. Figure S9. Gene ontology (GO) term enrichment in expanded (a) and contracted (b) gene families in Amaranthus cruentus. Only statistically significant enrichment is shown (adjusted P < 0.05). Figure S10. GO term enrichment in expanded (a) and contracted (b) gene families in the two amaranth species (Amaranthus cruentus and Amaranthus hypochondriacus) compared with other 12 plant genomes analysed. Only statistically significant enrichment is shown (adjusted P < 0.05). Figure S11. (a) Spinacia oleracea and (b) Chenopodium quinoa genomes show evidence for tandem gene duplications of DODAa1 and CYP76AD1. Ac, Amaranthus cruentus; Ah, Amaranthus hypochondriacus; Bv, Beta vulgaris; Cq, C. quinoa; So, S. oleracea. Rectangles indicate gene locations and colour the relative orientation (blue, À strand; green, + strand). Orthologous genes are linked by grey lines, with DODAa1 orthologues linked by red lines, and CYP76AD1 orthologues by blue. Figure S12. Conserved biosynthetic gene clusters in the genomes of five Amaranthaceae species: Ac, Amaranthus cruentus; Ah, Amaranthus hypochondriacus; Bv, B. vulgaris; Cq, Chenopodium quinoa; So, Spinacia oleracea. Each of the panel boxes indicate predicted gene models with different colours indicating different enzyme types and orthologous gene pairs linked by grey lines. Genes indicated with grey boxes are 'other' genes within the cluster or flanking the cluster, which do not have a predicted biosynthetic role. Table S1. Statistics of raw sequence read files for the assembly and polishing of the Amaranthus cruentus genome. Data are the output of seqkit (https://github.com/shenwei356/seqkit) software command (seqkit stats -a). Table S2. Chromosome statistics. Length of each pseudochromosome is shown in bp. Table S3. A comparison of transposable element content of Amaranthus cruentus with related species. Genome information is obtained from this study (A. cruentus), Jarvis et al. (2017) Table S4. (a) miRNA annotation in Amaranthus cruentus genome; (b) predicted miRNA target genes in the A. cruentus genome. First column is the name of the miRNA identified in this study and the second column is the predicted miRNA target gene; (c) miRNA target gene GO term enrichment. Target genes of miRNAs are from Supplementary Table 4b.  Table S5. Orthology across species analysis in Amaranthus cruentus and 13 additional plant genomes. Table S6. (a) PLANTISMASH results for Amaranthus cruentus indicating the cluster number, chromosome the cluster is located on, type of metabolite the cluster core enzyme produces, location of the cluster in the genome, size of the cluster, key biosynthetic protein domains encoded in the cluster, the number of CD-HIT gene groups in the cluster, and the gene IDs present in the cluster (with the core domain genes in bold); (b) PLANTISMASH results for Amaranthus hypochondriacus indicating the cluster number, chromosome the cluster is located on, type of metabolite the cluster core enzyme produces, location of the cluster in the genome, size of the cluster, key biosynthetic protein domains encoded in the cluster, the number of CD-HIT gene groups in the cluster and the gene IDs present in the cluster (with the core domain genes in bold); (c) PLANTISMASH biosynthetic gene clusters conserved across the A. cruentus and A. hypochondriacus genomes. Table S7. Metrics at different stages of the Amaranthus cruentus genome assembly