Draft genomes of two outcrossing wild rice, Oryza rufipogon and O. longistaminata, reveal genomic features associated with mating‐system evolution

Abstract Oryza rufipogon and O. longistaminata are important wild relatives of cultivated rice, harboring a promising source of novel genes for rice breeding programs. Here, we present de novo assembled draft genomes and annotation of O. rufipogon and O. longistaminata. Our analysis reveals a considerable number of lineage‐specific gene families associated with the self‐incompatibility (SI) and formation of reproductive separation. We show how lineage‐specific expansion or contraction of gene families with functional enrichment of the recognition of pollen, thus enlightening their reproductive diversification. We documented a large number of lineage‐specific gene families enriched in salt stress, antifungal response, and disease resistance. Our comparative analysis further shows a genome‐wide expansion of genes encoding NBS‐LRR proteins in these two outcrossing wild species in contrast to six other selfing rice species. Conserved noncoding sequences (CNSs) in the two wild rice genomes rapidly evolve relative to selfing rice species, resulting in the reduction of genomic variation owing to shifts of mating systems. We find that numerous genes related to these rapidly evolving CNSs are enriched in reproductive structure development, flower development, and postembryonic development, which may associate with SI in O. rufipogon and O. longistaminata.


| INTRODUC TI ON
The genus Oryza belongs to the grass family and consists of more than 20 wild species and two cultivated species (Khush, 1997;Vaughan, 1994). These rice species have been assigned ten genome types (AA, BB, CC, EE, FF, GG, BBCC, CCDD, HHJJ, and HHKK) (Aggarwal, Brar, & Khush, 1997;Ge, Sang, Lu, & Hong, 1999;Nayar, 1973), representing an enormous gene pool for genetic improvement of modern rice cultivars. The majority of alien genes involved in rice improvement are derived from wild AA-genome species to broaden gene pool of cultivated rice through introgression lines from the other wild relatives of Oryza. This bunch of AAgenome Oryza species comprises two cultivated rice (O. sativa and O. glaberrima) and six wild relatives (O. barthii, O. longistaminata, O. nivara, O. rufipogon, O. meridionalis, and O. glumaepatula), respectively (Vaughan, Morishima, & Kadowaki, 2003). They span a wide range of global pantropical regions and are disjunctively distributed in Asia, Africa, Australia, and South America (Vaughan, 1994). Phylogenomic studies showed that they have generated extensive AA-genome diversity from a common AA-genome ancestor and diverged within 4.8 million years (Myr) (Gao et al., 2019;Zhang et al., 2014;Zhu et al., 2014). There were frequent switches of mating systems between selfing (O. sativa, O. glaberrima, O. barthii, O. nivara, O. meridionalis and O. glumaepatula) and outcrossing species (O. rufipogon and O. longistaminata). This closely spaced series of recent speciation events has occurred with different life-history and breeding traits (Oka, 1988;Morishima, Sano, & Oka, 1992), providing an unparalleled system for understanding the gene and genome divergence that determines the wealth of phenotypic diversity and adaptive differences among multiple plant lineages.
O. rufipogon Griff., belonging to the genus Oryza, is thought to be the wild progenitor of Asian cultivated rice, O. sativa (Oka, 1988;Cheng et al., 2003;Fuller et al., 2010;Huang, Chen, et al., 2012;Huang, Kurata, et al., 2012;Khush, 1997). This perennial outcrossing plant species widely grows in diverse habitats of tropical and subtropical regions of Asia and Australia (Morishima et al., 1992). As a result of long historical domestication and improvement, O. sativa has experienced a considerable loss of genetic diversity through genome-wide bottlenecks and artificial selection for domestication traits compared to O. rufipogon (Kovach, Sweeney, & McCouch, 2007;Xu et al., 2011). Thus, O. rufipogon definitely harbors abundant source of novel alleles that are critical and necessary for rice breeding programs in the future. A large number of alien genes from O. rufipogon have successfully been introduced into cultivated rice, generating many environmentally resistant, and high-yielding varieties (Brar & Ramos, 2007), for example, the application of the "wild-abortive" allele for the fruitful raising of hybrid rice varieties (Lin & Yuan, 1980).
Recent decades have witnessed the progress in population and conservation genetic studies using multiple molecular markers to obtain novel insights into the extent, distribution, and dynamics of genetic diversity within and among natural populations of Chinese O. rufipogon (Gao, 2004). Despite the release of nuclear genomes of O. rufipogon and the seven other AA-genome Oryza species (Du et al., 2017;Goff et al., 2002;IRGSP, 2005;Stein, Yu, Copetti, Zwickl, & Zhang, 2018) the lack of a typical O. rufipogon genome may seriously impede its evolutionary and functional genomic studies. Indepth knowledge of genomic variation and population structure within the species is urgently needed to provide timely information useful for developing appropriate and efficient conservation management of wild rice germplasms.
O. longistaminata A. Chev. & Roehr., which is native to Africa, is among the eight AA-genome species in the genus Oryza. This wild rice species possesses highly valued traits to improve cultivated rice, including rhizomatousness for perennial rice breeding program (Glover et al., 2010), strong resistance to diseases and abiotic stresses (Song et al., 1995), and self-incompatibility (SI) for new procedures to generate seeds of hybrid rice (Ghesquiere, 1986).
Recently, Zhang et al. (2015) reported a draft genome assembly of O. longistaminata with a relatively short (12.5 kb) contig N50 length.
However, deciphering the O. longistaminata genome is fundamental to uncover molecular mechanisms that determine these remarkable agronomic traits to enhance rice genetic improvement.
The past decades have witnessed the completion of nuclear genomes of the two cultivated rice subspecies and a number of their relatives (Du et al., 2017;Goff et al., 2002;IRGSP, 2005;Shi et al., 2020;Stein et al., 2018;Wang et al., 2014;Yu et al., 2002;Zhang et al., 2014Zhang et al., , 2015Zhang et al., , 2016. Here, we sequenced and de novo assembled the two outcrossing wild rice genomes, O. rufipogon and O. longistaminata. We report the draft genome assembly and annotation of a typical O. rufipogon as well as its large transcriptome datasets using Illumina and 454 sequencing platforms. We also present an improved genome assembly and the annotation of O. longistaminata as well as large transcriptome datasets using Illumina sequencing platforms. Comparisons of all eight Oryza AA-genomes with well-defined phylogenetic framework and splitting times may undoubtedly serve as an important model that deserves endeavors for obtaining the full-genome patterns and multispecies perspective of evolutionary dynamics and genome dissimilarity. Access to the unprecedented data of these O. rufipogon and O. longistaminata genome sequences will be necessary for mining valuable alleles from this wild rice to boost the future rice breeding programs.

| Plant materials
An individual plant of O. rufipogon (RUF) was collected from a typical natural population grown in Yuanjiang County, Yunnan Province, China (Gao, Wei, Yang, Hong, & Ge, 2001). Another single plant of O. longistaminata (LON), which was originally from Botswana, was introduced from IRRI, Los Banos, the Philippines, and grown in the greenhouse of Kunming Institute of Botany, the Chinese Academy of Sciences. High-quality genomic DNA was extracted from leaves using a modified CTAB method (Porebski, Bailey, & Baum, 1997

antha, Zea mays, Sorghum bicolor, and Brachypodium distachyon
were aligned to the genome assemblies using GenBlastA (version 1.0.1) (She, Chu, Wang, Pei, & Chen, 2009) and further refined by GeneWise (version 2.2.0) (Birney, Clamp, & Durbin, 2004). To improve the quality of gene predictions, we aligned the assembled transcripts to the genome assembly using PASA (Program to Assemble Spliced Alignments) (Haas et al., 2003) to determine the potential gene structures. We used EVidenceModeler (EVM) (Haas et al., 2008) to combine the ab initio gene predictions, protein alignments and transcription alignments described above into weighted consensus gene structures. To validate the predicted gene models, protein sequences of SAT (v 7.0) genome were downloaded from MSU database (http://rice.plant biolo gy.msu.edu/), and then, all these peptide sequences were aligned to gene models using BLAT.  (Xu & Wang, 2007), MGEScan (Lee et al., 2016), and LTR retriever (Ou & Jiang, 2018). Previously annotated transposons were retrieved from the collected Oryza RiTE database (Copetti et al., 2015) and Repbase Update (Jurka et al., 2005). To annotate the repeat sequences in these two new genomes all the upper date were used to create a combined library for RepeatMasker (Chen, 2004;Smit, Hubley, & Green, 2016). Simple Sequence Repeats (SSRs) were identified and located using MISA (http://pgrc.ipk-gater sleben.de/ misa/). We combined SSRs from the plus and minus strands and differences caused by reading frames.

| Construction of the orthology synteny map of the eight AA-genome Oryza species
To aid evolutionary analyses, we accurately identified and aligned orthologous genomic regions from the eight assembled AA-genomes using MERCATOR (Dewey, 2007) and MAVID (Bray & Pachter, 2004).

| Identification and analysis of conserved noncoding sequences
We identified conserved noncoding sequences (CNSs) using GERP++ software (Cooper et al., 2005). The nucleotide substitution frequency of a CNS was estimated by , where d ij is an estimate of the number of nucleotide substitutions per site between DNA sequence i and j, and n is the number of the examined sequences. For each rapidly evolving region in CNSs, nucleotide substitution frequency was subsequently calculated for each species using an average of nucleotide substitution frequency of a species within this genomic region compared to other seven Oryza AAgenome species. We then used the nucleotide substitution frequency of this rapidly evolving region to compare with average nucleotide substitution frequency of total CNSs by Chi-squared test.

| Genome sequencing, assembly and quality assessment
We sequenced nuclear genomes of two wild rice species: O. rufipogon from Asia and O. longistaminata from Africa (Table S1) Figure S3a and Table S5). The resulting genome assembly was referred to as Oryza_rufipogon_v1.0. The N50 lengths of the assembled LON contigs and scaffolds were ~16.89 Kb and ~1.13 Mb, respectively (Table 1 and Table S4). The contig N50 and scaffold N50 sizes represent ~1.35-fold and ~3.10-fold improvement in length from the previously reported  ~12.5 Kb and 364 Kb, respectively.
About 96.94% of the LON assembly fell into 419 scaffolds larger than 100 Kb in length ( Figure S3b and Table S5). The resulting assembled genome was referred to as Oryza_longistaminata_v1.0.
To scaffolds with lengths varying from 385 to 680 Kb against the available contigs assembled using SMRT technology. After eliminating repeat sequence-masked and gap regions, pairwise alignments yielded high sequence similarities of ~99.54% to 99.64% with 100% coverage of sequence length ( Figure S4 and Table S9).

| Genome annotation
To aid the gene annotation, we de novo assembled the transcrip-  (Table S11). We also assembled RNA-Seq data (Table S10) from LON, and this process generated 111,105 transcripts with a N50 length of 1,064 bp and a total sequence length of ~74,561,481 bp (Table S11).
In combination with ab initio prediction, protein and public EST alignments, EVM combing and further filtering, we predicted 52,997 and 40,014 protein-coding genes for RUF and LON, respectively ( Figure S5 and Table S12). After the predicted genes were functionally annotated against InterPro, Pfam, and GO protein databases, we aligned protein sequences of O. sativa ssp. japonica cv. Nipponbare and the above-mentioned RNA-Seq data of RUF and LON, representing the major tissue types and different developmental stages, to assess the quality of gene prediction. Our results showed that ~89.8% and ~81.6% gene models were supported by transcripts or proteins (identity ≥30% and coverage ≥90%) in RUF and LON, respectively ( Figure S6 and Table S13).
Our annotation of repetitive sequences showed that approximately 45.18% of the RUF genome consists of TEs ( Figure 1a and Table S14), slightly lower than that (~50.97%) annotated in the Nipponbare genome with the same methods ( Figure 1a and Table   S14). LTR retrotransposons and MULEs were the most abundant TE types, occupying roughly 15.26% and 11.21% of the RUF genome, respectively. The annotation of repetitive sequences revealed that approximately 36.69% of the LON genome consists of TEs (Table   S14). LTR retrotransposons and MULEs were also the most abundant repeated sequences, occupying approximately 10.66% and 6.53% of the LON genome. We annotated a total of 214,337 SSRs (Table   S15) in RUF and 184,773 SSRs (Table S15) in LON, they will provide valuable genetic markers to assist rice breeding programs (Jurka & Pethiyagoda, 1995).

| Evolutionary dynamics of rice gene families
Genome-wide comparative analysis of the nucleotide-binding sites with leucine-rich repeat (NBS-LRR) genes further showed a remarkable expansion of gene families relevant to an enhanced disease resistance in RUF and LON. In total, we identified 631, 845,489,450,476,392,768,416, and 426 genes encoding NBS-LRR proteins in SAT, RUF, NIV, GLA, BAR, GLU, LON, MER, and PUN, respectively (Table S24). This amplification in RUF and LON is mainly attributable to an increase in CC-NBS, CC-NBS-LRR, NBS, and NBS-LRR domains, further supporting that they may have played an important role in biotic resistance and abiotic stresses. We positioned these orthologous R-genes to definite genomic locations across the SAT chromosomes (Figure 3), displaying an almost unequal distribution of the amplified NBS-encoding genes throughout the entire genome, among which Chromosome 11 harbored the utmost number of R-genes for all these nine rice species. They will evidently offer a large number of candidate loci for further functional genomic studies on disease resistance and rice breeding programs.

| Rapid evolution of CNSs
Conserved noncoding sequences are genomic regions showing a reduced mutation frequency of noncoding bases, many of which are regulatory elements that evolve under purifying selection F I G U R E 2 Expansion and contraction of gene families among the eight AA-genome Oryza species using O. punctata (BB-genome) as outgroup (Haudry et al., 2013). To aid evolutionary analyses of CNSs, we first identified and aligned orthologous genomic regions from the eight AA-genome assemblies. In total, we obtained 8,742 orthologous genomic segments among the eight AA-genome Oryza species, ranging from 40.02% in RUF to 52.58% in GLA (   are associated with the rapidly evolving regions were annotated in each of eight Oryza AA-genomes using AgriGO (Du, Zhou, Ling, Zhang, & Su, 2010). It is intriguing that a number of GO terms  A large collection of genomic sequences of Asian cultivated rice and its wild relatives has unquestionably formed a solid foundation to search for novel gene sources from wild rice germplasms (e.g., Du et al., 2017;Goff et al., 2002;IRGSP, 2005;Shi et al., 2020;Stein et al., 2018;Wang et al., 2014;Yu et al., 2002;Zhang et al., 2016;Zhang et al., 2014;Zhang et al., 2015). Future functional experiments should employ advanced genetic tools to study theses disease resistance candidate loci and genomic regulatory elements to determine how they are involved in specific adaptations. We thus expect efforts to generate chromosome-scale reference genome sequences of wild rice species using long-read SMRT platform, which would be particularly helpful for comparative genomic analyses of the genus Oryza, functional genomic studies, wild rice germplasm utilization towards the future rice breeding programs, and efficient conservation management of the seriously endangered natural populations of these wild rice species.

ACK N OWLED G M ENTS
We thank International Rice Research Institute (Manila, Philippines) for kindly providing rice germplasms. This work was supported by Yunnan Innovation Team Project and Natural Science Foundation of Yunnan (2015FA030) (to L.-Z. G.).

CO N FLI C T O F I NTE R E S T
The authors declare no conflict of interest.

DATA AVA I L A B I L I T Y S TAT E M E N T
The raw sequence data and genome assemblies reported in this paper have been deposited in the Genome Sequence Archive and Genome Warehouse in BIG Data Center, Beijing Institute of Genomics (BIG), Chinese Academy of Sciences, under accession numbers PRJCA002637.