Diversity and evolution of rice progenitors in Australia

Abstract In the thousands of years of rice domestication in Asia, many useful genes have been lost from the gene pool. Wild rice is a key source of diversity for domesticated rice. Genome sequencing has suggested that the wild rice populations in northern Australia may include novel taxa, within the AA genome group of close (interfertile) wild relatives of domesticated rice that have evolved independently due to geographic separation and been isolated from the loss of diversity associated with gene flow from the large populations of domesticated rice in Asia. Australian wild rice was collected from 27 sites from Townsville to the northern tip of Cape York. Whole chloroplast genome sequences and 4,555 nuclear gene sequences (more than 8 Mbp) were used to explore genetic relationships between these populations and other wild and domesticated rices. Analysis of the chloroplast and nuclear data showed very clear evidence of distinctness from other AA genome Oryza species with significant divergence between Australian populations. Phylogenetic analysis suggested the Australian populations represent the earliest‐branching AA genome lineages and may be critical resources for global rice food security. Nuclear genome analysis demonstrated that the diverse O. meridionalis populations were sister to all other AA genome taxa while the Australian O. rufipogon‐like populations were associated with the clade that included domesticated rice. Populations of apparent hybrids between the taxa were also identified suggesting ongoing dynamic evolution of wild rice in Australia. These introgressions model events similar to those likely to have been involved in the domestication of rice.

implying that the wild populations have remained largely isolated from the impacts of gene flow from domesticated crops that has apparently been widespread in Asia (Brozynska et al., 2017). The AA genome species of rice include cultivated species and their close relatives (Choi, Platts, Fuller, Wing, & Purugganan, 2017). Draft genome sequences of the AA genome populations from Australia have recently been reported indicating that these populations may be an important genetic resource for rice because of their high diversity and phylogenetic relationship to domesticated rice Brozynska et al., 2014Brozynska et al., , 2017Sotowa et al., 2013;Wambugu, Brozynska, Furtado, Waters, & Henry, 2015).
We now report on an analysis of the genomes of rice collected from sites over a wide area in northeastern Australia allowing analysis of the diversity and relationships within and between these wild populations. GPS coordinates, observations of plant spike form, awn length, an herbarium voucher, and photographs of flowers (where possible) were obtained at each site (Appendix S1, Table S1, Figure S1).

| Morphological measurement
Anther and awn measurements were recorded in the field. For anther length, 4-8 flowers from 3 to 6 immature panicles were selected at random from each population, photographed against a standard background with a scale, and measurements obtained later in the laboratory using Image-Pro Plus software (Media Cybernetics, MD, USA, http://www.mediacy.com/index.aspx?page=IPP). The awn length was measured for ten different plants from each population selected at random.

| DNA extraction and sequencing
Vegetative tissue from 29 samples (representing each of the collection sites) was prepared and DNA extracted as described by Furtado (2014). Three approaches were used to assess the quality and quantity of the extracted DNA: Nano Drop (Thermo Fisher Scientific), agarose gel electrophoresis, and Qubit (Thermo Fisher Scientific).
Multiplex sequencing of the 29 wild rice samples was conducted using a Hiseq 4000 (Illumina) using 2 × 150 paired end technique, aiming to produce approximately 10× whole genome coverage on average. Reference chloroplast genome sequences were obtained as described in (Appendix S1, Table S9).

| Chloroplast genome assembly
The sequence reads were analyzed using CLC Genomic workbench V.9, Geneious V.9.1.5 and Clone Manager Professional 9 (Kim et al., 2015). A quality check (QC) was applied to all raw data. Based on the results of the QC report, reads were trimmed. A dual pipeline approach was used to assemble the chloroplast genome sequences: mapping reads to reference and de novo assembly. The outputs of both pipelines were combined, and all discrepancies were resolved and corrected manually.

| Chloroplast genome annotation
All chloroplast sequences were annotated using the CpGAVAS website (http://www.herbalgenomics.org/0506/cpgavas/analyzer/ home), using the default parameters as recommended. The outcome was imported directly into Geneious software to allow comparison with the reference O. sativa japonica NC_001320 to identify polymorphisms.

| Phylogenetic analysis of nuclear genes
Phylogenetic analysis was based upon a set of 4,643 genes that were found in all include Oryza species (Brozynska et al., 2017).
These sequences were obtained from the sequence data pool for each field sample and reference genome using the software packages FastQC, BWA, Samtools, bcftools, and MUMmer. The accession identifiers of the reference samples used were as follows:  (Katoh et al., 2002). Following this individual gene alignment, files were concatenated into single alignment for each chromosome; then, all chromosomes were combined into a whole genome alignment of 8,179,015 base pairs (Figure 3b).
Phylogenetic trees were reconstructed using three analytical approaches: ML, MP, and BI. For the ML analysis, PHYML version 20131022 was used with the following settings: tree topology search: NNIs, initial tree = parsimony, model of nucleotide substitution = GTR (Guindon & Gascuel, 2003). For the MP analysis, PAUP 4.0 was used with the following setting: stepwise taxon addition with random seed, heuristic tree search strategy, and 1,000 bootstrap (Swofford, 2002). For the BI analysis, MrBayes was used with same as reported in Brozynska et al. (2017).

| RE SULTS AND D ISCUSS I ON
Wild AA genome rice was collected from 27 sites in north Queensland, Australia (Figure 1 and Appendix S1, Table S1). Plants were found around the margins of lakes and creeks (Appendix S1, Figure S1) where for the most part, water was available to support their growth. Wild rice was not located on Cape York north of the Jardine River (−11.103665, 142.283901)  All regions of the chloroplasts were successfully sequenced.
The high sequence coverage ensured a complete genome sequence was obtained for all sites in the assembly pipeline that was used.
The average coverage of the total chloroplast for all samples was 683× while the highest and lowest coverages were 2,063× and 10×, respectively (Appendix S1, Table S2). Compared to the reference sequence, an average of 129.6 variants (deletions, insertions, and SNPs) per sample were found (Appendix S1, Table S3), which agrees with the results reported by Brozynska et al. (2014). A total of 18 functional polymorphisms were found in the chloroplasts with six of them common to all samples (Appendix S1, Tables S4, S5).
The aligned sequence comprised 135,532 bp. Of the variable sites, 227 were parsimony-informative and 661 were uninformative (427 were unique). The phylogenetic trees constructed using different approaches (Appendix S1, Table S6) were highly congruent (Brozynska et al., 2014;Kim et al., 2015;Wambugu et al., 2015). As in earlier work (Wambugu et al., 2015) includes Taxon A (Appendix S1, Table S7) (Kim et al., 2015). The chloroplasts of the different Australian AA genome taxa showed significant genetic differences (Figure 2). The concatenated alignment of 4,555 nuclear genes comprised 8,179,015 bp of which 44.1% were invariant. The minimum and maximum lengths were 5,916,081 and 7,013,653 bp, respectively, slightly longer than reported previously (Brozynska et al., 2017). The nuclear analysis (as one full length sequence and by chromosomes) grouped the Australian samples into two main clades. One of these included Taxon A and the other much larger group (27 samples (Brozynska, et al. 2017). The phylogeny based upon individual chromosomes (Appendix S1, Figure S2-S13) shows that these populations were a sister to all Asian and African rices (chromosomes 4, 5, 6, 7, 8) or the Asian rices (chromosome 9, 10), O. indica/O. nivara (1, 2, 3, 11) or Australian (12)  rices. This suggests these Australian populations should be considered as a distinct, undescribed taxon (Brozynska et al., 2017).

Analysis of the chloroplast genomes placed Australian plants with
O. rufipogon-like morphology in the Australian clade, distant from the Asian O. rufipogon which were placed in the Asian clade. Some populations with a nuclear genome similar to O. meridionalis had a chloroplast genome that was closer to the O. rufipogon-like plants (Taxon A) suggesting that their evolutionary history involved some introgression or hybridization and chloroplast capture (Brozynska et al., 2014(Brozynska et al., , 2017Wambugu et al., 2015). One example of chloroplast capture in the other direction was also detected (WR- Further research should determine the diversity of useful alleles in these populations that might be incorporated into domesticated rice to improved stress tolerance and grain quality. The need for increased efforts to conserve these species in situ and ex situ is suggested by the very limited collection of this material in seed collections and the more limited distribution of the O. rufipogon-like populations in the wild in locations that may be threatened by the incursion of weeds.