Multicentric origin and diversification of atp6‐orf79‐like structures reveal mitochondrial gene flows in Oryza rufipogon and Oryza sativa

Abstract Cytoplasmic male sterility (CMS) is a widely used genetic tool in modern hybrid rice breeding. Most genes conferring rice gametophytic CMS are homologous to orf79 and co‐transcribe with atp6. However, the origin, differentiation and flow of these mitochondrial genes in wild and cultivated rice species remain unclear. In this study, we performed de novo assembly of the mitochondrial genomes of 221 common wild rice (Oryza rufipogon Griff.) and 369 Asian cultivated rice (Oryza sativa L.) accessions, and identified 16 haplotypes of atp6‐orf79‐like structures and 11 orf79 alleles. These homologous structures were classified into 4 distinct groups (AO‐I, AO‐II, AO‐III and AO‐IV), all of which were observed in O. rufipogon but only AO‐I was detected in O. sativa, causing a decrease in the frequency of atp6‐orf79‐like structures from 19.9% to 8.1%. Phylogenetic and biogeographic analyses revealed that the different groups of these gametophytic CMS‐related genes in O. rufipogon evolved in a multicentric pattern. The geographical origin of the atp6‐orf79‐like structures was further traced back, and a candidate region in north‐east of Gangetic Plain on the Indian Peninsula (South Asia) was identified as the origin centre of AO‐I. The orf79 alleles were detected in all three cytoplasmic types (Or‐CT0, Or‐CT1 and Or‐CT2) of O. rufipogon, but only two alleles (orf79a and orf79b) were observed in Or‐CT0 type of O. sativa, while no orf79 allele was found in other types of O. sativa. Our results also revealed that the orf79 alleles in cultivated rice originated from the wild rice population in South and South‐East Asia. In addition, strong positive selection pressure was detected on the sequence variations of orf79 alleles, and a special evolutionary strategy was noted in these gametophytic CMS‐related genes, suggesting that their divergence could be beneficial to their survival in evolution.


| INTRODUC TI ON
Cytoplasmic male sterility (CMS) is a maternally inherited trait characterized by the disability to produce functional pollen; it has been observed in many species of higher plants and is widely used in commercial hybrid seed production. Molecular studies indicate that CMS is usually associated with chimeric open reading frames (ORFs) coding for abnormal (toxic) proteins in the mitochondrial genome (Bentolila, Alfonso, & Hanson, 2002), including homologous sequences corresponding to essential genes coding for ATPase, cytochrome c oxidase or ribosomal proteins (Arrieta-Montiel & Mackenzie, 2011;Chen & Liu, 2014;Tang et al., 2017). These chimeric ORFs are typically thought to originate from illegitimate recombination events between normal mitochondrial genes and gene-flanking sequences.
Rice CMS has been intensively studied and has been widely used in commercial production for decades. To date, more than 60 rice CMS lines have been developed by hybridization between species, subspecies and varieties; it can be mainly categorized into two major types, sporophytic and gametophytic CMS (Li, Yang, & Zhu, 2007).
In these two types, the sterility of male gametes is dependent on the sporophyte and gametophyte genotypes, respectively. Three distinct sporophytic CMS genes have been cloned in rice, WA352 and WA314 in wild abortive-type CMS (WA-CMS) and orf182 in D1-CMS Tang et al., 2017;Xie et al., 2018), although their cytoplasms are derived from the same wild rice species, Oryza rufipogon. Most gametophytic CMS-related genes share homologous sequences of the same gene; for example, orf79 (here denoted orf79a), L-orf79 (orf79b) and orfH79 (orf79k) are considered as the functional genes of Boro II-type CMS (BT-CMS), Lead rice-type CMS (LD-CMS) and Hong-Lian-type CMS (HL-CMS), respectively, but their cytoplasms are derived from Chinsurah Boro II (Oryza sativa ssp. indica), Burmese cultivar Lead rice (O. sativa ssp. indica) and redawned wild rice (O. rufipogon), respectively (Kazama, Itabashi, Fujii, Nakamura, & Toriyama, 2016;Peng et al., 2010;Shinjyo, 1969;Wang et al., 2006;Watanabe, 1971;Yingsheng, 1988). These orf79 alleles are all co-transcribed with the upstream atp6 in the form of atp6-orf79-like structures; they can induce gametophyte abortion with different physiological mechanisms such as cytotoxicity caused by the accumulation of ORF79 and L-ORF79 mainly in microspores, or energy deficiency resulting from the reduction of the enzymatic activity of mitochondrial complex III. The amount of ORF79 protein produced in LD-CMS is considerably lower than that in BT-CMS (Itabashi, Kazama, & Toriyama, 2009;Kazama et al., 2016). These atp6-orf79-like structures share high nucleotide sequence similarity and play a very important role in the molecular mechanism underlying gametophytic CMS, suggesting a possible origin from a common ancient genotype of atp6-orf79.
Previous studies have also focused on the variations of the known gametophytic CMS genes and variant haplotypes of orf79 (Duan, Li, Li, Xiong, & Zhu, 2007;Duan, Zheng, Yan, He, & Liao, 2015;Li, Tan, Wang, Wan, & Zhu, 2008;Luan et al., 2013). However, as the formation and evolution patterns of WA352-like structures for sporophytic WA-CMS have been elaborately studied (Tang et al., 2017), the origin and diversification of mitochondrial atp6-orf79-like structures for gametophytic CMS, such as phylogenetic and geographic relationships among them as well as how they originated and spread in the wild and cultivated rice species, remain obscure. Furthermore, massive and accurate assembly of mitochondrial genomes is difficult owing to the rapid variation in the noncoding region and complex rearrangement of mitochondrial genes (Knoop, 2004) and due to their exchange of fragments with nuclear and plastid genomes (Timmis, Ayliffe, Huang, & Martin, 2004). These difficulties limited the molecular characterization of mitochondrial genomes of plants at the population level; thus, only a limited number of complete mitochondrial genomes have been successfully assembled using whole genome sequencing (WGS) data set in plants (Donnelly et al., 2017;Iorizzo et al., 2012;Zimmermann et al., 2019) with the accumulation of high-throughput sequencing data in recent years. However, the variation rates of the coding region of mitochondrial genes were found to be very low, even lower than those in nuclear and plastid genes (Muse, 2000;Wolfe, Li, & Sharp, 1987). Hence, mitochondrial protein-coding genes could be readily assembled with WGS data sets in large scale.
In this study, a wide-scope screening of homologous structures related to atp6-orf79 was attempted by assembling and assessing draft mitochondrial genomes of 590 common wild rice (O. rufipogon) and Asian cultivated rice (O. sativa) genotypes based on WGS data set in previous studies Wang et al., 2018); and 16 atp6-orf79-like structures and 11 orf79 alleles were obtained.
Distinct groups and multicentric features were observed during the analyses of phylogenetic and biogeographic diversification, revealing different evolutionary routes of these gametophytic CMSrelated genes. Their geographical origin was deduced, and it showed a complex multi-original process. Furthermore, novel evidences were provided to confirm the distinct and continuous mitochondrial gene flows during the diversification of common wild rice and domestication of Asian cultivated rice. In addition, a strong positive selection pressure was detected on sequence variations of the orf79 alleles, indicating a special evolutionary strategy of these gametophytic CMS-related genes, so that their divergence could be beneficial to their survival under natural conditions.

| Raw data set
A total of approximately 280 Gb of whole genome sequencing (WGS) data for Asian O. rufipogon and O. sativa genotypes were downloaded from the EMBL database; these had been generated based on paired-end libraries Wang et al., 2018). The whole genome coverage depth of sequence reads ranged from 1.02× to 7.35×, with an average of 2.53×. In all, 590 samples comprising 221 O. rufipogon accessions and 369 O. sativa cultivars, which originated from 63 countries or regions in Asia, Europe, Africa, America and Oceania, were included ( Figure S1, Table S1). The samples were first selected randomly from the original data sets and then were adjusted artificially according to their geographic distribution. Among them, 91 samples, including 59 O. rufipogon accessions and 32 O. sativa varieties, were identified as containing homologous structures of orf79 (Table S2). Five types of O. sativa were classified as Indica, Japonica, Aus, Aromatic and Intermediate (Wang et al., 2018).

| The de novo assembly and annotation of draft mitochondrial genomes
The raw WGS data of 590 accessions were processed using FastQC v0.11.5 and NGSQCToolkit v2.3 software to control sequence quality of the original data set; they were then filtered using BWA and SAMtools (Li & Durbin, 2010) software to extract mitochondrial-original reads that were properly paired to the reference mitochondrial genomes. These reads were finally used to conduct de novo assembly of the mitochondrial genomes by using SPAdes software (Bankevich et al., 2012). In order to improve the assembly quality for mitochondrial genes from the low-coverage WGS data, we applied 4 strategies: (a) quality control reports for all paired raw data were first generated using FastQC software. The data that passed quality control on per base sequence quality, per tile sequence quality and adopter content were selected. The remaining data were further trimmed and processed in NGSQCToolkit v2.3.3 with default options except qualCut-Off = 25 and cutOffQualScor = 25. (b) The interruption of excessive variation from mitochondrial genomes of distant species to this extraction was reduced by using only 13 mitochondrial genomes in the GenBank database, including 10 from Oryza genus, as reference genomes. The reads that paired properly and had at least one of them mapped to the reference genomes (G12 option in SAMtools) were selected as targeted reads. (c) Careful option was applied to fix the assembly errors caused by mismatches and short indels, and the cov-cutoff option was set to auto to remove residual nuclear-original contigs according to their abnormal kmer coverage. (d) Gap-close and additional scaffolding were further conducted for obtaining complete atp6-orf79-like structures, if necessary.
Summary statistics were calculated to evaluate the quality of the assembled mitochondrial contigs by using QUAST (Gurevich, Saveliev, Vyahhi, & Tesler, 2013). Gene annotation of the assembled mitochondrial contigs was performed using a local BLASTN program and further artificially modified in MEGA7 (Kumar, Stecher, & Tamura, 2016), when necessary. Three atp6-orf79-like sequences related to BT-CMS (AP017386.1), LD-CMS (AP011077.1) and HL-CMS (Peng et al., 2010) were obtained from GenBank or published reports and were used as query sequences to annotate the homologous structures of atp6-orf79; 50 complete CDSs of protein-coding genes in the mitochondrial genome of Nipponbare were downloaded from Ensembl Plants website and used as reference sequences for the annotation of homologous genes.

| Haplotype and genetic diversity analyses
Genealogical relationships of the identified haplotypes were inferred using a median joining method and were further virtualized in Network v5.003 (Fluxus Technology Ltd.) and Adobe Illustrator software (Adobe Systems Incorporated). Haplotype diversity, evolutionary distances based on Tajima-Nei model and population differentiation (FST) were calculated for each group of haplotypes by using DNAsp 6 (Rozas et al., 2017) and MEGA7 (Kumar et al., 2016). The principal coordinates analysis (PCoA) was conducted using GenAIEx 6.5 (Peakall & Smouse, 2012). Population diversity (π) within and between different populations was calculated and tested using pairwise differences method in Arlequin 3.5 (Excoffier & Lischer, 2010).

| Phylogenetic analysis
To perform the phylogenetic analysis, DNA sequences of mitochondrial genes were aligned in MAFFT software (Katoh & Standley, 2013) and were manually adjusted in MEGA7 (Kumar et al., 2016). All DNA regions were aligned separately and concatenated before analyses.
Phylogenetic analyses were conducted using maximum likelihood methods in IQ-TREE software (Nguyen, Schmidt, von Haeseler, & Minh, 2014), and the Bayes information criterion was used to determine the best-fit model for nucleotide substitution. The ultrafast bootstrap of 2,000 generations was used to exploit the best tree and its branches. The final trees were then plotted using the online tool Interactive Tree of Life (https://itol.embl.de/). Selection pressure analysis was conducted using codeml program contained in PAML 4.9 software (Yang, 1997).

| Geographic differentiation and biogeographical inference
The spatial auto-relationship analysis and Mantel test were conducted using SPAGeDi 1.5a (Hardy & Vekemans, 2002) and GenAIEx 6.5 (Peakall & Smouse, 2012). Geographic central feature (GCF) and geographic median centre (GMC) of a haplotype were identified based on pairwise geographic distances of genotypes containing that haplotype. The GCF was defined as the genotype that has the shortest average distance with other genotypes, whereas the GMC was defined as a theoretical coordinate that has the shortest average distance with other genotypes. Spatial kernel density for a selected haplotype was estimated using coordinates of its holders by using package MASS (Venables & Ripley, 2002) in R (Team, 2018).
The geographic region for an ancestral haplotype was inferred by constructing a linear regression model based on pairwise genetic distances and pairwise geographic distance, which was modified according to the methods described by Ramachandran et al. (2005), as follows: where y is the vector of genetic distances between the target haplotype and the remaining haplotypes, which was calculated using IQ-TREE with the best-fit model and optimal parameters; x is the vector of average geographic distances between candidate origin location of the target haplotype and locations of the remaining haplotypes based on coordinates information, whereas x i is the average geographic distance between candidate location of the target haplotype and locations of an ith haplotype; n i is the total number of genotypes containing the ith haplotype; d j is the geographic distance between the candidate location of a target haplotype and the location of a jth genotype containing the ith haplotype; slope a indicates the mutation rate of a gene along with geographic distance, whereas intercede b indicates the mean distribution radius of the target haplotype. The fitness coefficients of the models with coordinates of different candidate origin locations were calculated using a customized Perl script, and a hotspot region comprising all coordinates of resulted R 2 ≥ .8 was considered as the origin area of the target haplotype. Additionally, a filtered data set of haplotypes was constructed for the genetic-spatial regression model by taking the intersection of haplotypes in genotypic and spatial groups to eliminate the interference from foreign haplotypes (i.e. haplotypes derived from other regions or originated from nontarget haplotypes).

| The de novo assembly statistics of rice mitochondrial genomes
We performed de novo assembly of the mitochondrial genomes of O. rufipogon and O. sativa based on an efficient customized pipeline.
In the pre-assembly process, a total of 314,448,811 short reads of mitochondrial origin were identified and filtered from the WGS data set of 590 genotypes as pre-assembly data. For each genotype, an average of 532,662 mitochondrial-original sequence reads, ranging from 124,920 to 2,555,310, were obtained using an average coverage of 111.56× for the whole mitochondrial genome, which was sufficient to assemble long genomic contigs based on the de novo ap- were located in the long contigs (≥5.00 kb) that have sufficient length to cover the most mitochondrial protein-coding genes, including the atp6-orf79-like structures ( Figure 1). Annotation of the mitochondrial contigs indicated that 27 protein-coding genes, including atp6, were present in the form of complete coding sequences (CDS) in a single contig of all the 590 genotypes. Further, the CDSs of all the identified orf79 alleles were also found within single contigs. These results indicated that the assembled mitochondrial genomes could be adequately used for the identification and variation analysis of atp6-orf79-like structures and other protein-coding genes.

| Identification of different atp6-orf79like structures
To investigate sequence variation and distribution of atp6-orf79, we were identified, whereas the remaining 13 structures were different from those reported previously (Table 1). We analysed the sequence of the 16 atp6-orf79-like structures by searching on GenBank (http://www.ncbi.nlm.nih.gov/) and re-annotated their sequence

Oryza rufipogon
Oryza sativa 10 0 a Nucleotide positions of H1 were used as the reference positions for all loci.

| Nucleotide variation and genetic diversity of different atp6-orf79-like structures
Numerous mutations were observed in the intergenic region and orf79 coding sequence of different atp6-orf79-like structures. Their complete nucleotide sequences were aligned along a total length of 1,485 bp with 18 single nucleotide polymorphisms (SNPs), 3 length polymorphisms (9-105 bp) and two small insertions (1-4 bp; Table 2).
Among these variations, 6 SNPs were detected in the coding region of orf79, whereas no variation was found in the coding region of atp6, indicating relatively frequent sequence variation in the CMS gene orf79, but high conservation of the important mitochondrial gene atp6. The remaining variations were all located in the intergenic region.
Phylogenetic relationship and molecular diversity of haplotypes of atp6-orf79-like structures were assessed on the basis of the conserved regions in the complete sequences. Three main clades were identified in the phylogenetic tree-H1-H9 & H16, H10-H13 and H14-H15-which was identical to the statistical parsimony network of these haplotypes (Figures 3 and 4a). The latter two clades  (Table S5). H11 and H16 were detected in EA (5 and 1, respectively) and SEA (9 and 1, respectively) regions. were found only in the EA region; and H3 was noted in the SEA region ( Figure 5).
The multicentric characteristics of these haplotypes could be further summarized on the basis of their geographical distributions in the 4 regions ( Figure 5; Table 3). Two distinct distribution areas were observed, SA (SA-I and SA-II) and EA regions contained 13 of the 15 haplotypes, but did not share any common haplotypes between them. The SA-I region was a major distribution centre with the highest frequency of atp6-orf79-like structures (0.32) and orf79 alleles (0.51) ( Table 3) and covered the GCF and GMC locations of the two widely distributed haplotypes (H1 and H6) in group AO-I (Table S5), as well as the most popular orf79 allele (orf79a) in O. rufipogon (Table S6) Nevertheless, geographical isolation or barrier could have also affected the distribution of atp6-orf79-like structures; for example, a total of 11 haplotypes were region-specific, of which 3, 3, 3 and 2 haplotypes were region-specific in the SA-I, SA-II, EA and SEA regions, respectively.

| Spatial-genetic regression confirms the pattern of origin of the atp6-orf79-like structures
The disturbances associated with multicentric distribution and geo- Therefore, we constructed a spatial-genetic regression model to showed considerably higher frequencies of orf79 alleles (54.55% and 51.85%, respectively) than that in SEA (13.64%) and EA (0%), and almost all transferred orf79 alleles and atp6-orf79-like structures were detected in the SA regions, although a few of them were identified in SEA ( Figure S5), strongly suggesting that the transferred orf79 alleles in the 4 groups of cultivated rice had mostly descended from the wild rice populations in the SA region.

| An efficient pipeline for the auto-assembly of mitochondrial genes based on low-coverage WGS data
As an organelle genome of endosymbiotic acquisition and maternal inheritance, the mitochondrial genome plays a very important role in studies related to productivity and development (Ogihara et al., 2005), evolution and historical phylogeny of plants (Lonsdale, 1988;Palmer & Herbon, 1988;Wolfe et al., 1987;Ye et al., 2017), as well as those related to variation, exchange and interaction between functional genes in nuclear and mitochondrial genomes (Hsu & Mullin, 1989;Knoop, 2004;Stern & Lonsdale, 1982;Warren, Simmons, Wu, & Sloan, 2016). The number of completed mitochondrial genomes has been increasing over the past few years, and hundreds of plant mitochondrial genomes have been deposited in the GenBank database. However, assessing completed mitochondrial genomes based on sequencing data created using second-and/or third-generation sequencing platforms is both time-and resource-consuming (Notsu et al., 2002;Shi et al., 2018), which makes it difficult to analyse mitochondrial genes on a large scale. Alternatively, Sanger sequencing of cloned fragments amplified using PCR or special PCR methods, for example hiTAIL-PCR (Jaramillo-Correa, Aguirre-Planter, Eguiarte, Khasa, & Bousquet, 2013;Luan et al., 2013;Tang et al., 2017), is commonly used, although it is also a laborious process.
In this study, we constructed an efficient pipeline to facilitate batching de novo assembly of long mitochondrial scaffolds for assessing mitochondrial genes from low-coverage WGS data of large collection. The main purpose of our pipeline was to assemble each of the CDS regions of mitochondrial genes in a single contig irrespective of the length; with this approach, we could produce accurate and continuous sequences of mitochondrial genes by avoiding interruption with abundant structural variations of noncoding regions and contamination of nuclear genomic fragments. Several strategies were employed to improve the continuity while ensuring the accuracy of the assembled mitochondrial genomes, for example by using an optimal number of 13 mitochondrial genomes (10 from Oryza genus and 3 from other relative species of Poaceae) instead of going by the common practice of using a collection of dozens of different species or even all plant species as reference genomes; this enhanced the efficiency of filtering of the target mitochondrial-original reads from the WGS data set and reduced nontarget mapping resulting from incorporation of excessive sequence variations from reference genomes of unrelated species. When the assembly was performed using SPAdes software, different kmer values were automatically tested to obtain an optimal kmer, and the Mismatch Corrector module was activated to fix the errors in assembly caused by mismatches and short indels; the residual nucleus-and plastid-originated contigs were further identified and removed based on their abnormal coverage depth (e.g. coverage depth of mitochondrial contigs was usually tens or even more than one-hundred-fold that of nuclear contigs, but less than one-third that of plastid contigs). By using this pipeline, we could assemble draft mitochondrial genomes of 590 genotypes with an average coverage depth of 112.85×, which was considerably higher than that of the nuclear genome (2.53×) in the original WGS data set. In addition to the CMS gene orf79, 27 distinct mitochondrial genes, including 20 conserved genes, were successfully assembled in a single contig from low-coverage WGS data set (Table S7).
The nucleotide sequences of these 20 genes were highly conserved and remained unchanged in all 590 assembled genomes, which further underscored the reliability of this pipeline (Table S7). However, additional manual scaffolding was needed to generate continuous CDSs of some nonconserved mitochondrial genes and intergenic sequences owing to abundant variations in those regions; for example, 14 mitochondrial protein-coding genes-in the more than 200 genotypes-were annotated with their CDS regions split into more than one contig, which could still be fixed manually, but would be a time-consuming process.

| Multi-origination and complex diversification process of atp6-orf79-like structures during the co-evolution of common wild rice and Asian cultivated rice
We found that (a) both the ancestral haplotype (H1) of atp6-orf79-like structures and the ancestral allele (orf79a) of orf79 in group AO-I were the most detected in O. rufipogon accessions derived from SA-I, with a frequency of 11.11% and 51.85%, respectively, followed by that in SA-II (9.09% and 45.45%, respectively), SEA (4.55% and 13.64%, respectively) and EA (0% and 0%, respectively), indicating their clear flow from SA regions to SEA and finally likely to EA; (b) ancestral haplotypes in AO-II (i.e. H10 and H11) and AO-IV (i.e. H16) groups showed a dispersing tendency from EA (with higher frequency) to SEA and then to SA regions (with lower frequency); (c) many haplotypes showed cytoplasmic-or region-specific distribution; for example, H14 was distrib- after which they further dispersed to SEA with a flowing-and-varying feature. We further investigated the previously reported variation of orf79 (Duan et al., 2007(Duan et al., , 2015Li et al., 2008;Luan et al., 2013) and obtained a total of 15 alleles that were all different from those identified in this study (except orf79a and orf79k). Most of them (10/15) were only detected in a single genotype, indicating a potential cytoplasmic-or region-specific distribution; three alleles, including orf79k, Indica and Aus groups were identified as Or-CT1 type. In addition, the Aromatic and Aus groups showed considerably high frequency of atp6-orf79-like structures and orf79 alleles, which was closer or higher than that of the wild rice in SA regions. These results revealed a considerably frequent gene exchange among Asian cultivated rice and common wild rice after the early domestication of Japonica, which led to considerably higher diversification of the Asian cultivated rice and common wild rice in SA regions.

| Strong positive selection pressure on the gametophytic CMS genes in the rice mitochondrial genome
The mitochondrial CMS genes always showed special genetic variation and evolution features, owing to their maternal inheritance and co-existence with restorer genes, under natural conditions.
Although harmful, they could be inherited with a closed CMS/ Rf system that comprised a mitochondrial CMS gene conferring male sterility and one or more nuclear restorer gene(s) cancelling the male sterility. As the CMS gene was inherited only by the cy-  (Table S8). These results revealed that variations in these positively selective loci were encouraged and beneficial to the survival of the CMS genes. This was likely because more restorer genes were induced along with the occurrence of new alleles of the gametophytic CMS gene. Rf1 was first reported as the restorer gene in BT-CMS (orf79a) line and was shown to have the ability to restore fertility in the LD-CMS (orf79b) line (Itabashi et al., 2009;Wang et al., 2006); Rf5 and Rf6 were found to be the restorer genes of the HL-CMS (orf79k) line (Hu et al., 2012;Huang et al., 2015); Rf2 can completely restore the fertility of the LD-CMS (orf79b) line and weakly restore the fertility of the BT-CMS line (Itabashi et al., 2009;Kazama et al., 2016). Therefore, divergence of the orf79 alleles could broaden the resources of sterility restorer genes and facilitate the survival of the CMS genes.

| Potential exploitation of the cytoplasm with new haplotypes of atp6-orf79
CMS has been widely used for commercial hybrid seed production in many cereal crops such as maize and rice for decades. Hybrid rice has exhibited increased yield (by about 20%) compared with that obtained using inbred rice varieties (Cheng, Zhuang, Fan, Du, & Cao, 2007;Ma & Yuan, 2015), and has made a great contribution to the global food production (Zhu, 2016). However, potential risks exist when hybrids based on a single CMS cytoplasm are continuously grown in large areas. For instance, in 1970, an outbreak of Southern corn leaf blight (Helminthosporium maydis Nisikado & Miyake race T) occurred in U.S. maize hybrids produced using Texastype CMS that carried a mitochondrial gene T-urfl3 with the dual role of causing CMS and disease susceptibility (Levings, 1990). Thus, the exploitation of novel CMS cytoplasms to enrich the cytoplasmic diversity is very important for commercial hybrid seed production. In this study, we screened the mitochondrial scaffolds of 590 common wild rice accessions and Asian cultivated rice cultivars and yielded 16 haplotypes of atp6-orf79, of which H1, H4 and H11 had been previously confirmed to cause BT-CMS, LD-CMS and HL-CMS, respectively (Table S2), whereas the remaining 13 haplotypes obtained here were different with previous reports. Among them, 5 haplotypes-H2, H3, H5, H6 and H7-could be considered as candidate genes conferring gametophytic BT-like CMS because they shared identical atp6 and orf79 of H1, with variations only in the intergenic sequences (Table S2). Similarly, H12 and H13 could be the candidate genes conferring gametophytic HL-like CMS. These results can provide new cytoplasmic resources for research and application of BT-CMS and HL-CMS. The remaining 6 haplotypes contained different orf79 alleles with known rice CMS-related genes and could serve as candidate genes for developing new type of CMS in rice.
In addition, although the CMS/Rf system has been widely used in global hybrid rice production, genetic resources with restorer genes for the gametophytic CMS (such as BT-CMS, HL-CMS and Dian1-CMS) are still limited in rice, especially in japonica rice (Huang, Zhi-Guo, Zhang, & Shu, 2014;Huang, 2012). In this study, a total of 74 fertile accessions, including 44 O. rufipogon accessions and 30 O. sativa varieties, were identified to have 16 haplotypes of the CMS gene atp6-orf79 and were speculated to contain the corresponding restorer genes. Seventeen genotypes, including 12 common wild rice accessions, 2 indica, 2 japonica and 1 Aus rice varieties, were identified to have the known gametophytic CMS genes (H1, H4 and H11) and corresponding restorer genes. Thirty-eight genotypes, including 13 common wild rice accessions, 7 indica, 3 japonica, 9 Aromatic and 6 Aus rice varieties, were identified to have 5 BT-CMSlike and 2 Hl-CMS-like haplotypes of atp6-orf79. The remaining 19 genotypes were common wild rice accessions with novel haplotypes.
New restorer genes identified from these genotypes could be exploited for the gametophytic CMS/Rf system in rice.

CO N FLI C T O F I NTE R E S T
None declared.