Highly variable chloroplast genome from two endangered Papaveraceae lithophytes Corydalis tomentella and Corydalis saxicola

Abstract The increasingly wide application of chloroplast (cp) genome super‐barcode in taxonomy and the recent breakthrough in cp genetic engineering make the development of new cp gene resources urgent and significant. Corydalis is recognized as the most genotypes complicated and taxonomically challenging plant taxa in Papaveraceae. However, there currently are few reports about cp genomes of the genus Corydalis. In this study, we sequenced four complete cp genomes of two endangered lithophytes Corydalis saxicola and Corydalis tomentella in Corydalis, conducted a comparison of these cp genomes among each other as well as with others of Papaveraceae. The cp genomes have a large genome size of 189,029–190,247 bp, possessing a quadripartite structure and with two highly expanded inverted repeat (IR) regions (length: 41,955–42,350 bp). Comparison between the cp genomes of C. tomentella, C. saxicola, and Papaveraceae species, five NADH dehydrogenase‐like genes (ndhF, ndhD, ndhL, ndhG, and ndhE) with psaC, rpl32, ccsA, and trnL‐UAG normally located in the SSC region have migrated to IRs, resulting in IR expansion and gene duplication. An up to 9 kb inversion involving five genes (rpl23, ycf2, ycf15, trnI‐CAU, and trnL‐CAA) was found within IR regions. The accD gene was found to be absent and the ycf1 gene has shifted from the IR/SSC border to the SSC region as a single copy. Phylogenetic analysis based on the sequences of common CDS showed that the genus Corydalis is quite distantly related to the other genera of Papaveraceae, it provided a new clue for recent advocacy to establish a separate Fumariaceae family. Our results revealed one special cp genome structure in Papaveraceae, provided a useful resources for classification of the genus Corydalis, and will be valuable for understanding Papaveraceae evolutionary relationships.


| INTRODUC TI ON
, generally considered to have originated from ancient cyanobacteria, are the main site of photosynthesis and energy conversion in plant cells, containing the major enzyme systems for photosynthesis and a highly conserved genome (Ahlert et al., 2003;Moore et al., 2010). With the development of high-throughput sequencing technology, cp genomics has made rapid progress (Li et al., 2015).
The National Center for Biotechnology Information (NCBI) database included 377 complete cp genome sequences in 2010 and had more than 10,381 sequences in 2020 (https://www.ncbi.nlm.nih.gov/genom e/brows e/), a nearly 30-fold increase over 10 years. Currently, cp genomics research is an intense area of botanical and genomic study.
Correct understanding of the relationship between different biological groups is the main focus of phylogenetic biology, the basis of taxonomy and naming, and a foundation for research in other branches of biology (Chen et al., 2016). Compared with traditional molecular markers, the cp genomes provide specific advantages for establishing plant phylogenetic relationships and taxonomic research (Guo et al., 2017). The length of cp genomes is usually 115-165 kb, a modest size that is easily sequenced. The longer sequence provided more sufficient information for phylogenetic analysis.
Relatively conserved gene sequences allow produce co-linearity among plant groups, and the evolution rates of coding regions and noncoding regions are significantly different to be suited for phylogenetic analysis of different ranks (Clegg et al., 1994). Taxonomists have used cp genomes to study plant phylogenetics and advocated for use of cp genomes as a super DNA barcode for species identification (Guo et al., 2017).
In recent years, a large number of cp genome have been sequenced, providing abundant data that can be used for plant phylogeny research to more accurately reveal the true evolutionary relationships between species and effectively solve difficult phylogenetic relationship problems in the study of complex plant taxa (Guo et al., 2019;Jansen et al., 2006;Zhang et al., 2017). Cp genomes have been successfully used as a "super barcode" to identify many taxonomically difficult species (Cui et al., 2019;Ying et al., 2019).
With the reduced cost of sequencing and the development of bioinformatics technology, cp genome will be extensively used in future studies of plant taxonomy.
Corydalis DC., the largest genus of Papaveraceae, is recognized as one of the most taxonomically challenging plant taxa (Magnus et al., 1996). It has extremely complex morphological variation because of typical reticulate evolution and intense differentiation during evolution (Wu et al., 1996). Taxonomic study of the genus on the basis of morphological characteristics has been very difficult (Lu et al., 2018). Cp genomes have been proven effective for phylogenetic research of many taxonomically complex taxa. However, there currently are few reports about cp genomes of the genus Corydalis, but see two plants, Corydalis trisecta and Corydalis conspersa (Kanwal et al., 2019;Wu et al., 2020). Therefore, it is necessary to sequence the cp genomes of Corydalis plants in order to provide more accurate basis for the classification and identification of this genus.
In this study, high-throughput sequencing and comparative genomics were used to study the cp genomes of two Corydalis plants: Corydalis saxicola and Corydalis tomentella. They belong to Sect. Thalictrifoliae Fedde of the genus Corydalis, which grows in dry cracks of limestone ( Figure 1) and is known as lithophytes.
There are little available soil and water on the limestone, so they have been subjected to extreme environmental conditions, such as high temperature, drought, and high PH . Then, we asked whether the cp genome structures of these two lithophytes had special variation under the extremely harsh lithophytic environment, and whether these variation would affect their classification and identification. We sequenced four complete cp genome sequences from these two plants, described their genomic characteristics, conducted comparisons between these genomes and other Papaveraceae cp genomes, and analyzed the phylogenetic relationships on the basis of common protein CDS. Our study aim was to assess structural variation, and provide valuable resources for classification of the genus Corydalis.

| Genome assembly and annotation
The cp genome were assembled on a Linux system. First, raw sequencing data were filtered using Trimmomatic (Version 0.36) to get the high-quality clean data (Bolger et al., 2014). In the second step, we used the thirteen chloroplast genome sequences of Papaveraceae species which were downloaded from GenBank to establish a Basic Local Alignment Search Tool (BLASTn) database. Then the clean data were mapped to the BLAST database, and the mapped reads which were considered as reads from chloroplast genome were extracted.
Next step, the extracted reads were assembled to contigs using SOAPdenovo2 (Luo et al., 2012). At last, SSPACE was used to construct the scaffold of the chloroplast genome (Boetzer et al., 2011), and GapCloser was used to fill gaps (Luo et al., 2012). The completed genomes were annotated using CPGAVAS2 (Shi et al., 2019), and the results were modified for starter and terminator revisions by Apollo software (Lee et al., 2009). CPGAVAS2 software was used to convert revised GFF3 format annotation results into a sqn format for NCBI submission. Sequin software was used to check and correct unsatisfactory comments in the sqn file, and the corrected results were submitted to the NCBI database. Physical maps of the cp genomes were drawn by GenomeDRAW (Marc et al., 2013) using a GB format file exported from the sqn file by sequin software.

| Phylogenetic analysis
A total of 13 cp whole genome sequences were used in cluster

| Chloroplast genomes features
Approximately, 5.12, 5.23, 2.68, and 2.77 Gb raw paired-end reads  Figure 2 and Table 1. The lengths of the two complete C. saxicola genomes were 189,029 and 189,155 bp, which were slightly smaller than those of C. tomentella. The cp genome structure, size of each region, and GC content were similar between the two species (Table 1).
CPGAVAS2 was used to annotate the cp genomes of C. tomentella and C. saxicola. Removing duplicate genes, a total of 119 annotated genes ( Figure 2, Table 2 and Table S1), including 78 protein-coding genes, 37 tRNA genes, and four rRNA genes, were identified from the C. tomentella. There were 28 genes in the IR region, of which 15 were involved in gene expression. Introns greatly affect regulated selective splicing in the genome. There were 19 genes that contain introns in the C. tomentella cp genome. Most intron genes contained only one intron, while the ycf3 gene contained two introns. There were 12 introns with a length of more than 700 bp, and the longest gene was trnK-UUU with a length of 2,478 bp. The gene features of C. saxicola cp genome were similar to those of C. tomentella. The F I G U R E 2 Schematic representation of the chloroplast genomes of C. tomentella. The map contains four rings. From the center going outward, the first circle shows forward and reverse repeats connected with red and green arcs, respectively. The next circle shows tandem repeats marked with short bars. The third circle shows microsatellite sequences identified by MISA. The fourth circle is drawn using drawgenemap and shows the gene structure of the plastome. The genes are colored on the basis of their functional categories. Genes inside and outside of the circle are transcribed in clockwise and counterclockwise directions, respectively. IR, inverted repeat; LSC, large single copy; SSC, small single copy. The red rectangles indicated the nine gens (ndhF, ndhD, ndhL, ndhG, ndhE, psaC, ccsA, rpl32, and trnL-UAG) normally located in the SSC region have migrated to IRs; the green rectangles indicated the reversed segment involving five genes (rpl23, ycf2, ycf15, trnl-CAU, and trnL-CAA) C. saxicola cp genome contained 120 genes, including 78 proteincoding genes, 38 tRNA genes, and four rRNA genes. Nineteen genes contained introns. The longest intron gene in the C. saxicola cp genome was trnK-UUU, and its length was also 2,478 bp (Figure 2, Table 2 and Table S1).

| Variation in genome structural
VISTA software was used to make multiple comparisons of the C. tomentella and C. saxicola cp genome sequences, and results show that intraspecific variation was small but there were still some interspecific differences (Figure 3).  However, the ycf1 genes in C. tomentella and C. saxicola cp genomes F I G U R E 3 Sequence identity plot comparison of the C. tomentella and C. saxicola cp genomes. Gray arrows and thick black lines above the alignment indicate genes with their orientation and the position of the inverted repeats (IRs), respectively. A cutoff of 70% identity was used for the plots, and the Y-scale represents the percent identity ranging from 50% to 100% have been transferred to the SSC region to become a single copy  Simple sequence repeats are short tandem repeats of 1-6 bp DNA sequences that are widely distributed throughout the cp genome (Lee et al., 2019). In this study, CPGAVAS2 software was used to analyze the sequences and the classification statistics of SSRs with a length greater than or equal to 8 bp. Here, we analyzed the distribution and the type of SSRs contained in C. tomentella and C. saxicola cp F I G U R E 4 Sequence identity plot comparison of the cp genomes of C. tomentella, C. saxicola, P. somniferum, P. rhoeas, and C. hymenoides. Gray arrows and thick black lines above the alignment indicate genes with their orientation and the position of the inverted repeats (IRs), respectively. A cutoff of 70% identity was used for the plots, and the Y-scale represents the percent identity ranging from 50% to 100% genomes. A total of 172 SSRs were identified in the whole C. tomentella cp genome (take MHJ1 as an example), including 100 mono-, 34

| Codon usage bias, SSRs, and repeat sequences
di-, and one compound nucleotide SSRs. Among all SSR types, A and T were the most commonly used bases and 116 SSRs in the C. tomentella cp genome had A, T, or AT repeat units (Table 3 and Table S3). For C. saxicola, 170 SSRs (take YHL2 as an example) were categorized as 96 mono-, 36 di-, six tri-and six compound nucleotide SSRs, including 115 SSRs with A, T, or AT repeat units (Table 3 and Table S3).
In addition to SSRs, forward repeats (F) and palindromic repeats  (Table 3). Comparing the cp genomes of the two species, the C. saxicola genome had a greater total number of repeats than the C. tomentella cp genome, and the cp genome repeat content in both species was significantly higher than that of most species.

| Phylogenetic analysis
With C. chinensis and N. tabacum as outgroups, 70 common protein coding sequences from 13 cp genome sequences were extracted from C. saxicola, C. tomentella, and six Papaveraceae species to build a Maximum Likelihood (ML) phylogenetic tree ( Figure 6)

| High variability of genome size and the expansion of IRs
Corydalis saxicola and C. tomentella cp genomes are the large cp genomes due to the expansion of IR regions. Most angiosperms cp genomes are highly conserved, typically 115-165 kb in size and possessing a quadripartite structure with two IR regions (IRa and IRb) separating the LSC region and the SSC region . The sizes of C. saxicola and C. tomentella cp genomes are larger than those of most flowering plants, such as N. tabacum (Sajjad et al., 2016;Shinozaki et al., 1986;Yukawa et al., 2006), 30-40 kb larger than those reported genomes in Papaveriaceae, such as P. somniferum (Sun et al., 2016) and C. hymenoides (Kim & Kim, 2016). Distinctions between different cp genomes mainly result from the variability of the length and direction of IR regions (Duan et al., 2020). In terms of length, IR regions of the genus Taxodium (T. distichum, T. mucronatum and T. ascendens) contracted to about 282 bp (Saski et al., 2005), while IR regions were entirely absent in Pisum sativum and Cryptomeria japonica (Hirao et al., 2008;Ki & Hae, 2005). In contrast, the length of Pelargonium hortorum IR regions expanded to 76 kb (Duan et al., 2020). Numerous studies have shown that IR region lengths F I G U R E 6 ML tree of C. saxicola and C. tomentella and its relative species based on common protein coding sequences | 4167

REN Et al.
are the main factor influencing cp genome size (Yan et al., 2017). In our study, IR region lengths for the two newly sequenced species were 41,955 to 42,350 bp, which significantly increased their cp genome sizes over that of other Papaveraceae species. Genes normally located in the SSC region, such as ndhC, ndhD, ndhE, ndhF, ndhG, ndhL, rpl32, and trnL-UAG, have moved to IR regions, contributing to the expanded size of C. saxicola and C. tomentella IRs.

| Gene inversions, duplications, and deletions
Inversions usually serve as useful phylogenetic markers (Cosner et al., 2004;Kim et al., 2005). An up to 9 kb inversion containing five genes (rpl23, ycf2, ycf15, trnL-CAU, and trnL-CAA) was found in the IR regions of C. tomentella and C. saxicola cp genomes. Relatively To determine if it can be used as a phylogenetic marker of genus Corydalis, more species will need to be sequenced. In some plants, the large inversions have been found to be associated with short inverted repeats in cp genome (Joachim et al., 2017;Yi et al., 2013).
In Geraniaceae, Campanulaceae and some Fabaceae species, a mass of short inverted repeats have been found to be present at their inversion endpoints (Cosner et al., 2004;Yan et al., 2017). However, we didn't detect large numbers of short inverted repeats emerged in inversion endpoints in C. tomentella and C. saxicola.
Several NDH (NADH dehydrogenase-like) genes (ndhD, ndhE, ndhF, ndhG, ndhL) are duplicated in the C. tomentella and C. saxicola cp genomes, which could provide an explanation for their robust adaptability to harsh environments. Large-scale duplication of cp genes tends to occur only in highly rearranged genomes and can be explained by repeated expansion and contraction of IRs (Mercedes & Bartolomé, 2010;Ruhlman et al., 2015). In this study, genes that are normally located in the SSC region (ndhD,ndhE,ndhF,ndhG,ndhL,psaC,rpl32,ccsA, have migrated to IRs resulting in IR expansion and gene duplication. We found that most of these duplicated genes belong to the NDH complex. Because plastid NDH genes are dispensable under optimal growth conditions, they have been lost in a number of autotrophic and heterotrophic lineages, although they are widely retained across land plants (Ruhlman et al., 2015;Yan et al., 2017). For example, plastid NDH genes have been partially lost or pseudogenized in parasitic plants, such as several orchids and Petrosavia (Petrosaviaceae), and autotrophs plants, such as Najas (Hydrocharitaceae) and Erodium (Geraniaceae) (Mercedes & Bartolomé, 2010), even they have been completely lost in Selaginella tamariscina (Xu et al., 2018). Conversely, it is rare for NDH genes to undergo large-scale duplication and augmentation, and the effects of the increased genes resulting from gene duplication on plant growth and development have rarely been discussed in previous research. The NDH complex participates in photosystem I (PSI) cyclic electron flow (CEF), chlororespiration. NDH-dependent CEF provides additional pH change and ATP for CO 2 assimilation and alleviates oxidative stress caused by stromal over-reduction under stress conditions (Ruhlman et al., 2015). The nonphotochemical quenching ability of NDH deficient mutants decreased under mild drought (Sergi et al., 2005). NDH deficient mutants grow slowly at low humidity (Horvath, 2000). Under strong light, tobacco ndhB mutants were more susceptible to photobleaching (Sergi et al., 2005).
Under heat stress conditions, NDH-mediated cyclic and chlororespiratory electron transport are accelerated, mitigating photo-oxidative damage, and inhibition of CO2 assimilation caused by high temperature (Ju et al., 2003). Oleaceae (Joachim et al., 2017;Yan et al., 2017). The accD gene encodes an acetyl-CoA carboxylase subunit and is an important regulator of carbon flow entering the fatty acid biosynthesis pathway (Rousseau-Gueutin et al., 2013). It is known to be essential for leaf development in angiosperms (Hong et al., 2017;Kode et al., 2005).
Recent research has shown that the accD gene present in the plastome of most angiosperms is functional (Hong et al., 2017;Rousseau-Gueutin et al., 2013). Furthermore, several studies have shown that the accD gene has been transferred into the nucleus, and the proteins it encodes are transported from the nucleus to the chloroplast to function in the form of a transfer peptide (Joachim et al., 2017;Liu et al., 2016). Whether the C. tomentella and C. saxicola accD genes have been lost or transferred to the nucleus, the effects on development are currently unknown.

| Potential application of cp genome in phylogenetic research of Corydalis and Papaveraceae
By exhibiting high species identification power that accurately distinguished two closely related species (C. saxicola and C. tomentella), cp genomes have demonstrated a great potential for use as a superbarcode to discriminate Corydalis species. Corydalis, is considered to be one of the most taxonomically complex taxa (Wu et al., 1996).

It is extremely difficult to depend on morphological characteristics
for Corydalis species identification. Single-locus DNA barcodes lack adequate variation in closely related taxa. Researches using short sequence gene fragments and DNA barcodes showed that both nuclear genome (ITS/ITS2) sequence and cp genome (matK/rbcL/rps16) sequence produced unsatisfactory taxonomic identifications within Corydalis Wang, 2006). Cp genomes, exhibiting many advantages, including a moderate size and an appropriate frequency of nucleotide substitutions that can provide sufficient mutation sites (Yan et al., 2017), have been successfully used in the identification of various taxa, such as genera Epimedium (Guo et al., 2019), Fritillaria (Yan et al., 2018), Epipremnum (Tian et al., 2018), and Papaver (Zhou et al., 2017). In this study, C. tomentella and C. saxicola, two closely related species from Sect. Thalictrifoliae in Corydalis, are clustered into two branches in the phylogenetic tree, which indicates they could be accurately distinguished by cp genome analysis. While, in the phylogenetic analysis based on short sequences of DNA barcodes, these two related species were not monophyletic and couldn't be effectively distinguished. Recent barcoding studies have placed a greater emphasis on the use of whole-cp genome sequences, which are now more readily available as a consequence of improving sequencing technologies (Li et al., 2015). The demonstrated use of cp genomics in Corydalis species identification suggests that it has a great potential for taxonomic identification of this genus.
The cp genome also efficiently identified every genus of Papaveraceae in this study. The evolution rates of coding and noncoding regions are significantly different in cp genomes, enabling cp genome use for systematic analysis of different phylogenetic ranks (Clegg et al., 1994). The genus Corydalis belongs to Papaveraceae Fumarioideae (Corydaleae) and the phylogenetic relationships of this genus remain controversial (Wu et al., 1996). Recent studies have tended to treat the genus Corydalis as an independent Fumariaceae family because the morphological characteristics of this genus constitute a unique evolutionary series (Pérez-Gutiérrez et al., 2012;Wu & Lu, 2003;Zhang et al., 2008). In this study, a Papaveraceae phylogenetic tree, built using common protein CDS, shows that every genus is clustered into one separate clade. However, the clade of Corydalis is far from the other genera of Papaveraceae. Combined with the substantial differences in cp genome structures between Corydalis and the other Papaveraceae genera, it will be necessary to analyze more representative species to reveal the phylogenetic relationship of Corydalis.

ACK N OWLED G M ENTS
We are grateful to Jianguo Zhou and Yuanyao Xin for support and troubleshooting. This work was supported by National Natural The funders did not play any roles in the design of the study, collection, analysis and interpretation of the relevant data, and writing the manuscript.

CO N FLI C T O F I NTE R E S T
None declared.

DATA AVA I L A B I L I T Y S TAT E M E N T
The DNA sequences reported in this study have been deposited in the National Center for Biotechnology Information (NCBI) genome database, and Genbank accessions: MT093187, MT07787-MT07789. All sequences used in phylogenetic analysis of Papaveraceae are available from NCBI (Accession numbers: see the "Phylogenetic analysis of Papaveraceae" in Section 3).