The complete chloroplast genomes of three Hamamelidaceae species: Comparative and phylogenetic analyses

Abstract Hamamelidaceae is an important group that represents the origin and early evolution of angiosperms. Its plants have many uses, such as timber, medical, spice, and ornamental uses. In this study, the complete chloroplast genomes of Loropetalum chinense (R. Br.) Oliver, Corylopsis glandulifera Hemsl., and Corylopsis velutina Hand.‐Mazz. were sequenced using the Illumina NovaSeq 6000 platform. The sizes of the three chloroplast genomes were 159,402 bp (C. glandulifera), 159,414 bp (C. velutina), and 159,444 bp (L. chinense), respectively. These chloroplast genomes contained typical quadripartite structures with a pair of inverted repeat (IR) regions (26,283, 26,283, and 26,257 bp), a large single‐copy (LSC) region (88,134, 88,146, and 88,160 bp), and a small single‐copy (SSC) region (18,702, 18,702, and 18,770 bp). The chloroplast genomes encoded 132–133 genes, including 85–87 protein‐coding genes, 37–38 tRNA genes, and 8 rRNA genes. The coding regions were composed of 26,797, 26,574, and 26,415 codons, respectively, most of which ended in A/U. A total of 37–43 long repeats and 175–178 simple sequence repeats (SSRs) were identified, and the SSRs contained a higher number of A + T than G + C bases. The genome comparison showed that the IR regions were more conserved than the LSC or SSC regions, while the noncoding regions contained higher variability than the gene coding regions. Phylogenetic analyses revealed that species in the same genus tended to cluster together. Chunia Hung T. Chang, Mytilaria Lecomte, and Disanthus Maxim. may have diverged early and Corylopsis Siebold & Zucc. was closely related to Loropetalum R. Br. This study provides valuable information for further species identification, evolution, and phylogenetic studies of Hamamelidaceae plants.


| INTRODUC TI ON
Hamamelidaceae is an important group representing the origin and early evolution of angiosperms and is well known for its broad and scattered geographic distribution and endemics (Endress, 1993;Zhang & Lu, 1995). Hamamelidaceae fossils have been found in Upper Cretaceous-early Tertiary strata (Manchester et al., 2009;Zhang & Lu, 1995); thus, the flora of Hamamelidaceae may have arisen earlier than the Cretaceous. This family contains 28 genera and about 120 species (Judd et al., 2007), which mainly occur in are used as medicinal materials (Kim et al., 2020;Simon et al., 2021).
In addition, most of the genera have ornamental value, particularly Rhodoleia Champ. ex Hook. and Corylopsis.
The chloroplast is an important plant organelle and photosynthetic organ (Douglas, 1994). It is also a semiautonomous genetic organelle that contains independent chloroplast DNA (cpDNA), which has a length of 110-160 kb (Choi & Park, 2015). In general, cpDNA has a circular structure that includes one large single-copy (LSC) region, one short single-copy (SSC) region, and two inverted repeat (IR) regions, with the IR region separating the LSC and SSC regions (Ferrarini et al., 2013;Wu et al., 2014;Xue et al., 2019). The chloroplast genome is independent of the nuclear genome and corresponds to maternal inheritance with independent transcription and transport systems (Wu et al., 2020). Considering the similar structures, highly conserved sequences, and stable maternal heredity, the chloroplast genome has become an ideal resource for species identification, population genetics, phylogenetic, and genetic engineering studies (Fan et al., 2021;Nock et al., 2014). Moreover, gene mutations, rearrangements, duplications, and losses can be detected in the chloroplast genomes of the angiosperm lineages Luo et al., 2021). Structural changes in the genome can be used to study the taxonomic significance and phylogenetic relationships, and provide information for the development of genomic markers (Cheng et al., 2020;Watson et al., 2002). Repeat sequences are DNA sequence motifs that are repeated hundreds or thousands of times at different positions in the genome (Biscotti et al., 2015). They are ubiquitous in genomes and play important roles in evolution. Repeat sequences are mainly divided into two categories: one is tandem repeats, which mainly include some shorter repeats, such as simple sequence repeats (SSRs), and the other is interspersed repeated sequences, which are commonly known as transposons (Treangen & Salzberg, 2011). SSRs are composed of 1-6 nucleotide repeat units and are also called microsatellites, which have been widely used as molecular markers in population genetics and evolutionary biology (Bondar et al., 2019;Dashnow et al., 2015) due to their highly reproducible, codominance, multi-allelic, and chromosome-specific nature (Miri et al., 2014;Oliveira et al., 2006;Vieira et al., 2016).
Interspersed repeated sequences account for most of the plant genomic repeats (Zhao & Ma, 2013), whereas retrotransposons play an important role in genome amplification (Ammiraju et al., 2007;Baucom et al., 2009;Paterson et al., 2009;SanMiguel et al., 2009;Schnable et al., 2009) and contribute to the expansion and contraction of the genome and the difference in the interspecific sequence (Morgante et al., 2007). The complete chloroplast genome contains all genes used to reconstruct the evolutionary history and provides more valuable and high-quality information for evolutionary and phylogenetic analyses . Complete chloroplast genome sequences are easily obtained due to the rapid development of large-scale high-throughput sequencing techniques, such as the Illumina and PacBio sequencing platforms Kim et al., 2021;Lin et al., 2018;Yang et al., 2019;Ye et al., 2020).
Hamamelidaceae is a key family to study the phylogeny of angiosperms (Zhang et al., 2001). The relationships between genera in this family have been controversial for a long time (Hao & Wei, 1998;Li et al., 1997;Magallon, 2007;Xie et al., 2010). For example, Ye et al. (2020) reported that Hamamelis is sister to the clade that includes Parrotia C. A. Mey. and Distylium Siebold & Zucc., which is consistent with previous studies (Li, Bogle, & Donoghue, 1999;Li, Bogle, & Klein, 1999;Magallon, 2007;Shi et al., 1998;Xie et al., 2010). The results of another study showed that Parrotia subaequalis (H. T. Chang) R. , which is consistent with the result of Jiang et al. (2020). Different taxonomists have systematically divided Hamamelidaceae based on morphology, anatomy, and palynology (Bogle & Philbrick, 1980;Harms, 1930;Reinsch, 1890), but the traditional identification method based on morphological characteristics cannot be used to clearly distinguish Hamamelidaceae species (Deng et al., 1992;Endress, 1969Endress, , 1989Zhang, 1999). In recent years, phylogenetic analyses of Hamamelidaceae species have been carried out with the rapid development of molecular technology (Li et al., 2000;Shi et al., 1998;Wen & Shi, 1999;Xiang et al., 2019;Xie et al., 2010;Zhou et al., 2019), and early studies focused on DNA fragment-labeling techniques or phylogenetic analyses based on nuclear or chloroplast DNA fragments. However, limited nuclear or chloroplast DNA fragments do not provide sufficient phylogenetic information to effectively solve interspecific relationships (Hao & Wei, 1998;Li et al., 1997). Complete chloroplast genomes provide more valuable and higher-quality information for evolutionary and phylogenetic analyses and reduce the sampling error inherent in studies of one or a few genes that may indicate critical evolutionary events (Cho et al., 2019 (1) study the molecular structures of these three chloroplast genomes; (2) examine the variations in the repeat sequences and the SSRs in the three chloroplast genomes; (3) discover the divergence hotspot regions to provide potential molecular markers for future phylogenetic studies; and (4) establish and analyze the phylogenetic relationships of Hamamelidaceae species based on their complete chloroplast genome sequences, as well as the LSC, SSC, and IR regions. The data obtained in this study will provide valuable reference information for further studies on species identification and evolution, as well as population genetics and phylogenetic analyses of Hamamelidaceae.

| Genome comparison
Chloroplast genome sequences are often used to measure genetic diversity within a species, gene flow between species, and ancestral population size of separated sister species (Cavender et al., 2015). Therefore, it is necessary to understand the divergence of chloroplasts between species. The online comparison tool mVISTA (Mayor et al., 2000) was applied to compare the whole chloroplast genomes of L. chinense, C. glandulifera, and C. velutina to three published chloroplast genomes of Chunia bucklandioides Chang (NC_041163), Distylium tsiangii Chun ex Walker (MN711651), and Rhodoleia championii Hook. f. (NC_045276) in Shuffle-LAGAN mode (Frazer et al., 2004) with the L. chinense annotation as the reference. Although the IR regions are the most conserved, expansion and contraction of the IR boundary are the main reasons for differences in the sizes of chloroplast genomes (Kode et al., 2005;Raubeson et al., 2007;Yao et al., 2015). Irscope (Ali et al., 2018) (Table S1).

| Phylogenetic analysis
Maximum likelihood (ML) and Bayesian inference (BI) methods were used to perform phylogenetic analyses based on the following four datasets: (1) the complete chloroplast genome sequences; (2) LSC regions of the chloroplast genomes; (3) SSC regions of the chloroplast genomes; and (4) IR regions of the chloroplast genomes. The ML analysis (Guindon et al., 2010) was conducted using IQ-TREE (Nguyen et al., 2015) and Ultrafast bootstrap (Minh et al., 2013), and the BI analysis was conducted using MrBayes (Ronquist et al., 2012). All datasets were aligned using MAFFT (Katoh & Standley, 2013) under default parameters. ModelFinder (Kalyaanamoorthy et al., 2017) was used to select the best-fit model using Akaike's Information Criterion and GTR (general time-reversible)+F+I+G4 was selected as the best substitution model for the complete chloroplast genome sequences and the LSC regions. GTR+F+G4 was selected as the best substitution model for the SSC regions and GTR+F+I was selected for the IR regions. The ML analysis was conducted with 1,000 repetitions of Ultrafast bootstrap and 1,000 bootstrap replicates of the Shimodaira/Hasegawa approximate likelihood-ratio test (SH-aLRT) (Guindon et al., 2010). The Markov chain Monte Carlo algorithms were run for 2,000,000 generations and sampled every 100 generations for the BI analysis. The first 25% of the generations were discarded as burn-in. MAFFT, ModelFinder, IQ-TREE, Ultrafast bootstrap, and MrBayes were used in PhyloSuite F I G U R E 1 The chloroplast genome maps of Corylopsis glandulifera, Corylopsis velutina, and Loropetalum chinense. Genes on the inside of the circle are transcribed clockwise and those on the outside are transcribed counter-clockwise. The darker gray inner circle corresponds to the GC content, whereas the lighter gray indicates the AT content. Different colors represent different functional genes (Zhang, Gao, et al., 2020;. The phylogenetic relationships were visualized using FigTree (http://tree.bio. ed.ac.uk/softw are/figtr ee/).

| Chloroplast genome features of the three Hamamelidaceae species
The chloroplast genomes of C. glandulifera (accession no. MZ642354),  Table 1). The overall GC contents of the three chloroplast genomes were almost identical (37.97%-38.03%) (Table 1) and the GC contents of the LSC and SSC regions were lower than those of the IR regions ( Table 2).

| Codon usage analysis
Analyzing codon usage is essential to evaluate the evolution of the chloroplast genome (Chi et al., 2020;Sun et al., 2021). RSCU values were computed for the C. glandulifera, C. velutina, and L. chinense chloroplast genomes based on the protein-coding sequences.     (Table S2), indicating that most of the amino acids tended to use A/U-ending codons rather than C/G-ending codons.

| Repeat sequence analysis
Structures longer than 30 bp are known as long repeats (Asaf et al., 2018), and there are four types of long repeats, such as forward, palindromic, reverse, and complement repeats. In this study, three types of repeated sequences (forward, reverse, and palindromic) were  Table S3). The types and content of the long repeats were similar in species from the same genus.
The majority of the SSRs were located in intergenic regions. Most of TA B L E 3 Lists of genomic genes for Corylopsis velutina, Corylopsis glandulifera, and Loropetalum chinense Function

TA B L E 4
Characteristics and sizes of the intron and exon genes from the three Hamamelidaceae species the SSRs were located in the LSC regions rather than in the SSC or IR regions (Table S4). There were 143-152 mononucleotides, 9-10 dinucleotides, 58-66 trinucleotides, 4-5 tetranucleotides, 2 pentanucleotides, and 0-1 hexanucleotide (only in L. chinense). Among these SSRs, mononucleotide repeats were the most abundant, while pentanucleotide repeats numbered the least. Most mononucleotides and dinucleotides were composed of A and T (Figure 4).

| Comparative genomic analysis
To investigate genomic divergence, the percentage of sequence identity was calculated for six species of Hamamelidaceae using the mVISTA program with L. chinense as the reference. The results showed that the similarity among the six species was high and the variability in the IR regions was less than that in the LSC and SSC regions. Furthermore, the chloroplast genomes were more highly variable in their noncoding regions than in their coding regions and this is consistent with the pattern found in most angiosperms (Yang et al., 2020) ( Figure 5).
The chloroplast genome contains many variable nucleotides, which can be used to resolve closely related species or genera as valuable DNA barcoding Xiong et al., 2020).
Among them, 4 fragments were distributed in the LSC region and 3 in the SSC region ( Figure 6).

| Phylogenetic analysis
The chloroplast genome sequences observed provide essential data with which to further elucidate and understand the phylogenetic re- Hamamelidaceae species gathered on a large branch and species in the same genus were clustered together to a certain degree. The

Hamamelidaceae branch was divided into two clades with Chunia
and Mytilaria related to other 9 genera. Disanthus was related to other 8 genera in which Corylopsis and Loropetalum were found to be sister to other 6 genera (Sinowilsonia Hemsl., Fortunearia, Sycopsis,

Distylium, Parrotia, and Hamamelis). In addition, Corylopsis and
Loropetalum were sister genera to each other. However, ML and BI analyses revealed incongruent topologies based on the IR regions.
Moreover, some of the nodes had very low bootstrap support values ( Figure S3), indicating that the IR regions were not suitable for identification or phylogenetic analysis.

| DISCUSS ION
The chloroplast genome provides valuable information for species identification, as well as population genetics, phylogenetic, and genetic engineering studies (Daniell et al., 2016;Luo et al., 2021;Wu et al., 2021). In this study, the complete chloroplast genomes of  Mader et al., 2018;Xu et al., 2017;Yang, Hu, et al., 2018; and the GC content was lower than the AT content. This was generally the same as seen in other angiosperm chloroplast genomes (Asaf et al., 2018;Raubeson et al., 2007). The results also showed that the GC content in the IR regions was the highest, which may be due to the presence of a large number of rRNA in the IR regions. GC skewness is considered a dominant factor in codon bias. Several studies have indicated that high AT content is the main reason for synonymous codons ending in A/U (Clegg et al., 1994;Shimda & Sugiuro, 1991), which may be related to natural selection and mutation during evolution .
In addition, SSRs are usually composed of a higher number of A + T bases than G + C bases (Hu et al., 2017;Kuang et al., 2011;Simeone et al., 2018;Yang, Hu, et al., 2018;, which is consistent with our observations, and this may also be related to the high AT content in the nucleotide composition.
The lengths of the exons and introns in genes are important information in chloroplast genomes. Genes are interrupted by introns in major groups of organisms (Fan et al., 2021). One-intron genes vary among species, while clpP, rps12, and ycf3 are two-intron genes (Wu et al., 2020;Zhang, Gao, et al., 2020;. This finding is consistent with our observations. ClpP protease encoded by the clpP gene widely exists in mitochondria and chloroplasts of prokaryotes and eukaryotes, where it plays a vital role in regulating protein metabolism (Chen et al., 2014;Zhang et al., 2014).
The rps12 gene is a trans-spliced gene with the 5′ end located in the LSC region and duplicated 3′ ends located in the IR regions (Guo et al., 2018). In addition, ycf3 is related to photosynthesis (Boudreau et al., 1997;Naver et al., 2001). Consequently, detecting the clpP and ycf3 genes will contribute to further investigation of the chloroplasts in Hamamelidaceae.
The LSC and SSC regions are usually variable, while expansion and contraction are noted in the highly conserved IR regions (Asaf, 2017), which may be a critical factor underlying the size variation in the chloroplast genomes (Daniell et al., 2016;Kolodner et al., 1976).
The difference in the size of the chloroplast genomes among the six Hamamelidaceae species compared in this study was not significant, which could be due to their similar expansion and contraction in the IR regions (such as rps19, ndhF, ycf1, and trnH located at the LSC/IRb, IRb/SSC, SSC/IRa, and IRa/LSC boundaries, respectively) except C. bucklandioides. The longest chloroplast genome among the six Hamamelidaceae species was observed in C. bucklandioides, which may be associated with the size expansion of ycf2 in the IR regions.
Expansion or contraction of the IR regions in these species is supposed to be related to gene retention or loss, and we suggest that gene-loss events would have occurred during the evolution of this family and differentiation of the species.
The nucleotide diversity analysis also demonstrated that the IR regions contained fewer variable loci than the SC regions (LSC and SSC regions). Moreover, genes with Pi values > 0.055 were mainly located in the SC regions. Chloroplast genomes have a copydependent repair mechanism to ensure consistency and stability of the two IR regions in sequence, which enhances the stability and conservation of the genome (Khakhlova & Bock, 2006;Perry & Wolfe, 2002). This could explain why the IR regions contain less sequence divergence than the LSC or SSC regions (Shaw et al., 2007). None of the intron-containing genes (ndhA, ndhB, petB, petD, atpF, rpl16, rpl2, rpoC1, trnA-UGC, trnG-GCC, trnG-UCC, trnI-GAU,

F I G U R E 7
Comparison of the borders of the large single-copy (LSC), small single-copy (SSC), and inverted repeat (IR) regions among the six Hamamelidaceae chloroplast genomes. Genes are denoted by colored boxes. The gaps between the genes and the boundaries are indicated by the base lengths (bp) trnK-UUU, trnL-UAA, trnV-UAC, trnE-UUC, rps12, clpP, and ycf3) had a Pi value >0.055, except rps16, suggesting that intron-containing genes are more highly conserved than exon-containing genes only in the chloroplast genome. In other words, higher variability was found in exon-containing genes, which provides more valuable information for species evolution.
Chloroplast genome data are valuable for analyzing species definitions because organelle-based "barcodes" can be established for some species and then applied to reveal the phylogenetic relationships among species (Fan et al., 2021;Yang et al., 2013) Hamamelidaceae originated successively in the evolutionary history of angiosperms, and the three groups are paraphyletic (Dong et al., 2013(Dong et al., , 2018Soltis et al., 2013;Tarullo et al., 2021;Xiang et al., 2019). Alternatively, a different relationship of these paraphyletic groups was inferred from the morphological and molecular data, with an earlier divergence time for Cercidiphyllum than for Liquidambar (Magallon, 2007 Africa; and Northeastern Australia (Bobrov et al., 2020;Tarullo et al., 2021). The diversity in Hamamelidaceae is not fully understood, and extinct and extant new species are still being reported (Averyanov et al., 2017;Haynes et al., 2020;Huang et al., 2017). Therefore, the morphological and molecular evidence may not be complete due to sampling difficulties. Conversely, the unresolved mysteries in Hamamelidaceae may lead to more follow-up studies. To fully understand the phylogeny of Hamamelidaceae species, studies on more genera and more genes need to be conducted in the future.
Nevertheless, the phylogenetic trees constructed in this study provide a valuable resource for investigating the classification, phylogeny, and evolutionary history of Hamamelidaceae.

| CON CLUS ION
In this study, the complete chloroplast genomes of three Hamamelidaceae species were determined and the basic structures, conservation, and variability in these sequences were revealed. The IR regions were more conserved than the LSC or SSC region, while the noncoding regions contained more variability than the gene coding regions. SSRs and divergent hotspot regions could be used to develop molecular markers for population genetic and phylogenetic studies. The complete chloroplast genomes, LSC regions, and SSC regions were used to establish good phylogenetic relationships F I G U R E 1 0 Bayesian inference (BI) and maximum likelihood (ML) phylogenetic trees were constructed using the general time-reversible (GTR)+F+G4 model based on the SSC regions. Numbers on the branches are support values for ML-SH-Alrt, ML-UFBoot, and BI-PP (SH-aLRT/UFBoot/PP). The species investigated in this study are colored in red and solve the relationships between and within genera, while the IR region was not suitable for identification or phylogenetic analysis.
Notably, the relationship within the genus Distylium has not been well resolved. More studies on the relationship within this genus are needed to fully understand the phylogeny of Hamamelidaceae species. The results of this study provide a valuable reference for further studies on species identification, determination of evolutionary relationships, and the development of genetic resources within Hamamelidaceae. Program (2021K038A). The authors would like to thank TopEdit (www.toped itsci.com) for its linguistic assistance during the preparation of this manuscript.

CO N FLI C T O F I NTE R E S T
The authors declare no conflicts of interest regarding publication of this paper.

DATA AVA I L A B I L I T Y S TAT E M E N T
The original sequencing data have been submitted to the NCBI database and received GenBank accession numbers MZ642354 (C. glandulifera), MZ823391 (C. velutina), and MZ642355 (L. chinense). The data used to support the findings of this study are included in Appendix S1.