No hybrid snowcocks in the Altai—Hyper‐variable markers can be problematic for phylogenetic inference

A recent article in Ecology and Evolution featured the discovery of hybrid snowcocks (Tetraogallus) and speculated on the hybrid origin of an extant species (T. altaicus). Comprehensive re‐analyses of original data from the latter paper reliably refute the phylogenetic hypothesis taken as firm evidence of a past hybridization event in these birds. The new re‐analyses showed that there is no evidence of hybridization in these snowcocks from the data available so far.

dividuals were suggested to have derived from successive interspecific and intrageneric hybridization (thus having an admixed hybrid genome of three species; Toews et al., 2018).
In a recent paper on the diversification of Tetraogallus snowcocks (Galliformes, Phasianidae), Ding et al. (2020) reported on the discovery of three putative hybrid individuals that they supposed to have originated from a hybridization event between the Himalayan snowcock (T. himalayensis) and the Tibetan snowcock (T. tibetanus). These three birds had originally been identified as T. himalayensis by Wang et al. (2011) and carried to two distinct D-loop haplotypes (H35, H36). From Wang et al. (2011), there is no information on a deposition of any specimen that would refer to these two haplotypes, so for the time being we must assume that Wang et al. (2011) at best had seen the birds and identified them as T. himalayensis based on their phenotype. However, in Ding et al.'s (2020) re-analysis of the D-loop data set by Wang et al. (2011), the two crucial T. himalayensis sequences (H35 and H36) clustered with T. tibetanus. Ding et al. (2020) took this as firm evidence of having detected previously undocumented hybrid specimens that must have originated from a putative recent event when "the male T. himalayensis hybridized with the female T. tibetanus." To be precise, at this stage of analysis, Ding et al. (2020) were discussing two snowcock specimens that someone else had identified as Himalayan snowcocks, T. himalayensis, and that contrary to expectations, they believed to carry a mitochondrial haplotype of another species-level lineage, that is, that of the Tibetan snowcock, T. tibetanus. This alone is not indicative of hybrid origin of these specimens, because there are other obvious explanations for this supposed mismatch between taxon assignment at GenBank and molecular species identification, for example a simple misidentification (Tritsch et al., 2017). The result might also be indicative of mitochondrial introgression into a phenotypical T. himalayensis population with a nuclear gene pool of that species. Readers would therefore expect further and stronger evidence from molecular data for the hybrid hypothesis. However, instead Ding et al. (2020)  Furthermore, (Ding et al., 2020) hypothesized that the extant Altai snowcock, T. altaicus, should have emerged from that postulated hybrid population. Thus, according to Ding et al. (2020), the Altai snowcock should be the next candidate to be added to the shortlist of avian hybrid species. They put this hypothesis to test using a multi-locus data set of five mitochondrial markers and four nuclear markers for six species (four Tetraogallus sp. and two Alectoris sp.) each of them represented by a single concatenated sequence.
However, a sampling that includes a single individual of a hypothesized hybrid taxon and a single individual of each of the putative parental taxa has barely any explanatory power. Moreover, a closer look at the results presented by Ding et al. (2020)  data set, we should expect a strong mitochondrial signal and thus a closer relationship of the putative hybrid species T. altaicus to the postulated female parent, that is, T. tibetanus. However, the opposite is the case: The multi-locus phylogeny by Ding et al. (2020) showed a sister-group relationship of T. altaicus and the presumed male parent, T. himalayensis (Ding et al., 2020: figure 2). Taking into account that the hybrid origin of T. altaicus under the scenario developed by Ding et al. (2020) would be realistic, extremely strong mito-nuclear discordance could be one (if not the sole) alternative explanation for this unexpected topology (Bonnet et al., 2017;Toews & Brelsford, 2012).
Identification of hybrid individuals using a single mitochondrial marker (D-loop as in case of haplotypes H35/H36) seems even less convincing. However, if we acknowledged the hybrid origin of these specimens from cross-breeding between a female T. tibetanus and a male T. himalayensis, then we would expect the hybrid offspring to carry any of the known recent haplotypes of the female parental species or at least a haplotype that was firmly nested in one of the parental clades. The Italian sparrow, Passer italiae, that was reliably shown to have originated from past hybridization between two sparrow species (P. domesticus, P. hispaniolensis) during the late Pleistocene is a perfect example: In all populations genotyped so far, haplotypes of the house sparrow (P. domesticus) lineage occur with frequencies near 100% (Hermansen et al., 2011(Hermansen et al., , 2014Päckert et al., 2019). However, the situation in Tetraogallus snowcocks is different: The two putative hybrids haplotypes H35 and H36 were not "embedded in the T. tibetanus clade" as, claimed by Ding et al. (2020), they were sister to this clade separated by a deep split dated at 1.83 Ma from all T. tibetanus (figure 5 in Ding et al., 2020).

| Mitochondrial markers
To control for the position of the three putative hybrid specimens (haplotypes H35, H36) in the Tetraogallus tree, I downloaded the original D-loop sequences used by Ding et al. (2020) from GenBank.
Because there was no information on alignment procedure in Ding et al. (2020), I applied the ClustalW algorithm (Higgins et al., 1994) in MEGAX (Kumar et al., 2018). From that first alignment, it was evident that original GenBank sequences had different lengths, because they were gathered from two different studies: All T. himalayensis  Table S1). For data set 1, those 74 sequences were cut down to 890 bp, whereas data set 2 con-  (Table S2).

| Multi-locus sequence data
To control for the position of the putative hybrid form T. altaicus in the Tetraogallus tree, I downloaded original sequences included in the multi-locus-data set of 9 loci by Ding et al. (2020) from GenBank (Table S3). During multi-locus alignment preparation, I realized that ND2 sequences of the two outgroup species Alectoris rufa and A. chukar (DQ307002, FJ752426) were nearly identical. This appeared rather unreliable, because the two coding mtDNA markers (COI and cytb) differed greatly among the two species and so did the noncoding mtDNA markers (D-loop, 12S rRNA). Because Ding et al.
(2020) used the split among these two outgroup species for time calibration of their multi-locus phylogeny, I assumed a notable effect of that mismatch among mtDNA markers on divergence times. however, we might assume that this could be due to the use of an inappropriate ND2 sequence for one of the outgroups. Therefore, I downloaded a comprehensive data set of ND2 sequences for Alectoris and Tetraogallus to control for the position of the two partridge outgroup sequences DQ307002, FJ752426 (Table S4).

| Inference of phylogeny
For Bayesian inference of phylogeny using BEAST v.1.8.1 (Drummond et al., 2012), I applied exactly the same settings as provided in Ding et al. (2020): MCMC chain length of 50 Mio generations, sampling every 1000th generation, and a burn-in of 10% applied. I also referred to the same time calibration as Ding et al. (2020) who assumed a fixed node age for the time of the most recent common ancestor (tmrca) of the two partridge species used as outgroups: 2.84 Ma for the split between Alectoris chukar and A. rufa, mean tmrca prior starting value, SD = 0.01 (compare Ding et al., 2020).
Following Ding et al. (2020), I applied the HKY + I + G to the D-loop data set. Though Ding et al. (2020) claimed that they had a priori identified "best partition schemes for the dataset," they applied a single model (GTR + I) across all 9 loci of the multi-locus data set.
Because they stated that they had used concatenated sequence data for Bayesian inference of phylogeny with BEAST, I must assume that they treated their entire multi-locus alignment as a single partition.
Although there are more appropriate alternatives for partitioning of a multi-locus data set including coding and non-coding mtDNA and nuclear markers, I decided to stick closely to Ding et al.'s (2020) protocols and treated the whole multi-locus data set as one partition.  Table   A1 in Ding et al., 2020). I had to remove T. caspicus from the nuclear data set, because none of the nuclear markers was available for that species. When according to the model settings in Ding et al. (2020) the GTR + I model was applied to the nuclear data set, BEAST runs were always aborted due to a numerical likelihood error, no matter which prior settings were modified. I had to apply the simpler HKY As a control, I used a maximum-likelihood approach using RaxML (Stamatakis, 2006(Stamatakis, , 2014 with the GTR + I + Г model applied for tree | 16357 LETTER TO THE EDITOR reconstruction with all data sets and alignments. All obtained phylograms were edited in FIGTREE vers. 1.4.2 (Rambaut, 2009).
For illustration of mitochondrial lineage differentiation, I reconstructed unrooted minimum spanning networks with PopART (http://popart.otago.ac.nz) and with TCS (Clement et al., 2000) to check for a potential effect of gaps in the D-loop alignment.
For calculation of uncorrected pairwise distances between species and intraspecific mitochondrial lineages, I used MEGAX (Kumar et al., 2018). parsimony-informative sites were located within the region between site 101 and 138, that is, 41% of the total parsimony-informative sites were accumulated in a hypervariable region of only 37 bp length.

| Mitochondrial markers
When the entire first fragment including that hypervariable region was cut off, the remaining fragment (751 bp length) contained 37 variable sites (the two Alectoris outgroup taxa excluded) of which 28 were parsimony-informative. At those parsimony-informative sites, sequences H35 and H36 shared ten (out of 28) substitutions with all T. himalayensis, whereas only one substitution with T. tibetanus and another strongly diversified T. himalayensis haplotype H37. When only that last fragment of the D-loop (751 bp without the hypervariable region) was analyzed (including sequence information for the putative hybrid form T. altaicus), H35 and H36 were sister to T. himalayensis with full support from Bayesian posterior probabilities and moderate support from likelihood bootstrap (Figure 2, Figure S1).
Due to removal of the hypervariable region, the number of distinct D-loop haplotypes decreased from 73 to 36 (T. himalayensis, n = 18; T. tibetanus, n = 15; T. altaicus, n = 1). To avoid inflation of the data set and a possible effect of duplicate identical sequences, the analysis was repeated with the remaining 35 distinct haplotypes. Removal of duplicate haplotypes had no effect on likelihood bootstrap values; however, it evoked a notable decrease of node support from Bayesian posterior probabilities (except for the node uniting clades A, B1, and B2 of T. himalayensis; Figure 2). Nevertheless, a sistergroup relationship of the supposed hybrid lineage H35/H36 or of the supposed hybrid species T. altaicus with the putative female parent T. tibetanus was not supported in any run with BEAST or RaxML based on the second conservative fragment (751 bp; Figure 2, Figure   S1). The Altai snowcock, T. altaicus, was firmly nested in T. himalayensis; however, its relationships with clades A, B1, and B2 of the latter species were conflicting among phylogenetic reconstructions; for example, in the Bayesian tree T. altaicus was sister to clade B1 4 vs. 13 substitutions in the TCS network when gaps were treated as a 5th character, Figure S2). Accordingly, pairwise distances between H35/H36 and the putative female parent T. tibetanus inferred from the second conservative D-loop fragment alone were about 2 to 3 F I G U R E 1 Phylogenetic relationships of enigmatic haplotypes H35 and H36 inferred from three different data sets of mitochondrial D-loop sequences (1= 890 bp, n = 74; 2= 1163 bp, n = 74; 3= 1163 bp, n = 80) based on two different alignment procedures (automatic alignment by ClustalW (left) and automatic alignment plus manual editing of the hypervariable region of sequences H35 and H36 (right) using two different methods: (a) Bayesian inference of phylogeny; time-calibrated node Alectoris (mean age = 2.84 Ma; below RaxML, node support inferred from 1000 bootstrap replicates); (b) maximum-likelihood using RaxML; node support inferred from all three data sets indicated as shown in the figure; asterisk indicates full support; diamonds indicate nodes supporting a sister-group relationship of specimens H35/H36 with either T. tibetanus (diamond I) or T. himalayensis (diamond II) times greater (p-dist = 1.6%) than those between H35/H36 and each of the three intraspecific mitochondrial lineages of T. himalayensis (0.5% < p-dist < 0.9%; Table 1, below diagonal).
Both the D-loop and the cyt-b data confirmed the existence of several intraspecific splits among mitochondrial lineages of the Himalayan snowcock, T. himalayensis (Figure 3, Figure S3). In both data sets, mean intraspecific divergence within T. himalayaensis was greater than mean interspecific divergence between the latter species and T. altaicus (Figure 3, Figure S3; Table 1). This is another reason why resolution of phylogenetic relationships was problematic for a deeply divergent mitochondrial lineage of T. himalayensis like H35/H36, particularly when phylogenetic inference was based on a hypervariable marker like the D-loop.

| Multi-locus data set
Separate tree reconstructions for concatenated mtDNA and nuclear markers showed that T. altaicus carried a mitochondrial lineage that was closely related to T. himalayensis (not to the postulated female parent T. tibetanus), however, with poor support (Figure 4). Comparison of ND2 sequences showed that the Alectoris rufa sequence DQ307002 Ding et al. (2020) had selected for multi-locus analyses was firmly nested in the A. chukar clade ( Figure S4). Thus, their two outgroup species used for time calibration were unnaturally similar in one mtDNA marker, which as a consequence led to unreliably ancient divergence time estimates among snowcock species (Table 2). When the inappropriate outgroup sequence DQ307002 was replaced by another sequence that represented the true A. rufa lineage, divergence time estimates for all nodes decreased as expected (Table 2).

| D ISCUSS I ON
Re-examination of D-loop sequence data from Ding et al. (2020;and Wang et al., 2011) reliably showed that the unexpected position of samples H35 and H36 in the Tetraogallus tree was strongly affected by alignment procedure and the position of gaps in a hypervariable region (compare Swain, 2018). As a consequence, the sister-group relationship of H35 and H36 with T. tibetanus-the sole argument by Ding et al. (2020) for the putative hybrid status of the three specimens carrying those haplotypes-was strongly supported by only one out of six runs with BEAST and by none of the six maximumlikelihood trees inferred from RaxML analysis.
Although the control region generally appeared to perform equally well for phylogenetic reconstructions as other mitochondrial genes (such as cytochrome b for example in Galliformes; Randi et al., 2001), the hypervariable domains I and II can be considerable sources of error in alignments and thus lead to incorrect tree topologies. Particularly the phylogenetic signal of gaps and their position in an alignment can have a strong effect on phylogenies (Dessimoz & Gil, 2010;Lutzoni et al., 2000), even more when shifts of larger To ensure alignment accuracy, removal of hypervariable regions from alignments prior to phylogenetic reconstruction is therefore a commonly applied strategy, for example, for D-loop sequences (Robins et al., 2010) or other hypervariable markers like 16S rRNA (Anthes et al., 2008). Removing ambiguously aligned parts from alignments can even make sense for phylogenetic analysis of protein sequence data sets (Talavera & Castresana, 2007). According to these expectations, removal of the first hypervariable D-loop fragment from the alignment resulted in a fully supported monophyletic T. himalayensis including sequences H35 and H36.
In fact, Ding et al. (2020) did not discover previously unknown hybrid snowcocks, because haplotypes H35 and H36 just represent another deeply split mitochondrial lineage of the genetically diverse Himalayan snowcock, T. himalayensis (Figure 1, clades A, B1, B2; compare Wang et al., 2011). Strong intraspecific variation of the Himalayan snowcock was corroborated by two strongly diverged cytochrome b lineages of T. himalayensis (Figure 3) and by a recent microsatellite study . Strikingly, the strong intraspecific diversification of T. himalayensis does not seem to be associated with currently discussed future restrictions of access to digital sequence information (DSI) under the regulations of the Nagoya Protocol.

CO N FLI C T O F I NTE R E S T
None to declare.

DATA AVA I L A B I L I T Y S TAT E M E N T
All sequence data used in this study are available at GenBank. For information on sequence data sets accession numbers, please refer to information provided in the Methods and to the full documentation of the sequence data sets in Tables S1-S4.