From STRs to SNPs via ddRAD‐seq: Geographic assignment of confiscated tortoises at reduced costs

Abstract Assigning individuals to their source populations is crucial for conservation research, especially for endangered species threatened by illegal trade and translocations. Genetic assignment can be achieved with different types of molecular markers, but technical advantages and cost saving are recently promoting the shift from short tandem repeats (STRs) to single nucleotide polymorphisms (SNPs). Here, we designed, developed, and tested a small panel of SNPs for cost‐effective geographic assignment of individuals with unknown origin of the endangered Mediterranean tortoise Testudo hermanni. We started by performing a ddRAD‐seq experiment on 70 wild individuals of T. hermanni from 38 locations. Results obtained using 3182 SNPs are comparable to those previously obtained using STR markers in terms of genetic structure and power to identify the macro‐area of origin. However, our SNPs revealed further insights into the substructure in Western populations, especially in Southern Italy. A small panel of highly informative SNPs was then selected and tested by genotyping 190 individuals using the KASP genotyping chemistry. All the samples from wild populations of known geographic origin were genetically re‐assigned with high accuracy to the original population. This reduced SNPs panel represents an efficient molecular tool that enables individuals to be genotyped at low cost (less than €15 per sample) for geographical assignment and identification of hybrids. This information is crucial for the management in‐situ of confiscated animals and their possible re‐allocation in the wild. Our methodological pipeline can easily be extended to other species.

Some of the most common and widespread techniques used for genotyping SNPs in nonmodel organisms are based on reduced libraries as the restriction site-associated DNA sequencing (RAD-seq; Davey et al., 2011;Hohenlohe et al., 2011;Miller et al., 2007), its variant double-digest RAD-seq (ddRAD-seq; Peterson et al., 2012) and other derived methods (Campbell et al., 2018). Currently, sequencing of nuclear genomic regions with reduced library approaches offers the possibility to obtain, at low price, several thousands of SNPs. Moreover, selecting a posteriori only the informative SNPs that maintain as much as possible the original dataset information (e.g., SNP-based panel) required for the purpose of interest can considerably reduce the subsequent genotyping costs and technical efforts necessary to analyze a large number of individuals. This aspect is crucial in conservation genetics where practitioners need accuracy but also cheap and easy-to-implement guidance (Holderegger et al., 2019).
SNP-based panels, compared to STRs, address this need by simplifying the allele scoring procedure, allowing data transferability across studies, with the potential for high-throughput screening (e.g., Garvin et al., 2010). SNP arrays are also flexible, with lowdensity arrays allowing tens to thousands of SNPs to be genotyped and high-density arrays or "SNP chips" supporting tens of thousands to millions of SNPs (Humble et al., 2020;von Thaden et al., 2020).
Small panels of SNPs (<100) have already been used for the identification of distinct genetic populations to delineate biologically accurate management units and for the identification of individuals, including the assignment of individuals to their source population (Förster et al., 2018;Henriques et al., 2018;Kleinman-Ruiz et al., 2017). This approach can be described in four steps. First, a large number of SNPs is isolated and typed in a small number of individuals; second, the suitability of these markers to answer the question of interest is tested; third, a much smaller panel of SNPs is identified, given it can answer the same question of interest but at lower costs; and fourth, the subset is tested using a larger sample of individuals.
Here, we applied these steps to isolate informative SNPs for the assignment of samples to their population or area of origin. Inferring the geographic origin of living organisms from their genotypes is of great interest in wildlife management, conservation, and forensic applications. It can provide information about gene flow, migration patterns and connectivity in natural populations (Kremer et al., 2012) but can also help inform wildlife managers about illegal animal translocations and poaching hotspots (Biello et al., 2021;Ogden & Linacre, 2015). Our study produced a tool specific for the Herman's tortoise, but it can be considered also as a useful example to facilitate the production of reduced SNP panels in other species.
The Hermann's tortoise, Testudo hermanni Gmelin (1789), is one of the most endangered reptiles in Europe. The intense harvesting for pet trade, especially before the 1980s when it was not banned yet (Ljubisavljević et al., 2011), the releases of non-native individuals, and the habitat reduction and degradation (Stubbs et al., 1985) are the major threats for this species (Bertolero et al., 2011). T. hermanni is distributed in disjoint populations across Mediterranean Europe, from Spain to the Balkans, including various Mediterranean islands.
Based on mtDNA and STR markers, two subspecies are recognized (Biello et al., 2021;Perez et al., 2014) Red List (1996). Previous genetic studies suggest that individuals can be correctly assigned in most cases to their macro-area of origin with a small panel of STR markers (Biello et al., 2021;Perez et al., 2014).
However, given the large number of individuals of unknown origin hosted in rescue centers and potentially suitable for reintroductions plans, the possibility to use modern NGS technologies to decrease the genotyping costs should be carefully considered.
The main goal of this study is to introduce and validate a small panel of SNPs extracted from genome-wide distributed markers for the cost-effective geographic assignment of large numbers of confiscated T. hermanni individuals, necessary for their management and their possible re-allocation in the wild. We accomplished this goal following the four steps defined above. In particular, 1. we performed a ddRAD-seq experiment on 84 wild individuals of which 70 provided informative genotypes for further analyses; 2. we verified that the results provided by the ddRAD-seq loci, in terms of genetic structure and power to identify the macro-area of origin, are adequate (in our case, considering the results obtained with a sample of 292 wild individuals already typed at 7 STRs); 3. we designed a cost-effective reduced panel of SNPs with retained informativeness; and 4. we tested the reduced SNP panel in 190 individuals using the KASP-by-Design Fluidigm Assays (LGC Genomics; He et al., 2014).
Our study has practical consequences for the conservation, management, and reintroduction of T. hermanni, and it is also demonstrative of a technical pipeline applicable to other species.

| Samples, DNA extraction, and ddRAD-seq library preparation
Eighty-four samples already available in our laboratories were selected to perform the ddRAD-seq genotyping experiment. These samples belonged to individuals collected across most of the contemporary geographical range of the species and covering all the previously known genetic clusters. Of these, 70 were retained for further analyses after quality control and filtering (see next paragraph and Figures S1-S12

| Quality control and filtering
Raw reads generated were demultiplexed and split into individual data files using the process_radtags program in version 1.44 of STACKS . We used TRIMMOMATIC (Bolger et al., 2014) to trim our raw reads, removing read-through adapters, low quality (light filter-modify on a case-by-case basis), and shorter than 200 bp, cropping the first two bp (GG left from restriction site).
The resulting reads were 200 bp in length to ensure downstream compatibility in STACKS, which requires uniform read length when building de novo loci.
After trimming and quality-filtering, the STACKS programs ustacks, cstacks, and sstacks were used to build de novo catalog loci.
We specified a minimum of five reads per locus to build primary catalog stacks and allowed a maximum distance of six nucleotide mismatches within these stacks. Furthermore, we allowed eight mismatches between stacks during construction of the catalog because the specimens in our dataset were represented by two subspecies of the same species.
Following de novo mapping, an initial data-filtering step was performed using the population component of STACKS retaining only those loci present in at least 75% of individuals at each site and with a maximum of 5 SNPs per locus (full dataset, n = 3182). In an additional filtering step, we retained only one SNP per locus (single SNP dataset, n = 1179) to minimize redundancy of genetic information due to the physical linkage between the markers. This reduced dataset was used for PCA and F ST -based methods. The resulting vcf files were converted to other program-specific input files using PGDSPIDER v2.0.5.1 (Lischer & Excoffier, 2012).

| Genetic structure with ddRAD-seq in comparison with STRs
In order to compare the results provided by the ddRAD-seq loci in terms of genetic structure and power to identify the macro-area of origin of the individuals, we selected a subset of 292 wild samples genotyped previously at 7 STR loci (Biello et al., 2021;Perez et al., 2014) from the same locations of the samples included in the ddRAD-seq experiment.
The population structure was inferred from the ddRAD-seq with the package fineRADstructure designed to identify co-ancestry from RAD-seq data (Malinsky et al., 2018). Briefly, the algorithm compares each RAD-seq locus of each individual (recipient) with the alleles in all other individuals (potential donors) estimating the number of sequence differences (i.e., SNPs) to infer its nearest neighbor allele (donor) that is the allele (i.e., the concatenation of all the SNPs at each locus) with the least number of differences. The local co-ancestry values are then summed across all loci to obtain the co-ancestry similarity matrix for the full data. Next, an MCMC clustering algorithm infers the most likely population configuration.
In summary, using information from the entire genetic variation identified with RAD-seq, the analysis estimates the number of genetic clusters, quantifies ancestry sources in each group in terms of co-ancestry value, and provides a tree-like illustration of the relationships between clusters. To perform this analysis, the output of the software population in STACKS was converted into a RADpainter input file with the python script Stacks2fineRAD.py. Samples were assigned to populations using 100,000 iterations as burn-in before sampling 100,000 iterations. A tree was built using 10,000 iterations, and the output visualized using the fineradstructureplot.r and finestructurelibrary.r R scripts (available at https://github.com/milla nek/fineR ADstr ucture). Populations were defined as clusters within the fineRADstructure tree and relatedness plot.
The result of the cluster analysis based on the ddRAD-seq dataset was compared with similar analyses suitable for the panel of STRs described above. Specifically, we performed a cluster analysis on the STRs dataset using the Bayesian method implemented in STRUCTURE v2.3.4 (Pritchard et al., 2000). The analysis was conducted choosing a model with admixture, uncorrelated allele frequencies, and a nonuniform ancestry prior alpha among clusters, as suggested by (Wang, 2017) for uneven samplings. We ran 20 replicates for each value of K (i.e., the number of source populations) from K = 1 to K = 10 with 750,000 MCMC after a burn-in of 500,000.
Structure results were summarized and visualized with the web server CLUMPAK (Kopelman et al., 2015). We used STRUCTURE HARVESTER (Earl & vonHoldt, 2012) to infer the best value of K, based on both deltaK (Evanno et al., 2005) and the Pr[X|K] (Pritchard et al., 2000).
We further tested for population structure both the ddRADseq and STR datasets using a Principal Component Analysis (PCA) with the dudi.pca function in the R package adegenet v2.1.3 (Jombart, 2008). The samples were visualized following the clusters inferred by fineRADstructure.
Pairwise population F ST values were estimated among clusters inferred from fineRADstructure analysis using ARLEQUIN v3.5 (Excoffier & Lischer, 2010) for ddRAD-seq and STR datasets. Furthermore, in order to investigate in detail the genetic pattern found in Calabria (Southern Italy), where additional genetic clusters are identified only with ddRAD-seq data (see Section 3), we tested the effects of geographic distance on genetic structure on those populations. We calculated the pairwise genetic distances among individuals using either the number of allelic differences between individuals or the proportion of shared alleles (D PS ; Bowcock et al., 1994) using the R packages poppr (Kamvar et al., 2014) and adegenet v2.1.3 (Jombart, 2008), respectively. Euclidean distances between geographic locations were calculated using the R package fossil (Vavrek, 2011). The Mantel test was conducted using the R package vegan (Dixon & Palmer, 2003).

| Design of a small panel of informative SNPs
The ability of the panels to reflect the geographic assignment based on the complete set of SNPs was analyzed following a "node approach" based on the dendrogram produced by fineRADstructure with the whole dataset (see Section 3). This dendrogram represents the maximum level of geographic resolution supported by the entire set of SNPs (n = 3182). To maintain this level of resolution while reducing the number of markers, we considered one node at the time, and ranked each SNP with the aim of identifying SNPs that were most informative for maximizing genetic differentiation between the two groups separated by the node. A comparable number of SNPs was selected from each node (see Section 3). We applied three different methods to select informative SNPs use- Our aim was to identify the best panel with 48 or 96 SNPs, meaning that a total of six panels were compared (two panel sizes for each method). Only SNPs having at least 50-bp flanking region in the ddRAD-seq loci were selected, as required by the KASP screening protocol (see below).

| PCA-based panel
We performed seven principal component analyses, one per node, using the R package adegenet v2.1.3 (Jombart, 2008). For each PCA, a graph was created to visualize the distribution of the data and evaluate which axes would better separate the distinct groups; then, for each informative axis, we saved the SNPs within the top 5% loading value.

| F ST -based panel
The R package genepop v1.0.5 (Rousset, 2008) was used to compute pairwise F ST between the two groups branching from each node, retaining the SNPs with the highest F ST values.

| Random forest panel
We finally applied RF (R package randomforest v4.6-12; Liaw & Wiener, 2002), a machine learning method used for classification and regression. This method allows to identify a limited number of features (e.g., SNP) that best classify the observations into prior groups by computing the Mean Decrease in Accuracy (MDA). MDA is a measure of worsening of the model in classifying the observations when removing one feature at the time. Higher MDA means that the feature is important to the model. At every run, a forest of decision trees is grown. The out-of-bag (OOB) error is how RF measures misclassification; it is the mean of the prediction error of a bootstrapped training subsample across all the trees. Following the node approach as for the previous methods, we determined for every node our ntree parameter (number of trees) by running RF with 100, 500, 1000, 8000, 25,000 trees. The mtry parameter (number of features considered at each node) was tested using the function tuneRF. We then ran RF 10 times per node, as suggested in Sylvester et al. (2018), computed and exported MDA from every run. We only selected SNPs that were listed in all the runs per node with MDA > 1.
We finally created two panels retaining SNPs with the highest MDA.
We performed self-assignment tests to compare the three methods (each with two panel sizes, 48 and 96 SNPs) in terms of assignment accuracy. Assignment testing was performed by using the R package AssignPop (Chen et al., 2018). Initially, the dataset is divided into training (baseline), and test (holdout) datasets using a resampling cross-validation approach by the function assign.MC. The user can specify the proportion of individuals from each source "population" to be used in the training dataset. High grading bias is avoided using this method by producing randomly selected, independent training and test datasets (Anderson, 2010). Furthermore, using a PCA, the dimensionality of the training datasets (i.e., the genotypes) is decreased, and the result is utilized to create prediction models using user-chosen classification machine learning algorithms (Chen et al., 2018). Lastly, the models are used to estimate membership probabilities of tested individuals and assign them to a source population. Additionally, the training data are evaluated, and assignment tests are performed to assess the origin of individuals (Chen et al., 2018). Resampling was repeated 500 times for each combination of training individuals and loci. The proportion of individuals from each source population randomly allocated to the baseline dataset was set to 0.5 and 0.7. Finally, the linear discriminant analysis (LDA) was used as a classifier model.

| Test and validation of the SNP panel
After evaluating the performances and the cost-effectiveness of the 96 and 48 SNP panels, we selected the smaller set of markers (48 SNPs) identified by the F ST approach and these loci were tested with the KASP-by-Design Fluidigm Assays (LGC Genomics;He et al., 2014). Using this panel, we typed a total of 190 DNA samples including samples previously typed with other markers (only 7 STRs, n = 68; STRs and ddRAD-seq, n = 7; 75 in total) as internal control, samples from captivity (from rescue centers), or from wild populations never typed before (see Table S1-S12 for a detailed description of the samples). ddRAD-seq reference (3182 SNPs; 70 samples): We performed an assignment analysis with STRUCTURE (POPFLAG for individuals in the reference database; "update allele frequencies using only individuals with POPFLAG=1" option under a USEPOPINFO without admixture model). K was fixed at its optimal value (see Section 3), while the run length and other parameters were set as above (see STRs analysis with STRUCTURE). We assigned individuals to a source population when the probability of an individual belonging to that population was above 80% (Biello et al., 2021;Lang et al., 2021). Exclusion tests were performed with the partially Bayesian exclusion method (Rannala & Mountain, 1997) implemented in GENECLASS2 (Piry et al., 2004). We compared observed genotypes with an expected likelihood distribution of genotypes generated for each reference population by simulating 1000,000 individuals with Monte Carlo resampling. STR reference (7 loci; 292 samples from Biello et al., 2021): We performed assignment analysis on a subset of samples for which STR data were already available using the same methods and parameters used for the ddRAD-seq reference.
The diagram in Figures S1-S12 summarizes the workflow of our study.  Table S2). After filtering loci for missing data, by retaining markers that were genotyped in at least 75% of the individuals, we recovered 1179 loci and 3182 SNPs (full dataset).

| Genetic structure with ddRAD-seq and comparison with STRs
The co-ancestry matrix obtained with fineRADstructure using the full dataset showed the presence of eight main clusters (Figure 1a).
The lowest amount of co-ancestry was found between the two subspecies, T. h. boettgeri and T. h. hermanni. Within the T. h. hermanni group, six distinct clusters were identified: Italian Peninsula (ITP), North Calabria (NCA), Central Calabria (CCA), South Calabria (SCA), Sicily (SIC), and Sardinia (SAR). These results support a finer genetic structure than previously suggested using STRs (Biello et al., 2021;Perez et al., 2014). More specifically, SNPs allowed the identification of three clusters within the Calabria region (NCA, CCA, SCA) and a distinction between Sicily and Sardinia (SIC, SAR; Figure 1b). The average co-ancestry between the six clusters of T.
h. hermanni was generally low, with the higher co-ancestry levels found between ITP and NCA, and between SIC and SAR. Within the T. h. boettgeri group, we detected two clusters, namely Greece (GRE), and Bosco Mesola + Croatia (MEC). The comparison between individuals within these two genetic clusters showed the highest level of co-ancestry compared to other groups. See Table   S3 for a description of the nomenclature across the reference literature of each genetic cluster described in this study (Biello et al., 2021;Perez et al., 2014).
While fineRADStructure utilizes haplotype linkage information, other approaches such as PCA may be biased by the association between markers. Thus, population clustering was further investigated by means of PCA using only independent SNPs  Table S4). When we analyzed in detail the populations from Calabria (NCA, CCA and SCA), where additional clusters were identified, we found that geographic distance contributes significantly to genetic differentiation between populations in this region (number of allelic differences, Mantel test: r = 0.7413, p = 1e-04; D PS , Mantel test: r = 0.8404, p = 1e-04; Figure S5).
The SNPs markers isolated with the ddRAD-seq approach appeared thus able to genetically discriminate populations with different geographic distribution, with a precision that was not the same using different statistical approaches. To further confirm that the SNPs markers contained relevant information to discriminate individuals with different geographic origin, at least at the macro-areas level, we compared the geographic structure inferred with the SNPs with that estimated from a small STR panel of proven efficacy for conservation and forensic applications (Biello et al., 2021). The goal here was not to analyze in detail the difference between the different markers, but to additionally validate the SNPs dataset (based on 70 individual and >1000 independent biallelic loci) by showing that it contains the same geographic partition of the genetic variation inferred from a different type of molecular marker (almost 300 individuals typed at 7 STRs). We can confidently conclude that the geographic structure identified by the SNPs is very similar, and slightly more resolved, than that based on a much larger sample size

G R E M E C I T P N C A C C A S C A S I C S A R
because: (i) the pairwise F ST values between the groups inferred by the SNPs and the STRs ( Figure S2, upper diagonal) were highly correlated (see Figure S3); (ii) the PCA results on the STR dataset confirmed a clear subdivision of the two subspecies, as well as the two T. h. boettgeri clusters GRE and MEC (Figure 2c,d); overall, the first three components separated some of the T. h. hermanni populations, but with higher overlap between individuals, as observed in the SNPs dataset (Figure 2a,b); (iii) the major genetic clusters identified by the STRUCTURE analysis on the STR markers (see Figure 3 and Figure S4) corresponded to those observed with the SNPs dataset, but without separating the three pairs of areas with the smallest F ST values (ITP and NCA; CCA and SCA; SIC and SAR).

| Isolation of a small panel of informative SNPs
The SNPs were selected following a "node approach" based on the dendrogram produced by fineRADstructure with the whole dataset   Figure   S7). Assigning individuals to the two subspecies was extremely accurate, ranging from 99.1% to 100% depending on the method used (see Figure S8). Accuracy was very high also when we attempted to assign individuals to the two main groups in T. h. hermanni, Italian peninsula (including ITP, NCA, and CCA clusters) and the Mediterranean Islands (including SCA, SIC, and SAR). The mean accuracy ranged from 91.6% (PCA) to 97.6% (RF; see Figure S9).  Figure S10). Assigning individuals to the 2 subspecies was 100% accurate independently from the method used (see Figure   S11). The mean accuracy when we assigned individuals to the two main groups in T. h. hermanni was extremely accurate, ranging from 98.5% (F ST ) to 99.1% (RF; see Figure S12). of the tortoises were assigned to the GRE cluster, 24.3% to ITP, 18%

| Test and validation of the SNP panel
to MEC, 2.1% to the SAR, 0.5% to NCA, and 0.5% to CCA, while 21.7% of the individuals were not assigned to any predefined cluster (NA).
Furthermore, we compared the geographical assignment of samples that were genotyped at 41 SNPs and 7 STR loci (see Table S1), using first the ddRAD-seq reference and then the STR reference (see Biello et al., 2021;Figure 5). In order to minimize the differences between the two references, we decided, in this analysis, to group together the clusters SIC and SAR inferred by SNPs following the STRs cluster ISS (see Biello et al., 2021; Table S3). Assigning individuals to their area of origin was extremely accurate when we considered samples from wild populations with known geographic origin and for which the reference (ddRAD-seq reference) was available.
Both markers assigned all the samples to the same cluster ( Figure 5).
By contrast, for the captive populations, the results were not completely concordant between the two set of markers. The ddRADseq reference and the STR reference assigned 83% of samples to the same clusters ( Figure 5). Among the unassigned samples (17%), four samples assigned to GRE with the SNPs were assigned to MEC with the STR data (clusters from the same subspecies, T. h. boettgeri), and three samples assigned to ITP with the STRs were assigned to ITP and CAL with the SNPs (clusters from the same subspecies and adjacent geographic regions, T. h. hermanni).
Finally, we performed assignment analysis for the captive samples (from rescue centers), and from wild populations never typed before (Dune; Figure 6). Captive samples showed evidence of long-distance translocations and potential hybridization events.
For example, 46% of the tortoises from the Veneto center were not assigned to any cluster, 27% were classified as imported from Greece (GRE), 19% were classified within the T. h. hermanni subspecies (Italian Peninsula, ITP), and only 8% were assigned to the MEC cluster (typically found in northeastern Italy). In the Emilia-Romagna center, 58% of the tortoises had Greek origins, and 6% were assigned to the MEC cluster. Approximately one-third of the tortoises from this center was not assigned. The centers of Emilia-Romagna and Lazio show a similar proportion of unassigned individuals (33% and 22%, respectively). On the contrary, the two centers show an opposite trend in terms of individuals assigned to the GRE (i.e., nonnative origin) and ITP (i.e., local origin) clusters: 58% GRE and 6% ITP for the Emilia-Romagna center, 11% GRE and 56% ITP for the Lazio center. Interestingly, the tortoises from a wild and isolated population located in a natural reserve with fossil dunes corresponding to the former coastline of the Northern Adriatic Sea (Dune), were mainly assigned to the MEC cluster (78%; Figure 6). Populations from this cluster are typical in nearby areas (northeast Italy), showing a likely native component of this population.

| DISCUSS ION
In this study, we present the development and the application of a small panel of SNP markers useful to investigate population genetic structure and geographical assignment in the endangered species Hermann's tortoise, Testudo hermanni. We initially performed a ddRAD-seq experiment on 70 individuals from different locations, and we showed that these loci, in terms of genetic structure and power to identify the macro-area of origin, outperform those obtained with a sample of 292 wild individuals previously typed at 7 STRs. Starting from these results, we designed a small panel of SNPs, retaining the most informative markers in terms of maximizing F ST between the clusters inferred by the population genetic analysis.
Finally, we tested the small SNP panel in 190 individuals using the KASP genotyping chemistry, a relatively inexpensive SNP genotyping approach. This cost-effective molecular tool enables a large number of confiscated individuals to be genotyped for geographical assignment, necessary for their management and their possible reallocation in the wild.

| Population genetic structure of T. hermanni: From STRs to SNPs
The overall results of the genetic structure of T. hermanni wild populations using SNPs confirmed the pattern based on the analysis of 7 STR loci found by Biello et al. (2021) but also revealed a more detailed structure. First, SNPs data were able to identify three genetically distinct groups in Calabria, a small region in the South of Italy, corresponding to samples collected in Northern, Central, and Southern areas, respectively. This result is not surprising given that this geographically heterogeneous region is a hotspot of genetic diversity for many temperate species and likely acted as a single or multiple glacial refugia (Canestrelli et al., 2010;Chiocchio et al., 2017). Considering however that the samples are not covering homogenously the whole region, and the clustering algorithm we applied tends to overestimate genetic structure under isolation by distance (Frantz et al., 2009), we cannot exclude that these clades do not refer to genetic isolates but to genetically distinct groups sampled within an area of large genetic variation but limited barriers and genetic discontinuities. Second, SNP data suggest, differently from STRs, the existence of different genetic pools in Sicily and Sardinia. Finally, we note that even with a descriptive analysis such as the PCA, the SNPs dataset produces groups more easily distinguishable from each other, supporting the higher discriminatory power for these markers when compared to seven STRs. In summary, even if the samples present in the ddRAD-seq dataset included only 23% of the individuals included in the STR dataset, these were sufficient to understand better the genetic structure in this species, especially in emphasizing the fine-scale structure among wild populations.

| Design and validation of the SNP panel
Custom species-specific SNP panels, including high-ranking loci, have been shown to be highly informative for assignment F I G U R E 5 Contingency tables (a) and bar plots (b) of the geographical assignment of Wild (left) and Captive (right) samples to the ddRAD-seq reference and STR reference (see Table S1). (a) Columns correspond to the number of individuals assigned to original clusters using the ddRAD-seq reference, while rows correspond to the number of assignments using the STR reference (Biello et al., 2021) studies (Förster et al., 2018;Jenkins et al., 2019;Kleinman-Ruiz et al., 2017) and prove to be an important resource for identifying the geographic origin of species in a conservation and forensic perspective.
We tested different methods for selecting highly informative  hermanni's STRs, outsourcing the entire service is extremely expensive (Table 2). One option to reduce the cost of STR analysis is to carry out part of the wet lab internally, although KASP remains much cheaper (Table 2).

| Test of the SNP panel
We observed a high genotyping success rate when we tested the panel with 190 samples, even with a low DNA yield (500 ng), which is significant in the case of noninvasively collected samples. We showed that missing data were below 1.9% for every SNP in our final panel, which is low compared with higher rates of missing data observed especially in STR genotyping of species of conservation concern where the sampling should not be invasive (Kraus et al., 2015). A few SNPs (seven) failed due to technical problems, most probably caused by nonspecific annealing of primers in multiple genomic locations. The 41 SNPs performed very well in terms of individual identification.
Assigning individuals to their area of origin was highly accurate (100%) when we considered samples from wild populations with known geographic origin and for which the reference (ddRAD-seq reference) was available. These results were also confirmed by using a panel of STR markers (Biello et al., 2021). This is in accordance with previous studies, which showed that reduced panels of highly informative SNPs (<100 SNPs) performed as well as or better than the traditional number (10-20) of STR loci used for individual identification (Glover et al., 2010). We also showed that assigning individuals from captive populations to their location of origin was not completely concordant between the ddRAD-seq reference and the STR reference. Approximately 17% of samples were not assigned to the same groups or not assigned to any group. These samples could represent descendent of crosses in captivity between individuals with different origin (including hybrids between subspecies), or to the lack of major genetic clusters in the reference populations.
Among the tortoises with unknown origin that we genotyped for the first time and are currently hosted in Italian rescue centers, we found evidence of long-distance translocations of individuals from Balkan regions and individuals with unclear genotype (i.e., potentially hybrids). This pattern confirms the Balkan areas as a source of illegal trading and the presence of hybrids in the Italian rescue centers due to mating occurring in captivity between individuals with different origins (see Biello et al., 2021).
Most of the tortoises from a wild population located in a natural reserve with fossil dunes corresponding to the former coastline of the Northern Adriatic Sea (Dune), which was never sampled before, were assigned to the genetic cluster presents also in other areas in North-East Italy (Bosco Mesola) and Croatia. However, we also found non-native tortoises belonging to genetic cluster observed in Greek individuals.

| Implications for management and perspectives
Approximately 2  as compared to previous estimations (Perez et al., 2014;Zenboudji et al., 2016) in a wild population in the Var district (France). The massive release (more than 4300 captive individuals; Devaux, 1990) carried out in the 1990s to reinforce declining populations most TA B L E 2 Cost per sample for 7 STR markers (Biello et al., 2021;Perez et al., 2014) and 41 SNPs (present work) for an increasing number of samples France), these measures should be carefully evaluated as a necessary step in conservation plans.
In a previous paper we introduced a panel of 7 STR markers useful to reach this goal (Biello et al., 2021). It was an important advancement for the assignment of individuals of unknown origin, based on a large reference dataset of 461 wild individuals collected across most of the distribution range of this species. Here, we showed a new development based on a preliminary ddRAD-seq genomics study that allowed the introduction of an informative and cost-effective panel of biallelic markers suitable for the T. hermanni protection. This study, describing in detail the necessary methodological steps of the process, represents also a useful example for the development of similar tools in other species with similar management needs.
The SNP panel introduced in this study allows the genotyping of a large number of samples at low cost. Compared to the development and the typing of a set of similarly informative STR markers, this approach is cheaper, faster; requires less handling; and provides immediate standardization across laboratories. At the moment, however, the reference database of wild populations for the SNPs panel is still smaller and geographically less representative of the distribution range of this species compared to the STRs database, and we therefore recommend, until more wild individuals will be typed, the use of the SNPs panel integrated for difficult and especially forensic assignments with the analysis of the STR markers.
Hundreds of tortoises are currently maintained in captivity in breeding and rescue centers, providing a highly valuable source of individuals for reintroduction and reallocation projects. A fraction of these individuals, selected considering their health conditions, sex, age, and genetic compositions, could represent not only a demographic and genetic supplementation of small natural populations, but also the founders of new wild populations in ecologically suitable areas where this species was present in the past. These interventions would favor the well-being of the captive tortoises but also the human well-being by creating more natural and biodiverse environments (Fuller et al., 2007;Kuo, 2015).

CO N FLI C T O F I NTE R E S T
None declared.

DATA AVA I L A B I L I T Y S TAT E M E N T
Raw demultiplexed sequences are available on the Sequence Read Archive (SRA) on the study accession number: PRJNA777783.