Development of a single nucleotide polymorphism array for population genomic studies in four European pine species

Pines are some of the most ecologically and economically important tree species in the world, and many have enormous natural distributions or have been extensively planted. However, a lack of rapid genotyping capability is hampering progress in understanding the molecular basis of genetic variation in these species. Here, we deliver an efficient tool for genotyping thousands of single nucleotide polymorphism (SNP) markers across the genome that can be applied to genetic studies in pines. Polymorphisms from resequenced candidate genes and transcriptome sequences of P. sylvestris, P. mugo, P. uncinata, P. uliginosa and P. radiata were used to design a 49,829 SNP array (Axiom_PineGAP, Thermo Fisher). Over a third (34.68%) of the unigenes identified from the P. sylvestris transcriptome were represented on the array, which was used to screen samples of four pine species. The conversion rate for the array on all samples was 42% (N = 20,795 SNPs) and was similar for SNPs sourced from resequenced candidate gene and transcriptome sequences. The broad representation of gene ontology terms by unigenes containing converted SNPs reflected their coverage across the full transcriptome. Over a quarter of successfully converted SNPs were polymorphic among all species, and the data were successful in discriminating among the species and some individual populations. The SNP array provides a valuable new tool to advance genetic studies in these species and demonstrates the effectiveness of the technology for rapid genotyping in species with large and complex genomes.

resources that could be used for high-resolution genotyping for population genetic and genomic studies. This is particularly true for species with large and complex genomes, for which genome assembly is particularly challenging (Prunier, Verta, & MacKay, 2016;Zimin et al., 2017).
The draft conifer genomes that have been published so far (e.g. Picea abies: Nystedt et al., 2013;Pinus taeda: Zimin et al., 2014 Pinus lambertiana: Stevens et al., 2016;Pseudotsuga menziesii: Neale et al., 2017; Abies alba: Mosca et al., 2019), highlight the complexities involved in working with species within the order: their genomes exceed 20 gigabase pairs and contain transposons and other repetitive content including gene families and pseudogenes. However, developments in high-throughput sequencing methods have allowed insights into genomes of virtually any organism. Whole transcriptome sequencing is a relatively straightforward and informative alternative method (as compared to whole genome sequencing) as it effectively subsamples the genome by sequencing the coding regions alone: reducing both the vast size and complexity of the task.
Previous research, involving transcriptome sequencing of Pinus sylvestris and several closely related taxa from the Pinus mugo complex, has led to the discovery of thousands of polymorphic regions for pines (Wachowiak, Trivedi, Perry, & Cavers, 2015).
Increasing numbers of examples of SNP-based genotyping methods for humans, animals, crops, model plant species and some forest trees (Kathiresan et al., 2009;Kranis et al., 2013;Lepoittevin et al., 2015;Singh et al., 2015) demonstrate their potential to significantly advance genomic studies. Arrays containing thousands of loci provide much higher coverage and resolution of the genome compared with traditional sequencing and PCR-based genotyping methods.
However, their usefulness is dependent on the inclusion of markers, which have been carefully chosen to avoid ascertainment bias, either via their distribution across the genome or their frequency within and among populations. A further consideration is the fact that arrays are necessarily confined to genotyping at a preselected subset of all available loci and will only genotype at this selected subset, as well as fail to uncover additional variation-the SNP selection process should therefore be carefully undertaken to optimize the SNP set for the question being asked. However, such tools can provide powerful and cost-efficient alternatives to other genotyping methods, such as genotyping by sequencing and exome capture, which are also computationally expensive and more demanding of bioinformatic expertise by comparison. Furthermore, genotyping arrays can be applied for repeated screening of a common set of SNPs across different experiments and the genomic coverage they offer appears to be ample to allow trait prediction methods such as those employed in 'breeding without breeding' and genomic selection approaches (El-Kassaby & Lstibůrek, 2009;Grattapaglia & Resende, 2011) and may be useful in evolutionary studies.
Here, we describe the design and testing of a 49,829 SNP chip (Axiom_PineGAP array) for population genetic and molecular breeding studies in pines, with a focus on Scots pine (Pinus sylvestris L.), dwarf mountain pine (P. mugo T.), mountain pine (P. uncinata R.) and peat-bog pine (P. uliginosa N.). These taxa form a monophyletic group within Pinaceae, but differ strongly in phenotype, geographical distribution and ecology (Wachowiak, Perry, Donnelly, & Cavers, 2018).
This, together with their recent speciation history and high phenotypic divergence, mean they are especially suitable for comparative analyses of patterns of polymorphism and divergence and form a useful experimental system for studies of phenotypic trait variation at different ecological and evolutionary timescales (Wachowiak et al., 2015;Wachowiak, Zaborowska, et al., 2018). For the array design, we considered all available genomic resources for these species, including SNPs derived from transcriptomes and resequencing of candidate genes from previous population genetic studies. The aim was to develop a new large scale molecular tool for population genomic analysis of pines and to test its ability to differentiate species by genotyping a sample collection from each of the focal species.

| Design of the array
An initial set of 260,262 SNPs were provided to Affymetrix, derived from transcriptome and candidate gene sequences and markers from previous population genetic studies. Each probe consisted of the target SNP plus two (forward and reverse) 35-nt flanking sequences.
Filtering to select the final set included blast results to reference genome, Affymetrix recommendations, SNP frequency in a discovery panel and representation of transcripts. Other sequences in the genome similar to the probe, indels and the repetitive nature of large pine genomes may lead to spurious probe or primer binding and consequently reduce the number of converted SNPs. Therefore, for each of the two flanking sequences per SNP, Affymetrix calculated a repetitive variable and a p-convert score. SNPs with more than 300 hits of the flanking sixteen nucleotides between the SNP sequence and ref-  (Table S1). The vast majority (N = 49,052) were obtained from transcriptome sequencing of four pine species: Pinus sylvestris, P. mugo, P. uncinata and P. uliginosa (Wachowiak et al., 2015). These included SNPs, which were common to all species and also SNPs fixed in one species and polymorphic within and among others. A further set of SNPs (N = 578) were included from candidate genes (N = 279), which had been resequenced in previous population genetic studies of the pine species (Kujala & Savolainen, 2012;Mosca, Eckert, Di Pierro, et al., 2012;Palmé, Wright, & Savolainen, 2008;Wachowiak, Balk, & Savolainen, 2009;Wachowiak, Zaborowska, et al., 2018).
Variation in mitochondrial DNA (mtDNA) was targeted using a set of SNPs (N = 14), which had been discovered by Donnelly et al. (2017). A set of SNPs putatively associated with susceptibility to Dothistroma needle blight (discovered in Pinus radiata, ENA accession numbers ERS1034542-53) were also included (N = 185). SNPs derived from transcriptome sequences (Wachowiak et al., 2015) were assessed to determine how well they represented variation across Pinus unigenes.
During transcriptome assembly, contigs were aligned into 40,968 unigenes, of which 40,798 were considered to be high quality (not containing putative retrotransposon sequences). The proportion of unigenes included on the array and within the successfully converted SNP set were determined. The representation of different gene ontology (GO) classifications on the array and within the converted SNP set was also assessed in comparison with the transcriptome. After a new assembly of the P. taeda reference genome became available (GCA_000404065.3_Ptaeda2.0, Zimin et al., 2017), we used the SNP sequences (71-mers) included on the array as blast queries. A blast hit was counted where there was a minimum 95% match between the 71-nt SNP probe and the reference genome. A high proportion of hits to the P. taeda genome in SNPs which were not successfully converted may indicate that these failed due to, for example: nontarget hybridization; undetected SNPs in the flanking region; and the presence of a third allele. In contrast, a low proportion of hits to the P. taeda genome in nonconverted SNPs may indicate that the target sequence spanned an intron, for example, or that they were simply missing from the genome assembly.

| Source of samples and DNA extraction
The technology demanded an initial commitment of 5 x 384 samples for development of the array. This full sample set of 1920 was genotyped, and all successful genotypes were used in optimization of SNP calling filters (see next paragraph). Here, we report full details of array testing using a subset of 87 samples, which included four species of pine (Pinus sylvestris: SY; P. mugo: MU; P. uncinata: UN; P. uliginosa: UL; Tables S2-S4). DNA was extracted from needles using a Qiagen DNeasy 96 kit following the manufacturer's instructions.
Needles were dried on silica gel prior to extraction and DNA was quantified using a Qubit spectrophotometer to ensure a minimum standardized concentration of 35 ng/µl. The quality of genomic DNA was also checked visually for fragmentation on 1% agarose gel.

| Genotyping and SNP calling
Genotyping was done at Edinburgh Genomics following DNA amplification, fragmentation, chip hybridization, single-base extension through DNA ligation and signal amplification performed according to the Affymetrix Axiom ® Assay protocol. Genotyping was performed in 384-well format on a GeneTitan according to the manufacturer's procedure. Genotype calls were performed using Axiom Analysis Suite software as recommended by the manufacturer (Thermo Fisher). In order to both maximize the number of samples included in subsequent analyses and minimize the distortive effect of poor quality samples on genotype calls, three separate analyses were performed (Table S5).
Samples were assigned to an analysis group based on their call rate (CR) and dish QC (DQC: a metric provided by Thermo Fisher which is generated by measuring signals at multiple sites in the genome known not to vary among individuals), using the following thresholds: DQC 'high' ≥0.82; DQC 'low' <0.82; CR 'high' ≥96; CR 'low' <96. Analyses: High-quality samples (N = 529), with high CR and DQC, were used to set thresholds for allele calls. Posteriors for allele calls were subsequently used as priors for analyses 2 (N = 753) and 3 (N = 251).

| Testing of the array and statistical analyses
To test the array, a subset of genotyped samples of each species (N = 87) were analysed, including five populations per species (three populations for UL). Each population comprised 3-5 individuals,  (2003) TA B L E 1 Conversion rates for single nucleotide polymorphism (SNP) array using thresholds set by high-quality samples (N = 529) each from different mother trees (Tables S2 and S4). SNP type (polymorphic, monomorphic, CR < 80%) was compared among species to identify the proportion of SNPs, which could be analysed further.

Principal component analysis for all samples across all species was
performed in TASSEL (Bradbury et al., 2007). Separate analyses were performed using all available SNPs (N = 20,795) and only common (mean allelic frequency, MAF > 0.1) SNPs, which were polymorphic among all species (N = 2,358). The first two principal components from each analysis were plotted to qualitatively assess the extent of variation within and among species.

| Conversion rates for SNPs on the array
In total, 49,237 (over 98%) SNPs on the array represented genomic regions from pine transcriptome sequencing, and the remaining 1.19% included SNPs from candidate genes and mitochondrial regions ( The representation of GO terms in unigenes containing converted SNPs was more comparable to that of their coverage across the transcriptome than on the full array (Figure 1), for example: metabolic process, regulation of biological process and response to stimulus were overrepresented in the full array while cell death, cell communication and membrane were underrepresented. The mean percentage difference in GO term representation among the transcriptome and full array was 1.27% (range: 0.03%-6.84%) whereas the mean percentage difference in representation of GO terms among the transcriptome and converted SNPs was 0.11% (range 0.01%-0.33%).

| Testing of the array
Over a quarter of successfully converted SNPs were polymorphic among all species (N = 5,665,  Nearly a further quarter of all converted SNPs were monomorphic among all species (N = 4,950) or had a low call rate in at least one species (N = 4,921). About two thirds of successfully converted SNPs had call rates above 90% (Table S6) Figure S1).

| D ISCUSS I ON
The design of this genotyping array and its successful testing in a small number of populations across four related pine species demonstrates that the method can be effective in organisms with large and complex genomes. The design and testing of the array were necessarily constrained by the availability of material-both plant tissue  The moderate conversion rate for the pine array was probably due to a combination of factors. The use of transcriptome sequences for SNP discovery, which aimed to identify markers in coding regions, may have resulted in the inclusion of exon-intron boundaries or unknown polymorphisms within oligonucleotide binding regions.
Duplication and homology within flanking regions are also likely for a proportion of the unconverted SNPs, despite measures to control for this during the array design by using the first assembly of Pinus taeda genome (Zimin et al., 2014) to identify loci within duplicated or paralogous regions. As about 80% of successfully converted SNPs had unique hits to the novel assembly of the P. taeda genome (Zimin et al., 2017), we assume that the conversion rate may be partially affected by nontarget hybridization. In addition, the predominance of Scots pine transcriptome in initial SNP selection and the subsequent sample set used to set thresholds for allele calls may have compromised the conversion rate to some extent, viewed across all species.
However, the use of multiple species in designing and testing the array has the major advantage of increasing the potential applications of the array (for example, the same array may be used for both multiple and single species assays) and its subsequent relevance to a broader range of the scientific community. Although the conversion rate of SNPs derived from P. radiata (putative Dothistromaassociated SNPs) was limited and included no polymorphic SNPs, this species is not phylogenetically close to P. sylvestris (Gernandt, Lopez, Garcia, & Liston, 2005). Therefore, in future iterations of the array, other pines from the same section which are frequently studied due to their ecological and/or economic importance, such as P.
nigra, P. halepensis, P. pinaster or P. pinea, may be better candidates if strong trait-linked SNPs in these species have been identified.
Disadvantages of using multiple species include the effects of ascertainment bias and genomic divergence: for example, the sample set used to set thresholds for allele calls included only 0.57% P. mugo samples, which subsequently had extremely high levels of SNPs with low call rates compared with the other species. It is unknown whether this was a cause or effect: there were similarly low levels of P. uliginosa in the high-quality sample set but the level of SNPs with low call rates was not equivalently high, although this could be due to the genetic relatedness of P. uliginosa and P. sylvestris (Wachowiak et al., 2011).
Overall, the SNP conversion rate was similar for SNPs derived from different sources including transcriptomes, resequenced genes and mtDNA data variants. The array, and particularly the set of converted SNPs, represented the pine transcriptome (and therefore, putatively, the genome) well: unigenes identified in transcriptome sequencing were well represented, as were the range and relative proportion of associated gene functions. Although a published reference genome for the studied species is not yet available, it is known that there is rapid decay of linkage disequilibrium in conifers (Brown, Gill, Kuntz, Langley, & Neale, 2004;Wachowiak et al., 2009); therefore, the set of converted SNPs identified in this study are likely to represent unique mutations in individual genes. These findings support the use of SNP arrays in studies focussing on local adaptation and association genetics. Breeding studies are also increasingly looking to marker-assisted selection (Isik, 2014) to inform selection for trait improvement. These studies require well-phenotyped individuals and a large pool of SNPs from across the genome with which to identify putatively associated markers. A basic understanding of their putative function is also important to put results in context.
Pinus sylvestris, which is both economically and ecologically important, is a particularly suitable candidate for these studies: quantitative traits, such as growth, phenology and disease resistance, are commonly measured in progeny-provenance trials of this species and there would be considerable commercial value in the identification of markers linked to key traits.
The converted SNPs were found to reflect results from previous studies on inter-and intraspecific relationships in the focal species which, despite morphological, geographical and ecological differentiation, show high genetic similarity based on biometric, biochemical (monoterpenes, isozymes) and molecular markers (Boratyńska & Boratyński, 2007;Lewandowski et al., 2000;Wachowiak, Boratyńska, & Cavers, 2013). The presence of a high number of shared SNPs among the species, as found in comparative transcriptome sequencing (Wachowiak et al., 2015) and candidate gene studies (Wachowiak, Zaborowska, et al., 2018), confirms their close genetic relationships and common ancestry. Considering their re- Under multivariate analysis, three major clusters were identified (with similar resolution when including all available SNPs or only a subset of common polymorphic SNPs), which corresponded to P. sylvestris, P. uncinata and P. mugo/ P. uliginosa. The position of the P. uliginosa samples, between P. sylvestris and P. mugo but more proximate to the latter, has been observed in previous studies (Wachowiak, Żukowska, Wójkiewicz, Cavers, & Litkowiec, 2016).
Although the number of samples from individual populations was limited, the analysis showed the array's potential for discriminating populations at broad spatial scales. Within species, populations showed clear structure, while those from P. uliginosa appeared to be most clearly diverged, supporting results from a recent study using mtDNA markers (maternally inherited in pines; Łabiszak, Zaborowska, & Wachowiak, 2019). There was evidence for distinct population clusters in other species for highly diverged populations at the species' margins (including P. sylvestris in Scotland, P. mugo from the Abruzzi Mountains in Italy and P. uncinata from Castiello de Jaca in Spain).
The design and testing of the pine array demonstrate its potential application to a range of studies including, but not limited to population genetics, evolutionary history, conservation management, local adaptation, association genetics and tree breeding. The array is the highest resolution molecular tool developed to date for the focal pine species and has significant potential to further understanding of pine genetics and genomics.

ACK N OWLED G M ENTS
This work was financially supported primarily by two grants in the

AUTH O R CO NTR I B UTI O N S
The research was designed and planned by WW and SC; SNP selection for the array, data analysis and manuscript writing were performed by AP and WW; SNP genotyping was performed by AD and RT; Manuscript review and revision and final approval of manuscript were performed by all authors.

DATA AVA I L A B I L I T Y S TAT E M E N T
The data sets supporting the results of this article are freely available through the NERC's Environmental Information Data Centre, as follows: (a) Scots pine SNP for Axiom array ( Stephen Cavers https://orcid.org/0000-0003-2139-9236