An SSR‐based approach incorporating a novel algorithm for identification of rare maize genotypes facilitates criteria for landrace conservation in Mexico

Abstract As maize was domesticated in Mexico around 9,000 years ago, local farmers have selected and maintained seed stocks with particular traits and adapted to local conditions. In the present day, many of these landraces are still cultivated; however, increased urbanization and migration from rural areas implies a risk that this invaluable maize germplasm may be lost. In order to implement an efficient mechanism of conservation in situ, the diversity of these landrace populations must be estimated. Development of a method to select the minimum number of samples that would include the maximum number of alleles and identify germplasm harboring rare combinations of particular alleles will also safeguard the efficient ex‐situ conservation of this germplasm. To reach this goal, a strategy based on SSR analysis and a novel algorithm to define a minimum collection and rare genotypes using landrace populations from Puebla State, Mexico, was developed as a “proof of concept” for methodology that could be extended to all maize landrace populations in Mexico and eventually to other native crops. The SSR‐based strategy using bulked DNA samples allows rapid processing of large numbers of samples and can be set up in most laboratories equipped for basic molecular biology. Therefore, continuous monitoring of landrace populations locally could easily be carried out. This methodology can now be applied to support incentives for small farmers for the in situ conservation of these traditional cultivars.

| 1681 HAYANO-KANASHIRO et Al. traditional dishes. In Mexico, maize landraces are maintained (Badstue et al., 2007) by saving seed from one season to the next (Pressoir & Berthaud, 2004) and desirable genotypes are often exchanged between family members or through social alliances with both local and distant farmers or even acquired from commercial suppliers (Bellon & Berthaud, 2004;Louette, Charrier, & Berthaud,1997). When seed stocks are insufficient, farmers will commonly mix seed from several different sources (Bellon & Berthaud, 2004). The heterogeneous and dynamic nature of local landraces is advantageous when environmental conditions vary or infestation by pests or pathogens occurs.
Although in commercial terms many landraces are nonsuitable for grain production, these varieties provide a reservoir of genes that could be exploited to develop new materials with specific adaptations (Esquinas-Alcazar, 2005).
The introduction of commercial maize hybrids and the potential introduction of transgenic cultivars in the future have raised concerns with respect to genetic erosion of traditional landraces (Dyer, López-Feldman, Yúnez-Naude, & Taylor, 2014). Since the 1940s, to support the in situ conservation of maize germplasm by encouraging small farmers to maintain the cultivation of traditional landraces and considering incentives which would benefit the farmers who preserve the most diverse genotypes, even though these are often not commercially viable materials. The main challenges to implementing a strategy of incentives are to: (1) implement a relatively simple experimental strategy that can be easily replicated in low-tech laboratories, but allows reliable sampling and genotyping of a maximum number of individuals while maintaining overall costs at a minimum, (2) obtain a realistic image of the existing diversity in the main regions of the country where landraces are routinely grown, and (3) identify within these samples the most uncommon or "rare" genotype combinations.
Developing a strategy to meet these challenges with emphasis on supporting local farmers to maintain their traditional methods of cultivation and selection, while safeguarding the conservation of diversity within landrace populations, is the main objective of this report.
In order to meet these challenges, several genotyping methods were considered. For the proposed landrace diversity study, it was reasoned that to make the best use of resources, the priority should be the robust analysis of the greatest number of samples, in contrast to the accumulation of extensive genotype data on a few samples. Therefore, although genotyping-by-sequencing (GBS) methods (Elshire et al., 2011;Poland, Brown, Sorrells, & Jannink, 2012) are extremely powerful and economically relatively accessible, these methods can be time-consuming, their exploitation implies sophisticated infrastructure and depends on highly trained bioinformatics experts, and the level of complex data generated would be a drawback rather than an advantage for the efficient conclusion of proposed diversity study. From these observations, a microsatellite (SSR)-based strategy was developed and, by employing an information theory approach and previously obtained maize SSR data, a sampling protocol and minimum number of SSR markers were determined (Reyes-Valdés et al., 2013).
Although several methods have been reported to identify "rare" genotypes or the smallest subset of most diverse genotypes (Gouesnard et al., 2001;Kim et al., 2007;Thachuk et al., 2009), these have been targeted at ex-situ germplasm collections or collections assembled for breeding purposes. The range and scope of this longterm project called for the development of a rapid and robust method of analysis, to quickly identify germplasm comprised of uncommon alleles or allele combinations and facilitate the implementation of efficient conservation strategies. Therefore, a novel algorithm was developed and tested with this aim. While developing the algorithm, it became clear that it could also be exploited to identify the minimum number of accessions needed to cover all the diversity identified in a particular sample. These materials could then be maintained with reduced storage and maintenance costs as a safeguard ex-situ collection in the event that the in situ germplasm is lost.
The ultimate goal of the initial phase of the landrace diversity project is to analyze around 1,000 maize landrace accessions collected within the last 5-10 years from the Mexican states of Puebla, Tlaxcala, Michoacán, Oaxaca, Guerrero, and Tabasco with the aim of identifying rare genotypes and supporting decisions on the provision of incentives to small farmers and encourage the in situ conservation of maize germplasm. In order to optimize available resources, the strategy was were analyzed in this study as a "proof of concept" that the experimental strategy and data analysis methods developed were feasible and effective. These samples were chosen because they had been recently collected (2011) and geographical and morphological data were well documented. Descriptions of the accessions and other pertinent data are presented in Table S1. Type 1 refers to the primary race classification of each accession and Type 2 a secondary, additional classification for accessions where more than 1 Race could be identified.

| Selection of microsatellite markers
Fourteen microsatellite markers distributed across the 10 maize chromosomes were chosen based on data from a prior simulation analysis

| PCR amplification conditions
One primer of each pair was 5′ fluorescently labeled with one of the ABI Prism dyes: 6-FAM, PET, NED, and VIC (see Table 1). PCR amplification was carried out in a 30μl volume using AmpliTaq Gold ® PCR Master Mix (Applied Biosystems). One hundred nanograms of template genomic DNA from each bulk was used for the PCR amplification using a GeneAmp 2600 or Veriti thermal cycler (Applied Biosystems). the full-scale analysis. All primers combinations produced PCR products within the expected size range. Before sending the products of the PCR reactions for separation on an Applied Biosystems ABI 3730XL sequencer (carried out at the Genomic sequencing facility at LANGEBIO, CINVESTAV-Irapuato), positive controls and a selection of samples were visualized on 2% agarose gels.

| SSR genotyping
PCR reactions for each primer pair were carried out separately and then combined to produce samples containing the four different fluorescent dyes before separation of the amplified fragments on the ABI 3730XL, using GeneScan 500LIZ as size standard (Applied Biosystems). Samples were genotyped using GENEMAPPER V. 4.0 and Peak scanner V. 1.0 software programs (Applied Biosystems).

| Data analysis
The marker selection and bulk sampling scheme was developed and A coefficient of rareness, R i , was estimated for each of the 185 accessions as follows: For a given accession i, this measure is calculated as the average of the square differences between the score of each marker/allele combination with regard to the mean score in the whole collection. Therefore, accessions with a higher average of uncommon marker/allele combinations have a higher Ri value than those that have more common alleles.
To estimate a set of accessions that include all marker/allele combinations, a looping algorithm (AMA) was developed by selecting the accession with highest R i and including it in the selected set. Then, for each accession not in the selected set, the gain, in number of marker/ allele combinations, given by each accession is measured. In the case of a tie, the accession with higher R i value is selected. The process is repeated until all marker/allele combinations are included in the selected set. Although this procedure does not guarantee the identification of the smallest or "optimum" set, it produces results close to it. The methods used to develop Ri and AMA are described in detail in Data S1.
The relation between race and marker/allele combinations was determined by contingency analyses using the likelihood ratio test or G-statistic. Linear regression models using various selection methods were employed to estimate the putative relationship between marker/ allele combinations and meters above sea level (MASL). Details and discussion of the statistical data analysis are presented in Data S1. All data can be accessed at http://computational.biology.langebio.cinvestav.mx/GenoMaiz/index.html

| RESULTS
The geographical locations of the collection sites for the 185 maize landrace accessions analyzed in the present study are shown in Fig. S1. As can be observed, the samples were obtained throughout Puebla State and cover locations at different altitudes and with different soil types.
In order to gauge the efficiency of the experimental strategy in terms of allele detection, the total number of alleles and the range of sizes of SSR alleles identified in the accessions from Puebla were compared with previous studies using the same SSRs to determine diversity in maize inbred or landrace materials as shown in Table 2. SSR marker PHI031 was the only marker used in the current study for which no previous reports were available for Mexican maize landraces. For the remaining 13 SSRs, seven presented more alleles, five presented fewer, and one presented the same number of alleles in total than had been described previously for maize landraces (Table 2). In addition, in all cases a wider range of allele sizes is reported in the current study in comparison with previous reports.
Taken together, these data indicate that the experimental strategy and the SSR markers selected are informative for the material under study and have the ability to uncover new, unidentified alleles for each marker.

| Comparison of genetic distances within and between accessions
Genetic distances between accessions from Puebla State (PL) were Landrace populations are very variable, and we would expect some level of diversity within each accession/bulk. This is also illustrated by the data in Table S1 where around one-third of the accessions showed characteristics of two different races (Type 1 and Type 2). Therefore, as an additional measure to demonstrate the consistency of the data presented in Figure 1, the genetic distances between pairs of bulks from the same accession and between bulks from different accessions were carried out. The results of this analysis are presented in section S1-2.2 of Data S1 and show that the distances between pair of bulks range from 7.94 to 22.45 with a mean of 15.82 and a median of 15.75 (see Table S1-6 and Figs. S1-3 in Data S1

| Relationships observed between dendrogram topology geographical location and race classification
The dendrogram in Figure 1 presents clearly defined groups in relation to widely separated genotypic groups (PL landraces, PA, and TE).
In order to determine whether geographical location or morphological traits were also correlated with the groupings observed, associations between maize type (race), kernel color, geographical location, and meters above sea level of geographical location were determined and the results of these analyses were superimposed on the original dendrogram.
Sixteen distinct maize races were identified in the PL samples (Table S1), and the mean genetic distance between races was found to be slightly higher (15.33) than within races (14.42). However, for the comparison between race and genotype, only data from races composed of at least 10 accessions were analyzed (8 including PA).
The dendrogram in Figure 2A, where accessions are colored depending on their race classification, shows some association between race and genotype as indicated by * for specific clades. Additional analyses were carried out in order to determine whether specific marker/ allele combinations were associated with different races; Table 3 shows that 88 marker/allele combinations were significant for the t test and 64 for the CT analysis and 47 were identified by both analyses. Although no single marker/allele combination was found to be significant for all accessions tested, at least one allele of marker PHI96100 was significant for each accession (Data S1). However, this marker alone was not sufficient to distinguish the different races when used individually to produce a dendrogram. The most significant marker/allele combinations for each accession are shown in Table 4.
When the association between kernel color and genotype was investigated, no clear association could be observed ( Figure 2B) and only 11 significant maker/allele combinations were identified for this trait (Data S1), implying that particular kernel colors are not strongly indicative of a specific race, but have probably been incorporated into different races based on cultural preferences.
F I G U R E 2 Distribution of race, meters above sea level, and kernel color in relation to genotype and relation to geographical location. (A) Relation between race and genotype; the race determined for each accession is represented by different colors overlaid on the dendrogram presented in Figure 1. The key indicates the color assigned to each race. Accessions classified as containing two different races (Type 1 and Type 2 in Table S1) Figure 2D. The teosinte samples also support the relationship between genotype and MASL because the Parviglumis samples were all collected at lower altitudes, whereas the Mexicana samples were collected at medium to high altitudes. The Palomero landrace has also been reported to grow at higher altitudes, and this is reflected in the Palomero clade in the dendrogram. Local farmers tend to grow landraces which have been selected locally; however, exchange and transport of seed are common, and it is likely that some genotypes related to high altitudes will have been grown at lower altitudes and vice versa and this may explain the mixture of MASL for closely related genotypes. More detailed statistical analysis (Data S1) confirmed a strong correlation between marker and allele combinations where 39% of all marker/allele combinations were significantly associated with MASL, and it was determined that around 73% of the variance related to MASL was determined by the genotype. All 14 SSR markers were shown to be associated with MASL, but a model was developed in order to determine which marker/allele combinations were most relevant (Data S1), and these results are summarized in Table 5.

| Identification of rare maize genotypes based on a novel algorithm
Rare or unusual genotypes could be produced by the presence of very rare alleles, by novel combinations of alleles, or by both of these factors together. As the primary aim of the present work was to identify rare maize germplasm and provide a basis for criteria to determine priorities for conservation in situ of maize landraces, a "coefficient of rareness" (RA) and new algorithm (AMA) were developed in order to select a small set of in situ accessions that will include all marker/allele combinations present in the complete collection. A secondary function of the algorithm is to prioritize rare combinations over the most common combinations (the algorithm is described in detail in Data S1). Results are presented for t test and contingency tables (CT) analyses. Column "Both" shows the number of marker/allele combinations significant in both tests (t and CT).

Race # Accessions
T A B L E 3 Number of accessions and number of significant (FDR ≤ 0.1%) marker/ allele combinations for each one of the eight races represented by at least 10 accessions Average values of z are presented for the race (column "In race") and for all other accessions in the set PL∩PA (column "In others").

| DISCUSSION
One of the challenges related to the genotyping of maize landraces in Mexico is how to balance the experimental costs with the ability to analyze the maximum number of accessions and/or individual plants.
The most effective strategy to meet this challenge is to analyze bulked samples. Similar studies employing bulks are usually based on DNA prepared from pooled leaf samples (Deputy et al., 2002;Wang, Li, & Li, 2011); however, individual extraction although more time-consuming and expensive was shown to produce consistent results in terms of allele detection when individual and bulked samples were compared (Reyes-Valdés et al., 2013) and was therefore the method of choice for this study. The bulking scheme has the advantage of allowing the sampling of a larger number of individuals at lower cost than individual scoring, but implies that we cannot obtain a direct estimate of the frequency of each marker/allele combination in the population sampled.
The detection of a marker/allele in a bulk of ten plants implies only that at least one of the 20 haplotypes presented that combination.
Although SSR analysis may be almost completely automated, allele designation should be reviewed manually. In particular, null alleles are problematic to detect and designate because the technical failure of PCR reactions or independent mutations that alter the primer site could both lead to the lack of marker/alleles (Matsuoka et al., 2002).
In this case, putatively failed PCR reactions were repeated and alleles were designated as null if the PCR reaction was repeated at least twice and consistently gave a negative result. Null alleles were identified in a proportion of around 0.49% (154 cases of 31,329 reads), and assuming that a small proportion of these nulls may be false negatives, they should not have a significant impact on the overall results and conclusions drawn from the data.
All accessions could be discriminated based on the allele data obtained, and in general, the groups in the dendrogram in Figure 1 correspond to overall differences in genotype as TE and PT form separate classes in comparison with the PL samples and race-specific clades were formed which corresponded to the different TE races. Samples TE04 and TE23, classified as Race Balsas, are outliers within the TE group, and this may be due to the effects of maize-teosinte hybridization as has been described previously (Ellstrand, Garner, Hedge, Guadagnuolo, & Blancas, 2007;Fukunaga et al., 2005;Wilkes, 1967Wilkes, , 1977 (Doebley, Gaut, & Smith, 2006;Hanson et al., 1996).
In contrast, adaptation to high altitudes is more strongly associated with specific landraces and genotypes. Interestingly, the TE accessions from subspecies Mexicana also show the correlation with high altitudes, and this agrees well with the theory that although maize was originally domesticated in a single event from TE Parviglumis it is completely deterministic gives exactly the same results every time that it is run on a given dataset. In contrast, Core Hunter II has a stochastic component, and thus, it could give different results each time that it is run on the same dataset. Also, because AMA does not need extra parameters to be run, and because it takes explicitly the rareness coefficient as objective function, it generally gives an output set with higher rareness than Core Hunter II. Also AMA is at least two orders of magnitude faster than Core Hunter II, a fact that is important for large germplasm collections. Details of the comparison are presented in section S1-2.5.2 of Data S1.
In general, the devised strategy proved to be efficient and highly satisfactory for the low-cost, simple, high-volume analysis of Mexican landrace genotypes and is currently being employed to complete the