Linkage disequilibrium and population structure in wild and domesticated populations of Phaseolus vulgaris L.

Together with the knowledge of the population structure, a critical aspect for the planning of association and/or population genomics studies is the level of linkage disequilibrium (LD) that characterizes the species and the population used for such an analysis. We have analyzed the population structure and LD in wild and domesticated populations of Phaseolus vulgaris L. using amplified fragment length polymorphism markers, most of which were genetically mapped in two recombinant inbred populations. Our results reflect the previous knowledge of the occurrence of two major wild gene pools of P. vulgaris, from which two independent domestication events originated, one in the Andes and one in Mesoamerica. The high level of LD in the whole sample was mostly due to the gene pool structure, with a much higher LD in domesticated compared to wild populations. In relation to association studies, our results also suggest that whole-genome-scan approaches are feasible in the common bean. Interestingly, an excess of inter-chromosomal LD was found in the domesticated populations, which suggests an important role for epistatic selection during domestication. Moreover, our results indicate the occurrence of a strong bottleneck in the Andean wild population before domestication, suggesting a Mesoamerican origin of P. vulgaris. Finally, our data support the occurrence of a single domestication event in Mesoamerica, and the same scenario in the Andes.


Introduction
Linkage disequilibrium (LD) is the nonrandom association of alleles at two (or more) loci. The extent of LD in natural and domesticated populations is mainly related to effective recombination rate, mating system and population size. LD is promoted by drift, hitchhiking and epistatic selection, and it has become the subject of intensive studies in recent years, with the objective of mapping and identifying genes of interest in human and animals (Lai et al. 1994;Slatkin 1999). More recently, it has also been studied in plant species, such as Zea mays (Remington et al. 2001; Thornsberry et al. 2001;Vigouroux et al. 2002Vigouroux et al. , 2005, Glycine max (Zhu et al. 2003), Hordeum vulgare (Lin et al. 2002;Kraakman et al. 2004;Morrell et al. 2005) and the model plant species Arabidopsis thaliana (Nordborg et al. 2002;Tian et al. 2002;Caicedo et al. 2004).
Two main approaches have been proposed to exploit LD patterns for the identification of genes and genomic regions with adaptive roles. The most common approach involves association studies, where the aim is to associate a given phenotype to a molecular variant (Lynch and Walsh 1997;Cardon and Bell 2001;Thornsberry et al. 2001;Mazzucato et al. 2008). This is basically an extension of quantitative trait loci (QTL) mapping, where the focus turns from families to populations. An alternative approach is exploitation of the signature of selection on the structure of the molecular diversity (Kohn et al. 2000;Vigouroux et al. 2002). This approach is useful for validation of previously identified genes with putative adaptive roles, or for identification of genes or genomic regions involved in genetic control of important adaptive pathways, even without any prior information (Kohn et al. 2000;Akey et al. 2002;Fullerton et al. 2002;Vigouroux et al. 2002;Wright et al. 2005;Papa et al. 2007;Bitocchi et al. 2009).
All of these methods (QTLs, association genetics, natural selection mapping) depend on the LD between phenotypic causative and linked molecular variants. While traditional mapping procedures are based on the observable differential decay of LD between loci in experimental families over one or a few generations (e.g. F2, RIL), both association genetics and population genomics rely on historical differential decay of LD between pairs of loci in natural and domesticated populations. The larger the level of LD, the lower the resolution for detecting the genome location of the causative molecular variant.
For this reason, a critical aspect for the planning of association and/or population genomics studies is the level of LD that characterizes the species and the population used for such an analysis. When the LD is low, a candidate gene approach is usually preferred, because in this case, too many markers will be needed to perform a whole genome scan to cover the variation in the entire genomes. On the other hand, when LD is moderate/high, a whole genome scan can be more appropriate. An ideal situation would be to use different populations distinguished by variable LD levels.
Linkage disequilibrium studies show great promise for the identification of the genetic basis of important traits if the effects of population structure are effectively controlled (Remington et al. 2001;Zhao et al. 2007). To determine how the markers should be distributed through the genome to find causal variants, it is very important to know the extent and structure of genetic diversity and the level of LD in different populations. Without this knowledge, the LD mapping approach will not be successful, because a strong LD detected between a marker and a phenotypic trait might be due to the recent occurrence of disequilibrium (e.g. admixture, population structure), rather than to the close physical location of the two loci.
Phaseolus vulgaris is a diploid (2n = 2x = 22), annual species, and is predominantly self-pollinating. It is the most important grain legume for direct human consumption. In P. vulgaris, many aspects of its molecular and phenotypic diversity, migration dynamics and population structure are well known (Gepts et al. 1986;Gepts and Bliss 1988;Gepts 1996;Alvarez et al. 1998;Papa and Gepts 2003;Chacó n et al. 2005;Papa et al. 2006;Angioi et al. 2009;). To date, in contrast, little information is available on the extent of LD ). Thus, little is known about the suitability of the LD mapping approach in this context. However, the first indirect analysis (Papa et al. 2005(Papa et al. , 2007 indicated that the extent of LD in P. vulgaris is large and extends over a few centimorgans (cM), as is also the case for other autogamous species, such as soybean (Hyten et al. 2007), A. thaliana (Nordborg et al. 2002), barley (Kraakman et al. 2004;Morrell et al. 2005;Caldwell et al. 2006) and rice (Garris et al. 2003;Mather et al. 2007;Zhu et al. 2007). This observation suggests that a genome-scan approach would be promising in P. vulgaris. This species is characterized by two geographic gene pools distributed in Mesoamerica and in the southern Andes, with parallel geographic patterns in wild and domesticated beans, indicating the occurrence of independent domestication events in these two regions. Sequence analyses of the gene coding for phaseolin, the main seed storage protein, have suggested that the ancestral populations of P. vulgaris were likely to have originated from Ecuador and northern Peru (Kami et al. 1995;Gepts et al. 1999). In this study, we have used a large set of amplified fragment length polymorphism (AFLP) markers, and we have analyzed the extent and structure of the genetic diversity of different populations of P. vulgaris from both gene pools. Moreover, using these mapped AFLPs, we have explored the levels of LD in both gene pools, for wild and domesticated populations that are likely to show different levels of LD, as also found in other species (e.g. barley, soybean). These analyses are relevant to the domestication history of the common bean and also raise important questions related to the origins of the wild populations of P. vulgaris.

Plant materials and DNA extraction
In this study, 183 accessions of common bean were used, each represented by an individual genotype. Ninety-three accessions represented the geographical distribution of wild P. vulgaris from northern Mexico to northwestern Argentina, including seven wild accessions from northern Peru and Ecuador that are characterized by the ancestral phaseolin type I (Debouck et al. 1993;Kami et al. 1995). Ninety accessions, mostly landraces, represented the Andean and Mesoamerican gene pools and their major morphological races cultivated in central and southern America. Based on both passport data, which include race type, phaseolin type, seed colour and pattern, and seed weight, and considering previous studies on the common bean (Duarte et al. 1999;Beebe et al. 2000Beebe et al. , 2001Singh et al. 2001;McClean et al. 2004;Rosales-Serna et al. 2005;Blair et al. 2006a,b;Díaz and Blair 2006;Pedrosa et al. 2006;McClean and Lee 2007), we classified 42 domesticated accessions as Andean and 48 as Mesoamerican. A complete list of the accessions used in this study is available in Table S1. The seeds were provided by the USDA Western Regional Plant Introduction Station (Pullman, WA, USA) and the International Center of Tropical Agriculture (CIAT) in Colombia. For each accession, genomic DNA was extracted from young leaves of a single, greenhouse-grown plant using the miniprep extraction method (Doyle and Doyle 1987).

AFLP analysis
We used a total of 19 AFLP primer combinations ( Table S2) that were chosen based on a prior screening with 141 AFLP primer combinations (Papa et al. 2007). Of these 19 combinations, 14 were used for the population structure and LD analyses, whereas the remaining five were added only for construction of an AFLP-based consensus map. The AFLP analyses were carried out according to the technique developed by Vos et al. (1995), with minor modifications. Briefly, total genomic DNA (300 ng) from each accession was digested with the EcoRI (5¢-G^AATTC-3¢) and MseI (5¢-T^TAA-3¢) restriction enzymes and ligated with the respective adaptors by incubation at 37°C for 4 h, followed by 20 min at 65°C, and a final temperature of 10°C. The pre-amplification reactions were performed using DNA from the restriction-ligation, with an eight-fold dilution prior to amplification with the primers EcoRI (5¢-GACTGCGTACCAATTC-3¢) and MseI (5¢-GATGAGTCCTGAGTAA-3¢) with a single selective nucleotide (i.e. EcoRI and MseI +N). The preselective amplification products were diluted 11-fold, and used as templates for the second amplification. Cy5-labelled EcoRI primers were used for the final selective polymerase chain reaction, and the AFLP fragments were separated on 6% polyacrylamide gels by electrophoresis for 2 h at 50 W constant power, using a Genomix system (Beckman, Fullerton, CA). The AFLP fingerprints were manually scored and recorded for the absence or presence of fragments; bands of equal fragment size were assumed to be homologous. To minimize the effects of size homoplasy, only fragments of the medium/large size classes were scored (Vekemans et al. 2002). Only clearly amplified bands were scored, with bands presenting variable intensities across individuals not considered as informative and not retained for data analysis. Several precautions were taken to ensure reliability of the scoring. Thus, the genotypes were randomly distributed across different gels (four gels were necessary for each primer combination), and replicated samples were used to investigate polymorphism within gels. About 10% of the samples were replicated twice, to test polymorphism between gels. Control genotypes were also used to align the different gel runs for each primer combination, as well as monomorphic bands across the samples. The gels were scored twice, and all of the validated polymorphic fragments were checked by running a 'summary gel' that corresponded to a subset of genotypes (about 30%) that were representative of the polymorphism observed.

AFLP mapping (AFLP consensus map)
To map the AFLP markers identified, two recombinant inbred populations were used: the BAT93 · Jalo EEP558 population (BJ; 75 lines; Freyre et al. 1998) and the Midas · G12873 population (MG; 58 lines; Koinange et al. 1996). The AFLP markers were placed using previously established linkage maps that are available for these two populations: the BJ linkage map (Freyre et al. 1998;Kwak et al. 2008), and the MG linkage map (Koinange et al. 1996), both of which are mostly constituted of restriction fragment length polymorphism (RFLP) markers. The markers were placed according to standard mapping procedures (Freyre et al. 1998). The mapmaker/exp computer software, version 3.0b (Lander et al. 1987), was used to determine the linkage relationships and the linear order of segregating markers. Linkage map distances are in Kosambi units (Kosambi 1944) and LOD scores >3.0 (Freyre et al. 1998) were chosen to assign the markers to linkage groups and to find an internal order. Markers already mapped on the two RIL populations (Koinange et al. 1996;Freyre et al. 1998;Kwak et al. 2008) were used as a framework to locate the AFLP markers on their respective linkage maps. We used 56 framework markers for the BJ population, and 67 for the MG population, distributed among 11 linkage groups. The most likely order of markers in each group was determined using the 'order', 'compare', 'build' and 'ripple' commands. A chisquared test was performed to determine the deviation from the 1:1 ratio expected under the null-hypothesis of no segregation distortion; markers identified as 'distorted' (P £ 0.05) were not used to find the internal order. The presence of putatively co-dominant loci was also tested among co-segregated bands; loci were considered as putatively co-dominant if the segregating products came from the same AFLP primer combination and if polymorphism between individuals was alternatively present. A consensus AFLP map was then constructed by combining the two maps, using 75 previously identified common markers between the two mapping populations (55 AFLP markers and 20 framework markers). After evaluation of the marker orders and their respective positions on the chromosomes, the average distances between common markers were considered, to place the other markers. The quality of the consensus map was evaluated by comparing the locus arrangement of the consensus map with the arrangement of loci in the individual maps, and re-evaluating the distances between the markers, with a two-point independent analysis, in the consensus map using the 'two-point' command. The linkage group nomenclature followed Pedrosa et al. (2008).

Data analysis
The population structure was investigated using the model-based (Bayesian) clustering algorithm implemented in the structure software (Pritchard et al. 2000;Falush et al. 2003), which identifies groups of individuals according to their allele distributions. In this study, the data were first analyzed using all of the markers and the admixture model. Then, mapped markers were used to run the linkage model and the admixture model, to compare the results obtained using different models and markers and to test the robustness of the results. All of the runs were carried out with 1 000 000 iterations and 500 000 burn-ins, assuming correlated allele frequencies (Falush et al. 2003). No a priori population information was used. Twenty runs were carried out for each dataset, and a range of K from 1 to 10 was explored. The value of K was selected using DK (Evanno et al. 2005). This measure can provide a better estimation of the 'true' value of K compared to the maximal value of L(K) returned by structure. Individuals with membership coefficients of q i ‡ 0.7 were assigned to a specific group, whereas individuals with q i < 0.7 were identified as admixed (if not specified otherwise). Possible substructures that might have been hidden by the main population structure were explored (Evanno et al. 2005); indeed, each of the groups identified was subsequently analyzed to determine the number of putative subgroups.
Moreover, the percentage of polymorphic loci, the percentage of rare alleles (markers with a frequency <0.1 and >0.9) and the unbiased genetic diversity (H E ; Nei 1978) were estimated. An ad hoc statistic (DH) was used to measure the loss of diversity in domesticated populations, as proposed by Vigouroux et al. (2002): where H D and H W are the genetic diversity in domesticated and wild common bean, respectively. We used this statistic to compare the diversity between different gene pools using the larger diversity as described for H W . Overall, these statistics were calculated both considering the groups classified by passport data and those identified by the structure analysis.
The level of LD and its significance for each pair of AFLP loci were evaluated using the tassel software package (Bradbury et al. 2007); a threshold of P < 0.01 was chosen. LD was estimated for the entire population and for specific subgroups. All the analyses were conducted both with and without consideration of the admixed genotype identified by structure at K = 2. The LD was estimated by computing the squared correlation coefficient (r 2 ) between pairs of markers (Hill and Robertson 1968), using all of the loci, only the loci with a frequency >0.1 and <0.9, and only the loci with a frequency >0.2 and <0.8. To test for the independence of the level of LD observed from the population relatedness, the sample numbers were reduced by discarding very similar entries ( ‡15% identical alleles) and creating a 'normalized' sample that was used to estimate pairwise LD (Breseghello et al. 2005).
The plots of LD (r 2 ) for pairs of loci versus genetic distance (cM) were drawn from the r 2 -values calculated by tassel. A Wilcoxson nonparametric test was used to evaluate the significance of the different models. Five classes of linkage distances were chosen to represent intrachromosomal LD (D1: £3; D2: >3 and £10; D3: >10 and £30; D4: >30 and £50; D5: >50), while inter-chromosomal LD among markers located in different chromosomes was also calculated. Separation of the means between the classes was performed using a chi-squared test (P < 0.01) after a stepwise Bonferroni correction. In addition to the pairwise LD, a multilocus LD statistic was also evaluated (multilocus version 1.2, Agapow and Burt 2000); this multilocus LD index (rd) (Burt et al. 1999;Agapow and Burt 2000) is based on the distribution of allelic mismatches between pairs of individuals over all of the loci (Brown et al. 1980). Specifically, LD evaluation is based on the variance of the number of pairwise differences among samples that were subjected to genetic analysis at multiple loci; rd is an extended statistic from the traditional multilocus LD index of association, IA, which was corrected for the number of loci used in the analysis, thus making the comparison between different groups possible.

AFLP genetic diversity
Overall, a total of 418 AFLP markers were identified in the whole sample. Table 1 summarizes the levels of polymorphism and genetic diversity for all of the markers for the different groups defined using the passport data information. A higher percentage of polymorphic markers (62.4%) was found in the Mesoamerican gene pool, compared to the Andean one (54.8%); in both gene pools, the percentages of polymorphic loci of the wild forms were always higher than those of the domesticated forms. A greater number of rare alleles were found for the Andean gene pool (64.2%) compared to the Mesoamerican one (43.7%). The Mesoamerican gene pool showed a 1.6-fold higher genetic diversity (H E ) than the Andean one (v 2 = 24.7, P = 6.8 · 10 )7 , Wilcoxon, nonparametric test). These differences in genetic diversity were still significant when the analysis was carried out separately for each form, as wild (Mesoamerican versus Andean, v 2 = 35.8, P = 2.1 · 10 )9 , Wilcoxon nonparametric test) and domesticated (Mesoamerican versus Andean, v 2 = 5.1, P = 0.023, Wilcoxon nonparametric test). In the Mesoamerican gene pool, the wild forms were more diverse than the domesticated forms (v 2 = 19.4, P = 1 · 10 )5 , Wilcoxon nonparametric test), as also shown by analysis of the reduction in genetic diversity (DH = 0.32). In the Andean gene pool, the genetic diversity between the forms did not differ (DH = 0).

Linkage mapping of AFLP markers
The AFLP markers identified by the 19 primer combinations were placed in two different genetic maps based on two recombinant inbred populations: BAT93 · Jalo EEP558 (BJ; Freyre et al. 1998) and Midas · G12873 (Koinange et al. 1996). Considering both populations, 229 polymorphic markers were detected. Overall, we were able to place 166 markers in the 11 linkage groups (LG), 55 of which were shared between the two maps; the remaining 63 markers were unlinked at a LOD value >3.0. Eleven pairs of polymorphic bands were found to co-segregate as alleles of co-dominant loci. Thus, of 229 polymorphic markers, 207 (90.4%) were dominant loci, whereas 22 (9.6%) represented 11 putative co-dominant loci.

The BJ map
In the BJ population, 116 of 139 polymorphic markers were placed on the established core linkage map (Freyre et al. 1998), whereas 23 were unlinked (at an LOD score >3.0). Markers were placed in all of the linkage groups, for a total length of 1372 cM, ranging from three markers in LG 03, to 19 markers in LG 02, with an average of 10.6 markers per linkage group. Of 139 polymorphic markers, 68 (48.9%) originated in the parental BAT93 line, whereas 71 (51.1%) were present in the parental Jalo EEP558 line. Segregation analysis performed on the AFLP markers showed a significant deviation (P < 0.05) from the expected 1:1 ratio under the null hypothesis of no-segregation-distortion for only eight (6.9%) markers out of 139.

The MG map
In the MG population, 105 (67.3%) of 156 polymorphic markers were placed on the established map (Koinange et al. 1996), whereas 51 were unlinked (at a LOD score >3.0). Markers were placed in all linkage groups, for a total length of 1258 cM, ranging from two markers in LG 03 and LG 06, to 20 markers in LG 01 and 02, with an average of 10 markers per linkage group. Ninety (57.7%) of 156 polymorphic markers originated in the Midas parental line, whereas 66 (43.3%) were inherited from the G12873 parental line. Segregation analysis performed on the AFLP markers showed a significant deviation (P < 0.05) from the expected 1:1 ratio under the null hypothesis of no-segregation-distortion for 22 (14.1%) markers out of 156.

Consensus map
Before building the consensus map, the two maps were aligned to verify the positions of the markers along the  (418).
Linkage disequilibrium and population structure in Phaseolus vulgaris L. Rossi et al. linkage groups. After evaluation of the consistency of the loci order between the two different maps, 75 common loci between the two mapping populations (55 AFLP and 20 framework) were used to determine the relative orientations of the linkage groups in their respective maps and to combine the maps into a consensus map (Fig. 1). The average genetic distances between common adjacent markers were computed, to determine the positions of the other markers. The total length obtained for the consensus map is 1029 cM, with an average distance between markers of 6.2 cM. The AFLP markers were distributed in all the linkage groups, from four markers in LG 03, to 32 markers in LG 02. The quality of the consensus map was then evaluated using a two-point Figure 1 AFLP consensus map. Linkage groups (LG 01-LG 11) with AFLP markers (in bold) shown to the right, along with the previously mapped framework markers (Vallejos et al. 1992;Gepts et al. 1993;Nodari et al. 1993;Koinange et al. 1996;Freyre et al. 1998;Kwak et al. 2008). Domestication and other traits are shown to the left (grey rectangles), mapping them near to the AFLP markers. Genetic distances are in Kosambi map units. QTLs and major genes: CBB, common bacterial blight resistance; DF, days to flowering; DM, days to maturity; DO, dormancy; fin, determinacy; C, Gy, V seed coat pattern and colour; PD, photoperiod sensitivity; HI, harvest index; L5, length of fifth internode; NM, number of nodes on main stem; NN, rhizobium nodulation; NP, number of pods; PL, pod length; Ppd, photoperiod sensitivity; SW DOM seed weight, identified in cross with wild bean; SW ND seed weight, identified in cross between cultivars Koinange et al. 1996;Gepts et al. 1999;Geffroy et al. 2000). independent analysis to re-examine the distances between the markers in the consensus map; the two distances were in agreement.

Population structure
The uppermost hierarchical level of population structure identified using the method based on the estimation of DK (Evanno et al. 2005) suggested that our sample was made up of two main genetic groups (K = 2). However, we also identified a secondary peak at K = 4 (Fig. 2), and thus we analyzed our sample considering all of the K-values from 2 to 4. Using K = 2, structure identified two main clusters corresponding to the Mesoamerican and the Andean gene pools: the Mesoamerican cluster constituted of 48 domesticated Mesoamerican genotypes, plus 41 wild genotypes from Mexico, Guatemala, Costa Rica and Colombia, and the Andean cluster constituted of 41 domesticated Andean genotypes, plus 37 wild genotypes from Peru, Bolivia and Argentina (Fig. 3A). Sixteen genotypes were admixed. In particular, all of the seven wild genotypes with the ancestral phaseolin (type I; Kami et al. 1995) were admixed even if at a lower threshold (q i ‡ 0.6) were assigned to the first cluster (Mesoamerican). A stepwise increase to K = 3 showed the occurrence of a new group consisting of wild genotypes from Mexico, central America and Colombia, and five genotypes with phaseolin type I (Fig. 3B). At K = 4, there were two Mesoamerican groups (domesticated and wild), a third group represented by the 'I' gene pool (with the ancestral phase-  olin I), and a fourth group including Andean materials (Fig. 3C).
We then performed structure analysis on the Mesoamerican and the Andean groups (K = 2) separately. The analysis performed on the Mesoamerican group provided the same results as those shown in Fig. 3C; indeed, strong separation was again seen, which confirmed the subdivision between wild and domesticated Mesoamerican genotypes. The Andean group, which showed a homogeneous structure in the previous analysis, was further subdivided into two main clusters (K = 2) corresponding to the wild and domesticated forms (Fig. 4). Fifteen domesticated genotypes were found to be admixed; this number was further reduced to 8 using a lower threshold (q i > 0.6).
The last step was to investigate the population structure within each form (wild or domesticated) for each gene pool. Thus, we did not consider the admixed genotypes between gene pools and the populations with the ancestral phaseolin I. For these reasons, the analyses with structure were conducted on four subgroups (Fig. 5): MW (Mesoamerican wild), MD (Mesoamerican domesticated), AW (Andean wild) and AD (Andean domesticated).
Considering the MW subgroup, all of the genotypes were assigned to three subclusters (K = 3), with the exception of three genotypes, which were identified as admixed. The MW_1 subcluster contained only Mexican genotypes, plus one from Costa Rica, whereas the MW_2 subcluster contained all of the genotypes from Colombia (4) and Guatemala (3), plus one from Costa Rica. Interestingly, the MW_3 subcluster contained only genotypes that were classified as weedy, from the passport information.
In the MD subgroup, all of the genotypes were assigned to three subclusters (K = 3), with the exception of nine identified as admixed. In addition, the subclusters identified appeared to reflect the known subdivisions in races of the common bean. Indeed, the MD_1 subcluster includes genotypes mainly classified as 'race Mesoamerica', with an average seed weight of 0.19 g (SD = 0.03). The MD_2 subcluster includes genotypes mainly classified as 'race Durango-Jalisco', with an average seed weight of 0.38 g (SD = 0.15). The MD_3 subcluster is a smaller group, with an average seed weight of 0.22 g (SD = 0.07) and no specific relationship to race classifications. The pairwise differences in seed weight between the three subclusters were significant only for the MD_1 versus MD_2 (t-test, P < 0.0001), and MD_2 versus MD_3 (t-test, P < 0.05) comparisons.
Considering the AW subgroup, all of the genotypes were assigned to three subclusters (K = 3), with the exception of six that were identified as admixed. However, in the AW genotypes, the structure observed was not clearly related to any phenotypic traits, or geographical or ecological variables.
In the AD subgroup, even though we obtained evidence of three subclusters (K = 3), the genotypes were assigned to only two subclusters, with almost one third (12) of the genotypes showing admixed ancestry. Similar to AW, the observed structure for AD was not associated with any traits (e.g. seed colour, seed weight, race type, or phaseolin).

AFLP genetic diversity for groups identified by structure
The diversity was then evaluated for the main groups identified by structure at K = 2, excluding the admixed genotypes (Fig. 3A), and subdivided between the wild and domesticated forms. The results are summarized in Table 2, and they are very similar to those obtained using the classification of the genotypes based on the passport data information.

Linkage disequilibrium
Pairwise and multilocus LD were evaluated for all of the polymorphic loci, for the loci with a frequency > 0.1 and <0.9, and for the loci with a frequency > 0.2 and <0.8. The analyses were also conducted both with and without the previously identified admixed genotypes (K = 2). Moreover, to test the independence of the level of LD due to population relatedness, we also evaluated the levels of pairwise LD in a 'normalized' sample (Breseghello et al. 2005). Due to the very similar results obtained for all of these analyses, we present here only the results obtained using loci with a frequency >0.1 and <0.9, and without considering admixed genotypes between gene pools (K = 2).
In the whole sample, 56% of the pairwise LD comparisons were significant at the P < 0.01 level, with an average r 2 of 0.1. The multilocus LD computed (rd) was also significant (P < 0.01), and was equal to 0.05 (Table 3). Considering the subdivisions in the gene pools, we found a greater level of loci in significant LD for the Andean gene pool (pairwise LD: 40.8%; average r 2 = 0.11; multilocus LD: rd = 0.09) compared to the Number of genotypes, without considering admixed genotypes between gene pool. àPercentage of polymorphic loci calculated over total loci (418).
Linkage disequilibrium and population structure in Phaseolus vulgaris L. Rossi et al. Mesoamerican one (pairwise LD: 16.7%; average r 2 = 0.05; multilocus LD: rd = 0.03). Considering the subdivisions, a higher level of loci in significant LD was observed for the domesticated form (pairwise LD: 57.3%; average r 2 = 0.18; multilocus LD: rd = 0.11), compared to the wild form (pairwise LD: 31.5%; average r 2 = 0.08; multilocus LD: rd = 0.04). The LD was then evaluated for the subgroups identified by the structure analysis carried out independently within gene pools (Table 3). The highest level of loci in significant LD was seen for AD, followed by MD and AW, whereas the lowest level of LD was shown by MW. The analysis of LD (average r 2 ) as a function of genetic distance in cM (Fig. 6) showed significant differences among the recombination distance class (cM) only for domesticated samples. For both wild populations, these differences were not significant. In the MD subgroup, the first significant difference in the levels of LD among the distance classes was seen when the first two classes where compared (within 10 cM; v 2 = 8.7; P = 0.002; Wilcoxon nonparametric test), whereas in the AD subgroup, significant differences were seen when the fourth class was compared with the first three classes (within 50 cM; v 2 = 10.7; P = 0.001; Wilcoxon nonparametric test). Loci of distance-class D5 (>50 cM) in the domesticated samples showed a significantly lower level of LD (MD: v 2 = 9.2; P = 0.002; AD: v 2 = 15.5; P = 8 · 10 )5 ; Wilcoxon nonparametric test) than loci located in different chromosomes (inter-chromosomal LD), whereas in the wild samples, the difference in the level of LD between the two classes considered was not significant. The standardized index of association (rd) was then calculated within each of the subclusters described above, following the subdivision highlight by structure (Table 4). This summary statistic largely avoids dependency on the number of loci and therefore facilitates comparisons between groups with different numbers of loci and genotypes. In the three Mesoamerican domesticated subclusters (MD_1, MD_2 and MD_3), only MD_1 (race Meso-  america) and MD_2 (races Durango-Jalisco) showed much lower levels of LD compared to the entire sample. In contrast, the MD_3 subcluster showed a level of LD 3.8-fold higher than the entire sample. In the Andean domesticated population, much lower levels of LD were seen for subclusters AD_1 and AD_3, compared to the entire sample.
In the Mesoamerican wild population, subcluster MW_3 was not analyzed because of a lack of polymorphism. The MW_1 subcluster (Mexican genotypes) showed the same level of LD as that for the whole sample, whereas in the MW_2 subcluster (central American genotypes), the level of LD was four-fold higher than the LD in the entire MW group. In the Andean wild population, the levels of LD for the three subclusters (AW_1; AW_2; AW_3) were very similar to those found in the entire sample.

Discussion
We have described here the population structure and the levels of LD in wild and domesticated populations of the common bean using a large set of AFLP markers that were located on a genetic map using two different segregating populations (RILs), and then integrated into a consensus map. AFLPs have been extensively used to construct or enrich genetic maps in many plant species (Becker et al.1995;Schondelmaier et al. 1996;Boivin et al. 1999;Liu et al. 2003;Moen et al. 2004;Papa et al. 2005Papa et al. , 2007Zhong et al.2006;Troggio et al.2007), and they generally give good coverage of the whole genome (Nichols et al. 2003;Moen et al. 2004). In addition, AFLPs have proven to be very useful for diversity studies (Bonin et al. 2007;Berloo et al. 2008;Meudt and Clarke 2007) and are an excellent alternative to simple sequence repeat (SSR) markers to enhance the resolution in studies of population assignments (Campbell et al. 2003;Garoia et al. 2007;Sonstebo et al. 2007).
A potential disadvantage in the use of this technique is the size homoplasy due to the occurrence of co-migrating, nonhomologous fragments (Vekemans et al. 2002) where homoplasy produces underestimations of both withinand between-population diversity (Vekemans et al. 2002;Koopman and Gort 2004;Caballero et al. 2008). However, it should be noted that homoplasy is moderate when, as in this study, closely related genotypes (e.g. within species) are analyzed (Rouppe van der Voort et al. 1997;Vekemans et al. 2002;Koopman 2005), and when small genomes, like that of the common bean, are considered (Althoff et al. 2007). Moreover, the potential bias due to size homoplasy can be prevented by an accurate selection of polymorphic fragments (Vekemans et al. 2002;Koopman and Gort 2004;Caballero et al. 2008), as described in the Materials and methods section.

Diversity and population structure
The two main clusters identified with the analysis of population structure reflect our previous knowledge of the occurrence of two major wild gene pools of P. vulgaris Table 4. Multilocus linkage disequilibrium (rd) analysis within each subgroup (MW, AW, MD and AD) for each subcluster; the level of polymorphic AFLP markers and the unbiased genetic diversity are also given.  (Nei 1978). (Gepts et al. 1986;Koenig and Gepts 1989;Khairallah et al. 1990Khairallah et al. , 1992Koinange and Gepts 1992;Freyre et al. 1996), in which two independent domestication events took place, one in the Andes and one in Mesoamerica (Singh et al. 1991b,c;Becerra Velasquez and Gepts 1994). The Mesoamerican group included wild genotypes from Mexico, Central America and Colombia, while the genotypes with the ancestral phaseolin I, which originated from northern Peru and Ecuador were classified as admixed even if they are closer to the Mesoamerican cluster.
Even if in the Mesoamerican gene pool the separation between wild and domesticated populations was stronger compared to the Andean cluster, in both gene pools, the wild and domesticated populations can be split into two distinct groups. The domesticated forms further included three groups of accessions. While in Mesoamerica the genotype assignment mostly reflected phenotypic differences related to classifications in ecogeographic races, in the Andean group, it did not relate to any traits or known information included in the passport data.
A similar picture was obtained considering the wild form. Indeed, in the Mesoamerican gene pool, in addition to a weedy group, structure discriminated between Mexican and Central American-Colombian populations, whereas in the Andean gene pool, no possible explanations could be identified to explain their existence. The Mesoamerican and Andean gene pools showed marked differences for both levels of diversity and internal structures, with the Mesoamerican gene pool much more diverse and structured compared to the Andean one. Previous analysis has also suggested that the Mesoamerican and Andean gene pools have different structures and levels of genetic diversity, for both the wild and domesticated populations. Indeed, a higher genetic diversity was observed in the Mesoamerican gene pool, compared to the Andean one (Koenig and Gepts 1989;Beebe et al. 2000Beebe et al. , 2001Papa and Gepts 2003;McClean et al. 2004), even if recent analyses using SSRs have shown more similar levels of diversity between the Mesoamerican and Andean gene pools (Blair et al. 2006b;. In parallel, a higher between-populations component of genetic diversity was found in the Mesoamerican wild populations (using AFLP; Papa and Gepts 2003) than in the Andean wild populations [using random amplification of polymorphic DNA (RAPD); Cattan-Toupance et al. 1998]. A much stronger differentiation has also been seen between the domesticated races from Mesoamerica than between those from South America (Beebe et al. 2000(Beebe et al. , 2001. Our data obtained using the same markers on a balanced sample of wild and domesticated genotypes confirm these observations. Moreover, because these results were obtained for both the wild and the domesticated forms, the differences seen between gene pools should have originated during the formation and evolution of the two gene pools much earlier than the beginning of cultivation and domestication of the common bean in the Mesoamerican and Andean regions. It is very interesting to compare our AFLP results with those obtained using SSR loci by  in a comparable sample of P. vulgaris genotypes. Indeed, even if the major results related to the population structure are very similar between these two studies, a few intriguing differences are seen. First, in , the Mesoamerican and Andean wild genotypes did not show strong differences in genetic diversity (DH = 0.07), as we found in this study (DH = 0.45), or as was found for eight allozyme loci (DH = 0.34), calculated from published allelic frequencies (Koenig and Gepts 1989). Similarly, while in this study we observed a strong reduction in genetic diversity due to domestication (wild versus domesticated samples), only in Mesoamerica (DH = 0.32), in , was a moderate reduction (about 10%) found, although in both gene pools.
A major difference between the molecular markers mentioned is their different mutational model and mutation rates. The AFLP mutations that are primary due to point mutations within the restriction site, with a small fraction of in/dels that occurs between the two restriction enzyme sites and that should originate co-dominant AFLP loci. For instance, in our case, no more than 10% of the AFLP fragments could be considered as alleles of co-dominant loci, and a similar fraction has been found in other species (for review, see Rodriguez et al. 2006). In contrast, microsatellite polymorphisms are primarily determinated by variation in repeat number and by indels outside of repeat units. Indirect estimates of the AFLP mutational rate have suggested a value ranging between 10 )6 and 10 )5 (Mariette et al. 2001;Gaudeul et al. 2004;Kropf et al. 2009), whereas the SSRs have much higher mutation rates, varying between 10 )3 and 10 )4 , as found using indirect (Estoup and Angers 1998;Mariette et al. 2001;Garoia et al. 2007) and direct estimates (Vigouroux et al. 2002;Thuillet et al. 2005). Indeed, the number of generations needed for the recovery of genetic diversity (H) after a bottleneck due to mutations is expected to be close to the reciprocal of the mutation rate (Nei et al. 1975;Nei 2005). Thus, markers that differ substantially in their mutation rates can show very different patterns of molecular diversity, as we found in our case. On this basis, our results suggest that the Andean gene pool was subjected to a bottleneck before domestication, which is still evident based on data from the more slowly mutating AFLP markers, but has nearly disappeared based on the more rapidly mutating SSR markers.

Linkage disequilibrium
Considering the whole sample, we detected a very high level of LD, with most of the comparisons (56%) between loci significant at P < 0.01 (Table 3). However, the lower LDs seen for the separate subsamples indicate that the major cause of the level of LD in the common bean is the gene pool structure (Andean versus Mesoamerican). A higher level of LD was found in the Andean compared to the Mesoamerican gene pool. Of note, the same result was obtained for both the wild and domesticated Andean populations, suggesting that the higher LD in the Andean gene pool originated prior to domestication. This increased LD potentially represents a remnant of the increase in LD that may have appeared as a consequence of the migration, genetic drift, and selection that accompanied the establishment of the wild Andean gene pool. Differences between gene pools are also evident in the domesticated populations from the different patterns of LD decay along the chromosomes, with a slower decay seen for the Andean population compared to the Mesoamerican, whereas in the wild population no significant variation was found for any class of map distances.
Overall, the level of LD was much greater in domesticated compared to wild populations. A large effect on LD due to population structure and related to the process of domestication has also been reported in other autogamous species, such as barley, where the genome-wide LD extended from 10 to 15 cM when evaluated with AFLP markers (Kraakman et al. 2004), SSRs (Malysheva-Otto et al. 2006) and single nucleotide polymorphisms (Rostoks et al. 2006). The pattern of LD was extremely population dependent, and was highly related to the process of domestication (Morrell et al. 2005;Caldwell et al. 2006). In soybean, the level of LD of the wild ancestor, Glycine soja, was very limited, compared to the domesticated G. max groups, which showed much higher levels of LD (Hyten et al. 2007). The higher level of LD found in the domesticated compared to the wild populations can be explained by the sole effect of drift (bottleneck effect). However, the excess of inter-chromosomal LD that was found in (and only in) the domesticated population, suggested that the high level of LD in the domesticated population (compared to the wild one) might also be due to the action of epistatic selection (Lewontin 1964). Within the domesticated Mesoamerican and Andean common bean, it was also possible to identify genetic subgroups (associated with races in the Mesoamerican gene pool), where the level of LD was reduced to values very close to those seen for wild populations. These results might be related to selection for different phenotypic architectures that characterized the different races, which should have originated (and were main-tained) after the initial step of domestication, mainly because of selection. An alternative hypothesis would invoke the domestication process itself.
Our results have important implications in relation to the development of association mapping projects in the common bean. In particular, it is clear that a key factor is the careful assessment of population structure. Moreover, our results suggest that whole-genome-scan approaches are feasible, and that with the choice of appropriate populations, it will be possible to obtain different resolutions in association studies in the common bean. Additional LD studies focusing on a few genomic segments will be needed to obtain a more complete picture of the level and structure of LD in the common bean. In particular, these studies will be able to evaluate the possibility of conducting LD mapping based on a candidate gene approach using specific populations (e.g. the wild form).

Implications for the origins of the wild populations
Our results raise the question about the origin of wild P. vulgaris. Although the genus Phaseolus is almost certainly of Mexican origin (Delgado-Salinas et al. 2006), sequence studies of the genes coding for phaseolin, the main seed-storage protein, suggest that the ancestral populations of P. vulgaris are likely to have originated in Ecuador and northern Peru (Kami et al. 1995;Gepts et al. 1999). However, our results are not consistent with the hypothesis of a northern Peru-Ecuador origin: (i) both in this study and that of , the northern Peru-Ecuador genotypes with phaseolin type I are closer to the Mesoamerican gene pool rather than to the Andean one; (ii) among the estimates of F K (Falush et al. 2003), a parameter that is similar to F ST , but specific for each population and expected to be proportional to the divergence from a common ancestral population, the Mesoamerican and Mexican wild forms show the lowest estimates both in this study and in that of ; and (iii) all of the results relating to the molecular diversity and the levels of LD appear to argue in favour of a Mesoamerican origin. Indeed, the Andean populations have patterns of genetic diversity and LD that appear to reflect the occurrence of a strong bottleneck prior to domestication: the Andean wild populations showed almost half of the molecular diversity present in Mesoamerica, a much larger frequency of rare alleles, and a higher level of LD. The occurrence of a bottleneck prior to domestication in the Andes is also supported by comparing AFLP and SSR ) results.
These results tend to question the Peru-Ecuador origins and argue in favour of a Mesoamerican origin of wild P. vulgaris. Although the results presented by Kami et al. (1995) appear to be very robust in the identification of phaseolin type I as the ancestral phaseolin, the present distribution of phaseolin might not reflect its ancient distribution. For example, phaseolin type I might be extinct in Mesoamerica or might still be present but was not included in the samples studied. Indeed, for several reasons (reviewed in Doyle 1992), an agreement between gene phylogeny and species phylogeny is not assured. Thus, our results can be explained together with those of Kami et al. (1995), by considering the phaseolin type I P. vulgaris from northern Peru and Equador as a relict population that migrated from Mesoamerica in ancient times, and that phaseolin I became extinct (or nearly so) in Mesoamerica. Alternatively, our results might be due to the occurrence of a bottleneck in the Andes after the separation between gene pools for unknown reasons. Our results may also be explained by invoking environmental and geographical differences between the wild beans of the two gene pools. The habitat of the Mesoamerican gene pool is much larger, as it not only extends from Colombia to northern Mexico, but also includes a widespread distribution in Mexico (Freytag and Debouck 2002). This large habitat contrasts with that of the wild bean in the Andean region, where it is distributed in a narrow altitudinal fringe on the western and eastern slopes of the Cordillera (Debouck et al. 1993;Freyre et al. 1996). Therefore, the higher genetic diversity of the Mesoamerican gene pool may be associated with a higher effective population size and higher environmental variation in contrast with the Andean gene pool. Moreover, our results can be explained as due to an expansion of diversity in Mesoamerica (e.g. due to population growth). However, in all of these cases, we would expect different trends between the SSRs and AFLPs compared to those seen. Thus, the Mesoamerican origin of the species appear to be the most likely and robust explanation for our results.

Implications for the history of domestication
Our results are also relevant to the domestication history of the common bean. The occurrence of independent domestication events in Mesoamerica and the Andes is well established, due to the large set of coherent results that have been obtained using different approaches based on molecular markers (seed protein: Gepts et al. 1986;Gepts and Bliss 1988;allozymes: Koenig and Gepts 1989;Singh et al. 1991b;RFLPs: Becerra Velasquez and Gepts 1994;RAPDs: Freyre et al. 1996;AFLPs: Tohme et al. 1996;Papa and Gepts 2003;SSRs: Blair et al. 2006a,b;Díaz and Blair 2006) and morphological characteristics (Singh et al. 1991a,c;Gepts 1998). However, the number of domestication events that occurred in the two differ-ent regions remains a topic of discussion. From the lower diversity for phaseolin types in the domesticated Mesoamerican gene pool compared to the Andean one, Gepts et al. (1986) suggested that a single domestication event occurred in Mexico whereas multiple domestication events occurred in the Andes. Using molecular markers, the results of Papa and Gepts (2003),  and  also support the occurrence of a single domestication event in Mesoamerica. However, some studies indicate that from the comparison of the population structures of the two domesticated gene pools, there is an indication of multiple domestication events in Mesoamerica (Singh et al. 1991a,b,c;Beebe et al. 2000). Indeed, this study shows a high population structure related to important phenotypic differences in the domesticated forms from Mesoamerica, whereas almost no structure is seen in the Andes. These results can also be explained by differences between the two wild forms (e.g. the very low molecular diversity in the Andean gene pool). These are paralleled in the domesticated forms, suggesting that they originated prior to domestication. Similarly, from an analysis of cpDNA RFLP data, Chacó n et al. (2005) suggested that the common bean has undergone multiple domestications in Mesoamerica. However, as also suggested by Chacó n et al. (2005), their results can easily be explained by gene flow between the wild and domesticated forms. Our data support the occurrence of a single domestication event in Mesoamerica, and they also tend to support the same scenario in the Andes, although the low diversity in this gene pool did not provide our study with the appropriate resolution power to identify different genetic groups.

Supporting Information
Additional Supporting Information may be found in the online version of this article: Table S1. List of the genotypes of P. vulgaris used in this study; the classification is based on passport data. Table S2. The AFLP primer combinations used in this study.
Please note: Wiley-Blackwell is not responsible for the content or functionality of any supporting materials supplied by the authors. Any queries (other than missing material) should be directed to the corresponding author for the article.