Information about the locations and sample sizes of the investigated populations is summarized in Figure 2 and Table S1. Leaf samples of adult trees and saplings were collected from seven populations of S. parvifolia including two populations from peninsular Malaysia (Seremban and Mersing), two populations from Sumatra (Nanjak Makmur and Asialog), and three populations from eastern Borneo, Kalimantan (Sari Bumi Kusuma, ITCI Karya Utama, and Sumalindo). Twelve individuals were analyzed for each population, except for the population ITCI Karya Utama, where only six individuals could be sampled. In total, 78 individuals of S. parvifolia were analyzed in this study. Most individuals from the population Seremban were the same as those studied by Ishiyama et al. (2003) and Ishiyama et al. (2008). Individuals from populations Nanjak Makmur, Asialog, and Sari Bumi Kusuma were the same as those studied by Cao et al. (2006). Species identification in the field was done based on leaf morphological characters (e.g., length, petiole length, width, distance from petiole to the widest part of the leaf, number of venations, number of lobes, domatia length, and leaf shape).
In our previous studies, one putative hybrid between S. parvifolia and S. acuminata and one putative hybrid between S. parvifolia and S. leprosula were found among S. parvifolia individuals from peninsular Malaysia based on genotypes of the GapC and the PgiC gene regions (Ishiyama et al. 2003; Ishiyama et al. 2008). To detect putative interspecific hybrids in populations analyzed in the present study, one individual of each S. acuminata, S. leprosula, and S. curtisii, which belong to the same timber group (Red Meranti) as S. parvifolia (Symington 1943) were included. Close relationships of these species have been reported by Kamiya et al. (2005). In addition, one individual of S. maxwelliana, which is not closely related with the aforementioned species, was included as outgroup in some neutrality tests. Individuals from these four species were the same as those used by Ishiyama et al. (2003) or Ishiyama et al. (2008).
Several enzymes that commonly exist in plants and whose functions are well known were chosen as candidate loci for this study. Trial PCR primers were designed based on the known sequences of Shorea species obtained in other studies (GapC: Ishiyama et al. 2003; GBSSI: Kamiya, pers. commun.; PgiC: Ishiyama et al. 2008) and expressed sequence tag (EST) data from S. leprosula (Tsumura, pers. commun.). Sequences of the PCR products amplified using these primers were determined. Gene regions that contained single-nucleotide repeats (>10) or many indels were excluded. Specific primers for the remaining gene regions were designed based on the sequences of S. parvifolia obtained during trial PCR amplifications. Finally, the following five nuclear genes were used: GapC (glyceraldehyde-3-phosphate dehydrogenase, EC 220.127.116.11), GBSSI (granule-bound starch synthase I, EC 18.104.22.168), PgiC (cytosolic glucose-6-phosphate isomerase, EC.22.214.171.124), SBE2 (starch branching enzyme class II, EC 126.96.36.199), and SODH (sorbitol dehydrogenase, EC 188.8.131.52). Names of all loci were assigned according to the corresponding homologues of Arabidopsis thaliana.
DNA isolation, amplification, and sequencing
Genomic DNA was isolated from ∼300 mg of leaves using a modified cetyl trimethyl ammonium bromide method (Murray and Thompson 1980) or DNeasy 96 Plant Kit (QIAGEN). Partial regions of the five nuclear genes were amplified for each individual by PCR. When the efficiency of PCR amplification was poor, nested PCR was performed. Sequences of the primers for PCR, nested PCR, and sequencing are listed in Table S2. PCR amplification conditions were as follows: 35 cycles of denaturation at 94°C for 30 sec, annealing at 55°C for 30 sec and extension at 72°C for 150 sec. For nested PCR, the number of cycles ranged from 15 through 35 according to amplification efficiency. Amplification products were purified using Wizard® SV Gel, and PCR Clean-Up System kit (Promega). Purified products were directly sequenced for both strands using ABI Prism 3100 automatic sequencer (Applied Biosystems). We obtained sequences of both haplotypes for each locus and each individual. When sequences obtained by direct sequencing had no or only one heterozygous site, sequences of both haplotypes of an individual could be directly inferred. On the other hand, when two or more heterozygous sites or indels were detected by direct sequencing, purified amplification products were cloned into the pGEM T-easy vector (Promega). Individual clones were sequenced using universal primers T7 and SP6 for the promoter sites of the vector. To eliminate PCR errors, we carried out the following analyses: we determined sequences of individual clones, until three clones with the same phase at heterozygous sites were obtained. The consensus sequence of these three clones was regarded as a sequence of the first haplotype. This procedure was then repeated using additional clones to obtain the sequence of the second haplotype. Sequences of the obtained haplotypes were compared to the corresponding direct sequence to check consistency. Sequences obtained in this study have been deposited in GenBank with the following accession numbers: AB724403 through AB725191. We also used sequences of the PgiC gene region from population Seremban of S. parvifolia obtained by Ishiyama et al. (2008), and PgiC sequence for one individual of S. maxwelliana obtained by Kamiya et al. (2005).
DNA sequences were verified and assembled into a contiguous sequence for each locus of each individual using the ATGC program ver. 4 (GENETYX CORPORATION). Multiple sequence alignment for individual loci was performed using the Clustal W program ver. 1.4 (Thompson et al. 1994) and corrected manually. Alignment gaps were excluded in all analyses. To assess levels of nucleotide polymorphism, nucleotide diversity (π; Nei 1987) and haplotype diversity (Hd; Nei 1987) for each of the five investigated loci were estimated. Population recombination parameter ρ (ρ= 4Nec, where Ne is the effective population size and c is the recombination rate per generation per site) was estimated for each locus using the composite-likelihood method (Hudson 2001) implemented in the software package LDhat (http://www.stats.ox.ac.uk/~mcvean/LDhat/index.html). To test for deviation from selective neutrality and other assumptions (random mating, constant population size, and no migration), Tajima's D (Tajima 1989) test was performed for individual loci. For this test, the 95% confidence interval of Tajima's D statistics of individual loci was obtained using 10,000 replicates of coalescent simulations under standard neutral model (Hudson 1990) with no recombination. The observed number of polymorphic sites was given in coalescent simulations to define the number of mutations. Heterogeneity of the ratio of divergence to polymorphism between synonymous and nonsynonymous sites was tested using the MK test (McDonald and Kreitman 1991) and among the loci by the multilocus HKA test (Hudson et al. 1987). The ratio should be the same, if the tested sites or loci evolve neutrally. Shorea maxwelliana was used as an outgroup species in the MK and HKA tests. The multilocus Tajima's D test and HKA test were performed using the HKA program obtained from Jody Hey's website (http://lifesci.rutgers.edu/~heylab/). In the multilocus Tajima's D test, P-values for average Tajima's D statistic over five loci were obtained. All calculations and coalescent simulations (except for estimation of ρ and multi-locus Tajima's D and HKA tests) were performed using the DnaSP program ver. 4.10.9 (Rozas et al. 2003).
To investigate the degree of population differentiation, fixation indices (FST; Hudson et al. 1992) between populations were estimated for each gene region. To visualize relationships of the investigated populations, we constructed neighbor-joining (NJ) tree (Saitou and Nei 1987) based on the net number of nucleotide differences (Da; Nei 1987). The tree was constructed using MEGA5 program (Tamura et al. 2011). Furthermore, we used model-based clustering algorithm (Pritchard et al. 2000) implemented in the STRUCTURE program ver. 2.2 (http://pritch.bsd.uchicago.edu/structure.html) to detect population structure and assign individuals to populations. Related haplotypes were grouped and treated as single alleles. Haplotype grouping was performed using the TCS program ver. 1.18 (Clement et al. 2000). All model parameter values in the STRUCTURE analysis were defaults of the program. We conducted five independent simulations with 50,000 iterations for the burn-in phase and 200,000 iterations for the data collection phase. The number of distinct clusters (K) was selected based on the ΔK statistic of Evanno et al. (2005).
Our STRUCTURE analysis revealed two genetically distinct groups of populations: the Sumatra-Malay group and the Borneo group. To infer history of the splitting event of these population groups, we used the IMa program (Hey and Nielsen 2007). We estimated the following six parameters: 4Neu of Sumatra-Malay group (θsm), Borneo group (θb), and their ancestral population (θA), migration rate from Borneo group to Sumatra-Malay group (msm), migration rate from Sumatra-Malay group to Borneo group (mb), and divergence time (t). The program implements Markov chain Monte Carlo simulations for generating genealogy fitting the ‘‘isolation with migration’’ (IM) model (Hey and Nielsen 2004) to data from multiple loci. The infinite-site model (Kimura 1969) was used as mutation model for all loci in the simulations. Since the IM model assumes no recombination, we used the longest part of sequence alignment that showed no evidence of recombination in the four-gamete test (Hudson and Kaplan 1985). First, the prior interval of parameters was obtained empirically by preliminary running IMa program with large parameter intervals. Subsequently, 100,000,000 steps of simulation saving a genealogy for every 1000 steps after a burn-in period (100,000 steps) with the obtained prior maxima of the parameters were performed. Peaks of the resulting marginal posterior probability distributions were defined as estimates of the parameters. Since selective neutrality for the GBSSI gene was rejected by the MK test, another simulation was run with four loci excluding this locus to check how estimates are affected by this locus.
The six parameters estimated by the IMa program were converted to the actual demographic parameters (i.e., Ne, effective population size; T, divergence time in years; 2Nem, population migration rate). For the estimate of T, t must be divided by the geometric mean of mutation rate per year per locus. Unfortunately, it is difficult to estimate mutation rate of Shorea species due to the absence of precisely dated fossil records. Thus, the minimum and maximum mutation rates per site per year for synonymous nucleotide substitutions in nuclear genes studied in other tree species: usyn= 0.7 × 10−9 in Pinus (Willyard et al. 2007) and usyn= 2.61 × 10−9 in palms (Gaut et al. 1996) were used as the mutation rate for silent sites (intron and synonymous sites) in S. parvifolia. Mutation rate per site per year for nonsynonymous site was computed by multiplying synonymous mutation rate by the observed Ka/Ks ratio. Eventually, the minimum and the maximum of the calculated geometric means of the mutation rates per locus per year for S. parvifolia were 2.58 × 10−7 and 9.63 × 10−7, when five loci were included, and 2.87 × 10−7 and 1.07 × 10−6, when GBSSI locus was excluded from analysis. To obtain the estimates of Ne, θ should be divided by 4V where V is mutation rate per locus per generation. Assuming minimum generation time for Shorea as 60 years (Ashton 1969), the minimum and the maximum mutation rates per locus per generation (V) for S. parvifolia were computed as 1.55 × 10−5 and 5.78 × 10−5 for five loci, and 1.72 × 10−5 and 6.42 × 10−5 for four loci excluding the GBSSI gene. Population migration rate 2Nem per generation was computed by multiplying θ by m/2.
To test several different demographic models, log-likelihood ratio (LLR) tests between the full model and nested models were performed using the results of simulations performed using IMa program. These tests are also implemented in the IMa program (Hey and Nielsen 2007). The full model includes all six parameters estimated by the aforementioned simulations, while some parameters are fixed (e.g., msm = 0) in the nested models. For one of the nested models, msm = mb = 0, namely no migration after divergence (isolation model), the LLR is expected to follow a mixed χ2 distribution. However, this is not a good approximation (IM discussion group: http://groups.google.com/group/isolation-with-migration). Therefore, deviations of the LLR of the full and nested models (msm=mb) from a χ2 distribution (df = 1) were tested first. Subsequently, the deviation of the LLR of the nested models (msm=mb) and (msm=mb= 0) was tested from a mixed χ2 distribution: one half of the values is zero and the other half follows a χ2 distribution (df = 1). If neither test was rejected the isolation model was accepted as a null model.