Genome‐wide association study discovered genetic variation and candidate genes of fibre quality traits in Gossypium hirsutum L.

Summary Genetic improvement of fibre quality is one of the main breeding goals for the upland cotton, Gossypium hirsutum, but there are difficulties with precise selection of traits. Therefore, it is important to improve the understanding of the genetic basis of phenotypic variation. In this study, we conducted phenotyping and genetic variation analyses of 719 diverse accessions of upland cotton based on multiple environment tests and a recently developed Cotton 63K Illumina Infinium SNP array and performed a genome‐wide association study (GWAS) of fibre quality traits. A total of 10 511 polymorphic SNPs distributed in 26 chromosomes were screened across the cotton germplasms, and forty‐six significant SNPs associated with five fibre quality traits were detected. These significant SNPs were scattered over 15 chromosomes and were involved in 612 unique candidate genes, many related to polysaccharide biosynthesis, signal transduction and protein translocation. Two major haplotypes for fibre length and strength were identified on chromosomes Dt11 and At07. Furthermore, by combining GWAS and transcriptome analysis, we identified 163 and 120 fibre developmental genes related to length and strength, respectively, of which a number of novel genes and 19 promising genes were screened. These results provide new insight into the genetic basis of fibre quality in G. hirsutum and provide candidate SNPs and genes to accelerate the improvement of upland cotton.


Introduction
Cotton is one of the most important natural textile fibre crops and is extensively planted throughout the world (Chen et al., 2007;Wendel, 1989). The cotton genus, Gossypium, consists of approximately 46 diploid species and six allotetraploid species (Grover et al., 2014;Wendel and Cronn, 2003). Of these, Gossypium hirsutum, also known as 'upland cotton', is the most widely cultivated species, constituting more than 95% of the world's cotton production. Upland cotton is characterized by wide adaptability and high production. However, its low-quality fibre requires improvement to meet human demands and progress in spinning technology.
The fibre quality of cotton is a complex quantitative trait that is controlled by multiple genes and is susceptible to environmental conditions (Paterson et al., 2003). The major fibre quality characteristics include length, strength, fineness (also called micronaire), uniformity and elongation, among which length, strength and fineness are most important for spinning yarn quality. Previous studies have dissected the genetic architecture of fibre quality through traditional quantitative trait locus (QTL) linkage mapping using biparental populations (Jamshed et al., 2016;Paterson et al., 2003;Shen et al., 2005;Yang et al., 2014b;Zhang et al., 2011). Many QTLs of fibre quality traits have been identified (Lacape et al., 2010;Said et al., 2013), providing significant insight into fibre genetics. However, most of these QTLs obtained from interspecific populations are not directly applicable to upland cotton improvement because they are localized in very large genetic regions and are often not stable across populations (Islam et al., 2016), and the molecular mechanisms underlying most of these QTLs are unknown.
Genome-wide association study (GWAS) based on linkage disequilibrium (LD) can effectively associate genotypes with phenotypes in natural populations and simultaneously detect many natural allelic variations and candidate genes in a single study, in contrast to QTL linkage mapping (Gupta et al., 2005;Huang and Han, 2014;Nuzhdin et al., 2012). Because of its advantages, including high resolution, cost efficiency and nonessential pedigrees, GWAS has been applied to many important and complicated phenotypes in crops such as rice (Huang et al., 2010;Yano et al., 2016;Zhao et al., 2011), maize Thornsberry et al., 2001;Yang et al., 2014a), soybean (Han et al., 2015;Zhou et al., 2015b), sorghum (Mace et al., 2013;Morris et al., 2013) and millet (Jia et al., 2013). The identification and characterization of the genes associated with important agronomic traits is essential for understanding the genetic basis of phenotypic variation and for promoting crop improvement.
The first commercial high-density CottonSNP63K array was recently developed; it provides a new resource for the genetic dissection of agronomically and economically important traits in cotton improvement (Hulse-Kemp et al., 2015). Additionally, the release of the upland cotton TM-1 genome sequence Zhang et al., 2015b) and the development of highthroughput single nucleotide polymorphism (SNP) assays  have enabled GWAS to explore the genetic basis of complex cotton traits. However, there are few reports of GWA mapping using large natural populations based on SNP markers in cotton.
Thus, in this study, to identify the natural allelic variation in the cotton genome and candidate genes significantly associated with fibre quality, we performed a GWAS of fibre quality using 719 upland cotton accessions and the high-density CottonSNP63K array based on phenotypic tests in eight environments. The results further our understanding of the mechanisms underpinning fibre quality, provide molecular markers for high-quality cotton breeding and may provide a reference for genetic dissection of other complex quantitative traits in cotton.

Phenotypic variation in fibre quality traits
To evaluate the phenotypic variation in fibre quality traits in the association population, we analysed five traits in eight environments during two years. Fibre length (FL), fibre strength (FS), fibre micronaire (FM), fibre uniformity (FU) and fibre elongation (FE) varied from 22.07 to 35.56 mm, 22.69 to 36.80 cN/tex, 3.14 to 6.34, 77.48 to 89.00 % and 3.55 to 8.65 %, respectively ( Table 1). The combined variance analysis based on phenotype traits in the eight environments showed significant differences among genotypes, environments and a genotype and environment interaction (Table S2). Correlation analysis of the traits showed that FL was significantly positively correlated with FS and FU, and a positive correlation was also found between FS and FU. FM was significantly negatively correlated with FL and FS ( Figure S1).

Genetic variation based on SNPs
The genotypes of 719 accessions were examined using the Cotton 63K Illumina Infinium SNP array. All the SNP data were analysed using Illumina GenomeStudio software. First, SNPs without allele polymorphism were eliminated. Low-quality SNP loci (call rate < 85% and minor allele frequency < 0.05) were also deleted. A final set of 10 511 high-quality SNPs was screened and used for genetic variation and GWA analysis ( Figure 1a; Table 2). These SNP markers were not evenly distributed across the entire genome, with 3923 and 6588 SNPs in the At and Dt subgenomes, respectively. Chromosome Dt08 had the maximum number of SNPs (844), and At04 had the minimum (97) (Figure 1a; Table 2). The polymorphism information content (PIC) values ranged from 0.208 to 0.312 among chromosomes, and the mean PIC of the At and Dt subgenomes was 0.287 and 0.283, respectively (Table 2).
To understand the genetic diversity of our association panel, molecular phylogenetic analysis among these accessions was conducted based on the genetic distances of these SNPs. The neighbour-joining (NJ) tree results showed two divergent groups (Figure 1b), designated G1 and G2, with 323 and 396 accessions, respectively (Table S1). We then compared the differences in genome-wide SNPs between the two groups. There were 360 unique SNPs across the accessions of G2 and only 68 unique SNPs in G1 (Figure 1c, d). These results suggested the two groups had a degree of genetic differentiation at the molecular level (Coart et al., 2002).

Population structure and linkage disequilibrium
We investigated the population structure of our panel using STRUCTURE software. The LnP(K) values continuously increased with K from 1 to 10 ( Figure 2a); however, the delta K value reached a sharp peak at K = 2 ( Figure 2b). Therefore, this association population was clustered into two subpopulations ( Figure 2c). Similarly, PCA showed two clusters for the population, despite some accessions overlapping in the two clusters   Figure S1), the varietal population in this study was considered not highly structured and could be used for GWAS as in other previously reported crops (Yano et al., 2016). The extent of LD provides a moderate resolution for genomewide identification for gene discovery (Flint-Garcia et al., 2003). The LD decay of our population was approximately 0.82 Mb, where the r 2 drops to half the maximum value ( Figure 2e). The overall LD decay distance in the At subgenome was significantly higher than that in the Dt subgenome, 1.30 and 0.33 Mb, respectively ( Figure 2e). Considering the LD decay distances and in comparison with other crops, such as rice and maize, with LD of 100 kb-1 Mb and 1-100 kb, respectively (Gore et al., 2009;Nordborg et al., 2002;Remington et al., 2001;Zhao et al., 2011), we assumed 200 kb as the region of SNP-associated candidate genes for fibre traits.

Genome-wide association mapping
To explore the genetic factors associated with the fibre quality, we conducted a GWAS, which took into account the population structure and familial relatedness (Yu et al., 2006) using all traits analysed in 2014 and 2015. A total of 46 significantly associated SNPs were detected in at least one environment (Table 3). These SNPs were located on chromosomes At01, At07, At08, At10, At12, At13, Dt01, Dt03, Dt04, Dt05, Dt06, Dt07, Dt10, Dt11 and Dt13. In addition, some were detected repeatedly across multiple environments, which was similar to the results for other plants, showing more stable SNPs linked with traits (Xu et al., 2016;Zhao et al., 2011). For FL, 20 significant SNPs were identified on chromosomes At07, At10, Dt03, Dt05, Dt06, Dt07 and Dt11 (Table 3; Figure S2), explaining approximately 57.72 % of the total phenotypic variation (Table 3). Seven loci in At07 and four loci in Dt11 were close in their respective chromosomes. Moreover, the locus i60962Gt in Dt11 was detected across six environments (Table 3).
For the other three traits, FM, FU and FE, 4, 4 and 11 loci were detected, respectively (Table 3). There were four SNPs located in three chromosomes significantly associated with FM (Table 3; Figure S4), and all of four loci for FU were located close together in At08 (Table 3; Figure S5). The 11 SNPs for FE on six chromosomes contributed 2.67 to 5.45 % of the phenotypic variation, but three (i24710Gh, i23677Gh and i09441Gh) were also present among significant SNPs associated with FL (Table 3; Figure S6).

Identification of candidate genes associated with SNPs
We confirmed potential candidate genes near 46 significant SNP loci based on genes annotated in the G. hirsutum TM-1 genome (Zhang et al., 2015b). Within 200 kb of significant SNPs, a total of 612 candidate genes were identified (Table S3). We scrutinized the distribution of these candidate genes among each chromosome according to the distance to the nearest associated SNP ( Figure 3a; Table S3). The results showed that genes detected in the Dt subgenome overall doubled those in the At subgenome ( Figure 3a). The genes on Dt03, Dt05 and Dt06 increased as the distance increased, and genes on At07, At10 and Dt11 showed no obvious change ( Figure 3a).
Gene ontology (GO) analysis indicated that the functions of most genes were binding and catalytic reactions, regardless of the intervals (Figure 3b). We also conducted KEGG pathway enrichment of all candidate genes, and found 284 genes enriched in 72 pathways, with approximately 40 genes involved in important fibre development pathways (Table S4). The top three concrete pathways containing more than ten genes were plant hormone signal transduction, carbon metabolism and ribosome pathways. For the nucleotide sugar metabolism and starch and sucrose   metabolism pathways clearly related to fibre development, two genes, Gh_A13G0573 and Gh_D07G1799, code for galacturonosyltransferase (GAUT), and Gh_A07G1759 and Gh_D06G1953 code for ADP glucose pyrophosphorylase and UDP-arabinose 4-epimerase, respectively (Tables S3, S4). We analyse these genes and their related transcriptome data below.

Analysis of major SNP loci relevant to fibre length and strength
We analysed major SNP loci relevant to the two critical quality traits, fibre length and strength, that breeders most often consider in selection. For SNPs associated with FL, a significant peak appeared in chromosome Dt11 (Figure 4a, b), and there were 20 candidate genes associated with the significant SNPs ( Figure 4b). Haplotype analysis showed a high level of LD between the associated SNPs in Dt11 (Figure 4c). The four SNPs resulted in four haplotypes in our association panel (Figure 4d). The average FL of Hap4 was 29.24 mm, greater than those of the other three (Figure 4e). Based on the polymorphism of the SNP markers, each SNP locus of Hap4 could be classified into three genotypes ( Figure S7). The genotypes of the four SNP loci containing in Hap4 displayed a higher average FL than the other nonincluded genotypes ( Figure S7). Seven significant SNPs on chromosome At07 were simultaneously analysed for fibre length and strength across multiple environments (Table 3; Figure 5a, b). We selected seven SNPs to further investigate the allelic variation. There were also 20 candidate genes surrounding them (Figure 5b). Haplotype analysis showed a high level of LD between these SNP markers ( Figure 5c). Among the 719 accessions, there were three haplotypes with the seven SNPs (Figure 5d). An overwhelming majority of the accessions belonged to Hap1, while Hap3 was composed of 47 accessions (Table S5), some of which were known high-quality upland cotton cultivars. The average fibre length and strength of Hap3 were 29.85 mm and 30.81 cN/tex, respectively, significantly greater than those of other two (Figure 5e, f).
For the genetic effect of each SNP in Hap3, taking locus i39753Gh for example, the average fibre length and strength in the accessions with the GG genotype were 29.90 mm and 31.37 cN/tex, respectively ( Figure S8). The three SNPs i02034Gh, i02035Gh and i02037Gh had the same effect on the phenotype as the locus i39753Gh ( Figure S8). Moreover, the genotype of the other loci in Hap3 had a higher average fibre length and strength than those of the other haplotypes ( Figure S8).

Verification of candidate genes by transcriptome analysis
We further validated these candidate genes using transcriptome sequencing data. We investigated the expression level (log 2 (1+RPKM) > 1) of all 212 and 161 genes related to FL and FS, respectively (Table S3), and found that approximately twothirds of genes (163 and 120) were significantly expressed during some time points (0-30 days post anthesis, DPA) of fibre development, including initiation, cell elongation and cell wall thickening ( Figure S9). Among them, there were 107 and 95 genes (including 30 and 27 hypothetical proteins) not reported in cotton (Table S3). This means the genes are novel candidates for fibre development. The genes that had been reported in cotton (Deng et al., 2016;Guo et al., 2009;Huang et al., 2013;Samuel et al., 2006;Wang et al., 2004;Zhou et al., 2015a) were mainly (63 % in FL and 68 % in FS) involved in fibre development, which confirmed the accuracy of the fibre-associated candidate genes described in this study. The differences in fibre length and strength between cultivars could be determined by gene expression in different stages of fibre development. For genes in the fibre elongation stage (0-15 DPA), those with significantly increased expression levels in at least two of the three time points (5, 10 and 15 DPA) compared with 0 DPA (predominately expressed) were screened in four upland cotton cultivars. There were nine candidate genes that matched this condition (Figure 6a). In the fibre thickening stage, spanning 20-30 DPA, we identified eight differentially expressed genes (Figure 6b).
All 17 selected genes (nine in FL and eight in FS) were related to diverse functions (Table 4) and were classified into three expression patterns ( Figure 6). First, some genes, for example Gh_D05G1524, encoding a lipid transfer protein, showed steeply increasing expression from 0 DPA of fibre elongation stage and sharply decreased after 10 DPA among four upland cotton cultivars (Figure 6c). We designated this type of gene 'elongation pattern'. Second, another type of gene, for example Gh_D06G1953, associated with fibre length, coding for UDP-  (Table S4) and catalysing the epimerization between UDP-D-Xyl and UDP-L-Ara in the cell wall polysaccharides (Burget et al., 2003), showed a striking increase in expression from 15 DPA, peaking at 20 DPA (Figure 6d). We designated this type of gene 'thickening pattern'. Third, several of the 17 genes, such as Gh_D06G0002 and Gh_A01G1733, were highly expressed from 5 to 30 DPA (Figure 6a, b). We designated this type of gene 'elong-ckening pattern'. Other than the five known genes (Gh_D03G0294, Gh_D05G1451, Gh_D05G1507, Gh_D05G1523 and Gh_D07G1799), ten (excluding two genes found in both FL and FS) of these 17 predominately expressed genes were not previously reported in cotton (Table S3). Additionally, there were three expression patterns of fibre development for the 163 and 120 screened genes of FL and FS ( Figure S9).
We found two genes with significant differences in expression between high-quality and normal-quality cultivars (Figure 7; Table 4). Gh_D07G1799, coding for galacturonosyltransferase (GAUT), is involved in pectin biosynthesis, which is part of glycan metabolism (Sterling et al., 2006). In the KEGG analysis, we found that it was from the nucleotide sugar metabolism and sucrose metabolism pathways (Table S4). In the previous study of our team, we cloned a novel gene, GbGAUT1, from G. barbadense and proposed that GbGAUT1 plays a key role in fibre development (Chi et al., 2009). The expression level of Gh_D07G1799 showed an obvious increase in fibre elongation stage and was expressed at much higher levels in normal-quality cultivars than in high-quality cultivars (Figure 7a, b). Another gene Gh_D13G1792 belonged to the Arabidopsis ankyrin repeat family protein, and this protein served as a molecular chaperone that plays many important roles in plant cellular metabolism (Shen et al., 2010). It is associated with membrane-enclosed organelles and is required for pollen tube growth in lilies (Huang et al., 2006). However, it was not previously reported in cotton (Table S3). This gene had a much higher expression level in high-quality cultivars than in normal-quality cultivars ( Figure  7c, d).
In addition, we verified 20 associated genes in FL and 20 genes in FS surrounding the peak SNPs in Dt11 and At07 based on expression (Figure 8a, b). Three genes (Gh_D11G1926, Gh_D11G1928 and Gh_D11G1929) in Dt11 for FL contained a significant SNP locus ( Figure 4b) and had higher expression level, especially Gh_D11G1929 (Figure 8a). This gene was highly expressed at all time points of fibre development (Figure 6a, b;  Figure 8a). There were seven genes not expressed during fibre development. Most of the genes (except the five genes that were not expressed) associated with the seven SNPs in At07 for FS showed high expression levels ( Figure 8b). The two loci i02034Gh and i02037Gh were located inside gene Gh_A07G1768 and Gh_A07G1769, respectively ( Figure 5b). Moreover, there were three SNPs near the two genes ( Figure 5b) that were steadily expressed from 0 to 30 DPA (Figure 8b). These significant SNPs had an effect on the neighbouring genes. One of these genes, Gh_A07G1768, annotated as an unknown protein, may be a novel gene for fibre development (Table S3). In addition, the expression of gene Gh_A07G1758 near locus i39753Gh remained high (Figure 6a, b; Figure 8b). These data suggested that the above candidate genes play important roles in different stages of fibre development. Additionally, among the 26 significantly expressed genes in the haplotypes, there were three known and 23 not reported genes in cotton (Table S3).

Discussion
In this study, we first performed a genome-wide association analysis of fibre quality traits based on the CottonSNP63K array with a large number of natural accessions of G. hirsutum. This study uncovered numerous loci underlying variation in fibre quality traits and identified a set of candidate genes that could be exploited to alter fibre development to improve upland cotton cultivars.   By phenotyping the 719 cotton accessions in various environments, relatively abundant variation and significant differences among genotypes and environments were observed for the fibre quality traits (Table 1; Table S2), which showed higher diversity compared with previous reports in cotton (Jamshed et al., 2016;Shen et al., 2005;Yang et al., 2014b). The panel was classified into two groups based on molecular analyses, with 68 unique SNPs in G1 and 360 unique SNPs in G2 (Figure 1c, d); however, they were not completely consistent with geographical sources due to the wide adaptation of cotton, as in previous studies (Tyagi et al., 2014;Wang et al., 2016). The differences in polymorphic SNPs likely reflected the history of extensively interspecific crosses between diverse accessions of the two groups, resulting in introgression among germplasms. Therefore, it was difficult to fully classify the cotton accessions according to geographical source (Li et al., 2014;Xu et al., 2016). In addition, LD decay limits the mapping resolution of GWAS. Our study found that the LD decay was 0.82 Mb, which provided a more accurate reference for selecting candidate genes than previous studies using SSRs (Abdurakhmonov et al., 2008;Nie et al., 2016). Taking account of the greater continuous phenotypic variation among the tested fibre quality traits, our not highly structured association panel is applicable for performing GWAS.
Historically, QTL mapping has been utilized to identify and map causative genomic locations controlling fibre quality using biparental populations. Hundreds of QTLs have been located in intraspecific and interspecific populations of cotton (Said et al., 2015), but this method has rarely led to candidate gene isolation because it only captures limited allelic diversity existing in two parental lines. In contrast to traditional linkage mapping, GWA mapping does not require biparental crosses between individuals and provides greater precision in the localization of QTLs. This approach has been widely used for various traits in other crops, including morphological traits (Crowell et al., 2016;Li et al., 2014;Meijon et al., 2014;Porth et al., 2013), disease resistance (Dimkpa et al., 2015;Wang et al., 2012;Wen et al., 2014), environmental adaptation (Famoso et al., 2011;Nagel et al., 2015;Wu et al., 2015;Yang et al., 2014a;Zhang et al., 2015a) and flowering time (Harsh et al., 2015;Stracke et al., 2009;Xu et al., 2016). In cotton, efforts have been made to detect QTLs using GWAS with SSR markers in recent years (Abdurakhmonov et al., 2008;Cai et al., 2014;Nie et al., 2016;Zeng et al., 2009). These GWAS results, although conducted with few markers and samples, are important for fibre improvement. However, limitations of mapping resolution and genome coverage are inevitable. Therefore, in this study, the genetic factors controlling fibre quality discovered by GWA mapping with more SNP markers and cotton accessions could provide more precise insight into fibre development.
A combination analysis of our GWAS and transcriptome sequencing data revealed 163 and 120 significantly expressed candidate genes in FL and FS, respectively. Two of these genes, Gh_A07G1758 and Gh_D11G1929, were associated with fibre length and strength. Gh_A07G1758 on At07 is homologous to AT4G17170 (RAB GTPase B1C) and Gh_D11G1929 on Dt11 codes for KIP-related protein 6 (KRP6) in Arabidopsis. The Rab family encodes key factors of vesicle-targeting specificity proteins in A. thaliana (Rutherford and Moore, 2002). A plant Rab GTPase, RabA4b, was proposed to regulate membrane trafficking in cells (Preuss et al., 2004). Another plant-unique RAB5 protein, ARA6, was demonstrated to have a functional link between a specific RAB acting in the endosomal trafficking pathway and a specific SNARE complex in Arabidopsis thaliana (Ebine et al., 2011). Complexes of SNARE proteins mediate intracellular membrane fusion between vesicles and organelles to facilitate transport cargo proteins in plant cells (Baker and Hughson, 2016). These results indicated that Gh_A07G1758 could play a key role in the formation of cotton fibre, as indicated by a QTL (Zhang et al., 2011) near Gh_A07G1758 found in previous research. The detailed mechanism of this gene should be further investigated. In addition, we identified a gene, Gh_D06G0002, encoding a SNARE-like superfamily protein homologue, which was not previously reported in cotton. Gh_D11G1929 is a homologue of A. thaliana KRP6, which is a cyclin-dependent kinase inhibitor. In Arabidopsis thaliana, KRP6 overexpression accelerated entry into mitosis, but delayed mitotic progression (Vieira et al., 2014). The regulator KRP6 partially repressed GA-dependent activation of the cell cycle during germination (Nieuwland et al., 2016). However, analysis of this gene in cotton fibre development has not been reported.
Many genes participating in nucleotide sugar metabolism are important in fibre cells. An identified gene, Gh_D03G0294, homologous to Arabidopsis xyloglucan endotransglucosylase/ hydrolase (XTH), made plant cells undergo cell expansion, acting as a cell wall-loosening enzyme (Van Sandt et al., 2007). GhXTH1 was the predominant XTH in elongating fibres, and its expression limited cotton fibre elongation (Lee et al., 2010). Another gene, Gh_D05G1451, homologous to Arabidopsis trehalose-6-phosphate synthase (TPS), was implicated in the regulation of sugar metabolism/embryo development (Eastmond et al., 2002). Arabidopsis TPS1 may play a major role in coordinating cell wall biosynthesis and cell division in cellular metabolism (Gomez et al., 2006). These two genes play different roles during fibre development based on their expression patterns in cotton ( Figure 6); however, the functions of these genes in cotton remain to be elucidated.
In conclusion, we genotyped 719 upland cotton accessions using the CottonSNP63K array for the first time and identified 46 SNPs significantly associated with fibre quality traits across eight environments and a number of novel genes, including 19 promising genes, of which ten were not reported in cotton, for FL and FS by GWA mapping. The identified genetic variation and candidate genes deepen our understanding of the molecular mechanisms underlying cotton fibre development. The validated accessions with excellent haplotypes are valuable breeding materials to improve cultivars. Our study provides a new resource for the improvement of cotton fibre quality through biotechnology-assisted selection in future breeding efforts.

Phenotypic evaluation and statistical analyses
When mature, 20 naturally open bolls from the central part of the plants from each accession were hand harvested at each location and ginned. Fibre samples were sent to the Supervision and Testing Center of Cotton Quality, Ministry of Agriculture of China in Anyang, Henan Province for fibre property determination. Fibre quality traits, including the upper half mean fibre length (FL, mm), fibre strength (FS, cN/tex), fibre micronaire (FM), fibre uniformity (FU, %) and fibre elongation (FE, %), were measured using a high volume instrument (HVI). Statistical analyses, Pearson correlation between traits and significance analyses were conducted using SPSS 22.0 software. Differences were tested for significance at the 1% probability level.

Genotyping and SNP marker screening
Genomic DNA of each accession was extracted from young leaf tissues for genotyping using a modified CTAB method (Zhang and Stewart, 2000). A CottonSNP63K array containing 63 058 SNPs (Hulse-Kemp et al., 2015), which was recently developed by an international cotton SNP consortium, was applied to genotype the 719 accessions using the Illumina Infinium platform according to the manufacturer's protocol. All the SNP data were clustered and selectively analysed by Illumina GenomeStudio genotyping software. The SNP data set was further filtered with a calling rate < 0.85 and MAF < 0.05. For the physical localization of SNP markers, the probe sequences of the SNPs were used to perform a local BLAST (Altschul et al., 1990) search against the G. hirsutum TM-1 reference genome (Zhang et al., 2015b). SNPs that could not be assigned to a G. hirsutum chromosome were excluded from further analysis.
Population structure and association mapping analysis STRUCTURE 2.3.4 software (Evanno et al., 2005) was used to estimate the genetic structure of the population consisting of 719 accessions based on polymorphic SNPs. The numbers of hypothetical groups ranged from K = 1 to 10, using an admixture model with ten independent runs of 10 000 burn-in time and 10 000 MCMC (Markov chain Monte Carlo) replication number. The output from STRUCTURE was analysed for the delta K value (DK) in STRUCTURE HARVESTER (Earl and vonHoldt, 2011). The optimal K value was determined by the log probability of LnP(K) and delta K based on the rate of change of LnP(K) between successive K. The Q matrix was derived for the subsequent association mapping which was the result of the integration of the cluster membership coefficient of replicate runs from STRUCTURE using CLUMPP software (Jakobsson and Rosenberg, 2007). Principal component analysis (PCA) using GCTA software (Yang et al., 2011) was used to assess the population structure. PowerMarker version 3.25 (Liu and Muse, 2005) was used to construct a NJ phylogenetic tree by calculating Nei's genetic distance among individuals.
For the genome-wide association analysis, TASSEL 3.0 software (Bradbury et al., 2007) was used to determine the association between SNPs and phenotypic traits using a mixed linear model (MLM) . The Q matrix from STRUCTURE and the kinship calculated with TASSEL 3.0 were included as fixed and random effects, respectively. The LD parameter (r 2 ) between pairwise SNPs (MAF > 0.05) was estimated using TASSEL 3.0 software. The significance of the associations between SNPs and traits was based on the threshold of the Bonferroni correction for multiple tests (1/n), where n was the total number of SNPs used in the association analysis.

Supporting information
Additional Supporting Information may be found online in the supporting information tab for this article: Figure S1. Correlation analysis between five traits related to fibre quality. Figure S2. Manhattan plots showing the GWAS for FL in eight environments. Figure S3. Manhattan plots showing the GWAS for FS in eight environments. Figure S4. Manhattan plots showing the GWAS for FM in eight environments. Figure S5. Manhattan plots showing the GWAS for FU in eight environments. Figure S6. Manhattan plots showing the GWAS for FE in eight environments. Figure S7. Boxplots depicting the genetic effects of SNPs with significant associations with fibre length in Dt11. Figure S8. Boxplots depicting the genetic effects of SNPs with significant associations with fibre length and strength in At07. Figure S9. Expression of all candidate genes related to fibre length and strength. Table S1. List of 719 upland cotton accessions used for association mapping. Table S2. Analysis of variance (ANOVA) results of the fibre quality traits. Table S3. List of 612 candidate genes with fibre quality traits. Table S4. KEGG analysis of all candidate genes. Table S5. Phenotypes of the fibre length and strength of 47 upland cotton accessions belonging to Hap3.