Parallel and Intertwining Threads of Domestication in Allopolyploid Cotton

Abstract The two cultivated allopolyploid cottons, Gossypium hirsutum and Gossypium barbadense, represent a remarkable example of parallel independent domestication, both involving dramatic morphological transformations under selection from wild perennial plants to annualized row crops. Deep resequencing of 643 newly sampled accessions spanning the wild‐to‐domesticated continuum of both species, and their allopolyploid relatives, are combined with existing data to resolve species relationships and elucidate multiple aspects of their parallel domestication. It is confirmed that wild G. hirsutum and G. barbadense were initially domesticated in the Yucatan Peninsula and NW South America, respectively, and subsequently spread under domestication over 4000–8000 years to encompass most of the American tropics. A robust phylogenomic analysis of infraspecific relationships in each species is presented, quantify genetic diversity in both, and describe genetic bottlenecks associated with domestication and subsequent diffusion. As these species became sympatric over the last several millennia, pervasive genome‐wide bidirectional introgression occurred, often with striking asymmetries involving the two co‐resident genomes of these allopolyploids. Diversity scans revealed genomic regions and genes unknowingly targeted during domestication and additional subgenomic asymmetries. These analyses provide a comprehensive depiction of the origin, divergence, and adaptation of cotton, and serve as a rich resource for cotton improvement.


Assessment of sequence coverage, SNP detection methods, and resequencing data quality
Simulations of resequencing depth were used to assess our dataset using our variation detection pipeline. First, three replicate samples of random reads representing different coverage levels (5×, 10×, 15×, 20×, 25×, 30× and 35×) were extracted from a high coverage sequenced dataset (SRR1536366, Supplementary Note 2 Fig. 1). Sampled reads were independently mapped to the reference genome and analyzed for SNPs. Over 97% of reads achieved a quality score >Q20.
Only uniquely-mapped, high-quality reads were retained, whereas duplicated and low mapping quality reads were removed. For the low coverage (5×) sample, which has similar depth to previous diversity assays ( [3][4][5]; Supplementary Note 2, Table 1), only about 45% of the genome was covered by more than 5 reads, and only 5% was covered by more than 10, both of which reflect a general unevenness of coverage at lower sequencing depths. For the simulated 20× samples, which is comparable to the targeted coverage of this study (Supplementary Note 2, Table 1), more than 80% of the genome was mapped by at least 10 reads (versus 5% at 5× coverage). Because both depth and evenness of read coverage can impact the number and distribution of SNPs detected, we evaluated reproducibility of SNP detection at different coverage levels (Supplementary Note 2 Fig. 1B). In the 5× datasets, only ~30% of SNPs were detected in 3 out of 3 replicate analyses and only ~60% were detected in 2 out of 3 replicates. In the 20× datasets, however, 75% of SNPs were detected in all 3 replications and 86% were detected in 2 out of 3 replicates. This indicates that higher sequence coverage results in increased robustness and reproducibility of SNP detection. Methods for SNP detection can be found in the main manuscript and at https://github.com/Wendellab/BYUReseq.
We assessed coverage of all candidate datasets (1,432) (Supplementary Note 2 Fig. 2), and parsed these results by data origin (i.e., sequenced here or downloaded from the SRA). The effective depth of this study was 22.5x coverage, more than 4-and 10-fold than the previous reports of [6] and [7,8] . More than 85% of the reference genome had 10x coverage for samples sequenced in the present study, compared to 25% and 7% for [6] and [7,8], respectively (Supplementary Note 2, Table 1). After the initial round of SNP identification (see main methods), 408 of these low coverage accessions were removed from further analyses due to a significant number of missing SNPs ( > 25%, Supplementary Note 2 Fig. 3). Notably, most of these were from the publicly available datasets. By removing these low-coverage samples and increasing the coverage of wild/landrace accessions (Supplementary Note 2 Fig. 3), we were able to detect 3-fold more variation than previously reported [3,7,8].  Fig. 1. The effect of sequencing depth on variation identification. A) Effective cumulative coverage distribution (y-axis) of different coverage datasets (x-axis). The right-pointing arrows indicate ~45% and ~5% of the reference genome that was aligned or 'covered' by more than 5 and 10 reads, respectively, in the 5× dataset. The left-pointing arrows indicate ~93% and ~80% of the reference genome that was covered by more than 5 and 10 reads, respectively, in the 20× dataset.  [6]. The dark-blue slope line represents the average coverage for that study. The green lines represent the percentage of reference genome coverage (y-axis) at each coverage depth (x-axis) for individual accessions of the study (NAU) [7,8]. The dark-green slope line represents the average coverage for that study.

Summary of genetic variation in different genomic regions
We identified 53.7 million SNPs in the seven species of Gossypium (AD 1 -AD 7 ; Supplementary Note 3, Table 1 and Fig. 1). Approximately half of the genic SNPs were found in introns,  Table 2). Most InDels were smaller than 3 bp (Supplementary Note 3, Table 3). Most InDels (77.27%) were found in intergenic regions (9.79% upstream, 6.47% downstream, and 6.48% in gene regions, Supplemental Note 3, Table 4), which was lower than the percentage of intergenic SNPs (88.65%). In gene regions, 63.55% of the InDels were located in introns. There was a higher percentage of InDels identified upstream/downstream of genes (putative regulatory regions) compared to the percentage of identified SNPs.
Nucleotide diversity (π) was calculated for each population, as per the main methods, and diversity between G. hirsutum and G. barbadense was compared overall (Supplementary Note 3, we verified that this imbalance did not influence our estimates of π. We calculated π for ten replicates of 100 randomly chosen vcf files from each species, and used calculated with VCFtools, as previously noted. The maximum, minimum, and average π values were calculated for each replicate (Supplementary Note 3, Table 5), and these values were compared to the values recovered for the entire dataset. Fixation index (F ST ) is a measure of population differentiation, genetic distance, based on genetic polymorphism data [9]. We measured the F ST value using VCFtools [10] in 100-kb windows with sliding steps of 20 kb (Supplementary Note   3, Fig. 6).
Relationships among accessions of G. hirsutum and G. barbadense were also evaluated using phylogenetics and principal component analysis using the same parameters as outlined above. In both cases, accessions of G. mustelinum were used as an outgroup. Both the phylogeny (Supplementary Note 3, Fig. 7) and PCA (Supplementary Note 3, Fig. 8) for G. hirsutum suggest a four populations that include a wild population, the modern cultivars, and two distinct landrace populations. Relationships within G. barbadense were likewise divided into four populations, including two distinct landraces, by phylogeny (Supplementary Note 3, Fig. 9) and PCA (Supplementary Note 3, Fig. 10

Methods
Putative regions of selection were identified by identifying genomic regions where nucleotide diversity exhibited the greatest reduction and where F st between landrace 2 and the cultivar was the greatest. As per the main methods, VCFtools v0.1.13 [10] was used to measure each in 100kb windows sliding 20kb. Candidate domestication-sweep windows were identified as the top 5% genomic regions exhibiting the greatest reduction in diversity (π L /π c [10] values and the top 5% of regions with the greatest F st between landrace and cultivar (Supplementary Note 4, Fig. 1). Raw RNA-seq reads for multiple, bulked tissues were downloaded from NCBI (Project ID: PRJNA490626) [11]. These accessions include bulked RNA from roots, stems, leaves, and various reproductive organs of G. hirsutum accession TM-1 and G. barbadense accession Hai7124 [11]. Raw reads were cleaned by SOAPnuke v1.5.2 [12] and subsequently aligned to the reference TM-1 genome [13] using STAR v2.7.1a [14]. Quantification of gene expression was performed with Cufflinks version v2.2.1 [15]. To detect tissue-dominant or tissue-specific expression, we performed an enrichment test with TissueEnrich [16]. Gene expression in wild and cultivated G. hirsutum fiber (Supplementary Note 4, Fig. 2) was also compared using previously generated results [17]. Expression heatmaps were drawn using the online tool ClustVis [18]. Gene in putative regions of selection were cross-referenced with fiber expression for G. hirsutum (Supplementary Note 4, Fig. 3), and GO enrichment categories were identified for all genes in regions of selection using the R package clusterProfiler [19] (Supplementary   Note 4, Fig. 4). Functional annotations were derived from the release on CottonGen (https://www.cottongen.org/species/Gossypium_hirsutum/jgi-AD1_genome_v1.1).
To test whether the length of overlapping selected regions is greater than expected by chance, we performed a permutation test to randomly generate "selected regions" for G. hirsutum and G.
barbadense, which maintained the number and size of regions as characterized in this dataset.
Permutations were performed 1000 times to generate the null distributions of overlapped regions by region number and length. While the number of selected regions that overlap (22) was observed was within the distribution (rank in 58%), the total length of overlap (3.1 Mb) was greater than expected, occurring in less than 2.5% of the random permutations. Introgression was identified by scanning and categorizing SNPs identified between wild G. hirsutum and wild G. barbadense that did not have a history of interbreeding. Generally, we followed the methods recently described for the detection of homoeologous conversion [20]. Briefly, we created an SNP-index of nucleotides that were diagnostic of G. hirsutum and G. barbadense using multiple, truly wild accessions (36 wild G. hirsutum and 46 wild G. barbadense). Because of our depth of sequence, we used a robust 10x coverage threshold for each base throughout the pipeline. Ignoring heterozygosity, we compiled the consensus nucleotide (MAF > 0.10 and missing rate between accessions <40%) for each position and from each wild species into a joint species-specific SNP-index capable of distinguishing G. hirsutum SNPs from G. barbadense. This SNP-index was used to identify putatively introgressed bases along each chromosome by characterizing each SNP in each accession as G. hirsutum or G. barbadense-like based on the below pipeline.

Supplementary Note 4,
Assuming that each paired-end read originated from a single DNA molecule, we scanned the read alignment file (e.g. bam file) for each accession and partitioned read pairs into separate alignment files for reads with G. hirsutum (H) nucleotides or reads with G. barbadense (B) nucleotides based on the SNP-index. For this, we used PolyCat (BamBam package, [21]), which analyzes paired-end reads concurrently and uses a default of 75% agreement among SNPs to place read pairs into a category. A second program from BamBam [21], i.e., eflen, was used to identify regions of introgression based on a minimum coverage (i.e., 10x) of SNPs and immediate adjacency (within 500 bp) based on read overlap, which were subsequently written as annotations (i.e., gff files). Small blocks were then merged via awk (https://github.com/Wendellab/BYUReseq) if the adjacent block was within 30 kilobases (kb) and shared the same origin (i.e., G. hirsutum or G. barbadense).
As a controlled test for introgression detection methods we used three previously reported nearisogenic lines (NIL) containing an introgressed block of G. barbadense into G. hirsutum (Supplementary Note 5, Table 1; [23]. These NIL lines used as controls were sequenced at much lower sequence coverage requiring a lower threshold of 3x read coverage throughout the detection pipeline. Even with lower coverage, all three homozygous, introgressed blocks were successfully identified (Supplementary Note 5, Figures 1-12). The small heterozygous block in N17 D08 was not detected (Supplementary Note 5, Fig. 8). In this region, our method detected half the coverage of 'B' bases in the introgressed region (~1.5x coverage) compared to the homozygous regions, putting the coverage level below our detection limits.