Adaptive sequence evolution is driven by biotic stress in a pair of orchid species (Dactylorhiza) with distinct ecological optima

Abstract The orchid family is the largest in the angiosperms, but little is known about the molecular basis of the significant variation they exhibit. We investigate here the transcriptomic divergence between two European terrestrial orchids, Dactylorhiza incarnata and Dactylorhiza fuchsii, and integrate these results in the context of their distinct ecologies that we also document. Clear signals of lineage‐specific adaptive evolution of protein‐coding sequences are identified, notably targeting elements of biotic defence, including both physical and chemical adaptations in the context of divergent pools of pathogens and herbivores. In turn, a substantial regulatory divergence between the two species appears linked to adaptation/acclimation to abiotic conditions. Several of the pathways affected by differential expression are also targeted by deviating post‐transcriptional regulation via sRNAs. Finally, D. incarnata appears to suffer from insufficient sRNA control over the activity of RNA‐dependent DNA polymerase, resulting in increased activity of class I transposable elements and, over time, in larger genome size than that of D. fuchsii. The extensive molecular divergence between the two species suggests significant genomic and transcriptomic shock in their hybrids and offers insights into the difficulty of coexistence at the homoploid level. Altogether, biological response to selection, accumulated during the history of these orchids, appears governed by their microenvironmental context, in which biotic and abiotic pressures act synergistically to shape transcriptome structure, expression and regulation.

2 treatment) and directional Illumina sequencing as 100bp paired-end reads was performed at the CSF Vienna (www.csf.ac.at/ngs). Samples fB4, iB0 and iS4 (Table 1) were sequenced as half lanes, whereas sample fP7 has been sequenced as two half lanes. Six other samples were sequenced as full lanes.
After controlling read quality with FastQC v0.11.2 (available from http://www.bioinformatics .babraham.ac.uk/projects/fastqc/), filtering of the raw reads was done with Trimmomatic v0.32 (Bolger et al., 2014) with default settings, except for increasing the threshold for average quality per base to 20 when scanning the reads over four-base sliding windows, and only retaining adaptor-free, high-quality reads that are at least 50 bases long. After quality filtering 78.6% of the read pairs were retained (Table S1). We further used only the pairedend filtered reads, as the single-end files produced by Trimmomatic contained only a negligible amount of orphan reads. The reads were finally filtered against a database of all whole genome Virus and Bacteria from NCBI (ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/ and ftp://ftp.ncbi.nlm.nih.gov/genomes/Viruses/ version 2014-07-03) by using Bowtie2 v2.2.3 (Langmead and Salzberg, 2012) with default setting. Less than 1% of reads were found to be potential contamination from viruses or bacteria.

De novo transcriptome assemblies and annotation
The transcriptome of each sequenced accession was individually assembled from its cleaned reads using Trinity version r20140413 (Grabherr et al., 2011;Haas et al., 2013) with a kmer setting of 25, strand-specificity (i.e., FR orientation) and retaining only contigs with a minimum length of 200 bp. One individual reference per species has been then retained based on the total length, N50 and the percentage of reads that uniquely mapped back to each reference by using the CLC Genomics Workbench v7.5 (Qiagen) with strand-specificity, a minimum similarity fraction of 0.95 over at least 0.95 of the length and mismatch/insertion/deletion costs of 2/3/3. A multi-individual assembly was also attempted for each species, however this failed to extend the N50 length significantly or increase the back-mapping rate (results not shown), and was discarded to avoid the risk of chimeric contigs.
Ecologically-relevant genes are expected to be generally expressed at low rates, and hence may be missed in individual assemblies. To account for this, we have retained for each accession the un-mapping reads when aligning them to the selected species-specific assembly, and these reads were further combined per species and were re-assembled with Trinity with 3 the same settings as above. The D. fuchsii, and, respectively, D. incarnata "lowly expressed" contigs were then added to their specific reference. A common reference for the two species has been constructed by pooling together the two individual references. We finally removed redundancy from each of the three constructed references by using the clustering algorithm of CD-HIT-EST (Fu et al., 2012) with a global identity of 80% over at least 70% of the length of the shorter sequence (i.e., -aS 0.7) and by comparing only the 5'-3' strands (i.e., -r No). The final references are hereafter referred to as "f_reference", "i_reference" and the Dactylorhiza (i.e., "f_i") reference.
The individual Trinity assemblies contained between 7,686 (fA6 , Table S1) and 70,193 contigs (iS8), with N50 ranging from 303 (for fA6) to 472 bp (for fB4). Based on the richness (i.e., the total length, Table S1), contiguity (i.e., N50 estimates) and completeness (i.e., mapping rate) we retained the assemblies of fB5 and iS8 as the best representative transcriptomes for D. fuchsii and respectively D. incarnata. In order to get better representation of the lowly expressed genes, species-specific Trinity assembly of all D. fuchsii and, respectively, D. incarnata reads that did not map to these reference transcriptomes were performed and resulted in 3,113 and 2,064 transcripts for D. fuchsii and, respectively, for D. incarnata. After combining the different assemblies and removing redundancy as explained above, the final combined fi_reference contains 33.8 Mbp within 101,010 transcripts (52.3% originated from D. incarnata and 47.7% from D. fuchsii; Table S2). This transcriptome will be of high value for further studies of gene expression alterations following whole genome duplication events in Dactylorhiza.
Functional annotation analyses of the fi_reference were performed in Blast2Go v.3.2.7 (Conesa and Götz, 2008) using cloud-based NCBI BLAST+ searches against the Viridiplantae database v. 17.01.2016 under the BLASTX algorithm and a minimum e-value of 10 -6 . Blast2Go was further used to assign gene ontology (GO) terms to the contigs, to identify signatures of protein domains using InterProScan (Quevillon et al., 2005), and to annotate non-coding RNAs and cis-regulatory elements by performing Rfam scans against Xfam servers. Of the 101,010 transcripts in the fi_reference, 42.2% had significant BLASTX hits in Blast2Go. After mapping these contigs and integrating InterProScan results, 30,046 contigs were successfully annotated with GO terms (Fig. S1). In addition, 121 contigs received annotations from the Rfam scan, with 80 of them representing an rRNA type and 16 intronic regions.

Small RNA library preparation and sequencing
Isolation of RNA with smRNA enrichment was performed with the mirVana miRNA Isolation Kit (Life Technologies) following the manufacturer's instructions. Up to 90 mg of the same tissue fixations as for the RNAseq experiment were used. The concentration of the raw smRNA isolates was measured with a Quant-iT Picogreen dsDNA assay (Invitrogen) and a NanoDrop Fluorospectrometer ND-3300 (Thermo Scientific). The quantification and quality of the RNA isolate was then confirmed with a 2100 Bioanalyzer (Agilent Technologies) using a small RNA analysis kit. The RNA extracts were denatured with loading buffer at 95°C for 3 min and further purified by gel size selection in a XCell SureLock Mini-Cell (Life Technologies) using a microRNA Marker (NEB) and 15% TBE-Urea pre-casted gels (Life Technologies) stained with 2x SYBR Green II (Life Technologies) for 1 h at RT. The smRNA samples were re-isolated from gel slices by overnight incubation in 0.3M NaCl at 4 °C, followed by filtration through Ultrafee-MC Durapore Filter Units 0.22µm (Merck Millipore) and precipitation for 2 h at -20 °C in 2.5x volume 100% ethanol, together with 1 µl GlycoBlue (Ambion). After resuspension in RNase-free water the smRNA isolates were again checked on a Bioanalyzer small RNA chip. For individual iB6 the isolate has yielded insufficient smRNA quantity and as no tissue was further available, the sample has not been processed further. The smRNA libraries were prepared with the NEBNext Multiplex Small RNA Library Prep Set for Illumina (Set 1 and 2, NEB) following the manufacturer's protocol, except for diluting the adapters 1:2 prior to use due to low sample concentration (130<mg) and using only 1.5 µl instead of 2.5µl of the primers. The samples were purified on a 1.5 % agarose gel with 1x TBE buffer. The DNA was extracted from the gel slices with a MinElute Gel Extraction clean-up Kit (Qiagen) and eluted in water. Illumina sequencing as a half-lane of 50 bp single-end reads was performed at the CSF Vienna (www.csf.ac.at/ngs). The individual samples were demultiplexed with the BamIndexDecoder tool of the Illumina2Bam software collection (available from http://gq1.github.io/illumina2bam/) and adapter sequences have been removed using Trimmomatic v0.30 (Bolger et al., 2014). After demultiplexing and adapters removal, the total number of reads longer than 15 nucleotides averaged ca. 9.9 million (std. 5.3 million) across all samples of the two species (D. fuchsii mean = 13.2, std. = 4.2; D. incarnata mean = 5.8, std. = 3.4). Table S1. Summary of individual RNAseq Trinity assemblies. The uniquely and, respectively, total mapping rates refer to mapping with CLC Genomics Workbench of the reads of one sample to the assembly produced by those respective reads with Trinity. Supporting Figure S1. Map of the native localities of the Dactylorhiza samples analysed here. The samples were transplanted in a common garden at least one growing season before the material was fixed for analyses.        Bubble size is proportional to the frequency of the respective term in the public GO database. The colour represents the log10 value of the significance of the Fisher's test of enrichment, corresponding to the indicated scale. C, compound; Ac, activity; SAc, synthase activity; Bi, binding; Sy, synthesis; Pr, process; MPr, metabolic process.