Divergence in gene expression within and between two closely related flycatcher species

Abstract Relatively little is known about the character of gene expression evolution as species diverge. It is for instance unclear if gene expression generally evolves in a clock‐like manner (by stabilizing selection or neutral evolution) or if there are frequent episodes of directional selection. To gain insights into the evolutionary divergence of gene expression, we sequenced and compared the transcriptomes of multiple organs from population samples of collared (Ficedula albicollis) and pied flycatchers (F. hypoleuca), two species which diverged less than one million years ago. Ordination analysis separated samples by organ rather than by species. Organs differed in their degrees of expression variance within species and expression divergence between species. Variance was negatively correlated with expression breadth and protein interactivity, suggesting that pleiotropic constraints reduce gene expression variance within species. Variance was correlated with between‐species divergence, consistent with a pattern expected from stabilizing selection and neutral evolution. Using an expression PST approach, we identified genes differentially expressed between species and found 16 genes uniquely expressed in one of the species. For one of these, DPP7, uniquely expressed in collared flycatcher, the absence of expression in pied flycatcher could be associated with a ≈20‐kb deletion including 11 of 13 exons. This study of a young vertebrate speciation model system expands our knowledge of how gene expression evolves as natural populations become reproductively isolated.

The gene annotation system is evidence-based; all protein coding and non-coding RNA gene models are supported by biological sequences from public databases. Input data for the protein coding gene models came from UniProt, data generated in this study and the Ensembl release 71 and 68 databases for chicken and zebra finch respectively. Data from each source were aligned to the genome and filtered in order to generate gene models.

Homology pipeline
Coding models were generated using data from related species. The genomic positions of the BLAST hits against the UniProt database were passed to GeneWise to build protein-coding models. This allowed GeneWise to run over a much reduced search space and create a slice-aware alignment of each target protein on the genome. This step generated a total of 194,425 initial gene models.
In addition, the longest protein coding translation was retrieved for each Ensembl gene for chicken (e71) and zebra finch (e68). These sequences were aligned to the flycatcher genome using Exonerate, generating 32,060 models. As Exonerate is a fast, splice-aware it was possible to run the entire set proteins across the genome, selecting the 'best in genome alignment' in each case. This method provided a useful alternative strategy to GeneWise.

RNA-seq pipeline
RNA-seq data used in the annotation comprised of paired end data from samples including: brain, embryo, kidney, liver, lung, muscle, ovary, skin, testis and a merged set of all nine organs from a number of individuals. The available reads were aligned to the genome using BWA. The Ensembl RNA-seq pipeline was used to process the BWA alignments and create further split read alignments using Exonerate. The RNA-seq pipeline produced 174,504 gene models in total. The predicted open reading frames were compared to UniProt Protein Existence (PE) classification level 1 and 2 proteins using BLAST. Models with poorly scoring or no BLAST alignments were split into a separate class and not used in the final gene set.

Final protein-coding geneset creation:
The models produced by the RNA-seq and homology pipelines were filtered to select the highest quality models at each genomic position. Highest preference was given to RNA-seq models that were verified by a protein alignment covering 80-100 percent of the candidate ORF and with a percent identity of >= 80. Equal preference was given to models built from the homology pipeline based on bird proteins from SwissProt with PE level 1 or 2. Models from RNA-seq with poorer protein alignments, non-bird vertebrate models and models coming from Ensembl longest translation alignments were given lower precedence whenever a conflict in model structure occurred at a genomic position. Using this system the lower quality or redundant models were filtered out and a final geneset was created. UTR was added whenever possible using information from overlapping RNA-seq models.     Spearman's ρ between collared flycatcher organs, the upper right half between pied flycatcher organs.

Metabolic chains
Brain ρ = 0.113 p < 2.2 · 10 -16 ρ = 0.057 p = 1.4 · 10 -5 ρ = -0.223 p < 2.2 · 10 -16 Kidney ρ = 0.045 p = 3.8 · 10 -4 ρ = -0.        shows data from brain as a representative example organ. n ind , number of individuals of both species a gene was detected in; n genes , number of genes expressed in that category; n uniq , number of genes in that category which was uniquely expressed in one of the species. Boxes represent quartiles, whiskers the minimum of either 5x the interquartile range or the maximum of the distribution.