Recent advances in sequencing technology promise to provide new strategies for studying population differentiation and speciation phenomena in their earliest phases. We focus here on the black carrion crow (Corvus [corone] corone), which forms a zone of hybridization and overlap with the grey coated hooded crow (Corvus [corone] cornix). However, although these semispecies are taxonomically distinct, previous analyses based on several types of genetic markers did not reveal significant molecular differentiation between them. We here corroborate this result with sequence data obtained from a set of 25 nuclear intronic loci. Thus, the system represents a case of a very early phase of species divergence that requires new molecular approaches for its description. We have therefore generated RNAseq expression profiles using barcoded massively parallel pyrosequencing of brain mRNA from six individuals of the carrion crow and five individuals from a hybrid zone with the hooded crow. We obtained 856 675 reads from two runs, with average read length of 270 nt and coverage of 8.44. Reads were assembled de novo into 19 552 contigs, 70% of which could be assigned to annotated genes in chicken and zebra finch. This resulted in a total of 7637 orthologous genes and a core set of 1301 genes that could be compared across all individuals. We find a clear clustering of expression profiles for the pure carrion crow animals and disperse profiles for the animals from the hybrid zone. These results suggest that gene expression differences may indeed be a sensitive indicator of initial species divergence.
Hybrid zones, often referred to as natural laboratories of evolution, are ideal systems to unveil the barriers that keep gene pools from merging. The European crow hybrid zone is a classic textbook example of hybridization between the all black carrion crows (CC) (Corvus [corone] corone) and the grey breasted and coated hooded crows (HC) (Corvus [corone] cornix) (Stearns & Hoekstra 2005; Ridley 2007). Morphological, ecological and behavioural evidence suggest almost complete reproductive isolation between these semispecies (Moore 1977; Saino 1992; Saino et al. 1998). The zone of overlap and hybridization is relatively narrow (15–150 km, e.g. Saino 1992; Rolando 1993; Kryukov & Blinov 1994; Randler 2007) and apart from minor shifts it has been stably maintained for at least 100 years (Meise 1928; Haas & Brodin 2005). It coincides roughly with other suture zones that have arisen by secondary contact after the last glaciation periods (Hewitt 2000). It has thus been hypothesized that carrion and hooded crows diverged in allopatry during the Quaternary and came into secondary contact in the Holocene (see e.g. Mayr 1942). Hybridization of strongly diverged genomes would lead to genomic incompatibilities by negative epistasis and the disruption of co-adapted gene complexes (Bateson–Dobshansky–Muller model; Bateson 1909). Consistent with these ideas, carrion and hooded crows have been considered semispecies or even full species that will diverge further to a state where reproductive isolation is complete (reviewed in Parkin et al. 2003). This reasoning predicts that hybrids be less viable and the dynamics of the hybrid zone follow the well-established model of a tension zone where dispersal and hybrid fitness are the predominant parameters (Barton & Hewitt 1985). Consistent with this model and observations in other hybrid zones, trait variability seems to be higher within the zone than in the putatively pure populations (egg volume & fledging success: Saino & Bolzern 1992; clutch size: Saino & Villa 1992; genetic variability: Spiridonova & Kryukov 2004). Reinforcement of prezygotic isolation that has also been shown to be of importance in other avian hybrid zones (Saetre et al. 1997) is ubiquitous across the entire hybrid zone (reviewed in Randler 2007) and fits the idea of an evolutionary response to lowered hybrid fitness.
There is, however, accumulating evidence that unhinges this explanatory model. Observations of assortative mating can also be interpreted as a by-product of sexual imprinting common to most bird species (Immelmann 1975). This idea has been used by modelling approaches. They convincingly suggest that the hybrid zone dynamics can be explained exclusively by sexual imprinting without having to invoke intrinsic postzygotic isolation (Brodin & Haas 2006, 2009). Moreover, while there are some reports on hybrid disadvantage (Saino & Bolzern 1992), the evidence is not compelling and continuous variation in plumage colour suggests that hybrids successfully reproduce and backcross into the parental populations. A recent phylogeographical analysis further challenges the traditional view of genomic incompatibilities. Using the mitochondrial control region as a genetic marker, Haring et al. (2007) find no genetic differentiation between carrion and hooded crows in Europe, which is consistent with earlier mitochondrial studies (Kryukov & Suzuki 2000), restriction fragment length polymorhism (RFLP) analysis (Kryukov et al. 1992) and allozyme markers (Saino et al. 1992). Haas et al. (2009) have studied 18 microsatellite loci, and although they find a significant FST differentiation between their population samples, the differences between the species are not larger than differences within the species. In parallel, they have studied a candidate gene for plumage colour (MC1R), but find also no differentiation at this locus. This suggests divergence times in the range of a few thousand years rather than millions of years. Such rapid morphological divergence in the absence of significant molecular differentiation is not uncommon in birds (Pavlova et al. 2003; Zink et al. 2005; Miláet al. 2007). This raises the interesting prospect that we are faced with the challenge to explain almost complete reproductive isolation, evidenced through morphology, ecological niche, behaviour and assortative mating in a background of low genetic differentiation. This system offers therefore the opportunity to study the initial stages of speciation.
The recent advent of new sequencing techniques allows genomic investigations in evolutionary model systems where only scarce genomic resources are available (see Elmer et al., 2010). At present, exploration of the transcribed subset of the genome seems most promising (Vera et al. 2008). The transcriptomal sequence space bears several analytical advantages. The shear amount of sequence data is reduced to a smaller fraction and cleaned of repetitive and heterochromatic sequences making it amenable for de novo assembly (see Künstner et al., 2010). It constitutes the functional backbone of the genome, and together with the sequence information one obtains a digital measure of gene expression (see Fontanillas et al., 2010 and Harr & Turner, 2010). In addition, the restriction to the more strongly conserved coding sequences makes the successful assignment of contigs to genes in reference species easier.
It has also long been argued that gene expression differences may be of prime importance in species differentiation (King & Wilson 1975; Carroll et al. 2004), but the evidence remains controversial (Hoekstra & Coyne 2007; Wray 2007). The major argument for a special role for regulatory differences comes from the fact that many genes have tissue-specific enhancer elements and changes in these would be expected to have fewer pleiotropic effects on gene function than changes in protein sequence. Thus, fast divergence could be achieved at this level. Furthermore, it has been suggested that expression differences can be a contributor to Bateson–Dobshansky–Muller incompatibilities, at least in Drosophila species (Haerty & Singh 2006). However, whether this is also the case for very recently diverged species is still open.
Here, we revisit the degree of genetic differentiation at the DNA sequence level and explore expression divergence in populations of carrion and hooded crows. We used traditional Sanger sequencing of introns for a population sample and massively parallel pyrosequencing for a subset of individuals from both taxa. This is a pilot study that explores how well genomic resources from the model organisms zebra finch (Warren et al. in prep.) and chicken (Hillier et al. 2004), that have diverged from the crow lineage ∼50 mya (Barker et al. 2004) and ∼100 mya ago (van Tuinen & Hedges 2001) respectively, can be used. Further, we describe how RNAseq can be applied to obtain robust estimates of expression levels from large EST data sets and how candidate genes of expression divergence can be found. We finally discuss the further potential of next generation sequencing on expression studies in non-model organisms and reflect the results in the speciation framework outlined above.
Materials and methods
Sanger sequencing of autosomal introns
Genomic DNA was extracted from bone marrow of tibiotarsi of freshly killed birds using DNeasy® tissue kit (Qiagen™). In total, the genomic sample corresponds to 26 CC and 27 HC. It is comprised of 11 CC from western Germany sampled close to the Dutch border (51.75°N, 6.24°E), nine CC from western Germany close to the city of Dortmund (51.57°N, 7.38°E), 20 HC from Poland close to Warsaw (52.23°N, 21.01°E) and one Swedish HC individual sampled close to the city of Uppsala (59.93°N, 17.76°E). It further contains six CC from western Germany sampled south of Cologne (55.66°N, 6.79°E) and 6 HC samples from Northern Ireland in the area of Belfast (54.31°N, 5.62°W). This subset of 12 individuals was also used for the expression analysis (see below). Although crows in Ireland are generally described as a pure hooded crow population (but see Skelton 1993), it appears that the Scottish hybrid zone extends into Ireland. Pure HC have uniformly grey feathers on the mantle, breast, upper and under tail coverts. Although we focused on animals that resembled the pure HC expectations most closely, some individuals with dark ingrown tail coverts and darker feather rachises on mantle feathers are included.
Introns of 27 autosomal genes chosen from a reference set of consensus anchored tagged sequences spread across the avian genome were amplified for all individuals (Backstrom et al. 2007). The list of markers can be found in Table 1. After PCR amplification (Qiagen™ multiplex PCR®), products were purified by ExoSap-IT® (USB corporation) and sequenced on an ABI 3730 sequencer employing the Bigdye® Terminator 3.1 cycle sequencing Kit (Applied Biosystems). Quality control, sequence alignment and SNP scoring were conducted with the software CodonCode Aligner Version 3.0.1. (CodonCode Corporation). Consistent with results obtained for other genetic markers (see Introduction), there were no diagnostic differences and hardly any observable differentiation between the two taxa. We therefore reconstructed haplotypes from a pool of all chromosomes (at least 94) with PHASE (Stephens et al. 2001; Stephens & Donnelly 2003) as implemented in DnaSP 4.90.1. (Rozas et al. 2003). Two loci (those with the lowest sequencing quality) showed significant deviation from neutrality and were excluded from the analyses leaving a total of 16 780 nt of sequence information spread across 25 loci (see Table 1). Subsequently, we estimated population differentiation as FST (Hudson et al. 1992) between the two taxa for each locus separately including all 26 pure CC and 21 pure HC individuals (potential hybrid individuals from Ireland were excluded). We further constructed a matrix of pairwise π values for the subset of 12 individuals for all possible intra- and inter-individual comparisons (using C++ library pi_stat: Dutheil et al. 2006). In contrast to locus specific estimates used for the population sample, the concatenated 15 934 bp sequence of all but one locus (for which one individual could not be sequenced) was used for this purpose, as null values are frequent in pairwise individual comparisons leading to a downward biased estimates. Diagonal entries in the matrix represent individual heterozygosity (πwithin: average per site nucleotide difference between the two chromosomes of one diploid individual), and all non-diagonal entries exclusively reflect mean nucleotide divergence per site between two individuals (πbetween: mean of all four possible inter-individual comparisons). Following Hudson et al. (1992), we then calculated a pairwise distance measure for each pair of individuals as FST = 1 − πwithin/πbetween. This yields a distance matrix based on sequence information that can be compared with a distance matrix obtained from expression data (see below).
Table 1. Population genetic measures for a set of 25 intronic markers sequenced with the traditional Sanger method
Orthologous locus in chicken
Genbank accession numbers
No. sites [nt]
Sample size [chromosomes]
Average nucleotide diversity per site [π × 10−3]
Polymorphism information is based on bi-directional sequences (except locus ENSGALT00000003977, which constitutes a single large intron read from both ends without overlap). CC, carrion crow; HC, hooded crow.
Expression analyses were conducted with the subset of six CC from Western Germany and six HC from Northern Ireland described above. Brain tissue was harvested from freshly killed crows and subsequently stored in the RNA preservation buffer (RNAlater from Qiagen™). CC were hunted with peregrine falcons and HC with the rifle. Owing to field conditions, no specific dissection of brain regions was possible nor was it possible to use the entire brain. Nonetheless, the samples used here all stem from the same general region of the hindbrain. After removing the preservation buffer, total RNA was extracted in TRIzol® (Invitrogen™, Carlsbad, CA, USA) according to the manufacturer’s instructions. RNA quality and integrity was assessed on an 2100 Bioanalyzer using the RNA 6000 Nano Kit (Agilent Technologies) by visual inspection of electropherograms and with the help of RIN scores (criterion RIN > 8.7). rRNA was removed from the high quality samples using the MiniMACS™ magnetic bead column system from Miltenyi Biotec including a short step of DNAase treatment (rDNAse1, Ambion Inc.). After quantifying the mRNA and controlling for rRNA contamination ∼1 μg of each sample was used for GS-FLX library preparation.
A double stranded non-normalized cDNA library was created by reverse transcription with the SMART™ PCR cDNA Synthesis Kit from Clontech. To increase the amount of starting material, the resulting double stranded cDNA was amplified in 14 PCR cycles. A library was then constructed for each of the 12 individuals that were individually tagged with the standard multiplex identifiers supplied by Roche. After all libraries were prepared according to the manufacturer’s protocol, samples for each species were pooled in equal amounts and pyroysequenced on a Roche 454 FLX sequencer using the standard manufacturer’s protocol. One female HC sample yielded hardly any sequence information and was discarded from subsequent analyses leaving a total of six CC and five HC for expression analysis. Gender was determined by genotyping the alleles of the chromo-helicase-DNA-binding gene (which are sex chromosome specific) on an ABI 3730 sequencer (Griffith et al. 1998). The CC sample consists of two females and four males, the HC sample of three females and three males. Raw read data were deposited at the NCBI short read archive under the following accession numbers: study (SRP000770)/experiment (SRX006755-56)/samples (SRS004190-91)/runs (SRR019143-44).
Estimation of expression levels
After removal of vector sequence and identity tags, raw reads from all individuals were assembled de novo into ‘contigs’ with newbler 2.0.00.20. Homology with zebra finch and chicken genes was established by blast analysis of the contigs against the cDNA library of both species. However, 5′ and 3′ UTR regions are only annotated for a subset of transcripts in zebra finch (8.02%, 19.65%) and chicken (46.52%, 56.13%, data from Ensembl, database version 54). To make best use of all reads (including those in the UTR regions), we constructed an in silico cDNA library for both zebra finch and chicken. We downloaded protein coding sequences from Ensembl using BioMart and added 600 bp of genomic sequence at the 5′ end and 3000 bp at the 3′ end. These numbers were derived from the distribution of UTRs of fully annotated cDNAs in chicken (5′/3′ respectively: min: 2/2 median: 77/314 Q95%: 353/1662 Q99%: 575/2989 max: 3647/5591). These in silico cDNA libraries were then subjected to blast searches of all contigs. To maximize the number of orthologues, we conducted four blast searches in the following order: blastn against zebra finch, tblastx against zebra finch, blastn against chicken and tblastx against chicken. All contigs with blast hits of e-value of greater than 10−10, with multiple hits to more than one gene that differed by an e-value of 105 or with no significant reciprocal blast hit were discarded. Adding only 1000 nt at the 3′ ends of CDS as suggested by the length of zebra finch UTRs (5′/3′ respectively: min: 2/2 median: 120.5/347 Q95%: 436/683 Q99%: 569/870 max: 716/3104) yielded significantly fewer orthologues gene hits, possibly indicating that a non-negligible number of crow UTRs extend over several kb, or that the UTR annotation of the zebra finch is incomplete or biased.
After the establishment of orthology, reads of each individual were separately mapped against all contigs and their number recorded. Read numbers of non-overlapping contigs of the same orthologous genes were summed. Contigs without orthology information were discarded, as it would not be possible to ascertain that they are from different genes. The number of reads for an orthologous gene is the most basic expression measure. This measure was then standardized by zebra finch CDS length (see below) and the number of initial reads for each individual (=number of reads that passed the initial quality control of the newbler 2.0.00.20 software). Standardized expression levels are quantified as a normalized measure of exonic read density and overall read coverage of individuals. Let ai denote the total number of initial reads in the i-th animal, and gi,j the number of reads mapped to contigs that were annotated as the j-th gene in animal i. We further call lj the average length of zebra finch CDS encoded by gene j. Then the expression level of gene j in animal i was computed as
where t is simply a scaling factor. With t =1, this quantity represents the number of reads per bp exon per read. To have a more practical quantity, we set t to 109. ei,j then represents the number of reads per kb exon (103) per million reads (106) or RPKM (following Mortazavi et al. 2008). RPKM measures spanned over five orders of magnitude (min: 0.88, max: 27 770), which is comparable to other studies implementing RNAseq (Mortazavi et al. 2008) and were log10 transformed. Standardization by transcript length only makes sense if reads are uniformly distributed across transcripts and no 3′ UTR bias is introduced by polyA based mRNA enrichment and reverse transcription during library preparation. We therefore conducted a coverage distribution analysis on the subset of ∼20% genes with known UTRs in zebra finch, again making the assumption that UTR length is roughly comparable in crows. We projected every transcript onto 100 bins and calculated the proportional coverage for each bin. The weighted average of these proportions (weighed by the coverage of each gene) allows inspection of the read distribution. For short (250–746 nt = 25% quantile to median of zebra finch transcript length), medium sized transcripts (747–1318 nt = 75% quantile) and long transcripts (1319–6890 nt), reads are roughly equally distributed over the transcript and are only slightly underrepresented at the 5′ and 3′ UTR ends (Fig. S1).
Differential expression analysis
All descriptive and inferential statistical expression analyses were programmed in r 2.8.0 (Ihaka & Gentleman 1996). For the core set of genes expressed in all 11 individuals, we calculated Pearson correlation coefficients for all 55 pairwise comparisons between individuals. It is important to note that length standardization introduces a bias in correlation analysis leading to artificially high correlation levels (see Fig. 1). This is because of the fact that division by CDS length performed on the original data is equivalent to subtraction in the log10 transformed data. In pairwise correlations between individuals, this leads to rearrangement of data points each being shifted by the same amount (proportional to gene length) for both individual x and y. Under the assumption of homogeneity of variance of expression in x and y (var(x) = var(y)) and no covariation between unstandardized expression level and transcript length (which holds in our data), the correlation strength between x and y depends on the variance of transcript length l and is given by:
Thus, even without covariance between expression in individual x and y, significant correlations can be produced. Treating this problem in depth is beyond the scope of this manuscript. As we are not interested in the absolute correlation values, but only in the relative values between individuals, we still adhere to the standardized measure throughout the manuscript, following the recommendations of Mortazavi et al. (2008). This seems justified, as despite different absolute values, correlation coefficients are highly correlated between unstandardized and standardized data (R2 = 0.97, P < 0.001). In addition, we ran all analyses on data without length standardization and obtained the same results. Minor differences arise only with respect to the set of candidate genes found. The 11 × 11 correlation distance matrix based on standardized data was then used for several downstream analyses. We compared it to binary matrices grouping pairs of individuals by taxon and/or sex (same vs. different) using Mantel and partial Mantel tests (ecodist package in R). Several data reduction methods like principle component analysis (stats package in R) and hierarchical clustering analysis (cluster package in R) were used to asses which individuals group by their gene expression patterns. Cluster analysis was based on Ward’s criterion of aggregation, which iteratively merges groups with the lowest increase in intra-group variance and the lowest decrease in inter-group variance. This process yields a binary splitting tree (dendrogram) reflecting the hierarchy of similarities between lineages, and wherein branch lengths represent the decrease of variance involved in the creation of a node. Prior to these analyses, data were z-transformed separately for each gene. Z-transformed data were also used to detect differentially expressed genes. We considered both differential expressions between the two taxa and between the two groups suggested by multivariate analyses. We calculated three test statistics for each orthologuous gene including Wilcoxon ranked sum tests, t-test with Welch correction allowing for unequal variances and simple anova assuming homogeneity of variances between the groups. We accounted for multiple testing by controlling for false discovery rate (FDR) (qvalue package, Storey & Tibshirani 2003). As expected, Wilcoxon ranked sum tests yielded the lowest number of significantly different genes, followed by t-tests and anova. As the homogeneity of variance assumption of anova is likely to be violated in many cases and power is considerably lower for the nonparametric Wilcoxon test, we chose to present the results based on the t-test statistic.
Sequence divergence between the two taxa
Low levels of differentiation between CC and HC have been suggested on the basis of different molecular markers including allozymes, RFLP, microsatellites and mtDNA sequence data (see Introduction). Here, we explore genetic diversity and divergence for pure individuals from each taxon (26 CC, 21 HC, see Methods) using traditional Sanger sequencing on 25 nuclear intronic regions totalling 16 780 nt of sequence information (Table 1). Average nucleotide diversity per site was low (π = 0.00114) and did not differ between pure CC and pure HC populations (t-test: t48 = −0.1027, P = 0.92). Likewise, there was no difference in haplotype diversity (t-test: t48 = 0.0778, P = 0.94). The average number of nucleotide substitutions per site between populations (Dxy = 0.00115) was comparable to that within populations (πHC: 0.0011; πCC: 0.0014). FST calculation within species (the two CC populations) shows no significant differentiation (FST ± CI99.9 = −0.0045 ± 0.015), while the between species calculation shows a low, but significant level of population differentiation (FST±CI99.9 = 0.023 ± 0.021). However, this is comparable to the level of within-taxon population differentiation found in the microsatellite study by Haas et al. (2009) and could be ascribed to an isolation by distance effect, in particular, when one takes into account that the two CC populations samples by us are only approximately 100 km apart of each other. None of the 25 investigated loci showed any fixed differences between the taxa.
We further calculated nucleotide diversity and divergence estimates for all possible pairwise comparisons between the 11 individuals subject to 454 transcriptome sequencing (see below). As nucleotide diversity measures based on locus-by-locus estimates are expected to slightly underestimate π, we used the estimates derived from the full concatenated sequences (see Table S1). Hierarchical clustering of pairwise FST values reveals no obvious pattern between the two taxa (Fig. 2A). There is further no indication that HC individuals from the zone of hybridization are more heterozygous than pure CC (t-test: t8.19 = 0.2785, P = 0.79).
Working with genomic resources of distantly related model organisms
We used massively parallel pyrosequencing to obtain EST sequence and expression data for brain RNA from six CC and five HC individuals from within a zone of overlap and hybridization, using individually tagged libraries for each individual. We obtained a total of 856 675 reads from two runs, with average raw read length of 270 nt. De novo assembly of all individuals yielded 19 552 contigs with an average length of 319 (min: 94 median: 395 max: 5907) and mean coverage of 8.44. A total of 13 557 (69.3%) of these had significant hits in blast searches against in silico cDNA libraries of zebra finch and chicken, resulting in a total of 7637 orthologous genes (CC: 7348, HC: 7154). By mapping the reads from each individual onto the contigs, we obtained the number of genes present in each individual and at the same time expression levels for each gene. Mean coverage for individual contigs resulting from the mapping procedure was 2.53 (min: 1, median: 2.49, max: 320). The number of genes retrieved for each individual ranged from 3343 to 6091 (Table 2) and was a non-linear function of the read number saturating at ∼6150 genes with more than 150 000 reads (Fig. 3). A total of 3650 genes were on average shared between any two individuals (min: 2275; max: 6091), a core set of 1301 genes were shared among all individuals. Overall, throughout the entire procedure of base calling, de novo assembly, mapping and blast searches against the in silico cDNA libraries of chicken and zebra finch and 47.3% of initial reads were lost (see Table 2).
Table 2. Summary statistics of 454 sequencing for six carrion crows (CC) and five hooded crows (HC)
No. raw reads
No. reads mapping on contigs
No. reads mapping on contigs with orthologue
No. orthologous genes
% of raw reads used
Multivariate analyses based on the expression profile of 1301 core genes shared among all individuals indicate that individuals can be reliably separated into two groups. According to principle component analysis, all pure CC individuals form one cluster, whereas HC are more disperse with one individual (HC1) falling into the CC group (Fig. 4). Groups are significantly separated by the first component, which explains 29% of the overall variance (anova, F1,9 = 8.85, P < 0.05). Hierarchical clustering reveals the same pattern (Fig. 2B).
The correlation between gene expression levels from two individuals can be used to quantify expression divergence (Fig. 1). Expression profiles are highly correlated between all 11 individuals both for the core set of 1301 genes (Pearson’s r min: 0.47, mean: 0.70, max: 0.87) and for all genes shared by two individuals (min: 0.47, mean: 0.67, max: 0.80, Table S2). The correlation matrix can be used to assess whether expression profiles are more similar in same-sex than in different-sex comparisons. For sex, no difference can be detected (Mantel test: P = 0.866). Expression profiles within the same taxon, however, are more similar than between taxa (Mantel test: P = 0.015, Fig. 5A). The number of genes shared by two individuals is only higher in tendency for within-taxon comparisons (P = 0.069). If individual HC1 is excluded (P = 0.036) or grouped with the six CC as suggested by multivariate analyses (P = 0.027), the two groups differ (Fig. 5B).
Differentially expressed genes
Depending on the FDR applied, 18/138/288 genes (FDR: 0.01/0.05/0.1) can be identified that are differentially expressed between the two groups inferred by multivariate analysis (6 CC + 1 HC vs. 4 HC, Fig. S2). 9/106/257 are differentially expressed between the two taxa if individual HC1 is removed (6 CC vs. 4 HC, Fig. 6). A list of all genes, expression levels and FDR rates can be found in Table S3.
Most of the genes with the most significant expression difference are involved in neural regulation (e.g. ion channels, glutamic acid decarboxylase or beta-adrenergic receptor kinase) and differentiation (e.g. cell-cycle regulation, actin binding protein or microtubule associated protein). The two most significantly differentially expressed genes are VGluT2, which is a specific marker for glutamatergic neurons and neurexin which belongs to the organizing molecules for excitatory glutamatergic synapses (Craig & Kang 2007). These genes could be involved in creating new synaptic connections and might therefore play a role in different behavioural traits. However, for most of the genes identified only a biochemical function is known, rather than a biological one that could provide a clue towards their involvement in behavioural traits.
Genetic diversity and divergence between carrion and HC
Traditional Sanger sequencing of 25 intronic nuclear loci confirms the almost complete lack of genetic differentiation between carrion and HC that has been observed by others on the basis of different molecular markers (Saino et al. 1992; Haring et al. 2007). The degree of differentiation is well within the level observed between populations within the same taxon (e.g. Haas et al. 2009). In 16 780 bp of sequence information, we did not find a single diagnostic nucleotide, which sharply contrasts with the clear geographically well separated differences in plumage colour between the taxa. This leads to the conclusion that the genetic data do not even support subspecies status, while other lines of evidence classify carrion and hooded crow as separate species (Parkin et al. 2003). A similar phenomenon can be found in many young radiations like e.g. in the Cichlid fishes in Lake Victoria, where many clearly diagnosable species are genetically not distinct (Albertson et al. 1999). Tracing the initial steps of divergence in such systems that may or may not complete the speciation process is of particular interest, as they allow studying the conditions under which the speciation process was initiated (e.g. Tautz 2004; Bekkevold et al. 2005; Steinfartz et al. 2007; Wolf et al. 2008).
In theory, a single gene can promote prezygotic isolation and lead to speciation if it codes for a signal that is relevant to mate choice. As most bird species are sexually imprinted on their parents early in life, basically any one gene producing a perceivable signal can be involved. If imprinting is effective, positive frequency dependent selection will prevent immigration of different phenotypes into pure populations as has been recently explored for the crow hybrid zone (Brodin & Haas 2009). While the maintenance of population structure because of a morphological trait can be understood, it is harder to imagine how a mutant phenotype can initially increase in frequency. This is best explained by genetic drift in an isolated small local population. For the crow system investigated here, this implies that one of the taxa should be derived from such a peripheral population and should therefore have significantly lower diversity. Compared with other bird species (Ellegren 2007), nucleotide and haplotype diversity were low for both carrion and hooded crow (suggesting effective population sizes in the range of 2 × 105), but they are indistinguishable between the two taxa.
Given the confirmation of an extremely recent species divergence, we can consider the crow system as a test case of a non-model species where only higher level genomic approaches can lead a step forward. For cases like this, massively parallel sequencing should be particularly suitable. As a result of the lack of genome information, it would not be easily possible to use other genome wide approaches, such as microarrays, for studying expression divergence. The molecular distances to sequenced bird genomes (chicken and zebra finch) are so large that they cannot serve as a reliable basis for designing microarray probes.
As it will be typical for many nonmodel studies, we were constrained by a number of technical factors. This includes the fact that we had to take population samples directly from the wild where neither the previous ecological conditions nor age matching could be controlled. In addition, careful dissection of specific brain regions was not possible in the field. Although each of these constraints could have resulted in high variances of expression profiles, we have obtained rather consistent results. Expression levels are highly correlated across individuals and all animals of the pure species sample group tightly together, while the hybrid zone samples show dispersion, as one could have expected (see below). We note that similar constraints occur in the work with primates and humans, but have nonetheless yielded fairly consistent results in gene expression analyses (Enard et al. 2002; Gilad et al. 2006). Hence, although our study will need to be extended to include more populations from different regions and involve common garden experiments, it is encouraging to see that samples taken under seemingly suboptimal conditions can nonetheless provide insights into divergence patterns between populations.
Given the long reads obtained through 454 pyrosequencing, it was relatively easy to obtain a de novo assembly of contigs that could be associated with orthologous genes in zebra finch and chicken. However, we did not reach a sufficient density of reads that would have allowed ensuring that all assembled contigs match to exactly one gene only. This becomes clear from the contig mapping to annotated orthologues in chicken and zebrafinch. A total of 13 557 contigs obtained in the crows mapped to 7637 genes in chicken and zebrafinch, suggesting that more than 40% of the contigs can still be joined. For studying expression differences, it was necessary to restrict the analysis to the unequivocally identified single genes. Thus, all reads that could represent genes or splice forms specific for the crow lineage had to be discarded from the comparative expression analysis, although they may contain additional valuable information. This will be a typical constraint for non-model organisms. However, these reads could be included in a microarray design and interesting genes could then be further analysed on a case by case basis (i.e. doing RACE based studies to link possible contigs).
The application of RNAseq data to a non-model organism also raises several statistical issues. RNAseq expression level estimates require correction by transcript length. However, as contigs are often only fractions of the transcripts, transcript length can only be inferred from distant species with available genomic sequences. But even in annotated species like chicken and zebra finch, UTR lengths are unavailable for a large number of genes, so length correction can merely be based on the coding sequences. An additional problem lies in the fact that standardizations on log-transformed data will alter the correlation between individuals and to a lesser degree influence the relative expression differences between genes (see Methods for more details).
It is further of concern that tests for differential gene expression are dependent on the read coverage for each gene. We expected that differentially expressed genes would be a biased sample of genes highly covered with reads. We were thus surprised to see that neither P-values nor differences in expression were associated with read number (R2max = 0001; pmin = 0.127), although read number differed considerably across genes (min: 1 median: 9 mean: 20 max: 1910). This is unexpected, as stochastic variation should increase the within group variance lowering detection power in the test statistic. This issue certainly merits in depth statistical exploration in the future.
We used traditional Sanger sequencing to obtain nucleotide diversity estimates from selected intron regions. Estimates should, however, also be possible on the basis of the next generation sequencing data. Theoretical models have been proposed (Lynch 2008; Jiang et al. 2009) and first software implementations have been accomplished. As they are designed for whole genome sequencing, it will be interesting to see how well these methods perform on transcriptome data. Due to polymorphisms in regulatory regions or epigenetic alterations, it is likely that the two copies of a gene in a diploid organism are not transcribed at the same rate and that this allelic imbalance is reflected in the sequence data. This would influence the probability of getting both alleles from an individual in the shotgun data and could thus bias the polymorphism estimate.
The role of expression divergence in speciation
It is well-known that gene expression can be very sensitive to environmental perturbation (Gibson 2008). Under noncontrolled conditions, divergence between taxa could result from different environmental conditions (Cheviron et al. 2008). While these are not evident in our case, they also cannot be entirely excluded. Different treatments in the way of obtaining the samples could result in a systematic bias. Two different hunting methods were used for obtaining the hybrid zone and the pure species samples, namely rifle vs. falcons. The latter hunting scheme could have led to an increase in stress related gene expression during the hunting phase. However, only one of the upregulated genes in the CC sample is correlated with chronic stress (glycoprotein M6A), i.e. there is no strong indicator for stress induced gene expression changes. Furthermore, the fact that one hooded crow individual (HC1) clusters with all CC and the arguments provided in the ‘technical issues’ section speak also against this possibility.
Leaving these technical issues aside, the fact that we can see a clear differentiation between pure species and hybrid animals at the level of expression divergence corroborates the assumption that expression changes may contribute significantly to early divergence patterns. However, much of this divergence may be neutral and not of direct adaptive significance. Such neutral divergence patterns of gene expression have been suggested for the primate lineage (Khaitovich et al. 2004, 2005). In a systematic study of house mouse populations and subspecies, we found also no direct evidence for much positive selection involved in changes in gene expression. (Staubach et al. in press). Thus, most of the changes that differentiate the two groups of animals are likely to be of no direct significance for the species separation. If one treats gene expression estimates as neutral markers, one can explain the relative dispersion of the hybrid zone animals in the multivariate analysis (Fig. 4). Assuming these animals constitute various stages of backcross genotypes between the two pure populations, one can envisage a segregation of these markers. Gene expression analysis in hybrids between house mouse subspecies has shown extensive additivity of expression characteristics (Rottscheidt & Harr 2007), supporting the possibility of a segregation of expression characteristics in a hybrid zone. Individual HC1 that groups in its transcript pattern with the CC individuals is a particularly interesting case under this assumption. The transcript pattern of this individual suggests that most of its genome is of CC origin, while its external phenotype links it to the HC. This gives a first hint that at least the external phenotypic traits may be controlled by only a few loci, which have happened to come from the HC in this particular individual.
We have used a combination of a traditional Sanger sequencing and a massively parallel sequencing approach to shed some light on the underlying genetic factors of species differentiation in European carrion and hooded crow. Consistent with results by others, we find almost no divergence at the DNA level. This suggests that the genome wide accumulation of genetic incompatibilities during an extended allopatric phase might play a subordinate role in hybrid zone dynamics of these species. A model that proposes the emergence of a different colour morph in a small isolated population with subsequent spread by positive frequency dependent selection is also not readily supported, as the levels of nucleotide diversity in both taxa are almost indistinguishable from one another. The study of expression patterns using next generation sequencing techniques appears to provide an alternative approach to study this system. Our results suggest that overall expression divergence evolves faster than overall nucleotide divergence, possibly due to correlated effects that the change of expression of one gene has on other genes. At present it remains open how much of this expression change might be causally involved in species differentiation, but it provides a starting point for the investigation of species differences in this and other systems.
We like to thank Chris Harrod, Jaimie Dick, Andrzej Kruszewicz, Magdalena Gajownik and Deutscher Falkenorden for sample acquisition. Elke Bustorf provided valuable help in the lab. We are also very thankful for enlightening mathematical discussions with Carina Mugal. Many thanks also to Niclas Backström for providing sequencing primers. Funding was provided by the Max-Planck Society (JW, TB, BH, DT), Deutsche Forschungsgemeinschaft (PR, MS; Cluster of Excellence Future Ocean) and VolkswagenStiftung grant I/83 496 (JW).
JW and DT conceived the study and wrote the manuscript. JW performed fieldwork and together with MS and PR did the work in the laboratory. JW, TB and BH analysed the data.
Conflicts of interest
The authors have no conflict of interest to declare and note that the funders of this research had no role in the study design, data collection and analysis, decision to publish or preparation of the manuscript.