Detecting signatures of positive selection in partial sequences generated on a large scale: pitfalls, procedures and resources


Marine Brieuc, Fax: (206) 543 5728; E-mail:


Studying the actions of selection provides insight into adaptation, population divergence and gene function. Next-generation sequencing produces large amounts of partial sequences, potentially facilitating efforts to detect signatures of selection based on comparisons between synonymous (dS) and nonsynonymous (dN) substitutions, and single nucleotide polymorphism assays placed in selected genes would improve the ability to study adaptation in population surveys. However, sequences generated by these technologies are typically short. In nonmodel organisms that are a focus of evolutionary studies, the lack of a reference genome that facilitates the assembly of short sequences has limited surveys of positive selection in large numbers of genes. Here, we describe a series of steps to facilitate these surveys. We provide perl scripts to assist data analysis, and describe the use of commonly available programs. We demonstrate these approaches in six salmon species, which have partially duplicated genomes. We recommend using multiway blast to optimize the number of alignments between partial coding sequences. Reading frames should be manually detected after alignment with sequences in Genbank using the blastx program. We encourage the use of a phylogenetic approach to separate orthologs from paralogs in duplicated genomes. Simple simulations on a gene known to have undergone selection in salmon species, transferrin, showed that the ability to detect selection in short sequences (<600 bp) depended on the proportion of codons under selection (1–2%) within that sequence. This relationship was less relevant in longer sequences. In this exploratory study, we detected 11 genes showing evidence of positive selection.