Allelic imbalance metre (Allim), a new tool for measuring allele-specific gene expression with RNA-seq data

Estimating differences in gene expression among alleles is of high interest for many areas in biology and medicine. Here, we present a user-friendly software tool, Allim, to estimate allele-specific gene expression. Because mapping bias is a major problem for reliable estimates of allele-specific gene expression using RNA-seq, Allim combines two different strategies to account for the mapping biases. In order to reduce the mapping bias, Allim first generates a polymorphism-aware reference genome that accounts for the sequence variation between the alleles. Then, a sequence-specific simulation tool estimates the residual mapping bias. Statistical tests for allelic imbalance are provided that can be used with the bias corrected RNA-seq data.


Introduction
Allim, Allelic imbalance meter, offers an integrated and user-friendly solution for measuring allele specific gene expression (ASE) within species. Allim estimates allelic imbalance in F1 hybrids. Since mapping bias is the largest problem for reliable estimates of allele specific gene expression using RNA-seq, Allim combines two different measures to account for mapping biases. First, Allim generates a polymorphism aware reference genome that accounts for the sequence variation between the alleles of both parents (or parental lines). Second, Allim includes a sequence specific simulation tool to estimate the remaining mapping bias.
This estimated mapping bias is then incorporated in the statistical tests for allelic imbalance.
The pipeline requires whole transcript high throughput RNA sequencing (RNA-seq) reads of F1 hybrids. Additionally, either RNA-seq reads for both parents, genomic sequencing reads for both parents or two parental genomes for both parents have to be provided in separate files. Allim was tested on Illumina paired-end RNA-seq reads but it can also handle FASTQ files from other NGS sequencing platforms. The provided parental RNA-seq libraries can be from homozygous as well as heterozygous parents (however, all heterozygous loci will be excluded during the analysis).
Allim has five modules that can be run by a single command. All input parameters can be specified in the AllimOptions file. These parameters are then used to run the complete pipeline. Allim provides two different input options: AllimOptions_2Pexpr: This configuration file can be used when RNA-seq data for parent1, parent2 and F1 (hybrid) is provided. In this case Allim uses parent1 and parent2 RNA-seq reads to identify the fixed SNPs. The fixed SNPs are subsequently used to generate two "parental -genomes", one for each parent. Note, rather than RNA-seq data, also DNA reads can be provided for the two parents.
With Allim if user does not have reference genome and corresponding gene annotation in GTF format. It is possible to provide a single reference transcriptome along with RNA-seq data of both parents. The use of a reference transcriptome (or contigs from a RNA-Seq de novo assembly) instead of a reference genome is accompanied by the following differences: • In a transcriptome assembly different isoforms are typically presented by different contigs. Furthermore, assemblies often contain additional redundancies between contigs due to sequencing errors and polymorphisms in the RNA-seq reads used for de novo transcriptome assembly.
• As Allim creates two parental references (genomes or here transcriptomes), and maps RNA-seq reads of the F1 individual to the "diploid genome", only reads that map non-ambiguously can be used to determine ASE profiles. It is therefore recommended to remove redundancy from the transcriptome assembly before Allim usage.
• As transcriptome assemblies typically do not have a gtf file containing gene features (needed as Allim input). We have therefore added an additional short script to obtain a simple gtf file based on contig ids.
AllimOptions_2Pgenomes: This configuration file can be used when the user provides RNAseq data for the F1 hybrid and a genomic sequence for each parental line. If genomic sequences of both parents are provided, both FASTA files need to have the same size and identical FASTA IDs.

Five Modules of Allim:
(1) Identification of fixed SNPs (2) Computer simulation of RNA-seq reads with fixed SNPs (3) Estimation of the remaining mapping bias with simulated data (4) Estimation of allele specific expression for experimental data (5) Statistical test of significant allelic imbalance The source code and the user manual of Allim are available at http://code.google.com/p/allim/

System Requirements
To use the Allim pipeline, the user needs to meet the following requirements: Note: To install R packages listed above 11-14, R >=2.15.0 should be installed.

Operating system:
The Allim package is designed to work with a 64-bit Unix operating system with at least 4 GB of RAM and 2 CPUs (processors).

Python installation:
Allim is developed on Python version 2.7.3. Python 2.7.3 for your operating system can be obtained from http://www.python.org/download/. After the download follow the instructions given on the web page to complete the python installation and configuration.

Biopython installation:
Allim requires biopython for efficient fasta sequence reading, writing and other sequence manipulations. Once python is installed, biopython can be obtained and installed with the following steps: Step1: Download the biopython source code from http://biopython.org/DIST/biopython-

1.60.tar.gz
Step2: Uncompress the downloaded file with the following command: tar -zxvf biopython-1.60.tar.gz This command will return the folder/directory biopython-1.6.0 Step3: Enter the uncompressed folder with following command: cd biopython- 1.60 Step4: To install the biopython package run these two commands: python setup.py build python setup.py install

NumPy installation:
NumPy is the fundamental package for scientific computing with Python. In the Allim pipeline NumPy is used to integrate sequencing errors during the simulation of RNA-seq reads. In order to install NumPy on any Unix system the following steps are required: Step1: Download the NumPy package source code from http://sourceforge.net/projects/numpy/files/NumPy/1.6.0/ Step2: Uncompress the downloaded file with following command: tar -zxvf numpy-1.6.0.tar.gz This command will return the folder/directory numpy-1.6.0 Step3: Enter the uncompressed folder with following command: cd numpy-1.6.0 Step4: To install this python package run these two commands: python setup.py build python setup.py install

PICARD installation:
Picard comprises Java-based command-line utilities that manipulate SAM/BAM files. It is a collection of many JAVA jar files, which can be downloaded and used directly without prior installation. The Allim pipeline uses three jar files: 1) BuildBamIndex.jar, 2) MergeSamFiles.jar, 3) SortSam.jar. Provide the full paths of these three jar files in AllimOptions run file to run the Allim pipeline.

SAMTOOLS installation:
SAMTOOLS provide various utilities for the manipulation of alignments in the SAM format, including sorting, merging, indexing and generating alignments in a per-position format.

BedTools installation:
BedTools provide a set of functions to sort and intersect various genomic formats including bam files and files that contain genomic annotation such as gene position and SNP information. Step4: To make BedTools executable run the make all command:

make all
After running the "make all" command, BEDTools-Version-2.16.2/bin sub-folder/subdirectory will be created, which contains all BedTool executables.
The Allim pipeline uses three executables: 1) intersectBed, 2) sortBed, 3) bamToBed. Provide the full paths of these three executables in the AllimOptions run file to run the Allim pipeline.

GSNAP installation:
GSNAP (Genomic Short-read Nucleotide Alignment Program) is a mapper for RNA-seq data.
It has the advantages that it can detect splicing events and is capable of SNP tolerant alignments (Wu & Nacu 2010).

cd gmap-2012-07-20
Step4: To install the gmap package run these four commands: ./configure make make check (optional) make install The above four commands will build GSNAP and other executables in "/usr/bin" Step5: To check the GSNAP installations run this command:

gsnap -help
This command will return the detailed help manual for various GSNAP parameters if the installation was successful.
Note: For more detailed information how to install GSNAP please read the gmap-2012-07-20/README file.

R installation:
The R source code specific to your operating system can be obtained from http://cran.rproject.org/. To install R, please follow the instructions provided with R.

RPy2 installation:
RPy2 is a python class designed to do statistics programming using R in python. In order to install RPy2 on any Unix system the following steps are required:

Allim Installation
The user can download the latest version of Allim from http://code.google.com/p/allim/. The file to download is called Allim_1.0.tar.gz. Move the file to an appropriate directory and run the following command to uncompress the file: Note that after uncompressing the tar.gz file, a new folder will be created named Allim_1.0.
This directory contains the following files:

The Validation of the Installation & Sample Input Files
To validate the installation of the Allim pipeline it can be run with a small test data set. The test data set and the corresponding Allim configuration files can be obtained from the To get help on how to run Allim and required parameters enter: python Allim.py -help

Allim Input Description
Allim can be run with the following command, which should be run under the Allim_1.0 directory: python Allim.py --option-file <path to AllimOptions file> However, before running Allim with your own dataset all parameters have to be specified in the appropriate Allim Options files (AllimOptions_2Pexpr OR AllimOptions_2Pgenomes).
Global input parameters: Figure 1 shows global input parameters, which are essential for all modules of Allim. Figure 2 shows how the paths to third party tools that are used by Allim can be specified in the AllimOptions file.  The remaining input parameters are described in the following chapter along with the Allim modules description.

Module 1: Identification of fixed SNPs
This module is only used when the input option, "AllimOptions_2Pexpr" is chosen (RNA-seq or genomic DNA reads of both parents are provided). With the input option, "AllimOptions_2Pgenomes" (two genomes, one for each parent) is used this module will be skipped.
The module determines fixed SNPs between the parental genotypes (lines) based on the user provided RNA-seq libraries. It accepts multiple replicates for the calculation of fixed SNPs to increase the power and accuracy of SNP detection. For each parental genotype and replicate RNA-seq data (alternatively genomic DNA reads) can be provided for paired-end sequence data (pairs of fastq files, read1.fq and read2.fq, for each condition). Besides the RNA-seq data (genomic reads) further parameters can be specified as shown in Figure 3. The type of data that is used for calling fixed SNPs, either RNA or genomic sequence is reads is specified via the parameter SEQUENCE_TYPE ("mRNA" or "DNA"). In

Module 2: Computer simulation of RNA-seq reads
This module simulates paired-end Illumina reads in fastq format. It internally uses the fixed SNP information to create two "parental genomes", one for each parent or alternatively the user provided reference genomes. Further it used the information from the provided GTF file (provided as a global input parameter).
For the read simulation, first two parent specific genomes ("parental genomes"), which only differ with respect to the identified fixed SNPs, are generated. For each parental genome all possible paired-end reads that cover at least one fixed SNP position are simulated once.
Therefore, the expression ratio been reads simulated for each parent should equal 1 for each gene. In contrast, the deviation from an expression ratio of 1 in the simulated data for a gene indicates a remaining mapping bias, which is caused by the mapper itself.

Algorithm:
1. Creation of two parent specific ("parental") genomes via the substitution of the base at a fixed SNP position in initial reference genome (this step is skipped if two parental genomes are provided initially). Additional parameters that define the properties of the simulated reads are shown in Figure 4.   The module generates an expression table for all genes and all exons in two separate files. This expression table will be subsequently used in module 5 to test for significant allelic imbalance.

Estimation of allele specific expression for experimental data
Modules 3 & 4 make use of a two-genome approach to determine allele specific expression.
In this approach the two "parental" genomes (one for each parental genotype / parental line) that where created for the RNA-seq read simulation in module 2 (AllimOptions_2Pexpr) or provided by the user (AllimOptions_2Pgenomes) are used as a "combined reference" (genome of parent 1 plus genome of parent 2) to map RNA-seq reads. Only the reads that map unambiguously (no multiple equally good mapping positions) are used to determine allele specific expression. This approach excludes reads that do not span fixed SNPs as these have at least two equally good mapping locations in the "combined reference".
The input files and other parameters for modules 3& 4 can be specified in the AllimOptions file as shown in figure 5. Note that for each hybrid library only one TMM factor has to be calculated. This factor is used for the expression profile of both parental alleles. Table 1 provides read counts   before and table 2 after the first normalization step.  2) Mapping bias correction The residual mapping bias after accounting for fixed SNPs between the two parental lines is accounted for via the mapping bias coefficient. This coefficient is defined as the expression ratio between both parental alleles obtained from the simulated data (module 2).

Mapping bias coefficient = readcount_simuldata(parent1) / readcount_simuldata(parent2)
The mapping bias coefficient is calculated for each gene. All read counts normalized by library size (table 2) are then multiplied by the mapping bias coefficient of the respective gene (table 3).

3) Rescaling of the corrected read count
After applying the above two normalization factors, read counts for the different genes across libraries can be inflated or decreased depending on the values of the respective factors used.
As inflation results in an unjustified increase of power in the statistical test and a lowered count in an unjustified decrease, the values have to be rescaled on a gene-wise level. This is done via the following rescaling factor.

Rescaling factor = expression value before normalization (sum over all samples) /after normalization (sum over all samples)
All corrected read counts by the two previous correction factors (table 3) are multiplied by the rescaling factor of the respective gene (table 4). This folder contains the identified fixed SNPs between the parental genotypes (parental lines).
The number at the beginning of the file indicates in which cycle of remapping the fixed SNPs were called. "n" is a parameter that has to be provided by the user. The remaining two files provide the results of the significance testing for allelic imbalance for the F1 generation.

Benchmark of the Allim Pipeline
We have run the Allim pipeline with the following data set of RNA-seq reads from Drosophila pseudoobscura to benchmark the runtime of Allim. The reference genome release 2.23 was taken from Flybase and an improved annotation file of D. pseudoobscura was taken from (Palmieri et al 2012). The GTF file with the gene annotation contained 17,112 gene models.

RNA-seq reads (Illumina, GA II):
Parent1 reads: 2 million 100 bp paired-end reads with 68 bp average insert size  * This step will be skipped when RNA-seq data for parent1 and parent2 not available.
All benchmarks have been done on a Mac OS X 10.6.8, 2x2.8 GHz Quad-Core Intel Xeon, with 4 GB of RAM using 4 processors (CPU).

Validation
We validated the performance of our pipeline with experimental as well as simulated RNAseq reads. The experimental data consisted of paired-end RNA-seq reads from males and females of two different isofemale lines of Drosophila pseudoobscura (Table 6; Palmieri et al 2012). The four libraries were given as input to Allim to call fixed differences between both isofemale lines (different sexes were used as biological replicates). Module 1 of the pipeline identified154,920 fixed SNPs present in a total of 9,626 genes. The following Allim parameters were used to call the fixed SNPs:

Assignment of reads to the correct parental line:
To test the accuracy of Allim to identify the parental origin of a read, ASE expression profiles were determined for experimental and simulated data:

Experimental data:
a) pooled RNA-seq reads from libraries ps88 males and ps94 males (Table 6) b) pooled RNA-seq reads from libraries ps88 females and ps94 females (Table 6) In contrast to RNA-seq data of individuals from the F1 generation, the parental origin of the pooled reads from both lines is known. Therefore, the percentage of reads that were assigned to the correct parental line by Allim could be determined. The results show that between 98.60% and 99.97% of the reads were assigned to the correct parental line by Allim (Table 7).
The marginal difference between the success rates of both lines is due to the fact that ps94 is a derivative of the strain for which the reference genome sequence was provided to Allim. For this reason reads originating from ps94 that span polymorphic sites, which did not meet the threshold for calling fixed SNPs, are more easily mapped than reads originating from ps88.
Since we used the most extreme case for a reference bias, the small bias observed suggests that under less extreme settings, Allim will provide almost no bias caused by the reference genome.

Simulated data:
RNA-seq reads were simulated with the procedure described in module 2 based on the 154,920 fixed SNPs determined for the experimental data previously. The results show that 99.99% of the simulated reads were assigned to the correct parental line by Allim (Table 7). As the reads were directly simulated based on the reference genome with only fixed differences edited, this remaining error occurs during the mapping with GSNAP (possibly for reads spanning splice junctions).

Improvement of mapping quality via modification of the reference genome
We validated the improvement of the mapping quality via reference modification by mapping RNA-seq reads (experimental & simulated data) to the original as well as the modified genomes. Mapping success was then compared between both approaches. The mapping success varies between 85.97% and 92.15% between the different data sets. However, in all data sets an increase in mapping success after editing of the genomes between 0.07% and 0.32% can be detected (Table 8). We validated the performance of our pipeline with experimental as well as simulated RNAseq reads. The experimental data consisted of paired-end RNA-seq reads from males and females of two different isofemale lines of Drosophila pseudoobscura (