Introduction
- Top of page
- Introduction
- Basic Protocol 1: Finding Human Coding Exons with Highest SNP Density
- Basic Protocol 2: Loading Data and Understanding Datatypes
- Basic Protocol 3: Calling Peaks for ChIP-seq Data
- Basic Protocol 4: Compare Datasets Using Genomic Coordinates
- Basic Protocol 5: Working with Multiple Sequence Alignments
- Guidelines for Understanding Results
- Commentary
Most experimental biologists cannot fully take advantage of genomic data due to a formidable wall of countless and unnecessary computational issues. The goal of Galaxy (Blankenberg et al., 2010; Goecks et al., 2010) is to solve these issues. Consider the following example: a researcher wants to identify protein-coding exons containing the highest density of SNPs. Most biologists know three primary sources of genome-wide data for vertebrates: Entrez at the National Center for Biotechnology Information (NCBI; unit 1.3; Maglott et al., 2005), the Genome Browser at the University of California at Santa Cruz (unit 1.4; Karolchik et al., 2003; Schneider et al., 2006; Rosenbloom et al., 2009), and Ensembl (unit 1.15; Birney et al., 2004) at the EBI/Wellcome Trust Sanger Institute (U.K.). Although these three sources offer extensive information about genes, including genomic structure, gene expression profiles, and SNPs, the end user must still perform this task elsewherethe listed resources do not provide functionality necessary to perform this analysis. Typically, the project ends up in the hands of a graduate student who might initially try to achieve the analysis using popular desktop applications. Unfortunately, Excel (like many other desktop applications) cannot handle that much data. As a result, this relatively simple task becomes a complex endeavor that may easily take weeks or months. In the authors' view, this does not have to be complicated. Galaxy bridges the gap between data and analyses by allowing experimental biologists without programming experience to easily perform large-scale studies from within their Web browsers.
In this unit, the authors describe the functionality of Galaxy using a series of examples that correspond to the following protocols. Basic Protocol 1 covers the most fundamental features of Galaxy. Basic Protocol 2 elaborates on different types of data accepted by Galaxy. It also shows the user how to upload data and set data attributes. Basic Protocol 3 demonstrates analysis with ChIP-seq high throughput sequencing data. Basic Protocol 4 shows that manipulation of genomic intervals is one of Galaxy's greatest strengths. Basic Protocol 5 explains how Galaxy enables users to manipulate multiple alignments.
In addition, a fully interactive supplement titled Using Galaxy to Perform Large-Scale Interactive Data Analysis: A live supplement is available on the main public Galaxy instance under Shared Data: Published Pages: Using Galaxy 2012, at http://usegalaxy.org/u/galaxyproject/p/using-galaxy-2012. For each protocol, the input datasets, a complete history, and any workflows are included along with the exact methods and a screencast (video tutorial). These items can be examined, copied, rerun, and modified at the main public Galaxy instance (http://usegalaxy.org) and downloaded for use in a local or cloud instance (http://getgalaxy.org).
Basic Protocol 1: Finding Human Coding Exons with Highest SNP Density
- Top of page
- Introduction
- Basic Protocol 1: Finding Human Coding Exons with Highest SNP Density
- Basic Protocol 2: Loading Data and Understanding Datatypes
- Basic Protocol 3: Calling Peaks for ChIP-seq Data
- Basic Protocol 4: Compare Datasets Using Genomic Coordinates
- Basic Protocol 5: Working with Multiple Sequence Alignments
- Guidelines for Understanding Results
- Commentary
Suppose one wants to find the top hundred protein-coding exons in the human genome with the highest density of single nucleotide polymorphisms (SNPs). Answering this question is not trivial. To do so, one needs to compare all human exons to all human SNPs. To put this into perspective, the current version of the human genome at UCSC for hg19 includes over 350,000 known coding exons and dbSNP build 134 (Sherry et al., 2001) contains nearly 49 million SNPs. Galaxy is specifically designed to make such large-scale analyses fast and user-friendly. Galaxy's interface is accessible from http://usegalaxy.org. In the following protocol, the authors will use RefSeq (Karolchik et al., 2004; Pruitt et al., 2005) exons and dbSNP annotations on chromosome 22 extracted from the UCSC Table Browser (Fujita et al., 2011).
- An Internet-connected computer
- Internet browser that supports JavaScript (e.g., most current browsers such as Mozilla Firefox, Safari, Opera, Chrome, or Microsoft Internet Explorer)
- None
| 1. | Open the Galaxy Project's homepage by pointing your Web browser to http://galaxyproject.org. p type = annotation The project homepage features four prominent sections: Use Galaxy, Get Galaxy, Learn Galaxy, and Get Involved. | ||||||
| 2. | Click on the Use Galaxy link (http://usegalaxy.org), which will bring up the free public Galaxy server. p type = annotation The Galaxy interface, populated with sample data, is shown in Figure 10.5.1. | ||||||
| 3. | Hover over User in the top bar, then click on Register in the submenu. p type = annotation The center panel of the Galaxy interface will change into a form asking you to provide user account details. | ||||||
| 4. | Fill in the Create Account form and click submit. p type = annotation Although Galaxy can be used without creating an account, the authors highly recommend registering. First, having an account allows you to access your data from any machine connected to the Internet. Second, having an account safeguards data stored in the history against deletion. Anonymous histories and datasets are not reusable from one session to the next. You also cannot do all the protocols in this unit without an account. p type = annotation When registering for the account, note the mailing list subscription checkbox. By checking it, a new user will be subscribed to the galaxy-announce mailing list. This list is a moderated, low-volume list for announcements of interest to the Galaxy community. | ||||||
| 5. | After registering, you will be automatically logged in. For subsequent sessions, log in using your e-mail and password. Hover over User in the top menu and click Login in the submenu. | ||||||
| 6. | Name this history.
p type = annotation You can only do this step if you are logged in. | ||||||
| 7. | Click the Get Data link at the top of the Tools panel. | ||||||
| 8. | Click the UCSC Main link. p type = annotation The UCSC Table Browser interface will appear in the middle panel of the Galaxy screen. The History panel on the right will disappear until you leave UCSC Main. | ||||||
| 9. | Import coordinates of protein-coding exons of known human genes from the UCSC Table Browser to Galaxy. Make sure the following parameters are set as shown in Figure 10.5.2A:
This brings up the next screen of the Table Browser interface as shown in Figure 10.5.2B. This will return you to Galaxy and create the first item called UCSC Main on Human: refGene (chr22:1-51304566) in your History panel, and place a large green box in the center panel showing that the upload has been successfully added to the Galaxy job queue. The history item is initially gray, showing it is queued (Fig. 10.5.3). The history item becomes yellow when the job is running, and green once it is complete. If a task fails for any reason the history item will turn red. This dataset contains ~ 7,100 exons. | ||||||
| 10. | Rename the dataset to something more memorable.
This opens the Edit Attributes panel in the center as shown in Figure 10.5.5. This is step is not necessary, but it does keep this somewhat useful information (UCSC Main on Human: refGene (chr22:1-51304566)) associated with the dataset. | ||||||
| 11. | In the Tools panel, click Get Data, and then UCSC Main under it. | ||||||
| 12. | Import coordinates of SNPs from the UCSC Table Browser to Galaxy. Make sure the following parameters are set:
This brings up the next screen of the Table Browser interface. This will create the second history item named UCSC Main on Human: snp132Common (chr22:1-51304566). This dataset is much larger than the Exons dataset, with ~170,000 SNPs in it. | ||||||
| 13. | Click Operate on Genomic Intervals in the Tools panel. | ||||||
| 14. | Click Join to perform a Join operation.
This will join any exon and SNP records that overlap by one or more base pairs. For explanation of various join options, see Basic Protocol 4. This will take a few minutes to compute. The join tool allows the user to find intersections between two sets of genomic intervals. In our case, we are joining protein-coding exons and SNPs as shown in Figure 10.5.6A. The result of this operation, a dataset with ~4,800 overlapping exon-SNP pairs, is shown in Figure 10.5.7. The first six columns represent protein-coding exons, while the last six represent SNPs. The six columns are: (1) chromosome, (2) start position, (3) end position, (4) description, (5) score (always 0 in this example), and (6) strand (+ or ). Figure 10.5.7 highlights a single exon (located on chromosome 22 between positions 17,264,508 and 17,265,299), which contains (overlaps with) 4 SNPs. One can see that coordinates of SNPs (columns eight and nine) are always within the start and end positions of the exon (columns two and three). | ||||||
| 15. | Click Statistics in the Tools panel. | ||||||
| 16. | Click Count to count the number of SNPs per exon as shown in Figure 10.5.6B.
Column 4 contains the exon name. Figure 10.5.7 shows that the number of times each exon is listed equals the number of SNPs that exon overlaps with. Thus, by counting the number of occurrences of every exon in this dataset, one can compute how many SNPs each exon overlaps with. The resulting dataset contains ~2,600 lines, one for each exon that overlaps with one or more SNPs. | ||||||
| 17. | Sort results by the number of SNPs per exon as shown in Figure 10.5.6C.
The resulting history item contains the input dataset, sorted by the number of SNPs in each exon (column 1). | ||||||
| 18. | Select the top 100 exons from this list as shown in Figure 10.5.6D.
After execution is finished your new history item will contain a list of the 100 exons with the highest SNP density. | ||||||
| 19. | Retrieve the other information for the top 100 exons as shown in Figure 10.5.6E.
The exon name, the common value between the two datasets, is in column 4 in the exons dataset and column two in the counts dataset. | ||||||
| 20. | Rename and format the final result dataset.
p type = annotation The resulting dataset contains 100 rows from the Exons dataset. Each row contains a full BED record. This dataset can now be used anywhere a genomic interval dataset (see Basic Protocol 4), or BED dataset can be used. It can also be visualized in genome browsers. |
Basic Protocol 2: Loading Data and Understanding Datatypes
- Top of page
- Introduction
- Basic Protocol 1: Finding Human Coding Exons with Highest SNP Density
- Basic Protocol 2: Loading Data and Understanding Datatypes
- Basic Protocol 3: Calling Peaks for ChIP-seq Data
- Basic Protocol 4: Compare Datasets Using Genomic Coordinates
- Basic Protocol 5: Working with Multiple Sequence Alignments
- Guidelines for Understanding Results
- Commentary
In Galaxy, information is stored in datasets, which are analogous to files. Datasets can be added to your history by uploading files from your computer, or extracting from external data sources integrated with Galaxy such as UCSC's ENCODE datasets (Blankenberg et al., 2007; Raney et al., 2011). Transferring external data via http/ftp, copying from shared or public Galaxy histories and libraries, and running data manipulation and analysis tools within Galaxy are explained. In addition to their data contents, each Galaxy dataset is associated with metadata. Metadata is information that describes the characteristics of a dataset. These can include the assigned and given names/annotation, the associated reference genome and build, the format datatype, and, frequently, additional datatype-specific labels and definitions.
In this protocol, we demonstrate how metadata are assigned and modified for common genome analysis datasets uploaded into Galaxy using the methods listed above. We also use Galaxy to transform a dataset from a custom format into a standard BED format.
- An Internet-connected computer
- Internet browser that supports JavaScript (e.g., most current browsers such as Mozilla Firefox, Safari, Opera, Chrome, or Microsoft Internet Explorer) and an FTP client, such as FileZilla
- None
| 1. | Return to the main Galaxy interface by going to the URL http://usegalaxy.org. | ||||
| 2. | Create a new history. In the History panel, click on Options and select Create New. | ||||
| 3. | Name the new history by clicking on the text Unnamed History and entering Basic Protocol 2. | ||||
| 4. | Import two ChIP-Seq mouse ENCODE control and tag datasets from a shared data library.
These datasets are raw data from an ENCODE transcription factor binding site experiment described at http://genome.ucsc.edu/cgi-bin/hgFileUi?db=mm9&g=wgEncodeSydhTfbs. The original data were generated and analyzed by the labs of Michael Snyder at Stanford University and Sherman Weissman at Yale University. An important point for this protocol is that they are all in a legacy Illumina FASTQ format and processed by Galaxy's primary tool base (as tools are backwardly compatible with older FASTQ formats). To make this protocol run significantly faster, the two datasets have been reduced to contain only data that will eventually map to chromosome 19. The original full-length files are available at http://hgdownload.cse.ucsc.edu/goldenPath/mm9/encodeDCC/wgEncodeSydhTfbs/wgEncodeSydhTfbsMelCtcfDmso20IggyaleRawDataRep1.fastq.gz. and http://hgdownload.cse.ucsc.edu/goldenPath/mm9/encodeDCC/wgEncodeSydhTfbs/wgEncodeSydhTfbsMelCtcfDmso20IggyaleRawDataRep2.fastq.gz. The two imported datasets are now history items. | ||||
| 5. | Upload an annotated promoter dataset via FTP.
This dataset comes from the Mammalian Promotor Database (MPromDB, http://mpromdb.wistar.upenn.edu; Gupta et al., 2011), a curated database that strives to annotate gene promoters identified from ChIP-Seq experiment results. MPromDB is a public resource, but requires a login to download data and the data are restricted to noncommercial use. Depending on network speed and server load, this transfer may take several minutes. The MM9.chr19.AnnotatedPromotersWithTissueRNAP2Density.txt file now appears in the Files uploaded via FTP section. Galaxy knows about many reference genome builds, including mm9, the most recent reference for mouse. Setting this field gives Galaxy context for subsequent operations. | ||||
| 6. | Convert the dataset to a genomic intervals format so it can be visualized and used with Galaxy's interval operations (as described in Basic Protocol 4).
Clicking the eye icon shows a preview of the dataset in the center panel. Column 2 contains the genomic coordinates as chromosome:start..stop. To convert this file into a Galaxy genomic intervals format, this single column needs to be split into 3 columns. Click Execute. The resulting dataset contains only one column, the genomic coordinates, from column 2, the input dataset. Click Execute. The output dataset has two columns in it: the first containing the chromosome name, and the second the start and stop positions, separated by two periods. Click Execute The output dataset has three columns in it. Click Execute. The output dataset has 13 columns in it. The first three are the genomic coordinates, and the last 10 are from the original dataset. The center panel is updated and several new attributes appear, as shown in Figure 10.5.12. In this case, Galaxy correctly detects that the chromosome column is 1, and the start and end columns are 2 and 3. Galaxy did not detect the strand and name columns, but they can be easily manually assigned. | ||||
| 7. | Convert this dataset from a generic genomic interval format to BED format, which is a similar, but stricter, type of interval format. This allows the dataset to be used with tools that require BED format. p type = annotation The BED format is defined at http://genome.ucsc.edu/FAQ/FAQformat.html#format1. Not all of the data in the MPromDB file maps to columns in BED, but the data for all required and some optional BED columns are in the MPromDB dataset.
Click Execute. The output dataset contains ~8,600 promoters (same as the input file), but contains only the 6 columns specified in the Cut tool, and those columns have been rearranged as in Figure 10.5.13. The dataset is now formatted as a BED file, but the format type has not been applied yet. This adds an additional attribute, Score column for visualization to the center panel. BED can include (and this dataset does) a score value in column 5. Note that if Convert to new format is used to transform interval to BED, the score value will be lost (and padded as 0) as it is not a defined interval format attribute. | ||||
| 8. | Get the RefSeq gene definitions for chromosome 19. p type = annotation This gene set will provide context for visualizations in subsequent protocols.
This brings up the next screen of the Table Browser interface. This will create an item named UCSC Main on Mouse: refGene (chr19:1-61342430) in your history. This dataset contains 944 genes at the time of publication; the exact count may vary slightly as the RefSeq Genes track is updated with GenBank incremental releases (by the track source, UCSC). This has no impact on the analysis methods presented in protocols that use this dataset; however, some counts may vary slightly. |
Basic Protocol 3: Calling Peaks for ChIP-seq Data
- Top of page
- Introduction
- Basic Protocol 1: Finding Human Coding Exons with Highest SNP Density
- Basic Protocol 2: Loading Data and Understanding Datatypes
- Basic Protocol 3: Calling Peaks for ChIP-seq Data
- Basic Protocol 4: Compare Datasets Using Genomic Coordinates
- Basic Protocol 5: Working with Multiple Sequence Alignments
- Guidelines for Understanding Results
- Commentary
The decreasing cost and increasing throughput of sequencing technologies has made chromatin immunoprecipitation followed by sequencing (ChIP-seq) an essential tool for genome-wide profiling of protein-binding, histone modification, and nucleosome positioning (Park, 2009; Pepke et al., 2009). There are numerous tools for various stages of ChIP-seq analysis, and this protocol will focus on the use of MACS (Model-based Analysis of ChIP-Seq; Zhang et al., 2008) to perform peak calling that identifies regions of the mouse genome that are positive for zinc-finger CTCF tags versus a control. CTCF is a transcription factor that can function as either a repressor or activator. Though known to bind to several thousand different genomic locations, it has also been experimentally associated with cancer tumors including, but not limited to, testis, prostate, lung, and breast (Phillips and Corces, 2009). This protocol begins with FASTQ Tag and Control datasets that are groomed (using FASTQ Groomer, a Galaxy tool that normalizes quality scores and FASTQ formatting; Blankenberg et al., 2010) and mapped (using Bowtie, a DNA short read aligner; Langmead et al., 2009), and ends with peak calling by MACS.
- An Internet-connected computer
- Internet browser that supports JavaScript (e.g., most current browsers such as Mozilla Firefox, Safari, Opera, Chrome, or Microsoft Internet Explorer)
- Results from Basic Protocol 2, step 4 (see for sources, methods, and references).
- Datasets: Main Galaxy public instance http://usegalaxy.org
- Shared Data: Data Library: ChIP-Seq Mouse Example
- 1. Control file
- 2. Tags file
- (both created or imported by user)
| 1. | Return to the main Galaxy interface and start a new history.
| ||
| 2. | Load ChIP-seq input files described in Basic Protocol 2, step 4.
The right history panel will now contain the two copied datasets. p type = annotation Data from Basic Protocol 2, step 4 are in the original, ungroomed FASTQ format from the source. These data will require grooming (format standardization) prior to mapping.
| ||
| 3. | Groom the ChIP-seq FASTQ files as shown in Figure 10.5.15.
Two new history datasets will be added to the history. p type = annotation More about job status in the history panel: often the next steps in a protocol can be started before a prior job run has completed, to create a queue of related jobs that will run in sequence.
| ||
| 4. | Map the ChIP-seq datasets to the Mouse Reference Genome using Bowtie.
This will launch the Bowtie mapping job for the input (control) dataset. This will launch the Bowtie mapping jobs for the control and tags datasets. The result will be two new datasets added to the history. | ||
| 5. | Call Peaks with MACS (Model-based Analysis of ChIP-Seq).
Creates optional output files in step 6, d and e. | ||
| 6. | Output datasets consist of one or more result files (a to e) and an HTML summary report (f). p type = annotation Dataset results are listed in the far right history panel, and if the HTML summary report eye icon is clicked, it will display in the center panel, as shown in Figure 10.5.18:
BED and WIG are both plain-text data formats that describe discrete or continuous genome annotation features. These datatypes were developed by the UC Santa Cruz Bioinformatics Group (http://genome.ucsc.edu; Fujita et al., 2011). Interval format is a plain-text data format that describes discrete genome annotation features. This datatype was developed by the Galaxy Team (http://galaxyproject.org; Giardine et al., 2005; Blankenberg et al., 2010; Goecks et al., 2010).
| ||
| 7. | Click on the pencil icon for dataset 6.a. to name and format the BED file.
p type = annotation The CTCF Peaks chr19 BED result file demonstrates the primary output from this ChIP-seq expression peak-calling workflow. |
Basic Protocol 4: Compare Datasets Using Genomic Coordinates
- Top of page
- Introduction
- Basic Protocol 1: Finding Human Coding Exons with Highest SNP Density
- Basic Protocol 2: Loading Data and Understanding Datatypes
- Basic Protocol 3: Calling Peaks for ChIP-seq Data
- Basic Protocol 4: Compare Datasets Using Genomic Coordinates
- Basic Protocol 5: Working with Multiple Sequence Alignments
- Guidelines for Understanding Results
- Commentary
The protocol describing finding human exons with highest SNP density (Basic Protocol 1) used the join operation to find all protein-coding exons that contain SNPs. This is just one of many interval operations offered in Galaxy, which are based on the bx-python package (https://bitbucket.org/james_taylor/bx-python/wiki/Home) developed at Penn State University and Emory University. These include intersect, subtract, complement, merge, concatenate, cluster, coverage, base coverage, and join. Some operations are analogous to relational database queries, such as join and coverage (unit 9.2). Other operations are analogous to set operations. Figures 10.5.19 and 10.5.20 show examples of input and output produced by individual interval operations. In the following protocol, the authors use two human chromosome 22 annotation datasets as examples. The first dataset, Exons, representing protein-coding exons, is imported from the Basic Protocol 1 history. The second dataset Repeats, representing interspersed repeats (also known as transposable elements or simply repeats in the text), is retrieved from the UCSC Table Browser.
![]() | Figure 10.5.20 Examples highlighting the functionality of coverage tools. |
- An Internet-connected computer
- Internet browser that supports JavaScript (e.g., most current browsers such as Mozilla Firefox, Safari, Opera, Chrome, or Microsoft Internet Explorer)
- None
| 1. | Create a new history. In the History panel click on Options and select Create New. | ||
| 2. | Name the new history by clicking on the text Unnamed History and entering Basic Protocol 4. | ||
| 3. | Retrieve exons for chromosome 22 dataset from the Basic Protocol 1 history:
| ||
| 4. | Refresh the History panel and use the pencil icon (see Fig. 10.5.4 ) to change the name of the new dataset to Exons on the Edit Attributes form, as shown in Figure 10.5.5. | ||
| 5. | Retrieve repeats for chromosome 22:
This brings up the next screen of the Table Browser interface The history item will appear after a moment (10 to 20 sec) with the name UCSC Main on Human: rmsk (chr22:1-51304566). | ||
| 6. | Intersect: Find exons that overlap with one or more transposable elements, as shown in Figure 10.5.19A. p type = annotation Intersect allows for the intersection of two datasets. The intersect tool can output either the entire intervals from the first dataset that overlap the second dataset (e.g., all exons containing repeats), or it can return just the intervals representing the overlap between the two datasets (e.g., only the parts of exons that are repetitive). This step demonstrates the first option. p type = annotation When finding entire intervals (by setting Return to Overlapping Intervals), the order of the datasets is important. The operation will output all of the intervals in the first query that overlap any interval in the second query. It can also be thought of as a filter: intervals that do not overlap any interval in the second query will be filtered out.
The minimum overlap of 1 requests that any overlapping regions (even if they overlap by only 1 position) will be output. p type = annotation This launches the intersect operation. A new item appears in the History panel. The resulting dataset contains ~220 regionsevery coding exon that overlaps at least 1 base pair of a transposable element. The entire intervals from the coding exons dataset are output whenever there is an overlap with any transposable element interval. | ||
| 7. | Intersect: Find regions within exons that overlap with transposable elements, as shown in Figure 10.5.19B. p type = annotation The second intersect option is to return only the pieces of intervals that overlap. When finding pieces of intervals, or the regions representing the overlap between the two datasets (by setting Return to Overlapping Pieces of Intervals), the output will be the intervals of the first dataset with the nonoverlapping subregions removed.
p type = annotation This launches the intersect operation. A new item appears in the History panel. The output dataset contains ~250 regionsthe subregions of the exons that overlap with the intervals of the repeats. This dataset contains more regions than the previous intersect example because several exons overlap with more than one repeat. p type = annotation Examine the first few rows of this dataset. The start and end columns of the new dataset are different from those in the first intersect dataset, and the exon names are repeated whenever more than one repeat intersects with that exon. | ||
| 8. | Subtract all: Find exons that do not contain any repeats, as shown in Figure 10.5.19C. p type = annotation Subtract does the opposite of intersect. It removes the intervals or parts of intervals in the first dataset that are found in the second dataset. Like intersect, subtract can treat intervals as a whole, removing or keeping entire intervals, or it can break them apart, removing overlapping subregions. This step demonstrates the first option, returning entire intervals. p type = annotation As with arithmetic subtraction, the order of the datasets is important. The second dataset is subtracted from the first dataset. The output is a variation of the first dataset and all of its columns. When subtracting whole intervals (by setting Return to Intervals with no overlap), the output will be the intervals of the first dataset that do not overlap any part of intervals of the second dataset.
The minimum overlap of 1 means that any overlapping regions will be removed from the output. p type = annotation This launches the subtract operation. The output dataset contains ~7000 exons that contain no transposable elements; each exon that overlaps a transposable element is removed from the output. | ||
| 9. | Subtract subregions: Remove subregions of exons that overlap with transposable elements, as shown in Figure 10.5.19D. p type = annotation When subtracting overlapping subregions (by setting Return to Non-overlapping pieces of intervals), the output will be the intervals of the first dataset with the overlapping subregions removed.
The minimum overlap of 1 means that any overlapping regions will be removed from the output. p type = annotation This launches the subtract operation. The output dataset contains ~7300 regions/rows. These are the exons minus the subregions that overlap transposable elements. This is different from the previous example: only the overlapping subregions of the exons are removed. Regions or intervals not overlapping are preserved. Thus, this dataset contains more regions than the input exon dataset: exons that overlapped with repeats have now been split into multiple regions (but still with the same exon name). | ||
| 10. | Concatenate and Merge: Compare coding exons and transposable elements, as shown in Figures 10.5.19E (Concatenate) and 10.5.19F (Merge). p type = annotation Concatenate and Merge together are analogous to addition or union. They can be used together to combine datasets and merge (or flatten) the intervals. p type = annotation Concatenate (Figure 10.5.19E) simply combines two interval datasets. The option Both queries are exactly the same filetype indicates that columns in both datasets are the same. If this option is unchecked, then the second dataset is adjusted to match the column assignments of the first. However, since the columns chromosome, start, end, and strand are the only columns used by the operations, all other columns will be replaced in the second dataset with a period. This option is usually left checked, as BED files are the typical interval format used within Galaxy. p type = annotation Merge reads a dataset and combines all overlapping intervals into single intervals. When merging intervals, all columns besides chromosome, start, and end are lost. When two intervals are combined into one, it is ambiguous what the other columns represent or which fields should be carried over to the resulting interval. For this reason, all columns except for chromosome, start, and end are omitted from the output.
Both datasets are in BED format. After the operation has completed, the history item will change to a light-green color. You may click on the title of the history item to view the first few lines, or click the eye icon to view the dataset. This dataset is both datasets combined into one dataset. It contains ~82,000 regions. | ||
| 11. | Base Coverage: Calculate the number of bases covered by all transposable elements, as shown in Figure 10.5.20A. p type = annotation The Base Coverage tool (Figure 10.5.20A) calculates the number of bases covered by all of the intervals in a dataset. It does not count overlapping bases more than once; if there are two intervals referring to the same region, those bases are only counted once.
| ||
| 12. | Coverage: Determine how much of each coding exon is covered by repeats, as shown in Figure 10.5.20B. p type = annotation The Coverage tool (Figure 10.5.20B) is a combination of Intersect and Base Coverage. Coverage finds the number of bases each interval in the first dataset covers of the second dataset. In addition, it finds the fraction of the interval's total length that covers intervals in the second query. The resulting dataset is all of the intervals from the first input dataset, with two columns added to the end: bases covered and fraction covered. The additional two columns can be manipulated with other tools such as Filter under the Filter and Sort section of the toolbox or with Compute under the Text Manipulations section of the toolbox.
| ||
| 13. | Complement: Chromosome complement of repeats on chromosome 22, as shown in Figure 10.5.21A. p type = annotation The Complement tool (Figure 10.5.21A) inverts a dataset. Complement reads in all of the regions of a dataset, and outputs the regions not covered by any intervals in that dataset. The option Genome-wide complement allows for the entire genome to be complemented, regardless of whether a chromosome, contig, scaffold, etc., is represented in the query dataset. In a genome-wide complement of a dataset, any chromosome that does not have any intervals in the query dataset will be output in the result as the entire chromosome. In a normal complement, only the chromosomes, contigs, scaffolds, etc., that are referenced in the query dataset will be represented in the output.
| ||
| 14. | Cluster: Merge clusters of at least 2 transposable elements within 100 base pairs into single region elements, as shown in Figures 10.5.21B and 10.5.21C. p type = annotation Cluster (Figures 10.5.21B and 10.5.21C) is one of the most versatile and powerful interval operations. Cluster finds clusters of intervals, and has a wide range of behavior depending on the options specified. The Maximum distance parameter specifies the maximum distance allowed between regions for those regions to be considered a cluster. Maximum distance can be a positive number, zero, or a negative number. When maximum distance is a positive number, regions that are at most that distance from each other are considered to be a cluster. When maximum distance is zero, Cluster considers intervals that are touching to be a cluster. This is similar to the behavior of the merge tool, but is more flexible and specific. When maximum distance is a negative number, intervals that have that amount of overlap are considered to be a cluster. p type = annotation A cluster will be ignored unless it has at least as many intervals within it as specified by the parameter Minimum intervals per cluster. If this is set to 1 or lower, then all intervals, even single intervals that do not cluster with any surrounding intervals, are included in the output. p type = annotation Cluster has five options for output listed in the drop-down list Return type: p type = annotation Merge clusters into single intervals finds all of the clusters according to the criteria set by maximum distance and minimum intervals per cluster, and outputs the start and end of each cluster as an interval. The result is that clustered intervals become one large, continuous interval spanning all of the intervals within that cluster. Setting maximum distance to 0 and minimum intervals per cluster to 1 with this option produces exactly the same output as the Merge tool. p type = annotation Find cluster intervals; preserve comments and order finds all of the clusters according to the criteria set by maximum distance and minimum intervals per cluster, and outputs those intervals in the original order in which they were encountered in the input dataset. This option can be thought of as a filter that removes the intervals that are not found within a cluster. p type = annotation Find cluster intervals; output grouped by clusters finds all of the clusters according to the criteria set by maximum and minimum intervals per cluster. It is the same as the previous option, except that the intervals are grouped together in the output by cluster. p type = annotation Find the smallest interval in each cluster and Find the largest interval in each cluster first build the clusters and then return only the smallest or largest interval in each cluster.
| ||
| 15. | Join: Compare and Join coding exons with transposable elements, as shown in Figure 10.5.22A. p type = annotation The join (Figure 10.5.22) tool's operation is similar to joins done by database management systems such as MySQL. Join looks at two datasets of intervals, and joins them based on interval overlap. Any interval in the second dataset that overlaps an interval in the first dataset will be appended to the line from the first dataset and output. p type = annotation Like intersect, join allows a minimum overlap to be specified. Intervals must meet or exceed the minimum overlap to be joined. There are several types of join that can be done, as listed in the following paragraphs. These are specified by the drop-down list labeled Return: p type = annotation Only records that are joined (INNER JOIN) will only return intervals in the first query that overlap and are joined to an interval in the second query. For users of SQL databases, this is similar to an INNER JOIN (Fig. 10.5.22A). p type = annotation All records of first dataset (fill null with .) returns all intervals from the first dataset. Any interval in the first dataset that does not join an interval in the second dataset will have the extra fields padded with a period (Fig. 10.5.22B). p type = annotation All records of second dataset (fill null with .) returns all intervals from the second dataset. Any interval in the second dataset that is not joined to an interval in the first dataset will have fields filled in with a period (Fig. 10.5.22C). p type = annotation All records of both datasets (fill nulls with a .) returns all of the intervals from both datasets. Intervals that do not join have fields filled in with a period (Fig. 10.5.22D). An example of output for each join option is shown in Figure 10.5.22E. Notice that in all but the first option (A), example intervals may contain invalid chromosome, start, and/or end data points (null . values). This could result in a dataset that requires filtering to exclude null values before performing further operations.
|
Basic Protocol 5: Working with Multiple Sequence Alignments
- Top of page
- Introduction
- Basic Protocol 1: Finding Human Coding Exons with Highest SNP Density
- Basic Protocol 2: Loading Data and Understanding Datatypes
- Basic Protocol 3: Calling Peaks for ChIP-seq Data
- Basic Protocol 4: Compare Datasets Using Genomic Coordinates
- Basic Protocol 5: Working with Multiple Sequence Alignments
- Guidelines for Understanding Results
- Commentary
Galaxy includes several tools to specifically work with paired and multiple sequence alignment format (MAF) datasets. The tool functions can upload, extract, and summarize the content of MAF datasets sourced from the UCSC Browser with the goal of maximizing analytical access to the underlying data. Both custom and standard MAF datasets can be uploaded and used with the majority of tools. The MAF manipulation tools used in this protocol were developed by the Galaxy team (Blankenberg et al., 2011).
Part A of this protocol will demonstrate how to extract regions from a standard Conservation MAF reference track (hg19), based on the query interval ranges from Basic Protocol 1, step 20: top 100 SNP containing human coding exons on chromosome 22.
Part B of this protocol will demonstrate how to generate coverage statistics from a standard Conservation MAF reference track (hg19), based on the query interval ranges from Basic Protocol 1, step 20: top 100 SNP containing human coding exons on chromosome 22.
Part C of this protocol will demonstrate how to extract and manipulate syntenic transcript FASTA sequence from a standard Conservation MAF reference track (hg19), based on the query interval ranges from a human RefSeq Genes track, as extracted in BED format from the UCSC Table Browser, limited to chromosome 22.
- An Internet-connected computer
- Internet browser that supports JavaScript (e.g., most current browsers such as Mozilla Firefox, Safari, Opera, Chrome, or Microsoft Internet Explorer)
- Results from Basic Protocol 1, Step 20 (see for sources, methods, and references):
- 1. SNP Coding Exons chr22
- (created or imported by user)
- UCSC Browser tracks for Conservation and RefSeq Genes:
- 2. Conservation 46-way multiZ track for hg19
- (local on Main Galaxy public instance http://usegalaxy.org)
- 3. RefSeq Genes hg19 chr22
- (imported by user into Galaxy history)
- Workflow: Main Galaxy public instance http://usegalaxy.org
- Shared Data: Published Workflows
- 4. Transform Stitch Gene blocks FASTA blocks to standardized FASTA file
- (imported by user)
| 1. | Return to the main Galaxy interface and start a new history.
|
| 2. | Copy BED file from Basic Protocol 1, step 20: SNP Coding Exons chr22.
After the copy completes, a green banner at the form top will display the following message: 1 datasets copied to 1 history: Basic Protocol 5. p type = annotation The right history panel will now contain the copied dataset SNP Coding Exons chr22. This data copied from Basic Protocol 1 is a 100 line BED format file. | ||||||||||||||||||||||||||||||||||||||||||||
| 3. | Extract conserved MAF blocks for primate species.
p type = annotation Result file MAF blocks for SNP Coding Exons hg19 chr22 contains the MAF alignment blocks corresponding to the 100 input hg19 exon query interval ranges. An example of this output is in Figure 10.5.24.
| ||||||||||||||||||||||||||||||||||||||||||||
| 4. | Generate coverage statistics for SNP Coding Exons chr22 from MAF for all species.
|
| 5. | Import transcript coordinates of human RefSeq Genes from the UCSC Table Browser to Galaxy.
This brings up the next screen of the Table Browser interface | ||||||
| 6. | Extract syntenic FASTA sequence from MAF for primate species (same 10 species as listed in Part A, Step 3). Example result data is shown in Figure 10.5.28.
| ||||||
| 7. | Use a Galaxy Workflow to transform the FASTA blocks into a standardized FASTA file. p type = annotation Transforming the data into a concatenated FASTA file containing only those results with sequence will make the data suitable for tools that accept nucleotide FASTA sequence as an input.
This workflow generates 5 new datasets, some of them hidden in the history panel, as shown in Figure 10.5.33. To access these intermediate hidden datasets, click on Options: Show Hidden Datasets in the top right corner of the right history panel. p type = annotation The result dataset FASTA all for RefSeq Genes hg19 chr22 will contain 6,882 sequences and is formatted for use with tools that accept FASTA format.
| ||||||
| 8. | Transform the FASTA blocks into a standardized FASTA file for a single species. p type = annotation Subsetting the results by species will give the data a specific genome context and make it useable by tools that require a reference genome assignment. p type = annotation Note: Many of this protocol's operations in step 8 are the same as those bundled into the step 7 workflow. Step 8 demonstrates the individual tools in detail, showing how Galaxy's data manipulation, filtering, sorting, and format conversion tools work together in combination. Galaxy's tools most often perform a single, distinct task to maximize the ability to create customized analysis paths. Bundling multiple steps into a workflow makes customized analysis easy to apply to additional datasets and share with collaborators. Target species: (see step 3 for full list)
p type = annotation rheMac2 is the short label for the reference genome, used for the attribute database: and Database/Build: in the Galaxy user interface and file system.
p type = annotation Result dataset FASTA rheMac2 for RefSeq Genes hg19 chr22 contains predicted transcript FASTA sequence for only the rheMac2 species/build, corresponding to the input hg19 transcript query interval ranges (when conserved in the hg19 MAF data). Reassignment of the database attribute ensures that this dataset will be used correctly with downstream analysis tools.
|
Guidelines for Understanding Results
- Top of page
- Introduction
- Basic Protocol 1: Finding Human Coding Exons with Highest SNP Density
- Basic Protocol 2: Loading Data and Understanding Datatypes
- Basic Protocol 3: Calling Peaks for ChIP-seq Data
- Basic Protocol 4: Compare Datasets Using Genomic Coordinates
- Basic Protocol 5: Working with Multiple Sequence Alignments
- Guidelines for Understanding Results
- Commentary
Galaxy was designed to be an interactive system, and in most cases results will be self-descriptive, depending on which tools were applied to the original data. As always, caution should be used when interpreting genomic datathe information produced by Galaxy is only as good as the underlying data imported.
Commentary
- Top of page
- Introduction
- Basic Protocol 1: Finding Human Coding Exons with Highest SNP Density
- Basic Protocol 2: Loading Data and Understanding Datatypes
- Basic Protocol 3: Calling Peaks for ChIP-seq Data
- Basic Protocol 4: Compare Datasets Using Genomic Coordinates
- Basic Protocol 5: Working with Multiple Sequence Alignments
- Guidelines for Understanding Results
- Commentary
Background Information
Modern Web-based genomic resources offer many facilities for retrieval and visualization of data. However, few of these resources offer sophisticated tools for further analysis of these data. As a result, almost every experimental biologist has to analyze data on his/her own, struggling with numerous difficulties arising from format incompatibility or incomprehensible user interfaces. Although our computational colleagues are happy to help, few are willing to devote time and resources to develop a good user interface (a significant challenge). Galaxy is a system designed to help both sides. For experimental biologists, Galaxy provides an intuitive user interface offering a direct connection to many widely used data sources and browsers, a simplified FTP data-loading procedure, and a custom genome option for most tools including the native Galaxy Track Browser (GTB, or Trackster). The Galaxy workspace includes a unique history system to organize, label and display data, to track datasets and analysis for sharing and/or publishing, and to extract analysis functions into workflows for re-use. For computational biologists, Galaxy provides a framework that can integrate command-line tools with almost no effort. For each tool, Galaxy generates an interface and provides all housekeeping (e.g., input and output management, job control, error catching, and testing facilities). As this text was compiled with experimental biologists in mind, it does not contain any information on technical aspects of the Galaxy system (found at http://galaxyproject.org).
Critical Parameters and Troubleshooting
Galaxy allows performing an infinite number of analyses on genomic data. In designing the system, the authors tried to put as few constraints on the user as possible. In that sense Galaxy is similar to a car with a manual gearboxit gives you more control if you know what you are doing (e.g., you do not shift from fifth to reverse). Fortunately, user feedback provides convincing evidence that a short test drive is sufficient to understand how Galaxy works. This text is equivalent to such a test drive. Below, the authors list the most common problems encountered by Galaxy users. They can be condensed into two categories: (1) data format issues and (2) genome build incompatibilities.
Data format issues
Galaxy understands several datatypes including genomic coordinates (e.g., BED, GFF/GTF, Wig), sequences (e.g., FASTQ, FASTA), and alignments (e.g., SAM/BAM and MAF). Most of the tools require data to be in one of these formats. For example, the genomic intervals operations described in Basic Protocol 4 can only be performed on data in interval format. In most cases, changing your data to interval format is as simple as correctly setting metadata, as shown in Basic Protocol 2, step 6.
Genome build incompatibilities
Galaxy supports interactive genome analyses that use a mix of different genomes within a single analysis space (History). In the authors' opinion, such mixing is essential for a true comparative genomics resource. The ease of mixing also means that, in some cases, users will accidentally attempt comparing data from different genomes. Thus, when using tools that operate on more than one history item (i.e., most genomic interval operations), make sure that all data come from the same genome build.
If you have questions
Galaxy has a vibrant and growing user and developer community. If you want to learn more or encounter problems, the best places to find out how to get connected are in the Galaxy Wiki (http://galaxyproject.org), specifically our Learning Hub (http://galaxyproject.org/Learn) and Support Resource (http://galaxyproject.org/Support) pages.
Acknowledgments
A vision for Galaxy was originally articulated by Ross Hardison, who is also the major source of support and critical feedback. The authors would like to thank Jim Kent and David Haussler for their continuing support and making UCSC Genome Browser uplink and connection possible. Istvan Albert pioneered initial aspects of Galaxy design. Efforts of the Galaxy Team (Enis Afgan, Guru Ananda, Dannon Baker, Nate Coraor, Jeremy Goecks, Greg Von Kuster, Ross Lazarus) were instrumental in making this work happen. The following individuals also contributed to the Galaxy project at different stages: Richard Burhans, Ramkrishna Chakrabarty, Laura Elnitski, Belinda Giardiane, Bob Harris, Jianbin He, Kanwei Li, Webb Miller, Cathy Riemer, Kelly Vincent, and Yi Zhang. Robert Castelo, France Denoeud, Roderic Guigo, Erika Kvikstad, Julien Lagarde, and Kateryna Makova provided critical comments during software testing. Ramana Davuluri gave permission to use the MPromDB data in these protocols. This work was funded by an NIH grant GM07226405S2 to KDM, a Beckman Foundation Young Investigator Award to AN, NSF grant DBI 0543285 and NIH grants HG004909 and HG006620 to AN and JT, NIH grants HG005133 and HG005542 to JT and AN, as well as funds from Penn State University and Penn State Institute for Cyber Science and the Huck Institutes for the Life Sciences to AN and from Emory University to JT. Additional funding is provided, in part, under a grant with the Pennsylvania Department of Health using Tobacco Settlement Funds. The Department specifically disclaims responsibility for any analyses, interpretations, or conclusions.







































