Introduction
- Top of page
- Introduction
- Basic Protocol 1: An Introduction to the Galaxy Approach: Finding Promoters Containing TAF1 Binding Sites Identified From a ChIP-Seq Experiment
- Basic Protocol 2: Combining and Filtering Genome Annotations: Finding Exons with the Highest Number of Nucleotide Polymorphisms
- Support Protocol 1: Saving Results in Galaxy and Sharing Data with Others
- Basic Protocol 3: Generating a Workflow From a History in Galaxy
- Support Protocol 2: Modify a Parameter in the Workflow in Galaxy
- Support Protocol 3: Running Workflows with Galaxy
- Support Protocol 4: Sharing Workflows with Galaxy
- Basic Protocol 4: Generating Workflows from Scratch with Galaxy
- Basic Protocol 5: Extracting Sequences and Alignments with Galaxy: An SNPs in Exons Example
- Commentary
Research in the life sciences continues to become more data-intensive. With new high-throughput experimental techniques, an individual laboratory can generate raw data of a scale that was unthinkable only a few years ago. These developments represent an enormous opportunity for basic and applied research. However, they are also creating a crisis for many scientists, since making sense of this wealth of data requires significant analysis infrastructure. Without informatics support, experimental biologists, who possess key biological knowledge and experience, and thus the best potential for making novel discoveries, cannot effectively use the available data.
Galaxy (http://galaxyproject.org) rectifies this challenge by providing the needed informatics infrastructure (Taylor et al., 2007). For experimentalists, it provides an analysis environment in which they can perform analysis interactively, while ensuring that the resulting analyses are transparent and reproducible. The Galaxy framework encapsulates high-end computational tools, and gives them intuitive user interfaces while hiding the details of computation and storage management. It thus eliminates the need for specialized informatics expertise when performing many common types of large-scale analysis.
This unit describes the functionality of Galaxy using a series of examples. It is directed primarily at experimentalists, and makes use only of analysis tools available at the public Galaxy service at http://usegalaxy.org. Various components and tools of the public Galaxy server are explored by following several connected, but independent, protocols. Although the data being investigated in these protocols may not be of personal research interest, the techniques demonstrated are useful in a wide array of applications. Each of the protocols below is accompanied by a screencast (a real-time movie showing the steps of the protocol as they appear on the screen) available from http://galaxycast.org/CPMB. Following along with the screencasts is recommended, and they provide an alternate presentation of details not easily conveyed by text. This unit is divided into the following protocols: Basic Protocol 1 is an introduction to the Galaxy approachfinding promoters containing TAF1 binding sites identified from a ChIP-seq experiment. Basic Protocol 2 is a bit more data manipulationfinding coding exons with the most SNPs. Support Protocol 1 describes how to save results in Galaxy and share data with others. Basic Protocol 3 describes generating a workflow from a history in Galaxy. Support Protocol 2 describes modifying a parameter of the workflow in Galaxy. Support Protocol 3 describes running workflows with Galaxy. Support Protocol 4 describes sharing workflows with Galaxy. Basic Protocol 4 describes generating workflows from scratch with Galaxy. Basic Protocol 5 describes extracting sequences and alignments with GalaxySNPs in exons example.
These protocols cover the basic aspects of the functionality of Galaxy. They are sufficient for overcoming the initial learning curve, but Galaxy has much more to offer, including complex analyses of next generation sequencing data such as metagenomic applications or re-sequencing studies. Additionally, the Galaxy project is progressing rapidly with new tools and features added on a monthly basis. The best way to keep up with these enhancements is to regularly check the screencast page at http://galaxycast.org.
Before beginning the protocols, it is beneficial to review terminology and concepts. Many of the formats (datatypes) used in genomics are composed of rows of tab-delimited columns, which contain varied data (known as tabular data and similar in function to a spreadsheet). One of these is known as interval, in which each row represents the position of a genomic feature in a particular genome. The interval format contains at least three columns: (1) the chromosome, (2) the start position within that chromosome, and (3) the end position within that chromosome. Other columns commonly included are name, strand, score, and exon information (when the intervals are gene annotations). Additional formats beyond those composed of tabular columns are used, but the intricacies of their formats can be largely ignored in this introductory text as Galaxy can handle most of the details needed for performing complex analysis. The practice of matching rows between tabular datasets with Galaxy is known as joining. Two different Join tools are used here. The first Join tool works on interval datasets (using multiple columns to determine matching) and creates a dataset where rows are matched if their interval on the genome overlaps (by a user-specified number of nucleotides) and combined into a single row. The second type of join works on a single column from each dataset and is useful for matching between identifiers. Every time a tool is run, one or more datasets are created in the user's history. The box surrounding the dataset will change color based upon its state: a query in the queue will be indicated by a gray box, a running query will be yellow, and a completed query will have a green box. Although a dataset is only ready to be viewed or used as input after it has turned green, additional analysis steps can be lined up for non-completed queries by using the desired tools as normal; the tools will wait in the queue for the dataset needed to finish before running. Examining Figure 19.10.1 in detail will familiarize the user with the layout of Galaxy's interface, including a history for the user and the tools menu.
![]() | Figure 19.10.1 Galaxy's Analyze Data interface consists of four regions: the masthead (A) at the top, the tool menu; (B) on the left-hand side, the work area; (C) in the middle; and the history panel (D) on the right. The Get Data section has been expanded in the tool menu and the Upload File tool has been selected. In the work area, a local file containing TAF1 ChIP-Seq data has been chosen (see Basic Protocol 1, step 1); clicking the Execute button will cause the data to be uploaded and appear in the history panel. See the TAF1 screencast (http://galaxycast.org/cpmb-2009-1) for more details. |
Basic Protocol 1: An Introduction to the Galaxy Approach: Finding Promoters Containing TAF1 Binding Sites Identified From a ChIP-Seq Experiment
- Top of page
- Introduction
- Basic Protocol 1: An Introduction to the Galaxy Approach: Finding Promoters Containing TAF1 Binding Sites Identified From a ChIP-Seq Experiment
- Basic Protocol 2: Combining and Filtering Genome Annotations: Finding Exons with the Highest Number of Nucleotide Polymorphisms
- Support Protocol 1: Saving Results in Galaxy and Sharing Data with Others
- Basic Protocol 3: Generating a Workflow From a History in Galaxy
- Support Protocol 2: Modify a Parameter in the Workflow in Galaxy
- Support Protocol 3: Running Workflows with Galaxy
- Support Protocol 4: Sharing Workflows with Galaxy
- Basic Protocol 4: Generating Workflows from Scratch with Galaxy
- Basic Protocol 5: Extracting Sequences and Alignments with Galaxy: An SNPs in Exons Example
- Commentary
This protocol presents an example situation in which a ChIP-Seq experiment identified a series of genomic regions that bind TAF1-protein. The next task is to identify a list of genes that contain such sites. This can be easily done with Galaxy in just a few steps. This protocol uses a file arranged by tab-delimited columns, where each column contains information about the genomic positions (intervals), as well as name and score data, for TAF1-binding sites from a ChIP-Seq experiment. Each row in this file represents an individual TAF1-binding site by listing the chromosome and the start and end positions within that chromosome. Here, it is assumed that the ChIP-Seq data has already been processed into putative binding regions, since this procedure is currently very experiment and laboratory specific. However, as best practices are defined for performing and evaluating the quality of these procedures, appropriate tools will be added to Galaxy.
A screencast of the protocol can be viewed at http://galaxycast.org/cpmb-2009-1.
NOTE: The items to alter are stated in the protocol. If other menus and options are not referenced, leave those settings in their default or existing condition.
- A file containing genomic coordinates for TAF1-binding sites from the ChIP-Seq experiment (an example file can be downloaded at http://galaxy.psu.edu/CPMB/TAF1_ChIP.txt; Kim et al., 2005)
- An internet-accessible computer with any modern Web browser (Firefox, Safari, Opera, Internet Explorer)
| 1. | Upload the TAF1 ChIP-Seq data. Before beginning the analysis, the ChIP-Seq data needs to be uploaded into Galaxy's workspace (known as a user's history throughout this document).
It is possible to skip the downloading step and directly upload the data by entering the URL into the paste box, causing Galaxy to fetch the URL contents automatically. | ||
| 2. | Set properties of the TAF1 dataset (Fig. 19.10.2). To begin the analysis, a number of properties for the ChIP-Seq dataset need to be set.
| ||
| 3. | Upload gene annotations from the UCSC Table Browser (Fig. 19.10.3; Karolchik et al., 2004, 2008). To identify which genes' promoters contain the TAF1 binding sites, the gene coordinates must first be uploaded.
| ||
| 4. | Transform coordinates of genes into coordinates of putative promoters (Fig. 19.10.4).
| ||
| 5. | Remove unnecessary columns for dataset no. 3. Only five columns are needed from this dataset. Galaxy's Cut tool allows the removal of unwanted columns.
| ||
| 6. | Identify promoters containing the TAF1 binding sites. Now join the coordinates of TAF1 binding sites from dataset no. 1 with the coordinates of putative promoters from dataset no. 4. The Genomic Interval Join Tool (Fig. 19.10.5) matches two separate sets of genomic coordinates (intervals) according to their overlap, creating a single output containing the matched rows.
| ||
| 7. | Visualize results of this analysis using the UCSC Genome Browser.
|
Basic Protocol 2: Combining and Filtering Genome Annotations: Finding Exons with the Highest Number of Nucleotide Polymorphisms
- Top of page
- Introduction
- Basic Protocol 1: An Introduction to the Galaxy Approach: Finding Promoters Containing TAF1 Binding Sites Identified From a ChIP-Seq Experiment
- Basic Protocol 2: Combining and Filtering Genome Annotations: Finding Exons with the Highest Number of Nucleotide Polymorphisms
- Support Protocol 1: Saving Results in Galaxy and Sharing Data with Others
- Basic Protocol 3: Generating a Workflow From a History in Galaxy
- Support Protocol 2: Modify a Parameter in the Workflow in Galaxy
- Support Protocol 3: Running Workflows with Galaxy
- Support Protocol 4: Sharing Workflows with Galaxy
- Basic Protocol 4: Generating Workflows from Scratch with Galaxy
- Basic Protocol 5: Extracting Sequences and Alignments with Galaxy: An SNPs in Exons Example
- Commentary
The objective of this protocol is to demonstrate joining, grouping, sorting, and filtering of genomic annotations in Galaxy. To explore these features using real data, an illustrative example is presented: identification of exons containing the largest number of single nucleotide polymorphisms (SNPs).
A screencast of the protocol can be viewed at http://galaxycast.org/cpmb-2009-2.
- An internet-accessible computer with any modern Web browser (Firefox, Safari, Opera, Internet Explorer)
| 1. | Upload exon annotations from the UCSC Table Browser:
| ||
| 2. | Upload SNP coordinates.
| ||
| 3. | Join coordinates of exons with coordinates of SNPs to identify those exons that contain SNPs.
| ||
| 4. | Count the number of SNPs per exon using the Group tool. In Figure 19.10.7, if an exon contains multiple SNPs, its name is repeated. It is possible to take advantage of this by using the Group tool. By counting the number of times each exon's name appears within dataset no. 3, the number of SNPs within that exon will be obtained.
| ||
| 5. | Sort exon by SNP count. To see the highest possible number of SNPs per exon in this dataset, sort the dataset from the previous step.
| ||
| 6. | Restrict dataset no. 5 to exons that have ten or more SNPs.
| ||
| 7. | Restore genomic location for exons containing ten or more SNPs. Step 6 produced a list of exons containing ten or more SNPs; however, information about their genomic position, strand orientation, etc. has been lost. Because dataset no. 6 contains the exon identifier field, it can be used to restore genomic context information by joining with dataset no. 1. The Join two Queries tool is different than the Genomic Operations Join, which was used earlier; this tool matches two separate datasets by matching column contents between any tab-delimited dataset (including interval datasets).
| ||
| 8. | Visualize dataset no. 7 Join two Queries on data 6 and data 1 in UCSC Genome Browser.
| ||
| 9. | To save the analysis and share it with colleagues continue on to Support Protocol 1. |
Support Protocol 1: Saving Results in Galaxy and Sharing Data with Others
- Top of page
- Introduction
- Basic Protocol 1: An Introduction to the Galaxy Approach: Finding Promoters Containing TAF1 Binding Sites Identified From a ChIP-Seq Experiment
- Basic Protocol 2: Combining and Filtering Genome Annotations: Finding Exons with the Highest Number of Nucleotide Polymorphisms
- Support Protocol 1: Saving Results in Galaxy and Sharing Data with Others
- Basic Protocol 3: Generating a Workflow From a History in Galaxy
- Support Protocol 2: Modify a Parameter in the Workflow in Galaxy
- Support Protocol 3: Running Workflows with Galaxy
- Support Protocol 4: Sharing Workflows with Galaxy
- Basic Protocol 4: Generating Workflows from Scratch with Galaxy
- Basic Protocol 5: Extracting Sequences and Alignments with Galaxy: An SNPs in Exons Example
- Commentary
How can researchers ensure that the analyses they have just conducted are safely stored and that they are able to go back to them at anytime? They will need to create a free account within Galaxy. This is the only requirement to save analyses. The protocol below explains how to store results and introduces sharing analyses with colleagues. A screencast can be viewed at http://galaxycast.org/cpmb-2009-3 to walk the user through the process.
- An internet-accessible computer with any modern Web browser (Firefox, Safari, Opera, Internet Explorer)
- Results from Basic Protocol 2
- A Galaxy account (created by clicking Register in the Galaxy interface); histories must be linked to a user to be stored and shared
Basic Protocol 3: Generating a Workflow From a History in Galaxy
- Top of page
- Introduction
- Basic Protocol 1: An Introduction to the Galaxy Approach: Finding Promoters Containing TAF1 Binding Sites Identified From a ChIP-Seq Experiment
- Basic Protocol 2: Combining and Filtering Genome Annotations: Finding Exons with the Highest Number of Nucleotide Polymorphisms
- Support Protocol 1: Saving Results in Galaxy and Sharing Data with Others
- Basic Protocol 3: Generating a Workflow From a History in Galaxy
- Support Protocol 2: Modify a Parameter in the Workflow in Galaxy
- Support Protocol 3: Running Workflows with Galaxy
- Support Protocol 4: Sharing Workflows with Galaxy
- Basic Protocol 4: Generating Workflows from Scratch with Galaxy
- Basic Protocol 5: Extracting Sequences and Alignments with Galaxy: An SNPs in Exons Example
- Commentary
Basic Protocols 1 and 2 demonstrate interactive analysis in Galaxy, the result is a history that documents each step of an analysis. Galaxy also allows the construction of reusable multi-step analysis workflows (Fig. 19.10.8). In this protocol, the creation of a workflow from an existing analysis history is demonstrated.
![]() | Figure 19.10.8 To create a workflow from an existing history (see Basic Protocol 3), the user needs to make sure that they are logged in and then select History Options and click Extract Workflow. A new workflow will be populated from the current history as shown; the workflow can now be renamed and created. See the Workflow screencast (http://galaxycast.org/cpmb-2009-4) for more details. |
A screencast of the protocols can be viewed at http://galaxycast.org/cpmb-2009-4.
- An internet-accessible computer with any modern Web browser (Firefox, Safari, Opera, Internet Explorer)
- History created from Basic Protocol 2
- A Galaxy account (created by clicking Register in the Galaxy interface); all workflow manipulation in Galaxy requires the user to be logged in with an account
| 1. | Ensure a non-empty history is loaded (for this example, the history resulting from the completion of Basic Protocol 2 is used). |
| 2. | In the header of the History panel (top-right of the Galaxy analysis interface), click Options. This will load a menu of options that apply to the current history. |
| 3. | Click Extract Workflow. This will load a list of the actions (tool runs) that generated each dataset in the current history. A subset of tools can be selected by clicking the checkboxes on this page (e.g., if more than one analysis has been performed in the current history, but a workflow is only to be created from one of them). p type = annotation Certain tools cannot be used in workflows, including most external data sources. In these cases, the dataset can be treated as an input to the workflow. Here, a workflow is constructed from the entire history, so do not change any checkboxes. |
| 4. | Provide a name for the new workflow by entering a name of choice in the text box underneath the label Workflow Name. |
| 5. | Click the Create Workflow button to create the new workflow; a message will be displayed in the center panel confirming that the workflow was created. |
Support Protocol 2: Modify a Parameter in the Workflow in Galaxy
- Top of page
- Introduction
- Basic Protocol 1: An Introduction to the Galaxy Approach: Finding Promoters Containing TAF1 Binding Sites Identified From a ChIP-Seq Experiment
- Basic Protocol 2: Combining and Filtering Genome Annotations: Finding Exons with the Highest Number of Nucleotide Polymorphisms
- Support Protocol 1: Saving Results in Galaxy and Sharing Data with Others
- Basic Protocol 3: Generating a Workflow From a History in Galaxy
- Support Protocol 2: Modify a Parameter in the Workflow in Galaxy
- Support Protocol 3: Running Workflows with Galaxy
- Support Protocol 4: Sharing Workflows with Galaxy
- Basic Protocol 4: Generating Workflows from Scratch with Galaxy
- Basic Protocol 5: Extracting Sequences and Alignments with Galaxy: An SNPs in Exons Example
- Commentary
After constructing a workflow from an existing analysis, the Workflow Editor can be used to modify tool parameters (or even add and remove steps).
- An internet-accessible computer with any modern Web browser (Firefox, Safari, Opera, Internet Explorer)
- Workflow created by Basic Protocol 3
Support Protocol 3: Running Workflows with Galaxy
- Top of page
- Introduction
- Basic Protocol 1: An Introduction to the Galaxy Approach: Finding Promoters Containing TAF1 Binding Sites Identified From a ChIP-Seq Experiment
- Basic Protocol 2: Combining and Filtering Genome Annotations: Finding Exons with the Highest Number of Nucleotide Polymorphisms
- Support Protocol 1: Saving Results in Galaxy and Sharing Data with Others
- Basic Protocol 3: Generating a Workflow From a History in Galaxy
- Support Protocol 2: Modify a Parameter in the Workflow in Galaxy
- Support Protocol 3: Running Workflows with Galaxy
- Support Protocol 4: Sharing Workflows with Galaxy
- Basic Protocol 4: Generating Workflows from Scratch with Galaxy
- Basic Protocol 5: Extracting Sequences and Alignments with Galaxy: An SNPs in Exons Example
- Commentary
Once a workflow has been constructed, it can be run in the analysis view just like any other tool in Galaxy.
- An internet-accessible computer with any modern Web browser (Firefox, Safari, Opera, Internet Explorer)
- Workflow saved in Basic Protocol 3
| 1. | Return to the Analyze Data view by clicking Analyze Data in the top panel. |
| 2. | Create a new empty history in which to store the result of running the workflow.
|
| 3. | Get exon and SNP annotations for human chromosome X from the UCSC Table Browser.
|
| 4. | Click Workflows at the bottom of the tool menu (left panel), then click All Workflows in the list of options that appears. |
| 5. | In the center panel, click the name of the workflow created in Basic Protocol 3. This will load the workflow in the center panel with prompts for parameters that need values. |
| 6. | Under Step 1: Input Dataset, select the second item in the history (the SNPs). |
| 7. | Under Step 2: Input Dataset, select the first item in the history (the exons). |
| 8. | Click Run Workflow at the bottom of the form in the center panel. A message will be displayed confirming that the workflow has been run, and the datasets for each workflow step will be added to the history (in the queued state). At this point, the workflow is running, and each step will execute once the data it requires has been generated by previous steps. The box surrounding the dataset will change color based upon its state as the steps progress: a query in the queue will be indicated by a gray box, a running query will be yellow, and a completed query will have a green box. |
Support Protocol 4: Sharing Workflows with Galaxy
- Top of page
- Introduction
- Basic Protocol 1: An Introduction to the Galaxy Approach: Finding Promoters Containing TAF1 Binding Sites Identified From a ChIP-Seq Experiment
- Basic Protocol 2: Combining and Filtering Genome Annotations: Finding Exons with the Highest Number of Nucleotide Polymorphisms
- Support Protocol 1: Saving Results in Galaxy and Sharing Data with Others
- Basic Protocol 3: Generating a Workflow From a History in Galaxy
- Support Protocol 2: Modify a Parameter in the Workflow in Galaxy
- Support Protocol 3: Running Workflows with Galaxy
- Support Protocol 4: Sharing Workflows with Galaxy
- Basic Protocol 4: Generating Workflows from Scratch with Galaxy
- Basic Protocol 5: Extracting Sequences and Alignments with Galaxy: An SNPs in Exons Example
- Commentary
Galaxy allows researchers to share workflows with others. Workflows can either be shared with a specific Galaxy user, or made publicly accessible by a special link.
- An internet-accessible computer with any modern Web browser (Firefox, Safari, Opera, Internet Explorer)
- Workflow created by Basic Protocol 3
Basic Protocol 4: Generating Workflows from Scratch with Galaxy
- Top of page
- Introduction
- Basic Protocol 1: An Introduction to the Galaxy Approach: Finding Promoters Containing TAF1 Binding Sites Identified From a ChIP-Seq Experiment
- Basic Protocol 2: Combining and Filtering Genome Annotations: Finding Exons with the Highest Number of Nucleotide Polymorphisms
- Support Protocol 1: Saving Results in Galaxy and Sharing Data with Others
- Basic Protocol 3: Generating a Workflow From a History in Galaxy
- Support Protocol 2: Modify a Parameter in the Workflow in Galaxy
- Support Protocol 3: Running Workflows with Galaxy
- Support Protocol 4: Sharing Workflows with Galaxy
- Basic Protocol 4: Generating Workflows from Scratch with Galaxy
- Basic Protocol 5: Extracting Sequences and Alignments with Galaxy: An SNPs in Exons Example
- Commentary
In addition to creating workflows from existing histories, Galaxy allows the creation of a workflow from scratch. In this protocol, a simple workflow that finds the 50 longest intervals from a dataset in a 6-column BED file is constructed. The BED format is a specialized version of the interval format discussed earlier; it contains the information required to represent a genomic position. A 6-column BED file contains the chromosome, start position in the chromosome, end position in the chromosome, name, score, and strand for a set of genomic positions.
A screencast of this protocol can be viewed at http://galaxycast.org/cpmb-2009-5.
- An internet-accessible computer with any modern Web browser (Firefox, Safari, Opera, Internet Explorer)
- A Galaxy account (created by clicking Register in the Galaxy interface); all workflow manipulation in Galaxy requires the user to be logged in with an account
| 1. | Move from the Analyze Data view to the Workflow view by clicking Workflow in the top panel of the Galaxy interface. | ||
| 2. | Create a new empty workflow.
| ||
| 3. | Load the (empty) workflow in the workflow editor.
| ||
| 4. | Add an input dataset to the workflow.
| ||
| 5. | Add a step to the workflow to compute the length of each interval in the input dataset.
| ||
| 6. | Create a connection between the input dataset and the Compute node. Outputs of a node are represented by circled arrowheads overlapping the right edge of a node, while data inputs are circled arrowheads overlapping the left edge of a node (Fig. 19.10.9). Connections are made by dragging.
| ||
| 7. | Add a step to the workflow to sort the intervals by length.
| ||
| 8. | Add a step to the workflow to select the longest intervals.
| ||
| 9. | Edit the parameters of the Compute step to calculate interval length.
| ||
| 10. | Edit the parameters of the Sort step to sort on the correct column.
| ||
| 11. | Edit the parameters of the Select First step to select the first 50 intervals.
| ||
| 12. | Click the Save button in the title bar header of the center workflow canvas panel to save the workflow. | ||
| 13. | Click Close in the header of the workflow canvas panel to return to the workflow list. This workflow can now be run in the same fashion as described in Support Protocol 3. |
Basic Protocol 5: Extracting Sequences and Alignments with Galaxy: An SNPs in Exons Example
- Top of page
- Introduction
- Basic Protocol 1: An Introduction to the Galaxy Approach: Finding Promoters Containing TAF1 Binding Sites Identified From a ChIP-Seq Experiment
- Basic Protocol 2: Combining and Filtering Genome Annotations: Finding Exons with the Highest Number of Nucleotide Polymorphisms
- Support Protocol 1: Saving Results in Galaxy and Sharing Data with Others
- Basic Protocol 3: Generating a Workflow From a History in Galaxy
- Support Protocol 2: Modify a Parameter in the Workflow in Galaxy
- Support Protocol 3: Running Workflows with Galaxy
- Support Protocol 4: Sharing Workflows with Galaxy
- Basic Protocol 4: Generating Workflows from Scratch with Galaxy
- Basic Protocol 5: Extracting Sequences and Alignments with Galaxy: An SNPs in Exons Example
- Commentary
This protocol demonstrates how Galaxy is used to extract genomic sequences and multiple species alignments corresponding to regions of interest. It starts with the data that was generated in Basic Protocol 2, where human coding exons with high SNP counts were found. Two types of data will be extracted for these regions: the genomic sequence of each region, and pieces of a whole-genome alignment between human and other species overlapping these regions. Because the whole-genome alignment used here (produced by Multiz, a local aligner) is fragmented into pieces, these pieces will then be projected back onto the regions of interest (exons) to facilitate per-exon analysis of the alignments (the result is sometimes called a pseudo-global alignment). This protocol provides a brief illustration of how easily Galaxy can be used to handle the often tricky manipulation of these files.
A screencast of this protocol can be viewed at http://galaxycast.org/cpmb-2009-6.
- An internet-accessible computer with any modern Web browser (Firefox, Safari, Opera, Internet Explorer)
- Completed and saved history created by Basic Protocol 2 and Support Protocol 1
| 1. | Go to the main Galaxy interface at http://usegalaxy.org. | ||
| 2. | Load the history created in Basic Protocol 2:
| ||
| 3. | The history panel will refresh with the selected history, which should contain seven steps. Dataset no. 7 Join two Queries on data 6 and data 1 contains the genomic coordinates of the exons of interest. | ||
| 4. | Extract Genomic DNA corresponding to each of the exons.
| ||
| 5. | When the query finishes, a dataset containing one human sequence for each of the exons (total of 109 sequences) called Extract Genomic DNA on data 7 is created. This dataset contains the human genomic DNA corresponding to each of the 109 exons in FASTA format, a very common format for storing multiple named sequences. | ||
| 6. | Extract multiple species alignment blocks for each of the human exon locations.
| ||
| 7. | A new history item is created, no. 9 Extract MAF blocks on data 7, which contains the portions of the source alignment that overlap with the exon regions. For the 109 regions, 387 alignment blocks were retrieved, which is due to multiple local alignment blocks overlapping individual exons. Thus, the resulting dataset contains every local alignment block overlapping an exon, trimmed to just include the portion of the alignment that overlapped. This dataset is useful for examining the conservation of exons in aggregate; however, the relationship between exons and alignments has been lost. | ||
| 8. | Create one projected alignment per human exon.
| ||
| 9. | A new history item is created, no. 10 Stitch MAF blocks on data 7 and data 9, which contains one alignment block for each of the human exons, with regions where no alignment was found represented as gaps (). Click the eye icon to examine the data in the center panel. The projected alignment is in FASTA format, suitable for downstream analysis in most phylogenetic software packages, including those available in Galaxy. For 109 regions, 545 FASTA sequences (109 regions each with sequences for five species) were generated in 109 alignment blocks. |
Commentary
- Top of page
- Introduction
- Basic Protocol 1: An Introduction to the Galaxy Approach: Finding Promoters Containing TAF1 Binding Sites Identified From a ChIP-Seq Experiment
- Basic Protocol 2: Combining and Filtering Genome Annotations: Finding Exons with the Highest Number of Nucleotide Polymorphisms
- Support Protocol 1: Saving Results in Galaxy and Sharing Data with Others
- Basic Protocol 3: Generating a Workflow From a History in Galaxy
- Support Protocol 2: Modify a Parameter in the Workflow in Galaxy
- Support Protocol 3: Running Workflows with Galaxy
- Support Protocol 4: Sharing Workflows with Galaxy
- Basic Protocol 4: Generating Workflows from Scratch with Galaxy
- Basic Protocol 5: Extracting Sequences and Alignments with Galaxy: An SNPs in Exons Example
- Commentary
Galaxy successfully bridges the gap between data collection and analysis. The public Galaxy server allows researchers across the globe to perform computationally intensive, large-scale analyses with the only equipment requirement consisting of an internet-connected Web browser. Users are not required to delve into the intricacies of how to execute a large collection of unrelated programs, but instead have access to a unified point-and-click interface. Galaxy provides both experimental biologists and their computational colleagues with a framework to facilitate truly reproducible cutting-edge science.
The protocols contained within this unit offer only a glimpse of possible analyses and tool functionality. The text contained herein should only be considered as an introduction to performing complex analysis with Galaxy. New datasets, tools, and features are added regularly. Some new menu choices may arise or move. In addition to the screencasts that accompany these protocols, many more screencasts that demonstrate additional functionality are available at http://galaxycast.org and others will be added over time.
Transparency and reproducibility
Open and transparent research is essential to the process of science. Research papers cannot be published without making the protocols and generated experimental data publically available. Unfortunately, the same standards are often not applied to computational analysis. When analysis is performed within Galaxy, every detail is preserved in the history and can be inspected later. These histories can be shared or published, and can be reproduced (with or without modification) through the workflow system. Thus, without additional effort on the part of the user, Galaxy facilitates greater transparency and reproducibility of computational analyses.
Collaboration
While the scope of this unit is limited to introducing a user to performing data analysis with the public Galaxy server, Galaxy is also an excellent resource for collaborative analysis. Because it is Web-based, collaborators at different locations can easily and rapidly share data and analyses. In particular, Galaxy's library system provides for sharing of datasets within research groups, complete with access controls and version histories.
Research groups that have their own collections of analysis scripts and binaries will find it worthwhile to download the open source framework, integrate their unique tools, and maintain a private server (a Galaxy instance) for laboratory members to work on their projects. A local Galaxy server makes collaborations between computational and experimental researchers more efficient, since new analysis tools can be effortlessly made available to colleagues, allowing programmers to focus on method development. Although beyond the scope of this introduction to the user interface, documentation and assistance for programmers is also available on the Galaxy site.
The Galaxy Framework is easily downloaded, quickly configured, and effortlessly deployed. Although written in Python, no knowledge of the Python programming language is required to deploy or maintain a personal Galaxy instance. This facilitates local development of new tools, the creation of new Galaxy instances with custom toolsets, and secure private Galaxy instances for analyzing protected data (e.g., genotype data obtained in clinical setting). To download the Galaxy Framework and view detailed installation documentation, visit http://getgalaxy.org.
Help and feedback
Galaxy is under constant development and is improved based upon user suggestions. Extensive help is available in the form of screencasts as well as active public mailing lists, where both experimentalists and computationalists can request and receive advice. Discussion of feature requests is also encouraged. For links to these resources and to use Galaxy, visit http://galaxyproject.org.
Acknowledgements
A vision for Galaxy was originally articulated by Ross Hardison, who is also the major source of support and critical feedback. The authors thank Jim Kent and David Haussler for their continuing support and making UCSC Genome Browser uplink and connection possible. Istvan Albert pioneered initial aspects of the Galaxy design. The following individuals contributed to the Galaxy project at different stages: Richard Burhans, Laura Elnitski, Belinda Giardiane, Bob Harris, Jianbin He, Webb Miller, Cathy Riemer, and Yi Zhang. The authors thank Warren Lathe of OpenHelix for critical reading of the manuscript. Galaxy hardware is maintained by Nate Coraor. Robert Castelo, France Denoeud, Roderic Guigo, Erika Kvikstad, Julien Lagarde, and Kateryna Makova provided critical comments during software testing. This work is supported by funds provided by the Eberly College of Science, Huck Institutes of the Life Sciences at Pennsylvania State University, NSF BD&I grant 0543285, NIH-NHGRI grant R01 HG004909 as well as funds from the Pennsylvania Department of Public Health.












