Because of a combination of the increasing demand for analysis via MG-RAST, as well as the quickly increasing data set sizes, it has become evident that dedicated computing resources will not provide sufficient capacity for even the short term, with considerably larger shortfalls in the medium and long terms. When tackling this problem, our first approach was to scale out the MG-RAST backend to existing shared resources at Argonne, following closely with the use of cloud resources. This change posed some technical challenges, due to differences in the general infrastructure provided on shared systems, as well as the switch to using distributed resources.
2.1 Technical challenges
Several high level issues quickly presented themselves, as we adapted the MG-RAST pipeline to the use of distributed resources. First and foremost, we realized that we had an orchestration problem; MG-RAST had previously depended heavily on the use of SGE for job execution reliability and prioritization, which was not in use on these shared resources. Analysis operations had a variety of system infrastructure requirements, some requiring large amounts of memory, local disk space, or producing large quantities of output.
2.2 Choosing which analysis operations to distribute first
An examination of the computational needs of analysis stages in the MG-RAST pipeline showed that they fell into a three broad categories: data intensive applications, large memory applications, applications with long runtimes, and pleasant parallelism. Table 1 describes each of the major stages in the pipeline in terms of the overall fraction of runtime, CPU intensity, memory requirements, and data requirements. By far, the largest share of runtime, and hence the analysis bottleneck, came from similarity computations; it also had the simplest set of infrastructure requirements, so we distributed that stage of the pipeline first.
Table 1. Resources required for various stages of metagenome analysis in MG-RAST.
|Quality control||+||+||+||Full data set|
|DNA/peptide clustering||+||+||+||Full data set|
|Assembly||++||++||++++||Full data set|
2.3 Wide-area computational orchestration
As we mentioned previously, we depended on SGE for analysis execution reliability and control. This function seemed simple enough to easily implement upon initial inspection, so we implemented a single centralized coordination entity called AWE to coordinate execution between multiple distributed systems. As we described in the previous section, the similarity analysis stage was our initial target. This stage is implemented using NCBI BLAST. It is a good candidate for distribution, both because of its large overall resource consumption, as well as because the computation uses a large fixed database that changes infrequently. The only input to this stage is a small query sequence, easily transmitted from AWE.
Several other systems provide similar capabilities, including Berkeley Open Infrastructure for Network Computing and Condor. In our case, we opted to implement AWE in order to model work units in a format specifically suitable for genome sequence data. For example, because of the fixed metagenome database used by MG-RAST, request deduplication can be implemented if the AWE can recognize previously executed work units. Because of the relative simplicity of this software, we opted to build a purpose built orchestration system, as opposed to adapting our workload to an off the shelf system.
Argonne Workflow Engine consists of a centralized set of python daemons that can communicate with clients via a RESTful interface. This approach is widely portable, as clients need only be able to perform hypertext transfer protocol requests to the server. The work queue, as well as results and statistics, is stored in a Postgres database. We have used Facebook's Tornado framework to build a lightweight and efficient set of front tier web servers.
Work units are stored in small chunks. When a client requests work, it includes a work unit size; this allows the client to size a request to the resources available. At this point, the small chunks of work are aggregated into a larger unit, and each of these chunks is marked with a lease, including start time and duration. The client, as long as it is running, can extend this lease. If work units reach the end of the lease, they are freed for allocation to other client requests. This mechanism provides basic reliability in the face of unreliable remote clients. Each client can run one or more of these work unit cycles, depending on the resources available. On a current commodity system, these work units typically take 10–15 min to complete.
Although the initial implementation was relatively straightforward, we found that we needed to get address a wider range of error conditions than we had previously encountered. SGE had handled many of these issues for us. This highlights one of the important lessons we learned in this process; most pipelines are still user level software and not system software. By implementing AWE, and moving to resources we did not control, we began dealing with a wider range of error conditions. Moreover, these errors happened in a distributed environment, making debugging considerably more difficult. Similarly, debugging distributed performance problems required more sophisticated information gathering than when MG-RAST ran on a single cluster.
After becoming accustomed to dealing with performance issues, we were able to retune our database, modify our work management reliability scheme, and scale a single backend up to 500 compute nodes without issue. Further optimizations can be applied when they are needed; the architecture is designed to scale among multiple backend machines. However, since our current approach has been sufficiently scalable for the compute resources we have access to, we have not pursued this issue.
2.4 The issue of confidentiality or data security in distributed platforms
Using a diverse set of heterogeneous computers might present security issues, if access an arbitrary subset of the data being analyzed is unacceptable to the end-user. In our use-case, access to subsets of DNA sequences without contextual metadata (see e.g., GSC maintained metadata types [7, 9]) will render the data useless. The risk of accidentally granting access to anonymous DNA sequence data without any provenance information (as being distributed by the mechanisms we implemented) is deemed acceptable for the application we describe.
However, to further increase anonymity, any existing sequence identifiers generated by sequencing equipment could also be rewritten to provide increase anonymity. In our application, we strip out data protected by Health Insurance Portability and Accountability Act before submitting sequence data to distributed resources.
2.5 A brief discussion of performance
Moving from an integrated local cluster environment to a more distributed type of work-load has allowed us to utilize up to 500 nodes, where in the past we were limited to much more limited locally available resources. We have been able to use available cycles from a number of machines including the Department of Energy Magellan cloud test bed located at Argonne National Laboratory, Argonne's Laboratory Computing Resource Center computer, and Amazon's commercial EC2 offering.
One surprising result is the fact that as data volumes transferred after the initial deployment are small, even a DSL type connection provides suitable throughout for our application model. Interestingly, results of sequence comparison tend to be significantly larger than the input data sets, they frequently are 10 times the size of the input sets. Here, using our domain knowledge, we have chosen to change the report format for BLAST to produce results using a with significantly reduced on-disk and data transfer footprint.