Scalable data analysis in proteomics and metabolomics using BioContainers and workflows engines

The recent improvements in mass spectrometry instruments and new analytical methods are increasing the intersection between proteomics and big data science. In addition, the bioinformatics analysis is becoming an increasingly complex and convoluted process involving multiple algorithms and tools. A wide variety of methods and software tools have been developed for computational proteomics and metabolomics during recent years, and this trend is likely to continue. However, most of the computational proteomics and metabolomics tools are targeted and design for single desktop application limiting the scalability and reproducibility of the data analysis. In this paper we overview the key steps of metabolomic and proteomics data processing including main tools and software use to perform the data analysis. We discuss the combination of software containers with workflows environments for large scale metabolomics and proteomics analysis. Finally, we introduced to the proteomics and metabolomics communities a new approach for reproducible and large-scale data analysis based on BioContainers and two of the most popular workflows environments: Galaxy and Nextflow.


Introduction
The progress in the application of mass spectrometry (MS) to biological compounds has revolutionized the field of biology: the large-scale identification of proteins and metabolites provides a unique snapshot of a biological system of interest at a given time point [1]. The MS-based high-throughput technologies have resulted in an exponential growth in the dimensionality and sample size [2]. This increase has two major directions: I) the number of samples processed, powered by new mass spectrometers; and II) the number of molecules (metabolites, peptides, and proteins) identified alongside each sample [3]. In addition, the data analysis in MS-based metabolomics and proteomics is becoming more complex, including several convoluted steps to go from the spectra identification to the final list of relevant molecules. This scenario creates major challenges for software developers and the bioinformatics community: I) software and data analysis scalability; II) software availability and findability; and III) reproducibility of the data analysis [3,4].
Computational proteomics and metabolomics have been dominated by desktop and monolithic software for the past decades, which hampered high throughput analysis in High-Performance Computing systems (HPCS) and cloud environments [5,6]. Furthermore, many of these are proprietary closed-source solutions, often run only on MS Windows or from vendor's hardware, and use proprietary binary formats for data intake. These are barriers for reproducible science. During the last decade, open source software and distributed solutions have slowly made their way in these computational fields, with an ecology of computation tools flourishing in proteomics and metabolomics (see reviews in both fields [6,7]). While open source and distributed frameworks irruption into the aforementioned omics fields is positive for the scalability, portability, and reproducibility of data analysis in this fields, it often comes at a cost of an increased technical complexity: installing, maintaining and executing these analysis software is usually complex and requires advanced software expertise, which is often a rare skill among scientific practitioners. This is further complicated by the fact that reproducibility and collaboration demand the installation of these tools on different computational environments (local computers, HPCS, cloud, collaborators cluster, etc), often requiring different installation processes and software dependencies to be fulfilled [8].
In the past few years, the use of software containers and software packaging systems has markedly increased in general in the field of Bioinformatics [8,9]. In particular, the BioContainers [8] (http://biocontainers.pro) and BioConda [10] (http://bioconda.github.io) communities have widely increased the availability of containers and adequately packaged bioinformatics tools respectively, providing today thousands of tools in a format that can be used in local workstations, HPCS and cloud environment seamlessly [9]. These software containers reduce the technical entry barrier for setting up scientific open source software and for making setups portable across multiple environments.
While containers and software packages make easier the installation and portability of bioinformatics tools, they still leave to the scientist the task of dealing with combining (plumbing) tools together to create bioinformatics data analysis workflows and pipelines [11]. This is a complex task and demands the use of the Linux command line environment; the underlying file system and data streams. In addition, if the analysis is aimed to run in distributed architectures (e.g. HPC clusters or Cloud), the bioinformatician will need to combine the workflow design (what tools to run with which data inputs and parameters) with the execution logic (e.g. job scheduler, data filesystem). To facilitate workflow design and their execution on different distributed execution environments, such as HPC or Cloud architectures, the bioinformatics community has developed various Bioinformatics workflows systems [11]. During the past 10 years, open source workflow environments have started to consolidate in the field of bioinformatics, and more recently in the past 5 years, these have made an entrance into metabolomics and proteomics. The first popular workflow environment systems in bioinformatics where Taverna (now Apache Taverna) and Galaxy In this manuscript, we will discuss the combination of software containers with workflows environments for large scale metabolomics and proteomics analysis. The combination of software containers and workflows environments promises to make scientific analysis pipelines scalable, reproducible, portable and accessible to scientists that do not have any expertise in the use of complex computational infrastructure and command line environments. We will introduce to the proteomics and metabolomics communities a complete ecosystem of tools and framework for reproducible and large-scale data analysis based on BioContainers and two of the most popular workflows environments: Galaxy [12] and Nextflow [13].

Current approaches for computational mass spectrometry.
In proteomics, the most common strategy for the interpretation of data-dependent acquisition (DDA) MS/MS spectra consists of comparing the experimental spectra to a set of ideal spectra (also called theoretical spectra) extrapolated from the predicted fragmentation of peptides derived from a protein database [15]. During this process, every spectrum obtained by the mass spectrometer needs to be compared with all the theoretical spectra within the same precursor mass. As more data is generated (larger cohorts and more complex samples) the running time becomes longer [16]. During recent years, algorithms and tools have been developed to perform the identification step, such as Andromeda [17], MSGF+ [18] or MSFragger [19]. Even though most of these algorithms have become robust and reliable, analysis of large scale experiments will still be computationally intensive and take considerable execution time [20]. After the identification process, the resulting peptidespectrum matches can be reliably controlled by false discovery rates filters (such as FDR) (Figure 1). Recently, many tools have implemented a secondary database search which takes the initial identification results and refines the search parameters. Finally, the list of quantified protein is ensemble based on the identified peptides by using protein inference algorithms [21,22]. The list of quantified proteins is provided to the downstream statistical analysis step, which reports the final relevant proteins.
Data independent acquisition (DIA) is a relatively new mass spectrometry-based technique for systematically collecting tandem mass spectrometry data. Whereas data-dependent acquisition (DDA) selects precursor ions according to their abundances, DIA aims to implement a parallel fragmentation of all precursor ions, regardless of their intensity or other characteristics, enabling the establishment of a complete record of the sample [23]. This analytical method is well-suited for applications requiring the measurement of thousands of proteins or demanding the flexibility to investigate multiple hypotheses without having to acquire additional data sets. Different software have been implemented to analyze DIA datasets such as OpenSWATH [24] and Skyline [25].
Similarly, computational metabolomics is mainly based on the comparison of the metabolites spectra against a well-curated database of previously identified metabolites (Spectral library strategy) (Figure 1). Spectral libraries such as METLIN and MassBank contain information about mass and structure of small molecules, although MS/MS spectra are available for only a share of the small molecules in the database. The basic analytical workflow yields thousands of molecular features within minutes of data acquisition. But, similarly to proteomics, only a minority of detected masses can be matched to a molecule in the database, or more commonly to several possible molecular formulas [26,27]. A statistical validation and manual curation can only be achieved by a matched MS/MS spectrum and/or by another compound-specific property such as retention time, which is then compared to a synthesised standard compound. In principle, quantitative analysis in metabolomic experiments is very similar to the label-free quantitation approaches based on extracted ion chromatograms in proteomic workflows. Feature alignment and detection is followed by quantitation and then perhaps identification of a compound. However, the tendency of small organic molecules to form multimers or adducts (i.e. sodium or ammonium) needs to be considered and detected masses and their intensities deconvoluted before quantitation and statistical evaluation [26].

Current software ecosystem for computational mass spectrometry.
The more established and common tool design for proteomics and metabolomics data analysis are monolithic desktop applications. In this type of bioinformatic tools, all the analysis steps (Figure 1) are encapsulated into the same application, which lends itself to be used as a black box, with little understanding from users on the intermediate analysis steps.

MaxQuant (Proteomics)
MaxQuant [28] is one of the most frequently used platforms for mass-spectrometry (MS)based proteomics data analysis. The platform includes a Database search engine (Andromeda) to perform the peptide identification and a set of algorithms and tools designed for quantitative label-free proteomics, MS1-level labelling, and isobaric labelling techniques.
Recently, MaxQuant has implemented the full export to mzTab file format, enabling the proteomics community to perform complete submissions to ProteomeXchange repositories and analyse the data in an standard file format [29].

Skyline (Proteomics)
The Skyline [25] is an open source platform for targeted and data-independent proteomics and metabolomics data analysis. It runs on Microsoft Windows and supports the raw data formats from multiple mass spectrometric vendors. It contains a graphical user interface to display chromatographic data for individual peptide or small molecule analytes. Skyline supports multiple workflows, including selected reaction monitoring (SRM) / multiple reaction monitoring (MRM), parallel reaction monitoring (PRM), data-independent acquisition (DIA/SWATH) and targeted data-dependent acquisition. Because both SRM and DIA data are based on the analysis of MS/MS chromatograms (selected and extracted respectively), the processing (chromatogram peak integration) and visualization of data acquired using these two methods very similar within Skyline. In a recent publication, the Skyline team has recognized that one of the areas to work in the future is the parallelization and distribution of computation and processing in HPC and cloud architectures [31]. These developments will be vital in obtaining the robust, sensitive quantitative measurements required to better understand the systems biology of cells, organisms, and disease states.

XCMS2 and MZmine2 (Metabolomics)
XCMS-2 [32] and MZmine2 [33] have become arguably the most widely used free software tools for pre-processing untargeted metabolomics data. The XCMS-2 software is publicly available software that can be used within the R statistics language [32]. XCMS-2 is capable of providing structural information for unknown metabolites. This "similarity search" algorithm has been developed to detect possible structural motifs in the unknown metabolite which may produce characteristic fragment ions and neutral losses to related reference compounds contained in reference databases, even if the precursor masses are not the same. In addition, XCMS provides algorithms and tools to find peaks, align/group peaks, correct retention times between different samples, fill peaks, filter by dilution, among other methods [32].
MZmine was first introduced in 2005 as an open-source software toolbox for LC-MS data processing. The first version of MZmine defined the data analysis workflow and implemented simple methods for data processing (e.g. peak noise detection) and visualization. In 2010 [33], a critical assessment of the tool detected that MZmine was a build in a monolithic design, thus limiting the possibility of expanding the software with new methods developed by the scientific community. In lieu of this, MZmine2 was completely redesigned to be modular. MZmine2 was built in multiple data processing modules, with emphasis on easy usability and support for high-resolution spectra processing. MZmine2 includes the identification of peaks using online databases, MS-n data support, improved isotope pattern support, scatter plot visualization, and method for peak list alignment based on the random sample consensus (RANSAC) algorithm.
In 2017, Weber and co-workers conducted a survey on software data usage in metabolomics and found that LC-MS data analysis in metabolomics is performed in 84% of  can be easily combined into analysis pipelines even by non-experts and can be used in proteomics workflows. These applications range from useful utilities (file format conversions, peak picking) to wrappers for known applications like peptide identification search engines.
These two frameworks have been used recently to analyse big datasets [36,37]. Though these frameworks have been fully implemented as component-based frameworks, they have been really slow to implement and promote standard file formats between each component.

Standard file formats for better compatibility between components.
Standard file formats allow developing a common persistence (e.g. file) representation of the data that is analysed (e.g. spectra, peptides). The proposed approach in Figure  A major problematic in standardised workflows for metabolomics and proteomics in the field of mass spectrometry is the lack of intermediate exchange formats similar to the existing genomics formats (such as BAM, SAM, CRAM, VCF, bed, etc. to name a few). Often in these younger fields tools will generate results in ad-hoc formatted files with poor specifications and often incompatible with downstream tools that would naturally pipe. This means that further tailored conversion steps need to be provisioned, which slow development, require maintenance, and might in cases introduce errors or data loss.

Packaging and deployment using BioContainers
A component-based architecture like the one proposed in this manuscript (Figure 2) prompts multiple challenges in deployment and execution. The created package and containers contain all software dependencies needed to execute the tool in question. In general, one package will contain simply one tool, large packages containing many tools are in general discouraged. This allows to execute the pipeline in different compute environments, without the complexity of installation, dependency management, etc. It also allows moving the pipeline from one environment to another (e.g. HPC, Cloud or local personal computer) because everything is executed in containers. At the time of writing, BioContainers provides more than 7000 bioinformatics containers that can be searched, tagged and accessed through a common web registry (https://biocontainers.pro/#/registry/). Importantly, the BioContainers and BioConda communities convert automatically Bioconductor packages automatically into containers.

Workflow systems
High-throughput bioinformatic genomics and transcriptomics analyses increasingly rely on pipeline frameworks to process sequence and metadata.

NextFlow
NextFlow (https://www.nextflow.io/), an expressive, versatile and particularly comprehensive framework for composing and executing workflows. NextFlow uses a domain-specific language (DSL) which also supports the full syntax and semantics of Groovy, a dynamic language that runs on the Java platform. One of the great features that make NextFlow a powerful workflow engine is its dataflow functionalities. Nextflow allows users within the workflow definition to filter data, run processes conditionally on data value or have splitting/merging pipeline steps expressed in a short, elegant syntax.
Nextflow separates the workflow definition from the execution environment, which allows users to execute the same workflow in different architectures (Cloud, HPC or a local machine). This abstraction level is gurranted by using an execution layout that defines which type of containers will be used to execute the tools (components of the workflow) and which type of architecture will be used to execute those containers (e.g. HPC, Cloud  c  h  m  i  d  t  ,  A  .  ,  F  o  r  n  e  ,  I  .  ,  I  m  h  o  f  ,  A  .  ,  B  i  o  i  n  f  o  r  m  a  t  i  c  a  n  a  l  y  s  i  s  o  f  p  r  o  t  e  o  m  i  c  s  d  a  t  a  .   B  M  C  S  y  s  t  B  i  o  l   2  0  1  4  ,   8  S  u  p  p  l  2   ,  S  3  .   [  1  6  ]  G  r  i  s  s  ,  J  .  ,  P  e  r  e  z  -R  i  v  e  r  o  l  ,  Y  .  ,  L  e  w  i  s  ,  S  .  ,  T  a  b  b  ,  D  .  L  .   ,  e  t  a  l  .   ,  R  e  c  o  g  n  i  z  i  n  g  m  i  l  l  i  o  n  s  o  f   c  o  n  s  i  s  t  e  n  t  l  y  u  n  i  d  e  n  t  i  f  i  e  d  s  p  e  c  t  r  a  a  c  r  o  s  s  h  u  n  d  r  e  d  s  o  f  s  h  o  t  g  u  n  p  r  o  t  e  o  m  i  c  s  d  a  t  a  s  e  t    The design of bioinformatics workflows that use the specific containers and abstract the execution from the compute environment (e.g. Cloud or HPC). A very important step of this design is the use of standard file formats that enable to communicate different tools and steps of the workflow.  The workflow step (called process) describes which process will be performed and the input/output parameters. The container section inside the blastSearch process state which containers will be use; including container name (blast), and version of the container (v2.2.31_cv2). Between triple quotes is the actual command will be executed in the container (in this case blast). This is needed because one container can provide multiple tools. (B) The Nextflow config file (https://www.nextflow.io/docs/latest/config.html) defines how the present workflow (A) will be executed. In the example, we have defined two possible scenarios: local and lsf. If the user run the workflow using the local configuration it will be using Docker containers, if the user uses lsf, then will be using singularity and the LSF cluster executor.