TopDownApp: An open and modular platform for analysis and visualisation of top-down proteomics data

Although Top-down (TD) proteomics techniques, aimed at the analysis of intact proteins and proteoforms, are becoming increasingly popular, efforts are needed at different levels to generalise their adoption. In this context, there are numerous improvements that are possible in the area of open science practices, including a greater application of the FAIR (Findable, Accessible, Interoperable, and Reusable) data principles. These include, for example, increased data sharing practices and readily available open data standards. Additionally, the field would benefit from the development of open data analysis workflows that can enable data reuse of public datasets, something that is increasingly common in other proteomics fields.

The increasing popularity of open science practices in proteomics has resulted in the dramatic growth of publicly available MS-based proteomics datasets.The PRIDE database (https://www.ebi.ac.uk/pride/) [2] is the largest proteomics data repository worldwide and is one of the founding members of the ProteomeXchange consortium [3].As of August 2023, PRIDE stores approximately 35,500 datasets.Pro-teomeXchange resources adopt the FAIR data principles (Findable, Accessible, Interoperable and Reusable) [4].One of the main benefits of making data publicly available is to enable data reuse and the reproducibility of the analysis, facilitating an independent assessment of the results described in the corresponding publications.Additionally, as is already happening in other proteomics fields, new knowledge and tools are being generated from data reuse activities, for instance by applying machine learning techniques [5].Furthermore, in the context of data interoperability, it is important to highlight that ProteomeXchange resources implement the main open data standards developed by the Proteomics Standards Initiative [6], such as mzML (for MS data) [7] and mzTab (for peptide/protein identification and quantification) [8,9].However, despite improvements in recent years, in the context of open science practices, TD proteomics still lags behind other, more widely implemented, proteomics fields.For instance, the total number of submitted TD proteomics datasets to PRIDE is still small (approximately 231 datasets, as of May 2023); the trend is fortunately improving, albeit slowly.In our view, the low number of TD datasets in the public domain is due to two reasons.On one hand, the number of TD proteomics practitioners is still relatively low when compared with bottom-up approaches.On the other hand and more relevant, TD datasets are not being made available at the same pace as in other proteomics fields.One notable exception is the data for the Blood Proteoform Atlas [10], accounting for 56 ProteomeXchange datasets among its 1507 LC-MS/MS experiments.Furthermore, apart from a greater data availability in the public domain, to increase data reuse of TD proteomics data, it is important to develop open software for enabling computational data (re)analysis and visualisation.Indeed, in this context, advances in topics such as data integration and the reproducibility of the computational analyses, including scalingup analysis capabilities for increasingly large experiments, are key to match the ongoing developments in other proteomics fields, and in the life sciences as a whole [11].
We here introduce an open, modular and flexible platform called TopDownApp for the analysis and visualisation of TD proteomics data, which can be applied for instance to public datasets.TopDownApp is open source and freely available under an Apache 2.0 licence to allow for free use and software development.The source code can be found at https://github.com/mwalzer/TopDownApp.The following contains the technical descriptions of TopDownApp.A glossary containing definitions for specialist terminology can be found in the Supplementary Material (Section 1).
Since the TD data analysis methodology is still quite dynamic, a flexible and modular approach to data analysis was necessary.This is why TopDownApp was implemented using automated and modular analysis workflows-that is, a flexible succession of tools, inputs/outputs connected through open data standard formats (as the interoperability layer between analysis tools), and a modular choice of tools (through software containerisation) orchestrated in Nextflow [12].In the context of TD data analysis, currently, the most common and generally successive tasks are: MS raw file access, deconvolution, and proteoform identification.
Additionally, workflows that already support the deconvolution visualisation can be used through a user interface, as shown in Figure 1.
The workflow results are reported in multiple formats, the tool native output, and of note, in the open mzML and mzTab formats adapted for TD data.Table 1 includes the tools and the corresponding versions that are part of TopDownApp at the time of writing.
The application is designed to work on local machines as well as in remote settings.Yet, the framework can be used for development of scaled-up and/or automated use cases, since the different parts are functional on their own or in combination.For example, the workflow can be used to automate the data analysis, in particular on research compute infrastructures, which frequently support the use of Singularity software containers [18].Additionally, the workflow is designed to use all necessary tools through software containers, meaning it can use all the appropriate versions of the tools regardless of locally installed versions or the operating system.All tool parameters used in the workflow, including their software container, can be specified via a customisable configuration file.This makes updates and customisation simple.Note however that parameter changes from one version or tool to the next, may imply changes in how the tool is used, which needs to be reflected in the workflow itself.If new tools are added, data compatibility needs to be ensured (preferably via the use of the open data standards mzML and mzTab).For example, for now, TopDownApp supports Thermo RAW files, but RAW to mzML file converters compatible with other instrument vendors could be plugged-in as soon as they become available (see Supplementary Material).
As a result of the software containerisation, the minimal setup required for a user to run the tool is the container system (Singularity/Docker) and a combined container (see Supplementary Material for instructions, Section 4).The application can be started via a single command and then used through a web browser (see Supplementary Material Section 4 and the code repository at https://github.com/mwalzer/TopDownApp for more details).Here, a local Thermo RAW file and a protein sequence database (in fasta format) can be selected as input, and the protein modification parameters can also be set.Currently, the user interface allows researchers to select a number of standard protein modifications and combinations thereof.From there, the analysis workflow can be started, and the results can be inspected once the data analysis process is finished.The successfully deconvolved spectra can be selected for visualisation against their original form from a table (Figure S2).Additionally, within each spectrum, each peak can be selected for deconvolution visualisation.Identified spectra are listed in a table, from which one can also select the corresponding spectra to be visualised.New visualisations can be developed using a Python notebook and added to the user interface (see Supplementary Material Section 8).
The identification results are also available to download in the form of a development version of the mzTab standard format for TD proteomics data (see Supplementary Material, Section 3).The mzTab format was chosen because it has proven to be effective for representing identification data in other specialisations of MS technologies, including metabolomics and lipidomics [8].Furthermore, the format adaptations necessary for TD data are minor, and we expect a straightforward process to establish a community agreed extension of mzTab.
Likewise, the representation of TD peak data in mzML is already supported by the data standard.However for complete compatibility, the deconvolution data needs to be represented and formalised in a format extension, for which we developed a functional proposal.We used a TD spectrum reporting convention that attaches deconvolution information such as charge and isotope target to a user-defined section of each spectrum's element in mzML (userParam).This is compatible with the current release of the widely used Python library for mzML F I G U R E 1 Overview of the use (left panel) and the processes (right panels) for TopDownApp.Local input selected through the web browser (of RAW files and a fasta sequence database) is processed via Nextflow and dedicated tool software containers.The workflow is then started through the browser, and after processing the results can be inspected there.The Nextflow workflows can also be used independently to process large batches of data.

TA B L E 1
Example of a successful and unsatisfactory deconvolution process from the reanalysis of the tools in use with the current version (as of August 2023) of TopDownApp.
consumption (and other PSI data standards), Pyteomics [17] and many other mzML capable software libraries.For now, only FLASHDeconv supports the deconvolution specific mzML output, and thus visualisation of deconvolution is exclusive to FLASHDeconv.For both mzML and mzTab, input from the TD proteomics community will be needed for both to become accepted data standards for representing TD data.For details on the format specification extensions, see the Supplementary Material Section 3.
We demonstrate the utility of TopDownApp with two benchmarking datasets, the first to showcase examples of visualisation, and the second to provide a re-assessment of a previously published human dataset.The first dataset corresponds to an E. coli lysate, measured using a Thermo Scientific Orbitrap Eclipse (MassIVE dataset accession MSV000087484) [19].There, we used an E. coli proteome database from UniProtKB/SwissProt (canonical; release 2023_01; K12 taxon), and selected one modification (Oxidation [Unimod:35]).The workflow configuration used was FLASHDeconv for deconvolution and TopPIC for identification.We used the user interface to visualise the precursor deconvolutions.Figure 2 shows how TopDownApp can be used to examine the signal quality of the precursors corresponding to the identified proteoforms, which is crucial information for the quality control of identifications [20,21].

TA B L E 2
Results from the batch application of the default workflow setting (FLASHDeconv + TopPIC) in TopDownApp, plus an additional workflow setting using (FLASHDeconv + TopMG) for the datasets PXD026123 and PXD026159.Table 2 shows the reanalysis results using different tools for identification, split by the number of identified proteins (protein accessions), number of truncated sequences for Proteoform Spectrum Matches (PrSM) with or without PTMs, number of proteoforms, and PrSMs.

Dataset
These are contrasted with numbers derived from the TDPortal result export reported with the original Blood Proteoform Atlas publication, selected for the MS runs in the two datasets.
The samples of dataset PXD026123 were subject to an enrichment strategy.This combined with the fact that the former dataset comprises 35 MS runs, as opposed to 6 MS runs of dataset PXD026159, can explain the overall greater number of identifications detected.Switch-ing from the TopMG identification software to TopPIC also noticeably affected the number of observed identifications.This can be explained considering the different goals of the identification analysis for the two tools.TopMG is designed "for identifying highly modified proteoforms", whereas TopPIC is built for the characterisation of proteoforms at the proteome level [22].This is further corroborated by comparing TopDownApp is free and open to use, and can be utilised on a local computer or institutional cluster-set-up.The installation is as simple as downloading and running the containerised application (Supplementary Material Section 4).However, unlike other available computing environments for TD data processing like the web browser-based TDPortal or TopPIC Gateway [23,24], or the stand-alone tool MASH explorer [25] (which needs the Windows operating system to run), the workflows themselves are not reconfigurable via a graphical user interface.Instead, high-throughput analysis can be conducted on a huge variety of compute infrastructures through the application of the underlying Nextflow workflows.Changing the sequence or the type of tools included in the workflow needs to be coded in Nextflow.As a beneficial side effect of this, adding new tools into the workflow can be simply achieved with a few lines of configuration changes for any software, if it can be called from the command line and has compatible input/output.In the case of MASH explorer, external tools can also be added, but the workflows cannot be run in standard Linux clusters.
Another distinct characteristic of TopDownApp is the direct integration of the results visualisation as the main component of the user interface.Unfortunately, it was not possible for us to perform a comparison of the functionality with the popular tool ProSight [26] due to its commercial nature.
In the future, the TopDownApp could be enhanced with additional deconvolution and identification open software modules, such as MSPathFinderT [27], to further increase the software options for data analysis.Moreover, the inclusion of a dedicated label-free quantification module, such as FLASHQuant [28] would be ideal to enable quantitative analysis.In general, when the data standards for TD data have matured and been implemented widely, we expect the field of TD proteomics to have many more interoperable analysis tools at its disposal, leading to better workflows and analyses.To ensure robust and reproducible results, adopting additional quality control strategies at different levels including proteins, protein isoforms and proteoforms, would be highly beneficial.Currently, parameter configuration through the browser is not fully implemented due to the wide variety of supported parameters.Adoption of other PSI standard data formats, such as the ProForma 2.0 notation [29] for proteoform representation, would also improve data interoperability.
We expect that the availability of an open data analysis workflow will enable the reuse and reanalysis of TD proteomics datasets in the public domain.This would increase recognition of the original datasets' authors and open many possibilities such as the integration of the results in popular bioinformatics resources such as PRIDE, UniProtKB [30] and the Human Proteoform Atlas [31]  Project [32].
Furthermore, we envision an improved accessibility to deconvolution and identification data.As we demonstrated one method of visualisation of TD identification results, we hope this to serve as a basis for more analysis results to be made (visually) accessible to current practitioners.

Figure 2 (
left and right panels) show visualisation examples of high and low quality precursor signals, respectively.For each precursor mass, TopDownApp shows all its differently charged isotope packets in the input raw spectrum (blue colour coded peaks) as well as noisy peaks around them (red peaks).Users can easily distinguish signal and noise components by colour and do not need to search for separate m/z value regions to observe peaks with F I G U R E 2 Example of a successful and unsatisfactory deconvolution process from the reanalysis of the E. coli MSV000087484 dataset.(A) Deconvolution of scan 8557, monoisotopic peak mass 8564.733,exemplifying a successful deconvolution with blue peaks for target peaks, pink peaks declared noise, and dashed curve showing a good approximation of the isotope traces.(B) Deconvolution of a poor quality signal in scan 8722, monoisotopic peak mass 13101.864.
(among others), making TD proteomics and proteoform data more FAIR.Herein lies great potential for increased recognition to the authors of the field of TD proteomics.Many TD publications do not yet provide access to the corresponding data in ProteomeXchange resources.In our view, the TD proteomics field needs to be more proactive in open science practices.Similarly, we expect that the availability of TopDownApp as an open and shared development platform will also aid the reproducibility of TD data analysis, providing a basis for new, automated and easy to share TD data analysis workflows.In our view, the greater availability of automatable and reproducible open source analysis platforms will be essential to the success of the envisioned Human Proteoform