Exploring the Genomic Landscape of Cancer Patient Cohorts with GenVisR

The creation of visualizations to interpret genomics data remains an important aspect of data science within computational biology. The GenVisR Bioconductor package was created to lower the entry point for publication‐quality graphics and has remained a popular suite of tools within this domain. GenVisR supports visualizations covering a breadth of topics including functions to produce visual summaries of copy‐number alterations, somatic variants, sequence quality metrics, and more. Recently, the GenVisR package has undergone significant updates to increase performance and functionality. To demonstrate the utility of GenVisR, we present protocols for use of the updated Waterfall() function to create a customizable Oncoprint‐style plot of the mutational landscape of a tumor cohort. We explain the basics of installation, data import, configuration, plotting, clinical annotation, and customization. A companion online workshop describing the GenVisR library, Waterfall() function, and other genomic visualization tools is available at genviz.org. © 2021 The Authors. Current Protocols published by Wiley Periodicals LLC.


INTRODUCTION
The creation of genomic visualizations remains an integral part of data exploration and presentation (Qu, Lau, Nguyen, Zhou, & Catchpoole, 2019). Since its inception, Gen-VisR has focused on supporting this facet of bioinformatics, lowering the entry point for high-quality visualizations. The GenVisR library has numerous functions covering visualization of mutations (e.g., single-nucleotide variants, insertions, deletions), mutation features (e.g., transitions versus transversions), loss of heterozygosity, copy number alterations, and other genomic events. As of May 2021, the GenVisR library is in the 87th percentile of all Bioconductor packages. It has found use in diverse analyses, including large-scale cancer studies, explorations of agricultural crop resistance genes, and recently, examination of the COVID-19 mutational landscape (Bayer et al., 2019;Napit et al.,2021;Wagner et al., 2018). Efforts to improve the functionality of GenVisR have continued and have recently centered around the popular Waterfall() function. This function produces an Oncoprint-style plot where somatic variant data and relevant metrics are displayed at the cohort level (Gao et al., 2013). This function has been re-factored to include several enhancements, including the conversion from a functional to objectoriented model using the Bioconductor S4 class system. This has numerous development advantages and will decrease the likelihood of a user encountering unexpected errors. Further improvements include the extensive use of the data.table package in function internals, providing significant performance improvements in terms of speed and memory. The expansion of unit tests, with >95% code coverage, has reduced the chance that a package update will introduce an unforeseen bug. Finally, numerous improvements to the end-user experience have been made, including new functions to retrieve data from the plots.

GENERATING A Waterfall() PLOT FROM ORIGINAL DATA
The GenVisR Waterfall() function has the ability to read in data in a variety of formats including Ensembl Variant Effect Predictor (VEP) or mutation annotation formats (MAF; (McLaren et al., 2016); https:// docs.gdc.cancer.gov/ Data/ File_Formats/ MAF_Format/ #introduction. These standard formats have the advantages of being widely used and simple to create. However, the Waterfall() function also has the flexibility to support custom data formats in the form of a data.frame or data.table structure, given that the structure has at least gene, sample, and mutation information. Here we will use that method for an illustrative set of variant calls from a Phase 1 clinical trial of Buparlisib for metastatic breast cancer patients (Ma et al., 2016). This file was derived from pipelines running the Genome Modeling System (Griffith et al., 2015).

Software
3. The Waterfall() function will look for specific column names within the data.table; these are expected to be "sample", "gene", and "mutation". Here we rename our columns to conform to this expectation. The amino acid change column is not required, but is used in Basic Protocol 4. All other columns are not used and ignored.
4. In situations where there are multiple mutations for the same gene/sample, we need to specify which mutation type should get the priority for plotting purposes. This is done by defining a mutation hierarchy as a data.table, where the most significant mutation should come first. We also tell Waterfall what colors we would like to plot in this step.
• plotData <-Waterfall(myVars, mutationHierarchy = my-Hierarchy) 6. We next save this plot to a PDF device for viewing and/or publication. First we open a PDF device in R. We then use the GenVisR drawPlot() function to output a plot ( Fig.  1), and finally close the graphics device.
Saving to a PDF is optional; calling drawPlot() by itself will generate a plot in the Rstudio viewer window. Also, R provides a number of options for saving graphics, in addition to pdf() there are png(), jpeg(), tiff(), and bmp() functions that are available for use within base R.

ADDING CLINICAL DATA TO A Waterfall() PLOT
It is often beneficial to view clinical annotations for each patient in the context of mutations. Doing so may help identify important associations. For example, mutations in a specific gene may be associated with treatment response, sex, tumor stage, or subtype. Waterfall() allows for the addition of clinical annotations to make such patterns obvious. This protocol is an addendum to Basic Protocol 1 and assumes that the relevant libraries and data from that protocol are already present.

Necessary Resources Hardware
A modern compute environment capable of running R/Rstudio > 3.5.0.

Files
An example dataset containing clinical data from a Phase1 clinical trial is available at: http:// genomedata.org/ gen-viz-workshop/ GenVisR/ BKM120_Clinical.tsv myVars (see Basic Protocol 1) myHierarchy (see Basic Protocol 1) 1. First, read the clinical data. In this case, data is saved in a custom TSV file and read directly from a URL.
3. The "sample" name column should be named as such, and should match those sample designations that were used in the myVars data structure (i.e., Sample 1 should be named as such in both the myVars and myClinical data objects). • setnames(myClinical, c("sample", "Best response", "PTEN Immunohist.")) • myClinical[,sample := gsub("WU0+","",sample)] 4. Create a named vector for clinical variables and colors to set a color palette. In our example, we have six clinical variables that we must supply names and colors for.
5. Read the clinical information into GenVisR via the Clinical() function. This function looks for a data.frame or data.table with a "sample" column. All other columns will be treated as clinical variables. We also set a color palette and specify that we would like our legend split up into two columns.

CUSTOMIZING MUTATION BURDEN IN Waterfall() PLOTS
By default, the Waterfall() function will display observed mutation frequencies from the input data prior to any filtering that the function itself does. However, this can be misleading due to variances in the genomic space sequenced within the cohort. This issue can be resolved by calculating the tumor mutation burden (TMB), which normalizes for covered space. In this protocol, we examine how to instruct Waterfall() to output a TMB instead of a frequency for the top panel.

Necessary Resources Hardware
A modern compute environment capable of running R/Rstudio > 3.5.0 Software R/Rstudio, GenVisR

Files
An example dataset containing coverage data from a Phase1 clinical trial is available at: http:// genomedata.org/ gen-viz-workshop/ GenVisR/ bkm120_ coverage myVars (see Basic Protocol 1) myHierarchy (see Basic Protocol 1) 1. First, read in the coverage data from a custom TSV file. The file contains a sample column and the number of bases covered to at least 20×.
• mutationBurden <fread("http://genomedata.org/gen-viz-workshop/ GenVisR/bkm120_coverage.txt") 2. The sample column must match the sample identifiers used in the main plot. We use a regular expression here so the identifiers match between the data in the mutation-Burden and myVars objects.
• myMutBurden <as.numeric(mutationBurden$coverage) • names(myMutBurden) <-mutationBurden$sample 4. The Waterfall() function will plot mutation burden instead of mutation frequency if we set the "burden" flag for the "plotA" parameter. We also supply the named vector holding the coverage space so Waterfall() can properly calculate this for each sample using the parameter "coverage".

BRIEF EXPLORATION OF CUSTOMIZABLE OPTIONS
GenVisR functions are designed to be user friendly, with many aesthetic options exposed to the end user. However, exposing all aesthetic options as parameters is unfeasible. To get around this, all GenVisR plots can add layers conforming to the 'grammar of graphics' paradigm used by ggplot2 (Wickham, 2016). With knowledge of ggplot2 syntax, virtually any aesthetic element in a plot can be manipulated. In this protocol, we use some of the Waterfall() function exposed features to add greater detail to the mutation subplot (top sub-panel; plotA) and the gene subplot (left sub-panel; plotB). We also will limit the size of the display to 10 genes, and add labels to plot cells. We then go on to use ggplot to manipulate additional aesthetic features of the main plot (main panel; plotC). Specifically, we change the color and rotation of the x-axis labels as well as increase their size. These parameters are a small subset of options available to users: additional parameters include additional aesthetic changes, data filtering options, and annotation parameters. A full list of parameters is available within the R documentation of the GenVisR library.

Necessary Resources Hardware
A modern compute environment capable of running R/Rstudio > 3.5.0

R/Rstudio, GenVisR
Files myVars (see Basic Protocol 1) myHierarchy (see Basic Protocol 1) 1. The waterfall function can tabulate counts per mutation type for both the sample and gene subplots; these are controlled by switching the plotATally and plotBTally parameters to from "simple" to "complex". With these parameter settings, the top and side plots now show a more complex view of mutation frequency colored by mutation type.
• plotData <-Waterfall(myVars, mutationHierarchy = my-Hierarchy, plotATally = "complex", plotBTally = "complex") 2. A common request is to add additional information to waterfall plot cells in addition to mutation type. GenVisR supports this if there is a column in the original input from which to pull additional annotations. Here we modify the labels in the "amino acid change" column of myVars (see Basic Protocol 1), and set up Waterfall() to pull these annotations.
• plotData <-Waterfall(myVars, labelColumn="amino acid change", mutationHierarchy = myHierarchy, plotATally = "complex", plotBTally = "complex", geneMax=10) 4. Adding ggplot2 layers to a plot allows control over aesthetic elements. Here we adjust the axis text of the main plot to be horizontal, bigger, and colored red. We load the ggplot2 library first, then use ggplot2 syntax and save the resulting object into a list. This is then passed into the Waterfall() function via the parameter "plotCLayers".

GUIDELINES FOR UNDERSTANDING RESULTS
GenVisR and its associated functions are based on the grammar of graphics (ggplot2) package (Wickham, 2016). Graphical objects that are output by GenVisR are essentially "GROB" objects from ggplot2 (refer to ggplot2 documentation for details). As such, users can control any aspect of a GenVisR plot not already controlled by a parameter through the addition of ggplot2 layers. Such modifications require at least a basic understanding of ggplot2 syntax. In addition, the Waterfall() function now includes the underlying data in its output in the form of data.tables. Users are able to access this information through the S4 slot selector "@". For example, accessing the primary data of the plot outputs specified in the protocols above would look like this: plotData@primaryData. All Gen-VisR functions have R Documentation included within the package. For a complete list of parameters, options, and additional details, this documentation can be accessed by prepending a question mark in front of the function name (i.e., ?Waterfall) within the R interface.

COMMENTARY Background Information
Oncoprint-style plots (Gu, Eils, & Schlesner, 2016;Mayakonda, Lin, Assenov, Plass, & Phillip Koeffler, 2018) are commonly used in cancer genomics to aid in visually elucidating mutually exclusive or co-occurring mutation events or genotype-phenotype associations. Such events are known to happen in a variety of cancers and can inform treatment options. For example, PTEN and PIK3CA mutations have been shown to be mutually exclusive in myxoid liposarcoma (Demicco et al., 2012), emphasizing the importance of visualization of these events. Furthermore, Oncoprint-style plots have the benefit of summarizing a cohort in a compact and efficient way. This visualization method not only provides a view of mutations, but also the tumor mutation burden, the importance of each gene across the cohort, and the types of mutations impacting genes-of-interest, providing insight into the role of the gene itself. For example, a gene with a preponderance of loss-of-function mutations might be indicative of a tumor suppressor. Further, the ability to add clinical data may support the identification of genomic patterns that would remain otherwise obscure.

Critical Parameters
Users of these protocols should ensure that they are using a recent version of R and Gen-VisR. At the time of this writing, R (4.0.5) and GenVisR (1.24.0) are used. Further, users should be aware that two waterfall functions exist within the GenVisR package. These protocols correspond to Waterfall() [note the capital W, differentiating the function from the previous waterfall() function].

Troubleshooting
Table 1 lists problems that may arise with this method along with their possible causes and solutions.

Suggestions for Further Analysis
There are many tools available for genomic visualizations, from foundational plotting libraries such as ggplot2 and D3 (Bostock, Ogievetsky, & Heer, 2011;Wickham, 2016) to higher-level tools such as GenVisR. To further explore genomic visualization topics, we recommend genviz.org (Skidmore, Griffith, Griffith, Campbell, & Cotto, 2019; https:// genviz. org), a website designed to decrease the time required to learn GenVisR and other genome visualization tools. The genviz.org website covers several areas related to genomic visualization including web-based tools, GenVisR, R/ggplot2, genome browsers, and others.