detritalPy: A Python‐based toolset for visualizing and analysing detrital geo‐thermochronologic data

Detrital geochronology and thermochronology have emerged as primary methods of reconstructing the tectonic and surficial evolution of the Earth over geological time. Technological improvements in the acquisition of detrital geo‐thermochronologic data have resulted in a rapid increase in the quantity of published data over the past two decades, particularly for the mineral zircon. However, existing tools for visualizing and analysing detrital geo‐thermochronologic data generally lack flexibility for working with large datasets, hampering efforts to utilize the large quantity of available data.


| INTRODUCTION
The fields of detrital geochronology and thermochronology aim to identify the timing of crystallization and cooling of individual detrital mineral grains, respectively, with application to understanding both solid-earth (tectonic) and earth surface processes (Fedo, Sircombe, & Rainbird, 2003;Gehrels, 2012;Reiners & Brandon, 2006). Technological improvements over the last several decades, particularly in laser ablation-inductively coupled plasma-mass spectrometry (LA-ICP-MS), have led to greater efficiency in collecting detrital geochronologic data (Gehrels, 2014), and to a lesser extent thermochronologic data (Horne, van Soest, Hodges, Tripathy-Lang, & Hourigan, 2016). Although many detrital mineral types are used in geo-thermochronology, U-Pb and (U-Th)/He dating of detrital zircon (DZ) has emerged as a primary tool for constraining sediment provenance, refining palaeogeographic and tectonic interpretations, and constraining the depositional age of stratigraphic sequences (Fedo et al., 2003;Gehrels, 2014). Termed the "DZ Revolution" (Gehrels, 2012), the number of scientific articles, including peer-reviewed publications and conference abstracts, published per year that contain the phrase "detrital zircon" in the title has increased from ca 14 per year during the early 1990s to nearly 600 in 2016 ( Figure 1a). During this time period, both the number of grain analyses per sample and the total number of samples per study has also increased ( Figure 1a), resulting in an ever-expanding number of geo-thermochronologic data points.
How many individual DZ geochronologic analyses have been collected within the past 20 years? Voice, Kowalewski, and Eriksson (2011) and Puetz, Ganade, Zimmermann, and Borchardt (2018) compiled ca 200,000 and ca 260,000 individual DZ U-Pb ages, respectively, but these compilations are probably a small fraction of the total that has been collected. Back-of-the-envelope calculations suggest a conservative estimate of several million or more published U-Pb analyses of DZ alone (Figure 1b). This is probably a conservative estimate as not all articles that publish detrital geochronologic data contain the exact phrase "detrital zircon" in their title, and this estimate does not include unpublished data. The trend towards a large-n sampling strategy (Daniels et al., 2017;Pullen, Ibáñez-Mejía, Gehrels, Ibáñez-Mejía, & Pecha, 2014) suggests that the quantity of detrital geochronologic data will continue to increase at a rapid pace. Although detrital thermochronologic datasets are typically smaller than their geochronologic counterparts, as a result of greater effort and expense of data collection, ongoing advances in He-dating using laser ablation shows promise in increasing the future quantity of detrital thermochronometric data (Horne et al., 2016).
The proliferation of detrital geo-thermochronologic data provides an opportunity for researchers to leverage published data in interpreting their own datasets (Gehrels, 2014). Yet efficient management and analysis of such quantities of data can make even common tasks difficult, such as (a) selecting samples for comparison, (b) combining samples into groups, and (c) assessing the similarity and/or dissimilarity between samples or groups of samples (e.g. multi-dimensional scaling; Vermeesch, 2013). A number of analytical and visualization tools have been developed for working with detrital geo-thermochronologic data, including Isoplot (Ludwig, 2008), Excel-based macros from the Arizona LaserChron Center, DZStats (Saylor & Sundell, 2016), DZMix , and a number of tools developed by Pieter Vermeesch and his colleagues including Density Plotter (Vermeesch, 2012), MuDiSc (Vermeesch, 2013), and the R provenance package (Vermeesch, Resentini, & Garzanti, 2016). detritalPy, a Python 3-based toolset presented herein, supplements these existing tools by allowing efficient visualization and analysis of large detrital geochronologic and thermochronologic datasets. The following sections provide an overview of the data format required by detritalPy and an explanation of its visualization and analysis tools. Additional explanation is provided in the detritalPy manual (see Supporting Information) and within commented lines within the Python code itself (https://github.com/grsharma n/detritalPy). Example datasets from Hourigan (2013), Sharman, Graham, Grove, Kimbrough, andWright (2015), and Thomson, Stockli, Clark, Puidgefàbregas, and Fildani (2017) are used to illustrate detritalPy functions.

| DATA STRUCTURE
detritalPy requires input data to be structured by sample and detrital analysis ( Figure 2). Samples must contain a unique alphanumeric identifier (e.g. "11-Escanilla"; Figure 2) and each sample must have at least one detrital analysis. The default data input format is a Microsoft Excel  1991  1992  1993  1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016 Published studies per year Estimated cumulative published detrital zircon U-Pb analyses (x10⁶)   1990  1991  1992  1993  1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014   F I G U R E 2 Example of the default data structure used by detritalPy. (a) A "Samples" worksheet contains a required "Sample_ID" column and additional optional columns. (b) An "ZrUPb" worksheet contains required "Sample_ID", "BestAge", and "BestAge_err" columns. Additional columns are required for some plotting and analysis functions (e.g. "U_ppm"). Example data are included as supporting information a b F I G U R E 3 (a) Example code that illustrates how to import required libraries, load data in two separate Excel files, and plot a histogram of the distribution of analyses per sample. (b) Example code that illustrates how to select one or more individual sample(s) (above) or sample group (s) (below) for plotting and analysis (U-Th)/He ages. Each analysis must be linked with a sample by its unique sample identifier (i.e. "11-Escanilla", Figure 2). Each analysis must have at least one detrital age with an associated analytical uncertainty (1-or 2-sigma). If multiple detrital analyses are from the same grain (e.g. rim and core analyses), then a unique grain identifier can be used (e.g. "7_Guaso_81"; Figure 2b). The "ZrUPb" worksheet can optionally contain other information related to the detrital analysis (e.g. U concentration or Th/U).

SELECTION
Data can be loaded into detritalPy by simply specifying the file pathway and file name and extension ( Figure 3a). If data are present in multiple spreadsheets, these can be imported simultaneously and will be merged together, provided that there is no duplication of data and that both spreadsheets use the same column heading names. A histogram of the number of grains per sample can be optionally plotted ( Figure 3a). Samples can be selected for plotting and/or analysis via one of two options: (a) one or more individual samples can be specified by listing each unique sample identifier in an array, (b) one or more groups of samples can be specified by listing the sample identifiers that make up each group in an array, followed by an alphanumeric name for the group, all contained within a tuple data structure ( Figure 3b).

| DETRITALPY FUNCTIONS
The visualization and analysis functions included with detritalPy are described below. All functions can be modified to fit the user's requirements by modifying the source code within the included detritalFuncs library. Additional information for each of the following functions is provided within the detritalPy manual (see Supporting Information) and within commented lines within the Python code itself.

| Plot detrital age distributions
Detrital age distributions can be plotted for individual samples or sample groups using the most popular visualization types, including cumulative distributions, relative probability distributions, histograms and pie diagrams ( Figure 4). There is no limit to the number of samples or sample groups that can be plotted. The plot is divided into two parts. (a) An upper subplot contains superimposed cumulative distributions for each sample or sample group. (b) One or more lower subplot(s) includes relative probability distributions, histograms, and pie diagrams, if selected to be plotted. Each sample or sample group will be plotted in the order that the unique sample identifiers are listed. If the 'separateSubplots' variable is set to True, then each sample or sample group will be plotted in a separate subplot (Figure 5). The y-axes of the subplots will have the same scale (i.e. normalized) if the 'normPlots' variable equals True ( Figure 5). If the 'separateSubplots' variable equals False, then each normalized distribution (for a sample or sample group) will be vertically stacked within a single subplot ( Figure 5). However, histograms and pie diagrams cannot be plotted if 'separateSubplots' equals False.
Two types of relative probability distributions can be plotted. Probability density plots (PDPs) are probability distributions constructed from summing a Gaussian distribution for each analysis and normalizing the distribution such that its integral equals 1, where the mean and standard deviation of each summed Gaussian distribution is equal to the age and 1σ analytical uncertainty of the analysis (Vermeesch, 2012). Kernal density estimations (KDEs) are constructed in a similar manner to PDPs, except that a bandwidth is used for the standard deviation of each Gaussian, rather than the 1σ analytical uncertainty (Vermeesch, 2012). Although the PDP lacks a solid theoretical foundation and has been criticized as a visualization approach for detrital age distributions (Vermeesch, 2012(Vermeesch, , 2018, this option is included as the PDP continues to be widely used by the detrital geochronology community. Cumulative PDPs and/or KDEs can be also be plotted (e.g. Figure 4b).
Three options for colouring relative probability distributions have been included: (a) solid colouration that fills the area under the age distribution and that matches the line colour of the cumulative distribution, if selected for plotting (e.g. Figure 6), (b) age-dependent colouration that fills the area under the relative age distribution and that correspond to user-defined detrital age categories (e.g. Figure 4f), and (c) vertical, coloured bars that correspond to user-defined age categories. The user-specified age categories and colours can also be used to plot pie diagrams (Figure 4f).

| Plot rim age versus core age
Some detrital grains contain younger mineral growth (i.e. a rim) over an older core. Rim versus core age relationships can be plotted provided that (a) grains are identified with a unique alphanumeric label (e.g. "7_Guaso_81"), and (b) "Rim" and "Core" designations are included in a "RimCore" column ( Figure 2b). Figure 7 presents an example plot of rim versus core age from the Ainsa Basin of the Spanish Pyrenees (data from Thomson et al., 2017). Data points are automatically coloured according to either the sample or sample group, and error bars can be plotted, optionally.

| Plot detrital age distributions in comparison to another variable (e.g. Th/U)
Geochemical attributes of detrital minerals can provide additional insight into sedimentary provenance and/or the petrogenesis of source rocks (Barth, Wooden, Jacobson, & Economos, 2013;Colgan & Stanley, 2015;Malkowski & Hampton, 2014). For detrital zircon, concentrations of U and Th are routinely measured in LA-ICP-MS. Increasingly, the abundances of other trace and rare-earth elements and Hf isotopes are also analysed (Gehrels, 2014). Detrital U-Pb ages can be plotted in comparison to any other numeric variable in the "ZrUPb" worksheet (e.g. the concentration of U or ratio of Th to U; Figure 2b) to assess changes in this variable with respect to the age of the detrital mineral  Figure 8). Error bars can be plotted optionally; error can either be specified as a percentage or as the column heading that contains the variable error data. The option is also provided to plot a moving average of a specified window size (Figure 8).

| Plot detrital age populations as a bar graph
Bar graphs can provide a useful visualization of how detrital ages vary between samples or sample groups (Sharman  Figure 7), with the caveat that such plots involve subjective selection of bin boundaries that may over-simplify complex age distributions. detritalPy allows age proportions to be plotted as one or more bar graphs, using user-specified age categories and colours (Figure 9). If plotting sample groups, setting the variable 'sepa-rateGroups' to True results in plot(s) with the age proportions for individual samples in each group (Figure 9b). Otherwise, the proportions will reflect all analyses within each group (Figure 9c).

| Plot sample locations on an interactive map
Samples with latitude and longitude coordinates can be plotted on an interactive map, provided that the folium library has been installed (https://github.com/python-visua lization/folium; Figure 10). A number of basemap options are available and can be specified through the variable 'mapType'. Age distributions pop-ups (e.g. PDP or KDE) can be enabled and viewed interactively by clicking on each sample. Sample locations, including the unique sample identifier and an optional descriptor (e.g. the 'Unit' category; Figure 2a), can also be exported as a Google Earth kml file.

| Plot and export maximum depositional age (MDA) calculations
The youngest DZ U-Pb ages provide an estimate of the maximum depositional age (MDA) of a detrital sample (Dickinson & Gehrels, 2009;Fedo et al., 2003). An automated approach is provided to calculating the MDA of one or more sample(s) or sample group(s) using three ad hoc metrics used by Dickinson and Gehrels (2009). (a) The youngest single grain, YSG, assigns the MDA as the youngest detrital analysis (Figure 11). The YSG is defined by sorting all analyses by their U-Pb age plus 1σ F I G U R E 8 Illustration of plotting detrital U-Pb age distributions versus another analysis variable (Th to U ratio) for the Point of Rocks Sandstone and Butano Sandstone sample groups (see Figure 7).  uncertainty, and selecting the first analysis. Thus, it is possible to have a younger, but less precise, age than the YSG as defined herein. (b) The youngest cluster of 2 or more ages with overlapping 1σ uncertainties, YC1σ(2+), has the advantage of not relying on a single analysis that could be affected by lead loss or other analytical problems (Dickinson & Gehrels, 2009). The YC1σ(2+) is defined by sorting all analyses by their U-Pb age plus 1σ uncertainty, and identifying the youngest cluster of analyses with overlapping 1σ error ( Figure 11). (c) The youngest cluster of three or more ages with overlapping 2σ uncertainties, YC2σ(3+), provides a more conservative, but typically older, estimate of the MDA than the other two metrics (Dickinson & Gehrels, 2009). The YC2σ(3+) is defined by sorting all analyses by their U-Pb age plus 2σ uncertainty, and identifying the youngest cluster of 3 or more analyses with overlapping 2σ error ( Figure 11). A spreadsheet with all MDA calculation results is exported automatically. The MDA calculations can be optionally plotted for each sample or sample group selected ( Figure 11). Only the grain ages that are used in at least one of the three MDA calculations will be included. Analyses can be arranged by their mean age, mean age plus 1-sigma analytical uncertainty, or mean age plus two-sigma analytical uncertainty. Additional plot parameters can be used to change the appearance of the plot, including its dimensions, the width of the bars, and colours to use for the different MDA calculations.

| Multi-dimensional scaling
Multi-dimensional scaling (MDS) has become a popular approach for visual assessment of the degree of similarity and dissimilarity between detrital geochronologic samples Vermeesch, 2013). MDS plots for samples or sample groups can be created in detritalPy using the sklearn library with the option for either metric or non-metric MDS (Figure 12). Following Vermeesch (2013) and Saylor et al. (2017), MDS calculations can be based on the maximum separation between cumulative distribution functions (CDF) (Kolgomorov-Smirnov D max ) or the sum of the maximum differences between two CDFs (i.e. CDF 1 -CDF 2 and CDF 2 -CDF 1 ; Kupier V max ). The option is provided to plot data points as pie diagrams, using user-specified age categories and colours. Visualizing MDS plots using pie

dating" plot
Grains with both U-Pb crystallization and (U-Th)/He cooling ages can be plotted against each other in a 'double dating' plot (Thomson et al., 2017). The main portion of the plot contains a scatter plot, with grey shading within the region where the cooling age is older than the crystallization age (Figure 13). Separate subplots on the x-and y-axis show the relative probability distribution (PDP and/or KDE) of the U-Pb and (U-Th)/He age distributions respectively ( Figure 13). Histograms can be plotted, optionally.

| Export sample comparison matrices as a CSV file
A number of metrics have been proposed to evaluate the similarity or dissimilarity between detrital age distributions (Saylor & Sundell, 2016; and references within), although the use of some metrics (e.g. likeness, cross-correlation) has recently been discouraged (Vermeesch, 2018). Sample comparison matrices for samples or sample groups can be exported to a CSV file, including similarity, likeness, the Kolgomorov-Smirnov statistic D max and p-value, the Kuiper statistic V max , and the cross-correlation (r 2 ) coefficient of either the PDP or KDE. Similarity, likeness, and r 2 coefficient values are based on selection of either the PDP or KDE distribution, and when based upon the KDE, will depend on choice of a bandwidth.
4.10 | Export detrital age distributions as a

CSV file
Raw detrital age distributions (either cumulative or relative) can be exported as a CSV file. The age range and discretization interval can be specified. If the variable 'normalize' equals True, the distribution(s) will be forced to sum to 1. Exported age distributions can be used as inputs for more advanced analytical procedures (e.g. sediment unmixing; Sharman & Johnstone, 2017).

| Export ages and errors in tabular format as a CSV file
To promote compatibility with other software, detritalPy includes the option to export a CSV file containing U-Pb ages and 1σ uncertainties for any number of samples or sample groups. U-Pb ages are automatically sorted from youngest to oldest, and listed in adjacent columns, the format required by many existing tools (e.g. Arizona Laser-Chron Center Excel worksheets).

| KDE bandwidth selection
Selection of an appropriate KDE bandwidth is important for avoiding over-smoothing or under-smoothing detrital age distributions (Saylor & Sundell, 2016;Vermeesch, 2012). automatically selected by an algorithm that attempts to select an optimized value (Shimazaki & Shinomoto, 2010) using either a fixed or variable bandwidth (Figure 14), as implemented through the adaptivekde library (https://pypi. python.org/pypi/adaptivekde).
To consider the influence of KDE bandwidth selection on the appearance of detrital age distributions, two endmember cases are considered: a single sample from the Morrison Formation of central Colorado (149 analyses; Sharman, Stockli, Flaig, Raynolds, & Covault, in press) and a large compilation of 1,308 samples (120,968 analyses) that are mostly from North American continent (Figure 14). Application of the Shimazaki and Shinomoto (2010) algorithm to the single sample yields an optimized, fixed bandwidth of 17.8 Myr. This bandwidth selection appears to yield an acceptable match with the histogram over much of the age range of the detrital analyses, but over-smooths the youngest, Jurassic age peak (Figure 14a). Use of a narrower bandwidth (e.g. 5 Myr) results in better reproduction of the Jurassic age peak, and greater similarity to the PDP, but appears to under-smooth the older Palaeozoic and Proterozoic age peaks (Figure 14a). Although the optimized, variable bandwidth (Shimazaki & Shinomoto, 2010) is able to better reproduce the precision of the young, Jurassic peak, it appears to over-smooth the older age components. For the large data compilation, the differences in appearance between the PDP and different KDE bandwidths are less than with the single sample (Figure 14b). The optimized, fixed bandwidth (2.1 Myr) yields a result that closely resembles the PDP, 5 Myr bandwidth, and optimized, variable bandwidth plots (Figure 14b). However, the 10 Myr and 20 Myr bandwidths appear to over-smooth the Mesozoic-Cenozoic age populations (Figure 14b).
It is suggested here that the selection of a KDE bandwidth, whether user-defined or based on an optimized routine (Botev, Grotowski, & Kroese, 2010;Shimazaki & Shinomoto, 2010), will ultimately depend upon the nature of the age distributions themselves and the intended purpose of the display, with the decision having the greatest impact on plotted distributions with low numbers of analyses. Plots comprised of a relatively small number of detrital analyses may typically require larger bandwidths to avoid under-smoothing regions of sparse data while also tending to over-smooth young, precise age peaks (Figure 14a). This issue may be partially alleviated by plotting young and old detrital analyses separately, using different bandwidths for each (Sharman et al., in press). Plots comprised of large numbers of analyses, however, may benefit from selection of a smaller bandwidth (Figure 14b). Note that the choice of a KDE bandwidth has limited influence on the appearance of the cumulative (summed) KDE relative to the significant impact on the appearance of the KDE (Figure 14).

F I G U R E 1 3 (U-Th)/He versus U-Pb
'double dating' plot. PDPs are shown in subplots to the bottom and right of the scatterplot. Data from Thomson et al. (2017)

| Strengths of detritalPy
Although intended as a complement to rather than replacement of existing tools, detritalPy has a number of strengths that allow for efficient visualization and analysis of large detrital geo-thermochronologic datasets:

The code is executed in the open-source (Python Soft-
ware Foundation License) Python 3 language and is implemented using a user-friendly Jupyter Notebook interface (Kluyver et al., 2016;Perez & Granger, 2007). Thus, detritalPy does not require the use of proprietary software (e.g. MathWorks Matlab). Although no significant coding knowledge is required, users have the option of modifying the code to create user-customized functions and plots, which can be difficult with some existing tools that utilize a graphical user interface. 2. All detritalPy functions are compatible with an unlimited number of samples, or sample groups, and samples can be selected without manipulation of data in spreadsheets. Samples can be selected via simple reference to the unique sample identifier (e.g. 'Sample B'), and changes can be made on-the-fly. Thus, detritalPy eliminates the need to manually combine or organize data prior to plotting or analysis, beyond initial data organization (Figure 2), helping to eliminate duplication of data in separate spreadsheets. 3. Plotting and analysis functions are designed to allow maximum user flexibility in controlling the appearance and types of plots, following the most commonly used visualization and analytical approaches (Saylor & Sundell, 2016;Vermeesch, 2012Vermeesch, , 2013. Jupyter Notebooks are ideal for exploratory data analysis. Once the desired graphs have been created, plots can be exported in a vector-friendly format, requiring little modification for publication-quality figures. 4. Data can be exported in the common format used by the majority of existing analytical and visualization tools, for use in other published software.

| 'Big Data' in detrital geochronology
The proliferation of detrital geo-thermochronologic data within the last 20 years (Figure 1) provides an opportunity for  (Vermeesch & Garzanti, 2015). Yet development of an efficient means of visualizing and analysing large datasets remains a critical need in the detrital geo-thermochronologic community (Gehrels, 2014). For example, the ability to easily create sample groups within detritalPy will facilitate construction of reference curves (Gehrels, Dickinson, Ross, Stewart, & Howell, 1995; that can be compared with the magmatic and/or metamorphic history of known basement terranes (Dickinson & Gehrels, 2009) or with other detrital samples (Sharman, Covault, Stockli, Wroblewski, & Bush, 2017). The ability to quickly visualize and analyse related detrital geochronological and geochemical data also have relevance for characterization of source terranes, such as the magmatic flux of volcanic arc terranes (Ducea, 2001;Malkowski, Schwartz, Sharman, Sickmann, & Graham, 2017;Sharman et al., 2015). detritalPy also provides improved functionality for querying and exploring geo-thermochronologic data, particularly when combined with existing data repositories (e.g. the Geochron database; www.geochron.org). The ability to plot sample locations on an interactive, zoomable map and export them to Google Earth allows rapid identification of geographically related samples ( Figure 10). Relationships between samples or sample groups can be quickly assessed both visually and quantitatively (e.g. Figures 6 and 12). For instance, plotting samples or sample groups in stratigraphic succession (Gehrels et al., 2011) has potential for assessing the degree of multi-generational sediment recycling over time (Thomas, 2011). Because detritalPy is open-source, its functions can be modified or added to, as needed. It is anticipated that future development of new visualization and analytical tools within the detritalPy framework will address the evolving needs of the detrital geo-thermochronologic community.

| CONCLUSIONS
detritalPy, a Python-based approach to visualizing and analysing large detrital geo-thermochron datasets, addresses a critical need for an efficient means of processing the rapidly expanding quantity of detrital mineral isotopic and geochemical data. detritalPy is implemented through a user-friendly Jupyter Notebook interface and requires no significant coding expertise. However, the existing code can be modified to allow for user-customized plots and analysis.
An unlimited number of samples can be either plotted individually or within groups. Functionality includes (a) plotting detrital U-Pb age distributions using the most commonly employed visualization types, (b) plotting rim age versus core age, (c) comparing detrital age distributions to another variable (e.g. Th/U), (d) plotting age group proportions as a bar graph, (e) plotting sample locations on an interactive, zoomable map and exporting a Google Earth kml file, (f) calculating and visualizing maximum depositional ages, (g) multi-dimensional scaling, (h) calculation of similarity and dissimilarity metrics (e.g. similarity, likeness, Kolgomorov-Smirnov statistic) and (i) exporting U-Pb age and error data and age distributions as CSV files.
detritalPy has a number of advantages over existing tools, including not requiring proprietary software, offering flexibility in how data are plotted and analysed, and eliminating the need to manipulate data within spreadsheets to select which samples or groups of samples to plot. Furthermore, data can be easily exported from detritalPy in the common format required by most existing data visualization and analytical tools. It is expected that detritalPy will provide an important toolset for analysing detrital geochronologic and thermochronologic data within a 'Big Data' framework.