dartR v2: An accessible genetic analysis platform for conservation, ecology and agriculture

Innumerable approaches to analyse genetic data are now available to guide conservation, ecological and agricultural projects. However, streamlined and accessible tools are needed to bring these approaches within the reach of a broader user base. dartR was released in 2018 to lessen the intrinsic complexity of analysing single nucleotide polymorphisms (SNPs) and dominant markers (presence/absence of amplified sequence tags) by providing user‐friendly data quality control and marker selection functions. dartR users have grown steadily since its release and provided valuable feedback on their interaction with the package allowing us to enhance dartR capabilities. Here, we present Version 2 of dartR. In this version, we substantially increased the number of available functions from 45 to 144. In addition to improved functionality, we focused on enhancing the user experience by extending plot customisation, function standardisation, increasing user support and function speed. dartR provides functions for various stages in analysing genetic data, from data manipulation to reporting. dartR provides many functions for importing, exporting and linking to other packages, to provide an easy‐to‐navigate conduit between data generation and analysis options already available via other packages. We also implemented simulation functions whose results can be analysed seamlessly with several other dartR functions. As more methods and approaches mature to inform conservation, we envision that accessible platforms to analyse genetic data will play a crucial role in translating science into practice.


| INTRODUC TI ON
The plummeting costs of DNA sequencing have opened a powerful window of opportunity to use genetic data to inform biodiversity conservation, restoration of ecosystems, invasive species management and breeding of animals and plants (Breed et al., 2019).Remarkably, applied genetic studies have transitioned from typically analysing a dozen molecular markers to tens and even hundreds of thousands of markers in less than a decade.Similarly, the process of marker development that could take months of laboratory work a decade ago has been taken over by sequencing companies using novel approaches, such as genotyping by sequencing (Narum et al., 2013) or using restriction enzymes to reduce genome complexity (DArTseq; Kilian et al., 2012).These technological advances are reflected in the growing number and diversity of ways genetic data ae analysed and applied (e.g.identification of adaptive variation is now within reach for non-model organisms; Weigand & Leese, 2018).
Even though genetic data are increasingly accessible and population genomics has proved to be a powerful tool to improve biodiversity conservation and ecological restoration efforts (Garner et al., 2016;Hohenlohe et al., 2021), genetic information is not yet regularly used outside of the research community (Shafer et al., 2015).Several barriers to bridging this gap between research and practice have been identified, including poor communication between researchers and other stakeholders, insufficient funding and lack of genetics expertise (Taylor et al., 2017).A further barrier is arguably the intrinsic complexity involved in analysing genetic data.For instance, to interpret analysis results appropriately, it is necessary to understand theoretical models and population genetics principles (Andrews & Luikart, 2014).Furthermore, advanced computer and programming skills and the use of several programs, which are often complex and time-consuming to master, are required to make full use of the genetic data (Hohenlohe et al., 2021).Therefore, today, it is no longer the time needed for DNA sequencing that limits the speed of results, but rather a deficit of knowledge and skills to analyse genetic data.
dartR, an R package for analysing single nucleotide polymorphisms (SNPs) and presence/absence of amplified sequence tags, was released in 2018 (Gruber et al., 2018) and designed to bridge the gap between science and practice.dartR aims to bring the timeframe to analyse genetic data into line with the timeframe required by stakeholders to make their decisions.The second aim of dartR is to provide a broad range of analyses and pipelines in a user-friendly platform that allows no programming expertise to do so.dartR leverages the capabilities of the open-source programming language R (R Core Team, 2021) and the robustness of the genlight object from the package adegenet for representing large genetic datasets (Jombart & Ahmed, 2011).In the 4 years since its release, dartR has grown a large user base, evidenced by several hundred daily downloads and an active Google group (https://groups.google.com/g/dartr).With the genomic revolution well underway, there is a constant and rapid diversification of new methods and analyses, which users seek to include in their work, ideally without switching between platforms.
Here we present a significant update of dartR.Our purpose is to bring diverse and sophisticated analytical tools within the reach of a broad user base of genomic data.dartR facilitates all stages in analysing genetic data, from data quality control to the preparation of publishing quality plots through streamlined and accessible functions and strong user support, including tutorials, detailed function documentation and error checking.

| WHAT IS NE W IN DARTR . 0?
In dartR 2.0, we have added 99 functions to the initial 45 functions from version 1 (Figure 1; Table S1).In response to user feedback, we provide users with a deeper understanding of the purpose of each function, its underlying theory and its limitations by expanding and improving our tutorials and function documentation.Additionally, we have implemented messages to communicate errors, warnings, reports and important information while running each function.All the functions have been extensively tested, debugged, standardised, and their speed has been increased in many cases.Following the adage 'a picture is worth a thousand words', we have improved all the graphical outputs by standardising their format, increasing readability and extending their scope for customisation.
We realised that many individual researchers had developed their own scripts and analyses, which would be very helpful for others if made available.Therefore, we encourage these 'independent developers' to collaborate with dartR having provided a framework on how to write and document functions for dartR.To further encourage this collaboration, we have regular developer meetings and personal support to integrate analyses of independent developers.Initially, dartR aimed to primarily analyse the genomic data format provided by the sequencing company Diversity Arrays Technology Pty Ltd (DArT https://www.diversitya rrays.com/).In version two, we extended dartR's capabilities to import from and export to several formats to store SNP data to make dartR accessible to a broader pool of users.

| FUN C TION C ATEGORIE S AVAIL AB LE IN DARTR
To facilitate the usage and identification of the resources available in dartR, we categorised the functions based on the different stages in the analysis of genetic data.Typical steps are data input, data manipulation, filtering, reporting, exploration, visualisation and analysis.We also provide tutorials to guide the user for the most relevant stages, which can be accessed at http://georg es.bioma tix.org/dartR.
In this section, we enumerate dartR function categories while highlighting representative functions from each category.
As our basic format to input and store genetic data, we adopted the genlight object from the package adegenet (Jombart & Ahmed, 2011).
One of the main attributes of the genlight object is its efficient data compression using a bit-level coding scheme.We extended the genlight object by adding two additional compartments containing metadata for individuals (ind.metrics) and loci (loc.metrics).dartR can read common formats, including FASTA, VCF, PLINK, DArTseq™, genepop and CSV files.To ensure the compatibility of the imported data, we developed the function gl.compliance.check()to inspect the elements within the genlight object and, if necessary, correct incompatibilities.
dartR offers functions to facilitate data manipulation for loci, individuals and populations, such as renaming individuals, assigning and reassigning them to populations, removing individuals, populations and loci, merging populations and subsampling individuals and loci.After data manipulation, some locus metrics will no longer apply; the function gl.recalc.metrics()will recalculate the various locus metrics as necessary.
The filtering process is a decisive step in analysing genetic data that depends on sensible threshold decisions (O' Leary et al., 2018).
With this in mind, we provide a complementary reporting function for each of our 16 filtering functions.Reporting functions present the data in the form of summary statistics, tabulation of quantiles, boxplots and histograms.In a two-stage process, users can use the results of reporting functions to implement thresholds in filter functions that are appropriate for their application and data characteristics.For example, identifying and filtering loci that deviate from Hardy-Weinberg proportions is essential in many workflows.
Several technical and biological phenomena can cause this deviation and must be distinguished for correct interpretation of the data (Waples, 2015).Our functions gl.diagnostics.hwe(),gl.report.hwe()and gl.filter.hwe()allow the diagnosis, evaluation and filtering of loci deviating from Hardy-Weinberg proportions using either the Exact or the Chi-square method, adjustment for multiple comparisons and ternary plots (Figure 2).
The exploration and visualisation stage is critical to identify and interpret genetic patterns, generate hypotheses and set the path for downstream analyses.Functions for this stage in dartR include gl.pcoa() and gl.pcoa.plot(),which perform and plot principal component analysis F I G U R E 1 overview of the functions currently available in dartR covering various stages in the analysis of genetic data.We use the prefix 'gl' in function names to acknowledge the use of the genlight object from package adegenet (Jombart & Ahmed, 2011) as our input format.
PCA and PCoA are particularly suitable for genetic data.Despite not relying on genetic principles or models, results can reveal spatial patterns, evolutionary or ecological processes such as migration, geographical and reproductive isolation, and admixture (McVean, 2009).
Other visualisation and exploration tools available include heatmaps, network plots, smear plots and mapping of sampling locations.
Once the dartR user has read, manipulated, filtered and explored their genetic data, many analyses can be performed to inform the decision-making, evaluation and monitoring processes of conservation, restoration and breeding projects.Genetic data can provide insights into biological processes on two different but tightly linked fronts: a) issues associated with genetic diversity and its relationship with fitness, such as inbreeding depression and evolutionary Genetic variation can be monitored or evaluated with the function gl.report.diversity(),which calculates the q-profile, a spectrum of measures whose contrasting properties provide a rich summary of diversity, including allelic richness, Shannon information and heterozygosity (Sherwin et al., 2017).These measures are then converted to a standard scale of effective numbers (Hill's numbers), so they can be directly compared.Other functions allow different aspects and metrics of diversity to be characterised by partitioning variation geographically using analysis of molecular variance (AMOVA), statistical testing of heterozygosity difference between populations or standardising heterozygosity estimates using the number of invariant sites.

Identifying natural aggregations of individuals and populations
using genetic data has been an important tool to maximise and prioritise available resources in conservation and restoration projects, for example, to define evolutionarily significant units (ESUs; Funk et al., 2012), to delimitate species (Georges et al., 2018;Unmack et al., 2022), to identify populations suitable for eradication (Robertson & Gemmell, 2004) and to demarcate seed transfer zones for ecological restoration (Durka et al., 2017).dartR functions suitable for these applications include gl.fixed.diff(),which generates a matrix of fixed allelic differences between populations.The function gl.collapse() can be used to iteratively combine populations and aggregations of populations based on the absence of fixed allelic differences to yield a set of diagnosable units.These functions accommodate the risk of false-positive fixed differences likely to occur when samples sizes are small.A further application of identifying populations is the assignment of individuals of unknown provenance to their source population, which is particularly important in wildlife forensics to support law enforcement (Bourret et al., 2020).
Functions such as gl.assign.pa()and gl.assign.pca()are capable of assigning individuals of unknown provenance to a population using private alleles (i.e.alleles that are exclusive to particular populations) and standardised proximity, respectively.
Dispersal and gene flow are fundamental evolutionary and ecological processes that enable individuals to recolonise new habitat and replenish population's gene pool (Tigano & Friesen, 2016).These processes can be investigated by assessing the correlation between genetic distance among populations or individuals and the geographical distance separating them (Cayuela et al., 2018).The function gl.genleastcost() performs a least-cost path analysis based on a friction matrix to test the hypothesis that genetic distance correlates with landscape attributes, such as barriers or habitat corridors, rather than geographical distance.Other functions include the calculation of several genetic distances between individuals and populations, testing for isolation by distance (Van Strien et al., 2015) and dispersal simulations.
The evaluation and monitoring of inbreeding and relatedness can provide valuable information to maximise existing genetic variation and avoid inbreeding depression.This information has been used in captive breeding programmes to prevent the detrimental effects of small population size, founder effects and lack of gene flow (Wright et al., 2021).Various functions can guide the breeding of plants and animals; gl.grm() calculates and plots the mean probability of identity of descent across all loci that would result from all the possible crosses of the individuals that were sampled (Figure 4; Endelman & Jannink, 2012).This information can identify potential pairs of individuals whose crossing might prevent inbreeding.
We have developed functions to simplify the process of running external software that requires several steps (a.k.a.wrapping functions), linking to programmes such as Outflank (Whitlock & Lotterhos, 2015), BLAST (Altschul et al., 1990;Altschul et al., 1997), NewHybrids (Anderson & Thompson, 2002), Neestimator2 (Do et al., 2014), STRUCTURE (Pritchard et al., 2000), Clumpp (Jakobsson & Rosenberg, 2007), Distruct (Rosenberg, 2004) and Evanno's method (Evanno et al., 2005).For example, the latter four programmes can be run within dartR using the functions F I G U R E 3 Principal component analyses (PCA) using a platypus dataset provided with the package.PCA shows that platypuses sampled below (Severn below) and above (Severn above) a large dam form separated clusters in contrast to platypuses sampled in an unregulated river (Tenterfield Creek).Ellipses encapsulate a 95 percentile area from the centroid of each population.Computer simulations are powerful tools for understanding complex evolutionary and genetic processes and their relationships to ecological processes and can be used, for example, to predict complex scenarios involving the interaction between evolutionary forces or evaluate the plausibility of alternative hypotheses or, validate and evaluate genetic methods (Hoban et al., 2012).In this second version of dartR, we developed a realistic simulation model that can be parameterised with real-life genetic characteristics such as the number, location, allele frequency and the distribution of fitness effects (selection coefficients and dominance) of loci under selection.In the simulation model, recombination is accurately modelled, and it is possible to use real recombination maps as input.
We have also developed a set of internal functions that facilitate the user's interaction with dartR.For example, the function gl.instal.vanilla.dartR()installs all required packages for using all the functions available in dartR; and the functions gl.print.history()and gl.play.history()prints and replays the history of all the analyses performed previously in a genlight object, respectively.

| CON CLUDING REMARK S
The remarkable recent advances in applied and theoretical genetics offer many novel opportunities to address and better manage rates of biodiversity and ecosystem loss.Notwithstanding this, the list of skills and level of expertise required to integrate novel genomic resources and perform increasingly complex analyses have increased simultaneously.Thus, researchers and stakeholders often struggle to keep up with the various ways to analyse and apply genetic data and to take maximum advantage of them to inform conservation and restoration.We envision that as the F I G U R E 4 Heatmap of the probabilities of identity by descent (IBD) in which yellow and red colours indicate individuals more related to each other.The identification number of each individual is shown in the margins of the figure, where the last letter denotes whether the individual is male (M) or female (F).This information is being used to guide the captive breeding programme of the Arabian oryx at the Al-Wusta Wildlife Reserve in Oman (Al Rawahi et al., 2022).number of analyses and their complexity continues to increase, accessible, streamlined and reliable platforms to analyse genetic data, such as dartR, will play a crucial role in translating science into practice.

F
Output from function gl.diagnostics.hwe()which implements the recommendations from Waples (2015) and De Meeûs et al. (2007).(a) Histogram showing the distribution of p-values of Hardy-Weinberg equilibrium (HWE) tests.The distribution should be roughly uniform across equal-sized bins.(b) Bar plot showing observed and expected number of significant HWE tests for the same locus in multiple populations.If HWE tests are significant by chance alone, observed and expected number of HWE tests should have roughly a similar distribution.(c) Scatter plot with a linear regression between F ST and F IS , averaged across subpopulations.In the lower right corner of the plot, the Pearson correlation coefficient is reported.A positive relationship is expected in case of the presence of null alleles (De Meeûs, 2018).potentialand b) demographic issues, such as dispersal, population size and hybridisation.dartR offers various functions that address both of these suites of processes.
below and the results are plotted in an interactive map as shown in Figure 5.Note that while we aimed to facilitate access to resources and analytical tools, the users should remain aware of assumptions and characteristics of such analyses so that they can be run and interpreted properly.We envisage that future version of dartR will continue the development of functions that will facilitate testing of assumption and screening of adequate execution (e.g.convergence).> out_struc <-gl.run.structure(bandicoot.gl, k.range = 2:5, num.k.rep = 10, exec = "~/structure.exe",noadmix=FALSE) > out_evanno <-gl.evanno(out_struc)> qmat <-gl.plot.structure(out_struc,k=3, CLUMPP="~/CLUMPP.exe")> gl.map.structure(qmat,bandicoot.gl)Exporting genetic data to other formats is a common step and one of the most time-consuming and susceptible to errors in the analysis of genetic data.dartR offers 24 functions to export genlight objects to other formats, including FASTA, PLINK and VCF.