fundiversity: a modular R package to compute functional diversity indices

Abstract

To compute these indices in reproducible ways ecologists rely on computational tools able to crunch the numbers for thousands of species and thousands of sites.In the last few years, R has been the programming language of choice for ecologists (Lai et al., 2019;R Core Team, 2021).The main tool available to compute functional diversity indices has been the FD package which has now accumulated more than 1200 citations (Laliberté et al., 2014).But FD has been released in 2009 and has since then only received minor updates that stopped in 2015.At the same time, best practices in software development have changed dramatically and new, higher-performance tools have emerged in the R ecosystem.Additionally, since 2009, the size of ecological datasets has grown exponentially (Farley et al., 2018;Wüest et al., 2020) and high performance computing (HPC) environments have become standard.There is therefore a dire need for a modern package to compute functional indices using state-of-the-art software development techniques and tools.
The main function of the FD package dbFD() lets users compute a dozen functional diversity indices in a single call from raw trait data (Laliberté et al., 2014).While great for exploratory analyses this can increase computation time when only a single index is needed.Furthermore, it does not enforce good practice in choosing beforehand the appropriate functional diversity index for the question(s) asked (Legras et al., 2018;Mason et al., 2013;Schleuter et al., 2010).It encourages the user to fish the functional diversity index that matches their predicted relationships (a form of p-hacking).This can lead the users to report all computed functional diversity indices even when there are no clear expectations on different functional diversity facets and/or to report correlated indices (Legras et al., 2018;Mason et al., 2013;McPherson et al., 2018;Schleuter et al., 2010).Computing all indices in a single function also makes long-term maintenance and addition of new indices harder.Finally, it adds an extra performance burden in the case where not all indices are needed.
The average size of datasets analyzed in ecology increased several folds in the last years (Wüest et al., 2020).Considering that most analyses on functional diversity rely on null models that increase the data size by two or three orders of magnitude (Gotelli & Graves, 1996), the need for efficient computation is paramount.First, any improvement of the algorithmic efficiency to compute functional diversity indices could save sensible amounts of time as it is repeated many times.For example, we noted that many R packages that compute functional diversity indices do not leverage matrix algebra with its libraries available that can cut the number of operations by orders of magnitude compared to using a loop directly in R. Second, functional diversity indices are generally computed over many mathematically independent sites.With the rise of multi-core computers, parallelization, i.e. splitting independent computations between different Computing Processor Units (CPUs), is becoming the norm.Very few functional diversity R packages propose parallelization which leaves the burden of implementing it to the user.There have been formidable new developments in this area in R over the last few years with the release future framework (Bengtsson, 2020) that allows the user to seamlessly parallelize computations on multiple cores on a single machine or across several machines, or even on a remote cluster without changing execution code.Third, computations on the same input can be cached through a process called memoization (Wickham et al., 2021).This avoids wasting computing power on previously seen inputs.Several functional diversity indices rely on the computation of convex hulls across a multi-dimensional space (Cornwell et al., 2006;Villéger et al., 2008).Caching the results of this costly computation could save time and computing power when measuring the diversity across similar sets, such as sites across a given region.
Increasing discussions are held regarding scientific software robustness and reliability in ecology (Mislan et al., 2016;Poisot, 2015;White, 2015;Wilson et al., 2017).Mainly because most ecologists are self-trained in programming (Farrell & Carey, 2018), these virtuous practices are rarely applied in ecology (Barraquand et al., 2014).For example, unit tests use predefined inputs to compare the software's outputs to expectations (Poisot, 2015).Unit tests have also become standard in R packages since the release of packages streamlining this process, such as testthat and tinytest.In part because of the relative recentness of the testing frameworks, very few R functional diversity packages provide unit tests to assess that the functions behave expectedly.Automatic tests of one's code are crucial when developing a tool for a wider audience as it may be used across different contexts.
We here propose a modern alternative to FD called fundiversity that benefits from modern development practices, necessary features for large-sized datasets (modularity, parallelization, and memoization), and greater flexibility.The package can be easily extended to accommodate additional diversity indices not covered by following a clear design pattern detailed in the next section.We then go through a use case to show how it can be used.We then compare the performance of fundiversity against similar packages.

Main features of fundiversity
To ensure the consistency of its functions and to make it user-friendly, fundiversity follows clear design principles.In this section, we expose its distinctive features and principles.
To give maximum flexibility to the users, we tried to build fundiversity as modular as possible.Each function in fundiversity computes a single functional diversity index, as such if the user is interested in computing a single index, they only need to use a single function.All functions in fundiversity are prefixed with fd_ to avoid conflict with similarly named functions in other packages, as it's becoming standard practice in newer R packages (rOpenSci et al., 2021).In line with its modularity, we focused on making the inputs and outputs of functions coherent.The functions compute functional diversity indices using two main information: a species by traits matrix and a site by species matrix, all functions accept these two objects as first arguments.Because the function outputs one diversity value per site the outputs are always structured similarly: one site column that contains the name of its sites and one column named as the computed index (such as FRic when computing functional richness).The shape of the output is predictable and easy to be combined with other data.
Parallelization can be an easy way to vastly decrease computation speed why leveraging the architecture of modern computers.Almost all functions in fundiversity can be parallelized out of the box.
fundiversity provides parallelization through the future backend (Bengtsson, 2020).Parallelization is toggled through a single function call using future::plan() before using fundiversity functions.Thanks to the flexibility given by the future backend, the code to use won't change whether parallelizing across several cores on a single computer, across multiple computers, or on a remote high-performance cluster.The user has only to update the call to future::plan() to distribute computations on another infrastructure.Furthermore, the future backend provides load balancing so that no cores/units stay idle for too long and the parallelized tasks are split evenly.The packages contains a dedicated "Parallelization" vignette to guide the users through transforming unparallelized to parallelized code (accessible through vignette("parallel", package = "fundiversity")).
Because functional diversity indices can be computed repeatedly on the same data subset, such as in null models, we can leverage these repeated computations to reuse already computed indices.For example to compute functional richness (FRic) the convex hull of the input data has first to be identified, then the program needs to compute the volume of this convex hull.The first step, identifying the convex hull, takes the most time and as such, storing the results of each computed convex hull across a subset of data can vastly cut computation time for a little memory footprint.Memoization consists in doing exactly that, it trades a little of computer memory (keeping the convex hulls stored) for more computation speed.fundiversity leverages memoization for all complex computations such as convex hulls.By default, memoization is turned on for FRic, the intersection of convex hulls, and FDiv.However, it can sometimes create a memory bottleneck which slows down the overall computation.The default behavior can always be overridden through a change in the option fundiversity.memoise.
Packages depend on one another to avoid reinventing the wheel and thus reuse already developed functions.A higher number of dependencies means that a package requires more packages to be installed before its installation.While a high number of dependencies minimizes code replication, it also comes with a high price, because if a single dependency breaks then the whole package cannot be installed anymore (Cox, 2019).Inflated dependencies have been identified as a major risk in software and especially scientific software development (Claes et al., 2014;Cox, 2019).FD only has four dependencies but other functional diversity packages have many more dependencies.This renders them quite brittle for the users after years of not being actively developed.fundiversity has been designed to only have minimal external dependencies, it currently depends on only four external packages: future.applywhich depends only on two other packages, Matrix which is shipped with R, geometry and vegan on which FD also depends.
Because user flexibility is key, fundiversity has minimal assumptions on the input data structure.
All its functions work with data frames, matrices, or sparse matrices alike.Sparse matrices are a different formalization of matrices that do not store explicitly the cells that contain zero.They offer a reduced memory footprint and optimized algebra library for computation (Bates & Maechler, 2021).These matrices are thus specifically relevant for occurrence/abundance matrices that contain many zeros.If the used data have a high proportion of zeros, using sparse matrices can vastly decrease computational time in fundiversity.
As we underlined in the introduction, automatic software testing, while not 100% foolproof, is needed to increase the confidence in the behavior of functions.It is widespread in computer science but less in scientific software development.This means that software behavior is never assessed against known inputs to make sure it behaves in expected ways.It does not mean that the software is of poor quality, but rather that some simple errors could introduce unnoticed changes in the behavior of functions.Most packages that compute functional diversity indices do not include any form of automatic testing.We do want to point out that most ecologists never received formal training in software development hence the lack of tests (Farrell & Carey, 2018).We designed fundiversity with many unit tests from the beginning, executing at least every single line of code once (i.e.achieving coverage of 100%).

Case Study
Now that we described the main features of fundiversity, we are going to show how to use it in practice when computing functional diversity indices.As an example dataset, we included in fundiversity sitespecies and trait data from Nowak et al. (2019).It is accessible through the use of the data(..., package = "fundiversity") function.This dataset describes the presence of bird species in South America at different elevations and four morphological traits.
[Figure 1 about here.] The trait values show species in rows (species are specified as row names) and traits in columns with trait names as column names.Similarly, the site-species matrix contains sites as rows (site names are row names) and species as columns (species names are column names).
data("traits_birds", package = "fundiversity") data("site_sp_birds", package = "fundiversity") Now we obtained trait and occurrence data we need to compute the trait dissimilarity between each pair of species.As all traits are quantitative we first Z-score them, then we compute the Euclidean distance between pairs of species.z_traits = scale(traits_birds, center = TRUE, scale = TRUE) trait_distance = as.matrix(dist(z_traits)) We can then compute the functional richness of each index at each location.To do so we are using the If all traits are not quantitative it is possible to transform them back into independent traits through the use of Gower's distance [Gower (1971); and its extensions: Podani (1999), Pavoine et al. (2009);] then applying multivariate analysis to obtain orthogonal dimensions (Maire et al., 2015).

Performance Comparison
To test the actual performance improvements realized by fundiversity, we compared computation time on standardized datasets across similar functions in other packages.We only compared "original" packages that provide actual functions and not wrappers that depend on other packages to compute functional diversity indices.We identified 6 packages that computed similar indices to fundiversity.Most indices are computed by the FD::dbFD() function but the comparison would be unfair as the function computes many indices in a single call while functions in fundiversity only compute single indices.We considered functions from: adiv (Pavoine, 2020), BAT (Cardoso et al., 2015), betapart (Baselga & Orme, 2012), hillR (Li, 2018), mFD (Magneville et al., 2022), and FD (Laliberté et al., 2014) (see Table 2 for the correspondence between packages).A continuously updated version of this section can be found in the performance comparison vignette within the fundiversity package with vignette("performance", package = "fundiversity").
Table 2: List of functions available in fundiversity to compute functional diversity indices and corresponding functions in other packages.The name of the package is indicated before the :: while the name of the functions (including specified arguments) follows.
The benchmark ran 30 times through the bench package (Hester & Vaughan, 2021).A summary of the results of the benchmark can be seen in Fig. 2. Full results detailing the timings for each combination of parameters and functions are available in the Supplementary Material (Fig. S4).
[Figure 2 about here.] We see that for all the indices and functions tested, fundiversity is at least an order of magnitude faster than alternative packages.For functional dispersion, fundiversity is two orders of magnitude faster compared to BAT and mFD.For functional divergence, fundiversity is one order of magnitude faster than mFD.For functional evenness, fundiversity is two orders of magnitude faster than mFD with sequential and parallelized versions having similar performances.For Rao's quadratic entropy, fundiversity is one order of magnitude faster than hillR and mFD, two orders faster than adiv, and three orders of magnitude faster than BAT.For functional richness, fundiversity is half an order of magnitude faster than the hull version of BAT, as well as one and a half order of magnitude faster than its tree version and mFD.For functional richness intersection (beta functional diversity), fundiversity is two orders of magnitude faster than betapart and hillR.
[Figure 3 about here.] As shown on Fig. 2, on average the parallelized versions of fundiversity functions executed an order of magnitude faster than the sequential versions.For functional richness we even observed a difference of two orders of magnitude.However, for functional dispersion, parallelization increased overall computation time.This may be due to inherent parallelization issues: there is an overhead cost when splitting tasks across multiple cores of a computer.The efficiency of parallelization depends on the difficulty of the tasks that are split between cores.In the case of functional richness, the main task is computing the convex hull, which is computationally costly, that is why parallelization increase performance in this case.While computing functional dispersion is simpler, and as such, does not benefit from being split across different cores.Different values for the number of cores, species, traits or sites produce qualitatively the same results (full results in Fig. S5).
One important note regarding parallelization in fundiversity, is that it is important to avoid doing both memoization and parallelization simultaneously.Memoization creates a cache to avoid recomputing results, and the cache may be corrupted if several cores access the same results at the same time.We noticed that toggling memoization while performing parallelization severely increase total computational time, compared to sequential performance.
Note that these benchmarks only assess the packages computation speed and in no way any package intrinsic quality or usefulness.We're comparing fundiversity, a package whose one of the main goals is performance, with other packages that may have other primary goals and offer other benefits.Most other packages offer more features than simply computing functional diversity indices.For example, several packages offer nice default visualization functions to plot the different diversity indices, while we explicitly considered that visualization functions were not part of fundiversity and let the users decide how they want to plot their indices.

Conclusion
We proposed a modern alternative R package to compute functional diversity indices.This package follows current best development practices and leverages modern features like parallelization and memoization to increase its performance.This is only made possible by recent developments which were for the most part not available at the time when alternative packages came out.fundiversity does not propose to replace the entire toolkit for the researcher interested in functional diversity (including the upstream selection of the traits and building of a functional space) but instead focuses on improving the most computationally costing step: computing functional diversity indices.We hope it will be a useful contribution to this toolkit.
This package aims to always be a work in progress and we welcome contributions from interested users and developers.
fundiversity only computes alpha functional diversity indices, because other recent packages exist to compute other types of functional diversity indices[Hill numbers, Li (2018); beta-diversity indices, Baselga_betapart_2012].We focused on indices available through the dbFD() function in the FD package and on indices that could benefit from faster implementation.fundiversity contains the following alpha functional diversity indices: functional richness (FRic), functional dispersion (FDis), functional divergence (FDiv), functional evenness (FEve), and Rao's quadratic entropy (Q).fundiversity also contains a betadiversity index as it can be useful to compare functional richness between sites.
fd_fric() function.It expects quantitative trait values as the first argument and a site-species matrix as the second argument.in fundiversity use a similar structure, the first input is trait data the second one is a site-species matrix.For Rao's quadratic entropy computed through fd_raoq() functional dissimilarities can be specified as the third argument: # With functional dissimilarity birds_raoq = fd_raoq(traits = NULL, site_sp_birds, dist_matrix = trait_distance) # With trait values birds_raoq_2 = fd_raoq(z_traits, site_sp_birds) # Both options give the same results identical(birds_raoq, birds_raoq_2) ## [1] TRUE

Figure 1 :
Figure 1: Conceptual diagram showing the input and typical ouput data from 'fundiversity' functions.Input data are generally a site-speciestable and a species-traits table, and the output gives back a table of functional diversity index per site.

Figure 2 :
Figure2: Timing comparison across functional diversity indices between packages.Each point represents the execution time of one run using a simulated dataset, the points are transparent and jittered to avoid overplotting.We here show the performance results considering only a single set of parameters with 4 traits, 500 species, and 100 sites, repeated 30 times.

Figure 3 :
Figure 3: Timing comparison between parallel and sequential version of fundiversity functions across functional diversity indices.Each point represents the execution time of one run using simulated datasets with fixed properties (4 traits, 100 sites, 500 species), the points are transparent and jittered to avoid overplotting.The parallel version ran across 6 cores.

Figure 4 :
Figure 4: Performance comparison across functions of different packages over a range of parameters (number of traits, species, and sites).Note that each combination of parameters ran 30 iterations.The lines show trends of execution time in function of number of sites of the input dataset.

Figure 5 :
Figure 5: Performance comparison across internal functions over a range of parameters (number of traits, species, and sites) and different parallelization parameters.Note that each combination of parameters ran 20 iterations.The lines show trends of execution time in function of number of sites of the input dataset.

Table 1 :
List of functions available in fundiversity to compute functional diversity indices.The two last columns specify which functions are parallelizable and memoizable.
table and a species-traits table, and the output gives back a table of functional diversity index per site.