An interpolated biogeographical framework for tropical Africa using plant species distributions and the physical environment

Existing phytogeographical frameworks for tropical Africa lack either spatial completeness, unit definitions smaller than the regional scale or a quantitative approach. We investigate whether physical environmental variables can be used to interpolate floristically defined vegetation units, presenting an interpolated, hierarchical, quantitative phytogeographical framework for tropical Africa, which is compared to previously defined regions.


| INTRODUC TI ON
By defining areas of relatively homogeneous species composition, biogeographical frameworks provide spatial units of analysis that are useful in macroecological, evolutionary and systematic studies of the processes which shaped or maintain species distributions (Morrone, 2018). We can use theory generated by such studies to improve the biogeographical framework itself, predicting where species are likely to occur for areas which are incompletely collected, providing increasingly complete geographical and taxonomic coverage (Holt et al., 2013). By predicting which species occur in which place, biogeographical frameworks are useful tools in conservation and land management: the units are convenient for monitoring trends in species abundance, range or habitat condition (Dauby et al., 2017), can provide shortlists of native species for restoration (Lillesø et al., 2011), and facilitate geographical searching of online identification resources (Brunken et al., 2008).
Africa has both a rich history and contemporary research front in plant biogeography. The most influential framework for African vegetation is that of White (White, 1979(White, , 1983(White, , 1993. Building on the work of others (Aubreville, 1962;Lebrun, 1960Lebrun, , 1961Monod, 1957), White used species distributions of a subset of the flora to define a chorological framework at regional scale. Where others had used thresholds of endemism at successively finer taxonomic ranks to distinguish choria at different ranks (Engler, 1879;Good, 1947;Takhtajan, 1986), endemism thresholds were important to White only for distinguishing between types of region, and did not preclude the recognition of species poor or endemic poor choria at the rank of region. White overlaid on his regional chorology, a detailed map of vegetation units (1983), synthesizing many previous local vegetation maps. These mapping units are labelled with respect to their structure as well as the regional chorological situation (e.g. Congolian swamp forest). WWF's African ecoregions were derived from White's physiognomic vegetation map, though the ecoregion borders were simplified and differ where they were redrawn to respect animal distributions (Olson et al., 2001). More recently, the local vegetation maps that White synthesized have been resurrected and digitized for east Africa (Lillesø et al., 2011).
White's regional framework was produced from limited species distribution datasets and without the aid of multivariate statistics, prompting six quantitative assessments of tropical Africa's phytogeographical regions (Denys, 1980;Droissart et al., 2018;Fayolle et al., 2014Fayolle et al., , 2019Linder et al., 2005Linder et al., , 2012. All have shown broad similarities to White's regions, but with discrepancies from each other and White. Most have analysed gridded distribution data of species and infraspecific taxa at one degree square resolution; Fayolle et al. analysed local checklists and plot data for trees (2014) or woody taxa (2019). For the gridded analyses, there has been a trend towards more taxonomically complete datasets: 494 taxa (Denys, 1980), 5,438 (Linder et al., 2005), 5,881 (Linder et al., 2012), 24,719 (Droissart et al., 2018); the present dataset includes 31,046 species and infraspecific taxa. Most have used ordination and cluster analysis to produce regionalizations, as in the present analysis, while Droissart et al. used bipartite network analysis (Edler, Guedes, Zizka, Rosvall, & Antonelli, 2017). These quantitative floristic frameworks have left much of the area of tropical Africa unassigned to a spatial unit, due to the insufficiency of plant species distribution data from many parts of the continent, and because there is currently no comprehensive set of species distribution models for African plants. A lack of geographical completeness makes it difficult to use the quantitative phytoregionalizations as spatial frameworks in further analyses, as they apply only to the idiosyncratic portion of the continent that was used to derive them. For tropical Africa, we also lack quantitatively defined floristic units similar in size to ecoregions.
Using lower cluster solutions to define districts or provinces below regional level is precedented for African birds (De Klerk, Crowe, Fjeldså, & Burgess, 2002), and for plants a hierarchical system has been described as a useful and natural way to depict floristic relationships (McLaughlin, 1992).
Taxonomic information on biological composition must be derived from field survey, and is more limited in spatial and temporal resolution, and geographical coverage, than physical environmental variables like climate, which have been interpolated at high resolution across the globe (Kriticos et al., 2012). Land cover maps derived from remotely sensed data, like GlobCover 2009(Arino et al., 2012, have provided 90m resolution continuous characterizations of vegetation physiognomy, in terms of canopy height, openness, and deciduousness, for the world. For Africa, there is also A New Map of Standardised Terrestrial Ecosystems (Sayre et al., 2013), presenting a predicted vegetation classification at 90 m resolution across the continent using a classification and regression tree (CART) method.
The vegetation classification that was predicted was described as a draft. Training points were supplied by experts or derived from previous maps, and were reconciled into a hierarchical physiognomic framework following principles developed for the USA.
The prediction of vegetation physiognomy from physical environmental variables, especially climate, is a long established and current activity (Arino et al., 2012;Holdridge, 1947). Some African plant species distributions have been predicted with physical environmental variables via species distribution models McClean et al., 2005). Plant communities in Africa are typically ordinated and correlated with physical environmental variables, to reveal strong relationships with rainfall (Bongers, Poorter, & Hawthorne, 2004;Fayolle et al., 2014Fayolle et al., , 2019Hall & Swaine, 1976), lithology (Fayolle et al., 2012), soil (Réjou-Méchain et al., 2008, or a combination of soil and rainfall (Swaine, 1996), temperature and altitude (Fayolle et al., 2014(Fayolle et al., , 2019. Within climatic envelopes, fire, biotic interactions and feedback processes are important (Favier et al., 2012). The manner and extent to which floristically defined units can be predicted by physical environmental variables is of great interest (Tuomisto, Ruokolainen, & Yli-Halla, 2003). If the predictive relationship is strong, we could use the relationship to (a) 'complete' our maps, filling holes left by patchy species distribution data; (b) increase the resolution of the mapped units; (c) interpret the relative importance of particular physical environmental variables for determining floristic turnover and (d) make the relationship between physiognomy, environment and chorology more explicit and objective.
The aim of this study is to present a spatially complete quantitative phytogeographical framework for tropical mainland continental Africa (24°N to 24°S) that can predict which plant species occur at regional to more local scale across the study area, useful for future studies in ecology, evolution and conservation, and to facilitate (online) identification tools. We define a hierarchical phytogeographical framework via well-established multivariate methods from a comprehensive whole-flora assemblage of available plant species distribution data. We investigate the strength of the predictive relationship between physical environmental variables and these floristically defined spatial units, seeking to use this relationship to render the framework geographically complete within its scope. We ask:

| Species distribution data
We analyse 533,383 records of 31,046 tropical African species and infraspecific taxa in 1,197 degree squares of tropical mainland Africa between 24°N and 24°S. To prepare the dataset, the TRAFRICA dataset (Marshall, Wieringa, & Hawthorne, 2016) was supplemented with the FLOTROP dataset (Taugourdeau et al., 2019) and the RAINBIO dataset (Dauby et al., 2016), and is the largest yet compiled for tropical Africa. Species names and synonymy follow the taxonomic backbone of the TRAFRICA database, derived initially from the tropical African section of the African Plants Database (Conservatoire et Jardin botaniques de la Ville de Genève and South African National Biodiversity Institute, Pretoria, 2016). All distribution record identifications were updated against this framework for analysis. Records were checked and cleaned for geographical errors using custom Microsoft Visual FoxPro routines. Each record's supplied coordinates, if any, were compared against supplied textual locality information, with contradictions resolved or the record excluded. Records with textual locality information and without coordinates were parsed to the bounding boxes of the stated locality. We checked all records from centres of administrative areas, all records for any locality with > 25 records and suspiciously round coordinates.
The geographical resolution of the record was respected via the use of bounding boxes: unlikely coordinates were removed in favour of administrative polygon geolocation, and records which straddled one degree squares were dropped. Distribution data were summarized at the commonly used resolution of one degree square; the difference in area between the largest and smallest degree square is not large (8.8%). We excluded from analysis vague names, hybrids, cultivars and taxa we knew to be introduced or cultivated. Analyses are conducted at the lowest named taxonomic rank, the most commonly used rank used in phytogeographical studies in tropical Africa (Denys, 1980;Fayolle et al., 2014Fayolle et al., , 2019Linder et al., 2005Linder et al., , 2012.
Only 4.5% of informative records are resolved to infraspecific rank.
We excluded from analysis degree squares with < 15 taxon records, and ≤ 5% of taxa sampled. These thresholds were set as low as possible to maximize floristic representation with respect to environmental gradients, without analysing noise, and are slightly more stringent than the 5 species/100 km 2 used by Kreft and Jetz (2010) and Linder et al. (2012). A description of data sources and cleaning is in Appendix S1. The geographical pattern of sampling is shown in Figure 1, and is included for each degree square in Appendix S2.

| Environmental data
Environmental data were summarized at one degree square and half degree square using QGIS 3.4.3. We summarized mean altitude from GMTED2010 at 30 arc second resolution (Danielson & Gesch, 2011).
We derived topographic ruggedness from GMTED2010 using the GDAL Terrain Ruggedness Index tool via QGIS. Mean values for the climatic variables Bio1 to Bio35 at 30-min resolution for the years 1961-1990 were downloaded from the CliMond database (Kriticos et al., 2012). We used the surficial lithology classification of Sayre et al. (2013). We consulted the Harmonized World Soil Database, but could not use its classification as there were too many classes for the Random Forest algorithm to handle. We summarized the majority land cover class from GlobCover 2009 (Arino et al., 2012). We F I G U R E 1 Geographical distribution of 533,383 records of 31,046 vascular plant taxa across 1,197 degree squares of tropical Africa included in the analysed dataset. Degree squares are coloured by sampling completeness from ≥ 6% (lightest grey) to ≥ 100% (black) in equal intervals of 1%. Sampling completeness is calculated by comparing the number of species recorded as present with richness estimates of Barthlott et al., 2005. Map projection Africa Albers Equal Area Conic estimated completeness of taxon sampling for each degree square by comparing the number of species recorded as present with richness estimates of Barthlott, Mutke, Rafiqpoor, Kier, & Kreft, 2005. Environmental data summarized at half degree square resolution can be found in Appendix S3.

| Gradients in species composition
Analysis was conducted in R 3.6.3 (R Core Team, 2020), scripts are found in Appendix S9. Dissimilarity in taxon composition between degree cells was measured using the Simpson index of beta diversity (βsim) in the R package 'vegan' (Oksanen et al., 2019). βsim downweights dissimilarity between cells based on differences in species richness, making it appropriate for unevenly sampled datasets (Kreft & Jetz, 2010). The βsim dissimilarity matrix was ordinated using locally constrained NMDS with seven axes specified (stress = 0.094), in vegan. Local NMDS is recommended as a robust technique for indirect gradient analysis (Minchin, 1987). The first three axis scores were visualized in RGB colour space using the R package 'plotrix' (Lemon, 2006). Environmental variables were fitted to the ordination space as fitted vectors using vegan's envfit function.
Ward's algorithm gave the highest agglomerative coefficient with this dataset (0.95). We inspected 2-30 cluster solutions, every fifth cluster solution thereafter up to 95, and the 99th cluster solution. The highest node was always selected and no degree squares were reassigned. To help us choose the number of clusters to present, we calculated the decay in four different evaluation metrics for each of the clustering levels from 2 to 99, identifying the position of the knee using a geometric knee method: (a) dendrogram node height, where lower nodes are less informative than higher nodes; (b) ANOSIM R, a test statistic with the null that similarity between groups is greater than or equal to the similarity within the groups; (c) mean regional endemism, the average proportion of taxa restricted to each cluster and (d) total endemism, the percentage of all taxa that are restricted to any one cluster.
The two endemism metrics were calculated from the tropical African dataset only, so regions particularly at the margins of the dataset will include some 'endemics' that are found elsewhere in the world.

| Modelling the framework
Random forest classification models were built using the environmental data as predictors, using the R package randomForest (Liaw & Wiener, 2002). We trained one model on the 19 regions to predict the regional framework. We subsequently trained 19 models to predict the distribution of the 99 districts within each of the 19 regions, using the same selection of predictor variables. Continuous variables were rescaled between 0 and 1. Accuracy of the models was assessed using the out-of-bag error rate. Variable importance was assessed using the mean decrease in the Gini Index. Only half degree squares absent from the training dataset were classified using the model (half degree squares in the training dataset retained their classification). To reduce multicollinearity, overfitting and increase interpretability, indirect predictor variables were excluded and bioclim variables were restricted to the annual mean, minimum, and maximum values of temperature and precipitation, and annual mean value of the moisture index and radiation ( Fig. S8.2 for correlations between variables). Including more bioclim variables decreased the OOB error rate by a negligible amount. Half degree squares were merged by region and district to produce two shapefiles (Appendix S4). The 19 regions were coloured and the 99 districts are crosshatched using code written in Microsoft Visual Foxpro 9 for SVG output ( Figure 5).

| Characterizing and comparing the framework
Characteristics of each region are summarized in Table 2  confidence intervals of the median are calculated using ± 1.58 IQR/ sqrt(n). Categorical data were summarized by their majority class.
We tested for significant differences in MAT and MAP among regions and districts with the multiple comparison Kruskal-Wallis test from package 'pgirmess' (Giraudoux, 2018). We used three of the evaluation metrics (mean endemism, total endemism, and ANOSIM R) to compare our framework floristically against the regional frameworks of White (1983) and Droissart et al. (2018). We compared the spatial congruence of our regions with the White and Droissart et al. regionalizations, and the spatial congruence of our districts with WWF's ecoregions, using the V-measure implemented by the R package 'sabre' (Nowosad & Stepinski, 2018).

| Gradients in species composition
Floristic turnover at one degree square resolution is continuous, with no sharp disjunctions in ordination space (Figure 2). Floristic turnover is well explained by both geographical and climatic variables while the estimated percentage of species sampled explains almost none of the one degree cells' position in ordination space (r 2 = 0.027) ( Table 1). NMDS axis 1 represents the moisture gradient, and NMDS axis 2 the altitude/temperature gradient.
The soil moisture index (bio28) is the most strongly correlated variable with axis 1, and can explain a very large proportion of the one degree cells' positions in ordination space (r 2 = 0.86). The soil moisture index is a modelled environmental variable derived from precipitation, solar radiation (via pan evaporation) and water holding capacity of the soil; pan evaporation includes the effects of temperature, humidity, drought dispersion and wind (Hutchinson, Xu, Nix, & McMahon, 2009). Precipitation makes the most important contribution to the moisture index: the moisture index is very well correlated with mean annual precipitation (MAP) (Fig. S8.2). MAP only explains slightly less of the one degree cells' positions in ordination space than the soil moisture index (r 2 = 0.83), and is also very strongly parallel with axis 1.
Altitude is the most strongly parallel variable with axis 2, and explains a large amount of variation in ordination space (r 2 = 0.69).
Altitude is well correlated with mean annual temperature (MAT) ( Fig. S8.2), as is axis 2. Of the two variables, MAT explains more of the variation in ordination space than altitude (r 2 = 0.76). The zero point of axis 2 separates the continent rather precisely into the two parts which White (1983) recognized as 'high' and 'low' Africa: a separation of cooler, higher altitude southern and eastern Africa from the hotter, lower altitude northern, western and central regions; the 'line' is drawn from Angola in the southwest, around the Congo to western Ethiopia in the northeast (Figure 2b).

| Deriving the biogeographical framework
The continuous species turnover apparent in the ordination is also apparent in the cluster analysis. All four evaluation metrics have smooth curves, and the knee detection tests suggest different optimum cluster numbers from each other: between 9 and 20 clusters ( Fig. S7.1). The decision about how many clusters to recognize at the regional level is necessarily arbitrary within the range of breakpoints (Kreft & Jetz, 2010). No particular level is true or false, and so the results should be judged on the usefulness of the result (De Klerk et al., 2002). We chose to recognize the 19 cluster solution as regions. The 19th highest cluster is region 9 of Figure 3, a floristically distinct area which has previously been recognized at regional level (~ White's Karoo Namib). The 20th cluster is the division of the Sudanian region 2 into two latitudinal bands. The species distribution data through this area are relatively poor towards the east, weakening our confidence in the assignment of this cluster to regional level.
We recognized the 99 cluster solution as districts: the choice of 99 is arbitrary, but is informed by the popularity of WWF's ecoregions

| Modelling the framework
The regional Random Forest model has an out of bag error rate of 10.4%, that is, region could be accurately predicted for c. 90% of half degree square cells during cross validation. The error rates by region range from 2.94% (Region 12) to 25.4% (Region 3), and are higher for regions where sampling rates are lower (confusion matrix Table S8.2). Regional half degree square cell classifications are shown in Figure 4a. The relative importance of the model variables in the regional model is shown in Figure 4b. As in the ordination results, the most informative variables are those related to water availability (MAP, rad, soilMI and maxP), followed by temperature/ altitude (maxT, minT, alt and MAT), with lower contributions from lithology, landcover and topographic ruggedness. The out of bag error rates for the district models range from 4.7% (Region 3) to 23.6% (Region 16). Floristic districts, derived from the 99-cluster solution, are mapped within the regional framework in Figure 5.

| Characterizing and comparing the framework
The regions are different from each other with respect to their total taxa (921 to 10,555), and mean endemism rate (1%-22%) (Table 2, Appendix S5). We have not adopted endemism thresholds to distinguish between Our regions outperform White's regions by all evaluation metrics (mean endemism, total endemism, ANOSIM R; to White's regional framework (V-measure = 0.63, Figure 6c). At least in well-sampled areas, like our Guineo-Congolian (West) and Guineo-Congolian (West-Central), the districts would seem plausible based on field observation.

| Can physical environmental variables be used to predict floristically defined vegetation units?
One of White's guiding principles in the assembly of his 1983 map was 'Vegetation, in the first instance, should be classified without F I G U R E 2 (a) NMDS ordination scores for 1,197 degree squares of tropical Africa. A betasim dissimilarity matrix constructed from 31,046 vascular plant records in each degree square was ordinated. Each degree square is represented in RGB, with red for axis 1, green for axis 2, and blue for axis 3 (third axis not shown). High axis 1 scores have maximum red values, low axis 2 scores have maximum green values and high axis 3 scores have maximum blue values. The third axis is also represented by size, with larger points having higher axis 3 scores. Environmental variables are fitted as vectors with length proportional to r 2 (Table 1). MAT = mean annual temperature (bio1), MAP = mean annual precipitation (bio12), soilMI = soil moisture index (bio28), alt = altitude. (b) Schematic representation of the location of cells, coloured by the same scheme [Colour figure can be viewed at wileyonlinelibrary.com]   Maharjan et al., 2011;McClean et al., 2005), between climate and vegetation physiognomy (Arino et al., 2012), between climate and local-scale species assemblages across west African forest (Bongers et al., 2004), and between climate and local-scale species assemblages across savanna and forest biomes of tropical Africa (Fayolle et al., 2014(Fayolle et al., , 2019. Our model is empirical (Guisan & Zimmermann, 2000), employed for the pragmatic purpose of rendering a biogeographical map spatially complete in a way that is plausible and useful.  (Tansley, 1935). This definition draws together two principle approaches that have been used to divide the biosphere: the first using affinities and discontinuities in the distribution of taxa, the second using patterns of abiotic variation or direct measurement to classify the biosphere physiognomically (Mackey, Berry, & Brown, 2008). At finer spatial scales, it is rarely justified to separate the two approaches very distinctly, whereas in studies at larger spatial scales and scope the two disciplinary approaches have diverged. We have used a method popular in physiognomic spatial classification and applied it to a floristic biogeographical framework, producing spatial unit definitions that are closer to that of an ecosystem. Our framework has the advantage over physiognomic classifications (Arino et al., 2012;Sayre et al., 2013) that the units are diagnosed by, and fully characterized by, plant species distributions as currently represented by the plant biological record at

F I G U R E 4
The 19 clusters derived from Ward clustering of the betasim dissimilarity matrix of 31,046 vascular plant taxa of tropical Africa (Figure 3) are modelled using Random Forest to yield the regional framework. The model has an out of bag error rate of 10.4%. (a) Cells with no border constitute the training data, predicted cell classifications with regional classification error rates of 10 > 25% are shown with a black border, predicted cell classifications with regional classification error rates of 3%-10% are shown with a grey border. Lakes are shown in white. (b) Variables included in the model are ranked by their importance. rad = radiation (bio20); MAP = mean annual precipitation (bio12); soilMI = annual mean moisture index (bio28); maxT = maximum temperature of the warmest month (bio5); maxP = precipitation of the wettest month (bio13); minT = minimum temperature of the coldest month (bio6); alt = mean altitude; MAT = mean annual temperature (bio1); lith = majority lithological category (Sayre et al., 2013); globcover = majority GlobCover category (Arino et al., 2012); minP = precipitation of the driest month (bio14); TRI = terrain ruggedness index. Map projection Africa Albers Equal Area Conic [Colour figure can be viewed at wileyonlinelibrary.com]

(a) (b)
one degree square resolution. It has the advantage over previous quantitative phytochorological classifications (Droissart et al., 2018;Fayolle et al., 2014Fayolle et al., , 2019Linder et al., 2005Linder et al., , 2012 that it is spatially complete across its scope. It has the advantage over qualitatively interpolated biogeographical frameworks (Olson et al., 2001;White, 1983) that our unit boundaries are drawn in an reproducible, objective and quantitative manner. Of course, the degree of advantage presented by our framework is dependent on the intended use.

| How do our phytogeographical regions compare with previously defined regions?
Each region is interpreted with respect to previous region definitions in Appendix S8. Overall, our regions are very geographically coherent before modelling (Figure 2, Figure 3): not a single cell was manually reassigned, and there are almost no cells which have a dubious cluster assignment. This is in contrast to previous quantitative gridded chorological studies, which showed more geographically fractured regions (Droissart et al., 2018;Linder et al., 2005;plants in Linder et al., 2012). More comprehensive species distribution data have helped to resolve the phytogeographical affinities across the tropical continent. The bipartite method (Edler et al., 2017) (Kreft & Jetz, 2010), and the failure to resolve archipelago and mosaic regions with gridded data is a perennial issue (Linder et al., 2005). Even White (1993) (1983), but allowing the regional identity of the smaller cells to diverge from that of their enclosing larger cell, would be a solution.
Of note is our reinterpretation of the arid flora affinities in Tagant-Djibouti. The arid flora of eastern Namibia (9: Namib) is supported at regional level for the first time in a floristic quantitative regionalization of tropical African plants. The 'arid track' -shared species of plants (de Winter, 1971) and animals (Balinsky, 1962) across the geographically disjunct arid floras of the south west (Namibia) and north east (Djibouti and environs) -can be seen in ordination space (Figure 2), though overall their strongest affinity is to their geographically adjacent regions (Figure 3c). In the forest biome, we define three regions.  (Vetaas, 1993;Vetaas, Salih, & Jurasinski, 2012), and other outposts further south and north are not recovered within the region.

| Future directions
We limited the scope of our analysis to continental tropical ours  Botanica and the Missouri Botanical Garden. This paper has benefitted from reviews by Thomas Couvreur, Mike Swaine, and others anonymously.

DATA AVA I L A B I L I T Y S TAT E M E N T
Dryad data deposit DOI: https://doi.org/10.5061/dryad.rfj6q 5786.
Sources of distribution data are summarized in Appendix S1. All re-