Can we predict the number of plant species from the richness of a few common genera, families or orders?


Correspondence author. E-mail:


1. Halting biodiversity loss, a major environmental challenge, relies on the understanding of species richness patterns. The assessment of species richness is often hampered by limited taxonomic knowledge and the general dearth of trained systematists. Research has shown that we can predict the number of species in a community by the number of higher order taxonomic units present. Here, we test whether we need to know all the genera, families or orders in order to do so. Further, the number of common species in a region is a good predictor of total richness and we test if this predictability translates to using higher taxa.

2. We used data from 240 sites from the Natura 2000 network of protected areas in Greece, including 5148 plant species and subspecies, which are grouped in 1113 genera 174 families and 56 orders. We correlated species richness with the number of common genera, families or orders present. The analysis was repeated using the number of the most speciose higher orders instead of the most common.

3. We found that we do not need to know all higher order taxa present, in order to predict species richness. If we know how many out of the 30 most common orders are present, we can reliably predict the number of species. Similar results were obtained if we know how many of the 60 most common families or 200 most common genera are present.

4. Equally good results were obtained using the same numbers of the most speciose higher orders.

5.Synthesis and applications. Our analysis demonstrates that species richness can be predicted from the number of common or more speciose genera, families and orders present. These predictions hold without complete sampling of these higher taxa. The implication is that we need only limited systematic knowledge, resources and effort in order to predict species richness. Assuming these findings hold in other taxonomic groups and in other regions, we argue that the uncertainty introduced by limited knowledge of the systematics of less studied taxa should not be used as an excuse to avoid making conservation decisions.


Biodiversity decline is one of the major environmental problems of the 21st century. Despite the urgent need to preserve biodiversity, documentation and monitoring at a species level is very time consuming (Raven & Wilson 1992; Pressey et al. 2000; Faith et al. 2001), while our knowledge about the distribution of most species remains poor (Margules & Pressey 2000; Brooks, da Fonseca & Rodrigues 2004).

To make things worse, there is a decline in global taxonomic capacity; the number of experts in systematics is decreasing and the funding of basic research and conservation implementation is limited (Winston & Metzger 1998; Miller & Rogo 2002; Wortley, Bennett & Scotland 2002; Basset et al. 2004). Moreover, species-level biodiversity surveys require large resource investments and advanced systematic training. An estimation of the scientific hours and amounts of workforce required for plant assessments at species level is provided by Lawton et al. (1998) and recently by Lengyel et al. (2008), indicating the tremendous amount of resource that need to be committed. However, in this framework of limited knowledge, limited resources and high uncertainty, we need to make urgent conservation decisions and plan reserve networks. So far, two approaches have been used to speed up species richness estimation.

Firstly, if we know the total number of genera, families or orders present in an area we can reliably predict species richness. This approach has been tested with some success, and is referred to as ‘higher taxon surrogacy’ (Gaston & Williams 1993). Several studies of different taxa in various biogeographic zones have verified the validity of this approach, at least for genera and families (Williams & Gaston 1994; Williams, Humphries & Gaston 1994; Gaston & Blackburn 1995; Báldi 2003; Cardoso et al. 2004; Bergamini et al. 2005; Larsen & Rahbek 2005; Villaseñor et al. 2005; Mazaris et al. 2008a). However, other studies failed to find sufficient relationships between higher taxonomic levels and species richness (Andersen 1995; Fjeldså 2000). Specifically, several authors have raised concerns regarding the applicability at the family (Prance 1994; Balmford, Green & Murray 1996; Vanderklift, Ward & Phillips 1998), or order (Balmford, Lyon & Lang 2000; Grelle 2002; La Ferla et al. 2002; Prinzing et al. 2003) levels for this purpose. Although this approach is certainly less demanding in time and manpower, measurement of the total number of genera or families relies heavily on thorough field work and extensive systematic expertise to identify all samples accurately (Williams & Gaston 1994).

The second approach is based on the premise that if we can estimate the number of common species we can reliably predict total species richness. Commonness in this context refers to how widely spread over the study area, a species is distributed, and not to its local abundance. To date, there have been few studies supporting this idea (Lennon et al. 2004; Vazquez & Gaston 2004). However, the generality of this pattern has been tested in different habitats, at various scales and for different sampling units, highlighting the importance of widely distributed species in describing overall biodiversity patterns (Pearman & Weber 2007; Mazaris et al. 2008b). The method still relies on our ability to identify organisms to species level but it limits the range of species that need to be identified as well as the level of systematic knowledge required.

Here, we propose a new approach, namely predicting total species richness from the presence of common genera, families and orders. We test if it is sufficient to be able to identify only a few families, and especially the common, i.e. more widespread and well known ones, in order to reliably predict species richness patterns. In our case, commonness is measured in terms of geographic distribution and not in terms of local abundance. The distinction between common vs. rare genera, families and orders demands prior knowledge of the distribution of these taxa. Therefore, we used a second ranking scheme based upon the diversity of each taxon within the entire data set, which depends on the a priori available systematics knowledge. In other words, we also tested whether the knowledge of how many of the most speciose (taxon with the largest number of species) genera, families and orders can be used as an efficient surrogate of species richness.

Materials and methods

We ask whether we need to be able to identify all higher order taxa in order to predict species richness, and if there is a criterion that will enable us to minimize the sampling effort and require less expertise. Lennon et al. (2004) demonstrated that we can predict total species richness from the number of common species. In our approach, we test whether commonness is a good criterion to minimize sampling effort. If we lack information about the commonness of higher order taxa in the region of interest, we test whether we can predict species richness based on the number of the most speciose taxa. In order to examine the efficiency of these approaches we test how well a random subset performs compared with the other two schemes.

Data base

We used an existing data set of the distribution of 5148 higher plant species and subspecies in the Greek Natura 2000 network of protected areas. This data set included a total of 339 193 occurrence records sampled across 16 114 sampling plots located in 240 sites across Greece (Fig. 1). The species were classified into 56 orders, 174 families and 1113 genera.

Figure 1.

 The spatial distribution of the 296 sites selected for field sampling from the NATURA 2000 network in Greece.

The data set was analysed at two scales. The coarse scale was the Natura 2000 site scale, where all plant samples collected at a site were aggregated and analysed as a single case. However, at this scale the sampling effort varied significantly. In order to standardize for the differences in sampling effort and to test the same patterns at a finer scale we also analysed all data at the sampling plot scale. We identified two representative size categories of sampling plots (30 and 300 m2) and we repeated the analysis within each. At this finer scale of the analysis we used 886 × 30 m2 sampling plots and 829 × 300 m2 plots. Sampling was not nested so the 30 m2 plots were located in different positions, or even possibly in different habitats to the 300 m2 plots.

Richness of common higher order taxa

In each sample, we measured the number of species present. Then we measured the richness of the higher order taxa, i.e. how many genera, families and orders were present in that sample. These two measures are expected to be correlated. We asked whether we need to know the total number of higher orders present, to estimate species richness, or whether we can rely on a subset?

Initially, we ranked all genera in our data set according to their commonness (i.e. according to how many times they were counted in the data set) from the genera present in the greatest number of samples to the genera present in the least number of samples. The ranking created a sequence of the common to rare assemblages. For example, considering the 20 commonest genera, in each sample we can measure how many of these genera are present, so each sample is associated with a value between 0 and 20 representing the richness of the 20 common genera in the sample. Then, we correlated this value with the total species richness of the sample.

As there is no a priori definition of what is common, there is no reason to restrict the analysis to the 20 commonest genera. We can increase the number of genera considered in an iterative process by adding genus by genus until all the genera are included in the analysis. The process can then also be applied to families and orders.

Richness of most speciose higher order taxa

The procedure described above relies on ranking higher order taxa (genera, families and orders) according to their presence. This approach was used because of the known association between species commonness and species richness (Lennon et al. 2004; Mazaris et al. 2008b). However, this approach demands prior knowledge about the distribution of the flora of the area. An alternative ranking scheme would be to rank higher order taxa according to the number of species they represent globally. For example, we can rank all genera according to how many species are described for each genus, from the genus with the greatest number of species to the genus with the least number of species. Again the process iteratively adds one genus after another and can also be applied to families and orders.

In order to measure the correlation of the different ranking schemes we correlated each genus rank according to commonness with its rank according to species richness using Spearman’s rank correlation coefficient (Rs). The same analysis was performed for families and orders.

Null model: random ranking

Finally, assuming a complete lack of prior information about the flora of the area we produced random sequences of orders, families and genera and repeated the ranking process. We repeated the random ranking 1000 times, and estimated averages of the 1000 runs. The random ranking could also be used as a null model, to examine the efficiency of our ranking schemes.

Model building and validation

In each step of the analysis, we correlated species richness with the richness of a sub-assemblage of higher order taxa (for example the richness of each of the 20 commonest genera). Previous studies on the higher order surrogacy approach highlighted that the power law model provides the best fit to the data (see Villaseñor et al. 2005; Mazaris et al. 2008b), and this model was used here also.

The test of any surrogacy method is how well it performs in new and unknown conditions. For example, how well does the method perform in a region for which we do not know the common to rare ranking of plant genera, families or orders. As we lack data from a totally different geographic area, we tested our surrogacy method using the statistical method of cross-validation. For this reason, we separated our data into two data sets, one (the training set including 70% of all Greek Natura 2000 sites) was used to build the models and the second (the test set including the remaining 30% of the Greek Natura 2000 sites) was used to test the models. We initially re-constructed the models based solely on the training data set, which means that, we ranked higher order taxa according to their commonness in the training set and then estimated the power law parameters in the training set. Finally, we applied this model (ranking and mathematical formula) on the test set. For example, we estimated the parameters produced using the 20 most common genera of the training data set. Next, we measured how many of these 20 genera were present in each sample of the test set and applied the formula estimated in the training set to predict the species richness in each sample of the test set.

In addition, we examined the efficiency of the validation test using information at the finer scale of the analysis (i.e. sampling plots of 30 and 300 m2). To measure how well the model predictions fit the observed richness of the test set, we estimated standard deviations between observed and produced values [observed − produced)/observed].


The positive correlation between species richness and higher taxon richness was significant for genera, family and order (< 0·0001). In all cases (including two different sampling plot sizes), the correlations tended to decrease with increasing systematic levels, with genera being most strongly correlated to species richness, as the correlation coefficient (R2) and the standard error of estimates shows (Table 1).

Table 1.   Coefficients of determination (R2) for the relationship between species richness and higher taxon richness, for the aggregated data set (coarse scale) and for all different size categories of sampling plots (finer scale)
Sampling plot size (m2)Relationship
Species vs. orderSpecies vs. familySpecies vs. genera
  1. In parenthesis standard error of estimates are given.

Aggregated data set0·767 (0·292)0·819 (0·256)0·912 (0·121)
300·868 (0·254)0·895 (0·241)0·955 (0·1)
3000·722 (0·157)0·796 (0·154)0·928 (0·106)

Fitting the models

The main aim of this study was to investigate how many genera, families or orders are needed in order to reliably predict species richness. The inclusion of the first few common higher taxa rapidly increased the coefficient of determination and decreased the standard error of the estimates, until c. 20% of available genera, or 30% families, or 50% orders were included (Fig. 2). After that point, the addition of more information improved the predictive ability of the model only slightly. The results of the ranking based on actual species occurrence (or commonness) and the ranking based on the higher order taxon richness were practically indistinguishable (Fig. 2), suggesting that both approaches produce equally efficient models.

Figure 2.

 Predictive power indicated by the coefficient of determination and the standard error of model estimates of the regression produced using common to rare sequences for (a, d) orders, (b, e) families and (c, f) genera to predict species richness. In each panel, common to rare sequences were produced by ranking higher taxon (i) according to their frequency (continuous line), (ii) from the most speciose to least speciose taxon (dashed line) and (iii) randomly (dotted line).

The random ranking scheme predicted species richness less well (Fig. 2). Although the inclusion of additional higher taxa improved the predictive power of the random model, much greater numbers of genera, families and orders were required to produce results comparable with the ranking based on commonness or that based on the higher order taxon richness. In addition, a higher number of genera would be required to obtain the initial predictions of species richness.

The fine scale analysis (at the different size categories of sampling plots) generated similar results. The analysis showed that information about 30 (out of 48) orders could account for more than 80% and 70% of the variation in species richness at sampling plots of 30 and 300 m2 respectively. Similar results were obtained at the family level with information about a relatively low number of families yielding good estimates of species richness. Approximately, 60 (out of 121) families were needed to account for more than 85% and 79% of the variation in species richness at sampling plots of size 30 and 300 m2 respectively. The identification of the 200 most common genera (out of 756) could achieve a coefficient of determination >95% at sampling plots of 30 m2. In the larger 300 m2 sampling plots, 250 (out of 647) genera were required to achieve a coefficient of determination of 90%.

Correlations of ranking schemes

A rank correlation of the two schemes, at the coarse scale data, demonstrated no significant correlations at the level of orders and genera (> 0·05), but showed a significant positive relationship at the family level (Spearman Rs = 0·894; < 0·001). Despite this lack of correlation, the two schemes needed similar numbers of genera, families or orders in order to predict species richness (Fig. 2a,b; Table 2).

Table 2.   Spearman rank correlation coefficients between the two common to rare schemes produced based on the two definition of commonness for orders, families and genera
Data setsOrderFamilyGenera
  1. Significant correlation at a 0·05 and 0·01 level of significance have been highlighted with * or ** respectively.

Groups of sampling plots of similar size (m2)
 Aggregated data set0·1920·894**0·045

Validation of the method

Applying the higher taxa surrogates from the training data set to a validation data set demonstrated that the use of 30 orders or 60 families achieved a coefficient of determination as high as 75% and 80% respectively (Fig. 3a). Similarly, utilizing information about the 200 more diverse genera recognized throughout the rest of Greece yielded a coefficient of determination of about 85%. These findings are further supported by the decreasing trends in the standard error of estimates at the different systematic levels (Fig. 3b); it is clear that the standard errors of the models tend to decrease as the systematic level increases because there are fewer taxa included in the level, and these taxa have a higher probability of being shared by sites.

Figure 3.

 Predictive power of the methods for test sites, indicated by the coefficient of determination (a) and the standard error of estimates (b) of the regression produced using sequences (i) orders (black line), (ii) families (open grey line) and (iii) genera (dark grey line) by ranking from the most speciose to least speciose taxon within each site-aggregated data set after excluding the test sites (see text for details).

The models produced by the training data set were able to accurately predict species richness of the validation data set (Table 3). The models produced from the most common higher taxa performed equally well as the models produced from the most speciose higher taxa. Furthermore, the analysis of only the 50 commonest or the 50 most speciose genera yielded low values of deviation between predicted and observed values of species richness. Under both ranking schemes, we obtained no evidence for biased over- or underestimation of species richness for the test data set.

Table 3.   Prediction accuracy (standard deviations from observed species richness), for the test data sets using parameters obtained from the training data set [species = b (richness of higher taxa)z], for different sub-assemblages of the commonest/most speciose orders (a), families (b) and genera (c)
 Ranking according to commonnessRanking according to speciose taxa
(a) Number of Orders
(b) Number of Familes
(c) Number of Genera

Similar results were obtained when using the model to predict species richness of sampling plots of various sizes. When the model was run for the 30 commoner or more speciose orders and for the 100 commoner or more speciose families, the standard deviation between predicted and observed values was <30%. When comparing sampling plots of different sizes, for the 30 m2 sampling plots the use of 200 and 300 genera achieved standard deviations of about 30% and 15% respectively; while for the same level of predictions we would required 250 and 550 genera for sampling plots of 300 m2.


The efficiency of the higher taxon approach as a surrogate of species richness is relatively well documented in the literature (for example Balmford et al. 2000; Negi & Gadgil 2002; Bergamini et al. 2005; Villaseñor et al. 2005). The novelty of our results is that not only can we predict species richness just from the richness of genera, families and orders, but we do not even need to know all the genera, families or orders to do so.

We have demonstrated that knowing the number of the most common or most species-rich genera, families or orders present in an area is sufficient to predict species richness. Our finding suggests that the time required for rapid assessment of species richness could be significantly reduced. We propose that for an initial evaluation of an area, it is not necessary to identify all plants to species level but only to establish whether they belong to one of the 30 most common orders or one of the 50 most common families. This level of census requires limited effort and little time. An advantage of this approach is that it is much easier and less error prone to recognize common families as compared with rare species. This is even more pronounced in many cases where it is easy to identify a specimen to its higher taxon but notoriously difficult to identify it to species level (e.g. Poaceae). In many cases, such families (e.g. Poaceae, Fagaceae, Fabaceae, Asteraceae) are among the more common taxa in a floristic checklist. In other cases, it is difficult to place a specimen in its generic (or family) status. Our results suggest that such difficulties in identifying families or genera might be avoided by excluding them from a rapid estimation of species richness, as we do not need to know all higher order taxa in an area. Another advantage is that less experienced personnel and volunteers would be able to follow our approach. Given the increased volunteer participation in conservation programmes (Bell et al. 2008), our approach could contribute to a higher public awareness of environmental issues.

One key advantage of our surrogacy scheme is that it can easily be applied to new geographic regions. For this purpose, we divided our data into two sets, one used to train our models and the other to test the predictive ability of our model. We found that the model constructed from a given set of sites could reliably predict species richness in different sites. This extrapolation exercise needs to be verified in other biogeographic regions and at different scales of analysis before we can make generalizations about its efficiency. For example, in the USA, the state of Florida has c. 40% fewer species than the state of California, but 15% more families. Is it possible that if we measured only common American families the results would change?

We are not advocating that full floristic surveys should be avoided. On the contrary, we believe that the information obtained in such detailed surveys is important and useful, especially regarding rare species, genera, families or orders. Our approach predicts only species richness. Species richness is the most common measure of biodiversity and has been widely used for conservation, management, design and prioritization of protected areas (Prendergast et al. 1993). Although, a rough estimation of species richness might be enough for the identification of biodiversity hotspots, for the implementation of a protection network a wealth of ecological information should be collected as we need to know not only how many but also which species are there.

Species richness is not a panacea and should not be used uncritically as the ‘one size fits all’ measure of biodiversity for conservation policy planning. The protection of rare, endangered or endemic species is urgent and for that reason detailed surveys are required. Furthermore, some have argued for a rank-free system of nomenclature in which traditional Linnaean categories such as genus, family or orders are abandoned (for example see Cantino & de Queiroz 2004, De Queiroz 2006). We would argue that there are good reasons to keep the Linnaean scheme (see Nixon & Carpenter 2000; Benton 2007 concerning flexibility, stability and ease to use and communicate of the scheme) and our results show how useful it can be for predicting species richness.

Some concerns have been raised regarding the potential effects that sampling effort may have on the efficiency of the higher taxon approach (Balmford et al. 2000; Grelle 2002; Cardoso et al. 2004). However, we found that strong relationships between higher taxonomic levels and species richness hold at various sizes of sampling plots, indicating that this pattern itself does not clearly depend on the sampling effort. This finding is important for many reasons. Firstly, it provides evidence for the efficiency of higher taxon sampling even at fine scales. Moreover, taking into account that the size of the sampling plot depends on the habitat or organism to be sampled (Cox 1990; Barr & Babbitt 2001), our results suggest that the relationships between higher taxonomic levels and species may hold for various different habitats, although further studies are needed to confirm this hypothesis.

Our study was carried out on higher plants, a taxon that is relatively well known taxonomically. However, there are many taxonomic groups that have received less attention and only a fraction of their total diversity has been described. For example, insects are the most diverse taxon known to science, with c. 950 000 described species, and estimates of total richness range between 2 and 100 million species (Erwin 1983). So in the best-case scenario, our knowledge of the biodiversity of this taxon, includes less than half of the species, and in the worst-case scenario <1%. Do our results hold for such cases? Assuming this relationship does hold, we suggest that the uncertainty introduced by the limited knowledge of the systematics of less well studied taxa should not be used as an excuse to avoid making conservation decisions based on limited information.

In conclusion, identification to species level and complete surveys are an essential prerequisite for any effective conservation action; the lack of knowledge should stimulate research to obtain that information. However, practical limitations and time constraints might dictate the use of surrogacy methods. Our analysis has shown that the time required for assessments of species richness could be significantly reduced by recording the presence of just a few higher order taxa. We used two schemes in order to select the smallest number of higher taxa needed to predict total species richness. One scheme was based on the commonness of each taxa (i.e. how abundant and widespread it is), and relied on information about the distribution of the study taxa; information that is not always available. We found that using an alternative scheme based on the systematic richness of the taxa (i.e. how many species there are in the genus, family or order) was equally efficient. This second scheme relies on the number of species in the higher taxonomic groups, which is more or less independent of the study area and could be estimated from information already available in the literature.


We thank the Hellenic Ministry of the Environment, Planning and Public Works, for providing us with the data from the Natura 2000 network. We also thank Dr J.M. Halley for linguistic improvements on an earlier version of this manuscript. Furthermore, we thank three anonymous reviewers and the editor for their insightful comments and thorough revision of our manuscript. The work was supported by the EU FP7 SCALES project (‘Securing the Conservation of biodiversity across Administrative Levels and spatial, temporal and Ecological Scales’; project #226852).