Question: What are the effects of the number of presences on models generated with multivariate adaptive regression splines (MARS)? Do these effects vary with data quality and quantity and species ecology?
Location: Spain and Ecuador.
Methods: We used two data sets: (1) two trees from Spain, representing high-occurrence number data sets with real absences and unbalanced prevalence; (2) two herbs from Ecuador, representing low-occurrence number data sets without real absences and balanced prevalence. For model quality, we used two different measures: reliability and stability. For each sample size, different replicates were generated at random and then used to generate a consensus model.
Results: Model reliability and stability decrease with sample size. Optimal minimum sample size varies depending on many factors, many of which are unknown. Regional niche variation and ecological heterogeneity are critical.
Conclusions: (1) Model predictive power improves greatly with more than 18-20 presences. (2) Model reliability depends on data quantity and quality as well as species ecological characteristics. (3) Depending on the number of presences in the data set, investigators must carefully distinguish between models that should be treated with skepticism and those whose predictions can be applied with reasonable confidence. (4) For species combining few initial presences and wide environmental range variation, it is advisable to generate several replicate models that partition the initial data and generate a consensus model. (5) Models of species with a narrow environmental range variation can be highly stable and reliable, even when generated with few presences.