How to make use of unlabeled observations in species distribution modeling using point process models

Abstract Species distribution modeling, which allows users to predict the spatial distribution of species with the use of environmental covariates, has become increasingly popular, with many software platforms providing tools to fit such models. However, the species observations used can have varying levels of quality and can have incomplete information, such as uncertain or unknown species identity. In this paper, we develop two algorithms to classify observations with unknown species identities which simultaneously predict several species distributions using spatial point processes. Through simulations, we compare the performance of these algorithms using 7 different initializations to the performance of models fitted using only the observations with known species identity. We show that performance varies with differences in correlation among species distributions, species abundance, and the proportion of observations with unknown species identities. Additionally, some of the methods developed here outperformed the models that did not use the misspecified data. We applied the best‐performing methods to a dataset of three frog species (Mixophyes). These models represent a helpful and promising tool for opportunistic surveys where misidentification is possible or for the distribution of species newly separated in their taxonomy.

This document runs through the different steps to use the functions ppmMixEngine and ppmLoopEngine for data classification. These functions and others are contained in functionTestsim160420-SH.R. First, we load the various functions and packages we will need. source("functionTestsim160420-SH.r")

Data and environmental covariates
We load simulated data points for three species and environmental covariates store in the PrepData.RDATA document. We display the species true intensity as well as the three point patterns.

Mixture methods
We use a simulated dataset where we hide some label information. In the article, we choose three values and run the algorithm for all of them. Here we only present the case of 50% of hidden observations. The main function to apply that use our Mixture algorithm is ppmMixEngine, with the following arguments: • The points with hidden observations compose the Unknown.ppp pattern.
• The Known.ppp pattern is a marked point pattern with each marks representing a known species. Both point patterns are ppp objects from the spatstat package.
• quadsenv is a dataframe with information on the quadrature points: coordinates x and y and environmental covariates at those points.
• ppmform contains the spatial trend we want the model to fit our data.
• The initweights argument allows to decide the method to calculate the initial weights: • knn (we use the distance to the k nearest points). We set up the parameters to use in the initialization of the method, k determines the number of k nearest neighbors we want to choose for the knn nearest neighbors initialization method.
kmeans (we use the distance to the k centroids of the known species), random (we randomly attribute initial weight to the unknown points.), kps (similar to knn but the k nearest distances are calculated within each species), -coinF (we randomly attribute a label to the unknown points).
• We can choose the type of classification method. The argument classif = "soft" allows us to choose a soft classification. A hard classification is chosen using classif = "hard". We apply this function to the simulated data and run both Soft and Hard classification to compare the results.

Loop methods
We can do a similar job with the Loop methods. First, we run the ppmLoopEngine function where the arguments are: • Known.ppp, Unknown.ppp, n.sp and quadsenv are the same ones used for the ppmMixEngine function.
• addpt allows to choose the looping method of the algorithm: -"LoopA" for all points, -"LoopT" for all points following these arguments: with membership probabilities above delta_max we decrease at each iteration by delta_step the membership probabilities till we reach delta_min.
-"LoopE" for adding a similar number of points for each species at start and increasing by one point after the first iteration. The number of initial points to add is determined by num.add.

Evaluate performance
We can access useful parameters from the algorithm run: coefficients, membership probabilities, predictions. Whether we use the Loop or the Mixture method, we can access the estimates of the covariates using the function coef_fit. To run the function we need the simulation object. To access the final predicted intensities and membership weights directly, we use the function pred_int and member_prob respectively. The second one need only the simulation object to run. However, for the first one, we also submit the environmental dataframe defined earlier. We also define colpred by a number from 1 to 3 in order to set the color of the plot from color scale available in the viridis package.  We can calculate and store the performance measures: accuracy, meanRSS, IMSE, and sumcor for each method using the Perffunc function. We need to specify:

# Coefficients
• the fitted model object, • the list of the true species intensities, • the known marked point pattern defined earlier, • the label hidden to be reclassified and the number of species.
• The pf argument helps to choose which performances we want to compute between accuracy, meanRSS, sumIMSE and sumcor. The default value NULL computes all performances.
-For IMSE and sumcor, we can choose to compute a log or a square root of the intensity with the argument fun. By default, fun = "Else", which does not modify the predicted intensity.
-We can also decide the method to use in the calculation of the correlation ("pearson", "kendall", "spearman").
• For mixture methods, the LoopM argument is set up to FALSE by opposition to the loop methods for which this argument will be TRUE.