Modelling heterogeneity in the classification process in multi‐species distribution models can improve predictive performance

Abstract Species distribution models and maps from large‐scale biodiversity data are necessary for conservation management. One current issue is that biodiversity data are prone to taxonomic misclassifications. Methods to account for these misclassifications in multi‐species distribution models have assumed that the classification probabilities are constant throughout the study. In reality, classification probabilities are likely to vary with several covariates. Failure to account for such heterogeneity can lead to biased prediction of species distributions. Here, we present a general multi‐species distribution model that accounts for heterogeneity in the classification process. The proposed model assumes a multinomial generalised linear model for the classification confusion matrix. We compare the performance of the heterogeneous classification model to that of the homogeneous classification model by assessing how well they estimate the parameters in the model and their predictive performance on hold‐out samples. We applied the model to gull data from Norway, Denmark and Finland, obtained from the Global Biodiversity Information Facility. Our simulation study showed that accounting for heterogeneity in the classification process increased the precision of true species' identity predictions by 30% and accuracy and recall by 6%. Since all the models in this study accounted for misclassification of some sort, there was no significant effect of accounting for heterogeneity in the classification process on the inference about the ecological process. Applying the model framework to the gull dataset did not improve the predictive performance between the homogeneous and heterogeneous models (with parametric distributions) due to the smaller misclassified sample sizes. However, when machine learning predictive scores were used as weights to inform the species distribution models about the classification process, the precision increased by 70%. We recommend multiple multinomial regression to be used to model the variation in the classification process when the data contains relatively larger misclassified samples. Machine learning prediction scores should be used when the data contains relatively smaller misclassified samples.


Introduction
Species distribution models are essential ecology and conservation management tools that predict how natural and human factors affect biodiversity (Elith and Leathwick, 2009;Vermeiren et al., 2020).With an increasing amount of biodiversity data from multi-species surveys available to scientists, multi-species distribution models (hereafter mSDMs) have become widely used in analysing these data.These mSDMs model data at the community level, identify the important variables that drive species co-occurrences and predict the distribution of species in a community (Hui et al., 2015;Ovaskainen and Soininen, 2011;Pollock et al., 2014).
However, the biodiversity data obtained from these surveys can be subject to observation errors.These observation errors include false positives (recording absent species as present and misidentifying one species for another) and false negatives (failing to record present species).These false positives are usually a species that has been misclassified.It would be a great advantage if the observation was correctly classified, rather than being discarded once it has been identified as a false positive.The false positives may arise from imperfect classifiers (Spiers et al., 2022;Wright et al., 2020), observer error, and many other sources.In this study, we collectively describe all the false positives and misidentifications as misclassification.We use the term true states in this study to describe the correct or actual observation identity we are interested in modelling.False negatives are mostly accounted for in occupancy models by modelling the imperfect detection in the observation model.Failure to account for or correct these errors leads to biases in inferences about state variables such as occupancy probabilities, covariate effects and relative activity (Clare et al., 2021;Ferguson et al., 2015;Wright et al., 2020), leading to an impairment in decision making (Hoekman, 2021).
The methods to deal with misclassification from biodiversity data can be grouped into data review methods and model-based methods (Clare et al., 2021).Data review methods require complete and proper data collection and processing methods.This process can be very demanding as it is challenging to control for misclassification.This makes the model-based methods more popular when working with large-scale datasets from large-scale biodiversity data vendors like the Global Biodiversity Information Facility (GBIF hereafter;GBIF.Org, 2022).Model-based methods estimate classification probabilities jointly with the true state variables of interest.Model-based methods attempting to account for misclassification in multispecies occupancy models currently include modelling misclassification with detection heterogeneity (Clement et al., 2022;Ferguson et al., 2015;Louvrier et al., 2018), integrating multiple observers records with other methods such as distance sampling and N-Mixture models (Hoekman, 2021), supervised methods with extra information from observation confirmation or verification (Ferguson et al., 2015;Guillera-Arroita et al., 2017), site confirmation (Clare et al., 2021) and other calibrated methods.These methods need extra data from the verification process, which helps in estimating the misclassification probabilities in a semi-supervised setting (Spiers et al., 2022) and makes the parameters in the model identifiable (Guillera-Arroita et al., 2017).The above-mentioned studies have either used verified data collected on the site-level (where the occupancy state of a species is known at a site and not at the individual sample level; Chambert et al., 2018b), on aggregated individual sample level using a multinomial model with site-covariates (Wright et al., 2020), or on individual sample-level validation data which helps in modelling non-species identities (morphospecies) to species identities (Spiers et al., 2022).It is also worth stating that some studies have explored misclassification in abundance (Conn et al., 2013) and capture-recapture (Augustine et al., 2020) models.
Furthermore, these previous studies assumed that the misclassification probabilities are homogeneous (constant) across the study.In reality, the classification probabilities may vary with environmental covariates (such as field conditions; Conn et al., 2013) or observer experience (especially when ascertaining how well each observer classifies a report in citizen science projects will be informative; Arazy and Malkinson, 2021;Johnston et al., 2022), distance from a transect when using transect data (Conn et al., 2013), picture quality, etc.An attempt at modelling the heterogeneity in the classification process would be to model this effect as part of the true state model (such as abundance or occupancy).
However, this approach may not solve the problem of heterogeneity in the classification process since the estimates of the true state observation process parameters only serve as informed priors to the classification process (Spiers et al., 2022).
A more correct approach to model this heterogeneity is adding the covariate effect to the classification process.Some studies on dynamic false positive single-species occupancy models have modelled temporal changes in false positives using year as a covariate (Kéry and Royle, 2020;Miller et al., 2013;Sutherland et al., 2013), showing the possibility to model misclassification trends over time.Our study attempts to model variation in classification probabilities in mSDMs by modelling the probability of classifying a true state identity with a multinomial generalised linear model.To our knowledge, no previous work has been done on this.Failure to account for the heterogeneity in the misclassification probabilities will lead to biased estimates in the parameters of the true state observation process (such as species abundance, richness, and occupancy probabilities) and reduce the model's predictive performance.
Moreover, recent efforts to correctly classify observations from biodiversity surveys have relied on machine learning (hereafter ML) algorithms (Borowiec et al., 2022;Keshavan et al., 2019;Koch et al., 2022;Lotfian et al., 2021;Saoud et al., 2020;Suzuki-Ohno et al., 2022;Willi et al., 2019).These ML algorithms use sounds and/or images of observations to predict the true identity of the individual observations, and they can be trained to mimic expert verification of observations (Keshavan et al., 2019;Langenkämper et al., 2019;Ponti and Seredko, 2022).These ML algorithms use a prediction score (a value that shows the weight of predicting the observations as something else) to predict the possible list of the true identities of the individual reported observations.These prediction scores and a list of possible true identities provide information about the classification process of each observation.They can be used to model heterogeneity in the classification process.
This study is the first to model the heterogeneity in the classification process by using the prediction scores to weigh the distribution of the reported observations and predict the distribution of the actual observation identities.
Here, we present a joint model that simultaneously models the true state variables of interest (such as occupancy and abundance) and the heterogeneity in the classification process.Our model set-up extends the work done by Wright et al. (2020) and Spiers et al. (2022) by a) allowing the classification probabilities to vary with a covariate, b) allowing additional morphostates designations in imperfect classifications (for example, having "unidentified woodpecker" as part of the classified observations), c) using ML prediction scores as weights to account for heterogeneity in the classification process and d) performing variable selection on the classification process covariate to check for potential mSDM overfit.We compared the classification performance of our model with models that assume a homogeneous classification probability done by Wright et al. (2020) and (Spiers et al., 2022) through simulation studies.We parameterise our model with citizen science data on gulls in Norway, Finland and Denmark from iNaturalist (Matheson, 2014) downloaded from the GBIF (GBIF.Org, 2022).

Model framework
The proposed framework assumes a model for where the true state's identity is "present" (we use "state(s)" in this work to refer to taxon identity as well as any other identification category or morpho-states, such as an unidentified group).This true state can be on any taxonomic level.However, the true states are often unknown and treated as latent.An individual is sampled and classified as another state, known as the "reported" or "classified" state, with a probability.These reported states can also be on any taxonomic level or any unidentified group (such as in studies that include morphospecies; Spiers et al., 2022).
We further assume that the individuals are verifiable (and that we have information on the verification process) and that the verified state approximates the true state identity.

True state observation process model
We specify a relative abundance model (specifically a multinomial logit model) for the true state observation process.Our objective is to show how to model heterogeneity in the classification process and not to make inferences about the true state's abundance, so we chose a model that was easier to fit and understand to explain the distribution of the true state.
Let λ is be the average number of s = 1, 2, . . ., S true individuals at location i = 1, 2, . . ., R, which describes the abundance of the true states over space.This means that intensity can be modelled as an inhomogeneous point process (PPM) which assumes that the data are dependent on the environment covariate, or as a log-Cox Gaussian Point process (LGCP) where we assume a spatial dependency in the observed data (Renner et al., 2015).Here, the mean intensity is modelled using the inhomogeneous point process and defined as: where β 0s is the intercept of state s, β js is the effect of covariate j on the intensity of true state s, x ij is the j th covariate that affects the observation of the state at location i and n e is the number of covariates in the observation process model.Note that we assume there are no species interactions in our relative abundance model, and this could have been added as a random effect in the true state intensity definition (equation (1)).
Let p is be the probability of true state s being at location i.We estimate this probability from the mean intensities as follows: where λ is is given by (1).
The true state s = 1, 2, . . ., S at location i = 1, 2, . . ., R is then a realisation from a categorical distribution with probability p is .
This model specification for the ecological process used here is similar to the occupancy dynamics and encounter rate model used by Spiers et al. (2022) by assuming that occupancy probability equals one; and similar to the model used by Wright et al. (2020) by assuming Poisson counts with intensity λ is (refer to Table 3 for the link between our model framework and that of Spiers et al. (2022) and Wright et al. (2020)).

Classification process model
For each observation, we assume there was imperfect classification and that the true state identities can be classified into k = 1, 2, . . ., K states.Assuming the true state distribution represents the true point pattern for the species under study, each reported information can be seen as a draw from any of the K states under consideration with a given proba- Let Ω sk be the probability that a single true state s ∈ {1, . . ., S} is classified as state k ∈ {1, . . ., K}.The probabilities across all the possible k states sum to 1.In studies with homogeneous classification probabilities, the confusion matrix for the classification can be expressed as: where the rows correspond to the true states s and the columns correspond to the reported states k.
We model the heterogeneity in the classification probabilities by fitting a multinomial generalised linear model (Fahrmeir et al., 2013) to each of the rows of Ω defined in equa-tion (3).We refer to this approach as the multiple multinomial generalised linear model (MMGLM, hereafter).We define the linear predictor of the MMGLM as: and in matrix notation: where ω 0 is an S × K matrix of intercepts and ω j is an S × K matrix of covariate j effects, and z ij is the j th covariate that drives the classification process at location i.Using equation ( 4) as the definition for the linear predictor, our estimates of the parameters ω 0sk and ω jsk are identifiable with reference to one reported state.Therefore for each true state identity s, the classification probabilities for each reported state k = 1, 2, . . ., K − 1 with reference to state K is modelled as: where z 1 , . .., z n are covariate vectors.
This general framework has S × (K − 1) × (n c + 1) parameters to be estimated, where S is the number of true states, K is the number of reported states, and n c is the number of covariates in the classification process.Estimating these parameters can be very computationally expensive as the number of true states, reported states and covariates increases and would require significant numbers of misclassified state samples to estimate them.Therefore, we explored simplified forms of the generalised model in equation ( 5).
A simplified case of equation ( 5) assumes that the classification covariate z only af-fects the probability of correctly classifying the states.In this instance, the matrix ω j has its off-principal diagonal elements ω jsk = 0 for s = k for covariate j, and these parameters are not estimated (This is our study scenario "fixed-covariate" in Table 2).This simplification reduces the number of parameters estimated for the classification process by n c × (K − S − 1), where S is the number of true states, K is the number of reported states and n c is the number of classification covariates.A further simplification would also assume that the average probability of correctly classifying the true state does not change with species, that is, ω 0sk is the same for all s = k (This is our study scenario "fixed-intercov" in Table 2).The classification process heterogeneity is captured by the covariate effect ω 1sk for all s = k.The latter further reduces the number of parameters estimated by S − 1.This last simplification is useful especially when the states are very similar, and one would expect that their average classification probabilities would be the same.
Given which true state (s) was present at location i, the reported state at that location is a draw from K states with probability Ω.
where Ω is.refers to the probability of being classified as any of the K states given that the true state identity at location i is state s.

Variable selection
Fitting a complex model with these many parameters can result in an overfitted model.An overfitted model captures the pattern and noise in the training data, but performs poorly on validation or test data (Montesinos López et al., 2022).In some cases, the true state observation and classification process covariates may be highly correlated.These correlated covariates can inflate standard errors (reduce the precision) of the estimated parameters (Caradima et al., 2019;Roberts et al., 2017;Yu et al., 2015).To avoid overfitting the model, there is a need to perform variable selection.In this study, we performed Bayesian variable selection, specifically the spike and slab prior to the classification process covari-ates (for review of Bayesian variable selection see O' Hara and Sillanpää, 2009).For each of the classification process covariates, we re-define the linear predictor in equation ( 4) as: where ψ j is the indicator that variable j is selected with the expected probability q j .The variable selection indicator specified in equation ( 7) jointly selects the variable that affects all the true states in the model but can also be made state-specific (Ovaskainen and Abrego, 2020).Probabilities q j closer to 1 indicate that the variable contributes much to the model and should be selected, and those closer to 0 indicate that the variable contributes less and can be discarded.

Classification process
Heterogeneous Variable / covariate Variations in the MMGLM for the classification process defined by equation ( 7) and observation process defined by equation ( 1) used as our study scenarios.There are three heterogeneous models: covariate, fixed-covariate and fixed-intercov, and three homogeneous models: intercept, constant, and main.

Modelling heterogeneity using ML prediction scores
Some studies give weight to the true identities of the reported observations or state, for example, because they use machine learning to classify the observation.These prediction scores (such as F 1 score, mean square error, and logarithmic loss) are not classification probabilities but are values that indicate how well the ML algorithm classifies data in the test sample.We can use this information to model the heterogeneity in the classification process and predict the true state identity as follows: where λ ik is the average abundance of reported state k at site i and w iks is the predictive score of the k th reported state to true state s at sampling unit i.
It must be noted that this approach is a non-parametric approach to account for heterogeneity in the classification whereas the MMGLM is a parametric approach.Moreover, modelling the heterogeneity in the classification process using ML prediction scores models the covariate effects on the expected abundance of the reported states and corrects them using the prediction scores as weights to obtain the relative abundance of the true states.However, the MMGLM models the covariate effect on the expected abundance of true states and estimates the heterogeneity in the classification process using a parametric model.The prediction of the true state identity is done by weighing the expected intensity of true states with the estimated classification covariate.

Generalisation of model framework
The proposed framework generalises the existing mSDMs that account for misclassification in occupancy models.Wright et al. (2020) provided a framework to account for the homogeneity in the classification process, and our model is connected to this by using the relationship between the multinomial and the Poisson distribution (Steel et al., 1953) for the observation process as well as using a species by species constant model for the classification process (Table 3).Wright et al. (2020) further provided arguments that their proposed models were generalised forms of models for the binary detection of two species (Chambert et al., 2018a), single species with count detections (Chambert et al., 2015) and single species with binary detections (Chambert et al., 2018b).Since our proposed framework can be seen as a heterogeneous version of Wright et al. (2020), our framework is also a generalisable form of the models in Chambert et al. (2018aChambert et al. ( , 2015Chambert et al. ( , 2018b)).Spiers et al. (2022) provided an individual-level semi-supervised approach that estimates species misclassification with occupancy dynamics and encounter rates, and our model is connected to this if we assume a homogeneous classification process (that is ω 1 = 0), and assume that the occupancy probability is 1 (Table 3).

Simulation Study
To demonstrate how our proposed model works and its use in prediction, we performed a simulation study for S = 2 true states, and K = 3 reported states over R = 1000 sites.We simulated two covariates for the true state observation process and one for the classification, all from a Normal distribution with a mean of 0 and variance of 1.The true intensity was simulated from equation (1).The intercepts of the model for the two true states were chosen as β 01 = −1, β 02 = 0 and covariate effect for each true state was chosen as β 11 = 4, β 12 = −2, β 21 = 0 and β 22 = 0 (that is, state 2 is used as a reference category).The intercept and covariate effect for the classification process was chosen as follows: These values were chosen to obtain significant sample sizes of misclassified states.
We simulated 200 datasets with a heterogeneous classification process using the variable model in Table 2 (referred to as the "full model"), 200 with a homogeneous classification process by assuming ω 1 = 0 (a matrix of zeros) in equation ( 5) (referred to as "reduced model") and another 200 with the covariate effect for the classification process (using the True state observation process: relative abundance Choose ψis = 1, then zis = 1 for all locations i and true states s. [Vijs|zis = 1] ∼ Poisson(λijs) using the relationship between multinomial and Poisson distribution (Steel et al., 1953).
Classification process: where Ω sk ∼ Dirichlet(α) Classification process: where Ω sk can be chosen as any of the homogeneous models described in Table 2 Spiers et al. ( 2022) True state observation process: occupancy dynamics and encounter rates zist ∼ Bernoulli(ψist) True state observation process: relative abundance Choose t = 1, ψist = 1, then zist = 1 for all locations i and true states s.
[Vijs] ∼ Categorical Classification process: where Classification process: where Ω sk can be chosen as any of the homogeneous models described in Table 2.  (Wright et al., 2020), it is safe to say that they are also generalised forms of Chambert et al. (2018aChambert et al. ( , 2015Chambert et al. ( , 2018b)).
variable model in Table 2) that is correlated to the true state observation process covariate (referred to as "correlation model").The first simulated dataset explored modelling heterogeneity's effect on the classification process, whereas the latter explored the effect of having correlated covariates for the classification and true state observation process.
The second assessed the effect of overfitting the classification process model (adding heterogeneity to the classification process when it should be homogeneous).Moreover, we assessed the impact of the number of misclassified samples on the mSDMs predictive performance.We increased the principal diagonal components of ω 0 by 6 to obtain a reduction in the number of misclassified samples simulated.
We randomly withheld two hundred of the true state identities for each dataset simulated as our validation sample.The number of validation samples was not varied since Spiers et al. (2022) found that the number of validation samples had a modest effect on the model's predictive ability.We fitted the model under the various scenarios described in Table 2 to the data and evaluated the model's predictive performance on the validation sample.

Case study: Gulls dataset
The proposed model was used to analyse gull dataset downloaded from GBIF (GBIF.Org, 2022).The database hosts over two billion occurrence observations with well over a million observers (website visited on 17th February 2023).We were interested in the iNaturalist records since they have community verifications (Matheson, 2014).The observers collected these occurrence records and uploaded their observations with images and/or sounds that allowed for verification.The reported observations go through iNaturalist community verification and are accepted as research grade when two-thirds of the community agreed to the taxon identification (Ueda, 2020), at which point they are published on GBIF.We assumed that the community-accepted taxon name is the true state V.
We checked the iNaturalist platform to track the identification process of the observations and use the first reported identification as the reported state Y.
We obtained observations for some species of gulls in Denmark, Finland and Norway from 2015 to 2022.Specifically, we selected great black-backed gulls (Larus marinus), herring (Larus argentatus), common gulls (Larus canus) and lesser black-backed gulls (Larus fuscus) because the iNaturalist website reported that these species are commonly misclassified as the other.Any other species reported apart from the above-mentioned species were labelled as "others".We used annual precipitation (accessed from the raster package; Hijmans et al., 2015) as true state observation process covariates in the model since it has been noted in some literature to affect the distribution of sea birds such as gulls (Algimantas and Rasa, 2010;Jongbloed, 2016).
The data obtained were presence-only records.Exploratory analysis revealed that there were no multiple observations at the same location for our selected species.We, therefore, assumed that our locations were discrete and treated the data as a marked point process, where the species that was reported at a location is given a value of 1 and 0 for the other species in this study.If we had a species list, then we could have modelled it as a repeated marked point process at the same location and treated the sites as a random effect.
Citizen science data are known to be affected by several sources of bias.Some common biases are spatial bias (Johnston et al., 2022;Tang et al., 2021), and misclassifications (Johnston et al., 2022;Tulloch et al., 2013).We only accounted for the misclassifications in this study, as we are interested in explaining the classification process and not making inferences about the abundance of the gulls.Citizen scientists have been reportedly known to correctly classify species as they gain experience reporting the species (Vohland et al., 2021).We, therefore, modelled the variation in the classification process by using the number of reports made by each observer as a covariate in the classification process.If an observer has more than ten observations, the extra number of observations was calibrated at 10.We used this number of observations for an observer as a measure of experience (Johnston et al., 2018;Kelling et al., 2015), although there are other indices for measuring effort or experience of the citizen scientist (Santos-Fernandez and Mengersen, 2021;Vohland et al., 2021).
We also used ML algorithm to obtain prediction scores (specifically F 1 scores) for the possible true state identity of our downloaded data.The ML algorithm was a Convolutional Neural Network (a modified form in Koch et al., 2022) trained with data from all citizen science observations of any species in Norway.Since the ML algorithm is trained with all birds data from GBIF in Norway, we trained all our six study scenarios models summarised in Table 2 with all data for our selected gull species in Norway and all data reported before 2022 in Finland and Denmark and used all data reported in 2022 in

Fitting and evaluating the model
We ran all the analyses using the Bayesian framework using the Markov chain Monte Carlo approach in the NIMBLE package (de Valpine et al., 2017) using the R software (R Core Team, 2022).We chose the priors for all true state observation parameters from a normal distribution with a mean of 0 and standard deviation of 10, and we chose the priors for the classification process parameters from Normal distribution with a mean of 0 and standard deviation of 1.For the scenarios: constant and main model, we chose the priors of the confusion matrix (Ω) from the Dirichlet distribution with parameter alpha (α), which has a prior of an exponential distribution with mean 1.
We ran 3 chains, each with 10000 iterations; the first 5000 iterations were chosen as the burn-in.We check the convergence of the models by visually inspecting the trace plots and ignoring models with a Gelman-Rubin statistic (Brooks and Gelman, 1998) value greater than 1.1.We kept a fifth of the remaining samples in each of the chains.
We used accuracy (the proportion of predicted true states identities from all the predictions of the validation samples), precision (the proportion of mismatched true states in the validation sample that were correctly classified from the predictions) and recall (the proportion of correct true states in the validation sample retrieved from the predictions) as performance metrics.We used a Bayesian approach and got the posterior distributions for the parameters.Higher values of the validation metrics indicated the preferred model.
We also checked how well the model estimated the true state observation process parameters (β 0 , β 1 and β 2 ) and the classification process parameters (ω 0 and ω 1 ) by estimating the bias (difference between the true value and the estimated value) and precision of the parameters.

Results
3.1 Simulation study

Predictive performance
We illustrated the gain in model performance by using the accuracy, recall and precision of our model's predictions (Figure 1 A and B).When data was simulated from the full and correlated models, there was a strong indication that the predictive performance of mSDMs improved when the variability in the classification process was included.That is the "variable" model performed best for the full and correlation models with the highest accuracy, recall and precision values (Figure 1 A (i -ii), B (i -ii)).The simplified heterogeneous models (fixed-intercov and fixed-covariate) however did not perform any better than the homogeneous models (Figure 1 A (i -ii), B (i -ii)).This suggested that simplifying the heterogeneous classification model did not yield any improvement in predictive performance, and the heterogeneous model that captures the entire variability (in this case variable model; Table 2) would be the best predictive model.When the classification process covariate was modelled as part of the observation process (main model), the model's predictive performance also performed similarly to the homogeneous models (Figure 1 A and B).
Overfitting a homogeneous classification process with a heterogenous one did not have any effect on the mSDM's predictive performance (Figure 1 A iii) and B iii)).We expected the overfitted heterogeneous models to have poor predictive performance (Montesinos López et al., 2022), but the heterogeneous and homogeneous performed similarly (with equal recall, accuracy and precision across all six study models).The Bayesian variable selection probability indicated that the homogeneous classification model was better (with the probability of including classification covariates in heterogeneous models being 0.359 ± 0.012; Supplementary information 2; S2-1).Although the simplified heterogeneous models did not yield improvement in predictive performance, they performed similarly to the variable model in the variable selection process.
FIGURE 1: Boxplot of validation metrics (accuracy, precision, recall) from the six study models defined in Table 2 on the two hundred (200) withheld samples out of the thousand (1000) samples simulated in each dataset.Accuracy is the proportion of withheld samples that were correctly classified, recall is the proportion of correctly classified samples that were retrieved from the withheld samples, and precision is the proportion of the misclassified samples that were correctly classified.Each boxplot shows the median and the interquartile range (25 -75% quartiles).Each column shows the type of model used to simulate the dataset: "full" refers to using the variable/covariate model in Table 2, "reduced" refers to using the intercept model in Table 2 and "correlation" refers to using the variable model in Table 2, but with correlated true state observation and classification process covariates.The rows correspond to changes made to the number of misclassified samples in the simulated dataset: "Baseline" refers to using the values defined in section 2.4 and "Decrease" refers to reducing the number of misclassified samples by diagonal elements of ω by 6 as described in section 2.4

Effect of number of misclassified samples
As we increased the number of misclassified samples in our simulated data, the precision increased by on average 30% and accuracy and recall reduced by 6% (Figure 1 A (i -ii), B (i -ii)).This decrease in accuracy and recall could be attributed to the reduction in the number of correct classifications in the simulated data as the number of misclassified samples increased.Moreover, the classification parameters were estimated better when the number of misclassified samples was higher, leading to the high precision of predictions (Supplementary information 2; S2-1).This suggested that our proposed model will be beneficial when one has a substantial number of misclassified samples.

Bias in observation and classification process parameters
Although failure to account for misclassification in mSDMs can result in biased true state observation process parameters (Spiers et al., 2022;Wright et al., 2020), any method used to account for misclassification in mSDMs has a small effect on the accuracy and precision of the true state observation process parameters.The bias of the observation process parameters was consistently low for all six models, and the coverage was higher for all the scenarios under the full and reduced model (Supplementary information 2; S2-1).All the scenarios studied accounted for misclassification of some sort, thereby correcting for the bias in the observation parameters estimates (Spiers et al., 2022;Wright et al., 2020).
The classification process parameters were estimated more accurately and precisely for the variable model than the other models (Supplementary information; S2-1).This was only possible in the case where we had enough misclassified samples.This suggests that if the objective of a study is to predict true state identity with mSDMs, then modelling the full heterogeneity can improve predictive performance; if the aim is inference on true state distribution, then heterogeneous models may not provide any advantage over homogeneous models.

TABLE 4
Validation metrics of the models under study on the withheld gull dataset.The accuracy is the proportion of correctly classified validated data, the precision is the proportion of mismatched identities that were correctly matched and recall is the proportion of correctly matched identities that were recovered.The number of validated samples was 384 out of which 10 were mismatched.

Case study: Gull dataset
All six study scenario models performed equally well regarding their predictive performance with high accuracy and recall but smaller precision (Table 4).The poor precision value could not be attributed to the insignificance of the classification covariate (observer experience) in explaining the heterogeneity in the classification process since the variable selection probabilities are closer to 1 (Table 4) but to the small misclassification sample sizes (Supplementary Information 1; S2-2, S2-3).However, the precision increased from 10% to 80% when the heterogeneity in the classification process was accounted for by using the prediction scores from the Machine learning algorithm (Table 4).The ML algorithm's prediction scores were individual observation-specific which provided direct information to the observation process model.However, the six classification models depended on the misclassified sample size to capture the heterogeneity in the classification process.This suggests that one remedy to improve mSDM's predictive performance for data with very small misclassified samples is to use ML weights to account for heterogeneity in mSDMs.
Although the study scenario models had smaller precision, it was observed that the probability of correctly classifying the gull species in Denmark, Finland and Norway increased with the experience of the observer (Figure 2).The pattern showed that observers have a high chance of making mistakes on their first few reports, and they get better as the number of reports increased (Vohland et al., 2021).

Discussion
The main objective of this paper was to propose a general framework to account for misclassifications from imperfect classifications (such as those from surveys) and uncertain classifications (from automated classifiers) in mSDMs.This work builds on previous work by Spiers et al. (2022); Wright et al. (2020) by accounting for the heterogeneity in the classification probabilities while allowing the classified categories to be more than the verified species (such as unknown species, morphospecies etc.).Moreover, we assessed the effect of overfitting a homogeneous classification process on the predictive performance of mSDMs and provided ways of checking the overfitting of the classification process model.
Our study bridges the knowledge gap in the literature on accounting for misclassification in mSDMs by modelling the heterogeneity in the classification process.False positives are inevitable in biodiversity data (Kéry and Royle, 2020;Miller et al., 2013).
To model these false positives in this study, we presented the true state observation process as one model and the classification process as another model in a hierarchical form.
We model the classification probabilities for each true state identity as a multinomial generalised linear model.This specification generalises the modelling of the classification process to model effects of covariates as fixed or random effects or both.For example, one can estimate the classification probabilities of each observer in volunteer-collected data by assuming a random observer effect.This formulation for the classification process also mitigates the modelling problems of using the Dirichlet distribution as the prior for the classification probabilities (Spiers et al., 2022).
Furthermore, the specification of a separate state-space model for the true state observation process in the proposed framework allows the use of various multi-species observation models (such as joint species distribution models (Ovaskainen and Abrego, 2020;Tobler et al., 2019), Royle-Nichols model for abundance (Royle and Nichols, 2003), amongst many others) to model the distribution of the true states.Our simulation study showed that the proposed model framework could estimate the true state relative abundance parameters (with the bias of estimated parameters close to zero; Supplementary information 2; S2-1), an observation noted in previous studies that use observation confirmation design to model misclassification in mSDMs (Kéry and Royle, 2020;Spiers et al., 2022;Wright et al., 2020).We have shown that the true state observation process model is a simplified form of occupancy and abundance models that account for misclassification (Table 3), so we believe our proposed framework can be extended to any design used to collect and verify data on the true states (for example, point processes, distance sampling, site-confirmation (multi-method) design, etc.), and any model used to fit the data (for example, multi-state occupancy model (Kéry and Royle, 2020), joint species distribution models (Ovaskainen and Abrego, 2020;Tobler et al., 2019)).
Modelling the true state observation process with more complex models than the relative abundance models used in the study would add another level of hierarchical structure to the true state observation process (for example modelling detection probability or true occupancy state).This complexity could introduce confounding of the classification process and observation process parameters and with frequentist estimation approaches make the likelihood multi-modal (Kéry and Royle, 2020).This study did not explore such issues and further work can be done on this.A possible solution in the Bayesian framework to avoid such confounding issues would be to model the different processes with separate covariates, choose a good prior for the mSDM parameters, and use repeated survey visit data to model the true state observation process (Kéry and Royle, 2020).Moreover, the identifiability or confounding issues could also be tackled by using data with much information on detection and false positive detections such as those derived from acoustic surveys (Clement et al., 2022) and integrating occupancy data that are not susceptible to misclassifications such as those from camera traps to those with misclassifications (Kéry and Royle, 2020).
Accounting for the heterogeneity in the classification process increases the predictive performance of mSDMs.The homogeneous classification models may sometimes be unable to explain the variation in the classification process (Conn et al., 2013), leading to poor model predictive performance as a result of overfitting (Montesinos López et al., 2022).The simulation study showed a 30% increase in precision and a 6% decrease in accuracy and recall when the heterogeneity in the classification process was accounted for in the mSDMs (Figure 1 A, B).However, there was no change in predictive performance when a heterogeneous classification model overfitted a homogeneous classification process (Figure 1 A, B) due to the classification covariate effect size, observed from the bias of parameter estimates and low Bayesian variable selection probability (Figure 1 A, B).Since the predicted posterior probability for the true state's identity heavily relies on the weights from the misclassification probability (Supplementary Information 1), failure to account for that would mean our posterior probability would be incorrectly estimated.
The incorrectly predicted probability would lead to the underestimation of the prediction of the ranges of coverage and possibly abundance in the true states (Molinari-Jobin et al., 2012).
Fitting a more complex true state observation model with the covariate that explains the heterogeneity of the classification process does not provide enough information to improve the mSDM's predictive performance.Previous studies have shown that the estimates of the observation process model inform the estimation of the classification probabilities (Spiers et al., 2022), but the variability in the classification process cannot be inferred from variability in the observation process model (Figure 1 A (i -iii), Supplementary Information 1).Ecologists should therefore model the variability in the classification in its process model to gain the advantage in the mSDMs predictive performance.
Our model was parameterised with volunteer-collected gull data.These volunteercollected data have several sources of bias in their generation, such as spatial bias, and misidentification of species, among many others.We acknowledge that all these sources of biases may be present in the data, but we only modelled the misidentification of species by using the number of previously collected data as a proxy measure for the observer's experience in the classification process model.The predictive performance of the homogeneous and heterogeneous models was approximately the same due to small mis-classified samples (19 misclassified out of 1382 samples in training data (Supplementary Information S2-2) and 10 misclassified out of 378 samples in validation data (Supplementary Information S2-3)).the estimated covariate effect gives an idea of how the experience affects the probability of classifying a new observation.Specifically, the probability of correctly identifying the correct species increases with the observer's experience, as is noted in some literature (Johnston et al., 2018;Kelling et al., 2015;Santos-Fernandez and Mengersen, 2021;Vohland et al., 2021).Therefore, there is a trade-off between the model's ability to correctly classify mismatched data (precision) and understanding the covariate's effect driving the classification process when there are relatively small misclassified samples.
The inclusion of ML prediction scores in the mSDMs to account for the heterogeneity in the classification process increased the precision of our predictions by 70% (Table 4).
These ML prediction scores are observation specific and provide much information about the classification process to increase the precision of the model.The information from the ML does not depend on the misclassified sample sizes but on the quality of the images (Koch et al., 2022), making them advantageous to use in accounting for heterogeneity in the classification process when misclassified sample sizes are small (like we have in our gull data).
The proposed model framework in this study is flexible and can be generalised into any species distribution model and integrated distribution model.The framework proposed fits into the frameworks provided by Wright et al. (2020) and Spiers et al. (2022) and any framework their study generalises.Our proposed classification process model, MMGLM, improved the predictive performance of mSDMs, but it heavily relies on the misclassified samples size.Furthermore, the confusion matrix defined in the model framework allows for the classification of different taxonomic groups, as opposed to just the species by species confusion matrix in Wright et al. (2020) and including morphospecies in the classification categories (Spiers et al., 2022).This will make it possible for citizen science data analysts to account for the misclassification of data at any level in the data collection process.We recommend that variable or model selection is performed during the analysis to check for overfitting.Moreover, ecologists should explore using ML prediction scores True state observation process: absolute counts zis ∼ Bernoulli(ψis) [Vijs|zis = 1] ∼ Poisson(λijs)

Finland
and Denmark as our validation sample.The summary of the classifications (true and false positives) in the training and validation sample is presented in Supplementary information 2 (S2-2, S2-3).

FIGURE 2 :
FIGURE 2: Summary of results from the model fit to gull dataset showing A) Probability of correct classification for the common (Larus canus), herring (Larus argentatus), great black-backed (Larus marinus) and lesser black-backed gulls (Larus fuscus) and B) the distribution of the experiences of the observers used in the modelling.The ribbon around the correct classification probability estimates represents the 95 % credible interval of the estimates.

TABLE 3
Spiers et al. (2022)oposed models from homogeneous classification process studies done byWright et al. (2020)andSpiers et al. (2022).The table specifies the true state observation process model for Wright et al. (absolute abundance model), Spiers et al. (occupancy dynamics and encounter rate model) and ours (relative abundance model); and also the classification process model for Wright et al. and Spiers et al. (homogeneous classification process with classification probabilities simulated from Dirichlet distribution) and ours from heterogeneous models described in Table 2. Since our framework extends the work done by