Generalized additive models for categorical count data: An exploration of the decline of queen triggerfish Balistes vetula in the Bahamas and Turks and Caicos

Citizen science is growing in importance for ecosystem management and long-term monitoring.

underwater or distance traveled. This data has proven valuable for determining the efficacy of marine protected areas, and for monitoring population trends (Gravem et al., 2020;Pattengill-Semmens & Semmens, 2003). However, several features of the dataset make fine-scale analysis difficult. Data collection is opportunistic, so the dataset contains many variables that are strongly correlated to the particularities of the sampling, such as depth of survey and bottom time, which makes hypothesis testing difficult. Data is also collected by many surveyors of differing levels of experience in identifying different species, which adds an additional layer of error to the dataset. These kinds of problems have been addressed in analyses of other large opportunistic datasets (e.g., eBird, Johnston et al., 2021).
Current best practices for citizen science rely on statistical models (Johnston et al., 2021). Previous modeling has incorporated REEF presence-absence information into larger model-training sets, or combined information from multiple surveys and external count information to create abundance indices (Campbell et al., 2022;Grüss et al., 2018;Montecino-Latorre et al., 2016;Tolimieri et al., 2017).
Due to the abundance and diversity of reef fish, the abundance of each encountered species is difficult to estimate while diving. This led to the creation of a categorical count system that allows divers to mark whether they observed a single individual, a few individuals (2-10), many individuals (10-100), or abundant individuals (100+). This data also contains zeros, which imply that a species was not present or that the diver was unable to identify or did not see a species. While Single-Few-Many-Abundant (SFMA) categories are representative of actual abundance, they are condensed into a single categorical value. REEF divers record information in the form of a species checklist, allowing them to "check off" each species positively identified and mark its abundance category on a dive. For each checklist (i.e., survey), the date, time, and environmental variables are recorded, and REEF staff members internally mark the surveyor's experience level on each survey when the data are entered into the database.
REEF also records a measure of surveyor experience. Surveyors move from novice to experienced by completing a specified number of surveys within a region and taking an identification quiz. Existing methods infer a mean abundance from REEF categories aggregated across multiple surveys, but do not account for information from confounding variables such as dive duration (Campbell et al., 2022;Tolimieri et al., 2017;Wolfe & Pattengill-Semmens, 2013a). At the survey level, REEF data can be interpreted as binary (presenceabsence) or multinomial (by categories), although models based on these distributions omit much of the information on the abundance that is captured by SFMA categories.
One way to reduce errors produced by a lack of observer skill is to examine species that are easily identifiable. The Queen Triggerfish (Balistes vetula) is an easily identifiable species with purple markings along its body, ornate fins, and a distinct body shape ( Figure 1). The Queen Triggerfish is also large enough to be easily observed by a roving diver, which alleviates some concern about surveyors being unable to identify a species and makes it a good candidate for a model organism in citizen science. While the Queen Triggerfish is of minor economic importance, it is a food fish locally, and few fisheries-based or fishery-independent data sources include this species. Artisanal fisheries have led to local reductions in population elsewhere, and the species is listed as near threatened by the IUCN red list because of depletion in part of its range (Liu et al., 2015;Sagarese et al., 2018). No data from The Bahamas and Turks and Caicos were used in evaluating Queen Triggerfish for listing ( Figure 2).
Our objective was to determine if trends in the presence and abundance of Queen Triggerfish in The Bahamas, Turks, and Caicos could be estimated by accounting for variability in sampling details of REEF surveys (duration, experience, time of sampling).
The method includes two steps, an imputation of the number of fish observed in each survey based on counts in each size category. and then a Generalized Additive Model (GAM) to predict mean abundance in each survey. Many predictor variables affect abundance nonlinearly, so linear regression is likely inappropriate.
Thus, Generalized Additive Models were used because they allow for a non-linear relationship between predictor variables and the response variable, estimated through penalized smoothing functions. This family of models has already been applied to binomial data, but has not been applied to abundance information contained in REEF (Grüss et al., 2018). Like linear models, GAM predicts mean abundance for each observation as a function of predictor variables. This two-part estimation method was tested in simulation, and then applied to the Queen Triggerfish, to confirm that it worked in practice with covariates collected by REEF. Without the ability to account for survey covariates, changes in observed abundance could also be caused by changes in underlying survey conditions. These methods allow results to be more robust and are closer to best practices for other semi-structured citizen science initiatives (Johnston et al., 2021). Although the use of abundance information may not be necessary for all species because patterns in abundance and presence-absence are likely to be similar, abundance is more information-rich, and for some species, can be more informative (Joseph et al., 2006). F I G U R E 1 An adult Queen Triggerfish (Balistes vetula) observed by the author on a reef slope in Belize on 2/19/2014.

| Imputation of categorical data
The number of observations in adjacent abundance categories was used to estimate the arithmetic mean of each category as if the sampled distribution was log-normal (Wolfe & Pattengill-Semmens, 2013a, 2013b. These averages were then used to estimate average abundance over a particular group of surveys. In REEF data, four abundance categories are non-zero: S = number of observations of lone organisms, F = number of observations of 2-10 organisms, M = number of observations of 11-100 organisms, and A = number of observations of >100 organisms. The mean abundance in each category was estimated as follows: Where AverageF, AverageM, and AverageA are estimated average counts in the F, M, and A categories, and MeanAbundance is the average count per observation. Because S is the number of solitary fish observed (i.e., average = 1), S is used directly in equation 1d. This method was derived by calibrating the average value for each category to actual count data that was taken concurrently with SFMA counts in California and creating a simulated confidence interval for the mean (Wolfe & Pattengill-Semmens, 2013a, 2013b. For our analysis, instead of aggregating scores to estimate mean abundance across multiple observations, values of each survey were imputed as the mean of its reported abundance category (i.e., AverageF, AverageM, or AverageA). These averages were calculated over the entire dataset so that each observation of the same SFMA category was given the same value. This method does not incorporate errors introduced by the process of imputing abundance from REEF categories. Substituting the mean for each category is a very simple method, but produces a y variable that is appropriately scaled to convey information about the relative abundance of observations. Only 10 observations were of the abundant category of Queen Triggerfish in REEF data and were omitted due to missing information in important covariates. Therefore, models were trained with only sub-100 counts, and abundant counts were not considered. Averages were calculated over the entire dataset so that each observation of the same SFM category had the same value: AverageF = 3.234952 and AverageM = 12.518702.

| Simulated data
To examine potential bias in applying a negative binomial model to imputed count data from categories, a simulated dataset was created for model comparison in R (R Developement Core Team, 2020).
Parameters were selected to mimic data for Queen Triggerfish used in this study ( Figure 3). The Few and Many categories were slightly overrepresented in simulated data compared to actual data for Queen Triggerfish, and the abundant category was missing as in the actual data. Data were simulated by randomly generating a set of variables (x), with sample sizes ranging from 100 to 10,000, where x was taken from a uniform distribution from −3 to 3, and then generating count data (y) from a negative binomial distribution with mean = e x , and size = 10. Samples of 100, incremented in size by 100 (ranging from 100 to 10,000), were used to determine how model performance changed with the amount of training data.
Each simulated data set resulted in abundance data that could be described by a negative binomial model with one independent xvariable of slope = 1 and intercept = 0 when modeled with a log link Generalized Linear Model (GLM). The negative binomial distribution is often applied to over-dispersed count data of organisms, such as citizen science data (Johnston et al., 2021). Simulated count data was then coerced into 0-SFM categories by replacing counts between 2 and 10 with F, and counts between 11 and 100 with M.
Negative binomial GLM was then applied with four versions of abundance data. In the first GLM, the y variable was the original uncategorized data, for comparison to models fitted to abundance imputed from SFM categories. In the second GLM, y was abundance data after being coerced into SFM categories and then converted back to abundance with imputed category means (Equation 1). In the third model, abundance (y) was imputed as minimum possible values for each observation category (i.e., F = 2, M = 11, A = 101).
In the fourth GLM, an alternative exponential imputation method was used (Wolfe & Pattengill-Semmens, 2013a, 2013b, in which y = 5.73^(DEN-1)^1.28, where DEN was an index (S = 1, F = 2, M = 3, and A = 4). The mean method described in Equation 1 was considered to be the best (Wolfe & Pattengill-Semmens, 2013a, 2013b, whereas the other two methods were used to evaluate how much the imputation method influenced the imputed trend. All models were fit using the MGCV package in R, with the x variable assumed to have a linear relationship with the log link mean (Wood, 2017). Categorized and uncategorized model predictions were compared to true uncategorized count data to measure how wellcategorized data tracked true underlying abundance trends. True and predicted data were compared with root mean square error (RMSE), R-squared (R 2 ), and average error: Where N is the number of values, y i is the i th recorded value, y ′ i is the i th predicted value, and ‼ y is the mean value of the variable.
Metrics were calculated with both y i and y ′ i as either uncategorized abundance counts or means imputed from categories to evaluate how much error was introduced by using an imputed mean rather than counts. Categorized model predictions were also compared to categorized counts using the fraction of predictions that were in the correct abundance category. Code Appendix S1 contains code used to simulate categorized values, fit models, and calculate metrics.
After metrics were calculated for each simulated sample and model, the mean and standard deviation among samples were calculated for each metric.

| Fitting to REEF data
REEF data was used from 1994 through 2020 in The Bahamas and Turks and Caicos ( Figure 2) (REEF). Data manipulation was in R using RStudio with the tidyverse, measurements, lubridate, and lunar packages (Birk, 2019;Grolemund & Wickham, 2011;Lazaridis, 2020;R Core Team, 2020;RStudio, 2022;Wickham et al., 2019). Most surveys were collected by novice REEF surveyors, so data were not excluded based on experience. This was justified by the use of an easily identified model species, and the inclusion of experience as a binary parameter in the model. Surveys for which depth, visibility, habitat, current, and start time were recorded as zero were excluded because zero reflected missing data for most variables. For example, the lowest depth category recorded by REEF (a snorkel) is recorded as a 1, not a 0. For Start Time, although dives can begin at 00:00, an anomalous number of dives began at that time compared to other times late in the evening, so we assumed they were data entry errors. Surface and bottom temperatures below 10°C were well below Bahamian climatology, so were excluded. Bottom times <10 min and longer than 100 min were excluded. This was done based on work done with eBird data which standardized survey efforts in a similar fashion (Johnston et al., 2021). This filter removed <3% of the available surveys and so effectively removed durations for which high precision estimation was not feasible. Average depth, current, and visibility were collected by REEF as ordered categories (Appendix S1). Dives deeper than 24 m were combined into one category due to the small number at that depth. They were not excluded because the ordinal parametric fit used to model categories was capable of fitting to a category for dives of a certain depth or greater. Habitats with <100 observations were excluded since the precise estimation of occurrence within those habitats was not possible. After data cleanup, all observations of abundant (>100) Queen triggerfish contained several data entry issues, were outliers (10 of 18,345 total observations), and were therefore excluded from the analysis. The original dataset of 18,345 observations was reduced to 7380 after observations with missing values were omitted (Table 1; Appendix S1).
Lunar phase in radians and decimal day of the year were calculated from the date recorded for each survey. Queen Triggerfish, like several other large reef fish, aggregate seasonally for breeding during the full moon , the only known aggregation by this species. To accommodate this behavior, a Boolean variable (0,1) was used to represent whether or not a survey fell on a full moon during peak breeding activity from November to March. The full moon was defined as the quarter of the lunar cycle centered around the full moon (approximately 4 days before and after).
Variables were selected for inclusion in models using Akaike Information Criterion (AIC). AIC selection was used for selecting terms to reduce overfitting because models were fit to the same dataset (Hastie et al., 2009). The bidirectional selection was used, starting with backward selection, and then testing if adding any variables back to the final model improved the AIC score. The model with the lowest AIC was selected. After AIC selection, the selected model was fit to the entire valid dataset for selected variables.
Variables included: Visibility, Depth, Bottom time, surface temperature, latitude, longitude, current, breeding vs non-breeding, moon phase, decimal day of the year, decimal year, surveyor experience (Expert or Novice), and start time. Cyclical variables were fit using a cyclic cubic spline, and other numeric variables were fit using thinplate regressor splines. Latitude and longitude were incorporated into the model using a tensor product smooth, and the breeding variable was treated as an interaction variable on the latitude and longitude smooth because different locations were expected to exhibit different trends in abundance during spawning aggregations.
REML was used as the parameter estimation method. The default basis dimension parameter K was used for smoothing functions, because exploratory fits with larger K did not change the statistical significance of the model terms, but increased run time from a few minutes to several hours. The gam.check function was used to inform k selection, and while several k values were highly significant, increasing k values past 100 did not improve values for gam.check.
The overall form of models was as follows: y ~ experience + s(moon phase, bs = "cc") + current + averaged depth + habitat + s(bottom temperature, bs = "tp") + s(decimal day of year, bs = "cc") + s(date, bs = "tp") + s(bottom time, bs = "tp") + s(start time, bs = "cc") + te(latitude, longitude, bs = "tp", by = vetula breeding dummy variable). To determine population trends for Queen Triggerfish, binomial and negative binomial models were fit with the year (1994-2020) as a categorical variable for testing the significance of a potential temporal trend in presence and abundance. The anova.gam function in mgcv was used to determine whether the linear trend was significant, based on approximate p-values, which were sufficient for our analysis (Wood, 2017). Confidence intervals on relative abundance were calculated using 95% confidence intervals for each year from the GAM, applied to abundances imputed from categories. These confidence intervals were produced based on the assumption made by the GAM that it is fitting to negative binomial observations, and do not account for the error introduced through the imputation of categorical data. TA B L E 2 Goodness of fit (R 2 ), root-mean-square error (RMSE), average error, final slope, and final intercept of negative binomial model metrics for simulated data (100-10,000 data points), from comparing model predictions to imputed categorical and original uncategorized data. Simulation parameters were set to imitate observations of Queen Triggerfish (Balistes vetula) in The Bahamas and Turks and Caicos during 1994-2020, based on marine citizen-science data (Reef Environmental Education Foundation, REEF). The final slope and intercept (with standard error) for final models were fit to the dataset of 10,000 points (expected to be 0 and 1, respectively).

| Simulated data
Models fit to categorized data performed worse for all metrics than models fit to original data ( Table 2). The exponential model had an overwhelmingly negative R 2 and large RMSE and was by far the worst-fitting model. The three best models converged to an asymptotic range of values after <5000 datapoints (Figure 4). Fitted metrics converged relatively quickly, so the additional variance in those metrics, relative to the uncategorized fit model, was primarily due to error of the imputation method rather than a variation in sample size (Table 2). Average error of mean value imputation was closest to zero for all categorical imputed models and resulted in the highest R 2 . The R 2 of mean imputed model predictions relative to uncategorized data was also underestimated by R 2 of predictions relative to the value imputed into the model. The reverse was true for minimum value imputation. Minimum value imputation and mean value imputation both had a similar slope, although minimum value imputation was shifted down with a negative intercept. The mean value method was slightly overestimated, and the minimum value method severely underestimated the original uncategorized data.

| REEF data
The AIC selection process left the binomial model with 11 variables and the negative binomial model with 9 variables ( Most features were consistent among splines in the different models, in particular for negative binomial and binomial models. Positive observations decreased with the decimal year ( Figure 5) and increased linearly with bottom time. The spatial distribution of Queen Triggerfish did not differ among models, and the spatial distribution during the breeding season had a single high anomaly in the southwest for the binomial model, although the lack of inclusion of breeding season indicated this could be a spurious feature of the binomial model (Appendix S1). Queen Triggerfish were most often observed during the afternoon in 76 °F water. The DHARMa residual plots did not show a clear pattern in simulated residuals for all models fitted to are included in (Appendix S1).
Year was highly significant (p < 0.01) as an annual trend (linearly related to date) in both binomial and negative binomial models, which indicated the Queen Triggerfish population declined since 1994. The probability of observation and relative abundance both declined more than 50% since 1994 ( Figure 5).

| Model interpretation
Various methods have been developed to extract information on the underlying abundance from REEF categories (Campbell et al., 2022;Tolimieri et al., 2017;Wolfe & Pattengill-Semmens, 2013a). By testing these methods in simulation, it can be determined how appropriate they may be in different contexts.
The near-identical slope of the mean and minimum imputation methods suggests that for analyzing trends both methods may perform comparably well. In absolute terms, though, the minimum value consistently underestimated the true abundance and would be inferior in cases where the intercept of the model had a biologically relevant interpretation. Both the mean and minimum value imputation methods also showed bias in the slope in simulated data, which could lead to a failure to detect a significant trend. This limitation of REEF data is not unique to the imputation method, however, since the log categorical nature of the data collection allows for changes in abundance without changes in observed categories (Campbell et al., 2022). The exponential density score imputation did not work well for imputing individual categories, although other studies have used density scores to track population trends without translating them to abundance (Campbell et al., 2022). Depending on the study system, the use of abundance information or the use of pure presence-absence information can provide better model performance (Fukuda et al., 2012;Howard et al., 2014). In this study, the models that incorporated abundance information (the negative binomial and multinomial) performed worse than the presence-absence model TA B L E 3 Binomial and negative binomial ΔAIC scores for the best-fit model, with each variable removed (see Table 1 for variable definitions), for estimating the presence and abundance of Queen Triggerfish (Balistes vetula) in The Bahamas and Turks and Caicos during 1994-2020, based on marine citizen-science data (Reef Environmental Education Foundation, REEF).

TA B L E 4
Mean and standard deviation of goodness of fit (R 2 ), root-mean-squared error (RMSE), average error, and percent of categories predicted correctly for binomial and negative binomial cross-validation of ~15,000 surveys used to estimate the presence and abundance of Queen Triggerfish (Balistes vetula) in The Bahamas and Turks and Caicos during 1994-2020, based on marine citizen-science data (Reef Environmental Education Foundation, REEF). Binomial category prediction categories (1-0) differed from negative binomial prediction categories (0-SFM). Full prediction metrics for the binomial model are in Appendix S1.

Negative binomial
Mean (binomial). This likely depends on the species and study system and more comprehensive research is required to determine the relative efficacy of different methods applied to this data.
Accounting for variation from sources other than changes in abundance, as in our study, is important for the estimation of abundance indices and CPUE from fisheries data (Maunder & Punt, 2004).
For other citizen-science data, similar work has been done to account for observer effort, which can confound results if not considered (Johnston et al., 2021). Differences between abundance and presence-absence models can also inform decisions. Exclusion of the breeding variable in the abundance model suggests that either breeding behavior does not affect local abundances of Queen Triggerfish, or that there is not enough data to accurately predict over that range. The second seems more likely since this species forms seasonal breeding aggregations  last 20 years we detected is immediately relevant for species conservation and suggests that further investigation and management intervention may be warranted. Causes of this decline likely include fishing pressure, changes in diet, and habitat loss Hernández et al., 2019;Reinthal et al., 1984;Sadovy De Mitcheson et al., 2008). Current classification of Queen Triggerfish by the IUCN Red List is based on surveys of population trends from fisheries data within the Caribbean, so surveys of population trends are needed in The Bahamas and Turks and Caicos (Liu et al., 2015).
For many coastal fisheries, limited data on catch or abundance limits the ability of traditional models to detect population trends (Sagarese et al., 2018). REEF data can provide broad-scale, inexpensive data that is independent of fisheries catch for coastal species (Campbell et al., 2022). By translating categorical abundance data into values that can be modeled at the survey level, we developed and tested a simple and effective way to account for variation in survey characteristics while predicting overall trends in abundance of Queen Triggerfish in The Bahamas and Turks and Caicos. We found that trends in abundance and presence of Queen Triggerfish differed little, but the same may not be true for other species (Campbell et al., 2022). These methods could be quickly applied to other coastal fisheries and could be used to further incorporate REEF data into the IUCN red list process beyond presence-absence models already in use (Gravem et al., 2020).

| Potential future research
Surveyor experience, species identification, data screening, and missing data all likely affected our findings. The binary surveyor experience level used by REEF was a significant factor for the models used in our study. We used the Queen Triggerfish as a model organism due to its ease of identification and observation, whereas other species are likely to be more difficult to identify or observe.
Incorporating information about the relative detectability of species might be necessary for multispecies studies (Ashley et al., 2022).
Further exploration is needed of the interplay between surveyor experience and species identification. For example, other surveyor experience metrics could improve model performance by creating a continuous experience metric (Kelling et al., 2015). Surveyor experience also likely affects missing data. We omitted data if records in multiple fields were missing, which could bias results if data is not randomly missing. Future studies could account for missing data either through multiple imputations or data augmentation (Nakagawa & Freckleton, 2008). Such analyses were outside the scope of our study, but given that roughly half of the data available for our study was omitted due to problems with missing data, the need for better missing data handling is an important area for potential future research.
We were only able to examine variables collected by REEF surveyors, while broader explorations of multiple long-term monitoring datasets have successfully integrated REEF observations (Grüss et al., 2018). Similar to presence-absence modeling that integrated REEF observations, integration with actual counts using similar methods is promising. The precision of REEF abundance category observations was lower than actual counts, but the benefits of its inclusion could overcome quality concerns. Similarly, many of the explanatory variables collected by REEF could be supplemented by other sources, such as information from coral reef mapping. Divers often swim through multiple habitats while on a dive and many divers do not enter habitat information into their survey data. The use of external sources would increase the amount of usable data for modeling and thereby improve performance. In addition to external environmental data, actual count data from the same region could be compared to model-based estimates of presence and abundance.
The conversion function we used in our analysis was constructed by comparing actual counts to REEF categories, but the models we used have yet to be validated in this way (Wolfe & Pattengill-Semmens, 2013a, 2013b. More sophisticated imputation methods could be used to infer counts from SFMA categories. Substituting the inferred mean is a common and simple imputation method, and performed well in our study. Future work could use a more complex model to infer means, perhaps by separating data by years or habitat types, for which distributions of abundance categories may differ. Methods such as multiple imputation or Bayesian methods could be used to assign each imputed count to a theoretical distribution of values based on a truncated lognormal distribution within a range of possible values (e.g., 2 to 10 for F) (Nakagawa & Freckleton, 2008). Such methods might more accurately capture the uncertainty in estimated relationships between variables because they would include imputation error in the y variable. However, such methods would lack the simplicity and ease of use of our method.

FU N D I N G I N FO R M ATI O N
No authors received funding for this work.

CO N FLI C T O F I NTE R E S T S TATE M E NT
Authors declare no conflict of interest with this work.

DATA AVA I L A B I L I T Y S TAT E M E N T
Data are available from reef.org, REEF requests that the data are not distributed to other parties without prior notification to REEF's Co-Executive Director of Science and Engagement and that the geographic location data are not shared or published.