Predicting Imminent Cyanobacterial Blooms in Lakes Using Incomplete Timely Data

Toxic cyanobacterial blooms (CBs) are becoming more frequent globally, posing a threat to freshwater ecosystems. While making long‐range forecasts is overly challenging, predicting imminent CBs is possible from precise monitoring data of the underlying covariates. It is, however, infeasibly costly to conduct precise monitoring on a large scale, leaving most lakes unmonitored or only partially monitored. The challenge is hence to build a predictive model that can use the incomplete, partially‐monitored data to make near‐future CB predictions. By using 30 years of monitoring data for 78 water bodies in Alberta, Canada, combined with data of watershed characteristics (including natural land cover and anthropogenic land use) and meteorological conditions, we train a Bayesian network that predicts future 2‐week CB with an area under the curve (AUC) of 0.83. The only monitoring data that the model needs to reach this level of accuracy are whether the cell count and Secchi depth are low, medium, or high, which can be estimated by advanced high‐resolution imaging technology or trained local citizens. The model is robust against missing values as in the absence of any single covariate, it performs with an AUC of at least 0.78. While taking a major step toward reduced‐cost, less data‐intensive CB forecasting, our results identify those key covariates that are worth the monitoring investment for highly accurate predictions.

covariate without assuming all else to be fixed.Namely, the synergistic effects involved in CB dynamics (Paerl & Otten, 2013;Whitton, 2012) are not addressed, hindering attempts to rank potential management strategies.
One challenge in predicting CB is the inconsistent collected data over different lakes.For a management region with multiple lakes, CB monitoring efforts are not evenly distributed to each lake.There exists an extended monitoring history for lakes with higher social-economical values and scattered monitoring data for lakes in remote, less-populated, or less-bloomed areas.Hence, available covariates for predicting CB are not the same over the different lakes.A flexible model is desired that can identify, and once available, use primary covariates, that are most informative and sufficient for predicting CB, and if unavailable, exploit the remaining secondary covariates to make the prediction.Mathematically, primary covariates are those minimal covariates that if conditioned on, the target variable becomes statistically independent of other covariates.Making predictions in the absence of some covariates is impossible with non-probabilistic models, such as linear regressions and neural networks, because they require all predefined inputs to be available when making predictions.
Bayesian networks (BNs) allow the development of a single predictive model including all covariates and the use of any subset of the covariates in a probabilistically marginalized fashion for making predictions (Ramazi et al., 2021a).A BN visualizes the joint probability distribution of the variables of interest by a directed acyclic graph whose nodes represent the variables and links represent probabilistic dependence, and a conditional probability distribution parameterizes the relation between each node and its parents.Moe et al. manually constructed a BN on CB based on their a priori knowledge (Moe et al., 2016).Additionally, Rigosi et al. (2015) constructed a similar BN structure based on the input of expert knowledge, and learned the parameters from data.However, both the parameters and structure of a BN can be learned partially (Jiang et al., 2021) or entirely from data and without manipulation (Alameddine et al., 2011;Feki-Sahnoun et al., 2018), allowing for insights as to which covariates are essential for predictions, but may have been neglected in previous studies.Once the BN is constructed, the primary covariates will be the Markov blanket of the target variable, which is the set of its parents, children, and co-parents of its children, and the reminder will be secondary covariates.
Here, to facilitate timely mitigation of the effects of CBs, we use BNs to predict future 2-week blooms.Furthermore, we learn the conditional dependencies between 17 multi-scale covariates and CB occurrences to better inform longer term management of CBs.Unlike the common practice in the literature that are limited to a single lake-except for example, (Rigosi et al., 2015)-we use the data of 78 water bodies in Alberta, Canada, over a 30 year time period.The learning is entirely based on data without the artificial intervention of the network structure.
Our objectives are as follows: (a) learn, for the first time, a BN entirely from data to predict future 2-week CB of a specified lake based on 17 covariates, including lake monitoring, meteorological, lake features, and watershed features; (b) identify the primary and secondary covariates form the structure of the resulting BN; (c) reveal the aspects of the structure that match and differ from the current state of knowledge of CB and motivate further research; (d) find the most informative covariates to CB prediction based on the bloom probabilities conditioned on a single covariate; and (e) find the performance of groups/subsets of covariates that are of interest.
We first show that the performance of our predictive BN model is comparable with other predictive models.We then reveal the information encoded by the learned network structure.With the learned network, we compare the predictive power of covariates by examining the informativeness of covariates on the test data.Finally, we compare the performance of the prediction for several scenarios of partially missing data.The informativeness of covariates and predictions using partially missing data will help monitor and predict bloom occurrence in Alberta.

Data and Variables
We used data of 78 Albertan lakes and reservoirs collected from 1986 to 2017 (Figure S1 in Supporting Information S1).The lakes and reservoirs are located within the longitudes of 49.3°-58.8°N,and latitudes of 110.0079°W -119.21667°W.Our data set consisted of three parts: (a) the lake monitoring data collected in the open water season, May to October, with phytoplankton data collected as part of the Alberta Lake Monitoring Program, (b) the meteorological data, and (c) the lake and watershed information data.The used variables are summarized in Table 1 with histograms provided in Figure S4 of Supporting Information S1.

10.1029/2023WR035540
3 of 15 The collected data set was not ideal for predicting a fixed time in the future as the time difference between two monitoring/samples of the cyanobacteria cell counts was not fixed (Figures S2 and S3 in Supporting Information S1).Having the goal of predicting future 2-week blooms, out of every two consecutive cyanobacteria cell count samples for each lake, we took the latter as the target variable, say at time T, and the former as a covariate which was sampled at anytime from about T − 44 to T − 14.We set T − 14 as the time the prediction is made, so the meteorological data were averaged from T − 14 backward to a length of at most 30 days, that is, T − 44.This mimics the realistic situation where from "today" (T − 14), we want to predict blooms 2 weeks in the future (T) using the previous cell count samples for the lake, which could have been collected anytime from a month ago to today (T − 44 to T − 14), and the meteorological data are available for all past days.Alberta Environment and Parks (AEP) provided data other than those related to meteorology.The original lake monitoring data set contained chemical concentrations in lake water samples, Secchi depth, lake stratification, cyanobacteria cell count, etc. Depth-integrated water samples were taken for water columns in the euphotic zone at 10 locations in a lake and then mixed to obtain the composite sample to represent the overall condition of the lake.The regional watershed information for each lake was generated using ArcGIS in a previous study (Loewen et al., 2020).
We took the following steps to obtain the rest of the covariates for our data set: 1. Total cyanobacterial cell count data.The cell count data were extracted from a taxonomic enumeration data set of phytoplankton communities in water samples.We first extracted all the cyanobacteria cell count (see Table S1 in Supporting Information S1 for a list of 256 cyanobacteria taxa included in our study) from the phytoplankton data set of 1,306 phytoplankton taxa in sampled lake water.We only used data sampled in the open water season and collected following the standard protocols (see Figure S2 in Supporting Information S1 for the monitoring times).With respect to the definition of blooms, the guidelines vary from country to country and by scenario.Here, we define a bloom to be an algal density over 1,530 cells/mL, which is the median CB density of sampled lake data.The 100,000 cells/mL threshold by WHO (Chorus and Bartram, 1999) was devised for near-shoreline areas where surface grabs (of the top several inches of water) were common and not directly comparable to our pelagic zone sampling, where samples were taken vertically from one to several meters below the surface.For this reason, when a bloom is present we expect our samples to show significantly lower cell counts than the WHO threshold.Furthermore, our threshold is consistent to observed HABs in literature (Trainer et al., 2020) 2. Meteorological data.Daily meteorological data were extracted from the software BioSIM (Régnière et al., 2014) and its model "Climatic Daily" that takes as input the geographic information including longitude, latitude, and elevation, and BioSIM generated daily mean temperature (Celsius), total precipitation (mm), wind speed (km/hr), total radiation (MJ/m 2 ).The software obtained the meteorological data for each lake by interpolating data measured at the local weather stations in Canada from 1900 to 2017 with some adjustment according to the lake elevation.Then, for each data instance, we took the average of meteorological data from the monitoring date to 2 weeks before the day of the prediction.For example, for lake 1 with a sample on 4 July 2016, and a target prediction time of 25 July 2016, we averaged the meteorological data between July 4 and July 11, so we could make a 2-week ahead prediction using the available meteorological data on and before 11 July 2016.We restricted the weather history-length (the number of days before the prediction date) to a maximum of 30 days, as averaging weather over too many days could reduce the prediction power.3. Lake features and Watershed data.Lake elevation and lake-watershed area ratio were obtained using geographical information system (ArcGIS 10.S4 in Loewen et al., 2020), representing human-footprint.
We chose those variables according to a previous multi-scale study of the CB in the study area (Loewen et al., 2020).Some factors were not included, such as buoyancy and water temperature, because of a considerable portion of missing data.However, we included air temperature, which is a representative of water temperature and easier to obtain with the help of the weather stations.A summary of variables is shown in Table S3 of Supporting Information S1.The original covariates except for the current month and stratification of the lake were continuous.We discretized the 15 covariates by frequency into three levels: low, medium, and high (with approximately one third in each level).The three levels are essentially formed based on historical observations and local knowledge of the lake.Moreover, although more levels might yield more accurate predictions, we limited them to three so that trained citizens can also easily distinguish and report the level, avoiding the need of costly monitoring.The target variable, cyanobacteria cell count on the target date, was discretized into two levels based on the threshold of 1,530 cells/mL discussed previously with "0" indicating bloom-free and "1" indicating a bloom event.

Partitioning Data Into Training and Testing
Unlike the typical machine-learning approach of randomly partitioning the data set into training (calibration), used for learning the structure and parameters, and testing (validation), used for evaluating the model, here, we used a temporal partitioning.For each lake, we included in the testing data set the data instances collected for that lake during the most recent 3 years and included the remaining older data instances in the training data set.For example, if a lake was sampled at years 2010, …, 2018, we used instances at 2018, 2017, and 2016 for the testing and 2010, …, 2015 for the training.Since in practice, we may want to apply the model to a lake that we have never sampled before, we held out six lakes for the testing data set and did not use any of the corresponding data instances in the training.This resulted in a test data set consisting of 95 data instances, that is a portion of 15% of the whole data set, and the remainder was included in the training data set.The distribution of month over the test data set was as follows: 12 June, 46 July, 64 August, 48 September, and 20 October instances.

Learning and Testing
We found the BN structure that scored the lowest Bayesian information criterion (BIC) (Schwarz, 1978) on the training data set.BIC measures the fit to the data set via the likelihood function and penalizes the number of model parameters.Thus, it can prevent overfitting, which explains its use in learning BNs in the literature (Alameddine et al., 2011;Feki-Sahnoun et al., 2018;Ramazi et al., 2021a).Additionally, we assumed no manipulation, or forced structure in construction of the BN, as to allow the entire structure to be learned from the training data set only.We then learned the BN parameters, and compared the prediction accuracy of the resulting BN with four predictive models: generalized linear model (GLM), naive Bayes (NB), boosted decision tree (or, gradient boosting machine (GBM)), k nearest neighbor (KNN), and neural network (NN) (Ramazi et al., 2021b).We used the area under the curve (AUC) (Hajian-Tilaki, 2013) as the performance measure over the test data set.We tested whether there is a significant difference between the performance of BN and the other models.
Then we tested the performance of the BN when none or one of the primary covariates were missing and reported both the AUC and the area under the precision-recall curve (AUPR) (Raghavan et al., 1989), which is appropriate for unbalanced data.
The code was implemented in the programming language R. The package bnstruct (Franzin et al., 2017) was used to find the optimal BN from data.Analysis of the obtained BN, including the computation of the conditional probabilities, was performed using the bnlearn package (Scutari, 2009).The NN model was implemented using the deepnet package (Rong, 2014), the NB model was implemented using the bnlearn package, and the other tested models, that is, GLM, KNN, and GBM, were implemented using the caret package (Kuhn, 2008).Based on our prior experience, we set the value of k to 5 for the KNN model and the number of trees to 100 for training the GBM.The number in the best iteration was used during the test.The NN had one hidden layer with the number of neurons equal to half of the number of the input variables and the sigmoid function as the activation function.

Prediction Accuracy
The final trained BN scores the best BIC on the training data set and predicts future 2-week CB bloom with 0.79 AUC on the testing data set and 0.86 AUC on the training data set.The AUC scores of the competitive models range from 0.71 to 0.85 on the testing data set (Table 2, See Figure S6 in Supporting Information S1 for the ROC and AUPR curves and Table S6 in Supporting Information S1 for the AUC of each lake).Using the DeLong test (DeLong et al., 1988), we find that the AUC score of BN is not significantly different from the AUC of the other six models.The prediction accuracy of our model is comparable to other competitive models.

Network Structure
The BN suggests the following covariates as primary, which form the Markov blanket of the response variable bloom and are sufficient for predicting it (Koller & Friedman, 2009): the current cyanobacterial cell count, phosphorous, nitrogen, % pastureland in the watershed, and the current month.These covariates are sufficient for the BN to achieve an AUC of 0.79.The remaining covariates are secondary, which do not contribute to the above prediction but are used when the value of one or more of the primary covariates is missing.Namely, the prediction accuracy does not change when any secondary covariates are missing, given the five primary covariates.Should the value of any of the primary covariates be missing, the prediction performance would not get lower than 0.78 AUC (Table 3).
The structure of the BN (Figure 1) proposes a number of probabilistic conditional independencies among the variables (Table S4 in Supporting Information S1).One observation is the presence of watershed covariates, in particular, % pastureland, in the Markov blanket.Furthermore, it is not surprising that the covariate, current month, is present as primary because of the high latitude of Alberta.We further note that the meteorological covariates are statistically dependent on month, whereas the nutrients within the lake are independent of the month.
The existence of links between watershed variables and nutrient measurements are particularly interesting because they indicate a connection between land use and eutrophication that highlights potential anthropogenic nutrient sources.Additionally, the existence of a link between two variables implies that the two do not become independent conditioned on any other (subset of) variables.It does not imply the ability of one node to fully predict another.

Informativeness of the Covariates
Comparing the informativeness of covariates helps to identify the covariates that are important for CB prediction as a complement to the primary covariates.The obtained BN allows us to determine the probability of future The red node is the response variable (future 2-week CB), the other nodes represent the current covariate levels.Blue nodes are in the Markov blanket of the response variable, and hence, primary covariates, and white ones are the secondary covariates.The links do not represent causal relationships but conditional probabilistic dependencies.Note that all the continuous predictor variables were discretized into three levels, and the response variable was discretized into two levels, that is, bloom-free and bloom.
2-week bloom conditional on each of the discretized covariates, for example, Prob(bloom |nitrogen = low).We find five covariates that predicts high risk of bloom (Prob(bloom = 1 |covariate = X) > 2/3, for covariate level X) on the test data set: high level of current cell count predicts the bloom in 2 weeks with a probability of 0.843, and high % pastureland in watershed predicts a bloom of probability 0.746, followed by low level of Secchi depth (0.704), high level of phosphorus (0.683), and high level of nitrogen (0.668) (Figure 2a).There are 8 scenarios with low risk bloom (Prob(bloom = 1 |covariate = X) < 1/3, for covariate level X) (Figure 2b): high levels of % forest in watershed, Secchi depth and lake depth, low levels of current cell count, % pastureland in watershed, phosphorus and nitrogen, and medium level of % wetland in watershed.Those high-risk scenarios and low-risk scenarios are useful to quickly estimate the risk of bloom.
Another way to study the relative informativeness of the covariates is to compare the level of certainty that a single covariate can provide before obtaining the state of that covariate.We use Shannon's entropy to measure the level of uncertainty for a probability distribution, where "0" means the lowest uncertainty level, or most certain, and "1" is the highest uncertainty level, for example, probability distribution of tossing a fair coin.We rank the uncertainty (Table S5 in Supporting Information S1) of the conditional probability tables (CPT) using the entropy (Shannon, 1948).We find that the CPT of the current cell count level has the lowest uncertainty (0.75), meaning that knowing the current cell count level is more likely to avoid ambiguous prediction for bloom occurrences, if one were to choose only one covariate for prediction.There are five covariate CPTs with an uncertainty level between 0.84 and 0.9, including % pastureland in the watershed, phosphorus, Secchi depth, % forest in the watershed, and nitrogen.

Performance of Groups of Covariates
We categorize the covariates except current month into four groups: lake monitoring, meteorology, lake features and watershed features (Table 4) and investigate the prediction performance of a single group, using our obtained BN and values of the chosen group in the test data set.We find that the lake monitoring covariates performs the best, with an AUC of 0.78, almost the same accuracy as when all covariates are available in the BN.The next best is the group of watershed features (0.77 AUC).Just the two lake feature covariates predict bloom with a 0.73 AUC.The meteorological covariates predict almost as poorly as a random classifier, that is, 0.53 AUC.
We hypothesize that using an educated estimate might help boost the prediction accuracy, when we need timely predictions of CBs.With limited ability of getting lab-based output from monitoring samples promptly, the total cyanobacteria cell count and Secchi depth may be estimated by a well-trained local by identifying the level of cyanobacteria amount and the water clarity using their own experience and some scientific guideline.Moreover, with recent advances in high-resolution imaging, automated cyanobacterial cell counting methods (Graham et al., 2018) could be used to generate data of sufficient taxonomic resolution (i.e., total cyanobacterial cell count) in a timely manner for use in the predictive model.We compare two subsets of the covariates (Table 4).One subset of regional covariates includes the covariates associated with meteorology, lake features and watershed features.Using this subset and our obtained BN, we predict the bloom in 2 weeks with an AUC of 0.77.We complement the set of regional covariates with the estimates of the level of cyanobacteria growth (corresponding to cell count) and the water clarity (corresponding to Secchi depth) by a well-trained local or the high-resolution imaging technology.With the additional data-estimated level of cell count and Secchi depth, the prediction of bloom in 2 weeks outperforms the prediction using only the regional data, and ends up with an AUC of 0.83, which we take as our final predictive model.

Discussion
We obtained the following three main findings: (a) the purely data-based BN structure is consistent with most results from previous studies, identifying five primary covariates-P, N, and cyanobacteria cell count from the most recent sample, the pastureland percentage in the corresponding watershed, and the current month; (b) the BN proposes the most recent cell count, the pastureland percentage, and the Secchi depth to be the three major covariates that are the most informative as a single predictor of CB; (c) the BN predicts future 2-week CB with AUC 0.79 in the testing data set; however, if we select regional and local covariates as the only input variables, the AUC increases to 0.83.

Prediction Accuracy
Our approach to predicting future 2-week CB provides statistically similar accuracy as six other competing approaches.As for the performance of other studies (see references in Rousso et al., 2020), only a few had a Group/subset Covariates AUC Lake Monitoring Current cyanobacterial cell count, P, N, Secchi depth, stratification status (obtained on the currently most recent sampling date) 0.78 Meteorology Temperature, precipitation, wind speed, solar radiation (the average daily meteorological condition from the sampling date to today) 0.53 Lake feature Lake depth, lake elevation 0.73 Watershed feature % pastureland, % cropland, % forest, % wetland, lake-watershed area ratio 0.77

Regional covariates a
All the covariates in the groups of meteorology, lake feature and watershed feature 0.77 Regional covariates + local b In addition to the regional covariates we add covariates that can be estimated locally: the level of total cyanobacterial cell count and the Secchi depth 0.83 a Regional covariates refer to the meteorological covariates and those that are fixed for a given lake.b Local covariates here only refer to the level of total cyanobacterial cell count and the Secchi depth.

Table 4 Groups/Subsets of Covariates and Their Corresponding Prediction Performance Over the Testing Data Set
forecast horizon of at least 2 weeks, and we could not make a conclusive comparison because most of them use R 2 , that is, the coefficient of determination, instead of AUC scores.Nevertheless, our obtained BN is fully data based, which relaxes the stated limitation of requiring expert knowledge to develop BN models in previous work (Rousso et al., 2020).
Our prediction framework differs from others in that we do not require the lake monitoring data to be collected exactly "today" to make predictions for 2 weeks later.Instead, the monitoring data could be collected several days (0 ∼ 30) before "today" (see Figure S3 in Supporting Information S1), and the past meteorological data were used to complement the monitoring data gathered earlier.Although we had to choose such a framework due to the irregular gaps between consecutive monitoring dates, our framework better represents the reality that monitoring is not very frequent, and there is an expected time delay in reporting the monitoring data, especially the cell count.
The time lag can be variable-perhaps anywhere from 48 hr for an emergency sample to weeks or months for non-priority sampling (Graham et al., 2018).
While the obtained BN has the advantage of handling missing values, compared to other models, it could fall short of making accurate predictions.Nevertheless, there was no significant difference between the AUC of the BN and other models, supporting the choice of BNs.The AUC of the BN was slightly better than NB, which is consistent with predicting Karenia selliformis blooms in Feki-Sahnoun et al. (2018).This supports the dependence between the covariates even if conditioned on the target variable.While the parameters of GLM and NB are obtained using standard methods such as the maximum likelihood estimation (Koller & Friedman, 2009), obtaining the optimal configurations of some of the models may not be possible.In particular, a NN may take an arbitrary number of layers and neurons as well as an arbitrary activation function, and finding the "best" architecture is an open problem.Nevertheless, one may "hope" to obtain a competitive configuration by following some rules of thumbs such as those in Panchal et al. (2011) as we did in this paper.
The reason why omitting some of the covariates increases the accuracy (Table 3) is as follows.Our aim is to find the model that best describes the relationships between the variables and does not overfit the data set.Hence, we optimized the BN based on BIC (although one could also use AIC).This is a generative task (Jebara, 2001) with the goal of best estimating the joint distribution  Prob( ) where B is the target variable, bloom, and   is the set of the covariates.Now, when we use the model for predicting bloom, we are performing a discriminative task, that is, estimating the conditional distribution  Prob( | ) .The log of these likelihoods are being maximized during the training phase, and the log of the first likelihood has an extra term compared to the second: . Thus, the generative task comes with the cost of having perhaps a lower prediction accuracy of the target variable since it is also optimizing the extra term  log Prob() which stands for the log-likelihood of the covariates.So it is possible to obtain another model, such as NB, that is more accurate at predicting bloom but less accurate at matching the likelihood of the data, or equivalently the relationships between the covariates.

Network Structure
To exploit the full potential of the data set, the network structure was learned entirely from the training data set and had no prescribed connections or parameters (Ramazi et al., 2021a), which could have resulted in a different structure.Although previous studies suggest that such manipulations based on expert knowledge can decrease the complexity of the network (Alameddine et al., 2011), one goal of this work was to compare an entirely learned structure to the state of current knowledge.
The inclusion of P and N concentration in the Markov blanket of the target variable reinforces the idea that they both play an essential role in bloom formation (Dolman et al., 2012).Moreover, the link between N and P in the BN implies that both nutrients are likely to contribute to nutrient pollution, as opposed to the excess of just a single nutrient (Lang et al., 2013).The implied correlation between N and P from the network structure also leads to the unsurprising link between N and current bloom status.There is an overwhelming body of literature supporting that nutrient concentrations drive CB dynamics, and our model is one such study that supports this assertion.Furthermore, the temperature is believed to play an important role in bloom formation.Keeping in mind that Alberta has large average temperature fluctuations between seasons, literature would suggest that the season, or month, would be an important factor for evaluating the probability of CB.In our model, we note that the current month is linked to the current cyanobacterial density.However, since we are making 2-week predictions, it is not surprising that the month is not directly linked to the target but is still contained in the Markov blanket.Lastly, we note that our network structure shows a link between land type/use within the watershed of the focal lake and nutrient concentrations in the lake water.Although this link is sparsely discussed in the literature, there is an intuitive understanding of the link between the surrounding land type and nutrient concentrations.That is, lakes surrounded by farmlands such as pastureland and cropland may be more eutrophic than lakes surrounded by forests or wetlands due to an increase in anthropogenic nutrient sources (Carpenter, 2005;Keatley et al., 2011;Rodríguez-Gallego et al., 2017).conditional independencies imposed by the network suggests that certain measurements may not be required.For example, our model shows that knowing Secchi depth is not necessary for making a prediction, provided we know the covariates in the Markov blanket.This result is informative, as it suggests that even though Secchi depth is commonly used to predict algal abundance, it may be unnecessary (Canfield & Hodgson, 1983;Krause-Jensen et al., 2009).Also, certain lake characteristics such as lake depth and stratification are not deemed valuable predictors compared to other covariates.It is commonly referred to in the literature that CBs are often successful in stratified conditions due to their ability to regulate their buoyancy (Paerl & Huisman, 2009;Reynolds et al., 1987;Wang et al., 2007), and that deeper lakes generally have lower CB abundance (Berger et al., 2006;Wang et al., 2007).The covariates related to meteorology, wind speed, precipitation, air temperature, and solar radiation are not directly required for our prediction either.There are many reasons to expect that meteorology plays a crucial role in bloom formation.Several CB species growth rates are shown to have optimal growth near a temperature of 25°C (Butterwick et al., 2005), and higher winds encourage water column mixing, which in turn alters light and nutrient availability (Paerl, 2014).
Although these results may stray from the common beliefs found in the literature, they are not contradictory nor necessarily based upon error.When considering a more extensive breadth of environmental covariates, new and previously unconsidered covariates can cover the information provided by the typical covariates.For example, Secchi depth, in part, is a way of measuring the turbidity of a lake.However, one can reasonably postulate that the turbidity is also directly related to P and N concentrations, or perhaps even the surrounding land use.Hence, if we know the P and N concentrations and know the lake is in an agricultural watershed, Secchi depth may not provide any further information.Additionally, the absence of meteorological covariates required for the prediction can be explained by the relatively predictable monthly meteorological patterns of our study system of Alberta, Canada.That is, the average meteorological covariates are described mainly by the month from which we are predicting.
Moreover, BIC is a consistent score for learning BN structures; that is, as the data size approaches infinity, the structure that correctly represents the conditional independencies between the variables minimizes BIC and no other structure does so (Koller & Friedman, 2009).So for a large enough data set, the Markov blanket is expected to be correct.While this supports our approach to find the primary covariates, the validity of the results is limited to the size of the dataset-an issue with almost all data-driven results.One may consider re-training the model once more data is collected.
Lastly, our network shows several logically standard links, that is, the meteorological and land type covariates are interconnected.Of course, the surrounding land-use variables should be anti-correlated in the sense that the sum of the proportions of surrounding wetland (%), pastureland (%), cropland (%), and forest (%) should not exceed 100.The meteorological covariates and month should also be correlated as the average meteorological conditions in Alberta are highly cyclic and regular annually, implying a direct link to season or month.Finally, we expect a grouping of the covariates pertaining to nutrient concentrations, surrounding land use, current bloom status, and target.When a lake is considered eutrophic, it is often eutrophic in both P and N, although P is generally the main driver of bloom dynamics (as we see in this network (Figure 1): P is directly but N is indirectly linked to the target variable).We further expect N and P to be influenced by the type of surrounding land, that is, it is argued that pasture and croplands will contribute to eutrophication more than the other types of land in the watershed.However, for a given lake, whether a bloom will occur also depends on the local conditions prior to the target date.

Insights for CB Management and Mitigation
The BN structure and prediction results also present useful information in terms of management and mitigation strategies.Current mitigation strategies can be categorized as either within waterbody or within watershed/ airshed controls (Paerl et al., 2016).Various strategies apply to each of these categories, although both within-lake and within-watershed interventions may have unintended consequences.Some within-lake strategies involve significant alteration of the ecosystem, such as algaecides, sediment capping or dredging, and food web manipulations.There may be no direct link between such mitigation strategies and the alteration of covariates considered in our model thus, our work is limited in understanding the efficacy of such strategies.For example, in-lake remediation strategies targeting nutrient levels may also indirectly alter turbidity, in which case multiple covariates are altered in a way that is difficult to determine.However, certain within-watershed mitigation strategies may directly target covariates in our model, in which case we can offer insight into the potential effectiveness of that strategy.
First, it is readily known that excess nutrient inputs of nitrogen and phosphorus increase CB abundance.It is contested in the literature whether P, N, or a combination of these variables drive freshwater CB abundance (it is also worth noting that relationships differ in marine environments).However, due to the ability of several genera to fix nitrogen, phosphorus reduction has been thought to be more feasible in managing blooms.Many avenues can be pursued to reduce anthropogenic nutrient inputs into the watershed, a few examples that could potentially apply to Alberta lakes include; reduction of urban and industrial waste, reduction of agricultural pollutants such as fertilizers, manure and disturbance of legacy nutrients in soils, and riparian habitat restoration or preservation to sequester mobile surface nutrient.All the above strategies would directly target a reduction in our model's P and N covariates.Interestingly, P and N are contained within the Markov blanket, and Figure 2 shows that low P or N concentrations give a low probability of bloom occurring.It is worth noting that our model shows a low P concentration has a lower probability of a bloom occurring than N (19% vs. 24%).At the same time, a previous work in the province (Loewen et al., 2021) implies that the ratio of N:P may be important in Alberta, with higher N to P favoring non-cyanobacterial species which are unable to fix atmospheric N.
Additionally, altering the land type surrounding a lake and within the watershed has been shown to be a useful mitigation strategy.For example, thoughtful placement of riparian buffers within the watershed can increase the overall water quality and decrease nutrient pollution (Anbumozhi et al., 2005).Also, wetlands, both constructed and natural, can act as supplemental wastewater treatment processes to remove and trap even more nutrients (Paerl et al., 2016;Vymazal, 2010).To this end, our model supports the claim that land type within the watershed is an important indicator for predicting CB blooms.Although the watershed land-type variables (% pastureland, % cropland, % wetland, and % forest) are all correlated variables, we see certain trends in our model.As seen in Table 4 watershed features, in general, can be helpful in predicting CB blooms, however, the results do not single out an increase in percent wetlands as a potential management strategy as suggested by many studies (Paerl et al., 2016;Vymazal, 2010).Instead, as seen in Figure 2, our model suggests that reducing the percentage of pastureland may result in a lower probability of CB blooms.

Informativeness of the Covariates
Our findings that the level of cyanobacterial cell count, Secchi depth, nutrients, and most watershed covariates are among the most informative covariates are consistent with previous studies in other water systems (Rigosi et al., 2015;Steffen et al., 2017;Zhao et al., 2019).Contrary to the conventional findings on the crucial effect of wetland on nutrient retention and water quality (Baron et al., 2013;Verhoeven et al., 2006), we found that a high level of % wetland predicts a higher risk of bloom compared with a medium level of % wetland.We speculate that in addition to the total amount of wetland in a watershed the number of wetlands may also be important, as the accumulative effect of small wetlands in controlling CBs is reported to outperform one large wetland (Cheng & Basu, 2017).While wetlands frequently intercept and retain nutrients, they maintain stronger groundwater connections and nutrient supplies to lakes that may create eutrophic lake conditions that support blooms in undeveloped watersheds, especially later in the season (Loewen et al., 2020).We did not expect that % pastureland in the corresponding watershed of a focal lake ranks higher than the level of phosphorus and nitrogen within the lake, even though the % pastureland of a lake remains stationary.In combination with our finding that the covariate group of watershed feature has an AUC of 0.77 (Table 4), one of our important future directions is to develop a preliminary appraisal using our model framework and those stationary covariates, including lake and watershed features, for the less frequently monitored and currently bloom-free lakes in our study region.This will enable the environmental managers to make proactive management strategies (Qu et al., 2014) so that the limited monitoring and management resources can be efficiently allocated to the lakes with higher pre-assessed bloom risk before detrimental outcomes occur (Michalak et al., 2013).Taking proactive measures could possibly reduce the cost of management, because cyanobacteria are reported to be resistant to current mitigation efforts and management measures are notoriously costly (Cooke et al., 2016).To our surprise, unlike studies in Kosten et al. (2012) and Taranu et al. (2012), none of the meteorological covariates here were shown to be very informative.The reason behind this is unclear to us, even though we acknowledge the correlation between the current month and meteorological covariates.One explanation for the lack of importance of meteorological factors is that they may have limited direct influence, but influence blooms indirectly.For instance, warm or low wind conditions might push an agriculturally-driven watershed with high nutrient concentrations into a bloom, where the bloom might not otherwise have occurred if not for the combination of factors.
Our model suggests that the connections described above are existent; however, when a covariate has a high uncertainly level with respect to its probability distribution of predicting blooms, (Table S5 in Supporting Information S1) it may not be a strong predictor.Consequently, the connection may be spurious.This highlights the importance of determining the confidence over the BN connections, which can be done by Bayesian averaging (Koller & Friedman, 2009) and is left as future work.In this light, the interpretations provided are merely a potential explanation of the model results, regardless of the confidence of the model.

Performance of Groups of Covariates
A valuable feature of the BN is its ability to make predictions with missing covariates.This feature makes its range of applicability much broader.For example, a subset of the regional covariates could be used to make predictions if prompt monitoring data is not available.In certain instances, only a subset of covariates may be measurable due to technical issues or varied interests at a given time and region.Furthermore, our model can help understand which group of covariates is most useful for predictions, potentially cutting down on data collection costs and sampling.We considered six logical groups/subsets of covariates to show the applicably of our model for varying availability of data.All groups/subsets of covariates except the meteorological group perform reasonably well.In particular, considering the timely availability of data, the model can make good 2-week predictions without relying on detailed analysis of chemicals.The poor performance of the meteorological covariates may be explained by the invariability in the data, lack of important covariates (e.g., those in the Markov blanket), or the structure of the conditional dependencies of the covariates being used.In any sense, the ability of our approach to utilize subsets of covariates to make predictions makes it a potentially powerful tool for timely predictions to be made by health officials, managers, and scientists in general.

Limitations and Future Direction
Our data set includes a few frequently sampled lakes and many less frequently sampled lakes (see Figure S2 in Supporting Information S1).We obtain a representative set for testing, as is described in Section 2. However, the imbalanced nature of our data set might have lead to an unexpected prediction of a higher bloom level when the % wetland level is high (see Figure S6 in Supporting Information S1 for the conditional probability for % wetland).It is also possible that this unique result is due to the effect of wetland sizes (Cheng & Basu, 2017) which we do not consider.Obtaining more data could help to uncover the underlying cause for this result.When there is enough data, a balanced data set can be constructed to generate a BN.From that BN, we could verify the effect of % wetland in predicting CB.
The possible reason for meteorological covariates not showing a critical contribution to CB prediction as much as other groups of covariates is that we restrict our study to lakes in Alberta, where there are not distinct meteorological variations among lakes.The meteorological covariates might be shown to have a larger impact in CB prediction for a larger scale, for example, continental or global scale.
Overall, our approach of predicting CB performs as well as the competing methods and proposes the importance of watershed features in the CB study for regional water body systems.The proposed framework allows us to achieve more realistic goals when data are partially missing.Limited by the size of the data set, we did not include the effect of buoyancy because its effect is not as important as shown in an earlier study (Loewen et al., 2020).In the future, it would be interesting to justify the role of meteorology for CBs in the study area.Extending the methodology to obtain, rather than a discrete, a continuous BN (Jackson-Blake et al., 2022) allows predicting the Cyanobacteria level in a finer scale.The informativeness of lake and watershed features suggests that our approach could help prediction when monitor data is not available for some lakes.

Conclusion
We developed a purely data-driven BN that uses watershed characteristics, meteorological conditions, Secchi depth, and the (low, medium, high) level of cyanobacterial cell-count to predict future 2-week CBs in lakes with 0.83 AUC.Some predictive models, such as a GLM, perform up to 0.02 AUC higher; however, they cannot make predictions when one or more of the covariates are randomly or systematically missing in the data as in large-scale monitoring.However, being a probabilistic model, the BN can marginalize the missing covariates and may perform only 0.05 AUC worse in the absence of a covariate, making it a suitable fit for the case with incomplete, partially-monitored data.On the other hand, unlike most machine-learning models that bear a blackbox design with little to no insight, BNs are probabilistic and graphical, revealing the dependence between the variables.In particular, the obtained BN identifies five primary covariates: P, N, percentage of pastureland in the watershed, current cyanobacterial cell count, and current month, and uses the remaining secondary covariates to predict CB in the absence of one or more of the primary.Moreover, the probabilistic analysis of the BN proposes cell count, pastureland percentage, and Secchi depth as the three most informative covariates in predicting CB.We further discussed known and new insights on the relationship between the covariates and their "contribution" to the predictions and implications on CB management and mitigation.Overall, the BN modeling approach has a high potential for predicting CBs in lakes using incomplete, timely data and identifies those informative groups of covariates that are worth future monitoring investments for further improving the prediction accuracy.This research was primarily supported by a Grant from Alberta Conservation Association.Hao Wang was partially supported by an NSERC Individual Discovery Grant RGPIN-2020-03911 and an NSERC Discovery Accelerator Supplement Award RGPAS-2020-00090 as well as a Canada Research Chair.Mark Lewis was supported by an NSERC Discovery RGPIN-2018-05210 and the Gilbert and Betty Kennedy Chair.Pouria Ramazi was partly supported by an NSERC Discovery Grant RGPIN-2022-05199.

Figure 1 .
Figure1.The structure of the selected Bayesian network.This network has the lowest Bayesian information criterion score on the training data set.The red node is the response variable (future 2-week CB), the other nodes represent the current covariate levels.Blue nodes are in the Markov blanket of the response variable, and hence, primary covariates, and white ones are the secondary covariates.The links do not represent causal relationships but conditional probabilistic dependencies.Note that all the continuous predictor variables were discretized into three levels, and the response variable was discretized into two levels, that is, bloom-free and bloom.

Figure 2 .
Figure 2. Predicting risk of bloom with the obtained Bayesian network (BN) given the state of a single covariate.Predictions are made by inputting the value of a single covariate into the BN.(a) The probability of a high bloom level in 2 weeks (x-axis) based on the current covariate levels (y-axis).As an example, the top bar should be interpreted as follows: There is a bloom probability of 0.843 in the future 2 weeks, if the current cell count is High.(b) The probability of a low bloom level in 2 weeks (x-axis) based on the current covariate levels (y-axis).See Figure S7 in Supporting Information S1 for the bloom probabilities of all covariates.

Table 1
List of Variables and Their Abbreviations 3.1).Proportional coverage of forest and wetlands were obtained from a reclassified Agriculture and Agri-Food Canada 2010 Land Use grid (ca.2010, see AAFC (Agriculture and Agri-Food Canada), 2015), representing natural land cover.Proportional coverage of pastureland and cropland were obtained from reclassified Alberta Biodiversity Monitoring Institute (ABMI) Human Footprint Inventory data (ca.2016, see ABMI (Alberta Biodiversity Monitoring Institute), 2018; for reclassification details, see Table

Table 2
AUC Scores of the Selected Bayesian Network (in Bold) and Other Predictive Models Note.The results are on the test data set.AUC and AUPR for a random classifier on the test data set are 0.5 and 0.55.

Table 3
AUC and AUPR Scores of the Selected Bayesian Network When the Value of a Single Covariate Is Missing