The human factor: Weather bias in manual lake water quality monitoring

Sampling bias due to weather conditions has been anecdotally reported; however, in this analysis we demonstrate that manual lake sampling is significantly more likely to take place in “fair weather” conditions. We show and quantify how a manual lake monitoring program in Maine, USA, is biased due to wind speed, rainfall intensity, and air temperature. Emulating a manually sampled water quality (WQ) data set, we show that, on average, manual sampling recorded, depending upon depth, higher water temperature (between 0.4°C and 1.2°C), lower dissolved oxygen (DO) (between −0.8 and −0.4 mgL−1), and higher chlorophyll values (2.0 μgL−1) than average automated monitoring. By analyzing the actual manual monitoring data, we show that manually collected lake water temperatures are on average 1.0°C higher in the epilimnion and 0.5°C (corrected for sensor lag) higher in the hypolimnion compared to those collected using automated methods. We attribute these differences in WQ measurement values to the weather‐induced manual sampling bias. We believe that the nature of weather bias on manual monitoring will always record higher water temperatures, higher chlorophyll, and lower DO than automatic monitoring. The methodologies presented in this study will apply to similar manually sampled lake monitoring programs and the manual sampling bias will likely affect other WQ parameters. The weather‐induced water temperature bias reported is of the same order of magnitude as the root mean square errors reported in many lake models and is therefore considered substantial. If generally applicable and not corrected for, these results will have important implications for climate models, and similar applications, where manually collected WQ data are employed.

The monitoring of lake water quality (WQ) is conducted by ecologists and limnologists for a myriad of reasons, including the assessment of biological and chemical health, meeting environmental or drinking water standards, and obtaining data for long-term time series and lake models. Formalized and regular WQ monitoring programs are an essential part of this measurement and characterization of lakes (Ban et al. 2014). WQ data can be collected by manual means, for example, sensors deployed from a boat, or automated methods, for example, a floating data buoy. Local, regional, or national guidance is often used in the design of such monitoring programs. These WQ monitoring programs, however, can be significantly affected by various sampling biases, both temporal and spatial.
Anecdotally, weather biases in WQ monitoring are intuitively recognized by lake scientists but not necessarily overtly addressed. For example, although regular monitoring is a keystone of most WQ sampling programs (British Standards 2006), lake scientists knowingly avoid working from boats in rain and high winds for practical and safety reasons but do not account for this behavior in their sampling programs. Weather bias in data collection is mentioned in the literature but its treatment is rarely addressed. Fair weather bias, in oceanographic terms, is the tendency for ships to avoid or be unable to work in bad weather and thus prevent data collection in such conditions (Dhanak and Xiros 2016). Data gathered by ships is invariably subject to such bias (Austin 2012). Guidance on the design of monitoring programs, such as the ISO 5667 series (British Standards 2006), is typically concerned with weather from a boat safety and station-keeping point of view but not from a data bias standpoint. It should be noted that such guidance, when looking for systematic changes over time, generally advises that lake WQ monitoring is conducted regularly on the same day and at the same time to minimize sample variations (British Standards 2006). If monitoring programs could adhere to such rigid timetables, then any weather-derived biases would potentially be eliminated.
In this paper, we examine weather influence on a lake WQ monitoring program conducted on the Belgrade Lakes, Maine, USA, and the influence of this on the WQ data collected. Although the methodology presented is for a single lake monitoring program, it could be applied to similar lake monitoring programs globally to quantify the effects of weather in lake WQ data.
The hypothesis examined is that manual sampling suffers (1) a statistically significant bias due to different types of weather conditions and that (2) these biases have a statistically significant impact on the WQ data collected. To test part one of our hypotheses, we use data from the WQ sampling program from all seven Belgrade Lakes to examine if weather conditions produce a data bias when manual sampling is conducted. Specifically, we examine if average daily wind speed, temperatures, and rainfall affect the probability of manual sampling. Based on the anecdotal evidence, we hypothesize that high winds, cold temperatures, and heavy rainfall decrease the likelihood of manual sampling or, conversely, that low wind speed, low rainfall, and high air temperatures increase the likelihood of manual sampling to take place, all other things being equal. To test the second part of the hypothesis, manual and automated lake sampling data from Great Pond (one of the Belgrade Lakes) is analyzed. Using the methodologies presented, we show that, for our Great Pond case study, manual lake sampling yields significantly different WQ values (higher temperature and chlorophyll and lower dissolved oxygen [DO] on average) as compared to automated WQ monitoring. This finding may have a more general relevance to all manual WQ sampling parameters and programs but further research would be required to investigate this.

Monitoring program
Most lake monitoring programs measure similar WQ parameters on a regular basis (Stanley et al. 2019), including Secchi disk transparency, temperature and DO profiles, and discrete or integrated water samples for parameters such as chlorophyll a and total phosphorus. Our analysis comprises two main elements: (1) to assess the presence of a temporal bias due to weather conditions on manual sampling, we have estimated the likelihood of manual sampling taking place on any given day within the sampling period depending on weather parameters obtained from the Belgrade Lakes weather station (see the Influence of weather on manual sampling section) and (2) to see the effect of this weather bias in the WQ data, we have analyzed WQ data from both manually obtained profiles and an automated data buoy on Great Pond (see the Effect of weather sampling bias on WQ parameters using only automated buoy measurements and Effect of weather sampling bias on water temperature using automated buoy and manually collected measurements sections).
The Belgrade Lakes are a chain of temperate lakes in central Maine. Temperature and DO profiles are measured throughout the open water season by the 7 Lakes Alliance (a regional watershed organization) in conjunction with Colby College. The sampling program is based upon the State of Maine WQ monitoring guidance (Webster 1997). Standard procedure is for manual profiles to be taken from a boat weekly in June-August, biweekly in May and September, and once in April and October at various buoyed sample stations in the lakes. Great Pond is a mesotrophic, dimictic lake with a mean depth of 6.4 m and a maximum depth of 21 m. On Great Pond ( Fig. 1), there is an automated WQ data buoy at Sta. 2 (also used for mooring boats when sampling manually), which measures temperature, DO, chlorophyll, and photosynthetically active radiation (PAR) at fixed water-column depths every 15 min, and an automated weather station located on the lake shore that samples weather parameters every 15 min. These spatially adjacent and temporally aligned monitoring methods allow for weather and WQ data to be compared (Fig. 2). Equipment used to monitor the weather and WQ parameters comprises: • Manual sampling: Aqua TROLL 400 Multiparameter Probe measuring temperature (sensor accuracy: AE0.1 C, resolution: 0.01 C with a sensor lag of <30 s, factory calibrated with zero drift), DO (sensor accuracy: AE0.1 mgL À1 , resolution: 0.01 mgL À1 with a sensor lag of <60 s) and depth (sensor accuracy: AE0.08 m, resolution: AE0.008 m) deployed from a boat moored (in order to minimize wind induced drift and therefore minimize vertical offset of the manual sensor in the water) to the automated data buoy.

Data
For the analysis of weather bias in the sampling program, manual sampling data (see the Monitoring program section) from all seven Belgrade Lakes and the data from the Great Pond weather station were combined. It is assumed that the weather parameters recorded at the weather station represent the Belgrade Lakes which are all within a 12 km radius of the weather station. For this study, we had access to valid observations from 26 July 2016 to 08 November 2019, a total of 1201 ds. Manual sampling is scheduled to take place each year between April and October, resulting in a total sample of 740 d. The weather data were aggregated for each day, providing daily average measurements of air temperature, wind speed, and rainfall intensity (Table 1). Manual sampling took place on 221 ds (in at least one of the seven lakes), that is, on around 30% of the 740 d in the sample. Several control variables were introduced to account for the temporal nature of the data, namely a binary variable measuring if the day was a working day (Monday-Friday) or not and temporal lag variables, accounting for the planned regularity of the sampling program.

Methods
To establish if the weather has an influence on the manual sampling program, we model the probability that manual sampling takes place on any given day, depending on the weather that same day. Specifically, we consider a linear time series model where the dependent variable y t , t = 1, 2, …, T, is a realization of a Bernoulli process (Nyberg 2010), representative of the manual sampling outcome at time t. As such, the dependent variable is operationalized in the following way: We estimate this probability of sampling taking place using a logistic regression (LR) model with the independent variables of daily mean air temperature (T air ), mean wind speed (S wind ), and mean rainfall intensity (I rain ). Additionally, we control for whether the day was a working day or not (WD). The independent variables for each day t can then be collected in a vector x t ¼ T air ,S wind ,I rain ,WD f g T . We also initially included the prospective sampling schedule, that is, how many samples were supposed to be taken according to the seasonal schedule (see the Monitoring program section), as a control variable but did not include it in the final model since it had no significant effect and did not improve the model fit.
The human factor: Weather bias in manual lake water quality monitoring Logistic regression Following the notation of Hosmer et al. (2013), the standard LR model can be written as where πx ¼ E Yjx ð Þ is the conditional mean of the dependent variable Y given the independent variables x. Here, x ¼ 1,x 1 , …,x n ð Þ T is a vector of n independent variables, augmented with unity to allow for an intercept term and β ¼ β 0 ,β 1 ,…,β n ð Þ T is a vector of the corresponding regression coefficients. These regression coefficients are determined by maximum likelihood estimation using observations of the pairs x t ,y t À Á ,t ¼ 1, 2,…,T.
Note that, for a LR, the regression coefficients β represent the slope of each independent variable for the log-odds of the dependent variable. Hence, care must be taken when interpreting the numerical results. For ease of interpretation, we provide marginal effects plots (Fig. 3) to visualize the effect of each independent variable on the dependent variable. Due to the temporal nature of these data and the fact that the manual sampling program was intended to follow a specified schedule, the observations were not strictly independent and temporal autocorrelation was expected to be an issue. Serial dependence was investigated using (1) the autocorrelation function (ACF), which calculates the correlation between the dependent variable with itself at previous times (lags), (2) the partial autocorrelation function (PACF), which estimates the partial correlation between different lagged observations while controlling for the effect of shorter lags (Metcalfe and Cowpertwait 2009), as well as (3) the Durbin-Watson test for autocorrelation. To control for temporal autocorrelation, which the standard LR model does not account for, two additional LR models were investigated: a lagged dependent variable model (LDV) and a generalized linear autoregressive moving-average model (GLARMA). For each of the models, the relevant assumption tests were completed following standard procedures (Menard 2010;Hosmer et al. 2013;Garson 2016).

Dependent variable model
The LDV model controls for the temporal autocorrelation by including the dependent variable y at some time lag k as an independent variable (Menard 2010). This is done by adding to the vector of independent variables x t ¼ 1, x 1,t , …, x n,t , y tÀk À Á T the dependent variable y tÀk , where k is the lag. The vector of coefficients then becomes β ¼ β 0 , β 1 ,…,β n ,α k ð Þ T , where α k is an additional regression coefficient which, broadly speaking, measures the influence of the dependent variable at lag k. The advantage of this approach is the ease of interpretation of the resulting model, as the lags can simply be interpreted in the same way as any other binary independent variable. However, the inclusion of lagged dependent variables can potentially bias the regression coefficients (Achen 2000;Keele and Kelly 2006).

Generalized linear autoregressive moving-average model
To validate the conclusions, we fitted a GLARMA modelan extension of generalized linear models which robustly controls for serial dependence by including autoregressive (AR -ϕ t ) and moving-average (MA -θ t ) terms (Dunsmuir 2015;Dunsmuir and Scott 2015). A more detailed description of the GLARMA model is given in Dunsmuir and Scott (2015) and Dunsmuir (2015).

Results
All three LR models were applied to the data and Table 2 shows the regression coefficients, standard errors, and significance levels of each. Figure 3 shows the marginal effects plots for each weather-related independent variable, for ease of model interpretation. Regardless of the model, it is evident that the air temperature and whether it is a working day (weekday) have a positive effect on sampling events and that wind speed and rainfall intensity have a negative effect on sampling events. According to the ACF and PACF, the standard LR model exhibited significant autocorrelation for lags k = {1, 5, 7, 14}, which was also reflected in the Durbin-Watson test for autocorrelation (DW = 1.477, p < 0.001). Accordingly, the regression coefficients of the standard LR model should not be interpreted and are only reported for completeness. We included the lags exhibiting significant autocorrelation in the LDV and GLARMA models, which eliminated all significant autocorrelation. We hypothesize that these 1, 5, 7, and 14 d lags broadly represent prevailing sampling patterns. Introducing lags in the LDV model reduces the effect of air temperature, wind speed, and weekday, while the effect of rainfall intensity is increased. These differences are due partly to causal relationships exposed by controlling for past observations and partly to bias introduced through that same process (Achen 2000;Keele and Kelly 2006), as mentioned earlier. However, it is not particularly concerning in this case, considering that the differences are relatively minor and the coefficients remain significant in the LDV model. Considering the GLARMA model, we observe that all regression coefficients converge to values between those of the two previous models, and that the standard errors are mostly of a lower order than any of the other models. Since GLARMA is generally considered more robust than the LDV approach for time-series regression (Dunsmuir 2015;Dunsmuir and Scott 2015), we believe that the GLARMA model best represents the underlying "true" relationship between weather conditions and manual sampling events, even though the LDV model achieves a better Akaike Information Criterion (AIC, see Quinn and Keough 2002). It is likely that the relatively simplistic approach of including lagged dependent variables has introduced some bias in the LDV model, which would be reflected in a higher likelihood and consequently lower AIC, given that the models have the same number of estimated parameters. While the McFadden pseudo R 2 cannot be interpreted as analogous to the coefficient of determination in a linear least squares model, values above 0.3 indicate an excellent fit for a logistic model (McFadden 1979). This strongly supports our hypothesis that weather conditions influence the probability of undertaking manual WQ samples on a particular day. It should be noted that, while the exact meanings of "good" and "bad" weather are subjective, and especially depending upon Marginal effects plots for all the models. E(Yjx) is the conditional mean of the dependent variable, and broadly represents the probability of a sample being taken. For each effects plot, the remaining variables were held constant at their means. Shaded areas show 95% confidence intervals. These results align with our hypothesis and the anecdotal behavior of lake scientists where "good weather" promotes manual sampling taking place and "bad weather" discourages or prevents manual sampling from taking place.
the time of year, the GLARMA model uses moving averages for the weather data so in effect accounts for such seasonal variations.

Effect of weather sampling bias on WQ parameters using only automated buoy measurements
Having shown that weather conditions bias when manual sampling takes place, this and the next section examine the effect of the weather bias on the manually collected WQ data.
In the Effect of weather sampling bias on WQ parameters using only automated buoy measurements section, we use only data from the automated data buoy on Great Pond. By using this single data source, and using it to emulate the manual sampling data, any potential issues concerning sensor calibration, sensor drift, positional accuracy, environmental, and lake effects are eliminated. The analysis in this section demonstrates a consistent effect in the data, which we ascribe to the manual sampling weather bias.
The Effect of weather sampling bias on water temperature using automated buoy and manually collected measurements section analyses water temperature data to directly compare the automated and manually collected data. This relates our findings to actual manual measurements and additionally reveals sensor lag as another critical issue with manual sampling. While there are some caveats with respect to comparing measurements from two different sensors, the findings of both approaches align closely, hence mutually validating both analyses.

Data and methods
For the sampling period April to October, obtained annually for 4 yr 2016-2019, we have approximately 35,000 automated measurements for each WQ sensor over a total of 422 d from the data buoy on Great Pond. The details and depths of these water temperature, DO, chlorophyll, and PAR sensors are given in the Monitoring program section. In this same period, we identified 51 ds when manual sampling took place on Great Pond.
By taking a subset of the automated data buoy measurements, filtered by the exact time window when actual manual sampling measurements took place, we obtained an emulated manual sampling data set. By comparison of the full automated data set and the emulated manual data set, we investigated the measurement bias introduced by the weatherdependent manual sampling. Under the assumption that the mean values of all automatic measurements of each WQ parameter represent the true population means (μ true ), we estimated the difference in the subset means (μ sub ) compared to the true means using one-sample t-tests (Ross 2021). For each t-test, we report the t-statistic (t), degrees of freedom (df) and Table 2. Regression coefficients (β -log odds), standard errors (SE), and common diagnostics for LR models estimating the influence of weather and time variables on the likelihood of sampling. (1) Standard logistic regression (LR) model not correcting for temporal auto-correlation. (2) Lagged dependent variable (LDV) model. (3) Autoregressive moving-average model (GLARMA). Significance codes: ****p ≤ 0.001, ***p ≤ 0.01, **p ≤ 0.05, *p ≤ 0.1.

Predictor
(1) standard LR model (2)  significance level in the form of the p-value (p) (see, e.g., Berthouex and Brown 2002;Quinn and Keough 2002;Ross 2021). The PAR data did not meet the criteria for robust t-test analysis and was, therefore, not included.

Results
The analysis showed a statistically significant difference between the mean of the subset and the population mean in the entire water column for the water temperature data, the DO data, and for the 2 m chlorophyll sensor ( Table 3). The 6 m chlorophyll sensor shows no significant difference. We ascribe the statistical difference between all automated collected data and the emulated manual sampled data to the effect of the weather bias on the collection of the manual WQ data.
Water temperature measurements from the manual emulation were, on average and depending upon depth, between 0.4 C and 1.2 C higher than the total measurements gathered by the automated buoy, while annual DO measurements record between À0.8 and À0.4 mgL À1 lower values. Chlorophyll at depth 2 m is measured at 2.0 μgL À1 higher. These findings make intuitive sense where warmer, less windy, and sunnier "manually biased" weather yields higher water temperature, which would encourage algae growth as well as decreased capacity to hold DO.
Examining the difference between the subset mean and population mean μ sub À μ true depending on the depth, we find a stratified depth dependency where the consistent magnitude of these WQ parameter differences abruptly changes. For water temperature, there is an approximate 1.1 C difference above the 9 m depth, and an approximate 0.4 C difference below. This phenomenon is investigated further in the next section.
Effect of weather sampling bias on water temperature using automated buoy and manually collected measurements With a weather bias effect upon WQ data having been demonstrated using emulated data, this section investigates the bias effect using the actual manually collected water temperature data to confirm our findings and examine the depth effect noted in the Results section above. The manual and automatic sensors were not jointly calibrated. However, according to the manufacturer, all of the temperature sensors involved in this analysis require no periodic calibration and do not drift. Despite the manufacture's calibration assurance, Table 3. One-sample t-test analysis of emulated manual sampling data vs. true population mean. The manual sampling emulated data are a subset of all automated sampling data, filtered within the time window where manual sampling started and stopped on each occasion. μ sub À μ tot represents the averaged difference between all automated measurements and those collected by emulated manual sampling and is a quantification of the weather bias effect on each WQ parameter at each particular depth. All values, except for the 6 m depth chlorophyll, are statistically significant (p > 0.05), indicating that the weather bias has a significant effect on manually collected WQ data. Note that there appears to be a depth effect where the magnitude of the values of μ sub À μ tot are different below and above approximately 9 m. This effect is investigated for the water temperature parameter in the Effect of weather sampling bias on water temperature using automated buoy and manually collected measurements section. Significant differences are highlighted in bold.

Variable and depth
μ sub À μ true 95% CI t df p  Table S1 shows descriptive statistics for the fully automated and manually collected temperature data set. The data used in this analysis are historical. No contemporary checks were conducted to assure that the temperature sensors on the automated buoy measure the same temperature as the sensor on the manual Aqua-Troll 400. However, based on the manufacturer's declaration that the sensors are factory calibrated, do not drift, and require no periodic recalibration, we assume the sensors measure the same temperature value. Furthermore, seeing that the boat used during manual sampling is attached to the automated data buoy, it is assumed that the different sensors are colocated during the manual cast and subject to the same conditions during sampling. The Comparison of matched measurements section details the investigation into whether these assumptions are valid. The Sensor lag correction section details investigation into sensor lag and why, at depths below 9 m, some of our assumptions may be violated.
To investigate whether the manual sensor probe indeed measures the same temperatures as the automatic data buoy sensors, we matched these two data sources according to time and depth. We found 46 d with both automatic and manual sampling on Great Pond, constituting the data set for the analysis of matched pairs. Note that the raw temperature data were collected at depths D = {1, 3, 5, 7, 9, 11, 13, 15, 17, 19} (meters) for the automatic measurements and continuously from depth z [0,20] (meters) for the manual measurements. Hence, to pair the automatic and manual data, the two data sources required harmonization, to which end the manual measurements were aggregated according to their proximity to the automatic depths D.
The automatic measurements (superscript (a)) were aggregated in time, using all temperature measurements from each day, so that the average automatic temperature measurement for day t at depth d D was where N a ð Þ t,d is the number of automatic measurements on day t at depth d. We averaged the automatic measurements using all measurements from each day, as opposed to using only daytime measurements, since it was found that the within-day water temperatures did not vary greatly. Moreover, using only daytime measurements did not produce different results from those presented. Supporting Information Figure S1 shows that all day (24 h) averaged temperature data are perfectly correlated (R = 1, p < 0.001) and operationally equivalent to working day (08:00-16:00, 8 h) averaged data.
Adversely, the manual measurements (superscript (m)) were collected continuously at various depths z as the probe was lowered. To match these measurements with the fixed, discrete depths of the automatic buoy, we aggregated the manual measurements according to their distance to sampling depths of the automatic measurements, giving more influence to manual samples closer to the automatic buoy depths D, to obtain distance-weighted average manual temperatures with weights w n,t,d = 1 À jd z À z n,t j, where jd z À z n,t j is the distance between the depth of the manual measurement z n,t and the nearest automatic buoy depth d z , and N m ð Þ t,d is the number of manual measurements within 1 m of the buoy. Figure 4 depicts the paired automatic and manual aggregates, while summary statistics can be found in Supplementary Table S1.
Inspection of the matching data (see the Comparison of matched measurements section for statistical validation) revealed that both the automatic and manual data exhibited some manner of bimodal distribution (Fig. 5, upper panel). We successfully separated the two distributions (Fig. 5, lower panels) into distinct, unimodal distributions by filtering the matching data according to depth. Specifically, we separated data points from the epilimnion (1-7 m) from the layers below, that is, the thermocline and hypolimnion (9-19 m). This bimodal, depth-dependent distribution is the same as identified in the Results section above. In the following analyses, we consider all three cases, namely (1) the entire data set, (2) data from the epilimnion, and (3) data from the thermocline and hypolimnion regions.

Results
In this section, we (1) analyze if the automated buoy and the manual temperature sensors measure the same water temperature values (see the Comparison of matched measurements section) and investigate the bimodal depth effect which we ascribe to sensor lag, (2) quantify and correct for the sensor lag (see the Sensor lag correction section) and, (3) analyze the water temperature data sets measured by automated and, sensor lag corrected, manual means (see the Weather bias estimation section). Using both frequentist and Bayesian statistics, we achieve distinct statistical analyses which constitute robust methods for comparing manual and automated temperature data. A clear difference in overall water temperature is shown between manual and automated collection methods. We attribute these differences to weather bias on the collection of manual WQ samples, confirming the earlier analysis in the Effect of weather sampling bias on WQ parameters using only automated buoy measurements section.

Comparison of matched measurements
We tested for a difference between the matched manual and automatic temperature samples using the two methods of a paired student t-test (Ross 2021) and an analogous Bayesian analysis for paired samples (Rouder et al. 2009). To evaluate the significance of the Bayesian results, we report the Region of Practical Equivalence (ROPE), which is broadly the Bayesian equivalent of a p-value (Kruschke 2018). Additionally, we compute the discrete Jensen-Shannon (JS) divergence between manual and automatic samples, using a bin width of 1 C. The JS divergence is a symmetric measure of the divergence (or "dissimilarity") between two distributions (Lin 1991).
When comparing all data, it did not appear that automatic and manual sampling measured the same (t[435] = 14.65, p < 0.001, μ diff = 0.73 C). However, when filtered according to depth, we discovered that for epilimnion depths (1-7 m), the two sampling methods agreed (t[176] = 1.37, p = 0.173, μ diff = 0.05 C), while for thermocline and hypolimnion depths (9-19 m), they disagreed (t[258] = 18.18, p < 0.001, μ diff = 1.19 C). Both the Bayesian posterior estimates and the JS divergences confirmed this impression. Please refer to Supplementary Table S2 for further details. We hypothesize that the effect in the hypolimnion can be attributed to sensor-lag related to the manual sensor experiencing greater temperature variability as it passes through the thermocline. This is despite the manual sampling procedure to slowly lower the Aqua-Troll sensor to help eliminate sensor lag. The manual probe was, on average, found to measure higher temperatures than the automated buoy; as it is lowered from generally warmer water, through the thermocline layer where temperature gradients are larger, to cooler water, this temperature increase is attributed to the sensor response time being insufficient to compensate. The violin plot (Fig. 6) illustrates this phenomenon.

Sensor lag correction
Having identified a temperature sensor lag that is predominant in the thermocline/hypolimnion this section quantifies and corrects for the lag throughout the entire water column of the Great Pond data. To isolate the weather bias from the identified sensor-lag bias, we first computed the expected manual sensor-lag bias for each automatic sampling depth d D, using the paired measurements (N = 436) from the previous analysis. The resulting sensor-lag biases and relevant statistics are shown in Supplementary Table S3. We then corrected every manual temperature measurement (N = 14,260) using the sensor-lag bias corresponding to the nearest automatic buoy depth. We refer to these data as sensor-lag corrected manual measurements, and use them in place of the raw measurements in the analysis below.

Weather bias estimation
In this section, having corrected for any sensor lag, we compare all automated and manual water temperature measurements to show the effect of the weather bias in the WQ data from Great Pond. In line with the paired, aggregated data, we found a bimodal distribution in the raw, unaggregated data, which persisted after correcting for the sensor lag bias (Fig. 7). Hence, we filtered the unaggregated data using the  Table 4. Two-sample t-tests, Bayesian posteriors and JS divergences comparing unaggregated automatic and sensor-lag corrected manual samples. μ man À μ aut is the difference between the overall automated and the manual temperature populations and in effect represents the overall average difference of manual compared to automated temperature sampling which we have ascribed to a weather bias effect.

Depth
Two-sample t-test  Fig. 7. Density histograms and kernel density estimates (bw = 1) showing the distributions of the raw, unaggregated automatic temperature measurements (blue) and sensor-lag corrected manual temperature measurements (red), respectively. The top plot shows all the data, while the bottom two plots show the data from depths above and below 7 m, respectively.
same criteria, and treated data from the epilimnion and the hypolimnion separately.
Using both a two-sample t-test (Ross 2021) and the analogous Bayesian analysis for two-sample designs (Rouder et al. 2009), we estimate the difference in means between the whole population of the sensor-lag corrected manual and automatic temperature samples as shown in Table 4. Both the t-tests and the Bayesian analyses show that there is a difference in the means for all three subsets. For epilimnion depths 0-8 m, the two-sample t-test predicts a significant difference of the means of 1.0 C (t[7256.9] = 22.49, p < 0.001). For hypolimnion depths 8-20 m, the two-sample t-test predicts a significant difference of the means of 0.5 C (t[8636.9] = 16.07, p < 0.001). These findings are confirmed by the Bayesian analysis for two-sample designs. We hypothesize that this smaller difference at greater depths can be attributed to the higher lake temperature stability as the depth increases.
As established in the Influence of weather on manual sampling section, "good weather" (high air temperature, low rainfall, and low wind speed) results in a higher probability of manual sampling taking place (Fig. 3). Conversely, "bad weather" (low air temperature, high rainfall, and high wind speed) has a lower probability of manual sampling taking place. "Good weather" is associated with typically higher lake water temperatures and "bad weather" with typically lower lake water temperatures. While further validation on other lakes is required, this result indicates that weather-biased manual water temperature measurements will tend to be higher than their corresponding automated temperature measurements. These findings imply that manual sampling, typically subject to weather bias, will tend to record higher lake water temperatures on average than the "true" average lake water temperature. In this case study, this temperature difference was found to be between 1.0 C and 0.5 C, depending upon depth. Fig. 8. Aggregated water temperature at depth data collected by automated buoy compared to raw manual data, sensor lag corrected manual data, and emulated manual data (subset of the automatic data reflecting the manual sampling days). The differences between manual data collection values and the automated values are ascribed to the effect of weather bias on the manual sampling. The effect of sensor lag at depth on the manual sampling data is also clearly evident. Table 5. Examples of the RMSE temperature values of four lake models taken from recent literature (Dissanayake et al. 2019;Baracchini et al. 2020;Gasca-Ortiz et al. 2020;Man et al. 2021). None of these studies accounted for the effect of weather bias on their manually sampled WQ data. If our findings are representative of these studies and the water temperature weather bias was accounted for, then this would very significantly alter the reported lake water temperature values and/or the RMSE values reported.

Date
Data collection method Model RMSE ( C) Temperature sensor Accuracy/resolution ( C)

Comparison of the estimated weather biases
Using two different data sets and both frequentist and Bayesian statistics, we have robustly shown that manual sampling records statistically significantly different lake WQ values as compared to continuous, automated sampling. We have demonstrated this bias for water temperature, and, albeit with less independent validation, for DO and chlorophyll. As shown in Fig. 8, two different approaches for estimating the water temperature bias arrive at broadly the same result. In this figure, the close alignment of the actual and emulated manual water temperature data can be seen, indicating measurement equivalence of the manual and automatic sensors. However, this only holds true in the epilimnion-from 9 m depth and deeper, we see a sensor lag effect in the actual manual measurements which disappears in sensor lag corrected measurements. Importantly, all of these manual measurements are markedly offset from the automated temperature values, visualizing the weather bias in manual sampling.

Conclusions
Our case study of the lake monitoring program across seven lakes in Maine, USA, has revealed a manual sampling weather bias. This weather bias, based on air temperature, rainfall intensity, and wind speed, was found to have a statistically significant effect on the probability of manual WQ sampling taking place. In simple terms, "good" weather promotes manual WQ sampling while "bad" weather inhibits it.
Using measurements from a single automated data buoy and emulating manual sampling data by filtering by the day and time of actual manual sampling, we have shown a measurable effect of this weather bias on the water temperature, DO, and to some extent, chlorophyll measurements. Compared to the continuous buoy measurements, the emulated manual measurements were found to be, on average, statistically significantly different and consequently biased. These mean effects ranged, depending upon depth, between 0.4 C and 1.2 C higher for water temperature, between À0.8 and À0.4 mgL À1 lower for DO, and 2.0 μgL À1 higher for chlorophyll.
Despite the monitoring programs procedure to slowly lower the manual probe to eliminate it, sensor lag was still apparent in the manually collected data. This finding is confirmed by both t-test and Bayesian analysis. In the epilimnion (0-8 m), the automated and manual temperature measurements align. However, in the thermocline and hypolimnion (8-20 m), there is an offset between manual and automated measurements (Fig. 6) which we ascribe to the average effect of the manual temperature probe displaying sensor lag transiting the thermocline layer. This sensor lag effect was calculated and used as a correction factor in the hypolimnion layer. This finding reiterates the care required when using any manual collected data to examine and correct for any sensor lag effects, regardless of the manual collection procedures used to eliminate such a lag.
To reinforce the emulated data bias analysis findings, the effects of the weather bias have been demonstrated by comparing actual manually collected water temperature with that collected by an automated data buoy. In the epilimnion (1-7 m), manual methods measure, on average, 1.0 C higher temperatures than measured by automated methods. In the hypolimnion (8-20 m), once the effect of thermocline/sensor lag has been removed, manual methods measure, on average, 0.5 C higher temperatures than measured by automated methods. The difference between these depth-reliant values can be explained by the increased stability of water temperature (and DO) at lower depths, and their concomitant lower susceptibility to weather bias.
The observed influence of weather-induced sampling bias on lake water temperature, DO, and chlorophyll strongly suggests that this effect may be a general finding applicable to other lakes. More manual sampling programs need examining; however, if this finding is representative, calculating and accounting for weather bias when using manually obtained WQ sampling data should become standard practice.
The effect of weather-induced manual sampling bias is expected to affect other manually measured WQ parameters other than water temperature, chlorophyll, and DO. Further research is required to examine this hypothesis. The literature, for example British Standards (2006), recommends that manual sampling programs, looking for systematic changes over time, be conducted with regular periodicity to avoid detecting variations in WQ that are not of interest. While not specifically concerned with our novel weather bias findings, adherence to such guidelines through strict regular WQ measurement periodicity, would, it is believed, go some way to addressing the effect of the manual sampling weather bias identified. It may be possible to account for any weather bias developed by a manual monitoring program through examination of weather records and any deviations from a strict periodicity of measurement. Ultimately, the use of automated WQ data collection is the solution to the issue of weather bias reported in the paper.
The identified weather bias promotes manual WQ sampling taking place during warmer, dryer, and less windy weather conditions. Conversely, it reduces the probability of manual WQ sampling taking place during periods of colder, wetter, and windier weather. In short, the weather bias promotes manual sampling during "good" weather and reduces manual sampling in "bad" weather, which in turn results in sampling overall warmer water temperatures. Warmer water holds less DO and encourages algal growth and therefore higher chlorophyll; as it is hypothesized that the weather bias relationship observed for water temperature will generally hold true for other WQ parameters, corresponding assessment of overall WQ and ecosystem health will be affected.
The magnitude of errors in lake water temperature due to the manual collection weather-induced bias are similar to typical lake model root mean square errors (RMSE) reported in the literature (Table 5). If our weather bias temperature values were applied to these models, they would significantly alter the lake model temperature inputs with corresponding changes to the lake model temperature outputs and/or the RMSE values. That these weather bias temperature values, have similar magnitudes to model RMSE values, and demonstrate the significance of our findings.
To further illustrate the significance of our findings, when examining the warming of North Eastern, North American lakes due to climate change, Richardson et al. (2017) report a mean warming of lake surface temperatures of 0.05 C yr À1 . Much of the data used by Richardson et al. (2017) will inevitably have been obtained through manual sampling. If our case study finding is representative of other manual monitoring programs, then a potential weather bias in the epilimnion of 1.0 C that we have reported would have significant impact on the validity of such climate change studies. In addition, a manual sampling weather bias increase of 2.0 μgL À1 in average annual chlorophyll measurements, as we have shown, is of sufficient magnitude to change the tropic classification of a lake in the State of Maine (State of Maine Legislature 2021).
Our main takeaway conclusion is that users of lake WQ data obtained from manual collection programs should be aware that they may contain weather-induced data bias. It is advised that records of the manual WQ data collection timings and relevant weather records be examined to determine if the presence of weather bias is likely. The methodologies presented in this paper can be used to assessed and correct for such WQ weather-induced biases.

Data availability statement
The analysis data and code are available at https://github. com/MirjamOdile/weather-bias-in-water-quality-data.