Predicting Water Temperature Dynamics of Unmonitored Lakes with Meta Transfer Learning

Most environmental data come from a minority of well-monitored sites. An ongoing challenge in the environmental sciences is transferring knowledge from monitored sites to unmonitored sites. Here, we demonstrate a novel transfer learning framework that accurately predicts depth-specific temperature in unmonitored lakes (targets) by borrowing models from well-monitored lakes (sources). This method, Meta Transfer Learning (MTL), builds a meta-learning model to predict transfer performance from candidate source models to targets using lake attributes and candidates' past performance. We constructed source models at 145 well-monitored lakes using calibrated process-based modeling (PB) and a recently developed approach called process-guided deep learning (PGDL). We applied MTL to either PB or PGDL source models (PB-MTL or PGDL-MTL, respectively) to predict temperatures in 305 target lakes treated as unmonitored in the Upper Midwestern United States. We show significantly improved performance relative to the uncalibrated process-based General Lake Model, where the median RMSE for the target lakes is $2.52^{\circ}C$. PB-MTL yielded a median RMSE of $2.43^{\circ}C$; PGDL-MTL yielded $2.16^{\circ}C$; and a PGDL-MTL ensemble of nine sources per target yielded $1.88^{\circ}C$. For sparsely monitored target lakes, PGDL-MTL often outperformed PGDL models trained on the target lakes themselves. Differences in maximum depth between the source and target were consistently the most important predictors. Our approach readily scales to thousands of lakes in the Midwestern United States, demonstrating that MTL with meaningful predictor variables and high-quality source models is a promising approach for many kinds of unmonitored systems and environmental variables.


Introduction
Environmental data often does not exist at the appropriate resolution or extent for decision making or characterizing change. Models can be used to fill gaps in key ecosystem variables, such as extreme precipitation rates (Lockhoff et al., 2014), soil moisture (Mishra et al., 2017), hydrological flow (Y. Liu et al., 2017), and lake temperature (Aguilera et al., 2016), which otherwise would be unavailable at the spatial and temporal scales needed for ecological decision-making (Lovett et al., 2007). Although sensor data is increasingly prevalent, it will always be incomplete, especially for variables where observations are concentrated in a small subset of locations and the majority of locations remain unmonitored. Since observing key variables like these at scale is prohibitively costly (Caughlan & Oakley, 2001), models that can efficiently use existing data and transfer information to unmonitored systems are critical to closing our information gaps.
There are many modeling approaches for predicting complex environmental phenomena, and model choice can be viewed as a trade-off among prediction accuracy, data needs, and generalizability to new systems. Process-based models are a popular modeling choice for water resources tasks like the prediction of stream temperature (Dugdale et al., 2017), hydrological variables (Paniconi & Putti, 2015;Fatichi et al., 2016), and lake temperature (Gaudard et al., 2019;Hipsey et al., 2019;Winslow et al., 2017). Processbased models encode our understanding of relevant physical processes into numerical formulations. These relationships are often developed from decades of theory, observation, and experimentation, resulting in sufficient understanding of processes and their interactions to support defining them with code for a simulation model (Cuddington et al., 2013;White & Marshall, 2019). However, these models provide an approximation of reality and often require time-intensive parameter calibration to compensate for incomplete inclusion or resolution of processes. More recently, the rapid growth of sensor data (Porter et al., 2012;Hampton et al., 2013) along with advances in computation have led to development and increased use of powerful data-driven environmental models. Ensemble tree methods like gradient boosting and random forests, in addition to more advanced methods like deep learning (Goodfellow et al., 2016), have been effectively used for geoscientific applications (Reichstein et al., 2019) and water resources (Erdal & Karakurt, 2013;Shen, 2018;Tyralis et al., 2019). A major reason for this success is that ML models, given sufficient data, can discern patterns and structure in problems where complexity prohibits explicit programming of a system's exact physical nature. Given this ability to automatically extract complex relationships from data, ML models (e.g., deep learning) appear promising for scientific problems with physical processes that are not fully understood by researchers, but for which data of adequate quality and quantity is available. Given enough data, data-driven models can increase prediction accuracy relative to existing process-based methods due to lack of a priori constraints and the expressive power of modern data-driven models, though they can lack interpretability and generalizability, and they often fail to leverage domain knowledge. Coupling deep learning in particular with process-based models is an emerging paradigm for modeling earth systems, enabling the discovery of patterns that are not only generalizable but also consistent with existing scientific knowledge (Karpatne et al., 2018;Shen, 2018;Reichstein et al., 2019). For example, in J. S. Read et al., 2019), typically data-hungry long short term memory deep learning models (Hochreiter & Schmidhuber, 1997) are augmented with process-based knowledge to predict lake temperature dynamics more accurately than both the process-based model and the standard deep learning model. This class of method has been called "process-guided deep learning" (PGDL; see Table 1 for list of abbreviations) and is an accelerating field of study (Willard, Jia, et al., 2020;Kashinath et al., 2021). Previous works modeling lake temperature at a broad scale have focused on calibrating parameters with available data, when data are unavailable, using recommended default values based on field and laboratory studies (J. S. Read et al., 2014;Winslow et al., 2017). These approaches have since been outperformed by PGDL models in cases of both high and low data availability (J. S. Read et al., 2019). However, in the case of no available temperature measurements to train or calibrate a model, no effort has yet been made to transfer PGDL models from well-monitored systems for prediction.
Lakes are an exemplar for the disparity in observations across systems, where >80% of in-situ water quality observations come from 20% of monitored lakes (Stanley et al., 2019), and the majority of lakes have no in-situ monitoring data. In this work, we designate "monitored" vs "unmonitored" status of lakes based on the presence of in-situ data, and consider remote sensing integration in the discussion section. How can we leverage the information in a small population of lakes to make predictions in the much larger population of sparsely monitored to completely unmonitored systems? First, temporal synchrony in characteristics across ecosystems suggests that information or models from a highly monitored system could be transferred to a less-or un-monitored system. Examples include synchrony between stream temperature and streamflow, between organic matter concentrations across different lakes (Baines et al., 2000;Erlandsson et al., 2008), or coherence in lake temperature patterns (Benson et al., 2000). Synchrony can emerge for a variety of reasons, including but not limited to shared underlying physical processes, weather conditions, or landscape context; patterns in synchrony across ecosystems therefore exhibit strong relationships to other physical variables. For instance, lake morphometric factors like maximum depth and surface area have a direct relationship to the stratification dynamics of lakes (Gorham & Boyce, 1989;Stefan et al., 1996) and correlate with temporal coherence between lakes (Magnuson et al., 1990;George et al., 2000). Water clarity can also affect the responses of below-surface phenomena to solar radiation across different systems (J. S. Read & Rose, 2013;Rose et al., 2016). Differences in coherence strength can also be attributed to different dominant external drivers (Livingstone, 2008). Fortunately, many of these physical characteristics like shape, depth, and water clarity are more widely available than other measures of water quality. Further, these characteristics mediate the relationship between external drivers and within-lake responses (e.g., through sedimentation rates and head storage), such that information gained about dynamics in one lake could be transferred to other similar lakes, regardless of whether they exhibit temporal synchrony. Determining the generalizability of the relationship between physical characteristics and water quality dynamics across different ecosystems could allow the strategic transfer of site-specific models from well-monitored systems to predict temporal patterns in unmonitored systems.
Currently, methods to extend accurate site-specific models to broad scale predictions are rare or nonexistent. In hydrology, extending site-specific parameterizations has been achieved through regionalization and catchment classification (Sivapalan et al., 2003;Wagener et al., 2007;Samaniego et al., 2010). For example, (Samaniego et al., 2010) focus on transfer functions connecting geophysical attributes to process model parameters. However, these approaches are not widely regarded as successful, with noted drawbacks of (1) uncertainty in geophysical attributes, which translates to large uncertainty in parameter estimates, and (2) often-weak relationships between these attributes and parameters, perhaps because many of those parameters lack direct physical meaning (Archfield et al., 2015). Water resources research has yet to establish a robust way to bridge scales for prediction accuracy for key ecosystem variables.
Transfer learning is a powerful technique for applying knowledge learned from one problem domain to another, typically to compensate for missing or nonexistent data in the new problem domain. The idea is to transfer knowledge from an auxiliary task, i.e., the source task, where sufficient labeled data is available, to a new but related task, i.e., the target task, often when data is scarce or inadequate (Pan et al., 2010;Weiss et al., 2016). Transfer learning using deep neural networks has shown recent success in ecological applications such as plant classification models (Kaya et al., 2019), air quality prediction (Ma et al., 2019), and grassland fire risk assessment (X.-p. Liu et al., 2019). Transfer learning for deep neural networks is analogous to calibrating process-based models in well-monitored systems and transferring the calibrated parameters to models for unmonitored systems, which has shown success in hydrological applications (Kumar et al., 2013;Roth et al., 2016). The task of deciding what model or parameters to transfer can be posed as a problem to be solved by meta-learning, or learning from previous learn-ing experiences, which is another active area of machine learning research (Vanschoren, 2018(Vanschoren, , 2019. In this paper, we focus on the meta-learning task of systematically learning how to map candidate source models (models trained on well-monitored lakes) to a particular task (prediction in unmonitored lakes) ("Metalearning: Concepts and Systems", 2009). For clarity, we define base-learning models as the traditional machine learning models or process-based models that learn or are calibrated for specific tasks (e.g., prediction in a specific lake) as opposed to the meta-learning model's goal of learning from a multitude of experiences transferring source models to target tasks. In the transfer learning context, which we call Meta Transfer Learning, the meta-learning predicts which base models to transfer based on performance metrics for past transfer learning experiences and meta-features relating to the transferability of base-learning models (Ying et al., 2018). We demonstrate this method by transferring a suite of source lake temperature models to a number of artificially unmonitored target lakes, where temperature observations were only used for final evaluation. The metamodel was used to determine which source models would transfer well to the target lake and which lake attributes can best indicate the transfer performance.
Here, we demonstrate a meta transfer learning framework to predict lake temperature at depth. Our objectives are to (1) Demonstrate the use of a metamodel to rank both process-based models and process-guided deep learning models from well-monitored lakes (source lakes) in terms of their expected ability to predict lake temperature for a different, unmonitored lake (target lake); (2) Evaluate the MTL approach against existing process-based modeling approaches; and (3) Investigate the extent to which MTL can outperform the existing state-of-the-art process-guided deep learning models for the target lake itself in situations of limited observation data.

Overview
Here, we describe a method for model selection of trained source models from datarich systems to predict lake water temperature in target systems with no data. The general idea of the MTL framework is visualized in Figure 1 and summarized as follows, 1. Build and train two source models, a calibrated PB model and PGDL model, for each of the 145 well-monitored lakes. 2. For each source lake, use all 144 other source models of the same type (PB or PGDL) to predict daily temperatures and evaluate prediction accuracy. 3. Train the meta-learning model to predict the 145*144 collected model performance values from (2) based on the lake characteristics that we hypothesized could be important for selecting good transfer models. 4. Given an artificially unmonitored target lake, where data is only used for final evaluation, and its meta-features, use the meta-learning model to predict model performance of each source model. Use the source model[s] with the lowest predicted error to model the target.
Sections 2.2 and 2.3 summarize the two types of source models, PB and PGDL models, respectively. Then, we describe the meta-learning model and how it is trained and used to identify lake features that predict successful transfers between source and target lakes 2.4. Lastly, Section 2.5 describes the data used in Section 2.6, which contains descriptions of the experiments.

Process-Based Models
As in previous studies of deep learning applied to lake temperature prediction (J. S. Read et al., 2019;Jia et al., 2019), we chose the General Lake Model (GLM) 2.2 (Hipsey et al., 2019) to represent process-based modeling, due to its proven ability to simulate thermal hydrodynamics in lakes along with its open-source code availability (https://github .com/AquaticEcoDynamics/GLM). GLM can also be used to predict temperature at broad scales using widely-available lake characteristics (depth, surface area, clarity) to parameterize the model even when observations are not available . We acknowledge that GLM may not be the ideal process-based model in all calibrated and uncalibrated modeling scenarios, but consider the comparison of different process-based models for broad scale prediction to be out-of-scope of this work. Given that the MTL framework can use any similar hydrodynamic process-based model, we will further refer to the calibrated GLM using all the available observation data as "PB" and the parameterized but uncalibrated version of GLM as "PB0". To calibrate GLM for the PB models, we selected three parameters for calibration based on their known importance to model fits and their relative uncertainty: solar radiation scaling factor, momentum exchange coefficient, and hypolimnetic mixing efficiency. We used the optim() function in R (R Core Team, 2013) to modify these parameter values to minimize the RMSE of GLM temperature predictions relative to the available observations. See the Supplemental Information (text S3) in J. S. Read et al. (2019) for details.

Process-Guided Deep Learning (PGDL) Models
We used a recently-developed PGDL model for lake temperature prediction, J. S. Read et al., 2019), where process knowledge was combined with a Long Short Term Memory (LSTM) network via (1) a loss function term to encourage physical consistency and (2) pre-training using process-based model simulation data. LSTM networks are part of a class of deep learning architectures built for sequential and time series modeling called recurrent neural networks (Hochreiter & Schmidhuber, 1997). These are particularly suited for predicting lake temperature dynamics given the often persistence of the response and the time lag between the input drivers and water temperature changes that can be represented in the memory properties of LSTM J. S. Read et al., 2019). Here, the simulation data used for pre-training are the output of the parameterized but uncalibrated version of the PB model (PB0) described in Section 2.2. The components of the PGDL model are described in more detail in Supplemental Information (Text S1). The input features for the model are the meteorological factors that contribute to incoming and outgoing heat fluxes and the depth (distance from surface) of the target prediction (Wetzel & Likens, 2000;Fink et al., 2014;Zhong et al., 2016). This includes short-wave and long-wave radiation (in W/m 2 ), air temperature (in°C), relative humidity (0-100%), wind speed (in m/s), rain (in m/day), and snow (in m/day). The meteorological features are identical to the drivers used in the GLM simulations except that they are each normalized to a mean of 0 and standard deviation of 1 based on a calculated global mean for each driver across all lakes, a recommended step for training neural networks to address differences in the scales across input variables (Sola & Sevilla, 1997).

Meta Transfer Learning with Gradient Boosting Regression
Our MTL framework aims to predict the accuracy of each source model on an unmonitored target lake. Here, two metamodels were built, one for predicting the performance of source PB models on target lakes (PB-MTL), and one for predicting the performance of source PGDL models on the same target lakes (PGDL-MTL). As shown in Figure 1, each meta-learning model takes in lake-level features that may contain information about the transferability from a source to a target. We call these predictors metafeatures; meta-features included differences in physical attributes between the source and target lake, measures of data quality in the source lake, and features of the source and target that were derived from PB0 such as the likelihood of stratification. The response variable was the prediction accuracy (measured as root mean squared error, RMSEs) of transferring the source model (either PGDL or PB) to the target lake, where lower RM-SEs represent a successful transfer between lakes. If i is the index for the source lake, and j the index of the target lake, the meta-features for each unique source-target pair can be written as X i→j , and target RMSE values as RM SE i→j . The function F we are attempting to approximate can then be written as The training dataset for each metamodeling scenario then contains all (n) * (n− 1) possible source-target pairs as follows: The following subsections describe details of this MTL approach, including the method of gradient boosting regression used for the metamodel, how meta-features were selected, and how gradient boosting hyperparameters were tuned.

Gradient Boosting Regression
Due to its predictive power, ease of implementation, and ability to illustrate the relationships between predictors and the response, we chose gradient boosting regression to predict the RMSE of source-target pairs from meta-features. In short, gradient boosting creates an ensemble of estimator models. It starts by fitting an initial regression tree model to the data. Regression decision trees are generated such that each decision node in the tree contains a test on the input variable's value, and the tree terminates with nodes that contain the predicted output variable values (RMSE in this case). Then, it builds a second model that prioritizes accurately predicting the cases where the first model performs poorly, a process known as boosting. The ensemble of these two models can be expected to perform better than the first model due to this new prioritization. Estimators are then continuously added until a set amount is reached. Gradient boosting in particular generalizes boosting by optimizing with a differentiable loss function, which in the case of regression is usually mean squared error (MSE). Further method details can be found in Friedman (2001).

Identification of Meta-Features
We started with a collection of candidate meta-features that we hypothesized could predict the performance of source models in predicting temperature in different target lakes. As in Equations 1 and 2, each set of meta-features (X i→j ) is unique to a sourcetarget lake pair. An exhaustive list of 96 possible meta-features is listed in Supplemental Information Table S1, which are divided into four categories: lake attributes, PB0 simulation statistics, general observation statistics, and meteorological statistics. The last two categories are commonly-used meta-features that either (1) use statistics relating to the quality and quantity of observations of the source lake, or (2) compare differences in the data distributions of input features between source and target domains (Castiello et al., 2005). We expand on this with the two additional categories, lake attributes and a PB0 simulation statistic. All differences are calculated as the source value minus the target value.
1. Lake Attributes: These features contain information about maximum depth, surface area, and other lake properties that are not directly used in model train- ing since they are not features or observations, but may mediate or contain useful information about the physical response of lakes to meteorological drivers. They are calculated as the difference between the source and the target lake values. 2. PB0 Simulation Statistic: This feature describes an important property of the PB0 temperature predictions, the percentage of dates on which each lake is stratifies. We used PB0 predictions as a surrogate for in-situ temperature observations, which are not available for target lakes. The PB0 model translates driver data into temperature predictions via process understanding, and it can therefore give insight into similarities across lakes such as the likelihood the lake is stratified or how the lake responds to wind events. This statistic was already available as part of the pre-training process for PGDL models in this study, and is also calculated as a difference between the source and the target lake. 3. General Observation Statistics: These features contain information about temperature measurements that only pertains to the source lake. Ideally they would contain information about the quality of the source data. For example, a very poorly monitored source lake without adequate data to train a model could indicate poor transfer performance. Example statistics include total observations, number of observations per season, mean depth where temperature was measured, and mean temperature measured. 4. Meteorological Statistics: These features contain differences between the source and the target lake in both annual and seasonal averages and standard deviations of the 7 meteorological drivers used as inputs to the source models . Examples include differences in mean air temperature, solar radiation, and relative humidity.
Then, to narrow down the number of meta-features, we performed recursive feature elimination with cross validation (RFECV) (Guyon et al., 2002). Recursive feature elimination is a feature selection method that fits a model and iteratively removes the weakest features until an ideal set that produces the lowest cross-validation error is reached. To do this we used two Scikit-learn python modules (Buitinck et al., 2013). For building the base model we used GradientBoostingRegressor with default parameters and 3000 estimators, and for performing feature selection we used the RFECV library with 24fold cross validation and mean squared error loss. We also used the importance of each meta-feature to interpret how the transfers were selected. Here, feature importance was calculated by the GradientBoostingRegressor as a measure of how each feature affected mean squared error across nodes in the decision trees, weighted by how often those nodes are reached (Buitinck et al., 2013).

Hyperparameter Tuning
For both PB-MTL and PGDL-MTL, we tuned two gradient boosting hyperparameters that are known to affect performance: the number of boosted decision trees and the learning rate (impact of each tree on final outcome). The remaining parameters were left at their default values for the GradientBoostingRegressor class in scikit-learn version 0.22.1. We construct a nested 24-fold cross validation (CV) to estimate the generalization ability of the model given certain hyperparameter values. This CV works by performing 24 iterations of removing 1/24th of samples from the dataset for validation and taking the average mean squared error as an estimate of model performance for a given set of hyperparameters. CV is done for every set of candidate hyperparameter values in an exhaustive search of two candidate learning rates {0.05, 0.10} and intervals of 100 decision tree estimators from 1000 to 6000. The ideal hyperparameters were found to be learning rates equal to 0.05 for both PB-MTL and PGDL-MTL, and number of decision trees equal to 4500 for PB-MTL and 4900 for PGDL-MTL.

Data
All the data used in this work is available through a data release on the U.S. Geological Survey's ScienceBase platform (Willard, Read, et al., 2020). All study lakes are located in the Midwestern United States, and details about the selected lakes are included in the data release. Briefly, 450 lakes met our data density criterion of at least 50 unique observation dates where there are at least one measurement for every two meters of depth or at least 5 total observations. From these lakes, in-situ lake temperature measurements between 1980 and 2019 were used to train and test all our models. To build the metamodel 145 of these lakes were used, and the rest are considered "artificially unmonitored", where data is only used for final evaluation. An additional 1882 lakes with fewer observations were used as targets in an expansion exercise described in the Discussion (Figure 2).
Meteorological data used as the input drivers for our models were gathered from the North American Lake Data Assimilation System (NLDAS-2) (Mitchell et al., 2004). As in J. S. Read et al., 2019), these gridded data were transformed into process-model ready input (see (Hipsey et al., 2019)). These inputs were then normalized for use in the machine learning models as mentioned in Section 2.3. Lake attributes used as meta-features in the MTL algorithm were acquired in the same manner as in previous modeling studies in the region (J. S. Read et al., 2014Read et al., , 2019. A more detailed description of the sources and processing of these attribute data can be found in .

Model Experiments
We designed two experiments that use the previously described metamodel built using meta-features and past model transfer RMSE. For both experiments, we used 145 of the 450 well-monitored lakes as detailed in Section 2.5 as source lakes, and we kept the remaining 305 lakes as target lakes for which the metamodel was used to select one of the 145 source lakes. Source lakes were selected to be representatively distributed across maximum depth values and log-scale surface area values (see Supplemental Information Figure S3). In all experiments the metamodel training data consisted of RMSEs from applying each of the 145 source lake models on all other source lakes, leading to 144*145 meta-learning data points (20880 total). Then, after the metamodel is trained, for each source-target pair we constructed the meta-features as described in Section 2.4.2. From these meta-features, both metamodels (PB-MTL, PGDL-MTL) were then used to predict the expected RMSE of each of the 145 source models when transferred to the target lake.
2.6.1 Experiment 1: Predicting Temperatures in "Unmonitored" Lakes Experiment 1 evaluates the performance of the meta transfer learning models in a real-world scenario: predicting water temperature at multiple depths in unmonitored lakes. Given the predictions of source model performances from both metamodels, the PB or PGDL model with the lowest predicted source-to-target RMSE was singled out for use on each target lake. We compared the top-predicted transfers for each of the 305 test lakes against its PB0 simulation.
We assumed that our metamodel would not be able to select the true best source lake in all instances. We therefore also evaluated an ensemble method that combined several of the top predicted models. Ensembles of multiple individual models that perform well can almost always improve over their average prediction error (Zenobi & Cunningham, 2001;Kuncheva & Whitaker, 2003). This was proven by Krogh and Vedelsby (1995), who showed that increasing ensemble diversity (the extent to which single models disagree), given constant average error of individual ensemble members, reduces the overall ensemble error. This reduction often occurs because some ensemble member predictions are biased positively and some negatively, leading to bias cancellation in the ensemble prediction. We used a simple ensemble that takes an unweighted average of the predictions from each selected PGDL source model for each date and depth. Lakes selected for ensembling were the top "n" source models predicted to have the lowest RMSE on the test lake. The optimal value of n was estimated using a 29-fold cross validation. In each cross validation fold, 5/145 source lakes were designated as validation lakes and a metamodel with the same hyperparameters as described in Section 2.4.3 was trained on the remaining 140 lakes. Then, n source lakes were selected for each validation lake. Estimated ensemble error was then the mean ensemble errors across all folds. This was repeated for values of n ranging from 2 to 10, where 9 was found to be the optimum, but values differed minimally between 5 and 10. We call this 9 source ensemble approach PGDL-MTL9.
Lastly, we examined the metamodels themselves. In addition to evaluating the performance of the predicted best source-target transfer, we looked at how well the metamodel predicted the RMSE of every source-to-target combination and how well it was able to rank the source models. For the former we calculate a median across all target lakes of the metamodels' predictions for RMSEs and the actual source-to-target RMSEs. For the ranking evaluation, we used the Spearman rank correlation coefficient shown as r s . We also looked at distributions of the actual ranks (for the RMSEs of sources actually applied to targets) of those models that were identified by the metamodel as the top or top 9.

Experiment 2: Comparing PGDL-MTL with PGDL for Sparsely Monitored Systems
Experiment 2 examines the extent to which PGDL-MTL is an improvement over PGDL in systems that have some observations but are not sufficiently monitored to train any traditional deep learning model effectively. In this experiment, we define "sparsely monitored" as between 1 and 50 sampling dates. Deep learning models are generally datahungry, but PGDL models pre-trained on PB0 output have shown to achieve high accuracy with only a few observations ; J. S. Read et al., 2019). Thus, both PGDL and PGDL-MTL have the potential to alleviate the difficulty in calibrating processbased models for sparsely monitored lakes, where overfitting can be a problem. However, PGDL-MTL has the potential to harness more data and thereby outperform PGDL. The situation of few available observations is also far more common than the well-monitored case of the lakes chosen in this work (E. K. Read et al., 2017). To that end, we artificially sparsified the data available in the 305 test lakes to train PGDL models on low amounts of data. Then, these low-data PGDL models were compared to the PG-MTL and PGDL-MTL results of Experiment 1. We used this comparison to estimate the data threshold where PGDL tends to outperform PGDL-MTL.
Artificial sparsity was induced by building five PGDL models for each suitable lake for twelve different amounts of sampling dates (1,2,5,10,15,20,25,30,35,40,45,50) used for training. The test period was set as the first third of temperature observations in time for each lake, leaving the training period as the last two thirds of temperature observations. For each sampling date treatment we used all lakes that had at least that number of sampling dates during the training period. Of the 305 possible lakes, 221 had 50 or more observations, 270 had 40 or more observations, and all 305 had 30 or more observations. For the five models within each lake and data availability category, variability was introduced by randomly selecting dates for the training data. Note. The first three columns are the quartile distributions of RMSE of the best predicted source lake for each test lake. The fourth column is the median RMSE between the metamodel-predicted RMSEs and the observed RMSEs. The fifth column is the median Spearman rank correlation coefficient between the metamodel-predicted RMSEs and the actual RMSEs.

PB-and PGDL-MTL model accuracy on 305 test lakes
In Experiment 1, PGDL-MTL and PB-MTL predictions of water temperature in the 305 test lakes were typically more accurate than predictions from the uncalibrated process-based model (PB0; Table 2 and Figure 3). The median RMSE across the test lakes was 2.42°C for PB-MTL and 2.16°C for PGDL-MTL, versus 2.52°C for PB0. PB-MTL outperformed PB0 for 203/305 of the lakes and PGDL-MTL outperformed PB0 for 226/305 of the lakes, and the amount of improvement the transfer provided generally increased with PB0 error (Figure 3). Predictions of deeper water temperatures from the transferred models had higher RMSEs in general as compared to the lake-specific accuracy of all depths, with the highest median RMSE from PB0 models, followed by PB-MTL, and with PGDL-MTL having the lowest median deep-water RMSE (2.59°C, 2.56°C, and 2.36°C, respectively; RMSEs calculated based on predicted versus observed temperatures at or below 75% of the maximum depth of the lake; 18 of 305 lakes had no observations at these depths).

Model ensemble performance
Additionally, the ensemble PGDL-MTL9 model provided still better performance than PGDL-MTL. We can see in Table 3 that the RMSE of the combined averaged prediction of the source models tended to be lower than most of the source models individ- ually. The ensemble model PGDL-MTL9 had a median RMSE of 1.88°C, which is an improvement over the single-source PGDL-MTL, which had a 2.16°C median RMSE. When compared to PB0 in Figure 3, PGDL-MTL9 outperforms PB0 for 260/305 of the test lakes. Table 3 also shows the distribution of RMSE values per source systems at given ranks, between 1 and 9, as predicted by the metamodel. Comparing the individual source model RMSEs across the top 9 ranks, we see ranges of only 0.09°C in median RMSE, 0.12°C in lower quartile RMSE, and 0.13 in upper quartile RMSE.

Meta-features and importances
The top selected meta-features were related to maximum depth in both PB-MTL and PGDL-MTL, with combined importances of 50% and 45%, respectively. Surface area, observation count, source lake observed temperature, and stratification indicators were selected as meta-features in both PB-MTL and PGDL-MTL but were of lesser importance (Table 4).  The metamodels typically chose source models that were good, but not optimal, matches to the target lake (Figure 4). In a stratified lake with high PGDL-MTL accuracy (RMSE = 1.3°C), top-ranked source models all came from stratified source lakes ( Figure 4d) and captured the summer stratification dynamics (Figure 4a). In a stratified lake with low PGDL-MTL accuracy (RMSE = 3.5°C), top-ranked source models came from a mix of stratified and unstratified source lakes (Figure 4e) and had similar predictions to a low-ranked source model (Figure 4b). In our 305-lake test set, mixed lakes (n=121) had lower mean RMSEs (mean=2.01, SD=0.52°C) than stratified lakes (n=184; mean=2.62, SD=0.96°C). Our mixed example lake illustrates that all candidate source lakes had lower RMSEs (Figure 4i) and similar predictions (Figure 4c) such that even though the metamodel selected a combination of mixed and stratified source lakes, the resulting RMSEs could still be quite low (PGDL-MTL: 1.3°C, PGDL-MTL9: 1.1°C). Consistent with the meta-feature importances in Table 4, the selected source lakes tended to be similar to the target lake with respect to not just stratification but also maximum lake depth and surface area (Figure 4d-f). Ensembling with PGDL-MTL9 yielded similar accuracy to PGDL-MTL for the two example lakes with high PGDL-MTL ac-curacy (Figure 4g,i) and substantially improved accuracy in the example lake where the PGDL-MTL model failed to capture the observed stratification dynamics (Figure 4h).

Features of best and worst source lakes
There were large differences in the frequency at which source models were chosen by the MTL to represent target lakes, and several factors emerged that suggested differences exist between commonly and rarely selected source lakes. A small fraction of source models were used to predict almost one third of target lake water temperatures, and eleven lakes were selected as the top PGDL or PB source for ten or more target lakes. Seven top PGDL source models were used for 100 target lakes and seven PB models for 95 of 305 target lakes, and three lakes were in this top category for both PGDL and PB models. In contrast, 59 PGDL and 64 PB source models were not chosen as a top model for any target lake (31 were never selected as sources in either model). Additionally, we summed the number of times each lake was predicted to be in the top 9 sources for the ensembles, and compared the raw lake attributes of the upper and lower quartiles (Figure 5). For both PGDL and PB transfer models, lakes that were transferred often were in general deeper, larger, and more monitored than minimally transferred lakes. For PGDL, source models in the lower quartile of MTL selections had a median depth of 6.71 m, a median surface area of 0.86 km2, and a median of 872 training observations, compared to 13.1 m, 1.12 km2, and 2,117 observations for the upper quartile medians, respectively. The lower quartile of PB-MTL source models had medians of 7.9 m, 0.55 km2, and 824 calibration observations, with upper quartile medians of 18 m, 1.7 km2, and 2,557 calibration observations.

Metamodel performance
To assess the metamodel's ability to predict the performance of source lake models, we looked at both the RMSE of the predicted RMSE versus the actual RMSE when transferring source models in Experiment 1, and also the ability of the metamodel to accurately rank source models from best to worst in the form of the Spearman rank correlation coefficient. The median meta-RMSE for PB-MTL was 0.853°C and the Spearman rank correlation coefficient r s was 0.659, and for PGDL-MTL the meta-RMSE was 0.871°C with an r s of 0.663 (Table 2). Then, in Figure 6, in addition to showing the distribution of actual ranks for the predicted best source PGDL model for each target system, we also show the distribution of ranks for sources within the 9 source ensemble PGDL-MTL9. Further visualization of the ranking ability of the metamodels is shown in Supplemental Information Figure S2. Here, we see that the two metamodels have similar predictive ability, with PGDL-MTL ranking slightly better as seen in the Spearman coefficient values.     Other bars: best, median, and worst of the top-9-predicted sources.
Comparison with data-sparse target lake models In Experiment 2, the median RMSE across 305 test lakes tended to decrease as the number of sampling dates used for training increased (Figure 7 and Table 5). Figure 7 shows performance of PGDL trained with differing numbers of temperature profiles compared to the MTL approach, and Table 5 shows the specific RMSE numbers for Figure  7. Here, the RMSE for each test lake is defined as the median RMSE across 5 randomly chosen sets of the same number of observations. Given that the RMSEs of the singlesource PGDL-MTL and ensemble-of-sources PGDL-MTL9 from the previous experiment were 2.16°C and 1.88°C, respectively, PGDL models trained only on the target lake's data met or exceeded median MTL performance at between 5 and 15 observations for PGDL-MTL and between 35 and 40 observations for PGDL-MTL9. In other words, even for a reasonably well-monitored lake (up to 40 observations), it can be better to borrow a model from a different and better-monitored lake than to train a model on the target lake observations. For context, 45 profiles is approximately the coverage a lake would have if it had a monitoring program that took a temperature profile monthly during the ice-free period for 6 years.

Baseline performance of PB and PGDL source models
Success in transfer learning depends both on (1) metamodel success in choosing the best of the available source models for a target lake and (2) the baseline performance of the source models that could be transferred. If the PGDL-MTL metamodel had selected the best available source PGDL model for every target lake, the median RMSE would have been 1.54°C, versus an RMSE of 1.79°C if the best PB model was selected every time. This difference, aligning with established knowledge that PGDL predicts more accurately than PB (J. S. Read et al., 2019;Jia et al., 2019), can explain how the RMSE across the test lakes of PGDL-MTL was lower than PB-MTL even though PB-MTL had a lower meta-RMSE predicting the performance of source models.

Discussion
In this paper, we show Meta Transfer Learning (MTL) can be used to address monitoring gaps in environmental and ecological sciences by predicting in unmonitored systems. Even with the data deluge resulting from modern sensor developments, the majority of lakes and streams are unmonitored or have sparse observations. This has made it difficult to calibrate process-based models for these systems due to risk of overfitting, and even more inaccessible for traditional deep learning models which can require thousands or millions of data points. The MTL paradigm in this work harnesses data from many other systems to accurately predict temperature in unmonitored systems. Specifically, the transfer process leverages observations from highly monitored systems, simulated temperature data from process models, past model performance measures, and thousands of past transfer learning experiences to alleviate the drawbacks of both deep learning and process-model calibration in unmonitored systems.
As experts in the water resources community have called for integration of processbased and data-driven methods (Hipsey et al., 2015;Shen, 2018), MTL involves a collection of approaches harnessing both ML and process knowledge. Here, we use the ML technique of gradient boosting regression for the meta-learning task of predicting the transferability of source models including those that employ Process-Guided Deep Learning (PGDL), which iteself integrates process knowledge into ML. Limnology domain expertise was also used in defining the candidate meta-features offered to the metamodel. The top selected meta-features matched our process understanding from dozens of studies that show relationships between the properties of lakes (surface area, depth) and physical responses to external drivers (Gorham & Boyce, 1989;Stefan et al., 1996). This work shows that these lake characteristics, which are more widely available than water quality data themselves, can be used to transfer information from highly monitored to unmonitored systems. Note. Medians of medians are calculated as the median across 305 test lakes of the median of 5 models trained with different random selections of observations. Different types of lake-specific data were used to determine which sources should be transferred. Lake maximum depth difference between the source and target lake emerged as the most important in both the PGDL-MTL and PB-MTL approaches. Surface area differences were also included in both, but of less importance. This aligns with existing process-based lake modeling knowledge that maximum depth and surface area are key factors in lake stratification and thermodynamics (Gorham & Boyce, 1989;Stefan et al., 1996). Other meta-features related to source model quality, like the number of observations and mean observation temperatures, were also included in both metamodels. This is consistent with common modeling intuition that more data can lead to both better calibrated process models and better trained ML models (J. S. Read et al., 2019;. We also saw the PB0 meta-feature, GLM stratification percentage, as the 2nd most important feature for PB-MTL, and included with less importance in PGDL-MTL. Top sources had lower mean observation temperatures, which possibly indicates either more balanced measurements between surface and deeper depths, or a better spread of observations across seasons, in a given lake. For example, source lakes that use mostly surface temperatures would have higher mean observation temp, and source lakes that have many deeper measurements would have lower mean observation temp. Inspecting the characteristics of the most frequently selected source lakes could guide future monitoring and MTL modeling efforts. Only eleven unique source models were used to predict almost one third of target lake water temperatures using both PGDL-MTL and PB-MTL, and top source lakes were generally deeper, larger, and more wellmonitored compared to rarely or never selected source models ( Figure 5). Differences between source and target in lake depth and area, as well as the observation count of source lakes, were important meta-features used to select source lakes. While these features likely explain why some lake models are generally more transferable, the unique properties of some target lakes and their selected source models are important to consider when designing lake monitoring campaigns or evaluating future model transfer methods. For example, PGDL source models that were rarely selected (chosen one, two, or three times as a top source) still helped overall test lake performance and were often better actually-ranked options for their target lakes compared to the ranks of commonly chosen (ten or more times) source models (based on ranking the performance of all possible source model transfers to each target lake; median actual PGDL-MTL rank for rare source transfers: 20 of 145, and common source transfers: 36.5 of 145; n=100 and 103, respectively). This pattern did not hold for PB-MTL transfers, with generally worse actual ranks for rare sources compared to common sources (median rank for rare and common were 28.5 and 23, and n=80 and 95, respectively), and additional research may be necessary to understand these differences. The important meta-features used in this study (e.g., differences in maximum depth and area, and the number of observations used to train or calibrate the source model) differ from previous process-based modeling parameter transfer methods that have been applied to rivers. These previous works have instead focused on spatial proximity, spatial fields of hydrologic signatures, or global parameterization (Mizukami et al., 2017).
Because lake temperature is an ecological "master factor" (Magnuson et al., 1979), predictions at broader scales can support a wide variety of science and management efforts, from improved modeling of biota (Jones et al., 2006;Hansen et al., 2017) to improved thermoelectric power plant heat management (Cook et al., 2015). PGDL-MTL models can output predictions at scale wherever meteorological and essential lake attribute data are available, and the MTL approach could eventually be developed into use for forecasting applications. A forecasting variant of MTL could be developed by building base models specifically for forecasting (e.g. with probabilistic outputs), and optimizing transfer performance to new systems by simulating forecasting performance instead of hindcasting RMSE. Below, we discuss the various ways the MTL approach can scale to other systems.
The applicability of MTL scales beyond just unmonitored systems to a large range of monitored systems as well, bridging the gap between local accuracy and broad-scale modeling. In Experiment 2, we investigated the point at which, for sparsely monitored systems, it would be better to transfer models from different better-monitored systems as opposed to training PGDL models on what little target data is available. This is a pertinent question for broad scale modeling; while a majority of lakes in this region are unmonitored, a large fraction of monitored lakes have <40 observations (Stanley et al., 2019). Though PGDL models have been shown to outperform calibrated process-based models on even a small number of water temperature sampling dates by taking advantage of process-based simulation data and process-informed learning constraints (J. S. Read et al., 2019), MTL presents the opportunity to improve prediction by harnessing more simulation data, observation data, and metadata from past modeling experiences across many other lake systems. There is also opportunity to expand the MTL framework to incorporate sparse data available in many lakes, where the transferred source models could be fine-tuned using data from the target lake itself. Another major benefit of MTL with PGDL in particular is the scalability and efficiency of ML models once the meta-learning model and source models are trained. MTL can be built with data that are easier to obtain than temperature observations (e.g. max- RMSE. e-f) The distribution of RMSE from PB0 and transfer models, where the vertical gray and colored lines are the median PB0 and transfer RMSE, respectively. imum depth and surface area), and MTL does not require any new models to be trained. Therefore, it can scale to a much larger number of lakes than the ones used in this study. To demonstrate this scalability, we applied the transfer models to 1882 additional lakes that were less monitored than our initial 305 lakes. The transfers maintained a significant accuracy improvement over a purely process-based modeling approach (PB0). For this expanded set of lakes, median RMSE was 1.80°C for PGDL-MTL9, 2.06°C for PGDL-MTL, and 2.29°C for PB0 (Table 6). Temperatures in a majority of lakes were more accurately predicted by the transfer models compared to PB0; for PGDL-MTL9 1484 of 1882 lakes improved over PB0, and for PGDL-MTL 1206 of 1882 improved over PB0 (Figure 8).
Finally, given the demonstrated generalizability of PGDL and PB models using MTL, this approach opens doors to new research directions, like transferring source models into new spatial domains, including remote sensing surface observation data, incorporating uncertainty quantification, and aggregating models more effectively. Though our study was limited to 5 Midwestern states in the United States, this could be expanded to include a much larger variety of lake types and locations. A remaining question for this transfer approach is, when expanding to new types of lakes, how should an optimal set of source lakes be identified? Another research direction includes uncertainty estimation in the metamodel construction. Uncertainty estimates could be used to reject a target lake for which all the source model error estimates are confidently high. Furthermore, the ensembling approach could be improved, using more complex methods than a simple average to combine top source models. One promising option is generalized stacking of neural networks (Ghorbani & Owrangh, 2001), where all the source neural networks would be connected by an averaging layer. Remote sensing data integration could also help in adding surface temperature data to the source models and could allow corrective measures to be taken for predictions in lakes unmonitored by in-situ data. Though, remote sensing observations have known drawbacks in this application such as being limited to only surface temperature on larger lakes (Schaeffer et al., 2018;Topp et al., 2020).
Given the successful prediction of environmental variables using MTL approaches, there are many research opportunities in different types of applications and data scenarios. For example, predicting only lake surface temperature would allow for the use of MTL without the need for maximum depth measurements, which could allow for predictions in many more lakes. Also, different types of source models could also be used in different scenarios. Some process-based models likely work better for some lakes than others; for example, process models built specifically for reservoir dynamics could be important source models in regions where reservoirs are a common lake type. Other environmental variables could also be targeted for prediction like water quality (e.g. dissolved oxygen, conductivity) and water quantity in lakes, streams, wetlands and other water bodies.

Data and Code Availability Statement
The experimental data from this manuscript are freely available in the USGS data release "Predicting Water Temperature Dynamics of Unmonitored Lakes with Meta Transfer Learning" (https://doi.org/10.5066/P9I00WFR) (Willard, Read, et al., 2020). Also the code used in this is study is available at https://github.com/jdwillard19/MTL lakes.