Statistical upscaling of ecosystem CO2 fluxes across the terrestrial tundra and boreal domain: Regional patterns and uncertainties

The regional variability in tundra and boreal carbon dioxide (CO2) fluxes can be high, complicating efforts to quantify sink‐source patterns across the entire region. Statistical models are increasingly used to predict (i.e., upscale) CO2 fluxes across large spatial domains, but the reliability of different modeling techniques, each with different specifications and assumptions, has not been assessed in detail. Here, we compile eddy covariance and chamber measurements of annual and growing season CO2 fluxes of gross primary productivity (GPP), ecosystem respiration (ER), and net ecosystem exchange (NEE) during 1990–2015 from 148 terrestrial high‐latitude (i.e., tundra and boreal) sites to analyze the spatial patterns and drivers of CO2 fluxes and test the accuracy and uncertainty of different statistical models. CO2 fluxes were upscaled at relatively high spatial resolution (1 km2) across the high‐latitude region using five commonly used statistical models and their ensemble, that is, the median of all five models, using climatic, vegetation, and soil predictors. We found the performance of machine learning and ensemble predictions to outperform traditional regression methods. We also found the predictive performance of NEE‐focused models to be low, relative to models predicting GPP and ER. Our data compilation and ensemble predictions showed that CO2 sink strength was larger in the boreal biome (observed and predicted average annual NEE −46 and −29 g C m−2 yr−1, respectively) compared to tundra (average annual NEE +10 and −2 g C m−2 yr−1). This pattern was associated with large spatial variability, reflecting local heterogeneity in soil organic carbon stocks, climate, and vegetation productivity. The terrestrial ecosystem CO2 budget, estimated using the annual NEE ensemble prediction, suggests the high‐latitude region was on average an annual CO2 sink during 1990–2015, although uncertainty remains high.

Both the growing and non-growing seasons are important for these annual budget estimates. A recent synthesis found that non-growing season soil CO 2 emissions from the northern permafrost region are larger than previously estimated (Natali et al., 2019). However, CO 2 uptake by plants over the growing season can be substantial and is often the dominant component of the annual CO 2 budget (Alekseychik et al., 2017;Kolari et al., 2009;Lafleur et al., 2012). The current state of the annual terrestrial high-latitude CO 2 budget (net sink or source) remains highly uncertain. A key research priority is to develop robust data-driven quantitative frameworks to constrain regional boreal and tundra CO 2 budgets at annual and seasonal time scales.
Estimating high-latitude CO 2 fluxes across large areas and over long timescales is challenging due to their high spatiotemporal variability (Ai et al., 2018;Wilkman et al., 2018) that is controlled by a range of environmental variables (Camps-Valls et al., 2015;Lund et al., 2010). The ecosystem CO 2 balance (i.e., net ecosystem CO 2 exchange; NEE) is the relatively small difference between the two large CO 2 fluxes of photosynthesis (gross primary production; GPP) and ecosystem respiration (ER; comprising autotrophic and heterotrophic respiration). Although NEE can be measured with the eddy covariance (EC) and chamber techniques (Baldocchi et al., 1988;Lundegårdh, 1927), GPP and ER are estimated indirectly using environmental light and temperature measurements for EC sites Reichstein et al., 2005) and dark chamber measurements for chamber sites (Shaver et al., 2007).

K E Y W O R D S
Arctic, CO 2 balance, empirical, greenhouse gas, land, permafrost, remote sensing and soil properties (e.g., soil nutrients and moisture) (Arens et al., 2008;Dagg & Lafleur, 2011;Lund et al., 2009). However, our understanding of the influence of these drivers on GPP and ER, and particularly on NEE, across the entire high-latitude region remains limited (see e.g., Belshe et al., 2013;Lund et al., 2010).
Knowledge of the contemporary high-latitude terrestrial CO 2 budget is further limited by an increasing, but still relatively sparse, flux measurement network (Alton, 2020;Chu et al., 2017;Virkkala et al., 2018). The majority of flux sites are concentrated within a few intensively studied regions, particularly Alaska and Fennoscandia (Metcalfe et al., 2018;Pastorello et al., 2020;Virkkala et al., 2019), with just a few sites in other large regions such as Siberia and northern Canada. Consequently, issues related to the temporal, geographical and environmental representativeness of the measurements need to be considered to accurately estimate high-latitude carbon budgets and their uncertainties. Previous studies have used a variety of synthesis approaches (Belshe et al., 2013;McGuire et al., 2012), and statistical (Natali et al., 2019), process-based (López-Blanco et al., 2019McGuire et al., 2018;Rawlins et al., 2015;Wania et al., 2009) and atmospheric inversion models (McGuire et al., 2012), yielding highly different CO 2 budgets. Most of these modeling studies have been conducted at coarse spatial resolutions (25-100 km; Natali et al., 2019;Rawlins et al., 2015; 2019) that do not fully capture the heterogeneity in high-latitude environments despite their importance for the regional CO 2 budgets (Raynolds et al., 2019;Treat et al., 2018). New efforts synthesizing the current distribution of flux data and developing models at high spatial resolution are required to improve our understanding on the spatial patterns and magnitudes of CO 2 fluxes.
Models that rely on the statistical relationships between CO 2 flux and predictor variables have been increasingly employed to constrain global and high-latitude CO 2 budgets (e.g., Jung et al., 2020;Natali et al., 2019;Warner et al., 2019). These statistical models are useful for predicting fluxes across larger areas (i.e., upscaling) because they directly draw upon relationships between fluxes and environmental variables, can account for environmental variability across space and time at high resolutions, and are able to handle biases in the geographic representation of the data (Jung et al., 2020;Natali et al., 2019;Warner et al., 2019). A broad range of statistical models and data sources are available for upscaling, but not all of these have been fully utilized. For example, many past studies have upscaled high-latitude fluxes using a single model (Natali et al., 2019;Peltola et al., 2019;Ueyama, Ichii, et al., 2013), but how different models compare with each other is not well known (with exception of Jung et al., 2017 andTramontana et al., 2016). Further, most of these studies have primarily used machine learning models due to their ability to capture non-linear relationships and interactions in data (Elith et al., 2008). However, traditional regression methods can be a powerful tool in upscaling high-latitude ground conditions due to their ability to extrapolate beyond the range of data used for training, and due to their generalizability and ease of interpretation (Aalto et al., 2018). Finally, many of the recent upscaling studies have relied on EC flux measurements only, neglecting chamber measurements despite their importance as additional data sources (with exception of Natali et al., 2019). Chambers are useful especially in remote, sparsely measured treeless tundra where they can capture the entire ecosystem CO 2 balance and directly measure NEE and ER (Sørensen et al., 2019). Thus, a compilation of both EC and chamber flux measurements and the comparison of available modeling techniques is clearly required to ensure accurate CO 2 flux estimates from existing data and models.
Here, we synthesize annual and growing season CO 2 fluxes from EC and chamber measurements across the high-latitude terrestrial tundra and boreal region. We then use this new database to upscale annual average ecosystem CO 2 fluxes at relatively high spatial resolution (1 km 2 ) across the high-latitude domain using several statistical models. We compare our new database of in situ CO 2 fluxes to past tundra syntheses (Belshe et al., 2013;McGuire et al., 2012), provide a detailed assessment of model performance, analyze the spatial patterns and drivers of CO 2 fluxes, and discuss the resulting CO 2 budget estimates and recommendations for future work. We focus on understanding the spatial variability in average CO 2 fluxes instead of a temporal analysis of CO 2 flux change; however, our modeling framework also considers the interannual variability in fluxes.

| Data collection
2.1.1 | Collection of CO 2 flux data Our study area was defined by the high-latitude tundra and boreal biomes (>45°N) based on global ecoregions (20.6 × 10 6 km 2 ; Figure 1; Dinerstein et al., 2017). We first conducted a literature survey to identify existing EC and chamber-based terrestrial CO 2 flux observations of GPP, ER, and NEE over annual and growing season periods across the domain. Potential sites were identified from previous studies Marushchak et al., 2013;McCallum et al., 2013;Watts et al., 2014) and prior synthesis efforts (Belshe et al., 2013;McGuire et al., 2012;Virkkala et al., 2018). We augmented the resulting site list using a Web of Science search with key words ("tundra" or "boreal" or "arctic") and ("CO 2 flux" or "CO 2 exchange" or "CO 2 budget").
Additionally, a community call was solicited through a CO 2 flux synthesis workshop (Parmentier et al., 2019), whereby investigators contributed their most current unpublished data. Additional EC data were downloaded from FLUXNET2015 (Pastorello et al., 2020). The compiled dataset represents all natural terrestrial vegetation types (categorized by needle-or broadleaf forest, shrubland, grassland, wetland, and sparse vegetation) present in the high-latitude region.
We included studies and sites with NEE, GPP, and ER estimates over a full growing season or calendar year (i.e., cumulative flux).  Table S1, sites that were excluded from our analysis are in Table S2. F I G U R E 1 Measured median annual (a-c) and growing season (d-f) fluxes of GPP (gross primary production), ER (ecosystem respiration), and NEE (net ecosystem exchange) in the study domain (>45°N). The color of the point defines the median flux of the site (i.e., a sampling location), and the size of the point the number of observations that was measured (i.e., number of years). The background map represents the high-latitude region (dark gray = boreal biome, light gray = tundra biome). In all panels, sites that had only eddy covariance measurements are shown with black outline color around the point, and chamber measurements are without outline. One site had both eddy covariance and chamber measurements, but this is shown with black outline color. Positive numbers for NEE indicate net ecosystem CO 2 loss to the atmosphere (i.e., CO 2 source) and negative numbers indicate net ecosystem CO 2 gain (i.e., CO 2 sink) [Colour figure can be viewed at wileyonlinelibrary.com] and Belshe et al. (2013) we also included data from the boreal biome, additional tundra sites, and wetlands (not synthesized in Belshe et al., 2013; Figure S1). Similar to McGuire et al. (2012) and Belshe et al. (2013), our database primarily represents undisturbed environments.
However, it also includes measurements from ca. 10 sites that have experienced high natural, anthropogenic or anthropogenically induced disturbances, such as permafrost thaw (Bäckstrand et al., 2010;Cassidy et al., 2016;Trucco et al., 2012), fires (Iwata et al., 2011;Ueyama et al., 2019), insect outbreaks (Heliasz et al., 2011;López-Blanco et al., 2017;Lund et al., 2017), or extensive harvesting practices (Coursolle et al., 2012;Machimura et al., 2005). Throughout the text, positive numbers for NEE indicate net CO 2 loss to the atmosphere (i.e., CO 2 source) and negative numbers indicate net CO 2 gain (i.e., CO 2 sink). GPP and ER are always given as positive numbers.  Figure S2 for more information about the predictors). These predictors characterize previously identified key relationships between CO 2 fluxes and summer and winter temperatures, radiation, precipitation, local hydrology and soil conditions, soil carbon stocks, and vegetation properties (i.e., see Beer et al., 2010;Belshe et al., 2013;Lund et al., 2010;Natali et al., 2019;. NDVI further reflects disturbances as it can show spectral browning signals related to drought, harvesting, or fires (Myers-Smith et al., 2020; Figure S3; Supplementary Text Section 2.5). We recognize that GPP and ER partitioning and gap filling rely on supporting environmental data (e.g., temperature and radiation), and consequently these fluxes already include some information about variables that we also used as predictors in our statistical models. We used annual  data for GDD3, FDD, WAB, and NDVI; the remaining predictors were considered to be static. All predictor datasets were masked to only include high-latitude tundra and boreal biomes (Dinerstein et al., 2017), and to exclude permanent water bodies, urban areas, and croplands based on a land cover

| Statistical modeling
Our main response variables were annual and growing season cumulative GPP, ER, and NEE, but we also modeled daily average GPP, ER, and NEE during the growing season. Annual and growing season CO 2 fluxes were linked to the environmental predictors using a range of different statistical modeling methods ( Figure S4). We used five statistical models; two were extensions of linear regression models, and three were based on machine-learning. All of these models have been widely used in empirical CO 2 flux upscaling studies (Bond-Lamberty & Thomson, 2010;Hursh et al., 2016;Tramontana et al., 2016;Ueyama, Ichii, et al., 2013). Specifically, we examined generalized linear models (GLMs); generalized additive models (GAMs); generalized boosted regression trees (GBMs); random forest (RF models); and support vector machines (SVMs).

We used several model approaches because individual models have inherent strengths and weaknesses (Supplementary Text
Section 2). For example, machine learning methods might suffer from overfitting, whereas regression methods might result in unrealistic values when extrapolated outside the model data range. Further, individual models may detect different patterns in the data, and the best performing models are not always the same for different response variables (Segurado & Araújo, 2004). We also produced an ensemble prediction by calculating a median prediction over the five predictions from the individual modeling methods (see also Tramontana et al., 2016). We used the median instead of the mean to avoid extreme predicted values inflating the ensemble prediction. In this procedure, the uncertainty of the ensemble is expected to be lower than the uncertainty of a single model (Aalto et al., 2018). Consequently, we produced six model predictions for each of our response variables.
To determine the main drivers of the spatial patterns of response variables, the relative contribution of predictors in the models was assessed using a prediction re-shuffling approach (Niittynen & Luoto, 2018). We first fit the model and developed predictions using the original data, and then repeated this procedure with the values for one predictor randomly permuted. The contribution of a variable was calculated as a correlation between these two predictions (i.e., original model and the model with a shuffled predictor) subtracted from one: Values close to 1 indicate that the two predictions were different, indicating high variable importance of the predictor variable.
Relative contribution = 1 − correlation Prediction original data , Prediction Randomly permuted data Each predictor was randomly permuted 100 times for each flux with each of the modelling methods, and an ensemble contribution was derived by taking a mean of the values. To visualize a predictor's effect on a response variable after controlling for the effects of other predictors, partial dependence plots were derived from the random forest model. For both variable importance and partial dependence plot analyses, we used daily average growing season fluxes because the growing season length estimates that were used to calculate growing season fluxes are not independent from GDD3. We found that the daily average fluxes correlated strongly with the growing season fluxes (Pearson's correlation 0.93-0.94), so they can be assumed to reflect the same relationships with the predictors.
To extrapolate across the study domain, we fit the models using the entire dataset to produce annual flux predictions and their ensembles that were subsequently averaged to 1990-2015 mean values. Because the ensemble predictions were among the most accurate and least uncertain predictions across all response variables, and because their use is generally recommended in predictive efforts (Araújo & New, 2007), our final flux maps and budgets were based on the flux ensemble. In addition to annual and growing season budgets, we also calculated a non-growing season budget (see Table S4). We had different numbers of observations and sites available for each flux and model, and consequently observed and predicted ER and GPP fluxes and budgets do not sum up to NEE.

| Model fit, predictive performance and uncertainty
To evaluate model fit, we predicted fluxes over the entire model training data. To assess the predictive performance of the models, we used a leave-one-site-out cross validation scheme in which each site was iteratively left out from the dataset, and the remaining data were used to predict fluxes for the excluded site (Bodesheim et al., 2018). For both model fit and predictive performance, we calculated bias as an average of the absolute error between prediction and actual observations, Pearson correlation (r) to determine the strength of the linear relationship between the observed and predicted fluxes, and root mean squared error (RMSE) to estimate the model error. We use the terms "observed" and "predicted" to distinguish between field measurements and model predictions but acknowledge that some of these observed values represent indirect estimates of fluxes (e.g., GPP).
We evaluated the prediction uncertainty of all flux models and the budget uncertainty of annual and growing season NEE models using a repeated random resampling procedure (Aalto et al., 2018). Prediction uncertainty was calculated to characterize the spatial variability in flux predictions across the high-latitude region,

| Observed flux variation
Flux measurements showed considerable variation in magnitudes and signs (CO 2 sink vs source) across the high-latitude environments ( Figure 1 and Table 1). Observed annual NEE (no upscaling) was on average a small source of CO 2 in the most northern parts of the study domain (tundra: +10 g C m −2 yr −1 , 42 sites; northern permafrost region: +6 g C m −2 yr −1 , 63 sites) and in drier environments (tundra upland: +16 g C m −2 yr −1 , 36 sites), whereas the boreal biome (−46 g C m −2 yr −1 , 41 sites), and in particular boreal uplands (−47 C m −2 yr −1 , 34 sites), and non-permafrost regions (−90 g C m −2 yr −1 , 20 sites) were net ecosystem CO 2 sinks. All environmental categories were, on average, net CO 2 sinks during the growing season, with the average NEE ranging from −37 to −115 g C m −2 period −1 (Table 1).
Tundra upland and non-permafrost regions had the lowest average growing season sink strength. The non-permafrost region sink was greatly reduced by one disturbed site that had large source values up to +600 g C m −2 period −1 (Petrone et al., 2014), but this was not apparent in the annual averages because the same site did not report annual fluxes. Although the environmental conditions at the sites were fairly representative of the entire high-latitude region ( Figure S6), colder environments with low NDVI and GDD3 as well as high FDD were less well represented (e.g., large areas of Siberia; Figure 1). Some chamber sites were located in conditions that would have otherwise remained undersampled ( Figure S6). These included sites with relatively high soil organic carbon stocks in Hudson Bay Lowland and northwestern Canada, and wet climates in Greenland and northern Fennoscandia.  Figure S8). Annual models for ER and NEE exhibited a better fit and predictive performance than the growing season models (based on r), whereas the opposite was true for GPP ( Figure 2 and Figure S8). The growing season GPP model fit and predictive performance was higher than that of the ER models, but annual GPP and ER models performed equally well. Model fit and predictive performance were similar in models trained with and without chambers (Table S3). In most predictive performance analyses, the lowest and highest observed fluxes were over-and underestimated, respectively, indicating overall poor predictive performance at the extremes ( Figures S9 and S10).

| Predictive performance of the models
Average predicted and observed fluxes were of similar magnitude (Table 1). However, there was a tendency for the average predicted values to have slightly larger GPP and ER values (e.g., observed and predicted annual GPP in the tundra: 250 g C m −2 yr −1 and 378 g C m −2 yr −1 , respectively) and stronger net CO 2 sink values than what was observed (e.g., observed and predicted annual NEE in the tundra: +10 g C m −2 yr −1 and −2 g C m −2 yr −1 , respectively). Our crosscomparison of annual and growing season flux ensemble predictions showed there was a mismatch between annual and growing season component fluxes in approximately 2% of the pixels (growing season GPP/ER larger than annual GPP/ER) and that unrealistic flux values (negative GPP or ER) were found in less than 0.01% of the pixels in the ensemble predictions. TA B L E 1 Summary statistics of observed and predicted (using the average ensemble prediction) annual and growing season GPP (gross primary productivity), ER (ecosystem respiration), and NEE (net ecosystem exchange) fluxes (g C m −2 yr −1 for annual and g C m −2 period −1 for growing season fluxes) in different environments across the high-latitude region over 1990-2015. The time-series of the sites were averaged prior to calculating the observed mean flux (i.e., one flux value from one site was used when the regional averages were calculated). Positive numbers for NEE indicate net CO 2 loss to the atmosphere (i.e., CO 2 source) and negative numbers indicate net CO 2 gain (i.e., CO 2 sink). Note that ER and GPP do not sum up to NEE as different numbers of observations and sites were available for each flux and model. Moreover, some plant uptake occurs outside of our defined growing season, and consequently growing season GPP and annual GPP do not equal to each other. The average fluxes were calculated based on the extent of the high-latitude tundra and boreal biomes (Dinerstein et al., 2017), permafrost zones (Brown et al., 2002), and land cover (i.e., wetlands, and everything else is upland; ESA, 2017). The confidence intervals for the observed fluxes and the uncertainty ranges for the predicted fluxes can be found in the

| Predicted flux variation
Predicted fluxes showed high spatial variability across the region with a general trend towards decreasing fluxes and sink strength with increasing latitude for GPP, ER, and NEE (Figure 3 and Figure   S11). The variability was related to differences in climate (GDD3 and FDD), solar radiation (RAD) and vegetation greenness (NDVI), which had the strongest influence on most of the fluxes (Figure 4).
Moreover, SOC, CLAY, and LC were important variables for annual NEE; CLAY and SOC both had a positive yet saturating relationship ( Figure S12). The relationship between LC and NEE suggested that the annual and growing season net sink strength was largest in wetlands and smallest in sparse vegetation ( Figures S12 and S13). Some variables had a very low variable importance for most of the fluxes (e.g., TWI, soil pH).
Our predictions revealed regional hot spots in annual and growing season NEE, GPP, and ER. Strong annual and growing season CO 2 sinks, having low ER and high GPP, were found in forested regions with high GDD3, NDVI, RAD, and low FDD across Fennoscandia and European Russia, southern Canada, and southern Siberia (Figure 3 and Figure S11). Annual CO 2 sources were identified within northern and central Siberia, Greenland, northern and central Alaska, as well F I G U R E 2 Observed and predicted annual fluxes of GPP (gross primary production; a and d), ER (ecosystem respiration; b and e), and NEE (net ecosystem exchange; c and f) based on model fit (a-c) and predictive performance (d-e). Model fit was evaluated by predicting fluxes over the entire model training data, while predictive performance was assessed using a leave-one-site-out cross validation scheme in which each site was iteratively left out from the dataset, and the remaining data were used to predict fluxes for the excluded site.

| Terrestrial ecosystem NEE budget for the high-latitude region
Our ensemble predictions showed that the high-latitude tundra and boreal region was on average an annual terrestrial ecosystem CO 2 sink over the 26-year (1990-2015) study period ( The average annual NEE budgets over the study period from CMIP5 and FLUXCOM were −488 and −1056 Tg C yr −1 , respectively (Table S5). In the boreal biome, average annual GPP in our study was 8850 compared to 8561 Tg C yr −1 in FLUXCOM. In the tundra biome, the average annual GPP in this study was twice as high as in FLUXCOM (2495 and 1267 Tg C yr −1 , respectively). Differences were larger for annual ER. Our annual ER budget for the boreal and tundra biomes was 8241 and 2156 Tg C yr −1 , respectively, but the same budgets were only 6363 and 1200 Tg C yr −1 in FLUXCOM. For the regional NEE budgets estimated with CMIP5 and FLUXCOM, see Table S5.

| DISCUSS ION
This study provides a conceptual and methodological framework to bridge the gap between local, regional, and high-latitude scales in

| Drivers and spatial patterns of GPP, ER, and NEE
Our results suggest that climatic, vegetation, and soil variables were all important predictors for terrestrial ecosystem CO 2 fluxes.
However, almost all CO 2 fluxes were strongly driven by the broad climatic gradients and spatiotemporal variability in solar radiation, growing and non-growing season climatic conditions, water balance, and the resulting vegetation greenness patterns, supporting the find-  Puma et al., 2007). In general, we found that warmer, moderately wet, and greener conditions (i.e., environments of higher biomass as indicated by NDVI) increased the magnitude of annual GPP and ER.
However, our results also indicate that the overall net sink strength increases with larger greenness, warmer and shorter winters, and wetter climate. These results suggest that GPP and ER respond rather similarly to changes in climate and vegetation conditions across the high-latitude region, although GPP might increase even more due to F I G U R E 4 Variable importance for annual and growing season fluxes of GPP (gross primary production), ER (ecosystem respiration), and NEE (net ecosystem exchange). Explanatory variables are GDD3 (growing degree days), FDD (freezing degree days), WAB (water balance), NDVI (normalized difference vegetation index), TWI (topographic wetness index), RAD (potential incoming direct annual radiation), SOC (soil organic carbon stocks up to 2 m), pH (topsoil pH), CLAY (topsoil clay content), and LC (land cover). Variable importance was calculated by assessing how a randomly permuted predictor influences the predictions across all five statistical models. Values close to 0 and 1 indicate low and high importance of the predictor variable, respectively. The box corresponds to the 25th and 75th percentiles. The lines denote the 1.5 IQR of the lower and higher quartile, where IQR is the inter-quartile range, or distance between the first and third quartiles [Colour figure can be viewed at wileyonlinelibrary.com] increases in vegetation greenness  and changing climate (Lund et al., 2010). However, differences in these relationships might occur in different regions and land cover types (Baldocchi et al., 2018;Belshe et al., 2013;Lafleur et al., 2012).
In addition to the climate and greenness variables operating mostly at large scales, other more local-scale variables such as soil organic carbon stock and land cover helped explain CO 2 fluxes. Soil organic carbon stock was the most important predictor for annual NEE, and it had a positive relationship with it, demonstrating that areas with high carbon stocks might lose more CO 2 to the atmosphere. However, this result was not supported by the annual ER models, which would represent the main process behind this positive relationship (i.e., larger carbon stocks have more potential for increased CO 2 emissions, particularly in dry conditions (Voigt et al., 2019)). The lack of this relationship might be due to annual ER models not covering the full range of conditions represented by the annual NEE models, or spurious causal relationships being identified by the relatively poorly performing NEE models. The importance of land TA B L E 2 Annual and growing season average GPP, ER, and NEE budgets (Tg C yr −1 ) over 1990-2015 across the environments and the spatial extent of each environmental category when permanent water bodies, urban areas, and croplands were masked away. The NEE budgets are based on upscaled NEE data and include an uncertainty range derived by bootstrapping. Note that ER and GPP do not sum up to NEE as different numbers of observations and sites were available for each flux and model. For the non-growing season CO 2 budgets estimated based on annual and growing season budgets, see

F I G U R E 5
Complementing annual NEE predictions averaged over 1990-2015. Mean annual NEE derived by subtracting annual ER (ecosystem respiration) from GPP (gross primary production) in this study (a), from a global upscaling product FLUXCOM (b), and from a process model ensemble CMIP5 (Coupled Model Intercomparison Project Phase 5) (c), and the standard deviation of these and the independently modeled annual NEE in this study (visualized in Figure 3c) (d). A regional-scale example of the spatial variation of annual NEE in our prediction in northern Alaska, with black outlines depicting the size of the pixel in one of the highest resolution (smallest pixel size) models in the CMIP5 ensemble (1.92 × 1.5°) (e) [Colour figure can be viewed at wileyonlinelibrary.com] cover was expected as it summarizes many key processes related to carbon cycling (e.g., the carbon uptake capacity, temperature sensitivity, as well as quantity and quality of carbon inputs into the soil; Sørensen et al., 2019) and other environmental characteristics (e.g., soil moisture is likely higher in wetlands than in sparse vegetation).
Our ensemble prediction suggested that most of the southern portion of the high-latitude terrestrial region was an annual net ecosystem CO 2 sink while the central and northern regions were neutral or small net CO 2 sources. Observed and predicted spatial patterns in fluxes were similar to those described by most previous studies. For example, our compiled field observations and predictions are consistent with the majority of Alaskan tundra being an annual ecosystem CO 2 source on average, similar to the average observed fluxes in McGuire et al. (2012) or the prediction in Ueyama, Ichii, et al. (2013). The strongest annual ecosystem CO 2 sinks in our study were located in southern European Russia, Fennoscandia, and southern Canada, as also observed in the FLUXCOM product (Jung et al., 2017;Tramontana et al., 2016).
For some regions, our ensemble prediction differed from the predictions of previous studies. The distribution of annual net CO 2 sources across the tundra biome was larger in our prediction compared to FLUXCOM, particularly in Siberia and Canada. This was likely explained by our models including some tundra sites from Canada, Greenland, European Russia, and Siberia, which were not covered by the FLUXCOM model training data. Some of the sites in these regions were annual net CO 2 sources on some years (Emmerton et al., 2016;Karelin et al., 2013). A similar disagreement was found between an Asia-focused statistical upscaling analysis by Ichii et al. (2017) which suggested stronger sink strength across large parts of Siberia, likely due to a limited number of northern eddy covariance sites used to train their models. The largest regional differences between our predictions, CMIP5, and FLUXCOM occurred in central Siberia,

Fennoscandia, European Russia, and eastern Canada and the Canadian
Arctic Archipelago, and these differences were primarily driven by the fact that CMIP5 showed these regions to be sources whereas they were sinks in FLUXCOM and our analysis ( Figure 5). These regional differences demonstrate that these particular areas should be studied further to understand the sink-source patterns more accurately in the future.
Our uncertainty estimation suggests that CO 2 flux predictions should be interpreted carefully in areas that lack sampling locations or have large variability in fluxes that cannot be captured by the predictor variables. Such areas are particularly concentrated in European Russia, eastern Canada, and the Canadian Arctic Archipelago. As the accuracy of the prediction can usually be improved with increases in the quantity and quality of data, new measurements in these regions or better predictors would likely improve the performance of highlatitude CO 2 flux models.

| Key sources of uncertainty in our modeling approach
No single best model could be identified across the five modeling methods. However, the three machine learning methods outperformed the two regression models, particularly for NEE, as demonstrated by the improved model performance, lower uncertainty, and the lack of unrealistically high or low flux values in predictions. The better performance of the machine learning methods was likely related to their flexibility and capability to find complex structures in the flux data (Elith et al., 2008). Our results demonstrate that several machine learning methods should be tested to produce the most accurate high-latitude flux predictions and that ensemble methods provide robust predictions (Araújo & New, 2007). Our results also indicate that an ensemble prediction based on machine learning methods alone would likely lead to higher model accuracy and transferability (see also Tramontana et al., 2016).
Our models performed well when predicting to the same data that the models were trained with, but the models had challenges when tested against independent validation data. The predictive performance of our ensemble predictions was comparable to (annual GPP and ER) or less than (growing season GPP, ER, NEE, and annual NEE) that of in other global and regional upscaling studies Natali et al., 2019;Peltola et al., 2019;Tramontana et al., 2016;Ueyama, Ichii, et al., 2013). year as daily-to-monthly fluxes, was roughly 0.65-0.7 for NEE and 0.7-0.8 for GPP and ER. There are several reasons for why some of our models performed more poorly than these previous studies, which we explain below.
The lower quantity of measurements and weaker comparability of fluxes derived with EC and chamber techniques and with variable measurement lengths might explain the lower predictive performance in our study compared to the other upscaling studies. As we used aggregated fluxes over the growing season and annual time scales, the sample size in our models was smaller than in other efforts which all used daily or monthly fluxes (a few hundred observations versus thousands of observations). A larger sample size usually increases the predictive performance of the models, particularly when these measurements cover variable environmental conditions that can be captured by the predictors. For example, FLUXCOM models (Jung et al., 2017(Jung et al., , 2020Tramontana et al., 2016) might have had a higher predictive performance than our models because they use a global FLUXNET database (Pastorello et al., 2020), which covers broad environmental gradients. However, FLUXNET data originates mostly from lower latitudes (e.g., only five sites from the Arctic and 34 from the boreal out of 224 global sites in total used in Tramontana et al., 2016). This could explain the larger net sink strength in FLUXCOM compared to our predictions. The higher predictive performance of FLUXCOM compared to our prediction might also be explained by the fact that FLUXNET is based on a single flux measurement technique (EC) with standardized filtering, gap-filling, and partitioning procedures. We included chambers to our analysis as they covered conditions that were not covered by the EC network even though we acknowledge that using both chamber and EC measurements, and different partitioning methods for EC, increased the number of different flux measurement techniques and study designs, and may have made the comparison of fluxes across sites more uncertain (Fox et al., 2008;Tramontana et al., 2016). However, we observed no significant differences in fluxes estimated with the two approaches indicative of these mismatches ( Figure S6d), and the performance of models did not change when chambers were excluded from model training data. These results suggest that the relatively low performance of some models is related to the high variability in both EC and chamber-derived CO 2 flux estimates that is not captured by our predictors. Further, it demonstrates that including chamber measurements, despite operating at different spatial and temporal resolutions than the EC technique, did not decrease the model performance. It is also possible that the lower predictive performance of growing season models compared to annual models was related to the variable growing season measurement periods used across the studies. We accepted this variability because our goal was to use as many published fluxes as possible to improve the geographical and environmental coverage of sites.
The accuracy of our ensemble predictions varied depending on the flux, with the predictive performance being lowest for NEE models (r = 0.21-0.27). The predictive performance of our GPP and ER models was higher (r = 0.53-0.73) and is comparable to past efforts Natali et al., 2019;Tramontana et al., 2016;Ueyama, Ichii, et al., 2013) because these fluxes represent the ecophysiological and biogeochemical processes describing CO 2 uptake and loss, respectively. GPP and ER also already included some information about temperature and radiation variables that we used as predictors in our statistical models, which may introduce some circularity and artificially inflate the model performance. Our NEE models over-and underesti- 2017; Figure 3b), but some sink observations (24%) were also predicted as sources. We also discovered that the observed average annual NEE was often larger (more positive) than the individually predicted average NEE, which was either a result of the model not being able to predict sources accurately, or of the distribution of flux sites being biased towards environments with larger CO 2 source observations than the entire region on average (see the large number of sites with source observations originating primarily only from Alaska in Figure 1). These results demonstrate that the predictors included in our analyses did not fully represent the spatial gradients and dynamic temporal variability in environmental conditions that influence carbon cycle processes, and particularly the high and low NEE conditions. Further research should explore improvements offered by other current and potential future predictors related to the disturbance and permafrost conditions, snow cover duration and snow depth, soil moisture and nutrient availability, and phenology, root properties, and microbial communities (Illeris et al., 2003;Järveoja et al., 2018;Nobrega & Grogan, 2007).
Even though the geographical and environmental coverage of the flux sites was improved in our study compared to previous efforts, our models included only ca. 10 sites from heavily disturbed conditions (see Section 2.1.1). Consequently, our sites did not cover the full range of disturbance and post-disturbance recovery conditions and the associated impacts on CO 2 fluxes. For example, rapidly thawing permafrost and burned landscapes remained largely under-sampled across Siberia.
These disturbances have a substantial impact on carbon cycling in high-latitude ecosystems (Abbott et al., 2016;Walker et al., 2019), including direct emissions from the disturbance (not estimated with our models) and typically increased net CO 2 emissions for several years to decades after the disturbance (Coursolle et al., 2012;Kittler et al., 2017;Lund et al., 2017;Turetsky et al., 2020) which should ideally be captured by our models. The lack of flux data representing disturbed and post-disturbance recovery conditions likely leads to underestimations in net ecosystem CO 2 emissions, and is generally thought as one of the key limitations in statistical upscaling efforts (Jung et al., 2020;Zscheischler et al., 2017).

| Terrestrial ecosystem CO 2 budget and its uncertainty
Although our models may be biased towards sinks, our results suggest that high-latitude terrestrial ecosystems were on average an annual net CO 2 sink during 1990-2015. The uncertainty of this budget was high, as demonstrated by the low predictive performance of the annual NEE model, and the fact that budgets derived from different predictions (individual NEE predictions and ER-GPP predictions) differed by ca. 500 Tg C yr −1 -the latter most likely being linked to the different numbers of observations and sites available for each flux and model ( Figure 1). Nevertheless, the annual NEE budget was of similar magnitude to the one estimated by CMIP5 models and larger (less negative) than the one estimated by FLUXCOM (Table   S5). The boreal biome was responsible for most of this sink strength (−406 Tg C yr −1 , from −499 to −239 Tg C yr −1 ; 13.9 × 10 6 km 2 ). In contrast, the tundra biome was on average a small sink (−13 Tg C yr −1 , from −81 to +62 Tg C yr −1 ; 6.7 × 10 6 km 2 ) or a small source (+10 g C m −2 yr −1 ), based on our predictions and observations. This suggests that the tundra biome was on average close to CO 2 neutral even though the large soil organic carbon stocks of this region would indicate larger historical CO 2 sink strength (Hugelius et al., 2014 Belshe et al., 2013 was larger than in this study) which might explain some of the discrepancies between these studies, although the general patterns of these boundaries were rather similar (see e.g., Figure 1 in McGuire et al., 2012 vs. our tundra domain in Figure 1).