Upscaling Net Ecosystem Exchange Over Heterogeneous Landscapes With Machine Learning

This paper discusses different feature selection methods and CO2 flux data sets with a varying quality‐quantity balance for the application of a Random Forest model to predict daily CO2 fluxes at 250 m spatial resolution for the Rur catchment area in western Germany between 2010 and 2018. Measurements from eddy covariance stations of different ecosystem types, remotely sensed vegetation data from MODIS, and COSMO‐REA6 reanalysis data were used to train the model and predictions were validated by a spatial and temporal validation scheme. Results show the capabilities of a backwards feature elimination to remove irrelevant variables and an importance of high‐quality‐low‐quantity flux data set to improve predictions. However, results also show that spatial prediction is more difficult than temporal prediction by reflecting the mean value accurately though underestimating the variance of CO2 fluxes. Vegetated parts of the catchment acted as a CO2 sink during the investigation period, net capturing about 237 g C m−2 y−1. Croplands, coniferous forests, deciduous forests and grasslands were all sinks on average. The highest uptake was predicted to occur in late spring and early summer, while the catchment was a CO2 source in fall and winter. In conclusion, the Random Forest model predicted a narrower distribution of CO2 fluxes, though our methodological improvements look promising in order to achieve high‐resolution net ecosystem exchange data sets at the regional scale.

Process-based biogeochemical models have been widely applied for this purpose (e.g., Post et al., 2018;Xiao et al., 2011), but subject to assumptions and model parametrizations. Data-driven machine learning techniques such as Random Forest (RF) are another promising approach to predict NEE as they can grasp even highly nonlinear relationships to explanatory variables as is usual in environmental data (Cutler et al., 2007). Previous attempts using statistical modeling include nonspatial predictions of NEE at the EC tower scale (Dou et al., 2018;Safa et al., 2019;Zhou et al., 2019). Other attempts aimed at upscaling of carbon fluxes to the continental or national scale (Papale et al., 2015;Sun et al., 2011;Xiao et al., 2008) or the globe, most notably the FLUXCOM approach (Bodesheim et al., 2018;Jung et al., 2011Jung et al., , 2020. Upscaling to the regional scale at high spatial resolution has rarely been done although NEE estimates of heterogeneous regional and local ecosystems are of high value for assessing ecosystem services in spatial planning (Tammi et al., 2017). Furthermore, products at a finer spatial resolution are less prone to contain mixed pixels with contamination of the main land use class by, for example, roads, settlements, or tree rows. Zhang et al. (2011) developed a regression model for the U.S. Great Plains based on EC towers representing grassland only. Post et al. (2018) already upscaled NEE to the study area of this analysis, though with a process-based model. Spatial cross validation, that is, excluding whole locations from model training and testing the model on them, is crucial for a realistic assessment of the reliability of spatial predictions beyond the locations of training data. A substantial performance decrease is to be expected in comparison to a random split of data points into training and test sets, which hence overrates model performance due to spatial autocorrelation Roberts et al., 2016). Tramontana et al. (2016) conducted a profound cross validation analysis for spatial predictions of various carbon and energy fluxes with the conclusion that NEE is especially difficult to predict. Feature selection of explanatory variables, on the other hand, can considerably improve data-driven model performance as it reduces overfitting and removes irrelevant or redundant variables (Hall & Smith, 1999). Meyer (2018) proposed a sequential feature selection algorithm based on spatial cross validation to remove spatially autocorrelated predictors. In contrast to this, conventional feature selection as implemented in the caret package (Classification And REgression Training, Kuhn, 2020) is based on internal cross validations within the training data, and hence fails to improve model performance when testing on locations not used for model training (Meyer et al., 2019). Genetic algorithms like the Guided Hybrid Genetic Algorithm (GHGH, Jung & Zscheischler, 2013) are useful for very large numbers of features (>100), though generally they do not rely on spatial cross validation. Quality of EC data is another issue for upscaling attempts, especially when aggregating half-hourly to daily fluxes. While data quality improves when excluding low-confidence values based on quality control, too small data set. also limit the learning capacities of machine learning algorithms (Ließ et al., 2012). A common practice is to indicate daily data as missing if more than 20% of half-hourly values are missing or of low quality (Tramontana et al., 2016;Yuan et al., 2010). However, to our knowledge a sensitivity analysis to different percentages has not been done before.
In conclusion, NEE upscaling with data-driven methods at high spatiotemporal resolutions and incorporating different land uses remains a major task to be handled in order to approach the goal of flux information "everywhere, all of the time" (Baldocchi, 2014). Thus, the objectives of this paper are i) to perform upscaling of daily NEE over heterogeneous landscapes of the Rur catchment in western Germany for the years 2010-2018 with a RF model incorporating EC measurements, remote sensing and reanalysis data and ii) to assess the impact of EC data quality and feature selection on the model performance.

Study Area
The Eifel/Lower Rhine Valley Observatory covers the Rur catchment located near the German-Belgian-Dutch border and is one of four Terrestrial Environmental Observatories (TERENO) in Germany (Zacharias et al., 2011). These areas were selected for the TERENO network because they are representative of typical landscapes found in Central Europe . The catchment covers an area of 2,354 km 2 and can be divided into a northern lowland part with intensive agriculture and a relatively high proportion of built-up areas and a southern low mountain part where pastures and forests prevail, as shown in Figure 1. Based on a simplified land cover classification by Lussem and Herbrecht (2019), the catchment consists of 27.6% grassland, 25.7% cropland, 17.7% deciduous forest, 8.5% coniferous forest, and 20.4% other land cover types including urban areas, open cast mines and water bodies. Mean annual temperatures range from about 7.5-10.2°C, increasing from south to north. Mean annual precipitation decreases from 1,200 mm in the southern low mountain parts to 700 mm in the north (Baatz et al., 2014).

Eddy Covariance Data
CO 2 flux measurements from nine EC stations covering different land cover types and elevations within the study area have been used for model training and prediction (see Table 1 for details and abbreviations). The nine stations are all part of the TERENO network (Zacharias et al., 2011). Measurements from these stations were processed with the TK3 software (Mauder & Foken, 2011); 20 Hz frequency data were hereby processed to 30 min fluxes and corrected for storage terms to obtain NEE values. All processing and quality control schemes were carried out according to the standardized strategy presented by Mauder et al. (2013), which also includes a test on developed turbulence after Foken and Wichura (1996). Detailed measurement and processing descriptions can be found in the references listed in Table 1, a short description of each site is given here.
RO is an extensively managed grassland site, which is cut several times per year and mostly consists of ryegrass and smooth meadow grass (Lolium perenne, Poa pratensis). The EC tower was placed in the middle of two neighboring pastures with slightly different management regimes (Borchard et al., 2015;Korres et al., 2010). RU3 is a grassland site with similar characteristics , while RU1 is a grassland site at significantly lower elevation (Lussem & Herbrecht, 2019). SE, ME, RU2, and RU4 are cropland sites with rotating crops, mostly sugar beet, winter wheat and winter barley (Eder et al., 2015;Lussem & Herbrecht, 2019;Post et al., 2015;Schmidt et al., 2012). WU1 is located above a planted spruce forest (Picea abies) of uniform height (Graf et al., 2014), while nearby WU2 is placed in an 8.6 ha area which was deforested in 2013 to allow a natural succession toward a European beech forest (Ney et al., 2019;Wiekenkamp et al., 2016). Currently (2020), spontaneous vegetation of the deforested area consists mostly of grass, shrubs (e.g., Cytisus scoparius) and young trees (Sorbus aucuparius, Betula pubescens).
EC data were aggregated from half-hourly fluxes to daily data. As only high to moderate quality EC data were used (quality flags 0 and 1), frequent gaps were created. The number (n) of days containing all 48 REITZ ET AL.   (Jarvis et al., 2008) and eddy covariance (EC) stations used for training within the Rur catchment area (left), the location of the study area and the FLUXNET stations within Germany (middle) and simplified land cover classes after Lussem and Herbrecht for the Rur catchment (2019) (right).
half-hourly intervals (100%) was only 386 for all TERENO stations combined, which is about 3.2% of all possible days and constitutes the first data set. Additional data sets were created with a varying number of missing 30-min intervals allowed: minimum 45/48 (93.75%) intervals of high to moderate quality (n = 1,035; 8.5% of possible days), 42/48 (87.5%) (n = 2,032; 16.6% of possible days), and 36/48 (75%) (n = 3,996; 32.7% of possible days). For the calculation of these daily NEE values, gap-filled data inferred with the REddyProc package (Wutzler et al., 2018) were used. In case gap-filled data were not available, the mean was calculated of all nonfilled values of each respective day. Based on the minimum of reliable half-hourly values included, these data sets are referred to in this paper as 48, 45, 42, and 36, respectively. Forest sites were underrepresented in the TERENO data, as only one coniferous forest site and no deciduous forest site were included. To achieve a better representation of each ecosystem type and to broaden the environmental envelope, we added daily NEE data with variable ustar-thresholds of six further stations (two coniferous forest, two deciduous forest, one grassland, one cropland site) from the FLUXNET2015 database (Pastorello et al., 2020), as shown in Table 1. Because quality-flag schemes may have differed among these sites, we filtered the FLUXNET data according to the relative uncertainty instead. In order to create data sets of equal proportions as the 48, 45, 42, and 36 data sets, we took the X days with the lowest relative uncertainty, with X being 3.2%, 8.5%, 16.6%, and 32.7%, respectively. Finally, these FLUXNET data sets, were added to the 48, 45, 42, and 36 data sets. The sensitivity of each of these data sets with a varying quality-quantity balance to the RF performance was then further evaluated with the feature selection and cross validation strategies described below.

Raster Data
Explanatory variables were compiled from various sources and were of different spatial and temporal resolutions, as shown in Table 2. These variables were chosen because they are regarded to potentially affect NEE, and were selected by availability for the whole time period 2010-2018. Vitale et al. (2016) showed that variations of vegetation indices such as Leaf Area Index (LAI) can highly influence carbon fluxes. Ishtiaq and Abdul-Aziz (2015) concluded that meteorological parameters have a strong linkage with CO 2 fluxes, especially "radiation-energy" components. Datetime variables such as Day of Year can also be a useful proxy for estimating CO 2 fluxes (Acosta et al., 2018). We used the following remotely sensed REITZ ET AL.  (Didan, 2015). All of these data sets were quality controlled to exclude contaminated pixels with the quality assurance raster included in the MODIS products. Subsequently, for NDVI, EVI, LAI, and FPAR a Whittaker smoother (Atzberger & Eilers, 2011) was applied to fill gaps and smooth out noise in the data occurring from undetected clouds. Finally, these vegetation data sets were linearly interpolated in time from 8-day to daily data.
Daily gridded data for the meteorological variables air temperature and relative humidity in 2 m, soil temperature, precipitation, zonal, and meridional wind speed in 10 m, long wave upward and downward radiation at the surface and net shortwave and longwave radiation at the surface were obtained from the COSMO-REA6 regional reanalysis data set (Bollmeyer et al., 2015) and were regridded with Climate Data Operators (Schulzweida, 2019). Furthermore, daily downward shortwave radiation from 2010 to 2017 was acquired from the Heliosat (SARAH-2) Surface Solar Radiation Data Set (Pfeifroth et al., 2019), other variables include a digital elevation model from the Shuttle Radar Topography Mission (Jarvis et al., 2008), and soil moisture and potential evapotranspiration from the German Weather Service (DWD, 2019) based on Löpmeier (1994  Raster data were used and further processed at two different steps in the analysis, to i) extract values at the coordinates of each site for model training and validation, and ii) predict NEE for the entire catchment area.
For the latter, all raster sets were homogenized to the same extent and same spatial resolution of 250 m with bilinear interpolation of the raster package in R (Hijmans, 2020).

RF Model
RF is a machine learning method based on an ensemble of many binary decision trees. The algorithm was introduced by Breiman (2001) and is widely used for classification and regression in ecology (e.g., Aide et al., 2012;Prasad et al., 2006;Tramontana et al., 2016). Each decision tree is grown with a random subsample with replacement of the input data, called bootstrapping (Efron, 1979). At each node in the decision tree, a threshold of a randomly selected explanatory variable is ascertained to split the data into the two most homogeneous subgroups, i.e. with the lowest variance. The leaf nodes at the end of the tree do not further split the data but contain predictions of the target variable. This value is the mean of the target variable of all elements in the corresponding subgroup. For the final prediction, results of all trees (in this case 500) are averaged to overcome weaknesses of single trees. One consequence of this algorithm, however, is that predictions cannot be out of bounds of the training range. In this study, we used the implementation of the RF code in the randomForest package in R by Liaw and Wiener (2002) to predict NEE in a regression approach.
In order to identify an ideal number of predictor variables used at each split node ("mtry"), model tuning was conducted with the caret package, which is a wrapper to perform model tuning for various predictive models.
In order to perform feature selection, we first split the data into spatial and temporal folds (described in the next section in detail) with the CAST package (Meyer, 2018). In a next step, we applied the forward feature selection (FFS) procedure of this package with root mean squared error (RMSE) as performance metric to punish high errors in particular. The advantage of CAST FFS is that feature selection results are based on spatiotemporal cross validation rather than on training data only. However, as FFS sometimes selected very few variables with unsatisfactory performance (see Section 3.1), a slightly modified version of this procedure has been devised and compared to FFS. We applied a backward feature elimination (BFE), which starts with all features and iteratively removes the worst feature based on a spatial or temporal cross validation. Conversely, FFS iteratively adds features to the best combination of two features. Since RF relies on randomization, results can significantly differ between model runs. Hence, each iteration within BFE was repeated five times to average out such randomization effects. This leads, however, to a significant increase in computation time. The general algorithm of the BFE procedure is described in Table 3. To illustrate the impact of these two feature selection procedures on model performance, model runs without any feature selection were evaluated too.

Cross Validation
In order to assess RF performance beyond the scope of training data, NEE predictions have been cross-validated for (a) spatially and (b) temporally independent test sets. The additional FLUXNET data from outside the catchment were only used for training, whereas the TERENO data were used for training and testing. Figure 2 displays the cross validation strategies in a schematic way. Especially spatially independent test sets may be important for the assessment of the upscaled NEE predictions presented in Section 3.2 because they simulate predictions for pixels without any information used for model training. These cross validation strategies have been performed for all different NEE data sets and feature selection methods.
a) Data were split into folds by station ID. Stations were only considered for leave-out if they i) were not the only station of their land cover class to ensure that the class was still contained in the training data when excluding the station and ii) had data spanning over at least three years to ensure representative results. Therefore, five stations (SE, RU1, ME, RO, and WU1) were regarded, though all other stations were included for training. b) Data were split into folds by year. Each fold containing one year of data was left out once and predicted by a model trained with the other years. 2010 has only been regarded for training and not for testing as only data from one TERENO site was available for 2010.
We used the coefficient of determination (R 2 ), the mean absolute error (MAE) and RMSE as metrics to evaluate model performance.
The relative importance of selected variables for model building was assessed through RF's internal variable importance metric implemented in the randomForest package. For this, data points of each variable are randomly permutated and the relative increase of the mean squared error (MSE) based on an internal cross validation within in the training data is measured. This error is assumed to increase if the variable is important.
REITZ ET AL.

Cross Validation Results
We evaluated NEE predictions with a leave-one-fold-out cross validation by withholding either sites (spatial cross validation) or years (temporal cross validation). Note. Performance is displayed as root mean squared error (RMSE; in g C m −2 d −1 ), mean absolute error (MAE; in g C m m −2 d −1 ) and coefficient of determination (R 2 ). n Var gives the number of selected variables, variables selected lists which variables were selected by the Nr. Stated in Table 2. RMSE, MAE, and R 2 values are reported as averages of the respective folds. Abbreviations: BFE, backward feature elimination; FFS, forward feature selection. The relative importance of selected variables was assessed by the importance function of the randomForest package. As shown in Table 5, the most important variables were EVI and ETpot. Figure 3 displays an assessment of the quality of NEE predictions in comparison to observed TERENO validation data. While predictions and observations have almost the same mean values (−2.31 and −2.3 g C m −2 d −1 ) and rather similar median values (−1.84 and −1.33 g C m −2 d −1 ), and the regression line a slope close to 1 (y = 0.15 + 1.06x; Figure 3a), the standard deviation of predictions (2.5 g C m −2 d −1 ) is much lower compared to observations (3.73 g C m −2 d −1 ). The interquartile range of predictions is also narrower than of observations (−3.96 to −0.27 compared to −4.49 to 0.29 g C m −2 d −1 ; Figure 3b). As for predictions, values from about −5 to 1 g C m −2 d −1 were more frequent, and outside of that range less frequent than in observations, resulting in a narrower distribution of values ( Figure 3c). This results in higher absolute errors for high flux magnitudes, especially for positive fluxes (Figure 3d).

Upscaling Results
We predicted daily NEE data at 250 m spatial resolution for the Rur catchment from 2010 to 2018. According to the results of the previous section, the 48 NEE data set and explanatory variables selected with spatial BFE were used for model training. Table 6 shows the upscaled results aggregated by land cover class and season. To put these results into perspective, such aggregations over actual measurements within the catchment are also included in Table 6. Pixels classified as urban or built-up were excluded from the analysis because anthropogenic CO 2 emissions were not represented in the training data. Results show that vegetated areas of the Rur catchment were on average a CO 2 sink between 2010 and 2018 with about −0.65 g C m −2 d −1 . Grasslands and deciduous forests were the strongest sink (−0.76 g C m −2 d −1 and −0.72 g C m −2 d −1 , respectively), while croplands captured the least net amount of CO 2 (−0.56 g C m −2 d −1 ). During winter (December-February) and fall (September-November), the Rur catchment was a CO 2 source (0.86 g C m −2 d −1 and 0.75 g C m −2 d −1 , respectively), while in spring (March-May) it was a strong sink (−2.14 g C m −2 d −1 ), closely followed by summer (June-August; −2.02 g C m −2 d −1 ). Figure 4 shows yearly courses of predicted NEE aggregated by land cover for the investigation period. Additionally, daily NEE raster were aggregated by season and the whole investigation period ( Figure 5). These results show that all land cover classes were a CO 2 source in fall and winter and sink in spring and summer, although the CO 2 uptake started decreasing in summer already. Croplands were the earliest to become a sink in spring and also to turn into a source after day of year 200. This NEE sink capacity decrease of croplands from spring to summer is also observable in Figure 5 as croplands prevail in the northern half of the catchment. Forests were a stronger source than croplands and grasslands in fall and winter, though deciduous forests were also the strongest sink with average NEE below −5 g C m −2 d −1 around day of year 170. However, coniferous forests and especially deciduous forests were a greater sink in summer in actual measurements, and no CO 2 source in fall. In contrast, grasslands were predicted to be a greater sink in summer compared to actual measurements. Differences between land cover classes were in general less pronounced in upscaled predictions than in measurements.

Discussion
The results of the study showed that a data-driven upscaling of NEE to the regional scale predicted the average NEE well though underestimated the variance (Figure 3b). Feature selection and the right quality-quantity balance of NEE data, however, can improve model performance. Similar to our results, high errors for REITZ ET AL.   (Jung et al., 2011;Tramontana et al., 2016). Xiao et al. (2011) showed that an ecosystem model predicted NEE with an R 2 between 0 and 0.66, depending on the site. Richardson et al. (2012) demonstrated increasing random errors with flux magnitude for half-hourly CO 2 flux measurements. The absolute errors displayed in Figure 3d show a similar pattern, indicating that high flux magnitudes may be difficult to predict and validate because actual measurements in those ranges are already error-prone. Meyer et al. (2018) demonstrated that random cross validation lead to an overoptimistic view of the model performance compared to spatial cross validation. In our case, RMSE could be improved to 1.85 g C m −2 d −1 and R 2 up to 0.82 with a random cross validation, indicating that mere data reproduction is much easier than actual spatial prediction. Elevation was named as a typical example of a spatially autocorrelated predictor by Meyer et al. (2019). Hence, it is reasonable that it was removed by FFS and BFE for spatial cross validation, but not for temporal cross validation. Besnard et al. (2019) concluded that integrating memory effects of past disturbances in a recurrent neural network outperforms nondynamic statistical models like RF. So not including memory effects in our study might limit the prediction capacities.
REITZ ET AL.
10.1029/2020JG005814 10 of 16 One intrinsic feature of RF is to not extrapolate beyond the input data due to the prediction being the average target value of the subgroup within a leaf node. Although we attempted to overcome this limitation by including sites from outside the catchment and hence broadening the environmental envelope, outliers REITZ ET AL.   with high flux magnitudes were still underpredicted. A portion of the prediction error can also be attributed to uncertainties in the raster data sets used for model training and predictions. Some of the most important variables such as ETpot, Tair, and rH were also measured in situ at the EC stations SE, RO, WU1, and WU2. Averaged over these four stations, Tair from Cosmo-REA6 coincided very well with in situ Tair (R 2 = 0.99); the same applies to rH (R 2 = 0.88), and modeled ETpot (R 2 = 0.93). SWI was combined from two different sources without data overlap but both sources also agree well with in situ measurements (Heliosat: R 2 of 0.96; MODIS: R 2 of 0.92). However, we assume that MODIS-based vegetation indices did not capture smallscale vegetation structures well and hence contributed to prediction errors. A possibility to improve vegetation data to inform the RF model would be to use remote sensing data with a higher spatial resolution such as Sentinel-2 (Drusch et al., 2012), which was not used here because it did not cover the whole investigation period. Another limitation of our study comprises the exclusion of 20.4% of the land area from the analysis because anthropogenic fluxes were not measured. This high proportion results from the high population density in the northern part of the catchment and the relatively large (13 km 2 ) Inden open pit mine. However, only small biospheric net fluxes are to be expected from these areas as they are to a large extent non vegetated and thus may not contribute much to the overall biospheric fluxes of the catchment.
The results indicate that smaller data sets incorporating only few (<6.25%) or no low-quality intervals in the aggregated daily fluxes are more beneficial than larger data sets with more low-quality data. Small data sets can increase overfitting of a predictive model; however, the ensemble characteristic of RF of averaging multiple trees also counteracts overfitting. Thus, it seems reasonable for RF to select for small data sets with higher quality. Although a standardized quality-flag scheme was used on the TERENO-data set, it should be noted that quality-flagging is not fully standardized in the flux-community yet. Thus, our thresholds may not be transferable to other schemes.
As uncertainty is correlated with flux magnitude, filtering the FLUXNET data by small relative uncertainties has the side-effect to favor large NEE values and discriminate small ones, whereas quality flags are not correlated with magnitude. However, the distribution of the TERENO NEE magnitudes shows that, naturally, small fluxes occur much more frequent than large fluxes (Figure 3c). Such imbalanced data is a problem for RF, which requires about equally sized domains in the training data to not overpredict the largest domains (Krawczyk, 2016;Torgo et al., 2013). Therefore, favoring large fluxes in the FLUXNET data improves their representation in the training data sets. The test data sets, however, consisting only of quality-flag filtered data, remained unbiased and are thus regarded suitable for model cross validation. Even so, Figure 3c shows that the maximum around NEE = 0 in the training data was still overpredicted and rare domains underpredicted, indicating that the training data probably was still not balanced enough.
The performance differences between BFE and FFS can partly be explained by local optima of variable subsets, as sequential feature selection algorithms are prone to being trapped in such local optima ( Motoda, 2008). In these cases, the first local optimum trap for BFE is much closer or even identical to the absolute optimum than the first local optimum trap for FFS. Hence, a BFE is regarded superior in such cases. The relatively high variance between RF model runs increased fluctuations and can thus amplify this effect by creating artificial local optima, leading to a suboptimal variable selection. Averaging five model runs reduced the variance within 100 model runs by about 76%, generally leading to more robust results. However, it should be noted that variance between model runs can be lower for other machine learning algorithms and that repeating and averaging is computationally expensive and therefore not suitable for large numbers of variables to select from. For such cases, a genetic algorithm like GHGA (Jung & Zscheischler, 2013) may be more appropriate.
The relatively high elevations of forests and grasslands in the catchment resulted in lower average annual temperatures in years 2010-2018 (coniferous forests: 8.9°C, deciduous forests: 9.6°C, grasslands: 9.5°C) compared to croplands (10.8°C), and hence a later start of the growing season might be an explanation for croplands being an earlier CO 2 sink. Deciduous trees, on the other hand, first need to build-up the canopy leaf area to utilize suitable conditions for photosynthesis, though having higher photosynthesis capacities when fully leafed. However, differences between land cover types were less pronounced in upscaled results than in actual measurements (Table 6). One explanation for this might be mixed pixels in MODIS EVI (250 m spatial resolution) containing spectral responses from different land cover types.
The catchment was a slightly stronger CO 2 sink in spring than in summer. Lindroth et al. (2008) stated that net CO 2 uptake in Swedish spruce forests is shifted toward the earlier parts of the growing season because respiration was still low while radiation was already high. Managed grasslands on the other hand, usually are cut several times during summer. For example, Rollesbroich was cut three times in the growing season of 2013 (Borchard et al., 2015) and each defoliation had the potential to turn grassland temporally into a CO 2 source (Wohlfahrt et al., 2008). Croplands showed the largest decrease of CO 2 uptake in late summer. Schmidt et al. (2012) analyzed vegetation parameters of a winter wheat field in the catchment over the course of two years. LAI of living/green leaves reached the maximum in early May, plant senescence (LAI of brown leaves) already started in late April and reached its peak in July. Although these patterns can differ for other crops, the results still indicate that specific croplands uptake the most CO 2 in spring. The EC aggregations in Table 6 further confirm a decrease of CO 2 uptake in summer for croplands and grasslands. In comparison, the aggregated EVI of the whole Rur catchment started slowly increasing in late February, peaked in early June, and declined afterward. Graf et al. (2020) showed that the exceptional drought and heat across Central Europe during the 2018 growing season resulted in a reduced net CO 2 uptake for many drought-affected EC stations, including SE, RO and WU1. The whole catchment was predicted to be a significantly weaker CO 2 sink in summer 2018 (−0.89 g C m −2 d −1 ) compared to 2010-2017 (−2.16 ± 0.45 g C m −2 d −1 ). Whereas in spring 2018 the sink capacity decrease to 2010-2017 was less distinct (−1.92 g C m −2 d −1 in 2018 compared to −2.16 ± 0.51 g C m −2 d −1 2010-2017), indicating that the seasonal averages may be influenced by one exceptional year. In view of these findings, we consider the seasonal variations of upscaled NEE as largely plausible.

Conclusion
In this study, we scaled up daily EC NEE data to the regional scale at 250 m spatial resolution with a RF model integrating remote sensing and reanalysis data. Furthermore, we evaluated the impact of feature selection and NEE data quality-quantity balance on the model performance. We conclude that upscaling results can be improved with a BFE to remove unnecessary predictors and by incorporating no or only small (<6.25%) amounts of low-quality intervals in the aggregated daily NEE data. Therewith, we provided a data-driven approach for predicting spatial NEE data sets which can be used for assessing the CO 2 uptake of heterogeneous local and regional ecosystems or calibrating and validating process-based models. However, the spread of NEE observations and differences between land cover types were underestimated.
Vegetated parts of the Rur catchment acted as a CO 2 sink between 2010 and 2018 with about −0.65 g C m −2 d −1 . The catchment was predicted to be a slightly stronger sink in spring than in summer probably partly due to plant senescence increasing in summer in cropland and grassland ecosystems, while it was a CO 2 source during fall and winter. In future work, a model incorporating emissions from urban and built-up areas should be implemented to produce spatially continuous NEE data sets. Furthermore, remotely sensed vegetation products with a higher spatial resolution are likely to improve model accuracy as they would allow to distinguish small-scale vegetation structures.