On the information content of surface meteorology for downward atmospheric long-wave radiation synthesis
 One of the key uncertainties in site-based evaluations of land surface, hydrological or ecological models stems from the lack of availability of downward long-wave radiation, even at observational stations where tens of other key variables are measured. State of the art techniques for its synthesis are typically functionally dependent on surface temperature, vapour pressure and some representation of cloudiness. Here we show that existing functional forms for downward long-wave synthesis underutilise information in these key predictor variables, and that in fact cloudiness variables may be redundant. By using an empirical model at a range of sites globally, we examine the contribution of each of the predictor variables and conclude that an extremely simple empirical model may provide more defensible prediction where noa priori knowledge of site behavior exists.
 Land surface model evaluation using data collected at flux tower sites provides a unique opportunity for diagnostic model evaluation. It is one the few examples in climate science of a data source that provides both model drivers and several model evaluation variables at the same time step as the model's execution. One of the key data-based uncertainties, however, is that most flux tower sites do not have measurements of downward long wave radiation (LWdown), and so at these sites a key component of the energy budget is missing.
 There have been many attempts to address this problem, although usually in a hydrological or ecosystem study context, rather than in relation to land surface modeling (LSM). They began with clear-sky approximations to LWdown [e.g.,Ångström, 1918; Brunt, 1932; Swinbank, 1963; Idso and Jackson, 1969; Brutsaert, 1975], based on empirically derived functions of surface air temperature (T) and/or surface vapour pressure (e). Over time, the considerable effect cloudiness had on LWdown was recognized and this was dealt with by modifying atmospheric emissivity in these clear-sky formulations [e.g.,Crawford and Duchon, 1999; Iziomon et al., 2003; Lhomme et al., 2007]. The cloudiness modifications use either a cloudiness index derived from the ratio of measured short-wave radiation (SWdown) to potential SWdown (a top-of-atmosphere estimate based on location, time of day and time of year) or cloud fraction, commonly estimated using this ratio. We will refer to this fraction (which is affected by cloud fraction, as well as, for example, haze) asSWrat below. A comprehensive review of these techniques is given by Flerchinger et al. . One difficulty of using this type of correction is that corrections based on SWdown are only available during daylight hours. At night, such a method must either revert to its clear sky parent or attempt to approximate cloudiness using the cloudiness index from the daylight hours surrounding it.
 Flerchinger et al. looked for an ‘optimal’ cloudiness correction technique by testing 13 clear-sky approximations and coupling them in all possible combinations to 10 cloudiness corrections at 14 sites in the USA and China. While this and other work [e.g.,Wang and Liang, 2009] focus on the benefits that different cloudiness corrections offer over clear-sky approximations, they assume that at least one of the clear sky functional forms examined (mostly developed on limited data) appropriately represents clear sky LWdown as a function ofT and e. Here we will suggest that a different functional form of LWdown = f(T, e) is able to provide improved approximations of LWdown in both clear-sky and cloudy conditions, and that the benefit of the extra information provided by a cloudiness index is in fact minimal when this functional form is better chosen. We use an empirical model to construct this relationship and unlike several of the examples previously mentioned, restrict our analysis entirely to out-of-sample results.
 Different synthesis studies nominate different LWdown approximation techniques as having the best performance [Choi et al, 2008; Wang and Liang, 2009; Flerchinger et al., 2009] and this seems in large part due to the applicability of different techniques to different environments [e.g., Sridhar and Elliott, 2002; Duarte et al., 2006; Bilbao and de Miguel, 2007; Kjaersgaard et al., 2007; Choi et al., 2008]. To compare with our empirical approach, we choose clear-sky and cloudiness-adjusted synthesis approaches that are both commonly represented in published work and ‘optimal’ in a significant proportion of studies [Choi et al., 2008; Wang and Liang, 2009; Flerchinger et al., 2009]. They are shown in Table 1: the clear-sky approximations ofBrunt , Swinbank , Brutsaert  and Dilley and O'Brien  as well as the cloudiness correction of Crawford and Duchon , applied to both the Brunt and Brutsaert clear-sky approximations. In applying the cloudiness correction, we used a threshold of 5 Wm−2to discriminate between day and night, and reverted to the parent clear-sky approximation during nighttime time-steps. We do not suggest our sample of techniques is either comprehensive or optimal in all applications, rather it is chosen to be broadly representative of existing techniques. While we also accept that the performance of these techniques would be improvedin sampleby calibrating their parameters to particular data sets, we are primarily interested in synthesizing LWdown at ‘unseen’ sites, and so avoid the danger of over-fitting by using original published versions of parameter values.
 We use observed half-hourly meteorology from ten flux tower sites distributed across a range of environments globally (seeTable 2). These were downloaded from the Protocol for the Analysis of the Land Surface (PALS, pals.unsw.edu.au) which in turn processed the data from the Fluxnet La Thuile ‘Free-fair-use’ data release. Details of additional processing, as well as information on the proportion of gap-filled meteorological data are available through the PALS website. Only consecutive whole years of data were used for each site.
 We also tried to ensure that gap-filling of meteorological data did not affect our results. The La Thuile synthesis included a binary quality control variable associated with each meteorological variable, primarily reflecting which time steps were gap-filled. Before conducting any of the analysis or empirical model training described below, we filtered our data by removing any time step that had a quality control flag in any of SWdown,T, e, or LWdown. While our 10 sites collectively contained 806448 time steps of data, after this filtering process only 670280 remained.
 The empirical model we use involves clustering our independent variables (that is, for example, T and e if we are modeling LWdown = f(T, e)) using k-means clustering, a least-squares clustering approach. Each cluster contains a subset of the data, in our case time steps, grouped to have similar values of the independent variables. Each cluster is defined by minimizing the within-cluster sum of squares. Use of k-means clustering is already common in hydrology and other Earth sciences [e.g.,Lagacherie et al., 1997; Gorsevski et al., 2003]. To complete the mapping, we train a multiple linear regression between the independent variables and LWdown separately for each cluster.
 To predict LWdown on an unseen data set, we use the cluster centres and regression parameters established above. A new prediction time step will first be allocated to a particular cluster, based on the least squared distance of its independent variable values to the existing cluster centres. The regression parameters associated with that cluster will then be used to predict LWdown using those independent variable values. For a given functional relationship, such as LWdown = f(T, e), this approach gives a piecewise linear approximation to the functional form in the data itself, with the number of ‘pieces’ effectively determined by the number of clusters. We could equally have chosen other similarly robust empirical approaches such those used by Hsu et al.  or Jung et al. .
 We test several configurations of this empirical approach. For all of the functional dependencies described below, we examine the performance of the empirical model using 1, 3, 10 and 20 clusters, showing the effect of an increasingly complex (albeit piecewise linear) empirical model. We also try to quantify how much information T, e and SWrat contribute to LWdown predictability. To do this, we train separate implementations of the empirical model examining LWdown as f(T), f(e), f(T, e) and f(T, e, SWrat). Each of these is compared with the existing approaches in Table 1.
 Finally, all results presented here represent out-of-sample tests. We construct these tests using a bootstrap approach: for each site, all the other 9 sites are used to calibrate empirical model parameters, and the single remaining site is used for testing. This process is then repeated for all sites, so that where results for all sites are shown, all are still entirely out-of-sample. In this sense we are also testing the portability of the empirical approach. Good performance therefore, however we choose to define it, becomes even better as the sample size of sites included in the studydecreases.We chose 10 sites as a compromise between representativeness and the potential to demonstrate strong out-of-sample performance.
3. Results and Discussion
 Table 3shows the mean absolute error (MAE) of the 6 existing techniques and two configurations of the empirical model, with results shown separately for each site. An all-site result is shown in bold on the last line ofTable 3, with the two previous lines indicating day- and night-time only results. The two empirical model configurations shown here were chosen as our preferred models for any larger scale deployment of this technique – we will further outline our reasons for this below. Recall that all results are out of sample (testing site not used in training data).
Table 3. The Mean Absolute Error Across All 10 Sites of the 6 Existing Approximations as Well as Two Empirical Model Representations of LWdown as a Function of Surface Temperature and Vapour Pressurea
|Maun - Mopane||38.17||26.83||37.56||45.87||30.05||29.71||26.58||31.02|
|All sites day||41.19||43.40||40.38||41.85||32.49||39.11||33.51||32.07|
|All site night||42.33||44.37||41.34||41.63||42.33||41.34||30.49||28.22|
 The most striking feature of Table 3is the superior performance of the two empirical model approaches, neither of which utilize any form of cloudiness information. Even out of sample, they improve LWdown prediction performance over the cloudiness-correction approaches by as much as 40% of MAE (at Amplero or Cabauw, using either parent clear-sky function), and average a 19–24% improvement across all sites, depending on which approach is used. Note in particular the strength of the empirical models with night-time data, when the cloudiness correction techniques were not applied. There are some exceptions – most notably the Audubon and Skukuza sites. Also note that the Swinbank clear-sky approach appears particularly strong for the two African sites. A surprising result is that there are several examples of the cloudiness correction technique significantlydegradingthe clear-sky approximation (Audubon, Goodwin Creek, Howard Springs).
 A potentially confusing result is the occasional ability of the 1-cluster regression (that is, just a multiple linear regression) to outperform the 20-cluster model (at Howard Springs, Maun-Mopane and Skukuza). These sites apparently have non-linear behaviour that is different to the others – the linear model performed reasonably well, but when 20 clusters were used, the nature of the non-linear relationship learned at the 9 training sites was not appropriate for the testing site. We note that these three are the hottest sites in our sample and would likely see the largest range of vapour pressure as well as temperature extremes. This apparent inversion of behaviour across the empirical model spectrum may not have appeared if we had a larger sample of sites (although this would have impaired our ability to demonstrate out-of-sample performance).
 We chose to examine these two examples of our empirical approach because of their relative simplicity. Table 4 shows MAE values across all sites for a larger range of empirical model configurations. The aim in this instance was to explore (a) how much information each independent variable provided about LWdown (different functional forms in the columns of Table 4), and (b) how this was affected by the relative complexity of the empirical model used.
Table 4. The Mean Absolute Error of Empirical Models of Different Complexities and Representing Different Dependenciesa
 The biggest surprise here is that e, T and SWrat do not appear to provide any more information about LWdown than e and T alone. Across the different model resolutions there is no detectable performance difference between these two functional forms. Perhaps equally surprising is that modeling LWdown as a function of either e or Talone provides performance levels at least as good as the best cloudiness-corrected approach we investigated (indeed slightly better).
 These somewhat surprising results suggest that existing functional forms, while no doubt effective at particular sites, fail to generalize the nature of LWdown synthesis globally. That is, they do not appropriately utilize the information in e and T about LWdown. Equally importantly, it appears that there may be no utility in including a cloudiness correction. In fact, given the nature of Table 4, we are drawn to suggesting that the most practical and transparent approach is to use a linear regression against e and T(i.e., a 1-cluster approach). If we recalculate regression coefficients across all 10 of our sites (recall that results to date are calculated for nine sites and tested at a tenth), we find
for e in Pascal, and T in Kelvin offers the best fit. This will no doubt be a controversial suggestion for LWdown estimation, but one that appears well supported by the data we used. The 95% confidence interval for the e coefficient is 0.0311 ± 0.0002, 2.84 ± 0.01 for the T coefficient and −522.5 ± 2.9 for the intercept. Based on the smaller sample size of the training sets for individual sites, the mean and standard deviation of these three coefficients across all sites were: 0.031 and 0.005; 2.83 and 0.29, and; −520 and 75, respectively.
 Finally, while the MAE values in Tables 3 and 4 appear relatively large, we note that a considerable proportion of the studies discussed above test LWdown synthesis techniques in-sample – using the same data to calibrate and evaluate. Given that existing techniques are already widely used to create forcing data sets for LSMs, the benefits of the reduction in forcing uncertainty that our simplified approach provides should be clear.
 By using an empirical model we showed that at a broad range of sites globally, the functional form of existing downward long-wave radiation synthesis algorithms appears to be inappropriate. The empirical model, even entirely out of sample, was able to utilize the information available in time series of temperature and vapour pressure to make predictions that clearly outperformed existing approaches. When such a model was given cloudiness information in addition to these two variables it produced no marked improvement in performance, suggesting that existing cloudiness-correction approaches may in fact be compensating for a poor functional form in their clear-sky parent function. While we cannot claim from these results that the approach is superior in every instance, it does offer a marked improvement in global downward long-wave radiation predictability overall.
 This work used eddy covariance data acquired by the FLUXNET community and in particular by the following networks: AmeriFlux (U.S. Department of Energy, Biological and Environmental Research, Terrestrial Carbon Program (DE - FG02 - 04ER63917 and DE - FG02 - 04ER63911)), AfriFlux, AsiaFlux, CarboAfrica, CarboEuropeIP, CarboItaly, CarboMont, ChinaFlux, Fluxnet - Canada (supported by CFCAS, NSERC, BIOCAP, Environment Canada, and NRCan), GreenGrass, KoFlux, LBA, NECC, OzFlux, TCOS - Siberia, USCCC. We acknowledge the financial support to the eddy covariance data harmonization provided by CarboEuropeIP, FAO - GTOS - TCO, iLEAPS, Max Planck Institute for Biogeochemistry, National Science Foundation, University of Tuscia, Université Laval and Environment Canada and US Department of Energy and the database development and technical support from Berkeley Water Center, Lawrence Berkeley National Laboratory, Microsoft Research eScience, Oak Ridge National Laboratory, University of California - Berkeley, University of Virginia.
 The Editor thanks the two anonymous reviewers for their assistance in evaluating this paper.