To evaluate a model's ability to simulate radiative fluxes, the first aspect considered is whether there is closure in the radiation budget. Closure is assessed through comparison of the net all wave radiation (*Q**_{calc}) calculated from the two variables provided (K_{}, L_{}) and the two modelled variables (K_{}, L_{}) with the returned modelled *Q**_{mod}. No difference results in a coefficient of determination (*r*^{2}) of 1. At Stage 1, 15 of 32 models do not have a difference. In Stages 2/3/4, the number of models with *r*^{2} = 1 is 13/16/13, respectively, but the total number of models that have *r*^{2} = 1 at any stage is 18. Through four stages only ten models maintained no difference between *Q**_{calc} and *Q**_{mod}. If time periods with a difference of less than 1 W m^{−2} are considered (which includes one model with an *r*^{2} of 0.999999), then Stages 1/2/3/4 have 16/14/16/13 models, respectively. These models are considered in the later analyses as being ‘closed’. After this the *r*^{2} values for Stage 1 range from 0.999991 to 0.0989 [*sic*]; with seven above 0.998, two more above 0.990, four more above 0.980, and two more greater than 0.870. The general groupings remain the same through the stages but the *r*^{2} values do vary, except for the poorest models in Stages 1 and 2, which jump to greater than 0.998 at Stage 3.

Each modelling group which had a case of nonclosure was asked to determine the cause. The models without radiation balance closure problems are classified as P0 in the following analysis. Explanations for non-closure include (classified in analysis) not using the forcing data provided (P1), fluxes calculated independently (P1), timing issues (P3), day length (P3), spatial resolution (P3), and unknown (P4). In the first case, there are two different explanations: instead of using the individual 30-min interval forcing K_{} data, the daily peak observed K_{} was used and the other time periods for the day were obtained by assuming clear sky conditions, resulting in over-predicted K_{} and therefore *Q** (four cases, P1); and, the observed L_{} data were not used but modelled (one case, P1). In the second case, fluxes were calculated independently, the ULSMs calculate *Q** but for the purpose of this comparison, the radiative components have been calculated (three cases, P1) or there is an additional term in L_{} which is not incorporated into *Q** (one case, P4). In the third case, which relates to timing, the lack of closure is related to the 30-min forcing data being interpolated to a shorter time step for model calculations and then averaged back to the 30-min period for analysis (two cases, P3). This approach requires the forcing data to be interpolated which for K_{} may be questionable. For L_{}, the approach depends on an emitted contribution from the surface temperature and a reflected part: L_{}(*t*) = (1 − ε)L_{}(*t* − δ*t*)+ εσ*T*_{S}^{4}(*t*). The surface temperature *T*_{S} depends on the energy received and has inertia. Alternatively, it is because K_{} is only calculated if the sun is above the horizon for the whole time interval (one case, P3), thereby impacting the day length. The fifth case of spatial resolution (two cases, P3) is related to an underestimation of the total sky view factor (all model patches sum to less than 1.0) that arises in the process of rasterizing the surface within the model. The affected models then absorb slightly too much or too little diffuse solar or longwave radiation. The final case is where there are problems which the modelling groups have not been able to determine, leading to the imbalance (three cases, P4).

##### 3.1.1. Outgoing shortwave radiation

The performance of each model, with respect to outgoing shortwave radiation (K_{}), is shown in Figure 2 based on RMSE; models that do not have closure are indicated. For this upwelling solar flux, only daytime fluxes are analysed. This gives 4266 × 30-min periods for comparison. The mean observed flux is 54.2 W m^{−2}. The Stage 1 K_{} mean RMSE for all (*N* = 32)/(*N* = 31 models—as model 17 did not complete all stages)/not-closed/closed are 28/17/42/15 W m^{−2}, respectively, but the large difference is because of one model (17) which does not have closure. The mean RMSE for all 32 models by stage is generally larger than the median (Figure 2) because the mean is impacted by two poorly performing models, one of which did not complete Stage 4.

Considering all 32 models, as increasing information was provided (Stages 1–4) there was an improvement at each stage in mean but not in the median RMSE. The median RMSE improves from Stage 1 to 2 and again between Stage 3 and 4 (Figure 2). Of the 16/32 models with an improved RMSE from Stage 1 to Stage 2, 7/16 improved from Stage 2 to 3; and 2/7 of those improved from Stage 3 to 4. Thus, only two models had a reduction in RMSE at each stage. At Stage 2, improvement is associated with the fraction of vegetation to built areas becoming known (Table II). This fraction provides for the more realistic assignment of ‘urban’ and ‘vegetated’ albedos within the models. However, RMSE for five models became poorer. In Stages 2 and 3, a total of 14 models reduced (and 14 models increased) their RMSE and 13 in Stages 3 and 4 (and 4 increased). At Stage 3, more detailed information was provided about the surface fractions and heights. For the urban fraction it was now possible to distinguish the road and roof fractions correctly, in addition to knowing the wall heights. In the pervious fraction, grass could be distinguished from other vegetation. As expected, the largest overall improvement in K_{} based on the mean and median RMSE occurred at Stage 4 when the site observed albedo was provided (Figure 2).

The relative ordering of models in terms of performance remains relatively similar for all stages for K_{} with the same three models performing in the top three for all stages (Figure 2). Similarly, the poorest performing models, with slight reordering, remain the same for the four stages. But there are some notable changes for individual models between stages; e.g. model 22 does very well in Stages 1 and 2, then in Stage 3 the performance is much poorer but then returns to very good performance for K_{} in Stage 4. This demonstrates the importance not only of the model physics but also of the user's choice of parameter values, which can significantly influence the outcome. For Stages 1–3, there is a larger median systematic error (RMSE_{S}) than unsystematic error (RMSE_{U}), even when excluding model 17, but not for Stage 4 (Figure 2), suggesting that the additional surface information is important for improving the model performance. In Stage 4, once information about the albedo is available, 80% of the models have an RMSE_{U} that is greater than the RMSE_{S}. The shading of the bars distinguishes the models complexity (C) among simple (s, yellow, light grey), medium (m, blue, medium grey), and complex (c, crimson, dark grey) (see Section 2 for definition). It can be seen that the three model types are distributed across the range of model performances, with all three occurring in the first and last five at Stage 1. By Stage 4, the Cc models are all in the middle group, (except a Cc model has dropped out). At Stage 4, the majority of the Cs models are doing well but the poorest performing model belongs to that group.

The effective albedo (α_{eff}) used in the models can be determined from K_{mod}/K_{obs}. Here this value is investigated at two times of the year (June 21 and December 21) at 13:00 h. These two times will have maximum and minimum amount of midday shadow. The range of values at Stage 1 is from 0.08 to 0.28 (except for two extreme outliers). The best performing model had an α_{eff} of 0.15, which was the same as the observed value provided at Stage 4, on both dates. The December 21 range of values were 4 (3) cases < 0.1 (or > 0.2); 3 (4) cases that were 0.10–0.125 (0.175–0.20); and 16 cases with an α_{eff} within 0.125–0.175, of which 11 have the lowest RMSE for K_{}. For June 21, there was a similar distribution. The slightly higher α_{eff} (0.175–0.18) values are associated with the next best cohort in terms of RMSE performance.

The average cohort MBE is strongly influenced by the poorest performing models (Figure 2). The models have both positive and negative biases across the range which results in a net small negative bias (−4 W m^{−2} excluding model 17) for Stage 1. The median MBE has a large improvement from Stage 1 to 2 but after that remains almost constant at − 1 W m^{−2}. At Stage 4, the Cm models which perform least well all have a negative bias, whereas the poor Cs models have both positive and negative MBE.

On a normalized Taylor (2001) plot, where the ideal model performance is indicated by the open circle at 1.0, 1.0, 0.0 (Figure 3), the correlation coefficient (polar) and normalized standard deviation (*y*-axis) and normalized RMSE (inner cirles) are shown. Except for one model the correlation coefficient is better than 0.8; for the majority of the models it is better than 0.9; and for many better than 0.99. One can track the impact of the additional information for the individual models; e.g. model 44 (medium complexity so blue with a symbol of a plus sign within a circle shown in Figure 3) in Stage 1 had a correlation of ∼0.85 which improved in Stage 2 to ∼0.91 and improved again in Stage 4 to ∼0.95. Between Stages 2 and 3 there is a very minor change in correlation. In addition, one can see that there is an improvement in the normalized RMSE from greater than 0.5 to 0.4 to less than 0.4 (ideal is 0.0); and improvement of the normalized standard deviation from 0.62–0.73 to 0.74–0.88 (ideal is 1.0). For model 46 (same symbol but simple complexity) one can see that the model does not systematically improve.

Ensemble modelling, where the mean result from a number of different models is reported, is now used quite extensively in the climate community (e.g. Gillett *et al.*, 2002). Here we consider the model performance of four different ensembles, three based on complexity [simple with 14 models, medium with 11, and complex with 7 (or 6 when model 17 drops out in Stage 4)] and the fourth is when all of the models are included [32 (Stages 1–3) or 31 (Stage 4)]. In Figure 3, these are shown for each stage. For the simple models, the correlation remains approximately constant, but there is an improvement in both the normalized standard deviation and normalized RMSE in the ensemble performance with stage. This is also the case for the medium and complex models. However, the ensemble performance of the complex models is clearly strongly influenced by the outlier model (17), which is beyond the plot boundaries, in Stages 1 and 2. At Stage 4 the ensemble performance is best when all (A) models are used but this is only slightly better than the ensemble mean performance of the complex models; the simple models' ensemble mean is slightly better than the medium complexity models.

The characteristics used to classify the models (Figure 1) include some that are directly related to radiative modelling. When the model results are grouped by these characteristics (Figure 4), we can determine if particular approaches result in better performance. In several classes, there is a clear separation in the mean performance associated with modelling K_{}. However, in many cases the change in the mean is caused by one model's performance so the median is more robust as a measure of central location within the data. To maintain anonymity, each set of results plotted was required to have four or more results. This means that some classes are amalgamated. For each characteristic at each stage a box-plot of the RMSE gives the interquartile range (IQR), the individual models are plotted as dots, the median as a square, and the mean as a circle. Below each box the stage, the classification type, the characteristic with the class, then the number of models, the median, and the mean appear. For example, Figure 4(a) 1-Vn/11/14/17 indicates that for Stage 1 when the models are classified based on their approach to vegetation (V), there were 11 models that did not include it (n) with a median RMSE of 14 W m^{−2} and a mean of 17 W m^{−2}.

The first characteristic considered is whether the model integrates vegetation with the urban tile (Vi) rather than treating it separately (Vs) or not including it at all (Vn). For the Vi models there is a clear improvement in all four stages (Figure 4(a)). By Stage 3, the Vi models have a median RMSE of < 4 W m^{−2} which is the smallest value. From Stage 2, when more models included vegetation (Vs models increase in number at the expense of Vn) the model cohorts retained the same ordering Vn, Vs, and Vi (decreasing median RMSE), but both Vn and Vs median performances deteriorate slightly in Stage 3. We can conclude that accounting for vegetation is important which is consistent with the conclusions from Phase 1 (Grimmond *et al.*, 2010).

Urban morphology (L) is specified using seven different approaches; from a slab surface (L1) to single-layer models (L2—two components, L3—three facets) and multi-layer (L4–7) models. The multi-layer models (L4–7) have different aspects of the surface that are treated in more detail (Figure 1) which leads to small numbers in each class. In this paper, these have been grouped together and labelled L6. This group has by far the largest mean RMSE because of one outlier (Figure 4(b)). The median performance for the simplest slab models (L1) improves at each stage and has the lowest median RMSE at Stage 4. For the other classes, there is not a consistent trend between stages; and for the L2 models the Stage 3 and 4 results have a higher median although reduced range, maximum and minimum than the earlier stages. The L3 models have second best median RMSE at Stage 4. Note for this characteristic that there is no change in the model numbers per cohort between stages.

The approach to surface geometry with respect to whether the surface explicitly includes shaded surfaces or not (FO) has distinct differences between groups (Figure 4(c)). The simplest case, where the surface has a bulk geometry (FO1), has the lowest RMSE at all stages. It has a median RMSE of 4 W m^{−2} for all stages; however, the IQR decreases indicating more similar results. The most complex approach, which has both shading and intersections (FOi), has a systematic decrease in median RMSE at each stage, but at Stage 4 it is 11 W m^{−2}. This is greater than for models that take shading into account but have no intersection (i.e. have infinitely long canyons) (FOo) which have a median RMSE at Stage 4 of 7 W m^{−2}. The FOi models are clearly benefitting from the additional information provided, such as the wall height and built fraction provided at Stage 3. Both the FOo models and those that have an infinitely long canyon but do not account for shaded areas (FOn) have varying behaviour between stages; neither shows a continuous or significant improvement. The latter have the larger median RMSE at Stage 4 (16 W m^{−2}). The changing geometry influences the complexity of the modelling significantly with the simplest FO1 requiring considerably less computer resources than the more complete FOi which is theoretically much more realistic if within canyon information is required. Note, however, that the ability to model in-canyon information is not actually evaluated here.

Not only may the surface morphology description be different, but the approach taken to model reflections (R) also varies from those that include single (R1), multiple (Rm), or infinite reflections (Ri). The simplest (R1), unlike the other two approaches, has a systematic improvement in the median RMSE with stage (Figure 4(d)). By Stage 4, the median RMSE of 6 W m^{−2} is the smallest of the three approaches. The Rm approach, although it has a large scatter, shows a net improvement by Stage 4 (median RMSE = 8 W m^{−2}). The Ri group (median RMSE = 17 W m^{−2}) actually deteriorates through stages. So the simplest group consistently is the best performing and benefits from the additional information provided.

The albedo and emissivity (AE) classification distinguishes the amount of parameter information that is required by the models. The simplest case requires one bulk value (AE1) and so has a similar behaviour to FO1 and L1 (not shown). Significant improvement for these models at Stage 4 is a simple consequence of model formulation. Prior to Stage 4 albedo was assumed, but in Stage 4 for some models K_{} is just the product of two given values: site albedo and K_{}. Models also can require two values (per parameter) typically associated with two facets (AE2) or three or more values (AE3). The median RMSE is lowest for the AE1 group and largest for AE2 (median RMSE at Stage 4 is 4 and 20 W m^{−2}, respectively). The vast majority of the models (22) require at least three values (AE3) for which the median RMSE by Stage 4 is 9 W m^{−2}; a net improvement from Stage 1. However this group, like the Rm, continue to have a wide range of values for the individual models.

The models that do not have a problem with net radiation balance closure (P0) have the smallest median RMSE at each Stage (Figure 4(e)). Their IQR does not have the smallest spread but the minimum values are lowest, and except for Stage 4, the 75 percentile is the lowest. The P3 (time and space resolution issues) and P4 models (unknown) have a systematic improvement with stage. At Stage 4, the median RMSE is 6/20/8/5 W m^{−2} for the P0/P1/P3/P4 models, respectively. The P1 models that have problems calculating a component of the radiative balance or did not use the forcing data for individual time intervals perform poorly throughout.

For all three model complexities, there are steady improvements in performance as additional information is provided (Figure 4(f)). The simplest and most complex (Cs, Cc) have a larger overall improvement than the Cm models with additional surface information. The Cs models have a slightly better median (6 rather than 7 W m^{−2}) but the mean is better for the Cc models (8 W m^{−2}).

Overall K_{} is modelled well and the provision of additional information about the surface does result in better performance. The models that perform best, for individual characteristics, are those that are the simplest as they can be assigned one parameter that is close to the observed value. The inclusion of vegetation is important to the performance. Based on overall complexity the simplest and the most complex models have similar results. The models that have net radiation closure perform better generally. The poorest performing cohort overall (P1) at Stage 4 does not have radiative closure and either did not make use of the individual time interval data and/or calculated the fluxes independently.

##### 3.1.2. Outgoing longwave radiation

A combination of parameter information and flux calculations impact surface temperatures and hence the outgoing longwave radiation flux (L_{}). Thus, the modelling of day- and night-time L_{} is more complex than modelling K_{} because of the relation between surface temperature, sensible heat, and storage heat fluxes, as well as L_{} itself. This means that, unlike the K_{} case, when additional information is provided more related parameters may be influenced.

For L_{}, the median RMSE for the 32 models from Stages 1/2/3/4 are 16/13/14/17 W m^{−2}, respectively (Figure 5). Overall, 18 models improved from Stage 1 to 2, 11 from Stage 2 to 3, and 8 from Stage 3 to 4. Of the 32 models, only two improved across all the stages but eight improved in three consecutive stages. The largest improvement for an individual model was from Stage 2 to 3 with a greater than 20 W m^{−2} decrease in RMSE. The model performance from Stages 3 to 4, despite now having the most information about the site (Table II), suffered the largest loss of performance with 23 models having an increase in RMSE. This relates to the trade-off that is made in parameter values. The largest individual performance deterioration also occurred between Stages 3 and 4 (increase of > 35 W m^{−2} in the RMSE). There was one model that deteriorated across all four stages.

The models that close the radiation balance generally have better performance (e.g. smaller median RMSE) but that is not the case in Stage 1. At all stages, the models have a larger mean RMSE_{S} than RMSE_{U} but by Stage 3 and 4 the median RMSE_{U} is slightly larger (Figure 5), suggesting that the model parameter information is appropriate for most of the models. In terms of the MBE more models have a positive bias rather than negative, but the two (one at Stage 4) models which perform least well have a large negative bias. The median MBE remains at about 8 W m^{−2} across all four stages.

The overall range of RMSE is smaller for L_{} than K_{} but the best performing model for L_{} has a larger RMSE than the best model for K_{}. From comparison of the normalized Taylor plots (Figures 3 and 6), it is clear that the correlation is generally poorer for L_{}. The mean L_{} flux is larger, but the diurnal range is smaller, than K_{}. As with K_{}, one (although different to K_{}) model performs best across almost all stages (based on RMSE) and shows very little improvement with additional information being provided. This again is a simple model (Cs). The poorest performing model (excluding Model 17) does improve slightly with additional site information but still has a larger RMSE_{S} than RMSE_{U}, suggesting that the model could be improved further. This differs from the next least well-performing model which has a larger RMSE_{U} and a small positive MBE.

The three classes of complexity are scattered across the range of performance. However, again the best and poorest models are simple (Cs). In general, the simpler models are grouped in the middle or poorer end by Stage 4, whereas many of the Cm models are amongst the best. Unlike for K_{} the ensemble mean performance of the models does not improve with stage (Figure 6). At Stage 4 for all four ensembles all three measures have deteriorated. There is one model that is clearly performing better than the ensemble (but this is not the model with the lowest RMSE) pre-Stage 4. From the Taylor plot the best performing ensemble is the medium complexity but the four ensembles are clustered (and have moved together as a cluster between stages).

There is no model class that is better than the others. In most cases the model cohorts show poorest performance for all classes in Stage 4. For example (Figure 7), at Stage 4 the IQR is greater than in Stage 3 for all the approaches taken for vegetation (V); treatment of the urban morphology (L) has a drop in performance for each cohort in Stage 4, with the more complex models (L6) having the largest increase in median RMSE. There is very little change between stages in the other L classes. A similar result is obtained for the facet and orientation characteristics (FO) with no cohort improving across all four stages. One class (FOo) has a 6 W m^{−2} increase in median RMSE. For R and AE, similar results are obtained.

The models that have radiative closure (P0) have a median RMSE of 15 W m^{−2} at Stage 1 and 4. At Stage 4, the P0 cohort has the lowest median but this is not the case for all stages. For those without closure, the Stage 4 median is larger in all cases than Stage 1. For all P classes, Stage 2 was when the median RMSE was smallest.

The modelling of L_{} initially has the same size median RMSE as K_{} but not the general improvement with additional information (or progressive stage). This is seen consistently across all the classes of model types. In most cases, the Stage 4 results are poorer and have a larger IQR. At Stage 4, the best performing modelling approaches (lowest median RMSE) have the Vi, L3, FO1, Ri, AE1 and Cc characteristics. As was demonstrated previously (Grimmond *et al.*, 2010, Fig. 3), no single model has all these characteristics.

The models perform generally better at night than over the 24 h period (mean observed flux day = 410.14 W m^{−2} and night = 368.98 W m^{−2}). At night, the median RMSE for Stages 1/2/3/4 are 12/11/10/12 W m^{−2} and the median MBE are 8/7/2/− 0.2 W m^{−2}. At Stage 4, the best performing (median RMSE W m^{−2}) models have Vn (13)/L2 (10)/FOn (11)/Rm (11)/AE2 (10) characteristics. Notably there is no difference between Cs/Cm/Cc models; they all have a median RMSE of 12 W m^{−2}. The daytime, as expected, is poorer with median RMSE for Stages 1/2/3/4 of 18/14/16/20 W m^{−2} and the median MBE are 9/7/9/12 W m^{−2}. At Stage 4, the best performing (median RMSE W m^{−2}) models have Vi(16)/L2 (17)/FOi (18)/Ri (15)/AE1 and AE3 (20)/Cc (15) characteristics. Thus, the characteristics that result in the lowest median RMSE change with time of day so there is not a clear choice, although the differences in the errors are small.

The models that do not have radiative closure occur across the complete spectrum of model performance for all time periods. The daytime median RMSE for P0 models improves from Stage 1 to 4 from 18 to 16 W m^{−2} but the Stage 2 result is the best for P0/P3/P4 models. For P1 models, the best performance is Stage 3 (15 W m^{−2}) but at Stage 4 the median RMSE is the poorest (26 W m^{−2}). At night the median RMSE for P0 models is 11 W m^{−2} at all stages (but deteriorating). The best performance is Stage 3/2/4 for P1/3/4 models.

Overall L_{} is not as well modelled as K_{}. The daytime, when the mean flux is larger, has the larger median RMSE. The models generally improve when information about the pervious/impervious fraction is provided but generally did not improve when further details about heights and surface fractions were provided. Most models deteriorated when they were provided with details of the building materials typically back to Stage 1 performance but in many cases even poorer. Given both the wide range of materials that are in urban areas and the associated wide range of values for individual material types, the difficulty of deciding what the appropriate values should be suggests that until there is a way to obtain realistic values for actual sites specification of materials may not be worth the effort required to obtain the information. Here we contacted a large number of people associated with the building and planning design plus materials suppliers (see ‘Acknowledgements’) to allow us to provide the data in Table II.

##### 3.1.3. Net all wave radiation

Figure 8 shows the ranked performance of the models based on RMSE of net all wave radiation (*Q**), with the lack of radiative closure indicated. It can be seen from Figures 2, 5, and 8 that models which do not have closure are distributed from the best performing to the poorest performing for all three radiative fluxes evaluated, but are mainly the poorest performing for *Q**. For Stage 1 the mean RMSE for all models is 29 W m^{−2} for *Q** or 28 W m^{−2} when the model with poorest closure (*r*^{2} of 0.0989) is removed because it did not complete all four stages. However, this model is not the poorest performing for *Q** but is for K_{} and L_{} at Stage 1 (Figures 2 and 5). Models that have radiative closure generally perform better over all stages for *Q** than those that do not; on average having a mean RMSE 20 W m^{−2} smaller. However, closure of the radiation balance is not a good measure of ability to calculate a particular flux. Comparing the performance of the components to the net all wave radiation shows a clear re-ranking between fluxes. Notably those that perform poorly for an individual component flux are not the poorest for *Q** (Figures 2, 5, and 8). This means that the application that the model is being used for is important; for example, when assessing a mitigation strategy's impact (such as changing the albedo of the materials on the change in radiative fluxes and temperatures) an ULSM may be modelling the most directly impacted flux well, but not able to model the other fluxes (or vice versa).

There were 14 models which showed a reduction in RMSE from Stage 1 to 2; of these five had a further improvement at Stage 3; and two of these improved again at Stage 4. However, in the opposite situation there are eight models whose RMSE increased from Stage 1 to 2; of which five had a further increase at Stage 3 and four had a further drop in performance at Stage 4.

The overall performance for *Q** does not vary much between stages though, with the mean RMSE being approximately 30 W m^{−2} at Stage 4, which is slightly larger than in the earlier stages. Also at Stage 4 models that do have closure of the radiation balance have a smaller mean and median RMSE (both 18 W m^{−2}, Figure 8). At Stage 4, however, these models have a slightly larger RMSE_{S} than RMSE_{U} suggesting that an improvement could still be made in the physics or parameter specification but this is not the case for both K and L. The models generally have a negative MBE (Figure 8, Stage 4 median − 6 W m^{−2}). The models with the largest absolute MBE are both positive and negative (Figure 8).

The best and poorest performing models at all stages are of medium complexity (Cm). At Stage 1 at both ends of the performance spectrum we have models from the three levels of complexity. By Stage 4 the more complex models have generally improved with three of the six (remember model 17 no longer appears) best performing models. Cm models are grouped more at the end with poor performance.

From the Taylor plot (Figure 9) it is clear, except for three models, all do an excellent job of modelling *Q**. There is a very tight cluster around (but not on) the ideal point. This performance is clearly better than for the separate radiative fluxes. Although this is good, this does suggest that there is some compensation occurring within the individual fluxes which may not be physically correct. As noted previously this result suggests that caution is needed when using the models to account for changing radiative characteristics. For the ensemble performance the medium complexity models are poorer than the other three. The best are the simple and complex models with slightly poorer performance from the ‘all’ ensemble.

The models that do not account for vegetation (Vn) show a steady decline in performance across all stages (Figure 10(a)). In contrast, there is no strong evidence for improvement by those that do include vegetation. The lowest median RMSE at Stage 4 (21 W m^{−2}) is for Vi models, but as for L_{}, the performance deteriorates from 13 W m^{−2} at Stage 3. The best performing morphology class at Stage 4 is the simplest (L1) but the best performance across all stages and classes is Stage 2 L3, with a median RMSE of 14 W m^{−2}. This is the same result when the models are sorted by their approach to facets and orientation (FO) for the simplest models (FO1) at Stage 2, although FOo is only slightly larger at the same stage. This result is repeated again for classification based on treatment of R1 and for AE1.

The models with radiative closure (P0) have their lowest median RMSE at Stage 2 (15 W m^{−2}) and their largest at Stage 4 (25 W m^{−2}). The smallest median RMSE for P1 models is Stage 3 but these models have the largest IQR in Stages 3 and 4 (Figure 10(e)). As for L_{} at Stage 4, the complex (Cc) models perform slightly better than the less complex models even though they have deteriorated from better performance at earlier stages. The Cm models perform least well as a group with an increasing median RMSE with each stage.

The models perform generally better at night than for the 24 h period or for the daytime period (mean observed flux day = 216.83, night = − 59.45 W m^{−2}). The night-time median RMSE for Stages 1–4 are 11/10/10/12 W m^{−2} and the median MBE are − 7/− 7/− 2/1 W m^{−2}. At Stage 4, the best performing (median RMSE W m^{−2}) models have Vs (11)/L1&L2 (10)/FO1 (7)/R1 (7)/AE1 (7)/Cs (9) characteristics. The daytime performance for Stages 1–4 for the median RMSE was 27/24/28/29 W m^{−2} and for the median MBE was − 5/− 5/− 8/− 12 W m^{−2}. At Stage 4, the best performing (median RMSE W m^{−2}) models have Vi (28)/L1 (25)/FO1 (21)/R1 (25)/AE1 (21)/Cc (27) and Cs (28) characteristics. Compared to L_{} there is much greater variability between classes; e.g. the Cm models have daytime median RMSE of 50 W m^{−2} at Stage 4.

Models defined by simpler characteristics often perform best driven by the treatment of solar radiation. However, accounting for vegetation is important in improving the performance of the models. But when the overall complexity of the model is considered it is the more complex models that perform best overall and as a cohort make better use of the new site characteristics provided. The medium complexity models systematically drop in performance with increasing information provided, although there is consistently a Cm-type model performing best throughout.