A critical assessment of surface cloud observations and their use for verifying cloud forecasts



Total cloud amount and cloud-base height are two quantities diagnosed from the vertical distribution of cloud in a model grid column. Together they form the basis of many cloud-based forecast products. Forecasts from four Met Office Unified Model (MetUM) horizontal resolution configurations are compared against manual and automated conventional synoptic (SYNOP) observations. The analysis shows that observation-type-dependent characteristics feed through to model forecast biases and skill scores, where manual and automated cloud observations produce biases of opposite kind. The mixing of observation types is therefore not recommended, as the ability to interpret results is compromised. This is especially relevant when tuning model physics. The effect of horizontal grid resolution is mixed on both bias and skill. Copyright © 2012 British Crown copyright, the Met Office Published by John Wiley & Sons Ltd.

1. Introduction

It seems relatively little has been written in the literature about routine short-range model cloud forecast verification using conventional (synoptic) surface observations (SYNOP). Perhaps this is in recognition of the inadequacies of existing conventional datasets, as will be described later. On the other hand, the assessment of cloud has enjoyed greater prominence in the climate community as cloud feedbacks in climate models are of critical importance for the radiation balance, and future climate simulations (e.g. Ringer et al., 2006).

Cloud forecast assessments using less conventional datasets and approaches are also more abundant in the literature. Instead of assessing forecast aspects associated with end users (such as total cloud amount, TCA, or cloud-base height, CBH), brightness temperature, optical depth, liquid and ice water content, or liquid water path are assessed. For example, Jakob et al. (2004) investigated the use of vertically pointing cloud radar data to derive a total hydrometeor amount, and considered a probabilistic approach for addressing representativeness differences between model forecasts and these derived quantities. The Atmospheric Radiation Measurement (ARM) Program brought together a range of active sensors to consider cloud properties (e.g. Clothiaux et al., 2000). A more recent example of combined use of cloud radar, lidar and radiometer data for cloud assessment is documented by Illingworth et al. (2007), describing the CloudNet project. The dataset has since been widely used by e.g. Hogan et al. (2009), Bouniol et al. (2010), and Morcrette et al. (2012) to compare several numerical weather prediction (NWP) models and parametrisation schemes. These studies represent a site-specific time series assessment which provides detailed analyses of the vertical structure and distribution of cloud-related model parameters.

On the other hand, Williams and Brooks (2008) took a global view, using the ISCCP (International Satellite Cloud Climatology Project) dataset of cloud optical depth and cloud-top pressure to look at initial tendencies in cloud amounts to understand process errors. Satellite data in general are featuring more prominently because of the benefit of global coverage. Garand and Nadon (1998) use the model-to-satellite approach to compute radiances in order to compare forecasts of cloud fraction, height and outgoing radiation to near-coincident Advanced Very High Resolution (AVHRR) imagery. More recently, Palm et al. (2005) described a study using satellite lidar from the Geoscience Laser Altimeter System (GLAS) to assess forecast cloud fraction, vertical distribution and boundary-layer height. Böhme et al. (2011) produced a two-year continuous evaluation of integrated water vapour, brightness temperature, CBH and precipitation. The advent of active remote-sensing instruments in space such as the CloudSat cloud radar has led to studies such as that by Bodas-Salcedo et al. (2008), describing a comparison of simulated radar reflectivities against CloudSat radar reflectivities. Most of these assessments are aimed at the model development community, with the characteristics of the datasets generally well understood. Often the availability and coverage of such high-quality datasets is inadequate for routine monitoring of forecast performance, where surface SYNOP observations still have a part to play. It is less clear whether the characteristics of conventional observations are as well understood, especially for verification.

In this article the focus is on understanding the impact of surface observation type on the assessment of model cloud forecast biases and the interpretation of scores that measure forecast skill. In particular, based on the results presented, what conclusions can be drawn with respect to:

  • the suitability of automated and/or manual SYNOP observations for assessing cloud forecasts;

  • the suitability of conventional metrics for assessing cloud forecasts; and

  • the impact of horizontal model grid resolution on bias and skill.

Section 2 describes the Met Office Unified Model (MetUM) configurations while section 3 is devoted to explaining the characteristics of standard synoptic observations. Section 4 considers the differences in automated and manual cloud parameter distributions. Section 5 introduces the metrics used for this analysis and section 6 then describes model bias differences that result when using manual and automated synoptic observations. The impact these differences have on forecast skill is then explored in section 7. A summary and conclusions are presented in section 8.

2. Model configurations

The MetUM provides a seamless nested forecasting system that can be run at multiple resolutions and time-scales, from the km-scale to the climate scale. Davies et al. (2005), Lean et al. (2008) and Hewitt et al. (2011) provide further details. In this article, cloud forecasts from the Global (GM, 25 km), North Atlantic European (NAE, 12 km), UK4 (4 km) and UKV (1.5 km) configurations are compared over the UK. All these configurations run with 70 vertical levels, although the UKV and UK4 have a different level set with more levels in the boundary layer. Other key parametrisation differences are noted in Table 1.

Table 1. Comparison of parametrisation schemes used for representing key physical processes across MetUM configurations.Thumbnail image of

All models run four times a day although the run lengths vary from 36 h (UKV and UK4) to 48 h (NAE) and 60 h or 120 h for the GM. While the GM and NAE run at 0000, 0600, 1200 and 1800 UTC, the UK4 and UKV run with a three-hour offset, i.e. 0300, 0900, 1500 and 2100 UTC. The offset in lead times is indicated in the text as follows: t+24(21) h indicates a 24 h lead time for the GM or NAE with 21 h (in brackets) indicating the lead time for the UK4 and UKV forecasts. The results sections are based on this single forecast lead time, unless stated otherwise.

The station-based verification database is populated with nearest model grid point values for all WMO Block 03 observing sites, of which there are ∼160. Forecast extraction is dependent on the availability of a valid observation at the time. For this study t+24(21) h forecasts from the GM, NAE, UK4 and UKV configurations have been verified against hourly manual and automated SYNOP TCA and CBH for the calendar year 2010. In particular forecasts were assessed at 0000, 0600, 1200, 1800 UTC to consider the evolution as a function of time of day.

3. Overview of synoptic observations

The World Meteorological Organization guide (WMO 2008) Chapter 1 states that ‘The representativeness of an observation is the degree to which it accurately describes the value of the variable needed for a specific purpose’. It goes on to say ‘synoptic observations should typically be representative of an area up to 100 km around the station, but for small-scale or local applications the considered area may have dimensions of 10 km or less’. This definition is observation-based and not model-based, since km-scale modelling reveals considerable variability at scales less than 10 km. While the output from km-scale models may not be accurate at the grid-scale (e.g. Lorenz, 1969) it does highlight the fact that meteorological parameters are not constant for an area of 100 km; therefore neither can observations be representative for such an area.

Manual observations of TCA are described as ‘the fraction of the celestial dome covered by all clouds visible’. It depends on the visible horizon. For manual CBH observations observers can be guided by the availability of vertically pointing low-power lidar (often referred to as a low-cloud-base recorder LCBR or ceilometer) data, balloon ascents, cloud searchlights (at night) and aircraft reports. Manual observations are usually taken up to 10 min before the hour, and not on the hour. WMO (2008) Chapter 15 suggests that most observations of TCA are still made manually and only refers to instrumental methods being under development. They may be used operationally in some applications, most notably in the aviation sector. The UK has a mixture of automated and manual cloud observations with the number of manual observations dwindling rapidly in recent years. Figure 1 shows the total number of manual and automated observations from the UK observing network during 2010 taken at 0000, 0600, 1200 and 1800 UTC. The total number of manual observations is between 20 and 25% of the automated observations available. Therefore in the UK at least, the status quo described in the WMO documentation has not been the case for many years.

Figure 1.

Total number of manual and automated cloud observations (at 0000, 0600, 1200 and 1800 UTC) for 2010.

The WMO guide goes on to describe specific guidelines for operational measurement uncertainty requirements and instrument performance. For automated TCA, the achievable measurement uncertainty is 2 okta, which represents the difference between ‘instantaneous’ values. These ‘instantaneous’ values are in themselves aggregates over time as instrument sampling is typically of the order of seconds. For CBH, the guidelines suggest a nominal instrument range of 0–30 km (which in reality is more like 10 km maximum) at 10 m resolution with a required measurement uncertainty no more than 10 m for CBH less than 100 m and 10% for CBH greater than 100 m. The achievable measurement uncertainty though is stated as being ‘undetermined because no clear definition exists for instrumentally measured CBH (e.g. based on penetration depth or significant discontinuity in the extinction profile)’. The guide also points out that there is significant attenuation during precipitation which will introduce a bias.

Jones et al. (1988) reported on an international ceilometer intercomparison. A total of eleven ceilometers from five WMO member states, consisting of seven different instrument types (manufacturers and/or models) were deployed and run continuously for six months at an experimental site in the UK. Other meteorological variables were monitored concurrently to be able to assess performance as a function of weather type. The instruments were compared according to their ability to detect cloud and the heights reported. Overall there was fairly good agreement between the instruments. Laser ceilometers were found to be reliable instruments with great measurement consistency although all types suffered from deficiencies during certain weather conditions, most notably precipitation.

Automated TCA is derived from estimates of cloud amount in view of the observation point in each identified cloud layer. As such it is a time average of cloud passing directly over a ceilometer. Ceilometers also provide cloud cover as a continuous fraction between 0 and 1 (just like NWP models), but the process of encoding into a SYNOP message involves converting to okta, thus reducing the resolution of the observation. Farmer (1992) compared two algorithms (single linkage clustering and exponential decay) for deriving hourly automated cloud observations at a single site. Derived values were compared to manual observations to assess their accuracy. Based on the results, the exponential decay algorithm was preferred, and has been used ever since for hourly automated cloud derivation across the UK observing network. The key characteristics are summarised in Table 2. The algorithm, and the SYNOP encoding, is known to introduce a low CBH bias, mainly for the aviation sector.

Table 2. Characteristics of the exponential decay algorithm used for deriving hourly CBH and TCA values for the UK (from Farmer, 1992).
Key featureAdditional comments
(1) Uniform spatial distribution of cloud elements is assumed. 
(2) Correction applied to obscuring effects of lower cloud layers. 
(3) Exponential weighting with τ = 20 min.63% weight given to last 20 min, 86% to last 40 min and 95% to last 60 min.
(4) Trace of 0.5 okta is reported as 1 okta.Based on the proportion of time (with exponential weighting) equal to 0.063
(5) CBH refers to lowest cloud layer with at least 0.5 okta.

Typically, observations are treated in an absolute sense, and when verifying, all errors are attributed to the forecast. The uncertainties in the observations are often disregarded. Bowler (2006) and Mittermaier (2008) began to explore how observation uncertainty can be incorporated into verification metrics, and the impact these uncertainties may have. It is important to recognise that observations of the same atmospheric quantity may vary across the observing network, in terms of type (automated or manual), as well as instrument type (i.e. different manufacturers or older/newer models), which may affect sensitivity and the ability to calibrate; additionally each site will be at a different stage of the site maintenance cycle.

At this stage it is perhaps useful to compare and contrast the main properties of the automated and manual observations. These are summarised in Table 3. While manual observations are essentially made ‘instantaneously’ (but considering a considerable 3D volume that is determined by the distance to the visible horizon), automated observations are a retrospective time average, consisting of cloud that has already passed downwind of the site. Upwind cloud is not included. Therefore automated observations have a variable spatial dimension through advection, which cannot be called hemispheric. It is clear that automated and manual observations are very different by definition. Automated observations could be considered superior except for the sensitivity issue which may introduce a significant bias, which for verification purposes means the model is not assessed consistently at each site. Manual observing practices may also vary. It is perhaps also questionable whether manual CBH measurements are truly manual, or whether automated values are entered as a manual value when a site is manned. Manual observing of CBH is very difficult and because all sites (in the UK) have automated cloud observations, observers do use them as a guide. It is not known how often they may deviate from the automated value since the standard encoded SYNOP message allows for only one value to be recorded. For this reason it is argued that it is difficult to assess the true accuracy of manual CBH observations.

Table 3. Subjective assessment of the relative strengths and weaknesses of automated and manual cloud observations.
OverallPotential underestimationUnderestimation of events due to lack of
 of 0 and 8 okta eventssensitivity (detects too little high cloud)
ConsistencyObserver differencesEach instrument unique; inconsistency
  across observing network; calibration drift
DaytimeGood; CBH still difficultGood, but may miss high cloud
NighttimeDifficultGood, but may miss high cloud
HorizonRestricted hemispheric viewDownwind only
PrecipitationCBH potentially difficultAttenuation problems; CBH lower than actual

4. Comparing distributions

For comparative purposes it would be ideal if manual and automated SYNOP observations were available simultaneously at the same time. Unfortunately spatially and temporally collocated manual and automated SYNOP observations do not exist for the UK observing network because of the way that SYNOPs are encoded and stored. An observation can only be manual or automated at any given time. Therefore matched (forecast, observation) datasets for either manual or automated observations can be obtained, but not both manual and automated observations for the same time. The alternative approach, the one adopted here, is to consider a large enough sample of manual and automated observations, spanning all seasons. For this study all matched observations at 0000, 0600, 1200 and 1800 UTC for 2010 were used.

From a model development perspective, the entire distribution is important, although not all parts of the distribution are equally important from a user perspective. Forecasters are also focused on the full spectrum of cloud cover, but often concentrate on the tails of the distribution. While reduced instrument sensitivity above ∼3000 m is deemed to have little impact on the relatively low CBHs that are typically of the most interest (e.g. to the aviation sector), an underestimation of automated TCA potentially leads to an enhanced low bias in the observations which has implications on model assessment of TCA.

Quality control is governed by the data assimilation process (Rawlins et al., 2007) and thus each model has its own set of quality-controlled observations, and these may, and occasionally do, differ from model to model.

4.1. Total cloud amount

Firstly the differences in the observed distributions of TCA are examined, as shown in Figure 2. Here the matched (to the forecast) automated and manual observation distributions as a function of time of day are given for the NAE. The distributions are very similar for the other models (not shown).

Figure 2.

Distribution of cloud cover observations per okta, as a function of time of day.

Figure 2 shows that overall the distributions at the four times are very similar with the biggest differences in the tails. At the largely ‘cloud free’ end, it can be seen that the observer is still hedging away from the boundaries, even if they cannot see (either because a part of horizon is obscured or during nighttime). This hedging increases with the visibility of the horizon during the day where there is less inclination to report 0 okta cloud. An alternate view is simply that we rarely have totally cloud-free conditions, given an observer's hemispheric view. The automated frequencies of 0 okta are consistently high but are perhaps enhanced by a lack of instrument sensitivity when only high cloud is present. At the ‘mostly cloudy’ end of the distribution, there is a reversal in manual and automated frequencies for the 7 and 8 okta categories. Again during daylight hours manual observations hedge away from 8 okta due to the visibility of the horizon and ‘blue sky’. Automated frequencies for the ‘partial’ cloud categories are lower, possibly again illustrating a reduced detection of high cloud.

4.2. Cloud-base height

The distributions of manual and automated observations from the matched NAE forecasts are plotted in Figure 3, showing only small variations as a function of time of day. The notable exception is the increased frequency of very low cloud bases less than 500 m at 0000 and 0600 UTC, most probably through the enhanced occurrence of mist or fog during the night and early morning. The other curious feature is the distinct peak in the automated distribution at 1500 m, which may be an artefact of the binning algorithm used for SYNOP encoding (as it also seen in the manual observations at 0000, 0600 and possibly 1800 UTC). Not surprisingly, observers provide a coarser discretisation of CBH in the vertical, but are aware of SYNOP encoding rules. There are also fewer automated observations of CBH greater than 6.5 km, which may be due to the lack of instrument sensitivity but also the algorithm used to derive the hourly CBH.

Figure 3.

Distribution of CBH observations as a function of time of day. Proportions reported at 57.5 km represent the occasions where TCA was < 3 okta where no CBH is reported, being considered ‘cloud free’.

5. Contingency tables and categorical statistics

A categorical analysis is typically based on a 2 × 2 contingency table which is populated by applying a threshold to matched forecast–observation pairs and counting the number of hits (a), false alarms (b), misses (c) and correct non-events (d). The total sample is size n is then a + b + c + d. A multitude of metrics and scores can be calculated and many are listed in the literature (e.g. Jolliffe and Stephenson, 2003).

It is potentially misleading to draw conclusions based on only one verification metric, and therefore it is generally recommended to look at several, and always consider the frequency bias,

equation image(1)

and the frequency of observed occurrence (or base rate). Hogan et al. (2009) identified two metrics with desirable properties for verifying cloud forecasts, and these will be considered here. One is the log-odds ratio:

equation image(2)

which is a measure of association, although Stephenson (2000) points out that any zero entry in the contingency table means it can no longer be calculated. This could be considered a disadvantage, but using a sufficiently large sample will eliminate this potential problem. Hogan et al. (2009) also found that the log-odds ratio for a contingency table with random entries does not score zero, so they proposed amending the log-odds ratio (Eq. (2)), to take this residual skill into account, utilising the definitions for random counts given in Eqs. (4)–(7). Despite being unbounded, with the adjustments for random skill, the log-odds ratio remains a score with desirable properties.

equation image(3)


equation image(4)
equation image(5)
equation image(6)
equation image(7)

The other score proposed by Hogan et al. is the Symmetric Extreme Dependency Score (SEDS), which is a modification of the Extreme Dependency Score (EDS), introduced in a weather verification context by Stephenson et al. (2008). The SEDS can be written as:

equation image(8)

where q = (a + b)/n to mirror the base rate (frequency of occurrence of observed events) p = (a + c)/n, but in this case for forecast events; H is the hit rate defined as a/(a + c). The primary advantage of the revised score is that it can be used for uncalibrated (biased) forecasts. Ferro and Stephenson (2010) recently showed that, like the EDS, the SEDS is still base-rate dependent, but SEDS is less susceptible to hedging (playing the score by changing the forecast to give a higher value) and it is asymptotically equitable (i.e. random or constant forecasts all score equally poorly), like the log-odds ratio. Using the SEDS on a bounded, non-extreme quantity such as TCA may seem somewhat counter-intuitive, but it has several desirable properties, and some of the mid-range TCA values have a frequency of occurrence (or base rate) of 0.05 or less. Furthermore, both these scores have analytical formulae for calculating the standard error σ:

equation image(9)


equation image(10)

These have been used to compute error bars in subsequent sections.

6. Impact on model frequency bias

At the Met Office, the verification database is populated with instantaneous nearest model grid point forecast values of TCA and CBH. While the forecast is an instantaneous (time step) value, it represents an area defined by the model grid resolution and therefore can also be considered a time average of sorts (dependent on the advection speed). Matching forecasts and observations in this case means a comparison between an instantaneous grid-box average to an average automated observation or a manual observation with an offset of 10 min. The perception is that historically the representativeness of the observations and the forecast values for point verification was considered of secondary importance, since model forecast errors formerly dominated any verification result. As forecasts have improved, this viewpoint is less justifiable. The same applies to treating observations in an absolute sense (i.e. that they are ‘perfect’).

6.1. Total cloud amount

Forecast and observed joint distributions are very useful in indicating associations, and form the basis of the categorical contingency table concept. A perfect forecast system would have no off-diagonal elements (false alarms or misses in a 2 × 2 contingency table context). This can be expanded to multiple categories. Figure 4 shows the 1200 UTC joint and marginal distributions of the forecasts and automated observations for each of the four model configurations considered. The percentage per category is shown with the legend for the joint distribution given in the top right. At t+24(21) h, no clear 1:1 association is present for any of the models in the 2–6 okta categories. Most categories have evenly distributed weights of 0.5–2%. The differences between the model configurations are generally subtle, and mostly confined to the tails: 0–1 and 7–8 okta. Based on this dataset, it would appear that forecast cloud is essentially a binary response in the models with little correspondence between the observed and forecast cloud amounts in the mid-range. This may be a feature of MetUM configurations, since Figure 4 is certainly different to Figure 2 in Hogan et al. (2009) which showed better correspondence in a time-aggregate of the German Weather Service (DWD) model (7 km grid resolution) against cloud-radar-derived cloud amounts for t+0 to t+2 h. Besides being a time aggregate, the data preparation is very different. TCA was derived from both model and cloud radar data after cloud fractions were binned into 1 km vertical bins up to 11 km. While there appears to be greater correspondence, Hogan et al. also comment: ‘The joint histogram shows most of the data lying around the edge, suggesting a rather poor association between the two datasets’. This is similar to what is seen in Figure 4. Even the GM distribution at t+2 h (not shown) differs very little from that shown here. It is worth reiterating that the results presented in this article are hourly instantaneous values for all Block 03 sites aggregated over a year. The differences seen are probably due to the use of different observations and data preparation. Derived, time-aggregate cloud radar observations and forecasts may be more representative of each other. It is not the case that the DWD model is superior. Such a conclusion could only be drawn if the models in question had been treated the same way and verified against the same datasets. As an aside, it is worth noting that t+2 h is most often treated as a quasi-analysis by forecasters because a new model run is typically only available 2–3 h after data time. Often t+3 h represents the first real forecast of use.

Figure 4.

Joint observed–forecast and marginal distributions at 1200 UTC for (a) t+21 h UKV, (b) UK4 TCA, (c) t+24 h NAE and (d) GM TCA forecasts against automated observations.

Generally, frequency biases are amplified at midday (from convective overturning triggered by diurnal heating) as compared to 0000, 0600 and 1800 UTC. This holds for all model configurations (not shown). Figure 5 presents the 1200 UTC UK4 model frequency bias in two ways: TCA less than, and TCA greater than, a given threshold. This is done because most often cloud forecasts are communicated such that cloud cover is expected to exceed a certain amount. Rarely are the cloud amounts considered in isolation, and especially not the extremes of the distribution, as shown in Figure 4. Therefore Figure 5 attempts to show how the frequency bias evolves as the threshold is moved, and whether cloud is added or removed from the assessment as the threshold is changed. Figure 5(a) shows a steady reduction in frequency bias as progressively larger TCAs are included, with overforecasting against manual observations and underforecasting against automated observations. From Figure 5(a), it can be concluded that against manual observations the UK4 has far too many cloud-free events, but not enough against automated observations. Note that a frequency bias of unity does not imply a perfectly forecast TCA in the mean. In Figure 5(b), the frequency bias is shown in terms of cloud exceedance which shows a reversal in the signal, where the bias increases as the lower cloud amounts are progressively removed. This way it can be seen that the UK4 also has too many totally cloudy forecasts compared to manual observations but a relatively neutral signal for cloudy events against automated observations.

Figure 5.

Deviations from a perfect frequency bias of t+21 h UK4 forecasts at 1200 UTC using manual or automated observations. (a) shows the ≤ okta bias, and (b) shows the conventional exceedance frequency bias.

6.2. Cloud-base height

Figures 6 shows the CBH frequency bias of t+24(21) h model forecasts against automated and manual observations at 0000 UTC. While other times could have been chosen, low cloud bases are more prevalent at night with distributions similar at 0000 and 0600 UTC. At 0000 UTC all models overforecast the occurrence of all CBH less than 1.5 km against automated and manual observations, the GM being the exception, which has too few cloud bases below ∼400 m against both manual and automated observations. Between 400 and 1500 m, the GM overforecasts against automated observations, but underforecasts against manual observations. When cloud bases above 1.5 km are included, the overall bias tends towards too few forecast cloud bases against manual observations. The model bias against automated observations oscillates between overforecasting and underforecasting, the exception being the UK4 which has a near-neutral bias over all observed cloud bases. However, given the detection limits of automated observations, this may not necessarily be a true reflection of model biases. From a forecasting impact perspective, the CBH below 2 km are of more interest.

Figure 6.

Frequency bias of the t+24(21) h forecasts at 0000 UTC using manual or automated observations.

The frequency biases at other times of day (not shown) are broadly similar. At 1200 UTC in particular there is a tendency for underforecasting the frequency against manual observations at all levels. The overforecasting of high-cloud bases above 6.5 km against automated observations is still probably due to the lack of detection. The UKV and NAE are similar above 6.5 km, with the UK4 more similar to the GM. The NAE seems to have the smallest biases for mid-level CBH, while the GM has consistently too few cloud bases across all heights, except above 6.5 km against automated observations.

7. Impact of forecast skill metrics

7.1. Total cloud amount

Taking an all-inclusive approach, the log-odds ratio has been computed for model configurations and all exceedance thresholds and plotted at 0000 and 1200 UTC in Figure 7. Generally, forecasts from all models verify better against manual observations with the highest scores at the cloud-free end of the distribution. There is a small diurnal variation in the magnitude of the scores with 1200 UTC scores lower than other times. Against automated observations, the GM has the biggest skill deficit at 1200 UTC when compared to the other model configurations. Overall scores against automated observations for overcast conditions are the worst, with the biggest separation in skill between manual and automated observations at 1200 UTC. By contrast, the GM appears fairly competitive against manual observations. There are subtle differences between the model configurations, with the UK4 and NAE competing for being the most skilful. The UKV performance seems rather disappointing, being more comparable to the GM, which is ∼16 times coarser in horizontal resolution. Therefore an initial thought is that this may be a manifestation of the representativeness error, instead of a true reflection of forecast skill. Cloud is inherently similar to precipitation in terms of the discrete nature of the forecast field (where models appear to have an ‘all-or-nothing’ approach to TCA) and features may not be quite in the right place at the right time, falling foul of the double penalty effect (Rossa et al., 2008). The issues with verifying km-scale precipitation forecasts are well established in the literature, with many developed spatial verification methods rewarding close forecasts (e.g. Gilleland et al., 2009, provide a review), and it is hypothesised that the results shown here are similar to what is seen for high-resolution precipitation forecasts verified against gauges.

Figure 7.

Log-odds ratio for TCA for a range of thresholds at 0000 and 1200 UTC. Error bars reflect analytically calculated standard errors. This figure is available in colour online at wileyonlinelibrary.com/journal/qj

A preliminary test of this hypothesis is offered here, using the ‘minimum coverage’ spatial method proposed by Damrath (2004) which enables a point observation to be compared to a forecast neighbourhood (instead of just the nearest grid point). Using this method, a useful forecast is defined as one that predicts an event correctly over a minimum fraction of an area of interest. The results were compiled for a randomly selected month of UKV cloud forecasts. Based on a sample of thirty 36 h forecasts, the cumulative percentage UKV forecast skill gained (compared to traditional precise matching) by aggregating score differences over the 36 h forecast is between 1 and 5%. This was true for a range of neighbourhood sizes mimicking the coarser model resolutions (4, 12 and 25 km), and a fractional minimum coverage threshold of 0.5. The biggest gain is for lower cloud amounts where conceivably even slight displacement errors will have a maximum negative effect on scores when calculated based on the traditional method of precise matching of forecasts and observations in space and time. This would suggest that high-resolution cloud forecasts are subject to the double-penalty effect, just like precipitation, where UKV TCA forecasts are more skilful when a neighbourhood of forecast points around an observing site is used. Although not tested, based on the differences in score shown here, a 1–5% increase on plotted UKV scores could make the UKV skill comparable to, or better than, other models. This is the subject of ongoing work.

Figure 8 shows the SEDS for the same forecasts. In contrast to Figure 7, the results suggest that forecasts of cloud amount generally improve in skill as lower cloud amounts are eliminated from the calculation, the exception being for overcast conditions. Recall that the log-odds ratio is a measure of association and larger scores for the lower exceedance thresholds suggest that the product of the hits and correct non-events is largest when the sample is more inclusive. As the smaller cloud amounts are eliminated from the calculation, the strength of association decreases relative to the misses and false alarms. On the other hand, recall that SEDS can be computed for biased forecasts so p and q need not be equal. The SEDS increases as the smaller cloud amounts are eliminated from the calculation, with the hit rate H decreasing, thus reducing the denominator. If there are also more false alarms than misses, then q is larger than p such that the difference between the numerator and denominator decreases, and the score progressively increases as the smaller cloud amounts are eliminated from the calculation. The log-odds ratio suggests that the ratio of correctly forecast events weakens for totally overcast conditions, while at the same time the SEDS suggests that, despite a reduction in the hit rate, overforecasting can improve the score.

Figure 8.

SEDS for TCA for a range of thresholds at 0000 and 1200 UTC. Error bars reflect analytically calculated standard errors. This figure is available in colour online at wileyonlinelibrary.com/journal/qj

Given that neither model forecasts nor the observations are perfect (be they manual or automated), in general model forecasts still appear to be more skilful against manual observations, the exception being cloud-free events. The UK4 also appears to have a more tangible skill margin over the NAE most of the time, the exception being for cloud amounts ≥ 7–8 okta. Once again there is little to choose between the UKV and GM, the biggest deficit still in 1200 UTC forecasts. At 1800 UTC (not shown) the UKV forecasts for overcast conditions appear to be the least skilful of all model configurations. As before, subjective assessment of forecasts does not support these poor scores. The ranking of models against manual observations is markedly different with the NAE and GM, on balance, probably the more skilful, while the UK4 has the lowest scores. The ordering of models is more sensitive to the observation type than the metric. The main decision is therefore which set of results to believe.

7.2. Cloud-base height

Model forecasts of CBH are diagnosed using the three-dimensional cloud fraction, with diagnosis itself improving as the cloud fraction increases. Verification is aligned with aviation requirements, where only CBH for greater than 2 okta are verified. Therefore if the model TCA bias changes, e.g. cloud amounts are reduced, potentially fewer cloud bases will be derived too. Differences in manual and automated CBH distributions are an additional influence on the calculated model CBH bias. Figure 9 depicts the log-odds ratio for all CBH below 5000 m at 0000 and 1200 UTC for each of the models. Note that a logarithmic scale is used to place greater emphasis on the low CBH scores. One thing immediately noticeable is how the distribution of scores ‘buckles’ as the day progresses, with the distribution comparatively flat at 0000 and 0600 UTC (not shown). Again scores against manual observations appear to be higher for the lowest CBH thresholds, with some convergence in scores between 500 and 1000 m, and some model configurations verified against automated observations remaining comparatively more skilful than against manual observations. Rather paradoxically, the GM and UKV CBH forecasts verify better against lowest CBH automated observations than the others, the exception being the GM at 1200 UTC which appears to struggle to be competitive with the other model configurations. At 0000 UTC (Figure 6), the UKV has the smallest bias for CBH less than 500 m, with the GM underforecasting these low CBHs. The scores suggest that a smaller bias, even underforecasting is beneficial to the scores. This is also partially true for the UKV forecasts against manual observations. The sample of low CBH manual observations is also comparatively small, as reflected by the size of the error bars for CBH less than 500 m. As for TCA, log-odds ratios are rather low overall, although comparable to those reported by Hogan et al. (2009), for example.

Figure 9.

Log-odds ratio for a range of CBH thresholds at 0000 and 1200 UTC. Note the logarithmic x-axis. Error bars reflect analytically calculated standard errors. This figure is available in colour online at wileyonlinelibrary.com/journal/qj

Figure 10 shows the SEDS at 0000 and 1200 UTC for all forecasts and model configurations. The pattern is not dissimilar to that of the log-odds ratio, with larger skill scores for lower CBH, which is arguably of more interest to the user. In this case, the GM does not show the same level of competitiveness but the UKV is still showing signs of being superior, especially against automated observations but also against manual observations. The differences in calculated skill based on observation type are still the most compelling feature.

Figure 10.

SEDS for a range of CBH thresholds at 0000 and 1200 UTC. Note the logarithmic x-axis. Error bars reflect analytically calculated standard errors. This figure is available in colour online at wileyonlinelibrary.com/journal/qj

8. Summary and conclusions

This article has investigated three specific areas:

  • Suitability of automated and/or manual SYNOP observations for assessing cloud forecasts.

    A key disadvantage is that SYNOP observations can provide an assessment of the vertical distribution of cloud only in an indirect or bulk sense. Automated TCA may be underestimated when the cloud cover is dominated by high cloud, yielding more cloud-free events, and potentially reducing cloud amounts overall. During precipitation, automated CBH may be lower than it should be. When using automated observations for verification, this systematic behaviour must be factored into the interpretation, even if the effects cannot be isolated.

    Forecast TCA and CBH are derived from model profiles. For TCA maximum-random overlap is used. Therefore, when it comes to verifying TCA, not only is the model physics being verified, but also the overlap assumption. This is good example of the difference between verifying a forecast product and an underlying model characteristic.

    In the study presented, there is a fundamental mismatch between the surface observation and the model forecast stored for verification. The model forecast of TCA or CBH is an instantaneous time-step grid-box average. Hogan et al. (2009) and Bouniol et al. (2010) used instantaneous profiles of cloud fraction (not diagnosed quantities), and compared these to time-aggregated cloud radar observations to achieve comparability with a given model-grid resolution. Manual SYNOP observations are more ‘instantaneous’ but hemispheric, i.e. encompassing an area that stretches to the visible horizon (O(10–100 km)). The automated observation is a time-aggregate. To make a true comparison, the forecasts and observations must be matched as far as possible in terms of their spatial and temporal resolution. This is arguably more difficult with SYNOP observations.

    Despite the shortcomings, SYNOP cloud observations will continue to form the basis for routine operational NWP forecast verification, focused primarily on products. The main message is not to mix observation types. It is suggested that SYNOP observations are inadequate for fully understanding model physics. For this purpose, datasets from active sensing instruments such as cloud radar, are invaluable.

  • Suitability of conventional metrics for assessing cloud forecasts.

    The study used two metrics identified by Hogan et al. (2009) as being suitable for the verification of cloud forecasts, and this study supports this recommendation. It was shown that the log-odds ratio and SEDS can provide subtly different messages which, when interpreted, lead one to conclude different aspects of model and/or observation behaviour.

    Taking a ‘distribution’ view, such as through the use of joint distributions, is also helpful in understanding the degree of association between forecast and observation. In that respect, the log-odds ratio and joint distributions are complementary.

    While the distribution shown in Figure 4 might seem disappointing, the outcome could be more promising if model forecasts could be matched more appropriately (i.e. have a model time-mean TCA), as discussed in the previous bullet.

    More disturbingly perhaps, the often-preferred method of using a selection of exceedance thresholds, such as used here in sections 6 and 7, does not show association very well, especially if only a few thresholds are selected. This can potentially contribute to a false understanding of cloud forecast skill. Using more than one methodology to assess forecasts is critically important to have a realistic understanding of the level of accuracy an NWP model can deliver.

    While this study presents a UK- and MetUM-specific perspective, the generic message is clear. Without sufficient knowledge regarding the characteristics of an observation type, i.e. how different types behave, their limitations, etc, it is all too easy to misinterpret verification statistics. Even when only one observation type is used, an understanding of the characteristics is still necessary. For example, as shown here, the calculated model frequency bias may suggest an overforecasting of cloud amounts. Yet this result may potentially have nothing to do with the model, but may rather be because the observations are underestimating observed cloud amount (e.g. instrument limitations). Under such circumstances, signal attribution becomes very complex, if not impossible.

    Differences in verification results based on a variety of observation types should be expected, but explainable. From a verification perspective, this begs two questions: which observation is more like the model (grid-box average) forecast, and which results should be acted upon? Some might argue it is the automated value which represents a time average computed from sub-hourly samples of cloud passing directly over a vertically pointing instrument. Others might claim that the hemispheric view of manual observations made by an observer provide an area average at one point in time. Most importantly, if the SYNOP observation is a temporal average, should it be used to assess an instantaneous model forecast value? Or should a new time-mean forecast TCA and CBH be created to better match the observations? The answers to these questions may differ depending on what the ultimate goal of a comparison might be.

  • Impact of horizontal model-grid resolution on bias and skill.

    When considering Figures 7 to 10, it is difficult to draw any consistent conclusions (considering both observation types) as to which model is more skilful and whether horizontal resolution is a factor in determining forecast skill. The observation type dominates the interpretation.

    The intriguing result of this study is the bias and skill of the 1.5 km UKV forecasts. It was hypothesised that the TCA forecasts suffer from the double-penalty effect through the use of precise matching at observing sites. This was found to be a possibility, based on a simple test using one of the single-observation forecast-neighbourhood (SO-NF) methods. There may be other aspects of UKV 3D cloud forecast fields which this assessment cannot capture. This may be why CBH scores for the lowest cloud bases are not as poor. Fundamentally, the UKV vertical profile of cloud may be better, providing a better diagnosis of CBH. Only a full assessment of the vertical distribution of cloud can provide these answers.


I would like to thank all those who contributed to the rich discussions on this subject: Ric Crocker, Clive Wilson, Roy Kershaw, Cyril Morcrette and Malcolm Brooks from the Met Office, Chris Ferro at Exeter University, and fellow members of the WMO Joint Working Group for Forecast Verification Research (JWGFVR). I would also like to thank the reviewers for their comments which helped to draw out the main points.