## 1. Introduction

Convective-scale numerical weather prediction (NWP), based on models with a horizontal resolution of order 1 km, is motivated to a large extent by the desire to predict precipitation and winds associated with cumulus convection. As conventional observations are sparse at the convective scale, radar is an important source of information. The first operational NWP systems with kilometre-scale resolution use simple methods to assimilate radar data, such as latent heat nudging (LHN) (Jones and Macpherson, 1997; Macpherson, 2001; Leuenberger and Rossa, 2007; Stephan *et al.*, 2008), but research is ongoing using more sophisticated techniques.

In a perfect model context, simulated observations of convective storms have been assimilated using four-dimensional variational assimilation (4DVAR) (Sun and Crook, 1997, 1998) and Ensemble Kalman Filter (EnKF) methods (Caya *et al.*, 2005). Real observations of an isolated storm have been assimilated using EnKF by Dowell *et al.* (2004, 2010) and Aksoy *et al.* (2009, 2010). These studies represent significant progress towards operational systems, but also indicate the difficulty of inserting observed storm cells into the models and suppressing spurious simulated cells.

In general one would expect problems, since the dynamics of convective clouds at small scales strongly violate key assumptions that the methods depend on. In 4DVAR (Talagrand and Courtier, 1983; Bouttier and Rabier, 1998; Bouttier and Kelly, 2001) and Kalman filtering (Kalman, 1960; Kalman and Bucy, 1961), it is assumed that the distribution of background error is unimodal and adequately described by second-order moments through an error covariance matrix, as for a Gaussian distribution. The size of this matrix can be reduced to a manageable number of degrees of freedom by balance assumptions and observations of correlations, or through representation with a small ensemble in EnKF (Evensen, 1994; Houtekamer and Mitchell, 1998, 2001). Furthermore it is assumed that the temporal evolution of the error distribution can be represented by tangent linear dynamics or the evolution of a small ensemble of forecasts. Alternatively, more general methods such as the particle filter (Van Leeuwen, 2009) do not require these assumptions, although they gain generality at the cost of computational efficiency, potentially requiring prohibitively large ensemble sizes (Bengtsson *et al.*, 2008; Bickel *et* *al.*, 2008; Snyder *et al.*, 2008). These issues are reviewed by Bocquet *et* *al.* (2010).

In exploring new data assimilation methods, it is often convenient to complement tests using full atmospheric models with test problems using idealized models. Popular choices include the models of Lorenz (1963, 1995, 2005), which include coupling of fast and slow variables in a low-dimensional dynamical system, or the quasigeostrophic equations (Ehrendorfer and Errico, 2008). However, both of these systems were designed to represent key processes of synoptic-scale dynamics, rather than to capture the particular characteristics of the convective scales that make data assimilation difficult. To make progress, it is necessary to consider the nature of the non-Gaussianity and nonlinearity found in the convecting atmosphere.

The essence of this problem can be seen by considering assimilation of radar reflectivity for convective storms. Radar data have a high spatial resolution comparable to the model resolution, but the field of precipitation particles it observes is highly intermittent. Large areas contain no precipitation at all, and strong gradients over distances comparable to a couple of grid points are common. This results in a highly non-Gaussian forecast error distribution, with long tails associated with displacement errors, where a position error of a few grid points produces an order one error in reflectivity. A consequence of this spatial intermittency of precipitation fields is a lack of spatial correlations, that would reduce the effective number of degrees of freedom. This can be contrasted with the situation on synoptic scales, where dynamical balance introduces correlations in space and between model variables. The problem is a version of the ‘curse of dimensionality’: the number of possible states of the system increases exponentially as the dimensionality of the system grows (Bellman, 1957, 1961).

A second issue is that the typical temporal resolution of radar observations of 5–15 min is coarse in comparison to the model time step, which is determined by the numerical requirement that the model state does not change too much over the interval, and is typically less than a minute. Furthermore, clouds do not appear on radar until large precipitation particles have had time to form, by which time the dynamical circulation of the cloud is well developed. The result is that there is significant error growth between observation times, and indeed well-developed cumulus clouds can appear from one observation time to the next. This lack of temporal correlation between observation times results in an essentially stochastic evolution of the field between observation times, as previously unobserved features suddenly appear as precipitating clouds.

Two potential strategies have been discussed to cope with the lack of spatial and temporal correlations. The first is localization, where the analysis at a given location is only influenced by observations that are close by in space and time (Ott *et al.*, 2004). Patil *et al.* (2001) showed that the atmosphere often has a local low dimensionality and therefore localization reduces a high-dimensional problem to a set of problems of lower dimension. The second strategy is observation averaging. By averaging the observations over a region in space to create a so-called super-observation (Alpert and Kumar, 2007; Zhang *et al.*, 2009), the intermittency is reduced. Upscaling the observations not only reduces the effective dimensionality of the system by introducing spatial correlations, it also produces more smoothly varying fields, leading to better (more Gaussian) error statistics. The cost of this improvement is that the observations lose detail and may no longer resolve individual convective cells, so that even an analysis that ‘perfectly’ matches the averaged observations is not a perfect analysis when considered at full resolution.

The purpose of this paper is to introduce a minimal model that represents these key features of spatial intermittency and stochastic time evolution. This will provide a simple test bed to examine the performance of various data assimilation methods. The model can be regarded as a minimal version of the stochastic cumulus parametrization scheme of Plant and Craig (2008), which is based on a statistical mechanics theory of convective fluctuations (Craig and Cohen, 2006). The convecting atmosphere is represented by a stochastic birth–death process in space, where cumulus clouds appear at random locations with a certain triggering frequency, and existing clouds disappear with a certain frequency. The result is a field of randomly located clouds (a spatial Poisson process) that have a random lifetime, but with the average density of clouds in space and the average cloud lifetime determined by the birth and death rates.

Two data assimilation methods will be applied to this simple model, in basic and localized forms, and with averaged observations. These are the Ensemble Transform Kalman Filter (ETKF) of Bishop and Toth (1999) and its local version as described by Hunt *et al.* (2007), and the Sequential Importance Resampling (SIR) particle filter (Van Leeuwen, 2009) and its local version. These two methods were chosen because they make very different approximations while targeting the same posterior distribution and are likely to show different behaviours. The behaviour of the SIR filter should be easy to anticipate since it is expected to respond directly to the effective dimensionality of the system, while the ETKF is being applied well outside of its regime of validity and may not work at all.

The aim of this analysis is not to determine what data assimilation method is best, but rather to shed some light on what new problems are likely to appear when these methods are applied on the convective scale. The simple birth–death process used in this study omits many processes that are important in nature and that may enable the data assimilation algorithms to function more effectively. There is no detailed description of the processes responsible for triggering convective clouds, nor any representation of the coupling of convection to the larger-scale flow. The model could be extended to treat such processes, and would have to be, before any conclusions could be drawn regarding about one method being better than another for real applications. Instead we hope to identify typical patterns of error, using a model that is simple enough that their origins can be traced, and strategies that might help correct the errors may be found.

The organization of the paper is as follows. First, the simple model is introduced, along with the implementation details of the two data assimilation algorithms. The ability of the basic schemes to converge to the correct state for stationary and time-varying cloud fields is then examined in detail for a representative ensemble size. The dependence on ensemble size is then considered, followed by the impact of localization and averaging.