#### 3.1. Initial Model Order and Dwell Time Parameterization

[13] Much of the fitting procedure used was as detailed by *Sansom* [1999], but a short summary is given here and some differences are noted. The fitting procedure is a maximum likelihood procedure, which is iterative, so initial values are required. Following *Rabiner* [1989], uniform probabilities are sufficient to initialize the transition probability matrix, and, from *Sansom* [1999], if the dwell time probabilities are nonparametric (as is assumed initially) it is also sufficient to initialize these as uniform. However, according to *Rabiner* [1989], the parameters of the state distributions need to be initialized with values near to those that give the global maximum in the likelihood otherwise convergence may only reach a local maximum. To minimise this risk, initial values were randomly selected many times and the fitting procedure followed to convergence. Then the fit with the greatest likelihood was taken as the global maximum. Each random selection was restricted to feasible values with all the location parameters confined to the range of the data and the scale/correlation parameters kept to the same order as those of the data. While this method did not guarantee a global maximum, it was a practical strategy, which, since many such fits were made, provided some degree of confidence that the global maximum had been reached.

[14] In this way, *Sansom* [1999] fitted to the same data set a series of models each with a different number of states. The states were labeled as R1, R2, S1, and S2, where R refers to rain and S refers to showers and the following integers refer to subdivisions: I, interevent dry, and E, error (so named since *Sansom and Thomson* [1992] found a component that could be attributed to imperfections in the manual digitizing process). In general, the greater the number of states the smaller the BIC and hence the better the statistical fit but, the physical cause for each state was also assessed. This sometimes suggested that the same label was suitable for two states because they were collocated with similar links to other states. In such a case, even if the model was statistically justifiable according to the BIC criterion, it was considered that the decomposition had progressed too far and it was unlikely that two physically distinct states had been found. It might have been that seasonality or some other low-frequency variation in the data was being seen. A nine state model with four drys (R, S1, S2, and I) and five wets (R1, R2, S1, S2, and E) was adopted by *Sansom* [1999] since all the more complicated models had at least two states with the same physical interpretation. *Sansom* [1999] showed that that model was suitable throughout New Zealand, and it was chosen as the starting point for the analysis of the data from the 20 stations considered in this paper.

[15] Data for the 5 year period from January 1988 to December 1992 were used for most of the 20 stations but a few had slightly earlier or later 5 year periods. The data from Wellington (41°17′S, 174°46′E) are shown in Figure 3 by a histogram of the drys at the top and a scatterplot of the wets below. Wellington can be used as an effective example for all the 20 stations since, to the first order, the data from all stations are similar, and so the same model could be fitted at all stations. If this were not the case, then the spatial modeling presented in this paper would not be possible. In Figure 3 all rates and durations have been logarithmically transformed in an effort to make the data more Gaussian and in the light of suggestions by many authors [e.g., *Biondini*, 1976; *Kedem et al.*, 1994] that many types of rainfall measurement are lognormally distributed. It can be seen that the transformation did not completely normalise the data; the drys have a heavy upper tail and the contours in the wet scatterplot are distorted from being elliptical. It was assumed that the logarithms of the data could be well represented by a mixture of normals or, equivalently, that the original data could be well represented as a mixture of lognormals with each component in the mixture corresponding to a state in the HSMM.

[16] In the nine state model of *Sansom* [1999] state E was attributed to poor digitizing and involved data over a wide range of rates but all with short durations, i.e., those breakpoints which would be digitized with the most difficulty. However, the HSMM could now be fitted with the shortest durations censored, so that only knowledge of their occurrence was used in the fitting procedure but their values ignored. Thus, as given by *Sansom and Thomson* [1998], the wet data to the left of the solid line in Figure 3 were censored, and with the number of wet states reduced to four, an eight state model fitted. The BIC for the nine state model at Wellington was 69888.6, and that of the eight state model of the censored data was 69095.4, which is much smaller, indicating a significantly better model.

[17] One of the reasons for fitting an HSMM to the data, rather than an HMM, is that the dwell time distributions are implicitly geometric for the HMM, and it is far from certain that rainfall behaves that way. Indeed, because it is uncertain how the dwell times are distributed, a nonparametric dwell time distribution over durations of 1, 2,…, D was initially fitted, where D needed to be long enough to capture the longest likely dwell periods but short compared to the amount of data. For all the data sets concerned, dwell periods of at most 30 breakpoints were expected and about 13,000 data were available, and so a D of 50 was chosen. The nonparametric dwell time distribution for Wellington is shown in Figure 4 with a geometric, fitted to the nonparametric probabilities, superimposed. Clearly, the geometric is a poor fit for three of the four states, and the necessity for an HSMM rather than an HMM is established. Also shown is the fit of a more parsimonious parametric model, fitted using the methods of *Sansom and Thomson* [2000, 2001]. This new parametric form, which is a modified geometric with a free parameter for the probability of a dwell of 1 and a geometric tail for dwells greater than 1, fits the nonparametric probabilities adequately and lowered the BIC to 67402.9 which, again, indicates a significantly better fit.

#### 3.2. Alignment of the States Between the Stations and Final Selection of Model Order

[18] The data from all other stations were also fitted to HSMMs with four dry states and four wet states using both nonparametric dwell times and ones distributed as the modified geometric specified above. The latter gave similar improvements in the BIC for the other 19 stations to that found for Wellington. Figure 5 shows the model structure in its final form using Wellington again as an example, but the structure for the fit at this stage differed from the final one in only a few details. Furthermore, much similarity was found in the structures of all 20 fits with the relative locations and scales of all states and the interconnections between the states being qualitatively similar for the fits from station to station. The main differences from the structure of Figure 5 were in the dry states where more separation occurred between the two with the shortest durations, while one of the others often tended to stretch rather unrealistically from the shortest durations to the longest. However, the fitting procedure does not label the states except as state 1, state 2, etc., and, in general, it was not the case that state N at a particular station was equivalent to state N at any other station.

[19] Using Wellington as a convenient standard, for each other station, in turn, all combinations of a subset of its model parameters were compared to the same subset from Wellington using *k* mean clustering [*Hartigan and Wong*, 1979]. Thus, for example, if only three states were concerned and S_{1}, S_{2}, and S_{3} represented the parameters of the states at a station and W_{1}, W_{2}, and W_{3}, those at Wellington then clustering would be sought in the following seven-observation multivariate data set i.e., {W_{1},W_{2},W_{3}; S_{1},S_{2},S_{3}; S_{1},S_{3},S_{2}; S_{2},S_{1},S_{3}; S_{2},S_{3},S_{1}; S_{3},S_{1},S_{2}; S_{3},S_{2},S_{3}}. The clustering was performed in stages with two clusters being found at each stage and the combinations in the cluster not containing the Wellington values being discarded. This was repeated until only one combination clustered with the Wellington values, and it was taken to have its states aligned with Wellington's. Occasionally, more than one combination was in the final cluster, and a selection was made manually. A final visual check was made using plots similar to Figure 5, and some further changes were made. The alignment of the states was not strictly necessary for the next stage of the fitting, but it did facilitate the identification of poorly fitted states, which was made through the estimation errors of the model parameters.

[20] To assess estimation errors, fits were made to simulated data based on the parameters estimated by the fit to the actual data. At each station, 50 simulated data sets and 50 sets of parameter estimates were created in this way using much the same number of breakpoints as that in the original data set for the station concerned. (Technically, the number kept the same was that of the state visits, which was estimated from the original data using equation (6) from *Sansom and Thomson* [2001]). The standard error of each parameter estimate from the actual data was then estimated by its standard deviation over the 50 simulations with any bias in the estimation procedure being checked by comparing the parameter estimates from the actual data set to the mean of that parameter's estimates from the 50 simulations. The statistic of interest (*Z*) is the ratio of the difference between the mean, over the 50 simulations, of the parameter (*p*_{m}) and the estimate of the parameter (*p*_{e}) to the standard deviation, over the 50 simulations, of the parameter (*s*_{m}) divided by the square root of 50:

[21] Rather than computing such statistics for all 72 parameters of the model, only those for the 12 location parameters (i.e., the four mean rates and four mean durations for the wets and the four mean durations for the drys) were found. For unbiased estimates this statistic should be normally distributed; however, 38% of them had values less than −3, and 8% had values greater than +3, leaving only 54% between −3 and +3 rather than the 99.7% that might be expected for normal variates. The parameters most affected were those for the wet states that had been most heavily censored, by up to 10% each, although at most the overall censoring was 2% of the data.

[22] The initial censoring line had been based on that successfully used by *Sansom and Thomson* [1998]; now simulations were used to assess the optimum positioning of a censoring line. From the parameters fitted to the actual data sets at each of the 20 stations, data sets were simulated which were then censored to varying degrees, and the model was refitted. Both the slope and the position of the censoring line had to be changed from those initially chosen, with the solid line in Figure 3 being a suitable compromise between removing enough poor data to prevent it adversely affecting the fit and not so much that the biases found above reoccurred. The overall censoring was reduced to 0.5% or less. The BICs increased (up to 67,566.3 for Wellington), but, on repeating simulations to assess the estimation errors of the fitted model parameters, the statistic of equation (1) was more normally distributed with 88% between −3 and +3, 2% greater than +3, and 10% less than −3. However, half of those below −3 were for the mean duration of the longest dry periods. This general bias to a shorter mean duration for state I could result from data defects and/or by its representation as a lognormal component being an unsuitable choice.

[23] Defects in the dry data, which had not been censored, were unlikely to occur at the high end where the length of the durations was much greater than any error that manual digitizing could introduce, but the shortest periods were prone to poor digitizing with a high probability of unreasonably short periods being spuriously introduced. Such an excess of short dry periods might give rise to the tendency, as noted earlier, for state I to stretch from the shortest durations to the longest. Thus the censoring of short dry durations was introduced and optimized through simulations and refitting after various censoring thresholds had been applied. In general, the removal of the few extremely short durations, less than 0.1% of the whole data set, resulted in significant differences in the parameter estimates, but any further censoring rapidly led to much larger differences; the minimal censoring was introduced.

[24] With regard to the suitability of state I being represented by a lognormal component, *Sansom* [1995] and *Sansom and Thomson* [1998] refer to a companion to state I which might be termed state M, where M refers to multiple since its originates from the concatenation of two or more other dry states. Primarily, it would be expected to arise if between two occurrences of state I a weak event took place, which did not give any precipitation at the observation site. It is known that such events do occur from synoptic weather observations that commonly report precipitation near to but not at the point of observation. Such cases cannot be recognized from just the data, and, even if they were, those dry durations could not be separated into their parts, but the existence and handling of such behavior should be incorporated within the model.

[25] Three ways were tried. A fifth dry state was introduced, but this state M produced as many biased estimates as state I had previously; also the BICs increased markedly which was mainly due to the 10 extra parameters in the model. Second, rather than introduce an extra state, state I was modeled as a mixture of two lognormals, which ensured that the two components of state I would behave in the same way regarding transitions to other states, and this only increased the number of model parameters by three. This model structure still produced biases but only in 9% of the parameters, which were not all associated with state I, and the BICs all decreased; however, for all stations one of the other dry states had significant presence across the whole range of dry durations. The third method was to retain state I as a mixture but reduce the total number of dry states to three, since in the previous scheme the state stretching across all durations clearly had no particular function and might well be superfluous. As with the previous scheme, 9% of the parameter estimates had some bias, again not all associated with state I, but most the BICs improved even further, and none of the dry states stretched across the whole duration range.

[26] Thus a model with three drys states with the state I requiring a two-component mixture was best, but having reduced the number of states, it seemed prudent to check that the second component of state I was strictly necessary. When it was dropped, still 6% of the parameters had some bias, all of the BICs increased, and once again state I stretched across the whole duration range; the second component of state I was needed. The components of this HSMM fit to the Wellington data are shown in Figure 5, where the univariate and bivariate normal component distributions are shown scaled to a size appropriate to the relative frequencies of the states. The scaling factors for the wet states are determined from the estimated transition probability matrix and the dwell time distributions. A comparison between the fit and the data is given in Figure 3. For the drys the overall fitted distribution is superimposed on the histogram and shows a good fit between the data and the model. For the wets the percentage difference between the empirical distribution of the wets and the overall fitted distribution is shown by dashed lines. It can be seen in Figure 3 that the bulk of the wets is modeled well, and even around the extremes the absolute difference is not often more than 5%.

[27] The bottom panel of Figure 5 shows those transitions, which collectively account for 94% of all the transitions that occurred in the data. The states are placed in the same relative positions as in the top panels except that the I and M substates are shown together (i.e., the solid dot on the right) at a position that is the mean of their combined distribution. It can be seen that eight transitions, in four pairs, are indicated as each contributing over 4% to the total number, and they collectively account for 70% of the total number. These four connections partitioned six of the seven states into two groups. The first group linked S1 with R and I/M (to which R1 and R2 are also weakly linked), and the second group joined S2 with S and I/M. The first group appears to represent the sequence of showers and rain associated with a frontal system; the second group represents periods of convective activity. The two groups are weakly linked from S2 to S1 but are largely independent since otherwise they are separated by interevent drys. It is also interesting to note that heavy and persistent rain (R2) always occurs after an interevent dry and decays to lighter and less persistent rain (R1).

#### 3.3. Restricting Transitions and the Final Model

[28] The necessary conditions for the proposed spatial variability scheme to be feasible and effective are the following: The same model structure applies over the whole area where the spatial variability is being found, the model's parameters vary slowly and smoothly with dependence only on position and orography, and the model is physically meaningful and capable of similar interpretation at all stations. The first condition has been imposed, but the others needed to be verified. It has been noted that all the data sets were similar to Figure 3 and that, at least superficially, all the model fits were similar to Figure 5; thus the second condition could be assumed.

[29] To further ensure that fits from station to station were as compatible as possible, a refitting procedure was followed in which each station was refitted using the fits from itself and all the other stations as initializations and then taking the fit with the highest likelihood as the best fit. In general, it was not the case that the best fit at a station was that initialized by its current fit, and refitting was repeated until most stations were best fitted from themselves. It is this fit for Wellington that is shown in Figure 5. Fits to the other stations were broadly similar, with the main differences being in the position of state I and in the less important connections between states (i.e., the minor structure depicted within plots of the kind shown in the bottom panel of Figure 5).

[30] Twelve of the stations had the I and M states similar to Wellington, where their relative locations and scales can be justified as follows: State I represents the lognormal distribution of interevent durations (variate X), whose mean and variance relate to that of the normal distribution, N(μ,σ^{2}), of the logarithmically transformed data, i.e.,

If two interevent durations became concatenated because the event between resulted in no precipitation at the observation site, then the total duration would be an observation from the distribution of 2X or when logarithmically transformed from the normal distribution N(μ′, σ′^{2}). It can be shown that

Since the durations spanning a missed event (or events) might also include one or more S or R type drys, an exact fit to this argument was not found. However, for the values encountered, the relations imply that σ′^{2} is somewhat less than σ^{2} and that μ′ exceeds μ by ∼1 or 2; these features can be seen in the Wellington fit and the 11 similar to it. By refitting using initializations from these 12 stations, similar features were imposed at the other eight stations, and some progress was made toward ensuring that the third condition was satisfied; that is, the model has the same physical interpretation at all stations.

[31] Apart from forbidding self-transitions, no restrictions had been imposed on which transitions could occur. Within all the data sets a dry period never follows a dry period; thus all the probabilities for transitions between the dry states were zero. In addition, some of the remaining possible transitions had probabilities which were close to zero, and by setting them to zero in all initializations it was possible to impose further common structure onto the fits to all stations. As noted above, variation between stations fits existed in the less important connections between states, and if the mean across the stations for a particular transition probability was under 0.05, then it was set to zero for all stations. This reduced the number of parameters associated with the transition matrix from 35 to 17, and the BICs for all but one of the stations decreased. Also, it became clear that state R2 could have two functions: Either after a mean persistence of six breakpoints before a transition it connected to state R1 and occasionally state I and so was associated with frontal activity, or after a mean persistence of three breakpoints it connected to state S2 and state S and so was associated with convection. The relatively high persistence, which would not generally be associated with convection, supports frontal activity as the more likely function for state R2. However, all stations can have state R2 forced into taking on one function or the other by setting the appropriate transition probabilities to zero. In each case the number of parameters was reduced from 17 to 12, and the BICs for 16 of the 20 stations indicated that the model having state R2 involved in frontal activity was better.

[32] To again ensure that fits from station to station were as compatible as possible, the refitting procedure was followed. For state I/M, 10 stations retained the physically acceptable pattern described above, and, as before, by refitting using initializations from only those 10 stations, similar features were imposed at the others. The BICs for those stations increased, but all BICs were still smaller, except at two stations, than those for the unrestricted transitions model. Also, refitting to data sets simulated using these fits resulted in nearly 99% of values of the statistic of equation (1) being between −3 and +3 with a worse case of −5.9 for a parameter from one of the stations whose BIC was larger than its value for the unrestricted transitions model. For this station, state I/M was located at much lower values than for all other stations, so it was not used in the determination of the spatial variation, especially as it was located close to another station; it was one of the overlapping pair just south of 41^{o}S and east of 175°E in Figure 1.