## 1. Introduction

[2] Numerical weather prediction (NWP) is an established scientific discipline, the beginning of which can be traced even before the advent of computers [*Richardson*, 1922; *Yoden*, 2007]. While NWP models grew in number and complexity [*Charney et al.*, 1950; *Lynch*, 2008], model verification has become an increasingly important task. Objective evaluation of forecast quality is crucial for both scientific and operational purposes [*Brier and Allen*, 1951]. But a consensus about what constitutes a good quality forecast is difficult to achieve, even if attention is confined to just two aspects of quality: accuracy, defined as agreement between pairs of observation and forecast; and skill, measured with respect to a reference standard of performance [*Murphy*, 1993].

[3] A variety of verification procedures has been developed and a review of these can be found e.g. in *Wilks* [2006]. Here, we review the most commonly used elementary yardsticks for accuracy: the bias (a.k.a. unconditional bias), root-mean square error (RMSE) and Pearson's correlation coefficient. Each measure has its own strengths and shortcomings, where the latter are not necessarily addressed by other diagnostics:

[4] 1. The bias indicates the overall systematic difference between forecast and reality so that useful guiding notions like “the model is too wet/dry or too warm/cold” can be derived, but what constitutes a large or small bias is hard to say from the value of the bias itself without a context.

[5] 2. The RMSE gives a good estimate of the overall error between the model and the observations, but it tends to vary directly with the standard deviation of the observed quantities [*Koh and Ng*, 2009]. This means the size of RMSE is not solely due to the model's performance per se, e.g. small errors for temperature and humidity in the tropics and large errors for wind in the upper troposphere are somewhat expected from the corresponding small or large variabilities in physical quantities themselves.

[6] 3. The correlation coefficient is useful to detect errors arising from phase lead or lag between forecast and observation but is independent of the difference in the variance of forecast and observation. So having a correlation of one is of dubious significance if forecast variance is much smaller than observed variance and is left uncorrected.

[7] Since the error information given by any one error metric is always either incomplete or not detailed, there is a need for a suite of suitably chosen error metrics. One example is the decomposition of the mean square error (MSE) into correlation, conditional bias, unconditional bias, and possibly other contributions [cf. *Murphy*, 1988, equation (12)]. However, the trade-off of recognizing the nature of the error and decomposing a single metric into many components is that we simply have too many metrics to look at.

[8] The situation was improved when *Taylor* [2001] recognized that a simple geometrical relation exists between the centered RMSE and the standard deviations of forecast and observation and proposed a compact diagram to visualize these metrics. The Taylor diagram has since become generally accepted and is useful for comparing RMSE or other skill scores between different models.

[9] Another issue that is often overlooked in verification studies is the rigorous mathematical generalization of diagnostics from scalar to vector variables. Common methods for analyzing the error of vectors, such as wind, invariably break them up into Cartesian or polar components, and each component is treated separately as a scalar [*Anthes et al.*, 1989; *Qian et al.*, 2003; *Hanna and Yang*, 2001; *Hogrefe et al.*, 2001]. The result is that the information associated with the covariance between vector components is missed [*Koh and Ng*, 2009]. From a mathematical point of view, components of vectors are not scalars and do not respect the same invariance principles under transformations of the reference frame. So the danger is that the diagnostics of separate vector components even when taken together do not completely describe or might even mis-represent physical reality.

[10] In the first part of this paper, we aim to develop a systematic suite of elementary diagnostics — some are new while others have been published — which can (1) shed light on different aspects of model *accuracy*; (2) be neatly summarized in a few diagrams that relate the diagnostics geometrically; (3) be easily generalized from scalar to vector variables. The total error is neatly resolved into bias and pattern error and the latter is further decomposed into errors arising from the mismatch in the phase or amplitude of variations. For a scalar, the error metrics can be succinctly summarized in two diagrams, whereas one more diagram is needed for two-dimensional vectors to investigate the anisotropy of vector error distribution.

[11] The suite of diagnostics is put to the test in the second part of this paper by assessing the performance of Coupled Ocean/Atmosphere Mesoscale Prediction System (COAMPS^{®}). More than merely demonstrating the utility of the error metrics, the objective here is to contrast the model performance in the tropics and extratropics, so as to evaluate the current-day*skill*of tropical NWP. COAMPS is a limited-area mesoscale model originally developed for NWP in the USA and has been demonstrated to be efficacious in various parts of the world [e.g.,*Kong*, 2002; *Liu et al.*, 2007]. But the model is largely unverified in tropical regions such as Southeast Asia. The earlier work *Koh and Ng* [2009] verified the COAMPS model against two months of intensive radiosonde observations from South China Sea Monsoon Experiment (SCSMEX) [*Ding et al.*, 2004] using a more restricted set of diagnostics. With the suite of error metrics advanced here, we extend the effort by verifying the COAMPS model against one year of radiosonde data for Southeast Asia and compare the results with those for southeastern USA.

[12] The paper is organized as follows: Section 2 contains a brief review of the current diagnostic framework. In section 3, we advance a diagnostic framework for model accuracy and summarize these diagnostics into two (for scalar variable) or three (for vector variable) diagrams. Section 4 describes the COAMPS model and the observation data set used for the verification study. Section 5 highlights the main advantages of the proposed diagnostic tools and section 6 evaluates COAMPS model performance in the tropics and extratropics. The main conclusions are summarized and discussed in section 7. Table 1 provides a list of acronyms and symbols used in this paper for convenient reference.

Acronym/Symbol | Meaning | Defining Equations |
---|---|---|

MSE | mean square error | - |

RMSE | root-mean square error | (3) |

NRMSE | normalized root-mean square error | (26) |

NPE | normalized pattern error | (27) |

NBias | normalized bias | (28) |

O | observed variable | - |

F | forecast/modeled variable | - |

D | discrepancy of forecast/model from observation | (1) |

σ_{O} | standard deviation of observation | - |

σ_{F} | standard deviation of forecast/model | - |

σ_{D} | standard deviation of discrepancy, or centered RMSE | - |

_{F} | forecast variability normalized by observed variability | |

_{D} | centered RMSE normalized by observed variability | |

*_{F} | anti-symmetric measure of variance similarity, used inYu et al. [2006] | (11) |

*_{D} | anti-symmetric measure of normalized error variance | (12) |

ν | fractional difference of forecast from observation | |

ρ | correlation (for scalars) | (4) |

correlation (for vectors) | (18) | |

ψ | angle on the Taylor diagram | (9) |

η | variance similarity | (19) |

η* | modified variance similarity | (33) |

ϕ | angle on the correlation-similarity diagram | (20) |

α | normalized error variance | (17) |

α* | one example of skill score | (22) |

δ | normalized root-mean square error, NRMSE | (26) |

σ | normalized pattern error, NPE | (27) |

μ | normalized bias, NBias | (28) |

γ | angle on the error decomposition diagram | (29) |

θ | preferred direction of vector pattern error | - |

ε_{s} | symmetrized definition of eccentricity, used to measure vector error anisotropy | (32) |

ε | conventional definition of eccentricity | (B3) |

β | alternative symmetrized definition of eccentricity, used in Koh and Ng [2009] | |

a | larger eigenvalue of var(D) | - |

b | smaller eigenvalue of var(D) | - |