## 1. Introduction

The problem of combining forecasts has received much attention lately because of the availability of numerous forecasts from several modelling institutions. Virtually every article that has been written on this topic agrees that combining multiple forecasts leads to increased accuracy and reliability. However, there is considerable disagreement on the optimal method for combining forecasts. A question that arises repeatedly is whether weighting forecasts unequally yields significantly better predictions than weighting forecasts equally. Numerous studies have shown that equal weighting, especially the arithmetic multi-model mean, is a competitive method for producing accurate and probabilistically reliable forecasts (Kharin and Zwiers, 2002; Peng *et al.*, 2002; Hagedorn *et al.*, 2005; DelSole, 2007; Weigel *et al.*, 2010). The consistently good performance of the multi-model mean also has been recognized in the economic literature, as documented routinely in the *International Journal of Forecasting*, *Operational Research Quarterly* and *Journal of Forecasting* (Clemen, 1989). Nevertheless, many of these results are based on cross-validation experiments that lack a probabilistic statement as to whether unequal weighting strategy yields *significantly* better cross-validated predictions than equal weighting. In addition, cross-validation requires the definition of a specific strategy for unequal weighting, such as ordinary least-squares or ridge regression, with which to compare equal weighting. It would be desirable to test whether unequal weighting is significantly better than equal weighting without having to specify the specific strategy for unequal weighting.

The purpose of this article is to propose a statistical test for whether a multi-model combination based on unequal weighting has significantly smaller errors than one based on equal weighting. The test can be derived from a standard framework for testing hypotheses in linear regression models. Accordingly, we first review the standard framework in section 2, and specialize it to our particular problem in section 3. To illustrate certain aspects of the test, the test is applied to an idealized example based on synthetic data in section 4. The test is then applied to a set of multi-model hindcasts of seasonal mean temperature and precipitation. The dataset is described in 5 and the results of the test are described in section 6. This article concludes with a summary and discussion of results.