This paper addresses the question of the selection of multivariate generalized autoregressive conditional heteroskedastic (GARCH) models in terms of variance matrix forecasting accuracy, with a particular focus on relatively large-scale problems. We consider 10 assets from the New York Stock Exchange and compare 125 models based 1-, 5- and 20-day-ahead conditional variance forecasts over a period of 10 years using the model confidence set (MCS) and the superior predictive ability (SPA) tests. Model performance is evaluated using four statistical loss functions which account for different types and degrees of asymmetry with respect to over-/under-predictions. When considering the full sample, MCS results are strongly driven by short periods of high market instability during which multivariate GARCH models appear to be inaccurate. Over relatively unstable periods, i.e. the dot-com bubble, the set of superior models is composed of sophisticated specifications such as orthogonal and dynamic conditional correlation (DCC), both with leverage effect in the conditional variances. However, unlike the DCC models, our results show that the orthogonal specifications tend to underestimate the conditional variance. Over calm periods, a simple assumption like constant conditional correlation and symmetry in the conditional variances cannot be rejected. Finally, during the 2007–2008 financial crisis, accounting for non-stationarity in the conditional variance process generates superior forecasts. The SPA test suggests that, independently from the period, the best models do not provide significantly better forecasts than the DCC model of Engle (2002, Journal of Business and Economic Statistics 20: 339–350) with leverage in the conditional variances of the returns. Copyright © 2011 John Wiley & Sons, Ltd.