By continuing to browse this site you agree to us using cookies as described in About Cookies
Notice: Due to essential maintenance the subscribe/renew pages will be unavailable on Wednesday 26 October between 02:00- 08:00 BST/ 09:00 – 15:00 SGT/ 21:00- 03:00 EDT. Apologies for the inconvenience.
Corresponding author: I. Honkonen, Earth Observation, Finnish Meteorological Institute, PL 503, 00101, Helsinki, Finland. (email@example.com)
 We study the performance of four magnetohydrodynamic models (BATS-R-US, GUMICS, LFM, OpenGGCM) in the Earth's magnetosphere. Using the Community Coordinated Modeling Center's Run-on-Request system, we compare model predictions with magnetic field measurements of the Cluster, Geotail and Wind spacecraft during a multiple substorm event. We also compare model cross polar cap potential results to those obtained from the Super Dual Auroral Radar Network (SuperDARN) and the model magnetopause standoff distances to an empirical magnetopause model. The correlation coefficient (CC) and prediction efficiency (PE) metrics are used to objectively evaluate model performance quantitatively. For all four models, the best performance outside geosynchronous orbit is found on the dayside. Generally, the performance of models decreases steadily downstream from the Earth. On the dayside most CCs are above 0.5 with CCs for Bx and Bz close to 0.9 for three out of four models. In the magnetotail at a distance of about −130 Earth radii from Earth, the prediction efficiency of all models is below that of using an average value for the prediction with the exception of Bz. Bx is most often best predicted and correlated both on the dayside and the nightside close to the Earth whereas in the far tail the CC and PE for Bz are substantially higher than other components in all models. We also find that increasing the resolution or coupling an additional physics module does not automatically increase the model performance in the magnetosphere.
 The growing interest in space weather forecasting from both government and industry is also increasing the need for space weather model development. The increasing amount of infrastructure and people that can be affected by severe space weather events demand reliable forecasting of those events and their effects both in space and on the ground. This in turn requires systematic testing, verification and validation of space weather models and the objective evaluation of their suitability for a particular purpose. The international need for model improvement and validation was highlighted, for example, in the recent European Commission's Space Weather Awareness Dialogue [Krausmann and Bothmer, 2012].
 A good overview of the process of verifying and validating a space plasma model is given by Ledvina et al. . The verification of a code consists of, for example, comparing the results to an analytic solution (e.g., using the method of manufactured solutions employed recently by Welling et al.  for verifying SpacePy); monitoring conserved quantities, symmetries, and other predictable outcomes; or comparing results to those from other codes. Verification must happen before validation in order to make sure that the equations chosen for modeling the system are being solved correctly, and the results should be published especially in the case of a new model or a scheme. In this regard global magnetohydrodynamic (MHD) models leave something to be desired as only the BATS-R-US (Block Adaptive Tree Solar-wind Roe Upwind Scheme) code has been verified against analytic or semi-analytic results in a publication [Powell et al., 1999]. Results for the ubiquitous shock tube test (see Ryu and Jones  for a plethora of examples) have not been presented for the MHD solver(s) of any global MHD model even though such tests have been conducted for all models used here. For example, the MHD solvers of GUMICS were used in several one, two, and three-dimensional tests by Honkonen et al. , but in addition to the parallel scalability results of all tests, only the result of a three-dimensional blast wave test with adaptive mesh refinement (AMR) was shown. The validation of a code can consist of controlled experiments designed to investigate a physical process, experiments specifically designed to validate codes or passive observations of physical events. Global MHD models have mostly been validated with many qualitative comparisons against observations by various spacecraft and, to our knowledge, no systematic quantitative comparisons have been conducted until recently.
 The Geospace Environment Modeling (GEM) 2008–2009 challenge represents the largest effort to date to validate global MHD models against observations using various objective metrics. Pulkkinen et al. quantified the performance of three global MHD models and two empirical models in reproducing observations of ground magnetic field during four geospace storm events. The models were compared against observations from 12 observatories located between 43.5° and 74° of geomagnetic latitude. Five different metrics, each applicable to different situations, were used to evaluate the model performance: root-mean-square difference, prediction efficiency, log-spectral distance, utility, and ratio of maximum amplitudes. Rastätter et al. used the same events and global MHD models as Pulkkinen et al.  along with two empirical magnetospheric models to quantify model performance at the geosynchronous orbit with respect to observations of magnetic field strength and elevation. The models were compared to observations from two NOAA Geosynchronous Operational Environmental Satellites (GOES) in each event using the prediction efficiency and log-spectral distance metrics. The most recent GEM 2008–2009 challenge comparison [Rastätter et al., 2013] quantified the ability of different physics-based and statistical models to reproduce the Dst geomagnetic activity index (http://wdc.kugi.kyoto-u.ac.jp/dstdir).
 Magnetospheric substorms are an essential part of the dynamics of near-Earth space [McPherron, 1991]: During the substorm growth phase, magnetic flux accumulates in the tail lobes due to dayside reconnection and is eventually released by rapid reconnection in the near-Earth magnetotail, which also starts the substorm expansion phase. The cross-tail current is diverted from the region of reconnection and flows along magnetic field lines to the midnight ionosphere where it flows westward for a (relatively) short distance enhancing locally the westward electroject before returning to the tail. Magnetospheric substorms and hence the dynamics of the near and far tail magnetosphere are also important from the point of view of space weather and can have a significant effect on technological systems even at ground level. For example, the largest geomagnetically induced currents (GIC) occur with highest probability during the substorm expansion phase about 5 min after the expansion onset below the corrected geomagnetic (CGM) latitude of 72° [Viljanen et al., 2006].
 In this work, the performance of four global MHD models is systematically evaluated in the Earth's magnetosphere and in the ionosphere during an event with multiple substorms on 18 Feb 2004. The global MHD models BATS-R-US (Block Adaptive Tree Solar-wind Roe Upwind Scheme), GUMICS-4 (Grand Unified Magnetosphere-Ionosphere Coupling Simulation), LFM (Lyon-Fedder-Mobarry), and OpenGGCM (Open General Geospace Circulation Model) are given identical solar wind input, and the results are compared to the magnetic field measurements of Cluster 1 [Balogh et al., 1997] within the magnetosheath, Geotail [Kokubun et al., 1994] in the near tail, Wind [Lepping et al., 1995] in the far tail, and the cross polar cap potential (CPCP) obtained from SuperDARN [Chisham et al., 2007]. The models' magnetopause standoff distances are also compared to the empirical magnetopause model of Lin et al. . All simulations are carried out through NASA's Community Coordinated Modeling Center (CCMC) Run-on-Request system (http://ccmc.gsfc.nasa.gov), and the settings used for the models are as close to each other as reasonably possible. The results presented here are also available through CCMC. Similarly to Pulkkinen et al.  and Rastätter et al. , the detailed scientific analysis of the effect of various model parameters on the quality of model results in the magnetosphere is left for future work. As Pulkkinen et al.  and Rastätter et al.  studied the model performance at ground level and at geosynchronous orbit, this study is a natural next step to these investigations by validating the code performance in the near and far tail during dynamical events. In section 2, we describe the models, the features, and parameters which were used in this work and the event that was simulated. In section 4, we present the model results with the corresponding measurements, and in section 5, we compare them using the correlation coefficient (CC) and prediction efficiency (PE) metrics. We discuss the results and analysis in section 6 and draw our conclusions in section 7.
2 Global MHD Model Features and Settings
 The features and settings of global MHD models used in this study are presented in Table 1. We emphasize that some of the models support a wide range of features and different settings, but we have listed only the ones used in this study. All models are executed through the CCMC Run-on-Request system and receive as input the solar wind data measured by the Advanced Composition Explorer (ACE) satellite [Stone et al., 1998] located at GSE (221, −22, 9) RE (Earth radii) during the simulated event provided by CCMC. The options and features used in all models are as close to each other as reasonably possible.
Table 1. Summary of Features and Settings of Global MHD Models Used in This Study, See the Text for Details
aThe dipole orientation is fixed in SM coordinates, but solar wind and solar EUV conditions are adjusted with time.
 While all the models solve the MHD equations in the magnetosphere and the same electrostatic potential equation in the ionosphere, there are some differences. Both BATS-R-US and GUMICS solve the ideal (i.e., inviscid and perfectly conducting), conservative, non-relativistic MHD equations [Powell et al., 1999; Janhunen et al., 2012], while LFM and OpenGGCM solve the MHD equations in a semi-conservative form where the total energy is replaced with the fluid energy [Lyon et al., 2004; Raeder et al., 2008]. In OpenGGCM, a resistive term is also included in the equation for the electric field [Raeder et al., 2008]. BATS-R-US solves the MHD equations using a second-order eight-wave approximate Riemann solver that maintains zero divergence of the magnetic field to truncation error. GUMICS primarily uses a first-order seven-wave approximate Riemann solver and is the only model to periodically remove the divergence of magnetic field with the projection method of Brackbill and Barnes . Both LFM and OpenGGCM use a constrained transport method for advecting the magnetic field which preserves the divergence of magnetic field to roundoff error. BATS-R-US and GUMICS use an adapted Cartesian mesh for the magnetosphere. In BATS-R-US, the grid is adapted at the start of the simulation in blocks of 63cells and is static afterwards, while in GUMICS, the grid is adapted during the simulation on a cell-by-cell basis based on local gradients of several plasma quantities and geometric considerations. LFM uses a distorted spherical grid and OpenGGCM uses a stretched Cartesian grid and neither uses AMR. BATS-R-US, GUMICS and LFM separate the magnetic field into perturbed and static background components [see Tanaka1994]. The number of cells in the magnetospheric grid are about 800 k in BATS-R-US, 400 k in GUMICS, 330 k in LFM, and 3.6 M in OpenGGCM.
 The inner boundary of the magnetosphere in the models is between 2 and 4 RE. The ionosphere and magnetosphere are coupled through field aligned currents (FAC) and electric potentials mapped between the ionosphere and the inner boundary of the magnetosphere along the Earth's dipole magnetic field. Field aligned currents are obtained from currents computed from the magnetic field in the inner magnetosphere. A two-dimensional electrostatic solver is used in the ionosphere to solve the ionospheric potential from FACs using the current continuity equation. The solved electric potential is used to set plasma flow in the inner magnetosphere. Merkin and Lyon  provides more details on the LFM potential solver. OpenGGCM also includes a three-dimensional dynamical model of the thermosphere which adds, for example, the effect of the neutral wind dynamo to the ionospheric electric potential solution.
3 Event Description and Data
 Magnetospheric substorms are an essential part of the dynamics of near-Earth space [McPherron, 1991], and hence, an event with multiple substorms was selected to assess the performance of global models in the Earth's magnetosphere. The solar wind at the ACE satellite during the event of 18 Feb 2004 along with AE Dst indices is shown in Figure 1. The delay from ACE to the magnetopause for the event was calculated by Honkonen et al.  to be 46 min. During the event, the solar wind density fluctuated between 1 and 3 cm−3with large jumps recorded at 16:40 and 19:45 UT. The interplanetary magnetic field (IMF) z component changes sign more than a dozen times, varying between −8 and 8 nT. This leads to modest driving of the magnetosphere-ionosphere system as indicated by the AE index being over 600 for several hours during the event. Solar wind velocity is slightly above the average staying between 430 and 490 km/s.
 Undelayed solar wind is used as input for all models and, consequently, the delay from ACE to the solar wind boundary of the models must be taken into account separately. We assume a constant solar wind speed when calculating the delay for each model. In BATS-R-US the solar wind boundary is located at 33 RE which translates to a delay of 2505 s. In GUMICS, LFM and OpenGGCM the solar wind boundaries are at 32, 30 and 60 RE, which translate to delays of 2518, 2545 and 2145 s, respectively. These delays were used for all model results when comparing against observations.
 The simulation results reported here are available through the CCMC Web site as run numbers 102709_1, 103009_1, 011110_1, and 020410_1 prefixed with “Ilja_Honkonen_.” The data from all simulations used in this study is saved to disk every 5 min, hence, all the satellite data is averaged using a 5 min sliding window. Figure 2 shows the trajectories of Cluster 1, Geotail and Wind during the simulated event with rectangles marking their location at the start of the event (15:00 UT). The distances of Cluster, Geotail, and Wind are about 9, 26, and 133 RE from Earth, respectively. Cluster is flying towards dusk side ecliptic plane, Geotail is advancing outward from the dawn side flank, and Wind is almost stationary in the far tail.
 Figure 3 shows the magnetic field components from simulations and Cluster 1 as a function of time, CC, and PE scores calculated from that data in section 5, scatter plots of each model B component versus the observation at the same instant of time, the coordinate of Cluster 1, and the region in which it is located. Based on Cluster 1 ion energy spectrogram data (not shown), Cluster was in the magnetosheath (labeled msheath in the Figure) with short excursions into the magnetosphere (labeled ms) until about 23:00 UT after which Cluster moved into the solar wind (labeled sw). Before 19:00 UT, there is a large difference between the models' Bx prediction at Cluster 1 but afterwards models and Cluster show a fairly constant Bx. Bx has two noticeable depressions at 20:30 and 22:00 UT of which the first is captured by BATS-R-US, GUMICS, and LFM and the second by OpenGGCM and perhaps by BATS-R-US. By shows several large changes between 17:00 and 23:00 UT. The largest change in By around 17:00 UT is captured by BATS-R-US, GUMICS, and LFM. The increase at 20:15 UT is reproduced best by BATS-R-US and GUMICS while in OpenGGCM and LFM, the increase starts some 20 min earlier. Bz shows three large increases and subsequent decreases starting around 16:30, 20:15, and 21:30 UT. The major features of Bz are captured by all models. After 17:00 UT, BATS-R-US and GUMICS are in good agreement with each other and along with LFM are close to Cluster 1 observations. OpenGGCM reproduces Cluster 1 observations reasonably well but with an offset of about −10 nT and a 1 to 2 h shorter first enhancement in Bz at 17:00 UT. At 20:30 UT, all models are in good agreement with Cluster 1. The temporary spread in modeled results at 18:30 UT stands out. For all components of B, the largest differences between models occur at the beginning of the event before about 17:00 UT.
 Figure 4 shows the magnetic field data from simulations and Geotail as a function of time. Based on density and ion temperature data, Geotail is mostly in the plasma sheet (labeled ps in the Figure) or plasma sheet boundary layer (labeled psbl) before 21:00 UT and in the lobe afterwards [Aikio et al., 2013]. Bx has large changes around 16:00 and 20:00 UT of which only the changes before 17:30 UT seem to be reproduced by the models. All models predict the sharp decrease in Bx at Geotail around 17:00 UT. Geotail shows two depressions in Bx around 20:00 and 23:00 UT, but on the other hand, Bx stays more or less constant in BATS-R-US and LFM. Furthermore in GUMICS and OpenGGCM, contrary to Geotail, Bx clearly increases around 20:00 UT and also seems to increase, on average, around 23:00 UT. By does not show large features at Geotail except for noticeable increases at 16:00 and 17:00 UT. The former one seems to be captured by BATS-R-US and OpenGGCM while the latter is captured by LFM and, with a small delay, by BATS-R-US and GUMICS. The steady decrease of By between 17:00 and 19:30 UT is reproduced best by LFM with BATS-R-US and GUMICS also showing a similar feature. Geotail Bz has two large increases starting at 17:00 and 20:00 UT, which last for several hours. Between about 17:00 and 21:00 UT, all the models agree quite well with each other and, with an added offset of about 3 nT, also with Geotail. Interestingly, Bz in BATS-R-US, GUMICS, and LFM at Geotail is quite similar to the simulated Bz at Cluster 1, but with a delay of 15 to 30 min. This is not the case for observations where Geotail shows one large enhancement between 20:00 and 22:00 UT while Cluster 1 shows two separate smaller enhancements at 20:30 and 22:00 UT. The disagreement between Geotail Bz and the models is largest before 16:00, at 21:30 and at 23:00 UT.
 Figure 5 shows the magnetic field data from simulations and Wind as a function of time. During the event, Wind travels between the northern or southern lobes via the neutral sheet (labeled nsheet in the Figure). On average Bx seems to decrease steadily between 16:00 and 19:00 UT after which it starts to increase. Several depressions of Bx are overlaid on top of the average behavior, the largest ones being at about 17:30, 19:30, 21:00, and 22:30 UT. The only features that seem to be reproduced consistently by all models are the depressions of Bx at 17:30 and 21:00 UT. OpenGGCM shows quite good agreement with Wind between 19:30 and 21:00 UT. Wind shows large increases in By starting at 17:00, 18:30, and 21:00 UT, and all models reproduce its observations between 17:00 and 18:30 UT. The second enhancement between 18:30 and 20:00 UT is not reproduced by any model. From about 20:00 onward BATS-R-US, GUMICS, and LFM again agree reasonably well with Wind. Bz measured by Wind has features similar to Bz of Cluster 1 but with a magnitude of about one fourth of that of Cluster. For the whole event, BATS-R-US, GUMICS, and LFM agree with Wind quite well with one exception of about 30 min around 20:00 UT. OpenGGCM agrees well with Wind between 17:00 and 19:30 UT after which it starts to show very large variations. Again the Bz result of BATS-R-US, GUMICS, and LFM are similar to their respective results for Bz at Cluster 1 but with a 30 to 45 min delay.
 Figure 6 shows the cross polar cap potential (CPCP) in the northern hemisphere from simulations and calculations from SuperDARN along with the number of flow vectors that were used in the calculation as a function of time. The procedure of calculating SuperDARN CPCP [Ruohoniemi and Baker, 1998] consist of finding the best fit for the electric potential Φ from observed flow vectors using . In regions without data coverage, a statistical model based on the solar wind IMF [Ruohoniemi and Greenwald, 1996] is used to constrain the solution. Most of the time, over 100 flow vectors are available and, for example, starting at 21:30 UT, about 300 vectors are available for almost 2 h. When the number of available vectors is above about 100, they seem to be available on both the dawn and dusk side of the northern hemisphere (data not shown).
 During the event SuperDARN CPCP increases significantly four times with peak values at about 16:30, 19:30, 21:30, and 23:00 UT, which coincide with southward/northward turning of the IMF. A first-order minimum estimate for CPCP determined from the Defense Meteorological Satellite Program (DMSP) spacecraft available at NADIAWEB (http://cindispace.utdallas.edu/DMSP/NADIA\_FAQ.html) gives values in the same range (∼ 40 to ∼ 100 kV) as SuperDARN. BATS-R-US follows observations most closely with significantly different values for a period of about 30 min only at 17:30 and 20:45 UT. GUMICS and LFM capture the dynamics of CPCP well, but their results differ from observations by a constant factor of about 0.7 and 3, respectively. OpenGGCM also captures the main behavior of CPCP but with smaller enhancements before 21:00 UT and a very large increase in CPCP at around 23:00 UT.
 Figure 7 shows the minimum distance of the magnetopause from the Earth within 30° from the Sun-Earth line (referred to as R0 from hereinafter) from simulations and the empirical model of Lin et al.  as a function of time. With the exception of the first 1.5 h, BATS-R-US and GUMICS show very similar results for the entire event, staying between 11 and 13 RE. Their result is also quite close to the empirical model, but with an almost constant offset of about 1.5 RE. The results from LFM are closest to the empirical model with very good agreement (differences of less than 0.25 RE) from about 17:00 to 20:30 UT and at other times the difference is at most about 1 RE. In both, LFM and the empirical model, R0 stays almost completely between 9 and 11 RE. OpenGGCM has the lowest average value of R0 of 8 RE, and its dynamic range is much larger than the other models, varying between 6 and 12 RE.
 The topology of the magnetic field in a global MHD simulation can give significant insight into the solution that was obtained and is essential for example when studying reconnection in a global setting [see e.g., Dorelli et al., 2007]). Figures 8 and 9 show the magnetic field topology [Rastätter et al., 2012] from simulations in the y=0 RE plane at about 21:20 UT (tracing parameters in Figure 8: N1 = N2 = 11, adaptation = 6; flow line start positions in Figure 9: uniform random in cut plane). In Figure 8, traced magnetic field lines connected to the Earth at both ends are shown in red, field lines connected only to the northern or southern hemisphere are shown in yellow or green, respectively, and field lines not connected to the Earth are shown in blue, while in Figure 9, field lines connected to either one hemisphere are shown in black. In GUMICS two large plasmoids form in the magnetotail starting at 20:30 and 23:30 UT, which is most likely caused by multiple large and fast rotations of the interplanetary magnetic field (IMF) clock angle [Honkonen et al., 2011]. Figures 8b and 9b show the magnetic field topology from GUMICS at 21:12 UT, about 5 min before the first plasmoid dissipates. The plasmoid is visible as a region of closed magnetic field located in the far tail beyond −100 RE. In BATS-R-US, the plasmoid forms about 10 min later than in GUMICS and is shown in Figures 8a and 9a 5 min before it dissipates. The result looks similar to that of GUMICS with a complicated structure of closed, lobe, and solar wind field lines in the ecliptic plane surrounded by lobe field lines above and below. Figures 8c and 9c show the plasmoid in LFM 5 min before it dissipates. The plasmoid forms at about the same time as that in BATS-R-US, but its structure is different. The closed field line regions stay closer to the ecliptic plane and do not detach from the Earth as in the case of BATS-R-US and GUMICS. At the time of plasmoid formation in other models, OpenGGCM does not show a closed field line region extending downstream from the Earth, but northern lobe field lines do show an additional region further down the tail. OpenGGCM shows significant changes in magnetic field topology prior to 20:00 UT. In BATS-R-US and LFM, the boundary between lobe and solar wind field lines in the magnetosheath is quite wavy, which is also noticeable in GUMICS.
 In order to get a quantitative estimate for the performance of different models, the correlation coefficients (CC) and prediction efficiencies (PE) with respect to observations were calculated. Table 2 presents the correlation coefficients between model predictions and measurements. For every model, the magnetic field component with the largest correlation coefficient for every spacecraft is shown in bold and the smallest coefficient in italics. In this section we use the expressions best correlated and best predicted only for describing values of the respective metrics of one magnetic field component relative to another of the same combination of spacecraft and model. As will be shown, in some cases, an average prediction is better that the modeled result for all components of B, but even then, one component will have the highest score to which we will refer to as best.
Table 2. Correlation Coefficients Between Model Magnetic Fields, Cross Polar Cap Potential, and Measurements for the Event of 18 Feb 2004a
aFor every model, separately, the magnetic field component with the largest correlation coefficient for every spacecraft is shown in bold and the smallest coefficient in italics.
 Overall Bz for every spacecraft is most often (7/12) best correlated with measurements, but there are spatial differences. On the dayside (Cluster 1), the largest correlations are evenly distributed between Bx and Bz, while By has the lowest correlation for three models. On the nightside close to the Earth (Geotail), Bx is most often (3/4) best correlated with measurements while By has most often (3/4) the worst correlation. Far in the magnetotail (Wind), Bz is best correlated with measurements and Bx the worst for all models. By is never the best correlated magnetic field component for any model and spacecraft combination.
 When examining the correlations of each magnetic field component separately, several things can be observed. For three of the four models the highest correlation in Bx is obtained on the dayside and the lowest correlation far in the tail. In BATS-R-US and LFM the correlation of Bx decreases steadily from dayside to far tail. In GUMICS the correlation of Bx drops significantly from dayside to nightside and increases slightly from nightside to the far tail, which is probably due to GUMICS having a smaller resolution at Geotail than the other models. In OpenGGCM the highest Bx correlation is on the nightside. For all models, the lowest correlations of both By and Bz are on the nightside close to Earth with only one exception (OpenGGCM By at Wind). Overall, 37 out of 40 correlation coefficients are positive when SuperDARN is included.
 The prediction efficiency (PE) for a discrete signal is calculated following Pulkkinen et al.:
where xobs and xsim are the observed and simulated signals, respectively, <...>i indicates an arithmetic mean taken over i (i.e., time) and is the variance of the observed signal. A PE value of 1 indicates a perfect prediction while a PE value of 0 is equal to using the mean value of the signal as a predictor. The prediction efficiencies of the models are presented in Table 3. For every model the magnetic field component with the largest prediction efficiency for every spacecraft is shown in bold and the smallest prediction efficiency in italics.
Table 3. Magnetic Field and Cross Polar Cap Potential Prediction Efficiencies of Global MHD Models for the Event of 18 Feb 2004a
aFor every model, separately, the magnetic field component with the largest prediction efficiency for every spacecraft is shown in bold and the smallest efficiency in italics.
 Overall, Bx for every spacecraft is most often (7/12) best predicted by the models but again there are spatial differences. On the dayside and nightside close to the Earth, Bx is predicted best by three of the four models, but in the far tail, the highest prediction efficiency is mostly (3/4) obtained for Bz instead. On the dayside and far in the tail, By is predicted the worst almost without exception, but on the nightside close to Earth, Bz is predicted worst by all models. All the models are worse than using an average value for predicting By with the exception of one model for only one spacecraft (GUMICS on the dayside). On the other hand, for three models, the Bx and Bz prediction efficiencies on the dayside are above 0.5 and above 0.3 for Bz in the far tail.
 When examining the prediction efficiencies of each magnetic field component separately, several things are observed: For BATS-R-US and LFM, the highest PE for Bx is on the dayside and the lowest in the far tail, a situation identical to the correlation coefficients of Bx for these models. For GUMICS-4 and OpenGGCM, the highest PE for Bx is also on the dayside, but the lowest one is on the nightside close to Earth. The PE of By decreases downstream for GUMICS-4 and OpenGGCM but for BATS-R-US and LFM, By PE has the highest value on the nightside close to Earth. For all the models, Bz PE is significantly lower on the nightside than either the dayside or the far tail. Only BATS-R-US predicts the CPCP better than an average value would, and all other models are worse than using a random value. Overall, 13 out of 40 prediction efficiencies are positive when SuperDARN is included.
 For the combined magnetic field CC and PE results from all models Bx has the largest value among all components 12/24 times while By has the largest value 1/24 times and Bz 22/24 times. The number of times Bx, By and Bz have the smallest value among components are 5/24, 13/24 and 6/24 respectively.
 In order to estimate the effect that different modules/parameters of a model can have on the simulation result, three runs were done with a newer version of the BATS-R-US model (version 20110131) with different parameters: (1) only the version was changed, (2) higher resolution was used in the magnetosphere, and (3) the Rice Convection Model (RCM) module was included which solves the adiabatic drift of isotropic particle distributions [Toffoletto et al., 2005] in the inner magnetosphere and provides density and pressure corrections to the magnetospheric module [Tóth et al., 2005]. At Cluster 1, the standard resolution run has a resolution of 0.5 RE (up to about 8 RE distance from Earth) while the high resolution version has a resolution of 0.25 RE (up to about 16 RE distance from Earth on the dayside). At Geotail the resolutions are 1 RE and 0.5 RE, respectively although in the high resolution run the resolution decreases to 1 RE when Z >6 and further to 2 RE when Z >12 RE.
 We only summarize the results for CC and PE in these runs, but the simulation results are again available through CCMC (run numbers 112112_1, 112112_2, 112112_3, and 112312_1 all prefixed with “Ilja_Honkonen_”). The results for Bx at Cluster orbit do not differ significantly between any version or module combination of BATS-R-US that was tested. At Geotail only RCM improved the results for Bx noticeably by increasing the absolute value of both CC and PE by about 0.1. For By, the differences are less straightforward: CC at Cluster improved slightly (about 0.05) with the new version regardless of the requested resolution and further by about 0.05 when using RCM. At Geotail, the new version improved CC of By by 0.15 which higher resolution increased further by 0.06. Interestingly in this case, including RCM decreased CC by about 0.05. The prediction efficiency of By at Cluster improved significantly for the new version (by 0.35) and increasing the resolution and including RCM further improved the result by 0.1 and 0.2, respectively. At Geotail, the new version increased PE by almost 0.2 while increasing the resolution or including RCM decreased the result slightly. For Bz the results are relatively straightforward: A new version of BATS-R-US does not affect either CC or PE significantly; increasing the resolution improves all results noticeably at Cluster while having no or insignificant effect at Geotail. Interestingly, using RCM gives the worst results for Bz everywhere except for PE at Geotail, but there the results for all tested models are already worse than a random prediction.
 In this work the prediction efficiency (PE) and correlation coefficient (CC) metrics are used to obtain a quantitative estimate on model performance. An intuitive picture of these metrics can be obtained from the cross polar cap potential (CPCP) results where CC seems to quantify how well a model reproduces the “dynamics” of observations while PE indicates the quality of predicting the absolute values of observations. The CPCP prediction of LFM has the highest CC score of all models and indeed the relative changes of LFM CPCP correspond best to SuperDARN observations. On the other hand, the PE score of LFM CPCP is by far the lowest, which is not surprising given that LFM CPCP is mostly a factor of 3 larger than that of SuperDARN or any other model. As stated by Pulkkinen et al. , no one metric is the absolute best and the choice of the metric depends on the situation. For some applications it could be a valid, albeit an unphysical, approach to divide the LFM CPCP by three in order to obtain the best available prediction for CPCP based on the CC metric.
 Overall, based on the PE metric, none of the tested models seem good at predicting observations on the nightside at a distance of about 25 RE or more from the Earth. With the exception of Bx at Geotail and Bz at Wind, all model predictions at those satellites are worse than using an average value for the prediction, and even in the rest of the cases, PE is less than 0.5. On the dayside closer to Earth at about 14 RE, all model PEs are significantly higher with the PE of Bx and Bz being over 0.5 for three out of four models. When the CC metric is used, all models fare substantially better, and interestingly, the highest and lowest values of CC occur in about the same locations and for the same components of B as with PE.
 In BATS-R-US and LFM the values of CC and PE tend to decrease steadily from the dayside to the far tail with the exception of Bz. For a system with high Mach number(s) flow past an obstacle, it is reasonable that model predictions are most accurate upstream of the obstacle and decrease downstream from there since turbulence and other effects have had more time to affect the system. In GUMICS and OpenGGCM, both CC and PE in the nightside close to the Earth often have smaller values than in the far tail. There does not seem to be a simple explanation for this behavior since GUMICS as a model is closer to BATS-R-US than LFM is to BATS-R-US (Table 1), but it is GUMICS that behaves differently from BATS-R-US in this respect.
 Bz is the largest exception to the above rule that metric scores decrease steadily downstream from the dayside since in all models and for both CC and PE, the score is higher in the far tail than in the nightside close to Earth. The lack of modeled physics in the nightside does not seem to explain this since including RCM in BATS-R-US decreases the nightside CC and PE scores if they are positive to begin with. One obvious explanation for this behavior could be the fact the axis of Earth's strong intrinsic dipole field is also directed in the general direction of the GSE Z axis. In order to verify this, a similar comparison would probably have to be carried out for Neptune's magnetosphere where the dipole axis can point almost directly sunward [Ness et al., 1989].
6.2 Scatter Plots
 The scatter plots in Figures 345 illustrate the quality of predictions from another point of view. They allow one to understand the metric scores calculated here better and also let us estimate a limits to the CC and PE scores above which the simulation results can be considered good. For example, it is more apparent from the scatter plot than the time series that all models underestimate Bz at Geotail quite strongly. In a BATS-R-US run with RCM included, the shape of the point distribution does not change significantly but the whole distribution moves closer to the diagonal. Based on visual inspection of scatter plots, the modeled results seem good when both CC and PE are above about 0.6. Also using only the CC metric for estimating quality of the result can be misleading as the points in a scatter plot can still be quite far from the diagonal (e.g., GUMICS Bx at Cluster or BAST-R-US/LFM Bx at Geotail).
6.3 Cross Polar Cap Potential
 The CPCP result of GUMICS differs from that of SuperDARN by a constant factor of about 0.7 which can be explained, at least partially, by magnetospheric resolution and the dipole tilt angle. When using the Run-on-Request system, if the tilt angle is not updated with time, its direction is set to the start time of the simulation. In a previous simulation of this event with GUMICS, the dipole tilt angle was set to its average value during the simulated event [Honkonen et al., 2011] and the CPCP prediction was noticeably higher, i.e., closer to SuperDARN and the other models. Also, increasing the resolution of the inner magnetosphere in GUMICS gives higher field-aligned currents (FAC), which increase CPCP further.
 The CPCP from SuperDARN reaches its saturation point of about 80 kV [e.g., Shepherd et al., 2003] twice during the event at 16:30 and 19:30 UT. In this case the possible saturation of CPCP most likely would not change the relative result between models significantly because at those times all model predictions are less than or equal to SuperDARN with the exception of LFM, which is a factor of three higher. At times the number of flow vectors used for calculating SuperDARN CPCP falls below 100, and the number does not increase much beyond 300. This may cast some doubt on the calculated CPCP since, for example, the large statistical study of Grocott et al.  only included periods with 300 or more flow vectors. We argue that the number of vectors and hence the reliability of SuperDARN CPCP is adequate for the purpose of comparing global MHD models, with quite different CPCP predictions, to observations.
6.4 Additional Physics
Pulkkinen et al.  reported that neither increasing spatial resolution in OpenGGCM nor including thermospheric physics in LFM systematically improved the performance of either model with respect to ground magnetic field observations. In this event increasing the resolution in BATS-R-US has a significant effect only on the dayside CC and PE of Bz, and in particular, higher resolution does not have a significant effect on Bx anywhere. Contrary to Pulkkinen et al. , including the RCM module in BATS-R-US does not lead to an improved result in this case. Although the result for Bx does improve on the nightside by including RCM, the result for Bz becomes worse on the dayside. The reason for this behavior would be difficult to pinpoint based on even several tests. The possibilities range from small mistakes in the code, installation or usage to fundamental problems in the representation of the physics in each separate model, or in their coupling together. It is clear that including additional physical models in a simulation does not automatically guarantee a better result in the whole simulated volume.
6.5 Magnetopause Standoff Distance
 The magnetopause standoff distance between different MHD models varies by almost 6 RE at 16:30 and 22:00 UT while at 18:00 UT all models show an almost identical value and are quite close at 20:30 UT. The standoff distance from an empirical model tends to fall in the middle of MHD models except from about 21:00 UT onward where the standoff distance is lower than in all but one MHD model. While proper validation would require a comparison to observations that is not possible in this case using the current CCMC interface. Nevertheless, a comparison to an empirical model shows what values and dynamical behavior to expect based on the upstream solar wind conditions.
 The increase in standoff distance of all models at 18:00 UT is probably due to the sudden large decrease in solar wind density while the increase at 20:30 UT might be due to northward turning of IMF Bz. The empirical model shows similar behavior although the increase at 18:00 UT is smaller than in MHD. At 22:00 UT the standoff distance again increases, probably due to northward turning of IMF Bz, in BATS-R-US, GUMICS, LFM, and the empirical model. While the standoff distance increases several times to a very similar value between all MHD models, the standoff distance seems to subsequently return to a baseline value that is different for each model. Finding the cause of this will require further investigation in subsequent works, as there is no apparent explanation for this behavior in the upstream solar wind conditions.
6.6 Statistical Studies Needed
 When validating, verifying, or just comparing models, as many parameters as possible should be kept constant. Unfortunately, this is difficult to accomplish with the models used here especially through the CCMC interface, which limits the parameter space of models available to users. For example, even when a higher resolution run of BATS-R-US is requested, the resolution does not increase in the whole simulation domain but is lower, for example, around the lobes. As shown in Table 1, there are also significant differences between, e.g., the magnetospheric grid used by different models. In this work we examined only event, and more may be needed in order to draw solid conclusions on the performance of global MHD models in the Earth's magnetosphere. Due to the complicated physics involved and the differences in global models, it would be important to not only simulate single events but to run weeks or even many months worth of simulations in order to assess model performance using various metrics as a function of, for example, AE (http://wdc.kugi.kyoto-u.ac.jp/aedir) and Dst indices, the upstream solar wind driver, substorm phase, etc. This would allow the users of global models to estimate the quality of the solution for a particular event and could provide weights for statistical comparisons of simulations and observations. For model developers it would provide, for example, a baseline quality against which various modifications and changes in model parameters could be compared using a smaller but representative set of events.
 As a final note, there are four different global MHD models available at CCMC through a consistent interface, and each user can start several runs daily. Based on the results presented here and in previous works, there is virtually no reason to limit oneself to only one model when simulating the Earth's magnetosphere using CCMC resources. If two or more independent models agree on a particular result, it is very likely as close to reality as current state-of-the-art in global MHD can reasonably get.
6.7 Large Plasmoid Formation in Global MHD
 The results presented in Figures 8 and 9 lend more credibility to the hypothesis put forward by Honkonen et al.  that multiple large and fast rotations of the IMF clock angle result in large plasmoid formation in a global MHD simulation. In three out of four different global MHD models, two large plasmoids form in the magnetotail during the event, and the plasmoids occur close in time between all three models in both cases although there are 5 to 15 min differences between the three models in the stages of plasmoid formation. The plasmoid structure is most similar between BATS-R-US and GUMICS with LFM also showing the closed magnetic field line region extending about −200 RE downstream from Earth.
 The three models that agree on plasmoid formation also exhibit a large cross polar cap potential drop prior to the downstream growth of the closed magnetic field line region. GUMICS shows only very small variations in the ionospheric conductivities during the event, and hence changes to CPCP are almost completely due to changes in the FACs. Although the FACs in BATS-R-US vary by more than a factor of 2, the conductivities in the auroral oval also change moderately when the CPCP varies. In LFM both FACs and conductivities change significantly while in OpenGGCM only conductivities change drastically during variations of CPCP. Thus it seems that in order for a large tail plasmoid to be formed in a global MHD simulation, significant changes in ionospheric FACs are required and that ionospheric conductivities can have a significant effect on the structure of the formed plasmoid.
 In this work the performance of four global MHD models (BATS-R-US, GUMICS-4, LFM,and OpenGGCM) in the Earth's magnetosphere is studied by comparing model predictions to the magnetic field measurements of Cluster 1, Geotail, and Wind spacecraft during a multiple substorm event. Model results for the cross polar cap potential are also compared to the measurements of SuperDARN, and the model magnetopause standoff distances are compared to the empirical magnetopause model of Lin et al. . All simulations are executed through the CCMC Run-on-Request system. Comparisons are conducted using two quantitative and objective metrics: correlation coefficient and prediction efficiency. We find that for all four models, the best performance is on the dayside and, generally, model performance decreases steadily downstream from the Earth. From different components of the magnetic field, Bx is most often best predicted and correlated both on the dayside and the nightside close to the Earth whereas in all models Bz CC and PE are substantially higher in the far tail than for other components. On the dayside CCs are above 0.5 most of the time with Bx and Bz CCs close to 0.9 for three out of four models. In the magnetotail at a distance of about 130 RE, the prediction efficiency of all models is below that of using an average value for the prediction with the exception of Bz. We also find that increasing the resolution or coupling an additional physical model does not automatically increase model performance at least with respect to the CC and PE metrics. With a coupled inner magnetosphere module, the performance of BATS-R-US increases significantly close to the Earth for By and in all relevant cases decreases moderately for Bz.
 This work is a part of the project 200141-QuESpace, funded by the European Research Council under the European Community's seventh framework programme. The work of I.H. and M.P. is supported by project 218165 of the Academy of Finland. A.G. is supported by NERC Grant NE/G019665/1. The National Center for Atmospheric Research is supported by the National Science Foundation. We thank the rest of the CCMC staff for providing a valuable service, the Cluster Active Archive, Coordinated Data Analysis Web and the instrument teams and PIs of Geotail MGF, Wind MFI, Cluster FGM, and ACE MAG and SWEPAM (S. Kokubun, R. Lepping, A. Balogh, N.F. Ness, D.J. McComas, respectively) for providing the spacecraft data used in this study. We also thank the World Data Center for Geomagnetism for the Kyoto AE and Dst index services and the anonymous referee for insightful comments. I.H. thanks C. Anekallu for insightful discussions.