Multi-Model Ensembles for Upper Atmosphere Models

Multi-model ensembles (MMEs) are used to improve the forecasts of thermospheric neutral densities. A variety of algorithms for constructing the model weights for the MMEs are described and have been implemented including: performance weighting, independence weighting and non-negative least squares. Using both empirical and physics-based models, compared against in-situ CHAMP observations, the skill of each MME weighting approach has been tested in both solar minimum and maximum conditions. In both cases the MME performs better than any individual model. A non-negative least squares weighting for the MME on a set of bias corrected models provides a 68% and 50% reduction in the mean square error compared to the best model (Jacchia-Bowman 2008) in the solar minimum and maximum cases respectively


Introduction 1.Background
Accurately propagating satellite orbits requires knowledge of the forces acting on the satellite.For satellites in low Earth orbit (LEO) (less than 1,000 km), forces include terrestrial gravity, solar radiation pressure, lunar and solar gravity and drag caused by the atmosphere (Eshagh & Najafi Alamdari, 2007).The drag force increases dramatically as a satellite's altitude decreases and becomes significant below approximately 600 km (Fortescue et al., 2011).However, there are large uncertainties in modelling the magnitude of the drag acting on a satellite.To do so requires an understanding of the thermospheric mass density, winds and the satellite's ballistic coefficient.The largest contribution to error in the forecasting of satellite positions is specification of thermospheric density (Mehta et al., 2018), although for tumbling or complex geometries, the errors in the ballistic coefficient can be a substantial contribution.
Currently a variety of mathematical models are used to provide estimates of the density.Empirical models are often used by satellite operators.They are fitted to measurements of thermospheric parameters; however such measurements are sparse.In particular, there are very few measurements between 100 km and 250 km because balloons cannot reach these heights and satellites re-enter too quickly for any long term study.
Fabry-Perot Interferometers can be used to measure wind between 220 and 600 km (Titheridge, 1995) and meteor radars can measure wind, as well as temperature and pressure, between 80 and 100 km (John et al., 2011;Reid et al., 2018).
Physics-based models solve the equations which describe the physical processes in the thermosphere.Initially the atmospheric density, wind and temperatures are generally provided by empirical models, but a 'spin-up' time is used for the results to stabilize.The spin-up time can be reduced in subsequent model runs by using previous output from the model.Neutral and ion species production is then calculated via chemical reaction equations and using solar X-rays and EUV conditions.Ion transportation and recombination are also considered.The initial and boundary conditions, as well as proxies for solar activity, are the main drivers for the models.There are a number of approaches to modelling the physics of the thermosphere, which rely on different numerical methods (Purnell, 1976;Augenbaum, 1984;Bott, 1989), and thus exhibit different levels of complexity and use a variety of inputs.Model developer choices about the solver and the selection of boundary conditions leads to differences in the outputs from models.

Multi-Model Ensembles
A multi-model ensemble (MME) is a combination (usually weighted) of individual models (Thompson, 1977;Murray, 2018).Ideally the models should have independent errors and the improved performance of the MME arises from the errors partially cancelling (Hagedorn et al., 2005).Tracton and Kalnay (1993) showed the utility of MMEs in one of the first operational MME weather forecasts, and also demonstrated skill in longerterm forecasts.Elvidge (2014) and Elvidge et al. (2016) demonstrated for the first time the skill from using MMEs for upper atmosphere forecasts.The result has been further demonstrated by combining the four Utah State University (USU) Global Assimilation of Ionospheric Measurements (GAIM) models (Schunk et al., 2016).
A key question when using an MME is how the models should be combined.Elvidge et al. (2016) used both an equal weighting scheme and a scheme where the weights were the inverse of the mean square error (MSE) of the models used to create the MMEs.That work showed that six hour forecasted densities of the thermosphere had a 60% reduction in the root mean square error (RMSE) when using an MME.To further investigate MMEs in the thermosphere Elvidge et al. (2016) recommended that: • A 'training' dataset should be used for the weighting scheme rather than the testing dataset • More variety of weighting methods should be included • Longer test scenarios should be used to reduce the uncertainties in the statistics.This work addresses those recommendations.

Observations
This work uses data from the Challenging Minisatellite Payload (CHAMP) satellite (Reigber et al., 2002).CHAMP was operational from July 2000 to September 2010.
During this time the CHAMP orbit degraded from an altitude of 454 km to 296 km due to atmospheric drag.One of its primary missions was to precisely measure the terrestrial gravity field which required a very accurate accelerometer.Thermospheric total mass densities have been estimated from the CHAMP accelerometer data (Sutton, 2009).This is the dataset as was used in Elvidge et al. (2016); however, since that study, the CHAMP drag coefficient and surface area has been re-analysed using higher fidelity satellite geometry models and more advanced drag coefficient estimation (Mehta et al., 2017).This has resulted in a 20% reduction of the estimated densities.This work uses the re-analysed data, but it should be noted that the empirical models JB2008 and DTM-2013 are fitted with the older data (see Section 3).
The CHAMP data has a high sampling rate along its orbit (10 s).However, the local solar time varies by only a few seconds every orbit.Therefore, over the course of a month, CHAMP would only sample approximately 2.8 hours of local solar time (Häusler et al., 2010).Even so, structures in the neutral density such as travelling atmospheric disturbances (TADs) and the midnight density maximum (MDM) can be seen (Emmert, 2015).However the limit of the data coverage does impact the overall assessment of this method.The accelerometer-derived neutral density observations have an estimated mean error of 10.1% (Sutton, 2008).

MME Weighting Methods
In this paper the performance of six different weighting schemes that can be used to combine models are tested.These are equal weights (EW), performance weights (PW), performance weights with bias removed (PWB), Reliability Ensemble Averaging (REA), Independence Weighting (IW) and Non-Negative Least Squares Regression (NNLS).Each scheme is described in the following sections.In each case the MME is formed as a weighted combination of the input models: where M (x, t) is the MME for each point x at time t, R k (x, t) is the weight for each model k which could also vary in time and space, Z k (x, t) is each model output and N is the number of models.

Equal Weighting
Constructing appropriate model weights can be difficult given small sample sizes and available data (Kharin & Zwiers, 2002).As such it has been argued that the only way to generate a good MME for small datasets is by taking the ensemble mean (Hagedorn et al., 2005).Though simple, this method has been shown to produce good results in the thermosphere (Elvidge et al., 2016) and more broadly in climatology (Barnston et al., 2003;Palmer et al., 2004;Weisheimer et al., 2009).The MME weights are given by; (2)

Performance Weights
A performance weighting scheme uses a measure of model skill to weight the models so that the best performing model (against a representative dataset) has the highest weight.The performance weighting used in this work is a modified version of that described by Rozante et al. (2014).It uses the mean square error (MSE) as the skill measure to weight the models.The weights then have the value: where where Y i k is a model prediction in the training period, X i is a data point in the training period and N p is the number of points.

Performance Weights with Bias removed
Some models show a good amount of skill in terms of the correlation, showing that they model the thermospheric response well.However, the MSE may still be large, and a reason for this can be model biases.The bias is the difference of the average density of each model in the training period and the average density of the dataset: so that and where If the bias from the training data is removed before MSEs are taken, a potentially better representation of model skill can be achieved.The MSE of an unbiased model is equal to the variance, so this is effectively a variance weighting.The bias is then preremoved from the validation dataset before averaging the models.This assumes the bias does not change between testing and validation, which for a short time should be a reasonable approximation.

Reliability Ensemble Averaging
Elvidge et al. ( 2016) suggested the use of Reliability Ensemble Averaging (REA) to estimate the ensemble weights.REA is used in terrestrial weather climatology to infer the unknown future performance of the model from its previous performance and in comparison to the other model's predictions (Giorgi & Mearns, 2002).The weighting process involves calculating the following quantity: The R j k are weights per model k and validation point j, the ϵ are estimations of the dataset's variability which could be the range or the standard deviation of the data (a constant value of 1x10 −12 was used in this work as an estimation), B k is the bias of the model calculated against previous data, D j k are distances from the models to the weighted multi-model average, Y j , given by and m and n allow for separate weightings of the bias and the distances (usually (Giorgi & Mearns, 2002)).This is a circular definition since R k is defined in terms of the distance from Y j in Equation 9 so an iterative procedure is used to find the weights and is usually complete within a few cycles.The weights are calculated using Equation 9 then a new average is calculated using Equation 10 until a weight reaches a value of one (Giorgi & Mearns, 2002).This could be useful in storm time when little is known about the storm.It relies on the model average which, a better estimate of a model's reliability than its prestorm bias.

Independence Weighting
Model independence is a critical requirement for an MME to work (Elvidge, 2014).
It may be the case that a set of models are not independent and share a lot of their structure with each other.The 'independence weighting' approach aims to take this into account.To determine the level of independence between models first each has its bias removed.Ideally this de-biased time series should have Gaussian errors and the covariance between different independent model errors would be zero.In practice often these errors do have some covariance.A covariance matrix of these errors, for each of the different bias corrected models is constructed, and weights are produced based on the variance between model and data, and the amount of covariance between the models (Bishop & Abramowitz, 2013):.
where 1 is a vector of all 1's and A is the model difference covariance matrix.This system can produce negative weights which is meaningless.So the method is adjusted to give only positive weights (Bishop & Abramowitz, 2013): A consequence of this is that one model always has zero weight, and is therefore excluded from the weighting.This method allows the use of different versions of the same model since independence is no longer a concern, potentially allowing similar models of the different versions/generations and formulations to be used.

Non-Negative Least Squares
Non-Negative Least Squares is a simple constrained regression which does not allow the coefficients to become negative.Specifically it finds the coefficients R k such that arg min where ||•|| 2 is the Euclidean norm (Bro & De Jong, 1997).The regression is performed on the training dataset.

Models
In this paper six models have been used to create the multi-model ensemble (MME).
Three of the models: NRLMSISE-00 (Picone et al., 2002), the Thermosphere Ionosphere resolution.Additionally the Coupled Thermosphere-Ionosphere Plasmasphere Electrodynamics (CTIPe) (Millward et al., 1996;Codrescu et al., 2012), Jacchia-Bowman 2008 (JB2008) (Bowman et al., 2008) and the Drag Temperature Model 2013 (DTM-2013) (S.Bruinsma, 2015) are used in this paper.A summary of the differences between the empirical models used in this work are shown in Table 1 of Emmert (2015) whilst Table 2 of the same paper highlights the difference between the physics-based models.

CTIPe
The Coupled Thermosphere-Ionosphere-Plasmasphere-electrodyanmics model (CTIPe) has been developed at the National Oceanic and Atmospheric Administration (NOAA).
It is a physics-based model with a fixed resolution of 18 cells in longitude, 90 in latitude and 15 vertical pressure levels.These values are due to the smaller scales of spatial variation in latitude compared to longitude.The model assumes hydrostatic equilibrium (as TIE-GCM from Elvidge et al. (2016) does).As well as F10.7, CTIP uses hemispheric power in 12 minute intervals.The model was run on request at the CCMC website and automatically interpolated to CHAMP paths on the website (Codrescu et al., 2012) 3.2 JB2008 Jacchia-Bowman 2008 (JB2008) has been developed by Space Environment Technologies (SET) and is an empirical thermospheric density model (Bowman et al., 2008).
It is based on the previous JB2006 and the original Jacchia diffusion equations (Bowman et al., 2008;Jacchia, 1977).The model uses four solar proxies (computed from in-orbit sensors) as well as disturbance storm time index (Dst) data (a measure of geomagnetic activity) (Tobiska et al., 2009).The model has been validated using derived density data from satellite drag on a range of satellites been 175 and 1,000 km.

DTM-2013
The  (Solomon et al., 2010;Bilitza et al., 2017).However at solar minimum, internal and and external dynamics, rather than solar drivers, dominate the evolution of the thermospheric densities.It is expected that the greatest differences between the tested models will be evident at these times (Elvidge et al., 2016).A sample timeseries of a physics-based and empirical model and the data for this period is shown in Figure 2. Recall that the average mean error of the observations is approximately 10.6% (Sutton, 2008), which is shown as error bars around the CHAMP data points.Whilst the errors are not insignificant they are smaller than the differences between the models.The fast periodicity of the data is due to the CHAMP satellite completing one orbit every 90 minutes and each point being 15 minutes apart.

Solar Maximum Scenario
The second test scenario is a typical 30 day solar maximum period from 2002, using a 30 day training window.The F10.7 varies between 135 and 240, with some significant spikes in ap (Figure 3).

Introduction
The various upper atmosphere models, and the different MME approaches (whose weights are calculated on the training periods) have been run for the test scenarios.These forecasts are compared to the derived CHAMP neutral densities, it should be noted however that the CHAMP data has an estimated mean error of 10.1% (Sutton, 2008), and whilst these results cover month long scenarios that only represents approximately 2.8 hours of local solar time coverage (Häusler et al., 2010).The models are compared using modified Taylor diagrams (Elvidge et al., 2014).To read such a diagram (e.g.servations) which is shown by the colour bar.The normalization factors have been included in the top right of the diagram and can be used to revert any factor to its original value.Of the MMEs the highest MSE is equal weighting with an MSE of 1.03x10 −24 and the lowest is the non-negative least squares with an MSE of 1.73x10 −25 .The maximum drop in MSE therefore is 94%, and a 68% from the best model.

Solar Minimum Scenario
The weightings of each model, for the different schemes used here, are shown in Table 2.After removing the bias the weights allocated to NRLMSISE-00 become higher and the weights for JB2008 become lower.The regression and independence weightings both favour the physical models more heavily at the expense of JB2008.GITM again has a high bias and variance along with CTIPe.

Solar Maximum Scenario
The MSE of JB2008 is 4.53x10 −25 and the worst model is GITM with an MSE of 2.62x10 −23 .The physics models generally performed worse here than in the 2009 solar minimum test.GITM has a positive bias and a greater than 1 normalised standard deviation in contrast to the solar minimum test scenario when it is less than 1 and negatively biased.In this case TIE-GCM has a negative bias, again in contrast to 2009.TIE-GCM also has a small correlation, implying it had trouble producing the correct features in the neutral density field, whereas GITM produced a very high correlation.The best MME was again a non-negative least squares with an MSE of 2.35x10 −25 , while the equally weighted ensemble had an MSE of 3.28x10 −24 .The highest improvement is 99%, and

Conclusions
Multi-model ensembles (MMEs) have been shown to improve the mean square error (MSE) of upper atmosphere forecasts.They rely on a spread of values around the true value to approximate it.Upper atmosphere models tend to be biased and for satellite predictions this is the most important statistical parameter, since the bias leads to a consistent deviation away from the true satellite track.Models (and MMEs) therefore need some kind of bias correction.Efforts like HASDM (Storz et al., 2005) where the biases of a thermospheric model are corrected by data assimilation can reduce them to near zero, and lead to vastly improved satellite prediction capabilities.However MMEs offer an opportunity to de-bias the model output simply, without the need for a computationally expensive data assimilation system, and can be used during forecasts where data is unavailable.A number of different MME methodologies have been described and compared here which can broadly be used throughout space weather (not just in the context of thermospheric density specification).If deploying such a system in an operational setting we would recommend that weights are calculated on a "rolling" one-month basis (if not using Reliability Ensemble Averaging which, by definition, varies over time).gives the largest reduction in error.In the solar minimum case this is a 68% reduction in the mean square error from the best individual model (Jacchia-Bowman 2008 [JB2008]) and a 50% reduction in the solar maximum case, again compared to JB2008.

Electrodynamic
General Circulation Model (TIE-GCM)(Qian et al., 2014) (version 1.95      was used inElvidge et al. (2016) whilst version 2.0 is used here) and the Global Ionopshere-Thermosphere Model (GITM)(Ridley et al., 2006) (updated since then) were used inElvidge et al. (2016) (refer to that paper for a brief description of the models, or to the references for a detailed description).GITM and TIE-GCM were both run in 5 o × 5 o Drag Thermosphere Model 2013 (DTM-2013) is a semi-empirical model which describes thermospheric temperature, density and composition.The model has been developed by the Centre National d'Etudes Spatiales (CNES) and has a long development history starting with DTM-78(Barlier et al., 1978).DTM derives its densities and temperatures from satellite drag data and was the first model to include the high-accuracy accelerometer data from the CHAMP and GRACE satellite missions (S.L.Bruinsma et al., 2004).Recent developments of the model include GOCE satellite data from 270 km to improve specification of the lower thermosphere and use of F30 (30 cm radio flux) instead of the F10.7.These updates have shown to increase the performance of the model with regards to specifying thermospheric density (S.Bruinsma, 2015).It uses am instead of ap for modeling geomagnetic storm modelling.4 Test Scenarios 4.1 Solar Minimum Scenario The test scenario used in this work is an extension of the first test scenario in Elvidge et al. (2016), a 20 day long run from 18th August 2009 (Elvidge et al. (2016)'s test scenario started on 28th August 2009).The MME weighting schemes which require training, are trained on the 20 day period before this (from July 29th).The test scenario time period contains one geomagnetic storm, Figure 1, and during the rest of the month the geomagnetic conditions are quiet.The F10.7 only varies between 67 and 76 flux units.The extremely low solar minimum of 2008-2009 presents a significant modelling challenge since the F10.7 values have been shown to not represent the correct thermospheric conditions

Figure 1 .
Figure 1.ap (blue) and F10.7 (red) for the test scenario and training period which runs from July 29th to September 8th 2009.Training period is before August 18th (black line), values after this are used for validation.The large spike in ap is associated to a geomagnetic storm.
Figure 4): the radial distance of a data point from the origin is the models normalized standard deviation (here they are normalized by the standard deviation of the observation), and the azimuthal angle corresponds to the correlation between the model and observation.The dashed line (marked with a star) shows the normalized standard deviation of the observation (i.e., unity).The dotted lined semicircles, originating from the intersection of the observed standard deviation (dashed line) and the horizontal axis, show contours of the standard deviation of the model error.Finally, the (normalized) mean square error between the model and observation time series can be found by adding, in quadrature, the standard deviation of the model error and model bias (model minus ob-

Figure 2 .
Figure 2. Sample timeseries of neutral density from the solar minimum scenario for CHAMP (blue, with error bars shown in black), TIE-GCM (orange) and JB2008 (green).

Figure 3 .
Figure 3. ap (blue) and F10.7 (red) for the test scenario and training period which runs from August 1st to September 30th 2002.The training period is before August 31st (black line), values after this are used for validation.The large spikes in ap are associated to geomagnetic storms.

Figure 4
Figure4shows a modified Taylor diagram of the solar minimum test scenario.It can be seen that TIE-GCM and CTIPe have large positive biases and normalised standard deviations greater than 1 compared to the CHAMP observations.Whilst the other physics-based model, GITM, underestimates the range of observations (normalised standard deviation significantly less than 1) and is negatively biased.TIE-GCM and GITM have a lower correlation to the data than the empirical models, whilst CTIPe is similar.NRLMSISE-00 has a very high bias and variance although with a moderately better correlation than TIE-GCM.DTM performs similarly to NRLM-SISE, but with a slightly smaller bias.Overall JB2008 performs the best of any individual model.Of the MME approaches, the simple equally weighted ensemble leads to a greater correlation with the data compared to any individual model and accurate variance.It is a common theme for thermospheric models to overestimate the neutral density, but GITM here has a low bias which improves the bias of the equally weighted ensemble.REA and PWB have near-zero biases due to the bias correction, this shows that the biases in these models varies over timescales longer than a month.The non-negative

Figure 5
Figure 5 shows a modified Taylor diagram for the individual models and MME results for the solar maximum test scenario.The empirical model performance is superior.

Figure 5 .
Figure 5. Modified Taylor Diagram for 30 days validation from 31th August with 30 days training period.Data from the CHAMP satellite.This diagram shows in one plot the correlation to the data, normalised standard deviation and bias.The timeseries have not been binned.
schemes have been used.The testing scenarios have also been extended to reduce the uncertainties in the statistics.In both the solar maximum and minimum test scenarios the MME performs better than any individual model compared in this study, within the confines of only using CHAMP data.Whilst many of the MME weighting methods perform similarly, overall a non-negative least squares weighting on bias corrected models

Table 1 .
Labels in the Modified Taylor Diagrams.

Table 2 .
Weighting of the different models in 2009.

Table 3 .
Weighting of the different models in 2002.The weightings are shown in Table3.It can be seen that the physics-based models are weighted less heavily than the empirical models.NRLMSISE-00 is weighted more heavily than in the non-negative least squares.The equally weighted MME did not have a lower MSE than the best performing model in either circumstance.