Issues in calibrating models with multiple unbalanced constraints: the significance of systematic model and data errors

Calibrating process‐based models using multiple constraints often improves the identifiability of model parameters, helps to avoid several errors compensating each other and produces model predictions that are more consistent with underlying processes. However, using multiple constraints can lead to predictions for some variables getting worse. This is particularly common when combining data sources with very different sample sizes. Such unbalanced model‐data fusion efforts are becoming increasingly common, for example when combining manual and automated measurements. Here we use a series of simulated virtual data experiments that aim to demonstrate and disentangle the underlying cause of issues that can occur when calibrating models with multiple unbalanced constraints in combination with systematic errors in models and data. We propose a diagnostic tool to help identify whether a calibration is failing due to these factors. We also test the utility of adding terms representing uncertainty in systematic model/data systematic error in calibrations. We show that unbalanced data by itself is not the problem—when fitting simulated data to the ‘true’ model, we can correctly recover model parameters and the true dynamics of latent variables. However, when there are systematic errors in the model or the data, we cannot recover the correct parameters. Consequently, the modelled dynamics of the low data volume variables departs significantly from the true values. We demonstrate the utility of the diagnostic tool and show that it can also be used to identify the extent of the imbalance before the calibration starts to ignore the more sparse data. Finally, we show that representing uncertainty in model structural errors and data biases in the calibration can greatly improve the model fit to low‐volume data, and improve coverage of uncertainty estimates. We conclude that the underlying issue is not one of sample size or information content per se, despite the popularity of ad hoc approaches that focus on ‘weighting’ datasets to achieve balance. Our results emphasize the importance of considering model structural deficiencies and data systematic biases in the calibration of process‐based models.


| INTRODUC TI ON
Calibrating a model with multiple constraints means that we use data on several model outputs, often at similar organizational levels of the modelled system, to constrain uncertainties about model structure or parameters. The value of this approach has long been recognized: from a theoretical perspective, models that make multiple predictions are considered to be 'efficient' as they are often supported by multiple lines of evidence and can be tested against different types of observations (Grimm & Railsback, 2012;Marquet et al., 2014). From a practical perspective, scientists are increasingly reliant on complex process-based models (Fisher et al., 2018;Fisher & Koven, 2020), together with methods to combine models and data, to generate precise forecasts and improve system understanding Van Oijen, 2020). In both cases, the use of multiple constraints is important when alternative competing hypotheses or models are compatible with a single set of observations. While it is often not hard for complex models to get a single 'right' answer for the wrong reasons, it is much harder to hit multiple benchmarks at the same time, and careful comparisons to multiple data constraints can help isolate incorrect assumptions (Grimm & Railsback, 2012;Medlyn et al., 2015).
The value of multiple data constraints is not limited to model testing, but extends equally, if not more so, to model calibration.
Issues of equifinality (i.e. multiple alternative parameter combinations producing the same model output) and parameter identifiability are common when complex models are constrained by a single type of data, making it easy for models to get the right answer for the wrong reason (Williams et al., 2009). In principle, the process of constraining model uncertainties via calibration (a.k.a. inverse modelling or model-data fusion, e.g. Hartig et al., 2012) is relatively straightforward. The idea is to infer model parameters that produce outputs that agree with the observed data. This can be achieved via informal calibration or optimization procedures (e.g. Aber & Federer, 1992;Parton et al., 1993), but as increasingly more data have become available in the recent years (Farley et al., 2018;Hampton et al., 2013;LaDeau et al., 2017), the field has moved towards formal statistical calibration methods based on likelihood or Bayesian statistics (Fer et al., 2018;Hilborn & Mangel, 1997;Van Oijen et al., 2005).
Technically, combining multiple, heterogeneous data sources within a statistical calibration is straightforward. Provided that measurement errors associated with the data are uncorrelated and hence independent, we can combine them by multiplying the statistical likelihoods (the probability of observing a dataset under any particular set of proposed model parameters) for the individual data streams (Van Oijen, 2020). In practice however, the statistical calibration of complex models can be challenging, especially when data sources differ greatly in volume (e.g. Medvigy et al., 2009;Ricciuto et al., 2011;Richardson et al., 2010). Unbalanced calibration datasets are now common as low volumes of manually collected field data are frequently combined with high volumes of automatically collected data from in situ sensors or remote sensing. Since each data point is usually modelled as an independent piece of information in a statistical likelihood, sparse observations can often be overwhelmed by the higher volume of data. This is undesirable as the low-volume data often constrain parts of the system with high uncertainties that are crucial for future projections (e.g. soil carbon and nitrogen), and require higher labour costs to collect. As increasingly more data become available, this issue of unbalanced datasets is likely to worsen significantly. For example, NASA's earth observation system is expected to grow by an order of magnitude, from an already overwhelming ~5 PB/year in 2018-2020 to a staggering ~50 PB/year, as soon as 2022 (https://earth data.nasa.gov/eosdi s/cloud -evolu tion).
Since the apparent issue is the imbalance in data volume, existing approaches often try to correct that balance by thinning-out, aggregating or reweighing the calibration datasets so that they have a more balanced influence on the calibration. Common examples include, reweighting different datasets so they count equally (Cailleret et al., 2020;Keenan et al., 2013;Medvigy et al., 2009;Richardson et al., 2010) or weighting by inverse sample size (Thum et al., 2017).
In fisheries,  and Carvalho et al. (2017) have suggested that in likelihood-based statistical procedures used to assess stock measurements, the down weighting or elimination of data is often used (e.g. Kell et al., 2014;Siddeek et al., 2017) to deal with data conflicts arising from model misspecification.  suggest that model misspecification is a main cause of sensitivity of calibration results to data weighting, and that downweighting data are not necessarily appropriate because it may not calibration can greatly improve the model fit to low-volume data, and improve coverage of uncertainty estimates. 4. We conclude that the underlying issue is not one of sample size or information content per se, despite the popularity of ad hoc approaches that focus on 'weighting' datasets to achieve balance. Our results emphasize the importance of considering model structural deficiencies and data systematic biases in the calibration of process-based models.

K E Y W O R D S
Bayesian inference, inverse modelling, model calibration, model discrepancy, multiple constraints, predictive uncertainty, structural model error, systematic data bias resolve model misspecification . The main purpose of these ad hoc approaches is to down-weigh the high-volume data so that its influence on the calibration is more balanced.
Unfortunately, such ad hoc approaches have no basis in probability theory; indeed, it makes no logical sense that the information content of a dataset in the calibration should be determined by the presence of another more sparse dataset. The significance of a dataset in the calibration should be determined by the reliability of that dataset alone. By arbitrarily changing the reliability of the calibration data, we are also throwing away potentially useful information that can be used to improve models. In reweighting the data, we introduce subjective control over the calibration by some measure of how close we want the model to fit the different data streams after calibration. A better option would be to develop solutions based on the underlying causes that lead to poor outcomes when calibrating models with unbalanced data. Oberpriller et al. (2021) showed that the calibration problem with unbalanced data streams in not due the imbalance per se, but because the model cannot fit both data sources when structural error is present. The calibration will favour the high-volume data, at the expense of worse model predictions for the low-volume data, because the former has a higher weight in the likelihood.
Here we investigate this problem in more detail, using several virtual experiments to illuminate the underlying reasons for the issues discussed. Second, we propose a diagnostic tool to help researchers identify whether issues that they are facing during calibration can be attributed to the interaction of imbalanced calibration data with model/data error rather than some other cause. Finally, we illustrate, as simply as possible, that including uncertainty in model structural error and data systematic bias in the likelihood improves model predictions and provides a quantification of uncertainty that has greater utility than using ad-hoc methods such as reweighting.

| Very simple ecosystem model
To illustrate the issues of model calibration to multiple constraints, we developed the very simple ecosystem model (VSEM). The model was designed to be as simple as possible, yet resemble more complicated, process-based ecosystem models that are commonly used in terrestrial ecosystem modelling.
In essence, the model calculates the daily accumulation of carbon in the plant and soil from the growth of the plant via photosynthesis and senescence to the soil, which respires carbon back to the atmosphere. While we rely on a terrestrial carbon budget model for this example, the underlying issues are general to any model that predicts multiple outputs (e.g. species, life-history stages, biogeochemical pools and fluxes). These issues apply to wide classes of models in routine use across marine, freshwater and terrestrial systems that are used to describe physiological, population, community, ecosystem and evolutionary processes.
The VSEM requires only one input dataset to drive the model, daily photosynthetically active radiation (PAR, MJ m −2 day −1 ).
Since we are interested in virtual experiments, we simulated PAR input data using a sinusoidal function, where represents Gaussian random noise and Days is a vector of integers from 1 to the number of days in the simulation (2048 in this case).
The model calculates gross primary productivity (GPP) using a very simple light-use efficiency (LUE) formulation multiplied by light interception. Light interception is calculated via Beer's law with a constant light extinction coefficient, KEXT, operating on Leaf Area Index, which itself is calculated based on vegetation foliar carbon (C v ) and leaf area ratio (LAR). A respiration parameter (GAMMA) determines the fraction of GPP that is autotrophic respiration, giving the net primary productivity (NPP).
There are three state equations representing the change in vegetation (C v ), root (C r ) and soil (C s ) carbon pools over time. The NPP is allocated to above (vegetation) and below (root) ground carbon pools via a fixed allocation fraction (A v ). Carbon is lost from the plant pools to a single soil pool via fixed vegetation and root turnover rates ( v and r ).
Heterotrophic respiration in the soil is determined via a soil turnover rate ( s ).

| Bayesian calibration
We use a Bayesian approach to model calibration, though we note that the issues we raise, and their solutions, are not limited to Bayesian approaches but extend equally to other forms of statistical model calibration (e.g. Maximum Likelihood). In Bayesian Calibration (BC), our aim is to quantify the posterior probability of the model parameters ( ). The posterior probability P( | D) is calculated using where P( ) and L(D| ) are the prior and likelihood, respectively.
Since it is not possible to calculate the posterior distribution for VSEM analytically, we estimate it with Markov Chain Monte Carlo sampling, using the DREAMzs algorithm (Vrugt et al., 2009) implemented in the R package BayesianTools (Hartig et al., 2019).

| Prior
We used simple uniform priors (Table 2) since our aim is to identify the issues associated with multiple constraints using a simple and easy to interpret modelling approach. We focus the calibration on a subset of the parameters in Table 1 because we manipulated the values for two parameters, allocation to vegetation (Av) and initial root pool (Cr), as part of the simulated data experiments described below. Since the root pool is not part of the model with the error in these experiments, we also exclude tauR from the calibration. The parameters LAR and GAMMA were removed from the calibration to avoid non-identifiability issues. During calibration, tauR, LAR and GAMMA were fixed to the 'true' values used when generating simulated data.

| Likelihood
For the likelihood, we use a univariate Gaussian distribution. This is a typical choice and as we simulated the calibration data under the same assumptions (see Section 2.3.1), we know this form of the likelihood to be appropriate. In Section (2.4), we discuss modifications to this simple likelihood to represent model structural error and data systematic bias. Because heteroskedasticity is a common feature of carbon cycle data, each 2 was modelled as proportional to the variable in question (net ecosystem exchange [NEE], soil carbon, vegetation carbon) via a single coefficient of variation parameter, included in the calibration.

| Experiments with virtual data from VSEM
To illustrate the impacts of relative data volume (balanced vs. unbalanced) and different sources of model and data errors on the outcome of calibrating models to multiple data constraints, we designed a series of calibration experiments. Specifically, we simulated data from VSEM assuming a Gaussian observation error and then calibrated the model to these pseudo-observations for one flux, NEE, and two pools, vegetative carbon and soil carbon, that represent likely real-world data constraints. Model assessment focused on both the quantitative ability to recover the 'true' model parameters and the ability of the calibrated models to reconstruct the observed time series. The experiments described below are summarized in Table 3

| Model with known structural error
To simulate a model with a known structural error, we consider a situation where a major model process/structure is unknown and therefore missing in the calibrated model (but not the pseudodata).
Here we remove the root pool completely from the VSEM to simulate a major structural error. This was done by initializing the root pool to zero and setting the root allocation fraction to zero so that all the NPP is now allocated to the vegetation pool. This also shuts off any senescence from the root pool to the soil. This gave the model a structural error as we might have in a real situation while being sufficiently simple that we can still interpret the influence of the error.
This experiment was run both with balanced data (Eb) and unbalanced data (Eu).

| Observational data with known bias
In addition to considering model structural error, we also investigated the influence of observations with systematic biases since all observational data will to a greater or lesser extent contain biases.
Here we multiplied the soil data by 0.8 to represent a considerable multiplicative bias in the observations of soil carbon. The observation bias experiment was repeated for the perfect model/balanced data case (PbB), for the case where data were unbalanced (PuB), and when there is both unbalanced data and a model structural error (EuB).

| Modified likelihood to represent structural errors in the model and systematic biases in the data
Here we address the question of whether modifications to the likelihood function can help compensate for the errors introduced above (model structural errors, biased observational errors). A general principle in modelling is to begin with the simplest approach and only move on to more complicated solutions if the simple approach fails.
We adopt that approach here, representing model structural error and data systematic bias via very simple multiplicative ( 1 ) and additive ( 0 ) corrections to the model outputs (i.e. a linear bias correction), We add terms for each of the three outputs for which we have calibration data, and therefore have six additional parameters to represent additive and multiplicative errors for each of NEE, soil carbon and vegetative carbon (modaddNEE, modmultNEE, modaddCs, modmultCs, modaddCv and modmultCv). The priors for each of these are given in Table 4.

| Identifying the underlying issue
Here, we investigate the underlying issues when calibrating a model with unbalanced data (experiments Pb-EuB, Table 3). The following sections refer to posterior marginal parameter plots in Figure 1 and time series plots in Figure 3.    to compensate for the model structural error is less than for Eb.
Looking at the time series (Figure 3), the model does well for Cs (and NEE, not shown) but drifts away significantly from the six vegetation measurements. Consistent with the issues raised in the Introduction, this example illustrates a common behaviour for calibrations with a large data imbalance, where the sparsely measured parts of the system are ignored at the expense of the parts of the system with many observations.

| Perfect model and balanced data with a multiplicative bias (PbB)
We investigate the influence of data bias on the calibration by multiplying the soil carbon pool by 0.8 (Section 2.3.3). Similarly to when there is a model structural error (Eb), parameters in the calibration do not all recover their 'true' values and hence 'absorb' the influence of data error, particularly for the below-ground parameters.
The initial Cs and tauS both decrease, increasing the turnover and decreasing the soil carbon pool to match the erroneous data. As before, these departures of the parameters from their 'true' value allow there to be a reasonably close match between the model outputs after calibration and the data ( Figure S9 and S10).

| Perfect model and unbalanced data with a multiplicative bias (PuB)
Adding the effect of unbalanced data to the calibration with data bias, KEXT is now larger than its true value, increasing the carbon input into the system, but this is counteracted by a lower LUE. Cv is smaller and tauV larger, which has the combined effect of passing on less carbon to the soil. Belowground, tauS is slightly closer to its true value than PbB and Cs has increased versus the PbB The erroneous increase in the vegetation pool, due to the missing root pool model error, adds to the issue of trying to match the erroneously low soil carbon observations. The combined effect of the two errors pushes the model prediction even further away from the six Cv observations (S13 and S14).

| Changes to the Likelihood to represent model and data errors
The results from Section 3.1 demonstrate that the underlying issue with including unbalanced data in the calibration is not the imbalance itself but systematic errors in the model structure, data or both affecting the calibration. As presented in Section (2.4), we aim to introduce linear bias-correction terms in the likelihood which represent our uncertainty about what these systematic errors could be.
Since we would not normally know the error present in the model or the data, these terms are not designed to address the specific errors present in the model and data here but rather as a simple linear correction to the model outputs.
We now repeat calibrations Eu, PuB and EuB but with the new likelihood. For all three experiments (EuL, PuBL and EuBL), the uncertainty has increases significantly for a number of parameters (KEXT, LUE, tauV, Cs and Cv; Figure 2 also Figures S15-S17) so that in general they are now closer to the parameter's 'true' value.
An outlier is tauS where the uncertainty has increased but centre of the distribution is further away from the true value. Looking at the output time series (Figure 3), the influence of the error has not been removed, but there has been a significant improvement in the predictions, versus Eu, and EuB (PuB), with the centre of the posterior now much closer to the 'truth' line, especially for Cv. In addition, the uncertainty has increased so that, in general, data points are now inside the posterior predictive interval. The linear terms introduced have not completely removed the influence of the error, but there is a much greater sense that the sparse Cv data are influencing the calibration.

| Unbalanced data in model calibration: Identifying the underlying issue
Our aim was to identify as cleanly as possible the underlying causes behind issues for model calibration caused by unbalanced data. First, we demonstrated that unbalanced data by itself is not the problemthere was no issue with including very unbalanced data so long as the observation error in the data was unbiased and the model was perfect. This finding runs counter to the hypothesis implicit in weighting data streams, which is that poor fits reflect an imbalance in information content, and thus that this imbalance needs to be corrected by reweighting.
Second, if we introduce a very significant model error or data bias, but keep the data streams balanced, the model predictions after calibration remain close to the data. In real-world calibrations, where we do not know the extent of the systematic model and data errors present, nor the 'true' settings of the parameters, these calibrations with balanced data would be considered a successful calibration. Given that we had access to the true parameter settings, however, we found that after calibration the parameters were far from their true values with high confidence. 'From the perspective of the calibration', the goal is to diminish the modeldata difference. The likelihood cannot distinguish between modeldata difference due to parameter error, model structural error or observation error, and has no means to change the structure of the model, so model-data difference is reduced solely by the parameters departing significantly from their true values. In this way, the calibration 'absorbs' the model and data errors into wrong settings of the parameters such that the model delivers fair performance on all data streams it is calibrated to. Other outputs from the model may still be very poor but there is no data available to assess this.
Third, it is only when we combine unbalanced data with a systematic error in either the model or data that the model predictions against the more sparse calibration data become poor and we identify an issue in the calibration. Because most real-world calibrations against multiple data constraints involve unbalanced data, it is easy to wrongly attribute the issue to unbalanced data. Indeed, while the model predictions were poorer after calibration with the unbalanced data (Eu, PuB), the parameters were if anything closer to their true values and less confidently wrong.

| Diagnostic tool
Our (Figures 4 and 5) aim in developing a diagnostic tool was first to identify the characteristic behaviour, or signature, that model or data errors are causing issues when calibrating with unbalanced datasets. We illustrated with a perfect model (Figure 4) that the RMS difference goes down for all model outputs when the quantity of data in the calibration increases, regardless of whether the data are balanced. However, when a model error is present (Figure 4), the RMS difference increases for the sparse data output as the data imbalance increases. This is the signature behaviour that diagnoses the influence of the model discrepancy (or data bias) on the calibration.
We showed that this diagnostic plot could also be created where the true model and data are unknown ( Figure 5); hence, this tool can be used in real-world calibrations. In addition, the diagnostic figure can also be used to identify what size of imbalance in the data leads to a significant problem. This could be used to estimate how severely model and data errors are detrimentally influencing a calibration with unbalanced data and which variables are most effected, which may give clues about the underlying model and data systematic errors at issue.

| The role of data autocorrelation when calibrating with unbalanced datasets
Another class of potentially important errors in observations is due to autocorrelation. Accounting for autocorrelation in observations is an important way to discriminate between raw sample size and information content of the data to generate a more appropriate weighting among data constraints. Sample size and contribution of the data stream in the likelihood are not the same thing as contribution can be lowered, for example, by higher variance, or by data with a high degree of autocorrelation present. Nevertheless, we have shown here that the fundamental problem with data imbalance in model calibrations is not one of differences in information content/ weight, but one of systematic errors in the model and/or data. So while it would be 'best practice' to incorporate autocorrelation into calibration we have shown that it will not solve the problem that we have identified here.

| Addressing underlying causes rather than symptoms
We argued in the Introduction that using ad-hoc methods, such as reweighting the calibration data to give a more balanced dataset, was logically the wrong approach. The virtual data experiments that we have conducted in this study provide another reason to avoid adhoc methods. In general, it is much preferable to 'treat' the underlying cause of a problem rather than try and mitigate the symptoms.
Therefore, it is better to address model and data errors directly rather than trying to mitigate the symptoms by reweighing the data to arbitrarily adjust its reliability. Ideally, the best approach would be to make changes to the model and the data collection to eradicate the damaging systematic and structural errors. In reality, all models are approximations and data are also imperfect so it is only possible to achieve this to some extent. For example, in our terrestrial carbon flux example, there are known issues with eddy covariance data due to a lack of closure of the energy budget (Wilson et al., 2002); it is not possible to fully match such data with models that conserve mass and energy. As a solution,  state that ideally model misspecification would be eliminated but that this is often difficult to diagnose (Carvalho et al., 2017;. Hence, this study provides further evidence that calibration without any explicit recognition of model discrepancy (systematic error) is potentially 'dangerous' (Brynjarsdóttir & O'Hagan, 2014 (2017) and Carvalho et al. (2017) argue that down-weighing or eliminating conflicting data may not be appropriate as it may not resolve model misspecification. Stewart and Monnahan (2017) state that 'analysts should be aware that they cannot weigh their way out of a misspecified model'. They further suggest that inclusion of 'process variation', and not excessive down-weighing of data, is more likely to provide robust estimation. In Bayesian inference of soil respiration models, Elshall et al. (2019) suggest that there is often an assumption of independent, normally distributed and homoscedastic residuals.
Furthermore, they suggest not accounting for these may not result in biased predictions and parameter estimates however, it will lead to underestimated posterior uncertainties and poorer predictions.

| Model and data discrepancy modelling recommendations
Given the complexities of many mechanistic models and the processes that we are aiming to model, it will often be very challenging to find a good discrepancy model. In many cases, the discrepancy may be highly nonlinear. Indeed, given the very large variation in models and processes and hence in model discrepancies it is not possible to offer a general approach that will work in most circumstances. Brynjarsdóttir and O'Hagan (2014) and others (Oberpriller et al., 2021;Van Oijen, 2020) have advocated the use of a Gaussian Process (GP) as a flexible and powerful approach to discrepancy modelling and indeed this may be a good approach for many but it can have significant downsides. Brynjarsdóttir and O'Hagan (2014) show that such an approach can only avoid possible identifiability issues between model parameters and model error, finding the true parameter values and hence be useful for extrapolation predictions if good prior information is known on the GP parameters which they acknowledge will in many cases be very challenging. In addition, using GPs ignores physical mechanisms and can often be very expensive computationally because it involves the inversion of a potentially large covariance matrix. In general, in modelling, it is good practise to try simple solutions first and only to progress to more complex solutions such as GP when needed. This motivated our choice of simple linear bias correction term in the calibration to represent our uncertainty about model structural errors and data systematic biases. Similar approaches have been used to correct for systematic biases in greenhouse gas emission measurement and modelling by Van Oijen et al. (2011) and biases in soil respiration data by Fer et al. (2018). With this simple discrepancy model, we were able to illustrate that the linear bias correction increased the uncertainty in the joint posterior parameter distribution, making it more likely that the true parameter value was somewhere in the joint posterior distribution and that the model included the 'true' system in the posterior predictions. This facilitated a significant improvement of the fit of model predictions to the data even with very unbalanced datasets. Although even in this very simple case the linear discrepancy model did not fully recapture all the true model parameter settings. Indeed, in many real-world calibrations, a simple linear modelling approach may be found to be too simplistic; nevertheless, it has been usefully employed here to illustrate the importance of addressing model discrepancy and data bias in model calibration; especially where large calibration data imbalances are present. The topic of identifying and creating good statistical models of model discrepancy (and data bias) is not straightforward, and is an important area for future research and tool development (Chandler, 2013;Van Oijen, 2020). Nevertheless, as in all modelling, we advocate beginning with simple approaches, as we have followed here, and adding complexity incrementally.

| CON CLUS IONS
The virtual data calibrations presented here demonstrate cleanly that the underlying issue calibrating models with multiple constraint unbalanced data is not the unbalance in the data, but that models and data have systematic errors that remain hidden when we calibrate with balanced datasets, but whose influence is only seen in poor predictions after calibration with unbalanced datasets. This issue is likely even more rampant in the common case of calibrating models against a single constraint, but it only becomes apparent when such models are tested against additional types of observations. By addressing the underlying cause and including terms in the calibration for systematic error (discrepancy), we demonstrated that the model fit to low-volume data can be greatly improved with a quantification of uncertainty that has sufficient coverage to include the true system.

AUTH O R CO NTR I B UTI O N S
The original ideas for this study came from David Cameron and

ACK N OWLED G EM ENTS
This work greatly benefited from discussions within COST Action

CO N FLI C T O F I NTE R E S T
No conflicts of interest are known.

PEER R E V I E W
The peer review history for this article is available at https://publo ns.com/publo n/10.1111/2041-210X.14002.